UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

AR# 69751

Xilinx PCI Express - FAQs and Debug Checklist

Description

This answer record provides FAQs and a Debug Checklist for general Xilinx PCI Express IP issues.

For FAQs and Debug Checklists specific to a particular IP's operation, please refer to the link for the IP below:

(Xilinx Answer 70477)7 Series Integrated Block for PCI Express - FAQs and Debug Checklist
(Xilinx Answer 70478)AXI Bridge for PCI Express - FAQs and Debug Checklist
(Xilinx Answer 70479)AXI Bridge for PCI Express Gen3 - FAQs and Debug Checklist
(Xilinx Answer 70480)Virtex-7 FPGA Gen3 Integrated Block for PCI Express - FAQs and Debug Checklist
(Xilinx Answer 70481)DMA Subsystem for PCI Express - FAQs and Debug Checklist
(Xilinx Answer 70482)UltraScale FPGA Gen3 Integrated Block for PCI Express - FAQs and Debug Checklist
(Xilinx Answer 70483)UltraScale+ PCI Express Integrated Block - FAQs and Debug Checklist


For general PCIe and software / drivers FAQs and Debug Checklist, please refer to the Solution section below.


This article is part of the PCI Express Solution Centre

(Xilinx Answer 34536)Xilinx Solution Center for PCI Express

Solution

FAQs:

Simulation

Q) TXOUTCLK has a random phase shift in Simulation. Is this expected?

A) This is expected behavior with TXOUTCLK. Buffer bypass is enabled on the TX side for PCIe use mode to achieve minimum TX lane to lane skew.  

This results in the delay aligner (in the TXOUTCLK path Fig. 3-30 in UG578) being enabled.

During a reset of TX at startup or after a rate change, you will see this happening as the GT tries to phase align its internal parallel clock with the incoming TX/RXUSRCLK (which comes from TXOUTCLK). 

In hardware, TXOUTCLK and thus the resulting GT TX/RXUSRCLK is continuously adjusted by the delay aligner to compensate for VT, in order to keep it aligned with the internal parallel clock within the GT.

Debug Checklist:

Link Training Debug

Enumeration shows no PCIe device (lspci)


  1. Check using ILA if the cfg_ltssm_state signal shows an L0 state ('h10).
    • If in the L0 state, check if it consistently stays in the L0 state or is going through recovery state continuously. Repeatedly going through recovery state indicates a link integrity issue.
    • If it is stable in L0 state, check if PCIe Config Request TLP's are exchanged and that each Completion TLP is returned.
      This is part of the PCI enumeration process and must be done within 100ms of the Power Good indication.
      Try doing a warm reboot and if the device is now detected after a warm reboot, this requirement was violated.
    • Certain server systems might not do another PCIe discovery after a PCIe slot/device failure, requiring a Cold Boot (power cycle) to recover.

  2. If the cfg_ltssm_state signal shows state 00 indefinitely:
    • Ensure that cfg_link_training_enable input pin is driven to 1'b1.
    • Check using AXI JTAG if the GT reset FSM has completed and is back to 00 state. If so, check if the phy_status_rst pin is connected to the PCIe reset_done pin.

After system boot, no clock is seen

    • Use the AXI JTAG debugger to determine where the GT reset FSM is stuck (most likely the PLLLock signal from one or more GT Channel primitives is not set)
    • If Eyescan (In-System IBERT / ISI) is included in the design, choose the correct free_run_clk frequency from the IP customization GUI and connect an available on-board free running clock to it.
    • Check if the DRP clock frequency is running at the correct frequency.

Hang / Kernel Panic / Unexpected Reboot at Runtime


  • The Debug flow for this is similar to the BSOD or Hang on Boot flows, so check those. In addition, check the dmesg output:
  • If a driver is loaded and is being actively used before the problem occurs, there must be an access the driver is doing that causes the hang.
    Use Lecroy or the printk function in the driver to narrow down which traffic pattern or TLP is causing the hang
  • Typically this is caused by a request that is not responded to properly / lost or the Host has run out of some Flow Control credits, preventing the CPU from completing the requested task from the driver and cause the Kernel to lock up.

BSOD - Blue Screen of Death or Hang on Boot

  • Extended Configuration Space option in the GUI: If enabled, make sure that there is appropriate logic connected to the cfg_ext_* interface so that the Configuration Space Link List is terminated properly and all Extended Configuration Requests are responded to.
    • Hang will be seen in Lecroy when Configuration Space accesses are made to User Defined Configuration Space address 0x400 or 0x300 (depending on the device).
  • Certain BIOS will do a Memory Read request during the Enumeration process and if there are not enough Non Posted credits at the AXI interface, the packet will sit in the PCIe IP buffer preventing further Configuration Space accesses (which are also Non-Posted) to pass through
    • Hangs will be seen in Lecroy where the last few TLP are ACKed but Non-Posted or Posted Credits coming from the PCIe IP are not incremented
  • AXIS_RX_* interface (a.k.a. AXIS_CQ_* on newer devices) ready signal to the PCIe Hard Block: 
    • Similar reasoning to the rx_np_ok/req; some systems broadcast Vendor Defined Messages on boot which needs to be flushed out of the IP buffer if they are not applicable to the user design.
      Messages can come out on the cfg_msg_* interface (which does not have a ready signal) or AXIS_RX/CQ interface which does have a ready signal. Those interfaces must not be throttled.
    • Hangs will be seen in Lecroy where the last few TLP are ACKed but Non-Posted or Posted Credits coming from the PCIe IP are not incremented
    • A Hang in Lecroy will be preceded by Packet Errors or NAK 
  • AXIS_TX_* interface (a.k.a. AXIS_RQ_* and AXIS_CC_* on newer devices) along with Interrupt (cfg_interrupt_*) and Message (cfg_msg_*) sideband interface:
    • Check all of the *valid* input signal to the PCIe IP. If the *valid* signal is stuck high, the IP can send random TLP on the PCIe link causing an error on the Host side.
      This is a rare occurrence but can happen if PR/Tandem is being used and improper termination logic is used on the IP interface.
    • A Hang in Lecroy will be preceded by many unrecognizable TLPs packet going upstream.
  • Driver lspci output: This is useful if the problem went away on a certain board or design. Some driver might have been installed previously on the machine and is auto-loading during boot.
    This can be checked by loading a working design first and running the lspci vvv d 10ee: command, then looking under the Kernel module in the use row.
    If it is listed, that means a driver is loaded for the Xilinx device for a particular Vendor:Device ID.
    Remove the driver from the Kernel by deleting the .ko file (not just by doing rmmod because it will come back on the next boot) before trying the failing design again.
    This will prevent Driver access that might not be designed for the current design you are loading.

PCIe link issues during reset

  • If enabling FLR, ensure flr_done is being provided by the reference design
  • Hot reset, disable and FLR reset the BME and BAR. Use the proper PCI utility to enable them or else traffic cannot be sent/received
  • Some systems reset the MPS after a reset, so ensure that the driver takes care of restoring the MPS after a link reset is issued

Gen3 Link issues

  • Check the Link Status 2 register in the PCIe Configuration Space to see if Link Equalization phases were attempted.
  • Check the LTSSM state using AXI JTAG to see which state it has reached.
  • Try bypassing Phase2/3
  • Get a PCIe Analyzer trace to see what Equalization Preset values are being requested by the link partner and by the Xilinx receiver.
  • Try Equalization preset 5, or LPM/DFE, or RX auto adaptation modules to try improving link quality.
  • Use the PCIe PIPE descrambler module in Xilinx PCIe MAC to check for lane-to-lane skew at Gen3 speed.
  • If Third party MAC is used, try using the Xilinx example design first to rule out any board or setup issues.
    The most common issue is the MAC-GT integration issue, please review all connections on mandatory ports as stated in (PG239). 

General Checks for Link Training Issues

  • Check if it is possible to change TX drive parameters on the host
  • Check if PERSIST is enabled as a Bitstream setting. This option is not supported for non-tandem designs while using SPI/BPS Flash and has been known to cause link training issues.
  • When using in-system IBERT make sure that the free running clock is used
  • Ensure that there is no skew between lanes on the board 
  • Always check on the quality of voltage input signals either to GTs or to fabric.
  • Ensure clock quality of the slot clock
  •  If you are seeing issues with the link partner detecting the Xilinx device, in the UltraScale PCIe core configuration GUI: 1st Tab, select "Advanced Mode", then GT Settings Tab - "Receiver Detect" select "Falling Edge".

Protocol Violation:

Receiver Overflow

  • Check whether the lspci AER RxOF+ is set
  • Ensure that the system sets the relaxed ordering bit in the PCIe config space. If not, it might cause an overflow
  • Check if sufficient credits have been assigned

Unsupported Requests/Completion Timeout

  • Use the command "lspci -vvv -d [Vendor ID]:[Device ID]" and check to ensure that the BAR exists:

    • The list should have something similar to the following (or more than one if you have multiple BARs):
      Region 0: Memory at <address>.
      Make sure that this line does not have words such as disabled or virtual.
      Virtual means that you have lost the BAR information at the Endpoint card, but that it was previously visible.
      This can be caused by an unexpected link down.
      Disabled means that the Memory Enable bit at the PCIe Command register is not set. Set them through your driver or using the setpci utility.

    • If the BAR does not show up at all, but lspci does indicate that your device is detected, then your system might be running out of memory during memory allocation.
      This can be caused by large PCIe BARs in your design. Use the pci=realloc directive in the Kernel to re-map your MMIO or use 64-bit BAR instead of 32-bit BAR

    • Typically this is caused by Missing BAR information or the Command Register (Memory Enable bit) not being set.

Missing Interrupts

  • Check the Interrupt Enable bit in the PCIe Configuration Space. If you are using MSI, the MSI Control register has this Enable bit.

  • If you are using MSI-X, the MSI-X Control register has this Enable bit.
    If you are using Legacy Interrupt, there is no enable, but check the Interrupt Disable bit in the Command Register within the PCI Configuration Space, and Interrupt Pin and the Interrupt Line register of the PCI Configuration SpaceEnsure that cfg_interrupt_int is held until cfg_interrupt_done/fail is received.
    Certain IP only need this signal asserted for a clock cycle, and some require it to be held steady until the IP responds with a done or fail indicator.
    Some are also one hot encoded, so check the Product Guide for a given IP.


Simulation Debug

  • Check by bypassing EQ Phase 2 and 3 using PL_EQ_BYPASS_PHASE23
    • If there is an issue in simulation alone with Gen3 design
  • For PCIe DMA simulation issues, always ensure that the IP is generated in Vivado with target language set to Verilog. 
    • A Timeout error might be seen if the target language is set to VHDL
  • When using third party simulators, always ensure that the corresponding supported version of the simulator is used for a given Vivado version
    • Changing the ModelSim version from 10.3a to 10.5c (recommended version for that Vivado release) has in some cases resolved the issue.
  • Check all reset and clock signals. Are these working as expected?  Are the frequencies and polarities correct? 
  • Check the top level connections. Are TX and RX connected as expected? 
  • Are the Unisim_ver models being called first?  If not, put them first in the library call list of the simulation launch.  
  • Is the script calling any FAST libraries? 
  • Are both sides expecting PIPE or serial (i.e. is one sending via txp/txn, and the other pipe_txdata)?
  • Are the transactions on the user interface synchronous to the user clocks?
  • Are all top level inputs driven?  

General Checks 

  • Compare the lspci log between the working and the failing cases
  • Cross check the read/write request size against the max payload size register in the config space of the endpoint, as there is a chance of the host machine not scanning the config register correctly 
    • [Issue: Missing DMA read data when request is 256 or 512 bytes. However, it worked with 128 bytes.
      The issue was see when programming the device on board first and then placing the board in the PCIe slot and rescanning by the host PCIe bus.
      Host bridge did not enumerate the EP config registers properly]
  • Make sure first_be and last_be signals are correctly driven from the user application
    •  [Issue: Missing payload in TLP when DW is greater than 1]
  • Each time you load a new .bit file into the FPGA, you need to either reboot the PC or remove and rescan to enumerate again with the new parameters of the currently loaded .bit file. 
    Use the setpci command to enable the Memory and Bus Master. See: (Xilinx Answer 37406)
  • Make sure the BAR is configured correctly and confirm Memory Read and Memory Write addresses are correct.
    • [Issue: PCIe Simulation failed when RP tried to write to EP due to incorrect BAR configuration.]
  • Make sure that the packet formatting is correct as per the PG.
    • [Issue: Completion packet for a MemRd request seen without any data.]
  • In the Tandem PCIE Designs of  UltraScale/UltraScale+ devices, if BAR access transactions fail despite successful link training and enumeration, check that bit 12 of the MCAP control register is set to 1 (offset 14h of MCAP extended config space) 

Revision History:


04/19/2018 : Initial Release

 

AR# 69751
Date 09/24/2018
Status Active
Type General Article
IP
Page Bookmarked