Xcell Journal Online
  Xcell Journal Archives
   
  Writing for Xcell
  Advertising in Xcell
  FREE Subscription
   
  Partner Yellow Pages
  Reference Pages
  Contact Us

    

Home : Documentation : Xcell Journal Online : Article
Implementing 70 High-Speed Channels with 9 FPGAs



by Jose C. Da Silva, Design Engineer, LIP (Laboratorio Instrumentacao e Particulas) – Lisbon
jc.silva@cern.ch

Adarsh Jain, Design Engineer, LIP (Laboratorio Instrumentacao e Particulas) – Lisbon
adarsh.jain@cern.ch (9/10/04)


Using nine Xilinx XC2VP7 circuits on a data concentrator card greatly reduced costs and PCB design effort and increased board reliability.
article link to PDF
Article PDF 235 KB


Implementing 70 high-speed differential pairs on a 9U PCB using regular off-theshelf deserializers can be a nightmare; highspeed PCB design, noise, clock jitter, and signal integrity are the main challenges. Even the smallest deserializer packages would occupy roughly two-thirds of a 9U board, on which you would still need space for the logic – configuration, memories, access interfaces, and local control.

Our design concerns a data concentrator card (DCC), part of a large high-energy physics experiment at the European Organization for Nuclear Research (CERN) in Geneva. A very large particle accelerator called the Large Hadron Collider (LHC) is being constructed near the Franco-Swiss border west of Geneva. A number of experiments will be conducted to observe and measure the various properties of several existing, and possibly new, fundamental particles.

One such experiment is called the Compact Muon Solenoid (CMS), which is based on a large superconducting magnet system. The CMS will have a number of subdetectors, including an Electromagnetic Calorimeter (ECAL). The ECAL will use about 80,000 crystals to capture the energy of the photons and electrons. The data collected from these crystals will be captured, processed, and transmitted by the DCCs (about 60 of them) for further analysis.

Design Overview
The DCC includes 70 high-speed optical receiver channels (6 blocks of 12 channels each) implemented on a 9U VME board (36 cm x 40 cm) working at 800 Mbps using a 2-byte 8b/10b protocol.

For the implementation of the transceivers, we had two choices:

  1. As many as 70 discreet deserializers, along with 35 FPGAs for the required control (this number was based on cost considerations), for a total device count of 105. This would have given us more granularity and a lower cost, but more components and hence higher debug and testing times.
  2. Only nine Xilinx® Virtex-II Pro™ devices with eight embedded RocketIO™ transceivers on each (only the XC2VP7-FG456 part was available at the time). We would lose some granularity, but the PCB would be much less dense and easier to test. (Figure 1.)
We picked the second choice, as it meant a significant savings in device count (from 105 to 9). And because the DCCs will be in operation for four to five years, it will have a huge impact on overall PCB design and the final cost of production and maintenance from a long-term perspective.

Also, after deserialization, we will need to verify the integrity of received data and reformat it for downstream processing and analysis. We found that the remaining resources in the selected device were enough for most purposes. Of the 72 transceivers available, we use 70 and leave the other two unconnected. The use of 800 Mbps per channel is a system choice, but the design could work at 1.6 Gbps or higher.

PCB Design Issues
The DCC PCB is a 12-layer board with four power planes and eight routing layers. We have mostly followed the main rules for high-speed design and analog considerations from Chapter 4 of the Xilinx RocketIO™ Transceiver User Guide, such as:

  • All high-speed traces are impedance controlled and routed manually in “microstrip-edge couple differential pair,” with impedance matched to 50 Ohms and as close as possible to the source (respecting the crosstalk rules). No other lines were designed in the same area as the high-speed layout, where the immediate layer was the ground power plane.
  • All high-speed differential pair signals were AC coupled with 100 nf capacitors and internally terminated to 50 Ohms.
  • All of the transceivers’ power supply pins were filtered with an individual LC filter and a separate power plane for the “analog” supply, also with specific filters. No transceiver power supply was left unconnected, regardless of whether it was used or not. We used the same type of LC filters on the optical receivers.
  • Approximately 350 power supply decoupling capacitors of three different values (to match the main clock frequencies in use on the board) were placed as close as possible to the central power pins of the Xilinx FPGAs. Other capacitors were placed nearby each FPGA.
  • Each FPGA received one high-quality reference clock (low jitter – 100 ps peak-to-peak) differential pair from an individual buffer. We recommend using two independent reference clock sources to ease the internal usage of this clock on the FPGA if using all of the RocketIO transceivers.
RocketIO Implementation and Issues
Virtex-II Pro devices provide the first stage of processing for the front-end data (received from the on-detector electronics) on the DCC board. Each device receives 800 Mbps of serial data on each of its eight channels from the optical receivers, for a total of 6.4 Gbps per device. In a nutshell, the purpose of the Xilinx FPGAs is to process this data and prepare it for readout.

RocketIO transceivers are used to deserialize the received data and perform 8b/10b decoding. The 16-bit data is then written in a programmable latency buffer to match the trigger latency. A number of data verification checks are carried out. The data is finally formatted into 64-bit words and written into FIFOs. From there, it is read out by the event builder on the board.

Without going into the details of the functionality, we will focus on the various issues we faced (and solved) in making the real hardware churn out correct data, with a focus on the use of RocketIO transceivers. Much of what we learned was on a trial-and-error basis. The main issue was related to the reference clock, which we’ll describe in detail in the next section.

The other significant issue that we faced was the alignment of the K character within the 2-byte data path of the received data. We were initially using the Gigabit_Ethernet primitive in half-rate mode for a 2-byte data path. But we observed that not all of the channels were putting the K character in the same place within the 2-byte word and there was no way to force this alignment in the Gigabit_Ethernet primitive (the ALIGN_COMMA_MSB parameter of this primitive is set to FALSE by default).

Because our protocol expected the K to always appear on the LSB of the word, we switched to the GT_CUSTOM primitive, where we could force the alignment and subsequently swap the position of K to the LSB of the data. The simulations showed perfect alignment – but in real hardware, some of the channels were getting misaligned.

A colleague of ours referred us to the design note about 32-bit word comma alignment in the RocketIO transceiver user guide. Although this is usually needed only for a 4-byte data path, we implemented a similar scheme for our 2-byte data path and this fixed our misalignment problem.

Clock, Programming, and JTAG
We cannot over-emphasize the need for a high-quality reference clock. Besides satisfying all of the criteria specified in the RocketIO user manual, we made sure that our reference clock was as clean as we could possibly get (see Figure 2).

We used a quartz-based phase-locked loop (QPLL) circuit developed at CERN for our system to provide the best jitter-free clock source (100 ps peak-to-peak). We found that a lot of problems in the performance of the RocketIO devices could be traced to a noisy/jittery reference clock. If you are using RocketIO transceivers on both halves of the chip, then it’s much better to have two reference clocks. We believe this helps even if you are running the RocketIO transceivers in half-rate mode (which is our case).

Another aspect of the clocking scheme that we used was to pass the reference clock through a global clock buffer after an input global differential clock buffer. We observed improved stability and a more uniform distribution of the reference clock with the FPGA editor.

Also, though not directly related to the high-speed transceivers, we found that an independent post-configuration DCM reset logic (usually recommended if you have an external feedback clock) is useful even when using internal feedback. This solved a problem we were having with the DCMs where they were sometimes not locking after reconfiguration. Xilinx Technical Support helped us find the solution (Xilinx Answer Record 14425).

As for programming and JTAG, we used the same group of EPROMs to configure eight of the nine FPGAs. One of the FPGAs is the master and provides the clock for all the devices in the chain. The ninth FPGA has a different pinout and a separate EPROM for itself.

All circuits are connected in the same JTAG chain, which improved reprogramming time mainly during the “test” stages. We found that a need exists for a pull-up resistor on the TDO output of each Xilinx device, something that we hope Xilinx will add in future devices. The JTAG is used also to check the board interconnections after assembly.

Conclusion
In this article, we’ve shown the advantages of using embedded deserializers instead of discrete components on a large project. By using nine 456-pin FPGAs to do the same job as 105 TQFPs, we saved time, both in the design and debugging phases. Plus, this is a flexible approach, as the FPGAs are reprogrammable and a more economical solution in the long term.

We are currently considering migrating to a bigger Xilinx device as our processing requirements from the FPGAs increase. Therefore, we are studying the new devices available and how such a migration will affect our PCB design in terms of the routing of the high-speed lines.

We believe that by following the design rules concerning high-speed design, like clean clock distribution, power supply filtering, and good routing of the internal reference clocks, it is possible to obtain a successful design in good time. For more information, please write to us at jc.silva@cern.ch or adarsh.jain@cern.ch.

Printable PDF version of this article with graphics. PDF logo (9/10/04) 235 KB

 
Jobs Events Webcasts News Investors Feedback Legal Privacy Trademarks Sitemap
© 1994-2008 Xilinx, Inc. All Rights Reserved.