|
Implementing 70 high-speed differential
pairs on a 9U PCB using regular off-theshelf
deserializers can be a nightmare; highspeed
PCB design, noise, clock jitter, and
signal integrity are the main challenges.
Even the smallest deserializer packages
would occupy roughly two-thirds of a 9U
board, on which you would still need space
for the logic – configuration, memories,
access interfaces, and local control.
Our design concerns a data concentrator
card (DCC), part of a large high-energy
physics experiment at the European
Organization for Nuclear Research
(CERN) in Geneva. A very large particle
accelerator called the Large Hadron
Collider (LHC) is being constructed near
the Franco-Swiss border west of Geneva. A
number of experiments will be conducted
to observe and measure the various properties
of several existing, and possibly new,
fundamental particles.
One such experiment is called the
Compact Muon Solenoid (CMS), which is
based on a large superconducting magnet
system. The CMS will have a number of subdetectors,
including an Electromagnetic
Calorimeter (ECAL). The ECAL will use
about 80,000 crystals to capture the energy
of the photons and electrons. The data collected
from these crystals will be captured,
processed, and transmitted by the DCCs
(about 60 of them) for further analysis.
Design Overview
The DCC includes 70 high-speed optical
receiver channels (6 blocks of 12 channels
each) implemented on a 9U VME board
(36 cm x 40 cm) working at 800 Mbps
using a 2-byte 8b/10b protocol.
For the implementation of the transceivers,
we had two choices:
- As many as 70 discreet deserializers,
along with 35 FPGAs for the required
control (this number was based on
cost considerations), for a total device
count of 105. This would have given
us more granularity and a lower cost,
but more components and hence
higher debug and testing times.
- Only nine Xilinx® Virtex-II Pro™ devices with eight embedded
RocketIO™ transceivers on each (only
the XC2VP7-FG456 part was available
at the time). We would lose some granularity,
but the PCB would be much
less dense and easier to test. (Figure 1.)
We picked the second choice, as it meant
a significant savings in device count (from
105 to 9). And because the DCCs will be in
operation for four to five years, it will have a
huge impact on overall PCB design and the
final cost of production and maintenance
from a long-term perspective.
Also, after deserialization, we will need
to verify the integrity of received data and
reformat it for downstream processing and
analysis. We found that the remaining
resources in the selected device were enough for most purposes. Of the 72 transceivers
available, we use 70 and leave the other two
unconnected. The use of 800 Mbps per
channel is a system choice, but the design
could work at 1.6 Gbps or higher.
PCB Design Issues
The DCC PCB is a 12-layer board with
four power planes and eight routing layers.
We have mostly followed the main
rules for high-speed design and analog
considerations from Chapter 4 of the
Xilinx RocketIO™ Transceiver User
Guide, such as:
- All high-speed traces
are impedance controlled
and routed
manually in
“microstrip-edge couple
differential pair,” with
impedance matched to
50 Ohms and as close as
possible to the source
(respecting the crosstalk
rules). No other lines were
designed in the same area
as the high-speed layout,
where the immediate layer
was the ground power plane.
- All high-speed differential
pair signals were AC coupled
with 100 nf capacitors and
internally terminated to 50 Ohms.
- All of the transceivers’ power supply
pins were filtered with an individual
LC filter and a separate power plane
for the “analog” supply, also with specific
filters. No transceiver power supply
was left unconnected, regardless of
whether it was used or not. We used
the same type of LC filters on the optical
receivers.
- Approximately 350 power supply decoupling
capacitors of three different values
(to match the main clock frequencies in
use on the board) were placed as close as possible to the central power pins of the
Xilinx FPGAs. Other capacitors were
placed nearby each FPGA.
- Each FPGA received one high-quality
reference clock (low jitter – 100 ps
peak-to-peak) differential pair from an
individual buffer. We recommend using
two independent reference clock sources
to ease the internal usage of this clock
on the FPGA if using all of the
RocketIO transceivers.
RocketIO Implementation and Issues
Virtex-II Pro devices provide the first stage
of processing for the front-end data
(received from the on-detector electronics)
on the DCC board. Each device receives
800 Mbps of serial data on each of its eight
channels from the optical receivers, for a
total of 6.4 Gbps per device. In a nutshell,
the purpose of the Xilinx FPGAs is to
process this data and prepare it for readout.
RocketIO transceivers are used to deserialize
the received data and perform
8b/10b decoding. The 16-bit data is then written in a programmable latency buffer
to match the trigger latency. A number of
data verification checks are carried out. The
data is finally formatted into 64-bit words
and written into FIFOs. From there, it is
read out by the event builder on the board.
Without going into the details of the
functionality, we will focus on the various
issues we faced (and solved) in making the
real hardware churn out correct data, with
a focus on the use of RocketIO transceivers.
Much of what we learned was on a
trial-and-error basis. The main issue was
related to the reference clock, which we’ll
describe in detail in the next section.
The other significant issue that we
faced was the alignment of the K character
within the 2-byte data path of the received
data. We were initially using
the Gigabit_Ethernet primitive
in half-rate mode for a 2-byte
data path. But we observed that
not all of the channels were
putting the K character in the
same place within the 2-byte
word and there was no way to
force this alignment in the
Gigabit_Ethernet primitive
(the ALIGN_COMMA_MSB
parameter of this primitive is
set to FALSE by default).
Because our protocol expected
the K to always appear on the
LSB of the word, we switched to
the GT_CUSTOM primitive,
where we could force the alignment and subsequently
swap the position of K to the LSB
of the data. The simulations showed perfect
alignment – but in real hardware, some of
the channels were getting misaligned.
A colleague of ours referred us to the
design note about 32-bit word comma
alignment in the RocketIO transceiver user
guide. Although this is usually needed only
for a 4-byte data path, we implemented a
similar scheme for our 2-byte data path and
this fixed our misalignment problem.
Clock, Programming, and JTAG
We cannot over-emphasize the need for a
high-quality reference clock. Besides satisfying
all of the criteria specified in the
RocketIO user manual, we made sure that
our reference clock was as clean as we could
possibly get (see Figure 2).
We used a quartz-based phase-locked
loop (QPLL) circuit developed at CERN
for our system to provide the best jitter-free
clock source (100 ps peak-to-peak). We
found that a lot of problems in the performance
of the RocketIO devices could be
traced to a noisy/jittery reference clock. If
you are using RocketIO transceivers on
both halves of the chip, then it’s much better
to have two reference clocks. We believe
this helps even if you are running the
RocketIO transceivers in half-rate mode
(which is our case).
Another aspect of the clocking scheme
that we used was to pass the reference clock
through a global clock buffer after an input
global differential clock buffer. We
observed improved stability and a more
uniform distribution of the reference clock
with the FPGA editor.
Also, though not directly related to the
high-speed transceivers, we found that an
independent post-configuration DCM
reset logic (usually recommended if you
have an external feedback clock) is useful
even when using internal feedback. This
solved a problem we were having with the
DCMs where they were sometimes not
locking after reconfiguration. Xilinx
Technical Support helped us find the solution
(Xilinx Answer Record 14425).
As for programming and JTAG, we
used the same group of EPROMs to configure
eight of the nine FPGAs. One of the
FPGAs is the master and provides the clock
for all the devices in the chain. The ninth
FPGA has a different pinout and a separate
EPROM for itself.
All circuits are connected in the same
JTAG chain, which improved reprogramming
time mainly during the “test” stages.
We found that a need exists for a pull-up
resistor on the TDO output of each Xilinx
device, something that we hope Xilinx will
add in future devices. The JTAG is used
also to check the board interconnections
after assembly.
Conclusion
In this article, we’ve shown the advantages
of using embedded deserializers instead of
discrete components on a large project. By
using nine 456-pin FPGAs to do the same
job as 105 TQFPs, we saved time, both in
the design and debugging phases. Plus, this
is a flexible approach, as the FPGAs are
reprogrammable and a more economical
solution in the long term.
We are currently considering migrating
to a bigger Xilinx device as our processing
requirements from the FPGAs increase.
Therefore, we are studying the new devices
available and how such a migration will
affect our PCB design in terms of the routing
of the high-speed lines.
We believe that by following the design
rules concerning high-speed design, like
clean clock distribution, power supply
filtering, and good routing of the internal
reference clocks, it is possible to obtain a
successful design in good time. For more
information, please write to us at
jc.silva@cern.ch or adarsh.jain@cern.ch.
Printable PDF version of this article with graphics. (9/10/04) 235 KB |