Why do FFTs with large complex multipliers fail in PAR when targeting a Spartan-3A DSP device?
This also happens to a Virtex part as well, but is more common in the Spartan-3A DSP devices, due to fewer DSP48As in a column.
This is due to the size of the complex multiply, which causes a long cascade that cannot be placed.
Often, this happens with the use of the streaming architecture, which can create large complex multipliers, resulting in a long DSP48 cascade. (i.e., A FFT core that has 18-bit input data, 18-bit twiddle factors and uses the streaming architecture, the implementation will be a decimation in frequency which means that the complex multiplier comes after the butterfly. The streaming architecture has a complex multiplier after every second butterfly. The result is that the data path grows by 1 bit in each butterfly, so by the time the data path reaches the complex multiplier, it is 20 bits wide. Also, the 18-bit twiddle factors are internally increased by 1 bit so that +1 can be represented exactly, so the second input to the complex multiplier is 19 bits wide. Therefore, each complex multiplier is 20 x 19 bits. As the DSP48 inputs are 18 bits wide, the complex multiplier is constructed by cascading several DSP48s. In this case, 4 are required to build each real multiplier. As you have asked to optimize complex multipliers for speed using DSP48s, the complex multiplier uses 4 real multipliers, so there are a total of 16 DSP48As in each complex multiplier. These must be cascaded in two groups of 8 DSP48s in order to produce the separate real and imaginary outputs.)
There are three possible work-arounds:
- Use a different larger part, with more DSP48s per column.
- Uncheck the "Optimize complex multipliers for speed using DSP48s" check box to get a 3-real-multiplier complex multiplier instead. This might impact your maximum clock frequency, but there is no change to data precision. It will also reduce your DSP48 count to 75% of its current level.
- Reduce either your input data width to 16 bits, or your twiddle factor width to 17 bits, or both. Changing only one of these will give a complex multiplier that uses 8 DSP48s instead of 16. Changing both of these will give a complex multiplier that uses only 4 DSP48s. There is no impact on clock frequency, but a small reduction in data precision.
For a detailed list of LogiCORE Fast Fourier Transform (FFT) Release Notes and Known Issues, see (Xilinx Answer 29209).