With their inherent flexibility, Xilinx FPGAs and SoCs are ideal for high-performance or multi-channel digital signal processing (DSP) applications that can take advantage of hardware parallelism. Xilinx FPGAs and SoCs combine this processing bandwidth with comprehensive solutions, including easy-to-use design tools for hardware designers, software developers, and system architects.
A standard Von Neumann DSP architecture requires 256 cycles to complete a 256 tap FIR filter while Xilinx FPGAs can achieve the same result in a single clock cycle.
This massive parallelism translates into exceptional levels of DSP performance:
Xilinx DSP solutions include silicon, IP, reference designs, development boards, tools, documentation, and training to enable a wide range of applications in a breadth of markets, including —but not limited to— Wireless Communications, Data Center, and Aerospace and Defense.
Various tool flows are available for different use models and different levels of design abstraction:
Hardware designers can design in:
Software developers accustomed to developing in C/C++ can design using:
System architects can rapidly evaluate new algorithms with:
With Xilinx FPGAs and SoCs, designers can use multiple flows to deploy their DSP applications depending on design approach and level of abstraction.
Based on an ASIC-class architecture, Xilinx FPGAs combine multi-hundred giga-bit-per-second I/O bandwidth with over 20 TeraMACs of fixed point DSP performance in the Virtex® UltraScale+™ family. The Xilinx DSP slice and its parallelism is key to the achievable DSP performance in the latest generation of Xilinx FPGAs.
The UltraScale™ DSP48E2 slice is the 5th generation of DSP slices in Xilinx architectures.
This dedicated DSP processing block is implemented in full custom silicon that delivers industry leading power/performance allowing efficient implementations of popular DSP functions, such as a multiply-accumulator (MACC), multiply-adder (MADD) or complex multiply.
The slice also provides capabilities to perform different kinds of logic operations, such as AND, OR and XOR operations (UG579).
UltraScale architecture builds on the success of 7 series (DSP48E1), with further enhancements:
These enhancements help DSP critical applications perform more computation within the DSP48E2 slice before going into the FPGA fabric, ultimately leading to both resource and power savings.
|DSP Tile/Slice Type||DSP48E1||DSP48E2|
|Multiple Add/Sub/Acc operations|
|Multiplier and MACC||25x18||27x18|
|Squaring: [(A or B) +/- D]2|
|WMUX Feedback Ultra Efficient Complex Multiply CMACC||5 x DSP48E1||3 x DSP48E2|
|Integrated Pattern Detect Circuitry|
|Integrated Logic Unit|
|Wide Mux Functions (48 bit)|
|Wide XOR (96 bit)|
|Optional 96-bit Output|
|Sequential Complex Multiply, AB dyn access|
|AB Register Pipeline Balancing Improved|
Depending on your designing preferences, Xilinx has tools supporting RTL, C/C++ and model-based design entry. This flexibility in the design flow, along with an extensive DSP IP catalog, facilitates easier adoption of Xilinx tools and devices.
The Vivado IDE works as a design cockpit for system level design which provides the ability to build your complete design, implement it and write out a bit-file to program your device.
The following table shows some of the key DSP performance metrics for 7-Series, UltraScale and UltraScale+ families. For SoC device performance, see Software Developer section.
|Artix-7||Kintex-7||Kintex UltraScale||Kintex UltraScale+||Virtex-7||Virtex UltraScale||Virtex UltraScale+|
|Logic Cells (K) / System Logic Elements (K) (1)||13–215||65–478||318–1,451||356–1,143||326–1,424||783–5,541||862–3,780|
|Fixed Point Performance (GMACs)||25–464||178–1,424||507–4,090||1,218–3,143||831–2,671||444–2,134||2,031–10,948|
|Fixed Point Performance For Symmetric Filters (GMACs) (2)||50–928||356–2,848||1,014–8,180||2,436–6,286||1,662–5,342||888–4,268||4,062–21,896|
|INT8 GOPs (3)||50–928||356–2,848||1,774–14,315||4,263–11,000||1,662–5,342||1,554–7,469||7,108–38,318|
|Single Precision Floating Point (GFLOPs) (4)||10–196||96–770||320–2,685||800–1,673||449–1,444||294–1,411||1,354–7,299|
|Single Precision Floating Point (GFLOPs) (5)||7–147||72–577||240–2,028||609–1,571||337–1,083||220–1,058||1,015–5,474|
|Half Precision Floating Point (GFLOPs) (6)||15–295||144–1,154||480–4,056||1,218–3,142||674–2,166||440–2,116||2,030–10,948|
To achieve the most optimal and efficient usage of the DSP48 slices within Xilinx FPGAs, the following information and techniques should be reviewed and utilized where possible.
Xilinx has introduced software development environments and a comprehensive set of familiar and powerful tools, libraries and methodologies which allow software developers to target Xilinx FPGAs and SoCs with ease. With these high level abstraction environments like Vivado High Level Synthesis (HLS), SDAccel and SDSoC, Xilinx can offer GPU-like and familiar embedded application development and runtime experiences for C, C++ and/or OpenCL development.
SDSoC provides the ability to profile a given application and allows for the creation of hardware accelerators to run more efficiently in the Programmable Logic (PL), where the flexibility and parallelism of the FPGA are leveraged to provide large performance improvements. This also enables other functions of the application to run in the Processing System (PS) in parallel if desired.
By targeting Xilinx FPGAs and SoCs, many DSP and embedded applications will see improvements in efficiency and reduced power for their applications.
The following tables shows some of the key features and DSP performance metrics for both Xilinx Zynq-7000 SoC and Zynq UltraScale+ MPSoC families. For non-SoC device performance, visit the Hardware Designer section.
|PROCESSING SYSTEM||Zynq-7000 SoC||Zynq UltraScale+ MPSoC|
Processing Unit (APU)
Processing Unit (RPU)
|Dynamic Memory Interface||DDR3, DDR3L, DDR2, LPDDR2||DDR4, LPDDR4, DDR3, DDR3L, LPDDR3|
|High-Speed Peripherals||USB 2.0, Gigabit Ethernet, SD/SDIO||PCIe® Gen2, USB3.0, SATA 3.1, DisplayPort, Gigabit Ethernet, SD/SDIO|
|Security||RSA, AES, and SHA, ARM TrustZone®||RSA, AES, and SHA, ARM TrustZone|
|Max I/O Pins||128||214|
|PROGRAMMABLE LOGIC||Zynq-7000 SoC||Zynq UltraScale+ MPSoC|
|System Logic Elements (K)||23–444||103–1,045|
|Max Memory (Mb)||1.8–26.5||5.3–70.6|
|Max I/O Pins||100–362||252–668|
|Fixed Point Performance (GMACs) (1)||42–1,313||213–3,143|
|Fixed Point Performance For Symmetric Filters (GMACs) (1) (2)||84–2,626||426–6,286|
|INT8 GOPs (1) (3)||84–2,626||745–11,000|
|INT16 GOPs (1)||84–2,626||426–6,286|
|Single Precision Floating Point (GFLOPs) (1) (4)||23–716||142–1,673|
|Single Precision Floating Point (GFLOPs) (1) (5)||17–537||106–1,571|
|Half Precision Floating Point (GFLOPs) (1) (6)||34–1,074||212–3,142|
To learn more about Xilinx SoCs and MPSoCs, go to:
The Processing System (PS) provides DSP processing capabilities by way of the different ARM processing cores.
For more information on DSP capabilities in the ARM processors, visit:
Some useful examples can be found at the following locations:
For Zynq UltraScale+ MPSoC, see UG1211 for a demonstration of an FFT using the ARM NEON instruction set.
For Zynq-7000 SoC, the following Tech Tips are available on Xilinx wiki when targeting the Cortex-A9 and ARM SIMD:
Xilinx has very flexible data-type support in their devices. Varying precisions of Fixed Point, Floating Point and Integer are supported natively in Xilinx tools with Floating Point being implemented with the aid of the Floating Point Operator IP core.
Floating Point designs implemented on FPGAs will always lead to higher resource and power usage compared to Fixed Point or Integer implementations. Converting to a fixed point solution where possible will bring large benefits:
For more details on the benefits of converting from floating point to fixed point data types, please read WP491.
The below tables show a small selection of algorithms and possible performance improvements by using a Xilinx device and in particular the fabric in the programmable logic (PL) to accelerate the design.
|Algorithm||CPU/GPU||Zynq UltraScale+ MPSoC||Advantage|
|Stereo LocalBM @ 2K||ARM: 0.5 FPS/Watt
nVidia: 3.5 FPS/Watt
|ARM: 0.1 FPS/Watt
nVidia: 0.8 FPS/Watt
|ARM: 0.1 Imgs/s/w
nVidia: 8.8 Imgs/s/w
Note 1: ARM: Quad-core A53 run on Raspberry Pi @ 1200MHz
Note 2: Nvidia benchmarks were done using Tegra X1
Note 3: Optical Flow (LK) – Window Size 11x1
|Forward Projection||ARM: 3 sec/view||0.016 sec/view||188x|
|Motion Detection||ARM: 0.7 FPS||67 FPS||90x|
|Noise Reduction-Sobel||ARM: 1 FPS||67 FPS||60x|
|Canny Edge Detection||ARM: 0.66 FPS||40 FPS||45x|
|3D Image Reconstruction||ARM: 75k||8k||9x|
|DPD||ARM: 506 ms||31.3 ms||16x|
|FIR||TI DSP: 64020 ns||1200 ns||53x|
|FFT||TI DSP: 1036 ns||128 ns||8x|
Note 1: Cortex-A9 core used only on the Zynq devices when targeting ARM
Note 2: TI benchmarks were done using C66 DSP core
Xilinx high-level design tools like Vivado System Generator for DSP and Vivado High Level Synthesis provide a level of abstraction that empower system architects and domain experts to rapidly evaluate new algorithms and focus on developing the differentiating parts of their design. The complete Xilinx DSP solution is a combination of these design tools, IP, reference designs, methodologies and boards that work together to get to a working production design in the shortest time possible.
The Vivado System Generator for DSP is a Model-Based design tool that leverages the MATLAB and Simulink environment to define, test and implement production quality DSP algorithms in programmable logic in a fraction of traditional RTL development times.
The tool provides:
Learn more about Vivado System Generator for DSP:
Vivado High-Level Synthesis, included as a no cost upgrade in all Vivado HLx Editions, enables portable C, C++ and System C algorithm specifications to be directly targeted into Xilinx devices without the need to create RTL. Just as there are compilers from C/C++ to different processor architectures, the HLS compiler provides the same functionality from C/C++ to Xilinx FPGAs.
Learn more about Vivado High Level Synthesis:
Xilinx provides best-in-class tools to enable Digital Signal Processing (DSP) applications to be implemented efficiently and at low power on a Xilinx FPGA or SoC. Whether you are designing with RTL, C/C++/SystemC or Matlab/Simulink, the Xilinx tools below can easily facilitate your DSP design and reduce your time-to-market.
Xilinx offers a range of libraries which are optimized for performance, resource utilization and ease of use.
|Libraries & Frameworks||Description
|Reconfigurable Acceleration Stack||The Xilinx Reconfigurable Acceleration Stack enables the world’s largest cloud service providers to develop and deploy acceleration platforms at cloud scale and delivers ultimate flexibility for complex cloud computing applications like machine learning, data analytics, and video transcoding.||Acceleration Zone|
|GitHub Repositories||Xilinx has created GitHub repositories which contain useful examples for many applications including DSP related functions.|
Xilinx and its partners work together to produce tools and boards to ease the adoption of Xilinx FPGAs and SoCs for DSP applications across many market segments. Xilinx also works closely with its partners to provide an extensive range of FPGA Mezzanine Cards.
|Avnet DSP-Centric Development Kits and Modules||
MathWorks and leading high-speed analog suppliers, Avnet offers DSP-centric development kits and production-ready system-on-modules (SOM) for embedded vision, software-defined radio and high-performance motor control.
|Mathworks Computing Software||
Mathworks MATLAB® and Simulink® can reduce FPGA and SoC system development time significantly by enabling users to:
|Analog Devices Add-On Boards||
The AD-FMCDAQ2-EBZ FMC board, is a self-contained data acquisition and signal synthesis prototyping platform supporting ease of use operation enabling quicker end system signal processing development.