Support|documentation

  Xcell Journal Online
  Xcell Journal Archives
   
  Writing for Xcell
  Advertising in Xcell
  FREE Subscription
   
  Partner Yellow Pages
  Reference Pages
  Contact Us

    

Home : Documentation : Xcell Journal Online : Article
Implementing the H.264/AVC Video Coding Standard on FPGAs



by Wilson C. Chung, Senior Staff Video and Image Processing Engineer, Xilinx, Inc.
wilson.chung@xilinx.com (10/15/04)


Xilinx Virtex FPGAs provide excellent co-, pre-, and post-processing hardware acceleration solutions.
article link to PDF
Article PDF 255 KB


H.264/AVC is the latest international video coding standard in a series of such standards: H.261, MPEG-1, MPEG-2, H.263, and MPEG-4 visual, or part 2. It was approved by the ITU-T (International Telecommunications Union Telecommunication Standardization Sector) as recommendation H.264 and by ISO/IEC as International Standard 14 496-10 (MPEG-4 part 10) Advanced Video Coding (AVC) in May 2003.

Despite H.264/AVC’s promises of improved coding efficiency over existing video coding standards, it still presents tremendous engineering challenges to system architects, DSP engineers, and hardware designers. The H.264/AVC standard brought in the most significant changes and algorithmic discontinuities in the evolution of video coding standards since the introduction of H.261 in 1990.

The algorithmic computational complexity, data locality, and algorithm and data parallelism required to implement the H.264/AVC coding standard often directly influences the overall architectural decision at the system level. In turn, this determines the ultimate cost of developing any commercially viable H.264/AVC system solution in the broadcasting, video editing, teleconferencing, and consumer electronics fields.

Complexity Analysis
To achieve a real-time H.264/AVC standard definition (SD) or high definition (HD) resolution encoding solution, system architects often employ multiple FPGAs and programmable DSPs. To illustrate the enormous computational complexity required, let’s explore the typical run-time cycle requirements of the H.264/AVC encoder based on the software model provided by the Joint Video Team (JVT), comprising experts from ITU-T’s Video Coding Experts Group (VCEG) and ISO/IEC’s Moving Picture Experts Group (MPEG).

Using Intel™ VTune™ software running on an Intel Pentium™ III 1.0 GHz general-purpose CPU with 512 MB of memory, achieving H.264/AVC SD with a main profile encoding solution would require approximately 1,600 BOPS (billions of operations per second).

Table 1 illustrates a typical profile of the H.264/AVC encoder complexity based on the Pentium III general-purpose processor architecture. Notice that in Table 1, motion estimation, macro-block/block processing (including mode decision), and motion compensation modules are the primary candidates for hardware acceleration.

Table 1 – H.264/AVC encoder complexity profile by files
Functional Blocks % of Run-Time
Total Cycles
mv_search.c 67.31 %
block.c 8.19 %
refbuf.c 6.95 %
macroblock.c 3.48 %
rdopt.c 3.37 %
biariencode.c 3.21 %
cabac.c 2.98 %
memcpy.asm* 2.91 %
abs.c* 0.57 %
image.c 0.54 %
rdopt_coding_state.c 0.46 %
loopFilter.c 0.03 %

However, computation complexity alone does not determine if a functional module should be mapped to hardware or remain in software. To evaluate the viability of software and hardware partitioning of the H.264/AVC coding standard implementation on a platform that consists of a mixture of FPGAs, programmable DSPs, or general-purpose host processors, we need to look at a number of architectural issues that influence the overall design decision.

  • Data locality. In a synchronous design, the ability to access memory in a particular order and granularity while minimizing the number of clock cycles due to latency, bus contention, alignment, DMA transfer rate, and the types of memory used (such as ZBT memory, SDRAM, and SRAM) is very important. The data locality issue is primarily dictated by the physical interfaces between the data unit and the arithmetic unit (or the processing engine).
  • Data parallelism. Most signal processing algorithms operate on data that is highly parallelizable (such as FIR filtering). Single instruction multiple data (SIMD) and vector processors are particularly efficient for data that can be parallelized or made into a vector format (or long data width).

    FPGA fabric exploits this by providing a large amount of block RAM to support numerous very high aggregate bandwidth requirements. In the new Xilinx Virtex-4™ SX device family, the amount of block RAM matches closely with the number of Xtreme DSP™ slices (SX25 – 128 block RAM, 128 DSP slices; SX35 – 192 block RAM, 192 DSP slices; SX55 – 320 block RAM, 512 DSP slices).

  • Signal processing algorithm parallelism. In a typical programmable DSP or a general-purpose processor, signal processing algorithm parallelism is often referred to as instruction level parallelism (ILP). A very long instruction word (VLIW) processor is an example of such a machine that exploits ILP by grouping multiple instructions (ADD, MULT, and BRA) to be executed in a single cycle. A heavily pipelined execution unit in the processor is also an excellent example of hardware that exploits the parallelism. Modern programmable DSPs have adopted this architecture (including the Texas Instruments™ TMS320C64x).

    However, not all algorithms can exploit such parallelism. Recursive algorithms like IIR filtering, variablelength coding (VLC) in MPEG1/2/4, context-adaptive variable length coding (CAVLC), and context-adaptive binary arithmetic coding (CABAC) in H.264/AVC are particularly sub-optimal and inefficient when mapped to these programmable DSPs. This is because data recursion prevents ILP from being used effectively. Instead, dedicated hardware engines can be built efficiently in the FPGA fabric.

  • Computational complexity. Programmable DSP is bounded in computational complexity, as measured by the clock rate of the processor. Signal processing algorithms implemented in the FPGA fabric are typically computationally intensive. Some examples of these are the sum of absolute difference (SAD) engine in motion estimation and video scaling. By mapping these modules onto the FPGA fabric, the host processor or the programmable DSP has the extra cycles for other algorithms. Furthermore, FPGAs can have multiple clock domains in the fabric, so selective hardware blocks can thus have separate clock speeds based on their computational requirements.
  • Theoretic optimality in quality. Any theoretic optimal solution based on the rate-distortion curve can be achieved if and only if the complexity is unbounded. In a programmable DSP or general-purpose processor, the computational complexity is always bounded by the clock cycles available. FPGAs, on the other hand, offer much more flexibility by exploiting data and algorithm parallelism by means of multiple instantiations of the hardware engines, or increased use of block RAM and register banks in the fabric.

    A programmable DSP or general-purpose processor is often limited by the number of instruction issues per cycle, the level of pipeline in the execution unit, or the maximum data width to fully feed the execution units. Video quality is often compromised as a result of the limited cycles available per task in a programmable DSP, whereas hardware resources are fully allocated in FPGA fabric (three-step vs. full-search motion estimation).

Implementing Functional Modules onto FPGAs
Figure 1 shows the overall H.264/AVC macroblock level encoder with major functional blocks and data flows defined. One of the primary successes of the H.264/AVC standard is its ability to predict the values of the content of a picture to be encoded by exploiting the pixel redundancy in different ways and directions not exploited previously in other standards. Unfortunately, when comparing to previous standards, this increases the complexity and memory access bandwidth approximately four-fold.

Improved Prediction Methods
Let’s highlight some of the main features of the H.264/AVC video coding standard design that enable its enhanced coding efficiency, evaluating these functional modules based on the design criteria discussed in the previous section.

  • Quarter-pixel-accurate motion compensation. Prior standards use half-pixel motion vector accuracy. The new design improves on this by providing quarterpixel motion vector accuracy. The prediction values at half-pixel positions are calculated by applying a one-dimensional six-tap FIR filter [1, -5, 20, 20, -5, 1]/32 horizontally and vertically. Prediction values at quarter-pixel positions are generated by averaging samples at the full- and half-pixel positions. These sub-sampling interpolation operations can be efficiently implemented in hardware inside the FPGA fabric.
  • Variable block-sized motion compensation with small block size. The standard provides more flexibility for the tiling structure in a macroblock size of 16 x 16 pixels. It allows the use of 16 x 16, 16 x 8, 8 x 16, 8 x 8, 8 x 4, 4 x 8, and 4 x 4 sub-macroblock sizes. Because of the increasing combinations of tiling geometry with a given 16 x 16 macroblock, to find a rate distortion optimal tiling solution is extremely computationally intensive. This additional feature places an enormous burden on the computational engines used in motion estimation, refinement, and mode decision process.
  • In-the-loop adaptive deblocking filtering. The deblocking filter has been successfully applied in H.263+ and MPEG-4 part 2 implementations as a post-processing filter. In H.264/AVC, the deblocking filter is moved inside the motion-compensated loop to filter block edges resulting from the prediction and residual difference coding stages of the decoding process. The filtering is applied on both 4 x 4 block and 16 x 16 macroblock boundaries, in which two pixels on either side of the boundary may be updated using a three-tap filter. The filter coefficients or “strength” are governed by a contentadaptive non-linear filtering scheme.
  • Directional spatial prediction for intra coding. In cases where motion estimation cannot be exploited, intra-directional spatial prediction is used to eliminate spatial redundancies. This technique attempts to predict the current block by extrapolating the neighboring pixels from adjacent blocks in a defined set of directions. The difference between the predicted block and the actual block is then coded.

    This approach is particularly useful in flat backgrounds where spatial redundancies exist. There are a total of nine prediction directions for Intra_4x4 prediction, and four prediction directions for Intra_16x16 prediction. Note that the data causality imposes quick memory access to the neighboring 13 pixel values to the above and left of the current block in the case of Intra_4x4. For the Intra_16x16, 16 neighboring pixels on each side are used to predict a 16 x 16 block.

  • Multiple reference picture motion compensation. The H.264/AVC standard offers the option for multiple reference frames in the inter-frame coding. Unless the number of the referenced pictures is one, the index at which the reference picture is located inside the multi-picture buffer has to be signaled. The multi-picture buffer size determines the memory usage in the encoder and decoder. These reference frame buffers must be addressed correspondingly during the motion estimation and compensation stages in the encoder.
  • Weighted prediction. The JVT recognizes that in encoding certain video scenes that involve fades, having a weighted motion-compensated prediction dramatically improves the coding efficiency.
Improved Coding Efficiency
In addition to improved prediction methods, other parts of the standard design were also enhanced for improved coding efficiency. Two additional features are most likely to impact the overall system architecture based on our design criteria for software and hardware partitioning:
  • Small block size, hierarchical, exactmatch inverse, and short word-length transform. The H.264/AVC, like other standards, also applies transform coding to the motion-compensated prediction residual. But, unlike previous standards that use an 8 x 8 discrete cosine transform (DCT), this transform is applied to 4 x 4 blocks, and is exactly invertible in a 16-bit integer format. The small block helps reduce blocking and ringing artifacts, while the precise integer specification eliminates any mismatch issues between the encoder and decoder in the inverse transform.

    Furthermore, an additional transform based on the Hadamard matrix is also used to exploit the redundancy of 16 DC coefficients of the already transformed blocks. Compared to a DCT, all applied integer transforms have only integer numbers ranging from -2 to 2 in the transform matrix. This allows you to compute the transform and the inverse transform in 16-bit arithmetic using only low-complexity shifters and adders.

  • Arithmetic and context-adaptive entropy coding. Two methods of entropy coding exist: a low-complexity technique based on the use of contextadaptively switched sets of variable length codes (CAVLC) and the computationally more demanding algorithm of context-based adaptive binary arithmetic coding (CABAC). CAVLC is the baseline entropy coding method of H.264/AVC. Its basic coding tool consists of a single VLC of structured Exp-Golomb codes, which by means of individually customized mappings are applied to all syntax elements except those related to the quantized transform coefficients. For the CABAC, a more sophisticated coding scheme is applied. The transform coefficients are first mapped into a 1-D array based on a predefined scan pattern. After quantization, a block contains only a few significant non-zero coefficients.

    Based on this statistical behavior, five data elements are used to convey information of the quantized transform coefficients for a luminance 4 x 4 block. The efficiency of entropy coding can be improved further if using CABAC.

    There are two parts in CABAC. The arithmetic coding core engine and its associated probability estimation are specified as multiplication-free low-complexity methods using only shifts and table look-ups. The use of adaptive codes allows it to adapt to non-stationary symbol statistics. By using context modeling based on switching between conditional probability models that are estimated from previous coded syntax elements, CABAC can achieve a reduction in bit rate between 5-15% compared to CAVLC.

Figure 2 depicts a typical system-level functional block partition of the H.264/AVC SD video codec. The solution is implemented based on the Spectrum Digital EVM DM642 evaluation module for the Texas Instruments TMS320DM642 DSP, together with the Xilinx XEVM642-2VP20 Virtex-II Pro™ or XEVM642-4VSX25 Virtex-4™ daughtercard.

Conclusion
When used in an optimized fashion, the coding tools of the H.264/AVC standard increase coding efficiency by about 50% compared to previous video coding standards (like MPEG-4 part 2 and MPEG-2) for a wide range of bit rates and resolutions. Currently, it is the most likely successor to the widely used MPEG-2. However, the algorithm is quite complex, at a resolution greater than source input format (SIF).

The DVD-Forum, with its HD-DVD initiatives, has selected H.264/AVC together with WMV-9 and MPEG-2 as the standard video coding formats. The European DVB consortium has also selected H.264/AVC as the next format after MPEG-2. These announcements, plus endorsements from Hollywood studios, content distributors, and broadcast infrastructures, have further validated the importance of the H.264/AVC video coding standard for the next few years. For more comprehensive studies and technical details of the H.264/AVC video coding standard, please see the References.

References
“Draft ITU-TU Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC),” in Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVT-G050, 2003.

A. Luthra, G.J. Sullivan, and T. Wiegand. July 2003. “Special Issue on The H.264/AVC Video Coding Standard.” IEEE Trans. Circuits System Video Technology 13(7): 557-725.

Printable PDF version of this article with graphics. PDF logo (10/15/04) 255 KB

 
/csi/footer.htm