|
H.264/AVC is the latest international
video coding standard in a series of such
standards: H.261, MPEG-1, MPEG-2,
H.263, and MPEG-4 visual, or part 2. It
was approved by the ITU-T (International
Telecommunications Union Telecommunication
Standardization Sector) as recommendation
H.264 and by ISO/IEC as
International Standard 14 496-10 (MPEG-4 part 10) Advanced Video Coding (AVC)
in May 2003.
Despite H.264/AVC’s promises of
improved coding efficiency over existing
video coding standards, it still presents
tremendous engineering challenges to system
architects, DSP engineers, and hardware
designers. The H.264/AVC standard
brought in the most significant changes
and algorithmic discontinuities in the evolution
of video coding standards since the
introduction of H.261 in 1990.
The algorithmic computational complexity,
data locality, and algorithm and
data parallelism required to implement the
H.264/AVC coding standard often directly
influences the overall architectural decision
at the system level. In turn, this
determines the ultimate cost of developing
any commercially viable H.264/AVC system
solution in the broadcasting, video
editing, teleconferencing, and consumer
electronics fields.
Complexity Analysis
To achieve a real-time H.264/AVC standard
definition (SD) or high definition
(HD) resolution encoding solution, system
architects often employ multiple
FPGAs and programmable DSPs. To illustrate
the enormous computational complexity
required, let’s explore the typical
run-time cycle requirements of the
H.264/AVC encoder based on the software
model provided by the Joint Video Team
(JVT), comprising experts from ITU-T’s
Video Coding Experts Group (VCEG) and
ISO/IEC’s Moving Picture Experts Group
(MPEG).
Using Intel™ VTune™ software running
on an Intel Pentium™ III 1.0 GHz
general-purpose CPU with 512 MB of
memory, achieving H.264/AVC SD with a
main profile encoding solution would
require approximately 1,600 BOPS (billions
of operations per second).
Table 1 illustrates a typical profile of
the H.264/AVC encoder complexity
based on the Pentium III general-purpose
processor architecture. Notice that in
Table 1, motion estimation, macro-block/block processing (including
mode decision), and motion compensation
modules are the primary candidates
for hardware acceleration.
Table 1 – H.264/AVC encoder complexity profile by files
|
Functional Blocks | % of Run-Time Total Cycles |
| mv_search.c | 67.31 % |
| block.c | 8.19 % |
| refbuf.c | 6.95 % |
| macroblock.c | 3.48 % |
| rdopt.c | 3.37 % |
| biariencode.c | 3.21 % |
| cabac.c | 2.98 % |
| memcpy.asm* | 2.91 % |
| abs.c* | 0.57 % |
| image.c | 0.54 % |
| rdopt_coding_state.c | 0.46 % |
| loopFilter.c | 0.03 % |
However, computation complexity
alone does not determine if a functional
module should be mapped to hardware or
remain in software. To evaluate the viability
of software and hardware partitioning of the
H.264/AVC coding standard implementation
on a platform that consists of a mixture
of FPGAs, programmable DSPs, or general-purpose
host processors, we need to look at
a number of architectural issues that influence
the overall design decision.
- Data locality. In a synchronous design,
the ability to access memory in a particular
order and granularity while minimizing
the number of clock cycles due
to latency, bus contention, alignment,
DMA transfer rate, and the types of
memory used (such as ZBT memory,
SDRAM, and SRAM) is very important.
The data locality issue is primarily
dictated by the physical interfaces
between the data unit and the arithmetic
unit (or the processing engine).
- Data parallelism. Most signal processing
algorithms operate on data that is
highly parallelizable (such as FIR filtering).
Single instruction multiple data
(SIMD) and vector processors are particularly
efficient for data that can be
parallelized or made into a vector format
(or long data width).
FPGA fabric exploits this by providing
a large amount of block RAM to support
numerous very high aggregate
bandwidth requirements. In the new
Xilinx Virtex-4™ SX device family, the
amount of block RAM matches closely
with the number of Xtreme DSP™
slices (SX25 – 128 block RAM, 128
DSP slices; SX35 – 192 block RAM,
192 DSP slices; SX55 – 320 block
RAM, 512 DSP slices).
- Signal processing algorithm parallelism.
In a typical programmable DSP
or a general-purpose processor, signal
processing algorithm parallelism is
often referred to as instruction level
parallelism (ILP). A very long instruction
word (VLIW) processor is an
example of such a machine that
exploits ILP by grouping multiple
instructions (ADD, MULT, and BRA)
to be executed in a single cycle. A
heavily pipelined execution unit in the
processor is also an excellent example
of hardware that exploits the parallelism.
Modern programmable DSPs
have adopted this architecture (including
the Texas Instruments™
TMS320C64x).
However, not all algorithms can
exploit such parallelism. Recursive
algorithms like IIR filtering, variablelength
coding (VLC) in MPEG1/2/4,
context-adaptive variable length coding
(CAVLC), and context-adaptive binary
arithmetic coding (CABAC) in
H.264/AVC are particularly sub-optimal
and inefficient when mapped to
these programmable DSPs. This is
because data recursion prevents ILP
from being used effectively. Instead,
dedicated hardware engines can be
built efficiently in the FPGA fabric.
- Computational complexity.
Programmable DSP is bounded in
computational complexity, as measured
by the clock rate of the processor.
Signal processing algorithms implemented
in the FPGA fabric are typically
computationally intensive. Some
examples of these are the sum of absolute difference (SAD) engine in
motion estimation and video scaling.
By mapping these modules onto the
FPGA fabric, the host processor or the
programmable DSP has the extra cycles
for other algorithms. Furthermore,
FPGAs can have multiple clock
domains in the fabric, so selective hardware
blocks can thus have separate
clock speeds based on their computational
requirements.
- Theoretic optimality in quality. Any
theoretic optimal solution based on
the rate-distortion curve can be
achieved if and only if the complexity
is unbounded. In a programmable
DSP or general-purpose processor, the
computational complexity is always
bounded by the clock cycles available.
FPGAs, on the other hand, offer much
more flexibility by exploiting data and
algorithm parallelism by means of
multiple instantiations of the hardware
engines, or increased use of block
RAM and register banks in the fabric.
A programmable DSP or general-purpose
processor is often limited by the
number of instruction issues per cycle,
the level of pipeline in the execution
unit, or the maximum data width to
fully feed the execution units. Video
quality is often compromised as a result
of the limited cycles available per task
in a programmable DSP, whereas hardware
resources are fully allocated in
FPGA fabric (three-step vs. full-search
motion estimation).
Implementing Functional Modules onto FPGAs
Figure 1 shows the overall H.264/AVC
macroblock level encoder with major functional
blocks and data flows defined. One
of the primary successes of the H.264/AVC
standard is its ability to predict the values
of the content of a picture to be encoded by
exploiting the pixel redundancy in different
ways and directions not exploited previously
in other standards. Unfortunately, when
comparing to previous standards, this
increases the complexity and memory
access bandwidth approximately four-fold.
Improved Prediction Methods
Let’s highlight some of the main features of
the H.264/AVC video coding standard
design that enable its enhanced coding efficiency,
evaluating these functional modules
based on the design criteria discussed in the
previous section.
- Quarter-pixel-accurate motion compensation.
Prior standards use half-pixel
motion vector accuracy. The new design
improves on this by providing quarterpixel
motion vector accuracy. The prediction
values at half-pixel positions are
calculated by applying a one-dimensional
six-tap FIR filter [1, -5, 20, 20, -5,
1]/32 horizontally and vertically.
Prediction values at quarter-pixel positions
are generated by averaging samples
at the full- and half-pixel positions.
These sub-sampling interpolation operations
can be efficiently implemented
in hardware inside the FPGA fabric.
- Variable block-sized motion compensation
with small block size. The standard
provides more flexibility for the
tiling structure in a macroblock size of
16 x 16 pixels. It allows the use of 16 x
16, 16 x 8, 8 x 16, 8 x 8, 8 x 4, 4 x 8,
and 4 x 4 sub-macroblock sizes.
Because of the increasing combinations
of tiling geometry with a given
16 x 16 macroblock, to find a rate
distortion optimal tiling solution is
extremely computationally intensive.
This additional feature places an
enormous burden on the computational
engines used in motion estimation,
refinement, and mode decision
process.
- In-the-loop adaptive deblocking filtering.
The deblocking filter has been successfully
applied in H.263+ and
MPEG-4 part 2 implementations as a
post-processing filter. In H.264/AVC,
the deblocking filter is moved inside
the motion-compensated loop to filter
block edges resulting from the prediction
and residual difference coding
stages of the decoding process. The filtering
is applied on both 4 x 4 block
and 16 x 16 macroblock boundaries, in
which two pixels on either side of the
boundary may be updated using a
three-tap filter. The filter coefficients or
“strength” are governed by a contentadaptive
non-linear filtering scheme.
- Directional spatial prediction for intra
coding. In cases where motion estimation
cannot be exploited, intra-directional
spatial prediction is used to
eliminate spatial redundancies. This
technique attempts to predict the current
block by extrapolating the neighboring
pixels from adjacent blocks in a
defined set of directions. The difference
between the predicted block and
the actual block is then coded.
This approach is particularly useful in
flat backgrounds where spatial redundancies
exist. There are a total of nine
prediction directions for Intra_4x4
prediction, and four prediction directions
for Intra_16x16 prediction.
Note that the data causality imposes
quick memory access to the neighboring
13 pixel values to the above and
left of the current block in the case of
Intra_4x4. For the Intra_16x16, 16
neighboring pixels on each side are
used to predict a 16 x 16 block.
- Multiple reference picture motion compensation.
The H.264/AVC standard
offers the option for multiple reference
frames in the inter-frame coding. Unless
the number of the referenced pictures is
one, the index at which the reference
picture is located inside the multi-picture
buffer has to be signaled. The
multi-picture buffer size determines the
memory usage in the encoder and
decoder. These reference frame buffers
must be addressed correspondingly during
the motion estimation and compensation
stages in the encoder.
- Weighted prediction. The JVT recognizes
that in encoding certain video
scenes that involve fades, having a
weighted motion-compensated prediction
dramatically improves the coding
efficiency.
Improved Coding Efficiency
In addition to improved prediction methods,
other parts of the standard design were
also enhanced for improved coding efficiency.
Two additional features are most likely to
impact the overall system architecture based
on our design criteria for software and hardware
partitioning:
- Small block size, hierarchical, exactmatch
inverse, and short word-length
transform. The H.264/AVC, like other
standards, also applies transform coding
to the motion-compensated prediction
residual. But, unlike previous standards
that use an 8 x 8 discrete cosine transform
(DCT), this transform is applied
to 4 x 4 blocks, and is exactly invertible
in a 16-bit integer format. The small
block helps reduce blocking and ringing
artifacts, while the precise integer
specification eliminates any mismatch
issues between the encoder and decoder
in the inverse transform.
Furthermore, an additional transform
based on the Hadamard matrix is also
used to exploit the redundancy of 16
DC coefficients of the already transformed
blocks. Compared to a DCT, all
applied integer transforms have only
integer numbers ranging from -2 to 2 in
the transform matrix. This allows you to
compute the transform and the inverse
transform in 16-bit arithmetic using
only low-complexity shifters and adders.
- Arithmetic and context-adaptive
entropy coding. Two methods of
entropy coding exist: a low-complexity
technique based on the use of contextadaptively
switched sets of variable
length codes (CAVLC) and the computationally
more demanding algorithm
of context-based adaptive binary
arithmetic coding (CABAC). CAVLC
is the baseline entropy coding method
of H.264/AVC. Its basic coding tool
consists of a single VLC of structured
Exp-Golomb codes, which by means of
individually customized mappings are
applied to all syntax elements except
those related to the quantized transform
coefficients. For the CABAC,
a more sophisticated coding scheme is
applied. The transform coefficients are
first mapped into a 1-D array based on
a predefined scan pattern. After quantization,
a block contains only a few significant
non-zero coefficients.
Based on this statistical behavior, five
data elements are used to convey information
of the quantized transform coefficients
for a luminance 4 x 4 block.
The efficiency of entropy coding can be
improved further if using CABAC.
There are two parts in CABAC. The
arithmetic coding core engine and its
associated probability estimation are
specified as multiplication-free low-complexity
methods using only shifts
and table look-ups. The use of adaptive
codes allows it to adapt to non-stationary
symbol statistics. By using context
modeling based on switching between
conditional probability models that are
estimated from previous coded syntax
elements, CABAC can achieve a reduction
in bit rate between 5-15% compared
to CAVLC.
Figure 2 depicts a typical system-level
functional block partition of the
H.264/AVC SD video codec. The solution
is implemented based on the Spectrum
Digital EVM DM642 evaluation module
for the Texas Instruments TMS320DM642
DSP, together with the Xilinx XEVM642-2VP20 Virtex-II Pro™ or XEVM642-4VSX25 Virtex-4™ daughtercard.
Conclusion
When used in an optimized fashion, the
coding tools of the H.264/AVC standard
increase coding efficiency by about 50%
compared to previous video coding standards
(like MPEG-4 part 2 and MPEG-2)
for a wide range of bit rates and resolutions.
Currently, it is the most likely successor
to the widely used MPEG-2. However, the algorithm is quite complex,
at a resolution greater than source input
format (SIF).
The DVD-Forum, with its HD-DVD
initiatives, has selected H.264/AVC together
with WMV-9 and MPEG-2 as the standard
video coding formats. The European
DVB consortium has also selected
H.264/AVC as the next format after
MPEG-2. These announcements, plus
endorsements from Hollywood studios,
content distributors, and broadcast infrastructures,
have further validated the
importance of the H.264/AVC video coding
standard for the next few years.
For more comprehensive studies
and technical details of the H.264/AVC
video coding standard, please see the References.
References
“Draft ITU-TU Recommendation and Final Draft
International Standard of Joint Video Specification
(ITU-T Rec. H.264/ISO/IEC
14 496-10 AVC),” in Joint Video Team (JVT) of
ISO/IEC MPEG and ITU-T VCEG, JVT-G050,
2003.
A. Luthra, G.J. Sullivan, and T. Wiegand. July 2003.
“Special Issue on The H.264/AVC Video Coding
Standard.” IEEE Trans. Circuits System Video
Technology 13(7): 557-725.
Printable PDF version of this article with graphics. (10/15/04) 255 KB |