Fast Hardware Implementation of an H.264 Quantizer.

Download Report

Transcript Fast Hardware Implementation of an H.264 Quantizer.

High Speed Hardware
Implementation of an H.264
Quantizer.
Alex Braun
Shruti Lakdawala
H.264
 Video Compression Standard
 Process of compacting data into smaller
number of bits.
 Achieved by:

removing redundancy between consecutive
frames.
 Transforming the data into a different domain.
 Quantization
 Reordering the data and encoding it as compactly
as possible
H.264 Encoder block diagram
Quantization
 Scales the data down to a smaller range of values
thereby reducing the number of bits.
 To avoid floating point arithmetic the values are
rounded.
 There are 52 values of Qstep.
Quantization - 2
 To reduce the complexity of the quantization
block, the division operation is implemented
by multiplying the array by a multiplication
factor(MF) and then using a binary right shift
=
Implementation
Quantisation Equation
Architecture
Quantization on Three Arrays
 H.264 performs quantization on three arrays:
 4 x 4 array of Residual coefficients
 4 x 4 array of Luma coefficients
 2 x 2 array of Chroma coefficients
 Mode select will be used to quantize three
arrays differently because the quantization
equation is slightly different for each array.
New Architecture
 Pipelining is used for fast implementation
Y
Z
mode
QP
f
LUT
MF
QP_div_6
Data Path
Look Up Table
 Multiplication factor and qbits depends
on the position of the elements in the
array and the quantization step.
 Look Up Tables required for pre-
calculated MF and qbits.
Data Path
 Six Stage Booth-Recoded Wallace Tree
Multiplier
 Add and Shift broken into two stages
 Two
15-bit Fast Carry Look Ahead Adders
 One 16-bit Fast Carry Look Ahead
Incrementer and Right Shift Block
Y
MF
6 Stage Multiplier
Right
Shift
+
+
1
f
QP_div_6
CO
+
CO
Z
Performance
 Latency
 As Tested:


9 clock cycles
If Implemented with LUT in parallel with last stage
of transform block:

8 clock cycles
 Throughput
 1 result per clock cycle
 Frequency


As Implemented:
 309 MHz
Max Frequency of Data Path Without Area Constraints
 355 MHz
Area
Area (gates)
Data Path
58037
High Speed Data Path
(not used in final design)
LUTs
60845
Total System
938977
10385
Comparison to Another
Implementation
Pipelined
Combinational
Technology
TSMC 0.25µ
Xlininx Virtex-2 Pro
(0.15µ)
Latency
8-9 clocks
1 clock
Frequency
309 MHz
94 MHz
Area LUT (gates)
10385
10320
Area Quantizer
(gates)
928592
119040
Area System (gates) 938977
129360
Critical Path Delay
10.6ns
3.23ns
Areas for Improvement
 Implement LUTs as ROMs to reduce
area
 Pipeline LUTs and use faster Data Path
implementation for ~15% improvement
 Implement in a smaller technology
 Gate clocks to the 12 unused data
paths when in 2x2 DC Chroma mode
References







Richardson, Iain E. G. H.264 and MPEG-4 Video Compression. John Wiley & Sons
Ltd.England. 2003
H.265/MPEG-4 Part 10 Tutorials. http://www.vcodex.com/h264.html
Kordasiweicz R., Shirani S.. “Hardware Implementation of the Optimized Transform and
Quantization Blocks of H.264”. Electrical and Computer Engineering, 2004. Canadian
Conference on Volume 2, 2-5 May 2004 Page(s):943 - 946 Vol.2
Malvar, H., Hallapuro, A., Karczewicz, M., Kerofsky, L.. “Low-Complexity Transform and
Quantization in H.264/AVC”. Circuits and Systems for Video Technology, IEEE Transactions
on Volume 13, Issue 7, July 2003 Page(s):598 – 603
H. S. Malvar, “Low-Complexity length-4 transform and quantization with 16-bit arithmetic,” in
ITU-T SG16, Sept. 2001, Doc. VCEG-N44.
L. Kerofsky and S. Lei, “Reduced bit-depth quantization,” in JointVideoTeam (JVT) of
ISO/IEC MPEG and ITU-T VCEG, Sept. 2001, Doc.VCEG-N20.
L. Kerofsky, “H.26L transform/quantization complexity reduction Ad Hoc Report,” in Joint
Video Team(JVT) of ISO/IEC MPEG and ITU-T VCEG, Nov. 2001, Doc. VCEG-O09.