Fast Hardware Implementation of an H.264 Quantizer.
Download
Report
Transcript Fast Hardware Implementation of an H.264 Quantizer.
High Speed Hardware
Implementation of an H.264
Quantizer.
Alex Braun
Shruti Lakdawala
H.264
Video Compression Standard
Process of compacting data into smaller
number of bits.
Achieved by:
removing redundancy between consecutive
frames.
Transforming the data into a different domain.
Quantization
Reordering the data and encoding it as compactly
as possible
H.264 Encoder block diagram
Quantization
Scales the data down to a smaller range of values
thereby reducing the number of bits.
To avoid floating point arithmetic the values are
rounded.
There are 52 values of Qstep.
Quantization - 2
To reduce the complexity of the quantization
block, the division operation is implemented
by multiplying the array by a multiplication
factor(MF) and then using a binary right shift
=
Implementation
Quantisation Equation
Architecture
Quantization on Three Arrays
H.264 performs quantization on three arrays:
4 x 4 array of Residual coefficients
4 x 4 array of Luma coefficients
2 x 2 array of Chroma coefficients
Mode select will be used to quantize three
arrays differently because the quantization
equation is slightly different for each array.
New Architecture
Pipelining is used for fast implementation
Y
Z
mode
QP
f
LUT
MF
QP_div_6
Data Path
Look Up Table
Multiplication factor and qbits depends
on the position of the elements in the
array and the quantization step.
Look Up Tables required for pre-
calculated MF and qbits.
Data Path
Six Stage Booth-Recoded Wallace Tree
Multiplier
Add and Shift broken into two stages
Two
15-bit Fast Carry Look Ahead Adders
One 16-bit Fast Carry Look Ahead
Incrementer and Right Shift Block
Y
MF
6 Stage Multiplier
Right
Shift
+
+
1
f
QP_div_6
CO
+
CO
Z
Performance
Latency
As Tested:
9 clock cycles
If Implemented with LUT in parallel with last stage
of transform block:
8 clock cycles
Throughput
1 result per clock cycle
Frequency
As Implemented:
309 MHz
Max Frequency of Data Path Without Area Constraints
355 MHz
Area
Area (gates)
Data Path
58037
High Speed Data Path
(not used in final design)
LUTs
60845
Total System
938977
10385
Comparison to Another
Implementation
Pipelined
Combinational
Technology
TSMC 0.25µ
Xlininx Virtex-2 Pro
(0.15µ)
Latency
8-9 clocks
1 clock
Frequency
309 MHz
94 MHz
Area LUT (gates)
10385
10320
Area Quantizer
(gates)
928592
119040
Area System (gates) 938977
129360
Critical Path Delay
10.6ns
3.23ns
Areas for Improvement
Implement LUTs as ROMs to reduce
area
Pipeline LUTs and use faster Data Path
implementation for ~15% improvement
Implement in a smaller technology
Gate clocks to the 12 unused data
paths when in 2x2 DC Chroma mode
References
Richardson, Iain E. G. H.264 and MPEG-4 Video Compression. John Wiley & Sons
Ltd.England. 2003
H.265/MPEG-4 Part 10 Tutorials. http://www.vcodex.com/h264.html
Kordasiweicz R., Shirani S.. “Hardware Implementation of the Optimized Transform and
Quantization Blocks of H.264”. Electrical and Computer Engineering, 2004. Canadian
Conference on Volume 2, 2-5 May 2004 Page(s):943 - 946 Vol.2
Malvar, H., Hallapuro, A., Karczewicz, M., Kerofsky, L.. “Low-Complexity Transform and
Quantization in H.264/AVC”. Circuits and Systems for Video Technology, IEEE Transactions
on Volume 13, Issue 7, July 2003 Page(s):598 – 603
H. S. Malvar, “Low-Complexity length-4 transform and quantization with 16-bit arithmetic,” in
ITU-T SG16, Sept. 2001, Doc. VCEG-N44.
L. Kerofsky and S. Lei, “Reduced bit-depth quantization,” in JointVideoTeam (JVT) of
ISO/IEC MPEG and ITU-T VCEG, Sept. 2001, Doc.VCEG-N20.
L. Kerofsky, “H.26L transform/quantization complexity reduction Ad Hoc Report,” in Joint
Video Team(JVT) of ISO/IEC MPEG and ITU-T VCEG, Nov. 2001, Doc. VCEG-O09.