Adaptive slice-level parallelism for H.264/AVC encoding
Download
Report
Transcript Adaptive slice-level parallelism for H.264/AVC encoding
Adaptive slice-level parallelism for
H.264/AVC encoding using pre
macroblock mode selection
Bongsoo Jung, Byeungwoo Jeon
Journal of Visual Communication
and Image Representation 2008
1
Outline
Introduction
Complexity Analysis
Method
Pre Macroblock Mode Selection
Adaptive Slice-level Parallelism
Experimental Results
Conclusions
2
Introduction
H.264/AVC achieves high coding
efficiency
Variable block size, multiple reference frame,
quarter-pel motion vector accuracy,etc.
High computational complexity
Complexity reduction algorithm
Parallel processing
3
Introduction
GOP level
Frame level
Keep coding efficiency, but the dependence
among frames limits the thread scalability
Slice level
Simple but high latency
Encode independently but less coding efficiency
Macroblock level
High dependency
4
Introduction
MBs in a slice may not have similar
computational complexity.
Unnecessary extra waiting time in
some threads.
PU0
PU1
slice 0
slice 1
PU2
slice 2
PU3
slice 3
PU4
PU5
PU6
PU7
slice 4
slice 5
slice 6
slice 7
Encoding time
5
Main Purpose
Objective
Using parallel algorithm to speed up
H.264/AVC encoder
Maximize the parallelism efficiency by
distributing the workload equally.
Method
Pre processing: Fast MB mode selection
Adaptive slice-level parallelism
6
Complexity Analysis
Inter prediction mode of MBs in H.264
Intra prediction mode: 4*4, 16*16
7
Complexity Analysis
The run-time complexity of the
H.264/AVC encoder
Pentium IV 2.4GHz
Foreman_CIF with IPPP structure
8
Pre Macroblock Mode Selection
Overview
Why?
High computational complexity of ME in
variable block size
Remove unnecessary ME block size and RD
calculation of intra prediction mode
This removal leads to
Complexity reduction
Workload balancing among slices
9
Pre Macroblock Mode Selection
Inter MB mode selection
MC block sizes in video sequence
High temporal correlation
Foreground region : 8*8 or smaller
Non-moving region : 16*16
Check consistency history of block size
16*16 and zero MV
Two measurements
Zero motion consistency (ZMC)
Large block consistency (LBC)
10
Pre Macroblock Mode Selection
Inter MB mode selection
Zero Motion Consistency (ZMC)
Indicates how long a specified block has had
a zero MV consecutively
t : frame index , ZMC0 = 0,
(n,m;i,j) indicates a 4*4 block at (n,m)
within a MB (i,j)
When a block is encoded in intra mode
ZMC is set to 0
high value of ZMC
high prob. of belonging
to background region
11
Pre Macroblock Mode Selection
Inter MB mode selection
Zero Motion Consistency Score
Indicates how likely a MB being a stationary
region
TMOTION : A threshold value
12
Pre Macroblock Mode Selection
Inter MB mode selection
Large Block Consistency (LBC)
Indicates the number of continuous frames
having a 16*16 MC block size at (i,j)th MB
bestModet(i,j) : The best MB mode of the (i,j) MB in tth
frame
LBC0 = 0
When a block is encoded in intra mode
LBC is set to 0
13
Pre Macroblock Mode Selection
Inter MB mode selection
Large Block Consistency Score
Indicates how likely a MB being partitioned in
16*16
TMODE1 ,TMODE2 : Threshold values used to make the
assessment of the LBC
14
Pre Macroblock Mode Selection
Inter MB mode selection
A illustration of LBCS
15
Pre Macroblock Mode Selection
Inter MB mode selection
Conditional probability of MB modes
given ZMCS = High TMotion = 4
The other block sizes are very unlikely to
appear (less than about 0.04)
Early detect SKIP and P16*16 mode
16
Pre Macroblock Mode Selection
Inter MB mode selection
Joint conditional probability of given
LBCS with ZMCS = Low TMODE1 = 1, TMODE2 = 4
A: LBCS = High, B: LBCS = Medium, C: LBCS = Low
17
Pre Macroblock Mode Selection
Pre selective intra mode selection
High computational load of computing RD
costs of intra mode
Comparing temporal correlation with
spatial correlation of the current MB prior
to frame coding
18
Pre Macroblock Mode Selection
Selective intra mode selection
Mean Absolute Temporal Difference
cx,y : Pixel values at location (x,y) of MB in current frame
rx,y : Pixel values at location (x,y) of MB in previous frame
X, Y : Horizontal and vertical dimensions of a MB
Mean Absolute Spatial Difference
MASDH : The MASD between horizontally
neighboring pixels
MASDV : The MASD between vertically
neighboring pixels
19
Pre Macroblock Mode Selection
Selective intra mode selection
Comparing MATD and MASD to
determine whether current MB should
calculate RD costs of intra modes
More temporally correlated
than spatially correlated
w: Weighting factor, currently is set to 0.6
A larger w makes skipping intra mode
search easier
A smaller QP will incur more intra modes
than a larger QP
20
Pre Macroblock Mode Selection
MB mode classfication
Decision table of candidate MB mode
A block diagram of MB selection
21
Adaptive Slice-level Parallelism
Overview
Characteristic
Easy to implement
Lower overhead of inter communication
among processor unit
Good scalability
Increase bitrate
Slice boundary is defined on the
basis of a fixed number of MBs or
fixed number of bits
Hard to decide a slice boundary prior to
encoding
22
Adaptive Slice-level Parallelism
Fixed MB assignment
The number of consecutive MBs in
each slice
L : The number of processor units on a multi-core system
M : The total number of MBs in a frame
i : Slice index
Example : number of processing unit L = 8, sequence resolution
is CIF (352*288), M = 22*18 = 396
We can assign about 49 MBs to each slice
23
Adaptive Slice-level Parallelism
Fixed MB assignment
The scheduling of slice-level
parallelism in eight processor units
Ideal case
Practical case
PU0
slice 0
PU0
PU1
slice 1
PU1
PU2
slice 2
PU2
slice 2
PU3
slice 3
PU3
slice 3
PU4
slice 4
PU4
PU5
slice 5
PU5
PU6
slice 6
PU6
PU7
slice 7
PU7
Encoding time
slice 0
slice 1
slice 4
Bottleneck
slice 5
slice 6
slice 7
Encoding time
24
Adaptive Slice-level Parallelism
Fixed MB assignment
The imbalance of computational
load distribution
Exhaustive Search Method
Fast ME / Fast Mode Search
25
Adaptive Slice-level Parallelism
Fixed MB assignment
Computational load for encoding one
frame in slice level parallelism
Ctslice(i) : The computational load of ith slice in tth frame
Computation load of the tth frame by
a single processor system
L : Number of slice in a frame
26
Adaptive Slice-level Parallelism
Fixed MB assignment
The speedup of multiprocessor system
over a single processor system
To achieve the maximum speedup
Computation loads of each slice should be
as similar as possible
Adaptive slice partition method
27
Adaptive Slice-level Parallelism
Complexity estimation model
A simple estimation method by utilizing
the result of fast MB mode selection
Define the group value g corresponding
to the candidate MB modes
28
Adaptive Slice-level Parallelism
Complexity estimation model
Complexity model
Ck,CHKIntra(g) : Complexity cost of the kth MB
g : Group index
einter : Estimated complexity cost of inter mode in g = 1
eintra : Complexity cost according to the intra mode check
in g = 1
α1, α2, α3, β1 β2 β3 : Weighting values of complexity cost
29
Adaptive Slice-level Parallelism
Complexity estimation model
Relative computational load
CHKintra = 0
Assume einter = 1, eintra = 0
1 , g 1
eInter eIntra
e e 2.42, g 2
Ck ,CHK Intra 0 ( g ) 1 Inter 1 Intra
2 eInter 2 eIntra 3.12, g 3
3 eInter 3 eIntra 5.28, g 4
α1=2.42, α2=3.12,α3=5.28
CHKintra = 1
Assume einter = 1, eintra = 3.97
eInter eIntra
4.97, g 1
e e
1
Intra 6.48, g 2
Ck ,CHK Intra 1 ( g ) 1 Inter
2 eInter 2 eIntra 7.23, g 3
3 eInter 3 eIntra 9.48, g 4
β1=0.82, β2=0.83, β3=0.84
30
Adaptive Slice-level Parallelism
Adaptive MB assignment
The total computational load at the tth
frame
~ t M 1
C Ck ,CHK Intra ( g )
k 0
Ideal computational load of each slice for
the uniform workload distribution
~
Ct
~t
C slice
L
31
Adaptive Slice-level Parallelism
Adaptive MB assignment
MB assignment of slice
Much better than fixed MB assignment
in each slice
32
Adaptive Slice-level Parallelism
Adaptive MB assignment
Entire block diagram
33
Experimental Results
Overview
Performance comparison between
proposed MB mode decision and the
conventional method
Comparing adaptive slice-level
parallelism with fixed slice-level
parallelism
34
Experimental Results
MB mode selection
Average encoding time saving AST[%]
FULL_1Slice : Exhaustive method
FMD_1Slice : Fast MB mode search method
BDPSNR and BDBR are used to measure the
performance against FULL_1Slice
35
Experimental Results
Rate distortion curves
36
Experimental Results
R-D performance compared to one
slice per frame (FMD_1Slice)
37
Experimental Results
Rate distortion curves
38
Experimental Results
Slice-level parallelism
Comparing adaptive and fixed slice level
parallelism
Encoding time of one slice per frame
Speedup
by a single processor system
SpeedupFMD _ Fixed
EncTim e( FMD _ 1Slice)
MAXi EncTim eslicei FMD _ Fixed OverheadTim e
The longest encoding time of a slice using
fixed mode
SpeedupFMD _ Adaptive
EncTime( FMD _ 1Slice)
MAXi EncTimeslicei FMD _ Adaptive OverheadTim e
The longest encoding time of a slice using
adaptive mode
39
Experimental Results
Speedup
40
Conclusions
Proposed a fast MB mode selection
using consistency history of block
size and a zero MV
Proposed a intra mode selection by
comparing the correlation
Using these two schemes, they
proposed a new adaptive slice-level
parallelism to speed up H.264/AVC
encoder
41
Reference
Z. Chen, P. Zhou, Y. He, Fast motion estimation for JVT, JVT
Doc.JVT-G016,March 2003.
B. Jeon, J. Lee, Fast mode decision for H.264, JVT-J003,
ISO/IEC MPEG and ITU-T VCEG Joint Video Team,
(Waikoloa, HI), December 2003.
I. Choi, J. Lee, B. Jeon, Fast coding mode selection with
rate-distortion optimization for MPEG-4 Part-10 AVC/H.264,
IEEE Trans. Circuits Syst. VideoTechnol. 16 (12) (2006)
1557–1561.
42