Adaptive slice-level parallelism for H.264/AVC encoding

Download Report

Transcript Adaptive slice-level parallelism for H.264/AVC encoding

Adaptive slice-level parallelism for
H.264/AVC encoding using pre
macroblock mode selection
Bongsoo Jung, Byeungwoo Jeon
Journal of Visual Communication
and Image Representation 2008
1
Outline



Introduction
Complexity Analysis
Method




Pre Macroblock Mode Selection
Adaptive Slice-level Parallelism
Experimental Results
Conclusions
2
Introduction

H.264/AVC achieves high coding
efficiency


Variable block size, multiple reference frame,
quarter-pel motion vector accuracy,etc.
High computational complexity


Complexity reduction algorithm
Parallel processing
3
Introduction

GOP level


Frame level


Keep coding efficiency, but the dependence
among frames limits the thread scalability
Slice level


Simple but high latency
Encode independently but less coding efficiency
Macroblock level

High dependency
4
Introduction

MBs in a slice may not have similar
computational complexity.

Unnecessary extra waiting time in
some threads.
PU0
PU1
slice 0
slice 1
PU2
slice 2
PU3
slice 3
PU4
PU5
PU6
PU7
slice 4
slice 5
slice 6
slice 7
Encoding time
5
Main Purpose

Objective



Using parallel algorithm to speed up
H.264/AVC encoder
Maximize the parallelism efficiency by
distributing the workload equally.
Method


Pre processing: Fast MB mode selection
Adaptive slice-level parallelism
6
Complexity Analysis


Inter prediction mode of MBs in H.264
Intra prediction mode: 4*4, 16*16
7
Complexity Analysis

The run-time complexity of the
H.264/AVC encoder


Pentium IV 2.4GHz
Foreman_CIF with IPPP structure
8
Pre Macroblock Mode Selection
Overview

Why?



High computational complexity of ME in
variable block size
Remove unnecessary ME block size and RD
calculation of intra prediction mode
This removal leads to


Complexity reduction
Workload balancing among slices
9
Pre Macroblock Mode Selection
Inter MB mode selection

MC block sizes in video sequence



High temporal correlation


Foreground region : 8*8 or smaller
Non-moving region : 16*16
Check consistency history of block size
16*16 and zero MV
Two measurements


Zero motion consistency (ZMC)
Large block consistency (LBC)
10
Pre Macroblock Mode Selection
Inter MB mode selection

Zero Motion Consistency (ZMC)

Indicates how long a specified block has had
a zero MV consecutively
t : frame index , ZMC0 = 0,
(n,m;i,j) indicates a 4*4 block at (n,m)
within a MB (i,j)

When a block is encoded in intra mode

ZMC is set to 0
high value of ZMC
 high prob. of belonging
to background region
11
Pre Macroblock Mode Selection
Inter MB mode selection

Zero Motion Consistency Score

Indicates how likely a MB being a stationary
region
TMOTION : A threshold value
12
Pre Macroblock Mode Selection
Inter MB mode selection

Large Block Consistency (LBC)

Indicates the number of continuous frames
having a 16*16 MC block size at (i,j)th MB
bestModet(i,j) : The best MB mode of the (i,j) MB in tth
frame
LBC0 = 0

When a block is encoded in intra mode

LBC is set to 0
13
Pre Macroblock Mode Selection
Inter MB mode selection

Large Block Consistency Score

Indicates how likely a MB being partitioned in
16*16
TMODE1 ,TMODE2 : Threshold values used to make the
assessment of the LBC
14
Pre Macroblock Mode Selection
Inter MB mode selection

A illustration of LBCS
15
Pre Macroblock Mode Selection
Inter MB mode selection

Conditional probability of MB modes
given ZMCS = High TMotion = 4


The other block sizes are very unlikely to
appear (less than about 0.04)
Early detect SKIP and P16*16 mode
16
Pre Macroblock Mode Selection
Inter MB mode selection

Joint conditional probability of given
LBCS with ZMCS = Low TMODE1 = 1, TMODE2 = 4
A: LBCS = High, B: LBCS = Medium, C: LBCS = Low
17
Pre Macroblock Mode Selection
Pre selective intra mode selection


High computational load of computing RD
costs of intra mode
Comparing temporal correlation with
spatial correlation of the current MB prior
to frame coding
18
Pre Macroblock Mode Selection
Selective intra mode selection

Mean Absolute Temporal Difference
cx,y : Pixel values at location (x,y) of MB in current frame
rx,y : Pixel values at location (x,y) of MB in previous frame
X, Y : Horizontal and vertical dimensions of a MB

Mean Absolute Spatial Difference
MASDH : The MASD between horizontally
neighboring pixels
MASDV : The MASD between vertically
neighboring pixels
19
Pre Macroblock Mode Selection
Selective intra mode selection

Comparing MATD and MASD to
determine whether current MB should
calculate RD costs of intra modes
More temporally correlated
than spatially correlated
w: Weighting factor, currently is set to 0.6


A larger w makes skipping intra mode
search easier
A smaller QP will incur more intra modes
than a larger QP
20
Pre Macroblock Mode Selection
MB mode classfication

Decision table of candidate MB mode

A block diagram of MB selection
21
Adaptive Slice-level Parallelism
Overview

Characteristic





Easy to implement
Lower overhead of inter communication
among processor unit
Good scalability
Increase bitrate
Slice boundary is defined on the
basis of a fixed number of MBs or
fixed number of bits
Hard to decide a slice boundary prior to
encoding
22
Adaptive Slice-level Parallelism
Fixed MB assignment

The number of consecutive MBs in
each slice
L : The number of processor units on a multi-core system
M : The total number of MBs in a frame
i : Slice index
Example : number of processing unit L = 8, sequence resolution
is CIF (352*288), M = 22*18 = 396
 We can assign about 49 MBs to each slice
23
Adaptive Slice-level Parallelism
Fixed MB assignment

The scheduling of slice-level
parallelism in eight processor units
Ideal case
Practical case
PU0
slice 0
PU0
PU1
slice 1
PU1
PU2
slice 2
PU2
slice 2
PU3
slice 3
PU3
slice 3
PU4
slice 4
PU4
PU5
slice 5
PU5
PU6
slice 6
PU6
PU7
slice 7
PU7
Encoding time
slice 0
slice 1
slice 4
Bottleneck
slice 5
slice 6
slice 7
Encoding time
24
Adaptive Slice-level Parallelism
Fixed MB assignment

The imbalance of computational
load distribution
Exhaustive Search Method
Fast ME / Fast Mode Search
25
Adaptive Slice-level Parallelism
Fixed MB assignment

Computational load for encoding one
frame in slice level parallelism
Ctslice(i) : The computational load of ith slice in tth frame

Computation load of the tth frame by
a single processor system
L : Number of slice in a frame
26
Adaptive Slice-level Parallelism
Fixed MB assignment


The speedup of multiprocessor system
over a single processor system
To achieve the maximum speedup

Computation loads of each slice should be
as similar as possible
 Adaptive slice partition method
27
Adaptive Slice-level Parallelism
Complexity estimation model


A simple estimation method by utilizing
the result of fast MB mode selection
Define the group value g corresponding
to the candidate MB modes
28
Adaptive Slice-level Parallelism
Complexity estimation model

Complexity model
Ck,CHKIntra(g) : Complexity cost of the kth MB
g : Group index
einter : Estimated complexity cost of inter mode in g = 1
eintra : Complexity cost according to the intra mode check
in g = 1
α1, α2, α3, β1 β2 β3 : Weighting values of complexity cost
29
Adaptive Slice-level Parallelism
Complexity estimation model

Relative computational load
CHKintra = 0
Assume einter = 1, eintra = 0
1 , g 1
 eInter  eIntra
  e    e  2.42, g  2

Ck ,CHK Intra 0 ( g )   1 Inter 1 Intra
 2  eInter   2  eIntra  3.12, g  3
 3  eInter   3  eIntra  5.28, g  4
α1=2.42, α2=3.12,α3=5.28
CHKintra = 1
Assume einter = 1, eintra = 3.97
eInter  eIntra
 4.97, g  1

  e    e

1
Intra  6.48, g  2
Ck ,CHK Intra 1 ( g )   1 Inter
 2  eInter   2  eIntra  7.23, g  3
 3  eInter   3  eIntra  9.48, g  4
β1=0.82, β2=0.83, β3=0.84
30
Adaptive Slice-level Parallelism
Adaptive MB assignment

The total computational load at the tth
frame
~ t M 1
C   Ck ,CHK Intra ( g )
k 0

Ideal computational load of each slice for
the uniform workload distribution
~
Ct
~t
C slice 
L
31
Adaptive Slice-level Parallelism
Adaptive MB assignment


MB assignment of slice
Much better than fixed MB assignment
in each slice
32
Adaptive Slice-level Parallelism
Adaptive MB assignment

Entire block diagram
33
Experimental Results
Overview


Performance comparison between
proposed MB mode decision and the
conventional method
Comparing adaptive slice-level
parallelism with fixed slice-level
parallelism
34
Experimental Results
MB mode selection

Average encoding time saving AST[%]
FULL_1Slice : Exhaustive method
FMD_1Slice : Fast MB mode search method

BDPSNR and BDBR are used to measure the
performance against FULL_1Slice
35
Experimental Results
Rate distortion curves
36
Experimental Results

R-D performance compared to one
slice per frame (FMD_1Slice)
37
Experimental Results
Rate distortion curves
38
Experimental Results
Slice-level parallelism


Comparing adaptive and fixed slice level
parallelism
Encoding time of one slice per frame
Speedup
by a single processor system
SpeedupFMD _ Fixed 
EncTim e( FMD _ 1Slice)
MAXi EncTim eslicei FMD _ Fixed OverheadTim e
The longest encoding time of a slice using
fixed mode
SpeedupFMD _ Adaptive 
EncTime( FMD _ 1Slice)
MAXi EncTimeslicei FMD _ Adaptive OverheadTim e
The longest encoding time of a slice using
adaptive mode
39
Experimental Results
Speedup
40
Conclusions



Proposed a fast MB mode selection
using consistency history of block
size and a zero MV
Proposed a intra mode selection by
comparing the correlation
Using these two schemes, they
proposed a new adaptive slice-level
parallelism to speed up H.264/AVC
encoder
41
Reference



Z. Chen, P. Zhou, Y. He, Fast motion estimation for JVT, JVT
Doc.JVT-G016,March 2003.
B. Jeon, J. Lee, Fast mode decision for H.264, JVT-J003,
ISO/IEC MPEG and ITU-T VCEG Joint Video Team,
(Waikoloa, HI), December 2003.
I. Choi, J. Lee, B. Jeon, Fast coding mode selection with
rate-distortion optimization for MPEG-4 Part-10 AVC/H.264,
IEEE Trans. Circuits Syst. VideoTechnol. 16 (12) (2006)
1557–1561.
42