A Highly Parallel Framework for HEVC Coding Unit

Download Report

Transcript A Highly Parallel Framework for HEVC Coding Unit

A Highly Parallel Framework for
HEVC Coding Unit Partitioning
Tree Decision on Many-core
Processors
Chenggang Yan, Yongdong Zhang, Jizheng
Xu, Feng Dai, Liang Li, Qionghai Dai, and
Feng Wu.
IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 5, MAY 2014
Outline





Introduction
Related Work
Proposed Method
Experimental Results
Conclusion
2
Introduction(1/3)



In HEVC, each frame is divided into nonoverlapping CTUs, which can be recursively
split into smaller CUs.
For a CTU, the CU partitioning tree (CUPT)
controls how a CTU is coded with CUs with
variable block sizes and coding modes.
The price to be paid for higher coding
efficiency is higher computational complexity.
3
Introduction(2/3)

To speed up the decision process of CUPT,
many researchers have tried to reduce the
search space by avoiding searching the full
branches of the quad-tree [10].


•
In order to guarantee the coding efficiency, many
branches of the quad-tree can’t be skipped and the
speedup is no more than two times.
Many researchers only consider the RD-based intra
mode selection, while inter mode selection is much
more time-consuming.
[10] L. Shen, Z. Liu, and X. Zhang et al., “An effective CU size decision method for HEVC
encoders,” IEEE Trans. Multimedia, vol. 15, pp. 465–470, Jan. 2013.
4
Introduction(3/3)

Many-core processors are good candidates for
speeding up compression algorithms.


Efficient parallelization of CUPT decision
(CUPTD) on many-core processors is challenging,
because CUPTD has complicated data
dependencies.
If CUPTD isn’t extensively parallelizable, cores
will be left unused and performance might suffer.
5
Related Work(1/3)

HEVC CU Partition Tree Decision(CUPTD)
6
Related Work(2/3)

For RD-based intra prediction:



Instead of applying the intra coding at PU level,
HEVC conducts intra prediction in TU level
sequentially, which always utilize the nearest
neighboring reference samples from the already
reconstructed TUs.
To enhance the coding efficiency of HEVC, HEVC
provides as many as 35 prediction modes.
Just like H.264/AVC, left, above, and above-right
neighboring reconstructed sample will be used for
intra prediction.
7
Related Work(3/3)

For RD-based inter prediction:




The best motion vector predictor is selected from a
given advanced motion vector prediction candidate list.
The AMVPCL is composed of both spatial candidates
and temporal candidates.
Spatial candidates need the motion information of
neighboring left, left-down, upper, upper-left and
upper-right PUs.
According to RD-based intra/inter prediction, the
search of the current CU branch may have data
dependencies on its neighboring left, left-down,
upper, upper-left and upper-right CU branches.
8
Proposed Method A(1/2)

Problem Formulation
9
Proposed Method A(2/2)
•
•
•
•

M : maximum depth of the CTU.
H0 and H1 : overhead of not splitting the CU and splitting the CU.
H(𝑉𝑖−𝑚 ) : the best RD cost computed for the CU, 𝑉𝑖−𝑚 , without any restriction.
G(𝑉𝑖−𝑚 ) : the best RD cost computed for the CU, 𝑉𝑖−𝑚 , that is not split into sub-CUs.
HM-7.0 encoder tries to compute the best RD cost starting
from 𝐻 𝑉𝑖0 .
10
Proposed Method B(1/3)

CTU-Level Parallelism



•
The best RD costs in the current CTU’s
neighboring left, upper, upper-left, and upper-right
CTUs are computed.
The current CTU has data dependencies on its
neighboring left, upper, upper-left, and upper-right
CTUs.
We use the same DAG-based order as described in
our previous work [14] to parallelize CTUs.
[14] C. Yan et al., “Highly parallel framework for HEVC motion estimation on many-core
platform,” in Data Compression Conf., Snowbird, UT, 2013, pp. 63–72.
11
Proposed Method B(2/3)

Generate a DAG to capture the dependency
relationships of CTUs.



Consists of a set of vertices V and edges E.
data dependency <=> an edge.
Processed <=> remove
12
Proposed Method B(3/3)
13
Proposed Method B(1/)

Step1 :


Step2 :


Get coordinates from DQ and process corresponding CTUs in parallel on
many-core platform.
Step4 :


When some values in the CM become zero, get the corresponding
coordinates and push them into DQ.
Step3 :


Initialize DQ and CM. DQ is a waiting queue. CM is designed to record
the number of related CTUs for each CTU.
Update CM. When a CTU with coordinate (i, j) in CM is processed, the
values of coordinates (i+1, j), (i+1, j-1), (i,j+1) and (i+1,j+1) in CM will
minus one operation.
Step5 :

Repeat above steps 2~4 until each
frame is over.
14
Proposed Method C(1/3)

CU-Level Parallelism


When computing the 𝐻 𝑉𝑖0 of the current CTU 𝑉𝑖0 ,
the left, upper, upper-left and upper-right CTUs should
have been completely decided RD-based inter/intra
modes.
We analyze the dependencies in CU-level within the
same frame:


There exist completely independent CUs (CICUs), which
have no data dependencies on other CUs within the same
CTU.
There exist partially independent CUs (PICUs), which have
no data dependencies on other CUs when related CUs have
been processed within the same CTU.
15
Proposed Method C(2/3)

CICUs :


The CICU’s left boundary and CTU’s left boundary
overlap.
The CICU’s upper boundary and CTU’s upper boundary
overlap.
16
Proposed Method C(3/3)

PICUs :



PICUs don’t meet requirements of CICUs.
The PICU’s left boundary and CTU’s left boundary
overlap or neighboring left largest size CU has been
computed.
The PICU’s upper boundary and CTU’s upper boundary
overlap or neighboring upper and upper-right largest
size CUs have been computed.
17
Experimental Results


•
To compare our proposed method with serial
execution, we adopt an encoder migrated from
HEVC reference software HM7.0 without any
optimization.
The experiment platform of this letter is based
on Tile64, which is a member of TILERA
many-core platform and contains 64
processing cores[17].
[17] S. Bell et al., “TILE64-Processor: A 64-core SoC with mesh,” in IEEE Int. Solid-State
Circuits Conf. Dig. Tech. Papers, Feb. 2008, pp. 88–598.
18
Experimental Results
19
Experimental Results
20
Conclusion


We propose an efficient parallel framework for
HEVC CUPTD on many-core processors.
Experiments conducted on Tile64 platform
demonstrate that our method saves more time
than the default encoding scheme in HM 7.0.
21