Implementation And Improvement Of Wavefront Parallel

Download Report

Transcript Implementation And Improvement Of Wavefront Parallel

Implementation And Improvement
Of Wavefront Parallel Processing For
HEVC Encoding On Many-core
Platform
Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao
2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)
Outline
•
•
•
•
Introduction
Proposed Method
Experimental Results
Conclusion
2
Introduction
• In HEVC, two parallel tools, Tile and WPP, are
presented to facilitate high level parallel
processing.
• Compared with slice and Tile, WPP neither
changes the regular raster scan order nor breaks
coding dependencies at rows boundaries.
• WPP may often provide better compression
performance and avoid some visual artifacts that
may be induced by Tile and slice parallelism.
3
Introduction(Cont.)
• Several related works focus on improving
parallelism of HEVC.
• Chi[4] presents a novel approach called
Overlapped Wavefront (OWF) is provided to
enhance the parallel efficiency of WPP.
• Yan[5] utilizes the data dependencies among
neighboring CTUs and PU regions to exploit the
implicit parallelism.
•
•
[4] C. C. Chi et al., “Parallel scalability and efficiency of HEVC parallelization approaches,” IEEE
Trans. Circuits Syst. Video Technol., vol. 22, pp. 1827–1838, Dec. 2012.
[5] Chenggang Yan et al., “Highly parallel framework for HEVC motion estimation on manycore platform,” Proc. DCC, pp. 63-72, Mar. 2013.
4
Introduction(Cont.)
• WPP and its applications still have some
shortages.
– HEVC test model(HM) is a single-core codec, thus
the serial realization of WPP in HM is not suitable
for HEVC encoding on many-core platform.
– Due to the wavefront dependencies, it will
introduce parallelization inefficiencies and
becomes worse when a high number of
processors is utilized.
5
Proposed Method
• Besides the first row of a slice, WPP requires
control signaling to inform whether the topright CTU in previous row has been encoded
when processing a CTU.
• Additional memory to store side information
and probabilities of CABAC are required by the
next rows.
6
Proposed Method(Cont.)
• Try-and-wait mechanism is presented to apply
WPP for HEVC encoder on many-core platform.
– The control signaling are stored CTU by CTU, thus
W × H bytes are required.
– Current CTU should check whether the top-right
CTU in previous row has been done before its
processing. If not, the correspond core should
wait and attempt again.
7
• Ping-pang storage is utilized to reduce
memory for side information storage.
8
• Data reuse structure is also utilized for
probabilities storage of CABAC.
– Probabilities of previous row have been utilized
and unnecessary any more, thus they can be write
off by the newest probabilities. Data reuse
structure can reduce 88% for probabilities storage.
• Based on the above methods, WPP is realized
for real-time HEVC encoder efficiently on
many-core platform.
9
Proposed Method(Cont.)
• Parallel scalability model of WPP
– When the encoding speed ceases to increase with
the increase of cores, the encoder gets to its
Maximum Parallel Scalability (MPS)
• k : number of cores.
• n : CTU units (rows, Tile or slice) number in one frame.
10
Proposed Method(Cont.)
• α : remaining rows.
• u = ceil(H/k)
• v = (H−1)mod k
11
Proposed Method(Cont.)
• Improvement of parallel scalability for WPP
– Reduce CTU size
– Combine WPP with slice-level parallelism
– Combine WPP with frame-level parallelism
12
Proposed Method(Cont.)
• Reduce CTU size
– The reduction of CTU size is an efficient way to
increase the height of CTU rows and improve the
parallel scalability accordingly.
13
Proposed Method(Cont.)
– Although the reduction of CTU size can increase
the parallel scalability of WPP effectively, however,
it decreases the coding efficiency.
– Kim[6] proves that BD-rate drops about 3.4% to
14.4% performance loss when CTU size decreases
from 32 × 32 to 16 × 16.
– CTU size of 32×32 would be preferable to balance
the parallelism and performance loss.
•
[6] Kim et al., “Block partitioning structure in the HEVC standard,” IEEE Trans. Circuits
Syst. Video Technol., vol. 22, pp. 1649–1668, Dec. 2012.
14
Proposed Method(Cont.)
• Combine WPP with slice-level parallelism
– Slice-level parallelism, such as slice and Tile, can
break some dependencies among rows, thus the
parallel scalability can be enhanced when they
combined with WPP.
– Clare[7] implements two type of combinations of
Tile and WPP, which divide frame into two
independent or dependent Tiles side-by-side and
each Tile is wavefront processed.
•
[7] G. Clare et al., “Wavefront parallel processing for HEVC encoding and decoding,”
JCTVCF0274, July. 2011.
15
Proposed Method(Cont.)
– Combination of 2-4 slices and WPP under 32 × 32
CTU size will bring promising parallel scalability
while keep minor performance loss.
• m : number of slices or tiles.
• Hm = H/m.
• v' = (Hm−1) mod [floor(k/m)]
16
Proposed Method(Cont.)
17
Proposed Method(Cont.)
• Combine WPP with frame-level parallelism
– Two GOP structures, IPpP and IPpp, are
introduced to improve parallelism, where I and P
can be used as reference frame while p(denotes as
disposable frame) can not be used as reference.
– When a row has been encoded and no more tasks
are available in current picture, WPP combined
with frame-level parallelism will start next 1−3
frames simultaneously.
18
Proposed Method(Cont.)
– It can be inferred that H −2 cores are enough for
the encoding in parallel.
– Start time can be deduced as NW + 2Nr + 1.
– Finish moment of the Nth picture can be deduced
as (N + 2)W + 2Nr + 2
• r : maximum vertical search range.
• N : Nth picture.
19
Proposed Method(Cont.)
– Finishing moment of the N frame is (α + 2)W + 2αr + 2
– (p+1)(H −r) cores are enough to attain its MPS
• r : maximum vertical search range.
• p : number of disposable frame.
• α = ceil[ N/(p+1) ].
20
Experimental Results
• Test sequences and encode environments
– Adopt an encoder named FHM10.0 migrated from
HEVC reference software HM10.0.
– The input videos in our experiments contain a list of
standard test sequences with 100 frames, and motion
search range is set to 64.
– Select the Main profile and the default encoding test
conditions are specified in [8].
– The experiment platform of this paper is based on
GX36, which is a member of TILERA many-core
processor family and contains 36 processing cores.
•
[8] F. Bossen, “Common test conditions and software reference configurations,”
JCTVCI1100, Apr. 2012.
21
Experimental Results
• Parallel scalability analysis
22
23
24
Conclusion
• Several effective methods, such as try-andwait data interface, ping-pang storage and
data reuse structure, are presented to realize
WPP on HEVC encoder in parallel.
• Three effective methods are presented to
improve parallel scalability of WPP.
• Experimental results show that our proposed
methods improve more than 40% maximum
parallel scalability when compared with WPP.
25