www.cesr.ncsu.edu

Download Report

Transcript www.cesr.ncsu.edu

Procedure Cloning and Integration
for Converting Parallelism
from Coarse to Fine Grain
Won So & Alex Dean
Center for Embedded Systems Research
Department of Electrical and Computer Engineering
NC State University
1
Overview
• I. Introduction
• II. Integration Methods
• III. Overview of the Experiment
• IV. Experimental Results
• V. Conclusions and Future Work
2
I. Introduction
• Motivation
– Multimedia applications are pervasive and require a higher
level of performance than previous workloads.
– Digital signal processors are adopting ILP architectures such
as VLIW/EPIC.
• Philips Trimedia TM100, TI VelociTI architecture, BOPS
ManArray, and StarCore SC120, etc.
– Typical utilization is low from 1/8-1/2.
• Not enough independent instructions within limited
instruction window.
• A single instruction stream has limited ILP.
– Exploit thread level parallelism (TLP) with ILP.
• Find far more distant independent instructions (coarsegrain parallelism): Exists various levels (e.g. loop level,
procedure level)
3
I. Introduction (cont.)
• Software Thread Integration (STI)
– Software technique which interleaves multiple threads at the machine
instruction level
– Previous work focused on Hardware-to-Software Migration (HSM) for
low-end embedded processors.
Run time
– Integration can produce better performance
• Increases the number of independent instructions.
• Compiler generates a more efficient instruction schedule.
– Fusing/jamming multiple procedure calls into one
– Convert procedure level parallelism
to ILP
• Goal: Help programmers make
multithreaded programs run
faster on a uniprocessor
Performance Enhancement
Thread1
Thread2
Idle issue
slots
Original scheduled execution
of thread1 and thread2
Scheduled execution of the
integrated thread 4
Run time
• STI for high-end embedded processors
I. Introduction (cont.)
• Previous work I: Multithreaded architectures
– SMT, Multiscalar
– SM (Speculative Multithreading), DMT, XIMD
– Welding, Superthreading
 STI achieves multithreading on uniprocessors with no
architectural support.
• Previous work II: Software Techniques
– Loop jamming or fusion
 STI fuses entire procedures, removing the loop-boundary
constraints.
– Procedure Cloning
 STI makes procedure clones which do the work of
multiple procedures/calls concurrently.
5
II. Integration Methods
1. Identify the candidate procedures
2. Examine parallelism
3. Perform integration
4. Select an execution model to invoke best clone
6
II-1. Identify the Candidate Procedures
• Profile application
• In multimedia applications,
those procedures would be
DSP-Kernels : filter operations
(FIR/IIR), frequency-time
transformations (FFT, DCT)
• Example: JPEG. (from gprof)
Execution tim e of CJPEG and DJPEG
and function breakdow n
1.4E+08
Etc.
Pre/Post Process
Encode/Decode
FDCT/IDCT
1.2E+08
CPU Cycles
1.0E+08
8.0E+07
6.0E+07
4.0E+07
2.0E+07
0.0E+00
CJPEG
DJPEG
7
II-2. Examine Parallelism
•
•
Integration requires concurrent execution.
If not already identified, find purely independent
procedure-level data parallelism.
1) Each procedure call handles its own data set, input and
output.
2) Those data sets are independent of each other.
– Abundant because multimedia applications typically process
their large data by separating into blocks. (e.g. FDCT/IDCT)
– More details in [So02]
inputs
inputs
i1
i1
Call 1
o1
i2
Call 2
o2
i2
outputs
Integrated
Function
Call 1 +
Call 2
o1
Integrated
Function
Call 3 +
Call 4
o3
o2
Integrating
2 threads
Call 3
o3
i3
i4
Call 4
o4
i4
Original function calls
o4
...
i3
...
Goal: Help programmers
make multithreaded
programs run faster
on a uniprocessor
outputs
Modified function calls
8
II-3. Perform Integration
• Design the control structure of the integrated procedure.
– Use Control Dependence Graph (CDG).
– Care for data-dependent predicates.
– Techniques can be applied repeatedly and hierarchically.
– Case a: Function with a loop (P1 is data-independent)
Key
CFG
1
Thread1
Code
1
2
2
3
L1
Predicate
Integrated
thread
P1
P1
Loop
1+1'
Thread2
3+3'
L1
Thread1
3
1'
3'
L1
2+2'
P1
Thread2
Integrated
Thread
2'
P1
9
II-3. Perform Integration (cont.)
– Case b: Loop with a conditional (P1 is data-dependent and
P2 is not)
Thread1
CFG
1
2
3
Thread2
L1
P1
4
P2
Integrated
thread
L1
F
T
3
4
P1
1
2
L1
1+1'
T
4'
P1'
P2
P2
T
2+2'
P2
F
P1'
P1'
F
T
2+3'
3+2'
T
1'
4+4'
P1
F
3+3'
F
2'
3'
– Case c: Loop with different iterations (P1 is data-dependent)
Integrated
thread
Thread1
L1
CFG
1
2
1
2
Thread2
L1
L1
P1
1+1'
2+2'
L1
1'
P1&P2
P1
2'
P1'
L1
1'
2'
P1'
1
2
P1
10
II-3. Perform Integration (cont.)
• Two levels of integration: assembly and HLL
– Assembly: Better control but requires scheduler.
– HLL: Use compilers to schedule instructions.
– Which is better depends on capabilities of the tools and
compilers
• Code transform: Fusing (or jamming) two blocks
– Duplicate and interleave the code.
– Rename and allocate new local variables and parameters.
• Superset of loop jamming
– Not only jamming the loops but also the rest of the function.
– Allows larger variety of threads to be jammed together.
• Two side effects: Negative impact on the performance
– Code size increase: If it exceeds I-cache size
– Register pressure: If it exceeds the # of physical registers
11
II-4. Select an Execution Model
•
Two approaches: ‘Direct call’ and ‘Smart RTOS’
1) Direct call: Statically bind at compile time
• Modify the caller to invoke a specific version of the procedure every time (e.g. 2threaded clone).
• Simple and appropriate for a simple system.
• Same approach is used in Procedure Cloning.
• If multiple procedures have been cloned, each may have a different optimal # of
threads
2) Call via RTOS: Dynamically bind at run time
• The RTOS selects a version at run time based on expected performance.
• Adaptive and appropriate for a more complex system.
Application
Int. Proc.
Other
Clones
Procedures
Application
RTOS
Other
Procedures
Int. Proc.
Clones
1) Direct Call
2) Call via RTOS
12
II-4. Select an Execution Model (cont.)
• Smart RTOS model: 3 levels of execution
– Applications: Thread-forking requests for kernel procedures
– Thread library: Contains discrete and integrated versions.
– Smart RTOS: Chooses efficient version of the thread.
Applications
TA
TB
TC
fork request
Smart RTOS
Queue for
pending requests
Scheduler
Thread library
T1_2
T1
Integrated Threads
T2_3
T2
T1_3 (High ILP, fast)
T3
Discrete Threads
(Low ILP, slow)
13
III. Overview of the Experiment
• Objective
– Build up the general approach to perform STI.
– Examine performance benefits and bottlenecks of STI.
– Did not focus on a Smart RTOS model.
• Sample application: JPEG
–
–
–
–
Standard image compression algorithm.
Obtained from Mediabench
Input: 512x512x24bit lena.ppm
2 applications: Compress and decompress JPEG
Algorithm: CJPEG
Encoded
image
lena.ppm
[Preoprocess]
Read image
Image preparation
Block preparation
[FDCT]
Forward DCT
Quatize
[Encoding]
Huffman/Differential
Encoding
[Postprocess]
Frame build
Write image
Encoded
image
lena.jpg
[Postprocess]
Image build
Write image
Decoded
image
lena.ppm
Algorithm: DJPEG
Encoded
image
lena.jpg
[Preprocess]
Read image
Frame decode
[Decoding]
Huffman/Differential
Decoding
[IDCT]
Dequatize
Inverse DCT
14
III. Overview of the Experiment (cont.)
• Integration method
– Integrated procedures
• FDCT (Forward DCT) in CJPEG
• Encode (Huffman Encoding) in CJPEG
• IDCT (Inverse DCT) in DJPEG
– Methods
• Manually integrate threads at C source level.
• Build 2 integrated versions: integrating 2 and 3 threads
• Executed them with a ‘direct call’ model.
• Experiment
– Compile with various compilers: GCC, SGI Pro64, ORC (Open
Research Compiler) and Intel C++ Compiler.
– Run on EPIC machine: ItaniumTM running Linux for IA-64.
– Evaluate the performance: With the PMU (Performance
Monitoring Unit) in ItaniumTM using the software tool pfmon.
15
III. Overview of the Experiment (cont.)
Applications
Applications
Threads
DJPEG
CJPEG
IDCT_NOSTI
FDCT_NOSTI
Encode_NOSTI
IDCT_STI2
FDCT_STI2
Encode_STI2
IDCT_STI3
FDCT_STI3
Encode_STI3
GCC: GNU C Compiler
Pro64: SGI Pro64
Compiler
ORCC: Open Research
Compiler
Intel: Intel C++ Compiler
Compile
Compilers
and
Optimizations
GCC
-O2
Pro64
-O2
ORCC
-O2 / -O3
-O2: Level 2 optimization
Intel
-O2 / -O3
/ -O2u0
Run
Platform
Platform
-O3: Level 3 optimization
-O2u0: -O2 without loop
unrolling
Linux for IA-64
ItaniumTM processor
Measure
Results
Performance / IPC
Cycle breakdown
16
IV. Experimental Results
• Measured and plotted data
– CPU Cycles (execution time), speedup by STI, and IPC.
• Normalized performance: compared with NOSTI/GCC-O2.
• Speedup by STI: compared with NOSTI compiled with
each compiler.
• IPC = number of instructions retired / CPU cycles.
– Cycle and speedup breakdown.
• Cycle breakdown
– 2 categories of cycle: Inherent execution and stall
– 7 sources of stall: Instruction access, data access, RSE,
dependencies, issue limit, branch resteer, taken branches
• Speedup breakdown: sources of speedup and slowdown
– Code Size
• Code size of the procedure
17
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
[FDCT/CJPEG]
Performance
NOSTI
STI2
STI3
Pro64- ORCC- ORCCO2
O2
O3
IntelO2
IntelO3
20%
10%
0%
-10%
IntelO2-u0
70000
60000
3
50000
2.5
40000
Bytes
3.5
2
1.5
1
0.5
0
30000
20000
IPC does NOT correlate
well with performance
GCC-O2 Pro64O2
ORCCO2
ORCC- Intel-O2 Intel-O3 IntelO3
O2-u0
GCC- Pro64- ORCC- ORCC- IntelO2
O2
O2
O3
O2
Intel- IntelO3 O2-u0
-20%
[ F D C T / C JPEG]
IPC V ar iat io ns
4
Sweet spot varies
between one, two
and three threads
[ F D C T / C JPEG]
Sp eed up b y ST I
30%
STI speeds up best compiler
(Intel-O2-u0) by 17%
GCCO2
Instructions / Cycle
40%
% Speedup
Normalized Performance
IV. Experimental Results – FDCT in CJPEG
[FDCT/CJPEG]
Code Size
Code expansion
for function is
75% to 255%
I-cache: 16K
10000
0
GCC- Pro64- ORCC- ORCC- IntelO2
O2
O2
O3
O2
IntelO3
18
IntelO2-u0
IV. Cycle Breakdown – FDCT in CJPEG
[FDCT/CJPEG] CPU Cycle Breakdown
3. E +07
GCC-O2
Pro64-O2
ORCC-O2
ORCC-O3
Int el-O2
Int el-O3
Int el-O2-u0
TakenBr
Br.Res.
IssueLim.
3. E +07
Dep.
RSE
DataAcc
2. E +07
Inh.Exe
I-cache misses are reduced significantly after
disabling loop unrolling.
Sources of speedup: Inh.Exe, DataAcc, Dep,
IssueLim
Sources of slowdown: InstAcc, BrRes
2. E +07
[FDCT/CJPEG] Speedup Breakdown
40%
0%
-40%
RSE
Dep.
IssueLim.
Br.Res.
TakenBr
STI3
STI2
STI3
NOSTI
STI3
DataAcc
NOSTI
STI2
STI2
NOSTI
Inst.Acc
STI2
STI3
STI3
NOSTI
STI2
STI2
STI3
STI3
-20%
NOSTI
Inh.Exe
NOSTI
0. E +00
20%
STI2
5. E +06
% Speedup
1. E +07
NOSTI
Cycles
Inst.Acc
I-cache miss is crucial in Intel. (big code size
by loop unrolling)
GCC-O2 STI2
GCC-O2 STI3
ORCC-O2 STI2
ORCC-O2 STI3
Intel-O2 STI2
Intel-O2 STI3
Intel-O2-u0 STI2 CycleIntel-O2-u0
Category STI3
Pro64-O2 STI2
ORCC-O3 STI2
Intel-O3 STI2
Pro64-O2 STI3
ORCC-O3 STI3
Intel-O3 STI3
19
IV. Experimental Results – EOB in CJPEG
NOSTI: best with Intel-O3.
Speedup by STI: 0.69%~17.38%
[EOB/CJPEG]
13.61% speedup over best compiler Speedup
by STI
1.8
20%
1.6
18%
1.4
1.2
NOSTI
STI2
STI3
NOSTI
STI2
STI3
16%
14%
% Speedup
Normalized Performance
[EOB/CJPEG]
Performance
1
0.8
0.6
12%
10%
8%
6%
0.4
4%
0.2
2%
0
0%
GCC- Pro64- ORCC- ORCC- IntelO2
O2
O2
O3
O2
Compilers
IntelO3
GCC- Pro64- ORCC- ORCC- IntelO2
O2
O2
O3
O2
Compilers
IntelO3
20
IV. Cycle Breakdown – EOB in CJPEG
I-cache miss is not crucial though it tends
to increase after integration.
Sources of speedup: Inh.Exe, DataAcc,
IssueLim
Sources of slowdown: InstAcc, Dep
[EOB/CJPEG] Speedup Breakdown
15%
% Speedup
10%
5%
0%
-5%
-10%
-15%
Inh.Exe
Inst.Acc
DataAcc
GCC-O2 STI2
ORCC-O2 STI2
Intel-O2 STI2
RSE
Dep.
GCC-O2 STI3
ORCC-O2 STI3
Intel-O2 STI3
IssueLim.
Pro64-O2 STI2
ORCC-O3 STI2
Intel-O3 STI2
Br.Res.
TakenBr
Pro64-O2 STI3
ORCC-O3 STI3
Intel-O3 STI3
Cycle Category
21
IV. Experimental Results – IDCT in DJPEG
Wide performance variation for code from different compilers.
Wide variation in STI impact too…
[IDCT/DJPEG]
Performance
100%
NOSTI
STI2
STI3
2.5
NOSTI
STI2
STI3
50%
2
0%
1.5
% Speedup
Normalized Performance
3
[IDCT/DJPEG]
Speedup by STI
1
0.5
0
GCC- Pro64- ORCC- ORCC- IntelO2
O2
O2
O3
O2
Compilers
Intel- IntelO3 O2-u0
-50%
GCC- Pro64- ORCC- ORCC- IntelO2
O2
O2
O3
O2
IntelO3
-100%
-150%
-200%
Compilers
22
IntelO2-u0
IV. Cycle Breakdown - IDCT in DJPEG
[IDCT/DJPEG] CPU Cycle Breakdown
3. E +07
GCC-O2
P r o64-O2
ORCC-O2
ORCC-O3
I nt el -O2
I nt el -O3
I nt el -O2-u0
TakenBr
Br.Res.
IssueLim.
%Speedup
0. E +00
10%
0%
-10%
-20%
-30%
-40%
-50%
-60%
-70%
-80%
-90%
Inh.Exe
Inst.Acc
DataAcc
RSE
GCC-O2 STI2
Pro64-O2 STI3
ORCC-O3 STI2
Intel-O3 STI2
NOSTI
STI2
STI3
NOSTI
STI2
STI3
NOSTI
STI2
STI3
NOSTI
STI2
STI3
NOSTI
STI2
STI3
NOSTI
STI2
STI3
NOSTI
STI2
STI3
Cycles
2. E +07
5. E +06
I-cache misses are reduced significantly
after disabling loop unrolling.
Dep.
RSE
DataAcc
Inst.Acc
[IDCT/DJPEG] Speedup Breakdown
Inh.Exe
2. E +07
1. E +07
I-cache miss is crucial in both ORCC and
Intel.
Dep.
IssueLim.
GCC-O2 STI3
ORCC-O2 STI2
ORCC-O3 STI3
Intel-O2-u0 STI2
Cycle Category
Br.Res.
TakenBr
Pro64-O2 STI2
ORCC-O2 STI3
Intel-O2 STI2
Intel-O2-u0 STI3
23
IV. Experimental Results – Overall CJPEG App.
24
IV. Experimental Results - Overall DJPEG App.
25
IV. Experimental Results (cont.)
• Speedup by STI
–
–
–
–
Procedure speedup up to 18%.
Application speedup up to 11%.
STI does not always improve the performance.
Limited Itanium I-cache is a major bottleneck.
• Compiler variations
– ‘Good’ compilers – compilers other than GCC – have many
optimization features. (e.g. speculation, predication)
– Number of instructions are greater than that of GCC.
– Absolute performance and speedup by STI is bigger.
– But more susceptible to code size limitation.
– Carefully apply the optimizations like loop unrolling.
26
V. Conclusions and Future Work
• Summary
– Developed STI technique for converting abundant TLP to ILP
on VLIW/EPIC architectures
– Introduced static and dynamic execution models
– Demonstrated potential for significant amount of
performance improvement by STI.
– Relevant to high-end embedded processors with ILP support
running multimedia applications.
• Future Work
– Extend proposed methodology to various threads
– Examine the performance with other realistic workloads
– Develop a tool to automate integration process at
appropriate level
– Build a detailed model and algorithm for a dynamic approach
[email protected] ------ www.cesr.ncsu.edu/agdean
27