Efficient Execution of Augmented Reality Applications on Mobile Programmable Accelerators Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke University of Michigan December 10,

Download Report

Transcript Efficient Execution of Augmented Reality Applications on Mobile Programmable Accelerators Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke University of Michigan December 10,

Efficient Execution of Augmented Reality
Applications on Mobile Programmable Accelerators
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan
December 10, 2013
1
University of Michigan
Electrical Engineering and Computer Science
Augmented Reality
• Physical world + Computer generated inputs
Commerce
2
University of Michigan
Electrical Engineering and Computer Science
Augmented Reality
• Physical world + Computer generated inputs
Commerce
Information
3
University of Michigan
Electrical Engineering and Computer Science
Augmented Reality
• Physical world + Computer generated inputs
Commerce
Information
Games
Compared to multimedia applications,
1. User interactive
2. Computationally intensive
4
University of Michigan
Electrical Engineering and Computer Science
Application Characteristics
Execution Time (%)
Remaining
Software Pipelinable
Data Parallel
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1.• 69%
in data
parallel loops
(DLP loops)
Feature
Extracting
Kernels
=> SIMD / Coarse-Grained Reconfigurable Architecture (CGRA)
Virtual
Object Rendering
Kernel(SWP loops)
2.• 15%
in software
pipelinable loops
• =>Video
Conferencing with Virtual Object Manipulation
CGRA
5
University of Michigan
Electrical Engineering and Computer Science
SIMD vs. CGRA
• SIMD
– Identical lanes
– Shared instruction fetch (Same schedule across PEs)
– SIMD memory access
PE# : SIMD lane
instruction
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE8
PE7
6
PE9
PE10 PE11 PE12 PE13 PE14 PE15
University of Michigan
Electrical Engineering and Computer Science
SIMD vs. CGRA
• Heterogeneous
Homogeneous CGRA
CGRA
– Identical
More energy
unitsefficient than homogeneous CGRA
– Less
Mesh-like
performance
interconnects
compared to homogeneous CGRA
– Software pipelining
PE0
PE1
PE2
PE3
PE# : Processing Element with Multipliers
PE4
PE5
PE6
PE7
PE# : Processing Element with Memory Units
PE8
PE9
PE10
PE11
PE# : Processing Element without Complex Units
PE12
PE13
PE14
PE15
PE# : Processing Element with All Units
7
University of Michigan
Electrical Engineering and Computer Science
SIMD vs. CGRA
Homo. CGRA
SIMD
Hetero. CGRA
1.8
1.8
1.6
1.6
1.4
1.4
Normalized Energy
Normalized Execution Time
SIMD
1.2
1
0.8
0.6
1
0.8
0.6
0.4
0.2
0.2
0
0
SWP
DLP
total
Hetero. CGRA
1.2
0.4
DLP
Homo. CGRA
SWP
total
1. In DLP loops, SIMD > CGRA
2. In total execution time and energy, CGRA > SIMD (due to SWP loops)
3. In energy consumption, heterogeneous CGRA > homogeneous CGRA
(20% less energy with only 4% performance loss)
8
University of Michigan
Electrical Engineering and Computer Science
Adding SIMD Support for CGRA
• Heterogeneous CGRA
– Grouping multiple PEs to form an identical SIMD core
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
PE8
PE9
PE10
PE11
PE12
PE13
PE14
PE15
SIMD Core
1. How do we obtain the efficiency of single instruction fetch?
2. How do we achieve the efficiency of SIMD memory access?
9
University of Michigan
Electrical Engineering and Computer Science
Efficient Instruction Fetch
• Fetch instruction once from memory
– Pass around the instruction to the next SIMD core
– Last SIMD core stores the instruction in a recycle buffer
SIMD Core 0
SIMD Core 3
0
iteration 4
iteration 3
SIMD Core 1
SIMD Core 2
iteration 1
iteration 2
10
University of Michigan
Electrical Engineering and Computer Science
SIMD Memory Access
• Single memory request, multiple responses
– Split transaction
• Enables forwarding (Request ID)
– SIMD mode flag, stride information
MemUnit 0
MemUnit 1
MemUnit 2
MemUnit 3
Bank 0
Bank 1
Bank 2
Bank 3
11
University of Michigan
Electrical Engineering and Computer Science
Experimental Setup
•
Baseline
– Heterogeneous CGRA with 16 PEs
• 4 PEs with memory units, 4 PEs with multipliers
•
Our solution
– Baseline + SIMD support
• 1 cycle latency ring network, 16-entry recycle buffer
•
Compiler
– IMPACT frontend compiler
– Edge-centric modulo scheduler
– ADRES framework
•
Power
– 65nm technology @ 200MHz/1V
– CACTI
12
University of Michigan
Electrical Engineering and Computer Science
Evaluation for DLP Loops
16 coresILP
(SIMD)
4 loops
SIMD cores (Our solution)
withinvs.
the
Proposed
SIMD
Baseline
1.2
Proposed
SIMD
1.2
Normalized Energy
Normalized Execution Time
Baseline
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
- Our solution is 14.1% slower compared to SIMD.
- Our solution achieves nearly the same energy efficiency as SIMD.
13
University of Michigan
Electrical Engineering and Computer Science
Evaluation for Total Execution
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Proposed
SIMD
Baseline
Normalized Energy
Normalized Execution Time
Baseline
Proposed
SIMD
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Our solution achieves 17.6% speedup with 16.9% less energy
compared to baseline heterogeneous CGRA.
14
University of Michigan
Electrical Engineering and Computer Science
Conclusion
• Best performing / energy-efficient solution
– DLP loops : SIMD
– Whole application : CGRA
• Two techniques to implement SIMD support efficiently for CGRAs.
– Efficient instruction fetch : Ring network + recycle buffer
– SIMD memory access : Split transaction + stride information in header
– Results in 3.4% power saving
• CGRAs with SIMD support improves overall performance by 17.6%
with 16.9% less energy.
15
University of Michigan
Electrical Engineering and Computer Science
Questions?
For more information
http://cccp.eecs.umich.edu
[email protected]
16
University of Michigan
Electrical Engineering and Computer Science
CGRA Memory Access
• Resolve bank conflicts through buffering
– Compiler accounts for additional buffering delay
MemUnit 0
MemUnit 1
MemUnit 2
MemUnit 3
Bank 0
Bank 1
Bank 2
Bank 3
17
University of Michigan
Electrical Engineering and Computer Science
Compilation Flow
Program
Loop Classification
DLP
ILP Matching
SWP
Low ILP
High ILP
Acyclic Scheduling
Modulo Scheduling
Code Generation
Executable
18
University of Michigan
Electrical Engineering and Computer Science
Power Analysis
1.2
1
(-) Savings from memory
(+) Overheads from ring network, recycle buffer,
and SIMD memory access
0.8
Memory
Control
0.6
RF
0.4
FU
0.2
0
CGRA mode
(Baseline)
SIMD mode
- SIMD mode further saves power by 3.4%.
19
University of Michigan
Electrical Engineering and Computer Science
Resource Utilization in DLP Loops
CGRA mode
SIMD mode
0.6
Resource Utilization
0.5
0.4
0.3
0.2
0.1
0
SIMD mode can utilize 13.6% more resources in DLP loops.
- Compiler generates more efficient schedule with fewer resources.
(Less routing, less exploration)
20
University of Michigan
Electrical Engineering and Computer Science
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
PE8
PE9
PE10
PE11
PE12
PE13
PE14
PE15
21
University of Michigan
Electrical Engineering and Computer Science