Compiler-directed Synthesis of Multifunction Loop Accelerators
Download
Report
Transcript Compiler-directed Synthesis of Multifunction Loop Accelerators
CGRA Express: Accelerating Execution
using Dynamic Operation Fusion
Yongjun Park, Hyunchul Park, Scott Mahlke
CCCP Research Group, University of Michigan
1
Coarse-Grained Reconfigurable
Architecture (CGRA)
Array of PEs connected in a mesh-like interconnect
High throughput with a large number of resources
Distributed hardware offers low cost/power consumption
High flexibility with dynamic reconfiguration
2
University of Michigan
Electrical Engineering and Computer Science
CGRA : Attractive Alternative to ASICs
Suitable for running multimedia applications for future
embedded systems
High throughput, low power consumption, high flexibility
Morphosys
viterbi at 80Mbps
SiliconHive
h.264 at 30fps
Morphosys : 8x8 array with RISC processor
SiliconHive : hierarchical systolic array
ADRES
ADRES
50-60 MOps /mW
: 4x4 array with tightly coupled VLIW
3
University of Michigan
Electrical Engineering and Computer Science
Performance Bottleneck: Acyclic Code
…
Software
Normal schedule
Pipeline
Block 0
Software
Original
Pipeline
Acyclic
Loop region
regiondominant
dominant
Block 0
250
Block 1
Block 1
loop
Block
Block 2
1
200
Acyclic region is substantial!
It’s time to optimize acyclic code.
Block 4
Block
Block 5
2
…
Block 3
Execution time(M cycles)
Block 3
Block 2
sequential
Block 3
Block 5
Block 5
100
50
…
…
Application
150
Execution Time
0
aac
4
3d
h.264
University of Michigan
Electrical Engineering and Computer Science
Key Idea: Chaining Instructions
1.
Clock period
Critical Path:
Slow!
Non-critical
path
: Fast!
Longest operation with register file access.
2. CGRA is not VLIW.
Register file access is not frequent!
Group
3. Opportunity of instruction chaining.
Multi cycle op
4. Considerable register access time
Arith
≈ Arithmetic operation delay
Shift
(3.5ns clock period @ IBM 90nm)
Delay(ns)
MUL, LD, ST
1.65
ADD, SUB
1.74
LSL, LSR, ASR
1.36
Comp
EQ, NE, LT
0.93
Logic
AND, OR, XOR
0.73
RF Access
5
Opcode
1.61
University of Michigan
Electrical Engineering and Computer Science
Dynamic Operation Fusion
Execute multiple dependent operations in one cycle
Key benefits
1. Minimal hardware overhead
2. Multiple subgraphs can be executed simultaneously.
3. Dynamic merging of FUs
MUL
LD
4x4 CGRA
A B
ADD
Assumption
Instruction time
= RF read time
= RF write time
512
ADD
10
LSR
Add512r10
ADD
ADD LSR
Out
6
Operation
Current
fusion
:
:
13 Cycle
University of Michigan
Electrical Engineering and Computer Science
Hardware Support
Simple bypass network
Small overhead:
3.8%(SRAM),
2.3%(MUX)
baseline
7
modified overhead(%)
control bit
845
877
3.8
area (mm^2)
1.447
1.48
2.3
University of Michigan
Electrical Engineering and Computer Science
Compiler Support
Tick-based scheduling
Tick: small time unit based on hardware delay information
Clock cycle = # of ticks
Clock boundary constraint checking
Resource conflict
Time conflict
8
University of Michigan
Electrical Engineering and Computer Science
Dynamic Operation Fusion Example(1)
1. Conventional Scheduling – 5 cycle
DataFlow Graph
RF[0]
const
SUB(0)
RF[1]
const
Schedule Table
const
Time
FU0
FU1
0
OP 0
OP 1
FU2
1
ADD(1)
FU3
FU4
OP 2
2
OP 3
3
ADD(2)
OP4
4
const
LSR(3)
FU5
OP 5
CGRA Mapping
Register file
LSL(4)
ADD(5)
OP 0
OP 1
OP 5
RF[2]
OP 2
OP 3
OP 4
9
University of Michigan
Electrical Engineering and Computer Science
Dynamic Operation Fusion Example(2)
2. Dynamic Operation Fusion – 3 Cycle.
Schedule Table
DataFlow Graph
RF[0]
const
RF[1]
FU0
FU1
0
RF
RF
OP 0
OP 1
const
FU2
FU3
FU4
FU5
OP 2
const
SUB(0)
Time
ADD(1)
1
OP 3
OP4
ADD(2)
2
OP 5
RF
const
LSR(3)
CGRA Mapping
LSL(4)
Register file
ADD(5)
OP 0
OP 1
OP 5
OP 2
OP 3
OP 4
RF[2]
10
University of Michigan
Electrical Engineering and Computer Science
Experimental Setup
Benchmarks
multimedia applications for embedded systems
Audio decoding (AAC)
Video decoding (H.264)
3D graphics (3D)
Two designs
baseline : 4x4 heterogeneous CGRA
express : 4x4 heterogeneous CGRA with bypass network
11
University of Michigan
Electrical Engineering and Computer Science
the execution time (millions of cycles)
Performance Enhancement
180
160
loop(dependency)
loop(resource)
acyclic
140
120
100
80
60
40
20
0
baseline express
baseline express
aac
h.264
baseline express
3d
Express achieves 7-17% reduction in execution time
Most of reduction comes from acyclic code region.
Express also improves the performance of resource-constrained loop.
Bypass network gives more freedom to compiler.
12
University of Michigan
Electrical Engineering and Computer Science
Detailed Result for 3D Graphics
Target application
3D graphics
Power consumption
Performance enhancement
3% higher than the baseline
17% faster than the baseline
Energy consumption
15% more efficient
baseline
express
ratio
298.26
306.78
102.86%
# of cycles (million) 156.81
130.22
83.04%
199.74
85.42%
power (mW)
energy (mJ)
13
233.85
University of Michigan
Electrical Engineering and Computer Science
Conclusion
Acyclic region becomes the performance bottleneck.
Dynamic operation fusion enables to execute back-toback operations in a cycle
The run-time for loops decreases by large factors.
Bypass network
Tick-based scheduler
Up to17% faster and 15% more energy efficient
with 3% hardware overhead
14
University of Michigan
Electrical Engineering and Computer Science
Questions?
15
University of Michigan
Electrical Engineering and Computer Science