Compiler-directed Synthesis of Multifunction Loop Accelerators

Download Report

Transcript Compiler-directed Synthesis of Multifunction Loop Accelerators

CGRA Express: Accelerating Execution
using Dynamic Operation Fusion
Yongjun Park, Hyunchul Park, Scott Mahlke
CCCP Research Group, University of Michigan
1
Coarse-Grained Reconfigurable
Architecture (CGRA)




Array of PEs connected in a mesh-like interconnect
High throughput with a large number of resources
Distributed hardware offers low cost/power consumption
High flexibility with dynamic reconfiguration
2
University of Michigan
Electrical Engineering and Computer Science
CGRA : Attractive Alternative to ASICs

Suitable for running multimedia applications for future
embedded systems
 High throughput, low power consumption, high flexibility
Morphosys
viterbi at 80Mbps
SiliconHive
h.264 at 30fps

Morphosys : 8x8 array with RISC processor

SiliconHive : hierarchical systolic array

ADRES
ADRES
50-60 MOps /mW
: 4x4 array with tightly coupled VLIW
3
University of Michigan
Electrical Engineering and Computer Science
Performance Bottleneck: Acyclic Code
…
Software
Normal schedule
Pipeline
Block 0
Software
Original
Pipeline
Acyclic
Loop region
regiondominant
dominant
Block 0
250
Block 1
Block 1
loop
Block
Block 2
1
200
Acyclic region is substantial!
It’s time to optimize acyclic code.
Block 4
Block
Block 5
2
…
Block 3
Execution time(M cycles)
Block 3
Block 2
sequential
Block 3
Block 5
Block 5
100
50
…
…
Application
150
Execution Time
0
aac
4
3d
h.264
University of Michigan
Electrical Engineering and Computer Science
Key Idea: Chaining Instructions
1.
Clock period
Critical Path:
Slow!
Non-critical
path
: Fast!
Longest operation with register file access.
2. CGRA is not VLIW.
Register file access is not frequent!
Group
3. Opportunity of instruction chaining.
Multi cycle op
4. Considerable register access time
Arith
≈ Arithmetic operation delay
Shift
(3.5ns clock period @ IBM 90nm)
Delay(ns)
MUL, LD, ST
1.65
ADD, SUB
1.74
LSL, LSR, ASR
1.36
Comp
EQ, NE, LT
0.93
Logic
AND, OR, XOR
0.73
RF Access
5
Opcode
1.61
University of Michigan
Electrical Engineering and Computer Science
Dynamic Operation Fusion


Execute multiple dependent operations in one cycle
Key benefits
1. Minimal hardware overhead
2. Multiple subgraphs can be executed simultaneously.
3. Dynamic merging of FUs
MUL
LD
4x4 CGRA
A B
ADD
Assumption
Instruction time
= RF read time
= RF write time
512
ADD
10
LSR
Add512r10
ADD
ADD LSR
Out
6
Operation
Current
fusion
:
:
13 Cycle
University of Michigan
Electrical Engineering and Computer Science
Hardware Support

Simple bypass network
Small overhead:
3.8%(SRAM),
2.3%(MUX)
baseline
7
modified overhead(%)
control bit
845
877
3.8
area (mm^2)
1.447
1.48
2.3
University of Michigan
Electrical Engineering and Computer Science
Compiler Support

Tick-based scheduling



Tick: small time unit based on hardware delay information
Clock cycle = # of ticks
Clock boundary constraint checking


Resource conflict
Time conflict
8
University of Michigan
Electrical Engineering and Computer Science
Dynamic Operation Fusion Example(1)
1. Conventional Scheduling – 5 cycle
DataFlow Graph
RF[0]
const
SUB(0)
RF[1]
const
Schedule Table
const
Time
FU0
FU1
0
OP 0
OP 1
FU2
1
ADD(1)
FU3
FU4
OP 2
2
OP 3
3
ADD(2)
OP4
4
const
LSR(3)
FU5
OP 5
CGRA Mapping
Register file
LSL(4)
ADD(5)
OP 0
OP 1
OP 5
RF[2]
OP 2
OP 3
OP 4
9
University of Michigan
Electrical Engineering and Computer Science
Dynamic Operation Fusion Example(2)
2. Dynamic Operation Fusion – 3 Cycle.
Schedule Table
DataFlow Graph
RF[0]
const
RF[1]
FU0
FU1
0
RF
RF
OP 0
OP 1
const
FU2
FU3
FU4
FU5
OP 2
const
SUB(0)
Time
ADD(1)
1
OP 3
OP4
ADD(2)
2
OP 5
RF
const
LSR(3)
CGRA Mapping
LSL(4)
Register file
ADD(5)
OP 0
OP 1
OP 5
OP 2
OP 3
OP 4
RF[2]
10
University of Michigan
Electrical Engineering and Computer Science
Experimental Setup

Benchmarks





multimedia applications for embedded systems
Audio decoding (AAC)
Video decoding (H.264)
3D graphics (3D)
Two designs


baseline : 4x4 heterogeneous CGRA
express : 4x4 heterogeneous CGRA with bypass network
11
University of Michigan
Electrical Engineering and Computer Science
the execution time (millions of cycles)
Performance Enhancement
180
160
loop(dependency)
loop(resource)
acyclic
140
120
100
80
60
40
20
0
baseline express
baseline express
aac
h.264
baseline express
3d

Express achieves 7-17% reduction in execution time
 Most of reduction comes from acyclic code region.

Express also improves the performance of resource-constrained loop.
 Bypass network gives more freedom to compiler.
12
University of Michigan
Electrical Engineering and Computer Science
Detailed Result for 3D Graphics


Target application
 3D graphics
Power consumption


Performance enhancement


3% higher than the baseline
17% faster than the baseline
Energy consumption

15% more efficient
baseline
express
ratio
298.26
306.78
102.86%
# of cycles (million) 156.81
130.22
83.04%
199.74
85.42%
power (mW)
energy (mJ)
13
233.85
University of Michigan
Electrical Engineering and Computer Science
Conclusion

Acyclic region becomes the performance bottleneck.


Dynamic operation fusion enables to execute back-toback operations in a cycle



The run-time for loops decreases by large factors.
Bypass network
Tick-based scheduler
Up to17% faster and 15% more energy efficient
with 3% hardware overhead
14
University of Michigan
Electrical Engineering and Computer Science
Questions?
15
University of Michigan
Electrical Engineering and Computer Science