Compiler-directed Synthesis of Multifunction Loop Accelerators

Download Report

Transcript Compiler-directed Synthesis of Multifunction Loop Accelerators

Modulo Graph Embedding :
Mapping Applications onto
Coarse-Grained Reconfigurable Architectures
Hyunchul Park, Kevin Fan,
Manjunath Kudlur,Scott Mahlke
Advanced Computer Architecture Lab
University of Michigan
1
University of Michigan
Electrical Engineering and Computer Science
Coarse-Grained Reconfigurable
Architecture (CGRA)
Config
FU
LRF
• Array of PEs connected in a mesh-like interconnect
• Characterized by array size, node functionalities,
interconnect, register file configurations
• Execute compute intensive kernels in multimedia applications
2
University of Michigan
Electrical Engineering and Computer Science
CGRA : Attractive Alternative to ASICs
• Suitable for running multimedia applications on embedded
systems
– High computation throughput
– Low power consumption and scalability
– High flexibility with fast configuration
• Morphosys : 8x8 array with RISC processor
– SIMD style execution of loops
• Piperench : 1-D reconfigurable hardware
– Virtualize hardware pipeline
• ADRES : 8x8 array with tightly coupled VLIW
– Modulo scheduling with simulated annealing
3
University of Michigan
Electrical Engineering and Computer Science
Scheduling in CGRA
• Different from conventional VLIW
– Sparse interconnect and distributed register files
– No dedicated routing resources
• Need a good compiler to exploit the abundance of
computing resources
Central RF
FU0
FU1
FU2
FU3
FU0 LRF
FU1 LRF
FU2 LRF
FU3 LRF
CGRA
Conventional VLIW
4
University of Michigan
Electrical Engineering and Computer Science
Objectives of This Work
• Modulo scheduling technique for CGRAs
– Exploit loop-level parallelism by overlapping execution of
iterations
• Targeting low-cost CGRAs
– Achieve quality schedule under restriction of hardware
• Fast compilation time
5
University of Michigan
Electrical Engineering and Computer Science
Modulo Scheduling Basics
• Expose loop-level
parallelism by overlapping
execution of iterations
• Initiation interval (II)
– Each iteration is executed
every II cycles
A
B
C
II
A
A
B A
B
C B A
C
C B
A
C
B
C
Overlapped Execution
6
University of Michigan
Electrical Engineering and Computer Science
Modulo Scheduling for CGRA
• Mapping DFG onto 3-D scheduling space
• Limited number of scheduling slots : (number of PEs) x II
– Minimize routing cost (number of slots used for routing)
• Sparse interconnect and distributed register files
– Ensure routability of operands
time
II
4x4 CGRA
Scheduling Space
7
DFG
University of Michigan
Electrical Engineering and Computer Science
Our Approach
• Systematic approach to generate good schedule in
reasonable time
• Minimize routing cost
– Convert scheduling problem into graph embedding
– Leverage graph embedding algorithm
• Ensure routability of operands
– Skewed scheduling space
– Create a narrow, but tall scheduling space
8
University of Michigan
Electrical Engineering and Computer Science
1 : Minimize Routing Cost
• Routing cost : number of PEs used for routing
• Determined by positions of producer and consumer
– Minimize distance between producers and consumers
• Height-based list scheduling
– Schedule operations in the order of dependence height
– Place consumers close to producers
• Need to carefully place operations in the same
height
9
University of Michigan
Electrical Engineering and Computer Science
Scheduling Example – Routing Cost
0
1
2
4
5
3
time
PE 0
PE 1
PE 2
PE 3
0
0
1
2
3
1
4
5
4’
2
3
6
PE 1
6
Routing Cost = 2
DFG
PE 0
5’
PE 2
PE 3
time
PE 0
PE 1
PE 2
PE 3
0
0
1
2
3
4
5
1
2
1x4 CGRA
6
3
Common consumer information is important !
10
Routing Cost = 0
University of Michigan
Electrical Engineering and Computer Science
Affinity Graph Heuristic
• Consider placement of operations with same height
together
– Use common consumer information
• Affinity value between operations
– Measured by the distance of common consumers in DFG
• Construct affinity graph
– Nodes : operations, edges : affinity values
• Place operations with affinity edges close to each other
11
University of Michigan
Electrical Engineering and Computer Science
Affinity Graph Example
height 3
0
1
2
3
4
5
0
1
2
3
4
5
height 2
height 1
Affinity Graph
DFG
Mapping onto CGRA
PE
PE
PE
PE
0
2
4
1
4
PE
PE
PE
PE
1
2
3
3
5
2x4 CGRA
Drawing affinity graph
Bad
Good
onto
mapping
mapping
scheduling space
12
University of Michigan
Electrical Engineering and Computer Science
Leveraging Graph Embedding
• Graph embedding
– Drawing a graph onto a target space
• Grid layout algorithm by Li &
Kurata
– Embed complicated biochemical
networks onto 2-D grid space
– Simulated annealing
• Our scheduling problem is a
graph embedding problem
– Draw affinity graph onto scheduling
space minimizing edge length
Process Flow of Grid Layout [Li 2005]
13
University of Michigan
Electrical Engineering and Computer Science
2 : Ensure Routability of Operands
• Resources are repeatedly used every II cycles
– Routing can fail due to previously scheduled operations
• Backtracking : hard to make forward progress for CGRA
• Take preventative approach
0
1
3
2
II
4
PE 0
5
PE 1
PE 2
6
7
DFG
time
PE 0
PE 1
PE 2
0
0
1
2
1
3
4
2
5
6
3
1x3 CGRA
Routing failed for Op 7 !
14
0
1 7 2
4
3
4
5
5
6
University of Michigan
Electrical Engineering and Computer Science
Skewed Scheduling Space
time
PE 0
PE 1
PE 2
0
0
1
5
2
6
1
1
2
7
2
3
4
3
0
1
5
2
6
4
1
2
7
5
3
4
• Should prevent routing failures in
advance
• Skew scheduling space
– Staggering down to the right
• Create a narrow, but tall scheduling
space
– Operations can be routed to the right
• Dynamically adjust scheduling space
15
University of Michigan
Electrical Engineering and Computer Science
System Flow
16
University of Michigan
Electrical Engineering and Computer Science
Experimental Setup
• Twelve innermost loop kernels from various domains
• Three designs with different RF configurations
– Evaluate the impact of register file sharing
Dedicated RF
Shared RF
Central RF
17
University of Michigan
Electrical Engineering and Computer Science
Evaluation of Affinity Heuristic
60
affinity
greedy
50
Routing Cost
40
30
20
10
0
blowfish
channel
dct
fft
fir
fsed
sharp
sobel
viterbi
• Results of acyclic scheduling
• Average of 59% reduction in routing cost
18
University of Michigan
Electrical Engineering and Computer Science
Modulo Graph Embedding
vs. Simulated Annealing
1
MGE
0.9
SA
0.8
utilization
0.7
0.6
0.5
0.4
0.3
0.2
0.1
•
•
e
av
er
ag
qu
a
nt
t
de
id
c
rb
i
vi
te
be
l
so
rp
sh
a
iir
fs
ed
fir
fft
t
dc
bl
ow
fis
h
ch
an
ne
l
0
Utilization = (# slots used for computation) / (# total slots)
Time : (~ 5 sec) vs. (5 min ~ 3 hours)
19
University of Michigan
Electrical Engineering and Computer Science
Impact of Register File Configurations
20
University of Michigan
Electrical Engineering and Computer Science
Conclusions
• Modulo scheduler targeting low-cost CGRAs
– Provide high computation throughput, scalability, power
efficiency
• Two heuristics to generate a good schedule
– Affinity graph heuristic
– Skewed scheduling space
– Average utilizations of 56-68% for three designs
• Systematic approach allows fast compilation time
– All benchmarks finished within 5s
21
University of Michigan
Electrical Engineering and Computer Science
Questions ?
22
University of Michigan
Electrical Engineering and Computer Science