Transcript [PPT

A General Constraint-centric
Scheduling Framework for
Spatial Architectures
Tony Nowatzki
Michael Sartin-Tarm
Lorenzo De Carli
Karthikeyan Sankaralingam
Cristian Estan
Behnam Robatmili
UW-Madison
UW-Madison
UW-Madison
UW-Madison
Broadcom Corp.
Qualcomm Research
1
A spatial architecture is a type of microprocessor
hardware where the organization of functional
units, interconnection network or storage are
exposed to the software.
• Regions of computation are mapped to hardware units
• Alternative paradigm to instruction-wise pipelined execution
• Pioneering examples: RAW, Wavescalar, TRIPS
• Recent proposals: Tartan, CCA, PLUG, FlexCore, SoftHV,
MESCAL, SPL, C-Cores, DySER, BERET
The fundamental scheduling problem in all spatial
architectures is scheduling of computation to
hardware resources.
Why is this hard?
LD
LD
MUL
ST
R1
R2
R3,R1,R2
R3
Could be basic
block, hyperblock,
PDG, or other
program sub-region
4
Why is this hard?
LD
LD
MUL
ST
R1
R2
R3,R1,R2
R3
LD R2
LD R2
LD R1
LD R1
5
Why is this hard?
LD
LD
MUL
ST
R1
R2
R3,R1,R2
R3
LD R2
MUL R3,R1,R2
LD R2
MUL R3,R1,R2
LD R1
x
x
LD R1
6
Why is this hard?
LD
LD
MUL
ST
R1
R2
R3,R1,R2
R3
LD R2
MUL R3,R1,R2
LD R2
MUL R3,R1,R2
LD R1
x
x
LD R1
ST R3
ST R3
7
Why is this hard?
LD
LD
MUL
ST
R1
R2
R3,R1,R2
R3
LD R2
MUL R3,R1,R2
LD R2
MUL R3,R1,R2
LD R1
x
x
LD R1
ST R3
ST R3
8
We have contributed a general spatial architecture
scheduler that uses a declarative approach.
9
Existing heuristic approaches are popular, effective,
but inflexible.
In general:
• They are architecture-specific.
• They suffer from poor developer / architect productivity.
• They lack insight into optimality.
10
Existing declarative approaches do not cover
enough responsibilities.
• More on this later…
11
Our mathematical scheduling framework resolves
these issues.
• It is comprehensive.
• It is usable for a diverse set of spatial architectures.
• It is not complex to implement.
12
Our model uses an Integer-Linear Programming
(ILP) solver.
• A single objective is specified for minimization.
• The objective is a function of a system of linear equations.
• There are guaranteed bounds upon optimality.
13
Source
Code
Software DAGs
Hardware Graph
Compiler
ILP Scheduler
Assembler
Binary
14
Executive Summary
• We have implemented a declarative general spatial architecture
scheduler in ILP.
• It has been applied to 3 diverse spatial architectures.
• Provides competitive performance to native schedulers.
• Released open-source, and examples are available online.
• http://research.cs.wisc.edu/vertical/ilp-scheduler
• Our tool is the basis of a formulative scheduling guide.
• Constraint Centric Scheduling Guide, ACM SIGARCH Comp. Arch News, May 2013
• An interactive demo will be online as part of a synthesis lecture.
• Optimization and Mathematical Modeling in Computer Architecture
15
TRIPS
DySER
PLUG
IEEE Computer ‘04,
ASPLOS ‘09
HPCA ‘11,
IEEE Micro ‘12
SIGCOMM ‘09,
PACT ‘10
General-purpose
microprocessor
Spatial accelerator
Network-lookup engine
16
L
N
R
Software
E
V
Edges (𝑒 ∈ 𝐸)
connect vertices (v ∈ 𝑉) .
Mappings
MEL(E,L)
MVN(V,N)
x
Links (𝑙 ∈ 𝐿)
connect nodes (𝑛 ∈ 𝑁)
and routers (𝑟 ∈ 𝑅) .
Sets
+
Hardware
17
Our framework consists of 5 main responsibilities.
1.
2.
3.
4.
5.
Placement of Computation
Routing of Data
Latency Management
Resource Management
Optimization Objective
Key Contributions
- Developing scheduling in terms
of these responsibilities
- Encoding responsibilities in
terms of ILP constraints
19
20
Our framework consists of 5 main responsibilities.
1.
2.
3.
4.
5.
Placement of Computation
Routing of Data
Latency Management
Resource Management
Optimization Objective
Key Contributions
- Developing scheduling in terms
of these responsibilities
- Encoding responsibilities in
terms of ILP constraints
21
1. Placement of Computation (hardware mapping)
Each vertex v must be mapped to one compatible node n:
∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1,
∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 ,
+
𝑀𝑣𝑛 𝑣, 𝑛 = 1
𝑀𝑣𝑛 𝑣, 𝑛 = 0
x
22
1. Placement of Computation (hardware mapping)
Each vertex v must be mapped to one compatible node n:
∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1,
∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 ,
+
𝑀𝑣𝑛 𝑣, 𝑛 = 1
𝑀𝑣𝑛 𝑣, 𝑛 = 0
x
23
1. Placement of Computation (hardware mapping)
Each vertex v must be mapped to one compatible node n:
∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1,
∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 ,
+
𝑀𝑣𝑛 𝑣, 𝑛 = 1
𝑀𝑣𝑛 𝑣, 𝑛 = 0
x
24
1. Placement of Computation (hardware mapping)
Each vertex v must be mapped to one compatible node n:
∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1,
∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 ,
ADD
MUL
+
𝑀𝑣𝑛 𝑣, 𝑛 = 1
𝑀𝑣𝑛 𝑣, 𝑛 = 0
x
25
1. Placement of Computation (hardware mapping)
Each vertex v must be mapped to one compatible node n:
∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1,
∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 ,
ADD
MUL
+
𝑀𝑣𝑛 𝑣, 𝑛 = 1
𝑀𝑣𝑛 𝑣, 𝑛 = 0
x
26
2. Routing of Data (network mapping)
ADD
MUL
27
2. Routing of Data (network mapping)
An incoming edge to a vertex must be mapped to an incoming link of its node.
ADD
MUL
28
2. Routing of Data (network mapping)
An incoming edge to a vertex must be mapped to an incoming link of its node.
An outgoing edge to a vertex must be mapped to an outgoing link of its node.
ADD
MUL
29
2. Routing of Data (network mapping)
An incoming edge to a vertex must be mapped to an incoming link of its node.
An outgoing edge to a vertex must be mapped to an outgoing link of its node.
A router must have an edge mapped
to both an incoming and outgoing link.
ADD
MUL
30
3. Latency Management (timing)
ADD
+2
MUL
+4
31
3. Latency Management (timing)
The executional delay of a
vertex is the maximum
delay of its incoming edges.
0
0
ADD
+2
MUL
+4
32
3. Latency Management (timing)
The executional delay of a
vertex is the maximum
delay of its incoming edges.
The routing delay of an
edge is equal to the amount
of links it is mapped to.
0
0
2
ADD
+2 4
3
MUL
6 +4 10
12
33
3. Latency Management (timing)
The executional delay of a
vertex is the maximum
delay of its incoming edges.
The routing delay of an
edge is equal to the amount
of links it is mapped to.
0
0
2
ADD
+2 4
3
MUL
6 +4 10
The overall latency of a
mapping is the maximum
latency of any vertex.
12
34
4. Resource Management (utilization)
ADD
MUL
35
4. Resource Management (utilization)
LD R1
The utilization of a node is the sum of all
vertexes mapped to it.
LD R2
2
ADD
MUL
1
1
ST R3
36
4. Resource Management (utilization)
LD R1
The utilization of a node is the sum of all
vertexes mapped to it.
LD R2
2
The utilization of a link is
the sum of all edges
mapped to it.
2
2
ADD
MUL
1
1
1
1
ST R3
37
4. Resource Management (utilization)
LD R1
The utilization of a node is the sum of all
vertexes mapped to it.
LD R2
2
The utilization of a link is
the sum of all edges
mapped to it.
2
2
ADD
MUL
1
1
1
1
The overall utilization of
a mapping is the
maximum utilization of
any node or link.
38
5. Optimization Objective (final performance target!)
• The overall latency of a mapping is the
maximum latency of any vertex.
• The overall utilization of a mapping is the
maximum utilization of any node or link.
∀𝑣 ∈ 𝑉𝑜𝑢𝑡 ,
∀𝑛 ∈ 𝑁,
∀𝑙 ∈ 𝐿,
𝑇 𝑣 ≤ 𝐿𝐴𝑇
𝑈 𝑛 ≤ 𝑆𝑉𝐶
𝑈 𝑙 ≤ 𝑆𝑉𝐶
• Optimization target is a function balancing
latency and utilization.
39
Results: Feasibility
• Spatial scheduling can be modeled as a system of linear equations.
• The ILP-based approach is concise.
• Our implementation is ~50 lines of GAMS* code for each arch.
*GAMS is an algebraic modeling language for optimization; it can express ILP among other constraint theories.
40
Results: Performance (TRIPS)
ILP scheduler is competitive with SPS
• Better on 22 of 43 benchmarks (geometric mean: +2.9%)
• Worse on 21 of 43 benchmarks (geometric mean: -1.9%)
• Small slowdown / speedups from SPS are primarily due to dynamic events
• 3 worst benchmarks are 5.4%, 6.0%, and 13.2% worse
• This is primarily due to lack of cache-bank knowledge
- using TRIPS performance benchmarks: microbenchmarks, EEMBC
41
Results: Performance (DySER)
Outperforms specialized scheduler on all benchmarks (geometric mean: +9.2%)
• Individual block latencies are reduced by 38% on average
• Long dependency chains lead to significant ILP speedup
- using DySER performance benchmarks: PARBOIL
42
Results: Performance (PLUG)
Matches or outperforms hand-generated schedules on all benchmarks
(geometric mean: 2.7%)
• PLUG DFGs are inherently more complex
- using PLUG network-lookup benchmarks
43
Related Work
Year
Technique
Approach
1950
1992
2002
ILP machine scheduler
ILP for VLIW
ILP for RAW
2008
SMT for PLA
M-Job-DAG to N-resource scheduling
Modulo scheduling
M-Job-DAG to N-resource scheduling,
assuming fixed network delays
Strict communication and
computation requirements
Prior techniques do not model routing and utilization responsibilities.
44
Executive Summary
• We have implemented a declarative general spatial architecture
scheduler in ILP.
• It has been applied to 3 diverse spatial architectures.
• Provides competitive performance to native schedulers.
• Released open-source, and examples are available online.
• http://research.cs.wisc.edu/vertical/ilp-scheduler
• Our tool is the basis of a formulative scheduling guide.
• Constraint Centric Scheduling Guide, ACM SIGARCH Comp. Arch News, May 2013
• An interactive demo will be online as part of a synthesis lecture.
• Optimization and Mathematical Modeling in Computer Architecture
45