Transcript ppt
How to Turn the
Technological Constraint Problem
into a Control Policy Problem
Using Slack
Brian Fields
Rastislav Bodík
University of Wisconsin-Madison
Mark D. Hill
The Problem: Managing constraints
Technological constraints dominate memory design
Constraint:
Memory latency
Design:
Cache hierarchy
Non-uniformity:
Policy:
Load latencies
What to replace?
The Problem: Managing constraints
In the future, technological constraints will also
dominate microprocessor design
Constraint:
Wires
Power
Complexity
Design:
Clusters
Fast/Slow ALUs
Grid, ILDP
Non-uniformity:
Bypasses
Exe. Latencies
L1 latencies
Policy:
?
?
?
Policy Goal: Minimize effect of lower-quality resources
Key Insight: Control policy crucial
With non-uniform machines,
the technological constraint problem becomes
a control policy problem
Key Insight: Control policy crucial
The best possible policy:
Delays are imposed only on instructions
so that execution time is not increased
Achieved through slack:
The amount an instruction can be delayed
without increasing execution time
Contributions/Outline
Understanding (measure slack in a simulator?)
• determining slack: resource constraints important
• reporting slack: apportion to individual instructions
• analysis: suggest nonuniform machines to build
Predicting (how to predict slack in hardware?)
• simple, delay and observe approach works well
Case study (how to design a control policy?)
• on power-efficient machine, up to 20% speedup
Determining slack: Why hard?
Microprocessors are complex:
Sometimes slack is determined by resources (e.g. ROB)
“Probe the processor” approach: Delay and observe
1.
Delay dynamic instruction by n cycles
2.
See if execution time increased
a) No, increase n; restart; go to step 1
Srinivasan and Lebeck approximation, for loads
•
(MICRO ’98)
heuristics to predict execution time increase
Determining slack
Alternative approach: Dependence-graph analysis
1.
Build resource-sensitive dependence graph
2.
Analyze to find slack
But, how to build resource-sensitive graph?
Casmira and Grunwald’s solution
(Kool Chips Workshop ’00)
Graphs only with instructions in issue window
Data-Dependence Graph
1
1
2
1
1
3
Slack = 0
cycles
1
Our Dependence Graph Model (ISCA ‘01)
F
F
F
F
F
E
E
E
E
E
C
C
C
C
C
Slack = 0
cycles
Our Dependence Graph Model (ISCA ‘01)
F
0
1
E
1
1
1
C
10
F
E
C
0
1
E
3
1
1
1
1
C
1
F
1
1
1
0
F
E
1
2
1
C
E
1
1
0
F
1
C
Slack = 6
cycles
Modeling resources increases observable slack
Reporting slack
Global slack:
# cycles a dynamic operation can be delayed
without increasing execution time
0
0
35
3
1
GS = 15
AS = 10
10
10
GS = 15
AS = 5
2
Apportioned slack:
Distribute global slack among operations
using an apportioning strategy
Percent of dynamic instructions
Slack measurements (Perl)
100
90
80
6-wide out-of-order superscalar
128-entry issue window
12-stage pipeline
70
60
50
40
30
20
10
0
1
5
10
30
Number of cycles of slack
60
100
Percent of dynamic instructions
Slack measurements (Perl)
100
90
80
70
global
60
50
40
30
20
10
0
1
5
10
30
Number of cycles of slack
60
100
Percent of dynamic instructions
Slack measurements (Perl)
100
90
80
70
global
60
50
apportioned
40
30
20
10
0
1
5
10
30
Number of cycles of slack
60
100
Analysis via apportioning strategy
What non-uniform designs can slack tolerate?
Design
Fast/slow ALU
Non-uniformity
Exe. latency
App. Strategy
Double latency
Good news: 80% of dynamic instructions can
have latency doubled
Contributions/Outline
Understanding (measure slack in a simulator?)
• determining slack: resource constraints important
• reporting slack: apportion to individual instructions
• analysis: suggest nonuniform machines to build
Predicting (how to predict slack in hardware?)
• simple, delay and observe approach works well
Case study (how to design a control policy?)
• on power-efficient machine, up to 20% speedup
Measuring slack in hardware
delay and observe
Goal:
Determine whether static instruction has n cycles of slack
1. Delay a dynamic instance by n cycles
2. Check if critical (via critical-path analyzer):
ISCA ‘01
a) No, instruction has n cycles of slack
b) Yes, instruction does not have n cycles of slack
Two predictor designs
Explicit slack predictor
1.
•
Retry delay and observe with different values of slack
Problem: obtaining unperturbed measurements
Implicit slack predictor
2.
•
delay and observe with natural non-uniform delays
•
“Bin” instructions to match non-uniform hardware
Contributions/Outline
Understanding (measure slack in a simulator?)
• determining slack: resource constraints important
• reporting slack: apportion to individual instructions
• analysis: suggest nonuniform machines to build
Predicting (how to predict slack in hardware?)
• simple, delay and observe approach works well
Case study (how to design a control policy?)
• on power-efficient machine, up to 20% speedup
Fast/slow pipeline microarchitecture
P F2
save ~37% core power
Fetch +
Rename
Steer
WIN
Reg
ALUs
Fast, 3-wide pipeline
WIN
Reg
ALUs
Data
Cache
Bypass Bus
Slow, 3-wide pipeline
Design has three nonuniformities:
•
Higher execution latencies
•
Increased (cross-domain) bypass latency
•
Decreased effective issue bandwidth
Selecting bins for implicit slack predictor
Two decisions
1. Steer to fast/slow pipeline, then
2. Schedule with high/low priority within a pipeline
Use implicit slack predictor with four (22) bins:
Fast
Slow
High
1
3
Low
Schedule
Steer
2
4
Putting it all together
Prediction Path
Slack
prediction
table
4 KB
PC
Fast/slow
pipeline core
Slack bin #
Training Path
4-bin slack
state machine
Criticality
Analyzer
~1 KB
Fast/slow pipeline performance
1.1
1
0.9
Normalized IPC
0.8
2 fast, high-power pipelines
0.7
slack-based policy
0.6
0.5
reg-dep steering
0.4
0.3
0.2
0.1
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
Slack used up
Average global slack per dynamic instruction
100
Cycles of global slack
90
80
70
60
50
40
2 fast, high-power pipelines
30
20
slack-based policy
10
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
Slack used up
Average global slack per dynamic instruction
100
90
Cycles of global slack
80
70
60
50
40
2 fast, high-power pipelines
30
20
slack-based policy
reg-dep steering
10
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex average
Conclusion: Future processor design flow
Future processors will be non-uniform.
A slack-based policy can control them.
1. Measure slack in a simulator
•
decide early on what designs to build
2. Predict slack in hardware
•
simple implementation
3. Design a control policy
• policy decisions slack bins
Backup slides
Define local slack
Define Local Slack:
# cycles edge latency can be increased
without delaying subsequent instructions
2 cycles
1 cycle
1
1
1
1
1
3
1
1 cycle
In real programs, ~20% insts have local
slack of at least 5 cycles
Compute local slack
Define Local Slack:
# cycles edge latency can be increased
without delaying subsequent instructions
2 cycles
1 cycle
1
1
1
1
1
Arrival Time
1
1
2
3
1
4
3
3
5
1 cycle
In real programs, ~20% insts have local
slack of at least 5 cycles
Define global slack
Global Slack:
# cycles edge latency can be increased
without delaying the last instruction in the program
2 cycles
2 cycles
1
1
1
1
1
1 cycle
3
1
1 cycle
In real programs, >90% insts have global
slack of at least 5 cycles
Compute global slack
Calculate global slack:
backward propagate, accumulating local slacks
GS1=MIN(GS3,GS5)+LS1=2
LS1=1
LS2=0
GS5=LS5=2
LS5=2
LS3=1
GS6=LS6=0
GS3=GS6+LS3=1
In real programs, >90% insts have global
slack of at least 5 cycles
Apportioned slack
Goal:
Distribute slack to instructions that need it
Thus, apportioning strategy depends upon nature of
non-uniformities in machine
e.g.:
non-uniformity: 2 speed bypass busses (1 cycle, 2 cycle)
strategy: give 1 cycle slack to as many edges as possible
Define apportioned slack
Apportioned slack:
Distribute global slack among edges
For example:
GS1=2, AS1=1
GS2=1, AS2=1
GS5=2, AS5=1
GS3=1, AS3=0
In real programs, >75% insts can be
apportioned slack of at least 5 cycles
Percent of dynamic instructions
Slack measurements
100
90
80
70
60
global
50
apportioned
local
40
30
20
10
0
1
5
10
30
Number of cycles of slack
60
100
Multi-speed ALUs
Can we tolerate ALUs running at half frequency?
Yes, but:
1.
For all types of operations?
(needed for multi-speed clusters)
2.
Can we make all integer ops double latency?
Load slack
Can we tolerate a long-latency L1 hit?
design:
wire-constrained machine, e.g. Grid
non-uniformity:
multi-latency L1
apportioning strategy:
apportion ALL slack to load instructions
Apportion all slack to loads
Percent of Dynamic Loads
100
90
80
perl
70
gcc
60
50
40
gzip
30
20
10
0
0
10
20
30
40
50
60
70
80
Number of Cycles of Slack on Load Instructions
Most loads can tolerate an L2 cache hit
90
100
Multi-speed ALUs
Can we tolerate ALUs running at half frequency?
design:
fast/slow ALUs
non-uniformity:
multi-latency execution latency, bypass
apportioning strategy:
give slack equal to original latency + 1
Percent of Dynamic Instructions
Latency+1 apportioning
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
ammp
art
gcc
gzip
mesa
parser
perl
vortex average
Most instructions can tolerate doubling their latency
Breakdown by operation (Latency+1 apportioning)
Percent of Dynamic Instructions
Loads
Stores
Int Ops
Flt Ops
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
-10%
-20%
-30%
-40%
Slackfull
Nonslackfull
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
Validation
Two steps:
Increase latencies of insts. by their apportioned slack
1.
•
for three apportioning strategies:
1) latency+1,
2) 5-cycles to as many instructions as possible,
3) 12-cycles to as many loads as possible
2.
Compare to baseline (no delays inserted)
Validation
120%
Percent of Execution Time
110%
100%
90%
80%
70%
baseline
60%
latency + 1
50%
five cycles
40%
12 cycles to loads
30%
20%
10%
0%
ammp
art
gcc
gzip
mesa
parser
perl
Worst case: Inaccuracy of 0.6%
vortex
average
Predicting slack
Two steps to PC-indexed, history-based prediction:
1.
Measure slack of a dynamic instruction
Need: Ability to measure slack of a dynamic instruction
2.
Store in array indexed by PC of static instruction
Need: Locality of slack
•
can capture 80% of potential exploitable slack
Locality of slack experiment
For each static instruction:
1. Measure % slackful dynamic instances
2. Multiply by # of dynamic instances
3.
4.
Sum across all static instructions
Compare to total slackful dynamic instructions (ideal case)
slackful = has enough apportioned slack to double latency
Percent of (weighted) static instructions
Locality of slack
100
90
80
70
60
50
40
ideal
30
20
10
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
Percent of (weighted) static instructions
Locality of slack
100
90
80
70
60
50
40
ideal
30
20
10
100%
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
Percent of (weighted) static instructions
Locality of slack
100
90
80
70
60
50
40
ideal
30
90%
20
95%
10
100%
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
PC-indexed, history-based predictor
can capture most of the available slack
average
Predicting slack
Two steps to PC-indexed, history-based prediction:
1.
Measure slack of a dynamic instruction
Need: Ability to measure slack of a dynamic instruction
2.
Store in array indexed by PC of static instruction
Need: Locality of slack
•
can capture 80% of potential exploitable slack
Measuring slack in hardware
delay and observe
Goal:
Determine whether static instruction has n cycles of slack
1. Delay a dynamic instance by n cycles
2. Check if critical (via critical-path analyzer):
a) No, instruction has n cycles of slack
b) Yes, instruction does not have n cycles of slack
Review: Critical-path analyzer (ISCA ’01)
1
1
1
1
1
4
1
Review: Critical-path analyzer (ISCA ’01)
Don’t need to measure latencies
Review: Critical-path analyzer (ISCA ’01)
Just observe last-arriving edges
Review: Critical-path analyzer (ISCA ’01)
• Plant token and propagate forward
• If token survives, node is critical
• If token dies, node is noncritical
Baseline policies (existing, not based on slack)
1. Simple reg dep steering (reg dep)
Send to fast cluster until:
2. Window half full (fast-first win)
3. Too many ready insts (fast-first rdy)
Baseline policies (existing, not based on slack)
1.1
1
Normalized IPC
0.9
0.8
0.7
2 fast clusters
0.6
register dependence
0.5
0.4
fast-first window
0.3
fast-first ready
0.2
0.1
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
Slack-based policies
1.1
1
0.9
Normalized IPC
0.8
0.7
2 fast clusters
0.6
token-passing slack
0.5
0.4
ALOLD slack
0.3
reg-dep steering
0.2
0.1
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
10% better performance from hiding non-uniformities
Extra slow cluster (still save ~25% core power)
1.1
1
0.9
Normalized IPC
0.8
0.7
2 fast clusters
0.6
token-passing slack
0.5
0.4
ALOLD slack
0.3
best-existing policy
0.2
0.1
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average