Transcript ppt
How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian Fields Rastislav Bodík University of Wisconsin-Madison Mark D. Hill The Problem: Managing constraints Technological constraints dominate memory design Constraint: Memory latency Design: Cache hierarchy Non-uniformity: Policy: Load latencies What to replace? The Problem: Managing constraints In the future, technological constraints will also dominate microprocessor design Constraint: Wires Power Complexity Design: Clusters Fast/Slow ALUs Grid, ILDP Non-uniformity: Bypasses Exe. Latencies L1 latencies Policy: ? ? ? Policy Goal: Minimize effect of lower-quality resources Key Insight: Control policy crucial With non-uniform machines, the technological constraint problem becomes a control policy problem Key Insight: Control policy crucial The best possible policy: Delays are imposed only on instructions so that execution time is not increased Achieved through slack: The amount an instruction can be delayed without increasing execution time Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup Determining slack: Why hard? Microprocessors are complex: Sometimes slack is determined by resources (e.g. ROB) “Probe the processor” approach: Delay and observe 1. Delay dynamic instruction by n cycles 2. See if execution time increased a) No, increase n; restart; go to step 1 Srinivasan and Lebeck approximation, for loads • (MICRO ’98) heuristics to predict execution time increase Determining slack Alternative approach: Dependence-graph analysis 1. Build resource-sensitive dependence graph 2. Analyze to find slack But, how to build resource-sensitive graph? Casmira and Grunwald’s solution (Kool Chips Workshop ’00) Graphs only with instructions in issue window Data-Dependence Graph 1 1 2 1 1 3 Slack = 0 cycles 1 Our Dependence Graph Model (ISCA ‘01) F F F F F E E E E E C C C C C Slack = 0 cycles Our Dependence Graph Model (ISCA ‘01) F 0 1 E 1 1 1 C 10 F E C 0 1 E 3 1 1 1 1 C 1 F 1 1 1 0 F E 1 2 1 C E 1 1 0 F 1 C Slack = 6 cycles Modeling resources increases observable slack Reporting slack Global slack: # cycles a dynamic operation can be delayed without increasing execution time 0 0 35 3 1 GS = 15 AS = 10 10 10 GS = 15 AS = 5 2 Apportioned slack: Distribute global slack among operations using an apportioning strategy Percent of dynamic instructions Slack measurements (Perl) 100 90 80 6-wide out-of-order superscalar 128-entry issue window 12-stage pipeline 70 60 50 40 30 20 10 0 1 5 10 30 Number of cycles of slack 60 100 Percent of dynamic instructions Slack measurements (Perl) 100 90 80 70 global 60 50 40 30 20 10 0 1 5 10 30 Number of cycles of slack 60 100 Percent of dynamic instructions Slack measurements (Perl) 100 90 80 70 global 60 50 apportioned 40 30 20 10 0 1 5 10 30 Number of cycles of slack 60 100 Analysis via apportioning strategy What non-uniform designs can slack tolerate? Design Fast/slow ALU Non-uniformity Exe. latency App. Strategy Double latency Good news: 80% of dynamic instructions can have latency doubled Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): ISCA ‘01 a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack Two predictor designs Explicit slack predictor 1. • Retry delay and observe with different values of slack Problem: obtaining unperturbed measurements Implicit slack predictor 2. • delay and observe with natural non-uniform delays • “Bin” instructions to match non-uniform hardware Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup Fast/slow pipeline microarchitecture P F2 save ~37% core power Fetch + Rename Steer WIN Reg ALUs Fast, 3-wide pipeline WIN Reg ALUs Data Cache Bypass Bus Slow, 3-wide pipeline Design has three nonuniformities: • Higher execution latencies • Increased (cross-domain) bypass latency • Decreased effective issue bandwidth Selecting bins for implicit slack predictor Two decisions 1. Steer to fast/slow pipeline, then 2. Schedule with high/low priority within a pipeline Use implicit slack predictor with four (22) bins: Fast Slow High 1 3 Low Schedule Steer 2 4 Putting it all together Prediction Path Slack prediction table 4 KB PC Fast/slow pipeline core Slack bin # Training Path 4-bin slack state machine Criticality Analyzer ~1 KB Fast/slow pipeline performance 1.1 1 0.9 Normalized IPC 0.8 2 fast, high-power pipelines 0.7 slack-based policy 0.6 0.5 reg-dep steering 0.4 0.3 0.2 0.1 0 ammp art gcc gzip mesa parser perl vortex average Slack used up Average global slack per dynamic instruction 100 Cycles of global slack 90 80 70 60 50 40 2 fast, high-power pipelines 30 20 slack-based policy 10 0 ammp art gcc gzip mesa parser perl vortex average Slack used up Average global slack per dynamic instruction 100 90 Cycles of global slack 80 70 60 50 40 2 fast, high-power pipelines 30 20 slack-based policy reg-dep steering 10 0 ammp art gcc gzip mesa parser perl vortex average Conclusion: Future processor design flow Future processors will be non-uniform. A slack-based policy can control them. 1. Measure slack in a simulator • decide early on what designs to build 2. Predict slack in hardware • simple implementation 3. Design a control policy • policy decisions slack bins Backup slides Define local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 2 cycles 1 cycle 1 1 1 1 1 3 1 1 cycle In real programs, ~20% insts have local slack of at least 5 cycles Compute local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 2 cycles 1 cycle 1 1 1 1 1 Arrival Time 1 1 2 3 1 4 3 3 5 1 cycle In real programs, ~20% insts have local slack of at least 5 cycles Define global slack Global Slack: # cycles edge latency can be increased without delaying the last instruction in the program 2 cycles 2 cycles 1 1 1 1 1 1 cycle 3 1 1 cycle In real programs, >90% insts have global slack of at least 5 cycles Compute global slack Calculate global slack: backward propagate, accumulating local slacks GS1=MIN(GS3,GS5)+LS1=2 LS1=1 LS2=0 GS5=LS5=2 LS5=2 LS3=1 GS6=LS6=0 GS3=GS6+LS3=1 In real programs, >90% insts have global slack of at least 5 cycles Apportioned slack Goal: Distribute slack to instructions that need it Thus, apportioning strategy depends upon nature of non-uniformities in machine e.g.: non-uniformity: 2 speed bypass busses (1 cycle, 2 cycle) strategy: give 1 cycle slack to as many edges as possible Define apportioned slack Apportioned slack: Distribute global slack among edges For example: GS1=2, AS1=1 GS2=1, AS2=1 GS5=2, AS5=1 GS3=1, AS3=0 In real programs, >75% insts can be apportioned slack of at least 5 cycles Percent of dynamic instructions Slack measurements 100 90 80 70 60 global 50 apportioned local 40 30 20 10 0 1 5 10 30 Number of cycles of slack 60 100 Multi-speed ALUs Can we tolerate ALUs running at half frequency? Yes, but: 1. For all types of operations? (needed for multi-speed clusters) 2. Can we make all integer ops double latency? Load slack Can we tolerate a long-latency L1 hit? design: wire-constrained machine, e.g. Grid non-uniformity: multi-latency L1 apportioning strategy: apportion ALL slack to load instructions Apportion all slack to loads Percent of Dynamic Loads 100 90 80 perl 70 gcc 60 50 40 gzip 30 20 10 0 0 10 20 30 40 50 60 70 80 Number of Cycles of Slack on Load Instructions Most loads can tolerate an L2 cache hit 90 100 Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1 Percent of Dynamic Instructions Latency+1 apportioning 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% ammp art gcc gzip mesa parser perl vortex average Most instructions can tolerate doubling their latency Breakdown by operation (Latency+1 apportioning) Percent of Dynamic Instructions Loads Stores Int Ops Flt Ops 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% -10% -20% -30% -40% Slackfull Nonslackfull ammp art gcc gzip mesa parser perl vortex average Validation Two steps: Increase latencies of insts. by their apportioned slack 1. • for three apportioning strategies: 1) latency+1, 2) 5-cycles to as many instructions as possible, 3) 12-cycles to as many loads as possible 2. Compare to baseline (no delays inserted) Validation 120% Percent of Execution Time 110% 100% 90% 80% 70% baseline 60% latency + 1 50% five cycles 40% 12 cycles to loads 30% 20% 10% 0% ammp art gcc gzip mesa parser perl Worst case: Inaccuracy of 0.6% vortex average Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction Need: Ability to measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack • can capture 80% of potential exploitable slack Locality of slack experiment For each static instruction: 1. Measure % slackful dynamic instances 2. Multiply by # of dynamic instances 3. 4. Sum across all static instructions Compare to total slackful dynamic instructions (ideal case) slackful = has enough apportioned slack to double latency Percent of (weighted) static instructions Locality of slack 100 90 80 70 60 50 40 ideal 30 20 10 0 ammp art gcc gzip mesa parser perl vortex average Percent of (weighted) static instructions Locality of slack 100 90 80 70 60 50 40 ideal 30 20 10 100% 0 ammp art gcc gzip mesa parser perl vortex average Percent of (weighted) static instructions Locality of slack 100 90 80 70 60 50 40 ideal 30 90% 20 95% 10 100% 0 ammp art gcc gzip mesa parser perl vortex PC-indexed, history-based predictor can capture most of the available slack average Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction Need: Ability to measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack • can capture 80% of potential exploitable slack Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack Review: Critical-path analyzer (ISCA ’01) 1 1 1 1 1 4 1 Review: Critical-path analyzer (ISCA ’01) Don’t need to measure latencies Review: Critical-path analyzer (ISCA ’01) Just observe last-arriving edges Review: Critical-path analyzer (ISCA ’01) • Plant token and propagate forward • If token survives, node is critical • If token dies, node is noncritical Baseline policies (existing, not based on slack) 1. Simple reg dep steering (reg dep) Send to fast cluster until: 2. Window half full (fast-first win) 3. Too many ready insts (fast-first rdy) Baseline policies (existing, not based on slack) 1.1 1 Normalized IPC 0.9 0.8 0.7 2 fast clusters 0.6 register dependence 0.5 0.4 fast-first window 0.3 fast-first ready 0.2 0.1 0 ammp art gcc gzip mesa parser perl vortex average Slack-based policies 1.1 1 0.9 Normalized IPC 0.8 0.7 2 fast clusters 0.6 token-passing slack 0.5 0.4 ALOLD slack 0.3 reg-dep steering 0.2 0.1 0 ammp art gcc gzip mesa parser perl vortex average 10% better performance from hiding non-uniformities Extra slow cluster (still save ~25% core power) 1.1 1 0.9 Normalized IPC 0.8 0.7 2 fast clusters 0.6 token-passing slack 0.5 0.4 ALOLD slack 0.3 best-existing policy 0.2 0.1 0 ammp art gcc gzip mesa parser perl vortex average