Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N.
Download ReportTranscript Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N.
Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt HPS Research Group University of Texas at Austin *School of Computer Science Georgia Institute of Technology 2015-11-06 **Microsoft Research 1 Outline Motivation Mechanism Experimental Evaluation Conclusion 2015-11-06 2 Fetch Gating (Pipeline Gating) Proposed by Manne et al. [ISCA98] Stops fetching instructions on wrong path to save energy. Assumes wrong-path instructions do not contribute to performance and consume energy. Various fetch gating mechanisms Baniasadi and Moshovos [ISLPED01], Karkhanis et al. [ISLPED02], Aragon et al. [HPCA03], Buyuktosunoglu et al. [GLSVLSI03], Collins et al. [MICRO04] 2015-11-06 3 Limitations of Previous Mechanisms Hardware complexity Branch confidence estimator, changes to critical/power-hungry structures. Additional hardware can offset energy savings due to fetch gating. Assumption Wrong-path execution consumes energy but is useless for performance. 2015-11-06 4 Is Wrong Path Execution Really Useless? Perfect fetch gating Delta (%) IPC Energy AVG twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr gzip 20 15 10 5 0 -5 -10 -15 -20 -25 -30 -35 -40 parser: mcf: Performance Performance Energy consumption degrades of most decreases benchmarks by 30% and byincreases 28% energy butconsumption performance by perfect fetch increases degrades gating.by by 15% 5% 2015-11-06 5 Why Does Performance Degrade with Perfect Fetch Gating? L2 Cache Fills (%) MPKI: 36.6 MPKI: 1.5 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Correct path fills Unused wrong path fills Used wrong path fills twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr gzip mcf: parser: almost37% allWrong ofiswrong-path used wrong L2 path fillsfills, used, 14% memory is unused intensive wrong (MPKI: path fills 36.6) path execution can prefetch useful data 30% 5% performance performance degradation degradation with with perfect perfectfetch fetch gating gating Butler [Thesis93], Pierce and Mudge [IPPS94, MICRO96], Mutlu et al. [IEEE TC05] 2015-11-06 6 Why Can Wrong Path Execution Be Useful? Taken Not-taken BB2 ….. BR BB4 Misprediction Mispredicted recovery BB3 BB4 Load A Load A Cache hit Load B L2 cache miss Load B ….. ….. JMP BB5 BB5 Load C ….. Cache L2 cache hit miss From mcf Hammock structure within a frequently executed loop BR in BB2 is frequently mispredicted Since memory latency is large, wrong path prefetching benefit can be significant 2015-11-06 Taking into account wrongpath usefulness is important 7 Outline Motivation Mechanism Experimental Evaluation Conclusion 2015-11-06 8 Our Solution: Performance-Aware Speculation Control Hardware complexity: Simple low cost fetch gating mechanism Performance-Aware Speculation Control Lookup Fetch Gating WPUP Useful Wrong-path Usefulness: Low cost Wrong Path Usefulness Predictor (WPUP) Gate Enable Branch Count Fetch Engine Fetch gate only when wrong path execution is useless 2015-11-06 9 Our Fetch Gating Mechanism Branch-count based mechanism More Performance-Aware branches Speculation higherControl chance of misprediction. Lookup Fetch gate if (# of Branches) > Threshold Fetch Gating Mispredictions show phase WPUP behavior. Useful Threshold is determined by branch prediction Gate Enable Branch Count accuracy for a certain period. Higher accuracy Higher threshold Fetch Engine No need for complex logic (e.g. confidence estimator) 2015-11-06 10 Two WPUP Mechanisms Performance-Aware Speculation Control Branch PC-based WPUP (Fine grained) Lookup Fetch Gating WPUP Phase-based WPUP (Coarse grained) Useful Gate Enable Branch Count Can be combined with other fetch gating mechanisms. Fetch Engine 2015-11-06 11 Branch PC-based WPUP Basic idea Identifies and records conditional branch PCs that lead to useful wrong-path memory references If the fetched branch is recorded as useful, do not fetch gate 2015-11-06 12 Branch PC-based WPUP Implementation Fetch Engine Latest Branch PC Register (LBPC, 16bits) LBPC value carried through pipeline Miss Status Holding Registers (MSHR) Branch ID field (BID, 10bits) Already used for branch misprediction recovery Branch PC field (BPC, 16bits) Wrong Path field (WP, 1bit) WPUP Cache 2015-11-06 4 way set-associative, No Data Store, LRU 13 Branch PC-Based WPUP (Training) LBPC: Taken Not-taken PC 2 Load C B A BID inLoad BB3 2 BB5 from A with in branch BB4 PC PC22unit and and BID BID 22 BB2 PC2 : ….. BR 2 BID 2 Misprediction Mispredicted recovery BB3 BB4 Load A Load A Load B Load B ….. ….. JMP BB5 Load C ….. L2 cache miss MSHR Addr BID BPC WP A 2 PC2 0 1 B 2 PC2 0 1 C 2 PC2 0 1 MSHR hit; Wrong Path was useful. BPC 2 is stored in WPUP cache. 2015-11-06 14 Branch PC-Based WPUP (Prediction) LBPC: Taken Not-taken PC 2 Fetch Gate? Fetch Gate? BB2 PC2 : ….. BR 2 Mispredicted BB3 Load A Load B ….. JMP Wrong-path Execution BB5 Load C ….. BB4 Load A Load B ….. WPUP Cache Addr LRU PC2 …… …… …… Hit; Do not fetch gate. 2015-11-06 15 Phase-based WPUP Basic idea Predict if the current phase will provide useful wrong-path memory references If so, do not fetch gate Number of useful wrong-path references 800 700 600 500 400 300 200 100 0 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Time (x100K cycles) 2015-11-06 16 Phase-based WPUP Implementation Wrong Path Usefulness Counter (WPUC, 5bits) Incremented for each useful wrong-path memory reference Reset periodically Do not fetch gate if WPUC > threshold BPC fields or WPUP cache not needed 2015-11-06 17 Outline Motivation Mechanism Experimental Evaluation Conclusion 2015-11-06 18 Simulation Methodology Alpha ISA execution driven simulator Baseline processor configuration Wattch power model: 100 nm, 1.2V technology Manne’s fetch gating 2GHz, 8-wide issue, out-of-order, 128-entry ROB Hybrid branch predictor (64K-entry gshare and 64K-entry PAs) 11 stages (minimum branch misprediction penalty) 1MB, 8-way unified L2 cache 32 L2 MSHRs, 300 cycle memory latency Stream prefetcher Gating threshold: 3 low confidence branches JRS confidence estimator (4K-entry, 4bit-MDC) Tuned for the best energy-delay product Branch Count-based fetch gating 2015-11-06 BP Acc(%) 100~99 99~97 97~95 95~93 93~90 90~85 85~0 Threshold 18 16 13 12 11 7 3 19 Branch-Count Based Fetch Gating 15 IPC Delta (%) 10 5 0 -5 perfect manne fg-br -10 -15 -20 gap vortex bzip2 twolf hmean gap vortex bzip2 twolf AVG perlbmk eon parser crafty mcf gcc vpr -30 20 15 10 5 0 -5 -10 -15 -20 -25 -30 -35 -40 gzip Energy Delta (%) -25 perfect manne fg-br perlbmk eon parser crafty mcf gcc vpr gzip Manne’s Performance and our fetch and gating energydegrade savingsperformance are higher than of mcf Manne’s. and parser 2015-11-06 20 IPC Delta (%) WPUP Mechanisms 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 fg-br fg-br/pc-wpup fg-br/phase-wpup twolf AVG bzip2 twolf AVG Energy Delta (%) bzip2 vortex gap perlbmk eon parser crafty mcf gcc fg-br fg-br/pc-wpup fg-br/phase-wpup vortex gap perlbmk eon parser crafty mcf gcc vpr gzip 2015-11-06 vpr gzip Improves performance Improvesand performance energy savings of mcfcompared and parserto Manne’s 21 Hardware Cost Performance-Aware Speculation Control vs. Manne’s Fetch Gating Hardware cost Fetch Gating WPUP Total Manne 2049B - 2049B FG-BR/PC-WPUP 6B 260B 266B FG-BR/PHASE WPUP 6B 45B 51B 2015-11-06 22 manne manne/pc-wpup fg-br/pc-wpup bzip2 twolf AVG bzip2 twolf AVG vortex gap perlbmk eon parser crafty mcf gcc WPUPs improve performance and energy efficiency of Manne’s vpr 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 gzip Energy Delta (%) IPC Delta (%) Comparison with Manne’s Fetch Gating manne manne/pc-wpup fg-br/pc-wpup vortex gap perlbmk eon parser crafty mcf gcc vpr gzip 2.5% less performance degradation, 1.0% more energy savings 2015-11-06 23 EDP Delta (%) Energy-Delay Product 14 12 10 8 6 4 2 0 -2 -4 -6 -8 -10 manne fg-br/pc-wpup fg-br/phase-wpup AVG twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr gzip Improves Energy-Delay Product (2.6% compared to Manne’s) 2015-11-06 24 Conclusion Performance-Aware Speculation Control Branch count-based fetch gating Simple and low cost. Introduced Wrong Path Usefulness Prediction Recovers performance loss due to fetch gating by executing useful wrong-path instructions. Can be combined with other fetch gating mechanisms. Reduces performance loss due to fetch gating and also saves energy. 2015-11-06 25 Questions? 2015-11-06 26