Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University.
Download ReportTranscript Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University Why Thread Criticality Prediction? Insts Exec T0 T1 T2 T3 • Sources of variability: algorithm, process variations, thermal emergencies etc. D-Cache Miss I-Cache Miss Stall • Threads 1 & 3 are critical Performance degradation, energy inefficiency Stall • With thread criticality prediction: 1. Task stealing for performance 2. DVFS for energy efficiency 3. Many others … Related Work Instruction criticality [Fields et al., Tune et al. 2001 etc.] Thrifty barrier [Li et al. 2005] Faster cores transitioned into low-power mode based on prediction of Our Approach: barrier stall time 1. Also handles non-barrier code 2. Works on constant or variable loop[Liu iteration DVFS for energy-efficiency at barriers et al.size 2005] 3. Predicts criticality at any point in time, not just barriers Meeting points [Cai et al. 2008] DVFS non-critical threads by tracking loop iterations completion rate across cores (parallel loops) Thread Criticality Prediction Goals Design Goals Design Decisions 1. Accuracy • Absolute TCP accuracy • Relative TCP accuracy 1. Find suitable arch. metric 2. Low-overhead implementation • Simple HW (allow SW policies to be built on top) 2. History-based local approach versus thread-comparative approach 3. One predictor, many uses 3. This paper: TBB, DVFS Other uses: Shared LLC management, SMT and memory priority, … Outline of this Talk Thread Criticality Predictor Design Methodology Identify µarchitectural events impacting thread criticality Introduce basic TCP hardware Thread Criticality Predictor Uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Methodology Evaluations on a range of architectures: highperformance and embedded domains Full-system including OS Detailed power/energy studies using FPGA emulator Infrastructure GEMS Simulator ARM Simulator FPGA Emulator Domain High-performance, wide-issue, out-of-order Embedded, in-order Embedded, in-order System 16 core CMP with Solaris 10 4-32 core CMP 4-core CMP with Linux 2.6 Cores 4-issue SPARC 2-issue ARM 1-issue SPARC Caches 32KB L1 , 4MB L2 32KB L1, 4MB L2 4KB I-Cache, 8KB D-Cache Why not History-Based TCPs? Norm. Time (Rel. to Iteration 0) + Info local to core: no communication -- Requires repetitive barrier behavior -- Problem for in-order pipelines: variant IPCs 1.2 Stall Compute 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 Ocean: Iteration Number (Barrier 8) Time 6 7 8 SPLASH-2 PARSEC Streamcluster Blackscholes Fluidanimate Swaptions LU Ocean Barnes Water-Nsq Volrend Water-Sp Radix FFT Cholesky % Error of Metric in Tracking Compute Time Thread-Comparative Metrics for TCP: Instruction Counts 100 In-order Instruction Count 80 60 40 20 0 SPLASH-2 PARSEC Streamcluster Blackscholes Fluidanimate Swaptions LU Ocean In-order Instruction Count Barnes Water-Nsq Volrend Water-Sp Radix FFT Cholesky % Error of Metric in Tracking Compute Time Thread-Comparative Metrics for TCP: L1 D Cache Misses 100 In-order L1 D Cache per Inst 80 60 40 20 0 Thread-Comparative Metrics for TCP: L1 I & D Cache Misses In-order Instruction Count In-order L1 D Cache per Inst In-order L1 I & D Cache per Inst 80 60 40 20 SPLASH-2 PARSEC Streamcluster Blackscholes Fluidanimate Swaptions LU Ocean Barnes Water-Nsq Volrend Water-Sp Radix FFT 0 Cholesky % Error of Metric in Tracking Compute Time 100 Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses 80 In-order Instruction Count In-order L1 D Cache per Inst In-order L1 I & D Cache per Inst In-order L1 & L2 Cache per Inst 60 40 20 SPLASH-2 PARSEC Streamcluster Blackscholes Fluidanimate Swaptions LU Ocean Barnes Water-Nsq Volrend Water-Sp Radix FFT 0 Cholesky % Error of Metric in Tracking Compute Time 100 Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses In-order Instruction Count In-order L1 I & D Cache per Inst Out-of-Order L1 & L2 Cache per Inst 80 In-order L1 D Cache per Inst In-order L1 & L2 Cache per Inst 60 40 20 SPLASH-2 PARSEC Streamcluster Blackscholes Fluidanimate Swaptions LU Ocean Barnes Water-Nsq Volrend Water-Sp Radix FFT 0 Cholesky % Error of Metric in Tracking Compute Time 100 Outline of this Talk Thread Criticality Predictor Design Methodology Identify µarchitectural events impacting thread criticality Introduce basic TCP hardware Thread Criticality Predictor Uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Basic TCP Hardware Inst Inst Inst5:25: 5:L1 Inst Inst20 10 1 2 Miss L2 D$$Miss! Miss Over Inst Inst Inst135 15 20 30 35 1 2 5 Periodically Per-corerefresh Criticality criticality Counters counters track poorly with Interval cached, Bound slowRegister threads Inst Inst20: 20:L1 Inst Inst125 15 25 1 2 5 Miss I$ Miss! Over Inst Inst Inst135 15 20 30 35 1 2 5 Core 0 Core 1 Core 2 Core 3 L1 I $ L1 D $ L1 I $ L1L1D $ L1L1I $ L1 D $ L1 I $ L1 D $ L2 Cache Miss! Cache Miss! L2 Controller Shared L2 Cache Cache Miss! TCP Hardware Criticality Counters 0 0 1 11 0 1 0 Outline of this Talk Thread Criticality Predictor (TCP) Design Methodology Identify µarchitectural events impacting thread criticality Introduce basic TCP hardware Thread Criticality Predictor Uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs TBB Task Stealing & Thread Criticality TBB dynamic scheduler distributes tasks Each thread maintains software queue filled with tasks Empty queue – thread “steals” task from another thread’s queue Approach 1: Default TBB uses random task stealing More failed steals at higher core counts poor performance Approach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008] Steal based on number of items in SW queue Must track and compare max. occupancy counts TCP-Guided TBB Task Stealing Core 0 Core 1 Core 2 Core 3 Task 0 Task 4 Task None7 5 Task 6 SW Q1 Core 2: Steal SW Q2 Req. Core 3: L1 SW Q3 Miss SW Q0 Task 1 Task 2 • TCP initiates steals from critical thread Task 7 • Modest message overhead: L2 access latency counters 114 Criticality Counters Clock: Clock: Clock:100 30 10 0• Scalable: 14-bit criticality Core 3: bytes of storage @ 64 L2 cores 14 0 0 5 0 21 2 11 1 0 Task 3 Miss Shared L2 Cache Interval Bound Register TCP Steal Scan for Control from max val. Logic Core 3 TCP-Guided TBB Performance % Perf. Improvement versus Random Task Stealing 35 Occupancy-based Approach Criticality-based Approach 30 25 20 Avg. Improvement over Random (32 cores) = 21.6 % 15 10 Avg. Improvement over Occupancy (32 cores) = 13.8 % 5 0 4 8 16 32 Blackscholes 4 8 16 32 4 Fluidanimate 8 16 32 Swaptions Core Count • TCP access penalized with L2 latency 4 8 16 32 Stream. Outline of this Talk Thread Criticality Predictor Design Methodology Identify µarchitectural events impacting thread criticality Introduce basic TCP hardware Thread Criticality Predictor Uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Adapting TCP for Energy Efficiency in Barrier-Based Programs Insts Exec T0 T1 L2 D$ Miss Over T2 Approach: DVFS non-critical threads to eliminate barrier stall time T3 Challenges: • Relative criticalities T1 critical, • Misprediction costs • DVFS overheads => DVFS T0, T2, T3 TCP for DVFS: Results 0.3 Normalized Energy Savings (Rel. to Original Benchmark) 0.25 0.2 0.15 0.1 Average 15% energy savings 0.05 0 FPGA platform with 4 cores, 50% fixed leakage cost See paper for details: TCP mispredictions, DVFS overheads etc Conclusions Goal 1: Accuracy Accurate TCPs based on simple cache statistics Goal 2: Low-overhead hardware Scalable per-core criticality counters used TCP in central location where cache info. is already available Goal 3: Versatility TBB improved by 13.8% over best known approach @ 32 cores DVFS used to achieve 15% energy savings Two uses shown, many others possible…