Transcript Handling Branches in TLS Systems with Multipath Execution
Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA
Introduction Power efficiency, complexity and time-to-market reasons lead to CMPs Many simple cores = high TLP but low ILP – Ok for throughput computing and embarrassingly parallel applications Problem: – No benefits for sequential applications – Even for mostly parallel applications Amdahl’s Law limits performance gains with many cores Solution: Speculative Multithreading (SM) HPCA 2010
2
Speculative Multithreading
Basic Idea: Use idle cores/contexts to speculate on future application needs – TLS: speculatively execute parallel threads – HT/RA: speculatively perform future memory operations – MP: speculatively execute along multiple branch targets No SM model works best all times Hardware infrastructure is very similar Our Idea: Combine SM models and seamlessly exploit (speculative) TLP and/or ILP – In this work: TLS + MP – (for TLS +HT/RA see [ICS’09]) ICS 2009
3
Key Contributions Analyze branch prediction for TLS Systems Propose a mixed execution model that combines TLS with MP execution We show that TLS allows MP to be more aggressive Our approach outperforms state-of-the-art SM models: – TLS by 9.2% avg. (up to 23.2%) – MP by 28.2 % avg. (up to 138%) HPCA 2010
4
Outline Introduction
Speculative Multithreaded Models
Analysis of Branch Prediction in TLS Mixed Execution Model Experimental Setup and Results Conclusions HPCA 2010
5
Thread Level Speculation Compiler deals with: – Task selection – Code generation HW deals with: – Different context – Spawn threads – Detecting violations – Replaying – Arbitrate commit Speculative Thread 1 Thread 2 HPCA 2010
6
Thread Level Speculation Benefit: TLP/ILP – TLP (Overlapped Execution) – ILP (Prefetching) Speculative Thread 1 Thread 2 Overlapped Execution Speculative Thread 1 Thread 2 Prefetching HPCA 2010
7
MultiPath Execution Compiler deals with: – Nothing HW deals with: – Different context – When to do MP – Discard wrong path Main Thread Correct Paths MP Mode Wrong Paths HPCA 2010
8
MultiPath Execution
Benefit:
– ILP (Branch Pred.) Main Thread Correct Paths Branch Misp. Cost Wrong Paths HPCA 2010
9
Outline Introduction Speculative Multithreaded Models
Analysis of Branch Prediction in TLS
Mixed Execution Model Experimental Setup and Results Conclusions HPCA 2010
10
Impact of Branch Prediction on TLS
TLS emulates wider processor: – Removing mispredictions important (Amdahl) HPCA 2010
11
Branch Entropy for TLS
Much harder for TLS: – History partitioning – History re-order HPCA 2010
12
Increasing the Size of the Branch Predictor
Aliasing not much of a problem Fundamental limitation is lack of history HPCA 2010
13
Designing a Better Predictor
Predictors that exploit longer histories not necessarily better ..
HPCA 2010
14
Outline Introduction Speculative Multithreaded Models Analysis of Branch Prediction in TLS
Mixed Execution Model
Experimental Setup and Results Conclusions HPCA 2010
15
Mixed Execution Model
When idle resources: – Try MP on top of TLS!!
Map TLS threads on empty cores Map MP threads on empty contexts (same core) Minimal extra HW: – Branch confidence estimator – MP bit – thread on MP mode – PATHS – how many outstanding branches – DIR – which path thread followed HPCA 2010
16
Combined TLS/MP Model
Speculative Thread 1 Thread 2 HPCA 2010
17
Combined TLS/MP Model
Speculative Thread 1 Thread 2 Thread 1 MP: 0 PATHS: 000 DIR: 000 Low Confidence Branch HPCA 2010
18
Combined TLS/MP Model
Speculative Thread 1a Thread 2 Thread 1b Multi-Path Mode Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1b MP: 1 PATHS: 001 DIR: 001 HPCA 2010
19
Combined TLS/MP Model
Speculative Thread 1a Thread 1b Thread 2 Branch Resolved Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1b MP: 0 PATHS: 000 DIR: 000 HPCA 2010
20
Intricacies to be Handled
How do we map TLS/MP threads?
– Different mapping policies for TLS threads Dealing with thread ordering – Correct data forwarding Dealing with violations – While in “MP-Mode” delay restarts/kills/commits – No squashes on the wrong path Thread spawning: – Delayed as well – keep contention low HPCA 2010
21
Outline Introduction Speculative Multithreaded Models Analysis of Branch Prediction in TLS Mixed Execution Model
Experimental Setup and Results
Conclusions HPCA 2010
22
Experimental Setup Simulator, Compiler and Benchmarks: – SESC ( http://sesc.sourceforge.net/ ) – POSH (Liu et al. PPoPP ‘06) – Spec 2000 Int.
Architecture: – Four way CMP, 4-Issue cores, 6 contexts / core – 32K-bit OGEHL, 1KByte BTB, 32-Entry RAS – 8 Kbit enhanced JRS confidence estimator – 32KB L1 Data (multi-versioned) and Instruction Caches – 1MB unified L2 Caches HPCA 2010
23
Comparing TLS, MP and Combined TLS/MP HPCA 2010
24
Comparing TLS, MP and Combined TLS/MP Additive benefits; no point in doubling the predictor HPCA 2010
25
Comparing TLS, MP and Combined TLS/MP Additive benefits; no point in doubling the predictor 9.2% over TLS, 28.2% over MP HPCA 2010
26
Pipeline Flushes
Significant amount of flush reductions More than base MP!
HPCA 2010
27
Outline Introduction Speculative Multithreaded Models Analysis of Branch Prediction in TLS Mixed Execution Model Experimental Setup and Results
Conclusions
HPCA 2010
28
Also in the Paper … Detailed HW description Impact of scheduling Limiting MP to DP Effect of scaling Effect of a better CE HPCA 2010
29
Conclusions CMPs are here to stay: – What about single threaded apps. and apps with significant seq. sections?
– We advocate the use of speculative multithreading Analyzed branch prediction for modern TLS systems Proposed a new mixed execution model – TLS is nicely complemented by MP Unified scheme outperforms existing SM models – TLS by 9.2% avg. (up to 23.2%) – MP by 28.2 % avg. (up to 138%) HPCA 2010
30
Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA
Backup Slides
ICS 2009
32
Prediction Stats
Stat. (%) Misp. PVN PVP SPEC SENS Bzip2 Crafty Gap
5.7
5.2
3.3
Gzip
5.1
Mcf
3.9
Parser Twolf Vortex Vpr
3.4
10 0.3
6.6
Avg.
4.8
22.8
98.2
16.9
97.6
19.5
98.8
24.1
98.6
27.9
99.2
20.8
98.9
23.2
96.4
11.6
99.8
24.4
98 21.3
98.4
90.7
95 89.1
96 89.7
97.5
91.4
95.4
91.8
96.6
90 97.3
91.3
89.5
88.5
99.8
91 93.9
90.4
95.7
ICS 2009
33
Performance Model Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)
Tseq/Tmt 34
Performance Model Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall) 2. Compute sequential TLS speedup (Sseq)
Tseq/T1p 35
Performance Model Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall) 2. Compute sequential TLS speedup (Sseq) 3. Compute speedup due to ILP (Silp)
(T1+T2)/(T1’+T2’) 36
Performance Model Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall) 2. Compute sequential TLS speedup (Sseq) 3. Compute speedup due to ILP (Silp) 4. Use everything to compute TLP (Sovl)
Sall/(Sseq x Silp) 37