StageNetSlice: A Reconfigurable Microarchitecture Building Block for Resilient CMP Systems Shantanu Gupta, Shuguang Feng, Amin Ansari, Jason Blome and Scott Mahlke University of Michigan,
Download ReportTranscript StageNetSlice: A Reconfigurable Microarchitecture Building Block for Resilient CMP Systems Shantanu Gupta, Shuguang Feng, Amin Ansari, Jason Blome and Scott Mahlke University of Michigan,
StageNetSlice: A Reconfigurable Microarchitecture Building Block for Resilient CMP Systems Shantanu Gupta, Shuguang Feng, Amin Ansari, Jason Blome and Scott Mahlke University of Michigan, Ann Arbor CPU Performance (log scale) Journey of Silicon Technology Core 2 Quad Core Duo Memory redundancy Pentium 4 Pentium III Pentium II IBM z servers Pentium 486 Perfect transistors 1985 1990 1995 Rising Variability Unreliable and Defects Silicon 2000 2005 2010 Cell 2015 2 Reliability Threats Transient Faults Hard Faults (Manufacturing defects and device wear-out) Source N+ Drain - N+ - + ++ - + -+ P Manufacturing Defects That Escape Testing Parametric Variability (Inefficient Burn-in Testing) (Uncertainty in device and environment) Intra-die variations in ILD thickness Gate Increased Heating Thermal Runaway Higher Power Dissipation Higher Transistor Leakage 3 Tolerating Permanent Faults • Traditional solutions – – – • TMR Tandem / HP Non-stop IBM zSeries Impractical – – – Cost Power Low gain Teramac (1995) • Current approaches 1. Detection / Prediction • • • • Using sensors Analytical models Redundant Computation BIST 2. Repair • Replacement • Reconfiguration K-pos DP-31/32 4 Reconfiguration Granularity • Range of choices for the reconfiguration granularity MODULE level STAGE level MTTF increase (%) CORE level CORE level STAGE level MODULE level FETCH DEC EXEC MEM WB - ElastIC, DT’ 06 - Reunion, MICRO’06 - Configurable Isolation, ISCA’07 - Online Diagnosis of Hard Faults, MICRO’ 05 - Ultra Low-Cost Defect Protection, ASPLOS’ 06 - Detour, DSN’08 Lower design complexity Lower overheads Better resource utilization Area increase (%) 5 Goal of this Research • Incremental redundancy solutions not enough • Design a computing substrate – Highly reconfigurable – Provides scalable fault tolerance – Marginal overheads Design that can enable CMPs capable of facing ~ 100s of faults while maintaining useful throughput 6 CMP Fabric Stage1 Stage2 Stage3 Stage2 Latch Core 0 Stage1 Stage2 Stage3 StageN Stage1 Latch Stage1 Stage2 Stage3 StageN Core 1 Stage1 Stage2 Stage3 Stage3 StageN StageN Core 2 StageN Core 3 7 StageNet (SN) Fabric Crossbar Switch Logical pipeline Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Wearout Sensors • Delay • Temperature • Current Stage1 Stage2 Stage3 StageN Configuration Manager 8 SN – Benefits Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Configuration Manager 9 SN Slice (SNS) register wb 5 stage pipeline Fetch Decode Issue LATCH LATCH LATCH Register File LATCH Gen Branch PC Predictor Ex/Mem branch resolution WB bypass buffer buffer buffer buffer buffer double double double double double Ex/Mem double buffer double Challenge 1: No global stall / flush signals Register File Gen Branch PC Predictor Challenge 2: No data forwarding Scoreboard Challenge 3: Transmission delay Fetch Decode Issue buffer SN Slice 10 SNS Performance Hit register wb 6 7 8 9 10 bypass Commit Time 5 stage pipeline 1 2 3 6 Ex/Mem double double Issue branch resolution WB double double Ex/Mem buffer LATCH Scoreboard buffer Register File buffer LATCH buffer LATCH buffer Decode Issue double LATCH buffer Decode double Fetch 1 2 3 4 5 Register File double Gen Branch PC Predictor Fetch buffer Gen Branch PC Predictor 7 8 9 10 BR SNS pipeline register dependency 1 2 3. Transmission delays 3 6 7 1. Branch induced stall 8 9 10 2. Data forwarding 11 buffer buffer 0 Ex/Mem double Issue SID double double Scoreboard 0 SID buffer Register File double buffer buffer Decode double Fetch 0 double SID buffer Branch Predictor double Gen PC buffer 1. Stream-ID for Control Handling • stream-id : 1-bit to represent the execution path • Toggled upon a branch mis-predicted • Wrong path instructions are squashed 12 Stream-ID Example Gen PC Register File Branch Predictor Scoreboard SID 1 0 SID 1 0 Fetch Decode Issue Ex/Mem Toggle Stream-ID Toggle Stream-ID 0 BR 0 Squash the wrong ones BR 0 0 0 Branch mispredict 1 1 Continue on the right path BR 1 1 committed squashed 13 1 2 3 4 5 6 7 8 9 10 double double buffer SID Issue 0 Ex/Mem double Scoreboard 0 SID double buffer buffer buffer buffer Decode double Fetch Register File 0 double SID buffer Branch Predictor double Gen PC buffer SNS with Stream-ID How good is the performance? Commit Time 5 stage pipeline 1 2 3 6 7 8 9 10 BR SNS pipeline register dependency 1 2 3. Transmission delays 3 6 7 1. Branch induced stall 8 9 10 2. Data forwarding 14 Simulation Infrastructure • Trimaran Compiler • Liberty Simulation Environment Branch predictor Benchmarks Trimaran Global, 16-bit, gshare predictor Rebel Level 1 I/D cache 4-way, 16KB, 1 cycle latency Level 2 unified cache 8-way, 64KB, 5 cycle latency Assembler HPL-PD Assembly HPL-PD Emulator (FUNCTIONAL) SN Architecture (TIMING) Liberty Simulation Framework 15 SNS with StreamID 4X slowdown 16 • Bypass Cache - Fully associative structure - FIFO replacement policy REG ID buffer buffer Bypass $ SID 0 Ex/Mem double Issue 0 double double Decode buffer Scoreboard double buffer buffer Register File SID double Fetch 0 double SID buffer Branch Predictor double Gen PC buffer 2. Bypass$ for data forwarding VALUE • Key benefits - Reduced stalls - Lower bandwidth consumption 17 1 2 3 4 5 6 7 8 9 10 2 3 6 buffer buffer Ex/Mem Commit Time 5 stage pipeline 1 0 double Issue Bypass $ SID double double Scoreboard 0 SID buffer Register File double buffer buffer Decode double Fetch 0 double SID buffer Branch Predictor double Gen PC buffer SNS with Stream-ID, Bypass$ 7 8 9 10 BR SNS pipeline register dependency 1 2 3. Transmission delays 3 6 7 8 9 10 2. Data forwarding 18 SNS with StreamID, Bypass$ 2X slowdown 19 buffer buffer Bypass $ SID 0 Ex/Mem double Issue 0 double double Decode buffer Scoreboard double buffer buffer Register File SID double Fetch 0 double SID buffer Branch Predictor double Gen PC buffer 3. Transmission delay Multiple cycles for instruction transfer Low utilization 20 Hide delay with Macro-ops • Need to improve utilization – Balance transfer and compute time • Send instruction bundles – Macro-ops (MOP) – Greedy selection policy • Advantages – Removes temp. intermediates – Parallelizes transfer and compute Max length 4 Max live-ins 2 >> LD + LD + / & ST >> << ST 21 1 2 3 4 5 6 7 8 9 10 2 3 6 buffer buffer Ex/Mem Commit Time 5 stage pipeline 1 0 double Issue Bypass $ SID double double Scoreboard 0 SID buffer Register File double buffer buffer Decode double Fetch Packer 0 double SID buffer Branch Predictor double Gen PC buffer SNS with Stream-ID, Bypass$, MOP 7 8 9 10 BR SNS pipeline register dependency 1 1 22 3 3. Transmission delays 3 6 67 87 9 10 8 9 10 22 SNS: Final Performance Results 1.1X slowdown 23 buffer buffer 0 Ex/Mem double Issue Bypass $ SID double Scoreboard 0 SID buffer Register File double buffer buffer double Decode double Fetch Packer 0 double SID buffer Branch Predictor double Gen PC buffer SNS: Design Summary • StreamID – SID reg. • Bypass$ – Bypass$, Scoreboard • MOPs – Packer, Buffer sizes ~12% area overhead, ~10% perf. overhead 24 SN – Defect Tolerance # Faults 5 03 5 2 Traditional CMP 1 4 StageNet CMP 5 SN delivers 50% more cumulative work 25 Conclusions • Architectural innovations are crucial for tackling the high failure rates • StageNetSlice (SNS) is a potential solution: – Provides stage level reconfiguration – Incurs low overheads Traditional CMP • Ongoing work – SNS design for aggressive cores – Optimal SN configuration SNSs StageNet CMP 26 Thank You http://cccp.eecs.umich.edu 27 Back up 28 double buffer buffer Issue Register File Ex/Mem double buffer buffer double Scoreboard double buffer Decode double buffer Fetch double Branch Predictor buffer Gen PC double Scoreboard REG ID Valid • Scoreboard to handle RAW dependencies • Stalls generate backpressure 29 Area overhead breakdown Router area for 32 and 64 bit configurations 30 Architectural Details 31 Varying Crossbar Width 1.35X slowdown 32 Stage modifications for SNS 33 SN – Throughput 4X 34 SN – Cumulative Work 50% 35