StageNetSlice: A Reconfigurable Microarchitecture Building Block for Resilient CMP Systems Shantanu Gupta, Shuguang Feng, Amin Ansari, Jason Blome and Scott Mahlke University of Michigan,

Download Report

Transcript StageNetSlice: A Reconfigurable Microarchitecture Building Block for Resilient CMP Systems Shantanu Gupta, Shuguang Feng, Amin Ansari, Jason Blome and Scott Mahlke University of Michigan,

StageNetSlice: A Reconfigurable Microarchitecture
Building Block for Resilient CMP Systems
Shantanu Gupta, Shuguang Feng, Amin Ansari,
Jason Blome and Scott Mahlke
University of Michigan, Ann Arbor
CPU Performance (log scale)
Journey of Silicon Technology
Core 2 Quad
Core Duo
Memory redundancy
Pentium 4
Pentium III
Pentium II
IBM z servers
Pentium
486
Perfect transistors
1985
1990
1995
Rising Variability Unreliable
and Defects
Silicon
2000
2005
2010
Cell
2015
2
Reliability Threats
Transient Faults
Hard Faults
(Manufacturing defects and device wear-out)
Source
N+
Drain
-
N+
- + ++
- + -+
P
Manufacturing Defects
That Escape Testing
Parametric Variability
(Inefficient Burn-in Testing)
(Uncertainty in device and environment)
Intra-die variations in ILD thickness
Gate
Increased Heating
Thermal
Runaway
Higher
Power
Dissipation
Higher
Transistor
Leakage
3
Tolerating Permanent Faults
•
Traditional solutions
–
–
–
•
TMR
Tandem / HP Non-stop
IBM zSeries
Impractical
–
–
–
Cost
Power
Low gain
Teramac (1995)
• Current approaches
1. Detection / Prediction
•
•
•
•
Using sensors
Analytical models
Redundant Computation
BIST
2. Repair
• Replacement
• Reconfiguration
K-pos DP-31/32
4
Reconfiguration Granularity
• Range of choices for the reconfiguration granularity
MODULE level
STAGE level
MTTF increase (%)
CORE level
CORE level
STAGE level
MODULE level
FETCH
DEC
EXEC
MEM
WB
- ElastIC, DT’ 06
- Reunion, MICRO’06
- Configurable Isolation, ISCA’07
- Online Diagnosis of Hard Faults, MICRO’ 05
- Ultra Low-Cost Defect Protection, ASPLOS’ 06
- Detour, DSN’08
Lower design complexity
Lower overheads
Better resource utilization
Area increase (%)
5
Goal of this Research
• Incremental redundancy solutions not
enough
• Design a computing substrate
– Highly reconfigurable
– Provides scalable fault tolerance
– Marginal overheads
Design that can enable CMPs capable of facing
~ 100s of faults while maintaining useful throughput
6
CMP Fabric
Stage1
Stage2
Stage3
Stage2
Latch
Core 0
Stage1
Stage2
Stage3
StageN
Stage1
Latch
Stage1
Stage2
Stage3
StageN
Core 1
Stage1
Stage2
Stage3
Stage3
StageN
StageN
Core 2
StageN
Core 3
7
StageNet (SN) Fabric
Crossbar Switch
Logical pipeline
Stage1
Stage2
Stage3
StageN
Stage1
Stage2
Stage3
StageN
Stage1
Stage2
Stage3
StageN
Wearout Sensors
• Delay
• Temperature
• Current
Stage1
Stage2
Stage3
StageN
Configuration Manager
8
SN – Benefits
Stage1
Stage2
Stage3
StageN
Stage1
Stage2
Stage3
StageN
Stage1
Stage2
Stage3
StageN
Stage1
Stage2
Stage3
StageN
Configuration Manager
9
SN Slice (SNS)
register wb
5 stage pipeline
Fetch
Decode
Issue
LATCH
LATCH
LATCH
Register File
LATCH
Gen Branch
PC Predictor
Ex/Mem
branch resolution
WB
bypass
buffer
buffer
buffer
buffer
buffer
double
double
double
double
double
Ex/Mem
double
buffer
double
Challenge 1: No global stall / flush signals
Register File
Gen Branch
PC Predictor
Challenge
2: No data forwarding
Scoreboard
Challenge
3: Transmission
delay
Fetch
Decode
Issue
buffer
SN Slice
10
SNS Performance Hit
register wb
6
7
8
9
10
bypass
Commit Time
5 stage pipeline
1
2
3
6
Ex/Mem
double
double
Issue
branch resolution
WB
double
double
Ex/Mem
buffer
LATCH
Scoreboard
buffer
Register File
buffer
LATCH
buffer
LATCH
buffer
Decode
Issue
double
LATCH
buffer
Decode
double
Fetch
1
2
3
4
5
Register File
double
Gen Branch
PC Predictor
Fetch
buffer
Gen Branch
PC Predictor
7
8
9
10
BR
SNS pipeline
register
dependency
1
2
3. Transmission delays
3
6
7
1. Branch induced stall
8
9
10
2. Data forwarding
11
buffer
buffer
0
Ex/Mem
double
Issue
SID
double
double
Scoreboard
0
SID
buffer
Register File
double
buffer
buffer
Decode
double
Fetch
0
double
SID
buffer
Branch
Predictor
double
Gen
PC
buffer
1. Stream-ID for Control Handling
• stream-id : 1-bit to represent the execution path
• Toggled upon a branch mis-predicted
• Wrong path instructions are squashed
12
Stream-ID Example
Gen
PC
Register File
Branch
Predictor
Scoreboard
SID 1
0
SID 1
0
Fetch
Decode
Issue
Ex/Mem
Toggle Stream-ID
Toggle Stream-ID
0
BR 0
Squash the
wrong ones
BR 0
0
0
Branch mispredict
1
1
Continue on the
right path
BR 1
1
committed
squashed
13
1
2
3
4
5
6
7
8
9
10
double
double
buffer
SID
Issue
0
Ex/Mem
double
Scoreboard
0
SID
double
buffer
buffer
buffer
buffer
Decode
double
Fetch
Register File
0
double
SID
buffer
Branch
Predictor
double
Gen
PC
buffer
SNS with Stream-ID
How good is the performance?
Commit Time
5 stage pipeline
1
2
3
6
7
8
9
10
BR
SNS pipeline
register
dependency
1
2
3. Transmission delays
3
6
7
1. Branch induced stall
8
9
10
2. Data forwarding
14
Simulation Infrastructure
• Trimaran Compiler
• Liberty Simulation Environment
Branch predictor
Benchmarks
Trimaran
Global, 16-bit, gshare predictor
Rebel
Level 1 I/D cache
4-way, 16KB, 1 cycle latency
Level 2 unified
cache
8-way, 64KB, 5 cycle latency
Assembler
HPL-PD Assembly
HPL-PD Emulator
(FUNCTIONAL)
SN Architecture
(TIMING)
Liberty
Simulation
Framework
15
SNS with StreamID
4X slowdown
16
• Bypass Cache
- Fully associative structure
- FIFO replacement policy
REG ID
buffer
buffer
Bypass $
SID
0
Ex/Mem
double
Issue
0
double
double
Decode
buffer
Scoreboard
double
buffer
buffer
Register File
SID
double
Fetch
0
double
SID
buffer
Branch
Predictor
double
Gen
PC
buffer
2. Bypass$ for data forwarding
VALUE
• Key benefits
- Reduced stalls
- Lower bandwidth consumption
17
1
2
3
4
5
6
7
8
9
10
2
3
6
buffer
buffer
Ex/Mem
Commit Time
5 stage pipeline
1
0
double
Issue
Bypass $
SID
double
double
Scoreboard
0
SID
buffer
Register File
double
buffer
buffer
Decode
double
Fetch
0
double
SID
buffer
Branch
Predictor
double
Gen
PC
buffer
SNS with Stream-ID, Bypass$
7
8
9
10
BR
SNS pipeline
register
dependency
1
2
3. Transmission delays
3
6
7
8
9
10
2. Data forwarding
18
SNS with StreamID, Bypass$
2X slowdown
19
buffer
buffer
Bypass $
SID
0
Ex/Mem
double
Issue
0
double
double
Decode
buffer
Scoreboard
double
buffer
buffer
Register File
SID
double
Fetch
0
double
SID
buffer
Branch
Predictor
double
Gen
PC
buffer
3. Transmission delay
Multiple cycles for instruction transfer  Low utilization
20
Hide delay with Macro-ops
• Need to improve utilization
– Balance transfer and compute time
• Send instruction bundles
– Macro-ops (MOP)
– Greedy selection policy
• Advantages
– Removes temp. intermediates
– Parallelizes transfer and compute
Max length 4
Max live-ins 2
>>
LD
+
LD
+
/
&
ST
>>
<<
ST
21
1
2
3
4
5
6
7
8
9
10
2
3
6
buffer
buffer
Ex/Mem
Commit Time
5 stage pipeline
1
0
double
Issue
Bypass $
SID
double
double
Scoreboard
0
SID
buffer
Register File
double
buffer
buffer
Decode
double
Fetch
Packer
0
double
SID
buffer
Branch
Predictor
double
Gen
PC
buffer
SNS with Stream-ID, Bypass$, MOP
7
8
9
10
BR
SNS pipeline
register
dependency
1 1
22
3
3. Transmission delays
3
6 67
87
9
10
8
9
10
22
SNS: Final Performance Results
1.1X slowdown
23
buffer
buffer
0
Ex/Mem
double
Issue
Bypass $
SID
double
Scoreboard
0
SID
buffer
Register File
double
buffer
buffer
double
Decode
double
Fetch
Packer
0
double
SID
buffer
Branch
Predictor
double
Gen
PC
buffer
SNS: Design Summary
• StreamID – SID reg.
• Bypass$ – Bypass$, Scoreboard
• MOPs
– Packer, Buffer sizes
~12% area overhead, ~10% perf. overhead
24
SN – Defect Tolerance
# Faults
5
03
5
2
Traditional CMP 1
4
StageNet CMP 5
SN delivers 50% more cumulative work
25
Conclusions
• Architectural innovations are crucial for tackling the
high failure rates
• StageNetSlice (SNS) is a potential solution:
– Provides stage level reconfiguration
– Incurs low overheads
Traditional
CMP
• Ongoing work
– SNS design for aggressive cores
– Optimal SN configuration
SNSs
StageNet
CMP
26
Thank You
http://cccp.eecs.umich.edu
27
Back up
28
double
buffer
buffer
Issue
Register File
Ex/Mem
double
buffer
buffer
double
Scoreboard
double
buffer
Decode
double
buffer
Fetch
double
Branch
Predictor
buffer
Gen
PC
double
Scoreboard
REG ID Valid
• Scoreboard to handle RAW dependencies
• Stalls generate backpressure
29
Area overhead breakdown
Router area for 32 and 64 bit configurations
30
Architectural Details
31
Varying Crossbar Width
1.35X slowdown
32
Stage modifications for SNS
33
SN – Throughput
4X
34
SN – Cumulative Work
50%
35