compas.cs.stonybrook.edu

Download Report

Transcript compas.cs.stonybrook.edu

CSE502: Computer Architecture
CSE 502:
Computer Architecture
Core Pipelining
CSE502: Computer Architecture
Before there was pipelining…
Single-cycle
Multi-cycle
insn0.(fetch,decode,exec)
insn0.fetch
insn0.dec
insn0.exec
insn1.(fetch,decode,exec)
insn1.fetch
insn1.dec
insn1.exec
time
• Single-cycle control: hardwired
– Low CPI (1)
– Long clock period (to accommodate slowest instruction)
• Multi-cycle control: micro-programmed
– Short clock period
– High CPI
• Can we have both low CPI and short clock period?
CSE502: Computer Architecture
Pipelining
Multi-cycle
Pipelined
insn0.fetch
insn0.dec
insn0.exec
insn0.fetch
insn0.dec
insn0.exec
insn1.fetch
insn1.dec
insn1.exec
insn2.fetch
insn2.dec
time
insn1.fetch
insn1.dec
insn1.exec
insn2.exec
• Start with multi-cycle design
• When insn0 goes from stage 1 to stage 2
… insn1 starts stage 1
• Each instruction passes through all stages
… but instructions enter and leave at faster rate
Can have as many insns in flight as there are stages
CSE502: Computer Architecture
address
= = = =
address
Stage delay = 𝑛
Bandwidth = ~(1 𝑛)
hit?
= = = =
address
= = = =
Pipeline Examples
Stage delay =𝑛 2
Bandwidth = ~(2 𝑛)
hit?
hit?
Stage delay = 𝑛 3
Bandwidth = ~(3 𝑛)
Increases throughput at the expense of latency
CSE502: Computer Architecture
Processor Pipeline Review
Fetch
Decode
Execute
Memory
(Write-back)
+4
PC
I-cache
Reg
File
ALU
D-cache
CSE502: Computer Architecture
Stage 1: Fetch
• Fetch an instruction from memory every cycle
– Use PC to index memory
– Increment PC (assume no branches for now)
• Write state to the pipeline register (IF/ID)
– The next stage will read this pipeline register
CSE502: Computer Architecture
Stage 1: Fetch Diagram
target
1
PC
en
Instruction
Cache
Instruction
bits
Decode
+
PC + 1
M
U
X
en
IF / ID
Pipeline register
CSE502: Computer Architecture
Stage 2: Decode
• Decodes opcode bits
– Set up Control signals for later stages
• Read input operands from register file
– Specified by decoded instruction bits
• Write state to the pipeline register (ID/EX)
–
–
–
–
Opcode
Register contents
PC+1 (even though decode didn’t use it)
Control signals (from insn) for opcode and destReg
CSE502: Computer Architecture
Stage 2: Decode Diagram
Instruction
bits
destReg
IF / ID
Pipeline register
Register File
data
en
ID / EX
Pipeline register
Execute
regA
regB
regB
Control
regA
PC + 1
signals contents contents
Fetch
PC + 1
target
CSE502: Computer Architecture
Stage 3: Execute
• Perform ALU operations
– Calculate result of instruction
• Control signals select operation
• Contents of regA used as one input
• Either regB or constant offset (from insn) used as second input
– Calculate PC-relative branch target
• PC+1+(constant offset)
• Write state to the pipeline register (EX/Mem)
– ALU result, contents of regB, and PC+1+offset
– Control signals (from insn) for opcode and destReg
CSE502: Computer Architecture
ID / EX
Pipeline register
M
U
X
destReg
data
EX/Mem
Pipeline register
Memory
A
L
U
regB
contents
+
ALU
result
PC+1
+offset
target
Control
signals
regB
regA
PC + 1
contents contents
Control
signals
Decode
Stage 3: Execute Diagram
CSE502: Computer Architecture
Stage 4: Memory
• Perform data cache access
– ALU result contains address for LD or ST
– Opcode bits control R/W and enable signals
• Write state to the pipeline register (Mem/WB)
– ALU result and Loaded data
– Control signals (from insn) for opcode and destReg
CSE502: Computer Architecture
in_addr
Data Cache
EX/Mem
Pipeline register
data
Control
signals
destReg
Mem/WB
Pipeline register
Write-back
ALU
result
in_data
Loaded
data
ALU
result
regB
contents
target
en R/W
Control
signals
Execute
PC+1
+offset
Stage 4: Memory Diagram
CSE502: Computer Architecture
Stage 5: Write-back
• Writing result to register file (if required)
– Write Loaded data to destReg for LD
– Write ALU result to destReg for arithmetic insn
– Opcode bits control register write enable signal
CSE502: Computer Architecture
Loaded
data
data
Control
signals
Memory
ALU
result
Stage 5: Write-back Diagram
Mem/WB
Pipeline register
destReg
M
U
X
M
U
X
CSE502: Computer Architecture
Putting It All Together
M
U
X
Inst
Cache
PC+1
instruction
PC
+
regA
regB
Register file
1
R0 0
R1
R2
R3
R4
R5
R6
R7
M
U
X
IF/ID
+
PC+1
target
eq?
valA
valB
offset
M
U
X
A
L
U
ALU
result
ALU
result
Data
Cache
mdata
M
U
X
data
dest
valB
dest
dest
dest
op
op
op
ID/EX
EX/Mem
Mem/WB
CSE502: Computer Architecture
Pipelining Idealism
• Uniform Sub-operations
– Operation can partitioned into uniform-latency sub-ops
• Repetition of Identical Operations
– Same ops performed on many different inputs
• Repetition of Independent Operations
– All repetitions of op are mutually independent
CSE502: Computer Architecture
Pipeline Realism
• Uniform Sub-operations … NOT!
– Balance pipeline stages
• Stage quantization to yield balanced stages
• Minimize internal fragmentation (left-over time near end of cycle)
• Repetition of Identical Operations … NOT!
– Unifying instruction types
• Coalescing instruction types into one “multi-function” pipe
• Minimize external fragmentation (idle stages to match length)
• Repetition of Independent Operations … NOT!
– Resolve data and resource hazards
• Inter-instruction dependency detection and resolution
Pipelining is expensive
CSE502: Computer Architecture
The Generic Instruction Pipeline
Instruction Fetch
IF
Instruction Decode
ID
Operand Fetch
OF
Instruction Execute
EX
Write-back
WB
CSE502: Computer Architecture
Balancing Pipeline Stages
IF
TIF= 6 units
Without pipelining
ID
TID= 2 units
Tcyc TIF+TID+TOF+TEX+TOS
= 31
Pipelined
OF
EX
WB
TID= 9 units
TEX= 5 units
Tcyc  max{TIF, TID, TOF, TEX, TOS}
=9
Speedup= 31 / 9
TOS= 9 units
Can we do better?
CSE502: Computer Architecture
Balancing Pipeline Stages (1/2)
• Two methods for stage quantization
– Merge multiple sub-ops into one
– Divide sub-ops into smaller pieces
• Recent/Current trends
– Deeper pipelines (more and more stages)
– Multiple different pipelines/sub-pipelines
– Pipelining of memory accesses
CSE502: Computer Architecture
Balancing Pipeline Stages (2/2)
Coarser-Grained Machine Cycle:
4 machine cyc / instruction
IF
ID
TIF&ID= 8 units
Finer-Grained Machine Cycle:
11 machine cyc /instruction
IF
IF
ID
OF
# stages = 4
Tcyc= 9 units
TOF= 9 units
OF
OF
EX
TEX= 5 units
OF
EX
EX
WB
TOS= 9 units
WB
WB
WB
# stages = 11
Tcyc= 3 units
CSE502: Computer Architecture
Pipeline Examples
AMDAHL 470V/7
IF
MIPS R2000/R3000
IF
ID
IF
OF
RD
EX
ALU
PC GEN
Cache Read
Cache Read
ID
Decode
OF
Read REG
Addr GEN
Cache Read
Cache Read
WB
MEM
EX
EX 1
EX 2
WB
WB
Check Result
Write Result
CSE502: Computer Architecture
Instruction Dependencies (1/2)
• Data Dependence
– Read-After-Write (RAW) (only true dependence)
• Read must wait until earlier write finishes
– Anti-Dependence (WAR)
• Write must wait until earlier read finishes (avoid clobbering)
– Output Dependence (WAW)
• Earlier write can’t overwrite later write
• Control Dependence (a.k.a. Procedural Dependence)
– Branch condition must execute before branch target
– Instructions after branch cannot run before branch
CSE502: Computer Architecture
Instruction Dependencies (1/2)
# for (;(j<high)&&(array[j]<array[low]);++j);
bge
j, high, $36
mul
$15, j,
4
addu
$24, array, $15
lw
$25, 0($24)
mul
$13, low, 4
addu
$14, array, $13
lw
$15, 0($14)
bge
$25, $15, $36
$35:
addu
j, j,
1
...
$36:
addu
$11, $11, -1
...
Real code has lots of dependencies
CSE502: Computer Architecture
Hardware Dependency Analysis
• Processor must handle
– Register Data Dependencies (same register)
• RAW, WAW, WAR
– Memory Data Dependencies (same address)
• RAW, WAW, WAR
– Control Dependencies
CSE502: Computer Architecture
Pipeline Terminology
• Pipeline Hazards
– Potential violations of program dependencies
– Must ensure program dependencies are not violated
• Hazard Resolution
– Static method: performed at compile time in software
– Dynamic method: performed at runtime using hardware
– Two options: Stall (costs perf.) or Forward (costs hw.)
• Pipeline Interlock
– Hardware mechanism for dynamic hazard resolution
– Must detect and enforce dependencies at runtime
CSE502: Computer Architecture
Pipeline: Steady State
Instj
Instj+1
Instj+2
Instj+3
Instj+4
t0
t1
t2
t3
t4
t5
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
ALU
IF
ID
RD
IF
ID
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM
IF
CSE502: Computer Architecture
Pipeline: Data Hazard
Instj
Instj+1
Instj+2
Instj+3
Instj+4
t0
t1
t2
t3
t4
t5
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
ALU
IF
ID
RD
IF
ID
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM
IF
CSE502: Computer Architecture
Option 1: Stall on Data Hazard
Instj
Instj+1
Instj+2
Instj+3
Instj+4
t0
t1
t2
t3
t4
t5
IF
ID
RD
IF
ID
RD
ALU MEM WB
IF
ID
Stalled in RD
RD
IF
Stalled in ID
ID
RD
Stalled in IF
IF
ID
RD
IF
ID
RD
ALU
IF
ID
RD
IF
ID
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM
IF
CSE502: Computer Architecture
Option 2: Forwarding Paths (1/3)
Instj
t0
t1
t2
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
ALU
IF
ID
RD
IF
ID
Instj+1
Instj+2
Instj+3
Instj+4
t3
t4
t5
ALU MEM WB
Many possible paths
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM
IF
MEM
ALU
Requires stalling even with forwarding paths
CSE502: Computer Architecture
Option 2: Forwarding Paths (2/3)
src1
IF
ID
src2
Register File
dest
ALU
MEM
WB
CSE502: Computer Architecture
Option 2: Forwarding Paths (3/3)
src1
IF
ID
src2
Register File
dest
=
=
Deeper pipeline may
require additional
forwarding paths
=
ALU
MEM
WB
=
=
=
CSE502: Computer Architecture
Pipeline: Control Hazard
Insti
Insti+1
Insti+2
Insti+3
Insti+4
t0
t1
t2
t3
t4
t5
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
IF
ID
RD
ALU
IF
ID
RD
IF
ID
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM WB
ALU MEM
IF
CSE502: Computer Architecture
Pipeline: Stall on Control Hazard
Insti
Insti+1
Insti+2
Insti+3
Insti+4
t0
t1
t2
t3
IF
ID
RD
IF
ID
t4
t5
ALU MEM WB
RD
ALU MEM WB
Stalled in IF
IF
ID
RD
ALU MEM
IF
ID
RD
ALU
IF
ID
RD
IF
ID
IF
CSE502: Computer Architecture
Pipeline: Prediction for Control Hazards
Insti
t0
t1
t2
IF
ID
RD
IF
ID
RD
IF
ID
RD
ALU
nop
nop
nop
IF
ID
nop
RD
nop
nop
nop
IF
nop
ID
nop
nop
nop
nop
IF
ID
RD
ALU
nop
IF
ID
RD
ALU
IF
ID
RD
Insti+1
Insti+2
Insti+3
t3
t4
ALU MEM WB
New Insti+2
New Insti+4
Speculative State Cleared
ALU MEM WB
Insti+4
New Insti+3
t5
Fetch Resteered
CSE502: Computer Architecture
Going Beyond Scalar
• Scalar pipeline limited to CPI ≥ 1.0
– Can never run more than 1 insn. per cycle
• “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0)
– Superscalar means executing multiple insns. in parallel
CSE502: Computer Architecture
Architectures for Instruction Parallelism
• Scalar pipeline (baseline)
– Instruction/overlap parallelism = D
– Operation Latency = 1
– Peak IPC = 1.0
D
Successive
Instructions
D different instructions overlapped
1
2
3
Time in cycles
4
5
6
7
8
9
10
11
12
CSE502: Computer Architecture
Superscalar Machine
• Superscalar (pipelined) Execution
– Instruction parallelism = D x N
– Operation Latency = 1
– Peak IPC = N per cycle
D x N different instructions overlapped
Successive
Instructions
N
1
2
3
Time in cycles
4
5
6
7
8
9
10
11
12
CSE502: Computer Architecture
Superscalar Example: Pentium
Prefetch
Decode1
Decode2
4× 32-byte buffers
Decode up to 2 insts
Decode2
Read operands, Addr comp
Asymmetric pipes
Execute
Writeback
Execute
Writeback
both
u-pipe
v-pipe
mov, lea,
simple ALU,
push/pop
test/cmp
shift
rotate
some FP
jmp, jcc,
call,
fxch
CSE502: Computer Architecture
Pentium Hazards & Stalls
• “Pairing Rules” (when can’t two insns exec?)
– Read/flow dependence
• mov eax, 8
• mov [ebp], eax
– Output dependence
• mov eax, 8
• mov eax, [ebp]
– Partial register stalls
• mov al, 1
• mov ah, 0
– Function unit rules
• Some instructions can never be paired
– MUL, DIV, PUSHA, MOVS, some FP
CSE502: Computer Architecture
Limitations of In-Order Pipelines
• If the machine parallelism is increased
– … dependencies reduce performance
– CPI of in-order pipelines degrades sharply
• As N approaches avg. distance between dependent instructions
• Forwarding is no longer effective
– Must stall often
In-order pipelines are rarely full
CSE502: Computer Architecture
The In-Order N-Instruction Limit
• On average, parent-child separation is about ± 5 insn.
– (Franklin and Sohi ’92)
Ex. Superscalar degree N = 4
Any dependency
between these
instructions will
cause a stall
Dependent insn
must be N = 4
instructions away
Average of 5 means there are many
cases when the separation is < 4…
each of these limits parallelism
Reasonable in-order superscalar is effectively N=2
CSE502: Computer Architecture
In Search of Parallelism
• “Trivial” Parallelism is limited
– What is trivial parallelism?
• In-order: sequential instructions do not have dependencies
• In all previous examples, all instructions executed either at the
same time or after earlier instructions
– previous slides show that superscalar execution quickly
hits a ceiling
• So what is “non-trivial” parallelism? …
CSE502: Computer Architecture
What is Parallelism?
• Work
– T1: time to complete a computation on ax sequential
= a + b;
y = b * 2
system
• Critical Path
z =(x-y) * (x+y)
– T: time to complete the same computation on an
infinitely-parallel system
• Average Parallelism
– Pavg = T1/ T
• For a p-wide system
– Tp  max{T1/p , T}
– Pavg >> p  Tp  T1/p
CSE502: Computer Architecture
ILP: Instruction-Level Parallelism
• ILP is a measure of the amount of interdependencies between instructions
• Average ILP = num instructions / longest path
–
–
–
–
code1: ILP = 1
T1 = 3, T = 3
code2: ILP = 3
T1 = 3, T = 1
code1:
r1  r2 + 1
r3  r1 / 17
r4  r0 - r3
(must execute serially)
(can execute at the same time)
code2:
r1  r2 + 1
r3  r9 / 17
r4  r0 - r10
CSE502: Computer Architecture
ILP != IPC
• Instruction level parallelism usually assumes infinite
resources, perfect fetch, and unit-latency for all
instructions
• ILP is more a property of the program dataflow
• IPC is the “real” observed metric of exactly how
many instructions are executed per machine cycle,
which includes all of the limitations of a real machine
• The ILP of a program is an upper-bound on the
attainable IPC
CSE502: Computer Architecture
Scope of ILP Analysis
ILP=1
ILP=3
r1  r2 + 1
r3  r1 / 17
r4  r0 - r3
r11  r12 + 1
r13  r19 / 17
r14  r0 - r20
ILP=2
CSE502: Computer Architecture
DFG Analysis
•
•
•
•
•
•
•
•
•
•
A: R1 = R2 + R3
B: R4 = R5 + R6
C: R1 = R1 * R4
D: R7 = LD 0[R1]
E: BEQZ R7, +32
F: R4 = R7 - 3
G: R1 = R1 + 1
H: R4  ST 0[R1]
J: R1 = R1 – 1
K: R3  ST 0[R1]
CSE502: Computer Architecture
In-Order Issue, Out-of-Order Completion
In-order
Inst.
Stream
Execution
Begins
In-order
INT
Fadd1
Fmul1
Fadd2
Fmul2
Ld/St
Fmul3
Out-of-order
Completion
Issue = send an instruction
to execution
Issue stage needs to check:
1. Structural Dependence
2. RAW Hazard
3. WAW Hazard
4. WAR Hazard
CSE502: Computer Architecture
Example
A: R1 = R2 + R3
B: R4 = R5 + R6
C: R1 = R1 * R4
D: R7 = LD 0[R1]
E: BEQZ R7, +32
F: R4 = R7 - 3
G: R1 = R1 + 1
H: R4  ST 0[R1]
J: R1 = R1 – 1
K: R3  ST 0[R1]
A
C
E
A
2:
C
D
IPC = 10/8 = 1.25
6:
E
F
7:
H
J
8:
K
4:
G
F
5:
J
H
B
3:
B
D
Cycle 1:
K
G
CSE502: Computer Architecture
Example (2)
A: R1 = R2 + R3
B: R4 = R5 + R6
C: R1 = R1 * R4
D: R9 = LD 0[R1]
E: BEQZ R7, +32
F: R4 = R7 - 3
G: R1 = R1 + 1
H: R4  ST 0[R9]
J: R1 = R9 – 1
K: R3  ST 0[R1]
A
B
E
C
D
A
2:
C
B
E
3:
4:
F
H
Cycle 1:
F
D
G
J
K
5:
6:
H
7:
K
J
IPC = 10/7 = 1.43
G
CSE502: Computer Architecture
Track with Simple Scoreboarding
• Scoreboard: a bit-array, 1-bit for each GPR
– If the bit is not set: the register has valid data
– If the bit is set: the register has stale data
• i.e., some outstanding instruction is going to change it
• Issue in Order:
RD  Fn (RS, RT)
– If SB[RS] or SB[RT] is set  RAW, stall
– If SB[RD] is set  WAW, stall
– Else, dispatch to FU (Fn) and set SB[RD]
• Complete out-of-order
– Update GPR[RD], clear SB[RD]
CSE502: Computer Architecture
Out-of-Order Issue
In-order
Inst.
Stream
Need an extra
Stage/buffers for
Dependency
Resolution
DR
DR
DR
DR
INT
Fadd1
Fmul1
Ld/St
Fadd2
Fmul2
Fmul3
Out-of-order
Completion
Out of
Program
Order
Execution
CSE502: Computer Architecture
OOO Scoreboarding
• Similar to In-Order scoreboarding
– Need new tables to track status of individual instructions and
functional units
– Still enforce dependencies
• Stall dispatch on WAW
• Stall issue on RAW
• Stall completion on WAR
• Limitations of Scoreboarding?
• Hints
Finite number of registers in
any ISA will force you to reuse
register names at some point
 WAR, WAW  stalls
– No structural hazards
– Can always write a RAW-free code sequence
• Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; …
– Think about x86 ISA with only 8 registers
CSE502: Computer Architecture
Lessons thus Far
• More out-of-orderness  More ILP exposed
– But more hazards
• Stalling is a generic technique to ensure sequencing
• RAW stall is a fundamental requirement (?)
• Compiler analysis and scheduling can help
•
(not covered in this course)
CSE502: Computer Architecture
Ex. Tomasulo’s Algorithm [IBM 360/91,
1967]
Storage Bus
Instruction Unit
Floating Point
Buffers (FLB)
Floating Point
Buffers FLB
66
55
44
Control
33 Contr ol
22
11
Floating
F loating
Point
Operand
Operand
Stack
Stack
FLOS
(FLOS)
Tags
Busy
Bits
Floating Point
Registers F LR
88
44 Floating Point
22 Registers (FLR)
00
Decoder
Decoder
•
Tags
Control
Contr ol
Stor e
Data
Buffers SDB
FLB Bus
FLR Bus
CDB
•
•
•
Tag
Tag
Tag
Sink
Sink
Sink
Store
Data
Buffers (SDB)
•
•
Tag Source Ctrl.
Tag Source Ctrl.
Tag Source Ctrl.
Tag Sink
Tag Sink
Tag Source Ctrl.
Tag Source Ctrl.
Multiply/Divide
Adder
Adder
Result
Result
Common Data Bus (CDB)
3
2
2
1
1
3
CSE502: Computer Architecture
FYI: Historical Note
• Tomasulo’s algorithm (1967) was not the first
• Also at IBM, Lynn Conway proposed multi-issue
dynamic instruction scheduling (OOO) in Feb 1966
– Ideas got buried due to internal politics, changing project
goals, etc.
– But it’s still the first (as far as I know)
CSE502: Computer Architecture
Modern Enhancements to Tomasulo’s
Algorithm
Machine Width
Structural Deps
Anti-Deps
Output-Deps
True Deps
Exceptions
Tomasulo
Peak IPC = 1
2 FP FU’s
Single CDB
Operand copying
RS Tag
Tag-based forwarding
Imprecise
Modern
Peak IPC = 6+
6-10+ FU’s
Many forwarding buses
Renamed registers
Renamed registers
Tag-based forwarding
Precise (requires ROB)
CSE502: Computer Architecture
CSE502: Computer Architecture
Balancing Pipeline Stages
Without pipelining
IF
TIF= 6 units
ID
TID= 2 units
EX
TEX= 9 units
Tcyc TIF+TID+TEX+TMEM+TWB
= 31
Pipelined
Tcyc  max{TIF,TID,TEX,TMEM,TWB}
=9
Speedup= 31 / 9
MEM
WB
TMEM= 5 units
TWB= 9 units
Can we do better in terms of
either performance or
efficiency?
CSE502: Computer Architecture
Balancing Pipeline Stages
Two Methods for Stage Quantization:
– Merging of multiple stages
– Further subdividing a stage
Recent Trends:
– Deeper pipelines (more and more stages)
• Pipeline depth growing more slowly since Pentium 4. Why?
– Multiple pipelines (subpipelines)
– Pipelined memory/cache accesses (tricky)
CSE502: Computer Architecture
The Cost of Deeper Pipelines
Instruction pipelines are not ideal
i.e. Instructions in different stages can have
dependencies
RAW!!
Suppose
add
Inst0
nand
Inst1
add 1 2 3
nand 3 4 5
t0
Ft0
t1
Dt1
t2
Et2
F
FD
F
DE
FD
t3
Mt3
t4
Wt4
t5
t5
MStallW
E
E StallM
D
E
DW
M
E
W
M
CSE502: Computer Architecture
Types of Dependencies and Hazards
Data Dependence (Both memory and register)
– True dependence (RAW)
Instruction must wait for all required input operands
– Anti-Dependence (WAR)
Later write must not clobber a still-pending earlier read
– Output dependence (WAW)
Earlier write must not clobber already-completed later write
Control Dependence (aka Procedural Dependence)
– Conditional branches may change instruction sequence
– Instructions after cond. branch depend on outcome
(more exact definition later)
CSE502: Computer Architecture
Terminology
Pipeline Hazards:
– Potential violations of program dependences
– Must ensure program dependences are not violated
Hazard Resolution:
– Static Method: Performed at compiled time in software
– Dynamic Method: Performed at run time using hardware
Pipeline Interlock:
– Hardware mechanisms for dynamic hazard resolution
– Must detect and enforce dependences at run time
CSE502:
Computer Architecture
Necessary Conditions
for
Data Hazards
j:rk_
Reg Write
j:rk_
Reg Write
j:_rk
Reg Read
stage Y
i:rk_
Reg Write
WAW Hazard
i:_rk
Reg Read
i:rk_
WAR Hazard
dist(i,j)  dist(X,Y)  Hazard!!
??
dist(i,j) > dist(X,Y)  Safe
??
Reg Write
RAW Hazard
Hazard Distance
stage X
CSE502: Computer Architecture
Handling Data Hazards
Avoidance (static)
– Make sure there are no hazards in the code
Detect and Stall (dynamic)
– Stall until earlier instructions finish
Detect and Forward (dynamic)
– Get correct value from elsewhere in pipeline
CSE502: Computer Architecture
Handling Data Hazards:
Avoidance
Programmer/compiler must know implementation
details
– Insert nops between dependent instructions
add 1 2 3
nop
nop
nand 3 4 5
write R3 in cycle 5
read R3 in cycle 6
CSE502: Computer Architecture
Problems with Avoidance
Binary compatability
– New implementations may require more nops
Code size
– Higher instruction cache footprint
– Longer binary load times
– Worse in machines that execute multiple instructions /
cycle
• Intel Itanium – 25-40% of instructions are nops
Slower execution
– CPI=1, but many instructions are nops
CSE502: Computer Architecture
Handling Data Hazards:
Detect & Stall
Detection
– Compare regA & regB with DestReg of preceding insn.
• 3 bit comparators
Stall
– Do not advance pipeline register for Fetch/Decode
– Pass nop to Execute
CSE502: Computer Architecture
Problems with Detect & Stall
CPI increases on every hazard
Are these stalls necessary? Not always!
– The new value for R3 is in the EX/Mem register
– Reroute the result to the nand
• Called “forwarding” or “bypassing”
CSE502: Computer Architecture
Handling Data Hazards:
Detect & Forward
Detection
– Same as detect and stall, but…
• each possible hazard requires different forwarding paths
Forward
– Add data paths for all possible sources
– Add mux in front of ALU to select source
“bypassing logic” often a critical path in wide-issue machines
– # paths grows quadratically with machine width
CSE502: Computer Architecture
Handling Control Hazards
Avoidance (static)
– No branches?
– Convert branches to predication
• Control dependence becomes data dependence
Detect and Stall (dynamic)
– Stop fetch until branch resolves
Speculate and squash (dynamic)
– Keep going past branch, throw away instructions if wrong
CSE502: Computer Architecture
Avoidance: if-conversion
t1  a, b
t1, PC+2
x  x, #1
y  n, d
if (a == b) {
x++;
y = n / d;
}
sub
jnz
add
div
sub
t1  a, b
add(t1) x  x, #1
div(t1) y  n, d
sub
t1  a, b
add
t2  x, #1
div
t3  n, d
cmov(t1) x  t2
cmov(t1) y  t3
CSE502: Computer Architecture
Handling Control Hazards:
Detect & Stall
Detection
– In decode, check if opcode is branch or jump
Stall
– Hold next instruction in Fetch
– Pass noop to Decode
CSE502: Computer Architecture
Problems with Detect & Stall
CPI increases on every branch
Are these stalls necessary? Not always!
– Branch is only taken half the time
– Assume branch is NOT taken
• Keep fetching, treat branch as noop
• If wrong, make sure bad instructions don’t complete