Transcript hazards

Lecture 7
Pipeline Hazards
Hazards
CS510 Computer Architectures
Lecture 7 - 1
Pipelining Lessons
6 PM
7
30
T
a
s
k
O
r
d
e
r
Hazards
40
8
40
40
9
40
Time •
20
•
A
•
B
•
C
•
•
D
Pipelining doesn’t help
latency of single task, it helps
throughput of entire workload
Pipeline rate limited by
slowest pipeline stage
Multiple tasks operating
simultaneously
Potential speedup = Number
pipe stages
Unbalanced lengths of pipe
stages reduces speedup
Time to “fill” pipeline and
time to “drain” it reduces
speedup
CS510 Computer Architectures
Lecture 7 - 2
Its Not That Easy to Achieve
the Promised Performance
• Limits to pipelining: Hazards prevent the next instruction
from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline
– Control hazards: Pipelining of branches and other instructions
that change the PC
• Common solution is to stall the pipeline until the hazard is
resolved, inserting one or more “bubbles”, i.e., idle clock
cycles, in the pipeline
Hazards
CS510 Computer Architectures
Lecture 7 - 3
Structural Hazards /Memory
Time(clock cycles)
CC3
CC4
CC5
LOAD Mem
Reg
Mem
Mem
Reg
Instr 1
Mem
Reg
ALU
Mem
Reg
Mem
Reg
ALU
Mem
Reg
Mem
Mem
Reg
ALU
Mem
Mem
Mem
Reg
Instr 2
Instr 3
Instr 4
Operation on Memory
by 2 different instructions
in the same clock cycle
Hazards
CS510 Computer Architectures
CC6
CC7
ALU
CC2
ALU
Instruction Order
CC1
CC8
CC9
Reg
Mem
Reg
Lecture 7 - 4
Structural Hazards
with Single-Port Memory
Time(clock cycles)
CC2
CC3
CC4
CC5
Mem
LOAD Mem
Reg
ALU
Mem
Reg
Instr 1
Mem
Reg
ALU
Mem
Mem
Reg
Mem
Reg
ALU
Mem
Reg
Mem
Reg
ALU
Instruction Order
CC1
Mem
Instr 2
Stall
Instr 3
Stall
Stall
Hazards
3 cycles stall
with 1-port memory
CS510 Computer Architectures
CC7
Mem
Mem
CC8
CC9
Reg
Reg
ALU
Instr 3
CC6
Lecture 7 - 5
Avoiding Structural Hazard
with Dual-Port Memory
Time(clock cycles)
CC5
Reg
DM
DM
Reg
Reg
DM
Reg
IM
Reg
DM
DM
Reg
Reg
ALU
DM
Reg
ALU
Instr 2
CC4
ALU
Instruction Order
Instr 1
IM
IM
CC3
ALU
LOAD
CC2
ALU
CC1
DM
DM
IM
IM
Instr 3
IM
IM
Instr 4
Hazards
No stall with
2-port memory
CS510 Computer Architectures
Reg
IM
IM
CC7
Reg
CC8
ALU
Instr 5
IM
IM
CC6
CC9
Reg
DM
DM
Lecture 7 - 6
Speed Up Equation
for Pipelining
Speedup from pipelining
Ave Instr Time unpipelined
Ave Instr Time pipelined
CPIunpipelined x Clock Cycleunpipelined
CPIpipelined x Clock Cyclepipelined
CPIunpipelined Clock Cycleunpipelined
CPIpipelinedx Clock Cyclepipelined
Ideal CPI = CPIunpipelined/Pipeline depth(Number of pipeline stages)
Speedup = Ideal CPI x Pipeline depth x
CPIpipelined
Clock Cycleunpipelined
Clock Cyclepipelined
Ideal CPI for pipelined machines is almost always 1
Hazards
CS510 Computer Architectures
Lecture 7 - 7
Speed Up Equation
for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr
= 1 + Pipeline stall clock cycles per instr
Hazards
Speedup =
Ideal CPI x Pipeline depth
Clock Cycleunpipelined
x
Ideal CPI + Pipeline stall CPI
Clock Cyclepipelined
Speedup =
Pipeline depth
1 + Pipeline stall CPI
x
Clock Cycleunpipelined
Clock Cyclepipelined
CS510 Computer Architectures
Lecture 7 - 8
Dual-Port vs Single-Port Memory
• Machine A: 2-port memory(needs no stall for Load); same clock
cycle as unpipelined machine
• Machine B: 1-ported memory(needs 3 cycles stall for Load); 1.05
times faster clock rate than the unpipelined machine
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = [Pipeline Depth/(1 + 0)] x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 3) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.2) x 1.05
= 0.87 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.87 x Pipeline Depth) = 1.15
Machine A is 1.15 times faster
Hazards
CS510 Computer Architectures
Lecture 7 - 9
Data Hazard on Registers
Time(clock cycles)
CC1
XOR R10,R11,R1
Hazards
Reg R1
Reg
Reg
Mem
Mem
Reg
Reg
Mem
CC7
CC8
CC9
Reg
Mem
Reg
Reg
Reg
Mem
Mem
Reg
Reg
ALU
OR R8,R1,R9
Mem
CC6
ALU
AND R6,R1,R7
Mem
CC5
ALU
SUB R4,R1,R3
Reg
CC4
ALU
Mem
CC3
ALU
ADD R1,R2,R3
CC2
CS510 Computer Architectures
Reg
Mem
Reg
Lecture 7 - 10
Data Hazard on Registers
Registers can be made to read and store in the same cycle
such that data is stored in the first half of the clock cycle, and
that data can be read in the second half of the same clock cycle
Clcok
Cycle
Store
into Ri
Read
from Ri
Register Ri
Hazards
CS510 Computer Architectures
Lecture 7 - 11
Data Hazard on Registers
Time(clock cycles)
Mem
Mem
Reg R1
Reg
Reg
Mem
Mem
Reg
Reg
Mem
CC6
CC7
CC8
CC9
Reg
Mem
Reg
Reg
Reg
Mem
Mem
Reg
Reg
ALU
XOR R10,R11,R1
Reg
CC5
ALU
OR R8,R1,R9
Mem
CC4
ALU
AND R6,R1,R7
CC3
ALU
SUB R4,R1,R3
CC2
ALU
ADD R1,R2,R3
CC1
Reg
Mem
Reg
Needs to Stall 2 cycles
Hazards
CS510 Computer Architectures
Lecture 7 - 12
Three Generic Data Hazards
Instri followed by Instrj
Read After Write (RAW)
Instrj tries to read operand before Instri writes it
Instri
Instrj
Hazards
LW
R1, 0(R2)
SUBR 4, R1, R5
CS510 Computer Architectures
Lecture 7 - 13
Three Generic Data Hazards
InstrI followed by InstrJ
•
Write After Read (WAR)
Instrj tries to write operand before Instri reads it
Instri
Instrj
ADD R1, R2, R3
LW R2, 0(R5)
Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages,
– Reads are always in stage 2, and
– Writes are always in stage 5
Hazards
CS510 Computer Architectures
Lecture 7 - 14
Three Generic Data Hazards
InstrI followed by InstrJ
Write After Write (WAW)
Instrj tries to write operand before Instri writes it
– Leaves wrong result ( Instri not Instrj)
Instri
Instrj
LW
LW
R1, 0(R2)
R1, 0(R3)
Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
Will see WAR and WAW in later more complicated pipes
Hazards
CS510 Computer Architectures
Lecture 7 - 15
Forwarding
to Avoid Data Hazards
Time(clock cycles)
Hazards
CC3
Mem
Reg
Reg
Mem
Reg
Mem
Mem
Reg
Mem
CC6
CC7
Mem
Reg
Reg
Mem
Mem
Reg
CS510 Computer Architectures
CC8
CC9
Reg
ALU
XOR R10,R11,R1
CC2
ALU
OR R8,R1,R9
Mem
ALU
AND R6,R1,R7
CC5
ALU
SUB R4,R1,R3
CC4
ALU
ADD R1,R2,R3
CC1
Reg
Mem
Reg
Lecture 7 - 16
HW Change
for Forwarding
Zero?
Data
Memory
CS510 Computer Architectures
M/W Buffer
MUX
A/M Buffer
MUX
D/A Buffer
Hazards
ALU
Lecture 7 - 17
Hazards
CS510 Computer Architectures
Lecture 7 - 18
Load Delay Due to Data Hazard
Time(clock cycles)
Reg
IM
Reg
Reg
DM
IM
Reg
DM
Reg
Reg
IM
Hazards
CS510 Computer Architectures
Reg
IM
Reg
Reg
DM
ALU
OR R8,R1,R9
DM
ALU
AND R6,R1,R7
Reg
ALU
IM
Load Delay
=2cycles
ALU
DM
ALU
SUB R4,R1,R6
IM
ALU
LOAD R1,0(R2)
Reg
DM
Lecture 7 - 19
Load Delay
with Forwarding
Time(clock cycles)
IM
Reg
Load Delay with
Forwarding=1cycle
IM
DM
Reg
DM
Reg
DM
Reg
ALU
Hazards
Reg
Reg
ALU
OR R8,R1,R9
IM
DM
ALU
AND R6,R1,R7
Reg
ALU
SUB R4,R1,R6
IM
ALU
LOAD R1,0(R2)
We need to add HW,
called Pipeline Interlock
IM
CS510 Computer Architectures
Reg
Reg
DM
Reg
Lecture 7 - 20
Software Scheduling
to Avoid Load Hazards
Try to produce fast code for
a = b + c;
d = e - f;
assuming a, b, c, d ,e, and f are in memory.
Slow code(with forwarding):
Fast code:
LW
Rb,b
LW
LW
Rc,c
LW
Stall
RAW ADD
Ra,Rb,Rc
LW
Stall
RAW SW
a,Ra
ADD
LW
Re,e
LW
LW
Rf,f
SW
Stall
RAW SUB
Rd,Re,Rf
SUB
Stall
RAW SW
RAW SW
d,Rd
Hazards
CS510 Computer Architectures
Rb,b
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
a,Ra
Rd,Re,Rf
Stall
d,Rd
Lecture 7 - 21
Compiler Avoiding Load Stalls
scheduled
54%
gcc
spice
31%
42%
14%
65%
tex
0%
unscheduled
25%
20%
40%
60%
80%
% loads stalling pipeline
Hazards
CS510 Computer Architectures
Lecture 7 - 22
Pipelined DLX Datapath
ID Stage
IF Stage
EX Stage
WB Stage
Mem
Stage
MUX
Add
Zero?
LMD
MUX
Data
Memory
M/W Buffer
SMD
A/M Buffer
Sign
Ext
ALU
MUX
Reg
File
16
Hazards
D/A Buffer
F/D Buffer
PC
Instr.
Memory
MUX
+4
32
• Branch Address
Calculation
• Decide Condition
CS510 Computer Architectures
• Branch
Decision for
target address Lecture 7 - 23
Control Hazard on Branches:
Three Stall Cycles
Time(clock cycles)
CC4
CC5
IM
Reg
DM
Reg
IM
IM
Reg
Reg
IM
DM
DM
Reg
Reg
Reg
DM
DM
IM
IM
CC8
CC9
Reg
Reg
IM
Reg
Branch Target
available
Reg
DM
ALU
80 LD R4,R7, 100
CC7
Should’t be executed when
branch condition is true !
ALU
52 ADD R14,R2, R2
CC6
ALU
48 OR R13,R6, R2
CC3
ALU
44 AND R12,R2, R5
CC2
ALU
Program execution order in instructions
40 BEQ R1,R3, 36
CC1
Reg
Reg
DM
Reg
Branch Delay = 3 cycles
Hazards
CS510 Computer Architectures
Lecture 7 - 24
Control Hazard on Branches:
Three Stall Cycles
We don’t know yet the instruction
being executed is a branch.
Fetch the branch successor.
Branch instruction
Branch successor
Branch successor + 1
Branch successor + 2
IF
ID
IF
Now, we know the instruction
being executed is a branch.
But stall until branch target
address is known.
Hazards
EX
ID
Now, target address is available.
MEM WB
EX
MEM
IF ID
IF
EX
ID
3 Wasted clock cycles
for the TAKEN branch
CS510 Computer Architectures
Lecture 7 - 25
Branch Stall Impact
•
•
•
If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9
– Half of the ideal speed
Two part solution:
– Determine the branch is TAKEN or NOT TAKEN sooner, AND
– Compute TAKEN Branch Address(Branch Target) earlier
DLX branch tests if register = 0 or 1
DLX Solution: Get New PC earlier
- Move Zero test to ID stage
- Additional ADDER to calculate New PC(taken PC)
in ID stage
- 1 clock cycle penalty for branch in contrast to 3 cycles
Hazards
CS510 Computer Architectures
Lecture 7 - 26
Pipelined DLX Datapath
IF Stage
ID Stage
EX Stage
WB Stage
Mem
Stage
To get target
addr. earlier
Zero?
MUX
When a branch
instruction is in
Execute stage,
Next Address
is available here.
Add
Add
LMD
MUX
Data
Memory
M/W Buffer
SMD
A/M Buffer
16
ALU
MUX
Hazards
Reg
File
D/A Buffer
To get the
Condition Earlier.
Target Address
available after ID.
F/D Buffer
PC
Instr.
Memory
MUX
+4
Sign
Ext 32
CS510 Computer Architectures
Lecture 7 - 27
Hazards
CS510 Computer Architectures
Lecture 7 - 28
Branch Behavior in Programs
• Conditional branch frequencies
– integer average --- 14 to 16 %
– floating point
--- 3 to 12 %
• Forward and backward taken branches
– forward taken --- 60 %
– backward taken --- 85 %
– the average of all conditional branches ---- 67 %
Hazards
CS510 Computer Architectures
Lecture 7 - 29
4 Branch Hazard Alternatives
•
•
•
•
Hazards
Stall until branch direction is clear
Predict branch NOT TAKEN
Predict branch TAKEN
Delayed branch
CS510 Computer Architectures
Lecture 7 - 30
4 Branch Hazard Alternatives:
(1) STALL
Stall until branch direction is clear
Branch instruction
Branch successor
Branch successor + 1
Branch successor + 2
IF
ID EX MEM WB
stall stall stall IF ID
IF
EX
ID
IF
MEM
EX
ID
3 cycle penalty
Revised DLX pipeline(get the branch address at EX)
Branch instruction
Branch successor
Branch successor + 1
Branch successor + 2
IF ID EX
stall IF
MEM WB
ID
EX MEM WB
IF
ID EX
MEM
IF ID
1 cycle penalty(Branch Delay Slot)
Hazards
CS510 Computer Architectures
Lecture 7 - 31
4 Branch Hazard Alternatives:
(2) Predict Branch “NOT TAKEN”
• Execute successor instructions in the sequence
• PC+4 is already calculated, so use it to get the next instruction
• Flush instructions in the pipeline if branch is actually TAKEN
• Advantage of late pipeline state update
• 47% of DLX branches are NOT TAKEN on the average
NOT TAKEN branch instruction i IF
instruction i+1
instruction i+2
TAKEN branch instruction i
instruction i+1
instruction T
IF
ID
IF
ID
IF
EX
ID
IF
EX
ID
IF
MEM
EX
ID
WB
MEM
EX
MEM
EX
ID
WB
MEM
WB
MEM WB
EX
MEM
WB
No
penalty
1 cycle
penalty
WB
Flush this instruction in progress
Hazards
CS510 Computer Architectures
Lecture 7 - 32
4 Branch Hazard Alternatives:
(3) Predict Branch “TAKEN”
– 53% DLX branches TAKEN on average
– Branch target address available after ID in DLX
• DLX still incurs 1 cycle branch penalty for TAKEN branch
• Other machines: branch target known before outcome
TAKEN address not available at this time
NOT TAKEN instruction i
Instruction T
Instruction i+1
IF
ID
stall
EX
IF
MEM
WB
IF
ID
EX
MEM
WB
2 cycle penalty in DLX(1 in other machines).
TAKEN address available
TAKEN branch instruction i
Instruction T
Instruction T+1
IF
ID
stall
EX
IF
MEM
ID
IF
WB
EX
ID
MEM
EX
WB
MEM
WB
1 cycle penalty in DLX(0 in other machines)
Hazards
CS510 Computer Architectures
Lecture 7 - 33
4 Branch Hazard Alternatives:
(4) Delayed Branch
Delayed Branch
– Delay branch to take place AFTER a successor instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
Delayed Branch of length n
– 1 slot delayed branch allows proper decision and branch
target address in 5 stage DLX pipeline with control hazard
improvement
Hazards
CS510 Computer Architectures
Lecture 7 - 34
Delayed Branch
• Where to get instructions to fill branch delay slot?
–
–
–
–
Before branch instruction
From the target address: only valuable when branch TAKEN
From fall through: only valuable when branch NOT TAKEN
Canceling branches allow more slots to be filled
• Compiler effectiveness for single delayed branch slot:
– Fills about 60% of delayed branch slots
– About 80% of instructions executed in delayed branch slots are
useful in computation
– About 50% (60% x 80%) of slots usefully filled
Hazards
CS510 Computer Architectures
Lecture 7 - 35
4 Branch Hazard Alternatives:
Delayed Branch
From before
ADD R1, R2, R3
if R2=0 then
Delay slot
if R2=0 then
ADD R1, R2, R3
- Always improve performance
- Branch must not depend on
rescheduled instructions
Hazards
From target
SUB R4, R5, R6
ADD R1, R2, R3
if R1=0 then
Delay slot
ADD R1, R2, R3
if R1=0 then
SUB R4, R5, R6
- Improve performance when TAKEN(loop)
- Must be alright to execute rescheduled
instructions if Not Taken
- May need duplicate the instruction
if it is the target of another branch instr.
CS510 Computer Architectures
From fall through
ADD R1, R2, R3
if R1=0 then
Delay slot
SUB R4, R5, R6
ADD R1, R2, R3
if R2=0 then
SUB R4, R5, R6
- Improve performance when
NOT TAKEN
- Must be alright to execute
instructions of Taken
Lecture 7 - 36
Limitations on Delayed
Branch
• Difficulty in finding useful instructions to fill the delayed
branch slots
• Solution - Squashing
– Delayed branch associated with a branch prediction
– Instructions in the predicted path are executed in the
delayed branch slot
– If the branch outcome is mispredicted, instructions in the
delayed branch slot are squashed(discarded)
Hazards
CS510 Computer Architectures
Lecture 7 - 37
Canceling Branch
• Used when the delayed branch scheduling, i.e., filling the
delay slot cannot be done due to
– Restrictions on scheduling instructions at the delay slots
– Limitations on the ability to predict whether it will TAKE or NOT
TAKE at compile time
• Instruction includes the direction that the branch was
predicted
– When the branch behaves as predicted, the instructions in the
delay slot are executed
– When branch is incorrectly predicted, the instructions in the
delay slot are turned into No-OPs
• Canceling Branch allows to fill the delay slot even if the
instruction to be filled in the delay slot does not meet the
requirements
Hazards
CS510 Computer Architectures
Lecture 7 - 38
Evaluating Branch
Alternatives
Pipeline speedup
= Pipeline depth / CPI
=
Pipeline depth
1 + Branch frequency xBranch penalty
Conditional and Unconditional collectively 14% frequency,
65% of branch is TAKEN
Scheduling
scheme
Branch
penalty
Stall pipeline
3
Predict Taken
1
Predict Not Taken
1
Delayed branch
0.5
Hazards
CPI
speedup vs
unpipelined
1+0.14x3=1.42
1+0.14x1=1.14
1+0.14x0.65=1.09
1+0.14x0.5=1.07
CS510 Computer Architectures
5/1.42=3.5
5/1.14=4.4
5/1.09=4.5
5/1.07=4.6
speedup vs
stall
1.0
1.26
1.29
1.31
Lecture 7 - 39
Static(Compiler) Prediction of
Taken/Untaken Branches
Code Motion
LW
SUB
BEQZ
Depend
on LW,
OR
need to
ADD
stall
L:
ADD
R1, 0(R2)
R1, R1, R3
R1, L
R4, R5, R6
R10,R4,R3
R7, R8, R9
If branch is almost always NOT TAKEN,
and R4 is not needed on the taken path,
and R5 and R6 are not modified in the
following instruction(s), this move can
increase speed
If branch is almost always TAKEN,
and R7 is not needed, and R8 and R9
are not modified on the fall-through
path, this move can increase speed
Hazards
CS510 Computer Architectures
Lecture 7 - 40
Static(Compiler) Prediction
of Taken/Untaken Branches
Hazards
CS510 Computer Architectures
Taken backwards
Not Taken Forwards
tomcatv
ora
tomcatv
swm256
0%
swm256
Always taken
ora
mdljsp2
hydro2d
gcc
espresso
doduc
compress
0%
2%
mdljsp2
10%
4%
hydro2d
20%
6%
gcc
30%
8%
espresso
40%
10%
doduc
50%
12%
compress
60%
14%
alvinn
Misprediction Rate
70%
alvinn
Frequency of Misprediction
• Improves strategy for placing instructions in delay slot
• Two strategies
– Direction-based Prediction:
TAKEN backward branch, NOT TAKEN forward branch
– Profile-based prediction:
Record branch behaviors, predict branch based on the prior run(s)
Lecture 7 - 41
100000
10000
1000
100
10
Profile-based
Hazards
CS510 Computer Architectures
tomcatv
swm256
ora
mdljsp2
hydro2d
gcc
espresso
doduc
1
compress
•
Misprediction rate
ignores frequency of
branch
Instructions between
mispredicted branches
is a better metric
alvinn
•
Instructions per mispredicted branch
Evaluating Static Branch
Prediction Strategies
Direction-based
Lecture 7 - 42
Pipelining Summary
• Just overlap tasks, and easy if tasks are independent
• Speed Up <= Pipeline Depth; if ideal CPI is 1, then:
Speedup =
Pipeline Depth
X
1 + Pipeline stall CPI
Clock Cycle Unpipelined
Clock Cycle Pipelined
• Hazards limit performance on computers:
Structural: need more HW resources
Data:
need forwarding, compiler scheduling
Control:
Dynamic Prediction, Delayed branch slot,
Static(compiler) Prediction
Hazards
CS510 Computer Architectures
Lecture 7 - 43