Transcript stalling
Stalling
The easiest solution is to stall the pipeline
We could delay the AND instruction by introducing a one-cycle delay into
the pipeline, sometimes called a bubble
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
Clock cycle
3
4
DM
Reg
5
6
7
DM
Reg
Reg
Notice that we’re still using forwarding in cycle 5, to get data from the
MEM/WB pipeline register to the ALU
1
Stalling and forwarding
Without forwarding, we’d have to stall for two cycles to wait for the LW
instruction’s writeback stage
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
6
7
8
DM
Reg
Reg
Reg
In general, you can always stall to avoid hazards—but dependencies are
very common in real code, and stalling often can reduce performance by
a significant amount
2
Load-Use Hazard Detection
• Check when using instruction is decoded in ID stage
• ALU operand register numbers in ID stage are given by
• IF/ID.RegisterRs, IF/ID.RegisterRt
• Load-use hazard when
• ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
• If detected, stall and insert bubble
How to Stall the Pipeline
• Force control values in ID/EX register
to 0
• EX, MEM and WB do nop (no-operation)
• Prevent update of PC and IF/ID register
• Using instruction is decoded again
• Following instruction is fetched again
• 1-cycle stall allows MEM to read data for lw
• Can subsequently forward to EX stage
Stalling delays the entire pipeline
If we delay the second instruction, we’ll have to delay the third one too
— This is necessary to make forwarding work between AND and OR
— It also prevents problems such as two instructions trying to write to
the same register in the same cycle
1
lw
$2, 20($3)
and $12, $2, $5
or
$13, $12, $2
IM
2
3
Reg
IM
Clock cycle
4
5
DM
7
8
Reg
Reg
IM
6
DM
Reg
Reg
DM
Reg
5
What about EX, MEM, WB
But what about the ALU during cycle 4, the data memory in cycle 5, and
the register file write in cycle 6?
lw
$2, 20($3)
and $12, $2, $5
or
$13, $12, $2
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
Reg
Reg
IM
IM
6
7
DM
Reg
8
Reg
Reg
DM
Reg
Those units aren’t used in those cycles because of the stall, so we can set
the EX, MEM and WB control signals to all 0s.
6
Detecting Stalls, cont.
DM
mem\wb
Reg
Reg
ex/mem
Reg
mem\wb
IM
DM
id/ex
and $12, $2, $5
Reg
if/id
IM
id/ex
$2, 20($3)
if/id
lw
ex/mem
When should stalls be detected?
EX stage (of the instruction causing the stall)
if/id
Reg
What is the stall condition?
if (ID/EX.MemRead = 1 and (ID/EX.rt = IF/ID.rs or ID/EX.rt = IF/ID.rt))
then stall
7
Adding hazard detection to the CPU
PC Write
IF/ID Write
ID/EX.MemRead
Hazard
Unit
ID/EX
Rs
Rt
0
0
1
Control
PC
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
IF/ID
Read
register 1
Addr
ID/EX.RegisterRt
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
8
Stalls and Performance
Stalls reduce performance
— But are required to get correct results
Compiler can arrange code to avoid hazards and stalls
— Requires knowledge of the pipeline structure
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the
next instruction
Ex: c code for A = B + E; C = B + F;
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
13 cycles
lw
lw
lw
add
sw
add
sw
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
Branches in the original pipelined datapath
1
0
PCSrc
Control
IF/ID
4
When are they resolved?
ID/EX
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
Add
P
C
Add
RegWrite
Read Instruction
address
[31-0]
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Read
data 2
Write
register
Write
data
Instr [15 - 0]
Instr [20 - 16]
Shift
left 2
ALU
0
MemWrite
Zero
Result
1
Registers
ALUOp
ALUSrc
Sign
extend
Address
Data
memory
Write
data
RegDst
MemToReg
Read
data
1
MemRead
0
0
Instr [15 - 11]
1
11
Branch Hazards
If branch outcome determined in MEM:
Flush these
instructions
(Set control
values to 0)
PC
Reducing Branch Delay
Move hardware to determine outcome to ID stage
— Target address adder
— Register comparator
Example: branch taken
36: sub $10, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or
$13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
...
72: lw
$4, 50($7)
Example: Branch Taken
Example: Branch Taken
Data Hazards for Branches
If a comparison register is a destination of 2nd or 3rd preceding
ALU instruction
add $1, $2, $3
IF
add $4, $5, $6
…
beq $1, $4, target
Can resolve using forwarding
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Data Hazards for Branches
If a comparison register is a destination of preceding ALU
instruction or 2nd preceding load instruction
Need 1 stall cycle
lw
$1, addr
IF
add $4, $5, $6
beq stalled
beq $1, $4, target
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
ID
EX
MEM
WB
Data Hazards for Branches
If a comparison register is a destination of immediately preceding
load instruction
— Need 2 stall cycles
lw
$1, addr
IF
beq stalled
beq stalled
beq $1, $0, target
ID
EX
IF
ID
MEM
WB
ID
ID
EX
MEM
WB
Branch Prediction
• Longer pipelines can’t readily determine branch
outcome early
• Stall penalty becomes unacceptable
• Predict (i.e., guess) outcome of branch
• Only stall if prediction is wrong
• Simplest prediction strategy
• predict branches not taken
• Works well for loops if the loop tests are done at the
start.
• Fetch instruction after branch, with no delay
Dynamic Branch Prediction
In deeper and superscalar pipelines, branch penalty is more
significant
Use dynamic prediction
Branch prediction buffer (aka branch history table)
Indexed by recent branch instruction addresses
Stores outcome (taken/not taken)
To execute a branch
Check table, expect the same outcome
Start fetching from fall-through or target
If wrong, flush pipeline and flip prediction
1-Bit Predictor: Shortcoming
Inner loop branches mispredicted twice!
outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop
next time around
2-Bit Predictor
Only change prediction on two successive mispredictions
Calculating the Branch Target
Even with predictor, still need to calculate the target
address
1-cycle penalty for a taken branch
Branch target buffer
Cache of target addresses
Indexed by PC when instruction fetched
If hit and instruction is branch predicted taken,
can fetch target immediately
Concluding Remarks
ISA influences design of datapath and control
Datapath and control influence design of ISA
Pipelining improves instruction throughput
using parallelism
More instructions completed per second
Latency for each instruction not reduced
Hazards: structural, data, control
Main additions in hardware:
forwarding unit
hazard detection and stalling
branch predictor
branch target table