Pipelined datapath and control
Download
Report
Transcript Pipelined datapath and control
Forwarding
The actual result $1 - $3 is computed in clock cycle 3, before it’s
needed in cycles 4 and 5
We forward that value to later instructions, to prevent data hazards:
— In clock cycle 4, AND gets the value $1 - $3 from EX/MEM
— In cycle 5, OR gets that same result from MEM/WB
sub $2, $1, $3
and $12, $2, $5
or
$13, $6, $2
1
2
IM
Reg
IM
Clock cycle
3
4
DM
Reg
IM
5
7
Reg
DM
Reg
6
Reg
DM
Reg
1
Outline of forwarding hardware
A forwarding unit selects the correct ALU inputs for the EX stage:
— No hazard: ALU’s operands come from the register file, like normal
— Data hazard: operands come from either the EX/MEM or MEM/WB
pipeline registers instead
The ALU sources will be selected by two new multiplexers, with control
signals named ForwardA and ForwardB
sub $2, $1, $3
and $12, $2, $5
or
$13, $6, $2
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2
Simplified datapath with forwarding muxes
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
0
1
2
Registers
Instruction
memory
ForwardA
ALU
0
1
2
Data
memory
1
Rt
ForwardB
0
0
Rd
1
3
Detecting EX/MEM Hazards
How to detect an impending hazard?
sub $2, $1, $3
or
$12, $2, $5
IM
Reg
IM
DM
Reg
Reg
DM
Reg
In the above case, it occurs in cycle 3, when sub is in EX, or is in ID
— Hazard because:
ID/EX.rd == IF/ID.rs
An EX/MEM hazard occurs between the instruction currently in its EX
stage and the previous instruction if:
1. The previous instruction will write to the register file, and
2. The destination is one of the ALU source registers in the EX stage
4
EX/MEM data hazard equations
The first ALU source comes from the pipeline register when necessary:
if (EX/MEM.RegWrite and EX/MEM.rd == ID/EX.rs)
ForwardA = 2
The second ALU source is similar:
if (EX/MEM.RegWrite and EX/MEM.rd == ID/EX.rt)
ForwardB = 2
sub $2, $1, $3
and $12, $2, $5
IM
Reg
IM
DM
Reg
Reg
DM
Reg
5
MEM/WB data hazards
A MEM/WB hazard may occur between an instruction in the EX stage and
the instruction from two cycles ago
One new problem is if a register is updated twice in a row:
add
add
sub
$1, $2, $3
$1, $1, $4
$5, $5, $1
Register $1 is written by both of the previous instructions, but only the
most recent result (from the second ADD) should be forwarded
add $1, $2, $3
add $1, $1, $4
sub $5, $5, $1
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
6
MEM/WB hazard equations
Here is an equation for detecting and handling MEM/WB hazards for the
first ALU source:
if (MEM/WB.RegWrite and MEM/WB.rd == ID/EX.rs and
(EX/MEM.rd ≠ ID/EX.rs or not(EX/MEM.RegWrite))
ForwardA = 1
The second ALU operand is handled similarly:
if (MEM/WB.RegWrite and MEM/WB.rd == ID/EX.rt and
(EX/MEM.rd ≠ ID/EX.rt or not(EX/MEM.RegWrite))
ForwardB = 1
Handled by a forwarding unit which uses the control signals stored in
pipeline registers to set the values of ForwardA and ForwardB
7
Simplified datapath with forwarding
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
0
1
2
ForwardA
Registers
Instruction
memory
ALU
0
1
2
Data
memory
1
Rt
ForwardB
0
0
Rd
Rs
ID/EX.
RegisterRt
Forwarding
Unit
1
EX/MEM.RegisterRd
MEM/WB.RegisterRd
ID/EX.
RegisterRs
8
Example
sub
and
or
add
sw
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
Assume again each register initially contains its number plus 100
— After the first instruction, $2 should contain -2 (= 101 - 103)
— The other instructions should all use -2 as one of their operands
We’ll try to keep the example short:
— Assume no forwarding is needed except for register $2
— We’ll skip the first two cycles, since they’re the same as before
9
Clock cycle 3
IF: or $13, $6, $2
ID: and $12, $2, $5
IF/ID
EX: sub $2, $1, $3
ID/EX
EX/MEM
MEM/WB
PC
101
2
0
1
2
102
5
Instruction
memory
101
0
Registers
X
105
ALU
103
0
1
2
X
103
-2
Data
memory
1
0
5 (Rt)
0
12 (Rd)
2 (Rs)
2
ID/EX.
RegisterRt
3
ID/EX. 1
RegisterRs
1
0
2
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
10
Clock cycle 4: forwarding $2 from EX/MEM
IF: add $14, $2, $2
ID: or $13, $6, $2
IF/ID
EX: and $12, $2, $5
MEM: sub $2, $1, $3
ID/EX
EX/MEM
MEM/WB
PC
102
6
0
1
2
106
2
Instruction
memory
-2
2
Registers
X
102
ALU
105
0
1
2
X
105
-2
104
Data
memory
1
0
2 (Rt)
0
13 (Rd)
6 (Rs)
12
ID/EX.
RegisterRt
5
2
ID/EX.
RegisterRs
0
12
1
EX/MEM.RegisterRd
2
Forwarding
Unit
MEM/WB.RegisterRd
-2
11
Clock cycle 5: forwarding $2 from MEM/WB
IF: sw $15, 100($2)
ID: add $14, $2, $2
IF/ID
MEM: and $12, $2, $5
EX: or $13, $6, $2
ID/EX
EX/MEM
WB: sub
$2, $1, $3
MEM/WB
PC
106
2
-2
2
Instruction
memory
0
1
2
106
0
Registers
2
-2
102
-2
-2
ALU
0
1
2
-2
104
-2
Data
memory
X
1
1
2 (Rt)
0
14 (Rd)
2 (Rs)
-2
13
ID/EX.
RegisterRt
2
0
13
1
EX/MEM.RegisterRd
2
12
Forwarding
Unit
ID/EX. 6
RegisterRs
2
MEM/WB.RegisterRd
104
-2
12
Forwarding resolved two data hazards
The data hazard during cycle 4:
— The forwarding unit notices that the ALU’s first source register for the
AND is also the destination of the SUB instruction
— The correct value is forwarded from the EX/MEM register, overriding
the incorrect old value still in the register file
The data hazard during cycle 5:
— The ALU’s second source (for OR) is the SUB destination again
— This time, the value has to be forwarded from the MEM/WB pipeline
register instead
There are no other hazards involving the SUB instruction
— During cycle 5, SUB writes its result back into register $2
— The ADD instruction can read this new value from the register file in
the same cycle
13
Complete pipelined datapath...so far
ID/EX
WB
Control
IF/ID
EX/MEM
M
WB
MEM/WB
EX
M
WB
PC
Read
register 1
Addr
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
14
What about stores?
Two “easy” cases:
add $1, $2, $3
sw
2
IM
Reg
$4, 0($1)
add $1, $2, $3
sw
1
$1, 0($4)
3
IM
Reg
1
2
3
IM
Reg
IM
Reg
4
5
DM
Reg
6
DM
Reg
4
5
6
DM
Reg
DM
Reg
15
Store Bypassing: Version 1
MEM: add $1, $2, $3
EX: sw $4, 0($1)
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
Read
register 1
Addr
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
16
Store Bypassing: Version 2
MEM: add $1, $2, $3
EX: sw $1, 0($4)
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
Read
register 1
Addr
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
17
What about stores?
A harder case:
lw
$1, 0($2)
sw
$1, 0($4)
1
2
IM
Reg
IM
3
Reg
4
5
DM
Reg
DM
6
Reg
In what cycle is the load value available?
— End of cycle 4
In what cycle is the store value needed?
— Start of cycle 5
What do we have to add to the datapath?
18
Load/Store Bypassing: Extend the Datapath
ForwardC
IF/ID
ID/EX
EX/MEM
MEM/WB
1
PC
Read
register 1
Addr
0
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data data
1
0
0
Rd
1
Rs
Sequence :
lw $1, 0($2)
sw $1, 0($4)
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
19
Miscellaneous comments
Each MIPS instruction writes to at most one register
— This makes the forwarding hardware easier to design, since there is
only one destination register that ever needs to be forwarded
Forwarding is especially important with deep pipelines like the ones in all
current PC processors
The textbook has some additional material not shown here:
— Their hazard detection equations also ensure that the source register
is not $0, which can never be modified
20
Load-Use Data Hazard
Need to stall
for one
cycle
What about loads?
Consider the instruction sequence shown below:
— The load data doesn’t come from memory until the end of cycle 4
— But the AND needs that value at the beginning of the same cycle!
This is a “true” data hazard—the data is not available when we need it
lw
$2, 20($3)
1
2
Clock cycle
3
4
IM
Reg
DM
and $12, $2, $5
IM
Reg
5
Reg
DM
6
Reg
We call this a load-use hazard
22
Stalling
The easiest solution is to stall the pipeline
We could delay the AND instruction by introducing a one-cycle delay into
the pipeline, sometimes called a bubble
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
Clock cycle
3
4
DM
Reg
5
6
7
DM
Reg
Reg
Notice that we’re still using forwarding in cycle 5, to get data from the
MEM/WB pipeline register to the ALU
23
Stalling and forwarding
Without forwarding, we’d have to stall for two cycles to wait for the LW
instruction’s writeback stage
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
6
7
8
DM
Reg
Reg
Reg
In general, you can always stall to avoid hazards—but dependencies are
very common in real code, and stalling often can reduce performance by
a significant amount
24
Load-Use Hazard Detection
• Check when using instruction is decoded in ID stage
• ALU operand register numbers in ID stage are given by
• IF/ID.RegisterRs, IF/ID.RegisterRt
• Load-use hazard when
• ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
• If detected, stall and insert bubble
How to Stall the Pipeline
• Force control values in ID/EX register
to 0
• EX, MEM and WB do nop (no-operation)
• Prevent update of PC and IF/ID register
• Using instruction is decoded again
• Following instruction is fetched again
• 1-cycle stall allows MEM to read data for lw
• Can subsequently forward to EX stage
Stalling delays the entire pipeline
If we delay the second instruction, we’ll have to delay the third one too
— This is necessary to make forwarding work between AND and OR
— It also prevents problems such as two instructions trying to write to
the same register in the same cycle
1
lw
$2, 20($3)
and $12, $2, $5
or
$13, $12, $2
IM
2
3
Reg
IM
Clock cycle
4
5
DM
7
8
Reg
Reg
IM
6
DM
Reg
Reg
DM
Reg
27
What about EX, MEM, WB
But what about the ALU during cycle 4, the data memory in cycle 5, and
the register file write in cycle 6?
lw
$2, 20($3)
and $12, $2, $5
or
$13, $12, $2
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
Reg
Reg
IM
IM
6
7
DM
Reg
8
Reg
Reg
DM
Reg
Those units aren’t used in those cycles because of the stall, so we can set
the EX, MEM and WB control signals to all 0s.
28
Detecting Stalls, cont.
DM
mem\wb
Reg
Reg
ex/mem
Reg
mem\wb
IM
DM
id/ex
and $12, $2, $5
Reg
if/id
IM
id/ex
$2, 20($3)
if/id
lw
ex/mem
When should stalls be detected?
EX stage (of the instruction causing the stall)
if/id
Reg
What is the stall condition?
if (ID/EX.MemRead = 1 and (ID/EX.rt = IF/ID.rs or ID/EX.rt = IF/ID.rt))
then stall
29
Adding hazard detection to the CPU
PC Write
IF/ID Write
ID/EX.MemRead
Hazard
Unit
ID/EX
Rs
Rt
0
0
1
Control
PC
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
IF/ID
Read
register 1
Addr
ID/EX.RegisterRt
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
30
Stalls and Performance
Stalls reduce performance
— But are required to get correct results
Compiler can arrange code to avoid hazards and stalls
— Requires knowledge of the pipeline structure
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the
next instruction
Ex: c code for A = B + E; C = B + F;
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
13 cycles
lw
lw
lw
add
sw
add
sw
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
Branches in the original pipelined datapath
1
0
PCSrc
Control
IF/ID
4
When are they resolved?
ID/EX
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
Add
P
C
Add
RegWrite
Read Instruction
address
[31-0]
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Read
data 2
Write
register
Write
data
Instr [15 - 0]
Instr [20 - 16]
Shift
left 2
ALU
0
MemWrite
Zero
Result
1
Registers
ALUOp
ALUSrc
Sign
extend
Address
Data
memory
Write
data
RegDst
MemToReg
Read
data
1
MemRead
0
0
Instr [15 - 11]
1
33
Branch Hazards
If branch outcome determined in MEM:
Flush these
instructions
(Set control
values to 0)
PC
Reducing Branch Delay
Move hardware to determine outcome to ID stage
— Target address adder
— Register comparator
Example: branch taken
36: sub $10, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or
$13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
...
72: lw
$4, 50($7)
Example: Branch Taken
Example: Branch Taken
Data Hazards for Branches
If a comparison register is a destination of 2nd or 3rd preceding
ALU instruction
add $1, $2, $3
IF
add $4, $5, $6
…
beq $1, $4, target
Can resolve using forwarding
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Data Hazards for Branches
If a comparison register is a destination of preceding ALU
instruction or 2nd preceding load instruction
Need 1 stall cycle
lw
$1, addr
IF
add $4, $5, $6
beq stalled
beq $1, $4, target
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
ID
EX
MEM
WB
Data Hazards for Branches
If a comparison register is a destination of immediately preceding
load instruction
— Need 2 stall cycles
lw
$1, addr
IF
beq stalled
beq stalled
beq $1, $0, target
ID
EX
IF
ID
MEM
WB
ID
ID
EX
MEM
WB
Branch Prediction
• Longer pipelines can’t readily determine branch
outcome early
• Stall penalty becomes unacceptable
• Predict (i.e., guess) outcome of branch
• Only stall if prediction is wrong
• Simplest prediction strategy
• predict branches not taken
• Works well for loops if the loop tests are done at the
start.
• Fetch instruction after branch, with no delay
MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect
Dynamic Branch Prediction
In deeper and superscalar pipelines, branch penalty is more
significant
Use dynamic prediction
Branch prediction buffer (aka branch history table)
Indexed by recent branch instruction addresses
Stores outcome (taken/not taken)
To execute a branch
Check table, expect the same outcome
Start fetching from fall-through or target
If wrong, flush pipeline and flip prediction
1-Bit Predictor: Shortcoming
Inner loop branches mispredicted twice!
outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop
next time around
2-Bit Predictor
Only change prediction on two successive mispredictions
Calculating the Branch Target
Even with predictor, still need to calculate the target
address
1-cycle penalty for a taken branch
Branch target buffer
Cache of target addresses
Indexed by PC when instruction fetched
If hit and instruction is branch predicted taken,
can fetch target immediately
Concluding Remarks
ISA influences design of datapath and control
Datapath and control influence design of ISA
Pipelining improves instruction throughput
using parallelism
More instructions completed per second
Latency for each instruction not reduced
Hazards: structural, data, control
Main additions in hardware:
forwarding unit
hazard detection and stalling
branch predictor
branch target table
What about loads?
Consider the instruction sequence shown below:
— The load data doesn’t come from memory until the end of cycle 4
— But the AND needs that value at the beginning of the same cycle!
This is a “true” data hazard—the data is not available when we need it
lw
$2, 20($3)
1
2
Clock cycle
3
4
IM
Reg
DM
and $12, $2, $5
IM
Reg
5
Reg
DM
6
Reg
We call this a load-use hazard
48
Stalling
The easiest solution is to stall the pipeline
We could delay the AND instruction by introducing a one-cycle delay into
the pipeline, sometimes called a bubble
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
Clock cycle
3
4
DM
Reg
5
6
7
DM
Reg
Reg
Notice that we’re still using forwarding in cycle 5, to get data from the
MEM/WB pipeline register to the ALU
49
Stalling and forwarding
Without forwarding, we’d have to stall for two cycles to wait for the LW
instruction’s writeback stage
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
6
7
8
DM
Reg
Reg
Reg
In general, you can always stall to avoid hazards—but dependencies are
very common in real code, and stalling often can reduce performance by
a significant amount
50