Pipelined datapath and control

Download Report

Transcript Pipelined datapath and control

Forwarding
 The actual result $1 - $3 is computed in clock cycle 3, before it’s
needed in cycles 4 and 5
 We forward that value to later instructions, to prevent data hazards:
— In clock cycle 4, AND gets the value $1 - $3 from EX/MEM
— In cycle 5, OR gets that same result from MEM/WB
sub $2, $1, $3
and $12, $2, $5
or
$13, $6, $2
1
2
IM
Reg
IM
Clock cycle
3
4
DM
Reg
IM
5
7
Reg
DM
Reg
6
Reg
DM
Reg
1
Outline of forwarding hardware
 A forwarding unit selects the correct ALU inputs for the EX stage:
— No hazard: ALU’s operands come from the register file, like normal
— Data hazard: operands come from either the EX/MEM or MEM/WB
pipeline registers instead
 The ALU sources will be selected by two new multiplexers, with control
signals named ForwardA and ForwardB
sub $2, $1, $3
and $12, $2, $5
or
$13, $6, $2
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2
Simplified datapath with forwarding muxes
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
0
1
2
Registers
Instruction
memory
ForwardA
ALU
0
1
2
Data
memory
1
Rt
ForwardB
0
0
Rd
1
3
Detecting EX/MEM Hazards

How to detect an impending hazard?
sub $2, $1, $3
or
$12, $2, $5
IM
Reg
IM
DM
Reg
Reg
DM
Reg

In the above case, it occurs in cycle 3, when sub is in EX, or is in ID
— Hazard because:
ID/EX.rd == IF/ID.rs

An EX/MEM hazard occurs between the instruction currently in its EX
stage and the previous instruction if:
1. The previous instruction will write to the register file, and
2. The destination is one of the ALU source registers in the EX stage
4
EX/MEM data hazard equations
 The first ALU source comes from the pipeline register when necessary:
if (EX/MEM.RegWrite and EX/MEM.rd == ID/EX.rs)
ForwardA = 2
 The second ALU source is similar:
if (EX/MEM.RegWrite and EX/MEM.rd == ID/EX.rt)
ForwardB = 2
sub $2, $1, $3
and $12, $2, $5
IM
Reg
IM
DM
Reg
Reg
DM
Reg
5
MEM/WB data hazards
 A MEM/WB hazard may occur between an instruction in the EX stage and
the instruction from two cycles ago
 One new problem is if a register is updated twice in a row:
add
add
sub
$1, $2, $3
$1, $1, $4
$5, $5, $1
 Register $1 is written by both of the previous instructions, but only the
most recent result (from the second ADD) should be forwarded
add $1, $2, $3
add $1, $1, $4
sub $5, $5, $1
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
6
MEM/WB hazard equations
 Here is an equation for detecting and handling MEM/WB hazards for the
first ALU source:
if (MEM/WB.RegWrite and MEM/WB.rd == ID/EX.rs and
(EX/MEM.rd ≠ ID/EX.rs or not(EX/MEM.RegWrite))
ForwardA = 1
 The second ALU operand is handled similarly:
if (MEM/WB.RegWrite and MEM/WB.rd == ID/EX.rt and
(EX/MEM.rd ≠ ID/EX.rt or not(EX/MEM.RegWrite))
ForwardB = 1
 Handled by a forwarding unit which uses the control signals stored in
pipeline registers to set the values of ForwardA and ForwardB
7
Simplified datapath with forwarding
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
0
1
2
ForwardA
Registers
Instruction
memory
ALU
0
1
2
Data
memory
1
Rt
ForwardB
0
0
Rd
Rs
ID/EX.
RegisterRt
Forwarding
Unit
1
EX/MEM.RegisterRd
MEM/WB.RegisterRd
ID/EX.
RegisterRs
8
Example
sub
and
or
add
sw
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
 Assume again each register initially contains its number plus 100
— After the first instruction, $2 should contain -2 (= 101 - 103)
— The other instructions should all use -2 as one of their operands
 We’ll try to keep the example short:
— Assume no forwarding is needed except for register $2
— We’ll skip the first two cycles, since they’re the same as before
9
Clock cycle 3
IF: or $13, $6, $2
ID: and $12, $2, $5
IF/ID
EX: sub $2, $1, $3
ID/EX
EX/MEM
MEM/WB
PC
101
2
0
1
2
102
5
Instruction
memory
101
0
Registers
X
105
ALU
103
0
1
2
X
103
-2
Data
memory
1
0
5 (Rt)
0
12 (Rd)
2 (Rs)
2
ID/EX.
RegisterRt
3
ID/EX. 1
RegisterRs
1
0
2
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
10
Clock cycle 4: forwarding $2 from EX/MEM
IF: add $14, $2, $2
ID: or $13, $6, $2
IF/ID
EX: and $12, $2, $5
MEM: sub $2, $1, $3
ID/EX
EX/MEM
MEM/WB
PC
102
6
0
1
2
106
2
Instruction
memory
-2
2
Registers
X
102
ALU
105
0
1
2
X
105
-2
104
Data
memory
1
0
2 (Rt)
0
13 (Rd)
6 (Rs)
12
ID/EX.
RegisterRt
5
2
ID/EX.
RegisterRs
0
12
1
EX/MEM.RegisterRd
2
Forwarding
Unit
MEM/WB.RegisterRd
-2
11
Clock cycle 5: forwarding $2 from MEM/WB
IF: sw $15, 100($2)
ID: add $14, $2, $2
IF/ID
MEM: and $12, $2, $5
EX: or $13, $6, $2
ID/EX
EX/MEM
WB: sub
$2, $1, $3
MEM/WB
PC
106
2
-2
2
Instruction
memory
0
1
2
106
0
Registers
2
-2
102
-2
-2
ALU
0
1
2
-2
104
-2
Data
memory
X
1
1
2 (Rt)
0
14 (Rd)
2 (Rs)
-2
13
ID/EX.
RegisterRt
2
0
13
1
EX/MEM.RegisterRd
2
12
Forwarding
Unit
ID/EX. 6
RegisterRs
2
MEM/WB.RegisterRd
104
-2
12
Forwarding resolved two data hazards
 The data hazard during cycle 4:
— The forwarding unit notices that the ALU’s first source register for the
AND is also the destination of the SUB instruction
— The correct value is forwarded from the EX/MEM register, overriding
the incorrect old value still in the register file
 The data hazard during cycle 5:
— The ALU’s second source (for OR) is the SUB destination again
— This time, the value has to be forwarded from the MEM/WB pipeline
register instead
 There are no other hazards involving the SUB instruction
— During cycle 5, SUB writes its result back into register $2
— The ADD instruction can read this new value from the register file in
the same cycle
13
Complete pipelined datapath...so far
ID/EX
WB
Control
IF/ID
EX/MEM
M
WB
MEM/WB
EX
M
WB
PC
Read
register 1
Addr
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
14
What about stores?

Two “easy” cases:
add $1, $2, $3
sw
2
IM
Reg
$4, 0($1)
add $1, $2, $3
sw
1
$1, 0($4)
3
IM
Reg
1
2
3
IM
Reg
IM
Reg
4
5
DM
Reg
6
DM
Reg
4
5
6
DM
Reg
DM
Reg
15
Store Bypassing: Version 1
MEM: add $1, $2, $3
EX: sw $4, 0($1)
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
Read
register 1
Addr
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
16
Store Bypassing: Version 2
MEM: add $1, $2, $3
EX: sw $1, 0($4)
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
Read
register 1
Addr
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
17
What about stores?




A harder case:
lw
$1, 0($2)
sw
$1, 0($4)
1
2
IM
Reg
IM
3
Reg
4
5
DM
Reg
DM
6
Reg
In what cycle is the load value available?
— End of cycle 4
In what cycle is the store value needed?
— Start of cycle 5
What do we have to add to the datapath?
18
Load/Store Bypassing: Extend the Datapath
ForwardC
IF/ID
ID/EX
EX/MEM
MEM/WB
1
PC
Read
register 1
Addr
0
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data data
1
0
0
Rd
1
Rs
Sequence :
lw $1, 0($2)
sw $1, 0($4)
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
19
Miscellaneous comments
 Each MIPS instruction writes to at most one register
— This makes the forwarding hardware easier to design, since there is
only one destination register that ever needs to be forwarded
 Forwarding is especially important with deep pipelines like the ones in all
current PC processors
 The textbook has some additional material not shown here:
— Their hazard detection equations also ensure that the source register
is not $0, which can never be modified
20
Load-Use Data Hazard
Need to stall
for one
cycle
What about loads?
 Consider the instruction sequence shown below:
— The load data doesn’t come from memory until the end of cycle 4
— But the AND needs that value at the beginning of the same cycle!
 This is a “true” data hazard—the data is not available when we need it
lw
$2, 20($3)
1
2
Clock cycle
3
4
IM
Reg
DM
and $12, $2, $5
IM
Reg
5
Reg
DM
6
Reg
 We call this a load-use hazard
22
Stalling
 The easiest solution is to stall the pipeline
 We could delay the AND instruction by introducing a one-cycle delay into
the pipeline, sometimes called a bubble
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
Clock cycle
3
4
DM
Reg
5
6
7
DM
Reg
Reg
 Notice that we’re still using forwarding in cycle 5, to get data from the
MEM/WB pipeline register to the ALU
23
Stalling and forwarding
 Without forwarding, we’d have to stall for two cycles to wait for the LW
instruction’s writeback stage
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
6
7
8
DM
Reg
Reg
Reg
 In general, you can always stall to avoid hazards—but dependencies are
very common in real code, and stalling often can reduce performance by
a significant amount
24
Load-Use Hazard Detection
• Check when using instruction is decoded in ID stage
• ALU operand register numbers in ID stage are given by
• IF/ID.RegisterRs, IF/ID.RegisterRt
• Load-use hazard when
• ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
• If detected, stall and insert bubble
How to Stall the Pipeline
• Force control values in ID/EX register
to 0
• EX, MEM and WB do nop (no-operation)
• Prevent update of PC and IF/ID register
• Using instruction is decoded again
• Following instruction is fetched again
• 1-cycle stall allows MEM to read data for lw
• Can subsequently forward to EX stage
Stalling delays the entire pipeline
 If we delay the second instruction, we’ll have to delay the third one too
— This is necessary to make forwarding work between AND and OR
— It also prevents problems such as two instructions trying to write to
the same register in the same cycle
1
lw
$2, 20($3)
and $12, $2, $5
or
$13, $12, $2
IM
2
3
Reg
IM
Clock cycle
4
5
DM
7
8
Reg
Reg
IM
6
DM
Reg
Reg
DM
Reg
27
What about EX, MEM, WB
 But what about the ALU during cycle 4, the data memory in cycle 5, and
the register file write in cycle 6?
lw
$2, 20($3)
and $12, $2, $5
or
$13, $12, $2
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
Reg
Reg
IM
IM
6
7
DM
Reg
8
Reg
Reg
DM
Reg
 Those units aren’t used in those cycles because of the stall, so we can set
the EX, MEM and WB control signals to all 0s.
28
Detecting Stalls, cont.

DM
mem\wb
Reg
Reg
ex/mem
Reg
mem\wb
IM
DM
id/ex
and $12, $2, $5
Reg
if/id
IM
id/ex
$2, 20($3)
if/id
lw
ex/mem
When should stalls be detected?
EX stage (of the instruction causing the stall)
if/id

Reg
What is the stall condition?
if (ID/EX.MemRead = 1 and (ID/EX.rt = IF/ID.rs or ID/EX.rt = IF/ID.rt))
then stall
29
Adding hazard detection to the CPU
PC Write
IF/ID Write
ID/EX.MemRead
Hazard
Unit
ID/EX
Rs
Rt
0
0
1
Control
PC
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
IF/ID
Read
register 1
Addr
ID/EX.RegisterRt
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
30
Stalls and Performance
 Stalls reduce performance
— But are required to get correct results
 Compiler can arrange code to avoid hazards and stalls
— Requires knowledge of the pipeline structure
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the
next instruction
Ex: c code for A = B + E; C = B + F;
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
13 cycles
lw
lw
lw
add
sw
add
sw
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
Branches in the original pipelined datapath
1
0
PCSrc
Control
IF/ID
4
When are they resolved?
ID/EX
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
Add
P
C
Add
RegWrite
Read Instruction
address
[31-0]
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Read
data 2
Write
register
Write
data
Instr [15 - 0]
Instr [20 - 16]
Shift
left 2
ALU
0
MemWrite
Zero
Result
1
Registers
ALUOp
ALUSrc
Sign
extend
Address
Data
memory
Write
data
RegDst
MemToReg
Read
data
1
MemRead
0
0
Instr [15 - 11]
1
33
Branch Hazards
If branch outcome determined in MEM:
Flush these
instructions
(Set control
values to 0)
PC
Reducing Branch Delay
Move hardware to determine outcome to ID stage
— Target address adder
— Register comparator
Example: branch taken
36: sub $10, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or
$13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
...
72: lw
$4, 50($7)
Example: Branch Taken
Example: Branch Taken
Data Hazards for Branches
If a comparison register is a destination of 2nd or 3rd preceding
ALU instruction
add $1, $2, $3
IF
add $4, $5, $6
…
beq $1, $4, target
Can resolve using forwarding
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Data Hazards for Branches
If a comparison register is a destination of preceding ALU
instruction or 2nd preceding load instruction
Need 1 stall cycle
lw
$1, addr
IF
add $4, $5, $6
beq stalled
beq $1, $4, target
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
ID
EX
MEM
WB
Data Hazards for Branches
If a comparison register is a destination of immediately preceding
load instruction
— Need 2 stall cycles
lw
$1, addr
IF
beq stalled
beq stalled
beq $1, $0, target
ID
EX
IF
ID
MEM
WB
ID
ID
EX
MEM
WB
Branch Prediction
• Longer pipelines can’t readily determine branch
outcome early
• Stall penalty becomes unacceptable
• Predict (i.e., guess) outcome of branch
• Only stall if prediction is wrong
• Simplest prediction strategy
• predict branches not taken
• Works well for loops if the loop tests are done at the
start.
• Fetch instruction after branch, with no delay
MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect
Dynamic Branch Prediction
 In deeper and superscalar pipelines, branch penalty is more
significant
 Use dynamic prediction
 Branch prediction buffer (aka branch history table)
 Indexed by recent branch instruction addresses
 Stores outcome (taken/not taken)
 To execute a branch
 Check table, expect the same outcome
 Start fetching from fall-through or target
 If wrong, flush pipeline and flip prediction
1-Bit Predictor: Shortcoming
Inner loop branches mispredicted twice!
outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
 Mispredict as taken on last iteration of inner loop
 Then mispredict as not taken on first iteration of inner loop
next time around
2-Bit Predictor
Only change prediction on two successive mispredictions
Calculating the Branch Target
 Even with predictor, still need to calculate the target
address
 1-cycle penalty for a taken branch
 Branch target buffer
 Cache of target addresses
 Indexed by PC when instruction fetched
 If hit and instruction is branch predicted taken,
can fetch target immediately
Concluding Remarks




ISA influences design of datapath and control
Datapath and control influence design of ISA
Pipelining improves instruction throughput
using parallelism
 More instructions completed per second
 Latency for each instruction not reduced
Hazards: structural, data, control

Main additions in hardware:
 forwarding unit
 hazard detection and stalling
 branch predictor
 branch target table
What about loads?
 Consider the instruction sequence shown below:
— The load data doesn’t come from memory until the end of cycle 4
— But the AND needs that value at the beginning of the same cycle!
 This is a “true” data hazard—the data is not available when we need it
lw
$2, 20($3)
1
2
Clock cycle
3
4
IM
Reg
DM
and $12, $2, $5
IM
Reg
5
Reg
DM
6
Reg
 We call this a load-use hazard
48
Stalling
 The easiest solution is to stall the pipeline
 We could delay the AND instruction by introducing a one-cycle delay into
the pipeline, sometimes called a bubble
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
Clock cycle
3
4
DM
Reg
5
6
7
DM
Reg
Reg
 Notice that we’re still using forwarding in cycle 5, to get data from the
MEM/WB pipeline register to the ALU
49
Stalling and forwarding
 Without forwarding, we’d have to stall for two cycles to wait for the LW
instruction’s writeback stage
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
6
7
8
DM
Reg
Reg
Reg
 In general, you can always stall to avoid hazards—but dependencies are
very common in real code, and stalling often can reduce performance by
a significant amount
50