Computer Organization & Design

Download Report

Transcript Computer Organization & Design

Computer Organization and
Architecture (AT70.01)
Comp. Sc. and Inf. Mgmt.
Asian Institute of Technology
Instructor: Dr. Sumanta Guha
Slide Sources: Patterson &
Hennessy COD book site
(copyright Morgan Kaufmann)
adapted and supplemented
COD Ch. 6
Enhancing Performance with
Pipelining
Pipelining

Start work ASAP!! Do not waste time!
Time
6 PM
7
8
9
10
11
12
1
2 AM
Task
order
Not pipelined
A
B
C
D
Assume 30 min. each task – wash, dry, fold, store – and that
separate tasks use separate hardware and so can be overlapped
6 PM
7
8
9
10
11
12
1
2 AM
Time
Task
order
A
B
C
D
Pipelined
Pipelined vs. Single-Cycle
Instruction Execution: the Plan
Program
execution
Time
order
(in instructions)
lw $1, 100($0)
2
Instruction
Reg
fetch
lw $2, 200($0)
4
6
8
ALU
Data
access
10
12
14
18
Single-cycle
Reg
Instruction
Reg
fetch
8 ns
16
lw $3, 300($0)
Data
access
ALU
Reg
Instruction
fetch
8 ns
...
8 ns
Assume 2 ns for memory access, ALU operation; 1 ns for register access:
therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.
Program
2
execution
Time
order
(in instructions)
Instruction
lw $1, 100($0)
fetch
lw $2, 200($0)
lw $3, 300($0)
2 ns
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
2 ns
8
Data
access
ALU
Reg
2 ns
10
14
12
Reg
Data
access
Pipelined
Reg
ALU
Data
access
2 ns
2 ns
Reg
2 ns
Pipelining: Keep in Mind
Pipelining does not reduce latency of a single task, it
increases throughput of entire workload
Pipeline rate limited by longest stage





potential speedup = number pipe stages
unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it – when there is
slack in the pipeline – reduces speedup
Example Problem

Problem: for the laundry fill in the following table when
1.
2.
Person
the stage lengths are 30, 30, 30 30 min., resp.
the stage lengths are 20, 20, 60, 20 min., resp.
Unpipelined
finish time
Pipeline 1
finish time
Ratio unpipelined
to pipeline 1
Pipeline 2
finish time
1
2
3
4
n

Come up with a formula for pipeline speed-up!
Ratio unpiplelined
to pipeline 2
Pipelining MIPS

What makes it easy with MIPS?

all instructions are same length


just a few instruction formats


simplifies instruction decode and makes it possible in one stage
memory operands appear only in load/stores


so fetch and decode stages are similar for all instructions
so memory access can be deferred to exactly one later stage
operands are aligned in memory

one data transfer instruction requires one memory access stage
Pipelining MIPS

What makes it hard?




structural hazards: different instructions, at different stages,
in the pipeline want to use the same hardware resource
control hazards: succeeding instruction, to put into pipeline,
depends on the outcome of a previous branch instruction,
already in pipeline
data hazards: an instruction in the pipeline requires data to
be computed by a previous instruction still in the pipeline
Before actually building the pipelined datapath and control
we first briefly examine these potential hazards
individually…
Structural Hazards


Structural hazard: inadequate hardware to simultaneously
support all instructions in the pipeline in the same clock cycle
E.g., suppose single – not separate – instruction and data
memory in pipeline below with one read port

then a structural hazard between first and fourth lw instructions
Program
2
execution
Time
order
(in instructions)
Instruction
lw $1, 100($0)
fetch
lw $2, 200($0)
lw $3, 300($0)
lw $4, 400($0)
2 ns
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
2 ns
8
Data
access
ALU
Reg
Instruction
fetch
2 ns

10
14
12
Reg
Pipelined
Data
access
ALU
Reg
2 ns
Reg
Data
access
Hazard if single memory
Reg
ALU
Data
access
2 ns
2 ns
Reg
2 ns
MIPS was designed to be pipelined: structural hazards are easy
to avoid!
Control Hazards


Control hazard: need to make a decision based on the result of
a previous instruction still executing in pipeline
Solution 1 Stall the pipeline
Program
execution
Time
order
(in instructions)
add $4, $5, $6
beq $1, $2, 40
2
Instruction
fetch
2ns
4
Reg
6
ALU
Instruction
fetch
Reg
lw $3, 300($0)
bubble
4 ns
8
Data
access
10
Reg
ALU
Instruction
fetch
2ns
Pipeline stall
Data
access
Reg
14
12
Reg
ALU
16
Note that branch outcome is
computed in ID stage with
added hardware (later…)
Data
access
Reg
Control Hazards

Solution 2 Predict branch outcome

e.g., predict branch-not-taken :
Program
execution
Time
order
(in instructions)
add $4, $5, $6
2
4
6
Instruction
Reg
fetch
Data
access
ALU
Instruction
Reg
fetch
beq $1, $2, 40
2 ns
lw $3, 300($0)
8
10
14
Reg
Data
access
ALU
Instruction
Reg
fetch
2 ns
12
Reg
Data
access
ALU
Reg
Prediction success
Program
execution
Time
order
(in instructions)
add $4, $5 ,$6
2
4
6
Instruction
Reg
fetch
beq $1, $2, 40
2 ns
ALU
Instruction
Reg
fetch
bubble
or $7, $8, $9
4 ns
8
Data
access
10
14
12
Reg
ALU
Data
access
Reg
bubble
bubble
bubble
Instruction
Reg
fetch
ALU
Prediction failure: undo (=flush) lw
bubble
Data
access
Reg
Control Hazards

Solution 3 Delayed branch: always execute the sequentially next
statement with the branch executing after one instruction delay
– compiler’s job to find a statement that can be put in the slot
that is independent of branch outcome

MIPS does this – but it is an option in SPIM (Simulator -> Settings)
Program
execution
order
Time
(in instructions)
beq $1, $2, 40
2
Instruction
fetch
add $4, $5, $6
(d elayed branch slot) 2 ns
lw $3, 300($0)
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
8
Data
access
ALU
Reg
10
12
Reg
Data
access
ALU
Reg
Data
access
Reg
2 ns
Delayed branch beq is followed by add that is
independent of branch outcome
14
Data Hazards


Data hazard: instruction needs data from the result of a
previous instruction still executing in pipeline
Solution Forward data if possible…
2
Time
add $s0, $t0, $t1
IF
Program
execution
order
Time
(in instructions)
add $s0, $t0, $t1
sub $t2, $s0, $t3
4
ID
2
IF
6
EX
4
8
MEM
6
EX
MEM
IF
ID
EX
Instruction pipeline diagram:
shade indicates use –
left=write, right=read
WB
8
ID
10
10
WB
MEM
WB
Without forwarding – blue line –
data has to go back in time;
with forwarding – red line –
data is available in time
Data Hazards

Forwarding may not be enough

e.g., if an R-type instruction following a load uses the result of the
load – called load-use data hazard
2
Time
Program
execution
order
(in instructions)
lw $s0, 20($t1)
IF
sub $t2, $s0, $t3
sub $t2, $s0, $t3
IF
6
ID
EX
IF
ID
2
Time
Program
execution
order
(in instructions)
lw $s0, 20($t1)
4
4
8
MEM
EX
6
ID
EX
bubble
bubble
IF
10
14
Without a stall it is impossible
to provide input to the sub
instruction in time
WB
MEM
8
10
MEM
WB
bubble
bubble
ID
12
EX
WB
12
14
With a one-stage stall, forwarding
can get the data to the sub
instruction in time
bubble
MEM
WB
Reordering Code to Avoid
Pipeline Stall (Software Solution)
Example:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t2, 0($t1)
sw $t0, 4($t1)

Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1)
Data hazard

Interchanged
Pipelined Datapath
We now move to actually building a pipelined datapath
First recall the 5 steps in instruction execution


1.
2.
3.
4.
5.
Review: single-cycle processor




Instruction Fetch & PC Increment (IF)
Instruction Decode and Register Read (ID)
Execution or calculate address (EX)
Memory access (MEM)
Write result into register (WB)
all 5 steps done in a single clock cycle
dedicated hardware required for each step
What happens if we break the execution into multiple cycles, but
keep the extra hardware?
Review - Single-Cycle Datapath
“Steps”
ADD
ADD
4
PC
ADDR
RD
Instruction
Memory
<<2
Instruction I
32
16
32
5
5
RN1
RN2
5
WN
RD1
Register File
ALU
Zero
WD
RD2
16
IF
Instruction Fetch
ID
E
X
T
N
D
M
U
X
ADDR
Data
Memory
32
Instruction Decode
RD
M
U
X
WD
EX
Execute/ Address Calc.
MEM
Memory Access
WB
Write Back
Pipelined Datapath – Key Idea

What happens if we break the execution into multiple cycles, but keep
the extra hardware?


Answer: We may be able to start executing a new instruction at each
clock cycle - pipelining
…but we shall need extra registers to hold data between cycles
– pipeline registers
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
ADD
4
64 bits
128 bits
PC
ADDR
RD
Instruction
Memory
16
32
97 bits
<<2
Instruction I
32
ADD
5
5
RN1
RN2
64 bits
5
WN
RD1
Register File
ALU
Zero
WD
RD2
16
IF/ID
E
X
T
N
D
M
U
X
ADDR
Data
Memory
32
ID/EX
RD
WD
EX/MEM
MEM/WB
M
U
X
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
ADD
4
64 bits
128 bits
PC
ADDR
RD
Instruction
Memory
16
32
97 bits
<<2
Instruction I
32
ADD
5
5
RN1
RN2
64 bits
5
WN
RD1
Register File
ALU
Zero
WD
RD2
16
IF/ID
E
X
T
N
D
M
U
X
ADDR
Data
Memory
32
ID/EX
RD
WD
EX/MEM
MEM/WB
Only data flowing right to left may cause hazard…, why?
M
U
X
Bug in the Datapath
IF/ID
ID/EX
MEM/WB
EX/MEM
ADD
ADD
4
PC
ADDR
RD
Instruction
Memory
<<2
Instruction I
32
16
32
5
5
RN1
RN2
5
WN
RD1
Register File
ALU
WD
RD2
16
E
X
T
N
D
32
M
U
X
ADDR
Data
Memory
WD
Write register number comes from another later instruction!
RD
M
U
X
Corrected Datapath
IF/ID
ID/EX
EX/MEM
MEM/WB
ADD
133 bits
64 bits
4
ADD
102 bits
<<2
69 bits
PC
ADDR
RD
Instruction
Memory
32
5
RN1
5
RN2
5
WN
RD1
Register
File RD2
WD
16
5
E
X 32
T
N
D
ALU
M
U
X
Zero
ADDR
Data
Memory RD
WD
Destination register number is also passed through ID/EX, EX/MEM
and MEM/WB registers, which are now wider by 5 bits
M
U
X
Pipelined Example

Consider the following instruction sequence:
lw
$t0,
10($t1)
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:
Clock Cycle 1
LW
Single-Clock-Cycle Diagram:
Clock Cycle 2
SW
LW
Single-Clock-Cycle Diagram:
Clock Cycle 3
ADD
SW
LW
Single-Clock-Cycle Diagram:
Clock Cycle 4
SUB
ADD
SW
LW
Single-Clock-Cycle Diagram:
Clock Cycle 5
SUB
ADD
SW
LW
Single-Clock-Cycle Diagram:
Clock Cycle 6
SUB
ADD
SW
Single-Clock-Cycle Diagram:
Clock Cycle 7
SUB
ADD
Single-Clock-Cycle Diagram:
Clock Cycle 8
SUB
Alternative View –
Multiple-Clock-Cycle Diagram
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
Time axis
lw $t0, 10($t1)
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
IM
REG
ALU
DM
REG
IM
REG
ALU
DM
IM
REG
ALU
DM
IM
REG
ALU
REG
REG
DM
REG
Notes

One significant difference in the execution of an R-type
instruction between multicycle and pipelined implementations:






register write-back for the R-type instruction is the 5th (the last
write-back) pipeline stage vs. the 4th stage for the multicycle
implementation. Why?
think of structural hazards when writing to the register file…
Worth repeating: the essential difference between the pipeline
and multicycle implementations is the insertion of pipeline
registers to decouple the 5 stages
The CPI of an ideal pipeline (no stalls) is 1. Why?
The RaVi Architecture Visualization Project of Dortmund U. has
pipeline simulations – see link in our Additional Resources
page
As we develop control for the pipeline keep in mind that the
text does not consider jump – should not be too hard to
implement!
Recall Single-Cycle Control –
the Datapath
0
M
u
x
ALU
Add result
Add
4
Instruction [31 26]
Control
Instruction [25 21]
PC
Read
address
Instruction
memory
Instruction [15 11]
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
PCSrc
Read
register 1
Instruction [20 16]
Instruction
[31– 0]
1
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Write
data
Instruction [15 0]
16
Instruction [5 0]
Sign
extend
32
ALU
control
Read
data
Data
memory
1
M
u
x
0
Recall Single-Cycle – ALU Control
Instruction AluOp
opcode
LW
SW
Branch eq
R-type
R-type
R-type
R-type
R-type
00
00
01
10
10
10
10
10
Instruction Funct Field Desired
ALU control
operation
ALU action input
load word
store word
branch eq
add
subtract
AND
OR
set on less
xxxxxx
xxxxxx
xxxxxx
100000
100010
100100
100101
101010
add
add
subtract
add
subtract
and
or
set on less
ALUOp
Funct field
Operation
ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0
0
X X X X X X
010
0
1
X X X X X X
110
1
X
X X 0 0 0 0
010
1
X
X X 0 0 1 0
110
1
X
X X 0 1 0 0
000
1
X
X X 0 1 0 1
001
1
X
X X 1 0 1 0
111
Truth table for ALU control bits
010
010
110
010
110
000
001
111
Recall Single-Cycle – Control Signals
Effect of control bits
Signal Name
Effect when deasserted
Effect when asserted
RegDst
The register destination number for the
Write register comes from the rt field (bits 20-16)
None
The register destination number for the
Write register comes from the rd field (bits 15-11)
The register on the Write register input is written
with the value on the Write data input
The second ALU operand is the sign-extended,
lower 16 bits of the instruction
The PC is replaced by the output of the adder
that computes the branch target
Data memory contents designated by the address
input are put on the first Read data output
Data memory contents designated by the address
input are replaced by the value of the Write data input
The value fed to the register Write data input
comes from the data memory
RegWrite
AlLUSrc
MemRead
The second ALU operand comes from the
second register file output (Read data 2)
The PC is replaced by the output of the adder
that computes the value of PC + 4
None
MemWrite
None
MemtoReg
The value fed to the register Write data input
comes from the ALU
PCSrc
Memto- Reg Mem Mem
Deter- Instruction RegDst ALUSrc
Reg
Write Read Write Branch ALUOp1 ALUp0
1
0
0
1
0
0
0
1
0
mining R-format
0
1
1
1
1
0
0
0
0
control lw
sw
X
1
X
0
0
1
0
0
0
bits
beq
X
0
X
0
0
0
1
0
1
Pipeline Control


Initial design – motivated by single-cycle datapath control – use
the same control signals
Observe:
Will be






No separate write signal for the PC as it is written every cycle
modified
by hazard
No separate write signals for the pipeline registers as they are detection
unit!!
written every cycle
No separate read signal for instruction memory as it is read every
clock cycle
No separate read signal for register file as it is read every clock cycle
Need to set control signals during each pipeline stage
Since control signals are associated with components active
during a single pipeline stage, can group control lines into five
groups according to pipeline stage
Pipelined Datapath with
Control I
PCSrc
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
result
Add
4
Branch
Shift
left 2
PC
Address
Instruction
memory
Instruction
RegWrite
Read
register 1
MemWrite
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
ALUSrc
Zero
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Data
memory
Write
data
Instruction
16
[15– 0]
Same control
signals as the
single-cycle
datapath
Sign
extend
32
6
ALU
control
Instruction
[20– 16]
Instruction
[15– 11]
0
M
u
x
1
RegDst
ALUOp
MemRead
Read
data
1
M
u
x
0
Pipeline Control Signals

There are five stages in the pipeline
 instruction fetch / PC increment
 instruction decode / register fetch
 execution / address calculation


Nothing to control as instruction memory
read and PC write are always enabled
memory access
write back
Instruction
R-format
lw
sw
beq
Execution/Address Calculation Memory access stage
stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src Branch Read Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0
Write-back
stage control
lines
Reg Mem to
write
Reg
1
0
1
1
0
X
0
X
Pipeline Control
Implementation

Pass control signals along just like the data – extend each pipeline
register to hold needed control bits for succeeding stages
WB
Instruction
IF/ID

Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
Note: The 6-bit funct field of the instruction required in the EX
stage to generate ALU control can be retrieved as the 6 least
significant bits of the immediate field which is sign-extended and
passed from the IF/ID register to the ID/EX register
Pipelined Datapath with
Control II
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Control signals
emanate from
the control
portions of the
pipeline registers
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
ALUOp
RegDst
MemRead
1
M
u
x
0
20($1)
$2, $3
$4, $7
$6, $7
$8, $9
000
0000
EX/MEM
MEM/WB
00
000
M
WB: before<4>
WB
0
00
EX
0
00
0
0
0
M
0
WB 0
Add
Add
Add result
Address
Instruction
memory
MemWrite
PC
Branch
Shift
left 2
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Zero
ALU ALU
result
0
M
u
x
1
Write
data
MemtoReg
RegWrite
4
Address
Data
memory
Read
data
1
M
u
x
0
Write
data
Instruction
[15– 0]
Sign
extend
ALU
control
Instruction
[20– 16]
Clock cycle 1
0
M
u
x
1
Instruction
[15– 11]
Clock 1
IF: sub $11, $2, $3
0
M
u
x
1
MemRead
ALUOp
RegDst
ID: lw $10, 20($1)
EX: before<1>
IF/ID
ID/EX
11
lw
Control
WB
010
0001
M
MEM: before<2>
EX/MEM
WB: before<3>
MEM/WB
00
000
WB
0
00
EX
0
M
00
0
0
0
0
WB 0
Add
Add
Add result
4
PC
Address
Instruction
memory
Label “before<i>” means
i th instruction before
lw
Clock cycle 2
Clock 2
1
X
Branch
Shift
left 2
ALUSrc
Read
register 1
Read $1
data 1
Read
register 2
Registers Read $X
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
$10,
$11,
$12,
$13,
$14,
Control
WB
MEM: before<3>
MemWrite
lw
sub
and
or
add
ID/EX
00
RegWrite
Instruction
sequence:
EX: before<2>
IF/ID
Instruction

0
M
u
x
1
ID: before<1>
Instruction
Pipelined
Execution
and
Control
IF: lw $10, 20($1)
Address
Data
memory
Read
data
Write
data
20
Instruction
[15– 0]
10
Instruction
[20– 16]
10
X
Instruction
[15– 11]
X
Sign
extend
20
ALU
control
0
M
u
x
1
ALUOp
RegDst
MemRead
1
M
u
x
0
20($1)
$2, $3
$4, $7
$6, $7
$8, $9
Control
000
1100
EX/MEM
MEM/WB
11
010
M
WB: before<2>
WB
0
00
EX
1
00
0
0
0
M
0
WB 0
Add
Add
Add result
Address
Instruction
memory
2
3
MemWrite
PC
Branch
Shift
left 2
ALUSrc
Read
register 1
Read $2
data 1
Read
register 2
Registers Read $3
Write
data 2
register
$1
Zero
ALU ALU
result
0
M
u
x
1
Write
data
MemtoReg
RegWrite
4
Address
Data
memory
Read
data
1
M
u
x
0
Write
data
Clock cycle 3
X
Instruction
[15– 0]
X
Instruction
[20– 16]
11
Instruction
[15– 11]
Sign
extend
X
20
X
10
11
Clock 3
IF: or $13, $6, $7
0
M
u
x
1
ALU
control
0
M
u
x
1
MemRead
ALUOp
RegDst
ID: and $12, $2, $3
EX: sub $11, . . .
IF/ID
ID/EX
10
and
Control
WB
000
1100
M
MEM: lw $10, . . .
EX/MEM
WB: before<1>
MEM/WB
10
000
WB
1
10
EX
0
M
11
0
1
0
0
WB 0
Add
Add
Add result
4
PC
Address
Instruction
memory
4
5
Branch
Shift
left 2
ALUSrc
Read
register 1
Read $4
data 1
Read
register 2
Registers Read $5
Write
data 2
register
$2
$3
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
$10,
$11,
$12,
$13,
$14,
sub
WB
MEM: before<1>
MemWrite
lw
sub
and
or
add
ID/EX
10
RegWrite
Instruction
sequence:
EX: lw $10, . . .
IF/ID
Instruction

0
M
u
x
1
ID: sub $11, $2, $3
Instruction
Pipelined
Execution
and
Control
IF: and $12, $4, $5
Read
data
Address
Data
memory
Write
data
Clock cycle
4
Clock 4
X
Instruction
[15– 0]
X
Instruction
[20– 16]
12
Instruction
[15– 11]
Sign
extend
X
ALU
control
X
12
11
0
M
u
x
1
MemRead
ALUOp
10
RegDst
1
M
u
x
0
0
M
u
x
1
ID: or $13, $6, $7
ID/EX
10
or
WB
1
10
EX
0
10
0
0
0
M
1
WB 1
Add
Add result
Instruction
memory
6
7
10
MemWrite
Address
ALUSrc
Read
register 1
Read $6
data 1
Read
register 2
Registers Read $7
Write
data 2
register
$4
$5
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
RegWrite
PC
Branch
Shift
left 2
Address
Data
memory
Read
data
1
M
u
x
0
Write
data
Clock 5
IF: after<1>
X
Instruction
[15– 0]
X
Instruction
[20– 16]
X
13
Instruction
[15– 11]
13
X
Sign
extend
ALU
control
12
ALUOp
11
10
RegDst
ID: add $14, $8, $9
0
M
u
x
1
0
M
u
x
1
MemRead
EX: or $13, . . .
IF/ID
ID/EX
10
add
Control
WB
000
1100
M
MEM: and $12, . . .
EX/MEM
WB: sub $11, . . .
MEM/WB
10
000
WB
1
10
EX
0
M
10
0
0
0
1
WB 0
Add
Add
Add result
4
PC
Address
Instruction
memory
8
9
11
ALUSrc
Read
register 1
Read $8
data 1
Read
register 2
Registers Read $9
Write
data 2
register
$6
$7
Write
data
Label “after<i>” means
i th instruction after add
Branch
Shift
left 2
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
20($1)
$2, $3
$4, $7
$6, $7
$8, $9
000
M
MEM/WB
MemWrite
$10,
$11,
$12,
$13,
$14,
EX/MEM
WB: lw $10, . . .
Add
RegWrite
lw
sub
and
or
add
000
MEM: sub $11, . . .
10
4
Instruction
Instruction
sequence:
Control
WB
1100
Clock cycle 5

EX: and $12, . . .
IF/ID
Instruction
Pipelined
Execution
and
Control
IF: add $14, $8, $9
Address
Data
memory
Read
data
1
M
u
x
0
Write
data
Clock cycle 6
Clock 6
X
Instruction
[15– 0]
X
Instruction
[20– 16]
X
14
Instruction
[15– 11]
14
Sign
extend
X
ALU
control
13
0
M
u
x
1
MemRead
ALUOp
12
RegDst
11
ID: after<1>
0
M
u
x
1
EX: add $14, . . .
IF/ID
ID/EX
00
Control
000
0000
WB
EX/MEM
000
M
0
0
0
M
1
WB 0
Read
register 1
12
$8
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
$9
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
MemWrite
Address
Branch
ALUSrc
Address
Data
memory
Read
data
1
M
u
x
0
Write
data
Instruction
[15– 0]
Sign
extend
ALU
control
Instruction
[20– 16]
Instruction
[15– 11]
14
IF: after<3>
ALUOp
13
12
RegDst
ID: after<2>
0
M
u
x
1
0
M
u
x
1
MemRead
EX: after<1>
IF/ID
MEM: add $14, . . .
ID/EX
00
Control
000
0000
WB
M
EX/MEM
WB: or $13, . . .
MEM/WB
00
000
WB
0
00
EX
0
M
10
0
0
0
1
WB 0
Add
Add
Add result
4
Address
Instruction
memory
ALUSrc
Read
register 1
13
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
PC
Branch
Shift
left 2
MemWrite
20($1)
$2, $3
$4, $7
$6, $7
$8, $9
10
Shift
left 2
RegWrite
$10,
$11,
$12,
$13,
$14,
MEM/WB
Add
Add result
RegWrite
PC
Instruction
lw
sub
and
or
add
WB
1
10
EX
0
Clock 7
Instruction
sequence:
WB: and $12, . . .
Add
Clock cycle 7

MEM: or $13, . . .
10
4
Instruction
Pipelined
Execution
and
Control
IF: after<2>
Address
Data
memory
Read
data
1
M
u
x
0
Write
data
Instruction
[15– 0]
Clock cycle 8
Clock 8
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
ALU
control
0
M
u
x
1
MemRead
ALUOp
14
RegDst
13
Pipelined Execution and Control
0
M
u
x
1
20($1)
$2, $3
$4, $7
$6, $7
$8, $9
EX: after<2>
IF/ID
MEM: after<1>
ID/EX
00
Control
000
0000
WB
M
WB: add $14, . . .
EX/MEM
MEM/WB
00
000
WB
0
00
EX
0
M
00
0
0
0
1
WB 0
Add
4
PC
Add
Add result
Address
Instruction
memory
Branch
Shift
left 2
ALUSrc
Read
register 1
14
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
$10,
$11,
$12,
$13,
$14,
ID: after<3>
MemWrite
lw
sub
and
or
add
IF: after<4>
RegWrite
Instruction
sequence:
Instruction

Address
Data
memory
Read
data
1
M
u
x
0
Write
data
Instruction
[15– 0]
Instruction
[20– 16]
Clock cycle 9
Clock 9
Instruction
[15– 11]
Sign
extend
ALU
control
0
M
u
x
1
MemRead
ALUOp
14
RegDst
Revisiting Hazards


So far our datapath and control have ignored hazards
We shall revisit data hazards and control hazards and
enhance our datapath and control to handle them in
hardware…
Data Hazards and Forwarding

Problem with starting an instruction before previous are finished:

data dependencies that go backward in time – called data hazards
Time (in clock cycles)
$2 = 10 before sub;
$2 = -20 after sub
CC 1
Value of
register $2: 10
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10/– 20
– 20
– 20
– 20
– 20
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
sub
and
or
add
sw
$2,
$12,
$13,
$14,
$15,
$1, $3
$2, $5
$6, $2
$2, $2
100($2)
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
DM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
Software Solution

Have compiler guarantee never any data hazards!



by rearranging instructions to insert independent instructions
between instructions that would otherwise have a data hazard
between them,
or, if such rearrangement is not possible, insert nops
sub
$2,
$1, $3
lw
slt
and
or
add
sw
$10, 40($3)
$5, $6, $7
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
or
sub
$2,
nop
nop
and
or
add
sw
$12,
$13,
$14,
$15,
$1, $3
$2, $5
$6, $2
$2, $2
100($2)
Such compiler solutions may not always be possible, and nops
slow the machine down
MIPS: nop = “no operation” = 00…0 (32bits) = sll $0, $0, 0
Hardware Solution:
Forwarding
Idea: use intermediate data, do not wait for result to be
finally written to the destination register. Two steps:

1.
2.
Detect data hazard
Forward intermediate data to resolve hazard
Pipelined Datapath with
Control II (as before)
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Control signals
emanate from
the control
portions of the
pipeline registers
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
ALUOp
RegDst
MemRead
1
M
u
x
0
Hazard Detection

Hazard conditions:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

Eg., in the earlier example, first hazard between sub $2, $1, $3 and
and $12, $2, $5 is detected when the and is in EX stage and the
sub is in MEM stage because


EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)
Whether to forward also depends on:


if the later instruction is going to write a register – if not, no need to
forward, even if there is register number match as in conditions above
if the destination register of the later instruction is $0 – in which case
there is no need to forward value ($0 is always 0 and never overwritten)

Data
Forwarding
Plan:


allow inputs to the ALU not just from ID/EX, but also later
pipeline registers, and
use multiplexors and control signals to choose appropriate
inputs to ALU
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
sub
and
or
add
sw
$2,
$12,
$13,
$14,
$15,
$1, $3
$2, $5
$6, $2
$2, $2
100($2)
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
Dependencies between pipelines move forward in time
ID/EX
Forwarding
Hardware
EX/MEM
Registers
MEM/WB
ALU
Data
memory
M
u
x
Datapath before adding forwarding hardware
a. No forwarding
ID/EX
EX/MEM
MEM/WB
M
u
x
Registers
ForwardA
ALU
M
u
x
Rs
Rt
Rt
Rd
Data
memory
ForwardB
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
Datapath after adding forwarding hardware
b. With forwarding
M
u
x
Forwarding Hardware:
Multiplexor Control
Mux control
Source
ForwardA = 00
ForwardA = 10
ForwardA = 01
ID/EX
EX/MEM
MEM/WB
ForwardB = 00
ForwardB = 10
ForwardB = 01
ID/EX
EX/MEM
MEM/WB
Explanation
The first ALU operand comes from the register file
The first ALU operand is forwarded from prior ALU result
The first ALU operand is forwarded from data memory
or an earlier ALU result
The second ALU operand comes from the register file
The second ALU operand is forwarded from prior ALU result
The second ALU operand is forwarded from data memory
or an earlier ALU result
Depending on the selection in the rightmost multiplexor
(see datapath with control diagram)
Data Hazard: Detection and
Forwarding

1.
Forwarding unit determines multiplexor control according to the
following rules:
EX hazard
if (
EX/MEM.RegWrite
// if there is a write…
and ( EX/MEM.RegisterRd  0 )
// to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches, then…
ForwardA = 10
if (
EX/MEM.RegWrite
// if there is a write…
and ( EX/MEM.RegisterRd  0 )
// to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRt ) ) // which matches, then…
ForwardB = 10
Data Hazard: Detection and
Forwarding
2.
MEM hazard
if (
MEM/WB.RegWrite
and ( MEM/WB.RegisterRd  0 )
and ( EX/MEM.RegisterRd  ID/EX.RegisterRs )
// if there is a write…
// to a non-$0 register…
// and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline
register, then…
ForwardA = 01
if (
MEM/WB.RegWrite
and ( MEM/WB.RegisterRd  0 )
and ( EX/MEM.RegisterRd  ID/EX.RegisterRt )
// if there is a write…
// to a non-$0 register…
// and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRt ) ) // but match with later pipeline
register, then…
ForwardB = 01
This check is necessary, e.g., for sequences such as add $1, $1, $2; add $1, $1, $3; add $1, $1, $4;
(array summing?), where an earlier pipeline (EX/MEM) register has more recent data
Forwarding Hardware with
Control
Called forwarding unit, not hazard detection unit,
because once data is forwarded there is no hazard!
ID/EX
WB
Control
PC
Instruction
memory
Instruction
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
Datapath with forwarding hardware and control wires – certain details,
e.g., branching hardware, are omitted to simplify the drawing
Note: so far we have only handled forwarding to R-type instructions…!
M
u
x
or $4, $4, $2
and $4, $2, $5
sub $2, $1, $3
before<1>
before<2>
ID/EX
10
Forwarding
PC
Control
IF/ID
Instruction
2
Instruction
memory
WB
10
EX/MEM
M
WB
EX
M
$2
MEM/WB
WB
$1
M
u
x
5
Registers
Data
memory
ALU
$5
$3
M
u
x
M
u
x

2
1
5
3
4
2
M
u
x
Forwarding
unit
Clock cycle 3
Execution
example:
Clock 3
add $9, $4, $2
or $4, $4, $2
and $4, $2, $5
sub $2, . . .
before<1>
ID/EX
$2,
$4,
$4,
$9,
$1,
$2,
$4,
$4,
$3
$5
$2
$2
Control
IF/ID
4
PC
Instruction
memory
Instruction
sub
and
or
add
10
$4
WB
10
EX/MEM
M
WB
EX
M
10
MEM/WB
WB
$2
M
u
x
6
Registers
Data
memory
ALU
$2
$5
M
u
x
Clock cycle 4
Clock 4
2
2
6
5
4
4
M
u
x
2
Forwarding
unit
M
u
x
after<1>
add $9, $4, $2
or $4, $4, $2
and $4, . . .
sub $2, . . .
ID/EX
10
Forwarding
Control
IF/ID
PC
Instruction
4
Instruction
memory
WB
10
EX/MEM
M
WB
EX
M
$4
10
MEM/WB
WB
1
$4
M
u
x
2
2
Registers
Data
memory
ALU
$2
$2
M
u
x
M
u
x

Execution
example
(cont.):
4
4
2
2
9
4
M
u
x
4
2
Forwarding
unit
Clock cycle 5
Clock 5
after<1>
after<2>
add $9, $4, $2
or $4, . . .
and $4, . . .
ID/EX
$2,
$4,
$4,
$9,
$1,
$2,
$4,
$4,
$3
$5
$2
$2
WB
Control
IF/ID
10
EX/MEM
M
WB
EX
M
10
MEM/WB
WB
1
$4
PC
Instruction
memory
Instruction
sub
and
or
add
M
u
x
4
Registers
Data
memory
ALU
$2
M
u
x
M
u
x
4
2
9
Clock cycle 6
Clock 6
4
M
u
x
Forwarding
unit
4
Data Hazards and Stalls

Load word can still cause a hazard:

lw
and
or
add
Slt
$2,
$4,
$8,
$9,
$1,
an instruction tries to read a register following a load instruction that
writes to the same register
20($1)
$2, $5
$2, $6
$4, $2
$6, $7
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
As even a pipeline
dependency goes
backward in time
forwarding will not
solve the hazard
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7

IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
therefore, we need a hazard detection unit to stall the pipeline after
the load instruction
Pipelined Datapath with
Control II (as before)
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Control signals
emanate from
the control
portions of the
pipeline registers
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
ALUOp
RegDst
MemRead
1
M
u
x
0
Hazard Detection Logic to Stall

Hazard detection unit implements the following check if to stall
if ( ID/EX.MemRead
// if the instruction in the EX stage is a load…
and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs )
// and the destination register
or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source register
// of the instruction in the ID stage, then…
stall the pipeline
Mechanics of Stalling


If the check to stall verifies, then the pipeline needs to stall
only 1 clock cycle after the load as after that the forwarding
unit can resolve the dependency
What the hardware does to stall the pipeline 1 cycle:



does not let the IF/ID register change (disable write!) – this will
cause the instruction in the ID stage to repeat, i.e., stall
therefore, the instruction, just behind, in the IF stage must be
stalled as well – so hardware does not let the PC change (disable
write!) – this will cause the instruction in the IF stage to repeat,
i.e., stall
changes all the EX, MEM and WB control fields in the ID/EX
pipeline register to 0, so effectively the instruction just behind
the load becomes a nop – a bubble is said to have been inserted
into the pipeline

note that we cannot turn that instruction into an nop by 0ing all the
bits in the instruction itself – recall nop = 00…0 (32 bits) – because
it has already been decoded and control signals generated
Hazard Detection Unit
ID/EX.MemRead
Hazard
detection
unit
ID/EX
IF/IDWrite
WB
Control
0
M
u
x
PC
Instruction
memory
Instruction
PCWrite
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
M
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
ID/EX.RegisterRt
Rs
Rt
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
Datapath with forwarding hardware, the hazard detection unit and
controls wires – certain details, e.g., branching hardware are omitted
to simplify the drawing
Stalling Resolves a Hazard

Same instruction sequence as before for which forwarding by
itself could not resolve the hazard:
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw
and
or
add
Slt
$2,
$4,
$8,
$9,
$1,
20($1)
$2, $5
$2, $6
$4, $2
$6, $7
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
Reg
IM
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
slt $1, $6, $7
IM
DM
Reg
IM
Reg
Reg
DM
Reg
Hazard detection unit inserts a 1-cycle bubble in the pipeline, after
which all pipeline register dependencies go forward so then the
forwarding unit can handle them and there are no more hazards
and $4, $2, $5
lw $2, 20($1)
1
X
IF/IDWrite
Control
M
u
x
IF/ID
Instruction
1
Instruction
memory
before<3>
ID/EX
0
PC
before<2>
ID/EX.MemRead
11
PCWrite
Stalling
before<1>
Hazard
detection
unit
WB
EX/MEM
M
WB
EX
M
MEM/WB
WB
$1
M
u
x
X
Registers
Data
memory
ALU
$X
M
u
x
M
u
x
Execution
example:
1
X
2
M
u
x
ID/EX.RegisterRt
Forwarding
unit
ClockClock
cycle
2
2
20($1)
$2, $5
$4, $2
$4, $2
or $4, $4, $2
lw $2, 20($1)
and $4, $2, $5
Hazard
detection
unit
2
5
Control
M
u
x
IF/ID
$2
2
Instruction
memory
11
WB
EX/MEM
M
WB
EX
M
WB
M
u
x
5
Registers
ALU
$5
2
5
$X
1
X
2
4
ID/EX.RegisterRt
Clock 3
MEM/WB
$1
M
u
x
Clock cycle 3
before<2>
ID/EX
0
PC
before<1>
ID/EX.MemRead
00
Instruction
$2,
$4,
$4,
$9,
IF/IDWrite
lw
and
or
add
PCWrite

M
u
x
Forwarding
unit
Data
memory
M
u
x
or $4, $4, $2
and $4, $2, $5
2
5
Control
M
u
x
IF/ID
PCWrite
2
Instruction
before<1>
ID/EX
10
0
Instruction
memory
lw $2, . . .
ID/EX.MemRead
IF/IDWrite
Stalling
PC
bubble
Hazard
detection
unit
00
WB
EX/MEM
M
WB
EX
M
$2
11
MEM/WB
WB
$2
M
u
x
5
Registers
Data
memory
ALU
$5
$5
M
u
x
M
u
x
5
4
4
2
M
u
x
Forwarding
unit
Clock 4
20($1)
$2, $5
$4, $2
$4, $2
or $4, $4, $2
and $4, $2, $5
Hazard
detection
unit
4
2
Control
M
u
x
IF/ID
4
Instruction
memory
lw $2, . . .
ID/EX
10
0
PC
bubble
ID/EX.MemRead
IF/IDWrite
$2,
$4,
$4,
$9,
2
5
Clock cycle 4
add $9, $4, $2
lw
and
or
add
2
ID/EX.RegisterRt
Instruction
Execution
example
(cont.):
PCWrite

$4
WB
10
EX/MEM
M
WB
EX
M
0
MEM/WB
WB
11
$2
M
u
x
2
2
Registers
ALU
$2
$5
Data
memory
M
u
x
M
u
x
ID/EX.RegisterRt
Clock cycle 5
Clock 5
4
2
2
5
4
4
M
u
x
2
Forwarding
unit
after<1>
add $9, $4, $2
4
2
ID/EX
IF/IDWrite
10
Control
WB
M
u
x
IF/ID
PC
Instruction
4
Instruction
memory
bubble
ID/EX.MemRead
0
PCWrite
Stalling
and $4, . . .
or $4, $4, $2
Hazard
detection
unit
10
EX/MEM
M
WB
EX
M
$4
10
MEM/WB
WB
0
$4
M
u
x
2
Registers
Data
memory
ALU
$2
$2
M
u
x
M
u
x
9
4
M
u
x
4
Forwarding
unit
Clock 6
after<1>
add $9, $4, $2
Hazard
detection
unit
20($1)
$2, $5
$4, $2
$4, $2
and $4, . . .
ID/EX
10
Control
0
M
u
x
IF/ID
PC
or $4, . . .
ID/EX.MemRead
IF/IDWrite
$2,
$4,
$4,
$9,
2
Clock cycle 6
after<2>
lw
and
or
add
4
2
ID/EX.RegisterRt
WB
10
EX/MEM
M
WB
EX
M
10
MEM/WB
WB
1
$4
Instruction
memory
Instruction
Execution
example
(cont.):
PCWrite

4
M
u
x
4
Registers
Data
memory
ALU
$2
M
u
x
M
u
x
4
2
9
ID/EX.RegisterRt
Clock cycle 7
Clock 7
M
u
x
4
Forwarding
unit
4
Control (or Branch) Hazards

Problem with branches in the pipeline we have so far is that the
branch decision is not made till the MEM stage – so what
instructions, if at all, should we insert into the pipeline following
the branch instructions?

Possible solution: stall the pipeline till branch decision is known


not efficient, slow the pipeline significantly!
Another solution: predict the branch outcome


e.g., always predict branch-not-taken – continue with next
sequential instructions
if the prediction is wrong have to flush the pipeline behind the
branch – discard instructions already fetched or decoded – and
continue execution at the branch target
Predicting Branch-not-taken:
Misprediction delay
Time (in clock cycles)
Program
execution
CC 1
CC 2
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
The outcome of branch taken (prediction wrong) is decided only when
beq is in the MEM stage, so the following three sequential instructions
already in the pipeline have to be flushed and execution resumes at lw
Optimizing the Pipeline to
Reduce Branch Delay

Move the branch decision from the MEM stage (as in our
current pipeline) earlier to the ID stage


calculating the branch target address involves moving the branch
adder from the MEM stage to the ID stage – inputs to this adder,
the PC value and the immediate fields are already available in
the IF/ID pipeline register
calculating the branch decision is efficiently done, e.g., for
equality test, by XORing respective bits and then ORing all the
results and inverting, rather than using the ALU to subtract and
then test for zero (when there is a carry delay)


with the more efficient equality test we can put it in the ID stage
without significantly lengthening this stage – remember an objective
of pipeline design is to keep pipeline stages balanced
we must correspondingly make additions to the forwarding and
hazard detection units to forward to or stall the branch at the ID
stage in case the branch decision depends on an earlier result
Flushing on Misprediction


Same strategy as for stalling on load-use data hazard…
Zero out all the control values (or the instruction itself) in
pipeline registers for the instructions following the branch that
are already in the pipeline – effectively turning them into nops
– so they are flushed

in the optimized pipeline, with branch decision made in the ID
stage, we have to flush only one instruction in the IF stage – the
branch delay penalty is then only one clock cycle
Optimized Datapath for Branch
IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
0
M
u
x
IF/ID
4
EX/MEM
M
WB
EX
M
MEM/WB
WB
Shift
left 2
Registers
PC
IF.Flush control zeros out the instruction in the IF/ID
pipeline register (which follows the branch)
=
M
u
x
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
Branch decision is moved from the MEM stage to the ID stage – simplified drawing
not showing enhancements to the forwarding and hazard detection units
Pipelined
Branch
and $12, $2, $5
beq $1, $3, 7
sub $10, $4, $8
before<1>
before<2>
IF.Flush
Hazard
detection
unit
72
ID/EX
M
u
48 x
WB
Control
28
0
M
u
x
IF/ID
48
44
EX/MEM
M
WB
EX
M
MEM/WB
WB
72
4
$1
Shift
left 2
Registers
72
PC
Instruction
memory
44
M
u
x
=
36
40
44
48
52
56
Execution
example:
sub
beq
and
or
add
slt
$10,
$1,
$12
$13
$14,
$15,
M
u
x
M
u
x
$8
Sign
extend
10
$4,
$3,
$2,
$2,
$4,
$6,
$8
7
$5
$6
$2
$7
Forwarding
unit
Clock cycle
3
Clock 3
lw $4, 50($7)
bubble (nop)
beq $1, $3, 7
sub $10, . . .
before<1>
IF.Flush
Hazard
detection
unit
ID/EX
M
u
76 x
WB
Control
0
M
u
x
IF/ID
76
…
72 lw
Data
memory
ALU
$3
7

$4
EX/MEM
M
WB
EX
M
MEM/WB
WB
72
4
$4,
Shift
left 2
50($7)
Registers
76
PC
72
=
M
u
x
$1
Instruction
memory
Optimized pipeline with
only one bubble as a result
of the taken branch
Clock cycle
4
Clock 4
Data
memory
ALU
M
u
x
$3
Sign
extend
10
Forwarding
unit
M
u
x
Simple Example: Comparing
Performance

Compare performance for single-cycle, multicycle, and pipelined
datapaths using the gcc instruction mix



assume 2 ns for memory access, 2 ns for ALU operation, 1 ns
for register read or write
assume gcc instruction mix 23% loads, 13% stores, 19%
branches, 2% jumps, 43% ALU
for pipelined execution assume




50% of the loads are followed immediately by an instruction that
uses the result of the load
25% of branches are mispredicted
branch delay on misprediction is 1 clock cycle
jumps always incur 1 clock cycle delay so their average time is 2
clock cycles
Simple Example: Comparing
Performance



Single-cycle (p. 373): average instruction time 8 ns
Multicycle (p. 397): average instruction time 8.04 ns
Pipelined:
loads use 1 cc (clock cycle) when no load-use dependency and 2 cc
when there is dependency – given 50% of loads are followed by
dependency the average cc per load is 1.5

stores use 1 cc each

branches use 1 cc when predicted correctly and 2 cc when not –
given 25% misprediction average cc per branch is 1.25

jumps use 2 cc each

ALU instructions use 1 cc each

therefore, average CPI is
1.5  23% + 1  13% + 1.25  19% + 2  2% + 1  43% = 1.18

therefore, average instruction time is 1.18  2 = 2.36 ns
