EECS 322 Computer Architecture

Download Report

Transcript EECS 322 Computer Architecture

EECS 322 Computer
Architecture
Pipeline Control,
Data Hazards
and Branch Hazards
Instructor: Francis G. Wolff
[email protected]
Case Western Reserve University
This presentation uses powerpoint animation: please viewshow
Models
Single-cycle model (non-overlapping)
• The instruction latency executes in a single cycle
• Every instruction and clock-cycle must be
stretched to the slowest instruction (p.438)
Multi-cycle model (non-overlapping)
• The instruction latency executes in multiple-cycles
• The clock-cycle must be stretched to the slowest step
• Ability to share functional units within the execution
of a single instruction
Pipeline model (overlapping, p. 522)
• The instruction latency executes in multiple-cycles
• The clock-cycle must be stretched to the slowest step
• The throughput is mainly one clock-cycle/instruction
• Gains efficiency by overlapping the execution of multiple
instructions, increasing hardware utilization. (p. 377)
Recap: Can pipelining get us into trouble?
• Yes: Pipeline Hazards
– structural hazards: attempt to use the same resource two different
ways at the same time
• e.g., multiple memory accesses, multiple register writes
• solutions:
– multiple memories (separate instruction & data memory)
– stretch pipeline
– control hazards: attempt to make a decision before condition is
evaulated
• e.g., any conditional branch
• solutions: prediction, delayed branch
– data hazards: attempt to use item before it is ready
• e.g., add r1,r2,r3; sub r4, r1 ,r5; lw r6, 0(r7); or r8, r6 ,r9
• solutions: forwarding/bypassing, stall/bubble
Review: Single-Cycle Datapath
2 adders: PC+4 adder, Branch/Jump offset adder
Add
Add
Result
4
RegWrite
RegDst
PC
Read
address
Instruction
Instruction
memory
M
u
x
Shift
left 2
Read
register 1
Read
Read
data 1
register 2
Write
register
Write
data
16
ALUSrc
Read
data 2
Sign
extend
M
u
x
Branch
MemWrite
3
And
M
u
x
MemRead
ALUctl
Zero
ALU ALU
result
MemtoReg
Address
Read
data
Data
Write memory
data
32
Harvard Architecture: Separate instruction and data memory
M
u
x
Review: Multi vs. Single-cycle Processor Datapath
Combine adders: add 1½ Mux & 3 temp. registers, A, B, ALUOut
Combine Memory: add 1 Mux & 2 temp. registers, IR, MDR
IorD
PC
0
M
u
x
1
MemRead MemWrite
RegDst
RegWrite
Instruction
[25– 21]
Address
Memory
MemData
Write
data
IRWrite
Instruction
register
Instruction
[15– 0]
Memory
data
register
0
M
u
x
1
Read
register 1
Read
Read
data 1
register 2
Registers
Write
Read
register
data 2
Instruction
[20– 16]
Instruction
[15– 0]
ALUSrcA
0
M
Instruction u
x
[15– 11]
1
A
B
4
Write
data
0
M
u
x
1
16
Sign
extend
32
Shift
left 2
Zero
ALU ALU
result
ALUOut
0
1 M
u
2 x
3
ALU
control
MemtoReg
ALUSrcB
Instruction [5– 0]
ALUOp
Single-cycle= 1 ALU + 2 Mem + 4 Muxes + 2 adders + OpcodeDecoders
Multi-cycle = 1 ALU + 1 Mem + 5½ Muxes + 5 Reg (IR,A,B,MDR,ALUOut) + FSM
Multi-cycle Processor Datapath
Single-cycle= 1 ALU + 2 Mem + 4 Muxes + 2 adders + OpcodeDecoders
Multi-cycle = 1 ALU + 1 Mem + 5½ Muxes + 5 Reg (IR,A,B,MDR,ALUOut) + FSM
IorD
PC
0
M
u
x
1
MemRead MemWrite
RegDst
RegWrite
Instruction
[25– 21]
Address
Memory
MemData
Write
data
IRWrite
Instruction
register
Instruction
[15– 0]
Memory
data
register
0
M
u
x
1
Read
register 1
Read
Read
data 1
register 2
Registers
Write
Read
register
data 2
Instruction
[20– 16]
Instruction
[15– 0]
ALUSrcA
0
M
Instruction u
x
[15– 11]
1
A
B
4
Write
data
0
M
u
x
1
16
Sign
extend
32
Shift
left 2
Zero
ALU ALU
result
ALUOut
0
1 M
u
2 x
3
ALU
control
MemtoReg
ALUSrcB
Instruction [5– 0]
ALUOp
5x32 = 160 additional FFs for multi-cycle processor over single-cycle processor
PCSrc
0
M
u
x
1
Figure 6.25
PC
IF/ID
32
bits
ID/EX
EX/MEM
PC
Add
result
Add
MEM/WB
2W
32
M
Branch
RegWrite
Address
Instruction
memory
IR
Instruction
PC 32 bits
PC
2W
3M
PC3
2
Add
4
2W
3M
4 EX
32
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Datapath
Registers
160 FFs
bits
Instruction
16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
+ 213 FFs
Sign
extend
32
A
32
Shift
left 2
Z
MemWrite
Si
32
RT
5
ALU ALU
result
ALU
Out
32
6
ALU
control
0
M
u
x
1
R
Zero
Zero
0
M
u
x
1
RegDst
+ 16 FFs
1
ALUSrc
B
32
D
B
32
Address
Data
memory
Read
data
32
ALU
Out
Write
data
MemRead
ALUOp
D
5
RD
5
213+16 = 229 additional FFs for pipeline over multi-cycle processor
32
D
5
MemtoReg
1
M
u
x
0
Overhead
Single-cycle model
Chip Area
• 8 ns Clock (125 MHz), (non-overlapping)
• 1 ALU + 2 adders
• 0 Muxes
• 0 Datapath Register bits (Flip-Flops)
Speed
Multi-cycle model
• 2 ns Clock (500 MHz), (non-overlapping)
• 1 ALU + Controller
• 5 Muxes
• 160 Datapath Register bits (Flip-Flops)
Pipeline model
• 2 ns Clock (500 MHz), (overlapping)
• 2 ALU + Controller
• 4 Muxes
• 373 Datapath + 16 Controlpath Register bits (Flip-Flops)
Pipeline Control: Controlpath Register bits
WB
Instruction
Control
IF/ID
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
9 control bits
5 control bits
Figure 6.29
2 control bits
Pipeline Control: Controlpath table
Figure 5.20, Single Cycle
Instruction
Reg
Dst
ALU
Src
Mem
Reg
Reg
Wrt
Mem
Red
Mem
Wrt
Branch
ALU
op1
ALU
op0
R-format
1
0
0
1
0
0
0
1
0
lw
1
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
ID / EX
control lines
Figure 6.28
EX / MEM
control lines
MEM / WB
cntrl lines
Instruction
Reg
Dst
ALU
Op1
ALU
Op0
ALU
Src
Branch
Mem
Red
Mem
Wrt
Reg
Wrt
Mem
Reg
R-format
1
1
0
0
0
0
0
1
0
lw
1
0
0
1
0
1
0
1
1
sw
X
0
0
1
0
0
1
0
X
beq
X
0
1
0
1
0
0
0
X
Pipeline Hazards
Pipeline hazards
• Solution #1 always works (for non-realtime) applications:
stall, delay & procrastinate!
Structural Hazards (i.e. fetching same memory bank)
• Solution #2: partition architecture
Control Hazards (i.e. branching)
• Solution #1: stall! but decreases throughput
• Solution #2: guess and back-track
• Solution #3: delayed decision: delay branch & fill slot
Data Hazards (i.e. register dependencies)
• Worst case situation
• Solution #2: re-order instructions
• Solution #3: forwarding or bypassing: delayed load
Pipeline Datapath and Controlpath
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
ALUOp
RegDst
Figure 6.30
MemRead
1
M
u
x
0
Clock 1
Clock 2
WB
A=C$1
Instruction
Zero
ALU ALU
result
0
M
u
x
1
6
ALU
control
0
M
u
x
1
ALUOp
RegWrite
Branch
Address
Data
memory
Read
data
Write
data
MemRead
D=$10
D=0
Figure 6.30
D=$10
RegDst
WB
Aluout
32
T=$10
Instruction
[15– 11]
Sign
extend
ALUSrc
ALU=20+C$1
S=20
lw $10,20($1)
Write
data
Instruction
[20– 16]
Add
Add result
B=X
IR=
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Instruction 16
[15– 0]
M
Shift
left 2
Read
register 1
MEM/WB
MDR=Mem[20+C$1]
PC=4
PC=0
Instruction
memory
WB
EX
4
PC=4+20<<2
EX/MEM
M
IF/ID
Address
WB=11
PC=4+20<<2
Control
PC
WB=11
M=010
MemtoReg
PC=4
ID/EX
0
M
u
x
1
Add
Clock 3
WB=11, M=010
EX=0001
PCSrc
Clock 0
Clock 3
MemWrite
load inst.
1
M
u
x
0
Clock 1
Clock 2
WB
A=C$1
Instruction
Zero
ALU ALU
result
0
M
u
x
1
6
ALU
control
0
M
u
x
1
ALUOp
RegWrite
Branch
Address
Data
memory
Read
data
Write
data
MemRead
D=$10
D=0
Figure 6.30
D=$10
RegDst
WB
Aluout
32
T=$10
Instruction
[15– 11]
Sign
extend
ALUSrc
ALU=20+C$1
S=20
lw $10,20($1)
Write
data
Instruction
[20– 16]
Add
Add result
B=X
IR=
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Instruction 16
[15– 0]
M
Shift
left 2
Read
register 1
MEM/WB
MDR=Mem[20+C$1]
PC=4
PC=0
Instruction
memory
WB
EX
4
PC=4+20<<2
EX/MEM
M
IF/ID
Address
WB=11
PC=4+20<<2
Control
PC
WB=11
M=010
MemtoReg
PC=4
ID/EX
0
M
u
x
1
Add
Clock 3
WB=11, M=010
EX=0001
PCSrc
Clock 0
Clock 3
MemWrite
load inst.
1
M
u
x
0
Pipeline single stepping
Contents of Register 1 = C$1 = 3; C$2=4; C$3=4; C$4=6; C$5=7; C$10=8; … Memory[23]=9;
Formats: add $rd,$rs=A,$rt=B;
lw $rt=B,@($rs=A)
Clock
<IF/ID>
<ID/EX>
<EX/MEM>
<MEM/WB>
<PC, IR>
<PC, A, B, S, Rt, Rd> <PC, Z, ALU, B, R> <MDR, ALU, R>
0
<0,?>
<?,?,?,?,?,?>
<?,?,?,?,?>
<?,?,?>
1
<4,lw $10,20($1)>
<0,?,?,?,?,?>
<?,?,?,?,?>
<?,?,?>
2
<8,sub $11,$2,$3>
<4,C$13,C$108,20,$10,0> <0,?,?,?,?>
3
<12,and $12,$4,$5> <8,C$24,C$34,X,$3,$11> <4+20<<284,0,20+323,8,$10><?,?,?>
4
<16,or $13,$6,$7>
5
<20,add $14,$8,$9> <16,C$6 ,C$7,X,$7,$13> <X,0,1,7,$12>
<?,?,?>
<12,C$46,C$57,X,$5,$12><X,1,4-4=0,4,$11> <Mem[23]9,23,$10>
<X,0,$11>
Clock 1: Figure 6.31a
EX: before<2>
IF/ID
ID/EX
00
Control
PC=4
Add
000
0000
WB
M
EX/MEM
Instruction
Instruction
memory
000
WB
0
00
EX
0
M
00
0
0
0
0
WB 0
Branch
Shift
left 2
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
Address
Data
memory
Read
data
Write
data
Instruction
[15– 0]
Sign
extend
Instruction
[20– 16]
Instruction
[15– 11]
Clock 1
IF: sub $11, $2, $3
MEM/WB
Add
Add result
RegWrite
IR=lw $10,20($1)
PC=0
Address
WB: before<4>
00
4
PC
MEM: before<3>
MemtoReg
0
M
u
x
1
ID: before<1>
MemWrite
IF: lw $10, 20($1)
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
ID: lw $10, 20($1)
EX: before<1>
MEM: before<2>
WB: before<3>
1
Clock 1
RegDst
C
ID/EX
lw
Control
WB
010
0001
RegWrite
X
20
Instruction
[15– 0]
10
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
20
10
X
D=0
Figure 6.31b
WB
M
00
0
0
0
0
WB 0
Add
Add result
Branch
Shift
left 2
ALUSrc
Zero
ALU ALU
result
0
M
u
x
1
Address
Data
memory
Read
data
Write
data
T=$10
Clock 2
Read $1
data 1
Read
register 2
Registers Read $X
Write
data 2
register
Write
data
X
000
Read
register 1
S=20
Instruction
memory
Instruction
Address
IR=sub$10,20($1)
IR=lw
$11, $2, $3
PC=4
PC
MEM/WB
00
0
00
EX
0
B=X
4
1
M
EX/MEM
WB: before<3>
MemtoReg
11
MEM: before<2>
MemWrite
IF/ID
PC=4
PC=8
Add
EX: before<1>
A=C$1
0
M
u
x
1
ID: lw $10, 20($1)
PC=4
IF: sub $11, $2, $3
ALU
control
0
M
u
x
1
ALUOp
RegDst
MemRead
1
M
u
x
0
ID/EX
PC=8
PC=12
2
RegWrite
Read $2
data 1
Read
register 2
Registers Read $3
Write
data 2
register
0
00
EX
1
Instruction
[15– 11]
Add
Add result
X
11
Zero
ALU ALU
result
0
M
u
x
1
20
10
00
0
WB 0
Branch
ALUSrc
ALU
control
0
M
u
x
1
ALUOp
RegDst
D=$10
Instruction
[20– 16]
X
D=0
D=$11
X
Sign
extend
MEM/WB
0
0
M
0
$1
Write
data
Instruction
[15– 0]
WB
T=$10
T=$3
Figure 6.32a
010
Shift
left 2
Read
register 1
X
11
Clock 3
1100
M
ALU=20+C$1
3
000
S=20
S=X
IR=sub $12,$4,$5
IR=and
$11, $2, $3
PC=8
Instruction
memory
Instruction
4
Address
Control
11
PC=4+20<<2
sub
WB
EX/MEM
WB: before<2>
MemtoReg
10
MEM: before<1>
MemWrite
IF/ID
Add
PC
EX: lw $10, . . .
C
0
M
u
x
1
ID: sub $11, $2, $3
C C PC=8
B=X
PC=4 A=C$2
A=C$1 B=C$3
IF: and $12, $4, $5
Address
Data
memory
Read
data
Write
data
MemRead
1
M
u
x
0
11
000
1100
Sign
extend
X
X
Instruction
ALUSrc
Zero
ALU ALU
result
0
M
u
x
1
ALU
control
0
M
u
x
1
ALUOp
RegDst
Read
data
Address
Data
memory
Write
data
MemRead
10
D=$10
11
0
WB 0
Branch
D=$10
D=$11
12
D=$11
D=$12
Clock 4
Instruction
[15– 11]
0
1
0
ALU
Instruction
[20– 16]
Add
Add result
T=$3
X
12
$3
Write
data
Instruction
[15– 0]
M
MEM/WB
11
Shift
left 2
Read $4
data 1
Read
register 2
Registers Read $5
Write
data 2
register
X
WB
1
10
EX
0
S=X
5
000
ALU=20+C$1
ALU=C$2-C$3
4
M
10
WB: before<1>
MDR=Mem[20+C$1]
RegWrite
Control
WB
EX/MEM
PC=4+20<<2
PC=X
and
IR=and
$12,$4,$5
IR=or
$13,$6,$7
PC=20
Instruction
memory
C
ID/EX
10
MEM: lw $10, . . .
PC=4+20<<2
$2
IF/ID
4
Address
RegDst
EX: sub $11, . . .
Add
PC
C
Read
register 1
ID: and $12, $2, $3
PC=12
PC=16
0
M
u
x
1
PC=8 A=C$2
B=C$5
A=C$4 B=C$3
PC=12
IF: or $13, $6, $7
C
3
ClockClock
4: Figure
6.32b
x
1
MemtoReg
[15– 11]
MemWrite
11
1
M
u
x
0
Data Dependencies: that can be resolved by forwarding
Time (in clock cycles)
CC 1
Value of
register $2: 10
CC 2
CC 3
10
10
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10/–
– 20
– 20
Resolved
by20forwarding
– 20
– 20
Program
execution
order
(in instructions)
sub $2, $1, $3
Data Dependencies
and $12, $2, $5
IM
Reg
IM
sw $15, 100($2)
Figure 6.36
Data Hazards
Reg
DM
Reg
IM
or $13, $6, $2
add $14, $2, $2
DM
Reg
DM
Reg
IM
At same time: Not a hazard
DM
Reg
IM
Forward in time: Not a hazard
Reg
Reg
Reg
DM
Reg
Data Hazards: arithmetic
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
Forwards in time: Can be resolved
Program
execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Figure 6.37
At same time: Not a hazard
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
Data Dependencies: no forwarding
Clock
1
IF
sub $2,$1,$3
and $12,$2,$5
2
3
ID
IF
EX
ID
Stall
5
6
7
8
WB
M
ID
Stall
4
ID
Write
1st
Half
EX
M
Read
2nd
Half
Suppose every instruction is dependant = 1 + 2 stalls = 3 clocks
MIPS = Clock = 500 Mhz = 167 MIPS
CPI
3
WB
Data Dependencies: no forwarding
A dependant instruction will take
= 1 + 2 stalls = 3 clocks
An independent instruction will take = 1 + 0 stalls = 1 clocks
Suppose 10% of the time the instructions are dependant?
Averge instruction time = 10%*3 + 90%*1 = 0.10*3 + 0.90*1 = 1.2 clocks
MIPS = Clock = 500 Mhz = 417 MIPS (10% dependency)
CPI
1.2
MIPS = Clock = 500 Mhz = 167 MIPS (100% dependency)
CPI
3
MIPS = Clock = 500 Mhz = 500 MIPS (0% dependency)
CPI
1
Data Dependencies: with forwarding
Clock
1
IF
sub $2,$1,$3
and $12,$2,$5
2
ID
IF
3
EX
ID
4
M
EX
5
6
WB
M
WB
Detected
Data Hazard 1a
ID/EX.$rs = EX/M.$rd
Suppose every instruction is dependant = 1 + 0 stalls = 1 clock
MIPS = Clock = 500 Mhz = 500 MIPS
CPI
1
Data Dependencies: Hazard Conditions
Data Hazard Condition
occurs whenever a data source needs a previous
unavailable result due to a data destination.
Example
sub $2, $1, $3
and $12, $2, $5
sub
and
$rd, $rs, $rt
$rd, $rs, $rt
Data Hazard Detection
is always comparing a destination with a source.
Destination
EX/MEM.$rdest =
MEM/WB.$rdest =
Source
{
{
ID/EX.$rs
ID/EX.$rt
ID/EX.$rs
ID/EX.$rt
Hazard Type
1a.
1b.
2a.
2b.
Data Dependencies: Hazard Conditions
1a Data Hazard:
sub $2, $1, $3
and $12, $2, $5
EX/MEM.$rd = ID/EX.$rs
sub $rd, $rs, $rt
and $rd, $rs, $rt
1b Data Hazard:
sub $2, $1, $3
and $12, $1, $2
EX/MEM.$rd = ID/EX.$rt
sub $rd, $rs, $rt
and $rd, $rs, $rt
2a Data Hazard:
sub $2, $1, $3
and $12, $1, $5
or $13, $2, $1
MEM/WB.$rd = ID/EX.$rs
sub $rd, $rs, $rt
sub $rd, $rs, $rt
and $rd, $rs, $rt
2b Data Hazard:
sub $2, $1, $3
and $12, $1, $5
or $13, $6, $2
MEM/WB.$rd = ID/EX.$rt
sub $rd, $rs, $rt
sub $rd, $rs, $rt
and $rd, $rs, $rt
Data Dependencies: Worst case
Data Hazard:
sub $2, $1, $3
sub
$rd, $rs, $rt
and $12, $2, $2
and
$rd, $rs, $rt
or
and
$rd, $rs, $rt
$13, $2, $2
Data Hazard 1a:
Data Hazard 1b:
Data Hazard 2a:
Data Hazard 2b:
EX/MEM.$rd = ID/EX.$rs
EX/MEM.$rd = ID/EX.$rt
MEM/WB.$rd = ID/EX.$rs
MEM/WB.$rd = ID/EX.$rt
Data Dependencies: Hazard Conditions
Hazard
Type
1a.
1b.
Source
ID/EX.$rs
ID/EX.$rt
2a.
2b.
ID/EX.$rs
ID/EX.$rt
ID/EX
Destination
} = EX/MEM.$rdest
}
= MEM/WB.$rdest
Pipeline Registers
EX/MEM
MEM/WB
$rs
$rt
$rd
$rd
$rd
ID/EX
EX/MEM
Registers
MEM/WB
ALU
Data
memory
M
u
x
a. No forwarding
ID/EX
EX/MEM
MEM/WB
M
u
x
Registers
ForwardA
ALU
M
u
x
Rs
Rt
Rt
Rd
Data
memory
ForwardB
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
Figure 6.38
b. With forwarding
MEM/WB.RegisterRd
M
u
x
Data Hazards: Loads
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20($1)
IM
and $4, $2, $5
or $8, $2, $6
CC 2
Backwards in time: Cannot be resolved
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
Forwards in time: Cannot be resolved
Reg
IM
DM
Reg
IM
add $9, $4, $2
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
At same time: Not a hazard
slt $1, $6, $7
Figure 6.44
IM
Reg
DM
Reg
Data Hazards: load stalling
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
IM
and $4, $2, $5
CC 3
Reg
IM
or $8, $2, $6
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
IM
DM
Reg
Reg
Stall
slt $1, $6, $7
Figure 6.45
IM
Reg
DM
Reg
Data Hazards: Hazard detection unit (page 490)
Stall Condition
Source
IF/ID.$rs
IF/ID.$rt
Destination
} = ID/EX.$rt  ID/EX.MemRead=1
Stall Example
lw
$2, 20($1)
and $4, $2, $5
lw
$rt, addr($rs)
and
$rd, $rs, $rt
No Stall Example: (only need to look at next instruction)
lw
$2, 20($1)
lw
$rt, addr($rs)
and $4, $1, $5
and $rd, $rs, $rt
or
$8, $2, $6
or
$rd, $rs, $rt
Data Hazards: Hazard detection unit (page 490)
No Stall Example: (only need to look at next instruction)
lw
$2, 20($1)
lw
$rt, addr($rs)
and $4, $1, $5
and $rd, $rs, $rt
or
$8, $2, $6
or
$rd, $rs, $rt
Example
load: assume half of the instructions are immediately
followed by an instruction that uses it.
What is the average number of clocks for the load?
load instruction time: 50%*(1 clock) + 50%*(2 clocks)=1.5
Hazard Detection Unit: when to stall
ID/EX.MemRead
Hazard
detection
unit
ID/EX
IF/IDWrite
WB
Control
0
M
u
x
PC
Instruction
memory
Instruction
PCWrite
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
Figure 6.46
IF/ID.RegisterRt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
ID/EX.RegisterRt
Rs
Rt
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
M
u
x
Data Dependency Units
Forwarding Condition
Source
ID/EX.$rs
ID/EX.$rt
Destination
}
} = MEM/WB.$rd
ID/EX.$rs
ID/EX.$rt
Stall Condition
Source
IF/ID.$rs
IF/ID.$rt
= EX/MEM.$rd
Destination
}
= ID/EX.$rt  ID/EX.MemRead=1
Data Dependency Units
Pipeline Registers
Stalling Comparisons
IF/ID
ID/EX
$rs
$rs
$rt
$rt
$rd
$rd
Stall Condition
Source
IF/ID.$rs
IF/ID.$rt
Forwarding Comparisons
EX/MEM
$rd
MEM/WB
$rd
Destination
}
= ID/EX.$rt  ID/EX.MemRead=1
Branch Hazards: Soln #1, Stall until Decision made (fig. 6.4)
@3C:
@40:
add
beq
$4, $5, $6
$1, $3, 7
Soln #1: Stall until Decision is made
@44:
@48:
@4C:
@50:
Program
execution
Time
order
(in instructions)
add $4, $5, $6
beq $1, $2, 40
and
or
add
lw
2
Instruction
fetch
2ns
4
Reg
Instruction
fetch
lw $3, 300($0)
4 ns
$12, $2, $5
$13, $6, $2
$14, $2, $2
$4, 50($7)
6
ALU
Reg
8
Data
access
ALU
Instruction
fetch
10
14
12
Reg
Data
access
Reg
Reg
ALU
Data
access
Reg
2ns
Stall
Decision made in ID stage: do load
16
Branch Hazards: Soln #2, Predict until Decision made
Clock
beq $1,$3,7
1
IF
2
ID
3
EX
4
M
5
6
7
8
WB
Predict false branch
and $12, $2, $5
IF
ID
EX
M
WB
discard “and $12,$2,$5” instruction
lw $4, 50($7)
IF
ID
EX
M
WB
Decision made in ID stage: discard & branch
Branch Hazards: Soln #3, Delayed Decision
Clock
beq $1,$3,7
1
IF
2
ID
3
EX
4
M
5
6
7
WB
Move instruction before branch
add $4,$6,$6
IF
ID
EX
M
WB
Do not need to discard instruction
lw $4, 50($7)
IF
ID
EX
M
Decision made in ID stage: branch
WB
8
Branch Hazards: Soln #3, Delayed Decision
Clock
beq $1,$3,7
and $12, $2, $5
1
IF
2
ID
IF
3
EX
ID
4
M
EX
5
6
7
WB
M
WB
Decision made in ID stage: do branch
lw $4, 50($7)
IF
ID
EX
M
WB
8
Branch Hazards: Decision made in the ID stage (figure 6.4)
Clock
1
2
IF
ID
nop
No decision yet:
insert a nop
IF
beq $1,$3,7
lw $4, 50($7)
3
EX
ID
4
M
EX
5
6
7
WB
M
WB
Decision: do load
IF
ID
EX
M
WB
8
Branch Hazards: Soln #2, Predict until Decision made
Program
execution
order
(in instructions)
Branch Decision made in MEM stage:
Discard values when wrong prediction
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Predict false branch
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
IM
Reg
IM
Reg
IM
52 add $14, $2, $2
72 lw $4, 50($7)
Figure 6.50
DM
Reg
IM
Reg
DM
Reg
Same effect as 3 stalls
IM
Reg
DM
Reg
Reg
DM
Reg
Figure 6.51
IF.Flush
Early branch comparison
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
0
M
u
x
IF/ID
4
M
WB
EX
M
MEM/WB
WB
Shift
left 2
Registers
PC
EX/MEM
=
M
u
x
Instruction
memory
ALU
M
u
x
Sign
Flush: if wrong prediciton,
add nops
extend
M
u
x
Forwarding
unit
Data
memory
M
u
x
Performance
load: assume half of the instructions are immediately
followed by an instruction that uses it (i.e. data dependency)
load instruction time = 50%*(1 clock) + 50%*(2 clocks)=1.5
Jump: assume that jumps always pay 1 full clock cycle
delay (stall). Jump instruction time = 2
Branch: the branch delay of misprediction is 1 clock cycle
that 25% of the branches are mispredicted.
branch time = 75%*(1 clocks) + 25%*(2 clocks) = 1.25
Also known as the
instruction latency
with in a pipeline
Performance, page 504
Instruction
SingleCycle
Multi-Cycle
Clocks
Pipeline
Cycles
Instruction
Mix
loads
1
5
1.5
23%
Pipeline
throughput
(50% dependancy)
stores
1
4
1
13%
arithmetic
1
4
1
43%
branches
1
3
1.25
19%
(25% dependancy)
jumps
1
3
2
2%
Clock
speed
125 Mhz
8 ns
500 Mhz
2 ns
500 Mhz
2 ns
CPI
1
4.02
1.18
=  Cycles*Mix
MIPS
125 MIPS
125 MIPS
424 MIPS
= Clock/CPI
load instruction time = 50%*(1 clock) + 50%*(2 clocks)=1.5
branch time = 75%*(1 clocks) + 25%*(2 clocks) = 1.25