CPE 631 Review: Pipelining Electrical and Computer Engineering Aleksandar Milenkovic,

Download Report

Transcript CPE 631 Review: Pipelining Electrical and Computer Engineering Aleksandar Milenkovic,

CPE 631 Review: Pipelining
Electrical and Computer Engineering
University of Alabama in Huntsville
Aleksandar Milenkovic, [email protected]
http://www.ece.uah.edu/~milenka
Outline



Pipelined Execution
5 Steps in MIPS Datapath
Pipeline Hazards



Structural
Data
Control
AM
LaCASA
2
Laundry Example (by David Patterson)

Four loads of clothes: A, B, C, D
A


B
C
D
Task: each one to wash, dry, and fold
Resources

Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes
AM
LaCASA
3
Sequential Laundry
6 PM
7
8
9
10
11
Midnight
Time
T
a
s
k
AM
LaCASA
O
r
d
e
r


30 40 20 30 40 20 30 40 20 30 40 20
A
B
C
D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining,
how long would laundry take?
4
Pipelined Laundry

Pipelined laundry takes 3.5 hours for 4 loads
6 PM
7
8
9
10
11
Midnight
Time
T
a
s
k
O
r
d
e
AM
r
LaCASA
30 40 40 40 40 20
A
B
C
D
5
Pipelining Lessons

6 PM
7
8
9
Time
T
a
s
k

30 40 40 40 40 20
A


O
r
d
e
r
AM
LaCASA
B

C

D
Pipelining doesn’t help
latency of single task, it
helps throughput of entire
workload
Pipeline rate is limited by
slowest pipeline stage
Multiple tasks operating
simultaneously
Potential speedup =
Number pipe stages
Unbalanced lengths of
pipe stages reduces
speedup
Time to “fill” pipeline and
time to “drain” reduce
speedup
6
Computer Pipelines


Execute billions of instructions, so throughput
is what matters
What is desirable in instruction sets for
pipelining?


AM
LaCASA

Variable length instructions vs.
all instructions same length?
Memory operands part of any operation vs.
memory operands only in loads or stores?
Register operand many places in instruction
format vs. registers located in same place?
7
A "Typical" RISC

Registers



Data types



LaCASA
8-bit bytes, 16-bit half-words, 32-bit words, 64-bit
double words for integer data
32-bit single- or 64-bit double-precision numbers
Addressing Modes for MIPS Data Transfers

AM
32 64-bit general-purpose (integer) registers (R0-R31)
32 64-bit floating-point registers (F0-F31)


Load-store architecture: Immediate, Displacement
Memory is byte addressable with a 64-bit address
Mode bit to select Big Endian or Little Endian
8
MIPS64 Instruction Formats
Register-Register
31
26 25
Op
Rs
2120 16 15
Rt
Rd
Register-Immediate
31
26 25
2120 16 15
Op
Rs
Rt
1110
65
shamt
0
funct
0
immediate
Jump / Call
31
26 25
Op
AM
LaCASA
0
address
Floating-point (FR)
31
26 25
2120 16 15
Op
Fmt
Ft
Fs
Floating-point (FI)
31
26 25
2120 16 15
Op
Fmt
Ft
65
1110
Fd
funct
0
0
immediate
9
MIPS64 Instructions

MIPS Operations
(See Appendix B, Figure B.26)



AM
LaCASA

Data Transfers (LB, LBU, SB, LH, LHU, SH, LW, LWU, SW, LD,
SD, L.S, L.D, S.S, S.D, MFCO, MTCO, MOV.S, MOV.D, MFC1,
MTC1)
Arithmetic/Logical (DADD, DADDI, DADDU, DADDIU, DSUB,
DSUBU, DMUL, DMULU, DDIV, DDIVU, MADD, AND, ANDI,
OR, ORI, XOR, XORI, LUI, DSLL, DSRL, DSRA, DSLLV, DSRLV,
DSRAV, SLT, SLTI, SLTU, SLTIU)
Control (BEQZ, BNEZ, BEQ, BNE, BC1T, BC1F, MOVN, MOVZ,
J, JR, JAL, JALR, TRAP, ERET)
Floating Point (ADD.D, ADD.S, ADD.PS, SUB.D, SUB.S,
SUB.PS, MUL.D, MUL.S, MUL.PS, MADD.D, MADD.S,
MADD.PS, DIV.D, DIV.S, DIV.PS, CVT._._, C._.D, C._.S
10
5 Steps of Simple RISC Datapath
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
Next SEQ PC
Adder
4
L
M
D
MUX
Data
Memory
ALU
MUX MUX
LaCASA
Reg File
Inst
Memory
Address
RD
Imm
AM
Zero?
RS1
RS2
Write
Back
MUX
Next PC
Memory
Access
Sign
Extend
WB Data
11
5 Steps of Simple RISC Datapath
(cont’d)
Next SEQ PC
Sign
Extend
RD
RD
RD
MUX
MEM/WB
Data
Memory
EX/MEM
ALU
MUX MUX
ID/EX
Reg File
IF/ID
Memory
Address
RS2
Write
Back
Zero?
RS1
Imm
LaCASA
Next SEQ PC
Adder
4
Memory
Access
MUX
Next PC
AM
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
WB Data
Instruction
Fetch
• Data stationary control
– local decode for each instruction phase / pipeline stage
12
Visualizing Pipeline
Time (clock cycles)
LaCASA
IM
Reg
IM
CC 5
DM
Reg
Reg
IM
DM
Reg
CC 6
CC 7
Reg
DM
ALU
AM
Reg
CC 4
ALU
O
r
d
e
r
IM
CC 3
ALU
I
n
s
t
r.
CC 2
ALU
CC 1
Reg
DM
Reg
13
Instruction Flow through Pipeline
Time (clock cycles)
CC 1
Reg
Sub R6,R5,R7
ALU
Add R1,R2,R3
ALU
Lw R4,0(R2)
Nop
DM
DM
DM
DM
Nop
Nop
IM
Nop
Nop
Add R1,R2,R3
Reg
Reg
Reg
Reg
LaCASA
Lw R4,0(R2)
ALU
ALU
Xor R9,R8,R1
Reg
Reg
Reg
Add R1,R2,R3
CC 4
IM
IM
IM
Nop
AM
Sub R6,R5,R7
Lw R4,0(R2)
Add R1,R2,R3
CC 3
CC 2
14
Simple RISC Pipeline Definition: IF, ID

Stage IF



Stage ID

AM
LaCASA
IF/ID.IR  Mem[PC];
if EX/MEM.cond {IF/ID.NPC, PC 
EX/MEM.ALUOUT}
else {IF/ID.NPC, PC  PC + 4};


ID/EX.A  Regs[IF/ID.IR6…10];
ID/EX.B  Regs[IF/ID.IR11…15];
ID/EX.Imm  (IF/ID.IR16)16 ## IF/ID.IR16…31;
ID/EX.NPC  IF/ID.NPC;
ID/EX.IR  IF/ID.IR;
15
Simple RISC Pipeline Definition: IE

ALU




load/store



AM 
LaCASA
EX/MEM.IR  ID/EX.IR;
EX/MEM.ALUOUT  ID/EX.A func ID/EX.B; or
EX/MEM.ALUOUT  ID/EX.A func ID/EX.Imm;
EX/MEM.cond  0;
EX/MEM.IR  ID/EX.IR;
EX/MEM.B  ID/EX.B;
EX/MEM.ALUOUT  ID/EX.A  ID/EX.Imm;
EX/MEM.cond  0;
branch


EX/MEM.Aluout  ID/EX.NPC  (ID/EX.Imm<< 2);
EX/MEM.cond  (ID/EX.A func 0);
16
Simple RISC Pipeline Def.: MEM, WB

Stage MEM

ALU



load/store



MEM/WB.IR  EX/MEM.IR;
MEM/WB.LMD  Mem[EX/MEM.ALUOUT] or
Mem[EX/MEM.ALUOUT]  EX/MEM.B;
Stage WB

ALU

AM
LaCASA
MEM/WB.IR  EX/MEM.IR;
MEM/WB.ALUOUT  EX/MEM.ALUOUT;

Regs[MEM/WB.IR16…20]  MEM/WB.ALUOUT; or
Regs[MEM/WB.IR11…15]  MEM/WB.ALUOUT;
load

Regs[MEM/WB.IR11…15]  MEM/WB.LMD;
17
Its Not That Easy for Computers

Limits to pipelining: Hazards prevent next
instruction from executing during its
designated clock cycle



AM
LaCASA
Structural hazards: HW cannot support this
combination of instructions
Data hazards: Instruction depends on result of
prior instruction still in the pipeline
Control hazards: Caused by delay between
the fetching of instructions and decisions
about changes in control flow
(branches and jumps)
18
One Memory Port/Structural Hazards
Time (clock cycles)
LaCASA
Ifetch
Reg
DMem
Reg
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d Instr 3
e
AM
r Instr 4
DMem
ALU
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Ifetch
Reg
Reg
Reg
DMem
Reg
19
One Memory Port/Structural Hazards
(cont’d)
Time (clock cycles)
LaCASA
Ifetch
Reg
DMem
Reg
Ifetch
Bubble
Reg
Reg
DMem
Bubble Bubble
Ifetch
Reg
Reg
Bubble
ALU
O
r
d Stall
e
AM
r Instr 3
DMem
ALU
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
DMem
Reg
20
Data Hazard on R1
Time (clock cycles)
LaCASA
or
r8,r1,r9
xor r10,r1,r11
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
AM
dsub r4,r1,r3
Reg
ALU
O
r
d
e
r
dadd r1,r2,r3 Ifetch
WB
ALU
I
n
s
t
r.
MEM
ALU
IF ID/RF EX
Reg
Reg
Reg
Reg
DMem
Reg
21
Three Generic Data Hazards

Read After Write (RAW)
InstrJ tries to read operand before InstrI
writes it
I: add r1,r2,r3
J: sub r4,r1,r3

AM
LaCASA
Caused by a “Dependence” (in compiler
nomenclature). This hazard results from an
actual need for communication.
22
Three Generic Data Hazards

Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7


AM
LaCASA
Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
Can’t happen in MIPS 5 stage pipeline because:



All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5
23
Three Generic Data Hazards

Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7



AM
LaCASA
Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
Can’t happen in MIPS 5 stage pipeline because:


All instructions take 5 stages, and
Writes are always in stage 5
24
Forwarding to Avoid Data Hazard
LaCASA
or
r8,r1,r9
xor r10,r1,r11
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
AM
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
Reg
Reg
Reg
DMem
Reg
25
HW Change for Forwarding
NextPC
mux
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
mux
Immediate
Data
Memory
AM
LaCASA
26
Forwarding to DM input
- Forward R1 from EX/MEM.ALUOUT to ALU input (lw)
- Forward R1 from MEM/WB.ALUOUT to ALU input (sw)
- Forward R4 from MEM/WB.LMD to memory input
(memory output to memory input)
Time (clock cycles)
O lw
r
d sw
e
AM
r
LaCASA
R4,0(R1)
12(R1),R4
IM
Reg
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
DM
ALU
add R1,R2,R3
CC 2
ALU
CC 1
ALU
I
n
s
t.
CC 6
CC 7
Reg
DM
Reg
27
Forwarding to DM input (cont’d)
Forward R1 from MEM/WB.ALUOUT to DM input
CC 1
add R1,R2,R3
sw
0(R4),R1
IM
CC 2
Reg
IM
CC 3
Reg
CC 4
CC 5
DM
Reg
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t.
DM
CC 6
Reg
AM
LaCASA
28
Forwarding to Zero
Forward R1 from EX/MEM.ALUOUT to Zero
add R1,R2,R3
Reg
CC 3
CC 4
CC 5
DM
Reg
CC 6
Z
R1,50
IM
Reg
Reg
DM
Forward R1 from MEM/WB.ALUOUT to Zero
Reg
IM
Reg
DM
Reg
DM
Reg
Z
IM
Reg
ALU
IM
ALU
add R1,R2,R3
O sub R4,R5,R6
r
AM
d
bneq R1,50
e
r
LaCASA
IM
CC 2
ALU
beqz
CC 1
ALU
Time (clock cycles)
ALU
I
n
s
t
r
u
c
t
i
o
n
DM
Reg
29
Data Hazard Even with Forwarding
LaCASA
and r6,r1,r7
or
r8,r1,r9
DMem
Ifetch
Reg
DMem
Reg
Ifetch
Ifetch
Reg
Reg
Reg
DMem
ALU
sub r4,r1,r6
Reg
ALU
O
r
d
e
r
AM
lw r1, 0(r2) Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
Reg
30
Data Hazard Even with Forwarding
LaCASA
and r6,r1,r7
or r8,r1,r9
Reg
DMem
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
Reg
DMem
Reg
Reg
DMem
ALU
AM
sub r4,r1,r6
Ifetch
ALU
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
31
Software Scheduling to
Avoid Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
AM
LaCASA
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW
Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd
Fast code:
LW
LW
LW
ADD
LW
SW
SUB
SW
Rb,b
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
a,Ra
Rd,Re,Rf
d,Rd
32
22: add r8,r1,r9
AM
36: xor r10,r1,r11
LaCASA
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
r6,r1,r7
Ifetch
DMem
ALU
18: or
Reg
ALU
14: and r2,r3,r5
Ifetch
ALU
10: beq r1,r3,36
ALU
Control Hazard on Branches
Three Stage Stall
Reg
Reg
Reg
Reg
DMem
Reg
33
Example: Branch Stall Impact


If 30% branch, Stall 3 cycles significant
Two part solution:




MIPS branch tests if register = 0 or  0
MIPS Solution:

AM
LaCASA
Determine branch taken or not sooner, AND
Compute taken branch address earlier


Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3
34
Pipelined Simple RISC Datapath
Instruction
Fetch
Write
Back
Adder
Adder
Zero?
RS1
RD
RD
WB Data
RD
MUX
Sign
Extend
MEM/WB
Data
Memory
EX/MEM
ALU
MUX
ID/EX
Reg File
IF/ID
Memory
Address
RS2
Imm
AM
Memory
Access
MUX
Next
SEQ PC
Next PC
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
• Data stationary control
LaCASA
– local decode for each instruction phase / pipeline stage
35
Four Branch Hazard Alternatives


#1: Stall until branch direction is clear
#2: Predict Branch Not Taken




AM
LaCASA

Execute successor instructions in sequence
“Squash” instructions in pipeline if branch
actually taken
Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next
instruction
36
Branch not Taken
5
branch
IF
(not taken)
ID
IF
Ii+1
Ii+2
branch
(taken)
Ii+1
branch
target
AM
branch
target+1
LaCASA
Ex Mem WB
ID
Ex Mem WB
IF
ID
Ex Mem WB
5
IF
ID
IF
Time [clocks]
Branch is untaken
(determined during ID),
we have fetched the fallthrough and just continue
 no wasted cycles
Ex Mem WB
Branch is taken
(determined during ID),
idle idle idle idle
restart the fetch from at the
branch target
IF ID Ex Mem WB
 one cycle wasted
IF ID Ex Mem WB
Instructions
37
Four Branch Hazard Alternatives

#3: Predict Branch Taken



Treat every branch as taken
53% MIPS branches taken on average
But haven’t calculated branch target address
in MIPS


AM
LaCASA
MIPS still incurs 1 cycle branch penalty
Make sense only when
branch target is known before branch
outcome
38
Four Branch Hazard Alternatives

#4: Delayed Branch

Define branch to take place AFTER a following
instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
AM
LaCASA


Branch
delay of length n
1 slot delay allows proper decision and branch target
address in 5 stage pipeline
MIPS uses this
39
Delayed Branch

Where to get instructions to fill branch delay
slot?



Before branch instruction
From the target address:
only valuable when branch taken
From fall through:
only valuable when branch not taken
AM
LaCASA
40
Scheduling the branch delay slot:
From Before
ADD R1,R2,R3
if(R2=0) then
<Delay Slot>
Becomes
if(R2=0) then


Delay slot is scheduled
with an independent
instruction from before the
branch
Best choice, always
improves performance
<ADD R1,R2,R3>
AM
LaCASA
41
Scheduling the branch delay slot:
From Target
SUB R4,R5,R6
...
ADD R1,R2,R3
if(R1=0) then
<Delay Slot>
Becomes
AM
LaCASA
...
ADD R1,R2,R3
if(R2=0) then
<SUB R4,R5,R6>




Delay slot is scheduled from
the target of the branch
Must be OK to execute that
instruction if branch is not
taken
Usually the target instruction
will need to be copied because
it can be reached by another
path
 programs are enlarged
Preferred when the branch is
taken with high probability
42
Scheduling the branch delay slot:
From Fall Through
ADD R1,R2,R3
if(R2=0) then
<Delay Slot>
SUB R4,R5,R6


Becomes

ADD R1,R2,R3
if(R2=0) then
Delay slot is scheduled from
the
taken fall through
Must be OK to execute that
instruction if branch is taken
Improves performance when
branch is not taken
<SUB R4,R5,R6>
AM
LaCASA
43
Delayed Branch Effectiveness

Compiler effectiveness for single branch
delay slot:




AM
LaCASA
Fills about 60% of branch delay slots
About 80% of instructions executed in branch
delay slots useful in computation
About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: 7-8 stage
pipelines, multiple instructions issued per
clock (superscalar)
44
Example: Branch Stall Impact



Assume CPI = 1.0 ignoring branches
Assume solution was stalling for 3 cycles
If 30% branch, Stall 3 cycles

Op
Freq Cycles
CPI(i)
Other 70% 1
.7 (37%)
Branch 30% 4
1.2 (63%)

=> new CPI = 1.9, or almost 2 times slower


(% Time)
AM
LaCASA
45
Example 2:
Speed Up Equation for Pipelining
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
Cycle Timeunpipelined
Ideal CPI  Pipeline depth
Speedup 

Ideal CPI  Pipeline stall CPI
Cycle Timepipelined
For simple RISC pipeline, CPI = 1:
Cycle Timeunpipelined
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Timepipelined
AM
LaCASA
46
Example 3: Evaluating Branch
Alternatives (for 1 program)

Scheduling Branch CPI
speedup v.
scheme
penalty
stall
Stall pipeline
3
1.42
Predict taken
1
1.14
Predict not taken
1
1.09
Delayed branch
0.5
1.07

Conditional & Unconditional = 14%, 65%




1.0
1.26
1.29
1.31
change PC
AM
LaCASA
47
Example 4: Dual-port vs. Single-port




Machine A:
Dual ported memory (“Harvard Architecture”)
Machine B:
Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
Ideal CPI = 1 for both
Loads&Stores are 40% of instructions executed
AM
LaCASA
48
Extended Simple RISC Pipeline
DLX pipe with three
unpipelined,
FP functional units
EX
Int
EX
FP/I Mult
IF
ID
Mem
EX
FP Add
AM
LaCASA
EX
FP/I Div
WB
In reality, the intermediate
results are probably not
cycled around the EX unit;
instead the EX stages has
some number of clock delays
larger than 1
49
Extended Simple RISC Pipeline
(cont’d)


Initiation or repeat interval: number of clock cycles
that must elapse between issuing two operations
Latency: the number of intervening clock cycles
between an instruction that produces a result and an
instruction that uses the result
Functional unit
AM
LaCASA
Latency
Initiation interval
Integer ALU
0
1
Data Memory
1
1
FP Add
3
1
FP/Integer Multiply
6
1
FP/Integer Divide
24
25
50
Extended Simple RISC Pipeline
(cont’d)
Ex
M1
IF
M2
M3
M4
M5
M6
M7
ID
M
A1
A2
A3
WB
A4
..
AM
LaCASA
51
Extended Simple RISC Pipeline
(cont’d)

Multiple outstanding FP operations



FP/I Adder and Multiplier are fully pipelined
FP/I Divider is not pipelined
Pipeline timing for independent operations
MUL.D
ADD.D
L.D
S.D
IF
ID
M1
M2
M3
M4
M5
IF
ID
A1
A2
A3
A4
IF
ID
Ex
IF
ID
M6
M7 Mem WB
Mem WB
Mem WB
Ex
Mem WB
AM
LaCASA
52
Hazards and Forwarding in
Longer Pipes

Structural hazard:
divide unit is not fully pipelined




LaCASA
Structural hazard: number of register writes
can be larger than one due to varying running
times
WAW hazards are possible
Exceptions!

AM

detect it and stall the instruction
instructions can complete in different order
than they were issued
RAW hazards will be more frequent
53
Examples
Stalls arising from RAW hazards

L.D F4, 0(R2)
IF
MUL.D
F0, F4, F6
ADD.D
F2, F0, F8
S.D 0(R2), F2

ID
EX
Mem WB
IF
ID
stall
M1
M2
IF
stall
ID
stall stall stall stall stall stall
A1
A2
IF
stall stall stall stall stall stall
ID
EX
...
LaCASA
M4
M5
M6
M7
Mem WB
A3
A4
Mem WB
stall stall stall Mem
Three instructions that want to perform a write back
to the FP register file simultaneously
MUL.D
F0, F4, F6
...
AM
M3
ADD.D
F2, F4, F6
...
...
L.D F2, 0(R2)
IF
ID
M1
M2
M3
M4
M5
IF
ID
EX
IF
ID
EX Mem WB
IF
ID
A1
A2
IF
ID
IF
EX
ID
IF
M6
M7
Mem WB
A3
A4
Mem WB
Mem WB
Mem WB
EX Mem WB
ID
EX Mem WB
54
Solving Register Write Conflicts

First approach: track the use of the write port in the ID stage
and stall an instruction before it issues




Alternative approach: stall a conflicting instruction when it tries
to enter MEM or WB stage



AM
LaCASA
use a shift register that indicates when already issued
instructions will use the register file
if there is a conflict with an already issued instruction, stall the
instruction for one clock cycle
on each clock cycle the reservation register is shifted one bit

we can stall either instruction
e.g. give priority to the unit with the longest latency
Pros: does not require to detect the conflict until the entrance of
MEM or WB stage
Cons: complicates pipeline control; stalls now can arise from two
different places
55
WAW Hazards
IF
ADD.D F2, F4, F6
ID
EX
Mem
WB
IF
ID
A1
A2
A3
A4
IF
ID
EX
Mem
WB
IF
ID
EX
Mem
L.D F2, 0(R2)


LaCASA
WB
WB
Result of ADD.D is overwritten without any
instruction ever using it

AM
Mem
WAWs occur when useless instruction is executed
still, we must detect them and provide correct
execution
Why?
BNEZ
DIV.D
...
foo: L.D
R1, foo
F0, F2, F4 ; delay slot from fall-through
F0, qrs
56
Solving WAW Hazards



First approach: delay the issue of load
instruction until ADD.D enters MEM
Second approach: stamp out the result of the
ADD.D by detecting the hazard and changing
the control so that ADDD does not write; LD
issues right away
Detect hazard in ID when LD is issuing

AM
LaCASA


stall LD, or
make ADDD no-op
Luckily this hazard is rare
57
Hazard Detection in ID Stage

Possible hazards


hazards among FP instructions
hazards between an FP instruction and an
integer instr.


FP and integer registers are distinct,
except for FP load-stores, and FP-integer moves
Assume that pipeline does
all hazard detection in ID stage
AM
LaCASA
58
Hazard Detection in ID Stage (cont’d)

Check for structural hazards


Check for RAW data hazards


LaCASA
wait until source registers are not listed as pending
destinations in a pipeline register that will not be
available when this instruction needs the result
Check for WAW data hazards

AM
wait until the required functional unit is not busy and
make sure that the register write port is available
determine if any instruction in A1, .. A4, M1, .. M7, D
has the same register destination as this instruction; if
so, stall the issue of the instruction in ID
59
Forwarding Logic


Check if the destination register in any of
EX/MEM, A4/MEM, M7/MEM, D/MEM, or
MEM/WB pipeline registers is one of the
source registers of a FP instruction
If so, the appropriate input multiplexer will
have to be enabled so as to choose the
forwarded data
AM
LaCASA
60