Document 7300524

Download Report

Transcript Document 7300524

CpE 442
Designing a Pipeline Processor (lect. II)
CPE 442 hazards.1
Introduction to Computer Architecture
Outline of Today’s Lecture
° Recap and Introduction (5 minutes)
° Introduction to Hazards (15 minutes)
° Forwarding (25 minutes)
° 1 cycle Load Delay (5 minutes)
° 1 cycle Branch Delay (15 minutes)
° What makes pipelining hard
° Summary (5 minutes)
CPE 442 hazards.2
Introduction to Computer Architecture
Review: Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation:
Load
Store
Waste
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Multiple Cycle Implementation:
Load
Ifetch
Store
Reg
Exec
Mem
Wr
Exec
Mem
Wr
Reg
Exec
Mem
Ifetch
R-type
Reg
Exec
Mem
Ifetch
Pipeline Implementation:
Load Ifetch
Reg
Store Ifetch
R-type Ifetch
CPE 442 hazards.3
Reg
Exec
Wr
Mem
Wr
Introduction to Computer Architecture
Review: A Pipelined Datapath
Clk
Ifetch
Reg/Dec
Exec
ExtOp
RegWr
Mem
ALUOp
Wr
Branch
1
0
PC
Ra
Rt
RFile
Rw Di
Rd
0
Data
Me
mDo
RA
WA
Di
1
0
1
RegDst
CPE 442 hazards.4
Exec
Unit
Zero
Mux
Rt
Rb
Imm16
busA
busB
Mem/Wr Register
Rs
ID/Ex Register
IUnit
I
Imm16
IF/ID Register
A
PC+4
Ex/Mem Register
PC+4
PC+4
ALUSrc
MemWr
MemtoReg
Introduction to Computer Architecture
Review: Pipeline Control “Data Stationary Control”
° The Main Control generates the control signals during Reg/Dec
• Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
• Control signals for Mem (MemWr Branch) are used 2 cycles later
• Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
Reg/Dec
ALUSrc
ALUSrc
ALUOp
ALUOp
RegDst
MemWr
Branch
MemtoReg
RegWr
CPE 442 hazards.5
RegDst
MemWr
Branch
MemtoReg
RegWr
MemWr
Branch
MemtoReg
RegWr
Wr
Mem/Wr Register
ExtOp
Mem
Ex/Mem Register
ExtOp
ID/Ex Register
IF/ID Register
Main
Control
Exec
MemtoReg
RegWr
Introduction to Computer Architecture
Review: Pipeline Summary
° Pipeline Processor:
• Natural enhancement of the multiple clock cycle processor
• Each functional unit can only be used once per instruction
• If a instruction is going to use a functional unit:
- it must use it at the same stage as all other instructions
• Pipeline Control:
-
CPE 442 hazards.6
Each stage’s control signal depends ONLY on the instruction
that is currently in that stage
Introduction to Computer Architecture
Outline of Today’s Lecture
° Recap and Introduction (5 minutes)
° Introduction to Hazards
° Forwarding (25 minutes)
° 1 cycle Load Delay (5 minutes)
° 1 cycle Branch Delay (15 minutes)
° What makes pipelining hard
° Summary (5 minutes)
CPE 442 hazards.7
Introduction to Computer Architecture
Introduction to Hazards
° Limits to pipelining: Hazards prevent next
instruction from executing during its
designated clock cycle
• structural hazards: HW cannot support this
combination of instructions
• data hazards: instruction depends on
result of prior instruction still in the
pipeline
• control hazards: pipelining of branches &
other instructionsCommon solution is to
stall the pipeline until the hazardbubbles”
in the pipeline
CPE 442 hazards.8
Introduction to Computer Architecture
A Single Memory is a Structural Hazard
Time (clock cycles)
Instr 4
CPE 442 hazards.9
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Instr 3
Reg
ALU
Instr 2
Mem
Mem
ALU
Instr 1
Reg
ALU
O
r
d
e
r
Load
Mem
ALU
I
n
s
t
r.
Mem
Reg
Introduction to Computer Architecture
Option 1: Stall to resolve Memory
Structural Hazard
Time (clock cycles)
Instr 2
Instr 4
CPE 442 hazards.10
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Reg
Mem
Reg
bubble Mem
Mem
Reg
ALU
Instr 3(stall)
Reg
ALU
Mem
Mem
ALU
Instr 1
Reg
ALU
O
r
d
e
r
Load
Mem
ALU
I
n
s
t
r.
Mem
Reg
Introduction to Computer Architecture
Option 2: Duplicate to Resolve Structural Hazard
• Separate Instruction Cache (Im) & Data Cache (Dm)
Time (clock cycles)
Instr 4
CPE 442 hazards.11
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
Instr 3
Im
ALU
Instr 2
Dm
ALU
Instr 1
Reg
ALU
O
r
d
e
r
Load
Im
ALU
I
n
s
t
r.
Reg
Reg
Reg
Reg
Dm
Reg
Introduction to Computer Architecture
Data Hazard on r1
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
CPE 442 hazards.12
Introduction to Computer Architecture
Data Hazard on r1:
(Figure 6.30, page 397, P&H)
• Dependencies backwards in time are hazards
xor r10,r1,r11
CPE 442 hazards.13
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
or r8,r1,r9
W
B
Reg
ALU
and r6,r1,r7
Im
ME
M
Dm
ALU
sub r4,r1,r3
E
X
ALU
O
r
d
e
r
add r1,r2,r3
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Reg
Reg
Reg
Dm
Reg
Introduction to Computer Architecture
Option1: HW Stalls to Resolve Data Hazard
• Dependencies backwards in time are hazards
or r8,r1,r9
xor r10,r1,r11
CPE 442 hazards.14
W
B
Reg
bubble bubble bubble Reg
Im
Dm
Reg
Im
Reg
ALU
and r6,r1,r7
Im
ME
M
Dm
ALU
sub r4, r1,r3
E
X
ALU
O
r
d
e
r
add r1,r2,r3
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Im
Reg
Reg
Dm
Introduction to Computer Architecture
But recall use of “Data Stationary Control”
° The Main Control generates the control signals during Reg/Dec
• Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
• Control signals for Mem (MemWr Branch) are used 2 cycles later
• Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
Reg/Dec
ALUSrc
ALUSrc
ALUOp
ALUOp
RegDst
MemWr
Branch
MemtoReg
RegWr
CPE 442 hazards.15
RegDst
MemWr
Branch
MemtoReg
RegWr
MemWr
Branch
MemtoReg
RegWr
Wr
Mem/Wr Register
ExtOp
Mem
Ex/Mem Register
ExtOp
ID/Ex Register
IF/ID Register
Main
Control
Exec
MemtoReg
RegWr
Introduction to Computer Architecture
Option 1: How HW really stalls pipeline
• HW doesn’t change PC => keeps fetching same instruction
& sets control signals to to benign values (0)
stall
stall
Im
E
X
ME
M
Dm
bubble bubble bubble bubble
Im
bubble bubble bubble bubble
Im
and r6,r1,r7
CPE 442 hazards.16
bubble bubble bubble bubble
Im
Reg
Im
Dm
Reg
ALU
stall
sub r4,r1,r3
W
B
Reg
ALU
O
r
d
e
r
add r1,r2,r3
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Reg
Dm
Introduction to Computer Architecture
Option 2: SW inserts indepdendent instructions
• Worst case inserts NOP instructions
CPE 442 hazards.17
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Reg
ALU
and r6,r1,r7
Reg
ALU
nop
sub r4,r1,r3
W
B
Reg
ALU
nop
Im
ME
M
Dm
ALU
nop
E
X
ALU
O
r
d
e
r
add r1,r2,r3
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Reg
Im
Reg
Reg
Reg
Dm
Introduction to Computer Architecture
Outline of Today’s Lecture
° Recap and Introduction (5 minutes)
° Introduction to Hazards (15 minutes)
° Forwarding
° 1 cycle Load Delay (5 minutes)
° 1 cycle Branch Delay (15 minutes)
° What makes pipelining hard
° Summary (5 minutes)
CPE 442 hazards.18
Introduction to Computer Architecture
Option 3 Insight: Data is available! (Figure 6.35, page 415, P&H)
• Pipeline registers already contain needed data
xor r10,r1,r11
CPE 442 hazards.19
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
or r8,r1,r9
W
B
Reg
ALU
and r6,r1,r7
Im
ME
M
Dm
ALU
sub r4,r1,r3
E
X
ALU
O
r
d
e
r
add r1,r2,r3
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Reg
Reg
Reg
Dm
Reg
Introduction to Computer Architecture
HW Change for “Forwarding” (Bypassing):
• Increase multiplexors to add paths from pipeline registers
• Assumes register read during write gets new value
(otherwise more results to be forwarded)
CPE 442 hazards.20
Introduction to Computer Architecture
Complete data Path with Hazard detection and Forwarding
Figure 6.41 in the text
CPE 442 hazards.21
Introduction to Computer Architecture
Outline of Today’s Lecture
° Recap and Introduction (5 minutes)
° Introduction to Hazards (15 minutes)
° Forwarding (25 minutes)
° 1 cycle Load Delay
° 1 cycle Branch Delay (15 minutes)
° What makes pipelining hard
° Summary (5 minutes)
CPE 442 hazards.22
Introduction to Computer Architecture
From Last Lecture: The Delay Load Phenomenon
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
Clock
I0: Load Ifetch
Plus 1
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Plus 2
Plus 3
Plus 4
Wr
° Although Load is fetched during Cycle 1:
• The data is NOT written into the Reg File until the
end of Cycle 5
• We cannot read this value from the Reg File until
Cycle 6
• 3-instruction delay before the load take effect
CPE 442 hazards.23
Introduction to Computer Architecture
Forwarding reduces Data Hazard to 1 cycle:
(Figure 6.47, page 420 P&H)
or r8,r1,r9
CPE 442 hazards.24
W
B
Reg
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
and r6,r1,r7
Im
ME
M
Dm
ALU
sub r4,r1,r6
E
X
ALU
O
r
d
e
r
lw r1, 0(r2)
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Reg
Reg
Dm
Reg
Introduction to Computer Architecture
Option1: HW Stalls to Resolve Data Hazard
• “Interlock”: checks for hazard & stalls
stall
or r8,r1,r9
CPE 442 hazards.25
W
B
Reg
bubble bubble bubble bubble
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
and r6,r1,r7
ME
M
Dm
ALU
sub r4,r1,r3
Im
E
X
ALU
O
r
d
e
r
lw r1, 0(r2)
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Reg
Reg
Dm
Reg
Introduction to Computer Architecture
Option 2: SW inserts independent instructions
• Worst case inserts NOP instructions
• MIPS I solution: No HW checking
or r8,r1,r9
CPE 442 hazards.26
Reg
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
and r6,r1,r7
W
B
Reg
ALU
sub r4,r1,r3
Im
ME
M
Dm
ALU
nop
E
X
ALU
O
r
d
e
r
lw r1, 0(r2)
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Dm
Reg
Reg
Reg
Dm
Reg
Introduction to Computer Architecture
Software Scheduling to Avoid Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f
in memory.
Slow code:
LW
Rb,b
LW
Rc,c
ADD Ra,Rb,Rc
SW
a,Ra
LW
Re,e
LW
Rf,f
SUB Rd,Re,Rf
SW
d,Rd
CPE 442 hazards.27
Introduction to Computer Architecture
Fast code:
Slow code:
CPE 442 hazards.29
LW
Rb,b
LW
Rb,b
LW
Rc,c
LW
Rc,c
ADD
Ra,Rb,Rc
LW
Re,e
ADD
Ra,Rb,Rc
LW
Rf,f
SW
a,Ra
SW
a,Ra
LW
Re,e
LW
Rf,f
SUB
Rd,Re,Rf
SUB
Rd,Re,Rf
SW
d,Rd
SW
d,Rd
Introduction to Computer Architecture
Outline of Today’s Lecture
° Recap and Introduction (5 minutes)
° Introduction to Hazards (15 minutes)
° Forwarding (25 minutes)
° 1 cycle Load Delay (5 minutes)
° 1 cycle Branch Delay
° What makes pipelining hard
° Summary (5 minutes)
CPE 442 hazards.30
Introduction to Computer Architecture
From Last Lecture: The Delay Branch Phenomenon
Cycle 4 Cycle 5
Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11
Clk
12: Beq Ifetch Reg/Dec Exec
(target is 1000)
16: R-type Ifetch Reg/Dec
20: R-type
Ifetch
24: R-type
Mem
Wr
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
1000: Target of Br
Wr
° Although Beq is fetched during Cycle 4:
• Target address is NOT written into the PC until the end of Cycle 7
• Branch’s target is NOT fetched until Cycle 8
• 3-instruction delay before the branch take effect
CPE 442 hazards.31
Introduction to Computer Architecture
Control Hazard on Branches: 3 stage stall
CPE 442 hazards.32
Introduction to Computer Architecture
Branch Stall Impact
° If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
° 2 part solution:
• Determine branch taken or not sooner, AND
• Compute taken branch address earlier
° Solution Option 1:
• Move Zero test to ID/RF stage
• Adder to calculate new PC in ID/RF stage
• 1 clock cycle penalty for branch vs. 3
CPE 442 hazards.33
Introduction to Computer Architecture
Option 1: move HW forward to reduce branch delay
Data Path before change
Instruction
Fetch
CPE 442 hazards.34
Instr. Decode
Reg. Fetch
Execute
Addr. Calc.
Memor
y
Access
Write
Back
Introduction to Computer Architecture
Branch Delay now 1 clock cycle
Data Path after change
Instruction
Fetch
CPE 442 hazards.35
Memory
Instr. Decode Execute
Reg. Fetch Addr. Calc. Access
Write
Back
Introduction to Computer Architecture
Option 2: No Stalls, Define Branch as Delayed, insert
instruction after the branch and allow it to execute always,
° Worst case, SW inserts NOP into branch delay if no instruction can
be found
° Where to get instructions to fill branch delay slot?
• Before branch instruction,
example sw r1,0(r2); beqd r0,r2,T
change to, beqd r0,r2,T; sw r1,0(r2)
• From the target address: only valuable when branch
• From fall through: only valuable when don’t branch
° Compiler effectiveness for single branch delay slot:
• Fills about 60% of branch delay slots
• About 80% of instructions executed in branch delay slots
useful in computation
• about 50% (60% x 80%) of slots usefully filled
CPE 442 hazards.36
Introduction to Computer Architecture
Complete data Path with Hazard detection and Forwarding
Figure 6.41 in the text
CPE 442 hazards.37
Introduction to Computer Architecture
Example
Text
Figure 6.52
CPE 442 hazards.38
Introduction to Computer Architecture
Outline of Today’s Lecture
° Recap and Introduction (5 minutes)
° Introduction to Hazards (15 minutes)
° Forwarding (25 minutes)
° 1 cycle Load Delay (5 minutes)
° 1 cycle Branch Delay (15 minutes)
° What makes pipelining hard
° Summary (5 minutes)
CPE 442 hazards.39
Introduction to Computer Architecture
When is pipelining hard?
° Interrupts: 5 instructions executing in 5 stage pipeline
• How to stop the pipeline?
• Restrart?
• Who caused the interrupt?
Stage Problem interrupts occurring
IF
Page fault on instruction fetch; misaligned memory
access; memory-protection violation
ID
Undefined or illegal opcode
EX
Arithmetic interrupt
MEM
Page fault on data fetch; misaligned memory
access; memory-protection violation
° Load with data page fault, Add with instruction page
fault?
° Solution 1: interrupt vector/instruction 2: interrupt ASAP,
restart everything incomplete
CPE 442 hazards.40
Introduction to Computer Architecture
Data path with Exception Handling, Text Figure 6.55, add a
Cause register, an Exception PC, and constant addr. of
Exception Handeling routine
CPE 442 hazards.41
Introduction to Computer Architecture
Review: Summary of Pipelining Basics
° Speed Up Š Pipeline Depth (number of stages); if ideal CPI is 1, then:
Speedup=
Pipeline depth
 Clock cycle unpipelined
1Pipeline stall cycles per instruction Clock cycle pipelined
° Hazards limit performance on computers:
• structural: need more HW resources
• data: need forwarding, compiler scheduling
• control: early evaluation & PC, delayed branch, prediction
° Increasing length of pipe increases impact of hazards since pipelining
helps instruction bandwidth, not latency
° Compilers key to reducing cost of data and control hazards
• load delay slots
• branch delay slots
° Exceptions, Instruction Set, FP makes pipelining harder
° Longer pipelines => Branch prediction, more instruction parallelism?
CPE 442 hazards.42
Introduction to Computer Architecture