lecture_5_6_7_instruction_sets.ppt

Download Report

Transcript lecture_5_6_7_instruction_sets.ppt

EEL 5708
High Performance Computer Architecture
Pipelining
Lotzi Bölöni
EEL5708
Acknowledgements
• All the lecture slides were adopted from the slides
of David Patterson (1998, 2001) and David E.
Culler (2001), Copyright 1998-2002, University of
California Berkeley
EEL5708
Pipelining
EEL5708
Sequential Laundry
6 PM
7
8
9
10
11
Midnight
Time
T
a
s
k
O
r
d
e
r
30 40 20 30 40 20 30 40 20 30 40 20
A
B
C
D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
EEL5708
Pipelined Laundry
Start work ASAP
6 PM
7
8
9
10
11
Midnight
Time
T
a
s
k
O
r
d
e
r
30 40 40 40 40 20
A
B
C
D
• Pipelined laundry takes 3.5 hours for 4 loads
EEL5708
Pipelining Lessons
6 PM
7
8
9
Time
T
a
s
k
O
r
d
e
r
30 40 40 40 40 20
A
B
C
D
• Pipelining doesn’t help
latency of single task, it
helps throughput of
entire workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously
• Potential speedup =
Number pipe stages
• Unbalanced lengths of
pipe stages reduces
speedup
• Time to “fill” pipeline
and time to “drain” it
reduces speedup
EEL5708
Fast, Pipelined Instruction Interpretation
Next Instruction
Instruction Address
Instruction Fetch
Instruction Register
Decode &
Operand Fetch
Operand Registers
NI
NI
IF
NI NI NI
IF IF IF IF
D D D D
E E E
W W
D
E
W
E
W
W
Time
Execute
Result Registers
Store Results
Registers or Mem
EEL5708
Instruction Pipelining
• Execute billions of instructions, so throughput is
what matters
– except when?
• What is desirable in instruction sets for pipelining?
– Variable length instructions vs.
all instructions same length?
– Memory operands part of any operation vs.
memory operands only in loads or stores?
– Register operand many places in instruction
format vs. registers located in same place?
EEL5708
Example: MIPS (Note register location)
Register-Register
31
26 25
Op
21 20
Rs1
16 15
Rs2
11 10
6 5
Rd
0
Opx
Register-Immediate
31
26 25
Op
21 20
Rs1
16 15
Rd
immediate
0
Branch
31
26 25
Op
Rs1
21 20
16 15
Rs2/Opx
immediate
0
Jump / Call
31
26 25
Op
target
0
EEL5708
5 Steps of MIPS Datapath
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
Next SEQ PC
Adder
4
Zero?
RS1
MUX
Data
Memory
ALU
Immediate
MUX MUX
RD
Reg File
Inst
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
Sign
Extend
WB Data
EEL5708
5 Steps of MIPS Datapath
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
4
Zero?
RS1
MUX
MEM/WB
Data
Memory
EX/MEM
ALU
MUX MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
Sign
Extend
RD
RD
RD
• Data stationary control
– local decode for each instruction phase / pipeline stage
EEL5708
Visualizing Pipelining
Figure 3.3, Page 133 , CA:AQA 2e
Time (clock cycles)
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Reg
Reg
DMem
Reg
EEL5708
Its Not That Easy for Computers
• Limits to pipelining: Hazards prevent next
instruction from executing during its designated
clock cycle
– Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away)
– Data hazards: Instruction depends on result of prior instruction
still in the pipeline (missing sock)
– Control hazards: Pipelining of branches & other instructions that
change the PC
– Common solution is to stall the pipeline until the hazard is
resolved, inserting one or more “bubbles” in the pipeline
EEL5708
Speed Up Equation for Pipelining
CPIpipelined = Ideal CPI
+ Pipeline stall clock cycles per instr
Speedup = Ideal CPI x Pipeline depth
x
Ideal CPI + Pipeline stall CPI
Speedup =
Pipeline depth
x
1 + Pipeline stall CPI
Clock Cycleunpipelined
Clock Cyclepipelined
Clock Cycleunpipelined
Clock Cyclepipelined
EEL5708
Structural Hazard Example: Dual-port
vs. Single-port
• Machine A: Dual ported memory
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x
(clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth)
= 1.33
• Machine A is 1.33 times faster
EEL5708
Three Generic Data Hazards
InstrI followed by InstrJ
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes
it
EEL5708
Three Generic Data Hazards
InstrI followed by InstrJ
• Write After Read (WAR)
InstrJ tries to write operand before InstrI reads i
– Gets wrong operand
• Can’t happen in our 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
EEL5708
Three Generic Data Hazards
InstrI followed by InstrJ
• Write After Write (WAW)
InstrJ tries to write operand before InstrI writes it
– Leaves wrong result ( InstrI not InstrJ )
• Can’t happen in our 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
EEL5708
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW
Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd
Fast code:
LW
LW
LW
ADD
LW
SW
SUB
SW
Rb,b
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
a,Ra
Rd,Re,Rf
d,Rd
EEL5708
Control Hazard on Branches
Three Stage Stall
EEL5708
Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI =
1.9!
• Two part solution:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• Branch tests if register = 0 or <> 0
• Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
EEL5708
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
–
–
–
–
–
Execute successor instructions in sequence
“Squash” instructions in pipeline if branch actually taken
Advantage of late pipeline state update
47% branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% branches taken on average
– But haven’t calculated branch target address
» still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
EEL5708
Four Branch Hazard Alternatives
#4: Delayed Branch
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
Branch delay of length n
– 1 slot delay allows proper decision and branch target address in 5
stage pipeline
EEL5708
Delayed Branch
• Where to get instructions to fill branch delay slot?
–
–
–
–
Before branch instruction
From the target address: only valuable when branch taken
From fall through: only valuable when branch not taken
Cancelling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful in
computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple
instructions issued per clock (superscalar)
EEL5708
Evaluating Branch Alternatives
Pipeline speedup =
Scheduling
Branch
scheme
penalty
Stall pipeline
3
Predict taken
1
Predict not taken
1
Delayed branch 0.5
Pipeline depth
1 +Branch frequency Branch penalty
CPI
1.42
1.14
1.09
1.07
speedup v.
unpipelined
3.5
4.4
4.5
4.6
speedup v.
stall
1.0
1.26
1.29
1.31
Conditional & Unconditional = 14%, 65% change PC
EEL5708
Pipelining Summary
• Just overlap tasks, and easy if tasks are
independent
• Speed Up / Pipeline Depth; if ideal CPI is 1, then:
Speedup =
Pipeline Depth
Clock Cycle Unpipelined
X
1 + Pipeline stall CPI
Clock Cycle Pipelined
• Hazards limit performance on computers:
– Structural: need more HW resources
– Data (RAW,WAR,WAW): need forwarding, compiler scheduling
– Control: delayed branch, prediction
EEL5708