inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 31 – Pipelined Execution, part II 2004-11-10 Lecturer PSOE Dan Garcia www.cs.berkeley.edu/~ddgarcia The Incredibles! Election Data is now available… Purple America! www.princeton.edu/~rvdb/JAVA/election2004/ www.usatoday.com/news/politicselections/vote2004/countymap.htm CS61C L31 Pipelined Execution,

Download Report

Transcript inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 31 – Pipelined Execution, part II 2004-11-10 Lecturer PSOE Dan Garcia www.cs.berkeley.edu/~ddgarcia The Incredibles! Election Data is now available… Purple America! www.princeton.edu/~rvdb/JAVA/election2004/ www.usatoday.com/news/politicselections/vote2004/countymap.htm CS61C L31 Pipelined Execution,

inst.eecs.berkeley.edu/~cs61c
CS61C : Machine Structures
Lecture 31 –
Pipelined Execution, part II
2004-11-10
Lecturer PSOE Dan Garcia
www.cs.berkeley.edu/~ddgarcia
The Incredibles!
Election Data is
now available…
Purple America!
www.princeton.edu/~rvdb/JAVA/election2004/
www.usatoday.com/news/politicselections/vote2004/countymap.htm
CS61C L31 Pipelined Execution, part II (1)
Garcia, Fall 2004 © UCB
Review: Pipeline (1/2)
• Optimal Pipeline
• Each stage is executing part of an
instruction each clock cycle.
• One inst. finishes during each clock cycle.
• On average, execute far more quickly.
• What makes this work?
• Similarities between instructions allow us
to use same stages for all instructions
(generally).
• Each stage takes about the same amount
of time as all others: little wasted time.
CS61C L31 Pipelined Execution, part II (2)
Garcia, Fall 2004 © UCB
Review: Pipeline (2/2)
• Pipelining is a BIG IDEA
• widely used concept
• What makes it less than perfect?
• Structural hazards: suppose we had
only one cache?
 Need more HW resources
• Control hazards: need to worry about
branch instructions?
 Delayed branch
• Data hazards: an instruction depends on
a previous instruction?
CS61C L31 Pipelined Execution, part II (3)
Garcia, Fall 2004 © UCB
Control Hazard: Branching (1/7)
Time (clock cycles)
ALU
I
n
I$
D$
Reg
Reg
beq
s
I$
D$
Reg
Reg
t Instr 1
r.
I$
D$
Reg
Reg
Instr 2
O
I$
D$
Reg
Reg
Instr 3
r
I$
D$
Reg
Reg
d Instr 4
e
r
Where do we do the compare for the branch?
ALU
ALU
ALU
ALU
CS61C L31 Pipelined Execution, part II (4)
Garcia, Fall 2004 © UCB
Control Hazard: Branching (2/7)
• We put branch decision-making
hardware in ALU stage
• therefore two more instructions after the
branch will always be fetched, whether or
not the branch is taken
• Desired functionality of a branch
• if we do not take the branch, don’t waste
any time and continue executing
normally
• if we take the branch, don’t execute any
instructions after the branch, just go to
the desired label
CS61C L31 Pipelined Execution, part II (5)
Garcia, Fall 2004 © UCB
Control Hazard: Branching (3/7)
• Initial Solution: Stall until decision is
made
• insert “no-op” instructions: those that
accomplish nothing, just take time
• Drawback: branches take 3 clock cycles
each (assuming comparator is put in ALU
stage)
CS61C L31 Pipelined Execution, part II (6)
Garcia, Fall 2004 © UCB
Control Hazard: Branching (4/7)
• Optimization #1:
• move asynchronous comparator up to
Stage 2
• as soon as instruction is decoded
(Opcode identifies is as a branch),
immediately make a decision and set the
value of the PC (if necessary)
• Benefit: since branch is complete in
Stage 2, only one unnecessary
instruction is fetched, so only one no-op
is needed
• Side Note: This means that branches are
idle in Stages 3, 4 and 5.
CS61C L31 Pipelined Execution, part II (7)
Garcia, Fall 2004 © UCB
Control Hazard: Branching (5/7)
•
Insert
a
single
no-op
(bubble)
I
Time (clock cycles)
beq
Reg
I$
D$
Reg
Reg
ALU
add
I$
ALU
n
s
t
r.
D$
Reg
ALU
O lw
bub
D$
Reg
Reg
I$
ble
r
d
e • Impact: 2 clock cycles per branch
r instruction  slow
CS61C L31 Pipelined Execution, part II (8)
Garcia, Fall 2004 © UCB
Control Hazard: Branching (6/7)
• Optimization #2: Redefine branches
• Old definition: if we take the branch,
none of the instructions after the branch
get executed by accident
• New definition: whether or not we take
the branch, the single instruction
immediately following the branch gets
executed (called the branch-delay slot)
• The term “Delayed Branch” means
we always execute inst after branch
CS61C L31 Pipelined Execution, part II (9)
Garcia, Fall 2004 © UCB
Control Hazard: Branching (7/7)
• Notes on Branch-Delay Slot
• Worst-Case Scenario: can always put a
no-op in the branch-delay slot
• Better Case: can find an instruction
preceding the branch which can be
placed in the branch-delay slot without
affecting flow of the program
- re-ordering instructions is a common
method of speeding up programs
- compiler must be very smart in order to find
instructions to do this
- usually can find such an instruction at least
50% of the time
- Jumps also have a delay slot…
CS61C L31 Pipelined Execution, part II (10)
Garcia, Fall 2004 © UCB
Example: Nondelayed vs. Delayed Branch
Nondelayed Branch
or
$8, $9 ,$10
Delayed Branch
add $1 ,$2,$3
add $1 ,$2,$3
sub $4, $5,$6
sub $4, $5,$6
beq $1, $4, Exit
beq $1, $4, Exit
or
xor $10, $1,$11
xor $10, $1,$11
Exit:
CS61C L31 Pipelined Execution, part II (11)
$8, $9 ,$10
Exit:
Garcia, Fall 2004 © UCB
Data Hazards (1/2)
• Consider the following sequence of
instructions
add $t0, $t1, $t2
sub $t4, $t0 ,$t3
and $t5, $t0 ,$t6
or
$t7, $t0 ,$t8
xor $t9, $t0 ,$t10
CS61C L31 Pipelined Execution, part II (12)
Garcia, Fall 2004 © UCB
Data Hazards (2/2)
ALU
Dependencies backwards in time are hazards
Time (clock cycles)
I
n
IF
ID/RF EX
MEM WB
s add $t0,$t1,$t2 I$ Reg
Reg
D$
t
I$
D$
Reg
Reg
r. sub $t4,$t0,$t3
ALU
I$
D$
Reg
Reg
D$
Reg
I$
Reg
ALU
CS61C L31 Pipelined Execution, part II (13)
Reg
ALU
r
or $t7,$t0,$t8
d
e xor $t9,$t0,$t10
r
I$
ALU
O and $t5,$t0,$t6
D$
Reg
Garcia, Fall 2004 © UCB
Data Hazard Solution: Forwarding
• Forward result from one stage to another
and $t5,$t0,$t6
or $t7,$t0,$t8
xor $t9,$t0,$t10
Reg
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
ALU
I$
sub $t4,$t0,$t3
D$
ALU
Reg
WB
ALU
I$
EX MEM
ALU
ID/RF
ALU
add $t0,$t1,$t2
IF
D$
Reg
“or” hazard solved by register hardware
CS61C L31 Pipelined Execution, part II (14)
Garcia, Fall 2004 © UCB
Data Hazard: Loads (1/4)
• Dependencies backwards in time are
hazards
I$
Reg
I$
sub $t3,$t0,$t2
EX
MEM
WB
D$
Reg
Reg
ALU
ID/RF
ALU
lw $t0,0($t1)
IF
D$
Reg
• Can’t solve with forwarding
• Must stall instruction dependent on
load, then forward (more hardware)
CS61C L31 Pipelined Execution, part II (15)
Garcia, Fall 2004 © UCB
Data Hazard: Loads (2/4)
• Hardware must stall pipeline
• Called “interlock”
I$
sub $t3,$t0,$t2
and $t5,$t0,$t4
or $t7,$t0,$t6
CS61C L31 Pipelined Execution, part II (16)
WB
D$
Reg
Reg
bub
ble
I$
D$
bub
ble
Reg
D$
bub
ble
I$
Reg
ALU
Reg
MEM
ALU
I$
EX
ALU
ID/RF
ALU
lw $t0, 0($t1)
IF
Reg
Reg
D$
Garcia, Fall 2004 © UCB
Data Hazard: Loads (3/4)
• Instruction slot after a load is called
“load delay slot”
• If that instruction uses the result of the
load, then the hardware interlock will
stall it for one cycle.
• If the compiler puts an unrelated
instruction in that slot, then no stall
• Letting the hardware stall the instruction
in the delay slot is equivalent to putting
a nop in the slot (except the latter uses
more code space)
CS61C L31 Pipelined Execution, part II (17)
Garcia, Fall 2004 © UCB
Data Hazard: Loads (4/4)
• Stall is equivalent to nop
and $t5,$t0,$t4
or $t7,$t0,$t6
CS61C L31 Pipelined Execution, part II (18)
bub bub
ble ble
bub
ble
bub
ble
bub
ble
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
ALU
sub $t3,$t0,$t2
Reg
ALU
nop
D$
ALU
I$
ALU
lw $t0, 0($t1)
D$
Reg
Garcia, Fall 2004 © UCB
Historical Trivia
• First MIPS design did not interlock and
stall on load-use data hazard
• Real reason for name behind MIPS:
Microprocessor without
Interlocked
Pipeline
Stages
• Word Play on acronym for
Millions of Instructions Per Second,
also called MIPS
CS61C L31 Pipelined Execution, part II (19)
Garcia, Fall 2004 © UCB
Administrivia
• No lab this week (wed, thu or fri)
• Due to Veterans Day holiday on Thursday.
• The lab is posted as a take-home lab;
show TA your results in the following lab.
• Grade freezing update : through HW4
• You have until next Wed to request
regrades on HW3,HW4 & P1
• Back to 61C…Advanced Pipelining!
• “Out-of-order” Execution
• “Superscalar” Execution
CS61C L31 Pipelined Execution, part II (20)
Garcia, Fall 2004 © UCB
Review Pipeline Hazard: Stall is dependency
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
D
r
d E
e
r F
A depends on D; stall since folder tied up
CS61C L31 Pipelined Execution, part II (21)
Garcia, Fall 2004 © UCB
Out-of-Order Laundry: Don’t Wait
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
D
r
d E
e
r F
A depends on D; rest continue; need
more resources to allow out-of-order
CS61C L31 Pipelined Execution, part II (22)
Garcia, Fall 2004 © UCB
Superscalar Laundry: Parallel per stage
6 PM 7
T
a
s
k
8
9
3030 30 30 30
A
B
C
O
D
r
d E
e
r F
10
11
12
2 AM
1
Time
(light clothing)
(dark clothing)
(very dirty clothing)
(light clothing)
(dark clothing)
(very dirty clothing)
More resources, HW to match mix of
parallel tasks?
CS61C L31 Pipelined Execution, part II (23)
Garcia, Fall 2004 © UCB
Superscalar Laundry: Mismatch Mix
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
(light clothing)
O
r B
d C
e
r
D
(light clothing)
(dark clothing)
(light clothing)
Task mix underutilizes extra resources
CS61C L31 Pipelined Execution, part II (24)
Garcia, Fall 2004 © UCB
Peer Instruction
Assume 1 instr/clock, delayed branch, 5 stage
pipeline, forwarding, interlock on unresolved
load hazards (after 103 loops, so pipeline full)
Loop:
lw
$t0,
addu $t0,
sw
$t0,
addiu$s1,
bne $s1,
nop
0($s1)
$t0, $s2
0($s1)
$s1, -4
$zero, Loop
•How many pipeline stages (clock cycles) per
loop iteration to execute this code?
CS61C L31 Pipelined Execution, part II (25)
1
2
3
4
5
6
7
8
9
10
Garcia, Fall 2004 © UCB
Peer Instruction Answer
• Assume 1 instr/clock, delayed branch, 5
stage pipeline, forwarding, interlock on
unresolved load hazards. 103 iterations,
so pipeline full.
2. (data hazard so stall)
Loop: 1. lw
$t0, 0($s1)
3. addu $t0, $t0, $s2
$t0, 0($s1)
4. sw
5. addiu $s1, $s1, -4
6. bne
$s1, $zero, Loop
7. nop (delayed branch so exec. nop)
• How many pipeline stages (clock cycles)
per loop iteration to execute this code?
1
2
3
4
CS61C L31 Pipelined Execution, part II (26)
5
6
7
8
9
10
Garcia, Fall 2004 © UCB
“And in Conclusion..”
• Pipeline challenge is hazards
• Forwarding helps w/many data hazards
• Delayed branch helps with control hazard in
5 stage pipeline
• More aggressive performance:
• Superscalar
• Out-of-order execution
CS61C L31 Pipelined Execution, part II (27)
Garcia, Fall 2004 © UCB