Chapter 6: Pipelining
Download
Report
Transcript Chapter 6: Pipelining
Chapter 6:
Pipelining
Spring 2012
Ilam University
Forecast
Pipelining
Big Picture
Datapath
Control
Data Hazards
Stalls
Forwarding
Control Hazards
Exceptions
Motivation
Instructions
Program
(code size)
X
Cycles
X
Instruction
(CPI)
Time
Cycle
(cycle time)
Single cycle implementation
CPI = 1
Cycle = imem + RFrd + ALU + dmem + RFwr +
muxes + control
E.g. 500+250+500+500+250+0+0 = 2000ps
Time/program = P x 2ns
Multicycle
Multicycle implementation:
Cycle:
Instr:
1
2
3
4
5
i
F
D
X
M
W
i+1
i+2
i+3
i+4
6
7
8
F
D
X
9
10 11 12 13
F
D
X
M
F
Multicycle
Multicycle implementation
CPI = 3, 4, 5
Cycle = max(memory, RF, ALU, mux, control)
=max(500,250,500) = 500ps
Time/prog = P x 4 x 500 = P x 2000ps = P x 2ns
Would like:
CPI = 1 + overhead from hazards (later)
Cycle = 500ps + overhead
In practice, ~3x improvement
Big Picture
Instruction latency = 5 cycles
Instruction throughput = 1/5 instr/cycle
CPI = 5 cycles per instruction
Instead
Pipelining: process instructions like a lunch buffet
ALL microprocessors use it
E.g. Pentium 4, Athlon, PowerPC G5
Big Picture
Instruction Latency = 5 cycles (same)
Instruction throughput = 1 instr/cycle
CPI = 1 cycle per instruction
CPI = cycle between instruction completion = 1
Lundry Example
Ideal Pipelining
Comb. Logic
n Gate Delay
L
L
L
n Gate
-- Delay
2
n Gate
-- Delay
3
L
L
n Gate
-- Delay
3
BW = ~(1/n)
n Gate
-- Delay
2
L
n Gate
-- Delay
3
BW = ~(2/n)
BW = ~(3/n)
Bandwidth increases linearly with pipeline depth
Latency increases by latch delays
Ideal Pipelining
Cycle:
Instr:
1
2
3
4
5
i
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
i+1
i+2
i+3
i+4
6
7
8
9
W
10 11 12 13
Pipelining Idealisms
Uniform subcomputations
Identical computations
Can fill pipeline with identical work
Independent computations
Can pipeline into stages with equal delay
No relationships between work units
Are these practical?
No, but can get close enough to get significant speedup
Complications
Datapath
Control
Five (or more) instructions in flight
Must correspond to multiple instructions
Instructions may have
data and control flow dependences
I.e. units of work are not independent
One may have to stall and wait for another
Datapath (Fig. 6.10)
Datapath (Fig. 6.11)
Control
Control
Set by 5 different instructions
Divide and conquer: carry IR down the pipe
MIPS ISA requires sequential execution
True for most general purpose ISAs
Program Dependences
i1: xxxx
i1
i2: xxxx
i2
i3: xxxx
i3
i1:
A true dependence between
two instructions may only
involve one subcomputation
of each instruction.
i2:
i3:
The implied sequential precedences are
an overspecification. It is sufficient but not
necessary to ensure program correctness.
Program Data Dependences
True dependence (RAW)
Anti-dependence (WAR)
j cannot execute until i produces
its result
j cannot write its result until i has
read its sources
Output dependence (WAW)
j cannot write its result until i has
written its result
D(i) R( j )
R(i) D( j )
D(i) D( j )
Control Dependences
Conditional branches
Branch must execute to determine which
instruction to fetch next
Instructions following a conditional branch are
control dependent on the branch instruction
Example (quicksort/MIPS)
#
#
#
#
#
for (; (j < high) && (array[j] < array[low]) ; ++j );
$10 = j
$9 = high
$6 = array
$8 = low
bge
done, $10, $9
mul
$15, $10, 4
addu
$24, $6, $15
lw
$25, 0($24)
mul
$13, $8, 4
addu
$14, $6, $13
lw
$15, 0($14)
bge
done, $25, $15
cont:
addu
...
$10, $10, 1
addu
$11, $11, -1
done:
Resolution of Pipeline Hazards
Pipeline hazards
Hazard resolution
Potential violations of program dependences
Must ensure program dependences are not violated
Static: compiler/programmer guarantees correctness
Dynamic: hardware performs checks at runtime
Pipeline interlock
Hardware mechanism for dynamic hazard resolution
Must detect and enforce dependences at runtime
Pipeline Hazards
Necessary conditions:
WAR: write stage earlier than read stage
WAW: write stage earlier than write stage
Is this possible in IF-RD-EX-MEM-WB ?
RAW: read stage earlier than write stage
Is this possible in IF-RD-EX-MEM-WB ?
Is this possible in IF-RD-EX-MEM-WB?
If conditions not met, no need to resolve
Check for both register and memory
RAW Hazard
Earlier instruction produces a value used by a later
instruction:
add $1, $2, $3
sub $4, $5, $1
Cycle:
Instr:
1
2
3
4
5
add
F
D
X
M
W
F
D
X
M
sub
6
W
7
8
9
10
11
12
13
RAW Hazard - Stall
Detect dependence and stall:
add $1, $2, $3
sub $4, $5, $1
Cycle:
Instr:
1
2
3
4
5
add
F
D
X
M
W
sub
6
7
8
9
10 11
F
D
X
M
W
12 13
Control Dependence
One instruction affects which executes next
sw $4, 0($5)
bne $2, $3, loop
sub $6, $7, $8
Cycle:
Instr:
1
2
3
4
5
sw
F
D
X
M
W
F
D
X
M
W
F
D
X
M
bne
sub
6
7
W
8
9
10 11 12 13
Control Dependence - Stall
Detect dependence and stall
sw $4, 0($5)
bne $2, $3, loop
sub $6, $7, $8
Cycle:
Instr:
1
2
3
4
5
sw
F
D
X
M
W
F
D
X
M
W
F
D
bne
sub
6
7
8
9
X
M
W
10 11
12 13
Pipelined Datapath
Start with single-cycle datapath (F6.10)
Pipelined execution
Assume each instruction has its own datapath
But each instruction uses a different part in every cycle
Multiplex all on to one datapath
Latches separate cycles (like multicycle)
Ignore hazards for now
Data
control
Pipelined Datapath (Fig. 6.12)
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
32
Read
data
1
M
u
x
0
Pipelined Datapath
Instruction flow
add and load
Write of registers
Pass register specifiers
Any info needed by a later stage gets passed
down the pipeline
E.g. store value through EX
Pipelined Control
IF and ID
EX
ALUop, ALUsrc, RegDst
MEM
None
Branch, MemRead, MemWrite
WB
MemtoReg, RegWrite
Figure 6.25
PCSrc
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
result
Add
4
Branch
Shift
left 2
PC
Address
Instruction
memory
Instruction
RegWrite
Read
register 1
MemWrite
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
ALUSrc
Zero
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Data
memory
Write
data
Instruction
16
[15– 0]
Sign
extend
32
6
ALU
control
Instruction
[20– 16]
Instruction
[15– 11]
0
M
u
x
1
RegDst
ALUOp
MemRead
Read
data
1
M
u
x
0
Figure 6.29
WB
Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
Figure 6.30
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
ALUOp
RegDst
MemRead
1
M
u
x
0
Pipelined Control
Controlled by different instructions
Decode instructions and pass the signals
down the pipe
Control sequencing is embedded in the
pipeline
No explicit FSM
Instead, distributed FSM
Pipelining
Not too complex yet
Data hazards
Control hazards
Exceptions
RAW Hazards
Must first detect RAW hazards
Pipeline analysis proves that WAR/WAW don’t occur
ID/EX.WriteRegister = IF/ID.ReadRegister1
ID/EX.WriteRegister = IF/ID.ReadRegister2
EX/MEM.WriteRegister = IF/ID.ReadRegister1
EX/MEM.WriteRegister = IF/ID.ReadRegister2
MEM/WB.WriteRegister = IF/ID.ReadRegister1
MEM/WB.WriteRegister = IF/ID.ReadRegister2
RAW Hazards
Not all hazards because
WriteRegister not used (e.g. sw)
ReadRegister not used (e.g. addi, jump)
Do something only if necessary
RAW Hazards
Hazard Detection Unit
Several 5-bit (or 6-bit) comparators
Response? Stall pipeline
Instructions in IF and ID stay
IF/ID pipeline latch not updated
Send ‘nop’ down pipeline (called a bubble)
PCWrite, IF/IDWrite, and nop mux
RAW Hazard Forwarding
A better response – forwarding
Also called bypassing
Comparators ensure register is read after it is
written
Instead of stalling until write occurs
Use mux to select forwarded value rather than
register value
Control mux with hazard detection logic
Forwarding
Use temporary results, don’t wait for them to be written
register file forwarding to handle read/write to same register
ALU forwarding
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
IM
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
what if this $2 was $13?
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
Forwarding
ID/EX
WB
Control
PC
Instruction
memory
Instruction
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
M
u
x
Forwarding Paths
(ALU instructions)
IF
ID
RD
c b
a
ALU
FORWARDING
PATHS
i+1:
ALU
i: R1
i+2:
R1
i+3:
R1
i+1:
R1 i+2:
R1
i: R1
MEM
WB
R1
i+1:
i: R1
(i
i+1)
Forwarding
via Path a
(i
i+2)
Forwarding
via Path b
(i
i+3)
i writes R1
before i+3
reads R1
Write before Read RF
Register file design
Hence, same cycle:
2-phase clocks common
Write RF on first phase
Read RF on second phase
Write $1
Read $1
No bypass needed
If read before write or DFF-based, need bypass
Can't always forward
Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that
writes to the same register.
Thus, we need a hazard detection unit to “stall” the load instruction
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
Stalling
We can stall the pipeline by keeping an instruction in the
same stage
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
Reg
IM
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
slt $1, $6, $7
IM
DM
Reg
IM
Reg
Reg
DM
Reg
Forwarding Paths
(Load instructions)
IF
ID
RD
e d
ALU
LOAD
FORWARDING
PATH(s)
MEM
i+1:
R1
i:R1
MEM[]
i+1:
i:R1
R1
i+2:
R1
i+1:
R1
MEM[]
WB
i:R1
(i
i+1)
Stall i+1
(i
i+1)
Forwarding
via Path d
MEM[]
(i
i+2)
i writes R1
before i+2
reads R1
Control Flow Hazards
Control flow instructions
branches, jumps, jals, returns
Can’t fetch until branch outcome known
Too late for next IF
Control Flow Hazards
What to do?
Always stall
Easy to implement
Performs poorly
1/6th instructions are branches, each branch takes
3 cycles
CPI = 1 + 3 x 1/6 = 1.5 (lower bound)
Control Flow Hazards
Predict branch not taken
Send sequential instructions down pipeline
Kill instructions later if incorrect
Must stop memory accesses and RF writes
Late flush of instructions on misprediction
Complex
Global signal (wire delay)
Branch Hazards
When we decide to branch, other instructions are in
the pipeline!
Time (in clock cycles)
Program
execution
CC 1
CC 2
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
Control Flow Hazards
Even better but more complex
Predict taken
Predict both (eager execution)
Predict one or the other dynamically
Adapt to program branch patterns
Lots of chip real estate these days
Pentium III, 4, Alpha 21264
Current research topic
Dynamic branch prediction
Improving Performance
Try and avoid stalls! E.g., reorder these
instructions:
lw
lw
sw
sw
0($t1)
4($t1)
0($t1)
4($t1)
Add a “branch delay slot”
$t0,
$t2,
$t2,
$t0,
Always execute following instruction
“delay slot” (later example on MIPS pipeline)
Put useful instruction there, otherwise ‘nop’
rely on compiler to “fill” the slot with something useful
Superscalar: start more than one instruction in the
same cycle
Dynamic Scheduling
The hardware performs the “scheduling”
hardware tries to find instructions to execute
out of order execution is possible
speculative execution and dynamic branch prediction
All modern processors are very complicated
DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
PowerPC and Pentium: branch history table
Compiler technology important
Exceptions
Even worse: in one cycle
I/O interrupt
User trap to OS (EX)
Illegal instruction (ID)
Arithmetic overflow
Hardware error
Etc.
Interrupt priorities must be supported