Chapter 6: Pipelining

Download Report

Transcript Chapter 6: Pipelining

Chapter 6:
Pipelining
Spring 2012
Ilam University
Forecast

Pipelining




Big Picture
Datapath
Control
Data Hazards




Stalls
Forwarding
Control Hazards
Exceptions
Motivation
Instructions
Program
(code size)

X
Cycles
X
Instruction
(CPI)
Time
Cycle
(cycle time)
Single cycle implementation




CPI = 1
Cycle = imem + RFrd + ALU + dmem + RFwr +
muxes + control
E.g. 500+250+500+500+250+0+0 = 2000ps
Time/program = P x 2ns
Multicycle

Multicycle implementation:
Cycle:
Instr:
1
2
3
4
5
i
F
D
X
M
W
i+1
i+2
i+3
i+4
6
7
8
F
D
X
9
10 11 12 13
F
D
X
M
F
Multicycle

Multicycle implementation





CPI = 3, 4, 5
Cycle = max(memory, RF, ALU, mux, control)
=max(500,250,500) = 500ps
Time/prog = P x 4 x 500 = P x 2000ps = P x 2ns
Would like:



CPI = 1 + overhead from hazards (later)
Cycle = 500ps + overhead
In practice, ~3x improvement
Big Picture




Instruction latency = 5 cycles
Instruction throughput = 1/5 instr/cycle
CPI = 5 cycles per instruction
Instead


Pipelining: process instructions like a lunch buffet
ALL microprocessors use it

E.g. Pentium 4, Athlon, PowerPC G5
Big Picture




Instruction Latency = 5 cycles (same)
Instruction throughput = 1 instr/cycle
CPI = 1 cycle per instruction
CPI = cycle between instruction completion = 1
Lundry Example
Ideal Pipelining
Comb. Logic
n Gate Delay
L
L
L


n Gate
-- Delay
2
n Gate
-- Delay
3
L
L
n Gate
-- Delay
3
BW = ~(1/n)
n Gate
-- Delay
2
L
n Gate
-- Delay
3
BW = ~(2/n)
BW = ~(3/n)
Bandwidth increases linearly with pipeline depth
Latency increases by latch delays
Ideal Pipelining
Cycle:
Instr:
1
2
3
4
5
i
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
i+1
i+2
i+3
i+4
6
7
8
9
W
10 11 12 13
Pipelining Idealisms

Uniform subcomputations


Identical computations


Can fill pipeline with identical work
Independent computations


Can pipeline into stages with equal delay
No relationships between work units
Are these practical?

No, but can get close enough to get significant speedup
Complications

Datapath


Control


Five (or more) instructions in flight
Must correspond to multiple instructions
Instructions may have


data and control flow dependences
I.e. units of work are not independent

One may have to stall and wait for another
Datapath (Fig. 6.10)
Datapath (Fig. 6.11)
Control

Control



Set by 5 different instructions
Divide and conquer: carry IR down the pipe
MIPS ISA requires sequential execution

True for most general purpose ISAs
Program Dependences
i1: xxxx
i1
i2: xxxx
i2
i3: xxxx
i3
i1:
A true dependence between
two instructions may only
involve one subcomputation
of each instruction.
i2:
i3:
The implied sequential precedences are
an overspecification. It is sufficient but not
necessary to ensure program correctness.
Program Data Dependences

True dependence (RAW)


Anti-dependence (WAR)


j cannot execute until i produces
its result
j cannot write its result until i has
read its sources
Output dependence (WAW)

j cannot write its result until i has
written its result
D(i)  R( j )  
R(i)  D( j )  
D(i)  D( j )  
Control Dependences

Conditional branches


Branch must execute to determine which
instruction to fetch next
Instructions following a conditional branch are
control dependent on the branch instruction
Example (quicksort/MIPS)
#
#
#
#
#
for (; (j < high) && (array[j] < array[low]) ; ++j );
$10 = j
$9 = high
$6 = array
$8 = low
bge
done, $10, $9
mul
$15, $10, 4
addu
$24, $6, $15
lw
$25, 0($24)
mul
$13, $8, 4
addu
$14, $6, $13
lw
$15, 0($14)
bge
done, $25, $15
cont:
addu
...
$10, $10, 1
addu
$11, $11, -1
done:
Resolution of Pipeline Hazards

Pipeline hazards



Hazard resolution



Potential violations of program dependences
Must ensure program dependences are not violated
Static: compiler/programmer guarantees correctness
Dynamic: hardware performs checks at runtime
Pipeline interlock


Hardware mechanism for dynamic hazard resolution
Must detect and enforce dependences at runtime
Pipeline Hazards

Necessary conditions:

WAR: write stage earlier than read stage


WAW: write stage earlier than write stage



Is this possible in IF-RD-EX-MEM-WB ?
RAW: read stage earlier than write stage


Is this possible in IF-RD-EX-MEM-WB ?
Is this possible in IF-RD-EX-MEM-WB?
If conditions not met, no need to resolve
Check for both register and memory
RAW Hazard

Earlier instruction produces a value used by a later
instruction:


add $1, $2, $3
sub $4, $5, $1
Cycle:
Instr:
1
2
3
4
5
add
F
D
X
M
W
F
D
X
M
sub
6
W
7
8
9
10
11
12
13
RAW Hazard - Stall

Detect dependence and stall:


add $1, $2, $3
sub $4, $5, $1
Cycle:
Instr:
1
2
3
4
5
add
F
D
X
M
W
sub
6
7
8
9
10 11
F
D
X
M
W
12 13
Control Dependence

One instruction affects which executes next



sw $4, 0($5)
bne $2, $3, loop
sub $6, $7, $8
Cycle:
Instr:
1
2
3
4
5
sw
F
D
X
M
W
F
D
X
M
W
F
D
X
M
bne
sub
6
7
W
8
9
10 11 12 13
Control Dependence - Stall

Detect dependence and stall



sw $4, 0($5)
bne $2, $3, loop
sub $6, $7, $8
Cycle:
Instr:
1
2
3
4
5
sw
F
D
X
M
W
F
D
X
M
W
F
D
bne
sub
6
7
8
9
X
M
W
10 11
12 13
Pipelined Datapath


Start with single-cycle datapath (F6.10)
Pipelined execution





Assume each instruction has its own datapath
But each instruction uses a different part in every cycle
Multiplex all on to one datapath
Latches separate cycles (like multicycle)
Ignore hazards for now


Data
control
Pipelined Datapath (Fig. 6.12)
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
32
Read
data
1
M
u
x
0
Pipelined Datapath

Instruction flow




add and load
Write of registers
Pass register specifiers
Any info needed by a later stage gets passed
down the pipeline

E.g. store value through EX
Pipelined Control

IF and ID


EX


ALUop, ALUsrc, RegDst
MEM


None
Branch, MemRead, MemWrite
WB

MemtoReg, RegWrite
Figure 6.25
PCSrc
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
result
Add
4
Branch
Shift
left 2
PC
Address
Instruction
memory
Instruction
RegWrite
Read
register 1
MemWrite
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
ALUSrc
Zero
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Data
memory
Write
data
Instruction
16
[15– 0]
Sign
extend
32
6
ALU
control
Instruction
[20– 16]
Instruction
[15– 11]
0
M
u
x
1
RegDst
ALUOp
MemRead
Read
data
1
M
u
x
0
Figure 6.29
WB
Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
Figure 6.30
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
ALUOp
RegDst
MemRead
1
M
u
x
0
Pipelined Control



Controlled by different instructions
Decode instructions and pass the signals
down the pipe
Control sequencing is embedded in the
pipeline


No explicit FSM
Instead, distributed FSM
Pipelining

Not too complex yet



Data hazards
Control hazards
Exceptions
RAW Hazards

Must first detect RAW hazards

Pipeline analysis proves that WAR/WAW don’t occur
ID/EX.WriteRegister = IF/ID.ReadRegister1
ID/EX.WriteRegister = IF/ID.ReadRegister2
EX/MEM.WriteRegister = IF/ID.ReadRegister1
EX/MEM.WriteRegister = IF/ID.ReadRegister2
MEM/WB.WriteRegister = IF/ID.ReadRegister1
MEM/WB.WriteRegister = IF/ID.ReadRegister2
RAW Hazards

Not all hazards because



WriteRegister not used (e.g. sw)
ReadRegister not used (e.g. addi, jump)
Do something only if necessary
RAW Hazards

Hazard Detection Unit


Several 5-bit (or 6-bit) comparators
Response? Stall pipeline




Instructions in IF and ID stay
IF/ID pipeline latch not updated
Send ‘nop’ down pipeline (called a bubble)
PCWrite, IF/IDWrite, and nop mux
RAW Hazard Forwarding

A better response – forwarding



Also called bypassing
Comparators ensure register is read after it is
written
Instead of stalling until write occurs


Use mux to select forwarded value rather than
register value
Control mux with hazard detection logic
Forwarding

Use temporary results, don’t wait for them to be written


register file forwarding to handle read/write to same register
ALU forwarding
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
IM
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
what if this $2 was $13?
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
Forwarding
ID/EX
WB
Control
PC
Instruction
memory
Instruction
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
M
u
x
Forwarding Paths
(ALU instructions)
IF
ID
RD
c b
a
ALU
FORWARDING
PATHS
i+1:
ALU
i: R1
i+2:
R1
i+3:
R1
i+1:
R1 i+2:
R1
i: R1
MEM
WB
R1
i+1:
i: R1
(i
i+1)
Forwarding
via Path a
(i
i+2)
Forwarding
via Path b
(i
i+3)
i writes R1
before i+3
reads R1
Write before Read RF

Register file design




Hence, same cycle:



2-phase clocks common
Write RF on first phase
Read RF on second phase
Write $1
Read $1
No bypass needed

If read before write or DFF-based, need bypass
Can't always forward

Load word can still cause a hazard:


an instruction tries to read a register following a load instruction that
writes to the same register.
Thus, we need a hazard detection unit to “stall” the load instruction
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
Stalling

We can stall the pipeline by keeping an instruction in the
same stage
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
Reg
IM
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
slt $1, $6, $7
IM
DM
Reg
IM
Reg
Reg
DM
Reg
Forwarding Paths
(Load instructions)
IF
ID
RD
e d
ALU
LOAD
FORWARDING
PATH(s)
MEM
i+1:
R1
i:R1
MEM[]
i+1:
i:R1
R1
i+2:
R1
i+1:
R1
MEM[]
WB
i:R1
(i
i+1)
Stall i+1
(i
i+1)
Forwarding
via Path d
MEM[]
(i
i+2)
i writes R1
before i+2
reads R1
Control Flow Hazards

Control flow instructions



branches, jumps, jals, returns
Can’t fetch until branch outcome known
Too late for next IF
Control Flow Hazards

What to do?





Always stall
Easy to implement
Performs poorly
1/6th instructions are branches, each branch takes
3 cycles
CPI = 1 + 3 x 1/6 = 1.5 (lower bound)
Control Flow Hazards





Predict branch not taken
Send sequential instructions down pipeline
Kill instructions later if incorrect
Must stop memory accesses and RF writes
Late flush of instructions on misprediction


Complex
Global signal (wire delay)
Branch Hazards

When we decide to branch, other instructions are in
the pipeline!
Time (in clock cycles)
Program
execution
CC 1
CC 2
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
Control Flow Hazards

Even better but more complex



Predict taken
Predict both (eager execution)
Predict one or the other dynamically


Adapt to program branch patterns
Lots of chip real estate these days


Pentium III, 4, Alpha 21264
Current research topic
Dynamic branch prediction
Improving Performance

Try and avoid stalls! E.g., reorder these
instructions:
lw
lw
sw
sw

0($t1)
4($t1)
0($t1)
4($t1)
Add a “branch delay slot”





$t0,
$t2,
$t2,
$t0,
Always execute following instruction
“delay slot” (later example on MIPS pipeline)
Put useful instruction there, otherwise ‘nop’
rely on compiler to “fill” the slot with something useful
Superscalar: start more than one instruction in the
same cycle
Dynamic Scheduling


The hardware performs the “scheduling”

hardware tries to find instructions to execute

out of order execution is possible

speculative execution and dynamic branch prediction
All modern processors are very complicated

DEC Alpha 21264: 9 stage pipeline, 6 instruction issue

PowerPC and Pentium: branch history table

Compiler technology important
Exceptions

Even worse: in one cycle







I/O interrupt
User trap to OS (EX)
Illegal instruction (ID)
Arithmetic overflow
Hardware error
Etc.
Interrupt priorities must be supported