Appendix A Pipelining: Basic and Intermediate Concepts EEF011 Computer Architecture
Download
Report
Transcript Appendix A Pipelining: Basic and Intermediate Concepts EEF011 Computer Architecture
EEF011 Computer Architecture
計算機結構
Appendix A
Pipelining: Basic and
Intermediate Concepts
吳俊興
高雄大學資訊工程學系
October 2004
Outline
Basic concept of Pipelining
The Basic Pipeline for MIPS
The Major Hurdles of Pipelining – Pipeline Hazards
2
What Is Pipelining?
Laundry Example
• Ann, Betty, Cathy, Dave
each has one load of clothes
to wash, dry, and fold
A
B
C
D
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
3
What Is Pipelining?
6 PM
7
8
9
10
11
Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
k
A
B
O
r
d
e
r
C
D
Sequential laundry takes 6 hours for 4 loads
Want to reduce the time? - Pipelining!!!
4
What Is Pipelining?
6 PM
7
8
9
Time
30 40
T
a
s
k
O
r
d
e
r
A
B
40
40
40 20
• Start work ASAP
• Pipelined laundry takes
3.5 hours for 4 loads
C
D
5
What Is Pipelining?
Pipelining is an implementation technique whereby
multiple instructions are overlapped in execution
It takes advantage of parallelism that exists among
instructions => instruction-level parallelism
It is the key implementation technique used to make
fast CPUs
• Pipelining doesn’t help latency of single task; it helps
throughput of entire workload
• Pipeline rate is limited by the slowest pipeline stage
• Multiple tasks operating simultaneously
• Potential speedup = Number of pipe stages
– Unbalanced lengths of pipe stages reduces speedup
6
MIPS Without Pipelining
The execution of instructions is controlled by CPU clock. One
specific function in one clock cycle.
Every MIPS instruction takes 5 clock cycles in terms of five different
stages.
Several temporary registers are introduced to implement the 5-stage
structure.
7
MIPS Functions
Only consider loadstore, BEQZ, and
integer ALU
Passed To Next Stage
IR <- Mem[PC]
NPC <- PC + 4
Instruction Fetch (IF):
• Send out the PC and fetch the instruction from memory into the
instruction register (IR); increment the PC by 4 to address the next
sequential instruction and store it in next program count register (NPC).
• IR holds the instruction that will be used in the next stage.
• NPC holds the value of the next PC.
8
MIPS Functions
Passed To Next Stage
A <- Regs[rs];
B <- Regs[rt];
Imm <- ((IR16)48 ##IR16..31
Instruction Decode/Register Fetch (ID):
• Decode the instruction and access the register file to read the registers.
• The outputs of the general purpose registers are read into two
temporary registers (A & B) for use in later clock cycles.
• We sign extend the lower 16 bits of the Instruction Register into another
temporal register Imm.
9
MIPS Functions
Passed To Next Stage
ALUOutput <- A + Imm;
ALUOutput <- A func. B;
ALUOutput <- A op Imm;
ALUOutput <- NPC+ Imm<<2,
Cond = (A==0);
Execution/Effective Address Calculation (EX):
• We perform an operation (for an ALU) or an address calculation (if the
instruction is about load/store or Branch).
• If an ALU, actually do the operation. If an address calculation, figure out
the address and store it for the next cycle.
10
MIPS Functions
Passed To Next Stage
LMD = Mem[ALUOutput]
or
Mem[ALUOutput] = B;
If (cond) PC <- ALUOutput
Memory Access/Branch Completion (MEM):
• If it is an ALU instruction, do nothing.
• If it is a load/store instruction, then access memory.
• If it is a branch instruction, update PC if necessary in terms of condition.
11
MIPS Functions
Passed To Next Stage
Regs[rd] <- ALUOutput;
Regs[rs] <- ALUOutput;
Regs[rt] <- LMD;
Write-back (WB):
• Update the registers from either the ALU or from the data loaded.
12
The classic five-stages pipeline for MIPS
We can pipeline the execution with almost no changes by simply starting a
new instruction on each clock cycle.
Each clock cycle becomes a pipe stage – a cycle in the pipe line which
results in the execution pattern as a typical way of pipeline structure.
Although each instruction takes 5 clock cycles to complete, the hardware
will initiate a new instruction during each clock cycle and will be executing
some parts of the five different instruction already existing in the pipeline.
It may be hard to believe that pipelining is as simple as this.
Clock number
Instruction number
1
2
3
4
5
Instruction i
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
Instruction i+1
Instruction i+2
Instruction i+3
Instruction i+4
6
7
8
9
MEM WB
13
Figure A.2 The pipeline can be thought of as a series of data
paths shifted in time
14
Simple MIPS Pipeline
MIPS pipeline data path to deal with problems that pipelining introduces in
real implementation.
It is critical to ensure that instructions at different stage in the pipeline do
not attempt to use the hardware resources at the same time (in the same
clock cycle) – perform different operations with the same functional unit
such as ALU on the same clock cycle.
Instructions and data memories are separated in different caches (IM/DM).
Register file is used in two stages: one for reading in ID and one for writing
in WB. To handle a read and a write to the same register, we perform the
register write in the first half of the clock and the read in the second.
15
Pipeline implementation for MIPS
In order to ensure that instructions in different stages of the pipeline do not
interfere with each other, the data path is pipelined by adding a set of registers,
one between each pair of pipe stages.
The registers serve to convey values and control information from one stage to the
next.
Most of the data paths flow from left to right, which is from earlier in time to later.
The paths flowing from right to left (which carry the register write-back information
and PC information on a branch) introduce complications into the pipeline.
16
Events on Pipe Stages of the MIPS Pipeline
Stage
Any instruction
IF
IF/ID.IR <- Mem[PC];
IF/ID.NPC, PC <- (If ((EX/MEM.opcode==branch) & EX/MEM.cond){ EX/MEM.ALUOutput} else {PC+4});
ID
ID/EX.A <- Regs[IF/ID.IR[rs]]; ID/EX.B <- Regs[IF/ID.IR[rt]];
ID/EX.NPC <- IF/ID.NPC; ID/EX.IR <- IF/ID.IR;
ID/EX.Imm <- sign-extend(IF/ID.IR[immediate field]);
Figure A.19
ALU Instruction
Load or store
Branch
EX/MEM.IR <- ID/EX.IR;
EX/MEM.ALUOutput <- ID/EX.A
func ID/EX.B; or
EX/MEM.ALUOutput <- ID/EX.A
op ID/EX.Imm;
EX/MEM.IR <- ID/EX.IR
EX/MEM.ALUOutput <- ID/EX.A
+ ID/EX.Imm;
EX/MEM.ALUOutput <ID/EX.NPC + (ID/EX.Imm << 2);
MEM
MEM/WB.IR <- EX/MEM.IR;
MEM/WB.ALUOutput <EX/MEM.ALUOutput;
MEM/WB.IR <- EX/MEM.IR;
MEM/WB.LMD <Mem[EX/MEM.ALUOutput]; or
Mem[EX/MEM.ALUOutput] <EX/MEM.B;
WB
Regs[MEM/WB.IR[rd]] <MEM/WB.ALUOutput; or
Regs[MEM/WB.IR[rt]] <MEM/WB.ALUOutput
For load only:
Regs[MEM/WB.IR[rt]] <MEM/WB.LMD
EX
EX/MEM.cond <- (ID/EX.A ==0);
EX/MEM.B <- ID/EX.B
17
Basic Performance Issues for Pipelining
Example: Assume that an unpipelined processor has a 1ns clock cycle
and that it uses 4 cycles for ALU operations and branches and 5 cycles
for memory operations. Assume that the relative frequencies of these
operations are 40%, 20%, and 40%, respectively. Suppose that due to
clock skew and setup, pipelining the processor adds 0.2 ns overhead to
the clock. Ignoring any latency impact, how much speedup in the
instruction execution time will we gain from the pipeline implementation?
Solution:
Avg. instr. exec timeunpipelined = Clock cycle time x Avg. CPI
= 1ns x (40%x4+20%x4+40%x5) = 4.4ns
Ideal situation without any latency, avg. CPI is just only 1 cycle for all
kind of instructions and the clock cycle time is equal to 1.0ns + 0.2ns
(1.2ns), then Avg. instr. exec timepipelined = 1.2ns x1 = 1.2ns
Then, speed up from pipelining is 4.4ns/1.2ns or 3.7 times.
What is the result if there is no overhead when implement pipelining?
18
A.2 The Major Hurdle of Pipelining –
Pipeline Hazard
Limits to pipelining: there are situations, called Hazards, prevent next
instruction from executing during its designated clock cycle, thus
reduce the performance from the ideal speedup. Three classes of
hazards are:
– Structural hazards: arise from resource conflicts when the hardware
cannot support all possible combinations of instructions simultaneously
in overlapped execution- two different instructions use same h/w in the
same cycle .
– Data hazards: arise when an instruction depends on result of prior
instruction still in the pipeline, RAW, WAR and WAW.
– Control hazards: Pipelining of branches & other instructions
change the PC.
that
Common solution is to stall the pipeline until the hazard is cleared, i.e.,
inserting one or more “bubbles” in the pipeline.
19
Performance of Pipelining with Stalls
• The Pipelined CPI:
CPIpipelined Ideal CPI Pipeline stall cycles per instr.
1 Pipeline stall cycles per instr.
• Ignoring cycle time overhead of pipelining, and assuming the stages
are perfectly balanced (all occupy one clock cycle) and all instructions
take the same num of cycles, we have speedup from pipelining:
Speedup
CPI unpipelined
CPI pipelined
CPI unpipelined
1 Pipeline stall cycles per instr.
Pipeline depth
1 Pipeline stall cycles per instr.
20
Structural Hazards
Time (clock cycles)
Reg
Ifetch
Reg
DMem
Reg
Reg
DMem
Reg
e.g., MEM uses
the same memory
port as IF as
shown in this
slide.
Instr 3
Instr 4
Ifetch
Reg
Ifetch
DMem
Reg
ALU
Solution: stall
ALU
O
r
d
e
r
Ifetch
DMem
ALU
Instr 2
Reg
ALU
I Load Ifetch
n
s Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
When
two
or
more
different
instructions want
to use same h/w
resource in same
cycle
Reg
DMem
Reg
21
Structural Hazards
Time (clock cycles)
Stall
Instr 3
DMem
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Bubble
Reg
DMem
Bubble Bubble
Ifetch
This is another
way of looking
at the effect of
a stall.
Reg
Reg
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
DMem
Reg
22
Structural Hazards
This is another way to represent the stall.
23
Dealing With Structural Hazards
• Stall
– low cost, simple
– Increases CPI
– use for rare case since stalling has performance
effect
• Replicate resource
– good performance
– increases cost (+ maybe interconnect delay)
– useful for cheap or divisible resources
E.g., we use separate
instruction and data
memories in MIPS
pipeline
24
Data Hazards
• Data hazards occur when the pipeline changes the order of
read/write accesses to operands (registers) so that the order
differs from the order seen by sequentially executing
instructions on an unpipelined processor.
• Where there’s real trouble is when we have:
instruction A
instruction B,
and B manipulates (reads or writes) data before A does. This
violates the order of the instructions, since the architecture
implies that A completes entirely before B is executed.
25
Data Hazards
Execution Order is:
InstrI
InstrJ
Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: dadd r1,r2,r3
J: dsub r4,r1,r3
• Caused by a “dependence” (in compiler nomenclature).
This hazard results from an actual need for
communication.
26
Data Hazards
Execution Order is:
InstrI
InstrJ
Write After Read (WAR)
InstrJ tries to write operand before InstrI reads it
– Gets wrong operand
I: dsub r4,r1,r3
J: dadd r1,r2,r3
K: mul r6,r1,r7
– Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
27
Data Hazards
Execution Order is:
InstrI
InstrJ
Write After Write (WAW)
InstrJ tries to write operand before InstrI writes it
– Leaves wrong result ( InstrI not InstrJ )
I: dsub r1,r4,r3
J: dadd r1,r2,r3
K: mul r6,r1,r7
– Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more complicated
pipeline implementations
28
Solutions to Data Hazards
• Simple Solution to RAW
• Hardware detects RAW and stalls until the result is written into
the register
+ low cost to implement, simple
-- reduces # instruction executed per cycle
• Minimizing RAW stalls: Forwarding (also called bypassing)
• Key insight: the result is not really needed by the current
instruction until after the previous instruction actually produces it.
• The ALU result from both the EX/MEM and MEM/WB pipeline
registers is always fed back to the ALU inputs.
• If the forwarding hardware detects that the previous ALU
operation has written the register corresponding to a source for
the current ALU operation, control logic selects the forwarded
result as the ALU input rather than the value read from the
register file.
29
Data Hazards
dadd r1,r2,r3 Ifetch
dsub r4,r1,r3
and r6,r1,r7
or
r8,r1,r9
xor r10,r1,r11
Reg
Ifetch
EX
CC5
CC6
CC7
CC8
CC9
MEM WB
DMem
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
ID
CC4
ALU
IF
CC3
ALU
O
r
d
e
r
CC2
ALU
I
n
s
t
r.
CC1
ALU
Time (clock cycles)
Reg
Reg
Reg
Reg
DMem
Reg
The use of the result of the ADD instruction in the next two instructions causes a
hazard, since the register is not written until after those instructions read it.
30
Forwarding to Avoid Data Hazards
and r6,r1,r7
or
r8,r1,r9
xor r10,r1,r11
CC5
Reg
DMem
Reg
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
dsub r4,r1,r3
CC4
ALU
O
r
d
e
r
dadd r1,r2,r3 Ifetch
CC3
ALU
I
n
s
t
r.
CC1
ALU
Time (clock cycles)
ALU
Forwarding is the concept of making data available to the input of the ALU
for subsequent instructions, even though the generating instruction hasn’t
gotten to WB in order to write the memory or registers.
CC2
CC6
CC7
CC8
CC9
Reg
Reg
Reg
DMem
Reg
31
Data Hazards Requiring Stalls
AND R6,R1,R7
DSUB R4,R1,R6
OR
R8,R1,R9
CC4
CC5
Reg
Reg
DMem
Ifetch
Reg
DMem
Reg
Ifetch
Ifetch
Reg
CC6
CC7
CC8
Reg
DMem
ALU
O
r
d
e
r
Ifetch
CC3
ALU
LD R1,0(R2)
CC2
ALU
I
n
s
t
r.
CC1
ALU
Time (clock cycles)
Reg
DMem
Reg
There are some instances where hazards occur, even with forwarding,
e.g., the data isn’t loaded until after the MEM stage.
32
Data Hazards Requiring Stalls
CC4
CC5
Reg
LD R1,0(R2)Ifetch
Reg
DMem
DSUB R4,R1,R6
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
AND R6,R1,R7
OR
R8,R1,R9
CC6
CC7
DMem
Reg
Reg
DMem
ALU
CC3
ALU
O
r
d
e
r
CC2
ALU
I
n
s
t
r.
CC1
ALU
Time (clock cycles)
CC8
Reg
DMem
The stall is necessary for the case.
33
Another Representation of the Stall
LD
R1, 0(R2)
IF
DSUB R4, R1, R5
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
AND R6, R1, R7
OR
LD
R8, R1, R9
R1, 0(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR
R8, R1, R9
IF
ID
EX
MEM
WB
IF
ID
stall
EX
MEM
WB
IF
stall
ID
EX
MEM
WB
stall
IF
ID
EX
MEM
WB
WB
In the top table, we can see why a stall is needed: The MEM cycle
of the load produces a value that is needed in the EX cycle of the
DSUB, which occurs at the same time. This problem is solved by
inserting a stall, as shown in the bottom table.
34
Control Hazards
• A control hazard happens when we need to find the
destination of a branch, and can’t fetch any new
instructions until we know that destination.
– If instruction i is a taken branch, then the PC is normally not
changed until the end of ID
• Control hazards can cause a greater performance
loss than do data hazards.
35
24: add r8,r1,r9
36: xor r10,r1,r11
CC4
CC5
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Ifetch
Reg
Ifetch
Reg
DMem
Ifetch
Reg
ALU
20: or r6,r1,r7
CC3
ALU
16: and r2,r3,r5
CC2
ALU
12: beq r1,r3,36
CC1
ALU
Time (clock cycles)
ALU
Control Hazard on Branches
Three-Cycle Stall
CC6
CC7
CC8
CC9
Reg
DMem
Reg
Reg
DMem
Reg
36
Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
• Two solutions to this dramatic increase:
– Determine branch taken or not sooner, AND
– Compute target address earlier
• MIPS branch tests if register = 0 or ^ 0
• MIPS Solution:
– Move Zero test to ID stage
– Adder to calculate target address in ID stage
– 1 clock cycle penalty for branch versus 3
37
The Pipeline of 1-Cycle Stall for Branch
38
Four Solutions to Branch Hazards
#1: Stall until branch direction is clear
– Simple both for software and hardware
– Branch penalty is fixed (1-cycle penalty for revised MIPS)
Branch instr.
Branch successor
Branch successor+1
Branch successor+2
IF
ID
EX
MEM
WB
IF
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
39
Four Solutions to Branch Hazards
#2: Predict Branch Not Taken
– Continue to fetch instructions as if the branch were a normal
instruction.
– If the branch is taken, turn the fetched instruction into a no-op
and restart the fetch at the target address.
Untaken branch instr.
IF
Branch successor
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Branch successor+1
Branch successor+2
Branch successor+3
Taken branch instr.
Branch successor
Branch target
Branch successor+1
Branch successor+2
IF
ID
EX
MEM
WB
IF
idle
idle
idle
idle
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
WB
40
Four Solutions to Branch Hazards
#3: Predict Branch Taken
– As soon as the branch is decoded and the target address is
computed, we assume the branch to be taken and begin
fetching and executing at the target.
– But haven’t calculated the target address before we know
the branch outcome in MIPS
• MIPS still incurs 1-cycle branch penalty
• Useful for other machines on which the target address is
known before the branch outcome
41
Four Solutions to Branch Hazards
#4: Delayed Branch
– The execution cycle with a branch delay of one is
branch instruction
sequential successor1
branch target if taken
– The sequential successor is in the branch delay slot.
– The instruction in the branch delay slot is executed whether
or not the branch is taken (for zero cycle penalty)
•Where to get instructions to fill branch delay slot?
–
–
–
–
From before branch instruction
From target address: only valuable when branch taken
From fall through: only valuable when branch not taken
Canceling or nullifying branches allow more slots to be filled (nonzero cycle penalty, its value depends on the rate of correct
predication)
– the delay-slot instruction is turned into a no-op if incorrectly
predicted
42
Four Solutions to Branch Hazards
43
Pipelining Introduction Summary
• Just overlap tasks, and easy if tasks are independent
• Speed Up vs. Pipeline Depth; if ideal CPI is 1, then:
Pipeline Depth
Speedup =
Clock Cycle Unpipelined
X
1 + Pipeline stall CPI
Clock Cycle Pipelined
• Hazards limit performance on computers:
– Structural: need more hardware resources
– Data (RAW,WAR,WAW): need forwarding, compiler scheduling
– Control: delayed branch, prediction
44