Lectures for 2nd Edition

Download Report

Transcript Lectures for 2nd Edition

Chapter Six
Enhancing Performance with Pipelining
2004 Morgan Kaufmann Publishers
1
2004 Morgan Kaufmann Publishers
2
Outline
•
•
•
•
•
•
•
6.1
6.2
6.3
6.4
6.5
6.6
6.7
•
•
•
•
•
•
6.8
6.9
6.10
6.11
6.12
6.13
An overview of pipelining
A pipelined Datapath
Pipelined Control
Data Hazards and Forwarding
Data Hazards and Stalls
Branch Hazards
Using a Hardware Description Language to describe and Model
a pipeline
Exceptions
Advanced Pipelining: Extracting More Performance
Real Stuff: The Pentium 4 Pipeline
Fallacies and Pitfalls
Concluding Remarks
Historical Perspective and Further Reading
2004 Morgan Kaufmann Publishers
3
6.1
An overview of Pipelining
2004 Morgan Kaufmann Publishers
4
Keywords
•
Pipelining
An implementation technique in which multiple instructions
are overlapped in execution, much like to an assembly line.
•
Structural hazard
An occurrence in which a planned instruction
cannot execute in the proper clock cycle because the hardware cannot
support the combination of instructions that are set to execute in the
given clock cycle.
•
Data hazard Also called pipeline data hazard. An occurrence in which a
planned instruction cannot execute in the proper clock cycle because
data that is needed to execute the instruction is not yet available.
•
Forwarding Also called bypassing. A method of resolving a data hazard
by retrieving the missing data element from internal buffers rather than
waiting for it to arrive from programmer-visible registers or memory.
•
Load-use data hazard A specific form of data hazard in which the data
requested by a load instruction has not yet become available when it is
requested.
2004 Morgan Kaufmann Publishers
5
Keywords
•
Pipeline stall
hazard.
•
Control hazard Also called branch hazard. An occurrence in which the
proper instruction cannot execute in the proper clock cycle because the
instruction that was fetched is not the one that is needed; that is, the flow
of instruction addresses is not what the pipeline expected.
•
Untaken branch One that falls through to the successive instruction. A
taken branch is one that causes transfer to the branch target.
•
Branch prediction A method of resolving a branch hazard that
assumes a given outcome for the branch and proceeds from that
assumption rather than waiting to ascertain the actual outcome.
•
Latency (pipeline) The number of stages in a pipeline or the number of
stage between two instructions during execution.
Also called bubble. A stall initiated in order to resolve a
2004 Morgan Kaufmann Publishers
6
Figure 6.1 The laundry analogy for pipelining.
2004 Morgan Kaufmann Publishers
7
Figure 6.2 Total time for each instruction calculated from the
time for each component.
Instruction
Fetch
Register
Read
ALU
operation
Data
access
Register
write
Total
time
Load word (lw)
200 ps
100 ps
200 ps
200 ps
100 ps
800 ps
Store word (sw)
200 ps
100 ps
200 ps
200 ps
R-format (add,
sub, and, or, slt)
200 ps
100 ps
200 ps
Branch (beq)
200 ps
100 ps
200 ps
Instruction class
700 ps
100 ps
600 ps
500 ps
2004 Morgan Kaufmann Publishers
8
Pipelining
•
Improve performance by increasing instruction throughput
Program
execution
Time
order
(in instructions)
200
lw $1, 100($0) Instruction
fetch Reg
lw $2, 200($0)
400
600
Data
access
ALU
800
1000
1200
1400
1600
1800
Reg
Instruction Reg
fetch
800 ps
lw $3, 300($0)
ALU
Data
access
Reg
Instruction
fetch
800 ps
Note:
timing assumptions changed
for this example
800 ps
Program
execution
Time
order
(in instructions)
200
400
600
Instruction
fetch
Reg
lw $2, 200($0) 200 ps
Instruction
fetch
Reg
200 ps
Instruction
fetch
lw $1, 100($0)
lw $3, 300($0)
ALU
800
Data
access
ALU
Reg
1000
1200
1400
Reg
Data
access
ALU
Reg
Data
access
Reg
200 ps 200 ps 200 ps 200 ps 200 ps
Ideal speedup is number of stages in the pipeline. Do we achieve this?
2004 Morgan Kaufmann Publishers
9
Figure 6.4 Graphical representation of the instruction pipeline,
similar in spirit to the laundry pipeline in figure 6.4 on page 371.
2004 Morgan Kaufmann Publishers
10
Figure 6.5 Graphical representation of forwarding
2004 Morgan Kaufmann Publishers
11
Figure 6.6 We need a stall even with forwarding when an
R-format instruction following a load tries to use the data.
2004 Morgan Kaufmann Publishers
12
Figure 6.7 Pipeline showing stalling on every conditional
branch as solution to control hazards.
2004 Morgan Kaufmann Publishers
13
Figure 6.8 Predicting that branches are not taken as a
solution to control hazard.
2004 Morgan Kaufmann Publishers
14
Pipelining
•
What makes it easy
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
•
What makes it hard?
– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous instruction
•
We’ll build a simple pipeline and look at these issues
•
We’ll talk about modern processors and what really makes it hard:
– exception handling
– trying to improve performance with out-of-order execution, etc.
2004 Morgan Kaufmann Publishers
15
6.2
A pipelined Datapath
2004 Morgan Kaufmann Publishers
16
Basic Idea
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
Add
4
Shift
left 2
P
C
Address
Instruction
Instruction
memory
Read
Read
register 1 data1
Read
register 2
Registers
Write
Read
register
data2
Write
data
16
•
ADD Add
result
Zero
ALU ALU
result
Address
Read
data
Data
Memory
Write
data
Sign 32
extend
2004 Morgan Kaufmann Publishers 17
What do we need to add to actually split the datapath into stages?
6.10 Instructions being executed using the single-cycle
datapath in figure 6.9, assuming pipelined execution.
2004 Morgan Kaufmann Publishers
18
Pipelined Datapath
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Shift
left 2
PC
Address
Instruction
memory
Add Add
result
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
Zero
ALU ALU
result
Read
data
Address
Data
memory
Write
data
Write
data
16
Sign
extend
32
Can you find a problem even if there are no dependencies?
What instructions can we execute to manifest the problem?
2004 Morgan Kaufmann Publishers
19
Figure 6.12 IF and ID: first and second pipe stages of an instruction,
with the active portions of the datapath in figure 6.11 highlighted.
2004 Morgan Kaufmann Publishers
20
2004 Morgan Kaufmann Publishers
21
Figure 6.13 EX: the third pipe stage of a load instruction, highlighting the
portions of the datapath in figure 6.11 used in this pipe stage.
2004 Morgan Kaufmann Publishers
22
Figure 6.14 MEM and WB: the fourth and fifth pipe stages of a load instruction,
highlighting the portions of the datapath in figure 6.11 used in this pipe stage.
2004 Morgan Kaufmann Publishers
23
2004 Morgan Kaufmann Publishers
24
Figure 6.15 EX: the third pipe stage of a store instruction.
2004 Morgan Kaufmann Publishers
25
Figure 6.16 MEM and WB: the fourth and fifth pipe stage of a
store instruction.
2004 Morgan Kaufmann Publishers
26
2004 Morgan Kaufmann Publishers
27
Corrected Datapath
2004 Morgan Kaufmann Publishers
28
Graphically Representing Pipelines
Time (in clock cycles)
Program
execution
order
(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
•
CC 1
CC 2
IM
Reg
IM
CC 3
ALU
Reg
IM
CC 4
CC 5
DM
Reg
ALU
DM
Reg
ALU
DM
Reg
CC 6
CC7
Reg
Can help with answering questions like:
– how many cycles does it take to execute this code?
– what is the ALU doing during cycle 4?
– use this representation to help understand datapaths
2004 Morgan Kaufmann Publishers
29
Figure 6.18 The portion of the datapath in figure 6.17 that is
used in all five stages of a load instruction.
2004 Morgan Kaufmann Publishers
30
Figure 6.19 Multiple-clock-cycle pipeline diagram of five
instructions.
2004 Morgan Kaufmann Publishers
31
Figure 6.20 Traditional multiple-clock-cycle pipeline diagram
of five instructions in figure 6.19.
2004 Morgan Kaufmann Publishers
32
Figure 6.21 The single-clock-cycle diagram corresponding to
clock cycle 5 of the pipeline in figures 6.19 and 6.20.
2004 Morgan Kaufmann Publishers
33
6.3
Pipelined Control
2004 Morgan Kaufmann Publishers
34
Pipeline Control
2004 Morgan Kaufmann Publishers
35
Pipeline control
•
We have 5 stages. What needs to be controlled in each stage?
– Instruction Fetch and PC Increment
– Instruction Decode / Register Fetch
– Execution
– Memory Stage
– Write Back
•
How would control be handled in an automobile plant?
– a fancy control center telling everyone what to do?
– should we use a finite state machine?
2004 Morgan Kaufmann Publishers
36
Figure 6.23 A copy of figure 5.12 on page 302.
Instruction
opcode
ALUop
Instruction
operation
Function
code
Desired ALU
action
ALU control
input
LW
00
load word
XXXXXX
add
0010
SW
00
store word
XXXXXX
add
0010
Branch equal
01
branch equal
XXXXXX
subtract
0110
R-type
10
add
100000
add
0010
R-type
10
subtract
100010
subtract
0110
R-type
10
AND
100100
and
0000
R-type
10
OR
100101
or
0001
R-type
10
set on less than
101010
set on less than
0111
2004 Morgan Kaufmann Publishers
37
Figure 6.24 A copy of figure 5.16 on page 306.
Signal name
Effect when deasserted (0)
Effect when asserted (1)
RegDst
The register destination number for the
Write register comes from the rt field
(bit 20:16).
The register destination number for the
Write register comes from the rd field
(bits 15:11).
RegWrite
None.
The register on the Write register input
is written with the value on the Write
data input.
ALUSrc
The second ALU operand comes from
the second register file output (Read
data 2).
The second ALU operand is the singleextended, lower 16 bits of the
instruction.
PCSrc
The PC is replaced by the output of the The PC is replaced by the output of the
adder that computes the value of PC+4. adder that computes the branch target.
MemRead
None.
Data memory contents designated by
the address input are put on the Read
data output.
MemWrite
None.
Data memory contents designated by
the address input are replaced by the
value on the Write data input.
MemtoReg
The value fed to the register Write data
input comes from the ALU.
The value fed to the register Write data
input comes from the data memory.
2004 Morgan Kaufmann Publishers
38
Figure 6.25 The values of the control lines are the same as in figure 5.18 on page 308,
but they have been shuffled into three groups corresponding to the last three pipeline stages.
Execution/address calculation Memory access stage Write-back stage
stage control lines
control lines
control lines
Reg
Dst
ALU
Op1
ALU
Op0
ALU
Src
Bran
ch
Mem
Read
Mem
Write
Reg
Write
Mem to
Reg
R-format
1
1
0
0
0
0
0
1
0
lw
0
0
0
1
0
1
0
1
1
sw
X
0
0
1
0
0
1
0
X
beq
X
0
1
0
1
0
0
0
X
Instruction
2004 Morgan Kaufmann Publishers
39
6.4
Data Hazards and Forwarding
2004 Morgan Kaufmann Publishers
40
Pipeline Control
•
Pass control signals along just like the data
Instruction
R-format
lw
sw
beq
Execution/Address Calculation Memory access stage
stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src Branch Read Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0
Write-back
stage control
lines
Reg Mem to
write
Reg
1
0
1
1
0
X
0
X
WB
Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
2004 Morgan Kaufmann Publishers
41
Datapath with Control
2004 Morgan Kaufmann Publishers
42
Dependencies
•
Problem with starting next instruction before first is finished
– dependencies that “go backward in time” are data hazards
Time (in clock cycles)
CC 1
CC 2
Value of
register $2:
10
10
IM
Reg
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10/–20
–20
–20
–20
–20
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
43
Software Solution
•
•
Have compiler guarantee no hazards
Where do we insert the “nops” ?
sub
and
or
add
sw
•
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
Problem: this really slows us down!
2004 Morgan Kaufmann Publishers
44
EX hazard
if (EX / MEM . RegWrite
and (EX / MEM . RegisterRd ≠ 0)
and (EX / MEM . RegisterRd = ID / EX . RegisterRs) ) ForwardA = 10
if (EX / MEM . RegWrite
and (EX / MEM . RegisterRd ≠ 0)
and (EX / MEM . RegisterRd = ID / EX . RegisterRt) ) ForwardB = 10
2004 Morgan Kaufmann Publishers
45
MEM harzard
if (MEM / WB . RegWrite
and (MEM / WB . RegisterRd ≠ 0)
and (MEM / WB . RegisterRd = ID / EX . RegisterRs) ) ForwardA = 01
if (MEM / WB . RegWrite
and (MEM / WB . RegisterRd ≠ 0)
and (MEM / WB . RegisterRd = ID / EX . RegisterRt) ) ForwardB = 01
2004 Morgan Kaufmann Publishers
46
Forwarding
•
Use temporary results, don’t wait for them to be written
– register file forwarding to handle read/write to same register
– ALU forwarding Time (in clock cycles)
Value of register $2:
Value of EX/MEM:
Value of MEM/WB:
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
X
X
10
–20
X
10/–20
X
–20
–20
X
X
–20
X
X
–20
X
X
–20
X
X
IM
Reg
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14,$2 , $2
sw $15, 100($2)
what if this $2 was $13?
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
47
Figure 6.30 On the top are the ALU and pipeline registers
before adding forwarding.
2004 Morgan Kaufmann Publishers
48
Forwarding
•
The main idea (some details not shown)
2004 Morgan Kaufmann Publishers
49
Figure 6.31 The control values for the forwarding
multiplexors in figure 6.30.
Mux control
Source
Explanation
ForwardA = 00
ID / EX
The first ALU operand comes from the register file.
ForwardA = 10
EX / MEM
The first ALU operand is forwarded from the prior ALU result.
ForwardA = 01
MEM / WB
The first ALU operand is forward from data memory or an earlier ALU
result.
ForwardB = 00
ID / EX
The second ALU operand comes from the register file.
ForwardB = 10
EX / MEM
The second ALU operand is forwarded from the prior ALU result.
ForwardB = 01
MEM / WB
The second ALU operand is forward from data memory or an earlier
ALU result.
2004 Morgan Kaufmann Publishers
50
Figure 6.32 The datapath modified to resolve hazards via
forwarding.
2004 Morgan Kaufmann Publishers
51
Figure 6.33 A close-up of the datapath in figure 6.30 on page 409 shows a 2:1
multiplexor, which has been added to select the signed immediate as an ALU input.
2004 Morgan Kaufmann Publishers
52
6.5
Data hazards and Stalls
2004 Morgan Kaufmann Publishers
53
Keywords
•
nop
An instruction that does no operation to change state.
2004 Morgan Kaufmann Publishers
54
Can't always forward
•
Load word can still cause a hazard:
– an instruction tries to read a register following a load instruction
that writes to the same register.
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Program
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
•
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
Thus, we need a hazard detection unit to “stall” the load2004
instruction
Morgan Kaufmann Publishers 55
Stalling
•
We can stall the pipeline by keeping an instruction in the same stage
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
Reg
DM
Reg
CC 6
CC 7
CC 8
CC 9
CC 10
Program
execution
order
(in instructions)
lw $2, 20($1)
IM
bubble
and becomes nop
add $4, $2, $5
or $8, $2, $6
add $9, $4, $2
IM
Reg
IM
DM
Reg
IM
Reg
DM
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
56
Hazard Detection Unit
•
Stall by letting an instruction that won’t write anything go forward
2004 Morgan Kaufmann Publishers
57
6.6
Branch Hazards
2004 Morgan Kaufmann Publishers
58
Keywords
•
Flush (instructions) To discard instructions in a pipeline, usually due
to an unexpected event.
•
Dynamic branch prediction
runtime information.
•
Branch prediction buffer Also called branch history table. A small
memory that is indexed by the lower portion of the address of the branch
instruction and that contains one or more bits indicating whether the
branch was recently taken or not.
•
Branch delay slot The slot directly after a delayed branch instruction,
which in the MIPS architecture is filled by an instruction that does not
affect the branch.
•
Branch target buffer A structure that caches the destination PC or
destination instruction for a branch. It is usually organized as a cache with
tags, making it more costly than a simple prediction buffer.
Prediction of branches at runtime using
2004 Morgan Kaufmann Publishers
59
Keywords
•
Correlating predictor A branch predictor that combines local behavior
of a particular branch and global information about the behavior of some
recent number of executed branches.
•
Tournament branch predictor A branch predictor with multiple
predictions for each branch and a selection mechanism that chooses
which predictor to enable for a given branch.
2004 Morgan Kaufmann Publishers
60
Branch Hazards
•
When we decide to branch, other instructions are in the pipeline!
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Program
execution
order
(in instructions)
40 beq $1, $3, 28
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
•
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
We are predicting “branch not taken”
– need to add hardware for flushing instructions if we are wrong
2004 Morgan Kaufmann Publishers
61
Figure 6.38 The ID stage of clock cycle 3 determines that a branch must be taken, so it
selects 72 as the next PC address and zeros the instruction fetched for the next clock cycle.
2004 Morgan Kaufmann Publishers
62
2004 Morgan Kaufmann Publishers
63
Branches
•
•
•
•
If the branch is taken, we have a penalty of one cycle
For our simple design, this is reasonable
With deeper pipelines, penalty increases and static branch prediction
drastically hurts performance
Solution: dynamic branch prediction
Taken
Not taken
Predict taken
Predict taken
Taken
Not taken
Taken
Not taken
Predict not taken
Predict not taken
Taken
Not taken
A 2-bit prediction scheme
2004 Morgan Kaufmann Publishers
64
Branch Prediction
•
Sophisticated Techniques:
– A “branch target buffer” to help us look up the destination
– Correlating predictors that base prediction on global behavior
and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
– Tournament predictors that use different types of prediction
strategies and keep track of which one is performing best.
– A “branch delay slot” which the compiler tries to fill with a useful
instruction (make the one cycle delay part of the ISA)
•
Branch prediction is especially important because it enables other
more advanced pipelining techniques to be effective!
•
Modern processors predict correctly 95% of the time!
2004 Morgan Kaufmann Publishers
65
Figure 6.40 Scheduling the branch delay slot.
2004 Morgan Kaufmann Publishers
66
6.7 Using a Hardware Description
Language to Describe and Model a Pipeline
2004 Morgan Kaufmann Publishers
67
Figure 6.41 The final datapath and control for this chapter.
2004 Morgan Kaufmann Publishers
68
6.8
Exceptions
2004 Morgan Kaufmann Publishers
69
Keywords
•
Imprecise interrupt Also called imprecise exception. Interrupts or
exceptions in pipelined computers that are not associated with the exact
instruction that was the cause of the interrupt or exception.
•
Precise interrupt Also called precise exception. An interrupt or
exception that is always associated with the correct instruction in
pipelined computers.
2004 Morgan Kaufmann Publishers
70
Figure 6.42 The datapath with controls to handle exceptions.
2004 Morgan Kaufmann Publishers
71
Figure 6.43 The result of an exception due to arithmetic
overflow in the add instruction.
2004 Morgan Kaufmann Publishers
72
2004 Morgan Kaufmann Publishers
73
6.9
Advanced Pipelining: Extracting More
Performance
2004 Morgan Kaufmann Publishers
74
Keywords
•
Instruction-level parallelism
•
Multiple issue
1 clock cycle.
•
Static multiple issue An approach to implementing a multiple-issue
processor where many decisions are made by the compiler before
execution.
•
Dynamic multiple issue An approach to implementing a multiple-issue
processor where many decisions are made during execution by the
processor.
•
Issue slots The positions from which instructions could issue in a given
clock cycle; by analogy these correspond to positions at the starting
blocks for a sprint.
•
Speculation An approach whereby the compiler or processor guesses
the outcome of an instruction to remove it as a dependence in executing
other instructions.
The parallelism among instructions.
A scheme whereby multiple instructions are launched in
2004 Morgan Kaufmann Publishers
75
Keywords
•
Issue packet The set of instructions that issues together in 1 clock
cycle; the packet may be determined statically by the compiler or
dynamically by the processor.
•
Loop unrolling A technique to get more performance from loops that
access arrays, in which multiple copies of the loop body are made and
instructions from different iterations are scheduled together.
•
Register renaming The renaming of registers, by the compiler or
hardware, to remove antidependences.
•
Antidependences Also called name dependence. An ordering forced
by the reuse of a name, typically a register, rather then by a true
dependence that carries a value between two..
•
Instruction group In IA-64, a sequence of consecutive instructions
with no register data dependences among them.
2004 Morgan Kaufmann Publishers
76
Keywords
•
Stop In IA-64, an explicit indicator of a break between independent and
dependent instructions.
•
Prediction A technique to make instructions dependent on predicates
rather than on branches.
•
Poison A result generated when a speculative load yields an exception,
or an instruction uses a poisoned operand.
•
Advanced load In IA-64, a speculative load instruction with support to
check for aliases that could invalidate the load.
•
Superscalar An advanced pipelining technique that enables the
processor to execute more than one instruction per clock cycle.
•
Dynamic pipeline scheduling Hardware support for recording the
order of instruction execution so as to avoid stalls.
•
Commit unit The unit in a dynamic or out-of-order execution pipeline
that decides when it is safe to release the result of an operation to
programmer-visible registers and memory.
2004 Morgan Kaufmann Publishers
77
Keywords
•
Reservation station A buffer within a functional unit that holds the
operands and the operation.
•
Reorder buffer The buffer that holds results in a dynamically scheduled
processor unit it is safe to store the results to memory or a register.
•
In-order commit A commit in which the results of pipelined execution
are written to the programmer-visible state in the same order that
instructions are fetched.
•
Out-of-order execution A situation in pipelined execution when an
instruction blocked from executing does not cause the following
instructions to wait.
2004 Morgan Kaufmann Publishers
78
Figure 6.44 Static two-issue pipeline in operation.
Instruction type
Pipe stages
ALU or branch instruction
IF
ID
EX MEM
WB
Load or store instruction
IF
ID
EX MEM
WB
ALU or branch instruction
IF
ID
EX
MEM
WB
Load or store instruction
IF
ID
EX
MEM
WB
ALU or branch instruction
IF
ID
EX
MEM
WB
Load or store instruction
IF
ID
EX
MEM
WB
ALU or branch instruction
IF
ID
EX
MEM
WB
Load or store instruction
IF
ID
EX
MEM
WB
2004 Morgan Kaufmann Publishers
79
Figure 6.45 A static two-issue datapath.
2004 Morgan Kaufmann Publishers
80
Figure 6.46 The scheduled code as it would look on a twoissue MIPS pipeline.
ALU or branch instruction
Loop:
Data transfer instruction
lw
addi
$t0, 0($s1)
$s1, $s1, -4
$s1, $zero, Loop
1
2
addu $t0, $t0, $s2
bne
Clock cycle
3
sw $t0, 4($s1)
4
2004 Morgan Kaufmann Publishers
81
Figure 6.47 The unrolled and scheduled code of figure 6.46 as
it would look on a static.
ALU or branch instruction
Loop: addi
Clock cycle
lw
$t0, 0($s1)
1
lw
$t1, 12($s1)
2
addu $t0, $t0, $s2
lw
$t2, 8($s1)
3
addu $t1, $t1, $s2
lw
$t3, 4($s1)
4
addu $t2, $t2, $s2
sw $t0, 16($s1)
5
addu $t3, $t3, $s2
sw $t1, 12($s1)
6
sw $t2, 8($s1)
7
sw $t3, 4($s1)
8
bne
$s1, $s1, -16
Data transfer instruction
$s1, $zero, Loop
2004 Morgan Kaufmann Publishers
82
Figure 6.48 A summary of the characteristics of the Itanium and
Itanium2, Intel’s first two implementations of the IA-64 architecture.
Processor
Max.
instr.
Issues
/ clock
Itanium
6
4 integer/media
2 memory
3 branch
2 FP
9
0.8 GHz
25
130
379
701
Itanium 2
6
6 integer/media
4 memory
3 branch
2 FP
11
1.5 GHz
221
130
810
1427
Functional
units
Max.
ops.
Per
clock
Max.
clock
rate
Transistors
(millions)
Power
(watts)
SPEC
int2000
SPEC
fp2000
2004 Morgan Kaufmann Publishers
83
Improving Performance
•
Try and avoid stalls! E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
•
Dynamic Pipeline Scheduling
– Hardware chooses which instructions to execute next
– Will execute instructions out of order (e.g., doesn’t wait for a
dependency to be resolved, but rather keeps going!)
– Speculates on branches and keeps the pipeline full
(may need to rollback if prediction incorrect)
•
Trying to exploit instruction-level parallelism
2004 Morgan Kaufmann Publishers
84
Figure 6.49 The three primary units of a dynamically
scheduled pipeline.
2004 Morgan Kaufmann Publishers
85
Advanced Pipelining
•
•
•
•
Increase the depth of the pipeline
Start more than one instruction each cycle (multiple issue)
Loop unrolling to expose more ILP (better scheduling)
“Superscalar” processors
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
•
All modern processors are superscalar and issue multiple
instructions usually with some limitations (e.g., different “pipes”)
•
VLIW: very long instruction word, static multiple issue
(relies more on compiler technology)
•
This class has given you the background you need to learn more!
2004 Morgan Kaufmann Publishers
86
6.10 Real Stuff: The Pentium 4 Pipeline
2004 Morgan Kaufmann Publishers
87
Keywords
•
Microarchitecture The organization of the processor, including the
major functional units, their interconnection, and control.
•
Architectural registers The instruction set visible registers of a
processor; for example, in MIPS, these are 32 integer and 16 floatingpoint registers.
2004 Morgan Kaufmann Publishers
88
Figure 6.50 The microarchitecture of the Intel Pentium 4.
2004 Morgan Kaufmann Publishers
89
Figure 6.51 The Pentium 4 pipeline showing the pipeline flow for a typical
instruction and the number of clock cycles for the major steps in the pipeline.
2004 Morgan Kaufmann Publishers
90
6.11 Fallacies and Pitfalls
2004 Morgan Kaufmann Publishers
91
• Fallacy: Pipelining is easy.
• Fallacy: Pipelining ideas can be implemented independent of
technology.
• Pitfall: Failure to consider instruction set design can adversely impact
pipelining.
2004 Morgan Kaufmann Publishers
92
6.12 Concluding Remarks
2004 Morgan Kaufmann Publishers
93
Keywords
•
Instruction latency
The inherent execution time for an instruction.
2004 Morgan Kaufmann Publishers
94
Chapter 6 Summary
•
Pipelining does not improve latency, but does improve throughput
Deeply
pipelined
Multicycle
(Section 5.5)
Pipelined
Multiple issue
with deep pipeline
(Section 6.10)
Multiple-issue
pipelined
(Section 6.9)
Single-cycle
(Section 5.4)
Slower
Faster
Instructions per clock (IPC = 1/CPI)
2004 Morgan Kaufmann Publishers
95
Multiple issue
with deep pipeline
(Section 6.10)
Multiple-issue
pipelined
(Section 6.9)
Single-cycle
(Section 5.4)
Pipelined
Deeply
pipelined
Multicycle
(Section 5.5)
1
Several
Use latency in instructions
2004 Morgan Kaufmann Publishers
96