Introduction
Download
Report
Transcript Introduction
MIPS Pipelining
Chapter 4
Sections 4.5 – 4.8
Dr. Iyad F. Jafar
Outline
2
Introduction
Why Pipelining?
MIPS Pipelined Datapath
MIPS Pipelined Control
Pipelining Hazards
Structural Hazards
Data Hazards
Control Hazards
Exceptions and Interrupts
Fallacies and Pitfalls
Reading Assignment
Introduction
Single-cycle datapath
Simple!
Hardware replication?
Cycle time?
Multi-cycle datapath
More involved
Less HW replication of major units
Better performance if the delay of major functional
units is balanced!
Can we do any better?
Pipelining!
3
Pipelining
Introduction
In Multi-cycle, only one major unit is used in each
cycle while other units are idle!
Why not to use them to do something else?
Basically, start the next instruction before the
current one is finished!
Cycle 1
LW
SW
R-Type
4
IFetch
Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
Dec
Exec
Mem
WB
IFetch
Dec
Exec
Mem
WB
IFetch
Dec
Exec
Mem
WB
Introduction
Pipelining
The time required to execute one instruction
5
(Instruction latency) is not affected!
However, the number of instructions finished per
unit time (Throughput) is increased
Thus, Pipelining improves the throughput not
latency!
Most modern processors are pipelined!
Notes
As in multi-cycle, the cycle time is determined by
the slowest unit!
However, similar to single-cycle, we can get one
instruction done every cycle!
It is assumed that all instructions take the same
number of cycles!
Introduction
Single Cycle Implementation:
Cycle 1
Cycle 2
Clk
lw
sw
Waste
R-type
Multiple Cycle Implementation:
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
Clk
lw
IFetch
R-type
sw
Dec
Exec
Mem
WB
IFetch
Dec
Pipeline Implementation:
lw
IFetch
sw
Dec
Exec
Mem
WB
IFetch
Dec
Exec
Mem
WB
Dec
Exec
Mem
R-type IFetch
6
WB
Exec
Mem
IFetch
Why Pipelining?
For Performance!
IM
Reg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
Inst 4
DM
ALU
Inst 3
Reg
ALU
Inst 2
IM
ALU
O
r
d
e
r
Inst 1
Inst 5
7
Once the pipeline
is full, one
instruction is
completed every
cycle, so CPI = 1
(similar to Singlecycle)
ALU
I
n
s
t
r.
Time (clock cycles)
Time to fill the pipeline
Reg
Reg
Reg
Reg
DM
Reg
Why Pipelining?
Example 1. Comparing pipelining to single-cycle
Consider a program that consists of a large number of LOAD
instructions only that is executed on a single-cycle CPU and 5-stage
pipelined CPU with the operation time for the major units (memory,
ALU, and register file) to be 200 ps in both cases.
1) Determine the time required to finish executing 1,000,000 LOAD
instructions and compute the speed up of pipelining.
2) Determine the time required to finish executing the first 3 LOAD
instructions
3) Repeat (1) and (2) if the delay of the register file is 100 ps instead
of 200 ps.
8
Cycle times for the two implementations
CCSC = 200 + 200 + 200 + 200 + 200 = 1000 ps
CCPP = 200 ps
Why Pipelining?
Example 1. Comparing pipelining to single-cycle
1) Determine the time required to finish executing 1,000,000
LOAD instructions and compute the speed up of pipelining.
Single-cycle
Pipelining
TimeSC = 1000 ps x 1000000 = 1,000,000,000 ps
TimePP = 1000 ps + 200 ps x 999999 = 200,000,800 ps
Speeup = 1,000,000,000 /
200,000,800 = 4.99998
(very close to the number of stages)
9
Why Pipelining?
Example 1. Comparing pipelining to single-cycle
2) Determine the time required to finish executing the first 3
LOAD instructions and compute the speed up of pipelining
Single-cycle
TimeSC = 1000 x 3 = 3000 ps
Pipelining
TimePP = 200 x 5 +200 + 200 = 1400 ps
10
Speeup = 3000 / 1400 = 2.14
(less than the number of stages)
Why Pipelining?
Example 1. Comparing pipelining to single-cycle
3) Repeat (1) and (2) if the delay of the register file is 100 ps .
CCSC = 200 + 100 + 200 + 200 + 100 = 800 ps
CCPP = 200 ps
For 1,000,000 instructions
TimeSC = 800 x 1,000,000 = 800,000,000 ps
TimePP = 1000+ 200x999,999 = 200,000,800ps
Speeup = 800,000,000/ 200,000,600 = 3.99998 (<5)
For 3 instructions
TimeSC = 800 x 3 = 2400 ps
TimePP = 1000 + 200x 2 = 1400 ps
11
Speeup = 2400/ 1400 = 1.71 (<5)
Why Pipelining?
Example 1. Summary
Ideally, the pipeline speedup is n times faster than the singlecycle, where n is the number of pipeline stages.
In the 5-stage MIPS, the pipelined version would be 5 times
faster.
When the pipeline is full, the throughput will be one instruction
per cycle
Many factors affect pipelining performance
12
Time to fill empty the pipeline
Number of instructions to execute
Unbalancecd delay of pipeline stages
Instruction mix
Pipeline hazards
Ideally, the number of cycles required to finish M instructions
in N-stages pipeline is N + M – 1
Pipelined MIPS Datapath
What do we need to implement pipelining?
We need to consider the following:
1.
The execution of instructions is divided into 5 stages
(cycles): Instruction fetch (IF) , Instruction decode (ID),
Execute (EX), Memory Access (MEM), Write Back (WB)
2.
Instruction flow is from left to right except in two cases
In the write-back stage where the result is written into the register
file in the middle of the datapath
Choosing between the incremented PC and the branch address in
the MEM stage
In pipelining, all units are operating in every cycle; thus we
have to duplicate hardware where needed
Since the execution is over multiple cycles, we need to add
State (Pipeline) registers between stages to preserve
intermediate data and control for each instruction.
3.
4.
13
These registers hold the values to be used in later stages as long
as they are needed.
Pipelined MIPS Datapath
ID
EX
+
Shift
left 2
File
Write Addr Read
Data 2
Write Data
16
Sign
Extend
32
System Clock
14
Any problem?
ALU
Exec/Mem
Read
Address
Register Read
Read Addr 2Data 1
WB
+
Read Addr 1
Dec/Exec
PC
Instruction
Memory
IFetch/Dec
4
MEM
Data
Memory
Address
Write Data
Read
Data
Mem/WB
IF
Pipelined MIPS Datapath
ID
EX
+
Shift
left 2
File
Write Addr Read
Data 2
Write Data
16
Sign
Extend
ALU
Exec/Mem
Read
Address
Data
Memory
Register Read
Read Addr 2Data 1
Address
Write Data
32
System Clock
15
WB
+
Read Addr 1
Dec/Exec
PC
Instruction
Memory
IFetch/Dec
4
MEM
Need to preserve the destination register !
Read
Data
Mem/WB
IF
Pipelined MIPS Datapath
Example 2. Execution of LW instruction
(1) Instruction Fetch: Put PC and the loaded instruction in the
IF/ID register
16
Pipelined MIPS Datapath
Example 2. Execution of LW instruction
(2) Instruction Decode and Read Registers: Store Reg[rs],
Reg[rt], sign extended offset , rd, rt, and the updated PC (why?) in the
ID/EX register
17
MIPS Pipelining
Example 2. Execution of LW instruction
(3) Execute Or Address Calculation: Store branch address,
Reg[rt], result, and zero flag in the EX/MEM register
18
Pipelined MIPS Datapath
Example 2. Execution of LW instruction
(4) Memory Access: Store the data from memory into
MEM/WB register
19
Pipelined MIPS Datapath
Example 2. Execution of LW instruction
(5) Write Back: Copy the data loaded in the MEM/WB
register to register file
20
Pipelined MIPS Datapath
Required data fields in the pipelining registers
Data fields are moved from one pipeline register to
another every clock cycle until they are no longer
needed
21
Pipeline
Register
Data Fields
Register Size
IF/ID
Instruction and PC
64 bits
ID/EX
PC, Reg[rs], Reg[rt], sign-extended
offset, rt, rd
138 bits
EX/MEM
Branch address, Zero, ALU result,
Reg[rt], Destination register address (rt
or rd)
103 bits
MEM/WB
ALU Result, Data from memory,
Destination register address
69
Pipelined MIPS Control
All control signals can be determined during Decode stage while they
are needed in later stages!
Solution! Expand the pipeline registers to store and move the control
signals between stages until they are needed
22
Pipelined MIPS Control
Define the control signals and generate them in the decode stage
For the time being, no explicit write signals are required for the
pipeline registers since the are updated every cycle
23
Pipelined MIPS Control
Control signals needed in each stage
Pipeline Stage
Control signals
IF
None
ID
None
EX
RegDst, ALUOp1, ALUOp0,
ALUSrc
MEM
Branch, MemRead, MemWrite
WB
MemtoReg, RegWrite
Control signal values based on instruction type
24
MIPS Pipeline
Example 3. Given the code segment and the register
contents below, show the contents of the data and control
fields in the pipeline registers if the sixth instruction has
been fetched (i.e. the beginning of cycle 7)
Address
25
Instruction
0x00000000
lw $10, 20($1)
0x00000004
sub $11,$1,$2
0x00000008
add $12,$3,$4
0x0000000c
lw $13, 24($1)
0x00000010
add $3,$2,$1
0x00000014
Sub $1,$5,$6
Register
Contents
$1
1
$2
5
$3
3
$4
-6
$5
2
$6
7
$11
12
$12
-15
$13
10
MIPS Pipeline
Example 3. Multi-cycle diagram
sub $1,$5,$6
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
26
add $3,$2,$1
Reg
ALU
lw $13, 24($1)
IM
ALU
O
r
d
e
r
add $12,$3,$4
DM
ALU
sub $11,$1,$2
Reg
ALU
I
n
s
t
r.
IM
ALU
lw $10, 20($1)
Time
Reg
Reg
Reg
Reg
Reg
DM
Reg
MIPS Pipeline
Example 3. Single-cycle diagram
sub $1,$5,$6
27
add $3,$2,$1
lw $13, 24($1)
add $12,$3,$4
sub $11,$1,$2
MIPS Pipeline
Example 3.
At the beginning of cycle 7, the sixth instruction is stored
in the IF/ID register while the data and control for earlier
instructions are pushed to next pipeline registers and the
register files. Thus,
IF/ID register
No control signals are stored
Store the instruction sub $1,$5,$6 and PC+4
28
IF/ID.Instruction = 0x00A60822
IF/ID.PC = 0x00000018
MIPS Pipeline
Example 3.
ID/EX register
Store the information of add $3,$2,$1 and PC+4
ID/EX.PC = 0x00000014
ID/EX.RegRsContents = 0x00000005
ID/EX.RegRtContents = 0x00000001
ID/EX.RegRt = (00001)2
ID/EX.RegRd = (00011)2
ID/EX.SignExtend = 0x00001820
Control Information
29
ID/EX.MemToReg = 0
ID/EX.RegWrite = 1
ID/EX.MemRead = 0
ID/EX.MemWrite = 0
ID/EX.Branch = 0
ID/EX.ALUSrc = 0
ID/EX.RegDst = 1
ID/EX.ALUOp = (10)2
MIPS Pipeline
Example 3.
EX/MEM register
Store the information of lw $13,24($1), branch address,
and memory address
EX/MEM.BranchAddress = 0x00000070
EX/MEM.ALUOut = 0x00000019
EX/MEM.Zero = 0
EX/MEM.RegDestination= (01101)2
EX/MEM.RegRtContents = 0x0000000A
Control Information
30
EX/MEM.MemToReg = 0
EX/MEM.RegWrite = 1
EX/MEM.MemRead = 1
EX/MEM.MemWrite = 0
EX/MEM.Branch = 0
MIPS Pipeline
Example 3.
MEM/WB register
Store the information of add $12, $3,$4, addition
result, and data memory
MEM/WB.RegDestination= (01100)2
MEM/WB.ALUOut = 0xFFFFFFFD
MEM/WB.MemoryData = XXXX
Control Information
MEM/WB.MemToReg = 0
MEM/WB.RegWrite = 1
For the sub $11, $1,$2
31
It will be writing (1 - 5) to $11
Pipelining Hazards
In general, pipelining is effective!
MIPS ISA makes even easy
All instructions are of the same length (32 bits)
Can fetch the next instruction once the current is being decoded
Few instruction formats with symmetry across them
Can read the register file in the 2nd stage
Memory access is through the Load and Store instructions
Can use the execute stage to compute the address
Each MIPS instruction writes at most one result in the MEM
or WB stage
Is it that easy? Any complications?
YES!
PIPELINING HAZARDS !
32
Pipelining Hazards
Hazards - problems the might occur during pipeline operation
Three basic sources
Structural Hazards
In pipelining, all functional units are used in any cycle
What if two instructions use the same functional unit in the same cycle?
Data Hazards
In pipelining, execution of instructions is overlapped
What if the operand(s) of some instruction comes from an earlier
instruction that is still in the pipeline?
Control Hazards
In pipelining, an instruction is fetched every cycle
What if an instruction is a jump or a branch instruction that evaluates to
true? The following instruction(s) in the pipeline might not be correct?
Simple Solution?
Wait until the issue is resolved!
33
Structural Hazards
Reading from
memory twice in the
same cycle!
Single Memory!
Time (clock cycles)
Inst 4
34
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Inst 3
Mem
Mem
ALU
Inst 2
Reg
ALU
Inst 1
Mem
ALU
O
r
d
e
r
lw
ALU
I
n
s
t
r.
Mem
Solution: Use two memories; Data and Instruction!
Reg
Structural Hazards
Single Register File!
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
Inst 1
Reg
ALU
35
IM
ALU
O
r
d
e
r
add $1,
One instruction is
writing and the
other is reading
the register file?
ALU
I
n
s
t
r.
Time (clock cycles)
Inst 2
add $2,$1,
clock edge that controls
register writing
Reg
Reg
Reg
DM
Solution: Design
the register file to
write in the first
half of the cycle
and read in the
second half!
Reg
clock edge that controls
loading of pipeline state
registers
Data Hazards
xor $4,$1,$5
36
Reg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
$8,$1,$9
IM
ALU
or
DM
ALU
and $6,$1,$7
Reg
ALU
sub $4,$1,$5
IM
ALU
add $1,
Reg
Reg
• Dependencies backward in time cause hazards
• This is called Read-after-Write (RAW) data hazard
• Register-use data hazard
Reg
Reg
DM
Reg
Solution?
Data Hazards
Simply, wait for the earlier instruction to finish! This is
called stalling the pipeline! However, this affects the CPI?
Reg
Reg
IM
Reg
DM
IM
Reg
stall
stall
sub $4,$1,$5
and $6,$1,$7
37
DM
ALU
IM
ALU
O
r
d
e
r
add $1,
ALU
I
n
s
t
r.
Do we need two stalls all the time?
Reg
DM
Reg
Data Hazards
xor $4,$1,$5
38
Reg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
$8,$1,$9
IM
ALU
or
DM
ALU
and $6,$1,$7
Reg
ALU
sub $4,$1,$5
IM
ALU
lw $1,5($s1)
Reg
Reg
• Dependencies backward in time cause hazards
• It is a Read-after-Write (RAW) data hazard
• Load-use data hazard
Reg
Reg
DM
Reg
Solution?
Data Hazards
Again, wait for the LW instruction to finish by stalling the
pipeline! However, this affects the CPI?
DM
Reg
IM
Reg
DM
IM
Reg
stall
stall
sub $4,$1,$5
and $6,$1,$7
39
Reg
ALU
IM
ALU
O
r
d
e
r
lw $1,
ALU
I
n
s
t
r.
Reg
DM
Reg
Data Hazards
Example 4. how many cycles are actually required to
execute the following code? Assume the pipeline is
already full.
Ideally, and since the pipeline
add $1, $2, $5
add $5, $3, $1
sub $10, $7, $8
sub $5, $6, $7
lw
$3, 45($9)
add $3, $3, $8
40
is full, each instruction
requires 1 cycle. Thus, we
need 6 cycles (CPI =6/6= 1).
However, …
Register-use data hazard
Adds 2 cycles by stalls
Load-use data hazard
Adds 2 cycles by stalls
Thus, 10 cycles are needed.
CPI = 10/6 = 1.667 ??
Performance ??
Can we do any better?
Data Hazards
Fixing Register-use Hazard by Forwarding
Note that data produced by an instruction and needed by a
later instruction is pushed through the pipeline registers until
it is saved into the register file !
Why not to read the data from the pipeline registers before it
is stored ?
This is called forwarding!
What is required?
Need to detect the hazard
Is any of the source registers for the instruction the same as the
destination register for an earlier instruction that is still in the
pipeline?
Need to create a path to pass the data between pipeline stages
Instead of reading the source registers of the instruction from
the register file, read them from the pipeline registers
41
Data Hazards
Fixing Register-use Hazard by Forwarding
or
$8,$1,$9
IM
Reg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
and $6,$1,$7
DM
ALU
sub $4,$1,$5
Reg
ALU
IM
ALU
O
r
d
e
r
add $1,
ALU
I
n
s
t
r.
xor $4,$1,$5
42
No Stalls!
Reg
Reg
Reg
Reg
DM
Reg
Data Hazards
Forwarding Hardware implementation
43
Note that forwarding could be from EX/MEM or from MEM/WB! Why?
Data Hazards
Forwarding Hardware implementation
Inside the forwarding unit
(1) Forwarding from EX/MEM (MEM Stage)
if (EX/MEM.RegWrite
and (EX/MEM.RegRd != 0)
and (EX/MEM.RegRd = ID/EX.RegRs))
then ForwardA = From EX/MEM
if (EX/MEM.RegWrite
and (EX/MEM.RegRd != 0)
and (EX/MEM.RegRd = ID/EX.RegRt))
then ForwardB = From EX/MEM
Why to check the RegWrite signal?
44
Why to check the Zero register?
Data Hazards
Forwarding Hardware implementation
Inside the forwarding unit
(2) Forwarding from MEM/WB (WB Stage)
if (MEM/WB.RegWrite
and (MEM/WB.RegRd != 0)
and (MEM/WB.RegRd = ID/EX.RegRs))
then ForwardA = From MEM/WB
if (MEM/WB.RegWrite
and (MEM/WB.RegRd != 0)
and (MEM/WB.RegRd = ID/EX.RegRt))
then ForwardB = From MEM/WB
45
Data Hazards
Can the forwarding hardware be used with Load-use
data hazard?
or
$8,$1,$9
xor $4,$1,$5
46
IM
Reg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
and $6,$1,$7
DM
ALU
sub $4,$1,$5
Reg
ALU
$1,4($2) IM
ALU
O
r
d
e
r
lw
ALU
I
n
s
t
r.
Reg
Reg
Reg
We still need 1 Stall for the instruction following the load?
Reg
DM
Reg
Data Hazards
How to stall the pipeline?
Stall is required when the instruction in the EX stage is Load and
the one in the ID stage depends on the loaded value
The Load instruction moves normally to EX/MEM on the next
cycle
The conflicting instruction (the instruction following the load)
should stay in the decode stage? How?
Don’t write the IF/ID register need IF/IDWrite Signal
Don’t update the PC need PCWrite Signal
The control signals of the instruction in the decode stage are stored as
0’s (WHY?) in the ID/EX need a multiplexor for the control signals
Controlling the process requires a special unit; Hazard Detection Unit
47
Data Hazards
Stall Implementation
48
Data Hazards
Stall Implementation
Inside hazard detection unit
if (ID/EX.MemRead
and [(ID/EX.RegRt == IF/ID.RegRs) or
(ID/EX.RegRt == IF/ID.RegRt)])
then
PCWrite = 0
IF/IDWrite = 0
Select 0’s as control signals
Any Problem?
Do we need to stall in all cases?
How about j and jal that come immediately after load with rs and/or rt
fields being the same as the rt field of the load?
49
Data Hazards
Example 5. Consider the following code segment in C
A=B+E
C=B+F
(1) Generate the MIPS code assuming that variables
A, B, C, E, and F are in memory and addressable with
offsets 0, 4, 8, 12, and 16 from $t0
(2) Find all the data hazards and determine the
number of cycles required to run the code. Assume
forwarding is implemented.
(3) Can you reorder the code to reduce the stalls ?
50
Data Hazards
Example 5.
lw
lw
add
sw
lw
add
sw
51
$t1, 4($t0)
$t2, 12($t0)
$t3, $t1, $t2
$t3, 0($t0)
$t4, 16($t0)
$t5, $t1, $t4
$t5, 8($t0)
# loads B
# loads E
#A=B+E
# stores A
# loads F
#C=B+F
# stores C
Thus, 13 cycles are needed.
CPI = 13/7 = 1.86 ??
Performance ??
Ideally, each instruction
requires 1 cycle after the
pipeline is full. Thus, we
need (5+7-1) cycles.
CPI = 11/7 = 1.57
Load-use data hazard
Adds 1 cycle as a stall
Load-use data hazard
Adds 1 cycle as a stall
Data Hazards
Example 5. Reducing stalls by instruction reordering
lw
lw
lw
add
sw
lw
add
sw
52
$t1, 4($t0)
$t2, 12($t0)
$t4, 16($t0)
$t3, $t1, $t2
$t3, 0($t0)
$t4, 16($t0)
$t5, $t1, $t4
$t5, 8($t0)
# loads B
# loads E
# loads F
#A=B+E
# stores A
# loads F
#C=B+F
# stores C
Moving this
instructions fills the
first stall and eliminate
the second one!
Thus, 11 cycles are
needed.
CPI = 11/7 = 1.57
Data Hazards
Example 6. Assume that the pipelined MIPS processor
without forwarding is used to run a program with the
following instruction mix: 20% loads, 20% store, and 60%
ALU. Then compute the average CPI given that
10% of the ALU instructions result in load-use hazards.
15% of the ALU instructions result in read-before-write hazards.
Solution
Ideally, the average CPI is 1 for each instruction
With no forwarding
Load-use hazards add two cycles
Register-use hazards add two cycles
Average CPI = 0.2 x 1 + 0.2 x 1 + 0.75 x 0.60 x 1 +
53
0.1 x 0.60 x 3 + 0.15 x 0.60 x 3 = 1.30
Control Hazards
For the pipelined datapath designed so far, the
branch address and decision are known by the end of
the MEM stage
Instructions following the branch instruction in the
pipeline are not correct if the branch evaluates to true!
If the branch is true, then these instructions should be
removed from the pipeline and execution should
continue from the branch address
Otherwise, no action is required!
This is a dependency backward in time Control
Hazard
54
55
Branch
Inst1
Inst2
Inst3
Control Hazards
Solution!
Once it is known that the instruction is branch, then stall the pipeline for 3
cycles? Is it actually a stall?
Control Hazards
IM
Reg
Reg
stall
stall
stall
Fetching from instruction memory is
either from PC+4 or Branch address
depending on the branch result
IM
Reg
IM
DM
Reg
ALU
Inst
Inst
56
DM
ALU
O
r
d
e
r
beq
ALU
I
n
s
t
r.
Reg
DM
Are these actual stalls? Why not to start
the execution of the following
instructions normally and if the branch is
true, then flush these instructions?!
Control Hazards
Reducing the Cost of Branch Hazard
Note that three cycles are lost if the branch evaluates to
true in order to remove the three instructions following
the branch instruction!
This could affect the performance significantly!
Can we reduce this cost?
Move the branch address computation to the decode stage
Add additional hardware to compare the two registers in the ID
stage!
Whenever there is a branch instruction in the ID/EX register
(ID/EX.branch =1), flush the instruction in the IF/ID register.
The branch penalty in this case will be 1 cycle instead of 3 cycles!
57
Control Hazards
Reducing the Cost of Branch Hazard
58
Control Hazards
Reducing the Cost of Branch Hazard
IM
Reg
ALU
beq
DM
IM
Reg
Reg
stall
ALU
lw
DM
Reg
Modifying the Hazard Detection Unit
IF (ID/EX.Branch) then Flush IF/ID register
Note that we lose one cycle whenever a branch
59
instruction is encountered!
Can we do any better?
Control Hazards
Reducing the Cost of Branch Hazard
Approach I – Static Branch Prediction
Always predict the branch as Not Taken and start fetching the
instruction following the branch
If the branch evaluates to Not Taken, then the prediction is
correct and no further actions are required!
If the branch evaluates to Taken, then the prediction is not
correct! Remove the fetched instruction and start fetching from
the branch address
In this approach, we only lose one cycle if the prediction is not
correct
Inside the hazard detection unit
IF (ID/EX.Branch) and (ID/EX.ZERO) Then Flush IF/ID register
60
Control Hazards
Reducing the Cost of Branch Hazard
Approach II – Dynamic Branch Prediction
Prediction could be Taken or Not Taken
If the branch is predicted as Not Taken
Fetch the next instruction
If prediction is false, flush the instruction. One cycle is lost!
If branch is predicted as Taken
Fetch the instruction from the branch address
If prediction is false, flush and fetch from PC+4
How to store branch prediction?
Use Branch History Table or Branch Prediction Buffer
The table is addressable by the lower bits of the branch instruction
address
If branch is predicted as taken, we need to wait for the
branch address to be computed?
Use Branch Target Buffer
61
Control Hazards
Approach II – Dynamic Branch Prediction
1-bit Branch Predictor
Basically we have two states (Taken and Not Taken)
One bit is used to store the prediction
Prediction state is changed when prediction is wrong
Performance Issues
62
Consider branching in loops? EXAMPLE?
Control Hazards
Approach II – Dynamic Branch Prediction
2-bit Branch Predictor
Basically we have four states
two bits are used to store the prediction
Prediction state is changed when prediction is wrong twice
63
Control Hazards
Example 7. Consider a certain program that have a
conditional branch instruction whose actual outcome
is given below when the program is executed.
T-T-N-T-T-N-T
List predictions for the following branch prediction
schemes and find the prediction accuracy.
64
1.
Predict always taken
2.
Predict always not taken
3.
1-bit predictor, initialized to predict taken
4.
2-bit predictor, initialized to weakly predict taken
Control Hazards
Example 7.
Actual branch actions : T-T-N-T-T-N-T
Predict as always taken
Predictions : T-T-T-T-T-T-T
Accuracy = 5/7 = 71%
Predict as always not taken
Predictions : N-N-N-N-N-N-N
Accuracy = 2/7 = 29%
1-bit predictor initialized to predict taken
Predictions: T-T-T-N-T-T-N
Accuracy = 3/7 = 43%
2-bit predictor initialized to weakly predict taken
Predictions: T-T-T-T-T-T-T
Accuracy = 5/7 = 71%
65
Pipelining Performance
Example 8. Let’s compare the performance of single-cycle, multi-cycle, and
pipeline implementation of MIPS processor given the operation times and
instruction mix below.
For the pipelined implementation, assume that:
1) Branch decision is done in the MEM cycle. Branch handling in the pipeline
implementation is done by stalling the pipeline.
2) Half of the load instructions incur load-use hazard.
3) Forwarding is implemented.
4) The jump instruction is completed in the ID stage
66
Instruction type
Percentage %
Unit
Time (ps)
ALU
52
Memory
200
Load
25
ALU and adders
100
Store
10
Register File
50
Branch
11
Jump
2
Pipelining Performance
Example 8.
Clock cycle time
Single-cycle = 200 + 50 + 100 + 50 + 200 = 600 ps
Multi-cycle = 200 ps
Pipeline = 200 ps
CPI
Single-cycle = 1
Multi-cycle = 5x 0.25 + 4x0.52 + 4x0.10 + 3x0.11 + 3x0.02
= 4.12
Pipeline = 0.125x2 + 0.125x1 + 0.52x1 + 0.1x1 + 0.11x4 + 0.02x2
= 1.475
Execution Time per instruction
Single-cycle = 600 ps
Multi-cycle = 4.12 x 200 ps = 824 ps
Pipeline = 1.475 x 200 = 295 ps
67
Pipelining Performance
Example 9. Redo example 8 by assuming that branch
prediction is employed and 1/4th of the branch instructions
are miss predicted.
68
Exceptions & Interrupts
Exceptions and interrupts are unexpected events
that require the change in the flow
The two terms are used interchangeably and
depending is ISA
Intel x86 uses the term interrupt only
In MIPS
Exceptions: any internal unexpected change in the flow (undefined
opecode, overflow, system calls)
Interrupts: the event is external (I/O controller request)
Dealing with them
Is a challenging part of processor design
Affects performance
69
Exceptions & Interrupts
In MIPs, when an exception is generated, the
following sequence of steps are taken
The address of the offending instruction is saved into a
special called the Exception Program Counter (EPC).
The cause of the exception is saved in a special register
called the Cause Register.
The control is transferred to the operating system by
loading a special address (0x8000 00180) into the PC.
The code loaded starting at this address
Determines what actions will be done by the operating system in
response to the exception based on the value found in the Cause
Register. The operating system may terminate the program or
resume the execution using the value found in the EPC
70
Overflow Exception
Modifications to the Datapath
71
Fallacies
Fallacy 1. Pipelining is easy !
Not true ! Hazards complicate the operation
Fallacy 2. Pipelining is independent of
technology!
Why didn’t we have pipelined processors before ?
Advanced technology allowed more transistors and
thus more operations !
72
Reading Assignment
Read the following from the textbook
Section 4.9 – Exceptions
Section 4.10 – Parallelism and Advanced Instruction
Level Parallelism
73