CS 2204 Fall 2005 - NYU Polytechnic School of Engineering
Download
Report
Transcript CS 2204 Fall 2005 - NYU Polytechnic School of Engineering
CS 2214
Computer Architecture
and Organization
Pipelined EMY CPU
Haldun Hadimioglu
Version 1
Computer Science & Engineering
Spring 2014
Outline
Introduction
Version 1 EMY CPU : Pipelined EMY CPU
It executes only integer instructions
How a memory hierarchy can be attached to the
pipelined EMY CPU is also studied
Version 0, the Unpipelined EMY CPU is described in
another presentation
Handout to use
Pipelined EMY CPU
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
2
Introduction
On the microarchitecture layer, a computer is a
collection of at least three interconnected digital
systems
A central processing unit (CPU)
A (main) memory
An I/O controller to control an I/O device, such as the disk
There can be several I/O controllers to control several
different I/O devices
CPU
Disk
Memory
CS 2214
Interconnection
System
Haldun Hadimioglu
CSE – Spring 2014
I/O
Controller
EMY CPU Version 1
3
Digital Systems
A digital system performs microoperations
It consists of a datapath (data unit) and a control
unit
The datapath actually performs the microoperations
The control unit determines which microoperation
happens when
Registers
Status signals
CS 2214
ALUs
Sequencer
Buses
Control signals
Haldun Hadimioglu
CSE – Spring 2014
Datapath
Control Unit
EMY CPU Version 1
4
Digital Systems
The datapath (data unit) has registers, ALUs
and buses to perform the microoperations
Registers keep information temporarily
ALUs perform arithmetic/logic operations
Buses interconnect the registers and ALUs
Other components are used include
Multiplexers (MUXes), decoders, encoders, comparators,
counters, etc.
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
5
Digital Systems
The control unit has a sequencer that
determines the sequence of microoperations
The sequencer needs status signals from the data
unit to know what is happening there
Then, based also on the current state it
determines which microoperations to be
performed and indicates to the datapath by means
of control signals
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
6
Designing Digital systems
Datapath design is simpler than the control
unit since it has highly regular (duplicated)
circuits
A 64-bit ADDer is composed of 4 16-bit identical
ADDers
A 64-bit comparator consists of 8 8-bit identical
comparators, etc.
Control unit design is more difficult due to
Large amounts of random logic
A substantial amount of effort is needed to make
sure there are no timing problems
Microoperations must start at the right time and end at
the right time !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
7
Designing digital systems
We will use the finite-state machine (FSM)
technique to design the EMY CPU where the
FSM state diagram will have states with
microoperations
The state diagram shows which state follows
which state precisely
Each state indicates which microoperations to perform
The state diagram shows which states are needed
when for which machine language instruction
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
8
Designing digital systems
We will design the EMY CPU by using the
finite-state machine (FSM) technique
More specifically, we will obtain the following for
the complete EMY CPU design
A high-level-state diagram to show which microoperation
happens when
The datapath from the high-level state diagram
The low-level state diagram from the high-level sate
diagram and the datapath
The control unit from the low-level state diagram
It can be implemented by hardwiring and/or
microprogramming
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
9
Designing the microarchitecture level of a
computer
There are two tasks in this design
Develop the CPU and memory digital systems so
that instructions can be run
Develop the memory and I/O controller digital
systems so that I/O can happen
We will concentrate on the CPU and memory
digital systems
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
10
Designing the CPU and memory digital systems
First we focus on the CPU digital system while we make a few
design decisions on the memory quickly
We have designed the CPU as a slow CPU running only integer
instructions : No pipelining
This is Version 0
We assumed the memory was fast which is not realistic today
We will see how a memory hierarchy with cache memories, etc. can be
incorporated
This CPU coverage is given in another PowerPoint presentation
Now, we improve the CPU speed by using pipelining, but still running
integer instructions
This is Version 1
We will assume the memory is fast which is again not realistic today
Then, we will see how a memory hierarchy with cache memories, etc. can be
incorporated
For both versions the memory will be a black box with a few
details
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
11
Designing the CPU as a Digital System
The unpipelined EMY CPU digital system has
been designed for nine integer instructions
We obtained its
High-level state diagram
Datapath
Low-level state diagram
Control unit
We will design the pipelined EMY CPU digital
system for eight integer instructions
We will obtaine its
High-level state diagram
Datapath
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
12
Designing the Unpipelined CPU digital system
To design the unpipelined EMY CPU, we started with
the EMY architecture
What is the connection between the architecture and the
CPU?
A computer processes digital information, by running machine
language instructions
A machine language program is a list of instructions each of
which specifies operations on data (arguments)
An instruction specifies architectural operations
Each architectural operation is implemented by microoperations
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
13
Designing the Unpipelined CPU Digital System
In order to perform an architectural operation, the
CPU performs a series of microoperations in a
number of clock periods
That is an architectural operation is broken down into
smaller operations called microoperations
That is, to run a machine language instruction, the
CPU performs microoperations
The CPU performs some microoperations by itself and some
in cooperation with the memory and the I/O controllers
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
14
Designing the Unpipelined CPU Digital System
Architectural operations
An architectural operation is what we describe as the
semantics of the instruction, such as
The architectural operation specified by the ADD instruction
Rd Rs + Rt
The architectural operation specified by the SUB instruction
Rd Rs - Rt
The architectural operation specified by the SLT instruction
If Rs < Rt then Rd 1 else Rd 0
The architectural operation specified by the J instruction
PC[27-0] (Address * 4)
It is the CPU that contributes the most to the execution of
an instruction since it performs most of the microoperations
needed for an architectural operation
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
15
Designing the Unpipelined CPU Digital System
Typical CPU digital system microoperations
Add, subtract, multiply
In the past, a 32-bit addition was completed in 1 clock period.
Today, a 32-bit addition is completed in several clock periods
AND, OR, XOR
Shift right, Shift left
Read data from memory, write data to memory
In the past, a memory access was completed in 1 clock period.
Today, it is completed in several clock periods
Read instructions from memory (fetch)
Increment the program counter
Transfer a register to another register
…
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
16
Designing the Unpipelined CPU as a Digital
System
Other machines, especially CISC machines, require
other microoperations such as
Reading indirect address(es) from the memory
Effective address calculation for
Indexing
Autoincrement
Autodecrement
Alignment for
Instructions
Data
Addresses
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
17
Designing the Unpipelined CPU Digital System
Architecture’s effect on microoperations
The decisions made on architecture determine
microoperations needed for the execution of
instructions
the
the
General microoperations found on most CPUs
The ones mentioned on previous slides
Specific microoperations for certain CPUs
Specific microoperations for Memory Management Units
(MMUs), caches, I/O controllers
The architecture also determines the characteristics of
each microoperation
If the 26-bit PC-direct addressing mode is used, the rightmost
26 bits of IR are catenated the leftmost 4 bits of PC and the
resulting 30 bits are shifted to the left by 2
Thus, each machine language instruction requires a number
of certain microoperations taking a certain time : the CPIi
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
18
Designing the Unpipelined CPU Digital System
Microoperations
The CPU can perform one or more microoperations per clock
period, depending on the complexity of the microoperation
and the availability of the hardware resources
Most often a microoperation can be completed in one clock
period unless it is a complex microoperation
If a complex microoperations is desired to be run in a clock
period, the clock period needs to be longer
The more and complex the microoperations are, the longer it
takes to run the machine language instruction
CISC instructions take longer time to execute (larger CPIi)
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
19
Designing the Unpipelined CPU Digital System
Calculating CPIi
The time it takes to run an instruction, CPIi, is then
determined by
The number of microoperations needed for it
The complexity of the microoperations
The number of clock periods for an instruction, CPIi,
becomes a matter of figuring out the microoperations and
how to distribute them to individual clock periods
One can come up with 5-10 simple microoperations to be
performed one after another, resulting in a CPIi of 5-10
But, since microoperations are simple, the clock period is short
Alternatively, one can come up with
microoperations, resulting in a CPIi of 2-4
2-4
complex
But, the clock period is longer
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
20
Designing the Unpipelined CPU Digital System
Calculating CPIi
What can we do ?
Few long clock periods vs. many but shorter clock periods ?
Since increasing the clock frequency is important for marketing
purposes the second option would weigh in substantially
It turns out that if pipelining is implemented, having many shorter
clock periods would be beneficial as we will see
CPIi figures will be large but CPIave will be close to 1 (one) !
Today’s microprocessors have instruction CPIi values in the
range of 10-30, but CPIave figures for their targeted
applications are even less than 1 (one) !
Because they employ advanced pipelining techniques, such as
superscalar execution, hyperthreading, etc.
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
21
Designing the Unpipelined CPU Digital System
Determining microoperations for a machine language
instruction
Some microoperations are performed for all the instructions
Usually at the same point in time during the execution of every
instruction
Fetching the instruction is always the first microoperation to
perform for all CPUs
Updating PC (PC PC + 4) so that it points at the next instruction
is also universal
The other microoperations depend on the instruction, the
addressing mode, where the arguments are, the length of
the arguments, etc.
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
22
Designing the Unpipelined CPU Digital System
Determining microoperations for a machine language
instruction
We would list all the microoperations for each instruction,
by making sure that we are consistent in terms of
Bus usage
We often decide an approximate number of buses we need for our
datapath
Today’s CPUs have at least three internal buses to complete an
integer arithmetic microoperation in one clock period
Two buses carry the numbers from two registers and the third
bus carries the result to a register
ALU usage
An ALU is expensive and so we try to limit the number of ALUs
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
23
Designing the Unpipelined CPU Digital System
Determining microoperations for a machine language
instruction
We would list all the microoperations for each instruction,
by making sure that we are consistent in terms of
Register usage
Additional registers not visible to the architecture level are used
to keep temporary values : microarchitectural registers
Typically, the more registers are used, the more clock periods we
spend for an instruction since temporary values will be passed
from one register in one clock period to another register to be
used the following clock period
But, sometimes we have to use microarchitectural registers, such
as the instruction register that keeps the current instruction
Control unit usage
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
24
Designing the Unpipelined CPU Digital System
Determine how each EMY architectural operation is
implemented by microoperations
Most microoperations must be simple enough to be
completed in less than one clock period
A few microoperations may not be completed in a clock
period
For example a memory read may take several clock periods
since the memory is slower
These long microoperations should be accommodated in the
high-level state diagram, the datapath, low-level state diagram
and the control unit
We will assume in the beginning that every microoperation is
completed in one clock period
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
25
Designing the Unpipelined CPU Digital System
The EMY microoperations implied by the EMY
machine language instructions include
Instruction fetch, performed always
Update PC for next instruction, performed always
Effective address calculation for Displacement and relative
addressing modes
Sign extension or catenation of 0s for data/addresses
Reading data from the memory
Writing data to the memory
Perform an arithmetic/logic
Register transfer
Testing a condition
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
26
What is Pipelining ?
The unpipelined MIPS CPU can be thought of
having five stages that correspond to the
five major cycles
For the unpipelined MIPS CPU, at any time
only one stage is busy and the remaining ones
are idle
IF
ID
EX
MEM WB
Control Unit
Instructions
Datapath
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
Instructions
EMY CPU Version 1
27
What is Pipelining ?
The unpipelined CPU works like this :
ID
IF
ADD
R8, R11
LW R10,
R8, 0(R9)
EX
LWway…
R8, 0(R9)
Continues this
LW R8, 0(R9)
61
2
MEM
LW R8, 0(R9)
3
4
WB
LW R8, 0(R9)
5
Clock period
Only, one instruction is in the pipeline !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
28
What is Pipelining ?
Pipelining is the simultaneous execution of multiple
instructions in an assembly line fashion in a single
CPU
IF
ID
EX
ADD
BEQ
SW
LWR12,
R10,
R12,
R8,
R12,
0(R9)
R8,
0(R15)
R0,
R11
3
ADD
R13,
R14
ADD
ADD
SW
LWR12,
R12,
R10,
R8,0(R15)
R13,
0(R9)
R8, R11
R14
ADD
LW R10,
R12,
R8, 0(R8)
R13,R11
R8,
R14
LW R10,
R8, 0(R9)
ADD
R18, R11
2
3
4
1
MEM
WB
LW R8, 0(R9)
5
Clock period
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
29
What is Pipelining ?
Pipelining is a microarchitectural technique
where consecutive instructions are executed
overlappingly
Each instruction is in a pipeline stage
All stages are busy
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
30
What is a Stage ?
Each
stage
is
specialized
hardware
corresponding to a specific major cycle
IF, ID, EX, MEM, WB
The hardware for each major cycle can then be easily
identified and often named stage
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
31
What is Pipelining ?
Pipelined execution of instructions is similar to the
assembly line manufacturing of cars
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
32
What is Pipelining ?
There are two differences
On a car assembly line there is only one type of
car assembled
For the CPU the instructions executed are different
Loads, Stores, A/L, Branch instructions
All the cars on an assembly line have the same
requirements : the same pieces are placed on the
cars
For the CPU, even if two back-to-back instructions are
of the same type (for example two back-to-back Loads),
they have different requirements (different effective
addresses hence different memory locations are
accessed)
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
33
What is Pipelining ?
Because of these two differences, each stage
has to pass information related to the
instruction it just worked on to the next
stage
Temporary registers (latches, buffers) are used
between two stages to pass the information about
the instruction just leaving one stage and entering
the next one
Latches
IF
ID
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EX
MEM
WB
EMY CPU Version 1
34
What is Pipelining ?
Latches are then necessary to pass
information about an instruction from one
stage to the next
Latches are also needed so that partial work
done by one stage is passed to the next stage
so the work continues
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
35
What is the Pipe ?
We give the name “pipe” to the set of stages
since the stages are cascaded in a single
dimension forming a pipe where instructions
Enter from one end
Stay in a stage for one clock period
Proceed to the next stage
Finally exit from the other end
By
which time the instruction execution is
completed
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
36
What is Pipelining ?
Consider a sequence of instructions and a 5stage pipeline
Instructions
…I9 I8 I7 I6 I5 I4 I3 I2 I1
IF
ID
EX MEM WB
Instructions
Assume that all the instructions use the five
stages
That is they all take five clock periods to complete
their execution
This is not possible in real life but let’s assume this for
the time being to understand pipelining quickly
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
37
What is Pipelining ?
Stage
The execution can be shown as follows
I1
I2
I3
I4
I1
I2
I3
I4
I5
I1
I2
I3
I4
I5
I6
I1
I2
I3
I4
I5
I6
I7
I2
I3
I4
I5
I6
I7
I8
WB
MEM
EX
ID
I1
IF
0
1
2
3
5
4
7
6
8 Time
Pipeline is full ≡ all stages are busy ≡ start-up time = 5 clock periods
WB
MEM
EX
ID
IF
v
v
v
CS 2214
v
v
v
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
CSE – Spring 2014
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
EMY CPU Version 1
38
What is Pipelining ?
Compared with unpipelining, the five stages
are more complex to allow overlapped
execution
All stages take the same amount of time, one
clock period
The length of the clock period is determined
by the slowest stage
Because, it is difficult to obtain stages with equal
amount of work hence time
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
39
What is Pipelining ?
If the CPU is unpipelined, the instructions would take
5 clock periods each
I1
I2
5
I3
10
I4
15
I5
20
I6
25
I7
30
Time
35
CPIi = 5
Since each instruction is taking 5 clock periods
CPIave = 5
Since the number of clock periods divided by the number of
instructions run is 5
35
5 clock periods
7
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
40
What is Pipelining ?
If the CPU is pipelined, after the pipeline
becomes full (the start-up time), every clock
period an instruction is completed as opposed
to completing every 5 clock periods
I1
I2 I3 I4 I5 I6 I7
5 6 7
8
9
Time
10 11
CPIi = 5
Since each instruction is taking 5 clock periods
CPIave ≈ 1
Since after the start-up time, we complete one
instruction each clock period
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
41
What is Pipelining ?
Once the pipeline is filled, each clock period
an instruction exits the pipeline
Each clock period an instruction is completed
It seems each instruction takes one clock period to
execute
CPIave ≈ 1 !!!
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
42
What is Pipelining ?
Assume for next few slides that the
unpipelined EMY CPU is converted to a
pipelined CPU
CPILW = 5
CPISW = 4
CPIA/L R Format = 4
CPIBEQ = 3
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
43
What is Pipelining ?
Consider the following piece of EMY code
--400200
400204
400208
40020C
400210
400214
400218
40021C
---
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
R27, R28, 5
; R8 M[R9 + 0+]
; R10 R11 + R12
; R13 R14 – R15
; R16 <-- R17 + R18
; M[R20 + 0+] <-- R19
; R21 R22 | R23
; If R25 < R26, R24 1, else R24 0
; If R27 is equal to R28, branch to 400234
This code is not realistic since the instructions are all independent of each other !
But, for the sake of understanding pipelining, we will use this piece of code !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
44
What is Pipelining ?
Let’s see its pipelined execution by using textbook’s notation
and assume that the memory takes one clock period
1 2
31 4
400200 LW
400204 ADD
400208 SUB
40020C XOR
400210 SW
400214 OR
400218 SLT
R8, 0(R9)
IF ID EX
R10, R11, R12
IF ID
R13, R14, R15
IF
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
40021C BEQ
R27, R28, 5
EX
ID
IF
CS 2214
v
v
v
v
v
v
6
7
8
9
10
MEM WB
EX
MEM
ID
EX MEM
IF
ID
EX
MEM
IF
ID
EX
MEM
IF
ID
EX
MEM
IF
ID
EX
MEM
WB
MEM
5
v
v
v
v
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
CSE – Spring 2014
v
v
v
v
IF
ID
v
v
v
v
v
v
v
v
EX
v
v
v
v
EMY CPU Version 1
45
What is Pipelining ?
Textbook’s notation is hard to follow if there are more than
few instructions
Also, the notation requires a lot of space even for few instructions
From now on, we will use our notation
The execution by assuming assume that the cache memories take
one clock period and there is no miss
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
R27, R28, 5
CS 2214
IF
ID
EX
MEM
WB
1
2
3
2
3
4
3
4
5
4
5
6
5
4
5
6
7
8
5
6
7
8
9
6
7
8
9
10
7
8
9
10
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
46
What is Pipelining ?
What if the EMY CPU was not pipelined ?
The execution timing would be as follows by assuming that the
cache memories take one clock period and there is no miss
IF
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
R27, R28, 5
ID
EX
MEM
WB
1
6
10
2
7
11
3
8
12
4
9
13
5
14
18
22
26
30
15
19
23
27
31
16
20
24
28
32
17
21
25
29
The execution completes in 32 clock periods !
Pipelined execution takes 10 clock periods !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
47
What is Pipelining ?
Pipelining decreases the execution time of the
program, CPUtime
The number of instructions run, NI, stays the same
We execute the same number of instructions for a program
The CPIi stays the same
Often the unpipelined CPIi and Pipelined CPIi differ slightly for
efficient pipelining
The Branch CPIi will reduce from 4 to 3
The A/L Format CPIi will go up from 4 to 5
Instructions go through the similar stages as the unpipelined
case
But, we execute several instructions at the same time
All the stages are busy now
The CPU does more per clock period
CPIave decreases
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
48
What is Pipelining ?
We execute more instructions per unit time
(a second)
The throughput is increased
The MIPSave figure is increased
The number of instructions executed per second is
increased
The MFLOPSave figure is increased
The number of FP operations performed per second is
increased
That is why companies like to mention the MIPSave and
MFLOPSave figures for their new generations of
microprocessors since each new generation improves the
pipeline which directly improves MIPSave and MFLOPSave.
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
49
What is Pipelining ?
Pipelining does not decrease the CPIi of each
individual instruction but increases the clock
period slightly
The execution time of each instruction in terms of
seconds is increased slightly !
This is due to the slightly longer clock period
This is due to overhead of handling several instructions per
clock period
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
50
Hardware-related issues to solve
The stages must be precisely timed, synchronized
Each stage must take the same amount of time
Each stage must have about the same amount of work
This is hard to come up unless it is a RISC architecture
Suppose that we managed to have the same amount
of work per stage so that each stage takes the same
time
What is the clock period ?
Theoretically the clock period can stay the same as the
unpipelined CPU
But the simultaneous execution increases the overhead per
clock period
The clock period duration is increased slightly !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
51
Hardware-related issues to solve
A solution to these two problems today is to break up
stages that are taking too long into several simpler
stages so that the stages are finer
Then, the pipeline is longer ≡ there are many stages
Since each stage is doing simpler work, the clock period is
shorter ≡ the clock frequency is higher
Today, a technique to increase the microprocessor frequency is
exactly this ≡ make stages simpler and simpler and simpler ≡
make pipelines longer and longer and longer
Today’s microprocessor pipelines are typically 15 to 25 stages
long
Clock skew problems can cause timing problems
A signal may arrive too late to play a role in generating another
signal since the pipeline is very long !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
52
Pipelined EMY CPU Design
In CS2214, we design the EMY CPU by going through
two versions : 0 through 1
Version 0 is the unpipelined CPU executing only integer
instructions
Version 1 is the pipelined CPU executing only integer
instructions
Initially, the Version 1 design will not be an acceptable design
New hardware to handle pipelining is not identified
For example, the latches between stages are not identified
The CPU must have latches, so we will quickly change the design
It will not handle well certain situations called hazards
There are three types of hazards : structural, data & control
All programs have hazards, so we will quickly change the design
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
53
Pipelined EMY CPU Design
In CS2214, we design the EMY CPU by going through
two versions : 0 through 1
Version 0 is the unpipelined CPU executing only integer
instructions
Version 1 is the pipelined CPU executing only integer
instructions
Initially, the Version 1 design will not be an acceptable design
Branch instructions take too long causing pipeline startups
Control instructions must take shorter time, so we will quickly change
the design
It will assume ideal memory
All memory accesses take one clock period
We will partially deal with the slower memory and leave the rest to
the Computer Architecture II course
It will have imprecise interrupts
We will leave it to the Computer Architecture II course
We will not discuss the control unit, but we will know that it is
there
So, somehow, the initial design of this version of MIPS
CPU executes the code in a pipelined fashion
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
54
Pipelined EMY CPU Design Versions
We will design the pipelined MIPS CPU Version 1 in
several steps
As mentioned above, initially, the Version 1 design will not be
an acceptable design
The final design of Version 1 will improve the pipeline by
introducing additional hardware to better handle integer
instructions
New hardware, including latches, to handle pipelining will be
identified
It will better handle the three hazards
Branch instructions will take 2 clock periods
But, we will have delayed branches which is not practical
It will still have some unacceptable features
It will assume slower Level 1 cache memories and misses on
Level cache memories, but not misses on lower level cache
memories
It will still have imprecise interrupts
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
55
Pipelining EMY CPU
Consider the mnemonic machine language
discussed before
--400200
400204
400208
40020C
400210
400214
400218
40021C
---
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
R27, R28, 5
CS 2214
; R8 M[R9 + 0+]
; R10 R11 + R12
; R13 R14 – R15
; R16 <-- R17 + R18
; M[R20 + 0+] <-- R19
; R21 R22 | R23
; If R25 < R26, R24 1, else R24 0
; If R27 is equal to R28, branch to 400234
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
56
Pipelining EMY CPU
Here is the execution of the code discussed earlier
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
R27, R28, 5
IF
ID
EX
MEM
WB
1
2
3
2
3
4
3
4
5
4
5
6
5
4
5
6
7
8
5
6
7
8
9
6
7
8
9
10
7
8
9
10
This EMY CPU pipeline version has problems as
mentioned on slides 53 and 54
This EMY CPU pipeline also makes assumptions that
are not acceptable as mentioned on the next slide
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
57
Issues with the Current Design
This program will be executed without difficulty
since all instructions are independent of each other
1) There is no real application where all instructions are
independent of each other
Real-life applications have instruction dependencies
Instruction I1 generates a result that is used by another
instruction, I2, so that I2 depends on I1
2) This code assumes we will always execute in sequence :
even if we execute branch instructions
That is, it assumes branches are never taken
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
58
Improving Initial Version 1 Design
The pipelined EMY CPU state diagram and
pipeline stages
We will obtain the final state diagram and final
datapath after several iterations
The initial design of Version 1 will be improved by going
through several designs
First, we will add new hardware, including latches
Second, we will handle hazards better
Third, we will execute Branch instructions faster
Fourth, we will assume slower memory and so Level 1 cache
memories will be used
Level 1 cache memories will take more than one clock period
Level 1 cache memories will have cache misses
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
59
Improving Initial Version 1 Design
Version 1 will be improved by going through several
designs
First we will add the hardware overhead, including latches
When we have pipelined execution, it is important not to lose
the information about the execution of each instruction
With pipelining, each stage does some work for the instruction
and by doing so affects the architectural registers and the
memory (the state)
Some piece of this state is needed to execute an instruction in
latter stages
So, when we move an instruction from one stage to another, it
is necessary to transfer the information related to the
instruction to the next stage (to make the state of the
instruction available to the next stage) so that correct
execution happens
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
60
Latching hardware
Each stage starts with the “sum” of work that has
been done on its instruction in previous stages
Each stage works on the instruction resulting in new work
that will be needed in later stages to complete the
instruction execution
For that purpose stages are provided with latches
In other words, a stage works on an instruction that has left the
previous stage and produces something related to the instruction
and passes it to the next stage to be used in the next clock period
Thus, we need to save the work of a stage in
temporary registers (latches) for the next stage
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
61
Latching hardware
So we need the latches (buffers)
I8
The amount of storage (the number of latches)
between two stages is not constant :
IF
ID
EX
MEM
WB
I7
I7
I6
I5
I5
I4I4
I3
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
Instructions
EMY CPU Version 1
62
Latching hardware
The new hardware
Four IRs
Though not all the bits of the extra IRs are needed in
every stage
Two NPC registers
Two ALUoutput registers
One A register
Two B registers
One Imm register
One TA register
One Zero flip-flop
One MDR register
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
63
Latching hardware
Here is the new look of the MIPS CPU datapath with latches
NPC
GPR
PC
IF
2
IR
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
EX
ALUout
MDR
MEM
4
5
IR
IR
WB
The leftmost latch set (with NPC and IR) will be called latch set
2 since these latches are used by the second stage from left
(ID)
The next latch set to the right (NPC, A, B, Imm and IR) is latch
set 3, and so on
Haldun Hadimioglu
CS 2214
64
EMY CPU Version 1
CSE – Spring 2014
Latching hardware
We will identify the registers by using the
latch set number (or the stage number using
the registers)
Latch set 2 registers (Stage 2 uses them)
2.NPC and 2.IR
Used by the second stage from left : ID
Latch set 3 registers (Stage 3 uses them)
3.NPC, 3.A, 3.B, 3.Imm and 3.IR
Used by the third stage from left : EX
Latch set 4 registers (Stage 4 uses them)
4.Zero, 4.ALUout, 4.B and 4.IR
Used by the fourth stage from left MEM
Latch set 5 registers (Stage 5 uses them)
5.ALUout, 5.MDR and 5.IR
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
65
Latching hardware
What did we do ?
We identified latches for the pipelined execution
of instructions
The initial implementation of Version 1 does not identify
the latches
The initial implementation of Version 1 does not specify
that there are four IR registers, two NPC registers, two
ALUout registers, etc.
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
66
Timing of Microoperations
We need to know about the timing of microoperations
When does exactly the instruction fetch occur for the LW
instruction ?
---400200
400204
400208
40020C
400210
400214
400218
40021C
----
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
R27, R28, 5
That is, we know the instruction fetch will happen in clock period 1
(one), but exactly when ?
Similarly when does exactly PC get its value updated to 400204 when
we execute the LW ?
Note : On the unpipelined CPU, this code takes 32 clock periods !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
67
Timing of Microoperations
We clock (store on) our registers at the end of a
clock period and therefore, registers change their
values in the beginning of the next clock period
Therefore, IR gets its new value (the LW instruction) in
beginning of the ID cycle (in clock period 2)
PC gets its new value (400204) in beginning of the ID cycle
(in clock period 2)
Clock period 1
Clock period 2
400200
400204
Clock
PC 4001FC
IR
?
CS 2214
?
400208
LW R8, 0(R9)
Haldun Hadimioglu
CSE – Spring 2014
ADD R10, R11, R12
EMY CPU Version 1
68
Instruction fetch (IF) Cycle
Fetch the instruction pointed by PC to 2.IR
2.IR M[PC]
Update PC by adding 4
How about 2.NPC ?
PC PC + 4
NPC
GPR
PC
IF
2
Soon, we will see that !
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
This cycle will be
more complex when
we cover BEQ later
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
69
Instruction decode/register fetch (ID) Cycle
Prepare temporary registers A, B and Imm in case we need the
GPR registers, an effective address or an immediate operand
3.A GPR[2.IR.Rs]
3.B GPR[2.IR.Rt]
3.Imm 2.IR.DOImm+
NPC
GPR
PC
IF
2
TA
A
Zero
B
ALUout
Imm
B
3
IR
CS 2214
Soon, we will see them !
NPC
ID
IR
How about 3.NPC & 3.IR ?
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
70
Execute (EX) Cycle for LW/SW Instructions
How do we know we have a LW/SW instruction ?
The IR register for this stage (3.IR) was not transferred value
from the IR register of the previous stage (2.IR)
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
EX
ALUout
MDR
MEM
4
5
IR
IR
WB
We need to update the ID stage : 3.IR 2.IR
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
71
Instruction decode/register fetch (ID) cycle
Prepare temporary registers A, B and Imm and move IR to the
next stage
3.A GPR[2.IR.Rs]
3.B GPR[2.IR.Rt]
3.Imm 2.IR.DOImm+
3.IR 2.IR
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
How about 3.NPC ?
Soon, we will see that !
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
72
Execute (EX) Cycle for LW/SW Instructions
Calculate the effective address
4.ALUout 3.A + 3.Imm
How about 4.TA,
4.Zero and 4.B ?
We should not forget to move 3.IR to the next stage
4.IR 3.IR
NPC
GPR
PC
IF
2
Soon, we will see them !
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
73
Memory access/branch completion (MEM) Cycle for LW
Instructions
Read the data from memory
How about 5.ALUout ?
5.MDR M[4.ALUout]
We should not forget to move 4.IR to the next stage
5.IR 4.IR
NPC
GPR
PC
IF
2
Soon, we will see that !
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
74
Write-back (WB) Cycle for LW instructions
Transfer MDR to a GPR register
GPR[5.IR.Rt] 5.MDR
The LW takes 5 clock periods to execute : CPILW = 5
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
WB
75
Memory access/branch completion (MEM) Cycle for SW
instructions
The effective address is in 4.ALUoutput
Where is the data to store ?
It is in 3.B
We did not transfer 3.B to 4.B !
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
76
Execute (EX) Cycle for LW/SW Instructions
How about 4.TA
and 4.Zero ?
Calculate the effective address
4.ALUout 3.A + 3.Imm
We should not forget to move 3.IR to the next stage
4.IR 3.IR
Soon, we will see that !
Transfer 3.B to 4.B
4.B 3.B
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
77
Memory access/branch completion (MEM) Cycle for SW
Instructions
Write 4.B to the memory pointed by 4.ALUout
M[4.ALUout] 4.B
The SW takes 4 clock periods to execute : CPISW = 4
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
78
Execute (EX) Cycle for A/L R-format instructions
Perform the operation specified by the Function field of 3.IR
4.ALUout 3.A op 3.B
We have already moved 3.IR to the next stage
How about 4.TA
and 4.Zero ?
4.IR 3.IR
Soon, we will see that !
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
79
Memory access/branch completion (MEM) Cycle for A/L Rformat Instructions
We could complete the execution of these instructions in this cycle by
transferring 4.ALUout to a GPR register
But, we decide to complete the execution in the WB cycle to help us handle
data hazards better as we will see later
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
WB
80
Memory access/branch completion (MEM) Cycle for A/L Rformat Instructions
Transfer 4.ALUout and 4.IR to the next stage
5.ALUout 4.ALUout
5.IR 4.IR
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
WB
81
Write-back (WB) Cycle for A/L R-format instructions
We transfer the result from 5.ALUout to a GPR register
GPR[5.IR.Rd] 5.ALUout
A/L R-format instructions take 5 clock periods to execute
CPIA/L R-format = 5
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
82
Execute (EX) Cycle for BEQ Instructions
We need to store the result of compare of 3.A with 3.B on 4.Zero
We need to calculate the effective address by adding PC and (4 times
the Offset)
But, is PC changed by the instructions behind the BEQ ? Yes !
We should have saved the PC value for BEQ on a new register : NPC in the
IF cycle !
NPC
GPR
PC
IF
2
NPC
TA
A
Zero
B
ALUout
Imm
B
ID
3
IR
IR
CS 2214
EX
ALUout
MDR
MEM
4
5
IR
IR
Haldun Hadimioglu
CSE – Spring 2014
WB
EMY CPU Version 1
83
Execute (EX) Cycle for BEQ Instructions
We need to study the execution of Branch instructions more
carefully
400600
400604
400608
40060C
400610
400614
BEQ
ADD
SUB
XOR
SLT
AND
R8, R9, 4
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, R20, R21
R22, R23, R24
; Branch to 400614 if R8 = R9
When the BEQ is in its EX stage, PC is 400608
WB
?
?
?
MEM
?
?
?
EX
?
?
BEQ
ID
?
BEQ
IF
BEQ
1, 400600
2, 400604
CS 2214
3, 400604
Haldun Hadimioglu
CSE – Spring 2014
There is a
Problem !
We detect that
there is a BEQ
in the beginning
of its ID cycle
(clock period 2)
We then immediately
stop the IF stage from
fetching any
instruction and stop to
add 4 to PC
Clock period, PC
EMY CPU Version 1
84
Execute (EX) Cycle for BEQ Instructions
We know we have a BEQ in the ID stage when we decode it
PC is 400604 when the BEQ is in ID
When the Branch reaches EX, it expects to have PC = 400604
What shall we do ?
We decide to have a new register to keep the PC value for the BEQ :
NPC (New PC)
We save the PC value for the BEQ in NPC in the IF stage
So 400604 moves with the BEQ into the EX stage
When the ID stage detects a BEQ
It stops the IF stage fetching the next instruction
We also have to stop incrementing PC so that if the condition is not
satisfied, we execute the instruction following the BEQ
This is the instruction in location 400604
We should not execute the instruction 400608 after we execute the BEQ
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
85
Execute (EX) Cycle for BEQ instructions
We change the IF and ID stages to include transfers to 2.NPC
and 3.NPC
The EX stage for the BEQ is like this
4.IR 3.IR
4.Zero If 3.A = 3.B then 1
4.TA 3.NPC + (3..Imm * 4)
3.NPC has 400604
Now, we have the correct PC value on 3.NPC in the EX stage
But, when do we write to PC so that we can branch ?
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
86
Execute cycle (EX) Cycle for BEQ instructions
We write to PC the clock period after the BEQ is in EX
We write to PC in the IF stage when it is clock period 4
WB
?
?
?
MEM
?
?
?
EX
?
?
BEQ
ID
?
BEQ
IF
BEQ
Clock period, PC
1, 400600
?
?
AND
2, 400604
3, 400604
4, 400604
5, 400614
The IF stage then changes PC and NPC if 4.Zero is 1
PC If (4.Zero) then 4.TA else if (2.IR.opcoce ≠ BEQ) then PC + 4
2.NPC If (4.Zero) then 4.TA else if (2.IR.opcoce ≠ BEQ) then PC + 4
We also need to clear 4.Zero so that a new Branch can be executed
4.Zero If (4.Zero) then 0
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
87
Execute (EX) Cycle for BEQ Instructions
What shall we do with ADD, SUB and XOR ?
They should not be fetched until we know the BEQ result !
WB
?
?
?
MEM
?
?
?
EX
?
?
BEQ
ID
?
BEQ
IF
BEQ
Clock period, PC
1, 400600
?
NOP
AND
2, 400604
3, 400604
4, 400604 5, 400614
If the ID stage has a BEQ we stop the instruction fetch to
the memory
But, we also have to clear 2.IR if it has a BEQ so we fetch an
instruction the next clock period (clock period 5) : 4.IR has the
BEQ in the 4th clock period
2.IR If 4.IR.opcode = BEQ then NOP
else if (2.IR.opcode ≠ BEQ) then M[PC]
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
88
Execute (EX) Cycle for BEQ Instructions
What if we continued with the ADD, SUB and XOR ?
Would they change any architectural register or memory ?
NO ! Since we arranged the pipeline such that all register
writes and memory writes happen at the end of the pipeline
By that time we know we have a BEQ we stop them and flush out
them
RISC architectures result in late writes that help the
hardware designer
CISC architectures often require early writes in the pipeline
The hardware designer has to undo these early writes when a
branch is finally recognized
Unnecessary pressure on the hardware designer
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
89
Execute (EX) Cycle for BEQ Instructions
Stopping the fetches, how does the execution look ?
WB
?
?
?
MEM
?
?
?
EX
?
?
BEQ
ID
?
BEQ
IF
BEQ
Clock period, PC
1, 400600
?
?
NOP
AND
2, 400604
3, 400604 4, 400604
5, 400614
The pipeline is almost empty with only one instruction in the WB stage!
There is only one instruction in the pipeline
This is why Control instructions are important to deal with for pipelines
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
90
Execute (EX) Cycle for BEQ instructions
Showing the timing in a different way
1 21
400600
400604
400608
40060C
400610
400614
BEQ
ADD
SUB
XOR
SLT
AND
R8, R9, 4
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, R20, R21
R22, R23, R24
A pipeline bubble
is generated
CS 2214
4
5
6
7
8
9
IF
ID
EX
MEM
WB
IF ID EX
WB
MEM
EX
The Branch causes
a pipeline start-up !
3
ID
IF
?
?
?
?
v
?
?
?
v
?
?
v
?
Haldun Hadimioglu
CSE – Spring 2014
v
v
v
v
v
v
v
v
v
v
EMY CPU Version 1
v
v
v
v
v
91
Execute (EX) Cycle for BEQ Instructions
In the 4th clock period we complete the
execution of the BEQ by writing the effective
address to PC in IF
The control unit knows we are completing the BEQ
instruction and so does not allow an instruction
fetch
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
92
Let’s rewrite microoperations for the BEQ
IF stage
2.IR If 4.IR.opcode = BEQ then NOP
else if (2.IR.opcode ≠ BEQ) then M[PC]
PC If (4.Zero) then 4.TA
else if (2.IR.opcoce ≠ BEQ) then PC + 4
2.NPC If (4.Zero) then 4.TA
else if (2.IR.opcoce ≠ BEQ) then PC + 4
4.Zero If (4.Zero) then 0
3.NPC 2.NPC
ID stage
EX stage
4.IR 3.IR
4.Zero If 3.A = 3.B then 1
4.TA 3.NPC + (3.Imm * 4)
The BEQ execution completes in the IF stage in the
next clock period
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
93
BEQ instructions take 4 clock periods to execute
CPIBranch = 4
Since, the Branch execution is completed in the IF stage
Overall, executing a control instruction first
creates a pipeline bubble and then causes a
pipeline start-up where only one stage, IF, is
busy
It is therefore critical that the number of control
instructions be reduced by having
Better programming styles
Better compilers
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
94
Cautions for the Pipelined EMY CPU
With pipelining and memory hierarchies hardware has
become more sensitive to
The number of instructions, NI (due to increased memory
hierarchy delays)
The number of control instructions (due to pipeline and
memory hierarchy delays that can occur)
Now we see why the pipeline is sensitive to control instructions
The order of instructions (due to pipeline delays that can
occur)
Class notes on the remaining versions will show examples why
the pipeline is sensitive to a certain order of instructions
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
95
What is Pipelining ?
Before we continue with the evaluation of our
design, a comment :
Pipelining is often invisible to the programmer,
though current architectures allow some visibility
to help/improve pipeline
For example, knowing the pipeline length and how many
clock periods complex microoperations take help the
compiler to come up with a more efficient code
This is because a better order of instructions can be
obtained
This is a point made earlier
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
96
Pipelined Execution Timing
The execution of the code on the Version 1 EMY
pipeline is shown again below by assuming that the
cache memories take one clock period and there is no
miss
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
R27, R28, 5
IF
ID
EX
MEM
WB
1
2
3
2
3
4
3
4
5
4
5
6
5
6
7
4
5
6
7
8
5
6
7
8
9
6
7
8
9
10
7
8
9
10
8
10
11
Assume that our integer-instruction pipeline can execute the
XOR, SLT, etc.
It takes 11 clock periods to run the code
Note that the Branch completes in clock period 11 also !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
97
The Speed Comparison
The piece of program takes 11 clock periods on the
pipelined computer as opposed to 33 clock periods on
the unpipelined
Speedupoverall
CPIave w/o pipe
CPIave w/ pipe
CS 2214
CPUtimeold 33
3
CPUtimenew 11
# of clock periodsfor program 32
4
NI
8
# of clockperiodsfor program 11
1.375
NI
8
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
98
In general any pipeline will work fine if
Every instruction is independent of every other instruction in
the pipeline at any moment
Otherwise, we have what we call hazards as we will see soon
The number of control instructions is very small
The order of instructions is good
Otherwise, we have what we call hazards as we will see soon
There is a lot of hardware available
In the ideal case, CPIave ≡ the number of pipeline stages
In the ideal case, NI ≡ # of clock periods for the
program
Speedupoverallideal = pipeline depth (the number of pipeline
stages)
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
99
Ideal MIPS
MIPSideal
# of instruction completedper clockperiod clockfrequency
106
If the CPU completes one instruction per clock period
MIPSideal
clockfrequency
106
We now see why microprocessor companies are eager to
increase the clock frequency !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
100
Pipeline Timing
Due to start-ups and hazards, CPIave is not 1
The net effect of start-ups and hazards is that more
than one clock period is needed to execute an
instruction on average
The amount of additional clock periods is due to the
average delay cycles (stalls we will call soon) per
instruction
CP Iave CP Iave ideal pipelinedelays(stalls)per instruction
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
101
Pipeline Timing
Since the ideal CPIave with pipelining is 1, we obtain
the following formula
P ipelinedepth
Speedupoverall
1 P ipelinestallcyclesper instruction
It is clear from the above formula that the speedup
is directly proportional to the number of pipeline
stages
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
102
Pipeline Timing
Example : Assume that a program with no control instructions is
run and the following measurements are made on the MIPS
Instruction
CPIi
# of times executed Unpipelined time
Loads
5
10
0.25μsec
A/L
5
90
2.25 μsec
Calculate CPIave and CPUtime for both unpipelined and pipelined
cases and Speedupoverall, the pipelined efficiency and EMYideal for
the pipelined case
Assume that clock frequency is 200MHz
Note that this program is an ideal program since there is no Store
instruction !
NI = # of Loads + # of A/L = 10 + 90 = 100
Clock period
CS 2214
1
1
-9
5
10
5ns
6
Clock frequency 200 10
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
103
Pipeline Timing
Example continued
For the unpipelined case :
CPUtimeunpipelined = TimeLoads + TimeA/L = 0.25 + 2.25 = 2.5 μsec
# of clock periods for Loads = # of times executed x CPIi
= 10 x 5 = 50
# of clock periods for A/L = # of times executed x CPIi
= 90 x 5 = 450
# of clock periods for program = # of clock periods for Loads
+ # of clock periods for A/L = 50 + 450 = 500
# of clock periodsfor program 500
CPIave w/o pipe
5
NI
100
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
104
Pipeline Timing
Example continued
For the pipelined case :
# of clock periods for program = Start-up time + (NI – 1) =
= 5 + (100 – 1) = 104
CPUtimepipelined = # of clock periods for program x clock period
= 104 x 5 = 520ns = 0.52 μsec
CPUtimeold 2.5
Speedupoverall
4.81
CPUtimenew .52
Speedupoverall 4.81
P ipelineefficiency
0,96
Speedupideal
5
MIP Sideal
clock frequency
6
10
200 106
6
10
200
Speedupoverall is not 5 because of the startup time....
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
105
Improving Initial Version 1 Design
Now, we will make an assessment of pipelining to
prepare ourselves for next set of improvements
Pipelining increases the speed but there are difficulties and
problems associated with pipelining :
The hardware is complicated
Additional temporary registers (latches) are needed between
stages so that latter stages can correctly work on an instruction
Some latches are simple duplication of other registers and some
are latches that save the output of a stage.
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
106
Improving Initial Version 1 Design
Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
The pressure on the memory is doubled : two memory
accesses per clock period happen
One for instruction in the IF stage
One for data in the MEM stage
For example, for the program execution on slide 97, the CPU
makes two memory accesses in the 4th clock period
The frequency of simultaneous accesses depends on the number
of Loads and Stores
The number of Loads and Stores depend on the application,
programmer and compiler
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
107
Improving Initial Version 1 Design
Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Not all instructions require all the stages
Some stages are empty, idle, creating a pipeline bubble that
cannot be avoided
RISC instructions require fewer stages therefore the chance
having many unneeded stages is reduced
With CISC, the number of stages is larger
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
108
Improving Initial Version 1 Design
Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
The startup time slows the system
Its impact is based on
The number of times it occurs (due to control instructions)
The time it takes to fill the pipeline (pipeline depth or latency)
RISC systems perform better here since they have shorter
pipelines
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
109
Improving Initial Version 1 Design
Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Some instructions have complex microoperations that take
longer than one clock period to complete
Overall, it is difficult to have balanced stages
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
110
Improving Initial Version 1 Design
Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
The clock period is determined by
The slowest stage which is often the stage with the addition
and the stages with memory accesses
The EX stage
The IF and MEM stages
The latches that need set up time and propagation delays
The clock skew problem
In RISC systems it is easy to distribute the work equally to
stages but with CISC it is more difficult
So, in order not to increase the clock period length in CISC
systems, a stage that has a complex microoperation takes
more than one clock period
But, this creates bubbles !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
111
Improving Initial Version 1 Design
Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Because of what we call hazards, an instruction in the
pipeline may not be moved to the next stage but forced to
stay in the same stage more than one clock period
The instruction stalled
The stages to the left of the stalled instruction cannot move
their instruction to the right to keep the strict order of
execution
These stages become idle (do not work on new instruction) but
keep the old instructions
This creates a pipeline bubble : The speed is decreased.
Note that the startups also decrease the speed since there is
a larger bubble in the pipeline
Control instructions result in startups
Pipeline “hazards” also create startups if poorly designed
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
112
Pipeline Hazards
They are caused by a number of reasons
forcing the pipeline to stop the execution of
an instruction and the instructions that are
behind
The instructions are stalled
The hazards generate either bubbles or a start-up of
the pipeline.
There are three types of hazards
Structural
Data
Control
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
113
Structural Hazards
Structural hazards occur from resource conflicts
that can be solved with more resources, i.e. more or
faster hardware
Examples of structural hazards are
Only one memory port in the CPU which stops the IF stage if
a Load/Store is using this single memory port to access data
in the MEM stage
If a L1 cache memory takes two or more clock periods !
If the GPR set has only one write port and several
simultaneous GPR writes are performed, only one GPR write
will happen, the others will write one by one
If a stage performs a complex microoperation taking several
clock periods, such as FP arithmetic, and this microoperation
is not pipelined, then instructions behind it will stay idle in
their stages (these instructions are stalled)
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
114
Structural Hazards
Due to a structural hazard, one or more instructions
behind the instruction that caused the hazard are
delayed, are not allowed to move.
The stages behind the hazard causing instruction become
idle : A bubble is generated
The bubble moves one stage per clock period and eventually
leaves the pipeline.
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
115
Structural Hazards
An example
What if there was only one memory port ?
If a Load or Store tries to access a data element in the
memory in the MEM cycle, then, the IF stage is forced
to stay idle by the control unit so that the priority is
given to the instruction already in the pipeline to
complete it as soon as possible
The instruction that was going to be fetched is stalled
A bubble is created in the IF stage
The bubble moves up the pipeline one stage per clock
period
Stalling ends when the Load/Store complete the memory
access
Next slide shows this process
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
116
Structural Hazards
What if there was only one memory port ?
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
IF ID EX
R8, 0(R9)
R10, R11, R12
IF ID
IF
R13, R14, R15
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
R27, R28, 5
WB
A bubble is
created and
moves up
the pipeline
31 4
MEM
EX
ID
IF
CS 2214
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
5
6
7
MEM WB
EX
MEM WB
ID
EX
MEM WB
Stall IF
ID
EX
IF
ID
IF
?
v
v
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
CSE – Spring 2014
8
10
11
MEM WB
EX
MEM
ID
EX
MEM
IF
ID
EX
MEM
Stall IF
ID
v
v
v
v
9
v
v
v
v
v
v
v
v
v
v
v
v
EMY CPU Version 1
v
v
v
117
Structural Hazards
What if there was only one memory port ?
We will avoid using textbook notation of instruction
execution since even for a few instructions, a large space is
needed to show the flow of execution
Rather, we will use our own notation shown below
IF
ID EX MEM WB
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, 0(R20)
R21, R22, R23
R24, R25, R26
R27, R28, 5
1
2
3
2
3
4
3
4
5
4
5
6
5
6
7
5
6
7
8
10
6
7
8
9
11
7
8
9
10
12
8
9
10
11
8
11
XOR is fetched in the 5th clock period, not in the 4th clock period
XOR is delayed, stalled, in clock period 4 by the LW accessing the memory for its data
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
118
Structural Hazards
What if there was only one memory port ?
The control unit stops the IF stage from accessing
the memory to fetch the XOR
The reason is that we want to complete the execution of
the LW that is already in the pipeline
Instructions in the pipeline has higher priority for
completion
The SW instruction will access the memory in the 9th
clock period to write data
There will not be an instruction fetch in the 9th clock
period
Once a stall occurs, a bubble is introduced not all the
stages are busy
The execution of the instruction is increased ≡ its CPIi is
increased
CPIave is increased
CPUtime is increased
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
119
Structural Hazards
What if there was only one memory port ?
We will not have this structural hazard in our system
It is also clear from the Version 1 datapath diagram that we
have two separate memory ports
Memory Port 1 for instruction fetches
Memory Port 2 for data accesses
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
120
Hazards
Structural Hazards
Often, to solve structural hazards more or faster
hardware is needed
However, the solution of the other two
hazards, data and control hazards, requires
More hardware and
Better compilation techniques
To better order instructions
To reduce the number of control instructions
The result is that
Pipeline bubbles are eliminated or reduced
The number of pipeline start-ups is also reduced
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
121
Hazards
The overall hardware structure that detects a
hazard and stops (stalls) an instruction or several
instructions until the hazard condition does not exist
is called pipeline interlock
Note that if an instruction is stalled, the instructions behind
it are also stalled as we will see shortly
Thus, it is costly to stall a single instruction in the pipeline
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
122
Data Hazards
As mentioned before all previous program examples
had instructions independent of each other
The instructions did not have any register or memory
location in common
For example, an instruction writes to R10 and the next
instruction did not read R10
The second instruction did not depend on the first instruction
There is no data dependency between them
There are other types of data dependencies as we will see
shortly
If two instructions have data dependency between them and
they are in the pipeline there can be a data hazard
Let’s see the definition on the next slide
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
123
Data Hazards
Data hazards occur between two instructions
which are executed close enough in time and
there is writable data shared by them
That is there is a data dependency between
two instructions and the correct result will
occur only if the execution is confined to the
sequential rather than pipelined execution to
enforce the right order of access to the
shared data
The second instruction cannot be executed in a
pipelined fashion
It has to wait, stall !
This is sequential (unpipelined) execution then
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
124
Data Hazards
If we change the instruction sequence of the
previous code to include dependency, there will be
data hazards
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
We observe that the ADD writes to R10 and the
instructions below the ADD read R10
The ADD and the remaining instructions are executed close
in time
Can there be data hazards among them ?
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
125
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
RAW ?
RAW ?
400200
400204
RAW ? 400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
RAW ?
RAW ?
RAW ?
The data element in R10 is shared by all the instructions below
the ADD and they are executed close in time
An instruction, I1, writes to register and another instruction, I2,
reads the same register (the data element)
I1 has to write first and then I2 has to read : There is a Read
after Write (RAW) dependency
BUT, if I2 reads before I1 writes then there is a RAW hazard
Can I2 read before I1 write ? Yes
We have to stop I2 if it tries to read R2 before the ADD writes to R2
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
126
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
RAW
RAW
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
RAW
There are data dependencies, but are they all data hazards ?
Will all the instructions below the ADD try to read R10 before the
ADD writes ? NO !
Soon we will see that data hazards will happen between the ADD and
SUB, XOR and SW
SUB, XOR and SW will try to read R10 before the ADD writes to R10
The OR, SLT and BEQ will read R10 after the ADD writes to R10
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
127
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
All RAW
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
SUB, XOR and SW will try to read R10 before the ADD writes
to R10
These data dependencies result in data hazards
This data hazard is one of three types of data hazards
We will stall SUB, XOR and SW when they try to read R10
An instruction, I1, writes to register and another instruction, I2, reads
the same register (the same data element)
I1 has to write first and then I2 has to read : Read after Write (RAW)
If I2 reads before I1 writes there is a RAW hazard
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
128
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
Why do we stall the
SUB in the ID stage ?
MEM
EX
CS 2214
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
Stall
Stall
IF
Stall Stall
Stall Stall
Stall
WB
We stall the SUB for 3 clock
periods since it needs R10.
This creates a 3-clock-period
bubble that moves up the
pipeline
31 4
ID
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
7
Stall
Stall
Stall
Stall
Stall
8
9
10
EX MEM WB
ID
EX MEM
IF
ID
EX
Stall IF
ID
Stall Stall
IF
Stall Stall Stall
v
v
v
v
v
v
v
v
v
v
v
v
v
XOR is fetched and idling in the IF stage
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
129
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
We stalled the SUB in the ID stage since it reads its operands in
ID
The SUB reads its operands R10 and R6 in the ID stage
This is clock period 4
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
31 4
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
Stall
Stall
IF
Stall Stall
Stall Stall
Stall
7
8
9
10
Stall EX MEM WB
Stall ID
EX MEM
Stall IF
ID
EX
Stall Stall IF
ID
Stall Stall Stall IF
Stall Stall Stall
When will the ADD write to R10 ?
In clock period 6 !
When will R2 actually get the new value ?
In the beginning of the 7th clock period !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
130
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
Why does R10 get its new value in the beginning of the 7th clock
period ?
According to the state diagram of Version 1, the ADD writes from
5.ALUout to its destination register in the WB stage
This is clock period 6
Why does R2 get the value in the beginning of the 7th clock period ?
As we discussed before, we clock (store on) our registers at the end of
a clock period and therefore, registers change their values in the
beginning of the next clock period
Clock period 6
Clock period 7
Clock
5.ALUout ?
R2
?
CS 2214
?
Result of DADD
?
?
Result of DADD
Haldun Hadimioglu
CSE – Spring 2014
?
EMY CPU Version 1
131
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
In summary then that the SUB is stalled in the ID stage for three
clock periods
A 3-clock-period long bubble is created and moves up the pipeline
If we show the pipeline in our notation
All RAW
IF
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
CS 2214
1
2
3
4/7
8
9
10
11
ID
2
3
4/7
8
9
10
11
12
Haldun Hadimioglu
CSE – Spring 2014
EX
MEM
WB
3
4
8
9
10
11
12
13
4
5
9
10
11
12
13
5
6
10
11
13
14
EMY CPU Version 1
132
Pipeline Interlocks
What we are doing is that we check for
hazard situations in the ID stage and when
we recognize a hazard, we stall the
instruction in the ID stage !
If an instruction does not have a hazard situation,
it is allowed to proceed to the EX stage
That is the instruction is issued to the EX stage
If the instruction has a hazard, it is stalled in the
ID stage by the pipeline interlock to preserve the
execution pattern
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
133
Pipeline Interlocks
If an instruction is stalled in the ID stage,
then the instruction in the IF stage is stalled
That is the instruction behind the stalled
instruction is not allowed to pass by and continue
with its execution
This is called static issuing
Static issuing reduces hardware since we do not have to
keep track of which instruction changed which part of
the state
Because, if an instruction is stalled, it has to update the
state before all instructions that follow it
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
134
Pipeline Interlocks
If dynamic issuing is allowed then an instruction in
the IF stage would pass by the stalled instruction in
the ID stage and start its EX cycle
However, dynamic issuing results in other data hazards,
WAR and WAW, to happen as we will discuss later
We need to have hardware not to allow an instruction behind
a stalled instruction to update the state
Can we somehow allow this instruction to proceed ?
Yes, we can allow it to generate its results
But, we have to buffer the results and write them to the
destination after the stalled instruction is finished for
correct execution pattern
We then need additional hardware to keep temporary
results and keep track of instructions’ progress
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
135
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
What if SUB does not have a RAW hazard but XOR has ?
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
WB
CS 2214
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX
MEM
IF
ID
Stall
IF
Stall
Stall
?
We stall the XOR for 2
MEM ?
clock periods and create
EX ?
a 2-clock-period bubble
that moves up the pipeline
ID ?
IF
31 4
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
CSE – Spring 2014
v
v
7
WB
Stall
Stall
Stall
Stall
8
EX
ID
IF
Stall
Stall
9
MEM WB
EX
MEM
ID
EX
IF
ID
Stall IF
v
v
v
v
10
v
v
v
v
EMY CPU Version 1
v
v
v
v
?
136
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
What if SUB does not have a RAW hazard but XOR has ?
The XOR is in ID in the 5th clock period but has to wait until the 7th
clock period
If we show the pipeline in our notation
IF
All RAW
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
CS 2214
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
1
2
3
4
5/7
8
9
10
ID
2
3
4
5/7
8
9
10
11
Haldun Hadimioglu
CSE – Spring 2014
EX
MEM
WB
3
4
5
8
9
10
11
12
4
5
6
9
10
11
12
5
6
7
10
12
13
EMY CPU Version 1
137
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
What if SUB and XOR do not have a RAW hazard but SW has ?
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
WB
CS 2214
5
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX MEM WB
IF
ID
EX MEM
IF
ID
Stall
IF
Stall
Stall
?
We stall the SW for
MEM ?
1 clock period and create
a 1-clock-period bubble that EX ?
ID ?
moves up the pipeline
IF
31 4
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
CSE – Spring 2014
v
v
v
v
v
v
v
8
WB
EX
ID
IF
Stall
9
MEM
EX
ID
IF
10
MEM
EX
ID
v
v
v
v
v
v
v
v
EMY CPU Version 1
v
v
v
?
138
Data Hazards
Let’s concentrate on the ADD and the instructions that follow it
What if SUB and XOR do not have a RAW hazard but SW has ?
If we show the pipeline in our notation
IF
RAW
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
CS 2214
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
1
2
3
4
5
6/7
8
9
ID
2
3
4
5
6/7
8
9
10
Haldun Hadimioglu
CSE – Spring 2014
EX
3
4
5
6
8
9
10
11
MEM
WB
4
5
6
7
9
10
11
5
6
7
8
11
12
EMY CPU Version 1
139
Eliminating Hazards
We will eliminate delays due to RAW hazards
We will write to GPR registers in the WB stage in the first
half of the clock period and read GPR registers in the ID in
the second half of the same clock period
We will add new hardware to eliminate other RAW delays
We will reduce the amount of delay due to control
hazards
By assuming a certain compiler functionality we will eliminate
the control hazard delays completely
However, this compiler functionality is not acceptable in real
life
It does not allow software compatibility as we will see later
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
140
Data Hazards
Writing to a GPR in the first half – reading
the same GPR register in the second half of
the same clock period
Consider the timing diagram of writing to R10 in
the 6th clock period again
What if we clock (store on) R10 in the middle of the 6th
clock period where there is a negative edge !?
That is, what if we do not write at the end of the 6th
clock period, but the middle ?
This is possible by using negative-edge triggered GPR
registers
So, we write from 5.ALUoutput to R10 in the middle of
the clock period !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
141
Data Hazards
Writing to a GPR in the first half – reading the same
GPR register in the second half of the same clock
period
OK, we write in the first half, can we read the same register
in the second half ?
Yes, reading means getting the value from R10 in the second
half and storing it on the destination register at the end of the
same clock period when there is a positive edge
We read from GPR registers and store on temporary registers
3.A and 3.B in the ID stage
In this specific example R10 is stored on 3.B for the SUB
instruction
This will save one clock period for us
From now on the GPR registers are clocked by
negative edges and the other registers are clocked
at positive edges
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
142
Data Hazards
Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period
Let’s visualize what happens in clock periods 5, 6 and 7
Clock period 5
Clock period 6
Clock period 7
Clock
5.ALUoutput
?
?
R10
3.B
Result of ADD
?
?
?
Result of ADD
?
Result of ADD
In the 6th clock period R10 has its new value and is transferred to 3.B
Therefore, the SUB can be in EX in the 7th clock period to use 3.B
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
143
Data Hazards
We will draw short lines in the WB and ID stages to indicate that
the RAW hazard has been resolved by the write-in-first-halfread-in-the-second-half feature
Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period
Let’s see the new execution flow
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
WB
We stall the SUB for
MEM
2 clock periods and create
EX
a 2-clock-period bubble that
ID
moves up the pipeline
IF
CS 2214
3
4
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
Stall
Stall
IF
Stall Stall
Stall Stall
Stall
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
Haldun Hadimioglu
CSE – Spring 2014
7
EX
ID
IF
Stall
Stall
8
9
MEM WB
EX MEM
ID
EX MEM
IF
ID
EX
Stall
IF
ID
Stall Stall IF
v
v
v
v
10
v
v
v
v
v
v
v
v
EMY CPU Version 1
v
v
v
v
?
144
Data Hazards
Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period
If we show the pipeline in our notation
All RAW
IF
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
1
2
3
4/6
7
8
9
10
ID
2
3
4/6
7
8
9
10
11
EX
3
4
7
8
9
10
11
12
MEM
4
5
8
9
10
11
12
WB
5
6
9
10
12
13
We will draw short lines in the WB and ID stages to indicate that the RAW hazard
has been resolved by the write-in-first-half-read-in-the-second-half feature
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
145
Data Hazards
Writing to a GPR in the first half – reading the same GPR register in
the second half of the same clock period
Will this help if SUB does not have a RAW hazard but XOR has ? YES !
IF
All RAW
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
1
2
3
4
5/6
7
8
9
ID
2
3
4
5/6
7
8
9
10
EX
MEM
WB
3
4
5
7
8
9
10
11
4
5
6
8
9
10
11
5
6
7
9
11
12
We saved one clock period !
Note that the GPR registers are always written in the middle of the clock
period ! We show the short lines when this feature helps a RAW hazard !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
146
Data Hazards
Writing to a GPR in the first half – reading the same GPR register in
the second half of the same clock period
Will this help if SUB and XOR do not have a RAW hazard but SW has ?
YES !
IF
RAW
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
CS 2214
R8, 0(R9)
R10, R11, R12
R13, R14, R15
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
1
2
3
4
5
6
7
8
ID
2
3
4
5
6
7
8
9
Haldun Hadimioglu
CSE – Spring 2014
EX
3
4
5
6
7
8
9
10
MEM
4
5
6
7
8
9
10
WB
5
6
7
8
10
11
EMY CPU Version 1
147
Data Hazards
How will we eliminate the remaining two stall cycles ?
We will use forwarding also known as bypassing to do that
This means we have additional hardware to eliminate the stalls
The additional hardware will be new wires, new MUXes and MUX3 of the
datapath will be larger
To visualize how we can do this, let’s look at the Version 1 state diagram and
the datapath for the ADD instruction
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
3
4
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
Stall Stall
IF
Stall Stall
Stall Stall
Stall
7
EX
ID
IF
Stall
Stall
8
9
MEM WB
EX MEM
ID
EX
IF
ID
Stall
IF
Stall Stall
10
ID
IF
The new value of R10 is calculated in the EX stage in the 4th clock period for the ADD
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
148
Data Hazards
Forwarding (Bypassing)
The new value of R10 is stored on 4.ALUout at the end of the 4th clock
period
The new value of R10 is available for use in the MEM stage in the beginning
of the 5th clock period
Why do not we forward the new value of 4.ALUout directly from the MEM stage
The arrow
to the EX stage in the 5th clock period ?
from
At the same time, why do not we allow the SUB to read the old value of R10 to
MEM to EX
3.B in the ID stage so we do not stall it in the 4th clock period ?
indicates
But, when the SUB enters the EX in the 5th clock period, it uses the forwarded
forwarding
value from 4.ALUout ? It bypasses the value of 3.B
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
CS 2214
3
4
IF ID EX MEM
IF ID EX
IF ID
IF
5
WB
MEM
EX
ID
IF
Haldun Hadimioglu
CSE – Spring 2014
6
WB
MEM
Stall
Stall
Stall
7
8
9
10
WB
EX MEM WB
ID
EX MEM
IF
ID
EX
Stall IF
ID
MEM
EX
Stall
ID
IF
EMY CPU Version 1
149
Data Hazards
Forwarding (Bypassing)
What we are doing is that instead of waiting to get the new value of R10
that goes (i) from the ALU to 4.ALUout, then (ii) to 5.ALUout then (iii) to
R10 and then finally (iv) to 3.B, we forward the new value of R10 directly to
the EX stage, to the input of the ALU, bypassing the value in 3.B that has
the old R10 value
4.ALUout
ID
MUX3
3.B
ADD
3.A
MUX3 is larger now
3.Imm
EX
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
MEM
EMY CPU Version 1
150
Data Hazards
Forwarding (Bypassing)
If we show the pipeline in our notation
IF
All RAW
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
1
2
3
4
5/6
8
9
10
ID
2
3
4
5/6
7
9
10
11
EX
3
4
5
8
8
10
11
12
MEM
4
5
6
9
9
11
12
WB
5
6
7
10
12
13
The arrow from MEM to EX
indicates forwarding
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
151
Data Hazards
Forwarding (Bypassing)
What can we do to eliminate the stall for the XOR ?
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
3
4
IF ID EX MEM
IF ID EX
IF ID
IF
5
WB
MEM
EX
ID
IF
6
WB
MEM
Stall
Stall
Stall
7
8
9
10
WB
EX MEM WB
ID
EX MEM
IF
ID
EX
Stall IF
ID
MEM
EX
Stall
ID
IF
To eliminate the stall for the XOR we will employ forwarding from the WB stage
to the EX stage (as you will see on the next slide) !
Because we see that if we allow the XOR to read the old value of R10 in clock
period 5, it can get the new value of R10 in the beginning of the 6th clock period
In the 6th clock period, the new value of R2 is with the ADD in the WB stage on
register 5.ALUout
We then forward the value from 5.ALUout to MUX3, bypassing 3.B
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
152
Data Hazards
Forwarding (Bypassing)
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
3
4
5
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX
MEM WB
IF
ID
EX
MEM
IF
ID
EX
IF
ID
IF
8
9
10
WB
MEM
EX
ID
IF
MEM WB
EX
MEM
ID
EX
Now, there is no stall !
Note the short lines in clock period 6 that indicate that write-infirst-half-read-in-the-second-half help eliminate the stall between
the ADD and the SW
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
153
Data Hazards
Forwarding (Bypassing)
If we show the pipeline in our notation
400200
400204
400208
40020C
400210
400214
400218
40021C
All RAW
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
IF
ID
EX
1
2
3
4
5
6
7
8
2
3
4
5
6
7
8
9
3
4
5
6
7
8
9
10
MEM
4
5
6
7
8
9
10
WB
5
6
7
8
10
11
There is no stall !
Note the short lines in clock period 6 that indicate that write-infirst-half-read-in-the-second-half help eliminate the stall between
the DADD and the SW
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
154
Data Hazards
Forwarding (Bypassing)
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R14, R10
R16, R17, R10
R19, 0(R10)
R21, R22, R10
R24, R25, R10
R27, R10, 5
3
4
5
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX
MEM WB
IF
ID
EX
MEM
IF
ID
EX
IF
ID
IF
8
9
10
WB
MEM WB
EX
MEM WB
ID
EX
MEM
IF
ID
EX
Till now we considered this code where for the SUB, XOR and SW,
R10 is the second operand register, i.e. register Rt in the R-format
What if R10 is the first operand register, register Rs, in the R-format ?
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
155
Data Hazards
Forwarding (Bypassing)
What if the code is that R10 is Rs for the SUB and XOR ?
In this case we forward from 4.ALUout and 5.ALUout to a
new MUX, MUX2, bypassing 3.A
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R12
R13, R10, R15
R16, R10, R18
R19, 0(R20)
R10, R22, R23
R10, R25, R26
R10, R28, 5
3
4
5
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX
MEM WB
IF
ID
EX
MEM
IF
ID
EX
IF
ID
IF
8
9
10
WB
MEM
EX
ID
IF
MEM WB
EX
MEM
ID
EX
Only the SUB and XOR instructions will have the RAW
hazard and the stall cycles will be eliminated by forwarding
to MUX2
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
156
Data Hazards
Forwarding (Bypassing)
All RAW
What if the code is that R2 is Rs for the DSUB, XOR and SLT ?
If we show the pipeline in our notation
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
CS 2214
R8, 0(R9)
R10, R11, R12
R13, R10, R15
R16, R10, R18
R19, 0(R20)
R10, R22, R23
R10, R25, R26
R10, R28, 5
IF
ID
EX
1
2
3
4
5
6
7
8
2
3
4
5
6
7
8
9
3
4
5
6
7
8
9
10
Haldun Hadimioglu
CSE – Spring 2014
MEM
4
5
6
7
8
9
10
WB
5
6
7
8
9
10
EMY CPU Version 1
157
Data Hazards
EMY forwarding (Bypassing) for the general case
By using forwarding (bypassing) results that have not
reached the destination GPR, can be forwarded to the inputs
of
Functional units in the ALU in EX
Memory port 2 in MEM
Bypassing the inputs that are shown in the Version 1 state
diagram and datapath
Remember that we forward a value when it is needed
One exception is the Store instruction since it completes
not in 5 but, 4 (soon we will see) !
Also, soon we will see that BEQ will complete in 2 clock
periods and we will forward to ID
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
158
Data Hazards
What forwarding does is that functional units
in the ALU and memory port 2 bypass GPR
registers
If they cannot get the new value of a GPR register
on time, the new values are forwarded from
4.ALUout
5.ALUout
5.MDR
To the inputs of
Functional units in the ALU
Memory port 2
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
159
Data Hazards
Forwarding (Bypassing)
We show the changes to the inputs of the ALU below
5.ALUout
MEM
EX
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
5.MDR
4.ALUout
MUX3
3.Imm
3.B
ALU
MUX2
3.A
EMY CPU Version 1
WB
160
Data Hazards
Forwarding (Bypassing)
We show the changes to the inputs of Memory Port 2 below
4.ALUout
5.ALUout
MEM
4.B
5.MDR
WB
MUX5
AB2
DB2
DB3
Memory
Port
2
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
161
Data Hazards
Forwarding (Bypassing)
We show the exception case for Store instructions where the value to be
written to a memory location has to be passed to a Store in the EX stage
even though it is not needed in EX, but in MEM
EX
MEM
4.ALUout
MUX6
3.B
3.A
3.NPC
We have to have a new MUX in EX that will move data to 4.B either from 3.B or
from 5.ALUout or 5.LMD
3.Imm
CS 2214
WB
5.ALUout
5.MDR
4.B
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
162
Data Hazards
Forwarding (Bypassing)
In summary, we have the following changes to the EMY
datapath for forwarding purposes
Three new multiplexers, MUX2, MUX5 and MUX6
MUX3 are larger
There will be additional forwarding hardware for the BEQ
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
163
Data Hazards
As we said before, there are three types of data
hazards
Read after write, RAW
Instruction 1 has to write and then Instruction 2 has to read :
I1W - I2R
We studied it on previous slides
We need to prevent I2R - I1W
So, we stall I2 unless we can forward the value
We can do forwarding and write-in-the-first-half-read-in-thesecond-half to avoid the stall for all cases except one that
involves Load instructions as described below
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
164
Data Hazards
There are three types of data hazards
Write after read, WAR
Instruction 1 has to read and then Instruction 2 has to write : I1R I2W
We need to prevent I2W - I1R
So, we need to stall I2
This hazard cannot occur on EMY since all reads are early and all writes are
late
This will happen when some instructions write early and some others read late
An example is for an instruction that uses the autoincrement addressing mode :
ADD R8, (R9)+
This instruction does the following : R8 R8 + M[R9] then R9 R9 + 4
Often the CPU writes the new value of R9 in the MEM stage, not in the WB
stage, provided that there is a separate integer ADDer
So, we write to R9 early, perhaps before a previous instruction can read it
This instruction is a typical CISC instruction
The example shows how the architecture complexity affects the hardware
design, in this case pipelining !
This hazard can always be prevented by changing the destination register
of the second instruction !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
165
Data Hazards
There are three types of data hazards
Write after write, WAW
Instruction 1 has to write and then Instruction 2 has to write : I1W I2W
We need to prevent I2W - I1W
So, we need to stall I2 to prevent a wrong value on the destination
This hazard cannot occur on EMY since all reads are early and all writes are
late
This will happen if more than one stage can write
Allowing writes in different stages can result in two writes to a GPR in the same
clock period
The previous example can cause a WAW hazard
ADD R8, (R9)+
R8 R8 + M[R9] then R9 R9 + 4
The CPU writes the new value of R9 in the MEM stage, not in the WB stage
So, we write to R9 early, perhaps when a previous instruction is also writing to R9
at the same time
This instruction is a typical CISC instruction
The example shows how the architecture complexity affects the hardware
design, in this case pipelining !
This hazard can always be prevented by changing the destination register
of the second instruction !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
166
Data Hazards
There are three types of data hazards
WAR and WAW
The WAR and WAW hazards will also happen when an
instruction is allowed to proceed even though the instruction in
front of it is stalled
For example, with dynamic issuing, an instruction passes by a
stalled instruction and so it writes too soon !
This is a topic to deal with in Computer Architecture II !
The fourth hazard ?
Read after Read, RAR
This is not a hazard since no value is changed by the two
readings
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
167
Data Hazards
Not all RAW hazard stalls can be eliminated via forwarding and
write-in-the-first-half-read-in-the=second-half
Let’s consider our piece of mnemonic machine language program
again where there is now a dependency between the LW and the
instructions that follow it
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R8
R13, R14, R8
R16, R17, R8
R19, 0(R8)
R21, R22, R8
R24, R25, R8
R27, R8, 5
We observe that the LW writes to R8 and the instructions
below LW read R8
The LW and the remaining instructions are executed close in
time
Can there be data hazards among them ?
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
168
Data Hazards
RAW ?
RAW ?
RAW ?
RAW ?
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R8
R13, R14, R8
R16, R17, R8
R19, 0(R8)
R21, R22, R8
R24, R25, R8
R27, R8, 5
RAW ?
RAW ?
RAW ?
The data element in R8 is shared by all the instructions below the LW
and they are executed close in time
Yes, there are data dependencies, but are they all data hazards ?
Will all the instructions below the LW try to read R8 before the LW writes ?
Data hazards will be happen between the LW and ADD, SUB and XOR
ADD, SUB and XOR will try to read R8 before the LW writes to R8
This data hazard is the RAW hazard
We might have to stall ADD, SUB and XOR when they try to read R8 ????
The SW, OR, SLT and BEQ will read R8 after the LW writes to R8
They do not have any hazard situation !!!
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
169
Data Hazards
Do we have to stall ADD, SUB and XOR when they try
to read R8 ?
All RAW
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R8
R13, R14, R8
R16, R17, R8
R19, 0(R8)
R21, R22, R8
R24, R25, R8
R27, R8, 5
If yes, can we eliminate any possible stall by using
forwarding ?
Yes, we can eliminate the data hazard stalls between the LW
and SUB and XOR !
But, we cannot eliminate a stall cycle between the LW
and ADD with forwarding and write-in-the-first-halfread-in-the-second-half
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
170
Data Hazards
Why is that we cannot eliminate the stall cycle
between the LW and ADD ?
1 2
400200 LW
400204 ADD
RAW
R8, 0(R9)
R10, R11, R8
3
4
5
IF ID EX MEM WB
IF ID Stall EX
6
7
MEM WB
According to our state diagram, the LW reads the data from
the memory in the MEM stage
This is clock period 4
The data will come from the memory at the end of the 4th
clock period since the memory takes one clock period to access
But, the ADD needs that data from the memory in the
beginning of the 4th clock period
We need to stall the ADD and forward the data from WB to
EX in the 5th clock period
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
171
Data Hazards
Why is that we cannot eliminate the stall cycle
between the LW and ADD ?
All RAW
1 2
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R8
R13, R14, R8
R16, R17, R8
R19, 0(R8)
R21, R22, R8
R24, R25, R8
R27, R8, 5
WB
MEM
EX
ID
IF
CS 2214
3
4
5
IF ID EX MEM WB
IF ID Stall EX
IF Stall ID
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
6
MEM
EX
ID
IF
v
v
v
v
Haldun Hadimioglu
CSE – Spring 2014
v
v
v
v
7
8
WB
MEM WB
EX
MEM
ID
EX
IF
ID
IF
v
v
v
v
v
v
v
v
v
v
9
10
WB
MEM
EX
ID
IF
MEM
EX
ID
v
v
v
v
v
v
v
v
?
EMY CPU Version 1
172
Data Hazards
Why is that we cannot eliminate the stall cycle
between the LW and ADD ?
We see that the ADD is stalled to wait for the LW to read
the memory
Where is the ADD stalled ?
In the ID stage ? YES
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
173
Data Hazards
Why is that we cannot eliminate the stall cycle
between the LW and ADD ?
As mentioned before we are checking for hazard situations
in the ID stage and when we recognize a hazard, we stall the
instruction in the ID stage !
We have static issuing
We stall the ADD due to its RAW hazard
We stall the SUB, XOR and the others behind the ADD for
correct execution pattern
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
174
Data Hazards
Why is that we cannot eliminate the stall cycle
between the LW and ADD ?
If we show the pipeline in our notation
All RAW
400200
400204
400208
40020C
400210
400214
400218
40021C
LW
ADD
SUB
XOR
SW
OR
SLT
BEQ
R8, 0(R9)
R10, R11, R8
R13, R14, R8
R16, R17, R8
R19, 0(R8)
R21, R22, R8
R24, R25, R8
R27, R8, 5
IF
ID
EX
MEM
WB
1
2
3
4
5
2
3/4
5
6
7
8
9
3/4
5
6
7
8
9
10
5
6
7
8
9
10
11
6
7
8
9
10
11
7
8
9
11
12
Note the short lines in clock period 5 that indicate that
write-in-first-half-read-in-the-second-half help
eliminate the stall between the LW and the SUB
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
175
Data Hazards
Why is that we cannot eliminate the stall cycle between the LW
and ADD ?
The stall can be avoided (the interlock for the LD situation can be
eliminated) if there was an independent instruction, an instruction
that did not need R8 was placed between the LW and ADD
For the first time we have an example of the importance of
ordering instructions carefully
If we had a compiler that guaranteed to find an independent
instruction that does not depend on the LW, we would never have
the Load interlock !
This is what we call the compiler scheduling an independent instruction
The instruction position following the LW is called load delay slot and
the compiler fills the delay slot with an independent instruction
This is called delayed Load
If the compiler cannot find an independent instruction, it inserts a
NOP in the delayed Load slot
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
176
Data Hazards
Why is that we cannot eliminate the stall
cycle between the LW and ADD ?
If the compiler changes the order of instructions
to avoid stalls, to fill delay slots, then it is called
pipeline scheduling or instruction scheduling
We will have more examples of how the compiler
arranges the code for better pipeline efficiency
throughout the semester
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
177
Data Hazards
Delayed Loads are not practical and not used !
If delayed Loads were used, the Load interlock in hardware
is removed since it is guaranteed a Load is not followed by a
depending instruction
We can guarantee removing the interlock will work only if it
runs new code just compiled for the delayed Load CPU
But, there is a lot of software compiled years ago and the
compilers did not take into account this delayed Load feature
The old code has a lot of LW instructions followed by
depending instructions
If we ran them on a CPU with delayed Loads (no Load interlock)
the depending instruction will get wrong data and programs will
generate wrong results
This is the legacy software situation !
Our EMY CPU will not have delayed Loads !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
178
Control Hazards
Control hazards occur when a control instruction is executed
Control instructions are
Jump
Jump to a function
Return from function
Branch
Except the branch instruction, all control instructions change
the order of execution
The branch may or may not change the order of execution
depending on the condition
If the order of the execution is changed, the pipeline is
emptied
That is, there is a pipeline start-up
This results in a performance loss worse than the data hazard
performance loss
Even worse that we may branch to an instruction that is not in
the memory
This is a page-fault that results in millions of clock periods of
delay!
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
179
Control Hazards
Especially branches are troublesome
The order of execution may or may not be changed
So, we do not know which instruction to fetch next
Which one to fetch depends on the test : equal to zero or not
equal to zero ?
Note that besides comparing with zero, we also have to
compute the possible branch address, the target address, the
address of the target instruction
If these two are not performed early, there is a large control
hazard penalty of three clock periods.
If the branch instruction does not change the order of
execution, i.e. we continue with the instruction following the
branch we say the branch is not taken
If the branch instruction changes the order of execution,
i.e. we continue with the instruction pointed by the effective
address we say the branch is taken
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
180
Control Hazards
If we recall what we did earlier
Branch instructions go through stages IF, ID and
EX
They actually complete the execution back in stage IF
Therefore, CPIBranch = 4
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
181
Control Hazards
Let’s take a look at the code studied earlier
Assuming that we take the branch !
1
1 2
400600
400604
400608
40060C
400610
400614
BEQ
ADD
SUB
XOR
SLT
AND
R8, R9, 4
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, R20, R21
R22, R23, R24
A pipeline bubble
is generated
CS 2214
4
5
6
7
ID
EX
8
9
MEM
WB
IF ID EX
Stall Stall Stall
IF
WB
MEM
EX
The Branch causes
a pipeline start-up !
3
ID
IF
?
?
?
?
v
?
?
?
v
?
?
v
?
Haldun Hadimioglu
CSE – Spring 2014
v
v
?
v
?
?
v
?
?
?
EMY CPU Version 1
?
?
?
?
182
Control Hazards
If we show the pipeline in our notation
Assuming that we take the branch !
400600
400604
400608
40060C
400610
400614
BEQ
ADD
SUB
XOR
SLT
AND
R8, R9, 4
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, R20, R21
R22, R23, R24
IF
ID
EX
1
2
3
5
6
7
MEM
8
WB
9
We see that we have three stall cycles if the branch is
taken
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
183
Control Hazards
Let’s take a look at the code studied earlier
Assuming that we do not take the branch !
1 2
400600
400604
400608
40060C
400610
400614
BEQ
ADD
SUB
XOR
SLT
AND
R8, R9, 4
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, R20, R21
R22, R23, R24
A pipeline bubble
is generated
CS 2214
4
IF ID EX
Stall Stall Stall
WB
MEM
EX
The Branch causes
a pipeline start-up !
3
ID
IF
?
?
?
?
v
?
?
?
v
?
?
v
5
IF
6
7
8
9
ID
EX
MEM
WB
IF
ID
EX
MEM
IF
ID
IF
EX
ID
IF
v
v
v
v
v
v
v
v
?
Haldun Hadimioglu
CSE – Spring 2014
v
v
v
v
v
v
EMY CPU Version 1
184
Control Hazards
If we show the pipeline in our notation
Assuming that we do not take the branch !
IF
400600
400604
400608
40060C
400610
400614
BEQ
ADD
SUB
XOR
SLT
AND
R8, R9, 4
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, R20, R21
R22, R23, R24
1
5
6
7
8
9
ID
2
6
7
8
9
10
EX
3
7
8
9
10
11
MEM
8
9
10
11
12
WB
9
10
11
12
13
Are we fetching the ADD in the 5th clock period ?
If yes, why ?
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
185
Control Hazards
Assuming that we do not take the branch !
Why are we fetching the ADD in the 5th clock
period ?
Can we fetch the ADD in the 2nd clock period ?
The answer is yes, if the control unit allows the
completion of the fetch cycle of the ADD in the 2nd
clock period
Then, the ADD stays on the 2.IR register until the end
of 4th clock period then moves to the ID stage as will be
shown soon
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
186
Control Hazards
Assuming that we do not take the branch !
But, if the control unit stops fetching of the ADD
in the 2nd clock period to save itself from a
memory access that might be unnecessary if the
branch is taken, then the ADD must be fetched in
the 5th clock period
Why would the control unit stop fetching the ADD in the
2nd clock period ?
We are asking this question because we know that
decoding an instruction is very quick : Just checking the
Opcode bits is enough for many instructions
Thus, the control unit would know right in the beginning
of the 2nd clock period that there is a Branch in the ID
stage, and we can get the ADD by the end of the 2nd
clock period !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
187
Control Hazards
Assuming that we do not take the branch !
If the CPU designer decides to continue with the fetching of the
ADD in the 2nd clock period
1 2
400600
400604
400608
40060C
400610
400614
BEQ
ADD
SUB
XOR
SLT
AND
R8, R9, 4
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, R20, R21
R22, R23, R24
4
IF ID EX
IF Stall Stall
5
6
7
8
ID
EX
MEM
WB
IF
ID
IF
EX
ID
IF
MEM
EX
ID
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
?
Haldun Hadimioglu
CSE – Spring 2014
v
v
v
v
v
v
v
v
v
9
MEM WB
EX
MEM
ID
EX
IF
WB
CS 2214
3
v
v
v
v
v
EMY CPU Version 1
ID
v
v
v
v
v
188
Control Hazards
Assuming that we do not take the branch !
If the CPU designer decide to continue with the
fetching of the ADD in the 2nd clock period
If we show the pipeline in our notation
400600
400604
400608
40060C
400610
400614
BEQ
ADD
SUB
XOR
SLT
AND
R8, R9, 4
R10, R11, R12
R13, R14, R15
R16, R17, R18
R19, R20, R21
R22, R23, R24
IF
ID
EX
1
2
3
2/4
5
6
7
8
5
6
7
8
9
6
7
8
9
10
MEM
WB
7
8
9
10
11
8
9
10
11
12
We save one clock period !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
189
Control Hazards
Assuming that we do not take the branch !
The CPU designer might decide to design the control unit so
that it aborts the fetch of the ADD in the 2nd clock period
This is a toss up for the CPU designer !
How often the branches are not taken is critical
If branches are not taken often, then the designer can design
the control unit to allow fetching the ADD
BUT, if we go ahead with continuing with the fetch which
causes a page-fault (the instruction is not in the memory) and
we read the page of the instruction from disk and then realize
the branch is taken, all this effort will be wasted !
The frequency of untaken branches depends on the application,
programmer, the compiler and the instruction set !
We decide not to fetch the next instruction
We do not fetch the ADD !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
190
Control Hazards
If we summarize : If we have a control
instruction, the time penalty is high
Jumps, jumps to a function and returns from a
function instructions require an unconditional
change to the order of execution pattern
The sooner we calculate the target instruction address,
the more stall cycles we can reduce
But, with branches we also need to test the
condition so we need to determine two items
The target address
The condition
The sooner we calculate the target instruction
address and the condition, the more stall cycles
we can save
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
191
Control Hazards
Thus, solving the branch execution problem is
more difficult than the others
In fact, one can think of the jump, jump to a
function and return from a function instructions
as a special case of the branch where the
condition is always true, so we have to take the
jump/return
Overall, control hazards, especially branch
instructions, attract a lot of interest in
computer architecture research
Many journal and conference papers last 20 plus
years are published on the topic of branch penalty
reduction !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
192
Control Hazards
Let’s change our earlier code a little
400600
400604
400608
40060C
400610
400614
400618
40061C
LW
ADD
BEQ
SUB
XOR
SLT
OR
SW
R8, 0(R9)
R10, R11, R12
R13, R14, 3
R15, R16, R18
R19, R20, R21
R22, R23, R24
R25, R26, R27
R28, 0(R29)
If the Branch is not taken, the target instruction is the
SUB, the instruction that follows the Branch
If the Branch is taken, the target instruction is the OR
instruction that is two instructions below the instruction
that follows the Branch (SUB)
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
193
Control Hazards
Assuming that we do take the branch and do not
fetch the SUB !
1 2 31 4
5
6
7
8
9
10 11
400600
400604
400608
40060C
400610
400614
400618
40061C
LW
ADD
BEQ
SUB
XOR
SLT
OR
SW
R8, 0(R9)
R10, R11, R12
R13, R14, 3
R15, R16, R18
R19, R20, R21
R22, R23, R24
R25, R26, R27
R28, 0(R29)
WB
A pipeline
start-up is
created
MEM
EX
ID
IF
CS 2214
IF ID EX MEM WB
IF ID EX
MEM WB
IF
ID
Stall
EX
Stall
Stall
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
Haldun Hadimioglu
CSE – Spring 2014
ID
IF
EX
ID
MEM WB
EX
MEM
v
v
v
v
v
v
v
v
v
v
v
EMY CPU Version 1
v
v
v
v
v
194
Control Hazards
For the case where we take the branch, we
have a pipeline start-up created in clock
period 7
That is, the pipeline is emptied !
We need to improve the penalty cycles for
our pipeline
We will modify our state diagram so that
Branch instructions will take two clock
periods
Branch instructions will be in only IF and ID
CPIBranch = 2
There will be only one clock period of stall
after this implementation
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
195
Control Hazards
The changes on the state diagram for the Branch
instruction
As we discussed before we need to determine the target
address and the condition as early as possible
We would know we have a branch in the beginning of the ID
cycle
In that case, we determine the target address and the
condition by using the information in the ID stage
The target address calculation requires adding PC and
(4*Offset), for which the ID stage has an ADDer circuit now
We can justify a separate ADDer in the ID stage, besides the
ones in IF and EX, since there is large Branch penalty to pay
The execution of all other non-control instructions is not
affected
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
196
Control Hazards
The changes on the state diagram for the
Branch instruction
0
2.IR If (2.IR.opcode == BEQ) then NOP else M[PC]
PC If ((2.IR.opcode == BEQ) & (GPR[2.IR.Rs] == GPR[2.IR.Rt])) then (2.NPC + (4 * 2.IR.DOImm+))
else if (2.IR.opcode ≠ BEQ) then PC + 4
2.NPC If ((2.IR.opcode == BEQ) & (GPR[2.IR.Rs] == GPR[2.IR.Rt])) then (2.NPC + (4 * 2.IR.DOImm+))
else if (2.IR.opcode ≠ BEQ) then PC + 4
IF
1
ID
CS 2214
3.A GPR[2.IR.Rs]
3.B GPR[2.IR.Rt]
3.Imm 2.IR.DOImm+
3.IR 2.IR
Haldun Hadimioglu
CSE – Spring 2014
CPIBranch = 2
EMY CPU Version 1
197
Control Hazards
The changes to the IF and ID stages
ID
ADD
32
2.NPC
4
MUX1
IF
ADD
Sign
Extend
16
*4
32
PC
DOImm
GPR[Rs]
5
AB1
2.IR
Rs
5
GPR
Rt
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
Equal ?
GPR[Rt]
EMY CPU Version 1
198
Control Hazards
ADD
2.NPC
Final design of the ID stage with forwarding
Sign
Extend
16
ID
*4
ALU in EX
4.ALUout
5.ALUout
DOImm
5.MDR
MUX7
32
GPR[Rs]
Rs
5
ALU in EX
4.ALUout
GPR
Rt
GPR[Rt]
CS 2214
5.ALUout
5.MDR
Haldun Hadimioglu
CSE – Spring 2014
MUX8
Equal ?
5
2.IR
EMY CPU Version 1
199
Control Hazards
The changes to the IF and ID stages
The ADDer in the IF stage is used by MUX1 in the
IF stage
The Equal circuit has a forwarding circuit with
MUX7 and MUX8
We have forwardings to the ID stage so that we bypass
one or both GPR registers to test
These forwardings are from :
The output of the ALU in EX
4.ALUoutput
5.ALUoutput
5.MDR
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
200
Control Hazards
The execution of the BEQ now
Assume that the Branch is taken
1 2
400600
400604
400608
40060C
400610
400614
400618
40061C
LW
ADD
BEQ
SUB
XOR
SLT
OR
SW
13
4
5
6
IF ID EX MEM WB
R8, 0(R9)
R10, R11, R12
IF ID EX
MEM WB
R13, R14, 3
IF ID
R15, R16, R18
R19, R20, R21
R22, R23, R24
IF
ID
R25, R26, R27
IF
R28, 0(R29)
A 1-clock period long
bubble is created.
The other stall cycle
is because the BEQ
takes 2 clock periods
WB
MEM
EX
ID
IF
CS 2214
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
Haldun Hadimioglu
CSE – Spring 2014
7
8
EX
ID
MEM
EX
v
v
v
v
v
v
v
v
v
?
9
10
11
WB
MEM
v
v
v
?
?
v
?
?
?
EMY CPU Version 1
201
Control Hazards
The execution of the Branch now
Assume that the Branch is taken
If we show the pipeline in our notation
400600
400604
400608
40060C
400610
400614
400618
40061C
LW
ADD
BEQ
SUB
XOR
SLT
OR
SW
R8, 0(R9)
R10, R11, R12
R13, R14, 3
R15, R16, R18
R19, R20, R21
R22, R23, R24
R25, R26, R27
R28, 0(R29)
IF
ID
EX
MEM
WB
1
2
3
4
5
2
3
3
4
4
5
6
5
6
6
7
7
8
8
9
9
10
It looks like there is 2-clock period long bubble created on the
previous slide
This is because the BEQ does not have its EX cycle anymore !
Overall, there is only one stall cycle now !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
202
Control Hazards
Can we improve the BEQ hardware so there is
no one stall cycle ?
YES !
Solution
We will use delayed branches
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
203
Control Hazards
Delayed branches
Delayed branch makes use of the compiler and the hardware
In this technique, we continue the execution of the
instruction(s) that follow(s) the Branch in the branch delay
slot no matter what the Branch outcome is
The branch delay slot is the set of instruction positions
following the branch
Branch Rx, Offset
Branch delay slot
One instruction long or more
The length of the branch delay slot is the time penalty paid ≡
the number of stall cycles due to the Branch ≡ the amount of
time we are not sure about the target instruction
For the current design it is 1 clock period
Therefore, the branch delay slot has 1 instruction
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
204
Control Hazards
The changes on the state diagram due to delayed
branches
0
We have to execute the instruction that follows the branch
in any case
2.IR M[PC]
PC If ((2.IR.opcode == BEQ) & (GPR[2.IR.Rs] == GPR[2.IR.Rt])) then (PC + (4 * 2.IR.DOImm+))
else PC + 4
IF
1
ID
CS 2214
3.A GPR[2.IR.Rs]
3.B GPR[2.IR.Rt]
3.Imm 2.IR.DOImm+
3.IR 2.IR
Haldun Hadimioglu
CSE – Spring 2014
CPIBranch = 2
EMY CPU Version 1
205
Control Hazards
Delayed Branches
We execute the instructions in the branch delay
slot no matter what the Branch outcome is
These instructions must be independent of the branch so
that the program execution is correct !
For our EMY CPU the branch delay slot is one instruction
long
Because, we are not sure which instruction is the target
instruction for one clock period
The following clock period we know which instruction is
the target
Thus, we execute the instruction right after the Branch
whether we take the branch or not ?
It should be easy to find one instruction that can be
executed no matter what the Branch outcome is ????
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
206
Control Hazards
Delayed Branches
We execute the instructions in the branch delay
slot no matter what the Branch outcome is
It is the compiler that changes the order of instructions
so that after the Branch there is an independent
instruction
We say the compiler schedules an instruction to the
Branch delay slot
This is another example of how ordering instructions is
important (needed)
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
207
Control Hazards
Delayed Branches
We execute the instructions in the branch delay slot no
matter what the Branch outcome is
How can the compiler find an independent instruction for the
EMY CPU to place in the Branch delay slot ?
There are three possible cases that a compiler looks for
Case 1 : From before branch
If the instruction before the Branch is independent of the Branch
This one always improves the performance :
Original code
ADD R8, R9, R10
BEQ R11, R12, 5
The compiler realizes the ADD
is independent of the BEQ ≡ The
ADD can be executed after the
BEQ. The compiler moves the
ADD after the BEQ
Note the change of
offset for the BEQ
New code
BEQ R11, R12, 6
ADD R8, R9, R10
Branch delay slot
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
208
Control Hazards
Delayed Branches
We execute the instructions in the branch delay slot no matter what the
Branch outcome is
Case 2 : From target branch
It is used for loops where there is a large probability that the branch will be taken
(many times)
It improves the performance if the branch is taken
The compiler realizes the ADD
is not independent of the BEQ.
But, the SUB is independent of
Original code
The BEQ ≡ The SUB can be
executed after theBEQ. The
loop : SUB R8, R9, R10
compiler moves the SUB to the
Brach delay slot. This will save
---ADD R11, R12, R13 time if we branch back to the
BEQ R11, R14, (-9)10 beginning of the loop. If we exit
the loop, it must be OK to
execute
the SUB ! Branch offset must be
adjusted ! The code is longer !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
New code
loop : SUB R8, R9, R10
---ADD R11, R12, R13
BEQ R11, R14, (-8)10
SUB R8, R9, R10
EMY CPU Version 1
209
Control Hazards
Delayed Branches
We execute the instructions in the branch delay slot no matter
what the Branch outcome is
Case 3 : From fall through
It is used when there is a high probability that the branch will not be
taken
It improves the performance if the branch is not taken
Original code
ADD R8, R9, R10
BEQ R8, R11, 7
SUB R12, R13, R14
The compiler realizes the ADD is
not independent of the BEQ.
But, the SUB is independent of
the Branch ≡ The SUB can be
executed right after the BEQ. The
compiler moves the SUB to the
Branch delay slot. This will save
time if the branch is not taken. It
must be OK to execute the SUB
even if we take the branch !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
New code
ADD R8, R9, R10
BEQ R8, R11, 7
SUB R12, R13, R14
EMY CPU Version 1
210
Control Hazards
Delayed Branches
We execute the instructions in the branch delay slot no
matter what the Branch outcome is
You might have realized that delayed branch is not practical
since it requires the compiler to know that the CPU is
expecting an independent instruction in the Branch delay slot
This means that old code cannot be run on this EMY CPU either
because that
1.
2.
Compiler did not generate the code for such a CPU with a Branch delay
slot
Compiler did generate a code with a Branch delay slot, but the delay slot
was more than one instruction since it was an old generation EMY CPU
This is the legacy software situation !
Today’s microprocessors do not use delayed branches
because of the compatibility issue
However, academically, it is an interesting idea
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
211
Control Hazards
Delayed Branches
We execute the instructions in the branch delay slot no
matter what the Branch outcome is
Shall we not use delayed branches for the EMY CPU ?
We will use delayed Branches in Version 1 for the
sake of simplifying our discussion
We will not use delayed branches in Computer Architecture
II
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
212
Control Hazards
Delayed Branches
Let’s take a look at the execution of the following code with a
taken branch
Notice the SUB is an independent instruction in the branch delay slot
It must be OK to execute to execute the SUB even if we take the branch
Notice we changed the BEQ register to R10 to show forwarding to the
ID stage
The forwarding is from the EX stage to the ID stage where the output
of the ALU is forwarded to the ID stage to bypass GPR[Rs] of the
BEQ which is R10
400600
400604
400608
40060C
400610
400614
400618
40061C
LW
ADD
BEQ
SUB
XOR
SLT
OR
SW
R8, 0(R9)
R10, R11, R12
R10, R14, 3
R15, R16, R18
R19, R20, R21
R22, R23, R24
R25, R26, R27
R28, 0(R29)
CS 2214
IF
ID
EX
MEM
WB
1
2
3
4
5
2
3
3
4
4
5
6
4
5
6
7
8
5
6
6
7
7
8
8
9
9
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
213
Summary of Version 1
We added hardware to deal with structural, data and
control hazards
Still, it executes integer instructions
It issues instructions statically
Except for the branch which is not issued and completed in
two clock periods
The branch is not issued to save time !
IF ID
Static
Instruction
issue
EX MEM WB
Because of static issuing instructions complete in-order,
except for the branch which can complete before the
instructions that are issued earlier
This results in imprecise interrupts !
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
214
Summary of Version 1
We realize we need to modify Version 1 so that it
Executes FP instructions
The memory is not ideal
Has precise interrupts
All three are difficult problems to solve
FP operations, such as add, subtract, multiply and divide are
complex and cannot be completed in one clock period as we can with
integer add operation
The integer add is done in EX and takes one clock period
The FP add, subtract, multiply and divide will be done in EX and take
multiple clock cycles !
More instructions can complete out-of-order
The interrupt hardware becomes even more complex
We solve one problem (executing FP instructions) but made the other
problem more complex
The complete memory hierarchy must be considered
Interrupts can happen randomly
The cache memories, slower main memory and the virtual memory (disk)
We also need to save the state which is not easy for a pipelined CPU
Advanced versions in Computer Architecture II will solve them
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
215
Test Program
Determine when the execution of the second iteration ends if
L1 cache memories take one clock period and there is no cache
miss
Show all forwardings and write-in-the-first-half-read-in-thesecond-half cases
IF
LW
ADD
SUB
XOR
SLT
OR
BNE
SW
ID EX MEM WB IF
ID EX MEM WB
R8, 0(R9)
R10, R11, R8
R11, R10, R8
R9, R11, R10
R12, R10, R11
R13, R12, R14
R13, (-7)10
R12, 0(R13)
The answer is on the next slide
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
216
Test Program
Determine when the execution of the second iteration ends if
L1 cache memories take one clock period and there is no cache
miss
Show all forwardings and write-in-the-first-half-read-in-thesecond-half cases
IF
LW
ADD
SUB
XOR
SLT
OR
BNE
SW
R8, 0(R9)
R10, R11, R8
R11, R10, R8
R9, R11, R10
R12, R10, R11
R13, R12, R14
R13, (-7)10
R12, 0(R13)
All hazards are RAW
CS 2214
ID EX MEM WB IF
1
2
3
4
5
2
3/4
5
6
7
8
9
3/4
5
6
7
8
9
10
5
6
7
8
9
6
7
8
9
10
7
8
9
10
11
11
12
10
ID EX MEM WB
11
11 12/13
12/13 14
14 15
15
16
16 17
17
18
18 19
12
13
14
14
15
16
17
18
15
16
17
18
19
16
17
18
19
20
20
21
The second iteration ends in clock period 21
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
217
Test Program
Determine when the execution of the second iteration ends if
L1 cache memories take two clock periods and there is no cache
miss
Show all forwardings and write-in-the-first-half-read-in-thesecond-half cases
LW
ADD
SUB
XOR
SLT
OR
BNEZ
SW
R8, 0(R9)
R10, R11, R8
R11, R10, R8
R9, R11, R10
R12, R10, R11
R13, R12, R14
R13, (-7)10
R12, 0(R13)
There are structural
hazards in IF and
MEM stages due to
slow cache memories
IF
ID EX MEM WB IF
1-2
3
3-4 5/6
5-6 7
7-8
9
9-10
11-12
13-14
15-16
11
13
15
17
4
5-6
7
8
10
8
9
11
12
14
13
15
18
19-20
7
ID EX MEM WB
17-18 19
21-22
23
9 19-20 21/22 23
10 21-22 23
24
12 23-24 25 26
24
25
27
25
26
28
14 25-26
16 27-28
29-30
31-32
28
30
29
31
30
32
34
35-36
27
29
31
33
20
The second iteration ends in clock period 36
All hazards pointed by the arrows are data hazards and type RAW
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
218
Test Program
Determine when the execution of the second iteration ends if
L1 cache memories have misses
Assume that the memory levels are as described in the unpipelined
CPU case with the following additions and reminders
The physical memory has 4 Bytes per location
The bus width between the physical and lowest level cache is 4 Bytes
The instruction cache is 8KBytes
The data cache is 16KBytes
Both cache block sizes are 32 bytes
Both cache memories use direct mapping
Both caches use write-back with write-allocate
Both cache memories access the needed item first
The Data Cache has two read and two write ports
The Instruction Cache has two read ports
The latency to access the L2 cache is 4 clock periods and transferring
a 4-Byte content is one clock period each
The L2 cache memory can handle one miss per L1 cache memory at a
time
This means that if the instruction cache and the data cache have misses at
the same time, they will be handled at the same time by the L2 cache
This means the L2 cache can handle two hits at the same time
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
219
Test Program
Determine when the execution of the second iteration ends if
L1 cache memories have misses
Assume that the L1 instruction and data cache memories and the
physical memory have the following properties
Each Level 1 cache memory can handle only one miss at a time
A Store miss requires that the Store instruction stays in the MEM
stage until the miss is handled
It just cannot store to the write buffer and then proceed
Each Level 1 cache memory can handle up to four hits while it handles a
miss
An instruction that immediately follows a Load or a Store is forced to
stall an extra clock period in the ID stage to make sure the access for
the data element is completed
For the given code, assume the following
Each data element accessed is to a separate data block all of which do
not map to the same area in data cache
It means each Load and Store instruction accesses a different block in
each iteration
This means there will be four data cache misses in two iterations !
This is very unusual but, it is assumed here just to show an extreme case
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
220
Test Program
We observe all 8 instructions are in one instruction cache block
There are four data accesses, each one is in one separate data block,
resulting in four data cache misses
Determine when the execution of the second iteration ends
Show all forwardings and write-in-the-first-half-read-in-the-secondhalf cases
IF ID EX MEM WB IF ID EX MEM WB
LW
ADD
SUB
XOR
SLT
OR
BNE
SW
R8, 0(R9)
R10, R11, R8
R11, R10, R8
R9, R11, R10
R12, R10, R11
R13, R12, R14
R13, (-7)10
R12, 0(R13)
There are structural
hazards in IF and
MEM stages due to
cache misses
1/5
6
7
8/12
6 7/12
7/12 13
13 14
13
14
15
14
15
16
14
15
16
17
16
17
17
18
19
20/24
15
16
17
18
13
18 19/24 25
26/30
31
15 19/24 25/30 31
16 25/30 31
32
17 31 32
33
32
33
35
33
34
36
18
19
34
35
35
36
36
37
37
38/42
32
33
34
35
33
34
35
36
The second iteration ends in clock period 42
All hazards pointed by the arrows are data hazards and type RAW
CS 2214
Haldun Hadimioglu
CSE – Spring 2014
EMY CPU Version 1
221