CS 2204 Fall 2005 - NYU Polytechnic School of Engineering

Download Report

Transcript CS 2204 Fall 2005 - NYU Polytechnic School of Engineering

Computer
Architecture II
Versions 0 & 1
MIPS CPU
Haldun Hadimioglu
Computer Science & Engineering
CS 6143

Outline
 Introduction
 Version 0 MIPS CPU : Unpipelined MIPS CPU

It executes integer instructions
 Version 1 MIPS CPU : Pipelined MIPS CPU


It executes integer instructions
Handout to use
 MIPS CPU
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
2

Getting ready for CS6143
 The prerequisite for CS6143


CS6133 for graduate students
CS2214 for undergraduate students
 CS6143 students who took the prerequisite course and did not
use the Hennessy & Patterson book must realize that they will
put in more effort than the other CS6143 students

They will have to learn the MIPS assembly language and the MIPS
pipeline by themselves !
 If you are not sure you are ready for CS6143, you can work on
the execution timing of the pipelined MIPS CPU on the next
slide

You learned about it when you took the prerequisite course CS6133
or CS2214
 If you do not understand the timing, you need to take CS6133
 If you understand the timing, then study the remaining slides to
refresh your memory on CPU design and pipelining
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
3
Pipelined MIPS CPU Design : Version 1

Test Program
 Determine when the execution of the second iteration ends if
L1 cache memories take one clock period and there is no cache
miss
 Show all forwardings and write-in-the-first-half-read-in-thesecond-half cases
IF
LD
DADD
DSUB
XOR
SLT
OR
BNEZ
SD
R1, 500(R8)
R2, R3, R1
R5, R2, R1
R8, R5, R2
R11, R2, R5
R14, R11, R15
R14, (-7)10
R11, 600(R14)
ID EX MEM WB IF
1
2
3
4
5
10
2
3/4
5
6
7
8
9
3/4
5
6
7
8
9
10
5
6
7
8
9
6
7
8
9
10
7
8
9
10
11
11
12
ID EX MEM WB
11
11 12/13
12/13 14
14 15
15
16
16 17
17
18
18 19
12
13
14
14
15
16
17
18
15
16
17
18
19
16
17
18
19
20
20
21
The second iteration ends in clock period 21
All data hazards are RAW
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
4

Introduction
 On the microarchitecture layer, a computer is a
collection of at least three interconnected digital
systems


Introduction

A central processing unit (CPU)
A (main) memory
An I/O controller to control an I/O device, such as the disk
 There can be several I/O controllers to control different I/O
devices
CPU
Disk
Memory
CS 6143
Interconnection
System
Haldun Hadimioglu
I/O
Controller
MIPS Versions 0 & 1
5

Digital Systems
 A digital system performs microoperations
Introduction

It consists of a datapath (data unit) and a control
unit
 The datapath actually performs the microoperations
 The control unit determines which microoperation
happens when
Registers
Status signals
CS 6143
ALUs
Sequencer
Buses
Control signals
Haldun Hadimioglu
Datapath
Control Unit
MIPS Versions 0 & 1
6

Digital Systems
 The datapath (data unit) has registers, ALUs
and buses to perform the microoperations
Registers keep information temporarily
 ALUs perform arithmetic/logic operations
 Buses interconnect the registers and ALUs
 Other components are used include
Introduction

 Multiplexers (MUXes), decoders, encoders, comparators,
counters, etc.
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
7

Digital Systems
 The control unit has a sequencer circuit that
determines the sequence of microoperations
The sequencer needs status signals from the data
unit to know what is happening there
 Then, it determines which microoperations to be
performed and indicates to the datapath by means
of control signals
Introduction

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
8

Designing Digital systems
 Datapath design is simpler than the control
unit since it has highly regular (duplicated)
circuits
A 64-bit ADDer is composed of 4 16-bit identical
ADDers
 A 64-bit comparator consists of 8 8-bit identical
comparators, etc.
Introduction

 Control unit design is more difficult due to
Large amounts of random logic
 A lot of effort is needed to make sure there are
no timing problems

 Microoperations must start at the right time and end at
the right time !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
9

Designing digital systems
 We will use the finite-state machine (FSM)
technique to design the MIPS CPU where the
FSM state diagram will have states with
microoperations
Introduction

The state diagram shows which state follows
which state precisely
 Each state indicates which microoperations to perform

The state diagram shows which states are needed
when for which machine language instruction
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
10

Designing the microarchitecture level of a
computer
 There are two tasks in this design
Develop the CPU and memory digital systems so
that instructions can be run
 Develop the memory and I/O controller digital
systems so that I/O can happen
Introduction

 We will concentrate on the CPU and memory
digital systems
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
11

Designing the CPU and memory digital
systems
 First we focus on the CPU digital system while we
make a few design decisions on the memory hierarchy
quickly
Introduction

We will design the CPU as a slow CPU running only integer
instructions : No pipelining
 This is Version 0

Then, we will improve the CPU speed by using pipelining, but
still running integer instructions
 This is Version 1
 For both versions the memory will be a black box
with a few details
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
12

Designing the CPU as a Digital System
 The MIPS CPU digital system

We will concentrate on
 FSM state diagram of the MIPS CPU
Introduction
 The FSM state diagram describes both the datapath and
the control unit
 Datapath of the CPU
 Datapath hardware for the execution of integer MIPS
instructions will be covered

We will not concentrate on the MIPS CPU control
unit
 It can be implemented by hardwiring and/or
microprogramming
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
13

Designing the CPU digital system
 To design the MIPS CPU, we will start with
the MIPS architecture
Introduction

What is the connection between the architecture
and the CPU?
 A computer processes digital information, by running
machine language instructions
 A program is a list of instructions each of which
specifies operations on data (arguments)
 An instruction specifies architectural operations
 Each architectural operation is implemented
microoperations
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
by
14

Designing the CPU Digital System
 In order to perform an architectural
operation, the CPU performs a series of
microoperations in a number of clock periods
Introduction

That is an architectural operation is broken down
into smaller operations called microoperations
 That is, to run a machine language instruction,
the CPU performs microoperations

The CPU performs some microoperations alone and
some in cooperation with the memory and the I/O
controllers
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
15

Designing the CPU Digital System
 Architectural operations

An architectural operation is what we describe as the
semantics of the instruction
 The architectural operation specified by the DADD instruction
 Rd  Rs + Rt
 The architectural operation specified by the DSLLV instruction
Introduction
 Rd  Rs << Rt
 The architectural operation specified by the MOVN instruction
 If Rt < 0 then Rd  Rs
 The architectural operation specified by the J instruction
 PC[36-63]  (4 x Offset)

It is the CPU that contributes the most to the execution of
an instruction since it performs most of the microoperations
needed for an architectural operation
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
16

Designing the CPU Digital System
 Typical CPU digital system microoperations

Add, subtract, multiply
 In the past, a 32-bit addition was completed in 1 clock period.
 Today, a 64-bit addition is completed in several clock periods

Introduction


AND, OR, XOR
Shift right, Shift left
Read data from memory, write data to memory
 In the past, a memory access was completed in 1 clock period.
 Today, it is completed in several clock periods




Read instructions from memory (fetch)
Increment the program counter
Transfer a register to another register
…
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
17

Designing the CPU as a Digital System
 Other machines, especially CISC machines,
require other microoperations such as
Reading indirect address(es) from the memory
 Effective address calculation for
Introduction

 Indexing
 Autoincrement
 Autodecrement

Alignment for
 Instructions
 Data
 Addresses
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
18

Designing the CPU Digital System
 Architecture’s effect on microoperations

The decisions made on architecture determine
microoperations needed for the execution of
instructions
the
the
 General microoperations found on most CPUs
Introduction
 The ones mentioned on previous slides
 Specific microoperations for certain CPUs
 Specific microoperations for MMUs, caches, I/O controllers

The architecture also determines the characteristics of
each microoperation
 If the autoincrement addressing mode is used, the number to
be automatically added to the base register can be 4 or 8
depending on the length of memory location and world length
sizes
 Whether to attach 16 bits or 32 bits during sign extension

Thus, each machine language instruction requires a number
of certain microoperations taking a certain time : the CPIi
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
19

Designing the CPU Digital System
 Microoperations

The CPU can perform one or more microoperations
per clock period, depending on the complexity of
the microoperation and the availability of the
hardware resources
Introduction
 Most often a microoperation can be completed in one
clock period unless it is a complex microoperation
 If a complex microoperations is desired to be run in a clock
period, the clock period needs to be longer

The more and complex the microoperations are,
the longer it takes to run the machine language
instruction
 CISC instructions take longer time to execute (larger
CPIi) because of this reason
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
20

Designing the CPU Digital System
 Calculating CPIi

The time it takes to run an instruction, CPIi, is then
determined by
 The number of microoperations needed for it
 The complexity of the microoperations
Introduction

The number of clock periods for an instruction, CPIi,
becomes a matter of figuring out the microoperations and
how to distribute them to individual clock periods
 One can come up with 5-10 simple microoperations to be
performed one after another, resulting in a CPIi of 5-10
 But, since microoperations are simple, the clock period is short
 Alternatively, one can come up with
microoperations, resulting in a CPIi of 2-4
2-4
complex
 But, the clock period is longer
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
21

Designing the CPU Digital System
 Calculating CPIi

What can we do ?
Introduction
 Few long clock periods vs. many but shorter clock periods ?
 Since increasing the clock frequency is important for marketing
purposes the second option would weigh in substantially
 It turns out that if pipelining is implemented, having many shorter
clock periods would not matter as we will see
 CPIi figures will be large but CPIave will be close to 1 (one) !

Today’s microprocessors have instruction CPIi values in the
range of 10-30, but CPIave figures for their targeted
applications even less than 1 (one) !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
22

Designing the CPU Digital System
 Determining microoperations for a machine
language instruction

Some microoperations are performed for all the
instructions
Introduction
 Usually at the same point in time during the execution of
every instruction
 Fetching the instruction is always the first microoperation
to perform for all CPUs
 Updating PC (PC  PC + 4) so that it points at the next
instruction is also universal

The other microoperations depend on the
instruction, the addressing mode, where the
arguments are, the length of the arguments, etc.
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
23

Designing the CPU Digital System
 Determining microoperations for a machine language
instruction

We would list all the microoperations for each instruction,
by making sure that we are consistent in terms of
Introduction
 Bus usage
 We often decide an approximate number of buses we need for our
datapath
 Today’s CPUs have at least three internal buses to complete an
integer arithmetic microoperation in one clock period
 Two buses carry the numbers from two registers and the third
bus carries the result to a register
 ALU usage
 An ALU is expensive and so we try to limit the number of them
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
24

Designing the CPU Digital System
 Determining microoperations for a machine language
instruction

We would list all the microoperations for each instruction,
by making sure that we are consistent in terms of
Introduction
 Register usage
 Additional registers not visible to the architecture level are used
to keep temporary values : microarchitecture registers
 Typically, the more registers are used, the more clock periods we
spend for an instruction since temporary values will be passed
from one clock period to another
 But, sometimes we have to use microarchitecture registers, such
as the instruction register that keep the current instruction
 Control unit usage
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
25

Designing the CPU Digital System
 Designing the MIPS CPU digital system
Determine how each MIPS architectural operation
is implemented by microoperations
 Most microoperations must be simple enough to be
completed in less than one clock period
Introduction

 A few microoperations may not be completed in a clock
period
 For example a memory read may take several clock periods
 These microoperations should be accommodated in the
FSM state diagram, the datapath and the control unit
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
26

Designing the CPU Digital System
 The MIPS microoperations implied by the MIPS
machine language instructions are


Introduction







Instruction fetch, performed always
Update PC for next instruction, performed always
Effective address calculation for Displacement and relative
addressing modes
Sign extension or catenation of 0s for data/addresses
Reading data from the memory
Writing data to the memory
Perform an arithmetic/logic
Register transfer
Testing a condition
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
27

Unpipelined MIPS CPU : Version 0
 By using the MIPS CPU Handout
Unpipelined MIPS CPU Design : Version 0

The most interesting component of a computer is
the CPU
 We know that the CPU has registers, buses, ALUs and a
sequencer, among other
 Note that whether hardwiring or microprogramming is
used, the datapath stays the same, at least theoretically
 The textbook gives the description of the datapath, not
the control unit
 We will do the same thing
 The datapath performs microoperations on data
 It uses registers, buses and the ALU for that purpose
 The microoperations are in turn controlled by the
control unit.
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
28

Overview
Unpipelined MIPS CPU Design : Version 0
 We are now ready for the organizational
design of the MIPS

We know the architecture of MIPS
 We will design

The MIPS CPU that will have
 A control unit with a sequencer
 A datapath containing registers, buses and the ALU

The datapath performs the microoperations and
the control unit determines the timing and
sequence of these microoperations
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
29

Overview
Unpipelined MIPS CPU Design : Version 0
 The way the MIPS computer is covered indicates
that the authors organized the computer similar to
the commercial MIPS systems where



There is an integer MIPS CPU
A system control coprocessor (CP0) responsible for memory
management and cache control.
A FP coprocessor (CP1)
 The integer MIPS CPU registers are either
architectural or microarchitectural (temporary
registers)
 There are two other coprocessors, CP2 and CP3 that
are reserved for future use
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
30
Unpipelined MIPS CPU Design : Version 0

Overview
 Designing the MIPS CPU for all of instructions is
prohibitive
 First, we will design a MIPS CPU to execute only
integer instructions that include

LD, SD
DADD, DSUB
DADDI
AND, OR, XOR
ANDI, ORI, XORI
SLT
SLTI
BEQZ, BNEZ

We will not cover the execution of J-format instructions







 All these integer instructions use either the Iformat or the R-format
 Their execution hardware can be derived after learning how
the hardware for R-format and I-format instructions is
constructed
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
31

Overview
Unpipelined MIPS CPU Design : Version 0
 The MIPS CPU will have all the architectural
registers
32 64-bit GPRs
 64-bit PC

 FP registers are to be added later in the
semester
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
32
Unpipelined MIPS CPU Design : Version 0

New Microarchitectural registers
 These (temporary) registers are not a part of
the state (hence architecture)
 32-bit instruction register, IR, to keep the
current instruction

IR contains the instruction until it is completely
executed
 64-bit A and B registers

They keep the content of Rs and Rt registers of
the current instruction
 64-bit register Imm

It contains the sign extended value of the 16-bit
Displacement/Offset/Immediate (DOImm) field
of I-type instructions
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
33

New Microarchitectural registers
 64-bit Load Memory Data, LMD, register
Unpipelined MIPS CPU Design : Version 0

It keeps the data read from the memory for Load
instructions
 64-bit ALUoutput register

It keeps the result of the ALU operation
temporarily
 1-bit Cond register

It keeps the result of compare operation between
register A and 0
 This is needed for the BEQZ and BNEZ instructions that
compare register Rs with 0
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
34

New Microarchitectural registers
Unpipelined MIPS CPU Design : Version 0
 64-bit A and B registers
R format
6
5
5
Opcode
Rs
Rt
Opcode
Rs
Rt
I format
To register
A
CS 6143
16
Displacement/Offset/Immediate
Rd
Shamt
Function
5
5
6
To register
B
Haldun Hadimioglu
MIPS Versions 0 & 1
35
Unpipelined MIPS CPU Design : Version 0

New Microarchitectural registers
 Even if an instruction does not have Rs and Rt fields,
such as a J-format instruction, Rs and Rt field bits
are used to move Rs and Rt content to A and B,
respectively
J format
Opcode
5
5
Rs
Rt
6
Jump


Offset26
26
To register
A
To register
B
The values of A and B registers will not be used !
The reason for moving to A and B is to make the common
case fast where we think most instructions are R-format or
I-format and require this move !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
36

New Microarchitectural registers
Unpipelined MIPS CPU Design : Version 0
 64-bit register Imm
6
5
5
Opcode
Rs
Rt
I format

16
Displacement/Offset/Immediate
To register Imm after sign extension
Even if the current instruction is not an I-format
instruction, such as an R-format or J-format instruction,
DOImm field bits are used to move DOImm+ to Imm
 The value of the Imm register will not be used !
 The reason for moving to Imm is to make the common case fast
where we think many instructions are I-format and require this
move !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
37
Unpipelined MIPS CPU Design : Version 0

New Microarchitectural registers
 The textbook implies in Appendix A that the
Displacement used for loads and stores is signed
 Similarly, the textbook is sign extending the
immediate data elements of ANDI, ORI and XORI
instruction

Instead of attaching zeros to the left
 In order not to complicate the coverage of textbook
CPU design, we will accept these and assume the 16bit value is signed for the integer instructions we will
work on

We will use DOImm+ to indicate a sign-extended value from
now on
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
38

The MIPS CPU state diagram
 The design of a CPU is very complex

Unpipelined MIPS CPU Design : Version 0



We have to consider the space (hardware) and time (speed)
The design, analysis, description, testing, modification,
optimization, servicing and maintenance can be more
efficient if there are efficient tools around
These include HDLs and CAD tools
The textbook uses a typical register transfer language (RTL)
notation in Appendix A to describe the execution of
instructions
 We will use the same RTL notation which is also used in the
handout
 To quickly see the execution steps of the integer
machine language instructions, a FSM state diagram
and a CPU datapath figure are developed in the
handout

Additionally, timing diagrams and tables are provided to
understand the CPU design
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
39

The MIPS CPU state diagram
Unpipelined MIPS CPU Design : Version 0
 An instruction goes through several phases
when executed

We give a name to each phase of an instruction
execution
 A phase is also called major cycle

Each major cycle will take one or more minor
cycles (clock periods)
 Each minor cycle is a state
 Each minor cycle takes typically one clock period

Each major cycle
microoperation
often
has
at
least
one
 Often the name of a major cycle is derived from the
major microoperation of the cycle
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
40
Unpipelined MIPS CPU Design : Version 0

The MIPS CPU state diagram
 The number of major cycles and their complexity are
small for RISC systems and larger for CISC systems
 Often for RISC systems, the CPIi for most
frequently used instructions is between 4 and 6

However, this number has to be larger to have deep
pipelining and high clock frequencies
 In simple systems like RISC systems sharing of
hardware among different major cycles is not
necessary

A hardware resource is often needed in one major cycle only
 The hardware for each major cycle can then be easily
identified and often named stage
 So, the execution of an instruction is the movement of the
instruction through some or all of the stages of the CPU !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
41

The MIPS CPU state diagram
Unpipelined MIPS CPU Design : Version 0
 The MIPS integer instructions go through at
most five major cycles during the execution

However, even for this RISC machine, it is
difficult to name 5 cycle names because not all
instructions do similar things in a major cycle
 Some microoperations will be performed in
advance in anticipation of a frequent
operation

The early operations will not alter the state and
will not cause longer clock periods, but will slightly
increase the hardware
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
42

The MIPS CPU state diagram
 The MIPS CPU major cycles for integer instructions
(pages A-27 – A-28)
Unpipelined MIPS CPU Design : Version 0

Instruction fetch cycle
 Abbreviated as IF, standing for instruction fetch
 Same for all MIPS instructions.

Instruction decode/Register fetch cycle
 Abbreviated as ID, standing for instruction decode
 Same for all MIPS instructions.

Execution/effective address cycle
 Abbreviated as EX, standing for execution

Memory access/branch completion cycle
 Abbreviated as MEM, standing for memory

Write-back cycle
 Abbreviated as WB, standing for write-back
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
43
Unpipelined MIPS CPU Design : Version 0

The MIPS CPU state diagram
 Emphasizing again that designing a CPU is
determining which microoperation happens when for
each architectural operation (the semantics of the
instruction)
 For the MIPS, like many other CPUs, the IF and ID
stages are identical for all instructions


The same microoperations are performed for all instructions
These microoperations implement portions of the
architectural operation
 For the MIPS, the remaining portions of the
architectural operation are performed in the EX,
MEM and WB stages
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
44

The MIPS CPU state diagram
Unpipelined MIPS CPU Design : Version 0
 Architectural
operations
of
I-format
instructions among the integer instructions
I format
6
5
5
Opcode
Rs
Rt

16
Displacement/Offset/Immediate
Load/Store instructions
 LD Rt, Disp(Rs)
 SD Rt, Disp(Rs)
Superscript +
indicates sign
extension
CS 6143
 Rt  M[Rs + Disp+]
 M[Rs + Disp+] Rt
Architectural operations of
Load/Store instructions
Haldun Hadimioglu
MIPS Versions 0 & 1
45

The MIPS CPU state diagram
Unpipelined MIPS CPU Design : Version 0
 Architectural operations of I-format instructions
among the integer instructions
I format
6
5
5
Opcode
Rs
Rt
16
Displacement/Offset/Immediate
 Arithmetic/Logic instructions





DADDI
ANDI
ORI
XORI
SLTI
Rt, Rs, Imm+
Rt, Rs, Imm+
Rt, Rs, Imm+
Rt, Rs, Imm+
Rt, Rs, Imm+
 Rt  Rs + Imm+
 Rt  Rs Λ Imm+
 Rt  Rs ν Imm+
 Rt  Rs Ө Imm+
 If Rs < Imm+ then Rt  1
else Rt  0
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
46

The MIPS CPU state diagram
Unpipelined MIPS CPU Design : Version 0
 Architectural operations of I-format instructions
among the integer instructions
I format
6
5
5
Opcode
Rs
Rt
16
Displacement/Offset/Immediate
 Branch instructions


BEQZ Rs, Offset  If Rs = 0, then PC  PC + (4 x Offset+)
BNEZ Rs, Offset  If Rs ≠ 0, then PC  PC + (4 x Offset+)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
47

The MIPS CPU state diagram
Unpipelined MIPS CPU Design : Version 0
 Architectural operations of R-format instructions
among the integer instructions
R format
6
5
5
5
5
6
Opcode
Rs
Rt
Rd
Shamt
Function
 Arithmetic/Logic instructions






DADD
DSUB
AND
OR
XOR
SLT
Rd, Rs, Rt 
Rd, Rs, Rt 
Rd, Rs, Rt 
Rd, Rs, Rt 
Rd, Rs, Rt 
Rt, Rs, Rt

CS 6143
Rd  Rs + Rt
Rd  Rs - Rt
Rd  Rs Λ Rt
Rd  Rs ν Rt
Rd  Rs Rt
If Rs < Rt then Rt  1 else Rt  0
Haldun Hadimioglu
MIPS Versions 0 & 1
48

The MIPS CPU state diagram
Unpipelined MIPS CPU Design : Version 0
 All J-format instructions are not executed
by the CPU we are designing
J format
Opcode
5
5
Rs
Rt
Offset26
 However, one can incorporate them to the
CPU design after the design of the R-format
and I-format instructions is completed
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
49
Unpipelined MIPS CPU Design : Version 0

The MIPS CPU state diagram
 The major cycles of the DLX CPU are shown by the state
diagram given in the MIPS CPU handout
 Registers A and B are used to prepare operands for an ALU
operation
 Each state takes 1 clock period

Later, we will change it to one or more clock periods
 Memory accesses and complex arithmetic operations will take more
than one clock period to perform
 The state that has a memory access or a complex arithmetic operation will
take more than one clock period
 All microoperations mentioned in a state are performed in
parallel, so their order does not matter

If a state takes more than one clock period, one has to be careful
about the parallel operations
 We now obtain the state diagram and the datapath hardware of
the MIPS CPU
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
50

The MIPS major cycles and states
 The instruction fetch cycle (IF stage)
It is performed for all the instructions
 There are two microoperations performed
 In
general, all CPUs, regardless of
architecture do these two microoperations
Unpipelined MIPS CPU Design : Version 0

their
 Read the machine language instruction pointed by the
program counter (PC) to the instruction register (IR)
 Update the program counter so that it points at the
instruction that follows the instruction being read from
the memory
From now on look at the MIPS
CPU handout to follow the design
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
51

The MIPS major cycles and states
 The instruction fetch cycle (IF stage)
Unpipelined MIPS CPU Design : Version 0

Read the machine language instruction pointed by
the program counter (PC) to the instruction
register (IR)
 IR ← M[PC]
 Note the RTL notation that we use an equal sign (=) if the
destination is a wire or a bus and an arrow sign () if the
destination is a register, such as IR
 As mentioned before we will make a few design decisions
on the memory hierarchy as we design the CPU : We will
have an instruction cache which will have only instructions
 We will have Memory Port 1 to access the instruction
cache
 To access the instruction cache, Memory Port 1 has a 64bit address bus, ADB1, a 32-bit data bus, MDB1, and at
least a read control signal, Read1, to inform the instruction
cache we want to read now
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
52

The MIPS major cycles and states
 The instruction fetch cycle (IF stage)
Unpipelined MIPS CPU Design : Version 0

Read the machine language instruction pointed by
the program counter (PC) to the instruction
register (IR)
 IR ← M[PC]
 Then, the read of the instruction in terms buses is as
follows :
ADB1 = PC ; Read1 = 1 ; IR  MDB1
 Note again the three microoperations implement the
instruction read and they happen at the same and their
order does not matter
 Note the RTL notation that we use an equal sign (=) if
the destination is a wire or a bus, such as ADB1 and an
arrow sign () if the destination is a register, such as
IR
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
53

The MIPS major cycles and states
 The instruction fetch cycle (IF stage)
Unpipelined MIPS CPU Design : Version 0

Update the program counter so that it points at
the next instruction
 PC ← PC + 4
 Since an instruction is four bytes long, we need to add 4 to
PC
 We can use the general ALU in the EX stage to do the
addition, at the expense of increasing the complexity of
the ALU input logic : PC must be connected to MUX2, a 4
must be connected to MUX 3 and the output of the ALU
must be connected to MUX1
 The alternative is to have a simple 32-bit integer adder in
the IF stage
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
54

The MIPS major cycles and states
 The instruction fetch cycles (IF stage)
Unpipelined MIPS CPU Design : Version 0

Update the program counter so that it points at
the next instruction
 PC ← PC + 4
 We choose the second alternative since we will need to
have an adder in the IF stage when the CPU is pipelined
PC  PC + 4
 The MUX1 select input is controlled by the Sel circuit in
the EX stage
 The Sel circuit is in turn controlled by the Cond flip-flop
and the control unit
 The control unit in the IF stage instructs the Sel circuit to
generate a SelectMUX1 value so that the output of the
adder in IF is transferred to PC in the IF major cycle
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
55

The instruction fetch cycle (IF stage)
Unpipelined MIPS CPU Design : Version 0
 The two microoperations of the IF cycle can
be shown in state 0 as follows
0
IF
IR  M[PC] ;
PC  PC + 4 ;
a
The two microoperations are simply shown without
using buses to save space
 The
instruction cache read and PC update
microoperations
happen
simultaneously
and
complete before the end of the clock period

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
56
Unpipelined MIPS CPU Design : Version 0

The instruction fetch cycle (IF stage)
 If the instruction cache happens to take
more than one clock period, then we stay in
this state and update PC the last clock period
of the memory access so the address to the
instruction cache, the PC value, is not
changed

During this state the state register in the control
unit is 0, indicating we are in state 0
 The self-directed arrow in state 0 indicates waiting for
the slow cache for more than 1 clock period
 Also during this clock period the control unit determines
the next state as state 1
 It is the instruction decode cycle, the ID cycle
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
57

The MIPS major cycles and states
 The instruction decode cycle (ID stage)
Unpipelined MIPS CPU Design : Version 0

The most important goal in this cycle is to decode
the instruction
 Decoding the instruction means the CPU determines
what the current instruction is
 It is performed for all the instructions regardless of
their architecture
 Decoding is done by the control unit that checks the
opcode and function bits of IR
 They are input as status signals to the control unit

During this time the datapath does not do
anything
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
58

The MIPS major cycles and states
 The instruction decode cycle (ID stage)
Unpipelined MIPS CPU Design : Version 0

Instead of doing nothing in the datapath, we decide to
perform three microoperations in order to be prepared
 Transfer GPR register Rs pointed by I-format and R-format
instructions to register A
 Transfer GPR register Rt pointed by I-format and R-format
instructions to register B
 Transfer the DOImm field of IR to register Imm after sign
extension



By doing these in advance, we save time
 But, not all instructions need them : J-format instructions do
not need them and some of I-format instructions do not need
the transfer to register B
 This is fine since A, B and Imm registers are not architectural
registers and so changing them will not result in program
errors
These three microoperations are performed for all the
instructions
In general, RISC CPUs do these three microoperations
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
59

The MIPS major cycles and states
Unpipelined MIPS CPU Design : Version 0
 The instruction decode cycle (ID stage)
 A ← GPR[Rs] ; B ← GPR[Rt] ; Imm ← DOImm+
 The GPR register file is designed so that two GPRs can be read
simultaneously, by using the Rs and Rt fields of IR
 This means the GPR register file has two read ports controlled by
Rs and Rt
A  GPR[Rs]
; B  GPR[Rt]
 Note that the order of these microoperations does not matter as
they happen simultaneously
 There is also a write port to the GPR register file controlled by Rt
and Rd fields : 10 bits are connected to the GPR file to determine
the destination register
 A simple Sign Extend circuit attaches 48 zeros or 48 ones to
the DOImm field of IR and the result is stored on register
Imm
Imm DOImm+
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
60

The instruction decode cycle (ID stage)
Unpipelined MIPS CPU Design : Version 0
 The three microoperations of the ID cycle
can shown in state 1 as follows
ID
1 A  GPR[Rs] ;
B  GPR[Rt] ;
Imm DOImm+
The GPR read ports are directly connected to
register A and B and so no buses are used
 The Sign Extend circuit is directly connected to
register Imm
 The three microoperations happen simultaneously
and complete before the end of the clock period

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
61
Unpipelined MIPS CPU Design : Version 0

The instruction decode cycle (ID stage)
 During this clock period the state register in the
control unit is 1, indicating we are in state 1
 Also during this clock period the control unit
determines what the next state will be based on the
type of the instruction




If it is a memory reference instruction (LD, SD), the next
state is state 2 in the EX cycle
If it is a R-format A/L instruction, the next state is state 6
in the EX cycle
If it is a I-format A/L instruction, the next state is state 9
in the EX cycle
If it is a branch instruction, the next state is state 12 in
the EX cycle
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
62
Completing the execution of LD and SD


The LD instruction

Unpipelined MIPS CPU Design : Version 0

LD Rt, Disp(Rs)
 Rt  M[Rs + Disp+]
We see that to execute the LD we need to
1) Calculate the effective address, the address of the memory
location we want load from : Rs + Disp+
2) Read the cache memory pointed by the effective address
3) Transfer the value to GPR register Rt

The SD instruction


SD Rt, Disp(Rs)
 M[Rs + Disp+] Rt
We see that to execute the SD we need to
1) Calculate the effective address, the address of the memory
location we want store to : Rs + Disp+
2) Write to the cache memory pointed by the effective address
 Transfer the value from GPR register Rt to the memory pointed
by the effective address
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
63

Completing the execution of LD and SD
Unpipelined MIPS CPU Design : Version 0
 LD and SD both have a microoperation in
common : calculating the effective address

Then their microoperations differ
 In order to calculate the effective address,
we need to sign extend the DOImm field

This has already been done in the ID stage
 We save time !

We also realize that GPR register Rt has been
transferred to register B
 Register B will be written to the memory for the SD
instruction then
 LD requires one extra microoperation than
the SD as we will soon see
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
64
Unpipelined MIPS CPU Design : Version 0

Completing the execution of LD and SD
 We decide to have the effective address
calculation of LD and SD in the
Execution/Effective address cycle
The effective address is stored in a
microarchitectural register called ALUoutput1
 Then, we separate LD and SD execution in the
Memory Access/Branch completion cycle : Both
access the memory

 LD reads the memory location pointed by the effective
address to a microarchitectural register called LMD
 SD writes microarchitectural register B to a memory
location pointed by the effective address and completes
its execution

LD completes its execution by transferring the
data in LMD to GPR register Rt
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
65

Completing the execution of LD and SD
 The effective address calculation
Unpipelined MIPS CPU Design : Version 0

Rs + Disp+
 Rs is now in register A
 Sign extended DOImm is in register Imm
ALUoutput1  A + Imm
 As we will see shortly, A/L instructions will
have
their
arithmetic/logic
operation
performed in this cycle as well
They need the ALU in this cycle, in this stage
 Therefore, we decide to use the adder of the ALU
to do the addition for the effective address

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
66

Completing the execution of LD and SD
 We make another decision on the memory hierarchy
that data accesses will be made to another cache,
the Data cache with its own address and data buses
and control signals

Reading from the data cache
ADB2 = ALUoutput1 ; Read2 = 1 ; LMD  MDB3
 Note that the microoperations are performed in parallel and
the order does not matter

This microoperation can be stated without giving the bus
detail
LMD  M[ALUoutput1]
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
67
Unpipelined MIPS CPU Design : Version 0

Completing the execution of LD and SD
 Note that the cache access can take more
than one clock period and so we may stay in
this state more than one clock period
 The LD instruction completes by transferring
LMD to GPR register Rt
GPR[Rt]  LMD

The Rt field of IR is used by the GPR register file
to select the register to be written the value from
LMD
 We then go back to state 0, the IF cycle, to
start executing the next instruction
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
68

Completing the execution of LD and SD

Storing to the data memory
Unpipelined MIPS CPU Design : Version 0
ADB2 = ALUoutput1 ; Write2 = 1 ; MDB2 = B
 Note that the microoperations are performed in parallel
and the order does not matter

This microoperation can be stated without giving
the bus detail
M[ALUoutput1]  B
 Note that the cache access can take more than one clock
period and so we may stay in this state more than one
clock period

SD completes its execution !
 We then go back to state 0, the IF cycle to
start executing the next instruction
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
69

Completing the execution of LD and SD
Unpipelined MIPS CPU Design : Version 0
 The portion of the state diagram for LD and
SD
From the ID cycle
LD, SD
2
EX
ALUoutput1  A + Imm
LD
3
LMD  M[ALUoutput1]
SD
5
M[ALUoutput1]  B
4
WB
GPR[Rt]  LMD
CS 6143
Haldun Hadimioglu
a
MIPS Versions 0 & 1
70

Completing the execution of I-format A/L
instructions
 The I-format A/L instructions
Unpipelined MIPS CPU Design : Version 0





DADDI
ANDI
ORI
XORI
SLTI
Rt, Rs, Imm+
Rt, Rs, Imm+
Rt, Rs, Imm+
Rt, Rs, Imm+
Rt, Rs, Imm+
 Rt  Rs + Imm+
 Rt  Rs Λ Imm+
 Rt  Rs ν Imm+
 Rt  Rs Ө Imm+
 If Rs < Imm+ then Rt  1
else Rt  0
 To execute these instructions we need to perform an
operation specified by the Opcode field of IR
 Then we transfer the result to GPR register Rt
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
71
Unpipelined MIPS CPU Design : Version 0

Completing the execution of I-format A/L
instructions
 We see that we can perform the required
operations for the I-format instructions in
one state

Which one to perform would be determined by the
Opcode field
 The inputs are Rs and sign extended DOImm
 Rs is already transferred to A and sign extended
DOImm is already transferred to register Imm
 We see we save time by moving them in the ID stage !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
72
Unpipelined MIPS CPU Design : Version 0

Completing the execution of I-format A/L
instructions
 We see that we can perform the required
operations for the I-format instructions in
one state
The
result
would
be
stored
on
the
microarchitectural register ALUoutput1
 Though, we could store the result of the operation
directly on GPR register Rt

 This would require a separate bus from the output of
the ALU to the write port of the GPR file
 We decide to store to ALUoutput1 and then transfer
from ALUoutput to the GPR write port
 This decision will help pipelining as we will see later !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
73
Unpipelined MIPS CPU Design : Version 0

Completing the execution of I-format A/L
instructions
 The microoperation for the current I-format
A/L operation
ALUoutput1  A op Imm

The meaning of “op” is that the type of the
operation is indicated by the Opcode field of IR
 What happens is that the control unit uses the Opcode
field to generate a set of control signals
 These control signals are connected to the ALU, telling
which operation to perform
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
74

Completing the execution of I-format A/L instructions
 The result that is in ALUout1
microarchitectural register ALUout2
is
moved
to
another
Unpipelined MIPS CPU Design : Version 0
ALUout2  ALUoutput1

This decision increases the CPIi of the I-format instruction one
more clock period !
 This decision will also help pipelining as we will see later !
 The microoperation for the transfer of the result to GPR
register Rt
GPR[Rt]  ALUoutput2

The Rt field of IR is used by the GPR register file to select the
register to be written the value from ALUoutput
 We then go back to state 0, the IF cycle to start executing the
next instruction
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
75

Completing the execution of I-format A/L instructions
 The portion of the state diagram for I-format A/L instructions
Unpipelined MIPS CPU Design : Version 0
From the ID cycle
I-Format A/L instructions
9
EX
ALUoutput1  A op Imm
10
MEM
ALUout2  ALUoutput1
WB
11
GPR[Rt]  ALUoutput2
a
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
76

Completing the execution of R-format A/L
instructions
 The R-format A/L instructions
Unpipelined MIPS CPU Design : Version 0






DADD
DSUB
AND
OR
XOR
SLT
Rd, Rs, Rt  Rd  Rs + Rt
Rd, Rs, Rt  Rd  Rs - Rt
Rd, Rs, Rt  Rd  Rs Λ Rt
Rd, Rs, Rt  Rd  Rs ν Rt
Rd, Rs, Rt  Rd  Rs Ө Rt
Rt, Rs, Rt
 If Rs < Rt then Rt  1 else Rt  0
 We see that to execute these instructions we need
to perform an operation specified by the Opcode and
Function fields
 Then, we transfer the result to GPR register Rd
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
77
Unpipelined MIPS CPU Design : Version 0

Completing the execution of R-format
A/L instructions
 We see that we can perform the all required
operations for R-format instructions in one
state

Which one to perform would be determined by the
Opcode and Function fields
 The inputs are Rs and Rt
 Rs is already transferred to register A and Rt is already
transferred to register B
 We see we save time by moving them in the ID stage !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
78
Unpipelined MIPS CPU Design : Version 0

Completing the execution of R-format
A/L instructions
 We see that we can perform the all required
operations for R-format instructions in one
state

The
result
would
be
stored
on
microarchitectural register ALUoutput1
the
 Though, we could store the result of the operation
directly on GPR register Rd
 This would require a separate bus from the output of
the ALU to the write port of the GPR file
 We decide to store to ALUoutput and transfer from
ALUoutput to the GPR write port
 This decision will help pipelining as we will see later !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
79
Unpipelined MIPS CPU Design : Version 0

Completing the execution of R-format
A/L instructions
 The microoperation for the current R-format
A/L operation
ALUoutput1  A func B

The meaning of “func” is that the type of the
operation is indicated by the Opcode and Function
fields of IR
 What happens is that the control unit uses the Opcode
and Function fields to generate a set of control signals
 These control signals are connected to the ALU, telling
which operation to perform
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
80

Completing the execution of R-format A/L
instructions
Unpipelined MIPS CPU Design : Version 0
 The result that is in ALUout1
microarchitectural register ALUout2
is
moved
to
another
ALUout2  ALUoutput1

This decision increases the CPIi of the I-format instruction one
more clock period !
 This decision will also help pipelining as we will see later !
 The microoperation for the transfer of the result to
GPR register Rd
GPR[Rd]  ALUoutput2

The Rd field of IR is used by the GPR register file to select
the register to be written the value from ALUoutput
 We then go back to state 0, the IF cycle to start
executing the next instruction
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
81

Completing the execution of R-format A/L instructions
 The portion of the state diagram for R-format A/L instructions
Unpipelined MIPS CPU Design : Version 0
From the ID cycle
R-Format A/L instructions
6
EX
ALUoutput1  A func B
7
MEM
ALUout2  ALUoutput1
WB
8
GPR[Rd]  ALUoutput2
a
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
82

Completing the execution of Branch instructions
Unpipelined MIPS CPU Design : Version 0
 The BEQZ instruction

BEQZ Rs, Offset  If Rs = 0, then
PC  PC + (4 x Offset+)
 The BNEZ instruction

BNEZ Rs, Offset  If Rs ≠ 0, then
PC  PC + (4 x Offset+)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
83

Completing the execution of Branch
instructions
Unpipelined MIPS CPU Design : Version 0

We see that to execute these instructions
we need to
1) Calculate the effective address the address to
branch to

Add PC to the result of the multiplication of the sign
extended Offset by 4
2) Test if Rs is equal to or not equal to zero and
store the result of the test in the Cond flip-flop
(FF)

Testing Rs and calculating the effective address can be
done at the same time
3) If the Cond FF is 1, transfer the effective
address to PC
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
84

Completing the execution of Branch instructions
 In order to calculate the effective address, we need
to sign extend the DOImm field
Unpipelined MIPS CPU Design : Version 0

This has already been done in the ID stage
 We save time !

We then have to multiply it by 4
 We also realize that GPR register Rs has been
transferred to register A

Register A will be tested !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
85

Completing the execution of Branch instructions
 The effective address calculation

PC + (4 x Offset+)
Unpipelined MIPS CPU Design : Version 0
 Sign extended DOImm is in register Imm
ALUoutput1  PC + Imm * 4
 We know that shifting a number to the left by two bit
positions is multiplying it by four
ALUoutput1  PC + Imm << 2

We decide to use the adder of the ALU to the addition
 Before the addition the Imm value is shifted to the left by two
bit positions in the ALU
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
86
Unpipelined MIPS CPU Design : Version 0

Completing the
instructions
execution
of
Branch
 Testing if Rs is equal to or not equal to zero
and storing the result of the test in the Cond
FF

The Zero circuit in the EX stage compares
register A with zero
 The result of Zero is stored on the Cond FF
 Note that the Cond bit is initially set to 0 until a branch
changes it

The opcode of the branch instruction executed is
used by the Sel circuit to generate
 1 if the condition is satisfied
 0 if the condition is not satisfied
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
87

Completing the execution of Branch instructions
 The control unit sends a control signal to Sel to
indicate how to generate the output
Unpipelined MIPS CPU Design : Version 0


For example, if A = 0 and it is a BEQZ instruction, Sel
outputs 1 and Cond is stored 1
But, if A = 0 and it is a BNEZ instruction, Sel outputs 0 and
Cond is stored 0
Cond  A Branchop 0
 The Branchop is the combined effect of the test and
Sel operations

Note that the Sel circuit is also used in the IF cycle so that
it generates the right value for MUX1 so that we transfer
PC+4 to PC
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
88

Completing the
instructions
execution
of
Branch
Unpipelined MIPS CPU Design : Version 0
 Changing PC if the Cond FF is 1

This means we branch to a memory location
 That is we take the branch
If (Cond) PC  ALUoutput1

Reset the Cond FF to 0 so that it can be used for
another branch instruction
Cond  0
 We then go back to state 0, the IF cycle to
start executing the next instruction
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
89

Completing the execution of Branch instructions
Unpipelined MIPS CPU Design : Version 0
 The portion of the state diagram for Branch
instructions
From the ID cycle
Branch
EX
MEM
12
ALUoutput1  PC + Imm * 4
Cond  A Branchop 0
13
If (Cond) PC  ALUoutput1
Cond  0
a
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
90
Unpipelined MIPS CPU Design : Version 0

The complete state diagram
 The state diagram for integer instructions
and the datapath are given in the MIPS CPU
handout
 They will be modified to implement a
pipelined MIPS CPU

But, the overall CPU structure will be similar
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
91

CPIi of Integer Instructions
Unpipelined MIPS CPU Design : Version 0
 With this implementation, the CPIi of the
instructions can be calculated as
CPILW = 5 because we trace states 0, 1, 2, 3, 4
 CPISW = 4 because we trace states 0, 1, 2, 5
 CPIA/L = 5  because we trace states

 0, 1, 6, 7, 8 if R-format
 0, 1, 9 10, 11 if I-format

CPIBranch = 4 because we trace states 0, 1, 12, 13
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
92
Unpipelined MIPS CPU Design : Version 0

Control Signals
 The semantics of each state is that a
microoperation is implemented by the control
unit, turning on and off a few MUX select,
register clock inputs, ALU control inputs and
enable control signals

They are connected to MUXes, registers, ALUs
and tri-state buffers (TRBs)
 They are shown as angled signals in the handout

Depending on the type of chips used, tri-state
chips and/or additional MUXes will be used, for
example, for the usage of the constants, in the
datapath
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
93

The Clock Signal
 The clock period duration is determined by the
slowest but important microoperation in the CPU
Unpipelined MIPS CPU Design : Version 0

All the signal delays in the datapath and control unit are
added up to calculate the time for this important operation
 It is usually the integer add microoperation
 Though it could be the cache access time if it was a little
longer than the integer addition time
 Usually, the cache is slower than the CPU in commercial
systems now and so we do not consider it when we calculate the
clock period duration
 The loop back line drawn for states 0, 3 and 5 indicate that the
CPU would spend more than one clock period if the cache memory
takes more than one clock period for the access
 That is we assume the integer addition takes one clock period !
 For high-performance systems this is not the case though !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
94
Unpipelined MIPS CPU Design : Version 0

Clock Signal
 The clock period duration is determined by the
addition of all the delays in the control unit and the
delays in the datpath for the integer add
microoperation

The delays in the control unit include the delays to generate
the MUX select, register clock input, ALU control and enable
control signals
 Gate networks generate these select and clock control signals
if hardwiring is used
 The micromemory and additional circuits generate these select
and clock control signals if microprogramming is used

The delays in the datapath include
 Delay of data travel from registers to the ALU inputs
 Delay of the adder in the ALU
 Delay of the data travel from the ALU to the destination
register in the datapath : ALUoutput1
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
95

Architecture-Microarchitecture Interaction
 An example of how architectural decisions can affect
the microarchitecture design is the following
Unpipelined MIPS CPU Design : Version 0

The Rd and Rt fields of R-format and I-format instructions
are not in the same position
6
5
5
5
5
6
Opcode
Rs
Rt
Rd
Shamt
Function
6
5
5
Opcode
Rs
Rt
16
Displacement/Offset/Immediate
 Therefore, we need to use two separate states to transfer the
result of an A/L operation from ALUoutput to a destination
GPR register : states 8 and 11
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
96
Unpipelined MIPS CPU Design : Version 0

Updating PC
 The execution sequence in the textbook is
not clear since it updates PC for all
instructions in MEM and in MEM it updates
PC again for branch instructions if the
condition is true

To eliminate the confusion, we remove the NPC
register which is redundant
 But, we will use the NPC register when we implement
pipelining
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
97

Using the state diagram
Unpipelined MIPS CPU Design : Version 0
 Consider the following piece of program in
the MIPS memory
--100
104
108
10C
110
--150
--200
LD
R1, 150(R0)
DADDI R2, R1, #18
DADD R2, R2, R3
SD
R2, 200(R0)
BEQZ
R2, 5
; R1 <-- M[R0 + 150+] ; M[150] has C
; R2 <-- R1 + 18+ where 18 is in Hex
; R2 <-- R2 + R3 ; R3 has 1A
; M[R0 + 200+] <-- R2
; If R2 is equal to 0, branch to address 128
C
; The content of this location is C
?
When the program is run we execute instructions in 100, 104, 108, 10C and 110
When the program is run we access data in 150 (a read) and 200 (a write)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
98
Unpipelined MIPS CPU Design : Version 0

Using the state diagram
 If the cache memory is not slow (takes one clock period per
access) and there is no miss, then this piece of program will
take 23 clock periods as the table below shows the execution of
the program with respect to time
See the MIPS CPU handout for timing
100 LD
R1, 150(R0)
104 DADDI
R2, R1, #18
108 DADD
R2, R2, R3
10C SD
R2, 200(R0)
110 BEQZ
R2, 5
IF
1
6
11
16
20
ID
2
7
12
17
21
EX MEM WB
3
4
5
8
9
10
13
14
15
18
19
22 23
When the program is run we execute instructions in 100, 104, 108, 10C and 110
When the program is run we access data in 150 (a read) and 200 (a write)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
99

Using the state diagram
Unpipelined MIPS CPU Design : Version 0
 If the clock frequency is 1GHz
1
1
Clock period
 9  10-9 second 1 ns
Clock frequency 10
CPIave 
Number of clockcyclesfor the program 23
  4.6
Number of instructions run
5
CP Utime Number of clock periodsfor program Clock period 23 1  23 ns
MIPS ave 
Number of instructio ns run
5

 217
6
-9
6
CPUtime  10
23  10  10
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
100

We assumed the instruction and data cache
memories take one clock period each
 What if they took two clock periods each ?
Unpipelined MIPS CPU Design : Version 0

LD would take 7 clock periods since we trace states 0, 0, 1,
2, 3, 3, 4
 States 0 and 3 are repeated twice since the cache memories
take two clock periods each

SD would take 6 clock periods since we trace states 0, 0, 1,
2, 5, 5
 States 0 and 5 are repeated twice since the cache memories
take two clock periods each

DADD would take 7 clock periods since we trace states 0, 0,
1, 2, 6, 7, 8
 State 0 is repeated twice since the cache memory takes two
clock periods

BEQZ would take 5 clock periods since we trace states 0, 0,
1, 12, 13
 State 0 is repeated twice since the cache memory takes two
clock periods
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
101
Unpipelined MIPS CPU Design : Version 0

Using the state diagram
 If the cache memories are slow (they take two clock period per
access) and there is no miss, then this piece of program will
take 30 clock periods as the table below shows the execution of
the program with respect to time
See the MIPS CPU handout for
timing
IF
ID EX MEM WB
100 LD
R1, 150(R0)
1-2
3
4
5-6
7
104 DADDI
R2, R1, #18
8-9
10
11
12
13
108 DADD
R2, R2, R3
R2, 200(R0)
110 BEQZ
R2, 5
16
22
28
17
23
29
18
24-25
30
19
10C SD
14-15
20-21
26-27
When the program is run we execute instructions in 100, 104, 108, 10C and 110
When the program is run we access data in 150 (a read) and 200 (a write)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
102
 We
have so far assumed that the cache
Unpipelined MIPS CPU Design : Version 0
memories do not have misses !
 What if both instruction and data cache
memories result is cache misses ?

That is, there is a cold start !
 What is the new execution time ?
 To calculate the new execution time we have
to study the structure of the cache memories

The size of the physical (main) memory, the size
of the cache memories, the size of cache blocks,
the type of mapping (direct, associative, block-set
associative), the block replacement strategy, etc.
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
103

What if both instruction and data cache memories
result is cache misses ?
Unpipelined MIPS CPU Design : Version 0

For this semester
 We will concentrate on Level 1 cache memories, i.e. instruction
and data cache memories
 We will assume that there is no Level 2 cache memory miss !
 We will assume that all the addresses shown are physical
addresses unless otherwise specified
 For this presentation assume that
 The physical (main) memory has 256 Mbytes
 The physical memory has 8 Bytes per location
 The bus width between the physical memory and lowest level
cache is 8 Bytes
 The instruction cache is 8KBytes
 The data cache is 16KBytes
 Both cache block sizes are 32 bytes
 Both cache memories use direct mapping
 Both caches use write-back with write-allocate
 Both cache memories access the needed item first
 The physical memory latency is 4 clock periods and transferring an
8-Byte content is one clock period each
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
104

Instruction and data cache misses ?
Unpipelined MIPS CPU Design : Version 0
 The physical memory has 256MBytes or 228
Bytes
The physical address is 28 bits long
 The physical memory has 228/32 = 228/25 = 223
blocks
 The instruction cache has 8KB/32 = 213/25 = 28 =
256 blocks
 The data cache has 16KB/32 = 214/25 = 29 = 512
blocks

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
105

Instruction and data cache misses ?
Unpipelined MIPS CPU Design : Version 0
 The physical address is used by the physical memory and
instruction cache as follows
Main memory block number
23
15
8
Address tag
5
Instruction Byte offset
cache block #
 The physical address is used by the physical memory and data
cache as follows
Main memory block number
23
14
9
Address tag
CS 6143
Data cache
block #
Haldun Hadimioglu
5
Byte offset
MIPS Versions 0 & 1
106

Instruction and data cache misses ?
 The instruction cache has 32-Byte blocks
Each block contains 8 instructions since each
instruction is 4 Bytes long
 Instructions in physical memory locations 100
through 110 are in one instruction cache block
Unpipelined MIPS CPU Design : Version 0

Instruction cache
blocks have 32
bytes and so each
holds 8
instructions !
Instructions in
100, 104, 108,
10C, 110, 114,
118 and 11C are
in one instruction
cache block !
8 bytes
00000100
00000108
00000110
00000118
LD
R1, 150(R0)
DADD R2, R2, R3
DADDI R2, R1, #18
SD
R2, 200(R0)
BEQZ R2, 5
4 bytes
4 bytes
Which instruction cache block is this ?
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
107

Instruction and data cache misses ?
 The instruction cache has 32-Byte blocks
Unpipelined MIPS CPU Design : Version 0

Instructions in physical memory locations 100
through 110 are in instruction cache memory block
number 8
0000100
0
0
0
0
LD
1
0
R1, 150(R0)
0
0000 0000 0000 0000 0001 0000 0000
Address tag
Instruction
cache block # 8
since 00001000
is 8 in decimal
5 bits ! The byte offset is 5 bits long. The
LD instruction has 0 offset from the
beginning of the block, i.e. the first
instruction of the block
Instructions in 100, 104, 108, 10C, 110 are in instruction cache block 8 !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
108

Instruction and data cache misses ?
 How long does it take to access individual instructions ?

Unpipelined MIPS CPU Design : Version 0

Both cache memories access the needed item first
The physical memory latency is 4 clock periods and transferring an 8-Byte
content is one clock period each
8 bytes
00000100
00000108
00000110
00000118
LD
R1, 150(R0)
DADD R2, R2, R3
DADDI R2, R1, #18
SD
R2, 200(R0)
BEQZ R2, 5
Five clock periods !
Six clock periods !
Seven clock periods !
4 bytes
Eight clock periods !
4 bytes
Block fill time = 8 clock periods
Time
Start access
Latency
M[100] is the needed item
and accessed first !
CS 6143
Transfer
M[100] &
M[104]
Transfer
M[108] &
M[10C]
Haldun Hadimioglu
Transfer
M[110] &
M[114]
Transfer
M[118] &
M[11C]
MIPS Versions 0 & 1
109

Instruction and data cache misses ?
 The data cache has 32-Byte blocks
Each block contains 4 data elements since each
data element is 8 Bytes long
 The data element in physical memory location 150
is in one data cache block
Unpipelined MIPS CPU Design : Version 0

Data cache blocks
have 32 bytes and
so each holds 4
data elements !
Data elements in
140, 148, 150,
and 158 are in one
data cache block !
8 bytes
00000140
00000148
00000150
00000158
C
Which data cache block is this ?
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
110

Instruction and data cache misses ?
 The data cache has 32-Byte blocks
Unpipelined MIPS CPU Design : Version 0

The data element in physical memory location 150
is in data cache block number 10
0000150
0
0
0
0
C
1
5
0
0000 0000 0000 0000 0001 0101 0000
Address tag
Data cache
block # 10 since
000001010 is 10
in decimal
5 bits ! The byte offset is 5 bits long. The
data element has 8-Byte offset from the
beginning of the block, i.e. the third data
element of the block
Data element in 150 is in data cache block 10 !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
111

Instruction and data cache misses ?
 How long does it take to access individual data element ?

Unpipelined MIPS CPU Design : Version 0

Both cache memories access the needed item first
The physical memory latency is 4 clock periods and transferring an 8-Byte
content is one clock period each
8 bytes
Seven clock periods !
00000140
Eight clock periods !
00000148
00000150
00000158
C
Five clock periods !
Six clock periods !
Block fill time = 8 clock periods
Time
Start access
Latency
M[150] is the needed item
and accessed first !
CS 6143
Transfer
M[150]
Transfer
M[158]
Haldun Hadimioglu
Transfer
M[140]
Transfer
M[154]
MIPS Versions 0 & 1
112

Instruction and data cache misses ?
 The data cache has 32-Byte blocks
Each block contains 4 data elements since each
data element is 8 Bytes long
 The data element in physical memory location 200
is in one data cache block
Unpipelined MIPS CPU Design : Version 0

Data cache blocks
have 32 bytes and
so each holds 4
data elements !
Data elements in
200, 208, 210,
and 218 are in one
data cache block !
8 bytes
00000200
?
00000208
00000210
00000218
Which data cache block is this ?
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
113

Instruction and data cache misses ?
 The data cache has 32-Byte blocks
Unpipelined MIPS CPU Design : Version 0

The data element in physical memory location 200
is in data cache block number 16
0000200
0
0
0
0
?
2
0
0
0000 0000 0000 0000 0010 0000 0000
Address tag
Data cache
block # 16 since
000010000 is 16
in decimal
5 bits ! The byte offset is 5 bits long. The
data element has 0 offset from the
beginning of the block, i.e. the first data
element of the block
Data element in 150 is in data cache block 16 !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
114

Instruction and data cache misses ?
 How long does it take to access individual instructions ?

Unpipelined MIPS CPU Design : Version 0

Both cache memories access the needed item first
The physical memory latency is 4 clock periods and transferring an 8-Byte
content is one clock period each
8 bytes
Five clock periods !
?
00000200
Six clock periods !
00000208
00000210
00000218
Seven clock periods !
Eight clock periods !
Time
Start access
Latency
M[200] is the needed item
and accessed first !
CS 6143
Transfer
M[200]
Transfer
M[208]
Haldun Hadimioglu
Transfer
M[210]
Transfer
M[218]
MIPS Versions 0 & 1
115

Instruction and data cache misses ?
 How long does it take to run the program with a cold start ?
Unpipelined MIPS CPU Design : Version 0

This piece of program will take 35 clock periods as the table below
shows the execution of the program with respect to time
See the MIPS CPU handout for
timing
IF
ID EX MEM WB
100 LD
R1, 150(R0)
1/5
6
7
8/12
13
104 DADDI
R2, R1, #18
14
15
16
17
18
108 DADD
R2, R2, R3
R2, 200(R0)
110 BEQZ
R2, 5
20
25
33
21
26
34
22
27/31
35
23
10C SD
19
24
32
When the program is run we execute instructions in 100, 104, 108, 10C and 110
When the program is run we access data in 150 (a read) and 200 (a write)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
116

Pipelining
 Pipelining increases the speed of a CPU
Pipelined MIPS CPU Design : Version 1

The
CPU
executes
simultaneously
multiple
instructions
 The unpipelined MIPS CPU has five stages
that correspond to the five major cycles

For the unpipelined MIPS CPU, at any time only
one stage is busy and all the others are idle
IF
ID
EX
MEM WB
Control Unit
Instructions
Datapath
CS 6143
Haldun Hadimioglu
Instructions
MIPS Versions 0 & 1
117

What is Pipelining ?
 The unpipelined CPU works like this :
ID
Pipelined MIPS CPU Design : Version 1
IF
LD R1,R2,
150(R0)
DADDI
R1, #1C
Continues
LD way…
R1, 150(R0)
R1, 150(R0) this
LD
61
2
MEM
EX
LD
3
R1, 150(R0)
WB
LD
4
R1, 150(R0)
5
Clock period
 Only, one instruction is in the CPU !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
118
Pipelined MIPS CPU Design : Version 1

What is Pipelining ?
 Pipelining is the simultaneous execution of
multiple instructions in an assembly line
fashion in a single CPU
IF
ID
EX
BEQZ
R2,
5
DADD
SD
R2,
200(R0)
R2,
DADDI
LD R2,
R1,
R2,
150(R0)
R1,R3
#18
DADDI
DADD
LD
SD R1,
R2,R2,
150(R0)
200(R0)
R2,
R1, R3
#18
DADD
R2,
R2, #18
R3
DADDI
LD R1,R2,
150(R0)
R1,
DADDI
R1, #18
LD R1,R2,
150(R0)
2
3
4
1
MEM
WB
LD R1, 150(R0)
5
Clock period
CS 6143
6 Hadimioglu
Haldun
MIPS Versions 0 & 1
119
Pipelined MIPS CPU Design : Version 1

What is Pipelining ?
 Pipelining is a microarchitectural technique
where consecutive instructions are executed
overlappingly

Each instruction is in a pipeline stage
 All stages are busy
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
120

What is a Stage ?
Pipelined MIPS CPU Design : Version 1
 Each
stage
is
specialized
hardware
corresponding to a specific major cycle

IF, ID, EX, MEM, WB
 Recall how we defined a stage for the unpipelined CPU
 The hardware for each major cycle can then be easily
identified and often named stage
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
121

What is Pipelining ?
Pipelined MIPS CPU Design : Version 1
 Pipelined execution of instructions is similar to the
assembly line manufacturing of cars
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
122

What is Pipelining ?
 There are two differences
Pipelined MIPS CPU Design : Version 1

On a car assembly line there is only one type of
car assembled
 For the CPU the instructions executed are different
 Loads, Stores, A/L, Branch instructions

All the cars on an assembly line have the same
requirements : the same pieces are placed on the
cars
 For the CPU, even if two back-to-back instructions are
of the same type (for example two back-to-back Loads),
they have different requirements (different effective
addresses hence different memory locations are
accessed)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
123
Pipelined MIPS CPU Design : Version 1

What is Pipelining ?
 Because of these two differences, each stage
has to pass information related to the
instruction it just worked on to the next
stage

Additional temporary registers (latches, buffers)
are placed between each pair of stages to pass the
information about the instruction just leaving one
stage and entering the next one
Latches
IF
CS 6143
ID
EX
Haldun Hadimioglu
MEM
WB
MIPS Versions 0 & 1
124
Pipelined MIPS CPU Design : Version 1

What is Pipelining ?
 Latches are then necessary to pass
information about an instruction from one
stage to the next
 Latches are also needed so that partial work
done by one stage is passed to the next stage
so the work continues
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
125
Pipelined MIPS CPU Design : Version 1

What is the Pipe ?
 We give the name “pipe” to the set of stages
since the stages are cascaded to each other
in a single dimension forming a pipe where
instructions
Enter from one end
 Stay in a stage for one clock period
 Proceed to the next stage
 Finally exit from the other end
 By
which time the instruction execution is
completed

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
126

What is Pipelining ?
Pipelined MIPS CPU Design : Version 1
 Consider a sequence of instructions and a 5stage pipeline
Instructions
…I9 I8 I7 I6 I5 I4 I3 I2 I1
IF
ID
EX MEM WB
Instructions
 Assume that all the instructions use the five
stages

That is they all take five clock periods to complete
their execution
 This is not possible in real life but let’s assume this for
the time being to understand pipelining quickly
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
127

What is Pipelining ?
Stage
 The execution can be shown as follows
I1
I2
I3
I4
I1
I2
I3
I4
I5
I1
I2
I3
I4
I5
I6
I1
I2
I3
I4
I5
I6
I7
I2
I3
I4
I5
I6
I7
I8
Pipelined MIPS CPU Design : Version 1
WB
MEM
EX
ID
I1
IF
0
1
2
3
5
4
7
6
8 Time
Pipeline is full ≡ all stages are busy ≡ start-up time = 5 clock periods
WB
MEM
EX
ID
IF
v
v
v
CS 6143
v
v
v
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
MIPS Versions 0 & 1
128
Pipelined MIPS CPU Design : Version 1

What is Pipelining ?
 Compared with unpipelining, the five stages
are more complex to allow overlapped
execution
 All stages take the same amount of time, one
clock period
 The length of the clock period is determined
by the slowest stage

Though, it is difficult to obtain stages with equal
amount of work hence time
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
129

What is Pipelining ?
Pipelined MIPS CPU Design : Version 1
 If the CPU is unpipelined, the instructions would take
5 clock periods each
I1
I2
5

I3
10
I4
15
I5
20
I6
25
I7
30
Time
35
CPIi = 5
 Since each instruction is taking 5 clock periods

CPIave = 5
 Since the number of clock periods divided by the number of
instructions run is 5
35
 5 clock periods
7
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
130
Pipelined MIPS CPU Design : Version 1

What is Pipelining ?
 If the CPU is pipelined, after the pipeline
becomes full (the start-up time), every clock
period an instruction is completed as opposed
to completing every 5 clock periods
I1
I2 I3 I4 I5 I6 I7
5 6 7

8
9
Time
10 11
CPIi = 5
 Since each instruction is taking 5 clock periods

CPIave ≈ 1
 Since after the start-up time, we complete one
instruction each clock period
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
131

What is Pipelining ?
Pipelined MIPS CPU Design : Version 1
 Once the pipeline is filled, each clock period
an instruction exits the pipeline

Each clock period an instruction is completed
 It seems each instruction takes one clock period to
execute
 CPIave ≈ 1 !!!
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
132
Pipelined MIPS CPU Design : Version 1

What is Pipelining ?
 Assume for next few slides that the
unpipelined MIPS CPU is converted to a
pipelined CPU with the stages shown above
CPILoad = 5
 CPIStore = 4
 CPIA/L = 5
 CPIBranch = 4

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
133

What is Pipelining ?
Pipelined MIPS CPU Design : Version 1
 Consider the following piece of MIPS code
--200 LD
204 DADD
208 DSUB
20C XOR
210 SLT
214 OR
218 SD
21C BEQZ
---
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R10
R11, R12, R13
; R1  M[R0 + 500+]
; R2  R3 + R4
; R5  R61 - R7
; R8 <-- R9 + R10
; If R12 < R13, R11  1, else R11 0
R14, R15, R16
R17, 600(R0)
R18, 5
; R14  R15 ν R16
; M[R0 + 600+] <-- R17
; If R18 is equal to 0, branch to address 234
This code is not realistic since the instructions are all independent of each other !
But, for the sake of understanding pipelining, we will use this piece of code !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
134

What is Pipelining ?
Pipelined MIPS CPU Design : Version 1
 Let’s see its pipelined execution by using textbook’s notation
and assume that the cache memories take one clock period and
there is no miss
1 2
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R10
R11, R12, R13
14, R15, R16
R17, 600(R0
R18, 5
31 4
MEM
EX
ID
CS 6143
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX MEM
IF
ID
EX
IF
ID
IF
WB
IF
5
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
8
9
10
WB
MEM WB
EX
MEM WB
ID
EX
MEM WB
IF
ID
EX
MEM
v
v
v
v
v
IF
ID
v
v
v
v
v
v
v
v
v
EX
v
v
v
v
MIPS Versions 0 & 1
MEM
v
v
v
135

What is Pipelining ?
 Textbook’s notation is hard to follow if there are more than
few instructions

Also, the notation requires a lot of space even for few instructions
Pipelined MIPS CPU Design : Version 1
 From now on, we will use our notation

The execution by assuming assume that the cache memories take
one clock period and there is no miss
200
204
208
20C
210
214
218
21C
LD
R1, 500(R0)
DADD R2, R3, R4
DSUB R5, R6, R7
XOR
R8, R9, R10
SLT
R11, R12, R13
OR
R14, R15, R16
SD
R17, 600(R0)
BEQZ R18, 5
CS 6143
IF
ID
EX
MEM
WB
1
2
3
2
3
4
3
4
5
4
5
6
5
6
7
4
5
6
7
8
5
6
7
8
9
6
7
8
9
10
7
8
9
10
11
8
9
10
Haldun Hadimioglu
MIPS Versions 0 & 1
136

What is Pipelining ?
 What if the MIPS CPU was not pipelined ?
Pipelined MIPS CPU Design : Version 1

The execution timing would be as follows by assuming that the
cache memories take one clock period and there is no miss
IF
200
204
208
20C
210
214
218
21C
LD
R1, 500(R0)
DADD R2, R3, R4
DSUB R5, R6, R7
XOR
R8, R9, R10
SLT
R11, R12, R13
OR
R14, R15, R16
SD
R17, 600(R0)
BEQZ R18, 5
ID
EX
MEM
WB
1
6
11
2
7
12
3
8
13
4
9
14
5
10
15
16
21
26
31
35
17
22
27
32
36
18
23
28
33
37
19
24
29
34
38
20
25
30
The execution completes in 38 clock periods !
Pipelined execution takes 11 clock periods !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
137

What is Pipelining ?
Pipelined MIPS CPU Design : Version 1
 Pipelining decreases the execution time of
the program, CPUtime

The number of instructions run, NI, stays the
same
 We execute the same number of instructions for a
program

Instructions go through the same stages as the
unpipelined case
 But, we execute several instructions at the same time
 All the stages are busy now
 The CPU does more per clock period
 CPIave decreases
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
138

What is Pipelining ?
Pipelined MIPS CPU Design : Version 1
 We execute more instructions per unit time
(a second)

The throughput is increased
 The MIPSave figure is increased
 The number of instructions executed per second is
increased
 That is why companies like to mention the MIPSave figure
for their generation of microprocessors since they
improve the pipeline which improves MIPSave
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
139

Hardware-related issues to solve
 The stages must be precisely timed, synchronized

Pipelined MIPS CPU Design : Version 1

Each stage must take the same amount of time
Each stage must have about the same amount of work
 This is hard to come up unless it is a RISC architecture
 Suppose that we managed to have the same amount
of work per stage so that each stage takes the same
time

What is the clock period ?
 Theoretically the clock period can stay the same as the
unpipelined CPU
 But the simultaneous execution increases the overhead per
clock period
 The clock period duration is increased slightly !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
140

Hardware-related issues to solve
Pipelined MIPS CPU Design : Version 1
 A solution to these two problems today is to break up
stages that are taking too long into several simpler
stages so that the stages are finer

Then, the pipeline is longer ≡ there are many stages
 Since each stage is doing simpler work, the clock period is
shorter ≡ the clock frequency is higher
 Today, a technique to increase the microprocessor frequency is
exactly this ≡ make stages simpler and simpler ≡ make pipelines
longer and longer
 Today’s microprocessor pipelines are typically 15 to 25 stages
long

Clock skew problems can cause timing problems
 A signal may arrive too late to play a role in generating another
signal since the pipeline is very long !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
141
Pipelined MIPS CPU Design : Version 1

What is Pipelining ?
 Pipelining does not decrease the CPIi of each
individual instruction but increases the clock
period slightly

The execution time of each instruction in terms of
seconds is increased slightly !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
142

Pipelined MIPS CPU Design
 In CS6143, we design the MIPS CPU by going through
eight versions : 0 through 7
Pipelined MIPS CPU Design : Version 1


Version 0 is the unpipelined CPU executing only integer
instructions
Version 1 is the pipelined CPU executing only integer
instructions
 Initially, the Version 1 design will not be an acceptable design
 New hardware to handle pipelining is not identified
For example, the latches between stages are not identified
 It will not handle well certain situations called hazards
There are three types of hazards : structural, data and control
All programs have hazards, so we will quickly change the design
 Branch instructions take a long time, causing pipeline startups
 It will have imprecise interrupts
 It will assume ideal memory
 All memory accesses take one clock period

So, somehow, the initial design of this version of MIPS
CPU executes the code in a pipelined fashion
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
143

Pipelined MIPS CPU Design Versions
 We will design the pipelined MIPS CPU Version 1 in
several steps
Pipelined MIPS CPU Design : Version 1

The final design of Version 1 will improve the pipeline by
introducing additional hardware to better handle integer
instructions
 New hardware to handle pipelining is identified (latches, etc.)
 It will better handle the three hazards
 Branch instructions will take 2 clock periods
 But, we will have delayed branches which is not practical
 It will still assume that the cache memories take more than
one clock period and there are cache misses

It will still have some an unacceptable feature
 It will still have imprecise interrupts
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
144

Pipelining MIPS CPU
Pipelined MIPS CPU Design : Version 1
 Consider the mnemonic machine language
discussed before
--200 LD
204 DADD
208 DSUB
20C XOR
210 SLT
214 OR
218 SD
21C BEQZ
---
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R10
R11, R12, R13
; R1  M[R0 + 500+]
; R2  R3 + R4
; R5  R61 - R7
; R8 <-- R9 + R10
; If R12 < R13, R11  1, else R11 0
R14, R15, R16
R17, 600(R0)
R18, 5
; R14  R15 ν R16
; M[R0 + 600+] <-- R17
; If R18 is equal to 0, branch to address 234
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
145

Pipelining MIPS CPU
Pipelined MIPS CPU Design : Version 1
 Here is the execution of the code discussed
earlier
200
204
208
20C
210
214
218
21C
LD
R1, 500(R0)
DADD R2, R3, R4
DSUB R5, R6, R7
XOR
R8, R9, R10
SLT
R11, R12, R13
OR
R14, R15, R16
SD
R17, 600(R0)
BEQZ R18, 5
IF
ID
EX
MEM
WB
1
2
3
2
3
4
3
4
5
4
5
6
5
6
7
4
5
6
7
8
5
6
7
8
9
6
7
8
9
10
7
8
9
10
11
8
9
10
 This MIPS CPU pipeline version has problems
as mentioned previously and on the next slide
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
146

Issues with the Current Design
 This program will be executed without difficulty
since all instructions are independent of each other
Pipelined MIPS CPU Design : Version 1

There is no meaningful application where all instructions are
independent of each other
 An instruction, I1, generates a result that is used by another
instruction, I2, so that I2 depends on I1




This code assumes we will always execute in sequence : even
if we execute branch instructions
Latching hardware is not identified
All memory accesses take one clock period
Some instructions could take shorter times
 Such as the BEQZ instruction

The interrupts are imprecise
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
147

Improving Initial Version 1 Design
Pipelined MIPS CPU Design : Version 1
 The pipelined MIPS CPU state diagram and
pipeline stages

We will obtain the final state diagram and final
datapath after several iterations
 The initial design of Version 1 will be improved by going
through several designs
 First, we will add new hardware, including latches
 Second, we will handle hazards better
 Third, we will execute Branch instructions faster
 Fourth, we will have longer L1 cache hit times and cache
misses
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
148

Improving Initial Version 1 Design
 Version 1 will be improved by going through several
designs
Pipelined MIPS CPU Design : Version 1

First we will add the hardware overhead, including latches
 When we have pipelined execution, it is important not to lose
the info about the execution of each instruction
 With pipelining, each stage transforms the instruction by doing
so affects the architectural registers and the memory (the
state)
 Some piece of this state is needed to execute an instruction in
a latter specific stage
 So, when we move an instruction from one stage to another, it
is necessary to transfer the information to the next stage (to
make the state of the instruction available to the next stage)
so that correct execution happens
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
149

Latching hardware
 Each stage starts with the “sum” of work that has
been done on the instruction in previous stages
Pipelined MIPS CPU Design : Version 1

Each stage works on the instruction resulting in new work
that will be needed in later stages to complete the
instruction
 For that purpose stages are provided with their own latches
 In other words, a stage works on an instruction that has left the
previous stage and produces something related to the instruction
and passes it to the next stage to be used in the next clock period
 Thus, we need to save the product of a stage in
temporary registers (buffers, latches) for the next
stage
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
150

Latching hardware
 So we need the latches (buffers)
Pipelined MIPS CPU Design : Version 1

The amount of storage between two stages is not
constant :
ID
EX
I6
I7
I5
I6
IF
I7
I8
MEM
WB
I4
I5
I3
I4
Instructions
 We will not discuss the control unit, but we
will know that it is there
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
151

Latching hardware
 The new hardware

Three additional IRs
Pipelined MIPS CPU Design : Version 1
 Though not all the bits of the extra IRs are needed
Two NPC registers
 Two ALUoutput registers
 One A register
 One B register

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
152

Latching hardware
 Here is the new look of the MIPS CPU datapath with buffers
Pipelined MIPS CPU Design : Version 1
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
IR
IR
IR
 The leftmost buffer set (with NPC and IR) will be called buffer
set 2 since these buffers are used by the second stage from
left (ID)
 The next buffer set to the right (NPC, A, B, Imm and IR) is
buffer set 3, and so on
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
153

Latching hardware
Pipelined MIPS CPU Design : Version 1
 We will identify the registers by using the
buffer set number (or the stage number
using the registers)

Buffer set 2 registers (Stage 2 uses them)
 2.NPC and 2.IR
 Used by the second stage from left : ID

Buffer set 3 registers (Stage 3 uses them)
 3.NPC, 3.A, 3.B, 3.Imm and 3.IR
 Used by the third stage from left : EX

Buffer set 4 registers (Stage 4 uses them)
 4.Cond, 4.ALUoutput, 4.B and 4.IR
 Used by the fourth stage from left MEM

Buffer set 5 registers (Stage 5 uses them)
 5.ALUoutput, 5.LMD and 5.IR
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
154

Latching hardware
 What did we do ?
Pipelined MIPS CPU Design : Version 1

We identified buffers for the pipelined execution
of instructions
 The initial implementation of Version 1 does not identify
the buffers
 The initial implementation of Version 1 does not specify
that there are four IR registers, two NPC registers, two
ALUoutput registers, etc.
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
155

Timing of Microoperations
 We need to know about the timing of microoperations

When does exactly the instruction fetch occur for the LD
instruction ?
Pipelined MIPS CPU Design : Version 1
--200 LD
204 DADD
208 DSUB
20C XOR
210 SLT
214 OR
218 SD
21C BEQZ
---
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R10
R11, R12, R13
R14, R15, R16
R17, 600(R0)
R18, 5
 That is, we know the instruction fetch will happen in clock period 1
(one), but exactly when ?
 Similarly when does exactly PC get its value updated to 204 when we
execute the LD ?

Note : on the unpipelined CPU, this code takes 38 clock periods !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
156

Timing of Microoperations
Pipelined MIPS CPU Design : Version 1
 We clock (store on) our registers at the end of a
clock period and therefore, registers change their
values in the beginning of the next clock period


Therefore, IR gets its new value (the LD instruction) in
beginning of the ID cycle (in clock period 2)
PC gets its new value (204) in beginning of the ID cycle (in
clock period 2)
Clock period 1
Clock period 2
200
204
Clock
PC 1FC
IR
?
CS 6143
?
208
LD R1, 500(R0)
Haldun Hadimioglu
DADD R2, R3, R4
MIPS Versions 0 & 1
157

Instruction fetch (IF) Cycle
 Fetch the instruction pointed by PC to 2.IR

2.IR  M[PC]
 Update PC by adding 4
Pipelined MIPS CPU Design : Version 1

How about 2.NPC ?
PC  PC + 4
NPC
GPR
PC
Soon, we will see that !
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
158

Instruction decode/register fetch (ID) Cycle
Pipelined MIPS CPU Design : Version 1
 Prepare temporary registers A, B and Imm in case we need GPR
registers, an effective address or an immediate operand
3.A  GPR[2.IR.Rs]
3.B  GPR[2.IR.Rt]
3.Imm  2.IR.DOImm+
NPC
GPR
PC
How about 3.NPC & 3.IR ?
Soon, we will see them !
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
159

Execute (EX) Cycle for Load/Store Instructions
 How do we know we have a Load/Store instruction ?
Pipelined MIPS CPU Design : Version 1

The IR register for this stage (3.IR) was not transferred value
from the IR register of the previous stage (2.IR)
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
IR
IR
IR
 We need to update the ID stage : 3.IR  2.IR
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
160

Instruction decode/register fetch (ID) cycle
Pipelined MIPS CPU Design : Version 1
 Prepare temporary registers A, B and Imm and move IR to the
next stage
3.A  GPR[2.IR.Rs]
3.B  GPR[2.IR.Rt]
3.Imm  2.IR.DOImm+
3.IR  2.IR
NPC
GPR
PC
How about 3.NPC ?
Soon, we will see that !
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
161

Execute (EX) Cycle for Load/Store Instructions
 Calculate the effective address

4.ALUoutput  3.A + 3.Imm
 We should not forget to move 3.IR to the next stage
Pipelined MIPS CPU Design : Version 1

How about 4.Cond
and 4.B ?
4.IR  3.IR
Soon, we will see them !
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
162

Memory access/branch completion (MEM) Cycle for Load
Instructions
 Read the data from memory

How about 5.ALUoutput ?
5.LMD  M[4.ALUoutput]
Pipelined MIPS CPU Design : Version 1
 We should not forget to move 5.IR to the next stage

5.IR  4.IR
NPC
GPR
PC
Soon, we will see that !
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
163

Write-back (WB) Cycle for Load instructions
 Transfer LMD to a GPR register

Pipelined MIPS CPU Design : Version 1

GPR[5.IR.Rt]  5.LMD
The Load takes 5 clock periods to execute : CPILoad = 5
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
164

Memory access/branch completion (MEM) Cycle for Store
instructions
 The effective address is in 4.ALUoutput
Pipelined MIPS CPU Design : Version 1

Where is the data to store ?
 It is in 3.B
 We did not transfer 3.B to 4.B ?
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
165

Execute (EX) Cycle for Load/Store Instructions
 Calculate the effective address

How about 4.Cond ?
4.ALUoutput  3.A + 3.Imm
 We should not forget to move 3.IR to the next stage

4.IR  3.IR
Soon, we will see that !
Pipelined MIPS CPU Design : Version 1
 Transfer 3.B to 4.B

4.B  3.B
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
166

Memory access/branch completion (MEM) Cycle for Store
Instructions
 Write 4.B to the memory pointed by 4.ALUoutput

Pipelined MIPS CPU Design : Version 1

M[4.ALUoutput]  4.B
The Store takes 4 clock periods to execute : CPIStore = 4
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
167

Execute (EX) Cycle for A/L R-format instructions
 Perform the operation specified by the Function field of 3.IR
4.ALUoutput  3.A func 3.B
How about 4.Cond ?
 We should not forget to move 3.IR to the next stage

Pipelined MIPS CPU Design : Version 1

4.IR  3.IR
Soon, we will see that !
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
168

Memory access/branch completion (MEM) Cycle for A/L Rformat Instructions
 We could complete the execution of these instructions in this cycle by
transferring 4.ALUoutput to a GPR register
Pipelined MIPS CPU Design : Version 1

But, we decide to complete the execution in the WB cycle to help us handle
data hazards better as we will see later
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
169

Memory access/branch completion (MEM) Cycle for A/L Rformat Instructions
 Transfer 4.ALUoutput and 4.IR to the next stage

Pipelined MIPS CPU Design : Version 1

5.ALUoutput  4.ALUoutput
5.IR  4.IR
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
170

Write-back (WB) Cycle for A/L R-format instructions
 We transfer the result from 5.ALUoutput to a GPR register


GPR[5.IR.Rd]  5.ALUoutput
A/L R-format instructions take 5 clock periods to execute
Pipelined MIPS CPU Design : Version 1
 CPIA/L R-format = 5
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
171

Execute (EX) Cycle for A/L I-format instructions
 Perform the operation specified by the Opcode field of 3.IR
4.ALUoutput  3.A op 3.Imm
 We should not forget to move 3.IR to the next stage

Pipelined MIPS CPU Design : Version 1

How about 4.Cond ?
4.IR  3.IR
Soon, we will see that !
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
172

Memory access/branch completion (MEM) Cycle for A/L Iformat Instructions
 We could complete the execution of these instructions in this cycle by
transferring 4.ALUoutput to a GPR register
Pipelined MIPS CPU Design : Version 1

But, we decide to complete the execution in the WB cycle to help us handle
data hazards better as we will see later
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
173

Memory access/branch completion (MEM) Cycle for A/L Iformat Instructions
 Transfer 4.ALUoutput and 4.IR to the next stage

Pipelined MIPS CPU Design : Version 1

5.ALUoutput  4.ALUoutput
5.IR  4.IR
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
174

Write-back (WB) Cycle for A/L I-format Instructions
 We transfer the result from 5.ALUoutput to a GPR register


GPR[5.IR.Rt]  5.ALUoutput
A/L I-format instructions take 5 clock periods to execute
Pipelined MIPS CPU Design : Version 1
 CPIA/L I-format = 5
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
LMD
B
Imm
IF
IR
ID
EX
MEM
WB
2
3
4
5
CS 6143
IR
IR
Haldun Hadimioglu
IR
MIPS Versions 0 & 1
175

Execute (EX) Cycle for Branch Instructions
 We need to store the result of compare of 3.A with 0 on Cond
 We need to calculate the effective address by adding PC and 4 times
the Offset
Pipelined MIPS CPU Design : Version 1

But, is PC changed by the instructions behind the Branch ? Yes !
 We should have saved the PC value for Branch on a new register : NPC in
the IF cycle !
NPC
GPR
PC
NPC
Cond
A
Aluoutput
Aluoutput
B
B
LMD
Imm
IF
2
ID
3
IR
IR
CS 6143
EX
4
IR
Haldun Hadimioglu
MEM
5
WB
IR
MIPS Versions 0 & 1
176

Execute (EX) Cycle for Branch Instructions
Pipelined MIPS CPU Design : Version 1
 We need to study the execution of Branch instructions more
carefully
600
604
608
60C
610
614
BEQZ
R8, 4
DADD R9, R19, R11
DSUB
R12, R13, R14
XOR
R15, R16, R17
SLT
R18, R19, R20
AND
R21, R22, R23
; Branch to 614 if R8 = 0
 When the Branch is in its EX stage, PC is 608
WB
?
?
?
MEM
?
?
?
EX
?
?
BEQZ
ID
?
BEQZ
IF
BEQZ
1, 600
CS 6143
2, 604
3, 604
Haldun Hadimioglu
There is a
Problem !
We detect that
there is a Branch
in the beginning
of its ID cycle
(clock period 2)
We then immediately
stop the IF stage from
fetching any instruction
and stop to add 4 to PC
Clock period, PC
MIPS Versions 0 & 1
177

Execute (EX) Cycle for Branch Instructions
 We know we have a branch in the ID stage when we
decode it
Pipelined MIPS CPU Design : Version 1

PC is 604 in ID
 When the Branch reaches EX, it expects to have PC =
604

What shall we do ?
 We decide to have a new register to keep the PC value for the
Branch : NPC (New PC)
 We save the PC value for the Branch in NPC in the IF stage
 So 604 moves with the Branch into the EX stage
 When the ID stage detects a Branch
 It stops the IF stage fetching the next instruction
 It stops the IF stage adding 4 to PC
 We also have to stop incrementing PC so that if the condition is
not satisfied, we execute the instruction following the BEQ
 This is the instruction in location 604
 We should not execute the instruction 608 after we execute the
BEQ
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
178

Execute (EX) Cycle for Branch instructions
Pipelined MIPS CPU Design : Version 1
 We change the IF and ID stages to include transfers
to 2.NPC and 3.NPC
 The EX stage for the Branch is like this



4.IR  3.IR
4.Cond  3.A op 0
4.ALUoutput  3.NPC + (3.Imm * 4)
3.NPC has 604
 Now, we have the correct PC value in 3.NPC in the EX
stage
 But, when do we write to PC so we branch ?
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
179

Execute cycle (EX) Cycle for Branch instructions
 We write to PC the clock period after the Branch is in EX
Pipelined MIPS CPU Design : Version 1

We write to PC in the IF stage when it is clock period 4
WB
?
?
?
MEM
?
?
?
EX
?
?
BEQZ
ID
?
BEQZ
IF
BEQZ
Clock period, PC
1, 600
?
?
AND
2, 604
3, 604
4, 604
5, 614
 The IF stage then changes PC and NPC if 4.Cond is 1


PC  If (4.Cond) then 4.ALUoutput else if (2.IR.opcoce ≠ Branch) PC + 4
NPC  If (4.Cond) then 4.ALUoutput else if (2.IR.opcoce ≠ Branch) PC + 4
 We also need to clear 4.Cond so that a new Branch can be executed

4.Cond  If (4.Cond) then 0
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
180

Execute (EX) Cycle for Branch Instructions
 What shall we do with DADD, DSUB and XOR ?
Pipelined MIPS CPU Design : Version 1

They should not be fetched until we know the Branch result !
WB
?
?
?
MEM
?
?
?
EX
?
?
BEQZ
ID
?
BEQZ
IF
BEQZ
Clock period, PC

1, 600
?
NOP
AND
2, 604
3, 604
4, 604
5, 614
If the ID stage has a Branch we stop the instruction fetch
to the memory
 But, we also have to clear 2.IR if it has a Branch so we fetch an
instruction the next clock period (clock period 5) : 4.IR has the
Branch in the 4th clock period
 2.IR  If 4.IR.opcode = Branch then NOP
else if (2.IR.opcode ≠ Branch) then M[PC]
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
181

Execute (EX) Cycle for Branch Instructions
 What if we continued with the DADD, DSUB and
XOR ?
Pipelined MIPS CPU Design : Version 1

Would they change any architectural register or memory ?
 NO ! Since we arranged the pipeline such that all register
writes and memory writes happen at the end of the pipeline
 By that time we know we have a Branch we stop and flush out them
 RISC architectures allow late writes that help the hardware
designer
 CISC architectures require early writes in the pipeline
 The hardware designer has to undo these early writes when a
branch is finally recognized
 Unnecessary pressure on the hardware designer
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
182

Execute (EX) Cycle for Branch Instructions
Pipelined MIPS CPU Design : Version 1
 Stopping the fetches, how does the execution look ?
WB
?
?
?
MEM
?
?
?
EX
?
?
BEQZ
ID
?
BEQZ
IF
BEQZ
Clock period, PC
1, 600
?
?
NOP
AND
2, 604
3, 604
4, 604
5, 614
The pipeline is almost empty with only one instruction in the WB stage!
There is only one instruction in the pipeline
This is why Control instructions are important to deal with for pipelines
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
183

Execute (EX) Cycle for Branch instructions
 Stooping the fetches shown in a different way
Pipelined MIPS CPU Design : Version 1
1 21
600 BEQZ
R8, 4
601 DADD R9, R19, R11
608 DSUB
R12, R13, R14
60C XOR
R15, R16, R17
610 SLT
R18, R19, R20
614 AND
R21, R22, R23
A pipeline bubble
is generated
CS 6143
4
5
6
7
8
9
IF
ID
EX
MEM
WB
IF ID EX
IF
WB
MEM
EX
The Branch causes
a pipeline start-up !
3
ID
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
Haldun Hadimioglu
?
v
v
v
v
v
v
v
v
v
v
MIPS Versions 0 & 1
v
v
v
v
v
184

Execute (EX) Cycle for Branch Instructions
Pipelined MIPS CPU Design : Version 1
 In the 4th clock period we complete the
execution of the Branch by writing the effective
address to PC in IF

The control unit knows we are completing the
Branch instruction and so does not allow an
instruction fetch
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
185

Let’s rewrite microoperations for the Branch
 IF stage (for all instructions)


Pipelined MIPS CPU Design : Version 1


2.IR  If 4.IR.opcode = Branch then NOP
else if (2.IR.opcode ≠ Branch) then M[PC]
PC  If (4.Cond) then 4.ALUoutput
else if (2.IR.opcoce ≠ Branch) then PC + 4
2.NPC  If (4.Cond) then 4.ALUoutput
else if (2.IR.opcoce ≠ Branch) then PC + 4
4.Cond  If (4.Cond) then 0
 ID stage (for all instructions)





3.A  GPR[2.IR.Rs]
3.B  GPR[2.IR.Rt]
3.Imm  2.IR.DOImm+
3.IR  2.IR
3.NPC  2.NPC
 EX stage



4.IR  3.IR
4.Cond  3.A op 0
4.ALUoutput  3.NPC + (3.Imm * 4)
 The Branch execution completes in the IF stage in the next clock
period
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
186

Branch instructions take 4 clock periods to execute
 CPIBranch = 4
 Since, the Branch execution is completed in the IF stage
Pipelined MIPS CPU Design : Version 1

Overall, executing a control instruction first
creates a pipeline bubble and then causes a
pipeline start-up where only one stage, IF, is
busy
 It is therefore critical that the number of control
instructions be reduced by having


Better programming styles
Better compilers
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
187

Evaluation of Pipelined MIPS CPU
 With pipelining and memory hierarchies hardware has
become more sensitive to
Pipelined MIPS CPU Design : Version 1


The number of instructions, NI (due to increased memory
hierarchy delays)
The number of control instructions (due to pipeline and
memory hierarchy delays that can occur)
 Now we see why the pipeline is sensitive to control instructions

The order of instructions (due to pipeline delays that can
occur)
 Class notes on the remaining versions will show examples why
the pipeline is sensitive to the certain order of instructions
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
188

What is Pipelining ?
Pipelined MIPS CPU Design : Version 1
 Before we continue with the evaluation of our
design, a comment :

Pipelining is often invisible to the programmer,
though current architectures allow some visibility
to help/improve pipeline
 For example, knowing the pipeline length and how many
clock periods each stage takes help the compiler to come
up with a more efficient code
 This is because a better order of instructions can be
obtained
 This is a point made in earlier
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
189
Pipelined MIPS CPU Design : Version 1

The execution of the code on the Version 1 MIPS
pipeline is shown again below by assuming that the cache
memories take one clock period and there is no miss
200
204
208
20C
210
214
218
21C
LD
R1, 500(R0)
DADD R2, R3, R4
DSUB R5, R6, R7
XOR
R8, R9, R10
SLT
R11, R12, R13
OR
R14, R15, R16
SD
R17, 600(R0)
BEQZ R18, 5
IF
ID
EX
MEM
WB
1
2
3
2
3
4
3
4
5
4
5
6
5
6
7
4
5
6
7
8
5
6
7
8
9
6
7
8
9
10
7
8
9
10
8
9
10
 Assume that our integer-instruction pipeline can execute the
XOR, SLT, etc.
 It looks as if it takes 10 clock periods to run the code

Though the Branch completes in clock period 11 !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
190

The Speed Comparison
Pipelined MIPS CPU Design : Version 1
 The piece of program takes 11 clock periods on the
pipelined computer as opposed to 33 clock periods on
the unpipelined
Speedupoverall 
CPUtimeold 33
 3
CPUtimenew 10
CPIave w/o pipe 
# of clock periodsfor program 33

 4.125
NI
8
CPIave w/ pipe 
# of clockperiodsfor program 11
  1.375
NI
8
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
191

In general any pipeline will work fine if
 Every instruction is independent of every other instruction in
the pipeline at any moment
Pipelined MIPS CPU Design : Version 1

Otherwise, we have what we call hazards as we will see soon
 The number of control instructions is very small
 The order of instructions is good

Otherwise, we have what we call hazards as we will see soon
 There is a lot of hardware available



In the ideal case, CPIave ≡ the number of pipeline stages
In the ideal case, NI ≡ # of clock periods for the
program
Speedupoverallideal = pipeline depth (the number of pipeline
stages)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
192

Ideal MIPS
Pipelined MIPS CPU Design : Version 1
MIPSideal

# of instruction completedper clockperiod clockfrequency
106
If the CPU completes one instruction per clock period
MIPSideal 

clockfrequency
106
We now see why microprocessor companies are eager to
increase the clock frequency !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
193
Pipelined MIPS CPU Design : Version 1

Pipeline Timing
 Due to start-ups and hazards, CPIave is not 1
 The net effect of start-ups and hazards is that more
than one clock period is needed to execute an
instruction on average
 The amount of additional clock periods is due to the
average delay cycles (stalls we will call soon) per
instruction
CP Iave  CP Iave ideal  pipelinedelays(stalls)per instruction
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
194

Pipeline Timing
Pipelined MIPS CPU Design : Version 1
 Since the ideal CPIave with pipelining is 1, we obtain
the following formula
P ipelinedepth
Speedupoverall 
1  P ipelinestallcyclesper instruction
 It is clear from the above formula that the speedup
is directly proportional to the number of pipeline
stages
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
195

Pipeline Timing
 Example : Assume that a program with no control instructions is
run and the following measurements are made on the MIPS
Pipelined MIPS CPU Design : Version 1
Instruction
CPIi
# of times executed Unpipelined time
Loads
5
10
0.25μsec
A/L
5
90
2.25 μsec
 Calculate CPIave and CPUtime for both unpipelined and pipelined
cases and Speedupoverall, the pipelined efficiency and MIPSideal
for the pipelined case


Assume that clock frequency is 200MHz
Note that this program is an ideal program since there is no Store
instruction !
 NI = # of Loads + # of A/L = 10 + 90 = 100
Clock period
CS 6143
1
1
-9


5

10
 5ns
6
Clock frequency 200 10
Haldun Hadimioglu
MIPS Versions 0 & 1
196

Pipeline Timing
 Example continued

For the unpipelined case :
Pipelined MIPS CPU Design : Version 1
 CPUtimeunpipelined = TimeLoads + TimeA/L = 0.25 + 2.25 = 2.5 μsec
 # of clock periods for Loads = # of times executed x CPIi
= 10 x 5 = 50
 # of clock periods for A/L = # of times executed x CPIi
= 90 x 5 = 450
 # of clock periods for program = # of clock periods for Loads
+ # of clock periods for A/L = 50 + 450 = 500
# of clock periodsfor program 500
CPIave w/o pipe 

5
NI
100
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
197

Pipeline Timing
 Example continued
Pipelined MIPS CPU Design : Version 1

For the pipelined case :
 # of clock periods for program = Start-up time + (NI – 1) =
= 5 + (100 – 1) = 104
 CPUtimepipelined = # of clock periods for program x clock period
= 104 x 5 = 520ns = 0.52 μsec
CPUtimeold 2.5
Speedupoverall 

 4.81
CPUtimenew .52
Speedupoverall 4.81
P ipelineefficiency 

 0,96
Speedupideal
5
MIP Sideal 
clock frequency
6
10

200 106
6
10
 200
 Speedupoverall is not 5 because of the startup time....
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
198

Improving Initial Version 1 Design
 Now, we will make an assessment of pipelining to
prepare ourselves for next set of improvements
Pipelined MIPS CPU Design : Version 1

Pipelining increases the speed but there are difficulties and
problems associated with pipelining :
 The hardware is complicated
 Additional temporary registers (latches or buffers) are needed
between stages so that latter stages can correctly work on an
instruction
 Some latches are simple duplication of earlier registers and some
are latches that save the output of a stage.
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
199

Improving Initial Version 1 Design
 Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Pipelined MIPS CPU Design : Version 1

The pressure on the memory is doubled : two memory
accesses per clock period happen
 One for instruction in the IF stage
 One for data in the MEM stage
 For example, for the program execution on slide 190, the CPU
makes two memory accesses in the 4th clock period
 The frequency of simultaneous accesses depends on the number
of Loads and Stores
 The number of Loads and Stores depend on the programmer and
compiler
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
200

Improving Initial Version 1 Design
 Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Pipelined MIPS CPU Design : Version 1

Not all instructions require all the stages
 Some stages are empty, idle, creating a pipeline bubble that
cannot be avoided
 RISC instructions require fewer stages therefore the chance
having many unneeded stages is reduced
 With CISC, the number of stages is larger
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
201

Improving Initial Version 1 Design
 Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Pipelined MIPS CPU Design : Version 1

The startup time slows the system
 Its impact is through
 The number of times it occurs (due to control instructions)
 The time it takes to fill the pipeline (pipeline depth or latency)
 RISC systems perform better here since they have shorter
pipelines
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
202

Improving Initial Version 1 Design
 Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Pipelined MIPS CPU Design : Version 1


Some instructions have complex microoperations that take
longer than one clock period to complete
Overall, it is difficult to have balanced stages
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
203

Improving Initial Version 1 Design
 Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Pipelined MIPS CPU Design : Version 1

The clock period is determined by
 The slowest stage which is often the stage with the addition
 The EX stage
 The latches that need set up time and propagation delays
 The clock skew problem


In RISC systems it is easy to distribute the work equally to
stages but with CISC it is more difficult
So, in order not to increase the clock period length in CISC
systems, a stage that has a complex microoperation takes
more than one clock period
 But, this creates bubbles !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
204

Improving Initial Version 1 Design
 Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Pipelined MIPS CPU Design : Version 1

Because of what we call hazards, an instruction in the
stream may not be moved to the next stage but forced to
stay in the same stage more than one clock period
 The instruction stalled
 The stages to the left of the stalled instruction cannot move
their instruction to the right to keep the strict order of
execution
 These stages become idle (do not work on new instruction) but
keep the old instructions
 This creates a pipeline bubble : The speed is decreased.
 Note that the startups decrease the speed since there is a
larger bubble in the pipeline
 Control instructions result in startups
 Pipeline “hazards” also create startups if poorly designed
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
205
Pipelined MIPS CPU Design : Version 1

Pipeline Hazards
 They are caused by a number of reasons
forcing the pipeline to stop the execution of
an instruction and the instructions that are
behind

The instructions are stalled
 The hazards generate either bubbles or a start-up of
the pipeline.
 There are three types of hazards
Structural
 Data
 Control

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
206
Pipelined MIPS CPU Design : Version 1

Structural Hazards
 Structural hazards occur from resource conflicts
that can be solved with more resources, i.e. more or
faster hardware
 Examples of structural hazards are




Only one memory port in the CPU which stops the IF stage if
a Load/Store is using this single memory port to access data
If a L1 cache memory takes two or more clock periods !
If the GPR set has only one write port and several
simultaneous GPR writes are performed, only one GPR write
will happen, the others will write one by one
If a stage performs a complex microoperation taking several
clock periods, such as FP arithmetic, and this microoperation
is not pipelined, then instructions behind it will stay idle in
their stages (these instructions are stalled)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
207

Structural Hazards
Pipelined MIPS CPU Design : Version 1
 Due to a structural hazard, one or more instructions
behind the instruction that caused the hazard are
delayed, are not allowed to move.


The stages behind the hazard causing instruction become
idle : A bubble is generated
The bubble moves one stage per clock period and eventually
leaves the pipeline.
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
208

Structural Hazards
 What if there was only one memory port ?
Pipelined MIPS CPU Design : Version 1







If a Load or Store tries to access a data element in the
memory in the MEM cycle, then, the IF stage is forced to
stay idle by the control unit so that the priority is given to
the instruction already in the pipeline to complete it as soon
as possible
The instruction that was going to be fetched is stalled
A bubble is created in the IF stage
Theoretically, the instructions behind it are also stalled
The bubble moves up the pipeline one stage per clock period
Stalling ends when the bubble leaves the pipeline
Next slide shows this process one more time with that
theoretical stalling of the instructions behind the first
stalled instruction
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
209

Structural Hazards
 What if there was only one memory port ?
Pipelined MIPS CPU Design : Version 1
1 2
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R10
R11, R12, R13
14, R15, R16
R17, 600(R0
R18, 5
WB
A bubble is
created and
moves up
the pipeline
MEM
EX
ID
IF
CS 6143
31 4
5
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX
MEM WB
Stall IF
ID
EX
IF
ID
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
8
10
11
MEM WB
EX
MEM
ID
EX
MEM
IF
ID
EX
MEM
IF
ID
EX
v
v
v
v
9
v
v
v
v
v
v
v
v
v
v
v
v
v
v
MIPS Versions 0 & 1
v
v
v
v
v
210

Structural Hazards
 What if there was only one memory port ?
Pipelined MIPS CPU Design : Version 1

We will avoid using textbook notation of instruction
execution since even for a few instructions, a large space is
needed to show the flow of execution
200
204
208
20C
210
214
218
21C
 Rather, we will use our own notation shown below
IF
ID EX MEM WB
LD
R1, 500(R0)
1
2
3
4
5
DADD R2, R3, R4
2
3
4
5
6
DSUB R5, R6, R7
3
4
5
6
7
XOR
R8, R9, R10
5
6
7
8
8
SLT
R11, R12, R13
6
7
8
9
10
OR
R14, R15, R16
7
8
9
10
11
SD
R17, 600(R0)
8
9
10
11
BEQZ R18, 5
9
10
11
XOR is fetched in the 5th clock period, not in the 4th clock period
XOR is delayed, stalled, in clock period 4 by the LD accessing the memory for its data
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
211

Structural Hazards
 What if there was only one memory port ?
Pipelined MIPS CPU Design : Version 1

The control unit stops the IF stage from accessing
the memory to fetch the XOR
 The reason is that we want to complete the execution of
the LD that is already in the pipeline
 Instructions in the pipeline has higher priority for
completion
 The SD instruction will access the memory in the 11th
clock period to write data
 There will not be an instruction fetch in the 11th clock
period
 Once a stall occurs, a bubble is introduced not all the
stages are busy
 The execution of the instruction is increased ≡ its CPIi is
increased
 CPIave is increased
 CPUtime is increased
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
212

Structural Hazards
 What if there was only one memory port ?
Pipelined MIPS CPU Design : Version 1

We will not have this structural hazard in our system
 It is also clear from the Version 1 datapath diagram that we
have two separate memory ports
 Memory Port 1 for instruction fetches
 Memory Port 2 for data accesses
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
213

Hazards
 Structural Hazards
Pipelined MIPS CPU Design : Version 1

Often, to solve structural hazards more or faster
hardware is needed
 However, the solution of the other two
hazards, data and control hazards, requires
More hardware and
 Better compilation techniques

 To better order instructions
 To reduce the number of control instructions
 The result is that
Pipeline bubbles are eliminated or reduced
 The number of pipeline start-ups is also reduced

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
214
Pipelined MIPS CPU Design : Version 1

Hazards
 The overall hardware structure that that detects a
hazard and stops (stalls) an instruction or several
instructions until the hazard condition does not exist
is called pipeline interlock


Note that if an instruction is stalled, the instructions behind
it are also stalled as we will see shortly
Thus, it is costly to stall a single instruction in the pipeline
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
215

Data Hazards
 As mentioned before all previous program examples
had instructions independent of each other
Pipelined MIPS CPU Design : Version 1

The instructions did not have any register or memory
location in common
 For example, an instruction writes to R9 and the next
instruction did not read R9
 The second instruction did not depend on the first instruction
 There is no data dependency between them
 There are other types of data dependencies as we will see
shortly

If two instructions have data dependency between them and
they are in the pipeline there can be a data hazard
 Let’s see the definition on the next slide
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
216
Pipelined MIPS CPU Design : Version 1

Data Hazards
 Data hazards occur between two instructions
which are executed close enough in time and
there is writable data shared by them
 That is there is a data dependency between
two instructions and the correct result will
occur only if the execution is confined to the
sequential rather than pipelined execution to
enforce the right order of access to the
shared data

The second instruction cannot be executed in a
pipelined fashion
 It has to wait, stall !
 This is sequential execution then
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
217
Pipelined MIPS CPU Design : Version 1

Data Hazards
 If we change the instruction sequence of the
previous code to include dependency, there will be
data hazards
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
 We observe that the DADD writes to R2 and the
instructions below DADD read R2

R2 has data that is writable and shared by several
instructions
 The DADD and the remaining instructions are executed close in
time
 Can there be data hazards among them ?
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
218

Data Hazards
Pipelined MIPS CPU Design : Version 1
 Let’s concentrate on the DADD and the instructions that follow
it
RAW ?
RAW ?
RAW ?
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2 RAW ?
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
RAW ?
RAW ?
 The data element in R2 is shared by all the instructions below
the DADD and they are executed close in time



An instruction, I1, writes to register and another instruction, I2,
reads the same register (the data element)
I1 has to write first and then I2 has to read : There is a Read
after Write (RAW) dependency
BUT, if I2 reads before I1 writes then there is a RAW hazard
 Can I2 read before I1 write ? Yes
 We have to stop I2 if it tries to read R2 before the DADD writes to
R2
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
219

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it
Pipelined MIPS CPU Design : Version 1
RAW
RAW
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2 RAW
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
 There are data dependencies, but are they all data hazards ?

Will all the instructions below the DADD try to read R2 before the
DADD writes ? NO !
 Soon we will see that data hazards will happen between the DADD and
DSUB, XOR and SLT
 DSUB, XOR and SLT will try to read R2 before the DADD writes to R2
 The OR, SD and BEQZ will read R2 after the DADD writes to R2
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
220

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it
Pipelined MIPS CPU Design : Version 1
All RAW
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
 DSUB, XOR and SLT will try to read R2 before the DADD
writes to R2

These data dependencies result in data hazards
This data hazard is one of three types of data hazards

We will stall DSUB, XOR and SLT when they try to read R2

 An instruction, I1, writes to register and another instruction, I2, reads
the same register (the data element)
 I1 has to write first and then I2 has to read : Read after Write (RAW)
 If I2 reads before I1 writes there is a RAW hazard
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
221

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it
All RAW
Pipelined MIPS CPU Design : Version 1
1 2
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
Why do we stall the
DSUB in the ID stage ?
IF
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
Stall
Stall
IF
Stall Stall
Stall Stall
Stall
WB ?
We stall the DSUB for
3 clock periods and create
MEM ?
a 3-clock period bubble that
EX ?
moves up the pipeline
ID
31 4
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
7
Stall
Stall
Stall
Stall
Stall
8
9
10
EX MEM WB
ID
EX MEM
IF
ID
EX
Stall IF
ID
Stall Stall
IF
Stall Stall Stall
v
v
v
v
v
v
v
v
v
v
v
v
v
XOR is fetched and idling in the IF stage
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
222

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it
We stalled the DSUB in the ID stage since it reads its operands in
ID as we designed in Version 1)
 The DSUB reads its operands R2 and R6 in the ID stage
 This is clock period 4
1 2
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ




R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
31 4
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
Stall
Stall
IF
Stall Stall
Stall Stall
Stall
7
8
9
10
Stall EX MEM WB
Stall ID
EX MEM
Stall IF
ID
EX
Stall Stall IF
ID
Stall Stall Stall IF
Stall Stall Stall
When will the DADD write to R2 ?
In clock period 6 !
When will R2 actually get the new value ?
In the beginning of the 7th clock period !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
223

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it
Pipelined MIPS CPU Design : Version 1

Why does R2 get its new value in the beginning of the 7th clock
period ?
 According to the state diagram of Version 1, the DADD writes from
5.ALUoutput to its destination register in the WB stage
 This is clock period 6
 Why does R2 get the value in the beginning of the 7th clock period ?
 As we discussed before, we clock (store on) our registers at the end of
a clock period and therefore, registers change their values in the
beginning of the next clock period
Clock period 6
Clock period 7
Clock
5.ALUoutput
?
R2
?
CS 6143
?
Result of DADD
?
?
Result of DADD
Haldun Hadimioglu
?
MIPS Versions 0 & 1
224

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it

In summary then that the DSUB is stalled in the ID stage for
three clock periods
A 3-clock period long bubble is created and moves up the pipeline
 If we show the pipeline in our notation
IF
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
CS 6143
1
2
3
4/7
8
9
10
11
ID
2
3
4/7
8
9
10
11
12
Haldun Hadimioglu
EX
MEM
WB
3
4
8
9
10
11
12
13
4
5
9
10
11
12
13
5
6
10
11
12
13
MIPS Versions 0 & 1
225
Pipelined MIPS CPU Design : Version 1

Pipeline Interlocks
 What we are doing is that we are checking
for hazard situations in the ID stage and
when we recognize a hazard, we stall the
instruction in the ID stage !

If an instruction does not have a hazard situation,
it is allowed to proceed to the EX stage
 That is the instruction is issued to the EX stage

If the instruction has a hazard, it is stalled in the
ID stage by the pipeline interlock to preserve the
execution pattern
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
226

Pipeline Interlocks
Pipelined MIPS CPU Design : Version 1
 If an instruction is stalled in the ID stage,
then the instruction in the IF stage is stalled
That is the instruction behind the stalled
instruction is not allowed to pass by and continue
with its execution
 This is called static issuing

 Static issuing reduces hardware since we do not have to
keep track of which instruction changed which part of
the state
 Because, if an instruction is stalled, it has to update the
state before all instructions that follow it
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
227

Pipeline Interlocks
Pipelined MIPS CPU Design : Version 1
 If dynamic issuing is allowed then an instruction in
the IF stage would pass by the stalled instruction in
the ID stage and start its EX cycle


However, dynamic issuing results in other data hazards,
WAR and WAW, to happen as we will discuss later
We need to have hardware not to allow an instruction behind
a stalled instruction to update the state
 Can we somehow allow this instruction to proceed ?
 Yes, we can allow it to generate its results
 But, we have to buffer the results and write them to the
destination after the stalled instruction is finished for
correct execution pattern
 We then need additional hardware to keep temporary
results and keep track of instructions’ progress
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
228

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it
What if DSUB does not have a RAW hazard but XOR has ?
1 2
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
We stall the XOR for
2 clock periods and
create a 2-clock
period bubble that
moves up the pipeline
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX
MEM
IF
ID
Stall
IF
Stall
Stall
WB
MEM
CS 6143
31 4
EX
ID
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
v
v
7
WB
Stall
Stall
Stall
Stall
8
EX
ID
IF
Stall
Stall
9
MEM WB
EX
MEM
ID
EX
IF
ID
Stall IF
v
v
v
v
10
v
v
v
v
MIPS Versions 0 & 1
v
v
v
v
?
229

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it
What if DSUB does not have a RAW hazard but XOR has ?
 The XOR is in ID in the 5th clock period but has to wait until the 7th
clock period
 If we show the pipeline in our notation
IF
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
CS 6143
1
2
3
4
5/7
8
9
10
ID
2
3
4
5/7
8
9
10
11
Haldun Hadimioglu
EX
MEM
WB
3
4
5
8
9
10
11
12
4
5
6
9
10
11
12
5
6
7
10
11
12
MIPS Versions 0 & 1
230

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it
What if DSUB and XOR do not have a RAW hazard but SLT has ?
1 2
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R10
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
We stall the SLT for
1 clock period and
create a 1-clock period
bubble that moves up
the pipeline
WB
MEM
CS 6143
EX
ID
IF
31 4
5
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX MEM WB
IF
ID
EX MEM
IF
ID
Stall
IF
Stall
Stall
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
v
v
v
Haldun Hadimioglu
v
v
v
v
v
v
v
8
WB
EX
ID
IF
Stall
9
MEM
EX
ID
IF
10
WB
MEM
EX
ID
v
v
v
v
v
v
v
v
MIPS Versions 0 & 1
v
v
v
v
?
231

Data Hazards
 Let’s concentrate on the DADD and the instructions that follow
it

What if DSUB and XOR do not have a RAW hazard but SLT has ?
IF
RAW
Pipelined MIPS CPU Design : Version 1
 If we show the pipeline in our notation
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
CS 6143
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R10
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
1
2
3
4
5
6/7
8
9
Haldun Hadimioglu
ID
2
3
4
5
6/7
8
9
10
EX
3
4
5
6
8
9
10
11
MEM
4
5
6
7
9
10
11
WB
5
6
7
8
10
11
MIPS Versions 0 & 1
232

Eliminating Hazards
 We will eliminate delays due to RAW hazards
Pipelined MIPS CPU Design : Version 1


We will write GPR registers in the WB stage in the first half
of the clock period and read GPR registers in the ID in the
second half of the same clock period
We will add new hardware to eliminate other delays
 We will reduce the amount of delay due to control
hazards

By assuming a certain compiler functionality we will eliminate
the control hazard delays completely
 However, this compiler functionality is not acceptable in real
life
 It does not allow software compatibility as we will see later this
lccture
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
233
Pipelined MIPS CPU Design : Version 1

Data Hazards
 Writing to a GPR in the first half – reading
the same GPR register in the second half of
the same clock period

Consider the timing diagram of writing to R2 in the
6th clock period again
 What if we clock (store on) R2 in the middle of the 6th
clock period where there is a negative edge ?
 That is, what if we do not write at the end of the 6th
clock period, but the middle ?
 This is possible by using negative-edge triggered GPR
registers
 So, we write from 5.ALUoutput to R2 in the middle of
the clock period !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
234

Data Hazards
Pipelined MIPS CPU Design : Version 1
 Writing to a GPR in the first half – reading the same
GPR register in the second half of the same clock
period

OK, we write in the first half, can we read the same register
in the second half ?
 Yes, reading means getting the value from R2 in the second
half and storing it on the destination register at the end of the
same clock period when there is a positive edge
 We read from GPR registers and store on temporary registers
3.A and 3.B in the ID stage
 In this specific example R2 is stored on 3.B for the DSUB
instruction

This will save one clock period for us
 From now on the GPR registers are clocked by
negative edges and the other registers are clocked
by positive edges
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
235

Data Hazards
 Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period
Pipelined MIPS CPU Design : Version 1

Let’s visualize what happens in clock periods 5, 6 and 7
Clock period 5
Clock period 6
Clock period 7
Clock
5.ALUoutput
?
?
R2
3.B
Result of DADD
?
?
?
Result of DADD
?
Result of DADD
In the 6th clock period R2 has its new value and is transferred to 3.B
Therefore, the DSUB can be in EX in the 7th clock period to use 3.B
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
236
Data Hazards

We will draw short lines in the WB and ID stages to indicate that the RAW hazard
has been resolved by the write-in-first-half-read-in-the-second-half feature
 Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period
Let’s see the new execution flow
1 2
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
We stall the DSUB for
2 clock periods and
create a 2-clock period
bubble that moves up
the pipeline
CS 6143
WB
MEM
EX
ID
IF
3
4
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
Stall
Stall
IF
Stall Stall
Stall Stall
Stall
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
Haldun Hadimioglu
7
EX
ID
IF
Stall
Stall
8
9
MEM WB
EX MEM
ID
EX
IF
ID
Stall
IF
Stall Stall
v
v
v
v
v
v
v
v
v
v
v
v
MIPS Versions 0 & 1
10
ID
IF
v
v
v
v
?
237

Data Hazards
 Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period
If we show the pipeline in our notation
IF
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
1
2
3
4/6
7
8
9
10
ID
2
3
4/6
7
8
9
10
11
EX
3
4
7
8
9
10
11
12
MEM
4
5
8
9
10
11
12
WB
5
6
9
10
11
12
We will draw short lines in the WB and ID stages to indicate that the RAW hazard
has been resolved by the write-in-first-half-read-in-the-second-half feature
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
238

Data Hazards
 Writing to a GPR in the first half – reading the same GPR register in
the second half of the same clock period
Will this help if DSUB does not have a RAW hazard but XOR has ? Yes !
IF
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
1
2
3
4
5/6
7
8
9
ID
2
3
4
5/6
7
8
9
10
EX
MEM
WB
3
4
5
7
8
9
10
11
4
5
6
8
9
10
11
5
6
7
9
10
11
We saved one clock period !
Note that the GPR registers are always written in the middle of the
clock period ! We show the short lines when this feature helps a RAW
hazard !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
239

Data Hazards
 Writing to a GPR in the first half – reading the same GPR register in
the second half of the same clock period

Will this help if DSUB and XOR do not have a RAW hazard but SLT has ?
IF
RAW
Pipelined MIPS CPU Design : Version 1
 Yes !
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R7
R8, R9, R10
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
CS 6143
1
2
3
4
5
6
7
8
Haldun Hadimioglu
ID
2
3
4
5
6
7
8
9
EX
3
4
5
6
7
8
9
10
MEM
4
5
6
7
8
9
10
WB
5
6
7
8
9
10
MIPS Versions 0 & 1
240

Data Hazards
 How will we eliminate the remaining two stall cycles ?

We will use forwarding also known as bypassing to do that

To visualize how we can do this, let’s look at the Version 1 state diagram and
the datapath for the DADD instruction
1 2
All RAW
Pipelined MIPS CPU Design : Version 1
 This means we have additional hardware to eliminate the stalls
 The additional hardware will be new wires also MUX2 and MUX3 of the datapath
will be larger
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
3
4
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
Stall Stall
IF
Stall Stall
Stall Stall
Stall
7
EX
ID
IF
Stall
Stall
8
9
MEM WB
EX MEM
ID
EX
IF
ID
Stall
IF
Stall Stall
10
ID
IF
The new value of R2 is calculated in the EX stage in the 4th clock period for the DADD
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
241

Data Hazards
 Forwarding (Bypassing)

 Why do not we forward the new value of 4.ALUoutput directly from the MEM
The arrow
stage to the EX stage in the 5th clock period ?
from
 At the same time, why do not we allow the DSUB to read the old value of R2 to
MEM to EX
3.B in the ID stage so we do not stall it in the 4th clock period ?
indicates
 But, when the DSUB enters the EX in the 5th clock period, it uses the forwarded
forwarding
value from 4.ALUoutput ? It bypasses the value of 3.B
1 2
All RAW
Pipelined MIPS CPU Design : Version 1

The new value of R2 is stored on 4.ALUoutput at the end of the 4th clock
period
The new value of R2 is available for use in the MEM stage in the beginning
of the 5th clock period
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R2
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
CS 6143
3
4
5
6
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX MEM
IF
ID
Stall
IF
Stall
Stall
Haldun Hadimioglu
7
8
9
WB
EX MEM WB
ID
EX
MEM
IF
ID
EX
Stall IF
ID
Stall
IF
10
WB
MEM
EX
ID
MIPS Versions 0 & 1
242

Data Hazards
 Forwarding (Bypassing)
What we are doing is that instead of waiting to get the new value of R2 that
goes (i) from the ALU to 4.ALUoutput, then (ii) to 5.ALUoutput and then
finally (iii) to R2, we forward the new value of R2 directly to the EX stage,
to the input of the ALU, bypassing the value in 3.B that has the old R2 value
4.ALUoutput
ID
MUX3
3.B
ADD
MUX2
 MUX3 is larger now
3.Imm
Pipelined MIPS CPU Design : Version 1

EX
CS 6143
Haldun Hadimioglu
MEM
MIPS Versions 0 & 1
243

Data Hazards
 Forwarding (Bypassing)
If we show the pipeline in our notation
IF
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
ID
EX
MEM
WB
1
2
3
4
2
3
4
5/6
3
4
5
7
4
5
6
8
5
6
7
9
5/6
7
8
9
7
8
9
10
8
9
10
11
9
10
11
10
11
The arrow from MEM to EX
indicates forwarding
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
244
Data Hazards

 Forwarding (Bypassing)
All RAW
Pipelined MIPS CPU Design : Version 1

200
204
208
20C
210
214
218
21C
What can we do to eliminate the stall for the XOR ?
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R2
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX MEM
IF
ID
Stall
IF
Stall
Stall
WB
EX MEM WB
ID
EX
MEM
IF
ID
EX
Stall IF
ID
Stall
IF
WB
MEM
EX
ID
 To eliminate the stall for the XOR we will employ forwarding from the WB stage
to the EX stage (as you will see on the next slide) !
 Because we see that if we allow the XOR to read the old value of R2 in clock
period 5, it can get the new value of R2 in the beginning of the 6th clock period
 In the 6th clock period, the new value of R2 is with the DADD in the WB stage on
register 5.ALUoutput
 We then forward the value from 5.ALUoutput to MUX3, bypassing 3.B
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
245

Data Hazards
 Forwarding (Bypassing)
All RAW
Pipelined MIPS CPU Design : Version 1
1 2
200
204
208
20C
210
214
218
21C


LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
3
4
5
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX
MEM WB
IF
ID
EX
MEM
IF
ID
EX
IF
ID
IF
8
9
10
WB
MEM WB
EX
MEM WB
ID
EX
MEM
IF
ID
EX
Now, there is no stall !
Note the short lines in clock period 6 that indicate that write-infirst-half-read-in-the-second-half helps eliminate the stall
between the DADD and the SLT
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
246

Data Hazards
 Forwarding (Bypassing)
If we show the pipeline in our notation
200
204
208
20C
210
214
218
21C
All RAW
Pipelined MIPS CPU Design : Version 1



LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
IF
ID
EX
1
2
3
4
5
6
7
8
2
3
4
5
6
7
8
9
3
4
5
6
7
8
9
10
MEM
4
5
6
7
8
9
10
WB
5
6
7
8
9
10
There is no stall !
Note the short lines in clock period 6 that indicate that write-infirst-half-read-in-the-second-half helps eliminate the stall
between the DADD and the SLT
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
247

Data Hazards
 Forwarding (Bypassing)
All RAW
Pipelined MIPS CPU Design : Version 1
1 2
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R6, R2
R8, R9, R2
R11, R12, R2
14, R15, R2
R2, 600(R0)
R2, 5
3
4
5
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX
MEM WB
IF
ID
EX
MEM
IF
ID
EX
IF
ID
IF
8
9
10
WB
MEM WB
EX
MEM WB
ID
EX
MEM
IF
ID
EX
Till now we considered this code where for the DSUB, XOR and SLT,
R2 is the second operand register, i.e. register Rt in the R-format
What if R2 is the first operand register, register Rs, in the R-format ?
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
248

Data Hazards
 Forwarding (Bypassing)

1 2
All RAW
Pipelined MIPS CPU Design : Version 1

What if the code is that R2 is Rs for the DSUB, XOR and
SLT ?
In this case we forward from 4.ALUoutput and 5.ALUoutput
to MUX2, bypassing 3.A
200
204
208
20C
210
214
218
21C

LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R2, R7
R8, R2, R10
R11, R2, R12
14, R2, R15
R2, 600(R0)
R2, 5
3
4
5
6
7
IF ID EX MEM WB
IF ID EX
MEM WB
IF ID
EX
MEM WB
IF
ID
EX
MEM
IF
ID
EX
IF
ID
IF
8
9
10
WB
MEM WB
EX
MEM WB
ID
EX
MEM
IF
ID
EX
Only the DSUB, XOR and SLT instructions will have the
RAW hazard and the stall cycles will be eliminated by
forwarding to MUX2
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
249

Data Hazards
All RAW
Pipelined MIPS CPU Design : Version 1
 Forwarding (Bypassing)

What if the code is that R2 is Rs for the DSUB, XOR and SLT ?

If we show the pipeline in our notation
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R4
R5, R2, R7
R8, R2, R10
R11, R2, R12
14, R2, R15
R2, 600(R0)
R2, 5
CS 6143
IF
ID
EX
1
2
3
4
5
6
7
8
2
3
4
5
6
7
8
9
3
4
5
6
7
8
9
10
Haldun Hadimioglu
MEM
4
5
6
7
8
9
10
WB
5
6
7
8
9
10
MIPS Versions 0 & 1
250

Data Hazards
 MIPS forwarding (Bypassing) for the general case
Pipelined MIPS CPU Design : Version 1

By using forwarding (bypassing) results that have not
reached the destination GPR, can be forwarded to the inputs
of
 Functional units in the ALU
 Memory port 2
 The zero detection unit

Bypassing the inputs that are shown in the Version 1 state
diagram and datapath
 Remember that we forward a value when it is needed

Two exceptions are Store and Branch instructions since they
complete not in 5 but, 4 and 2 (soon we will see),
respectively !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
251
Pipelined MIPS CPU Design : Version 1

Data Hazards
 What forwarding does is that functional units
in the ALU, memory port 2 and the zero
detection unit bypass registers that originally
supply values

If they cannot get the new value of a register on
time, the new values are forwarded from
 4.ALUoutput
 5.ALUoutput
 5.LMD

To the inputs of
 Functional units in the ALU
 Memory port 2
 The zero detection unit
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
252
Data Hazards

 Forwarding (Bypassing)
EX
Haldun Hadimioglu
5.ALUoutput
MEM
5.LMD
ALU
MUX3
CS 6143
4.ALUoutput
MUX2
3.A
3.NPC
We show the changes to the inputs of the ALU below
3.B
3.Imm
Pipelined MIPS CPU Design : Version 1

MIPS Versions 0 & 1
WB
253

Data Hazards
 Forwarding (Bypassing)

We show the changes to the inputs of Memory Port 2 below
Pipelined MIPS CPU Design : Version 1
4.ALUoutput
5.ALUoutput
MEM
4.B
5.LMD
WB
MUX5
AB2
DB2
DB3
Memory
Port
2
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
254

Data Hazards
 Forwarding (Bypassing)
We show the exception case for Store instructions where the value to be
written to a memory location has to be passed to a Store in the EX stage
even though it is not needed in EX, but in MEM
EX
MEM
4.ALUoutput
MUX6
3.B
3.A
3.NPC
 We have to have a new MUX in EX that will move data to 4.B either from 3.B or
from 5.ALUout or 5.LMD
3.Imm
Pipelined MIPS CPU Design : Version 1

CS 6143
WB
5.ALUoutput
5.LMD
4.B
Haldun Hadimioglu
MIPS Versions 0 & 1
255
Data Hazards
 Forwarding (Bypassing)
We show the changes to the input of the Zero detection circuit below
3.A
From 4.ALUoutput
From 5.ALUoutput
Zero ?
4.Cond
Pipelined MIPS CPU Design : Version 1

MUX7

From 5.LMD
ID
EX
MEM
Soon, when we cover control hazards we will
see that this circuit is moved to the ID stage
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
256

Data Hazards
 Forwarding (Bypassing)
Pipelined MIPS CPU Design : Version 1

In summary, we have the following changes to the MIPS
datapath for forwarding purposes
 Three new multiplexers, MUX5, MUX6 and MUX7
 MUX2 and MUX3 are larger
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
257

Data Hazards
 As we said before, there are three types of data
hazards
Pipelined MIPS CPU Design : Version 1

Read after write, RAW
 Instruction 1 has to write and then Instruction 2 has to read :
I1W - I2R
 We studied it on previous slides
 We need to prevent I2R - I1W
 So, we stall I2 unless we can forward the value
 We can do forwarding and write-in-the-first-half-read-in-thesecond-half to avoid the stall for all cases except one that
involves Load instructions as described below
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
258

Data Hazards
 There are two other types of data hazards
Pipelined MIPS CPU Design : Version 1


Write after read, WAR
 Instruction 1 has to read and then Instruction 2 has to write : I1R I2W
 We need to prevent I2W - I1R
 So, we need to stall I2
This hazard cannot occur on MIPS since all reads are early and all writes
are late
This will happen when some instructions write early and some other read late
An example is for an instruction that uses the autoincrement addressing mode :
ADD R1, (R2)+
This instruction does the following : R1  R1 + M[R2] then R2  R2 + 8
Often the CPU writes the new value of R2 in the MEM stage, not in the WB
stage, provided that there is a separate integer ADDer
 So, we write to R2 early, perhaps before a previous instruction can read it
 This instruction is a typical CISC instruction
 The example shows how the architecture complexity affects the hardware
design, in this case pipelining !





CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
259

Data Hazards
 There are two other types of data hazards
Pipelined MIPS CPU Design : Version 1


Write after write, WAW
 Instruction 1 has to write and then Instruction 2 has to write : I1W I2W
 We need to prevent I2W - I1W
 So, we need to stall I2 to prevent a wrong value on the destination
This hazard cannot occur on MIPS since all reads are early and all writes
are late
 This will happen if more than one stage can write
 Allowing writes in different stages can result in two writes to a GPR in the same
clock period
 The previous example can cause a WAW hazard
 ADD R1, (R2)+
 R1  R1 + M[R2] then R2  R2 + 8
 The CPU writes the new value of R2 in the MEM stage, not in the WB stage
 So, we write to R2 early, perhaps when a previous instruction is also writing to R2
at the same time
 This instruction is a typical CISC instruction
 The example shows how the architecture complexity affects the hardware
design, in this case pipelining !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
260

Data Hazards
 There are two other types of data hazards
Pipelined MIPS CPU Design : Version 1

Write after write, WAW
 The WAW hazard will also happen when an instruction is
allowed to proceed even though the instruction in front of it is
stalled
 For example, with dynamic issuing, an instruction passes by a
stalled instruction, so it can write to a register that perhaps
the stalled instruction will write soon !
 This is a topic to deal with in later versions of the MIPS CPU !
 The fourth hazard ?

Read after Read, RAR
 This is not a hazard since no value is changed by the two
readings
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
261

Data Hazards
Pipelined MIPS CPU Design : Version 1
 Let’s consider our piece of mnemonic machine language program again
where there is now a dependency between the LD and the instructions
that follow it
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R1
R5, R6, R1
R8, R9, R1
R11, R12, R1
14, R15, R1
R1, 600(R0)
R1, 5
 We observe that the LD writes to R1 and the instructions below
LD read R1

The LD and the remaining instructions are executed close in time
 Can there be data hazards among them ?
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
262

Data Hazards
Pipelined MIPS CPU Design : Version 1
RAW ?
RAW ?
RAW ?
RAW ?
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R1
R5, R6, R1
R8, R9, R1
R11, R12, R1
14, R15, R1
R1, 600(R0)
R1, 5
RAW ?
RAW ?
RAW ?
 The data element in R1 is shared by all the instructions below the LD
and they are executed close in time

Yes, there are data dependencies, but are they all data hazards ?






Will all the instructions below the LD try to read R1 before the LD writes ?
Data hazards will be happen between the LD and DADD, DSUB and XOR
DADD, DSUB and XOR will try to read R1 before the LD writes to R1
This data hazard is the RAW hazard
We might have to stall DADD, DSUB and XOR when they try to read R1 ????
The SLT, OR, SD and BEQZ will read R1 after the LD writes to R1
 They do not have any hazard situation !!!
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
263

Data Hazards
Pipelined MIPS CPU Design : Version 1
 Do we have to stall DADD, DSUB and XOR when they
try to read R1 ?
All RAW
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R1
R5, R6, R1
R8, R9, R1
R11, R12, R1
14, R15, R1
R1, 600(R0)
R1, 5
 If yes, can we eliminate any possible stall by using
forwarding ?

Yes, we can eliminate the data hazard stalls between the LD
and DSUB and XOR !
 But, we cannot eliminate a stall cycle between the LD
and DADD with forwarding and write-in-the-firsthalf-read-in-the-second-half
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
264

Data Hazards
1 2
RAW
Pipelined MIPS CPU Design : Version 1
 Why is that we cannot eliminate the stall cycle
between the LD and DADD ?

200 LD
R1, 500(R0)
204 DADD R2, R3, R1
3
4
5
IF ID EX MEM WB
IF ID Stall EX
6
7
MEM WB
According to our state diagram, the LD reads the data from
the memory in the MEM stage
 This is clock period 4
 The data will come from the memory at the end of the 4th
clock period since the memory takes one clock period to access
 But, the DADD needs that data from the memory in the
beginning of the 4th clock period
 We need to stall the DADD and forward the data from WB to
EX in the 5th clock period
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
265

Data Hazards
1 2
All RAW
Pipelined MIPS CPU Design : Version 1
 Why is that we cannot eliminate the stall cycle
between the LD and DADD ?
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R4, R3, R1
R5, R6, R1
R8, R9, R1
R11, R12, R1
14, R15, R1
R1, 600(R0)
R1, 5
WB
MEM
EX
ID
IF
CS 6143
3
4
5
IF ID EX MEM WB
IF ID Stall EX
IF Stall ID
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
6
MEM
EX
ID
IF
v
?
v
v
v
v
Haldun Hadimioglu
v
v
v
v
7
8
WB
MEM WB
EX
MEM
ID
EX
IF
ID
IF
v
v
v
v
v
v
v
v
v
v
9
10
WB
MEM WB
EX
MEM
ID
EX
IF
ID
v
v
v
v
v
MIPS Versions 0 & 1
v
v
v
v
?
266

Data Hazards
 Why is that we cannot eliminate the stall cycle
between the LD and DADD ?
Pipelined MIPS CPU Design : Version 1

We see that the DADD is stalled to wait for the LD to read
the memory
 Where is the DADD stalled ?
 In the ID stage ? YES
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
267

Data Hazards
 Why is that we cannot eliminate the stall cycle
between the LD and DADD ?
Pipelined MIPS CPU Design : Version 1


As mentioned before we are checking for hazard situations
in the ID stage and when we recognize a hazard, we stall the
instruction in the ID stage !
We have static issuing
 We stall the DADD due to its RAW hazard
 We stall the DSUB, XOR and the others behind the DADD for
correct execution pattern
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
268

Data Hazards
All RAW
Pipelined MIPS CPU Design : Version 1
 Why is that we cannot eliminate the stall cycle
between the LD and DADD ?
 If we show the pipeline in our notation
200
204
208
20C
210
214
218
21C
LD
DADD
DSUB
XOR
SLT
OR
SD
BEQZ
R1, 500(R0)
R2, R3, R1
R5, R6, R1
R8, R9, R1
R11, R12, R1
14, R15, R1
R1, 600(R0)
R1, 5
IF
ID
EX
MEM
WB
1
2
3
4
5
2
3/4
5
6
7
8
9
3/4
5
6
7
8
9
10
5
6
7
8
9
10
11
6
7
8
9
10
11
7
8
9
10
11
 Note the short lines in clock period 5 that indicate that
write-in-first-half-read-in-the-second-half helps
eliminate the stall between the LD and the DSUB
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
269

Data Hazards
 Why is that we cannot eliminate the stall cycle between the LD
and DADD ?
Pipelined MIPS CPU Design : Version 1



The stall can be avoided (the interlock for the LD situation can be
eliminated) if there was an independent instruction, an instruction
that did not need R1 was placed between the LD and DADD
For the first time we have an example of the importance of
ordering instructions carefully
If we had a compiler that guaranteed to find an independent
instruction that does not depend on the LD, we would never have
the Load interlock !
 This is what we call the compiler scheduling an independent instruction
 The instruction position following the LD is called load delay slot and
the compiler fills the delay slot with an independent instruction
 This is called delayed Load
 If the compiler cannot find an independent instruction, it inserts a
NOP in the delayed Load slot
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
270

Data Hazards
Pipelined MIPS CPU Design : Version 1
 Why is that we cannot eliminate the stall
cycle between the LD and DADD ?
If the compiler changes the order of instructions
to avoid stalls, to fill delay slots, then it is called
pipeline scheduling or instruction scheduling
 We will have more examples of how the compiler
arranges the code for better pipeline efficiency
throughout the semester

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
271

Data Hazards
 Delayed Loads are not practical and not used !
Pipelined MIPS CPU Design : Version 1

If delayed Loads were used, the Load interlock in hardware
is removed since it is guaranteed a Load is not followed by a
depending instruction
 We can guarantee removing the interlock will work only if it
runs new code just compiled for the delayed Load CPU
 But, there is a lot of software compiled years ago and the
compilers did not take into account this delayed Load feature
 The old code has a lot of LD instructions followed by depending
instructions
 If we ran them on a CPU with delayed Loads (no Load interlock)
the depending instruction will get wrong data and programs will
generate wrong results
 This is the legacy software situation !

Our MIPS CPU will not have delayed Loads !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
272

Control Hazards
 Control hazards occur
instruction is executed
Pipelined MIPS CPU Design : Version 1

when
a
control
Control instructions are jump, jump to
procedure, branch and return from procedure
a
 Except the branch instruction, all control
instructions change the order of execution

The branch may or may not change the order of
execution depending on the condition test
 If the order of the execution is changed, the
pipeline is emptied
That is, there is a pipeline start-up
 This results in a performance loss worse than the
data hazard performance loss

CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
273

Control Hazards
 Especially conditional branches are troublesome

Pipelined MIPS CPU Design : Version 1

The order of execution may be changed or may not be
changed
So, we do not know which instruction to fetch next
 Which one to fetch depends on the test : equal to zero or not
equal to zero ?
 Note that besides comparing with zero, we also have to
compute the possible branch address, the effective address,
the address of the target instruction
 If these two are not performed early, there is a large control
hazard penalty of three clock periods.


If the branch instruction does not change the order of
execution, i.e. we continue with the instruction following the
branch we say the branch is not taken
If the branch instruction changes the order of execution,
i.e. we continue with the instruction pointed by the effective
address we say the branch is taken
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
274

Control Hazards
 If we recall what we did earlier
Pipelined MIPS CPU Design : Version 1

Branch instructions go through stages IF, ID and
EX
 They actually complete the execution back in stage IF
 Therefore, CPIBranch = 4
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
275

Control Hazards
 Let’s take a look at the code studied earlier

Assuming that we take the branch !
Pipelined MIPS CPU Design : Version 1
1
1 2
600 BEQZ
R8, 4
601 DADD R9, R19, R11
608 DSUB
R12, R13, R14
60C XOR
R15, R16, R17
610 SLT
R18, R19, R20
614 AND
R21, R22, R23
A pipeline bubble
is generated
CS 6143
4
5
6
7
ID
EX
8
9
MEM
WB
IF ID EX
Stall Stall Stall
IF
WB
MEM
EX
The Branch causes
a pipeline start-up !
3
ID
IF
?
?
?
?
v
?
?
?
v
?
?
v
?
Haldun Hadimioglu
v
v
?
v
?
?
v
?
?
?
MIPS Versions 0 & 1
?
?
?
?
276

Control Hazards
 If we show the pipeline in our notation
Pipelined MIPS CPU Design : Version 1

Assuming that we take the branch !
600 BEQZ
R8, 4
601 DADD R9, R19, R11
608 DSUB
R12, R13, R14
60C XOR
R15, R16, R17
610 SLT
R18, R19, R20
614 AND
R21, R22, R23
IF
ID
EX
1
2
3
5
6
7
MEM
8
WB
9
 We see that we have three stall cycles if the branch is
taken
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
277

Control Hazards
 Let’s take a look at the code studied earlier

Assuming that we do not take the branch !
Pipelined MIPS CPU Design : Version 1
1 2
600 BEQZ R8, 4
601 DADD R9, R19, R11
608 DSUB
R12, R13, R14
60C XOR
R15, R16, R17
610 SLT
R18, R19, R20
614 AND
R21, R22, R23
A pipeline bubble
is generated
CS 6143
4
IF ID EX
Stall Stall Stall
WB
MEM
EX
The Branch causes
a pipeline start-up !
3
ID
IF
?
?
?
?
v
?
?
?
v
?
?
v
Haldun Hadimioglu
5
IF
6
7
8
9
ID
EX
MEM
WB
IF
ID
EX
MEM
IF
ID
IF
EX
ID
IF
v
v
v
v
v
v
v
v
?
v
v
v
v
v
v
MIPS Versions 0 & 1
278

Control Hazards
 If we show the pipeline in our notation
Pipelined MIPS CPU Design : Version 1

Assuming that we do not take the branch !
IF
600 BEQZ R8, 4
601 DADD R9, R19, R11
608 DSUB
R12, R13, R14
60C XOR
R15, R16, R17
610 SLT
R18, R19, R20
614 AND
R21, R22, R23
1
5
6
7
8
9
ID
2
6
7
8
9
10
EX
3
7
8
9
10
11
MEM
8
9
10
11
12
WB
9
10
11
12
13
 Are we fetching the DADD in the 5th clock period ?
 If yes, why ?
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
279

Control Hazards
 Assuming that we do not take the branch !
Why are we fetching the DADD in the 5th clock
period ?
 Can we fetch the DADD in the 2nd clock period ?
Pipelined MIPS CPU Design : Version 1

 The answer is yes, if the control unit allows the
completion of the fetch cycle of the DADD in the 2nd
clock period
 Then, the DADD stays on the 2.IR register until the end
of 4th clock period then moves to the ID stage as will be
shown soon
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
280

Control Hazards
 Assuming that we do not take the branch !
Pipelined MIPS CPU Design : Version 1

But, if the control unit stops fetching of the
DADD in the 2nd clock period to save itself from a
memory access that might be unnecessary if the
branch is taken, then the DADD must be fetched
in the 5th clock period
 Why would the control unit stop fetching the DADD in
the 2nd clock period ?
 We are asking this question because we know that
decoding an instruction is very quick : just checking the
Opcode bits is enough for many instructions
 Thus, the control unit would know right in the beginning
of the 2nd clock period that there is a Branch in the ID
stage, and we can get the DADD by the end of the 2nd
clock period !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
281

Control Hazards
 Assuming that we do not take the branch !
Pipelined MIPS CPU Design : Version 1

If the CPU designer decides to continue with the fetching of the
DADD in the 2nd clock period
1 2
600 BEQZ
R8, 4
601 DADD R9, R19, R11
608 DSUB
R12, R13, R14
60C XOR
R15, R16, R17
610 SLT
R18, R19, R20
614 AND
R21, R22, R23
4
IF ID EX
IF Stall Stall
5
6
7
8
ID
EX
MEM
WB
IF
ID
IF
EX
ID
IF
MEM
EX
ID
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
Haldun Hadimioglu
?
v
v
v
v
v
v
v
v
v
9
MEM WB
EX
MEM
ID
EX
IF
WB
CS 6143
3
v
v
v
v
v
MIPS Versions 0 & 1
ID
v
v
v
v
v
282

Control Hazards
 Assuming that we do not take the branch !
Pipelined MIPS CPU Design : Version 1

If the CPU designer decide to continue with the
fetching of the DADD in the 2nd clock period
 If we show the pipeline in our notation
600 BEQZ
R8, 4
601 DADD R9, R19, R11
608 DSUB
R12, R13, R14
60C XOR
R15, R16, R17
610 SLT
R18, R19, R20
614 AND
R21, R22, R23
IF
ID
EX
1
2
3
2/4
5
6
7
8
5
6
7
8
9
6
7
8
9
10
MEM
WB
7
8
9
10
11
8
9
10
11
12
 We save one clock period !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
283

Control Hazards
 Assuming that we do not take the branch !
Pipelined MIPS CPU Design : Version 1

The CPU designer might decide to design the control unit so
that it aborts the fetch of the DADD in the 2nd clock period
 This is a toss up for the CPU designer !
 How often the branches are not taken is critical
 If branches are not taken often, then the designer can design
the control unit to allow fetching the DADD
 BUT, if we go ahead with continuing with the fetch which
causes a page-fault (the instruction is not in the memory) and
we read the page of the instruction from disk and then realize
the branch is taken, all this effort will be wasted !
 The frequency of untaken branches depends on the application,
programmer, the compiler and the instruction set !

We decide not to fetch the next instruction
 We do not fetch DADD !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
284

Control Hazards
 If we summarize : if we have a control
instruction, the time penalty is high
Pipelined MIPS CPU Design : Version 1

Jumps, jumps to a procedure and returns from a
procedure instructions require an unconditional
change to the order of execution pattern
 The sooner we calculate the target instruction address,
the more stall cycles we can reduce

But, with branches we also need to test the
condition so we need to determine two items
 The target address
 The condition

The sooner we calculate the target instruction
address and the condition, the more stall cycles
we can save
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
285

Control Hazards
Pipelined MIPS CPU Design : Version 1
 Thus, solving the branch execution problem is
more difficult than the others

In fact, one can think of the jump, jump to a
procedure and return from a procedure
instructions as a special case of the branch where
the condition is always true, so we have to take
the jump/return
 Overall, control hazards, especially branch
instructions, attract a lot of interest in
computer architecture research

Many journal and conference papers last 15 years
are published on the topic of branch penalty
reduction !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
286

Control Hazards
 Let’s change our earlier code a little
Pipelined MIPS CPU Design : Version 1
200 LD
R1, 500(R0)
204 DADD R2, R3, R4


208 BEQZ
R18, 2
20C DSUB
210 XOR
214 SLT
218 OR
21C SD
R5, R6, R7
R8, R9, R10
R11, R12, R13
14, R15, R16
R17, 600(R0 )
If the Branch is not taken, the target instruction is the
DSUB, the instruction that follows the Branch
If the Branch is taken, the target instruction is the SLT
instruction that is two instructions below the instruction
that follows the Branch (DSUB)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
287

Control Hazards
Pipelined MIPS CPU Design : Version 1
 Assuming that we do take the branch and do not
fetch the DSUB !
1 2 31 4
5
6
7
8
9
10 11
200 LD
R1, 500(R0)
204 DADD R2, R3, R4
IF ID EX MEM WB
IF ID EX
MEM WB
IF
208 BEQZ
R18, 2
20C DSUB
210 XOR
214 SLT
218 OR
21C SD
R5, R6, R7
R8, R9, R10
R11, R12, R13
14, R15, R16
R17, 600(R0 )
WB
A pipeline
start-up is
created
MEM
EX
ID
IF
CS 6143
ID
EX
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
v
v
v
Haldun Hadimioglu
ID
IF
EX
ID
IF
MEM WB
EX
MEM
ID
EX
v
v
v
v
v
v
v
v
v
v
v
MIPS Versions 0 & 1
v
v
v
v
v
288

Control Hazards
Pipelined MIPS CPU Design : Version 1
 For the case where we take the branch, we
have a pipeline start-up created in clock
period 7

That is, the pipeline is emptied !
 We need to improve the penalty cycles for
our pipeline
 We will modify our state diagram so that
Branch instructions will take two clock
periods
Branch instructions will be in only IF and ID
 CPIBranch = 2

 There will be only one clock period of stall
after this implementation
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
289

Control Hazards
 The changes on the state diagram for the Branch
instruction
Pipelined MIPS CPU Design : Version 1

As we discussed before we need to determine the target
address and the condition as early as possible
 We would know we have a branch in the beginning of the ID
cycle
 In that case, we determine the target address and the
condition in the ID stage
 The target address calculation requires adding PC and
(4*Offset), for which the ID stage has an ADDer circuit now
 The ADDer is accessed by the IF stage if the ID stage has a
Branch
 We can justify a separate ADDer in the ID stage, besides the
ones in IF and EX, since there is large Branch penalty to pay

The execution of all other non-control instructions is not
affected
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
290

Control Hazards
Pipelined MIPS CPU Design : Version 1
 The changes on the state diagram for the
Branch instruction
0
2.IR  If (2.IR.opcode == Branch) then NOP else M[PC]
PC  If ((2.IR.opcode == Branch) & (GPR[2.IR.Rs] op 0)) then (2.NPC + (4 * 2.IR.DOImm+))
else if (2.IR.opcode ≠ Branch) then PC + 4
2.NPC  If ((2.IR.opcode == Branch) & (GPR[2.IR.Rs] op 0)) then (2.NPC + (4 * 2.IR.DOImm+))
else if (2.IR.opcode ≠ Branch) then PC + 4
IF
1
ID
CS 6143
3.A  GPR[2.IR.Rs]
3.B  GPR[2.IR.Rt]
3.Imm  2.IR.DOImm+
3.IR  2.IR
Haldun Hadimioglu
CPIBranch = 2
MIPS Versions 0 & 1
291

Control Hazards
 The changes to the IF and ID stages
ID
ADD
2.NPC
64
MUX1
4
ADD
*4
Sign
Extend
16
PC
AB1
CS 6143
2.IR
Pipelined MIPS CPU Design : Version 1
IF
5
GPR
Rs
Haldun Hadimioglu
64
DOImm+
GPR[Rs]
Zero ?
Sel
MIPS Versions 0 & 1
292

Control Hazards
 The changes to the IF and ID stages
Pipelined MIPS CPU Design : Version 1

The ADDer in the ID stage is used by MUX1 in the IF stage
 This hardware will be correct for the case of GPRs written in
the first half of the clock period where we check the GPR in
the second half the clock period to determine if it is zero !

The Zero circuit has a forwarding circuit with MUX7 that is
moved to the ID stage
 We have forwardings to the ID stage so that we bypass the
GPR register to test
 These forwardings are from :
 The output of the ALU  This is a new forwarding compared with
slide 256
 4.ALUoutput
 5.ALUoutput
 5.LMD
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
293

Control Hazards
 The execution of the Branch now

Assume that the Branch is taken
Pipelined MIPS CPU Design : Version 1
1 2
200 LD
R1, 500(R0)
204 DADD R2, R3, R4
IF
6
7
8
9
10
11
MEM WB
EX
MEM WB
ID
EX
MEM
WB
ID
R5, R6, R7
R8, R9, R10
R11, R12, R13
14, R15, R16
R17, 600(R0 )
A 1-clock period long
bubble is created.
The other stall cycle
is because the Branch
takes 2 clock periods
5
IF ID EX MEM WB
IF ID EX
MEM WB
208 BEQZ R18, 2
20C DSUB
210 XOR
214 SLT
218 OR
21C SD
31 4
WB
MEM
EX
ID
IF
CS 6143
IF
?
?
?
?
v
?
?
?
v
v
?
?
v
v
v
?
v
v
v
ID
IF
v
v
v
EX
ID
IF
v
v
v
Haldun Hadimioglu
v
v
v
v
v
v
?
v
v
v
?
?
v
v
?
?
?
MIPS Versions 0 & 1
v
?
?
?
?
294

Control Hazards
 The execution of the Branch now

Assume that the Branch is taken
Pipelined MIPS CPU Design : Version 1
 If we show the pipeline in our notation
EX
MEM
WB
2
3
4
5
3
4
4
5
6
208 BEQZ R18, 2
2
3
20C DSUB
210 XOR
214 SLT
218 OR
21C SD
5
6
7
6
7
8
7
8
9
8
9
10
9
10
200 LD
R1, 500(R0)
204 DADD R2, R3, R4
R5, R6, R7
R8, R9, R10
R11, R12, R13
14, R15, R16
R17, 600(R0 )
IF
ID
1
 It looks like there is 2-clock period long bubble created on the
previous slide
 This is because the Branch does not have its EX cycle anymore !
 Overall, there is only one stall cycle now !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
295
Pipelined MIPS CPU Design : Version 1

Control Hazards
 Can we improve the Branch hardware so there
is no one stall cycle ?
 We will take a look at three solutions and
decide to go ahead with the last solution
 Solution 3 !
 Solution 1

We can eliminate the one clock period stall when
branches are not taken by continuing the
execution of the already fetched instruction that
follows the branch
 We discussed this before and said this is a toss up !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
296

Control Hazards
 Solution 2
Pipelined MIPS CPU Design : Version 1

Adding to Solution 1, we design the Branch hardware such
that it assumes branch-not-taken and continues the
execution
 If however, the branch is taken (we guessed wrong) we discard
the instruction in the ID stage and fetch from the target
address
 That is we back out and continue

Note again that Solution 2 includes Solution 1
Branch not taken
IF ID
IF ID EX MEM….
IF ID EX…….
CS 6143
Branch taken (we guessed wrong)
IF ID
IF (discard it)
……
……
IF ID EX…….
Haldun Hadimioglu
MIPS Versions 0 & 1
297

Control Hazards
 Solution 2
Pipelined MIPS CPU Design : Version 1

If we guessed wrong, we pay a one-clock period
stall penalty
 Otherwise, there is no stall on the pipeline !

We have to make sure that the state of the
machine is not changed so that backing out is
simple
 For CPUs that have long pipelines this would be difficult
 For the MIPS, it is not a problem
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
298

Control Hazards
 Solution 2
Pipelined MIPS CPU Design : Version 1

What if we design the Branch hardware such that
it assumes branch-taken (instead of assuming
branch-not-taken) ?
 This is not useful for the MIPS since the target address
and the test are known together
 On CPUs where the target address is known before the
test outcome, this technique can be useful
 These CPUs are often CISC CPUs !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
299

Control Hazards
 Solution 3

Pipelined MIPS CPU Design : Version 1

This is the one we will use for Version 3
The final method we will use is delayed branch which makes
use of the compiler and the hardware
 In this technique, we continue the execution of the
instruction(s) that follow(s) the Branch in the branch delay
slot no matter what the Branch outcome is
 The branch delay slot is the set of instruction positions
following the branch
Branch Rx, Offset
Branch delay slot
One instruction long or more
 The length of the branch delay slot is the time penalty paid ≡
the number of stall cycles due to the Branch ≡ the amount of
time we are not sure about the target instruction
 For the current design it is 1 clock period
 Therefore, the branch delay slot has 1 instruction
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
300
Control Hazards

 The changes on the state diagram due to delayed
branches

We have to execute the instruction that follows the branch
in any case
0
2.IR  M[PC]
PC  If ((2.IR.opcode == Branch) & (GPR[2.IR.Rs] op 0)) then (PC + ((2.IR.DOImm) + * 4))
else (PC + 4)
IF
1
ID
CS 6143
3.A  GPR[2.IR.Rs]
3.B  GPR[2.IR.Rt]
3.Imm  2.IR.DOImm+
3.IR  2.IR
Haldun Hadimioglu
CPIBranch = 2
MIPS Versions 0 & 1
301

Control Hazards
 Solution 3
Pipelined MIPS CPU Design : Version 1

Delayed branch means we execute the instructions
in the branch delay slot no matter what the
Branch outcome is
 These instructions must be independent of the branch so
that the program execution is correct !
 For our MIPS CPU the branch delay slot is one
instruction long
 Because, we are not sure which instruction is the target
instruction for one clock period
 The following clock period we know which instruction is
the target
 Then, why don’t we execute the instruction right after
the Branch whether we take the branch or not ?
 It should be easy to find one instruction that can be
executed no matter what the Branch outcome is ????
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
302

Control Hazards
 Solution 3
Pipelined MIPS CPU Design : Version 1

Delayed branch means we execute the instructions
in the branch delay slot no matter what the
Branch outcome is
 It is the compiler that changes the order of instructions
so that after the Branch there is an independent
instruction
 We say the compiler schedules an instruction to the
Branch delay slot
 This is another example of how ordering instructions is
important (needed)
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
303

Control Hazards
 Solution 3
Pipelined MIPS CPU Design : Version 1

Delayed branch means we execute the instructions in the
branch delay slot no matter what the Branch outcome is
 How can the compiler find an independent instruction for the
MIPS CPU to place in the Branch delay slot ?
 There are three possible cases
 Case 1 : From before branch
 If the instruction before the Branch is independent of the Branch
 This one always improves the performance :
Original code
DADD R1, R2, R3
Bxxxx R6, 5
The compiler realizes the DADD
is independent of the Bxxxx ≡ The
DADD can be executed after the
Bxxxx. The compiler moves the
DADD after the Bxxxx
New code
Bxxxx R6, 5
DADD R1, R2, R3
Branch delay slot
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
304

Control Hazards
 Solution 3
Pipelined MIPS CPU Design : Version 1

Delayed branch means we execute the instructions in the branch delay slot
no matter what the Branch outcome is
 Case 2 : From target branch
 It is used for loops where there is a large probability that the branch will be
taken
 It improves the performance if the branch is taken
The compiler realizes the DADD
is not independent of the Bxxxx.
Original code
But, the DSUB is independent of
loop : DSUB R7, R8, R9 The Branch ≡ The DSUB can be
executed after theBxxxx. The
compiler moves theDSUB to the
DADD R1, R2, R3 Brach delay slot. This will save
Bxxxx R1, (-9)10 time if we branch back to the
beginning of the loop. If we exit
the loop, it must be OK to execute
the DSUB ! Branch offset must be
adjusted ! The code is longer !
CS 6143
Haldun Hadimioglu
New code
loop : DSUB R7, R8, R9
DADD R1, R2, R3
Bxxxx R1, (-8)10
DSUB R7, R8, R9
MIPS Versions 0 & 1
305

Control Hazards
 Solution 3
Pipelined MIPS CPU Design : Version 1

Delayed branch means we execute the instructions in the branch
delay slot no matter what the Branch outcome is
 Case 3 : From fall through
 It is used when there is a high probability that the branch will not be
taken
 It improves the performance if the branch is not taken
Original code
DADD R1, R2, R3
Bxxxx R1, 7
DSUB R12, R13, R14
The compiler realizes the DADD
is not independent of the Bxxxx.
But, the DSUB is independent of
the Branch ≡ The DSUB can be
executed right after the Bxxxx.
The compiler moves the DSUB to
the Branch delay slot. This will
save time if the branch is not
taken. It must be OK to execute
the DSUB even if we take the
branch !
CS 6143
Haldun Hadimioglu
Original code
DADD R1, R2, R3
Bxxxx R1, 7
DSUB R12, R13, R14
MIPS Versions 0 & 1
306

Control Hazards
 Solution 3
Pipelined MIPS CPU Design : Version 1

Delayed branch means we execute the instructions in the branch
delay slot no matter what the Branch outcome is
 You might have realized that delayed branch is not practical
since it requires the compiler to know that the CPU is
expecting an independent instruction in the Branch delay slot
 This means that old code cannot be run on this MIPS CPU either
because that compiler did not generate the code for a CPU with a
Branch delay slot or that compiler did generate a code with a
Branch delay slot, but the delay slot was more than one instruction
since it was an old generation MIPS CPU
 This is the legacy software situation !

Today’s microprocessors do not use delayed branches
because of the compatibility issue
 However, academically, it is an interesting idea
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
307

Control Hazards
 Solution 3
Pipelined MIPS CPU Design : Version 1

Delayed branch means we execute the instructions in the
branch delay slot no matter what the Branch outcome is
 Shall we not use Solution 3 for the MIPS CPU ≡ Shall we not
use delayed Branches ?
 We will use delayed Branches in Version 1 for the
sake of simplifying our discussion

We will eventually not use delayed Branches when we cover
advanced pipelining in more advanced versions of the MIPS
CPU !
 When we cover advanced pipelining, we will be discuss features
of today’s microprocessors !
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
308

Control Hazards
 Solution 3

Let’s take a look at the execution of the following code with a taken branch
 Notice the DSUB is an independent instruction in the branch delay slot
Pipelined MIPS CPU Design : Version 1
 It must be OK to execute to execute the DSUB even if we take the branch
 Notice we changed the BEQZ register to R2 to show forwarding to the ID stage
 The forwarding is from the EX stage to the ID stage where the output of the
ALU is forwarded to the ID stage to test the result of the addition that is just
performed in EX to decide to branch
200 LD
R1, 500(R0)
204 DADD R2, R3, R4
208 BEQZ
R2, 2
20C DSUB
210 XOR
214 SLT
218 OR
21C SD
R5, R6, R7
R8, R9, R10
R11, R12, R13
14, R15, R16
R17, 600(R0 )
CS 6143
IF
ID
EX
MEM
WB
1
2
3
4
5
2
3
3
4
4
5
6
4
5
6
7
8
5
6
7
6
7
8
7
8
9
8
9
10
9
10
Haldun Hadimioglu
MIPS Versions 0 & 1
309
Pipelined MIPS CPU Design : Version 1

Summary of Version 1
 We added hardware to deal with structural, data and
control hazards
 Still, it executes integer instructions
 It issues instructions statically

Except for the branch which is not issued and completed in
two clock periods
 The branch is not issued to save time !
IF ID
Static
Instruction
issue
EX MEM WB
 Because of static issuing instructions complete in-order,
except for the branch which can complete before the
instructions that are issued earlier

This results in imprecise interrupts !

The L2 cache memories can be slower and there can be L1
cache misses
 Only the L1 cache memories are considered
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
310

Summary of Version 1
 We realize we need to modify Version 1 so that it



Executes FP instructions
All levels of the memory hierarchy must considered
Handles interrupts better
Pipelined MIPS CPU Design : Version 1
 All three are difficult problems to solve

FP operations, such add, subtract, multiply and divide are complex
and cannot be completed in one clock period as we can with integer
add operation
 The integer add is done in EX and takes one clock period
 The FP add, subtract, multiply and divide will be done in EX and take
multiple clock cycles !
 More instructions can complete out-of-order
 The interrupt hardware becomes even more complex
 We solve one problem (executing FP instructions) but made the other
problem more complex

All levels of the memory hierarchy must be considered

Interrupts can happen randomly
 The cache memories, slower main memory and the virtual memory (disk)
 We also need to save the state which is not easy for a pipelined CPU
 Advanced versions will attempt to solve them
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
311
Pipelined MIPS CPU Design : Version 1

Test Program
 Determine when the execution of the second iteration ends if
L1 cache memories take one clock period and there is no cache
miss
 Show all forwardings and write-in-the-first-half-read-in-thesecond-half cases
IF
LD
DADD
DSUB
XOR
SLT
OR
BNEZ
SD
ID EX MEM WB IF
ID EX MEM WB
R1, 500(R8)
R2, R3, R1
R5, R2, R1
R8, R5, R2
R11, R2, R5
R14, R11, R15
R14, (-7)10
R11, 600(R14)
The answer is on the next slide
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
312
Pipelined MIPS CPU Design : Version 1

Test Program
 Determine when the execution of the second iteration ends if
L1 cache memories take one clock period and there is no cache
miss
 Show all forwardings and write-in-the-first-half-read-in-thesecond-half cases
IF
LD
DADD
DSUB
XOR
SLT
OR
BNEZ
SD
R1, 500(R8)
R2, R3, R1
R5, R2, R1
R8, R5, R2
R11, R2, R5
R14, R11, R15
R14, (-7)10
R11, 600(R14)
ID EX MEM WB IF
1
2
3
4
5
10
2
3/4
5
6
7
8
9
3/4
5
6
7
8
9
10
5
6
7
8
9
6
7
8
9
10
7
8
9
10
11
11
12
ID EX MEM WB
11
11 12/13
12/13 14
14 15
15
16
16 17
17
18
18 19
12
13
14
14
15
16
17
18
15
16
17
18
19
16
17
18
19
20
20
21
The second iteration ends in clock period 21
All data hazards are RAW
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
313
Pipelined MIPS CPU Design : Version 1

Test Program
 Determine when the execution of the second iteration ends if L1 cache
memories take two clock period and there is no cache miss
 Show all forwardings and write-in-the-first-half-read-in-the-secondhalf cases
IF ID EX MEM WB IF ID EX MEM WB
LD
DADD
DSUB
XOR
SLT
OR
BNEZ
SD
R1, 500(R8)
R2, R3, R1
R5, R2, R1
R8, R5, R2
R11, R2, R5
R14, R11, R15
R14, (-7)10
R11, 600(R14)
1-2
3
4
5-6
3-4
5-6
7-8
5/6
7
9
7
8
10
8
9
11
9-10
11-12
13-14
15-16
11
13
15
17
12
14
13
15
18
19
7
17-18 19
20
21-22
23
9 19-20 21/22 23
10 21-22 23
24
12 23-24 25 26
24
25
27
25
26
28
14 25-26 27
16 27-28 29
29-30 31
31-32 33
28
30
29
31
30
32
34
35
We assume there is a write buffer that
allows Stores to complete in one clock period
There are structural
hazards in IF and
MEM stages due to
slow cache memories
The second iteration ends in clock period 35
All hazards pointed by the arrows are data hazards and type RAW
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
314

Test Program
 Determine when the execution of the second iteration ends if
L1 cache memories have misses
Pipelined MIPS CPU Design : Version 1

Assume that the memory levels are as described in the unpipelined
CPU case with the following additions and reminders
The bus width between the physical and lowest level cache is 8 Bytes
The instructions cache is 8KBytes and the data cache is 16KBytes
Both cache block sizes are 32 bytes
Both cache memories use direct mapping
Both caches use write-back with write-allocate
Both cache memories access the needed item first
The Data Cache has two read and two write ports
The Instruction Cache has two read ports
The latency to access the L2 cache is 4 clock periods and transferring
an 8-Byte content is one clock period each
 The L2 cache memory can handle one miss per L1 cache memory at a
time









 This means that if the instruction cache and the data cache have misses at
the same time, they will be handled at the same time by the L2 cache
 This means the L2 cache can handle two hits at the same time
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
315

Test Program
 Determine when the execution of the second iteration ends if
L1 cache memories have misses
Pipelined MIPS CPU Design : Version 1

Assume that the L1 instruction and data cache memories and the
physical memory have the following properties
 Each Level 1 cache memory can handle only one miss at a time
 A Store miss requires that the Store instruction stays in the MEM
stage until the miss is handled
 It just cannot store to the write buffer and then proceed
 Each Level 1 cache memory can handle up to four hits while it handles a
miss
 An instruction that immediately follows a Load or a Store is forced to
stall an extra clock period in the ID stage to make sure the access for
the data element is completed

For the given code, assume the following
 The first instruction occupies the leftmost 4 bytes of the top position
of an instruction block
 Each data element accessed is to a separate data block all of which do
not map to the same area in data cache
 It means each Load and Store instruction accesses a different block in
each iteration
This means there will be four data cache misses in two iterations !
This is very unusual but, it is assumed here just to show an extreme case
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
316
Pipelined MIPS CPU Design : Version 1

Test Program
 We observe all 8 instructions are in one instruction cache block
 There are four data accesses, each one is in one separate data block,
resulting in four data cache misses
 Determine when the execution of the second iteration ends
 Show all forwardings and write-in-the-first-half-read-in-the-secondhalf cases
IF ID EX MEM WB IF ID EX MEM WB
LD
DADD
DSUB
XOR
SLT
OR
BNEZ
SD
R1, 500(R8)
R2, R3, R1
R5, R2, R1
R8, R5, R2
R11, R2, R5
R14, R11, R15
R14, (-7)10
R11, 600(R14)
There are structural
hazards in IF and
MEM stages due to
cache misses
1/5
6
7
8/12
6 7/12
7/12 13
13 14
13
14
15
14
15
16
14
15
16
17
16
17
17
18
19
20/24
15
16
17
18
13
18 19/24
26/30
31
15 19/24 25/30 31
16 25/30 31 32
17 31
32
33
32
33
34
33
34
35
18
19
34
35
35
36
36
37
37
38/42
32
33
34
35
33
34
35
36
25
The second iteration ends in clock period 42
All hazards pointed by the arrows are data hazards and type RAW
CS 6143
Haldun Hadimioglu
MIPS Versions 0 & 1
317