Computer Organization & Design
Download
Report
Transcript Computer Organization & Design
Computer Organization and
Architecture (AT70.01)
Comp. Sc. and Inf. Mgmt.
Asian Institute of Technology
Instructor: Dr. Sumanta Guha
Slide Sources: Patterson &
Hennessy COD book website
(copyright Morgan Kaufmann)
adapted and supplemented
COD Ch. 5
The Processor: Datapath and
Control
Implementing MIPS
We're ready to look at an implementation of the MIPS instruction set
Simplified to contain only
arithmetic-logic instructions: add, sub, and, or, slt
memory-reference instructions: lw, sw
control-flow instructions: beq, j
6 bits
5 bits
5 bits
5 bits
5 bits
op
rs
rt
rd
6 bits
5 bits
5 bits
16 bits
op
rs
rt
offset
6 bits
shamt funct
6 bits
26 bits
op
address
R-Format
I-Format
J-Format
Implementing MIPS: the
Fetch/Execute Cycle
High-level abstract view of fetch/execute implementation
use the program counter (PC) to read instruction address
fetch the instruction from memory and increment PC
use fields of the instruction to select registers to read
execute depending on the instruction
repeat…
Data
PC
Address
Instruction
memory
Instruction
Register #
Registers
ALU
Address
Register #
Data
memory
Register #
Data
Overview: Processor
Implementation Styles
Single Cycle
Multi-Cycle
perform each instruction in 1 clock cycle
clock cycle must be long enough for slowest instruction; therefore,
disadvantage: only as fast as slowest instruction
break fetch/execute cycle into multiple steps
perform 1 step in each clock cycle
advantage: each instruction uses only as many cycles as it needs
Pipelined
execute each instruction in multiple steps
perform 1 step / instruction in each clock cycle
process multiple instructions in parallel – assembly line
Functional Elements
Two types of functional elements in the hardware:
elements that operate on data (called combinational elements)
elements that contain data (called state or sequential elements)
Combinational Elements
Works as an input output function, e.g., ALU
Combinational logic reads input data from one register and
writes output data to another, or same, register
read/write happens in a single cycle – combinational element
cannot store data from one cycle to a future one
Combinational logic hardware units
State
element
1
Clock cycle
Combinational logic
State
element
2
State
element
Combinational logic
State Elements
State elements contain data in internal storage, e.g., registers
and memory
All state elements together define the state of the machine
What does this mean? Think of shutting down and starting up again…
Flipflops and latches are 1-bit state elements, equivalently,
they are 1-bit memories
The output(s) of a flipflop or latch always depends on the bit
value stored, i.e., its state, and can be called 1/0 or high/low
or true/false
The input to a flipflop or latch can change its state depending
on whether it is clocked or not…
Set-Reset (SR-) latch
(unclocked)
Think of Sbar as S, the inverse of set (which
sets Q to 1), and Rbar as R, the inverse of reset.
Sbar
(set)
Q
n1
Rbar
n2
Qbar
(reset)
See sr_latch.v in Verilog Examples
equivalently with nor gates
R
S
Q
_
Q
A set-reset latch made from two cross-coupled
nand gates is a basic memory unit.
When both Sbar and Rbar are 1, then either one
of the following two states is stable:
a) Q = 1 & Qbar = 0
b) Q = 0 & Qbar = 1
and the latch will continue in the current stable
state.
If Sbar changes to 0 (while Rbar remains at 1),
then the latch is forced to the exactly one
possible stable state (a). If Rbar changes to 0
(while Sbar remains at 1), the latch is forced to
the exactly one possible stable state (b).
So, the latch remembers which of Sbar or Rbar
was last 0 during the time they are both 1.
When both Sbar and Rbar are 0 the exactly one
stable state is Q = Qbar = 1. However, if after
that both Sbar and Rbar return to 1, the latch must
then jump non-deterministically to one of stable
states (a) or (b), which is undesirable behavior.
Synchronous Logic:
Clocked Latches and Flipflops
Clocks are used in synchronous logic to determine when a state
element is to be updated
in level-triggered clocking methodology either the state changes
only when the clock is high or only when it is low (technologydependent)
Falling edge
Clock period
Rising edge
in edge-triggered clocking methodology either the rising edge or
falling edge is active (depending on technology) – i.e., states
change only on rising edges or only on falling edge
Latches are level-triggered
Flipflops are edge-triggered
Clocked SR-latch
State can change only when clock is high
Potential problem : both inputs Sbar = 0 & Rbar = 0
will cause non-deterministic behavior
Sbar
X
r1
clk
n1
Q
n2
Qbar
clkbar
a
Rbar
r2
Y
See clockedSr_latch.v in Verilog Examples
Clocked D-latch
State can change only when clock is high
Only single data input (compare SR-latch)
No problem with non-deterministic behavior
D
Dbar
a2
clk
a1
r1
X
n1
Q
n2
Qbar
clkbar
r2
Y
See clockedD_latch.v in Verilog Examples
D
C
Q
Timing diagram of D-latch
Clocked D-flipflop
Negative edge-triggered
Made from three SR-latches
sbar
clear
clk
cbar
s
q
clkbar
r
rbar
d
See edge_dffGates.v in Verilog Examples
qbar
State Elements on the
Datapath: Register File
Registers are implemented with arrays of D-flipflops
Clock
5 bits
Read register
number 1
5 bits
Read register
number 2
5 bits
Write
register
32 bits
Write
data
Read
data 1
32 bits
Read
data 2
32 bits
Register file
Write
Control signal
Register file with two read ports and
one write port
State Elements on the
Datapath: Register File
Port implementation:
Clock
Clock
Write
Read register
number 1
0
Register 0
Register 1
Register n – 1
M
u
x
Read data 1
Register number
Register 0
1
D
n-to-1
decoder
C
n– 1
Register n
C
Register 1
D
n
Read register
number 2
M
u
x
C
Register n – 1
D
Read data 2
C
Register n
Register data
Read ports are implemented
with a pair of multiplexors – 5
bit multiplexors for 32 registers
D
Write port is implemented using
a decoder – 5-to-32 decoder for
32 registers. Clock is relevant to
write as register state may change
only at clock edge
Verilog
All components that we have discussed – and shall discuss –
can be fabricated using Verilog
Refer to our Verilog slides and examples
Single-cycle Implementation
of MIPS
Our first implementation of MIPS will use a single long clock
cycle for every instruction
Every instruction begins on one up (or, down) clock edge
and ends on the next up (or, down) clock edge
This approach is not practical as it is much slower than a
multicycle implementation where different instruction
classes can take different numbers of cycles
in a single-cycle implementation every instruction must take
the same amount of time as the slowest instruction
in a multicycle implementation this problem is avoided by
allowing quicker instructions to use fewer cycles
Even though the single-cycle approach is not practical it is
simple and useful to understand first
Note : we shall implement jump at the very end
Datapath: Instruction
Store/Fetch & PC Increment
Instruction
address
Add
PC
Instruction
Add Sum
4
Instruction
memory
PC
a. Instruction memory
b. Program counter
Read
address
c. Adder
Instruction
Three elements used to store
and fetch instructions and
increment the PC
Instruction
memory
Datapath
Animating the Datapath
Instruction <- MEM[PC]
PC <- PC + 4
ADD
4
PC
ADDR
Memory
RD
Instruction
Datapath: R-Type Instruction
Register
numbers
5
Read
register 1
5
Read
register 2
Registers
Write
register
5
Data
Write
data
3
ALU control
Read
data 1
Data
Zero
ALU ALU
result
Read
data 2
Instruction
Read
register 2
Registers
Write
register
Write
data
RegWrite
3
Read
register 1
Read
data 1
Zero
ALU ALU
result
Read
data 2
RegWrite
a. Registers
b. ALU
Two elements used to implement
R-type instructions
ALU operation
Datapath
Animating the Datapath
add rd, rs, rt
Instruction
op
rs
rt
5
rd
shamt funct
5
Operation
5
R[rd] <- R[rs] + R[rt];
3
RN1
RN2
WN
RD1
Register File
WD
RD2
RegWrite
ALU
Zero
Datapath:
Load/Store Instruction
MemWrite
Instruction
Address
Write
data
Read
data
Data
memory
3
Read
register 1
16
Sign
extend
32
Read
register 2
Registers
Write
register
Write
data
MemWrite
Read
data 1
Zero
ALU ALU
result
Write
data
16
Sign
extend
32
Read
data
Data
memory
MemRead
b. Sign-extension unit
Two additional elements used
To implement load/stores
Address
Read
data 2
RegWrite
MemRead
a. Data memory unit
ALU operation
Datapath
Animating the Datapath
lw rt, offset(rs)
R[rt] <- MEM[R[rs] + s_extend(offset)];
Animating the Datapath
sw rt, offset(rs)
MEM[R[rs] + sign_extend(offset)] <- R[rt]
Datapath: Branch Instruction
PC + 4 from instruction datapath
No shift hardware required:
simply connect wires from
input to output, each shifted
left 2 bits
Instruction
Add Sum
Branch target
Shift
left 2
3
Read
register 1
Read
register 2
Registers
Write
register
Write
data
Read
data 1
ALU Zero
Read
data 2
RegWrite
16
ALU operation
Sign
extend
32
Datapath
To branch
control logic
Animating the Datapath
beq rs, rt, offset
if (R[rs] == R[rt]) then
PC <- PC+4 + s_extend(offset<<2)
MIPS Datapath I: Single-Cycle
Input is either register (R-type) or sign-extended
lower half of instruction (load/store)
Data is either
from ALU (R-type)
or memory (load)
Combining the datapaths for R-type instructions
and load/stores using two multiplexors
Animating the Datapath:
R-type Instruction
add rd,rs,rt
Instruction
32
5
16
5
Operation
5
3
RN1
RN2
WN
RD1
Register File
ALU
Zero
WD
RD2
RegWrite
16
E
X
T
N
D
32
M
U
X
ALUSrc
MemWrite
MemtoReg
ADDR
Data
Memory
WD
MemRead
RD
M
U
X
Animating the Datapath:
Load Instruction
lw rt,offset(rs)
Instruction
32
5
16
5
Operation
5
3
RN1
RN2
WN
RD1
Register File
ALU
Zero
WD
RD2
RegWrite
16
E
X
T
N
D
32
M
U
X
ALUSrc
MemWrite
MemtoReg
ADDR
Data
Memory
WD
MemRead
RD
M
U
X
Animating the Datapath:
Store Instruction
sw rt,offset(rs)
Instruction
32
5
16
5
Operation
5
3
RN1
RN2
WN
RD1
Register File
ALU
Zero
WD
RD2
RegWrite
16
E
X
T
N
D
32
M
U
X
ALUSrc
MemWrite
MemtoReg
ADDR
Data
Memory
WD
MemRead
RD
M
U
X
MIPS Datapath II: Single-Cycle
Separate adder as ALU operations and PC
increment occur in the same clock cycle
Add
4
PC
Read
address
Instruction
Instruction
memory
Registers
Read
register 1
Read
Read
data
1
register 2
3
Write
data
M
u
x
Zero
ALU ALU
result
Address
Write
data
RegWrite
16
MemWrite
MemtoReg
ALUSrc
Read
data 2
Write
register
ALU operation
Sign 32
extend
Separate instruction memory
as instruction and data read
occur in the same clock cycle
Adding instruction fetch
Read
data
Data
memory
MemRead
M
u
x
MIPS Datapath III: Single-Cycle
New multiplexor
PCSrc
M
u
x
Add
Add ALU
result
4
Shift
left 2
PC
Registers
Read
register 1
Read
Read
data 1
register 2
Read
address
Instruction
Instruction
memory
Instruction address is either
PC+4 or branch target address
Write
register
Write
data
RegWrite
16
ALUSrc
Read
data 2
M
u
x
Extra adder needed as both
adders operate in each cycle
3
ALU operation
Zero
ALU ALU
result
MemtoReg
Address
Write
data
Sign
extend
MemWrite
Data
memory
32
MemRead
Adding branch capability and another multiplexor
Important note: in a single-cycle implementation data cannot be stored
during an instruction – it only moves through combinational logic
Question: is the MemRead signal really needed?! Think of RegWrite…!
Read
data
M
u
x
Datapath Executing add
ADD
M
U
X
ADD
ADD
4
PC
<<2
Instruction
ADDR
Instruction
Memory
RD
32
5
5
RN1
RN2
16
PCSrc
Operation
5
3
WN
RD1
Register File
ALU
Zero
WD
RD2
RegWrite
16
add rd, rs, rt
E
X
T
N
D
32
M
U
X
ALUSrc
MemWrite
ADDR
Data
Memory
WD
MemRead
MemtoReg
RD
M
U
X
Datapath Executing lw
ADD
M
U
X
ADD
ADD
4
PC
<<2
Instruction
ADDR
Instruction
Memory
RD
32
5
5
RN1
RN2
16
PCSrc
Operation
5
3
WN
RD1
Register File
ALU
Zero
WD
RD2
RegWrite
16
lw rt,offset(rs)
E
X
T
N
D
32
M
U
X
ALUSrc
MemWrite
ADDR
Data
Memory
WD
MemRead
MemtoReg
RD
M
U
X
Datapath Executing sw
ADD
M
U
X
ADD
ADD
4
PC
<<2
Instruction
ADDR
Instruction
Memory
RD
32
5
5
RN1
RN2
16
PCSrc
Operation
5
3
WN
RD1
Register File
ALU
Zero
WD
RD2
RegWrite
16
sw rt,offset(rs)
E
X
T
N
D
32
M
U
X
ALUSrc
MemWrite
ADDR
Data
Memory
WD
MemRead
MemtoReg
RD
M
U
X
Datapath Executing beq
ADD
M
U
X
ADD
ADD
4
PC
<<2
Instruction
ADDR
Instruction
Memory
RD
32
5
5
RN1
RN2
16
PCSrc
Operation
5
3
WN
RD1
Register File
ALU
Zero
WD
RD2
RegWrite
16
beq r1,r2,offset
E
X
T
N
D
32
M
U
X
ALUSrc
MemWrite
ADDR
Data
Memory
WD
MemRead
MemtoReg
RD
M
U
X
Control
Control unit takes input from
the instruction opcode bits
Control unit generates
ALU control input
write enable (possibly, read enable also) signals for each storage
element
selector controls for each multiplexor
ALU Control
Plan to control ALU: main control sends a 2-bit ALUOp control field
to the ALU control. Based on ALUOp and funct field of instruction the
ALU control generates the 3-bit ALU control field
Recall from Ch. 4
ALU control
field
000
001
010
110
111
Function
and
or
add
sub
slt
ALU must perform
2
ALUOp
Main
Control
3
ALU
Control
ALU
control
input
6
Instruction
funct field
add for load/stores (ALUOp 00)
sub for branches (ALUOp 01)
one of and, or, add, sub, slt for R-type instructions, depending on the
instruction’s 6-bit funct field (ALUOp 10)
To
ALU
Setting ALU Control Bits
Instruction AluOp
opcode
LW
SW
Branch eq
R-type
R-type
R-type
R-type
R-type
*Typo in text
Fig. 5.15: if it is X
then there is potential
conflict between
line 2 and lines 3-7!
00
00
01
10
10
10
10
10
Instruction Funct Field Desired
ALU control
operation
ALU action input
load word
store word
branch eq
add
subtract
AND
OR
set on less
xxxxxx
xxxxxx
xxxxxx
100000
100010
100100
100101
101010
add
add
subtract
add
subtract
and
or
set on less
ALUOp
Funct field
Operation
ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0
0
X X X X X X
010
0*
1
X X X X X X
110
1
X
X X 0 0 0 0
010
1
X
X X 0 0 1 0
110
1
X
X X 0 1 0 0
000
1
X
X X 0 1 0 1
001
1
X
X X 1 0 1 0
111
Truth table for ALU control bits
010
010
110
010
110
000
001
111
Designing the Main Control
R-type
opcode
31-26
Load/store
or branch
opcode
31-26
rs
25-21
rt
20-16
rs
rt
25-21
20-16
rd
15-11
shamt
10-6
funct
5-0
address
15-0
Observations about MIPS instruction format
opcode is always in bits 31-26
two registers to be read are always rs (bits 25-21) and rt (bits 2016)
base register for load/stores is always rs (bits 25-21)
16-bit offset for branch equal and load/store is always bits 15-0
destination register for loads is in bits 20-16 (rt) while for R-type
instructions it is in bits 15-11 (rd) (will require multiplexor to select)
Datapath with Control I
PCSrc
Add
ALU
Add result
4
New multiplexor
Instruction [25– 21]
PC
Read
address
Instruction
[31– 0]
Instruction
memory
Instruction [20– 16]
1
M
u
Instruction [15– 11] x
0
RegDst
Instruction [15– 0]
Shift
left 2
RegWrite
Read
register 1
Read
register 2
Read
data 1
MemWrite
ALUSrc
Read
Write
data 2
register
Write
Registers
data
16
Sign
extend
1
M
u
x
0
1
M
u
x
0
Zero
ALU ALU
result
MemtoReg
Address
Write
data
32
ALU
control
Read
data
Data
memory
1
M
u
x
0
MemRead
Instruction [5– 0]
ALUOp
Adding control to the MIPS Datapath III (and a new multiplexor to select field to
specify destination register): what are the functions of the 9 control signals?
Control Signals
Signal Name
Effect when deasserted
Effect when asserted
RegDst
The register destination number for the
Write register comes from the rt field (bits 20-16)
None
The register destination number for the
Write register comes from the rd field (bits 15-11)
The register on the Write register input is written
with the value on the Write data input
The second ALU operand is the sign-extended,
lower 16 bits of the instruction
The PC is replaced by the output of the adder
that computes the branch target
Data memory contents designated by the address
input are put on the first Read data output
Data memory contents designated by the address
input are replaced by the value of the Write data input
The value fed to the register Write data input
comes from the data memory
RegWrite
AlLUSrc
MemRead
The second ALU operand comes from the
second register file output (Read data 2)
The PC is replaced by the output of the adder
that computes the value of PC + 4
None
MemWrite
None
MemtoReg
The value fed to the register Write data input
comes from the ALU
PCSrc
Effects of the seven control signals
Datapath with Control II
0
M
u
x
ALU
Add result
Add
4
Instruction [31 26]
Control
Instruction [25 21]
PC
Read
address
Instruction
memory
Instruction [15 11]
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
PCSrc
Read
register 1
Instruction [20 16]
Instruction
[31– 0]
1
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Write
data
Instruction [15 0]
16
Sign
extend
Read
data
Data
memory
32
ALU
control
Instruction [5 0]
MIPS datapath with the control unit: input to control is the 6-bit instruction
opcode field, output is seven 1-bit signals and the 2-bit ALUOp signal
1
M
u
x
0
0
M
u
x
ALU
Add result
Add
4
Instruction [31 26]
Control
Instruction [25 21]
PC
Read
address
Instruction
memory
Instruction [15 11]
Datapath with
Control II (cont.)
Instruction [15 0]
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
PCSrc
Read
register 1
Instruction [20 16]
Instruction
[31– 0]
1
PCSrc cannot be
set directly from the
opcode: zero test
outcome is required
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Write
data
16
Sign
extend
Read
data
Data
memory
1
M
u
x
0
32
ALU
control
Instruction [5 0]
Determining control signals for the MIPS datapath based on instruction opcode
Memto- Reg Mem Mem
Instruction RegDst ALUSrc
Reg
Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
Control Signals:
R-Type Instruction
ADD
0
M
U
X
ADD
ADD
4
rs
I[25:21]
PC
rt
I[20:16]
rd
I[15:11]
Instruction
ADDR
RD
Instruction
Memory
I
32
5
0
5
5
RN1
RN2
RegDst
1
5
WN
RD1
Register File
WD
immediate/
offset
I[15:0]
???
Operation
Value depends on
funct
3
ALU
0
Zero
0
M
U
X
RD2
RegWrite
1
Control signals
shown in blue
PCSrc
0
1
MUX
16
1
<<2
16
E
X
T
N
D
1
32
ALUSrc
0
0
MemWrite
ADDR
Data
Memory
MemtoReg
1
RD
M
U
X
WD
MemRead
0
0
Control Signals:
lw Instruction
ADD
0
M
U
X
ADD
ADD
4
rs
I[25:21]
PC
rt
I[20:16]
rd
I[15:11]
Instruction
ADDR
RD
Instruction
Memory
I
32
5
0
5
5
RN1
RN2
RegDst
0
5
WN
RD1
Register File
WD
immediate/
offset
I[15:0]
010
Operation
3
ALU
0
Zero
0
M
U
X
RD2
RegWrite
1
Control signals
shown in blue
PCSrc
0
1
MUX
16
1
<<2
16
E
X
T
N
D
1
32
ALUSrc
1
1
MemWrite
ADDR
Data
Memory
MemtoReg
1
RD
M
U
X
WD
MemRead
1
0
Control Signals:
sw Instruction
ADD
0
M
U
X
ADD
ADD
4
rs
I[25:21]
PC
rt
I[20:16]
rd
I[15:11]
Instruction
ADDR
RD
Instruction
Memory
I
32
5
0
5
5
RN1
RN2
RegDst
X
5
WN
RD1
Register File
WD
immediate/
offset
I[15:0]
010
Operation
3
ALU
1
Zero
0
M
U
X
RD2
RegWrite
0
Control signals
shown in blue
PCSrc
0
1
MUX
16
1
<<2
16
E
X
T
N
D
1
32
ALUSrc
1
X
MemWrite
ADDR
Data
Memory
MemtoReg
1
RD
M
U
X
WD
MemRead
0
0
Control Signals:
beq Instruction
ADD
0
M
U
X
ADD
ADD
4
rs
I[25:21]
PC
rt
I[20:16]
rd
I[15:11]
Instruction
ADDR
RD
Instruction
Memory
I
32
1
MUX
16
5
5
RN1
RN2
RegDst
X
5
WN
RD1
Register File
WD
immediate/
offset
I[15:0]
110
Operation
3
ALU
0
Zero
0
M
U
X
RD2
RegWrite
0
Control signals
shown in blue
PCSrc
1 if Zero=1
5
0
1
<<2
16
E
X
T
N
D
1
32
ALUSrc
0
X
MemWrite
ADDR
Data
Memory
MemtoReg
1
RD
M
U
X
WD
MemRead
0
0
Datapath with Control III
Jump
opcode
address
31-26
25-0
Composing jump
target address
Instruction [25– 0]
26
Shift
left 2
New multiplexor with additional
control bit Jump
Jump address [31– 0]
28
0
1
M
u
x
M
u
x
ALU
Add result
1
0
Zero
ALU ALU
result
Address
PC+4 [31– 28]
Add
4
Instruction [31– 26]
Control
Instruction [25– 21]
PC
Read
address
Instruction
memory
Read
register 1
Instruction [20– 16]
Instruction
[31– 0]
Instruction [15– 11]
Shift
left 2
RegDst
Jump
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Write
data
Instruction [15– 0]
16
Sign
extend
Read
data
Data
memory
1
M
u
x
0
32
ALU
control
Instruction [5– 0]
MIPS datapath extended to jumps: control unit generates new Jump control bit
Datapath Executing j
R-type Instruction: Step 1
add $t1, $t2, $t3 (active = bold)
0
M
u
x
Add
Add
4
Instruction [31– 26]
Control
Instruction [25– 21]
PC
Read
address
Instruction
memory
Instruction [15– 11]
1
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
register 1
Instruction [20– 16]
Instruction
[31– 0]
ALU
result
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Write
data
Instruction [15– 0]
16
Sign
extend
32
ALU
control
Instruction [5– 0]
Fetch instruction and increment PC count
Read
data
Data
memory
1
M
u
x
0
R-type Instruction: Step 2
add $t1, $t2, $t3 (active = bold)
0
M
u
x
Add
4
Instruction [31– 26]
Control
Instruction [25– 21]
PC
Read
address
Instruction
memory
Instruction [15– 11]
1
Zero
ALU ALU
result
Address
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
register 1
Instruction [20– 16]
Instruction
[31– 0]
ALU
Add result
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Write
data
Instruction [15– 0]
16
Sign
extend
32
ALU
control
Instruction [5– 0]
Read two source registers from the register file
Read
data
Data
memory
1
M
u
x
0
R-type Instruction: Step 3
add $t1, $t2, $t3 (active = bold)
0
M
u
x
Add
Add
4
Instruction [31 26]
Control
Instruction [25 21]
PC
Read
address
Instruction
memory
Instruction [15 11]
1
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
register 1
Instruction [20 16]
Instruction
[31– 0]
ALU
result
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Write
data
Instruction [15 0]
16
Sign
extend
32
ALU
control
Instruction [5 0]
ALU operates on the two register operands
Read
data
Data
memory
1
M
u
x
0
R-type Instruction: Step 4
add $t1, $t2, $t3 (active = bold)
0
M
u
x
Add
4
Instruction [31 26]
Control
Instruction [25 21]
PC
Read
address
Instruction
memory
Instruction [15 11]
1
Zero
ALU ALU
result
Address
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
register 1
Instruction [20 16]
Instruction
[31– 0]
ALU
Add result
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Write
data
Instruction [15 0]
16
Sign
extend
32
Instruction [5 0]
Write result to register
ALU
control
Read
data
Data
memory
1
M
u
x
0
Single-cycle Implementation
Notes
The steps are not really distinct as each instruction
completes in exactly one clock cycle – they simply indicate
the sequence of data flowing through the datapath
The operation of the datapath during a cycle is purely
combinational – nothing is stored during a clock cycle
Therefore, the machine is stable in a particular state at the
start of a cycle and reaches a new stable state only at the
end of the cycle
Very important for understanding single-cycle computing:
See our simple Verilog single-cycle computer in the folder
SimpleSingleCycleComputer in Verilog/Examples
Load Instruction Steps
lw $t1, offset($t2)
1.
2.
3.
4.
5.
Fetch instruction and increment PC
Read base register from the register file: the base
register ($t2) is given by bits 25-21 of the instruction
ALU computes sum of value read from the register file
and the sign-extended lower 16 bits (offset) of the
instruction
The sum from the ALU is used as the address for the
data memory
The data from the memory unit is written into the
register file: the destination register ($t1) is given by
bits 20-16 of the instruction
Load Instruction
lw $t1, offset($t2)
0
M
u
x
Add
4
Instruction [31– 26]
Control
Instruction [25– 21]
PC
Read
address
Instruction
memory
Instruction [15– 11]
1
Zero
ALU ALU
result
Address
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
register 1
Instruction [20– 16]
Instruction
[31– 0]
ALU
Add result
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Write
data
Instruction [15– 0]
16
Instruction [5– 0]
Sign
extend
32
ALU
control
Read
data
Data
memory
1
M
u
x
0
Branch Instruction Steps
beq $t1, $t2, offset
1.
2.
3.
4.
Fetch instruction and increment PC
Read two register ($t1 and $t2) from the register file
ALU performs a subtract on the data values from the
register file; the value of PC+4 is added to the signextended lower 16 bits (offset) of the instruction
shifted left by two to give the branch target address
The Zero result from the ALU is used to decide which
adder result (from step 1 or 3) to store in the PC
Branch Instruction
beq $t1, $t2, offset
0
M
u
x
Add
4
Instruction [31– 26]
Control
Instruction [25– 21]
PC
Read
address
Instruction
memory
Instruction [15– 11]
1
Zero
ALU ALU
result
Address
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
register 1
Instruction [20– 16]
Instruction
[31– 0]
ALU
Add result
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Write
data
Instruction [15– 0]
16
Instruction [5– 0]
Sign
extend
32
ALU
control
Read
data
Data
memory
1
M
u
x
0
Implementation: ALU Control Block
ALUOp
Funct field
Operation
ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0
0
X X X X X X
010
0*
1
X X X X X X
110
1
X
X X 0 0 0 0
010
1
X
X X 0 0 1 0
110
1
X
X X 0 1 0 0
000
1
X
X X 0 1 0 1
001
1
X
X X 1 0 1 0
111
Truth table for ALU control bits
ALUOp
ALU control block
ALUOp0
ALUOp1
F3
F2
F (5– 0)
Operation2
Operation1
F1
Operation0
F0
ALU control logic
Operation
*Typo in text
Fig. 5.15: if it is X
then there is potential
conflict between
line 2 and lines 3-7!
Implementation: Main Control
Block
Inputs
Op5
Op4
Outputs
Inputs
Signal
name
Rlw
format
Op5
Op4
Op3
Op2
Op1
Op0
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOP2
0
0
0
0
0
0
1
0
0
1
0
0
0
1
0
1
0
0
0
1
1
0
1
1
1
1
0
0
0
0
sw
beq
Op3
Op2
Op1
1
0
1
0
1
1
x
1
x
0
0
1
0
0
0
0
0
0
1
0
0
x
0
x
0
0
0
1
0
1
Truth table for main control signals
Op0
Outputs
R-format
Iw
sw
beq
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
Main control PLA (programmable
logic array): principle underlying
PLAs is that any logical expression
can be written as a sum-of-products
Single-Cycle Design Problems
Assuming fixed-period clock every instruction datapath uses one
clock cycle implies:
CPI = 1
cycle time determined by length of the longest instruction path
(load)
but several instructions could run in a shorter clock cycle: waste of time
consider if we have more complicated instructions like floating point!
resources used more than once in the same cycle need to be
duplicated
waste of hardware and chip area
Example: Fixed-period clock vs.
variable-period clock in a
single-cycle implementation
Consider a machine with an additional floating point unit. Assume
functional unit delays as follows
Assume instruction mix as follows
memory: 2 ns., ALU and adders: 2 ns., FPU add: 8 ns., FPU multiply: 16 ns.,
register file access (read or write): 1 ns.
multiplexors, control unit, PC accesses, sign extension, wires: no delay
all loads take same time and comprise 31%
all stores take same time and comprise 21%
R-format instructions comprise 27%
branches comprise 5%
jumps comprise 2%
FP adds and subtracts take the same time and totally comprise 7%
FP multiplys and divides take the same time and totally comprise 7%
Compare the performance of (a) a single-cycle implementation using a fixedperiod clock with (b) one using a variable-period clock where each instruction
executes in one clock cycle that is only as long as it needs to be (not really
practical but pretend it’s possible!)
Solution
Instruction
class
Load word
Store word
R-format
Branch
Jump
FP mul/div
FP add/sub
Instr. Register ALU
mem. read
oper.
2
2
2
2
2
2
2
1
1
1
1
1
1
2
2
2
2
Data
mem.
2
2
0
Register FPU
write
add/
sub
FPU
mul/
div
1
1
1
1
16
8
Total
time
ns.
8
7
6
5
2
20
12
Clock period for fixed-period clock = longest instruction time = 20
ns.
Average clock period for variable-period clock = 8 31% +
7 21% + 6 27% + 5 5% + 2 2% + 20 7% + 12 7%
= 7.0 ns.
Therefore, performancevar-period /performancefixed-period = 20/7 = 2.9
Fixing the problem with singlecycle designs
One solution: a variable-period clock with different cycle
times for each instruction class
unfeasible, as implementing a variable-speed clock is technically
difficult
Another solution:
use a smaller cycle time…
…have different instructions take different numbers of cycles
by breaking instructions into steps and fitting each step into one
cycle
feasible: multicyle approach!
Multicycle Approach
Break up the instructions into steps
each step takes one clock cycle
balance the amount of work to be done in each step/cycle so that
they are about equal
restrict each cycle to use at most once each major functional unit
so that such units do not have to be replicated
functional units can be shared between different cycles within one
instruction
Between steps/cycles
At the end of one cycle store data to be used in later cycles of the
same instruction
need to introduce additional internal (programmer-invisible) registers
for this purpose
Data to be used in later instructions are stored in programmervisible state elements: the register file, PC, memory
Multicycle Approach
PCSrc
M
u
x
Add
Add ALU
result
4
Shift
left 2
Note particularities of
multicyle vs. singlediagrams
PC
single memory for data
and instructions
single ALU, no extra adders
extra registers to
hold data between
clock cycles
PC
Registers
Read
register 1
Read
Read
data 1
register 2
Read
address
Instruction
Write
register
Write
data
RegWrite
16
Instruction
memory
ALUSrc
Read
data 2
M
u
x
ALU operation
3
Zero
ALU ALU
result
MemtoReg
Address
Write
data
Sign
extend
MemWrite
Read
data
Data
memory
32
MemRead
Single-cycle datapath
Instruction
register
Address
Data
A
Memory
Data
Register #
Instruction
or data
Memory
data
register
ALU
Registers
Register #
B
Register #
Multicycle datapath (high-level view)
ALUOut
M
u
x
Multicycle Datapath
PC
0
M
u
x
1
Address
Memory
MemData
Write
data
Instruction
[25– 21]
Read
register 1
Instruction
[20– 16]
Read
Read
register 2 data 1
Registers
Write
Read
register data 2
Instruction
[15– 0]
Instruction
register
Instruction
[15– 0]
Memory
data
register
0
M
Instruction u
x
[15– 11]
1
A
B
4
Write
data
0
M
u
x
1
16
Sign
extend
0
M
u
x
1
32
Zero
ALU ALU
result
ALUOut
0
1 M
u
2 x
3
Shift
left 2
Basic multicycle MIPS datapath handles R-type instructions and load/stores:
new internal register in red ovals, new multiplexors in blue ovals
Breaking instructions into steps
Our goal is to break up the instructions into steps so that
each step takes one clock cycle
the amount of work to be done in each step/cycle is about equal
each cycle uses at most once each major functional unit so that
such units do not have to be replicated
functional units can be shared between different cycles within one
instruction
Data at end of one cycle to be used in next must be stored !!
Breaking instructions into steps
We break instructions into the following potential execution steps
– not all instructions require all the steps – each step takes one
clock cycle
1.
2.
3.
4.
5.
Instruction fetch and PC increment (IF)
Instruction decode and register fetch (ID)
Execution, memory address computation, or branch completion (EX)
Memory access or R-type instruction completion (MEM)
Memory read completion (WB)
Each MIPS instruction takes from 3 – 5 cycles (steps)
Step 1: Instruction Fetch &
PC Increment (IF)
Use PC to get instruction and put it in the instruction register.
Increment the PC by 4 and put the result back in the PC.
Can be described succinctly using RTL (Register-Transfer Language):
IR = Memory[PC];
PC = PC + 4;
Step 2: Instruction Decode and
Register Fetch (ID)
Read registers rs and rt in case we need them.
Compute the branch address in case the instruction is a branch.
RTL:
A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC + (sign-extend(IR[15-0]) << 2);
Step 3: Execution, Address
Computation or Branch Completion
(EX)
ALU performs one of four functions depending on instruction
type
memory reference:
ALUOut = A + sign-extend(IR[15-0]);
R-type:
ALUOut = A op B;
branch (instruction completes):
if (A==B) PC = ALUOut;
jump (instruction completes):
PC = PC[31-28] || (IR(25-0) << 2)
Step 4: Memory access or Rtype Instruction Completion
(MEM)
Again depending on instruction type:
Loads and stores access memory
load
MDR = Memory[ALUOut];
store (instruction completes)
Memory[ALUOut] = B;
R-type (instructions completes)
Reg[IR[15-11]] = ALUOut;
Step 5: Memory Read
Completion (WB)
Again depending on instruction type:
Load writes back (instruction completes)
Reg[IR[20-16]]= MDR;
Important: There is no reason from a datapath (or control) point
of view that Step 5 cannot be eliminated by performing
Reg[IR[20-16]]= Memory[ALUOut];
for loads in Step 4. This would eliminate the MDR as well.
The reason this is not done is that, to keep steps balanced in
length, the design restriction is to allow each step to contain
at most one ALU operation, or one register access, or one
memory access.
Summary of Instruction
Execution
Step
1: IF
Step name
Instruction fetch
Action for R-type
instructions
Action for memory-reference
Action for
instructions
branches
IR = Memory[PC]
PC = PC + 4
A = Reg [IR[25-21]]
B = Reg [IR[20-16]]
ALUOut = PC + (sign-extend (IR[15-0]) << 2)
2: ID
Instruction
decode/register fetch
3: EX
Execution, address
computation, branch/
jump completion
ALUOut = A op B
ALUOut = A + sign-extend
(IR[15-0])
Memory access or R-type
Reg [IR[15-11]] =
ALUOut
Load: MDR = Memory[ALUOut]
or
Store: Memory [ALUOut] = B
4: MEM completion
5: WB
Memory read completion
Load: Reg[IR[20-16]] = MDR
if (A ==B) then
PC = ALUOut
Action for
jumps
PC = PC [31-28] II
(IR[25-0]<<2)
Multicycle Execution Step (1):
Instruction Fetch
IR = Memory[PC];
PC = PC + 4;
PC + 4
4
Multicycle Execution Step (2):
Instruction Decode & Register Fetch
A = Reg[IR[25-21]];
(A = Reg[rs])
B = Reg[IR[20-15]];
(B = Reg[rt])
ALUOut = (PC + sign-extend(IR[15-0]) << 2)
Reg[rs]
PC + 4
Reg[rt]
Branch
Target
Address
Multicycle Execution Step (3):
Memory Reference Instructions
ALUOut = A + sign-extend(IR[15-0]);
Reg[rs]
PC + 4
Reg[rt]
Mem.
Address
Multicycle Execution Step (3):
ALU Instruction (R-Type)
ALUOut = A op B
Reg[rs]
PC + 4
Reg[rt]
R-Type
Result
Multicycle Execution Step (3):
Branch Instructions
if (A == B) PC = ALUOut;
Reg[rs]
Branch
Target
Address
Reg[rt]
Branch
Target
Address
Multicycle Execution Step (3):
Jump Instruction
PC = PC[31-28] concat (IR[25-0] << 2)
Reg[rs]
Jump
Address
Reg[rt]
Branch
Target
Address
Multicycle Execution Step (4):
Memory Access - Read (lw)
MDR = Memory[ALUOut];
Reg[rs]
PC + 4
Mem.
Data
Reg[rt]
Mem.
Address
Multicycle Execution Step (4):
Memory Access - Write (sw)
Memory[ALUOut] = B;
Reg[rs]
PC + 4
Reg[rt]
Multicycle Execution Step (4):
ALU Instruction (R-Type)
Reg[IR[15:11]] = ALUOUT
Reg[rs]
PC + 4
Reg[rt]
R-Type
Result
Multicycle Execution Step (5):
Memory Read Completion (lw)
Reg[IR[20-16]] = MDR;
Reg[rs]
PC + 4
Mem.
Data
Reg[rt]
Mem.
Address
Multicycle Datapath with Control I
IorD
PC
0
M
u
x
1
MemRead
MemWrite
RegDst
RegWrite
Instruction
[25– 21]
Address
Memory
MemData
Write
data
IRWrite
Instruction
register
Instruction
[15– 0]
Memory
data
register
0
M
u
x
1
Read
register 1
Read
Read
data 1
register 2
Registers
Write
Read
register
data 2
Instruction
[20– 16]
Instruction
[15– 0]
ALUSrcA
0
M
Instruction u
x
[15– 11]
1
A
B
4
Write
data
0
M
u
x
1
16
Sign
extend
32
Shift
left 2
Zero
ALU ALU
result
ALUOut
0
1 M
u
2 x
3
ALU
control
Instruction [5– 0]
MemtoReg
ALUSrcB ALUOp
… with control lines and the ALU control block added – not all control lines are shown
Multicycle Datapath with Control II
New gates
New multiplexor
For the jump address
PCWriteCond
PCSource
PCWrite
Outputs ALUOp
IorD
ALUSrcB
MemRead
ALUSrcA
Control
MemWrite
RegWrite
MemtoReg
Op
RegDst
IRWrite
[5– 0]
0
M
26
Instruction [25– 0]
PC
0
M
u
x
1
Shift
left 2
Instruction
[31-26]
Address
Memory
MemData
Write
data
Instruction
[25– 21]
Read
register 1
Instruction
[20– 16]
Read
Read
register 2 data 1
Registers
Write
Read
register data 2
Instruction
[15– 0]
Instruction
register
Instruction
[15– 0]
Memory
data
register
0
M
Instruction u
x
[15– 11]
1
B
4
Write
data
0
M
u
x
1
16
Sign
extend
32
Shift
left 2
Jump
address [31-0]
Zero
ALU ALU
result
1 u
x
2
PC [31-28]
0
M
u
x
1
A
28
ALUOut
0
1 M
u
2 x
3
ALU
control
Instruction [5– 0]
Complete multicycle MIPS datapath (with branch and jump capability)
and showing the main control block and all control lines
Multicycle Control Step (1):
Fetch
IR = Memory[PC];
PC = PC + 4;
1
IRWrite
I
R
1
Instruction I
rs
PCWr*
0
0
IorD
32
U
1X
5
rt
5
5
0
MUX
rd
1
RN1
MemWrite
ADDR
Memory
RD
M
D
R
RegDst
U
X
RD2
MemRead
1
E
X
T
N
D
32
0
1M
U
2X
3
ALUSrcB
<<2
0
1
X
3
PCSource
Zero
0
ALU
B
RegWrite
16
010
1X
A
4
0
immediate
M
1U
Operation
0M
MemtoReg
X
2
U
RD1
WD
WD
CONCAT
0
ALUSrcA
WN
Registers
1M
0X
RN2
32
<<2
I[25:0]
5
PC
0M
28
jmpaddr
ALU
OUT
Multicycle Control Step (2):
Instruction Decode & Register Fetch
A = Reg[IR[25-21]];
(A = Reg[rs])
B = Reg[IR[20-15]];
(B = Reg[rt])
ALUOut = (PC + sign-extend(IR[15-0]) << 2);
0IRWrite
I
R
0
PCWr*
Instruction I
rs
X
0
IorD
32
U
1X
5
0
MUX
RN1
RD
M
D
R
1
I[25:0]
U
X
RD2
B
X
RegWrite
0
immediate
16
E
X
T
N
D
32
M
010
1U
0
0
1M
U
2X
3
3
X
3
PCSource
Zero
X
ALU
ALUSrcB
<<2
2
U
1X
4
MemtoReg
CONCAT
ALUSrcA
Operation
0M
A
WD
0
0
RegDst
RD1
32
<<2
WN
Registers
1M
0X
RN2
WD
MemRead
rd
5
MemWrite
ADDR
Memory
5
rt
5
PC
0M
28
jmpaddr
ALU
OUT
Multicycle Control Step (3):
Memory Reference Instructions
ALUOut = A + sign-extend(IR[15-0]);
0
IRWrite
I
R
0
Instruction I
rs
PCWr*
X
IorD
32
0
U
1X
5
0
MUX
rd
1
RN1
RD
M
D
R
U
RD2
0
immediate
1
ALUSrcA
X
A
E
X
T
N
D
32
0
1M
U
2X
3
ALUSrcB
<<2
1U
010
0
2
X
3
PCSource
Zero
ALU
B
RegWrite
0
M
1X
4
16
2
Operation
0M
MemtoReg
X
CONCAT
U
RD1
WD
WD
MemRead
RegDst
WN
Registers
1M
0X
RN2
32
<<2
I[25:0]
5
MemWrite
ADDR
Memory
5
rt
5
PC
0M
28
jmpaddr
ALU
OUT
X
Multicycle Control Step (3):
ALU Instruction (R-Type)
ALUOut = A op B;
0
IRWrite
I
R
0
PCWr*
Instruction I
rs
X
32
0
IorD
U
1X
5
0
MUX
rd
1
RN1
RD
M
D
R
U
RD1
WD
WD
RD2
MemRead
X
0
16
E
X
T
N
D
32
B
0
1M
U
2X
3
ALUSrcB
<<2
M
1U
???
0
0
X
3
PCSource
Zero
X
ALU
4
0
immediate
2
U
1X
A
RegWrite
CONCAT
ALUSrcA
Operation
0M
MemtoReg
X
1
RegDst
WN
Registers
1M
0X
RN2
32
<<2
I[25:0]
5
MemWrite
ADDR
Memory
5
rt
5
PC
0M
28
jmpaddr
ALU
OUT
Multicycle Control Step (3):
Branch Instructions
if (A == B) PC = ALUOut;
0
IRWrite
1 if
Zero=1
PCWr*
X
I
R
Instruction I
rs
32
0
IorD
U
1X
5
0
MUX
rd
1
RN1
RD
M
D
R
U
RD1
WD
WD
RD2
MemRead
X
0
16
E
X
T
N
D
32
B
0
1M
U
2X
3
ALUSrcB
<<2
M
1U
011
0
0
X
3
PCSource
Zero
1
ALU
4
0
immediate
2
U
1X
A
RegWrite
CONCAT
ALUSrcA
Operation
0M
MemtoReg
X
1
RegDst
WN
Registers
1M
0X
RN2
32
<<2
I[25:0]
5
MemWrite
ADDR
Memory
5
rt
5
PC
0M
28
jmpaddr
ALU
OUT
Multicycle Execution Step (3):
Jump Instruction
PC = PC[21-28] concat (IR[25-0] << 2);
0
IRWrite
I
R
1
Instruction I
rs
PCWr*
X
32
0
IorD
U
1X
5
0
MUX
rd
1
RN1
RD
M
D
R
U
X
RD2
0
immediate
A
E
X
T
N
D
32
0
1M
U
2X
3
ALUSrcB
<<2
1U
XXX
0
X
X
3
PCSource
Zero
ALU
B
RegWrite
0
M
1X
4
16
2
ALUSrcA
Operation
0M
MemtoReg
X
X
CONCAT
U
RD1
WD
WD
MemRead
RegDst
WN
Registers
1M
0X
RN2
32
<<2
I[25:0]
5
MemWrite
ADDR
Memory
5
rt
5
PC
0M
28
jmpaddr
ALU
OUT
2
Multicycle Control Step (4):
Memory Access - Read (lw)
MDR = Memory[ALUOut];
IRWrite
I
R
0
PCWr*
0
Instruction I
rs
1
32
0
IorD
U
1X
5
0
MUX
rd
1
RN1
RD
M
D
R
U
X
RD2
1
16
E
X
T
N
D
32
0
1M
U
2X
3
ALUSrcB
<<2
0
X
X
3
PCSource
Zero
X
ALU
B
RegWrite
M
1U
XXX
1X
A
4
0
immediate
2
ALUSrcA
Operation
0M
MemtoReg
X
CONCAT
U
RD1
WD
WD
MemRead
X
RegDst
WN
Registers
1M
0X
RN2
32
<<2
I[25:0]
5
MemWrite
ADDR
Memory
5
rt
5
PC
0M
28
jmpaddr
ALU
OUT
Multicycle Execution Steps (4)
Memory Access - Write (sw)
Memory[ALUOut] = B;
IRWrite
I
R
0
PCWr*
0
Instruction I
rs
1
32
1
IorD
U
1X
5
0
MUX
rd
1
RN1
RD
M
D
R
U
X
RD2
0
immediate
E
X
T
N
D
32
0
1M
U
2X
3
ALUSrcB
<<2
1U
XXX
0
X
X
3
PCSource
Zero
ALU
B
RegWrite
0
M
1X
A
4
16
2
ALUSrcA
Operation
0M
MemtoReg
X
X
CONCAT
U
RD1
WD
WD
MemRead
RegDst
WN
Registers
1M
0X
RN2
32
<<2
I[25:0]
5
MemWrite
ADDR
Memory
5
rt
5
PC
0M
28
jmpaddr
ALU
OUT
X
Multicycle Control Step (4):
ALU Instruction (R-Type)
Reg[IR[15:11]] = ALUOut;
ALUOut)
0
(Reg[Rd] =
IRWrite
I
R
0
Instruction I
rs
PCWr*
X
32
0
IorD
U
1X
5
0
MUX
rd
1
RN1
RD
M
D
R
<<2
RegDst
X
ALUSrcA
1
U
0M
RD2
immediate
B
RegWrite
1
E
X
T
N
D
32
0
1M
U
2X
3
ALUSrcB
<<2
1U
XXX
0
X
X
3
PCSource
Zero
ALU
4
16
M
1X
A
MemtoReg
0
2
U
RD1
WD
1
32
CONCAT
Operation
WN
Registers
0M
1X
RN2
WD
MemRead
I[25:0]
5
MemWrite
ADDR
Memory
5
rt
5
PC
0M
28
jmpaddr
ALU
OUT
X
Multicycle Execution Steps (5)
Memory Read Completion (lw)
Reg[IR[20-16]] = MDR;
IRWrite
I
R
0
PCWr*
0
Instruction I
rs
X
32
0
IorD
U
1X
5
0
MUX
rd
1
RN1
RD
M
D
R
U
RD2
0
0
immediate
1
E
X
T
N
D
32
0
1M
U
2X
3
ALUSrcB
<<2
X
0
X
3
PCSource
Zero
ALU
B
RegWrite
M
1U
XXX
1X
A
4
16
2
ALUSrcA
Operation
0M
MemtoReg
0
CONCAT
U
RD1
WD
WD
MemRead
X
RegDst
WN
Registers
0M
1X
RN2
32
<<2
I[25:0]
5
MemWrite
ADDR
Memory
5
rt
5
PC
0M
28
jmpaddr
X
ALU
OUT
Simple Questions
How many cycles will it take to execute this code?
Label:
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, Label #assume not equal
add $t5, $t2, $t3
sw $t5, 8($t3)
...
What is going on during the 8th cycle of execution?
Clock time-line
In what cycle does the actual addition of $t2 and $t3 takes place?
Implementing Control
Value of control signals is dependent upon:
Use the information we have accumulated to specify a finite
state machine
what instruction is being executed
which step is being performed
specify the finite state machine graphically, or
use microprogramming
Implementation is then derived from the specification
Review: Finite State Machines
Finite state machines (FSMs):
a set of states and
next state function, determined by current state and the input
output function, determined by current state and possibly input
Current state
Next-state
function
Next
state
Clock
Inputs
Output
function
Outputs
We’ll use a Moore machine – output based only on current state
Example: Moore Machine
The Moore machine below, given input a binary string
terminated by “#”, will output “even” if the string has an even
number of 0’s and “odd” if the string has an odd number of 0’s
1
Even state
No
output
Odd state
0
0
No
output
Start
#
Output
“even”
Output even state
#
Output
“odd”
Output odd state
1
FSM Control: High-level View
Start
Instruction fetch/decode and register fetch
(Figure 5.37)
Memory access
instructions
(Figure 5.38)
R-type instructions
(Figure 5.39)
Branch instruction
(Figure 5.40)
Jump instruction
(Figure 5.41)
High-level view of FSM control
Instruction decode/
Register fetch
Instruction fetch
(Op
Memory reference FSM
(Figure 5.38)
') or
(Op
')
R-type FSM
(Figure 5.39)
)
EQ
')
ype
'B
p
(O
-t
=R
Branch FSM
(Figure 5.40)
(Op = 'JMP')
W
= 'L
W
= 'S
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
=
Start
1
p
Asserted signals
shown inside
state circles
MemRead
ALUSrcA = 0
IorD = 0
IRWrite
ALUSrcB = 01
ALUOp = 00
PCWrite
PCSource = 00
(O
0
Jump FSM
(Figure 5.41)
Instruction fetch and decode steps of every instruction is identical
FSM Control: Memory Reference
From state 1
(Op = 'LW') or (Op = 'SW')
Memory address computation
2
(O
p
=
'S
')
W
(Op = 'LW')
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
Memory
access
3
Memory
access
5
MemRead
IorD = 1
MemWrite
IorD = 1
Write-back step
4
RegWrite
MemtoReg = 1
RegDst = 0
To state 0
(Figure 5.37)
FSM control for memory-reference has 4 states
FSM Control: R-type Instruction
From state 1
(Op = R-type)
Execution
6
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
R-type completion
7
RegDst = 1
RegWrite
MemtoReg = 0
To state 0
(Figure 5.37)
FSM control to implement R-type instructions has 2 states
FSM Control: Branch Instruction
From state 1
(Op = 'BEQ')
Branch completion
8
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCWriteCond
PCSource = 01
To state 0
(Figure 5.37)
FSM control to implement branches has 1 state
FSM Control: Jump Instruction
From state 1
(Op = 'J')
Jump completion
9
PCWrite
PCSource = 10
To state 0
(Figure 5.37)
FSM control to implement jumps has 1 state
FSM Control: Complete View
Instruction decode/
register fetch
Instruction fetch
(Op
2
W
= 'L
= 'S
(Op
r
o
')
EQ
')
Jump
completion
9
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCWriteCond
PCSource = 01
PCWrite
PCSource = 10
(O
p
=
'S
')
W
(Op = 'LW')
8
ALUSrcA =1
ALUSrcB = 00
ALUOp = 10
Memory
access
3
MEM
e)
-t yp
R
=
Branch
completion
Execution
6
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
EX
(Op
W ')
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
(Op = 'J')
Memory address
computation
1
ID
'B
Start
MemRead
ALUSrcA = 0
IorD = 0
IRWrite
ALUSrcB = 01
ALUOp = 00
PCWrite
PCSource = 00
=
IF
(O
p
0
Memory
access
5
R-type completion
7
MemRead
IorD = 1
MemWrite
IorD = 1
RegDst = 1
RegWrite
MemtoReg = 0
Labels on arcs are conditions
that determine next state
Write-back step
4
WB
RegDst = 0
RegWrite
MemtoReg = 1
The complete FSM control for the multicycle MIPS datapath:
refer Multicycle Datapath with Control II
Example: CPI in a multicycle
CPU
Assume
the control design of the previous slide
An instruction mix of 22% loads, 11% stores, 49% R-type
operations, 16% branches, and 2% jumps
What is the CPI assuming each step requires 1 clock cycle?
Solution:
Number of clock cycles from previous slide for each instruction class:
loads 5, stores 4, R-type instructions 4, branches 3, jumps 3
CPI = CPU clock cycles / instruction count
= (instruction countclass i CPIclass i) / instruction count
= (instruction countclass I / instruction count) CPIclass I
= 0.22 5 + 0.11 4 + 0.49 4 + 0.16 3 + 0.02 3
= 4.04
FSM Control:
Implementation
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
Control logic
MemtoReg
PCSource
ALUOp
Outputs
ALUSrcB
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
Instruction register
opcode field
S0
S1
S2
S3
Op0
Op1
Op2
Op3
Op4
Op5
Inputs
State register
Four state bits are required for 10 states
High-level view of FSM implementation: inputs to the combinational logic block are
the current state number and instruction opcode bits; outputs are the next state
number and control signals to be asserted for the current state
FSM
Control:
PLA
Implementation
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
MemtoReg
PCSource1
PCSource0
ALUOp1
ALUOp0
ALUSrcB1
ALUSrcB0
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
Upper half is the AND plane that computes all the products. The products are carried
to the lower OR plane by the vertical lines. The sum terms for each output is given by
the corresponding horizontal line
E.g., IorD = S0.S1.S2.S3 + S0.S1.S2.S3
FSM Control: ROM
Implementation
ROM (Read Only Memory)
values of memory locations are fixed ahead of time
A ROM can be used to implement a truth table
if the address is m-bits, we can address 2m entries in the ROM
outputs are the bits of the entry the address points to
address output
m
n
ROM
0
0
0
m = 3 0
n = 4 1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
1
0
0
0
0
0
1
1
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
1
0
1
The size of an m-input n-output ROM is 2m x n bits – such a ROM can
be thought of as an array of size 2m with each entry in the array being
n bits
FSM Control: ROM vs. PLA
First improve the ROM: break the table into two parts
4 state bits give the 16 output signals – 24 x 16 bits of ROM
all 10 input bits give the 4 next state bits – 210 x 4 bits of ROM
Total – 4.3K bits of ROM
PLA is much smaller
can share product terms
only need entries that produce an active output
can take into account don't cares
PLA size = (#inputs #product-terms) + (#outputs
#product-terms)
FSM control PLA = (10x17)+(20x17) = 460 PLA cells
PLA cells usually about the size of a ROM cell (slightly bigger)
Microprogramming
Microprogramming is a method of specifying FSM control that
resembles a programming language – textual rather graphic
this is appropriate when the FSM becomes very large, e.g., if the
instruction set is large and/or the number of cycles per instruction
is large
in such situations graphical representation becomes difficult as
there may be thousands of states and even more arcs joining them
a microprogram is specification : implementation is by ROM or PLA
A microprogram is a sequence of microinstructions
each microinstruction has eight fields (label + 7 functional)
Label: used to control microcode sequencing
ALU control: specify operation to be done by ALU
SRC1: specify source for first ALU operand
SRC2: specify source for second ALU operand
Register control: specify read/write for register file
Memory: specify read/write for memory
PCWrite control: specify the writing of the PC
Sequencing: specify choice of next microinstruction
Microprogramming
The Sequencing field value determines the execution order
of the microprogram
value Seq : control passes to the sequentially next
microinstruction
value Fetch : branch to the first microinstruction to begin the
next MIPS instruction, i.e., the first microinstruction in the
microprogram
value Dispatch i : branch to a microinstruction based on
control input and a dispatch table entry (called dispatching):
Dispatching is implemented by means of creating a table, called
dispatch table, whose entries are microinstruction labels and
which is indexed by the control input. There may be multiple
dispatch tables – the value Dispatch i in the sequencing field
indicates that the i th dispatch table is to be used
Control Microprogram
The microprogram corresponding to the FSM control shown
graphically earlier:
Label
Fetch
Mem1
LW2
ALU
control
Add
Add
Add
Register
SRC1 SRC2
control
PC
4
PC
Extshft Read
A
Extend
Memory
Read PC
PCWrite
control
ALU
Read ALU
Write MDR
SW2
Rformat1 Func code
Write ALU
A
B
Write ALU
BEQ1
JUMP1
Subt
A
B
ALUOut-cond
Jump address
Sequencing
Seq
Dispatch 1
Dispatch 2
Seq
Fetch
Fetch
Seq
Fetch
Fetch
Fetch
Microprogram containing 10 microinstructions
Op
000000
000010
000100
100011
101011
Dispatch ROM 1
Opcode name
R-format
jmp
beq
lw
sw
Dispatch Table 1
Value
Rformat1
JUMP1
BEQ1
Mem1
Mem1
Op
100011
101011
Dispatch ROM 2
Opcode name
lw
sw
Dispatch Table 2
Value
LW2
SW2
Microcode: Trade-offs
Specification advantages
easy to design and write
typically manufacturer designs architecture and microcode in parallel
Implementation advantages
easy to change since values are in memory (e.g., off-chip ROM)
can emulate other architectures
can make use of internal registers
Implementation disadvantages
control is implemented nowadays on same chip as processor so the
advantage of an off-chip ROM does not exist
ROM is no longer faster than on-board cache
there is little need to change the microcode as general-purpose
computers are used far more nowadays than computers designed for
specific applications
Summary
Techniques described in this chapter to design datapaths and
control are at the core of all modern computer architecture
Multicycle datapaths offer two great advantages over singlecycle
functional units can be reused within a single instruction if they are
accessed in different cycles – reducing the need to replicate
expensive logic
instructions with shorter execution paths can complete quicker by
consuming fewer cycles
Modern computers, in fact, take the multicycle paradigm to a
higher level to achieve greater instruction throughput:
pipelining (next topic) where multiple instructions execute
the MIPS architecture was designed to be pipelined
simultaneously by having cycles of different instructions overlap in
the datapath