Single Cycle Datapath Lecture notes from MKP, H. H. Lee and S.

Download Report

Transcript Single Cycle Datapath Lecture notes from MKP, H. H. Lee and S.

Single Cycle Datapath
Lecture notes from MKP, H. H. Lee and S. Yalamanchili
Reading
• Section 4.1-4.4
• Appendices B.7, B.8, B.11, D.2
• Practice Problems: 1, 4, 6, 9
(2)
Introduction
• We will examine two MIPS implementations
 A simplified version  this module
 A more realistic pipelined version
• Simple subset, shows most aspects
 Memory reference: lw, sw
 Arithmetic/logical: add, sub, and, or, slt
 Control transfer: beq, j
(3)
Instruction Execution
• PC  instruction memory, fetch instruction
• Register numbers  register file, read registers
• Depending on instruction class
1. Use ALU to calculate
o Arithmetic result
o Memory address for load/store
o Branch target address
2. Access data memory for load/store
3. PC  An address or PC + 4
Address
An Encoded Program
8d0b0000
014b5020
21080004
2129ffff
1520fffc
000a082a
…..
…..
(4)
Basic Ingredients
• Include the functional units we need for each
instruction – combinational and sequential
Instruction
address
PC
Instruction
Add Sum
MemWrite
Instruction
memory
Address
a. Instruction memory
b. Program counter
c. Adder
Write
data
5
Register
numbers
5
5
Data
3
Read
register 1
Read
register 2
Registers
Write
register
Write
data
Read
data
Data
memory
16
Sign
extend
ALU control
MemRead
Read
data 1
Data
Zero
ALU ALU
result
a. Data memory unit
b. Sign-extension unit
Read
data 2
RegWrite
a. Registers
32
b. ALU
(5)
Sequential Elements (4.2, B.7, B.11)
• Register: stores data in a circuit
 Uses a clock signal to determine when to update the
stored value
 Edge-triggered: update when Clk changes from 0 to 1
D
falling edge
Q
rising edge
Clk
Clk
D
D
D
C
D
latch
Q
D
Q
D
latch _
C
Q
Q
Q
_
Q
C
(6)
Sequential Elements
• Register with write control
 Only updates on clock edge when write control input is 1
 Used when stored value is required later
cycle time
D
Q
Write
Clk
Clk
Write
D
D
D
C
D
latch
Q
D
Q
D
latch _
C
Q
Q
_
Q
Q
C
(7)
Clocking Methodology
• Combinational logic transforms data during clock
cycles
 Between clock edges
 Input from state elements, output to state element
 Longest delay determines clock period
• Synchronous vs. Asynchronous operation
Recall: Critical Path Delay
(8)
Register File (B.8)
• Built using D flip-flops (remember ECE 2030!)
Read register
number 1
Register 0
Register 1
Register n – 1
M
u
x
Read register
number 1
Read data 1
Register n
Read
data 1
Read register
number 2
Register file
Write
register
Read register
number 2
M
u
x
Read data 2
Write
data
Read
data 2
Write
(9)
Register File
• Note: we still use the real clock to determine
when to write
Write
0
Register number
C
Register 0
1
D
n-to-1
decoder
C
n– 1
Register 1
D
n
C
Register n – 1
D
C
Register n
Register data
D
(10)
Building a Datapath (4.3)
• Datapath
 Elements that process data and addresses
in the CPU
o
Registers, ALUs, mux’s, memories, …
• We will build a MIPS datapath incrementally
 Refining the overview design
(11)
High Level Description
Instruction Streams
Data Streams
SISD
SIMD
MISD
MIMD
Control
Fetch
Instructions
Execute
Instructions
Memory
Operations
• Single instruction single data stream model of
execution (Remember Flynn’s Taxonomy)
 Serial execution model
• Commonly known as the von Neumann
execution model
 Stored program model
 Instructions and data share memory
(12)
Instruction Fetch
Increment by
4 for next
instruction
clk
32-bit
register
Start instruction fetch
cycle time
Complete instruction fetch
clk
(13)
R-Format Instructions
• Read two register operands
• Perform arithmetic/logical operation
• Write register result
op
rs
rt
rd
shamt
funct
(14)
Executing R-Format Instructions
5
Read
register 1
5
Read
register 2
5
Write
register
Write
data
3 ALU control
Read
data 1
Zero
ALU ALU
result
Read
data 2
RegWrite
op
rs
rt
rd
shamt
funct
(15)
Load/Store Instructions
• Read register operands
• Calculate address using 16-bit offset
 Use ALU, but sign-extend offset
• Load: Read memory and update register
• Store: Write register value to memory
op
rs
rt
16-bit constant
(16)
Executing I-Format Instructions
Read
register 1
Read
register 2
M e m W r it e
R ead
d a ta
A d d re s s
Write
register
W r i te
d a ta
D a ta
m e m o ry
RegWrite
16
op
S ig n
ex te nd
32
M em Read
rs
rt
16-bit constant
(17)
Branch Instructions
• Read register operands
• Compare operands
 Use ALU, subtract and check Zero output
• Calculate target address
 Sign-extend displacement
 Shift left 2 places (word displacement)
 Add to PC + 4
o
Already calculated by instruction fetch
op
rs
rt
16-bit constant
(18)
Branch Instructions
Just
re-routes
wires
Sign-bit wire
replicated
op
rs
rt
16-bit constant
(19)
Updating the Program Counter
Branch
Add
ALU
Add result
4
Shift
0
M
u
x
1
Computation
of the branch
address
Instruction [25–21]
PC
Read
address
Instruction
[31–0]
Instruction
memory
Instruction [20–16]
loop:
Instruction [15–11
Instruction [15–0]
16
Sign 32
extend
beq $t0, $0, exit
addi $t0, $t0, -1
lw $a0, arg1($t1)
lw $a1, arg2($t2)
jal func
add $t3, $t3, $v0
addi $t1, $t1, 4
addi $t2, $t2, 4
j loop
(20)
Composing the Elements
• First-cut data path does an instruction in one
clock cycle
 Each datapath element can only do one function at a
time
 Hence, we need separate instruction and data
memories
• Use multiplexers where alternate data sources
are used for different instructions
An Encoded Program
PC
Address
014b5020
21080004
2129ffff
1520fffc
000a082a
…..
…..
(21)
Full Single Cycle Datapath
Destination register
is “instructionspecific”
lw$t0, 0($t4) vs.
add $t0m $t1, $t2
(22)
The Main Control Unit
• Control signals derived from instruction
R-type
0
rs
31:26
Load/
Store
25:21
35 or 43 rs
31:26
Branch 4
25:21
rs
31:26
opcode
25:21
always
read
rt
20:16
rt
rd
shamt
15:11
20:16
read,
except
for load
5:0
address
20:16
rt
10:6
funct
15:0
address
15:0
write for
R-type
and load
sign-extend
and add
(23)
ALU Control (4.4, D.2)
• ALU used for
 Load/Store: Function = add
 Branch: Function = subtract
 R-type: Function depends on funct field
ALU control
Function
0000
AND
0001
OR
0010
add
0110
subtract
0111
set-on-less-than
1100
NOR
(24)
ALU Control
• Assume 2-bit ALUOp derived from opcode
 Combinational logic derives ALU control
opcode
ALUOp
Operation
funct
ALU function
ALU control
lw
00
load word
XXXXXX
add
0010
sw
00
store word
XXXXXX
add
0010
beq
01
branch equal
XXXXXX
subtract
0110
R-type
10
add
100000
add
0010
subtract
100010
subtract
0110
AND
100100
AND
0000
OR
100101
OR
0001
set-on-less-than
101010
set-on-less-than
0111
• How do we turn this description into gates?
(25)
ALU Controller
lw/sw
beq
arith
ALUOp
ALUOp1 ALUOp0
0
0
X
1
1
X
1
X
1
X
1
X
1
X
F5
X
X
X
X
X
X
X
Funct field
F4 F3 F2 F1
X X X X
X X X X
X 0 0 0
X 0 0 1
X 0 1 0
X 0 1 0
X 1 0 1
F0
X
X
0
0
0
1
0
ALU
Control
010
110
010
110
000
001
111
ALUOp
add
sub
add
sub
and
or
slt
funct =
inst[5:0]
ALU
control
A LU control
3
Zero
ALU
ALU
result
Generated from
inst[5:0]
Decoding inst[31:26]
(26)
ALU Control
• Simple combinational logic (truth tables)
ALUOp
ALU control block
ALUOp0
ALUOp1
F3
F2
F (5– 0)
Operation2
Operation1
Operation
F1
Operation0
F0
(27)
Datapath With Control
Use rt not rd
Instruction
R-format
lw
sw
beq
RegDst
1
0
X
X
ALUSrc
0
1
1
0
MemtoReg
0
1
X
X
Reg Mem Mem
Write Read Write
1
0
0
1
1
0
0
0
1
0
0
0
Branch
0
0
0
1
ALUOp1
1
0
0
0
ALUp0
0
0
0
1
(28)
Commodity Processors
ARM 7
Single Cycle Datapath
(29)
Control Unit Signals
Instruction
Inputs
Inst[31:26]
Op4
Op3
Reg
Mem
Mem
RegDst
ALUSrc
Reg
Write
Read
Write
Branch
ALUOp1
ALUp0
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
R-format
Op5
Memto-
Op2
Op1
Op0
Outputs
R-format
Iw
sw
beq
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
To harness
the datapath
MemWrite
Branch
ALUOp1
ALUOpO
(30)
Controller Implementation
LIBRARY IEEE;
USE IEEE.STD_LOGIC_1164.ALL;
USE IEEE.STD_LOGIC_ARITH.ALL;
USE IEEE.STD_LOGIC_SIGNED.ALL;
ENTITY control IS
PORT(
SIGNAL
);
SIGNAL
SIGNAL
SIGNAL
SIGNAL
SIGNAL
SIGNAL
SIGNAL
SIGNAL
);
SIGNAL
Opcode
: IN
STD_LOGIC_VECTOR( 5 DOWNTO 0
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUop
: OUT
: OUT
: OUT
: OUT
: OUT
: OUT
: OUT
: OUT
STD_LOGIC;
STD_LOGIC;
STD_LOGIC;
STD_LOGIC;
STD_LOGIC;
STD_LOGIC;
STD_LOGIC;
STD_LOGIC_VECTOR( 1 DOWNTO 0
clock, reset
: IN
STD_LOGIC );
END control;
(31)
Controller Implementation (cont.)
ARCHITECTURE behavior OF control IS
SIGNAL R_format, Lw, Sw, Beq
: STD_LOGIC;
BEGIN
-- Code to generate control signals using
opcode bits
R_format <= '1' WHEN Opcode = "000000" ELSE '0';
Lw
<= '1' WHEN Opcode = "100011" ELSE '0';
Sw
<= '1' WHEN Opcode = "101011" ELSE '0';
Beq
<= '1' WHEN Opcode = "000100" ELSE '0';
Implementation
RegDst
<= R_format;
ALUSrc
<= Lw OR Sw;
of each table
MemtoReg <= Lw;
column
RegWrite <= R_format OR Lw;
MemRead
<= Lw;
MemWrite
<= Sw;
Branch
<= Beq;
ALUOp( 1 ) <= R_format;
MemtoReg
Mem
Mem
ALUOp( 0 ) <= Beq;
Instruction
RegDst
ALUSrc
Reg
Write
Read
Write
Branch
END behavior;
R-format
ALUOp1
ALUp0
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
(32)
R-Type Instruction
(33)
Load Instruction
(34)
Branch-on-Equal Instruction
(35)
Implementing Jumps
Jump
2
address
31:26
25:
0
• Jump uses word address
• Update PC with concatenation of
 Top 4 bits of old PC
 26-bit jump address
 00
• Need an extra control signal decoded from
opcode
(36)
Datapath With Jumps Added
(37)
Energy Behavior
combinational
activity
storage read/write
access
(38)
Recall Hierarchy of Energy Models
D
D
C
D
latch
Q
D
Q
D
latch _
C
Q
Q
_
Q
C
a
b
x
c
y
ALU
Aggregate energy
expenditure into
higher level
modules
Aggregate energy
expenditure into gate
level estimates
Vdd
PMOS
Vin
Vout
NMOS
Switch level activity
(dynamic) and leakage
(static) energy costs
Ground
(39)
A Simple Architecture Energy Model
• To a first order, we can use the per-access
energy of each major component
 Obtain this for a technology generation
• Use this per-access energy to compute the
energy of each instruction
• Note:
 This is a high level approximation. The actual physics
is more complicated.
 However, this useful for several purposes
• What components do each instruction exercise?
(40)
Example: Updating the PC
Branch
Add
ALU
Add result
4
RegWrite
Instruction [25–21]
PC
Read
address
Instruction
[31–0]
Instruction
memory
Instruction [20–16]
0
M
u
Instruction [15–11] x
1
RegDst
Instruction [15–0]
Read
register 1
Read
register 2
Read
data 1
Read
Write
data 2
register
Write
data Registers
16
Sign 32
extend
Shift
left 2
0
M
u
x
1
What is the
energy cost of
this operation?
MemWrite
ALUSrc
0
M
u
x
1
ALU
control
Zero
ALU ALU
result
MemtoReg
Address
Read
data
Write Data
data memory
1
M
u
x
0
MemRead
Instruction [5–0]
ALUOp
(41)
Example: Register Instructions
Branch
Add
ALU
Add result
4
RegWrite
Instruction [25–21]
PC
Read
address
Instruction
[31–0]
Instruction
memory
Instruction [20–16]
0
M
u
Instruction [15–11] x
1
RegDst
Instruction [15–0]
Read
register 1
Read
register 2
Read
data 1
Read
Write
data 2
register
Write
data Registers
16
Sign 32
extend
Shift
left 2
0
M
u
x
1
What is the
energy cost of
this operation?
MemWrite
ALUSrc
0
M
u
x
1
ALU
control
Zero
ALU ALU
result
MemtoReg
Address
Read
data
Write Data
data memory
1
M
u
x
0
MemRead
Instruction [5–0]
ALUOp
(42)
Example: I-type Instructions
Branch
Add
ALU
Add result
4
RegWrite
Instruction [25–21]
PC
Read
address
Instruction
[31–0]
Instruction
memory
Instruction [20–16]
0
M
u
Instruction [15–11] x
1
RegDst
Instruction [15–0]
Read
register 1
Read
register 2
Read
data 1
Read
Write
data 2
register
Write
data Registers
16
Sign 32
extend
Shift
left 2
0
M
u
x
1
What is the
energy cost of
this operation?
MemWrite
ALUSrc
0
M
u
x
1
ALU
control
Zero
ALU ALU
result
MemtoReg
Address
Read
data
Write Data
data memory
1
M
u
x
0
MemRead
Instruction [5–0]
ALUOp
(43)
Example: I-Type for Branches
Add
ALU
Add result
4
RegWrite
Instruction [25–21]
PC
Read
address
Instruction
[31–0]
Instruction
memory
Instruction [20–16]
0
M
u
Instruction [15–11] x
1
RegDst
Instruction [15–0]
Read
register 1
Read
register 2
Read
data 1
Read
Write
data 2
register
Write
data Registers
16
Sign 32
extend
Shift
left 2
0
M
u
x
1
Branch
What is the
energy cost of
this operation?
MemWrite
ALUSrc
0
M
u
x
1
ALU
control
Zero
ALU ALU
result
MemtoReg
Address
Read
data
Write Data
data memory
1
M
u
x
0
MemRead
Instruction [5–0]
ALUOp
(44)
Converting Energy to Power
• For this data path, except for data memory, all
components are active every cycle, and
dissipating energy on every cycle
 Later we will see how data paths can be made more
energy efficient
• Computing power
 Compute the total energy consumed over all cycles
(instructions)
 Divide energy by time to get power in watts
Example:
(45)
Example: A Simple Energy Model
• We can use a simple model of per-access
energy for the architecture components
Common Components
Access Energy (10-12 joules)
Inst. Decode
Logic Switching 16.78
Inst. Registers
Read 2.74
Write 4.38
FP. Registers
Read 1.26
Write 1.98
Other Buffers
Read 9.74
Write 11.18
ALU + Result Bus (interconnect)
Logic Switching 123.2
FPU + Result Bus (interconnect)
Logic Switching 241.02
@16nm
• Each unit can be accessed multiple times depending on instruction type
• An Intel/AMD x86 instruction consume 600pJ ~ 4nJ dynamic energy.
(46)
ITRS Roadmap for Logic Devices
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008
(47)
Our Simple Control Structure
•
All of the logic is combinational
•
We wait for everything to settle down, and the right
thing to be done
•

ALU might not produce “right answer” right away

we use write signals along with clock to determine when to
write
Cycle time determined by length of the longest path
State
element
1
Combinational logic
State
element
2
Clock cycle
We are ignoring some details like setup and hold times
(48)
Performance Issues
• Longest delay determines clock period
 Critical path: load instruction
 Instruction memory  register file  ALU  data
memory  register file
• Not feasible to vary period for different
instructions
• Violates design principle
 Making the common case fast
• We will improve performance by pipelining
(49)
Summary
• Single cycle datapath




All instructions execute in one clock cycle
Not all instructions take the same amount of time
Software sees a simple interface
Can memory operations really take one cycle?
• Improve performance via pipelining, multicycle operation, parallelism or customization
• We will address these next
(50)
Study Guide
• Given an instruction, be able to specify the
values of all control signals required to execute
that instruction
• Add new instructions: modify the datapath and
control to affect its execution
 E.g., jal, jr, shift, etc.
 Modify the VHDL controller
• Given delays of various components, determine
the cycle time of the datapath
• Distinguish between those parts of the
datapath that are unique to each instruction
and those components that are shared across
all instructions
(51)
Study Guide (cont.)
• Given a set of control signal values determine
what operation the datapath performs
• Given the per access energies of each
component:
 Compute the energy required of any instruction
 Given a program and clock rate compute the power
dissipation of the datapath
(52)
Glossary
•
•
•
•
•
Asynchronous
Clock
Controller
Critical path
Flip Flop
•
•
•
•
•
ITRS Roadmap
Per-access energy
Program counter
Register
Synchronous
(53)