The Processor: Datapath and Control

Download Report

Transcript The Processor: Datapath and Control

The Processor: Datapath and Control
• We will design a microprocessor that includes a
subset of the MIPS instruction set:
– Memory access: load/store word (lw, sw)
– AL instructions: add, sub, and, or, and slt.
– Branch instructions: beq and jump (j).
• The subset doesn't include all the integer nor any fp
instructions but the principle is the same.
• For every instruction the first two steps are
identical:
– Fetch an instruction from where the PC points to in
memory.
– Decode the instruction and read the registers or memory
contents specified.
1
Abstract View of the DataPath
Data
Register #
PC
Address
Instruction
memory
Instruction
Registers
Register #
ALU
Address
Data
memory
Register #
Data
• The data path contains 2 types of logic elements:
– Combinational: Elements that operate on data values.
Their outputs depend on their inputs. The ALU is an
combinnational element.
– State: Elements with internal storage. Their state is
defined by the values they contain (memory and
2
registers).
Clocking Methodology
• A state element has at least two inputs and one
output. The inputs are the data value to be written
into the element and the clock signal which
determines when the value will be written. The
output is the data value stored in the element.
Thus a state element can be read from at any time
but written depending on the clock.
• A clocking methodology defines when signals
can be read and written. This is crucial (‫ )חיוני‬to the
correct design of a computer.
• We will assume an edge-triggered clocking
methodology. Any values stored in the machine are
updated only on a clock edge.
3
Edge-Triggered Clocking
• Because only state
State
State
element
Combinational logic
element
elements can store
1
2
values, any collection
of combinational logic
must have its inputs
coming from a set Clock cycle
of state elements and its outputs written to set of state
elements. The time necessary for the signals to reach
element 2 defines the length of the clock cycle.
• An edge-triggered methodology
State
Combinational logic
allows us to read the contents
element
of an register, send the value
through some combinational
logic and write that register in the
same clock cycle. We assume that state elements have
4
implicit clock signals.
Fetching an Instruction
• A memory unit will hold the instructions that are to be
executed. The address of the next instruction is in the PC.
We need an ALU that performs only addition in order to
calculate the next instruction to fetch.
• Thick arrows symbolize 32-bit buses unless specified
differently. Thin arrows specify 1-bit lines, colored lines
specify control lines.
Add
Instruction
address
4
PC
Instruction
Add Sum
PC
Read
address
Instruction
memory
Instruction
Instruction
memory
a. Instruction memory
b. Program counter
c. Adder
5
The Register File
• The R-type instructions (also called the arithmetic-logical
instructions) read the contents of 2 registers, perform an
ALU op. , and write the result back into a third register.
• The 32 registers are stored in the register file. The register
file has 3 5-bit inputs to specify the registers, 2 32-bit
outputs for the data read, 1 32-bit input for the data written
and 1 control signal to decide if data should be written in. In
addition we will need an ALU to perform the operations.
3
Read
register 1
Instruction
Read
register 2
Registers
Write
register
Write
data
ALU operation
Read
data 1
Zero
ALU
ALU
result
Read
data 2
RegWrite
6
Data Memory
• The 2 elements needed to implement load and store
instructions are data memory and a unit that sign-extends
the 16-bit constant in an I-type instruction. In addition we
use the existing ALU to compute the address to access.
• The data memory has 2 32-bit inputs, the address and the
write data, and 1 32-input the read data. In addition it has 2
control lines: MemWrite and MemRead.
3
Read
register 1
Instruction
Read
register 2
Registers
Write
register
Write
data
ALU operation
MemWrite
Read
data 1
Zero
ALU
ALU
result
Address
Read
data 2
Write
data
RegWrite
16
Sign
extend
32
Read
data
Data
memory
MemRead
7
Branch Equal
• The beq instruction has 3 operands two registers that are
compared for equality and a 16-bit offset used to compute
the branch address relative to the PC. To implement this
instruction we must add the sign-extend offset to the PC.
• There are 2 important details: PC + 4 from instruction datapath
1. The base for the address
Add Sum
Branch target
Shift
calculation is the address after
left 2
the current instruction's
ALU operation
3
Read
register 1
address. But since Instruction
Read
data 1
Read
register 2
we compute PC+4
To branch
Registers
ALU Zero
control logic
Write
when fetching we
register
Read
data 2
Write
already have this address
data
RegWrite
2. The offset is in words
16
32
not bytes so we have
Sign
extend
to shift left the offset by 2.
8
Combining ALU and Memory Instructions
• The ALU datapath (slide 6) and the Memory datapath (slide
7) are similar. The differences are:
– The second input to the ALU is a register (R-type) or the
sign-extended offset (I-type).
– The value stored into the destination register comes from
the ALU (R-type) or from memory (I-type) .
• Using 2 multiplexors (Mux) we can combine both datapaths.
3
Read
register 1
Instruction
ALU operation
MemWrite
Read
data 1
MemtoReg
Read
register 2
Registers Read
Write
data 2
register
ALUSrc
M
u
x
Write
data
ALU ALU
result
Address
Write
data
RegWrite
16
Zero
Sign
extend
Read
data
Data
memory
M
u
x
32
MemRead
9
The Complete Datapath
PCSrc
M
u
x
Add
Add ALU
result
4
Shift
left 2
Registers
PC
Read
address
Instruction
Instruction
memory
Read
register 1
Read
register 2
Write
register
Write
data
RegWrite
16
ALUSrc
Read
data 1
Read
data 2
M
u
x
3
ALU operation
Zero
ALU ALU
result
MemtoReg
Address
Write
data
Sign
extend
MemWrite
Read
data
Data
memory
M
u
x
32
MemRead
• This simple processor can compute ALU instructions,
access memory or compute the next instruction's address in
a single cycle.
10
ALU Control
• The ALU has 3 control inputs, we use 5 of the 8 possible
input combinations:
000
001
010
110
111
AND
OR
add
subtract
slt
• The ALU control uses as its inputs the funct field of the
instruction and a 2-bit control field called the ALUOp.
• For lw/sw the ALU computes the address using addition
(ALUOp=00), for the R-type instructions the ALU performs
one of 5 actions depending on the function field of the
instruction (ALUOp=10), for beq the ALU performs a
subtraction (ALUOp=01).
• The ALU control is a large truth table that given the funct
field and ALUOp outputs 3-bit controls for the ALU.
11
Main Control
• Look at the formats of the R-type and I-type instructions:
Field opcode
rs
rt
rd
shamt
funct
Bits 31-26
25-21 20-16 15-11 10-6
5-0
Field opcode
rs
rt
address
Bits 31-26
25-21 20-16
15-0
• The following observations can be made:
– The opcode is always in bits 31-26
– The 2 registers to be read are always the rs (25-21) and
rt (20-16) fields (R-type, beq, and store).
– The base register for load/ store instructions is always rs
(25-21)
– The 16-bit offset for beq, lw,sw is always in bits (15-0)
– The destination register is in one of two places: For a lw
it is rt (20-16), for a R-type it is rd (15-11). Thus we need
12
a MUX to select which field of the instruction is written.
The Main Control Signals
• There are 7 control signals in our microprocessor, let's see
what happens when they are asserted (set to 1) and
deasserted (set to 0):
Signal
RegDst
RegWrite
Deasserted
The Write reg is rt
None
Asserted
The Write reg is rd
The Write register is written
with the Write data
ALUSrc The 2nd ALU operand
The 2nd ALU operand is the
comes from the register file is the 16-bit address
PCSrc
PC=PC + 4
PC=Branch target
MemRead None
Memory contents at the
address input are put on the
Read data output
MemWrite None
Memory contents at the
address input are replaced by
the Write data input
MemtoReg The value of the reg. Write The value of the reg. Write
data input is from the ALU data input is from memory
13
Main Control Diagram
0
M
u
x
ALU
Add result
Add
4
Instruction [31 26]
PC
Instruction
memory
Read
register 1
Instruction [20 16]
Instruction
[31– 0]
Instruction [15 11]
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [25 21]
Read
address
0
M
u
x
1
1
PCSrc
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Write
data
Instruction [15 0]
16
Sign
extend
Read
data
Data
memory
32
ALU
control
Instruction [5 0]
14
1
M
u
x
0
Opcode to Control
• The control lines are determined by the opcodes of
the instructions. The exception is the PCSrc line
which is dependent on the output of the beq
instruction as well (x means don't care).
• Line
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp
R-type
1
0
0
1
0
0
0
10
lw
0
1
1
1
1
0
0
00
sw
x
1
x
0
0
1
0
00
beq
x
0
x
0
0
0
1
01
• At this stage the Control is a block box, which
receives inputs and gives outputs.
15
Operation of the Datapath
• Let's see the stages of execution of a R-type instruction
add $t1,$t2,$t3:
1. An instruction is fetched from memory, the PC is incremented
2. Two registers $t2 and $t3 are read from the register file.
3. The ALU operates on the data read from the register file.
4. The results of the ALU is written into the register $t3.
• This doesn't really happen in 4 steps because the
implementation is combinational, but at the end of the clock
cycle the result is written into the destination register.
• Let's look at lw $t1,offset($t2)
1. An instruction is fetched from memory, the PC is incremented
2. The register $t2 is read from the register file.
3. The ALU computes the sum of $t2 and the sign-extended offset.
4. The sum from the ALU is used as the address for the data memory.
5. The data from memory is written into register $t1.
16
Adding the Jump Instruction
• The j instruction uses pseudodirect addressing, the upper
4 bits of PC+4 are concatenated (‫ )מחוברים‬to the 26 bits
(shifted left by 2) of the address in the J-type instruction.
Instruction [25– 0]
26
Shift
left 2
Jump address [31– 0]
28
PC+4 [31– 28]
ALU
Add result
Add
4
Instruction [31– 26]
Control
Instruction [25– 21]
PC
Read
address
Instruction
memory
Instruction [15– 11]
M
u
x
1
0
Shift
left 2
RegDst
Jump
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Zero
0
M
u
x
1
Write
data
ALU
ALU
result
Address
Write
data
Instruction [15– 0]
1
M
u
x
Read
register 1
Instruction [20– 16]
Instruction
[31– 0]
0
16
Sign
extend
Read
data
Data
memory
32
ALU
control
Instruction [5– 0]
17
1
M
u
x
0
Performance of Single-Cycle Machines
• Let's assume that the operation time for the following units
is: Memory - 2 nanoseconds (ns), ALU and adders - 2 ns,
Register file - 1 ns. We will assume that MUXs, control,
sign-extension, PC accesses, and wires have no delays.
• Which implementation is faster?
1. Every instruction operates in 1 clock cycle of fixed length.
2. Every instruction operates in a varying length clock cycle.
• Lets look at the time needed by each instruction:
Inst.
Fetch Reg. Rd ALU op Memory Reg. Wr Total
R-Type 2
1
2
0
1
6ns
Load
2
1
2
2
1
8ns
Store
2
1
2
2
7ns
Branch 2
1
2
5ns
Jump
2
2ns
18
Fixed vs. Variable Cycle Length
• Lets Assume a program has the following instruction mix:
24% loads, 12% stores, 44% R-type, 18% branchs, 2%
jumps.
• CPU execution time = Instruction count * Cycle time
• For the fixed cycle length the cycle time is 8 ns, long enough
for the longest instruction (load). Thus each instruction takes
8 ns to execute.
• For the variable cycle time the average CPU clock cycle is:
8*24% + 7*12% + 6*44% + 5*18% + 2*2% = 6.3 ns
• It is obvious that the variable clock implementation is faster
but it is extremely hard to implement.
• So why not use the single cycle implementation which is
only 6.3/8 = 78% slower?
• When adding instructions such as multiply and divide which
19
can take tens of cycles this scheme is too slow.
A Multicycle Implementation
• We broke each instruction into several steps, we can use
these steps to build a multicycle implementation. Each
step takes 1 cycle, the multicycle implementation allows a
functional unit to be used more than once in each instruction
as long as it is used on different clock cycles.
Instruction
register
PC
Address
Data
A
Register #
Instruction
Memory
or data
Data
Memory
data
register
ALU
Registers
Register #
ALUOut
B
Register #
We now have only a single memory unit and a single ALU. In
addition we need registers to hold the output of each stage.
20
New Registers and MUXs
• We have now added several new registers(which hare
transparent to the programmer) and some new MUXs:
– Instruction Register (IR) - the instruction fetched
– Memory Data Register (MDR) - data read from memory
– A, B - registers read from the register file
– ALUOut - result of ALU operation
• The new MUXs added are:
– An additional MUX to the 1st ALU input, chooses
between the A register and the PC.
– The MUX on the 2nd ALU input is changed from a 2-way
to a 4-way MUX. The additional inputs are the constant 4
(used to increment the PC) and the sign-extended and
shifted offset field (used in beq).
21
Multicycle Diagram
IorD
PC
0
M
u
x
1
MemRead
MemWrite
RegDst
RegWrite
Instruction
[25– 21]
Address
Memory
MemData
Write
data
IRWrite
Instruction
register
Instruction
[15– 0]
Memory
data
register
0
M
u
x
1
Read
register 1
Read
Read
data 1
register 2
Registers
Write
Read
register
data 2
Instruction
[20– 16]
Instruction
[15– 0]
ALUSrcA
0
M
Instruction u
x
[15– 11]
1
A
B
4
Write
data
0
M
u
x
1
16
Sign
extend
32
Shift
left 2
Zero
ALU ALU
result
ALUOut
0
1 M
u
2 x
3
ALU
control
Instruction [5– 0]
MemtoReg
ALUSrcB ALUOp
• There are 3 possible sources for the PC value: 1. The output of
the ALU which is PC+4; 2. The register ALUOut which is the
address of the computed branch target; 3. The lower 26 bits of
the IR shifted left by 2, concatenated with the 4 upper bits of the
22
PC.
The Instruction Execution Stages (1,2)
1. Instruction Fetch (IF)- Fetch the instruction from
memory and compute the address of the next
sequential address:
IR = Memory[PC];
PC= PC + 4;
2. Instruction Decode (ID) and register fetch - get the
registers from the register file and compute the
potential branch address (even if it isn't needed in
the future):
A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC + (sign-extended(IR[15-0])<<2);
23
The Instruction Execution Stages (3)
3. Execution (EX), Memory address computation or
branch completion - In this stage the operation is
determined by the the instruction class:
A. Memory reference:
ALUOut = A + sign-extended(IR[15-0]);
B. R-type:
ALUOut = A op B;
C. Branch:
if (A == B)
PC = ALUOut;
D. Jump:
PC = PC[31-28]
cat (IR[25-0]<<2)
24
The Instruction Execution Stages (4,5)
4. Memory access (Mem) or R-type completion During this step the load/store instruction accesses
memory or the AL instruction write its results.
A. Memory reference:
MDR = Memory[ALUOut]; (load)
Memory[ALUOut] = B; (store)
B. R-type:
Reg[IR[15-11]] = ALUOut;
5. Memory read completion step - The load
completes by writing the value from memory into a
register.
Reg[IR[20-16]]=MDR;
25
Cycles Per Instruction (CPI)
• The CPI of a program defines how many cycles an average
instruction takes. Assuming an instruction mix (for the gcc
compiler) of 22% loads, 11% stores, 49% R-type, %16
branches, and 2% jumps what is the CPI, assuming each
state requires one clock cycle?
• The number of clock cycles for each instruction format is:
Loads: 5; Stores: 4; R-type: 4; Branches: 3; Jumps: 3
• Thus the CPI = 0.22*5 + (0.11 + 0.49)*4 + (0.16 + 0.02)*3
= 4.04
• This is better than the worst case CPI in which each
instruction would have taken the same number of clock
cycles.
26
Exceptions
• One of the most hardest parts of control is implementing
exceptions and interrupts, events other than branches and
jumps which change the normal flow of instruction
execution.
• An exception is an unexpected event that happens during
program execution such as an arithmetic overflow or an
illegal instruction (which are the only 2 in our design).
• An interrupt is an event that is external to the processor,
such as requests by I/O devices.
• When an exception occurs the machine must save the
address of the offending instruction in the exception
program counter (EPC), and then transfer execution to the
OS. The OS might service the exception and return control
to the program or terminate execution.
27
Causes of Exceptions
• In order for the OS to handle the exception it must know the
cause of the exception. MIPS has a register called the
Cause register which holds the reason of the exception.
• A second method is called vectored interrupts. In a
vectored interrupt the address to which control is transferred
is determined by the exception cause. The OS knows the
cause of the exception by the address that is jumped to.
• We need two additional registers the EPC which holds the
address of the instruction and the Cause Register which
holds 0 for an undefined instruction and 1 for arithmetic
overflow.
• We will need 2 control signals to write to the EPC and cause
registers (EPCWrite and CauseWrite) and a signal to set the
LSB of the Cause register (IntCause).
28
Datapath with Exceptions
CauseWrite
IntCause
EPCWrite
PCSource
ALUOp
PCWriteCond
PCWrite
IorD
Outputs
MemRead
MemWrite
ALUSrcB
Control
ALUSrcA
MemtoReg
IRWrite
RegWrite
Op
[5– 0]
RegDst
0
26
Instruction [25– 0]
PC
0
M
u
x
1
Shift
left 2
Instruction
[31-26]
Address
Memory
MemData
Write
data
Read
register 1
Instruction
[20– 16]
Read
Read
register 2 data 1
Registers
Write
Read
register data 2
Instruction
register
Instruction
[15– 0]
Memory
data
register
Jump
address [31-0]
2
CO 00 00 00
Instruction
[25– 21]
Instruction
[15– 0]
28
0
M
Instruction u
x
[15– 11]
1
B
4
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Sign
extend
32
Shift
left 2
3
0
1M
u
2 x
3
ALUOut
0
1
16
u
x
PC [31-28]
0
M
u
x
1
A
1M
EPC
0
M
u
x
1
Cause
ALU
control
Instruction [5– 0]
• IntCause is defined by the control if it can't decode the
instruction or if the ALU signals an overflow. The next PC
29
MUX now has 4 inputs, the exception handler addr is added