No Slide Title

Download Report

Transcript No Slide Title

ECE200 – Computer Organization
Chapter 5 – The Processor:
Datapath and Control
Homework 5

5.5, 5.7, 5.9, 5.15, 5.16, 5.22, 5.24, 5.29
What we’ve covered so far

Computer abstractions and technology (Ch 1)

Defining, measuring, evaluating performance
(Ch 2)

Instruction set architecture and assembly
language programming (Ch 3)

Computer arithmetic (Ch 4)

Basic CPU organization (Ch 5)

Advanced CPU organization (Ch 6)

Caches and main memories (Ch 7)

Input/Output (Ch 8 and Motorola HC11
manuals)

Multiprocessors (Ch 9) [if we get to it]
Outline for Chapter 5 lectures

Goals in processor implementation

Brief review of sequential logic design

Pieces of the processor implementation puzzle

A simple implementation of a MIPS integer
instruction subset
 Datapath
 Control

logic design
A multi-cycle MIPS implementation
 Datapath
 Control
logic design

Microcoded control

Exceptions

Some real microprocessor datapath and control
Goals in processor implementation

Balance the rate of supply of instructions and
data and the rate at which the execution core
can consume them and can update memory
instruction supply
execution core
data supply
Goals in processor implementation

Recall from Chapter 2
 CPU
Time = INST x CPI x CT

INST largely a function of the ISA and compiler

Objective: minimize CPI x CT within design
constraints (cost, power, etc.)

Trading off CPI and CT is tricky
multiplier
multiplier
logic
logic
logic
multiplier
Brief review of sequential logic design

State elements are clocked devices
 Flip

flops, etc
Combinatorial elements hold no state
 ALU,

caches, multiplier, multiplexers, etc.
In edge triggered clocking, state elements are
only updated on the (rising) edge of the clock
pulse
Brief review of sequential logic design

The same state element can be read at the
beginning of a clock cycle and updated at the end

Example: incrementing the PC
12
clock
8
Add input
PC
4
clock
Add
8
Add output
PC register
12
8
12
Our processor design progression

(1) Instruction fetch, execute, and operand
reads from data memory all take place in a
single clock cycle

(2) Instruction fetch, execute, and operand
reads from data memory take place in
successive clock cycles

(3) A pipelined design (Chapter 6)
Pieces of the processor puzzle

Instruction fetch

Execution

Data memory
instruction supply
execution core
data supply
Instruction fetch datapath

Memory to hold instructions

Register to hold the instruction memory address

Logic to generate the next instruction address
PC +4
Execution datapath

Focus on only a subset of all MIPS instructions
 add,
sub, and, or
 lw, sw
 slt
 beq, j

For all instructions except j, we
 Read
operands from the register file
 Perform an ALU operation

For all instructions except sw, beq, and j, we
write a result into the register file
Execution datapath

Register file block diagram
 Read
register 1,2: source operand register numbers
 Read data 1,2: source operands (32 bits each)
 Write register: destination operand register number
 Write data: data written into register file
 RegWrite: when asserted, enables the writing of Write Data
Execution datapath

Datapath for R-type (add, sub, and, or, slt)
 R-type
31
instruction format:
26 25
op
21 20
rs
16 15
rt
11 10
rd
shamt
6 5
0
funct
Execution datapath

Datapath for beq instruction
 I-type
31
instruction format:
26 25
op
21 20
rs
16 15
rt
0
immediate
ALU output indicates if rs=rt (branch is taken/not taken)
 Branch target address is the sign extended immediate left
shifted two positions, and added to PC+4
 Zero
Data memory

Used for lw, sw (I-type format)

Block diagram
 Address:
memory location to be read or written
 Read data: data out of the memory on a load
 Write data: data into the memory on a store
 MemRead: indicates a read operation is to be performed
 MemWrite: indicates a write operation is to be performed
Execution datapath + data memory

Datapath for lw, sw
 Address
is the sign-extended immediate added to the source
operand read out of the register file
 sw: data written to memory from specified register
 lw: data written to register file from specified memory
address
Putting the pieces together

Single clock cycle for fetch, execute, and
operand read from data memory

3 MUXes
 Register
file operand or sign extended immediate to ALU
 ALU or data memory output written to register file
 PC+4 or branch target address written to PC register
Datapath for R-type instructions
Example: add $4, $18, $30
Datapath for I-type ALU instructions
Example: slti $7, $4, 100
Datapath for not taken beq instruction
Example: beq $28, $13, EXIT
Datapath for taken beq instruction
Example: beq $28, $13, EXIT
Datapath for load instruction
Example: lw $8, 112($2)
Datapath for store instruction
Example: sw $10, 0($3)
Control signals we need to generate
ALU operation control


ALU control input codes from Chapter 4
ALU control input
ALU operation
Used for
000
and
and
001
or
or
010
add
add, lw, sw
110
subtract
sub, beq
111
set on less than
slt
Two steps to generate the ALU control input
 Use
the opcode to distinguish R-type, lw and sw, and beq
 If R-type, use funct field to determine the ALU control input
ALU operation control

Opcode used to generate a 2-bit signal called
ALUOp with the following encodings
 00:
lw or sw, perform an ALU add
 01: beq, perform an ALU subtract
 10: R-type, ALU operation is determined by the funct field
Funct
Instruction
ALU control input
100000
add
010
100010
sub
110
100100
and
000
100101
or
001
101010
slt
111
Comparing instruction fields
31
26 25
0
R-type
31
beq
lw (sw)
rs
26 25
4
31
21 20
rt
21 20
rs
26 25
35 (43)
16 15
rd
shamt
6 5
0
funct
16 15
rt
21 20
rs
11 10
0
immediate (offset)
16 15
rt
0
immediate (offset)

Opcode, source registers, function code, and
immediate fields always in same place

Destination register is
 bits
15-11 (rd) for R-type
 bits 20-16 (rt) for lw
 MUX to select the right one
Datapath with instr fields and ALU control
Main control unit design
Main control unit design

Truth table
(0)
(34)
(43)
(4)
Adding support for jump instructions

J-type format
31
26 25
2


0
target
Next PC formed by shifting left the 26-bit target
two bits and combining it with the 4 high-order
bits of PC+4
Now the next PC will be one of
 PC+4
 beq
target address
 j target address

We need another MUX and control bit
Adding support for jump instructions
Evaluation of the simple implementation

All instructions take one clock cycle (CPI = 1)

Assume the following worst case delays
 Instruction
memory: 4 time units
 Data memory: 4 time units (read), 2 time units (write)
 ALU: 4 time units
 Adders: 3 time units
 Register file: 2 time units (read), 1 time unit (write)
 MUXes, sign extension, gates, and shifters: 1 time unit

Large disparity in worst case delays among
instruction types
 R-type:
4+2+1+4+1+1 = 13 time units
 beq: 4+2+1+4+1+1+1 = 14 time units
 j: 4+1+1 = 6 time units
 store: 4+2+4+2 = 12 time units
 load: 4+2+4+4+1+1 = 16 time units
Evaluation of the simple implementation

Disparity would be worse in a real machine
 Even
slower integer instructions (e.g., multiply/divide in MIPS)
 Floating point instructions

Simple instructions take as long as complex ones
A multicycle implementation

Instruction fetch, register file access, etc occur
in separate clock cycles

Different instruction types take different
numbers of cycles to complete

Clock cycle time should be faster
High level view of datapath

New registers store results of each step
 Not

programmer visible!
Hardware can be shared
 One
ALU for PC+4, branch target calculation, EA calculation,
and arithmetic operations
 One memory for instructions and data
Detailed multi-cycle datapath
Multi-cycle control
First two cycles for all instructions

Instruction fetch (1st cycle)
 Load

the instruction into the IR register
IR = Memory[PC]
 Increment


the PC
PC = PC+4
Instruction decode and register fetch (2nd cycle)
 Read
register file locations rs and rt, results into the A and B
registers


A=Reg[IR[25-21]]
B=Reg[IR[20-16]]
 Calculate

the branch target address and load into ALUOut
ALUOut = PC+(sign-extend (IR[15-0]) <<2)
Instruction fetch

IR=Mem[PC]
Instruction fetch

PC=PC+4
Instruction decode and register fetch

A=Reg[IR[25-21]], B=Reg[IR[20-16]]
Instruction decode and register fetch

ALUOut = PC+(sign-extend (IR[15-0]) <<2)
Additional cycles for R-type

Execution
 ALUOut

= A op B
Completion
 Reg[IR[15-11]]
= ALUOut
R-type execution cycle

ALUOut = A op B
R-type completion cycle

Reg[IR[15-11]] = ALUOut
Additional cycles for store

Address computation
 ALUOut

= A + sign-extend (IR[15-0])
Memory access
 Memory[ALUOut]
=B
Store address computation cycle

ALUOut = A + sign-extend (IR[15-0])
Store memory access cycle

Memory[ALUOut] = B
Additional cycles for load

Address computation
 ALUOut

Memory access
 MDR

= A + sign-extend (IR[15-0])
= Memory[ALUOut]
Read completion
 Reg[IR[20-16]]
= MDR
Load memory access cycle

MDR = Memory[ALUOut]
Load read completion cycle

Reg[IR[20-16]] = MDR
Additional cycle for beq

Branch completion
 if
(A == B) PC = ALUOut
Branch completion cycle for beq

if (A == B) PC = ALUOut
Additional cycle for j

Jump completion
 PC
= PC[31-28] || (IR[25-0]<<2)
Jump completion cycle for j

PC = PC[31-28] || (IR[25-0]<<2)
Control logic design

Implemented as a Finite State Machine
 Inputs:
6 opcode bits
 Outputs: 16 control signals
 State: 4 bits for 10 states
High-level view of FSM
Instruction fetch cycle
Instruction decode/register fetch cycle
R-type execution cycle
R-type completion cycle
Memory address computation cycle
Store memory access cycle
Load memory access cycle
Load read completion cycle
beq branch completion cycle
j jump completion cycle
Complete FSM
Evaluation of the multi-cycle design

CPI calculated based on the instruction mix
 For





23% loads (5 cycles each)
13% stores (4 cycles each)
19% branches (3 cycles each)
2% jumps (3 cycles each)
43% ALU (4 cycles each)
 CPI

gcc (Figure 4.54)
= 0.23*5+0.13*4+0.19*3+0.02*3+0.43*4=4.02
Cycle time is calculated from the longest delay
path assuming the same timing delays as before
Worst case datapath: branch target

ALUOut = PC+(sign-extend (IR[15-0]) <<2)

Delay = 7 time units (delay of simple = 16)
Evaluation of the multi-cycle design

Time per instruction of simple and multi-cycle
 TPI(simple)
= CPI(simple) x cycle time(simple) = 16
 TPI(multi-cycle) = 4.02 x 7 = 28.1

Simple single-cycle implementation is faster

Multicycle with pipelining will be considerably
faster than single-cycle implementation
Exceptions


An exception is an event that causes a deviation
from the normal execution of instructions
Types of exceptions
 Operating
system call (e.g., read a file, print a file)
 Input/output device request
 Page fault (request for instruction/data not in memory – Ch 7)
 Arithmetic error (overflow, underflow, etc.)
 Undefined instruction
 Misaligned memory access (e.g., word access to odd address)
 Memory protection violation
 Hardware error
 Power failure


An exception is not usually due to an error!
We need to be able to restart the program at
the point where the exception was detected
Handling exceptions

Detect the exception

Save enough information about the exception to
handle it properly

Save enough information about the program to
resume it after the exception is handled

Handle the exception

Either terminate the program or resume
executing it depending on the exception type
Detecting exceptions

Performed by hardware

Overflow: determined from the opcode and the
overflow output of the ALU

Undefined instruction: determined from
 The
opcode in the main control unit
 The function code and ALUop in the ALU control logic
Detecting exceptions
overflow
undefined
instruction
Saving exception information

Performed by hardware

We need the type of exception and the PC of the
instruction when the exception occurred

In MIPS, the Cause register holds the exception
type
 Need
an encoding for each exception type
 Need a signal from the control unit to load it into the Cause
register

and the Exception Program Counter (EPC)
register holds the PC
 Need
to subtract 4 from the PC register to get the correct PC
(since we loaded PC+4 into the PC register during the
Instruction Fetch cycle)
 Need a signal from the control unit to load it into EPC
Saving exception information
Saving program information

Needed in order to restart the program from the
point where the exception occurred

Performed by hardware and software

EPC register holds the PC of the instruction that
had the exception (where we will restart the
program)

The software routine that handles the exception
saves any registers that it will need to the stack
and restores them when it is done
Handling the exception

Performed by hardware and software

Need to transfer control to a software routine to
handle the exception (exception handler)

The exception handler runs in a privileged mode
that allows it to use special instructions and
access all of memory
 Our

programs run in user mode
The hardware enables the privileged mode,
loads PC with the address of the exception
handler, and transitions to the Fetch state
Handling the exception

Loading the PC with exception handler address
Exception handler

Stores the values of the registers that it will
need to the stack

Handles the particular exception



Operating system call: calls the subroutine associated with the call
Underflow: sets register to zero or uses denormalized numbers
I/O: handles the particular I/O request, e.g., keyboard input

Restores registers from the stack (if program is
to be restarted)

Terminates the program, or resumes execution
by loading the PC with EPC and transitioning to
the Instruction Fetch state
FSM modifications
The Intel Pentium processor

Introduced in 1993

Uses a multi-cycle datapath with the following
steps for integer instructions
 Prefetch
(PF): read instruction from the instruction memory
 Decode 1 (D1): first stage of instruction decode
 Decode 2 (D2): second stage of instruction decode
 Execute (E): perform the ALU operation
 Write back (WB): write the result to the register file

Datapath usage varies by instruction type
 Simple
instructions make one pass through the datapath using
state machine control
 Complex instructions make multiple passes, reusing the same
hardware elements under microcode control
The Intel Pentium processor

The Pentium is a 2-way superscalar design as
two instructions can simultaneously execute
PF
D2
E
WB
U pipe
D2
E
WB
V pipe
D1

Ideal CPI for a 2-way superscalar is 0.5

Conditions for superscalar execution
 Both
must be simple instructions
 The result of the first instruction cannot be needed by the
second
 Both instructions cannot write the same register
 The first instruction in program sequence cannot be a jump
The Intel Pentium Pro processor

Introduced in 1995 as the successor to the
Pentium

The basis for the Pentium II and Pentium III

Implements a 14-cycle, 3-way superscalar
integer datapath
 Very

high frequency is the goal
Uses out-of-order execution in that instructions
may execute out of their original program order
 Completely
handled by hardware transparently to software
 Instructions execute as soon as their source operands become
available
 Complicates exception handling

Some instructions before the excepting one may not have executed,
while some after it may have executed
The Intel Pentium Pro processor

Pentium Pro designers (and AMD designers
before them) used innovative engineering to
overcome the disadvantages of CISC ISAs
 Many
complex X86 instructions are internally translated by
hardware into RISC-like micro-ops with state machine control
 Achieves a very low CPI for simple integer operations even
on programs compiled for older implementations

Combination of high frequency and low CPI gave
the Pentium Pro extremely competitive integer
performance versus RISC microprocessors
 Result
has been that RISC CPUs have failed to gain the
desktop market share that had been expected
The Intel Pentium 4 processor

20 cycle superscalar integer pipeline

Extremely high frequency (>3GHz)

Major effort to lower power dissipation
 Clock
gating: clock to a unit is turned off when the unit is not
 Trace
cache: caches micro-ops of previously decoded complex
in use
instructions to avoid power-consuming decode operation
Questions?