Transcript ALU

Department of Computer and IT Engineering
University of Kurdistan
MIPS Datapath and Control (single Cycle)
By: Dr. Alireza Abdollahpouri
A single-cycle MIPS processor
Any instruction set can be implemented in
many different ways
MIPS ISA
Single Cycle
Short CPI
Long CCT
Multi-Cycle
Long CPI
Short CCT
Pipelined
Short CPI
Short CCT
2
‫‪All instructions will execute in the same amount of‬‬
‫‪time; this will determine the clock cycle time for‬‬
‫‪our performance equations.‬‬
‫در این روش پیاده سازی‪ ،‬تمامی دستورالعملها برای اجرا به زمان‬
‫یکسانی احتیاج دارند (یک پالس ساعت)‪ .‬پس طول کالک باید‬
‫برابر با طوالنیترین دستورالعمل باشد‪.‬‬
‫‪3‬‬
The Performance Big Picture
 Execution Time = IC * CPI * Cycle Time
 Instruction count is controlled by the ISA and
the compiler design
 Processor design (Datapath and control) will
determine:
 Clock cycle time
 Clock cycles per instruction
 Starting today:
 Single cycle processor:
 Advantage: CPI = 1
 Disadvantage: long cycle time
Execute an
entire instruction
4
‫روش اعمال کالک‬
•Needed to prevent simultaneous read/write to state elements
•Edge-triggered methodology:
state elements updated at rising clock edge
State
element
1
Combinational
logic
State
element
2
clock input
5
‫‪Data Path & Control Unit‬‬
‫‪ Data path ‬مسیری است که مشخص میکند داده ها چگونه بین‬
‫پردازنده و سایر المانهای اصلی رد و بدل میشود‪ .‬اجزای آن عبارتند از‪:‬‬
‫‪combinational elements ‬‬
‫‪state (sequential) elements ‬‬
‫‪ Control Unit ‬مشخص میکند که سیگنالهای کنترلی و‬
‫زمانبندی چگونه به المانهای ‪ Data path‬میرسد‪.‬‬
‫‪6‬‬
‫روند اجرای دستورات در ‪MIPS‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫واکش ی دستورالعمل از حافظه‬
‫کدبرداری (دیکد کردن) دستورالعمل‬
‫خواندن عملوندها‬
‫اجرای دستورالعمل‬
‫نوشتن (یا خواندن)‬
‫ثبت نتیجه در رجیستر مقصد (در صورت لزوم)‬
‫در پیاده سازی ‪ Single-cycle‬تمامی این مراحل در یک کالک انجام میشوند‬
‫‪7‬‬
Instruction and Data Memory
)‫ استفاده میکنیم (دو حافظه جدا‬Harward ‫برای شروع از معماری‬
There will be only one DRAM Main memory
But we can have separate SRAM caches (We’ll study caches later)
8
Datapath Schematic
Data
PC
Instruction
Memory
Address
Register #
Register #
Registers
ALU
ALU
Address
Instruction
Data
Memory
Register #
Data
9
‫‪Instruction Fetch‬‬
‫‪ ‬محتوی ‪ PC‬توسط یک جمع کننده‬
‫با ‪ 4‬جمع میشود تا آدرس دستور‬
‫بعدی محاسبه شود‪.‬‬
‫‪ ‬مقدار ‪ PC‬به حافظه داده میشود تا‬
‫دستور واکش ی شده و به سایر اجزای‬
‫‪ Data Path‬ارسال شود‪.‬‬
‫‪ ‬اگر دستور بعدی پرش باشد‪ ،‬مقدار‬
‫‪PC‬متفاوت خواهد بود (بعدا‬
‫صحبت خواهد شد)‬
‫‪10‬‬
‫‪ALU‬‬
‫‪Adder‬‬
‫‪4‬‬
‫‪Instruction‬‬
‫‪Read‬‬
‫‪address‬‬
‫‪32‬‬
‫‪Instruction‬‬
‫‪Memory‬‬
‫‪PC‬‬
Instruction Decode
‫ و سایر فیلدهای الزم دستور به واحد کنترل‬opcode ‫ باید مقدار‬
.‫فرستاده شوند‬
.‫ رجیسترهای الزم هم از رجیستر فایل خوانده شوند‬
Control
Unit
Read Addr 1
Instruction
Read
Read Addr 2 Data 1
Write Addr
Write Data
Read
Data 2
11
The MIPS Instructions
31
 R-type
 add rd, rs, rt
 sub, and, or, slt
26
op
 LOAD and STORE
 lw rt, rs, imm
 sw rt, rs, imm
 BRANCH:
 beq rs, rt, imm
rs
6 bits
31
21
6 bits
6 bits
16
rt
5 bits
1. Read
2. Feed
3. Move
26
op
0
5 bits
5 bits
5 bits
5 bits
1. Read registers rs and rt
2. Feed them to ALU
3. Update register file
21
0
immediate
5 bits
16 bits
register rs (and rt for store)
rs and immed to ALU
data between mem and reg
21
rs
rd
6
funct
rs
6 bits
31
rt
11
shamt
26
op
16
16
rt
0
displacement
5 bits
5 bits
16 bits
1. Read registers rs and rt
2. Feed to ALU to compare
3. Add PC to disp; update PC
12
‫خواندن عملوندها از رجیستر فایل‬
‫ گانه پردازنده در ساختاری به اسم‬32 ‫•رجیسترهای‬
.‫رجیستر فایل نگهداری میشوند‬
‫•هر یک از رجیسترها را میتوان با مشخص کردن‬
.)‫ بیتی‬5 ‫شماره آن خواند و یا نوشت (شماره‬
•Register File’s I/O structure
–3 inputs derived from current
instruction to specify register
operands (2 for read and 1 for write)
–1 input to write data into a register
–2 outputs carrying contents of the
specified registers
5
Register
5
numbers
5
Data
32
Read reg 1
Read
data 1
32
Read reg 2
Registers
Write reg
Write data
Read
data 2
32
RegWrite
‫برای کنترل عملیات نوشتن‬
•Register file’s outputs are always
available on the output lines
13
Register File
‫خواندن‬
‫نوشتن‬
14
‫سایر مدارات‬
•Multiplexor selects one out of
2n
inputs
ALU
operation
Function
000
and
001
or
010
add
110
sub
111
slt (set less than)
others
don’t care
M
2n
1
U
X
n
32
ALU
32
zero
32
result
3
ALU operation
15
Datapath Building Blocks: R-Type Instruction
zero
R[rd]  R[rs] op R[rt]
16
‫‪Datapath Building Blocks: R-Type Instruction‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫برای دستورات محاسباتی و منطقی که توسط این فرمت نشان داده میشوند الزم است تا‬
‫دو رجیستر(‪ )rs, rt‬از رجیستر فایل خوانده شده و داده آنها به ‪ ALU‬منتقل شود‪.‬‬
‫عمل ‪ ALU‬بر اساس نوع دستور تعیین شده و بر روی محتوی رجیسترها انجام میشود‪.‬‬
‫نتیجه در رجیستر مقصد (‪ )rd‬نوشته میشود‪( .‬سیگنال ‪ RegWrite‬باید فعال گردد)‬
‫سیگناهای کنترلی باید ایجاد شود تا نتیجه در لبه کالک در رجیستر مقصد نوشته شود‪.‬‬
‫همچنین سیگنال ‪ ALUop‬باید تولید شود تا عمل ‪ ALU‬را تعیین کند‪.‬‬
‫‪6‬‬
‫‪func‬‬
‫‪17‬‬
‫‪5‬‬
‫‪5‬‬
‫‪5‬‬
‫‪5‬‬
‫‪rd shamt‬‬
‫‪rt‬‬
‫‪rs‬‬
‫‪6‬‬
‫‪R-Type opcode‬‬
‫‪I-Type Instruction: load/store‬‬
‫‪ ‬برای محاسبه آدرس باید‬
‫مقدار افست ‪ 16‬بیتی‬
‫موجود در دستورالعمل‬
‫بصورت یک عدد عالمت‬
‫دار ‪ 32‬بیتی تبدیل شده وبا‬
‫مقدار پایه موجود در ‪rs‬‬
‫جمع شود‪.‬‬
‫‪18‬‬
‫‪16‬‬
‫‪5‬‬
‫‪5‬‬
‫‪immediate‬‬
‫‪rt‬‬
‫‪rs‬‬
‫‪6‬‬
‫‪I-Type opcode‬‬
‫)‪LW R2, 232(R1‬‬
‫)‪SW R5, -88(R4‬‬
‫‪32‬‬
‫‪sign‬‬
‫‪extend‬‬
‫‪16‬‬
Load Instruction Steps
lw $rt, offset($rs)
1.
2.
3.
4.
5.
R[rt]  Mem[R[rs] + SignExt[imm16]]
Fetch instruction and increment PC
Read base register from the register file: the base
register ($rs) is given by bits 25-21 of the instruction
ALU computes sum of value read from the register file
and the sign-extended lower 16 bits (offset) of the
instruction
The sum from the ALU is used as the address for the
data memory
The data from the memory unit is written into the
register file: the destination register ($rt) is given by
bits 20-16 of the instruction
19
Datapath for Load Operations
31
26
op
6 bits
21
rs
5 bits
16
rt
5 bits
0
immediate
16 bits
20
Datapath for Store Operations
Mem[R[rs] + SignExt[imm16]]  R[rt]
31
26
op
6 bits
21
rs
5 bits
16
rt
5 bits
0
immediate
16 bits
21
Combining datapaths
RegWrite
ALU operation
Read Reg.1
Read
Read Reg.2 Data 1
Write Data
ALU
Read
Data 2
Sign
16 Extend
Mux
Write Reg.
zero
ALUSrc
‫استفاده از مالتی پلکسربا‬
ALUsrc ‫سیگنال کنترلی‬
0=register
1=immediate
32
22
‫‪Datapath for Branch Operations‬‬
‫‪0‬‬
‫‪16‬‬
‫‪immediate‬‬
‫‪16 bits‬‬
‫‪21‬‬
‫‪rt‬‬
‫‪5 bits‬‬
‫‪26‬‬
‫‪rs‬‬
‫‪5 bits‬‬
‫‪31‬‬
‫‪op‬‬
‫‪6 bits‬‬
‫مقصد پرش‬
‫مقصد دستور انشعاب از جمع‬
‫مقدار افست با ‪ PC‬بدست می آید‪.‬‬
‫از آنجائیکه این مقصد باید مضربی‬
‫از ‪ 4‬باشد‪ ،‬نیاز است تا مقدار‬
‫افست ‪ 2‬بار به سمت چپ شیفت‬
‫داده شود‪.‬‬
Branch Instruction Steps
beq $rs, $rt, offset
1.
2.
3.
4.
Fetch instruction and increment PC
Read two register ($rs and $rt) from the register file
ALU performs a subtract on the data values from the
register file; the value of PC+4 is added to the signextended lower 16 bits (offset) of the instruction shifted
left by two to give the branch target address
The Zero result from the ALU is used to decide which
adder result (from step 1 or 3) to store in the PC
24
‫‪ ‬برای مقایسه رجیسترهای دستور ‪ beq‬از ‪ ALU‬استفاده میشود‬
‫(عمل تفریق)‪ .‬از اینرو نتیجه این مقایسه با سیگنال ‪ Zero‬که در‬
‫خروجی ‪ ALU‬تعبیه شده است مشخص میگردد‪.‬‬
‫‪ ‬بعلت درگیر بودن ‪ ،ALU‬برای محاسبه آدرس مقصد از یک جمع‬
‫کننده دیگر استفاده میشود‪.‬‬
‫‪25‬‬
Branching hardware
We need a second adder, since the ALU
is already doing subtraction for the beq.
Multiply constant
by 4 to get offset.
 PCSrc=1 branches
to PC+4+(offset4).
 PCSrc=0 continues
to PC+4.
26
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪‬‬
‫‪27‬‬
‫اجزای مورد نیاز برای قسمت های مختلف در کنار هم قرار‬
‫داده شده و سیگناهای کنترلی و مالتی پلکسرهای مورد نیاز به‬
‫آن افزوده میشوند‪.‬‬
‫در طراحی ‪ Single Cycle‬همه مراحل واکشی‪ ،‬دیکد و اجرا‬
‫در یک کالک انجام میشود!‬
‫زمان این کالک برابر خواهد بود با زمان الزم برای طی کردن‬
‫طوالنی ترین مسیر که میتواند زمان زیادی باشد‪.‬‬
‫عالوه برآن امکان به اشتراک گذاشتن سخت افزار برای عملیات‬
‫یکسان وجود ندارد‪.‬‬
All together: the single cycle datapath
28
The R-Format (e.g. add) Datapath
ALUsrc=1, ALUop=“add”, MemWrite=0, MemToReg=0,
RegDst = 0, RegWrite=1 and PCsrc=1.
29
The Load Datapath
What control signals do we need for load??
30
The Store Datapath
Mem[R[rs] + SignExt[imm16]]  R[rt]
31
Datapath
+
Control
32
ALU Control


Plan to control ALU: main control sends a 2-bit ALUOp control field
to the ALU control. Based on ALUOp and funct field of instruction the
ALU control generates the 3-bit ALU control field
ALU
operation
Function
000
and
001
or
010
add
110
sub
111
slt (set less than)
2
ALU must perform



Main
Control
Unit
ALUOp
ALU
Control
3
ALU
control
input
To
ALU
6
Instruction
funct field
add for load/stores (ALUOp 00)
sub for branches (ALUOp 01)
one of and, or, add, sub, slt for R-type instructions, depending on the
instruction’s 6-bit funct field (ALUOp 10)
33
‫طراحي واحد کنترل ‪ALU‬‬
‫‪34‬‬
‫طراحي واحد کنترل ‪ALU‬‬
‫‪ALUop‬‬
‫‪2‬‬
‫‪3‬‬
‫‪ALU‬‬
‫‪35‬‬
‫‪operation‬‬
‫‪ALU‬‬
‫‪Control‬‬
‫‪6‬‬
‫واحد کنترل‬
‫اصلی‬
‫‪6‬‬
‫]‪Instr[31-26‬‬
‫‪) Funct‬شش بیت سمت راست دستورالعمل(‬
‫طراحي واحد کنترل ‪ALU‬‬
‫‪36‬‬
Designing the Main Control
R-type
opcode
31-26
Load/store
or branch
opcode
31-26
rs
25-21
rt
20-16
rs
rt
25-21
20-16
rd
1511
shamt
10-6
funct
5-0
addres
s
15-0
Observations about MIPS instruction format





opcode is always in bits 31-26
two registers to be read are always rs (bits 25-21) and rt (bits 20-16)
base register for load/stores is always rs (bits 25-21)
16-bit offset for branch equal and load/store is always bits 15-0
destination register for loads is in bits 20-16 (rt) while for R-type instructions
it is in bits 15-11 (rd) (will require multiplexor to select)
37
Control Signals
Signal Name
RegDst
AlLUSrc
MemtoReg
RegWrite
PCSrc
0
The register destination number for the
Write register comes from the rt field (bits 20-16)
The second ALU operand comes from the
second register file output (Read data 2)
The value fed to the register Write data input
comes from the ALU
None
MemRead
The PC is replaced by the output of the adder
that computes the value of PC + 4
None
MemWrite
None
Instruction
R-format
lw
sw
beq
OP-code
000000
100011
101011
000100
1
The register destination number for the
Write register comes from the rd field (bits 15-11)
The second ALU operand is the sign-extended,
lower 16 bits of the instruction
The value fed to the register Write data input
comes from the data memory
The register on the Write register input is written
with the value on the Write data input
The PC is replaced by the output of the adder
that computes the branch target
Data memory contents designated by the address
input are put on the first Read data output
Data memory contents designated by the address
input are replaced by the value of the Write data input
Memto- Reg Mem Mem
RegDst ALUSrc
Reg
Write Read Write
1
0
0
1
0
0
0
1
1
1
1
0
X
1
X
0
0
1
X
0
X
0
0
0
PCSrc ALUOp 1,0
0
10
0
00
0
00
1
01
38
Control Signals: R-Type Instruction
ADD
0
M
U
X
ADD
ADD
4
rs
I[25:21]
PC
rt
I[20:16]
rd
I[15:11]
Instruction
ADDR
RD
Instruction
Memory
I
32
5
0
5
5
RN1
RN2
RegDst
1
5
WN
RD1
Register File
WD
immediate/
offset
I[15:0]
???
Operation
Value depends on
funct
3
ALU
0
Zero
0
M
U
X
RD2
RegWrite
1
Control signals
shown in blue
PCSrc
0
1
MUX
16
1
<<2
16
E
X
T
N
D
1
32
ALUSrc
0
0
MemWrite
ADDR
Data
Memory
MemtoReg
1
RD
M
U
X
WD
MemRead
0
0
39
Control Signals: lw Instruction
ADD
0
M
U
X
ADD
ADD
4
rs
I[25:21]
PC
rt
I[20:16]
rd
I[15:11]
Instruction
ADDR
RD
Instruction
Memory
I
32
5
0
5
5
RN1
RN2
RegDst
0
5
WN
RD1
Register File
WD
immediate/
offset
I[15:0]
010
Operation
3
ALU
0
Zero
0
M
U
X
RD2
RegWrite
1
Control signals
shown in blue
PCSrc
0
1
MUX
16
1
<<2
16
E
X
T
N
D
1
32
ALUSrc
1
1
MemWrite
ADDR
Data
Memory
MemtoReg
1
RD
M
U
X
WD
MemRead
0
1
40
Single-cycle Implementation Notes




The steps are not really distinct as each instruction completes in
exactly one clock cycle – they simply indicate the sequence of
data flowing through the datapath
The operation of the datapath during a cycle is purely
combinational – nothing is stored during a clock cycle
Therefore, the machine is stable in a particular state at the start
of a cycle and reaches a new stable state only at the end of the
cycle
Very important for understanding single-cycle computing:
41
Implementation: Main Control Block
Signal
R-type
lw
sw
beq
Inputs
Op5
Outputs
Inputs
Op4
Op5
Op4
Op3
Op2
Op1
Op0
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOP2
0
0
0
0
0
0
1
0
0
1
0
0
0
1
0
1
0
0
0
1
1
0
1
1
1
1
0
0
0
0
1
0
1
0
1
1
x
1
x
0
0
1
0
0
0
0
0
0
1
0
0
x
0
x
0
0
0
1
0
1
Truth table for main control signals
Op3
Op2
Op1
Op0
Outputs
R-format
Iw
sw
beq
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
Main control PLA (programmable
logic array): principle underlying
PLAs is that any logical expression
can be written as a sum-of-products
Single-Cycle Design Problems

Assuming fixed-period clock every instruction datapath uses one
clock cycle implies:

CPI = 1

cycle time determined by length of the longest instruction path (load)


but several instructions could run in a shorter clock cycle: waste of time

consider if we have more complicated instructions like floating point!
resources used more than once in the same cycle need to be duplicated

waste of hardware and chip area
43
Fixing the problem with single-cycle
designs

One solution: a variable-period clock with different cycle times for each
instruction class


unfeasible, as implementing a variable-speed clock is technically difficult
Another solution:
use a smaller cycle time…
…have different instructions take different numbers of cycles
by breaking instructions into steps and fitting each step into one cycle
 feasible: multicyle approach!


44
‫مزایا و معایب معماری ‪Single Cycle‬‬
‫‪ ‬زمان کالک بطور موثر استفاده نمیشود زیرا بر اساس طوالنی ترین دستور تنظیم‬
‫شده است‪ .‬این امر در صورت داشتن دستورات پیچیده مثل دستورات اعشاری‬
‫میتواند خیلی وخیم باشد‪.‬‬
‫‪ ‬فضای بیشتری در روی چیپ الزم دارد زیرا تعداد بیشتری از املانهای سخت افزاری‬
‫الزم دارد‪.‬‬
‫‪ ‬ساده و قابل فهم است‪.‬‬
‫‪Cycle 2‬‬
‫‪Cycle 1‬‬
‫‪Clk‬‬
‫‪Waste‬‬
‫‪45‬‬
‫‪sw‬‬
‫‪lw‬‬
‫محاسبه طوالنی ترین دستورالعمل‬
Instruction
class
Instruction
memory
Register
read
ALU
operation
Data
memory
Register
write
Total
(ps)
R- type
200
50
100
0
50
400
lw
200
50
100
200
50
600
Sw
200
50
100
200
Branch
200
50
100
550
350
‫ طول کالک باید بر اساس زمان الزم برای طوالنی ترین دستور‬
600 ps .‫طراحی شود‬
46
A Multicycle Implementation
Clock
Time
needed
Time
allotted
Instr 1
Instr 2
Instr 3
Instr 4
Clock
Time
needed
Time
allotted
Time
saved
3 cycles
5 cycles
3 cycles
4 cycles
Instr 1
Instr 2
Instr 3
Instr 4
Single-cycle versus multicycle instruction execution.
47
Questions
48
‫‪Datapath Building Blocks: jump instruction‬‬
‫‪ ‬اجرای دستور ‪ jump‬از طریق المانهای زیر انجام میشود‪:‬‬
‫‪ 26 ‬بیت مقدار موجود در دستورالعمل به اندازه ‪ 2‬بیت به سمت‬
‫چپ شیفت داده شده و با ‪ 28‬بیت کم ارزش‪ PC‬جایگزین‬
‫میشود‪.‬‬
‫‪Add‬‬
‫‪4‬‬
‫‪4‬‬
‫‪Jump‬‬
‫‪address‬‬
‫‪28‬‬
‫‪Shift‬‬
‫‪left 2‬‬
‫‪Instruction‬‬
‫‪Memory‬‬
‫‪Instruction 26‬‬
‫‪Read‬‬
‫‪Address‬‬
‫‪PC‬‬
Example: Fixed-period clock vs variable-period
clock in a single-cycle implementation

Consider a machine with an additional floating point unit. Assume
functional unit delays as follows



Assume instruction mix as follows








memory: 2 ns., ALU and adders: 2 ns., FPU add: 8 ns., FPU multiply: 16
ns., register file access (read or write): 1 ns.
multiplexors, control unit, PC accesses, sign extension, wires: no delay
all loads take same time and comprise 31%
all stores take same time and comprise 21%
R-format instructions comprise 27%
branches comprise 5%
jumps comprise 2%
FP adds and subtracts take the same time and totally comprise 7%
FP multiplys and divides take the same time and totally comprise 7%
Compare the performance of (a) a single-cycle implementation using a fixedperiod clock with (b) one using a variable-period clock where each instruction
executes in one clock cycle that is only as long as it needs to be (not really
practical but pretend it’s possible!)
Solution
Instruction
class
Load word
Store word
R-format
Branch
Jump
FP mul/div
FP add/sub

Instr. Register ALU
mem. read
oper.
2
2
2
2
2
2
2
1
1
1
1
1
1
2
2
2
2
Data
mem.
2
2
0
Register FPU
write
add/
sub
FPU
mul/
div
1
1
1
1
16
8
Total
time
ns.
8
7
6
5
2
20
12
Clock period for fixed-period clock = longest instruction time = 20
ns.
 Average clock period for variable-period clock = 8  31% +
7  21% + 6  27% + 5  5% + 2  2% + 20  7% + 12  7%
= 7.0 ns.
 Therefore, performancevar-period /performancefixed-period = 20/7 = 2.9