ECE369: Fundamentals of Computer Architecture

Download Report

Transcript ECE369: Fundamentals of Computer Architecture

ECE462/562
ISA and Datapath
Review
Ali Akoglu
1
Instruction Set Architecture
•
A very important abstraction
– interface between hardware and low-level software
– standardizes instructions, machine language bit patterns, etc.
– advantage: different implementations of the same architecture
•
Modern instruction set architectures:
– IA-32, PowerPC, MIPS, SPARC, ARM, and others
2
MIPS arithmetic
•
•
All instructions have 3 operands
Operand order is fixed (destination first)
Example:
C code:
a = b + c
MIPS ‘code’:
add a, b, c
3
MIPS arithmetic
•
•
•
•
Design Principle: simplicity favors regularity.
Of course this complicates some things...
C code:
a = b + c + d;
MIPS code:
add a, b, c
add a, a, d
Operands must be registers, only 32 registers provided
Each register contains 32 bits
4
Registers vs. Memory
•
•
•
Arithmetic instructions operands must be registers,
— only 32 registers provided
Compiler associates variables with registers
What about programs with lots of variables
Control
Input
Memory
Datapath
Processor
Output
I/O
5
Memory Organization
•
•
•
Viewed as a large, single-dimension array, with an address.
A memory address is an index into the array
"Byte addressing" means that the index points to a byte of memory.
0
1
2
3
4
5
6
...
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
6
Memory Organization
•
•
•
•
•
Bytes are nice, but most data items use larger "words"
For MIPS, a word is 32 bits or 4 bytes.
0 32 bits of data
4 32 bits of data
Registers hold 32 bits of data
32
bits
of
data
8
12 32 bits of data
...
232 bytes with byte addresses from 0 to 232-1
230 words with byte addresses 0, 4, 8, ... 232-4
Words are aligned
i.e., what are the least 2 significant bits of a word address?
7
So far we’ve learned:
•
MIPS
— loading words but addressing bytes
— arithmetic on registers only
•
Instruction
Meaning
add $s1, $s2, $s3
sub $s1, $s2, $s3
lw $s1, 100($s2)
sw $s1, 100($s2)
$s1 = $s2 + $s3
$s1 = $s2 – $s3
$s1 = Memory[$s2+100]
Memory[$s2+100] = $s1
8
Instructions
•
•
Load and store instructions
Example:
C code: A[12] = h + A[8];
# $s3 stores base address of A and $s2 stores h
MIPS code:
lw $t0, 32($s3)
add $t0, $s2, $t0
sw $t0, 48($s3)
•
Remember arithmetic operands are registers, not memory!
Can’t write:
add 48($s3), $s2, 32($s3)
9
Summary
Name Register number
$zero
0
$v0-$v1
2-3
$a0-$a3
4-7
$t0-$t7
8-15
$s0-$s7
16-23
$t8-$t9
24-25
$gp
28
$sp
29
$fp
30
$ra
31
Usage
the constant value 0
values for results and expression evaluation
arguments
temporaries
instruction format op
saved
add
R 0
more temporaries
global pointer
sub
R 0
stack pointer
lw
I 35
frame pointer
sw
I 43
return address
A[300]=h+A[300]
Lw $t0,1200($t1)
Add $t0, $s2, $t0
Sw $t0, 1200($t1)
rs
reg
reg
reg
reg
rt
reg
reg
reg
reg
rd shamt funct address
reg 0 32 na
reg 0 34 na
na na na address
na na na address
# $t1 = base address of A, $s2 stores h
# use $t0 for temporary register
Op rs,rt,address
Op,rs,rt,rd,shamt,funct
Op,rs,rt,address
35,9,8,1200
0,18,8,8,0,32
43,9,8,1200
10
Policy of Use Conventions
Name Register number
$zero
0
$v0-$v1
2-3
$a0-$a3
4-7
$t0-$t7
8-15
$s0-$s7
16-23
$t8-$t9
24-25
$gp
28
$sp
29
$fp
30
$ra
31
Usage
the constant value 0
values for results and expression evaluation
arguments
temporaries
saved
more temporaries
global pointer
stack pointer
frame pointer
return address
Register 1 ($at) reserved for assembler, 26-27 for operating system
11
MIPS Format
12
Machine Language
•
Consider the load-word and store-word instructions
•
Introduce a new type of instruction format
– I-type for data transfer instructions
– other format was R-type for register
•
Example: lw $t0, 32($s2)
35
18
8
op
rs
rt
32
16 bit number
13
Shift and Logical Operations
14
Summary of New Instructions
15
Control Instructions
16
Addresses in Branches
•
•
Instructions:
bne $t4,$t5,Label
beq $t4,$t5,Label
Next instruction is at Label if $t4≠$t5
Next instruction is at Label if $t4=$t5
Formats:
I
op
rs
rt
16 bit address
17
Addresses in Branches and Jumps
• Instructions:
bne $t4,$t5,Label
beq $t4,$t5,Label
j Label
if $t4 != $t5
if $t4 = $t5
Next instruction is at Label
• Formats:
I
op
J
op
rs
rt
16 bit address
26 bit address
•
18
Overview of MIPS
•
•
•
simple instructions all 32 bits wide
very structured, no unnecessary baggage
only three instruction formats
R
op
rs
rt
rd
I
op
rs
rt
16 bit address
J
op
shamt
funct
26 bit address
19
Datapath
add $t1, $s1, $s2 ($t1=9, $s1=17, $s2=18)
– 000000 10001 10010 01001 00000
op
rs
rt
rd
shamt
100000
funct
20
MIPS64 - Instruction Set Architecture (ISA)
• MIPS is a compact RISC architecture (register-to-register)
• Simple 64-bit (register) Load/Store architecture
(data – 64 bits, instructions – 32 bits)
• Design for pipelining efficiency
• Components –
• Instructions (Types, Syntax)
• Registers (number, function)
• Addressing modes for MIPS Data Transfers
• Data Types (double, float…)
21
MIPS64 Registers
•
32 general-purpose registers (64-bits)
– R0, R1, … R31
– works for any instructions that involves integers,
including memory.
•
32 floating-point registers
– F0, F1…., F30, F31
– For single precision (32-bits) (other 32 bits
unused)
– For double precision (64-bits)
22
MIPS64 Instruction Set Examples
•
•
•
Arithmetic/Logical
– Add unsigned
DADDU R1, R2, R3 -- Regs[R1]
Regs[R2] + Regs[R3]
– Shift Left Logical
DSLL R1, R2, #5 -- Regs[R1]
Regs[R2] << 5
Load/Store
– Load Double word
LD R1, 30(R2) -- Regs[R1]
64 Mem[30+Regs[R2]]
– Store FP single
S.S F0, 40(R3) -- Mem[40+Regs[R3]]
32 Regs[F0]0…31
Control
– Jump Register
JR R3 -- PC
Regs[R3]
– Branch Not Equal
BNE R3, R4, name -- if (Regs[R3] != Regs[R4]), PC
name;
((PC+4) – 217) ≤ name < ((PC+4) + 217) 23
MIPS64 Instruction Set Architecture
•
For more, please refer to –
– Appendix A of the book (5th Edition)
24
The simple datapath
25
Memto- Reg Mem Mem
Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format
lw
sw
beq
26
Datapath in Operation for R-Type Instruction
Memto- Reg Mem Mem
Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
sw
beq
27
Datapath in Operation for Load Instruction
Memto- Reg Mem Mem
Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
28
Datapath in Operation for Branch Equal Instruction
Memto- Reg Mem Mem
Instruction RegDst ALUSrc
Reg
Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
29
Single Cycle Problems
– Wasteful of area
• Each unit used once per clock cycle
– Clock cycle equal to worst case scenario
• Will reducing the delay of common case help?
30
Pipelining: It’s Natural!
•
•
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes
•
Dryer takes 40 minutes
•
“Folder” takes 20 minutes
A
B
C
D
31
Sequential Laundry
6 PM
7
8
9
10
11
Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
k
A
B
O
r
d
e
r
•
•
C
D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
32
Pipelined Laundry: Start work ASAP
6 PM
7
8
9
10
11
Midnight
Time
30 40
T
a
s
k
40
40
40 20
A
B
O
r
d
e
r
C
D
•
Pipelined laundry takes 3.5 hours for 4 loads
33
Pipelining Lessons
6 PM
7
8
9
Time
T
a
s
k
O
r
d
e
r
30 40
40
40
40 20
A
• Pipelining doesn’t help
latency of single task,
it helps throughput of
entire workload
• Pipeline rate limited by
slowest pipeline stage
B
C
D
What is the speedup of a pipeline of n stages?
34
Pipelining
•
Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
35
Basic Idea
What do we need to add to actually split the datapath into stages? 36
Pipelined datapath
37
Five Stages (lw)
Memory and registers
Left half: write
Right half: read
38
Five Stages (lw)
39
Five Stages (lw)
40
What is wrong with this datapath?
41
Store Instruction
42
Store Instruction
43
Pipeline control
44
Pipeline control
Instruction
R-format
lw
sw
beq
Execution/Address
Calculation stage control
lines
Reg
ALU
ALU
ALU
Dst
Op1
Op0
Src
1
1
0
0
0
0
0
1
X
0
0
1
X
0
1
0
Write-back
Memory access stage stage control
control lines
lines
Branc Mem Mem
Reg
Mem
h
Read Write write to Reg
0
0
0
1
0
0
1
0
1
1
0
0
1
0
X
1
0
0
0
X
45
Datapath with control
46
Pipelining is not quite that easy!
•
Limits to pipelining: Hazards prevent next instruction from executing
during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away)
– Data hazards: Instruction depends on result of prior instruction
still in the pipeline (missing sock)
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps).
47
One Memory Port/Structural Hazards
Figure A.4, Page A-14
48
Three Generic Data Hazards
Inst I before inst j in in the program
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
•
Caused by a “Dependence” (in compiler nomenclature). This hazard
results from an actual need for communication.
49
Three Generic Data Hazards
•
Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
•
Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
•
Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5
50
Three Generic Data Hazards
•
Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
•
•
Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
51
Representation
52
Dependencies
•
Problem with starting next instruction before first is finished
– Dependencies that “go backward in time” are data hazards
53
Hazards
54
Forwarding
•
Use temporary results, don’t wait for them to be written
– register file forwarding to handle read/write to same register
– ALU forwarding
55
Forwarding
sub
and
or
add
sw
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
56
Forwarding
Forward
MEM/WB
Forward
fromfrom
EX/MEM
registers
registers
If (EX/MEM.RegWrite)
If (MEM/WB.RegWrite)
and If (EX/MEM.Rd != 0)
If (MEM/WB.Rd
!= 0)
andand
(ID/EX.Rs
== EX/MEM.Rd)
and If (ID/EX.Rt==EX/MEM.Rd)
57
Can't always forward
58
Can't always forward
59
Can't always forward
•
Load word can still cause a hazard:
– an instruction tries to read a register following a load instruction
that writes to the same register.
60
Stalling
•
•
Hardware detection and no-op insertion is called stalling
Stall pipeline by keeping instruction in the same stage
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
Reg
IM
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
slt $1, $6, $7
IM
DM
Reg
IM
Reg
Reg
DM
Reg
61
62
Pipeline with hazard detection
63
Assume that register file is written in the first half and read in the second half of the
clock cycle.
load r2 <- mem(r1+0)
r3 <- r3 + r2
load r4 <- mem(r2+r3)
r4 <- r5 - r3
Cycles
1
2
3
4
5
6
7
8
9
ID
EX
ME
WB
IF
ID
S
S
EX
ME
WB
IF
S
S
ID
EX
ME
WB
S
S
IF
ID
S
EX
; LOAD1
; ADD
; LOAD2
; SUB
10
11
12
13
load r2 <- mem(r1+0)
IF
r3 <- r3 + r2
load r4 <- mem(r2+r3)
r4 <- r5 - r3
ME
WB
64
Summary
65
Forwarding Case Summary
66
Multi-cycle
67
Multi-cycle
68
Multi-cycle Pipeline
69
Branch Hazards
70
Branch hazards
•
•
When we decide to branch, other instructions are in the pipeline!
We are predicting “branch not taken”
– need to add hardware for flushing instructions if we are wrong
71
Branch detection in ID stage
72
Solution to control hazards
•
Branch prediction
– We are predicting “branch not taken”
– Need to add hardware for flushing instructions if we are wrong
•
Reduce branch penalty
– By advancing the branch decision to ID stage
– Compare the data read from two registers read in ID stage
– Comparison for equality is a simpler design! (Why?)
– Still need to flush instruction in IF stage
•
Make the hazard into a feature!
– Delayed branch slot - Always execute instruction following
branch
73
Branch Prediction
•
Sophisticated Techniques:
– A “branch target buffer” to help us look up the destination
– Correlating predictors that base prediction on global behavior
and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
– Tournament predictors that use different types of prediction
strategies and keep track of which one is performing best.
– A “branch delay slot” which the compiler tries to fill with a useful
instruction (make the one cycle delay part of the ISA)
•
Branch prediction is especially important because it enables other
more advanced pipelining techniques to be effective!
•
Modern processors predict correctly 95% of the time!
74
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear: branch penalty is fixed
and can not be reduced by software (this is the example of
MIPS)
#2: Predict Branch Not Taken (treat every branch as not taken)
– Execute successor instructions in sequence
– “flush” instructions in pipeline if branch actually taken
– 47% MIPS branches not taken on average
– PC+4 already calculated, so use it to get next instruction
75
Four Branch Hazard Alternatives:
#3: Predict Branch Taken (treat every branch as taken)
As soon as the branch is decoded and the target address is
computed, we assume the branch is taken and begin
fetching and executing at the target address.
– 53% MIPS branches taken on average
– Because in our MIPS pipeline we don’t
know the target address any earlier than
we know the branch outcome, there is no
advantage in this approach for MIPS.
– MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before
outcome
76
Four Branch Hazard Alternatives
#4: Delayed Branch
– In a delayed branch, the execution cycle with a branch
delay of length n is:
branch instruction
sequential successor1
sequential successor2
Branch delay of length n
........
sequential successorn
branch target if taken
These sequential successor instructions are in a
branch-delay slots.
The sequential successors are executed whether
or not the branch is taken.
The job of the compiler is to make the successor instructions valid and
useful.
77
Scheduling Branch Delay Slots (Fig A.14)
A. From before branch
add $1,$2,$3
if $2=0 then
delay slot
becomes
if $2=0 then
add $1,$2,$3
B. From branch target
C. From fall through
sub $4,$5,$6
add $1,$2,$3
if $1=0 then
add $1,$2,$3
if $1=0 then
delay slot
becomes
Sub $4, $5, $6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
delay slot
Or $7, $8, $ 9
sub $4,$5,$6
becomes
add $1,$2,$3
if $1=0 then
Or $7, $8, $9
Sub $4,$5,$6
78
Delayed Branch
•
•
•
Where to get instructions to fill branch delay slot?
– Before branch instruction: this is the best choice if feasible.
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful
in computation
– About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: As processor go to deeper pipelines and
multiple issue, the branch delay grows and need more than one
delay slot
– Delayed branching has lost popularity compared to more
expensive but more flexible dynamic approaches
– Growth in available transistors has made dynamic approaches
relatively cheaper
79
Improving Performance
•
Try and avoid stalls! E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
•
Dynamic Pipeline Scheduling
– Hardware chooses which instructions to execute next
– Will execute instructions out of order (e.g., doesn’t wait for a
dependency to be resolved, but rather keeps going!)
– Speculates on branches and keeps the pipeline full
(may need to rollback if prediction incorrect)
•
Trying to exploit instruction-level parallelism
80
Advanced Pipelining
•
•
•
•
Increase the depth of the pipeline
Start more than one instruction each cycle (multiple issue)
Loop unrolling to expose more ILP (better scheduling)
“Superscalar” processors
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
•
All modern processors are superscalar and issue multiple
instructions usually with some limitations (e.g., different “pipes”)
81
Source: For ( i=1000; i>0; i=i-1 )
x[i] = x[i] + s;
Direct translation:
– Loop:
LD
ADDD
SD
DADDUI
BNE
F0, 0 (R1);
F4, F0, F2;
F4, 0(R1)
R1, R1, #-8
R1, R2, loop;
R1 points x[1000]
F2 = scalar value
R2 last element
Producer
Consumer
Latency
FP ALU op
Another FP ALU op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Store double
Store double
0
Assume 1 cycle latency from unsigned integer arithmetic to dependent instruction
82
Reducing stalls
1
2
3
4
5
6
7
8
9
• Pipeline Implementation:
– Loop:
Loop:
LD
F0, 0 (R1)
stall
ADDD
F4, F0, F2
stall
stall
SD
F4, 0(R1)
DADDUI
R1, R1, #-8
stall
BNE
R1, R2, loop
stall
LD
DADDUI
ADDD
stall
stall
SD
BNE
Producer
Consumer
Latency
FP ALU op
Another FP ALU op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Store double
Store double
0
F0, 0 (R1)
R1, R1, #-8
F4, F0, F2
F4, 8(R1)
R1, R2, Loop
83
Loop Unrolling
Loop LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
DADDUI
BNE
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F6, -8 (R1)
F8, F6, F2
F8, -8 (R1)
F10, -16 (R1)
F12, F10, F2
F12, -16 (R1)
F14, -24 (R1)
F16, F14, F2
F16, -24 (R1)
R1, R1, #-32
R1, R2, Loop
Producer
Consumer
Latency
FP ALU op
Another FP ALU op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Store double
Store double
0
; drop SUBI & BNEZ
; drop SUBI & BNEZ
; drop SUBI & BNEZ
27 cycles:
14 instructions,
1 for each LD,
2 for each ADDD,
1 for DADDUI
84
Loop
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
DADDUI
SD
SD
BNE
F0, 0(R1)
F6, -8 (R1)
F10, -16(R1)
F14, -24(R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
F4, 0(R1)
F8, -8 (R1)
R1, R1, #-32
F12, -16 (R1)
F16, 8(R1)
R1, R2, Loop
Design Issues:
• Code size!
• Instruction cache
• Register space
• Iteration dependence
• Loop termination
• Memory addressing
14 instructions
(3.5 cycles per element vs. 9 cycles!)
85
Superscalar architecture -Two instructions executed in parallel
Loop unrolling?
Branch delay slot?
86
Dynamically scheduled pipeline
87
Important facts to remember
•
Pipelined processors divide execution
in multiple steps
•
However pipeline hazards reduce
performance
– Structural, data, and control
hazard
•
Data forwarding helps resolve data
hazards
– But all hazards cannot be
resolved
– Some data hazards require
bubble or noop insertion
•
Effects of control hazard reduced by
branch prediction
– Predict always taken, delayed
slots, branch prediction table
– Structural hazards are resolved
by duplicating resources
•
Time to execute n instructions
depends on
– # of stages (k)
– # of control hazard and penalty of
each step
– # of data hazards and penalty for
each
– Time = n + k - 1 + (load hazard
penalty) + (branch penalty)
•
Load hazard penalty is 1 or 0 cycle
– Depending on data use with
forwarding
•
Branch penalty is 3, 2, 1, or zero
cycles depending on scheme
88
Design and performance issues
with pipelining
•
•
•
•
Pipelined processors are not
EASY to design
Technology affect
implementation
Instruction set design affect the
performance
– i.e., beq, bne
More stages do not lead to higher
performance!
89