www.ece.uah.edu

Download Report

Transcript www.ece.uah.edu

CPE 631:
ILP, Dynamic Exploitation
Electrical and Computer Engineering
University of Alabama in Huntsville
Aleksandar Milenković
[email protected]
http://www.ece.uah.edu/~milenka
Outline




Instruction Level Parallelism (ILP)
Recap: Data Dependencies
Extended MIPS Pipeline and Hazards
Dynamic scheduling with a scoreboard
AM
LaCASA
2
ILP: Concepts and Challenges


ILP (Instruction Level Parallelism) –
overlap execution of unrelated instructions
Techniques that increase amount of parallelism
exploited among instructions



AM
LaCASA
reduce impact of data and control hazards
increase processor ability to exploit parallelism
Pipeline CPI = Ideal pipeline CPI +
Structural stalls + RAW stalls +
WAR stalls + WAW stalls + Control stalls

Reducing each of the terms of the right-hand side
minimize CPI and thus increase instruction throughput
3
Two approaches to exploit parallelism

Dynamic techniques


largely depend on hardware
to locate the parallelism
Static techniques

relay on software
AM
LaCASA
4
Techniques to exploit parallelism
Technique (Section in the textbook)
AM
LaCASA
Reduces
Forwarding and bypassing (Section A.2)
Data hazard (DH) stalls
Delayed branches (A.2)
Control hazard stalls
Basic dynamic scheduling (A.8)
DH stalls (RAW)
Dynamic scheduling with register renaming (3.2)
WAR and WAW stalls
Dynamic branch prediction (3.4)
CH stalls
Issuing multiple instruction per cycle (3.6)
Ideal CPI
Speculation (3.7)
Data and control stalls
Dynamic memory disambiguation (3.2, 3.7)
RAW stalls w. memory
Loop Unrolling (4.1)
CH stalls
Basic compiler pipeline scheduling (A.2, 4.1)
DH stalls
Compiler dependence analysis (4.4)
Ideal CPI, DH stalls
Software pipelining and trace scheduling (4.3)
Ideal CPI and DH stalls
Compiler speculation (4.4)
Ideal CPI, and D/CH stalls
5
Where to look for ILP?

Amount of parallelism available within a basic block


BB: straight line code sequence of instructions with no branches
in except to the entry, and no branches out except at the exit
Example: Gcc (Gnu C Compiler): 17% control transfer



AM
LaCASA
5 or 6 instructions + 1 branch
Dependencies => amount of parallelism in
a basic block is likely to be much less than 5
=> look beyond single block to get more
instruction level parallelism
Simplest and most common way to increase
amount of parallelism among instruction
is to exploit parallelism among iterations of a loop =>
Loop Level Parallelism
for(i=1; i<=1000; i++)
x[i]=x[i] + s;

Vector Processing: see Appendix G
6
Definition: Data Dependencies

Data dependence: instruction j is data dependent on
instruction i if either of the following holds





AM
LaCASA

Instruction i produces a result used by instruction j, or
Instruction j is data dependent on instruction k, and
instruction k is data dependent on instruction i
If dependent, cannot execute in parallel
Try to schedule to avoid hazards
Easy to determine for registers (fixed names)
Hard for memory (“memory disambiguation”):


Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
7
Examples of Data Dependencies
Loop:
LD.D
ADD.D
SD.D
DADUI
BNE
F0, 0(R1)
F4, F0, F2
0(R1), F4
R1,R1,#-8
R1, R2, Loop
;
;
;
;
;
F0 = array element
add scalar in F2
store result and
decrement pointer
branch if R1!=R2
AM
LaCASA
8
Definition: Name Dependencies

Two instructions use same name
(register or memory location) but don’t exchange data




AM
LaCASA
Antidependence (WAR if a hazard for HW)
Instruction j writes a register or memory location that instruction i
reads from and instruction i is executed first
Output dependence (WAW if a hazard for HW)
Instruction i and instruction j write the same register or memory
location; ordering between instructions must be preserved. If
dependent, can’t execute in parallel
Renaming to remove data dependencies
Again Name Dependencies are Hard for Memory Accesses


Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
9
Where are the name dependencies?
1 Loop:L.D
2
ADD.D
3
S.D
4
L.D
5
ADD.D
6
S.D
7
L.D
8
ADD.D
9
S.D
10
L.D
11
ADD.D
12
S.D
13
SUBUI
14
BNEZ
15
NOP
AM
LaCASA
F0,0(R1)
F4,F0,F2
0(R1),F4
F0,-8(R1)
F4,F0,F2
-8(R1),F4
F0,-16(R1)
F4,F0,F2
-16(R1),F4
F0,-24(R1)
F4,F0,F2
-24(R1),F4
R1,R1,#32
R1,LOOP
;drop DSUBUI & BNEZ
;drop DSUBUI & BNEZ
;drop DSUBUI & BNEZ
;alter to 4*8
How can remove them?
10
Where are the name dependencies?
1 Loop:L.D
2
ADD.D
3
S.D
4
L.D
5
ADD.D
6
S.D
7
L.D
8
ADD.D
9
S.D
10
L.D
11
ADD.D
12
S.D
13
DSUBUI
14
BNEZ
15
NOP
AM
LaCASA
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)
F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#32
R1,LOOP
;drop DSUBUI & BNEZ
;drop DSUBUI & BNEZ
;drop DSUBUI & BNEZ
;alter to 4*8
The Orginal“register renaming”
11
Definition: Control Dependencies


Example: if p1 {S1;}; if p2 {S2;};
S1 is control dependent on p1 and
S2 is control dependent on p2 but not on p1
Two constraints on control dependences:


An instruction that is control dep. on a branch cannot be moved
before the branch, so that its execution is no longer controlled by
the branch
An instruction that is not control dep. on a branch cannot be
moved to after the branch so that its execution is
controlled by the branch
AM
LaCASA
L:
DADDU R5, R6, R7
ADD R1, R2, R3
BEQZ R4, L
SUB R1, R5, R6
OR R7, R1, R8
12
Dynamically Scheduled Pipelines
Overcoming Data Hazards
with Dynamic Scheduling

Why in HW at run time?




Works when can’t know real dependence
at compile time
Simpler compiler
Code for one machine runs well on another
Example
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F12
SUB.D cannot execute because the
dependence of ADD.D on DIV.D causes
the pipeline to stall; yet SUBD is not data
dependent on anything!
AM
LaCASA

Key idea: Allow instructions behind stall to proceed
14
Overcoming Data Hazards
with Dynamic Scheduling (cont’d)


Enables out-of-order execution =>
out-of-order completion
Out-of-order execution divides ID stage:



AM
LaCASA
1. Issue—decode instructions,
check for structural hazards
2. Read operands—wait until no data hazards,
then read operands
Scoreboarding –
technique for allowing instructions
to execute out of order when there are sufficient
resources and no data dependencies (CDC 6600,
1963)
15
Scoreboarding Implications

Out-of-order completion =>
WAR, WAW hazards?
DIV.D
ADD.D
SUB.D



AM
LaCASA


DIV.D
ADD.D
SUB.D
F0,F2,F4
F10,F0,F8
F10,F8,F12
Solutions for WAR


F0,F2,F4
F10,F0,F8
F8,F8,F12
Queue both the operation and copies of its operands
Read registers only during Read Operands stage
For WAW, must detect hazard:
stall until other completes
Need to have multiple instructions in execution phase =>
multiple execution units or pipelined execution units
Scoreboard keeps track of dependencies,
state or operations
Scoreboard replaces ID, EX, WB with 4 stages
16
Four Stages of Scoreboard Control




AM
LaCASA
ID1: Issue — decode instructions &
check for structural hazards
ID2: Read operands — wait until no data hazards,
then read operands
EX: Execute — operate on operands;
when the result is ready, it notifies the scoreboard that it has
completed execution
WB: Write results — finish execution;
the scoreboard checks for WAR hazards.
If none, it writes results. If WAR, then it stalls the instruction
DIV.D
ADD.D
SUB.D
F0,F2,F4
F10,F0,F8
F8,F8,F12
Scoreboarding stalls the the SUBD in its
write result stage until ADDD reads its
operands
17
Four Stages of Scoreboard Control


1.
Issue—decode instructions & check for structural hazards (ID1)

If a functional unit for the instruction is free and no other active
instruction has the same destination register (WAW), the scoreboard
issues the instruction to the functional unit and updates its internal data
structure. If a structural or WAW hazard exists, then the instruction issue
stalls, and no further instructions will issue until these hazards are
cleared.
2. Read operands—wait until no data hazards, then read
operands (ID2)

AM
LaCASA
A source operand is available if no earlier issued active instruction is
going to write it, or if the register containing the operand is being written
by a currently active functional unit. When the source operands are
available, the scoreboard tells the functional unit to proceed to read the
operands from the registers and begin execution. The scoreboard
resolves RAW hazards dynamically in this step, and instructions may be
sent into execution out of order.
18
Four Stages of Scoreboard Control

3.


4.


Execution—operate on operands (EX)
The functional unit begins execution upon receiving operands.
When the result is ready, it notifies the scoreboard that it has
completed execution.
Write result—finish execution (WB)
Once the scoreboard is aware that the functional unit has
completed execution, the scoreboard checks for WAR hazards.
If none, it writes results. If WAR, then it stalls the instruction.
Example:
DIV.D
F0,F2,F4
ADD.D
F10,F0,F8
SUB.D
F8,F8,F14
AM
LaCASA

CDC 6600 scoreboard would stall SUBD until ADD.D
reads operands
19
Three Parts of the Scoreboard


1. Instruction status—which of 4 steps the instruction is in
(Capacity = window size)
2. Functional unit status—Indicates the state of the
functional unit (FU). 9 fields for each functional unit






AM
LaCASA

Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready
3. Register result status—Indicates which functional unit will
write each register, if one exists. Blank when no pending
instructions will write that register
20
MIPS with a Scoreboard
Registers
FP Mult
FP Mult
FP Div
FP Div
FP Div
Add1
Add2
Add3
AM
LaCASA
Control/
Status
Scoreboard
Control/
Status
21
Detailed Scoreboard Pipeline Control
Instruction
status
Wait until
Bookkeeping
Issue
Not busy (FU)
and not result (D)
Busy(FU) yes; Op(FU) op;
Fi(FU) ’D’; Fj(FU) ’S1’;
Fk(FU) ’S2’; Qj Result(’S1’);
Qk Result(’S2’); Rj not Qj;
Rk not Qk; Result(’D’) FU;
Read
operands
Rj and Rk
Rj No; Rk No
Execution
complete
Functional unit
done
Write result
AM
LaCASA
f((Fj( f )≠Fi(FU)
f(if Qj(f)=FU then Rj(f) Yes);
or Rj( f )=No) &
f(if Qk(f)=FU then Rj(f) Yes);
(Fk( f ) ≠Fi(FU) or
Result(Fi(FU)) 0; Busy(FU) No
Rk( f )=No))
22
Scoreboard Example
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
FU
Read Execution
W rite
Issue operands
completeResult
Busy Op
No
No
No
No
No
F0
F2
dest
Fi
S1
Fj
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
F4
F6
F8
F10
F30
F12
...
AM
LaCASA
23
Scoreboard Example: Cycle 1
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
1
FU
Read Execution
Write
Issue operands
complete
Result
1
Busy Op
Yes Load
No
No
No
No
F0
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2
F4
F6
F8
Integer
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Yes
F10
F30
F12
...
AM
LaCASA
Issue 1st L.D!
24
Scoreboard Example: Cycle 2
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
2
FU
Read Execution
Write
Issue operands
complete
Result
1
2
Busy Op
Yes Load
No
No
No
No
F0
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2
F4
F6
F8
Integer
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Yes
F10
F30
F12
...
AM
LaCASA
Issue 2nd L.D?
Structural hazard!
No further instructions will issue!
25
Scoreboard Example: Cycle 3
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
3
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
Busy Op
Yes Load
No
No
No
No
F0
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2
F4
F6
F8
Integer
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Yes
F10
F30
F12
...
AM
LaCASA
Issue MUL.D?
26
Scoreboard Example: Cycle 4
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
4
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
Busy Op
Yes Load
No
No
No
No
F0
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2
F4
F6
F8
Integer
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Yes
F10
F30
F12
...
AM
LaCASA
Check for WAR hazards!
If none, write result!
27
Scoreboard Example: Cycle 5
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
5
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
Busy Op
Yes Load
No
No
No
No
F0
dest
Fi
F2
F2
F4
Integer
S1
Fj
S2
Fk
R3
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Yes
F6
F8
F10
F30
F12
...
AM
LaCASA
Issue 2nd L.D!
28
Scoreboard Example: Cycle 6
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
6
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
6
Busy Op
Yes Load
Yes Mult
No
No
No
dest
Fi
F2
F0
F0
F2
F4
Mult1 Integer
S1
Fj
FU for jFU for kFj?
Qj
Qk
Rj
F2
S2
Fk
R3
F4
F6
F8
F10
Integer
F12
No
Fk?
Rk
Yes
Yes
...
F30
AM
LaCASA
Issue MUL.D!
29
Scoreboard Example: Cycle 7
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
7
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
6
7
FU for jFU for kFj?
Qj
Qk
Rj
F2
S2
Fk
R3
F4
F8
F6
F2
Integer Yes
No
F0
F2
F4
Mult1 Integer
F6
F8
F10
Add
F12
F30
Busy
Yes
Yes
No
Yes
No
Op
Load
Mult
dest
Fi
F2
F0
S1
Fj
Sub
Integer
No
...
Fk?
Rk
Yes
Yes
AM
LaCASA
Issue SUB.D!
30
Scoreboard Example: Cycle 8
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
8
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
7
8
Busy
Yes
Yes
No
Yes
Yes
Op
Load
Mult
dest
Fi
F2
F0
S1
Fj
Sub
Div
F8
F10
F0
F2
F4
Mult1 Integer
FU for jFU for kFj?
Qj
Qk
Rj
F2
S2
Fk
R3
F4
F6
F0
F2
F6
Integer Yes
Mult1
No
F6
F8
F10
F12
Add Divide
Integer
No
...
Fk?
Rk
Yes
Yes
No
Yes
F30
AM
LaCASA
Issue DIV.D!
31
Scoreboard Example: Cycle 9
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
10 Mult1
Mult2
2 Add
Divide
Register result status
Clock
9
FU
AM
LaCASA
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
8
Busy
No
Yes
No
Yes
Yes
Op
dest
Fi
S1
Fj
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Mult
F0
F2
F4
Integer
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Integer Yes
Mult1
No
Yes
Yes
F4
F6
F8
F10
F12
Add Divide
F0
F2
Mult1
...
F30
Read operands for MUL.D and SUB.D!
Assume we can feed Mult1 and Add units in the same clock cycle.
Issue ADD.D? Structural Hazard (unit is busy)!
32
Scoreboard Example: Cycle 11
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
8 Mult1
Mult2
0 Add
Divide
Register result status
Clock
11
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
8
Busy
No
Yes
No
Yes
Yes
Op
dest
Fi
S1
Fj
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Mult
F0
F2
F4
Integer
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Integer Yes
Mult1
No
Yes
Yes
F4
F6
F8
F10
F12
Add Divide
F0
F2
Mult1
...
F30
AM
LaCASA
Last cycle of SUB.D execution.
33
Scoreboard Example: Cycle 12
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
7 Mult1
Mult2
Add
Divide
Register result status
Clock
12
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
Busy
No
Yes
No
Yes
Yes
Op
dest
Fi
S1
Fj
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Mult
F0
F2
F4
Integer
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Integer Yes
Mult1
No
Yes
Yes
F4
F6
F8
F10
F12
Add Divide
F0
F2
Mult1
...
F30
AM
LaCASA
Check WAR on F8. Write F8.
34
Scoreboard Example: Cycle 13
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
6 Mult1
Mult2
Add
Divide
Register result status
Clock
13
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
dest
S1
Busy Op
Fi
Fj
No
Yes Mult
F0
F2
No
Yes Add
F6
F8
Yes Div
F10
F0
F0
F2
Mult1
F4
F6
Add
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
F4
Integer
Yes
Yes
Mult1
Yes
No
Yes
Yes
...
F30
F2
F6
F8
F10
F12
Divide
AM
LaCASA
Issue ADD.D!
35
Scoreboard Example: Cycle 14
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
5 Mult1
Mult2
2 Add
Divide
Register result status
Clock
AM
14
FU
LaCASA
Read Execution
Write
Issue operands
completeResult
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1
Busy Op
Fi
Fj
No
Yes Mult F0
F2
No
Yes Add
F6
F8
Yes Div
F10
F0
F0
F2
Mult1
F4
F6
Add
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
F4
Integer
Yes
Yes
Mult1
Yes
No
Yes
Yes
...
F30
F2
F6
F8
F10
F12
Divide
Read operands for ADD.D!
36
Scoreboard Example: Cycle 15
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
4 Mult1
Mult2
1 Add
Divide
Register result status
Clock
14
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1
Busy Op
Fi
Fj
No
Yes Mult
F0
F2
No
Yes Add
F6
F8
Yes Div
F10
F0
F0
F2
Mult1
F4
F6
Add
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
F4
Integer
Yes
Yes
Mult1
Yes
No
Yes
Yes
...
F30
F2
F6
F8
F10
F12
Divide
AM
LaCASA
37
Scoreboard Example: Cycle 16
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
3 Mult1
Mult2
0 Add
Divide
Register result status
Clock
16
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1
Busy Op
Fi
Fj
No
Yes Mult
F0
F2
No
Yes Add
F6
F8
Yes Div
F10
F0
F0
F2
Mult1
F4
F6
Add
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
F4
Integer
Yes
Yes
Mult1
Yes
No
Yes
Yes
...
F30
F2
F6
F8
F10
F12
Divide
AM
LaCASA
38
Scoreboard Example: Cycle 17
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
2 Mult1
Mult2
Add
Divide
Register result status
Clock
17
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1
Busy Op
Fi
Fj
No
Yes Mult
F0
F2
No
Yes Add
F6
F8
Yes Div
F10
F0
F0
F2
Mult1
F4
F6
Add
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
F4
Integer
Yes
Yes
Mult1
Yes
No
Yes
Yes
...
F30
F2
F6
F8
F10
F12
Divide
AM
LaCASA
Why cannot write F6?
39
Scoreboard Example: Cycle 19
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
0 Mult1
Mult2
Add
Divide
Register result status
Clock
17
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
7
9
11
12
8
13
14
16
dest
S1
Busy Op
Fi
Fj
No
Yes Mult
F0
F2
No
Yes Add
F6
F8
Yes Div
F10
F0
F0
F2
Mult1
F4
F6
Add
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
F4
Integer
Yes
Yes
Mult1
Yes
No
Yes
Yes
...
F30
F2
F6
F8
F10
F12
Divide
AM
LaCASA
40
Scoreboard Example: Cycle 20
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
20
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
13
14
16
dest
S1
Busy Op
Fi
Fj
No
Yes Mult
F0
F2
No
Yes Add
F6
F8
Yes Div
F10
F0
F0
F2
Mult1
F4
F6
Add
S2
Fk
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
F4
Integer
Yes
Yes
Mult1
Yes
No
Yes
Yes
...
F30
F2
F6
F8
F10
F12
Divide
AM
LaCASA
41
Scoreboard Example: Cycle 21
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
21
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
dest
S1
S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes Add
F6
F8
F2
Yes Div
F10
F0
F6
F0
F2
F4
F6
Add
F8
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
Yes
...
F30
Mult1
F10
F12
Divide
AM
LaCASA
42
Scoreboard Example: Cycle 22
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
40 Divide
Register result status
Clock
22
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
22
dest
S1
S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes Add
F6
F8
F2
Yes Div
F10
F0
F6
F0
F2
F4
F6
Add
F8
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
Yes
...
F30
Mult1
F10
F12
Divide
AM
LaCASA
Write F6?
43
Scoreboard Example: Cycle 61
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
0 Divide
Register result status
Clock
61
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
13
14
16
22
dest
S1
S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes Div
F10
F0
F6
F0
F2
F4
F6
F8
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Mult1
Yes
Yes
F10
F12
Divide
...
F30
AM
LaCASA
44
Scoreboard Example: Cycle 62
Instruction status
Instruction
j
k
L.D
F6
34+ R2
L.D
F2
45+ R3
MUL.D F0
F2 F4
SUB.D F8
F6 F2
DIV.D
F10 F0 F6
ADD.D F6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
62
FU
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
62
13
14
16
22
dest
S1
S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes Div
F10
F0
F6
F0
F2
F4
F6
F8
FU for jFU for kFj?
Qj
Qk
Rj
Fk?
Rk
Mult1
Yes
Yes
F10
F12
Divide
...
F30
AM
LaCASA
45
Scoreboard Results

For the CDC 6600



70% improvement for Fortran
150% improvement for hand coded assembly
language
cost was similar to one of the functional units



Still this was in ancient time


AM
LaCASA


surprisingly low
bulk of cost was in the extra busses
no caches & no main semiconductor memory
no software pipelining
compilers?
So, why is it coming back

performance via ILP
46
Scoreboard Limitations

Amount of parallelism among instructions


Number of scoreboard entries


AM
LaCASA
how far ahead the pipeline can look for independent
instructions (we assume a window does not extend
beyond a branch)
Number and types of functional units


can we find independent instructions to execute
avoid structural hazards
Presence of antidependences and output
dependences

WAR and WAW stalls become more important
47
Things to Remember




Pipeline CPI = Ideal pipeline CPI + Structural
stalls + RAW stalls + WAR stalls + WAW
stalls
+ Control stalls
Data dependencies
Dynamic scheduling to minimise stalls
Dynamic scheduling with a scoreboard
AM
LaCASA
48
Scoreboard Limitations

Amount of parallelism among instructions


Number of scoreboard entries


AM
LaCASA
how far ahead the pipeline can look for independent
instructions (we assume a window does not extend
beyond a branch)
Number and types of functional units


can we find independent instructions to execute
avoid structural hazards
Presence of antidependences and output
dependences

WAR and WAW stalls become more important
49
Tomasulo’s Algorithm



Used in IBM 360/91 FPU (before caches)
Goal: high FP performance without special compilers
Conditions:





AM
LaCASA
Small number of floating point registers (4 in 360) prevented
interesting compiler scheduling of operations
Long memory accesses and long FP delays
This led Tomasulo to try to figure out how to get more effective
registers — renaming in hardware!
Why Study 1966 Computer?
The descendants of this have flourished!

Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604,
…
50
Tomasulo’s Algorithm (cont’d)

Control & buffers distributed with Function Units (FU)


Registers in instructions replaced by values or pointers to
reservation stations (RS) => register renaming




AM
LaCASA

FU buffers called “reservation stations” =>
buffer the operands of instructions waiting to issue;
avoids WAR, WAW hazards
More reservation stations than registers,
so can do optimizations compilers can’t
Results to FU from RS, not through registers, over Common
Data Bus that broadcasts results to all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue
51
Tomasulo-based FPU for MIPS
FP Op
Queue
Load Buffers
From Mem
From Instruction Unit
FP Registers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Store1
Store2
Store3
Add1
Add2
Add3
Mult1
Mult2
FP adders
Reservation
Stations
To Mem
FP multipliers
AM
LaCASA
Common Data Bus (CDB)
52
Reservation Station Components


Op: Operation to perform in the unit (e.g., + or –)
Vj, Vk: Value of Source operands


Qj, Qk: Reservation stations producing source registers (value
to be written)



AM
LaCASA
Store buffers has V field, result to be stored
Note: Qj/Qk=0 => source operand is already available in Vj /Vk
Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will
write each register, if one exists. Blank when no pending
instructions that will write that register.
53
Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue


2. Execute—operate on operands (EX)



LaCASA



Write it on Common Data Bus to all awaiting units;
mark reservation station available
Normal data bus: data + destination (“go to” bus)
Common data bus: data + source (“come from” bus)

AM
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)


If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does the broadcast
Example speed: 2 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
54
Instruction stream
Instruction status:
Tomasulo Example
Exec Write
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
Load1
Load2
Load3
Register result status:
Clock
0
AM
Clock cycle
counter
LaCASA
No
No
No
3 Load/Buffers
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
FU count
Add3
No
down
Mult1 No
Mult2 No
Busy Address
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
3 FP Adder R.S.
2 FP Mult R.S.
F0
F2
F4
F6
F8
F10
F12
...
F30
FU
55
Tomasulo Example
Cycle 1
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
AM
LaCASA
1
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Yes
No
No
34+R2
F10
F12
...
F30
Load1
56
Tomasulo Example
Cycle 2
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
AM
LaCASA
2
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Load2
Yes
Yes
No
34+R2
45+R3
F10
F12
...
F30
Load1
Note: Can have multiple loads outstanding
57
Tomasulo Example
Exec Write Cycle 3
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
Reservation Stations:
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes MULTD
Mult2 No
Register result status:
Clock
3
AM
LaCASA
FU
F0
Busy Address
3
S1
Vj
Load1
Load2
Load3
S2
Vk
RS
Qj
Yes
Yes
No
34+R2
45+R3
F10
F12
RS
Qk
R(F4) Load2
F2
Mult1 Load2
F4
F6
F8
...
F30
Load1
• Note: registers names are removed (“renamed”) in Reservation
Stations; MULT issued
• Load1 completing; what is waiting for Load1?
58
Tomasulo Example
Cycle 4
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
Reservation Stations:
Busy Address
3
4
4
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
No
Yes
No
45+R3
F10
F12
Time Name Busy Op
Add1 Yes SUBD M(A1)
Load2
Add2
No
Add3
No
Mult1 Yes MULTD
R(F4) Load2
Mult2 No
Register result status:
Clock
AM
LaCASA
4
FU
F0
Mult1 Load2
...
F30
M(A1) Add1
• Load2 completing; what is waiting for Load2?
59
Tomasulo Example
Cycle 5
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
2 Add1 Yes SUBD M(A1) M(A2)
Add2
No
Add3
No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
5
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
M(A1) Add1 Mult2
• Timer starts down for Add1, Mult1
60
Tomasulo Example
Cycle 6
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
6
FU
F0
Mult1 M(A2)
Add2
No
No
No
F10
F12
...
F30
Add1 Mult2
• Issue ADDD here despite name dependency on F6?
61
Tomasulo Example
Cycle 7
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
Busy Address
4
5
Load1
Load2
Load3
7
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
7
FU
F0
No
No
No
Mult1 M(A2)
Add2
F10
F12
...
F30
Add1 Mult2
• Add1 (SUBD) completing; what is waiting for it?
62
Tomasulo Example
Cycle 8
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
2 Add2 Yes ADDD (M-M) M(A2)
Add3
No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
8
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
Add2 (M-M) Mult2
63
Tomasulo Example
Cycle 9
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
1 Add2 Yes ADDD (M-M) M(A2)
Add3
No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
9
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
Add2 (M-M) Mult2
64
Tomasulo Example
Cycle 10
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
4
5
7
8
Busy Address
Load1
Load2
Load3
10
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
0 Add2 Yes ADDD (M-M) M(A2)
Add3
No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
10
FU
F0
No
No
No
Mult1 M(A2)
F10
F12
...
F30
Add2 (M-M) Mult2
• Add2 (ADDD) completing; what is waiting for it?
65
Tomasulo Example
Cycle 11
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
11
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Write result of ADDD here?
• All quick instructions complete in this cycle!
66
Tomasulo Example
Cycle 12
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
12
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
67
Tomasulo Example
Cycle 13
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
13
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
68
Tomasulo Example
Cycle 14
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
14
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
69
Tomasulo Example
Cycle 15
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
15
7
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
AM
LaCASA
15
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Mult1 (MULTD) completing; what is waiting for it?
70
Tomasulo Example
Cycle 16
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
AM
LaCASA
16
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Just waiting for Mult2 (DIVD) to complete
71
Tomasulo Example
Cycle 55
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
AM
LaCASA
55
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
72
Tomasulo Example
Cycle 56
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
AM
56
FU
F0
F2
F4
F6
F8
M*F4 M(A2)
No
No
No
11
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
Busy Address
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Mult2 (DIVD) is completing; what is waiting for it?
LaCASA
73
Tomasulo Example
Cycle 57
Exec Write
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
57
11
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
AM
56
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Result
• Once again: In-order issue, out-of-order execution
and out-of-order completion.
LaCASA
74
Tomasulo Drawbacks

Complexity



delays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620 in CA:AQA 2/e, but not in silicon!
Many associative stores (CDB) at high speed
Performance limited by Common Data Bus


Each CDB must go to multiple functional units
 high capacitance, high wiring density
Number of functional units that can complete per
cycle limited to one!

AM 
LaCASA
Multiple CDBs  more FU logic for parallel assoc stores
Non-precise interrupts!

We will address this later
75
Tomasulo Loop Example
Loop: LD
MULTD
SD
SUBI
BNEZ



LaCASA

0(R1)
F0
F2
0
R1
R1
#8
Loop
This time assume Multiply takes 4 clocks
Assume 1st load takes 8 clocks
(L1 cache miss), 2nd load takes 1 clock (hit)
To be clear, will show clocks for SUBI, BNEZ

AM
F0
F4
F4
R1
R1
Reality: integer instructions ahead of Fl. Pt.
Instructions
Show 2 iterations
76
Loop Example
Instruction status:
ITER Instruction
1
1
1
Iter2
ation 2
Count 2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Op
Vj
Exec Write
Issue CompResult
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
S1
Vk
S2
Qj
RS
Qk
Code:
LD
MULTD
SD
SUBI
BNEZ
No
No
No
No
No
No
Added Store Buffers
F0
F4
F4
R1
R1
Register result status
Clock
AM
LaCASA
0
F0
R1
80
F2
F4
F6
F8
Fu
F10 F12
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Instruction Loop
Fu
Value of Register used for address, iteration control
77
Loop Example Cycle
1
Exec Write
Instruction status:
ITER Instruction
1
LD
F0
j
k
0
R1
1
Vj
S1
Vk
Reservation Stations:
Time
Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Op
Issue CompResult
S2
Qj
RS
Qk
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
No
No
No
80
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
1
R1
80
F0
F2
F4
F6
F8
F10 F12
Fu Load1
78
Loop Example Cycle
2
Exec Write
Instruction status:
ITER Instruction
1
1
LD
MULTD
F0
F4
j
k
0
F0
R1
F2
1
2
Vj
S1
Vk
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Issue CompResult
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
No
No
No
80
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
2
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
79
Loop Example Cycle
3
Exec Write
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
j
k
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
3
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
Implicit renaming sets up data flow
graph
80
Loop Example Cycle
4
Exec Write
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
j
k
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
4
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
81
Loop Example Cycle
5
Exec Write
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
j
k
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
5
R1
72
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
82
Loop Example Cycle
6
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
LD
MULTD
SD
LD
F0
F4
F4
F0
j
k
0
F0
0
0
R1
F2
R1
R1
1
2
3
6
Vj
S1
Vk
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Issue CompResult
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
No
No
80
72
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
6
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult1
83
Loop Example Cycle
7
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
LD
MULTD
SD
LD
MULTD
F0
F4
F4
F0
F4
j
k
0
F0
0
0
F0
R1
F2
R1
R1
F2
1
2
3
6
7
Vj
S1
Vk
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
Issue CompResult
S2
Qj
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
No
No
80
72
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
7
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
84
Loop Example Cycle
8
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
Vj
S1
Vk
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
Issue CompResult
S2
Qj
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
Yes
No
80
72
80
72
Mult1
Mult2
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
8
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
85
Loop Example Cycle
9
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
Vj
S1
Vk
S2
Qj
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
Issue CompResult
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
Yes
No
80
72
80
72
Mult1
Mult2
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
9
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
86
Loop Example Cycle
10
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
4
Issue CompResult
1
2
3
6
7
8
S1
Vk
9
10
10
S2
Qj
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd
R(F2) Load2
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
Yes
No
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
Fu
72
80
72
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
10
R1
64
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
87
Loop Example Cycle
11
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
3
4
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
9
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
11
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
88
Loop Example Cycle
12
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
2
3
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
9
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
12
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
89
Loop Example Cycle
13
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
1
2
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
9
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
13
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
90
Loop Example Cycle
14
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
0
1
Issue CompResult
1
2
3
6
7
8
9
14
10
11
S1
Vk
S2
Qj
RS
Qk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
10
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
14
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
91
Loop Example Cycle
15
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
0
Issue CompResult
1
2
3
6
7
8
9
14
10
15
11
S1
Vk
S2
Qj
RS
Qk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes Multd M[72] R(F2)
10
15
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
[80]*R2
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
15
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
92
Loop Example Cycle
16
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
10
15
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
4
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Issue CompResult
10
15
R(F2) Load3
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
[80]*R2
[72]*R2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
16
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
93
Loop Example Cycle
17
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
10
15
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Issue CompResult
10
15
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
Yes
64
80
72
64
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
[80]*R2
[72]*R2
Mult1
Register result status
Clock
AM
LaCASA
17
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
94
Loop Example Cycle
18
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
10
15
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Issue CompResult
11
16
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
Yes
64
80
72
64
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
[80]*R2
[72]*R2
Mult1
Register result status
Clock
AM
LaCASA
18
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
95
Loop Example Cycle
19
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
19
10
15
19
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Issue CompResult
R(F2) Load3
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
No
Yes
Yes
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
Fu
64
72
64
[72]*R2
Mult1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
AM
LaCASA
19
R1
56
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
96
Loop Example Cycle
20
Exec Write
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
19
10
15
19
11
16
20
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Issue CompResult
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
Yes
No
No
Yes
56
64
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
64
Register result status
Clock
AM
20
R1
56
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
• Once again: In-order issue, out-of-order execution
and out-of-order completion.
LaCASA
97
Why can Tomasulo
overlap iterations of loops?

Register renaming


Reservation stations



AM
LaCASA
Multiple iterations use different physical destinations
for registers (dynamic loop unrolling)
Permit instruction issue to advance past integer
control flow operations
Also buffer old values of registers - totally avoiding the
WAR stall that we saw in the scoreboard
Other perspective: Tomasulo building data flow
dependency graph on the fly
98
Tomasulo’s scheme offers 2 major
advantages

(1) the distribution of the hazard detection logic




AM
LaCASA
distributed reservation stations and the CDB
If multiple instructions waiting on single result, & each
instruction has other operand, then instructions can
be released simultaneously by broadcast on CDB
If a centralized register file were used, the units would
have to read their results from the registers when
register buses are available.
(2) the elimination of stalls for WAW and WAR
hazards
99
Multiple Issue


Allow multiple instructions to issue in a single
clock cycle (CPI < 1)
Two flavors

Superscalar



AM
LaCASA
Issue varying number of instruction per clock
Can be statically (compiler tech.) or dynamically
(Tomasulo) scheduled
VLIW (Very Long Instruction Word)

Issue a fixed number of instructions formatted as a
single long instruction or as a fixed instruction
packet
100
Multiple Issue with Dynamic
Scheduling
FP Op
Queue
Load Buffers
From Mem
From Instruction Unit
FP Registers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Store1
Store2
Store3
Add1
Add2
Add3
Mult1
Mult2
FP adders
Reservation
Stations
To Mem
FP multipliers
AM
LaCASA
Issue: 2 instructions per clock cycle
101
Multiple Issue with Dynamic
Scheduling: An Example
Loop:
L.D
ADD.D
S.D
DADDIU
BNE
F0, 0(R1)
F4,F0,F2
0(R1), F4
R1,R1,-#8
R1,R2,Loop
Assumptions:
2-issue processor: can issue any pair of instructions
if reservation stations are available
AM
LaCASA
Resources: ALU (int + effective address),
a separate pipelined FP for each operation type,
branch prediction hardware, 1 CDB
2 cc for loads, 3 cc for FP Add
Branches single issue, branch prediction is perfect
102
Execution in
Dual-issue Tomasulo Pipeline
Iter. Inst.
Issue
Exe.
(begins)
1
LD.D F0,0(R1)
1
2
1
ADD.D F4,F0,F2
1
5
1
S.D 0(R1), F4
2
3
1
DADDIU R1,R1,-#8
2
4
1
BNE R1,R2,Loop
3
6
2
LD.D F0,0(R1)
4
7
2
ADD.D F4,F0,F2
4
10
2
S.D 0(R1), F4
5
8
2
DADDIU R1,R1,-#8
5
9
2
BNE R1,R2,Loop
6
11
3
LD.D F0,0(R1)
7
12
ADD.D F4,F0,F2
7
15
3
S.D 0(R1), F4
8
13
3
DADDIU R1,R1,-#8
8
14
3
BNE R1,R2,Loop
9
16
AM3
LaCASA
Mem.
Access
3
Write
Com.
at CDB
4
first issue
8
Wait for LD.D
9
Wait for ADD.D
5
Wait for ALU
Wait for DAIDU
8
9
Wait for BNE
13
Wait for LD.D
14
Wait for ADD.D
10
Wait for ALU
Wait for DAIDU
13
14
Wait for BNE
18
Wait for LD.D
19
Wait for ADD.D
15
Wait for ALU
Wait for DAIDU
103
Multiple Issue with Dynamic
Scheduling: Resource Usage
Clock
Int ALU
2
1/L.D
3
1/S.D
4
1/DADDIU
5
FP ALU
Data Cache
CDB
1/L.D
1/L.D
1/ADD.D
1/DADDIU
6
7
2/L.D
8
2/S.D
2/L.D
1/ADD.D
9
2/ DADDIU
1/S.D
2/L.D
10
2/ADD.D
2/DADDIU
11
AM
LaCASA
12
3/L.D
13
3/S.D
3/L.D
2/ADD.D
14
3/ DADDIU
2/S.D
3/L.D
15
3/ADD.D
3/DADDIU
16
17
18
19
3/ADD.D
3/S.D
104
Multiple Issue with Dynamic
Scheduling

DADDIU waits for ALU used by S.D



Add one ALU dedicated to
effective address calculation
Use 2 CDBs
Draw table for the dual-issue version of
Tomasulo’s pipeline
AM
LaCASA
105
Multiple Issue with Dynamic
Scheduling
Iter. Inst.
Issue
Exe.
(begins)
1
LD.D F0,0(R1)
1
2
1
ADD.D F4,F0,F2
1
5
1
S.D 0(R1), F4
2
3
1
DADDIU R1,R1,-#8
2
3
1
BNE R1,R2,Loop
3
5
2
LD.D F0,0(R1)
4
6
2
ADD.D F4,F0,F2
4
9
2
S.D 0(R1), F4
5
7
2
DADDIU R1,R1,-#8
5
6
2
BNE R1,R2,Loop
6
8
3
LD.D F0,0(R1)
7
9
ADD.D F4,F0,F2
7
12
3
S.D 0(R1), F4
8
10
3
DADDIU R1,R1,-#8
8
9
3
BNE R1,R2,Loop
9
11
AM3
LaCASA
Mem.
Access
3
Write
Com.
at CDB
4
first issue
8
Wait for LD.D
9
Wait for ADD.D
4
Executes earlier
Wait for DAIDU
7
8
Wait for BNE
12
Wait for LD.D
13
Wait for ADD.D
7
10
11
Executes earlier
Wait for BNE
15
16
10
106
Multiple Issue with Dynamic
Scheduling: Resource Usage
Clock
Int ALU
2
3
Adr. Adder
FP ALU
Data Cache
1/DADDIU
1/S.D
5
2/ DADDIU
2/S.D
2/L.D
13
2/DADDIU
1/ADD.D
3/ DADDIU
3/L.D
2/ADD.D
3/S.D
2/L.D
1/S.D
3/L.D
11
12
1/DADDIU
2/L.D
8
10
1/L.D
1/L.D
1/ADD.D
7
9
CDB#2
1/L.D
4
6
CDB#1
3/DADDIU
3/L.D
3/ADD.D
2/ADD.D
2/S.D
14
AM
LaCASA
15
16
3/ADD.D
3/S.D
107
What about Precise Interrupts?


Tomasulo had:
In-order issue, out-of-order execution, and
out-of-order completion
Need to “fix” the out-of-order completion
aspect so that we can find precise breakpoint
in instruction stream
AM
LaCASA
108
Hardware-based Speculation


With wide issue processors control
dependences become a burden, even with
sophisticated branch predictors
Speculation: speculate on the outcome of
branches and execute the program as if our
guesses were correct => need a mechanism
to handle situations when the speculations
were incorrect
AM
LaCASA
109
Relationship between
precise interrupts and speculation


Speculation is a form of guessing
Important for branch prediction:


If we speculate and are wrong, need to back
up and restart execution to point at which we
predicted incorrectly:

AM
LaCASA

Need to “take our best shot” at predicting
branch direction
This is exactly same as precise exceptions!
Technique for both precise
interrupts/exceptions and speculation:
in-order completion or commit
110
HW support for precise interrupts

Need HW buffer for results of uncommitted instructions:
reorder buffer (ROB)






AM
LaCASA

4 fields: instr. type, destination, value, ready
Use reorder buffer number instead
of reservation station
when execution completes
Supplies operands between
FP
execution complete & commit
Op
(Reorder buffer can be operand
Queue
source => more registers like RS)
Instructions commit
Once instruction commits,
result is put into register
Res Stations
As a result, easy to undo
FP Adder
speculated instructions
on mispredicted branches
or exceptions
Reorder
Buffer
FP Regs
Res Stations
FP Adder
111
Four Steps of Speculative Tomasulo
Algorithm

1. Issue—get instruction from FP Op Queue


2. Execution—operate on operands (EX)


AM
LaCASA
When both operands ready then execute; if not ready, watch
CDB for result; when both in reservation station, execute; checks
RAW (sometimes called “issue”)
3. Write result—finish execution (WB)


If reservation station and reorder buffer slot free, issue instr &
send operands & reorder buffer no. for destination (this stage
sometimes called “dispatch”)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register with reorder result

When instr. at head of reorder buffer & result present, update
register with result (or store to memory) and remove instr from
reorder buffer. Mispredicted branch flushes reorder buffer
(sometimes called “graduation”)
112
What are the hardware complexities
with reorder buffer (ROB)?
How do you find the latest version of a register?


LaCASA
Program Counter
Valid
Result
Dest Reg
AM
Need as many ports on ROB as register file
Reorder Table
FP
Op
Queue
Res Stations
FP Adder
Compar network

(As specified by Smith paper) need associative comparison network
Could use future file or just use the register result status buffer to track
which specific reorder buffer has received the value
Exceptions?

Reorder
Buffer
FP Regs
Res Stations
FP Adder
113
Summary

Reservations stations: implicit register renaming to larger set
of registers + buffering source operands





Not limited to basic blocks
(integer units gets ahead, beyond branches)
Today, helps cache misses as well



LaCASA


Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)
Lasting Contributions

AM
Prevents registers as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are Pentium III; PowerPC 604; MIPS
R10000; HP-PA 8000; Alpha 21264
114