CSCE430/830 Computer Architecture Instruction-level parallelism: Tomasulo Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U.

Download Report

Transcript CSCE430/830 Computer Architecture Instruction-level parallelism: Tomasulo Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U.

CSCE430/830 Computer Architecture
Instruction-level parallelism: Tomasulo
Lecturer: Prof. Hong Jiang
Courtesy of Yifeng Zhu (U. Maine)
Fall, 2006
CSCE430/830
Portions of these slides are derived from:
Dave Patterson © UCB
ILP: Tomasulo
Another Dynamic Algorithm:
Tomasulo’s Algorithm
• For IBM 360/91 (before caches!)
• Goal: High Performance without special compilers
• Small number of floating point registers (4 in 360)
prevented interesting compiler scheduling of operations
– This led Tomasulo to try to figure out how to get more effective registers
— renaming in hardware!
• Why Study 1966 Computer?
• The descendants of this have flourished!
– Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …
CSCE430/830
ILP: Tomasulo
Tomasulo’s Algorithm
• Tracks when operands for instructions are
available.
– Minimizes RAW hazards.
– Introduces register renaming to minimize WAW and WAR
hazards.
• Supports the overlapped execution of
multiple iterations of a loop.
CSCE430/830
ILP: Tomasulo
Avoiding Hazards
• RAW
– Execute an instruction only when its operands are
available.
• WAW and WAR
– Register renaming
CSCE430/830
ILP: Tomasulo
Register Renaming (Example)
• Eliminates WAR and WAW hazards by renaming all
destination registers.
• Can be done by compiler
DIV.D
ADD.D
S.D
SUB.D
MUL.D
F0, F2, F4
F6, F0, F8
F6, 0(R1)
F8, F10, F14
F6, F10, F8
Antidependence
S, T: temporary registers
Output dependence
DIV.D
ADD.D
S.D
SUB.D
MUL.D
CSCE430/830
F0, F2, F4
S, F0, F8
S, 0(R1)
T, F10, F14
F6, F10, T
ILP: Tomasulo
Tomasulo’s Algorithm
• Register renaming is provided by the
reservation stations, which buffer the
operands of instructions waiting to issue, and
by the issue logic.
• A reservation station fetches and buffers an
operand as soon as it is available, eliminating
the need to get the operand from a register.
CSCE430/830
ILP: Tomasulo
Tomasulo Organization
From Mem
FP Op
Queue
FP Registers
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
FP adders
Reservation
Stations
To Mem
FP multipliers
Common Data Bus (CDB)
CSCE430/830
Normal data bus: data + destination
Common data bus: data + source
ILP: Tomasulo
Tomasulo Algorithm: A Decentralized One
• Control & buffers distributed with Function Units (FU)
– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers
to reservation stations(RS);
– form of register renaming ;
– avoids WAR, WAW hazards
– more reservation stations than registers, so can do optimizations
compilers can’t
• Results to FU from RS, not through registers, over
Common Data Bus that broadcasts results to all FUs
• Loads and Stores treated as FUs with RSs as well
• Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue
CSCE430/830
ILP: Tomasulo
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2. Execute—operate on operands (EX)
When both source operands are ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
• Example speed:
3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
CSCE430/830
ILP: Tomasulo
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)
Vj, Vk: Value of Source operands
– Store buffers has V field, result to be stored
Qj, Qk: Reservation stations producing source
registers (value to be written)
– Note: Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit
will write each register, if one exists. Blank when no
pending instructions that will write that register.
CSCE430/830
ILP: Tomasulo
Tomasulo Example
Instruction stream
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
Load1
Load2
Load3
Register result status:
Clock
0
No
No
No
3 Load/Buffers
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
FU count
Add3
No
down
Mult1 No
Mult2 No
Busy Address
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
3 FP Adder R.S.
2 FP Mult R.S.
F0
F2
F4
F6
F8
F10
F12
...
F30
FU
Clock cycle
counter
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 1
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
1
CSCE430/830
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Yes
No
No
34+R2
F10
F12
...
F30
Load1
ILP: Tomasulo
Tomasulo Example Cycle 2
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
2
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Load2
Yes
Yes
No
34+R2
45+R3
F10
F12
...
F30
Load1
Note: Can have multiple loads outstanding
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 3
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
Reservation Stations:
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes MULTD
Mult2 No
Register result status:
Clock
3
FU
F0
Busy Address
3
S1
Vj
Load1
Load2
Load3
S2
Vk
RS
Qj
Yes
Yes
No
34+R2
45+R3
F10
F12
RS
Qk
R(F4) Load2
F2
Mult1 Load2
F4
F6
F8
...
F30
Load1
• Note: registers names are removed (“renamed”) in
Reservation Stations; MULT issued
ILP: Tomasulo
CSCE430/830
• Load1 completing; what is waiting for Load1?
Tomasulo Example Cycle 4
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
Reservation Stations:
Busy Address
3
4
4
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
No
Yes
No
45+R3
F10
F12
Time Name Busy Op
Add1 Yes SUBD M(A1)
Load2
Add2
No
Add3
No
Mult1 Yes MULTD
R(F4) Load2
Mult2 No
Register result status:
Clock
4
FU
F0
Mult1 Load2
...
F30
M(A1) Add1
• Load2 completing; what is waiting for Load2?
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 5
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
2 Add1 Yes SUBD M(A1) M(A2)
Add2
No
Add3
No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
5
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
M(A1) Add1 Mult2
• Timer starts down for Add1, Mult1
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 6
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
6
FU
F0
Mult1 M(A2)
Add2
No
No
No
F10
F12
...
F30
Add1 Mult2
• Issue ADDD here despite name dependency on F6?
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 7
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
Busy Address
4
5
Load1
Load2
Load3
7
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
7
FU
F0
No
No
No
Mult1 M(A2)
Add2
F10
F12
...
F30
Add1 Mult2
• Add1 (SUBD) completing; what is waiting for it?
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 8
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
2 Add2 Yes ADDD (M-M) M(A2)
Add3
No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
8
CSCE430/830
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
Add2 (M-M) Mult2
ILP: Tomasulo
Tomasulo Example Cycle 9
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
1 Add2 Yes ADDD (M-M) M(A2)
Add3
No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
9
CSCE430/830
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
Add2 (M-M) Mult2
ILP: Tomasulo
Tomasulo Example Cycle 10
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
4
5
7
8
Busy Address
Load1
Load2
Load3
10
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
0 Add2 Yes ADDD (M-M) M(A2)
Add3
No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
10
FU
F0
No
No
No
Mult1 M(A2)
F10
F12
...
F30
Add2 (M-M) Mult2
• Add2 (ADDD) completing; what is waiting for it?
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 11
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
11
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Write result of ADDD here?
• All quick instructions complete in this cycle!
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 12
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
12
CSCE430/830
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
ILP: Tomasulo
Tomasulo Example Cycle 13
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
13
CSCE430/830
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
ILP: Tomasulo
Tomasulo Example Cycle 14
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
14
CSCE430/830
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
ILP: Tomasulo
Tomasulo Example Cycle 15
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
15
7
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
15
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Mult1 (MULTD) completing; what is waiting for it?
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 16
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
16
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Just waiting for Mult2 (DIVD) to complete
CSCE430/830
ILP: Tomasulo
Faster than light computation
(skip a couple of cycles)
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 55
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
55
CSCE430/830
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
ILP: Tomasulo
Tomasulo Example Cycle 56
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
56
FU
F0
F2
F4
F6
F8
M*F4 M(A2)
No
No
No
11
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
Busy Address
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Mult2 (DIVD) is completing; what is waiting for it?
CSCE430/830
ILP: Tomasulo
Tomasulo Example Cycle 57
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
57
11
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
56
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Result
• Once again: In-order issue, out-of-order execution
and out-of-order completion.
ILP: Tomasulo
CSCE430/830
Tomasulo Drawbacks
• Complexity
– delays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620 in CA:AQA 2/e, but not in silicon!
• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus
– Each CDB must go to multiple functional units
high capacitance, high wiring density
– Number of functional units that can complete per cycle
limited to one!
» Multiple CDBs  more FU logic for parallel assoc stores
• Non-precise interrupts!
– We will address this later
CSCE430/830
ILP: Tomasulo
Avoiding Hazards
• RAW
– Execute an instruction only when its operands are
available.
• WAW and WAR
– Register renaming
CSCE430/830
ILP: Tomasulo
Tomasulo Organization
From Mem
FP Op
Queue
FP Registers
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
FP adders
Reservation
Stations
To Mem
FP multipliers
Common Data Bus (CDB)
CSCE430/830
ILP: Tomasulo
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2. Execute—operate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
• Example speed:
3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
CSCE430/830
ILP: Tomasulo
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)
Vj, Vk: Value of Source operands
– Store buffers has V field, result to be stored
Qj, Qk: Reservation stations producing source
registers (value to be written)
– Note: Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit
will write each register, if one exists. Blank when no
pending instructions that will write that register.
CSCE430/830
ILP: Tomasulo
Tomasulo Loop Example
Loop:LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
• This time assume Multiply takes 4 clocks
• Assume 1st load takes 8 clocks
(L1 cache miss), 2nd load takes 1 clock (hit)
• To be clear, will show clocks for SUBI, BNEZ
– Reality: integer instructions ahead of Fl. Pt. Instructions
• Show 2 iterations
CSCE430/830
ILP: Tomasulo
Loop Example
Instruction status:
ITER Instruction
1
1
1
Iter2
ation
2
Count 2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
Name
Add1
Add2
Add3
Mult1
Mult2
Busy
No
No
No
No
No
Op
Vj
Exec Write
Issue Comp Result
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
S1
Vk
S2
Qj
RS
Qk
Code:
LD
MULTD
SD
SUBI
BNEZ
No
No
No
No
No
No
Added Store Buffers
F0
F4
F4
R1
R1
Register result status
Clock
0
F0
R1
80
F2
F4
F6
F8
Fu
F10 F12
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Instruction Loop
Fu
Value of Register used for address, iteration control
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 1
Instruction status:
ITER Instruction
1
LD
F0
j
k
0
R1
1
Vj
S1
Vk
Reservation Stations:
Time
Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Exec Write
Issue CompResult
Op
S2
Qj
RS
Qk
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
No
No
No
80
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
1
CSCE430/830
R1
80
F0
F2
F4
F6
F8
F10 F12
Fu Load1
ILP: Tomasulo
Loop Example Cycle 2
Instruction status:
ITER Instruction
1
1
LD
MULTD
F0
F4
j
k
0
F0
R1
F2
1
2
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
No
No
No
80
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
2
CSCE430/830
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
ILP: Tomasulo
Loop Example Cycle 3
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
j
k
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Exec Write
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
3
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
• Implicit renaming sets up data flow graph
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 4
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
j
k
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Exec Write
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
4
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
• Dispatching SUBI Instruction (not in FP queue)
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 5
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
j
k
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Exec Write
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
5
R1
72
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
• And, BNEZ instruction (not in FP queue)
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 6
Instruction status:
ITER Instruction
1
1
1
2
LD
MULTD
SD
LD
F0
F4
F4
F0
j
k
0
F0
0
0
R1
F2
R1
R1
1
2
3
6
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
No
No
80
72
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
6
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult1
• Notice that F0 never sees Load from location 80
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 7
Instruction status:
ITER Instruction
1
1
1
2
2
LD
MULTD
SD
LD
MULTD
F0
F4
F4
F0
F4
j
k
0
F0
0
0
F0
R1
F2
R1
R1
F2
1
2
3
6
7
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
S2
Qj
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
No
No
80
72
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
7
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
• Register file completely detached from computation
• First and Second iteration completely overlapped
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 8
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
S2
Qj
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
Yes
No
80
72
80
72
Mult1
Mult2
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
8
CSCE430/830
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
ILP: Tomasulo
Loop Example Cycle 9
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
Vj
S1
Vk
S2
Qj
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
Yes
No
80
72
80
72
Mult1
Mult2
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
9
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
• Load1 completing: who is waiting?
• Note: Dispatching SUBI
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 10
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
4
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
9
10
10
S2
Qj
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd
R(F2) Load2
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
Yes
No
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
Fu
72
80
72
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
10
R1
64
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
• Load2 completing: who is waiting?
• Note: Dispatching BNEZ
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 11
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
3
4
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
9
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
11
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Next load in sequence
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 12
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
2
3
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
9
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
12
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Why not issue third multiply?
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 13
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
1
2
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
9
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
13
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Why not issue third store?
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 14
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
0
1
Exec Write
Issue CompResult
1
2
3
6
7
8
9
14
10
11
S1
Vk
S2
Qj
RS
Qk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
10
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
14
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Mult1 completing. Who is waiting?
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 15
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
0
Exec Write
Issue CompResult
1
2
3
6
7
8
9
14
10
15
11
S1
Vk
S2
Qj
RS
Qk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes Multd M[72] R(F2)
10
15
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
[80]*R2
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
15
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Mult2 completing. Who is waiting?
CSCE430/830
ILP: Tomasulo
Loop Example Cycle 16
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
10
15
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
4
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
10
15
R(F2) Load3
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
[80]*R2
[72]*R2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
16
CSCE430/830
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
ILP: Tomasulo
Loop Example Cycle 17
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
10
15
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
10
15
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
Yes
64
80
72
64
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
[80]*R2
[72]*R2
Mult1
Register result status
Clock
17
CSCE430/830
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
ILP: Tomasulo
Loop Example Cycle 18
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
10
15
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
11
16
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
Yes
64
80
72
64
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
[80]*R2
[72]*R2
Mult1
Register result status
Clock
18
CSCE430/830
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
ILP: Tomasulo
Loop Example Cycle 19
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
19
10
15
19
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
R(F2) Load3
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
No
Yes
Yes
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
Fu
64
72
64
[72]*R2
Mult1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
19
CSCE430/830
R1
56
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
ILP: Tomasulo
Loop Example Cycle 20
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
19
10
15
19
11
16
20
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
Yes
No
No
Yes
56
64
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
64
Register result status
Clock
20
R1
56
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
• Once again: In-order issue, out-of-order execution
and out-of-order completion.
ILP: Tomasulo
CSCE430/830
Why can Tomasulo overlap iterations
of loops?
• Register renaming
– Multiple iterations use different physical destinations for registers
(dynamic loop unrolling).
• Reservation stations
– Permit instruction issue to advance past integer control flow
operations
– Also buffer old values of registers - totally avoiding the WAR stall
that we saw in the scoreboard.
• Other perspective: Tomasulo builds data flow
dependency graph on the fly.
CSCE430/830
ILP: Tomasulo
Tomasulo’s scheme offers 2 major
advantages
(1) the distribution of the hazard detection logic
– distributed reservation stations and the CDB
– If multiple instructions waiting on a single result, & each
instruction has other operand, then instructions can be
released simultaneously by broadcasting on CDB
– If a centralized register file were used, the units would
have to read their results from the registers when
register buses are available.
(2) the elimination of stalls for WAW and WAR
hazards
CSCE430/830
ILP: Tomasulo
Summary
• Reservations stations: implicit register renaming to
larger set of registers + buffering source operands
– Prevents registers from being bottleneck
– Avoids WAR, WAW hazards of Scoreboard
– Allows loop unrolling in HW
• Not limited to basic blocks
(integer units gets ahead, beyond branches)
• Today, helps cache misses as well
– Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are Pentium III; PowerPC 604;
MIPS R10000; HP-PA 8000; Alpha 21264
CSCE430/830
ILP: Tomasulo
What about Precise Interrupts?
• State of machine looks as if no instruction
beyond faulting instructions has issued
• Tomasulo had:
In-order issue, out-of-order execution, and
out-of-order completion
• Need to “fix” the out-of-order completion
aspect so that we can find precise breakpoint
in instruction stream.
CSCE430/830
ILP: Tomasulo
Speculating with Tomasulo
• Modern processors such as PowerPC 603/604,
MIPS R10000, Intel Pentium II/III/4, Alpha 21264
extend Tomasulo’s approach to support
speculation
• Key ideas:
– separate execution from completion: allow instructions to
execute speculatively but do not let instructions update
registers or memory until they are no longer speculative
– therefore, add a final step – after an instruction is no longer
speculative – when it is allowed to make register and memory
updates, called instruction commit
– allow instructions to execute out of order but force them to
commit in order
– add a hardware buffer, called the reorder buffer (ROB), with
registers to hold the result of an instruction between execution
completion and commit
CSCE430/830
ILP: Tomasulo
HW support for precise interrupts
• Need HW buffer for results of
uncommitted instructions:
reorder buffer
– Use reorder buffer number instead
of reservation station when
execution completes
FP
– Supplies operands between
Op
execution completion & commit
Queue
– (Reorder buffer can be operand
source => more registers like RS)
– Instructions commit
– Once an instruction commits,
Res Stations
result is put into register
FP Adder
– As a result, easy to undo
speculated instructions
on mispredicted branches
or exceptions
CSCE430/830
Reorder
Buffer
FP Regs
Res Stations
FP Adder
ILP: Tomasulo
ROB Data Structure
ROB entry fields
• Instruction type: branch, store, register operation (i.e., ALU or load)
• State: indicates if instruction has completed execution and value is
ready
• Destination: where result is to be written – register number for
register operation (i.e. ALU or load), memory address for store
– branch has no destination result
• Value: holds the value of instruction result until time to commit
Additional reservation station field
• Destination: Corresponding ROB entry number
CSCE430/830
ILP: Tomasulo
Four Steps of Speculative Tomasulo
Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send
operands & reorder buffer no. for destination (this stage sometimes
called “dispatch”)
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB
for result; when both in reservation station, execute; checks RAW
(sometimes called “issue”)
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update
register with result (or store to memory) and remove instr from
reorder buffer. Mispredicted branch flushes reorder buffer
(sometimes called “graduation”)
CSCE430/830
ILP: Tomasulo
Program Counter
Valid
Exceptions?
Result
Reorder Table
FP
Op
Queue
Res Stations
Compar network
Dest Reg
What are the hardware complexities with reorder
buffer (ROB)?
Reorder
Buffer
FP Regs
Res Stations
FP Adder
FP Adder
• How do you find the latest version of a register?
– (As specified by Smith paper) need associative comparison network
– Could use future file or just use the register result status buffer to track which
specific reorder buffer has received the value
• Need as many ports on ROB as register file
CSCE430/830
ILP: Tomasulo
Speculative Tomasulo Example
LD
ADDD
DIVD
BNEZ
LD
ADDD
SD
…
F0
F10
F2
F2
F4
F0
F4
10
F4
F10
Exit
0
F4
0
R2
F0
F6
R3
F9
R3
Exit:
CSCE430/830
ILP: Tomasulo
Tomasulo With Reorder buffer:
Dest. Value
Instruction
Done?
ROB7
FP Op
Queue
ROB6
Newest
ROB5
ROB4
Reorder Buffer
ROB3
ROB2
F0
LD F0,10(R2)
Registers
Dest
CSCE430/830
ROB1
To
Memory
from
Memory
Dest
FP adders
N
Oldest
Reservation
Stations
Dest
1 10+R2
FP multipliers
ILP: Tomasulo
Tomasulo With Reorder buffer:
Dest. Value
Instruction
Done?
ROB7
FP Op
Queue
ROB6
Newest
ROB5
ROB4
Reorder Buffer
ROB3
F10
ADDD F10,F4,F0
N
ROB2
F0
LD F0,10(R2)
N
ROB1
Registers
Dest
2
ADDD
R(F4),ROB1
FP adders
CSCE430/830
Oldest
To
Memory
from
Memory
Dest
Reservation
Stations
Dest
1 10+R2
FP multipliers
ILP: Tomasulo
Tomasulo With Reorder buffer:
Dest. Value
Instruction
Done?
ROB7
FP Op
Queue
ROB6
Newest
ROB5
Reorder Buffer
ROB4
F2
DIVD F2,F10,F6
N
ROB3
F10
ADDD F10,F4,F0
N
ROB2
F0
LD F0,10(R2)
N
ROB1
Registers
Dest
2
ADDD
R(F4),ROB1
FP adders
CSCE430/830
Oldest
To
Memory
Dest
3
Reservation
Stations
DIVD
ROB2,R(F6)
from
Memory
Dest
1 10+R2
FP multipliers
ILP: Tomasulo
• Skip some cycles
CSCE430/830
ILP: Tomasulo
Tomasulo With Reorder buffer:
Dest. Value
Instruction
ROB7
FP Op
Queue
Reorder Buffer
F0
ADDD F0,F4,F6
N
ROB6
F4
LD F4,0(R3)
N
ROB5
--
BNE F2,<…>
N
ROB4
F2
DIVD F2,F10,F6
N
ROB3
F10
ADDD F10,F4,F0
N
ROB2
F0
LD F0,10(R2)
N
ROB1
Registers
Dest
2
6
ADDD
ADDD
R(F4),ROB1
ROB5, R(F6)
FP adders
CSCE430/830
Done?
Newest
Oldest
To
Memory
Dest
3
Reservation
Stations
DIVD
ROB2,R(F6)
from
Memory
Dest
1 10+R2
5 0+R3
FP multipliers
ILP: Tomasulo
Tomasulo With Reorder buffer:
Dest. Value
FP Op
Queue
N
ROB7
F0
ADDD F0,F4,F6
N
ROB6
F4
LD F4,0(R3)
N
ROB5
--
BNE F2,<…>
N
ROB4
F2
DIVD F2,F10,F6
N
ROB3
F10
ADDD F10,F4,F0
N
ROB2
F0
LD F0,10(R2)
N
ROB1
ROB5
Registers
Dest
2
6
ADDD
ADDD
R(F4),ROB1
ROB5, R(F6)
FP adders
CSCE430/830
Done?
SD 0(R3),F4
--
Reorder Buffer
Instruction
Newest
Oldest
To
Memory
Dest
3
Reservation
Stations
DIVD
ROB2,R(F6)
from
Memory
Dest
1 10+R2
5 0+R3
FP multipliers
ILP: Tomasulo
Tomasulo With Reorder buffer:
Dest. Value
FP Op
Queue
Y
ROB7
ADDD F0,F4,F6
N
ROB6
LD F4,0(R3)
Y
ROB5
--
BNE F2,<…>
N
ROB4
F2
DIVD F2,F10,F6
N
ROB3
F10
ADDD F10,F4,F0
N
ROB2
F0
LD F0,10(R2)
N
ROB1
M[10]
F0
F4
M[10]
Registers
Dest
2
6
ADDD
ADDD
R(F4),ROB1
M[10],R(F6)
FP adders
CSCE430/830
Done?
SD 0(R3),F4
--
Reorder Buffer
Instruction
Newest
Oldest
To
Memory
Dest
3
Reservation
Stations
DIVD
ROB2,R(F6)
from
Memory
Dest
1 10+R2
FP multipliers
ILP: Tomasulo
Tomasulo With Reorder buffer:
Dest. Value
FP Op
Queue
Reorder Buffer
Instruction
SD 0(R3),F4
Y
ROB7
ADDD F0,F4,F6
Ex
ROB6
LD F4,0(R3)
Y
ROB5
--
BNE F2,<…>
N
ROB4
F2
DIVD F2,F10,F6
N
ROB3
F10
ADDD F10,F4,F0
N
ROB2
F0
LD F0,10(R2)
N
ROB1
--
M[10]
F0
---
F4
M[10]
Registers
Dest
2
ADDD
R(F4),ROB1
FP adders
CSCE430/830
Done?
Newest
Oldest
To
Memory
Dest
3
Reservation
Stations
DIVD
ROB2,R(F6)
from
Memory
Dest
1 10+R2
FP multipliers
ILP: Tomasulo
Notes
• If a branch is mispredicted, recovery is
done by flushing the ROB of all entries
that appear after the mispredicted
branch
– entries before the branch are allowed to continue
– restart the fetch at the correct branch successor
• When an instruction commits or is
flushed from the ROB then the
corresponding slots become available
for subsequent instructions
CSCE430/830
ILP: Tomasulo
Tomasulo Algorithm vs. Scoreboard
• Control & buffers distributed with Function Units (FU) vs.
centralized in scoreboard;
– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to
reservation stations(RS); called register renaming ;
– avoids WAR, WAW hazards
– More reservation stations than registers, so can do optimizations
compilers can’t
• Results to FU from RS, not through registers, over Common
Data Bus that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
• Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue
CSCE430/830
ILP: Tomasulo
Summary
• Reservations stations: implicit register renaming to
larger set of registers + buffering source operands
– Prevents registers as bottleneck
– Avoids WAR, WAW hazards of Scoreboard
– Allows loop unrolling in HW
• Not limited to basic blocks
(integer units gets ahead, beyond branches)
• Today, helps cache misses as well
– Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are Pentium III; PowerPC 604;
MIPS R10000; HP-PA 8000; Alpha 21264
CSCE430/830
ILP: Tomasulo