Tomasulo - A. James Clark School of Engineering

Download Report

Transcript Tomasulo - A. James Clark School of Engineering

CS 665 Advanced Computer Architecture – Fall 2003
Tomasulo Algorithm
Adapted from DAP Spr.‘98 ©UCB
1
Review: Summary
• Instruction Level Parallelism (ILP) in SW or HW
• Loop level parallelism is easiest to see
• SW parallelism dependencies defined for program,
hazards if HW cannot resolve
• SW dependencies/compiler sophistication determine if
compiler can unroll loops
– Memory dependencies hardest to determine
• HW exploiting ILP
– Works when can’t know dependence at run time
– Code for one machine runs well on another
• Key idea of Scoreboard: Allow instructions behind stall
to proceed (Decode => Issue instr & read operands)
– Enables out-of-order execution => out-of-order completion
– ID stage checked both for structural & data dependencies
2
Review: Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
2. Functional unit status—Indicates the state of the functional unit
(FU). 9 fields for each functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready
3. Register result status—Indicates which functional unit will write
each register, if one exists. Blank when no pending instructions
will write that register
3
Review: Scoreboard Example Cycle 62
Ins truction status
Ins truction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDD F6
F8
F2
Functional unit s tatus
Tim e Nam e
Integer
Mult1
Mult2
Add
0 Divide
Regis ter res ult status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
62
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
No
Clock
F0
62
F2
F4
FU for j FU for k Fj?
Qj
Qk
Rj
F6 F8 F10
F12
...
Fk?
Rk
F30
FU
• In-order issue; out-of-order execute & commit
4
Review: Scoreboard Summary
• Speedup 1.7 from compiler; 2.5 by hand
BUT slow memory (no cache)
• Limitations of 6600 scoreboard
– No forwarding (First write register then read it)
– Limited to instructions in basic block
(small window)
– Number of functional units(structural hazards)
– Wait for WAR hazards
– Prevent WAW hazards
5
Dynamic Issue
Goal: take advantage of multiple function units and
deal with long memory latencies
• Advantages:
– Speed
• Problems: multiple execution latencies
– Result is out of order completion
– Forwarding and hazard control become more difficult
– Precise exceptions would later amplify the problem (non-issue in the
’60s)
• Answer: HW to issue instructions when hazards
clear
6
Dynamic Issue
• Hazards = data, structural, control
– Data: RAW (true data dependence), WAR ( anti-dependence),
WAW (output dependence)
– Structural: Are the required resources available?
– Control: Is this instruction supposed to execute or not?
• Implementation – 2 early approaches
– Control flow – CDC 6600 (scoreboard) (1964)
– Data flow – Tomasulo, IBM 360/91 (1967)
» Simple idea – when opcode and operands are ready, and the
appropriate set of resources are ready, launch the “execution
packet”
» Interesting wrinkle – does not used named registers for
intermediate storage
» Implicit introduction of Register Renaming
7
Register Renaming
• Can eliminate name dependence (and hence WAR
and WAW)
• Static renaming example:
Original
After Renaming R2 to R6
ADD R1, R2, R3
ADD R7, R2, R3
SUB R2, R3, R4
SUB R6, R3, R4
AND R5, R1, R2
AND R5, R1, R6
LD R1, 0(R4)
LD R1, 0(R4)
• Increase ILP, increases register pressure
• Can be done dynamically in hardware
8
Another Dynamic Algorithm: Tomasulo Algorithm
• For IBM 360/91 about 3 years after CDC 6600 (1966)
• Goal: High Performance without special compilers – Take
advantage of multiple function units and deal with long memory
latencies
– Advantages: speed via specialization and parallelism
– Problems: multiple execution latencies
» Out of order completion
» Bypass and hazard control difficult
» Precise exceptions
• Why Study?
– lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …
• Main difference from scoreboarding:
– Uses hardwired register renaming to remove WAR and WAW hazards
9
Computing Model
• Data
– Ops view sources and destinations as memory
– Hence registers not explicitly named in the ISA stream
• Control
– Multiple FU
» Each fronted by one or more “reservation stations” (RS)
» When reservation station has all of the source operands
then it issues
– Outputs
» Physically placed on a Common Data Bus (CDB)
» Tagged so the RS know what to look for
– Inputs
» Supplied with instruction or collected by RS
10
Tomasulo Organization
FP Registers
From Mem
FP Op
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
FP adders
Reservation
Stations
To Mem
FP multipliers
Common Data Bus (CDB)
11
Tomasulo Algorithm vs. Scoreboard
• Control & buffers distributed with Function Units (FU) vs.
centralized in scoreboard;
– FU buffers called “reservation stations” to control execution; have pending
operands
– Act as interlock-permits
• Registers in instructions replaced by values or pointers to
reservation stations(RS); called register renaming (implicit virtual
registers);
– avoids WAR, WAW hazards
– More reservation stations than registers, so can do optimizations compilers
can’t
• Results to FU from RS, not through registers, over Common Data
Bus that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
• Integer instructions can go past branches, allowing FP ops beyond
basic block in FP queue
12
Reservation Stations and Common
Data Bus
• Reservation Stations hold instructions stalled for
RAW hazards, buffers operands until read by
instructions. Pending instructions have their register
specifiers renamed as locations in reservation
stations
• Common Data Bus (CDB) broadcasts results to any
FU that may need them (reservation stations,
register file)
13
Reservation Station Duties
• Snarf sources off CDB when they appear
– CDB results are tagged with where they came from
• When all operands are present, enable the associate
FU to execute
• Since values aren’t really written to registers (until
later): no WAR or WAW hazards are possible
• Structural hazards checked at two points
– At dispatch – a free reservation station of the right type must be
available
– When execution packet is ready – multiple reservatino stations may
compete for a shared FU
» Program order used as basis for arbitration if required
14
Virtual Registers
• Tag field associated with data
• Tag field is a virtual register ID
• Corresponds to reservation station and load buffer
names
• Motivation due to the 360’s register weakness
– Had only 4 FP regs
– The 9 renamed regs (reservation station slots) were a significant
bonus
• Intel’s x86 architecture is also register-poor
– With renamed registers they can get around this
15
Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk—Value of Source operands
– Store buffers has V field, result to be stored
Qj, Qk—Reservation stations producing source
registers (value to be written)
– Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy—Indicates reservation station or FU is busy
Register result status—Indicates which functional
unit will write each register, if one exists. Blank when
no pending instructions that will write that register.
16
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue (in-order)
If reservation station free (no structural hazard), control issues instr & sends
operands (renames registers) from register file (if there) to RS. Dispatch if
available buffer (loads, stores)Stall otherwise due to structural hazard
2. Execution—operate on operands (EX) (may be out of order)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
Effectively deals with RAW hazards.
3. Write result—finish execution (WB) (may be out of order)
Write on Common Data Bus to all awaiting units;
mark reservation station available
Renaming model prevents WAW and WAR hazards.
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
17
Tomasulo Example
Instruction stream
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
Load1
Load2
Load3
Register result status:
Clock
0
No
No
No
3 Load/Buffers
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
FU count
Add3
No
down
Mult1 No
Mult2 No
Busy Address
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
3 FP Adder R.S.
2 FP Mult R.S.
F0
F2
F4
F6
F8
F10
F12
...
F30
FU
Clock cycle
counter
18
Tomasulo Example Cycle 1
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
1
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Yes
No
No
34+R2
F10
F12
...
F30
Load1
19
Tomasulo Example Cycle 2
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
2
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Load2
Yes
Yes
No
34+R2
45+R3
F10
F12
...
F30
Load1
Note: Can have multiple loads outstanding
20
Tomasulo Example Cycle 3
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
Reservation Stations:
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes MULTD
Mult2 No
Register result status:
Clock
3
FU
F0
Busy Address
3
S1
Vj
Load1
Load2
Load3
S2
Vk
RS
Qj
Yes
Yes
No
34+R2
45+R3
F10
F12
RS
Qk
R(F4) Load2
F2
Mult1 Load2
F4
F6
F8
...
F30
Load1
• Note: registers names are removed (“renamed”) in
Reservation Stations; MULT issued
• Load1 completing; what is waiting for Load1?
21
Tomasulo Example Cycle 4
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
Reservation Stations:
Busy Address
3
4
4
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
No
Yes
No
45+R3
F10
F12
Time Name Busy Op
Add1 Yes SUBD M(A1)
Load2
Add2
No
Add3
No
Mult1 Yes MULTD
R(F4) Load2
Mult2 No
Register result status:
Clock
4
FU
F0
Mult1 Load2
...
F30
M(A1) Add1
• Load2 completing; what is waiting for Load2?
22
Tomasulo Example Cycle 5
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
2 Add1 Yes SUBD M(A1) M(A2)
Add2
No
Add3
No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
5
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
M(A1) Add1 Mult2
• Timer starts down for Add1, Mult1
23
Tomasulo Example Cycle 6
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
6
FU
F0
Mult1 M(A2)
Add2
No
No
No
F10
F12
...
F30
Add1 Mult2
• Issue ADDD here despite name dependency on F6?
24
Tomasulo Example Cycle 7
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
Busy Address
4
5
Load1
Load2
Load3
7
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
7
FU
F0
No
No
No
Mult1 M(A2)
Add2
F10
F12
...
F30
Add1 Mult2
• Add1 (SUBD) completing; what is waiting for it?
25
Tomasulo Example Cycle 8
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
2 Add2 Yes ADDD (M-M) M(A2)
Add3
No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
8
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
Add2 (M-M) Mult2
26
Tomasulo Example Cycle 9
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
1 Add2 Yes ADDD (M-M) M(A2)
Add3
No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
9
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
Add2 (M-M) Mult2
27
Tomasulo Example Cycle 10
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
4
5
7
8
Busy Address
Load1
Load2
Load3
10
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
0 Add2 Yes ADDD (M-M) M(A2)
Add3
No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
10
FU
F0
No
No
No
Mult1 M(A2)
F10
F12
...
F30
Add2 (M-M) Mult2
• Add2 (ADDD) completing; what is waiting for it?
28
Tomasulo Example Cycle 11
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
11
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Write result of ADDD here?
• All quick instructions complete in this cycle!
29
Tomasulo Example Cycle 12
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
12
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
30
Tomasulo Example Cycle 13
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
13
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
31
Tomasulo Example Cycle 14
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
14
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
32
Tomasulo Example Cycle 15
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
15
7
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
15
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Mult1 (MULTD) completing; what is waiting for it?
33
Tomasulo Example Cycle 16
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
16
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Just waiting for Mult2 (DIVD) to complete
34
Faster than light computation
(skip a couple of cycles)
35
Tomasulo Example Cycle 55
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
55
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Mult2
36
Tomasulo Example Cycle 56
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
56
FU
F0
F2
F4
F6
F8
M*F4 M(A2)
No
No
No
11
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
Busy Address
F10
F12
...
F30
(M-M+M)(M-M) Mult2
• Mult2 (DIVD) is completing; what is waiting for it?
37
Tomasulo Example Cycle 57
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
57
11
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
56
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Result
• Once again: In-order issue, out-of-order execution
and out-of-order completion.
38
Compare to Scoreboard Cycle 62
Ins truction status
Ins truction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDD F6
F8
F2
Functional unit s tatus
Tim e Nam e
Integer
Mult1
Mult2
Add
0 Divide
Regis ter res ult status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
62
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
No
Clock
F0
62
F2
F4
FU for j FU for k Fj?
Qj
Qk
Rj
F6 F8 F10
F12
...
Fk?
Rk
F30
FU
• Why takes longer on Scoreboard/6600?
39
Tomasulo v. Scoreboard
(IBM 360/91 v. CDC 6600)
Pipelined Functional Units
(6 load, 3 store, 3 +, 2 x/÷)
window size: 14 instructions
No issue on structural hazard
WAR: renaming avoids
WAW: renaming avoids
Broadcast results from FU
Control: reservation stations
Multiple Functional Units
(1 load/store, 1 + , 2 x, 1 ÷)
5 instructions
same
stall completion
stall completion
Write/read registers
central scoreboard
40
Tomasulo Drawbacks
• Complexity
– delays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620 in CA:AQA 2/e, but not in silicon!
• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus
– Each CDB must go to multiple functional units
high capacitance, high wiring density
– Number of functional units that can complete per cycle limited to one!
» Multiple CDBs  more FU logic for parallel assoc stores
• Non-precise interrupts!
– We will address this later
41
Summary: Tomasulo
• Prevents Register as bottleneck
– Where’s the new bottleneck?
• Avoids WAR, WAW hazards of Scoreboard
• If we assume branch prediction (next subject…)
– Allows loop unrolling in HW
– Not limited to basic blocks
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
» Out of order is OK if addresses don’t match
• 360/91 descendants are PowerPC 604, 620; MIPS R10000;
HP-PA 8000; Intel Pentium Pro
42
Tomasulo Loop Example
Loop: LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
• Assume Multiply takes 4 clocks
• Assume first load takes 8 clocks (cache miss?),
second load takes 4 clocks (hit)
• To be clear, will show clocks for SUBI, BNEZ
• Reality, integer instructions ahead
43
Loop Example
Instruction status:
ITER Instruction
1
1
1
Iter2
ation 2
Count 2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Op
Vj
Exec Write
Issue CompResult
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
S1
Vk
S2
Qj
RS
Qk
Code:
LD
MULTD
SD
SUBI
BNEZ
No
No
No
No
No
No
Added Store Buffers
F0
F4
F4
R1
R1
Register result status
Clock
0
F0
R1
80
F2
F4
F6
F8
Fu
F10 F12
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Instruction Loop
Fu
Value of Register used for address, iteration control
44
Loop Example Cycle 1
Instruction status:
ITER Instruction
1
LD
F0
j
k
0
R1
1
Vj
S1
Vk
Reservation Stations:
Time
Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Exec Write
Issue CompResult
Op
S2
Qj
RS
Qk
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
No
No
No
80
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
1
R1
80
F0
F2
F4
F6
F8
F10 F12
Fu Load1
45
Loop Example Cycle 2
Instruction status:
ITER Instruction
1
1
LD
MULTD
F0
F4
j
k
0
F0
R1
F2
1
2
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
No
No
No
80
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
2
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
46
Loop Example Cycle 3
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
j
k
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Exec Write
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
3
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
• Implicit renaming sets up data flow graph
47
Loop Example Cycle 4
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
j
k
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Exec Write
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
4
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
• Dispatching SUBI Instruction (not in FP queue)
48
Loop Example Cycle 5
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
j
k
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Exec Write
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
5
R1
72
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
• And, BNEZ instruction (not in FP queue)
49
Loop Example Cycle 6
Instruction status:
ITER Instruction
1
1
1
2
LD
MULTD
SD
LD
F0
F4
F4
F0
j
k
0
F0
0
0
R1
F2
R1
R1
1
2
3
6
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
No
No
80
72
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
6
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult1
• Notice that F0 never sees Load from location 80
50
Loop Example Cycle 7
Instruction status:
ITER Instruction
1
1
1
2
2
LD
MULTD
SD
LD
MULTD
F0
F4
F4
F0
F4
j
k
0
F0
0
0
F0
R1
F2
R1
R1
F2
1
2
3
6
7
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
S2
Qj
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
No
No
80
72
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
7
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
• Register file completely detached from computation
• First and Second iteration completely overlapped
51
Loop Example Cycle 8
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
S2
Qj
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
Yes
No
80
72
80
72
Mult1
Mult2
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
8
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
52
Loop Example Cycle 9
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
Vj
S1
Vk
S2
Qj
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
Yes
No
80
72
80
72
Mult1
Mult2
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
9
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
• Load1 completing: who is waiting?
• Note: Dispatching SUBI
53
Loop Example Cycle 10
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
4
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
9
10
10
S2
Qj
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd
R(F2) Load2
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
Yes
No
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
Fu
72
80
72
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
10
R1
64
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
• Load2 completing: who is waiting?
• Note: Dispatching BNEZ
54
Loop Example Cycle 11
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
3
4
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
9
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
11
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Next load in sequence
55
Loop Example Cycle 12
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
2
3
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
9
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
12
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Why not issue third multiply?
56
Loop Example Cycle 13
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
1
2
Exec Write
Issue CompResult
1
2
3
6
7
8
9
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
13
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Why not issue third store?
57
Loop Example Cycle 14
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
0
1
Exec Write
Issue CompResult
1
2
3
6
7
8
9
14
10
11
S1
Vk
S2
Qj
RS
Qk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
10
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
14
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Mult1 completing. Who is waiting?
58
Loop Example Cycle 15
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
0
Exec Write
Issue CompResult
1
2
3
6
7
8
9
14
10
15
11
S1
Vk
S2
Qj
RS
Qk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes Multd M[72] R(F2)
10
15
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
[80]*R2
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
15
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
• Mult2 completing. Who is waiting?
59
Loop Example Cycle 16
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
10
15
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
4
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
10
15
R(F2) Load3
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
[80]*R2
[72]*R2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
16
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
60
Loop Example Cycle 17
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
10
15
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
10
15
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
Yes
64
80
72
64
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
[80]*R2
[72]*R2
Mult1
Register result status
Clock
17
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
61
Loop Example Cycle 18
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
10
15
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
11
16
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
Yes
64
80
72
64
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
[80]*R2
[72]*R2
Mult1
Register result status
Clock
18
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
62
Loop Example Cycle 19
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
19
10
15
19
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
R(F2) Load3
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
No
Yes
Yes
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
Fu
64
72
64
[72]*R2
Mult1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
19
R1
56
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult1
63
Loop Example Cycle 20
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
19
10
15
19
11
16
20
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
Yes
No
No
Yes
56
64
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
64
Register result status
Clock
20
R1
56
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
• Once again: In-order issue, out-of-order execution
and out-of-order completion.
64
Why can Tomasulo overlap iterations of
loops?
• Register renaming
– Multiple iterations use different physical destinations for registers
(dynamic loop unrolling).
• Reservation stations
– Permit instruction issue to advance past integer control flow operations
– Also buffer old values of registers - totally avoiding the WAR stall that
we saw in the scoreboard.
• Other perspective: Tomasulo building data flow
dependency graph on the fly.
65
Tomasulo’s scheme offers 2 major
advantages
(1) the distribution of the hazard detection logic
– distributed reservation stations and the CDB
– If multiple instructions waiting on single result, & each instruction
has other operand, then instructions can be released simultaneously
by broadcast on CDB
– If a centralized register file were used, the units would have to read
their results from the registers when register buses are available.
(2) the elimination of stalls for WAW and WAR
hazards
66
Tomasulo Summary
• Reservations stations: renaming to larger set of registers
+ buffering source operands
– Prevents registers as bottleneck
– Avoids WAR, WAW hazards of Scoreboard
– Allows loop unrolling in HW
• Not limited to basic blocks (integer units gets ahead,
beyond branches)
• Helps cache misses as well
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are Pentium II; PowerPC 604; MIPS
R10000; HP-PA 8000; Alpha 21264
67