EE560 Tomasulo Project - University of Southern California

Download Report

Transcript EE560 Tomasulo Project - University of Southern California

EE457
Out of Order (OoO) Execution
Introduction to Dynamic
Scheduling of Instructions
(The Tomasulo Algorithm)
By
Gandhi Puvvada
References
• EE557 Textbook
• Prof. Dubois’ EE557 Classnotes
• Prof. Annavaram’s slides
• Prof. Patterson’s Lecture slides
2
Programs often have several small fragments of
code, which can be executed in any order.
3
OoO (Out of Order) execution
Io = In order
”Execution” here means producing the results.
Completion means committing results.
(writing into register file or memory).
IoI (IoD)  OoE  IoC
In order Issue/Dispatch, Out of order
Execution and finally In order
completion/commitment
4
IoC or OoC?
IoI (IoD)  OoE  IoC
IoC (In order completion) is necessary to
support exceptions (ex: page fault).
Here we present first
IoI (IoD)  OoE  OoC
and then (at the end)
IoI (IoD)  OoE  IoC
5
OoC? But branches ..
OoC? Hope you are not executing
instruction beyond a branch and committing
them!
Well we dispatch a branch and suspend
dispatching and wait until the branch is
resolved. Then we resume dispatching
instructions beyond the branch at either the
fall-through area or at the target area.
6
Instruction Scheduling
(Re-ordering of instructions)
• Basic block = a straight-line code sequence with
no branches.
• Compiler can perform static instruction
scheduling.
• Tomasulo Algorithm lets us schedule
instructions dynamically (in hardware).
• Branch prediction and speculative execution beyond a
branch (of course with ability to flush wrong-path
instructions on misprediction) will be covered later (and
implemented on FPGA in EE560).
7
Register renaming to allow later
instructions to proceed
lw $8, 40($2);
add $8, $8, $8;
sw $8, 40($2);
lw $8, 40($2);
add $8, $8, $8;
sw $8, 40($2);
lw $8, 60($3);
add $8, $8, $8;
sw $8, 60($3);
lw $48, 60($3);
add $48, $48, $48;
sw $48, 60($3);
8
Static Scheduling (based on Prof. Dubois slide)
• Strengths
-- Hardware simplicity
-- Compiler has a global view of the code (does not help
the hardware much)
• Weaknesses
-- can not be CPU-implementation specific
-- can not foresee dynamic events
-- cache misses
-- data-dependent delays
-- conditional branches 
can only reschedule instructions in a basic block (basic
block = a straight-line code sequence with no branches)
-- can not pre-compute memory addresses
9
10
Simple 5-stage pipeline
In-order execution
RAW dependency
Solve it by forwarding,
if not, by stalling
Dependent instructions are stalled in the ID stage
IM
IF
DM
ID
EX
M
WB
11
Simple 5-stage pipeline:
Dependent instructions are stalled in the ID stage
and
lw
12
Simple 5-stage pipeline:
Dependent instructions can not be stalled in the
EX stage. Why?
and
lw
13
Provide multiple functional units
(for simplicity, we avoid talking about floating point
execution unit and floating point register file)
Stall, after decoding, in queues
Divide
Multiply
IM
Integer
IF
ID
DM
Queues and
Functional unit
Load/
Store
WB
14
Why junior instructions carry their source register IDs into
EX stage? Well they need to get help from Senior #1 or
Senior #2 in EX stage under the control of the FU.
No more of that. There may be 40 seniors in front of
you. So I, the dispatch unit, will tell you from which
senior you need to get help for which source register.
rs, rt (IDs)
are carried
into EX
15
Tomasulo’s plan
• OoO Out of order execution
• Multiple functional units
(say, Integer, DM, Multiplier, Divider)
• Queues between ID and EX stages
(in place of ID/EX register)
16
Out of order execution ?!
Problems all over ??!!
• For the time, no branch prediction,
no speculative execution beyond
branches,
just stall on a conditional branch
• No support for precise exceptions for
the time
Even then, …
17
RAW, WAR, and WAW
RAW = Read After Write
lw $8, 40($2);
add $9, $8, $7;
WAR = Write after Read
add $9, $8, $6;
lw $8, 40($2);
WAW = Write after Write
add $9, $8, $6;
lw $9, 40($2);
Why would anyone produce some
result in $9 and without utilizing
that result, why would he overwrite
it with another result?
WAW ?
How is it possible?
Consider a printer or a FIFO
18
WAW can easily occur!
WAW ? How is it possible?
In out of order execution, instructions before the branch
and instruction after the branch can co-exist.
For example, multiple iterations of this loop can coexist
in the execution area.
So, what?
Loop: LW
MULT
SW
ADDI
BNE
$2,
$4
$4,
$1,
$1,
40($1);
$2,
$3;
40($1);
$1,
-4;
$0, Loop;
19
Say a company gives standard bonus to most of the
employees and a higher bonus to the managers.
So you load into $3 standard bonus from the stdbonus
location in memory. And then you check to see if it is a case
of a manager, and then load into $3 again (overwriting the
earlier $3) the special bonus from the special location in
memory.
LW $3 stdbonus ($0)
BNE $1, $2, SKIP
LW $3 special ($0)
20
RAW, WAR, and WAW
(some terminology to remember)
RAW = Read After Write
Name Dependences
lw $8, 40($2);
add $9, $8, $7;
RAW
A true dependency
WAR = Write after Read
add $9, $8, $6;
lw $8, 40($2);
WAR
An anti-dependency
WAW = Write after Write
add $9, $8, $6;
lw $9, 40($2);
WAW
An output dependency
21
RAW, WAR, and WAW
• In-order execution:
We need to deal with RAW only.
• Out of order execution:
Now we need to deal with
WAR and WAW besides RAW.
22
23
Limited Architectural Registers
More Physical Registers
Register Renaming
lw $8, 40($2);
add $8, $8, $8;
sw $8, 40($2);
lw $8, 60($3);
add $8, $8, $8;
sw $8, 60($3);
It is clear that compiler is using $8 as
a temporary register.
If there is a delay in obtaining $2, the
first part of the code can not
proceed.
Unfortunately, the second part of the
code can not proceed because of
name dependency for $8.
24
If we had 64 registers instead of 32 registers,
then perhaps compiler might have used $48
instead of $8 and we could have executed the
second part of the code before the first part!
lw $8, 40($2);
add $8, $8, $8;
sw $8, 40($2);
lw $48, 60($3);
add $48, $48, $48;
sw $48, 60($3);
This is an example of
name dependency.
25
Four different temporary registers can be
used here as shown: $8, $18, $28, and $48
(or called with coded names, LION, TIGER,
CAT, and ANT).
lw $8, 40($2);
add $18, $8, $8;
sw $18, 40($2);
lw LION, 40($2);
add TIGER, LION, LION;
sw TIGER, 40($2);
lw $28, 60($3);
add $48, $28, $28;
sw $48, 60($3);
lw CAT, 60($3);
add ANT, CAT, CAT;
sw ANT, 60($3);
26
Can a later implementation provide
64 registers (instead of 32) while
maintaining binary compatibility
with previously compiled codes?
Answer: Yes / No
Why?
27
Answer: Can not change the number of
Architectural Registers
Register Renaming Through Tagging
Registers
This solves name dependency
problems (WAR and WAW) while
attending to true dependency (RAW)
through waiting in queues.
28
RST
square_root $2, $10;
lw $8, 40($2);
add $8, $8, $8;
sw $8, 40($2);
lw $8, 60($3);
add $8, $8, $8;
sw $8, 60($3);
RF
$1
$2
$3
$4
$5
$6
$7
$8
$1
$2
$3
$4
$5
$6
$7
$8
.
.
.
.
.
.
$31
$31
destination
dependent
source
RST = Register Status Table
RF = Register File
29
RST
square_root $2, $10;
lw $8, 40($2);
add $8, $8, $8;
sw $8, 40($2);
lw $8, 60($3);
add $8, $8, $8;
sw $8, 60($3);
RF
$1
$2
$3
$4
$5
$6
$7
$8
$1
$2
$3
$4
$5
$6
$7
$8
.
.
.
.
.
.
$31
$31
30
RST
square_root $2, $10;
lw $8, 40($2);
add $8, $8, $8;
sw $8, 40($2);
lw $8, 60($3);
add $8, $8, $8;
sw $8, 60($3);
RF
$1
$2
$3
$4
$5
$6
$7
$8
$1
$2
$3
$4
$5
$6
$7
$8
.
.
.
.
.
.
$31
$31
31
RST
square_root $2, $10;
lw $8, 40($2);
add $8, $8, $8;
sw $8, 40($2);
lw $8, 60($3);
add $8, $8, $8;
sw $8, 60($3);
RF
$1
$2
$3
$4
$5
$6
$7
$8
$1
$2
$3
$4
$5
$6
$7
$8
.
.
.
.
.
.
$31
$31
32
square_root $2, $10; Dispatch unit decodes and
dispatches instructions.
lw $8, 40($2);
add $8, $8, $8;
sw $8, 40($2);
lw $8, 60($3);
add $8, $8, $8;
sw $8, 60($3);
For destination operand, an
instruction carries a TAG (but
not the actual register name)!
For source operands, an
instruction carries either the
values or TAGs of the
operands (but not the actual
register names)!
33
Register Renaming
34
TAGs for destinations or sources or
for both?
• A new tag is assigned to the destination register of the
instruction being dispatched.
• For each of the source registers (source operands) of
the instruction being dispatched, either the value of the
source register (if it has not been previously tagged)
or the existing tag associated with the source register (if
it has been tagged already) is conveyed to the
instruction.
• If a tag is conveyed for a source, then the instruction
needs to wait for the original instruction with that
destination tag to go on to the CDB and announce the
value.
35
4
Unique TAG
4
• Like SSN, we need a unique TAG
• SSNs are reused.
• Similarly TAGs can be reused.
• TAGs are similar to the number TOKENs.
36
Take a number vs. Take a token
4
Helps to create a
Virtual Queue.
We do not need
that here!
In State Bank of India, the cashier issues brass
tokens to customers trying to draw money as an
identification (and not at all to put them in any
virtual queue). Token numbers are in random
order.
The cashier verifies the signature in the records
room and returns with money, call the token
number and issues the money.
Tokens are reclaimed and reused.
37
TAGs (= Tokens)
4
• How many Tokens should the bank
cashier have to start with?
• What happens if the tokens are run out?
• Does he need to have any order in holding
tokens and issuing tokens?
• Does he have to collect tokens back?
38
TAG FIFO
(FIFOs are taught in EE560)
• To issue and collect Tokens (TAGs),
use a circular FIFO (First-in-First-Out) unit.
While the FIFO-order is not important here, a FIFO is the easiest to
implement in hardware compared to a random order in a pile.
• Filled with (say) 64 tokens (in any order) initially on reset.
• Tokens return in out of order anyway.
• Put tokens back in the FIFO and issue.
wp
0
1
2
rp
wp
1
wp
2
rp
2
63
63
63
Full
2 tokens issued
1 token returned
rp
39
Block Diagram
provided by
Simplified
for EE457
Prof. Dubois
TAG FIFO
2
Int.
Divider
63
Integer
Multiplier
Issue Unit
CDB = Common Data Bus (compare it to a Public Announcing System)
40
Front-End & Back-End
• IFQ Instruction Fetch Queue (a FIFO structure)
• Dispatch unit (including RST, RF, Tag FIFO)
• Load Store and other Issue Queues
• Issue Unit
• Functional units
• CDB (Common Data Bus)
41
42
Bottle neck in the design
• CDB = Common Data Bus
Do all instructions use CDB?
• sw ?
• j (jump)?
• beq
43
load store queue
• Address calculation
• Memory disambiguation
Mr. Bruin: Let me take a guess!
You will now propose to have a MST (Memory
Status Table) (like the RST).
And you will rename memory locations to solve
WAW and WAR problems among memory
locations, right?!
44
MST (Memory Status Table)?
No way! It is too big!
We will just ask the junior to stall and wait to solve
his WAR and WAW problems with his seniors.
RST
MST
RF
$1
$2
$3
$4
$5
$6
$7
$8
$1
$2
$3
$4
$5
$6
$7
$8
.
.
.
.
.
.
$31
$31
Memory
0
1
0
1
.
.
.
.
.
.
45
Address calculation for lw and sw
EE557
approach for address calculation
EE457/560
approach for address calculation
Dedicated adder, to compute
address, attached to the loadstore queue.
46
Memory Disambiguation
EE557
47
Memory Disambiguation
RAW
sw $2, 2000($0);
lw $8, 2000($0);
WAW
sw $2, 2000($0);
sw $8, 2000($0);
WAR
lw $2, 2000($0);
sw $8, 2000($0);
48
Memory Disambiguation
RAW
sw $2, 2000($0);
This later lw can proceed
only if there is no store ahead
of it with the same address.
lw $8, 2000($0);
WAW
sw $2, 2000($0);
sw $8, 2000($0);
WAR
lw $2, 2000($0);
sw $8, 2000($0);
This later sw can proceed
only if there is no store ahead
of it with the same address.
This later sw can proceed
only if there is no load ahead
of it with the same address.
49
Maintaining instructions in the order
of arrival (issue order/program order)
in a queue
Is it necessary or is it desirable?
In the case of L-S Queue ?
In the case of Integer and other queues (mult
queue, div queue)?
50
Maintaining instructions in the order
of arrival (issue order/program order)
in a queue
Is it necessary or is it desirable?
In the case of L-S Queue ?
NECESSARY to enforce memory
disambiguation rules
In the case of Integer and other queues (mult
queue, div queue)?
DESIRABLE, so that an earlier instruction
gets executed whenever possible, there by
perhaps reducing too many instructions
waiting on it.
51
Priority (based on the order
of arrival) among instructions ready
to execute
• Is it necessary or is it desirable?
• Local priority with in the queues
• Global priority across the queues
52
Issue Unit
CDB
• CDB availability constraint
• Pipelined functional unit
vs.
Multi-cycle functional unit
• Conflict resolution
Round-robin priority adequate?, well, …
53
Conditional branches
• Dispatch unit stops dispatching until the
branch is resolved.
• CDB broadcasts the result of the branch
• Dispatching continues there after either at
the fall-through instruction or at target
instruction.
• Successful branch shall cause flushing of
IFQ very much like jump.
54
Conditional branches
• Since we stop dispatching instructions
after a branch, does it mean that this
branch is the last instruction to be
executed in the back-end ?
• Is it possible that the back-end holds
simultaneously (a) some instructions
dispatched before the branch and (b)
some instructions issued after the branch
was resolved?
55
Tomasulo Loop Example
Loop:
LW
MULT
SW
ADDI
BNE
$2,
$4
$4,
$1,
$1,
40($1);
$2, $3;
40($1);
$1, -4;
$0, Loop;
• Assume Multiply takes 4 clocks
• Assume first load takes 8 clocks (cache
miss), second load takes 1 clock (hit)
Based on Prof. Annavaram’s lecture slide
56
How could Tomasulo overlap
iterations of loops?
Loop: LW
MULT
SW
ADDI
BNE
$2,
$4
$4,
$1,
$1,
40($1);
$2,
$3;
40($1);
$1,
-4;
$0, Loop;
The destination registers bear
different TAGs in different
iterations. These tags were
given in place of the source
operands to the dependent
instructions following them.
57
Say, only two iterations.
Let us unroll the two iterations.
Loop: LW
$2,
40($1);
MULT $4
$2,
SW
40($1);
$4,
$3;
ADDI $1,
$1,
BNE $1,
$0, Loop;
Loop: LW
$2,
-4;
40($1);
MULT $4
$2,
SW
40($1);
$4,
destination register
$3;
ADDI $1,
$1,
BNE $1,
$0, Loop;
dependent source
register(s)
-4;
58
Because, there is no reorder buffer.
Note: Your EE560 project will use a reorder buffer and much more!
59