Computer Architecture

Download Report

Transcript Computer Architecture

Advanced Computer
Architecture
Chapter 4
Advanced Pipelining
Ioannis Papaefstathiou
CS 590.25
Easter 2003
(thanks to Hennesy & Patterson)
Chapter Overview
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining II
2
Chapter Overview
Technique
Reduces
Section
Loop Unrolling
Control Stalls
4.1
Basic Pipeline Scheduling
RAW Stalls
4.1
Dynamic Scheduling with Scoreboarding
RAW stalls
4.2
Dynamic Scheduling with Register Renaming
WAR and WAW stalls
4.2
Dynamic Branch Prediction
Control Stalls
4.3
Issue Multiple Instructions per Cycle
Ideal CPI
4.4
Compiler Dependence Analysis
Ideal CPI & data stalls
4.5
Software pipelining and trace scheduling
Ideal CPI & data stalls
4.5
Speculation
All data & control stalls
4.6
Dynamic memory disambiguation
RAW stalls involving memory
4.2, 4.6
Chap. 4 - Pipelining II
3
Instruction Level
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges
ILP is the principle that there are many
instructions in code that don’t
depend on each other. That means
it’s possible to execute those
instructions in parallel.
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
This is easier said than done:
Issues include:
• Building compilers to analyze the
code,
• Building hardware to be even
smarter than that code.
This section looks at some of the
problems to be solved.
Chap. 4 - Pipelining II
4
Instruction Level
Parallelism
Pipeline Scheduling and
Loop Unrolling
Terminology
Basic Block - That set of instructions between entry points and between
branches. A basic block has only one entry and one exit. Typically
this is about 6 instructions long.
Loop Level Parallelism - that parallelism that exists within a loop. Such
parallelism can cross loop iterations.
Loop Unrolling - Either the compiler or the hardware is able to exploit
the parallelism inherent in the loop.
Chap. 4 - Pipelining II
5
Instruction Level
Parallelism
Pipeline Scheduling and
Loop Unrolling
Simple Loop and its Assembler Equivalent
This is a clean and
simple example!
for (i=1; i<=1000; i++)
x(i) = x(i) + s;
Loop:
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop
;F0=vector element
;add scalar from F2
;store result
;decrement pointer 8bytes (DW)
;branch R1!=zero
;delayed branch slot
Chap. 4 - Pipelining II
6
Instruction Level
Parallelism
Pipeline Scheduling and
Loop Unrolling
FP Loop Hazards
Loop:
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop
Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op
Where are the stalls?
;F0=vector element
;add scalar in F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
;delayed branch slot
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
Chap. 4 - Pipelining II
Latency in
clock cycles
3
2
1
0
0
7
Instruction Level
Parallelism
Pipeline Scheduling and
Loop Unrolling
FP Loop Showing Stalls
1 Loop:
2
3
4
5
6
7
8
9
10
LD
stall
ADDD
stall
stall
SD
SUBI
stall
BNEZ
stall
F0,0(R1)
;F0=vector element
F4,F0,F2
;add scalar in F2
0(R1),F4
R1,R1,8
;store result
;decrement pointer 8Byte (DW)
R1,Loop
;branch R1!=zero
;delayed branch slot
Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
10 clocks: Rewrite code
Chap. 4 - Pipelining II
to minimize stalls?
Latency in
clock cycles
3
2
1
0
0
8
Instruction Level
Parallelism
Pipeline Scheduling and
Loop Unrolling
Scheduled FP Loop Minimizing Stalls
1 Loop:
2
3
4
5
6
LD
SUBI
ADDD
stall
BNEZ
SD
F0,0(R1)
R1,R1,8
F4,F0,F2
R1,Loop
8(R1),F4
Stall is because SD
can’t proceed.
;delayed branch
;altered when move past SUBI
Swap BNEZ and SD by changing address of SD
Instruction
producing result
FP ALU op
FP ALU op
Load double
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Latency in
clock cycles
3
2
1
Now 6 clocks: Now unroll
Chap. 4 - Pipelining II
loop 4 times to make faster.
9
Pipeline Scheduling and
Instruction Level
Loop Unrolling
Parallelism
Unroll Loop Four Times (straightforward way)
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
LD
stall
ADDD
stall
stall
SD
LD
stall
ADDD
stall
stall
SD
LD
stall
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)
15
16
17
18
19
20
21
22
23
24
25
26
27
28
ADDD
stall
stall
SD
LD
stall
ADDD
stall
stall
SD
SUBI
BNEZ
stall
NOP
F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#32
R1,LOOP
15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iteration
Assumes R1 is multiple of 4
Rewrite loop to minimize stalls.
Chap. 4 - Pipelining II
10
Instruction Level
Parallelism
Pipeline Scheduling and
Loop Unrolling
Unrolled Loop That Minimizes Stalls
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SD
SUBI
BNEZ
SD
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16
What assumptions made when
moved code?
– OK to move store past SUBI
even though changes register
– OK to move loads before
stores: get right data?
– When is it safe for compiler to
do such changes?
; 8-32 = -24
No Stalls!!
14 clock cycles, or 3.5 per iteration
Chap. 4 - Pipelining II
11
Instruction Level
Parallelism
Pipeline Scheduling and
Loop Unrolling
Summary of Loop Unrolling Example
•
•
•
•
•
•
Determine that it was legal to move the SD after the SUBI and BNEZ,
and find the amount to adjust the SD offset.
Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code.
Use different registers to avoid unnecessary constraints that would
be forced by using the same registers for different computations.
Eliminate the extra tests and branches and adjust the loop
maintenance code.
Determine that the loads and stores in the unrolled loop can be
interchanged by observing that the loads and stores from different
iterations are independent. This requires analyzing the memory
addresses and finding that they do not refer to the same address.
Schedule the code, preserving any dependences needed to yield the
same result as the original code.
Chap. 4 - Pipelining II
12
Instruction Level
Parallelism
Dependencies
Compiler Perspectives on Code Movement
Compiler concerned about dependencies in program. Not concerned if a
HW hazard depends on a given pipeline.
• Tries to schedule code to avoid hazards.
• Looks for Data dependencies (RAW if a hazard for HW)
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k is data
dependent on instruction i.
• If dependent, can’t execute in parallel
• Easy to determine for registers (fixed names)
• Hard for memory:
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
Chap. 4 - Pipelining II
13
Instruction Level
Parallelism
Data Dependencies
Compiler Perspectives on Code Movement
1 Loop:
2
3
4
5
LD
ADDD
SUBI
BNEZ
SD
F0,0(R1)
F4,F0,F2
R1,R1,8
R1,Loop
8(R1),F4
Where are the data
dependencies?
;delayed branch
;altered when move past SUBI
Chap. 4 - Pipelining II
14
Instruction Level
Parallelism
Name Dependencies
Compiler Perspectives on Code Movement
•
•
•
Another kind of dependence called name dependence:
two instructions use same name (register or memory location) but don’t
exchange data
Anti-dependence (WAR if a hazard for HW)
– Instruction j writes a register or memory location that instruction i reads from
and instruction i is executed first
Output dependence (WAW if a hazard for HW)
– Instruction i and instruction j write the same register or memory location;
ordering between instructions must be preserved.
Chap. 4 - Pipelining II
15
Instruction Level
Parallelism
Name Dependencies
Compiler Perspectives on Code Movement
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
F0,-8(R1)
F4,F0,F2
-8(R1),F4
F0,-16(R1)
F4,F0,F2
-16(R1),F4
F0,-24(R1)
F4,F0,F2
-24(R1),F4
R1,R1,#32
R1,LOOP
Where are the name
dependencies?
No data is passed in F0, but
can’t reuse F0 in cycle 4.
How can we remove these
dependencies? Chap. 4 - Pipelining II
16
Instruction Level
Parallelism
Name Dependencies
Compiler Perspectives on Code Movement
•
•
Again Name Dependencies are Hard for Memory Accesses
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
Our example required compiler to know that if R1 doesn’t change then:
0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)
There were no dependencies between some loads and stores so they
could be moved around each other
Chap. 4 - Pipelining II
18
Instruction Level
Parallelism
Control Dependencies
Compiler Perspectives on Code Movement
•
•
Final kind of dependence called control dependence
Example
if p1 {S1;};
if p2 {S2;};
S1 is control dependent on p1 and S2 is control dependent on p2 but not
on p1.
Chap. 4 - Pipelining II
19
Instruction Level
Parallelism
Control Dependencies
Compiler Perspectives on Code Movement
•
Two (obvious) constraints on control dependences:
– An instruction that is control dependent on a branch cannot be moved
before the branch so that its execution is no longer controlled by the
branch.
– An instruction that is not control dependent on a branch cannot be
moved to after the branch so that its execution is controlled by the
branch.
•
Control dependencies relaxed to get parallelism; get same effect if
preserve order of exceptions (address in register checked by branch
before use) and data flow (value in register depends on branch)
Chap. 4 - Pipelining II
20
Instruction Level
Parallelism
Control Dependencies
Compiler Perspectives on Code Movement
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
....
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
Where are the control
dependencies?
Chap. 4 - Pipelining II
21
Instruction Level
Parallelism
Loop Level Parallelism
When Safe to Unroll Loop?
• Example: Where are data dependencies?
(A,B,C distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i];
/* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
1. S2 uses the value, A[i+1], computed by S1 in the same iteration.
2. S1 uses a value computed by S1 in an earlier iteration, since
iteration i computes A[i+1] which is read in iteration i+1. The same
is true of S2 for B[i] and B[i+1].
This is a “loop-carried dependence” between iterations
• Implies that iterations are dependent, and can’t be executed in parallel
• Note the case for our prior example; each iteration was distinct
Chap. 4 - Pipelining II
22
Instruction Level
Parallelism
Loop Level Parallelism
When Safe to Unroll Loop?
• Example: Where are data dependencies?
(A,B,C,D distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
1.
2.
No dependence from S1 to S2. If there were, then there would be a
cycle in the dependencies and the loop would not be parallel. Since
this other dependence is absent, interchanging the two statements
will not affect the execution of S2.
On the first iteration of the loop, statement S1 depends on the value
of B[1] computed prior to initiating the loop.
Chap. 4 - Pipelining II
23
Instruction Level
Parallelism
Loop Level Parallelism
Now Safe to Unroll Loop? (p. 240)
OLD:
NEW:
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i];} /* S2 */
A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = + A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
Chap. 4 - Pipelining II
No circular dependencies.
Loop caused dependence
on B.
Have eliminated loop
dependence.
24
Dynamic Scheduling
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
Dynamic Scheduling is when the
hardware rearranges the order of
instruction execution to reduce
stalls.
Advantages:
• Dependencies unknown at compile
time can be handled by the hardware.
• Code compiled for one type of
pipeline can be efficiently run on
another.
Disadvantages:
• Hardware much more complex.
4.7 Studies of ILP
Chap. 4 - Pipelining II
25
Dynamic Scheduling
The idea:
HW Schemes: Instruction Parallelism
•
•
•
Why in HW at run time?
– Works when can’t know real dependence at compile time
– Compiler simpler
– Code for one machine runs well on another
Key Idea: Allow instructions behind stall to proceed.
Key Idea: Instructions executing in parallel. There are multiple
execution units, so use them.
DIVD
ADDD
SUBD
F0,F2,F4
F10,F0,F8
F12,F8,F14
– Enables out-of-order execution => out-of-order completion
Chap. 4 - Pipelining II
26
Dynamic Scheduling
The idea:
HW Schemes: Instruction Parallelism
•
Out-of-order execution divides ID stage:
1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read operands
•
•
•
•
•
•
Scoreboards allow instruction to execute whenever 1 & 2 hold, not
waiting for prior instructions.
A scoreboard is a “data structure” that provides the information
necessary for all pieces of the processor to work together.
We will use In order issue, out of order execution, out of order
commit ( also called completion)
First used in CDC6600. Our example modified here for DLX.
CDC had 4 FP units, 5 memory reference units, 7 integer units.
DLX has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
Chap. 4 - Pipelining II
27
Dynamic Scheduling
Using A Scoreboard
Scoreboard Implications
•
•
•
•
•
•
Out-of-order completion => WAR, WAW hazards?
Solutions for WAR
– Queue both the operation and copies of its operands
– Read registers only during Read Operands stage
For WAW, must detect hazard: stall until other completes
Need to have multiple instructions in execution phase => multiple
execution units or pipelined execution units
Scoreboard keeps track of dependencies, state or operations
Scoreboard replaces ID, EX, WB with 4 stages
Chap. 4 - Pipelining II
28
Dynamic Scheduling
Using A Scoreboard
Four Stages of Scoreboard Control
1. Issue —decode instructions & check for structural hazards (ID1)
If a functional unit for the instruction is free and no other active
instruction has the same destination register (WAW), the
scoreboard issues the instruction to the functional unit and
updates its internal data structure.
If a structural or WAW hazard exists, then the instruction issue
stalls, and no further instructions will issue until these hazards
are cleared.
Chap. 4 - Pipelining II
29
Dynamic Scheduling
Using A Scoreboard
Four Stages of Scoreboard Control
2.
Read operands —wait until no data hazards, then read
operands (ID2)
A source operand is available if no earlier issued active
instruction is going to write it, or if the register containing
the operand is being written by a currently active
functional unit.
When the source operands are available, the scoreboard tells
the functional unit to proceed to read the operands from
the registers and begin execution. The scoreboard
resolves RAW hazards dynamically in this step, and
instructions may be sent into execution out of order.
Chap. 4 - Pipelining II
30
Dynamic Scheduling
Using A Scoreboard
Four Stages of Scoreboard Control
3. Execution —operate on operands (EX)
The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.
4. Write result —finish execution (WB)
Once the scoreboard is aware that the functional unit has
completed execution, the scoreboard checks for WAR
hazards. If none, it writes results. If WAR, then it stalls the
instruction.
Example:
DIVD
F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
Scoreboard would stall SUBD until ADDD reads operands
Chap. 4 - Pipelining II
31
Using A Scoreboard
Dynamic Scheduling
Three Parts of the Scoreboard
1. Instruction
status—which
of
4
steps
the
instruction
is
in
2. Functional unit status—Indicates the state of the functional unit (FU). 9
fields for each functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj,
Rk—Flags
indicating
when
Fj,
Fk
are
ready
3. Register result status—Indicates which functional unit will write each
register, if one exists. Blank when no pending instructions will write that
register
Chap. 4 - Pipelining II
32
Dynamic Scheduling
Using A Scoreboard
Detailed Scoreboard Pipeline Control
Instruction
status
Wait until
Bookkeeping
Issue
Not busy (FU)
and not result(D)
Busy(FU) yes; Op(FU) op;
Fi(FU) `D’; Fj(FU) `S1’;
Fk(FU) `S2’; Qj Result(‘S1’);
Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Read
operands
Rj and Rk
Rj No; Rk No
Execution
complete
Functional unit
done
f((Fj( f )≠Fi(FU)
or Rj( f )=No) &
Write result
(Fk( f ) ≠Fi(FU) or
Rk( f )=No))
f(if Qj(f)=FU then Rj(f) Yes);
f(if Qk(f)=FU then Rj(f) Yes);
Result(Fi(FU)) 0; Busy(FU) No
Chap. 4 - Pipelining II
33
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example
This is the sample code we’ll be working with in the example:
LD
LD
MULT
SUBD
DIVD
ADDD
F6, 34(R2)
F2, 45(R3)
F0, F2, F4
F8, F6, F2
F10, F0, F6
F6, F8, F2
What are the hazards in this code?
Chap. 4 - Pipelining II
Latencies (clock cycles):
LD
1
MULT
10
SUBD
2
DIVD
40
ADDD
2
34
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
Issue
Read Execution
Write
operands
completeR esult
Busy
No
No
No
No
No
Op
dest
Fi
F0
F2
F4
S1
Fj
S2
Fk
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
F6
F8
F10
F30
F12
...
FU
Chap. 4 - Pipelining II
35
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 1
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
1
FU
Issue
1
Read Execution
Write
operandscompleteResult
Issue LD #1
Shows in which cycle
the operation occurred.
Busy
Yes
No
No
No
No
Op
Load
dest
Fi
F6
F0
F2
F4
S1
Fj
S2
Fk
R2
FU for j FU for k Fj?
Qj
Qk
Rj
F6 F8 F10
F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining II
36
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 2
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
Clock
F0
2
FU
Busy Op
Yes
Load
No
No
No
No
F2
S2
Fk
R2
LD #2 can’t issue since
integer unit is busy.
MULT can’t issue because
we require in-order issue.
dest
Fi
F6
S1
Fj
FU for j FU for k Fj?
Qj
Qk
Rj
F4
F6 F8 F10
F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining II
37
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 3
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
Clock
F0
3
FU
Busy Op
Yes
Load
No
No
No
No
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2
FU for j FU for k Fj?
Qj
Qk
Rj
F4
F6 F8 F10
F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining II
38
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 4
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
Clock
F0
4
FU
Busy Op
Yes
Load
No
No
No
No
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2
FU for j FU for k Fj?
Qj
Qk
Rj
F4
F6 F8 F10
F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining II
39
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 5
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
Clock
F0
5
FU
Busy Op
Yes
Load
No
No
No
No
F2
S2
Fk
R3
Issue LD #2 since integer
unit is now free.
dest
Fi
F2
S1
Fj
FU for j FU for k Fj?
Qj
Qk
Rj
F4
F6 F8 F10
F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining II
40
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 6
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
6
Clock
F0
6
FU
Busy Op
Yes
Load
Yes
Mult
No
No
No
F2
Issue MULT.
dest
Fi
F2
F0
S1
Fj
F4
F6 F8 F10
F2
S2
Fk
R3
F4
FU for j FU for k Fj?
Qj
Qk
Rj
Integer
F12
No
Fk?
Rk
Yes
Yes
...
F30
Mult1 Integer
Chap. 4 - Pipelining II
41
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 7
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
6
7
Busy
Yes
Yes
No
Yes
No
Clock
F0
7
FU
MULT can’t read its
operands (F2) because LD
#2 hasn’t finished.
Op
Load
Mult
dest
Fi
F2
F0
S1
Fj
F2
S2
Fk
R3
F4
Sub
F8
F6
F2
F2
F4
F6 F8 F10
Mult1 Integer
FU for j FU for k Fj?
Qj
Qk
Rj
No
Fk?
Rk
Yes
Yes
Integer
Yes
No
F12
...
F30
Integer
Add
Chap. 4 - Pipelining II
42
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 8a
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
6
7
8
Busy
Yes
Yes
No
Yes
Yes
Clock
F0
8
FU
DIVD issues.
MULT and SUBD both
waiting for F2.
Op
Load
Mult
dest
Fi
F2
F0
S1
Fj
F2
S2
Fk
R3
F4
Sub
Div
F8
F10
F6
F0
F2
F6
F2
F4
F6 F8 F10
Mult1 Integer
FU for j FU for k Fj?
Qj
Qk
Rj
No
Fk?
Rk
Yes
Yes
Integer
Yes
No
No
Yes
F12
...
F30
Integer
Mult1
Add Divide
Chap. 4 - Pipelining II
43
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 8b
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
7
8
Busy
No
Yes
No
Yes
Yes
Clock
F0
8
FU
Mult1
LD #2 writes F2.
Op
dest
Fi
S1
Fj
S2
Fk
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Mult
F0
F2
F4
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Yes
No
Yes
Yes
F2
F4
F6 F8 F10
...
F30
Mult1
F12
Add Divide
Chap. 4 - Pipelining II
44
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 9
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
10 Mult1
Mult2
2 Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
7
9
8
Busy
No
Yes
No
Yes
Yes
Clock
F0
9
FU
Mult1
Now MULT and SUBD can
both read F2.
How can both instructions
do this at the same time??
Op
dest
Fi
S1
Fj
S2
Fk
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Mult
F0
F2
F4
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Yes
No
Yes
Yes
F2
F4
F6 F8 F10
...
F30
Mult1
F12
Add Divide
Chap. 4 - Pipelining II
45
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 11
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
8 Mult1
Mult2
0 Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
7
9
11
8
Busy
No
Yes
No
Yes
Yes
Clock
F0
11
FU
Mult1
ADDD can’t start because
add unit is busy.
Op
dest
Fi
S1
Fj
S2
Fk
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Mult
F0
F2
F4
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Yes
No
Yes
Yes
F2
F4
F6 F8 F10
...
F30
Mult1
F12
Add Divide
Chap. 4 - Pipelining II
46
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 12
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
7 Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
Clock
F0
12
FU
Busy Op
No
Yes
Mult
No
No
Yes
Div
F2
SUBD finishes.
DIVD waiting for F0.
dest
Fi
S1
Fj
S2
Fk
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
F0
F2
F4
Yes
Yes
F10
F0
F6
No
Yes
F4
F6 F8 F10
...
F30
Mult1
Mult1
F12
Divide
Chap. 4 - Pipelining II
47
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 13
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
6 Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2
F4
No
Yes
Add
F6
F8
F2
Yes
Div
F10
F0
F6
Clock
F0
13
FU
Mult1
F2
F4
ADDD issues.
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10
Add
Chap. 4 - Pipelining II
F12
Divide
48
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 14
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
5 Mult1
Mult2
2 Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2
F4
No
Yes
Add
F6
F8
F2
Yes
Div
F10
F0
F6
Clock
F0
14
FU
Mult1
F2
F4
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10
Add
Chap. 4 - Pipelining II
F12
Divide
49
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 15
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
4 Mult1
Mult2
1 Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2
F4
No
Yes
Add
F6
F8
F2
Yes
Div
F10
F0
F6
Clock
F0
15
FU
Mult1
F2
F4
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10
Add
Chap. 4 - Pipelining II
F12
Divide
50
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 16
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
3 Mult1
Mult2
0 Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2
F4
No
Yes
Add
F6
F8
F2
Yes
Div
F10
F0
F6
Clock
F0
16
FU
Mult1
F2
F4
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10
Add
Chap. 4 - Pipelining II
F12
Divide
51
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 17
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
2 Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2
F4
No
Yes
Add
F6
F8
F2
Yes
Div
F10
F0
F6
Clock
F0
17
FU
Mult1
F2
F4
ADDD can’t write because
of DIVD. RAW!
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10
Add
Chap. 4 - Pipelining II
F12
Divide
52
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 18
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
1 Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2
F4
No
Yes
Add
F6
F8
F2
Yes
Div
F10
F0
F6
Clock
F0
18
FU
Mult1
F2
F4
Nothing Happens!!
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10
Add
Chap. 4 - Pipelining II
F12
Divide
53
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 19
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
0 Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
19
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2
F4
No
Yes
Add
F6
F8
F2
Yes
Div
F10
F0
F6
Clock
F0
19
FU
Mult1
F2
F4
MULT completes execution.
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10
Add
Chap. 4 - Pipelining II
F12
Divide
54
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 20
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes
Add
F6
F8
F2
Yes
Div
F10
F0
F6
Clock
F0
20
FU
F2
F4
MULT writes.
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
Yes
...
F30
F6 F8 F10
Add
Chap. 4 - Pipelining II
F12
Divide
55
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 21
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes
Add
F6
F8
F2
Yes
Div
F10
F0
F6
Clock
F0
21
FU
F2
F4
DIVD loads operands
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
Yes
...
F30
F6 F8 F10
Add
Chap. 4 - Pipelining II
F12
Divide
56
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example Cycle 22
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
40 Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes
Div
F10
F0
F6
Clock
F0
22
F2
F4
Now ADDD can write since
WAR removed.
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
...
F30
F6 F8 F10
FU
F12
Divide
Chap. 4 - Pipelining II
57
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example Cycle 61
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
0 Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes
Div
F10
F0
F6
Clock
F0
61
F2
F4
DIVD completes execution
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
Yes
Yes
...
F30
F6 F8 F10
FU
F12
Divide
Chap. 4 - Pipelining II
58
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 62
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
0 Divide
Register result status
Read Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
62
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
No
Clock
F0
62
F2
F4
DONE!!
FU for j FU for k Fj?
Qj
Qk
Rj
F6 F8 F10
F12
...
Fk?
Rk
F30
FU
Chap. 4 - Pipelining II
59
Dynamic Scheduling
Using A Scoreboard
Another Dynamic Algorithm:
Tomasulo Algorithm
•
•
•
For IBM 360/91 about 3 years after CDC 6600 (1966)
Goal: High Performance without special compilers
Differences between IBM 360 & CDC 6600 ISA
– IBM has only 2 register specifiers / instruction vs. 3 in CDC 6600
– IBM has 4 FP registers vs. 8 in CDC 6600
•
Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II,
PowerPC 604, …
Chap. 4 - Pipelining II
60
Dynamic Scheduling
Using A Scoreboard
Tomasulo Algorithm vs. Scoreboard
• Control & buffers distributed with Function Units (FU) vs.
centralized in scoreboard;
– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to
reservation stations(RS); called register renaming ;
– avoids WAR, WAW hazards
– More reservation stations than registers, so can do optimizations
compilers can’t
• Results to FU from RS, not through registers, over Common
Data Bus that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
• Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue
Chap. 4 - Pipelining II
61
Dynamic Scheduling
Using A Scoreboard
Tomasulo Organization
Load
Buffer
FP Op Queue FP
Registers
Store
Buffer
Common
Data
Bus
FP Add
Res.
Station
FP Mul
Res.
Station
Chap. 4 - Pipelining II
62
Dynamic Scheduling
Using A Scoreboard
Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk—Value of Source operands
– Store buffers have V field, result to be stored
Qj, Qk—Reservation stations producing source registers (value to be
written)
– Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy—Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will write each
register, if one exists. Blank when no pending instructions that will
write that register.
Chap. 4 - Pipelining II
63
Dynamic Scheduling
Using A Scoreboard
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instruction & sends operands (renames registers).
2. Execution—operate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
Chap. 4 - Pipelining II
64
Using A Scoreboard
Dynamic Scheduling
Tomasulo Example Cycle 0
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTDF0
F2
SUBD F8
F6
DIVD F10 F0
ADDDF6
F8
Reservation Stations
Time Name
0 Add1
0 Add2
0 Add3
0 Mult1
0 Mult2
Register result status
k
R2
R3
F4
F2
F6
F2
Write
Result
Load1
Load2
Load3
Busy Op
No
No
No
No
No
Clock
0
Issue
Execution
complete
F0
S1
Vj
S2
Vk
RS for j
Qj
RS for k
Qk
F2
F4
F6
F8
Busy
No
No
No
Address
F10
F12 ...
F30
FU
Chap. 4 - Pipelining II
65
Dynamic Scheduling
Using A Scoreboard
Review: Tomasulo
•
•
•
•
•
Prevents Register as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
Not limited to basic blocks (provided branch prediction)
Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
•
360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA
8000; Intel Pentium Pro
Chap. 4 - Pipelining II
66
Dynamic Hardware
Prediction
4.1 Instruction Level Parallelism:
Concepts and Challenges
Dynamic Branch Prediction is the ability
of the hardware to make an educated
guess about which way a branch will
go - will the branch be taken or not.
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
The hardware can look for clues based
on the instructions, or it can use past
history - we will discuss both of
these directions.
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining II
67
Dynamic Hardware
Prediction
Basic Branch Prediction:
Branch Prediction Buffers
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Lower bits of PC address index table of 1-bit values
– Says whether or not branch taken last time
• Problem: in a loop, 1-bit BHT will cause two mis-predictions:
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it predicts exit instead
of looping
Address
31
0
1
Bits 13 - 2
1023
Chap. 4 - Pipelining II
P
r
e
d
i
c
t
i
o
n
68
Dynamic Hardware
Prediction
Basic Branch Prediction:
Branch Prediction Buffers
Dynamic Branch Prediction
• Solution: 2-bit scheme where change prediction only if get
misprediction twice: (Figure 4.13, p. 264)
T
NT
Predict Taken
Predict Taken
T
Predict Not
Taken
T
NT
NT
Predict Not
Taken
T
NT
Chap. 4 - Pipelining II
69
Dynamic Hardware
Prediction
Basic Branch Prediction:
Branch Prediction Buffers
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table
• 4096 entry table programs vary from 1% misprediction (nasa7,
tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
• 4096 about as good as infinite table, but 4096 is a lot of HW
Chap. 4 - Pipelining II
70
Dynamic Hardware
Prediction
Basic Branch Prediction:
Branch Prediction Buffers
Correlating Branches
Idea: taken/not taken of
recently executed branches is
related to behavior of next
branch (as well as the history
of that branch behavior)
Branch address
2-bits per branch predictors
– Then behavior of recent
branches selects between, say,
four predictions of next branch,
updating just that prediction
Prediction
2-bit global branch history
Chap. 4 - Pipelining II
71
Basic Branch Prediction:
Branch Prediction Buffers
Dynamic Hardware
Prediction
Frequency of Mispredictions
Accuracy of Different Schemes
18%
(Figure 4.21,
p. 272)
4096 Entries 2-bits per entry
Unlimited Entries 2-bits per entry
1024 Entries - 2 bits of history,
2 bits per entry
18%
16%
12%
11%
10%
8%
6%
6%
6%
6%
5%
5%
4%
4%
2%
1%
0%
1%
0%
4,096 entries : 2-bits per entry
Chap. 4 - Pipelining II
1,024 entries (2,2)
Unlimited entries : 2-bits /entry
li
eqntott
espresso
gcc
fpppp
spice
doducd
tomcatv
matrix300
0%
nasa7
Frequency of Mispredictions
14%
72
Dynamic Hardware
Prediction
•
Basic Branch Prediction:
Branch Target Buffers
Branch Target Buffer
Branch Target Buffer (BTB): Use address of branch as index to get prediction AND
branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p.
273)
Predicted PC
•
Branch Prediction:
Taken or not Taken
Return instruction addresses predicted with stack
Chap. 4 - Pipelining II
73
Dynamic Hardware
Prediction
Example
Instructions
in Buffer
Yes
Yes
No
Basic Branch Prediction:
Branch Target Buffers
Prediction
Taken
Taken
Actual
Branch
Taken
Not taken
Taken
Penalty
Cycles
0
2
2
Example on page 274.
Determine the total branch penalty for a BTB using the above
penalties. Assume also the following:
• Prediction accuracy of 80%
• Hit rate in the buffer of 90%
• 60% taken branch frequency.
Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2
+ ( 1 - percent buffer hit rate) X Taken branches X 2
Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2)
Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles
Chap. 4 - Pipelining II
74
Multiple Issue
Multiple Issue is the ability of the
processor to start more than one
instruction in a given cycle.
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
Flavor I:
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
Superscalar processors issue varying
number of instructions per clock - can
be either statically scheduled (by the
compiler) or dynamically scheduled
(by the hardware).
Superscalar has a varying number of
instructions/cycle (1 to 8), scheduled
by compiler or by HW (Tomasulo).
IBM PowerPC, Sun UltraSparc, DEC
Alpha, HP 8000
Chap. 4 - Pipelining II
75
Multiple Issue
Issuing Multiple Instructions/Cycle
Flavor II:
VLIW - Very Long Instruction Word - issues a fixed number of
instructions formatted either as one very large instruction or as a
fixed packet of smaller instructions.
fixed number of instructions (4-16) scheduled by the compiler; put
operators into wide templates
– Joint HP/Intel agreement in 1999/2000
– Intel Architecture-64 (IA-64) 64-bit address
– Style: “Explicitly Parallel Instruction Computer (EPIC)”
Chap. 4 - Pipelining II
76
Multiple Issue
Issuing Multiple Instructions/Cycle
Flavor II - continued:
•
•
•
•
•
•
3 Instructions in 128 bit “groups”; field determines if instructions
dependent or independent
– Smaller code size than old VLIW, larger than x86/RISC
– Groups can be linked to show independence > 3 instr
64 integer registers + 64 floating point registers
– Not separate files per functional unit as in old VLIW
Hardware checks dependencies
(interlocks => binary compatibility over time)
Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mis-predictions?
IA-64 : name of instruction set architecture; EPIC is type
Merced is name of first implementation (1999/2000?)
Chap. 4 - Pipelining II
77
Multiple Issue
A SuperScalar Version of DLX
Issuing Multiple Instructions/Cycle
– Fetch 64-bits/clock cycle; Int on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
In our DLX example,
we can handle 2
instructions/cycle:
• Floating Point
• Anything Else
Type
Pipe Stages
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
• 1 cycle load delay causes delay to 3 instructions in Superscalar
– instruction in right half can’t use it, nor instructions in next slot
Chap. 4 - Pipelining II
78
Multiple Issue
A SuperScalar Version of DLX
Unrolled Loop Minimizes Stalls for Scalar
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SD
SUBI
BNEZ
SD
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16
Latencies:
LD to ADDD: 1 Cycle
ADDD to SD: 2 Cycles
; 8-32 = -24
14 clock cycles, or 3.5 per iteration
Chap. 4 - Pipelining II
79
Multiple Issue
A SuperScalar Version of DLX
Loop Unrolling in Superscalar
Integer instruction
FP instruction
Loop: LD F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1)
ADDD F4,F0,F2
LD F14,-24(R1)
ADDD F8,F6,F2
LD F18,-32(R1)
ADDD F12,F10,F2
SD 0(R1),F4
ADDD F16,F14,F2
SD -8(R1),F8
ADDD F20,F18,F2
SD -16(R1),F12
SD -24(R1),F16
SUBI R1,R1,#40
BNEZ R1,LOOP
SD
8(R1),F20
• Unrolled 5 times to avoid delays (+1 due to SS)
• 12 clocks, or 2.4 clocks per iteration
Chap. 4 - Pipelining II
Clock cycle
1
2
3
4
5
6
7
8
9
10
11
12
80
Multiple Issue
Multiple Instruction Issue &
Dynamic Scheduling
Dynamic Scheduling in Superscalar
Code compiler for scalar version will run poorly on Superscalar
May want code to vary depending on how Superscalar
Simple approach: separate Tomasulo Control for separate reservation
stations for Integer FU/Reg and for FP FU/Reg
Chap. 4 - Pipelining II
81
Multiple Issue
Multiple Instruction Issue &
Dynamic Scheduling
Dynamic Scheduling in Superscalar
• How to do instruction issue with two instructions and keep in-order
instruction issue for Tomasulo?
– Issue 2X Clock Rate, so that issue remains in order
– Only FP loads might cause dependency between integer and FP
issue:
• Replace load reservation station with a load queue;
operands must be read in the order they are fetched
• Load checks addresses in Store Queue to avoid RAW violation
• Store checks addresses in Load Queue to avoid WAR,WAW
Chap. 4 - Pipelining II
82
Multiple Issue
Multiple Instruction Issue &
Dynamic Scheduling
Performance of Dynamic Superscalar
Iteration Instructions
Issues Executes Writes result
no.
clock-cycle number
1
LD F0,0(R1)
1
2
4
1
ADDD F4,F0,F2
1
5
8
1
SD 0(R1),F4
2
9
1
SUBI R1,R1,#8
3
4
5
1
BNEZ R1,LOOP
4
5
2
LD F0,0(R1)
5
6
8
2
ADDD F4,F0,F2
5
9
12
2
SD 0(R1),F4
6
13
2
SUBI R1,R1,#8
7
8
9
2
BNEZ R1,LOOP
8
9
4 clocks per iteration
Branches, Decrements still take 1 clock cycle
Chap. 4 - Pipelining II
83
VLIW
Multiple Issue
Loop Unrolling in VLIW
Memory
reference 1
LD F0,0(R1)
LD F10,-16(R1)
LD F18,-32(R1)
LD F26,-48(R1)
Memory
FP
reference 2
operation 1
LD F6,-8(R1)
LD F14,-24(R1)
LD F22,-40(R1) ADDD F4,F0,F2
ADDD F12,F10,F2
ADDD F20,F18,F2
SD 0(R1),F4
SD -8(R1),F8 ADDD F28,F26,F2
SD -16(R1),F12 SD -24(R1),F16
SD -32(R1),F20 SD -40(R1),F24
SD -0(R1),F28
FP
op. 2
Int. op/
branch
Clock
ADDD F8,F6,F2
ADDD F16,F14,F2
ADDD F24,F22,F2
SUBI R1,R1,#48
BNEZ R1,LOOP
• Unrolled 7 times to avoid delays
• 7 results in 9 clocks, or 1.3 clocks per iteration
• Need more registers to effectively use VLIW
Chap. 4 - Pipelining II
84
1
2
3
4
5
6
7
8
9
Multiple Issue
Limitations With Multiple Issue
Limits to Multi-Issue Machines
• Inherent limitations of ILP
– 1 branch in 5 instructions => how to keep a 5-way VLIW busy?
– Latencies of units => many operations must be scheduled
– Need about Pipeline Depth x No. Functional Units of independent
operations to keep machines busy.
• Difficulties in building HW
– Duplicate Functional Units to get parallel execution
– Increase ports to Register File (VLIW example needs 6 read and 3
write for Int. Reg. & 6 read and 4 write for Reg.)
– Increase ports to memory
– Decoding SS and impact on clock rate, pipeline depth
Chap. 4 - Pipelining II
85
Multiple Issue
Limitations With Multiple Issue
Limits to Multi-Issue Machines
• Limitations specific to either SS or VLIW implementation
– Decode issue in SS
– VLIW code size: unroll loops + wasted fields in VLIW
– VLIW lock step => 1 hazard & all instructions stall
– VLIW & binary compatibility
Chap. 4 - Pipelining II
86
Multiple Issue
Limitations With Multiple Issue
Multiple Issue Challenges
•
While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:
– Exactly 50% FP operations
– No hazards
•
If more instructions issue at same time, greater difficulty of decode and
issue
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2
instructions can issue
•
VLIW: tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long instruction word are
independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
Chap. 4 - Pipelining II
87
Compiler Support For ILP
4.1 Instruction Level Parallelism:
Concepts and Challenges
How can compilers be smart?
1. Produce good scheduling of code.
2. Determine which loops might contain
parallelism.
3. Eliminate name dependencies.
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
Compilers must be REALLY smart to
figure out aliases -- pointers in C are
a real problem.
Techniques lead to:
Symbolic Loop Unrolling
Critical Path Scheduling
Chap. 4 - Pipelining II
88
Compiler Support For ILP
Symbolic Loop Unrolling
Software Pipelining
• Observation: if iterations from loops are independent, then can get ILP
by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made
from instructions chosen from different iterations of the original loop
(Tomasulo in SW)
Iteration
0
Iteration
Iteration
1
2
Iteration
3
Iteration
4
Softwarepipelined
iteration
Chap. 4 - Pipelining II
89
Compiler Support For ILP
Symbolic Loop Unrolling
SW Pipelining Example
After: Software Pipelined
Before: Unrolled 3 times
1 LD
F0,0(R1)
2 ADDD F4,F0,F2
3 SD
0(R1),F4
4 LD
F6,-8(R1)
5 ADDD F8,F6,F2
6 SD
-8(R1),F8
7 LD
F10,-16(R1)
8 ADDD F12,F10,F2
9 SD
-16(R1),F12
10 SUBI R1,R1,#24
11 BNEZ R1,LOOP
SD
IF
ADDD
LD
1
2
3
4
5
Read F4
ID
IF
EX
ID
IF
LD
ADDD
LD
SD
ADDD
LD
SUBI
BNEZ
SD
ADDD
SD
Read F0
Mem
EX
ID
F0,0(R1)
F4,F0,F2
F0,-8(R1)
0(R1),F4;
Stores M[i]
F4,F0,F2;
Adds to M[i-1]
F0,-16(R1); loads M[i-2]
R1,R1,#8
R1,LOOP
0(R1),F4
F4,F0,F2
-8(R1),F4
WB Write F4
Mem WB
EX
Mem WB
Chap. 4 - Pipelining II Write F0
90
Compiler Support For ILP
Symbolic Loop Unrolling
SW Pipelining Example
Symbolic Loop Unrolling
– Less code space
– Overhead paid only once
vs. each iteration in loop unrolling
Software Pipelining
Loop Unrolling
100 iterations = 25 loops with 4 unrolled iterations each
Chap. 4 - Pipelining II
91
Compiler Support For ILP
Critical Path Scheduling
Trace Scheduling
•
•
•
•
Parallelism across IF branches vs. LOOP branches
Two steps:
– Trace Selection
• Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
– Trace Compaction
• Squeeze trace into few VLIW instructions
• Need bookkeeping code in case prediction is wrong
Compiler undoes bad guess
(discards values in registers)
Subtle compiler bugs mean wrong answer
vs. poorer performance; no hardware interlocks
Chap. 4 - Pipelining II
92
Hardware Support For
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
Software support of ILP is best when
code is predictable at compile time.
But what if there’s no predictability?
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
Here
we’ll
talk
about
hardware
techniques. These include:
4.4 Taking Advantage of More ILP
with Multiple Issue
•
Conditional
Instructions
•
Hardware Speculation
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
or
Predicated
4.7 Studies of ILP
Chap. 4 - Pipelining II
93
Hardware Support For
Parallelism
Nullified Instructions
Tell the Hardware To Ignore An Instruction
• Avoid branch prediction by turning branches into
conditionally executed instructions:
IF (x) then A = B op C else NOP
– If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPs, PowerPC, SPARC,
have conditional move. PA-RISC can annul any
following instruction.
– IA-64: 64 1-bit condition fields selected so
conditional execution of any instruction
• Drawbacks to conditional instructions:
– Still takes a clock, even if “annulled”
– Stalls if condition evaluated late
– Complex conditions reduce effectiveness; condition
becomes known late in pipeline.
This can be a major win because there is no time lost by
taking a branch!!
Chap. 4 - Pipelining II
x
A=
B op C
94
Hardware Support For
Parallelism
Nullified Instructions
Tell the Hardware To Ignore An Instruction
Suppose we have the code:
if ( VarA == 0 )
VarS = VarT;
Previous Method:
LD
R1, VarA
BNEZ
R1, Label
LD
R2, VarT
SD
VarS, R2
Label:
Compare
and Nullify
Next Instr.
If Not Zero
Compare
and Move
IF Zero
Nullified Method:
LD
R1, VarA
LD
R2, VarT
CMPNNZ
R1, #0
SD
VarS, R2
Label:
Nullified Method:
LD
R1, VarA
LD
R2, VarT
CMOVZ
VarS,R2, R1
Chap. 4 - Pipelining II
95
Hardware Support For
Parallelism
Compiler Speculation
Increasing Parallelism
The theory here is to move an instruction across a branch so as to
increase the size of a basic block and thus to increase parallelism.
Primary difficulty is in avoiding exceptions. For example
if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases.
Methods for increasing speculation include:
1. Use a set of status bits (poison bits) associated with the registers.
Are a signal that the instruction results are invalid until some later
time.
2. Result of instruction isn’t written until it’s certain the instruction is
no longer speculative.
Chap. 4 - Pipelining II
96
Hardware Support For
Parallelism
Increasing
Parallelism
Example on Page 305.
Code for
if ( A == 0 )
A = B;
else
A = A + 4;
Assume A is at 0(R3) and
B is at 0(R4)
Note here that only ONE
side needs to take a
branch!!
Compiler Speculation
Original Code:
LW
R1, 0(R3)
BNEZ R1, L1
LW
R1, 0(R2)
J
L2
L1: ADDI R1, R1, #4
L2: SW
0(R3), R1
Load A
Test A
If Clause
Skip Else
Else Clause
Store A
Speculated Code:
LW
R1, 0(R3)
LW
R14, 0(R2)
BEQZ R1, L3
ADDI R14, R1, #4
L3: SW
0(R3), R14
Load A
Spec Load B
Other if Branch
Else Clause
Non-Spec Store
Chap. 4 - Pipelining II
97
Hardware Support For
Parallelism
Compiler Speculation
Poison Bits
In the example on the last
page, if the LW* produces
an exception, a poison bit
is set on that register. The
if a later instruction tries to
use the register, an
exception is THEN raised.
Speculated Code:
LW
R1, 0(R3)
LW*
R14, 0(R2)
BEQZ R1, L3
ADDI R14, R1, #4
L3: SW
0(R3), R14
Chap. 4 - Pipelining II
Load A
Spec Load B
Other if Branch
Else Clause
Non-Spec Store
98
Hardware Support For
Parallelism
Hardware Speculation
HW support for More ILP
• Need HW buffer for results of
uncommitted instructions: reorder buffer
– Reorder buffer can be operand
FP
source
Op
– Once operand commits, result is
Queue
found in register
– 3 fields: instr. type, destination, value
– Use reorder buffer number instead
of reservation station
Res Stations
– Discard instructions on misFP Adder
predicted branches or on exceptions
Reorder
Buffer
FP Regs
Res Stations
FP Adder
Figure 4.34, page 311
Chap. 4 - Pipelining II
99
Hardware Support For
Parallelism
Hardware Speculation
HW support for More ILP
How is this used in practice?
Rather than predicting the direction of a branch, execute the
instructions on both side!!
We early on know the target of a branch, long before we know it if will
be taken or not.
So begin fetching/executing at that new Target PC.
But also continue fetching/executing as if the branch NOT taken.
Chap. 4 - Pipelining II
100
Studies of ILP
•
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
•
•
Conflicting studies of amount of
improvement available
– Benchmarks (vectorized FP
Fortran vs. integer C programs)
– Hardware sophistication
– Compiler sophistication
How much ILP is available using
existing mechanisms with increasing
HW budgets?
Do we need to invent new HW/SW
mechanisms to keep on processor
performance curve?
4.7 Studies of ILP
Chap. 4 - Pipelining II
101
Studies of ILP
Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaming–infinite virtual registers and all WAW & WAR
hazards are avoided
2. Branch prediction–perfect; no mispredictions
3. Jump prediction–all jumps perfectly predicted => machine with
perfect speculation & an unbounded buffer of instructions available
4. Memory-address alias analysis–addresses are known & a store can
be moved before a load provided addresses not equal
1 cycle latency for all instructions; unlimited number of instructions
issued per clock cycle
Chap. 4 - Pipelining II
102
Studies of ILP
Upper Limit to ILP: Ideal
Machine
This is the amount of parallelism when
there are no branch mis-predictions and
we’re limited only by data dependencies.
(Figure 4.38, page 319)
160
150.1
FP: 75 - 150
Instruction Issues per cycle
IPC
140
120
118.7
Integer: 18 - 60
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
Instructions that could
theoretically be issued
per cycle.
espresso
li
fpppp
doducd
tomcatv
Programs
Chap. 4 - Pipelining II
103
Studies of ILP
Impact of Realistic Branch
Prediction
What parallelism do we get when we don’t allow perfect branch
prediction, as in the last picture, but assume some realistic model?
Possibilities include:
1. Perfect - all branches are perfectly predicted (the last slide)
2. Selective History Predictor - a complicated but do-able mechanism for
selection.
3. Standard 2-bit history predictor with 512 2-bit entries.
4. Static prediction based on past history of the program.
5. None - Parallelism is limited to basic block.
Chap. 4 - Pipelining II
104
Studies of ILP
Bonus!!
Selective History Predictor
8096 x 2 bits
1
0
11
Choose Non-correlator
10
01 Choose Correlator
00
Branch Addr
2
Global
History
Taken/Not Taken
00
01
10
11
8K x 2 bit
Selector
11 Taken
10
2048 x 4 x 2 bits
01 Not Taken
00
Chap. 4 - Pipelining II
105
Impact of Realistic
Branch Prediction
Studies of ILP
Figure 4.42, Page 325
Limiting the type of
branch prediction.
61
60
58
60
FP: 15 - 45
48
50
46 45
46 45 45
Instr uction issues per cycle
IPC
41
40
35
Integer: 6 - 12
29
30
19
20
16
15
12
10
13 14
10
9
6
7
6
6
6
7
4
2
2
2
0
gcc
espresso
li
fpppp
doducd
tomcatv
P rogr am
P erfect
Perfect
S elective predic tor
S tandard 2-bit
S tatic
Chap. 4 - Pipelining II
Selective Hist
BHT (512)
Profile
None
106
No prediction
More Realistic HW:
Register Impact
Studies of ILP
Effect of limiting the
number of renaming
registers.
60
Figure 4.44, Page 328
59
FP: 11 - 45
54
49
Instr uction issues per cycle
IPC
50
45
40
44
35
Integer: 5 - 15
30
29
28
20
20
15 15
11 10 10
10
16
13
12 12 12 11
10
9
5
5
4
11
6
4
15
5
5
5
4
7
5
5
0
gcc
espresso
li
fpppp
doducd
tomcatv
P rogr am
Infinite
Infinite
64
Chap. 4128
- Pipelining
II
256
256
128
64
32
32
None
None
107
Studies of ILP
More Realistic HW:
Alias Impact
What happens when there
may be conflicts with
memory aliasing?
Figure 4.46, Page 330
49
50
FP: 4 - 45
(Fortran,
no heap)
49
45
45
45
Instruction issues per cycle
IPC
40
35
Integer: 4 - 9
30
25
20
16
16
15
15
12
10
10
5
9
7
7
4
5
5
4
3
3
4
6
4
3
5
0
gc c
es pres s o
li
fpppp
doduc d
tomc atv
P rogr am
Global/Stack perf;
Perfect
Inspec.
P erfec t
Global/s
tac
k Perfec t
Ins
Chap.
4
Pipelining
IIpec tion
heap conflicts
Assem.
None
None
108
4
Summary
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining II
109