Tomasulo, ILP, Branch prediction

Download Report

Transcript Tomasulo, ILP, Branch prediction

ENGS 116 Lecture 8

Instruction Level Parallelism and Tomasulo’s approach

Vincent H. Berk October 7, 2005 Reading for today: chapter A.8

Reading for Monday: chapter 3.2 – 3.6

Homework #2: due Friday 14 th , 2.8, A.2, A.13, 3.6a&b, 3.10, 4.5, 4.8, (4.13 optional) 1

ENGS 116 Lecture 8

Instruction Level Parallelism

• Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls • Reduce stalls, reduce CPI • Reduce CPI, increase IPC • Instruction-level parallelism (ILP) seeks to reduce stalls • Loop-level parallelism is easiest to see: for (i=1; i<100; i=i+1) { A[i] D[i] = B[i] + C[i]; = E[i] + F[i]; } 2

ENGS 116 Lecture 8

Instruction Level Parallelism

• ILP in SW (static) or HW (dynamic) • HW intensive ILP dominates desktop and server markets • SW compiler intensive approaches more likely seen in embedded systems 3

ENGS 116 Lecture 8

Dependences

• Two instructions are parallel if they can execute simultaneously in a pipeline without causing any stalls (assuming no structural hazards) and can be reordered • Two instructions that are dependent are not parallel and cannot be reordered • Types of dependences – Data dependences – Name dependences – Control dependences 4

ENGS 116 Lecture 8

Dependences

• Dependences are properties of programs • Hazards are properties of the pipeline organization • Dependence indicates the

potential

for a hazard • Compiler concerned about dependences in program, whether or not a HW hazard occurs depends on a given pipeline 5

ENGS 116 Lecture 8

Review of Hazards

Consider instructions

i

and

j

, where

i

occurs before

j

.

RAW (read after write) — so

j

gets the old value

j

tries to read a source before

i

writes it, WAW (write after write) —

j

tries to write an operand before it is written by

i

(only possible in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled) 6 WAR (write after read) —

j

tries to write a destination before it is read by

i

, so

i

incorrectly gets the new value (only possible when some instructions can write results early in the pipeline and other instructions can read sources late in the pipeline)

ENGS 116 Lecture 8

Data Dependences

• (True) Data dependences (RAW if a hazard for HW) – Instruction

i

produces a result used by instruction j, or – Instruction

j

instruction

k

is data dependent on instruction

k

, and is data dependent on instruction

i

.

• Easy to determine for registers (fixed names) • Hard for memory: – Does 100(R4) = 20(R6)?

– From different loop iterations, does 20(R6) = 20(R6)?

7

ENGS 116 Lecture 8

Name Dependences

• Another kind of dependence called name dependence: two instructions use same name but don’t exchange data • Antidependence (WAR if a hazard for HW) – Instruction

j

instruction

i

writes a register or memory location that reads from and instruction

i

is executed first • Output dependence (WAW if a hazard for HW) – Instruction preserved

i

and instruction

j

write the same register or memory location; ordering between instructions must be 8

ENGS 116 Lecture 8

Name Dependences

• Hard for memory accesses – Does 100(R4) = 20 (R6)?

– From different loop iterations, does 20(R6) = 20(R6)?

9 • Example of renaming: DIV.D

F0,F2,F4 ADD.D F6,F0,F8 S.D F6, 0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 DIV.D

F0,F2,F4 ADD.D S,F0,F8 S.D S, 0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T

ENGS 116 Lecture 8 10

Control Dependence

• Final kind of dependence called control dependence • Example if pl {S1;} if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

Note that S2 could be data dependent on S1.

ENGS 116 Lecture 8 11

Control Dependences

• Two (obvious) constraints on control dependences: – An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch – An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch • Control dependences often relaxed to get parallelism; get same effect if we preserve

order of exceptions and data flow

ENGS 116 Lecture 8

Hardware Schemes for ILP

• Why in hardware at run time?

– Works when dependence is not known at run time – Simplifies compiler – Allows code for one machine to run well on another • Key idea: Allow instructions behind stall to proceed DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 – Enables out-of-order execution  completion out-of-order – ID stage checks for both structural hazards and data dependences 12

ENGS 116 Lecture 8

Hardware Schemes for ILP

Out-of-order execution divides ID stage: 1. Issue — decode instructions, check for structural hazards 2. Read operands — wait until no data hazards, then read operands 13

ENGS 116 Lecture 8

Tomasulo’s Algorithm

For IBM 360/91 about 3 years after CDC 6600 Goal: High performance without special compilers Differences between IBM 360 & CDC 6600 ISA – IBM has only 2 register specifiers/instruction vs. 3 in CDC 6600 – IBM has 4 FP registers vs. 8 in CDC 6600 Differences between Tomasulo’s Algorithm & Scoreboard – Control & buffers (called “reservation stations”) distributed with functional units vs. centralized in scoreboard – Registers in instructions replaced by pointers to reservation station buffer – HW renaming of registers to avoid WAR, WAW hazards – Common data bus (CDB) broadcasts results to functional units – Load and stores treated as functional units as well Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, ...

14

ENGS 116 Lecture 8

Three Stages of Tomasulo Algorithm

1. Issue: Get instruction from FP operation queue If reservation station free, issues instruction & sends operands (renames registers).

2. Execution: Operate on operands (EX) When operands ready then execute; if not ready, watch common data bus for result.

3. Write result: Finish execution (WB) Write on common data bus to all awaiting units; mark reservation station available.

15 Common data bus: data + source (“come from” bus)

ENGS 116 Lecture 8 From Memory Load Buffers FP Add Res. Station

Tomasulo Organization

From Instruction Unit FP Registers FP Op Queue Operation Bus Operand Bus FP Adders Reservation Stations Common data bus (CDB) FP Multipliers Store Buffers To Memory FP Mul Res. Station 16

ENGS 116 Lecture 8

Reservation Station Components

Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – Reservation stations producing source registers Vj, Vk – Value of source operands Rj, Rk – Flags indicating when Vj, Vk are ready Busy – Indicates reservation station and FU is busy Register result status – Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register.

17

ENGS 116 Lecture 8

Tomasulo Example Cycle 1

Instruction status Instruction

j

LD LD F6 F2 34 45 MULTD SUBD DIVD ADDD F0 F8 F10 F6 F2 F6 F0 F8 Reservation Stations

Name

Add1 Add2 Add3 Mult1 Mult2 Register result status

Busy

No No No No No

k

R2 R3 F4 F2 F6 F2

F0

FU

Issue

1

Op F2 Execution complete S1 Vj F4 Write Result S2 Vk F6

Load1 Load1 Load2 Load3 Busy Address Yes 34+R2 No No

RS for j Qj RS for k Qk F8 F10 F12

Clock 1

...

F30

18

ENGS 116 Lecture 8

Tomasulo Example Cycle 2

Instruction status Instruction

j

LD LD F6 F2 34 45 MULTD SUBD DIVD ADDD F0 F8 F10 F6 F2 F6 F0 F8 Reservation Stations

Name

Add1 Add2 Add3 Mult1 Mult2 Register result status

Busy

No No No No No

k

R2 R3 F4 F2 F6 F2

F0

FU

Issue

1 2

Op F2

Load2

Execution complete S1 Vj F4 Write Result S2 Vk F6

Load1 Load1 Load2 Load3 Busy Address Yes 34+R2 Yes No 45+R3

RS for j Qj RS for k Qk F8 F10 F12

Clock 2

...

F30

19

ENGS 116 Lecture 8

Tomasulo Example Cycle 3

Instruction status Instruction

j

LD LD F6 F2 34 45 MULTD SUBD DIVD ADDD F0 F8 F10 F6 F2 F6 F0 F8 Reservation Stations

Name

Add1 Add2 Add3 Mult1 Mult2 Register result status

k

R2 R3 F4 F2 F6 F2

Issue

1 2 3

Busy

No No

Op

No Yes MULTD No FU

F0

Mult1

F2

Load2

Execution complete

3

S1 Vj F4 Write Result S2 Vk

R(F4)

F6

Load1

RS for j Qj RS for k Qk

Load2

F8

Load1 Load2 Load3 Busy Address Yes 34+R2 Yes No 45+R3

F10 F12

Clock 3

...

F30

Register names are renamed in reservation stations Load1 completing — who is waiting for Load1?

20

ENGS 116 Lecture 8

Tomasulo Example Cycle 4

Instruction status Instruction

j

LD LD F6 F2 34 45 MULTD SUBD DIVD ADDD F0 F8 F10 F6 F2 F6 F0 F8 Reservation Stations

Name

Add1 Add2 Add3 Mult1 Mult2 Register result status

k

R2 R3 F4 F2 F6 F2

Issue

1 2 3 4

Busy

Yes No

Op

SUBD No Yes MULTD No FU

F0

Mult1

F2

Load2

Execution complete

3 4

S1 Vj

M(34+R2)

F4 Write Result

4

S2 Vk

R(F4)

F6

M(34+R2)

RS for j Qj RS for k Qk

Load2 Load2

F8

Add1 Load1 Load2 Load3 Busy Address No Yes No 45+R3

F10 F12

Clock 4

...

F30

21 Load2 completing — who is waiting for it?

ENGS 116 Lecture 8

Tomasulo Example Cycle 5

Instruction status Instruction

j

LD LD F6 F2 34 45 MULTD SUBD DIVD ADDD F0 F8 F10 F6 F2 F6 F0 F8 Reservation Stations

Name

Add1 Add2 Add3 Mult1 Mult2 Register result status

k

R2 R3 F4 F2 F6 F2

Issue

1 2 3 4 5

Busy

Yes No

Op

SUBD No Yes MULTD Yes DIVD FU

F0

Mult1

F2

M(45+R3)

Execution complete

3 4

S1 Vj

M(34+R2) M(45+R3)

F4 Write Result

4 5

S2 Vk

M(45+R3) R(F4) M(34+R2)

F6 RS for j Qj RS for k Qk

Mult1

F8

Add1 Load1 Load2 Load3 Busy Address No No No

F10

Mult2

F12

Clock 5

...

F30

22

ENGS 116 Lecture 8

Tomasulo Example Cycle 6

Instruction status Instruction

j

LD LD F6 F2 34 45 MULTD SUBD DIVD ADDD F0 F8 F10 F6 F2 F6 F0 F8 Reservation Stations

Name

Add1 Add2 Add3 Mult1 Mult2 Register result status

k

R2 R3 F4 F2 F6 F2

Issue

1 2 3 4 5 6

Busy

Yes Yes

Op

SUBD ADDD No Yes MULTD Yes DIVD FU

F0

Mult1

F2 Execution complete

3 4

S1 Vj

M(34+R2) M(45+R3)

F4 Write Result

4 5

S2 Vk

M(45+R3) M(45+R3) R(F4) M(34+R2)

F6

Add2

RS for j Qj RS for k Qk

Add1 Mult1

F8

Add1 Load1 Load2 Load3 Busy Address No No No

F10

Mult2

F12

Clock 6

...

F30

23

ENGS 116 Lecture 8 24

Tomasulo Summary

Reservation stations: renaming to larger set of registers + buffering source operands – Prevents registers as bottleneck – Avoids WAR, WAW hazards of scoreboard – Allows loop unrolling in HW Not limited to basic blocks (integer units get ahead, beyond branches) Lasting Contributions – Dynamic scheduling – Register renaming – Load/store disambiguation 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP PA 8000; Alpha 21264