PPT - ECE/CS 552 Fall 2010 - University of Wisconsin–Madison

Download Report

Transcript PPT - ECE/CS 552 Fall 2010 - University of Wisconsin–Madison

ECE/CS 552: Introduction to
Superscalar Processors
Instructor: Mikko H Lipasti
Fall 2010
University of Wisconsin-Madison
Lecture notes partially based on notes by John P.
Shen
Limitations of Scalar Pipelines

Scalar upper bound on throughput
– IPC <= 1 or CPI >= 1

Inefficient unified pipeline
– Long latency for each instruction

Rigid pipeline stall policy
– One stalled instruction stalls all newer
instructions
2
© Shen, Lipasti
Parallel Pipelines
(a) No Parallelism
(b) Temporal Parallelism
(d) Parallel Pipeline
(c) Spatial Parallelism
3
© Shen, Lipasti
Intel Pentium Parallel Pipeline
IF
IF
IF
D1
D1
D1
D2
D2
D2
EX
EX
EX
WB
WB
WB
U - Pipe
V - Pipe
4
© Shen, Lipasti
Diversified Pipelines
IF
•
•
•
ID
•
•
•
RD
•
•
•
EX
ALU
MEM1
FP1
MEM2
FP2
BR
FP3
WB
•
•
•
5
© Shen, Lipasti
Power4 Diversified Pipelines
I-Cache
FP
Issue Q
FP1
Unit
FP2
Unit
Fetch Q
BR
Scan
Decode
BR
Predict
FX/LD 1
Issue Q
FX1
Unit
PC
FX/LD 2
Issue Q
LD1
Unit
LD2
Unit
FX2
Unit
BR/CR
Issue Q
CR
Unit
Reorder Buffer
BR
Unit
StQ
D-Cache
6
© Shen, Lipasti
Rigid Pipeline Stall Policy
Bypassing
of Stalled
Instruction
Not Allowed
Backward
Propagation
of Stalling
Stalled
Instruction
7
© Shen, Lipasti
Dynamic Pipelines
IF
•
•
•
ID
•
•
•
RD
•
•
•
( in order )
Dispatch
Buffer
EX
( out of order )
ALU
MEM1
FP1
MEM2
FP2
BR
FP3
( out of order )
Reorder
Buffer
WB
( in order )
•
•
•
8
© Shen, Lipasti
Interstage Buffers
Stage i
Stage i
•
•
1
n ( in order )
Buffer (n)
Buffer (1)
•
•
1
Stage i + 1
•
•
n ( in order )
Stage i +1
•
•
•
(b)
(a)
Stage i
•
•
•
( in order )
Buffer (>_ n)
( out of order )
Stage i + 1
© Shen, Lipasti
•
(c)
•
•
9
Superscalar Pipeline Stages
Fetch
Instruction Buffer
In
Program
Order
Decode
Dispatch Buffer
Dispatch
Issuing Buffer
Out
of
Order
Execute
Completion Buffer
Complete
In
Program
Order
Store Buffer
Retire
10
© Shen, Lipasti
Limitations of Scalar Pipelines

Scalar upper bound on throughput
– IPC <= 1 or CPI >= 1
– Solution: wide (superscalar) pipeline

Inefficient unified pipeline
– Long latency for each instruction
– Solution: diversified, specialized pipelines

Rigid pipeline stall policy
– One stalled instruction stalls all newer instructions
– Solution: Out-of-order execution, distributed
execution pipelines
11
© Shen, Lipasti
Impediments to High IPC
I-cache
Branch
Predictor
FETCH
Instruction
Buffer
Instruction
Flow
DECODE
Integer
Floating-point
Media
Memory
Memory
Data
Flow
EXECUTE
Register
Data
Flow
Reorder
Buffer
(ROB)
Store
Queue
COMMIT
D-cache
12
© Shen, Lipasti
Superscalar Pipeline Design
Instruction Fetching Issues
 Instruction Decoding Issues
 Instruction Dispatching Issues
 Instruction Execution Issues
 Instruction Completion & Retiring Issues

13
© Shen, Lipasti
Instruction Fetch

Objective: Fetch multiple instructions per cycle

Challenges:
– Branches: control dependences
– Branch target misalignment
– Instruction cache misses

PC
Instruction Memory
Solutions
– Alignment hardware
– Prediction/speculation
3 instructions fetched
14
© Shen, Lipasti
Fetch Alignment
15
© Shen, Lipasti
Branches – MIPS
6 Types of Branches
Jump (uncond, no save PC, imm)
Jump and link (uncond, save PC, imm)
Jump register (uncond, no save PC, register)
Jump and link register (uncond, save PC, register)
Branch (conditional, no save PC, PC+imm)
Branch and link (conditional, save PC, PC+imm)
16
© Shen, Lipasti
Disruption of Sequential Control Flow
Fetch
Instruction/Decode Buffer
Decode
Dispatch Buffer
Dispatch
Reservation
Stations
Issue
Branch
Execute
Finish
Reorder/
Completion Buffer
Complete
Store Buffer
© Shen, Lipasti
Retire
17
Branch Prediction

Target address generation  Target Speculation
– Access register:

PC, General purpose register, Link register
– Perform calculation:


+/- offset, autoincrement
Condition resolution  Condition speculation
– Access register:

Condition code register, General purpose register
– Perform calculation:

Comparison of data register(s)
18
© Shen, Lipasti
Target Address Generation
Fetch
PCrel.
Reg.
ind.
with
offset
Reg.
ind.
Decode Buffer
Decode
Dispatch Buffer
Dispatch
Reservation
Stations
Issue
Branch
Execute
Finish
Completion Buffer
Complete
Store Buffer
© Shen, Lipasti
Retire
19
Condition Resolution
Fetch
Decode Buffer
GP
reg.
value
comp.
Decode
CC
reg.
Dispatch Buffer
Dispatch
Reservation
Stations
Issue
Branch
Execute
Finish
Completion Buffer
Complete
Store Buffer
© Shen, Lipasti
Retire
20
Branch Instruction Speculation
to I-cache
Prediction
FA-mux
Spec. target
PC(seq.)
Branch
Spec. cond. Predictor
(using a BTB)
BTB
update
(target addr.
and history)
PC(seq.) = FA (fetch address)
Fetch
Decode Buffer
Decode
Dispatch Buffer
Dispatch
Reservation
Stations
Issue
Branch
Execute
Finish
© Shen, Lipasti
Completion Buffer
21
Static Branch Prediction

Single-direction
– Always not-taken: Intel i486

Backwards Taken/Forward Not Taken
– Loop-closing branches have negative offset
– Used as backup in Pentium Pro, II, III, 4
22
© Shen, Lipasti
Static Branch Prediction
Profile-based

1. Instrument program binary
2. Run with representative (?) input set
3. Recompile program
a.
b.
Annotate branches with hint bits, or
Restructure code to match predict not-taken
Performance: 75-80% accuracy

–
Much higher for “easy” cases
23
© Shen, Lipasti
Dynamic Branch Prediction

Main advantages:
– Learn branch behavior autonomously

No compiler analysis, heuristics, or profiling
– Adapt to changing branch behavior


Program phase changes branch behavior
First proposed in 1980
– US Patent #4,370,711, Branch predictor using
random access memory, James. E. Smith

Continually refined since then
© Shen, Lipasti
24
Smith Predictor Hardware
2m k-bit counters
Branch Address
Updated Counter Value
m
Saturating Counter
Increment/Decrement
most significant bit


Branch Prediction
Branch Outcome
Jim E. Smith. A Study of Branch Prediction Strategies.
International Symposium on Computer Architecture, pages 135-148,
May 1981
Widely employed: Intel Pentium, PowerPC 604, PowerPC 620, etc.
25
© Shen, Lipasti
Two-level Branch Prediction
PHT
PC = 01011010010101
010110
BHR
0110
000000
000001
000010
000011
010100
010101
010110 1 0
010111
111110
111111
1 Branch Prediction

BHR adds global branch history
– Provides more context
– Can differentiate multiple instances of the same static branch
– Can correlate behavior across multiple static branches
26
© Shen, Lipasti
Combining or Hybrid Predictors
Branch Address
Branch Address
M
P0
P1
M
gshare
PAp
Meta-Prediction
Branch Prediction



Branch Prediction
Select “best” history
Reduce interference w/partial updates
Scott McFarling. Combining Branch Predictors. TN-36,
Digital Equipment Corporation Western Research
Laboratory, June 1993.
27
© Shen, Lipasti
Branch Target Prediction
Branch Address
Branch Target Buffer
target
Branch
Direction
Predictor
tag
target
tag
...
target
tag
Size of
Instruction
=
=
=
+
OR
not-taken
target
taken-target
BTB Hit?
0 1
Branch Target

Partial tags sufficient in BTB
28
© Shen, Lipasti
Return Address Stack
Bra nch Address
Bra nch Address
Size of
Instruction
BTB
Return
Address
BTB
Target Prediction
is this a return?
(a)

Target Prediction
(b)
For each call/return pair:
– Call: push return address onto hardware stack
– Return: pop return address from hardware stack
29
© Shen, Lipasti
Branch Speculation
NT
NT
NT
T
T
NT
T
(TAG 2)
T
NT
(TAG 1)
NT
T
T
NT
T
(TAG 3)

Leading Speculation
– Typically done during the Fetch stage
– Based on potential branch instruction(s) in the current fetch
group

Trailing Confirmation
– Typically done during the Branch Execute stage
– Based on the next Branch instruction to finish execution
30
© Shen, Lipasti
Branch Speculation
 Leading Speculation
1. Tag speculative instructions
2. Advance branch and following instructions
3. Buffer addresses of speculated branch
instructions

Trailing Confirmation
1. When branch resolves, remove/deallocate
speculation tag
2. Permit completion of branch and following
instructions
31
© Shen, Lipasti
Branch Speculation
NT
NT
T
NT
T
(TAG 2)
NT
T
NT
T
NT
T
(TAG 3)

T
NT
T
(TAG 1)
Start new correct path
– Must remember the alternate (non-predicted) path

Eliminate incorrect path
– Must ensure that the mis-speculated instructions
produce no side effects
32
© Shen, Lipasti
Mis-speculation Recovery

Start new correct path
1. Update PC with computed branch target (if
predicted NT)
2. Update PC with sequential instruction address (if
predicted T)
3. Can begin speculation again at next branch

Eliminate incorrect path
1.
Use tag(s) to deallocate resources occupied by
speculative instructions
2. Invalidate all instructions in the decode and dispatch
buffers, as well as those in reservation stations
33
© Shen, Lipasti
Summary: Instruction Fetch


Fetch group alignment
Target address generation
– Branch target buffer
– Return address stack

Target condition generation
– Static prediction
– Dynamic prediction

Speculative execution
– Tagging/tracking instructions
– Recovering from mispredicted branches
34
© Shen, Lipasti
Issues in Decoding

Primary Tasks
– Identify individual instructions (!)
– Determine instruction types
– Determine dependences between instructions

Two important factors
– Instruction set architecture
– Pipeline width
35
© Shen, Lipasti
Pentium Pro Fetch/Decode
36
© Shen, Lipasti
Predecoding in the AMD K5
37
© Shen, Lipasti
Dependence Checking
Dest Src0 Src1
Dest Src0 Src1
Dest Src0 Src1
Dest Src0 Src1
?=
?=
?=
?=
?=
?=
?=
?=
?=
?=
?=

?=
Trailing instructions in fetch group
– Check for dependence on leading instructions
38
© Shen, Lipasti
Instruction Dispatch and Issue

Parallel pipeline
– Centralized instruction fetch
– Centralized instruction decode

Diversified pipeline
– Distributed instruction execution
39
© Shen, Lipasti
Necessity of Instruction Dispatch
40
© Shen, Lipasti
Centralized Reservation Station
41
© Shen, Lipasti
Distributed Reservation Station
42
© Shen, Lipasti
Issues in Instruction Execution

Parallel execution units
– Bypassing is a real challenge

Resolving register data dependences
– Want out-of-order instruction execution

Resolving memory data dependences
– Want loads to issue as soon as possible

Maintaining precise exceptions
– Required by the ISA
43
© Shen, Lipasti
Bypass Networks
FP
Issue Q
FP1
Unit
FP2
Unit
PC
Fetch Q
BR
Scan
Decode
BR
Predict
FX/LD 1
Issue Q
FX1
Unit
I-Cache
FX/LD 2
Issue Q
LD1
Unit
LD2
Unit
FX2
Unit
BR/CR
Issue Q
CR
Unit
Reorder Buffer
BR
Unit
StQ
D-Cache



O(n2) interconnect from/to FU inputs and outputs
Associative tag-match to find operands
Solutions (hurt IPC, help cycle time)
– Use RF only (IBM Power4) with no bypass network
– Decompose into clusters (Alpha 21264)
44
© Shen, Lipasti
The Big Picture
INSTRUCTION PROCESSING CONSTRAINTS
Resource Contention
(Structural Dependences)
Control Dependences
(RAW) True Dependences
(WAR) Anti-Dependences
Code Dependences
Data Dependences
Storage Conflicts
Output Dependences (WAW)
45
© Shen, Lipasti
Register Data Dependences

Program data dependences cause hazards
– True dependences (RAW)
– Antidependences (WAR)
– Output dependences (WAW)

When are registers read and written?
– Out of program order!
– Hence, any/all of these can occur

Solution to all three: register renaming
46
© Shen, Lipasti
Register Renaming: WAR/WAW


Widely employed (Core i7, Athlon/Phenom, …)
Resolving WAR/WAW:
– Each register write gets unique “rename register”
– Writes are committed in program order at Writeback
– WAR and WAW are not an issue


All updates to “architected state” delayed till writeback
Writeback stage always later than read stage
– Reorder Buffer (ROB) enforces in-order writeback
Add R3 <= …
Sub R4 <= …
And R3 <= …
© Shen, Lipasti
P32 <= …
P33 <= …
P35 <= …
47
Register Renaming: RAW

In order, at dispatch:
– Source registers checked to see if “in flight”
Register map table keeps track of this
 If not in flight, can be read from the register file
 If in flight, look up “rename register” tag (IOU)

– Then, allocate new register for register write
Add R3 <= R2 + R1
Sub R4 <= R3 + R1
And R3 <= R4 & R2
P32 <= P2 + P1
P33 <= P32 + P1
P35 <= P33 + P2
48
© Shen, Lipasti
Register Renaming: RAW

Advance instruction to reservation station
– Wait for rename register tag to trigger issue

Reservation station enables out-of-order
issue
– Newer instructions can bypass stalled
instructions
49
© Shen, Lipasti
“Dataflow Engine” for Dynamic Execution
Reg. Write Back
Dispatch Buffer
Dispatch
Reg. File
Allocate
Reorder
Buffer
entries
Ren. Reg.
Reservation
Stations
Branch
Integer
Integer
Reorder Buffer
Float.Point
Load/
Store
Forwarding
results to
Res. Sta. &
rename
registers
Managed as a queue;
Maintains sequential order
of all Instructions in flight
Complete
50
© Shen, Lipasti
Instruction Processing Steps
•DISPATCH:
•Read operands from Register File (RF) and/or Rename Buffers (RRB)
•Rename destination register and allocate RRF entry
•Allocate Reorder Buffer (ROB) entry
•Advance instruction to appropriate Reservation Station (RS)
•EXECUTE:
•RS entry monitors bus for register Tag(s) to latch in pending operand(s)
•When all operands ready, issue instruction into Functional Unit (FU) and
deallocate RS entry (no further stalling in execution pipe)
•When execution finishes, broadcast result to waiting RS entries, RRB entry,
and ROB entry
•COMPLETE:
•Update architected register from RRB entry, deallocate RRB entry, and if it
is a store instruction, advance it to Store Buffer
•Deallocate ROB entry and instruction is considered architecturally
completed
51
© Shen, Lipasti
Load RF Write
ALU RF Write
D$
Execute
Agen-D$
Issue
RF Read
Decode
Rename
Fetch
Physical Register File
Map Table
R0 => P7
R1 => P3
Physical Register File
…
R31 => P39


Used in MIPS R10000, Pentium 4, AMD Bulldozer
All registers in one place
– Always accessed right before EX stage
– No copying to real register file at commit
52
© Shen, Lipasti
Managing Physical Registers
Map Table
R0 => P7
R1 => P3
…
R31 => P39

Add R3 <= R2 + R1
Sub R4 <= R3 + R1
…
…
And R3 <= R4 & R2
P32 <= P2 + P1
P33 <= P32 + P1
P35 <= P33 + P2
Release P32
(previous R3)
when this
instruction
completes
execution
What to do when all physical registers are in use?
– Must release them somehow to avoid stalling
– Maintain free list of “unused” physical registers

Release when no more uses are possible
– Sufficient: next write commits
53
© Shen, Lipasti
Memory Data
Dependences

WAR/WAW: stores commit in order
Load/Store RS
Agen
Mem
– Hazards not possible. Why?

– Store queue keeps track of pending store
addresses
– Loads check against these addresses
– Similar to register bypass logic
– Comparators are 32 or 64 bits wide (address
size)

Store
Queue
RAW: loads must check pending stores
Reorder Buffer
Major source of complexity in modern
designs
– Store queue lookup is position-based
– What if store address is not yet known?
54
© Shen, Lipasti
Increasing Memory Bandwidth
Reg. Write Back
Dispatch Buffer
Dispatch
Reg. File
Ren. Reg.
RS’s
Branch
Integer
Integer
Float.-
Load/
Load/
Point
Store
Store
Missed
loads
Reorder Buff.
Complete
Store Buff.
© Shen, Lipasti
Data Cache
Retire
55
Issues in Completion/Retirement

Out-of-order execution
– ALU instructions
– Load/store instructions

In-order completion/retirement
– Precise exceptions

Solutions
– Reorder buffer retires instructions in order
– Store queue retires stores in order
– Exceptions can be handled at any instruction
boundary by reconstructing state out of ROB/SQ
56
© Shen, Lipasti
A Dynamic Superscalar Processor
57
© Shen, Lipasti
Superscalar Summary
I-cache
Branch
Predictor
FETCH
Instruction
Buffer
Instruction
Flow
DECODE
Integer
Floating-point
Media
Memory
Memory
Data
Flow
EXECUTE
Register
Data
Flow
Reorder
Buffer
(ROB)
Store
Queue
COMMIT
D-cache
58
© Shen, Lipasti
[John DeVale & Bryan Black, 2005]
Landscape of Microprocessor Families
1
1300
1100
1700
1900 SpecINT 2000
Intel-x86
1500
900
500
300
700
AMD-x86
Itanium
Power5
Power
100
DTN
SPECint2000/MHz
Itanium
Power 3
Power4
Opteron
Extreme
0.5
PIII
Athlon
800 MHz
PSC
NWD
P4
PerformanceCPU
Frequency

PathLength CPI
0
0
500
1000
1500
2000
Frequency (MHz)
2500
3000
3500
** Data source www.spec.org
59
© Shen, Lipasti
Superscalar Summary

Instruction flow
– Branches, jumps, calls: predict target, direction
– Fetch alignment
– Instruction cache misses

Register data flow
– Register renaming: RAW/WAR/WAW

Memory data flow
– In-order stores: WAR/WAW
– Store queue: RAW
– Data cache misses: missed load buffers
60
© Shen, Lipasti