Lec8-dyn_schedule2 - ECE Users Pages

Download Report

Transcript Lec8-dyn_schedule2 - ECE Users Pages

ECE 4100/6100
Advanced Computer Architecture
Lecture 8 Dynamic Scheduling (II)
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
Modern Processors
• Branch Prediction results in speculative
execution
• Speculative instructions (if wrongly
speculated) must not alter the architecture
states
– Architecture Registers
– Memory
• Requirement of precise exception/interrupts
2
Modern Out-of-Order Core
Reservation Station
issues instructions to
functional units
Allocate
instructions
RS
Reorder Buffer maintains state
information (physical registers)
for precise interrupts and
speculative execution
ALLOC
ROB
RAT
ARF
Architectural
register file
LSQ
Register Alias Table
renames architecture
registers
Load Store Queue
maintains memory access
ordering
3
Register Renaming
Architected
Registers
R0
R1
R2
R3
R4
R5
R6
R7
Physical
Registers
T0
T2
T4
T6
T8
T10
T12
T14
T16
T18
T20
T22
Tn-2
T1
T3
T5
T7
T9
T11
T13
T15
T17
T19
T21
T23
Tn-1
Original
Code
R2 = R1+R3
R4 = R2 - R6
…
R2 = R7 / R5
BEQ R2, #1
…
R2 = R4 * R1
R6 = Load [R2]
WAW
WAR
Renamed
Code
T1 = R1+R3
R4 = T1 - R6
…
T20 = R7 / R5
BEQ T20, #1
…
T7 = R4 * R1
R6 = Load [T7]
No False
Dependencies!
Sandy Bridge:
160 PRs for INT
144 PRs for FP
Adapted from Prof. G. Loh’s Slides
4
Register Renaming
Unmapped
Physical
Registers
Dest = Src1 op Src2
Mapping
Mechanism
TagD
Src1  TagS1
Src2  TagS2
TagD = TagS1 op TagS2
Dest  TagD
Repeat for each instruction
Adapted from Prof. G. Loh’s Slides
5
Register Alias Table (RAT)
• Use a lookup table for
renaming
• One entry per
architectural register
• Each entry maps to the
most recent version of the
architectural register,
could be in
– Physical register file
– Architectural register file
RAT
EAX
EBX
ECX
EDX
ESI
EDI
ESP
EBP
ROB (40 entries)
Data
Status
RRF
P6 Style Register Renaming
(So does HP-PA8000, PPC604)
6
RAT Example
R1 = R2 + R3
T13 = R2 + R3
R0 R1 R2 R3 R4 R5 R6 R7
- - - - - - - -
T13, T14, T15, T16
R5 = R4 – R1
T14 = R4 – T13
- 13 -
-
-
-
-
T14, T15, T16
R1 = R1 * R5
T15 = T13 * T14
- 13 -
-
- 14 -
-
T15, T16
R2 = R5 / R1
T16 = T14 / T15
- 15 -
-
- 14 -
-
T16
- 15 16 -
- 14 -
-
Adapted from Prof. G. Loh’s Slides
-
Free PRegs
7
Superscalar Rename
R1 = R2 + R3
R4 = R5 – R7
R3 = R0 / R2
R5 = Ld 12[R6]
From free
register pool
Don’t rename
immediates
T10
T31
T19
T6
T16
T39
T14
T5
T23
T7
T16
X
RAT
For N-wide
superscalar:
2N RAT read-ports
N RAT write-ports
8
Intra-Group Dependencies
R2 = R2 + R3
R4 = R5 – R7
R3 = R0 / R2
R5 = Ld 12[R6]
T16
T39
T14
T5
From free
register pool
RAT
T10
T31
T19
T6
T23
T7
T16
X
This is the wrong
version of R2
Should be using
this version of R2
9
Intra-Group Dependencies
R1 = R2 + R1
R2 = R1 – R2
R1 = R2 / R1
R1 = R2 >> R1
From free
register pool
RAT
T10
T31
T19
T6
T16
T34
T16
T16
T34
T16
T34
T34
T16
T10
T31
T31
T34
T16
T10
T19
Result of
sequential
renaming
Correct final renamed registers
10
Resolving Intra-Group Dependencies
Inst 0
Inst 1
Inst 2
Inst 3
Src L
Src R
Dest
From free
register pool
Intra-Group
Dependency
Checker
RAT
T0L
T0R
T1L
T1R
T2L
T2R
T3L
T3R
Pdst0
Pdst1
Pdst2
Adapted from Prof. G. Loh’s Slides
11
Intra-Group Dependency Checking
src0L
srcsrc
1L 0Rsrc1R src2L
src2R
src3L
src3R
Pdst0
dst0
dst1
dst2
dst3
Pdst1
Pdst2
Pdst3 R1L = R1R = R2L =
R2R
R3L
=
=
=
R3R
=
=
=
=
=
T1L
=
T1R
T2L
0 1
Adapted from Prof. G. Loh’s Slides
T2R
T3L
T3R
12
Mapping Selection
R1 = R2 + R1
R2 = R1 – R2
R1 = R2 / R1
R1 = R2 >> R1
Condition: use mapping
if instruction is last
writer to the register
Adapted from Prof. G. Loh’s Slides
!=
!=
use pdst0
!=
!=
use pdst1
!=
!=
use pdst2
1
use pdst3
Priority
encoder
Only this mapping
for R1 should be
written into the RAT
dst0 dst1 dst2 dst3
13
Issue with Imprecise Interrupt
lw r5, 8(r10)
add r10, r9, r8
Instruction
Page Fault
add r12, r10, r7
• add instructions take one cycle
• E.g.,
L1:
add r3, r1, r2
add r4, r1, r4
add r2, r4, r4
End of
Non-Resident
Page X
Start of
Resident
Page X+1
– Load (left side) induces a “data page fault”;
– Add (right side) induces an “instruction page fault”
• If out-of-order completion is allowed
– r10, r12, (or r2, r4) … will be modified
– Wrong values will be used by the re-issued load
• Interrupt classes
– Program interrupts (exceptions or traps)
– External interrupts (asynchronous)
14
Precise Interrupts
• To reflect a sequential architecture model 
Serially correct (think about a single issue, nonpipelined processor)
• Keep “Precise State” of an execution
– All instructions before the interrupted instruction must be completed
– The state should appear as if no instruction issued after the
interrupted instruction
– The interrupted PC should be presented to the interrupt handler
(restartable)
• Similar to branch misprediction handling
• Out-of-order execution makes the ordering hard
– Undo what comes after an interrupt
15
Why Supporting Precise Interrupts
• Need to maintain a precise state (for recovery)
• Software debugging
• I/O or timer interrupts
• Virtual memory (page fault)
• Instruction emulation
• Virtual machines
16
Support Precise Interrupt
• Buffer results
• Can reconstruct the scenario (state) as
sequential execution
• Restart from saved PC with saved PC state
17
Reorder Buffer (ROB) [SmithPlezkun’85 ‘88]
• Architecture Register File keeps “In-order state”
• Reorder Buffer (ROB)
– A circular buffer
– Contains all in-flight instructions
– buffers the “Lookahead state”
– In-order allocation/deallocation with head/tail pointers
• When an exception occurs
– Halting instruction issues
– Revert to in-order state using RF and discard ROB results
• Also used for branch misprediction recovery
• Pentium Pro/II/III integrates physical register file within ROB
• Pentium 4 decouples ROB and physical register file
18
V
Head
(oldest
instruction)
Spec?
Done?
Reorder Buffer (with physical registers)
PC
Exp
event
RegDst
.
.
.
Data (physical register)
.
.
.
Tail
(next inst
to be
allocated)
Sandy Bridge : 168-entry ROB
19
V
Spec?
Done?
Handling Precise Interrupts
PC
Exp
event
RegDst
Head
01 0 1
0
1 0 0
xA000
xA004
0000
0000
R1
R2
Tail
1 0 0
xA008
0000
FR1
.
.
.
Data (physical register)
11
R1=R1+10
R2=R2*2
FR1=FR2/0.0
.
.
.
R1
R2
R3
R4
ARF
11
1
1
2
1
3
1
4
1
R31
20
PC
Exp
event
0
1 0 0
xA004
0000
R2
1 0 0
xA008
0000
FR1
FR1=FR2/0.0
1 0 0
xA00C
0000
R3
R3=R3+1
V
Head
Spec?
Done?
Handling Precise Interrupts
RegDst
Data (physical register)
R2=R2*2
Tail
.
.
.
.
.
.
R1
R2
R3
R4
ARF
11
1
1
2
1
3
1
4
1
R31
21
PC
Exp
event
0
1 0 0
xA004
0000
R2
1 0 0
xA008
0000
FR1
1 0 1
1 0 0
xA00C
xA010
0000
0000
R3
R4
V
Head
Spec?
Done?
Handling Precise Interrupts
RegDst
Data (physical register)
R2=R2*2
FR1=FR2/0.0
4
R3=R3+1
R4=R4*2
Tail
.
.
.
.
.
.
R1
R2
R3
R4
ARF
11
1
1
2
1
3
1
4
1
R31
22
PC
Exp
event
0
1 0 0
1
1 0 0
xA004
0000
R2
xA008
0010
FR1
1 0 1
1 0 1
xA00C
xA010
0000
0000
R3
R4
1 0 0
xA014
0000
FR4
V
Head
Spec?
Done?
Handling Precise Interrupts
RegDst
Data (physical register)
4
R2=R2*2
FR1=FR2/0.0
4
8
R3=R3+1
R4=R4*2
FR4=FR4*2.0
Tail
.
.
.
.
.
.
R1
R2
R3
R4
ARF
11
1
4
1
2
1
3
1
4
1
R31
23
PC
Exp
event
0
1
0 0 1
xA004
0000
R2
1 0 0
xA008
0010
FR1
1 0 1
1 0 1
xA00C
xA010
0000
0000
R3
R4
1 0 0
xA014
0000
FR4
V
Head
Spec?
Done?
Handling Precise Interrupts
RegDst
Data (physical register)
4
R2=R2*2
FR1=FR2/0.0
4
8
R3=R3+1
R4=R4*2
FR4=FR4*2.0
Tail
.
.
.
.
.
.
R1
R2
R3
R4
ARF
11
1
1
4
1
3
1
4
1
R31
24
PC
Exp
event
RegDst
0
0
1 0 0
xA008
0010
FR1
1 0 1
1 0 1
xA00C
xA010
0000
0000
R3
R4
1 0 0
xA014
0000
FR4
V
Head
Spec?
Done?
Handling Precise Interrupts
These values
were not
Data (physical register)
committed into
RF
FR1=FR2/0.0
4
8
R3=R3+1
R4=R4*2
FR4=FR4*2.0
Tail
Back up “PC”
and current RF
.
.
.
.
Exception detected. .
.
R1
R2
R3
R4
ARF
11
1
1
4
1
3
1
4
1
R31
Depending on the Exception, process will either abort or instruction will be resumed from this
excepting instruction
25
V
Head
Spec?
Done?
Handling Speculative Execution
1 0 0
1 0 0
PC
Exp
event
xB000
xB004
0000
0000
RegDst
Data (physical register)
R1=R1+10
BEQ R1, R0, L1
R1
Tail
.
.
.
.
.
.
R1
R2
R3
R4
ARF
1
1
2
1
3
1
4
1
R31
26
PC
Exp
event
1 0 0
1 0 0
xB000
xB004
0000
0000
R1
1 1 1
xC100
0000
1 1 0
xC104
0000
R2
R1
1 1 0
xD2AC
0000
1 1 1
xD2B0
0000
V
Head
Spec?
Done?
Handling Speculative Execution
RegDst
Data (physical register)
R1=R1+10
BEQ R1, R0, L1
32
R2=R3 << 2
R1=R2*R3
BEQ R3, R0, L1
R1
28
R1=R7+1
Tail
.
.
.
.
.
.
R1
R2
R3
R4
ARF
1
1
2
1
3
1
4
1
R31
BEQ R1, R0, L1 is predicted TAKEN
27
V
Head
Spec?
Done?
Handling Speculative Execution
PC
Exp
event
1 0 0
xB004
0000
1 1 1
xC100
0000
1 1 0
xC104
0000
1 1 0
xD2AC
0000
1 1 1
xD2B0
0000
RegDst
BEQ
Data (physical register)
Misprediction
BEQ R1, R0, L1
R2
R1
32
R2=R3 << 2
R1=R2*R3
BEQ R3, R0, L1
R1
28
R1=R7+1
Tail
.
.
.
.
.
.
R1
R2
R3
R4
ARF
11
1
2
1
3
1
4
1
R31
BEQ R1, R0, L1 is resolved, actually NOT TAKEN !!
28
V
Head
Spec?
Done?
Handling Speculative Execution
1 0 0
PC
Exp
event
xB004
0000
RegDst
Data (physical register)
BEQ R1, R0, L1
Tail
.
.
.
.
.
.
R1
R2
R3
R4
ARF
11
1
2
1
3
1
4
1
R31
Retire branch, Clear all entries after the mis-speculated branch
29
V
Head
Spec?
Done?
Handling Speculative Execution
1 0 0
PC
Exp
event
RegDst
xB008
0000
R2
Data (physical register)
R2=R5 << 4
Tail
.
.
.
.
.
.
R1
R2
R3
R4
ARF
11
1
2
1
3
1
4
1
R31
Continue execution from the correct path (Fall through in this case)
30
RAT Recovery
ARF
br
RAT
?!?
ARF state corresponds to state prior
to oldest non-committed instruction
As instructions are processed, the RAT
corresponds to the register mapping after
the most recently renamed instruction
On a branch misprediction, wrong-path
instructions are flushed from the machine
The RAT is left with an invalid set of
mappings corresponding to the wrongpath instruction state
Adapted from Prof. G. Loh’s Slide
31
Solution: Stall and Drain
ARF
Allow all instructions to execute and
commit; ARF corresponds to last
committed instruction
RAT
ARF now corresponds to the state
right before the next instruction to
be renamed (foo)
br
X
Reset RAT so that all mappings
?!?
refer to the ARF
Pros: Very simple
to implement
Resume renaming the new correctfoo
Correct path instructions
from
Cons: Performance
loss
pathfetch;
instructions from fetch
can’t
rename
because
RAT
is
wrong
due to stalls
32
Another Solution: Checkpointing
ARF
At each branch, make a copy of the RAT
(register mapping at the time of the branch)
br
br
br
br
foo
RAT
RAT
RAT
RAT
RAT
Checkpoint
Free Pool
On a misprediction:
1. flush wrong-path instructions
2. deallocate RAT checkpoints
3. recover RAT from checkpoint
4. resume renaming
33
Modern Instruction Scheduler
• At dispatch, instruction read all available
operands from the register files and store a
copy in the scheduler (Tomasulo’s algorithm)
Fetch &
Dispatch
PRF/ROB
Functional
Units
Bypass
Instruction
Scheduler
Adapted from Prof. G. Loh’s Slide
Physical register update
ARF
• Unavailable operands will be “captured” from
the functional unit outputs (CDB broadcast)
• When ready, instructions can issue directly
from the scheduler without reading additional
operands from any other register files
(Wakeup and select)
34
Instruction Scheduling: Wakeup and Select
• Wakeup Logic
– To notify the resolution of data dependency of
input operands
– Wake up instructions with zero input dependency
• Select Logic
– Choose and fire ready instructions
– Deal with structure hazard
• Wakeup-select is likely on the critical path
– Associative match
35
Scalar Scheduler (Issue Width = 1)
=
T39
T6
=
T17
T39
=
T15
T39
=
=
T8
=
T42
=
To Execute Logic
From Prof. G. Loh’s Slide
T39
Select Logic
Tag Broadcast Bus
T14
T16
T17
=
36
Superscalar Scheduler (Issue Width = 4)
Tag Broadcast Bus [3..0]
T14
T16
T17
T39
T15
T39
=
===
=
===
T8
=
===
=
===
T42
=
===
=
===
T17
To Execute Logic
T6
T39
Select Logic
T39
=
===
=
===
Snapshot of RS (only 4 entries shown)
Adapted from Prof. G. Loh’s Slide
37
Selection Logic
• Select ready instructions to be issued
• Goal: to reduce the height of DFG
• Methods
– Location-based (e.g., leftmost ready first)
•Allow simple, faster hardware
– Oldest ready first
•Can use location-based (in-order issue) with
“compaction”
•Can be slow and complex
38
Simple Select Logic Implementation
Reservation Station
Req2
Grant1
Grant3
Req3
Grant02
AnyQueue
Enable
Req1
Grant0
Grant3
Req3
Grant02
Req0
Req2
Grant1
Enable
Req1
Grant0
Req0
AnyQueue
Req2
Grant1
Grant3
Req3
Grant02
Enable
Req1
Grant0
Req0
AnyQueue
Tree-like
Arbitrated
Selection
Logic
Req2
Grant1
Grant3
Req3
Grant02
Enable
Req1
Grant0
Req0
[Palarchala ISCA’97]
AnyQueue
1
39
Simple Select Logic Implementation
Reservation Station
Req2
Grant1
Grant3
Req3
Grant02
Grant3
Req3
Grant02
Req0
Req1
Grant0
Req2
Grant1
Req0
Req1
Grant0
Enable
Enable
Req2
Grant1
Grant3
Req3
Grant02
Enable
Req3
Req2
Req1
Req0
Grt2
Grt1
Grt0
Grt3
Req0
Req1
Grant0
AnyQueue
Req2
Grant1
Grant3
Req3
Grant02
Enable
Req0
Req1
Grant0
AnyQueue
Priority
Decoder
40
1
[Palarchala ISCA’97]
Enable
AnyQueue
AnyQueue
AnyQueue
Simple Select Logic Implementation
Reservation Station
Req2
Grant1
Grant3
Req3
Grant02
Enable
Req2
Grant1
Grant3
Req3
Grant02
Req0
Req1
Grant0
Enable
Req0
Req1
Grant0
AnyQueue
AnyQueue
Req2
Grant1
Grant3
Req3
Grant02
Enable
Req0
Req1
Grant0
AnyQueue
Req2
Grant1
Grant3
Req3
Grant02
Req0
Req1
Grant0
Enable
41
1
[Palarchala ISCA’97]
AnyQueue
Simple Select Logic Implementation
Reservation Station
Req2
Grant1
Grant3
Req3
Grant02
Enable
Req2
Grant1
Grant3
Req3
Grant02
Req0
Req1
Grant0
Enable
Req0
Req1
Grant0
AnyQueue
AnyQueue
Req2
Grant1
Grant3
Req3
Grant02
Enable
Req0
Req1
Grant0
AnyQueue
Req2
Grant1
Grant3
Req3
Grant02
Req0
Req1
Grant0
Enable
42
1
[Palarchala ISCA’97]
AnyQueue
Issues to Distinctive Functional Units
Distributed Instruction Windows (e.g., MIPS R1000 or Alpha 21264)
Reservation Station
Reservation Station
Faster to have separate instruction schedulers
for different instruction types
43
Dual Issues to Multiple Units (e.g., 2 Adders)
Req0
Req1
Req2
Req3
Grant0
Grant1
Grant2
Grant3
Req0
Req1
Req2
Req3
Grant0
Grant1
Grant2
Grant3
44
[Palarchala Dissertation]
Memory Disambiguation
• Can we “undo” stores?
• Stores cannot be committed to memory until
they are marked ready to retire
• Completed stores are queued and waiting in
a store queue or store buffer
• Disambiguate (and resolve) memory
dependency dynamically
45
Memory Ordering
Source: Alpha 21264 HRM
• Load X bypassing Load X violates certain memory
consistency model (e.g., sequential consistency)
• Load-load order trap replays
46
47
Load Store Queue (LSQ)
Age-ordered
RS
ALLOC
ROB
Store Queue
Load Queue
Split LSQ
•
•
•
•
Memory instructions are allocated into LSQ in program order
LSQ manages memory reference ordering
Unified LSQ vs. Split LSQ
Sandy Bridge: 64 Load buffers, 36 Store buffers
48
age
Issued?
Issued?
Issuing a Load for Execution
age
address
address
data
1 1
A
00000001
1 1
A
1 1
B
12340000
0 2
D
0 1
C
0 2
C
0 2
???
FFFF1111
FFFFFF00
Store Queue
Issued to
Memory
for execution
Load Queue
• Each load checks against older stores
– Associative search
– A performance issue of scalability
49
age
Issued?
Issued?
Issuing a Load for Execution
age
address
address
data
1 1
A
00000001
1 1
A
1 1
B
12340000
1 2
D
0 1
C
0 2
C
0 2
???
FFFF1111
FFFFFF00
Store Queue
Store-to-load
forwarding
Load Queue
• Implementation dependent: comprehensive size matching can be prohibitively
expensive
• Simple method: forward when a larger store (word) precedes a smaller load (half)
50
•
age
age
address
address
data
1 1
A
00000001
1 1
A
1 1
B
12340000
1 2
D
0 1
C
1 2
C
0 2
???
FFFF1111
FFFFFF00
0 3
K
Store Queue
Speculativel
y issue for
execution
Load Queue
Can speculatively issue loads for shortening latency (Alpha 21264, Pentium 4 (Prescott))
–
–
•
•
Issued?
Issued?
Issuing a Load for Execution
Naively
Use Memory Dependency Predictor
Store, when address ready, checks newer loads in the Load Queue
“Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay)
51
age
Issued?
Issued?
Store Checks Pre-Mature Loads
age
address
address
data
1 1
A
00000001
1 1
A
1 1
B
12340000
1 2
D
1 1
C
1 2
C
0 2
K
FFFF1111
FFFFFF00
1 3
K
1 3
M
1 4
P
Store Queue
Conflict
detected!
Replay the load
Load Queue
• Store, when address ready, checks newer loads in the Load Queue
– Associative Search
• “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s storeload replay)
52
age
Issued to
memory
Issued?
Issued?
Issuing a Store for Execution
age
address
address
data
1 4
A
11000000
1 4
A
0 6
A
0F0F0F0F
0 5
D
0 6
C
00000002
0 5
C
0 6
K
Store Queue
Load Queue
• Shown above the basic concept
• Implementation dependent
– Not allow store bypassing load, since it has little impact on performance
– Perform associative search
53
age
Issued?
Issued?
Issuing a Store for Execution
age
address
address
data
1 4
A
11000000
1 4
A
0 6
A
0F0F0F0F
0 5
D
0 6
C
00000002
0 5
C
0 6
K
cannot issue
for execution
Store Queue
Load Queue
54
• Needed for
– Multiprocessor support
– Maintaining memory
consistency model
• Load-load trap invoked
– Trap on the later, conflicted
instructions
– Replay
Load-load trap
Issued?
Load-Load Ordering
age
address
0 4
A
1 5
D
1 5
C
1 6
A
1 6
M
1 6
N
0 7
K
Load Queue
55