Presentation

Download Report

Transcript Presentation

Federation:
Repurposing Scalar Cores for Outof-Order Instruction Issue
David Tarjan*, Michael Boyer, and Kevin Skadron*
University of Virginia
Department of Computer Science
* Currently on internship/sabbatical at NVIDIA Research
Motivation
Homogeneous
Heterogeneous
Adaptive
(Federation)
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
Multithreaded
scalar IO core
2-way
OO core
Basic Insights



A multithreaded in-order core has many registers
which can be reused for a reorder buffer or
active list
If cores are small, single cycle communication
between neighbors is feasible
Prior work on making large OOO cores feasible can
be applied at the low end to make low-cost OOO
possible
In-order & Out-of-order Pipelines
In-order
Out-of-order
Fetch
Bpred
Fetch
Decode
Decode
Execute
Execute
Mem
Allocate
Mem
Writeback
Writeback
Rename
Issue
Commit
Issue Queue Example
Ready Bits Subscriber Slot 1 Subscriber Slot 2
1
1
1
IQ2
2
1
0
1
IQ3
3
0
1
0
1
1
+
3
+
IQ3
2
+
4
5
Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002
Sassone et al., Matrix Scheduler Reloaded, ISCA 2007
Simplified Load-Store Queue




Memory Alias Table (MAT)
No store forwarding
No conservative waiting on stores
Only detect memory order violations after they have
occurred and flush the pipeline when the offending
instruction commits
Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for
Enhanced Load Optimization, ISCA 2005
MAT Example
MAT
st 0x13, r5
ld r1, 0x13
0
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
MAT Example
MAT
st 0x13, r5
ld r1, 0x13
EXE
ld executes and
increments counter
0
0
1
0
2
0
3
1
4
0
5
0
6
0
7
0
MAT Example
MAT
st 0x13, r5
ld r1, 0x13
COM
st commits and
sets flag
0
0
1
0
2
0
3
1!
4
0
5
0
6
0
7
0
MAT Example
MAT
ld r1, 0x13
COM
ld commits, sees flag,
and flushes pipeline
0
0
1
0
2
0
3
1!
4
0
5
0
6
0
7
0
Flush
MAT Example
MAT
ld r1, 0x13
MAT is reset and
execution resumes
0
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
Performance Impact
6%
5.46%
Average IPC Loss
5%
4%
2.67%
2%
1.71%
1%
0.00%
0%
consumer-based
issue queue
pseudo-random
scheduling
MAT
commit-time branch
recovery
Performance
1.4
spec
specint
specfp
1.2
Average IPC
1
0.8
0.6
0.4
0.2
0
Scalar IO
2-way IO
Federated
OO
2-way OO
4-way OO
Energy Efficiency
Normalized BIPS^3/Watt
2.5
spec
specint
specfp
2
1.5
1
0.5
0
Scalar IO
2-way IO
Federated
OO
2-way OO
4-way OO
Area Efficiency
Normalized BIPS^3/(Watt*mm^2)
1.2
spec
specint
specfp
1
0.8
0.6
0.4
0.2
0
Scalar IO
2-way IO
Federated
OO
2-way OO
4-way OO
Conclusions




Two in-order cores can be federated at run-time to
form a 2-way OO core
Almost doubling IPC of throughput core is possible
with very little extra hardware
Don’t want traditional OO structures because their
performance comes at too high a price
Best combined area- and energy-efficiency
Q&A
Backup
Core Fusion Data
Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in
Chip Multiprocessors” , ISCA 2007
Overall Results



Scalar in-order core is 8KB I/D, 256KB L2
Base 2-way core has 16KB I and D-Caches, 256KB
L2, 32 entry ROB, 16 entry issue queue, 16 entry
LSQ, bimodal bpred
4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32
IQ and LSQ, tournament bpred
Branch Prediction



Use only a Next Line and Set (NLS) predictor,
Bimodal predictor and a Return Address Stack (RAS)
NLS ok if your instruction working set not > I$ size
Small bimodal predictor ik ok for small window
processor
Fetch



Two I$’s act as a I$ of twice the size and
associativity (and random replacement)
More logic and buffers to capture two instructions
Extra cycle to route instructions from two I$’s to
two decoders
Decode


Cancel second instruction if first turns out to be
branch
Extra cycle to route decoded instructions to new
allocate stage
Allocate

New logic and free lists to allocate ROB, IQ entries
Rename



New table since it has too many ports
One, centralized rename table, not distributed
Has separate table (or field in each RAT entry) for
each registers producer instructions IQ-slot number
(see our new issue queue)
Issue


Uses a simple lookup table as wakeup structure,
where instructions subscribe to their input
instructions (explained in detail later)
Centralized, one IQ for the two cores
Register File


Register file is mirrored in the two cores
No extra copy instructions or load-balancing
questions
Execute

Add extra cycle for copying result to other core’s
register file (like EV6)
Memory Access



The two D$s are checked in parallel, each
responsible for half of the merged D$’s ways
No standard LSQ, only a Memory Alias Table (details
later)
Only detects ordering violations and send signal to
pipeline
Commit



Centralized commit, no slippage
Recover from branch mispredictions since no
checkpoints of RAT on branches
Recover from memory order violations (or false
positives) from MAT