Transmission Lines for Future On

Transcript Transmission Lines for Future On

Managing Wire Delay in
Large CMP Caches
Bradford M. Beckmann
David A. Wood
Multifacet Project
University of Wisconsin-Madison
MICRO 2004
12/8/04
Overview
•
•
Managing wire delay in shared CMP caches
Three techniques extended to CMPs
1. On-chip Strided Prefetching (not in talk – see paper)
– Scientific workloads: 10% average reduction
– Commercial workloads: 3% average reduction
2. Cache Block Migration (e.g. D-NUCA)
– Block sharing limits average reduction to 3%
– Dependence on difficult to implement smart search
3. On-chip Transmission Lines (e.g. TLC)
– Reduce runtime by 8% on average
– Bandwidth contention accounts for 26% of L2 hit latency
•
Combining techniques
+ Potentially alleviates isolated deficiencies
– Up to 19% reduction vs. baseline
– Implementation complexity
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
2
Current CMP: IBM Power 5
CPU 0 CPU 1
2 CPUs
L1 I$ L1 D$ L1 D$ L1 I$
L2
L2
L2
Bank Bank Bank
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
3 L2
Cache
Banks
3
CMP Trends
CPU 0 CPUCPU12
L2
L2
L2
L2
CPU 3
L2 L1L2
I$ L1 D$ L1 D$ L1 I$
L1 I$ L1 D$ L1 D$ L1 I$
L2
CPU
4 CPU
5 CPU
6 CPU 7
L2
L2
L2
L2
L2
L1 I$ L1 D$ L1 D$ L1 I$
L2
L2
L2
L2
L2
L1 I$ L1 D$ L1 D$ L1 I$
L2
2004
2010 Reachable
Distance / Cycle
2010 technology
technology
2004
4
CPU 3
CPU 5
L1
D$
L1
I$
L1
I$
L1
D$
CPU 0
L1
I$
L1
D$
L1
D$
L1
I$
CPU 1
L1
I$
L1
D$
CPU 2
L1
I$
L1
D$
CPU 4
Baseline: CMP-SNUCA
L1
D$
L1
I$
CPU 7
L1
D$
L1
I$
CPU 6
5
Outline
• Global interconnect and CMP trends
• Latency Management Techniques
• Evaluation
– Methodology
– Block Migration: CMP-DNUCA
– Transmission Lines: CMP-TLC
– Combination: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
6
CPU 3
L1
I$
L1
D$
L1
I$
L1
D$
CPU 5
L1
D$
L1
I$
CPU 1
L1
I$
L1
D$
A
B
CPU 2
L1
I$
L1
D$
CPU 4
Block Migration: CMP-DNUCA
L1
D$
L1
I$
CPU 0
B
A
L1
D$
L1
I$
CPU 7
L1
D$
L1
I$
CPU 6
7
On-chip Transmission Lines
• Similar to contemporary off-chip communication
• Provides a different latency / bandwidth tradeoff
• Wires behave more “transmission-line” like as
frequency increases
– Utilize transmission line qualities to our advantage
– No repeaters – route directly over large structures
– ~10x lower latency across long distances
• Limitations
– Requires thick wires and dielectric spacing
– Increases manufacturing cost
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
8
Transmission Lines: CMP-TLC
CPU 3
CPU 2
CPU 1
CPU 0
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
CPU 4
CPU 5
CPU 6
16
8-byte
links
CPU 7
9
CPU 3
CPU 5
8
32-byte
links
L1
D$
L1
I$
L1
I$
L1
D$
CPU 0
L1
I$
L1
D$
L1
D$
L1
I$
CPU 1
L1
I$
L1
D$
CPU 2
L1
I$
L1
D$
CPU 4
Combination: CMP-Hybrid
L1
D$
L1
I$
CPU 7
L1
D$
L1
I$
CPU 6
10
Outline
• Global interconnect and CMP trends
• Latency Management Techniques
• Evaluation
– Methodology
– Block Migration: CMP-DNUCA
– Transmission Lines: CMP-TLC
– Combination: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
11
Methodology
• Full system simulation
– Simics
– Timing model extensions
• Out-of-order processor
• Memory system
• Workloads
– Commercial
• apache, jbb, otlp, zeus
– Scientific
• Splash: barnes & ocean
• SpecOMP: apsi & fma3d
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
12
System Parameters
Memory System
Dynamically Scheduled Processor
L1 I & D caches
64 KB, 2-way, 3 cycles
Clock frequency
10 GHz
Unified L2 cache
16 MB, 256x64 KB, 16way, 6 cycle bank access
Reorder buffer /
scheduler
128 / 64 entries
L1 / L2 cache block
size
64 Bytes
Pipeline width
4-wide fetch & issue
Memory latency
260 cycles
Pipeline stages
30
Memory bandwidth
320 GB/s
Direct branch predictor
3.5 KB YAGS
Memory size
4 GB of DRAM
Return address stack
64 entries
Outstanding memory
request / CPU
16
Indirect branch predictor
256 entries
(cascaded)
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
13
Outline
• Global interconnect and CMP trends
• Latency Management Techniques
• Evaluation
– Methodology
– Block Migration: CMP-DNUCA
– Transmission Lines: CMP-TLC
– Combination: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
14
CMP-DNUCA: Organization
CPU 3
CPU 1
CPU 4
CPU 2
Bankclusters
Local
CPU 0
CPU 5
Inter.
CPU 7
Center
CPU 6
15
Hit Distribution: Grayscale Shading
CPU 3
CPU 0
CPU 7
Beckmann & Wood
Greater %
of L2 Hits
CPU 5
CPU 1
CPU 4
CPU 2
CPU 6
Managing Wire Delay in Large CMP Caches
16
CMP-DNUCA: Migration
• Migration policy
– Gradual movement
– Increases local hits and reduces distant hits
other
bankclusters
Beckmann & Wood
my center
bankcluster
my inter.
bankcluster
Managing Wire Delay in Large CMP Caches
my local
bankcluster
17
CMP-DNUCA: Hit Distribution
Ocean per CPU
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
18
CMP-DNUCA: Hit Distribution
Ocean all CPUs
Block migration successfully separates the data sets
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
19
CMP-DNUCA: Hit Distribution
OLTP all CPUs
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
20
CMP-DNUCA: Hit Distribution
OLTP per CPU
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
Hit Clustering: Most L2 hits satisfied by the center banks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
21
CMP-DNUCA: Search
• Search policy
– Uniprocessor DNUCA solution: partial tags
• Quick summary of the L2 tag state at the CPU
• No known practical implementation for CMPs
– Size impact of multiple partial tags
– Coherence between block migrations and partial tag state
– CMP-DNUCA solution: two-phase search
• 1st phase: CPU’s local, inter., & 4 center banks
• 2nd phase: remaining 10 banks
• Slow 2nd phase hits and L2 misses
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
22
Cycles
CMP-DNUCA: L2 Hit Latency
50
45
40
35
30
25
20
15
10
5
0
CMP-SNUCA
CMP-DNUCA
perfect CMPDNUCA
jbb
oltp
ocean
apsi
Benchmarks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
23
CMP-DNUCA Summary
• Limited success
– Ocean successfully splits
• Regular scientific workload – little sharing
– OLTP congregates in the center
• Commercial workload – significant sharing
• Smart search mechanism
– Necessary for performance improvement
– No known implementations
– Upper bound – perfect search
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
24
Outline
• Global interconnect and CMP trends
• Latency Management Techniques
• Evaluation
– Methodology
– Block Migration: CMP-DNUCA
– Transmission Lines: CMP-TLC
– Combination: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
25
L2 Hit Latency
Bars Labeled
D: CMP-DNUCA
T: CMP-TLC
H: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
26
Normalized Runtime
Overall Performance
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
CMP-SNUCA
perfect CMP-DNUCA
CMP-TLC
perfect CMP-Hybrid
jbb
oltp
ocean
apsi
Benchmarks
Transmission lines improve L2 hit and L2 miss latency
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
27
Conclusions
• Individual Latency Management Techniques
– Strided Prefetching: subset of misses
– Cache Block Migration: sharing impedes migration
– On-chip Transmission Lines: limited bandwidth
• Combination: CMP-Hybrid
– Potentially alleviates bottlenecks
– Disadvantages
• Relies on smart-search mechanism
• Manufacturing cost of transmission lines
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
28
Backup Slides
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
29
Strided Prefetching
• Utilize repeatable memory access patterns
– Subset of misses
– Tolerates latency within the memory hierarchy
• Our implementation
– Similar to Power4
– Unit and Non-unit stride misses
L1 – L2
Beckmann & Wood
L2 – Mem
Managing Wire Delay in Large CMP Caches
30
On and Off-chip Prefetching
Normalized Runtime
Runtime
Normalized
Runtime
Normalized
1.2
1.2
11
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
no
no
pf
no
pf
nopf
pf
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
L2
L2
pf
L2
pf
L2pf
pf
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
L1&L2
L1&L2
pf
pf
L1&L2
L1&L2pfpf
0.2
0.2
0.2
oolol l t
ttppp
zzezee
uususs
bbabaa
rrnnrn e
eesss
oococ
ceee
aana n
n
aapap
psss
ii i
ffmmfm
aa3a3
3ddd
jbjbjb
bbb
aapapp
aacacc
hhehee
00
0
Benchmarks
Commercial
Scientific
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
31
CMP Sharing Patterns
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
32
CMP Request Distribution
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
33
CMP-DNUCA: Search Strategy
CPU 3
CPU 1
CPU 4
CPU 2
Bankclusters
Local
CPU 0
CPU 5
Inter.
CPU 7
CPU 6
Center
1st
2ndSearch
SearchPhase
Phase
Uniprocessor DNUCA: partial tag array for smart searches
Significant implementation complexity for CMP-DNUCA
34
CMP-DNUCA: Migration Strategy
CPU 3
CPU 4
CPU 1
CPU 2
Bankclusters
Local
CPU 7
other
other
local
inter.
Center
CPU 5
CPU 0
Inter.
CPU 6
other
my
center center
my
inter.
my
local
35
Uncontended Latency Comparison
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
36
CMP-DNUCA: L2 Hit Distribution
Bankcluster Hit Distribution
% of Total L2 Hits
100%
80%
other 10
60%
center
40%
inter.
local
20%
fma3d
apsi
ocean
barnes
zeus
oltp
jbb
apache
0%
Benchmarks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
37
CMP-DNUCA: L2 Hit Latency
90
80
CMP-SNUCA
no pf
70
Cycles
60
50
CMP-DNUCA
40
30
perfect CMPDNUCA
20
10
ze
us
ba
rn
es
oc
ea
n
ap
si
fm
a3
d
ol
tp
jb
b
ap
ac
he
0
Benchm arks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
38
CMP-DNUCA: Runtime
Normalized Runtime
1.2
1
CMP-SNUCA
0.8
CMP-DNUCA
0.6
perfect CMPDNUCA
0.4
0.2
ap
si
fm
a3
d
ze
us
ba
rn
es
oc
ea
n
ol
tp
jb
b
ap
ac
he
0
Benchmarks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
39
CMP-DNUCA Problems
• Hit clustering
– Shared blocks move within the center
– Equally far from all processors
• Search complexity
– 16 separate clusters
– Partial tags impractical
• Distributed information
• Synchronization complexity
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
40
CMP-TLC: L2 Hit Latency
Bars Labeled
D: CMP-DNUCA
T: CMP-TLC
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
41
Runtime: Isolated Techniques
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
42
CMP-Hybrid: Performance
Normalized Runtime
1.2
CMP-SNUCA
1
perfect CMPDNUCA
0.8
0.6
CMP-TLC
0.4
perfect CMPHybrid
0.2
apsi
apsi
fma3d
fma3d
barnes
barnes
ocean
ocean
zeus
zeus
oltp
jbb
apache
0
Benchmarks
Benchmarks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
43
Energy Efficiency
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
44

Transmission Lines for Future On

Transcript Transmission Lines for Future On

Directory