Transmission Lines for Future On

Download Report

Transcript Transmission Lines for Future On

Managing Wire Delay in
Large CMP Caches
Bradford M. Beckmann
David A. Wood
Multifacet Project
University of Wisconsin-Madison
MICRO 2004
12/8/04
Overview
•
•
Managing wire delay in shared CMP caches
Three techniques extended to CMPs
1. On-chip Strided Prefetching (not in talk – see paper)
– Scientific workloads: 10% average reduction
– Commercial workloads: 3% average reduction
2. Cache Block Migration (e.g. D-NUCA)
– Block sharing limits average reduction to 3%
– Dependence on difficult to implement smart search
3. On-chip Transmission Lines (e.g. TLC)
– Reduce runtime by 8% on average
– Bandwidth contention accounts for 26% of L2 hit latency
•
Combining techniques
+ Potentially alleviates isolated deficiencies
– Up to 19% reduction vs. baseline
– Implementation complexity
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
2
Current CMP: IBM Power 5
CPU 0 CPU 1
2 CPUs
L1 I$ L1 D$ L1 D$ L1 I$
L2
L2
L2
Bank Bank Bank
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
3 L2
Cache
Banks
3
CMP Trends
CPU 0 CPUCPU12
L2
L2
L2
L2
CPU 3
L2 L1L2
I$ L1 D$ L1 D$ L1 I$
L1 I$ L1 D$ L1 D$ L1 I$
L2
CPU
4 CPU
5 CPU
6 CPU 7
L2
L2
L2
L2
L2
L1 I$ L1 D$ L1 D$ L1 I$
L2
L2
L2
L2
L2
L1 I$ L1 D$ L1 D$ L1 I$
L2
2004
2010 Reachable
Distance / Cycle
2010 technology
technology
2004
4
CPU 3
CPU 5
L1
D$
L1
I$
L1
I$
L1
D$
CPU 0
L1
I$
L1
D$
L1
D$
L1
I$
CPU 1
L1
I$
L1
D$
CPU 2
L1
I$
L1
D$
CPU 4
Baseline: CMP-SNUCA
L1
D$
L1
I$
CPU 7
L1
D$
L1
I$
CPU 6
5
Outline
• Global interconnect and CMP trends
• Latency Management Techniques
• Evaluation
– Methodology
– Block Migration: CMP-DNUCA
– Transmission Lines: CMP-TLC
– Combination: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
6
CPU 3
L1
I$
L1
D$
L1
I$
L1
D$
CPU 5
L1
D$
L1
I$
CPU 1
L1
I$
L1
D$
A
B
CPU 2
L1
I$
L1
D$
CPU 4
Block Migration: CMP-DNUCA
L1
D$
L1
I$
CPU 0
B
A
L1
D$
L1
I$
CPU 7
L1
D$
L1
I$
CPU 6
7
On-chip Transmission Lines
• Similar to contemporary off-chip communication
• Provides a different latency / bandwidth tradeoff
• Wires behave more “transmission-line” like as
frequency increases
– Utilize transmission line qualities to our advantage
– No repeaters – route directly over large structures
– ~10x lower latency across long distances
• Limitations
– Requires thick wires and dielectric spacing
– Increases manufacturing cost
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
8
Transmission Lines: CMP-TLC
CPU 3
CPU 2
CPU 1
CPU 0
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
CPU 4
CPU 5
CPU 6
16
8-byte
links
CPU 7
9
CPU 3
CPU 5
8
32-byte
links
L1
D$
L1
I$
L1
I$
L1
D$
CPU 0
L1
I$
L1
D$
L1
D$
L1
I$
CPU 1
L1
I$
L1
D$
CPU 2
L1
I$
L1
D$
CPU 4
Combination: CMP-Hybrid
L1
D$
L1
I$
CPU 7
L1
D$
L1
I$
CPU 6
10
Outline
• Global interconnect and CMP trends
• Latency Management Techniques
• Evaluation
– Methodology
– Block Migration: CMP-DNUCA
– Transmission Lines: CMP-TLC
– Combination: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
11
Methodology
• Full system simulation
– Simics
– Timing model extensions
• Out-of-order processor
• Memory system
• Workloads
– Commercial
• apache, jbb, otlp, zeus
– Scientific
• Splash: barnes & ocean
• SpecOMP: apsi & fma3d
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
12
System Parameters
Memory System
Dynamically Scheduled Processor
L1 I & D caches
64 KB, 2-way, 3 cycles
Clock frequency
10 GHz
Unified L2 cache
16 MB, 256x64 KB, 16way, 6 cycle bank access
Reorder buffer /
scheduler
128 / 64 entries
L1 / L2 cache block
size
64 Bytes
Pipeline width
4-wide fetch & issue
Memory latency
260 cycles
Pipeline stages
30
Memory bandwidth
320 GB/s
Direct branch predictor
3.5 KB YAGS
Memory size
4 GB of DRAM
Return address stack
64 entries
Outstanding memory
request / CPU
16
Indirect branch predictor
256 entries
(cascaded)
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
13
Outline
• Global interconnect and CMP trends
• Latency Management Techniques
• Evaluation
– Methodology
– Block Migration: CMP-DNUCA
– Transmission Lines: CMP-TLC
– Combination: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
14
CMP-DNUCA: Organization
CPU 3
CPU 1
CPU 4
CPU 2
Bankclusters
Local
CPU 0
CPU 5
Inter.
CPU 7
Center
CPU 6
15
Hit Distribution: Grayscale Shading
CPU 3
CPU 0
CPU 7
Beckmann & Wood
Greater %
of L2 Hits
CPU 5
CPU 1
CPU 4
CPU 2
CPU 6
Managing Wire Delay in Large CMP Caches
16
CMP-DNUCA: Migration
• Migration policy
– Gradual movement
– Increases local hits and reduces distant hits
other
bankclusters
Beckmann & Wood
my center
bankcluster
my inter.
bankcluster
Managing Wire Delay in Large CMP Caches
my local
bankcluster
17
CMP-DNUCA: Hit Distribution
Ocean per CPU
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
18
CMP-DNUCA: Hit Distribution
Ocean all CPUs
Block migration successfully separates the data sets
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
19
CMP-DNUCA: Hit Distribution
OLTP all CPUs
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
20
CMP-DNUCA: Hit Distribution
OLTP per CPU
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
Hit Clustering: Most L2 hits satisfied by the center banks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
21
CMP-DNUCA: Search
• Search policy
– Uniprocessor DNUCA solution: partial tags
• Quick summary of the L2 tag state at the CPU
• No known practical implementation for CMPs
– Size impact of multiple partial tags
– Coherence between block migrations and partial tag state
– CMP-DNUCA solution: two-phase search
• 1st phase: CPU’s local, inter., & 4 center banks
• 2nd phase: remaining 10 banks
• Slow 2nd phase hits and L2 misses
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
22
Cycles
CMP-DNUCA: L2 Hit Latency
50
45
40
35
30
25
20
15
10
5
0
CMP-SNUCA
CMP-DNUCA
perfect CMPDNUCA
jbb
oltp
ocean
apsi
Benchmarks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
23
CMP-DNUCA Summary
• Limited success
– Ocean successfully splits
• Regular scientific workload – little sharing
– OLTP congregates in the center
• Commercial workload – significant sharing
• Smart search mechanism
– Necessary for performance improvement
– No known implementations
– Upper bound – perfect search
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
24
Outline
• Global interconnect and CMP trends
• Latency Management Techniques
• Evaluation
– Methodology
– Block Migration: CMP-DNUCA
– Transmission Lines: CMP-TLC
– Combination: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
25
L2 Hit Latency
Bars Labeled
D: CMP-DNUCA
T: CMP-TLC
H: CMP-Hybrid
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
26
Normalized Runtime
Overall Performance
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
CMP-SNUCA
perfect CMP-DNUCA
CMP-TLC
perfect CMP-Hybrid
jbb
oltp
ocean
apsi
Benchmarks
Transmission lines improve L2 hit and L2 miss latency
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
27
Conclusions
• Individual Latency Management Techniques
– Strided Prefetching: subset of misses
– Cache Block Migration: sharing impedes migration
– On-chip Transmission Lines: limited bandwidth
• Combination: CMP-Hybrid
– Potentially alleviates bottlenecks
– Disadvantages
• Relies on smart-search mechanism
• Manufacturing cost of transmission lines
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
28
Backup Slides
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
29
Strided Prefetching
• Utilize repeatable memory access patterns
– Subset of misses
– Tolerates latency within the memory hierarchy
• Our implementation
– Similar to Power4
– Unit and Non-unit stride misses
L1 – L2
Beckmann & Wood
L2 – Mem
Managing Wire Delay in Large CMP Caches
30
On and Off-chip Prefetching
Normalized Runtime
Runtime
Normalized
Runtime
Normalized
1.2
1.2
11
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
no
no
pf
no
pf
nopf
pf
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
L2
L2
pf
L2
pf
L2pf
pf
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
CMP-SNUCA
L1&L2
L1&L2
pf
pf
L1&L2
L1&L2pfpf
0.2
0.2
0.2
oolol l t
ttppp
zzezee
uususs
bbabaa
rrnnrn e
eesss
oococ
ceee
aana n
n
aapap
psss
ii i
ffmmfm
aa3a3
3ddd
jbjbjb
bbb
aapapp
aacacc
hhehee
00
0
Benchmarks
Commercial
Scientific
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
31
CMP Sharing Patterns
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
32
CMP Request Distribution
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
33
CMP-DNUCA: Search Strategy
CPU 3
CPU 1
CPU 4
CPU 2
Bankclusters
Local
CPU 0
CPU 5
Inter.
CPU 7
CPU 6
Center
1st
2ndSearch
SearchPhase
Phase
Uniprocessor DNUCA: partial tag array for smart searches
Significant implementation complexity for CMP-DNUCA
34
CMP-DNUCA: Migration Strategy
CPU 3
CPU 4
CPU 1
CPU 2
Bankclusters
Local
CPU 7
other
other
local
inter.
Center
CPU 5
CPU 0
Inter.
CPU 6
other
my
center center
my
inter.
my
local
35
Uncontended Latency Comparison
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
36
CMP-DNUCA: L2 Hit Distribution
Bankcluster Hit Distribution
% of Total L2 Hits
100%
80%
other 10
60%
center
40%
inter.
local
20%
fma3d
apsi
ocean
barnes
zeus
oltp
jbb
apache
0%
Benchmarks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
37
CMP-DNUCA: L2 Hit Latency
90
80
CMP-SNUCA
no pf
70
Cycles
60
50
CMP-DNUCA
40
30
perfect CMPDNUCA
20
10
ze
us
ba
rn
es
oc
ea
n
ap
si
fm
a3
d
ol
tp
jb
b
ap
ac
he
0
Benchm arks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
38
CMP-DNUCA: Runtime
Normalized Runtime
1.2
1
CMP-SNUCA
0.8
CMP-DNUCA
0.6
perfect CMPDNUCA
0.4
0.2
ap
si
fm
a3
d
ze
us
ba
rn
es
oc
ea
n
ol
tp
jb
b
ap
ac
he
0
Benchmarks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
39
CMP-DNUCA Problems
• Hit clustering
– Shared blocks move within the center
– Equally far from all processors
• Search complexity
– 16 separate clusters
– Partial tags impractical
• Distributed information
• Synchronization complexity
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
40
CMP-TLC: L2 Hit Latency
Bars Labeled
D: CMP-DNUCA
T: CMP-TLC
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
41
Runtime: Isolated Techniques
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
42
CMP-Hybrid: Performance
Normalized Runtime
1.2
CMP-SNUCA
1
perfect CMPDNUCA
0.8
0.6
CMP-TLC
0.4
perfect CMPHybrid
0.2
apsi
apsi
fma3d
fma3d
barnes
barnes
ocean
ocean
zeus
zeus
oltp
jbb
apache
0
Benchmarks
Benchmarks
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
43
Energy Efficiency
Beckmann & Wood
Managing Wire Delay in Large CMP Caches
44