The Migration Prefetcher
Download
Report
Transcript The Migration Prefetcher
HiPEAC 2012, Paris (France) – January 23, 2012
Javier Lira (Intel-UPC, Spain)
Timothy M. Jones (U. of Cambridge, UK)
[email protected]
[email protected]
Carlos Molina (URV, Spain)
Antonio González (Intel-UPC, Spain)
[email protected]
[email protected]
Intel®
24 MBytes
CMPs have become the
dominant paradigm.
Nehalem
IBM®
32 MBytes
Incorporate large shared lastlevel caches.
Access latency in large caches
is dominated by wire delays.
POWER7
Tilera®
32 MBytes
2
Tile-GX
NUCA divides a large cache
in smaller and faster banks.
Cache access latency
consists of the routing and
bank access latencies.
Banks close to cache
controller have smaller
latencies than further banks.
Processor
[1] Kim et al. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Architectures. ASPLOS’02
3
Data can be mapped in multiple banks.
Migration allows data to adapt to application’s behaviour.
S-NUCA
D-NUCA
Migration movements are effective, but about 50% of hits still
happen in non-optimal banks.
4
Introduction
Methodology
The Migration Prefetcher
Analysis of results
Conclusions
5
Placement
16 positions per data
Partitioned multicast
Access
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Migration
Gradual promotion
LRU + Zero-copy
Replacement
[2] Beckmann and Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. MICRO’04
6
SPEC
CPU2006
PARSEC
Solaris 10
8 x UltraSPARC IIIi
Simics
GEMS
Ruby
Garnet
Orion
Number of cores
8 – UltraSPARC IIIi
Frequency
1.5 GHz
Main Memory Size
4 Gbytes
Memory Bandwidth
512 Bytes/cycle
Private L1 caches
8 x 32 Kbytes, 2-way
Shared L2 NUCA cache
8 MBytes, 128 Banks
NUCA Bank
64 KBytes, 8-way
L1 cache latency
3 cycles
NUCA bank latency
4 cycles
Router delay
1 cycle
On-chip wire delay
1 cycle
Main memory latency
250 cycles (from core)
7
Introduction
Methodology
The Migration Prefetcher
Analysis of results
Conclusions
8
Uses prefetching principles
on data migration.
This not a traditional
prefetcher.
◦ It does not bring data from
main memory.
◦ Potential benefits are much
restricted.
Require simple data
correlation.
9
Core 0
Core 1
Core 2
Core 3
NAT
A
PS
Core 4
Core 5
@
Next Address
B
Bank
5
DataBblock
Core 6
Core 7
10
• 1 confidence bit is
effective.
• > 1 bit is not worthy.
Fraction of prefetching requests that ended up being useful.
11
• 12-14 bits use about
25% of erroneous
information.
• NAT with 12
addressable bits is
232 KBytes in total.
Percentage of prefetching requests submitted with other
address’s information.
12
• Predicting data
location in based on
the last appearance
provides 50%
accuracy.
• Accuracy increases
accessing to local
bank.
Percentage of prefetching requests that are found in the
NUCA cache.
13
The realistic Migration Prefetcher uses:
◦ 1-bit confidence for data patterns.
◦ A NAT with 12 addressable bits (29KBytes/table).
◦ Last responder + Local as search scheme.
Total hardware overhead is 264 KBytes.
Latency: 2 cycles.
14
Introduction
Methodology
The Migration Prefetcher
Analysis of results
Conclusions
15
16
NUCA is up to 25% faster with
the Migration Prefetcher.
Reduces NUCA cache latency
by 15%, on average.
Achieves overall performance
improvements of 4%, and up to 17%.
17
The prefetcher introduces extra
traffic into the network.
In case of hit, reduces the number
of messages significantly.
This technique does not
increase energy consumption.
18
Introduction
Methodology
The Migration Prefetcher
Analysis of results
Conclusions
19
Existing migration techniques effectively concentrate most accessed
data to banks that are close to the cores.
About 50% of hits in NUCA are in non-optimal banks.
The Migration Prefetcher anticipates migrations based on the past.
It reduces the average NUCA latency by 15%.
Outperforms the baseline configuration by 4%, on average, and does
not increase energy consumption.
20
HiPEAC 2012, Paris (France) – January 23, 2012
Questions?