Adaptive Cache Fill Policies to Improve Shared/Private

Transcript Adaptive Cache Fill Policies to Improve Shared/Private

Adaptive Insertion Policies
for Managing Shared Caches
Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi,
Julien Sebot, Simon Steely Jr., Joel Emer
Intel Corporation, VSSAD
[email protected]
International Conference on Parallel Architectures and Compilation Techniques (PACT)
Paper Motivation
Core 0
FLC
Core 0 Core 1
FLC
FLC
LLC
LLC
Single Core
( SMT )
Dual Core
( ST/SMT )
Core 0 Core 1
Core 2 Core 3
FLC
FLC
FLC
FLC
MLC
MLC
MLC
MLC
LLC
Quad-Core
( ST/SMT )
• Shared caches common and more so with increasing # of cores
• # concurrent applications   contention for shared cache 
• High Performance  Manage shared cache efficiently
2
Misses Per 1000 Instr (under LRU)
Problems with LRU-Managed Shared Caches
resources based on rate of demand
soplex
– Applications that do not benefit from
cache cause destructive cache
interference
h264ref
soplex
0
• Conventional LRU policy allocates
h264ref
25
50
75
Cache Occupancy Under LRU Replacement
(2MB Shared Cache)
100
3
Misses Per 1000 Instr (under LRU)
Addressing Shared Cache Performance
• Conventional LRU policy allocates
– Applications that do not benefit from
cache cause destructive cache
interference
• Cache Partitioning: Reserves cache
h264ref
resources based on application
benefit rather than rate of demand
soplex
0
resources based on rate of demand
soplex
h264ref
25
50
75
Cache Occupancy Under LRU Replacement
(2MB Shared Cache)
100
 HW to detect cache benefit
 Changes to existing cache structure
 Not scalable to large # of applications
Eliminate Drawbacks of
Cache Partitioning
4
Paper Contributions
• Problem: For shared caches, conventional LRU policy allocates
cache resources based on rate-of demand rather than benefit
• Goals: Design a dynamic hardware mechanism that:
1.
2.
3.
4.
Provides High Performance by Allocating Cache on a Benefit-basis
Is Robust Across Different Concurrently Executing Applications
Scales to Large Number of Competing Applications
Requires Low Design Overhead
• Solution: Thread-Aware Dynamic Insertion Policy (TADIP) that
improves average throughput by 12-18% for 2, 4, 8, and 16-core
systems with  two bytes of storage per HW-thread
TADIP, Unlike Cache Partitioning, DOES NOT Attempt to Reserve Cache Space
5
Review Insertion Policies
“Adaptive Insertion Policies for High-Performance Caching”
Moinuddin Qureshi, Aamer Jaleel, Yale Patt, Simon Steely Jr., Joel Emer
Appeared in ISCA’07
6
Cache Replacement 101 – ISCA’07
Two components of cache replacement:
• Victim
Selection:
– Which line to replace for incoming line? (E.g. LRU, Random etc)
• Insertion
Policy:
– With what priority is the new line placed in the replacement list?
(E.g. insert new line into MRU position)
Simple changes to insertion policy can minimize cache thrashing and
improves cache performance for memory-intensive workloads
7
Static Insertion Policies – ISCA’07
• Conventional (MRU Insertion) Policy:
– Choose victim, promote to MRU
c
d
e
f
g
LRU
h
Reference to ‘i’ with conventional LRU policy:
• LRU Insertion Policy (LIP):
– Choose victim, DO NOT promote to MRU
– Unless reused, lines stay at LRU position
• Bimodal Insertion Policy (BIP)
– LIP does not age older lines
– Infrequently insert some misses at MRU
– Bimodal Throttle: b
• We used b ~= 3%
MRU
a
b
i
a
b
c
d
e
f
g
Reference to ‘i’ with LIP:
a
b
c
d
e
f
g
i
Reference to ‘i’ with BIP:
if( rand() < b )
Insert at MRU postion
else
Insert at LRU position
Applications Prefer Either Conventional LRU or BIP…
8
Dynamic Insertion Policy (DIP) via “Set-Dueling” – ISCA’07
HW Required: 10 bits + Combinational Logic
•
•
Set Dueling Monitors (SDMs): Dedicated
sets to estimate the performance of a predefined policy
Divide the cache in three:
– SDM-LRU: Dedicated LRU-sets
– SDM-BIP: Dedicated BIP-sets
– Follower sets
•
PSEL: n-bit saturating counter
– misses to SDM-LRU: PSEL++
– misses to SDM-BIP: PSEL--
•
Follower sets insertion policy:
– Use LRU: If PSEL MSB = 0
– Use BIP: If PSEL MSB = 1
SDM-LRU
SDM-BIP
Follower Sets
miss
miss
+
PSEL
–
MSB = 1?
NO
USE LRU
YES
DO BIP
- Based on Analytical and Empirical Studies:
• 32 Sets per SDM
• 10 bit PSEL counter
9
Misses Per 1000 Instr (under LRU)
Extending DIP to Shared Caches
•
DIP uses a single policy (LRU or BIP) for
all applications competing for the cache
•
DIP can not distinguish between apps that
benefit from cache and those that do not
•
Example: soplex + h264ref w/2MB cache
– DIP learns LRU for both apps
– soplex causes destructive interference
– Desirable that only h264ref follow
LRU and soplex follow BIP
soplex
h264ref
Need a Thread-Aware Dynamic Insertion Policy (TADIP)
10
Thread Aware Dynamic Insertion Policy (TADIP)
•
Assume N-core CMP running N apps, what is best insertion policy for each app?
(LRU=0, BIP=1)
•
Insertion policy decision can be thought of as an N-bit binary string:
< P0, P1, P2 … PN-1 >
– If Px = 1, then for application c use BIP, else use LRU
– e.g. 0000  always use conventional LRU, 1111  always use BIP
•
With N-bit string, 2N possible string combinations. How to find best one???
– Offline Profiling: Input set/system dependent & impractical with large N
– Brute Force Search using SDMs: Infeasible with large N
Need a PRACTICAL and SCALABLE Implementation of TADIP
11
Using Set-Dueling As a Practical Approach to TADIP
•
Unnecessary to exhaustively search all 2N combinations
•
Some bits of the best binary insertion string can be learned independently
– Example: Always use BIP for applications that create interference
•
Exponential Search Space  Linear Search Space
– Learn best policy (BIP or LRU) for each app in presence of all other apps
Use Per-Application SDMs To Decide:
In the presence of other apps, does an app cause destructive interference…
If so, use BIP for this app, else use LRU policy
12
TADIP Using Set-Dueling Monitors (SDMs)
•
Assume a cache shared by 4 applications: APP0 APP1 APP2 APP3
In the presence of
other apps, does
APP0 doing LRU or
BIP improve cache
performance?
<
<
<
<
<
<
<
<
0, P1, P2, P3
1, P1, P2, P3
P0, 0, P2, P3
P0, 1, P2, P3
P0, P1, 0, P3
P0, P1, 1, P3
P 0 , P1 , P2 , 0
P 0 , P1 , P2 , 1
>
>
>
>
>
>
>
>
miss
–
–
–
–
+
PSEL0
+
PSEL1
+
PSEL2
+
PSEL3
< P0, P1, P2, P3 >
Follower Sets
Pc = MSB( PSELc )
Set-Level View of Cache
High-Level View of Cache
13
TADIP Using Set-Dueling Monitors (SDMs)
•
•
•
Assume a cache shared by 4 applications: APP0 APP1 APP2 APP3
– LRU SDMs for each APP
miss
+
< 0, P1, P2, P3 >
– BIP SDMs for each APP
–
< 1, P1, P2, P3 >
+
< P0, 0, P2, P3 >
– Follower sets
–
Per-APP PSEL saturating counters
– misses to LRU: PSEL++
– misses to BIP: PSEL-Follower sets insertion policy:
– SDMs of one thread are
follower sets of another thread
– Let Px = MSB[ PSELx ]
– Fill Decision: <P0, P1, P2, P3 >
<
<
<
<
<
P0,
P0,
P0,
P0,
P0,
1, P2, P3
P1, 0, P3
P1, 1, P3
P1 , P2 , 0
P1 , P2 , 1
>
>
>
>
>
< P0, P1, P2, P3 >
Follower Sets
HW Required: (10*T) bits + Combinational Logic
–
–
PSEL0
PSEL1
+
PSEL2
+
PSEL3
• 32 sets per SDM
• 10-bit PSEL
Pc = MSB( PSELc )
14
Summarizing Insertion Policies
Policy
Insertion Policy Search Space
# of
SDMs
#
Counters
LRU Replacement
< 0, 0, 0, … 0 >
0
0
DIP
< 0, 0, 0, … 0 > and < 1, 1, 1, … 1 >
2
1
Brute Force
< 0, 0, 0, … 0 > … < 1, 1, 1, … 1 >
2N
2N
TADIP
< P0, P1, P2, … PN-1 > and Hamming
Distance of 1
2N
N
TADIP is SCALABLE with Large N
15
Experimental Setup
•
Simulator and Benchmarks:
– CMP$im – A Pin-based Multi-Core Performance Simulator
– 17 representative SPEC CPU2006 benchmarks
•
Baseline Study:
– 4-core CMP with in-order cores (assuming L1-hit IPC of 1)
– Three-level Cache Hierarchy: 32KB L1, 256KB L2, 4MB L3
– 15 workload mixes of four different SPEC CPU2006 benchmarks
•
Scalability Study:
– 2-core, 4-core, 8-core, 16-core systems
– 50 workload mixes of 2, 4, 8, & 16 different SPEC CPU2006 benchmarks
16
MPKI
Cache % MRU
Usage insertions
Baseline LRU Policy / DIP
MPKI
H264REF
MPKI
SOPLEX
Cache % MRU
Usage insertions
APKI
APKI: accesses per 1000 inst
MPKI: misses per 1000 inst
APKI
soplex + h264ref Sharing 2MB Cache
LRU
BIP
TADIP
TADIP Improves Throughput by 27% over LRU and DIP
17
1.60
DIP
TADIP
No Gains from DIP
1.50
1.40
1.30
1.20
1.10
GEOMEAN
MIX_14
MIX_13
MIX_12
MIX_11
MIX_10
MIX_9
MIX_8
MIX_7
MIX_6
MIX_5
MIX_4
MIX_3
MIX_2
MIX_1
1.00
MIX_0
Throughput Normalized to LRU
TADIP Results – Throughput
DIP and TADIP are ROBUST and Do Not Degrade Performance over LRU
Making Thread-Aware Decisions is 2x Better than DIP
18
Static Best almost always better because insertion string with
best IPC chosen as “Best Static”. TADIP optimizes for fewer
misses. Can use TADIP to optimize other metrics (e.g. IPC)
1.60
1.50
DIP
TADIP
BEST STATIC
1.40
1.30
TADIP Better Due to Phase Adaptation
1.20
1.10
GEOMEAN
MIX_14
MIX_13
MIX_12
MIX_11
MIX_10
MIX_9
MIX_8
MIX_7
MIX_6
MIX_5
MIX_4
MIX_3
MIX_2
MIX_1
1.00
MIX_0
Throughput Normalized to LRU
TADIP Compared to Offline Best Static Policy
TADIP is within 85% of Best Offline Determined Insertion Policy Decision
19
Utility Based Cache Partitioning (UCP)
1.60
1.50
Cost Per Thread (bytes)
1.40
UCP
TADIP
1920
2
TADIP
UCP
1.30
1.20
1.10
GEOMEAN
MIX_14
MIX_13
MIX_12
MIX_11
MIX_10
MIX_9
MIX_8
MIX_7
MIX_6
MIX_5
MIX_4
MIX_3
MIX_2
MIX_1
1.00
MIX_0
Throughput Normalized to LRU
TADIP Vs. UCP ( MICRO’06 )
DIP Out-Performs UCP Without Requiring Any Cache Partitioning Hardware
Unlike Cache Partitioning Schemes, TADIP Does NOT Reserve Cache Space
TADIP Does Efficient CACHE MANAGEMENT by Changing Insertion Policy
20
TADIP - 4MB
LRU - 8MB
TADIP - 8MB
LRU - 16MB
2.00
1.75
1.50
1.25
GEOMEAN
MIX_14
MIX_13
MIX_12
MIX_11
MIX_10
MIX_9
MIX_8
MIX_7
MIX_6
MIX_5
MIX_4
MIX_3
MIX_2
MIX_1
1.00
MIX_0
Throughput Normalized to 4MB LRU
TADIP Results – Sensitivity to Cache Size
TADIP Provides Performance Equivalent to Doubling Cache Size
21
Throughput Normalized to LRU of
System
to Baseline
Throughput Normalized
Respective Baseline
System
TADIP Results – Scalability
2.00
2-Thread
4-thread
8-Thread
16-Thread
1.75
1.50
1.25
1.00
0
10
20
30
40
50
Workloads
TADIP Scales to Large Number of Concurrently Executing Applications
22
Summary
•
The Problem: For shared caches, conventional LRU policy allocates cache
resources based on rate-of demand rather than benefit
•
Solution: Thread-Aware Dynamic Insertion Policy (TADIP)
1. Provides High Performance by Allocating Cache on a Benefit-Basis
- Up to 94%, 64%, 26% and 16% performance on 2, 4, 8, and 16 core CMPs
2. Is Robust Across Different Workload Mixes
- Does not significantly hurt performance when LRU works well
3. Scales to Large Number of Competing Applications
- Evaluated up to 16-cores in our study
4. Requires Low Design Overhead
- < 2 bytes per HW-thread and NO CHANGES to existing cache structure
23
Q&A
24
Journal of Instruction-Level Parallelism
1st Data Prefetching Championship (DPC-1)
Sponsored by: Intel, JILP, IEEE TC-uARCH
Conjunction with: HPCA-15
Paper & Abstract Due: December 12th, 2008
Notification: January 16th, 2008
Final Version: January 30th, 2008
More Information and Prefetch Download Kit At:
http://www.jilp.org/dpc/
1.20
DIP
TADIP
1.15
1.10
1.05
TADIP Provides More Than Two Times Performance of DIP
TADIP Improves Performance over LRU by 18%
GEOMEAN
MIX_14
MIX_13
MIX_12
MIX_11
MIX_10
MIX_9
MIX_8
MIX_7
MIX_6
MIX_5
MIX_4
MIX_3
MIX_2
MIX_1
1.00
MIX_0
Weighted Speedup Normalized to LRU
TADIP Results – Weighted Speedup
26
0.50
GEOMEAN
MIX_14
MIX_13
MIX_12
MIX_11
MIX_10
MIX_9
MIX_8
MIX_7
MIX_6
MIX_5
MIX_4
MIX_3
MIX_2
MIX_1
MIX_0
Harmonic Mean of Normalized IPCs
TADIP Results – Fairness Metric
1.00
0.75
LRU
DIP
TADIP
0.25
0.00
TADIP Improves the Fairness
27
TADIP In Presence of Prefetching on 4-core CMP
TADIP In The Presence of Prefetching
Throughput Normalized to LRU + Prefetching
1.70
1.60
1.50
1.40
1.30
1.20
1.10
1.00
0
10
20
30
40
50
0.90
Workloads
TADIP Improves Performance Even In Presence of HW Prefetching
28
Insertion Policy to Control Cache Occupancy (16-Cores)
•
Changing insertion policy
directly controls the amount of
cache resources provided to an
application
•
In figure, only showing only the
TADIP selection insertion
policy for xalancbmk & sphinx3
•
TADIP improves performance
by 28%
Cache % MRU
MPKI
Usage insertions
APKI
Sixteen Core Mix with 16MB LLC
Insertion Policy Directly Controls Cache Occupancy
29
TADIP Using Set-Dueling Monitors (SDMs)
•
Assume a cache shared by 2 applications: APP0 and APP1
In the presence of
other apps, should
APP0 do LRU or BIP?
In the presence of
other apps, should
APP1 do LRU or BIP?
< 0 , P1 >
< 1 , P1 >
< P0 , 0 >
< P0 , 1 >
< P0 , P1 >
Follower Sets
miss
miss
miss
miss
+
–
+
–
PSEL0
PSEL1
• 32 sets per SDM
• 9-bit PSEL
Pc = MSB( PSELc )
Set-Level View of Cache
High-Level View of Cache
30
TADIP Using Set-Dueling Monitors (SDMs)
•
•
•
Assume a cache shared by 2 applications: APP0 and APP1
– LRU SDMs for each APP
< 0 , P1 >
– BIP SDMs for each APP
– Follower sets
< 1 , P1 >
PSEL0, PSEL1: per-APP PSEL
– misses to LRU: PSEL++
– misses to BIP: PSEL-Follower sets insertion policy:
– SDMs of one thread are
follower sets of another thread
– Let Px = MSB[ PSELx ]
– Fill Decision: <P0, P1>
< P0 , 0 >
< P0 , 1 >
< P0 , P1 >
Follower Sets
HW Required: (9*T) bits + Combinational Logic
miss
miss
miss
miss
+
–
+
–
PSEL0
PSEL1
• 32 sets per SDM
• 9-bit PSEL cntr
Pc = MSB( PSELc )
31

Adaptive Cache Fill Policies to Improve Shared/Private

Transcript Adaptive Cache Fill Policies to Improve Shared/Private

Directory