FLEXclusion: Balancing Cache Capaticy a

Download Report

Transcript FLEXclusion: Balancing Cache Capaticy a

FLEXclusion: Balancing Cache Capacity and
On-chip Bandwidth via Flexible Exclusion
Jaewoong Sim Jaekyu Lee
Moinuddin K. Qureshi Hyesoon Kim
Outline




Motivation
FLEXclusion
 Design
 Monitoring & Operation
 Extension
Evaluations
Conclusion
2/26
Introduction

Today’s processors have multi-level cache hierarchies


Design options for each size, inclusion property, # of levels, ...
Design choice for cache inclusion



Inclusion: upper-level cache blocks always exist in the lower-level cache
Exclusion: upper-level cache blocks must not exist in the lower-level cache
Non-Inclusion : may contain the upper-level cache blocks
UPPER-LEVEL
LOWER-LEVEL
Inclusion
3/26
Non-inclusion
Exclusion
Trend of Cache Size Ratio
Trend of total non-LLC capacity to LLC capacity
More Duplication

High ratio indicates more data duplications with inclusion/non-inclusions
L2: 4 x 256KB , L3: 6MB L3
More than 15% duplication!!
Ratio of Non-LLC to LLC

0.2
0.15
Multi-Core
Era Begins
PIII (Coopermine T)
Core i5-2xxx
Core i7-9xx
0
2000
2002
2004
Core i7-39xxX
Core 2 Duo E4xxx
Core 2 Duo E6xxx
Core 2 Duo L7xxx
Prescott 2M (90)
P4 (Prescott)
Core (Woodcrest)
0.05
Core i7-2600
Core i7-8xx
0.1
PIII-S (Tualantin) P4 (Northwood)
Core i3-5xx
2006
Year
2008
Core 2 Duo E7xxx
Core 2 Duo P9xxx
2010
Ratio of non-LLC to LLC sizes of Intel’s processors over the past 10 years
For Capacity: Exclusion is a better option
4/26
2012
On-Chip Traffic

What about on-chip traffic?
 Each design also has a different impact on on-chip traffic
Sliently
Dropped!
More
Traffic!!
L2
Fill Flow
Clean Victim
L3 Hit
L2
Dirty
Victim
Fill Flow
Clean
Victim
L3 (LLC)
Dirty
Victim
L3 (LLC)
L3 Hit
DRAM
Non-Inclusive Hierarchy
DRAM
Exclusive Hierarchy
For Bandwith: Non-Inclusion is a better option
5/26
Static Inclusion
6/26
want to go for non-inclusion
L2->L3 Traffic Difference
(IPKI)
More BW consumption on exclusion
Question: Which design do we want to choose?
60
mcf
50
want to go for exclusion
40
30
bwaves
leslie3d
soplex
sphinx3
omnetpp
wrf
20
10
calculix
0
0.95
1
bzip2
gcc
h264ref
xalancbmk
1.05
1.1
1.15
hmmer
1.2
Performance of Exclusion Relative to Non-Inclusion
More performance benefits on exclusion
1.25
Static Inclusion : Problem


Each policy has its advantages/disadvantages
 Non-Inclusion provides less capacity but higher efficiency
on on-chip traffic
 Exclusion provides more capacity but low efficiency on
on-chip traffic
Workloads have diverse capacity/bandwidth requirement
Problem: No single static cache configuration works
best for all workloads 
7/26
Our Solution : Flexible Exclusion
Dynamically change cache inclusion
according to the workload requirement!
8/26
Our Solution : Flexible Exclusion

Providing both non-inclusion and exclusion


Key Observation


Capture the best of capacity/bandwidth requirement
Non-inclusion and exclusion require similar hardware
Benefits of FLEXclusion


Reducing on-chip traffic compared to exclusion
Improving performance compared to non-inclusion
9/26
Outline




Motivation
FLEXclusion
 Design
 Monitoring & Operation
 Extension
Evaluations
Conclusion
10/26
FLEXclusion Overview


Goal: Adapts cache inclusion between non-inclusion
and exclusion
Overall Design
 Monitoring logic
 A few logic blocks in the hardware to control traffic
11/26
Design



EXCL-REG: to control L2 clean victim data flow
NICL-GATE: to control incoming blocks from memory
Monitoring & policy decision logic: to switch operating mode
L2 Cache
L2 Line Fill
L2 Clean Victim
Monitoring logic is required
in many modern cache
mechanisms!
EXCL-REG
L3 Line Fill
Last-Level Cache
NICL-GATE
12/26
Policy Decision &
Information Collection
Logic
Non-inclusive Mode (PDL signals 0)



Clean L2 victims are silently dropped
Incoming blocks are installed into both L2 and L3
L3 hitting blocks keep residing in the cache
L2 Cache
Non-inclusive mode follows
typical non-inclusive behavior
L2 Line Fill
L2 Clean Victim
EXCL-REG
L3 Line Fill
Last-Level Cache
NICL-GATE
13/26
Policy Decision &
Information Collection
Logic
Exclusive Mode (PDL signals 1)



Clean L2 victims are inserted into L3
Incoming blocks are only installed into L2
L3 hitting blocks are invalidated
Performs similar to typical
L2 Cache
L2 Line Fill
exclusive design except for
L3 insertions from L2
L2 Clean Victim
EXCL-REG
L3 Line Fill
Last-Level Cache
NICL-GATE
14/26
Policy Decision &
Information Collection
Logic
L2
Requirement Monitoring

Set-dueling method is used to capture




LLC
PDL
ICL
performance and traffic behavior of exclusion and noninclusion
Sampling sets follow their original behavior
Monitor cache miss and insertion
Cache Miss
Other sets follow the winning policy
Insertion
Counters
Cache Miss
Non-Inclusive Set
Exclusive Set
Following Set
15/26
Insertion
Set 0
Set 1
Set 2
Set 3
Set 4
Set 5
Set 6
Set 7
L2
Operating Region

LLC
PDL
ICL
Decision of winning policy is made by Policy Decision Logic (PDL)
 Basic operating mode is determined by Perfth
 Extensions of FLEXclusion use Insertionth for further
performance/traffic optimization
L3 IPKI
Difference
Insertionth
1.0
16/26
Non-Inclusive
Region
Non-Inclusive
Region
(Aggressive)
Exclusive Region
(Bypass)
Miss(NICL) – Miss(EX) > Perfth
Ins(EX) – Ins(NICL) > Insertionth
Exclusive Region
Perfth
Exclusion Performance Relative
to Non-Inclusion (Cache Miss)
Extensions of FLEXclusion



Per-core policy: to isolate each application behavior
Aggressive non-inclusion: to improve performance in
non-inclusive mode
Bypass on exclusive mode: to reduce traffic in exclusive
mode
Detail explanations are
in the paper.
L2
Hit on LLC
Line Fill
(DRAM)
Clean Victim
LLC
Aggressive non-inclusive mode
17/26
L2
Hit on LLC
Line Fill
(DRAM)
Clean Victim
LLC
Bypass on exclusive mode
FLEXclusion Operation

A FLEXclusive cache changes operating mode at run-time



FLEXclusion does not require any special actions
- On a switch from non-inclusive to exclusive mode
- On a switch from exclusive to non-inclusive mode
FLEXclusion Mode
Non-Inclusive
Exclusive
Non-Inclusive
Evict
FILL
Dirty
Evict
FILL
Dirty
Evict
Evict
Hit
Hit
Written back into
the same position!
18/26
L2
LLC
FLEXclusive Hierarchy
Outline




Motivation
FLEXclusion
 Design
 Monitoring & Operation
 Extension
Evaluations
Conclusion
19/26
Evaluations



MacSim Simulator
 A cycle-level in house simulator (now public)
 Power results with Orion (Wang+[MICRO’02])
Baseline Processor
 4-core, 4.0GHz, private L1 and L2, shared L3
Workloads

Group A: bzip2, gcc, hmmer, h264, xalancbmk, calculix (Low MPKI)
Group B: mcf, omnetpp, bwaves, soplex, lesilie3d, wrf, sphinx3 (High MPKI)

Multi-programmed: 2-MIX-S, 2-MIX-A, 4-MIX-S


Other results in the paper
 Multi-programmed workloads, per-core, aggressive mode, bypass,
threshold sensitivity
20/26
Evaluations – Performance/Traffic
AVG. 6.3% loss for 1MB
Performance
Performance Relative
to Exclusion
1.05
Non-Inclusion
FLEXclusion
0.95
0.85
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
FLEXclusion performs
bzip2
gcc tohmmer
h264refxalancbmkcalculix
similar
exclusion
AVG.
Traffic
35
30
25
20
15
10
5
0
Exclusion
FLEXclusion
72.6% reduction over
exclusion!!
bzip2
21/26
mcf
5.9% improvement over
omnetppnon-inclusion!!
bwaves leslie3d soplex
wrf sphinx3
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
1MB
2MB
4MB
L3 IPKI Normalized
to Non-Inclusion
0.75
gcc
hmmer h264refxalancbmkcalculix
mcf
omnetpp bwaves leslie3d soplex
wrf
sphinx3
AVG.
Evaluations - Effective Cache Size
Running the same benchmark on 1-/2-/4- cores (4MB L3)
bzip
gcc
hmmer
h264
xalancbmk
calculix
mcf
omentpp
bwaves
leslie3d
soplex
wrf
sphinx3
AVG.
100%
80%
60%
40%
20%
0%
One thread is
enjoying the cache!!
Exclusive Mode (2-Cores)
Non-Incluive Mode (2-Cores)
Exclusive Mode (4-Cores)
Non-Inclusive Mode (4-Cores)
100%
100%
80%
80%
60%
60%
40%
40%
20%
20%
0%
0%
Threads are competing
for shared caches!!
bzip
gcc
hmmer
h264
xalancb…
calculix
mcf
omentpp
bwaves
leslie3d
soplex
wrf
sphinx3
AVG.
Exclusive Mode (1-Core)
Non-Inclusive Mode (1-Core)
bzip
gcc
hmmer
h264
xalancb…
calculix
mcf
omentpp
bwaves
leslie3d
soplex
wrf
sphinx3
AVG.

FLEXclusive cache is
configured as exclusive
mode more often!!
FLEXclusion adapts inclusion on the effective cache size for each
workload!!
22/26
Traffic Normalized to Exclusion


0.8
0.6
0.4
0.2
0
0
A+B
Benchmarks
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
Power Consumption
FLEXclusion
Normalized to Exclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Exclusion
FLEXclusion
Non-Inclusion
Evaluations – Traffic & Power
Impact on L3 insertion traffic reduction in total?
FLEXclusion effectively reduces the traffic
1.2
L3 Insertion
Reducedtakes
to ~10%
up
more
withthan
FLEXclusion!!
40%!
1
bzip2
23/26
gcc
hmmer
Data (MC<->Caches)
2-MIX-S
h264 xalancbmk calculix
mcf
Single-threaded (A+B)
Reduction
Data (L3->L2)20%Data
(L2->L3)
Address
1
0.8
0.6
0.4
0.2
Exclusion
FLEXclusion
Non-Inclusion
4-MIX-S
omnetpp bwaves leslie3d
2-MIX-A
soplex
wrf
sphinx3
AVG.
Outline




Motivation
FLEXclusion
 Design
 Monitoring & Operation
 Extension
Evaluations
Conclusion
24/26
Conclusions & Future Work

FLEXclusion balances performance and on-chip bandwidth
consumption





depending on the workload requirement
with negliglibe hardware changes
5.9% performance improvement over non-inclusion
72.6% L3 insertion traffic reduction over exclusion
(20% power reduction)
Future Work
 More generic flexclusion including inclusion property
 Impact on on-chip network
25/26
Q/A

Thank you!
26/26