This is my presentation title

Transcript This is my presentation title

Symbiotic Scheduling for Shared Caches in
Multi-Core Systems Using Memory
Footprint Signature
 Mrinmoy
Ghosh
 Ripal Nathuji
Min Lee
Karsten Schwan
Hsien-Hsin S. Lee
ARM 
Microsoft Research  Georgia Tech
Cache Interference in “Concurrent Processes”
P2
Core A
Core B
L1 Cache
L1 Cache
P1
Line
Hit !!!!!!
Conflict
P1 $ Line
P2
L2 Cache
Cache Interference Effect (Concurrent Processes)
1.10
libq
1.06
1.04
libq
mcf
libq
mcf
mcf
mcf
mcf
perl
1.02
perl
libq
libq
1.00
Maximum performance degradation less than 10%
xalancbmk
sphinx3
bwaves
astar
mcf
omnetpp
povray
soplex
hmmer
gobmk
0.96
libquantum
0.98
perlbench
Relative Run Time
1.08
Cache Interference in “Shared Cache Multi-Core”
Core B
Core A
P1
P2
L1 Cache
L1 Cache
Conflict !!!
P1 $ Line
P2
L2 Cache
lbm
lbm
libquantum
mcf
soplex
omnetpp
libq
mcf
hmmer
libq
bwaves
libq
povray
libq
bwaves
astar
soplex
xalancbmk
libq
sphinx3
libq
libq
gobmk
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
perlbench
Relative Run Times
Cache Interference Effect (Shared Cache Multi-Core)
Performance degraded by as much as 65%
Intelligent Process Management Needed !!
Process (In-)Compatibility in Multi-Cores
• Problem
– Processes in different cores can be incompatible
– Shared resource contention
• Observation
– Less contention of incompatible processes when running
on the same core
• Insight:
– Process incompatibility severely affects performance
– Compatibility-based scheduling increases throughput
Ideas
• Use Counting Bloom Filter to
record memory access signature
• Compatibility test using signature
7
Insertion: Counting Bloom Filter
1
N-to-m
Hash Func X
N-bit Data Address A
N-to-m
Hash Func Y
1
Presence Counter
Bit
Insertion: Counting Bloom Filter
1
1
N-to-m
Hash Func X
N-bit Data Address B
N-to-m
Hash Func Y
1
2
Presence Counter
Bit
Deletion: Counting Bloom Filter
1
1
Data Address A
Was Evicted
N-to-m
Hash Func X
N-to-m
Hash Func Y
1
2
Presence Counter
Bit
Query: Counting Bloom Filter
1
0
N-to-m
Hash Func X
Data Address A
??
N-to-m
Hash Func Y
2
1
Data Not Present !!!
Presence Counter
Bit
Bloom Filter Signatures vs. Cache Footprint
Cache Footprint
3500
Signature Value
3000
2500
2000
1500
1000
500
0
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
Strong Correlation !!!
1300
1400
1500
1600
1700
Architectural Support
13
Bloom Filter Signature Multi-Core Architecture
Core A
Core B
L1 Cache
L1 Cache
Last Filter
Last Filter
Core Filter
Core Filter
L2 Cache
Bloom Filter Counters
Bloom Filter Signature Multi-Core Architecture
P3
Core A
Core B
P1
P2
L1 Cache
L1 Cache
Last Filter
Last Filter
Core Filter
Core Filter
L2 Cache
Bloom Filter Counters
Metric for Execution State
Last Filter
Core Filter
RBV (Running Bit Vector)
+
Occupancy Weight
(i.e., # of 1s)
Interference Metric (Complement of Symbiosis)
Process Pool
Proc1 RBV
(Processes waiting to be scheduled)
Core Filter
Proc0
Proc1
+
Proc2
Symbiosis = 5
Proc*
Proc**
+
Interference Metric = N - 5
Process-to-Core
Mapping Algorithms
•
A1: Use Occupancy Weight
•
A2: Use Interference Graph
•
A3: Use Weighted Interference Graph
18
A1: Weight Sorted Algorithm
• Sort all processes according to occupancy weight
• Processes form groups using sorted weight
– # of processes in a group = Processes/Cores
• Map processes to cores based on sorting results
P0
100
P4
99
P2
70
P5
65
P6
43
P3
20
P1
15
Core A
Core B
Core C
Core D
L1 Cache
L1 Cache
L1 Cache
L1 Cache
A2: Interference Graph Algorithm
• Form interference graph using interference metric
• Find MAX-CUT of the graph
P0
CA=20
CB=30
P1
CA=10
CB=45
P2
CA=40
CB=25
Was in CA
P0
(A)
P3
CA=15
CB=50
Was in CB
30
40
P2
(B)
Interference Graph
P1
(A)
P3
(B)
A2: Interference Graph Algorithm
• Form interference graph using interference metric
• Find MAX-CUT of the graph
P0
CA=20
CB=30
P1
CA=10
CB=45
P2
CA=40
CB=25
Was in CA
P0
(A)
P3
CA=15
CB=50
Was in CB
70
P2
(B)
Interference Graph
P1
(A)
P3
(B)
A2: Interference Graph Algorithm
• Form interference graph using interference metric
• Find MAX-CUT of the graph
P0
CA=20
CB=30
P1
CA=10
CB=45
P2
CA=40
CB=25
Was in CA
P0
(A)
Was in CB
70
45
30
P1
(A)
P3
CA=15
CB=50
P2
(B)
75
85
60
P3
(B)
Interference Graph
A2: Interference Graph Algorithm
• Form interference graph using interference metric
• Find MAX-CUT of the graph
P0
(A)
70
45
30
P1
(A)
P1
(A)
85
P2
(B)
Interference Graph
75
85
60
P2
(B)
P3
(B)
P0
(A)
45
P3
(B)
A3: Weighted Interference Graph Algorithm
• To address high interference issues
• Weight the edges of the interference graph
• The rest are the same as A2
P0
OW=90
CA=20
CB=30
P1
OW=85
CA=10
CB=45
P2
OW=50
CA=40
CB=25
Was in CA
P0
(A)
P3
OW=100
CA=15
CB=50
Was in CB
90*30
50*40
P2
(B)
Interference Graph
P1
(A)
P3
(B)
Performance Evaluation
25
Evaluation Methodology
P1
P1
P2
P3
P2
P3
PN
PN
Intel Core 2
Fedora Linux
Native x86 Run
“magic”
interface
P1
Simics x86
P2
PN
Linux Linux
Linux
Xen Hypervisor
Gather Footprint
in Emulator
Process-to-Core
Mapping
Intel Core 2
VM Run
0%
Average performance improvement of up to 23%
xalancbmk
sphinx
soplex
povray
perlbench
omnetpp
mcf
xalancbmk
sphinx
soplex
povray
perlbench
omnetpp
mcf
libquantum
lbm
hmmer
gobmk
astar
0%
libquantum
lbm
hmmer
gobmk
astar
Performance Results
60%
50%
40%
30%
20%
10%
Maximum performance improvement of up to 54%
25%
20%
15%
10%
5%
Performance of Virtualized Systems
Maximum performance improvement of up to 26%
Average performance improvement of up to 9.5%
xalanbcmk
sphinx
soplex
povray
perlbench
omnetpp
mcf
libquantum
lbm
hmmer
gobmk
astar
10%
9%
8%
7%
6%
5%
4%
3%
2%
1%
0%
Performance Sensitivity of 3 Algorithms
Performance Benefit
Sorted
Graph
Weighted Graph
16%
12%
8%
4%
0%
mcf
gobmk
povray
omnetpp
mcf
hmmer
libquantum
omnetpp
perlbench
gobmk
libquantum
omnetpp
gobmk
hmmer
libquantum
povray
mcf
hmmer
libquantum
povray
Application Mix
Weighted Interference Graph has the best performance
Conclusion
Shared Resource (e.g., LLC) Management is Critical
Capturing Cache Reference Behavior for Processes
Symbiotic Scheduling with Bloom Filter Signature
Measured Speedup of 22% (up to 54%) on Intel Core 2
30/53
That’s All, Folks!
31