Transcript Slide 1

SLICC
Self-Assembly of Instruction Cache Collectives
for OLTP Workloads
Islam Atta
Pınar Tözün
Andreas Moshovos
Anastasia Ailamaki
Online Transaction Processing (OLTP)
$100 Billion/Yr, +10% annually
• E.g., banking, online purchases, stock market…
Benchmarking
• Transaction Processing Council
• TPC-C: Wholesale retailer
• TPC-E: Brokerage market
OLTP drives innovation for HW and DB vendors
© Islam Atta
2
Transactions Suffer from Instruction Misses
Many concurrent transactions
L1-I
size
Time
Footprint
Each
Instruction Stalls due to L1 Instruction Cache Thrashing
© Islam Atta
3
Even on a CMP all Transactions Suffer
Time
Cores
L1-1 Caches
Transactions
All caches thrashed with similar code blocks
© Islam Atta
4
Opportunity
Technology:
CMP’s aggregate L1
instruction cache capacity
is large enough
Time
•
Multiple L1-I caches
Application Behavior:
•
Instruction overlap within
and across transactions
Multiple threads
Footprint over Multiple Cores  Reduced Instruction Misses
© Islam Atta
5
SLICC Overview
Dynamic Hardware Solution
• How to divide a transaction
• When to move
• Where to go
Performance
• Reduces instruction misses by 44% (TPC-C), 68% (TPC-E)
• Performance improves by 60% (TPC-C), 79% (TPC-E)
Robust:
•
non-OLTP workload remains unaffected
© Islam Atta
6
Talk Roadmap
• Intra/Inter-thread instruction locality is high
• SLICC Concept
• SLICC Ingredients
• Results
• Summary
© Islam Atta
7
OLTP Facts
Many concurrent transactions
Few DB operations
R()
• 28 – 65KB
U()
I()
D()
New Order
IT()
ITP()
Payment
Few transaction types
• TPC-C: 5, TPC-E: 12
Transactions fit in 128-512KB
Overlap within and across different transactions
CMPs’ aggregate L1-I cache is large enough
© Islam Atta
8
Instruction Commonality Across Transactions
TPC-C
TPC-E
All Threads
Most
Few
Single
Per Transaction
Type
More Yellow
More Reuse
Lots of code reuse
Even higher across same-type transactions
© Islam Atta
9
Requirements
Enable usage of aggregate L1-I capacity
• Large cache size without increased latency
Exploit instruction commonality
• Localizes common transaction instructions
Dynamic
• Independent of footprint size or cache configuration
© Islam Atta
10
Talk Roadmap
• Intra/Inter-thread instruction locality is high
• SLICC Concept
• SLICC Ingredients
• Results
• Summary
© Islam Atta
11
Example for Concurrent Transactions
Transactions
T1
T2
T3
Control
Flow
Graph
Code segments that
can fit into L1-I
© Islam Atta
12
Scheduling Threads
Threads
Conventional
SLICC
CORES
CORES
T1
0
1
2
3
L1-I
Time
T1
T2
0
T1
T2
T1
T2
T1
T2
1
2
3
T1
T2
T1
T3
T3
T2
T3
T1
T1
T3
T2
T3
T3
Cache Filled 10 times
T3
Cache Filled 4 times
© Islam Atta
13
Talk Roadmap
• Intra/Inter-thread instruction locality is high
• SLICC Concept
• SLICC Ingredients
• Results
• Summary
© Islam Atta
14
Migration Ingredients
When to migrate?
Step 1:
Detect: cache full
Step 2:
Detect: new code segment
Where to go?
Step 3:
Predict where is the next code segment?
© Islam Atta
15
Migration Ingredients
Idle cores
When to migrate?
Step 1:
Loops
Step 2:
Detect: new segment
Return back
Idle
Time
Detect: cache full
Where to go?
Step 3:
Where is the next
segment?
T1
© Islam Atta
16
Migration Ingredients
When to migrate?
Step 1:
Time
Detect: cache full
Step 2:
Detect: new segment
Where to go?
Step 3:
Where is the next
segment?
T2
© Islam Atta
17
Implementation
When to migrate?
Step 1:
Detect: cache full
Miss Counter
Step 2:
Detect: new segment
Miss Dilution
Where to go?
Step 3:
Where is the next
segment?
Find signature
blocks on remote
cores
© Islam Atta
18
Boosting Effectiveness
More overlap across transactions of the same-type
SLICC: Transaction Type-oblivious
Transaction Type-aware
• SLICC-Pp: Pre-processing to detect similar transactions
• SLICC-SW : Software provides information
© Islam Atta
19
Talk Roadmap
• Intra/Inter-thread instruction locality is high
• SLICC Concept
• SLICC Ingredients
• Results
• Summary
© Islam Atta
20
Experimental Evaluation
How does SLICC affect INSTRUCTION misses?
 Our primary goal
How does it affect DATA misses?
 Expected to increase, by how much?
Performance impact:
 Are DATA misses and MIGRATION OVERHEADS
amortized?
© Islam Atta
21
Methodology
Simulation
• Zesto (x86)
• 16 OoO cores, 32KB L1-I, 32KB L1-D, 1MB per core L2
• QEMU extension
• User and Kernel space
Workloads
Shore-MT
© Islam Atta
22
Effect on Misses
Baseline: no effort to reduce instruction misses
45
40
I-MPKI
Better
35
D-MPKI
MPKI
30
25
20
15
10
5
TPC-C-10
TPC-E
SLICC-SW
SLICC
Base
SLICC-SW
SLICC
Base
SLICC-SW
SLICC
Base
0
MapReduce
Reduce I-MPKI by 58%. Increase D-MPKI by 7%.
© Islam Atta
23
Performance
Next-line: always prefetch the next-line
Upper bound for Proactive Instruction Fetch [Ferdman, MICRO’11]
2
Next-Line
1.9
PIF-No Overhead
Speedup
Better
1.8
1.7
SLICC
1.6
SLICC-SW
1.5
1.4
Storage per core
- PIF: ~40KB
- SLICC: <1KB.
1.3
1.2
1.1
1
TPC-C-1
TPC-C-10
TPC-C: +60%
TPC-E
MapReduce
TPC-E: +79%
© Islam Atta
24
Summary
OLTP’s performance suffers due to instruction stalls.
Technology & Application Opportunities:
• Instruction footprint fits on aggregate L1-I capacity of CMPs.
• Inter- and intra-thread locality.
SLICC:
• Thread migration  spread instruction footprint over multiple
cores.
• Reduce I-MPKI by 58%
Baseline: +70%
• Improve performance by
Next-line: +44%
PIF: ±2% to +21%
© Islam Atta
25
Thanks!
Email: [email protected]
Website: http://islamatta.com
Why data misses increase?
Example: thread migrates from core A  core B.
•Read data on core B that is fetched on core A.
•Write data on core B to invalidate data on core A.
•When returning to core A, cache blocks might be evicted by other
threads.
© Islam Atta
27
SLICC Agent per Core
Miss(1)
Hit(0)
Cache Full Detection
Remote Cache Segment Search
MC
+
(Miss Counter)
Fill-up_t
Enable
Searching
Matched_t
entries
≥
Miss Dilution Tracking
Enable shifting
...
MSV
Miss TagQueue
(MTQ)
Count “1”s
(Miss Shift-Vector)
Dilution_t
Locating Missed
Blocks on Remote
Cores
≥
Enable
Migration
Select Matching Core
© Islam Atta
28
Detailed Methodology
Zesto (x86)
Qtrace (QEMU extension)
Shore-MT
© Islam Atta
29
Hardware Cost
© Islam Atta
30
Larger I-caches?
60
1.4
Conflict
Capacity
Compulsory
Speedup
1.2
50
1
0.8
30
0.6
Speed Up
MPKI
40
20
0.2
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
0
16
32
64
128
256
512
0
Instructions
Data
Instructions
Data
Instructions
Data
TPC-C-10
TPC-E
Cache Size (K)
Better
10
16
32
64
128
256
512
Better
0.4
MapReduce
© Islam Atta
31
Different Replacement Policies?
40
LRU
LIP
BIP
DIP
SRRIP
BRRIP
DRRIP
L1 Instruction MPKI
Better
35
30
25
20
15
10
5
0
TPC-C
TPC-E
MapReduce
© Islam Atta
32
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
2
4
6
8
10
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
70
1.6
60
1.4
50
40
1
0.8
30
0.6
20
10
0
2
4
6
8
10
TPC-C
© Islam Atta
Speedup
D-MPKI
Better
128
256
384
512
MPKI
I-MPKI
Base
128
256
384
512
Base
Better
Parameter Space (1)
Speedup
1.2
0.4
0.2
0
TPC-E
Fill-up_t (top), Matched_t (bottom)
33
Parameter Space (2)
I-MPKI
D-MPKI
Speedup
60
2
1.8
50
1.6
MPKI
1.2
30
1
0.8
20
Speedup
1.4
40
0.6
0.4
10
0.2
TPC-C
TPC-E
Better
Better
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
0
Dilution_t
© Islam Atta
34
Cache Signature Accuracy
Partial Bloom Filter
BF Accuracy
101
Accuracy (%)
Better
100
99
98
97
96
512 1K 2K 4K 8K
512 1K 2K 4K 8K
TPC-C
TPC-E
© Islam Atta
35