Transcript Slide 1
SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Pınar Tözün Andreas Moshovos Anastasia Ailamaki Online Transaction Processing (OLTP) $100 Billion/Yr, +10% annually • E.g., banking, online purchases, stock market… Benchmarking • Transaction Processing Council • TPC-C: Wholesale retailer • TPC-E: Brokerage market OLTP drives innovation for HW and DB vendors © Islam Atta 2 Transactions Suffer from Instruction Misses Many concurrent transactions L1-I size Time Footprint Each Instruction Stalls due to L1 Instruction Cache Thrashing © Islam Atta 3 Even on a CMP all Transactions Suffer Time Cores L1-1 Caches Transactions All caches thrashed with similar code blocks © Islam Atta 4 Opportunity Technology: CMP’s aggregate L1 instruction cache capacity is large enough Time • Multiple L1-I caches Application Behavior: • Instruction overlap within and across transactions Multiple threads Footprint over Multiple Cores Reduced Instruction Misses © Islam Atta 5 SLICC Overview Dynamic Hardware Solution • How to divide a transaction • When to move • Where to go Performance • Reduces instruction misses by 44% (TPC-C), 68% (TPC-E) • Performance improves by 60% (TPC-C), 79% (TPC-E) Robust: • non-OLTP workload remains unaffected © Islam Atta 6 Talk Roadmap • Intra/Inter-thread instruction locality is high • SLICC Concept • SLICC Ingredients • Results • Summary © Islam Atta 7 OLTP Facts Many concurrent transactions Few DB operations R() • 28 – 65KB U() I() D() New Order IT() ITP() Payment Few transaction types • TPC-C: 5, TPC-E: 12 Transactions fit in 128-512KB Overlap within and across different transactions CMPs’ aggregate L1-I cache is large enough © Islam Atta 8 Instruction Commonality Across Transactions TPC-C TPC-E All Threads Most Few Single Per Transaction Type More Yellow More Reuse Lots of code reuse Even higher across same-type transactions © Islam Atta 9 Requirements Enable usage of aggregate L1-I capacity • Large cache size without increased latency Exploit instruction commonality • Localizes common transaction instructions Dynamic • Independent of footprint size or cache configuration © Islam Atta 10 Talk Roadmap • Intra/Inter-thread instruction locality is high • SLICC Concept • SLICC Ingredients • Results • Summary © Islam Atta 11 Example for Concurrent Transactions Transactions T1 T2 T3 Control Flow Graph Code segments that can fit into L1-I © Islam Atta 12 Scheduling Threads Threads Conventional SLICC CORES CORES T1 0 1 2 3 L1-I Time T1 T2 0 T1 T2 T1 T2 T1 T2 1 2 3 T1 T2 T1 T3 T3 T2 T3 T1 T1 T3 T2 T3 T3 Cache Filled 10 times T3 Cache Filled 4 times © Islam Atta 13 Talk Roadmap • Intra/Inter-thread instruction locality is high • SLICC Concept • SLICC Ingredients • Results • Summary © Islam Atta 14 Migration Ingredients When to migrate? Step 1: Detect: cache full Step 2: Detect: new code segment Where to go? Step 3: Predict where is the next code segment? © Islam Atta 15 Migration Ingredients Idle cores When to migrate? Step 1: Loops Step 2: Detect: new segment Return back Idle Time Detect: cache full Where to go? Step 3: Where is the next segment? T1 © Islam Atta 16 Migration Ingredients When to migrate? Step 1: Time Detect: cache full Step 2: Detect: new segment Where to go? Step 3: Where is the next segment? T2 © Islam Atta 17 Implementation When to migrate? Step 1: Detect: cache full Miss Counter Step 2: Detect: new segment Miss Dilution Where to go? Step 3: Where is the next segment? Find signature blocks on remote cores © Islam Atta 18 Boosting Effectiveness More overlap across transactions of the same-type SLICC: Transaction Type-oblivious Transaction Type-aware • SLICC-Pp: Pre-processing to detect similar transactions • SLICC-SW : Software provides information © Islam Atta 19 Talk Roadmap • Intra/Inter-thread instruction locality is high • SLICC Concept • SLICC Ingredients • Results • Summary © Islam Atta 20 Experimental Evaluation How does SLICC affect INSTRUCTION misses? Our primary goal How does it affect DATA misses? Expected to increase, by how much? Performance impact: Are DATA misses and MIGRATION OVERHEADS amortized? © Islam Atta 21 Methodology Simulation • Zesto (x86) • 16 OoO cores, 32KB L1-I, 32KB L1-D, 1MB per core L2 • QEMU extension • User and Kernel space Workloads Shore-MT © Islam Atta 22 Effect on Misses Baseline: no effort to reduce instruction misses 45 40 I-MPKI Better 35 D-MPKI MPKI 30 25 20 15 10 5 TPC-C-10 TPC-E SLICC-SW SLICC Base SLICC-SW SLICC Base SLICC-SW SLICC Base 0 MapReduce Reduce I-MPKI by 58%. Increase D-MPKI by 7%. © Islam Atta 23 Performance Next-line: always prefetch the next-line Upper bound for Proactive Instruction Fetch [Ferdman, MICRO’11] 2 Next-Line 1.9 PIF-No Overhead Speedup Better 1.8 1.7 SLICC 1.6 SLICC-SW 1.5 1.4 Storage per core - PIF: ~40KB - SLICC: <1KB. 1.3 1.2 1.1 1 TPC-C-1 TPC-C-10 TPC-C: +60% TPC-E MapReduce TPC-E: +79% © Islam Atta 24 Summary OLTP’s performance suffers due to instruction stalls. Technology & Application Opportunities: • Instruction footprint fits on aggregate L1-I capacity of CMPs. • Inter- and intra-thread locality. SLICC: • Thread migration spread instruction footprint over multiple cores. • Reduce I-MPKI by 58% Baseline: +70% • Improve performance by Next-line: +44% PIF: ±2% to +21% © Islam Atta 25 Thanks! Email: [email protected] Website: http://islamatta.com Why data misses increase? Example: thread migrates from core A core B. •Read data on core B that is fetched on core A. •Write data on core B to invalidate data on core A. •When returning to core A, cache blocks might be evicted by other threads. © Islam Atta 27 SLICC Agent per Core Miss(1) Hit(0) Cache Full Detection Remote Cache Segment Search MC + (Miss Counter) Fill-up_t Enable Searching Matched_t entries ≥ Miss Dilution Tracking Enable shifting ... MSV Miss TagQueue (MTQ) Count “1”s (Miss Shift-Vector) Dilution_t Locating Missed Blocks on Remote Cores ≥ Enable Migration Select Matching Core © Islam Atta 28 Detailed Methodology Zesto (x86) Qtrace (QEMU extension) Shore-MT © Islam Atta 29 Hardware Cost © Islam Atta 30 Larger I-caches? 60 1.4 Conflict Capacity Compulsory Speedup 1.2 50 1 0.8 30 0.6 Speed Up MPKI 40 20 0.2 16 32 64 128 256 512 16 32 64 128 256 512 16 32 64 128 256 512 16 32 64 128 256 512 0 16 32 64 128 256 512 0 Instructions Data Instructions Data Instructions Data TPC-C-10 TPC-E Cache Size (K) Better 10 16 32 64 128 256 512 Better 0.4 MapReduce © Islam Atta 31 Different Replacement Policies? 40 LRU LIP BIP DIP SRRIP BRRIP DRRIP L1 Instruction MPKI Better 35 30 25 20 15 10 5 0 TPC-C TPC-E MapReduce © Islam Atta 32 128 256 384 512 128 256 384 512 128 256 384 512 128 256 384 512 2 4 6 8 10 128 256 384 512 128 256 384 512 128 256 384 512 128 256 384 512 70 1.6 60 1.4 50 40 1 0.8 30 0.6 20 10 0 2 4 6 8 10 TPC-C © Islam Atta Speedup D-MPKI Better 128 256 384 512 MPKI I-MPKI Base 128 256 384 512 Base Better Parameter Space (1) Speedup 1.2 0.4 0.2 0 TPC-E Fill-up_t (top), Matched_t (bottom) 33 Parameter Space (2) I-MPKI D-MPKI Speedup 60 2 1.8 50 1.6 MPKI 1.2 30 1 0.8 20 Speedup 1.4 40 0.6 0.4 10 0.2 TPC-C TPC-E Better Better 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 Dilution_t © Islam Atta 34 Cache Signature Accuracy Partial Bloom Filter BF Accuracy 101 Accuracy (%) Better 100 99 98 97 96 512 1K 2K 4K 8K 512 1K 2K 4K 8K TPC-C TPC-E © Islam Atta 35