Brown University Hideaki Kimura* [email protected] (*) Graduates soon. On Job Market. Microsoft Research Vivek Narasayya Manoj Syamala {viveknar,manojsy}@microsoft.com.
Download ReportTranscript Brown University Hideaki Kimura* [email protected] (*) Graduates soon. On Job Market. Microsoft Research Vivek Narasayya Manoj Syamala {viveknar,manojsy}@microsoft.com.
Brown University Hideaki Kimura* [email protected] (*) Graduates soon. On Job Market. Microsoft Research Vivek Narasayya Manoj Syamala {viveknar,manojsy}@microsoft.com Every Major DBMS Supports Saves Storage Consumption Saves I/O Bandwidth DBMS A: 4x! DBMS B: 10x! DBMS C: 12x! Query Process Engine Tables, Indexes Decompress SELECT Compressed Data Compress INSERT 2/28 Dictionary Encoding City Seattle 1 San Jose 2 Seattle 1 .. .. + NULL Suppression Dict. Price 1:Sea ttle 000321 @321 000054 @54 2:San Jose 000015 @15 .. .. ◦ Local dict. (Oracle, SQL Server) ◦ Global dict. (DB2) Prefix Suppression, LZO, RLE… 3/28 B X X Y Y X X Y Y B X X X X Y Y Y Y IBA A 000AA 000AA 00BBB 00BBB 000AA 000AA 00BBB 00BBB IAB A @AA @AA @AA @AA @BBB @BBB @BBB @BBB = IBA A @AA @AA @BBB @BBB @AA @AA @BBB @BBB ◦ NULL-Supp. ◦ Global dict. ◦ … Order Dependent IAB A 000AA ↑ ↑ ↑ 00BBB ↑ ↑ ↑ IBA page IAB A 000AA 000AA 000AA 000AA 00BBB 00BBB 00BBB 00BBB fragmented Order Independent ≠ A 000AA ↑ 00BBB ↑ 000AA ↑ 00BBB ↑ ◦ Run Length Enc. ◦ Local dict. ◦ … 4/28 Saves Storage Space, I/O CPU Overhead to Compress & Decompress Different Compression Scheme = Different Saving ↔ Overhead How Do We Use It? DBA 5/28 Depends 10GB -90% on Data 1GB High Compression Ratio Depends 10GB -10% 9GB Low Compression Ratio on Workload ◦ SELECTs/INSERTs Frequency ◦ CPU bottleneck? IO bottleneck? 6/28 Physical DB Design Tool Q1 Q2 I3 I2 I1 I4 I5 Prune I3 I1 Syntactically Relevant Indexes I5 Select Candidate Configurations Configuration I1 I5 Enumerate Best Configuration DBMS Hypothetical Indexes What-if Analysis Query Optimizer Estimate Runtime 7/28 Run Design Tool to Select Indexes Compress them, then Repeat. Stage 1 Stage 2 Compress! MV MV MV Workload Idx 100 MB Budget 100 MB Idx 50 MB Idx 100 MB 8/28 SELECT SUM(Price*Discount) FROM Sales WHERE State='CA' and Jul 01 < Shipdate < Sep 01 Sales Choice for 100 MB? Shipdate State Price Discount Feb 21 CA $123 10% Jan 9 RI $222 0% Jul 5 TX $213 5% I1 (State, Shipdate): 95 MB → 50 MB I2 (State, Shipdate) Include (Price, Discount): 170 MB → 90 MB Misses an index that makes sense only with compression 9/28 70 Good design: TPC-H, 2ndary Index Only Improvement [%] 60 50 Good 40 design 30 Staged 20 Staged: ? 10 0 0 200 400 600 800 Space Budget [MB] 175MB CREATE COMPRESSED INDEX (L_PARTKEY,L_ORDERKEY,L_ SUPPKEY) INCLUDE (L_QUANTITY,L_EXTENDEDP RICE,L_DISCOUNT) 1000 155MB CREATE INDEX (L_ORDERKEY) INCLUDE(L_SUPPKEY,L_COM MITDATE,L_RECEIPTDATE) 10/28 Choice for 200 MB? I1 (State, Shipdate): 95 MB → 50 MB INSERT INTO Sales … UPDATE Sales SET Price=.. I2 (State, Shipdate) Include (Price, Discount): 170 MB → 90 MB CPU Overheads Result in too high CPU overheads for compression/decompression. 11/28 60 UPDATE Intensive TPC-H, 2ndary Index Only Improvement [%] 50 Good design 40 Staged 30 20 10 0 0 500 1000 1500 2000 Space Budget [MB] 2500 3000 Worse with More Budget! 12/28 How to Estimate Index-size after compression? How to Evaluate benefits/overheads of compression? How Compression affects Candidate Selection/Enumeration? 13/28 Essential Metric of Indexes ◦ To Fit Space Budget ◦ To Estimate I/O cost Need Stats Compression Fraction Col-A Width=8 Col-B Width=4 Col-C Width=10 Clust. Key Width=4 Table #tuple=1M Size (IABC) = (8 + 4 + 10 + 4) * 1M = 26 MB Comp. Size (IABC) = 26 MB * CF (IABC) 14/28 1GB CREATE COMPRESSED INDEX Sample 10MB 20 SampleCF Overheads 15 [min] Table Design Tool Runtime SampleCF [Idreos et al. ICDE'10] 10 5 0 Naïve Implementation Sample Size: Cost ↔ Accuracy Still Expensive for 1,000s of indexes 15/28 Workload Storage bound Candidate Selection Database Engine Tuning Advisor (DTA) Size Estimation Samples SampleCF Temp DB Merging Query Optimizer Enumeration What-if analysis Physical design recommendation (Compression Aware Cost Model) Microsoft SQL Server 16/28 Ia NULL supp. (ORD-IND) Ib Ia,b Sum-up Savings SampleCF Ia Col-Ext Deduction Ib Ia,b Col-Set Deduction Ib,a Local dict. (ORD-DEP) IBA IAB A 000AA / / / 00BBB / / / A 000AA / 00BBB / 000AA / 00BBB / DV I AB , A 1 DV I BA , A 2 LI AB , A 4 LI BA , A 2 Estimate From Run-Length More Details in paper 17/28 Size-Estimation Strategy ◦ Sample Size? ◦ Deduction Path? ◦ Expected Errors? Formulate as Graph Problem Greedy algorithm to solve (details in the paper) 18/28 Query Cost model to consider (De)Compression CPU cost Candidate Selection/Enumeration Key Challenge: Space-Performance Trade-off 19/28 Q1 Q2 IB IA ID IC Compressed Versions IB IA ID IC Add Compressed Indexes Most of them are Ignored! Compressed Indexes are often Select Fastest IA IC Slower-but-Smaller (exception: very high compression ratio) 20/28 Query Cost Slow-small Fast-large Configuration Size Construct Skyline of Configurations Pick Both Fast-Indexes and Small-Indexes 21/28 Greedy picks un-compressed indexes too early Comp. IB 10MB I CB 5MB Seed IA IC IA IA IB 15MB Room IC 10MB IA IB IA IB I CB IC Optimal Design IA I CB IA I CB IC 22/28 Recover oversized configurations Compress indexes in the config. … IA ICC IB IA I CB IC Recover If Oversized IA IA IB IA IB IC 23/28 Implemented on SQL Server 2008 ◦ Modified Database Tuning Advisor (DTA) "DTAc" ◦ Modified Query Cost Model TPC-H Scale-1 (more results in paper) ◦ SELECT-intensive/UPDATE-intensive ◦ Compared Estimated Runtime 24/28 Select Intensive 70 70 60 60 Improvement [%] Improvement [%] 80 50 40 30 20 10 Update Intensive 50 300 700 Budget [MB] 1500 (Both) Skyline 50 40 Backtrack 30 20 DTAc 10 0 0 DTAc 50 300 700 1500 (None) DTA Budget [MB] Clustered/2ndary Indexes Both Skyline & Backtrack are required esp. for tight budget 25/28 Improvement [%] Update Intensive Improvements[%] Select Intensive 80 60 40 DTAc 20 DTA 60 40 DTAc 20 DTA 0 0 0 500 Budget [MB] 1000 0 500 Budget [MB] 1000 Clustered/2ndary/MV Indexes Especially better in tight budget Choose lightly compressed designs in UPDATE-intensive 26/28 Runtime [min] Design Tool MV-Estimate 20 MV-Sample 15 Partial-Estimate 10 Partial-Sample 5 Table-Estimate 0 Table-Sample DTAc w/o DTAc Other Optimization Reduce Size Estimation Overheads for a factor of 3 Mostly <10% Estimation Error 27/28 Opportunities and Challenges Integrated Approach to exploit compression in physical design ◦ Space-Performance Tradeoff ◦ Size Estimation Open Issues ◦ Column-Store 28/28