Brown University Hideaki Kimura* [email protected] (*) Graduates soon. On Job Market. Microsoft Research Vivek Narasayya Manoj Syamala {viveknar,manojsy}@microsoft.com.

Download Report

Transcript Brown University Hideaki Kimura* [email protected] (*) Graduates soon. On Job Market. Microsoft Research Vivek Narasayya Manoj Syamala {viveknar,manojsy}@microsoft.com.

Brown University
Hideaki Kimura*
[email protected]
(*) Graduates soon. On Job Market.
Microsoft Research
Vivek Narasayya
Manoj Syamala
{viveknar,manojsy}@microsoft.com
Every Major DBMS Supports
 Saves Storage Consumption
 Saves I/O Bandwidth

DBMS A: 4x!
DBMS B: 10x!
DBMS C: 12x!
Query Process
Engine
Tables,
Indexes
Decompress
SELECT
Compressed
Data
Compress
INSERT
2/28
Dictionary Encoding
City
Seattle
1
San Jose
2
Seattle
1
..
..
+
NULL Suppression
Dict.
Price
1:Sea
ttle
000321
@321
000054
@54
2:San
Jose
000015
@15
..
..
◦ Local dict. (Oracle, SQL Server)
◦ Global dict. (DB2)
Prefix Suppression,
LZO, RLE…
3/28
B
X
X
Y
Y
X
X
Y
Y
B
X
X
X
X
Y
Y
Y
Y
IBA
A
000AA
000AA
00BBB
00BBB
000AA
000AA
00BBB
00BBB
IAB
A
@AA
@AA
@AA
@AA
@BBB
@BBB
@BBB
@BBB
=
IBA
A
@AA
@AA
@BBB
@BBB
@AA
@AA
@BBB
@BBB
◦ NULL-Supp.
◦ Global dict.
◦ …
Order
Dependent
IAB
A
000AA
↑
↑
↑
00BBB
↑
↑
↑
IBA
page
IAB
A
000AA
000AA
000AA
000AA
00BBB
00BBB
00BBB
00BBB
fragmented
Order
Independent
≠
A
000AA
↑
00BBB
↑
000AA
↑
00BBB
↑
◦ Run Length Enc.
◦ Local dict.
◦ …
4/28
 Saves
Storage Space, I/O
 CPU Overhead to Compress &
Decompress
 Different Compression Scheme
= Different Saving ↔ Overhead
How Do We Use It?
DBA
5/28
 Depends
10GB
-90%
on Data
1GB
High Compression Ratio
 Depends
10GB
-10%
9GB
Low Compression Ratio
on Workload
◦ SELECTs/INSERTs Frequency
◦ CPU bottleneck? IO bottleneck?
6/28
Physical DB Design Tool
Q1
Q2
I3
I2
I1
I4
I5
Prune
I3
I1
Syntactically
Relevant
Indexes
I5
Select
Candidate
Configurations
Configuration
I1
I5
Enumerate
Best
Configuration
DBMS
Hypothetical
Indexes
What-if
Analysis
Query
Optimizer
Estimate Runtime
7/28
 Run
Design Tool to Select Indexes
 Compress them, then Repeat.
Stage 1
Stage 2
Compress!
MV
MV
MV
Workload
Idx
100 MB
Budget
100 MB
Idx
50 MB
Idx
100 MB
8/28
SELECT SUM(Price*Discount) FROM Sales
WHERE State='CA' and Jul 01 < Shipdate < Sep 01
Sales
Choice for 100 MB?
Shipdate
State
Price
Discount
Feb 21
CA
$123
10%
Jan 9
RI
$222
0%
Jul 5
TX
$213
5%

I1 (State, Shipdate):
95 MB → 50 MB
I2 (State, Shipdate)
Include (Price, Discount):
170 MB → 90 MB
Misses an index that makes sense
only with compression
9/28
70
Good design:
TPC-H, 2ndary Index Only
Improvement [%]
60
50
Good
40
design
30
Staged
20
Staged:
?
10
0
0
200
400
600
800
Space Budget [MB]
175MB
CREATE COMPRESSED
INDEX
(L_PARTKEY,L_ORDERKEY,L_
SUPPKEY) INCLUDE
(L_QUANTITY,L_EXTENDEDP
RICE,L_DISCOUNT)
1000
155MB
CREATE INDEX
(L_ORDERKEY)
INCLUDE(L_SUPPKEY,L_COM
MITDATE,L_RECEIPTDATE)
10/28
Choice for 200 MB?
I1 (State, Shipdate):
95 MB → 50 MB
INSERT INTO Sales …
UPDATE Sales SET Price=..
I2 (State, Shipdate)
Include (Price, Discount):
170 MB → 90 MB

CPU Overheads
Result in too high CPU overheads for
compression/decompression.
11/28
60
UPDATE Intensive TPC-H, 2ndary Index Only
Improvement [%]
50
Good
design
40
Staged
30
20
10
0
0
500
1000
1500
2000
Space Budget [MB]
2500
3000
Worse with More Budget!
12/28
 How
to Estimate Index-size after
compression?
 How to Evaluate
benefits/overheads of
compression?
 How Compression affects
Candidate Selection/Enumeration?
13/28
 Essential
Metric of Indexes
◦ To Fit Space Budget
◦ To Estimate I/O cost
 Need
Stats
Compression Fraction
Col-A
Width=8
Col-B
Width=4
Col-C
Width=10
Clust. Key
Width=4
Table
#tuple=1M
Size (IABC) = (8 + 4 + 10 + 4) * 1M = 26 MB
Comp. Size (IABC) = 26 MB * CF (IABC)
14/28
1GB
CREATE
COMPRESSED
INDEX
Sample
10MB
20
SampleCF
Overheads
15
[min]
Table
Design Tool Runtime
 SampleCF [Idreos et al. ICDE'10]
10
5
0
Naïve Implementation
Sample Size: Cost ↔ Accuracy
 Still Expensive for 1,000s of indexes

15/28
Workload
Storage
bound
Candidate Selection
Database
Engine
Tuning
Advisor
(DTA)
Size
Estimation
Samples
SampleCF
Temp DB
Merging
Query Optimizer
Enumeration
What-if
analysis
Physical design
recommendation
(Compression Aware
Cost Model)
Microsoft SQL Server
16/28
Ia
NULL supp. (ORD-IND)
Ib
Ia,b
Sum-up
Savings
SampleCF
Ia
Col-Ext
Deduction
Ib
Ia,b
Col-Set
Deduction
Ib,a
Local dict. (ORD-DEP)
IBA
IAB
A
000AA
/
/
/
00BBB
/
/
/
A
000AA
/
00BBB
/
000AA
/
00BBB
/
DV I AB , A  1
DV I BA , A  2
LI AB , A  4
LI BA , A  2
Estimate From Run-Length
More Details in paper
17/28
 Size-Estimation
Strategy
◦ Sample Size?
◦ Deduction Path?
◦ Expected Errors?
 Formulate
as Graph Problem
 Greedy algorithm to solve
(details in the paper)
18/28
 Query
Cost model to consider
(De)Compression CPU cost
 Candidate Selection/Enumeration
Key Challenge:
Space-Performance Trade-off
19/28
Q1
Q2
IB
IA
ID
IC
Compressed Versions
IB
IA
ID
IC

Add Compressed
Indexes
Most of them
are Ignored!
Compressed Indexes
are often
Select
Fastest
IA
IC
Slower-but-Smaller
(exception: very high
compression ratio)
20/28
Query Cost
Slow-small
Fast-large
Configuration Size
Construct Skyline of Configurations
 Pick Both Fast-Indexes
and Small-Indexes

21/28
 Greedy
picks un-compressed
indexes too early
Comp.
IB 10MB
I CB
5MB
Seed
IA
IC
IA
IA
IB
15MB
Room
IC
10MB
IA
IB
IA
IB
I CB
IC
Optimal Design
IA
I CB
IA
I CB
IC
22/28
 Recover
oversized configurations
 Compress indexes in the config.
…
IA
ICC
IB
IA
I CB
IC
Recover
If Oversized
IA
IA
IB
IA
IB
IC
23/28
 Implemented
on SQL Server 2008
◦ Modified Database Tuning Advisor
(DTA) "DTAc"
◦ Modified Query Cost Model
 TPC-H
Scale-1
(more results in paper)
◦ SELECT-intensive/UPDATE-intensive
◦ Compared Estimated Runtime
24/28
Select Intensive
70
70
60
60
Improvement [%]
Improvement [%]
80
50
40
30
20
10
Update Intensive
50
300
700
Budget [MB]
1500
(Both)
Skyline
50
40
Backtrack
30
20
DTAc
10
0
0
DTAc
50
300
700
1500
(None)
DTA
Budget [MB]
Clustered/2ndary Indexes
 Both
Skyline & Backtrack are
required esp. for tight budget
25/28
Improvement [%]
Update Intensive
Improvements[%]
Select Intensive
80
60
40
DTAc
20
DTA
60
40
DTAc
20
DTA
0
0
0
500
Budget [MB]
1000
0
500
Budget [MB]
1000
Clustered/2ndary/MV Indexes
 Especially
better in tight budget
 Choose lightly compressed
designs in UPDATE-intensive
26/28
Runtime [min]
Design Tool
MV-Estimate
20
MV-Sample
15
Partial-Estimate
10
Partial-Sample
5
Table-Estimate
0
Table-Sample
DTAc w/o
DTAc
Other
Optimization
 Reduce
Size Estimation Overheads
for a factor of 3
 Mostly <10% Estimation Error
27/28
 Opportunities
and Challenges
 Integrated Approach to exploit
compression in physical design
◦ Space-Performance Tradeoff
◦ Size Estimation
 Open
Issues
◦ Column-Store
28/28