CURE for Cubes: Cubing Using a ROLAP Engine
Download
Report
Transcript CURE for Cubes: Cubing Using a ROLAP Engine
VLDB 2006
CURE for Cubes:
Cubing Using a ROLAP Engine
Konstantinos Morfonios
Yannis Ioannidis
University of Athens
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
SELECT region, sum(revenue)
FROM SALES
WHERE month = ‘September’
GROUP BY region
Gray On Data-warehousing:
CUBE
Introduction
SELECT A, B, SUM(M)
C, SUM(M)
FROM R
GROUP BY A, B
B, C
CUBE
SELECT SUM(M)
FROM R
Introduction
Problems
Construction
algorithm
Storage scheme
Focusing on ROLAP techniques (MVs)
Stressed
to limits?
Complete
solution?
Unclear (not finished
with efficient storage)
Unclear (not focused
on hierarchies)
Introduction
Challenges of hierarchies:
D
Number of nodes: i 1 (L i 1) 2 often
Efficient execution plan
D
CURE
Small domains in the higher levels of dimension
hierarchies
New partitioning algorithm
Number of tuples increases
Novel storage scheme
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Execution Plan
Extend BUC (Bottom-Up-Cube) [BR99]
Efficient
pipelining
Cheap identification of some kinds of
redundancy
Inherent support for iceberg cubes and
holistic functions
Existing “BUC-based” methods: BU-BST
[WLFY02] and QC-Tables [LPH02]
Execution Plan
Dimensions: A, B, C
ABC
AB
AC
BC
A
B
C
Execution Plan
Dimensions: A0→A1→A2,
B0→B1,
C0
Execution Plan
Dimensions: A0, A1, A2,
A0B0C0
A0B0
A0B1C0
A0B1
A0
A1B0C0
A0C0
A1B0
B0, B1,
A1B1C0
A1B1
A2B0C0
A1C0
A1
A2B0
C0
A2B1C0
A2B1
A2
A2C0
B0
B0C0
B1
B1C0
C0
Execution Plan
Dimensions: A0, A1, A2,
A0B0C0
A0B0
A0B1C0
A0B1
A0
A1B0C0
A0C0
A1B0
B0, B1,
A1B1C0
A1B1
A2B0C0
A1C0
A1
A2B0
C0
A2B1C0
A2B1
A2
A2C0
B0
B0C0
B1
B1C0
C0
Execution Plan
Dimensions: A0, A1, A2, B0, B1,
Height: 3
A0B0C0
A0B0
A0B1C0
A0B1
A0
A1B0C0
A0C0
A1B0
A1B1C0
A1B1
A2B0C0
A1C0
A1
A2B0
C0
A2B1C0
A2B1
A2
A2C0
B0
B0C0
B1
B1C0
C0
Execution Plan
Dimensions: A0→A1→A2,
B0→B1,
C0
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
Execution Plan
Dimensions: A0→A1→A2,
B0→B1,
C0
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
Execution Plan
Dimensions: A0→A1→A2, B0→B1,
ABC
Height: 6
0
0
C0
0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
Execution Plan
Important properties of BUC-based cubing:
Recursive
calls at higher levels tend to be
cheaper
Benefits from early pruning recursion at some
node N increase with the number of ancestors
of N in the execution plan
ABC
Advantage of taller execution plans
AB
AC
BC
A
B
C
Execution Plan
CURE’s Plan:
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
External Partitioning
R
Memory
External Partitioning
A0B0C0
R
Memory
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
External Partitioning
R
Memory
External Partitioning
R
Memory
Partitions
External Partitioning
R
Sound
Partitions
Memory
External Partitioning
For sound partitioning |Biggest partition| ≤ |M|
In flat datasets this holds in general
In hierarchical datasets…
External Partitioning
|R| = 500 GB, |M| = 1 GB |R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
B1
C0
A2
External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A2B1
A2C0
B0
B1C0
A2
B1
C0
A1
External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0
External Partitioning
A0B0C0
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A0
A1B1
A1C0
A1
A1B1C0
A2B0C0
A2B0
A2B1C0
B0C0
A2B1
A2C0
B0
B1C0
A2
B1
C0
External Partitioning
A0B0C0
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A0
A1B1
A1C0
A1
A1B1C0
|A0|/|A2| times
smaller than R
|A2B0C0| ≈ 50 MB
A2B0C0
A2B0
A2B1C0
B0C0
A2B1
A2C0
B0
B1C0
A2
B1
C0
External Partitioning
A0B0C0
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A0
A1B1
A1C0
A1
A1B1C0
A2B0C0
A2B0
A2B1C0
B0C0
A2B1
A2C0
B0
B1C0
A2
B1
C0
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Storage Format
Two types of redundancy
Dimensional
Redundancy (DR)
Aggregational Redundancy (AR)
Storage Format
A0B0C0
A0B0
A0B1
A0
A0B1C0
A0C0
A1B1
A1B0C0 ABC
AB
A1B0
A
A1C0
AC
A2B0C0
B
BC
A1B1C0
C
A2B0
A2B1C0
Example
cubeB
A
A B with flat
AC
only for simplicity
1
2
A2
1
2
B1
0
0
C0
B0C0
B1C0
Storage Format
t1
t2
t’
t
CUBE with DR
CUBE’ without DR
Storage Format
t1
t2
t’
t
CUBE with DR
CUBE’ without DR
Storage Format
t1
t2
t’
t
CUBE with DR
CUBE’ without DR
Storage Format
CUBE with DR
CUBE’ without DR
Storage Format
CUBE with DR
CUBE’ without DR
Storage Format
Classify tuples according
to AR into:
• Normal Tuples (NTs)
• Trivial Tuples (TTs)
• Common Aggregate
Tuples (CATs)
CUBE with DR
CUBE’ without DR
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Purpose of the previous example:
Explanation
of different types of redundancy
Not construction algorithm
Constructing an uncompressed cube and
then compressing it would be inefficient
Instead, CURE classifies tuples during
construction itself (details in the paper)
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Experimental Evaluation
Hierarchical datasets: APB-1
Code (6,500) → Class (435) →
Group (215) → Family (54) → Line (11) →
Division (3)
Customer: Store (640) → Retailer (71)
Time: Month (17) → Quarter (6) → Year (2)
Channel: Base (9)
Product:
Flat datasets: CovType, Sep85L, Synthetic
Experimental Evaluation
Two versions of CURE:
CURE
CURE+
Experimental Evaluation
300
250
CURE
CURE+
Less than
3 hours
Time (min)
200
150
100
50
0
1.E+06
1.E+07
1.E+08
Num ber of Tuples in the Fact Table
1.E+09
Experimental Evaluation
10
Storage Space (GB)
8
CURE
≈ 6.8 GB
CURE+
6
4
2
0
1.E+06
1.E+07
1.E+08
Num ber of Tuples in the Fact Table
1.E+09
Experimental Evaluation
300
250
Time (sec)
200
150
100
50
0
BUC
BU-BST
FCURE
FCURE+
APB 0.4
CURE
CURE+
Experimental Evaluation
450
Storage Space (MB)
400
350
300
250
200
150
100
50
0
BUC
BU-BST
FCURE
FCURE+
APB 0.4
CURE
CURE+
Experimental Evaluation
14
12
Time (sec)
10
8
6
4
2
0
BUC
BU-BST
FCURE
FCURE+
APB 0.4
CURE
CURE+
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Conclusions
Main contribution: CURE
Efficient execution plan
New partitioning algorithm
Novel storage scheme
Main advantages of CURE
Efficient
construction of complete cubes over
large datasets with arbitrary hierarchies
Cube compression
Optimization opportunities for queries and
updates
Easy implementation
Current and Future Work
Study of indexing for queries and updates
Comparison with the most prominent
MOLAP and Tree-based techniques
Questions???
Thank you!
Storage Format
Memory Image
Disk Image
Storage Format
45
65
100
110
150
Memory Image
Disk Image
Storage Format
150
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
20
30
Memory Image
Disk Image
Storage Format
30
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image