CURE for Cubes: Cubing Using a ROLAP Engine

Download Report

Transcript CURE for Cubes: Cubing Using a ROLAP Engine

VLDB 2006
CURE for Cubes:
Cubing Using a ROLAP Engine
Konstantinos Morfonios
Yannis Ioannidis
University of Athens
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
SELECT region, sum(revenue)
FROM SALES
WHERE month = ‘September’
GROUP BY region
Gray On Data-warehousing:
CUBE
Introduction
SELECT A, B, SUM(M)
C, SUM(M)
FROM R
GROUP BY A, B
B, C
CUBE
SELECT SUM(M)
FROM R
Introduction

Problems
 Construction
algorithm
 Storage scheme

Focusing on ROLAP techniques (MVs)
 Stressed
to limits?
 Complete
solution?
Unclear (not finished
with efficient storage)
Unclear (not focused
on hierarchies)
Introduction
Challenges of hierarchies:
D
Number of nodes: i 1 (L i  1)  2 often
Efficient execution plan


D

CURE
Small domains in the higher levels of dimension
hierarchies
New partitioning algorithm
Number of tuples increases
Novel storage scheme

Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Execution Plan

Extend BUC (Bottom-Up-Cube) [BR99]
 Efficient
pipelining
 Cheap identification of some kinds of
redundancy
 Inherent support for iceberg cubes and
holistic functions

Existing “BUC-based” methods: BU-BST
[WLFY02] and QC-Tables [LPH02]
Execution Plan
Dimensions: A, B, C
ABC
AB
AC
BC
A
B
C

Execution Plan
Dimensions: A0→A1→A2,
B0→B1,
C0
Execution Plan
Dimensions: A0, A1, A2,
A0B0C0
A0B0
A0B1C0
A0B1
A0
A1B0C0
A0C0
A1B0
B0, B1,
A1B1C0
A1B1
A2B0C0
A1C0
A1
A2B0
C0
A2B1C0
A2B1
A2

A2C0
B0
B0C0
B1
B1C0
C0
Execution Plan
Dimensions: A0, A1, A2,
A0B0C0
A0B0
A0B1C0
A0B1
A0
A1B0C0
A0C0
A1B0
B0, B1,
A1B1C0
A1B1
A2B0C0
A1C0
A1
A2B0
C0
A2B1C0
A2B1
A2

A2C0
B0
B0C0
B1
B1C0
C0
Execution Plan
Dimensions: A0, A1, A2, B0, B1,
Height: 3
A0B0C0
A0B0
A0B1C0
A0B1
A0
A1B0C0
A0C0
A1B0
A1B1C0
A1B1
A2B0C0
A1C0
A1
A2B0
C0
A2B1C0
A2B1
A2

A2C0
B0
B0C0
B1
B1C0
C0
Execution Plan
Dimensions: A0→A1→A2,
B0→B1,
C0
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

Execution Plan
Dimensions: A0→A1→A2,
B0→B1,
C0
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

Execution Plan
Dimensions: A0→A1→A2, B0→B1,
ABC
Height: 6
0
0
C0
0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

Execution Plan

Important properties of BUC-based cubing:
 Recursive
calls at higher levels tend to be
cheaper
 Benefits from early pruning recursion at some
node N increase with the number of ancestors
of N in the execution plan

ABC
Advantage of taller execution plans
AB
AC
BC
A
B
C

Execution Plan
CURE’s Plan:
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
External Partitioning
R
Memory
External Partitioning
A0B0C0
R
Memory
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

External Partitioning
R
Memory
External Partitioning
R
Memory
Partitions
External Partitioning
R
Sound
Partitions
Memory
External Partitioning
For sound partitioning |Biggest partition| ≤ |M|
 In flat datasets this holds in general
 In hierarchical datasets…

External Partitioning
|R| = 500 GB, |M| = 1 GB  |R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
B1
C0

A2

External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A2B1
A2C0
B0
B1C0
A2
B1
C0
A1


External Partitioning
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0C0
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A1B1C0
A2B0C0
A0
A1B1
A1C0
A2B0
A2B1C0
B0C0
A1
A2B1
A2C0
B0
B1C0
A2
B1
C0

External Partitioning
A0B0C0
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A0
A1B1
A1C0
A1
A1B1C0
A2B0C0
A2B0
A2B1C0
B0C0
A2B1
A2C0
B0
B1C0
A2
B1
C0

External Partitioning
A0B0C0
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A0
A1B1
A1C0
A1
A1B1C0
|A0|/|A2| times
smaller than R
|A2B0C0| ≈ 50 MB
A2B0C0
A2B0
A2B1C0
B0C0
A2B1
A2C0
B0
B1C0
A2
B1
C0

External Partitioning
A0B0C0
|R| = 500 GB, |M| = 1 GB
|R|/|M| = 500
A0 (50,000)→A1 (500)→A2 (5)
A0B0
A0B1C0
A1B0C0
A0B1
A0C0
A1B0
A0
A1B1
A1C0
A1
A1B1C0
A2B0C0
A2B0
A2B1C0
B0C0
A2B1
A2C0
B0
B1C0
A2
B1
C0

Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Storage Format

Two types of redundancy
 Dimensional
Redundancy (DR)
 Aggregational Redundancy (AR)
Storage Format
A0B0C0
A0B0
A0B1
A0
A0B1C0
A0C0
A1B1
A1B0C0 ABC
AB
A1B0
A
A1C0
AC
A2B0C0
B
BC
A1B1C0
C

A2B0
A2B1C0
Example
cubeB
A
A B with flat
AC
only for simplicity
1
2
A2
1
2
B1

0
0
C0
B0C0
B1C0
Storage Format
t1
t2
t’
t
CUBE with DR
CUBE’ without DR
Storage Format
t1
t2
t’
t
CUBE with DR
CUBE’ without DR
Storage Format
t1
t2
t’
t
CUBE with DR
CUBE’ without DR
Storage Format
CUBE with DR
CUBE’ without DR
Storage Format
CUBE with DR
CUBE’ without DR
Storage Format
Classify tuples according
to AR into:
• Normal Tuples (NTs)
• Trivial Tuples (TTs)
• Common Aggregate
Tuples (CATs)
CUBE with DR
CUBE’ without DR
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format
Storage Format

Purpose of the previous example:
 Explanation
of different types of redundancy
 Not construction algorithm
Constructing an uncompressed cube and
then compressing it would be inefficient
 Instead, CURE classifies tuples during
construction itself (details in the paper)

Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Experimental Evaluation

Hierarchical datasets: APB-1
Code (6,500) → Class (435) →
Group (215) → Family (54) → Line (11) →
Division (3)
 Customer: Store (640) → Retailer (71)
 Time: Month (17) → Quarter (6) → Year (2)
 Channel: Base (9)
 Product:

Flat datasets: CovType, Sep85L, Synthetic
Experimental Evaluation

Two versions of CURE:
 CURE
 CURE+
Experimental Evaluation
300
250
CURE
CURE+
Less than
3 hours
Time (min)
200
150
100
50
0
1.E+06
1.E+07
1.E+08
Num ber of Tuples in the Fact Table
1.E+09
Experimental Evaluation
10
Storage Space (GB)
8
CURE
≈ 6.8 GB
CURE+
6
4
2
0
1.E+06
1.E+07
1.E+08
Num ber of Tuples in the Fact Table
1.E+09
Experimental Evaluation
300
250
Time (sec)
200
150
100
50
0
BUC
BU-BST
FCURE
FCURE+
APB 0.4
CURE
CURE+
Experimental Evaluation
450
Storage Space (MB)
400
350
300
250
200
150
100
50
0
BUC
BU-BST
FCURE
FCURE+
APB 0.4
CURE
CURE+
Experimental Evaluation
14
12
Time (sec)
10
8
6
4
2
0
BUC
BU-BST
FCURE
FCURE+
APB 0.4
CURE
CURE+
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Introduction
Execution Plan
External Partitioning
Storage Format
Experimental Evaluation
Conclusions
Conclusions

Main contribution: CURE
 Efficient execution plan
 New partitioning algorithm
 Novel storage scheme

Main advantages of CURE
 Efficient
construction of complete cubes over
large datasets with arbitrary hierarchies
 Cube compression
 Optimization opportunities for queries and
updates
 Easy implementation
Current and Future Work
Study of indexing for queries and updates
 Comparison with the most prominent
MOLAP and Tree-based techniques

Questions???
Thank you!
Storage Format
Memory Image
Disk Image
Storage Format
45
65
100
110
150
Memory Image
Disk Image
Storage Format
150
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
20
30
Memory Image
Disk Image
Storage Format
30
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image
Storage Format
Memory Image
Disk Image