On Synopses for Distinct-Value Estimation Under Multiset Operations
Download
Report
Transcript On Synopses for Distinct-Value Estimation Under Multiset Operations
On Synopses for Distinct-Value
Estimation Under Multiset
Operations
Kevin Beyer
Peter J. Haas
Berthold Reinwald
Yannis Sismanis
IBM Almaden Research Center
Rainer Gemulla
Technische Universität Dresden
Introduction
Estimating # Distinct Values (DV) crucial for:
Data integration & cleaning
E.g. schema discovery, duplicate detection
Query optimization
Network monitoring
Materialized view selection for datacubes
Exact DV computation is impractical
Sort/scan/count or hash table
Problem: bad scalability
Approximate DV “synopses”
25 year old literature
Hashing-based techniques
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
2
Motivation: A Synopsis Warehouse
Synopsis
Full-Scale
Warehouse Of
Data Partitions
S1,2
Sn,m
combine
S*,*
etc
S1-2,3-7
Goal: discover partition characteristics & relationships to other
partitions
Keys, functional dependencies, similarity metrics (Jaccard)
Synopsis
S1,1
Warehouse
of Synopses
Synopsis
Similar to Bellman [DJMS02]
Accuracy challenge: small synopses sizes, many distinct values
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
3
Outline
Background on KMV synopsis
An unbiased low-variance DV estimator
Optimality
Asymptotic error analysis for synopsis sizing
Compound Partitions
Union, intersection, set difference
Multiset Difference: AKMV synopses
Deletions
Empirical Evaluation
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
4
K-Min Value (KMV) Synopsis
Partition
k-min
a
b
…
a
a
hash
0
X X X X
U(1) U(2) ... U(k)
X
X
X X
X
X
1
1/D
e
D distinct values
Hashing = dropping DVs uniformly on [0,1]
KMV synopsis: L {U (1 ) , U ( 2 ) ,..., U ( k ) }
Leads naturally to basic estimator [BJK+02]
Basic estimator: E [U ( k ) ] k / D Dˆ kBE k / U ( k )
All classic estimators approximate the basic estimator
Expected construction cost: O ( N k log log D )
Space: O ( k log D )
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
5
Contributions: New Synopses & Estimators
Better estimators for classic KMV synopses
Better accuracy: unbiased, low mean-square error
Exact error bounds (in paper)
Asymptotic error bounds for sizing the synopses
Augmented KMV synopsis (AKMV)
Permits DV estimates for compound partitions
Can handle deletions and incremental updates
A
Synopsis
Combine
A op B
Synopsis
B
IBM Almaden Research Center
Technische Universität Dresden
SA
SIGMOD 07
SA op B
SB
6
Unbiased DV Estimator from KMV Synopsis
Exact error analysis based on theory of order statistics
Asymptotically optimal as k becomes large (MLE theory)
Analysis with many DVs
Theorem:
Proof:
UB
Dˆ k ( k 1) / U ( k )
Unbiased Estimator [Cohen97]:
Dˆ UB D
k
E
D
2
(k 2)
Show that U(i)-U(i-1) approx exponential for large D
Then use [Cohen97]
Use above formula to size synopses a priori
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
7
Outline
Background on KMV synopsis
An unbiased low-variance DV estimator
Optimality
Asymptotic error analysis for synopsis sizing
Compound Partitions
Union, intersection, set difference
Multiset Difference: AKMV synopses
Deletions
Empirical Evaluation
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
8
(Multiset) Union of Partitions
LA
0
k-min
X X X X
L
…1
LB
0
k-min
X X X X
…1
k-min
XXX X X X XX
0
U(k)
…1
Combine KMV synopses: L=LALB
Theorem: L is a KMV synopsis of AB
Can use previous unbiased estimator: Dˆ kUB ( k 1) / U ( k )
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
9
(Multiset) Intersection of Partitions
L=LALB as with union (contains k elements)
K = # values in L that are also in D(AB)
Note: L corresponds to a uniform random sample of DVs in AB
Theorem: Can compute from LA and LB alone
K/k estimates Jaccard distance:
D
D
D(A B)
D(A B)
Dˆ ( k 1) / U ( k )
Unbiased estimator of #DVs in the intersection:
estimates
D D ( A B)
K
Dˆ
k
k 1
U
(k )
See paper for variance of estimator
Can extend to general compound partitions from ordinary set
operations
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
10
Multiset Differences: AKMV Synopsis
Augment KMV synopsis with multiplicity counters L+=(L,c)
Space: O ( k log D k log M )
M=max multiplicity
Proceed almost exactly as before i.e. L+(E/F)=(LELF,(cE-cF)+)
Unbiased DV estimator:
Kg k 1
k U ( k )
Kg is the #positive counters
Closure property:
G=E op F
AKMV
Synopsis
L+E=(LE,cE)
E
L+G =(LE LF,hop(cE,cF))
Combine
AKMV
Synopsis
F
L+F=(LF,cF)
Can also handle deletions
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
11
Accuracy Comparison
0.1
Average ARE
0.08
0.06
0.04
0.02
0
Unbiased-KMV SDLogLog Sample-Counting Baseline
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
12
Compound Partitions
500
Unbiased-KMV/Intersections
Unbiased-KMV/Unions
Unbiased-KMV/Jaccard
SDLogLog/Intersections
SDLogLog/Unions
SDLogLog/Jaccard
Frequency
400
300
200
100
0
0
0.10
0.05
0.15
ARE
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
13
Conclusions
DV estimation for scalable, flexible synopsis warehouse
Better estimators for classic KMV synopses
DV estimation for compound partitions via AKMV synopses
Closure property
Theoretical contributions
Order statistics for exact/asymptotic error analysis
Asymptotic efficiency via MLE theory
A new spin on an old problem
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
14