Probabilistic Relational Model

Transcript Probabilistic Relational Model

WILD PROJECT REVIEW
Efficient Allocation
Algorithms For OLAP Over
Imprecise Data
Doug Burdick
University of Wisconsin – Madison
Prasad Deshpande
IBM India Research Lab, SIRC
T.S. Jayram
IBM Almaden Research Center
Raghu Ramakrishnan
Yahoo! Research
Shivakumar Vaithyanathan
IBM Almaden Research Center
Imprecise
Data Data
Multidimensional
2
1
Region
State
[BDJ+05]
ALL
MA
NY
TX
Truck
Sedan
Civic
CA
West
ALL
LOCATION
East
3
ALL
AUTOMOBILE
Camry
F150
Sierra
p3
ALL
3
Category
2
Model
1
p4
p5
p1
p2
FactID
Auto
Loc
Repair
p1
F150
NY
100
p2
Sierra
NY
500
p3
F150
MA
100
p4
Sierra
MA
200
p5
Truck
MA
100
Burdick et al. OLAP Over Uncertain and Imprecise Data In VLDB 2005
2
Sources of Imprecision

Dimensions extracted from free text

Assume given extractor for Auto dimension values
FactID
Text
p1
Brakes on F150…
p2
Rotors on the Sierra are…
p3
The F150 has…
p4
The Sierra is…
p5
cust’s Sierra is … but their F150 has …
3
Sources of Imprecision

Dimensions extracted from free text

Assume given extractor for Auto dimension values
FactID
Text
p1
Brakes on F150…
p2
Rotors on the Sierra are…
p3
The F150 has…
p4
The Sierra is…
p5
cust’s Sierra is … but their F150 has …
4
Sources of Imprecision

Dimensions extracted from free text

Assume given extractor for Auto dimension values
FactID
Text
Auto
p1
Brakes on F150…
F150
p2
Rotors on the Sierra are…
Sierra
p3
The F150 has…
F150
p4
The Sierra is…
Sierra
p5
cust’s Sierra is … but their F150 has …
{Sierra,F150}
More details for dimensions extracted from text in [BDJ+06]
Burdick et al. OLAP Over Uncertain and Imprecise Data. To
appear in VLDB Journal
5
Sources of Imprecision

Dimensions extracted from free text

Assume given extractor for Auto dimension values
FactID
Text
p1
Brakes on F150…
p2
Rotors on the Sierra are…
p3
The F150 has…
p4
The Sierra is
p5
cust’s Truck has…
6
Sources of Imprecision

Dimensions extracted from free text

Assume given extractor for Auto dimension values
FactID
Text
p1
Brakes on F150…
p2
Rotors on the Sierra are…
p3
The F150 has…
p4
The Sierra is…
p5
cust’s Truck has…
7
Sources of Imprecision

Dimensions extracted from free text

Assume given extractor for Auto dimension values
FactID
Text
Auto
p1
Brakes on F150…
F150
p2
Rotors on the Sierra are…
Sierra
p3
The F150 has…
F150
p4
The Sierra is…
Sierra
p5
cust’s Truck has…
Truck
8
Sources of Imprecision

Data Integration


Fact table constructed by integrating multiple data sources
Different sources record same dimension attribute at
different granularities
AUTOMOBILE
ALL
Sedan
Civic
FactID
Auto
p1
F150
p2
Sierra
p3
F150
p4
Sierra
Camry
Truck
F150
Loc Repair
FactID Auto
NY
100
p1
F150
NY
500
p2
Sierra
MA
100
p3
F150
MA
200
p4
Sierra
Call Centerp5
Truck
Sierra
ALL
3
Category
2
Model
1
FactID
Auto
Loc Repair
p5
Truck
NY
100
Loc
Repair
NY
100
NY
500
MA
100
MA
200
MA
Mailing
100
List
9
Imprecision In Real Data

Obtained real-world dataset from auto
manufacturer



Fact table entries from several source relations
Integrated fact table contained 798,570 facts
Real data has many imprecise facts
10
Querying Imprecise Facts
Auto = F150
Loc = MA
SUM(Repair) = ???
Truck
F150
Sierra
NY
East
MA
p5
p3
p4
p1
p2
FactID
Auto
Loc
Repair
p1
F150
NY
100
p2
Sierra
NY
500
p3
F150
MA
100
p4
Sierra
MA
200
p5
Truck
MA
100
11
Solution: Allocation

Intuitively: Replace each imprecise fact
r with set of precise facts, one for each
possible completion of r
Each completion is assigned an allocation
weight
 Refer to the resulting fact table as the
Extended Database (EDB)


Queries operate over this Extended
Database
12
Handle Imprecision With Allocation
Truck
NY
East
MA
F150
p5
p3
Sierra
p5
p4
p1
p2
ID
FactID
Auto
Loc
Repair
Weight
1
p1
F150
NY
100
1.0
2
p2
Sierra
NY
500
1.0
3
p3
F150
MA
100
1.0
4
p4
Sierra
MA
200
1.0
5
p5
F150
Truck
MA
100
0.5
6
p5
Sierra
MA
100
0.5
13
Querying The Extended Database
Auto = F150
Loc = MA
SUM(Repair) = ???
Truck
NY
East
MA
F150
p5
p3
Sierra
p5
p4
p1
p2
ID
FactID
Auto
Loc
Repair
Weight
1
p1
F150
NY
100
1.0
2
p2
Sierra
NY
500
1.0
3
p3
F150
MA
100
1.0
4
p4
Sierra
MA
200
1.0
5
p5
F150
MA
100
0.5
6
p5
Sierra
MA
100
0.5
14
Querying The Extended Database
Auto = F150
Loc = MA
Procedure
SUM(Repair)
= 150
for assigning allocation
weights is referred to as an
Truckallocation policy
NY
East
MA
F150
p5
p3
Sierra
p5
p4
p1
p2
ID
FactID
Auto
Loc
Repair
Weight
1
p1
F150
NY
100
1.0
2
p2
Sierra
NY
500
1.0
3
p3
F150
MA
100
1.0
4
p4
Sierra
MA
200
1.0
5
p5
F150
MA
100
0.5
6
p5
Sierra
MA
100
0.5
15
Contributions


Propose generalized template for allocation
policies presented in [BDJ+05]
Present operational framework for allocation



Propose Extended Database Maintenance
Algorithm


Allocation graph formalism
Used to derive Independent, Block, Transitive Algorithms
Update EDB to reflect changes to given fact table
Experimental Evaluation
16
Allocation Policy Template
pc ,r 
Q (c )
Q (c )

 Q(c' ) Qsum(r )
c 'region ( r )
Truck
F150
Sierra
MA
r
c1
c2
NY
East
Q (c1)
pc1,r 
Q (c1)  Q (c 2)
Q ( c 2)
pc 2 , r 
Q (c1)  Q (c 2)
17
Interactions between overlapping facts

Allocation weights for
imprecise fact p6 depend
on allocation weights for
fact p7 (and vice-versa)

Would like assigned
weights to capture these
interactions

Idea: Repeatedly
allocate p6 and p7 until
allocation weights
converge
Truck
F150
Sierra
p5
NY
East
MA
p6
p4
p7
p1
p2
18
Iterative Allocation Policies
1) Initialize each Q0(c) in cell c (using precise facts)
2) For each iteration t until all Qt(c) converged
For each imprecise fact r
Qsumt (r ) 

Qt (c ')
c 'region ( r )
For each cell c
For each imprecise fact r overlapping c
t 1
Q
(r )
Q t (c )  Q t (c ) 
Qsumt (r )
3) For each imprecise fact r
For each cell c in region(r)
pc,r
Qt (c)

t
Qsum (r )
19
Benefits of Iterative Allocation

Imprecise facts can be allocated in any order
and same allocation weights are obtained


Leverage this idea to obtain scalable allocation
algorithms
Leads to Expectation Maximization (EM)
framework for allocation


Final allocation weights have pleasing
mathematical properties
See [BDJ+05] for details
20
Truck
East
MA
F150
Sierra
Allocation Graph
p5
p3
p4
p6
c1
c2
p2
NY
p1
Precise Cells
Cell(NY,F150)
Cell(NY,Sierra)
Cell(MA,F150)
Cell(MA,Sierra)
Imprecise Facts
<MA,Truck>
21
Processing With
Allocation Graph
Truck
F150
Sierra
East
MA
p5
p5
p3
p4
p6
c1
c2
p2
NY
p1

t
t
Q
(
c
)
0
Initializep
Qsum
(each
rc,r)  Q (c) tin cellQc (c ')
Qsum
(r()r )
c 'region
t
Precise Cells
Cell(NY,F150)
Cell(NY,Sierra)
Cell(MA,F150)
Cell(MA,Sierra)
2
1
2/3
Imprecise Facts
3
<MA,Truck>
1/3
22
Efficient Allocation Algorithms

Independent Algorithm



Requires multiple sorts of precise cells for each
iteration
Optimizations based on re-using each sort as
much as possible
Block Algorithm


Reduces the number of required sorts for precise
cells to 1
Optimizations based on increasing buffer
utilization
23
S1:<State,Category>
p6 <MA,Sedan>
p7 <MA,Truck>
<MA,Civic> p1
<MA,Sierra> p2
<NY,F150> p3
S2 :<State, ALL>
p8 <CA,ALL>
S3 :<Region,Category>
p9 <East,Truck>
p10 <West,Sedan>
S4 :<ALL,Model>
<CA,Civic> p4
<CA,Sierra> p5
p11 <ALL,Civic>
p12 <ALL,Sierra>
S5 :<Region,Model>
p13 <West,Civic>
p14 <West,Sierra>
24
Iteration aware allocation

Optimizations for Independent and Block
reduce work for single iteration

Problem: Each iteration of allocation is
still expensive



Involves multiple scans of entire fact table
Not feasible for real data warehouses!
Can we do better?
25
Required Data For Allocating A Fact
p6 <MA,Sedan>
p7 <MA,Truck>
<MA,Civic> c1
p8 <CA,ALL>
<MA,Sierra> c2
` p9 <East,Truck>
<NY,F150> c3
<CA,Civic> c4
<CA,Sierra> c5
p10 <West,Sedan>
p11 <ALL,Civic>
p12 <ALL,Sierra>
p13 <West,Civic>
p14 <West,Sierra>
26
Required Data For Allocating A Fact
<MA,Sierra> c2
p7 <MA,Truck>
p9
<East,Truck>
Connected components in
allocation graph can be
processed independently
<NY,F150> c3
p12 <ALL,Sierra>
p6 <MA,Sedan>
p8 <CA,ALL>
<MA,Civic> c1
p10 <West,Sedan>
<CA,Civic> c4
p11 <ALL,Civic>
<CA,Sierra> c5
p13 <West,Civic>
p14 <West,Sierra>
27
Transitive Algorithm

Transitive Algorithm has two steps:



1) Connected component identification step
2) Process each connected component
 Read
component into
memory
Use
concepts
from
Transitive
 Perform all iterations of allocation for facts in component
Algorithm to develop EDB
Maintenance Algorithm
If each component fits into memory then
required I/O operations for Transitive is
independent of number of iterations!


Components larger than buffer processed using Block
algorithm
In real datasets, all components were memory resident
28
Experimental Setup

Algorithms evaluated on several datasets


Real-world dataset: 798K facts , 4 dimensions
Used several synthetic datasets

Vary level of imprecision in the data




Percentage of imprecise facts
Severity of imprecision
Scalability (up to 5 million tuples)
Important parameter: Ratio of input table size to
available memory

Memory limited to restricted buffer pool
29
Experiment 1a: Memory Resident
Independent
Block
Transitive
300
Time (sec)
250
200
150
100
50
0
1
3
5
7
Iterations (until converged)
Real Dataset
30
Experiment: Memory Resident (2)
Independent
Block
Transitive
500
Time (sec)
400
300
200
100
0
0
5
Iterations (until converged)
Synthetic Dataset (more imprecision)
10
31
Time (sec)
Experiment: Algorithm Scalability
ε = 0.1 (3 iterations)
1400
1200
1000
800
600
400
200
0
600KB
1MB
6MB
Buffer Size
Independent
Block
Transitive
12MB
32
Experiment 1b: Algorithm Scalability
Time (sec)
ε = 0.005 (10 iterations)
7000
6000
5000
4000
3000
2000
1000
0
600 KB
1MB
6MB
Buffer Size
Independent
Block
Transitive
12MB
33
Conclusions

Imprecision is a compelling real-world
problem


Allocation graph formalism



Propose allocation as a solution
Basis for 3 scalable allocation algorithms
Independent, Block, Transitive
Transitive algorithm is quite intriguing


Performance is stable as number of iterations
increase
Connected components algorithm identifies can
be used in proposed EDB maintenance algorithm
34

Probabilistic Relational Model

Transcript Probabilistic Relational Model

Directory