Transcript Document

Query Processing over
Incomplete Autonomous Databases
Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati
Arizona State University
2008-02-04
Summerized By Sungchan Park
Introduction
 More and more data is becoming accessible via web servers
which are supported by backend autonomous databases

E.g. Cars.com, Realtor.com, Google Base, Etc.
Autonomous
Database
Mediator
Autonomous
Database
Autonomous
Database
Center for E-Business Technology
Copyright  2008 by CEBT
Web DB.s are Incomplete!
 Incomplete Entry
 Inaccurate Extraction
 Heterogeneous Schemas
 User-Defined Schemas
Center for E-Business Technology
Copyright  2008 by CEBT
Problem
 Current autonomous database systems only return certain
answers, namely those which exactly satisfy all the user query
constraints
 Although there has been work on handling incompleteness in
databases, much of it has been focused on single databases on
which the query processor has complete control.

Modify databases directly by replacing null values with likely values.
–
Not applicable to autonomous databases
Center for E-Business Technology
Copyright  2008 by CEBT
Possible Naïve Approaches
Query Q: (Body Style = Convt)
 CERTAINONLY

Return only certain answer
–
Low Recall
 ALLRETURNED

Return all answer having Body Style = Convt or Body Style = Null
–
Low Precision, Infeasible
 ALLRANKED

Return all answers having Body Style = Convt. Additionally, rank all
answers having body style as null by predicting the missing values
and return them to the user
–
Costly, Infeasible
Center for E-Business Technology
Copyright  2008 by CEBT
QPIAD
 Solved the problem by generating rewritten queries according
to a set of mined attribute correlation rules.

Approximate Functional Dependency(AFD)

Naïve Bayesian Classifier
Center for E-Business Technology
Copyright  2008 by CEBT
QPIAD Solution
Center for E-Business Technology
Copyright  2008 by CEBT
QPIAD Architecture
Center for E-Business Technology
Copyright  2008 by CEBT
Overall Process
1. Learn
2. Rewrite
3. Rank
4. Explain
Center for E-Business Technology
Copyright  2008 by CEBT
#1. Learn - AFD
 Learn Attribute Correlations

Approximate Functional Dependencies(AFD)

Approximate Keys(Akeys)
–

For pruning
Learn by TANE algorithm

Y. Huhtala, et al. Efficient discovery of functional and approximate
dependencies using partition. 1998.
 Pruning example

AFD {A1, A2} ~> A3

Akey {A1}
Center for E-Business Technology
Copyright  2008 by CEBT
#1. Learn - Naïve Bayesian Classifier
 Learn Value distribution by NBC

Using mined AFD as selected feature

E.g.
–
AFD {Make, Body} ~> Model
–
P(Model = Accord | Make = Honda, Body = Coupe) = ?
Center for E-Business Technology
Copyright  2008 by CEBT
#1. Learn - Selectivity
 SmplSel(Q)*SmplRatio(R)*PerInc(R)

SmplSel(Q) = Selectivity of rewritten query issued on sample

SmplRatio(R) = Ratio of original database size over sample

PerInc(R) = Percent of incomplete tuples while creating sample
Center for E-Business Technology
Copyright  2008 by CEBT
#2. Rewrite
1. Get base result(Certain answers)
2. Generate rewritten queries by base result and learned AFD
Rewritten Queries
Center for E-Business Technology
Copyright  2008 by CEBT
#3. Rank
1. Select top-k queries based on F-Measure
P = learned Prob.
R = selectivity
2. Reorder selected query based on P
3. Retrieve tuples
Center for E-Business Technology
Copyright  2008 by CEBT
#4. Explain
Center for E-Business Technology
Copyright  2008 by CEBT
Other Issues: Correlated Source
Center for E-Business Technology
Copyright  2008 by CEBT
Other Issues: Handling Aggregation
Center for E-Business Technology
Copyright  2008 by CEBT
Empirical Evaluation: Quality
 QPIAD vs. ALLRETURNED

ALLRETURNED has low precision because not all tuples with
missing values on the constrained attributes are relevant to the
query

QPIAD has a much higher precision than ALLRETURNED as it aims
to retrieve tuples with missing values on the constrained attributes
which are very likely to be relevant to the query
Center for E-Business Technology
Copyright  2008 by CEBT
Empirical Evaluation: Efficiency
 QPIAD vs. ALLRANKED

ALLRANKED approach is often infeasible as direct retrieval of null
values is not often allowed

QPIAD is able to achieve the same level of recall as ALLRANKED while
requiring much fewer tuples to be retrieved
Center for E-Business Technology
Copyright  2008 by CEBT
Empirical Evaluation: Robustness
 Robustness w.r.t. Sample Size

QPIAD is robust even when face with a relatively small data sample
Center for E-Business Technology
Copyright  2008 by CEBT
Empirical Evaluation: Extensions
 Aggregates

Prediction of missing values
increases the fraction of queries
that achieve higher levels of
accuracy

Approximately 20% more queries
achieve 100% accuracy when
prediction is used
 Join

As alpha is increased, we obtain a
higher recall without sacrificing
much precision
Center for E-Business Technology
Copyright  2008 by CEBT
Related Work




Querying Incomplete Databases

Possible World Approaches – tracks the completions of incomplete tuples (CoddTables, VTables, Conditional Tables)

Probabilistic Approaches – quantify distribution over completions to distinguish between
likelihood of various possible answers
Probabilistic Databases

Tuples are associated with an attribute describing the probability of its existence

However, in our work, the mediator does not have the capability to modify the underlying
autonomous databases
Query Reformulation / Relaxation

Aims to return similar or approximate answers to the user after returning or in the absence of
exact answers

Our focus is on retrieving tuples with missing values on constrained attributes
Learning Missing Values

Common imputation approaches replace missing values by substituting the mean, most
common value, default value, or using kNN, association rules, etc.

Our work requires schema level dependencies between attributes as well as distribution
information over missing values
Center for E-Business Technology
Copyright  2008 by CEBT
Contribution

Efficiently retrieve relevant uncertain answers from autonomous
sources given only limited query access patterns


Retrieves answers with missing values on constrained attributes
without modifying the underlying databases


AFD-Enhanced Classifiers
Rewriting & ranking considers the natural tension between precision
and recall


Query Rewriting
F-Measure based ranking
AFDs play a major role in:

Query Rewriting

Feature Selection

Explanations
Center for E-Business Technology
Copyright  2008 by CEBT