WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu

Download Report

Transcript WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu

WebIQ: Learning from the Web
to Match Deep-Web Query Interfaces
Wensheng Wu
Database & Information Systems Group
University of Illinois, Urbana
Joint work with AnHai Doan & Clement Yu
ICDE, April 2006
Search Problems on the Deep Web
Find round-trip flights from
Chicago to New York under $500
united.com
airtravel.com
delta.com
2
Solution: Build Data Integration Systems
Find round-trip flights from
Chicago to New York under $500
Global query interface
united.com
airtravel.com
delta.com
comparison shopping systems “on steroid”
3
Current State of Affairs

Very active in both research communities & industry

Research
– multidisciplinary efforts: Database, Web, KDD & AI
– 10+ research groups in US, Asia & Europe
– focuses:
– source discovery
– schema matching & integration
– query processing
– data extraction

Industry
– Transformic, Glenbrook Networks, WebScalers, PriceGrabber,
Shopping.com, MySimon, Google, …
4
Key Task: Schema Matching
1-1 match
Complex match
5
Schema Matching is Ubiquitous!

Fundamental problem in numerous applications
–
–
–
–
–
–

data integration
data warehousing
peer data management
ontology merging
view integration
personal information management
Schema matching across Web sources
– 30+ papers generated in past few years
– Washington [AAAI-03, ICDE-05], Illinois [SIGMOD-03, SIGMOD-04,
ICDE-06], MSR [VLDB-04], Binghamton [VLDB-03], HKST [VLDB04], Utah [WebDB-05], …
6
Schema Matching is Still Very Difficult
1-1 match
Complex match



Must rely on properties of attributes, e.g., label & instances
Often there are little in common between matching attributes
Many attributes do not even have instances!
7
Matching Performance Greatly Hampered by
Pervasive Lack of Attribute Instances

28.1% ~ 74.6% of attributes with no instances

Extremely challenging to match these attributes
– e.g., does departure city match from city or departure date?

Also difficult to match attributes with dissimilar instances
– e.g., airline (with American airliners) vs. carrier (with Europeans)
8
Our Solution: Exploit the Web

Discover instances from the Web
– e.g., Chicago, New York, etc. for departure city & from city

Borrow instances from other attributes & validate via Web
– e.g., check if Air Canada is an instance of carrier with the Web
9
Key Idea: Question-Answering from AI




Search Web via search engines, e.g., Google
… but search engines do not understand natural language
questions
Idea: form extraction queries as sentences to be completed
“Trick” search engine to complete sentences with instances
Extraction Patterns
Ls such as NP1, … NPn
such Ls as NP1, …, NPn
NP1, …, NPn, and other Ls
attribute label:
departure city
Ls including NP1, …, NPn

Example extraction query: “departure cities such as”
10
Key Idea: Question-Answering from AI

Search Google & obtain snippets:

Extract instance candidates from snippets:
extraction query
completion
other departure cities such as Boston, Chicago and LAX available …
Boston, Chicago, LAX
11
But Not Every Candidate is True Instance


Reason 1: Extraction queries may not be perfect
Reason 2: Web content is inherently noisy

Example:
– attribute: city
– extraction query: “and other cities”
– extracted candidate: 150

need to perform instance verification
12
Instance Verification: Outlier Detection

Goal: Remove statistical outliers (among candidates)

Step 1: Pre-processing
–
–
–
–

recognize types of instances via pattern matching & 80% rule
types: numeric & string
discard all candidates not of determined type
e.g., most of instance candidates for city are strings, so remove 150
Step 2: Type-specific detection
– perform discordance tests
– test statistics, e.g.,
– # of words: abnormal if more than 5 words in person
name
– % of numeric characters: US zip code contains only
digits
13
Instance Verification: Web Validation


Goal: Further semantic-level validation
Idea: Exploit co-occurrence statistics of label & instances
– “Make: Honda; Model: Accord”
– “a variety of makes such as Honda, Mitsubishi”
Validation Patterns (V + x)
Validation phrase V
Lx
Ls such as x
such Ls as x
x and other Ls
Ls including x

Form validation queries using validation patterns
– e.g., “make Honda”, “makes such as Honda”
14
Instance Verification: Web Validation

Possible measure: NumHits(V+x)
– e.g., NumHits(“cities such as Los Angeles”) = 26M


Potential problems: bias towards popular instances
Use PMI(V, x), point-wise mutual information
NumHits(V+x)
NumHits(V) * NumHits(x)

Example:
– V = “cities such as”, candidates: California, Los Angeles
– NumHits(V, California) = 29
– PMI(V, Los Angeles) = 3000 * PMI(V, California)
15
Validate Instances from Other Attributes

Method 1: Discover k more instances from Web
– then check for borrowed one (Aer Lingus for Airline)
 problem: very likely Aer Lingus not among discovered instances

Method 2: Compare validation score with that of instance
 problem: score for Aer Lingus may be much lower, how to decide?

Key observation: compare also to scores of non-instances
– e.g., Economy (with respect to Airline)
16
Train Validation-Based Instance Classifier

Naïve Bayes classifier with validation-based features
V1: Airlines such as
V2: Airline
Example
M1
M2
+/-
Air Canada
.5
.3
+
American
.8
.1
+
Economy
.4
.03
-
First Class
.2
.05
-
Delta
.6
.3
+
Example
United
.9
.4
+
Delta
1
1
+
Jan
.1
.06
-
United
1
1
+
1
.3
.09
-
Jan
0
0
-
1
0
1
-
Thresholds:
t1=.45, t2=.075
P(C|X) ~ P(C) P(X|C)
P(+)=P(-) = ½
f1 f2
+/-
P(f1=1|+) = 3/4
P(f1=1|-) = 1/4
…
17
Validate Instances via Deep Web


Handle attributes while difficult via Web, e.g., from
Disadvantage: ambiguity when no results found
18
Architecture of Assisted Matching System
Attribute matches
Interface
matcher
Source interfaces
with augmented instances
Instance
acquisition
Source interfaces
19
Empirical Evaluation


Five domains:
Domain
#
schemas
# attributes
per schema
% of attributes
with no instances
Average depth
of schemas
Airfare
20
10.7
28.1
3.6
Automobile
20
5.1
38.6
2.4
Book
20
5.4
74.6
2.3
Job
20
4.6
30.0
2.1
Real Estate
20
6.5
32.2
2.7
Experiments:
– Baseline: IceQ [Wu et al., SIGMOD-04]
– Web assistance

Performance metrics:
– precision (P), recall (R), & F1 (= 2PR/(P+R))
20
Matching Accuracy

Web assistance boosts accuracy (F1) from 89.5 to 97.5
100
Baseline
Baseline + WebIQ
Baseline + WebIQ + Threshold
95
90
85
80
Airfare
Automobile
Book
Job
Real Estate
21
Overhead Analysis
Reasonable overhead: 6~11 minutes across domains
7
Baseline
Attr-Surface
Surface
Attr-Deep
6
5
Min(s)

4
3
2
1
0
Airfare
Auto
Book
Job
RE
22
Conclusion

Search problems on the Deep Web are increasingly crucial!

Novel QA-based approach to learning attribute instances

Incorporation into a state-of-art matching system

Extensive evaluation over varied real-world domains
 More details: Wensheng Wu on Google
23