Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani.
Download
Report
Transcript Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani.
Querying for relations from the
semi-structured Web
Sunita Sarawagi
IIT Bombay
http://www.cse.iitb.ac.in/~sunita
Contributors
Rahul Gupta
Girija Limaye Prashant Borole
Rakesh Pimplikar
Aditya Somani
Web Search
Mainstream web search
User Keyword queries
Search engine Ranked list of documents
15 glorious years of serving all of user’s search need
into this least common denominator
Structured web search
User Natural language queries ~/~ Structured queries
Search engine Point answer, record sets
Many challenges in understanding both query and
content
15 years of slow but steady progress
2
The Quest for Structure
Vertical structured search engines
Structure Schema Domain-specific
Shopping: Shopbot: (Etzoini + 1997)
Publications: Citeseer (Lawrence, Giles,+ 1998)
Paper title, author name, email, conference, year
Jobs: Flipdog Whizbang labs (Mitchell + 2000)
Product name, manufacturer, price
Company name, job title, location, requirement
People: DBLife (Doan 07)
Name, affiliations, committees served, talks delivered.
Triggered much research on extraction and IR-style search
of structured data (BANKS ‘02).
3
Horizontal Structured Search
Domain-independent structure
Small, generic set of structured primitives over
entities, types, relationships, and properties
<Entity> IsA <Type>
<Entity> Has <Property>
Mysore is a city
<City> Average rainful <Value>
<Entity1> <related-to> <Entity2>
<Person> born-in <City>
<Person> CEO-of <Company>
4
Types of Structured Search
Web+People Structured databases ( Ontologies)
Created manually (Psyche), or semi-automatically (Yago)
True Knowledge (2009), Wolfram Alpha (2009)
Web annotated with structured elements
Queries: Keywords + structured annotations
Example: <Physicist> +cosmos
Open-domain structure extraction and annotations of web
docs (2005—)
5
Users, Ontologies, and the Web
Users are from Venus
•
Ontologies are from Mars
•
•
Bi-syllabic, impatient, believe in
mind -reading
One structure to fit all
Web content creators are from
some other galaxy
–
–
Ontologies=axvhjizb
Let search engines bring the
users
G
What is missed in Ontologies
The trivial, the transient, and the textual
Procedural knowledge
•
Huge body of invaluable text of various type
What do I do on an error?
reviews, literature, commentaries, videos
Context
By stripping knowledge to its skeletal form, context
that is so valuable for search is lost.
As long as queries are unstructured, the redundancy
and variety in unstructured sources is invaluable.
Structured annotations in HTML
Is A annotations
Open-domain Relationships
KnowITAll (2004)
Text runner (Banko 2007)
Ontological annotations
SemTag and Seeker (2003)
Wikipedia annotations (Wikify! 2007, CSAW 2009)
All view documents as a sequence of tokens
Challenging to ensure high accuracy
8
WWT: Table queries over the semistructured web
9
Queries in WWT
Query by content
Alan Turing Turing Machine
E. F. Codd
Relational Databases
Desh
Late night
Bhairavi
Patdeep
Morning
Afternoon
Query by description
Inventor
Indian states
Computer science concept
Airport City
Year
10
Answer: Table with ranked rows
Person
Concept/Invention
Alan Turing
Turing Machine
Seymour Cray
Supercomputer
E. F. Codd
Relational Databases
Tim Berners-Lee
WWW
Charles Babbage
Babbage Engine
11
Keyword search to find structured records
Computer science concept inventor year
Correct answer is not
one click away.
Verbose articles, not
structured tables
Desired records spread
across many documents
The only document
with an unstructured list
of some desired records
12
The only list in one of the retrieved pages
13
Highly relevant Wikipedia table not retrieved in the top-k
Ideal answer should be integrated from these incomplete sources
14
Attempt 2: Include samples in query
alan turing machine codd relational database
Known examples
Documents relevant
only to the keywords
Ideal answer still
spread across many
documents
15
WWT Architecture
User
Typesystem Hierarchy Final consolidated table
Query Table
Type Inference
Web
Ontology
Extract record
sources
Annotate
Store
Content+
context
index
Resolver builder
Resolver
Ranker
Cell resolver Row resolver
Source L1,…,Lk
Offline
Keyword Query
Index Query Builder
Consolidated Table
Tables
Record labeler T1,…,Tk
CRF models
Extractor
Consolidator
Row and cell scores
16
Offline: Annotating to an Ontology
Annotate table cells with entity nodes and table
columns with type nodes
All
Indian_films
Indian_directors
People
2008_films
movies
Entertainers
English_films
Indian_films
A_Wednesday
2008_films
Terrorism_films
Wednesday
Black&White
A_Wednesday
Coffee_house (film)
18
Coffee_house (film)
Black&White
Coffee_house (Loc)
Challenges
Ambiguity of entity names
Noisy mentions of entity names
Black&White versus Black and White
Multiple labels
“Coffee house” both a movie name and a place name
Yago Ontology has average 2.2 types per entity
Missing type links in Ontology cannot use
least common ancestor
Missing link: Black&White to 2008_films
Not a missing link: 1920 to Terrorism_films
Scale:
Yago has 1.9 million entities, 200,000 types
19
A unified approach
Graphical model to jointly label cells and
columns to maximize sum of scores on
movies
ycj = Entity label of cell c of column j
English_films
yj = Type label of column j
Indian_films
Score(ycj ): String similarity between c & ycj .
Score(yj ): String similarity between header in j & yj
Score( yj, ycj)
yj
Subsumed entity: Inversely proportional to distance
Subsumed
between them
entity y1j
Terrorism_films
Subsumed
entity y2j
Outside enity: Fraction of overlapping entities between
Outside entity y3j
yj and immediate parent of ycj
Handles missing links: Overlap of 2008_movies with
2007_movies zero but with Indian movies is non-zero.
WWT Architecture
User
Typesystem Hierarchy Final consolidated table
Query Table
Type Inference
Web
Ontology
Extract record
sources
Annotate
Store
Content+
context
index
Resolver builder
Resolver
Ranker
Cell resolver Row resolver
Source L1,…,Lk
Offline
Keyword Query
Index Query Builder
Consolidated Table
Tables
Record labeler T1,…,Tk
CRF models
Extractor
Consolidator
Row and cell scores
21
Extraction: Content queries
Extracting queries columns from list records
Query: Q
Cornell University
State University of New York
New York University
Ithaca
Stony Brook
New York
A source: Li
New York University (NYU), New York City, founded in 1831.
Columbia University, founded in 1754 as King’s College.
Binghamton University, Binghamton, established in 1946.
State University of New York, Stony Brook, New York, founded in 1957
Syracuse University, Syracuse, New York, established in 1870
State University of New York, Buffalo, established in 1846
Rensselaer Polytechnic Institute (RPI) at Troy.
Lists are often human generated.
23
Extraction
Query: Q
Cornell University
State University of New York
New York University
Ithaca
Stony Brook
New York
Extracted table columns
New York University (NYU), New York City, founded in 1831.
Columbia University, founded in 1754 as King’s College.
Binghamton University, Binghamton, established in 1946.
State University of New York, Stony Brook, New York, founded in 1957
Syracuse University, Syracuse, New York, established in 1870
State University of New York, Buffalo, established in 1846
Rensselaer Polytechnic Institute (RPI) at Troy.
Rule-based extractor insufficient. Statistical extractor needs training data.
Generating that is also not easy!
24
Extraction: Labeled data generation
Lists are unlabeled. Labeled records needed to train a CRF
A fast but naïve approach for generating labeled records
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook
Query about colleges in NY
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in
Stony Brook, New York.
Fragment of a relevant list source
25
Extraction: Labeled data generation
A fast but naïve approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in
Stony Brook, New York.
Another match for New York
Another match for New York University
In the list, look for matches of every query cell.
26
Extraction: Labeled data generation
A fast but naïve approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
1
2
State University of New York in
Stony Brook, New York.
In the list, look for matches of every query cell.
Greedily map each query row to the best match in the list
27
Extraction: Labeled data generation
A fast but naïve approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
1
2
Unmapped (hurts recall)
Assumed as ‘Other’
State University of New York in
Stony Brook, New York.
Wrongly Mapped
Hard matching criteria has significantly low recall
Missed segments.
Does not use natural clues like Univ = University
Greedy matching can be lead to really bad mappings
28
Generating labeled data: Soft approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in
Stony Brook, New York.
29
Generating labeled data: Soft approach
0.9
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook
0.3
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
1.8
State University of New York in
Stony Brook, New York.
Match score for each query and source row
Score of best segmentation of source row
to query columns
Score of a segment s of column c:
Probability Cell c of query row same as
segment s
Computed by the Resolver module based
on the type of the column
30
Generating labeled data: Soft approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
0.7
0.3
2.0
State University of New York in
Stony Brook, New York.
Match score for each query and source row
Score of best segmentation of source row into query columns
Score of a segment s of column c:
Probability Cell c of query row same as segment s
Computed by the Resolver module based on the type of the
column
31
Generating labeled data: Soft approach
0.9
New York University
Monroe College
State University of New York
New York
New York Univ. in NYC
Columbia University in NYC
0.3
Brighton
Stony Brook
0.7
1.8
0.3
1.8
2
Greedy matching in red
Monroe Community College
in Brighton
State University of New York in
Stony Brook, New York.
Compute the maximum weight matching
Better than greedily choosing the best match for each row
Soft string-matching increases the labeled candidates significantly
Vastly improves recall, leads to better extraction models.
32
Extractor
Use CRF on the generated labeled data
Feature Set
Delimiters, HTML tokens in a window around
labeled segments.
Alignment features
Collective training of multiple sources
33
Experiments
Aim: Reconstruct Wikipedia tables from only a
few sample rows.
Sample queries
TV Series: Character name, Actor name, Season
Oil spills: Tanker, Region, Time
Golden Globe Awards: Actor, Movie, Year
Dadasaheb Phalke Awards: Person, Year
Parrots: common name, scientific name, family
34
Experiments: Dataset
Corpus:
16M lists from 500M pages from a web crawl.
45% of lists retrieved by index probe are irrelevant.
Query workload
65 queries. Ground truth hand-labeled by 10 users
over 1300 lists.
27% queries not answerable with one list (difficult).
True consolidated table = 75% of Wikipedia table,
25% new rows not present in Wikipedia.
35
Extraction performance
Benefits of soft training data generation, alignment features,
staged-extraction on F1 score.
More than 80% F1 accuracy with just three query records
36
Queries in WWT
Query by content
Alan Turing Turing Machine
E. F. Codd
Relational Databases
Desh
Late night
Bhairavi
Patdeep
Morning
Afternoon
Query by description
Inventor
Indian states
Computer science concept
Airport City
Year
37
Extraction: Description queries
Non-informative
headers
No headers
Lithium
3
Sodium
11
Beryllium
4
Context to get at relevant tables
Ontological annotations
All
Chemical
element
People
Alkali
Lithium
3
Sodium
11
Beryllium
4
Context is union of
Chemical_
elements
Non_Metals
Metals
Non alkali
Alkali
Gas
Non-gas
Text around tables
Headers
Hydrogen
Lithium
Aluminium
Ontology labels when
present
Sodium
Carbon
39
Joint labeling of table columns
Given
Candidate tables: T1 ,T2,..Tn
Query column q1, q2,.. qm
Task: label columns of Ti with {q1, q2,…, qm, } to
maximize sum of these scores
Score (T , j , qk) = Ontology type match + Header
string match with qk
Score (T , * , qk) = Match of description of T with qk
Score (T , j, T’ , j’, qk) = Content overlap of column j
of table T with column j’ of table T’ when both label qk
Inference algorithm in a graphical model solve
40
via Belief Propagation.
WWT Architecture
User
Typesystem Hierarchy Final consolidated table
Query Table
Type Inference
Web
Ontology
Extract record
sources
Annotate
Store
Content+
context
index
Resolver builder
Resolver
Ranker
Cell resolver Row resolver
Source L1,…,Lk
Offline
Keyword Query
Index Query Builder
Consolidated Table
Tables
Record labeler T1,…,Tk
CRF models
Extractor
Consolidator
Row and cell scores
41
Step 3: Consolidation
Merging the extracted tables into one
Cornell University
Ithaca
SUNY
Stony Brook
State University of New
York
Stony Brook
New York University
(NYU)
New York
New York University
New York City
RPI
Troy
Binghamton University
Binghamton
Columbia University
New York
Syracuse University
Syracuse
=
+
Cornell University
Ithaca
State University of New York OR
SUNY
Stony Brook
New York University OR
New York University (NYU)
New York City OR
New York
Binghamton University
Binghamton
RPI
Troy
Columbia University
New York
Syracuse University
Syracuse
Merging duplicates
42
Consolidation
Challenge: resolving when two rows are the
same in the face of
Extraction errors
Missing columns
Open-domain
No training.
Our approach: a specially designed Bayesian
Network with interpretable and generalizable
parameters
Resolver
Bayesian Network
P(1st cell match|q1,r1)
P(RowMatch|rows q,r)
P(ith cell match|qi,ri)
P(nth cell match|qn,rn)
Cell-level probabilities
Parameters automatically set using list
statistics
Derived from user-supplied type-specific
similarity functions
45
Ranking
• Factors for ranking
– Relevance: membership in overlapping sources
– Support from multiple sources
– Completeness: importance of columns present
Penalize records with only common ‘spam’ columns like
City and State
Correctness: extraction confidence
School
Location
State
Merged Row Confidence
Support
-
-
NY
0.99
9
-
NYC
New York
0.95
7
New York Univ. OR New
York University
New York City OR
New York
New York
0.85
4
University of Rochester
OR Univ. of Rochester,
Rochester
New York
0.50
2
University of Buffalo
Buffalo
New York
0.70
2
Cornell University
Ithaca
New York
0.76
1
47
Relevance ranking on set membership
Weighted sum approach
Score of a set t:
Relevance of consolidated row r:
s(t) = fraction of query rows in t
r t s(t)
Tables
Graph walk based approach
Random walk from rows to table
nodes starting from query rows
Consolidated rows
along with random restarts to
query rows
Query rows
48
Ranking Criteria
• Score(Row r):
× Graph-relevance of r.
× Importance of columns C present in r (high if C functionally
determines the other)
× Sum of cell extraction confidence: noisy-OR of cell extraction
confidence from individual CRFs
School
Location
New York Univ. OR New
York University (0.90)
State
Merged Row Confidence
Support
New York City OR New York (0.98)
New York (0.95)
0.85
4
University of Buffalo (0.88)
Buffalo (0.99)
New York (0.99)
0.70
2
Cornell University (0.92)
Ithaca (0.95)
New York (0.99)
0.76
1
University of Rochester OR
Univ. of Rochester, (0.80)
Rochester (0.95)
New York (0.99)
0.50
2
-
-
NY (0.99)
0.99
9
-
NYC (0.98)
New York (0.98)
0.95
7
49
Overall performance
All Queries
Difficult Queries
Justify sophisticated consolidation and resolution. So compare with:
Processing only the magically known single best list
=> no consolidation/resolution required.
Simple consolidation. No merging of approximate duplicates.
WWT has > 55% recall, beats others. Gain bigger for difficult queries.
50
Running time
< 30 seconds with 3 query records.
Can be improved by processing sources in parallel.
Variance high because time depends on number of columns,
record length etc.
51
Related Work
Google-Squared
Developed independently. Launched in May 2009
User provides keyword query, e.g. “list of Italian
joints in Manhattan”. Schema inferred.
Technical details not public.
Prior methods for extraction and resolution.
Assume labeled data/pre-trained parameters
We generate labeled data, and automatically train
resolver parameters from the list source.
52
Summary
Structured web search & the role of non-text,
partially structured web sources
WWT system
Domain-independent
Online: structure interpretation at query time
Relies heavily on unsupervised statistical learning
Graphical model for table annotation
Soft-approach for generating labeled data
Collective column labeling for descriptive queries
Bayesian network for resolution and consolidation
Page rank + confidence from a probabilistic extractor for
53
ranking
What next?
Designing plans for non-trivial ways of combining of
sources
Better ranking and user-interaction models.
Expanding query set
Aggregate queries: tables are rich in quantities
Point queries: attribute value and relationship queries
Interplay between semi-structured web & Ontologies
Augmenting one with the other.
Quantify information in structured sources vis-à-vis
text sources on typical query workloads
54