Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani.

Transcript Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani.

Querying for relations from the
semi-structured Web
Sunita Sarawagi
IIT Bombay
http://www.cse.iitb.ac.in/~sunita
Contributors
Rahul Gupta
Girija Limaye Prashant Borole
Rakesh Pimplikar
Aditya Somani
Web Search

Mainstream web search




User  Keyword queries
Search engine  Ranked list of documents
15 glorious years of serving all of user’s search need
into this least common denominator
Structured web search


User  Natural language queries ~/~ Structured queries
Search engine  Point answer, record sets

Many challenges in understanding both query and
content

15 years of slow but steady progress
2
The Quest for Structure

Vertical structured search engines
Structure  Schema  Domain-specific

Shopping: Shopbot: (Etzoini + 1997)


Publications: Citeseer (Lawrence, Giles,+ 1998)


Paper title, author name, email, conference, year
Jobs: Flipdog Whizbang labs (Mitchell + 2000)


Product name, manufacturer, price
Company name, job title, location, requirement
People: DBLife (Doan 07)

Name, affiliations, committees served, talks delivered.
Triggered much research on extraction and IR-style search
of structured data (BANKS ‘02).
3
Horizontal Structured Search


Domain-independent structure
Small, generic set of structured primitives over
entities, types, relationships, and properties

<Entity> IsA <Type>


<Entity> Has <Property>


Mysore is a city
<City> Average rainful <Value>
<Entity1> <related-to> <Entity2>


<Person> born-in <City>
<Person> CEO-of <Company>
4
Types of Structured Search


Web+People  Structured databases ( Ontologies)

Created manually (Psyche), or semi-automatically (Yago)

True Knowledge (2009), Wolfram Alpha (2009)
Web annotated with structured elements

Queries: Keywords + structured annotations


Example: <Physicist> +cosmos
Open-domain structure extraction and annotations of web
docs (2005—)
5
Users, Ontologies, and the Web

Users are from Venus
•

Ontologies are from Mars
•
•
Bi-syllabic, impatient, believe in
mind -reading
One structure to fit all
Web content creators are from
some other galaxy
–
–
Ontologies=axvhjizb
Let search engines bring the
users
G
What is missed in Ontologies


The trivial, the transient, and the textual
Procedural knowledge
•

Huge body of invaluable text of various type


What do I do on an error?
reviews, literature, commentaries, videos
Context

By stripping knowledge to its skeletal form, context
that is so valuable for search is lost.

As long as queries are unstructured, the redundancy
and variety in unstructured sources is invaluable.
Structured annotations in HTML

Is A annotations


Open-domain Relationships


KnowITAll (2004)
Text runner (Banko 2007)
Ontological annotations

SemTag and Seeker (2003)

Wikipedia annotations (Wikify! 2007, CSAW 2009)
All view documents as a sequence of tokens
Challenging to ensure high accuracy
8
WWT: Table queries over the semistructured web
9
Queries in WWT

Query by content
Alan Turing Turing Machine

E. F. Codd
Relational Databases
Desh
Late night
Bhairavi
Patdeep
Morning
Afternoon
Query by description
Inventor
Indian states
Computer science concept
Airport City
Year
10
Answer: Table with ranked rows
Person
Concept/Invention
Alan Turing
Turing Machine
Seymour Cray
Supercomputer
E. F. Codd
Relational Databases
Tim Berners-Lee
WWW
Charles Babbage
Babbage Engine
11
Keyword search to find structured records
Computer science concept inventor year
Correct answer is not
one click away.
Verbose articles, not
structured tables
Desired records spread
across many documents
The only document
with an unstructured list
of some desired records
12
The only list in one of the retrieved pages
13
Highly relevant Wikipedia table not retrieved in the top-k
Ideal answer should be integrated from these incomplete sources
14
Attempt 2: Include samples in query
alan turing machine codd relational database
Known examples
Documents relevant
only to the keywords
Ideal answer still
spread across many
documents
15
WWT Architecture
User
Typesystem Hierarchy Final consolidated table
Query Table
Type Inference
Web
Ontology
Extract record
sources
Annotate
Store
Content+
context
index
Resolver builder
Resolver
Ranker
Cell resolver Row resolver
Source L1,…,Lk
Offline
Keyword Query
Index Query Builder
Consolidated Table
Tables
Record labeler T1,…,Tk
CRF models
Extractor
Consolidator
Row and cell scores
16
Offline: Annotating to an Ontology
Annotate table cells with entity nodes and table
columns with type nodes
All
Indian_films
Indian_directors
People
2008_films
movies
Entertainers
English_films
Indian_films
A_Wednesday
2008_films
Terrorism_films
Wednesday
Black&White
A_Wednesday
Coffee_house (film)
18
Coffee_house (film)
Black&White
Coffee_house (Loc)
Challenges

Ambiguity of entity names


Noisy mentions of entity names



Black&White versus Black and White
Multiple labels


“Coffee house” both a movie name and a place name
Yago Ontology has average 2.2 types per entity
Missing type links in Ontology cannot use
least common ancestor

Missing link: Black&White to 2008_films

Not a missing link: 1920 to Terrorism_films
Scale:

Yago has 1.9 million entities, 200,000 types
19
A unified approach
Graphical model to jointly label cells and
columns to maximize sum of scores on

movies
ycj = Entity label of cell c of column j
English_films
yj = Type label of column j
Indian_films
Score(ycj ): String similarity between c & ycj .

Score(yj ): String similarity between header in j & yj

Score( yj, ycj)


yj
Subsumed entity: Inversely proportional to distance
Subsumed
between them
entity y1j
Terrorism_films
Subsumed
entity y2j
Outside enity: Fraction of overlapping entities between
Outside entity y3j
yj and immediate parent of ycj

Handles missing links: Overlap of 2008_movies with
2007_movies zero but with Indian movies is non-zero.
WWT Architecture
User
Typesystem Hierarchy Final consolidated table
Query Table
Type Inference
Web
Ontology
Extract record
sources
Annotate
Store
Content+
context
index
Resolver builder
Resolver
Ranker
Cell resolver Row resolver
Source L1,…,Lk
Offline
Keyword Query
Index Query Builder
Consolidated Table
Tables
Record labeler T1,…,Tk
CRF models
Extractor
Consolidator
Row and cell scores
21
Extraction: Content queries
Extracting queries columns from list records
Query: Q
Cornell University
State University of New York
New York University
Ithaca
Stony Brook
New York
A source: Li
New York University (NYU), New York City, founded in 1831.
 Columbia University, founded in 1754 as King’s College.
 Binghamton University, Binghamton, established in 1946.
 State University of New York, Stony Brook, New York, founded in 1957
 Syracuse University, Syracuse, New York, established in 1870
 State University of New York, Buffalo, established in 1846
 Rensselaer Polytechnic Institute (RPI) at Troy.

Lists are often human generated.
23
Extraction
Query: Q
Cornell University
State University of New York
New York University
Ithaca
Stony Brook
New York
Extracted table columns
New York University (NYU), New York City, founded in 1831.
 Columbia University, founded in 1754 as King’s College.
 Binghamton University, Binghamton, established in 1946.
 State University of New York, Stony Brook, New York, founded in 1957
 Syracuse University, Syracuse, New York, established in 1870
 State University of New York, Buffalo, established in 1846
 Rensselaer Polytechnic Institute (RPI) at Troy.

Rule-based extractor insufficient. Statistical extractor needs training data.
Generating that is also not easy!
24
Extraction: Labeled data generation
Lists are unlabeled. Labeled records needed to train a CRF
A fast but naïve approach for generating labeled records
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook
Query about colleges in NY

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in
Stony Brook, New York.
Fragment of a relevant list source
25
Extraction: Labeled data generation
A fast but naïve approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in
Stony Brook, New York.
Another match for New York
Another match for New York University

In the list, look for matches of every query cell.
26
Extraction: Labeled data generation
A fast but naïve approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton
1
2

State University of New York in
Stony Brook, New York.
In the list, look for matches of every query cell.
 Greedily map each query row to the best match in the list

27
Extraction: Labeled data generation
A fast but naïve approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton
1
2
Unmapped (hurts recall)
Assumed as ‘Other’

State University of New York in
Stony Brook, New York.
Wrongly Mapped
Hard matching criteria has significantly low recall
 Missed segments.
 Does not use natural clues like Univ = University
 Greedy matching can be lead to really bad mappings

28
Generating labeled data: Soft approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in
Stony Brook, New York.
29
Generating labeled data: Soft approach
0.9
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook
0.3

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton
1.8

State University of New York in
Stony Brook, New York.
Match score for each query and source row
 Score of best segmentation of source row
to query columns
 Score of a segment s of column c:
 Probability Cell c of query row same as
segment s
 Computed by the Resolver module based
on the type of the column
30
Generating labeled data: Soft approach
New York University
New York
Monroe College
Brighton
State University of New York
Stony Brook

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton
0.7
0.3
2.0

State University of New York in
Stony Brook, New York.
Match score for each query and source row
 Score of best segmentation of source row into query columns
 Score of a segment s of column c:
 Probability Cell c of query row same as segment s
 Computed by the Resolver module based on the type of the
column
31
Generating labeled data: Soft approach
0.9
New York University
Monroe College
State University of New York
New York

New York Univ. in NYC

Columbia University in NYC
0.3
Brighton
Stony Brook
0.7
1.8
0.3

1.8
2
Greedy matching in red

Monroe Community College
in Brighton
State University of New York in
Stony Brook, New York.
Compute the maximum weight matching
 Better than greedily choosing the best match for each row
 Soft string-matching increases the labeled candidates significantly
 Vastly improves recall, leads to better extraction models.

32
Extractor

Use CRF on the generated labeled data

Feature Set


Delimiters, HTML tokens in a window around
labeled segments.

Alignment features
Collective training of multiple sources
33
Experiments


Aim: Reconstruct Wikipedia tables from only a
few sample rows.
Sample queries





TV Series: Character name, Actor name, Season
Oil spills: Tanker, Region, Time
Golden Globe Awards: Actor, Movie, Year
Dadasaheb Phalke Awards: Person, Year
Parrots: common name, scientific name, family
34
Experiments: Dataset


Corpus:

16M lists from 500M pages from a web crawl.

45% of lists retrieved by index probe are irrelevant.
Query workload

65 queries. Ground truth hand-labeled by 10 users
over 1300 lists.

27% queries not answerable with one list (difficult).

True consolidated table = 75% of Wikipedia table,
25% new rows not present in Wikipedia.
35
Extraction performance


Benefits of soft training data generation, alignment features,
staged-extraction on F1 score.
More than 80% F1 accuracy with just three query records
36
Queries in WWT

Query by content
Alan Turing Turing Machine

E. F. Codd
Relational Databases
Desh
Late night
Bhairavi
Patdeep
Morning
Afternoon
Query by description
Inventor
Indian states
Computer science concept
Airport City
Year
37
Extraction: Description queries
Non-informative
headers
No headers
Lithium
3
Sodium
11
Beryllium
4
Context to get at relevant tables

Ontological annotations
All
Chemical
element
People
Alkali
Lithium
3
Sodium
11
Beryllium
4

Context is union of



Chemical_
elements
Non_Metals
Metals
Non alkali
Alkali
Gas
Non-gas
Text around tables
Headers
Hydrogen
Lithium
Aluminium
Ontology labels when
present
Sodium
Carbon
39
Joint labeling of table columns



Given

Candidate tables: T1 ,T2,..Tn

Query column q1, q2,.. qm
Task: label columns of Ti with {q1, q2,…, qm, } to
maximize sum of these scores

Score (T , j , qk) = Ontology type match + Header
string match with qk

Score (T , * , qk) = Match of description of T with qk

Score (T , j, T’ , j’, qk) = Content overlap of column j
of table T with column j’ of table T’ when both label qk
Inference algorithm in a graphical model  solve
40
via Belief Propagation.
WWT Architecture
User
Typesystem Hierarchy Final consolidated table
Query Table
Type Inference
Web
Ontology
Extract record
sources
Annotate
Store
Content+
context
index
Resolver builder
Resolver
Ranker
Cell resolver Row resolver
Source L1,…,Lk
Offline
Keyword Query
Index Query Builder
Consolidated Table
Tables
Record labeler T1,…,Tk
CRF models
Extractor
Consolidator
Row and cell scores
41
Step 3: Consolidation
Merging the extracted tables into one
Cornell University
Ithaca
SUNY
Stony Brook
State University of New
York
Stony Brook
New York University
(NYU)
New York
New York University
New York City
RPI
Troy
Binghamton University
Binghamton
Columbia University
New York
Syracuse University
Syracuse
=
+
Cornell University
Ithaca
State University of New York OR
SUNY
Stony Brook
New York University OR
New York University (NYU)
New York City OR
New York
Binghamton University
Binghamton
RPI
Troy
Columbia University
New York
Syracuse University
Syracuse
Merging duplicates
42
Consolidation


Challenge: resolving when two rows are the
same in the face of

Extraction errors

Missing columns

Open-domain

No training.
Our approach: a specially designed Bayesian
Network with interpretable and generalizable
parameters
Resolver
Bayesian Network
P(1st cell match|q1,r1)
P(RowMatch|rows q,r)
P(ith cell match|qi,ri)
P(nth cell match|qn,rn)
Cell-level probabilities
 Parameters automatically set using list
statistics
 Derived from user-supplied type-specific
similarity functions
45
Ranking
• Factors for ranking
– Relevance: membership in overlapping sources
– Support from multiple sources
– Completeness: importance of columns present


Penalize records with only common ‘spam’ columns like
City and State
Correctness: extraction confidence
School
Location
State
Merged Row Confidence
Support
-
-
NY
0.99
9
-
NYC
New York
0.95
7
New York Univ. OR New
York University
New York City OR
New York
New York
0.85
4
University of Rochester
OR Univ. of Rochester,
Rochester
New York
0.50
2
University of Buffalo
Buffalo
New York
0.70
2
Cornell University
Ithaca
New York
0.76
1
47
Relevance ranking on set membership

Weighted sum approach

Score of a set t:


Relevance of consolidated row r:


s(t) = fraction of query rows in t
 r  t s(t)
Tables
Graph walk based approach

Random walk from rows to table
nodes starting from query rows
Consolidated rows
along with random restarts to
query rows
Query rows
48
Ranking Criteria
• Score(Row r):
× Graph-relevance of r.
× Importance of columns C present in r (high if C functionally
determines the other)
× Sum of cell extraction confidence: noisy-OR of cell extraction
confidence from individual CRFs
School
Location
New York Univ. OR New
York University (0.90)
State
Merged Row Confidence
Support
New York City OR New York (0.98)
New York (0.95)
0.85
4
University of Buffalo (0.88)
Buffalo (0.99)
New York (0.99)
0.70
2
Cornell University (0.92)
Ithaca (0.95)
New York (0.99)
0.76
1
University of Rochester OR
Univ. of Rochester, (0.80)
Rochester (0.95)
New York (0.99)
0.50
2
-
-
NY (0.99)
0.99
9
-
NYC (0.98)
New York (0.98)
0.95
7
49
Overall performance
All Queries


Difficult Queries
Justify sophisticated consolidation and resolution. So compare with:
 Processing only the magically known single best list
=> no consolidation/resolution required.
 Simple consolidation. No merging of approximate duplicates.
WWT has > 55% recall, beats others. Gain bigger for difficult queries.
50
Running time

< 30 seconds with 3 query records.


Can be improved by processing sources in parallel.
Variance high because time depends on number of columns,
record length etc.
51
Related Work


Google-Squared

Developed independently. Launched in May 2009

User provides keyword query, e.g. “list of Italian
joints in Manhattan”. Schema inferred.

Technical details not public.
Prior methods for extraction and resolution.

Assume labeled data/pre-trained parameters

We generate labeled data, and automatically train
resolver parameters from the list source.
52
Summary


Structured web search & the role of non-text,
partially structured web sources
WWT system

Domain-independent

Online: structure interpretation at query time

Relies heavily on unsupervised statistical learning





Graphical model for table annotation
Soft-approach for generating labeled data
Collective column labeling for descriptive queries
Bayesian network for resolution and consolidation
Page rank + confidence from a probabilistic extractor for
53
ranking
What next?

Designing plans for non-trivial ways of combining of
sources

Better ranking and user-interaction models.

Expanding query set


Aggregate queries: tables are rich in quantities

Point queries: attribute value and relationship queries
Interplay between semi-structured web & Ontologies


Augmenting one with the other.
Quantify information in structured sources vis-à-vis
text sources on typical query workloads
54

Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani.

Transcript Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani.

Directory