Transcript [Poster]

WebSets: Extracting Sets of Entities from the Web Using
Unsupervised Information Extraction
Bhavana Dalvi , William W. Cohen and Jamie Callan
Language Technologies Institute, Carnegie Mellon University
Motivation
Application
WebSets Framework
 Many NLP tasks get benefit from concept-instance pairs
 Hypothesis 1 : Entities appearing in a table column probably belong to the same concept.
 Hypothesis 2 : Frequent co- occurrence of a set of entities in multiple table columns and distinct web domains
indicates that they represent some meaningful concept.
Summarization, Co-reference resolution,
Named entity extraction
 Existing knowledge bases (NELL, Freebase, …) are
incomplete.
 Detecting co-ordinate terms to find term clusters (i ~ j)
Entity-feature file
<Entities, table-columns,
domains>
HTML
Table Corpus
 Using hyponym patterns (“X such as Y”) to name the terms
Experiments
Relational Table
Identification
Hyponym
Concept
Dataset
Entity
Clusters
Bottom-up Entity
Clustering
Labeled
entity sets
<Entities,
hypernym>
Hypernym
Recommendation
 Datasets
#HTML
pages
Toy_Apple
Fruits + companies
574
Delicious_Sports Links from Delicious w/ tag=sports 21K
#tables
2.6K
146.3K
Delicious_Music
Links from Delicious w/ tag=music 183K
643.3K
CSEAL_Useful
Pages SEAL found NELL entities on
30K
322.8K
ASIA_NELL
ASIA run on NELL categories
112K
676.9K
ASIA_INT
ASIA run on intelligence domain
121K
621.3K
Clueweb_HPR
High pagerank sample of
Clueweb
100K
586.9K
Table Identification

Features : #rows, #non-link columns, HTML tags, length(cells), recursive or not

% relational tables : 15-30% to 70-85%
 Entity vs. Triplet record representation

O(N) triplet records created for tables of size O(N)

Can disambiguate different senses of entities : Toy_Apple dataset
Method
WebSets
K-Means
K
FM w/ Entity records
FM w/ Triplet records
30
25
0.11 (K=25)
0.09
0.08
0.85 (K=34)
0.35
0.38
TableId=21 , domain=“wikipedia.org”
Country
India
China
Canada
France
Hypernym
Country
Capital City
Delhi
Beijing
Ottawa
Paris
City,
Destinations
TableId=34 , domain=“aneki.com”
Country
China
Canada
France
England
Capital City
Beijing
Ottawa
Paris
London
Entities
China, Canada, India
Canada, China, France
Beijing, Delhi, Ottawa
Beijing, Ottawa, Paris
Canada, England, France
London, Ottawa, Paris
Instruments: Flute, Tuba , String
 Gold standard #clusters : Toy_Apple (27) and Delicious_Sports (29)
Entities
Table:Column
India, China,
21:1, 34:1
Canada, France,
England
Delhi, Beijing,
21:2, 34:2
Ottawa, London,
Paris
Table:Column
21:1
21:1, 34:1
21:2
21:2, 34:2
34:1
34:2
Domains
Wikipedia.org,
aneki.com
Wikipedia.org,
aneki.com
Domains
Wikipedia.org
Wikipedia.org,
aneki.com
Wikipedia.org
Wikipedia.org,
aneki.com
aneki.com
aneki.com
Intervals: Whole tone, Major sixth,
International Organizations: United Nations
Genres: Smooth jazz, Gothic, Metal
Children Fund UNICEF, Southeast European
Cooperative Initiative SECI, World Trade
Organization WTO, Indian Ocean Commission
INOC, Economic and Social Council ECOSOC,
Caribbean Community and Common Market
CARICOM, ….
Audio Equipments: Audio editor ,
Brazilian, Surinamese, Burkinabe, Barbadian,
Cuban , ….
 Record/cluster :
<entity+ , tableColumn+, domain+>
 Clusters = { }
 Go through each triplet record t so that |t.domains| > threshold
 For each existing cluster C check if
 t.entity overlaps with C.entity OR
 t.tableColumn overlaps with C.tableColumn
If sufficient overlap  add t to C
 If no existing cluster C matches t
 Create new cluster C’ = t
 Add C’ to Clusters
 Time complexity : O(N * log N)
 Table corpus : O(N)  Triplet Store : O(N)
arg1 such as (w+ (and/or))? arg2
arg1 (w+ )? (and/or) other arg2
arg1 include (w+ (and/or))? arg2
arg1 including (w+ (and/or))? Arg2
 ClueWeb09 dataset
: 500M page sample of the Web
e.g. “Obama is president of USA”
 (president of , Obama, USA)
 Evaluation of quality of entity sets produced
K
Purity
NMI
RI
FM
Toy_Apple
K-Means
40
0.96
0.71
0.98
0.41
WebSets
25
0.99
0.99
1.00
0.99
K-Means
50
0.72
0.68
0.98
0.47
CSEAL_Useful
165.2K
1090
312
69.0
0.56
98.6%
WebSets
32
0.83
0.64
1.00
0.85
ASIA_NELL
11.4K
448
266
73.0
0.59
98.5%
ASIA_INT
15.1K
395
218
63.0
0.58
97.4%
Clueweb_HPR
516.0
47
34
70.5
0.56
99.0%
with entities in the cluster
Method K
J
%Accuracy
DPM
0.0
0.2
0.0
0.2
-
34.6
50.0
21.9
44.0
67.7
78.8
Yield (#pairs
produced)
88.6K
0.8K
100,828.0K
2.8K
73.7K
64.8K
#Correct pairs
(predicted)
30.7K
0.4K
22,081.3K
1.2K
45.8K
51.1K
rock, Rock, Pop, Hip hop, Rock n roll,
Country, Folk, Punk rock , ….
General midi synthesizer , Audio
recorder , Multichannel digital audio
workstation , Drum sequencer , Mixers
, Music engraving system , Audio
server , Mastering software ,
Soundfont sample player ….
Languages: Hebrew, Portuguese, Danish,
 Hearst patterns e.g. “X such as Y”
Method
Score(hypernym | cluster) ∝ co-occurrence counts of hypernym
Fifth, Perfect fifth, Seventh, Third,
Diminished fifth, Whole step , ….
Hyponym Concept Dataset
Dataset
 Hypernym Recommendation
Orchestra, Chimes, Harmonium,
Bassoon, Woodwinds, Glockenspiel,
French horn, Timpani, Piano, ….
Islamic Republic, Parliamentary Self Governing
Territory, Parliamentary Republic, Constitutional
Republic, Republic Presidential Multiparty System,
….
 Noun-pair context dataset
 Number of clusters is unknown
Inf
5
DPMExt Inf
5
WS
WSExt -
Religions: Buddhism, Christianity, Islam, Sikhism,
Bottom-Up Clustering Algorithm
 Bottom-up clustering
Delicious_Sports
Music Domain
Government: Monarchy, Limited Democracy,
 We worked on problem of automatically harvesting conceptinstance pairs from a corpus of HTML tables.
Description
Intelligence Domain
Taoism, Zoroastrianism, Jainism, Bahai, Judaism,
Hinduism, Confucianism , .…
 Problem can be divided into :
Dataset
Corpus Summary :
Dataset
#Triplets
#Clusters #Clusters with %Meaningful
hypernyms
clusters
MRR of
hypernym
%Precision of
labeled sets
Acknowledgements
This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force
Research Laboratory (AFRL) contract number FA8650-10-C-7058.
Hyponym
USA
Paris
Monkey
Sparrow
Concept:count
Country:1000
City:450,
destination:100
Animal:100,
mammal:30
Bird:40
 X, Y are hyponym, hypernym when
context = Hearst pattern
Conclusions
 We propose a unsupervised IE technique to extract conceptinstance pairs from an HTML corpus. It is novel in that it relies solely
on HTML tables to detect coordinate terms.
 Our triplet-based data representation helps in disambiguating
multiple senses of the same noun-phrase.
 WebSets approach is corpus driven, efficient and scalable. We
presented a method which takes O(N * logN) time to process the
HTML tables of size O(N) and extract named entity sets from them.
 Labeled entity sets produced by WebSets can act as summary
of a HTML corpus.
 Class-instance pairs thus produced are also being used to
populate an existing Knowledge Base (NELL).
 Future research direction is to extend this method for doing
Unsupervised Relation Extraction.