icdm-2007.ppt

Download Report

Transcript icdm-2007.ppt

Language-Independent
Set Expansion of Named
Entities using the Web
Richard C. Wang & William W. Cohen
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213 USA
Language-Independent Set Expansion
Richard C. Wang
Outline
Introduction
 System Architecture

 Fetcher
 Extractor
 Ranker
Evaluation
 Conclusion

Language Technologies Institute, Carnegie Mellon University
2 / 20
Language-Independent Set Expansion
Richard C. Wang
What is Set Expansion?

For example,



More formally,



Given a query: {“spit”, “boogers”, “ear wax”}
Answer is: {“puke”, “toe jam”, “sweat”, ....}
Given a small number of seeds: x1, x2, …, xk
where each xi  St
Answer is a listing of other probable elements:
e1, e2, …, en where each ei St
A well-known example of a web-based set
expansion system is Google Sets™

http://labs.google.com/sets
Language Technologies Institute, Carnegie Mellon University
3 / 20
Language-Independent Set Expansion
Richard C. Wang
What is it used for?

Derive features for…
 Named Entity Recognition (Settles, 2004) (Talukdar, 2006)
 Expand true named entities in training set
 Utilize expanded names to assign features to words
 Concept Learning (Cohen, 2000)
 Given a set of instances, look in web pages for tables or lists
that contain some of those instances
 Automatically extract features from those pages
 Define features over the instances found
 Relation Learning (Cafarella et al, 2005) (Etzioni et al, 2005)
 Extract items from tables or lists that contain given seeds
 Utilize extracted items and their contexts for learning
relations
Language Technologies Institute, Carnegie Mellon University
4 / 20
Language-Independent Set Expansion
Richard C. Wang
Our Set Expander: SEAL
Set Expander for Any Language

Features

Independent of human/markup language



Does not require pre-annotated training data



Support seeds in English, Chinese, Japanese, Korean, ...
Accept documents in HTML, XML, SGML, TeX, WikiML, …
Utilize readily-available corpus: World Wide Web
Learns wrappers on the fly
Based on two research contributions
1.
Automatic construction of wrappers

2.
Extracts “lists” of entities on semi-structured web pages
Use of random graph walk

Ranks extracted entities so that those most likely to be in
the target set are ranked higher
Language Technologies Institute, Carnegie Mellon University
5 / 20
Language-Independent Set Expansion
1.
2.
3.
Canon
Nikon
Olympus



System Architecture
Richard C. Wang
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Pentax
Sony
Kodak
Minolta
Panasonic
Casio
Leica
Fuji
Samsung
…
Fetcher: download web pages from the Web
Extractor: learn wrappers from web pages
Ranker: rank entities extracted by wrappers
Language Technologies Institute, Carnegie Mellon University
6 / 20
Language-Independent Set Expansion
Richard C. Wang
The Fetcher
Procedure:
Compose a search query using all seeds
2. Use Google API to request for top N URLs
1.

We use N = 100, 200, and 300 for evaluation
Fetch URLs by using a crawler
4. Send fetched documents to the Extractor
3.
Language Technologies Institute, Carnegie Mellon University
7 / 20
Language-Independent Set Expansion
Richard C. Wang
The Extractor

Learn wrappers from web documents
and seeds on the fly

Utilize semi-structured documents
 Wrappers defined at character level


No tokenization required; thus language
independent
However, very specific; thus page-dependent

Wrappers derived from document d is applied to d only
Language Technologies Institute, Carnegie Mellon University
8 / 20
Language-Independent Set Expansion
Richard C. Wang
<li class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
…
<liclass="honda"><a
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li
…
<li class="toyota"><a href="http://www.geisauto.com/">
…
<liclass="acura"><a
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li
…
Extractor
E1
It
It seems
seems to
to be
be
finds
maximallyworking
too…
working…
but
long contexts
but how
about
a
what
if I add
one
that bracket
moreinstance
complex
more
of
all instances of
“toyota”?
example?
every
seed
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li<li
class="nissan"><a
…
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li<li
class="toyota"><a
…
Language Technologies Institute, Carnegie Mellon University
9 / 20
Language-Independent Set Expansion
Richard C. Wang
<li class="ford"><a href="http://www.curryauto.com/">
<img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a>
I amhref="http://www.curryauto.com/">
a noisy
<ul><li class="last"><a
<span class="dName">Curry
Ford</span>...</li></ul>
entity mention
</li>
<li class="honda"><a href="http://www.curryauto.com/">
<img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a>
<ul><li><a href="http://www.curryhonda-ga.com/">
Me too!
<span class="dName">Curry Honda Atlanta</span>...</li>
<li><a href="http://www.curryhondamass.com/">
<span class="dName">Curry Honda</span>...</li>
Can
findlike
Extractor
Enot!
Horray!
Ityou
seems
I guess
2 finds
<li class="last"><a href="http://www.curryhondany.com/">
common
contexts
Extractor
E2 out
works!
maximally-long
Let’s try
<span class="dName">Curry Honda Yorktown</span>...</li></ul>
But
how
webracket
get rid
that do
bracket
contexts
that
</li>
E2entity
and
ofExtractor
those
noisy
<li class="acura"><a href="http://www.curryauto.com/"> at
all
instances
of
least
one
instance
<img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif"
see
if alt="5"></a>
it works…
every
seed?
ofmentions?
every
seed
<ul><li class="last"><a href="http://www.curryacura.com/">
<span class="dName">Curry Acura</span>...</li></ul>
</li>
<li class="nissan"><a href="http://www.curryauto.com/">
<img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a>
<ul><li class="last"><a href="http://www.geisauto.com/">
<span class="dName">Curry Nissan</span>...</li></ul>
</li>
<li class="toyota"><a href="http://www.curryauto.com/">
<img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a>
<ul><li class="last"><a href="http://www.geisauto.com/toyota/">
<span class="dName">Curry Toyota</span>...</li></ul>
</li>
Language Technologies Institute, Carnegie Mellon University
10 / 20
Language-Independent Set Expansion
Richard C. Wang
Extractor: Summary

A wrapper consists of a pair of left (L) and
right (R) context string

All strings between (but not containing) L and R
are extracted


Referred to as “candidate entity mention”
We compared two versions of wrapper:

Maximally-long contextual strings that bracket…
1.
2.
all instances of every seed (Extractor E1)
at least one instance of every seed (Extractor E2)
Language Technologies Institute, Carnegie Mellon University
11 / 20
Language-Independent Set Expansion
Richard C. Wang
The Ranker

Rank candidate entity mentions based
on “similarity” to seeds


Noisy mentions should be ranked lower
We compare two methods for ranking
Extracted Frequency (EF)
1.

# of times an entity mention is extracted
Random Graph Walk (GW)
2.

Probability of an “entity mention” node being
reached in a graph (explained in next slide)
Language Technologies Institute, Carnegie Mellon University
12 / 20
Language-Independent Set Expansion
Richard C. Wang
Building a Graph
“ford”, “nissan”, “toyota”
Wrapper #2
find
northpointcars.com
extract
curryauto.com
“chevrolet”
22.5%
Wrapper #3
“honda”
26.1%
“acura”
34.6%
derive
“volvo chicago”
8.4%
Wrapper #1
Wrapper #4
“bmw pittsburgh”
8.4%

A graph consists of a fixed set of…
 Node
Types: {seeds, document, wrapper, mention}
 Labeled Directed Edges: {find, derive, extract}


Each edge asserts that a binary relation r holds
Each edge has an inverse relation r-1 (graph is cyclic)
Minkov et al. Contextual Search and Name Disambiguation in Email using Graphs. SIGIR 2006
Language Technologies Institute, Carnegie Mellon University
13 / 20
Language-Independent Set Expansion
“curryauto.com”, ...
“wrapper #1”, ...
find,
find-1, ...
derive,
“honda”,
“acura”,
derive-1, extract,
Legend
r
extract-1
x
Node: x, y, z
Richard C. Wang
Random Graph Walk
y
Edge Relation: r
An edge from x to y with
r
relation r : x 

y
Recursive
computation
of probability
Stop Probability: λ
Probability of
staying at a
node (0.5)
Probability of
Probability
of of
reaching any Probability
node
to at
node
z x
node z fromcontinuing
x staying
from x
Probability of
picking an edge
relation r given
Probability
a source
nodeofx
picking a target
node y given an
edge relation r and
source node x
1 if x  z
where I( x  z )  
0 otherwise
Language Technologies Institute, Carnegie Mellon University
14 / 20
Language-Independent Set Expansion
Richard C. Wang
Evaluation Datasets
Language Technologies Institute, Carnegie Mellon University
15 / 20
Language-Independent Set Expansion
Richard C. Wang
Evaluation Method

Mean Average Precision


Commonly used for evaluating ranked lists in IR
Contains recall and precision-oriented aspects
 Sensitive to the entire ranking
 Mean of average precisions for each ranked list
Prec(r) = precision at rank r
NewEntity (r ) 
1 if (a) and (b) are true

otherwise
0
(a) Extracted mention at r
matches any true mention
where L = ranked list of extracted mentions, r = rank

Evaluation Procedure (per dataset)
Randomly select three true entities and use
their first listed mentions as seeds
2. Expand the three seeds obtained from step 1
3. Repeat steps 1 and 2 five times
4. Compute MAP for the five ranked lists
1.
Language Technologies Institute, Carnegie Mellon University
(b) There exist no other
extracted mention at rank
less than r that is of the
same entity as the one at r
# True Entities = total number
of true entities in this dataset
16 / 20
Language-Independent Set Expansion
Richard C. Wang
Experimental Results
Overall MAP vs. Various Methods
100%
MAP (%)
95%
80%
90%
60%
85%
40%
80%
20%
75%
93.13%
94.03%
82.39%
94.18%
93.13%
87.61%
82.39%
43.76%
14.59%
70%
0%
E1+EF+100
G.Sets
E2+GW+100
E2+EF+100
G.Sets
(Eng)
E2+GW+200
E2+GW+100
E1+EF+100
E2+GW+300
Methods
Legend
[Extractor] + [Ranker] + [Top N URLs]
Extractor = { E1: Extractor E1, E2: Extractor E2 }
Ranker = { EF: Extracted Frequency, GW: Graph Walk }
N = { 100, 200, 300 }
Language Technologies Institute, Carnegie Mellon University
17 / 20
Language-Independent Set Expansion
Richard C. Wang
Conclusion & Future Work

Conclusion

Unsupervised approach for expanding sets of named entities


SEAL performs better than Google Sets



Domain and language independent
Higher Mean Average Precision on our datasets
Handle not only English, but also Chinese and Japanese
Future Work



Learn from graphs to re-rank extracted mentions
Bootstrap named entities by using extracted mentions in
previous expansion as seeds
Identify possible class names for expanded sets

i.e. car makers, constellations, presidents…
Language Technologies Institute, Carnegie Mellon University
18 / 20
Language-Independent Set Expansion
Richard C. Wang
References
Language Technologies Institute, Carnegie Mellon University
19 / 20
Language-Independent Set Expansion
Richard C. Wang
Top three mentions are the seeds
Try it out at http://rcwang.com/seal
Language Technologies Institute, Carnegie Mellon University
20 / 20