icdm-2007.ppt
Download
Report
Transcript icdm-2007.ppt
Language-Independent
Set Expansion of Named
Entities using the Web
Richard C. Wang & William W. Cohen
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213 USA
Language-Independent Set Expansion
Richard C. Wang
Outline
Introduction
System Architecture
Fetcher
Extractor
Ranker
Evaluation
Conclusion
Language Technologies Institute, Carnegie Mellon University
2 / 20
Language-Independent Set Expansion
Richard C. Wang
What is Set Expansion?
For example,
More formally,
Given a query: {“spit”, “boogers”, “ear wax”}
Answer is: {“puke”, “toe jam”, “sweat”, ....}
Given a small number of seeds: x1, x2, …, xk
where each xi St
Answer is a listing of other probable elements:
e1, e2, …, en where each ei St
A well-known example of a web-based set
expansion system is Google Sets™
http://labs.google.com/sets
Language Technologies Institute, Carnegie Mellon University
3 / 20
Language-Independent Set Expansion
Richard C. Wang
What is it used for?
Derive features for…
Named Entity Recognition (Settles, 2004) (Talukdar, 2006)
Expand true named entities in training set
Utilize expanded names to assign features to words
Concept Learning (Cohen, 2000)
Given a set of instances, look in web pages for tables or lists
that contain some of those instances
Automatically extract features from those pages
Define features over the instances found
Relation Learning (Cafarella et al, 2005) (Etzioni et al, 2005)
Extract items from tables or lists that contain given seeds
Utilize extracted items and their contexts for learning
relations
Language Technologies Institute, Carnegie Mellon University
4 / 20
Language-Independent Set Expansion
Richard C. Wang
Our Set Expander: SEAL
Set Expander for Any Language
Features
Independent of human/markup language
Does not require pre-annotated training data
Support seeds in English, Chinese, Japanese, Korean, ...
Accept documents in HTML, XML, SGML, TeX, WikiML, …
Utilize readily-available corpus: World Wide Web
Learns wrappers on the fly
Based on two research contributions
1.
Automatic construction of wrappers
2.
Extracts “lists” of entities on semi-structured web pages
Use of random graph walk
Ranks extracted entities so that those most likely to be in
the target set are ranked higher
Language Technologies Institute, Carnegie Mellon University
5 / 20
Language-Independent Set Expansion
1.
2.
3.
Canon
Nikon
Olympus
System Architecture
Richard C. Wang
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Pentax
Sony
Kodak
Minolta
Panasonic
Casio
Leica
Fuji
Samsung
…
Fetcher: download web pages from the Web
Extractor: learn wrappers from web pages
Ranker: rank entities extracted by wrappers
Language Technologies Institute, Carnegie Mellon University
6 / 20
Language-Independent Set Expansion
Richard C. Wang
The Fetcher
Procedure:
Compose a search query using all seeds
2. Use Google API to request for top N URLs
1.
We use N = 100, 200, and 300 for evaluation
Fetch URLs by using a crawler
4. Send fetched documents to the Extractor
3.
Language Technologies Institute, Carnegie Mellon University
7 / 20
Language-Independent Set Expansion
Richard C. Wang
The Extractor
Learn wrappers from web documents
and seeds on the fly
Utilize semi-structured documents
Wrappers defined at character level
No tokenization required; thus language
independent
However, very specific; thus page-dependent
Wrappers derived from document d is applied to d only
Language Technologies Institute, Carnegie Mellon University
8 / 20
Language-Independent Set Expansion
Richard C. Wang
<li class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
…
<liclass="honda"><a
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li
…
<li class="toyota"><a href="http://www.geisauto.com/">
…
<liclass="acura"><a
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li
…
Extractor
E1
It
It seems
seems to
to be
be
finds
maximallyworking
too…
working…
but
long contexts
but how
about
a
what
if I add
one
that bracket
moreinstance
complex
more
of
all instances of
“toyota”?
example?
every
seed
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li<li
class="nissan"><a
…
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li<li
class="toyota"><a
…
Language Technologies Institute, Carnegie Mellon University
9 / 20
Language-Independent Set Expansion
Richard C. Wang
<li class="ford"><a href="http://www.curryauto.com/">
<img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a>
I amhref="http://www.curryauto.com/">
a noisy
<ul><li class="last"><a
<span class="dName">Curry
Ford</span>...</li></ul>
entity mention
</li>
<li class="honda"><a href="http://www.curryauto.com/">
<img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a>
<ul><li><a href="http://www.curryhonda-ga.com/">
Me too!
<span class="dName">Curry Honda Atlanta</span>...</li>
<li><a href="http://www.curryhondamass.com/">
<span class="dName">Curry Honda</span>...</li>
Can
findlike
Extractor
Enot!
Horray!
Ityou
seems
I guess
2 finds
<li class="last"><a href="http://www.curryhondany.com/">
common
contexts
Extractor
E2 out
works!
maximally-long
Let’s try
<span class="dName">Curry Honda Yorktown</span>...</li></ul>
But
how
webracket
get rid
that do
bracket
contexts
that
</li>
E2entity
and
ofExtractor
those
noisy
<li class="acura"><a href="http://www.curryauto.com/"> at
all
instances
of
least
one
instance
<img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif"
see
if alt="5"></a>
it works…
every
seed?
ofmentions?
every
seed
<ul><li class="last"><a href="http://www.curryacura.com/">
<span class="dName">Curry Acura</span>...</li></ul>
</li>
<li class="nissan"><a href="http://www.curryauto.com/">
<img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a>
<ul><li class="last"><a href="http://www.geisauto.com/">
<span class="dName">Curry Nissan</span>...</li></ul>
</li>
<li class="toyota"><a href="http://www.curryauto.com/">
<img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a>
<ul><li class="last"><a href="http://www.geisauto.com/toyota/">
<span class="dName">Curry Toyota</span>...</li></ul>
</li>
Language Technologies Institute, Carnegie Mellon University
10 / 20
Language-Independent Set Expansion
Richard C. Wang
Extractor: Summary
A wrapper consists of a pair of left (L) and
right (R) context string
All strings between (but not containing) L and R
are extracted
Referred to as “candidate entity mention”
We compared two versions of wrapper:
Maximally-long contextual strings that bracket…
1.
2.
all instances of every seed (Extractor E1)
at least one instance of every seed (Extractor E2)
Language Technologies Institute, Carnegie Mellon University
11 / 20
Language-Independent Set Expansion
Richard C. Wang
The Ranker
Rank candidate entity mentions based
on “similarity” to seeds
Noisy mentions should be ranked lower
We compare two methods for ranking
Extracted Frequency (EF)
1.
# of times an entity mention is extracted
Random Graph Walk (GW)
2.
Probability of an “entity mention” node being
reached in a graph (explained in next slide)
Language Technologies Institute, Carnegie Mellon University
12 / 20
Language-Independent Set Expansion
Richard C. Wang
Building a Graph
“ford”, “nissan”, “toyota”
Wrapper #2
find
northpointcars.com
extract
curryauto.com
“chevrolet”
22.5%
Wrapper #3
“honda”
26.1%
“acura”
34.6%
derive
“volvo chicago”
8.4%
Wrapper #1
Wrapper #4
“bmw pittsburgh”
8.4%
A graph consists of a fixed set of…
Node
Types: {seeds, document, wrapper, mention}
Labeled Directed Edges: {find, derive, extract}
Each edge asserts that a binary relation r holds
Each edge has an inverse relation r-1 (graph is cyclic)
Minkov et al. Contextual Search and Name Disambiguation in Email using Graphs. SIGIR 2006
Language Technologies Institute, Carnegie Mellon University
13 / 20
Language-Independent Set Expansion
“curryauto.com”, ...
“wrapper #1”, ...
find,
find-1, ...
derive,
“honda”,
“acura”,
derive-1, extract,
Legend
r
extract-1
x
Node: x, y, z
Richard C. Wang
Random Graph Walk
y
Edge Relation: r
An edge from x to y with
r
relation r : x
y
Recursive
computation
of probability
Stop Probability: λ
Probability of
staying at a
node (0.5)
Probability of
Probability
of of
reaching any Probability
node
to at
node
z x
node z fromcontinuing
x staying
from x
Probability of
picking an edge
relation r given
Probability
a source
nodeofx
picking a target
node y given an
edge relation r and
source node x
1 if x z
where I( x z )
0 otherwise
Language Technologies Institute, Carnegie Mellon University
14 / 20
Language-Independent Set Expansion
Richard C. Wang
Evaluation Datasets
Language Technologies Institute, Carnegie Mellon University
15 / 20
Language-Independent Set Expansion
Richard C. Wang
Evaluation Method
Mean Average Precision
Commonly used for evaluating ranked lists in IR
Contains recall and precision-oriented aspects
Sensitive to the entire ranking
Mean of average precisions for each ranked list
Prec(r) = precision at rank r
NewEntity (r )
1 if (a) and (b) are true
otherwise
0
(a) Extracted mention at r
matches any true mention
where L = ranked list of extracted mentions, r = rank
Evaluation Procedure (per dataset)
Randomly select three true entities and use
their first listed mentions as seeds
2. Expand the three seeds obtained from step 1
3. Repeat steps 1 and 2 five times
4. Compute MAP for the five ranked lists
1.
Language Technologies Institute, Carnegie Mellon University
(b) There exist no other
extracted mention at rank
less than r that is of the
same entity as the one at r
# True Entities = total number
of true entities in this dataset
16 / 20
Language-Independent Set Expansion
Richard C. Wang
Experimental Results
Overall MAP vs. Various Methods
100%
MAP (%)
95%
80%
90%
60%
85%
40%
80%
20%
75%
93.13%
94.03%
82.39%
94.18%
93.13%
87.61%
82.39%
43.76%
14.59%
70%
0%
E1+EF+100
G.Sets
E2+GW+100
E2+EF+100
G.Sets
(Eng)
E2+GW+200
E2+GW+100
E1+EF+100
E2+GW+300
Methods
Legend
[Extractor] + [Ranker] + [Top N URLs]
Extractor = { E1: Extractor E1, E2: Extractor E2 }
Ranker = { EF: Extracted Frequency, GW: Graph Walk }
N = { 100, 200, 300 }
Language Technologies Institute, Carnegie Mellon University
17 / 20
Language-Independent Set Expansion
Richard C. Wang
Conclusion & Future Work
Conclusion
Unsupervised approach for expanding sets of named entities
SEAL performs better than Google Sets
Domain and language independent
Higher Mean Average Precision on our datasets
Handle not only English, but also Chinese and Japanese
Future Work
Learn from graphs to re-rank extracted mentions
Bootstrap named entities by using extracted mentions in
previous expansion as seeds
Identify possible class names for expanded sets
i.e. car makers, constellations, presidents…
Language Technologies Institute, Carnegie Mellon University
18 / 20
Language-Independent Set Expansion
Richard C. Wang
References
Language Technologies Institute, Carnegie Mellon University
19 / 20
Language-Independent Set Expansion
Richard C. Wang
Top three mentions are the seeds
Try it out at http://rcwang.com/seal
Language Technologies Institute, Carnegie Mellon University
20 / 20