Grouping search-engine returned citations for person-name queries

Transcript Grouping search-engine returned citations for person-name queries

Grouping search-engine
returned citations for
person-name queries
Reema Al-Kamha, David W. Embley
(Proceedings of the 6th annual ACM
international workshop on Web
information and data management 2004)
Abstract
• They present a technique to group search-engine
returned citations for person-name queries.
• The objective is to put the returned citations in
groups such that each group relates to one person.
• They use a multi-faceted approach that considers
evidence from three facets (attributes, links,
page similarity).
• They construct a relatedness confidence matrix
for pairs of citations.
• They merge pairs whose matching confidence
value is above an threshold.
Related work
• The problem is related to cross-document
coreferencing and object identity.
• G. Mann and D. Yarowsky (2003)
– They use document vectors over biographical
information such as birth year, birth place, spouse name..
• S. Tcjada (2001)
– About object identification, one technique is vector
space modeling, and the other is probabilistic modeling.
A multi-faceted approach
• They use a multi-faceted method to group
relevant citations.
• Each facet represents an aspect of the problem
about if two citations reference the same person
or different persons.
• In this paper, they consider attributes about a
person, links within and among sites, and page
similarity as facets.
Facet 1: Attributes
• Attributes they found by manual inspection are
phone number, email address, state, city and zip
code.
• In order to extract values from a web page, they
write regular expressions for each attribute.
Facet 2: Links (1)
• If two URLs share a common host, they may
refer to the same person.
• If the URL of one citation has the same host as
one of the URLs that belongs to the web page
referred by the other citation, they may refer to
the same person.
Facet 2: Links (2)
• Because many names often appear on popular
hosts, when two citations share a popular host, we
have less confidence that they refer to the same
person.
• They need to find a way to determine whether
the host is popular or not.
• The query link:siteURL in Google shows all pages
that point to that URL.
• A host h is popular for person-name queries if
more than 400 pages point to h.
Facet 3: Page Similarity
(1)
• If two different web pages are similar, they may
refer to the same person.
• They use pairs of words that start with a capital
letter and that are either adjacent or separated
by a connector (and, or, but) or by a preposition
which may be followed by an article (a, an, the) or
by a single capital letter followed by dot.
– David Embley, who is a professor of the Data Extraction
Research Group in the Computer Science Department at
Brigham Young Univeristy.
Facet 3: Page Similarity
(2)
• They construct a stop word list which is a list of
frequently appearing adjacent cap-word pairs
– Home Page, Privacy Policy
• They collected approximately 10,000 web
documents taken at random from the Open
Directory Project.
• They constructed all adjacent cap-word pairs and
sorted by their frequencies and considered all
pairs only with a frequency greater than two to be
stop words.
Facet 3: Page Similarity
(3)
• They consider the number of adjacent cap-word
pairs as an indicator of the similarity between two
web pages.
• The greater the number of adjacent cap-word
pairs, the greater the similarity between the
pages.
Confidence Matrix
Construction (1)
• They construct a confidence matrix, one for each
facet.
• First, they construct a training set to compute
the conditional probabilities.
• There are some restrictions for training set.
– They should contain male, female, and gender-neutral names.
– They should contain names that the returned citations are
grouped in different size groups.
– They should contain names that the returned citations are
grouped in different number of groups.
• They entered each name (9) as a query for Google,
and collected the first 50 returned citations for
each name.
Confidence Matrix
Construction (2)
• They use training set to estimate the conditional
probabilities.
• P( Same Person=“Yes” | Email=“Yes”)
• P( Same Person=“Yes” | City=“Yes” and State=“Yes”)
Final Confidence Matrix
• They generate the final confidence matrix by
combining the confidence matrices for the three
facets using Stanford certainty theory.
• Stanford certainty theory gives the following rule
to combine the evidence from these two
independent observations.
• Suppose CF(E1) is the certainty factor associated
with evidence E1 for some observation B, and
CF(E2) is another certainty factor. The
compounded CF of B is calculated by
CF(E1)+CF(E2)-(CF(E1)*CF(E2)).
Grouping Algorithm
• If there is high confident between two citations
Ci, Cj, they are grouped into a set S1.
• If there is high confident between two citations
Cj, Ck, they are grouped into a set S2.
• Because S1 and S2 share one or more citations,
they are grouped together in one group S3.
• Keep merging any two sets of citations that share
one or more citations until no citation is shared
between any two sets.
• The threshold is 0.8.
Example (1)
• They apply their technique to the first 10
returned citations for the person-name query
“Kelly Flanagan”.
• Pages referenced by the two citations C4 and
C7 have the same city and state.
• They have P( Same Person = “Yes” | City =“Yes”
and State =“Yes”)=0.96.
Example (2)
• The final confidence value
between citation C1 and C8
using Stanford certainty
theory as 0.96 + 0 + 0.78 –
0.96*0 – 0.96*0.78 –
0.78*0 +0.96*0*0.78 =
0.9912.
Experimental results (1)
• They chose 10 names by opening an arbitrary page
from a phone book and choosing an arbitrary name
from the page.
• The system returned the grouping result for the
first 50 returned citations for each name.
• The size of test set are 500 citations.
Experimental results (2)
• To evaluate the performance of their system,
they use split and merge measures.
• First, they count how many splits they should do
over all the groups to make the citations in each
group relate to one person.
• Then, they counted how many merges they should
do between the groups to ensure that no two
groups relate to one person.
• They normalize the split and merge scores to
range between 0 and 1.
• For example.
Experimental results (3)
Experimental results (4)
• Using a multi-faceted approach gives much better
performance than using each facet separately.
• For groups that should have been merged, no
evidence or only weak evidence was found to group
them.
• Human expert may look at pictures, a deeper
understanding of the meaning of distinguishing
phrases.
Concluding remarks
• They designed and implemented a system that can
automatically group the returned citations from a
search engine person-name query.
• They used a multi-faceted approach that
considers three facets.
• They gave experimental evidence to show that
their approach can be successful.
Evaluation example
• Correct grouping result for 8 citations:
– G1: {C1, C2, C4, C6, C7}
– G2: {C3, C8}
– G3: {C5}
• The grouping result of their system:
– G1: {C1, C2, C4}
– G2: {C3, C6, C7}
– G3: {C5, C8}
• The number of splits over all the citations is
0+1+1=2, and total number of merge scores is 2.
• back