Author Name Disambiguation in Medline

Download Report

Transcript Author Name Disambiguation in Medline

Author Name Disambiguation
in Medline
Vetle I. Torvik and
Neil R. Smalheiser
August 31, 2006
A Statistically Based Model:

Hypothesis: an individual tends to publish
papers with similar attributes, sufficiently so
that these attributes suffice to disambiguate
the authors.


The case of Dr. Tom Jobe?
Large, automatically generated training sets
of pairs of articles matching (Last Name, First
Initial) written by the same person vs. by
different individuals.
Name Attributes




Suffix if present (III, Jr.)
Middle initial match
Original model got very good performance
without using first names at all!
First name (if available in Medline, or if it can
be scraped from online papers)


Name spelling variants
Name frequency
Article Attributes



Journal name
Number of co-author names in common
Affiliation words in common





Language of the article
Title words in common
Email addresses, if given


may not be given for all co-authors
Name correlations with affiliations (e.g. Ito and Japan are
correlated)
assign to right author
MeSH headings in common
A Monotone Model



Each pair of papers creates a vector of
10 dimensions, each of which has a
matching score.
Assume monotonicity [more attributes
in common, more likely written by the
same person]
Allows for nonlinear and interactive
effects across dimensions
Estimate Pairwise Frequencies


For a given pair of articles, compute the
match vector, then look up its frequencies in
pos vs. neg training sets: ratio = R value
For a given name, estimate the a priori
probability P that any two papers will be
written by the same person



This is a whole story in itself….
1/[1 + (1-P)/PR] = probability of a match
The Author-ity Site at
(http://arrowsmith.psych.uic.edu)
Beyond Pairwise Comparisons





A and B share titles, journals
B and C share co-authors, affiliations
But A and C share nothing!
Yet p(AC) must be > (p(AB) + p(BC) -1)
Triangle inequality using probabilities, detect
and correct anomalies



due to missing data or higher order correlations
Catch un-characteristic papers by an author
Another long story to optimize the methods!
Clustering “all” papers in Medline
according to author-individuals


First we compute all pairwise
probabilities for each (last name, first
initial) modified with triplet correction
Then we form clusters


at p = 0.95 (high precision)
and at p = 0.5 (high recall)

i.e. the chance is greater than 0.5 that it
belongs to some cluster, or it stays as a
singleton
First-Pass Disambiguation
is Complete!

Except for several hundred names having
more than ~3000 papers each,





reach memory limit,
will assess if the model is reliable for the biggest
names
For now, proceed for papers giving first names.
Monitoring for over-clustering and underclustering
Summarizing global statistics
Immediate Next Steps

Evaluate the clustering performance





Old vs. new papers
Importance of missing data
Very frequent names
Singletons, least confident assignments
Update the web interface
Baseline 2006 version of MEDLINE
Number of names (lastname, first name initial)
1,000,000
100,000
10,000
1,000
100
J Wang
J Smith
10
S Kim
S Lee
J Lee
1
1
10
100
Name count (number of papers)
1,000
10,000
30,000
A sample of names from baseline 2006 version of MEDLINE
500
All
Singletons
450
400
Average number of clusters
350
300
250
200
150
100
50
0
0
500
1000
1500
Name count (number of papers)
2000
2500
3000
A sample of names from baseline 2006 version of MEDLINE
1,000,000
100,000
Number of clusters
10,000
1,000
100
10
1
1
10
100
Cluster size (number of papers)
1,000
Upcoming Grant Renewal
Aim 1: Special Cases



name reversal, hyphenated names,
spelling errors, Gerald vs. Jerry, Rick vs.
A. Rick
Use co-author assignment to help
disambiguate another co-author
Compute confidence level of
assignment for each paper, identify
least confident assignments
Upcoming Grant Renewal
Aim 2: Update the Model






Original model covers 1966-present, but new
papers have different information, MeSH,
emails, online information
Modify training sets with recent papers.
Journal name partial match
Abstract words match?
Affiliations matched to each authors in PMC,
online papers
References Cited information taken from PMC
Upcoming Grant Renewal
Aim 3: Web Interface




Update the pairwise interface (given name, a
particular paper, list all others in order of
match probability)
Show clusters – given a name, show all
clusters of author-individuals, link to
Community of Science, searchable by
attributes, can summarize and explore further
(Anne O’Tate tool)
Author profile/collaboration finder tools
Data made available for bibliometrics and
collaboration network research
Upcoming Grant Renewal
Aim 4: Curation

Curator to identify errors and least-confident
assignments




manually
machine methods (e.g. wobble in clustering)
change the database and alter the model as
needed
Wiki Authors – will monitor postings to Wiki
and change the database as verified and
warranted (e.g. maiden name to married
name)