Learning name variants from true person resolution

Download Report

Transcript Learning name variants from true person resolution

Large scale harvesting of variants
of proper names
Gerrit Bloothooft, UiL-OTS, Utrecht University
Marijn Schraagen, LIACS, Leiden University
The Netherlands
[email protected]
[email protected]
Utrecht Leiden
ICOS 2014 Glasgow
Links
name variants
• different versions of a name,
that can denote the same object
• requires a proof that the same object is
involved (in at least one example)
– not always easy
– rarely explicitly provided
Utrecht Leiden
ICOS 2014 Glasgow
Links
2
proper names in historical sources
Lots of variation
–
–
–
–
spelling variation
suffix variation
abbreviation
translation
– typos (digitization)
–…
Utrecht Leiden
Dirk
Willem
Willem
Willem
Willem
Willem
ICOS 2014 Glasgow
- Dirck
- Willempje
- Wim
- Guillaume
- Wilhelmus
- Aillem
Links
3
variation!
Guljelmus
Wllhelmus
Wlhelmus
WIllem
(Willem)
Wiellem
Wlllem
Gujlelnius
Wllem
WiIllem
Wijllem
Wihelmus
Willemj
Wikllem
Wwillem
Willlem
Guilleam
Willeam
Willem
Wil.lem
Wilem
Guileam
Willelmini
willem
Wiilem
Guillem
Weillem
Guilelmis
Wil;helmus
Wilhlem
Welhelmus
Wiillem
Wiehelmus
Wulhelmus
Willem)
Wilehelmus
Woillem
Wihhelmus
Weijlem
Willelmus
Wi;;em
Wilehlmus
Wuhelm
Guilelmus
Wilhlelmus
Willem(se)
Wilalem
Wullem
Willem.
W#ilhelmus
Guillelmus
Wliiem
Wlihelmus
Wilelmus
Willemm
Wileem
Wìllem
Willemem
Wolhelmus
Wechelmus
Guilllelmus
Wilemm
Utrecht Leiden
W.ilhelmus
Willem]
Willemh
\Willem
Wïllem
w8illem
Wilhellmus
Wilhelm.
Wilmhelmus
Wilhelmuns
Wilhelmua
Wilhelmos
wilhelmnus
Wilhelmnus
Wilhelmues
Guilleaumme
Wilhelmum
Guilhelmus
Willeml
Wilhelmanus
Wilhelmjus
Wilhelmes
Guilliaumme
Wilhelmas
Willemn
Wilhelmus
Wilhelmns
Willhelmus
Guiliaume
Willlen
Guiilleaume
Guilliaume
Willenis
Guiliermo
Wilempjen
Willempjen
Willepjen
Guilliermo
Wittem
Willen!
Wilhlenn
Wijlen
Wielen
Willen
wilhem
Willempke
Guilleaume
Wilhellemus
Wilhekmus
Guiileaume
Willeaume
Wilhelmuus
Guylleaume
Guileaumme
Guileaume
Wilhelemus
Guilleauma
Willewm
Guillesmus
Guïllermo
Guilermo
Guiilermo
Guillermo
Guillerlmus
Guijlleaume
Wilheminus
Wilhelhmus
Guillaum
guillaum
Gueillaum
Wilhemus
Guilhemus
Wielhemus
Wilhehmus
Wilhelminus
Wilhelmienus
Wilherlmus
Wilhermus
Weilhim
Wilhiem
Wilheim
Wilhein
Willaum
Guillaim
wilhemus
Wilhelnmus
Woalter
Willhem
Guillhem
Wilheem
Wilhem
Wölhelm
Wilhelimus
Wilhelus
Willaim
Willemerman
Wiechem
Wiloem
Wilhelmius
Wilhelmijs
Guilhelmis
Wilhelmjs
Wilhelmis
Willemhelmus
Willoem
Wilhelnus
Weilhelmus
Wwilhelmus
Wylhelmus
wWilhelmus
(Wilhelmus)
(Wilhelmus
Wilhelmüs
Wilhelmus\
Guiljame
Wilhelmus?
WilhelmusHubertus
Wilheelmus
Wilhelmmus
Wielhelmus
Wilhhelmus
Wiilhelmus
WEilhelmus
wilhelmus
ICOS 2014 Glasgow
Wilhelmus)
Guillieaume
Wilhelmuss
Wilhwlmus
WilhelmusStephanus Wilhwelmus
WIlhelmus
Willum
Willkem
Guillum
Wilkhelmus
William
Wilhelmiem
Wilhlemus
Wilhelmigs
Wilielmus
Willme
Willielmus
Wilme
Güilielmus
WilhelmusHenricus Guililmus
WilhelmusTheodorus Guileilmus
Wilhelmushenricus Guïllielmus
Wilhelmusn
Guilielmus
Wilhelmuszn
Guillijaam
Eilhelmus
Willemus
ilhelmus
Wiiliam
Ilhelmus
Guilemus
Willemcus
Guillemus
WilhelmusJohannes Willemmus
Wilhelmushubertus Wilehmus
Wilhelmuw
Wilemus
Wwilhwlmus
Willliam
Guilliaam
Wieliam
Guiliam
Guillielmus
Guiliaam
Wilhlmus
Guillieam
Guillmus
Guilliam
Wiliaam
Wiliam
Wilhmus
Wilnelmus
Guiilmus
Willwm
Wilmus
Links
Guilmus
Aillem
JohannesWilhelmus
Johanneswilhelmus
CornelisWilhelmus
Gulliëlmus
Guliëlmus
Gijlliaume
Güliëlmus
Guli?lmus
Guijelmus
Gulielmus
Guiëlmus
Giliaume
Gilliaume
Gilliaumme
Guihelmus
Guikelmus
Gullielmus
Guielmus
Jannwillem
Janwillem
JanWillem
JanWilhelmus
MartinusWilhelmus
Qwillem
1. 4
challenge
name variation is difficult to model, therefore:
• learn variation in person names from
use of names in real life
(let data speak for itself)
• automatically from big data
Utrecht Leiden
ICOS 2014 Glasgow
Links
5
required
• big data
– with many references to individuals
• true person resolution
– proof that the same individual is concerned
– even with data that contain name variants
Utrecht Leiden
ICOS 2014 Glasgow
Links
6
big data
• Dutch vital registration (who-was-who 2011)
1811- early 20th century
– 4.1 million birth certificates
(~30%)
– 3.1 million marriage certificates (~90%)
– 7.6 million death certificates
(~65%)
55 million name references to persons
Utrecht Leiden
ICOS 2014 Glasgow
Links
7
source names
1,052,000 different full first names (composite)
Jan, Johanna Maria Cornelia
111,900 different female first names (singular, Maria)
82,700 different male first names (singular, Jan)
681,000 different surnames (prefixes included)
Bakker, de Vries
600.000 different surnames (prefixes excluded)
Vries
Utrecht Leiden
ICOS 2014 Glasgow
Links
8
information per person
•
•
•
•
•
first name person (child, bride or groom, deceased)
first name father
surname father
first name mother
surname mother (always maiden name
in The Netherlands)
• age person
Utrecht Leiden
ICOS 2014 Glasgow
Links
9
person resolution
• assumption: the available information
identifies a person uniquely (if there is exact
matching)
• relaxed assumption: one of the first names
and surnames of the mother or father is not
needed for true person resolution
Utrecht Leiden
ICOS 2014 Glasgow
Links
10
example
Johanna Endt
• marries in 1858 as 29 years old daughter of
Gerrit Endt and Dorothea Kerbert
• dies in 1882 as 54 years old daughter of
Gerrit Endt and Doortje Kerbert
~1829, Johanna, Gerrit, Endt, Kerbert, Dorothea
~1828, Johanna, Gerrit, Endt, Kerbert, Doortje
Utrecht Leiden
ICOS 2014 Glasgow
Links
11
test of assumption
(of true person resolution)
• consider all matches between birth and death
certificates with exact matching of all
information
• leave out one name per match
• count number of multiple matches
result:
only 85 out of 1,107,162 matches are not unique
Utrecht Leiden
ICOS 2014 Glasgow
Links
12
harvesting name variant pairs
(procedure)
• identify all record pairs of individuals (over
birth, marriage and death certificates) that
exactly share
– first name of the individual
– approximate year of birth
– three out of four names of parents (first names and surnames)
• collect pairs of the remaining name, if
different
Christiena – Christina
Bloothooft - Bloothoofd
Utrecht Leiden
ICOS 2014 Glasgow
Links
13
harvesting name variant pairs
(results)
female first names
male first names
surnames
48,600 pairs
31,900 pairs
177,000 pairs
246,500 tokens
183,000 tokens
374,900 tokens
average:
first names: 5 to 6 tokens per variant pair
surnames: 2 tokens per variant pair
Utrecht Leiden
ICOS 2014 Glasgow
Links
14
so far so good, but
• the original certificates are not error-free
> found variants can be due to errors in the
source, during transcription or to typos
• theoretical issue:
what is a name variant, and what is an error?
Utrecht Leiden
ICOS 2014 Glasgow
Links
15
example
in the source documents:
Pieter
born as son of Jacob Houtlosser and Aafje Spruit,
died as son of Jacob Houtlosser and Grietje Spruit
variant Aafje – Grietje ?
Utrecht Leiden
ICOS 2014 Glasgow
Links
16
variants and errors
distinction is difficult to make
• variants share the same lemma
and errors do not
requires onomastic expertise
(which we would like to avoid, let the data speak for itself)
Utrecht Leiden
ICOS 2014 Glasgow
Links
17
variants and errors
• Variants
Willem
Willem
Willem
- Wilhelm
- Guillaume
- W8llem (no indication of different lemma)
• Errors
Grietje
Fijtje
Utrecht Leiden
- Aafje
- Sijtje
(understandable reading error
but different lemma)
ICOS 2014 Glasgow
Links
18
methods for cleaning
• using name dictionaries with lemmas
• to accept name pairs
• using known non-variants
• to reject name pairs
• rules
• to accept name pairs
all with manual intervention (< 2%)
Utrecht Leiden
ICOS 2014 Glasgow
Links
19
cleaning | name dictionaries
• dictionary of Dutch first names (20,000), but
– lemmas too detailed
– names with multiple lemmas
– only 8% of all first name pairs share lemma in
dictionary (43 % of tokens)
Utrecht Leiden
ICOS 2014 Glasgow
Links
20
results, in variant pairs
• female first name pairs
34,800 accepted
13,900 errors (29%)
• male first name pairs
22,500 accepted
9,400 errors (29%)
• surnames pairs
120,100 accepted
Utrecht Leiden
57,100 errors (32%)
ICOS 2014 Glasgow
Links
21
very many variant pairs (Willemina)
WILMINA
WILLEMJE
WELLEMTJE
WILMTJE
WILLEMTJE
WILHELMINA
WILLEPMJE
WILLEMPIE
WELLEMTJE
WELLEMTJE
WILLEMIJNTJE
WILLEMIJNTJE
WLLEMIJNTJE
WILLEMIJN
WILHELMINA
WILLEMTIEN
WILLEMTIEN
WILEHELMINA
WILLEMKE
WILLEMKEN
WILLEMINA
WILLEMINA
WILLEMIENA
WILLEMINA
WIHELMINA
WILLEMKE
WILLEMIJNTJE
WILHEMINA
WILLEMKEN
WILLEMPJE
WILLEMIJNTE
WILLEMIJNTJE
WILLEMPTJE
WILLEMIJNTJE
WILLEMIJNTJE
WILLEMYNA
WILLEMPJE
WILEMPJE
WILLEMIJNTJE
WILLEMIINTJE
WILLEMINA
WILLEMINA
WILHELMINA
WILLEMIJN
WILLEMIJN
WILLEMINA
WILLEMIJNTJE
WILLEMIJNTJE
WILLEMIJN
WILHELMINA
-
WILMIJNA
WILLEMPJE
WILLEMTJE
WILLEMPJE
WILEMTJE
WILLEMPJE
WILLEMTJE
WILLEMPJE
WELLIMTJE
WOLLEMTJE
WILLEMPJE
WLLEMIJNTJE
WILLEMPJE
WILLEMIJNA
WILLEMINA
WILMTIEN
WILLEMTJE
WILHELMINE
WILLEMKEN
WILLEKEN
WILLEMINE
WILLIMINA
WILLEMINA
WILLEMPJE
WILHELMINA
WILLENKE
WILEMIJNTJE
WILLEMINA
WILMKEN
WILLEMTJE
WILLEMIJNTJE
WILLEMYNTJE
WILLEMTJE
WILLEMTJE
WILLEMYNA
WILLEMIJNA
WILSJE
WILLEMPJE
WILLEMEINTJE
WILLEMIJNTJE
WILLEMINTJE
WILELMINA
WILHELMINE
WILLEMPJE
WILLEMTJE
WILLEMIJN
WILLEMINTJE
WILLEMEIJNTJE
WILLEMIJNTJE
WILLEMIJNA
Utrecht Leiden
WILHELMIMA
WILHELMINA
WILHELMIJNA
WILLEMKE
WILLEPMJE
WILLEPMJE
WILLEMIJNTJE
WILHELMA
WILLEMINA
WILLEINTJE
WILHELMIJNA
WILHELMINA
WILLEMINA
WILHELMIA
WILLEMTIEN
WILLEKE
WILHELMINA
WILHELMINA
WILLEMPTJE
WILLEMIEN
WILLEM
WILLEMINA
WILTIEN
WILMKE
WELHELMINA
GUILLIELMINE
WILLEMTIEN
WILHELMIENA
WILMINA
WILLEMKE
WELLEMTJE
WILLEMIN
WILMTJE
WILLEMINA
WILLELMIN
GUILLIELMINE
WILLEMINA
WILEMIJNA
WILLEMTIJN
WILLEMINA
WILLEMIJNE
WILLEMS
WILLEMINE
WILLEMKE
WILLEMIJNTJE
WILLEMINA
WILLEMA
WILLEMINA
WILHELINA
WILLEMKEN
-
WILHELMINA
WILHLEMINA
WILHELMINA
WILLEMPJE
WILLEMKE
WILLEMPJE
WILLEMINA
WILLEMIJNA
WILLLEMINA
WILLEMPJE
WILLEMIJNA
WILHELMUS
WILHELMUS
WILHELMINA
WILTIEN
WILLEMKE
WILHLMINA
WILHEMINA
WILLEMTJEN
WILLEMTIEN
WILLEMPJE
WILLEMIJNE
WILMTIEN
WILLEMKEN
WILHELMINA
GUILLELMINE
WILLEMPIEN
WILHELMINA
WILMIENA
WILLEMTIEN
WELMTJE
WILHELMINA
WILLEMTJE
WILMINA
WILHELMINA
WILHELMINA
WILLEMKE
WILLEMIJNA
WILLEMTJE
WILLEMMINA
WILLEMIJNA
WILLEMINA
WILLELMINA
WILMKE
WILLEMIENTJE
WILLEMIMA
WILLEMINA
WILLEMEIJNTJE
WILHELMINA
WILLENKE
WILLEMINA
WILLEMIJNTJE
WILHELMINA
WULLEMPJE
WILLEMINA
WILHELMINE
WILLEMIJN
WILLEMIJNE
WILLEMPTJE
WILHELM
WILLEMIEN
WILLEMINA
WILHELMA
WILHELMINE
WILLEMIN
GUILLEMINE
WILLEMIENTJE
WILLMINA
WILLEMIJNA
WILLEMINA
GUILLELMINE
WILLEMIJNTJE
WILLEM
WILHELMINA
WILMPJE
WILLEMINA
WILLEMKE
WILLEMKE
WILLEMIJNTJE
WILLEMIJNTJE
WILLEMPJE
WILLEMINA
WILLEINTJE
WILLEMTJEN
WILLEMTJE
WILLEMINA
GUILLIELMINE
WILLEMPIEN
WILHELMINA
WILLEMINA
WILLEMIEN
WILLEMINA
WILMINE
WILKENS
WILLEMINE
WILLEMTJEN
WIILEMINA
WILEHELMINA
WILHELMINA
WILLEMKEN
ICOS 2014 Glasgow
-
WILLEMTJE
WILLIMPJE
WILLEMIJNTJE
WILLEMPJE
WELLEMINA
WILLEMINE
WILHELMINA
WILHELMINA
WILMPTJE
WILHELMI
WILHELMINA
WILLEMKEN
WILHELMINA
WILLEMINA
WILLEMINA
WILHELMINE
WILLEMEINTJE
WILHELMINA
WILEMINA
WILLMINA
WILHELMINE
WILMIENA
WILLEMS
WILMINA
WILLEMTJE
WILLEMIENTJE
WILLEMTJE
WILLEMPKE
WILLEMKEN
WILLEMIJNTIE
WILEMTJE
WILMIJNTJE
WILLEMTJE
WILLEMPJE
WILLMEPJE
WILHELMIMA
GUILIELMINE
WILLEMPJE
WILLEMTJE
WILLEMEINTJE
WILLEMIN
WILMPJE
WILLEMINE
WILKES
WILMINA
WILLMEPJE
WILLEMINA
WILHELMINA
WILLEMDINA
WILHELMINA
WILLEMIENTJE
WILLEMA
WILLEMPJEN
WILLEMPIEN
WILHELHERMINA
GUILLEMINE
WILLEMIJNTJE
WILLEMPJE
WILLEMINE
WILLEMINA
WILLEMPKE
GUILLELMINE
WILLEMIENA
WILLEMIJNTIE
WILLELMINA
GUILLEMINE
WILLEMIENA
WILLEMINA
WILELMINA
GUILLEMINA
WILLEMKE
WILLEMKE
WILLEMTJEN
WILLEMPIEN
WILLEMJE
WILLEMKEN
WILEMIJNA
WILHELMINA
WILLEMTJE
WILLEMTIEN
WILLEMTIEN
GUILHELMINE
WILLEMKE
WILHELMINA
WILHELLEMINA
WILEMINA
WILLEMJEN
WILMINE
WILHELMIN
WILLEMPJ
-
WILLEMIJNA
WILLEMS
WILLEMTJEN
WILLEMTJE
WILHELMINA
WILHELMINA
WILMIJNTJE
WILMPJE
WILLEMIENE
WILLEMSEN
WILLEMPJE
GUILLELMINA
WILLEMPJE
WILLEMPJE
WILLEMINA
GUILLELMINA
WILHELMIENA
WILHELMIENA
WILHELMINA
GUILLELMINE
WILEMKE
WILLEM
WILLEMTIJN
WILLEMPJEN
WILLEMTJE
WILLEM
WILMIJNA
WILLEMIENA
WILLEMTJEN
WILLEMS
WILLEMPJE
GUILLELMINE
WIMPKE
WILKELINA
WILHELMINA
WILLEMINA
WILLEMKEN
WILLEMINA
WILHELMINA
WILLEMPJE
and many more
Links
22
name clusters
• variant pairs (are interconnected)
Jan - Johannes
Jan - Joannes
Jan - Johan
Johannes – Johan, etc
• create cluster Jan {Jan, Johannes, Johan}
Utrecht Leiden
ICOS 2014 Glasgow
Links
23
name clusters
• male first names
• female first names
1.221 (16.487 names, 20%)
1.530 (23.816 names, 21%)
compares to number of lemma’s in Dutch dictionary of
first names, vd Schaar 1964
• surnames
11.686 (93.839 names, 17%)
compares to number in Dutch surnames overview (without many
variants), Winkler 1885
Utrecht Leiden
ICOS 2014 Glasgow
Links
24
conclusions
• person name variants need proof from true
person links
• expert knowledge necessary because errors
cannot be distinguished fully automatically
from true variants (but < 2%)
• final results are promising as a starting point
to create a national repository of proven name
variants
Utrecht Leiden
ICOS 2014 Glasgow
Links
25