The Problem of Identity

Download Report

Transcript The Problem of Identity

The Problem of Identity
WHY COMPUTERS HAVE TROUBLE
TELLING PEOPLE APART
Beau Sharbrough
PO Box 3170
Grapevine TX 76099-3170
INTRODUCTION: What’s the Problem?
SEVERAL TRENDS CONVERGE:
 More people store family history information in
digital form.
 More family history records are available in
digital form.
 Above a certain threshold, we rely more on
computers to help us match people.
This lecture is an intermediate level lecture about
why computers do such a bad job spotting people.
There will be math.
That silly wasp
Stupid software.
What he said
“Often an identifier may fail to agree precisely on a pair of
records but be obviously similar nevertheless. Examples include
slightly discrepant date of birth components, names that are
sometimes truncated or differ in a few final letters, and
geographical locations that are in proximity but are not quite
the same. The judgement of a human searcher may be
strongly influenced by his or her perception of these partial
similarities and dissimilarities. The machine would make poor
use of the discriminating powers contained in the identifiers if
it could not do as the human does in this matter.”
-- Newcombe
It’s always something
What are the odds that a
computer has a detailed
statistical understanding of
naming patterns in the history
of a population?
What’s the Philosophical Basis for Identity?
Universals and Particulars Loux 1970
“What are the conditions which a set of
predicates must satisfy for it to count a
descriptive of a single individual?” – AJ Ayer
Is there no difference between things which
cannot be expressed as a difference in
properties? How could one know that a thing
was unique?
More indiscernible ideas
If the number of permutations of
perceivable properties is less than the
number of objects in the set, the
principle will fail. Sometimes we just
posit that there are unperceivable
properties that instantiate the objects.
Simply, unless there are more than
2000 ways to distinguish NGS
attendees, there will be some doubles,
some groups.
How do Computers Learn?
 From Knowledge to Knowledge Representation.
 Symbols. To computers, these numbers and strings
represent objects or ideas.
 Procedural Representation. “Knowledge and the
manipulations of that knowledge are inextricably
linked.” Declarative representation is coupled with
Procedural Code to reduce the impact of that
limitation.
Relational Representation. “While relational database
tables are flexible, they are not good at representing
complex relationships between concepts or objects in
the real world.”
Hierarchical Representation. Centers on relationships
and shared attributes between kinds or classes of
objects. “Isa.”
Predicate Logic. Formal logic has its own syntax, which
defines how to make sentences, and its own semantics,
which describe the meaning of the sentences.
 Resolution and Unification. Procedures for resolving
sets of sentences.
 Uncertainty. Unconditional and conditional
probabilities. Bayes Theorem Fuzzy logic.
 Knowledge Interchange Format. A language that was
expressly designed for the interchange of knowledge
between agents. Based on predicate logic. Supports
the definition of objects, functions, relations, rules,
and metaknowledge.
Knowledge Engineering requires a domain expert (the one
who knows what) and a knowledge engineer (the one who
knows how). Knowledge Acquisition turns out to be a
difficult and expensive process:
the experts can’t explain “how” they do it,
experts don’t want to be replaced by tables,
this area becomes a bottleneck.
So What’s the Barrier to Genealogical
Knowledge?
The syntactic-semantic problem in
family history information.
Names, dates, and places have a variety
of syntactic representations, and no
common underlying semantic
representations.
Existing methods of comparing near matches
Names – Soundex, initials
Places – hierarchy, exact place name
match, longitude and latitude
Dates – mathematical comparison
C Vectors
Problems with granularity of data – full
name versus initials or nicknames; city
names versus state or county names;
exact dates versus approximate dates.
Back to the school for computers
Information Theory, Shannon and Weaver
(1949): The unit of information is a bit, and
the amount of information in a single binary
answer is log2P(v), where P(v) is the
probability of event v occurring.
The
information content is based on the prior
probabilities of getting the correct answer to
a question or classification (Russell and
Norvig 1995).
Information gain
 Each subset has it’s own makeup of p and n outcomes.
average, after testing attribute A, we still need
On
Remainder(A)=pi+niI(Pi/(Pi+ni/(pi+ni))
 Bits of information where i goes from 1 to v, the number of
discrete values that attribute A can take. The difference
between the information needed before the attribute test and
the remainder is called the information gain of the test.
Gain(A) = I(p/p+n,n/p+n) – Remainder(A)
Show your work
Gain(Gender)=1-[(500/1000)
I(450/500,50/500)+500/1000
I(350/500,150/500))]

= 1-(.5) I(.9,.1) + (.5) I (.7,.3)

= 1-(.5) 0.468996 + (0.5) 0.881291

=0.324857
Classifier systems
According to Holland, there are ways to
change a system’s performance as it gains
experience. A system has three components:
sensors, rules, and effectors. Holland
considers the rules as hypotheses that are
constantly undergoing testing and
confirmation. All “facts” in your genealogical
database are alternative, competing
hypotheses. When one hypothesis fails,
competing rules are waiting to be tried.
The basis for resolving competition
between rules is that rule’s usefulness
in the past. We assign each rule a
strength that, over time, comes to
reflect the rule’s usefulness to the
system. The procedure for modifying
strength on the basis of experience is
often called credit assignment.
Demographic procedures
The most common search sequences for
matching are phonetic surname +
alphabetic surname, and alternatively
first given name + year month day of
birth, “these being the most reliable
and discriminating identifiers available“
(emphasis mine).
“The basic idea is very simple. If a NAME, or
and INITIAL, or a MONTH OF BIRTH, or
any other identifier, agrees or disagrees or is
more or less similar or dissimilar in any way,
one simply asks, ‘How typical is that
comparison outcome among LINKED pairs of
records, as compared with UNLINKABLE
pairs brought together at random?’”
FREQUENCY RATIO 
frequencyof outcom e( x, y ) am ongLINKED pairs
frequencyof outcom e( x, y ) am ongunlinkablepairs
Where:
 x indicates the identifier and its value on the record from the
file initiating the search (record A);
 y indicates the identifier and its value on the record from the
file being searched (record B);
 LINKED pairs may refer either to all linked pairs, or to a
defined subset of these; and
 UNLINKABLE pairs may refer either to all unlinkable pairs, or
to a defined subset, provided the linked and the unlinkable sets
(or subsets) are otherwise strictly comparable with each other.
Examples
 FIRST INITIALS
 AGREEMENT
 DISAGREEMENT
 LETTER “Q”
 YEAR OF BIRTH
 SIMILARITY (difference = 1 year)
 DISSIMILARITY (difference = 11+ years)
 GIVEN NAMES
 SIMILARITY (first 3 letters agree, none disagree – eg Sam vs Samuel)
 SIMILARITY + DISSIMILARITY (first 3 letters agree, 4th disagrees –
eg Samuel vs Sampson)
 DIFFERENT BUT LOGICALLY RELATED IDENTIFIERS
 PLACE of WORK vs PLACE of DEATH (Provo vs Salt Lake City)
Some more examples
Identifiers compared Comparison
outcomes
SURNAME
FIRST NAME
MIDDLE INITIAL
YEAR OF BIRTH
MONTH OF BIRTH
DAY OF BIRTH
STATE/COUNTRY
OF BIRTH
Agree
Disagree
Agree
Disagree
Agree
Disagree
Agree
Disagree
Agree
Disagree
Agree
Disagree
Agree
Disagree
Percentage
frequencies
Links Non-Links
96.5
0.1
3.5
99.9
79.0
0.9
21.0
99.1
88.8
7.5
11.2
92.5
77.3 1.1
22.7 98.9
93.3 8.3
6.7
91.7
85.1 3.3
14.9 96.7
98.1 11.7
1.9
88.3
Global frequency
ratios (links/nonlinks)
965/1
1/29
88/1
1/5
12/1
1/8
70/1
1/4
11/1
1/14
26/1
1/6
8/1
1/46
Discrimination
A lookup table containing the
frequencies of values for identifiers, as
they appear in the file being searched.
SURNAMES Brown (0.39), Aube
(0.014), and Skuda (0.00004).
FIRST NAMES John(5.30), Axel
(0.020), and Ulder (0.0045).
Brief Diversion
Utah is the home of GENISYS, c. 1979, as part of the Utah Mormon
Genealogy Project. 1.75 million records, 1 million individuals, on a Data
General Eclipse Model S/250 using the AOS operating system. A system
named DUP (Demographic Utility Package), simplified the interface. The
DUP user was a FORTRAN programmer.
AI Basics
 All cars have 4 wheels.
 Vans are cars.
 Grandpa Smith was born in Virginia.
 Grandma Smith has a van.





Rules - Antecedent and consequent clause. Relationships.
Facts - Existence IS a predicate.
Forward chaining
Conflict resolution procedure
Order of rule selection
More AI basics
 All cars have 4 wheels.
 Vans are cars.
 Grandpa Smith was born in Virginia.
 Grandma Smith has a van.
Backward Chaining
Fuzzy rules can be used to infer probabilities.
Neural Nets
Neural Networks are not simply another
learning algorithm or regression
technique. They represent a new and
different computing model from the
serial von Neumann computers we all
use.
Types of Neural Nets
Back Propagation
Kohenen maps
Decision Trees
Genetics vs Memetics = Code vs Data
The difference is that you can change
the data. In comparison, the Code is
fixed, static.
Instinct vs intelligence.
The Current Merging Art
Merging Databases
Merging Individuals
Merging the Rest
Spotting Duplicates
Merge Sources for most popular software
Their own files
GEDCOM
In some cases, files from other programs
In some cases, CD and internet databases
Still, it ends up being like pouring two
cans of paint together.
Merging Individuals
If you want to merge
duplicates, most programs
will make you choose
which “tags” to keep and
throw the rest away.
Merging the Rest
Most programs don’t even import
and merge place tables, source
tables, etc.
I don’t know of any program that
recognizes the same source in
two separate datasets.
Spotting duplicates
Soundex for names (AQ)
Exact spelling or soundex (PAF 3.0)
Exact spelling and exact birth date (FTM)
Many name compares (TMG and UFT)
Soundex surname and user choice of # of
letters in first name (LG)
Warn if duplicate name entered (most)
Signs that you can merge better
today than you could before
More formats allowed
Easier individual merging
Identifying routines are becoming more
sophisticated
More storage of conflicting data allowed
More variety in the software marketplace
Signs that we aren’t getting there yet
No formal studies on known datasets to
quantify false positives and false
negatives
No implementation of information
sciences in commercial products
No implementation of AI in commercial
products
No formal discussion of algorithms
Merging tips
Match on parent soundex reduces false
positives (Gaylon Findlay)
If your program won’t let you choose
initials, but has a number-of-letters, try
that with 1.
Beware of people about whom you know
very little.
Beware of blank dates.
The Family History Fingerprint will have ...
A heirarchy of useful comparison
algorithms
A method of searching across the Internet
- and paying for it
A method of documenting the source of
that search that satisfies the rules of
preserving intellectual property and
academic research
LeMaster Numbers
The distance between two events
Name, Date, and Place are the axes
Requires a coordinate system
Example of LeMaster Numbers
Time
Death 1
Marriage 1
Birth 2
Space
Birth 1
Name
A method of coding
coordinates
Match the citation to a coordinate value,
don’t convert it and store the converted
value.
Use arbitrary measures and iterate them
with known cases until you get good
correlation.
What are we really asking?
How close is close?
How close is close enough
for me to look it up?
The future
More on-line records.
More on-line publication by
researchers.
More intelligent software.
More “agents” trawling the networks
looking for what you want.