Reidentification
Download
Report
Transcript Reidentification
Deidentification
and
Reidentification
Salvador Ochoa
Jamie Rasmussen
Christine Robson
Michael Salib
Overview
What is reidentification?
basic mechanisms
examples
Taking a look at the Chicago crime database
some interesting privacy invasions
Legal protections (or lack thereof)
Taking a look at the code
what have we done and how did we do it?
Where do we go from here?
Deidentification and Reidentification 2
3 May 2001
Status Report: Goals Review
Identifying useful databases
done
Finding some surprising results
we’ve got some good stuff
we’re continuing to look into it
Analyzing current and proposed legislation
done
Recommendations for databases and laws
in progress
Deidentification and Reidentification 3
3 May 2001
Paper Progress
Jamie has accepted nomination as editor
Breakdown of paper will be the same as for
this presentation, with the same authorship
divisions
Research and write up:
All datasets identified, description and
critique outlined
Code mostly complete but not yet
translated into English descriptions
Deidentification theory outlined
Deidentification and Reidentification 4
3 May 2001
Paper Progress cont.
Research and write up cont.:
Legal overview researched and outlined
Recommendations for laws in progress
Analysis of deidentification tools in
progress
Recommendations for database
deidentification remains to be done
Deidentification and Reidentification 5
3 May 2001
Overall Status
We have good material for the paper, mostly
complete
We have a coherent structure and work
breakdown
After compiling this presentation, we have a
pretty good idea of what we’re doing
Deidentification and Reidentification 6
3 May 2001
Deidentification
and
Reidentification
Reidentification
Theory
Reidentification is Scary Stuff
In the (free, publicly available) Chicago
Homicide dataset
4 records are uniquely identifiable by
victim age at death
10,251 records are uniquely identified by
victim age and death date
That's 93.5% of all records that list the
victim's age and death date
Mike was able to reidentify his little brother’s
birth in our hospital outpatient records
Deidentification and Reidentification 8
3 May 2001
Database Basics
What is a database?
Subject No.
Age
Sex
ZIP
Race
1
21
1
02139
1
2
26
0
02138
2
3
19
1
02138
5
4
20
1
02139
3
Entity-specific Data
Person-specific Data
Deidentification and Reidentification 9
3 May 2001
Data Linkage
Name
Age
Sex
Name
Age
Major
Salvador Ochoa
21
M
Salvador Ochoa
21
6
Mike Salib
21
M
Mike Salib
21
6
Christine Robson
19
F
Chrisine Robson
19
6, 18
Jamie Rasmussen
21
M
Jamie Rasmussen
21
18
Name
Age
Sex
Major
Salvador Ochoa
21
M
6
Mike Salib
21
M
6
Christine Robson
19
F
6, 18
Jamie Rasmussen
21
M
18
Deidentification and Reidentification 10
3 May 2001
Privacy Concern
Data Explosion
Privacy Protection
Data holders (“data protectors”) need to
ensure the greatest amount of privacy
protection for subjects.
Deidentification and Reidentification 11
3 May 2001
Optimal Release of Data
Balance usefulness with privacy
Identifiable
Anonymous
more privacy
more useful
Deidentification and Reidentification 12
3 May 2001
Classes of Access Policies
Private
Insiders only
Semi-private
Limited access
Semi-public
Deniable access
Public
No restrictions
Deidentification and Reidentification 13
3 May 2001
Deidentification
Explicit Identifiers
Allow for direct communication with
subjects
Deidentification
Removal of all explicit identifiers
Is de-identification enough to ensure
anonymity?
Deidentification and Reidentification 14
3 May 2001
De-identified Data
Definition
Data that results when all explicit
identifies are removed, generalized, or
replaced with made-up alternatives
Looks anonymous
Deidentification and Reidentification 15
3 May 2001
Anonymous Data
Definition
Data that cannot be manipulated or linked
to identify the entity that is the subject of
the data
De-identified data is NOT anonymous
Deidentification and Reidentification 16
3 May 2001
Reidentification
Ascertaining the identity of individuals who
are the subjects of a study through data linkage
techniques
Possible using Quasi-identifiers
Uniquely (or almost uniquely) map to an
entity
Deidentification and Reidentification 17
3 May 2001
Quasi-Identifier Example
Uniqueness of Cambridge Voters
Birth date alone
12%
Birth date and gender
29%
Birth date and 5-digit ZIP
69%
Birth date and full postal code
97%
Basically, a few characteristics make a
person unique
Deidentification and Reidentification 18
3 May 2001
Arrest Record Database
Ethnicity
ZIP
Arrest Date Birth date
Violation
Sex
Sentence
Birth
Date
Sex
Ethnicity
ZIP
Arrest Date
Deidentification and Reidentification 19
Violation
Sentence
3 May 2001
Voter Registration List
ZIP
Name
Birth Date Address
Sex
Date Registered
Party Affiliation
Name
Address
ZIP
Birth
Date
Sex
Date Registered
Deidentification and Reidentification 20
Party
Affiliation
3 May 2001
Reidentification
Ethnicity
Arrest Date
Violation
Sentence
ZIP
Birth date
Sex
Deidentification and Reidentification 21
Name
Address
Date Registered
Party Affiliation
3 May 2001
Why Reidentify?
Scientific research
Investigative reporting
Marketing
Blackmail
Stalking
Insurance
Political action
Deidentification and Reidentification 22
3 May 2001
Deidentification
and
Reidentification
Chicago Homicide Data
Chicago Data Overview
Source: Illinois Criminal Justice Information
Authority
Size: 4.8 MB
Dates Covered: 1965-1995 (only 1982-1995
have death date, location code)
Record Count: 23,817 victims (data on
offenders available in separate database)
Covers: Chicago
Cost: FREE!
Deidentification and Reidentification 24
3 May 2001
Nationwide Homicide
1995 U.S. Deaths: Homicide
7000
6000
5000
# of Deaths
4000
3000
2000
1000
0
[0,10) [10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80110)
Age Range (years)
Deidentification and Reidentification 25
3 May 2001
Chicago Homicide
Chicago Homicides, 1982-1995
9000
8000
7000
6000
# of Deaths
5000
4000
3000
2000
1000
0
[0,10)
[10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-110)
Age Range (years)
Deidentification and Reidentification 26
3 May 2001
What are we doing?
LaTanya has done lots of theory and some
experiments
Most of that involves medical data
We wanted to try for ourselves to see just how
easy it is and how much we could do
We wanted to look at non-healthcare data - no
one else has
Deidentification and Reidentification 27
3 May 2001
So how did we do it?
We needed two data sets - deidentified data
and a control data set
Jamie will talk about finding a control data
set later
Deidentification and Reidentification 28
3 May 2001
So how did we do it? Cont.
Deidentified data (outside of healthcare) is
hard to find because:
There's tons of identifying data available
(ie. for marketing purposes)
There's tons of aggregate statistical data
available
Everyone seems to assume that people
who need data are too stupid to use
excel
There's little need to release individual
level data without names
Deidentification and Reidentification 29
3 May 2001
Database selection criteria
Our first DB to reidentify had to:
Be small, since we don't know what we're
doing
Contain incriminating or useful info to
amuse us
Be easy to verify - so we can tell how
good we are
Be publicly accessible and cheap (or free)
Deidentification and Reidentification 30
3 May 2001
Our Database of Choice
The only non medical data we could find was
crime data
The Bureau of Justice Statistics is your
friend
We love you Louis Freeh!
We selected Homicides in Chicago (19651995)
Deidentification and Reidentification 31
3 May 2001
Murder in Chicago
Dataset covered every homicide in Chicago
from 1965 to 1995
Included juicy info on both offender and
victims such as:
Victim-offender relationship and past
criminal history of both
Weapon type, drugs, alcohol, gangs, child
abuse fields
Deidentification and Reidentification 32
3 May 2001
Murder in Chicago cont.
Looked like a candidate for reidentification
because:
It had death dates and ages for victims
It had fine grained geographical info on
where the homicides took place
It contained gender and race info for
victims and offenders
Everyone listed had to be from Chicago
(almost true)
Deidentification and Reidentification 33
3 May 2001
Uniqueness of Chicago Data
The Chicago Homicide Dataset contains
10,963 records with valid death dates and ages.
10,251, or 93.5% of {death date, age}
events are unique
680, or 6.2% match one other event
8, or .073% match two other events
2, or .018% match three other events
Deidentification and Reidentification 34
3 May 2001
Dead People, Who Cares?
No one, but . . .
Info about them says a lot about live people
Example: a woman is murdered. The DB
tells us that the relationship between victim
& offender is "spousal"
We could easily do this on live people if you
give us $5K
Actually, the dead make up a sizeable voting
block in Chicago.
Deidentification and Reidentification 35
3 May 2001
Potential Embarrassment
Interesting relationships between offenders
and victims:
hired killer, target for contract, cell mates
pimp, prostitute, prostitute client
sexual rivals, homosexual acquaintance,
homosexual couple
gambler, drug pusher, drug buyer
(rival) gang member
Deidentification and Reidentification 36
3 May 2001
Potential Embarrassment cont.
Was the child being abused? Was the killing
domestic violence related?
Were drugs, alcohol or gang violence
involved?
Was the victim killed while committing a
crime?
Deidentification and Reidentification 37
3 May 2001
Dead People Statistics
23,000 victims
Only 10,000 have more than useful data
If you want to do geographical mapping, you
need to restrict your data set to people who
died at home
so you can gaunter that location of death
equals residence
This cuts you down to about 3,000
Only 2,000 have geographical info that
maps uniquely to zip codes
Look! Its the incredible shrinking database!
Deidentification and Reidentification 38
3 May 2001
Deidentification
and
Reidentification
Social Security
Death Index
What is the SSDI?
Social Security Death Master File
About 65 million entries
Contains: last name, first name, date of
birth, date of death, zip code of last
residence, zip code of last payment, SSN,
and issuing state
98% is individuals who died after 1962
The SSA began keeping the database on
computers in 1962
Deidentification and Reidentification 40
3 May 2001
What is the SSDI? cont.
Only contains reported deaths the SSA
Reported if person was getting benefits
Often reported by funeral directors as
part of their services
Available free on the web
http://ssdi.genealogy.rootsweb.com/cgi-bin/ssdi.cgi/
http://www.ancestry.com/search/rectype/vital/ssdi/
But is it useful to join with Chicago data?
Deidentification and Reidentification 41
3 May 2001
SSDI Sample
Deidentification and Reidentification 42
3 May 2001
SSDI Completeness
SSDI Completeness
2500000
2000000
Recorded 1500000
Deaths
1000000
Not in SSDI
In SSDI
500000
0
1994
1995
1996
Year
Deidentification and Reidentification 43
3 May 2001
SSDI Records by Year
SSDI Record Count by Year (for RootsWeb)
25000
20000
15000
# of Records
10000
5000
0
82
83
84
85
86
87
88
89
90
91
92
93
94
95
Year
Deidentification and Reidentification 44
3 May 2001
SSDI Age Breakdown
SSDI: Chicago 1995 Age Breakdown
9000
8000
7000
6000
# of Records
5000
4000
3000
2000
1000
0
[0,10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80) [80-110)
Age Range (years)
Deidentification and Reidentification 45
3 May 2001
Census Bureau Deaths
1995 U.S. Deaths
1000000
900000
800000
700000
600000
# of Deaths 500000
400000
300000
200000
100000
0
[0,10)
[10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-110)
Age Range (years)
Deidentification and Reidentification 46
3 May 2001
Our Original Approach
Only examine people who died at home
We can infer that they are residents, and
not random tourists
We can then get their home zip code by
manipulating murder location
Victims that have a valid death year: 10,981
Victims that died at home: 7,636
Victims that died at home and have a valid
death year: 3,143
Deidentification and Reidentification 47
3 May 2001
Our Original Approach Cont.
Victims that died at home, have a valid death
year, and died in a census tract which maps to
only one zip code: 2,862
Of those, the number that match a single
SSDI record based on (death date, birth year,
county of residence): 1165
Victims that match a single person in the
SSDI based on (death date, birth year, zip
code): about 30
Deidentification and Reidentification 48
3 May 2001
Our New Approach
We didn’t take birth year ambiguity effect
into account
Ignore all geographic data
Look at death date and age then gender and
race
Analyze uniqueness of SSDI records
May give us more false positives
Deidentification and Reidentification 49
3 May 2001
Deidentification
and
Reidentification
15 Minute Break
Deidentification
and
Reidentification
Legal Protections
(or lack thereof)
Are we breaking the law?
As independent private citizen researchers
No, in fact what we are doing is probably
protected by the First Amendment
We could probably even sell the results
What if we were a corporate entity?
No, although if we were a “credit bureau,”
we couldn’t distribute our findings
What if we were the government?
Yes, because government agencies are not
allowed to combine databases with each
other
Deidentification and Reidentification 52
3 May 2001
Why isn’t this illegal?
The easy answer is “because there isn’t a law
against it.”
Publicly available statistics are there to be
used. Combining them is just an extension of
that use.
It is difficult to write laws for this situation
without restricting valid statistical uses of the
data.
Deidentification and Reidentification 53
3 May 2001
Legal Problems
Privacy Act vs. Freedom of Information Act
two important pieces of legislation,
fundamentally at odds with each other
FOIA includes some privacy clauses
Many examples of FOIA and the Privacy Act
clashing in the courtroom
few deal with databases
fewer still deal with reidentification
Deidentification and Reidentification 54
3 May 2001
Southern Illinoisan vs.DPH
The case
The Department of public health denied a
newspaper’s request for release of
neuroblastoma statistics
The holding
release of the data does not constitute an
“unwarranted invasion of privacy,” despite
LaTanya’s research
Deidentification and Reidentification 55
3 May 2001
What data is protected?
Medical
distribution protected by HIPAA
LaTanya’s work is stirring up reidentification fears
Criminal
distribution of statistics regulated
recombining databases is not addressed
Deidentification and Reidentification 56
3 May 2001
What data is protected? cont.
All personal data given to a company or
government agency is protected by the Privacy
Act
Some special, additional protections exist for
other data
financial information is protected under
the Fair Credit Reporting Act
drivers license information is protected
under the Drivers Privacy Protection Act
Deidentification and Reidentification 57
3 May 2001
Title 42 Protections
Sec. 3789g, 3732, 10505
Maintenance of crime-related databases
and release of crime statistics must conform
to certain security and privacy restrictions,
in compliance with the Privacy Act
However, Bureau of Justice Statistics is
required to collect these statistics and make
them publicly available
Deidentification and Reidentification 58
3 May 2001
CFR Regulations
42 CFR part 2a
identity protection for research subjects
45 CFR parts 160 and 162 (1999)
guidelines for distributing statistical
medical information.
45 CFR parts 163 and 164 (not finalized)
“Implementation Specifications For DeIdentifying Information”
under debate as we speak
Deidentification and Reidentification 59
3 May 2001
Proposed Legislation
Medical Information Protection and Research
Enhancement Act of 2001
Financial Information Privacy Protection Act
of 2001
Deidentification and Reidentification 60
3 May 2001
What data is NOT protected?
Right now, the law concerns itself only with
private information that someone has entrusted
to you (or your company)
Information you discover for yourself (ie
through reidentification) is not well regulated.
Deidentification and Reidentification 61
3 May 2001
German Laws
An example of a real-live deidentification
law, attempting to address these issues
“It is prohibited to match individual data
from federal statistics … for establishing a
reference to persons, enterprises,
establishments or local units for other than
the statistical purposes …”
Deidentification and Reidentification 62
3 May 2001
Deidentification
and
Reidentification
Datasets and Code
What I'm going to tell you
Requirements
30 Second Intro to RDBMSs & SQL
Snarfing & Loading
Matching!
Doh! Matching, take 2
Curse you and your promises of spatial
invariance!
Doh! Matching, take 3
Sex is good
Verification - does it actually work?
Deidentification Techniques that don't Suck
Deidentification and Reidentification 64
3 May 2001
Requirements
Take two large tables of data (think rows
& columns)
Large means > 10,000 rows
Combine one row from one table to at most
one row in the control table
Combine based on several fields in
individual rows
Not all rows will match - deal with it
Deidentification and Reidentification 65
3 May 2001
Requirements cont.
This had better be fast
Don't you dare hardcode this cause those
table structures WILL change
The system must be cheap (free)
It had better run on cheap hardware
We shouldn't have to learn too much (we're
lazy)
Deidentification and Reidentification 66
3 May 2001
Title
a
a
b
b
a
Deidentification and Reidentification 67
3 May 2001
Title
a
a
b
b
a
Deidentification and Reidentification 68
3 May 2001
Title
a
a
b
b
a
Deidentification and Reidentification 69
3 May 2001
Title
a
a
b
b
a
Deidentification and Reidentification 70
3 May 2001
Title
a
a
b
b
a
Deidentification and Reidentification 71
3 May 2001
Title
a
a
b
b
a
Deidentification and Reidentification 72
3 May 2001
Title
a
a
b
b
a
Deidentification and Reidentification 73
3 May 2001
Title
a
a
b
b
a
Deidentification and Reidentification 74
3 May 2001
Title
a
a
b
b
a
Deidentification and Reidentification 75
3 May 2001
SSDI Spider
Implemented in Perl
RootsWeb only gives you 15 records per
page
We sleep for 15 seconds between page grabs
to avoid hammering their server
Deidentification and Reidentification 76
3 May 2001
Spider Flow Chart
At approx. 20,000
records/year for
Chicago, this takes
about 6 hours per year
For 1982-1995, or 14
years it took about 3.5
days of downloading
Totaled 18.3 MB of
data
Deidentification and Reidentification 77
3 May 2001
Database Tools
PostgreSQL
Microsoft SQL Server 7.0
Scripts to parse data in variable-length
format
Deidentification and Reidentification 78
3 May 2001
Uniqueness SQL
SELECT [ deathyr] AS Year, [ deathmon] AS
Month,
[ deathdte] AS Day, [ vicage] AS Age
FROM [dbo].[victim]
WHERE ([ deathyr] NOT LIKE '%98')
AND ([ deathmon] NOT LIKE '%98')
AND ([ deathdte] NOT LIKE '%98')
AND ([ vicage]
NOT LIKE '%999')
GROUP BY [ deathyr], [ deathmon],
[ deathdte], [ vicage]
HAVING COUNT(*) = 1
ORDER BY [ deathyr], [ deathmon],
[ deathdte], [ vicage]
Deidentification and Reidentification 79
3 May 2001
Deidentification
and
Reidentification
Other Deidentified
Datasets
AIDS Patients Data
Source: Centers for Disease Control &
Prevention, Division of HIV/AIDS Prevention
Size: 23.6 MB
Dates Covered: 1981 – 1998
Record Count: 688,200
Covers: Entire United States
Cost: $25
Deidentification and Reidentification 81
3 May 2001
AIDS Patients Data cont.
age - Age group at diagnosis
dxdate - Month of diagnosis
gender - Sexual classification of patient
race - Race of patient
death - Vital status of patient
msa - Region of residence at diagnosis of
AIDS
sexbi - Sex with a bisexual man (women
only)
sexiv - Sex with an injecting drug user
Deidentification and Reidentification 82
3 May 2001
AIDS Patients Data cont.
Deidentification and Reidentification 83
3 May 2001
Outpatient Data
Source: National Center for Health Statistics
(NCHS)
Size: Huge
Dates Covered: 1965 - present
Record Count: Huge
Covers: Entire United States
Cost: Varies, depending on provider and
coverage.
Deidentification and Reidentification 84
3 May 2001
Outpatient Data cont.
Age
Race
Gender
Marital Status
Geographic Region
Diagnosis – e.g. Abortion, AIDS
Deidentification and Reidentification 85
3 May 2001
Outpatient Data cont.
Deidentification and Reidentification 86
3 May 2001
Malpractice Data
Source: U.S. Department of Health and
Human Services
Size: 37 MB
Dates Covered: 1 Sep 1990 – 31 Dec 1999
Record Count: 227,541
Covers: Entire United States
Cost: state slice, $20; entire U.S., $55
Deidentification and Reidentification 87
3 May 2001
Malpractice Data cont.
Practitioner Number – Allows linking
within datasets
Work state
Home State
Field of License – e.g. Dentist or Nuclear
Pharmacist
Age Group
Deidentification and Reidentification 88
3 May 2001
Malpractice Data cont.
Malpractice Code – e.g. Diagnosis,
Unnecessary Tests or Surgery, Wrong
Body Part
Payment
Adverse Actions – e.g. Revocation of
License or Denial of Professional Society
Membership
Deidentification and Reidentification 89
3 May 2001
Malpractice Data cont.
Deidentification and Reidentification 90
3 May 2001
Robberies Data
Source: Inter-university Consortium for
Political and Social Research (ICPSR)
Size: 759 KB
Dates Covered: 1982-1983
Record Count: 7,216
Covers: Chicago
Cost: FREE!
Deidentification and Reidentification 91
3 May 2001
Robberies Data cont.
Victim Age Range – e.g. Baby, Young Adult,
Old Adult
Victim Race
Victim Gender
Victim Marital Status
Victim Employment – e.g. Unemployed,
Self-Employed, Full Time Student
Victim Area / District of Residence
Deidentification and Reidentification 92
3 May 2001
Robberies Data cont.
Gang Membership – Yes, No, Ex-member,
Probable member, etc.
Victim/Offender Relationship – e.g. Ex
boyfriend-girlfriends, strangers, Drug
dealer/buyer
Victim Dealing Drugs?
Deidentification and Reidentification 93
3 May 2001
Robberies Data cont.
Deidentification and Reidentification 94
3 May 2001
Juveniles Data
Source: Arkansas Administrative Office of
the Courts
Size: 7.1 MB
Dates Covered: 1991 – 1994
Record Count: 55,467
Covers: Arkansas (other states publish this
info as well)
Cost: FREE!
Deidentification and Reidentification 95
3 May 2001
Juveniles Data cont.
County Identification
Race
Sex
Date of Birth
Type of Charge – e.g. Felony,
Misdemeanor
Offense – e.g. Capital murder, Rape,
Sodomy, Gaming in house or steamboat,
Use of intoxicating, stupefying substance,
Unlawful packaging of strawberries, or Use of
x-ray shoe-fitting machines
Deidentification and Reidentification 96
3 May 2001
Juveniles Data cont.
Deidentification and Reidentification 97
3 May 2001
Deidentification
and
Reidentification
Conclusion
Further Work
Validation/verification of our results
Other datasets
Reidentify the prison record dataset
Analyze different deidentification techniques
(LaTanya’s work)
Determine how effective they are on our
Chicago Homicide dataset
Deidentification and Reidentification 99
3 May 2001
Summary
An anonymous database system makes
individual and entity-specific data available
such that individuals and other entities
contained in the released data cannot be
reliably identified
Most databases are not anonymous
Therefore, reidentification is possible
Motivations for reidentification exist,
whereas barriers do not
Deidentification and Reidentification 100
3 May 2001
Summary cont.
Better, more complete legal policy is needed
More importantly, data holders should employ
technical means to make their data anonymous
Deidentification and Reidentification 101
3 May 2001
Deidentification
and
Reidentification
Questions?