Reidentification

Download Report

Transcript Reidentification

Deidentification
and
Reidentification
Salvador Ochoa
Jamie Rasmussen
Christine Robson
Michael Salib
Overview
 What is reidentification?
 basic mechanisms
 examples
 Taking a look at the Chicago crime database
 some interesting privacy invasions
 Legal protections (or lack thereof)
 Taking a look at the code
 what have we done and how did we do it?
 Where do we go from here?
Deidentification and Reidentification 2
3 May 2001
Status Report: Goals Review
 Identifying useful databases
 done
 Finding some surprising results
 we’ve got some good stuff
 we’re continuing to look into it
 Analyzing current and proposed legislation
 done
 Recommendations for databases and laws
 in progress
Deidentification and Reidentification 3
3 May 2001
Paper Progress
 Jamie has accepted nomination as editor
 Breakdown of paper will be the same as for
this presentation, with the same authorship
divisions
 Research and write up:
 All datasets identified, description and
critique outlined
 Code mostly complete but not yet
translated into English descriptions
 Deidentification theory outlined
Deidentification and Reidentification 4
3 May 2001
Paper Progress cont.
 Research and write up cont.:
 Legal overview researched and outlined
 Recommendations for laws in progress
 Analysis of deidentification tools in
progress
 Recommendations for database
deidentification remains to be done
Deidentification and Reidentification 5
3 May 2001
Overall Status
 We have good material for the paper, mostly
complete
 We have a coherent structure and work
breakdown
 After compiling this presentation, we have a
pretty good idea of what we’re doing
Deidentification and Reidentification 6
3 May 2001
Deidentification
and
Reidentification
Reidentification
Theory
Reidentification is Scary Stuff
 In the (free, publicly available) Chicago
Homicide dataset
 4 records are uniquely identifiable by
victim age at death
 10,251 records are uniquely identified by
victim age and death date
 That's 93.5% of all records that list the
victim's age and death date
 Mike was able to reidentify his little brother’s
birth in our hospital outpatient records
Deidentification and Reidentification 8
3 May 2001
Database Basics
 What is a database?
Subject No.
Age
Sex
ZIP
Race
1
21
1
02139
1
2
26
0
02138
2
3
19
1
02138
5
4
20
1
02139
3
Entity-specific Data
 Person-specific Data

Deidentification and Reidentification 9
3 May 2001
Data Linkage
Name
Age
Sex
Name
Age
Major
Salvador Ochoa
21
M
Salvador Ochoa
21
6
Mike Salib
21
M
Mike Salib
21
6
Christine Robson
19
F
Chrisine Robson
19
6, 18
Jamie Rasmussen
21
M
Jamie Rasmussen
21
18
Name
Age
Sex
Major
Salvador Ochoa
21
M
6
Mike Salib
21
M
6
Christine Robson
19
F
6, 18
Jamie Rasmussen
21
M
18
Deidentification and Reidentification 10
3 May 2001
Privacy Concern
 Data Explosion
 Privacy Protection
 Data holders (“data protectors”) need to
ensure the greatest amount of privacy
protection for subjects.
Deidentification and Reidentification 11
3 May 2001
Optimal Release of Data
 Balance usefulness with privacy
Identifiable
Anonymous
more privacy
more useful
Deidentification and Reidentification 12
3 May 2001
Classes of Access Policies
 Private
 Insiders only
 Semi-private
 Limited access
 Semi-public
 Deniable access
 Public
 No restrictions
Deidentification and Reidentification 13
3 May 2001
Deidentification
 Explicit Identifiers
 Allow for direct communication with
subjects
 Deidentification
 Removal of all explicit identifiers
 Is de-identification enough to ensure
anonymity?
Deidentification and Reidentification 14
3 May 2001
De-identified Data
 Definition
 Data that results when all explicit
identifies are removed, generalized, or
replaced with made-up alternatives
 Looks anonymous
Deidentification and Reidentification 15
3 May 2001
Anonymous Data
 Definition
 Data that cannot be manipulated or linked
to identify the entity that is the subject of
the data
 De-identified data is NOT anonymous
Deidentification and Reidentification 16
3 May 2001
Reidentification
 Ascertaining the identity of individuals who
are the subjects of a study through data linkage
techniques
 Possible using Quasi-identifiers
 Uniquely (or almost uniquely) map to an
entity
Deidentification and Reidentification 17
3 May 2001
Quasi-Identifier Example
 Uniqueness of Cambridge Voters
Birth date alone
12%
Birth date and gender
29%
Birth date and 5-digit ZIP
69%
Birth date and full postal code
97%
 Basically, a few characteristics make a
person unique
Deidentification and Reidentification 18
3 May 2001
Arrest Record Database
Ethnicity
ZIP
Arrest Date Birth date
Violation
Sex
Sentence
Birth
Date
Sex
Ethnicity
ZIP
Arrest Date
Deidentification and Reidentification 19
Violation
Sentence
3 May 2001
Voter Registration List
ZIP
Name
Birth Date Address
Sex
Date Registered
Party Affiliation
Name
Address
ZIP
Birth
Date
Sex
Date Registered
Deidentification and Reidentification 20
Party
Affiliation
3 May 2001
Reidentification
Ethnicity
Arrest Date
Violation
Sentence
ZIP
Birth date
Sex
Deidentification and Reidentification 21
Name
Address
Date Registered
Party Affiliation
3 May 2001
Why Reidentify?
 Scientific research
 Investigative reporting
 Marketing
 Blackmail
 Stalking
 Insurance
 Political action
Deidentification and Reidentification 22
3 May 2001
Deidentification
and
Reidentification
Chicago Homicide Data
Chicago Data Overview
 Source: Illinois Criminal Justice Information
Authority
 Size: 4.8 MB
 Dates Covered: 1965-1995 (only 1982-1995
have death date, location code)
 Record Count: 23,817 victims (data on
offenders available in separate database)
 Covers: Chicago
 Cost: FREE!
Deidentification and Reidentification 24
3 May 2001
Nationwide Homicide
1995 U.S. Deaths: Homicide
7000
6000
5000
# of Deaths
4000
3000
2000
1000
0
[0,10) [10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80110)
Age Range (years)
Deidentification and Reidentification 25
3 May 2001
Chicago Homicide
Chicago Homicides, 1982-1995
9000
8000
7000
6000
# of Deaths
5000
4000
3000
2000
1000
0
[0,10)
[10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-110)
Age Range (years)
Deidentification and Reidentification 26
3 May 2001
What are we doing?
LaTanya has done lots of theory and some
experiments
Most of that involves medical data
We wanted to try for ourselves to see just how
easy it is and how much we could do
We wanted to look at non-healthcare data - no
one else has
Deidentification and Reidentification 27
3 May 2001
So how did we do it?
 We needed two data sets - deidentified data
and a control data set
 Jamie will talk about finding a control data
set later
Deidentification and Reidentification 28
3 May 2001
So how did we do it? Cont.
 Deidentified data (outside of healthcare) is
hard to find because:
 There's tons of identifying data available
(ie. for marketing purposes)
 There's tons of aggregate statistical data
available
 Everyone seems to assume that people
who need data are too stupid to use
excel
 There's little need to release individual
level data without names
Deidentification and Reidentification 29
3 May 2001
Database selection criteria
 Our first DB to reidentify had to:
 Be small, since we don't know what we're
doing
 Contain incriminating or useful info to
amuse us
 Be easy to verify - so we can tell how
good we are
 Be publicly accessible and cheap (or free)
Deidentification and Reidentification 30
3 May 2001
Our Database of Choice
 The only non medical data we could find was
crime data
 The Bureau of Justice Statistics is your
friend
 We love you Louis Freeh!
 We selected Homicides in Chicago (19651995)
Deidentification and Reidentification 31
3 May 2001
Murder in Chicago
 Dataset covered every homicide in Chicago
from 1965 to 1995
 Included juicy info on both offender and
victims such as:
 Victim-offender relationship and past
criminal history of both
 Weapon type, drugs, alcohol, gangs, child
abuse fields
Deidentification and Reidentification 32
3 May 2001
Murder in Chicago cont.
 Looked like a candidate for reidentification
because:
 It had death dates and ages for victims
 It had fine grained geographical info on
where the homicides took place
 It contained gender and race info for
victims and offenders
 Everyone listed had to be from Chicago
(almost true)
Deidentification and Reidentification 33
3 May 2001
Uniqueness of Chicago Data
 The Chicago Homicide Dataset contains
10,963 records with valid death dates and ages.
 10,251, or 93.5% of {death date, age}
events are unique
 680, or 6.2% match one other event
 8, or .073% match two other events
 2, or .018% match three other events
Deidentification and Reidentification 34
3 May 2001
Dead People, Who Cares?
 No one, but . . .
 Info about them says a lot about live people
 Example: a woman is murdered. The DB
tells us that the relationship between victim
& offender is "spousal"
 We could easily do this on live people if you
give us $5K
 Actually, the dead make up a sizeable voting
block in Chicago.
Deidentification and Reidentification 35
3 May 2001
Potential Embarrassment
 Interesting relationships between offenders
and victims:
 hired killer, target for contract, cell mates
 pimp, prostitute, prostitute client
 sexual rivals, homosexual acquaintance,
homosexual couple
 gambler, drug pusher, drug buyer
 (rival) gang member
Deidentification and Reidentification 36
3 May 2001
Potential Embarrassment cont.
Was the child being abused? Was the killing
domestic violence related?
 Were drugs, alcohol or gang violence
involved?
 Was the victim killed while committing a
crime?
Deidentification and Reidentification 37
3 May 2001
Dead People Statistics
 23,000 victims
 Only 10,000 have more than useful data
 If you want to do geographical mapping, you
need to restrict your data set to people who
died at home
 so you can gaunter that location of death
equals residence
 This cuts you down to about 3,000
 Only 2,000 have geographical info that
maps uniquely to zip codes
 Look! Its the incredible shrinking database!
Deidentification and Reidentification 38
3 May 2001
Deidentification
and
Reidentification
Social Security
Death Index
What is the SSDI?
 Social Security Death Master File
 About 65 million entries
 Contains: last name, first name, date of
birth, date of death, zip code of last
residence, zip code of last payment, SSN,
and issuing state
 98% is individuals who died after 1962
 The SSA began keeping the database on
computers in 1962
Deidentification and Reidentification 40
3 May 2001
What is the SSDI? cont.
 Only contains reported deaths the SSA
 Reported if person was getting benefits
 Often reported by funeral directors as
part of their services
 Available free on the web
 http://ssdi.genealogy.rootsweb.com/cgi-bin/ssdi.cgi/
 http://www.ancestry.com/search/rectype/vital/ssdi/
 But is it useful to join with Chicago data?
Deidentification and Reidentification 41
3 May 2001
SSDI Sample
Deidentification and Reidentification 42
3 May 2001
SSDI Completeness
SSDI Completeness
2500000
2000000
Recorded 1500000
Deaths
1000000
Not in SSDI
In SSDI
500000
0
1994
1995
1996
Year
Deidentification and Reidentification 43
3 May 2001
SSDI Records by Year
SSDI Record Count by Year (for RootsWeb)
25000
20000
15000
# of Records
10000
5000
0
82
83
84
85
86
87
88
89
90
91
92
93
94
95
Year
Deidentification and Reidentification 44
3 May 2001
SSDI Age Breakdown
SSDI: Chicago 1995 Age Breakdown
9000
8000
7000
6000
# of Records
5000
4000
3000
2000
1000
0
[0,10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80) [80-110)
Age Range (years)
Deidentification and Reidentification 45
3 May 2001
Census Bureau Deaths
1995 U.S. Deaths
1000000
900000
800000
700000
600000
# of Deaths 500000
400000
300000
200000
100000
0
[0,10)
[10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-110)
Age Range (years)
Deidentification and Reidentification 46
3 May 2001
Our Original Approach
 Only examine people who died at home
 We can infer that they are residents, and
not random tourists
 We can then get their home zip code by
manipulating murder location
 Victims that have a valid death year: 10,981
 Victims that died at home: 7,636
 Victims that died at home and have a valid
death year: 3,143
Deidentification and Reidentification 47
3 May 2001
Our Original Approach Cont.
 Victims that died at home, have a valid death
year, and died in a census tract which maps to
only one zip code: 2,862
 Of those, the number that match a single
SSDI record based on (death date, birth year,
county of residence): 1165
 Victims that match a single person in the
SSDI based on (death date, birth year, zip
code): about 30
Deidentification and Reidentification 48
3 May 2001
Our New Approach
 We didn’t take birth year ambiguity effect
into account
 Ignore all geographic data
 Look at death date and age then gender and
race
 Analyze uniqueness of SSDI records
 May give us more false positives
Deidentification and Reidentification 49
3 May 2001
Deidentification
and
Reidentification
15 Minute Break
Deidentification
and
Reidentification
Legal Protections
(or lack thereof)
Are we breaking the law?
 As independent private citizen researchers
 No, in fact what we are doing is probably
protected by the First Amendment
 We could probably even sell the results
 What if we were a corporate entity?
 No, although if we were a “credit bureau,”
we couldn’t distribute our findings
 What if we were the government?
 Yes, because government agencies are not
allowed to combine databases with each
other
Deidentification and Reidentification 52
3 May 2001
Why isn’t this illegal?
 The easy answer is “because there isn’t a law
against it.”
 Publicly available statistics are there to be
used. Combining them is just an extension of
that use.
 It is difficult to write laws for this situation
without restricting valid statistical uses of the
data.
Deidentification and Reidentification 53
3 May 2001
Legal Problems
 Privacy Act vs. Freedom of Information Act
 two important pieces of legislation,
fundamentally at odds with each other
 FOIA includes some privacy clauses
 Many examples of FOIA and the Privacy Act
clashing in the courtroom
 few deal with databases
 fewer still deal with reidentification
Deidentification and Reidentification 54
3 May 2001
Southern Illinoisan vs.DPH
 The case
 The Department of public health denied a
newspaper’s request for release of
neuroblastoma statistics
 The holding
 release of the data does not constitute an
“unwarranted invasion of privacy,” despite
LaTanya’s research
Deidentification and Reidentification 55
3 May 2001
What data is protected?
 Medical
 distribution protected by HIPAA
 LaTanya’s work is stirring up reidentification fears
 Criminal
 distribution of statistics regulated
 recombining databases is not addressed
Deidentification and Reidentification 56
3 May 2001
What data is protected? cont.
 All personal data given to a company or
government agency is protected by the Privacy
Act
 Some special, additional protections exist for
other data
 financial information is protected under
the Fair Credit Reporting Act
 drivers license information is protected
under the Drivers Privacy Protection Act
Deidentification and Reidentification 57
3 May 2001
Title 42 Protections
 Sec. 3789g, 3732, 10505
 Maintenance of crime-related databases
and release of crime statistics must conform
to certain security and privacy restrictions,
in compliance with the Privacy Act
 However, Bureau of Justice Statistics is
required to collect these statistics and make
them publicly available
Deidentification and Reidentification 58
3 May 2001
CFR Regulations
 42 CFR part 2a
 identity protection for research subjects
 45 CFR parts 160 and 162 (1999)
 guidelines for distributing statistical
medical information.
 45 CFR parts 163 and 164 (not finalized)
 “Implementation Specifications For DeIdentifying Information”
 under debate as we speak
Deidentification and Reidentification 59
3 May 2001
Proposed Legislation
 Medical Information Protection and Research
Enhancement Act of 2001
 Financial Information Privacy Protection Act
of 2001
Deidentification and Reidentification 60
3 May 2001
What data is NOT protected?
 Right now, the law concerns itself only with
private information that someone has entrusted
to you (or your company)
 Information you discover for yourself (ie
through reidentification) is not well regulated.
Deidentification and Reidentification 61
3 May 2001
German Laws
 An example of a real-live deidentification
law, attempting to address these issues
 “It is prohibited to match individual data
from federal statistics … for establishing a
reference to persons, enterprises,
establishments or local units for other than
the statistical purposes …”
Deidentification and Reidentification 62
3 May 2001
Deidentification
and
Reidentification
Datasets and Code
What I'm going to tell you
 Requirements
 30 Second Intro to RDBMSs & SQL
 Snarfing & Loading
 Matching!
 Doh! Matching, take 2
 Curse you and your promises of spatial
invariance!
 Doh! Matching, take 3
 Sex is good
 Verification - does it actually work?
 Deidentification Techniques that don't Suck
Deidentification and Reidentification 64
3 May 2001
Requirements
 Take two large tables of data (think rows
& columns)
 Large means > 10,000 rows
 Combine one row from one table to at most
one row in the control table
 Combine based on several fields in
individual rows
 Not all rows will match - deal with it
Deidentification and Reidentification 65
3 May 2001
Requirements cont.
 This had better be fast
 Don't you dare hardcode this cause those
table structures WILL change
 The system must be cheap (free)
 It had better run on cheap hardware
 We shouldn't have to learn too much (we're
lazy)
Deidentification and Reidentification 66
3 May 2001
Title
a
a

b

b
a
Deidentification and Reidentification 67
3 May 2001
Title
a
a

b

b
a
Deidentification and Reidentification 68
3 May 2001
Title
a
a

b

b
a
Deidentification and Reidentification 69
3 May 2001
Title
a
a

b

b
a
Deidentification and Reidentification 70
3 May 2001
Title
a
a

b

b
a
Deidentification and Reidentification 71
3 May 2001
Title
a
a

b

b
a
Deidentification and Reidentification 72
3 May 2001
Title
a
a

b

b
a
Deidentification and Reidentification 73
3 May 2001
Title
a
a

b

b
a
Deidentification and Reidentification 74
3 May 2001
Title
a
a

b

b
a
Deidentification and Reidentification 75
3 May 2001
SSDI Spider
 Implemented in Perl
 RootsWeb only gives you 15 records per
page
 We sleep for 15 seconds between page grabs
to avoid hammering their server
Deidentification and Reidentification 76
3 May 2001
Spider Flow Chart
 At approx. 20,000
records/year for
Chicago, this takes
about 6 hours per year
 For 1982-1995, or 14
years it took about 3.5
days of downloading
 Totaled 18.3 MB of
data
Deidentification and Reidentification 77
3 May 2001
Database Tools
 PostgreSQL
 Microsoft SQL Server 7.0
 Scripts to parse data in variable-length
format
Deidentification and Reidentification 78
3 May 2001
Uniqueness SQL
SELECT [ deathyr] AS Year, [ deathmon] AS
Month,
[ deathdte] AS Day, [ vicage] AS Age
FROM [dbo].[victim]
WHERE ([ deathyr] NOT LIKE '%98')
AND ([ deathmon] NOT LIKE '%98')
AND ([ deathdte] NOT LIKE '%98')
AND ([ vicage]
NOT LIKE '%999')
GROUP BY [ deathyr], [ deathmon],
[ deathdte], [ vicage]
HAVING COUNT(*) = 1
ORDER BY [ deathyr], [ deathmon],
[ deathdte], [ vicage]
Deidentification and Reidentification 79
3 May 2001
Deidentification
and
Reidentification
Other Deidentified
Datasets
AIDS Patients Data
 Source: Centers for Disease Control &
Prevention, Division of HIV/AIDS Prevention
 Size: 23.6 MB
 Dates Covered: 1981 – 1998
 Record Count: 688,200
 Covers: Entire United States
 Cost: $25
Deidentification and Reidentification 81
3 May 2001
AIDS Patients Data cont.
 age - Age group at diagnosis
 dxdate - Month of diagnosis
 gender - Sexual classification of patient
 race - Race of patient
 death - Vital status of patient
 msa - Region of residence at diagnosis of
AIDS
 sexbi - Sex with a bisexual man (women
only)
 sexiv - Sex with an injecting drug user
Deidentification and Reidentification 82
3 May 2001
AIDS Patients Data cont.
Deidentification and Reidentification 83
3 May 2001
Outpatient Data
 Source: National Center for Health Statistics
(NCHS)
 Size: Huge
 Dates Covered: 1965 - present
 Record Count: Huge
 Covers: Entire United States
 Cost: Varies, depending on provider and
coverage.
Deidentification and Reidentification 84
3 May 2001
Outpatient Data cont.
 Age
 Race
 Gender
 Marital Status
 Geographic Region
 Diagnosis – e.g. Abortion, AIDS
Deidentification and Reidentification 85
3 May 2001
Outpatient Data cont.
Deidentification and Reidentification 86
3 May 2001
Malpractice Data
 Source: U.S. Department of Health and
Human Services
 Size: 37 MB
 Dates Covered: 1 Sep 1990 – 31 Dec 1999
 Record Count: 227,541
 Covers: Entire United States
 Cost: state slice, $20; entire U.S., $55
Deidentification and Reidentification 87
3 May 2001
Malpractice Data cont.
 Practitioner Number – Allows linking
within datasets
 Work state
 Home State
 Field of License – e.g. Dentist or Nuclear
Pharmacist
 Age Group
Deidentification and Reidentification 88
3 May 2001
Malpractice Data cont.
 Malpractice Code – e.g. Diagnosis,
Unnecessary Tests or Surgery, Wrong
Body Part
 Payment
 Adverse Actions – e.g. Revocation of
License or Denial of Professional Society
Membership
Deidentification and Reidentification 89
3 May 2001
Malpractice Data cont.
Deidentification and Reidentification 90
3 May 2001
Robberies Data
 Source: Inter-university Consortium for
Political and Social Research (ICPSR)
 Size: 759 KB
 Dates Covered: 1982-1983
 Record Count: 7,216
 Covers: Chicago
 Cost: FREE!
Deidentification and Reidentification 91
3 May 2001
Robberies Data cont.
 Victim Age Range – e.g. Baby, Young Adult,
Old Adult
 Victim Race
 Victim Gender
 Victim Marital Status
 Victim Employment – e.g. Unemployed,
Self-Employed, Full Time Student
 Victim Area / District of Residence
Deidentification and Reidentification 92
3 May 2001
Robberies Data cont.
 Gang Membership – Yes, No, Ex-member,
Probable member, etc.
 Victim/Offender Relationship – e.g. Ex
boyfriend-girlfriends, strangers, Drug
dealer/buyer
 Victim Dealing Drugs?
Deidentification and Reidentification 93
3 May 2001
Robberies Data cont.
Deidentification and Reidentification 94
3 May 2001
Juveniles Data
 Source: Arkansas Administrative Office of
the Courts
 Size: 7.1 MB
 Dates Covered: 1991 – 1994
 Record Count: 55,467
 Covers: Arkansas (other states publish this
info as well)
 Cost: FREE!
Deidentification and Reidentification 95
3 May 2001
Juveniles Data cont.
County Identification
Race
Sex
Date of Birth
Type of Charge – e.g. Felony,
Misdemeanor
Offense – e.g. Capital murder, Rape,
Sodomy, Gaming in house or steamboat,
Use of intoxicating, stupefying substance,
Unlawful packaging of strawberries, or Use of
x-ray shoe-fitting machines
Deidentification and Reidentification 96
3 May 2001
Juveniles Data cont.
Deidentification and Reidentification 97
3 May 2001
Deidentification
and
Reidentification
Conclusion
Further Work
 Validation/verification of our results
 Other datasets
 Reidentify the prison record dataset
 Analyze different deidentification techniques
(LaTanya’s work)
 Determine how effective they are on our
Chicago Homicide dataset
Deidentification and Reidentification 99
3 May 2001
Summary
 An anonymous database system makes
individual and entity-specific data available
such that individuals and other entities
contained in the released data cannot be
reliably identified
 Most databases are not anonymous
 Therefore, reidentification is possible
 Motivations for reidentification exist,
whereas barriers do not
Deidentification and Reidentification 100
3 May 2001
Summary cont.
 Better, more complete legal policy is needed
More importantly, data holders should employ
technical means to make their data anonymous
Deidentification and Reidentification 101
3 May 2001
Deidentification
and
Reidentification
Questions?