Transcript privacy

Preserving Privacy in
Published Data
Dimitris Sacharidis
outline
• introduction
– motivation
– definitions
• k-anonymity
– incognito
– mondrian
– topdown
• l-diversity
– anatomy
• other issues
motivation
• several agencies, institutions, bureaus, organizations make
(sensitive) data involving people publicly available
– termed microdata (vs. aggregated macrodata) used for analysis
– often required and imposed by law
• to protect privacy microdata are sanitized
– explicit identifiers (SSN, name, phone #) are removed
• is this sufficient for preserving privacy?
• no! susceptible to link attacks
– publicly available databases (voter lists, city directories) can reveal
the “hidden” identity
link attack example
• Sweeney [S01a] managed to re-identify the medical record of
the governor of Massachussetts
– MA collects and publishes sanitized medical data for state
employees (microdata) left circle
– voter registration list of MA (publicly available data) right circle
• looking for governor’s record
• join the tables:
– 6 people had his birth date
– 3 were men
– 1 in his zipcode
• regarding the US 1990 census data
– 87% of the population are unique based on (zipcode, gender, dob)
privacy in microdata
• the role of attributes in microdata
– explicit identifiers are removed
– quasi identifiers can be used to re-identify individuals
– sensitive attributes (may not exist!) carry sensitive information
identifier
quasi identifiers
sensitive
Name
Birthdate
Sex
Zipcode
Disease
Andre
21/1/79
male
53715
Flu
Beth
10/1/81
female
55410
Hepatitis
Carol
1/10/44
female
90210
Brochitis
Dan
21/2/84
male
02174
Ellen
19/4/72
female
02237
Sprained
Ankle
AIDS
goal of privacy preservation (rough definition)
de-associate individuals from sensitive information
outline
• introduction
– motivation
– definitions
• k-anonymity
– incognito
– mondrian
– topdown
• l-diversity
– anatomy
• other issues
k-anonymity
• preserve privacy via k-anonymity, proposed by Sweeney and
Samarati [S01, S02a, S02b]
• k-anonymity: intuitively, hide each individual among k-1 others
– each QI set of values should appear at least k times in the released
microdata
– linking cannot be performed with confidence > 1/k
– sensitive attributes are not considered (going to revisit this...)
• how to achieve this?
– generalization and suppression
– value perturbation is not considered (we should remain truthful to
original values )
• privacy vs utility tradeoff
– do not anonymize more than necessary
k-anonymity example
tools for anonymization
• generalization
– publish more general values, i.e., given a domain hierarchy, roll-up
• suppression
– remove tuples, i.e., do not publish outliers
– often the number of suppressed tuples is bounded
original microdata
Birthdate
Sex
Zipcode
21/1/79
male
53715
10/1/79
female
55410
1/10/44
female
90210
21/2/83
male
02274
19/4/82
male
02237
2-anonymous data
group 1
Birthdate Sex
Zipcode
*/1/79
person
5****
*/1/79
person
5****
female
90210
*/*/8*
male
022**
*/*/8*
male
022**
suppressed 1/10/44
group 2
generalization lattice
assume domain hierarchies exist for all QI attributes
Z2 ={537**}
Z1 ={5371*, 5370*}
B1 ={*}
S1 ={Person}
Z0 ={53715, 53710, 53706, 53703}
B0 ={26/3/1979, 11/3/1980, 16/5/1978}
S0 ={Male, Female}
construct the generalization
lattice for the entire QI set
<S1, Z1>
<S0, Z2>
<S1, Z0>
<S0, Z1>
generalization
more
<S1, Z2>
<S0, Z0>
sex
birthdate
zipcode
less
objective
find the minimum generalization
that satisfies k-anonymity
[1, 2]
i.e., maximize utility
by finding minimum
distance vector
with k-anonymity
[1, 1]
[0, 2]
[1, 0]
[0, 1]
[0, 0]
generalization lattice
incognito [LDR05]
exploit monotonicity properties regarding frequency of tuples in lattice
– reminiscent of OLAP hierarchies and frequent itemset mining
<S1, Z2>
(I) generalization property (~rollup)
<S1, Z1>
<S0, Z2>
<S1, Z0>
<S0, Z1>
<S0, Z0>
note: the entire lattice, which
includes three dimensions
<S,Z,B>, is too complex to show
if at some node k-anonymity holds, then it also holds
for any ancestor node
e.g., <S1, Z0> is k-anonymous and, thus, so is <S1, Z1> and <S1, Z2>
(II) subset property (~apriori)
if for a set of QI attributes k-anonymity doesn’t
hold then it doesn’t hold for any of its supersets
e.g., <S0, Z0> is not k-anonymous and, thus <S0, Z0, B0> and <S0, Z0, B1>
cannot be k-anonymous
incognito [LDR05] considers sets of QI attributes of increasing cardinality
(~apriori) and prunes nodes in the lattice using the two properties above
seen in the domain space
consider the multi-dimensional domain space
– QI attributes are the dimensions
– tuples are points in this space
– attribute hierarchies partition dimensions
Z2
537**
5370*
male
S0
Z1
5371*
53709 53711
53714
(53705, female)
person
sex
hierarchy
female
53703 53705
S1
•
(53711, male)
53718
Z0
zipcode
hierarchy
seen in the domain space
not 2-anonymous
incognito example
2 QI attributes, 7 tuples,
hierarchies shown with bold lines
sex
zipcode
2-anonymous
group 1
w. 2 tuples
group 2
w. 3 tuples
group 3
w. 2 tuples
seen in the domain space
taxonomy [LDR05, LDR06]
generalization taxonomy according to groupings allowed
single dimensional
global recoding
incognito [LDR05]
multi dimensional
global recoding
mondrian [LDR06]
generalization strength
multi dimensional
local recoding
topdown [XWP+06]
mondrian
[LDR06]
•
define utility measure: discernability metric (DM)
– penalizes each tuple with the size of the group it belongs
– intuitively, the ideal grouping is the one in which all groups have size k
•
•
•
mondrian tries to construct groups of roughly equal size k
what else (besides Mondrian)
does this painting remind you?
it’s reminiscent of the kd tree:
–
–
cycle among dimensions
median splits
2-anonymous
measuring group quality
•
DM depends only on the cardinality of the group
– no measure of how tight the group is
•
•
a good group is one that contains tuples with similar QI values
define a new metric [XWP+06]: normalized certainty penalty (NCP)
– measures the perimeter of the group
bad generalization
long boxes
good generalization
square-like boxes
topdown
[XWP+06]
• start with the entire data set
• iteratively split in two
– reminiscent of R-tree quadratic split
• continue until left with groups which contain <2k-1 tuples
split algorithm
find seeds, 2 points that are furthest away
•
•
heuristic, not complete quadratic search
the seeds will become the 2 split groups
examine points randomly (unlike quadratic split)
•
assign point to the group whose NCP will increase
the least
boosting privacy with external data
•
•
external databases (e.g., voter list) are used by attackers
can we use them to our benefit?
– try to improve the utility of anonymized data
•
join k-anonymity (JKA) [SMP]
3-anonymous
microdata
k-anonymity
join
public data
join 3-anonymous
joined microdata
JKA
x3
x3
x3
outline
• introduction
– motivation
– definitions
• k-anonymity
– incognito
– mondrian
– topdown
• l-diversity
– anatomy
• other issues
k-anonymity problems
•
•
•
k-anonymity example
homogeneity attack: in the last group everyone has cancer
background knowledge: in the first group, Japanese have low chance of heart disease
•
we need to consider the sensitive values
microdata
id
1
2
3
4
5
6
7
8
9
10
11
12
Zipcode Sex
13053
28
13068
29
13068
21
13053
23
14853
50
14853
55
14850
47
14850
49
13053
31
13053
37
13068
36
13068
35
National.
Russian
American
Japanese
American
Indian
Russian
American
American
American
Indian
Japanese
American
4-anonymous data
Disease
Heart Disease
Heart Disease
Viral Infection
Viral Infection
Cancer
Heart Disease
Viral Infection
Viral Infection
Cancer
Cancer
Cancer
Cancer
id
1
2
3
4
5
6
7
8
9
10
11
12
Zipcode Sex
130**
<30
130**
<30
130**
<30
130**
<30
1485* ≥40
1485* ≥40
1485* ≥40
1485* ≥40
130**
3∗
130**
3∗
130**
3∗
130**
3∗
National.
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
Disease
Heart Disease
Heart Disease
Viral Infection
Viral Infection
Cancer
Heart Disease
Viral Infection
Viral Infection
Cancer
Cancer
Cancer
Cancer
l-diversity
[MGK+06]
• make sure each group contains well represented sensitive values
– protect from homogeneity attacks
– protect from background knowledge
l-diversity (simplified definition)
a group is l-diverse if the most frequent sensitive
value appears at most 1/l times in group
• l-diversity definition has monotonicity properties similar to kanonymity
– simply adapt existing algorithms to count frequencies of sensitive
values
anatomy
[XT06]
•
•
fast l-diversity algorithm
anatomy is not generalization
–
–
id
1
2
3
4
5
6
7
8
seperates sensitive values from tuples
shuffles sensitive values among groups
Age
23
27
35
59
61
65
65
70
Group-ID
1
1
2
2
2
Sex
M
M
M
M
F
F
F
F
Zipcode
11000
13000
59000
12000
54000
25000
25000
30000
Disease
dyspepsia
pneumonia
bronchitis
flu
gastritis
Group ID
1
1
1
1
2
2
2
2
Count
2
2
1
2
1
algorithm
•
•
assign sensitive values to buckets
create groups by drawing from l largest
buckets
5
8
3
9
7
1
2
6
group 1
5
8
7
group 2
3
9
6
group 3
1
2
4
4
outline
• introduction
– motivation
– definitions
• k-anonymity
– incognito
– mondrian
– topdown
• l-diversity
– anatomy
• other issues
other issues
• how to preserve privacy in dynamic data
– [XT07]
• protection against minimality attack
– [WFW+07]
• anonymity in location-based services
– spatial cloaking
references
[LDR05] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Incognito: Efficient Full-domain
k-Anonymity. SIGMOD, 2005.
[LDR06] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Mondrian Multidimensional kAnonymity. ICDE, 2006.
[S01] Samarati, P. Protecting Respondents' Identities in Microdata Release. IEEE
TKDE, 13(6):1010-1027, 2001.
[S02a] Sweeney, L. k-Anonymity: A Model for Protecting Privacy. International
Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002.
[S02b] Sweeney, L. k-Anonymity: Achieving k-Anonymity Privacy Protection using
Generalization and Suppresion. International Journal on Uncertainty, Fuzziness
and Knowledge-based Systems, 2002.
[SMP] Sacharidis, D., Mouratidis, K., Papadias, D. k-Anonymity in the Presence of
External Databases (submitted)
[WFWP07] Wong, R. C., Fu, A. W., Wang, K., Pei, J. Minimality Attack in Privacy
Preserving Data Publishing. VLDB, 2007.
[XT06] Xiao, X, Tao, Y. Anatomy: Simple and Effective Privacy Preservation. VLDB,
2006.
[XT07] Xiao, X, Tao, Y. m-Invariance: Towards Privacy Preserving Re-publication of
Dynamic Datasets. SIGMOD, 2007.
[XWP+06] Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A., Utility-Based
Anonymization Using Local Recoding. SIGKDD, 2006.