Privacy-Preserving Data Publishing

Download Report

Transcript Privacy-Preserving Data Publishing

Privacy-Preserving Data Publishing
Donghui Zhang
Northeastern University
Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.
motivation
• several agencies, institutions, bureaus,
organizations make (sensitive) data
involving people publicly available
– termed microdata (vs. aggregated macrodata) used for analysis
– often required and imposed by law
• to protect privacy microdata are sanitized
– explicit identifiers (SSN, name, phone #) are removed
• is this sufficient for preserving privacy?
• no! susceptible to link attacks
– publicly available databases (voter lists, city directories) can reveal the
“hidden” identity
link attack example
• [Sweeney01] managed to re-identify the medical
record of the governor of Massachussetts
– MA collects and publishes sanitized medical data for state employees
(microdata) left circle
– voter registration list of MA (publicly available data) right circle
• looking for governor’s record
• join the tables:
– 6 people had his birth date
– 3 were men
– 1 in his zipcode
• regarding the US 1990 census data
– 87% of the population are unique based on (zipcode, gender, dob)
Microdata
Name
Bob
Alice
Andy
David
Gary
Helen
Jane
Ken
Linda
Paul
Steve
Age Zipcode
21
12000
22
14000
24
18000
23
25000
41
20000
36
27000
37
33000
40
35000
43
26000
52
33000
56
34000
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
Inference Attack
Published table
An adversary
Name Age Zipcode
Bob 21 12000
Age Zipcode
21
12000
22
14000
24
18000
23
25000
41
20000
36
27000
37
33000
40
35000
43
26000
52
33000
56
34000
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
Quasi-identifier (QI) attributes
k-anonymity
[Samarati and Sweeney02]
• Transform the QI values into less specific forms
Age Zipcode
21
12000
22
14000
24
18000
23
25000
41
20000
36
27000
37
33000
40
35000
43
26000
52
33000
56
34000
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
Age
[21, 22]
[21, 22]
[23, 24]
[23, 24]
[36, 41]
[36, 41]
[37, 43]
[37, 43]
[37, 43]
[52, 56]
[52, 56]
generalize
Zipcode
[12k, 14k]
[12k, 14k]
[18k, 25k]
[18k, 25k]
[20k, 27k]
[20k, 27k]
[26k, 35k]
[26k, 35k]
[26k, 35k]
[33k, 34k]
[33k, 34k]
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
Generalization
• Transform each QI value into a less specific form
A generalized table
An adversary
Name Age Zipcode
Bob 21 12000
Age
[21, 22]
[21, 22]
[23, 24]
[23, 24]
[36, 41]
[36, 41]
[37, 43]
[37, 43]
[37, 43]
[52, 56]
[52, 56]
Zipcode
[12k, 14k]
[12k, 14k]
[18k, 25k]
[18k, 25k]
[20k, 27k]
[20k, 27k]
[26k, 35k]
[26k, 35k]
[26k, 35k]
[33k, 34k]
[33k, 34k]
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
Graphically…
35000
34000
33000
27000
26000
25000
20000
Alice
18000
14000
Bob
12000
2122 23 24
36 37
40 41
43
52
56
Why not…
How many people with age in [30, 50] contracted flu?
35000
34000
33000
27000
26000
25000
20000
18000
14000
12000
2122 23 24
36 37
40 41
43
52
56
k-anonymity
How many people with age in [30, 50] contracted flu?
Age
[21, 56]
[21, 56]
[21, 56]
[21, 56]
[21, 56]
[21, 56]
[21, 56]
[21, 56]
[21, 56]
[21, 56]
[21, 56]
Zipcode
[12k, 35k]
[12k, 35k]
[12k, 35k]
[12k, 35k]
[12k, 35k]
[12k, 35k]
[12k, 35k]
[12k, 35k]
[12k, 35k]
[12k, 35k]
[12k, 35k]
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
generalization with low utility:
answer less accurately: [0..3]
Age
[21, 22]
[21, 22]
[23, 24]
[23, 24]
[36, 41]
[36, 41]
[37, 43]
[37, 43]
[37, 43]
[52, 56]
[52, 56]
Zipcode
[12k, 14k]
[12k, 14k]
[18k, 25k]
[18k, 25k]
[20k, 27k]
[20k, 27k]
[26k, 35k]
[26k, 35k]
[26k, 35k]
[33k, 34k]
[33k, 34k]
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
generalization with high utility:
answer queries more accurately: 2.
k-anonymity with utility
• Among all generalizations that enforce kanonymity, we should maximize utility by
minimizing the “rectangle” sizes!
• Several measures. E.g. to minimize the
maximal perimeter size of the rectangles.
Mondrian [LDR06]
Recursive half-plane partitioning, alternating dimensions.
let k=2
Mondrian [LDR06]
Unbounded approximation ratio!
let k=4
Our contributions [DXT+07]
• Proved that to find the optimal partitioning
is NP-hard.
• Proved that to find a partitioning with
approximation ratio less than 1.25 is also
NP-hard.
• Provided three algorithms with tradeoffs in
complexity and approximation ratio.
Divide-And-Group (DAG)
• Divide the space into square cells with
proper size
• Find a set of non-overlapping tiles of 2 x 2
cells to cover the points, such that each
tile covers at least k points
• Assign the rest of (uncovered) points to
the nearest tile
Min-MBR-Group (MMG)
• For each point p, find the smallest MBR
which covers at least k points including p
• Find a set of non-overlapping MBRs from
the result of previous step
• Assign the points to the nearest MBR
Nearest-Neighbor-Group (NNG)
• For each point p, find the MBR which
covers p and its k-1 nearest neighbors
• Find a set of non-overlapping MBRs from
the result of previous step
• Assign the points to the nearest MBR
Analysis
Algorithm
Complexity
Approximation
Ratio
DAG
O(3d d n log2n)
8d
MMG
O(d n2d+1)
2d+1
NNG
O(d n2)
6d
Drawback of k-anonymity
• In a QI group, if many records have the same
sensitive attribute value...
Quasi-identifier (QI) attributes
If Bob is in this
group, he
must have
pneumonia.
Age
[21, 40]
[30, 60]
[30, 60]
[21, 40]
[61, 65]
[63, 70]
[61, 65]
[63, 70]
Sex
M
M
M
M
F
F
F
F
Zipcode
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
Sensitive attribute
Disease
pneumonia
dyspepsia
dyspepsia
pneumonia
flu
gastritis
flu
bronchitis
l-diversity [ICDE06]
• A QI-group with m tuples is l-diverse, iff each sensitive
value appears no more than m / l times in the QI-group.
• A table is l-diverse, iff all of its QI-groups are l-diverse.
Quasi-identifier (QI) attributes
2 QI-groups
Age
[21, 60]
[21, 60]
[21, 60]
[21, 60]
[61, 70]
[61, 70]
[61, 70]
[61, 70]
Sex
M
M
M
M
F
F
F
F
Zipcode
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
[10001, 60000]
• The above table is 2-diverse.
Sensitive attribute
Disease
pneumonia
dyspepsia
dyspepsia
pneumonia
flu
gastritis
flu
bronchitis
What l-diversity guarantees
• From an l-diverse generalized table, an adversary
(without any prior knowledge) can infer the sensitive value
of each individual with confidence at most 1/l
Name Age Sex Zipcode
Bob 23 M 11000
A 2-diverse generalized table
Age Sex
Zipcode
Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000]
flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000]
flu
[61, 70] F [10001, 60000] bronchitis
A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity.
ICDE 2006
Problem with multi-publishing
• A hospital keeps track of the medical records collected in the last
three months.
• The microdata table T(1), and its generalization T*(1), published in
Apr. 2007.
Name Age Zipcode Disease
Bob
21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000
flu
David 23 25000 gastritis
Gary 41 20000
flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken
40 35000
flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
Microdata T(1)
G. ID
1
1
2
2
3
3
4
4
4
5
5
Age
[21, 22]
[21, 22]
[23, 24]
[23, 24]
[36, 41]
[36, 41]
[37, 43]
[37, 43]
[37, 43]
[52, 56]
[52, 56]
Zipcode
[12k, 14k]
[12k, 14k]
[18k, 25k]
[18k, 25k]
[20k, 27k]
[20k, 27k]
[26k, 35k]
[26k, 35k]
[26k, 35k]
[33k, 34k]
[33k, 34k]
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
2-diverse Generalization T*(1)
Problem with multi-publishing
• Bob was hospitalized in Mar. 2007
Name Age Zipcode
Bob 21 12000
G. ID
1
1
2
2
3
3
4
4
4
5
5
Age
[21, 22]
[21, 22]
[23, 24]
[23, 24]
[36, 41]
[36, 41]
[37, 43]
[37, 43]
[37, 43]
[52, 56]
[52, 56]
Zipcode
[12k, 14k]
[12k, 14k]
[18k, 25k]
[18k, 25k]
[20k, 27k]
[20k, 27k]
[26k, 35k]
[26k, 35k]
[26k, 35k]
[33k, 34k]
[33k, 34k]
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
2-diverse Generalization T*(1)
Problem with multi-publishing
• One month later, in May 2007
Name Age Zipcode Disease
Bob
21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000
flu
David 23 25000 gastritis
Gary 41 20000
flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken
40 35000
flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
Microdata T(1)
Problem with multi-publishing
• One month later, in May 2007
• Some obsolete tuples are deleted from the microdata.
Name Age Zipcode Disease
Bob
21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000
flu
David 23 25000 gastritis
Gary 41 20000
flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken
40 35000
flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
Microdata T(1)
Problem with multi-publishing
• Bob’s tuple stays.
Name Age Zipcode
Bob
21 12000
David 23 25000
Gary 41 20000
Jane 37 33000
Linda 43 26000
Steve 56 34000
Disease
dyspepsia
gastritis
flu
dyspepsia
gastritis
gastritis
Microdata T(1)
Problem with multi-publishing
• Some new records are inserted.
Name Age Zipcode
Bob
21 12000
David 23 25000
Emily 25 21000
Jane 37 33000
Linda 43 26000
Gary 41 20000
Mary 46 30000
Ray
54 31000
Steve 56 34000
Tom 60 44000
Vince 65 36000
Disease
dyspepsia
gastritis
flu
dyspepsia
gastritis
flu
gastritis
dyspepsia
gastritis
gastritis
flu
Microdata T(2)
Problem with multi-publishing
• The hospital published T*(2).
Name Age Zipcode
Bob
21 12000
David 23 25000
Emily 25 21000
Jane 37 33000
Linda 43 26000
Gary 41 20000
Mary 46 30000
Ray
54 31000
Steve 56 34000
Tom 60 44000
Vince 65 36000
Disease
dyspepsia
gastritis
flu
dyspepsia
gastritis
flu
gastritis
dyspepsia
gastritis
gastritis
flu
Microdata T(2)
G. ID
1
1
2
2
3
3
4
4
4
5
5
Age
[21, 23]
[21, 23]
[25, 43]
[25, 43]
[25, 43]
[41, 46]
[41, 46]
[54, 56]
[54, 56]
[60, 65]
[60, 65]
Zipcode
[12k, 25k]
[12k, 25k]
[21k, 33k]
[21k, 33k]
[21k, 33k]
[20k, 30k]
[20k, 30k]
[31k, 34k]
[31k, 34k]
[36k, 44k]
[36k, 44k]
Disease
dyspepsia
gastritis
flu
dyspepsia
gastritis
flu
gastritis
dyspepsia
gastritis
gastritis
flu
2-diverse Generalization T*(2)
Problem with multi-publishing
• Consider the previous adversary.
Name Age Zipcode
Bob 21 12000
G. ID
1
1
2
2
3
3
4
4
4
5
5
Age
[21, 23]
[21, 23]
[25, 43]
[25, 43]
[25, 43]
[41, 46]
[41, 46]
[54, 56]
[54, 56]
[60, 65]
[60, 65]
Zipcode
[12k, 25k]
[12k, 25k]
[21k, 33k]
[21k, 33k]
[21k, 33k]
[20k, 30k]
[20k, 30k]
[31k, 34k]
[31k, 34k]
[36k, 44k]
[36k, 44k]
Disease
dyspepsia
gastritis
flu
dyspepsia
gastritis
flu
gastritis
dyspepsia
gastritis
gastritis
flu
2-diverse Generalization T*(2)
Problem with multi-publishing
• What the adversary learns from T*(1).
Name Age Zipcode
Bob 21 12000
G. ID
Age
1
[21, 22]
1
[21, 22]
Zipcode
[12k, 14k]
[12k, 14k]
……
Disease
dyspepsia
bronchitis
Zipcode
[12k, 25k]
[12k, 25k]
……
Disease
dyspepsia
gastritis
• What the adversary learns from T*(2).
Name Age Zipcode
Bob 21 12000
G. ID
1
1
Age
[21, 23]
[21, 23]
• So Bob must have contracted dyspepsia!
• A new generalization principle is needed.
m-invariance [SIGMOD07]
• A sequence of generalized tables T*(1), …, T*(n)
is m-invariant, if and only if
– T*(1), …, T*(n) are m-unique, and
– each individual has the same signature in every
generalized table s/he is involved.
• Explanation
– m-unique: every QI group contains at least m tuples
with different sensitive attributes
– signature: all the sensitive attributes in the individual’s
QI group.
m-unique
• A generalized table T*(j) is m-unique, if and only if
– each QI-group in T*(j) contains at least m tuples
– all tuples in the same QI-group have different sensitive values.
G. ID
1
1
2
2
3
3
4
4
4
5
5
Age
[21, 22]
[21, 22]
[23, 24]
[23, 24]
[36, 41]
[36, 41]
[37, 43]
[37, 43]
[37, 43]
[52, 56]
[52, 56]
Zipcode
[12k, 14k]
[12k, 14k]
[18k, 25k]
[18k, 25k]
[20k, 27k]
[20k, 27k]
[26k, 35k]
[26k, 35k]
[26k, 35k]
[33k, 34k]
[33k, 34k]
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
A 2-unique generalized table
Signature
Name G.ID Age
Zipcode Disease
Bob
1 [21, 22] [12k, 14k] dyspepsia
Alice
1 [21, 22] [12k, 14k] bronchitis
…
…
…
…
…
Jane
4 [37, 43] [26k, 35k] dyspepsia
Ken
4 [37, 43] [26k, 35k]
flu
Linda 4 [37, 43] [26k, 35k] gastritis
…
…
…
…
…
T*(1)
• The signature of Bob in T*(1) is {dyspepsia,
bronchitis}
• The signature of Jane in T*(1) is {dyspepsia, flu,
gastritis}
The m-invariance principle
• Lemma: if a sequence of generalized tables
{T*(1), …, T*(n)} is m-invariant, then for any
individual o involved in any of these tables, we
have
risk(o) <= 1/m
The m-invariance principle
• Lemma: let {T*(1), …, T*(n-1)} be m-invariant.
{T*(1), …, T*(n-1), T*(n)} is also m-invariant, if
and only if {T*(n-1), T*(n)} is m-invariant
• Only T*(n - 1) is needed for the generation of
T*(n).
T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n)
Can be discarded
Solution idea
• Goal: Given T(n) and T*(n-1), create T*(n)
such that {T*(n-1) and T*(n)} is m-invariant.
• Idea: create counterfeits.
• Optimization goal: to impose as little
amount of generalization as possible.
Name Age Zipcode
Bob 21 12000
David 23 25000
Emily 25 21000
Jane 37 33000
Linda 43 26000
Gary 41 20000
Mary 46 30000
Ray
54 31000
Steve 56 34000
Tom 60 44000
Vince 65 36000
Disease
dyspepsia
gastritis
flu
dyspepsia
gastritis
flu
gastritis
dyspepsia
gastritis
gastritis
flu
Microdata T(2)
Name Group-ID Age
Zipcode
Bob
1
[21, 22] [12k, 14k]
c1
1
[21, 22] [12k, 14k]
David
2
[23, 25] [21k, 25k]
Emily
2
[23, 25] [21k, 25k]
Jane
3
[37, 43] [26k, 33k]
c2
3
[37, 43] [26k, 33k]
Linda
3
[37, 43] [26k, 33k]
Gary
4
[41, 46] [20k, 30k]
Mary
4
[41, 46] [20k, 30k]
Ray
5
[54, 56] [31k, 34k]
Steve
5
[54, 56] [31k, 34k]
Tom
6
[60, 65] [36k, 44k]
Vince
6
[60, 65] [36k, 44k]
Disease
dyspepsia
bronchitis
gastritis
flu
dyspepsia
flu
gastritis
flu
gastritis
dyspepsia
gastritis
gastritis
flu
Counterfeited generalization T*(2)
Group-ID Count
1
1
3
1
The auxiliary relation R(2) for T*(2)
Name G.ID Age
Zipcode Disease
Bob
1 [21, 22] [12k, 14k] dyspepsia
Alice
1 [21, 22] [12k, 14k] bronchitis
Andy
2 [23, 24] [18k, 25k]
flu
David
2 [23, 24] [18k, 25k] gastritis
Gary
3 [36, 41] [20k, 27k]
flu
Helen
3 [36, 41] [20k, 27k] gastritis
Jane
4 [37, 43] [26k, 35k] dyspepsia
Ken
4 [37, 43] [26k, 35k]
flu
Linda
4 [37, 43] [26k, 35k] gastritis
Paul
5 [52, 56] [33k, 34k] dyspepsia
Steve
5 [52, 56] [33k, 34k] gastritis
Name G.ID Age
Zipcode Disease
Bob
1 [21, 22] [12k, 14k] dyspepsia
c1
1 [21, 22] [12k, 14k] bronchitis
David
2 [23, 25] [21k, 25k] gastritis
Emily
2 [23, 25] [21k, 25k]
flu
Jane
3 [37, 43] [26k, 33k] dyspepsia
c2
3 [37, 43] [26k, 33k]
flu
Linda
3 [37, 43] [26k, 33k] gastritis
Gary
4 [41, 46] [20k, 30k]
flu
Mary
4 [41, 46] [20k, 30k] gastritis
Ray
5 [54, 56] [31k, 34k] dyspepsia
Steve
5 [54, 56] [31k, 34k] gastritis
Tom
6 [60, 65] [36k, 44k] gastritis
Vince
6 [60, 65] [36k, 44k]
flu
Generalization T*(1)
Counterfeited Generalization T*(2)
Name Age Zipcode
Bob 21 12000
Group-ID Count
1
1
3
1
The auxiliary relation R(2) for T*(2)
• A sequence of generalized tables T*(1), …, T*(n) is minvariant, if and only if
– T*(1), …, T*(n) are m-unique, and
– each individual has the same signature in every generalized
table s/he is involved.
Name G.ID Age
Zipcode Disease
Bob
1 [21, 22] [12k, 14k] dyspepsia
Alice
1 [21, 22] [12k, 14k] bronchitis
Andy
2 [23, 24] [18k, 25k]
flu
David
2 [23, 24] [18k, 25k] gastritis
Gary
3 [36, 41] [20k, 27k]
flu
Helen
3 [36, 41] [20k, 27k] gastritis
Jane
4 [37, 43] [26k, 35k] dyspepsia
Ken
4 [37, 43] [26k, 35k]
flu
Linda
4 [37, 43] [26k, 35k] gastritis
Paul
5 [52, 56] [33k, 34k] dyspepsia
Steve
5 [52, 56] [33k, 34k] gastritis
Generalization T*(1)
Name G.ID Age
Zipcode Disease
Bob
1 [21, 22] [12k, 14k] dyspepsia
c1
1 [21, 22] [12k, 14k] bronchitis
David
2 [23, 25] [21k, 25k] gastritis
Emily
2 [23, 25] [21k, 25k]
flu
Jane
3 [37, 43] [26k, 33k] dyspepsia
c2
3 [37, 43] [26k, 33k]
flu
Linda
3 [37, 43] [26k, 33k] gastritis
Gary
4 [41, 46] [20k, 30k]
flu
Mary
4 [41, 46] [20k, 30k] gastritis
Ray
5 [54, 56] [31k, 34k] dyspepsia
Steve
5 [54, 56] [31k, 34k] gastritis
Tom
6 [60, 65] [36k, 44k] gastritis
Vince
6 [60, 65] [36k, 44k]
flu
Generalization T*(2)
• A sequence of generalized tables T*(1), …, T*(n) is minvariant, if and only if
– T*(1), …, T*(n) are m-unique, and
– each individual has the same signature in every generalized
table s/he is involved.
Name G.ID Age
Zipcode Disease
Bob
1 [21, 22] [12k, 14k] dyspepsia
Alice
1 [21, 22] [12k, 14k] bronchitis
Andy
2 [23, 24] [18k, 25k]
flu
David
2 [23, 24] [18k, 25k] gastritis
Gary
3 [36, 41] [20k, 27k]
flu
Helen
3 [36, 41] [20k, 27k] gastritis
Jane
4 [37, 43] [26k, 35k] dyspepsia
Ken
4 [37, 43] [26k, 35k]
flu
Linda
4 [37, 43] [26k, 35k] gastritis
Paul
5 [52, 56] [33k, 34k] dyspepsia
Steve
5 [52, 56] [33k, 34k] gastritis
Generalization T*(1)
Name G.ID Age
Zipcode Disease
Bob
1 [21, 22] [12k, 14k] dyspepsia
c1
1 [21, 22] [12k, 14k] bronchitis
David
2 [23, 25] [21k, 25k] gastritis
Emily
2 [23, 25] [21k, 25k]
flu
Jane
3 [37, 43] [26k, 33k] dyspepsia
c2
3 [37, 43] [26k, 33k]
flu
Linda
3 [37, 43] [26k, 33k] gastritis
Gary
4 [41, 46] [20k, 30k]
flu
Mary
4 [41, 46] [20k, 30k] gastritis
Ray
5 [54, 56] [31k, 34k] dyspepsia
Steve
5 [54, 56] [31k, 34k] gastritis
Tom
6 [60, 65] [36k, 44k] gastritis
Vince
6 [60, 65] [36k, 44k]
flu
Generalization T*(2)
• A sequence of generalized tables T*(1), …, T*(n) is minvariant, if and only if
– T*(1), …, T*(n) are m-unique, and
– each individual has the same signature in every generalized
table s/he is involved.
Name G.ID Age
Zipcode Disease
Bob
1 [21, 22] [12k, 14k] dyspepsia
Alice
1 [21, 22] [12k, 14k] bronchitis
Andy
2 [23, 24] [18k, 25k]
flu
David
2 [23, 24] [18k, 25k] gastritis
Gary
3 [36, 41] [20k, 27k]
flu
Helen
3 [36, 41] [20k, 27k] gastritis
Jane
4 [37, 43] [26k, 35k] dyspepsia
Ken
4 [37, 43] [26k, 35k]
flu
Linda
4 [37, 43] [26k, 35k] gastritis
Paul
5 [52, 56] [33k, 34k] dyspepsia
Steve
5 [52, 56] [33k, 34k] gastritis
Generalization T*(1)
Name G.ID Age
Zipcode Disease
Bob
1 [21, 22] [12k, 14k] dyspepsia
c1
1 [21, 22] [12k, 14k] bronchitis
David
2 [23, 25] [21k, 25k] gastritis
Emily
2 [23, 25] [21k, 25k]
flu
Jane
3 [37, 43] [26k, 33k] dyspepsia
c2
3 [37, 43] [26k, 33k]
flu
Linda
3 [37, 43] [26k, 33k] gastritis
Gary
4 [41, 46] [20k, 30k]
flu
Mary
4 [41, 46] [20k, 30k] gastritis
Ray
5 [54, 56] [31k, 34k] dyspepsia
Steve
5 [54, 56] [31k, 34k] gastritis
Tom
6 [60, 65] [36k, 44k] gastritis
Vince
6 [60, 65] [36k, 44k]
flu
Generalization T*(2)
In case of corruption…
• If an adversary knows from Alice that she has bronchitis, he can
conclude that Bob has dyspepsia.
Name Age Zipcode Disease
Bob
21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000
flu
David 23 25000 gastritis
Gary 41 20000
flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken
40 35000
flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
Microdata
G. ID
1
1
2
2
3
3
4
4
4
5
5
Age
[21, 22]
[21, 22]
[23, 24]
[23, 24]
[36, 41]
[36, 41]
[37, 43]
[37, 43]
[37, 43]
[52, 56]
[52, 56]
Zipcode
[12k, 14k]
[12k, 14k]
[18k, 25k]
[18k, 25k]
[20k, 27k]
[20k, 27k]
[26k, 35k]
[26k, 35k]
[26k, 35k]
[33k, 34k]
[33k, 34k]
2-diverse Generalization
Disease
dyspepsia
bronchitis
flu
gastritis
flu
gastritis
dyspepsia
flu
gastritis
dyspepsia
gastritis
Anti-corruption publishing [ICDE08]
• We formalized anti-corruption publishing, by
modeling the degree of privacy preservation as
a function of an adversary’s background
knowledge.
• We proposed a solution, by integrating
generalization with
– perturbation: switch selected records’ sensitive
information.
– stratified sampling: sample some records from each
QI group.
Summary
• Introduced the problem of privacy-preserving
publishing.
• Two principles:
– k-anonymity
– l-diversity
• Two extensions:
– multi-publishing
– corruption