Privacy-MaxEnt: Integrating Background Knowledge in
Download
Report
Transcript Privacy-MaxEnt: Integrating Background Knowledge in
Privacy-MaxEnt: Integrating
Background Knowledge in Privacy
Quantification
Wenliang (Kevin) Du,
Zhouxuan Teng,
and Zutao Zhu.
Department of Electrical Engineering & Computer Science
Syracuse University, Syracuse, New York.
Introduction
Privacy-Preserving Data Publishing.
The impact of background knowledge:
Integrate background knowledge in privacy
quantification.
How does it affect privacy?
How to measure its impact on privacy?
Privacy-MaxEnt: A systematic approach.
Based on well-established theories.
Evaluation.
Privacy-Preserving Data Publishing
Data disguise methods
Randomization
Generalization (e.g. Mondrian)
Bucketization (e.g. Anatomy)
Our Privacy-MaxEnt method can be applied
to Generalization and Bucketization.
We pick Bucketization in our presentation.
Data Sets
Identifier
Quasi-Identifier (QI)
Sensitive Attribute (SA)
Bucketized Data
Quasi-Identifier (QI)
Sensitive Attribute (SA)
P( Breast cancer | {female, college}, bucket=1 ) = 1/4
P( Breast cancer | {female, junior}, bucket=2 ) = 1/3
Impact of Background Knowledge
Background Knowledge:
It’s rare for male to have breast cancer.
This analysis is hard for large data sets.
Previous Studies
Martin, et al. ICDE’07.
Chen, LeFevre, Ramakrishnan. VLDB’07.
Improves the previous work.
They deal with rule-based knowledge.
First formal study on background knowledge
Deterministic knowledge.
Background knowledge can be much more
complicated.
Uncertain knowledge
Complicated Background Knowledge
Rule-based knowledge:
Probability-Based Knowledge
P (s | q) = 0.2.
P (s | Alice) = 0.2.
Vague background knowledge
P (s | q) = 1.
P (s | q) = 0.
0.3 ≤ P (s | q) ≤ 0.5.
Miscellaneous types
P (s | q1) + P (s | q2) = 0.7
One of Alice and Bob has “Lung Cancer”.
Challenges
How to analyze privacy in a systematic way
for large data sets and complicated
background knowledge?
What do we want to compute?
P( S | Q ), given the background knowledge and
the published data set.
P(S | Q ) is primitive for most privacy metrics.
Directly computing P( S | Q ) is hard.
Our Approach
Consider P( S | Q ) as variable x (a vector).
Background
Knowledge
Constraints
on x
Solve x
Published Data
Constraints
on x
Most unbiased solution
Public Information
Maximum Entropy Principle
“Information theory provides a constructive
criterion for setting up probability
distributions on the basis of partial
knowledge, and leads to a type of statistical
inference which is called the maximum
entropy estimate. It is least biased estimate
possible on the given information.”
— by E. T. Jaynes, 1957.
The MaxEnt Approach
Background
Knowledge
Constraints
on P( S | Q )
Maximum Entropy Estimate
Estimate P( S | Q )
Published Data
Public Information
Constraints
on P( S | Q )
Entropy
Entropy: H (S | Q, B) P(Q, B) P(S | Q, B) log P(S | Q, B).
Q, S , B
Because H(S | Q, B) = H(Q, S, B) – H(Q, B)
Entropy: H (Q, S , B) P(Q, S , B) log P(Q, S , B).
Q, S , B
Constraint should use
P(Q, S, B) as variables
Maximum Entropy Estimate
Let vector x = P(Q, S, B).
Find the value for x that maximizes its
entropy H(Q, S, B), while satisfying
h1(x) = c1, …, hu(x) = cu : equality constraints
g1(x) ≤ d1, …, gv(x) ≤ dv : inequality constraints
A special case of Non-Linear Programming.
Constraints from Knowledge
Background
Knowledge
Linear model: quite generic.
Conditional probability:
Constraints
on P(Q, S, B)
P (S | Q) = P(Q, S) / P(Q).
Background knowledge has nothing to do with B:
P(Q, S) = P(Q, S, B=1) + … + P(Q, S, B=m).
Constraints from Published Data
Published Data Set
D’
Constraints
on P(Q, S, B)
Constraints
Truth and only the truth.
Absolutely correct for the original data set.
No inference.
Assignment and Constraints
Observation: the original data is one of the assignments
Constraint: true for all possible assignments
QI Constraint
h
Constraint:
P ( q, s , b) P ( q, b)
j 1
Example:
j
P(q1, s1,1) P(q1, s2 ,1) P(q1 , s3 ,1) P(q1,1) 0.2
SA Constraint
g
Constraint:
P(q , s, b) P(s, b)
i 1
Example:
i
P(q1,s4 ,2) P(q3 ,s4 ,2) P(q4 ,s4 ,2) P(s4 ,2) 0.1
Zero Constraint
P(q, s, b) = 0, if q or s does not appear in
Bucket b.
We can reduce the number of variables.
Theoretic Properties
Soundness: Are they correct?
Completeness: Have we missed any constraint?
See our theorems and proofs.
Conciseness: Are there redundant constraints?
Easy to prove.
Only one redundant constraint in each bucket.
Consistency: Is our approach consistent with the
existing methods (i.e., when background knowledge
is Ø).
Completeness w.r.t Equations
Have we missed any equality constraint?
Yes!
If F1 = C1 and F2 = C2 are constraints, F1 + F2 =
C1 + C2 is too. However, it is redundant.
Completeness Theorem:
U: our constraint set.
All linear constraints can be written as the linear
combinations of the constraints in U.
Completeness w.r.t Inequalities
Have we missed any inequalities constraint?
Yes!
If F = C, then F ≤ C+0.2 is also valid (redundant).
Completeness Theorem:
Our constraint set is also complete in the
inequality sense.
Putting Them Together
Tools: LBFGS,
TOMLAB,
KNITRO, etc.
Background
Knowledge
Constraints
on P( S | Q )
Maximum Entropy Estimate
Estimate P( S | Q )
Published Data
Public Information
Constraints
on P( S | Q )
Inevitable Questions:
Where do we get background knowledge?
Do we have to be very very knowledgeable?
For P (s | q) type of knowledge:
All useful knowledge is in the original data set.
Association rules:
Positive: Q S
Negative: Q ¬S, ¬Q S, ¬Q ¬S
Bound the knowledge in our study.
Top-K strongest association rules.
Knowledge about Individuals
Alice: (i1, q1)
Bob: (i4, q2)
Charlie: (i9, q5)
Knowledge 1: Alice has either s1 or s4.
Constraint:
P(i1, q1, s1,1) P(i1, q1, s1,2) P(i1, q1, s4 ,2) p(i1, q1 )
Knowledge 1: Two people among Alice, Bob, and Charlie have s4.
Constraint:
P(i1 , q1 , s4 ,2) P(i4 , q2 , s4 ,3) P(i9 , q5 , s4 ,3)
2
N
1
N
Evaluation
Implementation:
Lagrange multipliers:
Constrained Optimization Unconstrained Optimization
LBFGS: solving the unconstrained optimization
problem.
Pentium 3Ghz CPU with 4GB memory.
Privacy versus Knowledge
Estimation Accuracy:
KL Distance between P(MaxEnt) (S | Q) and P(Original) (S | Q).
Privacy versus # of QI attributes
Performance vs. Knowledge
Running Time vs. Data Size
Iteration vs. Data size
Conclusion
Privacy-MaxEnt is a systematic method
Model various types of knowledge
Model the information from the published data
Based on well-established theory.
Future work
Reducing the # of constraints
Vague background knowledge
Background knowledge about individuals