Privacy-MaxEnt: Integrating Background Knowledge in

Download Report

Transcript Privacy-MaxEnt: Integrating Background Knowledge in

Privacy-MaxEnt: Integrating
Background Knowledge in Privacy
Quantification
Wenliang (Kevin) Du,
Zhouxuan Teng,
and Zutao Zhu.
Department of Electrical Engineering & Computer Science
Syracuse University, Syracuse, New York.
Introduction


Privacy-Preserving Data Publishing.
The impact of background knowledge:



Integrate background knowledge in privacy
quantification.



How does it affect privacy?
How to measure its impact on privacy?
Privacy-MaxEnt: A systematic approach.
Based on well-established theories.
Evaluation.
Privacy-Preserving Data Publishing

Data disguise methods




Randomization
Generalization (e.g. Mondrian)
Bucketization (e.g. Anatomy)
Our Privacy-MaxEnt method can be applied
to Generalization and Bucketization.

We pick Bucketization in our presentation.
Data Sets
Identifier
Quasi-Identifier (QI)
Sensitive Attribute (SA)
Bucketized Data
Quasi-Identifier (QI)
Sensitive Attribute (SA)
P( Breast cancer | {female, college}, bucket=1 ) = 1/4
P( Breast cancer | {female, junior}, bucket=2 ) = 1/3
Impact of Background Knowledge

Background Knowledge:
It’s rare for male to have breast cancer.

This analysis is hard for large data sets.
Previous Studies

Martin, et al. ICDE’07.


Chen, LeFevre, Ramakrishnan. VLDB’07.


Improves the previous work.
They deal with rule-based knowledge.


First formal study on background knowledge
Deterministic knowledge.
Background knowledge can be much more
complicated.

Uncertain knowledge
Complicated Background Knowledge

Rule-based knowledge:



Probability-Based Knowledge



P (s | q) = 0.2.
P (s | Alice) = 0.2.
Vague background knowledge


P (s | q) = 1.
P (s | q) = 0.
0.3 ≤ P (s | q) ≤ 0.5.
Miscellaneous types


P (s | q1) + P (s | q2) = 0.7
One of Alice and Bob has “Lung Cancer”.
Challenges

How to analyze privacy in a systematic way
for large data sets and complicated
background knowledge?

What do we want to compute?



P( S | Q ), given the background knowledge and
the published data set.
P(S | Q ) is primitive for most privacy metrics.
Directly computing P( S | Q ) is hard.
Our Approach
Consider P( S | Q ) as variable x (a vector).
Background
Knowledge
Constraints
on x
Solve x
Published Data
Constraints
on x
Most unbiased solution
Public Information
Maximum Entropy Principle

“Information theory provides a constructive
criterion for setting up probability
distributions on the basis of partial
knowledge, and leads to a type of statistical
inference which is called the maximum
entropy estimate. It is least biased estimate
possible on the given information.”
— by E. T. Jaynes, 1957.
The MaxEnt Approach
Background
Knowledge
Constraints
on P( S | Q )
Maximum Entropy Estimate
Estimate P( S | Q )
Published Data
Public Information
Constraints
on P( S | Q )
Entropy
Entropy: H (S | Q, B)    P(Q, B) P(S | Q, B) log P(S | Q, B).
Q, S , B
Because H(S | Q, B) = H(Q, S, B) – H(Q, B)
Entropy: H (Q, S , B)    P(Q, S , B) log P(Q, S , B).
Q, S , B
Constraint should use
P(Q, S, B) as variables
Maximum Entropy Estimate


Let vector x = P(Q, S, B).
Find the value for x that maximizes its
entropy H(Q, S, B), while satisfying



h1(x) = c1, …, hu(x) = cu : equality constraints
g1(x) ≤ d1, …, gv(x) ≤ dv : inequality constraints
A special case of Non-Linear Programming.
Constraints from Knowledge
Background
Knowledge


Linear model: quite generic.
Conditional probability:


Constraints
on P(Q, S, B)
P (S | Q) = P(Q, S) / P(Q).
Background knowledge has nothing to do with B:

P(Q, S) = P(Q, S, B=1) + … + P(Q, S, B=m).
Constraints from Published Data
Published Data Set
D’

Constraints
on P(Q, S, B)
Constraints



Truth and only the truth.
Absolutely correct for the original data set.
No inference.
Assignment and Constraints
Observation: the original data is one of the assignments
Constraint: true for all possible assignments
QI Constraint
h
Constraint:
 P ( q, s , b)  P ( q, b)
j 1
Example:
j
P(q1, s1,1) P(q1, s2 ,1)  P(q1 , s3 ,1)  P(q1,1)  0.2
SA Constraint
g
Constraint:
 P(q , s, b)  P(s, b)
i 1
Example:
i
P(q1,s4 ,2) P(q3 ,s4 ,2)  P(q4 ,s4 ,2)  P(s4 ,2)  0.1
Zero Constraint


P(q, s, b) = 0, if q or s does not appear in
Bucket b.
We can reduce the number of variables.
Theoretic Properties

Soundness: Are they correct?


Completeness: Have we missed any constraint?


See our theorems and proofs.
Conciseness: Are there redundant constraints?


Easy to prove.
Only one redundant constraint in each bucket.
Consistency: Is our approach consistent with the
existing methods (i.e., when background knowledge
is Ø).
Completeness w.r.t Equations

Have we missed any equality constraint?



Yes!
If F1 = C1 and F2 = C2 are constraints, F1 + F2 =
C1 + C2 is too. However, it is redundant.
Completeness Theorem:


U: our constraint set.
All linear constraints can be written as the linear
combinations of the constraints in U.
Completeness w.r.t Inequalities

Have we missed any inequalities constraint?



Yes!
If F = C, then F ≤ C+0.2 is also valid (redundant).
Completeness Theorem:

Our constraint set is also complete in the
inequality sense.
Putting Them Together
Tools: LBFGS,
TOMLAB,
KNITRO, etc.
Background
Knowledge
Constraints
on P( S | Q )
Maximum Entropy Estimate
Estimate P( S | Q )
Published Data
Public Information
Constraints
on P( S | Q )
Inevitable Questions:



Where do we get background knowledge?
Do we have to be very very knowledgeable?
For P (s | q) type of knowledge:


All useful knowledge is in the original data set.
Association rules:



Positive: Q  S
Negative: Q  ¬S, ¬Q  S, ¬Q  ¬S
Bound the knowledge in our study.

Top-K strongest association rules.
Knowledge about Individuals
Alice: (i1, q1)
Bob: (i4, q2)
Charlie: (i9, q5)
Knowledge 1: Alice has either s1 or s4.
Constraint:
P(i1, q1, s1,1) P(i1, q1, s1,2)  P(i1, q1, s4 ,2)  p(i1, q1 ) 
Knowledge 1: Two people among Alice, Bob, and Charlie have s4.
Constraint:
P(i1 , q1 , s4 ,2)  P(i4 , q2 , s4 ,3)  P(i9 , q5 , s4 ,3) 
2
N
1
N
Evaluation

Implementation:

Lagrange multipliers:
Constrained Optimization Unconstrained Optimization


LBFGS: solving the unconstrained optimization
problem.
Pentium 3Ghz CPU with 4GB memory.
Privacy versus Knowledge
Estimation Accuracy:
KL Distance between P(MaxEnt) (S | Q) and P(Original) (S | Q).
Privacy versus # of QI attributes
Performance vs. Knowledge
Running Time vs. Data Size
Iteration vs. Data size
Conclusion

Privacy-MaxEnt is a systematic method




Model various types of knowledge
Model the information from the published data
Based on well-established theory.
Future work



Reducing the # of constraints
Vague background knowledge
Background knowledge about individuals