Center for Secure Information Systems Concordia Institute for Information Systems Engineering k-Jump Strategy for Preserving Privacy in Micro-Data Disclosure Wen Ming Liu1, Lingyu Wang1,
Download ReportTranscript Center for Secure Information Systems Concordia Institute for Information Systems Engineering k-Jump Strategy for Preserving Privacy in Micro-Data Disclosure Wen Ming Liu1, Lingyu Wang1,
Center for Secure Information Systems Concordia Institute for Information Systems Engineering k-Jump Strategy for Preserving Privacy in Micro-Data Disclosure Wen Ming Liu1, Lingyu Wang1, and Lei Zhang2 1 Concordia University 2 George Mason University ICDT 2010 CIISE / CSIS March 23 , 2010 Agenda Background K-Jump Strategy Data Utility Comparison Conclusion 2 Agenda Background Example Algorithm anaive and asafe K-Jump Strategy Data Utility Comparison Conclusion 3 Example Data Holder’s View 4 Data Holder Example – Data Holder’s View generalization generalization DoB Name DoB Alice Condition Condition DoB 1990 flu 1980~1999 Bob 1985 cold Charlie 1974 cancer David 1962 Eve 1953 Fen 1970~1999 cancer 1940~1969 1941 headache toothache Micro-Data Table t0 Name: identifier. DoB: quasi-identifier. Condition: sensitive attribute. Goal: Release table to satisfy 2-diversity flu Condition Condition DoB DoB flu 1970~1999 cold cold Goal: Release table to cancer 1960~1979 cancer satisfy 2-diversity cancer cancer 1940~1959 headache headache Condition Condition Condition flu cold cancer 1940~1969 cancer Released! headache toothache toothache toothache 2-diversity? generalization function Generalization g1(t0g)1() 2-diversity? generalization function Generalization g2(t0g)2() Generalization g2(t0) generalization algorithm: considering generalization Released! function g1 and then g2 in order 5 Example (cont.) Adversary’s View 6 Example (cont.) – Adversary’s View Attacker knows: generalization Condition Adversary public knowledge privacy property Goal: Guess what isDoB the micro-data 1970~1999 Name DoB Condition flu DoB Condition cold t2 t3 t4 col col … can can flu A flu flu ??? cancer cold B col can flu 1974 ??? 1940~1969 cancer cancer C can col David 1962 ??? cancer Eve 1953 ??? Fen 1941 ??? Alice 1990 ??? Bob 1985 Charlie Unknown Public Knowledge Micro-Data Table t0 1970~1999 1940~1969 t35 t36 can … flu col … col flu D can can can can … tac tac headache E hac hac hac hac … hac hac toothache toothache F tac tac headache can flu tac tac … can can Released Released Generalization g2(t0) Generalization g2(t0) What can adversary infer? Name: identifier. DoB: quasi-identifier. Condition: sensitive attribute. … t1 The three persons in each group may have thepermutation three conditions inset any given order. 7 Example (cont.) permutation set … t1 t2 t3 t4 t35 t36 A flu flu col col … can can B col can flu C can col can … flu col … col flu D can can can can … tac tac E hac hac hac hac … hac hac F tac tac can flu tac tac … can can This would be the adversary’s best guesses of the micro-data table, if the released generalization is his/her only knowledge, However … 8 Example (cont.) – Adversary Simulating the Algorithm However, adversary also knows the generalization algorithm, and can simulate the algorithm to further exclude some invalid guesses. 9 Mental image Simulating the algorithm Example (cont.) – Adversary Simulating the Algorithm Name DoB Condition t… 1 2 3 4 35 36 DoB Condition t… 1 2 3 4 35 36 DoB Condition Alice 1990 ??? flu cold … cancer 1980~1999 ??? flu cold … cancer 1970~1999 flu Bob 1985 ??? cold cancer … flu Charlie 1974 … ??? cancer cold flu 1960~1979 David 1962 ??? cancer … toothache Eve 1953 ??? headache … Fen 1941 ??? toothache … cancer Violate Satisfy privacy! privacy! ??? cancer … flu cold cold … ??? cancer cold flu cancer ??? cancer … toothache 1940~1959 1940~1969 cancer ??? … headache headache ??? toothache … cancer toothache Possible Table ti Unknown Generalization g1(ti) Checked but unused Released Micro-Data Table t0 Generalization g1(t0) Generalization g2(t0) t1 t2 t3 t4 … t35 t36 Name DoB t1 t3 t7 t9 A flu flu col col … can can Alice 1990 flu cold flu cold B col can flu can … flu col Bob 1985 cold flu cold flu C can col can flu … col flu Charlie 1974 cancer cancer cancer cancer D can can can can … tac tac David 1962 cancer cancer cancer cancer E hac hac hac hac … hac hac Eve toothache toothache F tac tac tac tac … can can headache headache permutation set 1953 headache Let’s Is this trythe to check valid headache Fen toothache toothache guess it 1941 using ofthe the micro-data algorithm! table? disclosure set 10 Decision Process of Safe and Unsafe Algorithms Most existing generalization algorithms (without considering this problem): g1(t0) g2(t0) gi(t0) gn(t0) Evaluate the permutation set. t0 Y per1 Y N g1 anaive per2 Y N ... g2 peri Y N ... gi pern N (Adversary’s mental image of the microdata table without the knowledge about the algorithm) gn Safe generalization algorithms (Zhang’07ccs, ….) t0 asafe g1(t0) g2(t0) gi(t0) gn(t0) Y Y Y Y ds1 N ds2 N ... dsi N ... dsn per1 per2 peri pern g1 g2 gi gn Evaluate the disclosure set, instead. N (Adversary’s mental image of the microdata table after simulating the algorithm) box: the ith iteration diamond: an evaluation of the privacy property per: permutation set ds: disclosure set evaluation path 11 Agenda Background K-Jump Strategy The Algorithm Family ajump( k ) Properties of ajump( k ) Data Utility Comparison Conclusion 12 The Algorithm Family ajump(k) g1(t0) g2(t0) Y Y ds1 ds2 Y t0 ajump(k) per1 g1 g2+k(t0) Y N ds2+k Y N per2 g2 gn(t0) Y N dsn Y N ... per2+k N Y N ... g2+k pern N gn naive strategy : evaluate privacy property on permutation set only safe strategy : evaluate privacy property on disclosure set directly k-jump strategy: penalize by jumping over the next k-1 iterations naive strategy: efficient but unsafe safe strategy : safe but costly 13 Properties of ajump(k) g1(t0) g2(t0) Y Y ds1 t0 ajump(k) ds2 Y per1 g2+k(t0) Y N ds2+k Y N g1 per2 gn(t0) Y N dsn Y N ... g2 per2+k g2+k N Y N ... pern N gn Computation of the disclosure set asafe: to compute ds(gi(t0)), must first compute ds(gj(t)) for all t in per(gi(t0)) and j=1,2, … ,i-1 ajump: to compute ds(gi(t0)) (2<i<2+k), no longer need to compute ds(g2(t)) for all t in per(gi(t0)) ds(g1(t0)) and ds(g2(t0)) ds(g1(t0)) = per(g1(t0)) ds(g2(t0)) is independent of the distance vector. Size of the family There are (n-1)! different jump distance vectors. 14 Agenda Background K-Jump Strategy Data Utility Comparison Construction for Theorem 1: 1-jump and i-jump (1<i) incomparable Construction for Theorem 2: i-jump and j-jump (1<i<j) incomparable Construction for Theorem 3: K1-jump and K2-jump (K1,K2: vector) incomparable Construction for proposition 2: Reusing generalization functions Results on asafe and ajump(1) Conclusion 15 Construction for Theorem1:1-jump and i-jump (1<i) incomparable QID g1 g2 g3 … A C0 C0 C0 … B C1 C1 C1 … C C2 C2 C2 … D C3 C3 C3 E C4 C4 F C5 G S1 S2 S3 S4 A C0 C0 C0 C0 B C1 C1 C1 C1 C C2 C2 C2 C2 … D C3 C3 C3 C3 C4 … E C4 C4 C4 C4 C5 C5 … F C5 C5 C5 C5 C6 C6 C6 … G C6 C6 C6 C6 H C6 C6 C6 … H C6 C6 C6 C6 I C6 C6 C6 … I C6 C8/C9 C7/C9 C7/C8 J C7 C7 C7 … J C7 C6 C6 C6 K C7 C7 C7 … K C7 C8 C7 C7 L C8 C8 C8 … L C8 C9 C9 C8 M C8 C8 C8 … M C8 C8/C9 C7/C9 C7/C8 N C9 C9 C9 … N C9 C7 C8 C9 O C9 C9 C9 … O C9 C7 C8 C9 # 4320 1152 1152 1152 privacy property : highest ratio of a sensitive value in a group must be no greater than 1/2 To compute ds3k(t0): 1 Excluding any table t for which p(per1(t))=true Belongs to one of the four disjoint sets. 16 Construction for Theorem1 (cont.) : 1-jump and i-jump (1<i) QID g1 g2 g3 … A C0 C0 C0 … B C1 C1 C1 … C C2 C2 C2 … D C3 C3 C3 … E C4 C4 C4 … F C5 C5 C5 … G C6 C6 C6 H C6 C6 I C6 J S1 S2 S3 S4 A C0 C0 C0 C0 B C1 C1 C1 C1 C C2 C2 C2 C2 D C3 C3 C3 C3 E C4 C4 C4 C4 F C5 C5 C5 C5 … G C6 C6 C6 C6 C6 … H C6 C6 C6 C6 C6 C6 … I C6 C8/C9 C7/C9 C7/C8 C7 C7 C7 … J C7 C6 C6 C6 K C7 C7 C7 … K C7 C8 C7 C7 L C8 C8 C8 … L C8 C9 C9 C8 M C8 C8 C8 … M C8 C8/C9 C7/C9 C7/C8 N C9 C9 C9 … N C9 C7 C8 C9 O C9 C9 C9 … O C9 C7 C8 C9 # 4320 1152 1152 1152 privacy property : highest ratio of a sensitive value in a group must be no greater than 1/2 To compute ds3k(t0): 1 2 Excluding any table t for which p(per1(t))=true Considering generalizing these tables using g2 S2, S3, S4 cannot be disclosed under g2. 17 Construction for Theorem1 (cont.) : QID g1 g2 g3 … A C0 C0 C0 … B C1 C1 C1 … C C2 C2 C2 … D C3 C3 C3 … E C4 C4 C4 … F C5 C5 C5 … G C6 C6 C6 H C6 C6 I C6 J 1-jump and i-jump (1<i) S1 S101 S102 S103 A C0 C0 C0 C0 B C1 C1 C1 C1 C C2 C2 C2 C2 D C3 C3 C3 C3 E C4 C4 C4 C4 F C5 C5 C5 C5 … G C6 C6 C6 C6 C6 … H C6 C6 C6 C6 C6 C6 … I C6 C6 C6 C6 C7 C7 C7 … J C7 C8 C7 C7 K C7 C7 C7 … K C7 C8 C7 C7 L C8 C8 C8 … L C8 C9 C9 C8 M C8 C8 C8 … M C8 C9 C9 C8 N C9 C9 C9 … N C9 C7 C8 C9 O C9 C9 C9 … O C9 C7 C8 C9 # 4320 288 288 288 privacy property : highest ratio of a sensitive value in a group must be no greater than 1/2 To compute ds3k(t0): 1 2 Excluding any table t for which p(per1(t))=true Considering generalizing these tables using g2 a. Subsets in S1 which with both N and O have C7, C8, or C9 cannot be disclosed under g2. |S1’|=864 18 Construction for Theorem1 (cont.) : QID g1 g2 g3 … A C0 C0 C0 … B C1 C1 C1 … C C2 C2 C2 … D C3 C3 C3 … E C4 C4 C4 … F C5 C5 C5 … G C6 C6 C6 H C6 C6 I C6 J 1-jump and i-jump (1<i) S1 S111 S112 S113 A C0 C0 C0 C0 B C1 C1 C1 C1 C C2 C2 C2 C2 D C3 C3 C3 C3 E C4 C4 C4 C4 F C5 C5 C5 C5 … G C6 C6 C6 C6 C6 … H C6 C6 C6 C6 C6 C6 … I C6 C6 C6 C6 C7 C7 C7 … J C7 C7 C7 C7 K C7 C7 C7 … K C7 C8 C8 C7 L C8 C8 C8 … L C8 C9 C8 C8 M C8 C8 C8 … M C8 C9 C9 C9 N C9 C9 C9 … N C9 C7 C7 C8 O C9 C9 C9 … O C9 C8 C9 C9 # 4320 1152 1152 1152 To compute ds3k(t0): 1 2 Excluding any table t for which p(per1(t))=true Considering generalizing these tables using g2 b. For ajump(i),all tables in S1\S1’ will be excluded from ds3i(t0). ds 3(t 0 ) S 1 S 2 S 3 S 4 i privacy property : highest ratio of a sensitive value in a group must be no greater than 1/2 ' |S1\S1’|=3456 Satisfied! 19 Construction for Theorem1 (cont.) : QID g1 g2 g3 … A C0 C0 C0 … B C1 C1 C1 … C C2 C2 C2 … D C3 C3 C3 … E C4 C4 C4 … F C5 C5 C5 … G C6 C6 C6 H C6 C6 I C6 J 1-jump and i-jump (1<i) S1 S111 S1111 S1112 A C0 C0 C0 C0 B C1 C1 C1 C1 C C2 C2 C2 C2 D C3 C3 C3 C3 E C4 C4 C4 C4/C5 F C5 C5 C5 C6 … G C6 C6 C6 C6 C6 … H C6 C6 C6 C4/C5 C6 C6 … I C6 C6 C6 C6 C7 C7 C7 … J C7 C7 C7 C7 K C7 C7 C7 … K C7 C8 C8 C8 L C8 C8 C8 … L C8 C9 C9 C9 M C8 C8 C8 … M C8 C9 C9 C9 N C9 C9 C9 … N C9 C7 C7 C7 O C9 C9 C9 … O C9 C8 C8 C8 # 4320 1152 576 576 To compute ds3k(t0): 1 2 Excluding any table t for which p(per1(t))=true Considering generalizing these tables using g2 c. For ajump(1),the disclosure set of all tables in S1\S1’ under g2 do not satisfy the privacy property. ds 3(t 0 ) S 1 S 2 S 3 S 4 1 privacy property : highest ratio of a sensitive value in a group must be no greater than 1/2 Violated! The ratio of I being associated with C6 is 5/9. 20 Construction for Theorem2: i-jump and j-jump (1<i<j) incomparable Show the evaluation paths by figures. 21 Construction for Theorem2 (cont.) : i-jump and j-jump (1<i<j) g1 g2 g3 … gj gj+1 gj+2 … C0 C0 C0 … C0 C0 C0 … C1 C1 C1 … C1 C1 C1 … The case where i-jump has better C2 C2 C2 … C2 C2 C2 … utility than j-jump is relatively easier to C3 C3 C3 … C3 C3 C3 … construct. We only show the construction C4 C4 C4 … C4 C4 C4 … for the other case. S S S … S S S … S S S … S S S … C5 C5 C5 … C5 C5 C5 … C6 C6 C6 … C6 C6 C6 … C7 C7 C7 … C7 C7 C7 … C8 C8 C8 … C8 C8 C8 … C9 C9 C9 … C9 C9 C9 … … … … … … … … … For this construction, generalization gj+2 will be released for j-jump, while gj+i+1 or after will be released for i-jump. 22 Construction for Theorem3: K1-jump and K2-jump (K1,K2:vectors) incomparable 23 Construction for proposition2: Reusing generalization functions QID g1 g2 g3 g2' A C1 C1 C1 C1 B C2 C2 C2 C2 C C3 C3 C3 C3 D C4 C4 C4 C4 E C5 C5 C5 F C3 C3 G C3 C3 g2 S1 S2 S3 A C1 C1 C1/C2 C1/C2 B C2 C2 C3 C3 C C3 C3 C1/C2 C1/C2 D C4 C4 C3 C4 C5 E C5 C5 C3 C5 C3 C3 F C3 C3 C4 C3 C3 C3 G C3 C3 C5 C3 # 72 24 8 8 Without reusing g2: The table will lead to disclosing nothing! Belongs to one of the three disjoint sets. 40 1 2 Cannot be disclosed under g1(.) or g3(.) . To compute ds2: the jump distance is 1; ds 2 S 1 S 2 S 3 the privacy property: highest ratio of a sensitive value in a group must be no greater than ½. Violated! 24 Construction for proposition2 (cont.): Reusing generalization functions QID g1 g2 g3 g2' A C1 C1 C1 C1 B C2 C2 C2 C2 C C3 C3 C3 C3 D C4 C4 C4 C4 E C5 C5 C5 C5 F C3 C3 C3 G C3 C3 C3 g3 S1 S2 S3 A C1 C1 C1/C2 C1/C2 B C2 C2 C3 C3 C C3 C3 C1/C2 C1/C2 D C4 C4 C3 C4 E C5 C5 C3 C5 C3 F C3 C3 C4 C3 C3 G C3 C3 C5 C3 24 8 8 g2 is reused as g2’: To calculate ds2’, the tables can be disclosed under g1, g2, and g3 must be excluded from per2’ # 1 2 S1,S2, and S3 cannot be disclosed under g2, as mentioned above. S2 and S3 cannot be disclosed under g3. 40 the jump distance is 1; the privacy property: highest ratio of a sensitive value in a group must be no greater than ½. 25 Construction for proposition2 (cont.): Reusing generalization functions QID g1 g2 g3 g2' A C1 C1 C1 C1 B C2 C2 C2 C2 C C3 C3 C3 C3 D C4 C4 C4 C4 E C5 C5 C5 C5 F C3 C3 C3 G C3 C3 C3 1 S1,S2, and S3 cannot be disclosed under g2, as mentioned above. 2 S2 and S3 cannot be disclosed under g3. 3 S1 S11 S12 S13 A C1 C1 C1 C1 B C2 C2 C2 C2 C C3 C3 C3 C3 D C4 C3 C3 C4 E C5 C4/C5 C3 C5 C3 F C3 C3 C4 C3 C3 G C3 C4/C5 C5 C3 # 24 16 4 4 S1 can be further divided into three disjoint subsets g2 is reused as g2’: To caculate ds2’, the tables can be disclosed under g1, g2, and g3 must be excluded from per2’ a. S12 and S13 cannot be disclosed under g3. the jump distance is 1; the privacy property: highest ratio of a sensitive value in a group must be no greater than ½. 26 Construction for proposition2 (cont.): Reusing generalization functions QID g1 g2 g3 g2' A C1 C1 C1 C1 B C2 C2 C2 C2 C C3 C3 C3 C3 D C4 C4 C4 C4 E C5 C5 C5 C5 F C3 C3 C3 C3 G C3 C3 C3 C3 1 2 3 To compute ds3(t0 in S11): g2 is reused as g2’: To caculate ds2’, the tables can be disclosed under g1, g2, and g3 must be excluded from per2’ S1,S2, and S3 cannot be disclosed under g2, as mentioned above. b. The tables in subset S11 can be disclosed under g3. S1 can be further divided into three disjoint subsets S2 and S3 cannot be disclosed under g3. Excluding any table t for which p(per1(t))=true A These subsets cannot B Belongs to one of the be under g2. twodisclosed disjoint sets (nor under g2). one instance S1 S11 tA SA1 SA2 A C1 C1 C1 C3 C1/C2/C4 B C2 C2 C2 C3 C1/C2/C4 C C3 C3 C3 C1 C3 D C4 C3 C3 C2 C3 E C5 C4/C5 C4 C4 C1/C2/C4 F C3 C3 C3 C3 C3 G C3 C4/C5 C5 C5 C5 # 24 16 120 12 36 27 Construction for proposition2 (cont.): Reusing generalization functions QID g1 g2 g3 g2' A C1 C1 C1 C1 B C2 C2 C2 C2 C C3 C3 C3 C3 D C4 C4 C4 E C5 C5 F C3 G C3 S12 S13 S2 S3 A C1 C1 C1/C2 C1/C2 B C2 C2 C3 C3 C C3 C3 C1/C2 C1/C2 C4 D C3 C4 C3 C4 C5 C5 E C3 C5 C3 C5 C3 C3 C3 F C4 C3 C4 C3 C3 C3 C3 G C5 C3 C5 C3 # 4 4 8 8 g2 is reused as g2’: ds 2 ' S 12 S 13 S 2 S 3 The ratio of D and E being associated with C3 are 0.5, which is the highest ratio. the jump distance is 1; the privacy property: highest ratio of a sensitive value in a group must be no greater than ½. Satisfied! 28 Results on asafe and ajump(1) 1. When the privacy property is: either set-monotonic or based on the highest ratio of sensitive values Lemma 3: p(per(t0))=false p(any of its subsets)=false Corollary 1: The algorithm asafe has the same data utility as ajump(1) 2. When the privacy property is other cases: Lemma 4: The ds3 under asafe is a subset of that under ajump(1) Theorem 5: The data utility of asafe and ajump(1) is generally incomparable. 29 Agenda Background K-Jump Strategy Data Utility Comparison Conclusion 30 Conclusion We have proposed a novel k-jump strategy for micro-data disclosure. Transform a given generalization algorithm into a large number of safe algorithms. Show the data utility is generally incomparable by constructing counter-examples. Practical impact: make a secret choice. 31 Further Result and Future Work Further Results in the extended version of this paper: n Computational complexity: O (| max( per ) | k ) Making a secret choice among unsafe algorithms does not yield a safe solution. Future studies: Study more efficient safe algorithms. Employ statistical methods to compare different k-jump algorithms.. Further investigate the opportunity in reusing generalization functions. 32 Thank you! 33 Example – Data Holder View Data Holder generalization Goal: Release table to satisfy 2-diversity generalization Name DoB Condition DoB Condition Condition DoB DoB Alice 1990 flu 1980~1999 flu 1970~1999 Bob 1985 cold Charlie 1974 cancer David 1962 cancer cold Goal: Releasecancer table to 1960~1979 satisfy 2-diversity Eve 1953 headache Fen 1941 toothache Micro-Data Table t0 Name: identifier. DoB: quasi-identifier. Condition: sensitive attribute. cancer 1940~1959 Condition Condition Condition flu cold cancer 1940~1969 cancer headache headache toothache toothache 2-diversity? generalization function Generalization g1(t0g)1() 2-diversity? generalization function Generalization g2(t0g)2() generalization algorithm: considering generalization function g1 and then g2 in order 34 Toy Example Data Holder Attacker generalized Name DoB Condition DoB Condition Alice 1990 flu ??? 1970~1999 flu Bob 1985 cold ??? Charlie 1974 cancer ??? David 1962 cancer ??? Eve 1953 Fen 1941 Attacker knows: generalization external data privacy property … t1 t2 t3 t4 A flu flu col col … can can cold B col can flu cancer C can col cancer headache ??? toothache ??? 1940~1969 t35 t36 can … flu col … col flu D can can can can … tac tac headache E hac hac hac hac … hac hac toothache F tac tac can flu tac tac … can can 2-diversity Micro-Data External Table Data t0 Generalization g2(t0) Name: identifier. DoB: quasi-identifier. What can attacker infer? permutation set Condition: sensitive attribute. 35