Transcript ppt

Resisting Structural Re-identification in Anonymized Social Networks

Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Philipp Weis University of Massachusetts Amherst Session : Privacy & Authentication, VLDB 2008 2011-01-21 Presented by Yongjin Kwon

Outline

      Introduction Adversary Knowledge Models    Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Disclosure in Real Networks Anonymity in Random Graphs Graph Generalization for Anonymization Conclusion Copyright  2011 by CEBT

2

Introduction

 There are a large amount of data in various storages.

      Supermarket Transactions Web Sever Logs Sensor Data Interactions in Social Networks Email, Twitter …  Data owners publish sensitive information to facilitate research.

  Reveal as much important information as possible while preserving the privacy of the individuals in the data.

In personal data, analysts may find valuable information.

3

Copyright  2011 by CEBT

Introduction (Cont’d)

 A Face Is Exposed for AOL Searcher No. 4417749 Times, August 9, 2006] [New York    AOL collected 20 million Web search queries and published them.

Although the company naïvely anonymized the data, the identity of AOL user “No. 4417749” revealed: “Thelma Arnold, a 62-year old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs.” Serious problem of privacy risks!

4

Copyright  2011 by CEBT

Introduction (Cont’d)

 Potential privacy risks in network data   Risk network structure in the early epidemic phase of HIV trans mission in Colorado Springs [Sexually Trans. Infections, 2002] – A social network, which represents a set of individuals related by sexual contacts and shared drug injections, is published in order to analyze how HIV spreads.

Enron Email Dataset ( http://www.cs.cmu.edu/~enron/ ) – – The email collection was released for investigation.

It is the only “real” email collection due to the privacy issues.

Copyright  2011 by CEBT

5

Introduction (Cont’d)

 Attacks on (naïvely anonymized network data)    Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography [WWW 2007] Active Attack – An adversary chooses a set of targets, creates a pattern of links among the new nodes.

small number of fake nodes with edges to these targets, and construct a highly identifiable – After the network is released, the adversary can recognize the pattern and fake nodes, and reveal the sensitive information of targets.

Passive Attack – – Most vertices in network data usually belong to a small uniquely identifiable subgraph.

An adversary may collude with other friends to identify additional nodes connected to the distinct subset of the coalition.

6

Copyright  2011 by CEBT

Introduction (Cont’d)

 An adversary may compromise privacy of some victims with some (structural) background knowledge.

  The naïve anonymization is NOT sufficient!

A new way of resisting malicious actions to re-identify the identity of each individual in a published network data must be proposed.

 Need to think of…    Types of adversary knowledge Theoretical approach of privacy risks A way of preserving privacy while maintaining high utility of data

7

Copyright  2011 by CEBT

Adversary Knowledge Models

 The adversary’s background knowledge is modeled as “correct” answers to a restricted knowledge query .

 The adversary uses the query to refine the feasible candidate set.

 Three knowledge models    Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Copyright  2011 by CEBT

8

Vertex Refinement Queries

 These queries report on the local structure of the graph around the “target” node.

B Degree of B Degrees of neighbors of B

9

Copyright  2011 by CEBT

Vertex Refinement Queries (Cont’d)

 Relative Equivalence  A B D E C F G H  If the adversary knows the answer of , then G quickly re-identified in the anonymized graph!

can be Copyright  2011 by CEBT

10

Subgraph Queries

 Two drawbacks of vertex refinement queries   Always return “correct” information.

Depend on the degree of the target node.

 These queries assert the existence of a subgraph around the “target” node.

B B B B  Edge Facts : 3 4 5 Assume that the adversary knows the number of edge facts around the target node.

Copyright  2011 by CEBT

11

Hub Fingerprint Queries

  A hub is a node with high degree and high betweenness centrality.

 Hubs are easily re-identified by an adversary.

A hub fingerprint for a node is a vector of distances from observable hub connections.

A B C Hub D E F G H Closed World : Not reachable within distance 1 Open World : Incomplete knowledge

12

Copyright  2011 by CEBT

Disclosure in Real Networks

 Experiments for the impact of external information     Three networked data set – – –

Hep-Th

: co-author graphs, taken from the arXiv archive

Enron

: “real” email dataset, collected by the CALO Project

Net-trace

: IP-level network trace collected at a major university Consider each node in turn as a target.

Compute the candidate set for the target.

– Smaller candidate set : more vulnerable!

Characterize how many nodes are protected and how many are re-identifiable.

Copyright  2011 by CEBT

13

Disclosure in Real Networks (Cont’d)

 Vertex Refinement Queries Copyright  2011 by CEBT

14

Disclosure in Real Networks (Cont’d)

 Subgraph Queries  Two Strategies to build subgraphs – – Sampled Subgraph Degree Subgraph Copyright  2011 by CEBT

15

Disclosure in Real Networks (Cont’d)

 Hub Fingerprint Queries  Hub : five highest degree nodes (

Enron

), ten highest degree nodes (

Hep-Th

,

Net-trace

) Copyright  2011 by CEBT

16

Anonymity in Random Graphs

 Theoretical approach of privacy risk with random graphs  Erd ő s-Rényi Model (ER Model) with

n

probability

p

.

nodes and edge connection –  Asymptotic analysis of robustness against knowledge attack    Sparse ER Graphs : robust against for any Dense ER Graphs : robust against , but vulnerable against Super-dense ER Graphs : vulnerable against

17

Copyright  2011 by CEBT

Anonymity in Random Graphs (Cont’d)

  Anonymity Against Subgraph Queries    Depends on the number of nodes in the largest clique If for a subgraph query , then The clique number is a useful lower bound on the disclosure.

Random Graphs with Attributes Copyright  2011 by CEBT

18

Graph Generalization for Anonymization

 Generalize a naïvely-anonymized graph.

 Much uncertainty! (measured by the number of possible world)   Find the partitioning that maximizes the likelihood while satisfying that the size of a supernode is larger than

k

.

Apply the simulated annealing method to find the partitioning.

Copyright  2011 by CEBT

19

Graph Generalization for Anonymization (Cont’d)

 How to analyze the generalized graph?

  Construct the synthetic graph using the tagged information.

Perform standard graph analysis on this synthetic graph.

Copyright  2011 by CEBT

20

Graph Generalization for Anonymization (Cont’d)

 How does graph generalization affect network properties?

   Examine five properties on the three real-world networks.

– Degree – – Path Length Transitivity (Clustering Coefficient) – Network Resilience – Infectiousness Perform the experiments on the 200 synthetic graphs.

Repeat for each .

21

Copyright  2011 by CEBT

Graph Generalization for Anonymization (Cont’d)

Copyright  2011 by CEBT

22

Conclusion

 Three contributions    Formalize models of adversary knowledge.

Provide a start point of theoretical study of privacy risks on a network data.

Introduce a new anonymization technique by generalizing the original graph.

Copyright  2011 by CEBT

23