ppt

Transcript ppt

Resisting Structural Re-identification in Anonymized Social Networks

Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Philipp Weis University of Massachusetts Amherst Session : Privacy & Authentication, VLDB 2008 2011-01-21 Presented by Yongjin Kwon

Outline

      Introduction Adversary Knowledge Models    Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Disclosure in Real Networks Anonymity in Random Graphs Graph Generalization for Anonymization Conclusion Copyright  2011 by CEBT

Introduction

 There are a large amount of data in various storages.

      Supermarket Transactions Web Sever Logs Sensor Data Interactions in Social Networks Email, Twitter …  Data owners publish sensitive information to facilitate research.

  Reveal as much important information as possible while preserving the privacy of the individuals in the data.

In personal data, analysts may find valuable information.

Introduction (Cont’d)

 A Face Is Exposed for AOL Searcher No. 4417749 Times, August 9, 2006] [New York    AOL collected 20 million Web search queries and published them.

Although the company naïvely anonymized the data, the identity of AOL user “No. 4417749” revealed: “Thelma Arnold, a 62-year old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs.” Serious problem of privacy risks!

Introduction (Cont’d)

 Potential privacy risks in network data   Risk network structure in the early epidemic phase of HIV trans mission in Colorado Springs [Sexually Trans. Infections, 2002] – A social network, which represents a set of individuals related by sexual contacts and shared drug injections, is published in order to analyze how HIV spreads.

Enron Email Dataset ( http://www.cs.cmu.edu/~enron/ ) – – The email collection was released for investigation.

It is the only “real” email collection due to the privacy issues.

Introduction (Cont’d)

 Attacks on (naïvely anonymized network data)    Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography [WWW 2007] Active Attack – An adversary chooses a set of targets, creates a pattern of links among the new nodes.

small number of fake nodes with edges to these targets, and construct a highly identifiable – After the network is released, the adversary can recognize the pattern and fake nodes, and reveal the sensitive information of targets.

Passive Attack – – Most vertices in network data usually belong to a small uniquely identifiable subgraph.

An adversary may collude with other friends to identify additional nodes connected to the distinct subset of the coalition.

Introduction (Cont’d)

 An adversary may compromise privacy of some victims with some (structural) background knowledge.

  The naïve anonymization is NOT sufficient!

A new way of resisting malicious actions to re-identify the identity of each individual in a published network data must be proposed.

 Need to think of…    Types of adversary knowledge Theoretical approach of privacy risks A way of preserving privacy while maintaining high utility of data

Adversary Knowledge Models

 The adversary’s background knowledge is modeled as “correct” answers to a restricted knowledge query .

 The adversary uses the query to refine the feasible candidate set.

 Three knowledge models    Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Copyright  2011 by CEBT

Vertex Refinement Queries

 These queries report on the local structure of the graph around the “target” node.

B Degree of B Degrees of neighbors of B

Vertex Refinement Queries (Cont’d)

 Relative Equivalence  A B D E C F G H  If the adversary knows the answer of , then G quickly re-identified in the anonymized graph!

can be Copyright  2011 by CEBT

Subgraph Queries

 Two drawbacks of vertex refinement queries   Always return “correct” information.

Depend on the degree of the target node.

 These queries assert the existence of a subgraph around the “target” node.

B B B B  Edge Facts : 3 4 5 Assume that the adversary knows the number of edge facts around the target node.

Hub Fingerprint Queries

  A hub is a node with high degree and high betweenness centrality.

 Hubs are easily re-identified by an adversary.

A hub fingerprint for a node is a vector of distances from observable hub connections.

A B C Hub D E F G H Closed World : Not reachable within distance 1 Open World : Incomplete knowledge

Disclosure in Real Networks

 Experiments for the impact of external information     Three networked data set – – –

Hep-Th

: co-author graphs, taken from the arXiv archive

Enron

: “real” email dataset, collected by the CALO Project

Net-trace

: IP-level network trace collected at a major university Consider each node in turn as a target.

Compute the candidate set for the target.

– Smaller candidate set : more vulnerable!

Characterize how many nodes are protected and how many are re-identifiable.

Disclosure in Real Networks (Cont’d)

 Vertex Refinement Queries Copyright  2011 by CEBT

Disclosure in Real Networks (Cont’d)

 Subgraph Queries  Two Strategies to build subgraphs – – Sampled Subgraph Degree Subgraph Copyright  2011 by CEBT

Disclosure in Real Networks (Cont’d)

 Hub Fingerprint Queries  Hub : five highest degree nodes (

Enron

), ten highest degree nodes (

Hep-Th

Net-trace

Anonymity in Random Graphs

 Theoretical approach of privacy risk with random graphs  Erd ő s-Rényi Model (ER Model) with

probability

nodes and edge connection –  Asymptotic analysis of robustness against knowledge attack    Sparse ER Graphs : robust against for any Dense ER Graphs : robust against , but vulnerable against Super-dense ER Graphs : vulnerable against

Anonymity in Random Graphs (Cont’d)

  Anonymity Against Subgraph Queries    Depends on the number of nodes in the largest clique If for a subgraph query , then The clique number is a useful lower bound on the disclosure.

Graph Generalization for Anonymization

 Generalize a naïvely-anonymized graph.

 Much uncertainty! (measured by the number of possible world)   Find the partitioning that maximizes the likelihood while satisfying that the size of a supernode is larger than

Apply the simulated annealing method to find the partitioning.

Graph Generalization for Anonymization (Cont’d)

 How to analyze the generalized graph?

  Construct the synthetic graph using the tagged information.

Perform standard graph analysis on this synthetic graph.

Graph Generalization for Anonymization (Cont’d)

 How does graph generalization affect network properties?

   Examine five properties on the three real-world networks.

– Degree – – Path Length Transitivity (Clustering Coefficient) – Network Resilience – Infectiousness Perform the experiments on the 200 synthetic graphs.

Repeat for each .

Graph Generalization for Anonymization (Cont’d)

Conclusion

 Three contributions    Formalize models of adversary knowledge.

Provide a start point of theoretical study of privacy risks on a network data.

Introduce a new anonymization technique by generalizing the original graph.

ppt

Transcript ppt

Resisting Structural Re-identification in Anonymized Social Networks

Outline

Introduction

Introduction (Cont’d)

Introduction (Cont’d)

Introduction (Cont’d)

Introduction (Cont’d)

Adversary Knowledge Models

Vertex Refinement Queries

Vertex Refinement Queries (Cont’d)

Subgraph Queries

Hub Fingerprint Queries

Disclosure in Real Networks

Disclosure in Real Networks (Cont’d)

Disclosure in Real Networks (Cont’d)

Disclosure in Real Networks (Cont’d)

Anonymity in Random Graphs

Anonymity in Random Graphs (Cont’d)

Graph Generalization for Anonymization

Graph Generalization for Anonymization (Cont’d)

Graph Generalization for Anonymization (Cont’d)

Graph Generalization for Anonymization (Cont’d)

Conclusion

Directory