Resisting Structural Re-identification in Anonymized Social Networks

Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Philipp Weis University of Massachusetts Amherst Session : Privacy & Authentication, VLDB 2008 2011-01-21 Presented by Yongjin Kwon


Introduction Adversary Knowledge Models Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Disclosure in Real Networks Anonymity in Random Graphs Graph Generalization for Anonymization Conclusion



 There are a large amount of data in various storages.

      Supermarket Transactions Web Sever Logs Sensor Data Interactions in Social Networks Email, Twitter …  Data owners publish sensitive information to facilitate research.

  Reveal as much important information as possible while preserving the privacy of the individuals in the data.

In personal data, analysts may find valuable information.


Introduction (Cont’d)

 A Face Is Exposed for AOL Searcher No. 4417749 Times, August 9, 2006] [New York    AOL collected 20 million Web search queries and published them.

Although the company naïvely anonymized the data, the identity of AOL user “No. 4417749” revealed: “Thelma Arnold, a 62-year old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs.” Serious problem of privacy risks!


Introduction (Cont’d)

 Potential privacy risks in network data   Risk network structure in the early epidemic phase of HIV trans mission in Colorado Springs [Sexually Trans. Infections, 2002] – A social network, which represents a set of individuals related by sexual contacts and shared drug injections, is published in order to analyze how HIV spreads.

Enron Email Dataset ( ) – – The email collection was released for investigation.

It is the only “real” email collection due to the privacy issues.

Introduction (Cont’d)

 Attacks on (naïvely anonymized network data)    Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography [WWW 2007] Active Attack – An adversary chooses a set of targets, creates a pattern of links among the new nodes.

small number of fake nodes with edges to these targets, and construct a highly identifiable – After the network is released, the adversary can recognize the pattern and fake nodes, and reveal the sensitive information of targets.

Passive Attack – – Most vertices in network data usually belong to a small uniquely identifiable subgraph.

An adversary may collude with other friends to identify additional nodes connected to the distinct subset of the coalition.


Introduction (Cont’d)

 An adversary may compromise privacy of some victims with some (structural) background knowledge.

  The naïve anonymization is NOT sufficient!

A new way of resisting malicious actions to re-identify the identity of each individual in a published network data must be proposed.

 Need to think of…    Types of adversary knowledge Theoretical approach of privacy risks A way of preserving privacy while maintaining high utility of data


Adversary Knowledge Models

 The adversary’s background knowledge is modeled as “correct” answers to a restricted knowledge query .

 The adversary uses the query to refine the feasible candidate set.

 Three knowledge models    Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Copyright  2011 by CEBT


Vertex Refinement Queries

 These queries report on the local structure of the graph around the “target” node.

B Degree of B Degrees of neighbors of B


Vertex Refinement Queries (Cont’d)

 Relative Equivalence  A B D E C F G H  If the adversary knows the answer of , then G quickly re-identified in the anonymized graph!

Subgraph Queries

 Two drawbacks of vertex refinement queries   Always return “correct” information.

Depend on the degree of the target node.

 These queries assert the existence of a subgraph around the “target” node.

B B B B  Edge Facts : 3 4 5 Assume that the adversary knows the number of edge facts around the target node.

Hub Fingerprint Queries

  A hub is a node with high degree and high betweenness centrality.

 Hubs are easily re-identified by an adversary.

A hub fingerprint for a node is a vector of distances from observable hub connections.

A B C Hub D E F G H Closed World : Not reachable within distance 1 Open World : Incomplete knowledge


Disclosure in Real Networks

 Experiments for the impact of external information     Three networked data set – – –


: co-author graphs, taken from the arXiv archive


: “real” email dataset, collected by the CALO Project


: IP-level network trace collected at a major university Consider each node in turn as a target.

Compute the candidate set for the target.

– Smaller candidate set : more vulnerable!

Characterize how many nodes are protected and how many are re-identifiable.

Disclosure in Real Networks (Cont’d)

Disclosure in Real Networks (Cont’d)

Disclosure in Real Networks (Cont’d)

 Hub Fingerprint Queries  Hub : five highest degree nodes (


), ten highest degree nodes (




Anonymity in Random Graphs

 Theoretical approach of privacy risk with random graphs  Erd ő s-Rényi Model (ER Model) with





nodes and edge connection –  Asymptotic analysis of robustness against knowledge attack    Sparse ER Graphs : robust against for any Dense ER Graphs : robust against , but vulnerable against Super-dense ER Graphs : vulnerable against


Anonymity in Random Graphs (Cont’d)

  Anonymity Against Subgraph Queries    Depends on the number of nodes in the largest clique If for a subgraph query , then The clique number is a useful lower bound on the disclosure.

Graph Generalization for Anonymization

 Generalize a naïvely-anonymized graph.

 Much uncertainty! (measured by the number of possible world)   Find the partitioning that maximizes the likelihood while satisfying that the size of a supernode is larger than



Apply the simulated annealing method to find the partitioning.

Graph Generalization for Anonymization (Cont’d)

 How to analyze the generalized graph?

  Construct the synthetic graph using the tagged information.

Perform standard graph analysis on this synthetic graph.

Graph Generalization for Anonymization (Cont’d)

 How does graph generalization affect network properties?

   Examine five properties on the three real-world networks.

– Degree – – Path Length Transitivity (Clustering Coefficient) – Network Resilience – Infectiousness Perform the experiments on the 200 synthetic graphs.

Repeat for each .


Graph Generalization for Anonymization (Cont’d)

 Three contributions    Formalize models of adversary knowledge.

Provide a start point of theoretical study of privacy risks on a network data.

Introduce a new anonymization technique by generalizing the original graph.

Copyright  2011 by CEBT