Transcript ppt
Resisting Structural Re-identification in Anonymized Social Networks
Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Philipp Weis University of Massachusetts Amherst Session : Privacy & Authentication, VLDB 2008 2011-01-21 Presented by Yongjin Kwon
Outline
Introduction Adversary Knowledge Models Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Disclosure in Real Networks Anonymity in Random Graphs Graph Generalization for Anonymization Conclusion Copyright 2011 by CEBT
2
Introduction
There are a large amount of data in various storages.
Supermarket Transactions Web Sever Logs Sensor Data Interactions in Social Networks Email, Twitter … Data owners publish sensitive information to facilitate research.
Reveal as much important information as possible while preserving the privacy of the individuals in the data.
In personal data, analysts may find valuable information.
3
Copyright 2011 by CEBT
Introduction (Cont’d)
A Face Is Exposed for AOL Searcher No. 4417749 Times, August 9, 2006] [New York AOL collected 20 million Web search queries and published them.
Although the company naïvely anonymized the data, the identity of AOL user “No. 4417749” revealed: “Thelma Arnold, a 62-year old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs.” Serious problem of privacy risks!
4
Copyright 2011 by CEBT
Introduction (Cont’d)
Potential privacy risks in network data Risk network structure in the early epidemic phase of HIV trans mission in Colorado Springs [Sexually Trans. Infections, 2002] – A social network, which represents a set of individuals related by sexual contacts and shared drug injections, is published in order to analyze how HIV spreads.
Enron Email Dataset ( http://www.cs.cmu.edu/~enron/ ) – – The email collection was released for investigation.
It is the only “real” email collection due to the privacy issues.
Copyright 2011 by CEBT
5
Introduction (Cont’d)
Attacks on (naïvely anonymized network data) Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography [WWW 2007] Active Attack – An adversary chooses a set of targets, creates a pattern of links among the new nodes.
small number of fake nodes with edges to these targets, and construct a highly identifiable – After the network is released, the adversary can recognize the pattern and fake nodes, and reveal the sensitive information of targets.
Passive Attack – – Most vertices in network data usually belong to a small uniquely identifiable subgraph.
An adversary may collude with other friends to identify additional nodes connected to the distinct subset of the coalition.
6
Copyright 2011 by CEBT
Introduction (Cont’d)
An adversary may compromise privacy of some victims with some (structural) background knowledge.
The naïve anonymization is NOT sufficient!
A new way of resisting malicious actions to re-identify the identity of each individual in a published network data must be proposed.
Need to think of… Types of adversary knowledge Theoretical approach of privacy risks A way of preserving privacy while maintaining high utility of data
7
Copyright 2011 by CEBT
Adversary Knowledge Models
The adversary’s background knowledge is modeled as “correct” answers to a restricted knowledge query .
The adversary uses the query to refine the feasible candidate set.
Three knowledge models Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Copyright 2011 by CEBT
8
Vertex Refinement Queries
These queries report on the local structure of the graph around the “target” node.
B Degree of B Degrees of neighbors of B
9
Copyright 2011 by CEBT
Vertex Refinement Queries (Cont’d)
Relative Equivalence A B D E C F G H If the adversary knows the answer of , then G quickly re-identified in the anonymized graph!
can be Copyright 2011 by CEBT
10
Subgraph Queries
Two drawbacks of vertex refinement queries Always return “correct” information.
Depend on the degree of the target node.
These queries assert the existence of a subgraph around the “target” node.
B B B B Edge Facts : 3 4 5 Assume that the adversary knows the number of edge facts around the target node.
Copyright 2011 by CEBT
11
Hub Fingerprint Queries
A hub is a node with high degree and high betweenness centrality.
Hubs are easily re-identified by an adversary.
A hub fingerprint for a node is a vector of distances from observable hub connections.
A B C Hub D E F G H Closed World : Not reachable within distance 1 Open World : Incomplete knowledge
12
Copyright 2011 by CEBT
Disclosure in Real Networks
Experiments for the impact of external information Three networked data set – – –
Hep-Th
: co-author graphs, taken from the arXiv archive
Enron
: “real” email dataset, collected by the CALO Project
Net-trace
: IP-level network trace collected at a major university Consider each node in turn as a target.
Compute the candidate set for the target.
– Smaller candidate set : more vulnerable!
Characterize how many nodes are protected and how many are re-identifiable.
Copyright 2011 by CEBT
13
Disclosure in Real Networks (Cont’d)
Vertex Refinement Queries Copyright 2011 by CEBT
14
Disclosure in Real Networks (Cont’d)
Subgraph Queries Two Strategies to build subgraphs – – Sampled Subgraph Degree Subgraph Copyright 2011 by CEBT
15
Disclosure in Real Networks (Cont’d)
Hub Fingerprint Queries Hub : five highest degree nodes (
Enron
), ten highest degree nodes (
Hep-Th
,
Net-trace
) Copyright 2011 by CEBT
16
Anonymity in Random Graphs
Theoretical approach of privacy risk with random graphs Erd ő s-Rényi Model (ER Model) with
n
probability
p
.
nodes and edge connection – Asymptotic analysis of robustness against knowledge attack Sparse ER Graphs : robust against for any Dense ER Graphs : robust against , but vulnerable against Super-dense ER Graphs : vulnerable against
17
Copyright 2011 by CEBT
Anonymity in Random Graphs (Cont’d)
Anonymity Against Subgraph Queries Depends on the number of nodes in the largest clique If for a subgraph query , then The clique number is a useful lower bound on the disclosure.
Random Graphs with Attributes Copyright 2011 by CEBT
18
Graph Generalization for Anonymization
Generalize a naïvely-anonymized graph.
Much uncertainty! (measured by the number of possible world) Find the partitioning that maximizes the likelihood while satisfying that the size of a supernode is larger than
k
.
Apply the simulated annealing method to find the partitioning.
Copyright 2011 by CEBT
19
Graph Generalization for Anonymization (Cont’d)
How to analyze the generalized graph?
Construct the synthetic graph using the tagged information.
Perform standard graph analysis on this synthetic graph.
Copyright 2011 by CEBT
20
Graph Generalization for Anonymization (Cont’d)
How does graph generalization affect network properties?
Examine five properties on the three real-world networks.
– Degree – – Path Length Transitivity (Clustering Coefficient) – Network Resilience – Infectiousness Perform the experiments on the 200 synthetic graphs.
Repeat for each .
21
Copyright 2011 by CEBT
Graph Generalization for Anonymization (Cont’d)
Copyright 2011 by CEBT
22
Conclusion
Three contributions Formalize models of adversary knowledge.
Provide a start point of theoretical study of privacy risks on a network data.
Introduce a new anonymization technique by generalizing the original graph.
Copyright 2011 by CEBT
23