Privacy-Aware Publishing of Netflix Data

Download Report

Transcript Privacy-Aware Publishing of Netflix Data

A Renewal Theory Approach to Anomaly
Detection in Communication Networks
Tina Eliassi-Rad†‡
[email protected]
‡Lawrence Livermore Lab
Thompson†
Brian
[email protected]
†Rutgers University
Introduction/Motivation
Our Approach
Experimental Results
• A communication network is a time-evolving graph
that models interactions between entities over time
• Consider the weighted graph Gt = (V,E) representing a
communication network at time t, with w(e) = Rec(e,t)
• Experiments on 4 datasets: Enron email, LBNL IP traffic,
Twitter messages, and Reality Mining Bluetooth proximity
• Pervasive in today’s world: phone calls, blog posts,
email, social network messages, IP connections
• For E   E, let XE’,p = # of edges in E’ with w(e) ≤ p
• We define the p-divergence of E’ as follows:
1
, where X ~ Bin(|E’|,p)
Div p (E ) 
P ( X  X E ,p )
• Clear and intuitive visualization reveals anomalous activity
in the Bluetooth dataset at two points in time
• Volatile: static network analysis tools not sufficient
=?
=!
+
4
=
+
1
t=2
t=3
Day 220:
1
0.3
1
2
1
t=1
1
1
t=4
0.1
Summary graph
• Goal: Efficiently identify local or global changes in
communication activity or graph structure over time
0.2
0.9
• Communication across an edge is modeled as a
sequence of time-stamped events, which yields a
distribution of inter-arrival times (IATs)
9:00 am
9:30 am
10:30 am
0.8
0.7
frequency
0.1
0.3
• Let E’ be the set
of thick edges
• |E’| = 6
• XE’,0.3 = 4
• P(X ≥ 4) = 0.07
• Div0.3(E’) = 14.2
• The max-divergence of E’ is: Divmax (E )  max Divp (E ' )
p
• Intuitively, p-divergence of d means that the probability
of at least XE’,p edges occurring p-recently is 1/d
• A (maximal) p-component of G = (V,E) is a connected
subgraph C = (V’,E’) such that (1) w(e) ≤ p for all e in E’
and (2) w(e) > p for all e not in E’ incident to V’
2
1
2:00 pm
11:30 am
0.4
0.2
Model

0.5
0.7
0.5
0.3
<Alice, Bob>
0.6
Day 250:
Sorted by degree
Recency
MCD Analysis
• A simple plot of MCD over time (left) identifies handlabeled scanning activity in the LBNL dataset, as well as
other anomalies overlooked by human analysts
• The plot at right shows scalability using the Twitter dataset
(263k nodes, 308k edges, 1.1 million timestamps)
0
0,5
1
1,5
2
2,5
• The set of p-components partition V, for all p in [0,1]
3
inter-arrival time (hours)
• IATs for human interaction frequently follow a
power-law distribution
Bounded Pareto
• The Bounded Pareto allows
us to model communication
concisely, and make updates in
real-time and constant space
xmindistribution
Runtime for MCD Algorithm
2000
• The p-components of Gt for p = 0.3 are shown in blue
runtime (milliseconds)
+
1
xmax
Algorithm
1
0,8
0,6
1500
1000
500
0
0
15 000
0,4
30 000
45 000
60 000
number of live edges
The MCD Algorithm:
0,2
0
0
1
2
3
4
5
time
• The recency function Rec : 2T x T → [0,1] assigns
a weight to edge e at time t based on its age, i.e. the
time since the last event, subject to the constraints:
• Rec(e,t) = 0 at the time an event occurs, 1
when age = xmax, and is increasing in between
• Rec(e,t) is uniform over [0,1] when sampled
uniformly in time
• Rec is uniquely determined by the constraints
• The uniformity property eliminates time-scale bias
1. Calculate edge weights using the Recency function
Conclusions
2. Gradually increase the edge threshold, updating
components and divergence values as necessary
• Traditional network analysis is inadequate for dealing with
communication networks, which are dynamic and volatile
3. Output: Disjoint components with max divergence
0.1
V1
0.3
0.9
V5
0.5
V2
0.25
V3
0.7
V4
p
0.1
0.25
0.3
0.5
0.7
Component
{V1,V2}
{V1,V2,V3}
{V1,V2,V3}
{V4,V5}
{V1,V2,V3,V4,V5}
Div
2.9
2.7
6.1
1.1
2.4
0.9
{V1,V2,V3,V4,V5} 1.9
• Studying the inter-arrival time distributions of edges is a
novel approach for analyzing communication networks
2.4
• Our algorithms are streaming, and run in O(m) space and
O(m log m) time, where m is the # of edges in the dataset
6.1
V3
2.9
V1
V2
1.1
V4
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
V5
• MCD analysis can be easily visualized and used as a tool
for monitoring activity in a variety of real-world domains
IM Review and Release number<Insert number here>