Privacy-Aware Publishing of Netflix Data

Download Report

Transcript Privacy-Aware Publishing of Netflix Data

DAPA-V10: Discovery and Analysis of Patterns and
Anomalies in Volatile Time-Evolving Networks
Brian Thompson
Rutgers University
[email protected]
Problem Statement
Given a volatile time-evolving network:
(1) Find persistent patterns.
(2) Detect local and global anomalous activity.
Challenges:
• Volatility: Network changes drastically, frequently.
• Sparsity: A single snapshot is extremely sparse.
Tina Eliassi-Rad
Lawrence Livermore Lab
[email protected]
Our Algorithm: DAPA-V10
Source
Dest.
t_start
t_end
v49273
v71192
t=5
t=9
v83492
v12987
t = 12
t = 14
v40927
v62198
t = 13
t = 16
v98364
v39872
t = 20
t = 21
v18964
v38719
t = 20
t = 25
Experimental Results
• Dataset: a collection of email correspondence between
672 Enron employees from 1997-2002.
• Found 6 persistent patterns that represent connected
components of employees with regular communication.
• Substructures of edges within and between persistent
patterns are monitored over time for anomalous behavior.
• Anomalies found by DAPA-V10 correspond with events
surrounding the Enron scandal. The close correspondence
illustrates the effectiveness of our approach.
1. Timestamped edges are used
to construct a dynamic graph.
• Scalability: Algorithms must be efficient for large
networks of potentially millions of members.
Timeline of Enron Scandal
Network Representation
• Model a network as a dynamic graph G=(V,ET).
• To capture temporal information, we construct a
weighted cumulative graph G’=(V,Et’).
• Edge weights are defined by a decay function f:
we   f (e)
eET
Persistent Patterns
A persistent pattern is a collection of vertices that
(1) form a connected component.
(2) communicate regularly.
Algorithm:
• Consider only edges with weight above threshold θ.
• Decrease θ until a component of size |V | appears.
• Remove edges and iterate on the remaining graph.
3. Persistent patterns are identified. Substructures are selected
to track activity both within and
between components.
4. Substructures are monitored,
flagging abnormal activity for
investigation and analysis.
# of Edges in Log-Scale
1.0E+00
1.0E-05
1.0E-04
1.0E-03
Average Edge Weight in Log-Scale
Distribution of edge weights and threshold points for Enron
data. Resulting persistent patterns shown in figure at center.
Executives get $1M bonuses; stock is soaring
4/01
Q1 profit $536M; Wall St. analyst suspicious
7/01
Reported earnings $50B; share price dropping
8/01
Public criticism of Enron accounting practices
9/01
9/11 attacks; Enron director sells 500K shares
10/01
Q3 loss of $618M; SEC begins investigation
11/01
Acquisition offer, revoked; ‘junk’ credit rating
12/01
Enron files for bankruptcy, lays off employees
Conclusions
• Goal: identify anomalies on a local and global scale.
• Monitor substructures: sets of edges (1) within each
persistent pattern, and (2) between each pair of patterns.
• A substructure is anomalous if recent activity across its
edges differs significantly from what is expected.
• We discover persistent patterns in volatile time-evolving
networks and use them to find and rank anomalous events.
• Previous work focuses on identifying times of higher activity
overall. DAPA-V10 detects local anomalies, pinpointing
sources of unusual behavior for further analysis.
• Our approach is scalable to very large networks.

1.0E+01
2/01
Anomaly Detection
Algorithm:
• For all e  Et, define Xe = {we if e  Gt, 1 - we otherwise}.
1.0E+02
Event
2. A cumulative graph is used to
measure the average strengths
of relationships.
1.0E+04
1.0E+03
Time
• Measure likelihood of substructure S by X S 
.
X
e
eS
*
*
• Flag S as anomalous if X U (t )  U  U   , where  is an
anomalicity threshold.
• Substructures are then analyzed in decreasing order of
anomalicity as resources and time allow.
• DAPA-V10 is efficient: run-time complexity is O(|ET|).
Future work:
• Conduct experiments on a variety of domains (e.g. cyber).
• Use Enron dataset to evaluate effectiveness of DAPA-V10
as an early predictor of high-impact events.
• Normalize at each time step to find local anomalies
independent of global trends in network activity.
• Incorporate semantic information from complex networks.