Lecture 10: Graph Mining

Download Report

Transcript Lecture 10: Graph Mining

CS 521
Data Mining Techniques
Instructor: Abdullah Mueen
LECTURE 8: TIME SERIES AND GRAPH MINING
Definition of Time Series Motifs
2
1. Length of the motif
2. Support of the motif
3. Similarity of the Pattern
4. Relative Position of the Pattern
1
0
-1
-2
20
40
60
Given a length, the most similar/least
distant pair of non-overlapping
subsequences.
80
100
120
xˆi 
140
x  x
x
d ( xˆ , yˆ ) 
160
180
, yˆ i 
200
y  y
y
2
ˆ
ˆ
(
x

y
)
 i i
i
Problem Formulation
time:1000
100
200
300
400
500
600
700
800
-7000
-7500
-8000
The most similar pair of nonoverlapping subsequences
900
1000
1
2
3
4
5
6
7
8
.
.
.
873
The closest pair of points in high
dimensional space
 Optimal algorithm in two dimension : Θ(n log n)
 For large dimensionality d, optimum algorithm is effectively Θ(n2d)
Lower Bound
If P, Q and R are three points in a d-space
d(P,Q)+d(Q,R) ≥ d(P,R)

P
Q
d(P,Q) ≥ |d(Q,R) - d(P,R)|
R
A third point R provides a very inexpensive lower bound on the
true distance
If the lower bound is larger than the existing best, skip d(P, Q)
d(P,Q) ≥ |d(Q,R) - d(P,R)| ≥ BestPairDistance
Circular Projection
Pick a reference point r
9
Circularly Project all points on a line
passing through the reference point
20
7
16
22
4
15
21
5
18
17
10
r
19
14
3
2
8
Equivalent to computing distance from r
and then sorting the points according to
distance
1
12
11
24
6
23
13
9
The Order Line
22
4
P
r
11
13
23
k=1:n-1
•
Compare every pair having
k-1 points in between
•
Do k scans of the order
line, starting with the 1st
to kth point
|d(Q, r) - d(P, r)|
BestPairDistance
2
Q

r
3
8
1
6
0
k=3
k=2
k=1
r
14
24
d(Q, r)
18
17
10
19
21
5
15
12
d(P, r)
20
7
16
Correctness
If we search for all offset=1,2,…,n-1 then all possible pairs are considered.
◦ n(n-1)/2 pairs
for any offset=k, if none of the k scans needs an actual distance computation
then for the rest of the offsets=k+1,…,n-1 no distance
computation will be needed.
r
Graph Similarity
Edit distance/graph isomorphism:
◦ Tree Edit Distance
Feature extraction
◦ IN/out degree
◦ Diameter
Iterative methods
◦ SimRank
Diameter
Largest Shortest path in the graph.
1 let dist be a |V| × |V| array of minimum distances initialized
to ∞ (infinity)
2 for each vertex v
3 dist[v][v] ← 0
4 for each edge (u,v)
5 dist[u][v] ← w(u,v) // the weight of the edge (u,v)
6 for k from 1 to |V|
7 for i from 1 to |V|
8
for j from 1 to |V|
9
if dist[i][j] > dist[i][k] + dist[k][j]
10
dist[i][j] ← dist[i][k] + dist[k][j]
11
end if
http://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm
Simrank
For a node v in a graph, we denote by I(v) and O(v) the set of in-neighbors and out-neighbors of v, respectively.
1. A solution s(∗, ∗) ∈ [0, 1] to the n2 SimRank equations always
exists and is unique.
2. Symmetric
3. Reflexive
http://www-cs-students.stanford.edu/~glenj/simrank.pdf
Tree Edit Distance
http://grfia.dlsi.ua.es/ml/algorithms/references/editsurvey_bille.pdf
Tree Edit Distance
Applications
Find the most frequent tree structure in a phylogenetic tree.
Match a query subtree with a set of XML documents.
Ranking Nodes
Page Rank
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PR(A) is the PageRank of page A,
PR(Ti) is the PageRank of pages Ti which link to page A,
C(Ti) is the number of outbound links on page Ti and
d is a damping factor which can be set between 0 and 1.
Example
PR(A) = 0.5 + 0.5 PR(C)
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
These equations can easily be solved. We get the following PageRank values for the
single pages:
PR(A) = 14/13 = 1.07692308
PR(B) = 10/13 = 0.76923077
PR(C) = 15/13 = 1.15384615
Matlab Script
Matlab script for the example in the previous slide
syms x y z;
eqn1 = x == 0.5 + 0.5*z
eqn2 = y == 0.5 + 0.25*x
eqn3 = z == 0.5 + 0.25*x + 0.5*y
[A,B] = equationsToMatrix([eqn1, eqn2, eqn3], [x, y, z])
X = linsolve(A,B)
HITS: Hyperlink-Induced Topic Search
http://www.cs.cornell.edu/home/kleinber/auth.pdf