Transcript Slide 1
CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu
Networks of tightly
connected groups
Network communities:
Sets of nodes with lots of
connections inside and
few to outside (the rest
of the network)
Communities, clusters,
groups, modules
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
2
Laplacian matrix (L):
n n symmetric matrix
5
1
2
3
4
6
What is trivial eigenvector,
eigenvalue?
1
2
3
4
5
6
1
3
-1
-1
0
-1
0
2
-1
2
-1
0
0
0
3
-1
-1
3
-1
0
0
4
0
0
-1
3
-1
-1
5
-1
0
0
-1
3
-1
6
0
0
0
-1
-1
2
L=
𝑥 = (1, … , 1) with 𝜆 = 0
Eigenvalues are non-negative real numbers
D-A
Now the question is, what is 2 doing?
We will see that eigenvector that corresponds to 2
basically does community detection
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
3
For symmetric matrix M:
xT Mx
T
2 min T x Mx
x
x x
2
x is unit vector: 𝑖 𝑥𝑖 = 1
x is orthogonal to 1st eigenvector,
𝑖 𝑥𝑖
=0
What is the meaning of min xTLx on G?
𝑇
𝑥 ⋅𝐿⋅𝑥 =
5
1
2
3
7/17/2015
4
6
𝑖,𝑗 ∈𝐸
𝑥𝑖 − 𝑥𝑗
2
Think of xi as a numeric value of node i.
2
Set xi to min 𝑖,𝑗 ∈𝐸 𝑥𝑖 − 𝑥𝑗 while 𝑖 𝑥𝑖2 = 1, 𝑖 𝑥𝑖 = 0.
This means some xi>0 and some xi<0
Set values xi such that they don’t differ across the edges
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
4
0
Constraints: 𝑖 𝑥𝑖 = 0 and
What are is 𝐦𝐢𝐧
𝒙𝒊
𝒊,𝒋 ∈𝑬
2
𝑖 𝑥𝑖
𝒙𝒊 − 𝒙𝒋
=1
𝟐
really doing?
Find sets A and B of about similar size.
Set xA > 0 , xB < 0 and then value of 𝝀𝟐 is 2(#edges A—B)
Embed nodes of the graph on a real line so that
constraints 𝑖 𝑥𝑖 = 0 and 𝑖 𝑥𝑖2 = 1 are obeyed
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
5
Say, we want to minimize the cut score
(#edges crossing)
We can express partition A, B as a vector
A
B
We can minimize the cut score of the
partition by finding a non-trivial vector 𝑥 (𝑥𝑖 ∈
{−1, +1}) that minimizes:
Looks like our
equation for 2!
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
6
Trivial solution to
the cut score.
How to prevent it?
Approximation to
normalized cut.
2
1
4
𝐶𝑢𝑡 =
𝑥𝑖 ∈ {−1, +1}
𝑖,𝑗 ∈𝐸 𝑥𝑖 − 𝑥𝑗
“Relax” the indicators from {-1,+1} to real
numbers: min
𝑥𝑖
𝑖,𝑗 ∈𝐸
𝑥𝑖 − 𝑥𝑗
2
𝑥𝑖 ∈
The optimal solution for x is given by the
corresponding eigenvector λ2, referred as the
Fiedler vector
Note: this is even better than the cut score, since
it will give nearly balanced partitions (since
2
𝑥
𝑖 𝑖 = 1, 𝑖 𝑥𝑖 = 0)
To learn more: A Tutorial on Spectral Clustering by U. von Luxburg
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
7
How to define a “good” partition of a graph?
Minimize a given graph cut criterion
How to efficiently identify such a partition?
Approximate using information provided by the
eigenvalues and eigenvectors of a graph
7/17/2015
Spectral Clustering
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
8
Three basic stages:
1. Pre-processing
Construct a matrix representation of the graph
2. Decomposition
Compute eigenvalues and eigenvectors of the matrix
Map each point to a lower-dimensional
representation based on one or more eigenvectors
3. Grouping
7/17/2015
Assign points to two or more clusters, based on the
new representation
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
9
1
2
3
4
5
6
1
3
-1
-1
0
-1
0
2
-1
2
-1
0
0
0
3
-1
-1
3
-1
0
0
4
0
0
-1
3
-1
-1
5
-1
0
0
-1
3
-1
6
0
0
0
-1
-1
2
0.0
0.4
0.3
-0.5
-0.2
-0.4
-0.5
1.0
0.4
0.6
0.4
-0.4
0.4
0.0
3.0
0.4
0.3
0.1
0.6
-0.4
0.5
0.4
-0.3
0.1
0.6
0.4
-0.5
4.0
0.4
-0.3
-0.5
-0.2
0.4
0.5
5.0
0.4
-0.6
0.4
-0.4
-0.4
0.0
Pre-processing:
Build Laplacian
matrix L of the
graph
Decomposition:
Find eigenvalues
and eigenvectors x
of the matrix L
Map vertices to
corresponding
components of 2
7/17/2015
=
3.0
1
0.3
2
0.6
3
0.3
4
-0.3
5
-0.3
6
-0.6
X=
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
How do we now
find clusters?
10
Give normalized
cut criterion score
Grouping:
Sort components of reduced 1-dimensional vector
Identify clusters by splitting the sorted vector in two
How to choose a splitting point?
Naïve approaches:
Split at 0, (or mean or median value)
More expensive approaches:
7/17/2015
Attempt to minimize normalized cut criterion in 1-dim
Split at 0:
Cluster A: Positive points
Cluster B: Negative points
1
0.3
2
0.6
3
0.3
4
-0.3
1
0.3
4
-0.3
5
-0.3
2
0.6
5
-0.3
6
-0.6
3
0.3
6
-0.6
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
A
B
11
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
12
How do we partition a graph into k clusters?
Two basic approaches:
Recursive bi-partitioning [Hagen et al., ’92]
Recursively apply bi-partitioning algorithm in a
hierarchical divisive manner
Disadvantages: Inefficient, unstable
Cluster multiple eigenvectors [Shi-Malik, ’00]
Build a reduced space from multiple eigenvectors
Node i is described by its k eigenvector components (x2,i, x3,i, …, xk,i)
Use k-means to cluster the points
A preferable approach…
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
13
Do this on real
laplacian – here
the lambdas are
greater than 1!!!
Eigengap:
The difference between two consecutive
eigenvalues
Most stable clustering is generally given by
the value k that maximizes the eigengap
Example:
50
λ1
45
40
max k 2 1
Eigenvalue
35
30
25
Choose
k=2
λ2
20
15
10
5
0
1
7/17/2015
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
k
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
14
CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu
Would like to do prediction:
estimate a function f(x) so that y = f(x)
Where y can be:
Real number: Regression
Categorical: Classification
Complex object:
Ranking of items, Parse tree, etc.
Data is labeled:
X
Y
X’
Y’
Training and
test set
Have many pairs {(x, y)}
x … vector of real valued features
y … class ({+1, -1}, or a real number)
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
16
We will talk about the following methods:
k-Nearest Neighbor (Instance based learning)
Perceptron algorithm
Support Vector Machines
Decision trees
Main question:
How to efficiently train
(build a model/find model parameters)?
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
17
Instance based learning
Example: Nearest neighbor
Keep the whole training dataset: {(x, y)}
A query example (vector) q comes
Find closest example(s) x*
Predict y*
Can be used both for regression and
classification
Collaborative filtering is an example of a k-NN
classifier
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
18
To make Nearest Neighbor work we need 4 things:
Distance metric:
Euclidean
How many neighbors to look at?
One
Weighting function (optional):
Unused
How to fit with the local points?
Just predict the same output as the nearest neighbor
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
19
Distance metric:
Euclidean
How many neighbors to look at?
k
Weighting function (optional):
Unused
How to fit with the local points?
Just predict the average output among k nearest neighbors
k=9
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
20
Distance metric:
How many neighbors to look at?
All of them (!)
Weighting function:
𝑤𝑖 =
wi
Euclidean
𝑑 𝑥𝑖 ,𝑞 2
exp(−
)
𝐾𝑤
d(xi, q) = 0
Nearby points to query q are weighted more strongly. Kw…kernel width.
How to fit with the local points?
Predict weighted average:
Kw=10
2/14/2011
𝑖 𝑤𝑖 𝑦𝑖
𝑖 𝑤𝑖
Kw=20
Jure Leskovec, Stanford C246: Mining Massive Datasets
Kw=80
21
Given: a set P of n points in Rd
Goal: Given a query point q
NN: Find the nearest neighbor p of q in P
Range search: Find one/all points in P within
distance r from q
p
q
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
22
Main memory:
Linear scan
Tree based:
Quadtree
kd-tree
Hashing:
Locality-Sensitive Hashing
Secondary storage:
R-trees
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
23
Skip
Simplest spatial structure on Earth!
Split the space into 2d equal subsquares
Repeat until done:
only one pixel left
only one point left
only a few points left
Variants:
split only one dimension
at a time
Kd-trees
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
24
skip
Range search:
Put root node on the stack
Repeat:
pop the next node T from stack
for each child C of T:
q
if C is a leaf, examine point(s) in C
if C intersects with the ball of radius
r around q, add C to the stack
Nearest neighbor:
Start range search with r =
Whenever a point is found, update r
Only investigate nodes with respect to current r
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
25
Skip!
Quadtrees work great for 2 to 3
dimensions
Problems:
Empty spaces: if the points form
sparse clouds, it takes a while to
reach them
Space exponential in dimension
Time exponential in dimension, e.g.,
points on the hypercube
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
26
Example: Spam filtering
Instance space x X (|X|= n data points)
Binary feature vector x of word occurrences
d features (words + other things, d~100,000)
Class y Y:
y: Spam (+1), Ham (-1)
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
28
Binary classification:
f (x) =
{
+1 if w1 x1 + w2 x2 +. . . wd xd
-1 otherwise
Input: Vectors xi and labels yi
Goal: Find vector w = (w1, w2 ,... , wd)
Decision
boundary
is linear
Each wi is a real number
wx=
wx=0
2/14/2011
-- - - -- - - -w
Jure Leskovec, Stanford C246: Mining Massive Datasets
Note:
-
x x, 1
x
w w,
29
(very) Loose motivation: Neuron
Inputs are feature values
Each feature has a weight wi
w1
Activation is the sum:
x1
w
2
x2
w3
x3
w4
x4
f(x) = i wi xi = w x
If the f(x) is:
wx=0
Positive: predict +1
Negative: predict -1
nigeria
x1
x2
0?
Spam=1
w
Ham=-1
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
viagra
30
Perceptron: y’ = sign(w x)
How to find parameters w?
Start with w0 = 0
Pick training examples xt one by one (from disk)
Predict class of xt using current weights
y’ = sign(wt xt)
If y’ is correct (i.e., yt = y’)
No change: wt+1 = wt
If y’ is wrong: adjust w
wt+1 = wt + yt xt
yx
wt
wt+1
x
is the learning rate parameter
xt is the training example
yt is true class label ({+1, -1})
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
31
Optimize – join
with the next slide
Perceptron Convergence Theorem:
If there exist a set of weights that are consistent
(i.e., the data is linearly separable) the perceptron
learning algorithm will converge
How long would it take to converge?
Perceptron Cycling Theorem:
If the training data is not linearly separable the
perceptron learning algorithm will eventually
repeat the same set of weights and therefore
enter an infinite loop
How to provide robustness, more
expressivity?
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
32
Separability: Some parameters get
training set perfectly
Convergence: If training set is
separable, perceptron will converge
(binary case)
(Training) Mistake bound:
Number of mistakes < 1/2
𝛾 = min
𝑤⋅𝑥
|𝑥|
if we scale examples to have Euclidean length 1, then γ is
the minimum distance of any example to the plane
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
33
Perceptron won’t
converge here –
use the trick to
make eta smaller
and smaller
If more than 2 classes:
Weight vector wc for each class
Train one class vs. the rest
Example: 3-way classification y = {A, B, C}
Train 3 classifies: wA: A vs. B,C; wB: B vs. A,C; wC: C vs. A,B
Calculate activation for each class
f(x,c) = i wc,i xi = wc x
Highest activation wins:
c = arg maxc f(x,c)
wCx
biggest
wC
wA
wB
wBx
biggest
wAx
biggest
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
34
Overfitting:
Regularization: if the data
is not separable weights
dance around
Mediocre generalization:
Finds a “barely” separating
solution
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
35
Winnow algorithm
Similar to perceptron, just different updates
Initialize :
n; w i 1
Prediction
is
1 iff
w x
If no mistake : do nothing
If f(x) 1 but w x ,
w i 2w i
If f(x) 0 but w x ,
w i w i /2 (if x i 1) (demotion)
(if x i 1) (promotion)
x … binary feature vector
w … weights (can never get negative!)
Learns linear threshold functions
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
37
Algorithm learns monotone functions
For the general case:
Duplicate variables:
To negate variable xi, introduce a new variable xi’ = -xi
Learn monotone functions over 2 n variables
This gives us the Balanced Winnow:
Keep two weights for each variable;
effective weight is the difference
Update Rule :
If f ( x) 1 but ( w w ) x ,
If f ( x) 0 but ( w w ) x ,
2/14/2011
wi 2 wi
wi
1
2
Jure Leskovec, Stanford C246: Mining Massive Datasets
wi
wi
1
2
wi where xi 1 (promotion)
wi 2 wi
where xi 1 (demotion)
38
•
Thick Separator (aka Perceptron with Margin)
(Applies both for Perceptron and Winnow)
– Promote if:
wx=
wx> +
-- – Demote if:
- - -- - wx< -
- -wx=0
-
-
-
Note: is a functional margin. Its effect could disappear as w grows.
Nevertheless, this has been shown to be a very effective algorithmic addition.
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
39
Examples : x {0,1} d ;
Prediction is
Hypothesis : w R d
1 iff
w x
Additive weight update algorithm
[Perceptron, Rosenblatt, 1958]
w ← w + ηi yj xj
If Class 1 but w x ,
w i w i 1 (if xi 1) (promotion)
If Class 0 but w x ,
w i w i - 1 (if xi 1) (demotion)
Multiplicative weight update algorithm
[Winnow, Littlestone, 1988]
w ← w ηi exp{yj xj}
If Class 1 but w x ,
w i 2w i (if xi 1) (promotion)
If Class 0 but w x ,
w i w i /2 (if xi 1) (demotion)
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
40
• Perceptron
• Winnow
• Online: can adjust to changing
target, over time
• Advantages
– Simple
– Guaranteed to learn a
linearly separable problem
• Online: can adjust to changing
target, over time
• Advantages
– Simple
– Guaranteed to learn a
linearly separable problem
– Suitable for problems with
many irrelevant attributes
• Limitations
– only linear separations
– only converges for linearly
separable data
– not really “efficient with
many features”
• Limitations
– only linear separations
– only converges for linearly
separable data
– not really “efficient with
many features”
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
41