Transcript Slide 1

CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu

Networks of tightly
connected groups

Network communities:
 Sets of nodes with lots of
connections inside and
few to outside (the rest
of the network)
Communities, clusters,
groups, modules
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
2

Laplacian matrix (L):
 n n symmetric matrix
5
1
2
3

4
6
What is trivial eigenvector,
eigenvalue?
1
2
3
4
5
6
1
3
-1
-1
0
-1
0
2
-1
2
-1
0
0
0
3
-1
-1
3
-1
0
0
4
0
0
-1
3
-1
-1
5
-1
0
0
-1
3
-1
6
0
0
0
-1
-1
2
L=
 𝑥 = (1, … , 1) with 𝜆 = 0
 Eigenvalues are non-negative real numbers

D-A
Now the question is, what is 2 doing?
 We will see that eigenvector that corresponds to 2
basically does community detection
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
3

For symmetric matrix M:
xT Mx
T
2  min T  x Mx
x
x x
2
 x is unit vector: 𝑖 𝑥𝑖 = 1
 x is orthogonal to 1st eigenvector,

𝑖 𝑥𝑖
=0
What is the meaning of min xTLx on G?
𝑇
𝑥 ⋅𝐿⋅𝑥 =
5
1
2
3
7/17/2015
4
6
𝑖,𝑗 ∈𝐸
𝑥𝑖 − 𝑥𝑗
2
Think of xi as a numeric value of node i.
2
Set xi to min 𝑖,𝑗 ∈𝐸 𝑥𝑖 − 𝑥𝑗 while 𝑖 𝑥𝑖2 = 1, 𝑖 𝑥𝑖 = 0.
This means some xi>0 and some xi<0
Set values xi such that they don’t differ across the edges
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
4

0
Constraints: 𝑖 𝑥𝑖 = 0 and

What are is 𝐦𝐢𝐧
𝒙𝒊
𝒊,𝒋 ∈𝑬
2
𝑖 𝑥𝑖
𝒙𝒊 − 𝒙𝒋
=1
𝟐
really doing?
 Find sets A and B of about similar size.
Set xA > 0 , xB < 0 and then value of 𝝀𝟐 is 2(#edges A—B)

Embed nodes of the graph on a real line so that
constraints 𝑖 𝑥𝑖 = 0 and 𝑖 𝑥𝑖2 = 1 are obeyed
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
5


Say, we want to minimize the cut score
(#edges crossing)
We can express partition A, B as a vector
A

B
We can minimize the cut score of the
partition by finding a non-trivial vector 𝑥 (𝑥𝑖 ∈
{−1, +1}) that minimizes:
Looks like our
equation for 2!
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
6
Trivial solution to
the cut score.
How to prevent it?
Approximation to
normalized cut.


2
1
4
𝐶𝑢𝑡 =
𝑥𝑖 ∈ {−1, +1}
𝑖,𝑗 ∈𝐸 𝑥𝑖 − 𝑥𝑗
“Relax” the indicators from {-1,+1} to real
numbers: min
𝑥𝑖
𝑖,𝑗 ∈𝐸
𝑥𝑖 − 𝑥𝑗
2
𝑥𝑖 ∈ 
The optimal solution for x is given by the
corresponding eigenvector λ2, referred as the
Fiedler vector
 Note: this is even better than the cut score, since
it will give nearly balanced partitions (since
2
𝑥
𝑖 𝑖 = 1, 𝑖 𝑥𝑖 = 0)

To learn more: A Tutorial on Spectral Clustering by U. von Luxburg
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
7

How to define a “good” partition of a graph?
 Minimize a given graph cut criterion
How to efficiently identify such a partition?

 Approximate using information provided by the
eigenvalues and eigenvectors of a graph

7/17/2015
Spectral Clustering
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
8
Three basic stages:
1. Pre-processing

Construct a matrix representation of the graph
2. Decomposition


Compute eigenvalues and eigenvectors of the matrix
Map each point to a lower-dimensional
representation based on one or more eigenvectors
3. Grouping

7/17/2015
Assign points to two or more clusters, based on the
new representation
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
9

1
2
3
4
5
6
1
3
-1
-1
0
-1
0
2
-1
2
-1
0
0
0
3
-1
-1
3
-1
0
0
4
0
0
-1
3
-1
-1
5
-1
0
0
-1
3
-1
6
0
0
0
-1
-1
2
0.0
0.4
0.3
-0.5
-0.2
-0.4
-0.5
1.0
0.4
0.6
0.4
-0.4
0.4
0.0
3.0
0.4
0.3
0.1
0.6
-0.4
0.5
0.4
-0.3
0.1
0.6
0.4
-0.5
4.0
0.4
-0.3
-0.5
-0.2
0.4
0.5
5.0
0.4
-0.6
0.4
-0.4
-0.4
0.0
Pre-processing:
 Build Laplacian
matrix L of the
graph

Decomposition:
 Find eigenvalues 
and eigenvectors x
of the matrix L
 Map vertices to
corresponding
components of 2
7/17/2015
=
3.0
1
0.3
2
0.6
3
0.3
4
-0.3
5
-0.3
6
-0.6
X=
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
How do we now
find clusters?
10
Give normalized
cut criterion score
Grouping:



Sort components of reduced 1-dimensional vector
Identify clusters by splitting the sorted vector in two
How to choose a splitting point?


Naïve approaches:


Split at 0, (or mean or median value)
More expensive approaches:

7/17/2015
Attempt to minimize normalized cut criterion in 1-dim
Split at 0:
Cluster A: Positive points
Cluster B: Negative points
1
0.3
2
0.6
3
0.3
4
-0.3
1
0.3
4
-0.3
5
-0.3
2
0.6
5
-0.3
6
-0.6
3
0.3
6
-0.6
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
A
B
11
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
12


How do we partition a graph into k clusters?
Two basic approaches:
 Recursive bi-partitioning [Hagen et al., ’92]
 Recursively apply bi-partitioning algorithm in a
hierarchical divisive manner
 Disadvantages: Inefficient, unstable
 Cluster multiple eigenvectors [Shi-Malik, ’00]
 Build a reduced space from multiple eigenvectors
 Node i is described by its k eigenvector components (x2,i, x3,i, …, xk,i)
 Use k-means to cluster the points
 A preferable approach…
7/17/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
13
Do this on real
laplacian – here
the lambdas are
greater than 1!!!

Eigengap:
 The difference between two consecutive
eigenvalues


Most stable clustering is generally given by
the value k that maximizes the eigengap
Example:
50
λ1
45
40
max  k  2  1
Eigenvalue
35
30
25
 Choose
k=2
λ2
20
15
10
5
0
1
7/17/2015
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
k
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
14
CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu

Would like to do prediction:
estimate a function f(x) so that y = f(x)

Where y can be:
 Real number: Regression
 Categorical: Classification
 Complex object:
 Ranking of items, Parse tree, etc.

Data is labeled:
X
Y
X’
Y’
Training and
test set
 Have many pairs {(x, y)}
 x … vector of real valued features
 y … class ({+1, -1}, or a real number)
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
16

We will talk about the following methods:





k-Nearest Neighbor (Instance based learning)
Perceptron algorithm
Support Vector Machines
Decision trees
Main question:
How to efficiently train
(build a model/find model parameters)?
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
17


Instance based learning
Example: Nearest neighbor





Keep the whole training dataset: {(x, y)}
A query example (vector) q comes
Find closest example(s) x*
Predict y*
Can be used both for regression and
classification
 Collaborative filtering is an example of a k-NN
classifier
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
18

To make Nearest Neighbor work we need 4 things:
 Distance metric:
 Euclidean
 How many neighbors to look at?
 One
 Weighting function (optional):
 Unused
 How to fit with the local points?
 Just predict the same output as the nearest neighbor
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
19

Distance metric:
 Euclidean

How many neighbors to look at?
 k

Weighting function (optional):
 Unused

How to fit with the local points?
 Just predict the average output among k nearest neighbors
k=9
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
20

Distance metric:

How many neighbors to look at?
 All of them (!)

Weighting function:
 𝑤𝑖 =

wi
 Euclidean
𝑑 𝑥𝑖 ,𝑞 2
exp(−
)
𝐾𝑤
d(xi, q) = 0
 Nearby points to query q are weighted more strongly. Kw…kernel width.
How to fit with the local points?
 Predict weighted average:
Kw=10
2/14/2011
𝑖 𝑤𝑖 𝑦𝑖
𝑖 𝑤𝑖
Kw=20
Jure Leskovec, Stanford C246: Mining Massive Datasets
Kw=80
21


Given: a set P of n points in Rd
Goal: Given a query point q
 NN: Find the nearest neighbor p of q in P
 Range search: Find one/all points in P within
distance r from q
p
q
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
22

Main memory:
 Linear scan
 Tree based:
 Quadtree
 kd-tree
 Hashing:
 Locality-Sensitive Hashing

Secondary storage:
 R-trees
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
23
Skip



Simplest spatial structure on Earth!
Split the space into 2d equal subsquares
Repeat until done:
 only one pixel left
 only one point left
 only a few points left

Variants:
 split only one dimension
at a time
 Kd-trees
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
24
skip

Range search:
 Put root node on the stack
 Repeat:
 pop the next node T from stack
 for each child C of T:
q
 if C is a leaf, examine point(s) in C
 if C intersects with the ball of radius
r around q, add C to the stack

Nearest neighbor:
 Start range search with r = 
 Whenever a point is found, update r
 Only investigate nodes with respect to current r
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
25
Skip!


Quadtrees work great for 2 to 3
dimensions
Problems:
 Empty spaces: if the points form
sparse clouds, it takes a while to
reach them
 Space exponential in dimension
 Time exponential in dimension, e.g.,
points on the hypercube
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
26

Example: Spam filtering

Instance space x  X (|X|= n data points)
 Binary feature vector x of word occurrences
 d features (words + other things, d~100,000)

Class y  Y:
 y: Spam (+1), Ham (-1)
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
28

Binary classification:
f (x) =


{
+1 if w1 x1 + w2 x2 +. . . wd xd  
-1 otherwise
Input: Vectors xi and labels yi
Goal: Find vector w = (w1, w2 ,... , wd)
Decision
boundary
is linear
 Each wi is a real number
wx=
wx=0
2/14/2011
-- - - -- - - -w
Jure Leskovec, Stanford C246: Mining Massive Datasets
Note:
-
x  x, 1
x
w  w,  
29



(very) Loose motivation: Neuron
Inputs are feature values
Each feature has a weight wi
w1
Activation is the sum:
x1
w
2
x2
w3
x3
w4
x4
 f(x) = i wi xi = w x

If the f(x) is:
wx=0
 Positive: predict +1
 Negative: predict -1
nigeria


x1
x2
 0?
Spam=1
w
Ham=-1
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
viagra
30


Perceptron: y’ = sign(w x)
How to find parameters w?
 Start with w0 = 0
 Pick training examples xt one by one (from disk)
 Predict class of xt using current weights
 y’ = sign(wt  xt)
 If y’ is correct (i.e., yt = y’)
 No change: wt+1 = wt
 If y’ is wrong: adjust w
wt+1 = wt +   yt  xt
yx
wt
wt+1
x
  is the learning rate parameter
 xt is the training example
 yt is true class label ({+1, -1})
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
31
Optimize – join
with the next slide

Perceptron Convergence Theorem:
 If there exist a set of weights that are consistent
(i.e., the data is linearly separable) the perceptron
learning algorithm will converge


How long would it take to converge?
Perceptron Cycling Theorem:
 If the training data is not linearly separable the
perceptron learning algorithm will eventually
repeat the same set of weights and therefore
enter an infinite loop

How to provide robustness, more
expressivity?
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
32

Separability: Some parameters get
training set perfectly

Convergence: If training set is
separable, perceptron will converge
(binary case)

(Training) Mistake bound:
Number of mistakes < 1/2
 𝛾 = min
𝑤⋅𝑥
|𝑥|
 if we scale examples to have Euclidean length 1, then γ is
the minimum distance of any example to the plane
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
33
Perceptron won’t
converge here –
use the trick to
make eta smaller
and smaller

If more than 2 classes:
 Weight vector wc for each class
 Train one class vs. the rest
 Example: 3-way classification y = {A, B, C}
 Train 3 classifies: wA: A vs. B,C; wB: B vs. A,C; wC: C vs. A,B
 Calculate activation for each class
f(x,c) = i wc,i xi = wc x
 Highest activation wins:
c = arg maxc f(x,c)
wCx
biggest
wC
wA
wB
wBx
biggest
wAx
biggest
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
34

Overfitting:

Regularization: if the data
is not separable weights
dance around

Mediocre generalization:
 Finds a “barely” separating
solution
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
35

Winnow algorithm
 Similar to perceptron, just different updates
Initialize : 
 n; w i  1
Prediction
is
1 iff
w x 
If no mistake : do nothing
If f(x)  1 but w  x   ,
w i  2w i
If f(x)  0 but w  x   ,
w i  w i /2 (if x i  1) (demotion)
(if x i  1) (promotion)
 x … binary feature vector
 w … weights (can never get negative!)
 Learns linear threshold functions
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
37


Algorithm learns monotone functions
For the general case:
 Duplicate variables:
 To negate variable xi, introduce a new variable xi’ = -xi
 Learn monotone functions over 2 n variables
 This gives us the Balanced Winnow:
 Keep two weights for each variable;
effective weight is the difference
Update Rule :


If f ( x)  1 but ( w  w )  x   ,


If f ( x)  0 but ( w  w )  x   ,
2/14/2011


wi  2 wi

wi 
1
2
Jure Leskovec, Stanford C246: Mining Massive Datasets

wi

wi 

1
2

wi where xi  1 (promotion)

wi  2 wi
where xi  1 (demotion)
38
•
Thick Separator (aka Perceptron with Margin)
(Applies both for Perceptron and Winnow)
– Promote if:
wx=
 wx> +
-- – Demote if:
- - -- -  wx< -
- -wx=0
-
-
-
Note:  is a functional margin. Its effect could disappear as w grows.
Nevertheless, this has been shown to be a very effective algorithmic addition.
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
39
Examples : x  {0,1} d ;
Prediction is

Hypothesis : w  R d
1 iff
w x 
Additive weight update algorithm
[Perceptron, Rosenblatt, 1958]
w ← w + ηi yj xj
If Class  1 but w  x   ,
w i  w i  1 (if xi  1) (promotion)
If Class  0 but w  x   ,
w i  w i - 1 (if xi  1) (demotion)
Multiplicative weight update algorithm
[Winnow, Littlestone, 1988]
w ← w ηi exp{yj xj}

If Class  1 but w  x   ,
w i  2w i (if xi  1) (promotion)
If Class  0 but w  x   ,
w i  w i /2 (if xi  1) (demotion)
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
40
• Perceptron
• Winnow
• Online: can adjust to changing
target, over time
• Advantages
– Simple
– Guaranteed to learn a
linearly separable problem
• Online: can adjust to changing
target, over time
• Advantages
– Simple
– Guaranteed to learn a
linearly separable problem
– Suitable for problems with
many irrelevant attributes
• Limitations
– only linear separations
– only converges for linearly
separable data
– not really “efficient with
many features”
• Limitations
– only linear separations
– only converges for linearly
separable data
– not really “efficient with
many features”
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
41