An analysis of single-layer networks in unsupervised feature learning

Download Report

Transcript An analysis of single-layer networks in unsupervised feature learning

1
AN ANALYSIS OF SINGLELAYER NETWORKS IN
UNSUPERVISED FEATURE
LEARNING
[1]
Yani Chen
10/14/2014
Outline
2





Introduction
Framework for feature learning
Unsupervised feature learning algorithms
Effect of some parameters
Experiments and analysis on the results
Introduction
3



1. Much work focused on employing complex
unsupervised feature learning algorithm.
2. Simple factors, such as the number of hidden
nodes may be more important to achieving high
performance than the learning algorithm or the
depth of the model.
3. Using only one single layer network can get very
good feature learning results.
Unsupervised feature learning framework
4
1>. extract random patches from unlabeled training
images (choose image as example)
2>. apply a pre-processing stage to the patches
3>. learn a feature-mapping using an unsupervised
feature learning algorithm
4>. extract features from equally spaced sub-patches
covering the input images
5>. pool features together to reduce the number of
feature values
6>. train a linear classifier to predict the labels given
the feature vectors
Unsupervised learning algorithm
5




1. Sparse autoencoder
2. Sparse restricted Boltzmann machine
3. K-means clustering
4. Gaussian mixture models clustering
Sparse auto-encoder
6

Objective function (minimize):
1
J W , b   
m



1
i 
i 
 hW , b x  y
2
2
sl
m
i 1
 
s l 1
W 



2
l 
ji
l 1 i 1
j 1
2
2



s2
   KL   j
j 1
Feature mapping function:
f  x   g Wx  b 

where
g  z   1 1  exp   z 



Sparse restricted Boltzman machine
7

Energy function of an RBM is :
E  x , h   b x  a h  h Wx





The same type of sparsity penalty can be added
like in the sparse autoencoder
Sparse RBMs can be trained using a contrastive
divergence approximation
Feature mapping function:
f  x   g Wx  b 
[7]

where
g  z   1 1  exp   z 
K-means clustering
8

Object function for learning K centroids
K
arg min
S



i 
S  S 1,S 2 , ,S K  is the cluster sets
2
where
i 1 x S i
Feature mapping function
1> hard-assignment
1

fk x   
 0

xc
i 
c
if k  arg min c
 j
x
2
2
j
otherwise
2> soft-assignment
f k  x   max 0 ,   z   z k 
where
zk  x  c
k 
,   z   mean  z k 
GMM clustering
9

Gaussian mixture models:
A Gaussian mixture model is a
probabilistic model that assumes
all the data points are generated
from a mixture of a finite number
of Gaussian distributions with
unknown parameters.
GMM(Gaussian mixture models)
10
J 
 P x 
overall likelihood
n
of the model
n
P xn  
f k  x n | c k ,  k P  k 

k
f k x | ck,  k  
is the posterior
d : dimention
k  1 ... K
c
k 
1
 2  d
2
k
membership
1 2
 1
k 
 exp  
xc
 2
probabilit

ies
of x
Gaussians
: mean of the k th Gaussian
 k : covariance
matrix of the k th Gaussian


1

k x  c
k 


EM algorithm
11


EM(expectation-maximization) algorithm is an
iterative method for finding maximum likelihood or
maximum a posteriori (MAP) estimates of
parameters in statistical models.
E-step : assign points to clusters
f k  x k |  k ,  k P k 
P k | x n  

P xn 
M-step : estimate model parameters

k 

k 
 P k | x  x  P k | x 
n
n
n
n

n


P  k | x n  x n   k    x n   k 
n

1
P k  
N
 P k | x 
n
n
 P k | x 
n
n
Gaussian mixtures
12

Feature mapping function:
fk x  
1
 2  d
d : dimention
k  1 ... K
c
k 
2
k
1 2
 1
 k   1
k  
 exp    x  c   k  x  c 
 2

of x
Gaussians
: mean of the k th Gaussian
 k : covariance
matrix of the k th Gaussian
Feature extraction and classification
13

Convolutional feature extraction and pooling(sum)

Classification : (L2) SVM
Data
14



1. CIFAR-10 (this data is used to tune the
parameters)
2. NORB
3. downsampled STL(96*96 --> 32*32)
CIFAR10 dataset
15

[3]
The CIFAR-10 dataset consists of 60000 32x32 colour images in
10 classes, with 6000 images per class. There are 50000 training
images and 10000 test images
NORB dataset
16

[4]
This dataset is intended for experiments in 3D object recognition
from shape. It contains images of 50 toys belonging to 5 generic
categories: animals, human figures, airplanes, trucks, and cars.
24,300 training image pairs (96*96), 24300 test image pairs .
STL-10 dataset
17

[5]
The STL-10 dataset consists of 5200 64x64 color images and 3200 test
images in 4 classes, airplane, cat, car and dog. There are 50000 training
images and 10000 test images.
Effect elements
18




1. with or without whitening
2. number of features
3. stride(spacing between patches)
4. receptive field size
Effect of whitening
19
Result of whitening:
1. the features are less correlated with each other
2. the features all have the same variance
 For sparse autoencoder and sparse RBM
when using only 100 features, significant benefit from
whitening preprocessing
when the number of features getting bigger, the
advantage disappeared
 For clustering algorithms
The whitening is a must have step because they cannot
handle the correlations in the data.

Effect of number of features
20


Num of features used: 100, 200, 400, 800, 1600
All algorithms generally achieved higher
performance by learning more features
Effect of stride
21


Stride is the spacing between patches where
feature values will be extracted
Downward performance with increasing step size
Effect of receptive field size
22


Receptive field size is the patch size.
Overall, the 6 pixel receptive field size worked
best.
Classification results
23
Table 1: Test recognition accuracy on CIFAR-10
Algorithm
Accuracy
Raw pixels
3-way factored RBM (3 layers)
Mean-covariance RBM (3 layers)
Improved Local Coord. Coding
Conv. Deep Belief Net (2 layers)
37.3%
65.3%
71.0%
74.5%
78.9%
Sparse auto-encoder
Sparse RBM
K-means (Hard)
K-means (Triangle, 1600 features)
k-means (Triangle, 4000 features)
73.4%
72.4%
68.6%
77.9%
79.6%
stride = 1, receptive field = 6, with whitening, large number of features
Classification results
24
Table 2: Test recognition accuracy (and error) for NORB
(normalized-uniform)
Algorithm
Accuracy(error)
Conv. Neural Network
Deep Boltzmann Machine
Deep Belief Network
Best result of [6]
Deep neural network
93.4% (6.6%)
92.8% (7.2%)
95.0% (5.0%)
94.4% (5.0%)
97.13% (2.87%)
Sparse auto-encoder
Sparse RBM
K-means (Hard)
K-means (Triangle, 1600 features)
k-means (Triangle, 4000 features)
96.9% (3.1%)
96.2% (3.8%)
96.9% (3.1%)
97.0% (3.0%)
97.21% (2.79%)
stride = 1, receptive field = 6, with whitening, large number of features
Classification results
25
Table 3: Test recognition accuracy on STL-10
Algorithm
Accuracy
Raw pixels
K-means (Triangle 1600 features)
31.8% (  0.62%)
51.5% (  1.73%)
The method proposed is strongest when we have large labeled training sets.
Conclusion
26

Best performance is based on k-means clustering.
 Easy
and fast.
 No hypermeters to tune.


One layer network can get good result.
Using more features and dense extraction.
Reference
27
[1] Coates, Adam, Andrew Y. Ng, and Honglak Lee. "An analysis of single-layer
networks in unsupervised feature learning." International Conference on Artificial
Intelligence and Statistics. 2011.
[2]http://ace.cs.ohio.edu/~razvan/courses/dl6900/index.html
[3]A. Krizhevsky. Learning multiple layers of features form Tiny Images. Master’s thesis,
Dept. of Comp. Sci., University of Toronto, 2009
[4] LeCun, Yann, Fu Jie Huang, and Leon Bottou. "Learning methods for generic object
recognition with invariance to pose and lighting." Computer Vision and Pattern
Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society
Conference on. Vol. 2. IEEE, 2004.
[5] http://cs.stanford.edu/~acoates/stl10
[6] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object
recognition?." Computer Vision, 2009 IEEE 12th International Conference on. IEEE,
2009.
[7] Goh, Hanlin, Nicolas Thome, and Matthieu Cord. "Biasing restricted Boltzmann machines
to manipulate latent selectivity and sparsity." NIPS workshop on deep learning and
unsupervised feature learning. 2010.
28
THANK YOU !