An Analysis of Single-Layer Networks in

Download Report

Transcript An Analysis of Single-Layer Networks in

An Analysis of Single-Layer Networks
in Unsupervised Feature Learning
Adam Coates, Honglak Lee and Andrew Y. Ng
AISTATS 2011
The Importance of Encoding Versus Training
with Sparse Coding and Vector Quantization
Adam Coates and Andrew Y. Ng
ICML 2011
Presented by: Mingyuan Zhou
Duke University, ECE
June 17, 2011
An Analysis of Single-Layer Networks
in Unsupervised Feature Learning
Adam Coates, Honglak Lee and Andrew Y. Ng
AISTATS 2011
Outline
•
•
•
•
•
Introduction
Unsupervised feature learning
Parameter setting
Experiments on CIFAR, NORB and STL
Conclusions
Training/testing pipeline
• Feature learning:
– Extract random patches from unlabeled training images
– Apply a pre-processing stage to the patches
– Learn a feature-mapping using an unsupervised learning
algorithm
• Feature extraction and classification:
– Extract features from equally spaced sub-patches covering the
input image
– Pool features together over regions of the input image to reduce
the number of feature values
– Train a linear classifier to predict the labels given the feature
vectors
Feature learning
• Pre-processing of patches
– Mean subtraction and scale normalization
– Whitening
• Unsupervised learning
– Sparse auto-encoder
– Sparse restricted Boltzmann machine
– K-means clustering
Hard-assignment:
Soft-assignment:
Feature learning
• Unsupervised learning
–
–
–
–
Sparse auto-encoder
Sparse restricted Boltzmann machine
K-means clustering
Gaussian mixture model (GMM)
Feature extraction and classification
Experiments and analysis
• Model parameters:
– Whitening?
– Number of features K
– Stride s (all the overlapping patches are used when s
= 1)
– Receptive field (patch) size w
Experiments and analysis
Experiments and analysis
Experiments and analysis
Experiments and analysis
Conclusions
Mean-subtraction, scale normalization and Whitening
+ Large K
+ Small s
+ Right patch size w
+ Simple feature learning algorithm (soft K-means)
=
State-of-the-art results on CIFAR-10 and NORB
The Importance of Encoding Versus Training
with Sparse Coding and Vector Quantization
Adam Coates and Andrew Y. Ng
ICML 2011
Outline
•
•
•
•
•
Motivations and contributions
Review of dictionary learning algorithms
Review of sparse coding algorithms
Experiments on CIFAR, NORB and Caltech101
Conclusions
Main contributions
Dictionary learning algorithms
• Sparse coding (SC)
• Orthogonal matching pursuit (OMP-k)
• Sparse RBMs and sparse auto-encoders (RBM, SAE)
• Randomly sampled patches (RP)
• Random weights (R)
Sparse coding algorithms
• Sparse coding (SC)
• OMP-k
• Soft threshold (T)
• “Natural” encoding
Experimental results
Experimental results
Comments on dictionary learning
•
The results have shown that the main advantage of sparse coding is as an
encoder, and that the choice of basis functions has little effect on
performance.
•
The main value of the dictionary is to provide a highly overcomplete basis
on which to project the data before applying an encoder, but that the exact
structure of these basis functions is less critical than the choice of encoding
•
All that appears necessary is to choose the basis to roughly tile the space of
the input data. This increases the chances that a few basis vectors will be
near to an input, yielding a large activation that is useful for identifying the
location of the input on the data manifold later
•
This explains why vector quantization is quite capable of competing with
more complex algorithms: it simply ensures that there is at least one
dictionary entry near any densely populated areas of the input space. We
expect that learning is more crucial if we use small dictionaries, since we
would then need to be more careful to pick basis functions that span the
space of inputs equitably.
Conclusions
• The main power of sparse coding is not that it learns better basis
functions. In fact, we discovered that any reasonable tiling of the
input space (including randomly chosen input patches) is sufficient
to obtain high performance on any of the three very different
recognition problems that we tested.
• Instead, the main strength of sparse coding appears to arise from its
non-linear encoding scheme, which was almost universally effective
in our experiments—even with no training at all.
• Indeed, it was difficult to beat this encoding on the Caltech 101
dataset. In many cases, however, it was possible to do nearly as
well using only a soft threshold function, provided we have sufficient
labeled data.
• Overall, we conclude that most of the performance obtained in our
results is a function of the choice of architecture and encoding,
suggesting that these are key areas for further study and
improvements.