Self-taught Learning Transfer Learning from Unlabeled Data Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y.

Download Report

Transcript Self-taught Learning Transfer Learning from Unlabeled Data Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y.

Self-taught Learning
Transfer Learning from Unlabeled Data
Rajat Raina
Honglak Lee, Roger Grosse
Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer,
Narut Sereewattanawoot
Andrew Y. Ng
Stanford University
The “one learning algorithm” hypothesis
 There is some evidence that the human brain uses essentially the
same algorithm to understand many different input modalities.
–
Example: Ferret experiments, in which the “input” for vision was plugged
into auditory part of brain, and the auditory cortex learns to “see.” [Roe
et al., 1992]
(Roe et al., 1992. Hawkins & Blakeslee, 2004)
Self-taught Learning
The “one learning algorithm” hypothesis
 There is some evidence that the human brain uses essentially the
same algorithm to understand many different input modalities.
–
Example: Ferret experiments, in which the “input” for vision was plugged
into auditory part of brain, and the auditory cortex learns to “see.” [Roe
et al., 1992]
If we could find this one learning algorithm,
we would be done. (Finally!)
(Roe et al., 1992. Hawkins & Blakeslee, 2004)
Self-taught Learning
Finding a deep learning algorithm
 If the brain really is one learning algorithm, it would suffice
to just:
 Find a learning algorithm for a single layer, and,
 Show that it can build a small number of layers.
 We evaluate our algorithms:
 Against biology.
 On applications.
e.g., Sparse RBMs for V2:
Poster yesterday (Lee et al.)
This talk
Self-taught Learning
Supervised learning
Train
Cars
Test
Motorcycles
Supervised learning algorithms may not work well with limited labeled data.
Self-taught Learning
Learning in humans
 Your brain has 1014 synapses (connections).
 You will live for 109 seconds.
 If each synapse requires 1 bit to parameterize, you need to
“learn” 1014 bits in 109 seconds.
 Or, 105 bits per second.
Human learning is largely unsupervised,
and uses readily available unlabeled data.
(Geoffrey Hinton, personal communication)
Self-taught Learning
Supervised learning
Train
Cars
Self-taught Learning
Test
Motorcycles
“Brain-like” Learning
Train
Cars
Test
Motorcycles
Unlabeled images
(randomly downloaded from the Internet)
Self-taught Learning
“Brain-like” Learning
+
Labeled Digits
Labeled Webpages
?
Unlabeled English characters
+
?
Unlabeled newspaper articles
+
Labeled Russian Speech
Self-taught Learning
?
Unlabeled English speech
“Self-taught Learning”
+
Labeled Digits
Labeled Webpages
?
Unlabeled English characters
+
?
Unlabeled newspaper articles
+
Labeled Russian Speech
Self-taught Learning
?
Unlabeled English speech
Recent history of machine learning
• 20 years ago: Supervised learning
Motorcycles
Cars
• 10 years ago: Semi-supervised learning.
Motorcycles
Cars
• 10 years ago: Transfer learning.
Bus
Tractor
Aircraft
Helicopter
Cars
• Next: Self-taught learning?
Car
Natural scenes
Motorcycle
Motorcycles
Self-taught Learning
Labeled examples:
{( xl(i ) , y(i ) )}im1
xl(i )  Rn , y (i ) {1,, T}
Unlabeled examples:
{xu(i ) }ik1
xu(i )  Rn , k  m
The unlabeled and labeled data:
• Need not share labels y.
• Need not share a generative distribution.
Advantage: Such unlabeled data is often easy to obtain.
Self-taught Learning
A self-taught learning algorithm
Overview: Represent each labeled or unlabeled input x
s
as a sparse linear combination of “basis vectors” {b j } j 1.
x   a jbj
b j  Rn , a j  R
j
= 0.8 *
x
Self-taught Learning
= 0.8 *
+ 0.3 *
b87
+ 0.3 *
+ 0.5 *
b376
+ 0.5 *
b411
A self-taught learning algorithm
x   a jbj
j
= 0.8 *
x
= 0.8 *
+ 0.3 *
b87
+ 0.3 *
+ 0.5 *
b376
+ 0.5 *
b411
Key steps:
1. Learn good bases bj using unlabeled data xu(i ).
2. Use these learnt bases to construct “higher-level” features for the
labeled data.
3. Apply a standard supervised learning algorithm on these features.
Self-taught Learning
Learning the bases: Sparse coding
Given only unlabeled data xu(i ), we find good bases b using sparse
coding:
minb,a || xu(i )   a (ji )b j ||22   || a (i ) ||1
i
j
i
Reconstruction error
Sparsity penalty
(Efficient algorithms: Lee et al., NIPS 2006)
[Details: An extra normalization constraint on || bj ||2 is required.]
Self-taught Learning
Example bases
Natural images.
Learnt bases: “Edges”
Handwritten characters.
Learnt bases: “Strokes”
Self-taught Learning
Constructing features
 Using the learnt bases b, compute features for the
examples xl from the classification task by solving:
Featuresof xl  arg mina || xl   a j b j ||22  || a ||1
j
Reconstruction error
= 0.8 *
xl
= 0.8 *
+ 0.3 *
b87
+ 0.3 *
Sparsity penalty
+ 0.5 *
b376
+ 0.5 *
b411
 Finally, learn a classifer using a standard supervised
learning algorithm (e.g., SVM) over these features.
Self-taught Learning
Image classification
Large image
(Platypus from
Caltech101 dataset)
Self-taught Learning
Feature visualization
Image classification
Platypus image
(Caltech101 dataset)
Self-taught Learning
Feature visualization
Image classification
Platypus image
(Caltech101 dataset)
Self-taught Learning
Feature visualization
Image classification
Platypus image
(Caltech101 dataset)
Self-taught Learning
Feature visualization
Image classification
Baseline
16%
PCA
37%
Sparse coding
47%
(15 labeled images per class)
36.0% error reduction
Self-taught Learning
Other reported results:
Fei-Fei et al, 2004: 16%
Berg et al., 2005: 17%
Holub et al., 2005: 40%
Serre et al., 2005: 35%
Berg et al, 2005: 48%
Zhang et al., 2006: 59%
Lazebnik et al., 2006: 56%
Character recognition
Digits
Handwritten English
English font
Raw
54.8%
Raw
17.9%
PCA
54.8%
PCA
14.5%
Sparse coding
58.5%
Sparse coding
16.6%
Sparse coding + Raw
20.2%
Handwritten English classification
(20 labeled images per handwritten character)
Bases learnt on digits
8.2% error reduction
Self-taught Learning
English font classification
(20 labeled images per font character)
Bases learnt on handwritten English
2.8% error reduction
Text classification
Webpages
Reuters newswire
UseNet articles
Raw words
62.8%
Raw words
61.3%
PCA
63.3%
PCA
60.7%
Sparse coding
64.3%
Sparse coding
63.8%
Webpage classification
UseNet classification
(2 labeled documents per class)
(2 labeled documents per class)
Bases learnt on Reuters newswire
Bases learnt on Reuters newswire
4.0% error reduction
Self-taught Learning
6.5% error reduction
Shift-invariant sparse coding
Sparse features
Basis functions
Reconstruction
(Algorithms: Grosse et al., UAI 2007)
Self-taught Learning
Audio classification
Spectrogram
MFCCs
Sparse coding
Spectrogram
48.4%
MFCCs
54.0%
Music-specific model
49.3%
Sparse coding
56.6%
38.5%
43.8%
48.7%
Speaker identification
Musical genre classification
(5 labels, TIMIT corpus, 1 sentence per speaker.)
(5 labels, 18 seconds per genre.)
Bases learnt on different dialects
Bases learnt on different genres, songs
8.7% error reduction
5.7% error reduction
(Details: Grosse et al., UAI 2007)
Self-taught Learning
Sparse deep belief networks
. . .
Sparse RBM
h: Hidden layer
W, b, c: Parameters
. . .
v: Visible layer
New
(Details: Lee et al., NIPS 2007. Poster yesterday.)
Self-taught Learning
Sparse deep belief networks
Image classification
(Caltech101 dataset)
1-layer sparse DBN
44.5%
2-layer sparse DBN
46.6%
3.2% error reduction
(Details: Lee et al., NIPS 2007. Poster yesterday.)
Self-taught Learning
Summary
 Self-taught learning: Unlabeled data does not share the labels of the
classification task.
Cars
Motorcycles
Unlabeled images
 Use unlabeled data to discover features.
 Use sparse coding to construct an easy-to-classify, “higher-level”
representation.
= 0.8 *
Self-taught Learning
+ 0.3 *
+ 0.5 *
THE END
Related Work
•
Weston et al, ICML 2006
• Make stronger assumptions on the unlabeled data.
•
Ando & Zhang, JMLR 2005
• For natural language tasks and character
recognition, use heuristics to construct a transfer
learning task using unlabeled data.
Self-taught Learning