Document

Transcript Document

Deep Learning
Why?
Speech recognition
1992
100.0%
1997
Year
2002
2007
2012
Word error rate
50.0%
25.0%
12.5%
6.3%
Read
Conversational
Source: Huang et al., Communications ACM 01/2014
Large Scale Visual Recognition Challenge 2012
35%
30%
Error rate
25%
20%
15%
10%
5%
0%
ISI
OXFORD_VGG XRCE/INRIA
University of
Amsterdam
LEAR-XRCE
SuperVision
the 2013 International Conference on Learning
Representations, the 2013 ICASSP’s special session on New
Types of Deep Neural Network Learning for Speech
Recognition and Related Applications, the 2013 ICML
Workshop for Audio, Speech, and Language Processing, the
2012, 2011, and 2010 NIPS Workshops on Deep Learning and
Unsupervised Feature Learning, 2013 ICML Workshop on
Representation Learning Challenges, 2013 Intern. Conf. on
Learning Representations, 2012 ICML Workshop on
Representation Learning, 2011 ICML Workshop on Learning
Architectures, Representations, and Optimization for Speech
and Visual Information Processing, 2009 ICML Workshop on
Learning Feature Hierarchies, 2009 NIPS Workshop on Deep
Learning for Speech Recognition and Related Applications,
2012 ICASSP deep learning tutorial, the special section on
Deep Learning for Speech and Language Processing in IEEE
Trans. Audio, Speech, and Language Processing (January
2012), the special issue on Learning Deep Architectures in
”A fast learning algorithm for
deep belief nets”
-- Hinton et al., 2006
”Reducing the dimensionality of
data with neural networks”
-- Hinton & Salakhutdinov
Geoffrey Hinton
University of Toronto
How?
Shallow learning
• SVM
• Linear & Kernel Regression
• Hidden Markov Models (HMM)
• Gaussian Mixture Models (GMM)
• Single hidden layer MLP
• ...
Limited modeling capability of concepts
Cannot make use of unlabeled data
Neuronal Networks
• Machine Learning
• Knowledge from high dimensional data
• Classification
• Input: features of data
• supervised vs unsupervised
• labeled data
• Neurons
Multi Layer Perceptron
[ Y1 ,
Y2 ]
output
k
wjk
•
•
•
•
Multiple Layers
Feed Forward
Connected Weights z   xi wij
i
1-of-N Output
j
hidden
j
1
vij
a
0
0 z
input
i
[ X1 ,
X2 ,
X3 ]
a
1
1 e
z
Backpropagation
• Minimize error of
calculated output
k
• Adjust weights
wjk
• Procedure
• Forward Phase
• Backpropagation
of errors
j
vij
i
• Gradient Descent
• For each sample,
multiple epochs
Best Practice
• Normalization
• Prevent very high weights, Oscillation
• Overfitting/Generalisation
• Validation Set, Early Stopping
• Mini-Batch Learning
• update weights with multiple
input vectors combined
Problems with Backpropagation
• Multiple hidden Layers
• Get stuck in local optima
• start weights from random positions
• Slow convergence to optimum
• large training set needed
• Only use labeled data
• most data is unlabeled
Generative Approach
Restricted Boltzmann Machines
• Unsupervised
hidden
j
• Find complex regularities in
training data
• Bipartite Graph
• visible, hidden layer
wij
i
visible
• Binary stochastic units
• On/Off with probability
• 1 Iteration
p(h j
1
)
1
1 e
( v w )

i
ivis
ij
• Update Hidden Units
• Reconstruct Visible Units
• Maximum Likelihood
of training data
Restricted Boltzmann Machines
hidden
j
wij
• find latent factors of data set
i
p(h j
• Training Goal:
Best probable reproduction
• unsupervised data
visible
1
)
1
1 e
( v w )

i
ivis
ij
• Adjust weights to get
maximum probability of
input data
Training: Contrastive Divergence
j
<vi h j>0
i
t=0
data
j
<vi h j>1
i
t=1
reconstruction
Dwij = e ( <vi h j >0 - <vi h j>1 )
• Start with a training vector on the
visible units.
• Update all the hidden units
in parallel.
• Update the all the visible units in
parallel to get a “reconstruction”.
• Update the hidden units again.
Example: Handwritten 2s
50 binary neurons that
learn features
Increment weights
between an active pixel
and an active feature
16 x 16
pixel
image
data
(reality)
50 binary neurons that
learn features
Decrement weights
between an active pixel
and an active feature
16 x 16
pixel
image
reconstruction
The final 50 x 256 weights: Each unit grabs a different feature
Example: Reconstruction
Data
Reconstruction
from activated
binary features
New test image from
the digit class that the
model was trained on
Data
Reconstruction
from activated
binary features
Image from an unfamiliar
digit class The network
tries to see every image
as a 2.
Deep Architecture
• Backpropagation, RBM as building blocks
• Multiple hidden layers
• Motivation (why go deep?)
• Approximate complex decision boundary
• Fewer computational units for
same functional mapping
• Hierarchical Learning
• Increasingly complex features
• work well in different domains
• Vision, Audio, …
Hierarchical Learning
• Natural progression from
low level to high level
structure as seen in natural
complexity
• Easier to monitor what is
being learnt and to guide
the machine to better
subspaces
Stacked RBMs
Then
train this
RBM
h2
W2
h1
Compose the
two RBM
models to
make a single
DBN model
out
h2
W2
copy binary state for each v
h1
Train this
RBM first
W1
h1
W1
v
v
• First learn one layer at a time by
stacking RBMs.
• Treat this as “pre-training” that finds a
good initial set of weights which can
then be fine-tuned by a local search
procedure.
• Backpropagation can be used to finetune the model to be better at
discrimination.
Uses
Dimensionality reduction
Dimensionality
reduction
• Use a stacked RBM as deep auto-encoder
1.
Train RBM with images as input & output
2.
Limit one layer to few dimensions
 Information has to pass through middle
layer
Dimensionality
reduction
Olivetti face data, 25x25 pixel images reconstructed from 30 dimensions
(625  30)
Original
Deep
RBN
PCA
Dimensionality
reduction
804’414 Reuters news stories,
reduction to 2 dimensions
PCA
Deep RBN
Uses
Classification
Unlabeled data
Unlabeled data is readily available
Example: Images from the web
1. Download 10’000’000 images
2. Train a 9-layer DNN
3. Concepts are formed by DNN
 70% better than previous state of the art
Building High-level Features Using Large Scale Unsupervised Learning
Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng
Uses
AI
Artificial intelligence
Enduro, Atari 2600
Expert player: 368 points
Deep Learning: 661 points
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller
Uses
Generative (Demo)
How to use it
How to use it
• Home page of Geoffrey Hinton
https://www.cs.toronto.edu/~hinton/
• Portal
http://deeplearning.net/
• Accord.NET
http://accord-framework.net/

Document

Transcript Document

Directory