Transcript part 1 pptx
Deep Learning: Back To The Future
Hinton NIPS 2012 Talk Slide (More Or Less)
What was hot in 1987 Neural networks
What happened in ML since 1987 Computers got faster Larger data sets became available
What is hot 25 years later Neural networks
… but they are informed by graphical models!
Brief History Of Machine Learning
1960s Perceptrons
1969 Minsky & Papert book
1985-1995 Neural Nets and Back Propagation
1995- Support-Vector Machines
2000- Bayesian Models
2013- Deep Networks
What My Lecture Looked Like In 1987
The Limitations Of Two Layer Networks
Many problems can’t be learned without a layer of intermediate or hidden units.
Problem Where does training signal come from?
Teacher specifies target outputs, not target hidden unit activities.
If you could learn input->hidden and hidden->output connections, you could learn new representations!
But how do hidden units get an error signal?
Why Stop At One Hidden Layer?
E.g., vision hierarchy for recognizing handprinted text Word Character Stroke Edge Pixel output layer hidden layer 3 hidden layer 2 hidden layer 1 input layer
Demos
Yann LeCun’s LeNet5 http://yann.lecun.com/exdb/lenet/index.html
Why Deeply Layered Networks Fail
Credit assignment problem How is a neuron in layer 2 supposed to know what it should output until all the neurons above it do something sensible?
How is a neuron in layer 4 supposed to know what it should output until all the neurons below it do something sensible?
Mathematical manifestation Error gradients get squashed as they are passed back through a deep network
Solution
Traditional method of training Random initial weights
Alternative Do unsupervised learning layer by layer to get weights in a sensible configuration for the statistics of the input.
Then when net is trained in a supervised fashion, credit assignment will be easier.
Autoencoder Networks
Self-supervised training procedure
Given a set of input vectors (no target outputs)
Map input back to itself via a hidden layer bottleneck
How to achieve bottleneck?
Fewer neurons
Sparsity constraint
Information transmission constraint (e.g., add noise to unit, or shut off randomly, a.k.a. dropout)
Autoencoder Combines An Encoder And A Decoder
Decoder Encoder
Stacked Autoencoders
...
copy deep network
Note that decoders can be stacked to produce a generative model of the domain
Neural Net Can Be Viewed As A Graphical Model
y x 1
Deterministic neuron
P
(
y
|
x
1 ,
x
2 ,
x
3 ,
x
4 ) = å å åå 1 0 x 2
Stochastic neuron
P
(
y
|
x
1 ,
x
2 ,
x
3 ,
x
4 ) = å å åå 1 0 x 3 x 4 if
y
= (1 + exp( å otherwise with probability (1 + otherwise
w i x i
)) 1 exp( å
w i x i
)) 1
Boltzmann Machine (Hinton & Sejnowski, circa 1985)
Undirected graphical model
Each node is a stochastic neuron
Potential function defined on each pair of neurons
Algorithms were developed for doing inference for special cases of the architecture.
E.g., Restricted Boltzmann Machine
2 layers
Completely interconnected between layers
No connections within layer
Punch Line
Deep network can be implemented as a multilayer restricted Boltzmann machine Sequential layer-to-layer training procedure Training requires probabilistic inference Update rule: ‘contrastive divergence’
Different research groups prefer different neural substrate, but it doesn’t really matter if you use deterministic neural net vs. RBM
•
Different Levels of Abstraction
From Ng’s group
Hierarchical Learning
– Natural progression from low level to high level structure as seen in natural complexity – – Easier to monitor what is being learnt and to guide the machine to better subspaces A good lower level representation can be used for many distinct tasks 5
Suskever, Martens, Hinton (2011) Generating Text From A Deep Belief Net
Wikipedia The meaning of life is the tradition of the ancient human reproduction: it is less favorable to the good boy for when to remove her bigger. In the show’s agreement unanimously resurfaced. The wild pasteured with consistent street forests were incorporated by the 15th century BE. In 1996 the primary rapford undergoes an effort that the reserve conditioning, written into Jewish cities, sleepers to incorporate the .St Eurasia that activates the population. Mar??a Nationale, Kelli, Zedlat-Dukastoe, Florendon, Ptu’s thought is. To adapt in most parts of North America, the dynamic fairy Dan please believes, the free speech are much related to the
NYT while he was giving attention to the second advantage of school building a 2-for-2 stool killed by the Cultures saddled with a half- suit defending the Bharatiya Fernall ’s office . Ms . Claire Parters will also have a history temple for him to raise jobs until naked Prodiena to paint baseball partners , provided people to ride both of Manhattan in 1978 , but what was largely directed to China in 1946 , focusing on the trademark period is the sailboat yesterday and comments on whom they obtain overheard within the 120th anniversary , where many civil rights defined , officials said early that forms , ” said Bernard J. Marco Jr. of Pennsylvania , was monitoring New York
2013 News
No need to use unsupervised training or probabilistic models if…
You use clever tricks of the neural net trade, i.e.,
Back propagation with
deep networks
rectified linear units
dropout
weight maxima
Krizhevsky, Sutskever, & Hinton
ImageNet competition
15M images in 22k categories
For contest, 1.2M images in 1k categories
Classification: can you name object in 5 guesses?
√
2012 Results
2013: Down to 11% error