DeepLearningTutorial_Icare2014

Download Report

Transcript DeepLearningTutorial_Icare2014

Deep Learning Tutorial
Mitesh M. Khapra
IBM Research India
(Ideas and material borrowed from
Richard Socher’s tutorial @ ML Summer School 2014
Yoshua Bengio’s tutorial @ ML Summer School 2014
& Hugo Larochelle’s lecture videos & slides)
Roadmap
•
•
•
•
What?
Why?
How?
Where?
2
Roadmap
•
•
•
•
What are Deep Neural Networks?
Why?
How?
Where?
3
Roadmap
•
•
•
•
What are Deep Neural Networks?
Why should I be interested in Deep Learning?
How?
Where?
4
Roadmap
•
•
•
•
What are Deep Neural Networks?
Why should I be interested in Deep Learning?
How do I make a Deep Neural Network work?
Where?
5
Roadmap
• What are Deep Neural Networks?
• Why should I be interested in Deep Learning?
• How do I make train a Deep Neural Network
work?
• Where?
6
Roadmap
•
•
•
•
What are Deep Neural Networks?
Why should I be interested in Deep Learning?
How do I train a Deep Neural Network?
Where can I find additional material?
7
the what?
8
A typical machine learning example
feature
extraction
number of positive words,
number of negative words,
feature
length of review, author name, vector
bag of words, etc.
data
𝑥1 = 1, 0, 0, 1, 0, 1 ,
label
𝑦1 = 1
𝑥2 = 0, 0, 1, 1, 0, 1 ,
𝑦2 = 0
𝑥3 = 1, 0, 1, 1, 0, 1 ,
𝑦3 = 1
𝑥4 = 0, 0, 1, 0, 1, 1 ,
𝑦4 = 0
9
next
A typical machine learning example
data
𝑥1 = 1, 0, 0, 1, 0, 1 ,
label
𝑦1 = 1
𝑥2 = 0, 0, 1, 1, 0, 1 ,
𝑦2 = 0
𝑨𝒔𝒔𝒖𝒎𝒆 𝒚 = 𝒇𝒘 𝒙
𝑓𝑜𝑟 𝑒𝑥𝑎𝑚𝑝𝑙𝑒, 𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
𝑥3 = 1, 0, 1, 1, 0, 1 ,
𝑦3 = 1
𝑥4 = 0, 0, 1, 0, 1, 1 ,
𝑦4 = 0
𝑙𝑒𝑎𝑟𝑛 𝑤 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑓𝑤 𝑥𝑖
𝑤𝑖 𝑥𝑖
𝑖
𝑖𝑠 𝑎𝑠 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑦𝑖 𝑎𝑠 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒
10
So, where does deep learning fit in?
• Machine Learning
– hand crafted features
– optimize weights to improve prediction
• Representation Learning
– automatically learn features
• Deep Learning
– automatically learn multiple levels of features
From Richar Socher’s tutorial @ ML Summer School, Lisbon
11
back
The basic building block
𝑎 𝑥 =𝑏+
𝑤𝑖 𝑥𝑖
𝑖
ℎ 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 +
𝑤𝑖 𝑥𝑖
𝑖
𝑎 𝑥
𝑏
𝑤1 𝑤2 𝑤3
𝑥1
𝑥2
1
single artificial
neuron
𝑥3
𝑤𝑖 = 𝑤𝑒𝑖𝑔ℎ𝑡
𝑏 = 𝑏𝑖𝑎𝑠
𝑎 𝑥 = 𝑝𝑟𝑒 − 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
ℎ 𝑥 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
𝑮𝒐𝒂𝒍: 𝑔𝑖𝑣𝑒𝑛 𝑁 𝑥, 𝑦 𝑝𝑎𝑖𝑟𝑠 𝑙𝑒𝑎𝑟𝑛 𝑤, 𝑏
𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ℎ(𝑥𝑗 ) 𝑖𝑠 𝑎𝑠 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑦𝑗 𝑎𝑠 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒
12
Okay, so what can I use it for?
• For binary classification problems by treating ℎ 𝑥 𝑎𝑠 𝑝 𝑦 = 1 𝑥
• Works when data is linearly separable
(image from Hugo Larochelles’s slides)
𝑤1 𝑤2 𝑤3
𝑥1
𝑥2
𝑏
1
𝑥1
𝑥3
𝑥 = 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑓𝑟𝑜𝑚 𝑚𝑜𝑣𝑖𝑒 𝑟𝑒𝑣𝑖𝑒𝑤𝑠
𝑦 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒\𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
ℎ 𝑥 > 0.5 𝑡ℎ𝑒𝑛
𝑤1 𝑤2 𝑤3
𝑒𝑙𝑠𝑒
𝑥2
𝑏
1
𝑥3
𝑥 = 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑓𝑟𝑜𝑚 𝑚𝑜𝑣𝑖𝑒 𝑟𝑒𝑣𝑖𝑒𝑤𝑠
𝑦 = 𝑚𝑎𝑙𝑒 𝑎𝑢𝑡ℎ𝑜𝑟\female 𝑎𝑢𝑡ℎ𝑜𝑟
13
ℎ 𝑥 > 0.5 𝑡ℎ𝑒𝑛
𝑒𝑙𝑠𝑒
What are its limitations?
• Fails when data is not linearly separable….
(images from Hugo Larochelles’s slides)
• …unless the input is suitably transformed
𝑥 = 𝑥1 , 𝑥2
𝑥 ′ = 𝐴𝑁𝐷(𝑥1 , 𝑥2 ), 𝐴𝑁𝐷(𝑥1 , 𝑥2 )
14
A neural network for XOR
Wait…., are you telling me that I will always have to
meditate on the data and then decide the
transformation/network ?
𝐴𝑁𝐷(𝑥1 , 𝑥2 )
(2)
𝐴𝑁𝐷(𝑥2 , 𝑥1 ) 𝑊
No, definitely not. The XOR example is only to give
the intuition.
𝑊 (1)
The key takeaway is that by adding more layers you
can make the data separable.
𝑥1
𝑥2
A multi-layered neural network
Lets spend some more time in understanding this ….
15
(graphs from Pascal Vincent’s slides)
Capacity of a multi-layer network
𝑧
0.7
−1 𝑦
1
𝑦2
−1.5
0.5
1
1
𝑥1
−0.4
1
1
1
𝑥2
16
Capacity of a multi-layer network
(image from Pascal Vincent’s slides)
17
Capacity of a multi-layer network
In particular, we can find a
separator for the XOR problem
(images from from Pascal Vincent’s slides)
Universal Approximation Theorem (Hornik, 1991) :
• “a single hidden layer neural network with a linear output unit can approximate any
continuous function arbitrary well, given enough hidden units”
18
Lets take a minute here…
If “a single hidden layer neural network” is enough then why go deeper?
Hand-crafted features representations
Automatically learned features representations
𝑥 = 𝑥1 , 𝑥2
𝑥′
𝑊 (2)
𝑊 (1)
= 𝐴𝑁𝐷(𝑥1 , 𝑥2 ), 𝐴𝑁𝐷(𝑥1 , 𝑥2 )
𝑥1
………
𝑥2
19
Multiple layers = multiple levels of features
𝑦
But why would I be interested in learning
multiple levels of representations ?
𝑊 (4)
Lets see where the motivation comes from…
𝑊 (3)
𝑊 (2)
𝑊 (1)
𝑥1
𝑥2
𝑥3
20
The brain analogy
Layer 1 representation
nose
mouth
Layer 2 representation
eyes
face
Layer 3 representation
(idea from Hugo Larochelle’s slides)
21
YAWN!!!! Enough With the Brain Tampering
Just tell me Why should I be interested In Deep Learning?
(“Show Me the Money”)
22
the why?
23
(from Y. Bengio’s MLSS 2014 slides)
Used in a wide variety of applications
24
Industrial Scale Success Stories
Speech Recognition
Object Recognition
Face Recognition
Cross Language
Learning
Machine Translation
Text Analytics
Disclaimer: Some nodes and edges may be
missing due to limited public knowledge
Dramatic improvements reported in some cases
25
(from Y. Bengio’s MLSS 2014 slides)
Some more success stories
26
Let me see if I understand this correctly…
• Speech Recognition, Machine Translation, etc. are more than 50 years old
• Single artificial neurons have been around for more than 50 years
𝑦
𝑊 (4)
𝑤1 𝑤2 𝑤3
𝑥1
𝑥2
𝑊 (3)
50+ years?
𝑏
1
𝑊 (2)
𝑥3
𝑊 (1)
𝑥1
𝑥2
𝑥3
No, even deep neural
networks have been
around for many, many
years but prior to 2006
training deep nets was
unsuccessful
27
(from Y. Bengio’s MLSS 2014 slides)
So what has changed since 2006?
• New methods for unsupervised
pre-training have been developed
• More efficient parameter
estimation methods
• Better understanding of model
regularization
• Faster machines and more data
help DL more than other algorithms
28
the how?
29
recap
𝑎 𝑥 =𝑏+
𝑤𝑖 𝑥𝑖
𝑖
ℎ 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 +
𝑤𝑖 𝑥𝑖
𝑖
𝑎 𝑥
𝑏
𝑤1 𝑤2 𝑤3
𝑥1
𝑥2
1
single artificial
neuron
𝑥3
𝑤𝑖 = 𝑤𝑒𝑖𝑔ℎ𝑡
𝑏 = 𝑏𝑖𝑎𝑠
𝑎 𝑥 = 𝑝𝑟𝑒 − 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
ℎ 𝑥 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
𝑮𝒐𝒂𝒍: 𝑔𝑖𝑣𝑒𝑛 𝑁 𝑥, 𝑦 𝑝𝑎𝑖𝑟𝑠 𝑙𝑒𝑎𝑟𝑛 𝑤, 𝑏
𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ℎ(𝑥𝑗 ) 𝑖𝑠 𝑎𝑠 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑦𝑗 𝑎𝑠 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒
30
Switching to slides corresponding to lecture 2
from Hugo Larochelle’s course
http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html
31
the where?
32
Some pointers to additional material
• http://deeplearning.net/
• http://info.usherbrooke.ca/hlarochelle/neural
_networks/content.html
33