Deep Learning
Download
Report
Transcript Deep Learning
Deep Learning
Bing-Chen Tsai
1/21
1
outline
Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
2
Neural networks
Supervised learning
The training data consists of input information with their
corresponding output information.
Unsupervised learning
The training data consists of input information without their corresponding
output information.
3
Neural networks
Generative model
Model the distribution of input as well as output ,P(x , y)
Discriminative model
Model the posterior probabilities ,P(y | x)
P(x,y1)
P(x,y2)
P(y1|x)
4
P(y2|x)
Neural networks
x1
What is the neural?
Linear neurons y b x i w i
x2
w1
w2
1
b
y
i
Binary threshold neurons z = b + å xi wi
y
0 otherwise
i
Sigmoid neurons z b x i w i
1
y
i
1
Stochastic binary neurons z b
i
5
1 if z³0
e
xi wi
z
1
p ( y 1)
1
e
z
Neural networks
Two layer neural networks (Sigmoid neurons)
Back-propagation
Step1:
Randomly initial weight
Determine the output vector
Step2:
Evaluating the gradient
of an error function
Step3:
Adjusting weight,
Repeat The step1,2,3
until error enough low
6
Neural networks
Back-propagation is not good for deep learning
It requires labeled training data.
Almost data is unlabeled.
The learning time is very slow in networks with multiple hidden
layers.
It is very slow in networks with multi hidden layer.
It can get stuck in poor local optima.
For deep nets they are far from optimal.
Learn P(input) not P(output | input)
What kind of generative model should we learn?
7
outline
Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
8
Graphical model
A graphical model is a probabilistic model for which graph
denotes the conditional dependence structure between
random variables probabilistic model
In this example: D
depends on A, D depends
on B, D depends on C, C
depends on B, and C
depends on D.
9
Graphical model
A
Directed graphical model
𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 𝑃 𝐵 𝐴 𝑃 𝐶 𝐴 𝑃(𝐷|𝐵, 𝐶)
B
C
D
Undirected graphical model
𝑃 𝐴, 𝐵, 𝐶, 𝐷 =
A
C
B
D
1
∗ φ 𝐴, 𝐵, 𝐶 ∗ 𝜑(𝐵, 𝐶, 𝐷)
𝑍
10
outline
Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
11
Belief nets
A belief net is a directed acyclic graph composed of
stochastic variables
stochastic hidden causes
Stochastic binary neurons
z b
i
xi wi
1
p ( y 1)
1 e
z
It is sigmoid belief nets
visible
12
Belief nets
we would like to solve two problems
The inference problem: Infer the states of the unobserved variables.
The learning problem: Adjust the interactions between variables to
make the network more likely to generate the training data.
stochastic hidden causes
visible
13
Belief nets
It is easy to generate sample P(v | h)
It is hard to infer P(h | v)
stochastic hidden causes
Explaining away
visible
14
Belief nets
Explaining away
H1
H2
H1 and H2 are independent, but they
can become dependent
when we observe an effect that they
can both influence
𝑃 𝐻1 𝑉 𝑎𝑛𝑑 𝑃 𝐻2 𝑉 𝑎𝑟𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡
V
15
Belief nets
Some methods for learning deep belief nets
Monte Carlo methods
But its painfully slow for large, deep belief nets
Learning with samples from the wrong distribution
Use Restricted Boltzmann Machines
16
outline
Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
17
Boltzmann Machine
It is a Undirected graphical model
The Energy of a joint configuration
hidden
j
i
-E(v, h) =
å vibi +
iÎvis
p(v, h) =
e
å
kÎhid
-E(v, h)
å e-E(u, g)
hk bk + å vi v j wij + å vi hk wik + å hk hl wkl
i< j
i, k
e-E(v, h)
å
h
p(v) =
åu, g e-E(u, g)
u, g
18
k<l
visible
Boltzmann Machine
v h -E
e-E
p(v, h ) p(v)
An example of how weights
define a distribution
h1
+2
v1
19
-1
h2
+1
v2
Boltzmann Machine
A very surprising fact
¶log p(v)
= si s j
¶wij
Derivative of log
probability of one
training vector, v
under the model.
v
- si s j
Expected value of
product of states at
thermal equilibrium
when v is clamped on
the visible units
Dwij µ
si s j
data
20
- si s j
model
Expected value of
product of states at
thermal equilibrium
with no clamping
model
Boltzmann Machines
Restricted Boltzmann Machine
We restrict the connectivity to make learning easier.
Only one layer of hidden units.
We will deal with more layers later
No connections between hidden units
Making the updates more parallel
visible
21
Boltzmann Machines
the Boltzmann machine learning algorithm for an RBM
j
j
j
<vi h j>¥
<vi h j>0
i
t=0
j
i
i
t=1
i
t=2
Dwij = e ( <vi h j >0 - <vi h j>¥ )
22
t = infinity
Boltzmann Machines
Contrastive divergence: A very surprising short-cut
j
<vi h j>0
i
t=0
data
j
<vi h j>1
This is not following the gradient of the
log likelihood. But it works well.
i
t=1
reconstruction
Dwij = e ( <vi h j >0 - <vi h j>1 )
23
outline
Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
24
DBN
It is easy to generate sample P(v | h)
It is hard to infer P(h | v)
stochastic hidden causes
Explaining away
visible
Use RBM to initial weight can get good optimal
25
DBN
Combining two RBMs to make a DBN
Then train
this RBM
h2
W2
h1
Compose the
two RBM
models to
make a single
DBN model
W2
h1
copy binary state for each v
W1
h1
Train this
RBM first
h2
v
W1
It’s a deep belief nets!
v
26
DBN
etc.
W
T
h2
Why we can use RBM to initial belief nets weights?
An infinite sigmoid belief net that is equivalent to an RBM
W
v2
W
T
h1
Inference in a directed net with replicated weights
Inference is trivial. We just multiply v0 by W transpose.
The model above h0 implements a complementary prior.
Multiplying v0 by W transpose gives the
product of the likelihood term and the prior term.
W
v1
W
h0
W
v0
27
T
DBN
Complementary prior
X1
X2
X3
X4
A Markov chain is a sequence of variables X1;X2; : : : with the Markov
property
𝑃 𝑋𝑡 𝑋1 , … , 𝑋𝑡−1 = 𝑃(𝑋𝑡 |𝑋𝑡−1 )
A Markov chain is stationary if the transition probabilities do not
depend on time
𝑃 𝑋𝑡 = 𝑥 ′ 𝑋𝑡−1 = 𝑥 = 𝑇 𝑥 → 𝑥 ′
𝑇(𝑥 → 𝑥′) is called the transition matrix.
If a Markov chain is ergodic it has a unique equilibrium distribution
𝑃𝑡 𝑋𝑡 = 𝑥 → 𝑃∞ 𝑋 = 𝑥 𝑎𝑠 𝑡 → ∞
28
DBN
Most Markov chains used in practice satisfy detailed balance
𝑃∞ (𝑋)𝑇(𝑋 → 𝑋′) = 𝑃∞ (𝑋′)𝑇(𝑋′ → 𝑋)
e.g. Gibbs, Metropolis-Hastings, slice sampling. . .
Such Markov chains are reversible
X1
X2
X3
X4
𝑃∞ 𝑋1 𝑇 𝑋1 → 𝑋2 𝑇 𝑋2 → 𝑋3 𝑇(𝑋3 → 𝑋4 )
X1
X2
X3
X4
𝑇 𝑋1 ← 𝑋2 𝑇 𝑋2 ← 𝑋3 𝑇 𝑋3 ← 𝑋4 𝑃∞ (𝑋4 )
29
DBN
𝑃 𝑌𝑘 = 1 𝑋𝑘+1 = 𝜎(𝑊 𝑇 𝑋𝑘+1 + 𝑐)
𝑃 𝑋𝑘 = 1 𝑌𝑘 = 𝜎(𝑊𝑌𝑘 + 𝑏)
30
DBN
Combining two RBMs to make a DBN
Then train
this RBM
h2
W2
h1
Compose the
two RBM
models to
make a single
DBN model
W2
h1
copy binary state for each v
W1
h1
Train this
RBM first
h2
v
W1
It’s a deep belief nets!
v
31
Reference
Deep Belief Nets,2007 NIPS tutorial ,
G . Hinton
https://class.coursera.org/neuralnets-2012001/class/index
Machine learning 上課講義
http://en.wikipedia.org/wiki/Graphical_mod
el
32