Contrastive Divergence Learning

Download Report

Transcript Contrastive Divergence Learning

Contrastive Divergence
Learning
Geoffrey E. Hinton
A discussion led by Oliver Woodford
Contents
•
•
•
•
•
Maximum Likelihood learning
Gradient descent based approach
Markov Chain Monte Carlo sampling
Contrastive Divergence
Further topics for discussion:
– Result biasing of Contrastive Divergence
– Product of Experts
– High-dimensional data considerations
Maximum Likelihood learning
• Given:
– Probability
model
£
p(x; £) =
1 f (x; £)
Z(£)
•
- model parameters
Z(£)
•
- the partition Rfunction, defined as
Z(£) = f (x; £) dx
X = fx gK
– Training data
• Aim:
– Find
£
k k=1
– Or,
f (x; £) = exp¡ (x2¾¹)
2
£ = f¹; ¾g
p
Z(£) = ¾ 2¼
Known result:
Q likelihood of training data:
thatp(X;
maximizes
1 f (x ; £)
£) = K
k
k=1 Z(£)
£
Toy example
that minimizes negative
¡ PK log of likelihood:
E(X; £) = K log(Z(£))
k=1
log(f (xk ; £))
¡
2
Maximum Likelihood learning
• Method:
@E(X;£)
–
@£
=0
@E(X; £)
@£
at minimum
=
=
h¢i
X
K @ log f (x ; £)
@ log Z(£) ¡ 1 X
i
@£
K
@£
¿ i=1
À
@ log Z(£) ¡ @ log f (x; £)
@£
@£
X
¢
X
is the expectation of given the data distribution .
¿
À
p
(x¡¹)2
@E(X;£)
= @ log(¾ 2¼) + @ 2¾2
@£
@£
@£
®
X
¡¹
) ¹ = hx i
@E(X;£)
= ¡ xD
=E0
X
@¹
¾2 X
p
) ¾ = h(x ¡ ¹)2 i
@E(X;£)
= 1 + (x¡¹)2
=0
@¾
¾
¾3
X
– Let’s assume that there is no linear solution…
X
Gradient descent-based approach
´
– Move a fixed step size, , in the direction of steepest
gradient. (Not line search – see why later).
– This gives the following
parameter
update equation:
@E(X;
£)
£t+1
= £t ¡ ´
= £t ¡ ´
t
µ
@£t
¿
À ¶
@ log Z(£t ) ¡ @ log f (x; £t )
@£t
@£t
X
Gradient descent-based
approach
R
Z(£) =
f (x; £) dx
– Recall
. Sometimes this integral
will be algebraically intractable.
@ log Z(£)
@£
– ThisE(X;
means
we
can
calculate
neither
£)
nor
(hence no line search).
– However, with some clever substitution… R
@ log Z(£)
@£
=
=
=
1
Z(£)
R
R
@f (x;£)
@£
D
dx
@£
@ log f (x;£)
@£
E
µD
p(x;£)
@ log f (x;£t )
@£t
1
@
Z(£) @£
=
p(x; £) @ log f (x;£) dx =
£t+1 = £t ¡ ´
– so
where
=
1 @Z(£)
Z(£) @£
E
1
Z(£)
D
p(x;£t )
R
f (x; £) dx
f (x; £) @ log f (x;£) dx
@ log f (x;£)
@£
¡
D
@£
E
p(x;£)
@ log f (x;£t )
@£t
E ¶
X
can be estimated numerically.
Markov Chain Monte Carlo sampling
– To estimate
Z(£)
D
@ log f (x;£)
@£
E
p(x;£)
we must draw samples from
p(x; £)
.
– Since
is unknown, we cannot draw samples randomly
from a cumulative distribution curve.
– Markov Chain Monte Carlo (MCMC) methods turn random
samples into samples from a proposed distribution, without
Z(£)
knowing
.
– Metropolis algorithm:
x0 = xk + randn(size(xk ))
k
• Perturb samples
e.g.
0
p(x ;£) < rand(1)
x0
k
k
p(x
;£)
• Reject
if
k
• Repeat cycle for all samples until stabilization of the distribution.
– Stabilization takes many cycles, and there is no accurate
criteria for determining when it has occurred.
Markov Chain Monte Carlo sampling
– Let us use the training data,
MCMC sampling.
X
, as the starting point for our
n
Xn
X0
£
£
Notation: X1 - training data,
- training data after cycles of MCMC,
£
£
- samples from proposed distribution with parameters
µD equation
E becomes:
D
– Our parameter update
£t+1 = £t ¡ ´
@ log f (x;£t )
@£t
X1
£t
¡
@ log f (x;£t )
@£t
E
X0
£t
¶
.
Contrastive divergence
– Let us make the number of MCMC cycles per iteration small,
say even 1.
µD
¶
E is now:
D
E
– Our parameter update
equation
£t+1 = £t ¡ ´
@ log f (x;£t )
@£t
X1
£t
¡
@ log f (x;£t )
@£t
X0
£t
– Intuition: 1 MCMC cycle is enough to move the data from the
target distribution towards the proposed distribution, and so
suggest which direction the proposed distribution should
move to better model the training data.
Contrastive divergence bias
– We assume:
@ E (X;£)
@£
¼
D
@ log f (x;£)
@£
E
X1
£
¡
D
@ log f (x;£)
@£
E
X0
£
X0 jjX1
£
£
– ML
learning
equivalent
to
minimizing
, where
R
jj
P Q = p(x) log p(x) dx
q(x)
(Kullback-Leibler divergence).
X0 jjX1 ¡ X1 jjX1
D
E £D £
£E £
– CD
attempts
to
minimize
jj 1
¡ @ log f (x;£)
¡ @X @X jjX
@ log f (x;£)
0 jj 1 ¡
@ (X
£
@£
X
£
X1 X ) =
£
£
@X1 @X1 jjX1
£
£
£
@£
@X1
@£
¼0
X1
£
@£
1
£
X0
£
@£
1
£
@X1
1
£
£
– Usually
, but can sometimes bias
results.
– See “On Contrastive Divergence Learning”,
Carreira-Perpinan & Hinton, AIStats 2005, for more
details.
£
Product of Experts
Dimensionality issues