Wake-Sleep algorithm for Representational Learning

Download Report

Transcript Wake-Sleep algorithm for Representational Learning

Wake-Sleep algorithm for Representational Learning Hamid Reza Maei Physiol. & Neurosci. Program University of Toronto

Motivation

The Brain is able to learn the underlying representation of received input data (e.g. images) in an unsupervised manner.

Challenge for neural networks: 1. It needs a specific teacher for desired output 2. It needs training all the connections Wake-Sleep algorithm avoids these two problems

V/H

R 2

d

R 1 G 1 G 2

Logistic belief network

x i i X y j g j y j G ij xy Y

G

d

Advantage Conditional distributions are factorial:

Learning Generative weights

The inference is intractable

Explaining away:

Sprinkle and Rain conditionally are dependent

Sprinkle Rain Wet

Though it is very crude, but let’s approximate factorial distribution Q(h|d; R).

P(h|d; G) with a Recognition weight

Any guarantee for the improvement of learning?

YES!

Using Jensen’s inequality we find a lower bound for log likelihood:

Free energy

Thus, decreasing the free energy increases the lower bound and therefore increases log likelihood. This leads to Wake Phase .

Wake phase

Remind:

:

Replaced by

Q(h|d; R).

1.Get samples (x o and y o ) from factorial distribution Q(h|d;R) (bottom-up pass) 2.use these samples in generative model for changing the generative weights.

Learning recognition weights

Derivative of free energy with respect to

R

results that computationally is intractable ,gives complicated What should be done?!

Switch!

(KL is not a symmetric function! ) Change the recognition weights to minimize the above free energy. This leads to sleep phase .

Sleep Phase

Sleep phase:

1. Get samples (x ● ,y ● ), generated by generative model using data coming from nowhere!

2. Change the recognition connections using the above delta rule.

Any guarantee for improvement? (for sleep phase)

NO!

Wake phase approximation Sleep phase approximation Q Q

-In the sleep phase we are minimizing KL( P , Q) which is wrong!

-

In the wake phase we are minimizing KL(Q, P ) which is right thing to do.

P

The wake-sleep algorithm

1. Wake-phase: -Use recognition weights to perform a bottom-up pass in order to create samples for layers above (from data).

-Train generative weights using samples obtained from recognition model. 2. Sleep-phase: -Use generative weights to reconstruct data by performing a top-down pass.

-Train recognition weights using samples obtained from generative model R 1 R 2 G 2 G 1

d What Wake-Sleep algorithm really is trying to achieve?!

It turns out that the goal of wake-sleep algorithm is to learn representation that are economical to describe: We can describe it using Shannon’s coding theory.

Simple example

Training:

For 4X4 images, we use belief network with one visible layer and two hidden layers (binary neurons): -The visible layer has 16 neurons.

-First hidden layer (8 neurons) decides all possible orientations.

-The top hidden layer (1 neuron) decides vertical and horizontal bars 2. The network was trained on 2x10 6 random examples.

Hinton et. al. Science (1995)

Wake-sleep algorithm on 20 news group data set -contains about 20,000 articles.

-many categories fall into overlapping clusters -we used tiny version of this data set with binary occurrence of 100 words across 16242 posting which could be divided with 4 classes: 1.comp.* 2.sci.* 3.rec.* 4.talk.*

Training

•Visible layer: 100 visible units •First hidden layer: 50 hidden units in the first hidden layer •Second hidden layer: 20 hidden units in the top layer.

•For training we used %60 of data (9745 training examples) and kept remaining for testing the model ( 6497 testing examples ).

visible

Just for fun!

Performance for model- Comp.* (class 1) • `windows',`win',`video',`card',`dos', `memory',`program',`ftp',`help',`system‘ • … Performance for model-talk.* (class 4) • 'god‘, 'bible‘, 'jesus‘, 'question‘, 'christian', 'israel‘, 'religion‘, 'card‘, 'jews' 'email' • `world',`jews',`war',`religion',`god',`jesus', `christian',`israel',`children',`food‘ • …

Testing (classification)

1. Learn two different Wake-Sleep algorithm on two different classes 1 and 4 ; that is comp.* and talk.* respectively.

2. Present the training examples from classes 1 and 4 to each of the two learned algorithm and compute the following free energy as score under each model .

(Class 1)

Presented examples from classes 1 and 4 to the learned wake-sleep algorithm under model comp.* Presented examples from classes 1 and 4 to the learned wake-sleep algorithm under model talk.*

Naïve Bayes classifier

Assumptions:

P(c j ):

frequency of classes in the training examples (9745).

•Conditional Independence Assumption.

•Use Bayes rule •Learn model parameter using Maximum likelihood (e.g. for classes 1 and 4) .

Correct prediction on testing examples:

Present testing example from class 1 and 2 to the trained model and predict which class it belongs to.

%80 correct prediction Most probable words in each class:

•Comp.*: ’ windows‘, 'help‘, 'email‘,'problem' 'system‘, 'computer''software’,'program' 'university''drive‘ •Talk.*: 'fact‘, 'god‘,'government’,'question''world‘,'christian‘,'case''course''state' 'jews'

McCallum et. al. (1998)

Conclusion

• Wake-Sleep is unsupervised learning algorithm.

• higher hidden layers store representations.

• Although we have used very crude approximations it works very well on some of realistic data.

• Wake-Sleep is trying to describe the representation economical (Shannon’s coding theory).

Flaws of wake-sleep algorithm

• Sleep phase has horrible assumptions (although it worked!) -it minimized KL(P||Q) rather KL(Q|P) -The recognition weights are trained not from data space but dream space!

*Variation approximations.

Using complementary priors to eliminate explaining away

1. Because of explaining away there ..

Remove the correlations in hidden layers —complementary priors G T H 1

Do complementary priors exist? Very hard questions and not obvious!

G T V 1 But it is possible to remove the effect of explaining away using this architecture: G T H 0 G T

etc

v j 1 h i 1 h i 0 V 0 v j 0 Restricted Boltzman Machine:

Inference is very easy Because of factorial distributions

Hinton et al. Neural Computation (2006) Hinton et. al. Science (2006) G G G G