Transcript lecture20

Today’s Topics
• Read: Chapters 7, 8, and 9 on
Logical Representation and Reasoning
• HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry)
• Recipe for using Backprop to Train an ANN
• Adjusting the Learning Rate (η)
• The Momentum Term ()
• Reducing Overfitting in ANNs
– Early Stopping
– Weight Decay
• Understanding Hidden Units
• Choosing the Number of Hidden Units
• ANNs as Universal Approximators
• Learning what an ANN has Learned
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
1
Using BP to Train ANN’s
1. Initialize weights & bias to
small random values
(eg, in [-0.3, 0.3])
k
j
i
2. Randomize order of training examples
For each do:
a) Propagate activity forward to output units
outi = F(  wi,j x outj )
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
2
Using BP to Train ANN’s (continued)
b) Compute ‘deviation’ for output units
i = F '( neti ) x ( Teacheri - outi )
F '( netj ) 
 F(neti)
 neti
c) Compute ‘deviation’ for hidden units
j = F '( netj ) x (  wi,j x i )
d) Update weights
wi,j =  x i x outj
wj,k =  x j x outk
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
Aside
The book (Fig 18.24) uses
 instead of , g instead of
F and  instead of 
3
Using BP to Train ANN’s
(concluded)
3. Repeat until training-set error rate small enough
Actually, should use early stopping
(ie, minimize error on the tuning set; more details later)
Some jargon: Each cycle through all
training examples is called an epoch
4. Measure accuracy on test set to estimate
generalization (future accuracy)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
4
The Need for
Symmetry Breaking (if HUs)
Assume all weights are initially the same
(drawing below a bit more general)








Can the corresponding (mirror-image) weight ever differ?
NO
WHY?
by symmetry (the two HUs in identical environments)
Solution
11/10/15
randomize initial weights (in, say, [-0.3, 0.3])
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
5
Error
Choosing η (‘the learning rate’)
η too large
(error )
weight space
η too small (error )
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
6
Adjusting η On-the-Fly
0. Let η = 0.25
1. Measure ave. error over k examples
- call this E before
2. Adjust wgts according to learning algorithm
being used
3. Measure ave error on same k examples
- call this E after
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
7
Adjusting η (cont)
4. If E after > E before,
then η  η  0.99
else η  η  1.01
5. Go to 1
Note: k can be all training examples
but could be a subset
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
8
Including a ‘Momentum’ Term
in Backprop
To speed up convergence, often another
term is added to the weight-update rule
 E
Wi , j (t ) 
 Wi , j (t  1)
Wi , j
The previous change in weight
Typically, 0 < β < 1, 0.9 common choice
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
9
Overfitting Reduction
Approach #1: Using Tuning Sets (Known as ‘Early Stopping’)
Test
50%
Error
Tune
Train
Ideal Chosen ANN
ANN to
choose
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
Training Epochs
10
Overfitting Reduction
Approach #2: Minimizing a Cost Function
– Cost = Error Rate + Network Complexity
– Essentially what SVMs do (later)
Cost  Train Set Error Rate  
1
2
w
2
weights
Need to tune the parameter lambda
(so still use a tuning set)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
11
Overfitting Reduction:
Weight Decay (Hinton ’86)
Cost
E
  same as
  Wi , j
wi , j
Wi , j
So …
wi , j    (teacher  output)  out j      wi , j
Weights decay toward zero
Empirically improves generalization
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
12
Four Views of What Hidden Units Do
(Not Necessarily Disjoint)
1. Transform the input space into a new space
where perceptrons suffice (relates to SVMs)
A perceptron
2. Probabilistically represent ‘hidden features’
– constructive induction, predicate
invention, learning representations, etc
(construct new features out of those given;
‘kernels’ do this in SVMs)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
13
Four Views of What Hidden Units Do
(Not Necessarily Disjoint)
3. Divide feature space into many subregions
+
+
+
+
-
+
+
+
+
+
+
-
-
+
+
-
--
+
+
+
+
+
+
+
+
- ---
+
+
-
-
+
-
-
4. Provide a set of basis functions which can
be linearly combined (relates to SVMs)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
14
How Many Hidden Units?
Historically one hidden layer is used
– How many units should it contain?
• Too few: can’t learn
• Too many: poor generalization
‘conventional view’
– Use tuning set or cross-validation
to select number of hidden units
• Traditional Approach (but no longer recommended)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
15
Can One Ever Have Too
Many Hidden Units?
Evidence (Weigand, Caruana) suggests
that if ‘early stopping’ is used
– Generalization does not degrade as
number of hidden units  ∞
– Ie, use tuning set to detect over fitting
(recall ‘early stopping’ slide)
– Weigand gives an explanation in terms of
‘effective number’ of HUs (analysis based on
principle components and Eigenvectors)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
16
ANNs as
‘Universal Approximators’
• Boolean Functions
– Need one layer of hidden units
to represent exactly
But note, what can be
REPRESENTED is
different from what can
be ‘easily’ LEARNED
• Continuous Functions
– Approximation to arbitrarily small error
with one (possibly quite ‘wide’) layer of hidden units
• Arbitrary Functions
– Any function can be approximated to arbitrary
precision with two layers of hidden units
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
17
Looking for Specific Boolean Inputs
(eg, memorize the POS examples)
1/2
-∞ -∞
looks for
"1001"
1
Use bias=0.99
for all nodes
(assume ‘step
functions’)
1/2
1/3
-∞
1/3
1
1/3
looks for
"1101"
Becomes
an ‘or’ of
all the
positive
examples
An "OR" of
all the
positive
examples
looked for
Hence with enough hidden units can ‘memorize’ the training data
But what about generalization?
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
18
Understanding What a
Trained ANN has Learned
- Human ‘Readability’ of Trained ANNs is Challenging
Rule Extraction (Craven & Shavlik, 1996)
Training
Examples
Could be an
ENSEMBLE
of models
Extraction
Algorithm
(TREPAN)
Roughly speaking, train ID3 to learn the I/O behavior of the
neural network – note we can generate as many labeled training
ex’s as desired by forward prop’ing through trained ANN!
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10
19