Transcript lecture20
Today’s Topics • Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning • HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) • Recipe for using Backprop to Train an ANN • Adjusting the Learning Rate (η) • The Momentum Term () • Reducing Overfitting in ANNs – Early Stopping – Weight Decay • Understanding Hidden Units • Choosing the Number of Hidden Units • ANNs as Universal Approximators • Learning what an ANN has Learned 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 1 Using BP to Train ANN’s 1. Initialize weights & bias to small random values (eg, in [-0.3, 0.3]) k j i 2. Randomize order of training examples For each do: a) Propagate activity forward to output units outi = F( wi,j x outj ) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 2 Using BP to Train ANN’s (continued) b) Compute ‘deviation’ for output units i = F '( neti ) x ( Teacheri - outi ) F '( netj ) F(neti) neti c) Compute ‘deviation’ for hidden units j = F '( netj ) x ( wi,j x i ) d) Update weights wi,j = x i x outj wj,k = x j x outk 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 Aside The book (Fig 18.24) uses instead of , g instead of F and instead of 3 Using BP to Train ANN’s (concluded) 3. Repeat until training-set error rate small enough Actually, should use early stopping (ie, minimize error on the tuning set; more details later) Some jargon: Each cycle through all training examples is called an epoch 4. Measure accuracy on test set to estimate generalization (future accuracy) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 4 The Need for Symmetry Breaking (if HUs) Assume all weights are initially the same (drawing below a bit more general) Can the corresponding (mirror-image) weight ever differ? NO WHY? by symmetry (the two HUs in identical environments) Solution 11/10/15 randomize initial weights (in, say, [-0.3, 0.3]) CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 5 Error Choosing η (‘the learning rate’) η too large (error ) weight space η too small (error ) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 6 Adjusting η On-the-Fly 0. Let η = 0.25 1. Measure ave. error over k examples - call this E before 2. Adjust wgts according to learning algorithm being used 3. Measure ave error on same k examples - call this E after 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 7 Adjusting η (cont) 4. If E after > E before, then η η 0.99 else η η 1.01 5. Go to 1 Note: k can be all training examples but could be a subset 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 8 Including a ‘Momentum’ Term in Backprop To speed up convergence, often another term is added to the weight-update rule E Wi , j (t ) Wi , j (t 1) Wi , j The previous change in weight Typically, 0 < β < 1, 0.9 common choice 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 9 Overfitting Reduction Approach #1: Using Tuning Sets (Known as ‘Early Stopping’) Test 50% Error Tune Train Ideal Chosen ANN ANN to choose 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 Training Epochs 10 Overfitting Reduction Approach #2: Minimizing a Cost Function – Cost = Error Rate + Network Complexity – Essentially what SVMs do (later) Cost Train Set Error Rate 1 2 w 2 weights Need to tune the parameter lambda (so still use a tuning set) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 11 Overfitting Reduction: Weight Decay (Hinton ’86) Cost E same as Wi , j wi , j Wi , j So … wi , j (teacher output) out j wi , j Weights decay toward zero Empirically improves generalization 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 12 Four Views of What Hidden Units Do (Not Necessarily Disjoint) 1. Transform the input space into a new space where perceptrons suffice (relates to SVMs) A perceptron 2. Probabilistically represent ‘hidden features’ – constructive induction, predicate invention, learning representations, etc (construct new features out of those given; ‘kernels’ do this in SVMs) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 13 Four Views of What Hidden Units Do (Not Necessarily Disjoint) 3. Divide feature space into many subregions + + + + - + + + + + + - - + + - -- + + + + + + + + - --- + + - - + - - 4. Provide a set of basis functions which can be linearly combined (relates to SVMs) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 14 How Many Hidden Units? Historically one hidden layer is used – How many units should it contain? • Too few: can’t learn • Too many: poor generalization ‘conventional view’ – Use tuning set or cross-validation to select number of hidden units • Traditional Approach (but no longer recommended) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 15 Can One Ever Have Too Many Hidden Units? Evidence (Weigand, Caruana) suggests that if ‘early stopping’ is used – Generalization does not degrade as number of hidden units ∞ – Ie, use tuning set to detect over fitting (recall ‘early stopping’ slide) – Weigand gives an explanation in terms of ‘effective number’ of HUs (analysis based on principle components and Eigenvectors) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 16 ANNs as ‘Universal Approximators’ • Boolean Functions – Need one layer of hidden units to represent exactly But note, what can be REPRESENTED is different from what can be ‘easily’ LEARNED • Continuous Functions – Approximation to arbitrarily small error with one (possibly quite ‘wide’) layer of hidden units • Arbitrary Functions – Any function can be approximated to arbitrary precision with two layers of hidden units 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 17 Looking for Specific Boolean Inputs (eg, memorize the POS examples) 1/2 -∞ -∞ looks for "1001" 1 Use bias=0.99 for all nodes (assume ‘step functions’) 1/2 1/3 -∞ 1/3 1 1/3 looks for "1101" Becomes an ‘or’ of all the positive examples An "OR" of all the positive examples looked for Hence with enough hidden units can ‘memorize’ the training data But what about generalization? 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 18 Understanding What a Trained ANN has Learned - Human ‘Readability’ of Trained ANNs is Challenging Rule Extraction (Craven & Shavlik, 1996) Training Examples Could be an ENSEMBLE of models Extraction Algorithm (TREPAN) Roughly speaking, train ID3 to learn the I/O behavior of the neural network – note we can generate as many labeled training ex’s as desired by forward prop’ing through trained ANN! 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 19