Transcript lecture19
Today’s Topics • Midterm class mean: 83.5 • HW3 Due Thursday and HW4 Out Thursday • Turn in Your BN Nannon Player (in Separate, ‘Dummy’ Assignment) until a Week from Thursday • Weight Space (for ANNs) • Gradient Descent and Local Minima • Stochastic Gradient Descent • Backpropagation • The Need to Train the Biases and a Simple Algebraic Trick • Perceptron Training Rule and a Worked Example • Case Analysis of Delta Rule • Neural ‘Word Vectors’ 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 1 Back to Prob Reasoning for Two Slides: This same issue arises in ML when Base-Rate Fallacy have many more neg than pos ex’s: https://en.wikipedia.org/wiki/Base_rate_fallacy false pos overwhelm true pos Assume Disease A is rare (one in 1 million, say – so picture not to scale) 99.99% Assume population is 10B = 1010 So 104 people have it A 0.01% Assume testForA is 99.99% accurate You test positive. What is the prob you have Disease A? Someone (not in cs540) might naively think prob = 0.9999 11/3/15 People for whom testForA = true 9999 people that actually have Disease A 106 people that do NOT have Disease A Prob(A | testForA) = 0.01 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 2 A Major Weakness of BN’s (I also copied this and prev slide to an earlier lecture, for future cs540’s) • If many ‘hidden’ random vars (N binary vars, say), then the marginalization formula leads to many calls to a BN (2N in our example; for N = 20, 2N = 1,048,576) • Using uniform-random sampling to estimate the result is too inaccurate since most of the probability might be concentrated in only a few ‘complete world states’ • Hence, much research (beyond cs540’s scope) on scaling up inference in BNs and other graphical models, eg via more sophisticated sampling (eg, MCMC) 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 3 WARNING! Some Calculus Ahead 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 4 No Calculus Experience? For HWs and the Final … • Derivatives generalize the idea of SLOPE • You only need to know how to calc the SLOPE of a line d (m x + b) dx = m // ‘mx + b’ is the algebraic form of a line // ‘m’ is the slope // ‘b’ is the y intercept (value of y when x = 0) Two (distinct) points define a line 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 5 Weight Space • Given a neural-network layout, the weights and biases are free parameters that define a space • Each point in this Weight Space specifies a network weight space is a continuous space we search • Associated with each point is an error rate, E, over the training data • Backprop performs gradient descent in weight space 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 6 Gradient Descent in Weight Space Total Error on Training Set ERROR with Current Wgt Settings ∂E ∂W W1 W2 11/3/15 New Wgt Settings CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 W1 Current Wgt Settings W2 7 Backprop Seeks LOCAL Minima (in a continuous space) Error on Train Set Note: a local min might over fit the training data, so often ‘early stopping’ used (later) 11/3/15 Weight Space CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 8 Local Min are Good Enough for Us! • ANNs, including Deep Networks, make accurate predictions even though we likely are only finding local min • The world could have been like this: Error on Train Set Ie, most min poor, hard to find a good min Weight Space • Note: ensembles of ANNs work well (often find different local minima) 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 9 The Gradient-Descent Rule E(w) [ The ‘gradient’ E, w0 E, w1 E , w2 …, E_ wN ] This is a N+1 dimensional vector (ie, the ‘slope’ in weight space) Since we want to reduce errors, we want to go ‘down hill’ We’ll take a finite step in weight space: E E w = - E ( w ) ‘delta’ = the change to w 11/3/15 or wi = - E wi CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 W1 W2 w 10 ‘On Line’ vs. ‘Batch’ Backprop • Technically, we should look at the error gradient for the entire training set, before taking a step in weight space (‘batch’ backprop) • However, in practice we take a step after each example (‘on-line’ backprop) – Much faster convergence (learn after each example) – Called ‘stochastic’ gradient descent – Stochastic gradient descent very popular at Google, etc due to easy parallelism 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 11 ‘On Line’ vs. ‘Batch’ BP (continued) * Note wi,BATCH wi, ON-LINE, for i > 1 E BATCH – add w vectors for every training example, then ‘move’ in weight space wex1 wi wex3 wex2 w E ON-LINE – ‘move’ after each example (aka, stochastic gradient descent) wex2 wex1 Vector from BATCH wex3 w * Final locations in w space need not be the same for BATCH and ON-LINE 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 12 BP Calculations (note: text uses ‘loss’ instead of ‘error’) k j i Assume one layer of hidden units (std. non-deep topology) 1. Error ½ ( Teacheri – Output i ) 2 2. = ½ (Teacheri – F ( [ Wi,j x Output j ] )2 3. = ½ (Teacheri – F ( [ Wi,j x F ( Wj,k x Output k)] ))2 Determine ∂ Error = ∂ Wi,j (use equation 2) ∂ Error ∂ Wj,k = (use equation 3) Recall 11/3/15 See Sec 18.7.4 and Fig 18.24 in textbook for results (I won’t ask you to derive on final) wx,y = - (∂ E / ∂ wx,y ) CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 13 Note: Differentiating RLU’s easy! Differentiating the Logistic Function (‘soft’ step-function) (use F’ = 0 when input = bias) F(wgt’ed in) 1 out i = - ( wj,i x outj - i ) 1+e 1/2 F '(wgt’ed in) = out i ( 1- out i ) Wj x outj F '(wgt’ed in) 1/4 Notice that even if totally wrong, no (or very little) change in weights 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 wgt’ed input 14 Gradient Descent for the Perceptron (for the simple case of linear output units) 2 Error ½ ( T – o ) ∂E ∂ Wa 11/3/15 Network’s output Teacher’s answer (a constant wrt the weights) = (T – o) ∂ (T – o) = - (T – o) ∂ o ∂ Wa ∂ Wa CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 15 Continuation of Derivation ∂E ∂ Wk = - (T – o) ∂( ∑ w k x k) ∂ Wa = - (T – o) x a So Stick in formula for output ΔWk = η (T – o) xa Recall ΔWk - η ∂ Wa The Perceptron Rule We’ll use for both LINEAR and STEP-FUNCTION activation 11/3/15 ∂E CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 Also known as the delta rule and other names (with some variation in the calc) 16 Node Biases Recall: A node’s output is weighted function of its inputs and a ‘bias’ term Output 1 bias Input These biases also need to be learned! 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 17 Training Biases ( Θ’s ) A node’s output (assume ‘step function’ for simplicity) 1 if W1 X1 + W2 X2 +…+ Wn Xn ≥ Θ 0 otherwise Rewriting W1 X1 + W2 X2 + … + Wn Xn – Θ ≥ 0 W1 X1 + W2 X2 + … + Wn Xn + Θ (-1) ≥ 0 ‘activation’ weight 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 18 Training Biases (cont.) Hence, add another unit whose activation is always -1 The bias is then just another weight! Eg Θ -1 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 Θ 19 Perceptron Example (assume step function and use η = 0.1) Train Set X1 X2 Correct Output 3 -2 1 6 1 0 5 -3 1 X1 X2 -1 1 Perceptron Learning Rule ΔWa = η (T – o) xa Out = StepFunction(3 1 - 2 (-3) - 1 2) =1 -3 2 No wgt changes, since correct 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 20 Perceptron Example (assume step function and use η = 0.1) Train Set X1 X2 Correct Output 3 -2 1 6 1 0 5 -3 1 X1 X2 -1 11/3/15 1 Perceptron Learning Rule ΔWa = η (T – o) xa Out = StepFunction(6 1 + 1 (-3) - 1 2) = 1 // So need to update weights -3 2 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 21 Perceptron Example (assume step function and use η = 0.1) Train Set X1 X2 Correct Output 3 -2 1 6 1 0 5 -3 1 X1 X2 Perceptron Learning Rule ΔWa = η (T – o) xa 1 - 0.16 = 0.4 Out = StepFunction(6 1 + 1 (-3) - 1 2) =1 -3-0.11 = -3.1 -1 2 0 0.1(-1) = 2.1 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 22 Pushing the Weights and Bias in the Correct Direction when Wrong Output Assume TEACHER=1 and ANN=0, so some combo of (a) wgts on some positively valued inputs too small (b) wgts on some negatively valued inputs too large (c) ‘bias’ too large Opposite movement when TEACHER= 0 and ANN = 1 Wgt’ed Sum bias 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 23 Case Analysis: ΔWk = η (T – o) xk Assume Teach = 1, out = 0, η = 1 Note: ‘bigger’ means closer to +infinity and ‘smaller’ means closer to -infinity Input Vector: Weights: New Wgts: Four Cases Pos/Neg Input Pos/Neg Weight Cases for the BIAS 1, -1, 1, -1 2, -4, -3, 5 2+1, -4-1, -3+1, 5-1 bigger smaller bigger smaller Old vs New Input Wgt 2 vs 3 smaller smaller 4 vs 5 -3 vs -2 -5 vs -4 So weighted sum will be LARGER (-2 vs 2) 11/3/15 ‘-1’ ‘-1’ 6 or -6 // the BIAS 6-1 -6-1 And BIAS will be SMALLER CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 24 Neural Word Vectors – Current Hot Topic (see https://code.google.com/p/word2vec/ or http://deeplearning4j.org/word2vec.html) Distributional Hypothesis words can be characterized by the words that appear nearby in a large text corpus (matrix algebra also used for this task, eg singularvalue decomposition, SVD) Two Possible Designs (CBOW = Continuous Bag of Words) Initially assign each word a random k-long vector of random #’s in [0,1] or [-1,1] (k is something like 100 or 300) – as opposed to traditional ‘1-of-N’ encoding Recall 1-of-N: aardvark = 1,0,0,…,0 // N is 50,000 or more! zzzz = 0,0,0,…,1 // And nothing ‘shared’ by related words Compute Error / Inputi to change the input vector(s) - ie, find good word vectors so easy to learn to predict I/O pairs in the fig above 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 25 Neural Word Vectors king - = – man = queen – -? Surprisingly, one can do ‘simple algebra’ with these word vectors! vectorFrance – vectorParis = vectorItaly – X Subtract vector for Paris from vector for France, then subtract vector for Italy. Negate then find closest word vectors in one’s word ‘library’ web page suggests X = vectorRome though I got vectorMilan (which is reasonable; vectorRome was 2nd) 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 26 X Wrapup of Basics of ANN Training ∂ Error ∂ Wa • We differentiate (in the calculus sense) all the free parameters in an ANN with a fixed structure (‘topology’) – If all else is held constant (‘partial derivatives’), what is the impact of changing weighta? – Simultaneously move each weight a small amount in the direction that reduces error – Process example-by-example, many times • Seeks local minimum, ie, where all derivatives = 0 11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 27