Neural Networks ICS 273A UC Irvine Instructor: Max Welling Neurons • Neurons communicate by receiving signals on their dendrites.
Download ReportTranscript Neural Networks ICS 273A UC Irvine Instructor: Max Welling Neurons • Neurons communicate by receiving signals on their dendrites.
Neural Networks ICS 273A UC Irvine Instructor: Max Welling Neurons • Neurons communicate by receiving signals on their dendrites. Adding these signals and firing off a new signal along the axon if the total input exceeds a threshold. • The axon connects to new dendrites through synapses which can learn how much signal is transmitted. • McCulloch and Pitt (’43) built a first abstract model of a neuron. 1 b y g (Wi xi b ) i output activation function input weights bias Neurons • We have about 1011 neurons, each one connected to 10 4 other neurons on average. • Each neuron needs at least 103 seconds to transmit the signal. • So we have many, slow neurons. Yet we recognize our grandmother in 101 sec. • Computers have much faster switching times: 10 10 sec. • Conclusion: brains compute in parallel ! • In fact, neurons are unreliable/noisy as well. But since things are encoded redundantly by many of them, their population can do computation reliably and fast. fi (x ) Classification & Regression Wij xj • Neural nets are a parameterized function Y=f(X;W) from inputs (X) to outputs (Y). • If Y is continuous: regression, if Y is discrete: classification. • We adapt the weights so as to minimize the error between the data and the model predictions. N dout din n 1 i 1 j 1 error (yin Wij x jn bi )2 • This is just a perceptron with a quadratic cost function. Optimization • We use stochastic gradient descent: pick a single data-item, compute the contribution of that data-point to the overall gradient and update the weights. Stochastic Gradient Descent stochastic updates full updates (averaged over all data-items) • Stochastic gradient descent does not converge to the minimum, but “dances” around it. • To get to the minimum, one needs to decrease the step-size as one get closer to the minimum. • Alternatively, one can obtain a few samples and average predictions over them (similar to bagging). Multi-Layer Nets yˆi g (Wij3hj2 bi 3 ) y j W3,b3 h2 W2,b2 hi 2 g (Wij2hj1 bi 2 ) j hi 1 g (Wij1x j bi 1 ) h1 j W1,b1 x • Single layers can only do linear things. If we want to learn non-linear decision surfaces, or non-linear regression curves, we need more than one layer. • In fact, NN with 1 hidden layer can approximate any boolean and cont. functions Back-propagation • How do we learn the weights of a multi-layer network? Answer: Stochastic gradient descent. But now the gradients are harder! error = å y in log hin3 + (1 - y in )log(1 - hin3 ); d errorn dhin3 d error dW jk2 å in å in å in å in å in in = y in hin3 =å in d errorn dhin 3 d errorn dhin 3 d errorn dhin3 d errorn dhin3 d errorn dhin 3 - 1 - y in 1 - hin3 ; y d errorn dhin3 hin3 hin3 (1 - hin3 ) dW jk2 = W3,b3 æ ö 3 2 3 d ççåW is h sn + bi ÷÷ ès ø hin3 (1 - hin3 )W ij3 dW jk 2 dh 2jn dW jk 2 h2 = W2,b2 = hin3 (1 - hin3 )W ij3h 2jn (1 - h 2jn ) æ ö d ççåW js2 h 1sn + b 2j ÷÷ è ø s dW jk2 1 hin3 (1 - hin3 )W ij3h 2jn (1 - h 2jn )hkn = h1 = W1,b1 x hin3 (1 - hin3 )W ij3h 2jn (1 - h 2jn )s (åW kl1 x ln + b k1 ) l i Back Propagation y W3,b3 h2 i hin3 = s (åWij3h2jn + bi3 ) i din3 = hin3 (1 - hin3 ) j W3,b3 hin2 = s (åWij2h1jn + bi2 ) h2 d errorin dhin3 Wij in upstream i jn2 hjn2 (1 hjn2 ) 3 3 j W2,b2 W2,b2 h1 y hin1 = s (åWij1 x jn + bi1 ) W1,b1 x Upward pass h1 Wjk jn upstream j 1 kn hkn1 (1 hkn1 ) j W1,b1 x downward pass 2 2 Back Propagation y i din3 = hin3 (1 - hin3 ) d errorin dhin3 d error dW jk 2 W3,b3 h2 Wij in upstream i jn2 hjn2 (1 hjn2 ) 3 3 =å in d errorn dhin 3 1 hin3 (1 - hin3 )W ij3h2jn (1 - h 2jn )hkn 2 1 = d jn hkn Wjk2 Wjk2 jn2 hkn1 W2,b2 h1 W1,b1 x bj2 bj2 jn2 Wjk jn upstream j 1 kn hkn1 (1 hkn1 ) 2 2 ALVINN Learning to drive a car This hidden unit detects a mildly left sloping road and advices to steer left. How would another hidden unit look like? Weight Decay • NN can also overfit (of course). • We can try to avoid this by initializing all weights/biases terms to very small random values and grow them during learning. • One can now check performance on a validation set and stop early. • Or one can change the update rule to discourage large weights: Wjk2 Wjk2 jn2 hkn1 Wjk2 bj2 bj2 jn2 bj2 • Now we need to set using X-validation. • This is called “weight-decay” in NN jargon. Momentum • In the beginning of learning it is likely that the weights are changed in a consistent manner. • Like a ball rolling down a hill, we should gain speed if we make consistent changes. It’s like an adaptive stepsize. • This idea is easily implemented by changing the gradient as follows: 2 1 Wjk2 (new ) jn hkn Wjk2 (old ) Wjk2 Wjk2 Wjk2 (new ) (and similar to biases) Conclusion • NN are a flexible way to model input/output functions • They are robust against noisy data • Hard to interpret the results (unlike DTs) • Learning is fast on large datasets when using stochastic gradient descent plus momentum. • Local minima in optimization is a problem • Overfitting can be avoided using weight decay or early stopping • There are also NN which feed information back (recurrent NN) • Many more interesting NNs: Boltzman machines, self-organizing maps,...