Transcript lecture19

Today’s Topics
• Midterm class mean: 83.5
• HW3 Due Thursday and HW4 Out Thursday
• Turn in Your BN Nannon Player (in Separate, ‘Dummy’ Assignment)
until a Week from Thursday
• Weight Space (for ANNs)
• Gradient Descent and Local Minima
• Stochastic Gradient Descent
• Backpropagation
• The Need to Train the Biases and a Simple Algebraic Trick
• Perceptron Training Rule and a Worked Example
• Case Analysis of Delta Rule
• Neural ‘Word Vectors’
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
1
Back to Prob Reasoning for Two Slides:
This same issue arises in ML when
Base-Rate Fallacy
have many more neg than pos ex’s:
https://en.wikipedia.org/wiki/Base_rate_fallacy
false pos overwhelm true pos
Assume Disease A is rare
(one in 1 million, say
– so picture not to scale)
99.99%
Assume population is 10B = 1010
So 104 people have it
A
0.01%
Assume testForA is 99.99% accurate
You test positive. What is the prob
you have Disease A?
Someone (not in cs540)
might naively think prob = 0.9999
11/3/15
People for whom
testForA = true
9999 people that actually have Disease A
106 people that do NOT have Disease A
Prob(A | testForA) = 0.01
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
2
A Major Weakness of BN’s
(I also copied this and prev slide to an earlier lecture, for future cs540’s)
• If many ‘hidden’ random vars (N binary vars, say),
then the marginalization formula leads to many calls
to a BN (2N in our example; for N = 20, 2N = 1,048,576)
• Using uniform-random sampling to estimate the result is too
inaccurate since most of the probability might be concentrated
in only a few ‘complete world states’
• Hence, much research (beyond cs540’s scope) on scaling up
inference in BNs and other graphical models, eg via more
sophisticated sampling (eg, MCMC)
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
3
WARNING!
Some Calculus Ahead
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
4
No Calculus Experience?
For HWs and the Final …
• Derivatives generalize the idea of SLOPE
• You only need to know how to calc the SLOPE of a line
d (m x + b)
dx
= m
// ‘mx + b’ is the algebraic form of a line
// ‘m’ is the slope
// ‘b’ is the y intercept (value of y when x = 0)
Two
(distinct)
points
define a
line
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
5
Weight Space
• Given a neural-network layout, the weights and
biases are free parameters that define a space
• Each point in this Weight Space specifies a network
weight space is a continuous space we search
• Associated with each point is an error rate, E,
over the training data
• Backprop performs gradient descent in weight space
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
6
Gradient Descent in Weight Space
Total Error on
Training Set
ERROR with
Current Wgt
Settings
∂E
∂W
 W1
W2
11/3/15
New Wgt
Settings
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
W1
Current Wgt
Settings
 W2
7
Backprop Seeks LOCAL Minima
(in a continuous space)
Error on Train Set
Note: a local min
might over fit the
training data, so
often ‘early
stopping’ used
(later)
11/3/15
Weight Space
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
8
Local Min are
Good Enough for Us!
• ANNs, including Deep Networks,
make accurate predictions even
though we likely are only finding local min
• The world could have been like this:
Error on
Train Set
Ie, most
min
poor,
hard to
find a
good
min
Weight Space
• Note: ensembles of ANNs
work well (often find different local minima)
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
9
The Gradient-Descent Rule
E(w)  [
The
‘gradient’
E,
w0
E,
w1
E ,
w2
…,
E_
wN
]
This is a N+1 dimensional vector (ie, the ‘slope’ in weight space)
Since we want to reduce errors, we want to go ‘down hill’
We’ll take a finite step in weight space:
E
E
w = -   E ( w )
‘delta’ = the
change to w
11/3/15
or wi = - 
E
wi
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
W1
W2
w
10
‘On Line’ vs. ‘Batch’
Backprop
• Technically, we should look at the error gradient for the
entire training set, before taking a step in weight space
(‘batch’ backprop)
• However, in practice we take a step after each example
(‘on-line’ backprop)
– Much faster convergence (learn after each example)
– Called ‘stochastic’ gradient descent
– Stochastic gradient descent very popular at Google, etc
due to easy parallelism
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
11
‘On Line’ vs. ‘Batch’ BP (continued)
* Note wi,BATCH  wi, ON-LINE, for i > 1
E
BATCH – add w
vectors for every training
example, then ‘move’ in
weight space
wex1
wi
wex3
wex2
w
E
ON-LINE – ‘move’ after
each example
(aka, stochastic gradient
descent)
wex2
wex1
Vector from BATCH
wex3
w
* Final locations in w space need not be the same for BATCH and ON-LINE
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
12
BP Calculations
(note: text uses ‘loss’ instead of ‘error’)
k
j
i
Assume one layer of hidden units (std. non-deep topology)
1. Error  ½  ( Teacheri – Output i ) 2
2. = ½  (Teacheri – F ( [  Wi,j x Output j ] )2
3. = ½  (Teacheri – F ( [  Wi,j x F (  Wj,k x Output k)] ))2
Determine
∂ Error
=
∂ Wi,j
(use equation 2)
∂ Error
∂ Wj,k =
(use equation 3)
Recall
11/3/15
See Sec 18.7.4 and Fig 18.24
in textbook for results (I won’t
ask you to derive on final)
wx,y = -  (∂ E / ∂ wx,y )
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
13
Note: Differentiating
RLU’s easy!
Differentiating the
Logistic Function (‘soft’ step-function)
(use F’ = 0 when input = bias)
F(wgt’ed in)
1
out i =
- (  wj,i x outj -  i )
1+e
1/2
F '(wgt’ed in) = out i ( 1- out i )

 Wj x outj
F '(wgt’ed in)
1/4
Notice that even if totally wrong,
no (or very little) change in weights
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
wgt’ed input
14
Gradient Descent for the Perceptron
(for the simple case of linear output units)
2
Error  ½  ( T – o )
∂E
∂ Wa
11/3/15
Network’s output
Teacher’s answer
(a constant wrt the weights)
= (T – o) ∂ (T – o) = - (T – o) ∂ o
∂ Wa
∂ Wa
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
15
Continuation of Derivation
∂E
∂ Wk
= - (T – o)
∂( ∑ w k  x k)
∂ Wa
= - (T – o) x a
So
Stick in formula
for output
ΔWk = η (T – o) xa
Recall ΔWk  - η
∂ Wa
The Perceptron Rule
We’ll use for both LINEAR
and STEP-FUNCTION activation
11/3/15
∂E
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Also known as the
delta rule and other
names (with some
variation in the calc)
16
Node Biases
Recall: A node’s output is weighted function
of its inputs and a ‘bias’ term
Output
1
bias
Input
These biases also need to be learned!
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
17
Training Biases ( Θ’s )
A node’s output (assume ‘step function’ for simplicity)
1 if W1 X1 + W2 X2 +…+ Wn Xn ≥ Θ
0 otherwise
Rewriting
W1 X1 + W2 X2 + … + Wn Xn – Θ ≥ 0
W1 X1 + W2 X2 + … + Wn Xn + Θ  (-1) ≥ 0
‘activation’
weight
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
18
Training Biases (cont.)
Hence, add another unit whose activation is
always -1
The bias is then just another weight!
Eg
Θ
-1
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Θ
19
Perceptron Example
(assume step function and use η = 0.1)
Train Set
X1
X2
Correct Output
3
-2
1
6
1
0
5
-3
1
X1
X2
-1
1
Perceptron Learning Rule
ΔWa = η (T – o) xa
Out = StepFunction(3  1 - 2  (-3) - 1  2)
=1
-3
2
No wgt changes, since correct
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
20
Perceptron Example
(assume step function and use η = 0.1)
Train Set
X1
X2
Correct Output
3
-2
1
6
1
0
5
-3
1
X1
X2
-1
11/3/15
1
Perceptron Learning Rule
ΔWa = η (T – o) xa
Out = StepFunction(6  1 + 1  (-3) - 1  2)
= 1 // So need to update weights
-3
2
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
21
Perceptron Example
(assume step function and use η = 0.1)
Train Set
X1
X2
Correct Output
3
-2
1
6
1
0
5
-3
1
X1
X2
Perceptron Learning Rule
ΔWa = η (T – o) xa
1 - 0.16 = 0.4 Out = StepFunction(6  1 + 1  (-3) - 1  2)
=1
-3-0.11 = -3.1
-1
2 0 0.1(-1) = 2.1
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
22
Pushing the Weights and Bias
in the Correct Direction when Wrong
Output
Assume TEACHER=1 and ANN=0, so some combo of
(a) wgts on some positively valued inputs too small
(b) wgts on some negatively valued inputs too large
(c) ‘bias’ too large
Opposite movement
when TEACHER= 0
and ANN = 1
Wgt’ed Sum
bias
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
23
Case Analysis:
ΔWk = η (T – o) xk
Assume Teach = 1, out = 0, η = 1
Note: ‘bigger’ means
closer to +infinity
and ‘smaller’ means
closer to -infinity
Input Vector:
Weights:
New Wgts:
Four Cases
Pos/Neg Input 
Pos/Neg Weight
Cases for the BIAS
1,
-1,
1, -1
2,
-4, -3,
5
2+1, -4-1, -3+1, 5-1
bigger smaller bigger smaller
Old vs New Input  Wgt 2 vs 3
smaller smaller
4 vs 5 -3 vs -2 -5 vs -4
So weighted sum will be LARGER (-2 vs 2)
11/3/15
‘-1’
‘-1’
6 or -6 // the BIAS
6-1
-6-1
And BIAS will be SMALLER
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
24
Neural Word Vectors – Current Hot Topic
(see https://code.google.com/p/word2vec/ or http://deeplearning4j.org/word2vec.html)
Distributional Hypothesis
words can be
characterized by the
words that appear nearby
in a large text corpus
(matrix algebra also used
for this task, eg singularvalue decomposition, SVD)
Two Possible Designs
(CBOW = Continuous Bag of Words)
Initially assign each word a random k-long vector of random #’s in [0,1] or [-1,1]
(k is something like 100 or 300) – as opposed to traditional ‘1-of-N’ encoding
Recall 1-of-N: aardvark = 1,0,0,…,0 // N is 50,000 or more!
zzzz
= 0,0,0,…,1 // And nothing ‘shared’ by related words
Compute  Error /  Inputi to change the input vector(s)
- ie, find good word vectors so easy to learn to predict I/O pairs in the fig above
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
25
Neural Word Vectors
king
-
=
– man
= queen –
-?
Surprisingly, one can do ‘simple algebra’
with these word vectors!
vectorFrance – vectorParis = vectorItaly – X
Subtract vector for Paris from vector for France,
then subtract vector for Italy. Negate then find
closest word vectors in one’s word ‘library’
web page suggests X = vectorRome
though I got vectorMilan
(which is reasonable; vectorRome was 2nd)
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
26
X
Wrapup of Basics
of ANN Training
∂ Error
∂ Wa
• We differentiate (in the calculus sense)
all the free parameters in an ANN with a
fixed structure (‘topology’)
– If all else is held constant (‘partial derivatives’),
what is the impact of changing weighta?
– Simultaneously move each weight a small amount
in the direction that reduces error
– Process example-by-example, many times
• Seeks local minimum,
ie, where all derivatives = 0
11/3/15
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
27