Transcript lecture18

Today’s Topics
• Artificial Neural Networks (ANNs)
• Perceptrons (1950s)
• Hidden Units and Backpropagation (1980s)
• Deep Neural Networks (2010s)
• ??? (2040s [note the pattern])
• This Lecture: The Big Picture & Forward Propagation
• Next Lecture: Learning Network Weights
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
1
Should you?
(Slide I used in CS 760 for 20+ years)
‘Fenwίck here
is biding his
time waiting
for neural
networks’
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
2
Recall:
Supervised ML Systems Differ in
How They Represent Concepts
Backpropagation
…
Training
Examples
ID3, CART
…
FOIL, ILP
10/27/15
XY
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
Z
3
Advantages of
Artificial Neural Networks
Provide best predictive
accuracy for many problems
Positive
Negative
Positive
Can represent a rich
class of concepts
(‘universal approximators’)
Saturday: 40% chance of rain
Sunday: 25% chance of rain
(time-series data)
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
4
A Brief Overview of ANNs
Recurrent
link
 error
 weight
Output units
Hidden units
Input units
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
5
Recurrent ANN’s
(Advanced topic: LSTM models, Schmidhuber group)
State Units
(ie, memory)
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
6
Representing Features in ANNs (and SVMs)
- we need NUMERIC values
Nominal
f={a,b,c}
Input Units
f=a
f=b
f=c
‘1 of N’ rep
f = root
Hierarchical
a
c
b
d
e
g
Linear/Ordered
f=[a,b]
Ex
1
0
0
Approach I (use 1 input unit):
f=a
f=b
f=c
f=d
f=e
f=g
0
1
0
0
1
0
(for f=e)
f = value – a
b-a
Approach II: Thermometer Rep (next slide)
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
7
More on Encoding Datasets
Thermometer Representation
f is an element of { a, b, c }, ie f is ordered
f = a  100
f = b  110
f = c  111
(could also discretize continuous functions this way)
Output Representation
For N categories use a 1-of-N representation
Category 1  100
Category 2  010
Category 3  001
Could also use an error-correcting code
(but we won’t cover that)
For Boolean functions use either 1 or 2 output units
Normalize real-valued functions to [0,1]
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
8
Connectionism History
PERCEPTRONS (Rosenblatt 1957)
• no hidden units
• earliest work in machine learning,
died out in 1960’s (due to Minsky & Papert book)
wij
J
K
L
wik
I
wil
Outputi = F(Wij  outputj + Wik  outputk + Wil  outputl )
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
9
Connectionism (cont.)
• Backpropagation Algorithm
Overcame Perceptron’s Weakness
– Major reason for renewed excitement in 1980’s
• ‘Hidden Units’ Important
– Fundamental extension to perceptrons
– Can generate new features
(‘constructive induction’, ‘predicate invention’,
‘learning representations’, ‘derived features’)
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
10
Deep Neural Networks
Old: backprop algo does not work well for
more than one layer of hidden units
(‘gradient gets too diffuse’)
New: with a lot of training data,
deep (several layers of hidden units)
neural networks exceed prior
state-of-the-art results
Unassigned, but FYI: http://www.idsia.ch/~juergen/deep-learning-overview.html
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
11
Sample Deep Neural Network
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
12
A Deeper Network
Old Design: fully connect each input node to each HU
(only one HU layer), then fully connect
each HU to each output node
We’ll cover CONVOLUTION and POOLING later
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
13
From http://people.idsia.ch/~juergen/handwriting.html
Digit Recognition:
Influential ANN Testbed
• Deep Networks (Schmidhuber, 2012)
– One YEAR of training on single CPU
– One WEEK of training on a single GPU
that performed 109 wgt updates/sec
– 0.2% Error Rate (old record was 0.4%)
• More info on datasets and results at
http://yann.lecun.com/exdb/mnist/
Perceptron:
k-NN:
12% error (7.6% with feature engineering)
2.8% (0.63%)
Ensemble of d-trees: 1.5%
10/27/15
SVMs:
1.4% (0.56%)
One layer of HUs:
1.6% (0.4%; feature engr + ensemble of 25 ANNs)
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
14
Activation Units:
Map Weighted Sum to Scalar
Individual Units’ Computation
output I = F(Sweight i,j x output j)
j
output
Called the
‘sigmoid’
and ‘logistic’
(hyperbolic
tangent
also used)
Typically
F(input i) =
1
1+e -(input i – bias i)
Piecewise Linear (and Gaussian) nodes
can also be used
bias
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
input
15
Rectified Linear Units (ReLUs)
(Nair & Hinton, 2010)
– used for HUs; use ‘pure’ linear for output units, ie F(wgt’edSum) = wgt’edSum
F(wgt’edSum) = max(0, wgt’edSum)
Argued to be
more biologically
plausible
output
Used in ‘deep
networks’
bias
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
input
16
Sample ANN Calculation
(‘Forward Propagation’,
ie, reasoning with weights learned by backprop)
OUTPUT
3
4
3
4
-2
3
4
1
-7
-1
0
-8
9
5
3
2
3
2
Assume
bias=0 for all
nodes for
simplicity and
using RLUs
INPUT
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
17
Perceptron Convergence Theorem
(Rosenblatt, 1957)
Perceptron  no hidden units
If a set of examples is learnable,
the DELTA rule will eventually find the necessary weights
However a perceptron can only
learn/represent linearly separable
dataset
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
18
X2
+
+
+
+
+
+
Linear Separability
+
+
+
+
Consider a perceptron, its output is
-
-
+
+
+
-
-
+
+
-
-
1
0
+
If W1 X1 + W2 X2 + … + Wn Xn > Q
-
-
-
X1
-
otherwise
In terms of feature space (2 features only)
W1X1 + W2X2 = Q
X2 =
Q -W1X1
W2
=
y = mx + b
-W1
Q
X1+
W2
W2
Hence, can only classify examples if a
‘line’ (hyperplane) can separate them
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
19
The (Infamous) XOR Problem
Not linearly separable
Exclusive OR (XOR)
Input
a)
b)
c)
d)
0
0
1
1
X1
Output
1 b
0
1
1
0
0
1
0
1
a
0
d
c
1 X2
A Solution with (Sigmoidal) Hidden Units
10
X1
10
-10
X2
10/27/15
-10
10
10
Let Q = 5 for all nodes
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
20
The Need for Hidden Units
If there is one layer of enough hidden units
(possibly 2N for Boolean functions), the
input can be recoded (N = number of input units)
This recoding allows any mapping to be
represented (known by Minsky & Papert)
Question: How to provide an error signal
to the interior units?
(backprop is the answer from the 1980’s)
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
21
Hidden Units
One View
Allow a system to create its own
internal representation – for which
problem solving is easy
A perceptron
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
22
Reformulating XOR
X1
X1
X3 = X1  X2
X2
Alternatively
X1
X2
10/27/15
X3
X2
So, if a hidden unit can learn to
represent X1  X2 , solution is easy
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
23
The Need for Non-Linear
Activation Functions
Claim: For every ANN using only linear activation
functions with depth k, there is an
equivalent perceptron
– Ie, a neural network with no hidden units
– So if using only linear activation units,
‘deep’ ANN can only learn a separating ‘line’
– Note that RLU’s are non-linear (‘piecewise’ linear)
Can show using linear algebra (but won’t in cs540)
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
24
A Famous Early Application
(http://cnl.salk.edu/Media/nettalk.mp3)
• NETtalk (Sejnowski & Rosenburg, 1987)
• Mapping character strings into phonemes
• ‘Sliding Window’ approach
• Train: 1,000 most common English words
88.5% correct
…
• Test: 20,000 word dictionary
72% / 63% correct
Like the phonemes in a dictionary
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
Ă
…
…
_
_ C
Ō
…
…
A
T
_
25
_
An Empirical Comparison of
Symbolic and Neural Learning
10/27/15
Correctness on
Test Data
100
ID3
50
Relative Training
Time (scaled to
ID3)
Perceptron works quite well!
[Shavlik, Mooney, & Towell, IJCAI 1989 & ML journal 1991]
Perceptron
Backprop
0
Soybeans
Chess
Audiology
NetTalk-A
NetTalk-full
1000
100
10
1
0.1
ID3
Perceptron
Backprop
Soybeans
Chess
Audiology
NetTalk-A NetTalk-full
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
26
Geoff Hinton, 1947-
ANN Wrapup
on Non-Learning Aspects
(great-great-grandson
of George Boole!)
• Perceptrons can do well, but can
only create linear separators in feature space
• Backprop Algo (next lecture) can
successfully train hidden units
• Historically only one HU layer used
• Deep Neural Networks (several HU layers)
highly successful given large amounts of
training data, especially for images & text
10/27/15
CS 540 - Fall 2015 (Shavlik©), Lecture 18, Week 8 & 9
27