Linear Classifiers and the Perceptron

Download Report

Transcript Linear Classifiers and the Perceptron

Perceptrons and Linear Classifiers
William Cohen
2-4-2008
Announcement: no office hours for William this Friday 2/8
Dave Touretzky’s Gallery of CSS Descramblers
Linear Classifiers
• Let’s simplify life by assuming:
– Every instance is a vector of real numbers, x=(x1,…,xn).
(Notation: boldface x is a vector.)
– There are only two classes, y=(+1) and y=(-1)
• A linear classifier is vector w of the same dimension as x
that is used to make this prediction:
yˆ  sign( w1 x1  w2 x2  ...  wn xn )  sign (w  x)
 1 if x  0
sign ( x)  

 1 if  0 
w
Visually, x · w is
the distance you
get if you “project
x onto w”
x2
X1
-W
In 3d: lineplane
In 4d: planehyperplane
…
The line perpendicular to
w divides the vectors
classified as positive from
the vectors classified as
negative.
yˆ  sign( w1 x1  w2 x2  ...  wn xn )  sign (w  x)
w
-W
Wolfram MathWorld
Mediaboost.com
Geocities.com/bharatvarsha1947
w
Notice that the separating hyperplane goes
through the origin…if we don’t want this we
can preprocess our examples:
x  x1 , x2 ,..., xn
-W
x  1, x1 , x2 ,..., xn
yˆ  sign( w1 x1  w2 x2  ...  wn xn )  sign (w  x)
yˆ  sign( w01  w1 x1  w2 x2  ...  wn xn )  sign (w  x)
x1 ,...., xn 
What
have
we
given
up?
,x
,x
,x
,x
,x
xoutlook, sunny
outlook,overcast
outlook, rain
temp, hot
temp, mild
temp,cool
+1
-1
D7  xoutlook, sunny  0, xoutlook,overcast  1, xoutlook,rain  0,...,
 0,1,0, 0,0,1, 0,1, 1,0
Outlook overcast
Humidity normal
,...,
What have we given up?
• Not much!
– Practically, it’s a little harder to understand a particular
example (or classifier)
– Practically, it’s a little harder to debug
• You can still express the same information
• You can analyze things mathematically much
more easily
Naïve Bayes as a Linear Classifier
Consider Naïve Bayes with two classes (+1, -1) and binary features (0,1).
Naïve Bayes as a Linear Classifier
Naïve Bayes as a Linear Classifier
“log odds”
Naïve Bayes as a Linear Classifier
pi
qi
Naïve Bayes as a Linear Classifier
Naïve Bayes as a Linear Classifier
• Summary:
– NB is linear classifier
•
yˆ  sign (w  x)
– Weights wi have a closed form
• which is fairly simple, expressed in log-odds
Proceedings of ECML-98,
10th European Conference on Machine Learning
An Even Older Linear Classifier
• 1957: The perceptron algorithm: Rosenblatt
–
WP: “A handsome bachelor, he drove a classic MGA sports car and was often seen with his
cat named Tobermory. He enjoyed mixing with undergraduates, and for several years taught
an interdisciplinary undergraduate honors course entitled "Theory of Brain Mechanisms" that
drew students equally from Cornell's Engineering and Liberal Arts colleges…this course was
a melange of ideas .. experimental brain surgery on epileptic patients while conscious,
experiments on .. the visual cortex of cats, ... analog and digital electronic circuits that
modeled various details of neuronal behavior (i.e. the perceptron itself, as a machine).”
– Built on work of Hebbs (1949); also developed by Widrow-Hoff (1960)
• 1960: Perceptron Mark 1 Computer – hardware implementation
Bell Labs TM 59-1142-11– Datamation 1961 – April 1 1984 Special Edition of CACM
An Even Older Linear Classifier
• 1957: The perceptron algorithm: Rosenblatt
–
WP: “A handsome bachelor, he drove a classic MGA sports car and was often seen with his
cat named Tobermory. He enjoyed mixing with undergraduates, and for several years taught
an interdisciplinary undergraduate honors course entitled "Theory of Brain Mechanisms" that
drew students equally from Cornell's Engineering and Liberal Arts colleges…this course was
a melange of ideas .. experimental brain surgery on epileptic patients while conscious,
experiments on .. the visual cortex of cats, ... analog and digital electronic circuits that
modeled various details of neuronal behavior (i.e. the perceptron itself, as a machine).”
– Built on work of Hebbs (1949); also developed by Widrow-Hoff (1960)
• 1960: Perceptron Mark 1 Computer – hardware implementation
• 1969: Minksky & Papert book shows perceptrons limited to linearly
separable data, and Rosenblatt dies in boating accident
• 1970’s: learning methods for two-layer neural networks
• Mid-late 1980’s (Littlestone & Warmuth): mistake-bounded learning
& analysis of Winnow method; early-mid 1990’s, analyses of
perceptron/Widrow-Hoff
Experimental evaluation of Perceptron vs WH and
Experts (Winnow-like methods) in SIGIR-1996 (Lewis,
Schapire, Callan, Papka), and (Cohen & Singer)
Freund & Schapire, 1998-1999 showed “kernel trick”
and averaging/voting worked
The voted perceptron
A
instance xi
B
^
yi
yi
^
Compute: yi = sign(vk . xi )
If mistake: vk+1 = vk + yi xi
(1) A target u
(2) The guess v1 after one
positive example.
u
u
v1
-u
+x1
-u
2γ
2γ
(3a) The guess v2 after the two
(3b) The guess v2 after the one positive and
positive examples: v2=v1+x2
one negative example: v2=v1-x2
u
u
+x2
v2
v2
v1
v1
+x1
-x2
-u
-u
2γ
2γ
I want to show two things:
1. The v’s get closer and closer to u: v.u increases with each mistake
2. The v’s do not get too large: v.v grows slowly
(3a) The guess v2 after the two
(3b) The guess v2 after the one positive and
positive examples: v2=v1+x2
one negative example: v2=v1-x2
u
u
+x2
v2
v2
v1
v1
+x1
-x2
-u
-u
2γ
2γ
u
u
+x2
v2
v2
v1
v1
+x1
-x2
-u
-u
2γ
2γ
On-line to batch learning
1. Pick a vk at random
according to mk/m, the
fraction of examples it
was used for.
2. Predict using the vk
you just picked.
3. (Actually, use some
sort of deterministic
approximation to this).
The voted perceptron
Some more comments
Perceptrons are like support vector machines (SVMs)
1. SVMs search for something that looks like u:
i.e., a vector w where ||w|| is small and the
margin for every example is large
2. You can use “the kernel trick” with perceptrons
•
Replace x.w with (x.w+1)d
Experimental Results
Task: classifying hand-written digits for
the post office
More Experimental Results (Linear kernel,
one pass over the data)