Announcements • See Chapter 5 of Duda, Hart, and Stork. • Tutorial by Burge linked to on web page. • “Learning quickly when irrelevant attributes.
Download ReportTranscript Announcements • See Chapter 5 of Duda, Hart, and Stork. • Tutorial by Burge linked to on web page. • “Learning quickly when irrelevant attributes.
Announcements
• See Chapter 5 of Duda, Hart, and Stork.
• Tutorial by Burge linked to on web page.
• “Learning quickly when irrelevant attributes abound,” by Littlestone, Machine Learning 2:285-318, 1988.
Supervised Learning
• Classification with labeled examples.
• Images vectors in high-D space.
Supervised Learning
• Labeled examples called
training
set.
• Query examples called
test
set.
• Training and test set must come from same distribution.
Linear Discrimants
• Images represented as vectors,
x
1
,
x
2 , …
. • Use these to find hyperplane defined by vector w and w 0 .
x
is on hyperplane:
w
T x + w
0
• Notation:
a T
= [w 0 , w 1
= 0.
, …]. y T = [1,
x 1 ,x 2 , …
] So hyperplane is
a T y
=0.
• A query,
q,
is classified based on whether
a
T
q
> 0 or
a
T
q <
0
.
Why linear discriminants?
• Optimal if classes are Gaussian with same covariances.
• Linear separators easier to find.
• Hyperplanes have few parameters, prevents overfitting.
– Have lower VC dimension, but we don’t have time for this.
Linearly Separable Classes
Linearly Separable Classes
• For one set of classes,
a
T
y
> 0. For other set:
a
T
y < 0.
• Notationally convenient if, for second class, make
y
finding
a
negative.
• Then, finding a linear separator means such that, for all
i,
a T y
> 0.
• Note, this is a linear program.
– Problem convex, so descent algorithms converge to global optimum.
Perceptron Algorithm
Perceptron Error Function
J P
(
a
)
y
Y
(
a T y
) Y is set of misclassified vectors.
J P
y
Y
(
y
) So update
a
by:
a
(
k
1 )
a
(
k
)
y
Y y
Simplify by cycling through
y
is misclassified, update
a
and whenever one
a
+ c
y.
This converges after finite # of steps.
Perceptron Intuition
Support Vector Machines
• Extremely popular.
• Find linear separator with maximum margin.
– Some guarantees this generalizes well.
• Can work in high-dimensional space without overfitting.
– Nonlinear map to high-dim. space, then find linear separator. – Special tricks allow efficiency in ridiculously high dimensional spaces.
• Can handle non-separable classes also.
– Not as important if space very high-dimensional.
Maximum Margin
Maximize the minimum distance from hyperplane to points.
Points at this minimum distance are support vectors.
Geometric Intuitions
• Maximum margin between points -> Maximum margin between convex sets
p ap
kp
1
kap
1 ( 1
k
)
p
2 ( 1
k
)
ap
2 0
ap
1
ap
ap
2 or
ap
1
ap
ap
2
k
1
This implies max margin hyperplane is orthogonal to vector connecting nearest points of convex sets, and halfway between.
Non-statistical learning
• There are a class of functions that could label the data.
• Our goal is to select the correct function, with as little information as possible.
• Don’t think of data coming from a class described by probability distributions.
• Look at worst-case performance.
– This is CS’ey approach.
– In statistical model, worst case not meaningful.
On-Line Learning
• Let X be a set of objects (eg., vectors in a high dimensional space).
• Let C be a class of possible classifying functions (eg., hyperplanes).
– f in C: X-> {0,1} – One of these correctly classifies all data.
• The learner is asked to classify an item in X, then told the correct class.
• Eventually, learner determines correct f.
– Measure number of mistakes made.
– Worst case bound for learning strategy.
VC Dimension
• S, a subset of X, is
shattered
by C if, for any U, a subset of S, there exists f in C such that f is 1 on U and 0 on S-U.
• The
VC Dimension
of C is the size of the largest set shattered by C.
VC Dimension and worst case learning
• Any learning strategy makes VCdim(C) mistakes in the worst case.
– If S is shattered by C – Then for any assignment of values to S, there is an f in C that makes this assignment.
– So any set of choices the learner makes for S can be entirely wrong.
Winnow
•
x
’s elements have binary values.
• Find weights. Classify
x w
T
x
> 1 or
w
T
x
< 1.
by whether • Algorithm requires that there exist weights
u
, such that: – –
u
T
x
> 1 when f(
x
) = 1
u
T
x
< 1 – d when f(
x
) = 0.
– That is, there is a margin of d .
Winnow Algorithm
• Initialize
w
= (1,1, …1).
• Let a 1d/2.
• Decision:
w
T
x
> q • If learner correct, weights don’t change. If wrong: Learner’s Prediction 1 0 Correct Response 0 1 Update Action
w
i :=
w
i / a if
x
i =1
w
i unchanged if
x
i =0
w
i := a
w
i if
x
i =1
w
i unchanged if
x
i =0 Update name Demotion Step Promotion step
Some intuitions
• Note that this is just like Perceptron, but with multiplicative change, not additive.
• Moves weights in right direction; – if
w
T
x
was too big, shrink weights that effect inner product.
– If
w
T
x
too small, make weights bigger.
• Weights change more rapidly (exponential with mistakes, not linear).
Number of errors bounded by:
Theorem
d 8 2 q
n
d 14 ln q d 2 Set q = n d 8 2 d 14 ln d 2
n
n i
1
u i
n i
1
u i
Note: if
x i
is an irrelevant feature,
u i = 0.
Errors grow logarithmically with irrelevant features. Empirically, Perceptron errors grow linearly.
This is optimal for
k
-monotone disjunctions:
f(x 1 , … x n ) = x i1
V
x i2
V … V
x ik
Winnow Summary
• Simple algorithm; variation on Perceptron.
• Great with irrelevant attributes.
• Optimal for monotone disjunctions.