Announcements • See Chapter 5 of Duda, Hart, and Stork. • Tutorial by Burge linked to on web page. • “Learning quickly when irrelevant attributes.

Download Report

Transcript Announcements • See Chapter 5 of Duda, Hart, and Stork. • Tutorial by Burge linked to on web page. • “Learning quickly when irrelevant attributes.

Announcements

• See Chapter 5 of Duda, Hart, and Stork.

• Tutorial by Burge linked to on web page.

• “Learning quickly when irrelevant attributes abound,” by Littlestone, Machine Learning 2:285-318, 1988.

Supervised Learning

• Classification with labeled examples.

• Images vectors in high-D space.

Supervised Learning

• Labeled examples called

training

set.

• Query examples called

test

set.

• Training and test set must come from same distribution.

Linear Discrimants

• Images represented as vectors,

x

1

,

x

2 , …

. • Use these to find hyperplane defined by vector w and w 0 .

x

is on hyperplane:

w

T x + w

0

• Notation:

a T

= [w 0 , w 1

= 0.

, …]. y T = [1,

x 1 ,x 2 , …

] So hyperplane is

a T y

=0.

• A query,

q,

is classified based on whether

a

T

q

> 0 or

a

T

q <

0

.

Why linear discriminants?

• Optimal if classes are Gaussian with same covariances.

• Linear separators easier to find.

• Hyperplanes have few parameters, prevents overfitting.

– Have lower VC dimension, but we don’t have time for this.

Linearly Separable Classes

Linearly Separable Classes

• For one set of classes,

a

T

y

> 0. For other set:

a

T

y < 0.

• Notationally convenient if, for second class, make

y

finding

a

negative.

• Then, finding a linear separator means such that, for all

i,

a T y

> 0.

• Note, this is a linear program.

– Problem convex, so descent algorithms converge to global optimum.

Perceptron Algorithm

Perceptron Error Function

J P

(

a

) 

y

 

Y

( 

a T y

) Y is set of misclassified vectors.

J P

y

 

Y

( 

y

) So update

a

by:

a

(

k

 1 ) 

a

(

k

)  

y

 

Y y

Simplify by cycling through

y

is misclassified, update

a

 and whenever one

a

+ c

y.

This converges after finite # of steps.

Perceptron Intuition

Support Vector Machines

• Extremely popular.

• Find linear separator with maximum margin.

– Some guarantees this generalizes well.

• Can work in high-dimensional space without overfitting.

– Nonlinear map to high-dim. space, then find linear separator. – Special tricks allow efficiency in ridiculously high dimensional spaces.

• Can handle non-separable classes also.

– Not as important if space very high-dimensional.

Maximum Margin

Maximize the minimum distance from hyperplane to points.

Points at this minimum distance are support vectors.

Geometric Intuitions

• Maximum margin between points -> Maximum margin between convex sets

p ap

 

kp

1 

kap

1 ( 1 

k

)

p

2  ( 1 

k

)

ap

2 0  

ap

1 

ap

ap

2 or

ap

1 

ap

ap

2

k

 1

This implies max margin hyperplane is orthogonal to vector connecting nearest points of convex sets, and halfway between.

Non-statistical learning

• There are a class of functions that could label the data.

• Our goal is to select the correct function, with as little information as possible.

• Don’t think of data coming from a class described by probability distributions.

• Look at worst-case performance.

– This is CS’ey approach.

– In statistical model, worst case not meaningful.

On-Line Learning

• Let X be a set of objects (eg., vectors in a high dimensional space).

• Let C be a class of possible classifying functions (eg., hyperplanes).

– f in C: X-> {0,1} – One of these correctly classifies all data.

• The learner is asked to classify an item in X, then told the correct class.

• Eventually, learner determines correct f.

– Measure number of mistakes made.

– Worst case bound for learning strategy.

VC Dimension

• S, a subset of X, is

shattered

by C if, for any U, a subset of S, there exists f in C such that f is 1 on U and 0 on S-U.

• The

VC Dimension

of C is the size of the largest set shattered by C.

VC Dimension and worst case learning

• Any learning strategy makes VCdim(C) mistakes in the worst case.

– If S is shattered by C – Then for any assignment of values to S, there is an f in C that makes this assignment.

– So any set of choices the learner makes for S can be entirely wrong.

Winnow

x

’s elements have binary values.

• Find weights. Classify

x w

T

x

> 1 or

w

T

x

< 1.

by whether • Algorithm requires that there exist weights

u

, such that: – –

u

T

x

> 1 when f(

x

) = 1

u

T

x

< 1 – d when f(

x

) = 0.

– That is, there is a margin of d .

Winnow Algorithm

• Initialize

w

= (1,1, …1).

• Let a  1d/2.

• Decision:

w

T

x

> q • If learner correct, weights don’t change. If wrong: Learner’s Prediction 1 0 Correct Response 0 1 Update Action

w

i :=

w

i / a if

x

i =1

w

i unchanged if

x

i =0

w

i := a

w

i if

x

i =1

w

i unchanged if

x

i =0 Update name Demotion Step Promotion step

Some intuitions

• Note that this is just like Perceptron, but with multiplicative change, not additive.

• Moves weights in right direction; – if

w

T

x

was too big, shrink weights that effect inner product.

– If

w

T

x

too small, make weights bigger.

• Weights change more rapidly (exponential with mistakes, not linear).

Number of errors bounded by:

Theorem

d 8 2 q

n

 d 14 ln q d 2 Set q = n d 8 2  d 14 ln d 2

n

n i

 1

u i

n i

 1

u i

Note: if

x i

is an irrelevant feature,

u i = 0.

Errors grow logarithmically with irrelevant features. Empirically, Perceptron errors grow linearly.

This is optimal for

k

-monotone disjunctions:

f(x 1 , … x n ) = x i1

V

x i2

V … V

x ik

Winnow Summary

• Simple algorithm; variation on Perceptron.

• Great with irrelevant attributes.

• Optimal for monotone disjunctions.