Pattern Classification & Decision Theory How are we doing on the pass sequence? • Bayesian regression and estimation enables us to track the.

Download Report

Transcript Pattern Classification & Decision Theory How are we doing on the pass sequence? • Bayesian regression and estimation enables us to track the.

Pattern Classification & Decision Theory
How are we doing on the pass sequence?
•
Bayesian regression and estimation enables us to
track the man in the striped shirt based on labeled data
Can we track the man in the white shirt?
Not very well.
Hand-labeled horizontal coordinate, t
•
Regression fails
to identify that
there really are
two classes of
solution
Feature, x
Decision theory
• We wish to classify an input x as belonging to one of K
classes, where class k is denoted
• Example: Buffalo digits, 10 classes, 16x16 images,
each image x is a 256-dimensional vector, xm  [0,1]
Decision theory
• We wish to classify an input x as belonging to one of K
classes, where class k is denoted
• Partition the input space into regions
,
that if x 
, our classifier predicts class
, …, so
• How should we choose the partition?
• Suppose x is presented with probability p(x) and the
distribution over the class labels given x is p( |x)
• Then, p(correct) = Sk x
p(x) p(
|x) dx
• This is maximized by assigning each x to the region
whose class maximizes p( |x)
Three approaches to pattern classification
3
1. Discriminative and non-probabilistic
– Learn a discriminant function f (x),
which maps x directly to a class label
2
1
0
2. Discriminative and probabilistic
– For each class k, learn the probability
model
– Use this probability to classify a new
input x
f(x)
Three approaches to pattern classification
3. Generative
– For each class k, learn the generative
probability model
– Also, learn the class probabilities
– Use Bayes’ rule to classify a new
input:
where
Three approaches to pattern classification
3
1. Discriminative and non-probabilistic
– Learn a discriminant function f (x),
which maps x directly to a class label
2
1
0
2. Discriminative and probabilistic
– For each class k, learn the probability
model
– Use this probability to classify a new
input x
f(x)
Can we use regression to learn
discriminant functions?
3
f(x)
2
1
0
Can we use regression to learn
discriminant functions?
• What do the classification regions look like?
• Is there any sense in which square error is an
appropriate cost function for classification?
• We should be careful to not interpret integer-valued
class labels as ordinal targets for regression
3
2
3
f(x)
2
1
1
0
0
f(x)
The one-of-K representation
• For > 2 classes, each class is represented by
a binary vector t with 1 indicating the class:
t=
1
0
0
.
..
0
t=
0
1
0
.
..
...
0
• K regression problems:
• To classify x, pick class k with largest yk(x)
Let’s focus on binary classification
• Predict target t  {0,1}
from input vector x
• Denote the mth input of
training case n by xnm
Classification boundary for linear regression
• Values of x where y(x)=0.5 are ambiguous – these
form the classification boundary
• For these points,
= 0.5
• If x is (M+1)-dimensional, this defines an Mdimensional hyperplane separating the classes
How well does linear regression work?
Works well in some cases, but there are two problems:
• Due to linearity, “extreme” x’s cause extreme y(x,w)’s
• Due to squared error, extreme y(x,w)’s dominate learning
Linear regression
Logistic regression
(more later)
Clipping off extreme y’s
• To clip off extremes of
use a sigmoid function:
, we can
where
• Now, squared error won’t penalize extreme x’s
• y is now a non-linear function of w, so learning is
harder
How the observed error propagates back
to the parameters
yn
E(w) = ½ Sn ( tn – s (Smwmxnm) )2
• The rate of change of E w.r.t. wm is
E(w)/wm = -Sn ( tn - yn ) yn (1- yn) xnm
– Useful fact: s’(a) = s (a) (1 - s (a)),
• Compare with linear regression:
E(w)/wm = -Sn ( tn - yn ) xnm
The effect of clipping
• Regression with sigmoid:
E(w)/wm = -Sn ( tn - yn ) yn (1- yn) xnm
• Linear regression:
E(w)/wm = -Sn ( tn - yn ) xnm
For these outliers, both
(tn-yn)  0 and y(1-y)  0,
so the outliers won’t hold
back improvement of the
boundary
Squared error for learning probabilities
• If t = 0 and y  1, y is moderately pulled down (grad  1)
• If t = 0 and y  0, y is weakly pulled down (grad  0)
E
E = ½(t-y)2
t=0
dE/dy = 0
y
dE/dy = 1
Problems:
• Being certainly wrong
is often undesirable
• Often, tiny differences
between small
probabilities count a
lot
Three approaches to pattern classification
3
1. Discriminative and non-probabilistic
– Learn a discriminant function f (x),
which maps x directly to a class label
2
1
0
2. Discriminative and probabilistic
– For each class k, learn the probability
model
– Use this probability to classify a new
input x
f(x)
Logistic regression: Binary likelihood
• As before, we use:
where
• Now, use binary likelihood: p(t|x) = y(x)t(1-y(x))1-t
• Data log-likelihood:
L = Sn
tn lns (Smwmxnm) + (1-tn) ln(1-s (Smwmxnm))
• Unlike linear regression, L is nonlinear in the
w’s, so gradient-based optimizers are needed
Binary likelihood for learning probabilities
• If t = 0 and y  1, y is strongly pulled down (grad  ∞)
• If t = 0 and y  0, y is moderately pulled down (grad  1)
dE/dy  ∞
E
E = -ln(1-y)
t=0
dE/dy = 1
dE/dy = 1
dE/dy = 0
y
How the observed error propagates back
to the parameters
yn
L = Sn
tn lns (Smwmxnm) + (1-tn) ln(1-s (Smwmxnm))
• The rate of change of L w.r.t. wm is
L/wm = Sn ( tn - yn ) xnm
• Compare with sigmoid plus squared error:
E(w)/wm = -Sn ( tn - yn ) yn (1- yn) xnm
• Compare with linear regression:
E(w)/wm = -Sn ( tn - yn ) xnm
How well does logistic regression work?
Linear regression
Logistic regression
Multiclass logistic regression
• Create one set of weights per class and define
• The K-class generalization of the sigmoid function is
p(t|x) = Sk exp(tk yk(x)) / Sk exp(yk(x))
which is equivalent to
p(
|x) = exp(yk(x)) / Sj exp(yj(x))
• Learning: Similar to logistic regression (see textbook)
Three approaches to pattern classification
3. Generative
– For each class k, learn the generative
probability model
– Also, learn the class probabilities
– Use Bayes’ rule to classify a new
input:
where
Gaussian generative models
• We can assume each element of x is independent and
Gaussian, given the class:
p(x|
) = Pm p(xm|
) = Pm N (xm | km,skm2)
• Contour plot of density:
2sk1
Isotropic Gaussian:
sk12 = sk22
2sk2
k2
k1
Learning a Buffalo digit classifier
(5000 training cases)
• The generative ML estimates of km and skm2 are just
the data means and variances:
Means:
Variances (black=low variance, white=high variance):
• The classes are equally frequent, so
= 1/10
• To classify a new input x, compute (in the log-domain!)
A problem with the ML estimate
• Some pixels were constant across all training images
within a class, so sML2 = 0
• This causes numerical problems when evaluating
Gaussian densities, but is also an overfitting problem
• Common hack: Add
smin2 to all variances
• More principled approaches
– Regularize s2
– Place a prior on s2 and use MAP
– Place a prior on s2 and use Bayesian learning
Classifying test data
(5000 test cases)
• Adding smin2 = 0.01 to the variances, we obtain:
– Error rate on training set = 16.00% (std dev .5%)
– Error rate on test set = 16.72% (std dev .5%)
• Effect of value of smin2 on error rates:
Test error rate
Training error rate
log10 smin2
Full-covariance Gaussian models
• Let x = Ly, where y is isotropic Gaussian and L is an
M x M rotation and scale matrix
• This generates a full-covariance Gaussian:
• Defining S = (L-1TL-1)-1, we obtain
Determinant
where S is the covariance matrix: Sjk = COV(xj, xk)
Generative models easily induce non-linear
decision boundaries
• The following three-class problem shows how three
axis-aligned Gaussians can induce nonlinear decision
boundaries
Generative models easily induce non-linear
decision boundaries
• Two Gaussians can be used to account for “inliers”
and “outliers”
How are we doing on the pass sequence?
•
Bayesian regression and estimation enables us to
track the man in the striped shirt based on labeled data
Can we track the man in the white shirt?
Not very well.
Hand-labeled horizontal coordinate, t
•
Regression fails
to identify that
there really are
two classes of
solution
Feature, x
Using classification to improve tracking
Position of man in white shirt
Position of man in striped shirt
The position of the man in the striped shirt can be used to
classify the tracking mode of the man in the white shirt
Feature
Feature
Using classification to improve tracking
For man in white shirt, hand-label regions
and learn two trackers
and
tw
Position of man in striped shirt,
ts
•
•
xs and ts = feature and position of man in striped shirt
xw and tw = feature and position of man in white shirt
p(ts|xs)
Feature, xs
p(
|ts)
p(
|ts)
Position of man in white shirt,
•
p(tw|xw,
p(tw|xw,
)
Feature, xw
)
Using classification to improve tracking
•
The classifier can be obtained using the generative
approach, where each class-conditional likelihood is a
Gaussian
Note: Linear classifiers won’t work
Position of man in striped shirt,
ts
•
p(ts|xs)
Feature, xs
p(
|ts)
p(
|ts)
p(ts| )
p(ts| )
Questions?
How are we doing on the pass sequence?
• We can now track both men, provided with
– Hand-labeled coordinates of both men in 30 frames
– Hand-extracted features (stripe detector, white blob
detector)
– Hand-labeled classes for the white-shirt tracker
• We have a framework for how to optimally
make decisions and track the men
How are we doing on the pass sequence?
• We This
cantakes
now too
track
both
men,
much
time
to doprovided
by hand! with
– Hand-labeled coordinates of both men in 30 frames
– Hand-extracted features (stripe detector, white blob
detector)
– Hand-labeled classes for the white-shirt tracker
• We have a framework for how to optimally
make decisions and track the men
Lecture 4 Appendix
Binary classification regions for linear regression
•
is defined by
, and vice versa for
• Values of x satisfying
are on the decision
boundary, which is a D-1 dimensional hyperplane
– w specifies the orientation
of the decision hyperplane
– -w0/||w|| specifies the
distance from the hyperplane
to the origin
– The distance from input x
to the hyperplane is
y(x)/||w||
K-ary classification regions for linear regression
• x
if
• Each resulting classification region is contiguous and
has linear boundaries:
Fisher’s linear discriminant and least squares
• Fisher: Viewing y = wTx as a projection, pick w to
maximize the distance between the means of the data
sets, while also minimizing the variances of the data sets
• This result is also obtained using linear regression, by
setting t = N/N1 for class 1 and t = - N/N2 for class 2,
where Nk = # training cases in class k
In what sense is logistic regression linear?
• The log-odds can be written thus:
1-
• Each input contributes linearly to the log-odds
Gaussian likelihoods and logistic regression
• For two classes, if their covariance matrices are equal,
S1=S2=S, we can write the log-odds as
=
=
2
2
1
1
where
=
• So…
Logistic
regression
classifiers
Classifiers using
equal-covariance
Gaussian generative
models
“Linear models” or classifiers with
“linear boundaries”
Don’t be fooled
• Such classifiers can be very hard to learn
• Such classifiers may have boundaries that are highly
nonlinear in x (eg, via basis functions)
• All this means is that in some space the boundaries
are linear