Statistical Decision Theory, Bayes Classifier Lecture Notes for CMPUT 466/551 Nilanjan Ray

Download Report

Transcript Statistical Decision Theory, Bayes Classifier Lecture Notes for CMPUT 466/551 Nilanjan Ray

Statistical Decision Theory,
Bayes Classifier
Lecture Notes for CMPUT 466/551
Nilanjan Ray
1
Supervised Learning Problem
Input (random) variable:
X
Output (random) variable: Y
A toy example:
unknown
X ~ N ( ; 2 )
Y  f (X )
A set of realizations of X and Y are available in (input, output) pairs:
T  {( x1 , y1 ), ( x2 , y2 ), , ( xN , y N )}
This set T is known as the training set
Supervised Learning Problem: given the training set, we want to estimate
the value of the output variable Y for a new input variable value, X=x0 ?
Essentially here we try to learn the unknown function f from T.
2
Basics of Statistical Decision Theory
We want to attack the supervised learning problem from the viewpoint of
probability and statistics. Thus, let’s consider X and Y as random variables
with a joint probability distribution Pr(X,Y).
Assume a loss function L(Y,f(x)), such as a squared loss (L2):
L(Y , f ( X ))  (Y  f ( X )) 2
Choose f so that the expected prediction error is minimized:
EPE( f )  EL(Y , f ( X ))   ( y  f ( x)) 2 Pr( dx, dy )
Known as regression
function
Also known as
(non-linear) filtering
in signal processing
The minimizer is the conditional expectation : fˆ ( x)  E(Y | X  x)
So, if we knew Pr(Y|X), we would readily estimate the output
See [HTF] for the derivation of this minimization;
Also see [Bishop] for a different derivation– know about a heavier machinery, called
calculus of variations
3
Loss Functions and Regression Functions
L(Y , f ( X )) | Y  f ( X ) |
L1 loss:
0-1 loss:
0, if Y  f ( X )
L(Y , f ( X ))  
1, otherwise
RF is conditional median:
fˆ ( x)  Median (Y | X  x)
RF is conditional mode:
fˆ ( x)  max Pr (Y | X  x)
Exponential loss function
Used in boosting…
Huber loss function
Robust to outliers…
Your loss function
Own student award in …
Y
Observation: the estimator/regression function is always
defined in terms of the conditional probability Pr(Y|X)
Pr(Y|X) is Typically unknown to us, so what can we do??
4
Create Your Own World
• Example Strategies:
– Try to estimate regression function directly from
training data
– Assume some models, or structure on the
solution,….so that the EPE minimization becomes
tractable
– Your own award winning paper
– Etc…
Structure free
Highly structured
5
Nearest-Neighbor Method
Use those observations to predict that are neighbors of the (new) input:
1
Yˆ ( x) 
yi

k i:xi N k ( x )
Nk(x) is the neighborhood of x defined by k-closest points xi in the training sample
Estimates the regression function directly from the training set:
fˆ ( x)  Ave( yi | xi  N k ( x))
Nearest neighbor method is also (traditionally) known as a nonparametric method
6
Nearest-Neighbor Fit
Major limitations:
(1) Very in inefficient in high dimension
(2) Could be unstable (wiggly)
(3) If training data is scarce, may not be the right choice
For data description see
[HTF], Section 2.3.3
7
How to Choose k: Bias-Variance Trade-off
Let’s try to answer the obvious question: how can we choose k?
One way to get some insight about k is the bias-variance trade-off
Let’s say input-output model is
Test error:
Y  f (X )  
EPEk ( x0 )  ET [(Y  fˆk ( x0 )) 2 | X  x0 ]
  2  Bias 2 ( fˆ ( x ))  Var ( fˆ ( x ))
k
0
T
k
0
1 k
2
2
   [ f ( x0 )   f ( x(l ) )] 
k l 1
k
2
Irreducible error
(the regressor
has no control)
Will most likely
increase as k increases
Will decrease
as k increases
So we can find a trade-off between bias and variance to choose k
8
Bias Variance Trade-off
Bias variance trade-off: Graphical representation
9
Assumption on Regression Function:
Linear Models
Structural assumption: Output Y is linear in the inputs X=(X1, X2, X3,…, Xp)
p
Predict the output by:
Yˆ   0   X j  j  X T 
j 1
Vector notation, 1 included in X
We estimate β from the training set (by least square):
ˆ  ( XT X) 1 XT y
Where,
1 x11  x1 p 
 y1 
1 x

y 
 x2 p 
21

X
, y 2
 

 
  


 
 yN 
1 xN 1  xNp 
So, for a new input
Can be shown that EPE minimization
leads to this estimate with the linear model
(see the comments in [HTF] section 2.4)
x0  ( x01, x02 ,, x0 p )
T
T
The regression output is Yˆ ( x0 )  x0 ˆ  x0 (XT X)1 XT y
10
Example of Linear Model Fit
Classification rule:
Red, if Yˆ  0.5
ˆ
G
Green, otherwise
Very stable (not wiggly), and
computationally efficient.
X2
However, if the linear model
assumption fails, the regression
fails miserably
X1
X = (X1, X2)
Y = (Red, Green), coded as Red=1, Green=0
11
More Regressors
• Many regression techniques are based on
modifications of linear models and nearestneighbor techniques
–
–
–
–
–
Kernel methods
Local regression
Basis expansions
Projection pursuits and neural networks
….
Chapter 2 of [HTF] gives their quick overview
12
Bayes Classifier
The key probability distribution Pr(Y|X) in the Bayesian paradigm
called posterior distribution
EPE is also called Risk in the Bayesian terminology; however Risk is more
general than EPE
We can show that EPE is the average error of classification under 0-1 penalty
in the Risk
So the Bayes’ classifier is minimum error classifier
Corresponding error is called Bayes’ risk / Bayes rate etc.
Materials mostly based on Ch 2 of [DHS] book
13
Bayes’ Classifier…
We can associate probabilities to the categories (output classes) of the classifier
This probability is called prior probability or simply the prior
Pi  Pr[ H i ], i  1,2,..., M
We assign a cost (Cij) to each possible
decision outcome: Hi is true, we choose Hj
R1
R2
R3
R1
M=3
2D input space, X
We know the conditional probability densities p(X|Hi)
Also called likelihood of the hypothesis Hi, or simply the likelihood
M
Total risk
M
Risk   Cij Pr[ H i is true ] Pr[ H j is chosen | H i is true ]
j 1 i 1
M
M
  Cij Pi  p( x | H i )dx
j 1 i 1
Rj
M

    Cij Pi p( x | H i ) dx
j 1 R j  i 1

M
14
Bayes’ Classifier…
Choose the Rj’s in such as way that this risk is minimized
Since Rj partitions the entire region, any x will belong to exactly one such Rj
So we can minimize the risk via the following rule to construct the partitions:

Rk   x : k  arg min
j


M
 C P p( x | H )
i 1
ij i
i
You should realize that the reasoning is similar to that in [HTF]
while minimizing EPE in Section 2.4.
15
Minimum Error Criterion
0-1 criterion:
0, if i  j
Cij  
1, if i  j

Rk   x : k  arg min
j


  x : k  arg min
j


C
P
p
(
x
|
H
)

ij i
i 
i 1

M

P
p
(
x
|
H
)
 i
i 
i 1,i  j

M


  x : k  arg max Pj p( x | H j )
j


We are minimizing the Risk with 0-1 criterion
This is also known as minimum error criterion
In this case the total risk is the probability of error
Note that
arg max Pj p( x | H j )  arg max
j
j
Pj p( x | H j )
p ( x)
 arg max P( H j | x)
j
This is same result if you minimize the EPE with 0-1 loss function
(see [HTF] Section 2.4)
Posterior probability
16
Two Class Case: Likelihood Ratio Test
We can write Risk as:
Risk  C11P1  p ( x | H1 )dx  C12 P2  p ( x | H 2 )dx
R1
R1
 C21P1  p ( x | H1 )dx  C22 P2  p ( x | H 2 )dx
R2
R2
   [(C21  C11 ) P1 p ( x | H1 )  (C12  C22 ) P2 p ( x | H 2 )]dx  const
R1
We can immediately see the rule of classification from above:
Assign x to H1 if P1 p( x | H1 )(C21  C11)  P2 p( x | H 2 )(C12  C22 )
else assign x to H2
Likelihood ratio
Special case: 0-1 criterion
p ( x | H1 ) P2 (C12  C22 )

p ( x | H 2 ) P1 (C21  C11 )
p ( x | H1 ) P2

p ( x | H 2 ) P1
17
An Example: Bayes Optimal Classifier
We exactly know the posterior distribution of the
two distribution, from which these boundaries are created
18
Minimax Classification
Bayes rule:
P(Y | X ) 
P( X | Y ) P(Y )
P( X )
In many real life applications, prior probabilities may be unknown
So we cannot have a Bayes’ optimal classification.
However, one may wish to minimize the worst case overall risk
We can write the Risk as a function of P1 and R1:
Risk ( P1 , R1 )  C22  (C12  C22 )  p ( x | H 2 )dx
R1
 P1[(C11  C22 )  (C21  C11 )  p ( x | H1 )dx  (C12  C22 )  p ( x | H 2 )dx]
R2
R1
Observation 1: If we fix R1 then Risk is a linear function of P1
[ Risk ( P1 , R1 )] is concave in P1
Observation 2: The function g ( P1 )  min
R
1
19
Minimax…
Let’s arbitrarily fix P1 = a, and compute
By observation 1, Risk ( P1 , R1a )
R1a  arg min Risk (a, R1 )
R1
is the straight line AB
Q: why should it be a tangent to g(P1)?
Claim:
R1a
is not a good classification when P1 is not known. Why?
So, what could be a better classification here?
The classification corresponding to MN. Why?
This is the minimax solution
Why the name? We can reach the solution by
max [min [ Risk ( P1 , R1 )]]
P1
R1
An aside Q: when can we reverse the order and get the same result?
Another way to solve R1 in minimax is from:
(C11  C22 )  (C21  C11 )  p( x | H1 )dx  (C12  C22 )  p( x | H 2 )dx
R2
R1
If you get multiple solutions, choose one that gives you the minimum Risk
20
Neyman-Pearson Criterion
Consider a two class problem
1
0.9
Following four probabilities can be computed:
0.8
p( x | H1 )
0.7
Probability of detection (hit)
P( x  R2 | H 2 )
p( x | H 2 )
0.6
0.5
Probability of false alarm
P( x  R2 | H1 )
0.4
0.3
Probability of miss
P( x  R1 | H 2 )
0.2
0.1
Probability of correct rejection
P( x  R1 | H1 )
0
-10
-8
-6
-4
-2
0
R1
2
4
6
8
10
R2
We do not know the prior probabilities, so Bayes’s optimum classification
is not possible
However we do know that Probability of False alarm must be below 
Based on this constraint (Neyman-Pearson criterion) we can design a classifier
Observation: maximizing probability of detection and minimizing probability of 21
false alarm are conflicting goals (in general)
Receiver Operating Characteristics
ROC is a plot: probability of false alarm vs. probability of detection
1
0.9
Probability of detection
Classifier 1
0.8
0.7
0.6
0.5
0.4
0.3
Classifier 2
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of false alarm
Area under ROC curve is a measure of performance
Used also to find a operating point for the classifier
22