PRML - 3rd chapter

Download Report

Transcript PRML - 3rd chapter

Pattern Recognition and
Machine Learning
Linear Models for Regression
1
Introduction
• Predict continuous target variable t using a Ddimensional vector x of input variables.
• Linear functions of the adjustable parameters
– Not linear function of the input variables.
• Linear combinations of a fixed set of nonlinear
functions, called basis functions.
• We aim to model p(t|x), in order to be able to
minimize expected loss.
– Not just to find a function t=y(x)
Linear models have significant limitations
2
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
3
Linear Basis Function Models
• The simplest model:
y(x, w)  w0  w1x1  ... wD xD
• Fixed non-linear basis functions:
M 1
y(x, w)   w j j (x)  wT  (x)
j 0
Usually φ0(x)=1. In this case, w0
is called the bias parameter.
Linear regression
x  ( x1 ,..., xD )T
w  ( w0 ,..., xM 1 )T
  (0 ,...,M 1 )T
Can be seen as feature extraction
4
Examples (1/2)
• Polynomial basis functions  j ( x)  x j
– Limitation: Global functions
• Alternative: Spline functions
• ‘Gaussian’ basis functions
2

 (x   j ) 

 j ( x)  exp

2
2
s




5
Examples (2/2)
• Sigmoidal basis functions
 xj 

 s 
 j ( x)   
where σ(a) 
1
1  exp(a)
– ‘tanh’ function: tanh(a)=2σ(a)-1
• Fourier basis functions
– Infinite spatial extent
– Wavelets: Localized in both space and frequency
In the rest of the chapter we won’t consider a particular
form of basis functions, so it is ok to use even φ(x)=x.
6
Maximum likelihood and least squares (1/2)
unknown function
• Suppose that:
p(t | x, w,  )  N (t | y(x, w),  1 )
• If we assume a squared loss function, then the
optimal prediction is the conditional mean of t:
E[ t | x]   tp (t | x)dt  y (x, w )
• Suppose X=(x1,…,xN) and t=(t1,…,tN), then:
N
p(t | X, w,  )   N (tn | wT  (x n ),  1 )
n 1
x will always be a conditioning
variable, so it can be omitted
7
Maximum likelihood and least squares (2/2)
• Using maximum likelihood to determine w and β:


1
w ML  Φ Φ ΦT t
T
normal equations for the least squares problem
• where:
 0 (x1 ) 1 (x1 )

  (x ) 1 (x 2 )
Φ 0 2

  (x )  (x )
1
N
 0 N
• Finally:
1
 ML

M 1 (x1 ) 

M 1 (x 2 ) 


M 1 (x N ) 
design matrix

2
1 N
  tn  wTML (x n )
N n1
8
Geometry of least squares
• Consider an N-dimensional space.
– t is a vector in this space.
– Each basis function evaluated over the N data points is also
a vector, denoted by φj.
• If M<N, φj’s span a linear subspace S of dimensionality M.
– Let y be an N-dimensional vector, such that: y n  y(xn , w)
• y is an arbitrary combination of φj’s.
– Then, sum-of-square is equal to ½|y-t|.
Solution obtained by projecting t into S.
Difficulties when φj’s are near-colinear.
9
Sequential learning
• Sequential gradient descent:
w( 1)  w( ) En
error of the nst point
learning rate
• For the case of sum-of-squares error function:


1 N
ED ( w )   t n  w T  (x n )
2 n1
2
• we have:
w( 1)  w( )  (tn  w( )T (xn )) (xn )
Least Mean Squares (LMS) algorithm
10
Regularized least squares (1/2)
• Regularization term:
ED (w)  EW (w)
• Weighted decay regularizer:
2
1 N
 T
T
t

w

(
x
)

w w

n
n
2 n1
2


• Minimizes for:
w  (I  ΦTΦ)1 ΦT t
11
Regularized least squares (2/2)
• More general regularizer:
2
1 N
 M
T
q
t

w

(
x
)

|
w
|
 n
 j
n
2 n1
2 j 1


lasso
• Contours for M=2:
The regularization term can be seen as a
constraint, which in case of q1 drives
many coefficients to zero
12
Multiple outputs
• K>1 outputs, denoted with target vector t.
• We can use different basis function for each output,
however it is common to use the same ones:
y(x, w)  WT (x)
• In this case:
p(t | x, W,  )  N (t | WT (x),  1I)
• and in case of N observations T=(t1,t2,…, tN)T :


1
WML  ΦT Φ ΦT T
Decouples for the different
output variables in:


1
w ML  Φ Φ ΦT t k
T
13
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
14
Expected loss (again)
• Suppose the squared loss function for which:
h(x)  E[ t | x]   tp(t | x)dt
optimal prediction
• Expected squared loss:
p(t|x) may be computed using
any error function
E[L]    { y(x)  t}2 p(x, t )dxdt 
2
2
{
y
(
x
)

h
(
x
)}
p
(
x
)
d
x

{
h
(
x
)

t
}
p(x, t )dxdt


Our goal
Regression function,
not known precisely
The second integrand is irrelevant
to the choice of y(x)
15
(Intrinsic noise, cannot be reduced)
A frequentist treatment
• Suppose y(x)=y(x,w).
• For a given data set D of size N we make an estimate
of w resulting in y(x;D).
• Suppose we have several i.i.d. data sets of size N, we
can compute ED[y(x;D)].
• Then the expected loss can be written:
E[L]   {ED [ y(x; D)]  h( x)}2 p(x)dx 


2
2
E
{
y
(
x
;
D
)

E
[
y
(
x
;
D
)]}
p
(
x
)
d
x

{
h
(
x
)

t
}
p(x, t )dxdt
D
 D

expected loss = (bias)2 + variance + noise
16
The bias-variance trade-off
(bias ) 2   {ED [ y (x; D)]  h( x)}2 p (x)dx
• (bias)2: How much the average prediction over all
datasets varies from the actual value.


variance   ED { y (x; D)  ED [ y (x; D)]} 2 p(x)dx
• variance: How much the individual predictions differ
from their average.
• noise noise    {h(x)  t}2 p(x, t )dxdt
17
Example (1/3)
• L=100 sinusoidal data set, each with N=25 points.
• For each dataset D(l) we fit a model with 24 Gaussian
basis functions (plus a constant term, thus M=25) by
minimizing the regularized error for various λ’s.
y(l)(x), l=1..100
Average(l)(y(l)(x)), l=1..100
18
Example (2/3)
19
Example (3/3)
• Approximations:
1 L (l )
y ( x)   y ( x)
L l 1
1
(bias) 
N
2
N
2


y
(
x
)

h
(
x
)
 n
n
n 1
1
variance
N
N


1 L (l )
y ( x n )  y ( xn )


n 1 L l 1
2
Optimal value: lnλ=-0.31
(coincides with test error)
Bias-variance decomposition is of limited practical use;
if we have multiple data sets, we usually merge them…
20
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
21
Introduction
• Maximum likelihood always leads to overfitting.
– Regularization term is a way to alleviate the problem.
– Test data is another way, provided we have plenty of data.
• Bayesian learning is an alternative approach.
– No need for test data.
22
Parameter distribution (1/2)
• Given β and omitting
X, likelihood is:
N
p(t | X, w,  )   N (tn | wT  (x n ),  1 )  p(t | w)
n 1
• Prior of w:
p(w)  N (w | m0 , S0 )
• So, posterior of w is (using the Bayes’ theorem for
Gaussian variables):
p(w | t )  N (w | m N , S N )
wMAP=wN
where
m N  S N (S 01m 0  Φ T t )
S  S  Φ Φ
1
N
1
0
T
Allows for sequential learning
23
Parameter distribution (2/2)
• In the following we will consider that:
p(w | a)  N (w | 0, a 1I)
zero-mean isotropic Gaussian
• which leads to:
m N  S N Φ T t
a=0  wMAP=wML
S N1  aI   Φ T Φ
a  p(w|t,a)=p(w|a)
• Log of the posterior:
ln p(w | t )  


t

2
N
n 1

2
a T
T

w

(
x
)

w w  const.
n
n
2
sum-of-squares
error function
quadratic
regularization term
24
Example: Straight-line fitting
• Linear model: y(x,w)=w0+w1x.
• Actual function (to be learnt): f(x,a)=a0+a1x, with
a0=-0.3 and a1=0.5.
• Synthetic data: U(x|-1,1)  f(xn,a)  Add Gaussian
noise with standard deviation 0,2 .
• Suppose that the noise variance is known, so
β=(1/0.2)2=25. Let’s also consider a=2.0.
25
26
Predictive distribution
• Usually we are not interested in p(w|t) but in:
p(t | x, t, a,  )   p(t | x, w,  ) p(w | x, t, a,  )dw
• which evaluates to:
p(t | x, t, a,  )  N (t | mTN (x), N2 (x))
• where:
 ( x) 
2
N
1

  ( x)T S N  ( x)
 ( x) 
2
N
N 
1

27
Example (1/2)
red lines: the means of the
predictive distributions
• Data generated
from y=sin(2πx)
plus Gaussian
noise.
• Model: Linear
combination of
Gaussian-basis
functions.
shaded areas: the standard deviations
28
Example (2/2)
• y(x,w) for various
samples from
p(w|t).
Predictive distribution
integrates over such
individual y(x,w)
29
Equivalent kernel (1/2)
• Combining
M 1
y(x, w)   w j j (x)  wT  (x)
j 0
and w=mN :
N
y(x, m N )  m  (x)   (x) S N  t    (x)T S N (x n )tn
T
N
• More concise:
T
T
n 1
N
y (x, m N )   k (x, x n )tn
• where:
Linear smoother
n 1
k (x, x)   (x)T S N (x)
smoother matrix or
equivalent kernel
k(x,x’) for 11 Gaussian basis functions
over [-1,1] and 200 values of x equally
30
spaced over the same interval.
Equivalent kernel (2/2)
• Other types of basis functions
– k(x, x’) for x=0, plotted as a function of x’:
General result: Equivalent
kernel is a localized
function around x, even if
the basis functions are not
localized. Thus, nearby
“training” points give
higher weights
polynomial
sigmoidal
• Covariance between y(x) and y(x’):
cov[y (x), y(x)]  cov[ (x)T w, wT  (x)] 
 (x) S N (x)   k (x, x)
T
1
Nearby points have
highly correlated targets
31
Properties of the equivalent kernel
• For every new data point x:
N
 k (x, x
n 1
n
) 1
• It can be written in the form of an inner product:
k (x, z)   (x)T (z)
• where:
 (x)   1/ 2S1N/ 2 (x)
32
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
33
Introduction
• Over-fitting in maximum likelihood can be avoided
by marginalizing over the model parameters (instead
of making point predictions).
• Models can be compared directly on the training data!
34
Probability of a model
• Suppose we wish to compare a set of L models {Mi}.
– A model refers to a probability distribution over the
observed data D.
model evidence or
marginal likelihood
• Then:
p(M i | D)  p(M i ) p( D | M i )
• Bayes factor for two models:
p( D | M i )
p( D | M j )
Note: A model is defined by the
type/number of the basis functions.
p(D|Mi) marginalizes over the
parameters of the model.
35
Predictive distribution over models
• Actually, a mixture distribution over models!:
L
p(t | x, D)   p(t | x, M i , D) p( M i | D)
i 1
• Approximation: Use just the most probable model
– Model selection
• Model evidence: p( D | M i )   p( D | w, M i ) p(w | M i )dw
36
Example
• Suppose a single model parameter w.
• Suppose that wprior and wposterior have a rectangle form.
• Then:
p( D)   p( D | w) p(w)dw  p( D | wMAP )
• and:
ln p( D)  ln p( D | wMAP )  ln
wposterior
wprior
wposterior
wprior
• In case of M parameters:
ln p( D)  ln p( D | wMAP )  M ln
wposterior
wprior
negative
The penalty term
grows with M
37
Models of different complexity
• Model evidence can favor models of intermediate
complexity.
38
Remarks on Bayesian Model Comparison
• If the correct model is within the compared ones, then
Bayesian Model Comparison favors on average the
correct model.
• Model evidence is sensitive on the model’s
parameters prior.
• It is always wise to keep aside an independent test set
of data.
39
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
40
Evidence approximation (1/2)
• Two hyperparameters:
– Noise precision β:
p(t | x, w,  )  N (t | y(x, w),  1 )
– w prior precision a:
p(w | a)  N (w | 0, a 1I)
Analytically
intractable
• Marginalizing over all parameters (x is omitted):
p(t | t )   p(t | w,  ) p(w | t, a,  ) p(a,  | t )dwdad
• Approximation: p(a,β|t) sharply peaked around aˆ and ˆ
p (t | t )  p (t | t,aˆ,βˆ )   p (t | w, ˆ ) p (w | t, aˆ , ˆ )dw
Evidence approximation
41
Evidence approximation (2/2)
• We need to estimate aˆ and ˆ . Obviously, their values
will depend on the training data:
p(a,  | t)  p(t | a,  ) p(a,  )
• If p(a,β) is relatively flat, then we have to maximize
the likelihood function p(t|a,β).
• Two approaches for maximizing log likelihood:
– Analytically (our approach)
– Expectation maximization
42
Evaluation of the evidence function
• Integrating over the weight parameters w:
p ( t | a ,  )   p ( t | w ,  ) p ( w | a ) dw
• and after several steps we find:
M
N
1
N
ln p(t | a,  ) 
ln a  ln   E (m N )  ln | A |  ln( 2 )
2
2
2
2
evidence function (ln)
• where:

a
|| t  Φ mN ||2  m TN m N
2
2
A  E (w )  aI  Φ T Φ
E (m N ) 
43
Example: Polynomial regression (again)
• Model evidence wrt M:
– a=5x10-3
Sinusoidal underlying function
44
Maximizing the evidence function wrt a
• By defining:
(ΦTΦ)ui  iui
• We obtain:
a
• where:

m TN m N
 
i
Iterative estimation
between α, γ and mN.
i
a  i
45
Maximizing the evidence function wrt β
• Iterative estimation:
1
1

 N 
 t
N
n 1
n

 m  (x n )
T
N
2
• If both a and β need to be determined, then their
values must be re-estimated together after each
update on γ.
46
Effective number of parameters (1/2)
• Eigenvalues λi measure the curvature of the
likelihood function:
0
i
i  a
1 0    M
0<λ1<λ2
• Well determined parameters:
i  a 
i
 1  wiMAP  wiML
i  a

i  a  i  0  wiMAP  0
i  a
γ: effective total number of
well determined parameters
New parameters space,
aligned with eigenvectors ui.
47
Effective number of parameters (2/2)
• Compare:
• and:
1
1

 N 
1
 ML
 t
N
n 1


T

m
n
N  (x n )
2

2
1 N
  tn  wTML (x n )
N n1
• and recall (chapter 1) that the unbiased βML in case of
a single parameter is given by:
1
~ 2  ~ 
 ML
1 N
2


x


 n ML
N  1 n1
48
Example: Estimating a
• In the sinusoidal data set with 9 Gaussian basis
functions we have M=10 parameters.
• Let’s set β to its true value (11.1) and determine a.
a

m TN m N
Red curve: γ
Blue curve: amNTmN
Red curve: lnp(t|a,β)
Blue curve: Test set error
49
Example: γ versus wi’s
• γ is positively related to a
– When a goes from 0 to , γ goes from 0 to M.
50
Large data sets
• For N>>M, γM.
• In this case the equations for a and β become:
a
M
M

m TN m N 2 EW (m N )

N
2 ED (m N )
No need to compute eigenvalues,
still need for iterations.
51
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
52
Limitations
• The basis functions φj(x) (form and number) are fixed
before the training data set is observed.
– Number of basis functions grows exponentially with the
dimensionality D.
• More complex models are needed
– Neural networks
– Support vector machines
• However, usually:
– Data form non-linear manifolds of lower dimensions.
– Target values depend on a subset of input variables.
53