PRML - 3rd chapter
Download
Report
Transcript PRML - 3rd chapter
Pattern Recognition and
Machine Learning
Linear Models for Regression
1
Introduction
• Predict continuous target variable t using a Ddimensional vector x of input variables.
• Linear functions of the adjustable parameters
– Not linear function of the input variables.
• Linear combinations of a fixed set of nonlinear
functions, called basis functions.
• We aim to model p(t|x), in order to be able to
minimize expected loss.
– Not just to find a function t=y(x)
Linear models have significant limitations
2
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
3
Linear Basis Function Models
• The simplest model:
y(x, w) w0 w1x1 ... wD xD
• Fixed non-linear basis functions:
M 1
y(x, w) w j j (x) wT (x)
j 0
Usually φ0(x)=1. In this case, w0
is called the bias parameter.
Linear regression
x ( x1 ,..., xD )T
w ( w0 ,..., xM 1 )T
(0 ,...,M 1 )T
Can be seen as feature extraction
4
Examples (1/2)
• Polynomial basis functions j ( x) x j
– Limitation: Global functions
• Alternative: Spline functions
• ‘Gaussian’ basis functions
2
(x j )
j ( x) exp
2
2
s
5
Examples (2/2)
• Sigmoidal basis functions
xj
s
j ( x)
where σ(a)
1
1 exp(a)
– ‘tanh’ function: tanh(a)=2σ(a)-1
• Fourier basis functions
– Infinite spatial extent
– Wavelets: Localized in both space and frequency
In the rest of the chapter we won’t consider a particular
form of basis functions, so it is ok to use even φ(x)=x.
6
Maximum likelihood and least squares (1/2)
unknown function
• Suppose that:
p(t | x, w, ) N (t | y(x, w), 1 )
• If we assume a squared loss function, then the
optimal prediction is the conditional mean of t:
E[ t | x] tp (t | x)dt y (x, w )
• Suppose X=(x1,…,xN) and t=(t1,…,tN), then:
N
p(t | X, w, ) N (tn | wT (x n ), 1 )
n 1
x will always be a conditioning
variable, so it can be omitted
7
Maximum likelihood and least squares (2/2)
• Using maximum likelihood to determine w and β:
1
w ML Φ Φ ΦT t
T
normal equations for the least squares problem
• where:
0 (x1 ) 1 (x1 )
(x ) 1 (x 2 )
Φ 0 2
(x ) (x )
1
N
0 N
• Finally:
1
ML
M 1 (x1 )
M 1 (x 2 )
M 1 (x N )
design matrix
2
1 N
tn wTML (x n )
N n1
8
Geometry of least squares
• Consider an N-dimensional space.
– t is a vector in this space.
– Each basis function evaluated over the N data points is also
a vector, denoted by φj.
• If M<N, φj’s span a linear subspace S of dimensionality M.
– Let y be an N-dimensional vector, such that: y n y(xn , w)
• y is an arbitrary combination of φj’s.
– Then, sum-of-square is equal to ½|y-t|.
Solution obtained by projecting t into S.
Difficulties when φj’s are near-colinear.
9
Sequential learning
• Sequential gradient descent:
w( 1) w( ) En
error of the nst point
learning rate
• For the case of sum-of-squares error function:
1 N
ED ( w ) t n w T (x n )
2 n1
2
• we have:
w( 1) w( ) (tn w( )T (xn )) (xn )
Least Mean Squares (LMS) algorithm
10
Regularized least squares (1/2)
• Regularization term:
ED (w) EW (w)
• Weighted decay regularizer:
2
1 N
T
T
t
w
(
x
)
w w
n
n
2 n1
2
• Minimizes for:
w (I ΦTΦ)1 ΦT t
11
Regularized least squares (2/2)
• More general regularizer:
2
1 N
M
T
q
t
w
(
x
)
|
w
|
n
j
n
2 n1
2 j 1
lasso
• Contours for M=2:
The regularization term can be seen as a
constraint, which in case of q1 drives
many coefficients to zero
12
Multiple outputs
• K>1 outputs, denoted with target vector t.
• We can use different basis function for each output,
however it is common to use the same ones:
y(x, w) WT (x)
• In this case:
p(t | x, W, ) N (t | WT (x), 1I)
• and in case of N observations T=(t1,t2,…, tN)T :
1
WML ΦT Φ ΦT T
Decouples for the different
output variables in:
1
w ML Φ Φ ΦT t k
T
13
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
14
Expected loss (again)
• Suppose the squared loss function for which:
h(x) E[ t | x] tp(t | x)dt
optimal prediction
• Expected squared loss:
p(t|x) may be computed using
any error function
E[L] { y(x) t}2 p(x, t )dxdt
2
2
{
y
(
x
)
h
(
x
)}
p
(
x
)
d
x
{
h
(
x
)
t
}
p(x, t )dxdt
Our goal
Regression function,
not known precisely
The second integrand is irrelevant
to the choice of y(x)
15
(Intrinsic noise, cannot be reduced)
A frequentist treatment
• Suppose y(x)=y(x,w).
• For a given data set D of size N we make an estimate
of w resulting in y(x;D).
• Suppose we have several i.i.d. data sets of size N, we
can compute ED[y(x;D)].
• Then the expected loss can be written:
E[L] {ED [ y(x; D)] h( x)}2 p(x)dx
2
2
E
{
y
(
x
;
D
)
E
[
y
(
x
;
D
)]}
p
(
x
)
d
x
{
h
(
x
)
t
}
p(x, t )dxdt
D
D
expected loss = (bias)2 + variance + noise
16
The bias-variance trade-off
(bias ) 2 {ED [ y (x; D)] h( x)}2 p (x)dx
• (bias)2: How much the average prediction over all
datasets varies from the actual value.
variance ED { y (x; D) ED [ y (x; D)]} 2 p(x)dx
• variance: How much the individual predictions differ
from their average.
• noise noise {h(x) t}2 p(x, t )dxdt
17
Example (1/3)
• L=100 sinusoidal data set, each with N=25 points.
• For each dataset D(l) we fit a model with 24 Gaussian
basis functions (plus a constant term, thus M=25) by
minimizing the regularized error for various λ’s.
y(l)(x), l=1..100
Average(l)(y(l)(x)), l=1..100
18
Example (2/3)
19
Example (3/3)
• Approximations:
1 L (l )
y ( x) y ( x)
L l 1
1
(bias)
N
2
N
2
y
(
x
)
h
(
x
)
n
n
n 1
1
variance
N
N
1 L (l )
y ( x n ) y ( xn )
n 1 L l 1
2
Optimal value: lnλ=-0.31
(coincides with test error)
Bias-variance decomposition is of limited practical use;
if we have multiple data sets, we usually merge them…
20
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
21
Introduction
• Maximum likelihood always leads to overfitting.
– Regularization term is a way to alleviate the problem.
– Test data is another way, provided we have plenty of data.
• Bayesian learning is an alternative approach.
– No need for test data.
22
Parameter distribution (1/2)
• Given β and omitting
X, likelihood is:
N
p(t | X, w, ) N (tn | wT (x n ), 1 ) p(t | w)
n 1
• Prior of w:
p(w) N (w | m0 , S0 )
• So, posterior of w is (using the Bayes’ theorem for
Gaussian variables):
p(w | t ) N (w | m N , S N )
wMAP=wN
where
m N S N (S 01m 0 Φ T t )
S S Φ Φ
1
N
1
0
T
Allows for sequential learning
23
Parameter distribution (2/2)
• In the following we will consider that:
p(w | a) N (w | 0, a 1I)
zero-mean isotropic Gaussian
• which leads to:
m N S N Φ T t
a=0 wMAP=wML
S N1 aI Φ T Φ
a p(w|t,a)=p(w|a)
• Log of the posterior:
ln p(w | t )
t
2
N
n 1
2
a T
T
w
(
x
)
w w const.
n
n
2
sum-of-squares
error function
quadratic
regularization term
24
Example: Straight-line fitting
• Linear model: y(x,w)=w0+w1x.
• Actual function (to be learnt): f(x,a)=a0+a1x, with
a0=-0.3 and a1=0.5.
• Synthetic data: U(x|-1,1) f(xn,a) Add Gaussian
noise with standard deviation 0,2 .
• Suppose that the noise variance is known, so
β=(1/0.2)2=25. Let’s also consider a=2.0.
25
26
Predictive distribution
• Usually we are not interested in p(w|t) but in:
p(t | x, t, a, ) p(t | x, w, ) p(w | x, t, a, )dw
• which evaluates to:
p(t | x, t, a, ) N (t | mTN (x), N2 (x))
• where:
( x)
2
N
1
( x)T S N ( x)
( x)
2
N
N
1
27
Example (1/2)
red lines: the means of the
predictive distributions
• Data generated
from y=sin(2πx)
plus Gaussian
noise.
• Model: Linear
combination of
Gaussian-basis
functions.
shaded areas: the standard deviations
28
Example (2/2)
• y(x,w) for various
samples from
p(w|t).
Predictive distribution
integrates over such
individual y(x,w)
29
Equivalent kernel (1/2)
• Combining
M 1
y(x, w) w j j (x) wT (x)
j 0
and w=mN :
N
y(x, m N ) m (x) (x) S N t (x)T S N (x n )tn
T
N
• More concise:
T
T
n 1
N
y (x, m N ) k (x, x n )tn
• where:
Linear smoother
n 1
k (x, x) (x)T S N (x)
smoother matrix or
equivalent kernel
k(x,x’) for 11 Gaussian basis functions
over [-1,1] and 200 values of x equally
30
spaced over the same interval.
Equivalent kernel (2/2)
• Other types of basis functions
– k(x, x’) for x=0, plotted as a function of x’:
General result: Equivalent
kernel is a localized
function around x, even if
the basis functions are not
localized. Thus, nearby
“training” points give
higher weights
polynomial
sigmoidal
• Covariance between y(x) and y(x’):
cov[y (x), y(x)] cov[ (x)T w, wT (x)]
(x) S N (x) k (x, x)
T
1
Nearby points have
highly correlated targets
31
Properties of the equivalent kernel
• For every new data point x:
N
k (x, x
n 1
n
) 1
• It can be written in the form of an inner product:
k (x, z) (x)T (z)
• where:
(x) 1/ 2S1N/ 2 (x)
32
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
33
Introduction
• Over-fitting in maximum likelihood can be avoided
by marginalizing over the model parameters (instead
of making point predictions).
• Models can be compared directly on the training data!
34
Probability of a model
• Suppose we wish to compare a set of L models {Mi}.
– A model refers to a probability distribution over the
observed data D.
model evidence or
marginal likelihood
• Then:
p(M i | D) p(M i ) p( D | M i )
• Bayes factor for two models:
p( D | M i )
p( D | M j )
Note: A model is defined by the
type/number of the basis functions.
p(D|Mi) marginalizes over the
parameters of the model.
35
Predictive distribution over models
• Actually, a mixture distribution over models!:
L
p(t | x, D) p(t | x, M i , D) p( M i | D)
i 1
• Approximation: Use just the most probable model
– Model selection
• Model evidence: p( D | M i ) p( D | w, M i ) p(w | M i )dw
36
Example
• Suppose a single model parameter w.
• Suppose that wprior and wposterior have a rectangle form.
• Then:
p( D) p( D | w) p(w)dw p( D | wMAP )
• and:
ln p( D) ln p( D | wMAP ) ln
wposterior
wprior
wposterior
wprior
• In case of M parameters:
ln p( D) ln p( D | wMAP ) M ln
wposterior
wprior
negative
The penalty term
grows with M
37
Models of different complexity
• Model evidence can favor models of intermediate
complexity.
38
Remarks on Bayesian Model Comparison
• If the correct model is within the compared ones, then
Bayesian Model Comparison favors on average the
correct model.
• Model evidence is sensitive on the model’s
parameters prior.
• It is always wise to keep aside an independent test set
of data.
39
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
40
Evidence approximation (1/2)
• Two hyperparameters:
– Noise precision β:
p(t | x, w, ) N (t | y(x, w), 1 )
– w prior precision a:
p(w | a) N (w | 0, a 1I)
Analytically
intractable
• Marginalizing over all parameters (x is omitted):
p(t | t ) p(t | w, ) p(w | t, a, ) p(a, | t )dwdad
• Approximation: p(a,β|t) sharply peaked around aˆ and ˆ
p (t | t ) p (t | t,aˆ,βˆ ) p (t | w, ˆ ) p (w | t, aˆ , ˆ )dw
Evidence approximation
41
Evidence approximation (2/2)
• We need to estimate aˆ and ˆ . Obviously, their values
will depend on the training data:
p(a, | t) p(t | a, ) p(a, )
• If p(a,β) is relatively flat, then we have to maximize
the likelihood function p(t|a,β).
• Two approaches for maximizing log likelihood:
– Analytically (our approach)
– Expectation maximization
42
Evaluation of the evidence function
• Integrating over the weight parameters w:
p ( t | a , ) p ( t | w , ) p ( w | a ) dw
• and after several steps we find:
M
N
1
N
ln p(t | a, )
ln a ln E (m N ) ln | A | ln( 2 )
2
2
2
2
evidence function (ln)
• where:
a
|| t Φ mN ||2 m TN m N
2
2
A E (w ) aI Φ T Φ
E (m N )
43
Example: Polynomial regression (again)
• Model evidence wrt M:
– a=5x10-3
Sinusoidal underlying function
44
Maximizing the evidence function wrt a
• By defining:
(ΦTΦ)ui iui
• We obtain:
a
• where:
m TN m N
i
Iterative estimation
between α, γ and mN.
i
a i
45
Maximizing the evidence function wrt β
• Iterative estimation:
1
1
N
t
N
n 1
n
m (x n )
T
N
2
• If both a and β need to be determined, then their
values must be re-estimated together after each
update on γ.
46
Effective number of parameters (1/2)
• Eigenvalues λi measure the curvature of the
likelihood function:
0
i
i a
1 0 M
0<λ1<λ2
• Well determined parameters:
i a
i
1 wiMAP wiML
i a
i a i 0 wiMAP 0
i a
γ: effective total number of
well determined parameters
New parameters space,
aligned with eigenvectors ui.
47
Effective number of parameters (2/2)
• Compare:
• and:
1
1
N
1
ML
t
N
n 1
T
m
n
N (x n )
2
2
1 N
tn wTML (x n )
N n1
• and recall (chapter 1) that the unbiased βML in case of
a single parameter is given by:
1
~ 2 ~
ML
1 N
2
x
n ML
N 1 n1
48
Example: Estimating a
• In the sinusoidal data set with 9 Gaussian basis
functions we have M=10 parameters.
• Let’s set β to its true value (11.1) and determine a.
a
m TN m N
Red curve: γ
Blue curve: amNTmN
Red curve: lnp(t|a,β)
Blue curve: Test set error
49
Example: γ versus wi’s
• γ is positively related to a
– When a goes from 0 to , γ goes from 0 to M.
50
Large data sets
• For N>>M, γM.
• In this case the equations for a and β become:
a
M
M
m TN m N 2 EW (m N )
N
2 ED (m N )
No need to compute eigenvalues,
still need for iterations.
51
•
•
•
•
•
•
Linear Basis Function Models
The Bias-Variance Decomposition
Bayesian Linear Regression
Bayesian Model Comparison
The Evidence Approximation
Limitations of Fixed Basis Functions
52
Limitations
• The basis functions φj(x) (form and number) are fixed
before the training data set is observed.
– Number of basis functions grows exponentially with the
dimensionality D.
• More complex models are needed
– Neural networks
– Support vector machines
• However, usually:
– Data form non-linear manifolds of lower dimensions.
– Target values depend on a subset of input variables.
53