Basic principles of probability theory

Download Report

Transcript Basic principles of probability theory

Regression II
Model Selection
•
•
•
Model selection based on t-distribution
Information criteria
Cross-validation
Introduction
When modelling relationship between input and out variables we need to make decision
about the form of the model. If there are several input parameters then modelling
decision could be: which input variable is important in predicting the output. Let us say
that we have several input variables and the proposed model is linear:
y=β0 + β1 x1+β2 x2 + ε
If some of the coefficients are 0 then corresponding input variable has no effect on
predicting response – y.
Another situation may arise when we want to explain output using polynomials on input
variables.
y=β0 + β1 x1+β2 x12 + ε
In these cases we may want to know if some of the coefficients are 0. E.g. if β2 is 0 then
relationship between input and output is linear. Otherwise relationship may be higher
order polynomials. In general if we have functions of increasing complexity gk then it is
likely that functions of higher complexity will give us better agreement between input
and output. One of the ways of calculating agreement is residual sum of square (RSS):
RSS = Σ(yi-gk(xi))2
Where i is the observation number and k is the function (model) number.
Introduction
One very simple way of testing if a particular coefficient of the linear model is 0 is
testing hypothesis:
H0:βj=0 versus H1:βj≠0
As we already know, to do this test we need to know the distribution of the estimated
parameters. We know that for large samples the distribution of the parameters
asymptotically is normal. We also know that covariance matrix of this distribution is:
ˆ 2 (X T X )1
cov( )  
n
n
p1
1
1
ˆ 

ri2 
(y i  ˆ j x ij ) 2


n  p i1
n  p i1
j 0
2
Where all xi0 are 1. The number of observations is n and the number of parameters is p.
Under H0 the distribution of βj is normal distribution with mean 0 and the variance

equal to the diagonal term
of the covariance matrix. Therefore the distribution of
ˆ j
ˆ jj

is t-distribution with n-p degrees of freedom.

Example
Let us take an example of weight vs height and look at the results of linear model fitting using
polynomials on input variable (height) up to fifth order. Plots show that second order polynomial
fits sufficiently well. After the second order improvements are marginal. Natural question would
be which model is better, can we justify more complex model based on the observations we have.
Model selection using t-distribution: summary of lm
method
The distribution of estimated parameters are used in the summary of lm command. Let
us say we want to see if some of the coefficients of the model:
y=β0 + β1 x1+β2 x12 + β1 x13 +β2 x14 + β1 x15 + ε
After fitting the model using
lm5 = lm(weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5))
summary(lm5) will produce a table that will have information about significance for each
estimated parameter:
Coefficients:
(Intercept)
height
I(height^2)
I(height^3)
I(height^4)
I(height^5)
Estimate Std. Error t value Pr(>|t|)
9.144e+04 8.300e+04
1.102
0.299
-6.947e+03 6.418e+03 -1.083
0.307
2.107e+02 1.982e+02
1.063
0.315
-3.187e+00 3.058e+00 -1.042
0.325
2.403e-02 2.355e-02
1.020
0.334
-7.224e-05 7.246e-05 -0.997
0.345
Residual standard error: 0.2245 on 9 degrees of freedom
Multiple R-squared: 0.9999,
Adjusted R-squared: 0.9998
F-statistic: 1.335e+04 on 5 and 9 DF, p-value: < 2.2e-16
Model selection using t-distribution
More details of this table will be described in the tutorial. Here we are interested in notrejecting null-hypothesis (that particular parameter is 0). If we are using t-distribution to
to make such decision then we need to take the largest p-value and remove the
corresponding coefficient from the model. In this case it is the coefficient corresponding
to height5. After removing that we can do model fitting using the reduced model. We
use:
lm4 = lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4))
summary(lm4)
It will produce significance levels for coefficients
Coefficients:
(Intercept)
height
I(height^2)
I(height^3)
I(height^4)
Estimate Std. Error t value Pr(>|t|)
8.824e+03 4.550e+03
1.939
0.0812 .
-5.552e+02 2.814e+02 -1.973
0.0768 .
1.319e+01 6.515e+00
2.024
0.0705 .
-1.389e-01 6.693e-02 -2.076
0.0646 .
5.507e-04 2.574e-04
2.140
0.0581 .
Now we cannot remove any of the coefficients, p-values are not large enough to say
that particular coefficient is insignificant.
Model selection using t-distribution
summary command after lm command allows a simple way of model selection. It would
work as follows:
1)
Specify full model (it may be a problem. Model always can be made more
complex)
2)
Fit the model using lm
3)
Use summary and remove the coefficient with the highest p-value (Say values
that are more than 0.1. This value is arbitrary and depends how much would one want
to reduce the model)
4)
Repeat 2) and 3) until model can no longer be reduced
Warning: Using multiple hypothesis testing may produce misleading results. For
example if we are working on a significance level α and we want to test 10 hypothesis
then the probability that at least one of the hypothesis will be significant when it is not
is 1-(1-α)10. If α=0.05 then this probability will be 0.4 and if we are testing 100
hypothesis then this probability will be 0.99. There are corrections for multiple
hypothesis testing. We will learn some of them in later lectures. When we used
summary we deliberately did not use rejection of null-hypothesis. We used high pvalues not to reject null-hypothesis.
Akaike’s information criterion
Another model selection tool is Akaike’s information criterion (AIC). This technique is
based on the value of likelihood at the maximum and uses the number of parameters to
penalise complex models. Let us remind us the maximum likelihood technique.
Likelihood is proportional to the conditional distribution of observations (outputs,
responses) given parameters of the model:
L(y |  )  P(y |  )
In maximum likelihood estimation we maximise the likelihood function wrt the
parameters of the model. By making model more and more complex we can get larger
and larger likelihood values. But is more complex model justified?

Once we have maximised the likelihood function we can calculate the value of the
likelihood at the maximum. Akaike’s information criterion as implemented in R uses
the following formula:
ˆ)  c * p
AIC  2log(L(y | 
Where c is a constant (may depend on the number of observations) that defines various
forms of information criterion, p is the effective number of parameters in the model. In
linear model p is equal to
the number of parameters. When c=2 then we have the classic
AIC and when c=log(n) then we have Bayesian information criterion (BIC).
Akaike’s information criterion
To apply AIC to the cases of linear models we need to specify likelihood function and
find its value at the maximum. We can consider linear models with least square
minimisation as a special case of maximum likelihood.
n
L(y |  )  
i1
1
e
2 i

n
(y i g(x,  )2
2 i2

1

(2 )
n
2
e
n


i 1
(y i g(x,  )2
2 i2
i
i1
Where σι is the standard deviation of the i-th observation. When all σι are equal to each
other (denote them as σ) and they are unknown then we can write:

L(y | )  max <==>
n
(y
i
 g(x, ))2  min
i1
Once we have found the maximum wrt β we can carry on and find the maximum wrt σ.
1 n
ˆ ))2
ˆ  (y i  g(x, 

n i1

2
Note that the maximum likelihood estimator for the variance is not unbiased. Unbiased
estimator is

2
ˆ unbiased

1 n
ˆ ))2

(y i  g(x, 

n  p i1
Akaike’s information criterion
Let us find the value of the likelihood at the maximum. We will have:
L(y | ˆ ) 
1
n
2
n
2 2
e

n
2
ˆ )
(2 ) (
If we put it in the expression of AIC we will have:
RSS
ˆ )  cp  n  nlog(2 )  nlog(
ˆ 2 )  cp  n  nlog(2 )  nlog(
AIC1  2log(L(y | 
)  cp
n

In model selection the number of observations do not change, so AIC can be written:
AIC  nlog(
RSS
)  cp
n
If we use more complex model then RSS will be smaller. On the other hand model with
smaller number of parameters may have better predictive power. AIC attempts to

combine these two conflicting
ideas together.
When comparing two models, model with smaller AIC is considered to be better.
Akaike’s information criterion in R
There are number of functions in R that uses AIC. The simplest of them is extractAIC.
This function gives the number of parameters as well as AIC. Another function is step.
In its simplest form this function removes one at a time a parameter and calculates the
AIC. If removal of a parameter decreases AIC then this parameter is removed from
further consideration. The procedure is applied iteratively. In the following slide a result
of step function is shown.
Akaike’s information criterion
Start: AIC=-40.48
weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5)
Df Sum of Sq RSS AIC
Stage 1: Procedure removes one
- I(height^5) 1 0.050 0.504 -40.910
parameter at a time. Removal of height5
- I(height^4) 1 0.052 0.506 -40.840
reduces AIC
- I(height^3) 1 0.055 0.508 -40.773
- I(height^2) 1 0.057 0.510 -40.708
- height
1 0.059 0.513 -40.646
<none>
0.454 -40.482
Step: AIC=-40.91
weight ~ height + I(height^2) + I(height^3) + I(height^4)
Df Sum of Sq RSS AIC
<none>
0.504 -40.910
Stage 2: Procedure tries further reduce
- height
1 0.196 0.700 -37.979
the number of parameters. But the
- I(height^2) 1 0.206 0.710 -37.759
“best” model is the full model.
- I(height^3) 1 0.217 0.721 -37.536
- I(height^4) 1 0.231 0.734 -37.256
Call:
lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4))
Coefficients:
Final stage: The “best” model is given
(Intercept)
height I(height^2) I(height^3) I(height^4)
8.824e+03 -5.552e+02 1.319e+01 -1.389e-01 5.507e-04
Akaike’s information criterion: further improvements
There are several additional forms of AIC. They try to correct to small samples. For
example corrected AIC has an expression:
AICc  AIC 
2 p( p  1)
n  p 1
R does not produce AICc (I am not aware if it produces). However it can be calculated
using (for example)

extractAIC(lm4)+2*p*(p+1)/(n-p-1)
Where n is the number of observations and p is the number of parameters. For example
for fourth order polynomial fit for weight <–> height relationship we can use:
lm4=lm(weight ~ height + I(height^2) + I(height^3) + I(height^4))
extractAIC(lm4)+2*5*6/(15-5-1)
It will produce the value -34.24 and for the reduced model
lm3=lm(weight ~ height + I(height^2) + I(height^3))
extractAIC(lm3)+2*4*5/(15-4-1)
It will produce -33.26. Since more complex model has lower AICc, it is preferred.
AIC, BIC
Both AIC and BIC attempt to produce minimal best model that describes the
observations as best as possible. BIC reduces (prefers simpler model) more than AIC.
In general in applications it is better to use common sense and look at the parameters
very carefully. If you can find interpretation of all terms and reasons for them being
there then they may be valid. Even when the “best” model has been selected it would be
better to come back to model selection again if new observations become available.
Absolute values of AIC (or BIC) are not important. Differences between AIC-s for
different models are important.
AIC (and BIC) can be used to compare models with the same observations. If
observations change (for example few outliers are removed or new observations
become available) then AIC should be applied for all models again.
Cross-validation
If we have a sample of observations, can we use this sample and choose among given models.
Cross validation attempts to reduce overfitting thus helps model selection.
Description of cross-validation: We have a sample of the size n – (yi,xi) .
1)
Divide the sample into K roughly equal size parts.
2)
For the k-th part, estimate parameters using K-1 parts excluding k-th part. Calculate
prediction error for k-th part.
3)
Repeat it for all k=1,2,,,K and combine all prediction errors and get cross-validation
prediction error.
If K=n then we will have leave-one-out cross-validation technique.
Let us denote an estimate at the k-th step by k (it is a vector of parameters). Let k-th subset of
the sample be Ak and number of points in this subset is Nk.. Then prediction error per
observation is:
PE 
1 K 1

K k 1 N k
 ( y  g ( x, 
iAk
i
k
))2
Then we would choose the function that gives the smallest prediction error. We can expect that in
future when we will have new observations this function will give smallest prediction
error.
This technique is widely used in modern statistical analysis. It is not restricted to least-squares
technique. Instead of least-squares we could could use other techniques such as maximumlikelihood, Bayesian estimation, M-estimation.
Cross-validation in R
There is no cross-validation method for lm in R but there is a cross-validation function
for glm (generalised linear method). This method requires that resoponses are stored in
the output from model fitting. We can change a little bit the output from lm and then
use cross-validation for glm. Following sequence of commands will allow us to do this
(it is just an example):
lm1 = lm(weight~height)
lm1$y = weight # Add responses to the output from lm
cv.glm(lm1)$delta # Do cross-validation
cv.glm function by default uses leave one out cross-validation.
Cross-validation in R
There is no cross-validation method for lm in R but there is a cross-validation function
for glm (generalised linear method). This method requires that resoponses are stored in
the output from model fitting. We can change a little bit the output from lm and then
use cross-validation for glm. Following sequence of commands will allow us to do this
(it is just an example):
lm1 = lm(weight~height)
lm1$y = weight # Add responses to the output from lm
cv.glm(lm1)$delta # Do cross-validation
cv.glm function by default uses leave one out cross-validation.
Cross-validation in R
Let us apply cross-valdiation for weight~height relationship.
require(boot)
require(MASS)
lm1=lm(weight~height,data=women)
lm1$y=women$weight
cv.glm(women,lm1)$delta
This sequence of command will produce the following. So prediction error is 3.04.
1
1
3.040776 3.002450
lm1=lm(weight~height+I(height^2),data=women)
lm1$y=women$weight
cv.glm(women,lm1)$delta
It produces
1
1
0.2198339 0.2156849
Prediction error is reduced to 0.22. Further application will produce prediction errors 0.15, 0.13,
0.18. The minimum seems to be when polynomial of fourth order is used. However differences
between third and fourth order is not as substantial as that between first and second order.
Depending on circumstances we would choose models as polynomila of second, third or fourth
order.
R commands
lm – model fit
summary (or summary.lm) – will give summary about the model including pvalues for each model parameter
extractAIC – to extract Akaikes information criterion
extractAIC – with k=log(n) as an argument it will produce Bayesian information
criterion
stepAIC (or step) – will use AIC to step by step model comparision
add1 – will add one parameter at a time and calculate AIC
drop1 – will drop one parameter at a time and calculate AIC
cv.glm – from the library boot. It is suitable for the objects produced by glm,
however slight modification allows it to be used for lm object also.
References
1.
2.
3.
4.
Stuart, A., Ord, KJ, Arnold, S (1999) Kendall’s advanced theory of
statistics, Volume 2A
Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for experimenters
Berthold, MJ and Hand, DJ. Intelligent Data Analysis
Dalgaard, Introductury statistics with R
Exercise 3
Take the data set cement from the library MASS and analyse it as using linear
model. Write a report.