Transcript Document
Introduction to Biostatistical Analysis Using R
Statistics course for first-year PhD students Session 4
Lecture
:
Regression Analysis
Practical
: multiple regression
Lecturer
: Lorenzo Marini DAFNAE, University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.
E-mail: [email protected]
Tel.: +39 049 8272807 Skype: lorenzo.marini
http://www.biodiversity-lorenzomarini.eu/
Statistical modelling: more than one parameter
Nature of the response variable
NORMAL POISSON, BINOMIAL …
Generalized Linear Models GLM
Session 3 Categorical
General Linear Models Nature of the explanatory variables
Continuous Categorical + continuous ANOVA Session 3 Regression Session 4 ANCOVA
REGRESSION
REGRESSION
Linear methods
Simple linear
-One X -Linear relation
Multiple linear
-2 or > X i - Linear relation Non-linear methods Non-linear -One X -Complex relation
Polynomial
-One X but more slopes - Non-linear relation
LINEAR REGRESSION
lm()
Regression analysis
and analysis of numerical data consisting of values of a
dependent variable
is a technique used for the modeling (response variable) and of one or more
independent continuous variables
(explanatory variables)
Assumptions Independence
: The Y-values and the error terms must be independent of each other.
Linearity
between Y and X.
Normality
: The populations of Y-values and the error terms are normally distributed for each level of the predictor variable x
Homogeneity of variance
: The populations of Y-values and the error terms have the same variance at each level of the predictor variable x.
(
don’t test for normality or heteroscedasticity
,
check the residuals instead!)
LINEAR REGRESSION
lm()
AIMS
1.To describe the linear relationships between Y and X i (
EXPLANATORY APPROACH
) and to quantify how much of the total variation in Y can be explained by the linear relationship with X i .
2. To predict new values of Y from new values of X i
APPROACH
) (
PREDICTIVE
X i predictors Y response
Y
i
=
α
+
β
x
i
+ ε
i
We estimate one INTERCEPT and one or more SLOPES
SIMPLE LINEAR REGRESSION:
Simple LINEAR regression step by step:
I step: -Check linearity
[visualization with plot()]
II step: -Estimate the parameters
(one slope and one intercept)
III step: -Check residuals
(check the assumptions looking at the residuals: normality and homogeneity of variance)
SIMPLE LINEAR REGRESSION:
Normality
Do not test the normality over the whole y
SIMPLE LINEAR REGRESSION
MODEL
y
i
=
α
+
β
x
i
SLOPE β
= Σ [(x i -x mean )(y i -y mean )] Σ (x i -x mean ) 2
INTERCEPT α
= y mean -
β
*x mean
The model gives the fitted values The model does not explained everything RESIDUALS Residuals = observed y i fitted y i Observed value
SIMPLE LINEAR REGRESSION
Least square regression explanation library(animation) ###########################################
##Slope changing
# save the animation in HTML pages ani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ") ani.start() par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3) least.squares() ani.stop() ############################################
# Intercept changing
# save the animation in HTML pages ani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ") ani.start() par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3) least.squares(ani.type = "i") ani.stop()
SIMPLE LINEAR REGRESSION
Hypothesis testing Parameter t testing
Ho: β = 0 (There is no relation between X and Y) H1: β ≠ 0
Parameter t testing
(test the single parameter!) We must measure the unreliability associated with each of the estimated parameters (i.e. we need the standard errors) SE(β) = [(residual SS/(n-2))/Σ(x i - x mean )] 2 t = (β – 0) / SE(β)
SIMPLE LINEAR REGRESSION
If the model is significant (β ≠ 0) How much variation is explained?
Measure of goodness-of-fit
Total SS = Σ(y observed i - y mean ) 2 Model SS = Σ(y fitted i - y mean ) 2 Residual SS = Total SS - Model SS
R 2 = Model SS /Total SS It does not provide information about the significance Explained variation
SIMPLE LINEAR REGRESSION: example 1
If the model is significant, then
model checking 1. Linearity between X and Y?
ok
No patterns in the residuals vs. predictor plot
SIMPLE LINEAR REGRESSION: example 1
2. Normality of the residuals
Q-Q plot + Shapiro-Wilk test on the residuals
ok > shapiro.test(residuals)
Shapiro-Wilk normality test data: residuals W = 0.9669, p-value = 0.2461
ok
SIMPLE LINEAR REGRESSION: example 1
3. Homoscedasticity
Call:
lm(formula = abs(residuals) ~ yfitted)
Coefficients: Estimate SE t P (Intercept) 2.17676 2.04315 1.065 0.293
yfitted 0.11428 0.07636 1.497 0.142
ok
SIMPLE LINEAR REGRESSION: example 2
1. Linearity between X and Y?
no no yes
NO LINEARITY between X and Y
SIMPLE LINEAR REGRESSION: example 2
2. Normality of the residuals
Q-Q plot + Shapiro-Wilk test on the residuals
no
> shapiro.test(residuals)
Shapiro-Wilk normality test data: residuals W = 0.8994, p-value = 0.001199
no
SIMPLE LINEAR REGRESSION: example 2
3. Homoscedasticity
NO YES
SIMPLE LINEAR REGRESSION: example 2
How to deal with non-linearity and non-normality situations?
Transformation of the data
-Box-cox transformation (power transformation of the response) -Square-root transformation -Log transformation -Arcsin transformation
Polynomial regression
Regression with multiple terms (linear, quadratic, and cubic) Y= a + b 1 X + b 2 X 2 + b 3 X 3 + error
X is one variable!!!
POLYNOMIAL REGRESSION: one X, n parameters
Hierarchy in the testing (always test the highest)!!!!
X + X 2 + X 3 n.s.
X + X 2 n.s.
X n.s.
No relation
P
<0.01
Stop
P
<0.01
Stop
P
<0.01
Stop NB Do not delete lower terms even if non-significant
MULTIPLE LINEAR REGRESSION: more than one x
Multiple regression
Regression with two or more variables Y= a + b 1 X 1 + b 2 X 2 +… + b i X i
Assumptions Same assumptions as in the simple linear regression!!!
The Multiple Regression Model
There are important issues involved in carrying out a multiple regression: • which explanatory variables to include (
VARIABLE SELECTION
); •
NON-LINEARITY
in the response to the explanatory variables; •
INTERACTIONS
between explanatory variables; • correlation between explanatory variables (
COLLINEARITY
); •
RELATIVE IMPORTANCE
of variables
MULTIPLE LINEAR REGRESSION: more than one x
Multiple regression MODEL
Regression with two or more variables
Y = a+ b 1 X 1 + b 2 X 2 +…+ b i X i Each slope (b i ) is a partial regression coefficient: bi
are the most important parameters of the multiple regression model. They measure the expected change in the dependent variable associated with a one unit change in an independent variable holding the other independent variables constant. This interpretation of partial regression coefficients is very important because independent variables are often correlated with one another.
MULTIPLE LINEAR REGRESSION: more than one x
Multiple regression MODEL EXPANDED
We can add polynomial terms and interactions
Y= a + linear terms + quadratic & cubic terms
+
interactions QUADRATIC AND CUBIC TERMS account for NON-LINEARITY INTERACTIONS account for non-independent effects of the factors
MULTIPLE LINEAR REGRESSION:
Multiple regression step by step:
I step: -Check collinearity
(visualization with pairs() and correlation)
-Check linearity II step: -Variable selection and model building
(different procedures to select the significant variables)
III step: -Check residuals
(check the assumptions looking at the residuals: normality and homogeneity of variance)
MULTIPLE LINEAR REGRESSION: I STEP
I STEP: -Check collinearity -Check linearity
Let’s begin with an example from air pollution studies. How is ozone concentration related to wind speed, air temperature and the intensity of solar radiation?
MULTIPLE LINEAR REGRESSION: II STEP
II STEP: MODEL BUILDING
Start with a complex model with interactions and quadratic and cubic terms Model simplification Minimum Adequate Model How to carry out a model simplification in multiple regression 1. Remove non-significant interaction terms.
2. Remove non-significant quadratic or other non-linear terms.
3. Remove non-significant explanatory variables.
4. Amalgamate explanatory variables that have similar parameter values.
MULTIPLE LINEAR REGRESSION: II STEP
Start with the most complicate model (it is one approach)
model1
(Intercept) temp wind rad I(rad^2) I(temp^2) I(wind^2) temp:wind temp:rad wind:rad temp:wind:rad
Estimate Std.Error t
5.7E+02 2.1E+02
Pr(>t)
2.74 0.01 ** -1.1E+01 -3.2E+01 -3.1E-01 -3.6E-04 5.8E-02 6.1E-01 2.4E-01 8.4E-03 2.1E-02 -4.3E-04 4.3E+00 -2.50 0.01 * 1.2E+01 -2.76 0.01 ** 5.6E-01 -0.56 0.58
2.6E-04 -1.41 0.16
2.4E-02 1.5E-01 2.44 0.02 * 4.16 0.00 *** 1.4E-01 7.5E-03 1.74 0.09
1.12 0.27
4.9E-02 0.42 0.68
6.6E-04 -0.66 0.51
!!!!!!
We cannot delete these terms !!!!!!!
Delete only the highest interaction temp:wind:rad
MULTIPLE LINEAR REGRESSION: II STEP
Manual model simplification
(It is one of the many philosophies) Deletion the non-significant terms
one by one
:
COMPLEX
Hierarchy in the deletion
: 1. Highest interactions 2. Cubic terms
At each deletion test:
3. Quadratic terms 4. Linear terms
Is the fit of a simpler model worse?
Deletion
SIMPLE
IMPORTANT!!!
If you have quadratic and cubic terms significant you cannot delete the linear or the quadratic term even if they are not significant If you have an interaction significant you cannot delete the linear terms even if they are not significant
MULTIPLE LINEAR REGRESSION: III STEP
III STEP: we must check the assumptions
NO NO
Variance tends to increase with y Non-normal errors We can transform the data (e.g. Log-transformation of y)
model
MULTIPLE LINEAR REGRESSION: more than one x
The log-transformation has improved our model but maybe there is an outlier
PARTIAL REGRESSION:
With partial regression we can remove the effect of one or more variables (covariates) and test a further factor which becomes independent from the covariates WHEN?
• Would like to hold third variables constant, but cannot manipulate.
• Can use statistical control.
HOW?
• Statistical control is based on residuals. If we regress Y on X1 and take residuals of Y, this part of Y will be uncorrelated with X1, so anything Y residuals correlate with will not be explained by X1.
PARTIAL REGRESSION: VARIATION PARTITIONING
Relative importance of groups of explanatory variables R 2 = 76% (
TOTAL EXPLAINED VARIATION
) What is space and what is environment?
Total variation
Unexpl.
Space Space ∩ Envir.
Environment
Longitude (km)
Site
Explained variation Full.model
Response variable: orthopteran species richness Explanatory variable:
SPACE (latitude + longitude)
+
ENVIRONMENT (temperature + land-cover heterogeneity)
VARIATION PARTITIONING:
varpart(vegan)
Full.model
Unexpl.
Space Environment
Env.model
env.residuals
Unexpl.
Space Environment
Pure.Space.model
Unexpl.
Space Environment
VE=15% Space.model
Unexpl.
Space Environment space.residuals
Pure.env.model
Unexpl.
Space Environment
VE=40%
NON-LINEAR REGRESSION:
nls()
Sometimes we have a
mechanistic model
for the relationship between y and x, and we want to estimate the parameters and standard errors of the parameters of a specific non-linear equation from data.
We must SPECIFY the exact nature of the function
as part of the model formula when we use non-linear modelling In place of
lm()
we write
nls()
(this stands for ‘non-linear least squares’). Then, instead of
y~x+I(x 2 )+I(x 3 )
(polynomial), we write the
y~function
to spell out the precise nonlinear model we want R to fit to the data.
NON-LINEAR REGRESSION: step by step
1. Plot y against x 2. Get an idea of the family of functions that you can fit 3. Start fitting the different models 4. Specify initial guesses for the values of the parameters 5. [Get the MAM for each by model simplification] Alternative approach Multimodel inference (minimum deviance + minimum number of parameters) AIC = scaled deviance +2k k= parameter number + 1 Compare GROUPS of model at a time 6. Check the residuals 7. Compare PAIRS of models and choose the best
Model weights and model average [see Burnham & Anderson, 2002]
nls():
examples of function families
Asymptotic functions S-shaped functions Humped functions Exponential functions
nls():
Look at the data
Asymptotic functions ?
S-shaped functions Humped functions Exponential functions 0 10 20 30 40 50 age
Asymptotic exponential
y
~
a
b
*
e
cx
Understand the role of the parameters a, b, and c Using the data plot work out sensible starting values. It always helps in cases like this to work out the equation’s at the limits – i.e. find the values of y when x=0 and when x=
nls():
Look at the data
Fit the model
y
~
a
b
*
e
cx
1. Estimate of a, b, and c (iterative) 2. Extract the fitted values (y i ) 3. Check graphically the curve fitting 0 10 20 age 30 Can we try another function from the same family?
40
Model choice is always an important issue in curve fitting (particularly for prediction)
50 Different behavior at the limits!
Think about your biological system not just residual deviance!
nls():
Look at the data
Michaelis –Menten
Fit a second model
y
~
ax
1
bx
1. Extract the fitted values (y i ) 2. Check graphically the curve fitting 0 10 20 age 30 40 50 You can see that the asymptotic exponential (solid line) tends to get to its asymptote first, and that the Michaelis –Menten (dotted line) continues to increase. Model choice, therefore would be enormously important if you intended to use the model for prediction to ages much greater than 50 months.
Application of regression: prediction
Regression models for prediction A model can be used to predict values of y in space or in time knowing new x i values
Spatial extent + data range YES NO Before using a model for prediction it has to be VALIDATED!!!
2 APPROACHES
0 20 40 age 60 80
VALIDATION
1. In data-rich situation, set aside validation (use one part of data set to fit model, second part for assessing prediction error of final selected model).
Residual=Prediction error Model fit Predicted 2. If data scarce, must resort to “artificially produced” validation sets
Cross-validation Bootstrap
K-FOLD CROSS-VALIDATION
Split randomly the data in K groups with roughly the same size Take turns using one group as test set and the other k-I as training set for fitting the model 1 2 3 4 5 Test Train Train Train Train 1. Prediction error 1 Train Train Test Train Train Train Train Test Train Train 2. Prediction error 2 3. Prediction error 3 Train Train Test Train Train Train Train 4. Prediction error 4 Train Train Test 5. Prediction error 5 Cross-validation estimate of prediction error is average of these
BOOTSTRAP
1. Generate a large number (n= 10 000) of bootstrap samples … n=10000 2. For each bootstrap sample, compute the prediction error Error 1 Error 2 Error 3 … 3. The mean of these estimates is the bootstrap estimate of prediction error Error n
Application of regression: prediction
1. Do not use your model for prediction without carrying out a validation
If you can, use an independent data set for validating the model If you cannot, use at least bootstrap or cross-validation
2. Never extrapolate