Multiple linear regression

Transcript Multiple linear regression

Experimental design and analysis
Multiple linear regression
 Gerry Quinn & Mick Keough, 1998
Do not copy or distribute without
permission of authors.
Multiple regression
• One response (dependent) variable:
–Y
• More than one predictor (independent
variable) variable:
– X1, X2, X3 etc.
– number of predictors = p
• Number of observations = n
Example
• A sample of 51 mammal species (n = 51)
• Response variable:
– total sleep time in hrs/day (y)
• Predictors:
– body weight in kg (x1)
– brain weight in g (x2)
– maximum life span in years (x3)
– gestation time in days (x4)
Regression models
Population model (equation):
• yi = 0 + 1x1 + 2x2 + .... + i
Sample equation:
• yi = b0 + b1x1 + b2x2 + ....
Example
• Regression model:
sleep = intercept + 1*bodywt +
2*brainwt + 3*lifespan + 4*gestime
Multiple regression equation
Log lifespan
Total
sleep
Log body weight
Partial regression coefficients
• Ho: 1 = 0
• Partial population regression coefficient
(slope) for y on x1, holding all other x’s
constant, equals zero
• Example:
– slope of regression of sleep against body
weight, holding brain weight, max. life span
and gestation time constant, is 0.
Partial regression coefficients
• Ho: 2 = 0
• Partial population regression coefficient
(slope) for y on x2, holding all other x’s
constant, equals zero
• Example:
– slope of regression of sleep against brain
weight, holding body weight, max. life span
and gestation time constant, is 0.
Testing HO: i = 0
•
•
•
•
Use partial t-tests:
t = bi / SEbi
Compare with t-distribution with n-2 df
Separate t-test for each partial
regression coefficient in model
• Usual logic of t-tests:
– reject HO if P < 0.05
Model comparison
• To test HO: 1 = 0
• Fit full model:
– y = 0+1x1+2x2+3x3+…
• Fit reduced model:
– y = 0+2x2+3x3+…
• Calculate SSextra:
– SSRegression(full) - SSRegression(reduced)
• F = MSextra / MSResidual(full)
Overall regression model
• Ho: 1 = 2 = ... = 0 (all population
slopes equal zero).
• Test of whether overall regression
equation is significant.
• Use ANOVA F-test:
– Variation explained by regression
– Unexplained (residual) variation
Regression diagnostics
• Residual is still observed y - predicted y
– Studentised residuals still work
• Other diagnostics still apply:
– residual plots
– Cook’s D statistics
Assumptions
• Normality and homogeneity of variance
for response variable
• Independence of observations
• Linearity
• No collinearity
Collinearity
• Collinearity:
– predictors correlated
• Assumption of no collinearity:
– predictor variables are uncorrelated with
(ie. independent of) each other
• Collinearity makes estimates of i’s and
their significance tests unreliable:
– low power for individual tests on i’s
Collinearity
Response (y) and 2 predictors (x1 and x2); n=20
1. x1 and x2 uncorrelated (r = -0.24)
coeff
intercept
x1
x2
-0.17
1.13
0.12
se
1.03
0.14
0.14
tol
t
0.95
0.95
-0.16
7.86
0.86
R2 = 0.787, F = 31.38, P < 0.001
P
0.873
<0.001
0.404
Collinearity
2. rearrange x2 so x1 and x2 highly correlated (r = 0.99)
intercept
x1
x2
coeff
se
0.49
1.55
-0.45
0.72
1.21
1.21
tol
t
P
0.01
0.01
0.69
1.28
-0.37
0.503
0.219
0.714
R2 = 0.780, F = 30.05, P < 0.001
Checks for collinearity
• Correlation matrix between predictors
• Tolerance for each predictor:
– 1-R2 for regression of that predictor on all others
– if tolerance is low (<0.1) then collinearity is a
problem
• Variance inflation factor (VIF) for each
predictor:
– 1/tolerance
– if VIF>10 then collinearity is a problem
Explained variance
R2
proportion of variation in y explained by
linear relationship with x1, x2 etc.
SS Regression
SS Total
Example
Sleep
3.3
12.5
etc.
Bodywt
Brainwt
6654.000
3.385
5712.0
44.5
African elephant
Arctic fox
etc.
Lifespan
38.6
14.0
Gestime
645
60
Boxplots of variables
Predictors log transformed
Parameter
Intercept
Bodywt
Brainwt
Lifespan
Gestime
Estimate
18.94
-0.76
-0.84
2.60
-5.11
SE
3.11
1.31
2.03
2.05
1.81
Tol
0.08
0.05
0.33
0.36
t
P
6.09 <0.001
-0.58
0.565
-0.42
0.680
1.27
0.211
-2.82
0.007
R2 = 0.486
Collinearity problem for body weight and brain weight
• low tolerance
• highly correlated
Omit brain weight because body weight and
brain weight are so highly correlated.
Parameter
Intercept
Bodwt
Lifespan
Gestime
Estimate
19.06
-1.25
2.19
-5.39
SE
3.07
0.59
1.78
1.67
Tol
0.36
0.43
0.42
t
P
6.21 <0.001
-2.09
0.042
1.23
0.225
-3.23
0.002
R2 = 0.484
No collinearity between any predictors:
• all tolerances OK
• reduced SE and larger slope for body weight
Examples from literature
Lampert (1993)
• Ecology 74:1455-1466
• Response variable:
– Daphnia (water flea) clutch size
• Predictors:
– body size (mm)
– particulate organic carbon (mg/L)
– temperature (oC)
Lampert (1993)
Parameter
Coeff.
SE
t
P
Intercept
-42.34
27.52
-1.54
0.168
Body size
POC
Temp
14.76
0.27
0.73
7.10
0.43
0.68
2.08
0.61
1.07
0.076
0.559
0.321
ANOVA P = 0.052, R2 = 0.684, n = 11
Williams et al. (1993)
• Ecology 74:904-918
• Response variable:
– Zostera (seagrass) growth
• Predictors:
– epiphyte biomass
– porewater ammonium
Williams et al. (1993)
Parameter
Epiphyte biomass
Porewater ammonium
Coeff.
0.340
0.919
P
>0.05
<0.05
R2 = 0.71
Tolerance = 0.839 (so no collinearity)

Multiple linear regression

Transcript Multiple linear regression

Directory