Transcript lect9

Lecture 9:
ANOVA tables
F-tests
BMTRY 701
Biostatistical Methods II
ANOVA
 Analysis of Variance
 Similar in derivation to ANOVA that is
generalization of two-sample t-test
 Partitioning of variance into several parts
• that due to the ‘model’: SSR
• that due to ‘error’: SSE
 The sum of the two parts is the total sum of
squares: SST
2.6
2.4
2.2
2.0
data$logLOS
2.8
3.0
Total Deviations: Yi  Y
0
200
400
data$BEDS
600
800
2.6
2.4
2.2
2.0
data$logLOS
2.8
3.0
Regression Deviations: Yˆi  Y
0
200
400
data$BEDS
600
800
2.6
2.4
2.2
2.0
data$logLOS
2.8
3.0
Error Deviations: Yi  Yˆi
0
200
400
data$BEDS
600
800
Definitions
Yi  Y  Yˆi  Y  Yi  Yˆi
SST   (Yi  Y )
2
2
ˆ
SSR   (Yi  Y )
2
ˆ
SSE   (Yi  Y )
SST  SSR  SSE
Example: logLOS ~ BEDS
> ybar <- mean(data$logLOS)
> yhati <- reg$fitted.values
> sst <- sum((data$logLOS- ybar)^2)
> ssr <- sum((yhati - ybar )^2)
> sse <- sum((data$logLOS - yhati)^2)
>
> sst
[1] 3.547454
> ssr
[1] 0.6401715
> sse
[1] 2.907282
> sse+ssr
[1] 3.547454
>
Degrees of Freedom
 Degrees of freedom for SST: n - 1
• one df is lost because it is used to estimate mean Y
 Degrees of freedom for SSR: 1
• only one df because all estimates are based on same
fitted regression line
 Degrees of freedom for SSE: n - 2
• two lost due to estimating regression line (slope and
intercept)
Mean Squares
 “Scaled” version of Sum of Squares
 Mean Square = SS/df
 MSR = SSR/1
 MSE = SSE/(n-2)
 Notes:
• mean squares are not additive! That is, MSR + MSE
≠SST/(n-1)
• MSE is the same as we saw previously
Standard ANOVA Table
SS
df
MS
Regression
SSR
1
MSR
Error
SSE
n-2
MSE
Total
SST
n-1
ANOVA for logLOS ~ BEDS
> anova(reg)
Analysis of Variance Table
Response: logLOS
Df Sum Sq Mean Sq F value
Pr(>F)
BEDS
1 0.64017 0.64017 24.442 2.737e-06 ***
Residuals 111 2.90728 0.02619
Inference?
 What is of interest and how do we interpret?
 We’d like to know if BEDS is related to logLOS.
 How do we do that using ANOVA table?
 We need to know the expected value of the
MSR and MSE:
E ( MSE)   2
E ( MSR)   2  12  ( X i  X ) 2
Implications
E ( MSE)   2
E ( MSR)   2  12  ( X i  X ) 2
 mean of sampling distribution of MSE is σ2
regardless of whether or not β1= 0
 If β1= 0, E(MSE) = E(MSR)
 If β1≠ 0, E(MSE) < E(MSR)
 To test significance of β1, we can test if MSR and
MSE are of the same magnitude.
F-test
 Derived naturally from the arguments just made
 Hypotheses:
• H0: β1= 0
• H1: β1≠ 0
 Test statistic: F* = MSR/MSE
 Based on earlier argument we expect F* >1 if H1 is true.
 Implies one-sided test.
F-test
 The distribution of F under the null has two sets
of degrees of freedom (df)
• numerator degrees of freedom
• denominator degrees of freedom
 These correspond to the df as shown in the
ANOVA table
• numerator df = 1
• denominator df = n-2
 Test is based on
MSR
F* 
~ F (1, n  2)
MSE
Implementing the F-test
 The decision rule
 If F* > F(1-α; 1, n-2), then reject Ho
 If F* ≤ F(1-α; 1, n-2), then fail to reject Ho
0.8
F-distributions
0.4
0.2
0.0
f1
0.6
F(1,10)
F(1,1000)
F(5,10)
F(5,1000)
0
1
2
3
x
4
5
6
ANOVA for logLOS ~ BEDS
> anova(reg)
Analysis of Variance Table
Response: logLOS
Df Sum Sq Mean Sq F value
Pr(>F)
BEDS
1 0.64017 0.64017 24.442 2.737e-06 ***
Residuals 111 2.90728 0.02619
> qf(0.95, 1, 111)
[1] 3.926607
> 1-pf(24.44,1,111)
[1] 2.739016e-06
More interesting: MLR
 You can test that several coefficients are zero at
the same time
 Otherwise, F-test gives the same result as a ttest
 That is: for testing the significance of ONE
covariate in a linear regression model, an F-test
and a t-test give the same result:
• H0: β1= 0
• H1: β1≠ 0
general F testing approach
 Previous seems simple
 It is in this case, but can be generalized to be
more useful
 Imagine more general test:
• Ho: small model
• Ha: large model
 Constraint: the small model must be ‘nested’ in
the large model
 That is, the small model must be a ‘subset’ of
the large model
Example of ‘nested’ models
Model 1:
LOSi   0  1 INFRISK  2 MS  3 NURSE   4 NURSE2  ei
Model 2:
LOSi   0  1 INFRISK 3 NURSE   4 NURSE2  ei
Model 3:
LOSi   0  1 INFRISK  2 MS  ei
Models 2 and 3 are nested in Model 1
Model 2 is not nested in Model 3
Model 3 is not nested in Model 2
Testing: Models must be nested!
 To test Model 1 vs. Model 2
• we are testing that β2 = 0
• Ho: β2 = 0 vs. Ha: β2 ≠ 0
• If β2 = 0 , then we conclude that Model 2 is superior to
Model 1
• That is, if we reject the null hypothesis
Model 1:
LOSi   0  1 INFRISK  2 MS  3 NURSE   4 NURSE2  ei
Model 2:
LOSi   0  1 INFRISK 3 NURSE   4 NURSE2  ei
R
reg1 <- lm(LOS ~ INFRISK + ms + NURSE + nurse2, data=data)
reg2 <- lm(LOS ~ INFRISK + NURSE + nurse2, data=data)
reg3 <- lm(LOS ~ INFRISK + ms, data=data)
> anova(reg1)
Analysis of Variance Table
Response: LOS
Df Sum Sq Mean Sq F value
Pr(>F)
INFRISK
1 116.446 116.446 45.4043 8.115e-10 ***
ms
1 12.897 12.897 5.0288
0.02697 *
NURSE
1
1.097
1.097 0.4277
0.51449
nurse2
1
1.789
1.789 0.6976
0.40543
Residuals 108 276.981
2.565
---
R
> anova(reg2)
Analysis of Variance Table
Response: LOS
Df Sum Sq Mean Sq F value
Pr(>F)
INFRISK
1 116.446 116.446 44.8865 9.507e-10 ***
NURSE
1
8.212
8.212 3.1653
0.078 .
nurse2
1
1.782
1.782 0.6870
0.409
Residuals 109 282.771
2.594
---
> anova(reg1, reg2)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
108
2
109
LOS ~ INFRISK + ms + NURSE + nurse2
LOS ~ INFRISK + NURSE + nurse2
RSS Df Sum of Sq
F Pr(>F)
276.981
282.771 -1
-5.789 2.2574 0.1359
R
> summary(reg1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.355e+00 5.266e-01 12.068 < 2e-16 ***
INFRISK
6.289e-01 1.339e-01
4.696 7.86e-06 ***
ms
7.829e-01 5.211e-01
1.502
0.136
NURSE
4.136e-03 4.093e-03
1.010
0.315
nurse2
-5.676e-06 6.796e-06 -0.835
0.405
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.601 on 108 degrees of freedom
Multiple R-squared: 0.3231,
Adjusted R-squared: 0.2981
F-statistic: 12.89 on 4 and 108 DF, p-value: 1.298e-08
>
Testing more than two covariates
 To test Model 1 vs. Model 3
• we are testing that β3 = 0 AND β4 = 0
• Ho: β3 = β4 = 0 vs. Ha: β3 ≠ 0 or β4 ≠ 0
• If β3 = β4 = 0, then we conclude that Model 3 is
superior to Model 1
• That is, if we reject the null hypothesis
Model 1:
LOSi   0  1 INFRISK  2 MS  3 NURSE   4 NURSE2  ei
Model 3:
LOSi   0  1 INFRISK  2 MS  ei
R
> anova(reg3)
Analysis of Variance Table
Response: LOS
Df Sum Sq Mean Sq F value
Pr(>F)
INFRISK
1 116.446 116.446 45.7683 6.724e-10 ***
ms
1 12.897 12.897 5.0691
0.02634 *
Residuals 110 279.867
2.544
---
> anova(reg1, reg3)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
108
2
110
LOS ~ INFRISK + ms + NURSE + nurse2
LOS ~ INFRISK + ms
RSS Df Sum of Sq
F Pr(>F)
276.981
279.867 -2
-2.886 0.5627 0.5713
R
> summary(reg3)
Call:
lm(formula = LOS ~ INFRISK + ms, data = data)
Residuals:
Min
1Q Median
-2.9037 -0.8739 -0.1142
3Q
0.5965
Max
8.5568
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
6.4547
0.5146 12.542
<2e-16 ***
INFRISK
0.6998
0.1156
6.054
2e-08 ***
ms
0.9717
0.4316
2.251
0.0263 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
Residual standard error: 1.595 on 110 degrees of freedom
Multiple R-squared: 0.3161,
Adjusted R-squared: 0.3036
F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10
Testing multiple coefficients simultaneously
 Region: it is a ‘factor’ variable with 4 categories
LOSi   0  1 I ( Ri  2)   2 I ( Ri  3)   3 I ( Ri  4)  ei