Transcript Document

Lecture 8:
Multiple Linear Regression
Interpretation with different types of predictors
BMTRY 701
Biostatistical Methods II
Interaction
 AKA effect modification
 Allows there to be a different association
between two variables for differing levels of a
third variable.
 Example: In the model with length of stay as an
outcome, is there an interaction between
medschool and nurse?
 Note that ‘adjustment’ is a rather weak form of
accounting for a variable.
 Allowing an interaction allows much greater
flexibility in the model
Interactions
 Interactions can be formed between
•
•
•
•
two continous variables
a binary and a continuous variable
two binary variables
a binary variable and a categorical variable with >2
categories.
• Etc.
 Three-way interaction: interaction between 3
variables
 Four-way, etc.
Example: log(LOS) ~ INFRISK*MS
log LOSi   0  1 INFRISKi   2 MS i   3 MS i * INFRISKi  ei
E[log LOSi | MS i  0]   0  1 INFRISKi
E[log LOSi | MS i  1]   0  1 INFRISKi   2   3 INFRISKi
 (  0   2 )  ( 1   3 ) INFRISKi
How does this differ from the model without the
interaction? Without the adjustment?
 Model 1:
log LOSi   0  1 INFRISKi  ei
 Model 2:
log LOSi   0  1 INFRISKi   2 MSi  ei
 Model 3:
log LOSi   0  1 INFRISKi   2 MSi   3 MSi * INFRISKi  ei
Model 1
> plot(data$INFRISK, data$logLOS, xlab="Infection Risk, %",
ylab="Length of Stay, days", pch=16, cex=1.5)
>
> # Model 1:
> reg1 <- lm(logLOS ~ INFRISK, data=data)
> abline(reg1, lwd=2)
> summary(reg1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.93250
0.04794 40.310 < 2e-16 ***
INFRISK
0.07293
0.01053
6.929 2.92e-10 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1494 on 111 degrees of freedom
Multiple R-Squared: 0.302,
Adjusted R-squared: 0.2957
F-statistic: 48.02 on 1 and 111 DF, p-value: 2.918e-10
2.0
2.2
2.4
2.6
Length of Stay, days
2.8
3.0
Model 1
2
3
4
5
Infection Risk, %
6
7
8
Model 2
>
>
>
>
>
>
>
>
reg2 <- lm(logLOS ~ INFRISK + ms, data=data)
infriski <- seq(1,8,0.1)
beta <- reg2$coefficients
yhat0 <- beta[1] + beta[2]*infriski
yhat1 <- beta[1] + beta[2]*infriski + beta[3]
lines(infriski, yhat0, lwd=2, col=2)
lines(infriski, yhat1, lwd=2, col=2)
summary(reg2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.94449
0.04709 41.295 < 2e-16 ***
INFRISK
0.06677
0.01058
6.313 5.91e-09 ***
ms
0.09882
0.03949
2.503
0.0138 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1459 on 110 degrees of freedom
Multiple R-Squared: 0.3396,
Adjusted R-squared: 0.3276
F-statistic: 28.28 on 2 and 110 DF, p-value: 1.232e-10
2.0
2.2
2.4
2.6
Length of Stay, days
2.8
3.0
Model 2
2
3
4
5
Infection Risk, %
6
7
8
Model 3
>
>
>
>
>
>
>
>
# Model 3:
reg3 <- lm(logLOS ~ INFRISK + ms + ms:INFRISK, data=data)
infriski <- seq(1,8,0.1)
beta <- reg3$coefficients
yhat0 <- beta[1] + beta[2]*infriski
yhat1 <- beta[1] + beta[3] + (beta[2]+beta[4])*infriski
lines(infriski, yhat0, lwd=2, col=4)
lines(infriski, yhat1, lwd=2, col=4)
> summary(reg3)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.947942
0.049698 39.195 < 2e-16 ***
INFRISK
0.065950
0.011220
5.878 4.6e-08 ***
ms
0.059514
0.178622
0.333
0.740
INFRISK:ms 0.007856
0.034807
0.226
0.822
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1466 on 109 degrees of freedom
Multiple R-Squared: 0.3399,
Adjusted R-squared: 0.3217
F-statistic: 18.71 on 3 and 109 DF, p-value: 7.35e-10
2.0
2.2
2.4
2.6
Length of Stay, days
2.8
3.0
Model 3
2
3
4
5
Infection Risk, %
6
7
8
Conclusions
 There does not appear to be an interaction
between MEDSCHOOL and INFRISK
 Both MEDSCHOOL and INFISK are associated
with log(LOS), in the presence of each other
 the association between INFRISK and log(LOS)
is positive: for a 1% increase in infection risk,
logLOS is expected to increase by 0.07,
adjusting for Med School affiliation
 Hospitals with Med School affiliation tend to
have longer average length of stay, adjusting for
infection risk
Interactions with continuous variables
 How to interpret with continuous variables?
log LOSi   0  1 INFRISKi   2 NURSEi   3 INFRISKi * NURSEi  ei
 Example: Difference between two hospitals with
a 1% difference in INFRISK
log LOSi   0  1 INFRISKi   2 NURSEi   3 INFRISKi * NURSEi  ei
log LOSi   0  1 ( INFRISKi  1)   2 NURSEi   3 ( INFRISKi  1) * NURSEi  ei
Difference  1   3 NURSEi
Interaction with continuous variables
> reg4 <- lm(logLOS ~ INFRISK*NURSE, data=data)
> summary(reg4)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.067e+00 6.642e-02 31.120 < 2e-16 ***
INFRISK
3.164e-02 1.586e-02
1.995 0.04853 *
NURSE
-1.025e-03 4.657e-04 -2.201 0.02986 *
INFRISK:NURSE 2.696e-04 9.727e-05
2.771 0.00657 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1427 on 109 degrees of freedom
Multiple R-Squared: 0.3739,
Adjusted R-squared: 0.3567
F-statistic: 21.7 on 3 and 109 DF, p-value: 4.284e-11
3.0
2.6
2.4
2.2
2.0
data$logLOS
2.8
NURSE=300
NURSE=100
2
3
4
5
data$INFRISK
6
7
8
0.04
0.08
0.12
0.16
Change in logLOS for 1% change in INFRISK
Interaction interpretation
0
100
200
300
NURSE
400
500
600
Interactions between categorical variables
 Simple with two binary variables
 More complicated to keep track of when more
than two categories in one or more variable\
 Example: REGION and MEDSCHOOL
 Question: Is there an interaction between
REGION and MEDSCHOOL in regards to
logLOS?
 That is: does the association between
MEDSCHOOL and logLOS differ by REGION?
Interpreting coefficients
log LOSi   0  1 MS i   2 I ( Ri  2)   3 I ( Ri  3)   4 I ( Ri  4) 
 5 MS i I ( Ri  2)   6 MS i I ( Ri  3)   7 MS i I ( Ri  4)  ei
E[log LOSi | Ri  1, MS i  0]   0
E[log LOSi | Ri  1, MS i  1]   0  1
E[log LOSi | Ri  2, MS i  0]   0   2
E[log LOSi | Ri  2, MS i  1]   0  1   2   5
E[log LOSi | Ri  3, MS i  0]   0   3
E[log LOSi | Ri  3, MS i  1]   0  1   3   6
E[log LOSi | Ri  4, MS i  0]   0   4
E[log LOSi | Ri  4, MS i  1]   0  1   4   7
Regression Results
> reg5 <- lm(logLOS ~ factor(REGION)*ms, data=data)
>
> summary(reg5)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.359927
0.030715 76.833 < 2e-16
factor(REGION)2
-0.122926
0.042560 -2.888
0.0047
factor(REGION)3
-0.163065
0.039769 -4.100 8.16e-05
factor(REGION)4
-0.299316
0.049933 -5.994 2.92e-08
ms
0.125486
0.072685
1.726
0.0872
factor(REGION)2:ms -0.007176
0.096181 -0.075
0.9407
factor(REGION)3:ms 0.033145
0.114691
0.289
0.7732
factor(REGION)4:ms 0.082734
0.132974
0.622
0.5352
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1
***
**
***
***
.
‘ ’ 1
Residual standard error: 0.1473 on 105 degrees of freedom
Multiple R-Squared: 0.3578,
Adjusted R-squared: 0.3149
F-statistic: 8.356 on 7 and 105 DF, p-value: 4.356e-08
Association between MS and REGION
table(data$REGION, data$ms)
1
2
3
4
0
23
25
34
14
1
5
7
3
2
=
=
=
=
18%
22%
8%
13%