Transcript Document
Lecture 8:
Multiple Linear Regression
Interpretation with different types of predictors
BMTRY 701
Biostatistical Methods II
Interaction
AKA effect modification
Allows there to be a different association
between two variables for differing levels of a
third variable.
Example: In the model with length of stay as an
outcome, is there an interaction between
medschool and nurse?
Note that ‘adjustment’ is a rather weak form of
accounting for a variable.
Allowing an interaction allows much greater
flexibility in the model
Interactions
Interactions can be formed between
•
•
•
•
two continous variables
a binary and a continuous variable
two binary variables
a binary variable and a categorical variable with >2
categories.
• Etc.
Three-way interaction: interaction between 3
variables
Four-way, etc.
Example: log(LOS) ~ INFRISK*MS
log LOSi 0 1 INFRISKi 2 MS i 3 MS i * INFRISKi ei
E[log LOSi | MS i 0] 0 1 INFRISKi
E[log LOSi | MS i 1] 0 1 INFRISKi 2 3 INFRISKi
( 0 2 ) ( 1 3 ) INFRISKi
How does this differ from the model without the
interaction? Without the adjustment?
Model 1:
log LOSi 0 1 INFRISKi ei
Model 2:
log LOSi 0 1 INFRISKi 2 MSi ei
Model 3:
log LOSi 0 1 INFRISKi 2 MSi 3 MSi * INFRISKi ei
Model 1
> plot(data$INFRISK, data$logLOS, xlab="Infection Risk, %",
ylab="Length of Stay, days", pch=16, cex=1.5)
>
> # Model 1:
> reg1 <- lm(logLOS ~ INFRISK, data=data)
> abline(reg1, lwd=2)
> summary(reg1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.93250
0.04794 40.310 < 2e-16 ***
INFRISK
0.07293
0.01053
6.929 2.92e-10 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1494 on 111 degrees of freedom
Multiple R-Squared: 0.302,
Adjusted R-squared: 0.2957
F-statistic: 48.02 on 1 and 111 DF, p-value: 2.918e-10
2.0
2.2
2.4
2.6
Length of Stay, days
2.8
3.0
Model 1
2
3
4
5
Infection Risk, %
6
7
8
Model 2
>
>
>
>
>
>
>
>
reg2 <- lm(logLOS ~ INFRISK + ms, data=data)
infriski <- seq(1,8,0.1)
beta <- reg2$coefficients
yhat0 <- beta[1] + beta[2]*infriski
yhat1 <- beta[1] + beta[2]*infriski + beta[3]
lines(infriski, yhat0, lwd=2, col=2)
lines(infriski, yhat1, lwd=2, col=2)
summary(reg2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.94449
0.04709 41.295 < 2e-16 ***
INFRISK
0.06677
0.01058
6.313 5.91e-09 ***
ms
0.09882
0.03949
2.503
0.0138 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1459 on 110 degrees of freedom
Multiple R-Squared: 0.3396,
Adjusted R-squared: 0.3276
F-statistic: 28.28 on 2 and 110 DF, p-value: 1.232e-10
2.0
2.2
2.4
2.6
Length of Stay, days
2.8
3.0
Model 2
2
3
4
5
Infection Risk, %
6
7
8
Model 3
>
>
>
>
>
>
>
>
# Model 3:
reg3 <- lm(logLOS ~ INFRISK + ms + ms:INFRISK, data=data)
infriski <- seq(1,8,0.1)
beta <- reg3$coefficients
yhat0 <- beta[1] + beta[2]*infriski
yhat1 <- beta[1] + beta[3] + (beta[2]+beta[4])*infriski
lines(infriski, yhat0, lwd=2, col=4)
lines(infriski, yhat1, lwd=2, col=4)
> summary(reg3)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.947942
0.049698 39.195 < 2e-16 ***
INFRISK
0.065950
0.011220
5.878 4.6e-08 ***
ms
0.059514
0.178622
0.333
0.740
INFRISK:ms 0.007856
0.034807
0.226
0.822
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1466 on 109 degrees of freedom
Multiple R-Squared: 0.3399,
Adjusted R-squared: 0.3217
F-statistic: 18.71 on 3 and 109 DF, p-value: 7.35e-10
2.0
2.2
2.4
2.6
Length of Stay, days
2.8
3.0
Model 3
2
3
4
5
Infection Risk, %
6
7
8
Conclusions
There does not appear to be an interaction
between MEDSCHOOL and INFRISK
Both MEDSCHOOL and INFISK are associated
with log(LOS), in the presence of each other
the association between INFRISK and log(LOS)
is positive: for a 1% increase in infection risk,
logLOS is expected to increase by 0.07,
adjusting for Med School affiliation
Hospitals with Med School affiliation tend to
have longer average length of stay, adjusting for
infection risk
Interactions with continuous variables
How to interpret with continuous variables?
log LOSi 0 1 INFRISKi 2 NURSEi 3 INFRISKi * NURSEi ei
Example: Difference between two hospitals with
a 1% difference in INFRISK
log LOSi 0 1 INFRISKi 2 NURSEi 3 INFRISKi * NURSEi ei
log LOSi 0 1 ( INFRISKi 1) 2 NURSEi 3 ( INFRISKi 1) * NURSEi ei
Difference 1 3 NURSEi
Interaction with continuous variables
> reg4 <- lm(logLOS ~ INFRISK*NURSE, data=data)
> summary(reg4)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.067e+00 6.642e-02 31.120 < 2e-16 ***
INFRISK
3.164e-02 1.586e-02
1.995 0.04853 *
NURSE
-1.025e-03 4.657e-04 -2.201 0.02986 *
INFRISK:NURSE 2.696e-04 9.727e-05
2.771 0.00657 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1427 on 109 degrees of freedom
Multiple R-Squared: 0.3739,
Adjusted R-squared: 0.3567
F-statistic: 21.7 on 3 and 109 DF, p-value: 4.284e-11
3.0
2.6
2.4
2.2
2.0
data$logLOS
2.8
NURSE=300
NURSE=100
2
3
4
5
data$INFRISK
6
7
8
0.04
0.08
0.12
0.16
Change in logLOS for 1% change in INFRISK
Interaction interpretation
0
100
200
300
NURSE
400
500
600
Interactions between categorical variables
Simple with two binary variables
More complicated to keep track of when more
than two categories in one or more variable\
Example: REGION and MEDSCHOOL
Question: Is there an interaction between
REGION and MEDSCHOOL in regards to
logLOS?
That is: does the association between
MEDSCHOOL and logLOS differ by REGION?
Interpreting coefficients
log LOSi 0 1 MS i 2 I ( Ri 2) 3 I ( Ri 3) 4 I ( Ri 4)
5 MS i I ( Ri 2) 6 MS i I ( Ri 3) 7 MS i I ( Ri 4) ei
E[log LOSi | Ri 1, MS i 0] 0
E[log LOSi | Ri 1, MS i 1] 0 1
E[log LOSi | Ri 2, MS i 0] 0 2
E[log LOSi | Ri 2, MS i 1] 0 1 2 5
E[log LOSi | Ri 3, MS i 0] 0 3
E[log LOSi | Ri 3, MS i 1] 0 1 3 6
E[log LOSi | Ri 4, MS i 0] 0 4
E[log LOSi | Ri 4, MS i 1] 0 1 4 7
Regression Results
> reg5 <- lm(logLOS ~ factor(REGION)*ms, data=data)
>
> summary(reg5)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.359927
0.030715 76.833 < 2e-16
factor(REGION)2
-0.122926
0.042560 -2.888
0.0047
factor(REGION)3
-0.163065
0.039769 -4.100 8.16e-05
factor(REGION)4
-0.299316
0.049933 -5.994 2.92e-08
ms
0.125486
0.072685
1.726
0.0872
factor(REGION)2:ms -0.007176
0.096181 -0.075
0.9407
factor(REGION)3:ms 0.033145
0.114691
0.289
0.7732
factor(REGION)4:ms 0.082734
0.132974
0.622
0.5352
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1
***
**
***
***
.
‘ ’ 1
Residual standard error: 0.1473 on 105 degrees of freedom
Multiple R-Squared: 0.3578,
Adjusted R-squared: 0.3149
F-statistic: 8.356 on 7 and 105 DF, p-value: 4.356e-08
Association between MS and REGION
table(data$REGION, data$ms)
1
2
3
4
0
23
25
34
14
1
5
7
3
2
=
=
=
=
18%
22%
8%
13%