Transcript Slide 1
Lecture 21:
poisson regression
log-linear regression
BMTRY 701
Biostatistical Methods II
Poisson distribution
Used for count data
generally, rare events
in space or time
upper limit is theoretically infinite
Examples:
•
•
•
•
earthquakes, hurricanes
cancer incidence (spatial)
absences in school year
AIDS deaths in a region
Assessing disease in different groups:
• Probability, Risk, Rate, Incidence, Prevalence
The Poisson distribution
Probability mass function
P( X k )
k
e
k!
Approximates a binomial for rare event
Notice it has only ONE parameter: λ
Mean = variance = λ
Simple poisson distribution example
The infection rate at a Neonatal Intensive Care Unit (NICU) is
typically expressed as a number of infections per patient days. This
is obviously counting a number of events across both time and
patients.
assume that the probability of getting an infection over a short
time period is proportional to the length of the time period. In
other words, a patient who stays one hour in the NICU has twice the
risk of a single infection as a patient who stays 30 minutes.
assume that for a small enough interval, the probability of
getting two infections is negligible.
assume that the probability of infection does not change over
time or over infants.
assume independence.
• The probability of seeing an infection in one child does not increase or
decrease the probability of seeing an infection in another child.
• If an infant gets an infection during one time interval, it doesn't change
the probability that he or she will get another infection during a later time
interval.
Poisson regression
Based on the idea that the log of probability of
disease is a linear function of risk factors
The rate ratio (“relative risk”) is modeled
log(r1 ) 0 1Group
Interpretation of slope:
log(r1 ) log(r0 ) ( 0 1 *1) ( 0 1 * 0)
r1
log( ) 1
r0
Implementation
ri is the rate
Often we observe
• a number of events
• a geographic region, time, or number of person-years
Need to account for these differences
• rates based on smaller “exposure” are less precise
• adjustment is made
Implementation
Unless there is uniform time, space, etc., the
following is generally implemented:
log(r ) 0 1Groupi
i
counti
0 1Groupi
log
totali
log(casesi ) log(totali ) 0 1Groupi
log(casesi ) 0 1Groupi log(totali )
“OFFSET”
Offset term
Notice: NO COEFFICIENT on offset
Adjusts for population size or space
Example: breast cancer incidence per county in
south carolina
• cases are the number of women (& men) diagnosed
within in a county in SC in one year.
• the offset would be the population size in the county
in the year (probably estimated)
Caveat
Standard poisson regression relies on poisson
assumption about the variance
If events tend to occur in clusters, than there is
“overdispersion”
This leads to a more general form of model: loglinear model (later)
Example: Catheter-Related Bloodstream Infections in the
ICU (Critical Care Medicine, 2004).
Objective: To determine whether a multi-facted systems
intervention would eliminate catheter-related
bloodstream infections (CR-BSIs)
Design: prospective cohort in surgical ICU at JHU
including all patients with central venous catheter in ICU.
Two ICUs
Interventions:
•
•
•
•
•
educating staff
creating catheter insertion cart
asking providers daily if catheters could be removed
implementing checklist to ensure adherence to guidelines
empowering nurses to stop catheter insertion if violation of
guidelines was observed.
Example: Catheter-Related Bloodstream Infections in the
ICU (Critical Care Medicine, 2004).
Analysis
• Poisson regression
• Outcome is rate of CR-BSIs
• Data structure
number of infections per quarter in ICU
number of catheter days (counting every patient who has
catheter at 12am each day). Patients each counted only
once
indicator of control vs. intervention ICU
• Intervention not implemented until 1st quarter 1999.
Dataset
. list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
+-------------------------------------------------------------+
|
quarter
ncase
cathdays
rate
dataset
quartern |
|-------------------------------------------------------------|
|
Qtr1-98
6
1057
5.68
1
1 |
|
Qtr2-98
4
1018
3.93
1
2 |
|
Qtr3-98
10
899
11.12
1
3 |
|
Qtr4-98
8
952
8.4
1
4 |
|
Qtr1-99
3
952
3.15
1
5 |
|-------------------------------------------------------------|
|
Qtr2-99
10
939
10.65
1
6 |
|
Qtr3-99
5
1045
4.78
1
7 |
|
Qtr4-99
9
927
9.71
1
8 |
|
Qtr1-00
7
1060
6.6
1
9 |
|
Qtr2-00
7
1094
6.4
1
10 |
|-------------------------------------------------------------|
|
Qtr3-00
5
850
5.88
1
11 |
|
Qtr4-00
10
822
12.17
1
12 |
|
Qtr1-01
11
868
12.67
1
13 |
|
Qtr2-01
4
830
4.82
1
14 |
|
Qtr3-01
4
603
6.63
1
15 |
|-------------------------------------------------------------|
|
Qtr4-01
5
551
9.07
1
16 |
20
5
10
15
Intervention ICU
Control ICU
0
Rate of Infection per 1000 catheter days
Observed Data
5
10
Quarter
15
20
R code
data <- read.csv("csicu7.csv")
plot(data$quartern, data$rate, xlab="Quarter",
ylab="Rate of Infection per 1000 catheter days", pch=16)
points(data$quartern[data$dataset==1],
data$rate[data$dataset==1], pch=16, col=2)
lines(data$quartern[data$dataset==0],
data$rate[data$dataset==0], col=1)
lines(data$quartern[data$dataset==1],
data$rate[data$dataset==1], col=2)
legend(12,22, c("Intervention ICU","Control ICU"),
col=c(1,2), pch=c(16,16))
abline(v=5, lty=3)
Estimating the Poisson regression
Want to model change in rates
However, the first 4 quarters there was no
intervention.
Based on the observed data and on the data
structure, what model is appropriate?
Poisson regression model
log(ri ) 1 2 IVi 3quarteri 4 splinei
5 IVi * quarteri 6 IVi * splinei log(cathdaysi )
What is the model for
•
•
•
•
IV=0 and quarter<5?
IV=0 and quarter≥5?
IV=1 and quarter<5?
IV=1 and quarter≥5?
R code
ncase <- data$ncase
cathdays <- data$cathdays
control <- data$dataset
intervention <- 1- control
quartern <- data$quartern
# create knot for spline model
k1 <- ifelse(quartern>5,quartern-5,0)
# FIT MODEL WITH INTERACTIONS WITH TIME FOR BOTH GROUPS
reg <- glm(ncase~intervention*quartern+ intervention*k1,
family=poisson, offset=log(cathdays))
summary(reg)
Results
Call:
glm(formula = ncase ~ intervention * quartern + intervention *
k1, family = poisson, offset = log(cathdays))
Deviance Residuals:
Min
1Q
Median
-3.6005 -0.8439 -0.2368
3Q
0.6349
Max
2.4233
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-5.20386
0.37944 -13.715
<2e-16 ***
intervention
0.73339
0.45986
1.595
0.111
quartern
0.07517
0.09148
0.822
0.411
k1
-0.08774
0.10365 -0.847
0.397
intervention:quartern -0.02874
0.11302 -0.254
0.799
intervention:k1
-0.08355
0.13080 -0.639
0.523
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 108.489
Residual deviance: 61.317
AIC: 213.76
on 39
on 34
degrees of freedom
degrees of freedom
20
5
10
15
Intervention ICU
Control ICU
0
Rate of Infection per 1000 catheter days
Fitted model, rate scale
5
10
Quarter
15
20
R code
fit.early.0 <- b[1]
+ b[3]*seq(1,5,1)
fit.late.0 <- (b[1]-b[4]*5) + (b[3]+b[4])*seq(5,20,1)
fit.early.1 <- (b[1]+b[2])
+ (b[3]+b[5])*seq(1,5,1)
fit.late.1 <- (b[1]+b[2]-b[4]*5-b[6]*5) +
(b[3]+b[4]+b[5]+b[6])*seq(5,20,1)
fit.early.0
rate.early.0 <- exp(fit.early.0)*1000
rate.early.0
rate.early.1 <- exp(fit.early.1)*1000
rate.late.0 <- exp(fit.late.0)*1000
rate.late.1 <- exp(fit.late.1)*1000
# add lines to plot for fitted control ICU
lines(seq(1,5,1), rate.early.0, col=2)
lines(seq(5,20,1), rate.late.0, col=2)
# add lines to plot for fitted intervention ICU
lines(seq(1,5,1), rate.early.1, col=1)
lines(seq(5,20,1), rate.late.1, col=1)
-5.5
-5.0
Intervention ICU
Control ICU
-6.0
Linear Predictor
-4.5
Fitted model, linear predictor scale
5
10
Quarter
15
20
Real question
Is the change in infection rates different in the
two ICUs?
That is, are the slopes after Q5 different?
How to test that:
• slope in control ICU:
• slope in intervention ICU:
β3 + β4
β3 + β4 + β5 + β6
What is the hypothesis test?
Linear Combination of Coefficients
> estimable(reg, c(0,0,0,0,1,1))
Estimate Std. Error X^2 value DF
Pr(>|X^2|)
(0 0 0 0 1 1) -0.1122858 0.03091206 13.19452 1 0.0002807688
Example: Breast Cancer Incidence in SC
Cunningham et al.
Hypothesize that there are differences in
subtypes of breast cancer by race
• ER + vs. ER• Grades 1, 2, 3
• Stage 1, 2, 3, 4
Incidence of breast cancer varies by age
Data:
• Tumor registry data for SC (and Ohio)
• Census data for SC
Poisson modeling
Rate of incidence per cancer type
Modeled as a function of ER, grade and race
> summary(reg1)
Call:
glm(formula = nc ~ age + age2 + age3 + bl + er + gr + age * bl +
age2 * bl + age3 * bl + age * er + age2 * er + age3 * er +
age * gr + age2 * gr + age3 * gr + bl * er + bl * gr + er *
gr, family = poisson, offset = log(9 * popn))
Results
50
Age Category
40
80
Age Category
30 60
Age Category
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
40 80
4
8
Age Category
Age Category
er-, grade 1
er-, grade 2
er-, grade 3
er-, grade unk
20 40
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
2
4
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
30 60
20
40
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0 5
15
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
er+, grade 3
Age Category
Age Category
Age Category
Age Category
er unk, grade 1
er unk, grade 2
er unk, grade 3
er unk, grade unk
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
Age Category
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
20 40
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
Incidence (per 100K)
er+, grade 2
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
2.0
Age Category
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0.0
Incidence (per 100K)
er+, grade 1
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0 20
Incidence (per 100K)
Confidence Intervals
er+, grade unk
Age Category
Incidence Ratio for AA vs. EA