Transcript Slide 1

Lecture 21:
poisson regression
log-linear regression
BMTRY 701
Biostatistical Methods II
Poisson distribution





Used for count data
generally, rare events
in space or time
upper limit is theoretically infinite
Examples:
•
•
•
•
earthquakes, hurricanes
cancer incidence (spatial)
absences in school year
AIDS deaths in a region
 Assessing disease in different groups:
• Probability, Risk, Rate, Incidence, Prevalence
The Poisson distribution
 Probability mass function
P( X  k ) 
k 
e
k!
 Approximates a binomial for rare event
 Notice it has only ONE parameter: λ
 Mean = variance = λ
Simple poisson distribution example
 The infection rate at a Neonatal Intensive Care Unit (NICU) is
typically expressed as a number of infections per patient days. This
is obviously counting a number of events across both time and
patients.
 assume that the probability of getting an infection over a short
time period is proportional to the length of the time period. In
other words, a patient who stays one hour in the NICU has twice the
risk of a single infection as a patient who stays 30 minutes.
 assume that for a small enough interval, the probability of
getting two infections is negligible.
 assume that the probability of infection does not change over
time or over infants.
 assume independence.
• The probability of seeing an infection in one child does not increase or
decrease the probability of seeing an infection in another child.
• If an infant gets an infection during one time interval, it doesn't change
the probability that he or she will get another infection during a later time
interval.
Poisson regression
 Based on the idea that the log of probability of
disease is a linear function of risk factors
 The rate ratio (“relative risk”) is modeled
log(r1 )  0  1Group
 Interpretation of slope:
log(r1 )  log(r0 )  (  0  1 *1)  (  0  1 * 0)
r1
log( )  1
r0
Implementation
 ri is the rate
 Often we observe
• a number of events
• a geographic region, time, or number of person-years
 Need to account for these differences
• rates based on smaller “exposure” are less precise
• adjustment is made
Implementation
 Unless there is uniform time, space, etc., the
following is generally implemented:
log(r )   0  1Groupi
i
 counti 
   0  1Groupi
log
 totali 
log(casesi )  log(totali )   0  1Groupi
log(casesi )   0  1Groupi  log(totali )
“OFFSET”
Offset term
 Notice: NO COEFFICIENT on offset
 Adjusts for population size or space
 Example: breast cancer incidence per county in
south carolina
• cases are the number of women (& men) diagnosed
within in a county in SC in one year.
• the offset would be the population size in the county
in the year (probably estimated)
Caveat
 Standard poisson regression relies on poisson
assumption about the variance
 If events tend to occur in clusters, than there is
“overdispersion”
 This leads to a more general form of model: loglinear model (later)
Example: Catheter-Related Bloodstream Infections in the
ICU (Critical Care Medicine, 2004).
 Objective: To determine whether a multi-facted systems
intervention would eliminate catheter-related
bloodstream infections (CR-BSIs)
 Design: prospective cohort in surgical ICU at JHU
including all patients with central venous catheter in ICU.
 Two ICUs
 Interventions:
•
•
•
•
•
educating staff
creating catheter insertion cart
asking providers daily if catheters could be removed
implementing checklist to ensure adherence to guidelines
empowering nurses to stop catheter insertion if violation of
guidelines was observed.
Example: Catheter-Related Bloodstream Infections in the
ICU (Critical Care Medicine, 2004).
 Analysis
• Poisson regression
• Outcome is rate of CR-BSIs
• Data structure
 number of infections per quarter in ICU
 number of catheter days (counting every patient who has
catheter at 12am each day). Patients each counted only
once
 indicator of control vs. intervention ICU
• Intervention not implemented until 1st quarter 1999.
Dataset
. list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
+-------------------------------------------------------------+
|
quarter
ncase
cathdays
rate
dataset
quartern |
|-------------------------------------------------------------|
|
Qtr1-98
6
1057
5.68
1
1 |
|
Qtr2-98
4
1018
3.93
1
2 |
|
Qtr3-98
10
899
11.12
1
3 |
|
Qtr4-98
8
952
8.4
1
4 |
|
Qtr1-99
3
952
3.15
1
5 |
|-------------------------------------------------------------|
|
Qtr2-99
10
939
10.65
1
6 |
|
Qtr3-99
5
1045
4.78
1
7 |
|
Qtr4-99
9
927
9.71
1
8 |
|
Qtr1-00
7
1060
6.6
1
9 |
|
Qtr2-00
7
1094
6.4
1
10 |
|-------------------------------------------------------------|
|
Qtr3-00
5
850
5.88
1
11 |
|
Qtr4-00
10
822
12.17
1
12 |
|
Qtr1-01
11
868
12.67
1
13 |
|
Qtr2-01
4
830
4.82
1
14 |
|
Qtr3-01
4
603
6.63
1
15 |
|-------------------------------------------------------------|
|
Qtr4-01
5
551
9.07
1
16 |
20
5
10
15
Intervention ICU
Control ICU
0
Rate of Infection per 1000 catheter days
Observed Data
5
10
Quarter
15
20
R code
data <- read.csv("csicu7.csv")
plot(data$quartern, data$rate, xlab="Quarter",
ylab="Rate of Infection per 1000 catheter days", pch=16)
points(data$quartern[data$dataset==1],
data$rate[data$dataset==1], pch=16, col=2)
lines(data$quartern[data$dataset==0],
data$rate[data$dataset==0], col=1)
lines(data$quartern[data$dataset==1],
data$rate[data$dataset==1], col=2)
legend(12,22, c("Intervention ICU","Control ICU"),
col=c(1,2), pch=c(16,16))
abline(v=5, lty=3)
Estimating the Poisson regression
 Want to model change in rates
 However, the first 4 quarters there was no
intervention.
 Based on the observed data and on the data
structure, what model is appropriate?
Poisson regression model
log(ri )  1   2 IVi  3quarteri   4 splinei 
5 IVi * quarteri  6 IVi * splinei  log(cathdaysi )
What is the model for
•
•
•
•
IV=0 and quarter<5?
IV=0 and quarter≥5?
IV=1 and quarter<5?
IV=1 and quarter≥5?
R code
ncase <- data$ncase
cathdays <- data$cathdays
control <- data$dataset
intervention <- 1- control
quartern <- data$quartern
# create knot for spline model
k1 <- ifelse(quartern>5,quartern-5,0)
# FIT MODEL WITH INTERACTIONS WITH TIME FOR BOTH GROUPS
reg <- glm(ncase~intervention*quartern+ intervention*k1,
family=poisson, offset=log(cathdays))
summary(reg)
Results
Call:
glm(formula = ncase ~ intervention * quartern + intervention *
k1, family = poisson, offset = log(cathdays))
Deviance Residuals:
Min
1Q
Median
-3.6005 -0.8439 -0.2368
3Q
0.6349
Max
2.4233
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-5.20386
0.37944 -13.715
<2e-16 ***
intervention
0.73339
0.45986
1.595
0.111
quartern
0.07517
0.09148
0.822
0.411
k1
-0.08774
0.10365 -0.847
0.397
intervention:quartern -0.02874
0.11302 -0.254
0.799
intervention:k1
-0.08355
0.13080 -0.639
0.523
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 108.489
Residual deviance: 61.317
AIC: 213.76
on 39
on 34
degrees of freedom
degrees of freedom
20
5
10
15
Intervention ICU
Control ICU
0
Rate of Infection per 1000 catheter days
Fitted model, rate scale
5
10
Quarter
15
20
R code
fit.early.0 <- b[1]
+ b[3]*seq(1,5,1)
fit.late.0 <- (b[1]-b[4]*5) + (b[3]+b[4])*seq(5,20,1)
fit.early.1 <- (b[1]+b[2])
+ (b[3]+b[5])*seq(1,5,1)
fit.late.1 <- (b[1]+b[2]-b[4]*5-b[6]*5) +
(b[3]+b[4]+b[5]+b[6])*seq(5,20,1)
fit.early.0
rate.early.0 <- exp(fit.early.0)*1000
rate.early.0
rate.early.1 <- exp(fit.early.1)*1000
rate.late.0 <- exp(fit.late.0)*1000
rate.late.1 <- exp(fit.late.1)*1000
# add lines to plot for fitted control ICU
lines(seq(1,5,1), rate.early.0, col=2)
lines(seq(5,20,1), rate.late.0, col=2)
# add lines to plot for fitted intervention ICU
lines(seq(1,5,1), rate.early.1, col=1)
lines(seq(5,20,1), rate.late.1, col=1)
-5.5
-5.0
Intervention ICU
Control ICU
-6.0
Linear Predictor
-4.5
Fitted model, linear predictor scale
5
10
Quarter
15
20
Real question
 Is the change in infection rates different in the
two ICUs?
 That is, are the slopes after Q5 different?
 How to test that:
• slope in control ICU:
• slope in intervention ICU:
β3 + β4
β3 + β4 + β5 + β6
 What is the hypothesis test?
Linear Combination of Coefficients
> estimable(reg, c(0,0,0,0,1,1))
Estimate Std. Error X^2 value DF
Pr(>|X^2|)
(0 0 0 0 1 1) -0.1122858 0.03091206 13.19452 1 0.0002807688
Example: Breast Cancer Incidence in SC
 Cunningham et al.
 Hypothesize that there are differences in
subtypes of breast cancer by race
• ER + vs. ER• Grades 1, 2, 3
• Stage 1, 2, 3, 4
 Incidence of breast cancer varies by age
 Data:
• Tumor registry data for SC (and Ohio)
• Census data for SC
Poisson modeling
 Rate of incidence per cancer type
 Modeled as a function of ER, grade and race
> summary(reg1)
Call:
glm(formula = nc ~ age + age2 + age3 + bl + er + gr + age * bl +
age2 * bl + age3 * bl + age * er + age2 * er + age3 * er +
age * gr + age2 * gr + age3 * gr + bl * er + bl * gr + er *
gr, family = poisson, offset = log(9 * popn))
Results
50
Age Category
40
80
Age Category
30 60
Age Category
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
40 80
4
8
Age Category
Age Category
er-, grade 1
er-, grade 2
er-, grade 3
er-, grade unk
20 40
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
2
4
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
30 60
20
40
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0 5
15
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
er+, grade 3
Age Category
Age Category
Age Category
Age Category
er unk, grade 1
er unk, grade 2
er unk, grade 3
er unk, grade unk
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
Age Category
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
20 40
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
Incidence (per 100K)
er+, grade 2
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0
2.0
Age Category
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0.0
Incidence (per 100K)
er+, grade 1
Incidence (per 100K)
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
0 20
Incidence (per 100K)
Confidence Intervals
er+, grade unk
Age Category
Incidence Ratio for AA vs. EA