Transcript Exact Logistic Regression
Exact Logistic Regression Larry Cook
Outline • Review the logistic regression model • Explore an example where model assumptions fail – Brief algebraic interlude • Explore an example with a different issue where logistic regression fails • Computational considerations • Example SAS code
Logistic Regression • Model a binary outcome, Y, with one or more predictors – Success/failure – Disease/not disease • Model outcome in terms of the log odds of a success • log(odds of Y i ) = a + b x i + e
Why Log Odds?
• Canonical link function • Makes a binary outcome continuous • Solves this problem – Probability is constrained to [0,1] – Odds are constrained to [0, ∞) • Log odds are in (-∞, ∞) • Exponentiating coefficients gives us estimates of odds ratios
Example: Motor Vehicle Crash Fatalities • What are odds of being hospitalized or killed in a motor vehicle crash for drivers using safety restraints vs. those that are not?
– Outcome: Hospitalized/killed or not – Covariate: safety belt use
Hospital/Killed * Restraint Use OR = 0.22, p-value < 0.001
Example: Motor Vehicle Crash Fatalities • What are odds of being hospitalized or killed in a motor vehicle crash for drivers using safety restraints vs. those that are not?
– Outcome: Hospitalized/killed or not – Covariate: safety belt use gender, age, alcohol, rural area
Logistic Regression Output
Parameter
Intercept Male Restraint Use Alcohol Night Rural
Estimate
-0.261
-0.576
-1.430
1.065
0.194
0.135
Odds Ratio
0.56
0.24
2.90
1.21
1.14
P-value
< 0.001
< 0.001
< 0.001
< 0.001
0.011
<0.001
Assumptions • Conditional probabilities follow a logistic function of the independent variables • Observations are independent • Asymptotics – Sample size is large enough – Minimum of 50 to 100 observations – 10 successes/failures per variable
Corneal Graft Rejections • What if studying a rare disease?
• Data for eight kids in young age group and eight in the older age group • Hypothesis is that rejection is more likely in older children
Graft Rejections No Rejection (Y = 0) Rejection (Y = 1) Total
Young (< 4 y.o.) (X = 0)
7
Older (> 4 y.o.) (X = 1)
2 1 8 6 8
Total
9 7 16 OR = 21, p-value = 0.012, 100% of cell have expected counts < 5!!!
Fisher’s Exact Test p-value (2-sided) = 0.0406; (1-sided) = 0.0203
Let’s Tackle the Graft Rejection Example as Logistic Regression
Graft Rejections No Rejection Rejection Total
Young (< 4 y.o.) Older (> 4 y.o.)
7 1 8 2 6 8 Sample Size << 50!
Don’t have 10 success or 10 failures!
Total
9 7 16
Exact (Conditional) Logistic Regression • Rather than using the unconditional logistic regression, we will condition on nuisance parameters • Use conditional maximum likelihood for estimation and inference
Warning Algebra Ahead Proceed with Caution
Logistic Model
Likelihood of a Sample
Sufficient Statistics
Conditioning • If we are only trying to describe the relationship between rejection and age, do we care about the value of the intercept?
• Remove the intercept, a , out of the likelihood by conditioning on its sufficient statistic,
t 0
= S y i .
• Let
S(t o )
= Set of all tables with S y i =
t 0
and observed sample sizes
Conditional Likelihood
Estimation
Inference
End of Algebra Back to Example
Graft Rejections No Rejection (Y = 0) Rejection (Y = 1) Total
Young (< 4 y.o.) (X = 0)
7
Older (> 4 y.o.) (X = 1)
2 1 8 6 8
Total
9 7 16 t t Sufficient Statistics 0 1 = S y i = S x i y i = # of rejections = 7 = 0*# of rejections in young + 1*# of rejections in old = 0*1 + 1*6 = 6
Conditional Distribution for Graft Rejection • Need to calculate all possible tables that have exactly 7 rejections • Calculate how often each of the tables occur • Calculate CMLE • Calculate how rare our table is to obtain p-value
Reference Set
Yng_NR
1 2 3 4 5 6
7
8
Yng_R
7 6 5 4 3 2
1
0
Old_NR
8 7 6 5 4 3
2
1
Old_R
0 1 2 3 4 5
6
7
t 0
7 7 7 7 7 7
7
7 7
t 1
0 1 2 3 4 5
6
7
Count
8 224 1,568 3,920 3,920 1,568
224
8 11,440
P[Table]
0.0007
0.0196
0.1371
0.3427
0.3427
0.1371
0.0196
0.007
1.000
t 1
0 1 2 3 4 5 6 7 Estimate b and Find a p-value
Count
8
P[Table]
0.0007
224 1,568 0.0196
0.1371
3,920 3,920 1,568 224 8 0.3427
0.3427
0.1371
0.0196
0.0007
t 1
0 1 2 3 4 5
6 7
Estimate and p-value
Count
8
P[Table]
0.0007
224 1,568 0.0196
0.1371
3,920 3,920 1,568
224 8
0.3427
0.3427
0.1371
0.0196
0.0007
Confidence Interval • Lower Bound, b • If t 1 b = t 1,min = ∞ • Otherwise b is the value of b that produces an upper p-value of a /2 • Upper Bound, b + • If t 1 b + = t 1,max = ∞ • Otherwise b + is the value of b that produces a lower p-value of a /2
Final Stats for Graft Rejection
Example 2 PECARN C-Spine Study
Control Case Total Case Control Study Not Present 1,057 540 1,0597 Present 2 0 2 Total 1,059 540 1,599 Any problems estimating the odds ratio?
Could exact logistic regression help?
What sufficient statistics are needed?
• • Control (Y = 0) Case (Y = 1) Total S y = 2 S xy = 0
Not Present (X = 0)
1,057 540 1,597
Present (X = 1)
2 0 2
Total
1,059 540 1,599
Conditional Density
Case P
0
1 2
Case NP
540
539 538
Ctrl P
2 1 0
Ctrl NP
1,057
1,058 1,059
t 0
2
2 2 2
t 1
0
1 2
Count
560,211
571,860 145,530
1,277,601
P[Table]
0.438
0.448
0.114
1.000
One-sided p-value = 0.438
Two-sided p-value = 2*0.438 = 0.876
95% confidence interval ( ∞, 2.345) Point estimate?
Median Unbiased Estimate
One More Example Dose Response
Toxicology Experiment • 400 mice randomized to one of four levels of a drug • Drug administered to each animal • Outcome is the number of deaths in each dose level Lived Died Total
0
99 1 100
1
97 3 100
2
95 5 100
3
90 10 100
Total
381 19 400 S y = 19 S xy = 3 + 10 + 30 = 43
Exact vs. Unconditional
Exact
• Estimate = 0.710
• SE = 0.246
• OR = 2.03
• CI = (1.26, 3.52) • p-value = 0.002
Unconditional
• Estimate = 0.712
• SE = 0.246
• OR = 2.04
• CI = (1.26, 3.30) • p-value = 0.004
Computational Issues
Counting All the Tables • One of the main hurdles for conditional logistic regression is counting all the tables in the sample space – Graft rejections – 11,440 possibilities – PECARN C-Spine - 1,277,601 – Toxicology – 2.79 x 10 33 • Obviously don’t want to generate tables one at a time
Network Algorithm • Graphical representation of the sample space • Nodes represent a partial sum of the sufficient statistic • Arcs have combinatorial weighting value • One path through the graph represents a table in the sample space
Y = 0 Y = 1 Total
X = 1
3 0 3 Example
X = 2
2 1 3
X = 3
2 1 3
X = 4
1 2 3 t t Sufficient Statistics 0 1 = S y i = S x i y i = 4 = 1*0 + 2*1 + 3*1 + 4*2 = 13
Total
8 4 12
(0,0) Y = 0 Y = 1
X = 1
1 2 (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)
X = 2
3 0 (2,4)
X = 3
1 2 (3,4)
X = 4
3 0 (4,4)
Total
8 4
(0,0) Y = 0 Y = 1
X = 1
3 0 (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)
X = 2
2 1 (2,4)
X = 3
2 1 (3,4)
X = 4
1 2 (4,4)
Total
8 4
(0,0) Network Representation of the Sample Space (1,0) (2,0) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (2,4) (3,1) (3,2) (3,3) (3,4) (4,4)
What About Multiple Covariates?
More Conditioning!
Osteogtenic Sarcoma LogXact Manual • 46 patients surgically treated for osteogenic sarcoma and then observed for disease recurrence within 3 years • Covariates – Sex: Male = 1, Female = 0 – Any Ostoid Pathology (AOP) • Present = 1, not = 0 • Interested in the effect of AOP
Osteogtenic Sarcoma Covariate Group 1 No Recurrence (y = 0) 8 2 3 5 9 4 Total 7 29 Recurrence (y = 1) 0 2 4 11 17 Group Size (n i ) 8 7 13 18 46 Covariates Sex (x 1 ) 0 AOP (x 0 2 ) 0 1 1 1 0 1
Estimating the Effect of AOP • New statistics to condition – Group sizes – Sufficient statistic for intercept, S y = 17 – Sufficient statistic for coefficient for sex, S x 1 y = 15 • Calculate the conditional distribution of S x 2 y – Sufficient statistic for coefficient for AOP – Number of cases with AOP in recurrence (=13) – Given exactly 17 with recurrence 15 of which are males
Network Algorithm • The Network Algorithm using two passes – First pass conditions on the intercept • All tables with exactly 17 cases in recurrence – Second pass removes arcs that don’t produce sufficient statistic for sex • All tables that don’t have 15 males in recurrence • Proceed with estimation & inference as before
P[ S x 2 y = t 2 |17 in recurrence and 15 males ]
Results
LR Test for Both Variables • To test both sex and AOP are zero simultaneously, need the joint conditional density – All possible combinations of males and patients with AOP in recurrence given exactly 17 patients in recurrence – Determine how rare is it to have 15 recurrent males AND 13 recurrent AOP patients?
SAS Examples
Conclusion • Exact (conditional) logistic regression – Useful method when asymptotic assumptions are not met or with separation – Utilizes conditioning to remove nuisance parameters from the likelihood – Very computational intensive method – Network algorithm speeds up calculations
Questions?