Exact Logistic Regression

Download Report

Transcript Exact Logistic Regression

Exact Logistic Regression Larry Cook

Outline • Review the logistic regression model • Explore an example where model assumptions fail – Brief algebraic interlude • Explore an example with a different issue where logistic regression fails • Computational considerations • Example SAS code

Logistic Regression • Model a binary outcome, Y, with one or more predictors – Success/failure – Disease/not disease • Model outcome in terms of the log odds of a success • log(odds of Y i ) = a + b x i + e

Why Log Odds?

• Canonical link function • Makes a binary outcome continuous • Solves this problem – Probability is constrained to [0,1] – Odds are constrained to [0, ∞) • Log odds are in (-∞, ∞) • Exponentiating coefficients gives us estimates of odds ratios

Example: Motor Vehicle Crash Fatalities • What are odds of being hospitalized or killed in a motor vehicle crash for drivers using safety restraints vs. those that are not?

– Outcome: Hospitalized/killed or not – Covariate: safety belt use

Hospital/Killed * Restraint Use OR = 0.22, p-value < 0.001

Example: Motor Vehicle Crash Fatalities • What are odds of being hospitalized or killed in a motor vehicle crash for drivers using safety restraints vs. those that are not?

– Outcome: Hospitalized/killed or not – Covariate: safety belt use gender, age, alcohol, rural area

Logistic Regression Output

Parameter

Intercept Male Restraint Use Alcohol Night Rural

Estimate

-0.261

-0.576

-1.430

1.065

0.194

0.135

Odds Ratio

0.56

0.24

2.90

1.21

1.14

P-value

< 0.001

< 0.001

< 0.001

< 0.001

0.011

<0.001

Assumptions • Conditional probabilities follow a logistic function of the independent variables • Observations are independent • Asymptotics – Sample size is large enough – Minimum of 50 to 100 observations – 10 successes/failures per variable

Corneal Graft Rejections • What if studying a rare disease?

• Data for eight kids in young age group and eight in the older age group • Hypothesis is that rejection is more likely in older children

Graft Rejections No Rejection (Y = 0) Rejection (Y = 1) Total

Young (< 4 y.o.) (X = 0)

7

Older (> 4 y.o.) (X = 1)

2 1 8 6 8

Total

9 7 16 OR = 21, p-value = 0.012, 100% of cell have expected counts < 5!!!

Fisher’s Exact Test p-value (2-sided) = 0.0406; (1-sided) = 0.0203

Let’s Tackle the Graft Rejection Example as Logistic Regression

Graft Rejections No Rejection Rejection Total

Young (< 4 y.o.) Older (> 4 y.o.)

7 1 8 2 6 8 Sample Size << 50!

Don’t have 10 success or 10 failures!

Total

9 7 16

Exact (Conditional) Logistic Regression • Rather than using the unconditional logistic regression, we will condition on nuisance parameters • Use conditional maximum likelihood for estimation and inference

Warning Algebra Ahead Proceed with Caution

Logistic Model

Likelihood of a Sample

Sufficient Statistics

Conditioning • If we are only trying to describe the relationship between rejection and age, do we care about the value of the intercept?

• Remove the intercept, a , out of the likelihood by conditioning on its sufficient statistic,

t 0

= S y i .

• Let

S(t o )

= Set of all tables with S y i =

t 0

and observed sample sizes

Conditional Likelihood

Estimation

Inference

End of Algebra Back to Example

Graft Rejections No Rejection (Y = 0) Rejection (Y = 1) Total

Young (< 4 y.o.) (X = 0)

7

Older (> 4 y.o.) (X = 1)

2 1 8 6 8

Total

9 7 16 t t Sufficient Statistics 0 1 = S y i = S x i y i = # of rejections = 7 = 0*# of rejections in young + 1*# of rejections in old = 0*1 + 1*6 = 6

Conditional Distribution for Graft Rejection • Need to calculate all possible tables that have exactly 7 rejections • Calculate how often each of the tables occur • Calculate CMLE • Calculate how rare our table is to obtain p-value

Reference Set

Yng_NR

1 2 3 4 5 6

7

8

Yng_R

7 6 5 4 3 2

1

0

Old_NR

8 7 6 5 4 3

2

1

Old_R

0 1 2 3 4 5

6

7

t 0

7 7 7 7 7 7

7

7 7

t 1

0 1 2 3 4 5

6

7

Count

8 224 1,568 3,920 3,920 1,568

224

8 11,440

P[Table]

0.0007

0.0196

0.1371

0.3427

0.3427

0.1371

0.0196

0.007

1.000

t 1

0 1 2 3 4 5 6 7 Estimate b and Find a p-value

Count

8

P[Table]

0.0007

224 1,568 0.0196

0.1371

3,920 3,920 1,568 224 8 0.3427

0.3427

0.1371

0.0196

0.0007

t 1

0 1 2 3 4 5

6 7

Estimate and p-value

Count

8

P[Table]

0.0007

224 1,568 0.0196

0.1371

3,920 3,920 1,568

224 8

0.3427

0.3427

0.1371

0.0196

0.0007

Confidence Interval • Lower Bound, b • If t 1  b = t 1,min = ∞ • Otherwise  b is the value of b that produces an upper p-value of a /2 • Upper Bound, b + • If t 1  b + = t 1,max = ∞ • Otherwise  b + is the value of b that produces a lower p-value of a /2

Final Stats for Graft Rejection

Example 2 PECARN C-Spine Study

Control Case Total Case Control Study Not Present 1,057 540 1,0597 Present 2 0 2 Total 1,059 540 1,599 Any problems estimating the odds ratio?

Could exact logistic regression help?

What sufficient statistics are needed?

• • Control (Y = 0) Case (Y = 1) Total S y = 2 S xy = 0

Not Present (X = 0)

1,057 540 1,597

Present (X = 1)

2 0 2

Total

1,059 540 1,599

Conditional Density

Case P

0

1 2

Case NP

540

539 538

Ctrl P

2 1 0

Ctrl NP

1,057

1,058 1,059

t 0

2

2 2 2

t 1

0

1 2

Count

560,211

571,860 145,530

1,277,601

P[Table]

0.438

0.448

0.114

1.000

One-sided p-value = 0.438

Two-sided p-value = 2*0.438 = 0.876

95% confidence interval ( ∞, 2.345) Point estimate?

Median Unbiased Estimate

One More Example Dose Response

Toxicology Experiment • 400 mice randomized to one of four levels of a drug • Drug administered to each animal • Outcome is the number of deaths in each dose level Lived Died Total

0

99 1 100

1

97 3 100

2

95 5 100

3

90 10 100

Total

381 19 400 S y = 19 S xy = 3 + 10 + 30 = 43

Exact vs. Unconditional

Exact

• Estimate = 0.710

• SE = 0.246

• OR = 2.03

• CI = (1.26, 3.52) • p-value = 0.002

Unconditional

• Estimate = 0.712

• SE = 0.246

• OR = 2.04

• CI = (1.26, 3.30) • p-value = 0.004

Computational Issues

Counting All the Tables • One of the main hurdles for conditional logistic regression is counting all the tables in the sample space – Graft rejections – 11,440 possibilities – PECARN C-Spine - 1,277,601 – Toxicology – 2.79 x 10 33 • Obviously don’t want to generate tables one at a time

Network Algorithm • Graphical representation of the sample space • Nodes represent a partial sum of the sufficient statistic • Arcs have combinatorial weighting value • One path through the graph represents a table in the sample space

Y = 0 Y = 1 Total

X = 1

3 0 3 Example

X = 2

2 1 3

X = 3

2 1 3

X = 4

1 2 3 t t Sufficient Statistics 0 1 = S y i = S x i y i = 4 = 1*0 + 2*1 + 3*1 + 4*2 = 13

Total

8 4 12

(0,0) Y = 0 Y = 1

X = 1

1 2 (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

X = 2

3 0 (2,4)

X = 3

1 2 (3,4)

X = 4

3 0 (4,4)

Total

8 4

(0,0) Y = 0 Y = 1

X = 1

3 0 (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

X = 2

2 1 (2,4)

X = 3

2 1 (3,4)

X = 4

1 2 (4,4)

Total

8 4

(0,0) Network Representation of the Sample Space (1,0) (2,0) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (2,4) (3,1) (3,2) (3,3) (3,4) (4,4)

What About Multiple Covariates?

More Conditioning!

Osteogtenic Sarcoma LogXact Manual • 46 patients surgically treated for osteogenic sarcoma and then observed for disease recurrence within 3 years • Covariates – Sex: Male = 1, Female = 0 – Any Ostoid Pathology (AOP) • Present = 1, not = 0 • Interested in the effect of AOP

Osteogtenic Sarcoma Covariate Group 1 No Recurrence (y = 0) 8 2 3 5 9 4 Total 7 29 Recurrence (y = 1) 0 2 4 11 17 Group Size (n i ) 8 7 13 18 46 Covariates Sex (x 1 ) 0 AOP (x 0 2 ) 0 1 1 1 0 1

Estimating the Effect of AOP • New statistics to condition – Group sizes – Sufficient statistic for intercept, S y = 17 – Sufficient statistic for coefficient for sex, S x 1 y = 15 • Calculate the conditional distribution of S x 2 y – Sufficient statistic for coefficient for AOP – Number of cases with AOP in recurrence (=13) – Given exactly 17 with recurrence 15 of which are males

Network Algorithm • The Network Algorithm using two passes – First pass conditions on the intercept • All tables with exactly 17 cases in recurrence – Second pass removes arcs that don’t produce sufficient statistic for sex • All tables that don’t have 15 males in recurrence • Proceed with estimation & inference as before

P[ S x 2 y = t 2 |17 in recurrence and 15 males ]

Results

LR Test for Both Variables • To test both sex and AOP are zero simultaneously, need the joint conditional density – All possible combinations of males and patients with AOP in recurrence given exactly 17 patients in recurrence – Determine how rare is it to have 15 recurrent males AND 13 recurrent AOP patients?

SAS Examples

Conclusion • Exact (conditional) logistic regression – Useful method when asymptotic assumptions are not met or with separation – Utilizes conditioning to remove nuisance parameters from the likelihood – Very computational intensive method – Network algorithm speeds up calculations

Questions?