#### Transcript Exact Logistic Regression

Exact Logistic Regression Larry Cook

Outline • Review the logistic regression model • Explore an example where model assumptions fail – Brief algebraic interlude • Explore an example with a different issue where logistic regression fails • Computational considerations • Example SAS code

Logistic Regression • Model a binary outcome, Y, with one or more predictors – Success/failure – Disease/not disease • Model outcome in terms of the log odds of a success • log(odds of Y i ) = a + b x i + e

Why Log Odds?

• Canonical link function • Makes a binary outcome continuous • Solves this problem – Probability is constrained to [0,1] – Odds are constrained to [0, ∞) • Log odds are in (-∞, ∞) • Exponentiating coefficients gives us estimates of odds ratios

Example: Motor Vehicle Crash Fatalities • What are odds of being hospitalized or killed in a motor vehicle crash for drivers using safety restraints vs. those that are not?

– Outcome: Hospitalized/killed or not – Covariate: safety belt use

Hospital/Killed * Restraint Use OR = 0.22, p-value < 0.001

Example: Motor Vehicle Crash Fatalities • What are odds of being hospitalized or killed in a motor vehicle crash for drivers using safety restraints vs. those that are not?

– Outcome: Hospitalized/killed or not – Covariate: safety belt use gender, age, alcohol, rural area

Logistic Regression Output

**Parameter**

Intercept Male Restraint Use Alcohol Night Rural

**Estimate**

-0.261

-0.576

-1.430

1.065

0.194

0.135

**Odds Ratio**

0.56

0.24

2.90

1.21

1.14

**P-value**

< 0.001

< 0.001

< 0.001

< 0.001

0.011

<0.001

Assumptions • Conditional probabilities follow a logistic function of the independent variables • Observations are independent • Asymptotics – Sample size is large enough – Minimum of 50 to 100 observations – 10 successes/failures per variable

Corneal Graft Rejections • What if studying a rare disease?

• Data for eight kids in young age group and eight in the older age group • Hypothesis is that rejection is more likely in older children

Graft Rejections No Rejection (Y = 0) Rejection (Y = 1) Total

**Young (< 4 y.o.) (X = 0)**

7

**Older (> 4 y.o.) (X = 1)**

2 1 8 6 8

**Total**

9 7 16 OR = 21, p-value = 0.012, 100% of cell have expected counts < 5!!!

Fisher’s Exact Test p-value (2-sided) = 0.0406; (1-sided) = 0.0203

Let’s Tackle the Graft Rejection Example as Logistic Regression

Graft Rejections No Rejection Rejection Total

**Young (< 4 y.o.) Older (> 4 y.o.)**

7 1 8 2 6 8 Sample Size << 50!

Don’t have 10 success or 10 failures!

**Total**

9 7 16

Exact (Conditional) Logistic Regression • Rather than using the unconditional logistic regression, we will condition on nuisance parameters • Use conditional maximum likelihood for estimation and inference

Warning Algebra Ahead Proceed with Caution

Logistic Model

Likelihood of a Sample

Sufficient Statistics

Conditioning • If we are only trying to describe the relationship between rejection and age, do we care about the value of the intercept?

• Remove the intercept, a , out of the likelihood by conditioning on its sufficient statistic,

*t 0*

= S y i .

• Let

*S(t o )*

= Set of all tables with S y i =

*t 0*

and observed sample sizes

Conditional Likelihood

Estimation

Inference

End of Algebra Back to Example

Graft Rejections No Rejection (Y = 0) Rejection (Y = 1) Total

**Young (< 4 y.o.) (X = 0)**

7

**Older (> 4 y.o.) (X = 1)**

2 1 8 6 8

**Total**

9 7 16 t t Sufficient Statistics 0 1 = S y i = S x i y i = # of rejections = 7 = 0*# of rejections in young + 1*# of rejections in old = 0*1 + 1*6 = 6

Conditional Distribution for Graft Rejection • Need to calculate all possible tables that have exactly 7 rejections • Calculate how often each of the tables occur • Calculate CMLE • Calculate how rare our table is to obtain p-value

Reference Set

**Yng_NR**

1 2 3 4 5 6

*7*

8

**Yng_R**

7 6 5 4 3 2

*1*

0

**Old_NR**

8 7 6 5 4 3

*2*

1

**Old_R**

0 1 2 3 4 5

*6*

7

**t 0**

7 7 7 7 7 7

*7*

7 7

**t 1**

0 1 2 3 4 5

*6*

7

**Count**

8 224 1,568 3,920 3,920 1,568

*224*

8 11,440

**P[Table]**

0.0007

0.0196

0.1371

0.3427

0.3427

0.1371

*0.0196*

0.007

1.000

**t 1**

0 1 2 3 4 5 6 7 Estimate b and Find a p-value

**Count**

8

**P[Table]**

0.0007

224 1,568 0.0196

0.1371

3,920 3,920 1,568 224 8 0.3427

0.3427

0.1371

0.0196

0.0007

**t 1**

0 1 2 3 4 5

*6 7*

Estimate and p-value

**Count**

8

**P[Table]**

0.0007

224 1,568 0.0196

0.1371

3,920 3,920 1,568

*224 8*

0.3427

0.3427

0.1371

*0.0196*

*0.0007*

Confidence Interval • Lower Bound, b • If t 1 b = t 1,min = ∞ • Otherwise b is the value of b that produces an upper p-value of a /2 • Upper Bound, b + • If t 1 b + = t 1,max = ∞ • Otherwise b + is the value of b that produces a lower p-value of a /2

Final Stats for Graft Rejection

Example 2 PECARN C-Spine Study

Control Case Total Case Control Study Not Present 1,057 540 1,0597 Present 2 0 2 Total 1,059 540 1,599 Any problems estimating the odds ratio?

Could exact logistic regression help?

What sufficient statistics are needed?

• • Control (Y = 0) Case (Y = 1) Total S y = 2 S xy = 0

**Not Present (X = 0)**

1,057 540 1,597

**Present (X = 1)**

2 0 2

**Total**

1,059 540 1,599

Conditional Density

**Case P**

*0*

1 2

**Case NP**

*540*

539 538

**Ctrl P**

2 1 0

**Ctrl NP**

*1,057*

1,058 1,059

**t 0**

*2*

2 2 2

**t 1**

*0*

1 2

**Count**

*560,211*

571,860 145,530

*1,277,601*

**P[Table]**

*0.438*

0.448

0.114

*1.000*

One-sided p-value = 0.438

Two-sided p-value = 2*0.438 = 0.876

95% confidence interval ( ∞, 2.345) Point estimate?

Median Unbiased Estimate

One More Example Dose Response

Toxicology Experiment • 400 mice randomized to one of four levels of a drug • Drug administered to each animal • Outcome is the number of deaths in each dose level Lived Died Total

**0**

99 1 100

**1**

97 3 100

**2**

95 5 100

**3**

90 10 100

**Total**

381 19 400 S y = 19 S xy = 3 + 10 + 30 = 43

Exact vs. Unconditional

**Exact**

• Estimate = 0.710

• SE = 0.246

• OR = 2.03

• CI = (1.26, 3.52) • p-value = 0.002

**Unconditional**

• Estimate = 0.712

• SE = 0.246

• OR = 2.04

• CI = (1.26, 3.30) • p-value = 0.004

Computational Issues

Counting All the Tables • One of the main hurdles for conditional logistic regression is counting all the tables in the sample space – Graft rejections – 11,440 possibilities – PECARN C-Spine - 1,277,601 – Toxicology – 2.79 x 10 33 • Obviously don’t want to generate tables one at a time

Network Algorithm • Graphical representation of the sample space • Nodes represent a partial sum of the sufficient statistic • Arcs have combinatorial weighting value • One path through the graph represents a table in the sample space

Y = 0 Y = 1 Total

**X = 1**

3 0 3 Example

**X = 2**

2 1 3

**X = 3**

2 1 3

**X = 4**

1 2 3 t t Sufficient Statistics 0 1 = S y i = S x i y i = 4 = 1*0 + 2*1 + 3*1 + 4*2 = 13

**Total**

8 4 12

(0,0) Y = 0 Y = 1

**X = 1**

1 2 (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

**X = 2**

3 0 (2,4)

**X = 3**

1 2 (3,4)

**X = 4**

3 0 (4,4)

**Total**

8 4

(0,0) Y = 0 Y = 1

**X = 1**

3 0 (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

**X = 2**

2 1 (2,4)

**X = 3**

2 1 (3,4)

**X = 4**

1 2 (4,4)

**Total**

8 4

(0,0) Network Representation of the Sample Space (1,0) (2,0) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (2,4) (3,1) (3,2) (3,3) (3,4) (4,4)

What About Multiple Covariates?

More Conditioning!

Osteogtenic Sarcoma LogXact Manual • 46 patients surgically treated for osteogenic sarcoma and then observed for disease recurrence within 3 years • Covariates – Sex: Male = 1, Female = 0 – Any Ostoid Pathology (AOP) • Present = 1, not = 0 • Interested in the effect of AOP

Osteogtenic Sarcoma Covariate Group 1 No Recurrence (y = 0) 8 2 3 5 9 4 Total 7 29 Recurrence (y = 1) 0 2 4 11 17 Group Size (n i ) 8 7 13 18 46 Covariates Sex (x 1 ) 0 AOP (x 0 2 ) 0 1 1 1 0 1

Estimating the Effect of AOP • New statistics to condition – Group sizes – Sufficient statistic for intercept, S y = 17 – Sufficient statistic for coefficient for sex, S x 1 y = 15 • Calculate the conditional distribution of S x 2 y – Sufficient statistic for coefficient for AOP – Number of cases with AOP in recurrence (=13) – Given exactly 17 with recurrence 15 of which are males

Network Algorithm • The Network Algorithm using two passes – First pass conditions on the intercept • All tables with exactly 17 cases in recurrence – Second pass removes arcs that don’t produce sufficient statistic for sex • All tables that don’t have 15 males in recurrence • Proceed with estimation & inference as before

P[ S x 2 y = t 2 |17 in recurrence and 15 males ]

Results

LR Test for Both Variables • To test both sex and AOP are zero simultaneously, need the joint conditional density – All possible combinations of males and patients with AOP in recurrence given exactly 17 patients in recurrence – Determine how rare is it to have 15 recurrent males AND 13 recurrent AOP patients?

SAS Examples

Conclusion • Exact (conditional) logistic regression – Useful method when asymptotic assumptions are not met or with separation – Utilizes conditioning to remove nuisance parameters from the likelihood – Very computational intensive method – Network algorithm speeds up calculations

Questions?