Transcript lect17

Lecture 17: Regression for Case-control Studies BMTRY 701 Biostatistical Methods II

Old business: Comparing AUCs

 Good reference: Hanley and McNeill “Comparing AUCs for ROC curves based on the same data” See class website for pdf.

Additional Reading in Logistic REgression

   Hosmer and Lemeshow, Applied Logistic Regression http://en.wikipedia.org/wiki/Logistic_regression http://luna.cas.usf.edu/~mbrannic/files/regressio n/Logistic.html

 http://www.statgun.com/tutorials/logistic regression.html

  http://www.bus.utk.edu/stat/Stat579/Logistic%20 Regression.pdf

Etc: Google “logistic regression”

Case Control Studies in Logistic Regression

 http://www.oxfordjournals.org/our_journals/tropej /online/ma_chap11.pdf

 How is a case-control study performed?

 What is the outcome and what is the predictor in the regression setting?

Recall the simple 2x2 example

 Odds ratio for 2x2 table can be used in case control studies  Similarly, the logistic regression model can be used treating ‘case’ status as the outcome.

 It has been shown that the results do not depend on the sampling (i.e., cohort vs. case-control study).

Example: Case control study of HPV and Oropharyngeal Cancer  Gillison et al. ( http://content.nejm.org/cgi/content/full/356/19/1 944 )  100 cases and 200 controls with oropharyngeal cancer  How was the sampling done?

Data on Case vs. HPV

> table(data$hpv16ser, data$control) 0 1 0 186 43 1 14 57 > epitab(data$hpv16ser, data$control) $tab Outcome Predictor 0 p0 1 p1 oddsratio lower upper p.value

0 186 0.93 43 0.43 1.00000 NA NA NA 1 14 0.07 57 0.57 17.61130 8.99258 34.49041 4.461359e-21

Multiple Logistic Regression

  This is not ‘randomized’ study there are lots of other predictors that may be associated with the cancer  Examples: • smoking • alcohol • age • gender

Fit the model:

 Write down the model • assume main effects of tobacco, alcohol and their interaction  What is the likelihood function?

 What are the MLEs?

How do we interpret the results?

 Is there an effect of tobacco?

 Is there an effect of alcohol?

 Is there an interaction?

Interpreting the interaction

 What is the OR for smoker/non-drinker versus a non-smoker/non-drinker?

 What is the OR for a smoker/drinker versus a non-smoker/drinker?

How can we assess if the effect of smoking differs by HPV status?

How likely is it that someone who smokes and drinks will get oropharyngeal cancer?

 How can we estimate the chance?

Matched case control studies

 References: • • • • • Hosmer and Lemeshow, Applied Logistic Regression http://staff.pubhealth.ku.dk/~bxc/SPE.2002/Slides/mc c.pdf

http://staff.pubhealth.ku.dk/~bxc/Talks/Nested Matched-CC.pdf

http://www.tau.ac.il/cc/pages/docs/sas8/stat/chap49/s ect35.htm

http://www.ats.ucla.edu/stat/sas/library/logistic.pdf

(beginning page 5)

Matched design

  Matching on important factors is common OP cancer: • age • gender  Why?

• forces the distribution to be the same on those variables • removes any effects of those variables on the outcome • eliminates confounding

1-to-M matching

  For each ‘case’, there is a matched ‘control Process usually dictates that the case is enrolled, then a control is identified  For particularly rare diseases or when large N is required, often use more than one control per case

Logistic regression for matched case control studies  Recall independence

y i iid

~

iid

~

Bern

(

p i

)

Bern

 

e

1   0

e

  1

x i

 0   1

x i

   But, if cases and controls are matched, are they still independent?

Solution: treat each matched set as a stratum    one-to-one matching: 1 case and 1 control per stratum one-to-M matching: 1 case and M controls per stratum Logistic model per stratum: within stratum, independence holds.

p k

(

x i

)  1

e

k

  

x i e

k

 

x i

 We assume that the OR for x and y is constant across strata

How many parameters is that?

   Assume sample size is 2n and we have 1-to-1 matching: n strata + p covariates = n+p parameters This is problematic: • as n gets large, so does the number of parameters • too many parameters to estimate and a problem of precision   but, do we really care about the strata-specific intercepts?

“NUISANCE PARAMETERS”

Conditional logistic regression

    To avoid estimation of the intercepts, we can condition on the study design.

Huh?

Think about each stratum: • how many cases and controls?

• what is the probability that the case is the case and the control is the control?

• what is the probability that the control is the case and the case the control?

For each stratum, the likelihood contribution is based on this conditional probability

Conditioning

 For 1 to 1 matching: with two individuals in stratum k where y indicates case status (1 = case, 0 = control)

P

(

y

1

k

 1 ,

y

2

k

 0 ) 

P

(

y

1

k

 1 ,

P

(

y

2

k y

1

k

  0 ) 1 , 

y

2

k P

(

y

 0 )  1

k

0 ,

y

2

k

 1 )  Write as a likelihood contribution for stratum k:

L k

P

(

y

1

k

 1 |

x

1

k

)

P

(

P

(

y

2

y

1

k

k

 0 1 | |

x x

2

k

1

k

) )

P

( 

y P

( 2

y k

1

k

  0 | 0 |

x

2

k x

1

k

) )

P

(

y

2

k

 1 |

x

2

k

)

Likelihood function for CLR

Substitute in our logistic representation of p and simplify:

L k

   

P

(

y

1

k

 1 |

x

1

k

)

P

(

P y

(

y

1

k

2

k

  0 1 | |

x x

1

k

2

k

) )

P

(

y

2

k

P

(

y

1

k

  0 0 | |

x

2

k x

1

k

) )

P

(

y

2

k

 

e

1  

e k

 

x

1

k

k

 

x

1

k

  1

e

 

k

 

x

1

k e

k

 

x

1

k

   1  1

e

k

 

x

2

k e

k

 

x

1

k e

k

 

x

1

k

e

k e

x

1

k e

x

1

k

e

x

2

k

 

x

2

k

   1  1

e

k

 

x

2

k

 1  1

e

k

 

x

1

k

  1

e

 

k

 

x

2

k e

k

 

x

2

k

 1 |

x

2

k

)  

Likelihood function for CLR

 Now, take the product over all the strata for the full likelihood

L

(  ) 

k n

  1

L k

k n

  1

e

x

1

k e

x

1

k

e

x

2

k

   This is the likelihood for the matched case-control design Notice: • there are no strata-specific parameters • cases are defined by subscript ‘1’ and controls by subscript ‘2’ Theory for 1-to-M follows similarly (but not shown here)

Interpretation of

β   Same as in ‘standard’ logistic regression β represents the log odds ratio comparing the risk of disease by a one unit difference in x

When to use matched vs. unmatched?

    Some papers use both for a matched design Tradeoffs: • bias • precision  Sometimes matched design to ensure balance, but then unmatched analysis They WILL give you different answers Gillison paper

Another approach to matched data

     use random effects models CLR is elegant and simple can identify the estimates using a ‘transformation’ of logistic regression results But, with new age of computing, we have other approaches Random effects models: • allow strata specific intercepts • not problematic estimation process • additional assumptions: intercepts follow normal distribution • Will NOT give identical results

. xi: clogit control hpv16ser, group(strata) or Iteration 0: log likelihood = -72.072957 Iteration 1: log likelihood = -71.803221 Iteration 2: log likelihood = -71.798737 Iteration 3: log likelihood = -71.798736 Conditional (fixed-effects) logistic regression Number of obs = 300 LR chi2(1) = 76.12

Prob > chi2 = 0.0000

Log likelihood = -71.798736 Pseudo R2 = 0.3465

----------------------------------------------------------------------------- control | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+--------------------------------------------------------------- hpv16ser | 13.16616 4.988492 6.80 0.000 6.26541 27.66742

------------------------------------------------------------------------------

. xi: logistic control hpv16ser Logistic regression Number of obs = 300 LR chi2(1) = 90.21

Prob > chi2 = 0.0000

Log likelihood = -145.8514 Pseudo R2 = 0.2362

----------------------------------------------------------------------------- control | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+--------------------------------------------------------------- hpv16ser | 17.6113 6.039532 8.36 0.000 8.992582 34.4904

------------------------------------------------------------------------------

. xi: gllamm control hpv16ser, i(strata) family(binomial) number of level 1 units = 300 number of level 2 units = 100 Condition Number = 2.4968508

gllamm model log likelihood = -145.8514

OR = 17.63

----------------------------------------------------------------------------- control | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+--------------------------------------------------------------- hpv16ser | 2.868541 .3429353 8.36 0.000 2.1964 3.540681

_cons | -1.464547 .1692104 -8.66 0.000 -1.796193 -1.1329

----------------------------------------------------------------------------- Variances and covariances of random effects ----------------------------------------------------------------------------- ***level 2 (strata) var(1): 4.210e-21 (2.231e-11) ------------------------------------------------------------------------------