Dependent samples/ Survival analysis

Download Report

Transcript Dependent samples/ Survival analysis

Dichotomous and survival outcomes

Brian Healy, PhD

Comments from previous class

  Book suggestion/practice problems – Fundamentals of Biostatistics by Bernard Rosner – Available in Countway How do I know when I need help?

– When you think your project is more complicated than we have discussed   Correlated observations Skewed data – This class designed to help you do basic analysis, but also help you communicate with a statistician

Objectives

 Dichotomous outcome – Chi-square test – Logistic regression  Survival analysis – Log-rank test

Quick aside

  What is the most common proportion to see in the news?

– Political polling Most polls look like this: – 50% of people support Scott Brown – 45% of people support Martha Coakley – Margin of error +/- 3%

Margin of error

 What does the margin of error tell us?

– The plausible values for the true proportion accounting for sampling variability (chance)  What does the margin of error not tell us?

– Sample design  Who was sampled?

 How was sampling done?

– Was there any missing data?

– Were all people treated the same?

Statement regarding accuracy

  For confidence interval: – We are 95% confident that the true parameter value lies within our confidence bounds For polling (from http://www.pollingreport.com/sampling.htm) – “In theory, with a sample of this size, one can say with 95 percent certainty that the results have a statistical precision of plus or minus __ percentage points of what they would be if the entire adult population had been polled with complete accuracy. Unfortunately, there are several other possible sources of error in all polls or surveys that are probably more serious than theoretical calculations of sampling error. They include refusals to be interviewed (non-response), question wording and question order, interviewer bias, weighting by demographic control data, and screening (e.g., for likely voters). It is difficult or impossible to quantify the errors that may result from these factors.”

Review

 Steps for hypothesis test – How do we set up a null hypothesis?

 Choosing the right test – Continuous outcome/dichotomous predictor: Two sample t-test – Continuous outcome/categorical predictor: ANOVA – Continuous outcome/continuous predictor: Correlation or regression

Types of analysis-independent samples

Outcome Explanatory Analysis Continuous Dichotomous Continuous Continuous Dichotomous Dichotomous Time to event Categorical Continuous Dichotomous Continuous Dichotomous t-test, Wilcoxon test, linear reg ANOVA, linear regression Correlation, linear regression Chi-square test, logistic regression Logistic regression Log-rank test

Dichotomous outcome

 Sustained disease progression in MS is often defined as a one-unit increase on EDSS that lasts for at least six months  This is a common outcome in clinical trials and observational studies  Patients are often classified as progressed or not progressed, which is a dichotomous outcome

Example

 MS is known to have a genetic component  Several single nucleotide polymorphisms have been associated with susceptibility to MS  Question: Do patients with susceptibility SNPs experience more sustained progression than patients without susceptibility SNPs?

Data

  Initially, we will focus on presence vs. absence of SNPs Among our 190 treated patients, 74 had the SNP and 116 did not – 12 patients with the SNP experienced sustained progression

SNP

  12 74  0 .

162 – 13 patients without the SNP experienced sustained progression

SNP

  13 116  0 .

112

Contingency table

  A common way to look at this data is a 2x2 table Does the SNP have an effect on whether or not patients progress?

Prog No prog Total SNP+ 12 62 74 SNP 13 103 116 Total 25 165 190

Question

 In our analysis, we assume that the margins are set  Under the null hypothesis of no relationship between the two variables, what would we expect the values in the table be?

Example

 As an example, use this table Prog No prog Total SNP+ 50*100/200 =25 50*100/200 =25 50 SNP 150*100/200= 75 150*100/200 =75 150 Total 100 100 200

Expected table

 Expected table for our analysis Prog No prog Total SNP+ 25*74/190= 9.73

165*74/190 =64.3

74 SNP 25*116/190 =15.3

116*165/ 190=100.7

116 Total 25 165 190 How different is our observed data compared to the expected table?

Does our data show an effect?

 To test for an association between the outcome and the predictor, we would like to know if our observed table was different from the expected table under the null hypothesis  How could we investigate if our table was different?

cells

O i

E i E i

 2 This quantity has a chi-square distribution If it is large, it implies a large difference from the expected

Critical information for

c 2  For 1 degree of freedom, cut-off for a =0.05 is 3.84

– For normal distribution, this is 1.96

– Note 1.96

2 =3.84

 Inherently, two-sided since it is squared  Has problems with small cell counts – Fix: Fisher’s exact test

Chi-square distribution

X 2 =3.84

Area=0.05

Hypothesis test with

c 2 2) 3 4) 5) 6) 1) 7) H 0 : No association between SNP and progression Dichotomous outcome, dichotomous predictor c 2 test Summary statistic: c 2 =0.99

p-value=0.32

Since the p-value is greater than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between SNP and progression

c 2 statistic p-value

Question: Why 1 degree of freedom?

 We used a c 2 distribution with 1 degree of freedom, but there are 4 numbers. Why?

– For our analysis, we assume that the margins are fixed.

– If we pick one number in the table, the rest of the numbers are known Prog No Prog Total SNP+ 3 71 74 SNP 22 94 116 Total 25 165 190

Estimated effect

 When you compare two groups with a dichotomous outcome, there are three common ways to show the difference between the groups – Risk difference  Prob of disease Group 1 -Prob of disease Group 2 – Relative risk/risk ratio  Prob of disease Group 1 /Prob of disease Group 2 – Odds ratio

Odds ratio

 Odds:

Odds

 1 

p p

 Odds ratio:

OR

Odds Exposure

Odds Exposure

 

P

(

Disease

P

(

Disease

 | |

Exposure

 )

Exposure

 )   1 1  

P

(

Disease

 |

P

(

Disease

 |

Exposure Exposure

    – Under the null, what is the OR?

Disease Y N Total Exposure Y a c m 1 N b d m 2 Total n 1 n 2 N

P

(

D

 |

E

 ) 

a a

c Odds Disease

 |

Exposure

 

Odds D

 |

E

  1

P

 (

D

 |

P

(

D

 |

E

 )

E

 ) 

a

/(

a c

/(

a

c

) 

c

) 

a c Odds Disease

 |

Exposure

 

Odds D

 |

E

 

b d OR

Odds D

 |

E

Odds D

 |

E

 

a b c d

ad bc

This is the estimate of the odds ratio from a cohort study

Disease Y N Total Exposure Y a c m 1 N b d m 2

P

(

E

 |

D

 ) 

a a

b Odds Exposure

 |

Disease

 

Odds E

 |

D

 

a b Odds Exposure

 |

Disease

 

Odds E

 |

D

 

c d OR

Odds E

 |

D

Odds E

 |

D

 

a c b

ad d bc

Total n 1 n 2 N This is the estimate of the odds ratio from a case-control study

Amazing!!

 Estimated odds ratio from each kind of study ends up being the same thing!!!

 Therefore, we can complete a case control study and get an estimate that we really care about, which is the effect of the exposure on the disease  This relationship is one reason why the odds ratio is so commonly reported

Logistic regression

Types of analysis-independent samples

Outcome Explanatory Analysis Continuous Dichotomous Continuous Continuous Dichotomous Dichotomous Time to event Categorical Continuous Dichotomous Continuous Dichotomous t-test, Wilcoxon test, linear reg ANOVA, linear regression Correlation, linear regression Chi-square test, logistic regression Logistic regression Log-rank test

Linear regression

  When we fit linear regression, we used indicator variables to represent dichotomous predictors – Ex. Effect of gender – Gender=0 if Female – Gender=1 if Male

Y

i

 b 0  What is the interpretation of b 1 ?

b 1

Gender

e

i

Outcome

 What if the outcome is dichotomous?

– Progression – Y=0 if no progression – Y=1 if progression  Can we just use linear regression with 0/1 as the outcome?

10 20 30 Age 40 50 Can we fit a line to this data?

Is there another measure?

Better outcome

 Rather than investigating the 0/1 value, we focus our attention on the probability of the event   Therefore, we could use the following regression equation

p i

 b 0  b 1 *

x

Is there anything wrong with this function?

Technical aside-Probabilities

 Probabilities are required to be between 0 and 1 – Does the present equation impose this restriction?

– No

p i

 b 0  b 1 *

x

 We would like a similar equation, but with the restriction that 0<=p<=1  One option

p i

 1

e

b 0  b 1 *

x

e

b 0  b 1 *

x

Logistic regression

 The previous function is quite complex to deal with, but we can transform the equation

p

 1

e

 b

e

0  b 0 b 1  *

x

b 1 *

x

1 

p p

e

b 0  b 1 *

x

ln   1 

p p

   ln(

Odds

)  b 0  b 1 *

x

 Note that the right side of the equation looks EXACTLY like our normal regression

Parameter interpretation-review

   Let’s think about the following linear regression model for the effect of age on BPF

E

(

BPF i

|

age i

)  b 0  b 1 *

age i

In linear regression, the meaning of in age the mean BPF goes up by b 1 .

b 1 in this model is that for a one unit increase The meaning of b 0 BPF when age=0 is the mean value of

Parameter interpretation

    How does this change for our logistic model?

– Not at all!!!

* Logistic model: ln   1 

p i p i

   b 0  b 1

age i

The meaning of goes up by b 1 b 1 in this model is that for a one unit increase in age, the ln(Odds) The meaning of when age=0 b 0 is the value of ln(Odds)

Results

 When we fit our data, the parameter estimates were ln   1  ˆ

i

ˆ

i

    4 .

58  0 .

086 *

age i

 For a one unit increase in age, the estimated log(Odds) increases by 0.086

 Is this a statistically significant increase?

Hypothesis test

    If there was no effect of age on the probability of progression, what would the value of b 1 equal?

How could we test the hypothesis that there is no effect?

– H 0 : b 1 =0 Need an estimate of the variance of the estimated b 1 , but this is provided by STATA Assume approximate normality

Hypothesis test

3) 4) 5) 6) 1) 2) 7) H 0 : b 1 =0 Dichotomous outcome with continuous predictor Logistic regression Summary statistic: z=1.99

p-value=0.047

Since the p-value is less than 0.05, we reject the null hypothesis We conclude that there is a significant effect of age at symptom onset on probability of progression

Estimated coefficient for age p-value for H 0 : b 1 =0 Estimated intercept coefficient p-value for H 0 : b 0 =0

10 20 event2yr 30 Age Pr(event2yr) 40 50

Conclusions

   Logistic regression allows us to investigate the relationship between a continuous predictor and a dichotomous outcome Interpretation of coefficients is the same as linear regression, but on the log(odds) scale We can calculate the predicted probability just like we could calculate the predicted mean value

Survival analysis

Types of analysis-independent samples

Outcome Explanatory Analysis Continuous Dichotomous Continuous Continuous Dichotomous Dichotomous Time to event Categorical Continuous Dichotomous Continuous Dichotomous t-test, Wilcoxon test ANOVA, linear regression Correlation, linear regression Chi-square test, logistic regression Logistic regression Log-rank test

Example

   An important marker of disease activity in MS is the occurrence of a relapse – This is the presence of new symptoms that lasts for at least 24 hours Many clinical trials in MS have demonstrated that treatments increase the time until the next relapse – How does the time to next relapse look in the clinic?

What is the distribution of survival times?

Kaplan-Meier curve

Each drop in the curve represents an event

Survival data

   To create this curve, patients placed on treatment were followed and the time of the first relapse on treatment was recorded – Survival time If everyone had an event, some of the methods we have already learned could be applied Often, not everyone has event – Loss to follow-up – End of study

Censoring

 The patients who did not have the event are considered censored – We know that they survived a specific amount of time, but do not know the exact time of the event – We believe that the event would have happened if we observed them long enough  These patients provide some information, but not complete information

Censoring

  How could we account for censoring?

– Ignore it and say event occurred at time of censoring  Incorrect because this is almost certainly not true – Remove patient from analysis  Potential bias and loss of power – Survival analysis Our objective is to estimate the survival distribution for patients in the presence of censoring

Comparison of survival curve

 One important aspect of survival analysis is the comparison of survival curves  Null hypothesis: survival curve in group 1 is the same as survival curve in group 2  Method: log-rank test

4 5 6 7 8 9 10 1 2 3 Untreated Patient Time 3 8+ 15 27+ 32 46 49 51 55+ 70

Example

8 9 4 5 6 7 1 2 3 Treated Patient Time 30 38 52+ 58 66 73+ 77 89 107+

Kaplan-Meier survival estimates 0 20 40 60 analysis time group = 0 80 group = 1 100

Technical aside-Log-rank test

  To compare survival curves, a log-rank test creates 2x2 tables at each event time and combines across the tables – Similar to MH-test Provides a c 2 freedom (for a two sample comparison) and a p-value statistic with 1 degree of  Same procedure for hypothesis testing

Hypothesis test

1) 2) 3) 4) 5) 6) 7) H 0 : Survival distribution in group 1 = survival Time to event outcome, dichotomous predictor Log rank test Summary statistic: c 2 =4.4

p-value=0.036

Since the p-value is less than 0.05, we reject the null hypothesis We conclude that there is a significant difference in the survival time in the treated compared to untreated

p-value

What we learned

 Chi-square test  Logistic regression  Survival analysis