Introduction to logistic regression

Download Report

Transcript Introduction to logistic regression

Introduction to Logistic Regression

Rachid Salmi, Jean-Claude Desenclos, Thomas Grein, Alain Moren, Viviane Bremer

Objectives • When do we need to use logistic regression • Principles of logistic regression • Uses of logistic regression • What to keep in mind

Chlamorea • Sexually transmitted infection – Virus recently identified – Leads to general rash, blush, pimples and feeling of shame – Increasing prevalence with age – Risk factors unknown so far

Case control study • • • Population of Berlin 150 cases, 150 controls Hypothesis: Consistent use of condoms protects against chlamorea • Questionnaire with questions on demographic characteristics, sexual behaviour • OR, t-test

Results bivariate analysis Cases n=150 Controls n=150 Odds ratio Used condoms at last sex Did not use condoms 40 110 90 60 0.17

Ref

Results bivariate analysis Single Currently in a relationship Cases n=150 125 25 Controls n=150 Odds ratio 50 4.7 100 Ref

Results bivariate analysis Cases n=150 Controls n=150 T-test nr partners during last year Mean age in years 4 39 2 26 p=0.001

p=0.001

Confounding?

Chlamorea and condom use a c OR raw b d

Stratification

Single status a 1 c 1 a 2 c 2 a 3 c 3 a i c i b 1 d 1 b 2 d 2 b 3 d 3 b i d i OR 1 OR 2 OR 3 OR 4 Agegroup a 1 c 1 a 2 c 2 a i c i a c 1 1 a c 2 2 a c i i i i a 1 c 1 a 2 c 2 a i c i b 1 d 1 b 2 d 2 b i d i b d 1 1 b d 2 2 b d i i i i b 1 d 1 b 2 d 2 b i d i OR 1 OR 2 OR i OR 1 OR 2 OR i OR 1 OR 2 OR i Number of partners a 1 c 1 a 2 c 2 a i c i a c a b 1 d d 1 1 2 c 2 1 b 2 2 a a b i d i i 1 c 1 i a c a c 2 2 i i a d 1 1 c a 2 c a c a c a 2 c a b d i c i i i a c a c b 1 OR 1 b d a c 1 b d b i i 1 1 2 2 1 1 2 d 2 1 1 2 2 i i i i a c a c a c c a 2 c a b d i c i i i a c a c a b 1 OR 1 d c i 1 1 d 2 b d b d 1 b i i d 2 b 1 d 1 1 d 2 b d i 2 2 i i 1 1 2 1 2 2 i i OR i 2 b d b 2 d OR a c a c b b d b d a c i 1 1 2 2 d i a 2 c a b d i c i i i 1 i i 1 2 2 i i OR OR i i 2 OR OR OR i 2 OR 2 b 2 d 2 OR i b 1 OR 1 a d 1 1 c b 1 2 d 2 a c OR b 1 OR 1 d 1 1 2 b 1 OR 1 i a c b b 2 d a c b i d i 2 2 d i i 1 2 i i 1 OR i OR OR i 2 OR OR 2 OR 2 b 2 d 2 b i d i 1 1 i 2 OR 1 OR OR OR OR i OR i 2 2 OR OR 1 i 2

Let’s go one step back

Simple linear regression

Table 1 Age and systolic blood pressure (SBP) among 33 adult women Age 22 23 24 27 28 29 30 32 33 35 40 SBP 131 128 116 106 114 123 117 122 99 121 147 Age 41 41 46 47 48 49 49 50 51 51 51 SBP 139 171 137 111 115 133 128 183 130 133 144 Age 52 54 56 57 58 59 63 67 71 77 81 SBP 128 105 145 141 153 157 155 176 172 178 217

SBP (mm Hg) 220 200 180 160 140 120 100 80 20 30 SBP 40 50 60 Age (years)

adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974

70

81.54

1.222

Age 80 90

Simple linear regression • Relation between 2 continuous variables (SBP and age)

y Slope y

α

β 1 x 1 α x

• Regression coefficient b 1 – Measures association between y and x – Amount by which y changes on average when x changes by one unit –Least squares method

What if we have more than one independent variable?

Multiple risk factors •

Objective:

To attribute to each risk factors the respective effect (RR) it has on the occurrence of disease.

Types of multivariable analysis • Multiple models –Linear regression –Logistic regression –Cox model –Poisson regression – Loglinear model – Discriminant analysis… • Choice of the tool according objectives, study design and variables

Multiple linear regression • • • Relation between a continuous variable and a set of i variables

y

α

β 1 x 1

β 2 x 2

...

β i x i

Partial regression coefficients b i – Amount by which y changes when x i by one unit and all the other x i changes remain constant – Measures association between x i

adjusted

for all other x i and y Example – Number of partners in relation to age & income

Multiple linear regression

y

α

β 1 x 1

β 2 x 2

...

β i x i

Predicted Response variable Outcome variable Dependent Predictor variables Explanatory variables Covariables Independent variables y (number of partners) = α + β 1 age + β 2 income + β 3 gender

What if our outcome variable is dichotomous?

Logistic regression (1)

Table 2 Age and chlamorea Age 22 23 24 27 28 30 30 32 33 35 38 Chlamorea 0 0 0 0 0 0 0 0 0 1 0 Age 40 41 46 47 48 49 49 50 51 51 52 Chlamorea 0 1 0 0 0 1 0 1 0 1 0 Age 54 55 58 60 60 62 65 67 71 77 81 Chlamorea 0 1 1 1 0 1 1 1 1 1 1

How can we analyse these data?

• Compare mean age of diseased and non-diseased – Non-diseased: 26 years – Diseased: 39 years (p=0.0001) • Linear regression?

Dot-plot: Data from Table 2 Yes No 0 20 40 AGE (years) 60 80 100

Logistic regression (2) Table 3 Prevalence (%) of chlamorea according to age group

Age group 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 70 - 79 80 - 89 # in group 5 6 7 7 5 2 1 # 0 1 2 4 4 2 1 Diseased % 0 17 29 57 80 100 100

Dot-plot: Data from Table 3

Diseased % 100 80 60 40 20 0 0 2 4 Age group 6 8

Probability of disease 1.0

Logistic function (1) )  1

e

 b

x

e

 b

x

0.8

0.6

0.4

0.2

0.0

x

Logistic function • Logistic regression models the logit of the outcome =natural logarithm of the odds of the outcome ln Probability of the outcome (p) Probability of not having the outcome (1-p)

ln

  

1 P P

   

α

β 1 x 1

β 2 x 2

...

β i x i

Logistic function

ln

  

1 P P

   

α

β 1 x 1

β 2 x 2

...

β i x i

 

= log odds of disease in unexposed

 b

= log odds ratio associated with being exposed

e

b

= odds ratio

Multiple logistic regression • More than one independent variable – Dichotomous, ordinal, nominal, continuous …

ln

  

1 P P

   

α

β 1 x 1

β 2 x 2

...

β i x i

• Interpretation of b i – Increase in log-odds for a one unit increase in x i with all the other x i s constant – Measures association between x i adjusted for all other x i and log-odds

Uses of multivariable analysis • Etiologic models – Identify risk factors adjusted for confounders – Adjust for differences in baseline characteristics • Predictive models – Determine diagnosis – Determine prognosis

Fitting equation to the data • • Linear regression: – Least squares Logistic regression: –Maximum likelihood

Elaborating e β • e β = OR  What if the independent variable is continuous?

 what’s the effect of a change in x by more than one unit?

The Q fever example • Distance to farm as independent continuous variable counted in meters – β in logistic regression was -0.00050013 and statistically significant • OR for each 1 meter distance is 0.9995 – Too small to use • What’s the OR for every 1000 meters?

– e 1000* β = e -1000*0.00050013

= 0.6064

Continuous variables • • Increase in OR for a one unit change in exposure variable Logistic model is multiplicative  OR increases exponentially with x – If OR = 2 for a one unit change in exposure and x increases from 2 to 5: OR = 2 x 2 x 2 = 2 3 = 8 • Verify if OR increases exponentially with x – When in doubt, treat as qualitative variable

Coding of variables (2) • Nominal variables or ordinal with unequal classes: – Preferred hair colour of partners: » No hair=0, grey=1, brown=2, blond=3 – Model assumes that OR for blond partners = OR for grey-haired partners 3 – Use indicator variables (dummy variables)

Indicator variables: Hair colour

Hair colour of partners grey brown blond no hair blond 0 0 1 0 Dummy variables brown 0 1 0 0 grey 1 0 0 0

• • • • Neutralises artificial hierarchy between classes in variable “hair colour of partners" No assumptions made 3 variables in model using same reference OR for each type of hair adjusted for the others in reference to “no hair”

Classes • Relationship between number of partners during last year and chlamorea – Code number of partners: 0-1 = 1, 2-3 = 2, 4-5 = 3

Code nr partners 1 Cases 20 Controls 40 OR 1.0

2 22 30 1.5

1.5

2  2.2

3 12 11 2.2

• Compatible with assumption of multiplicative model – If not compatible, use indicator variables

Risk factors for Chlamorea

Sex Hair colour Agegroup Single Visiting bars Number of partners No condom use

Chlamorea

Term

Unconditional Logistic Regression

# partners Single (Yes/No) Hair colour (1/0) Hair colour (2/0) Hair colour (3/0) Visiting bars Used no Condoms Sex (f/m) CONSTANT Odds Ratio 95% 1,2664 1,0345 0,2634 0,3277 1,6126 0,7291 0,2675 0,0991 1,1137 0,1573 1,5942 0,4953 9,0918 3,0219 1,3024 0,2278 * * C.I.

10,7082 3,2660 9,7220 5,3668 7,8870 5,1317 27,3533 7,4468 * Coef.

S. E.

0,2362 0,0339 0,9452 0,5866 Z Statistic 0,5486 0,0578 P Value 0,5833 0,9539 0,4778 0,9166 -0,3159 1,0185 0,5213 0,6022 -0,3102 0,7564 0,1076 0,9988 0,4664 0,5965 2,2074 0,5620 0,2642 0,8896 -3,0080 2,0559 0,1078 0,7819 3,9278 0,2970 -1,4631 0,9142 0,4343 0,0001 0,7665 0,1434

Last but not least

Why do we need multivariable analysis?

• Our real world is multivariable • Multivariable analysis is a tool to determine the relative contribution of all factors

Sequence of analysis • • • • Descriptive analysis – Know your dataset Bivariate analysis – Identify associations Stratified analysis – Confounding and effect modifiers Multivariable analysis – Control for confounding

What can go wrong • • • Small sample size and too few cases Wrong coding Skewed distribution of independent variables – Empty “subgroups” • Collinearity – Independent variables express the same

Do not forget • • • Rubbish in - rubbish out Check for confounders first Number of subjects >> variables in the model • Keep the model simple – Statisticians can help with the model but you need to understand the interpretation • You will need several attempts to find the “best” model

• If in doubt…

Really call a statistician !!!!

References • Norman GR, Steiner DL. Biostatistics. The Bare Essentials. BC Decker, London, 2000 • Hosmer DW, Lemeshow S. Applied logistic regression. Wiley & Sons, New York, 1989 • Schwartz MH. Multivariable analysis. Cambridge University Press, 2006