Transcript Title
Controlling Confounders Logistic Regression Aya Goto Nguyen Quang Vinh SLIDE 1 Key concepts Confounding Indicative of true association. Can be controlled at the designing or analysis stage. Bias Should be minimized at the designing stage. Random errors Is the nature of quantitative data. Non-differential (random) misclassification Is the nature of (inaccurate) measurement. SLIDE 2 Review question In a cohort study, pregnant women with intended and unintended pregnancy are compared, and it is found that a proportion of mothers who lack confidence in child rearing after birth is higher than in the mothers with unintended pregnancy. SLIDE 3 EXAMPLE OF ( ) Higher proportion of mothers with unintended pregnancy are first-time mothers, and thus they naturally have less confidence in child rearing. We are not sure if the observed relationship between the pregnancy intention and maternal confidence is true or not true due to the effect of pregnancy history. This is called ( ) because the observation is correct, but should be carefully interpreted to foresee the truth. SLIDE 4 EXAMPLE OF ( ) Interview by a public health nurse after birth about parenting is less complete among women with intended pregnancy because they are considered as lower risk mothers. This is called ( ) because the observation itself is in error. SLIDE 5 EXAMPLE OF ( ) By chance, there are more episodes of losing confidence in the unintended pregnancy group in the study sample. EXAMPLE OF ( ) Lack of good information on pregnancy intention results in some intended pregnancy mothers being randomly classified as unintended pregnancy, and vice-versa. If this happens, the study finding underestimates the true RR. SLIDE 6 7 “Mothers with unintended pregnancy tend to lose confidence.” Pregnancy intention Maternal confidence SLIDE 7 Fist-time motherhood Confounding It occurs when there is a confounder, which is associated with both exposure and disease independently. Exposure Disease Confounder SLIDE 8 http://www.amazon.co.jp/ Coffee-Cigarettes-RobertoBenigni/dp/B0001XAO7U Does drinking coffee increase the risk of myocardial infarction? Coffee MI Smoking SLIDE 9 Control confounding at the designing stage Strategy Advantages Disadvantages Specification “Include only non-smokers.” • Easily understood • Limits generalizability • May limit sample size Matching “Match smoking status of cases and controls” • Useful for eliminating influence of strong constitutional confounders like age and sex • Decision to match must be made when designing and can have irreversible adverse effects on analysis • Time consuming • Can not analyze associations of matched variables with the outcome SLIDE 10 Control confounding at the analysis stage Strategy Stratification “Conduct analysis separately for smokers and nonsmokers.” Statistical adjustment “Conduct multivariate analysis controlling (adjusting) for smoking status.” Advantages Disadvantages • Easily understood • Reversible • May be limited by sample size for each stratum • Difficult to control for multiple confounders • Multiple confounders can be controlled. • Reversible • Need advanced statistical techniques • Results may be difficult to understand SLIDE 11 “Whichever method you choose, you have to know potential confounders reported in previous studies.” Literature searching is important SLIDE 12 Correlation and regression Correlation “Variable A and B are correlated / associated.” Regression “Variable A predicts / explains Variable B.” Commonly used statistical analyses can only be applied to a linear relationship. SLIDE 13 Linear relationship No Yes Variable A predicts / explains Variable B No Yes Correlation Regression Parametric Non-para. Parametric Pearson’s correlation Spearman’s correlation SLIDE 14 Non-para. Median Linear regression regression Logistic regression UNIVARIATE logistic regression Y = aX + e Research question: Does Factor X1 explain Outcome Y? Outcome (dependent variable): Binominal (yes or no) Factors of interest (independent variables): Any type of data X Pregnancy intention Y Low confidence SLIDE 15 Assumptions (For advanced learners) Linearity: Logistic regression does not require linear relationships between the independent factor or covariates and the dependent, but it does assume a linear relationship between the independents and the log odds (logit) of the dependent. One strategy for mitigating lack of linearity of a continuous variable is to divide it into categories. Normal distribution: The dependent variable need not be normally distributed. Homoscedacity: The dependent variable need not be homoscedastic for each level of the independents; that is, there is no homogeneity of variance assumption: variances need not be the same within categories. SLIDE 16 MULTIVARIATE logistic regression Y = aX1 + bX2 + cX3 …. + e X1 Pregnancy intention Y Low confidence X2 Pregnancy history X3 Financial status etc. SLIDE 17 Major types of multivariate model 1. Find associated factors STEP 1. Univariate analysis STEP 2. Multivariate analysis 2. Test a specific hypothesis Multivariate analysis controlling for potential confounders at once. SLIDE 18 MODEL 1. Finding associated factors X1 Pregnancy intention X2 Pregnancy history X3 Financial status etc. SLIDE 19 Y Low confidence SLIDE 20 MODEL 2. Testing a specific hypothesis X1 Pregnancy intention Y Low confidence X2 Pregnancy history X3 Financial status etc. SLIDE 21 22 SLIDE 22 Influence of pregnancy intention on mother’s attachment towards her baby Not confident Confident Intended 42% 58% Unintended 67% 33% OR (95%CI) 1.00 3.1 (1.1-8.8) The odds ratio was calculated by using intended pregnancy as a reference group and adjusted by multiple logistic regression analysis for six factors: whether she was a first-time mother or not… SLIDE 23 Technical notes Sample size and number of independent factors “As a rule of thumb, there should be 5 to 10 events for each X variable.” Example: A study was conducted among 457 patients receiving aortic grafts, 86 of whom had a cardiac complication, and factors associated with its occurrence was analyzed. Limit variables to 8 to 16. SLIDE 24 Conceptual framework “Key for clarity and teamwork” • • • • • • • • • • • S o cio - d em o g rap hic ch aracteristics Age R esid en ce in com m u ne w h o le life D u ratio n o f m arriage L ivin g w ith h u sb an d R elig io n O ccu p atio n E d u catio n H o u seh old assets H ealth b eh avio rs F req u en cy o f gen ital w ash ing W ater u sed fo r gen ital w ash in g M aterial u sed fo r gen ital w ashin g • • RTI • • • • • • P resen t p reg n ancy co n d itio n G estatio n al w eek P reg n an cy related sym p tom s N u m b er o f an ten atal care C u rren t R T I treatm en t C u rren t u sage o f an tib io tics M ed ical h isto ry P ast co n tracep tive u se P reg n an cy h isto ry A g e at first sexu al in terco u rse 25 SLIDE 25 Categorization of variables “Important, challenging and rewarding” Where to “cut”? 1. Defined cut-off-point 2. Median 3. Quantile (Tertile, quartile, etc.) How to decide? Tabulate or draw a graph. 26 SLIDE 26 11 16 18 24 25 26 27 28 29 30 31 32 Intended 1 1 0 6 2 1 5 8 10 9 8 64 Unintended 0 0 1 5 3 0 0 1 1 2 1 10 60 % 50 40 Intended Unintended 30 20 10 0 11 16 18 24 25 26 27 28 29 30 31 32 27 Stepwise selection “Convenient, but do not rely.” Especially when you are testing a hypothesis. 28 SLIDE 28 STATA: Logistic regression SLIDE 29 When a categorical variable with a multiple level is included, enter in the command window. Job (1=housewife, 2=office worker, 3=manual worker) xi: logistic attach pregint age i.job SLIDE 30 . xi: logistic attach pregint age i.job Logistic regression Number of obs = 122 LR chi2(5) = 9.81 Prob > chi2 = 0.0809 Log likelihood = -65.300399 Pseudo R2 = 0.0698 -----------------------------------------------------------------------------aichaku2 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------pregint | 4.268066 2.536309 2.44 0.015 1.331688 13.67917 age | 1.015464 .0508943 0.31 0.759 .9204558 1.120279 job_2 | .7797739 .3511123 -0.55 0.581 .3226223 1.884704 job_3 | .683234 .8093097 1.08 0.279 .6559658 1.319243 -----------------------------------------------------------------------------SLIDE 31 Odds ratio = Odds change for a one-unit change in the independent variable. Interpretation: Dichotomy: The odds ratio is 4.3 for pregnancy intention (1=intended, 2=unintended), which means that the risk of less attachment for unintended pregnancy is 4.3 times higher than intended pregnancy. Continuous variable: Each additional year of age increases the risk 1.03 times. Categorical variable: The “job" has three levels: 1=housewife, 2=office worker, 3=manual worker. The lowest category “housewife" is the default reference category (OR=1), meaning that the risk of office workers is 0.8 times higher (=lower) than that of housewives. SLIDE 32 Pseudo R2 = "percent of variance explained". Interpretation About 7% variance of attachment is explained by the three factors (pregnancy intention, age and job). SLIDE 33 STATA: Goodness-of-fit SLIDE 34 . lfit Logistic model for attach, goodness-of-fit test number of observations = 122 number of covariate patterns = 85 Pearson chi2(79) = 89.36 Prob > chi2 = 0.1996 Not significant = the model fits If the goodness-of-fit test statistic is greater than .05, we fail to reject the null hypothesis that there is no difference between observed and model-predicted values, implying that the model's estimates fit the data at an acceptable level SLIDE 35 SPSS: Logistic regression Statnotes: Topics in Multivariate Analysis, by G. David Garson http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm Binary logistic regression with SPSS, by Karl L. Wuensch http://www.ecu.edu/cs-cas/psyc/wuenschk/index.cfm SLIDE 36 In SPSS, binomial logistic regression is under Analyze - Regression - Binary Logistic Example: Factors associated with “not owing a gun” SLIDE 37 Categorical independent variables must be declared by clicking “categorical” button. The last category becomes the reference by default. SLIDE 38 SLIDE 39 39 OR Most reported Goodness-of-fit SLIDE 40