Control of Confounding Variables

Transcript Control of Confounding Variables

Multiple Regression
Control of Confounding Variables
• Randomization
• Matching
• Adjustment
– Direct
– Indirect
– Mantel-Haenszel
Stratified methods
• Multiple Regression
– Linear
– Logistic
– Poisson
– Cox
Limitations of the Stratified Methods
• Can study only one independent
variable at a time
• Problematic when there are too many
variables to adjust for (too many
strata)
• Limited to categorical variables (if
continuous, can categorize, which
may result in residual confounding)
How to Investigate Associations
Between Variables?
• Between two categorical variables:
– Contingency table, odds ratio, χ2
• Between a categorical and a
continuous variable:
– Compare means, t test, ANOVA
• Between two continuous variables
– Example:
Relationship between air pollution and health status
Measure of pollution
73
52
68
47
60
71
67
80
86
91
67
73
71
57
86
76
91
69
87
77
Measure of health status
90
74
91
62
63
78
60
89
82
105
76
82
93
73
82
88
97
80
87
95
Scatter Plot of health status by pollution level in 20 geographic areas
Health
status



 






0
20
40
 

60

80
100
120
Pollution level
Suppose we now wish to know whether our
two variables are linearly related
• The question becomes:
– Are the data we observed compatible with the two
variables being linearly related? That is,
– Is the true association between the two variables
defined by a straight line, and the scatter we see just
random error around the truth?
Scatter Plot of health status by pollution level in 20 geographic areas
Health
status



 






 


r= ?
0
20
40
60
80
100
120
Pollution level
Scatter Plot of health status by pollution level in 20 geographic areas
Health
status



 






 


r0.7
0
20
40
60
80
100
120
Pollution level
• Then, the next practical question in our evaluation
of whether the relationship is linear:
– How can the fit of the data to a straight line be
measured?
• Correlation Coefficient (Pearson): the extent to which the two
variables vary together
• Linear Regression Coefficient: most useful when we wish to
know the strength of the association
Correlation Coefficient (Pearson)
R: ranges from -1 (perfect negative correlation) to 1 (perfect
positive correlation)
• •
•
• •
•
•
r= 1.0
• •
• ••• •
• • ••
••
r= -0.8
• ••• •
•• • • ••
• • • •• •
• •• •
r= 0
Linear Regression Coefficient of a Straight Line
y
y  0  1  x
1
0
0
x= 0, y= 0
x
1 unit
1Linear regression coefficient
: increase in y per unit increase in x
: expresses strength of the association
: allows prediction of the value of y, given x
The trick is to find the “line” (0, 1) that best fits the observed da
Y= Health
status

In linear regression,
least square
approach estimates
the line that
minimizes the
square of the
distance between
each point and the
line


 







0
20
 
40

60
80
100 120
X= Pollution level
Health status= 0 + 1 (pollution)
Health status= 30.8 + 0.71 (pollution)
Simple Linear Regression
• The “points” (observations) can be
individuals, or conglomerates of individuals
(e.g., regions, countries, families) in
ecologic studies.
• When X is inversely related to Y, b () is
negative.
Note: when estimating  from
samples, the notation “b” is used
instead of 
• In epidemiologic studies, the value of the intercept (b0 or
0) is frequently irrelevant (X=0 is meaningless for many
variables)
– E.g. Relationship of weight (X) to systolic blood
pressure (Y):
•
•
••
• • •
•
• •
•
•
•
• •
•
•
• ••
•
•
SBP(mmHg)
200
100
0
?
50
100
150
200
WEIGHT (Lb)
FUNDAMENTAL ASSUMPTION IN THE LINEAR MODEL: X and y are
linearly related, i.e., the increase in y per unit increase of x () is
constant across the entire range of x.
E.g., The increase in health status index between pollution level 40 and
50 is the same as that between pollution level 90 and 100
Y= Health
status
•



 







0
20
 
40

60
80
100 120
X= Pollution level
FUNDAMENTAL ASSUMPTION IN THE LINEAR MODEL:
X and y are linearly related
However…if the data look like this:
y
• •• •
• ••• •
• • • ••
•••
•
••
• •••• •
•• •
“u-shaped” function
Wrong model!
x
BOTTOM LINE: LOOK AT THE DATA BEFORE YOU DECIDE
ON THE BEST MODEL!
- Plot yi vs. xi
If non-linear patterns are present:
- Use quadratic terms (e.g., age2), logarithmic
terms --- e.g., log (x) --- etc.
- Categorize and use dummy variables
Other important points to keep in mind
• Like any other “sample statistic”, b is subject to
error. Formulas to calculate the standard error of b
are available in most statistics textbooks.
• “Statistical significance” of b (hypothesis testing):
– H0: b=0  No association x  y
– H1: b=0  x and y are linearly related
– Test statistic: Wald statistic (z-value)  b/SE(b)
• WARNING: THIS TEST IS ONLY SENSITIVE FOR LINEAR
ASSOCIATIONS. A NON-SIGNIFICANT RESULT DOES NOT
IMPLY THAT x AND y ARE NOT ASSOCIATED, BUT MERELY
THAT THEY ARE NOT LINEARLY ASSOCIATED.
• Confidence interval (precision) for b:
– 95% CI= b ± 1.96 x SE(b)
• The regression coefficient (b) is related to the
correlation coefficient (r), but the former is
generally preferable because:
– It gives some sense of the strength of the association,
not only the extent to each two variables vary
concurrently in a linear fashion.
– It allows prediction of Y as a function of X.
Correlation Coefficient (Pearson)
R: ranges from -1 (perfect negative correlation) to 1 (perfect
positive correlation)
•
•
•
•
•
•
•
r= 1.0
• •
•
• •
•
•
r= 1.0
• •
• ••• •
• • ••
••
r= -0.8
• ••• •
•• • • ••
• • • •• •
• •• •
r= 0
Note: functions having different slopes may have
the same correlation coefficient
The equation:
yi  b0  b1x1
Naturally extends to multiple variables
(multidimensional space):
yi  b0  b1x1  b2 x2  b3 x3  ...  bk xk
y
yi  b0  b1x1  b2 x2
x1
x2
Multiple regression coefficients:
b1 -- increment in y per unit increment in x1, after the effect of x2 on y and
on x1 has been removed, or
-- effect of x1 on y, adjusted for x2
b2 – increment in y per unit increment in x2, after the effect of x1 on y and
on x2 has been removed, or
-- effect of x2 on y, adjusted for x1
(b0 – value of y when both x1 and x2 are equal to 0)
Exposed:
x1= 1 (smoker)
Unexposed: x1= 0 (non-smoker)
y: lung cancer
Confounder present x2= 1 (drinker)
Confounder absent x2= 0 (non-drinker)
y exp osed  b 0  b1x1  b 2x 2  e
- y un exp osed  b 0  b1x1  b 2x 2  e
=0
ARexp =
b1
y exp osed  b 0  b1x1  b 2x 2  e
- y un exp osed  b 0  b1x1  b 2x 2  e
=0
ARexp =
b1
Same! (no
interaction)
y
yi  b0  b1 x1  b2 x2  b3 x1  x2
Interaction term
x1
x2
Multiple regression coefficients:
b1 -- increment in y per unit increment in x1 in individuals not exposed to
x2
b2 – increment in y per unit increment in x2 in individuals not exposed to
x1
b3 – increment in y per unit increment in the joint presence of x1 and x2,
compared to individuals not exposed to x1 and x2
Multiple Linear Regression Notes
• To obtain least square estimates of b’s, need to
use matrix algebra…or computers!
• Important assumptions:
– Linearity
– No (additive) interaction, i.e.,
• The absolute effect of x1 is independent of x2, or
• The effects of x1 and x2 are merely additive (i.e., not “less, or
more than additive”)
• NOTE: if there is interaction, product terms can be introduced in
the model to account for it (it is, however, better to do stratified
analysis)
• Other assumptions:
– Observations (i’s) are independent
– Homoscedasticity: variance of y is constant across x-values
– Normality: for a given value of x, values of y are normally
distributed
In Linear Regression (simple or multiple),
Independent Variables (x’s) can be:
• Continuous
•
•
•
•
Pollution level (score)
BMI (kg/m2)
Blood pressure (mmHg)
Age (years)
• Categorical
– Dichotomous (conventionally, one of the values is coded
as “1” and the other, as “0”)
• Gender (male/female)
• Treatment (yes/no)
• Smoking (yes/no)
• Ordinal
• Any continuous variables categorized in percentiles (tertiles,
quartiles, etc)
In Linear Regression (simple or multiple),
the Dependent Variable (y) can be:
• Discrete (yes/no)
– Incident cancer
– Recurrent cancer
• Continuous
– Systolic blood pressure (mmHg)
– Serum cholesterol (mg/dL)
– BMI (kg/m2)
Example of x as a discrete variable (obesity) and y as a
continuous variable (systolic blood pressure, mmHg)
160










150
140
130
120
110








0
Unit: from zero to 1
Average difference
(regression coefficient
or slope = b1)
1
Obesity (x= 0 if “no”; x=1 if “yes”)
When x= 1,
When x= 0,
SBP  b0  b1 1 b0  b1
-SBP  b0  b1  0  b0
SBP  =
b1
Thus, b1 = increase in SBP per unit increase in obesity = average difference in
SBP between “obese” and “non-obese” individuals
Example of x as a discrete variable with more than 2
categories (e.g., educational level) and y as a
continuous variable (systolic blood pressure
(mmHg)
• Ordinal variables (x’s) can be entered into the regression
equation as single x’s. Example:
SBP  b0  b1educ  b2 age
• Where education is categorized into “low”, “medium” and
“high”.
• Thus, x1= 1 when “low”, x1=2 when “medium” and x1=3 when
“high”
160
150
140
130
120




b1 



same
SBP  b0  b1educ  b2 age





b1 


110
Low
Medium








High
Educational Level
HOWEVER, the model assumes that the difference in SBP
(decrease) is the same between “low” (x1= 1) and
“medium” (x1= 2), as that between “medium” (x1= 2) and
“high” (x1= 3)  assumption of linearity
Alternative: it’s coming!
Non-ordinal multilevel categorical variable
• Race (Asian, Black, Hispanic, White)
• Treatment (A, B, C, D)
• Smoking (cigarette, pipe, cigar, nonsmoker)
How to include these variables in a multiple
regression model?
“Dummy” or indicator variables: Define the number of
dummy dichotomous variables as the number of categories
minus one
Use of dummy variables
Example: “Race” categorized as Asian, Black, Hispanic and
White. Thus, to model “race”:
SBP  b0  b1x1  b2 x2  b3 x3
Where
X1= 1 if Asian,
x1= 0 if otherwise
X2= 1 if Black,
x2= 0 if otherwise
X3= 1 if Hispanic, x3= 0 if otherwise
SBPASIANS  b0  b1
SBPBLACKS  b0  b2
SBPHISPAN b0  b3
SBPWHITES  b0
Thus, what is the interpretation of b0, b1, b2, and b3?
Definitions of Dummy Variables
Dummy Variables
Race
x1
x2
x3
Asian
1
0
0
Black
0
1
0
Hispanic
0
0
1
White
0
0
0
• b0= average value of y in whites (reference category)
• b1= average difference in y between Asians and Whites
• b2= average difference in y between Blacks and Whites
• b3= average difference in y between Hispanics and Whites
Use of dummy variables when the function
is not a straight line
SBP
160




150




140
130
120
110




1




2
WRONG
MODEL!!!




3
4
5
BMI Quintile
SBP
160




150




140
130
120
110
Model




1




2




3
4
5
BMI Quintile
SBP  b0  b1x1  b2 x2  b3 x3  b4 x4
Where
X1=1 if BMI quintile=2; x1=0 if otherwise
X2=1 if BMI quintile=3; x2=0 if otherwise
X3=1 if BMI quintile=4; x3=0 if otherwise
X4=1 if BMI quintile=5; x4=0 if otherwise
Note: each b represents the difference between each quintile (2, 3, 4 and 5) and the
reference quintile (quintile 1). Thus, the difference is negative for 2, slightly negative
for 3, and positive for 4 and 5.
Can also obtain the difference between quintiles: for example, b4 – b3 is the
difference between quintiles 5 and 4
Multiple linear regression models of leukocyte count (thousands/mm3) by
selected factors, in never smokers, ARIC study, 1986-89 (Nieto et al, AJE
1992;136:525-37)
Model 1* (R2=0.09)
Model 2** (R2=0.21)
Variable
b
SE(b)
b
SE(b)
Age (5 years)
-0.066
0.019
-0.066
0.018
Sex (male=1, fem=0)
0.478
0.065
0.030
0.073
Race(W=1, B=0)
0.495
0.122
0.333
0.117
Work activity score (1 unit)
-0.065
0.021
-0.061
0.020
Subscapular skinfold (10
mm)
0.232
0.018
0.084
0.020
SBP (10 mmHg)
0.040
0.011
0.020
0.011
FEV1 (1 liter)
-0.208
0.047
-0.183
0.045
Heart rate (10 beats/minutes
0.206
0.020
0.128
0.019
*Model 1: adjusted for center, education, height, apolipoprotein A-I, glucose and
for the other variables shown in the table.
**Model 2: Adjusted for the same variables included in Model 1 plus hemoglobin,
platelet, uric acid, insulin, HDL, apolipoprotein B, triglycerides, factor VIII,
fibrinogen, antithrombin III, protein C antigen and APTT
Control of Confounding Variables
• Random allocation
• Matching
– Individual
– Frequency
– Restriction
• Adjustment
–
–
–
–
Direct
Indirect
Mantel-Haenszel
MULTIPLE REGRESSION
• Linear model
• LOGISTIC MODEL
AN ALTERNATIVE TO THE LINEAR MODEL
When the dependent variable is dichotomous (1/0)
P y / x 
1
1 eb0  b1x 
Or, simplifying:
P
1.0
Probability of response (P)
The probability of disease (y) given
exposure (x):
0.5
1
1 e b
0
Dose (x)
EXPONENTS AND LOGARITHMS: Brief Review
log A  B  A  10 B
E . g. log 100  2  100  102
ln A  B  A  e B
E . g.ln 5  1609
.
 5  2.711.609
Notation:
e B  exp B
(Note: In most epidemiologic literature, lnA is written as logA)
Logs: Brief Review (Cont.)
log A  log B  log( A  B)
A

log A  log B  log 
 B
1
A
B
 A
B
Example:
100= 1/0.01= 1/10-2= 102= 100
1
e

e
B
B
P
1  P 1 
1
1 e b
1
1 e
B

1 e
B
1 e
1
B

e
B
1 e
B
1
1 e B
P
1


 eB
1 P
e B
e B
1 e B
THUS:
P
 ODDS  e B  eb0  b1x
1 P
 P 
log
  log(Odds)  b0  b1 x
 1  P
b1
b0
x
 
Unit increment
in x
b1= increment in log (Odds) per unit increment in x
Remember that:
log(Odds) x  1  log(Odds) x
(Odds) x  1
 log
 log(Odds Ratio)
(Odds) x
Thus, b1 is the log of the Odds Ratio!!
Assume prospective data in which exposure (independent variable) is defined
dichotomously (x):
Disease Non disease
Y=1
Y=0
Exposed
X=1
p1
1 – p1
Unexposed
X=0
P0
1 – p0
 P1 
For exposed (x=1): log
  b0  b1x  b0  b1 1
 1  P1 
 P0 
For unexposed (x=0): log
  b0  b1  0  b0
 1  P0 
 P1 


 P1 
 P0 
1

P
1
b1  log
 logOR
  log
  log
 1  P1 
 1  P0 
 P0 


 1  P0 
OR 
b1
e
 Anti log of b1
WITH CASE-CONTROL DATA:
• Intercept (b0) is uninterpretable
• Can obtain unbiased estimates of the
regression coefficient (b1) (See
Schlesselman, pp. 235-7)
The logistic model extends to the multivariate situation:
P( y / x ) 
1
1
 (b0  b1 x1  b2 x 2  b3 x 3  ...  bk x k )
e
 P 
log
  b0  b1x1  b2 x2  b3 x3  ...  bk xk
 1 P
Interpretation of multiple logistic regression coefficients:
Dichotomous x:
b1: log(OR) for x=1 compared to x=0 after adjustment for
the remaining x’s
Continuous x:
b1: log(OR) for an increment of 1 unit in x, after
adjustment for the remaining x’s
Thus: 10 x b1: log(OR) for an increment of 10 units of x,
after adjustment for the remaining x’s
CAUTION: Assumes linear increase in the log(OR) throughout the entire
range of x values
Logistic Regression Using Dummy Variables: Cross-Sectional Association
Between Demographic Factors and Depressive State, NHANES, MexicanAmericans Aged 20-74 Years, 1982-4
Factor
b
OR
P value
Intercept
-3.1187
-
-
Sex (female= 1, male= 0)
0.8263
2.28
0.00
20-24
Reference
1.00
-
25-34
0.1866
1.20
0.11
35-44
-0.1112
0.89
0.60
45-54
-0.1264
0.88
0.52
55-64
-0.1581
0.85
0.32
65-74
-0.3555
0.70
0.19
0-6
0.8408
2.32
0.00
7-11
0.4470
1.56
0.01
12
0.2443
1.28
0.21
 13
Reference
1.00
-
Age
Years of Education
Generalized Linear Models
Model
Equation
Interpretation
Linear (simple)
y  b0  b1x1  b2 x2  ...  bk xk
Increase in outcome y mean
value per unit increase in x1,
adjusted for all other
variables in model
Logistic
Log (odds)  b0  b1x1  b2 x2  ...  bk xk
Increase in log (odds) of
outcome per unit increase in
x1, adjusted for all other
variables in model
Cox
Log (hazard )  b0  b1x1  b2 x2  ...  bk xk
Increase in log (hazard) of
outcome per unit increase in
x1, adjusted for all other
variables in model
Poisson
Log (rate)  b0  b1x1  b2 x2  ...  bk xk
Increase in log (hazard) of
outcome per unit increase in
x1, adjusted for all other
variables in model
Logistic Regression Notes
• Popularity of logistic regression results from its
predictive ability (values above 1.0 or below 0 are
impossible with this model).
• Least squares solution for logistic regression
does not work. Need maximum likelihood
estimates…I.e., computers!
• 95% confidence limits for the Odds Ratio
b  1.96 SE (b)

e
Logistic Regression on 7-Year Follow-Up, Washington County ARIC Cohort, Ages
45-64 Years at Baseline (1987-89)
Factor (x)
b
Odds Ratio
Intercept
-4.5670
-
Gender (male=1,
female=0)
1.3106
3.71
Smoking (yes=1, no=0)
0.7030
2.02
Age (1 year)
0.1444
1.16
Systolic Blood Pressure
(1 mmHg)
0.5103
1.67
Serum Cholesterol (1
mg/dL)
0.4916
1.63
Body Mass Index (1
kg/m2)
0.1916
1.21
What is the probability (P) that can be predicted from this model for a
male smoker less than 55 years old, who is hypertensive, nonhypercholesterolemic and obese?
Odds
P
1  Odds
Logistic Regression on 7-Year Follow-Up, Washington County ARIC Cohort, Ages
45-64 Years at Baseline (1987-89)
Factor (x)
b
Odds Ratio
Intercept
-4.5670
-
Gender (male=1,
female=0)
1.3106
3.71
Smoking (yes=1, no=0)
0.7030
2.02
Age (1 year)
0.1444
1.16
Systolic Blood Pressure
(1 mmHg)
0.5103
1.67
Serum Cholesterol (1
mg/dL)
0.4916
1.63
Body Mass Index (1
kg/m2)
0.1916
1.21
What is the probability (P) that can be predicted from this model for a
male smoker less than 55 years old, who is hypertensive, nonhypercholesterolemic and obese?
Odds
1
P

[( 4 .5670 )  ( 1. 3106  1)  ( 0 .703  1)  ( 0 .1444  0 )  ( 0 .4916  0 )  ( 0 .1916  1) 
1  Odds 1  e
 0.1357  13.57%
b  1.96 SE (b)

e
Example:
b1= 1.1073; SE(b1 )= 0.1707
95% CL  e
.
 1.96  0.1707
11073
 2.17, 4.22
When multiplying the OR for increase in more than one unit of a continuous
variable, must multiply both the coefficient and the SE by the number of units,
to obtain CL’s. E.g., for an increase in 10 units:
.
 10  (1.96  0.1707 )
95% CL  e10  11073
• Hypothesis testing (H0: b= 0)
– Wald statistic:
b
 z value
SE (b)
• Example: b= 1.2163, SE(b)= 0.1752
1.2163
z
 6.94 , p  0.05
0.1752
(Note that the square of this z value is the 2 )
• Assumptions:
1) Linearity in the log(odds) scale
If not linear: use dummy variables or quadratic terms
2) No multiplicative interaction
E.g., the relative effect of x1 is independent of x2
» or
The effects of x1 and x2 are merely multiplicative (i.e., not “more,
or less than, multiplicative”)
Note:
– This is the same assumption needed to calculate ORMH
– If there is interaction, product terms can be introduced in the
model to account for it
Better still: do stratified analysis
3) Observations are independent
Analytic Techniques for Assessment of Relationships Between Exposures
(x) and Outcomes (y)- I
Type of Study
Any
Cross-sectional
Case-control
Type of outcome (y)
Continuous (eg, BP
Diseased/Non-diseased
Diseased/Non-diseased
Multivariate approach
Adjusted measure of
association
ANOVA
 In means
Linear regression
Linear 
Direct adjustment
Prevalence Rate Ratio
Indirect adjustm.
Stand. Prevalence Ratio
Mantel-Haenszel
Prevalence Odds Ratio (OR)
Logistic regression
Prevalence OR
Mantel-Haenszel
OR
Logistic regression
OR
(Adapted from Szklo & Nieto, Aspen, 2000, p. 338)
Analytic Techniques for Assessment of Relationships Between
Exposures (x) and Outcomes (y)- II
Type of
Study
Cohort
Type of y
Cumulative incidence by the end of
follow-up
Multivariate approach
Adjusted Measure of
Association
Direct Adjustment
Incidence Proportion
Ratio
Indirect Adjustment
SIR
Mantel-Haenszel
Probab. OR
Logistic regression
Probab. OR
Cumulative incidence: time-toevent data
Cox Proportional Hazards
Model
Hazard Rate Ratio
Incidence Rate per Person-Time
Mantel-Haenszel
Rate Ratio
Poisson regression
Rate Ratio
Nested casecontrol
Time-dependent disease status
(time to event data taken into
account by density sampling)
Conditional logistic
regression
Hazard Rate Ratio
Case-cohort
Time-dependent disease status
(time to event data)
Cox model with staggered
entries
Hazard Rate Ratio
EPILOGUE:
• Stratification Vs. Adjustment
•Advantage of stratification: best way to understand the data, and
examine the possibility of interaction.
•Disadvantage of stratification: cumbersome if large number of
variables.
• If you use multiple regression models, do not let the data make a fool
of you: Look at the data!!
•Check the appropriateness of the model (Is it linear?)
•Watch for outliers
• Consider the possibility of residual confounding.
Causes of Residual Confounding
• Variables missing in model
• Categories of the variables included in the
model are too broad
• Confounding variables are misclassified
• Construct validity is not the same in
groups under comparison
Residual Confounding: Relationship Between Natural
Menopause and Prevalent CHD, ARIC Study, Ages 45-64 Years,
1987-89
Model
Odds Ratio (95% CI)
1
Crude
4.54 (2.67, 7.85)
2
Adjusted for age: 45-54 Vs. 55+
(Mantel-Haenszel)
3.35 (1.60, 6.01)
3
Adjusted for age:
45-49, 50-54, 55-59, 60-64 (MantelHaenszel)
3.04 (1.37, 6.11)
4
Adjusted for age: continous
(logistic regression)
2.47 (1.31, 4.63)
EPILOGUE (Cont.)
• Statistical models and adjustment techniques can be used to
explore causal pathways (intermediate variables).
• Statistical models as “tools for science” rather than “laws of
nature”:
…Statistical models are sometimes misunderstood… Statistical
models are never true. The question whether a model is true is
irrelevant. A more appropriate question is whether we obtain the
correct scientific conclusion if we pretend that the process under
study behaves according to a particular statistical model.
(Zeger SL. Statistical reasoning in epidemiology. Am J Epidemiol 1991;134:1062-6)

Control of Confounding Variables

Transcript Control of Confounding Variables

Directory