Statistical models to control extraneous factors

Download Report

Transcript Statistical models to control extraneous factors

Confounders and Interactions:
An Introduction
1
An Example
• Data were collected from some students in department of an
university on the following variables:
– No. of times visited theatre per month (z)
– Scores in the final examination (y)
• The simple correlation coefficient (ryz) between y and z was
calculated to be 0.20 which was significant because the
sample size was moderately large.
• The same experiment was repeated for other departments in
the university. Every time it was positive and significant.
• Interpretation: As you visit theatre more and more, your
result will improve. An interpretation which was hard to
believe.
2
An Example
(Continued)
• Statisticians were puzzled. After a long investigations
they found that who visited theatre more are more
intelligent students. So they need less time to study and
thus spend more time on other things.
• From the same set of students in the department
experiments were carried out to find the IQ of the
students (x). The results of the computation were as
follows:
rxy = 0.8, ryz = 0.2 and rxz = 0.6.
• Still the paradox was not solved.
3
Solution - 1
• One statistician suggested the following:
– Let us fix IQ and take correlation coefficient
between x and z for each IQ.
• It was not practicable as such. Sample size was
too less for such experiment.
• Sample size was increased and the correlation
coefficient between x and z was found for each
IQ.
• Each time the value was negative, but different.
4
Solution - 2
• The effect of x from both y and z was eliminated
and the correlation coefficient between y and z
was found. It was negative.
• How do we eliminate the effect of x?
– We assume that linear relations exists between these
variables, i.e., y = a + b x and z = c + d x (apart from
the errors in the equations). The regressions were fitted
and the residuals of y and z were found and then the
correlations were found between the residuals. This is
the correlation coefficient between y and z after
eliminating the effect of x and this was negative.
– This is known as the partial correlation coefficient.
5
Discussions
• Fortunately, it is not necessary to do all these steps to find
out the partial correlation coefficient. We can use the
following formula:
• The result is ryz.x ≈ – 0.58. It is clearly a negative value.
• Solution 1 gives different values of the estimates of the
correlation coefficients.
• If we assume that the correlation coefficient is same for
each stratum (i.e., fixed value of x) then the estimates will
be more or less close and close to – 0.58 for this example.
• If x, y and z is a trivariate normal distribution then
theoretically the value of the correlation coefficient will be
same for each x.
• Thus Solution 1 does not need any distributional
assumptions but gives multiple answers whereas solution 2
is unique but valid under restrictive assumptions.
6
Partial Correlation to Regression
• Correlations and regression coefficients are related. In the equation
y = a + b x, b is positive if and only if rxy is positive. Testing for
significance of b is same as testing for significance of rxy.
• In the equation y = a + b x + c z, c is positive if and only if ryz.x is
positive. Testing for significance of c is same as testing for
significance of ryz.x.
• If we want to find the relation between y and z; and the variable x
has effect on both then we should take both the variables as
regressors and proceed.
• This is why the regression coefficients in a multiple linear
regression are known as partial regression coefficients.
• x is called the confounding variable. Not all such variables are
confounding variables. The confounding variable should be the true
cause of variation of the explained variable.
7
Another Illustration of Confounding
• Diabetes is associated with hypertension.
• Does diabetes cause hypertension?
• Does hypertension causes diabetes?
• Another way in which diabetes and hypertension may be related is when
both variables are caused by FACTOR X. For hypertension and diabetes,
Factor X might be obesity.
• We should not conclude that diabetes causes hypertension. In fact, they
had no true causal relationship. We should rather say that:
• The relationship between hypertension and diabetes is confounded by
obesity. Obesity would be termed as a confounding variable in this
relationship.
8
Confounders are true causes of disease.
9
Definition of Confounding
• A confounder:
– 1) Is associated with exposure
– 2) Is associated with disease
– 3) Is NOT a consequence of exposure (i.e. not
occurring between exposure and disease)
10
MEDIATING VARIABLE
(SYNONYM: INTERVENING
VARIABLE)
EXPOSURE
MEDIATOR
DISEASE
AN EXPOSURE THAT PRECEDES A MEDIATOR IN
A CAUSAL CHAIN IS CALLED AN ANTECEDENT
VARIABLE.
11
Mediation
• A mediation effect occurs when the third variable (mediator,
M) carries the influence of a given independent variable (X)
to a given dependent variable (Y).
• Mediation models explain how an effect occurred by
hypothesizing a causal sequence.
• .
12
Confounding Vs. Mediation
• Exposure occurs first and then Mediator and
outcome, and conceptually follows an
experimental design).
• Confounders are often demographic variables
that typically cannot be changed in an
experimental design. Mediators are by
definition capable of being changed and are
often selected based on flexibility.
13
Another Example: No Confounding
14
A Different Example
• A group of scientists wanted to find the effect of IQ and the time
spent on studying for examination on the result of examination. The
linear model taken by them was
yt = α + xt+ zt + et .
• They fitted the data and the fitting was good. However, one of the
scientists noticed that the residuals did not show random pattern
when the data were arranged in increasing order of values of IQ.
Then they started investigating the behaviour of the data more
closely. They could do so because the sample size was large.
• They fixed the value of IQ at different points and plotted the scatter
diagram of result against study hours. Every time the scatter
diagram showed linear relation, but the slope changed every time
the value of IQ was changed. And surprisingly, it had a systematic
increasing pattern as the value of IQ increased.
15
The Revised Model
• Now look at the model again
yt = α + xt+ zt + et .
• We interpret  as the change in the value of y on the average as the value of
x is increased by one unit keeping the value of z fixed. But why should the
value of  change as the value of z is increased to some other fixed value.
Ideally the intercept parameter, α, should absorb zt and thus the intercept
term should change and not the slope parameter.
• It means that the selection of model was wrong. If  changes/increases as z
increases then  is not a constant. We may take  to ( + zt) and get
yt = α + ( + zt)xt+ zt + et ,
and get
yt = α + xt+ zt + xtzt + et .
• This phenomenon is known as the interaction effect between x and z. It is
symmetric. One may arrive at the same by varying coefficient of zt
appropriately.
16
No interaction Vs. Interaction
• No Interaction: Disease increases with age and this
association is the same for both, male and female.
• Interaction: gender interacts with age if the effect of age
on disease is not the same in each gender.
• .
17
Examples
• Aspirin protects against heart attacks, but only in men
and not in women. We say then that gender moderates
the relationship between aspirin and heart attacks,
because the effect is different in the different sexes. We
can also say that there is an interaction between sex
and aspirin in the effect of aspirin on heart disease.
• In individuals with high cholesterol levels, smoking
produces a higher relative risk of heart disease than it
does in individuals with low cholesterol levels.
Smoking interacts with cholesterol in its effects on
heart disease.
18
The Implications
• The implication is that, when x or z is increased there is an
additional change in the expected value of y apart from the linear
effect.
• If x is increased by one unit for fixed z then the change in y is +zt
instead of  only, and if z is increased by one unit for fixed x then
the change in y is +xt . If both x and z are increased by one unit
then the change in y is ++ xt+zt+.
• For binary variables taking only 0 and 1 values the corresponding
changes in y are ,  and ++  respectively assuming that x and z
both were in position 0. This is clear from the following table:
Expected values of y at different values of x and z
Z
X
0
1
0
1
α.
α+
α + + +
α+
19
The Implications
• Since y measures the effect i.e., disease, say, of exposures x and/or
z, the number of cases of y in each stage will reflect the same. The
odds ratios will be different.
• Interaction between two variables (with respect to a response
variable) is said to exist when the association between one of these
variables (may be called the exposure variable) and the response
variable (generally measured by the odds ratio or relative risk) is
different at different levels of the other exposure variable.
• For example, the odds ratio that measures the association between
cigarette smoking and lung cancer may be smaller among
individuals who consume large quantities of beta carotene in their
food when compared to the analogous odds ratio among persons
who consume little or no beta carotene in their food.
20
THE INTERACTING OR EFFECT-MODIFYING
VARIABLE IS ALSO KNOWN AS A
MODERATOR VARIABLE
MODERATOR
EXPOSURE
DISEASE
A moderator variable is one that moderates or modifies
the way in which the exposure and the disease are
related. When an exposure has different effects on
disease at different values of a variable, that variable is
called a modifier.
21
Methods to reduce confounding
– during study design:
• Randomization
• Restriction
• Matching
– during study analysis:
• Stratified analysis
• Mathematical regression
22
Randomized controlled trial
• Randomized controlled trial: A method where the study population is
divided randomly in order to mitigate the chances of self-selection by
participants or bias by the study designers. Before the experiment begins,
the testers will assign the members of the participant pool to their groups,
using a randomization process such as the use of a random number
generator.
• For example, in a study on the effects of exercise, the conclusions would be
less valid if participants were given a choice if they wanted to belong to the
control group which would not exercise or the intervention group which
would be willing to take part in an exercise program. The study would then
capture other variables besides exercise, such as pre-experiment health
levels and motivation to adopt healthy activities. From the observer’s side,
the experimenter may choose candidates who are more likely to show the
results the study wants to see or may interpret subjective results (more
energetic, positive attitude) in a way favorable to their desires.
23
Case-Control Studies
• In a case-control study the researcher retrospectively determines
which individuals were exposed to the agent or treatment or the
prevalence of a variable in each of the study groups. The researcher
assigns confounders to both groups, cases and controls, equally. For
example if somebody wanted to study the cause of myocardial
infarct and thinks that the age is a probable confounding variable,
each 67 years old infarct patient will be matched with a healthy 67
year old "control" person. In case-control studies, matched
variables most often are the age and sex.
• Drawback: Case-control studies are feasible only when it is easy to
find controls, i.e., persons whose status vis-à-vis all known potential
confounding factors is the same as that of the case's patient: Suppose
a case-control study attempts to find the cause of a given disease in a
person who is 1) 45 years old, 2) African-American, 3) from Alaska,
4) an avid football player, 5) vegetarian, and 6) working in
education. A theoretically perfect control would be a person who, in
addition to not having the disease being investigated, matches all
these characteristics and has no diseases that the patient does not
also have — but finding such a control would be an enormous task. 24
An Hypothetical Example
25
Cohort studies
• Cohort studies: A group of people is chosen who do not have the outcome of
interest (for example, myocardial infarction). The investigator then measures a
variety of variables that might be relevant to the development of the condition.
Over a period of time the people in the sample are observed to see whether they
develop the outcome of interest (that is, myocardial infarction).
– Internal Controls: In single cohort studies those people who do not develop the
outcome of interest are used as internal controls.
– External Controls: Where two cohorts are used, one group has been exposed to or
treated with the agent of interest and the other has not, thereby acting as an external
control.
• A degree of matching is also possible in cohort studies, creating a cohort of people
who share similar characteristics and thus all cohorts are comparable in regard to
the possible confounding variable. For example, if age and sex are thought to be
confounders, only 40 to 50 years old males would be involved in a cohort study
that would assess the myocardial infarct risk in cohorts that either are physically
active or inactive.
• Drawback: In cohort studies, the over-exclusion of input data may lead researchers
to define too narrowly the set of similarly situated persons for whom they claim
the study to be useful. Similarly, "over-stratification" of input data within a study
may reduce the sample size in a given stratum to the point.
26
Double blinding
• Double blinding conceals from the trial
population and the observes the experiment group
membership of the participants. By preventing the
participants from knowing if they are receiving
treatment or not, the placebo effect should be the
same for the control and treatment groups. By
preventing the observers from knowing of their
membership, there should be no bias from
researchers treating the groups differently or from
interpreting the outcomes differently.
27
Stratification
• Stratification: As in the example above, physical
activity is thought to be a behaviour that protects
from myocardial infarct; and age is assumed to be a
possible confounder. The data sampled is then
stratified by age group – this means, the association
between activity and infarct would be analyzed per
each age group. If the different age groups (or age
strata) yield much different risk ratios, age must be
viewed as a confounding variable. There exist
statistical tools, among them Mantel–Haenszel
methods, that account for stratification of data sets.
28
Stratification of Confounding Variable
• While ascertaining association between 2 factors, we have Exposure
and disease
– Both Discrete: 2 levels of exposure/disease: 2x2 table
– Both Discrete: More levels of exposure/disease: r x c
– Level of disease continuous and exposure discrete or continuous: Usual
regression
– Level of disease discrete and exposure discrete or continuous:
Regression, but needs special attention
• A 3rd variable is considered: May be considered as an additional
regressor variable or one may use stratification
– Repeat analysis within every level of that variable
– E.g. gender, age, breed, farm etc.
• Stratification solves the problem of confounding as well as
interaction
29
The Problem with Stratification as a
Solution to Confounding
• Stratification sometimes may cause bias. Consider the situation of a
pair of dice, die A and die B. Of course, you know that they must be
independent. In other words, if you roll one, it tells you nothing
about the roll of the other. What if we stratify upon the sum of the
dice?
• What happens if we stratify? Let’s look in the stratum where the
sum is, for example, 7. In this stratum, if we know A (say, 1) then
we know B. If A is 3, B must be 4.
• Earlier, we said that A and B were independent. Now, however,
once we stratify upon the sum, if we know A, we know B. We have
induced a relationship between A and B that otherwise did not exist.
30
Holding the Extraneous Variable Constant
• For example, if you want to control for gender using
this strategy, you would only include females in your
research study (or you would only include males in
your study). If there is still a relationship between the
variables say motivation and test grades, you will be
able to tell that the relationship is not due to gender
because you have made it a constant (by only
including one gender in your study).
31
Statistical Control
• Statistical Control: It's based on the following logic:
examine the relationship between the variables at each level
of the control/extraneous variable; actually, the computer
will do it for you, but that’s what it does.
• One type of statistical control is called partial correlation.
This technique shows the correlation between two
quantitative variables after statistically controlling for one
or more quantitative control/extraneous variables.
• A second type of statistical control is called ANCOVA (or
analysis of covariance). This technique shows the
relationship between the variables after statistically
controlling for one or more quantitative control/extraneous
variables.
32
Thank you
33