Correlation, Causation and Counterfactual Theory
Download
Report
Transcript Correlation, Causation and Counterfactual Theory
Using causal graphs to
understand bias in the
medical literature
About these slides
This presentation was created for the Boston Less Wrong Meetup by Anders
Huitfeldt (Anders_H)
I have tried to optimize for intuitive understanding, not for technical
precision or mathematical rigor
The slides are inspired by courses taught by Miguel Hernan and Jamie
Robins at the Harvard School of Public Health
Directed Acyclic Graphs
This is a Directed Acyclic Graph:
The nodes (letters) are variables
The graph is “Directed” because the arrows have direction
It is “Acyclic” because, if you move in the direction of the arrows, you can never
get back to where you began
We use these graphs to represents the assumptions we are making about the
relationships between the individual variables on the graph
Directed Acyclic Graphs
We can use DAGs to reason about which statements about independence
are logical consequences of other statements about independence
The rules for this type of reasoning are called “D-Separation” (Pearl, 1987)
It is possible to do the same thing using algebra, but D-Separation saves a
lot of time and energy
Directed Acyclic Graphs
This DAG is complete
There is a direct arrow between all variables on the graph
This means we are not making any assumptions about independence
between any two variables
Directed Acyclic Graphs
On this DAG, there are missing arrows
Each missing arrow corresponds to assumptions about independence
Specifically, when arrows are missing, we assume that every variable is independent of
the past, given the joint distribution of its parents
Other independencies may follow automatically
Directed Acyclic Graphs
There is a «path» between two variables if you can get from one variable to
the other by following arrows. The direction of the arrows does not matter
for determining if a path exists (but does matter for whether it is open).
We can tell whether two variables are independent by checking whether
there is an open path between them
Colliders and Non-Colliders
A path from A to C via B could be of four different types:
ABC
ABC
ABC
A B C
The last of these types is different from the others: We call B a “Collider” on this path
Notice how the arrows from A and C “Collide” in B
On the three other types of paths, B is a “Non-Collider”
Note that the concept of “Collider” is path-dependent: B could be a collider on one path, and a non-collider on another
path
Conditioning
If we look at the data within levels of a covariate, that covariate has been
“Conditioned on”
We represent that by drawing a box around the variable on the DAG
The Rules of D-Separation
If variables have not been conditioned on:
Non-Colliders are open
Colliders are closed (unless a downstream consequence is conditioned on)
If variables have been conditioned on:
Non-colliders that have been conditioned on are closed
Colliders that have been conditioned on are open
Colliders that have downstream consequences that have been conditioned on,
are open
The Rules of D-Separation
A path from A to B is open if all variables between A and B are open on
that path
Two variables d-separated (independent) if all paths between them are
closed
Two variables A and B are d-separated conditional on a third variable, if
conditioning on the third variable closes all paths between A and B
Causal DAGs
We can give the DAG a causal interpretation if we are willing to assume
That the variables are in temporal (Causal) order
And that whenever two variables on our graph share a common ancestor, that
ancestor is also shown on the graph
If we have a Causal DAG, we can use it as a map of the data generating
mechanism:
Causal DAGs can be interpreted to say that if we change the value of A,
that «change» will propagate to other variables in the direction of the
arrows
Causal DAGs
All scientific papers make assumptions about the data generating
mechanism. Causal DAGs are simply a very good way of being explicit
about those assumptions. We use them because:
They assure us that our assumptions correspond to a plausible, logically
consistent causal process
They make it easy to verify that our analysis matches the assumptions
They give us very precise definitions of different types of bias
They make it much easier to think about complicated data generating
mechanisms
Causal DAGs
Note that we can never know what the data generating mechanism
actually looks like
The best we can do is make arguments that our map fits the territory
Sometimes it is very obvious that the map does not match the territory.
Causal Inference
A pathway is causal if every arrow on the path is in the forward direction
If I intervene to change the value of A, this will lead to changes in all
variables that are downstream from A
The goal of causal inference is to predict how much the outcome Y will
change if you change A
In other words, we are quantifying the magnitude of the combination of all
forward-going pathways from A to Y
If we have data from the observed world, and we know that the only open
pathway from exposure to the outcome is in the forward direction, then the
observed correlation is purely due to causation
Bias
However, if there exists any open pathway between the exposure and the
outcome where one or more of the arrows is in the backward direction,
there is bias
Open pathways that have arrows in the backward direction will lead to
correlation in the observed data
But that correlation will not be reproduced in an experiment where you
intervened to change the exposure
The two main types of bias are confounding and selection bias
Confounding is a question of who gets exposed
Selection bias is a question of who gets to be part of the study
Confounding
Confounding bias occurs when there is a common cause of the exposure
and the outcome
You can check for it using the “Backdoor Criterion”
If there exists an open path between A and Y that goes into A (as opposed
to out from A), we call that a “Backdoor path”
A backdoor path between A and Y will always have an arrow in the
backwards direction
Example of a DAG with Confounding
Confounding
Notice that if we had randomized people to be smokers or non-smokers,
the arrow from Sex to Smoking could not exist
We would know it didn’t exist, because the only cause of smoking is our
random number generator
Therefore, there could be no confounding
The best way to abolish confounding is to randomize exposure. However,
this is expensive, and is usually not feasible
Controlling for Confounding
There are many ways to control for confounding if the data is observational
instead of experimental
Standard analysis (stratification, regression, matching) are based on looking
at the effect within levels of a confounder
If we do this, we put a box around the confounder on the DAG
This closes the backdoor path
If we condition on all the confounders, the only open pathways will be in
the forward direction, and all remaining correlation between the exposure
and the outcome is due to causation
Controlling for Confounding
An alternative way to control for confounding, is to simulate a world where
there are no arrows into treatment
We do this by weighting all observation by the inverse probability of
treatment, given the confounders.
We can represent this on the DAG by abolishing the arrows from the
confounders to the treatment (in contrast to drawing a box around the
confounder)
In this simulated world we can run any analysis we want without
considering confounding
There are situations where this type of analysis is valid, whereas all
conditioning-based methods such as regression are biased.
Controlling for Confounding
Before you choose to control for a variable, make sure it actually is a
confounder
If you control for something that is not a confounder, you can introduce
bias
For example, this can happen if you control for a causal intermediate
Controlling for Confounding
Make sure you never control for anything that is causally downstream from
the exposure
For example, in this situation, the investigators want to find the effect of
eating at McDonalds on the risk of Heart Attacks. They have controlled for
BMI
This introduces bias by blocking part of the effect we are interested in
M-Bias
Just because a variable is pre-treatment and correlated with the outcome
does not make it safe to control for
In fact, sometimes controlling for a pre-treatment variable introduces bias.
M-Bias
Consider the following DAG:
You want to estimate the effect of smoking on cancer
Should you control for Coffee Drinking or not?
Selection Bias
Selection bias occurs when the investigators have accidentally
conditioned on a collider
Selection Bias
Imagine you are interested in the effect of Socioeconomic Status on
Cancer
Since it is easier to get an exact diagnosis at autopsy, you decided to only
enroll people who had an autopsy in your study
This means you are looking at the effect within a single level of autopsy:
“Autopsy = 1”
The variable has been conditioned on
People of low socioeconomic status are less likely to have an autopsy
People with cancer are also less likely to have an autopsy.
Selection Bias
• There is now an open pathway from Socioeconomic Status to Cancer with a backward arrow:
• Socioeconomic Status Autopsy Cancer
Evaluating a Scientific Paper
If you are given a paper, and you want to know if the claims are likely to be
true:
1.
First, make sure they are addressing a well-defined causal question
2.
Look at the analysis section and determine what map the authors have of the
data generating mechanism
3.
Ask yourself whether you think the implied map captures the important features
of the territory
Is there confounding that has not been accounted for? Did the authors accidentally
condition on any variables to cause selection bias?
Evaluating a Scientific Paper
Example:
Prof Yudkowsky wants to estimate the effect of reading HPMOR, on the
probability of defeating dark lords
He controls for sex
Evaluating a Scientific Paper
1.
Draw the DAG that Prof Yudkowski had in mind when he conducted this
analysis
2.
Do you think this DAG captures the most important aspects of the
territory?
Time-Dependent Confounding
In many situations, exposure varies with time
We can picture this has having an exposure variable for every time point,
labelled A0, A1, A2 etc
There may also be time-dependent confounding by L0, L1 and L2
Time-Dependent Confounding
On this graph, L1 confounds the effect of A1 on Y
However, it is also on the causal pathway from A0 to Y
Do you control for it or not?
Time-Dependent Confounding
In this situation, all stratification-based approaches, such as regression or
matching, are biased.
This is because these methods put a box around the variable L1, blocking
part of the effect we are studying
Methods for controlling for confounding that do rely on conditioning on L1
are still valid
This includes inverse probability weighting (marginal structural models), the
parametric g-formula and G-Estimation
Time-Dependent Confounding
Time-dependent confounding is very common in real data-generating
mechanisms
Consider the following scenario:
If I don’t take my pills this year, my health is likely to decrease next year
If my health has decreased next year, I am more likely to take my pills next year
Health predicts my risk of death
In this situation, it is impossible to obtain the effect of pills on mortality
without using generalized (non-stratification based) methods
This is true whenever there is a feedback loop like the one described here
Time-Dependent Confounding
There are many alternative “causal” models that do not recognize timedependent confounding
These models work fine if exposure is truly something that does not vary with
time
However, that is very rarely the case
If we are not trained to draw maps that recognize this important feature of
the territory, we will end up assuming that it does not exist
This is often a bad assumption
Further Reading
If you are a mathematician or computer scientist, and want a very formal
understanding of the theory:
Judea Pearl. Causality: Models of Reasoning and Inference. (Cambridge
University Press, 2000)
If are not a mathematician, but want to understand how to apply causal
methods to analyze observational data
Miguel Hernan and James Robins. Causal Inference. (Chapman & Hall/CRC,
2013)
Most of the book is available for free at http://www.hsph.harvard.edu/miguelhernan/causal-inference-book/