Amy Wagaman's seminar on Interactions in 2004

Download Report

Transcript Amy Wagaman's seminar on Interactions in 2004

Interactions:
Types, Tests and Dangers
By Amy Wagaman
Motivation


When trying to find the “right”
treatment for a patient, researchers
want to know if “treatment effects are
homogeneous over various subsets of
patients defined by prognostic factors.”
(Gail and Simon 1985: 361).
So, the logical thing to do is to
investigate potential interactions.
Types of Interactions


Qualitative Interaction: the direction of true
treatment differences varies among subsets
of patients – also called crossover
interaction
Quantitative Interaction: variation in the
magnitude but NOT direction of treatment
effects among patient subgroups – also
called a non-crossover interaction
Illustration of Interactions
2
The deltas are
true treatment
effects/efficacy
by subgroup.
This example is for 2
subgroups, say men and
women. The yellow
regions are regions of
qualitative interaction.
1  0, 2  0
1
1  0, 2  0
Why Qualitative Interactions?


Qualitative interactions illustrate that a
treatment is harmful for one subgroup but
beneficial for another. This is very useful
information when deciding on what
treatment to assign a particular patient.
The problem comes in identifying qualitative
interactions.
Continued



Qualitative interactions are less likely to exist
than quantitative interactions.
The presence of qualitative interactions is
not often found in similar trials.
“We regard observed qualitative interactions
with skepticism for they are often shown to
be spurious when the same comparison is
made in similar trials.”
(Yusuf 1991: 94)
Why not Quantitative Interactions?


If a treatment is effective (significant
positive treatment effect) for all subgroups,
but some benefit perhaps more than others,
a clinician will still prescribe that treatment
for everyone.
Thus, it is argued that little attention needs
to be paid to this type of interaction.
Continued


“Quantitative interactions are to be
expected, but may not be important
clinically.” (Gail and Simon 1985: 362).
“I am almost certain a priori that a
quantitative interaction will exist between a
treatment and any categorization of
patients which subdivides them into groups
with materially different survival
expectancy.” (Peto 1995: 1043).
Qual. Versus Quan.

“In summary, quantitative interactions are a
priori very plausible, but qualitative
interactions are not and, when the overall
treatment effects are not overwhelming,
trials can be expected to generate a number
of apparent qualitative interactions even if no
interactions at all exist.” (Peto 1995: 1043).
A Testing Hurdle

“All standard statistical tests for interaction
are tests for quantitative interaction and
significant results in them do not constitute
any kind of evidence for the existence of
qualitative interactions, unless in addition
there were strong prior scientific reasons for
anticipating qualitative interactions.”
(Peto 1995: 1043)
A Test for Qual. Interactions



Gail and Simon in 1985 developed a LRT for
qualitative interactions.
This procedure is often used as the test for
qualitative interactions.
However, it has several assumptions:


The subsets/subgroups ought to be disjoint
The subgroups must be specified in advance

“Unless such a prespecification is made, it is unlikely
that sufficient numbers of patients will be available in
all subsets for a meaningful assessment of
interactions.” (Gail and Simon 1985: 366)
Issues of Statistical Power



Based on work by Cohen, the estimated N
under optimal study conditions was 128 to
have 80% power to detect a medium-sized
interaction. For a small-sized interaction, the
required sample size is 780.
A review was done to examine 55 studies
that tested for interactions.
Only 18 out of the 55 and then 3 out of the
55 studies had large enough samples to have
80% power respectively for each setting.
(Moyer 2001)
Another Statistical Issue



It so happens (see later slides) that people
often perform MANY tests for interaction for
any given study.
This helps fuel the suspicion that in many
cases, researchers are finding spurious
interactions – they are capitalizing on type 1
error.
Extreme example: If you ran 567 tests for
interaction, you’d CERTAINLY be expected to
find at least one significant interaction.
Problems with Subgroups


Very few studies are hypothesis-driven with
prespecified subgroups where a potential
interaction would make sense.
If you use the data to “help” identify
subgroups across which to look for an
interaction, you’re getting into somewhat
“fishy” territory. Why wouldn’t you expect an
interaction across such subgroups?
Subgroup Definitions


Proper subgroup: “a group of patients
characterized by a common set of ‘baseline’
parameters”
Improper subgroup: “a group of patients
characterized by a variable measured after
randomization and potentially affected by
treatment”
(Yusuf 1991: 93)
Another Subgroup Problem


It can be VERY misleading to look for
interactions among improper subgroups.
This is because a treatment effect may have
contributed to assignment to a subgroup.
Misinterpretations



Two types of misinterpretations
Misinterpretation of significant interactions
Misinterpretation of non-significant but
surprising interactions
Example of Abuse of Test

In an analysis of data from the Beta-Blocker
Heart Attack Trial, researchers tested for an
interaction (using the Gail-Simon test)
between “dominant” and “divergent” centers.
There were 31 centers (21 dominant, 10
divergent).



Dominant means mortality rate higher for
placebo.
Divergent means mortality rate higher for the
treatment – propranolol.
Note that the subgroups were chosen using a
study outcome.
(Horwitz 1996)
Ensuing Discussion



Senn and Harrell point out the “error” in the
Horwitz paper.
“The ‘significant’ result … says absolutely
nothing about the trial in question and
everything about the practice of defining
groups on the basis of extreme values after
the results are in.” (Senn and Harrell 1997:
749)
Picking subsets based on an observed event
rate differences is a serious violation of
statistical assumptions.
How Widespread is the Problem?




A review was done to examine 55 studies
that tested for interactions.
30 of those 55 studies found at least one
significant interaction.
The mean number of tests performed by
those 30 studies was 61 tests (median 16,
range 3-567).
The mean number performed by the other 25
studies was 21 tests (median 7, range 1186).
Widespread Continued

Only TWO out of those 55 studies met the
following criteria:




Hypothesis-driven
Sufficient statistical power to detect medium-size
interactions
Random assignment of patients to treatments
Conducted 10 or fewer tests for interactions
Term Clarification



The term “risk index” in this context is a
misnomer.
Risk indices are used as predictors of
outcomes, or looking for susceptible groups
where new treatments are needed.
My work involves deciding between treatments
for patients, not predicting an outcome and in
that sense, it can be considered that I am
looking for a “tailoring variable”. It’s
discrimination versus prediction.
Per Danny’s email and discussion with Susan
Implications for Tailoring Variables


Looking for tailoring variables involves looking
for subgroups of patients with similar
characteristics such that the direction of
treatment effect differs across the subgroups,
so that you would want to assign one
treatment to one group and another to a
different group. You could also add a timing
issue, i.e. when to switch.
Problem: This is essentially looking for
qualitative interactions among unprespecified
subgroups.
Consider Quant. Interactions?
Assume the top line is a very
intensive and costly treatment,
while the middle is a lessintensive/cheaper one, with
the bottom being a control
group. The y-axis is treatment
effect, and the x-axis is some
baseline variable.
Based on a talk with Danny
Bibliography
Gail, M. and R. Simon. Testing for Qualitative Interactions between Treatment
Effects and Patient Subsets. Biometrics. Vol. 41 No. 2 (June 1985): 361-372.
Green, Sylvan B. Design of Randomized Trials. Epidemiologic Reviews. Vol. 24 No. 1
(2002): 4-11.
Horwitz, et.al. Can Treatment that is Helpful on Average be Harmful to Some
Patients?… Journal of Clinical Epidemiology. Vol. 49 No. 4 (1996): 395-400.
Moyer, et.al. Can Methodological Features Account for Patient-Treatment Matching
Findings in the Alcohol Field? Journal of Studies on Alcohol. Vol. 62 Issue 1 (Jan.
2001): 62-82.
Peto, R. Clinical Trials. In Treatment of Cancer. Editors: Price, Sikora, and Halnan.
Chapman and Hall Medical: New York. (1995): 1039-1043.
Senn, Stephen and Frank Harrell. On Wisdom after the Event. Journal of Clinical
Epidemiology. Vol. 50 No. 7 (1997): 749-751.
Vach, et.al. Neural Networks and Logistic Regression: Part II. Computational
Statistics and Data Analysis. Vol. 21 (1996): 683-701.
Yusuf, et.al. Analysis and Interpretation of Treatment Effects in Subgroups of
Patients in Randomized Clinical Trials. Journal of the American Medical
Association. Vol. 266 No. 1 (1991): 93-98.