Introduction of Epidemiology

Download Report

Transcript Introduction of Epidemiology

Analysis and presentation of
Case-control study data
Chihaya Koriyama
February 14 (Lecture 1)
Study design in epidemiology
Observational
study
individual
Case-control
study
intervention
population
Cohort
study
Ecological
study
Why case-control study?
• In a cohort study, you need a large number
of the subjects to obtain a sufficient number
of case, especially if you are interested in a
rare disease.
– Gastric cancer incidence in Japanese male:
128.5 / 100,000 person year
• A case-control study is more efficient in
terms of study operation, time, and cost.
Comparison of the study design
Case-control
Rare diseases
suitable
Number of disease
1
Sample size
relatively small
Control selection difficult
Study period
relatively short
Recall bias
yes
Risk difference
no available
Cohort
not suitable
1<
need to be large
easier
long
no
available
Case-control study
- Sequence of determining exposure and outcome status
• Step1: Determine and select cases of
your research interest
• Step2: Selection of appropriate controls
• Step3: Determine exposure status in
both cases and controls
Case ascertainment
• What is the definition of the case?
– Cancer (clinically? Pathologically?)
– Virus carriers (Asymptomatic patients)
→ You need to screen the antibody
– Including deceased cases?
• You have to describe the following points,
– the definition
– when, where & how to select
Who will be controls?
• Control
≠
non-case
– Controls are also at risk of the disease
in his(her) future.
– “Controls” are expected to be a
representative sample of the
catchment population from which the
case arise.
– In a case-control study of gastric
cancer, a person who has received the
gastrectomy cannot be a control since
he never develop gastric cancer .
Various types of case-control studies
1)a population-based case-control study
Both cases and controls are recruited from the
population.
2)a case-control study nested in a cohort
Both case and controls are members of the cohort.
3)a hospital-based case-control study
Both case and controls are patients who are
hospitalized or outpatients.
Controls with diseases associated with the exposure
of interest should be avoided.
The following points should be
recorded (described in your paper)
• The list (number) of eligible cases
whose medical records unavailable
• The list (number) of refused subjects,
if possible, with descriptions of the
reasons of refusal
• The length of interview
• The list (number) of subjects lacking
the measurement data, with
descriptions of the reasons
Exploratory or Analytic
• Exploratory case-control studies
– There is no specific a priori
hypothesis about the relationship
between exposure and outcome.
• Analytic case-control studies
– Analytic studies are designed to test
specific a priori hypotheses about
exposure and outcome.
Case-control study - information
• Sources of the information of exposure and
potential confounding factors
– Existing records
– Questionnaires
– Face-to-face / telephone interviews
– Biological specimens
– Tissue banks
– Databases on biochemical and
environmental measurements
Temporality is essential in Hill’s criteria
Disease
Initial
Clinical
onset
Symptoms
Diagnosis
The study exposure
is unlikely to be
altered at this stage
because of the
disease.
The study exposure
is more likely to be
altered at this stage
because of the
symptoms.
Essential Epidemiology (WA Oleckno)
Bias should be minimized
• Bias & Confounding
– Selection bias
– Detection bias
– Information bias (recall bias)
– Confounding
Confounding can be controlled
by statistical analyses but we
can do nothing about bias after
data collection.
Case-control studies ・・・
• are potential sources
of many biases
• should be carefully
designed, analyzed,
and interpreted.
How can we solve the problem of
confounding in a case-control study?
“Prevention” at study design
Limitation
Matching in a cohort study But
not in a case-control study
Matching in a case-control study
• Matched by confounding
factor(s) to increase the
efficiency of statistical analysis
• Cannot control confounding
– A conditional logistic analysis is
required.
Over matching
• Matched by factor(s) strongly
related to the exposure which is
your main interest
– CANNOT see the difference in
the exposure status between
cases and controls
How can we solve the problem of
confounding?
“Treatment “ at statistical analysis
Stratification by a confounder
Multivariate analysis
What you should describe in the
materials and methods,
1. Study design
2. Definition of eligible cases
and controls
–
Inclusion / exclusion criteria of
cases and controls
3. Number of the respondents
and response rate
4. Main exposure and other
factors including potential
confounding factors
What you should describe in the
materials and methods,
5. Sources of the information of
exposure and other factors
6. Matched factors, if any
7. The number of subjects used
in statistical analyses
8. Statistical test(s) and model(s)
9. Name and version of the
statistical software
Assuring adequate study power
• Following information is necessary
– The confidence level desired (usually 95%
corresponding to a p-value of 0.05)
– The level of power desired (80-95%)
– The ratio of controls to cases
– The expected frequency of the exposure in
the control group
– The smallest odds ratio one would like to be
able to detect (based on practical
significance)
Statistical analysis
“Matched” vs. “Unmatched” studies
The procedures for analyzing the
results of case-control studies
differ depending on whether the
cases and controls are matched or
unmatched.
Matched
・McNemar’s test
・Conditional logistic
regression analysis
Unmatched
・Chi-square test
・Unconditional logistic
regression analysis
Advantages of pair matching in casecontrol studies
• Assures comparability between cases and
controls on the selected variables
• May simplify the selection of controls by
eliminating the need to identify a random
sample
• Useful in small studies where obtaining cases
and controls that are similar on potentially
confounding factors may otherwise be difficult
• Can assure adequate numbers of subjects with
specified characteristics so as to permit
statistical comparisons Essential Epidemiology (WA Oleckno)
Disdvantages of pair matching in casecontrol studies
• May be difficult or costly to find a sufficient
number of controls
• Eliminates the possibility of examining the effects
of the matched variables on the outcome
• Can increase the difficulty or complexity of
controlling for confounding by the remaining
unmatched variables
• Overmatching
• Can result in a greater loss of data since a pair
of subjects has to be eliminated even if ne
subject is not responsive Essential Epidemiology (WA Oleckno)
An example of unmatched case-control study
Lung cancer Controls
cases
N=100
N=100
Smokers (NOT recently started)
↓
↓
70
40
smoker
Cases
70
Controls
40
Non-smoker
30
60
Odds ratio=
Risk measure in a case-control study
Odds = prevalence / (1- prevalence)
Odds ratio = odds in cases / odds in controls
Disease
+(case)
-(control)
+
a
c
Exposure -
b
d
Exposure odds in cases =a / b
Exposure odds in controls=c / d
Odds ratio=(a / b) / (c / d) = a * d / b * c
An example of matched case-control study
Lung cancer
Matched controls
Cases
by sex & age
N=100
N=100
Smokers (NOT recently started)
↓
↓
70
40
Control
smoker
Non-smoker
Case
Smoker
Non-smoker
30
10
40
20
Notice that this is the distribution of 100 matched pairs.
McNemar’s test
Case
Smoker
Non-smoker
Control
smoker
Non-smoker
30
40
Chi-square (test) statistic
= (40 – 10)2 / (40+10)
= 18
where degree of freedom is “1”.
Odds ratio = 40 / 10 = 4
10
20
Logistic regression analysis
• Logistic regression is used to
model the probability of a
binary response as a function
of a set of variables thought to
possibly affect the response
(called covariates).
1: case (with the disease)
Y=
0: control (no disease)
One could imagine trying to fit a linear model
(since this is the simplest model !) for the
probabilities, but often this leads to problems:
Probability
In a linear model, fitted probabilities can fall
outside of 0 to 1. Because of this, linear models
are seldom used to fit probabilities.
In a logistic regression analysis, the
logit of the probability is modelled,
rather than the probability itself.
P = probability of getting disease
p
logit (p) = log
1-p
As always, we use the natural log. The logit
is therefore the log odds,
since odds = p / (1-p)
Simple logistic regression (with a continuous covariate)
Suppose we give each of several beetles some
dose of a potential toxic agent (x=dose), and we
observe whether the beetle dies (Y=1) or lives
(Y=0). One of the simplest models we can consider
is to assume that the relationship of the logit of the
probability of death and the dose is linear, i.e.,
px
logit (px) = log
=a+bx
1 – px
where px = probability of death for a given dose x,
and a and b are unknown parameters to be
estimated from the data.
The values of a and b will determine whether or
not and how steeply the dose-response curve
rises (or falls) and where it is centered.
If b = 0
b>0
b<0
px is constant over x
px increases with x
px decreases with x
e (a+bx)
Px =
1 + e (a+bx)
H0: b = 0 is the null hypothesis in a “test of trend”
when x is a continuous variable. Knowledge of b
would give us insight to the direction and degree
of association outcome and exposure.
Simple logistic regression (with a dichotomous covariate)
Suppose we are considering a case-control study
where the response variable is disease (case) /
non-disease (control) and the predictor variable is
exposed / non-exposed, which we “code” as an
indicator variable, or dummy variable.
1
D1
1
E1
Y=
x=
0
D0
0
E0
And px = Prob (disease given exposure x)
= P (Y = 1 | x) x = 0, 1
Thus, p1 = probability of disease among exposed
p0 = probability of disease among non-exposed
In case of exposure (X=1): logit(PE1)=intercept + b
In case of non-exposure (X=0): logit (PE0) =intercept
If you want to obtain odds ratio of exposure group,
OR=(PE1 / (1-PE1))/ (PE0 / (1-PE0))
Definition of odds ratio
log(OR) = log {(PE1 / (1-PE1))/ (PE0 / (1-PE0))}
= log (PE1 / (1-PE1)) – log(PE0 / (1-PE0))
= logit (P for exposure) – logit (P for non-exposure)
= (intercept + b) – intercept
=b
OR = e b
Simple logistic regression
(with a covariate having more than two categories)
Suppose we are considering a case-control study
where the predictor variable is current smoker / exsmoker / non-smoker, which we “code” as a dummy
variable.
Original data
Dummy variables
Case
Smoking
status
SMK1
(X1)
SMK2
(X2)
1
Current
1
0
0
Ex-smoker
0
1
1
Non-smoker
0
0
1
Ex-smoker
0
1
0
Non-smoker
0
0
0
Non-smoker
0
0
Logistic regression model of the previous example
logit (P) = a + b1(X1) + b2 (X2)
In case of current smoker (X1=1, X2=0):
logit(Pcurrent)= a + b1
ORcurrent = e b1
In case of ex-smoker (X1=0, X2=1) :
ORex = e b2
logit(Pex)= a + b2
In case of non-smoker (X1=0, X2=0) :
logit(Pnon)= a
ORnon = 1 (referent)
Wald’s test for no association
The null hypothesis of no association between
outcome and exposure corresponds to
H0: OR=1 or H0: b =logOR=0
Using logistic regression results, we can test
this hypothesis using standard coefficients or
Wald’s test.
Note: STATA and SAS present two-sided
Wald’s test p-values.
Likelihood Ratio Test (LRT)
An alternative way of testing hypotheses in a
logistic regression model is with the use of a
likelihood ratio test. The likelihood ratio test
is specifically designed to test between
nested hypotheses.
H0: log (Px / (1-Px)) = a
HA: log (Px / (1-Px)) = a + bx
and we say that H0 is nested in HA.
Likelihood Ratio Test (LRT)
In order to test H0 vs. HA, we compute the likelihood
ratio test statistic:
G= -2・log(LH0 / LHA) = 2 (log LHA – log LH0)
= (-2log LH0) – (-2log LHA)
Where
LHA is the maximized likelihood under the
alternative hypothesis HA and
LH0 is the maximized likelihood under the null
hypothesis H0.
If the null hypothesis H0 were true, we would expect
the likelihood ratio test statistic to be close to zero.
Wald’s test vs. LRT
•In general, the LRT often works a little better than
the Wald test, in that the test statistic more closely
follows a X2 distribution under H0. But the Wald test
often works very well and usually gives similar
results.
•More importantly, the LRT can more easily be
extended to multivariate hypothesis tests, e.g.,
H0: b1 = b2 = 0 vs. HA: b1 = b2 = 0
World J. Gastroenterology 2006
Recruitment of cases
81 cases were excluded
1
2
173
formalin-fixed
paraffin-embedded
blocks
RECURRENT CASES
LIVED IN
VALLE DEL CAUCA
LESS THAN 5 YEARS
PATIENTS
NEWLY
DIAGNOSED
AS G.C.
Sep.2000~
Dec.2002
65
16
7
4
REFUSED TO
PARTICIPATE
IN THE STUDY
91
3
COULD NO
CONTACT
216
CASES
395
We could not obtain the information
on tumor location for 23 cases, and
those cases were excluded from the
tumor location specific analysis.
Recruitment of controls
Matched by sex, age (5-year ),
hospital, date of
administration
Case: control= 1 : 2
431
CONTROLS
POTENTIAL
CONTROLS
528
1
LIVED IN
VALLE DEL CAUCA
LESS THAN 5 YEARS
2
REFUSED TO
PARTICIPATE
IN THE STUDY
67
29
1
Major diseases of controls

cardiovascular diseases (208)

trauma
(117)

infectious diseases
(38)

urological disorders
(21)
3
Histry of G.C.
| gastric cancer
Smoking |
0
1 | Total
-----------+----------------------+---------Never 0 |
188
78 |
266
Ex1|
145
89 |
234
Current 2 |
98
49 |
147
-----------+----------------------+---------Total |
431
216 |
647
xi:logistic casocon i.fumar
i.fumar
_Ifumar_0-2
Logistic regression
Log likelihood = -409.93333
(naturally coded; _Ifumar_0 omitted)
Number of obs =
647
LR chi2(2)
=
4.24
Prob > chi2 = 0.1198
Pseudo R2
= 0.0051
Walt’s test p values
-----------------------------------------------------------------------------------------------casocon | Odds Ratio Std. Err.
z
P>|z| [95% Conf. Interval]
-------------+---------------------------------------------------------------------------------_Ifumar_1 | 1.479399 .2817549 2.06 0.040 1.018526 2.148813
_Ifumar_2 | 1.205128 .2660901 0.85 0.398 .7817889 1.857706
------------------------------------------------------------------------------------------------
Results of conditional logistic regression analysis using the same data
Case Control OR
(95%CI)
Fumar=0
Fumar=1
Fumar=2
xi:clogit casocon i.fumar, group(identi) or
Stata command
Conditional (fixed-effects) logistic regression Number of obs =
647
LR chi2(2)
=
4.64
Prob > chi2 = 0.0982
Log likelihood = -234.5745
Pseudo R2
= 0.0098
--------------------------------------------------------------------------------------------------casocon | Odds Ratio Std. Err.
z
P>|z| [95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_Ifumar_1 | 1.535023 .3061998 2.15 0.032 1.038295 2.269389
_Ifumar_2 | 1.219851 .2784042 0.87 0.384
.7799 1.907985
--------------------------------------------------------------------------------------------------Wald’s test p values
GC risk by smoking in Cali, Colombia
results of tumor-location specific analysis
Lower
(N=116)*
OR (95%CI)
Middle
(N=52)*
Upper
(N=24)*
cigarrete smoking
never
1.0 referent
1.0 referent
1.0 referent
ex-smoker 1.9 (1.1 - 3.4) 1.2 (0.6 - 2.5) 3.7 (1.1 - 12.5)
current
1.3 (0.7 - 2.3) 1.3 (0.5 - 3.4) 3.0 (0.6 - 13.9)
P for trend
P for
heterogeneity
0.257
0.597
0.083
0.059
0.859
0.070
P = 0.51
P value by LRT
This test examines the difference in the magnitude of the
association between smoking and GC risk among 3 tumor sites.