Propensity Scores Methodology for Receiver Operating Characteristic (ROC) Analysis. Marina Kondratovich, Ph.D. U.S. Food and Drug Administration, Center for Devices and Radiological Health No official support or.

Download Report

Transcript Propensity Scores Methodology for Receiver Operating Characteristic (ROC) Analysis. Marina Kondratovich, Ph.D. U.S. Food and Drug Administration, Center for Devices and Radiological Health No official support or.

1
Propensity Scores
Methodology for Receiver
Operating Characteristic
(ROC) Analysis.
Marina Kondratovich, Ph.D.
U.S. Food and Drug Administration,
Center for Devices and Radiological Health
No official support or endorsement by the Food and Drug
Administration of this presentation is intended or should be
inferred.
September, 2003
2
Outline
 Introduction
 Place for propensity scores
Distributions of covariates (details)
Distributions of a New Test results (details)
 Bias of naïve AUC estimation
 Matching for one covariate
Weighted ROC analysis
 Stratification for one covariate
 Relationship between AUC by matching and by stratification
 Propensity score – pre-test risk of disease
 Conjunction of a New Test with other diagnostic tests
3
ROC Analysis
New Test is quantitative.
New Test Variable: X for Diseased population
Y for Non-Diseased population
ROC curve = relationship between sensitivity and specificity of a New Test
over all possible cut-off values.
The AUC (area under curve) is the most common measure of the test performance.
AUC = sensitivity averaged over all values of specificity;
specificity averaged over all values of sensitivity;
AUC = P{X>Y}
probability that a randomly selected Diseased
subject has a test value bigger than that for a
randomly chosen Non-Diseased subject
4
•
In order to correctly estimate the diagnostic accuracy of a New
Test, we should compare the values of the New Test for Diseased
subjects and the values of the New Test for the same Non-Diseased
subjects.
Each subject has two potential values of the New Test:
a value X that would be observed if the subject was Diseased and
a value Y that would be observed if the subject was Non-Diseased.
But X and Y cannot be observed jointly for same subject.
Subject = {New Test, Covariates (e.g., C1=Age, C2=BMI)}
• If we were able to assign randomly the subjects to Diseased and
Non-Diseased clinical states then Diseased and Non-Diseased
groups were comparable in the sense of covariates and
diagnostic accuracy of Test was evaluated correctly.
But such a random assignment is impossible.
5
Biased estimators of AUC occur if
I.
Distributions of covariates are different for the Disease and
Non-Diseased study groups;
and
II. Distributions of New Test results are different for
different sets of covariates.
Problem: Consider M randomly selected Diseased subjects and
N randomly selected Non-Diseased subjects.
Naïve estimation of AUC is biased (usually overstated).
Consider these two situations in more details for one covariate, Age.
6
I. Different Age distributions in Diseased and Non-Diseased
study groups.
Target Population
Age distribution (t1, t2, t3).
t1=0.5; t2=0.3; t3=0.2
Pre-test risk of
Disease (Age) =
π1
π2
π3
0.1
Age1
0.3
0.5
Age2
Age3
πpopulation = π1·t1+ π2·t2+ π3·t3
0.24
7
I.
Study Groups: M randomly selected Diseased subjects,
N randomly selected Non-Diseased subjects.
Diseased
Non-Diseased
M = m1 + m2 + m3
E [mi/M] = pi
N = n1 + n 2 + n3
E [ni/N] = qi
p1=0.21; p2=0.38; p3=0.41
q1=0.59; q2=0.28; q3=0.13
Age distributions
8
I.
Study Groups: M randomly selected Diseased subjects,
N randomly selected Non-Diseased subjects.
mi
 i 
mi  ni
1
 i
N  population
1  (1   i )  (1  
)
M 1   population
Monotonic function of πi, depends on πpopulation and πstudy.
mi
Pre-test risk of Disease in the study (Agei) =
mi  ni
related to the pre-test risk of Disease in the population
Pre-test Risk of Disease
Study (N=M)
Population
Age1
0.26
0.1
Age2
0.58
0.3
Age3
0.80
0.5
9
II. The distribution of the Test variable depends on Age.
The New Test variables of
Diseased subjects:
X1 , X2 , X3
Non-Diseased subjects: Y1 , Y2 , Y3
with c.d.f. F1(x), F2(x), F3(x)
with c.d.f. G1(y), G2(y), G3(y)
Example. Disease=Fracture,
Non-Diseased=No Fracture,
New Test=Ultrasound test for
body site.
This is a hypothetical relationship between
the average ultrasound test and the age.
Usually, the ultrasound values becomes
lower with increasing of age.
PSA test values (for prostate cancer) are increasing with increasing age;
BNP test values (for congestive heart failure) are increasing with increasing age.
10
This is a typical picture of the data
(ultrasound test for the bone status).
I.
The age distributions for Diseased and Non-Diseased subjects are different.
II. The values of the New Test depend on age.
Prostate cancer is more prevalent in older men;
Congestive heart failure is more prevalent in older people.
11
PROBLEM: Naïve estimation of AUC is biased (usually overstated).
Indeed,
Wilcoxon - Mann -Whitney statistic
3
3
mk
ns
1
AUC 
 ( X k ,i , Ys , j )

M  N k 1 s 1 i 1 j 1
3
3
E[ AUC ]   pk qs AUCk ,s  pT AUCq
k 1 s 1

where AUCk , s  P{ X k  Ys }  Gs ( x)dFk ( x)
area under ROC curve when
the Diseased subjects are Agek -years old and
the Non-Diseased subjects are Ages -years old.
Ψ(A,B) =1 if A>B;
½ if A=B;
0 if A<B
12
Example. New Test does not have diagnostic ability:
New Test cannot discriminate Diseased and
Non-Diseased subjects in every age group.
X1 , Y1 ~ N(1,1/4)
X2, ,Y2 ~ N(2,1/4)
X3 , Y3 ~ N(3,1/4)
AUC matrix is
Non-diseased
Age1 Age2 Age3
 0.50 0.16 0.02 
Age2  0.84 0.50 0.16 



Age3 0.98 0.84 0.50 


Age1
Diseased
Two groups, Diseased and Non-Diseased,
appear different with respect to the values of
the New Test.
Age distribution of the Diseased subjects is
pT=(0.21; 0.38; 0.41);
age distribution of Non-Diseased subjects is
qT=(0.59; 0.28; 0.13),
13
Example (continued).
Non-diseased
Age1 Age2 Age3
 0.50 0.16 0.02 
Age2  0.84 0.50 0.16 


Age3  0.98 0.84 0.50 


Age1
AUC matrix:
Diseased
If the age distribution of the Diseased subjects is pT=(0.21; 0.38; 0.41);
age distribution of Non-Diseased subjects is
qT=(0.59; 0.28; 0.13),
then the mean value of the Wilcoxon-Mann-Whitney statistic, pTAUCq,
is 0.68.
The matrix element AUC3,1=0.98, which corresponds to the biggest age
group of Diseased subjects (p3=0.41) and the biggest age group of
Non-Diseased subjects (q1=0.59), makes the largest contribution to the
bilinear form pTAUCq, computed for vectors p and q.
14
Adjustments for one covariate
Three common methods of adjusting for
one confounding covariate:
– Matching
– Stratification
– Covariate adjustment through logistic regression
15
Matching
Matching of Diseased and Non-diseased subjects means that the age
distributions of these subjects are the same.
Let the diseased and non-diseased subjects be matched with common
age distribution φT = (φ1 , φ2 , φ3 )
3
3
E[ AUC ]  k s AUCk ,s   T AUC
k 1 s 1
Theorem.
A New Test cannot discriminate Diseased and Non-Diseased
populations for each age group. Then the expected value of
the Mann-Whitney statistic is 0.5 for any age distribution in
the age-matched samples of Diseased and Non-Diseased subjects.
Wilcoson-Mann-Whitney statistic correctly evaluates the test
performance (area under ROC curve) only for age-matched samples.
16
Matching (continued)
By matching, we create a “quasi-randomized” experiment.
That is, if we find two subjects, one in the Diseased and one in
Non-Diseased group, with the same pre-test risk of Disease
(same age), then we could imagine that there was one subject to
whom the value of the New Test was observed when this subject
was Diseased and when this subject was Non-Diseased.
The age-matched study groups are similar with respect to the
Age (AUC for the covariate Age is exactly 0.5). Then we are
sure that the difference in the New Test distributions for
Diseased and Non-Diseased groups are not due to the difference
in age.
Problem: The data of unmatched subjects are not used in AUC.
Then the weighted ROC analysis should be used.
17
Weighted ROC Analysis
Data set: Diseased and Non-Diseased Subjects are not Age-matched.
We want to have these two samples be age-matched with the
common age distribution φ, where φk = dk/D (dk = min(mk, nk)).
Age distribution
for matching
Age1
d1=3
13/7
Age2
31/7
m1=3
X 1,1 X 1,2 X 1,3
d2=3
23/7
Age3
Diseased
d3=1
m2=4
Non-diseased
n1=5
Y1,1 Y1, 2 Y1,3 Y1, 4 Y1,5
n2=3
X 2,1 X 2,2 X 2,3 X 2,4
Y2,1 Y2, 2 Y2,3
m2=2
n2=1
X 3,1 X 3,2
Y3,1
18
Weighted ROC Analysis (continued)
For each age Agek, we can take
• Some set of size dk of mk Diseased subjects.
 mk 
 different variants.
 dk 
There are 
•Some set of size dk of nk Non-Diseased subjects.
n 
There are  k  different variants.
 dk 
For Age1, 10 variants; for Age2, 4 variants; for Age3, 2 variants.
Total number of different matched sets: 80 (=10 x 4 x 2).
Using the particular age-matched set of D Diseased and
D Non-Diseased subjects, we can estimate age-matched AUC
using the Wilcoxon statistic.
Then we consider all possible sets of matching, estimate AUC for
each set, and then take the average of AUC over all these sets.
19
Weighted ROC Analysis (continued)
This is equivalent to the calculation of AUC with all N Diseased
subjects with weights dk/mk and with all M Non-Diseased subjects
with weights dk/nk:
AUC weighted
1
 2
D
K
K
mk
ns
dk ds
 ( X k ,i , Ys , j )

k 1 s 1 i 1 j 1 nk ns
The weighted ROC analysis is equivalent to consideration of all
possible variants of age-matching with common age distribution φ.
Also, the weighted estimate of AUC can be obtained using the
bootstrap technique.
20
Weighted ROC Analysis (continued)
Age distribution
for matching
Age1
Diseased
d1=3
m1=3
13/7
Age2
d2=3
23/7
1
1
m2=4
1
X 2,1 X 2,2 X 2,3 X 2,4
Weights
Age3
n1=5
X 1,1 X 1,2 X 1,3
Weights
d3=1
31/7
Weights
Non-diseased
3/4 3/4 3/4 3/4
Y1,1 Y1, 2 Y1,3 Y1, 4 Y1,5
3/5 3/5 3/5 3/5 3/5
n2=3
Y2,1 Y2, 2 Y2,3
1
m2=2
n2=1
Y3,1 Y3, 2
Y3,1
1/2 1/2
1
1
1
21
Weighted ROC Analysis (continued)
The weighted AUC is unbiased estimate of φ-age-matched AUC.
E[ AUC weighted ]  E[ AUC matched ]   T AUC
The variance of the weighted estimate is:
var( AUC weighted ) 
2
2
d
d
d
d
d
1 K K K
k , s ,t
k ,t , s
k
s
t
k
t ds
 k  st (10
 01
)

2
2
D k 1 s 1 t 1
mk nt ms
mk ns nt
1
 2
D
K
K
k ,s
k ,s,s


(



 k s 11 10
k 1 s 1
2
2
d
d
 01k , s , k ) k2 s2
mk ns
If dk ≤ min(mk, nk) (all weights are not more than 1) then this
variance is smaller than the variance for one matching set.
22
Stratification
The strata are defined and Diseased and Non-Diseased subjects
who are in the same stratum are compared.
Diseased
Age1
m1=3
X 1,1 X 1,2 X 1,3
Age2
m2=4
X 2,1 X 2,2 X 2,3 X 2,4
Age3
m2=2
X 3,1 X 3,2
Non-diseased
n1=5
Y1,1 Y1, 2 Y1,3 Y1, 4 Y1,5
AUC1,1
n2=3
Y2,1 Y2, 2 Y2,3
AUC2,2
n2=1
Y3,1
AUC3,3
23
Stratification (continued)
Overall diagnostic accuracy of the New test can be the weighted
average of AUC1,1, AUC2,2, and AUC3,3.
3

We can consider the linear combination:
k 1
k
AUCk ,k
where φ is the same as in matching, φk = dk/D (dk = min(mk, nk)).
If AUC1,1=AUC2,2=AUC3,3=AUC, then the weights φk are similar
to the weights inversely proportional to variances of stratum AUC.
Is there a relationship between
AUC by matching
3
3
 T AUC  i j AUCi , j
i 1 j 1
3
and AUC by stratification

k 1
k
AUCk ,k ?
24
Example. New Test = Ultrasound test for bone status.
The results of the ultrasound test are the normal variables with
the means which are different for different ages and with the
same standard deviation of 130 m/sec.
Means for Diseased
(m/sec)
Means for Non-Diseased
(m/sec)
Age1
4,005
4,027
Age2
3,904
3,953
Age3
3,885
3,942
Matrix AUC
 0.55 0.39 0.37 


0.75
0.65
0.58


 0.83 0.70 0.68 


φT = (0.2; 0.5; 0.3)
AUC by matching:
φTAUCφ = 0.624
AUC
by stratification:
3
k AUCk ,k  0.639
k 1
25
Relationship between
AUC by matching and AUC by stratification
Theorem.
Let φT=(φ1, φ2, φ3) be the age distribution in the age-matched
Diseased and Non-Diseased groups. Then ,
3
k AUCk ,k
  T AUC   T 
k 1
where the matrix Δ is a symmetric matrix with elements
 k ,s   s,k  ( AUCk ,k  AUCs,s  AUCk ,s  AUCs,k ) / 2.
Matrix Δ from previous Example.
For broad class of distributions,
3
 0 0.030 0.015 


   0.030 0 0.025 
 0.015 0.025 0 


 T AUC   k AUCk ,k
k 1
AUC by
matching
≤
AUC by
stratification
26
Covariates (C1, C2, …, CL)
mk  3
nk  5
Matching based on many covariates is difficult.
Stratification: As the number of covariates increases,
the number of strata grows exponentially.
Propensity Scores
Replace the collection of confounding covariates with
one scalar function of these covariates: the propensity score.
Propensity score (PS):
conditional probability be in Diseased group rather than NonDiseased group, given a collection of observed covariates.
PS (C1, C2, …, CL) = Pr (Disease| C1, C2, …, CL).
Propensity Score = Pre-test risk of Disease given a
collection of covariates, C1, C2, …, CL.
27
28
Construction of propensity score (pre-test risk)
Logistic regression or others (neural networks,..)
Outcome: Disease – 1, Non-Disease – 0.
Predictors: all measured covariates, some interaction terms
or squared terms, and so on.
New Test is not included.
AUC for combined covariates – a measure of covariates unbalance.
The distributions of X and Y variables, the values of a New Test
for Diseased and Non-Diseased groups, depend on the covariates
but this dependence is approximated well through the pre-test risk:
F (x, C1, C2, …, CL) = F (x, PS(C1, C2, …, CL));
G (y, C1, C2, …, CL) = G (y, PS(C1, C2, …, CL)).
Propensity Scores (continued)
 Calculate estimated propensity scores (pre-test risk) for all
subjects using the propensity score model.
 Sort all subjects by propensity scores.
 Divide subjects into strata that have similar PS.
 Estimate AUC by matching (use weighted AUC) or
AUC by stratification.
BMI
mk Diseased
nk Non-Diseased
Age
Five strata based on logistic regression model of age and BMI
(linear terms).
29
Propensity Scores (continued).
Example: conjunction of a New Test with
other diagnostic tests
30
A New test is used in conjunction with other clinical tests to detect
the clinical state “Disease”. The use of propensity scores technique is
convenient tool for the matching based on all available prior
information (covariates) about the subjects.
Example: “Disease”= any stenosis during coronary angiography;
New Test;
C1 = Age;
C2 = Gender;
C3 = Total cholesterol;
C4 = HDL (“good” cholesterol)
C5 = LDL (“bad” cholesterol)
In order to correctly evaluate the diagnostic ability of a New Test, matched AUC
analysis should be performed. Matching based on propensity score is recommended.
31
Use of matched ROC analysis when New Test results do not
depend on the covariates.
If the distribution of the New Test results for each strata is the same
(F1=F2=F3=F, G1=G2=G3=G) but we do not have any information
about that and use the matched ROC analysis.
How is the matched estimate of AUC related to the usual
empirical estimate?
Theorem.
The matched estimate of the AUC is unbiased estimate of AUC
but the variance of the matched estimate is inflated.
Proof based on the Hölder’s inequality (see [1]).
32
Summary

If the results of a New Test depend on covariates
and distributions of covariates in Diseased and
Non-Diseased groups are different then only
matched ROC analysis correctly evaluates
the diagnostic accuracy of the New Test.

Matching based on propensity scores (pre-test risk
of Disease) reduces bias. Propensity score is
seriously degraded when important covariates
influencing pre-test risk have not been collected.

Weighted ROC analysis allows more effectively
utilizing all the data.
33
References
The propensity scores technique is well developed in the context of observational studies
and studies for the therapeutic devices. In the context of diagnostic studies, however,
there has been little papers.
1. Kondratovich, Marina V. (2000). Methodology of removing
the effect of confounding variables in receiver operating
characteristic (ROC) analysis.
Proceedings of the 2000 Joint Statistical Meeting,
Biopharmaceutical Section, Indianapolis, IN.
2. Kondratovich, Marina V. (2002). Matched receiver operating
characteristic (ROC) analysis and propensity scores.
Proceedings of the 2002 Joint Statistical Meeting,
Biopharmaceutical Section, New York, NY.
3. Zweig, M.H. and Campbell, G. (1993). Receiver operating
characteristic (ROC) plots: a fundamental evaluation tool in
clinical medicine. Clinical Chemistry, 39, p. 561-577.
References for the propensity scores technique
• Rubin, DB, Estimating casual effects from large data sets using
propensity scores. Ann Intern Med 1997; 127:757-763
• Grunkemeier, GL and et al, Propensity score analysis of stroke
after off-pump coronary artery bypass grafting, Ann Thorac
Surg 2002; 74:301-305
• Wolfgang, C. and et al, Comparing mortality of elder patients
on hemodialysis versus peritoneal dialysis: A propensity score
approach, J. Am Soc Nephrol 2002; 13:2353-2362
• Rosenbaum, PR, Rubin DB, Reducing bias in observational
studies using subclassification on the propensity score. JASA
1984; 79:516-524
• Blackstone, EH, Comparing apples and oranges, J. Thoracic
and Cardiovascular Surgery, January 2002; 1:8-15
• D’agostino, RB, Jr., Propensity score methods for bias
reduction in the comparison of a treatment to a non-randomized
control group, Statistics in medicine, 1998,17:2265-2281
34