New Statistical Tests for Method Validation
Download
Report
Transcript New Statistical Tests for Method Validation
Beyond MARLAP:
New Statistical Tests
For Method Validation
NAREL – ORIA – US EPA
Laboratory Incident Response Workshop
At the 53rd Annual RRMC
Outline
The method validation problem
MARLAP’s test
And its peculiar features
New approach – testing mean squared
error (MSE)
Two possible tests of MSE
Chi-squared test
Likelihood ratio test
Power comparisons
Recommendations and implications for
MARLAP
2
The Problem
We’ve prepared spiked samples at
one or more activity levels
A lab has performed one or more
analyses of the samples at each level
Our task: Evaluate the results to see
whether the lab and method can
achieve the required uncertainty (uReq)
at each level
3
MARLAP’s Test
In 2003 the MARLAP work group
developed a simple test for MARLAP
Chapter 6
Chose a very simple criterion
Original criterion was whether every
result was within ±3uReq of the target
Modified slightly to keep false
rejection rate ≤ 5 % in all cases
4
Equations
Acceptance range is TV ± k uReq where
TV = target value (true value)
uReq = required uncertainty at TV, and
k z0.50.5(1 )1 / n
E.g., for n = 21 measurements (7 reps at
each of 3 levels), with α = 0.05, we get k =
z0.99878 = 3.03
For smaller n we get slightly smaller k
5
Required Uncertainty
The required uncertainty, uReq, is a function
of the target value
if TV UBGR
uMR ,
uReq (TV )
TV MR , if TV UBGR
Where uMR is the required method
uncertainty at the upper bound of the gray
region (UBGR)
φMR is the corresponding relative method
uncertainty
6
Alternatives
We considered a chi-squared (χ2) test as an
alternative in 2003
Accounted for uncertainty of target values
using “effective degrees of freedom”
Rejected at the time because of complexity
and lack of evidence for performance
Kept the simple test that now appears in
MARLAP Chapter 6
But we didn’t forget about the χ2 test
7
Peculiarity of MARLAP’s Test
Power to reject a biased but precise
method decreases with number of
analyses performed (n)
Because we adjusted the acceptance
limits to keep false rejection rates low
Acceptance range gets wider as n
gets larger
8
Biased but Precise
This graphic image was borrowed and edited
for the RRMC workshop presentation. Please
view the original now at despair.com.
http://www.despair.com/consistency.html
Best Use of Data?
It isn’t just about bias
MARLAP’s test uses data inefficiently – even
to evaluate precision alone (its original
purpose)
The statistic – in effect – is just the worst
normalized deviation from the target value
M max Z j
1 j n
where
Zj
X j TV
uReq
Wastes a lot of useful information
10
Example: The MARLAP Test
Suppose we perform a level D
method validation experiment
UBGR = AL = 100 pCi/L
uMR = 10 pCi/L
φMR = 10/100 = 0.10, or 10 %
Three activity levels (L = 3)
50 pCi/L, 100 pCi/L, and 300 pCi/L
Seven replicates per level (N = 7)
Allow 5 % false rejections (α = 0.05)
11
Example (continued)
For 21 measurements, calculate
k z0.50.5(10.05)1 / 2 1 z0.99878 3.03 3.0
When evaluating measurement results for
target value TV, require for each result Xj:
Zj
X j TV
u Req
Equivalently, require
3 .0
M max Z j 3.0
1 j 21
12
Example (continued)
We’ll work through calculations at just
one target value
Say TV = 300 pCi/L
This value is greater than UBGR
(100 pCi/L)
So, the required uncertainty is 10 %
of 300 pCi/L
uReq = 30 pCi/L
13
Example (continued)
Suppose the lab produces 7
results Xj shown at the right
For each result, calculate
the “Z score”
Zj
X j TV
uReq
X j 300 pCi/L
30 pCi/L
j
Xj
1
256.1
2
235.2
3
249.0
4
258.5
5
265.2
6
255.7
7
254.5
We require |Zj| ≤ 3.0 for each j
14
Example (continued)
Scores and Evaluation
Xj
Zj
| Zj | ≤ 3.0?
256.1
−1.463
Yes
235.2
−2.160
Yes
249.0
−1.700
Yes
258.5
−1.383
Yes
265.2
−1.160
Yes
255.7
−1.477
Yes
254.5
−1.517
Yes
Every Zj is smaller
than ±3.0
M max Z j 2.160
1 j 7
The method is
obviously biased
(~15 % low)
But it passes the
MARLAP test
15
2007
In early 2007 we were developing the
new method validation guide
Applying MARLAP guidance, including
the simple test of Chapter 6
Someone suggested presenting
power curves in the context of bias
Time had come to reconsider
MARLAP’s simple test
16
Bias and Imprecision
Which is worse: bias or imprecision?
Either leads to inaccuracy
Both are tolerable if not too large
When we talk about uncertainty (à la
GUM), we don’t distinguish between
the two
17
Mean Squared Error
When characterizing a method, we
often consider bias and imprecision
separately
Uncertainty estimates combine them
There is a concept in statistics that
also combines them: mean squared
error
18
Definition of MSE
If X is an estimator for a parameter θ,
the mean squared error of X is
MSE(X) = E((X − θ)2) by definition
It also equals
MSE(X) = V(X) + Bias(X)2 = σ2 + δ2
If X is unbiased, MSE(X) = V(X) = σ2
We tend to think in terms of the root
MSE, which is the square root of MSE
19
New Approach
For the method validation guide we
chose a new conceptual approach
A method is adequate if its root MSE
at each activity level does not exceed
the required uncertainty at that level
We don’t care whether the MSE is
dominated by bias or imprecision
20
Root MSE v. Standard Uncertainty
Are root MSE and standard
uncertainty really the same thing?
Not exactly, but one can interpret the
GUM’s treatment of uncertainty in
such a way that the two are closely
related
We think our approach – testing
uncertainty by testing MSE – is
reasonable
21
Chi-squared Test Revisited
For the new method validation
document we simplified the χ2 test
proposed (and rejected) in 2003
Ignore uncertainties of target values,
which should be small
Just use a straightforward χ2 test
Presented as an alternative in App. E
But the document still uses MARLAP’s
simple test
22
The Two Hypotheses
We’re now explicitly testing the MSE
2
Null hypothesis (H0): MSE uReq
2
Alternative hypothesis (H1): MSE uReq
In MARLAP the 2 hypotheses were not
clearly stated
Assumed any bias (δ) would be small
We were mainly testing variance (σ2)
23
A χ2 Test for Variance
Imagine we really tested variance only
2
H0: 2 uReq
2
H1: 2 uReq
We could calculate a χ2 statistic
1 N
2
2 ( X j X )2
uReq j 1
Chi-squared with N − 1 degrees of freedom
Presumes there may be bias but doesn’t
test for it
24
MLE for Variance
The maximum-likelihood estimator (MLE) for
σ2 when the mean is unknown is:
N
1
ˆ X2 ( X j X ) 2
N j 1
Notice similarity to χ2 from preceding slide
2
1
2
uReq
N
2
(
X
X
)
j
j 1
25
Another χ2 Test for Variance
We could calculate a different χ2 statistic
2
1
2
uReq
N
2
(
X
TV
)
j
j 1
N degrees of freedom
Can be used to test variance if there is no
bias
Any bias increases the rejection rate
26
MLE for MSE
The MLE for the MSE is:
N
1
ˆ X2 ( X TV ) 2 ( X j TV ) 2
N j 1
Notice similarity to χ2 from preceding slide
2
1
2
uReq
N
2
(
X
TV
)
j
j 1
In the context of biased measurements, χ2
seems to assess MSE rather than variance
27
Our Proposed χ2 Test for MSE
For a given activity level (TV), calculate a χ2
statistic W:
N
1 N
W 2 ( X j TV ) 2 Z 2j
uReq j 1
j 1
Calculate the critical value of W as follows:
wC 12 ( N )
N = number of replicate measurements
α = max false rejection rate at this level
28
Multiple Activity Levels
When testing at more than one activity
level, calculate the critical value as follows:
wC (21 )1 / L ( N )
Where L is the number of levels and N is the
number of measurements at each level
Now α is the maximum overall false
rejection rate
29
Evaluation Criteria
To perform the test, calculate Wi at
each activity level TVi
Compare each Wi to wC
If Wi > wC for any i, reject the method
The method must pass the test at
each spike activity level
Don’t allow bad performance at one
level just because of good
performance at another
30
Lesson Learned
Don’t test at too many levels
Otherwise you must choose:
High false acceptance rate at each level,
High overall false rejection rate, or
Complicated evaluation criteria
Prefer to keep error rates low
Need a low level and a high level
But probably not more than three
levels (L = 3)
31
Better Use of Same Data
The χ2 test makes better use of the
measurement data than the MARLAP
test
The statistic W is calculated from all
the data at a given level – not just
the most extreme value
32
Caveat
The distribution of W is not
completely determined by the MSE
Depends on how MSE is partitioned
into variance and bias components
Our test looks like a test of variance
As if we know δ = 0 and we’re testing σ2
only
But we’re actually using it to test MSE
33
False Rejections
If wC < N, the maximum false rejection rate
(100 %) occurs when δ = ±uReq and σ = 0
But you’ll never have this situation in practice
If wC ≥ N + 2, the maximum false rejection
rate occurs when σ = uReq and δ = 0
This is the usual situation
Why we can assume the null distribution is χ2
Otherwise the maximum false rejection rate
occurs when both δ and σ are nonzero
This situation is unlikely in practice
34
To Avoid High Rejection Rates
We must have wC ≥ N + 2
This will always be true if α < 0.08, even if L = N = 1
Ensures the maximum false rejection rate
occurs when δ = 0 and the MSE is just σ2
Not stated explicitly in App. E, because:
We didn’t have a proof at the time
Not an issue if you follow the procedure
Now we have a proof
35
Example: Critical Value
Suppose L = 3 and N = 7
Let α = 0.05
Then the critical value for W is
wC (21 )1 / L ( N ) 02.951 / 3 (7) 02.98305 (7) 17.1
Since wC ≥ N + 2 = 9, we won’t have
unexpectedly high false rejection rates
Since α < 0.08, we didn’t really have to check
36
Some Facts about the Power
The power always increases with |δ|
The power increases with σ if uReq wC / N
2
2
2
u
or if
Req wC /( N 2)
For a given bias δ with uReq wC / N , there is
a positive value of σ that minimizes the
power
If uReq wC / N , even this minimum power
exceeds 50 %
Power increases with N
37
Power Comparisons
We compared the tests for power
Power to reject a biased method
Power to reject an imprecise method
The χ2 test outperforms the simple
MARLAP test on both counts
Results of comparisons at end of this
presentation
38
False Rejection Rates
wC N 2
H1
Rejection rate = α
2
2 2 uReq
Rejection rate < α
H0
Rejection rate = 0
uReq
0
uReq
39
Region of Low Power
wC N 2
H1
Rejection rate = α
H0
uReq
wC
N
uReq
0
uReq
uReq
wC
N
40
Region of Low Power (MARLAP)
H1
Rejection rate = α
H0
k uReq
uReq
0
uReq
k uReq
41
Example: Applying the χ2 Test
Return to the scenario used earlier for the
MARLAP example
Three levels (L = 3)
Seven measurements per level (N = 7)
5 % overall false rejection rate (α = 0.05)
Consider results at just one level,
TV = 300 pCi/L, where uReq = 30 pCi/L
2
w
( N ) 17 .1
C
(1 )1 / L
42
Example (continued)
Reuse the data from our
earlier example
Calculate the χ2 statistic
N
W Z 2j 17.4
j 1
Since W > wC (17.4 > 17.1),
the method is rejected
j
Xj
Zj
1
256.1
−1.463
2
235.2
−2.160
3
249.0
−1.700
4
258.5
−1.383
5
265.2
−1.160
6
255.7
−1.477
7
254.5
−1.517
We’re using all the data now – not just the worst
result
43
Likelihood Ratio Test for MSE
We also discovered a statistical test
published in 1999, which directly
addressed MSE for analytical methods
By Danish authors Erik Holst and Poul
Thyregod
It’s a “likelihood ratio” test, which is a
common, well accepted approach to
hypothesis testing
44
Likelihood Ratio Tests
To test a hypothesis about a
parameter θ, such as the MSE
First find a likelihood function L(θ),
which tells how “likely” a value of θ is,
given the observed experimental data
Based on the probability mass function
or probability density function for the
data
45
Test Statistic
Maximize L(θ) on all possible values of θ and
again on all values of θ that satisfy the null
hypothesis H0
Can use the ratio of these two maxima as a
test statistic
max L( )
H0
max L( )
H 0 H1
The authors actually use λ = −2 ln(Λ) as the
statistic for testing MSE
46
Critical Values
It isn’t simple to derive equations for
λ, or to calculate percentiles of its
distribution, but Holst and Thyregod
did both
They used numerical integration to
approximate percentiles of λ, which
serve as critical values
47
Equations
For the two-sided test statistic, λ:
1 N
Z Z j
N j 1
N
1
ˆ Z2 ( Z j Z ) 2
N j 1
~
ˆ Z2 ( Z ) 2
ˆ Z2
N
1 ln
~2
~ 2
1
1
~
Where is the unique real root of the cubic
polynomial 3 Z 2 (ˆ Z2 Z 2 ) Z
See Holst & Thyregod for details
48
One-Sided Test
We actually need the one-sided test
statistic:
2
2
ˆ
,
if
Z
1
*
Z
0, otherwise
This is equivalent to:
2
2
2
ˆ
,
if
(
X
TV
)
u
X
Req
*
0, otherwise
49
Issues
The distribution of either λ or λ* is not
completely determined by the MSE
2
Under H0 with 2 2 uReq
, the percentiles
λ1−α and λ*1−α are maximized when σ 0 and
|δ| uReq
To ensure the false rejection rate never
exceeds α, use the maximum value of the
percentile as the critical value
Apparently we improved on the authors’
method of calculating this maximum
50
Distribution Function for λ*
To calculate max values of the percentiles,
use the following “cdf” for λ*:
1 e x / 2 e x / N 1 N2 k
x / N
F* ( x; N ) 1
1
e
N 1
3
2
2
k 0 2 k
k
From this equation, obtain percentiles of the
null distribution by iteration
Select a percentile (e.g., 95th) as a critical
value
52
The Downside
More complicated to implement
Critical values are not readily
available (unlike percentiles of χ2)
Unless you can program the equation
from the preceding slide in software
53
Power of the Likelihood Ratio Test
More powerful than either the χ2 test
or MARLAP’s test for rejecting biased
methods
Sometimes much more powerful
Slightly less powerful at rejecting
unbiased but imprecise methods
Not so much worse that we wouldn’t
consider it a reasonable alternative
54
Power Comparisons
Same scenario as before:
Level D method validation
3 activity levels: AL/2, AL, 3×AL
7 replicate measurements per level
φMR is 0.10, or 10 %
Constant relative bias at all levels
Assume ratio σ/uReq constant at all
levels
55
Power Curves
P
RSD = 5 % at AL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MARLAP
Chi-Squared
Likelihood Ratio
0
2
4
6
8
10
12
14
16
18
20
Relative Bias (% )
56
Power Curves
P
RSD = 7.5 % at AL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MARLAP
Chi-Squared
Likelihood Ratio
0
2
4
6
8
10
12
14
16
18
20
Relative Bias (% )
57
Power Curves
P
RSD = 10 % at AL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MARLAP
Chi-Squared
Likelihood Ratio
0
2
4
6
8
10
12
14
16
18
20
Relative Bias (% )
58
Power Curves
P
RSD = 12.5 % at AL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MARLAP
Chi-Squared
Likelihood Ratio
0
2
4
6
8
10
12
14
16
18
20
Relative Bias (% )
59
Power Contours
Same assumptions as before (Level D
method validation, etc.)
Contour plots show power as a
function of both δ and σ
Horizontal coordinate is bias (δ) at
the action level
Vertical coordinate is the standard
deviation (σ)
Power is shown as color
60
Power
−uReq
uReq
2uReq
Power of MARLAP’s test
61
Power
−uReq
uReq
uReq
2uReq
wC
N
Power of the chi-squared test62
Power
−uReq
uReq
2uReq
Power of the likelihood ratio test63
Recommendations
You can still use the MARLAP test
We prefer the χ2 test of App. E.
It’s simple
Critical values are widely available (percentiles
of χ2)
It outperforms the MARLAP test
The likelihood ratio test is a possibility, but
It is somewhat complicated
Our guide doesn’t give you enough information
to implement it
64
Implications for MARLAP
The χ2 test for MSE will likely be
included in revision 1 of MARLAP
So will the likelihood ratio test
Or maybe a variant of it
One or both of these will probably
become the recommended test for
evaluating a candidate method
65
Questions
Power Calculations – MARLAP
For MARLAP’s test, probability of rejecting a
method at a given activity level is
k uReq
Pr[reject] 1
k uReq
N
Where σ is the method’s standard deviation
at that level, δ is the method’s bias, and k is
the multiplier calculated earlier (k ≈ 3)
Φ(z) is the cdf for the standard normal
distribution
67
Power Calculations (continued)
Same probability is calculated by the
following equation
k u
Pr[reject] 1 F 2
2
2
Req
2
; 1, 2
2
N
F x; , is the cdf for the non-central χ2
distribution
2
In this case, with ν = 1 degree of freedom and
non-centrality parameter λ = δ2 / σ2
68
Power Calculations – χ2
For the new χ2 test, the probability of
rejecting a method is
2
2
wCuReq
N
Pr[reject] 1 F 2
;
N
,
2
2
Where again σ is the method’s standard
deviation at that level and δ is the method’s
bias
69
Non-Central χ2 CDF
The cdf for the non-central χ2 distribution is
given by
j
(
/
2
)
F 2 x; , e / 2
j!
j 0
x
P j ,
2 2
Where P(∙,∙) denotes the incomplete gamma
function
You can find algorithms for P(∙,∙), e.g., in
books like Numerical Recipes
70
Solving the Cubic
~
To solve the cubic equation for
3ˆ Z2 2 Z 2
Q
9
Z (9ˆ Z2 7 Z 2 27)
R
54
T Q3 R 2
~
(R T )
1/ 3
(R T )
1/ 3
Z
3
71
Variations
Another possibility: Use Holst & Thyregod’s
methodology to derive a likelihood ratio test
2
2
for H0: 2 k 2 uReq
versus H1: 2 k 2 uReq
There are a couple of new issues to deal
with when k > 1
But the same approach mostly works
Only recently considered – not fully
explored yet
72