New Statistical Tests for Method Validation

Transcript New Statistical Tests for Method Validation

Beyond MARLAP:
New Statistical Tests
For Method Validation
NAREL – ORIA – US EPA
Laboratory Incident Response Workshop
At the 53rd Annual RRMC
Outline
 The method validation problem
 MARLAP’s test
 And its peculiar features
 New approach – testing mean squared
error (MSE)
 Two possible tests of MSE
 Chi-squared test
 Likelihood ratio test
 Power comparisons
 Recommendations and implications for
MARLAP
2
The Problem
 We’ve prepared spiked samples at
one or more activity levels
 A lab has performed one or more
analyses of the samples at each level
 Our task: Evaluate the results to see
whether the lab and method can
achieve the required uncertainty (uReq)
at each level
3
MARLAP’s Test
 In 2003 the MARLAP work group
developed a simple test for MARLAP
Chapter 6
 Chose a very simple criterion
 Original criterion was whether every
result was within ±3uReq of the target
 Modified slightly to keep false
rejection rate ≤ 5 % in all cases
4
Equations
 Acceptance range is TV ± k uReq where
 TV = target value (true value)
 uReq = required uncertainty at TV, and
k  z0.50.5(1 )1 / n
 E.g., for n = 21 measurements (7 reps at
each of 3 levels), with α = 0.05, we get k =
z0.99878 = 3.03
 For smaller n we get slightly smaller k
5
Required Uncertainty
 The required uncertainty, uReq, is a function
of the target value
if TV  UBGR
 uMR ,
uReq (TV )  
 TV   MR , if TV  UBGR
 Where uMR is the required method
uncertainty at the upper bound of the gray
region (UBGR)
 φMR is the corresponding relative method
uncertainty
6
Alternatives
 We considered a chi-squared (χ2) test as an
alternative in 2003
 Accounted for uncertainty of target values
using “effective degrees of freedom”
 Rejected at the time because of complexity
and lack of evidence for performance
 Kept the simple test that now appears in
MARLAP Chapter 6
But we didn’t forget about the χ2 test
7
Peculiarity of MARLAP’s Test
 Power to reject a biased but precise
method decreases with number of
analyses performed (n)
 Because we adjusted the acceptance
limits to keep false rejection rates low
 Acceptance range gets wider as n
gets larger
8
Biased but Precise
This graphic image was borrowed and edited
for the RRMC workshop presentation. Please
view the original now at despair.com.
http://www.despair.com/consistency.html
Best Use of Data?
 It isn’t just about bias
 MARLAP’s test uses data inefficiently – even
to evaluate precision alone (its original
purpose)
 The statistic – in effect – is just the worst
normalized deviation from the target value
M  max Z j
1 j  n
where
Zj 
X j  TV
uReq
 Wastes a lot of useful information
10
Example: The MARLAP Test
 Suppose we perform a level D
method validation experiment
 UBGR = AL = 100 pCi/L
 uMR = 10 pCi/L
 φMR = 10/100 = 0.10, or 10 %
 Three activity levels (L = 3)
 50 pCi/L, 100 pCi/L, and 300 pCi/L
 Seven replicates per level (N = 7)
 Allow 5 % false rejections (α = 0.05)
11
Example (continued)
 For 21 measurements, calculate
k  z0.50.5(10.05)1 / 2 1  z0.99878  3.03  3.0
 When evaluating measurement results for
target value TV, require for each result Xj:
Zj 
X j  TV
u Req
 Equivalently, require
 3 .0
M  max Z j  3.0
1 j  21
12
Example (continued)
 We’ll work through calculations at just
one target value
 Say TV = 300 pCi/L
 This value is greater than UBGR
(100 pCi/L)
 So, the required uncertainty is 10 %
of 300 pCi/L
 uReq = 30 pCi/L
13
Example (continued)
 Suppose the lab produces 7
results Xj shown at the right
 For each result, calculate
the “Z score”
Zj 
X j  TV
uReq

X j  300 pCi/L
30 pCi/L
j
Xj
1
256.1
2
235.2
3
249.0
4
258.5
5
265.2
6
255.7
7
254.5
 We require |Zj| ≤ 3.0 for each j
14
Example (continued)
Scores and Evaluation
Xj
Zj
| Zj | ≤ 3.0?
256.1
−1.463
Yes
235.2
−2.160
Yes
249.0
−1.700
Yes
258.5
−1.383
Yes
265.2
−1.160
Yes
255.7
−1.477
Yes
254.5
−1.517
Yes
 Every Zj is smaller
than ±3.0
M  max Z j  2.160
1 j 7
 The method is
obviously biased
(~15 % low)
 But it passes the
MARLAP test
15
2007
 In early 2007 we were developing the
new method validation guide
 Applying MARLAP guidance, including
the simple test of Chapter 6
 Someone suggested presenting
power curves in the context of bias
 Time had come to reconsider
MARLAP’s simple test
16
Bias and Imprecision




Which is worse: bias or imprecision?
Either leads to inaccuracy
Both are tolerable if not too large
When we talk about uncertainty (à la
GUM), we don’t distinguish between
the two
17
Mean Squared Error
 When characterizing a method, we
often consider bias and imprecision
separately
 Uncertainty estimates combine them
 There is a concept in statistics that
also combines them: mean squared
error
18
Definition of MSE
 If X is an estimator for a parameter θ,
the mean squared error of X is
 MSE(X) = E((X − θ)2) by definition
 It also equals
 MSE(X) = V(X) + Bias(X)2 = σ2 + δ2
 If X is unbiased, MSE(X) = V(X) = σ2
 We tend to think in terms of the root
MSE, which is the square root of MSE
19
New Approach
 For the method validation guide we
chose a new conceptual approach
A method is adequate if its root MSE
at each activity level does not exceed
the required uncertainty at that level
 We don’t care whether the MSE is
dominated by bias or imprecision
20
Root MSE v. Standard Uncertainty
 Are root MSE and standard
uncertainty really the same thing?
 Not exactly, but one can interpret the
GUM’s treatment of uncertainty in
such a way that the two are closely
related
 We think our approach – testing
uncertainty by testing MSE – is
reasonable
21
Chi-squared Test Revisited
 For the new method validation
document we simplified the χ2 test
proposed (and rejected) in 2003
 Ignore uncertainties of target values,
which should be small
 Just use a straightforward χ2 test
 Presented as an alternative in App. E
 But the document still uses MARLAP’s
simple test
22
The Two Hypotheses
 We’re now explicitly testing the MSE
2
 Null hypothesis (H0): MSE  uReq
2
 Alternative hypothesis (H1): MSE  uReq
 In MARLAP the 2 hypotheses were not
clearly stated
 Assumed any bias (δ) would be small
 We were mainly testing variance (σ2)
23
A χ2 Test for Variance




Imagine we really tested variance only
2
H0:  2  uReq
2
H1:  2  uReq
We could calculate a χ2 statistic
1 N
2
  2  ( X j  X )2
uReq j 1
 Chi-squared with N − 1 degrees of freedom
 Presumes there may be bias but doesn’t
test for it
24
MLE for Variance
 The maximum-likelihood estimator (MLE) for
σ2 when the mean is unknown is:
N
1
ˆ X2   ( X j  X ) 2
N j 1
 Notice similarity to χ2 from preceding slide
2 
1
2
uReq
N
2
(
X

X
)
 j
j 1
25
Another χ2 Test for Variance
 We could calculate a different χ2 statistic
2 
1
2
uReq
N
2
(
X

TV
)
 j
j 1
 N degrees of freedom
 Can be used to test variance if there is no
bias
 Any bias increases the rejection rate
26
MLE for MSE
 The MLE for the MSE is:
N
1
ˆ X2  ( X  TV ) 2   ( X j  TV ) 2
N j 1
 Notice similarity to χ2 from preceding slide
 
2
1
2
uReq
N
2
(
X

TV
)
 j
j 1
 In the context of biased measurements, χ2
seems to assess MSE rather than variance
27
Our Proposed χ2 Test for MSE
 For a given activity level (TV), calculate a χ2
statistic W:
N
1 N
W  2  ( X j  TV ) 2   Z 2j
uReq j 1
j 1
 Calculate the critical value of W as follows:
wC  12 ( N )
 N = number of replicate measurements
 α = max false rejection rate at this level
28
Multiple Activity Levels
 When testing at more than one activity
level, calculate the critical value as follows:
wC   (21 )1 / L ( N )
 Where L is the number of levels and N is the
number of measurements at each level
 Now α is the maximum overall false
rejection rate
29
Evaluation Criteria
 To perform the test, calculate Wi at
each activity level TVi
 Compare each Wi to wC
 If Wi > wC for any i, reject the method
 The method must pass the test at
each spike activity level
 Don’t allow bad performance at one
level just because of good
performance at another
30
Lesson Learned
 Don’t test at too many levels
 Otherwise you must choose:
 High false acceptance rate at each level,
 High overall false rejection rate, or
 Complicated evaluation criteria
 Prefer to keep error rates low
 Need a low level and a high level
 But probably not more than three
levels (L = 3)
31
Better Use of Same Data
 The χ2 test makes better use of the
measurement data than the MARLAP
test
 The statistic W is calculated from all
the data at a given level – not just
the most extreme value
32
Caveat
 The distribution of W is not
completely determined by the MSE
 Depends on how MSE is partitioned
into variance and bias components
 Our test looks like a test of variance
 As if we know δ = 0 and we’re testing σ2
only
 But we’re actually using it to test MSE
33
False Rejections
 If wC < N, the maximum false rejection rate
(100 %) occurs when δ = ±uReq and σ = 0
 But you’ll never have this situation in practice
 If wC ≥ N + 2, the maximum false rejection
rate occurs when σ = uReq and δ = 0
 This is the usual situation
 Why we can assume the null distribution is χ2
 Otherwise the maximum false rejection rate
occurs when both δ and σ are nonzero
 This situation is unlikely in practice
34
To Avoid High Rejection Rates
 We must have wC ≥ N + 2
 This will always be true if α < 0.08, even if L = N = 1
 Ensures the maximum false rejection rate
occurs when δ = 0 and the MSE is just σ2
 Not stated explicitly in App. E, because:
 We didn’t have a proof at the time
 Not an issue if you follow the procedure
 Now we have a proof
35
Example: Critical Value
 Suppose L = 3 and N = 7
 Let α = 0.05
 Then the critical value for W is
wC   (21 )1 / L ( N )   02.951 / 3 (7)   02.98305 (7)  17.1
 Since wC ≥ N + 2 = 9, we won’t have
unexpectedly high false rejection rates
Since α < 0.08, we didn’t really have to check
36
Some Facts about the Power
 The power always increases with |δ|
 The power increases with σ if   uReq wC / N
2
2
2




u
or if
Req wC /( N  2)
 For a given bias δ with   uReq wC / N , there is
a positive value of σ that minimizes the
power
 If   uReq wC / N , even this minimum power
exceeds 50 %
 Power increases with N
37
Power Comparisons
 We compared the tests for power
 Power to reject a biased method
 Power to reject an imprecise method
 The χ2 test outperforms the simple
MARLAP test on both counts
 Results of comparisons at end of this
presentation
38
False Rejection Rates
wC  N  2

H1
Rejection rate = α
2
 2   2  uReq
Rejection rate < α
H0
Rejection rate = 0
 uReq
0
uReq

39
Region of Low Power
wC  N  2

H1
Rejection rate = α
H0
 uReq
wC
N
 uReq
0
uReq
uReq
wC
N

40
Region of Low Power (MARLAP)

H1
Rejection rate = α
H0
 k uReq
 uReq
0
uReq
k uReq 
41
Example: Applying the χ2 Test
 Return to the scenario used earlier for the
MARLAP example
 Three levels (L = 3)
 Seven measurements per level (N = 7)
 5 % overall false rejection rate (α = 0.05)
 Consider results at just one level,
TV = 300 pCi/L, where uReq = 30 pCi/L
2
w


( N )  17 .1
 C
(1 )1 / L
42
Example (continued)
 Reuse the data from our
earlier example 
 Calculate the χ2 statistic
N
W   Z 2j  17.4
j 1
 Since W > wC (17.4 > 17.1),
the method is rejected
j
Xj
Zj
1
256.1
−1.463
2
235.2
−2.160
3
249.0
−1.700
4
258.5
−1.383
5
265.2
−1.160
6
255.7
−1.477
7
254.5
−1.517
 We’re using all the data now – not just the worst
result
43
Likelihood Ratio Test for MSE
 We also discovered a statistical test
published in 1999, which directly
addressed MSE for analytical methods
 By Danish authors Erik Holst and Poul
Thyregod
 It’s a “likelihood ratio” test, which is a
common, well accepted approach to
hypothesis testing
44
Likelihood Ratio Tests
 To test a hypothesis about a
parameter θ, such as the MSE
 First find a likelihood function L(θ),
which tells how “likely” a value of θ is,
given the observed experimental data
 Based on the probability mass function
or probability density function for the
data
45
Test Statistic
 Maximize L(θ) on all possible values of θ and
again on all values of θ that satisfy the null
hypothesis H0
 Can use the ratio of these two maxima as a
test statistic

max L( )
H0
max L( )
H 0  H1
 The authors actually use λ = −2 ln(Λ) as the
statistic for testing MSE
46
Critical Values
 It isn’t simple to derive equations for
λ, or to calculate percentiles of its
distribution, but Holst and Thyregod
did both
 They used numerical integration to
approximate percentiles of λ, which
serve as critical values
47
Equations
 For the two-sided test statistic, λ:
1 N
Z  Z j
N j 1
N
1
ˆ Z2   ( Z j  Z ) 2
N j 1
~
 ˆ Z2  ( Z   ) 2
ˆ Z2 

  N  
 1  ln
~2
~ 2 
1 
1  

~
 Where  is the unique real root of the cubic
polynomial  3  Z 2  (ˆ Z2  Z 2 )   Z
 See Holst & Thyregod for details
48
One-Sided Test
 We actually need the one-sided test
statistic:
2
2

ˆ

,
if


Z
1
*
Z
 
 0, otherwise
 This is equivalent to:
2
2
2

ˆ

,
if


(
X

TV
)

u
X
Req
*  
 0, otherwise
49
Issues
 The distribution of either λ or λ* is not
completely determined by the MSE
2
 Under H0 with  2   2  uReq
, the percentiles
λ1−α and λ*1−α are maximized when σ  0 and
|δ|  uReq
 To ensure the false rejection rate never
exceeds α, use the maximum value of the
percentile as the critical value
 Apparently we improved on the authors’
method of calculating this maximum
50
Distribution Function for λ*
 To calculate max values of the percentiles,
use the following “cdf” for λ*:
1  e  x / 2 e x / N  1   N2  k 
x / N
F* ( x; N )  1 
1

e

N 1
3

2
 2 
k  0  2  k 


k




 From this equation, obtain percentiles of the
null distribution by iteration
 Select a percentile (e.g., 95th) as a critical
value
52
The Downside
 More complicated to implement
 Critical values are not readily
available (unlike percentiles of χ2)
 Unless you can program the equation
from the preceding slide in software
53
Power of the Likelihood Ratio Test
 More powerful than either the χ2 test
or MARLAP’s test for rejecting biased
methods
 Sometimes much more powerful
 Slightly less powerful at rejecting
unbiased but imprecise methods
 Not so much worse that we wouldn’t
consider it a reasonable alternative
54
Power Comparisons
 Same scenario as before:




Level D method validation
3 activity levels: AL/2, AL, 3×AL
7 replicate measurements per level
φMR is 0.10, or 10 %
 Constant relative bias at all levels
 Assume ratio σ/uReq constant at all
levels
55
Power Curves
P
RSD = 5 % at AL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MARLAP
Chi-Squared
Likelihood Ratio
0
2
4
6
8
10
12
14
16
18
20
Relative Bias (% )
56
Power Curves
P
RSD = 7.5 % at AL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MARLAP
Chi-Squared
Likelihood Ratio
0
2
4
6
8
10
12
14
16
18
20
Relative Bias (% )
57
Power Curves
P
RSD = 10 % at AL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MARLAP
Chi-Squared
Likelihood Ratio
0
2
4
6
8
10
12
14
16
18
20
Relative Bias (% )
58
Power Curves
P
RSD = 12.5 % at AL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MARLAP
Chi-Squared
Likelihood Ratio
0
2
4
6
8
10
12
14
16
18
20
Relative Bias (% )
59
Power Contours
 Same assumptions as before (Level D
method validation, etc.)
 Contour plots show power as a
function of both δ and σ
 Horizontal coordinate is bias (δ) at
the action level
 Vertical coordinate is the standard
deviation (σ)
 Power is shown as color
60
Power
−uReq
uReq
2uReq
Power of MARLAP’s test
61
Power
−uReq
uReq
uReq
2uReq
wC
N
Power of the chi-squared test62
Power
−uReq
uReq
2uReq
Power of the likelihood ratio test63
Recommendations
 You can still use the MARLAP test
 We prefer the χ2 test of App. E.
 It’s simple
 Critical values are widely available (percentiles
of χ2)
 It outperforms the MARLAP test
 The likelihood ratio test is a possibility, but
 It is somewhat complicated
 Our guide doesn’t give you enough information
to implement it
64
Implications for MARLAP
 The χ2 test for MSE will likely be
included in revision 1 of MARLAP
 So will the likelihood ratio test
 Or maybe a variant of it
 One or both of these will probably
become the recommended test for
evaluating a candidate method
65
Questions
Power Calculations – MARLAP
 For MARLAP’s test, probability of rejecting a
method at a given activity level is
  k uReq  
Pr[reject]  1   

 

  k uReq  
  




 


N
 Where σ is the method’s standard deviation
at that level, δ is the method’s bias, and k is
the multiplier calculated earlier (k ≈ 3)
 Φ(z) is the cdf for the standard normal
distribution
67
Power Calculations (continued)
 Same probability is calculated by the
following equation
k u
Pr[reject]  1  F 2 
  

2
2
Req
2
 
; 1, 2
 
2
N
 F  x;  ,   is the cdf for the non-central χ2

distribution
2
 In this case, with ν = 1 degree of freedom and
non-centrality parameter λ = δ2 / σ2
68
Power Calculations – χ2
 For the new χ2 test, the probability of
rejecting a method is
2
2 
 wCuReq
N


Pr[reject]  1  F 2 
;
N
,
2 
  2



 Where again σ is the method’s standard
deviation at that level and δ is the method’s
bias
69
Non-Central χ2 CDF
 The cdf for the non-central χ2 distribution is
given by
j
(

/
2
)
F 2 x;  ,    e  / 2 

j!
j 0

  x
P j  , 
2 2

 Where P(∙,∙) denotes the incomplete gamma
function
 You can find algorithms for P(∙,∙), e.g., in
books like Numerical Recipes
70
Solving the Cubic
~
 To solve the cubic equation for 
3ˆ Z2  2 Z 2
Q
9
 Z (9ˆ Z2  7 Z 2  27)
R
54
T  Q3  R 2
~
  (R  T )
1/ 3
 (R  T )
1/ 3
Z

3
71
Variations
 Another possibility: Use Holst & Thyregod’s
methodology to derive a likelihood ratio test
2
2
for H0:  2  k 2  uReq
versus H1:  2  k 2  uReq
 There are a couple of new issues to deal
with when k > 1
 But the same approach mostly works
 Only recently considered – not fully
explored yet
72

New Statistical Tests for Method Validation

Transcript New Statistical Tests for Method Validation

Directory