Transcript Slide 1

Bootstrap Methods: Recent
Advances and New Applications
2007 Nonparametrics Conference
Michael Chernick
United BioSource Corporation
October 11, 2007
1
Bootstrap Topics
• Introduction to Bootstrap
• Wide Variety of Applications
• Confidence regions and hypothesis tests
• Examples of bootstrap applications:
(1) P-value adjustment - consulting example,
(2) Bioequivalence - Individual Bioequivalence
• Examples where bootstrap is not consistent and remedies:
(1) infinite variance case for a population mean and
(2) extreme values
• Available Software
2
Introduction
• The bootstrap is a general method for doing
statistical analysis without making strong
parametric assumptions.
• Efron’s nonparametric bootstrap, resamples
the original data.
• It was originally designed to estimate bias and
standard errors for statistical estimates much
like the jackknife.
3
Introduction (continued)
• The bootstrap is similar to earlier
techniques which are also called
resampling methods:
– (1) jackknife,
– (2) cross-validation,
– (3) delta method,
– (4) permutation methods, and
– (5) subsampling..
4
Introduction (continued)
The technique was extended, modified and
refined to handle a wide variety of problems
including:
– (1) confidence intervals and hypothesis tests,
– (2) linear and nonlinear regression,
– (3) time series analysis and other problems
5
Introduction (continued)
Definition of Efron’s nonparametric bootstrap.
Given a sample of n independent identically
distributed (i.i.d.) observations X1, X2, …, Xn from
a distribution F and a parameter  of the
distribution F with a real valued estimator
(X1, X2, …, Xn ), the bootstrap estimates the
accuracy of the estimator by replacing F with Fn,
the empirical distribution, where Fn places
probability mass 1/n at each observation Xi.
6
Introduction (continued)
• Let X1*, X2*, …, Xn* be a bootstrap sample, that
is a sample of size n taken with replacement
from Fn .
• The bootstrap, estimates the variance of
(X1, X2, …, Xn ) by computing or
approximating the variance of
* = (X1*, X2*, …, Xn* ).
7
Introduction (continued)
• Statistical Functionals - A functional is a
mapping that takes functions into real
numbers.
• Parameters of a distribution can usually be
expressed as functionals of the population
distribution.
• Often the standard estimate of a
parameter is the same functional applied
to the empirical distribution.
8
Introduction (continued)
• Statistical Functionals and the bootstrap.
• A parameter  is a functional T(F) where T
denotes the functional and F is a
population distribution.
• An estimator of  is h = T(Fn) where Fn is
the empirical distribution function.
• Many statistical problems involve
properties of the distribution of  - h , its
mean (bias of h ), variance, median etc.
9
Introduction (continued)
• Bootstrap idea: Cannot determine the distribution of
 - h but through the bootstrap we can determine, or
approximate through Monte Carlo, the distribution of
h - *, where * = T(Fn*) and Fn* is the empirical
distribution for a bootstrap sample X1*, X2*,…,Xn*
(* is a bootstrap estimate of ).
• Based on k bootstrap samples the Monte Carlo
approximation to the distribution of h - * is used to
estimate bias, variance etc. for h .
• In bootstrapping h substitutes for  and * substitutes for
h . Called the bootstrap principle.
10
Introduction (continued)
• Basic Theory: Mathematical results show that
bootstrap estimates are consistent in particular
cases.
• Basic Idea: Empirical distributions behave in
large samples like population distributions.
Glivenko-Cantelli Theorem tells us this.
• The smoothness condition is needed to transfer
consistency to functionals of Fn, such as the
estimate of the parameter .
11
Wide Variety of Applications
• Efron and others recognized that through
the power of fast computing the Monte
Carlo approximation could be used to
extend the bootstrap to many different
statistical problems .
12
Wide Variety of Applications
(continued)
• It can estimate process capability indices for
non-Gaussian data.
• It is used to adjust p-values in a variety of
multiple comparison situations.
• It can be extended to problems involving
dependent data including multivariate, spatial
and time series data and in sampling from
finite populations.
13
Wide Variety of Applications
(continued)
• It also has been applied to problems involving
missing data.
• In many cases, the theory justifying the use of
bootstrap (e.g. consistency theorems) has been
extended to these non i.i.d. settings.
• In other cases, the bootstrap has been modified
to “make it work.” The general case of
confidence interval estimation is a notable
example.
14
Confidence regions and hypothesis
tests
• The percentile method and other bootstrap
variations may require 1000 or more
bootstrap replications to be very useful.
• The percentile method only works under
special conditions.
• Bias correction and other adjustments are
sometimes needed to make the bootstrap
“accurate” and “correct” when the sample
size n is small or moderate.
15
Confidence regions and hypothesis
tests (continued)
• Confidence intervals are accurate or nearly
exact when the stated confidence level for the
intervals is approximately the long run
probability that the random interval contains the
“true” value of the parameter.
• Accurate confidence intervals are said to be
correct if they are approximately the shortest
length confidence intervals possible for the given
confidence level.
16
Confidence regions and hypothesis
tests (continued)
• The BCa method, the iterated bootstrap (or
double bootstrap) and the bootstrap t
method are methods for constructing
bootstrap confidence intervals that are
closer to being exact (accurate) and
correct than the percentile method in many
circumstances.
• See Chernick (2007) pp. 57-65 for details
on these methods.
17
Confidence regions and hypothesis
tests (continued)
• Hall and Martin have shown what the rate is that
various bootstrap estimates approach their
advertised confidence levels as the size n of the
original sample increases.
• They use Edgeworth and Cornish-Fisher
expansions to prove these results.
• See Hall (1992) Chapter 3 or Chernick (2007)
Section 3.1 for more discussion of this.
• See Ewens and Grant (2001) Chapter 12 for
another nice treatment and comparison with
permutation tests.
18
Four Methods for Setting Approximate
Confidence Intervals for a Real-Valued
Parameter 
Method
Abbrev- -Level Endpoint
iation
1. Standard
S []
Normal
Approximation
2. Percentile
P []
3. Biascorrected
BC []
4. BCa
BCa []
h + h z()
Correct if
h  N(, 2) with  constant
There exists a monotone transformation such that
Gh ()
h=g(h) where = g() and h  N(, 2) and  is
constant
-1
()
There exists a monotone transformation such that
Gh ({2z0 + z })
h=g(h) where = g() and h N(-z0, 2) and 
and z0 are constant
Gh-1 ({z0 + [z0 + z()]/[1-a(z0 +z()]}) There exists a monotone transformation such that
h=g(h) where = g() and h  N(-z00, 02)
where 0 = 1+a and z0 and a are constant.
-1
19
Hypothesis tests
• Since there is a 1-1 correspondence between
hypothesis tests and confidence intervals, a
hypothesis test about a parameter  can be
constructed based on a bootstrap confidence
interval for .
• See Chernick (1999 or 2007) Section 3.2.
• Examples of hypothesis tests can be found in
Section 3.3 of Chernick (1999 or 2007).
• Advice on which method to use is also given in
Carpenter and Bithell (2000).
20
Examples of bootstrap applications
• P-value adjustment - a consulting example
• Many problems in the course of a clinical trial
involve multiple comparisons or repeated
significance tests for a key endpoint at various
follow-up times.
• In these cases, the individual test p-values are
not appropriate and p-value adjustment is
appropriate.
• Conservative estimates based on the Bonferroni
inequality are often used but sometimes may be
too conservative.
21
P-value Adjustment Application
• Westfall and Young (1993) have demonstrated useful
bootstrap and permutation approaches which work in a
wide variety of multiple testing situations.
• Their methods are implemented in the SAS software
package (Version 6.12 or higher) through a procedure
called PROC MULTTEST.
• Chernick has implemented this approach in a number of
clinical trials.
• As a consultant on a particular clinical trial he employed
p-value adjustment to determine if results differed
significantly depending on the country where the patient
was treated.
22
P-value Adjustment Application
(continued)
• This example is presented in Section 8.5.3 of
Chernick (2007).
• A company conducted a clinical trial for a
medical treatment in one country but due to
slow enrollment decided to extend the trial to
other countries.
• The initial country we denote as country E.
• The other four countries are labeled A, B, C
and D.
23
P-value Adjustment Application
(continued)
• Fisher’s exact test was used to compare failure
rates for the treatment with failure rates for the
control. The primary statistical analysis of the
endpoint.
• In country E, the result showed that the
treatment was superior to the control, but this
was not the case in the other countries.
• The client wanted to show that there were
differences among countries which made the
poolability of the data questionable.
24
P-value Adjustment Application
(continued)
• They wanted to claim that only the data in
country E was relevant to the submission since
they were seeking regulatory approval only in
country E.
• This involved comparing treatment success in
each country compared to country E.
25
P-value Adjustment Application
(continued)
• There are 4 relevant pairwise
comparisons of other countries with
country E.
• Consequently, the raw p-values from the
individual Fisher tests are not
appropriate.
• The raw p-values were compared with
the Bonferroni adjustment and the
bootstrap adjustment.
26
P-value Adjustment Application
(continued)
TABLE 8.1 from Chernick (2007)
page 152 Comparison of
Treatment Failure Rates
Country
failure rate
A
40% (18/45)
B
41% (58/143)
C
29% (20/70)
D
29% (51/177)
E
22% (26/116)
TABLE 8.2 from Chernick (2007)
page 153 Comparison of p-value
adjustments
Countries Raw p Bonf. p Boot. p
E vs A 0.0307 0.1229 0.0855
E vs B 0.0021 0.0085 0.0062
E vs C 0.3826 1.0000 0.7654
E vs D 0.2776 1.0000 0.6193
27
P-value Adjustment Application
(continued)
• The raw p-values indicated that failure
rate for E was statistically significantly
different (lower) from A and B at the 5%
level.
• But results are misleading since they
ignore the multiple testing.
• The Bonferroni bound shows only E and
B to be statistically significantly different
at the 10% level.
28
P-value Adjustment Application
(continued)
• But the Bonferroni bound is known to be
excessively conservative in many situations.
• Bootstrap provides an appropriate answer.
• For the bootstrap estimate we again find that E
and B are clearly different but now we find that
the p-value for E and A is below 0.10 and so E
is statistically significantly better than A at the
10% level.
29
Individual Bioequivalence
• The FDA has a Guidance document on how to
conduct bioequivalence (bioavailability) trials.
• Three types of bioequivalence have been
defined (1) average bioequivalence, (2)
population bioequivalence and (3) individual
bioequivalence.
• Currently the FDA only requires average
bioequivalence be shown (a change over past
policy).
• Bootstrap solutions useful in determining
individual bioequivalence and population
bioequivalence have been devised and shown to
be consistent.
30
Individual Bioequivalence: Model
In the model, we consider crossing over twice with
the sequence TRR Meaning new treatment first and then
the reference treatment 2 times and RTR, reference first
followed by new treatment and then the reference again.
Consider the following model for pharmacokinetic
response in a 2 treatment crossover design using
only sequences RTR and TRR randomized 1:1:
Yijkl = μ + Fl +Pj + Qk + Wijk + Sikl + εijkl,
where μ is the overall mean, Pj is the fixed effect for the jth
period with the constraint ∑ Pj = 0, Qk is the fixed effect for
the kth sequence with ∑ Qk = 0, Fl is the fixed effect for the
lth drug.
31
Individual Bioequivalence: Model
(continued)
For these trials we only have two drugs the new and old
Formulations denoted T for the new treatment and R for the
reference formulation. We also have the constraint that
FT + FR = 0. Now Sikl is a random effect of the ith subject in
the kth sequence with the lth treatment, Wijk is the fixed
interaction between treatment , sequence and period and
εijkl is a random noise (error) component with mean 0
independent and identically distributed and independent of
all the fixed and random effects.
32
Individual Bioequivalence:
Definition
Under the linear model given on the previous
slides individual bioequivalence is accepted if
after testing H0:ΔPB ≤ Δ versus H1: ΔPB > Δ,
where ΔPB = PTR – P RR with
PTR = prob(|YT-YR| ≤ r) and
P RR = prob(|YT - Y’R| ≤ r) where Δ and r are
determined fixed constants and Y’R is the observed
response the second time the reference treatment is
given.
33
Bootstrap Results for this Trial
• See Schall and Luus (1993) for a description of
a bootstrap hypothesis test for this problem.
• Pigeot (2001) in a survey article describes the
Schall and Luus method in detail, shows that
their method is not consistent and modifies it by
constructing a bootstrap percentile method
confidence interval to use in the test.
• In an earlier work Shao, Kübler and Pigeot
(2000) prove that the bootstrap method Pigeot
describes in Pigeot (2001) is consistent.
34
Examples where the bootstrap fails
• Athreya (1987) shows that the bootstrap
estimate of the sample mean is
inconsistent when the population
distribution has an infinite variance.
• Angus (1993) provides similar
inconsistency results for the maximum
and minimum of a sequence of
independent identically distributed
observations.
35
Examples where the bootstrap fails
(continued)
We shall describe the inconsistency of the bootstrap in
these two cases and then provide remedies
(1) sample mean with infinite population variance,
and
(2) maximum term in an i.i.d sequence of observations
36
Example where the bootstrap fails - Sample
Mean with Infinite Population Variance
• Singh (1981) and Bickel and Freedman (1981)
showed that in the case of estimating the mean
from an i.i.d. sample with a finite population
variance the bootstrap procedure is consistent.
• In the case of an infinite variance, the population
distribution might have a distribution F(x) satisfying
1-F(x) ~ cx- L(x) where L is a slowly varying
function as x , c is a nonnegative constant and
0<2.
• Under these conditions, the sample mean
appropriately normalized, converges to a stable
distribution.
37
Example where the bootstrap fails - Sample
Mean with Infinite Population Variance
(continued)
• For =2 the variance of F is finite and the central
limit applies. For <2 the population variance is
infinite.
• Theorem 1 of Athreya (1987) proves the
inconsistency of the bootstrap for the case where
1<<2.
• The result tells us that when we appropriately
normalize the sample mean and apply the bootstrap
substitutions the bootstrap version of the normalized
mean converges to a random probability
distribution and not to the corresponding fixed
stable distribution that the sample mean
converges to.
38
Example where the bootstrap fails Estimating extreme values
• For i.i.d. random variables Gnedenko’s
theorem usually applies to the maximum or
minimum values.
• Gnedenko’s theorem states that when
appropriately normalized the minimum value
and the maximum value converge to one of
three extreme value distribution families.
• The appropriate family depends on the tail
behavior of the population distribution.
39
Example where the bootstrap fails Estimating extreme values (continued)
• Angus (1993) showed that using the
appropriate normalization and the bootstrap
substitution, the maximum and minimum
converge to a random probability
distribution and not the fixed extreme value
distribution from Gnedenko’s theorem that the
sample extremes converge to.
40
Bootstrap Remedies
• In the past decade many of the problems
where the bootstrap is inconsistent
remedies have been found by researchers
to give good modified bootstrap solutions
that are consistent.
• For both problems describe thus far a
simple procedure called the m-out-n
bootstrap has been shown to lead to
consistent estimates .
41
The m-out-of-n Bootstrap
• This idea was proposed by Bickel and Ren (1996) for
handling doubly censored data.
• Instead of sampling n times with replacement from a
sample of size n they suggest to do it only m times
where m is much less than n.
• To get the consistency results both m and n need to get
large but at different rates. We need m=o(n). That is
m/n→0 as m and n both → ∞.
• This method leads to consistent bootstrap estimates in
many cases where the ordinary bootstrap has problems,
particularly (1) mean with infinite variance and (2)
extreme value distributions.
42
Available Software
• Resampling Stats from Resampling Stats Inc.
(provides basic bootstrap tools in easy to use
software and is good as an elementary teaching tool).
• SPlus from Insightful Corporation ( good for
advanced bootstrap techniques such as BCa, easy to
use in new Windows based version). The current
module Resample is what I use in my bootstrap class
at statistics.com.
• S functions provided by Tibshirani (see Appendix in
Efron and Tibshirani text or visit Rob Tibshirani’s web
site http:/www.stat-stanford.edu/~tibs)
43
Available Software (continued)
• Stata has a bootstrap algorithm available that
some users rave about.
• Mathworks and other examples (see Susan
Holmes web page):
http:/www-stat.stanford.edu/~susan) or contact her
by email
• SAS macros are available and Proc MULTTEST
does bootstrap sampling.
44
References on confidence intervals
and hypothesis tests
(1) Chernick, M.R. (1999). Bootstrap Methods: A
Practitioner’s Guide. Wiley, New York.
(2) Chernick, M.R. (2007). Bootstrap Methods: A
Guide for Practitioners and Researchers, 2nd
Edition. Wiley, New York.
(3) Hall, P. (1992). The Bootstrap and
Edgeworth Expansion. Springer-Verlag, New
York.
(4) Efron, B. (1982) The Jackknife, the Bootstrap
and Other Resampling Plans. Society for
Industrial and Applied Mathematics CBMS-NSF
Regional Conference Series 38, Philadelphia.
45
References on confidence intervals and
hypothesis tests (continued)
(5) Carpenter, J. and Bithell, J. (2000).
Bootstrap confidence intervals: when, which,
what? A practical guide for medical statisticians.
Statistics in Medicine 19, 1141-1164.
(6) Bahadur, R.R. and Savage, L.J. (1956). The
nonexistence of certain statistical procedures in
nonparametric problems. Annals of
Mathematical Statistics 27, 1115-1122.
(7) Ewens, W.J. and Grant, G.R. (2001).
Statistical Methods in Bioinformatics An
Introduction.
46
References on p-value adjustment
(1) Chernick, M.R. (2007). Bootstrap Methods: A
Guide for Practitioners and Researchers, 2nd
Edition. Wiley, New York.
(2) Westfall, P. and Young, S. S. (1993).
Resampling-Based Multiple Testing: Examples
of p-Value Adjustment. Wiley, New York.
47
References for Individual Bioequivalence
(1) Chernick, M. R. (2007). Bootstrap Methods: A
Guide for Practitioners and Researchers. Wiley,
New York.
(2) Pigeot, I. (2001). The jackknife and bootstrap in
biomedical research – Common principles and
possible pitfalls. Drug Information J. 35, 1431-1443.
(3) Schall, R., and Luus, H. G. (1993). On
population and individual bioequivalence. Statist.
Med. 12, 1109-1124.
(4) Shao, J., Kübler, and Pigeot, I. (2000).
Consistency of the bootstrap procedure in
individual bioequivalence. Biometrika 87, 573-585.
48
References on when bootstrap fails
(1) Angus, J. E. (1993). Asymptotic theory for
bootstrapping the extremes. Communs. Statist.
Theory and Methods 22, 15-30.
(2) Athreya, K. B. (1987). Bootstrap estimation
of the mean in the infinite variance case. Ann.
Statist. 15, 724-731.
(3) Bickel, P. J. and Freedman, D. A. (1981).
Some asymptotic theory for the bootstrap. Ann.
Statist. 9, 1196-1217.
49
References on when bootstrap fails
(continued)
(4) Chernick, M.R. (1999). Bootstrap Methods: A
Practitioner’s Guide. Wiley, New York.
(5) Chernick, M.R. (2007). Bootstrap Methods: A
Guide for Practitioners and Researchers, 2nd
Edition. Wiley, New York.
(6) Cochran, W. (1977). Sampling Techniques.
3rd ed., Wiley, New York
50
References on when bootstrap fails
(continued)
(7) Knight, K. (1989). On the bootstrap of the
sample mean in the infinite variance case. Ann.
Statist. 17, 1168-1175.
(8) LePage, R., and Billard, L. (editors). (1992).
Exploring the Limit of Bootstrap. Wiley, New
York.
(9) Mammen, E. (1992). When Does the
Bootstrap Work? Asymptotic Results and
Simulations Springer-Verlag, Heidelberg.
(10) Singh, K. (1981). On the asymptotic
accuracy of Efron’s bootstrap. Ann. Statist. 9,
1187-1195.
51