Resampling Techniques - TIGP Bioinformatics Program

Download Report

Transcript Resampling Techniques - TIGP Bioinformatics Program

Nonparametric Methods I
Henry Horng-Shing Lu
Institute of Statistics
National Chiao Tung University
[email protected]
http://tigpbp.iis.sinica.edu.tw/courses.htm
1
Parametric vs. Nonparametric
 MLE:
probability distribution and
likelihood
 Bayes: conditional, prior and
posterior distributions
 Distribution free?
 http://en.wikipedia.org/wiki/Non
-parametric_statistics
2
Motivation (1)
In many applications, direct access to a
measurement and is not possible.
However, an estimation of the
measurement is needed.
 Most of the time, the large scale repetition
of an experiment is not economically
feasible.
 What can one do?

3
Motivation (2)
Q1: What estimator for the problem of
interest can be used?
 Q2: Having chosen an estimator, how
accurate is it? What is the bias and
variance of an estimator?
 Q3: How to make inference? What is the
confidence interval? What is the p-value
for a hypothesis testing?

4
References




B. Efron (1979) Computers and the theory of
statistics: thinking the unthinkable, SIAM Review,
21, 460-480.
B. Efron and R. J. Tibshirani (1993) An
Introduction to the Bootstrap. Chapman & Hall.
J. I. De la Rosa and G. A. Fleury (2006)
Bootstrap methods for a measurement estimation
problem. IEEE Transactions on
Instrumentation and Measurement, 55, 3, 820–
827.
http://en.wikipedia.org/wiki/Resampling_%28sta
tistics%29#Jackknife
http://en.wikipedia.org/wiki/Bootstrapping_%28s
tatistics%29
5
Resampling Techniques

Data resampling

PART 1: Jackknife
 Resampling without replacement

PART 2: Bootstrap
 Resampling with replacement
6
PART 1: Jackknife
Naming
 Illustration
 Math Expression
 Examples
 R codes
 C codes

7
Why the funny name of Jackknife?

Jackknife: a pocket knife
http://en.wikipedia.org/wiki/Jackknife

Mosteller and Tukey (1977, p. 133) described a
predecessor resampling method, the jackknife, in
the following way:
“The name ‘jackknife’ is intended to suggest the
broad usefulness of a technique as a substitute
for specialized tools that may not be available,
just as the Boy Scout’s trustworthy tool serves so
variedly…”
http://mrw.interscience.wiley.com/emrw/9780470013199/esbs/article/bsa321/current/abstract
8
Illustration of Jackknife
Population, 
sampling
Estimate
by ˆ
X1 , X 2 , ..., X n
resampling
X 2 , ..., X n
statistics
ˆ( 1)
inference
N times
X1 , X 3 ... X n
ˆ( 2)
X1 , X 2 , ..., X n1
ˆ(  n )
9
Math Expression
iid
X 1 , ..., X n
F ( x), e.g. F ( x)  N (  ,  2 ),    ( F )=( ,  2 ).
We can estimate θ by the followings:
n
1
ˆ1 ( X 1 , ..., X n )  ( X , S ) (where S 
( X i  X )2 )

n  1 i 1
q3  q1 2
ˆ
 2  (median, (
) )
c1
2
2
2
1 n
 
   X i  median  
n
q q
 ),
ˆ3  ( 1 3 ,   i 1


2
c2




and so on.
Rules of judgement :
bias  E (ˆ   )
Variance  Var (ˆ)
MSE  bias 2  Variance
10
An Example of Jackknife (1)
iid
X 1 , ..., X n ~ F ( X ) (e.g. N (  ,  2 ), ˆ  X )
X  ... + X n
X1 , X 2 ..., Xn

ˆ(1)  2
n 1
X  X 3  ... + X n

ˆ(2)  1
X1 , X 2 ..., Xn
n 1
X1 , X 2 ..., Xn
 ˆ( ) 
ˆ( n ) 

ˆ(1)  ˆ(2) + ... + ˆ( n )
n
X 1  ... + X n1
n 1
= X,
ˆ 2
2
n
s
n

1
bias  (n  1)(ˆ( )  ˆ)  (n  1)( X  X )  0, se 
 
(ˆ( i )  ˆ( ) ) 2 .

n
n
n i 1
e. g .
HW
2
HW
11
An Example of Jackknife (2)
X (50)  X (51)
ˆ
ˆ
X 1 , ..., X n ~ F ( X ) (e.g. N (  ,  ),   median n  100,   median 
)
2
iid
2
*
*
X 1 , X 2 ..., X n  X 1* , ..., X 99
 ˆ(1)  X (50)
*
*
X 1 , X 2 ..., X n  X 1* , ..., X 99
 ˆ(2)  X (50)
*
*
X 1 , X 2 ..., X n  X 1* , ..., X 99
 ˆ( n )  X (50)
12
Summary of the Jackknife Method
ˆ( ) 
ˆ(1)  ˆ(2) + ... + ˆ( n )
n
bias J  (n  1)(ˆ( )  ˆ),
se J
2
,
n 1
2
ˆ
ˆ

( ( i )   ( ) ) ,

n i 1
n
2
2
MSEJ  bias J  se J .
13
How do quartiles lead to an estimate?
Note that q3= 1 (0.75)=0.6745 and
q1= 1 (0.25)=-0.6745 is
the upper and lower quartile of
a standard normal distribution.
Hence, the interquartile range is IQR=q3-q1=1.349.
Therefore, IQR/1.349 can be used an estimator of
the standard deviation of a normal distribution.
14
Jackknife by R
1. Open “R”
15
2. Install add-on
packages
16
3.Select a mirror site, like
Taiwan (Taipeh)
17
4.Select the package of
“bootstrap”
18
19
5. type: library(bootstrap)
20
If you want to see the manual,
you can type “?jackniffe”.
21
22
R-package
23
Select the menu to open
the editor in R
24
You can edit your
program in this box
and then store this
program.
25
You can save your
program……
26
main.jackknife.function
27
(1) Use mouse to select
the R commands you
want to run.
(2) Press “F5” to run
28
output
29
Jackknife by C
define functions
30
31
32
An example for jackknife
33
34
35
36
PART 2: Bootstrap
Naming
 Illustration
 Math Expression
 Examples
 R codes


Three approaches




Package(bootstrap)
Package(boot)
Write your own R codes
C codes
37
The Bootstrap

Bootstrap technique was proposed by
Bradley Efron (1979, 1981, 1982) in
literature.

Bootstrapping is an application of intensive
computing to traditional inferential methods.
38
Why the funny name of bootstrap?

Bootstrap:
http://www.concurringopinions.com/
archives/Bootstrap_1.jpg


In the book of ‘Singular
Travels, Campaigns and
Adventures of Baron
Munchausen’ by R. E.
Raspe (1786), the main
character, finding
himself in a deep hole,
extracts himself using
only the straps of his
boots.
http://tigger.uic.edu/~slsclov
e/stathumr.htm
39
Illustration of Bootstrap
Population,
sampling
estimate
by ˆ
X1 , X 2 , ..., X n
resampling
B times
X1* , X 2* , ..., X n* X1* , X 2* , ..., X n*
statistics
ˆ1*
inference
ˆ2*
X1* , X 2* , ..., X n*
ˆB*
40
Math Expression
e. g .
X 1 , X 2 , ..., X n ~ F ( x) (  N (  ,  2 )),
e. g .
e. g .
   ( F )(   xdF ( x)),ˆ   ( Fn )(   xdFn ( x)  X ),
1 n
where Fn ( x)  1 X i  x.
n i 1
Resampling with replacement:
X 1* , X 2* , ..., X n* ~ Fn ( x).
Repeat B times and every time,
ˆ 
*
( )
ˆ1*  ˆ2*  ...  ˆB*
,
X + X + ...  X
B
ˆi*   ( Fn* )(   xdFn* ( x) 
),
n
bias B  (ˆ(* )  ˆ),
1 n
*
1 B ˆ* ˆ* 2
i  1, ..., B, where Fn ( x)  1 X i*  x.
var B 
 (b - ( ) ) .
n i 1
B  1 b 1
e. g .
*
1
*
2
*
n
41
For example, X 1 , X 2 , ..., X n ~ F ( x) ( e.g. F ( x)  N (  , 1) ).
Population, 
If you want to know population, you can calculate mean or variance in expectation,
but it is often not easy to do.
For example, X is sampling from a population and
 E ( X )  xf ( x)dx  xdF ( x);



1

     median  F 1 ( );
2

  ( F ).

42
Population
step1
sampling
X1 , X 2 , ..., X n
STEP 1: When you get n data objects, how can you do to
estimate the parameter of the population?
43
X1 , X 2 , ..., X n
step2
resampling
X1* , X 2* , ..., X n*
B times
X1* , X 2* , ..., X n*
X1* , X 2* , ..., X n*
STEP2 : Resampling the data B times by with replacement,
then you can get many resampling data, and use this
resampling data instead of really resampling data from
population.
44
X1* , X 2* , ..., X n* X1* , X 2* , ..., X n*
Step 3:
ˆ*

statistics 1
ˆ2*
X1* , X 2* , ..., X n*
ˆB*
STEP 3: Regrad X 1 , X 2 , ..., X n as the new population and resample it B times with
replacement, X 1* , X 2* , ..., X n* ~ Fn ( x).
Then, you can calculate statistics.
45
STEP 4: Make inference by resampling statistics, X 1* , X 2* , ..., X B*
X
*
( )
X 1*  X 2*  ...  X B*
LLN
*



E
(
X
)?
*
B 
B
n
X  X  ...  X
*
E* ( X )  E* (
)
n
*
1
*
2
*
n
 E (X
i 1
*
n
*
i
)
X 1  X 2  ...  X n
 E* ( X ) 
.
n
*
1
LLN
bias B  ( X (* )  X ) 
( X  X )  0.
B 
2
1 B
LLN
*
* 2
*
*
*
var B 
( X b  X ( ) ) 

E
X

E
(
X
)

V
(
X
)?



*
*
*
B 
B  1 b1
*
*
*
*
*
*
X

X

...

X
V
(
X
)

V
(
X
)

...

V
(
X
2
n
*
2
*
n)
V* ( X * )  V* ( 1
) * 1
n
n2
1 n
2
n 1 2
(
X

X
)
S
i
V* ( X 1* ) n 
n
i 1



.
n
n
n
46
Summary of the Bootstrap
Method
More generally, we can get ˆ1* , ˆ2* , ..., ˆB* by bootstrap.
*
*
*
ˆ
ˆ
ˆ
1   2  ...   B LLN
*
*
ˆ
ˆ
ˆ.
( ) 


E
(

)


*
B 
B
*
ˆ
bias B  (  ˆ),
( )
se B
2
1 B ˆ * ˆ* 2
 var B 
(b   ( ) ) ,

B  1 b1
2
2
2
MSE B  bias B  se B  bias B  var B .
47
Bootstrap by R

Approach 1


Approach 2


Use package “bootstrap”
Use package “boot”
Approach 3

Write your own R codes
48
Approach 1
http://finzi.psych.upenn.edu/R/library/bootstrap/DESCRIPTION
49
1. Install the add-on
package
50
2.Select a mirror site like
“Taiwan (Taipeh)”
51
3.Select the package of
“bootstrap”
52
53
4. type library(bootstrap)
54
If you want to see the manual,
you can type “?bootstrap”.
55
bias
56
Use this package to do
bootstrap
57
58


2
59
Approach 2
http://finzi.psych.upenn.edu/R/library/boot/DESCRIPTION
60
Library(boot)
61
62
Arguments

A character string indicating the type of
simulation required. Possible values are
"ordinary" (the default), "parametric",
"balanced", "permutation", or "antithetic".
Importance resampling is specified by
including importance weights; the type of
importance resampling must still be
specified but may only be "ordinary" or
"balanced" in this case.
63
R code
Approach 3
64
An example
65
Run
functions


2
66
Run main function
67
Bootstrap by C
68
69
An example
實際操作
70
71
72
73
Exercises

Write your own programs similar to those
examples presented in this talk.

Write programs for those examples
mentioned at the reference web pages.

Write programs for the other examples
that you know.

Prove those theoretical statements in this
talk.
74