Basic principles of probability theory

Transcript Basic principles of probability theory

• • • • • Why resampling?

Jacknife Cross-validation Bootstrap Exercises

Resampling techniques

Why resampling?

Purpose of statistics is to estimate some parameter(s) and reliability of them. Since estimators are function of the sample points they are random variables. If we could find distribution of this random variable (sample statistic) then we could estimate reliability of the estimators.

Unfortunately apart from the simplest cases, sampling distribution is not easy to derive.

There are several techniques to approximate these distributions. These include: Edgeworth series, Laplace approximation, saddle-point approximations and others.

These approximations give analytical form for the approximate distributions. With advent of computers more computationaly intensive methods are emerging. They work in many cases satisfactorily.

If we would have sampling distribution for the sampling statistics then we can estimate variance of the estimator, interval, even test hypothesis. Examples of simplest cases where sample distribution is known include: 1) 2) Sample mean when sample is from the normal distribution – normal distribution with mean value equal to sample mean and variance equal to variance of the population divided by sample size if population variance is known. If population variance is not known then variance of sample mean is sample variance divided by n.

Sample variance has the distribution of multiple of 

distribution. Again it is valid if population distribution is normal.

3) 4) Sample mean divided by square root of sample variance has the multiple of the distribution – again normal case Sample variance divided by sample variance has the multiple of

-distribution.

Jacknife

Jacknife is used for bias removal. As we know that mean-square error is sum of squared bias and variances of the estimator. If bias is much higher than variance then under some circumstances Jacknife could be used.

Description of Jacknife

: Let us assume that we have sample of size n. We estimate some sample statistics using all data – t n . Then by removing one point at a time we estimate t n-1,i , where subscript indicates size of the sample and index of removed sample point. Then new estimator is derived as:

n t n

' 

nt n

 (

1 )

t n

 1 , where

t n

 1 

  1

t n n

 1 ,

If the order of the bias of the statistic becomes

O(n -2 ).

t n

O(n -1 )

then after jacknife order of the bias Variance is estimated using:



 1

n i

 1

  1 (

t n

 1 ,



t n

 1 ) 2 This procedure can be applied iteratively. I.e. for new estimator jacknife can be applied again. First application of Jacknife can reduce bias without changing variance of the estimator. But its second and higher order application can in general increase the variance of the estimator.

Cross-validation

Cross-validation is a resampling technique to overcome overfitting.

Let us consider least-squares technique. Let us assume that we have sample of size n

y=(y 1 ,y 2 ,,,y n ).

We want to estimate parameters 



1 ,



2 ,,,



m ).

Now let us further assume that mean value of the observations is a function of these parameters (we may not know form of this function). Then we can postulate that function has a form

. Then we can find values of the parameters using least-squares techniques.

n h

 (

y i



(

x i

1 ,

x i

2 , , ,

x im

,  1 ,  2 , , , 

m i

  1 )) 2 Where

is a fixed matrix or random variables. After this technique we will have values of the parameters therefore form of the function. Form of the function g defines model we want to use. We may have several forms of the function. Obviously if we have more parameters fit will be “better”. Question is what would happen if we would observe new values of observations. Using estimated values of the parameters we could estimate square differences. Let us say we have new observation calculate new differences:

(y n+1 ,,,y n+l ).

Can our function predict new observations? Which function can predict better? To answer to these questions we can



i l

  1 (

y n





(



) 1 , , ,

(



)

,  1 , , , 

)) 2 Where PE is prediction error. Function g that gives smallest value for PE will have higher predictive power. Function that gives smaller h but larger PE will be called overfitted function.

Cross-validation: Cont.

When we choose the function using current sample how can we avoid overfitting? Cross validation is an approach to deal with this problem.

Description of cross-validation

: We have sample of the size n.

1) Divide sample into

roughly equal size parts.

2) 3) For the

th part, estimate parameters using prediction error for Repeat it for all prediction error.

th part.

=1,2,,,

K K

-1 parts excluding

th part. Calculate and combine all prediction errors and get cross-validation If

K=n

then we will have leave-one-out cross-validation technique. Let us denote estimate at the

th step by 

(we will use the vector form). Let

th subset of the sample be

A k

and number of points in this subset is

N k

. Then prediction error calculated per observation would be:

 1

K k K

   1 1

N k i



A k

(

y i



(

, 

)) 2 Then we would choose the function that gives the smallest prediction error. We can expect that in future when we will have new observation this function will give smallest prediction error.

This technique is widely used in modern statistical analysis. It is not restricted to least-squares technique. Instead of least-squares we could have any other form dependent on the distribution of the observations. It can in principle be applied to various maximum likelihood and other estimators.

Cross-validation is useful for model selection. I.e. if we have several models using cross validation we select one of them.

Bootstrap

Bootstrap is one of the computationally very expensive techniques. In a very simple form it works as follows.

We have a sample size of

. We want to estimate some parameter  . Estimator for this parameter gives

. For each sample we assign probability (usually 1/n, i.e. all sample points have equal probability). Then from this sample with replacement we draw another random sample of size by

t i *

at the

and estimate 

Let us denote estimate of the parameter

th resampling stage. Bootstrap estimator for  and its variance is calculated as:

t B

*  1

B j B

  1

and the variance

V B

( *

t B

) 

1  1

j B

  1 (



t B

* ) 2 It is very simple form of application of the bootstrap resampling. For the parameter estimation bootstrap is usually chosen to be around 200.

Let us analyse the working of bootstrap in one simple case. Consider random variable X with sample space x

=(x 1 ,,,,x M ).

( Each point have probability



x j

) 

j f j .

I.e.

f =(f 1 ,,,f M

) represents distribution of the population. The sample of the size frequencies for each sample point as

will have relative

ˆ  (

ˆ 1 , , , ,

)

Bootstrap: Cont.

Then distribution of

ˆ conditional on |



(

f f

) will be multinomial distribution: Multinomial distribution is the extension of the binomial distribution and expressed as:

(

 (

1 ,

2 , , , ,

x M

) 

1 !...

x M

1 ...

f x M M

j M

  1

x j





 1

f j

 1 Limiting distribution of:

ˆ 

Is multinormal distribution. If we resample from the given sample then we should consider conditional distribution of the following (that is also multinomial distribution):

ˆ * |

ˆ 

(

ˆ ) Limiting distribution of

ˆ * 

ˆ is the same as the conditional distribution of original sample. Since these two distribution converge to the same distribution then well behaved function of them also will have same limiting distributions. Thus if we use bootstrap to derive distribution of the sample statistic we can expect that in the limit it will converge to the distribution of sample statistic. I.e. following two function will have the same limiting distributions:

(

ˆ * ,

ˆ ) and

(

ˆ ,

)

Bootstrap: Cont.

If we could enumerate all possible resamples from our sample then we could build “ideal” bootstrap distribution. In practice even with modern computers it is impossible to achieve. Instead Monte Carlo simulation is used.

Usually it works like: Draw random sample of size of

with replacement from the given sample.

Estimate parameter and get estimate

t j .

Repeat it

times and build frequency and cumulative distributions for

Bootstrap: Cont.

How to build the cumulative distribution (it approximates our distribution function)? Consider sample of size

. x=(x 1 ,x 2 ,,,,x n ). Then cumulative distribution will be: where

I F

ˆ (

)  1

m j n

 1

(

x j



)

(

x j



)  1  0 if

x j



otherwise Another way of building the cumulative distribution is to sort the data first so that:

1 

2  ....



x n

Then build cumulative distribution like:

j x



(

)  max( :

j t

) We can also build histogram that approximates density of the distribution. First we should find interval that contains our data into equal intervals with length 

. Assume that center of the

th interval is

(

t i

) 

t i .

Then histogram can be calculated using the formula: min(

x k



t i

 

/ 2 ) 

max(

x j



t i

 

/ 2 ) Once we have the distribution of the statistics we can use it for various purposes. Bootstrap estimation of the parameter and its variance was one of the possible application. We can use this distribution for hypothesis testing, interval estimation etc. For pure parameter estimation we need resample around 200 times. For interval estimation we might need to resample around 2000 times. Reason is that for interval estimation and hypothesis testing we need more accurate distribution.

Bootstrap: Cont.

Since while resampling we did not use any assumption about the population distribution this bootstrap is called non-parametric bootstrap. If we have some idea about the population distribution then we can use it in resampling. I.e. when we draw randomly from our sample we can use population distribution. For example if we know that population distribution is normal then we can estimate its parameters using our sample (sample mean and variance). Then we can approximate population distribution with this sample distribution and use it to draw new samples. As it can be expected if assumption about population distribution is correct then parametric bootstrap will perform better. If it is not correct then non-parametric bootstrap will overperform its parametric counterpart.

Other application of bootstrap and cross validation we will discuss in future lectures

Exercises 2a

These data have been taken from Box, Hunter & Hunter: Statistics for experimenters: You should use SPSS or other statistics packages.

The following are results from a larger study on the pharmacological effects nalbuphine. The measured response optained from 11 subjects was change in pupil diameter. (in millimeters) after 28 doses of nalbuphine (B) or morphine (A) A 2.4

0.08

0.8

2.0

1.9

1.0

B 0.4

0.2

-0.3

0.8

0.0

Assume that subjects were randomly allocated to the drugs. Find sample mean and variances for A and B. Test if differences between two drugs are significantly different. What is 95%, 90% confidence intervals for the differences. What distribution you would use for testing differences between means if we assume that variances are equal. What conclusion would you make from these results? Are treatments have significantly different effects?

Hint: statistic we want to use for two-sample test (if variances are equal) is:

 (



) /(

s n x



n y

) , where



n n x x s x

 

n y n y



What is the distribution of this statistic? What is degrees of freedom?

When variances are different then you should use Welch’s two sample test.

Write small report summarising your conclusions about these data.

Exercise 2b

These data have been taken from the book: Box, Hunter & Hunter: Statistics for experimenters Use SPSS or other statistics packages.

Given following data on egg production from 12 hens randomly allocated to two different diets, estimate mean differences produced by the diets and obtain 95% and 90% confidence intervals. Are these differences significant? Can you write small report analysing this data.

Diet A: Diet B: 166 158 174 159 150 142 166 163 165 161 178 157 What distribution would you use if variances would be equal?

Write small report analysng these data.

Basic principles of probability theory

Transcript Basic principles of probability theory

Resampling techniques

Why resampling?

Jacknife

Cross-validation

Cross-validation: Cont.

Bootstrap

Bootstrap: Cont.

Bootstrap: Cont.

Bootstrap: Cont.

Bootstrap: Cont.

Exercises 2a

Exercise 2b

Directory