Transcript STAT 572: Bootstrap Project
STAT 572: Bootstrap Project
Group Members:
Cindy Bothwell Erik Barry Erhardt Nina Greenberg Casey Richardson Zachary Taylor
Histograms of Complex Population Distribution
Histograms of Population Sampling Distribution of the Median and Estimated Bootstrap Sampling Distributions
What is a Bootstrap
A method of Resampling: creating many samples from a single sample Generally, resampling is done with replacement Used to develop a sampling distribution of statistics such as mean, median, proportion, others.
The Bootstrap and Complex Surveys
Number of bootstrap samples – – n = sample size, N = population size Possible resamples n n (example n=200, 200 200 =1.6x10
460 ) Too many possibilities N!/[n!(N-n)!], limit to B a large number, (example = 1000) - the Monte Carlo approximation Determine sampling distribution with parameters Calculate variance in the normal way
Advantages and Disadvantages
Advantages: – – – – Avoids the costs of taking new samples (Estimate a sampling distribution when only one sample is available) Checking parametric assumptions Used when parametric assumptions cannot be made or are very complicated Estimation of variance in quantiles Disadvantages: – – Relies on a representative sample Variability due to finite replications (Monte Carlo)
Computations
With more computing power available, bootstrap is possible for a large number of resamples Possible programs: – – – – – – – Matlab Minitab SAS Excel S-Plus SPSS Fathom
Bootstrap using SURVEY program
Main parameter of interest is the median price that all households in Lockhart City are wiling to pay for cable. The price that a household is willing to pay for cable is positively correlated with average-district house value.
Districts in Lockhart City are divided into strata based on average house value. Estimate the variance and create 95% CI
Lockhart City Strata Characteristics:
Take a stratified random sample of size 200 using proportional allocation. Using the stratified random sample, implement the general bootstrap procedure, BWO, and mirror match. Stratum 1 2 3 4 Districts 53, 54, 55, 59, 60 52, 58, 63, 64, 65 62, 68, 69, 70, 73 57, 67, 72, 74, 75 House Value ($1,000) 35-55 55-70 70-80 80-85 N 3529 4775 4257 4077 5 51, 56, 61, 66, 71 85-105 3026 Table 1: Lockhart City Stratum based on House Value n 36 49 43 41 31
Variations of the Bootstrap in Strata
General Bootstrap – Mimic the original sampling method BWO: Bootstrap Without Replacement – Grow the sample to the size of the population Mirror-Match – Repeated miniature resamples
BWO: Bootstrap Without Replacement
Grow the sample to the size of the population For each stratum L, create a pseudo population by replicating the sample k L times.
Resample n’ L units from each stratum without replacement to obtain a single bootstrap sample for stratum L.
Repeat a large number of times
BWO: Variable Definitions
n
'
L
n L
1
f L
where
f L
n L N L
= stratum sampling fraction
k L
N L n L
1 1
n L f L
where
n
' and
k L
are integers
Disadvantages of extended BWO
N L must be known n’ L and k L are often non-integers Must bracket between integers if n’ L are non-integer and k L Computing time
Mirror-Match
Repeated miniature resamples Resample size is determined to match the proportion of the original sample size to the population sample size (n L /N L ).
Using the resample size n’ L , we resample n’ L (SRSWOR) from each stratum L. units Repeat previous step k L times with replacement to obtain a single bootstrap sample for stratum L. Repeat a large number times
Mirror-Match: Variable Definitions
n
'
L
n L
2
N L k L
n L n
'
L
1 1
f f
*
L
L
where:
f
*
L f L
n
'
L n L
= stratum resample fraction
n L N L
= original stratum sample fraction
Mirror Match: Disadvantages
N L must be known k L is often non-integer Must bracket between integers when k L non-integer is Computing time
Estimation of the Population Sampling Distributions
100,000 independent stratified random samples.
Medians computed and plotted to form empirical sampling distributions.
Variables: house value, cable price, and TV hours.
Estimation of the Population Sampling Distributions
Simulations
Matlab code: General, BWO, and Mirror-match.
Two independent stratified random samples from Lockhart City.
Comparison of the sample bootstrap sampling distributions with the population sampling distributions.
95% confidence intervals were determined bootstrap 2.5 and 97.5 percentiles.
Sampling Distributions 1
Sampling Distributions 2
Confidence Intervals
Variable House (1) Price (1) Hours (1) House (2) Price (2) Hours (2) Population Estimate 74740 10 40 74740 10 40 Empirical CI (72027,75954) (10,10) (28.5,41) (72027,75954) (10,10) (28.5,41) Standard Bootstrap (73092,75616) (10,10) (32.5,47.0) (72079,75155) (10,10) (30.5,39.5) BWO (73119,75600) (10,10) (32.5,47.0) (71995,75155) (10,10) (29.5,39.5) Mirror Match (73119,75733) (10,10) (32.0,47.0) (72010,75155) (10,10) (30.5,40.0)
The Empirical verses the Bootstrap Sampling Distributions
Bootstrap sampling distributions are expected to mimic actual sampling distributions. Bootstrap sampling is sensitive to individual samples. The shape of bootstrap sampling distributions may vary, but the statistic of interest and its variance are considered accurate.
Comparison of Bootstrap Methods
Empirical Coverages
The empirical coverages were close to the expected 95%. They differed very little between the different bootstrap procedures. General BWO Mirror-Match House Value .936 .933 .94 Cable Price TV Hours 1 .957 1 1 .959 .961
Empirical Coverages
Empirical coverages are dependent on the type of confidence interval that was originally selected. Our confidence intervals were calculated from the 2.5 and 97.5 percentiles of each bootstrap distribution. There are many different types of bootstrap confidence intervals. The one we selected, although intuitive in design, is considered generally biased (Bedrick 2006).
Computer Processing Times
Computer processing times varied greatly.
Mean processing time per sample in seconds. General BWO Mirror-Match House Value Cable Price TV Hours .11961 .11502 .12112 45.765 35.18 45.769 35.164 45.812 35.169
Computer Processing Times
BWO took 381 times as long as general bootstrapping procedures.
Mirror-match took 293 times as long as general bootstrapping procedures. For our study, the BWO and mirror-match conferred no advantage over general bootstrapping with regard to statistical estimates. However, their vastly greater processing times are a great disadvantage.
CONCLUSIONS: General Bootstrap verses BWO and Mirror-Match
BWO and Mirror-match procedures are designed to mimic complex sampling designs. We only analyzed stratified samples of 200 from a fictitious city. BWO and Mirror-match methods may be advantageous in other complex sampling scenarios.