No Slide Title
Download
Report
Transcript No Slide Title
Ratio estimation under SRS
Assume
Absence of nonsampling error
SRS of size n from a pop of size N
Ratio estimation is alternative to y , tˆ, and pˆ
under SRS, uses “auxiliary” information (X )
Sample data: observe yi and xi
Population information
Have yi and xi on all individual units, or
Have summary statistics from the population distribution
of X, such as population mean, total of X
Ratio estimation is also used to estimate
population parameter called a ratio (B )
1
Uses
Estimate a ratio
More precise estimator of population parameters
Obtaining estimates of subsamples
Incorporate known information into estimates
Avoids need to know N in formula for tˆ
Domain estimation
If X and Y are correlated, can improve upon y , tˆ, and pˆ
Estimating totals when pop size N is unknown
Tree volume or bushels per acre
Per capita income
Liability to asset ratio
Postratification
Adjust for nonresponse
2
Estimating a ratio, B
Population parameter for the ratio: B
ty
yU
B
tx
xU
Examples
Number of bushels harvested (y) per acre (x)
Number of children (y) per single-parent
household (x)
Total usable weight (y) relative to total shipment
weight (x) for chickens
3
Estimating a ratio
ty
yU
B
tx
xU
SRS of n observation units
Collect data on y and x for each OU
Natural estimator for B ?
4
Estimating a ratio -2
Estimator for B
n
Bˆ
tˆy
y
x
tˆx
yi
i
1
n
xi
i
1
Bˆ is a biased estimator for B
E [Bˆ] B
Bˆ is a ratio of random variables
5
Bias of Bˆ
E Bˆ B
n 1
1
N n xU
0
B S
2
x
R SxSy
R Corr x , y
N
R
x i
i
1
N
xU
y i
yU
1S x S y
S x population standard deviation of x
S y population standard deviation of y
6
Bias of Bˆ – 2
Bias is small if
Sample size n is large
Sample fraction n/N is large
x U is large
S x is small (pop std deviation for x)
High positive correlation between X and Y
(see Lohr p. 67)
7
Estimated variance of
estimator for B
Estimator for V [Bˆ]
n
ˆ
ˆ
V [B ] 1
N
where
se
2
se
2
nx U
2
1 n
y i Bˆx i
n 1 i 1
2
1 n 2
ei
n 1 i 1
e i y i Bˆx i
If x U is unknown?
8
Variance of Bˆ
n se
ˆ
ˆ
V B 1
2
N nx U
2
Variance is small if
sample size n is large
sample fraction n/N is large
deviations about line e = y Bx are small
correlation between X and Y close to 1
x U is large
9
Ag example – 1
Frame: 1987 Agricultural Census
Take SRS of 300 counties from 3078 counties to
estimate conditions in 1992
Collect data on y , have data on x for sample
y i total acreage of farms in 1992 in county i
x i total acreage of farms in 1987 in county i
Existing knowledge about the population
x U average acreage of farms per county in US in 1987
313,343.283 acres / county
t x total acreage of farms per county in US in 1987
964,470,625 acres
10
Ag example – 2
300
i
1
300
i
1
acres of farms in 1992
Estimate B
acres of farms in 1987
y i 89,369,114acres
x i 90,586,117acres
Bˆ
yi
xi
89,369,114
0.9866 farm acres in
90,586,117
1992 relative to 1987
farm acres
11
Ag example – 3
2
s
n
e
Vˆ Bˆ 1
2
N
n
x
U
Need to calculate variance of ei ’s
12
Ag example – 4
For each county i, calculate e i y i Bˆ x i
Coffee Co, AL example
e i 1175,209 (0.9866)179,311 1693.00
Sum of squares for ei
n
i
se
2
2
e i2 e i2 e 22 e 300
2.9965166x 1011
1
n 1
n
i
1
e i2
1
299
1,002,179,462
13
Ag example – 5
n se
ˆ
ˆ
V B 1
N nx U2
2
300
1,002,179,462
1
0.000030707
3078 300 313,343.283
Bˆ 0.9866 farm acres in 1992 per 1987 farm acre
SE (Bˆ) 0.0055
14
Estimating proportions
If denominator variable is random, use ratio
estimator Bˆ to estimate the proportion p
Example (p. 72)
10 plots under protected oak trees used to assess effect of
feral pigs on native vegetation on Santa Cruz Island, CA
Count live seedlings y and total number of seedlings x per
plot
Y and X correlated due to common environmental factors
Estimate proportion of live seedlings to total number of
seedlings
y
6.1
ˆ
B
0.300 with SE (Bˆ) 0.032
x 20.6
15
Estimating population mean
Estimator for y U
y
xU
ˆ
ˆ
y r Bx U x U y
x
x
“Adjustment factor”
xU
x
for sample mean
A measure of discrepancy between sample
and population information, x and x U
Improves precision if X and Y are + correlated
16
Underlying model
y
0
0
x
yi B xi
with B > 0
B is a slope
B > 0 indicates X and Y
are positively correlated
Absence of intercept
implies line must go
through origin (0, 0)
17
Using population mean of X
to adjust sample mean
x U
ˆ
ˆ
yr B xU y
x
xU
x xU
1
x
adjust y to get better estimate of y U
x xU
xU
1 adjust y
x
Discrepancy between sample & pop info for X is
viewed as evidence that same relative discrepancy
exists between y and y U
18
Bias of yˆr
Ratio estimator for the population mean is
biased
E yˆr y U
ˆ
yU
E Bx U
xU
xU
x U [E {Bˆ} B ] 0
Rules of thumb for bias of Bˆ apply
19
Estimator for variance of yˆr
Estimator for variance of yˆr
Vˆ yˆr
2
s
n
x U2 Vˆ Bˆ 1 e
N n
20
Ag example – 6
yˆr Bˆ x U 0.9866 313,343.283
309,133.6 farm acres per county in 1992
21
Ag example - 8
yˆr 309,100 farm acres / county in 1992
Vˆ yˆr
n se
2 ˆ ˆ
x U V B 1
N n
0.9025
SE yˆr
2
1,002,179,462
3,014,890.67
300
1736 1700
22
Ag example – 9
Expect a linear relationship between
X and Y (Figure 3.1)
Note that sample mean is not equal to
population mean for X
x U average acreage of farms per county in US
in 1987
313,343.283 acres / county
x mean acreage of farms per county in US
in 1987 for the sample
301,953.723 acres / county
23
MSE under ratio estimation
Recall …
MSE = Variance + Bias2
SRS estimators are unbiased so
Ratio estimators are biased so
MSE = Variance
MSE > Variance
Use MSE to compare design/estimation
strategies
EX: compare sample mean under SRS with ratio
estimator for pop mean under SRS
24
Sample mean vs. ratio
estimator of mean
MSE [yˆr ] is smaller than MSE [y ]
if and only if
1 CV x
R
2 CV y
For example, if CV x ~
CV y and
R Corr x , y 1 / 2
ratio estimation will be better than SRS
25
Estimating the MSE
Estimate MSE with sample estimates of bias
and variance of estimator
This tends to underestimate MSE
V Bˆ and Bias Bˆ are approximations
Estimated MSE is less biased if
ˆ is small (see earlier slide)
Bias B
Large sample size or sampling fraction
High + correlation for X and Y
x is a precise estimate (small CV for x )
We have a reasonably large sample size
(n > 30)
26
Ag example – 10
y 297,897 farm acres per county in 1992
seˆy
MSˆE y
1/2
300
344,552
1
18,898 acres
3078
300
Vˆy 357,151,042
yˆr 309,134 farm acres per county in 1992
se y r 1,736 acres
0.9866344,8302
300
1
Biaˆs 1
10
3078 300301, 954 0.9958344, 552 344, 830
2
MSˆE yˆr Vˆ yˆr Bias y r 3,301,696 100 3,013,796
27
Estimating population total t
Estimator for t
tˆyr
tˆy
tˆx
t x Bˆt x Nyˆr
Is tˆyr biased?
Estimator for V [tˆyr ]
Vˆ[tˆyr ]
28
Ag example – 11
tˆyr Bˆ t x 0.9866 964,470,625
951,513,191 farm acres in US in 1992
Vˆ tˆyr
n se
2 ˆ ˆ
2
t x V B N 1
N 2 Vˆ yˆyr
N n
1,002,179,724
2
3078 0.9025
300
2,856 10 3
2
29
Summary of ratio estimation
n
Bˆ
tˆy
tˆx
yi
i
y
x
1
n
xi
i
1
n se
ˆ
ˆ
V B 1
N n x U2
where
s e2
2
1
n 1
(est. x U w/x )
n
2
e
i
i 1
e i y i Bˆx i
30
Summary of ratio estn – 2
xU
ˆ
ˆ
yr B xU y
x
n se
2 ˆ ˆ
ˆ
ˆ
V y r x U V B 1
N n
tx
ˆ
ˆ
ˆ
t yr B tx t y
tˆx
2
Vˆ tˆyr t x2 Vˆ Bˆ N 2Vˆ yˆr N 2
n se
1 N n
2
31
Regression estimation
What if relationship between y and x is
linear, but does NOT pass through the origin
Better model in this case is y B 0 B 1 x
B1 slope
y
B0
x
32
Regression estimation – 2
New estimator is a regression estimator
To estimate y
yˆreg is predicted value
U ,
from regression of y on x at x x U
yˆreg Bˆ0 Bˆ1 x U y Bˆ1 x U x
Adjustment factor for sample mean is
linear, rather than multiplicative
33
Estimating population mean
yU
Regression estimator
yˆreg Bˆ0 Bˆ1 x U y Bˆ1 x U x
Estimating regression parameters
i x i x y i y
n
i x i x
n
Bˆ1
1
2
1
s xy
s x2
r sy
sx
Bˆ0 y Bˆ1 x
34
Estimating pop mean – 2
Sample variances, correlation, covariance
sy
2
sx
2
1
n 1
1
n 1
i 1 y i
y
i x i
x
i x i
x y i x
n
n
1
2
2
s xy
r
sxsy
s xy
1
n 1
n
1
35
Bias in regression estimator
E yˆreg y U
Cov [Bˆ , x ] 0
1
36
Estimating variance
Vˆ yˆreg
2
s
n
e
1
N n
where
s e2
1
n 1
n
2
e
i
i 1
e i y i Bˆ0 Bˆ1 x i y i yˆi
Note: This is a different residual than ratio
estimation (predicted values differ)
37
Estimating the MSE
Plugging sample estimates into Lohr,
equation 3.13:
MSˆE yˆreg
2
s
n y
1
(1 r 2 )
N n
38
Estimating population total t
tˆyreg N yˆreg
Vˆ tˆyreg
N 2 Vˆ yˆreg
Is regression estimator for t unbiased?
39
Tree example
Goal: obtain a precise estimate of number of
dead trees in an area
Sample
Select n = 25 out of N = 100 plots
Make field determination of number of dead trees
per plot, yi
Population
For all N = 100 plots, have photo determination
on number of dead trees per plot, xi
Calculate x U = 11.3 dead trees per plot
40
Tree example – 2
Lohr, p. 77-78
Data
Plot of y vs. x
Output from PROC REG
Components for calculating estimators and
estimating the variance of the estimators
We will use PROC SURVEYREG, which will give
you the correct output for regression estimators
41
Tree example – 3
Estimated mean number of dead trees/plot
yˆreg 5.059292 0.613274 11.3 11.99 dead trees/plot
seˆ yˆreg
25 5.54834
1
0.4080 ~
0.41
100
25
Estimated total number of dead trees
tˆyreg 100 11.99 1199 dead trees in area
seˆ tˆyreg
100 0.4080 41
42
Tree example – 4
Due to small sample size, Lohr uses t distribution w/ n 2 degrees of freedom
t / 2,df , .05, df n 2 23
so t .025 ,23 2.07
Half-width for 95% CI
t / 2,n 2 seˆt yreg 2.0740.80 84.45
Approx 95% CI for ty is (1115, 1283) dead
trees
43
Related estimators
Ratio estimator
B0 = 0 ratio model
Ratio estimator
regression estimator
with no intercept
y B 0 B1 x
B1 slope
y
B0
x
Difference estimation
B1 = 1 slope is assumed to be 1
44
Domain estimation under SRS
Usually interested in estimates and inferences
for subpopulations, called domains
If we have not used stratification to set the
sample size for each domain, then we should
use domain estimation
We will assume SRS for this discussion
If we use stratified sampling with strata =
domains, then use stratum estimators (Ch 4)
To use stratification, need to know domain
assignment for each unit in the sampling frame
prior to sampling
45
Stratification vs.
domain estimation
In stratified random sampling
Define sample size in each stratum before
collecting data
Sample size in stratum h is fixed, or known
In other words, the sample size nh is the same for
each sample selected under the specified design
In domain estimation
nd = sample size in domain d is random
Don’t know nd until after the data have been
collected
The value of nd changes from sample to sample
46
Population partitioned
into domains
Domain D
d=1
d=2
...
...
d=D
Domain #1
Recall U = index set for population = {1, 2, …, N }
Domain index set for domain d = 1, 2, …, D
Ud = {1, 2, …, Nd } where Nd = number of OUs in domain d in
the population
In sample of size n
nd = number of sample units from domain d are in the sample
Sd = index set for sample belonging to domain d
47
Boat owner example
Population
Sample
n = 1,500 owners selected using SRS
Divide universe (population) into 2 domains
N = 400,000 boat owners (currently licensed)
d=1
d=2
own open motor boat > 16 ft. (large boat)
do not own this type of boat
Of the n = 1500 sample owners:
n1 = 472 owners of open motor boat > 16 ft.
n2 = 1028 owners do not own this kind of boat
48
New population parameters
Domain mean
Domain total
y Ud
t Ud
1
Nd
yi
i U
d
yi
i U
d
i U d " Unit i belongs to domain d "
i U d " Unit i does NOT belong to domain d "
49
Boat owner example - 2
Estimate population domain mean
Estimate the average number of children
for boat owners from domain 1
Estimate proportion of boat owners from
domain 1 who have children
Estimate population domain total
Estimate the total number of children for
large boat owners (domain 1)
50
New population parameter – 2
Ratio form of population mean
y Ud
yi
i U
d
Nd
Nd
yi
i
N
/N
1
Nd / N
ui
i
/N
1
N
xi
i
/N
uU
B
xU
1
Numerator variable
Denominator variable
ui
y i
0
i Ud
if i U d
xi
1 if
0 if
if
i Ud
i Ud
51
Boat owner example - 3
Estimate mean number of children for
owners from domain 1
y i number of children for owner i
ui
xi
y i
0
if owner i U 1
(domain 1)
otherwise
(not in domain 1)
1 if owner i U 1
0 otherwise
Applies to
whole pop
Zero values
for OUs that
are not in
domain 1
52
Boat example – 4
Owner Domain # Kids
(i)
(di)
(yi)
1
1
3
2
1
2
3
2
5
4
1
0
5
2
0
6
2
1
7
1
1
8
2
2
…
Den.
(ui)
Num.
(xi)
53
Estimator for population
domain mean y U
d
u
y d Bˆ
x
1
ui
n i
1
1
n
n
n
n
xi
i 1
yi
i d
nd
ui
i
1
n
xi
ui ui
i S
i S
d
d
nd
i 1
sample mean of observations in domain d
54
Boat example – 5
Domain 1 data
Number of
Children
0
1
2
3
4
5
6
8
Total
Number of
Respondents
76
139
166
63
19
5
3
1
472
55
Boat example – 6
Domain 1 and domain 2 data combined
ui
0
1
2
3
4
5
6
8
Total
Number of
Respondents
1104
139
166
63
19
5
3
1
1500
1104 zeros =
76 zeros from
domain 1
+
1028 zeros from
domain 2
56
Boat example – 7
Whole
data set
Domain 1
data only
Two ways of estimating mean
u
1
n
ui
n i
1
1
[1028(0) 76(0) 139(1) 166(2) ... 1(8)
1500
787
0.524667
1500
n
1 n
482
x ui 1
0.314667
n i 1
n 1500
u
0.524667
Bˆ
1.67 children per large boat owner
x 0.314667
y1
1
n1
n1
yi
i
1
787
1.67 children per large boat owner
472
57
Estimator for variance of y d
u
y d Bˆ
x
Vˆy d
n 1 N
1
N n Nd
where s yd
2
2
1
nd 1
nd 1 2
s yd
n 1
yi
i S
Bˆx i
2
d
58
Boat example – 8
Vˆy d
n 1 N
1
N n Nd
2
nd 1 2
s yd
n 1
n
1,500
1
1
1 so can ignore FPC
N
400,000
N
400,000
n 1500
- - estimate with
3.177966
N1
?
n1
472
n1 1 n1
472
1
n 1
n 1500 3.177966
s
2
y1
1
n1 1
i S
y i Bˆx i
1
2
94.111078
59
Boat example – 9
n 1 N
ˆ
V y 1 1
N n N1
1 n
1
n n1
2
2
n1 1 2
s y1
n 1
n1 2
s y1
n
s y21 94.111078
0.199388
n1
472
y 1 1.667373 1.67 children per large boat owner
SE y 1 0.4465287 0.45
60
Approximation for estimator of
variance of y d
Vˆy d
2
s
n yd
1
N nd
N d nd
N
n
assuming
where
s
2
yd
1
nd 1
nd 1 nd
and
n 1
n
yi
i S
d
Bˆx i
2
Domain 1
data only
61
Estimated variance of Bˆ
Estimator for V [Bˆ]
n
ˆ
ˆ
V [B ] 1
N
where
se
2
se
2
nx U
2
1 n
y i Bˆx i
n 1 i 1
2
1 n 2
ei
n 1 i 1
e i y i Bˆx i
Domain variance estimator is directly related
62
Relationship to estimating a
u
u
ˆ
y
B
ratio
with B x
x
U
Ud
Population mean of X
xU
U
Nd
N
Residual
e i u i Bˆx i y i Bˆx i if i S d
0 Bˆ0 0 if i S d
63
Relationship to estimating a
u
u
ˆ
y
B
ratio
with B x - 2
x
U
Ud
U
Residual variance
se
2
1
n 1
1
n 1
ui
i S
1
ˆ
y Bx n 1
Bˆx i
2
2
i S d
1 nd 1
n 1 nd 1
i
i
i S
y i Bˆx i
0
i S
2
d
2
d
nd 1 2
s yd
n 1
64
Estimator for variance of y d
n
ˆ
ˆ
V [B ] 1
N
Vˆy d
1 1 2
se
2
n xU
n 1 N
1
N n Nd
where s yd
2
2
1
nd 1
nd 1 2
s yd
n 1
yi
i S
Bˆx i
2
d
65
Estimating a population
domain total t d N d y Ud
If we know the domain sizes, Nd
tˆyd N d y d
Vˆ tˆyd
if N d known
N d2 Vˆy d
66
Estimating a population
domain total t d N d y Ud
-2
If we do NOT know the domain sizes
tˆyd N u
if N d unknown
Vˆ tˆyd N 2 Vˆu
Standard SRS
estimator using
u as the variable
2
s
n
N 2 1 u
N n
where
s u2
1
n 1
n
u i
i 1
u
2
67
Boat example – 10
Do not know the domain size, N1
tˆyd tˆy 1 N u
400,000 0.524667 209,867 210,000 children
Vˆ tˆy 1 N 2Vˆu
2
s
n
N 2 (1 ) u
N n
1.0394178
400,000
110,871,232
1500
2
SE (tˆy 1 ) Vˆ tˆy 1 10,530 10,000
68
Comparing 2 domain means
Suppose we want to test the hypothesis that two
domain means are equal
H0 : y U1 y U 2
H1 : y U1 y U 2
Construct a z-test with Type 1 error rate (for falsely
rejecting null hypothesis)
y1 y 2
Test statistic: z
Critical value: z/2
Reject H0 if |z| > z/2
Vˆ(y 1 ) Vˆ(y 2 )
69
Boat example - 10
Large boat owners (d = 1)
y 1 1.667373 children per large boat owner
SE y 1 0.4465287
Other boat owners (d = 2)
y 2 2.501059 children per other boat owner
SE y 2 0.669793
70
Boat example - 11
Test whether domain means are equal at
Calculate z-statistic
z
= 0.05
y1 y 2
Vˆ(y 1 ) Vˆ(y 2 )
1.667373 2.501059
0.4465292 0.6697932
0.833686
1.04
0.804991
Critical value z/2 = z0.25 = 1.96
Apply rejection rule
|z| = |-1.04|=1.04 < 1.96 = z0.25
Fail to reject H0
71
Overview
Population parameters
Mean
Total
Proportion (w/ fixed denom)
Ratio
Includes proportion w/ random denominator
Domain mean
Domain total
72
Overview – 2
Estimation strategies
No auxiliary information
Auxiliary information X, no intercept
Y and X positively correlated
Linear relationship passes through origin
Auxiliary information X, intercept
Y and X positively correlated
Linear relationship does not pass through origin
73
Overview – 3
Make a table of population parameters (rows)
by estimation strategy (columns)
In each cell, write down
Estimator for population parameter ˆ
Estimator for variance of estimated parameter Vˆ(ˆ)
Residual ei
Notes
Some cells will be blank
Look for relationship between mean and total, and
mean and proportion
Look at how the variance formulas for many of the
74
estimators are essentially the same form