No Slide Title

Download Report

Transcript No Slide Title

Ratio estimation under SRS

Assume



Absence of nonsampling error
SRS of size n from a pop of size N
Ratio estimation is alternative to y , tˆ, and pˆ
under SRS, uses “auxiliary” information (X )


Sample data: observe yi and xi
Population information



Have yi and xi on all individual units, or
Have summary statistics from the population distribution
of X, such as population mean, total of X
Ratio estimation is also used to estimate
population parameter called a ratio (B )
1
Uses

Estimate a ratio




More precise estimator of population parameters


Obtaining estimates of subsamples
Incorporate known information into estimates


Avoids need to know N in formula for tˆ
Domain estimation


If X and Y are correlated, can improve upon y , tˆ, and pˆ
Estimating totals when pop size N is unknown


Tree volume or bushels per acre
Per capita income
Liability to asset ratio
Postratification
Adjust for nonresponse
2
Estimating a ratio, B

Population parameter for the ratio: B
ty
yU
B 

tx
xU

Examples



Number of bushels harvested (y) per acre (x)
Number of children (y) per single-parent
household (x)
Total usable weight (y) relative to total shipment
weight (x) for chickens
3
Estimating a ratio
ty
yU
B 

tx
xU



SRS of n observation units
Collect data on y and x for each OU
Natural estimator for B ?
4
Estimating a ratio -2

Estimator for B
n
Bˆ 
tˆy
y


x
tˆx
yi

i
1
n
xi

i
1

Bˆ is a biased estimator for B

E [Bˆ]  B

Bˆ is a ratio of random variables
5
Bias of Bˆ

E Bˆ  B

n 1

 1  
N  n xU

0
B S
2
x
R SxSy

R  Corr x , y 
N
R 
x i

i
1
N
 xU
 y i
 yU

 1S x S y
S x  population standard deviation of x
S y  population standard deviation of y
6
Bias of Bˆ – 2

Bias is small if





Sample size n is large
Sample fraction n/N is large
x U is large
S x is small (pop std deviation for x)
High positive correlation between X and Y
(see Lohr p. 67)
7
Estimated variance of
estimator for B

Estimator for V [Bˆ]
n

ˆ
ˆ
V [B ]  1 
N

where
se
2
 se

2
 nx U
2

1 n

y i  Bˆx i

n  1 i 1

2
1 n 2

ei

n  1 i 1
e i  y i  Bˆx i

If x U is unknown?
8
Variance of Bˆ

n  se

ˆ
ˆ
V B  1  
2
 N  nx U

2
Variance is small if





sample size n is large
sample fraction n/N is large
deviations about line e = y  Bx are small
correlation between X and Y close to 1
x U is large
9
Ag example – 1

Frame: 1987 Agricultural Census


Take SRS of 300 counties from 3078 counties to
estimate conditions in 1992
Collect data on y , have data on x for sample
y i  total acreage of farms in 1992 in county i
x i  total acreage of farms in 1987 in county i

Existing knowledge about the population
x U  average acreage of farms per county in US in 1987
 313,343.283 acres / county
t x  total acreage of farms per county in US in 1987
 964,470,625 acres
10
Ag example – 2

300

i
1
300

i
1
acres of farms in 1992
Estimate B 
acres of farms in 1987
y i  89,369,114acres
x i  90,586,117acres
Bˆ 
yi
xi
89,369,114

 0.9866 farm acres in
90,586,117
1992 relative to 1987
farm acres
11
Ag example – 3

2
s
n


e
Vˆ Bˆ  1  
2
N
n
x


U

Need to calculate variance of ei ’s
12
Ag example – 4


For each county i, calculate e i  y i  Bˆ x i
Coffee Co, AL example
e i  1175,209 (0.9866)179,311  1693.00

Sum of squares for ei
n
i

se
2
2
e i2  e i2  e 22    e 300
 2.9965166x 1011
1

n 1
n

i
1
e i2 
1
299


 1,002,179,462
13
Ag example – 5

n  se

ˆ
ˆ
V B  1  
N  nx U2

2
300 
1,002,179,462

 1 
 0.000030707

3078  300  313,343.283

Bˆ  0.9866 farm acres in 1992 per 1987 farm acre
SE (Bˆ)  0.0055
14
Estimating proportions


If denominator variable is random, use ratio
estimator Bˆ to estimate the proportion p
Example (p. 72)




10 plots under protected oak trees used to assess effect of
feral pigs on native vegetation on Santa Cruz Island, CA
Count live seedlings y and total number of seedlings x per
plot
Y and X correlated due to common environmental factors
Estimate proportion of live seedlings to total number of
seedlings
y
6.1
ˆ
B  
 0.300 with SE (Bˆ)  0.032
x 20.6
15
Estimating population mean

Estimator for y U
y
 xU 
ˆ
ˆ
y r  Bx U  x U  y 

x
 x 

“Adjustment factor”


xU
x
for sample mean
A measure of discrepancy between sample
and population information, x and x U
Improves precision if X and Y are + correlated
16
Underlying model

y


0
0
x

yi  B xi
with B > 0
B is a slope
B > 0 indicates X and Y
are positively correlated
Absence of intercept
implies line must go
through origin (0, 0)
17
Using population mean of X
to adjust sample mean



x U 
ˆ
ˆ
yr  B xU  y  
x 
xU
x  xU 
 1
x
 adjust y  to get better estimate of y U
x  xU
xU

 1  adjust y 
x
Discrepancy between sample & pop info for X is
viewed as evidence that same relative discrepancy
exists between y and y U
18
Bias of yˆr

Ratio estimator for the population mean is
biased

E yˆr  y U


ˆ

yU
 E Bx U 
xU 
xU


 x U [E {Bˆ}  B ]  0
Rules of thumb for bias of Bˆ apply
19
Estimator for variance of yˆr

Estimator for variance of yˆr
 
Vˆ yˆr

2
s
n


 x U2 Vˆ Bˆ  1   e
 N n
20
Ag example – 6
yˆr  Bˆ x U  0.9866 313,343.283
 309,133.6 farm acres per county in 1992
21
Ag example - 8
yˆr  309,100 farm acres / county in 1992
 
Vˆ yˆr

n  se

2 ˆ ˆ
 x U V B  1  
N n

 0.9025
 
SE yˆr
2
1,002,179,462
 3,014,890.67
300
1736  1700
22
Ag example – 9


Expect a linear relationship between
X and Y (Figure 3.1)
Note that sample mean is not equal to
population mean for X
x U  average acreage of farms per county in US
in 1987
 313,343.283 acres / county
x  mean acreage of farms per county in US
in 1987 for the sample
 301,953.723 acres / county
23
MSE under ratio estimation

Recall …


MSE = Variance + Bias2
SRS estimators are unbiased so


Ratio estimators are biased so


MSE = Variance
MSE > Variance
Use MSE to compare design/estimation
strategies

EX: compare sample mean under SRS with ratio
estimator for pop mean under SRS
24
Sample mean vs. ratio
estimator of mean

MSE [yˆr ] is smaller than MSE [y ]
if and only if
1 CV x 
R 
2 CV y 

For example, if CV x  ~
 CV y  and
R  Corr x , y   1 / 2
ratio estimation will be better than SRS
25
Estimating the MSE


Estimate MSE with sample estimates of bias
and variance of estimator
This tends to underestimate MSE




V Bˆ and Bias Bˆ are approximations
Estimated MSE is less biased if
ˆ is small (see earlier slide)
 Bias B





Large sample size or sampling fraction
High + correlation for X and Y
x is a precise estimate (small CV for x )
We have a reasonably large sample size
(n > 30)
26
Ag example – 10
y  297,897 farm acres per county in 1992
seˆy 
MSˆE y 
1/2
300 
344,552

 1 
 18,898 acres

3078 
300

 Vˆy   357,151,042
yˆr  309,134 farm acres per county in 1992
se y r   1,736 acres
0.9866344,8302 

300 
1

Biaˆs  1 


  10
3078  300301, 954 0.9958344, 552 344, 830

2
MSˆE yˆr  Vˆ yˆr  Bias y r   3,301,696  100  3,013,796
 
 
27
Estimating population total t

Estimator for t
tˆyr 
tˆy
tˆx
t x  Bˆt x  Nyˆr

Is tˆyr biased?

Estimator for V [tˆyr ]
Vˆ[tˆyr ] 
28
Ag example – 11
tˆyr  Bˆ t x  0.9866 964,470,625
 951,513,191 farm acres in US in 1992
 
Vˆ tˆyr

 
n  se
2 ˆ ˆ
2 
 t x V B  N 1  
 N 2 Vˆ yˆyr
N n

1,002,179,724
2
 3078 0.9025
300
 2,856  10 3
2
29
Summary of ratio estimation
n
Bˆ 
tˆy
tˆx
yi

i
y


x
1
n
xi

i
1

n  se

ˆ
ˆ
V B  1  
N  n x U2

where
s e2
2
1

n 1
(est. x U w/x )
n
2
e
 i
i 1
e i  y i  Bˆx i
30
Summary of ratio estn – 2
xU
ˆ
ˆ
yr  B xU  y
 

x
n  se

2 ˆ ˆ
ˆ
ˆ
V y r  x U V B  1  
N n

tx
ˆ
ˆ
ˆ
t yr  B tx  t y
tˆx
 

2
 
Vˆ tˆyr  t x2 Vˆ Bˆ  N 2Vˆ yˆr  N 2
n  se

1  N  n


2
31
Regression estimation


What if relationship between y and x is
linear, but does NOT pass through the origin
Better model in this case is y  B 0  B 1 x
B1 slope
y
B0
x
32
Regression estimation – 2

New estimator is a regression estimator
 To estimate y
yˆreg is predicted value
U ,
from regression of y on x at x  x U
yˆreg  Bˆ0  Bˆ1 x U  y  Bˆ1 x U  x 

Adjustment factor for sample mean is
linear, rather than multiplicative
33
Estimating population mean

yU
Regression estimator
yˆreg  Bˆ0  Bˆ1 x U  y  Bˆ1 x U  x 

Estimating regression parameters
i x i  x  y i  y 
n
i x i  x 
n
Bˆ1 
1
2
1

s xy
s x2
r sy

sx
Bˆ0  y  Bˆ1 x
34
Estimating pop mean – 2

Sample variances, correlation, covariance
sy
2
sx
2
1

n 1
1

n 1
i 1 y i
y
i x i
x
i x i
 x  y i  x 
n
n
1
2
2
s xy
r 
sxsy
s xy
1

n 1
n
1
35
Bias in regression estimator

E yˆreg  y U
  Cov [Bˆ , x ]  0
1
36
Estimating variance
 
Vˆ yˆreg
2
s
n

 e
 1  
N n

where
s e2
1

n 1

n
2
e
 i
i 1

e i  y i  Bˆ0  Bˆ1 x i  y i  yˆi

Note: This is a different residual than ratio
estimation (predicted values differ)
37
Estimating the MSE

Plugging sample estimates into Lohr,
equation 3.13:
 
MSˆE yˆreg
2
s
n y

 1  
(1  r 2 )
 N n
38
Estimating population total t
tˆyreg  N yˆreg
 
Vˆ tˆyreg

 
N 2 Vˆ yˆreg
Is regression estimator for t unbiased?
39
Tree example


Goal: obtain a precise estimate of number of
dead trees in an area
Sample



Select n = 25 out of N = 100 plots
Make field determination of number of dead trees
per plot, yi
Population


For all N = 100 plots, have photo determination
on number of dead trees per plot, xi
Calculate x U = 11.3 dead trees per plot
40
Tree example – 2

Lohr, p. 77-78



Data
Plot of y vs. x
Output from PROC REG


Components for calculating estimators and
estimating the variance of the estimators
We will use PROC SURVEYREG, which will give
you the correct output for regression estimators
41
Tree example – 3

Estimated mean number of dead trees/plot
yˆreg  5.059292  0.613274 11.3  11.99 dead trees/plot
 
seˆ yˆreg

25  5.54834

1

 0.4080 ~
 0.41


100 
25

Estimated total number of dead trees
tˆyreg  100 11.99  1199 dead trees in area
 
seˆ tˆyreg
100 0.4080  41
42
Tree example – 4

Due to small sample size, Lohr uses t distribution w/ n  2 degrees of freedom
t  / 2,df ,   .05, df  n  2  23

so t .025 ,23  2.07
Half-width for 95% CI
t  / 2,n 2 seˆt yreg   2.0740.80  84.45

Approx 95% CI for ty is (1115, 1283) dead
trees
43
Related estimators

Ratio estimator



B0 = 0  ratio model
Ratio estimator 
regression estimator
with no intercept
y  B 0  B1 x
B1 slope
y
B0
x
Difference estimation

B1 = 1  slope is assumed to be 1
44
Domain estimation under SRS


Usually interested in estimates and inferences
for subpopulations, called domains
If we have not used stratification to set the
sample size for each domain, then we should
use domain estimation


We will assume SRS for this discussion
If we use stratified sampling with strata =
domains, then use stratum estimators (Ch 4)

To use stratification, need to know domain
assignment for each unit in the sampling frame
prior to sampling
45
Stratification vs.
domain estimation

In stratified random sampling




Define sample size in each stratum before
collecting data
Sample size in stratum h is fixed, or known
In other words, the sample size nh is the same for
each sample selected under the specified design
In domain estimation



nd = sample size in domain d is random
Don’t know nd until after the data have been
collected
The value of nd changes from sample to sample
46
Population partitioned
into domains
Domain D
d=1
d=2
...
...
d=D
Domain #1


Recall U = index set for population = {1, 2, …, N }
Domain index set for domain d = 1, 2, …, D


Ud = {1, 2, …, Nd } where Nd = number of OUs in domain d in
the population
In sample of size n


nd = number of sample units from domain d are in the sample
Sd = index set for sample belonging to domain d
47
Boat owner example

Population


Sample


n = 1,500 owners selected using SRS
Divide universe (population) into 2 domains



N = 400,000 boat owners (currently licensed)
d=1
d=2
own open motor boat > 16 ft. (large boat)
do not own this type of boat
Of the n = 1500 sample owners:


n1 = 472 owners of open motor boat > 16 ft.
n2 = 1028 owners do not own this kind of boat
48
New population parameters


Domain mean
Domain total
y Ud 
t Ud 
1
Nd
yi

i U

d
yi

i U

d
i  U d  " Unit i belongs to domain d "
i  U d  " Unit i does NOT belong to domain d "
49
Boat owner example - 2

Estimate population domain mean



Estimate the average number of children
for boat owners from domain 1
Estimate proportion of boat owners from
domain 1 who have children
Estimate population domain total

Estimate the total number of children for
large boat owners (domain 1)
50
New population parameter – 2

Ratio form of population mean
y Ud 
yi

i U

d
Nd
Nd

yi

i
N
/N
1
Nd / N

ui

i
/N

1
N
xi

i
/N
uU
B
xU
1

Numerator variable

Denominator variable
ui
y i
 
0
i Ud
if i  U d
xi
1 if
 
0 if
if
i Ud
i Ud
51
Boat owner example - 3

Estimate mean number of children for
owners from domain 1
y i  number of children for owner i
ui
xi
y i
 
0
if owner i  U 1
(domain 1)
otherwise
(not in domain 1)
1 if owner i  U 1
 
0 otherwise
Applies to
whole pop
Zero values
for OUs that
are not in
domain 1
52
Boat example – 4
Owner Domain # Kids
(i)
(di)
(yi)
1
1
3
2
1
2
3
2
5
4
1
0
5
2
0
6
2
1
7
1
1
8
2
2
…
Den.
(ui)
Num.
(xi)
53
Estimator for population
domain mean y U
d
u
y d  Bˆ 
x
1

ui

n i
1
1
n

n
n
n
xi
i 1
yi

i d

nd

ui

i
1
n
xi

ui  ui

i S
i S


d
d
nd
i 1
 sample mean of observations in domain d
54
Boat example – 5

Domain 1 data
Number of
Children
0
1
2
3
4
5
6
8
Total
Number of
Respondents
76
139
166
63
19
5
3
1
472
55
Boat example – 6

Domain 1 and domain 2 data combined
ui
0
1
2
3
4
5
6
8
Total
Number of
Respondents
1104
139
166
63
19
5
3
1
1500
1104 zeros =
76 zeros from
domain 1
+
1028 zeros from
domain 2
56
Boat example – 7

Whole
data set
Domain 1
data only
Two ways of estimating mean

u 
1
n
ui

n i
1

1
[1028(0)  76(0)  139(1)  166(2)  ...  1(8)
1500
787
 0.524667
1500
n
1 n
482
x  ui  1 
 0.314667
n i 1
n 1500
u
0.524667
Bˆ  
 1.67 children per large boat owner
x 0.314667


y1 
1
n1
n1
yi

i
1

787
 1.67 children per large boat owner
472
57
Estimator for variance of y d
u
y d  Bˆ 
x
Vˆy d

n 1 N


 1  
N  n Nd

where s yd
2



2
1

nd  1
 nd  1  2

 s yd
 n 1 

yi

i S

 Bˆx i

2
d
58
Boat example – 8
Vˆy d

n 1 N


 1  
N  n Nd




2
 nd  1  2

 s yd
 n 1 
n
1,500
1
1
 1 so can ignore FPC
N
400,000
N
400,000
n 1500

- -  estimate with

 3.177966
N1
?
n1
472
n1  1 n1
472
1



n 1
n 1500 3.177966
s
2
y1
1

n1  1


i S

y i  Bˆx i
1

2
 94.111078
59
Boat example – 9
n 1 N 

ˆ


V y 1   1  
N  n  N1 

1 n 
 
 1
n  n1 
2
2
 n1  1  2

 s y1
 n 1 
 n1  2
  s y1
n 
s y21 94.111078


 0.199388
n1
472
y 1  1.667373  1.67 children per large boat owner
SE y 1   0.4465287  0.45
60
Approximation for estimator of
variance of y d
Vˆy d 
2
s
n  yd

 1  
 N  nd
N d nd

N
n
assuming
where
s
2
yd
1

nd  1
nd  1 nd
and

n 1
n

yi

i S

d
 Bˆx i

2
Domain 1
data only
61
Estimated variance of Bˆ

Estimator for V [Bˆ]
n

ˆ
ˆ
V [B ]  1 
N

where
se
2
 se

2
 nx U
2

1 n

y i  Bˆx i

n  1 i 1

2
1 n 2

ei

n  1 i 1
e i  y i  Bˆx i

Domain variance estimator is directly related
62
Relationship to estimating a
u
u
ˆ
y

B

ratio
with B  x
x
U
Ud

Population mean of X
xU

U
Nd

N
Residual
e i  u i  Bˆx i  y i  Bˆx i if i  S d
 0  Bˆ0  0 if i  S d
63
Relationship to estimating a
u
u
ˆ
y

B

ratio
with B  x - 2
x
U
Ud

U
Residual variance
se
2
1

n 1
1

n 1

ui

i S

1
ˆ
 y  Bx   n  1

 Bˆx i
2
2
i S d
1 nd  1

n  1 nd  1
i
i


i S

y i  Bˆx i

0 

i S
2

d
2
d
nd  1 2

s yd
n 1
64
Estimator for variance of y d
n

ˆ
ˆ
V [B ]  1 
N

Vˆy d

1 1 2
se

2
 n xU
n 1 N


 1  
N  n Nd

where s yd
2



2
1

nd  1
 nd  1  2

 s yd
 n 1 

yi

i S

 Bˆx i

2
d
65
Estimating a population
domain total t d  N d y Ud

If we know the domain sizes, Nd
tˆyd  N d y d
 
Vˆ tˆyd
if N d known
N d2 Vˆy d 
66
Estimating a population
domain total t d  N d y Ud

-2
If we do NOT know the domain sizes
tˆyd  N u
if N d unknown
 
Vˆ tˆyd  N 2 Vˆu 
Standard SRS
estimator using
u as the variable
2
s
n


 N 2 1   u
N n

where
s u2
1

n 1
n
 u i
i 1
u 
2
67
Boat example – 10

Do not know the domain size, N1
tˆyd  tˆy 1  N u
 400,000 0.524667  209,867  210,000 children
 
Vˆ tˆy 1  N 2Vˆu 
2
s
n
 N 2 (1  ) u
N n
 1.0394178 
 400,000 
  110,871,232
 1500 
2
 
SE (tˆy 1 )  Vˆ tˆy 1  10,530  10,000
68
Comparing 2 domain means

Suppose we want to test the hypothesis that two
domain means are equal

H0 : y U1  y U 2
H1 : y U1  y U 2

Construct a z-test with Type 1 error rate  (for falsely
rejecting null hypothesis)
y1  y 2

Test statistic: z 

Critical value: z/2

Reject H0 if |z| > z/2
Vˆ(y 1 )  Vˆ(y 2 )
69
Boat example - 10

Large boat owners (d = 1)
y 1  1.667373 children per large boat owner
SE y 1   0.4465287

Other boat owners (d = 2)
y 2  2.501059 children per other boat owner
SE y 2   0.669793
70
Boat example - 11


Test whether domain means are equal at
Calculate z-statistic
z 



 = 0.05
y1  y 2
Vˆ(y 1 )  Vˆ(y 2 )
1.667373  2.501059
0.4465292  0.6697932

 0.833686
 1.04
0.804991
Critical value z/2 = z0.25 = 1.96
Apply rejection rule


|z| = |-1.04|=1.04 < 1.96 = z0.25
Fail to reject H0
71
Overview

Population parameters




Mean
Total
Proportion (w/ fixed denom)
Ratio



Includes proportion w/ random denominator
Domain mean
Domain total
72
Overview – 2

Estimation strategies



No auxiliary information
Auxiliary information X, no intercept

Y and X positively correlated

Linear relationship passes through origin
Auxiliary information X, intercept

Y and X positively correlated

Linear relationship does not pass through origin
73
Overview – 3


Make a table of population parameters (rows)
by estimation strategy (columns)
In each cell, write down




Estimator for population parameter ˆ
Estimator for variance of estimated parameter Vˆ(ˆ)
Residual ei
Notes



Some cells will be blank
Look for relationship between mean and total, and
mean and proportion
Look at how the variance formulas for many of the
74
estimators are essentially the same form