La Stima della Varianza della Stima dell’Incidenza di
Download
Report
Transcript La Stima della Varianza della Stima dell’Incidenza di
Enhancing Small Area Estimation Methods
Applications to Istat’s Survey Data
Ranalli M.G. ~ Università di Perugia
D’Alo’ M., Di Consiglio L., Falorsi S., Solari F. ~ Istat
Pratesi M., Salvati N. ~ Università di Pisa
Q2008 ~ Rome, July 11th
1
OUTLINE
Italian Labour Force Survey
Standard small area estimators for LFS
Small area estimators that incorporate spatial information
Model based direct estimator (MBDE)
Semi-parametric models (based on p-splines)
Experimental study
Analysis of results
Final remarks
2
Labour Force Survey description
Labour Force Survey (LFS) is a quarterly two stage survey with partial
overlap of sampling units according to a rotation scheme of type (2-2-2).
In each province the municipalities are classified as Self-Representing
Areas (SRAs) and the Non Self-Representing Areas (NSRAs).
From each SRAs a sample of households is selected.
In NSRAs the sample is based on a stratified two stage sampling design.
The municipalities are the primary sampling units (PSUs), while the
households are the Secondary Sampling Units (SSUs).
For each quarterly sample about 1350 municipalities and 200,000
individuals are involved.
3
Small area estimation on LFS
■ Since 2000, ISTAT disseminates yearly LFS estimates of employed and
unemployed counts related to the 784 Local Labour Market Areas (LLMAs).
■ LLMAs are unplanned domains obtained as clusters of municipalities
cutting across provinces which are the LFS finest planned domains.
■ The direct estimates are unstable due to very small LLMA sample sizes
(more than 100 LLMAs have zero sample size). SAE methods are necessary.
■ Until 2003, a design based composite type estimator was adopted.
■ Starting from 2004, after the redesign of LFS sampling strategy, a unit-level
EBLUP estimator with spatially autocorrelated random area effects has been
introduced.
4
Standard small area estimators – design based
Direct and GREG estimator
The direct estimator is given by
YˆdD
wi yi
Nd
isd
The GREG estimator is based on the standard linear model:
yid xTid β id
E( id ) 0,
var ( id ) 2
and can be expressed as an adjustment of the direct estimator
for differences between the sample and population area means of covariates
GREG
D
D T ˆ
ˆ
ˆ
ˆ
Yd
Yd X d X d β w
5
Standard small area estimators – model based
Unit level Synthetic and EBLUP
The Synthetic estimator assumes a standard linear mixed model with unitspecific auxiliary variables, random area-specific effects and errors
independently normally distributed
yid x Tid β ud eid
and is given by
ud ~ iid N (0, u2 ), eid ~ iid N (0, e2 )
YˆdSI XTd βˆ
The EBLUP estimator assumes the same model but is given by
YˆdEB XTd βˆ uˆd yˆi N d
iU d
6
Enhanced small area estimators
1. Unit level EBLUP with spatial correlation of area effects
The EBLUP-S estimator is based on the following unit level linear mixed model:
yid x Tid β ud eid
u ~ MN(0, u2 A), e ~ MN(0, e2 I N )
The matrix A depends on the distances among the areas and on an unknown
parameter connected to the spatial correlation coefficient among the areas.
dist d , d '
A add ' 1 dd ' exp
1
dd '
0
1
if d d '
ot herwise
7
Enhanced small area estimators
2. Model Based Direct Estimator (Chambers & Chandra, 2006)
The MBD estimator is based on a unit level linear mixed model and is given by
wim yi
Yˆ
YˆdMBD
isd
where the weights are such that
Y
yi
wim
isd
wim yi is the (E)BLUP of
is
under the model (Royall, 1976).
iU
Calibrated with respect to the total of x.
Reduces bias vs EBLUP
Does not allow estimation for non-sampled areas
Less efficient than EBLUP
8
Enhanced small area estimators
3. Nonparametric EBLUP (Opsomer et al., 2008)
yid x Tid β f ( z1id ) f ( z2id , z3id ) ud eid
ud ~ iid N (0, u2 ), eid ~ iid N (0, e2 )
In the literature there are many nonparametric regression methods (kernel,
local polynomial, wavelets…) BUT difficult to incorporate in a Small area model
Methods based on penalized splines (Eilers e Marx, 1996; Ruppert et al.,
2003) can be estimated by means of mixed models -> promising candidate for
SAE methods
Great Flexibility in definition of model
Estimable with existing software using REML
Hard to estimate efficiency and test for terms significance (via
bootstrap?)
9
LFS empirical study
The simulation study on LFS has been carried out to estimate the
unemployment rate at LLMA level
500 two-stage LFS sample have been drawn from 2001 census data set.
The performances of the methods have been evaluated for the estimation of
the unemployment rate in the 127 LLMAs belonging to the geographical area
“Center of Italy ”.
GREG, Synthetic, EBLUP small area estimators have been applied
considering two different sets of auxiliary variables
Case A - LFS real covariates = sex by 14 age classes + employment
indicator at previous census;
Case B – LFS real covariates + geographic coordinates (latitude and
longitude of the municipality the sampling unit belongs to).
10
Enhanced Small area estimators
■ Spatial EBLUP: A spatial correlation in the variance matrix of the random
effects has been considered (EBLUP SP) + Case A covariates
■ MBD: Model based direct estimation is performed on sampled LLMAs, while
synthetic estimators based on unit level linear mixed model is considered for non
sampled LLMAs (Case A covariates)
■ Nonparametric EBLUP: two semiparametric representations based
penalized splines have been applied (fitted as additional random effects):
on
geographical coordinates of the municipality (EBLUP-SPLINE SP): this
allows for a finer representation of the spatial component vs EBLUP SP (at
municipality level instead of LLMA).
age (EBLUP-SPLINE AGE & EBLUP SP-SPLINE AGE)
11
Evaluation Criteria
% Relative Bias:
R
1 Yˆdr Yd
RBd
100
R r 1 Yd
% Relative Root Mean Squared Error:
2
R
r
ˆ
Yd Yd
1
RRMSE d
100
R r 1 Yd
1 D
Average Absolute RB: AARB RBd
D d 1
1 D
Average RRMSE: ARRMSE RRMSEd
D d1
MARB max RBd
Maximum Absolute RB:
d
Maximum RRMSE:
MRRMSE max RRMSEd
d
12
Results – A: LFS covariates; B = A + geog. coord. mun.
ESTIMATOR
AARB
ARRMSE
MARB
MRRMSE
DIRECT
2.9
51.7
20.4
90.7
GREG A
7.2
40.2
83.3
93.8
GREG B
6.9
40.0
71.5
82.8
SYNTH A
14.0
15.8
93.0
93.5
SYNTH B
12.4
16.4
79.7
81.0
EBLUP A
13.2
16.2
92.5
93.1
EBLUP B
11.9
16.7
79.5
80.7
EBLUP SP
12.7
16.3
90.9
91.6
8.8
35.3
86.3
92.6
EBLUP-SPLINE SP
12.1
16.5
91.1
92.2
EBLUP-SPLINE AGE
13.2
16.5
89.8
90.5
EBLUP SP-SPLINE AGE
12.2
17.3
90.3
90.9
MBD
13
Analysis of results
The results of GREG, SYNTH and EBLUB in case B, when geographical
information is considered in the fixed term, display better performances in
terms of bias.
In terms of MSE standard estimators in case A outperform standard
estimators in case B if the ARRMSE is considered as overall evaluation criteria,
while better results are obtained in case B if MRRMSE is considered
Area level estimators (not shown here) perform a little better in terms of
Bias but much worse in terms of MSE.
14
Analysis of results
EBLUP SP can be compared with the unit level EBLUP with geographical
information included as covariates and the EBLUP-SPLINE SP.
o
EBLUP SP show better performances in terms of MSE, while the unit level
EBLUP outperform the other estimators in terms of bias.
o
The EBLUP-SPLINE SP displays performances in between the other
estimators.
EBLUP-SPLINE AGE performs similarly to the unit level EBLUP in Case A
o The use of the age in a nonparametric way is an alternative use of
auxiliary information. With respect to case A the model is more
parsimonious.
As it was expected MBDE shows better results in term of bias and performs
poorly in term of MSE than other SAE methods
The use of autocorrelation structure together with the spline on the variable
age doesn’t improve the performances
15
Final remarks
The model group is a small portion of Italy (center); hence the area specific
effects are smaller than they could be if an overall model was considered for
all the country: the introduction of geographical information should be
analyzed considering a larger model level group
Sensitivity to smoothing parameters’ choice in the splines approach has to
be investigated.
The introduction of the sampling weighs should be considered to try to
achieve benchmarking with direct estimates produced at regional level
The response in a 0-1 variable: a logistic mixed model is currently being
investigated
16