La Stima della Varianza della Stima dell’Incidenza di

Download Report

Transcript La Stima della Varianza della Stima dell’Incidenza di

Enhancing Small Area Estimation Methods
Applications to Istat’s Survey Data
Ranalli M.G. ~ Università di Perugia
D’Alo’ M., Di Consiglio L., Falorsi S., Solari F. ~ Istat
Pratesi M., Salvati N. ~ Università di Pisa
Q2008 ~ Rome, July 11th
1
OUTLINE
 Italian Labour Force Survey
 Standard small area estimators for LFS
 Small area estimators that incorporate spatial information
 Model based direct estimator (MBDE)
 Semi-parametric models (based on p-splines)
 Experimental study
 Analysis of results
 Final remarks
2
Labour Force Survey description
 Labour Force Survey (LFS) is a quarterly two stage survey with partial
overlap of sampling units according to a rotation scheme of type (2-2-2).
 In each province the municipalities are classified as Self-Representing
Areas (SRAs) and the Non Self-Representing Areas (NSRAs).
 From each SRAs a sample of households is selected.
 In NSRAs the sample is based on a stratified two stage sampling design.
The municipalities are the primary sampling units (PSUs), while the
households are the Secondary Sampling Units (SSUs).
 For each quarterly sample about 1350 municipalities and 200,000
individuals are involved.
3
Small area estimation on LFS
■ Since 2000, ISTAT disseminates yearly LFS estimates of employed and
unemployed counts related to the 784 Local Labour Market Areas (LLMAs).
■ LLMAs are unplanned domains obtained as clusters of municipalities
cutting across provinces which are the LFS finest planned domains.
■ The direct estimates are unstable due to very small LLMA sample sizes
(more than 100 LLMAs have zero sample size). SAE methods are necessary.
■ Until 2003, a design based composite type estimator was adopted.
■ Starting from 2004, after the redesign of LFS sampling strategy, a unit-level
EBLUP estimator with spatially autocorrelated random area effects has been
introduced.
4
Standard small area estimators – design based
Direct and GREG estimator
 The direct estimator is given by
YˆdD 
 wi yi
Nd
isd
 The GREG estimator is based on the standard linear model:
yid  xTid β   id
E( id )  0,
var ( id )   2
and can be expressed as an adjustment of the direct estimator
for differences between the sample and population area means of covariates


GREG
D
D T ˆ
ˆ
ˆ
ˆ
Yd
 Yd  X d  X d β w
5
Standard small area estimators – model based
Unit level Synthetic and EBLUP
 The Synthetic estimator assumes a standard linear mixed model with unitspecific auxiliary variables, random area-specific effects and errors
independently normally distributed
yid  x Tid β  ud  eid
and is given by
ud ~ iid N (0,  u2 ), eid ~ iid N (0,  e2 )
YˆdSI  XTd βˆ
 The EBLUP estimator assumes the same model but is given by
YˆdEB  XTd βˆ  uˆd   yˆi N d
iU d
6
Enhanced small area estimators
1. Unit level EBLUP with spatial correlation of area effects
 The EBLUP-S estimator is based on the following unit level linear mixed model:
yid  x Tid β  ud  eid
u ~ MN(0,  u2 A), e ~ MN(0,  e2 I N )
The matrix A depends on the distances among the areas and on an unknown
parameter connected to the spatial correlation coefficient among the areas.


 dist d , d '
A  add '   1   dd ' exp 




1 

 
 
dd '

0


1
if d  d '
ot herwise
7
Enhanced small area estimators
2. Model Based Direct Estimator (Chambers & Chandra, 2006)
 The MBD estimator is based on a unit level linear mixed model and is given by

wim yi
Yˆ 

YˆdMBD 
isd
where the weights are such that
Y
 yi

wim
isd
wim yi is the (E)BLUP of
is
under the model (Royall, 1976).
iU

Calibrated with respect to the total of x.

Reduces bias vs EBLUP

Does not allow estimation for non-sampled areas

Less efficient than EBLUP
8
Enhanced small area estimators
3. Nonparametric EBLUP (Opsomer et al., 2008)
yid  x Tid β  f ( z1id )  f ( z2id , z3id )  ud  eid
ud ~ iid N (0,  u2 ), eid ~ iid N (0,  e2 )
In the literature there are many nonparametric regression methods (kernel,
local polynomial, wavelets…) BUT difficult to incorporate in a Small area model
Methods based on penalized splines (Eilers e Marx, 1996; Ruppert et al.,
2003) can be estimated by means of mixed models -> promising candidate for
SAE methods
 Great Flexibility in definition of model
 Estimable with existing software using REML
 Hard to estimate efficiency and test for terms significance (via
bootstrap?)
9
LFS empirical study
The simulation study on LFS has been carried out to estimate the
unemployment rate at LLMA level
 500 two-stage LFS sample have been drawn from 2001 census data set.
 The performances of the methods have been evaluated for the estimation of
the unemployment rate in the 127 LLMAs belonging to the geographical area
“Center of Italy ”.

GREG, Synthetic, EBLUP small area estimators have been applied
considering two different sets of auxiliary variables
Case A - LFS real covariates = sex by 14 age classes + employment
indicator at previous census;
Case B – LFS real covariates + geographic coordinates (latitude and
longitude of the municipality the sampling unit belongs to).
10
Enhanced Small area estimators
■ Spatial EBLUP: A spatial correlation in the variance matrix of the random
effects has been considered (EBLUP SP) + Case A covariates
■ MBD: Model based direct estimation is performed on sampled LLMAs, while
synthetic estimators based on unit level linear mixed model is considered for non
sampled LLMAs (Case A covariates)
■ Nonparametric EBLUP: two semiparametric representations based
penalized splines have been applied (fitted as additional random effects):
on
 geographical coordinates of the municipality (EBLUP-SPLINE SP): this
allows for a finer representation of the spatial component vs EBLUP SP (at
municipality level instead of LLMA).
 age (EBLUP-SPLINE AGE & EBLUP SP-SPLINE AGE)
11
Evaluation Criteria
 % Relative Bias:
R
1  Yˆdr  Yd 
RBd  
 100
R  r 1 Yd 

 % Relative Root Mean Squared Error:
2
R
r

ˆ


Yd  Yd 
1
RRMSE d 

  100

R r 1  Yd 



1 D
Average Absolute RB: AARB   RBd
D d 1
1 D
Average RRMSE: ARRMSE  RRMSEd
D d1
MARB  max  RBd 
Maximum Absolute RB:
d
Maximum RRMSE:
MRRMSE  max RRMSEd 
d
12
Results – A: LFS covariates; B = A + geog. coord. mun.
ESTIMATOR
AARB
ARRMSE
MARB
MRRMSE
DIRECT
2.9
51.7
20.4
90.7
GREG A
7.2
40.2
83.3
93.8
GREG B
6.9
40.0
71.5
82.8
SYNTH A
14.0
15.8
93.0
93.5
SYNTH B
12.4
16.4
79.7
81.0
EBLUP A
13.2
16.2
92.5
93.1
EBLUP B
11.9
16.7
79.5
80.7
EBLUP SP
12.7
16.3
90.9
91.6
8.8
35.3
86.3
92.6
EBLUP-SPLINE SP
12.1
16.5
91.1
92.2
EBLUP-SPLINE AGE
13.2
16.5
89.8
90.5
EBLUP SP-SPLINE AGE
12.2
17.3
90.3
90.9
MBD
13
Analysis of results
 The results of GREG, SYNTH and EBLUB in case B, when geographical
information is considered in the fixed term, display better performances in
terms of bias.
 In terms of MSE standard estimators in case A outperform standard
estimators in case B if the ARRMSE is considered as overall evaluation criteria,
while better results are obtained in case B if MRRMSE is considered
 Area level estimators (not shown here) perform a little better in terms of
Bias but much worse in terms of MSE.
14
Analysis of results
 EBLUP SP can be compared with the unit level EBLUP with geographical
information included as covariates and the EBLUP-SPLINE SP.
o
EBLUP SP show better performances in terms of MSE, while the unit level
EBLUP outperform the other estimators in terms of bias.
o
The EBLUP-SPLINE SP displays performances in between the other
estimators.
 EBLUP-SPLINE AGE performs similarly to the unit level EBLUP in Case A
o The use of the age in a nonparametric way is an alternative use of
auxiliary information. With respect to case A the model is more
parsimonious.
 As it was expected MBDE shows better results in term of bias and performs
poorly in term of MSE than other SAE methods
 The use of autocorrelation structure together with the spline on the variable
age doesn’t improve the performances
15
Final remarks
 The model group is a small portion of Italy (center); hence the area specific
effects are smaller than they could be if an overall model was considered for
all the country: the introduction of geographical information should be
analyzed considering a larger model level group
 Sensitivity to smoothing parameters’ choice in the splines approach has to
be investigated.
 The introduction of the sampling weighs should be considered to try to
achieve benchmarking with direct estimates produced at regional level
 The response in a 0-1 variable: a logistic mixed model is currently being
investigated
16