Transcript Document
Small Area Estimation
(in survey research)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Knock! Knock! Whose there? (without opening the door)
"The census taker."
"Go away - I don't want my senses taken."
"No, you don't understand, I just want to survey you."
"A statistical sample of one isn't valid -- go away."
"You aren't the only one."
"So you are bothering a whole bunch of people, go away."
"Look you are unique and I don't want to miss you in the survey."
"How do you know I'm unique when you haven't surveyed me yet?"
"Ok, I don't know you are unique, but you might be."
"You mean you think I'm an oddball."
"No, maybe more like an outlier."
"Now you are calling me an out and out lier, go away."
"No, I mean you are far from the average Joe."
"I hope so, I'm Sally. "
"Look Sally, we are trying to get population data, how many people live here?"
"Gosh, how would I know, I think there about 15 thousand in Smugville."
"No, I mean in this house!"
"Oh, that's a question of a different nature."
"So, how many?"
"Sometimes one, sometimes two, sometimes four , now -- go away."
"No, I need a precise number."
"Ok, how about 1.34"
"How did you come up with that?"
"I live here sometimes during the week, my sistor visits me on weekends, and my mother visits me
every second week, my two cats are sometimes here, and my …. and that’s none of your business".
"Thanks Sally have a great day."
(census taker wrote -- "NO PERSONS LIVING HERE - UNOCCUPIED.")
What is SAE?
• Small area estimation is the collective term for several
statistical techniques involving the estimation of
parameters for small sub-populations, generally used
when the sub-population of interest is included in a larger
survey. - Wikipedia (the free encyclopedia)
• Small area: a sub-population for which there is not enough
sample to construct reliable estimates directly based on the
survey sample
– small geographical area, such as LHA
– small domain, such as demographic subgroups
Area with small number of respondents – estimates with low
precision (large standard error)
Area with no respondent – no estimate
Why SAE?
• Growing demand for reliable small area statistics
for policy analysis and planning purposes
– there is “increasing government concern with issues of
distribution, equity and disparity”
– apportionment of government funds
– regional planning
• Constraints of national surveys:
– not designed to produce reliable estimates at the small
area level due to cost constraints.
• Limitation of administrative data sources:
– do not have the necessary information to provide the
detailed statistics needed for small areas.
How to do SAE?
• “Borrow strength” from related or similar small
areas through explicit or implicit models that
connect the small areas via supplementary data
– combine data obtained from large scale surveys
containing measures of interest with a set of covariates
available for all small areas from other sources
• Auxiliary information/covariates
– correlated with the measure of interest
– known for all small areas
– common source: census, administrative registries
SAE Methods
• Simple approaches:
– Demographic methods
local estimation of population in post-censual years
latest census data + administrative registries (e.g., birth, death, etc.)
– Synthetic estimation
derived from direct survey estimate of a large area
the small area is covered by the large area
assumption: the small areas have the same characteristics as the large
area
potential bias
Indirect standardization
– Composite estimation
weighted average of the synthetic and survey direct estimates
balance the potential bias of a synthetic estimator and the instability
of a direct estimator
SAE Methods
• Multi-level modeling:
– using individual level covariates only
– combining individual and area-level covariates
– using area level covariates only
model-based SAE generated for a particular small area is the
expected outcome for that area based on its characteristics as
measured by the covariates.
example of interpretation: given the characteristics of the local
population we would expect approximately x% of adults within
LHA X to smoke/be obese etc.
enables us to provide information about the characteristics of
all areas in the population, not just the sampled areas.
Indirect Standardization
• Applying national (large area) direct survey estimates of
demographic class to area-level population counts to
generate expected area estimates.
– intuitively appealing
Mean level of many variables in a population is highly related to the
distribution of such demographic variables as age, sex and social
class.
– easy and inexpensive to apply
local level populations of demographic classes from the Census
+
national estimates from survey
– assumes that the national rates for each subgroup apply uniformly
across all areas.
Differences between areas are due solely to differences in their
demographic composition.
Models using individual level covariates only
• Modeling the relationship between measure of interest and
covariates on individual level based on survey data.
• Apply estimated model coefficients to covariates available
as counts for all small areas (e.g. from the Census) to
obtain expected area estimate for measure of interest.
• Data requirement
– exact correspondence between the covariates used in the model
and data available from the Census or other administrative data
sources.
– restricts the choice of covariates in these models.
• Within area clustering is ignored.
Models combining individual and area level covariates
• Multi-level models incorporating random effects
– fixed effects of covariates + small area specific random
effects
– taking into account the clustering within small area
suited to the clustered nature of social surveys
provides more accurate standard errors estimates
– enabling exploration of the association of area
differences with individual and area level
characteristics
– stringent data requirements due to inclusion of
individual level covariates
Models using area level covariates only
• The model gives a constant predicted value for all
individuals within an area - the predicted mean of
the area.
– avoid the stringent data requirements
– relatively low cost
– a strong argument: controlling for differences in area
level covariates is all that is needed for predicting area
differences in study variable.
– not support subgroup estimates within each small area
such as gender-specific estimate
Data requirements
• survey dataset: holds both the outcome variables (e.g. smoking status),
as well as the individual level covariate data (e.g. age, sex, SEC).
• area-level covariate dataset: contains the estimation area level means
for a set of covariates – usually census, administrative and registration
data – along with the estimation area identifiers, and any higher-level
area covariates and identifiers.
• analysis dataset: the survey and covariate datasets matched on
estimation area identifier. The analysis dataset contains only the areas
sampled in the survey. This dataset is used for modeling.
• implementation dataset: a dataset covering all areas (not just those
sampled) to produce the final estimates. The implementation dataset
will be at the lowest estimation area level, nested within higher-level
geographic identifiers. This will allow the production of higher-level
estimates by aggregating estimates for the component small areas.
• external validation dataset: relevant local and/or national surveys or
other administrative sources to provide direct estimates of relevant
outcomes to compare against the SAE.
Cautions
• “Indirect estimators should be considered when
better alternatives are not available, but only with
appropriate caution and in conjunction with
statistical research and evaluation efforts. Both
producer and user must not forget that even after
such efforts, indirect estimates may not be
adequate for the intended purpose.”
you never have to say you are certain.
You Might Be a Statistician if...
•
•
•
•
•
•
no one wants your job.
you are right 95% of the time.
you feel complete and sufficient.
you found accountancy too exciting.
you never have to say you are certain.
you may not be normal but you are transformable.
References
• M Ghosh, J.N.K. Rao. "Small area estimation: An
appraisal", Statistical Science, vol 9, no.1 (1994),
55-76.
• Danny Pfefferman. "Small area estimation - New
developments and directions", International
Statistical Review (2002), 70, 1, 125-143.
• Goldstein H (2003) Multilevel statistical models
(New York: Halstead Press).
• Rao JNK (2003). Small Area Estimation. John
Wiley & Sons, Inc., Hoboken, New Jersey.
Top ten reasons to be a statistician
1. Estimating parameters is easier than dealing with real life.
2. Statisticians are significant.
3. I always wanted to learn the entire Greek alphabet.
4. The probability a statistician major will get a job is > .9999.
5. If I flunk out I can always transfer to Engineering.
6. We do it with confidence, frequency, and variability.
7. You never have to be right - only close.
8. We're normal and everyone else is skewed.
9. The regression line looks better than the unemployment line.
10. No one knows what we do so we are always right.