How survey design affects analysis

Download Report

Transcript How survey design affects analysis

Ambitious title?
Confidence intervals, design effects
and significance tests for surveys.
How to calculate sample numbers
when planning a survey.
Scot Exec Course Nov/Dec 04
Summary
• Statistical inference
– Design based
– Model based
• Confidence intervals and hypothesis tests general
• Their modification for survey designs
– Design effects and design factors
• Calculation of sample numbers for studies
– Their modification for complex surveys
Scot Exec Course Nov/Dec 04
Statistical inference
• Making inferences about some aspect of the
population, using observation to draw
conclusions about the population now, or
will evolve in future
• Data are what we are given
• Inference allows us to turn them into
information
Scot Exec Course Nov/Dec 04
Elements needed for statistical
inference – design based
• Want to learn something about a population
• You have
– A model of how the sample was selected from the
population.
– Some data obtained from the sample
– Knowledge of how to estimate!
• E.g. Obtain data on the income of 10,000 from a population of 5
million.
• Need inference to estimate the income distribution of the whole 5
million and to know how close this is to the population value
Scot Exec Course Nov/Dec 04
Elements needed for statistical
inference – model based
• You have
– A model that could have generated the data for your
population, along with ideas about what current and
future populations this might generalise to..
– Some data that can be assumed to be generated by this
model.
– Knowledge of how to carry out the inference!
• E.g. Obtain data on the income of 10,000 from a population and can
make the assumption that the income distribution follows some
mathematical distribution
• Need inference about the assumed model for the income distribution of
the whole 5 million and how close your estimate will be to the true
value
Scot Exec Course Nov/Dec 04
How do design and model
based inferences differ?
•
•
•
•
Conceptually poles apart
In practice they give the same answers
Except when numbers are small
Or when a large proportion of the
population has been sampled
• But its good to think about what you are
doing and decide which type fits your
problem
Scot Exec Course Nov/Dec 04
Next set of results
• Apply to a simple unstructured sample
– No clustering
– No stratification
– No weighting
• Taken from a population with replacement (not a
problem in model based inference)
• Exactly the same large-sample results apply for
model-based and design-based inferences
Scot Exec Course Nov/Dec 04
Mean of 9 x s
x
x
?m
m?
Scot Exec Course Nov/Dec 04
st .dev.

st.dev. 
n
Standard error of the mean
x  m
Approx a normal distr with s.d.

n
The data are fixed, so this tells us where m is likely to be.

n
is called the standard error of the sample mean
Sometimes s.e.mean - it measures the expected distance of
the “true” mean from the mean of the observed sample.
A 100(1a)% confidence interval for m from the
normal distribution Is
a/2
x
z
Scot Exec Course Nov/Dec 04
s.e.m.
Values of Z for confidence
intervals
• 95% c.I. Gives Z = 1.96
• 99%
Z = 2.58
• 68%
Z=1
• 90%
Z = 1.64
Scot Exec Course Nov/Dec 04
We can use it for proportions too
• Want too estimate a proportion p - e.g. a
proportion of 20 year olds who use the internet
–Then r/n estimates p
p (1  p ) / n
–to use this formula we replace p with (pˆ  r n )
–with standard error
•A rule of thumb is that this approximation is OK
if the smaller of r and (n-r) is >5.
Scot Exec Course Nov/Dec 04
Are these formulae good enough?
• Yes – unless your survey is too small to be
any use
• They extend easily to differences in means
and proportions
• Similar approximate results apply to
regression models and logistic regressions
• BUT – they only apply to simple samples
Scot Exec Course Nov/Dec 04
But my data are more complicated than this
And nobody will let me put standard erorrs or
confidence intervals in my report
• A goal of a good statistical report is that it should
not include and tables or graphs where what seems
to be information are just the result of chance
variation (noise).
– set out your task in terms of an outcome predicted from
other factors
– Carry out a set of regression predictions
– Base the tables to go in the report on the regression
models that are found to be more than chance effects
Scot Exec Course Nov/Dec 04
Inferences for complex surveys
• The usual formulae and regression models
don’t hold
• Most surveys use weighting
• And allowances for clustering and
stratification have to be made
• Software that modifies the results we have
just discussed and calculates them correctly
for complex surveys is now available
Scot Exec Course Nov/Dec 04
Two main methods are used
• Taylor linearisation – theory of this all
worked out in the 1940s and 50s
• Replication methods, jacknives and
bootsraps – 1960s and 1970s
• Only now is software readily available to do
things properly
Scot Exec Course Nov/Dec 04
Getting by without the correct software
• Carry out an analysis using an ordinary computer
package (eg. SAS, SPSS simple procedures)
• But use a weight in the analysis to get results that
will correct the bias in the estimates
• Your weighted analysis will get you the wrong
standard errors and wrong tests, but the estimates
will be about right.
• Use design effect tables to get some idea of the
standard errors
Scot Exec Course Nov/Dec 04
Using the correct software
• Is not difficult – PEAS web site explains how
• Routines are available in SAS, SPSS, STATA and
R
• But it does mean that you need to get details of the
survey design
• E.g. PSU, stratification variables need to be
available
• Easier for you than for me
Scot Exec Course Nov/Dec 04
Getting by without the correct software
• Use a table of design effects (DE)
• Often published with the surveys
• To get a s.e. from a complex survey
– Calculate the design factor (DF) as the square root of
the DE
• Multiply the s.e. from a simple analysis by DF
• For most household surveys DEs vary from about
0.8 to 2 or 3.
• This is a rough and ready method and will only
work if weights are not too far from 1.0
Scot Exec Course Nov/Dec 04
Disadvantages of this
• DEs are not constant for a survey
• They are also different (usually lower) when
subgroups of a survey are selected
• They may also be lower in complicated
models, like regressions where it is also
very hard to know how to apply them.
• Methods are approximate
Scot Exec Course Nov/Dec 04
Uses of design effects (DEs)
• They tell you about how well your survey
design has worked
• Most survey software produce estimates of
design effects with their output
• A design effect of 2 means your effective
sample size is halved
• It is good to have such estimates when
planning sample numbers for surveys.
Scot Exec Course Nov/Dec 04
Sample numbers for planning
studies
• Think ahead about the sort of comparisons
you might want to make
• Are you interested in time trends?
• Or in comparisons between certain groups
– If so, what proportions in each
• Do you want to estimate something (eg %
of children in poverty)?
Scot Exec Course Nov/Dec 04
Use spread sheet sample
numbers.xls
Scot Exec Course Nov/Dec 04
To modify these for surveys
• Simply multiply your answer by an estimate
of the design effect
• Or try to do the next survey better by
getting a smaller design effect
Scot Exec Course Nov/Dec 04