No Slide Title

Download Report

Transcript No Slide Title

Introduction to
Secondary Data Analysis
Young Ik Cho, PhD
Research Associate Professor
Survey Research Laboratory
University of Illinois at Chicago
Fall, 2009
What is secondary data?
• Data collected by a person or
organization other than the users
of the data
Survey Research Laboratory
2 of 20
Advantages of Secondary Data
• Unobtrusive
• Fast & inexpensive
• Avoid data collection problems
• Provide bases for comparison
Survey Research Laboratory
3 of 20
Disadvantages of Secondary Data
• Data availability
• Level of observation
• Quality of documentation
• Data quality control
• Outdated data
Survey Research Laboratory
4 of 20
Data Sources
 Inter-university Consortium for Political and
Social Research (ICPSR)
http://www.icpsr.umich.edu/icpsrweb/ICPSR/
 National Center for Health Statistics (NCHS)
http://www.cdc.gov/nchs/surveys.htm
 Center for Medicare and Medicaid Services
(CMS) http://www.cms.hhs.gov/home/rsds.asp
 US Census Bureau
http://www.census.gov/main/www/access.html
Survey Research Laboratory
5 of 20
Data Sources (cont.)
Examples of Directly Downloadable Data from NCHS:
National Health and Nutrition Examination Survey (NHANES)
National Ambulatory Medical Care Survey (NAMCS)
National Hospital Ambulatory Medical Care Survey (NHAMCS)
National Hospital Discharge Survey (NHDS)
National Home and Hospice Care Survey (NHHCS)
National Nursing Home Survey (NNHS)
National Survey of Ambulatory Surgery (NSAS)
National Employer Health Insurance Survey (NEHIS)
National Vital Statistics System (NVSS)
National Health Interview Survey (NHIS)
Survey Research Laboratory
6 of 20
Data Sources (cont.)
Data Available for Use with Survey Documentation and Analysis
(SDA):
http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/sda.jsp
Aging Data
National Archive of Computerized Data on Aging (NACDA)
http://www.icpsr.umich.edu/NACDA/
Holding about 160 survey data including:
• Longitudinal Study of Aging, 70 Years and Older, 1984-1990
• National Survey of Self-Care and Aging: Follow-Up, 1994
• National Health and Nutrition Examination Survey II: Mortality Study, 1992
• National Hospital Discharge Survey, 1994-1997
• National Health Interview Survey, 1994, Second Supplement on Aging
Survey Research Laboratory
7 of 20
Data Sources (cont.)
SDA (continued):
Substance Abuse Data
Substance Abuse and Mental Health Data Archive
(http://www.icpsr.umich.edu/SAMHDA/)
•
•
•
•
•
•
•
Drug Abuse Warning Network
Monitoring the Future
National Household Survey on Drug Abuse
National Pregnancy and Health Survey
National Treatment Improvement Evaluation Study
Treatment Episode Data Set
Uniform Facility Data Set
Survey Research Laboratory
8 of 20
Data Sources (cont.)
SDA (continued):
Criminal Justice Data
National Archive of Criminal Justice Data (NACJD)
(http://www.icpsr.umich.edu/NACJD/)
•
•
•
•
International Crime Data
Homicide Data
National Crime Victimization Survey Data
Corrections Data
Survey Research Laboratory
9 of 20
Evaluation of Data Sources
•
•
•
•
•
Purpose of the study
Sponsor/collector of the data
Mode of data collection
Sampling procedures
Consistency of data with other
sources
Survey Research Laboratory
10 of 20
Evaluation of Data Sources (cont.)
•
•
•
•
•
Documentation
Number of observations
Number of variables
Coding scheme
Summary statistics
Survey Research Laboratory
11 of 20
Types of Survey Sample Design
• Simple Random Sampling
• Systematic Sampling
• Complex sample designs
▪ stratified designs
▪ cluster designs
▪ mixed mode designs
Survey Research Laboratory
12 of 20
Types of Survey Sample Design
• Simple Random Sampling
 Each member of the population has an equal
and known chance of being selected
 Simple Random Sample With Replacement
(SRSWR)
 Simple Random Sample Without
Replacement (SRSWOR)
Survey Research Laboratory
13 of 20
Types of Survey Sample Design
• Systematic Random Sampling
 the selection of every kth element from a
sampling frame with the sampling interval k
(=N/n).
Survey Research Laboratory
14 of 20
Types of Survey Sample Design
• Stratified sample
 The population is first divided into nonoverlapping subpopulations: strata such as
gender, race or SES.
 Sample from each strata.
 Works most effectively when the variance is
smaller within the strata than in the sample as
a whole.
Survey Research Laboratory
15 of 20
Types of Survey Sample Design
• Cluster sample
 Elements are selected in groups or clusters
 PSU: Primary Sampling Unit. This is the first
unit that is sampled in the design. For
example, school districts from Chicago may
be sampled and then schools within districts
may be sampled.
 Homogeneity within cluster: Intracluster
Correlation Coefficient (ICC)
Survey Research Laboratory
16 of 20
Why complex survey design?
• Increased efficiency
• Decreased costs
Survey Research Laboratory
17 of 20
Sample Weights
• Selection weight: Used to adjust
for differing probabilities of
selection (=N/n).
• In theory, simple random samples
are self-weighted
• In practice, simple random samples
are likely to also require
adjustments for non-response
Survey Research Laboratory
18 of 20
Types of Sample Weights
• Post-stratification weights:
designed to bring the sample
proportions in demographic
subgroups into agreement with the
population proportion in the
subgroups.
Survey Research Laboratory
19 of 20
Types of Sample Weights (cont.)
• Non-response weights: designed
to inflate the weights of survey
respondents to compensate for
nonrespondents with similar
characteristics.
Survey Research Laboratory
20 of 20
Types of Sample Weights (cont.)
• “Blow-up” (expansion) weights:
provide estimates for the total
population of interest
Survey Research Laboratory
21 of 20
Types of Sample Weights (cont.)
• Replicate weights: A series of
weight variables that are used
instead of PSUs and strata in an
effort to protect the respondents'
identity. Selection weight and the
replicate weights must be used for
the correct calculation of the point
estimate and its standard error.
Survey Research Laboratory
22 of 20
Complex Survey Design Effect
• Complex designs with clustering
and unequal selection
probabilities generally increase
the sampling variance.
• Not accounting for the impact of
complex sample design can lead
to biased estimates.
Survey Research Laboratory
23 of 20
Complex Survey Design Effect
• The ratio of the design-based
standard error to the SRS standard
error of a variable:
• Deff=SE(des)/SE(srs)
• Deff= 1 + ρ (n – 1)
where the ρ is the interclass correlation and n is
the number of elements in the cluster.
Survey Research Laboratory
24 of 20
How can we adjust for
the design effects?
• Find variables identifying the primary
sampling units (psu), the strata, and the
weight(s).
• Use appropriate software to adjust for
the design effect.
Survey Research Laboratory
25 of 20
Syntax Examples of Design-based
Analysis in SAS, STATA & SUDAAN
SAS
proc surveyreg data=nhanes;
strata strata;
cluster psu;
class sex race;
model fatintk = age sex race;
weight finalwt
STATA
svyset
svyset
svyset
svyreg
26 of 20
strata strata
psu psu
pweight finalwt
fatitk age male black hispanic
Survey Research Laboratory
Syntax Examples of Design-based
Analysis in STATA, SUDAAN & SAS
SUDAAN
proc regress data=”c:\nhanes.sav” filetype=spss
desgn=wr;
nest strata psu;
weight finalwt
subpgroup sex race;
levels
2
3;
model fatintk = age sex race;
Survey Research Laboratory
27 of 20