Transcript Document

Treasure Trove of Data:
Conducting Research Using
Federal Statistical Surveys
So many unanswered research questions…
3
Census
Publications
4
The World of
Printed Reports:
Statistical Abstract,
1902, 580 pages
5
6
Cost of Living
Measurement
… but seriously folks… There is a Hierarchy of
Federal Data
 Published aggregates – dating back over a Century but also
(mostly available electronically)
 Some predetermined geography and categories
 Thinner the data “slice” the more confidentiality
protection, i.e. the data’s not there anymore
 Public Use file
 A sub-sample of the data, only feasible for large samples
 …but also with confidentiality protection (see above)
 Synthetic Data (new approach)
 Restricted Use Micro Data
 Proposals for research required
 Special access arrangements, terms of use, etc.
Public use data
Census Research Data centers
Demographic Data
 1970,
1980, 1990 and 2000 Decennial
Long Form (back to 1940 soon)
 American Community Survey
(effectively replacing the long form)
 March CPS Earnings Supplements
 Survey of Income and Program
Participation
 American Housing Survey
Economic Data Sets
Annual Survey of Manufactures
Census of Construction
Census of Finance and Insurance
Census of Manufactures
Census of Mining
Census of Real Estate
Census of Retail
Census of Services
Census of Transportation
Census of Wholesale
Characteristics of Business Owners
Survey
Commodity Flow Survey
Auxiliary Establishment Survey
Longitudinal Business Database
Longitudinal Research Database
Manufacturing Energy Consumption
Survey
Medical Expenditure Panel Survey,
Insurance Component
National Employer Survey
Pollution Abatement Costs and
Expenditures
Quarterly Financial Reports
Research and Development Survey
Survey of Manufacturing Technology
Worker Establishment Characteristics
Database
R&D and Innovation Survey
Read the Forms!
Linked Household / Business data
 Longitudinal Employer Household Dynamics (LEHD)
 Links households to place of employment
 Based on unemployment insurance administrative records
 Covers most states
 Quarterly starting in 1990
 “Tracks” a person based on their place of employment



Establishment (i.e. the place of work) is exact for single plant
companies
Establishment is assigned for all others (using geography and
industry to improve matches)
Google “LEHD on the map”…
How to Apply
 Preliminary Proposal Must Meet Basic
Requirements
Need for Non-Public data
 Maintains Confidentiality
 Feasibility
 Describes Census Benefits

(LEGAL REQUIREMENT)

Scientific Merit
 Work with Census Administrator to Craft
Final Proposal
Restricted use Health data
Why is there health data at the Census
RDCs?
 This data is collected by:


National Center for Health Statistics (NCHS)
Agency for Healthcare Research and Quality (AHRQ)
 Dual mission: to provide broad access to health
data and statistics, while protecting the privacy of
respondents
 Most Research uses the Public Use file
 NCHS and AHRQ RDCs created to provide
access to restricted use files
 Now available at all Census RDCs
What type of data is it? NCHS Data
National Health Status Surveys
 National Health and Nutrition Examination
Survey (NHANES) I, II, and III
 National Health Interview Survey (NHIS)
 Longitudinal Study on Aging I and II (LSOA)
 National Survey of Family Growth
 National Survey of Children's Health



National Survey of Early Childhood Health
National Survey of Children with Special Health
Care Needs
National Survey of Children with Special Health
Care Needs
National Asthma Survey
National Health Care Surveys
 National Ambulatory Medical Care Survey
 National Hospital Ambulatory Medical Care
Survey
 National Survey of Ambulatory Surgery
 National Hospital Discharge Survey

o
o
o
National Nursing Home Survey (NNHS)
National Home and Hospice Care Survey
National Employer Health Insurance
Survey
o National Health Provider Inventory
o National Immunization Survey
Vital Statistics
o Mortality and Multiple Mortality
o Birth
o Fetal Death
o National Death Index
o Marriage and Divorce
Linked Data Sets
o Linked mortality data: NHIS, NHANES
LSOA II, NNHS
o Linked Medicare Enrollment and Claims
data: NHIS, NHANES, LSOA II
o Linked Social Security Administration
Data: NHIS, NHANES, LSOA II, NNHS
o Linked EPA data
What is restricted in the public use files but
available in the RDC?
 Every survey has at least some data that is
restricted for confidentiality
 Data can be restricted in a number of ways:

Individual variables:
Removed
 Top-coded, bottom-coded, coarsened or masked
 Artificial information is substituted

Pieces of datasets are restricted
 Whole datasets are unavailable (particularly linked
files)

What’s restricted? Variables
Examples of restricted variables:
 Geographic variables (state, county, or metropolitan area)
 Most dates (date of interview, date of death, date of birth)
 Income and employment data (industry codes)
 Specific diagnoses (ICD-9 codes are generally coarsened)
 Details about facilities (accreditation, payments, number of




employees)
Some information about children and adolescents, (e.g. height
and weight, depression, behavior problems, and drug use)
Some information about race, ethnicity, and country of origin
Contextual data (nearest hospital, % of population with
diploma)
Sample design variables (necessary for estimating variances)
What’s restricted? Pieces of datasets
Examples
Contextual data: data can be linked to information
about area (e.g., number of hospitals, education in
county, MEPS Area Resource File)
 Medical Expenditure Panel Survey: Provider,
Insurance, and Nursing Home Component
 NHANES III: Youth Conduct Disorder Datasets, Los
Angeles Demographic Dataset, Diagnostic Interview
Schedule for Children
 National Survey on Family Growth: self-report data
and interviewer comments

What’s restricted? Datasets
 Linked data sets:
Mortality files linked to NHANES, NHIS, LSOA
 EPA emissions data linked to NHDS, NHIS, NHANES
 Social Security linked to NHANES, NHIS, LSOA
 Medicare files linked to NHANES, NHIS, LSOA

 Other datasets unavailable:
National Employer Health Insurance Survey
 National Death Index

How can I access it?
 Submit a proposal to NCHS or AHRQ
 NCHS/AHRQ evaluates for feasibility, availability
of computing resources, and likelihood of
disclosure of confidential info (NOT for scientific
merit)
 If approved, researcher sends public use data and
code
 NCHS/AHRQ staff merges public use data with
restricted data to create a file for use by researcher
 Files are only created by NCHS/AHRQ staff
How can I access it?
 Proposal must include
 Full research proposal
 Explanation of why public-use files are insufficient
 Data dictionary, which must identify files and years, target
sample, and variables
 Sample code, examples of desired output, and software
requirements
 Resumes of researchers, sources of funding, and proposed
dates when analysis will take place
How can I access it?
(Working through NCHS/AHRQ )
 Working at NCHS or AHRQ RDCs (both in Hyattsville,
MD)




RDC analyst prepares data prior to researcher’s arrival
Researchers cannot merge own data sets or work with more than
one data set at time
All output and notes must be reviewed before removal; data files
cannot be removed
Support is available from RDC staff
 Working with NCHS remotely
 Researchers send code via email and receive output back via email
 Only certain SAS/SUDAAN procedures permitted; no access to
micro data
 Working with AHRQ remotely
 AHRQ has no remote server
 Possibility of writing task order for AHRQ