Welcome to the inaugural meeting of the Scottish Social

Download Report

Transcript Welcome to the inaugural meeting of the Scottish Social

Scottish Social Survey Network:
Master Class 1
Data Analysis with Stata
Dr Vernon Gayle and Dr Paul Lambert
23rd January 2008, University of Stirling
The SSSN is funded under Phase II of the ESRC
Research Development Initiative
1
Master Class 1: Data Analysis with Stata, 23/1/08
Introductions and generic resources
1030-1100, 2V1 Data Analysis and Data Management with Stata (PL)
1100-1130, 2V1 Introduction to the Stata interface (VG)
1130-1300, 2A21 Computer Lab: Data analysis and data construction for
complex survey data
[Lunch in 2X6]
2V1 and 2A21 Specialist topics and illustrative examples
1400-1445 Handling coefficients (VG)
1445-1515 Sample selected data (VG)
[Coffee in 2X6]
1545-1615 Multilevel data and analysis (PL)
1615-1645 Handling occupational data (PL)
Reminder: Scottish Social Survey Network seminar on ‘Scotland’s Large Scale2
Datasets’, 1500-1700 on 24th January 2008, University of Stirling
Data Analysis and Data Management
with Stata
1) Background: Integrating data analysis and
data management
2) Stata and data management
- Lab: Some useful Stata routines / functions
3
Background: Integrating data
management and data analysis
“A programme like SPSS … has two main components: the
statistical routines, that do the numerical calculations…, and the
data management facilities. Perhaps surprisingly, it was the
latter that really revolutionised quantitative social research”
(Procter, 2001:253).
By Data management we mean:
 Matching data files together
 ‘Cleaning’ data
 Operationalising variables
 Accessing and reviewing data
4
Research interests, data analysis and
data management (1)
1) Research-led pressures for large and
complex survey data
– Longitudinal surveys
– Linked data projects
•
e.g. administrative data; health data; GIS
– Comparative research
•
e.g. x-national, historical
 social survey researchers enjoy access to a vast
array of micro-data resources, many of which
have (sometimes hidden) complexity
5
Check: what is large and complex
social survey data?
1. Array of variables / operationalisations

Competing measures; interaction effects; latent variables
2. Multiple related data files


3.
4.
5.
6.
Linked component datasets
External data (e.g. aggregate and micro-data)
{Large volumes of cases}
Relations between cases
Multiple hierarchies of measurement
Multiple points of measurement



Unbalanced repeated contacts
{Censored} duration data
International comparative survey designs
7. Sample collection and weighting data
6
Example: Multiple measurement points
(BHPS Unbalanced panel)
Wave
1
1
1
2
2
3
Person  Person-level Vars 
1 1
38
1
36
2 2
34
2
0
3 2
6
9
1 1
39
1
38
2 2
35
1
16
1 1
40
1
36
3
2
2
36
1
18
3
N_w=3
3
N_p=3
2
8
9
7
E.g.: array of variables and sample selection (BHPS occ data)
8
Example: Relations between cases
9
Check: Variable operationalisations?
processes by which survey measures are defined
and subsequently interpreted by research analysts
• Some prescriptive advice (e.g. ONS, EU)
• Variable operationalisations in longitudinal research
– http://www.longitudinal.stir.ac.uk/variables/
• Themes from comparative research
– ‘universality’ and ‘specificity’
– Importance of documentation / metadata
– {See Scottish Social Survey Network seminar tomorrow 24th Jan}
– {See example on occupations this afternoon}
Student’s Law: …In survey data analysis, somebody else has
already struggled through the variable constructions you are
10
working on right now…
Research interests, data analysis and
data management (2)
2) Availability and advocacy of complex methods of data
analysis
–
Complex statistical approaches
•
•
•
•
–
Multi-process models (CQeSS, http://e-science.lancs.ac.uk/cqess/)
Latent variable and Multilevel analysis
Missing data analysis (e.g. www.missingdata.org.uk)
See the SSSN Master Class programme..!!
Challenging methodological approaches
•
•
Mixed methods research
See esp. the ESRC NCRM (http://www.ncrm.ac.uk/ )
 Daily work of survey researchers straddle social science
and statistical traditions
11
A research capacity shortfall?
• Concern that UK lacks sufficient trained social
researchers with quantitative analytical skills
• Criticism that social scientists don’t sufficiently exploit
empirical survey data
– Insufficient impact of published analyses
– Published analyses are too simple and crude
– {this doesn’t really apply to economics!}
 This is in some ways a puzzle, given dramatic progress
in the availability of survey data (e.g. www.dataarchive.ac.uk) and in resources for statistical analysis
12
Returning to survey data management…
• Simple survey data management
– Short recodes; selecting cases; one small data file
 taught in many textbooks and reasonably widely
understood by most users of SPSS, Stata, etc
• Complex survey data management
– Matching multiple data files; complex variable
operationalisations; complex relations between cases
 Is rarely taught in textbooks/courses
 Is usually required at some stage
 Often puts off non-specialists
13
• A substantial social science need for improved
standards and resources in data management
UK Data Archive
Qualidata
Flagship social surveys
Office for National Statistics
Administrative data
Specialist academic outputs
DAMES
ONS support
ESDS support
NCRM workshops
Essex summer school
ESRC RDI initiatives
CQeSS
Data Management
Data access / collection
Data Analysis
 In practice, social researchers often spend more time on data
management than any other part of the research process
 A ‘methodology’ of data management is relevant to social science
literatures on ‘harmonisation’, ‘comparability’
14
Confronting complex data management…
There are two related possibilities
i.
Generic resources and services for (survey)
data management


ii.
Format independence
Computer science research (e-science)
Specialist support for key social survey data
management approaches


Directed to specific software formats
Directed to specific example datasets
15
(i) DAMES – Data Management
through e-Social Science
ESRC National Centre for e-Social Science research Node,
University of Stirling / University of Glasgow, 2008-2011
Case studies, provision and support for data
management in the social sciences
4 social science themes
1) Grid Enabled Specialist Data Environments
• occupations; education; ethnicity
2) Micro-simulation on social care data
3) Linking e-Health and social science databases
4) Training and interfaces for data management support
Underlying computer science research themes
– Linking heterogeneous and distributed data; metadata; data
abstraction and data fusion; workflow modelling; data security
16
(ii) Specialist support for survey
research communities
– Scottish Social Survey Network
– Focussed advice on smallish range of
• Key surveys
• Key variables
• Stata and survey data management
– Stata combines extensive routines for data analysis
with extensive routines for data management
17
Data Analysis and Data Management
with Stata
1) Background: Integrating data analysis and data
management
2) Stata and data management
- Lab: Some useful Stata routines / functions
18
Stata and its competitors (1)
Claim: Stata offers unparalleled convenience in
combining pre-programmed data analytical and data
management functionality
• Ease of data access, manipulation and review
– Conditional processing (‘if’, ‘by’)
– Succinct command syntax
– Ability to read online files
• Exporting / saving results and graphs
– Regression model outputs
– Matrix manipulation of model results
• Development of new analytical routines
– Research community posting new models (researcher driven)
– Complex data estimators (svy; cluster; xt; xtmixed)
19
Stata and its competitors (2)
Claim: Stata is ultimately much more powerful, but it is not
always well designed
• Batch files / interactive syntax / programs:
– Stata has more flexibility, but SPSS interactive syntax is easier (e.g. delimiters)
• Direct data entry / browsing
– Stata is clumsy – easier to use SPSS or another package
• Variable and value labels and presenting outputs
– SPSS quicker and better presentation; Stata needs more effort
• Computing / recoding / conditional processing
– Stata more extensive (eg ‘by’ and ‘if’); SPSS easier to use – eg Stata won’t allow
overwriting an existing variable
• Missing values / weighting data
– Stata’s default settings cause more confusion than SPSS
– Stata has some restrictions on its weights / SPSS easier
• Complex data estimators (svy; cluster; xt; xtmixed)
– Unique and advantageous feature of Stata
– But many Stata models are very slow to estimate – e.g. GLLAMM
20
Some existing resources on data
management
• Stata’s files: http://www.stata.com/support/faqs/data/
• LDA WebCT site www.longitudinal.stir.ac.uk, worked
examples of data management on complex survey
data using SPSS and Stata:
– ‘introductory training in data analysis’
– ‘longitudinal research resources’
– Model – ‘learn by doing’…
• Researcher input:
– Importance of logging your work (‘syntax’ / ‘do’ files)
– Consistent use of file paths / annotation of command files
21
Stata lab 23/1/08: illustrating integrated
data management and analysis
• Example files from ‘Longitudinal data
analysis’ www.longitudinal.stir.ac.uk
– 4 LDA files with extended examples
– {Data (from UKDA) should be in place on
machines for today}
 First lab: a selective summary file
 Concentrates on matching data and
manipulating variables
22
Variable management in Stata
•
•
•
•
Painful text value label processes..
Recoding data examples
Use of ‘do’ and ‘ado’ batch files
Matching with aggregate datasets
• Further resources on operationalising
variables: see talk on ‘Handling
occupational data’
23
Matching files
• Complex data inevitably involves more than
one related data file
– Multiple related files are almost inevitable with
longitudinal data collections
• A vital data analysis skill!!
– Link data between files by connecting them according
to key linking variable(s)
– Eg, ‘person identifier’ variable ‘pid’
– Eg : iserwww.essex.ac.uk/ulsc/bhps/doc/
See SPSS and Stata example command files
within LDA Website
24
Types of file matching
1. Addition of files
–
E.g. two files with same variables for different
people
•
•
Stata: append using file2.dta
SPSS: add files file=“file1.sav” /file=“file2.sav” .
2. Case-to-case matching
–
One-to-one link, eg two files with different sets of
variables for same people
•
•
STATA: merge pid using file2.dta
SPSS: match files file=“file1.sav” /file=“file2.sav” /by=pid.
3. Table distribution
–
One-to-many link, eg one file has individuals,
another has households, and match household info
to the individuals
•
•
STATA: merge pid using file2.dta
25
SPSS: match files file=“file1.sav” /table=“file2.sav” /by=pid .
Types of file matching, ctd.
4. Aggregating
–
–
–
–
Summarise over multiple cases
Stata: - collapse (mean) inc , by(pid)
or egen avinc=mean(inc), by(pid)
SPSS: aggregate outfile=“file2.sav” /break=pid
/avinc=mean(inc)
Output files from aggregate / collapse are often linked
back into the micro-data from which they are derived
5. Related cases matching
–
–
–
Link info from one related case to another case, eg info
on spouse put on own case
Stata: - merge pid using file2.dta
or - joinby …
SPSS: match files file=“file1.sav” /file=“file2.sav” /by=pid.
26
File matching crib:
Stata:
_merge = indicator of cases present for:
1 = Master file but not input file
2 = Input file but not Master file
3 = Master and input file
Remember to drop auto-generated _merge before
performing next merge command
27