Transcript Slide 1

Data Management through e-Social Science

Workshop at the Fourth International Conference on e-Social Science University of Manchester, 18 th June 2008 A workshop organised by the DAMES research Node of the National Centre for e-Social Science www.dames.org.uk

/ www.ncess.ac.uk

DAMES, 18 Jun 2008

Data Management through e-Social Science: Workshop Timetable

1400-1440 1440-1500 1500-1520

Introduction to Data Management in the Social Sciences

(Paul Lambert, Univ. Stirling)

E-Science Approaches in the DAMES Node

(Simon Jones, Univ. Stirling)

Social Science Requirements: Examples of data on social care and on health

(Alison Dawson, Univ. Stirling)

1540-1600 1600-1640 1640-1700

Security Approaches and Requirements in Applied Data Projects

(John Watt, Univ. Glasgow)

Distributed Data Linking using OGSA-DAI and OGSA DQP

(Ally Hume, Univ. Edinburgh) Closing panel and discussion

DAMES, 18 Jun 2008 2

Talk 1

Introduction to Data Management in the Social Sciences

1) The nature of data management 2) Existing resources for social scientists 3)   The contributions of… e-Social Science the DAMES Node ( www.dames.org.uk

) 4) Review: Data Management in Quantitative Social Science Research DAMES, 18 Jun 2008 3

‘Data management’ means…

 ‘

the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis

[…our own poster.., 18.6.08]

 Usually performed by social scientists themselves  Most overt in quantitative survey data analysis • ‘variable constructions’, ‘data manipulations’  Usually a substantial component of the work process DAMES, 18 Jun 2008 4

Some components…

     Manipulating data  Recoding categories / ‘operationalising’ variables Linking data  Linking related data (e.g. longitudinal studies)  combining / enhancing data (e.g. linking micro- and macro-data) Secure access to data  Linking data with different levels of access permission  Detailed access to micro-data cf. access restrictions Harmonisation standards  Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’)  Recommendations on particular ‘variable constructions’ Cleaning data  ‘missing values’; implausible responses; extreme values DAMES, 18 Jun 2008 5

Example – recoding data Count Highest educational qualification -9 Miss ing or wild -7 Proxy res pondent 1 Higher Degree 2 First Degree 3 Teaching QF 4 Other Higher QF 5 Nurs ing QF 6 GCE A Levels 7 GCE O Levels or Equiv 8 Commercial QF, No O Levels 9 CSE Grade 2-5,Scot Grade 4-5 10 Apprentices hip 11 Other QF 12 No QF 13 Still At School No QF Total -9.00

323 982 0 0 0 0 0 0 0 0 0 1.00 Degree 0 0 425 1597 0 0 0 0 0 0 0 educ4 2.00 Diploma 0 0 0 0 340 3434 161 0 0 3.00 Higher school or vocational 0 0 0 0 0 0 0 1811 0 4.00 School level or below 0 0 0 0 0 0 0 0 2518 0 331 0 0 0 102 0 0 0 0 0 0 0 138 0 DAMES, 18 Jun 2008 1545 2022 0 3935 0 257 0 0 0 2399 421 0 0 2787 0 5726 Total 323 982 425 1597 340 3434 161 1811 2518 331 421 257 102 2787 138 6 15627

Example –Linking data

Linking via ‘ojbsoc00’ :

c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk

DAMES, 18 Jun 2008 7

Some further considerations

  DM as stumbling block in research conduct  UK has ample data, ample analytical resources, but low levels of exploitation (esp. of complex data)  Capacity building aims in DAMES Lots of previous work in this field  ..See below..

‘Data management’ also sometimes means..

 Data distributors supplying and monitoring use of particular datasets (e.g. UK Data Archive DM guides) DAMES, 18 Jun 2008 8

DAMES research Node

UK Data Archive

Qualidata

Flagship social surveys

Office for National Statistics Administrative data Specialist academic outputs

DAMES

ONS support ESDS support NCRM workshops Essex summer school ESRC RDI initiatives CQeSS

Data Management Data access / collection Data Analysis

social researchers often spend more time on data management than any other part of the research process

DAMES, 18 Jun 2008 9

2. Existing resources (i): D

ata providers a) Documentation and metadata files DAMES, 18 Jun 2008 10

2. Existing resources (i): D

ata providers b) c) d)  Resources for variables CESSDA PPP on key variables http://www.nsd.uib.no/cessda/project/   UK Question Bank http://qb.soc.surrey.ac.uk/ ONS Harmonisation http://www.statistics.gov.uk/about/data/  Resources for datasets UK Census data portal, http://census.ac.uk/   IPUMS international census data facilities, www.ipums.org

European Social Survey, www.europeansocialsurvey.org

   Data manipulations prior to data release Missing data imputation / documentation Survey design / weighting information

Influential – most analysts use ‘the archive version’

DAMES, 18 Jun 2008 11

2. Existing resources (ii)

Resource projects / infrastructures - UK ESDS www.esds.ac.uk

ESDS International | ESDS Government ESDS Longitudinal | ESDS Qualidata

Helpdesks; online instructions; user support..

- UK ESRC NCRM / NCeSS / RDI initiatives - Longitudinal data – www.longitudinal.stir.ac.uk

- Linking micro/macro www.mimas.ac.uk/limmd/ - Other resources / projects / initiatives - EDACwowe ….

http://recwowe.vitamib.com/datacentre DAMES, 18 Jun 2008 12

2. Existing resources (iii)

Analytical and software support   Textbooks featuring data management 

[Levesque 2008] [Sarantakos 2007]

Software training covering DM   Stata’s ‘data management’ manual SPSS user group course on syntax and data management, www.spssusers.co.uk

But generally, sustained marginalisation of DM as a topic

 

Advanced methods texts use simplistic data Advanced software for analysis isn’t usually combined with extended DM requirements

DAMES, 18 Jun 2008 13

2. Existing resources (iv)

Data analysts’ contributions   Academic researchers often generate and publish their own DM resources, e.g. Harry Ganzeboom on education and occupations, http://home.fsw.vu.nl/~ganzeboom/pisa/ Provision of whole or partial syntax programming examples Analysts often drive wider resource provisions related to DM CAMSIS project on occupational scales, www.camsis.stir.ac.uk

CASMIN project on education and social class DAMES, 18 Jun 2008 14

2. Existing resources (v) Literatures on harmonisation and standardisation

National Statistics Institutes’ principles and practices

E.g. ONS www.statistics.gov.uk/about/data/harmonisation/ 

Cross-national organisations

E.g. UNSTATS http://unstats.un.org/unsd/class/ 

Academic studies

E.g.

[Harkness et al 2003]; [Hoffmeyer-Zlotnick & Wolf 2003] [Jowell et al. 2007]

DAMES, 18 Jun 2008 15

3a. The contribution of e-Science

E-Science isn’t essential to good DM, but it has capacity to improve and support conduct of DM…

1. Concern with standards setting in communication and enhancement of data 2. Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources 3) Contribution of metadata tools/standards for variable harmonisation and standardisation 4) Linking data subject to different security levels 5) The workflow nature of many DM tasks

E.g. of GEODE: Organising and distributing specialist data resources (on occupations) DAMES, 18 Jun 2008 17

3b. The contribution of DAMES

8 project themes

1.1) Grid Enabled Specialist Data Environments (‘GE*DE’) 1.2) Data resources for micro simulation on social care data 2.1) Description, discovery & service use through metadata and data abstraction 2.2) Techniques to handle data from multiple sources 1.3) Linking e-Health and social science databases 2.3) Workflow modelling for social science 1.4) Training and interfaces for management of complex survey data 2.4) Security driven data management

DAMES, 18 Jun 2008 18

DAMES agenda

Useful social science provisions

 Specialist data topics – occupations; education qualifications; ethnicity; social care; health  Mainstream packages and accessible resources 

To exploit / engage with existing DM resources

 In social science – e.g. CESSDA  In e-Science – e.g. OGSA-DAI; OMII DAMES, 18 Jun 2008 19

DAMES – key decisions / debates

     Metadata – DDI 3 approach Portals and servers  GT and Gridsphere  Hosting within e-Infrastructure project Software  Supported: SPSS, Stata, plain text  Not supported (tbc): R, SAS Case studies: Social scientists’ requirements statements Workflow models: Generic v’s specific provisions »

….for discussion…

DAMES, 18 Jun 2008 20

4. Case study: Quantitative data analysis

DAMES’ main focus on quantitative data analysis − Social Surveys − Large and complex social surveys − Other specialist data resources stored in a numerical form

Two themes:

 We’re

data rich

(but analysts’ poor)  We work overwhelmingly through

individual analysts’ micro-computing

o Pressure for simple / accessible packages o Specialist development of very complex packages DAMES, 18 Jun 2008 21

Access, manipulate & analyse large flat files (‘variable by case matrix’) DAMES, 18 Jun 2008 22

Abundance of data

Thousands of datasets available

Secondary data:  download ‘micro-data’ from UK Data Archive www.data archive.ac.uk

or other national or international source  Access ‘macro-data’ online, e.g. ESDS www.esds.ac.uk

Primary data: collect our own datasets 

What we mean by ‘large and complex’ surveys?

 A few thousand variables  Tens of thousands of cases  Repeated contacts with and/or relations between cases DAMES, 18 Jun 2008 23

App. – Selected secondary data sources

UK Data Archive CESSDA European Social Survey CNEF IPUMS ISSP ESDS Macrodata http://www.data-archive.ac.uk/ http://www.nsd.uib.no/cessda/home.html

http://www.europeansocialsurvey.org/ http://cnefusergroup.blogspot.com/ http://www.ipums.org/ http://www.issp.org/ http://www.esds.ac.uk/international/ DAMES, 18 Jun 2008 24

Phase 1 - Data manipulation – i) recoding data Count Highest educational qualification -9 Miss ing or wild -7 Proxy res pondent 1 Higher Degree 2 First Degree 3 Teaching QF 4 Other Higher QF 5 Nurs ing QF 6 GCE A Levels 7 GCE O Levels or Equiv 8 Commercial QF, No O Levels 9 CSE Grade 2-5,Scot Grade 4-5 10 Apprentices hip 11 Other QF 12 No QF 13 Still At School No QF Total -9.00

323 982 0 0 0 0 0 0 0 0 0 1.00 Degree 0 0 425 1597 0 0 0 0 0 0 0 educ4 2.00 Diploma 0 0 0 0 340 3434 161 0 0 3.00 Higher school or vocational 0 0 0 0 0 0 0 1811 0 4.00 School level or below 0 0 0 0 0 0 0 0 2518 0 331 0 0 0 102 0 0 0 0 0 0 0 138 0 DAMES, 18 Jun 2008 1545 2022 0 3935 0 257 0 0 0 2399 421 0 0 2787 0 5726 Total 323 982 425 1597 340 3434 161 1811 2518 331 421 257 102 2787 138 25 15627

Phase 1 - Data manipulation - ii) Missing data / case selection DAMES, 18 Jun 2008 26

Phase 1 - Data manipulation – iii) Linking data

Linking via ‘ojbsoc00’ :

c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk

DAMES, 18 Jun 2008 27

Phase 2: Analytical approaches

Suites of statistical techniques involving both DA and DM...

 Simple analysis   Well-defined techniques (e.g. tables, graphs, bivariate correlations) Main work is in manipulating data prior to analysis  Complex analysis   Estimation of complex statistical models and their parameters

Some argue that only tremendously complex models are adequate for social science description

 

M-plus and structural equation modelling; GRID developments – SabreR at http://e-science.lancs.ac.uk/cqess/

 Often slow to estimate on pc.s, - 24hrs+ is common (modest associations; sparse data) DAMES, 18 Jun 2008 28

Examples of

Longitudinal

micro-data

Panel studies

(collect data from same subjects on multiple occasions)

Repeated cross-sections

(same type of data from different subjects)    Complex data manipulations and variable constructions  e,g. British Household Panel Survey (BHPS / UKHLS) o 16 years of data (‘waves’); multiple data files within waves Complex data analysis approaches  Panel and Event History model estimators [e.g. Blossfeld & Rohwer 2002] A convenient example because..

Longitudinal data especially attractive for substantive research questions

 

Requires extensive data management, and integrated DM and DA 3 of DAMES Co I’s undertake ESRC training in longitudinal data analysis – www.longitudinal.stir.ac.uk

DAMES, 18 Jun 2008 29

Comment on the research process:

 Tasks of data manipulation and analysis often overlap – analysis results often require responsive manipulation and further analysis  The practical experience of most applied researchers is dominated by data management, not data analysis DAMES, 18 Jun 2008 30

Tools: Software for data analysis and DM

 SPSS, Stata, {SAS, Minitab, Excel}  Accessible general purpose packages: Combine wide range of DM and DA functionality • SPSS is UK market leader; Stata a popular advanced alternative • Proprietary; Training events; user communities; gurus  Specialist software packages, often freeware  A few for data management tasks • Panelwhiz (Stata related) ( http://www.panelwhiz.eu/ )  Mostly for advanced analytical tasks » R / S-Plus; Sabre-R; LIMDEP; BUGS; MLwiN; TDA; lEM; AML DAMES, 18 Jun 2008 31

 “A program like SPSS .. has two main components: the statistical routines, .. and the data management facilities. Perhaps surprisingly, it was the latter that really revolutionised quantitative social research” [Procter, 2001: 253]  “ Socio-economic processes require comprehensive approaches as they are very complex (‘everything depends on everything else’). The data and computing power needed to disentangle the multiple mechanisms at work have only just become available .” [Crouchley and Fligelstone 2004] DAMES, 18 Jun 2008 32

Working patterns (1)

Occasional examples of remote access to micro-data  LIS ( http://www.lisproject.org/ ) – email SPSS / Stata / SAS jobs to run on LIS server, outputs returned by email  NESSTAR ( http://www.nsd.uib.no/cessda/extcessda.jsp

and SDA ( http://sda.berkeley.edu/ ) – exploratory online ) analysis of large micro-data surveys  MIMAS past ( http://www.mimas.ac.uk/ , no longer operational) – login to Manchester computing unix server containing relevant datasets, analyse in SPSS, Stata, etc  ESRC forthcoming: Secure Data Service for highly confidential data DAMES, 18 Jun 2008 33

Working patterns (2):

More usually, we access micro-data by downloading formatted datasets DAMES, 18 Jun 2008 34

Processing software

 Drop down GUI’s  Introductory level use  Training books and software manuals  Command language syntax  Favoured by all advanced users  Online training (e.g. www.longitudinal.stir.ac.uk

)  Common to learn software commands by heart  Data shared through syntax command files  Programme routines shared via syntax files DAMES, 18 Jun 2008 35

SPSS syntax example

DAMES, 18 Jun 2008 36

DAMES, 18 Jun 2008 37

Stata syntax example (‘do file’) DAMES, 18 Jun 2008 38

R and Sabre-R example DAMES, 18 Jun 2008 39

A personal view on software for survey research

   Stata is superior package for secondary survey analysis • Advanced data management

and

data analysis functionality • Culture of transparency of programming Problems with Stata • Proprietary and not available to all users • Slow estimation times So to support social scientists (DAMES):    Core provisions aimed to Stata, SPSS and plain text SPSS compatibility very important Exploration of other packages appropriate but unlikely to be central DAMES, 18 Jun 2008 40

Some forthcoming trends in quantitative data analysis

 Analytical innovations including:  Enhancing data for purposes of analysis o More variables, drawn from different data sources  Exploitation of more and more complex data o International and longitudinal comparisons  Further advances in multi-process modelling o E.g. ESRC funded: Sabre-R; Bugs; MLwiN  Simulation analysis  Methodological attention to:  Documentation of quantitative research [Dale 2006]  Replication standards [Freese 2007] DAMES, 18 Jun 2008 41

End of Talk 1

Introduction to Data Management in the Social Sciences

1) The nature of data management 2) Existing resources for social scientists 3)   The contributions of… e-Social Science the DAMES Node ( www.dames.org.uk

) 4) Case study: Data Management in Quantitative Social Science Research DAMES, 18 Jun 2008 42

References

          Blossfeld, H. P., & Rohwer, G. (2002).

Techniques of Event History Modelling: New Approaches to Causal Analysis, 2nd Edition

. Mawah, NJ: Lawrence Erlbaum Associates.

Crouchley, R., & Fligelstone, R. (2004).

The Potential for High End Computing in the Social Sciences

. Lancaster: Centre for Applied Statistics, Lancaster University, and http://redress.lancs.ac.uk/document-pool/hecsspotential.pdf.

Dale, A. (2006). Quality Issues with Survey Research.

International Journal of Social Research Methodology, 9

(2), 143-158.

Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology?

Sociological Methods and Research, 36

(2), 2007.

Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003).

Cross-Cultural Survey Methods

. New York: Wiley.

Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003).

Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables

. Berlin: Kluwer Academic / Plenum Publishers.

Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007).

Measuring Attitudes Cross Nationally

. London: Sage.

Levesque, R., & SPSS Inc. (2008).

Programming and Data Management for SPSS 16.0: A Guide for SPSS and SAS users

. Chicago, Il.: SPSS Inc.

Procter, M. (2001). Analysing Survey Data. In G. N. Gilbert (Ed.),

Researching Social Life, Second Edition

(pp. 252-268). London: Sage.

Sarantakos, S. (2007).

A Tool Kit for Quantitative Data Analysis Using SPSS

. London: Palgrave MacMillan.

DAMES, 18 Jun 2008 43