Transcript Slide 1

The importance of data management
Paul Lambert, 31st January 2012
Talk to the seminar ‘Data management in the social sciences and the
contribution of the DAMES Node’, a session organised as part of the Data
Management through e-Social Science ESRC research Node
www.dames.org.uk
DAMES, 31/JAN/2012, T1
Today’s session (2V1/2V3)
DAMES, 31/JAN/2012, T1
2
‘Data Management though eSocial Science’
 DAMES – www.dames.org.uk
 ESRC funded research Node
Funded 2008-11, with ongoing work into 2012 with the NeISS
(www.neiss.org.uk) and ‘eStat’
(www.bristol.ac.uk/cmm/research/estat/) projects
 Aim: Useful social science provisions
Specialist data topics – occupations; education
qualifications; ethnicity; social care; health
Computer science research on secure data models;
metadata and linking data; workflows
Programme of case studies and provisions
DAMES, 31/JAN/2012, T1
3
‘Data management’ means…
 ‘the tasks associated with linking related data resources, with
coding and re-coding data in a consistent manner, and with
accessing related data resources and combining them within the
process of analysis’ […DAMES Node..]
 Usually performed by social scientists themselves
 Most overt in quantitative survey data analysis
• ‘variable constructions’, ‘data manipulations’
• navigating abundance of data – thousands of variables
 Usually a substantial component of the work process
 Here we differentiate from archiving / controlling data itself
DAMES, 31/JAN/2012, T1
4
Some components…
 Manipulating data
 Recoding categories / ‘operationalising’ variables
 Linking data
 Linking related data (e.g. longitudinal studies)
 combining / enhancing data (e.g. linking micro- and macro-data)
 Secure access to data
 Linking data with different levels of access permission
 Detailed access to micro-data cf. access restrictions
 Harmonisation standards
 Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’)
 Recommendations on particular ‘variable constructions’
 Cleaning data
 ‘missing values’; implausible responses; extreme values
DAMES, 31/JAN/2012, T1
5
Example – recoding data
Count
educ4
1.00
Degree
-9.00
Highest
educational
qualification
2.00
Diploma
4.00 School
level or
below
Total
-9 Missing or wild
323
0
0
0
0
323
-7 Proxy respondent
982
0
0
0
0
982
1 Higher Degree
0
425
0
0
0
425
2 First Degree
0
1597
0
0
0
1597
3 Teaching QF
0
0
340
0
0
340
4 Other Higher QF
0
0
3434
0
0
3434
5 Nursing QF
0
0
161
0
0
161
6 GCE A Levels
0
0
0
1811
0
1811
7 GCE O Levels or Equiv
0
0
0
0
2518
2518
8 Commercial QF, No O
Levels
0
0
0
331
0
331
9 CSE Grade 2-5,Scot
Grade 4-5
0
0
0
0
421
421
10 Apprenticeship
0
0
0
257
0
257
102
0
0
0
0
102
0
0
0
0
2787
2787
138
0
0
0
0
1545
2022
3935
2399
5726
11 Other QF
12 No QF
13 Still At School No QF
Total
3.00 Higher
school or
vocational
6
138
15627
Example - Linking data (on related adults in the BHPS)
Used health services in
last year (Y=43%)
GHQ score
indv
cp
hh
xhid
indv
cp
hh
xhid
Female
0.63
0.77
0.69
0.65
1.36
1.36
1.36
1.53
Age
0.02
0.03
0.02
0.02
0.13
0.13
0.14
0.14
Age-squared(*100)
-0.12
-0.13
-0.13
-0.13
Cohabiting
-0.58
-0.58
-0.54
-0.59
Ln(household inc.)
-0.09
-0.14
-0.12
-0.11
-0.63
-0.62
-0.63
-0.62
Constant
-0.65
-0.67
-0.59
-0.55
12.9
12.8
12.6
12.6
ICC L2% (VC)
0
6.3
8.8
7.9
0
22.9
15.8
7.8
Mean cluster size
1
1.4
1.8
4.6
1
1.4
1.8
4.5
L2:sd(cons)
0.61
0.51
0.53
2.54
1.91
1.15
L2:sd(fem)
2.00
0.82
0.00
2.81
2.32
1.64
L1:sd(cons)
1.81
1.81
1.81
1.81
5.40
4.30
4.76
5.28
-Log-like (-40k)
9648
9625
9624
9632
3529
3383
3410
3512
‘The significance of data management for
social survey research’
 The data manipulations described above are a major
component of the social survey research workload
 Pre-release manipulations performed by distributors / archivists
• Coding measures into standard categories; Dealing with missing records
 Post-release manipulations performed by researchers
• Re-coding measures into simple categories
• All serious researchers perform extended post-release management
(and have the scars to show for it)
 We do have existing tools, facilities and expert experience to
help us…but we don’t make a good job of using them
efficiently or consistently
 So the ‘significance’ of DM is about how much better
research might be if we did things more effectively…
DAMES, 31/JAN/2012, T1
8
..being more effective probably involves..
 Knowing about, using and citing previous
standard measures/strategies
 Effective documentation/dissemination of
information on the approach used
 Being proactive (not just relying on the most
convenient measure to hand)
 Trying a few alternatives – sensitivity analysis
DAMES, 31/JAN/2012, T1
9
‘Documentation’ (and its dissemination)
is probably the key…
 By documentation we mean the ‘paper trail’
 (such as data & syntax files during secondary survey research)
 For scientists, this is the log book / journal / laboratory
notebook
 For social sciences, there are few agreed standards
Effective documentation is
possible, but requires some
effort (e.g. Long, 2009)
Image of Alexander Graham Bell’s
1876 notebook, taken from:
http://sandacom.wordpress.com/2010/
03/11/the-face-rings-a-bell/
10
..good levels of documentation are not
engrained in the social sciences!
 “…Little or nothing is systematically archived from these electronic
sources. How many of us routinely keep copies of our old wordprocessing files once they are no longer of current relevance for
research or teaching activities. We have been reminded…of the
insecurity and non-survival of departmental and professional files stored
in broom cupboards, but how many electronic files even get into that
cupboard in the first place?” (p142 of Scott, J. (2005) ‘Some principal concerns in
the shaping of sociology’, in Halsey, A.H. and Runciman, W. (eds) British Sociology: See
from without and within. London: British Academy)
...Yet, ‘documentation for replication’ is a reasonable
expectation for a scientific model of research
(e.g. Steuer, Dale, Freese)…
Steuer, M. (2003). The Scientific Study of Society. Boston: Kluwer Academic.
Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social
Research Methodology, 9(2), 143-158.
Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology?
DAMES,
31/JAN/2012,
11
Sociological Methods & Research,
36(2),
153-71. T1
A bit of focus…

Most of the DAMES applications aim to facilitate
one of two data management activities, their
documentation, and the dissemination of that
documentation:
1) Variable constructions
o Coding and re-coding values
2) Linking datasets
o Internal and external linkages
DAMES, 31/JAN/2012, T1
12
‘Documentation for replication’
supports replication of..
 Your own analysis
 in response to comments, revisions, requests for access)
 Others’ analysis
 To build upon – cumulative science
 To critique / cross-examine
 In secondary survey research
 Complex data is often updated (new related records; revised
and re-released; re-weighted or re-standardardised; new levels
of access/linkage)
 New analysis feasible - variable operationalisations; new
statistical methods
 Most documentation requirements are achieved by
effective use of software (‘syntax’ programming)
 See our training workshops, www.dames.org.uk/workshops
13
DAMES, 31/JAN/2012, T1
Keep clear records of your DM activities!
Reproducible (for self)
Replicable (for all)
Paper trail for whole
lifecycle

In survey research,
this means using
clearly annotated
syntax files
(e.g. SPSS/Stata)
Syntax Examples:
www.dames.org.uk/workshops
DAMES, 31/JAN/2012, T1
14
15
We’ve written a guide for researchers...
 ‘Software Session 1: Documentation and workflows with
popular software packages’
(www.dames.org.uk/workshops/stir10/docs_workflows_2010.html)
 Dozens of sample command files in SPSS, Stata and R from
DAMES Node workshops at www.dames.org.uk
DAMES, 31/JAN/2012, T1
16
For data distributors,
the provision of
systematic metadata
is also beneficial
Example of DDI
format metadata
(see also talk 5)
DAMES, 31/JAN/2012, T1
17
DAMES, 31/JAN/2012, T1
18
NESSTAR
DAMES, 31/JAN/2012, T1
19
What more is needed for good data
management?
1) Good standards in the operationalisation of
variables
See yesterday’s workshop sessions (www.dames.org.uk)
Most options have already been studied!
Using GEODE/GEMDE/GEEDE to facilitate sensitivity
analysis and comparisons of alternative plausible
measures
• Collect documentation/metadata on specialist records
• Promote more effective measurement options
e.g. effect proportional scaling; replication of measures used
before; derivation of recommended standards
DAMES, 31/JAN/2012, T1
20
DAMES ‘GESDE’ tools: online services
for data coordination/organisation
Tools for handing variables in
social science data
Recoding measures; standardisation /
harmonisation; Linking; Curating
21
ES2 E6
E3 G13 G10 G5
G2
R7 WR9 O8
MN
I99
CF CF2 ISEI AWM WG2 GN1
ES5 E9
E5
E2 G11 G7
G3
K4
WR O17 O4
I9
CM CM2 CG SIOP WG1 WG3
Increase in R-squared
Increase in BIC
Predictors of ‘poor health’ in Sweden
-.05
0
.05
.1
Sweden measures, from DAMES, TP 2011-1)
(comparison of different occupation-based
ES2 E6
E3 G13 G10 G5
G2
R7 WR9 O8
MN
I99
CF CF2 ISEI AWM WG2 GN1
ES5 E9
E5
E2 G11 G7
G3
K4
WR O17 O4
I9
CM CM2 CG SIOP WG1 WG3
What more is needed for good data
management?
2) Incentives/disincentives
Arguably, good data management is penalised at
present (‘Don’t get it right, get it published’)
Few formalised requirements of documentation or
data management activity
(cf. metadata publishing standards such as DDI)
Citation rankings might incentivise here (citation of
your do files..)
Prospects are probably rather bleak for good
science..!!
DAMES, 31/JAN/2012, T1
23
Summary
the ‘significance’ of DM is about how much
better research might be if we did things
more effectively…
 Can (try to) provide data oriented facilities
supporting improved data management
 May also need a cultural change in
expectations…
DAMES, 31/JAN/2012, T1
24