Transcript Folie 1

DDI – Metadata for social science
data
Wolfgang Zenk-Möltgen
GESIS – Leibniz Institute for the Social Sciences
[email protected]
DataCite Summer Meeting 2010 – Making datasets visible and accessible
Hannover, 7-8 June 2010
Topics
•
•
•
•
About DDI
Basic DDI Concepts
Identification and Citation
Application Examples
Acknowledgement:
DDI Alliance TIC members, namely Wendy Thomas, Arofan Gregory, Joachim Wackerow
About DDI
DDI – Data Documentation Initiative
„The Data Documentation Initiative (DDI) is an effort to create an
international standard for describing social science data. Expressed
in XML, the DDI metadata specification now supports the entire life
cycle of social science datasets. DDI metadata accompanies and
enables data conceptualization, collection, processing, distribution,
discovery, analysis, repurposing, and archiving.” (Stefan Kramer)
http://www.ddialliance.org/
History of DDI
• Concept of DDI and definition of needs grew out
of the data archival community
• Established in 1995 as a grant funded project,
initiated and organized by ICPSR
• February 2003 – Formation of DDI Alliance
– Membership based alliance
– Formalized development procedures
Members of DDI
• Initial members
– Social science data archives
– Statistical data producers
• Actual membership expanded by
– Research data centers
– Data producers
– Commercial organizations
University of Alberta, Canada
Australian Bureau of Statistics (ABS)
Australian Social Science Data Archive (ASSDA)
University of California, Berkeley -- Computer-Assisted Survey Methods
Program and UCDATA
University of California, California Digital Library
Centro De Investigaciones Sociologicas (CIS), Spain
CEPS/INSTEAD -- Luxembourg
Cornell University (CISER)
Danish Data Archive
Data Archiving and Networked Services (DANS), The Netherlands
Finnish Social Science Data Archive
German Socio-Economic Panel Study (SOEP)
GESIS - Leibniz Institute for the Social Sciences
University of Guelph
Institute for Quantitative Social Science (IQSS) at Harvard University
Institute for the Study of Labor (IZA)
Inter-university Consortium for Political and Social Research (ICPSR)
Massachusetts Institute of Technology (MIT)
University of Minnesota, Minnesota Population Center
National Opinion Research Center (NORC)
Norwegian Social Science Data Service (NSD)
Open Data Foundation
Princeton University
Research Data Centre of the German Federal Employment Agency,
Institute for Employment Research (IAB)
Roper Center
Stanford University
Survey Research Operations, University of Michigan
Swedish National Data Service (SND)
Swiss Foundation for Research in Social Sciences (FORS)
United Kingdom Data Archive
University of Toronto
University of Wisconsin
U.S. Bureau of Labor Statistics (Associate Member)
World Bank, Development Data Group (DECDG)
Yale University
DDI is being used around the world
•
•
•
Archives and Data Libraries
Research Institutes and Data Service Centers
International Organizations and National Statistical Agencies
DDI Versions
•
2000 – DDI 1.0
–
•
2003 – DDI 2.0 and 2.1
–
–
•
Extension to aggregate data
Support for geographic material
2008 – DDI 3.0
–
–
–
–
–
–
–
•
Documentation of simple surveys, microdata only
Lifecycle model: Shift from the codebook centric / variable centric model to capturing the
lifecycle of data
Focus on metadata creation and re-use
“Machine-actionable” aspects of DDI to support programming
CAI instruments supported by expanded description of the questionnaire
Data series support (longitudinal surveys, panel studies, etc.)
Support comparison by design and comparison-after-the-fact
Improved support for describing complex data files
2009 – DDI 3.1
–
–
Correction of bugs
Introduction of final URN structure to ensure persistent URNs for all identified elements
Basic DDI 3 Concepts
• Lifecycle Concept
• Re-usable documentation
– Modules
– Maintainables, versionables, identifiables
– Scheme-based (maintainable lists)
• Relations to other standards
• Controlled Vocabularies
The Data Life Cycle
Archiving
Concept
Collection
Processing
Distribution
Discovery
Repurposing
Analysis
DDI 3 versus earlier versions
• Previous versions had the “codebook” idea that creates a
documentation of a social science dataset
• DDI 3 with it’s lifecycle model allows for documentation at all stages
from study conception and data processing until analysis and
repurposing of data
• DDI 3 uses XML Schemas instead of XML Data Type Definition
(DTD) to have a stronger definition of metadata types, to make
better reuse of content and to reach the goal of “machine
actionability”
• A DDI 3 instance includes now the “simple instance” from previous
DDI versions. Multiple data products can be included for a single
study.
DDI 3.1 Modules
Contain groups of related documentation elements
Some are related to the Lifecycle model, some are technically grouped
•
•
•
•
•
•
•
•
•
•
•
Archive module
Comparative module
Conceptual components module
Data collection module
Dataset module
Dublin Core Elements module
DDI profile module
Grouping module
Instance module
Logical product module
Physical data product module
–
•
•
•
(plus inline n-cube, normal n-cube, tabular n-cube module and proprietary module)
Physical instance module
Reusable module
Study unit module
Usage of DDI 3 Modules
Study Unit
Data Collection
•
•
•
•
•
•
Identification
Coverage
–
Topical
–
Temporal
–
Spatial
Conceptual Components
–
Universe
–
Concept
–
Representation (optional replication)
Purpose, Abstract, Proposal, Funding
•
•
•
Methodology
Question Scheme
–
Question
–
Response domain
Instrument
–
using Control Construct
Scheme
Coding Instructions
–
question to raw data
–
raw data to public file
Interviewer Instructions
Logical Product
•
•
•
•
•
•
Category Schemes
Coding Schemes
Variables
NCubes
Variable and NCube Groups
Data Relationships
Physical Data Structure
Archive
•
•
•
Organization or individual which has
control over the metadata
Lifecycle events
Archive specific information
•
•
•
Links to Data Relationships
Links to Variable or NCube Coordinate
Description of physical storage structure
–
in-line, fixed, delimited or proprietary
Physical Instance
•
•
•
One-to-one relationship with a data file
Coverage constraints
Variable and category statistics
etc…
Maintainables, Versionables,
Identifiables
Maintainables (may be maintained
separately, need agency)
Versionables (may be versioned in
the form „1.0.0“)
Inheritance
Inheritance
Identifiables (may be identified
and be referenced, either by ID
or URN)
Other DDI elements
DDI Schemes
Schemes = Lists of elements of one type
Examples
•
archive
– OrganizationScheme
•
datacollection
– QuestionScheme
– ControlConstructScheme
– InterviewerInstructionScheme
•
conceptualcomponent
– ConceptScheme
– UniverseScheme
– GeographicStructureScheme
– GeographicLocationScheme
•
logicalproduct
– CategoryScheme
– CodeScheme
– VariableScheme
– NCubeScheme
•
physicaldataproduct
– PhysicalStructureScheme
– RecordLayoutScheme
Relationship to Other Standards
•
Dublin Core
–
–
•
METS
–
–
•
Basic bibliographic citation information
Basic holdings and format information
Upper level descriptive information for managing digital objects
Provides specified structures for domain specific metadata
OAIS
–
Reference model for the archival lifecycle
•
PREMIS
•
ISO 19115 – Geography (FGDC)
–
–
•
•
Supports and documents the digital preservation process
Metadata structure for describing geographic feature files such as shape, boundary, or map image files and their associated
attributes
ISO/IEC 11179
–
International standard for representing metadata in a Metadata Registry
–
Consists of a hierarchy of “concepts” with associated properties for each concept
SDMX
–
–
Exchange of statistical information (time series/indicators)
Supports metadata capture as well as implementation of registries
Contr. Vocab
• Not part of standard
• Recommendations on:
LifeCycleEventType
CommonalityTypeCoded
TimeMethod
• Example:
TimeMethod may be
–
–
–
–
–
–
Longitudinal (Cohort or Trend)
Panel (Continuous or Interval)
TimeSeries (Continuous or Discrete)
CrossSectional
CrossSectionalAdHocFollowUp
Other
ResponseUnit
AggregationMethodsType
DataType
SoftwarePackage
CharacterSet
CategoryStatistic
SummaryStatistic
Date@Calendar
AnalysisUnit
Contributor@Role
Publisher@Role
Identification in DDI 3
• Two possibilities to identify an element:
– Specify the <ID> Tag
• Agency and Version are inherited
– Use the specially-structured URN
• Agency and Version must be included
• The structured URN approach is preferred
• These IDs/URNs can be referenced
• Both ways need a resolver service that turns the names into
locations to make effective re-use possible
• DDI Alliance ist currently working on that, based on the DNS
(Domain Name System) infrastructure approach
URN Identification Examples
URN of a maintained object
To identify of a variable scheme in DDI 3 via a URN would be as follows:
urn=“urn:ddi:us.icpsr:VariableScheme.V_GENDER_SCHEME.1.0.0”
URN of an versionable object
All versionable objects are contained within maintainable objects. To identify a variable in DDI 3 via a URN
would be as follows:
urn=“urn:ddi:us.icpsr.VariableScheme. V_GENDER_SCHEME.1.0.0:Variable.Gender.1.0.0”
URN of an identifiable object
An identifiable object may be a direct child of a maintainable object or be contained by a versionable object
within a maintainable object. The full path should be provided to facilitate locating the item when referenced.
<DataCollection isMaintainable=”true” id=”DC_5698” version=”2.4.0”>
<Methodology isVersionable=”true” id=”Meth_Type_1” version=”1.0.0”>
<TimeMethod isIdentifiable=”true” id=”TM_1”>
To identify the identifiable object in the above hierarchy in DDI 3 via a URN would be as follows:
urn=“urn:ddi:us.icpsr:DataCollection.DC_5698.2.4.0:TimeMethod.TM_1.1.0.0”
(from the DDI Technical Specification Part I)
Citation in DDI
OtherMaterial
•
•
•
•
•
•
•
•
Elements
Citation
ExternalURLReference
ExternalURNReference
MIMEType
Relationship
Segment
UserID
holds full citation information for the external object
location of the external object
URN expression for the external object
the standard internet MIME type for applications
reference to DDI object and description of relation to it
specifies part of external object (e.g. with audio/video files)
unique ID of other types, e.g. DOI
•
•
•
•
•
•
•
•
•
Attributes
Action
id
isIdentifiable
objectSource
type
urn
xml:lang
used for local overrides in case of inheritance ("Add" | "Update" | "Delete")
DDI ID of the element
fixed value of "true"
source name or location
required type code for type of the external object
DDI URN of the element
optional identification of the language of the external object
DOIs and DDI URNs
• Relationship still unclear
• DDI URN resolution service still needed
• Every identifiable element could be registered with a
DOI, that would result in huge amounts of DOIs
• Only study level could be registered with a DOI, e.g.
each StudyUnit
• In DDI all registered DOIs should be documented
• Vice versa each DOI should contain the DDI URN in the
metadata
• Diverse software applications will make use of them
Application Examples
• Enhanced Publications
– Providing Information to connect Publications with the
underlying datasets/variables used
– Making retrieval of research with specific
datasets/variables possible
• Version History of Datasets
– Documenting errata and correction history
– Making it easy to cite used data
Supporting Enhanced Publications
DDI Alliance
Publications with References to Data:
DDI 3.1 URN contains:
Agency
Object
Version
Publication with
References
(URNs)
http://resolve.gesis.org
find object
return URL
http://www.gesis.org/doc/docxyz
URL of
Documentatio
n and/or Data
urn:ddi:de.gesis:VariableScheme.ZA3811_VarSch.1.0.0:Variable.V8.1.0.0
Supporting Enhanced Publications
DSDM DDI 3 EPE Simple Export Wizard 1.2.0
Enhancing Publications - DatapluS
A University of Tilburg and Centerdata project, supported by GESIS and the European Values Study
Version History of Datasets
• The GESIS data catalogue holds study descriptions with
links to data access
• GESIS currently introduces a common versioning policy
for datasets
• Starting with version 1.0.0 and increasing the major,
minor or revision number according to change in the
dataset
• Corresponding to each published version a DOI will be
created
• That gives transparancy in the history of data processing
• Citation of used datasets will include the specific version
to ease replication
Data
Catalogue
Thank you!