The Data Documentation Initiative

Download Report

Transcript The Data Documentation Initiative

Workshop on Metadata Standards and Best Practices
November 19-20th, 2007
Session 4
The Data Documentation Initiative
Technical Overview
Pascal Heus
Open Data Foundation
[email protected]
http://www.opendatafoundation.org
Outline
• XML refresher
• DDI Background
• DDI 1/2.x
– Status / Tools
• DDI 3.0
– Use cases
– Need for tools
• Conclusions / Q&A
Thanks to the DDI Alliance and GESIS
for slides on DDI
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
XML to the rescue!
• XML is driving today’s web service oriented
architecture of the Internet and Intranets
• Using XML, we can capture, structure,
transform, discover, exchange, query, edit
and secure metadata and data
• XML is platform & language independent
and can be used by everyone
• XML is both machine and human readable
• XML is non-proprietary, public domain and
many open tools exist
• Domain specific standards are available!
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
XML Technical Overview
Structure
DTD
XSchema
Manage
Transform
Software
XForms
XSL, XSLT
XSL-FO
Capture
XML
Search
Discover
XPath
XQuery
Registries
Databases
Exchange
Web Services
SOAP
REST
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
What is DDI?
• The Data Documentation Initiative (DDI) is an XML
format for capturing metadata about survey data
and register data
• Data files may remain in their native formats (ASCII
files which may be delimited or fixed-width) or may
be captured as XML
• It used to be designed to describe codebooks, and
was mainly useful for data archives and libraries
– Versions 1.*/2.*
• Now, it can be used for any type of data collection
– Version 3.0
– Focus on survey instrumentation and microdata, but also
can describe aggregates
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Background
• Concept of DDI and definition of needs grew
out of the data archival community
• Established in 1995 as a grant funded
project initiated and organized by ICPSR
• Members:
– Social Science Data Archives (US, Canada,
Europe)
– Statistical data producers (including US Bureau
of the Census, the US Bureau of Labor Statistics,
Statistics Canada and Health Canada)
• February 2003 – Formation of DDI Alliance
– Membership-based alliance
– Formalized development procedures
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Growth of the DDI Structure
• 2000 – DDI 1.0
– Simple survey
– Archival data formats
– Microdata only
• 2003 – DDI 2.0
– Aggregate data (based on matrix structure)
– Added geographic material to aid geographic
search systems and GIS users
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Characteristics of DDI 1.0/2.0
• Focuses on the static object of a codebook
• Designed for limited uses
– End user data discovery via the variable or high
level study identification (bibliographic)
– Only heavily structured content relates to
information used to drive statistical analysis
• Coverage is focused on single study, single
data file, simple survey and aggregate data
files
• Variable contains majority of information
(question, categories, data typing, physical
storage information, statistics)
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
1/2.x Schema
• Organized in 5 sections
– docDsrc: information about the XML document
itself: metadata preparation, version,
– stdyDscr: detailed information about the survey
• Title, year, coverage, sampling, data collection/cleaning,
quality, contact, access policy,
– fileDscr: describes files in the dataset
– dataDscr: describes the data structure
• Variable: name, label, code, summary statistics,
definitions, literal question, interviewer instructions,
weights, grouping, etc.
• Cubes: aggregated data
– othMat: additional documentation
• See examples
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Limitations
• Treated as an “add on” to the data collection
process
• Focus is on the data end product and end
users (static)
• Limited tools for creation or exploitation
• The variable must exist before metadata can
be created
• Producers hesitant to take up DDI creation
because it is a cost and does not support
their development or collection process
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Requirements for 3.0
• Improve and expand the machine-actionable
aspects of the DDI to support programming and
software systems
• Support CAI instruments through expanded
description of the questionnaire (content and
question flow)
• Support the description of data series (longitudinal
surveys, panel studies, recurring waves, etc.)
• Support comparison, in particular comparison by
design but also comparison-after-the fact
(harmonization)
• Improve support for describing complex data files
(record and file linkages)
• Provide improved support for geographic content to
facilitate linking to geographic files (shape files,
boundary files, etc.)
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Approach
• Shift from the codebook centric model of early
versions of DDI to a lifecycle model
– providing metadata support from data study conception
through analysis and repurposing of data
• Shift from an XML Data Type Definition (DTD) to an
XML Schema model
– to support the lifecycle model, reuse of content and
increased controls to support programming needs
• Redefine a “single DDI instance” to include a “simple
instance”
– similar to DDI 1.*/2.* which covered a single study and
“complex instances” covering groups of related studies.
– Allow a single study description to contain multiple data
products (for example, a microdata file and aggregate
products created from the same data collection).
• Incorporate the requested functionality in the first
published edition
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Designing to support registries
• Resource package
– structure to publish non-study-specific materials for reuse
• Extracting specified types of information into
schemes
– Universe, Concept, Category, Code, Question, Instrument,
Variable
• Allowing for either internal or external references
• Providing comparison mapping
– Target can be external harmonized structure
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Relationship to Other Standards
• Dublin Core
– Mapping of citation elements
– Option for DC namespace basic entry
• ISO 19115 – Geography
– Search requirements
– Support for GIS users
• METS
– Designed to support profile development
• OAIS
– Reference model for the archival lifecycle
• SDMX
– Completely mapping to and from DDI NCubes
– Designed to be used with registries
• ISO/IEC 11179
– Variable linking representation to concept and universe
– Optional data element construct in ConceptualComponent that
allows for complete ISO/IEC 11179 structure as a maintained item
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Development of DDI 3.0
• 2004 – Acceptance of a
new DDI paradigm
– Lifecycle model
– Shift from the codebook
centric / variable centric
model to capturing the
lifecycle of data
– Agreement on expanded
areas of coverage
• 2005
– Presentation of schema
structure
– Focus on points of
metadata creation and
reuse
http://www.opendatafoundation.org
• 2006
– Presentation of first
complete 3.0 model
– Internal and public
review
• 2007
– Vote to move to
Candidate Version
– Establishment of a set of
use cases to test
application and
implementation
• 2008
– March: anticipated vote
to publish DDI 3.0
Open Data Foundation – IZA 2007/11
XML Schemas and 3.0 Modules (one is not
necessarily the other)
• XML Schemas
– Each .xsd file is a xml schema
– Some xml schemas are modules
– Some xml schemas are substitution sets
or “sub-modules”
– Some xml schemas simply contain
elements that are used by multiple
schemas or may require more frequent
updates
– Some xml schemas are “external”
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
XML Schemas and 3.0 Modules (one is not
necessarily the other)
• Modules
– Reflect closely related sets of information
similar to the sections of DDI 1.*/2.* DTD
– Modules can be held as separate XML
instances and be included in a large
instance by either inclusion or reference
– All modules are maintainable, identifiable
packages
– Each module has its own XML
namespace
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
XML SCHEMAS
•
•
•
•
•
•
•
•
•
•
•
•
archive
comparative
conceptualcomponent
datacollection
dataset
dcelements
DDIprofile
ddi-xhtml11
ddi-xhtml11-model-1
ddi-xhtml11-modules-1
group
inline_ncube_recordlayout
http://www.opendatafoundation.org
•
•
•
•
•
•
•
•
•
•
•
•
instance
logicalproduct
ncube_recordlayout
organization
physicaldataproduct
physicalinstance
reusable
simpledc20021212
studyunit
tabular_ncube_recordlayout
xml
set of xml schemas to
support xhtml
Open Data Foundation – IZA 2007/11
Basic Structure/Organization
•
•
•
•
DDI 3.0 is divided into “modules”
Each contains a set of related metadata
Reusable metadata is divided into schemes
Modules reflect the steps of the data lifecycle
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
DDI 3.0 Modules
• Main modules are:
– Study Unit (contains a simple study description)
– Conceptual Component
– Data Collection (survey instruments, questions,
sources)
– Logical Product (concepts, variables, codes,
categories)
– Physical Storage (describes patterns of storage
and physical instances/files)
– Archive (organizations and processing events)
– Group (comparing and grouping study units)
– Comparative (allows for explicit comparisons
between grouped studies)
•
See also http://www.ddialliance.org/DDI/ddi3/moduledescriptions.html
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Maintainable Schemes
(that’s with an ‘e’ not an ‘a’)
•
•
•
•
•
•
•
Concept Scheme
Universe Scheme
Question Scheme
Control Construct Scheme
Category Scheme
Code Scheme
Variable Scheme
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Packages of
reusable metadata
maintained by a
single agency
DDI 3.0
• Look at schema (Candidate release 2)
• Look at examples (prototype XML)
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
DDI Lifecycle View and Use Cases
Our Initial Thinking…
The metadata payload
from version 2.* DDI was
re-organized to cover
these areas.
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Wrapper
For later parts
of the lifecycle,
metadata is
reused heavily
from earlier
Modules.
The discovery
and analysis
itself creates
data and
metadata, reused in future
cycles.
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Realizations
• Many different organizations and individuals are
involved throughout this process
– This places an emphasis on versioning and exchange
between different systems
• There is potentially a huge amount of metadata
reuse throughout an iterative cycle
– We needed to make the metadata as reusable as possible
• Every organization acts as an “archive” (that is, a
maintainer and disseminator) at some point in the
lifecycle
– When we say “archive” in DDI 3.0, it refers to this function
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
DDI 3.0 Lifecycle Model
Metadata Reuse
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Use Cases
• Study design/survey instrumentation
• Questionnaire generation/data collection and
processing
• Data recoding, aggregation and other processing
• Data dissemination/discovery
• Archival ingestion/metadata value-add
• Question/concept/variable banks
• DDI for use within a research project
• Capture of metadata regarding data use
• Metadata mining for comparison, etc.
• Generating instruction packages/presentations
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Study Design/Survey Instrumentation
• This use case concerns how DDI 3.0 can support
the design of studies and survey instrumentation
– Without benefit of a question or concept bank
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Questionnaire Generation, Data Collection, and Processing
• This use case concerns how DDI 3.0 can support
the creation of various types of questionnaires/CAI,
and the collection and processing of raw data into
microdata.
• Algenta working on DDI 3.0 based software
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Data Recoding, Aggregation, etc.
• This use case concerns how DDI 3.0 can
describe recodes, aggregation, and similar
types of data processing.
• Relevant to both producer and researcher
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Data Dissemination/Data Discovery
• This use case concerns how DDI 3.0 can
support the discovery and dissemination of
data.
• Highly relevant to researchers
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Archival Ingestion and
Metadata Value-Add
• This use case concerns how DDI 3.0 can
support the ingest and migration functions of
data archives and data libraries.
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Question/Concept/Variable Banks
• This use case describes how DDI 3.0 can support question,
concept, and variable banks. These are often termed
“registries” or “metadata repositories” because they contain
only metadata – links to the data are optional, but provide
implied comparability. The focus is metadata reuse.
• Concept classification very important to researchers
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
DDI For Use within a Research Project
• This use case concerns how DDI 3.0 can support
various functions within a research project, from the
conception of the study through collection and
publication of the resulting data.
• Direct use in RDC
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Capture of Metadata Regarding
Data Use
• This use case concerns how DDI 3.0 can capture information
about how researchers use data, which can then be added to
the overall metadata set about the data sources they have
accessed.
• Data use and user feedback crucial to improve overall quality
and future data production (relevance)
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Metadata Mining for Comparison, etc.
• This use case concerns how collections of DDI 3.0 metadata
can act as a resource to be explored, providing further insight
into the comparability and other features of a collection of data.
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Generating Instruction Packages/Presentations
• This use case concerns how DDI 3.0 can
support automation around the instruction of
students and others.
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Tools
• DDI 1/2.x
– IHSN Microdata Management Toolkit
(http://www.surveynetwork.org/toolkit)
– Nesstar (http://www.nesstar.com)
– http://www.ddialliance.org
– Dextris (http://opendatafoundation.org)
• DDI 3.0
– Foundation Tools Platform
– UKDA DExT
– DDI 3.0 Use case
http://opendatafoundation.org/ddi/use_cases.php
– Algenta SurveyWiz
– Dextris (http://opendatafoundation.org)
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
DDI for RDC
• A small set of DDI 1/2.x tools are available
today
– Users can generate from internal databases, use
conversion utilities (see DDI web site) or software
like Nesstar Publisher and the IHSN Microdata
Management Toolkit
• DDI 3.0 has a much broader scope and
provides both core and advanced
functionalities that will require management
tools
– Next generation metadata framework is being
build as the standard is begin finalized
– The DDI Foundation Tools Program is an
umbrella for implementers startup toolkit
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Conclusions
• The first generation of DDI is suitable for
data archives interested in the preservation
of metadata and discovery by users
• DDI 3.0 focus on the entire life cycle of the
survey and is suitable for many different
uses.
– More relevant to RDC environment
• DDI 3.0 calls for coordinated efforts for
building relevant tools for producers,
archives, researchers and other users
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11