Transcript Slide 1

DDI Across the Life Cycle:
One Data Model, Many Products
Click to edit Master title style
Inter-university Consortium for Political
and Social Research (ICPSR)
and
to edit
Survey ResearchClick
Operations
(SRO) Master
IASSIST Meeting
Tampere, Finland
May 29, 2009
subtitle style
Presenters
• Mary Vardigan,
• Sanda Ionescu,
Assistant Director, ICPSR
Documentation Specialist,
ICPSR
• Sue Ellen Hansen,
Director, SRO Technical Systems • Felicia LeClere, Associate
Group
Research Scientist, ICPSR
• Peter Granda, Archivist, ICPSR
The Collaborators
• Both are units of the Institute for Social Research,
University of Michigan
– ICPSR is a large social science data archive
– SRO is a data collection center
Past Collaborations
• Working together on the National Survey of Family
Growth, sponsored by NCHS, to create data and an
interactive codebook
• Partnered on the Collaborative Psychiatric
Epidemiology Surveys, sponsored by NIMH
– This involved a harmonization of three datasets and
interactive documentation featuring question
comparison and five languages –
www.icpsr.umich.edu/CPES
Rationale for Collaboration
• We share a need for rich, high-quality metadata
• We want to comply with metadata standards – in
particular, the Data Documentation Initiative (DDI)
• DDI 3 enables life cycle perspective
• We need to pass data easily from SRO to ICPSR
without information loss
SRO-ICPSR Joint Project
• Shared DDI-compliant data model and database
design for survey metadata
• Challenges:
– Different computing platforms
– Different end products
– Different staff orientations
Task B and
D
Other File Types
(e.g. SAS, SPSS, etc)
DDI 2 or 3
File
Task B
Blaise
Database
(BDB)
Client
Relational Database
(offline SQL Server
Express)
Client
Relational Database
(offline SQL Server
Express)
SRO
Relational Database
(online/networked SQL Server)
Edit /
Review
metadata
Stand-alone
client
application
ICPSR Import Tool
Other Importing Tool
Export
codebook
ICPSR
Relational Database
(online/networked Oracle)
Export
questionnaire
Export
data
Display
metadata
<XML/WSDL>
Client
application
with sync
data
SRO/ICPSR/Other
web client
Web server
Task A
SRO Blaise Parsing Tool
Tasks C and D
<Metadata
& Data>
<Transform-ations>
<Data
Storage>
Blaise
Datamodel
(BMI)
<Application
Logic>
Offline\Local Application
Online or
Offline
User specifies files (location, file type, etc.) using
an application
ICPSR web client::
• Variable Search
• Internal Variable Browser
• NSFG Data Management
Products and Benefits
SRO
• Tools to enhance MQDS, which produces XML
documentation from Blaise instruments
• Tool to permit external users to add metadata for NSFG
ICPSR
• Variable-level database that permits users to search across
the ICPSR collection; compare variables; create new
datasets and questionnaires
• Internal variable search for harmonization
Data Life Cycle Coverage
Michigan Questionnaire
Documentation System (MQDS)
Sue Ellen Hansen
Nicole Kirgis
What Does MQDS Do?
• Facilitates automated documentation and
harmonization of Blaise survey instruments
and datasets
– Extracts survey question metadata
– Standardized format
Survey Question Metadata
•
•
•
•
•
•
•
Question universe
Variable name and label
Question text
Question variable text (fills)
Data type
Code values and code text
Skip instructions
• etc.
Data Documentation Initiative (DDI)
• Standard specification for technical documentation of social
science data
• eXtensible Markup Language (XML)
– Widely used
– Facilitates sharing of data
• Initial focus on standard dataset codebook
• Ongoing development
http://www.ddialliance.org/
MQDS Version 1
• Extracted metadata from Blaise data model as
XML tagged data
• Provided user interface for selection of
– Blaise files
– Instrument questions and sections
– Types of metadata to extract
– Languages to display
– Style sheet for generation of instrument
documentation or codebook
Using MQDS V1 XML: Codebook in Five Languages
National Latino and Asian American Study
www.icpsr.umich.edu/CPES
MQDS Version 1
• Limitations
– XML not DDI-compliant
• DDI Version 2 did not have XML tags for all metadata
provided by Blaise
• Did not provide easy means of adding XML tags without
becoming noncompliant
– XML files for complex surveys can be very large (text files)
• Entire files had to be processed in computer memory
• Limited ability to fully automate documentation
DDI Version 3
• Released April 2008
• Focus on complete data lifecycle –going
beyond the codebook
DDI Version 3
• Included extensions proposed by DDI
working group on instrument design
Persistent Content of Question
Use of Question in Instrument
Question text
• Static
• Dynamic or variable
Order and routing
• Sequence / skip patterns
• Loops
Multiple-part question
Universe
Response domain
• Open
• Set categories
• Special types (date, time, etc.)
Analysis unit
Definitional text
Instructions
MQDS Version 3
• Joint SRC and ICPSR venture
• Goals:
– Address version 2 limitations
• Process Blaise instrument of any size
– Exploit new elements and validate to the recently
released DDI version 3 standard
– Move from processing XML metadata in memory
to streaming metadata to a relational database
MQDS Version 3
Relational Database: Import, Export, Transform
SQL Server /
SQL Server Express
XML (DDI 3)
Relational
Db
Blaise
Datamodel
(BMI)
User specifies
input files
(location, file type,
etc.)
Blaise
Database
(BDB)
2.
Export
1.
Import
User specifies
output files
(location,
Language/locale,
XML output
options, etc.)
3.
Transform
Questionnaire
Other File
Types
(e.g. SAS,
SPSS, etc)
Database
connection
settings
DDI 3
elements
not in
*.bmi
Codebook
User specifies stylesheet selection
criteria, type of output desired
(html, rtf, pdf), etc.
MQDS Version 3
• Relational database
– DDI compliant standardized tables
– Flexibility for SRC and ICPSR to add extensions that meet
their specific organizational needs
– Allows
• Automated documentation of any Blaise survey
instrument
• Importing and documenting data produced by other
software
• Lower cost development of other tools that facilitate
editing and disseminating data
MQDS V3 Prototype: Exporting Language XML
MQDS Development
• Expect to release Summer 2009
• Working out a distribution plan for Blaise
users
Data Life Cycle Coverage
Applications:
Customized Editing Tool
Peter Granda
ICPSR
MQDS Version 3
• Relational database
– DDI compliant standardized tables
– Flexibility for SRC and ICPSR to add extensions
that meet their specific organizational needs
– Allows: Development of new tools to deal with the
practical problems involved in transforming data and
documentation derived from BLAISE instruments into
public-use products
Features of the Tool
• Loads MQDS output into database tables
• Web interface to permit quick viewing
• Application that permits both internal and external
clients to access and edit variable-level information
• Ability to include disposition codes to designate which
variables to include in public-use files
• Maintain permanent record of decisions made
throughout the editing process
SELECT VARIABLE TO EDIT FROM DATABASE POPULATED
WITH METADATA FROM MQDS WITH POSSIBLE REVISIONS
FROM SUBSEQUENT DATA PROCESSING STEPS
Variable
Name
Variable
Label
Value
Labels
Question
Text
Universe
Statements
List of
Standard
Formats
VARIABLE DISPOSITION:
• Place in public-use file
• Place in restricted-use file
• Leave in original file created by the data producer
Data Life Cycle Coverage
Social Science Variables
Database:
The Public Search
Sanda Ionescu
ICPSR
SSVD – The Public Search
• ICPSR variables search
– Internal (staff, other authorized users)
– External (public)
SSVD – The Public Search
• Enables ICPSR users to search variables
across datasets
• Assists in data discovery, comparison,
harvesting, and analysis
• Useful in question mining for designing new
research
SSVD – The Public Search
• Concept first tested in a pilot project
completed in 2005
– Good functionality
– Demonstrated benefits of using DDI markup:
easy import; complex, granular searches; userfriendly display
– Limited number of data sets (69 ICPSR studies
included)
SSVD – The Public Search
• Expand the project to ultimately include
most of ICPSR’s holdings
– Generate DDI documentation for most ICPSR
studies
• Need for automated production
– Build a solid, state-of-the-art, DDI compliant
database
• Handle large number of files
• Support multiple applications
SSVD – The Public Search
• The Hermes batch processing system *:
ASCII data file
SPSS system /
portable file
(Mandatory)
Statistical setups:
SPSS, SAS, Stata
Ready-to-go data files:
SAS transport, SPSS
portable, Stata system
Question text file
in fixed format
(Optional)
DDI 2.1 variable-level
documentation with frequencies
[and question text (optional)]
(Part of )
PDF
Codebook
*This is a simplified diagram
SSVD – The Public Search
• Hermes:
– Consistent, reliable source of variables
descriptions in DDI
– DDI documentation limited to content of input
files
• Labels may be truncated or may contain
abbreviations
• Question text may be missing although available in
original documentation
SSVD – The Public Search
• Additional quality standards necessary for
DDI documentation, to maximize
effectiveness of Public Search:
– Presence of question text, whenever available
– Increased readability of variable/value labels,
especially if question text is not present
SSVD – The Public Search
• Not all ICPSR studies qualify for variablelevel searches
• Criteria for selecting studies; not included:
– Aggregate/statistical data (ex. Census data, Data Books,
Roll Call records, etc.)
– Poor documentation
– Some restricted data
SSVD – The Public Search
• Pre-SSVD upload:
– Review of DDI output from Hermes to apply
content quality standards and study selection
criteria
– Additional work to upgrade DDI where necessary
(and feasible)
•
•
•
•
Add question text
Complete truncated text
Improve readability of labels
Add frequencies
SSVD – The Public Search
• Preparing studies for SSVD:
– Started end of 2006
– Included DDI produced for previous projects
– Reviewed all variable-level DDI created at
ICPSR, November 2006 to present (new
releases and updates)
SSVD – The Public Search
• New database finalized Fall 2008
• Built to match DDI 3.0 data model
• Both DDI 2.x and DDI 3.0 compliant
– Designed to accept both DDI 2.x and 3.0 input
and produce output in both versions
• ICPSR version currently uploads DDI 2.1 and
generates DDI 3.0 individual variables
descriptions.
SSVD – The Public Search
• First batch of variable-level description files
uploaded into SSVD:
– Approx. 3,500 DDI files (one file per dataset),
representing
• Approx. 1,300 ICPSR studies (approx. 18.5 percent of
total ICPSR holdings, excluding US Census; approx. 30
percent of holdings with data and setups)
– Over 1,000,000 individual variable descriptions;
23,000,000 categories
SSVD – The Public Search
• Currently in Beta-testing phase.
– Email bugs at [email protected]
• Uses Oracle Text.
http://www.icpsr.umich.edu/ICPSR/ssvd/index.html
SSVD – The Public Search
Moving forward…
• Fall 2009: switch to Solr searches (based on
Lucene)
– Faster
– More sophisticated: results filtered by multiple relevant
parameters
• Enable side-by-side/same page display of
selected variables for comparison
• Enable variable search from individual study
page (search within study)
SSVD – The Public Search
Moving forward…
• Adding content:
– Second batch of DDI files ready to upload:
• 900 DDI files, representing 500-600 studies (will bring
total close to 45 percent of ICPSR studies with data
and setups)
– Initiate retrofit project to examine older studies
that were not covered in the first conversion
phase
SSVD – The Public Search
Moving forward…
• Transition to automated DDI upload
– DDI uploaded at the time of study publication
– First quality check performed by study processing
staff
– Acceptable DDI immediately released for public
view
– Problematic DDI suppressed from public view for
further review, and upgrade as appropriate
Data Life Cycle Coverage
Applications: Internal
Variable Search and
Documentation
Felicia LeClere,
ICPSR
The Integrated Fertility Survey
Series
• 5 year grant from NICHD to harmonize data
from 10 large surveys of marriage, fertility, and
child-bearing in the United States
• 10 surveys beginning in 1955 through 2002
Problem of Harmonization
• In order to make decisions about harmonizing
across all files need:
• Question text
• Value labels and categories
• Be able to find and export metadata from all
10 files at the variable level
• Be able to document each variable, recode
and variable choice
Tools from Variables Database
• Need to be able to do nested searches that are
documented
• Need to be able to search all fields individually and in
sequence
• Need to be able to download results and document
what search terms were used
ICPSR SSVD Internal Search
• All 10 data sets were loaded in ICPSR’s version
of the shared data base
• Designed to capture all of the relevant fields
that were marked up in DDI
Entry screen for internal search
Search results screen
Excel download from search
Can also download value labels and codes
Search Utilities
• Downloaded search fields serve to:
– 1. Identify variables to be harmonized
– 2. Provide metadata for “translation tables” which
are used to harmonize files
Harmonization steps
• Use search results to populate two
intermediate steps to reforming data set
• Exploratory comparative tables
» Use this comparative table to make decisions about
harmonization by examining universes, question texts,
and response categories
• Translation tables
» These tables are designed to provide instructions on
recoding the underlying items from the 10 surveys to a
single harmonized item. The table provides instructions
to an automated SAS program that recodes items from 10
surveys.
Comparative table – date of birth
63
Translation Table for place of birth
Harmonization steps
• After the translation table, the recode
instructions for all 10 files are built into the
SAS file and a new data file has been created.
• The underlying metadata data provided by the
database allow us to (1) search all 10 files, (2)
explore comparability and (3) recode to new
variables