Transcript Slide 1
Managing the Metadata Lifecycle The Future of DDI at GESIS and ICPSR Peter Granda, ICPSR Meinhard Moschner, GESIS Mary Vardigan, ICPSR Joachim Wackerow, GESIS Wolfgang Zenk-Möltgen, GESIS Research Data Life Cycle Archiving Concept Collection Processing Distribution Discovery Repurposing Analysis Current Uses of DDI • DDI 2 used for many different purposes by many different archival institutions, e.g., metadata records for data catalogs, export to Web-based information systems such as Nesstar, long-term preservation, and PDF codebooks • GESIS and ICPSR are developing procedures and systems to extend use of DDI in their institutions DDI 3 Expands in Scope • To date use mainly limited to Distribution and Archiving stages of data life cycle • DDI 3 enables use of new elements and structures to extend markup to other stages of the life cycle - both earlier and later • Emphasis is on projects and tasks already in process at each institution DDI 3 Use at GESIS • • • • Structured Comments – Processing Translation of EVS Questionnaire – Collection Supporting Enhanced Publications – Analysis Continuity Guides: Trends by Concepts – Concept, Discovery, Repurposing Extracting structured information in current workflow • Example: building derived variables by SPSS • SPSS setups contain commands and comments • Necessary steps for using SPSS setups as information source for DDI – Improving comments for automated extraction • formalize layout • add keywords from a list – Extraction of structured comments and related commands by custom tool. – Transformation of this information into DDI 3 fragments Extracting structured information in current workflow ***v* Variables/DerivedVariables * DESCRIPTION * This section is on derived variables; ***. ***v* DerivedVariables/w101_new * NAME * w101_new * DESCRIPTION * w101_new is a derived variable from w101; * It has the original value from w101 * when w102 is equal 1 * otherwise it has the value 5; * USED VARIABLES * w101, w102 * SOURCE **. compute w101_new = 5 . if ( w102 = 1 ) w101_new = w101 . ** * VERSION * 2009-04-18 * AUTHOR * Achim Wackerow * EMAIL * [email protected] ***. Report (HTML) Extractor DDI 3 fragments GenerationInstruction Description Command SPSS Result Translation of EVS Questionnaire DSDM http://zacat.gesis.org Supporting Enhanced Publications DDI Alliance Publications with References to Data: DDI 3.1 URN contains: Agency Object Version Publication with References (URNs) http://resolve.gesis.org find object return URL http://www.gesis.org/doc/docxyz URL of Documentatio n and/or Data <urn:ddi:3_1:VariableScheme.Variable=gesis.de.ddi:ZA3811_VarSch(1_0).V8(1_0)> Supporting Enhanced Publications DSDM DDI 3 EPE Simple Export Wizard 1.2.0 Grouping Trends • Continuity guides in different contexts – Synoptical question / variable lists – Documentation of changes in question wording / answer scales • Systematic organization by conceptual categories – CodebookExlorer tool (relational DB) – Publication as html links on variable level in ZACAT • Taking advantage of DDI3 in the future – Defining the standard and comparison – Qualifying relations (e.g. q-text modified, scale modified,…) Continuity guides Literal question text over time Conceptual categories Deviations in answer categories Trends by concepts Trend variables by study Conceptual categories Country 1 Country 2 DDI3 RESOURCE „Ex-post Standard“ Universe Concept Comparison map Equivalency Relationship Description Data Collection STUDY UNIT 1 … n DataCollection <dc:QuestionScheme id="QS"> <dc:QuestionItem id="Q"> <dc:QuestionText> <dc:LiteralText> <dc:Text>Do you …?</dc:Text> </dc:LiteralText> … <dc:CodeDomain> <r:CodeSchemeReference> <r:ID>CODS1</r:ID> </r:CodeSchemeReference> Logical Product <l:CategoryScheme id="CATS1"> <l:Category id="Cat1"> <r:Label>often</r:Label> … <l:CodeScheme id="CODS1"> <l:CategorySchemeReference> <r:ID>CATS1</r:ID> </l:CategorySchemeReference> <l:Code isDiscrete="true"> <l:CategoryReference> <r:ID>Cat1</r:ID> </l:CategoryReference> <l:Value>1</l:Value> </l:Code> … <dc:QuestionScheme id="QS"> <dc:QuestionItem id="Qn"> … <dc:Text>Have you …?</dc:Text> … LogicalProduct Label <>identical<> Values <>different>> <>generation instruction<> <>scale reversed<> <l:CategoryScheme id="CATS1"> <l:Category id="Cat1"> <r:Label>often</r:Label> … <l:CodeScheme id="CODS1"> … <l:Code isDiscrete="true"> <l:CategoryReference> <r:ID>Cat1</r:ID> </l:CategoryReference> <l:Value>4</l:Value> </l:Code> … GROUP STUDY UNIT 8-14 DataCollection … GROUP LogicalProduct STUDY UNIT 15-x … DataCollection … LogicalProduct … DDI 3 Use at ICPSR • Information collected from data producers in precollection phase – Concept • Metadata output from CAI applications – Data Collection • Processor‘s dashboard – Data Processing • Metadata mining: New faceted search tool to facilitate discovery through more precise searching – Data Discovery • Relational database for comparison and harmonization across studies – Repurposing SMDS Metadata Modules OAIS AIP Repurposing SIP A combination of this information forms a The structured metadata combined traditional SIP. with Concept An AIPCollection Processing mustdata be specially built, because the metadata forms the core ofInformation the archive. from each life cycle stage can include just to other reused metadata. Itreferences would be organised in a way where sent to the archive - can be understood as An AIP should includecan everything of one DDI can metadata be reused and study, information can dynamic SIP. be also the main structure of the AIP. Data can be inline Custom Tools CAI be Tools Information ingested and distributed infrom a dynamic extracted Self-archiving by web forms can be offered in DDI. An AIP would exist beside the core structure in (e.g. Forms-based) MQDS etc. SPSS etc. way. for the different stages. the archive. An easy roundtrip should be possible between the core structure and the AIP. The purpose of the AIP is comparable to PDF/A where all fonts are included. DDI as backbone forisstructured metadata The core structure headed to efficient processing Archive and reuse of metadata. Data / Documents outside of DDI DIP Distribution Packages Web information system Search engines. Distribution Statistical packages Online Analysis. Discovery Analysis DDI-based archive as collection of reusable components • • Metadata in DDI is structured in small items which can be identified and maintained by one or more institutions These parts can be – the basis for comparison and metadata mining (discovery of new relationships) – a candidate for reuse in other studies or new studies (like standard questions or variables) Study 1 Study-specific information Items for reuse Study 1 Study-specific information Items for reuse New study Repository of reusable components Standard concepts Standard questions Standard variables Harmonized information Controlled vocabularies Issues for Discussion • Advantages and disadvantages of seeking to capture additional metadata throughout the data life cycle • How much information to make available to funding agencies, data producers, and secondary users? • Rules for structured documentation and delivery of items to archives for preservation • An overall DDI tool to capture and curate all metadata and data – the Holy Grail???