Data Management Plans: A good idea, but not sufficient
Download
Report
Transcript Data Management Plans: A good idea, but not sufficient
Data Management Plans:
A good idea, but not sufficient
Andreas Rauber
Department of Software Technology and Interactive Systems
Vienna University of Technology
&
Secure Business Austria
[email protected]
http://www.ifs.tuwien.ac.at/~andi
Outline
Why are Data Management Plans good but insufficient?
From Data to Process Management Plans
How to capture process & context?
Summary
Sustainable (e-)Science
Data is key enabler in science
-
Basis for evaluation and verification
-
Basis for re-use
-
Basis for meta-studies
Safeguarding investment made in data
Need to preserve and curate the data
Preservation: keeping useable over time
fighting mostly technical & semantic obsolescence
How to avoid data being lost after projects end?
Sustainable (e-)Science
Data Management Plans
as integral part of research proposals
Need recognized by researchers, funding bodies,…
Focus on
-
Data
Descriptions
Declarations of activities to ensure long-term availability of data
Data Management Plans are good, but not sufficient!
https://dmp.cdlib.org/
https://data.uni-bielefeld.de/de/datamanagement-plan
https://dmponline.dcc.ac.uk/
Data Management Plans
Short, free-form text, requiring human interpretation
Declarations of intent
Not enforceable, hardly verifiable
(Burden remains with researchers / institutions,
who need to become data management experts)
Focuses solely on data, ignoring the process:
pre-processing, processing, analysis
Limits
-
availability of data & results
-
verification of results,
-
re-use and re-purposing
http://rci.ucsd.edu/_files/D
MP%20Example%20Cos
man.pdf
http://deepblue.lib.umich.edu/bitstream/ha
ndle/2027.42/86586/CoE_DMP_template_
v1.pdf?sequence=1
From Data to Processes
Excursion: Scientific Processes
From Data to Processes
Rhythm Pattern Feature Set
-
Used for
-
extracts numeric descriptors from audio
basically 2 Fourier Transforms
some psycho-acoustic modelling
some filters (gaussian, gradient) to make features more robust
music genre classification
clustering of music by similarity
retrieval
Implemented first in Matlab, then in Java
-
both publicly available on website
same same but different...
From Data to Processes
Excursion: scientific processes
set1_freq440Hz_Am11.0Hz
set1_freq440Hz_Am12.0Hz
set1_freq440Hz_Am05.5Hz
Java
Matlab
From Data to Processes
Excursion: Scientific Processes
Bug?
Psychoacoustic transformation tables?
Forgetting a transformation?
Diferent implementation of filters?
Limited accuracy of calculation?
Difference in FFT implementation?
...?
From Data to Processes
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234
From Data to Processes
To sum up:
Data
-
is the fuel for scientific processes
-
is the result of scientific processes
Curation of data thus needs to consider these processes
Data Management Plans
-
are data centric
-
put too little focus on the processes associated with data
-
are written by humans for humans
Outline
Why are Data Management Plans insufficient?
From Data to Process Management Plans
How to capture process & context?
Summary
Process Management Plans
Process Management Plans (PMPs)
Go beyond data to cover research process:
-
ideas, steps, tools, documentation, results, …
-
data is only one (important) element,
commonly actually a result of a research (pre-)process
Ensure re-executability, re-usability
Must be machine-actionable & verifiable
Basis for preservation and re-use of research
Similar to “research objects”, “executable papers”, …
Process Management Plans
Need to establish
Models for representing such process management
plans (PMPs)
Must be machine-readable and machine-actionable
Identify “minimum set” of information
Devise means to automate (most of) the activity in
creating and maintaining those PMPs
Establish them to replace (enhance / subsume / …)
Data Management Plans
Process Management Plans
Structure of PMPs (following concept of DMPs):
1. Overview and context
2. Description of processes and their implementation
Process description | Process implementation | Data used and
produced by process
3. Preservation
Preservation history | Long term storage and funding
4. Sharing and reuse
Sharing | Reuse | Verification | Legal aspects
5. Monitoring and external dependencies
6. Adherence and Review
Outline
Why are Data Management Plans insufficient?
From Data to Process Management Plans
How to capture process & context?
Summary
Process Capture
Need to establish what forms part of a process:
-
analyzing process documentation
establishing context of process, relationships between elements
monitoring of process activities
Capture and describe this in a context model
Architectural Concepts
Based on Enterprise Architecture Framework
(Zachmann), taxonomies (e.g. PREMIS), …
DIO: Domain-Independent Ontology
DSO: Domain-Specific Ontologies
(legal, sensor, multimedia codecs, …)
DIO
(ArchiMate)
DIO-DSO1
Transformation Map
DSO-1
DIO-DSO2
Transformation Map
DSO-2
19
Process Capture
Example: Music Classification Process
Input: music (e.g. MP3 format)
Input: training data, i.e. music with genre labels
Output: classification of music, e.g. into genres
Intermediate steps
extract numeric description (features) from music
combine features with ground truth into specific file format, …
Process Capture
Taverna
…………….
Process Capture
Software setup can be automatically detected in OS with
software packages (e.g. Linux);
allows detection of licenses, dependencies
Process Capture
Process Capture
Example:
Music Classification Workflow
24
Process Re-deployment
Preservation and Re-deployment
„Encapsulate“ as complex „research objects“ (RO)
Re-Deployment beyond original environment
Format migration of elements of ROs
Cross-compilation of code
Emulation-as-a-Service, virtual machines, …
Process Re-deployment
Verification, Validation & Data
Verify correctness of re-execution
validation and verification framework
process instance data
points of capture
Metrics
Data and data citation
Identifying subsets of data in large and dynamic databases
Timestamping and versioning of data
PID Provider
Assigning PID (DOI, …) to time-stamped query
PID Store
Query
Data
Query Store
Table B
Table A
Subsets
Sustainable (e-)Science
How to get there?
Research infrastructure support
-
Versioning systems
-
Logging (“virtual lab-book”)
-
Virtual machines / pre-configured virtual labs for research
-
Data citation support for large, dynamic databases
R&D in process preservation, re-deployment & verification
-
Evolving research environments, code migration, …
-
Verification of process re-execution
-
Financial impact, business models
Summary
Need to move beyond concept of data
Need to move beyond the focus on description
Process Management Plans (PMPs) extending DMPs
Process capture, preservation & verification
Capture “all” elements of a research process
Machine-readable and -actionable
Data and process re-use as basis for data driven science
Thank you!
DIO
(ArchiMate)
DIO-DSO1
Transformation Map
DSO-1
PID Provider
DIO-DSO2
Transformation Map
Query
Data
DSO-2
PID Store
Query Store
Table B
Table A
Subsets
http://www.ifs.tuwien.ac.at/imp