E-Quality in the Library

Download Report

Transcript E-Quality in the Library

The importance of
having data-sets
free from Oscar
Wilde
Datasets as the crown jewels of an institutes
scientific infrastructure
2006 IATUL CONFERENCE Porto
July 16, 2015
Ronald Dekker
1
Vermelding onderdeel organisatie
Data-set importance
• Verification of publications
(results= analysis + data)
• Longitudinal research
(long periods, meta-research)
• Interdisciplinary use of data
(reuse/innovation)
• Valorisation
(get new projects based on data set ownership)
July 16, 2015
2
A scientific workflow
Data
Model
Result
Article
Parameters
July 16, 2015
3
Data publication today
Modified after Helly et al. (2003)
Library
Publication
Manuscript
Private Files
Data
Metadata
Research
July 16, 2015
4
“In archival terms the last quarter of
the 20th century has some
similarities to the dark ages. Only
fragments or written descriptions of
the digital maps produced exist.
The originals have disappeared or
can no longer be accessed.”
Taylor
July 16, 2015
5
The hydrological research DARELUX
Data Archiving River Environment LUXemburg
The relation between rainfall and
discharge
Rain
And everything in between
Discharge
Long term
July 16, 2015
Modeling discharge prediction
Direct
6
Vermelding onderdeel organisatie
Data Archiving River Environment LUXemburg
Huelerbach
km2
1.6
Paris basin (Sandstone, Lime)
July 16, 2015
Maisbich
1.2 km2
Ardennes Massif (Slate)
7
Measurements
Interception
transpiration
Rainfall
Surfac
e
Subsurface
soil
Neerslag: ASTA, Administration des services
techniques de l'Agriculture
Deep
soil
Diepe bodem
River
July 16, 2015
Interceptie: TUDelft, Gabriel Lippmann Inst.
8
Measurements
TUD: Road run-off
University Utrecht: Soil moisture
Interception
transpiration
Rainfall
Surfac
e
Subsurface
Gabriel Lippmann, TUD
Piezometers
University Luxemburg
Gabriel Lippmann, TUD
V Notch
Deep
soil
Diepe bodem
River
July 16, 2015
University Luxemburg
Gabriel Lippmann, TUD
Tracers
9
Research pilot DARELUX
Data Archiving River Environment LUXemburg
Why is archiving important: the researchers
view
Direct:
•
organized storage and meta data assignment
•
data exchange, closed user groups
•
elaborating raw data
July 16, 2015
10
Research pilot DARELUX
Data Archiving River Environment LUXemburg
Why is archiving important: the researchers
view
Long term:
July 16, 2015
•
- (long) time series
•
- reuse of data in education and research
•
- verification
•
- enhanced continuity
11
Research pilot DARELUX
Data Archiving River Environment LUXemburg
Why is archiving important: the researchers
view
It can be done otherwise…..
Records of floods in Koblenz
July 16, 2015
12
The DARELUX approach
Capture,
Publish
and
Preserve
July 16, 2015
13
Capture and use
sensor
Archive
user
sensor
user
sensor
Dataacquisition
Datacorrection
and
enrichment
DataIngest
(OAIS)
Datastorage
(OAIS)
Retrieval
(OAIS)
Model
user
sensor
user
sensor
July 16, 2015
14
A DARELUX community
July 16, 2015
15
Use and re-use of the DARELUX
archive data
• Primary users
• Working with the archive (Delft and Utrecht)
• Secondary users
• Store data in the archive, use data from the archive
(ASTA and Gabriel Lippmann)
• Tertiary users
• Use data from the archive (the world)
July 16, 2015
16
Publish
• Data publication today suffers from several flaws in the
publication process:
• Data are not published in journals due to economic
constraints
• There is little merit in data publication for the
author because data are not citeable
• Data are not citeable due to their often transient
web locations (URLs)
Klump et al.
July 16, 2015
17
Publish: what needs to be done
• Data publications must be citeable to be “valuable”
• Reputation is the “currency” of science
• Authors will only take this effort if it is easy enough
and worthwhile doing so
• Preparing data for publication takes a lot of effort
• Data must be accessible
• Use of persistent identifiers and long-term storage
Klump et al.
July 16, 2015
18
How to make data citeable
• To become persistent, data sets need persistent identifiers (e.g.
DOI, URN).
• Piotrowska et al. (2005): Extraction and AMS radiocarbon
dating of pollen from Lake Baikal Sediments. Scientific Drilling
Database. doi:10.1594/GFZ.SDDB.1014
• This dataset relates to:
• Piotrowska, N., Bluszcz, A., Demske, D., Granoszewski, W. &
Heumann, G. (2004): Extraction and AMS radiocarbon dating
of pollen from Lake Baikal sediments. Radiocarbon, 46 (1),
181-187.
Klump et al.
July 16, 2015
19
July 16, 2015
20
Publish: how to make data citeable
• Data publications need more than persistent
identifiers, they also need to be stored in trusted longterm archives.
• Several initiatives are working on criteria to certify
trusted long-term archives.
• Centralised archives stand a better chance to exist for
a long time, but this does not rule out small
specialised repositories.
Klump et al.
July 16, 2015
21
And Preserve…..
Take care of the preservation of the data-sets
providing:
eternal access to scientific heritage
July 16, 2015
22
Quid aeternis minorum consiliis animum
fatigas?
Why burden your humble mind with plans for
eternity?
Horatio
July 16, 2015
23
The risks in a nutshell
• Physical decay of storage media
• Loss of descriptive (meta) data: inability to retrieve
data and context
• Loss of ”rendering” functionality caused by the inability
to run old software on new computers and operating
systems
• Who pays the ferryman?
July 16, 2015
24
Our approach: the e-Archive project
(2000-2002)
Digital preservation: The findings of the e-Archive project,
Ronald Dekker, Kees van der Meer,Eugène Dürr, Project's Final Report,
September 2003 http://durr.dhs.org/EArchive/publications/eArchivefindingsfinal13.pdf
• Preventing physical degradation
• replicate on new medium every 10? Years
• store items as simple bit streams
• Preventing irretrievability
• store metadata and information inseparable
together; special attention to context and
provenance metadata
• Provide perpetual rendering
• use emulation or conversion strategies
July 16, 2015
25
Requirements for objects to be stored
• Atomic (indivisible unit) = one file
• Self descriptive via metadata; no references needed for basic
content
• Implemented as a XML container
• Independent of any context
• Data archaeology argument: readable ASCII
July 16, 2015
26
Open Archival Information System:
Six Functional Entities
Preservation Planning
P
R
O
D
U
C
E
R
C
O
N
S
U
M
E
R
Data
Management
Descriptive
Info.
SIP
Ingest
Archival
Storage
queries
result sets
Access
orders
DIP
AIP
Administration
SIP = Submission Information Package
DIP = Dissemination Information Package
July 16, 2015
MANAGEMENT
AIP = Archival Information Package
27
The e-Archive building blocks: XML
containers as a logical storage structure
<XML ... > ... dtd ..
<con tain e r docId = “NLUBUX...”>
de scripti on me ta data
<title..
<author <id
pre s e rvati on
me tadata
<lifetime... <rights ..
AIP :
•
•
•
•
•
•
Use pure character streams ASCII/UTF
Keep meta data together with content
Store the original and one or more
other representations.
Use set of files in lightweight hierarchy
Archive items in containers with XML.
Archive also Viewers in the archive :
programs which give meaning to the
content representation in the containers
vie we r in fo
<viewer 1 ...
<viewer 2 ...
ori gin al i npu t fil es
t ex input
files
bundled
<!CDATA[[
]]
repre se n tati on 2
<!CDATA[[
htm l
version
]]
repre se n tati on 3
pdf
version
<!CDATA[[
]]
</con tain e r
July 16, 2015
example
art icle
28
DARELUX Architecture
July 16, 2015
29
Who pays the ferryman?
A business plan for DARELUX
• Costs
• Breakdown of cost
• €6000 per study/20datasets/20year
• Revenues
• Future use is not perceived as a source of income
• Funding
• Funding by research project (ingest, use, publication): primary
users
• Institutional or governmental funding needed for long term
preservation
• Added value services might create additional income
July 16, 2015
30
The projects status: things to do
• The DARELUX project is at midterm: another year to go (midterm
review resulted in a “green light” for the project)
• Goals for the second year:
• Involve secondary and tertiary users in the project
• Seek collaboration with OA periodical e.g. HESS
• Store data from secondary users in the archive (or make data
available via the archive)
• Further work on the business plan
• Upscale DARELUX to a (trusted repository) service
• Embedded in (3) TU (NL) en EU frameworks
July 16, 2015
31
And we are very interested in cooperation
Ronald Dekker
[email protected]
July 16, 2015
32
Data publication tomorrow?
Modified after Helly et al. (2003)
Library
Library Data Center
Publication
Scientific Data Network
Data publication
Manuscript
Data
Metadata
Research
July 16, 2015
33
Quid aeternis minorum consiliis animum
fatigas?
Why burden your humble mind with plans for
eternity?
That’s why….Horatio
July 16, 2015
34