University of California Curation Services (UC3)

Download Report

Transcript University of California Curation Services (UC3)

DataCite, DataONE, Dryad and UC3
William Michener
DataONE and University of New Mexico
John Kunze and Patricia Cruse
University of California Curation Center (UC3), California Digital Library and
DataONE
Ryan Scherle
Dryad (National Evolutionary Synthesis Center) and DataONE
A Choice
If the scientific record is at risk
– Results can’t be reproduced
– Science fails, global
catastrophe ensues
The choice: Better data publishing,
sharing, and archiving
OR
Roberto Rizzato
Planetary destruction?
A Vision for Change: DataONE
Providing universal access to data about life on earth and the environment that sustains it
 engaging the
scientist in the data
curation process
 supporting the full
data life cycle
 encouraging data
stewardship and
sharing
 promoting best
practices
 engaging citizens
 developing domainagnostic solutions
1. Build on existing
cyberinfrastructure
2. Create new
cyberinfrastructure
3. Support new
communities of
practice
DataONE Cyberinfrastructure
Coordinating
Nodes
Member Nodes
• retain complete
• diverse catalog
institutions
metadata
•• subset
of allcommunity
data
serve local
• perform basic indexing
provide network-wide
resources for
•• provide
managing their data
services
• ensure data availability
(preservation)
• provide replication
services
Flexible, scalable,
sustainable network
University of California Curation Center, California Digital Library
DataONE Wish List for Data Citation
• Precise identification of a dataset
– At level of version, file, table, cell, etc., or groups thereof
– So that readers can find and understand the data
• Credit to data producers and data publishers
– Vital incentive for data sharing and archiving
• A link from the traditional literature to the data
– Gives intellectual legitimacy to creation of data sets
• Research metrics for datasets
– Sponsors want publication and retention numbers
• Coordinated citation support for local data producers,
regional archives, and global end-users
Identifier Requirements
• To accommodate a diverse set of member nodes that hold a
wide variety of content, the DataONE system must adhere to
the following principles:
– Agnosticism – DataONE supports all identifier schemes where the ID
can be represented as a Unicode string.
– Opacity – DataONE does not attach any meaning or resolution
protocol based on the identifier.
– Authority – The identifier first assigned by a member node is
authoritative. Other identifiers may be assigned by other nodes for
internal use.
Identifier Requirements
• To participate in the DataONE network, a node must be able
to meet the following requirements:
– Uniqueness – Identifiers must be unique across the space of DataONE.
– Granularity – Every item must be assigned an identifier (metadata as
well as data).
– Immutability – The object referenced by an identifier cannot change.
If an object is modified, it must receive a new identifier.
Think Big, Start Small
CDL leading 2 projects involving DataONE:
1. EZID for simple identifier management
– Creates ids, stores metadata and resolver target URLs
– Supports DataCite DOIs and lower-cost ids (ARKs, URLs)
– First customer is DataONE member, Dryad
2. Excel “add-in” project with MS Research
– Extend Excel to support data sharing, archiving, and access
– E.g., ability to export to data archive in a standard format
with column headings drawn from a shared vocabulary
DataONE/DataCite Example
DOI resolver and
TIB registration
5. URL plus id
4. save full citation
DataCite Member
(eg, CDL)
6. full citation
DataONE Member
Node data archive
(eg, Dryad)
3. citation +
URL + id
2. metadata
+ URL + id
EZID resolver and
registration service
DataONE
Coordinating Node
metadata catalog
(eg, UNM or UCSB)
7. full citation
Research scientist
1. data +
metadata
get unique id string
get unique id string
(opt) CDL-hosted EZID
id minting service
A Repository of Data
Underlying Journal Articles
The Goal
• Store all data underlying publications in evolutionary biology,
ecology, and related disciplines, at the time of publication.
ccaattggct
gttcttcgat
tctggcgagt
GenBank
TreeBASE
Dryad
Identifiers and Versioning
• Each “data package” receives a DOI, which refers to the most
recent version of the file.
•
doi:10.5061/dryad.20
• When repository content is modified, a version indicator will
be appended to the original DOI
•
doi:10.5061/dryad.20.2
• To specify a particular file within the data package, a slash is
used.
•
doi:10.5061/dryad.20.2/3
Identifiers and Versioning
• Metadata and particular formats of the files are not given
“true” DOIs. They are reachable by appending a parameter to
the DOI.
•
•
doi:10.5061/dryad.20.2/3.1?urlappend=%3fformat=dc
doi:10.5061/dryad.20.2/3.1?urlappend=%3fformat=xls
Citation
• When using data from Dryad, please cite the original article.
– Sidlauskas, B. 2007. Testing for unequal rates of morphological diversification in the
absence of a detailed phylogeny: a case study from characiform fishes. Evolution 61:
299–316.
• Additionally, please cite the Dryad data package. The citation should
include the following elements:
–
–
–
–
–
–
Author(s)
The date on which the data was deposited
The name of the data file, if applicable
The title of the data package, which in Dryad is always "Data from: [Article name]"
The name "Dryad Digital Repository"
The data identifier
• For example:
– Sidlauskas, B. 2007. Data from: Testing for unequal rates of morphological diversification
in the absence of a detailed phylogeny: a case study from characiform fishes. Dryad
Digital Repository. doi:10.5061/dryad.20
Challenges/Questions
• Dealing with dynamic streaming data?
– How do versions enter into the identifiers
scheme?
• Resolving to human or machine-interpretable
description of object?
• Need for a registry of name spaces?
• Can metadata stds support multiple globally
unique identifiers?