CDL UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center, Oakland, CA Robert Cook,

Download Report

Transcript CDL UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center, Oakland, CA Robert Cook,

CDL
UC Curation Center
Defining the Data Citation
Problem in the DataNet Context
December 2009
John Kunze, University of California Curation Center, Oakland, CA
Robert Cook, Environmental Sciences Division, Oak Ridge National Laboratory, TN
Patricia Cruse, University of California Curation Center, Oakland, CA
Carol Tenopir, School of Information Sciences, University of Tennessee, Knoxville, TN
Todd Vision, Department of Biology, University of North Carolina, Chapel Hill, NC
William Michener, University Libraries, University of New Mexico, Albuquerque, NM
CDL
UC Curation Center
Data’s shameful neglect
“Research cannot flourish if data are not
preserved and made accessible.
All concerned must act accordingly.”
10 September 2009
CDL
UC Curation Center
The scientific record is at risk
•
•
•
•
Incompatible formats, models, semantics
Poor preservation practice
Dispersed sources
Science needs this record to verify
findings and test new hypotheses
• Record at risk  planet at risk
CDL
UC Curation Center
Collage: J. Callaway, USF
CDL
UC Curation Center
Data preservation is hard;
start small with data publication
The risk is complex, with social and technical
dimensions – can we start small?
• Insight: data that drive much scientific journal
literature is produced in islands of practice, resulting
in unshared, incompatible datasets
• Hypothesis: establishing a system of data publishing
will promote data sharing and re-use by providing
standards and producer incentives
Publishing  Sharing  Use  Preservation
CDL
UC Curation Center
Data publishing challenges
• Datasets encompass everything
– Data plus documents, images, audio, video, etc.
– Tension between standardization and innovation
• Data is similar to software, but even more specialized
– OK to maintain in-house, but tedious to prepare for release
– Technical dependence complicates long-term maintenance
– Internal consistency requirements, plus provenance
• Some built-in instability: long-term value of some data
can depend on change, such as annotation
CDL
UC Curation Center
Data publication is hard;
start small with data citation
Published data, from outset, will call for citations
• Need links from journal articles to data used
Hypothesis: establishing simple, easy conventions
for data citation will encourage its practice, hence
data publishing, hence data preservation
data citation  data publishing  data preservation
CDL
UC Curation Center
Data citation leads to data set
Luyssaert, S., I. Inglima and M. Jung. 2009. Global Forest Ecosystem Structure
and Function Data for Carbon Balance Research. Data set. Available on-line
[http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active
Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/949

http://dx.doi.org/10.3334/ORNLDAAC/949

http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=949

Leads often to one or more surrogates

If data set is archived, leads to data files
allspice1
CDL
Small
surrogate
UC Curation
Center
Citation target
Smaller surrogate
Smallest surrogate
CDL
UC Curation Center
Data citation examples
World Data Center for Paleoclimatology Data (NOAA)
Anderson, D.W., W.L. Prell, and N.J. Barratt. 1989. Estimates of sea
surface temperature in the Coral Sea at the last glacial maximum.
Paleoceanography 4(6):615-627. Data archived at the World Data
Center for Paleoclimatology, Boulder, Colorado, USA. no identifier
Publishing Network for Geoscientific & Environmental Data in Germany
Nishioka, J et al. (2008): Profiles of iron concentration from GoFlow bottles
during the CARUSO-EISENEX experiment,
doi:10.1594/PANGAEA.701305, Supplement to: Nishioka, Jun; Takeda,
Shigenobu; de Baar, Hein JW; Croot, Peter L; Boyé, Marie; Laan,
Patrick; Timmermans, Klaas R (2005): Changes in the concentration of
iron in different size fractions during an iron enrichment experiment in
the open Southern Ocean, Marine Chemistry, 95(1-2), 51-63,
doi:10.1016/j.marchem.2004.06.040
2 identifiers: 1 for publication,1 for data
CDL
UC Curation Center
More data citation examples
ICPSR
Kessler, Ronald C. National Comorbidity Survey: Baseline (NCS-1), 1990-1992
(Restricted Version) [Computer file]. ICPSR25381-v1. Ann Arbor, MI: Interuniversity Consortium for Political and Social Research [distributor], 200905-11. doi:10.3886/ICPSR25381
archival data center?
Economic Modeling
Figure 3. Change of relative agricultural producer prices since 1998. Middleincome CIS show average for Russia, Kazakhstan, and Ukraine. ….
Source: OECD, 2004 and CIS Statistics, 2003.
2 organizations listed, but
which of their 100s of datasets
were used?
CDL
UC Curation Center
Contrasting citation styles
Some commonalities (who, when, where), but
• Prose is interspersed with metadata elements
• Standard citation format/recipe would be easy to read
• Not every citation had an actionable identifier
• Name of dataset and data subset used (what) unclear
• Archival commitment unclear
• Date of publication vs date of collection unclear
• One citation contained another citation (for publication)
CDL
UC Curation Center
What we want from data citation
• Precise identification of dataset
– At level of version, file, table, etc., or groups thereof
– So that readers can find and understand the data
• Credit to data producers and data publishers
– Vital incentive for data sharing and archiving
• A link from the traditional literature to the data
– Gives intellectual legitimacy to creation of data sets
• Research metrics for datasets
– Sponsors want publication and retention numbers
CDL
UC Curation Center
Starter data citation wish list
– Any dataset, database, data file
– All levels of granularity (table, row, cell)
– For any snapshot (version, e.g., in time)
– Any formatted view: XML, HTML, CSV, etc.
– With and without annotations
– Links to older, newer, and latest versions
– Actionability (“Click-through”)
– Persistence (validity into the future)
CDL
UC Curation Center
Datasets and documents have
much in common
Data
Document
Systematically
organized
yes
yes
Hierarchical
yes
yes
Yes, with metadata
for semantics help
Sort of, with
schema structure
(TEI, CCS docs)
Machine readable
CDL
UC Curation Center
Data citation wish list possibilities
We want it all, but might settle for initial partial solutions
All datasets?
Well, maybe just archived datasets*
All levels of granularity? For any snapshot? All views?
Publisher-defined granules, versions, and views*
Plus older/newer version, and latest version?
Surrogate-based pointer to extant version chain*
With and without annotations? Annotation as publication*
What about actionability and persistence?
Yes and yes*
(* Standards and archives needed for all)
CDL
UC Curation Center
Initiatives and outfits to watch
• DataCite initiative: to encourage data publishing via
global data citation support: standards, persistent
reference to datasets in regional archives
• Supplemental materials publishing standards for
data, surrogates, and extended descriptions and
methods, e.g., technical data application appendices
• Publishers: increased volume of submission
• Community standards (so many to choose from!):
ORNL DAAC, Pangaea, GCMD, ESIP, GBIF, TDWG,
OECD, NISO/NFAIS, IPYDIS, Dataverse, etc.
CDL
UC Curation Center
Data citation summary
Data citation helps publication and sharing, which helps
preservation and re-use, which saves the planet
•Gives credit to data producers and data publishers
– Vital incentive for data sharing and archiving
•Provides a link from traditional literature to data
– Gives intellectual legitimacy to creation of data
•Research metrics for datasets
– Sponsors want publication and retention numbers
•Need recipes and stuff, i.e., standards and archives