CLADDIER Citation, Location, and* Deposit in Discipline

Download Report

Transcript CLADDIER Citation, Location, and* Deposit in Discipline

CLADDIER
Citation, Location, and* Deposit in
Discipline and Institutional Repositories
Bryan Lawrence
(obviously et.al.)
*Annotation
CLADDIER workshop, Chilworth, Southampton, UK 15th May 2007
Outline
“Full and open access to scientific data must also be ensured.
Archiving of, and open access to, data will be a major challenge.”
Statement to the Second Earth Observation Summit Tokyo,25 April 2004 by Prof. Thomas
Rosswall, Executive Director on behalf of the International Council for Science (ICSU)]
• Data Publication, Why, Why Now?
• The CLADDIER use case
– The story
– The consequences
• The Future
Data Publication – Why 1?
• Data provides evidence that supports, vindicates
or disproves scientific theory.
• Data underpins everything.
• We teach school children to record all
experimental results, but in most scientific
disciplines we discard those records after “the
result” is published.
– Even then “the result” is actually “the interpretation”
and “the raw result” is often left to lie fallow and to be
forgotten.
• The (raw and processed) data should be as much a
part of the scientific record as the conclusions.
Data Publication – Why 2?
• In most sciences, data production is expensive,
and interpretation is cheap!
• It is a rare scientist who squeezes all the scientific
fruit from their data, and it is a rare science that
doesn’t benefit from data aggregation.
• “One person’s noise is another person’s signal”
(anyone who can give me a reliable source of the original quote can have a free beer).
BUT: It’s one thing to make data available, it’s
another thing to make it available with quality
control, provenance, and sufficient detail for it to
be used without reference to the original author …
Data Publication – Why now?
Because we can! The technology (in particular the
software) is up to it.
– We have the machinery to describe data adequately.
• AI may not have delivered clever robots yet, but it has delivered much of
what we need for data publication!
– We have the machinery to find it.
– We have the machinery to display it.
Because we should! The chain from data production to
“traditional” publication is now so long, that many
good scientists never get to publish “traditional”
papers.
– We need to recognise the excellence of “data scientists” within academia
using metrics understood by their employers (publication and citation).
– Like complex mathematics, complex data interpretation needs to be
repeatable, which means the sources need to be available.
The CLADDIER use case, part 1
Joanna, at the University of Southampton, has done some work on the
biology of seawater at a location off the coast of Cornwall. As part of
her analysis she needs to acquire (from a number of locations):
– Publications and data describing prior or similar work.
– Oceanic profiles of salinity and temperature from the closest cruise in time and space,
– Meteorological data to accompany both her own sampling and the oceanic data,
– Remotely sensed ocean colour imagery (to add additional information on the biota).
When her analysis is complete, she will publish a paper that cites the above datasets and
lodge the paper in her own institutional repository. She will also deposit her datasets in
one or more appropriate data repositories (probably in her case, both the National
Oceanography Centre, Southampton data archive, and the British Oceanographic Data
Centre, BODC).
Ideally, in the process of doing this, the archives holding the datasets and publications she
cites would be notified that a paper citing them had been submitted, and the metadata
associated with those records would be updated to reflect the citations. The metadata in
the publication repository should also link to the data in the data archives and vice versa.
The CLADDIER use case, part 2
It turns out that the work Joanna has done is of significant interest in calibrating a
global earth system model where one might need to compare simulations of
oceanic carbon dioxide production with the scenarios used in the model.
Fred, at Reading University needs to be able to find Joanna’s paper and data
either via citations or directly from publication repositories.
Having found the paper, the data should be obtainable via the citation and the data
archive.
As part of his work he is likely to check back through the other datasets used and
cited as inputs to Joanna’s data, as before he uses Joanna’s data, he suspects
Joanna’s work could be recalibrated by using later, better quality,
meteorological re-analyses.
Meanwhile, Joanna, and all the dataset authors will be pleased that the citation of
not only the publication, but also the datasets, will be reflected in the 2012
RAE.
Requirements
1.
Location and acquisition of both papers and data. Implies we need a
“discovery engine” (more than Google!)
2.
Creation of personal metadata (out of scope).
3.
Citation mechanism. How do we cite data? (What does a citation look
like, what exists at the citation target?)
4.
What does publishing data mean? What would a referee do to referee
data?
5.
How do we deal with persistence of citations. Our expectation is that a
citation should exist in perpetuity.
6.
Linking mechanisms between data and publication repositories.
7.
Support for annotation.
8.
Support for metrics
The Future, Part 1
There are a number of “data publication”
initiatives under way:
1. Some are represented here, some are not.
2. Two key absentees are
1. The Earth System Atlas
http://www.lehigh.edu/~inesa/
(initial funding from NSF, still immature, but concentrating thus far
on refereeing procedures)
2. “Publication and Citation of Scientific Primary
Data” http://www.std-doi.de
(initially funded by the German Research Forum, relatively mature,
delivering persistence via reliable repositories and DOIs, but issues
of citation and refereeing not fully resolved).
The Future, Part 2
• OJMS: Overlay Journal for Meteorological
Science (or something similar).
– New JISC funded project NCAS with Royal Met Soc, to deliver a
new journal prototype. Success will depend on
• Availability and quality of data i.e. on the technology, and on the
sociology of the review process
• Interaction between “traditional” journal world, and data
publication world.
• Multiple projects a good thing!
– Data Publication is an idea whose time has come!
• Crucial to get critical mass (across projects) on
– Acceptable methods of citing data