Transcript Slide 1

The Geoscience Data Journal: collaboration between data repositories
and publishing houses in data publishing
Fiona Murphy (1), Sarah Callaghan (2), Paul Hardaker (3), Rob Allan (4)
(1) Wiley-Blackwell, Earth Science Journals, Chichester, UK ([email protected]), (2) British Atmospheric Data Centre, STFC, UK ([email protected]), (3) Royal Meteorological Society, Reading,
UK ([email protected]), (4) Met Office, Hadley Centre, Exeter, UK ([email protected])
Is the current situation really so bad?
“the amount of data generated worldwide...is growing by 58% per year; in
2010 the world generated 1250 billion gigabytes of data”
The Digital Universe Decade – Are You Ready?
IDCC White Paper, May 2010
Research budgets are under strain as never before – whilst at the same time researchers are
able to produce data at an unprecedented rate. This gives rise to difficult questions of what to
produce, what to throw away, formatting, interoperability, long-term curation, security, and a host
of other key issues. In short, how we can be aware of, locate, use and register what we know,
and thereby ensure its most effective use and re-use? And how do we fit all of this into a
sustainable research ecosystem?
In recognition of this situation, and in response to the general economic squeeze on science
budgets, funding bodies (for example the National Science Foundation and the UK’s Natural
Environment Research Council) are increasingly trying to encourage good data management
practises by requiring data management plans (DMPs). At present, DMPs have not yet fully
integrated into the research lifecycle, so far as many researchers are concerned. Researchers
may view DMPs with suspicion for potentially being a waste of effort or for compelling
researchers to share research output before they are able to fully exploit it themselves. In short,
the benefits of good data management are not clearly mapped out to everyone’s satisfaction. To
its credit, NSF in particular is engaging in dialogue with other stakeholders on this issue through
outreach at relevant scientific events, an interactive website and so forth.
However, this is not the only problem faced by the current research ecosystem. Crucially,
technical advances have generally not been accompanied by support to work optimally in terms
of funding structure and research infrastructures. As yet, best data output and management
practice does not elicit the sort of scientific reward needed to persuade compliance. Other key
barriers to progress are silos in terms of geography (where people are or where data are
collected) and also within disciplines (hydrologists versus geologists versus climatologists). The
status of data science itself – and of data scientists – also needs re-examining.
The current situation can be
summarised as this...:
Which we want to change into this:
Benefits to the community
It is becoming increasingly important to the scientific and wider non-academic communities that
the data that underpins key scientific results should be made available to allow for the testing
and confirmation of those results. Historically, publishing data has been so difficult as to be
prohibitive, and those cases where it has been possible, the raw data has had to be converted to
other formats; for example, instead of raw numbers being published in a (lengthy) table, it has
been converted to a graph.
How do we coordinate the peer review of data and data papers?
As scientists’ ability to create and collect new data has been growing, so too has our ability to
store it. A dataset can be stored on any digital medium that is convenient, but future-proofing the
data so that it is readable and understandable in 20 years’ time remains a time-consuming and
difficult job. Yet, if the results drawn from that data are to stand up to scrutiny in the future, the
data must be curated and archived properly.
Openly sharing data is often proposed as a method for ensuring that data underpinning the
scientific record is kept. There are issues with this in that sharing data in an unstructured way
often results in the provenance of the dataset (and often the dataset itself) being changed as it
passes from one “owner” to another, thereby reducing any chances of using that data to test the
reproducibility of results originally made from it.
Also, the present mechanism for academic recognition revolves around the production and
publication of peer-reviewed papers. The production of high-quality datasets takes time and
effort, and is often insufficiently recognised as an activity worthy of prestige, even though the
papers that result from that dataset may be considered of significant scientific importance.
Simple sharing of data is unlikely to provide the data creators with the academic recognition they
deserve. A process of data publication, involving peer-review of datasets would be of benefit to
many sectors of the academic community.
Aims of the project
It is for these reasons that a partnership
has been developed between the British
Atmospheric Data Centre, the Royal
Meteorological Society and the academic
publishers Wiley-Blackwell, in order to
develop a mechanism for the formal
publication of data in the (soon to be
launched) Geoscience Data Journal.
This journal builds on the work funded by
JISC in the OJIMS (Overlay Journal
Infrastructure for Meteorological
Sciences) project, and parallels with work
done by the NERC Science Information
Strategy Data Citation and Publication
project team, which brings all the NERC
environmental data centres together.
The aim of the Geoscience Data Journal is to provide a platform where scientific data can be
formally published, in a way that includes scientific peer-review. This will provide the dataset
creator with full credit for their efforts, while also improving the scientific record, and allowing
major datasets to be fully described, cited and discovered.
Figures adapted from ‘Opportunities for Data Exchange’ Project, Alliance for Permanent Access http://www.alliancepermanentaccess.org/
The Workflow of Data Publication
Pressure points
• Workflow, landing pages, author, editor, publisher, data centre communication and education
• Processes shown in orange boxes are the subject of continual, iterative development driven by
stakeholders and incorporating best practices as they emerge.
Incentives
• Publishing a dataset in a data journal will provide academic credit to data scientists, and without
diverting effort from their primary work on ensuring data quality.
• Funders want to get the best possible science for their money. Running measurement campaigns
is expensive, so the more reuse that can be derived from a dataset, the better. Publication in a
data journal ensures that the dataset is uploaded to a trusted repository where it will be backed up,
archived and curated and so won’t be vulnerable to bit-rot or being lost/stored on obsolete media.
The peer-review process also reassures the funder that the published dataset is of good quality
and that the experiment was carried out appropriately. Data journals will be a good starting point
for information for researchers outside the immediate field, about what sort of data is available and
how to access the data. This will encourage inter-disciplinary collaboration, and open up the user
base not only for the datasets, but also the data journal and the underlying repositories. The
availability of published datasets will make it easier to validate conclusions through the reanalysis
of those datasets. Data publication will help show transparency in the scientific process, improving
public accountability.
• Opportunities to form partnerships with other organisations with the same goal of data publication
to exploit common activities and achieve a wider community buy-in. For example, the CODATAICSTI Task Group on Data Citation Standards and Practises, DataCite and others.