Data Citation in the Earth and Physical Sciences Sarah Callaghan [[email protected]] with thanks and acknowledgement to a lot of other people! Developing Data Attribution.

Download Report

Transcript Data Citation in the Earth and Physical Sciences Sarah Callaghan [[email protected]] with thanks and acknowledgement to a lot of other people! Developing Data Attribution.

Data Citation in the Earth and
Physical Sciences
Sarah Callaghan [[email protected]]
with thanks and acknowledgement to a lot of other people!
Developing Data Attribution and Citation Practices and Standards
An International Symposium and Workshop
August 22-23, 2011
VO Sandpit, November 2009
Who are we? And why do we care?
... And what do we know about data?
We’re one of the NERC data centres
VO Sandpit, November 2009
Some BADC numbers for context
Dataset: A collection of files sharing some
administrative and/or project heritage.
BADC has approximately 150 real datasets
(and thousands of virtual datasets).
BADC has approx 200 million files
containing thousands of measured or
simulated parameters.
BADC tries to deploy information systems
that describe those data, parameters,
projects and files, along with services
that allow one to manipulate them …
Calendar year 2010: 2800 active users (of
12000 registered), downloaded 64 TB
data in 16 million files from 165
datasets.
Less than half of the BADC data
consumers are “atmospheric science”
users!
VO Sandpit, November 2009
What does data mean to us?
Data can be anything from:
• A measurement taken at a single place and time (e.g.
water sample, crystal structure, particle collision)
• Measurements taken at a point over a period of time
(e.g. rain gauge measurements, temperature)
• Measurements taken across an area at multiple times
by a static instrument (e.g. meteorological radar,
satellite radiometer measurements)
• Measurements taken over and area and a time by a
moving instrument (e.g. ocean traces, air quality
measurements taken during an airplane flight,
biodiversity measurements)
• Results from computer models (e.g. climate models,
ocean circulation models)
• Video and images (e.g. cloud camera images, photos
and video from flood events, wildlife camera traps)
• Physical samples (e.g. rock cores, tree ring samples,
ice cores)
Suber cells and mimosa leaves. Robert
Hooke, Micrographia, 1665
VO Sandpit, November 2009
Case Study: CMIP5
CMIP5: Fifth Coupled Model
Intercomparison Project
• Global community activity under the
auspices of the World Meteorological
Organisation (WMO) via the World
Climate Research Programme (WCRP)
•Aim:
– to address outstanding scientific
questions that arose as part of
the AR4 process,
– improve understanding of
climate, and
– to provide estimates of future
climate change that will be useful
to those considering its possible
consequences.
Method: standard set of model
simulations in order to:
• evaluate how realistic the models are
in simulating the recent past,
• provide projections of future climate
change on two time scales, near term
(out to about 2035) and long term (out to
2100 and beyond), and
• understand some of the factors
responsible for differences in model
projections, including quantifying some
key feedbacks such as those involving
clouds and the carbon cycle
VO Sandpit, November 2009
FAR:1990
SAR:1995
TAR:2001
AR4:2007
AR5:2013
VO Sandpit, November 2009
CMIP5 numbers
Simulations:
~90,000 years
~60 experiments
~20 modelling centres (from around
the world) using
~30 major(*) model configurations
~2 million output “atomic” datasets
~10's of petabytes of output
~2 petabytes of CMIP5 requested
output
~1 petabyte of CMIP5 “replicated”
output
Which will be replicated at a number
of sites (including ours).
Of the replicants:
~ 220 TB decadal
~ 540 TB long term
~ 220 TB atmosphere-only
~80 TB of 3hourly data
~215 TB of ocean 3d monthly data!
~250 TB for the cloud feedbacks!
~10 TB of land-biochemistry (from
the long term experiments alone).
(May 2011: All these data output
volumes probably a factor of two too
low!)
VO Sandpit, November 2009
CIMP5 and Data Citation
CMIP5 will produce a lot of data! It’s an international effort, with everyone involved
wanting to ensure proper citation, attribution and location of the data produced.
From http://cmip-pcmdi.llnl.gov/cmip5/citation.html?submenuheader=3 :
“Digital Object Identifiers will be assigned to various subsets of the CMIP5
multi-model dataset and, when available and as appropriate, users should cite
these references in their publications. These DOI’s will provide a traceable record
of the analyzed model data, as tangible evidence of their scientific value.
Instructions will be forthcoming on how to cite the data using DOI’s.”
There are also plans to work with journal publishers to publish data papers about
various key model runs and ensembles (more about data publication later!)
VO Sandpit, November 2009
Earth Sciences: BADC
It is possible to
reference our
datasets using a
specific citation given
on the main dataset
information page.
We’re currently
working on assigning
DOIs to certain
datasets which meet
our technical quality
standards.
VO Sandpit, November 2009
Earth Sciences: Pangaea
VO Sandpit, November 2009
Physics and Life Science: ISIS
The ISIS pulsed neutron and muon source produces
beams of neutrons and muons that allow
scientists to study materials at the atomic level
using a suite of instruments, often described as
‘super-microscopes’. It supports a national and
international community of more than 2000
scientists who use neutrons and muons for
research in physics, chemistry, materials science,
geology, engineering and biology.
ISIS is now issuing DOIs for experiment data to allow
easy citation. Principal Investigators will be sent
DOIs shortly before their experiment is due to
start.
DOIs issued by ISIS are in the form:
10.5286/ISIS.E.1234567
The recommended format for citation is:
Author, A N. et al; (2010): RB123456, STFC ISIS
Facility, doi:10.5286/ISIS.E.1234567
Identifying materials for hydrogen storage
VO Sandpit, November 2009
Chemistry: PubChem
VO Sandpit, November 2009
Astronomy: Seamless Astronomy and
Dataverse
The Seamless Astronomy Group at the Harvard-Smithsonian
Center for Astrophysics brings together astronomers,
computer scientists, information scientists, librarians and
visualization experts involved in the development of tools and
systems to study and enable the next generation of online
astronomical research.
The are evaluating the Dataverse, an open data archive
hosted by Harvard University and managed by the Institute for
Quantitative Social Science (IQSS), as a project-based
repository for the storage, access, and citation of reduced
astronomical data.
http://projects.iq.harvard.edu/seamlessastronomy/
Dataverse data citation standard:
• offers proper recognition to authors
• permanent identification through the use of
global, persistent identifiers in place of URLs,
• uses universal numerical fingerprints (UNFs)
to guarantee that future researchers will be
able to verify that data retrieved is identical to
that used in a publication decades earlier,
even if it has changed storage media,
operating systems, hardware, and statistical
program format.
Following is an authentic example of a
replication data-set citation (from International
Studies Quarterly, King and Zeng, 2007,
p.209):
Gary King; Langche Zeng, 2006,
"Replication Data Set for 'When Can
History be Our Guide? The Pitfalls of
Counterfactual Inference'"
hdl:1902.1/DXRXCFAWPK
UNF:3:DaYlT6QSX9r0D50ye+tXpA==
Murray Research Archive [distributor]
VO Sandpit, November 2009
(Scientific) Communication through
the ages
Science, as a process, requires the
exchange of information and ideas.
We can make this exchange face-to-face
(conferences, meetings, seminars) or
through another medium (text, video,
images), or both.
No matter what method we use, we wind
up telling each other stories about
what we’ve discovered.
http://www.intoon.com/#68559
Technology has given us new tools, but
it’s also provided new challenges
VO Sandpit, November 2009
The Data Deluge
Journals can’t now communicate
everything we need to know about
a scientific event
- whether that’s an observation,
simulation, development of a
theory, or any combination of
these.
“the amount of data generated
worldwide...is growing by 58% per year; in
2010 the world generated 1250 billion
gigabytes of data”
The Digital Universe Decade – Are You
Ready?
IDCC White Paper, May 2010
Data always has been the foundation
of scientific progress – without it,
we can’t test any of our assertions.
Previously data was hard to capture,
but could be (relatively) easily
published in image or table format
We need to publish data – but how?
VO Sandpit, November 2009
Serving, Citing and Publishing Data
Citation forms an important part
of the scientific record.
Doi:10232/123ro
We draw a clear distinction
between:
publishing = making available for
consumption (e.g. on the
web), and
Publishing = publishing after
some formal process which
adds value for the consumer:
• e.g. PloS ONE type review, or
• EGU journal type public
review, or
• More traditional peer review.
AND
• provides commitment to
persistence
2.
Publication of data
sets
Doi:10232/123
1.
Data set Citation
0.
Serving of data sets
VO Sandpit, November 2009
This involves the peer-review
of data sets, and gives “stamp
of approval” associated with
traditional journal publications.
Can’t be done without effective
linking/citing of the data sets.
This is our first step for this
project – formulate and
formalise a way of citing data
sets. Will provide benefits to
our users – and a carrot to get
them to provide data to us!
This is what data centres do as
our day job – take in data
supplied by scientists and
make it available to other
interested parties.
Final remarks
• There is obviously a need for data
citation, not only for scientists, but
also to provide traceability and
accountability for the general public
(c.f. issues surrounding Climategate)
• There is serious pressure in the Earth
and climate sciences to publish data
• but there is also a need to ensure
proper accreditation
• How we communicate scientific
findings is changing – data citation is
a big part of that.
http://www.keepcalm-omatic.co.uk/default.aspx#createposter
VO Sandpit, November 2009