A torrent of data from CMIP5 is about to arrive! Can the

Download Report

Transcript A torrent of data from CMIP5 is about to arrive! Can the

Tracking the impact of data –
how?
Sarah Callaghan
[email protected]
@sorcha_ni
1st Altmetrics conference, London, 25-26 September 2014
VO Sandpit, November 2009
Who are we and why do we
care about data?
The UK’s Natural Environment Research Council (NERC)
funds six data centres which between them have
responsibility for the long-term management of NERC's
environmental data holdings.
We deal with a variety of environmental measurements,
along with the results of model simulations in:
• Atmospheric science
• Earth sciences
• Earth observation
• Marine Science
• Polar Science
• Terrestrial & freshwater science, Hydrology and
Bioinformatics
• Space Weather
VO Sandpit, November 2009
OpenAIRE Portal
Develop an Open
Access,
participatory
infrastructure for
scientific information
that includes:
Publications
Datasets
Projects
Interlinking
www.openaire.eu
VO Sandpit, November 2009
3
Data, Reproducibility and Science
Science should be reproducible –
other people doing the same
experiments in the same way
should get the same results.
Observational data is not
reproducible (unless you have a
time machine!)
Therefore we need to have access
to the data to confirm the science is
valid!
http://www.flickr.com/photos/31333486@N00/1893012324/sizes/
o/in/photostream/
VO Sandpit, November 2009
It used to be “easy”…
The Scientific Papers of William Parsons,
Third Earl of Rosse 1800-1867
Suber cells and mimosa leaves. Robert
Hooke, Micrographia, 1665
…but datasets have gotten so big, it’s not
useful to publish them in hard copy anymore
VO Sandpit, November 2009
Hard copy of the Human Genome at
the Wellcome Collection
VO Sandpit, November 2009
Creating a dataset is hard
work!
"Piled Higher and Deeper" by Jorge Cham
www.phdcomics.com
Managing and archiving data so that it’s understandable by other
researchers is difficult and time consuming too.
We want to reward researchers for putting that effort in!
VO Sandpit, November 2009
VO Sandpit, November 2009
Most people have an idea of what a
publication is
VO Sandpit, November 2009
Some examples of data (just from
the Earth Sciences)
1. Time series, some still being updated
e.g. meteorological measurements
2. Large 4D synthesised datasets, e.g.
Climate, Oceanographic, Hydrological
and Numerical Weather Prediction
model data generated on a
supercomputer
3. 2D scans e.g. satellite data, weather
radar data
4. 2D snapshots, e.g. cloud camera
5. Traces through a changing medium,
e.g. radiosonde launches, aircraft
flights, ocean salinity and temperature
6. Datasets consisting of data from
multiple instruments as part of the
same measurement campaign
7. Physical samples, e.g. fossils
VO Sandpit, November 2009
What is a Dataset?
DataCite’s definition
(http://www.datacite.org/sites/default/files/Bu
siness_Models_Principles_v1.0.pdf):
Dataset: "Recorded information, regardless of
the form or medium on which it may be
recorded including writings, films, sound
recordings, pictorial reproductions,
drawings, designs, or other graphic
representations, procedural manuals, forms,
diagrams, work flow, charts, equipment
descriptions, data files, data processing or
computer programs (software), statistical
records, and other research data."
(from the U.S. National Institutes of Health (NIH)
Grants Policy Statement via DataCite's Best
Practice Guide for Data Citation).
VO Sandpit, November 2009
In my opinion a dataset is
something that is:
• The result of a defined
process
• Scientifically meaningful
• Well-defined (i.e. clear
definition of what is in the
dataset and what isn’t)
What metrics do we use for our data?
VO Sandpit, November 2009
Metric
Breakdown
Number of
Quarterly
discovery
dataset
records in the
DCS
Web site visits Quarterly
Web site page Quarterly
views
Queries
closed this
period
Quarterly
Queries
received in
period
Quarterly
CEDA
numbers
NEODC 26
BADC 242
UKSSDC 11
BADC:
61,600
NEODC:
10,200
BADC:
219,900
NEODC:
25,800
362 helpdesk
queries
838 dataset
applications
388 helpdesk
queries
860 dataset
applications
Notes
Compliance with NERC data management
policy. Reflects how many data sets NERC
has. The number of dataset discovery
records visible from the NERC data
discovery service.
Active use and visibility of the data centre.
Site visits from standard web log analysis
systems, such as webaliser. Sensible web
crawler filters should have been applied.
See web visits notes.
Active use and visibility of the data centre.
Queries marked as resolved within the
quarter. A query is a request for information,
a problem or ad hoc data request.
Active use and visibility of the data centre.
See closed query notes.
VO Sandpit, November 2009
Data
centre
metrics –
produced
15th July
2014
Metric
Percent queries
dealt with in 3
working days
Breakdown
CEDA numbers
Notes
Quarterly 84.06 (11.57% resolved after 3 Responsiveness. See
days)
closed query notes
87.67 (10.23% resolved after 3
days)
Queries receiving initial response
within 1 working day
Helpdesk - 93.57 %
Dataset applications - 97.91%
Identifiable users None
Over year to date: BADC: 4065 Use and visibility of the
actively
NEODC: 362
data centre. An estimate of
downloading
the number of users using
data access services over
the year.
Number of
None
BADC: 240
INSPIRE compliance.
metadata records
NEODC:33
Reflects how many data
in data centre
sets NERC has.
web site
Number of
None
(Metric in development)
INSPIRE compliance.
datasets
Usable services.
available to view
via the data
centre web site
Number of
None
(Metric in development)
INSPIRE compliance.
datasets
Usable services.
available to
download via the
data centre web
site
VO Sandpit, November 2009
Data
centre
metrics –
produced
15th July
2014
Metric
NERC funded Data centre
staff (FTE)
Breakdown CEDA numbers
None
14 (estimate for FY
14/15)
Direct costs of Data
None
Stewardship in data centre
Capital Expenditure directly None
related to Data Stewardship
at data centre
Direct Receipts from Data None
Licenses and Sales
Number of projects with
Outline Data Management
Plans
Number of projects with
Full Data Management
Plans
Users by area
Users by institute type
None
None
Notes
Data management costs. Efficiency.
Number of full time equivalent posts
employed to perform data centre
functions.
(reportable at end of Data management costs. Efficiency. Cost
financial year)
to NERC
(reportable at end
Data management costs. Efficiency.
financial year)
£0
(CEDA does not
charge for data)
(Metric in
development)
(Metric in
development)
Commercial value of data products and
services
Means of tracking projects’ adoption of
good DM practice. Outline DMP is at
proposal stage
Means of tracking projects’ adoption of
good DM practice. Full DMP is at funded
stage
Active use. Visibility of the data centre
internationally. Percentage of user base
in terms of geographical spread.
UK
2534
61%
Europe
494
12%
Rest of the 1024
25%
world
Unknown
79
2%
University 2934
71%
Active use. Visibility of the data centre
Government 694
17%
sectorially. Percentage of users base in
NERC
160
4%
terms of the users host institute type.
Other
277
7%
Commercial 42
1%
VO
Sandpit,
November
School
35
1% 2009
After the data is downloaded,
what happens then?
Short answer:
We don’t know!!
Unless the data user comes back to us to tell us.
Or we stumble across a paper which
• Cites us
• Or mentions us in a way that we can find
• And tells us what the dataset the
authors used was.
This is why we’re working with other groups (like
CODATA, Force11, RDA, DataCite, Thompson
Reuters,…) to promote data citation.
VO Sandpit, November 2009
The Noble Eight-Fold Path to Citing Data
1.
2.
3.
4.
5.
6.
7.
8.
Importance
Credit and attribution
Evidence
Unique Identification
Access
Persistence
Specificity and verifiability
Interoperability and
flexibility
Principles are supplemented
with a glossary, references and examples
VO Sandpit, November 2009
http://force11.org/datacitation
How we (NERC) cite
data
We using digital object identifiers (DOIs)
as part of our dataset citation
because:
•
•
•
•
They are actionable, interoperable,
persistent links for (digital) objects
Scientists are already used to citing
papers using DOIs (and they trust
them)
Academic journal publishers are
starting to require datasets be cited in
a stable way, i.e. using DOIs.
We have a good working relationship
with the British Library and DataCite
NERC’s guidance on citing data and assigning DOIs can be found at:
http://www.nerc.ac.uk/research/sites/data/doi.asp
VO Sandpit, November 2009
Dataset
catalogue page
(and DOI landing
page)
Dataset citation
Clickable link to
Dataset in the archive
VO Sandpit, November 2009
Another example
of a cited dataset
VO Sandpit, November 2009
http://www.charme.org.uk/
VO Sandpit, November 2009
Data metrics – the state of the art!
Data citation isn’t common practice
(unfortunately)
Data citation counts don’t exist yet
To count how often BADC data is used
we have to:
1. Search Google Scholar for “BADC”,
“British Atmospheric Data Centre”
2. Scan the results and weed out false
positives
3. Read the papers to figure out what
datasets the authors are talking
about (if we can)
4. Count the mentions and citations (if
any)
http://www.lol-cat.org/little-lovely-lolcat-and-big-work/
We’re working with DataCite and
Thompson Reuters to get data
citation counts.
VO Sandpit, November 2009
Altmetrics and social media for data?
Mainly focussing on citation as a first
step, as it’s most commonly
accepted by researchers.
We have a social media presence
@CEDAnews
- Mainly used for announcements about
service availability
We definitely want ways of showing our
funders that we provide a good
service to our users and the research
community.
And we want to be able to
tell our depositors what
impact their data has had!
VO Sandpit, November 2009
• Launched 3rd September
RDA Bibliometrics for Data WG –
preliminary survey results
• As of 17th September – 63 responses
• 100% completion
• Survey link still live
https://www.surveymonkey.com/s/RDA_
bibliometrics_data
Science 3
Earth sciences 16
Physics 4
Scientometrics and bibliometrics 4
Engineering 2
Chemistry 1
Biology (inc. zoology) 2
STEM 1
Medicine & biomedical research 8
Energy 1
Admin for research 2
Computer science 4
Social science, policy and economics 4
Librarian and digital curation 11
VO Sandpit, November 2009
Current use
VO Sandpit, November 2009
Future and missing
In the future, what would you like to use to
evaluate the impact of data?
Most popular suggestions:
• Data citations
• Actual use in professional practice
• Download statistics
• Mentions in social media
• DOIs/PIDs
• Altmetrics
• Well regarded indicators
What is currently missing and/or needs to be
created for bibliometrics for data to become
widely used?
Most popular suggestions:
• Culture change!
• Principles and standards for consistent
practice (and enforcement of these)
• Use of PIDs
• Mature tools for data citation, publishing,
discovery and impact analysis
• Openness in papers and patents
Also pleas for:
• Easy to use and set up
• Radically different tools
• Whatever tool can provide reliable
information
• Best estimate of societal benefit in $$ terms
Also:
• Research on what current metrics actually
measure
• Infrastructure
• Free apps
VO Sandpit, November 2009
Survey link still live!
https://www.surveymonkey.com/s/RDA_bi
bliometrics_data
Please help!
Please pass on the link to anyone who
might be interested and encourage others
to fill in the survey!
Share your experience with altmetrics –
join the RDA WG on Publishing Data
Bibliometrics
https://rd-alliance.org/group/rdawdspublishing-data-bibliometrics-wg.html
Thank you!
Sarah Callaghan
[email protected]
@sorcha_ni
VO Sandpit, November 2009
http://weknowmemes.com/generator/meme
/379914/
Work funded by the European
Commission as part of the project
OpenAIREplus (FP7-INFRA-20112, Grant Agreement no. 283595)