Datasets Programme

Download Report

Transcript Datasets Programme

British Library
Datasets Programme
JISC RSP Winter School
February 2011
Max Wilkinson
Today’s Talk
1.
2.
3.
4.
The British Library
Data in scholarly communication
The problem with data
The Datasets Programme
 Vision
 Strategy
 Activity (DataCite)
5. Other Projects
2
The British Library

Exists for everyone who wants to do
research – for academic, personal, and
commercial purposes.

Covers all subject areas – sciences,
technology, medicine, arts, humanities,
social sciences…

Receives a copy of every item
published in the UK.

Holds over 150 million items, with 3
million items added each year.

Used by over 16,000 people each day
(on site and online).
3
The British Library: some facts and figures
British Library Act 1972
National centre for reference, study, bibliographical and other information services, in relation both to scientific and
technological matters, and to the humanities.
Science and Innovation Investment Framework 2004-2014, H.M. Treasury (2004)
UK research base must have ready and efficient access to information of all kinds – such as experimental data sets,
journals, theses, conference proceedings and patents. This is the life blood of research and innovation.
National library of the UK.
Serves researchers, business,
libraries, education & the general
public
The largest document supply
service in the world. Secure
e-delivery and ‘just in time’
digitisation enables desktop
delivery within 2 hours
Collection includes over 2m
sound recordings, 5m reports, theses
and conference papers, the world’s
largest patents collection (c.50m)
Generates value to the UK
economy each year of 4.4 times
public funding
Collection fills over 600km of
shelving and grows at 11km per year
70 Tb of digital material through
voluntary deposit
GIA Funding 08/09:
£94.8m operational,
£12m capital
Other funding secured 07/08:
c.£33m
Business and IP Centre:
Providing inspiration, and enabling
protection of creative capital and
business development
Helping people
advance knowledge to
enrich lives
3 main sites in London and
Yorkshire. Circa 2,000 staff
Who do we serve?
The Researcher – We provide access to research level materials to all
sectors including academia, industry, government, charities and NGOs.
Business -The British Library also has a critical role supporting
businesses of all sizes, from individual entrepreneurs through to major
organisations.
The Learner - We have an important role to play in supporting education
from primary schools to developing future researchers of any age.
The Library Community – We play a key role in supporting the wider
UK Library Community and information network.
The General Public - The services we offer include exhibitions and
events, tours and web services which digitally showcase our collection.
5
Modern science relies on good data
6
Scholarly record
Exposure
Metadata
Discovery
Scholarly
record
Permanence
Record
Citation
Access
Trust Fabrics
Copyright
7
The Foundation for Research

Data is a crucial component of the scholarly record.

Re-acquisition may be impossible

Datasets are essential to the British Library’s mission
to advance the World’s knowledge.
8
Current Situation
 No
effective way to link between
datasets and article;
 No widely used method to identify
datasets;
 No widely used method to cite datasets.
9
As a result…
Datasets are:



Difficult to discover
Difficult to access
In danger of being lost
10
Difficult to Discover.
Good luck finding the data!
“Source: Committee on Climate Change”
11
Data are diverse in the Digital Landscape

Seismic measurements taken by a
geologist.

An audio archive of birdsong created
by an ornithologist.

Genetic data collected by a medical
researcher.

A survey of public opinions collected
by a sociologist.
12
Re-join the gap…
Articles
Underlying
data

(No) effective way to link
between articles and
datasets

(No) widely used method to
identify datasets

(No) widely used method to
cite datasets
13
Datasets – first class citizens?





Data is difficult to manage after
project funding ceases
Informal networks provide the
primary means of sharing
Only 21% use a national or
international facility
Datasets are not included in
impact analysis
Good luck finding it or getting
permission to use it (your
discipline may vary)
Source: UKRDS Study:
The Data Imperative. Managing the UK’s research data for future use (Feb 2009)
14
Scholarly record
Exposure
Metadata
Discovery
Scholarly
record
Permanence
Record
Citation
Access
Trust Fabrics
Copyright
15
Research training based on scholarly communication
Exposure
Metadata
Discovery
Rarely includes data
Scholarly
record
Permanence
Record
Citation
Access
Trust Fabrics
Copyright
16
Scholarly communication requires intellectual exchanges
Exposure
Metadata
Discovery
Scholarly
record
Permanence
Record
Citation
Access
Trust Fabrics
Copyright
No such data fabric
17
Scholarly discourse requires a record and provenance
Exposure
Metadata
Discovery
Scholarly
record
Permanence
Record
Citation
Access
Trust Fabrics
Copyright
Almost non-existent for data
18
The Datasets Programme
We envision a future where researchers can:


Discover, access, reuse, and reference
datasets.
Track the impact of the data that they
generate and receive appropriate credit.
Our approach is to:


Provide a focus for the community to
establish needs, requirements and
agreement.
Explore novel technology and creative
solutions.
19
Two key concepts
INCENTIVE
SUSTAINABILITY
20
Projects and activities
www.bl.uk/datasets
Follow us on twitter @datasetsBL
21
A Key Component for Many Goals
Cite
Make
Visible
Reuse
Persistent
?
Find Identification
Verify
Access
Track
Impact
22
Citation using Digital Object Identifiers (DOIs)
Published Article (Abstract or full text)
The DOI system offers an easy,
internet actionable way to
connect the article with the
underlying publication
Article Citation
G. Yancheva, N. R. Nowaczyk et al (2007)
Influence of the intertropical convergence zone on the
East Asian monsoon
Nature 445, 74-77
doi:10.1038/nature05431
But a complete scholarly
record would also link to the
evidential datasets and their
location, e.g. PANGAEA
How to
reference
Dataset
G.Yancheva, N. R. Nowaczyk et al (2007)
Rock magnetism and X-ray flourescence spectrometry
analyses on sediment cores of the Lake Huguang Maar,
Southeast China, PANGAEA
23
doi:10.1038/nature05431
leads to a landing page
24
Connecting an Article with the Underlying Data
URIs are commonly used but
can decay
 (e.g. Wren JD: URL decay in
MEDLINE- a 4-year follow-up
study. Bioinformatics. 2008, Jun
1;24(11):1381-5).
Digital Object Identifiers (DOIs)
offer a solution
 Mostly widely used identifier for
scientific articles
 Researchers, authors, publishers
know how to use them
 Put datasets on the same playing
field as articles


Dataset
Yancheva et al (2007). Analyses
on sediment of Lake Maar.
PANGAEA.
doi:10.1594/PANGAEA.587840
25
doi:10.1594/PANGAEA.587840
26
Dataset citation using Digital Object Identifiers (DOIs)
Dataset
G.Yancheva, N. R. Nowaczyk et al (2007)
Rock magnetism and X-ray flourescence spectrometry
analyses on sediment cores of the Lake Huguang Maar,
Southeast China, PANGAEA
doi:10.1594/PANGAEA.587840
Scholarly record is complete
Data Citation
Article
G. Yancheva, N. R. Nowaczyk et al (2007)
Influence of the intertropical convergence zone on the
East Asian monsoon
Nature 445, 74-77
doi:10.1038/nature05431
27
Projects – DataCite
DataCite is an international consortium which
aims to:

Establish easier access to scientific research
data on the Internet

Increase acceptance of research data as
legitimate, citable contributions to the scientific
record

Support data archiving that will permit results
to be verified and re-purposed for future study.
28
DataCite

Support researchers by enabling them to
locate, identify, and cite research datasets with
confidence

Support data centres by providing persistent
identifiers for datasets, workflows and
standards for data publication

Support publishers by enabling research
articles to be linked to the underlying data
DataCite : Data Centres :: CrossRef : Publishers
29
Digital Object Identifier (DOI)
doi:10.4124 / 0003.569
Prefix
Suffix
30
DOI prefix
doi:10.4124/0003.569
Prefix

Suffix
The British Library provides data
centres with a unique prefix for
DataCite DOI

For example, Archaeology Data Service
uses 10.5284
31
DOI suffix
doi:10.4124/0003.569
Prefix
Suffix

Suffix generated by the data
centre

Guidelines for DOI syntax are
being developed
32
Resolving a DOI
doi:10.4124/0003.569
Prefix
Suffix
Resolving the DOI:

http://dx.doi.org/10.4124/0003.569
33
DOIs resolve to an open landing page
34
DataCite Service
 Built a service for data centres to mint
DOIs for datasets and store associated
metadata (http://api.datacite.org)
 British Library is trialling the service
with several UK data centres, including:
35
Projects and activities
www.bl.uk/datasets
36
For more information on the BL Datasets Programme
Max Wilkinson: Programme Manager; Datasets
Email:[email protected]
Email: [email protected]
WebSite www.bl.uk/datasets
Follow us on twitter @datasetsBL
37
Follow On slides
38
SageCite: Data citation in bioinformatics
workflow
Sage Bionetworks: Aggregating datasets from contributors to create massive
coherent datasets that can be used for systems level analysis of disease
SageCite: Integration of data citation services into multi-contributor bio-informatics
workflow. Establishing data attribution and credit mechanisms.
► INCENTIVE
•Sage bionetworks data capture and analysis workflow (Tavenra:
MyExperiemnt)
•Data Citation service integration points and recommendations
•Benefits analysis
39
Dryad UK: Repository sustainability
Leveraging the Dryad Consortium, which is addressing the acquisition and
storage of long tail supplementary data
Dryad UK: Define a business case and pilot service integrating DataCite
DOIs and dataset archiving into publisher workflows
► SUSTAINABILITY
•Expand Publisher base
•Seamless integration into publisher workflow
•Sustainability models for datasets supplementary to publication
40
Discovery
Science Technology & Medicine
 Focussing on discovery services in the library’s integration
engine
 Based on commissioned consultations



Data resources
Selection guidelines
Making available through library search facilities
41
Dataset Discovery Project
42
Access
SSCR
 Focussing on streamlining access to established and high
value data collections
 Resource guides for datasets
 Streamlining access to established data centres
 Raising profiles of high impact datasets


E.g. 2012 Olympics and 2011 census
Also piloting dataset surfacing through the Libraries search
facilities
43
Projects – British Atmospheric Data Centre
British Atmospheric Data Centre (BADC):

Natural Environment Research Council's
designated data centre for the Atmospheric
Sciences.

Assists researchers to locate, access and
interpret atmospheric data and ensures the
long-term integrity of this data.
A joint project is underway to improve the
citability of BADC datasets

Publications based on the data will underlie the
2013 International Panel on Climate Change
(IPCC) Report.
44
Challenges to Explore

Helping people to …

Developing and sustaining…

Providing a…
45
A combination of eight social and technical factors – ideally there would be:
Personal attribution and credit for data publication
An established mechanism for citation of datasets
A generic minimum metadata standard for datasets
A tool to permit the easy creation of well-structured metadata
A standard mechanism for packaging data files and their metadata
Appropriate repositories to archive and publish research datasets
Reciprocal citation links between datasets and research articles
Mechanisms for quality control of data publications
46