Data and Publication Discovery

Download Report

Transcript Data and Publication Discovery

Data and Publication
Discovery
Brian Matthews,
Information Management Group,
STFC Rutherford Appleton Laboratory
CLADDIER workshop, Chilworth, Southampton, UK
15th May 2007
Microsoft’s Science 2020
Report
Modern scientific communication relies on
both journals and databases. At present these
are not integrated.
By 2020 mutual linking will be commonplace
and publications just containing peer-reviewed
data will become available.
http://research.microsoft.com/towards2020science/downloads.htm
The Use Case
Joanna, at the University of Southampton, has done some work on the biology of
seawater off the coast of Cornwall. As part of her analysis she needs (from a number of
locations):
•Publications and data describing prior or similar work.
•Oceanic profiles of salinity and temperature from the closest cruise in time and space,
•Meteorological data to accompany both her own sampling and the oceanic data,
•Remotely sensed ocean colour imagery (to add additional information on the biota).
She will then publish a paper that cites the datasets, lodge the paper in her own
institutional repository and also deposit her datasets in one or more appropriate data
repositories (e.g. both the NOCS data archive, and the, BODC).
The work Joanna has done is of interest in calibrating a global earth system model to
compare simulations of oceanic CO2 production with the scenarios used in the model.
Fred, at Reading University needs to be able to find Joanna’s paper and data either via
citations or directly from publication repositories. Having found the paper, the data
should be obtainable via the citation and the data archive.
As part of his work he checks back through the other datasets used and cited as inputs
to Joanna’s data, as before he uses Joanna’s data, he suspects Joanna’s work could be
recalibrated by using better quality meteorological re-analyses.
What does that need?
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Joanna’s own data acquisition
Location and acquisition of prior publications and data
Location and acquisition of remote datasets required as part of the analysis
Creation of personal metadata for new data
Data analysis and paper writing
Citation of remote papers and datasets
Paper submission to a journal and acceptance
Repository submission of paper (maybe a preprint)
Repository submission of data
Further metadata creation for the data (at the data repository).
Further metadata creation for the publication (at the institutional
repository)
12. Linking between institutional repositories and the data held at the
discipline repository
13. All the datasets and publications cited need to be annotated with the
citation information
1.
2.
3.
4.
Discovery of Joanna’s work by Fred (either from Joanna’s publication or
datasets or citations thereof)
Acquisition of all the relevant publications and datasets by Fred
Analysis and Publication by Fred (and all the same steps from 5 as required
by Joanna)
External Adjudicators need to be able to find and acquire citation
information.
So what services do we need?
In order to achieve this scenario we need to
provide a set of key services
• Publishing of Data
• Browsing and searching
– across different repositories
– across data and publication
• Cross-citation of data and publication
– forward and backward citation
– need to maintain currency of citation links
Browsing and Searching
•Browsing and searching
– across different repositories
– across data and publication
CLADDIER has provided a harvesting and search
tool to support cross-repository searching
• Uses OAI-PMH – a conventional approach
–
–
–
–
Simple – but it works!
Simple key-word searching
Three participating repositories in the pilot
BADC, STFC ePubs, e-Prints Soton
Adding cross-citation
The Discovery Service gives a broad-brush search
• Give you both publications and data sets
– which are indexed on a key word
• A Google across repositories
• Currently, cannot tell whether the data and
publication are actually related
– what data and publications inspire a piece of work
(generating a new data set)
– what publications arise from a data set
We need to exploit the concept of citation to see
whether relationships are actually related
Traditional Citations
Cross-citation
Adding Citations to the
Metadata Model
Adding Citations has been considered in standard
metadata models.
•e.g.
–
–
–
–
–
–
–
–
Scholarly Works Application Profile
JISC funded initiative
Dublin Core Application Profile
Describing Scholarly Publications (ePrints)
Based on the FRBR model
Does consider Citations
But breaks citations up into small components
This is highly labour intensive to enter
Does not have a notion of back citation
FRBR Model
ePubs and Cross-Citations
STFC ePubs has a metadata model based on
FRBR
•Need to extend this to support cross-citation
•Keep it simple
•Can support forward and back links
•Have developed a simple model for citations
Citation Model
Maintaining Links
Ideally the archives holding the datasets and
publications would be notified that a paper
citing them had been submitted.
Metadata associated with those records would
be updated to reflect the citations.
The metadata in the publication repository
should also link to the data in the data archives
and vice versa.
It would be great if this notification could be
done automatically.
Notification Services
To support this, we need to provide a
notification service.
• Federated Repositories register with
the service
• Repositories notify the service of
citations
• The service informs (via
broadcasting or targeting) repositories
of citation,
•Service provides sufficient
information to update metadata
•Still under development.
•Note Blogging software.
Conclusions
The Use Case supports the scientific process with
repositories
This requires the cross-linking network of information
objects
Which needs to be stored, maintained and searched
Tools and ideas relatively straightforward
Lots of gluing of existing components
Keep it simple – so it will get used