Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’ http://r4l.eprints.org Leslie Carr, Simon Coles & Jeremy Frey University of Southampton, U.K. [email protected] This.

Download Report

Transcript Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’ http://r4l.eprints.org Leslie Carr, Simon Coles & Jeremy Frey University of Southampton, U.K. [email protected] This.

Experiences with Repositories and Blogs in
Laboratories
or
‘R4L: The Repository for the Laboratory’
http://r4l.eprints.org
Leslie Carr, Simon Coles & Jeremy Frey
University of Southampton, U.K.
[email protected]
This work is licensed under a
Creative Commons Licence
Attribution-ShareAlike 3.0
http://creativecommons.org/licenses/by-sa/3.0/
The Problem: Data Generation
Synthesis
Characterisation
The Problem: Data Management
“Data from experiments conducted as recently as six months ago
might be suddenly deemed important, but those researchers may
never find those numbers – or if they did might not know what those
numbers meant”
“Lost in some research assistant’s computer, the data are often
irretrievable or an undecipherable string of digits”
“To vet experiments, correct errors, or find new breakthroughs,
scientists desperately need better ways to store and retrieve
research data”
“Data from Big Science is … easier to handle, understand and
archive. Small Science is horribly heterogeneous and far more vast.
In time Small Science will generate 2-3 times more data than Big
Science.”
‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher Education (23/06/2006)
The Problem: Data Deluge
• 40 years ago a PhD student
would determine about 3 crystal
structures during the course of their
study – this can now be easily
achieved in a day
• There are approx. 30 million known chemical compounds
• Approx. 2 million crystal structures have been determined
• There are less than 0.5 million published crystal structures
residing in (licensed) curated databases
• There are just a few thousand ‘open’ crystal structures
• The primary cause of this is the current data publication
process, which is tied to journal articles and peer review
The Problem: Publishing Data
Spectroscopic analysis is often
performed to ensure a reaction is
proceeding according to plan – as a
result <5% are published (via a
process with heavy information loss)
The Problem: Reproducible
Experiments
• Poor availability and description of experiments and
arising data in the current literature
• How can we validate this data?
• Open data will need to be self explanatory and prove
its own ‘correctness’
• Requirement for an ‘experiment audit trail’
• (Published) Science should be reproducible
• Requirement to provide sufficient data and metadata to
back up an experiment description
The Solution
Intellect &
Interpretation
(Journal
article, report,
etc)
Underlying data
(Institutional
data repository)
Fitting into the Information Environment
Institutional
Data
Sources
High Level Relationships
Repository Design
• Scenarios – assist design team in understanding each other
• Feedback from SPECTRa – based questionnaire
• First design: build one to throw away (out of the box EPrints)
• Population of disposable repository informed design of
actual repository
• Population informed workflow capture and analysis
• Manufacturer discussions
• Requirements capture with publishing community
Questionnaire Results #1
• Respondents comprised PhD students (55%), postdoctoral
workers (18%) and faculty staff (19%) and totalled 110 people.
• Primary use of computers and the internet is for information
researching, writing papers and reports, working up data and
instrument control.
• Computers are used regularly for everyday work, but much
less so for social networking and other ‘modern’ uses.
• Mainly highly established community standard applications,
software and file formats are used, with less use of modern
data sources.
• There is still extensive use of printed paper copies of PDF
files, which are generally stored on personal computers
without any structure or use of reference software. A
researcher will have 100-1000 PDF files on their computer and
prefers to communicate them by sending PDF’s to
collaborators.
Questionnaire Results #2
• About 66% have had to generate electronic supplementary
information to supplement their journal articles.
• There is a predominance for self teaching use of software, as
opposed to being taught professionally.
• Supplementary information is mainly generated and stored in
proprietary formats, although there is considerable use of
‘popular’ formats (eg Microsoft Office).
• There is a preference, or requirement, to keep a hardcopy of
data as well as an electronic one.
• Experimental and analysed data are generally kept on a
group or instrument controlling computer, however there is
often a need to keep a hardcopy (eg in a lab notebook).
• About 66% have not heard of ‘InChI’, ‘metadata’ or ‘JCAMP’
format, whilst around 50% have not heard of ‘DOI’, ‘Open
Access’, ‘Semantic Web’ or ‘RDF’.
Questionnaire Results #3
• There is a considerable lack of awareness surrounding
repositories and their function.
• There is a requirement for search and discovery to be
based predominantly on structure, formula, author or
keyword.
• The most attractive purpose of a repository would be for the
storage of a ‘permanent record’.
• Most chemists would comply if deposit in a repository was a
mandatory requirement of funding or publication, however
virtually all are ignorant of what the position of these
organisations is with respect to open access and deposit in a
repository.
The Plan
Workflow Analysis
UV-Vis
Powder XRD
Mass Spec
NMR
Sufficient similarity to design a generic deposit / ingest process
The R4L Repository Deposit / Ingest
Create new compound
(parent record)
Add new experiment type
Add metadata and
upload data files
The Probity Service
• Process to assert originality of a
piece of work / repository record
• Incorporate into ePrints core
software?
Repository Search / Browse
Crucial record metadata for Data Management,
Search/Browse and Discovery:
Date; Instrument; Location; Compound Name; Experiment
Type; Researcher
Search / Browse
Report Generation
• Too cumbersome and inflexible (revisit?)
• Requirement for ‘familiar’ software
• Suitable for informatics, but not routine reporting
Report Generation
• Ability to import repository data into software and easily edit
• Need to bring the repository to the researchers ‘desktop’
• Demonstrator employed Sharepoint to store templates
• (functionality to be incorporated into repository software?)
• Does this really bring anything new to reporting research?
Analysis & Discussion: Blogging Experiments
A repository can…
• Allow one to put, store and
get digital objects
• Provide minimal search
and browse functions
• NOT provide the
presentation and discussion
functions essential to
working up a scientific study
• Social networking tools
and approaches can
provide a way…
Getting data into Blogs
• Developing relationship between Blog and R4L repository
• Repository back-end, Blog for sharing data with collaborators
and developing ideas / conclusions based on data
• ‘Live copy’ application only – <SWORD /> development?
•
Enabling Research
• Enables ‘geographically distributed collaborative research’
• Useful approach for sharing ‘failed’ experiments?
Open Notebook Science
Automatic Blogging by Machines
Automatic Logging of Sensor Data
• Timeline visualisation – instant detection of erroneous event
• Assists in analysing inconsistencies in datasets
Comments and Annotation
• Chemists need to scribble!
•A picture says a thousand words!
• Need for more advanced Blog
tools / technology
R4L End-to-End Overview
Usability
• Low barrier to use; familiarity; flexibility; quick gain
• A specification/requirements for data repository based
software?
Problems Encountered
• Over ambitious in affecting the attitudes of instrument
manufacturers at such an early stage
• Attitudes of chemists towards changes in laboratory working
procedure
• Attitudes of chemists towards change in the publication
system
• Input from journal publishers before demonstrator / prototype
available
• Blog software restrictive
• Extreme diversity and number of file formats employed for
analytical chemistry
Future Directions
• A useful demonstrator to the practising scientist of the value of a
laboratory repository…
• Further advocacy required in preservation and data
management areas
• Feasibility of a departmental data repository?
• An exemplar for:
• The institution - towards an institutional data preservation
policy?
• Publishers – improved handling of supplementary information
• Instrument manufacturers – will respond to the demands of
their customer base
• Follow on funding: eChemistry; myExperiment
• Best practice in validation and reproducibility of experiments
• Develop relationship between data repository & ‘Blog approach’