eCrystals Federation: Open Repositories for Data-driven Science Dr Liz Lyon, UKOLN, University of Bath, UK Dr Simon Coles, University of Southampton, UK Chemical Informatics Workshop,

Download Report

Transcript eCrystals Federation: Open Repositories for Data-driven Science Dr Liz Lyon, UKOLN, University of Bath, UK Dr Simon Coles, University of Southampton, UK Chemical Informatics Workshop,

eCrystals Federation:
Open Repositories for
Data-driven Science
Dr Liz Lyon, UKOLN, University of Bath, UK
Dr Simon Coles, University of Southampton, UK
Chemical Informatics Workshop, Manchester, March 2008
This work is licensed under a
Creative Commons Licence
Attribution-ShareAlike 3.0
http://creativecommons.org/licenses/by-sa/3.0/
Federation
Themes
1. Context: Institutional data repositories
crystallography exemplar
2. Scale: repository federations
3. Longevity: Digital curation and preservation
4. Integration: Semantic challenges
eBank Project – building the
eCrystals Data Repository
Started Sept 2003
Scholarly knowledge
cycle context
UKOLN-led
interdisciplinary team
ePrints platform @ Southampton
Institutional Repository exemplar
Embedded in workflow
http://ecrystals.chem.soton.ac.uk
Scaling Up Report
Phase 3 findings:
Data policy should reflect lab
practice & institutional model
Diverse lab practice
LIMS proprietary formats
Data quality criteria/validation
“Prior publication” problem
We need automated assignment
of terms for data discovery
No discipline preservation model
The
nλ = 2 d sinθ
eCrystals
Repository
ePrints.org v3.0
Repository Foundations
• Using
Learned society +
subject repository
support
simple Dublin Core
• Crystal structure
• Title (Systematic IUPAC Name)
• Authors
• Affiliation
• Creation Date
• Additional chemical information through Qualified Dublin Core
• Empirical formula
• International Chemical Identifier (InChI)
• Compound Class & Keywords
• Specifies which ‘datasets’ are present in an entry
• Application Profile http://www.ukoln.ac.uk/projects/ebank-uk/schemas/
• DOI links http://dx.doi.org/10.1594/ecrystals.chem.soton.ac.uk/145
• Rights & Citation http://ecrystals.chem.soton.ac.uk/rights.html
Federation interoperability & linking services
• Roll-out in 2 phases led by University of Southampton
• Establish Federation policies, application profile, mappings
• Bi-directional links with derived articles in “publisher
repositories”, IUCr, Royal Society of Chemistry (RSC),
Chemistry Central: scholarly knowledge cycle
• StOReLink project - Test linking options: StORe middleware
and CLADDIER
• OAI-ORE Testbed
eChemistry project
Laboratory practice & workflow
• Community standard CIF
• Mixed lab practice – central service
facility versus single “staff
crystallographer” in department
• Achieve end-to-end workflow
• Challenge of instrument
manufacturers with proprietary
formats
• “Repository Lite” for smaller lab
operations?
X-ray diffractometers
eBank-UK Phase 3 Curation &
Preservation Study:
Sustainability issues
http://www.ukoln.ac.uk/projects/ebankuk/curation/
Examined four main areas
1. Audit and certification (TRAC,
DRAMBORA, NESTOR, ISO
International repository audit
and certification BOF Group)
2. The Open Archival Information
System (OAIS) and
Representation Information
(RI)
3. eBank-UK application profile
and preservation metadata
4. ePrints.org repository platform
Recommendations:
Self-assessment using DRAMBORA
Consider Representation Information
in wider context
Develop preservation strategy
Capture preservation metadata PREMIS
Semantic
issues
Crystallographic schema underpins CIF (Crystallographic Information
Framework), but is limited to data parameters
e.g. cell_length_a
IUCr Acta Cryst 1992
Limited set of keywords
describing methods,
properties &
applications, compounds,
attributes
No established
crystallography
dictionary or controlled
vocabulary to give
chemistry context
What do we want to do?
•
•
•
•
•
•
•
•
•
•
•
Support depositors’ keyword/term assignment
Facilitate and improve automated indexing
Support advanced search / browse
Allow metadata validation & enhancement
Apply across a heterogeneous Federation
Cross search, cross browse functionality
Link data to all associated digital objects
Develop domain semantics / vocabulary
Use domain-specific authority files
Mine to “discover” rather than “find”
Achieve full inter-disciplinary integration
Some (semantic) issues…..
•
•
•
•
•
•
•
•
•
•
•
How are terms assigned?
Informal tags and/or structured KOS?
How is a vocabulary curated and maintained?
Can a vocabulary be transformed into a (Semantic Web
related understanding) ontology?
Disambiguation, acronyms, IUPAC names
Persistent identification for data citation
Granularity of data citation
Data (and metadata) quality, provenance, validation
Embedding within complex workflows
Use collaborative social approaches?
Community adoption: becomes part of the culture
Questions?
Slides will be available at :
http://wiki.ecrystals.chem.soton.ac.uk/index.php
http://www.ukoln.ac.uk/ukoln/staff/e.j.lyon/presentations.html
This work is licensed under a
Creative Commons Licence
Attribution-ShareAlike 3.0
http://creativecommons.org/licenses/by-sa/3.0/
Federation