Transcript Document

UK Electronic Information Group
Image Management in Bio- and Environmental
Sciences: New Directions
John Rylands Library, University of Manchester
Thursday 31st May 2007
Research images as
first class publication objects
David Shotton
Image BioInformatics Research Group
Oxford e-Research Centre and
Department of Zoology
University of Oxford, UK
e-mail: [email protected]
© David Shotton, 2007
Outline

The nature of scientific data and image publication
 What data do we actually publish?

Relationship between publications and databases
 Improving journal authoring
 Data lenses, semantic lenses and live journal content

Integrating distributed data
 Data webs
 ImageWeb
 ImageBLAST

Preserving biological research images
 The ImageStore Project
Characteristics of biological research data





Bottom-up data flow, lacking central control
Very large research community with diverse research topics
Highly distributed research activities and publication structures
Research data heterogeneous and largely unstructured,
often with little by way of semantic mark-up
An open world, where change is as ubiquitous as consensus is elusive
Where to store research data?

Research results may represent ‘universal
truths’, e.g. the sequence of a particular gene
 These form bounded data sets
 The data need only be discovered once
 Such information is typically published in a
large global bioinformatics database

Research data can also be ‘particulars’ rather
than ‘universals’, for example individual assay
results, microscopy images and wildlife photos
 These data form unbounded data sets
 Data collection will never be complete
 Such image information is not (yet) widely
available on line

It is not appropriate to submit such data to
centralized global databases
 The data are too heterogeneous
 Such activities would not scale
What data do we publish?


A scientific paper does not just report scientific observations
Rather, as Anita de Waard of Elsevier has pointed out, a scientific
paper is an exercise in rhetoric, designed to convince readers of the
truth of a particular scientific hypothesis or belief
 The goal of the article is not to state facts, but rather to convince
 Facts are selected to support the argument, and are embedded in a
rhetorical structure with the purpose of conviction
“These observations support
theories that defects of the
muscle plasma membrane are
important for dystrophic
pathogenesis.”
. . . but what about the original research data?


While selected findings that support hypotheses appear in research
articles, the majority of original research data are never published
Historically, in the paper age, there was no easy method for doing this
 Journals had limited space
 Other publication avenues were not available

Now, in this digital age, ‘supplementary information’ can be put on-line
 However, this facility is not widely used
 Furthermore, such supplementary data are usually poorly structured,
with insufficient metadata, and may not be discoverable by external
search engines
 Depositing data as supplementary information may thus be
consigning them to costly data graveyards, from which resurrection
is difficult
How might we improve on this situation?
(my take home messages !!)


We need to start treating experimental research
data sets as first class publication objects, of
equal value to the journal papers based upon them
We need to work towards better interoperability
between papers and data
 First, two examples of work in progress
 Then my suggestions for new developments
Convergence between papers and databases

Philip Bourne, Editor-in Chief of PLoS Computational Biology and Codirector of the Protein Data Bank, wrote a stimulating paper:
PLoS Comp. Biol. 2005 1(3) e34


In this, he contends that the distinction between an on-line paper and a
database is diminishing
He calls for “seamless integration” between papers reporting results and
the data used to compute those results
Similar Processes Lead to Similar Resources
Author Submission via the Web
Depositor Submission via the Web
Syntax Checking
Syntax Checking
Review by Scientists &
Editors
Review by Annotators
Corrections by Author
Corrections by Depositor
Publish – Web Accessible
Release – Web Accessible
Credit: Philip Bourne
My critique of Philip Bourne’s ideas


I agree with his central analysis of the processes involved. However, this
similarity of process should not blind us to essential differences in purpose
We must maintain a clear distinction between the journal publication
 peer reviewed
 a dated record of the authors’ view at the time of publication
 while errata are permitted, the original version should be immutable

and the research database
 should contain the most reliable up-to-date information
 data quality is initially the responsibility of the depositor
 errors subsequently discovered should be corrected by the curator

Thus “seamless integration” is not desirable
 One needs to approach publications and data sets with different
presuppositional spectacles – the first rhetorical, the other analytical
 Researchers really want the “seams” to be very clear, not covered over
Improving the authoring process

Richard O’Beirne of Oxford Journals has stressed that, for publishers to
enable their publications to be better used in the digital world, they need
to expose metadata of a higher granularity, identifying component pieces of
papers
 For images, this means figures and their legends




Such mark-up is typically present during the production phases of a paper’s
publication, usually in the form of XML, but is ‘lost’ upon publication as PDF
Such metadata needs exposure to facilitate interoperability with data
resources
Anita de Waard of the Elsevier Advanced Technology Group is currently
developing a system, in conjunction with the editors and authors of Cell,
whereby the authors are enabled to create such mark-up while writing the
paper
What we need is an easy-to-use plug-in for MS-Word, accepted by all
leading publishers, for the creation of suitable text mark-up at the time of
authoring
Live (or at least lively) journal content

The norm that the online version of a journal article is a PDF file is
antithetical to the spirit of the Web, and ignores its great potential
 PDF is an electronic embodiment of a static printed page

Rather, what we need are on-line journals that include tools to deliver
renderable interactive views of otherwise static images, and interpretive
‘data lenses’ or ‘semantic lenses’ over published data, thereby enabling new
levels of reader comprehension
 Semantic lenses specifically provide viewpoints onto RDF data,
presenting users with information from selected semantic perspectives

This will require Web delivery of information from multiple resources,
involving proper integration of the published paper with research data
archives
A data lens showing tsunami damage
A data lens
applying a
high-pass
filter
A data lens for image analysis

Electron
micrograph
showing cross
sections of
microvilli on the
surface of
intestinal
epithelial cells
A live semantic lens demonstration
http://www.cc.gatech.edu/gvu/ui/sub_arctic/sub_arctic/test/sem_lens_test.html
An example from
a recent issue of
Biochemistry
Report of a crystallographic structure
Figure 1
from the
on-line
version of
the paper,
showing
the
protein
structure
The PDB entry for Polo-like Kinase 1
(PDB ID 2OU7)
Interactive Jmol representation of Polo-like Kinase 1
http://molvis.sdsc.edu/fgij/fg.htm?mol=2ou7
Another example, from The Plant Cell
All the images in the paper should be clickable videos !
Fusiform bodies within the ER network of arabidopsis stem cortical cells
http://www.brookes.ac.uk/schools/lifesci/research/molcell/hawes/gfpmoviepage.htm
Integrating distributed data


The problems of achieving semantic interoperability between distributed
heterogeneous archives of digital data are well known
Previous approaches to solving the problem have involved
 distributed query processing




repository federation, or
portals
All shared in common reliance on mainstream technologies such as Z39.50,
XML and Web Services, some of which might be considered as dated or
heavyweight technologies
None have applied to the problems of data integration the Semantic Web
and Web 2.0 approaches that I wish now to describe
Web and Semantic Web standards and tools
We favour the World Wide Web Consortium standards:

RDF as the standard format for sharable metadata

SPARQL as the universal query language for RDF


Software such as D2R Server for abstracting RDF from relational
databases in response to SPARQL queries
OWL-DL as the standard web ontology language;
and for software development and integration:

use of agile programming techniques

Ruby or Python to provide a lightweight development environment

loose coupling between the Model, View and Controller software components,
based on a simple ‘REST’-full approach to component integration
(Fielding 2000, Representational State Transfer)
publication@source


With the advent of the Semantic Web, the possibility exists to extend
the Web paradigm that anyone can publish to include data publication
We are entering the age of distributed data publication
 Most research data will in future not be submitted to centralized
databases
 Rather, data will be published locally by individual research groups, by
institutional repositories and by journal publishers, complete with
semantically rich metadata that can be harvested and indexed



The database gives way to a distributed ‘data space’
The trick then is to create mechanisms whereby such heterogeneous
distributed data can be integrated and made cross searchable
One mechanism we are now exploring is the data web
Data integration – the lightweight data web approach
The data web is a novel concept for
digital information integration
involving semantic web technologies

The data are held locally, with metadata published on local Web servers

Separately for each data web serving a particular knowledge domain,
automated lightweight software tools will be used to integrate the
distributed data
 separate metadata schemas will be mapped to a core ontology
 instance metadata describing the distributed data will be made available
for harvesting as RDF by creating a SPARQL endpoint at each resource


This overcomes syntactic and semantic differences between data providers
Resources can then be discovered by distributed SPARQL queries across
the data web
Data web services
Web 2.0 aspects of data webs

Use of the Web as the platform

Small pieces, loosely coupled

Programmatic access, giving ‘hackability’ and the right to remix

Tagging:
 Data webs are predicated on a formal core ontology, but we see vital
roles for user annotations to supplement formal metadata

Trusting our users:
 Data providers control their own primary image data and metadata
 Data consumers are free to use the data web service in whatever
way they think fit, including building secondary services, and
providing annotations

The Long Tail:
 Data webs enable discovery of ‘long tails’ of hard-to-find data – this
is particularly true for research particulars rather than research
universals
The ImageWeb Project



Image webs are data webs for research images
We desire to integrate and make cross-searchable research images held
by publishers, research organizations, museums and institutional
repositories, which are currently in isolated data silos
We desire to enable these information resources
 to become a more integral part of day-to-day research, and
 for published images to be more fully used than at present, including
combination and re-use for meta-research

The same images might be accessed by more than one data web
 For example, cellular images might be accessed by one data web
illustrating confocal microscopy techniques, and alternatively by
another data web concerned with cancer therapy
ImageBLAST – an image web secondary service




I originally imagined that ImageWeb users would directly query the
ImageWeb, and from there being led to relevant images
However, I now believe that it might be even more useful for a user to be
able to click on an image within an online paper she is reading, and have
semantically related images from other sources presented as a ranked list
This service would resemble the basic bioinformatics BLAST service for
finding related biological sequences (http://www.ncbi.nlm.nih.gov/BLAST/)
This ‘ImageBLAST’ service would not locate images that resemble the first
image in terms of visual appearance, but in terms of being about the same
thing
 e.g. the same gene expressed in a different organism
 or the same biological concept demonstrated in a different system
An example – transplanted GFP-labelled stem cells
Related images
Fig. 2. (A and B) Immunohistochemical staining for EGFP on livers of (A) Z/EG x Cre–into-Cre and (B)
Z/EG-into-Cre transplants. (C) Immunofluorescence staining with cytokeratin (green) and Y chromosome
FISH (red) in the same Z/EG-into-Cre transplant, showing the presence of a donor-derived Y-positive
hepatocyte (arrow). (D and E) Immunofluorescence staining of (D) untransplanted positive control (Z/EG
x Cre F1) and (E) experimental (Z/EG into Cre) epidermal sections with antibodies against EGFP (green)
and cytokeratin AE1/AE3 (red). (F) Immunofluorescence staining with cytokeratin AE1/AE3 (red) and Y
chromosome FISH (green), showing the presence of a donor-derived Y-positive keratinocyte (arrow) in
the epidermis of a Z/EG-into-Cre transplant recipient.
How might a data web improve on




It permits access to database information hidden in
the ‘Deep Web’
It involves specific targeting to a particular
knowledge domain, thus achieving a significantly
higher signal-to-noise ratio
It provides integration of information with
ontological underpinning, semantic coherence, and
truth propagation
It permits programmatic access, enabling secondary
services to be built on top of one or more data webs
?
Our present objective
DW-40 : data webs for frictionless interoperability
between scientific publications and research datasets
References
In addition to the papers shown in my presentation itself, please find further details in:








Presentations by Philip Bourne, Anita de Waard, David Karger and David Shotton given at the
Research Information Network workshop “Data Webs: new visions for research data on the Web”,
28 June 2006, available at http://www.rin.ac.uk/data-webs.
Erika Darling, Chris Newbern and Nikhil Kalghatgi (Mitre Corporation IR&D) (2005) Reducing
visual clutter with semantic lenses. ESRI User Conference July 2005.
http://www.themitrecorporation.org/tech/nlvis/pdf/esri_user_conference.pdf.
Anita de Waard (2006) Semantic authoring for scientific publication. Downloadable from
www.cs.uu.nl/people/anita/talks/deWaardSWDays0410.pdf.
Anita de Waard and H. van Oostendorp (2005). Development of a semantic structure for scientific
articles. Presented at Werkgemeenschap Informatiewetenschap, Antwerp, the Netherlands.
http://www.cs.uu.nl/people/anita/papers/deWvanOWIG2710.pdf.
Anita de Waard, Leen Breure, Joost G. Kircz and Herre van Oostendorp (2006) Modeling rhetoric
in scientific publications. Presented at INSCIT 2006.
http://www.instac.es/inscit2006/papers/pdf/133.pdf.
Roy Thomas Fielding (2000) Architectural styles and the design of network-based software
architectures. Chapter 5: Representational state transfer (REST). Ph. D. thesis. Department of
Information and Computer Science, University of California, Irvine.
http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
Requirements analyses for building a data web for images:
http://imageweb.zoo.ox.ac.uk/wiki/index.php/Defining_Image_Access.
Details of the ImageWeb Consortium:
http://imageweb.zoo.ox.ac.uk/wiki/index.php/BioImageWeb_Consortium.
The Internet and the flow of information

What struck me after compiling that list is
that it did not contain a single journal
publication!
 Why is this?

“The Internet treats censorship as damage,
and routes around it”
 quote by John Gilmore


The same fate will suffer anything that
impedes the free flow of information,
including journals
Unless journals adapt to provide the quality and depth of information that
users require, they will become increasingly marginalized, as users go
elsewhere on the Web to find it
The ImageStore Project



ImageStore: Curation requirements for legacy analogue and ‘born
digital’ scientific image data
Purpose: To research the requirements for effective digital curation
and re-use of scientific research images from the biological domain
Part of the Digital Curation Centre’s JISC-funded SCARP Project
 To adopt a disciple-specific approach to problems of sharing,
curation, re-use and preservation of data
 To determine curation needs by embedding curation staff within
research teams

To give the ImageStore project specific focus, we are investigating
the curation requirements for four distinct types of images, two sets
of historical analogue records and two sets of modern ‘born digital’
images
The history of molecular and cell biology




Molecular and cell biology began as research disciplines in the 1950s,
when the combination of findings from biochemistry, biophysics and
electron microscopy gave us the DNA double helix and the first visions
of cell ultrastructure and function
Many of the pioneers of molecular and cell biology have now retired or
are close to retirement
Their analogue data constitute our scientific cultural heritage, yet most
of it will almost certainly be lost if nothing is done soon to curate and
archive it
The cost of having to repeat these research observations would far
outweigh the cost of preserving the original data
How much data should we save?

It is now technically possible to store as much research data as we
wish
 But how much is enough?
 When is it right not to save data?


For electron microscopy, a good rule of thumb is that for every 1000
EM images taken, 100 will be good, 10 will be superb, and 1 or 2 will
make it into print, as figures in a scientific paper
While we should be happy to discard the 900 poor negatives, what we
should do with the 98 unpublished good images is a pressing question
Electron microscopy of trypanosomes


Trypanosomes are the causative agents of sleeping sickness
Hundreds of electron micrograph negatives – glass photographic plates –
taken over the last 25 years by Professor Keith Gull (Dunn School of
Pathology), during his life-long studies of microtubules in trypanosomes
Tsetes fly
From Broadhead et al.,
Flagellar motility is
required for the viability
of the bloodstream
trypanosome.
Nature 440, 224-227
(9 March 2006)
Wildlife videos



Wildlife videos of British and African mammals,
including badgers and Ethiopian wolves
Created by Professor David Macdonald’s
Wildlife Conservation Research Unit
(Department of Zoology) over the last 20 years
There are hundreds of analogue videotapes in a
variety of formats
Haydon et al.
Low-coverage vaccination
strategies for the conservation
of endangered species.
Nature 443, 692-695
(12th October 2006)
Computer simulations of the human heart



These models, created by Professor Denis Noble and colleagues
(Department of Physiology), permit understanding of hear disease
They form part of the OeRC Integrative Biology e-Science Project
Both the computational models and the resulting digital videos
recording the simulations are important artefacts that are shared
with overseas collaborators and that required long-term curation
In situ images of gene expression

In situ images revealing the time and place of gene expression
in the testes of the fruit fly, Drosophila melanogaster, are
important for understanding male sterility in humans


These images are currently being acquired by my colleague Dr
Helen White-Cooper (Department of Zoology), as part of a
BBSRC project on which I am co-investigator
They are born digital true colour light microscopy images
aly

DNA array images that
quantify gene expression
also form part of the data
cyclinB
Mst87F
Male fruit fly
The end
Acknowledgement:
I am endebted to Graham Klyne,
with whom my data web ideas
have been developed