Integrating everything and Integrated by everything EMBRACE and EMBOSS Peter Rice, EBI ()

Download Report

Transcript Integrating everything and Integrated by everything EMBRACE and EMBOSS Peter Rice, EBI ()

EMBRACE and EMBOSS
Integrating everything and
Integrated by everything
Peter Rice, EBI ([email protected])
June 2006
Funded by:
EMBRACE and EMBOSS
EMBRACE is an EC-funded Network of Excellence with 18 partners, developing an
integrated set of services for the major bioinformatics data resources and
analysis tools.
The EMB name was selected after two previous names were rejected. It stands for
"European Model for Bioinformatics Research And Community Education" ....
and has no connection with EMBL.
EMBOSS is now 10 years old, with the project team hosted by EMBL-EBI,
providing open source libraries and over 200 applications for sequence analysis.
EMBOSS has its roots at EMBL Heidelberg, but started at the Sanger Centre and
the UK EMBnet node. The EMB name reflects the EMBL and EMBnet origins as
"European Molecular Biology Open Software Suite"
Funded by:
EMBRACE
Network of Excellence - 18 partners with data resources, analysis tools, expertise
in grid technology and experimental biologists.
Graham Cameron, Peter Rice, Alan Bleasby — EBI, Cambridge, GB
Toby Gibson — EMBL, Heidelberg, DE
Andreas Gisel — Institute of Biomedical Technologies, Section Bari, CNR, IT
Teresa Attwood — University of Manchester, GB
Marco Pagni—Swiss Institute of Bioinformatics, CH
Erik Bongcam-Rudloff — LCB/BMC, Uppsala, SE
Vincent Breton — CNRS, Clermont Ferrand, FR
Søren Brunak — CBS, Lyngby, DK
José-María Carazo — CNB, Madrid, ES
Arne Elofsson — DBB, Stockholm, SE
Daniel Kahn — INRA/CNRS, Toulouse, FR
Ralf Herwig — MPI für Molekulare Genetik, Berlin, DE
Eija Korpelainen — CSC, Espoo, FI
Christine Orengo — University College London, GB
Yitzhak Pilpel — Weizmann Institute of Science, IL
Gert Vriend — CMBI, Nijmegen, NL
Alfonso Valencia — INTA-CAB, Madrid, ES
Christian Bryne — University of Bergen, NO
Funded by:
EMBRACE Overview
This kind of programming is hard to do.
EMBRACE aims to make it easier, and
within the reach of experimental
biologists.
To do this, we need an interoperable set of
services and clients that can both find and
make use of them.
Funded by:
EMBRACE aims to enable ...
•a scientist to evoke the latest and best version of a given program
without any concern for its physical location
•the program to find the most up-to-date data without help from the user
•workflows to automatically take advantage of whatever compute power
is available
•workflows to deliver results in a way which any user can understand
•the scientist to follow connections to other relevant data and tools using
all the straightforward idioms of web browsing and hyperlinks.
Funded by:
User interface
Application
Application interface
EMBRACE: Interconnectivity
Funded by:
EMBRACE: Approaches
•Defining an application interface
•Design from the view of the user/application
•Browser example
•User provides a query and a data type
•Generate a list of results by data resource
•Expand and browse the list, following links
•Select some or all as input to analysis tools
•Requires human-readable definitions
•Automation
•A similar example, but with a program selecting and launching the
analysis
•Requires machine-readable definitions
Funded by:
EMBRACE Data Content
DNA sequence information
Protein sequence information
Genome annotation
Macromolecular Structure Data
Expression information
Literature
Orthologs
Untranslated regions
Protein Families
Alignments
Protein/protein-associations
Structural domains
Gene3D
ORFandDB
SNPs in regulatory regions
3D Electron Microscopy data
Funded by:
EMBRACE Analysis Tools
EMBOSS
DNA sequence analysis
Protein sequence analysis
Pattern matching
Genome annotation
Expert systems
Hidden Markov Models
Homology searches
Phylogenetic analysis
Protein structure analysis
Protein structure comparison
Protein domain mapping
Microarrays and gene expression
Bioinformatics workflows
Bioinformatics tool environments
Protein structure prediction
Electron microscopy
Electron microscope tomography
Systems biology modelling
Text mining
Funded by:
Information
world
Infrastructure
world
Web services
Grid services
OK
??
OK
KO
EMBRACEgrid
Requires:
Data management
Data replication
Service discovery
Computing
Lack of infrastructure
providing low-level services
Standards still evolving, and
implementations lying behind
KO
??
KO
OK
Instability and lack of
robustness
Funded by:
EMBRACE: Data Content Services
•Promised deliverables are prototypes
•Webservice technology
•Content provided by EBI and EMBL Heidelberg
•Access to:
•Nucleotide sequence data resources
•Protein sequence data resources
•Protein motif resources
•Technology choices kept flexible
•SOAP webservices from EBI
•BioMart from EBI
•Existing services from other partners
Funded by:
EMBRACE: Analysis Tools Services
•Promised deliverables are prototypes
•Webservice technology
•Content provided by EBI
•Access to:
•Sequence analysis tools (EMBOSS etc.)
•Protein structure analysis tools (EMBOSS/EMBASSY etc.)
•Technology choices kept flexible
•SOAP webservices
•SOAPlab project (EBI/MyGrid)
•Life Science Analysis Engine standard (OMG)
•Integration also implies
•Tools will access data resources via EMBRACE interfaces
Funded by:
EMBRACE: Technology Choice
•Promised deliverable is a survey of webservice and grid technologies
•Will be made publicly available
•To cover:
•European Grids and Bioinformatics (EGEE etc.)
•Webservice standards
•Grid service standards
•Current standards
•Emerging standards
•Recommendations on technology adoption
•Recommendations on further technology watch
•Technology test cases
•Designed to demonstrate technology
•Designed to show improvements in technology
•Designed to highlight problems
Funded by:
EMBRACE: Test Cases
•EMBRACE is driven by biological test cases:
•4 initial test cases in the proposal
•Workshop (Uppsala, 2005) defined new test cases
•Partners illustrating use of their content/tool resources
•Test cases described in detail
•Template adopted from BioMOBY
•Implement template solutions
•Identify missing components
•Set priorities
•... and fill in the gaps
Funded by:
EMBRACE: Outreach
•First workshops have been internal (inreach)
•In 2006, workshops will be mixed with outreach
•EMBRACE is aimed at skilled bioinformaticians
•Need to address needs of biological researchers
•EMBRACE provides a programming interface to services
•Biologists need a simple "browser"
•EMBRACE will need a simple interface to demonstrate utility
•Example interfaces:
•Taverna (EBI/MyGrid/OMII-UK)
•Other workflow systems
•Simple program examples
•Simple script examples
•"The Big Red Button"
Funded by:
EMBRACE Year Two
•Prototype content services to become standard
•Prototype tool services to become standard
•Further prototypes beyond sequence data
•Established technology choice
•Well documented test cases
•Good links to biological research community
•Selected collaborators
•Willing to explore emerging technologies
•Biological (and practical) use cases
Funded by:
EMBOSS: History
•
EMBOSS started in March 1996
•
First requirements based on a list of long-standing problems in existing commercial
software (GCG), and the need for public source code
•
First "ajax" library written August 1996
•
30 potential developer/user sites identified November 1996 (EMBnet Helsinki)
•
Wellcome Trust proposal February 1997 (Sanger, HGMP and EBI)
•
Accepted August 1997
•
Project started November 1997.
•
EMBOSS 1.0.0 released on 15th July 2000.
•
EMBOSS 2.0.0 released on 15th July 2002.
•
EMBOSS 3.0.0 released on 15th July 2005
•
EMBOSS 4.0.0 will be released on 15th July 2006
Original Target Users
Each of the following groups had their own special needs which
EMBOSS aimed to satisfy:
• Sanger Centre genomic sequencing and analysis groups
• RFCGR/HGMP registered academic users (about 10,000)
• EMBnet service providers in 30+ other countries with over
30,000 users
• Academic users everywhere
• Pharmaceutical and biotechnology industry
• Bioinformatics developers
Seqret
Seqret is a very simple application
• It reads a sequence USA (in any format, from anywhere)
• It writes a sequence USA (in any format)
If you tell it the sequence has feature annotation:
• It reads the features (in any format)
• It writes the features (in any format)
Seqret has 13 lines of code
The source code seqret.c
#include "emboss.h"
int main(int argc, char **argv) {
AjPSeqall seqall;
AjPSeqout outseq;
AjPSeq seq = NULL;
embInit("seqret", argc, argv);
seqall = ajAcdGetSeqall ("sequence");
outseq = ajAcdGetSeqout ("seqout");
while (ajSeqallNext (seqall, &seq))
ajSeqWrite (outseq, seq);
ajSeqWriteClose (outseq);
ajExit();
}
EMBOSS Quality Control
•
•
•
•
•
•
•
Nightly build with no compiler warnings
2,000 test runs (including expected fail conditions)
150 valgrind memory leak tests
Code documentation validation and indexing
ACD file validation
ACD documentation completeness
Program documentation: description, command
line qualifiers, example run(s) and input/output
files
• Web site updates
Disaster proof software licences
Disaster proof software licences
•
•
•
•
•
•
•
1977 Fred Sanger sequences ΦX174 with computing by Rodger Staden
1996 EMBOSS started by Peter Rice (Sanger) and Alan Bleasby (SEQNET Daresbury), in
collaboration with Thure Etzold (EBI)
1997 funding approved by the Wellcome Trust
1998 SEQNET relocated to Hinxton (HGMP)
1999 Thure goes to LION Bioscience
2000 Peter leaves Sanger – EMBOSS goes to Alan at HGMP
2001 LION (Peter) adds EMBOSS to SRS and updates EMBOSS
•
•
•
CCP11 funding for EMBOSS development
2002 Peter leaves LION
2003 Peter joins EBI – integrating EMBOSS in myGrid services
•
Medical Research Council terminates funding for Rodger Staden
•
MRC still "owns" the Staden package. Rodger Staden retires.
•
HGMP is renamed after Rosalind Franklin (by MRC)
•
2004 April 1st: MRC announces RFCGR will be closed within 15 months
•
2005 Alan Bleasby and Jon Ison move to EBI; Tim Carver moves to Sanger
•
All the code is still licensed to everyone under (L)GPL.
Users: Are you a Man or a Mouse?
Command Line
EMBOSS has many possible command lines:
• Prompting for required values
% seqret
What sequence []: embl:paamir
Output file [paamir.fasta]:
• Unix style
% seqret embl:paamir –send 100 -auto
% seqret embl:paamir –se 100 -auto
% seqret –se 100 embl:paamir -auto
• GCG style
% seqret embl:paamir –send=100 –auto
Web Interface (wEMBOSS)
Web interface (SRS)
GUI Interfaces: Jemboss
GUI Interfaces: Taverna
Where are we now?
New grant vision
• For the new grant we were asked to present a vision:
•
•
•
•
•
•
•
Genomics (whole genome analysis)
Phylogenetics (beyond phylip)
Gene expression (microarray data standards)
Biostatistics (R and BioConductor)
Proteomics (2d gel, MS, etc)
Genetic linkage
Chemistry (small molecules)
• All these ideas came from the 2005 User Survey
• We have funding only for core development (so far
Extending core EMBOSS
• There are many other things we can do:
• Workflows
• Automatic support for the 100+ interfaces
• Generating XML files
• Notification of changes to ACD standard
• Testing
•
•
•
•
Ontologies
Graphics library
Database indexing
Non-sequence data access
EMBOSS Books
• Three books are planned after 4.0.0
• Text ownership stays with the EMBOSS team for reuse
• Publishers Cambridge University Press
• Programmer's guide
• After a major code refactoring effort
• Automated generation of code examples
• Administrator's guide
• Installing and maintaining EMBOSS code
• Managing data resources
• Supporting in-house developers
• User's guide
• Aimed at experimental biologists
EMBOSS and Industry
• Celera were the first industrial users
• And the first to provide funding (for the SRS interface)
• Hardware manufacturers offer machines and compilers
• IBM, HP, Apple
• Our latest partners are SciTegic/Accelrys
• Pipeline Pilot Independent Software Vendor partnership
Pipelining Heterogeneous Tools
Heterogeneous [BioJava, Perl, PROSITE, EMBOSS, (& GCG)]
tools for sequence annotation
The SciTegic Challenge
• Pipeline Pilot runs on Linux
• BioPerl interface to launch EMBOSS
• EMBOSS team to maintain the BioPerl code
• Pipeline Pilot runs on Windows
• EMBOSS team to support EMBOSSWIN
• Why? Because we can do it, and we expect the
GCG development team will find it difficult!
We need help
• Encouraging more developers
• CUP books
• Developer training courses - not in Hinxton
• Course in Indiana May 2005
• Sponsorship offer from Newcastle, UK
• Willing to travel anywhere!!!
• [email protected]
• Henrikki Almusa and Medicel (Helsinki)
• Suggestions for new applications
• Collaborations in proposed new areas.
Acknowledgements
•
•
•
•
•
•
(HGMP/RFCGR): Gary Williams, Tim Carver, Hugh Morgan, Claude
Beesley, Damian Counsell, Val Curwen, Mark Faller, Sinead O’Leary, Thon
deBoer, Martin Bishop
LION: (Thomas Laurent), (Bijay Jassal), Thure Etzold
Sanger: (Ian Longden), (Richard Bruskiewich), Simon Kelley, (Ewan Birney)
EBI: Peter Rice, Alan Bleasby, Jon Ison, Lisa Mullan, (Martin Senger), Tom
Oinn, Rodrigo Lopez, Mahmut Uludag, Shaun McGlinchey
EMBnet: UK, Norway, Italy, Germany, Belgium, Argentina, China, Turkey,
Israel, Canada, Manchester
Others: Don Gilbert, Will Gilbert, Rodger Staden, Bill Pearson, Catherine
Letondal, Luke McCarthy, Susan Jean Johns, David Bauer, Andrew Lyall,
Henrikki Almusa, Melody Clark, ....