EMBL-EBI Powerpoint Presentation

Download Report

Transcript EMBL-EBI Powerpoint Presentation

EMBOSS as a DAS Client
Peter Rice
[email protected]
Mahmut Uludag [email protected]
3rd March 2011.
EBI is an Outstation of the European Molecular Biology Laboratory.
EMBOSS: A quick introduction
•
European Molecular Biology Open Software Suite
•
Open source package for sequence analysis
•
•
•
•
•
•
•
•
2
16 July, 2015
ANSI C source code
GPL licensed applications, LGPL libraries
200+ applications
100+ third party applications in 15 associated packages
Project started 1996 at Sanger Centre and HGMP
Now based at EBI
Release 6.3.0 15th July 2010
Funded by UK-BBSRC and EMBL-EBI
EMBOSS as a DAS Client
EMBOSS history
•
•
•
•
•
•
•
3
Project started at Sanger Centre and SEQNET August 1996
Alan moved from SEQNET 1997 (Wellcome funding)
Peter moved to Lion Bioscience 2000 (CCP11-BBSRC/MRC)
Peter moved to EBI 2003
HGMP closed 2005: Alan+Jon moved to EBI
BBSRC funding (limited) 2006-2009
BBSRC BBR funding 2009-2011
• Major new developments
• New data types
• New data sources
• Built-in ontologies
16 July, 2015
EMBOSS as a DAS Client
EMBOSS command line interface
•
•
EMBOSS applications run from the command line
This is not the only interface
•
•
All applications have a command definition file (.acd)
•
•
•
•
4
There are over 100 interfaces and packaged systems available
• Web interfaces
• Graphical user interfaces (GUIs)
• Web services
16 July, 2015
Defines all inputs, outputs, and other options
Read at startup
Contains all command line options with descriptions
Template for any other interface
EMBOSS as a DAS Client
EMBOSS command line example
% antigenic
Input protein sequence(s): uniprot:actb1_fugru
Minimum length of antigenic region [6]:
Output report [actb1_fugru.antigenic]:
% antigenic uniprot:actb1_fugru -auto
5
16 July, 2015
EMBOSS as a DAS Client
EMBOSS ACD File
application: antigenic [
documentation: "Finds antigenic sites in proteins"
groups: "Protein:Motifs"
]
section: input [ information: "Input section” type: "page“ ]
seqall: sequence [
parameter: "Y"
type: “proteinstandard"
]
integer: minlen [
standard: "Y"
minimum: "1"
maximum: "50"
default: "6"
information: "Minimum length of antigenic region"
]
endsection: required
section: output [ information: "Output section” type: "page” ]
endsection: input
section: required [ information: "Required section” type: "page” ]
report: outfile [
parameter: "Y"
rformat: "motif"
multiple: "Y"
taglist: "int:pos=Max_score_pos"
]
endsection: output
6
16 July, 2015
EMBOSS as a DAS Client
EMBOSS ACD File with EDAM Annotation
application: antigenic [
documentation: "Finds antigenic sites in proteins"
groups: "Protein:Motifs"
relations: "EDAM:0000201 topic Immunological analysis"
relations: "EDAM:0000416 operation Epitope mapping“
]
section: input [ information: "Input section“ type: "page” ]
seqall: sequence [
parameter: "Y"
type: “proteinstandard"
relations: "EDAM:0001219 data Pure protein sequence"
relations: "EDAM:0000849 data Sequence record"
relations: "EDAM:0002178 data 1 or more“
]
endsection: input
section: required [ information: "Required section” type: "page” ]
integer: minlen [
standard: "Y"
minimum: "1"
maximum: "50"
default: "6"
information: "Minimum length of antigenic region"
relations: "EDAM:0001249 data Sequence length“
]
endsection: required
section: output [ information: "Output section” type: "page” ]
report: outfile [
parameter: "Y"
rformat: "motif"
multiple: "Y"
taglist: "int:pos=Max_score_pos"
relations: "EDAM:0001534 data Peptide immunogenicity report“
]
endsection: output
7
16 July, 2015
EMBOSS as a DAS Client
Documentation & books
Three books at typesetting stage.
•
•
•
Administrators’ Manual
Users’ Manual
Developers’ Manual
Concomitant major revision of EMBOSS website.
Automation of website content addition.
Books to form basis of new website content.
8
16 July, 2015
EMBOSS as a DAS Client
EMBOSS: Sequences
Uniform Sequence Address (USA): URL-style naming
Derived from the familiar "VMS logical name" syntax used by SRS and GCG.
database : entryname
• embl : ecompa
ID or accession can be used in this way
• uniprot-id : opsd_bovin
SRS syntax for query by ID
• embl-acc : x13776
SRS syntax for query by accession
format :: filename
• fasta :: /users/pmr/paamir.fa
Filename with specific format
• ecoompa.genbank
With no format, can try all formats
format :: filename : entryname
• fasta :: unfinished : AH6.1
Most formats allow multiple sequences
Also @listfile
and asis::gctgactgactgatg
Queries database-field:query
9
16 July, 2015
EMBOSS as a DAS Client
SRS syntax for id, acc, sv, des, key, org
New data resources
• Aim to read “all” public data resources
• Follow cross-references (explicit and implied)
• UniProt
• EMBL/GenBank/DDBJ
• Other
• Servers
• Multiple data resources through a single server definition
• DAS, Ensembl, BioMart, WsEbeye, DbFetch, SRS
• Cache files of resource definitions for server
• Data resource catalogue (drcat)
• 600+ data resources
• Query terms and URLs
• EDAM annotation of resources, formats, identifiers, terms
10
16 July, 2015
EMBOSS as a DAS Client
Data resource catalogue (drcat)
ID
ArachnoServer
Acc
DB-0145
Name
ArachnoServer
Desc
Spider toxin database
URL
http://www.arachnoserver.org
Cat
Organism-specific databases
Taxon
6845 | Arachnida
EDAMres 0000621 | Organism-specific
EDAMdat 0002400 | Toxin annotation
EDAMid 0002578 | ArachnoServer ID
Xref
SP_explicit | ArachnoServer ID;Toxin name
Query
Toxin annotation | HTML | ArachnoServer ID |
www.arachnoserver.org/toxincard.html?id=%s
Example ArachnoServer ID | AS000014
CCmisc BMC Genomics 10:375-375(2009); [Pubmed: 19674480]
11
16 July, 2015
EMBOSS as a DAS Client
EMBOSS Data Types
•
Sequences
•
•
•
Features
•
•
•
•
•
•
•
Attached to sequences
Independent data objects
Bio-Ontologies (OBO)
Taxonomy (NCBI)
Data Resources
Assembled reads
Text
•
12
Nucleotide (DNA and RNA)
Protein
16.07.2015
Text, HTML, XML
EMBOSS Datatypes
New data types
• Reuse “USA” syntax
• [Server:] Dbname : identifier
Database has an access method
• [Server:] Dbname – field : query General field names
• Data types: features, bio-ontologies, taxonomy, etc.
• Access methods: HTTP, DAS, BioMart, Ensembl, ...
• Multiple types and formats for a server/resource
• type: “sequence features”
• format: “embl fasta”
13
16 July, 2015
EMBOSS as a DAS Client
EMBOSS Query Language
•
Query fields are now made general
•
•
•
•
Multiple queries combined
•
•
14
Any field queriable by the access method (DAS, SRS, …)
Any index created by indexing applications
Any query term in the data resource catalogue
16.07.2015
For one data resource
AND, OR, … to combine queries
EMBOSS as a DAS Client
DAS Server Definitions
SERVER das [
method: "dassource"
type:
"sequence, features"
url:
"http://www.dasregistry.org/das/"
comment: "access sequence/feature sources listed on das registry
(http://www.dasregistry.org/das/)"
cachefile: "server.dassource"
]
15
16.07.2015
EMBOSS as a DAS Client
DAS Server Definitions
SERVER ensembldas [
method: "dassource"
type:
"sequence, features"
url:
"http://www.ensembl.org/das/"
comment: "access sequence/feature sources on ensembl das server
(http://www.ensembl.org/das/)"
cachefile: "server.ensembldas"
]
16
16.07.2015
EMBOSS as a DAS Client
DAS Example
DB Ensembl_Human_Genes [
method: das
type: "Sequence, Features“
taxon: "9606“
format: "das, dasgff“
url: http://www.ebi.ac.uk/das-srv/genedas/das/
Homo_sapiens.Gene_ID.reference
example: "ENSG00000139618“
comment: "The Ensembl human Gene_ID reference source, serving
sequences and non-location features.“
hasaccession: "N“
identifier: "segment“
fields: "segment, type, category, categorize, feature_id“
]
17
16.07.2015
EMBOSS as a DAS Client
Ensembl DAS Example
DB Felis_catus_CAT_prediction_transcript [
method: das
type: "Nucfeatures“
taxon: "9685“
format: "dasgff“
url: http://www.ensembl.org/das/Felis_catus.CAT.prediction_transcript
example: "scaffold_209987[1:550]“
comment: "Annotation source for Felis_catus prediction_transcript“
hasaccession: "N“
identifier: "segment“
fields: "segment, type, category, categorize, feature_id“
]
18
16 July, 2015
EMBOSS as a DAS Client
EMBOSS Query Language
•
•
•
•
•
•
•
19
das: ensembl_human_genes: ENSG00000139618
ensembldas: Felis_catus_CAT_prediction_transcript:
scaffold_209987 [1:550]
das: Homo_sapiens_GRCh37_transcript: 10
[32889611:32973347]
das: uniprot: P00280
das: cath: 5pti
das: uniparc: UPI000000000A
das: Homo_sapiens_GRCh37_reference{segment: 11 & type: supercontig}
16.07.2015
EMBOSS as a DAS Client
EMBOSS Query Language: Future
•
Ontology-based searches of data resources
•
•
•
•
Search for applications matching data types
•
•
•
•
20
Taxonomy
EDAM terms
• Resources
• Data types
• Identifiers
Descriptions
Sequences and features
Nucleotide and protein
…
Support for DAS advanced query ...
16.07.2015
EMBOSS as a DAS Client
Acknowledgements
•
EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Martin Senger, Tom Oinn, Jaina Mistry,
Rodrigo Lopez, Sharmilla Pillai, Hamish McWilliam
•
RFCGR/HGMP: Alan Bleasby, Jon Ison, Tim Carver, Hugh Morgan, Claude Beazley, Lisa Mullan,
Damian Counsell, Gary Williams, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin
Bishop
•
Sanger Institute: Ian Longden, Richard Bruskiewich, Simon Kelley
•
LION: Mahmut Uludag, Thomas Laurent, Bijay Jassal, Bren Vaughan, Thure Etzold
•
National bioinformatics service providers in: Norway, Spain, Italy, Netherlands, Germany, Belgium,
Russia, China, Canada, Australia, Argentina
•
Others: Catherine Letondal, Don Gilbert, Rodger Staden, Bill Pearson, Webb Miller, Marie-Laetitia
Denayer, Amandine Schurmann, Gabriele Weiler, Luke McCarthy, David Mathog, David Bauer,
Henrikki Almusa, Thomas Siegmund, Scott Markel, Darryl Leon, Bastien Chevreux, Ivo Hofacker, ...
•
IBM, Hewlett-Packard, (Compaq), Apple, SGI, Sun, LION bioscience, SciTegic, Cambridge
University Press
•
Open-Bio Foundation, Sourceforge, Debian, Fedora, CEH
... And the British Antarctic Survey
http://emboss.sourceforge.net
http://emboss.open-bio.org/wiki/Latest_developments
21
16 July, 2015
EMBOSS as a DAS Client