Abstract Outline Goals Overview of genome annotation tools
Download
Report
Transcript Abstract Outline Goals Overview of genome annotation tools
The challenge of annotating a
complete eukaryotic genome:
A case study in Drosophila
melanogaster
Martin G. Reese ([email protected])
Nomi L. Harris ([email protected])
George Hartzell ([email protected])
Suzanna E. Lewis ([email protected])
Drosophila Genome Center
Department of Molecular and Cell Biology
539 Life Sciences Addition
University of California, Berkeley
Reese et al., Tutorial #3, ISMB ‘99
Abstract
Many of the technical issues involved in sequencing complete genomes are essentially solved.
Technologies already exist that provide sufficient solutions for ascertaining sequencing error rates
and for assembling sequence data. Currently, however, standards or rules for the annotation
process are still an outstanding problem.
How shall the genomes be annotated, what shall be annotated, which computational tools are most
effective, how reliable are these annotations, how organism-specific do the tools have to be and
ultimately how should the computational results be presented to the community? All these
questions are unsolved. This tutorial will give an overview and assessment of the current state of
annotation based upon experiences gained at the Drosophila melanogaster genome project.
In the tutorial we will do three things. First, we will break down the annotation process and discuss
the various aspects of the problem. This will serve to clarify the term "annotation", which is often
used to collectively describe a process that has a number of discrete steps. Second, with the
participation of computational biologists from the community we will compare existing tools for
sequence annotation. We will do this by providing a 3 megabase sequence that has already been
well-characterized at our center as a testbed for evaluating other feature-finding algorithms. This is
similar to what has been done at the CASP (critical assessment of techniques for protein structure
prediction) conferences (http://predictioncenter.llnl.gov) for protein structure prediction. Third, we
will discuss which annotation problems are essentially solved and which problems remain.
Reese et al., Tutorial #3, ISMB ‘99
Tutorial goals
Review the algorithms currently used in annotation
Assess existing methods under “field” conditions
Identify open issues in annotation
Reese et al., Tutorial #3, ISMB ‘99
Tutorial organization
Definitions
Annotation
“Biological”
issues
“Engineering” issues
Application of tools within an existing annotation system
Break (20 minutes)
Review of existing tools
Our annotation experiment
Conclusions and outstanding issues
Reese et al., Tutorial #3, ISMB ‘99
What is a gene?
Definition: An inheritable trait associated with a
region of DNA that codes for a polypeptide chain
or specifies an RNA molecule which in turn have
an influence on some characteristic phenotype of
the organism.
Reese et al., Tutorial #3, ISMB ‘99
What are annotations?
Definition: Features on the genome derived
through the transformation of raw genomic
sequences into information by integrating
computational tools, auxiliary biological data, and
biological knowledge.
Reese et al., Tutorial #3, ISMB ‘99
How does an annotation differ
from a gene?
Many annotations are the same as ‘genes’
The
annotation describes an inheritable trait associated
with a region of DNA.
But an annotation may not always correspond in
this way, e.g. an STS, or sequence overlap
Region
of genomic DNA or RNA is not translated or
transcribed
Reese et al., Tutorial #3, ISMB ‘99
Transcription and translation
Reese et al., Tutorial #3, ISMB ‘99
Schematic gene structure
DNA:
Promoter
Exon 2
Exon 1
Exon 3
TSS
Intron 2
Intron 1
ATG GT
AG
GT
TAA
AG
transcription
Exon 2
Exon 1
Exon 3
Intron 1
preRNA:
ATG GT
Intron 2
AG
GT
TAA
AG
splicing
5'UTR
mRNA:
ORF
3'UTR
ATG
TAA
polyA
AAAAAAAA A
translation
primary
translation:
modification
[cleavage product]
ATG
TAA
MPYCPLTW
..............GFL
amino acid sequence
[glycosylation site]
active protein:
CPLTW
......G
Reese et al., Tutorial #3, ISMB ‘99
Sequence feature types
Transcribed region
Structural region
Exon, intron, 5’ UTR, 3’ UTR, ORF, cleavage product
Mutations: insertion, deletion, substitution, inversion, translocation
Functional or signal region
Promoter, enhancer, DNA/RNA binding site, splice site signal, polyadenylation signal
Protein processing: glycosylation, methylation, phosphorylation site
Similarity
mRNA, tRNA, snoRNA, snRNA, rRNA
Homolog, paralog, genomic overlap (syntenic region)
Other feature types
Transposable element, repetitive element
Pseudogene
STS, insertion site
Reese et al., Tutorial #3, ISMB ‘99
DNA transcription unit features
Promoter elements
Core
promoter elements
TATA box
Initiator (Inr)
Downstream promoter element (DPE)
Transcription
factor (“TF”) binding sites
CAAT boxes
GC boxes
SP-1 sites
GAGA boxes
Enhancer
site(s)
Reese et al., Tutorial #3, ISMB ‘99
mRNA features
Exon
Initial, internal, terminal
Intron
5’ splice site (“GT”), branchpoint (lariat), 3’ splice site (“AG”)
Repeat elements
“Kozak” rule
5’ UTR
Start codon (translation start site)
UTR (untranslated regions)
Codon usage, preference
Control elements (e.g. splice enhancers)
Translation regulatory elements
RNA binding sites
Control elements (e.g. splice enhancers)
RNA binding sites (cis-acting elements)
Initial, internal, terminal
3’ UTR
Stop codon
Poly-adenylation signal and site
RNA destabilization signal
Reese et al., Tutorial #3, ISMB ‘99
Reese et al., Tutorial #3, ISMB ‘99
Definitions for data modeling
Feature: An interval or an ordered set of intervals on a
sequence that describes some biological attribute and is
justified by evidence.
Sequence: A linear molecule of DNA, RNA or amino
acids.
Evidence: A computational or experimental result
coming out of an analysis of a sequence
Annotation: A set of features
Reese et al., Tutorial #3, ISMB ‘99
Depth of knowledge
Annotation
Detailed analysis
(typically biological) of
single genes
Annotated genome
Large-scale analysis
(typically computational)
of entire genome
Breadth of knowledge
Reese et al., Tutorial #3, ISMB ‘99
Annotation process overview
Methods
Data
Genome
Sequence
Auxiliary
Data
Computational
Tools
Database
Resources
Annotation Systems
Understanding of a Genome
Reese et al., Tutorial #3, ISMB ‘99
Types of sequence data
Chromosomal sequence
Euchromatic
Heterochromatic
mRNA sequences
Full
length cDNA
5’ EST
3’ EST
Protein sequences
Insertion site flanking sequences
Reese et al., Tutorial #3, ISMB ‘99
Auxiliary data
Maps
Genetic,
physical, radiation hybrid map (RH), deletion,
cytogenetic
Expression data
Tissue,
stage
Phenotypes
Lethality,
sterility
Reese et al., Tutorial #3, ISMB ‘99
Computational annotation tools
Gene finding
Repeat finding
EST/cDNA alignment
Homology searching
BLAST,
FASTA, HMM-based methods, etc.
Protein family searching
PFAM,
Prosite, etc.
Reese et al., Tutorial #3, ISMB ‘99
Database resources
Curated sequence feature data sets
Repeat
elements
Transposons
Non-redundant mRNA
STSs and other sequence markers
Genome sequence from related species
D.
melanogaster vs. D. virilis, D. hydei
Genome sequence from more distant species
Protein sequences from distant species
Reese et al., Tutorial #3, ISMB ‘99
Biological issues in annotation
Common
Genes
within genes
Alternative splicing
Alternative poly-adenylation sites
Rare
Translational
frame shifting
mRNA editing
Eukaryotic operons
Alternative initiation
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation
What sequence to start with?
When to annotate?
Because features are intervals on a sequence, problems can be caused by
gaps, frameshifts, and other changes to the sequence. How do you track
these changes over time and model features that span gaps?
Feature identification can aid in sequencing. It may be advisable to carry
out sequencing and annotation in parallel thus enabling them to
complement one another.
What analyses need to be run and how?
What dependencies are there between various analysis programs?
What parameters settings to use?
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation
What public sequence data sets are needed?
How do you achieve computational throughput?
What are the mechanics of obtaining public sequence databases?
Are curated data sets available or do you need to set up a means of
maintaining your own (for repeats, insertions, organism of interest)
Workstation farm, or simply a big, powerful box?
Job flow control
What do you do with the results?
Homogenize results into single format?
Filter results for significance and redundancy
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation
Interpreting the results
Is human curation needed?
How can you achieve consistency between curators?
How do you design the user interface so that it is simple enough to get the
task completed speedily but complex enough to deal with biology?
How do you capture curations?
How are annotation translations to be described?
EC terminology
ProSite families
Pfam domains
Is function distinguishable from process?
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation
How do you manage data?
What is the appropriate database schema design?
How is the database to be kept up to date? Will it be directly from
programs running user interfaces and analyses or via a middleware layer?
Is a flat file format needed and what should it be?
What query and retrieval support is needed?
How do you distribute data?
For bulk downloads what is the format of the data?
What information is best summarized in tables?
What information requires an integrated graphical view?
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation
How do you update the annotations?
How frequently are they re-evaluated?
How can re-evaluation be minimized (only subsets of the
databanks, only modified sequences)?
How can differences between old and new computational results
be detected?
Changes in computational results may need to trigger changes in
curated annotations
Reese et al., Tutorial #3, ISMB ‘99
Drosophila melanogaster
Drosophila is the most important model organism*
Drosophila genome:
4
chromosomes
180 Mb total sequence
140 Mb euchromatic sequence
12-14,000 genes
* source: G.M. Rubin
Reese et al., Tutorial #3, ISMB ‘99
Drosophila Genome Project
Laboratories working on Drosophila sequencing:
BDGP (Berkeley Drosophila Genome Project)
EDGP (European Drosophila Genome Project)
Celera Genomics Inc.
“Complete” D. melanogaster sequence will be
finished by the end of 1999
Comprehensive database - FlyBase
Reese et al., Tutorial #3, ISMB ‘99
Goals of the Drosophila Genome
Project
Complete genome sequence
Structure of all transcripts
Expression pattern of all genes
Phenotype resulting from mutation of all ORFs
And more...
Reese et al., Tutorial #3, ISMB ‘99
Sequencing at the BDGP
Genomic sequence
P1
and BAC clones
24Mb of completed sequence (as of July 22, 1999)
18Mb unfinished sequence in process
Complete tiling path in BACs
1.5x-path
draft sequencing
ESTs and cDNAs
80,942
ESTs finished (as of March 19, 1999)
Over 800 full-length cDNAs
Reese et al., Tutorial #3, ISMB ‘99
The BDGP sequence annotation
process
Reese et al., Tutorial #3, ISMB ‘99
What sequence to start with?
Unit of sequencing at the BDGP
Completed high-quality clone sequences
Reassembling the genomic sequence
Need to place clones in correct genomic positions
Need to integrate genes that span multiple clones
Solved by using genomic overlaps to reconstitute full genomic sequence
Reese et al., Tutorial #3, ISMB ‘99
Which analyses need to be run?
Similarity searches
BLAST
(Altschul et al., 1990)
BLASTN (nucleotide databases)
BLASTX (amino acid databases)
TBLASTX (amino acid databases, six-frame translation)
sim4
(Miller et al., 1998)
Sequence alignment program for finding near-perfect matches
between nucleotide sequences containing introns
Gene predictors
Genefinder
(Green, unpublished)
GenScan (Burge and Karlin, 1997)
Genie (Reese et al., 1997)
Other analyses
tRNAscanSE
(Lowe and Eddy, 1996)
Reese et al., Tutorial #3, ISMB ‘99
Which analyses need to be run
and how?
mRNAs
ORFFinder(Frise,
unpublished)
Protein translations
HMMPFAM 2.1 (Eddy 1998) against PFAM (v 2.1.1 Sonnhammer
et al. 1997, Bateman et al. 1999)
Ppsearch (Fuchs 1994) against ProSite (release 15.0) filtered with
EMOTIF ( Nevill-Manning et al. 1998)
Psort II (Horton and Nakai 1997)
ClustalW (Higgins et al. 1996)
Reese et al., Tutorial #3, ISMB ‘99
What public sequence data sets are
needed?
Automating updates of public databases:
Genbank, SwissProt, trEMBL, BLOCKS, dbEST, EDGP
Curated data sets
D. melanogaster genes (FlyBase)
Transposable elements (EDGP)
Repeat elements (EDGP)
STSs (BDGP)
Reese et al., Tutorial #3, ISMB ‘99
Which analyses need to be run
and how?
Reese et al., Tutorial #3, ISMB ‘99
How do you achieve
computational throughput?
BDGP computing power
Sun Ultra 450 (3 machines, 4 processors each)
Sun Enterprise (1 machine, 8 processors)
Used these directly, without any system for distributed computing.
Job flow control: the Genomic Daemon
Automatic batch analysis of genomic clones
Berkeley Fly Database is used for queuing system and storage of results
Many clones can be analyzed simultaneously
Results are processed and saved in XML format for interactive browsing
Reese et al., Tutorial #3, ISMB ‘99
What do you do with the results?
Berkeley Output Parser (BOP)
Input
to BOP:
Genomic sequence
Results of computational analyses
Filtering preferences
Parses
results from BLAST, sim4, GeneFinder, GenScan, and
tRNAscan-SE analyses
Filters BLAST and sim4 results
Eliminates redundant or insignificant hits
Merges hits that represent single region of homology
Homogenizes
results into single format
Output: sequence and filtered results in XML format
Reese et al., Tutorial #3, ISMB ‘99
Is human curation needed?
Not for everything
Some features are obvious and can be identified computationally
Known D. melanogaster genes are detected automatically by
GeneSkimmer
Repetitive elements
But still for many things
Annotating complete gene structure is still hard
We use CloneCurator (BDGP’s Java graphical editor) for curation
Reese et al., Tutorial #3, ISMB ‘99
Gene Skimmer
Quick way of identifying genes in new sequence before
curation
Start with XML output from BOP
Look for sim4 hits with known Drosophila genes
Find gene hits with sequence identity >98%,
coverage >30%
Verify that hits represent real genes
Reese et al., Tutorial #3, ISMB ‘99
Gene Skimmer
URL: http://www.fruitfly.org/sequence/genomic-clones.html
Reese et al., Tutorial #3, ISMB ‘99
CloneCurator
Displays computational results and annotations on a
genomic clone
Interactive browsing
Zoom/scroll
Change
cutoffs for display of results
Analyze GC content, restriction sites, etc.
Interactive annotation editing
Expert
“endorses” selected results
Presents annotations to community via Web site
Reese et al., Tutorial #3, ISMB ‘99
Reese et al., Tutorial #3, ISMB ‘99
How do we annotate gene/protein
function?
Gene Ontology Project
Controlled
hierarchical vocabulary for multiple-genome
annotations and comparisons
Standardized vocabulary facilitates collaboration
Good data modeling allows better database querying
Ontology browser provides interactive search of hierarchical
terms
“GO” project (http://www.ebi.ac.uk/~ashburn/GO)
Reese et al., Tutorial #3, ISMB ‘99
Ontology browser
Reese et al., Tutorial #3, ISMB ‘99
Reese et al., Tutorial #3, ISMB ‘99
Ontology browser: searching for
terms
Reese et al., Tutorial #3, ISMB ‘99
How do you distribute the data?
Bulk downloads
FASTA at http://www.fruitfly.org/sequence/download.html
Curated
data sets
Tabular data
At http://www.fruitfly.org/sequence/
Sequenced
genomic clones
Clone contigs sorted by genomic location
Clone contigs sorted by size
Ribbon provides integrated graphical view of
annotations on physical contigs
Reese et al., Tutorial #3, ISMB ‘99
Ribbon
Human curator annotates individual clones (~100Kb)
Clones are assembled into physical contigs (regions of
physical map)
Clone annotations are merged and renumbered for
display on whole physical contigs
Ribbon is our Java display tool for displaying curated
annotations on physical contigs
Will soon be available on Web
Reese et al., Tutorial #3, ISMB ‘99
Ribbon
Reese et al., Tutorial #3, ISMB ‘99
How do you manage the data?
Using Informix as our database server
Updated via Perl dbi.pm module
Development underway in
Schema
revisions
GAME DTD (Genome Annotation Markup Entities)
Perl module for annotation objects
http://www.bioxml.org/ (Ewan Birney)
Reese et al., Tutorial #3, ISMB ‘99
How do you maintain annotations?
Open questions
How
frequently are annotations re-evaluated?
How can re-evaluation be minimized (only subsets of
the databanks, only modified sequences)?
How can differences between old and new
computational results be detected?
Changes in computational results may need to trigger
changes in curated annotations
Reese et al., Tutorial #3, ISMB ‘99
Integrated annotation systems
ACeDB
Genotator
Magpie
GAIA
TIGR
Reese et al., Tutorial #3, ISMB ‘99
Integrated annotation systems:
ACeDB
Developed for analysis of the C. elegans genome
Sophisticated database designed for storing annotations
and related information
New Java and Web-based versions available
Written by Jean Thierry-Mieg and Richard Durbin
http://www.sanger.ac.uk/Software/Acedb/
Reese et al., Tutorial #3, ISMB ‘99
ACeDB
Reese et al., Tutorial #3, ISMB ‘99
Genotator
Back end automates sequence analysis; browser
provides interactive viewing and editing of annotations
Nomi Harris (1997), Genome Research 7(7), 754-762.
http://www-hgc.lbl.gov/inf/annotation.html
Reese et al., Tutorial #3, ISMB ‘99
Magpie
Expert system based (PROLOG)
Data
collection daemon
Data analysis and report daemon
“Intelligent” integration of various individual feature
prediction systems
Allows human interactions
Gaasterlund and Sensen (1996), TIG, 12, 76-78.
http://genomes.rockefeller.edu/magpie/magpie.html
Reese et al., Tutorial #3, ISMB ‘99
GAIA
Web-based system
Results displayed as Java applets
Bailey, L.C., J. Schug, S. Fischer, M. Gibson, J.
Crabtree, D.B. Searls, and G.C. Overton (1998),
Genome Research.
http://daphne.humgen.upenn.edu:1024/gaia/
Reese et al., Tutorial #3, ISMB ‘99
TIGR Human Gene Index
Gene Indices for various organisms
Databases for transcribed genes linked into
external/internal genomic databases
Internal backend analysis software
http://www.tigr.org/tdb/tdb.html
Reese et al., Tutorial #3, ISMB ‘99
Computational analysis tools
Gene finding
Repeat finding
EST/cDNA alignment
Homology searching
BLAST,
FASTA, HMM-based methods, etc.
Protein family searching
PFAM,
Prosite, etc.
Reese et al., Tutorial #3, ISMB ‘99
Gene finding:
Prokaryotes vs. Eukaryotes
Prokaryotes
Contiguous
open reading frames (ORF)
Short intergenic sequences
Good method: detecting large ORFs
Complications:
Partial sequences
Sequencing errors
Start codon prediction
Overlapping genes on both strands
Reese et al., Tutorial #3, ISMB ‘99
Gene finding:
Prokaryotes vs. Eukaryotes
Eukaryotes
Complex
gene structures (exon/introns)
D. melanogaster has an average of 4 introns/gene
Very long genes (D. melanogaster X gene 160 kb)
Very long introns
Many introns
“Nested”, overlapping, and alternatively spliced genes
5’ UTRs with non-coding exons
Long 3’ UTRs
Complex transcription machinery
ORF-finding
alone is not adequate
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding
Assumptions
Signals
and content method sensors alone are not
sufficient for predicting gene structure
Gene structure is hierarchical
Each component (exon, intron, splice site, etc.) can be
modeled independently
The approach
Generate
a list of candidates for each component (with
scores)
Assemble the components into a “gene model”
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding:
Dynamic programming
Determines the best combination of components
Two-part problem:
Develop
an “optimal” scoring function
Use dynamic programming to find an “optimal” alignment
through scoring matrix
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding:
Dynamic programming
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding:
Linear and Quadratic
Discriminant Analysis (LDA/QDA)
LDA
Deterministic
calculation of thresholds
n-class discrimination
Example:
HSPL, Solovyev et al. (1997), ISMB, 5,294-302.
QDA
Can
represent a great improvement over LDA
Example:
MZEF, Michael Zhang (1997), PNAS, 94, 565-568.
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding:
Feed-forward neural networks
Supervised learning
Training to discriminate between several feature classes
Computing units
Gradient descent optimization
Multi-layer networks
Limitations
Black-box predictions
Local minima
Example:
GRAIL, Uberbacher et al. (1991), PNAS, 88, 11261-11265.
Reese et al., Tutorial #3, ISMB ‘99
Approaches to gene finding:
Hidden Markov models
Model
Markov
k-order Markov chain: current state dependent on k previous states
The next state in a 1st-order Markov model depends on current state
Hidden
A finite model describing a probability distribution over all possible sequences of
equal length
“Natural” scoring function
(Conditional) Maximum likelihood “training”
Hidden states generate visible symbols
Assumptions
Independence of states
No long range correlation
Example: HMMgene, A. Krogh (1998), In Guide to Human Genome
Computing, 261-274.
Reese et al., Tutorial #3, ISMB ‘99
Approaches to gene finding:
Generalized hidden Markov models
Each HMM state can be a probabilistic sub-model
Complex hierarchical system
Requires care in modeling state overlaps
Example:
Genie,
Kulp et al. (1996), ISMB, 4, 134-142
GenScan, Burge and Karlin (1997), JMB, 268(1), 78-94
Reese et al., Tutorial #3, ISMB ‘99
Gene finding software
Signal recognition
Promoter prediction
Splice site prediction
Start codon prediction
Poly-adenylation site prediction
Coding potential
Coding exons
Gene structure prediction
Spliced alignment
LDA/QDA
Neural networks
HMMs and GHMMs
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition
PromoterScan
Identify potential promoter regions
Based on databases of known TF binding sites
TFD (Gosh (1991), TIBS, 16, 445-447)
TRANSFAC (Heinemeyer et al. (1999), NAR, 27, 318-322)
Prestridge (1995), JMB, 249, 923-932
http://bimas.dcrt.nih.gov/molbio/proscan/
MatInd and MatInspector
Finding consensus matches to known TF binding sites
Based on TRANSFAC
Heinemeyer et al. (1999), NAR, 27, 318-322
Quandt et al. (1995), NAR, 23, 4878-4884.
http://transfac.gbf.de/TRANSFAC/
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition (cont.)
TSSG/TSSW
LDA based
combination of several features (TATA-box, Inr
signal, upstream regions)
Solovyev et al. (1997), ISMB, 5, 294-302.
http://genomic.sanger.ac.uk/gf/gf.shtml
Transcription Element Search Software
Identify TF
binding sites
Based on TRANSFAC
http://agave.humgen.upenn.edu/tess/index.html
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition (cont.)
CBS Promoter 2.0 Prediction Server
Simulated
transcription factors
Principles common to neural networks and genetic algorithms
Knudsen (1999), Bioinformatics 13(5), 356-361.
http://genome.cbs.dtu.dk/services/promoter/
CorePromoter
Position
dependent 5-tuple
QDA
Michael
Zhang (1998), Genome Research, 8, 319-326.
http://scislio.cshl.org/genefinder/CPROMOTER/
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition (cont.)
Neural network promoter prediction (NNPP)
Time-delay
neural network
Combining TATA box and initiator
Reese (1999), in preparation.
http://www-hgc.lbl.gov/projects/promoter.html
Reese et al., Tutorial #3, ISMB ‘99
Example: NNPP
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition (cont.)
Markov chain promoter finder
Competing
interpolated Markov chains for promoters, exons,
introns
Promoter model consists of five states representing the core
promoter parts
Ohler, Reese et al., Bioinformatics 13(5), 362-369.
Reese et al., Tutorial #3, ISMB ‘99
Splice site prediction
Nakata, 1985
Nakata
(1985), NAR, 13(14), 5327-5340.
BCM GeneFinder
HSPL -
Prediction of splice sites in human DNA sequences
Triplet frequencies in various functional parts of splice site
regions
Combined with codon statistics
Solovyev et al. (1994), NAR, 22(24), 5156-5163.
http://genomic.sanger.ac.uk/gf/gf.shtml
Reese et al., Tutorial #3, ISMB ‘99
Splice site prediction (cont.)
Neural Network splice site predictor (NNSPLICE)
Multi-layered feed-forward neural network
Modeled after Brunak et al. (1991), JMB, 220, 49-65.
Reese et al. (1997), JCB, 4(3), 311-323.
http://www-hgc.lbl.gov/projects/splice.html
NetGene2
Combination of neural networks and rule-based system
Splice site signal neural network combined with coding potential
Hebsgaard et al. (1996), NAR, 24(17), 3439-3452.
Brunak et al. (1991), JMB, 220, 49-65.
http://www.cbs.dtu.dk/services/NetGene2/
Reese et al., Tutorial #3, ISMB ‘99
Splice site prediction (cont.)
SplicePredictor
Logitlinear
models for splice site regions
Degree of matching to the splice site consensus
Local compositional contrast
Brendel
and Kleffe (1998), NAR, 26(20), 4748-4757.
http://gnomic.stanford.edu/~volker/SplicePredictor.html
Reese et al., Tutorial #3, ISMB ‘99
Start codon prediction
NetStart
Trained
on cDNA-like sequences
Neural network based
Local start codon information
Global sequence information
Pedersen
and Nielsen (1997), ISMB, 5, 226-233.
http://www.cbs.dtu.dk/services/NetStart/
Reese et al., Tutorial #3, ISMB ‘99
Poly-adenylation signal prediction
BCM GeneFinder
POLYAH
- Recognition of 3'-end cleavage and polyadenylation region
Triplet frequencies in various functional parts in polyadenylation regions
LDA
Solovyev et al. (1994), NAR, 22(24), 5156-5163.
http://genomic.sanger.ac.uk/gf/gf.shtml
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential
Periodicity detection
Coding
sequences have an inherent periodicity of three
Especially good on long coding sequences
Auto-correlation
Seeking the strongest response when shifted sequence is compared
with original
Michel (1986), J. Theor. Biol. 120, 223-236.
Fourier
transformation: Spectral analysis
Detection of peak at position corresponding to 1/3 of the frequency
Silverman and Linsker (1986), J. Theor. Biol. 118, 295-300.
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential
(cont.)
Trifonov (1980;1987)
G-notG-U
periodicity
JMB , 194, 643-652.
Fickett (1982)
Position
asymmetry in the three codon positions
NAR 10(17), 5303-5318.
Staden (1984)
Codon
usage in tables
NAR 12, 551-567.
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential
(cont.)
Claverie and Bougueleret (1987)
Hexamer
frequency differentials
NAR 14, 179-196.
Fichant and Gautier (1987)
Codon
usage homogeneity
CABIOS, 3(4), 287-295.
GRAIL I (1991)
Neural
network using a shifting fixed size window
7 sensors as input, 2 hidden layers and 1 unit as output
Uberbacher et al. (1991), PNAS, 88(24), 11261-11265.
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential
(cont.)
GeneMark (1986)
Inhomogeneous
Markov chain models
Easy trainable (closed solution for Maximum Likelihood)
Used extensively in prokaryotic genomes
Borodovsky et al. (1993), Computers & Chemistry, 17, 123133.
Glimmer (1998)
Interpolated
Markov chains from first to eighth order
Salzberg et al. (1998), NAR, 26(2), 544-548.
http://www.tigr.org/softlab/glimmer/glimmer.html
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential
(cont.)
Review by Fickett (1992)
“Assessment
of protein coding measures”, NAR, 20, 6441-
6450.
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding exons
SorFind
BCM GeneFinder
Detection of “spliceable” ORFs
Hutchinson, NAR, 20(13), 3453-3462.
FEXD, FEXN, FEXA, FEXY, FEXH, HEXON
LDA
Solovyev et al. (1994), NAR, 22(24), 5156-5163.
http://genomic.sanger.ac.uk/gf/gf.shtml
GRAIL II
Exon candidates, heuristic integration, learning with neural network
Uberbacher et al., Genet. Eng., 16, 241-253.
http://compbio.ornl.gov/
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
LDA/QDA
FGene
LDA based
Dynamic
programming for the integration of LDA output
Solovyev et al. (1995), ISMB, 3, 367-375.
http://genomic.sanger.ac.uk/gf/gf.shtml
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models: NN
GeneParser
“Gene-parsing”
approach
Potential alternative splicing recognized
Neural network and dynamic programming
Snyder and Stormo (1995), JMB, 248, 1-18.
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
Artificial intelligence approaches
GeneID
Rule-based
system
Homology integration
Guigó et al. (1992), JMB , 226, 141-157.
http://www1.imim.es/geneid.html
GeneID using DP
DP to
combine a set of potential exons
Guigó et al. (1998), JCB , 5, 681-702.
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
Artificial intelligence approaches
GenLang
Syntactic
pattern recognition system
Formal grammar
Tools from computational linguistics
Dong and Searls (1994), Genomics, 23,540-551.
http://cbil.humgen.upenn.edu/~sdong/genlang_home.html
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models: HMMs
HMMGene
Several
genes per sequence possible
User constraints possible
Krogh (1997), ISMB, 5, 179-186.
http://www.cbs.dtu.dk/services/HMMgene/
GeneMark.hmm
Based
on GeneMark program for bacterial sequences
Can predict frame shifts
Trained for various organisms
Lukashin and Borodovsky (1998), NAR, 26, 1107-1115.
http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
GHMMs
Genie
Generalized
hidden Markov model with length distribution
Integration of multiple content and signal sensors
Content: codon statistics, repeats, intron, intergenic, database
homology hits
Signal: promoter, start codon, splice sites, stop codon
Dynamic
programming to find optimal parse
Several genes per sequence possible
Kulp et al. (1996), ISMB, 4, 134-142.
Reese et al. (1997), JCB, 4(3), 311-323.
http://www.cse.ucsc.edu/~dkulp/cgi-bin/genie
Reese et al., Tutorial #3, ISMB ‘99
Example: Genie
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
GHMMs
GenScan
Multiple
content and signal models
Semi-hidden Markov model sensors with length distribution
Takes GC content into account (separate models)
Several genes per sequence possible
Burge and Karlin (1997), JMB, 268(1), 78-94.
http://CCR-081.mit.edu/GENSCAN.html
Reese et al., Tutorial #3, ISMB ‘99
EST/cDNA alignment for gene
finding: Spliced alignments
PROCRUSTES
Spliced
alignment algorithm
Dynamic programming to combine a set of potential exons
Frame conservation
Homologous sequence needed
Gelfand et al. (1996), PNAS, 93, 9061-9066.
http://hto-13.usc.edu/software/procrustes/
Reese et al., Tutorial #3, ISMB ‘99
EST/cDNA alignment
Sim4
Aligns
cDNA to genomic sequence
Uses local similarity
Florea et al. (1998), Genome Research, 8, 967-974.
GeneWise
Dynamic
programming
Partial genes allowed
Based on Pfam and statistical splice site models
Birney (1999), unpublished
http://www.sanger.ac.uk/Software/Wise2
Reese et al., Tutorial #3, ISMB ‘99
EST/cDNA alignment (cont.)
ACEMBLY
Aligns
ESTs to genomic sequence
Identifies alternative splicing
Integrated in ACeDB
Jean Thierry-Mieg (unpublished)
Reese et al., Tutorial #3, ISMB ‘99
Repeat finders
Censor
Uses
database of repeat sequences
Jurka et al. (1996), Comp. and Chem., 20(1), 119-122.
BLAST
Integrated
masking operations
XBLAST procedure
Claverie (1994), In Automated DNA Sequencing and Analysis
Techniques, M. D. Adams, C. Fields and J. C. Venter, eds., 267-279.
http//:www.ncbi.nlm.nih.gov/BLAST
Reese et al., Tutorial #3, ISMB ‘99
Repeat finders (cont.)
RepeatMasker
Detection
of interspersed repeats
Smit and Green, unpublished results
http://ftp.genome.washington.edu/RM/RepeatMasker.html
Reese et al., Tutorial #3, ISMB ‘99
Homology searching
BLAST suite
BLASTN,
BLASTX, TBLASTX, PSI-BLAST
Altschul et al. (1990), JMB, 215, 403-410.
http://www.ncbi.nlm.nih.gov/BLAST
FASTA suite
FASTA,
TFASTA
Pearson and Lipman (1988), PNAS, 85, 2444-2448.
HMM-based searching
SAM
(UCSC group)
http://www.cse.ucsc.edu/research/compbio/sam.html
HMMER,
Sean Eddy
http://hmmer.wustl.edu/
Reese et al., Tutorial #3, ISMB ‘99
Gene family searching
BLOCKS
http://www.blocks.fhcrc.org
PROSITE
http://www.expasy.ch/prosite/
PFAM
http://pfam.wustl.edu/
SCOP
http://scop.mrc-lmb.cam.ac.uk/scop/
Reese et al., Tutorial #3, ISMB ‘99
The genome annotation
experiment (GASP1)
Genome Annotation Assessment Project (GASP1)
Annotation of 2.9 Mb of Drosophila melanogaster
genomic DNA
Open to everybody, announced on several mailing lists
Participants can use any analysis methods they like
(gene finding programs, homology searches, by-eye
assessment, combination methods, etc.) and should
disclose their methods.
“CASP” like
12 participating groups
Reese et al., Tutorial #3, ISMB ‘99
URL: http://www.fruitfly.org/GASP1
Reese et al., Tutorial #3, ISMB ‘99
Goals of the experiment
Compare and contrast various genome annotation
methods
Objective assessment of the state of the art in gene
finding and functional site prediction
Identify outstanding problems in computational
methods for the annotation process
Reese et al., Tutorial #3, ISMB ‘99
Adh contig
2.9 Mb contiguous Drosophila sequence from the Adh
region, one of the best studied genomic regions
From
chromosome 2L (34D-36A)
Ashburner et al., (to appear in Genetics)
222 gene annotations (as of July 22, 1999)
375,585 bases are coding (12.95%)
We chose the Adh region because it was thought to be
typical. A representative test bed to evaluate annotation
techniques.
Reese et al., Tutorial #3, ISMB ‘99
Adh paper (to appear in Genetics)
URL: http://www.fruitfly.org/publications/PDF/ADH.pdf
Reese et al., Tutorial #3, ISMB ‘99
GAATTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCA
TACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTG
ATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCT
TGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAG
ACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGA
AAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGA
TAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGG
TTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGG
CCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTG
TCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACT
TCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCC
TGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTT
AAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGA
TAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGG
TTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGG
CCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTG
TCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACT
TCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCC
TGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTT
TATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAA
ACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAG
CTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGT
GTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGG
TTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTT
CCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTG
ACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTT
GGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCAC
TGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGA
CGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACT
GCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTA
TCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATTCCACGGAAATGTCAG
GAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGTGCGGCGATCTCGTACTGGACGGAAATGTCAGGAGATAGGAGAAGAAAA
Raw sequence:
Adh.fa
Reese et al., Tutorial #3, ISMB ‘99
Drosophila data sets provided to
participants
Curated Drosophila nuclear DNA "coding sequences" (CDS)
Curated non-redundant Drosophila genomic DNA data (275
“multi”- and 144 “single”-exon sequence entries from Genbank)
Drosophila 5' and 3' splice sites
Drosophila start codon sites
Drosophila promoter sequences
Drosophila repeat sequences
Drosophila transposon sequences
Drosophila cDNA sequences
Drosophila EST sequences
URL: http://www.fruitfly.org/GASP1/data/data.html
Reese et al., Tutorial #3, ISMB ‘99
Timetable
May 13, 1999 - June 30, 1999
Distribution
of the sample sequence and associated data to the
predictors. Collection of predictions.
June 30, 1999 - July 31, 1999
Evaluation
of the predictions by the Drosophila Genome
Center.
August 4, 1999
External
expert assessment of the prediction results (HUGO
meeting, EMBL)
August 6, 1999
Tutorial
#3 at the ISMB ‘99 conference in Heidelberg,
Germany
Reese et al., Tutorial #3, ISMB ‘99
Resources for assessing predictions
80 cDNA sequences NOT in Genbank before
experiment deadline
Sequenced
from 5 different cDNA libraries
3 paralogs to other genes in the genome
19 cDNAs with cloning artifacts
2 apparently representing unspliced RNA
Multiple inserts (2 cDNAs cloned in the same vector)
58
“usable” cDNAs
33 cDNA sequences in Genbank during experiment
Annotations from Adh paper
Reese et al., Tutorial #3, ISMB ‘99
Curated data sets for assessing
predictions
Standard 1 (Adh.std1.gff) “conservative gene set”
43
gene structures (7 single- and 36 multi- coding exon
genes)
Criteria for inclusion:
>=95% (most >=99%) of the cDNA aligned to genomic DNA (using
sim4)
“GT”/”AG” splice site consensus sequences
Splice site score from neural net
• 5’ splice sites: >=0.35 threshold ( 98% True Positive score)
• 3’ splice sites: >=0.25 threshold ( 92% True Positive score)
Start codon and stop codon annotations from Standard 3 (derived
from Adh paper)
These
43 genes represent “typical” genes
Reese et al., Tutorial #3, ISMB ‘99
Curated data sets for assessing
predictions
Standard 2 (Adh.std2.gff)
Superset
of Standard 1
15 additional gene structures
Same alignment criteria as Standard 1 but no splice site
consensus requirement
Not used in the experiment
Reese et al., Tutorial #3, ISMB ‘99
Curated data sets for assessment
Standard 3 (Adh.std3.gff) “more complete gene set”
222
gene structures (39 single- and 183 multi- coding exon
genes)
Criteria:
Annotated as described in Ashburner et al.
cDNA to genomic alignment using sim4
Start codons predicted by ORFFinder (Frise et al., unpublished)
~182 genes have similarity to a homologous protein sequence in
another organism or have a Drosophila EST hit
•
•
•
•
Edge verification by partial EST/cDNA alignments
BLASTX, TBLASTX homology results
PFAM alignments
Gene structure verification using GenScan (human)
14 genes had EST/homology hits but no gene finding predictions
~40 genes only have “strong” GenScan predictions
Reese et al., Tutorial #3, ISMB ‘99
Submission format
GFF (Durbin and Haussler, 1998, unpublished)
http://www.sanger.ac.uk/Software/GFF/
Reese et al., Tutorial #3, ISMB ‘99
Sample submission
# organism: Drosophila melanogaster
# std1
Gene 1
Gene 2
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
TFBS
32002
TATA_signal
TSS
32033
prim_transcript
exon
32034
start_codon
CDS
32122
splice5 32277
splice3 32332
exon
32785
CDS
32785
splice5 32830
splice3 32825
CDS
32826
exon
32826
stop_codon
polyA_signal
polyA_site
prim_transcript
exon
38100
polyA_site
polyA_signal
stop_codon
CDS
40125
start_codon
TSS
41973
TATA_signal
TFBS
42187
TFBS
42211
32006
32009
32034
32034
32277
32122
32277
32278
32333
32830
32830
32831
32826
33003
33122
33001
33090
33101
38100
41973
39620
39685
40125
40390
40388
41974
41998
42193
42216
.
32012
.
33122
.
32124
.
.
.
.
.
.
.
.
.
33003
33095
33102
41973
.
39621
39690
40127
.
40390
.
42001
.
.
+
.
+
.
+
.
+
+
+
+
+
+
+
+
+
.
.
.
.
.
.
.
.
.
-
.
+
.
+
.
+
.
.
.
.
.
.
.
.
.
+
+
+
.
.
.
.
.
.
transcript
transcript "1"
.
transcript
transcript "1"
.
transcript
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
.
transcript
.
transcript
.
transcript
.
transcript
transcript "2"
.
transcript
.
transcript
.
transcript
transcript "2"
.
transcript
transcript "2"
.
transcript
"1"
"1"
"1"
"1"
"1"
"1"
"2"
"2"
"2"
"2"
"2"
"2"
Reese et al., Tutorial #3, ISMB ‘99
Submissions
MAGPIE Team
Credit
Terry Gaasterland, Alexander Sczyrba, Elizabeth Thomas, Gulriz
Kurban, Paul Gordon, Christoph Sensen
Laboratory for Computational Genomics, Rockefeller and Institute
for Marine Biosciences, Canada
Method
Automatic genome analysis system integrating Drosophila Genscan
predictions, confirming exons boundaries using database searches,
repeat finding (Calypso, REPupter) and gene function annotations.
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
References
“Multigenome MAGPIE” poster at ISMB ‘99.
Gaasterland and Ragan (1998), J. of Microbial and Comparative
Genomics, 3, 305-312.
Gaasterland and Sensen (1996), Biochimie 78, 302-310.
REPupter: Kurtz and Schleiermacher (1999), Bioinformatics 15(5),
426-427.
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
Computational Genomics Group, The Sanger Centre
Credit
Victor Solovyev, Asaf Salamov
Method
Discriminant analysis based gene prediction programs FGenes
(trained for Human) and FGenesH (trained for Drosophila);
Combining the output of Fgenes, FGenesH and BLAST using
FGenesH+. 3 different “threshold” annotations are submitted.
The programming running time is linear with the sequence length.
Automatic, plus additional user interactive screening.
Non-redundant NCBI database used for BLAST.
URL/References
http://genomic.sanger.ac.uk/gf/gf.shtml
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
Genome Annotation Group, The Sanger Centre
Credit
Ewan Birney
Method
Protein family based gene identification using Wise2 (previously
Genewise) and PFAM.
URL
http://www.sanger.ac.uk/Software/Wise2
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
Pattern Recognition, The University of Erlangen
Credit
Uwe Ohler, Georg Stemmer, Stefan Harbeck, Heinrich Niemann
Method
Promoter recognition based on interpolated Markov chains;
“Genscan” like promoter model (MCPromoter); maximal mutual
information based estimation of interpolated Markov chains.
Automatic.
Promoter training data set from
http://www.fruitfly.org/data/genesets
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
References
Ohler, Harbeck, Niemann, Noeth and Reese (1999), Bioinformatics
15(5), 362-369.
Ohler, Harbeck and Niemann (1999), Proc. EUROSPEECH, to appear.
URL
http://www5.informatik.uni-erlangen/HTML/English/Research/Promoter
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
Computational Biosciences, Oakridge National
Laboratory
Credit
Richard J. Mural, Douglas Hyatt, Frank Larimer, Manesh Shah,
Morey Parang
Method
Integrated neural network based system including gene assembly
using EST and homology information (GRAILexp).
URL:
http://compbio.ornl.gov/droso
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
Center for Biological Sequence Analysis, Technical
University of Denmark
Credit
Anders Krogh
Method
Modular HMM incorporating database hits (proteins and
ESTs/cDNAS) and other “external information” probabilistically
(HMMGene); the HMM has modules for coding regions, splice sites,
translation start/stop, etc..
It will be a fully automated system.
Trained on Drosophila data
• http://www.fruitfly.org/GSAC1/data/data.html
and
• Victor Solovyev (personal communication)
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
References
Krogh (1998), In S.L. Salzberg et al., eds., Computational Methods in
Molecular Biology, 45-63, Elsevier.
Krogh (1997), Gaasterland et al., eds., Proc. ISMB 97, 179-186.
http://www.cbs.dtu.dk/krogh/refs.html
URL
http://www.cbs.dtu.dk/services/HMMgene/
Not yet for Drosophila.
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
BLOCKS group, Fred Hutchinson Cancer Research
Center in Seattle, Washington
Credit
Jorja Henikoff, Steve Henikoff
Method
DNA translation in 6 frames and search against BLOCKS+ and
against BLOCKS extracted from Smart3.0 (http://coot-emblheidelberg.de/SMART/) using BLIMPS; automatic post-processing to
join multiple predictions from the same block.
Automatic with some user interactive screening of results.
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
References
Henikoff, Henikoff and Pietrokovski (1999), Nucl. Acids Res., 27,
226-228.
Henikoff and Henikoff (1994), Proc. 27th Ann. Hawaii Intl. Conf. On
System Sciences, 265-274.
Henikoff and Henikoff (1994), Genomics, 19, 97-107.
URL
http://blocks.fhcrc.org
http://blocks.fhcrc.org/blocks-bin/getblock.sh?<block name>
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
Genome Informatics Team, IMIM, Barcelona, Spain
Credit
Roderic Guigó, Josep F. Abril, Enrique Blanco, Moises Burset, Genis
Parra
Method
Dynamic programming based system to combine potential exon
candidates modeled as a fifth order Markov model and functional
sequence sites modeled as a position weight matrix (Geneid version 3).
Fully automatic, very fast.
Trained on Drosophila data
• http://www.fruitfly.org/GSAC1/data/data.html
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
References
Guigó et al. (1998), JCB , 5, 681-702.
URL
Information on training process:
• http://www1.imim.es/~rguigo/AnnotationExperiment/index.html
http://www1.imim.es/geneid.html
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
Mark Borodovsky's Lab, School of Biology, Georgia
Institute of Technology
Credit
Mark Borodovsky, John Besemer
Method
Markov chain models combined with HMM technology
(Genemark.hmm).
URL
http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
Biodivision, GSF Forschungszentrum für Umwelt und
Gesundheit, Neuherberg, Germany
Credit
Matthias Scherf, Andreas Klingenhoff, Thomas Werner
Method
Universal sequence classifier which is based on a correlated word
analysis to predict initiators and promoter associated TATA boxes
(CoreInspector V1.0 beta). Sequences of 100 bp are classified at once.
Trained on Eukaryotic Promoter Database (EPD version 5.9).
Fully automatic, 2 seconds per 1Kb.
References
Scherf et al. (1999), in preparation.
URL
http://www.gsf.de/biodv/
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
The Department of Biomathematical Sciences, Mount
Sinai School of Medicine, New York
Credit
Gary Benson
Method
Tandem repeats finder (TRF v2.02) uses theoretical model of the
similarity between adjacent copies of pattern (pattern from 1 -500 bp
recognized); dynamic programming for candidate validation.
Fully automatic; very fast (seconds per 1Mb).
http://c3.biomath.mssm.edu/trf/Adh.fa.2.7.7.80.10.50.500.1.html
References
Benson (1999), Nucl. Acids Res., 27(2), 573-580.
URL
http://c3.biomath.mssm.edu/trf.html
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
Genie, UC Berkeley/UC Santa Cruz/ Neomorphic Inc.
Credit
Martin G. Reese, David Kulp, Hari Tammana, David Haussler
Method
Generalized hidden Markov model with optional integration of EST
hits and homology searches (Genie).
Trained on Drosophila data
• http://www.fruitfly.org/GSAC1/data/data.html
Semi-automatic, in that the overlaps of the analyzed sequence contigs
(110kb) where manual run again with Genie to resolve conflicts.
BLAST used for homology searches on non-redundant protein
database (nr).
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
References
Reese et al. (1997), JCB, 4(3), 311-323.
Kulp et al. (1997), Biocomputing: Proc. Of the 1997 PSB conference,
232-244.
Kulp et al. (1996), ISMB, 4, 134-142.
URL
http://www.neomorphic.com/genie
Reese et al., Tutorial #3, ISMB ‘99
Submission classes
Program name
Gene
finding
Mural et al.
Oakridge, US
GRAILexp
X
Guigó et al.
Barcelona, ES
GeneID
X
Krogh
Copenhagen, DK
HMMGene
X
Borodovsky et al.
Georgia, US
GeneMark.hmm
X
Henikoff et al.
Fred Hutchinson,
Seattle, US
Solovyev et al.
Sanger, UK
BLOCKS
FGenes/FGenesH
Promoter
EST/cDNA
recognition Alignement
Protein
Repeat
similarity
X
Gene
function
X
X
X
X
Reese et al., Tutorial #3, ISMB ‘99
Submission classes (cont.)
Program name
Gaasterland et al.
Rockefeller, US MAGPIE
Benson et al.
Mount Sinai, US
TRF
Werner et al.
Munich, GER
CoreInspector
Gene
finding
X
Reese et al.
Berkeley/Santa
Cruz, US
X
X
X
Gene
function
X
X
X
Wise2
Genie
Protein
Repeat
similarity
X
Ohler et al.
Nuermberg, GER MCPromoter
Birney
Sanger, UK
Promoter
EST/cDNA
recognition Alignment
X
X
X
X
Reese et al., Tutorial #3, ISMB ‘99
Gene finding techniques
Program name
Statistics Promoter EST/cDNA
Alignment
Mural et al.
Oakridge, US
GRAILexp
X
Guigo et al.
Barcelona, ES
GeneID
X
Krogh
Copenhagen, DK
HMMGene
X
Borodovsky et al.
Georgia, US
GeneMark.hmm
X
Solovyev et al.
Sanger, UK
FGenes/FGenesH
X
Gaasterland et al.
Rockefeller, US
MAGPIE
X
X
X
Genie
X
X
X
Reese et al.
Berkeley/Santa
Cruz, US
Protein
similarity
X
X
X
X
Reese et al., Tutorial #3, ISMB ‘99
Measuring success
By nucleotide
Sensitivity/Specificity (Sn/Sp)
By exon
Sn/Sp
Missed exons (ME), wrong exons (WE)
By gene
Sn/Sp
Missed genes (MG), wrong genes (WG)
Average overlap statistics
Based on Burset and Guigo (1996), “Evaluation of gene
structure prediction programs”. Genomics, 34(3), 353-367.
Reese et al., Tutorial #3, ISMB ‘99
Definitions and formulae
Sn = TP/(TP+FN)
Sp = TP/(TP+FP)
TP = True positive
FP = False positive
FN = False negative
Reese et al., Tutorial #3, ISMB ‘99
Genes: True positives (TP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: False positives (FP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: False Negatives (FN)
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (1)
Std1
Pred1
Pred2
TP
2
2
FP
1
5
FN SN SP
1 2/3 2/3
1 2/3 2/7
Sn = TP/(TP+FN)
Sp = TP/(TP+FP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Missing Genes (MG)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Wrong Genes (WG)
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (2)
Std1
Pred1
Pred2
TP
2
2
FP
1
5
FN SN SP MG WG
1 2/3 2/3 1
1
1 2/3 2/7 0
4
Sn = TP/(TP+FN)
Sp = TP/(TP+FP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Std 1 versus Std 3
Std1: “conservative gene set”
Std3: “more complete gene set”
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (3)
Std1
Pred1
Pred2
Std3
Pred1
Pred2
TP
2
2
FP
1
5
2
3
1
4
FN SN SP MG WG
1 2/3 2/3 1
1
1 2/3 2/7 0
4
2
1
2/4 2/3
3/4 3/7
2
0
1
3
Sn = TP/(TP+FN)
Sp = TP/(TP+FP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Std1 and Std3 versus
“real” gene structure
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (4)
Std1
Pred1
Pred2
Std3
Pred1
Pred2
"Real"
Pred1
Pred2
FN SN SP MG WG
1
1 2/3 2/3 1
4
1 2/3 2/7 0
TP
2
2
FP
1
5
2
3
1
4
2
1
2/4 2/3
3/4 3/7
2
0
1
3
3
3
0
4
1
1
3/4 3/3
3/4 3/7
1
0
0
3
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (5): Exon level
Std1
Pred1
Pred2
Std3
Pred1
Pred2
"Real"
Pred1
Pred2
FN SN SP ME WE
2
1 5/6 5/7 1
7
2 2/3 1/3 1
TP
5
4
FP
2
8
5
5
2
7
2
2
5/7 5/7
5/7 5/12
2
1
2
6
7
6
0
6
2
3
7/9 7/7
2/3 1/2
1
1
0
5
Reese et al., Tutorial #3, ISMB ‘99
Genes: Joined genes (JG)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Split genes (SG)
Reese et al., Tutorial #3, ISMB ‘99
Definition: “Joined” and “split”
genes
# Actual genes that overlap predicted genes
# Predicted genes that overlap one or more actual genes
JG = ------------------------------------------# Predicted genes that overlap actual genes
# Actual genes that overlap one or more predicted genes
SG = ------------------------------------------
JG > 1, tendency to join multiple actual genes into one
prediction
SG > 1, tendency to split actual genes into separate
gene predictions
Inspired by Hayes and Guigó (1999), unpublished.
Reese et al., Tutorial #3, ISMB ‘99
Toy example 2 (1)
Std1
Pred1
Pred2
TP
0
1
FP
2
7
FN
3
2
SN
0
1/3
SP MG WG
0
1
1
1/8
0
4
JG
2
1
SG
1
1.33
Reese et al., Tutorial #3, ISMB ‘99
Annotation experiment results
Results available during tutorial and at
http://www.fruitfly.org/GASP1/results/
Reese et al., Tutorial #3, ISMB ‘99
Results: Base level
Fgene Fgene Fgene Gene Gene Gene
s
s
s
ID v1 ID v2 Mark
CGG1 CGG2 CGG3
HMM
Sn
(Std1)
Sp
(Std3)
Genie Genie Genie HMM
EST
EST
Gene
HOM
MAG
PIE
Grail
exp
0.89 0.49 0.93 0.48 0.86 0.96 0.96 0.97 0.97 0.97 0.96 0.81
0.77 0.86 0.60 0.84 0.83 0.86 0.92 0.91 0.83 0.91 0.63 0.86
Sensitivity:
Low
variability among predictors
~95% coverage of the proteome
Specificity
~90%
Programs
that are more like Genscan (used for original
annotation) might do better?
Reese et al., Tutorial #3, ISMB ‘99
Results: Exon level
Fgen
es
CGG1
Sn
(Std1)
Sp
(Std3)
Fgen
es
CGG2
Fgen
es
CGG3
Gene
ID
v2
Gene
Mark
HMM
Genie Genie Genie HMM
EST
EST
Gene
HOM
MAG
PIE
Grai
l
exp
0.65 0.44 0.75 0.27 0.58 0.70 0.70 0.77
0.79 0.68 0.63 0.42
0.49 0.68 0.24 0.29 0.34 0.47 0.57 0.55
0.52 0.53 0.41 0.41
ME(%) 10.5 45.5 5.6
(Std1)
Gene
ID
v1
54.4 21.1 8.1
8.1
4.8
3.2
4.8
12.1 24.3
WE(%) 31.6 17.2 53.3 47.9 47.4 28.9 17.4 20.1 22.8 20.2 50.2 28.7
(Std3)
Higher variability among predictors
Up to ~75% sensitivity (both exon boundaries correct)
55% specificity
Low specificity because partial exon overlaps do not count
Missing exons below 5%
Many wrong exons (~20%)
Reese et al., Tutorial #3, ISMB ‘99
Results: Gene level
Fgene Fgene Fgene Gene Gene Gene
s
s
s
ID v1 ID v2 Mark
CGG1 CGG2 CGG3
HMM
Sn
(Std1)
Sp
(Std3)
Genie Genie Genie HMM
EST
EST
Gene
HOM
Grail
exp
0.51 0.16 0.60 0.07 0.35 0.56 0.56 0.65 0.65 0.56 0.47 0.33
0.36 0.32 0.14 0.07 0.14 0.31 0.37 0.38 0.34 0.39 0.25 0.21
MG(%) 27.9 81.3 13.9 81.3 46.5 20.9 18.6 11.6 9.3
(Std1)
MAG
PIE
11.6 27.9 37.2
WG(%) 50.3 33.8 74.5 85.4 72.2 53.5 39.0 41.8 45.7 42.0 67.0 52.0
(Std3)
SG
1.10 1.10 2.11 1.06 1.06 1.07 1.17 1.15 1.16 1.04 1.22 1.23
JG
1.06 1.09 1.08 1.62 1.11 1.11 1.08 1.09 1.09 1.12 1.06 1.08
Reese et al., Tutorial #3, ISMB ‘99
Results: Gene level
60% of actual genes predicted completely correct
Specificity only 30-40%
5-10% missed genes (comparable to Sanger Center)
40% wrong genes, a lot of short genes over-predicted
(possibly not annotated in Standard 3)
Splitting genes is a bigger problem than joining genes
Reese et al., Tutorial #3, ISMB ‘99
Results (protein homology):
Base level
BLOCKS
Sn
(Std1)
Sp
(Std3)
Wise2
MAGPIE
cDNA
MAGPIE
EST
GRAIL
Simila
rity
0.04
0.12
0.02
0.31
0.31
0.80
0.82
0.55
0.32
0.81
Reese et al., Tutorial #3, ISMB ‘99
Results (protein homology):
Exon level
BLOCKS
Sn
(Std1)
Sp
(Std3)
ME(%)
(Std3)
WE(%)
(Std3)
Wise2
MAGPIE
cDNA
MAGPIE
EST
GRAIL
Simila
rity
0.00
0.06
0.00
0.02
0.07
0.00
0.09
0.04
0.00
0.35
86.1
77.2
98.3
64.2
54.4
13.2
14.2
25.4
56.4
12.4
Reese et al., Tutorial #3, ISMB ‘99
Results (protein homology):
Gene level
BLOCKS
Sn
(Std1)
Sp
(Std3)
MG(%)
(Std3)
WG(%)
(Std3)
Wise2
MAGPIE
cDNA
MAGPIE
EST
GRAIL
Simila
rity
0.00
0.00
0.00
0.00
0.07
0.00
0.00
0.00
0.00
0.18
95.3
90.6
97.6
88.3
74.4
17.5
15.7
52.6
58.5
29.7
Reese et al., Tutorial #3, ISMB ‘99
Transcription Start Site (TSS):
Standard 1
Reese et al., Tutorial #3, ISMB ‘99
TSS: Standard 3
Reese et al., Tutorial #3, ISMB ‘99
Results:
TSS recognition
Likely
(7.7%)
Unlikely
(6.5%)
Possible
(86.8%)
MAGPIE
Genie
MCPromoter
CoreInspector
153
(36.3%)
29
(6.8%)
239
(56.7%)
143
(61.1%)
62
(26.4%)
29
(12.3%)
80
(9.2%)
170
(19.5%)
619
(71.2%)
3
(13.0%)
3
(13.0%)
17
(74.0%)
Reese et al., Tutorial #3, ISMB ‘99
Interesting gene examples:
bubblegum
Reese et al., Tutorial #3, ISMB ‘99
Adh/Adhr (Alcohol
dehydrogenase/Adh related)
Reese et al., Tutorial #3, ISMB ‘99
Adh/Adhr (cont..)
Reese et al., Tutorial #3, ISMB ‘99
osp (outspread)
Contains Adh and Adhr embedded in an intron
Reese et al., Tutorial #3, ISMB ‘99
cact (cactus)
Reese et al., Tutorial #3, ISMB ‘99
kuz (kuzbanian)
Reese et al., Tutorial #3, ISMB ‘99
beat (beaten path)
Reese et al., Tutorial #3, ISMB ‘99
Idfg1, Idfg2, Idfg3 (Imaginal Disc
Growth Factor)
Reese et al., Tutorial #3, ISMB ‘99
Idfg1, Idfg2, Idfg3 (cont.)
Chitinase-related
Gene function has changed (now a growth factor)
Reese et al., Tutorial #3, ISMB ‘99
Conclusion of GASP1
95% coverage of the proteome
Base level prediction is easier, exon level prediction is
harder
Small genes over predicted (?)
Long introns
The high number of “wrong genes” indicates possible
incomplete annotation in Standard 3 (Are there more
genes?)
HMM seems to currently be the best approach
Major improvements in multiple gene regions
Reese et al., Tutorial #3, ISMB ‘99
Conclusion GASP1 (cont.)
Much lower false positive rates
Methods optimized for organism of interest do better
Gene finding including homology not always improves
prediction
Split genes is more of a problem than joined genes
No program is perfect
Reese et al., Tutorial #3, ISMB ‘99
Discussion GASP1
Genes in introns
Alternative splicing
Genomic contamination in cDNA libraries
Translation start prediction
Biological verification of prediction needed
Improve
test bed by cDNA sequencing
More regulation data needed to confirm promoter assessment
Combining methods
Better methods needed
GASP 2 ?
Reese et al., Tutorial #3, ISMB ‘99
Conclusions on annotating
complete eukaryotic genomes
Throughput has to improve dramatically
Not only genes but also their relationships have to be
elucidated
Complete transcript cDNAs very powerful tool for
annotation including alternative transcripts
Comparative genomics as well as expression analysis
improves/completes genome annotation
Standardization efforts needed (ontology working
group, OMG, OiB, NCBI/EBI, Bioxml, etc.)
Standards
for description of gene products
Exchange format (GFF, Genbank, EMBL, XML)
Reese et al., Tutorial #3, ISMB ‘99
Conclusions on annotating complete
eukaryotic genomes (cont.)
Maintenance requires even more effort than the original
development
Automated methods are not good enough
Human curators can cause problems too
Functional assignment by homology is sometimes
unreliable
Reese et al., Tutorial #3, ISMB ‘99
Discussion on annotating complete
eukaryotic genomes
Re-annotation: updating results and annotations over
time
Genomic sequence changes (indels, point mutations)
Analysis software changes
New entries in public sequence databases
Entries removed from sequence databases
Audit trail for annotations
Master copy of genome annotations should reside in the
model organism databases where the expertise resides
Community collaborative annotation
Reese et al., Tutorial #3, ISMB ‘99
Acknowledgments
Uwe Ohler (University of Erlangen, Germany)
Gerry Rubin (UC Berkeley)
Sima Misra (UC Berkeley)
Erwin Frise (UC Berkeley)
Roderic Guigó (Barcelona)
GFF team (headed by Richard Bruskiewich, Sanger Centre)
Assessment team: Michael Ashburner (EBI), Peer Bork (EMBL),
Richard Durbin (Sanger), Roderic Guigó (Barcelona), Tim
Hubbard (Sanger)
Annotation experiment participants
Reese et al., Tutorial #3, ISMB ‘99