Abstract Outline Goals Overview of genome annotation tools

Transcript Abstract Outline Goals Overview of genome annotation tools

The challenge of annotating a
complete eukaryotic genome:
A case study in Drosophila
melanogaster
Martin G. Reese ([email protected])
Nomi L. Harris ([email protected])
George Hartzell ([email protected])
Suzanna E. Lewis ([email protected])
Drosophila Genome Center
Department of Molecular and Cell Biology
539 Life Sciences Addition
University of California, Berkeley
Reese et al., Tutorial #3, ISMB ‘99
Abstract
Many of the technical issues involved in sequencing complete genomes are essentially solved.
Technologies already exist that provide sufficient solutions for ascertaining sequencing error rates
and for assembling sequence data. Currently, however, standards or rules for the annotation
process are still an outstanding problem.
How shall the genomes be annotated, what shall be annotated, which computational tools are most
effective, how reliable are these annotations, how organism-specific do the tools have to be and
ultimately how should the computational results be presented to the community? All these
questions are unsolved. This tutorial will give an overview and assessment of the current state of
annotation based upon experiences gained at the Drosophila melanogaster genome project.
In the tutorial we will do three things. First, we will break down the annotation process and discuss
the various aspects of the problem. This will serve to clarify the term "annotation", which is often
used to collectively describe a process that has a number of discrete steps. Second, with the
participation of computational biologists from the community we will compare existing tools for
sequence annotation. We will do this by providing a 3 megabase sequence that has already been
well-characterized at our center as a testbed for evaluating other feature-finding algorithms. This is
similar to what has been done at the CASP (critical assessment of techniques for protein structure
prediction) conferences (http://predictioncenter.llnl.gov) for protein structure prediction. Third, we
will discuss which annotation problems are essentially solved and which problems remain.
Reese et al., Tutorial #3, ISMB ‘99
Tutorial goals

Review the algorithms currently used in annotation

Assess existing methods under “field” conditions

Identify open issues in annotation
Reese et al., Tutorial #3, ISMB ‘99
Tutorial organization

Definitions

Annotation
 “Biological”
issues
 “Engineering” issues
 Application of tools within an existing annotation system

Break (20 minutes)

Review of existing tools

Our annotation experiment

Conclusions and outstanding issues
Reese et al., Tutorial #3, ISMB ‘99
What is a gene?

Definition: An inheritable trait associated with a
region of DNA that codes for a polypeptide chain
or specifies an RNA molecule which in turn have
an influence on some characteristic phenotype of
the organism.
Reese et al., Tutorial #3, ISMB ‘99
What are annotations?

Definition: Features on the genome derived
through the transformation of raw genomic
sequences into information by integrating
computational tools, auxiliary biological data, and
biological knowledge.
Reese et al., Tutorial #3, ISMB ‘99
How does an annotation differ
from a gene?

Many annotations are the same as ‘genes’
 The
annotation describes an inheritable trait associated
with a region of DNA.

But an annotation may not always correspond in
this way, e.g. an STS, or sequence overlap
 Region
of genomic DNA or RNA is not translated or
transcribed
Reese et al., Tutorial #3, ISMB ‘99
Transcription and translation
Reese et al., Tutorial #3, ISMB ‘99
Schematic gene structure
DNA:
Promoter
Exon 2
Exon 1
Exon 3
TSS
Intron 2
Intron 1
ATG GT
AG
GT
TAA
AG
transcription
Exon 2
Exon 1
Exon 3
Intron 1
preRNA:
ATG GT
Intron 2
AG
GT
TAA
AG
splicing
5'UTR
mRNA:
ORF
3'UTR
ATG
TAA
polyA
AAAAAAAA A
translation
primary
translation:
modification
[cleavage product]
ATG
TAA
MPYCPLTW
..............GFL
amino acid sequence
[glycosylation site]
active protein:
CPLTW
......G
Reese et al., Tutorial #3, ISMB ‘99
Sequence feature types

Transcribed region


Structural region






Exon, intron, 5’ UTR, 3’ UTR, ORF, cleavage product
Mutations: insertion, deletion, substitution, inversion, translocation
Functional or signal region
Promoter, enhancer, DNA/RNA binding site, splice site signal, polyadenylation signal
Protein processing: glycosylation, methylation, phosphorylation site
Similarity


mRNA, tRNA, snoRNA, snRNA, rRNA
Homolog, paralog, genomic overlap (syntenic region)
Other feature types



Transposable element, repetitive element
Pseudogene
STS, insertion site
Reese et al., Tutorial #3, ISMB ‘99
DNA transcription unit features

Promoter elements
 Core
promoter elements
TATA box
 Initiator (Inr)
 Downstream promoter element (DPE)

 Transcription
factor (“TF”) binding sites
CAAT boxes
 GC boxes
 SP-1 sites
 GAGA boxes

 Enhancer
site(s)
Reese et al., Tutorial #3, ISMB ‘99
mRNA features

Exon

Initial, internal, terminal





Intron


5’ splice site (“GT”), branchpoint (lariat), 3’ splice site (“AG”)
Repeat elements

“Kozak” rule

5’ UTR
Start codon (translation start site)
UTR (untranslated regions)





Codon usage, preference
Control elements (e.g. splice enhancers)


Translation regulatory elements
RNA binding sites

Control elements (e.g. splice enhancers)

RNA binding sites (cis-acting elements)
Initial, internal, terminal
3’ UTR
Stop codon
Poly-adenylation signal and site
RNA destabilization signal
Reese et al., Tutorial #3, ISMB ‘99
Reese et al., Tutorial #3, ISMB ‘99
Definitions for data modeling

Feature: An interval or an ordered set of intervals on a
sequence that describes some biological attribute and is
justified by evidence.

Sequence: A linear molecule of DNA, RNA or amino
acids.

Evidence: A computational or experimental result
coming out of an analysis of a sequence

Annotation: A set of features
Reese et al., Tutorial #3, ISMB ‘99
Depth of knowledge
Annotation
Detailed analysis
(typically biological) of
single genes
Annotated genome
Large-scale analysis
(typically computational)
of entire genome
Breadth of knowledge
Reese et al., Tutorial #3, ISMB ‘99
Annotation process overview
Methods
Data
Genome
Sequence
Auxiliary
Data
Computational
Tools
Database
Resources
Annotation Systems
Understanding of a Genome
Reese et al., Tutorial #3, ISMB ‘99
Types of sequence data

Chromosomal sequence
 Euchromatic
 Heterochromatic

mRNA sequences
 Full
length cDNA
 5’ EST
 3’ EST


Protein sequences
Insertion site flanking sequences
Reese et al., Tutorial #3, ISMB ‘99
Auxiliary data

Maps
 Genetic,
physical, radiation hybrid map (RH), deletion,
cytogenetic

Expression data
 Tissue,

stage
Phenotypes
 Lethality,
sterility
Reese et al., Tutorial #3, ISMB ‘99
Computational annotation tools




Gene finding
Repeat finding
EST/cDNA alignment
Homology searching
 BLAST,

FASTA, HMM-based methods, etc.
Protein family searching
 PFAM,
Prosite, etc.
Reese et al., Tutorial #3, ISMB ‘99
Database resources

Curated sequence feature data sets
 Repeat
elements
 Transposons
 Non-redundant mRNA
 STSs and other sequence markers

Genome sequence from related species
 D.


melanogaster vs. D. virilis, D. hydei
Genome sequence from more distant species
Protein sequences from distant species
Reese et al., Tutorial #3, ISMB ‘99
Biological issues in annotation

Common
 Genes
within genes
 Alternative splicing
 Alternative poly-adenylation sites

Rare
 Translational
frame shifting
 mRNA editing
 Eukaryotic operons
 Alternative initiation
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation

What sequence to start with?


When to annotate?


Because features are intervals on a sequence, problems can be caused by
gaps, frameshifts, and other changes to the sequence. How do you track
these changes over time and model features that span gaps?
Feature identification can aid in sequencing. It may be advisable to carry
out sequencing and annotation in parallel thus enabling them to
complement one another.
What analyses need to be run and how?


What dependencies are there between various analysis programs?
What parameters settings to use?
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation

What public sequence data sets are needed?



How do you achieve computational throughput?



What are the mechanics of obtaining public sequence databases?
Are curated data sets available or do you need to set up a means of
maintaining your own (for repeats, insertions, organism of interest)
Workstation farm, or simply a big, powerful box?
Job flow control
What do you do with the results?


Homogenize results into single format?
Filter results for significance and redundancy
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation

Interpreting the results





Is human curation needed?
How can you achieve consistency between curators?
How do you design the user interface so that it is simple enough to get the
task completed speedily but complex enough to deal with biology?
How do you capture curations?
How are annotation translations to be described?




EC terminology
ProSite families
Pfam domains
Is function distinguishable from process?
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation

How do you manage data?





What is the appropriate database schema design?
How is the database to be kept up to date? Will it be directly from
programs running user interfaces and analyses or via a middleware layer?
Is a flat file format needed and what should it be?
What query and retrieval support is needed?
How do you distribute data?



For bulk downloads what is the format of the data?
What information is best summarized in tables?
What information requires an integrated graphical view?
Reese et al., Tutorial #3, ISMB ‘99
Engineering issues in annotation

How do you update the annotations?




How frequently are they re-evaluated?
How can re-evaluation be minimized (only subsets of the
databanks, only modified sequences)?
How can differences between old and new computational results
be detected?
Changes in computational results may need to trigger changes in
curated annotations
Reese et al., Tutorial #3, ISMB ‘99
Drosophila melanogaster

Drosophila is the most important model organism*

Drosophila genome:
4
chromosomes
 180 Mb total sequence
 140 Mb euchromatic sequence
 12-14,000 genes
* source: G.M. Rubin
Reese et al., Tutorial #3, ISMB ‘99
Drosophila Genome Project

Laboratories working on Drosophila sequencing:



BDGP (Berkeley Drosophila Genome Project)
EDGP (European Drosophila Genome Project)
Celera Genomics Inc.

“Complete” D. melanogaster sequence will be
finished by the end of 1999

Comprehensive database - FlyBase
Reese et al., Tutorial #3, ISMB ‘99
Goals of the Drosophila Genome
Project





Complete genome sequence
Structure of all transcripts
Expression pattern of all genes
Phenotype resulting from mutation of all ORFs
And more...
Reese et al., Tutorial #3, ISMB ‘99
Sequencing at the BDGP

Genomic sequence
 P1
and BAC clones
 24Mb of completed sequence (as of July 22, 1999)
 18Mb unfinished sequence in process

Complete tiling path in BACs
 1.5x-path

draft sequencing
ESTs and cDNAs
 80,942
ESTs finished (as of March 19, 1999)
 Over 800 full-length cDNAs
Reese et al., Tutorial #3, ISMB ‘99
The BDGP sequence annotation
process
Reese et al., Tutorial #3, ISMB ‘99
What sequence to start with?

Unit of sequencing at the BDGP


Completed high-quality clone sequences
Reassembling the genomic sequence



Need to place clones in correct genomic positions
Need to integrate genes that span multiple clones
Solved by using genomic overlaps to reconstitute full genomic sequence
Reese et al., Tutorial #3, ISMB ‘99
Which analyses need to be run?

Similarity searches
 BLAST
(Altschul et al., 1990)
BLASTN (nucleotide databases)
 BLASTX (amino acid databases)
 TBLASTX (amino acid databases, six-frame translation)

 sim4


(Miller et al., 1998)
Sequence alignment program for finding near-perfect matches
between nucleotide sequences containing introns
Gene predictors
 Genefinder
(Green, unpublished)
 GenScan (Burge and Karlin, 1997)
 Genie (Reese et al., 1997)

Other analyses
 tRNAscanSE
(Lowe and Eddy, 1996)
Reese et al., Tutorial #3, ISMB ‘99
Which analyses need to be run
and how?

mRNAs
 ORFFinder(Frise,

unpublished)
Protein translations




HMMPFAM 2.1 (Eddy 1998) against PFAM (v 2.1.1 Sonnhammer
et al. 1997, Bateman et al. 1999)
Ppsearch (Fuchs 1994) against ProSite (release 15.0) filtered with
EMOTIF ( Nevill-Manning et al. 1998)
Psort II (Horton and Nakai 1997)
ClustalW (Higgins et al. 1996)
Reese et al., Tutorial #3, ISMB ‘99
What public sequence data sets are
needed?

Automating updates of public databases:


Genbank, SwissProt, trEMBL, BLOCKS, dbEST, EDGP
Curated data sets




D. melanogaster genes (FlyBase)
Transposable elements (EDGP)
Repeat elements (EDGP)
STSs (BDGP)
Reese et al., Tutorial #3, ISMB ‘99
Which analyses need to be run
and how?
Reese et al., Tutorial #3, ISMB ‘99
How do you achieve
computational throughput?

BDGP computing power




Sun Ultra 450 (3 machines, 4 processors each)
Sun Enterprise (1 machine, 8 processors)
Used these directly, without any system for distributed computing.
Job flow control: the Genomic Daemon




Automatic batch analysis of genomic clones
Berkeley Fly Database is used for queuing system and storage of results
Many clones can be analyzed simultaneously
Results are processed and saved in XML format for interactive browsing
Reese et al., Tutorial #3, ISMB ‘99
What do you do with the results?

Berkeley Output Parser (BOP)
 Input
to BOP:
Genomic sequence
 Results of computational analyses
 Filtering preferences

 Parses
results from BLAST, sim4, GeneFinder, GenScan, and
tRNAscan-SE analyses
 Filters BLAST and sim4 results
Eliminates redundant or insignificant hits
 Merges hits that represent single region of homology

 Homogenizes

results into single format
Output: sequence and filtered results in XML format
Reese et al., Tutorial #3, ISMB ‘99
Is human curation needed?

Not for everything

Some features are obvious and can be identified computationally



Known D. melanogaster genes are detected automatically by
GeneSkimmer
Repetitive elements
But still for many things


Annotating complete gene structure is still hard
We use CloneCurator (BDGP’s Java graphical editor) for curation
Reese et al., Tutorial #3, ISMB ‘99
Gene Skimmer





Quick way of identifying genes in new sequence before
curation
Start with XML output from BOP
Look for sim4 hits with known Drosophila genes
Find gene hits with sequence identity >98%,
coverage >30%
Verify that hits represent real genes
Reese et al., Tutorial #3, ISMB ‘99
Gene Skimmer
URL: http://www.fruitfly.org/sequence/genomic-clones.html
Reese et al., Tutorial #3, ISMB ‘99
CloneCurator


Displays computational results and annotations on a
genomic clone
Interactive browsing
 Zoom/scroll
 Change
cutoffs for display of results
 Analyze GC content, restriction sites, etc.

Interactive annotation editing
 Expert

“endorses” selected results
Presents annotations to community via Web site
Reese et al., Tutorial #3, ISMB ‘99
Reese et al., Tutorial #3, ISMB ‘99
How do we annotate gene/protein
function?

Gene Ontology Project
 Controlled
hierarchical vocabulary for multiple-genome
annotations and comparisons
 Standardized vocabulary facilitates collaboration
 Good data modeling allows better database querying
 Ontology browser provides interactive search of hierarchical
terms
 “GO” project (http://www.ebi.ac.uk/~ashburn/GO)
Reese et al., Tutorial #3, ISMB ‘99
Ontology browser
Reese et al., Tutorial #3, ISMB ‘99
Reese et al., Tutorial #3, ISMB ‘99
Ontology browser: searching for
terms
Reese et al., Tutorial #3, ISMB ‘99
How do you distribute the data?

Bulk downloads
 FASTA at http://www.fruitfly.org/sequence/download.html
 Curated

data sets
Tabular data
 At http://www.fruitfly.org/sequence/
 Sequenced
genomic clones
 Clone contigs sorted by genomic location
 Clone contigs sorted by size

Ribbon provides integrated graphical view of
annotations on physical contigs
Reese et al., Tutorial #3, ISMB ‘99
Ribbon





Human curator annotates individual clones (~100Kb)
Clones are assembled into physical contigs (regions of
physical map)
Clone annotations are merged and renumbered for
display on whole physical contigs
Ribbon is our Java display tool for displaying curated
annotations on physical contigs
Will soon be available on Web
Reese et al., Tutorial #3, ISMB ‘99
Ribbon
Reese et al., Tutorial #3, ISMB ‘99
How do you manage the data?



Using Informix as our database server
Updated via Perl dbi.pm module
Development underway in
 Schema
revisions
 GAME DTD (Genome Annotation Markup Entities)
 Perl module for annotation objects
 http://www.bioxml.org/ (Ewan Birney)
Reese et al., Tutorial #3, ISMB ‘99
How do you maintain annotations?

Open questions
 How
frequently are annotations re-evaluated?
 How can re-evaluation be minimized (only subsets of
the databanks, only modified sequences)?
 How can differences between old and new
computational results be detected?
 Changes in computational results may need to trigger
changes in curated annotations
Reese et al., Tutorial #3, ISMB ‘99
Integrated annotation systems





ACeDB
Genotator
Magpie
GAIA
TIGR
Reese et al., Tutorial #3, ISMB ‘99
Integrated annotation systems:
ACeDB





Developed for analysis of the C. elegans genome
Sophisticated database designed for storing annotations
and related information
New Java and Web-based versions available
Written by Jean Thierry-Mieg and Richard Durbin
http://www.sanger.ac.uk/Software/Acedb/
Reese et al., Tutorial #3, ISMB ‘99
ACeDB
Reese et al., Tutorial #3, ISMB ‘99
Genotator



Back end automates sequence analysis; browser
provides interactive viewing and editing of annotations
Nomi Harris (1997), Genome Research 7(7), 754-762.
http://www-hgc.lbl.gov/inf/annotation.html
Reese et al., Tutorial #3, ISMB ‘99
Magpie

Expert system based (PROLOG)
 Data
collection daemon
 Data analysis and report daemon




“Intelligent” integration of various individual feature
prediction systems
Allows human interactions
Gaasterlund and Sensen (1996), TIG, 12, 76-78.
http://genomes.rockefeller.edu/magpie/magpie.html
Reese et al., Tutorial #3, ISMB ‘99
GAIA




Web-based system
Results displayed as Java applets
Bailey, L.C., J. Schug, S. Fischer, M. Gibson, J.
Crabtree, D.B. Searls, and G.C. Overton (1998),
Genome Research.
http://daphne.humgen.upenn.edu:1024/gaia/
Reese et al., Tutorial #3, ISMB ‘99
TIGR Human Gene Index




Gene Indices for various organisms
Databases for transcribed genes linked into
external/internal genomic databases
Internal backend analysis software
http://www.tigr.org/tdb/tdb.html
Reese et al., Tutorial #3, ISMB ‘99
Computational analysis tools




Gene finding
Repeat finding
EST/cDNA alignment
Homology searching
 BLAST,

FASTA, HMM-based methods, etc.
Protein family searching
 PFAM,
Prosite, etc.
Reese et al., Tutorial #3, ISMB ‘99
Gene finding:
Prokaryotes vs. Eukaryotes

Prokaryotes
 Contiguous
open reading frames (ORF)
 Short intergenic sequences
 Good method: detecting large ORFs
 Complications:
Partial sequences
 Sequencing errors
 Start codon prediction
 Overlapping genes on both strands

Reese et al., Tutorial #3, ISMB ‘99
Gene finding:
Prokaryotes vs. Eukaryotes

Eukaryotes
 Complex
gene structures (exon/introns)
D. melanogaster has an average of 4 introns/gene
 Very long genes (D. melanogaster X gene 160 kb)
 Very long introns
 Many introns
 “Nested”, overlapping, and alternatively spliced genes
 5’ UTRs with non-coding exons
 Long 3’ UTRs
 Complex transcription machinery

 ORF-finding
alone is not adequate
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding

Assumptions
 Signals
and content method sensors alone are not
sufficient for predicting gene structure
 Gene structure is hierarchical
 Each component (exon, intron, splice site, etc.) can be
modeled independently

The approach
 Generate
a list of candidates for each component (with
scores)
 Assemble the components into a “gene model”
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding:
Dynamic programming


Determines the best combination of components
Two-part problem:
 Develop
an “optimal” scoring function
 Use dynamic programming to find an “optimal” alignment
through scoring matrix
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding:
Dynamic programming
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding:
Linear and Quadratic
Discriminant Analysis (LDA/QDA)

LDA
 Deterministic
calculation of thresholds
 n-class discrimination
 Example:


HSPL, Solovyev et al. (1997), ISMB, 5,294-302.
QDA
 Can
represent a great improvement over LDA
 Example:

MZEF, Michael Zhang (1997), PNAS, 94, 565-568.
Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding:
Feed-forward neural networks






Supervised learning
Training to discriminate between several feature classes
Computing units
Gradient descent optimization
Multi-layer networks
Limitations



Black-box predictions
Local minima
Example:

GRAIL, Uberbacher et al. (1991), PNAS, 88, 11261-11265.
Reese et al., Tutorial #3, ISMB ‘99
Approaches to gene finding:
Hidden Markov models

Model




Markov



k-order Markov chain: current state dependent on k previous states
The next state in a 1st-order Markov model depends on current state
Hidden


A finite model describing a probability distribution over all possible sequences of
equal length
“Natural” scoring function
(Conditional) Maximum likelihood “training”
Hidden states generate visible symbols
Assumptions

Independence of states


No long range correlation
Example: HMMgene, A. Krogh (1998), In Guide to Human Genome
Computing, 261-274.
Reese et al., Tutorial #3, ISMB ‘99
Approaches to gene finding:
Generalized hidden Markov models




Each HMM state can be a probabilistic sub-model
Complex hierarchical system
Requires care in modeling state overlaps
Example:
 Genie,
Kulp et al. (1996), ISMB, 4, 134-142
 GenScan, Burge and Karlin (1997), JMB, 268(1), 78-94
Reese et al., Tutorial #3, ISMB ‘99
Gene finding software

Signal recognition







Promoter prediction
Splice site prediction
Start codon prediction
Poly-adenylation site prediction
Coding potential
Coding exons
Gene structure prediction




Spliced alignment
LDA/QDA
Neural networks
HMMs and GHMMs
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition

PromoterScan


Identify potential promoter regions
Based on databases of known TF binding sites





TFD (Gosh (1991), TIBS, 16, 445-447)
TRANSFAC (Heinemeyer et al. (1999), NAR, 27, 318-322)
Prestridge (1995), JMB, 249, 923-932
http://bimas.dcrt.nih.gov/molbio/proscan/
MatInd and MatInspector


Finding consensus matches to known TF binding sites
Based on TRANSFAC



Heinemeyer et al. (1999), NAR, 27, 318-322
Quandt et al. (1995), NAR, 23, 4878-4884.
http://transfac.gbf.de/TRANSFAC/
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition (cont.)

TSSG/TSSW
 LDA based
combination of several features (TATA-box, Inr
signal, upstream regions)
 Solovyev et al. (1997), ISMB, 5, 294-302.
 http://genomic.sanger.ac.uk/gf/gf.shtml

Transcription Element Search Software
 Identify TF
binding sites
 Based on TRANSFAC
 http://agave.humgen.upenn.edu/tess/index.html
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition (cont.)

CBS Promoter 2.0 Prediction Server
 Simulated
transcription factors
 Principles common to neural networks and genetic algorithms
 Knudsen (1999), Bioinformatics 13(5), 356-361.
 http://genome.cbs.dtu.dk/services/promoter/

CorePromoter
 Position
dependent 5-tuple
 QDA
 Michael
Zhang (1998), Genome Research, 8, 319-326.
 http://scislio.cshl.org/genefinder/CPROMOTER/
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition (cont.)

Neural network promoter prediction (NNPP)
 Time-delay
neural network
 Combining TATA box and initiator
 Reese (1999), in preparation.
 http://www-hgc.lbl.gov/projects/promoter.html
Reese et al., Tutorial #3, ISMB ‘99
Example: NNPP
Reese et al., Tutorial #3, ISMB ‘99
Promoter recognition (cont.)

Markov chain promoter finder
 Competing
interpolated Markov chains for promoters, exons,
introns
 Promoter model consists of five states representing the core
promoter parts
 Ohler, Reese et al., Bioinformatics 13(5), 362-369.
Reese et al., Tutorial #3, ISMB ‘99
Splice site prediction

Nakata, 1985
 Nakata

(1985), NAR, 13(14), 5327-5340.
BCM GeneFinder
 HSPL -
Prediction of splice sites in human DNA sequences
 Triplet frequencies in various functional parts of splice site
regions
 Combined with codon statistics
 Solovyev et al. (1994), NAR, 22(24), 5156-5163.
 http://genomic.sanger.ac.uk/gf/gf.shtml
Reese et al., Tutorial #3, ISMB ‘99
Splice site prediction (cont.)

Neural Network splice site predictor (NNSPLICE)





Multi-layered feed-forward neural network
Modeled after Brunak et al. (1991), JMB, 220, 49-65.
Reese et al. (1997), JCB, 4(3), 311-323.
http://www-hgc.lbl.gov/projects/splice.html
NetGene2





Combination of neural networks and rule-based system
Splice site signal neural network combined with coding potential
Hebsgaard et al. (1996), NAR, 24(17), 3439-3452.
Brunak et al. (1991), JMB, 220, 49-65.
http://www.cbs.dtu.dk/services/NetGene2/
Reese et al., Tutorial #3, ISMB ‘99
Splice site prediction (cont.)

SplicePredictor
 Logitlinear
models for splice site regions
Degree of matching to the splice site consensus
 Local compositional contrast

 Brendel
and Kleffe (1998), NAR, 26(20), 4748-4757.
 http://gnomic.stanford.edu/~volker/SplicePredictor.html
Reese et al., Tutorial #3, ISMB ‘99
Start codon prediction

NetStart
 Trained
on cDNA-like sequences
 Neural network based
Local start codon information
 Global sequence information

 Pedersen
and Nielsen (1997), ISMB, 5, 226-233.
 http://www.cbs.dtu.dk/services/NetStart/
Reese et al., Tutorial #3, ISMB ‘99
Poly-adenylation signal prediction

BCM GeneFinder
 POLYAH
- Recognition of 3'-end cleavage and polyadenylation region
 Triplet frequencies in various functional parts in polyadenylation regions
 LDA
 Solovyev et al. (1994), NAR, 22(24), 5156-5163.
 http://genomic.sanger.ac.uk/gf/gf.shtml
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential

Periodicity detection
 Coding
sequences have an inherent periodicity of three
 Especially good on long coding sequences
 Auto-correlation
Seeking the strongest response when shifted sequence is compared
with original
 Michel (1986), J. Theor. Biol. 120, 223-236.

 Fourier
transformation: Spectral analysis
Detection of peak at position corresponding to 1/3 of the frequency
 Silverman and Linsker (1986), J. Theor. Biol. 118, 295-300.

Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential
(cont.)

Trifonov (1980;1987)
 G-notG-U
periodicity
 JMB , 194, 643-652.

Fickett (1982)
 Position
asymmetry in the three codon positions
 NAR 10(17), 5303-5318.

Staden (1984)
 Codon
usage in tables
 NAR 12, 551-567.
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential
(cont.)

Claverie and Bougueleret (1987)
 Hexamer
frequency differentials
 NAR 14, 179-196.

Fichant and Gautier (1987)
 Codon
usage homogeneity
 CABIOS, 3(4), 287-295.

GRAIL I (1991)
 Neural
network using a shifting fixed size window
 7 sensors as input, 2 hidden layers and 1 unit as output
 Uberbacher et al. (1991), PNAS, 88(24), 11261-11265.
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential
(cont.)

GeneMark (1986)
 Inhomogeneous
Markov chain models
 Easy trainable (closed solution for Maximum Likelihood)
 Used extensively in prokaryotic genomes
 Borodovsky et al. (1993), Computers & Chemistry, 17, 123133.

Glimmer (1998)
 Interpolated
Markov chains from first to eighth order
 Salzberg et al. (1998), NAR, 26(2), 544-548.
 http://www.tigr.org/softlab/glimmer/glimmer.html
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding potential
(cont.)

Review by Fickett (1992)
 “Assessment
of protein coding measures”, NAR, 20, 6441-
6450.
Reese et al., Tutorial #3, ISMB ‘99
Prediction of coding exons

SorFind



BCM GeneFinder





Detection of “spliceable” ORFs
Hutchinson, NAR, 20(13), 3453-3462.
FEXD, FEXN, FEXA, FEXY, FEXH, HEXON
LDA
Solovyev et al. (1994), NAR, 22(24), 5156-5163.
http://genomic.sanger.ac.uk/gf/gf.shtml
GRAIL II



Exon candidates, heuristic integration, learning with neural network
Uberbacher et al., Genet. Eng., 16, 241-253.
http://compbio.ornl.gov/
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
LDA/QDA

FGene
 LDA based
 Dynamic
programming for the integration of LDA output
 Solovyev et al. (1995), ISMB, 3, 367-375.
 http://genomic.sanger.ac.uk/gf/gf.shtml
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models: NN

GeneParser
 “Gene-parsing”
approach
 Potential alternative splicing recognized
 Neural network and dynamic programming
 Snyder and Stormo (1995), JMB, 248, 1-18.
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
Artificial intelligence approaches

GeneID
 Rule-based
system
 Homology integration
 Guigó et al. (1992), JMB , 226, 141-157.
 http://www1.imim.es/geneid.html

GeneID using DP
 DP to
combine a set of potential exons
 Guigó et al. (1998), JCB , 5, 681-702.
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
Artificial intelligence approaches

GenLang
 Syntactic
pattern recognition system
 Formal grammar
 Tools from computational linguistics
 Dong and Searls (1994), Genomics, 23,540-551.
 http://cbil.humgen.upenn.edu/~sdong/genlang_home.html
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models: HMMs

HMMGene
 Several
genes per sequence possible
 User constraints possible
 Krogh (1997), ISMB, 5, 179-186.
 http://www.cbs.dtu.dk/services/HMMgene/

GeneMark.hmm
 Based
on GeneMark program for bacterial sequences
 Can predict frame shifts
 Trained for various organisms
 Lukashin and Borodovsky (1998), NAR, 26, 1107-1115.

http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
GHMMs

Genie
 Generalized
hidden Markov model with length distribution
 Integration of multiple content and signal sensors
Content: codon statistics, repeats, intron, intergenic, database
homology hits
 Signal: promoter, start codon, splice sites, stop codon

 Dynamic
programming to find optimal parse
 Several genes per sequence possible
 Kulp et al. (1996), ISMB, 4, 134-142.
 Reese et al. (1997), JCB, 4(3), 311-323.
 http://www.cse.ucsc.edu/~dkulp/cgi-bin/genie
Reese et al., Tutorial #3, ISMB ‘99
Example: Genie
Reese et al., Tutorial #3, ISMB ‘99
“Integrated” gene models:
GHMMs

GenScan
 Multiple
content and signal models
 Semi-hidden Markov model sensors with length distribution
 Takes GC content into account (separate models)
 Several genes per sequence possible
 Burge and Karlin (1997), JMB, 268(1), 78-94.
 http://CCR-081.mit.edu/GENSCAN.html
Reese et al., Tutorial #3, ISMB ‘99
EST/cDNA alignment for gene
finding: Spliced alignments

PROCRUSTES
 Spliced
alignment algorithm
 Dynamic programming to combine a set of potential exons
 Frame conservation
 Homologous sequence needed
 Gelfand et al. (1996), PNAS, 93, 9061-9066.
 http://hto-13.usc.edu/software/procrustes/
Reese et al., Tutorial #3, ISMB ‘99
EST/cDNA alignment

Sim4
 Aligns
cDNA to genomic sequence
 Uses local similarity
 Florea et al. (1998), Genome Research, 8, 967-974.

GeneWise
 Dynamic
programming
 Partial genes allowed
 Based on Pfam and statistical splice site models
 Birney (1999), unpublished
 http://www.sanger.ac.uk/Software/Wise2
Reese et al., Tutorial #3, ISMB ‘99
EST/cDNA alignment (cont.)

ACEMBLY
 Aligns
ESTs to genomic sequence
 Identifies alternative splicing
 Integrated in ACeDB
 Jean Thierry-Mieg (unpublished)
Reese et al., Tutorial #3, ISMB ‘99
Repeat finders

Censor
 Uses
database of repeat sequences
 Jurka et al. (1996), Comp. and Chem., 20(1), 119-122.

BLAST
 Integrated
masking operations
 XBLAST procedure

Claverie (1994), In Automated DNA Sequencing and Analysis
Techniques, M. D. Adams, C. Fields and J. C. Venter, eds., 267-279.
 http//:www.ncbi.nlm.nih.gov/BLAST
Reese et al., Tutorial #3, ISMB ‘99
Repeat finders (cont.)

RepeatMasker
 Detection
of interspersed repeats
 Smit and Green, unpublished results
 http://ftp.genome.washington.edu/RM/RepeatMasker.html
Reese et al., Tutorial #3, ISMB ‘99
Homology searching

BLAST suite
 BLASTN,
BLASTX, TBLASTX, PSI-BLAST
 Altschul et al. (1990), JMB, 215, 403-410.
 http://www.ncbi.nlm.nih.gov/BLAST

FASTA suite
 FASTA,
TFASTA
 Pearson and Lipman (1988), PNAS, 85, 2444-2448.

HMM-based searching
 SAM

(UCSC group)
http://www.cse.ucsc.edu/research/compbio/sam.html
 HMMER,

Sean Eddy
http://hmmer.wustl.edu/
Reese et al., Tutorial #3, ISMB ‘99
Gene family searching

BLOCKS
 http://www.blocks.fhcrc.org

PROSITE
 http://www.expasy.ch/prosite/

PFAM
 http://pfam.wustl.edu/

SCOP
 http://scop.mrc-lmb.cam.ac.uk/scop/
Reese et al., Tutorial #3, ISMB ‘99
The genome annotation
experiment (GASP1)






Genome Annotation Assessment Project (GASP1)
Annotation of 2.9 Mb of Drosophila melanogaster
genomic DNA
Open to everybody, announced on several mailing lists
Participants can use any analysis methods they like
(gene finding programs, homology searches, by-eye
assessment, combination methods, etc.) and should
disclose their methods.
“CASP” like
12 participating groups
Reese et al., Tutorial #3, ISMB ‘99
URL: http://www.fruitfly.org/GASP1
Reese et al., Tutorial #3, ISMB ‘99
Goals of the experiment

Compare and contrast various genome annotation
methods

Objective assessment of the state of the art in gene
finding and functional site prediction

Identify outstanding problems in computational
methods for the annotation process
Reese et al., Tutorial #3, ISMB ‘99
Adh contig

2.9 Mb contiguous Drosophila sequence from the Adh
region, one of the best studied genomic regions
 From
chromosome 2L (34D-36A)
 Ashburner et al., (to appear in Genetics)
 222 gene annotations (as of July 22, 1999)
 375,585 bases are coding (12.95%)

We chose the Adh region because it was thought to be
typical. A representative test bed to evaluate annotation
techniques.
Reese et al., Tutorial #3, ISMB ‘99
Adh paper (to appear in Genetics)
URL: http://www.fruitfly.org/publications/PDF/ADH.pdf
Reese et al., Tutorial #3, ISMB ‘99
GAATTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCA
TACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTG
ATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCT
TGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAG
ACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGA
AAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGA
TAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGG
TTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGG
CCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTG
TCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACT
TCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCC
TGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTT
AAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGA
TAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGG
TTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGG
CCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTG
TCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACT
TCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCC
TGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTT
TATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAA
ACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAG
CTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGT
GTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGG
TTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTT
CCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTG
ACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTT
GGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCAC
TGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGA
CGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACT
GCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTA
TCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATTCCACGGAAATGTCAG
GAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGTGCGGCGATCTCGTACTGGACGGAAATGTCAGGAGATAGGAGAAGAAAA
Raw sequence:
Adh.fa
Reese et al., Tutorial #3, ISMB ‘99
Drosophila data sets provided to
participants









Curated Drosophila nuclear DNA "coding sequences" (CDS)
Curated non-redundant Drosophila genomic DNA data (275
“multi”- and 144 “single”-exon sequence entries from Genbank)
Drosophila 5' and 3' splice sites
Drosophila start codon sites
Drosophila promoter sequences
Drosophila repeat sequences
Drosophila transposon sequences
Drosophila cDNA sequences
Drosophila EST sequences
URL: http://www.fruitfly.org/GASP1/data/data.html
Reese et al., Tutorial #3, ISMB ‘99
Timetable

May 13, 1999 - June 30, 1999
 Distribution
of the sample sequence and associated data to the
predictors. Collection of predictions.

June 30, 1999 - July 31, 1999
 Evaluation
of the predictions by the Drosophila Genome
Center.

August 4, 1999
 External
expert assessment of the prediction results (HUGO
meeting, EMBL)

August 6, 1999
 Tutorial
#3 at the ISMB ‘99 conference in Heidelberg,
Germany
Reese et al., Tutorial #3, ISMB ‘99
Resources for assessing predictions

80 cDNA sequences NOT in Genbank before
experiment deadline
 Sequenced
from 5 different cDNA libraries
 3 paralogs to other genes in the genome
 19 cDNAs with cloning artifacts
2 apparently representing unspliced RNA
 Multiple inserts (2 cDNAs cloned in the same vector)

 58


“usable” cDNAs
33 cDNA sequences in Genbank during experiment
Annotations from Adh paper
Reese et al., Tutorial #3, ISMB ‘99
Curated data sets for assessing
predictions

Standard 1 (Adh.std1.gff) “conservative gene set”
 43
gene structures (7 single- and 36 multi- coding exon
genes)
 Criteria for inclusion:
>=95% (most >=99%) of the cDNA aligned to genomic DNA (using
sim4)
 “GT”/”AG” splice site consensus sequences
 Splice site score from neural net

• 5’ splice sites: >=0.35 threshold ( 98% True Positive score)
• 3’ splice sites: >=0.25 threshold ( 92% True Positive score)

Start codon and stop codon annotations from Standard 3 (derived
from Adh paper)
 These
43 genes represent “typical” genes
Reese et al., Tutorial #3, ISMB ‘99
Curated data sets for assessing
predictions

Standard 2 (Adh.std2.gff)
 Superset
of Standard 1
 15 additional gene structures
 Same alignment criteria as Standard 1 but no splice site
consensus requirement
 Not used in the experiment
Reese et al., Tutorial #3, ISMB ‘99
Curated data sets for assessment

Standard 3 (Adh.std3.gff) “more complete gene set”
 222
gene structures (39 single- and 183 multi- coding exon
genes)
 Criteria:
Annotated as described in Ashburner et al.
 cDNA to genomic alignment using sim4
 Start codons predicted by ORFFinder (Frise et al., unpublished)
 ~182 genes have similarity to a homologous protein sequence in
another organism or have a Drosophila EST hit

•
•
•
•
Edge verification by partial EST/cDNA alignments
BLASTX, TBLASTX homology results
PFAM alignments
Gene structure verification using GenScan (human)
14 genes had EST/homology hits but no gene finding predictions
 ~40 genes only have “strong” GenScan predictions

Reese et al., Tutorial #3, ISMB ‘99
Submission format

GFF (Durbin and Haussler, 1998, unpublished)
 http://www.sanger.ac.uk/Software/GFF/
Reese et al., Tutorial #3, ISMB ‘99
Sample submission
# organism: Drosophila melanogaster
# std1
Gene 1
Gene 2
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
Adh
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
std1
TFBS
32002
TATA_signal
TSS
32033
prim_transcript
exon
32034
start_codon
CDS
32122
splice5 32277
splice3 32332
exon
32785
CDS
32785
splice5 32830
splice3 32825
CDS
32826
exon
32826
stop_codon
polyA_signal
polyA_site
prim_transcript
exon
38100
polyA_site
polyA_signal
stop_codon
CDS
40125
start_codon
TSS
41973
TATA_signal
TFBS
42187
TFBS
42211
32006
32009
32034
32034
32277
32122
32277
32278
32333
32830
32830
32831
32826
33003
33122
33001
33090
33101
38100
41973
39620
39685
40125
40390
40388
41974
41998
42193
42216
.
32012
.
33122
.
32124
.
.
.
.
.
.
.
.
.
33003
33095
33102
41973
.
39621
39690
40127
.
40390
.
42001
.
.
+
.
+
.
+
.
+
+
+
+
+
+
+
+
+
.
.
.
.
.
.
.
.
.
-
.
+
.
+
.
+
.
.
.
.
.
.
.
.
.
+
+
+
.
.
.
.
.
.
transcript
transcript "1"
.
transcript
transcript "1"
.
transcript
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
transcript "1"
.
transcript
.
transcript
.
transcript
.
transcript
transcript "2"
.
transcript
.
transcript
.
transcript
transcript "2"
.
transcript
transcript "2"
.
transcript
"1"
"1"
"1"
"1"
"1"
"1"
"2"
"2"
"2"
"2"
"2"
"2"
Reese et al., Tutorial #3, ISMB ‘99
Submissions

MAGPIE Team
 Credit
Terry Gaasterland, Alexander Sczyrba, Elizabeth Thomas, Gulriz
Kurban, Paul Gordon, Christoph Sensen
 Laboratory for Computational Genomics, Rockefeller and Institute
for Marine Biosciences, Canada

 Method

Automatic genome analysis system integrating Drosophila Genscan
predictions, confirming exons boundaries using database searches,
repeat finding (Calypso, REPupter) and gene function annotations.
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
 References
“Multigenome MAGPIE” poster at ISMB ‘99.
 Gaasterland and Ragan (1998), J. of Microbial and Comparative
Genomics, 3, 305-312.
 Gaasterland and Sensen (1996), Biochimie 78, 302-310.
 REPupter: Kurtz and Schleiermacher (1999), Bioinformatics 15(5),
426-427.

Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

Computational Genomics Group, The Sanger Centre
 Credit

Victor Solovyev, Asaf Salamov
 Method
Discriminant analysis based gene prediction programs FGenes
(trained for Human) and FGenesH (trained for Drosophila);
Combining the output of Fgenes, FGenesH and BLAST using
FGenesH+. 3 different “threshold” annotations are submitted.
 The programming running time is linear with the sequence length.
 Automatic, plus additional user interactive screening.
 Non-redundant NCBI database used for BLAST.

 URL/References

http://genomic.sanger.ac.uk/gf/gf.shtml
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

Genome Annotation Group, The Sanger Centre
 Credit

Ewan Birney
 Method

Protein family based gene identification using Wise2 (previously
Genewise) and PFAM.
 URL

http://www.sanger.ac.uk/Software/Wise2
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

Pattern Recognition, The University of Erlangen
 Credit

Uwe Ohler, Georg Stemmer, Stefan Harbeck, Heinrich Niemann
 Method
Promoter recognition based on interpolated Markov chains;
“Genscan” like promoter model (MCPromoter); maximal mutual
information based estimation of interpolated Markov chains.
 Automatic.
 Promoter training data set from
http://www.fruitfly.org/data/genesets

Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
 References
Ohler, Harbeck, Niemann, Noeth and Reese (1999), Bioinformatics
15(5), 362-369.
 Ohler, Harbeck and Niemann (1999), Proc. EUROSPEECH, to appear.

 URL

http://www5.informatik.uni-erlangen/HTML/English/Research/Promoter
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

Computational Biosciences, Oakridge National
Laboratory
 Credit

Richard J. Mural, Douglas Hyatt, Frank Larimer, Manesh Shah,
Morey Parang
 Method

Integrated neural network based system including gene assembly
using EST and homology information (GRAILexp).
 URL:

http://compbio.ornl.gov/droso
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

Center for Biological Sequence Analysis, Technical
University of Denmark
 Credit

Anders Krogh
 Method
Modular HMM incorporating database hits (proteins and
ESTs/cDNAS) and other “external information” probabilistically
(HMMGene); the HMM has modules for coding regions, splice sites,
translation start/stop, etc..
 It will be a fully automated system.
 Trained on Drosophila data

• http://www.fruitfly.org/GSAC1/data/data.html

and
• Victor Solovyev (personal communication)
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
 References
Krogh (1998), In S.L. Salzberg et al., eds., Computational Methods in
Molecular Biology, 45-63, Elsevier.
 Krogh (1997), Gaasterland et al., eds., Proc. ISMB 97, 179-186.
 http://www.cbs.dtu.dk/krogh/refs.html

 URL
http://www.cbs.dtu.dk/services/HMMgene/
 Not yet for Drosophila.

Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

BLOCKS group, Fred Hutchinson Cancer Research
Center in Seattle, Washington
 Credit

Jorja Henikoff, Steve Henikoff
 Method
DNA translation in 6 frames and search against BLOCKS+ and
against BLOCKS extracted from Smart3.0 (http://coot-emblheidelberg.de/SMART/) using BLIMPS; automatic post-processing to
join multiple predictions from the same block.
 Automatic with some user interactive screening of results.

Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
 References
Henikoff, Henikoff and Pietrokovski (1999), Nucl. Acids Res., 27,
226-228.
 Henikoff and Henikoff (1994), Proc. 27th Ann. Hawaii Intl. Conf. On
System Sciences, 265-274.
 Henikoff and Henikoff (1994), Genomics, 19, 97-107.

 URL
http://blocks.fhcrc.org
 http://blocks.fhcrc.org/blocks-bin/getblock.sh?<block name>

Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

Genome Informatics Team, IMIM, Barcelona, Spain
 Credit

Roderic Guigó, Josep F. Abril, Enrique Blanco, Moises Burset, Genis
Parra
 Method
Dynamic programming based system to combine potential exon
candidates modeled as a fifth order Markov model and functional
sequence sites modeled as a position weight matrix (Geneid version 3).
 Fully automatic, very fast.
 Trained on Drosophila data

• http://www.fruitfly.org/GSAC1/data/data.html
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
 References

Guigó et al. (1998), JCB , 5, 681-702.
 URL

Information on training process:
• http://www1.imim.es/~rguigo/AnnotationExperiment/index.html

http://www1.imim.es/geneid.html
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

Mark Borodovsky's Lab, School of Biology, Georgia
Institute of Technology
 Credit

Mark Borodovsky, John Besemer
 Method

Markov chain models combined with HMM technology
(Genemark.hmm).
 URL

http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

Biodivision, GSF Forschungszentrum für Umwelt und
Gesundheit, Neuherberg, Germany
 Credit

Matthias Scherf, Andreas Klingenhoff, Thomas Werner
 Method
Universal sequence classifier which is based on a correlated word
analysis to predict initiators and promoter associated TATA boxes
(CoreInspector V1.0 beta). Sequences of 100 bp are classified at once.
 Trained on Eukaryotic Promoter Database (EPD version 5.9).
 Fully automatic, 2 seconds per 1Kb.

 References

Scherf et al. (1999), in preparation.
 URL

http://www.gsf.de/biodv/
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

The Department of Biomathematical Sciences, Mount
Sinai School of Medicine, New York
 Credit

Gary Benson
 Method
Tandem repeats finder (TRF v2.02) uses theoretical model of the
similarity between adjacent copies of pattern (pattern from 1 -500 bp
recognized); dynamic programming for candidate validation.
 Fully automatic; very fast (seconds per 1Mb).
 http://c3.biomath.mssm.edu/trf/Adh.fa.2.7.7.80.10.50.500.1.html

 References

Benson (1999), Nucl. Acids Res., 27(2), 573-580.
 URL

http://c3.biomath.mssm.edu/trf.html
Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)

Genie, UC Berkeley/UC Santa Cruz/ Neomorphic Inc.
 Credit

Martin G. Reese, David Kulp, Hari Tammana, David Haussler
 Method
Generalized hidden Markov model with optional integration of EST
hits and homology searches (Genie).
 Trained on Drosophila data

• http://www.fruitfly.org/GSAC1/data/data.html
Semi-automatic, in that the overlaps of the analyzed sequence contigs
(110kb) where manual run again with Genie to resolve conflicts.
 BLAST used for homology searches on non-redundant protein
database (nr).

Reese et al., Tutorial #3, ISMB ‘99
Submissions (cont.)
 References
Reese et al. (1997), JCB, 4(3), 311-323.
 Kulp et al. (1997), Biocomputing: Proc. Of the 1997 PSB conference,
232-244.
 Kulp et al. (1996), ISMB, 4, 134-142.

 URL

http://www.neomorphic.com/genie
Reese et al., Tutorial #3, ISMB ‘99
Submission classes
Program name
Gene
finding
Mural et al.
Oakridge, US
GRAILexp
X
Guigó et al.
Barcelona, ES
GeneID
X
Krogh
Copenhagen, DK
HMMGene
X
Borodovsky et al.
Georgia, US
GeneMark.hmm
X
Henikoff et al.
Fred Hutchinson,
Seattle, US
Solovyev et al.
Sanger, UK
BLOCKS
FGenes/FGenesH
Promoter
EST/cDNA
recognition Alignement
Protein
Repeat
similarity
X
Gene
function
X
X
X
X
Reese et al., Tutorial #3, ISMB ‘99
Submission classes (cont.)
Program name
Gaasterland et al.
Rockefeller, US MAGPIE
Benson et al.
Mount Sinai, US
TRF
Werner et al.
Munich, GER
CoreInspector
Gene
finding
X
Reese et al.
Berkeley/Santa
Cruz, US
X
X
X
Gene
function
X
X
X
Wise2
Genie
Protein
Repeat
similarity
X
Ohler et al.
Nuermberg, GER MCPromoter
Birney
Sanger, UK
Promoter
EST/cDNA
recognition Alignment
X
X
X
X
Reese et al., Tutorial #3, ISMB ‘99
Gene finding techniques
Program name
Statistics Promoter EST/cDNA
Alignment
Mural et al.
Oakridge, US
GRAILexp
X
Guigo et al.
Barcelona, ES
GeneID
X
Krogh
Copenhagen, DK
HMMGene
X
Borodovsky et al.
Georgia, US
GeneMark.hmm
X
Solovyev et al.
Sanger, UK
FGenes/FGenesH
X
Gaasterland et al.
Rockefeller, US
MAGPIE
X
X
X
Genie
X
X
X
Reese et al.
Berkeley/Santa
Cruz, US
Protein
similarity
X
X
X
X
Reese et al., Tutorial #3, ISMB ‘99
Measuring success




By nucleotide
 Sensitivity/Specificity (Sn/Sp)
By exon
 Sn/Sp
 Missed exons (ME), wrong exons (WE)
By gene
 Sn/Sp
 Missed genes (MG), wrong genes (WG)
 Average overlap statistics
Based on Burset and Guigo (1996), “Evaluation of gene
structure prediction programs”. Genomics, 34(3), 353-367.
Reese et al., Tutorial #3, ISMB ‘99
Definitions and formulae
Sn = TP/(TP+FN)
Sp = TP/(TP+FP)



TP = True positive
FP = False positive
FN = False negative
Reese et al., Tutorial #3, ISMB ‘99
Genes: True positives (TP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: False positives (FP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: False Negatives (FN)
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (1)
Std1
Pred1
Pred2
TP
2
2
FP
1
5
FN SN SP
1 2/3 2/3
1 2/3 2/7
Sn = TP/(TP+FN)
Sp = TP/(TP+FP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Missing Genes (MG)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Wrong Genes (WG)
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (2)
Std1
Pred1
Pred2
TP
2
2
FP
1
5
FN SN SP MG WG
1 2/3 2/3 1
1
1 2/3 2/7 0
4
Sn = TP/(TP+FN)
Sp = TP/(TP+FP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Std 1 versus Std 3
Std1: “conservative gene set”
Std3: “more complete gene set”
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (3)
Std1
Pred1
Pred2
Std3
Pred1
Pred2
TP
2
2
FP
1
5
2
3
1
4
FN SN SP MG WG
1 2/3 2/3 1
1
1 2/3 2/7 0
4
2
1
2/4 2/3
3/4 3/7
2
0
1
3
Sn = TP/(TP+FN)
Sp = TP/(TP+FP)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Std1 and Std3 versus
“real” gene structure
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (4)
Std1
Pred1
Pred2
Std3
Pred1
Pred2
"Real"
Pred1
Pred2
FN SN SP MG WG
1
1 2/3 2/3 1
4
1 2/3 2/7 0
TP
2
2
FP
1
5
2
3
1
4
2
1
2/4 2/3
3/4 3/7
2
0
1
3
3
3
0
4
1
1
3/4 3/3
3/4 3/7
1
0
0
3
Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (5): Exon level
Std1
Pred1
Pred2
Std3
Pred1
Pred2
"Real"
Pred1
Pred2
FN SN SP ME WE
2
1 5/6 5/7 1
7
2 2/3 1/3 1
TP
5
4
FP
2
8
5
5
2
7
2
2
5/7 5/7
5/7 5/12
2
1
2
6
7
6
0
6
2
3
7/9 7/7
2/3 1/2
1
1
0
5
Reese et al., Tutorial #3, ISMB ‘99
Genes: Joined genes (JG)
Reese et al., Tutorial #3, ISMB ‘99
Genes: Split genes (SG)
Reese et al., Tutorial #3, ISMB ‘99
Definition: “Joined” and “split”
genes
# Actual genes that overlap predicted genes
# Predicted genes that overlap one or more actual genes
JG = ------------------------------------------# Predicted genes that overlap actual genes
# Actual genes that overlap one or more predicted genes
SG = ------------------------------------------

JG > 1, tendency to join multiple actual genes into one
prediction
SG > 1, tendency to split actual genes into separate
gene predictions
Inspired by Hayes and Guigó (1999), unpublished.
Reese et al., Tutorial #3, ISMB ‘99
Toy example 2 (1)
Std1
Pred1
Pred2
TP
0
1
FP
2
7
FN
3
2
SN
0
1/3
SP MG WG
0
1
1
1/8
0
4
JG
2
1
SG
1
1.33
Reese et al., Tutorial #3, ISMB ‘99
Annotation experiment results

Results available during tutorial and at
http://www.fruitfly.org/GASP1/results/
Reese et al., Tutorial #3, ISMB ‘99
Results: Base level
Fgene Fgene Fgene Gene Gene Gene
s
s
s
ID v1 ID v2 Mark
CGG1 CGG2 CGG3
HMM
Sn
(Std1)
Sp
(Std3)

Genie Genie Genie HMM
EST
EST
Gene
HOM
MAG
PIE
Grail
exp
0.89 0.49 0.93 0.48 0.86 0.96 0.96 0.97 0.97 0.97 0.96 0.81
0.77 0.86 0.60 0.84 0.83 0.86 0.92 0.91 0.83 0.91 0.63 0.86
Sensitivity:
 Low
variability among predictors
 ~95% coverage of the proteome

Specificity
 ~90%
 Programs
that are more like Genscan (used for original
annotation) might do better?
Reese et al., Tutorial #3, ISMB ‘99
Results: Exon level
Fgen
es
CGG1
Sn
(Std1)
Sp
(Std3)
Fgen
es
CGG2
Fgen
es
CGG3
Gene
ID
v2
Gene
Mark
HMM
Genie Genie Genie HMM
EST
EST
Gene
HOM
MAG
PIE
Grai
l
exp
0.65 0.44 0.75 0.27 0.58 0.70 0.70 0.77
0.79 0.68 0.63 0.42
0.49 0.68 0.24 0.29 0.34 0.47 0.57 0.55
0.52 0.53 0.41 0.41
ME(%) 10.5 45.5 5.6
(Std1)
Gene
ID
v1
54.4 21.1 8.1
8.1
4.8
3.2
4.8
12.1 24.3
WE(%) 31.6 17.2 53.3 47.9 47.4 28.9 17.4 20.1 22.8 20.2 50.2 28.7
(Std3)






Higher variability among predictors
Up to ~75% sensitivity (both exon boundaries correct)
55% specificity
Low specificity because partial exon overlaps do not count
Missing exons below 5%
Many wrong exons (~20%)
Reese et al., Tutorial #3, ISMB ‘99
Results: Gene level
Fgene Fgene Fgene Gene Gene Gene
s
s
s
ID v1 ID v2 Mark
CGG1 CGG2 CGG3
HMM
Sn
(Std1)
Sp
(Std3)
Genie Genie Genie HMM
EST
EST
Gene
HOM
Grail
exp
0.51 0.16 0.60 0.07 0.35 0.56 0.56 0.65 0.65 0.56 0.47 0.33
0.36 0.32 0.14 0.07 0.14 0.31 0.37 0.38 0.34 0.39 0.25 0.21
MG(%) 27.9 81.3 13.9 81.3 46.5 20.9 18.6 11.6 9.3
(Std1)
MAG
PIE
11.6 27.9 37.2
WG(%) 50.3 33.8 74.5 85.4 72.2 53.5 39.0 41.8 45.7 42.0 67.0 52.0
(Std3)
SG
1.10 1.10 2.11 1.06 1.06 1.07 1.17 1.15 1.16 1.04 1.22 1.23
JG
1.06 1.09 1.08 1.62 1.11 1.11 1.08 1.09 1.09 1.12 1.06 1.08
Reese et al., Tutorial #3, ISMB ‘99
Results: Gene level





60% of actual genes predicted completely correct
Specificity only 30-40%
5-10% missed genes (comparable to Sanger Center)
40% wrong genes, a lot of short genes over-predicted
(possibly not annotated in Standard 3)
Splitting genes is a bigger problem than joining genes
Reese et al., Tutorial #3, ISMB ‘99
Results (protein homology):
Base level
BLOCKS
Sn
(Std1)
Sp
(Std3)
Wise2
MAGPIE
cDNA
MAGPIE
EST
GRAIL
Simila
rity
0.04
0.12
0.02
0.31
0.31
0.80
0.82
0.55
0.32
0.81
Reese et al., Tutorial #3, ISMB ‘99
Results (protein homology):
Exon level
BLOCKS
Sn
(Std1)
Sp
(Std3)
ME(%)
(Std3)
WE(%)
(Std3)
Wise2
MAGPIE
cDNA
MAGPIE
EST
GRAIL
Simila
rity
0.00
0.06
0.00
0.02
0.07
0.00
0.09
0.04
0.00
0.35
86.1
77.2
98.3
64.2
54.4
13.2
14.2
25.4
56.4
12.4
Reese et al., Tutorial #3, ISMB ‘99
Results (protein homology):
Gene level
BLOCKS
Sn
(Std1)
Sp
(Std3)
MG(%)
(Std3)
WG(%)
(Std3)
Wise2
MAGPIE
cDNA
MAGPIE
EST
GRAIL
Simila
rity
0.00
0.00
0.00
0.00
0.07
0.00
0.00
0.00
0.00
0.18
95.3
90.6
97.6
88.3
74.4
17.5
15.7
52.6
58.5
29.7
Reese et al., Tutorial #3, ISMB ‘99
Transcription Start Site (TSS):
Standard 1
Reese et al., Tutorial #3, ISMB ‘99
TSS: Standard 3
Reese et al., Tutorial #3, ISMB ‘99
Results:
TSS recognition
Likely
(7.7%)
Unlikely
(6.5%)
Possible
(86.8%)
MAGPIE
Genie
MCPromoter
CoreInspector
153
(36.3%)
29
(6.8%)
239
(56.7%)
143
(61.1%)
62
(26.4%)
29
(12.3%)
80
(9.2%)
170
(19.5%)
619
(71.2%)
3
(13.0%)
3
(13.0%)
17
(74.0%)
Reese et al., Tutorial #3, ISMB ‘99
Interesting gene examples:
bubblegum
Reese et al., Tutorial #3, ISMB ‘99
Adh/Adhr (Alcohol
dehydrogenase/Adh related)
Reese et al., Tutorial #3, ISMB ‘99
Adh/Adhr (cont..)
Reese et al., Tutorial #3, ISMB ‘99
osp (outspread)

Contains Adh and Adhr embedded in an intron
Reese et al., Tutorial #3, ISMB ‘99
cact (cactus)
Reese et al., Tutorial #3, ISMB ‘99
kuz (kuzbanian)
Reese et al., Tutorial #3, ISMB ‘99
beat (beaten path)
Reese et al., Tutorial #3, ISMB ‘99
Idfg1, Idfg2, Idfg3 (Imaginal Disc
Growth Factor)
Reese et al., Tutorial #3, ISMB ‘99
Idfg1, Idfg2, Idfg3 (cont.)


Chitinase-related
Gene function has changed (now a growth factor)
Reese et al., Tutorial #3, ISMB ‘99
Conclusion of GASP1







95% coverage of the proteome
Base level prediction is easier, exon level prediction is
harder
Small genes over predicted (?)
Long introns
The high number of “wrong genes” indicates possible
incomplete annotation in Standard 3 (Are there more
genes?)
HMM seems to currently be the best approach
Major improvements in multiple gene regions
Reese et al., Tutorial #3, ISMB ‘99
Conclusion GASP1 (cont.)





Much lower false positive rates
Methods optimized for organism of interest do better
Gene finding including homology not always improves
prediction
Split genes is more of a problem than joined genes
No program is perfect
Reese et al., Tutorial #3, ISMB ‘99
Discussion GASP1





Genes in introns
Alternative splicing
Genomic contamination in cDNA libraries
Translation start prediction
Biological verification of prediction needed
 Improve
test bed by cDNA sequencing
 More regulation data needed to confirm promoter assessment



Combining methods
Better methods needed
GASP 2 ?
Reese et al., Tutorial #3, ISMB ‘99
Conclusions on annotating
complete eukaryotic genomes





Throughput has to improve dramatically
Not only genes but also their relationships have to be
elucidated
Complete transcript cDNAs very powerful tool for
annotation including alternative transcripts
Comparative genomics as well as expression analysis
improves/completes genome annotation
Standardization efforts needed (ontology working
group, OMG, OiB, NCBI/EBI, Bioxml, etc.)
 Standards
for description of gene products
 Exchange format (GFF, Genbank, EMBL, XML)
Reese et al., Tutorial #3, ISMB ‘99
Conclusions on annotating complete
eukaryotic genomes (cont.)




Maintenance requires even more effort than the original
development
Automated methods are not good enough
Human curators can cause problems too
Functional assignment by homology is sometimes
unreliable
Reese et al., Tutorial #3, ISMB ‘99
Discussion on annotating complete
eukaryotic genomes

Re-annotation: updating results and annotations over
time







Genomic sequence changes (indels, point mutations)
Analysis software changes
New entries in public sequence databases
Entries removed from sequence databases
Audit trail for annotations
Master copy of genome annotations should reside in the
model organism databases where the expertise resides
Community collaborative annotation
Reese et al., Tutorial #3, ISMB ‘99
Acknowledgments








Uwe Ohler (University of Erlangen, Germany)
Gerry Rubin (UC Berkeley)
Sima Misra (UC Berkeley)
Erwin Frise (UC Berkeley)
Roderic Guigó (Barcelona)
GFF team (headed by Richard Bruskiewich, Sanger Centre)
Assessment team: Michael Ashburner (EBI), Peer Bork (EMBL),
Richard Durbin (Sanger), Roderic Guigó (Barcelona), Tim
Hubbard (Sanger)
Annotation experiment participants
Reese et al., Tutorial #3, ISMB ‘99