Transcript Test
Gene finding pipelines for automatic
annotation of new eukaryotic and
bacterial genomes
Victor Solovyev
Professor of computer science, Royal Holloway,
University of London
Chairman, Softberry Inc.
New genomes sequencing
Human, Mouse, Rat, Cow,
Sheep, Cat, Dog, Pig,
Chicken, Drosophila, Bee,
Zebrafish, Fugu, Nematodes
Arabidopsis, Rice,
Medicago, Soybean,
Barley, Poplar, Tomato,
Oat, Wheat, Corn
S.cerevisiae, S.pombe, Aspergillus
nidulans,
Coprinus cinereus
Cryptococcus neoformans,
Fusarium graminearum
Magnaporthe grisea
Neurospora crassa
Ustilago maydis
Anopheles, P. falciparum, E.
cuniculi, Chlamy, Ciona,
Diatom, White rot, P. sojae
Bacterial & bacterial communities
Expression stages and structural organization of typical
eukaryotic protein-coding gene
5’-non-coding
exon
Start of
transcription
Enhancer
3’-non-coding
exon
Introns
ATGcodon
Stopcodon
Internal exons
Core promoter
5’-
Poly-A
signal
3’-
Transcription, 5’-Capping and 3’-polyadenilation
Pre-mRNA
Splicing (removing of intron sequences)
mRNA
Translation
Protein
Ab initio multiple gene prediction
approaches
Genescan (Burge, Karlin,1997)
Probabilistic
HMMgene (Krogh, 1977)
Fgenesh (Salamov, Solovyev,1998)
Genie (Reese et al., 2000)
Likelihoods of gene
components, HMM
Balanced score as production
of likelihoods, simple features
GeneID (Guigo at al. 1992)
Pattern
recognition
Neural networks
Fgenes (Solovyev,1997)
Discriminant functions
Flexible combinations
of any discriminative features
Hidden Markov model
of
multiple eukaryotic
genes
Used in
Genescan and Fgenesh
programs
Ei and Ii are different exon
and intron states,
respectively (i=0,1,2 reflect
3 possible different ORF).
E 5/3 marks non-coding exons
and
I5/I3 are 5’- and 3’-introns
adjacent to non-coding exons.
I5
E0
E1
E2
I0
I1
I2
I3
EL
E3
EF
E5
5’-
3’-
E0
Pr
PolyA
N
PolyA
3’-
Pr
5’-
E0
E5
EL
I3
I0
I1
I2
E0
E1
E2
EF
E3
I5
Figure 10. Different functional regions of the first, internal, last and single coding exons
corresponding to components of recognition functions. RBS is ribosome binding site, ORF is
open reading frame, A and D are acceptor and donor splice sites, respectively, Stop is (TAA,
TAG, TGA stop codons).
RBS
5’-region
ATG
ORF
D
intron
5’-exon
intron
A
ORF
D
intron
intron
A
ORF
Stop 3’-region
Internal exon
3’-exon
5’-region
Single exon
ATG
ORF
Stop
3’-region
Signal differences: start of translation
Signal differences: donor splice site
Importance of good specific parameters: Rice example
Fgenesh with Monocot gene-finding parametres
.
Strategy to make gene-finding parameters
for new genomes
1. Using GeneBank genes for close organisms
2. Using new genomic sequence
•
a) Having known mRNA/cDNA sequences
•
Map mRNA by EST_MAP program on genomic sequence
•
Extract genes and use them as learning set
•
b) Using ab initio gene-prediction
Predict genes, Select genes with protein support
c) Using a database of known proteins (NR) that can be mapped
on genome by Prot_map program with reconstructing gene-structure
In addition find protein coding ORF by BESTORF program in a set of ESTs
and use them in learning of coding parameters
Learning parameters using GeneBank genes for close organisms
Select GeneBank organism class having enough known genes
Create Infogene database with reconstructed genes running Infog
program (some genes might be described in several GeneBank entries)
Run GetGenes program to extract genes from Infogen to use in learning programs
(with cleaning genes with errors in annotation in ORF and splice sites)
Run Efeature program to create:
set of coding regions (usually significantly bigger than set of genes)
set of non-coding regions
Run scripts/programs of learning coding parameters (might be several GC zones)
Run scripts/programs of learning splice sites parameters
Run scripts/programs to create exon length distributions and other statistics
Check parameters of initial probabilities (exons/introns/noncoding)
depending on gene density in genome and gene structure
Test and edit parameters to select the best variant.
Repeat learning on bigger or smaller organism classes and select the best
learning set.
Developed parameters
for fgenesh group of programs:
•
Human, Mouse, Drosophila, C. elegans, Fish (WUSTL, Baylor, CSHL, JGI)
•
•
Dicots (Arabidopsis), Nicotiana tabacum,
Monocots (Corn, Rice, Wheat, Barley) (TIGR, Rutgers University)
•
Algae, Plasmodium falciparum, Anopheles gambiae
•
•
Schizosaccharomyces pombe, Neurospora crassa,
Aspergillus nidulans, Coprinus cinereus, Cryptococcus neoformans, Fusarium
graminearum, Magnaporthe grisea, Ustilago maydis (MIT/Broad Institute)
•
•
Medicago (University of Minnesota)
Brugie malayi (TIGR)
FGENESH++: AUTOMATIC EUKARYOTIC GENOME
ANNOTATION PIPELINE
1.
RefSeq mRNA mapping by Est_map program - mapped genes are excluded
from further gene prediction process.
2.
Map all known proteins (NR) on genome by Prot_map program with gene
structure reconstruction (find regions occupied by genes)
3.
Run Fgenesh+ using mapped proteins and selected genome sequences
4.
Run ab initio Fgenesh gene prediction on the rest of genome.
5.
Search for protein homologs (by BLAST) of all products of predicted genes in
NR.
6.
Run Fgenesh+ gene prediction on sequences (from stage 4) having protein
homologs.
7.
Second run of Fgenesh in regions free from genes selected on stages 1,3,5.
8.
Run of Fgenesh gene predictions in large introns of known and predicted
genes.
Special variant of FGENESH++ can take into account synteny (human-mouse, for
example) using FGENESH-2 program that predicts genes using 2 similar
genomic sequences from different species.
Components of Fgenesh++ automatic pipeline:
Gene-finding group of program have mostly common
components and working with the same organism-specific
parameters
Fgenesh – ab initio gene prediction. Run on whole chromosomes
(~300MB). FAST: The Human genome of 3 GB sequences is processed
for ~ 4 hours
Fgenesh+ This derivative of Fgenesh uses information on homologous
proteins to improve accuracy of gene prediction, if such homologs can be found.
Fgenesh-2
Variant of Fgenesh that uses homology between two genomic
DNAsequences, such as human and mouse, as an extra factor for more accurate
gene prediction.
Fgenesh_C uses information on homologous mRNA/EST to improve accuracy
of gene prediction. Can be used to reconstruct alternatively spliced genes.
Components of Fgenesh++ automatic pipeline:
Programs for mapping known mRNA/Est or proteins
with reconstruction of gene structure
Est_map a program for fast mapping of a set of mRNAs/ESTs to a
chromosome sequence. It takes into account splice site weight matrices
for accurate mapping (important for accurate mapping very small
exons).
Prot_map is used for fast mapping a database of protein sequences to
genome with accounting for splice sites (useful for genomes with a few
known genes and to search for pseudogenes).
Example of Prot_map mapping of a protein sequence to genome
First sequence Chr19 [cut:1 3000000]
[DD] Sequence:
1(
1), S:
52.623, L:1739 IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1
Summ of block lengths: 1468, Alignment bounds:
On first
sequence: start
2146727, end
On second sequence: start
263, end
2167939, length 21213
1739, length 1477 Blocks of alignment: 19
1 E: 2146727
70 [ca GT] P: 2146727
263 L:
23, G: 101.313, W:
1160, S:14.1355
2 E: 2147573
107 [AG GT] P: 2147575
287 L:
35, G: 102.892, W:
1810, S:17.7256
3 E: 2148934
42 [AG GT] P: 2148934
322 L:
14, G: 102.539, W:
720, S:11.1699
4 E: 2150399
111 [AG GT] P: 2150399
336 L:
37, G: 101.777, W:
1880, S:18.0157
5 E: 2150620
235 [AG GT] P: 2150620
373 L:
78, G: 101.251, W:
3930, S:26.0143
6 E: 2151098
114 [AG GT] P: 2151100
452 L:
37, G: 105.778, W:
2000, S:18.7669
7 E: 2151750
92 [AG GT] P: 2151752
490 L:
30, G: 101.188, W:
1510, S:16.1227
8 E: 2153538
102 [AG GT] P: 2153538
520 L:
34, G: 100.414, W:
1690, S:17.0246
9 E: 2153848
138 [AG GT] P: 2153848
554 L:
46, G:
99.168, W:
2240, S:19.5414
10 E: 2154470
126 [AG GT] P: 2154470
600 L:
42, G: 101.071, W:
2110, S:19.0531
11 E: 2156280
485 [AG GT] P: 2156280
642 L:
161, G: 102.616, W:
8290, S:37.9091
12 E: 2156954
136 [AG GT] P: 2156955
804 L:
45, G: 103.244, W:
2340, S:20.1719
13 E: 2157771
147 [AG GT] P: 2157771
849 L:
49, G:
98.511, W:
2360, S:20.0267
14 E: 2160107
115 [AG GT] P: 2160107
898 L:
38, G: 100.777, W:
1900, S:18.0672
15 E: 2161975
584 [AG GT] P: 2161977
937 L:
194, G: 101.031, W:
16 E: 2163280
206 [AG GC] P: 2163280
1131 L:
68, G: 103.135, W:
3530, S:24.7691
17 E: 2165387
65 [AG GT] P: 2165388
1200 L:
21, G:
98.427, W:
1010, S:13.0987
18 E: 2166182
945 [AG GT] P: 2166184
1222 L:
314, G: 102.034, W:
16020, S:52.6232
19 E: 2167736
608 [AG ta] P: 2167738
1538 L:
202, G: 104.624, W:
10730, S:43.3437
9740, S:40.932
Prot_map example of alignment
1
11
2146713
2146723
2146739
2146769
gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg
...............(..)evdhqlkerfanmke
GGRIVSSKPFAPLNFRINSRNLS-
248
248
249
259
267
277
2146797
2146806
2147558
2147568
2147581
2147611
]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK
---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK
286
286
286
286
289
299
2147641
2147671
2147686
2148919
2148926
2148937
PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP
PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP
309
319
322
322
322
323
2148967
2148982
2150384
2150391
2150402
2150432
KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP
KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP
333
336
2150462
336
2150492
336
2150513
337
2150523
347
2150609
2150619
TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK
TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK
357
367
373
373
373
373
2150644
2150674
2150704
2150734
2150764
2150794
AGAATVKKPSPSKARKKKLNKKGRKMAGRKRGRPKKMNTANPERKPKKNQTALDALHAQT
AGAATVKKPSPSKARKKKLNKKGRKMAGRKRGRPKKMNTANPERKPKKNQTALDALHAQT
Analysis of gene-finding accuracy and running time
Test on 83 small (< 20 000 bp) human genes using mouse homolog:
Prot_map: Sne= 73.7 Sn_pe- 93.3 Spe- 71.3 Sn_n= 93.9 Sp_n= 88.6 C=0.9015
Time ~ 1 min
Genewise: Sne= 76.4 Sn_pe- 93.9 Spe- 76.4 Sn_n= 94.9 Sp_n=89.4 C=0.9116
Time ~ 90 min
Fgenesh:
Test on 8 big (> 400 000 bp) human genes using mouse homolog:
Prot_map: Sne= 87.9 Sn_pe- 96.0 Spe- 81.3 Sn_n= 94.3 Sp_n= 96.0 C=0.9514 Time ~ 1 min
Genewise: Sne= 91.9 Sn_pe- 97.0 Spe- 90.1 Sn_n= 95.1 Sp_n= 97.0 C=0.9599
Time ~ 1200 min
Fgenesh:
Prot_map mapping of Human protein set
of 55946 proteins on chromosome 19 (~59 MB)
takes 90 min (best hit for each protein) and
148 min (all significant hits for each protein)
Can be used for fast finding of an initial gene set in new genome mapping all known
proteins
Used for pseudogenes finding as mapping with frameshifts damaging ORFs
New Fgenesh+ and Genewise
1) 700 genes with 6508 exons having similar protein with > 90% similarity
GeneWISE: Sne= 94.1 Sn_pe- 97.8 Spe- 96.0 Sn_n= 98.9 Sp_n= 99.6 C=0.992
FGENESH+: Sne= 96.9 Sn_pe= 98.5 Spe= 97.9 Sn_n= 99.0 Sp_n= 99.5 C=0.992
2) 18 genes with 116 exons having similar Drosophila protein
with identity 28-70%
GeneWISE: Sne= 40.5 Sn_pe- 64.7 Spe- 62.7 Sn_n= 68.3 Sp_n= 99.7 C=0.813
Fgenesh+: Sne= 70.7 Sn_pe= 84.5 Spe= 82.0 Sn_n= 84.8 Sp_n= 96.9 C=0.898
5’-exon Observed - 18 Predicted - 14 Correct - 2 (11 by Fgenesh+)
Intr: Observed - 80 Predicted - 43 Correct - 38 (59 by Fgenesh+)
3’-exon: Observed - 18 Predicted - 14 Correct - 7 (12 by Fgenesh+)
Run time: Fgenesh+ 50 – 1000 times faster than GeneWise
Automated Gene Calling at
Center for Genome Research
MIT
Sequencing: 2003/2004 – 6 new yeast genomes
2004/2005 ~ 20 new yeast genomes
Gene structures are predicted using a combination of
FGENESH, FGENESH+, and GENEWISE (Sanger Institute). the
protein used in the previous had >90% amino acid identity to the translated
genome (cumulative across sub-alignments), then the GENEWISE call, if
valid, was favored over the FGENESH+ call, and was used as the
EVIDENCE_GENE
1.
If this protein had >80% but less than 90% amino acid identity to the
translated genome (cumulative across sub-alignments), then the
FGENESH+ call, if valid, was favored over the GENEWISE call, and was
used as the EVIDENCE_GENE
Examples of usage Fgenesh suit in genome annotations
•
Grimwood J, Gordon LA, Olsen A, .., Salamov A., Solovyev V., ..., Lukas S. (2004) The DNA
sequence and biology of human chromosome 19. Nature, 428(6982), 529-535. Using Fgenesh,
Fgenesh+, est_map to annotate genes in Himan cjromosome 19. annotation.
·
Heiliget al. (2003) The DNA sequence and analysis of human chromosome 14. Nature 421,
601 - 607. FGENESH used for human chromosome 14 annotation.
·
Hillier et al. (2003) The DNA sequence of human chromosome 7. Nature 424, 157 - 164.
Extensive use of FGENESH-2 for human chromosome 7 annotation.
Feng et al. (2002) Sequence and analysis of rice chromosome 4. Nature 420, 316 320. FGENESH used for annotation of rice chromosome 4.
•
Galagan et al. (2003) The genome sequence of the filamentous fungus Neurospora crassa.
Nature 422:859-868. Neurospora genome annotation based on FGENESH and FGENESH+.
Lander et al. (2001) Initial sequencing and analysis of the human genome. Nature
409, 860 - 921. Original paper on sequencing human genome by public consortium also
reports use of FGENESH genefinder for genome annotation.
Deloukas et al. (2001) The DNA sequence and comparative analysis of human
chromosome 20. Nature 414, 865-871. Use of FGENESH for annotation of human
chromosome 20.
Yu et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica).
Science 296:79-92. Rice genome sequencing and annotation project used FGENESH as
primary source of gene predictions.
Holt et al. (2002) The Genome Sequence of the Malaria Mosquito Anopheles
gambiae. Science 298: 129-149. Use of FGENESH for annotation of Anopheles genome.
Canonical and Non-canonical splice sites
SpliceDB
(Burset, Seledtsov, Solovyev, NAR 1999,2000)
GT-AG: 99.24%
a)
GC-AG: 0.69% AT-AC: 0.05%
other sites: 0.02%
GT-AG group (canonical splice sites): 22199 examples
M70A60G80|GTR95A71G81T46
Y73Y75Y78Y79Y80Y79Y78Y81Y86Y86NC71AG|G52
b) GC-AG group: 126 examples
M83A89G98|GCA87A84G97T71
c) AT-AC group: 8 annotated examples + 2 examples
recovered from annotation errors
S90|ATA100T100C100C100T100T90T70
T70G50C70NC60AC|A60T60
Gene prediction is usually done with only standard splice sites
Additional sources of genes
•
•
•
•
•
Identified with synteny data help
Non canonical splice sites
Alternatively spliced
Alternative promoters
Alternative poly-A
Additional studies of the above topics will
update the current gene collections
Exon-based syntheny
1. Run Gene-finding
annotation pipeline for
each genome
2. Select chains of
similar exons between
2 genomes comparing
coding exons by Blast
95% in agreement with filtered genome
alignments
Brudno et al.(2004) Automated Whole-Genome
Multiple Alignment of Rat, Mouse, and Human
Genome Research Journal, 14(4):685-692.
Pseudogene finding using Prot_map
54408308
54408560
54408568
54408581
54408611
54408641
nnnnnnnn(..)nnnnnnnnnnnnnag[KEFDFESANAQFNKEEMGREFHNKLKLKEDKL
--------(..)--------------- KDFDFESANAQFNKEEIDREFHNKLKLKEDKL
134
134
134
136
146
156
54408671
54408701
54408731
54408761
54408791
54408821
EKEEKPVNGEDKGDSGVDTQNSEGHADEEDALGPNCFYDQTKSSFDNISGDDNRERRPTW
EKQEKPVNGEDKGDSGVDTQNSEGNADEEDPLGPNCYYDKTKSFFDNISCDDNRERRPTW
166
176
186
196
206
216
54408851
54408881
54408911
54408939
54408967
54408976
AEGRRLNAETFGIPLCPNRGHGGYRGRGaGLGFHGGRGRg]gtggcagaagtggta(..)
AEERRLNAETFGIPLRPNRGRGGYRGRG-GLGFRGGRGR- ---------------(..)
226
236
246
255
264
264
Pseudogene finder
• Generation pseudogene candidates:
• Run script finding genes having almost identical
coding proteins (or part of them) with lesser
number introns (or without introns).
• Run prot_map mapping Human (mammalian)
proteins and selecting damaged ones
• Selecting pseudogenes using additional
features: like poly_A tail, ratio ks/kn
Development of eukaryotic promoter recognizer
In group of TSS programs
Table 7. Selected characteristics of promoter sequences used by TSSW programs for TATA+
and TATA- promoters.
Characteristiscs
D2 for TATA+ promoters
D2 for TATA- promoters
•Hexaplets -200 - -45
2.6
1.4 (-100 - -1)
•TATA box score
3.4
0.9
•Triplets around TSS
4.1
0.7
•Hexaplets +1 - +40
0.9
•Sp1-motif content
0.9
•TATA fixed location
0.7
•CpG content
1.4
0.7
•Similarity -200 - -100
0.3
0.7
•Motif Density(MD) -200 - +1
4.5
3.2
•Direct/Inverted MD-100 - +1
4.0
3.3 ( -100 - -1)
Total Maxalonobis distance
11.2
4.3
Number promoters/non-promoters
203/4000
193/74000
Results of promoter search on genes with known mRNAs
by different promoter-finding programs. Reproduced from
Liu and States (2002) Genome Research 12:462-469.
Accuracy of prediction by TSSP on plant genomic sequences
Selected known genomic regions upstream of CDS
True positives 92%
Total number of False positives for 40 TATA promoters: 22
(1 per 3648 bp)
True positives 95%
Total number of False positives for 25 TATA –less promoters: 15
(1 per 3300 bp)
For every class (TATA and TATA-less) promoters only one predicted TSS with
highest score in an interval of 300 bp was taken during the search.
tssp Wed Jul 10 02:52:32 EDT 2002
>gi|1902902|dbj|AB001920.1| Oryza sativa (japonica cultivar-group) gene
for phos
Length of sequence5871
Thresholds for TATA+ promoters - 0.02, for TATA-/enhancers - 0.04
2 promoter/enhancer(s) are predicted
Promoter Pos:
1522 LDF- 0.13 TATA box at
1488
18.93
Enhancer Pos:
1597 LDF- 0.12
Transcription factor binding sites/RegSite DB:
for promoter at position 1522
1468 (-) RSP00004
tagaCACGTaga
1459 (+) RSP00010
cACGTG
1456 (+) RSP00011
ctccACGTGgt
1461 (+) RSP00016
caTGCAC
1468 (-) RSP00016
caTGCAC
1256 (-) RSP00026
gcttttgaTGACtTcaaacac
1460 (+) RSP00065
ACGTGgcgc
1460 (+) RSP00066
ACGTGccgc
1459 (+) RSP00069
tACGTG
1341 (+) RSP00071
GACGTC
1346 (-) RSP00071
GACGTC
1452 (-) RSP00096
GGTTT
1432 (+) RSP00129
CACGAC
1281 (+) RSP00148
CGACG
1284 (+) RSP00148
CGACG
1315 (+) RSP00148
CGACG
1335 (+) RSP00148
CGACG
1340 (+) RSP00148
CGACG
1365 (+) RSP00148
CGACG
1434 (+) RSP00148
CGACG
1458 (+) RSP00148
CGACG
1347 (-) RSP00148
CGACG
1474 (+) RSP00162
ACACccGagctaaccacaac
1348 (+) RSP00241
CGGTCA
1387 (+) RSP00339
RTTTTTR
1264 (-) RSP00397
AGTGGCGG
1268 (+) RSP00422
ACCGAC
1459 (+) RSP00423
GACGTG
1464 (-) RSP00424
CACGTC
1369 (-) RSP00431
rdygRCRGTTRs
1278 (-) RSP00432
cVacGGTaGGTgg
1249 (-) RSP00436
TTGACT
1260 (+) RSP00463
atttcatggCCGACctgcttttt
1260 (+) RSP00464
acttgatggCCGACctctttttt
1260 (+) RSP00465
aatatactaCCGACcatgagttct
1265 (+) RSP00466
actaCCGACatgagttccaaaaagc
Genes with predi cted TSS
22
D is t a n c e b e t w e e n P r e d ic t e d a n d a n n o t a t e d T S S
4
-5:-1
-6:-20
0
+1:+5
2
1
-37
-144
1
5
Predicted TSSs relative to annotated TSS (0)
Figure 2. Localization of the predicted nearest TSS in relation to the known TSSs for
35 of 40 genes with the annotated TATA promoters.
PromH with ortologous sequences
753
******
*******
******
866
GTGGGgCTA-TTTTAaGcaCAGCcTCTtGGcCTgCACactcccctggcccccagCCcCCaGCaGcTCaGCtACTGGTcACCTGCC---------ACCgCCTGGAATGCTGATTGGCAGTTgGct
GTGGGaCTAtTTTTAtGtcCAGCtcCTcGGaCTaCAC----------------cCCaCCcCCtGtTCtGCcACTGGTtgCCTGCCtctgcctgtACCtCCTGGAATGCTGATTGGCAGTTaG-987
1094
TATA(-26)
TSS(+1)
867
********
951
GGggtGgGTGGGGGCTGGGAAGACacTaTTATAAAGCtgggAGtG-TtgGGaAGCAGCcGTCcCC-----gTCCaGaGTCCtCTGtggtCC . . .CDS(963)
GGctgGaGTGGGGGCTGGGAAGAC-—TgTTATAAAGCctaaAGgGcTaaGGgAGCAGCtGTCaCCtggagcTCCtGcGTCCcCTGcccaCC
CDS(1182)
1095
1181
TATA(-28)
TSS(-1)
Fgenesb_annotator - Bacterial Gene/Operon
Prediction and Annotation Pipeline
FGENESB is a new complex package
for annotation of bacterial genomes. Its
gene prediction algorithm is based on
Markov chain models of coding regions
and translation and termination sites.
Operon models are based on distances
between ORFs, frequencies of different
genes neighboring each other in known
bacterial genomes, predicted promoters
and terminators
The parameters of gene prediction are
self-learning, so the only input necessary
for annotation of new genome is a
sequence.
Fgenesb accuracy on difficult sets
Sn (exact
predictions)
Sn (exact+overlapping
predictions)
123set:
Glimmer
GeneMarkS
FgenesB
57.0%
82.9
89.3
91.1
91.9
98.4
72set:
Glimmer
GeneMarkS
FgenesB
57.0%
88.9
91.5
91.7
94.4
98.6
51set:
Glimmer
GeneMarkS
FgenesB
51.0%
90.2
92.0
88.2
94.1
98.0
rRNA and tRNA annotation
STEP 1.
Finds all potential ribosomal RRNA genes using BLAST
against bacterial and/or archaeal RRNA databases.
and masks detected RRNA genes.
STEP 2.
Predicts tRNA genes using tRNAscan-SE program.
Inside bactg_ann.pl - run tRNAscan-SE and masks
detected TRNA genes .
Genes and Operon identification
STEP 3.
Initial predictions of long, slightly overlapping ORF that are
used as a starting point for calculating parameters of
predictions. Iterates until stabilizes.
Generates parameters such as 5th-order in-frame Markov
chains for coding regions, 2nd-order Markov models for region
around start codon and upstream RBS site, Stop codon and
probability distributions of ORF lengths.
Protein coding genes prediction
STEP 4.
it predicts operons based only on distances between predicted
genes.
Annotate genes comparing with
databases of known proteins
STEP 5.
Runs blastp for predicted proteins against COG databasecog.pro and annotate by COGs descriptions
STEP 6.
Run blastp against NR for proteins having no COGs hits
And annotate by NR descriptions.
Promoters and Terminators prediction and
improvement of operons assignment
STEP 7.
Uses information about conservation of neighbor gene pairs in known
genomes to improve operon prediction.
STEP 8.
predicts potential promoters (tssb) and terminators (bterm) in the
corresponding 5'-upstream and 3'-downstream regions of predicted genes.
Tssb- bacterial promoter prediction (sigma70), using dicriminant function
with characteristics of sequence features of promoters (such as conserved
motifs, binding sites and etc)
Bterm - prediction of pho-independent terminators as hairpins,
with energy scoring based on discriminant function of hairpin elements.
STEP 9.
refines operon predictions using predicted promoters and terminators as
additional evidences.
Fgenesb_annotator output:
1
+
+
+
+
+
+
+
+
+
CDS
Term
Prom
CDS
Term
Prom
CDS
CDS
CDS
407 1786 1847 1926 3074 3105 3193 3418 4578 -
1747
1823
1906
3065
3122
3164
3405
4545
6506
+
+
+
Term
Prom
CDS
6516 6512 6595 -
6551
6571
9066
4.7
2.3
2957
Term
+ SSU_RRNA
ribosomal RNA # Bacillus cereus
+
TRNA
+
TRNA
+ LSU_RRNA
ribosomal RNA # Bacillus
7
3 Op 1
.
CDS
+ 5S_RRNA
5S ribosomal RNA # Bacillus
8
3 Op 2
.
CDS
9
3 Op 3
.
CDS
Prom
9067 9308 -
9098
10861
3.4
100.0
# AY138279 [D:1..1554] # 16S
10992 11077 11233 -
11068
11152
14154
101.2
93.9
99.0
# Ile GAT 0 0
# Ala TGC 0 0
# AF267882 [D:1..2922] # 23S
14175 14205 -
14363
14315
14353 15170 15373 -
15249
15352
15432
2
1 Op
1 Op
1
2
21/0.000
3/0.019
3
4
2 Op 1
4/0.002
2 Op 2
4/0.002
2 Op 3 16/0.000
(topoisomerase II) B subunit
6
2 Op 4
.
(topoisomerase II) A subunit
+
1311 ## COG0593 ATPase involved in
3.2
10.5
1237 ## COG0592 DNA polymerase
9.1
4.0
278 ## COG2501 Uncharacterized A
899 ## COG1195 Recombinational D
2148 ## COG0187 DNA gyrase
## COG0188 DNA gyrase
158
97.0 # AE017026 [D:165635..165750]
351
99
6.9
## Similar_to_GB
Comparison of 2 bacterial genomes
GenomMatch aligns 2 bacterial genomes 2 MB x 2MB ~ 30 sec
Figure1
Nature (2004) 428 (6978) , p. 37 – 43
Annotation of new bacteria
New drugs
Annotation of bacterial communities DNA from
Specific sources
(not growing in Labs)
Oceans/Acid mines/agriculture
(with mix of 100s species)
New ferments
• Main Collaborators:
• Asaf Salamov, Igor Seledtsov,
• Ilham Shahmuradov