Genome Bioinformatics Tools
Download
Report
Transcript Genome Bioinformatics Tools
Genome Bioinformatics
Tyler Alioto
Center for Genomic Regulation
Barcelona, Spain
Jul-01-0806/16/08
Bioinformatics Workshop - Malaga
Node 1 of the INB
GN1 Bioinformática y Genómica
Genome Bioinformatic Lab, CRG
Roderic Guigó (PI)
Jul-01-08
Bioinformatics Workshop - Malaga
Themes
Gene prediction
Genome feature visualization
gff2ps
Alternative splicing
ab initio => GeneID
dual-genome => SGP2
u12 introns => GeneID v1.3 and U12DB
combiner => GenePC
ASTALAVISTA
Gene expression regulatory elements
meta and mmeta alignment
Jul-01-08
Bioinformatics Workshop - Malaga
Eukaryotic gene structure
Jul-01-08
Bioinformatics Workshop - Malaga
Eukaryotic gene structure
INTRONS
PROMOTOR
donor
UPSTREAM
REGULATOR
acceptor
EXONS
Jul-01-08
DOWNSTREAM
REGULATOR
Bioinformatics Workshop - Malaga
The Splicing Code
Jul-01-08
Bioinformatics Workshop - Malaga
Gene Prediction Strategies
Expressed Sequence (cDNA) or protein
sequence available?
Yes Spliced alignment
BLAT, Exonerate, est_genome, spidey, GMAP, Genewise
No Integrated gene prediction
Informant genome(s) available?
Yes Dual or n-genome de novo predictors:
SGP2, Twinscan, NSCAN,
(Genomescan – same or cross genome protein blastx)
No ab initio predictors
geneid, genscan, augustus, fgenesh, genemark, etc.
Many newer gene predictors can run in
multiple modes depending on the evidence
available.
Jul-01-08
Bioinformatics Workshop - Malaga
Gene Prediction Strategies
Jul-01-08
Bioinformatics Workshop - Malaga
Frameworks for gene prediction
Hierarchical exon-buliding and chaining
Hidden Markov Models (many flavors)
HMM, GHMM, GPHMM, Phylo-HMM
Conditional Random Fields (new!)
Conrad, Contrast... and, no doubt, more to come
All of them involve parsing the optimal path of
exons using dynamic programming
(e.g. GenAmic, Viterbi algorithms)
Jul-01-08
Bioinformatics Workshop - Malaga
How does GeneID approach
gene prediction?
Jul-01-0806/16/08
Bioinformatics Workshop - Malaga
The gene prediction problem
sites
a4
a2
a1
d1
e1
a3
d2
e2 d3
e3
e4
exons
d4
d5
e5
e6
e7
e8
e1
e4
genes
e8
Jul-01-08
Bioinformatics Workshop - Malaga
GeneID
Geneid follows a
hierarchical structure:
Exon score:
Score of exon-defining signals
+ protein-coding potential
(log-likelihood ratios)
Dynamic programming
algorithm:
Jul-01-08
signal exon gene
maximize score of assembled
exons assembled gene
Bioinformatics Workshop - Malaga
Training GeneID
1
GAGGTAAAC
TCCGTAAGT
CAGGTTGGA
ACAGTCAGT
TAGGTCATT
TAGGTACTG
ATGGTAACT
CAGGTATAC
2
3
4
5
6
7
8
9
A
0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1
C
0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2
G
0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1
T
0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6
TGTGTGAGT
AAGGTAAGT
ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT
GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG
GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT
GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC
ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA
GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC
Jul-01-08
Bioinformatics Workshop - Malaga
Running GeneID
command line or on geneid server
NAME
geneid - a program to annotate genomic sequences
SYNOPSIS
geneid
[-bdaefitnxszr]
[-DA] [-Z]
[-p gene_prefix]
[-G] [-3] [-X] [-M] [-m]
[-WCF] [-o]
[-j lower_bound_coord]
[-k upper_bound_coord]
[-O <gff_exons_file>]
[-R <gff_annotation-file>]
[-S <gff_homology_file>]
[-P <parameter_file>]
[-E exonweight]
[-V evidence_exonweight]
[-Bv] [-h]
<locus_seq_in_fasta_format>
RELEASE
geneid v 1.3
OPTIONS
-b: Output Start codons
-d: Output Donor splice sites
-a: Output Acceptor splice sites
-e: Output Stop codons
-f: Output Initial exons
-i: Output Internal exons
-t: Output Terminal exons
-n: Output introns
-s: Output Single genes
Jul-01-08
-x: Output all predicted exons
Bioinformatics Workshop - Malaga
GeneID output
## gff-version 2
## date Mon Nov 26 14:37:15 2007
## source-version: geneid v 1.2 -- [email protected]
# Sequence HS307871 - Length = 4514 bps
# Optimal Gene Structure. 1 genes. Score = 16.20
# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20
HS307871 geneid_v1.2
Internal
1710
1860
-0.11
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
1976
2055
0.24
+
2
HS307871_1
HS307871 geneid_v1.2
Internal
2132
2194
0.44
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
2434
2682
4.66
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
2749
2910
3.19
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3279
3416
0.97
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3576
3676
3.23
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3780
3846
-0.96
+
1
HS307871_1
HS307871 geneid_v1.2
Terminal
4179
4340
4.55
+
0
HS307871_1
Jul-01-08
Bioinformatics Workshop - Malaga
GFF: a standard annotation format
Stands for:
Designed as a single line record for describing features on
DNA sequence -- originally used for gene prediction output
9 tab-delimited fields common to all versions
Gene Finding Format -or- General Feature Format
seq source feature begin end score strand frame group
The group field differs between versions, but in every case no
tabs are allowed
GFF2: group is a unique description, usually the gene name.
GFF2.5 / GTF (Gene Transfer Format): tag-value pairs introduced,
start_codon and stop_codon are required features for CDS
NCOA1
transcript_id “NM_056789” ; gene_id “NCOA1”
GFF3: Capitalized tags follow Sequence Ontology (SO) relationships,
FASTA seqs can be embedded
ID=NM_056789_exon1; Parent=NM_056789; note=“5’ UTR exon”
Jul-01-08
Bioinformatics Workshop - Malaga
GeneID output
## gff-version 2
## date Mon Nov 26 14:37:15 2007
## source-version: geneid v 1.2 -- [email protected]
# Sequence HS307871 - Length = 4514 bps
# Optimal Gene Structure. 1 genes. Score = 16.20
# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20
HS307871 geneid_v1.2
Internal
1710
1860
-0.11
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
1976
2055
0.24
+
2
HS307871_1
HS307871 geneid_v1.2
Internal
2132
2194
0.44
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
2434
2682
4.66
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
2749
2910
3.19
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3279
3416
0.97
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3576
3676
3.23
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3780
3846
-0.96
+
1
HS307871_1
HS307871 geneid_v1.2
Terminal
4179
4340
4.55
+
0
HS307871_1
Jul-01-08
Bioinformatics Workshop - Malaga
Visualizing features with gff2ps
generated by Josep Abril
Jul-01-08
Bioinformatics Workshop - Malaga
Visualizing features on UCSC
genome browser (custom tracks)
If “your” genome is served by UCSC, this is a
good option because:
browsing is dynamic
access to other annotations
can view DNA sequence
can do complex intersections and filtering
gff2ps is good when:
your genome is not on UCSC
you want more flexible layout options
you want to run it ‘offline’
Jul-01-08
Bioinformatics Workshop - Malaga
Extensions to GeneID
Syntenic Gene Prediction (dual-genome)
Evidence-based (constrained) gene
prediction
U12 intron detection
Combining gene predictions
Selenoprotein gene prediction
Jul-01-08
Bioinformatics Workshop - Malaga
Syntenic Gene Prediction: SGP2
Jul-01-08
Bioinformatics Workshop - Malaga
Minor splicing and U12 introns
U12 introns make up a minor proportion of
all introns (~0.33% in human, less in
insects)
But they can be found in 2-3% of genes
Normally ignored, but this causes
annotation problems
Easy to predict due to highly conserved
donor and branch sites
Jul-01-08
Bioinformatics Workshop - Malaga
Splice Signal Profiles:
major and minor
Jul-01-08
Bioinformatics Workshop - Malaga
Gathering U12 Introns
Human
predict
genome
Fruit Fly
2084
aln to EST/
mRNA
aln to EST/
mRNA
563
568
385
score
merge
all
annotated
introns
predict
score
merge
genome
all
annotated
introns
658
ENSEMBL?
597
ortholog search (17 species)
+ spliced alignment
published
U12 DB
Jul-01-08
Bioinformatics Workshop - Malaga
Jul-01-08
Bioinformatics Workshop - Malaga
Coming Soon: GenePC
a Gene Prediction Combiner
Jul-01-08
Bioinformatics Workshop - Malaga
Tutorial Homepage
http://genome.imim.es/courses/Malaga08/
GBL Homepage
http://genome.imim.es/
Jul-01-08
Bioinformatics Workshop - Malaga