Genome Bioinformatics Tools

Download Report

Transcript Genome Bioinformatics Tools

Genome Bioinformatics
Tyler Alioto
Center for Genomic Regulation
Barcelona, Spain
Jul-01-0806/16/08
Bioinformatics Workshop - Malaga
Node 1 of the INB

GN1 Bioinformática y Genómica

Genome Bioinformatic Lab, CRG

Roderic Guigó (PI)
Jul-01-08
Bioinformatics Workshop - Malaga
Themes

Gene prediction





Genome feature visualization


gff2ps
Alternative splicing


ab initio => GeneID
dual-genome => SGP2
u12 introns => GeneID v1.3 and U12DB
combiner => GenePC
ASTALAVISTA
Gene expression regulatory elements

meta and mmeta alignment
Jul-01-08
Bioinformatics Workshop - Malaga
Eukaryotic gene structure
Jul-01-08
Bioinformatics Workshop - Malaga
Eukaryotic gene structure
INTRONS
PROMOTOR
donor
UPSTREAM
REGULATOR
acceptor
EXONS
Jul-01-08
DOWNSTREAM
REGULATOR
Bioinformatics Workshop - Malaga
The Splicing Code
Jul-01-08
Bioinformatics Workshop - Malaga
Gene Prediction Strategies

Expressed Sequence (cDNA) or protein
sequence available?

Yes  Spliced alignment


BLAT, Exonerate, est_genome, spidey, GMAP, Genewise
No  Integrated gene prediction

Informant genome(s) available?



Yes  Dual or n-genome de novo predictors:
 SGP2, Twinscan, NSCAN,
 (Genomescan – same or cross genome protein blastx)
No  ab initio predictors
 geneid, genscan, augustus, fgenesh, genemark, etc.
Many newer gene predictors can run in
multiple modes depending on the evidence
available.
Jul-01-08
Bioinformatics Workshop - Malaga
Gene Prediction Strategies
Jul-01-08
Bioinformatics Workshop - Malaga
Frameworks for gene prediction


Hierarchical exon-buliding and chaining
Hidden Markov Models (many flavors)


HMM, GHMM, GPHMM, Phylo-HMM
Conditional Random Fields (new!)

Conrad, Contrast... and, no doubt, more to come
All of them involve parsing the optimal path of
exons using dynamic programming
(e.g. GenAmic, Viterbi algorithms)
Jul-01-08
Bioinformatics Workshop - Malaga
How does GeneID approach
gene prediction?
Jul-01-0806/16/08
Bioinformatics Workshop - Malaga
The gene prediction problem
sites
a4
a2
a1
d1
e1
a3
d2
e2 d3
e3
e4
exons
d4
d5
e5
e6
e7
e8
e1
e4
genes
e8
Jul-01-08
Bioinformatics Workshop - Malaga
GeneID

Geneid follows a
hierarchical structure:


Exon score:


Score of exon-defining signals
+ protein-coding potential
(log-likelihood ratios)
Dynamic programming
algorithm:

Jul-01-08
signal  exon  gene
maximize score of assembled
exons  assembled gene
Bioinformatics Workshop - Malaga
Training GeneID
1
GAGGTAAAC
TCCGTAAGT
CAGGTTGGA
ACAGTCAGT
TAGGTCATT
TAGGTACTG
ATGGTAACT
CAGGTATAC
2
3
4
5
6
7
8
9
A
0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1
C
0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2
G
0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1
T
0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6
TGTGTGAGT
AAGGTAAGT
ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT
GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG
GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT
GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC
ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA
GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC
Jul-01-08
Bioinformatics Workshop - Malaga
Running GeneID
command line or on geneid server
NAME
geneid - a program to annotate genomic sequences
SYNOPSIS
geneid
[-bdaefitnxszr]
[-DA] [-Z]
[-p gene_prefix]
[-G] [-3] [-X] [-M] [-m]
[-WCF] [-o]
[-j lower_bound_coord]
[-k upper_bound_coord]
[-O <gff_exons_file>]
[-R <gff_annotation-file>]
[-S <gff_homology_file>]
[-P <parameter_file>]
[-E exonweight]
[-V evidence_exonweight]
[-Bv] [-h]
<locus_seq_in_fasta_format>
RELEASE
geneid v 1.3
OPTIONS
-b: Output Start codons
-d: Output Donor splice sites
-a: Output Acceptor splice sites
-e: Output Stop codons
-f: Output Initial exons
-i: Output Internal exons
-t: Output Terminal exons
-n: Output introns
-s: Output Single genes
Jul-01-08
-x: Output all predicted exons
Bioinformatics Workshop - Malaga
GeneID output
## gff-version 2
## date Mon Nov 26 14:37:15 2007
## source-version: geneid v 1.2 -- [email protected]
# Sequence HS307871 - Length = 4514 bps
# Optimal Gene Structure. 1 genes. Score = 16.20
# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20
HS307871 geneid_v1.2
Internal
1710
1860
-0.11
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
1976
2055
0.24
+
2
HS307871_1
HS307871 geneid_v1.2
Internal
2132
2194
0.44
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
2434
2682
4.66
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
2749
2910
3.19
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3279
3416
0.97
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3576
3676
3.23
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3780
3846
-0.96
+
1
HS307871_1
HS307871 geneid_v1.2
Terminal
4179
4340
4.55
+
0
HS307871_1
Jul-01-08
Bioinformatics Workshop - Malaga
GFF: a standard annotation format

Stands for:



Designed as a single line record for describing features on
DNA sequence -- originally used for gene prediction output
9 tab-delimited fields common to all versions


Gene Finding Format -or- General Feature Format
seq source feature begin end score strand frame group
The group field differs between versions, but in every case no
tabs are allowed

GFF2: group is a unique description, usually the gene name.


GFF2.5 / GTF (Gene Transfer Format): tag-value pairs introduced,
start_codon and stop_codon are required features for CDS


NCOA1
transcript_id “NM_056789” ; gene_id “NCOA1”
GFF3: Capitalized tags follow Sequence Ontology (SO) relationships,
FASTA seqs can be embedded

ID=NM_056789_exon1; Parent=NM_056789; note=“5’ UTR exon”
Jul-01-08
Bioinformatics Workshop - Malaga
GeneID output
## gff-version 2
## date Mon Nov 26 14:37:15 2007
## source-version: geneid v 1.2 -- [email protected]
# Sequence HS307871 - Length = 4514 bps
# Optimal Gene Structure. 1 genes. Score = 16.20
# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20
HS307871 geneid_v1.2
Internal
1710
1860
-0.11
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
1976
2055
0.24
+
2
HS307871_1
HS307871 geneid_v1.2
Internal
2132
2194
0.44
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
2434
2682
4.66
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
2749
2910
3.19
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3279
3416
0.97
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3576
3676
3.23
+
0
HS307871_1
HS307871 geneid_v1.2
Internal
3780
3846
-0.96
+
1
HS307871_1
HS307871 geneid_v1.2
Terminal
4179
4340
4.55
+
0
HS307871_1
Jul-01-08
Bioinformatics Workshop - Malaga
Visualizing features with gff2ps
generated by Josep Abril
Jul-01-08
Bioinformatics Workshop - Malaga
Visualizing features on UCSC
genome browser (custom tracks)

If “your” genome is served by UCSC, this is a
good option because:





browsing is dynamic
access to other annotations
can view DNA sequence
can do complex intersections and filtering
gff2ps is good when:



your genome is not on UCSC
you want more flexible layout options
you want to run it ‘offline’
Jul-01-08
Bioinformatics Workshop - Malaga
Extensions to GeneID





Syntenic Gene Prediction (dual-genome)
Evidence-based (constrained) gene
prediction
U12 intron detection
Combining gene predictions
Selenoprotein gene prediction
Jul-01-08
Bioinformatics Workshop - Malaga
Syntenic Gene Prediction: SGP2
Jul-01-08
Bioinformatics Workshop - Malaga
Minor splicing and U12 introns




U12 introns make up a minor proportion of
all introns (~0.33% in human, less in
insects)
But they can be found in 2-3% of genes
Normally ignored, but this causes
annotation problems
Easy to predict due to highly conserved
donor and branch sites
Jul-01-08
Bioinformatics Workshop - Malaga
Splice Signal Profiles:
major and minor
Jul-01-08
Bioinformatics Workshop - Malaga
Gathering U12 Introns
Human
predict
genome
Fruit Fly
2084
aln to EST/
mRNA
aln to EST/
mRNA
563
568
385
score
merge
all
annotated
introns
predict
score
merge
genome
all
annotated
introns
658
ENSEMBL?
597
ortholog search (17 species)
+ spliced alignment
published
U12 DB
Jul-01-08
Bioinformatics Workshop - Malaga
Jul-01-08
Bioinformatics Workshop - Malaga
Coming Soon: GenePC
a Gene Prediction Combiner
Jul-01-08
Bioinformatics Workshop - Malaga
Tutorial Homepage

http://genome.imim.es/courses/Malaga08/
GBL Homepage

http://genome.imim.es/
Jul-01-08
Bioinformatics Workshop - Malaga