GENE FINDING - Cayetano Heredia University

Download Report

Transcript GENE FINDING - Cayetano Heredia University

GENE FINDING
Module 3
• Gene Finding:
• Sequence properties
• Gene finding software
• Homology
• Protein domains
• Pfam
• Prosite
• hmmer
• Annotation
• Database flat files
• Artemis
• Gene ontologys
• Genome comparisons
• Syteny
• ACT
Gene finding
•
•
•
•
•
Artemis genome viewer
Coding sequence vs non coding sequence
Gene finding software
Homology between species
ESTs
DNA sequence
RepeatMasker
Blastn
Repeats
Promoters
Fasta
BlastP
Gene finders
rRNA
Pfam
Blastx
Halfwise
Pseudo-Genes
Prosite
Psort
tRNA scan
Genes
SignalP
tRNA
TMHMM
The Annotation Process
ANNALYSIS SOFTWARE
DNA SEQUENCE
Useful
Information
Annotator
Artemis
• Artemis is a free DNA sequence viewer and
annotation tool that allows visualization of
sequence features and the results of
analyses within the context of the sequence,
and its six-frame translation.
• http://www.sanger.ac.uk/Software/Artemis/
atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca
tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg
cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat
ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt
atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca
tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg
agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa
ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat
tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa
ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa
taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat
taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat
atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgt
attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta
ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata
tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga
atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata
tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt
ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttg
taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc
aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa
taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata
tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat
tctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt
ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa
tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt
tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta
agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata
aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa
ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct
ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa
cacataacaattataatgacatatcaaataataataataataataataatattaatggggtgaaagaccatataaataataacactctggaaaataatga
tgaaccaatcttatctatatataatgaagatcttaatgttttatatatatgccaaaatatgtataacgtcctttttgttttgaatttaaataacctaagt
DNA in Artemis
AT content
Forward
translations
Reverse
Translations
DNA and amino
acids
Gene structure
• IN TRYPANOSOMATIDS
– Polycistronic structure
– Genes occur on a single strand at a time.
– Inflection points
– No splicing
Trypanosome gene structure
GENE STRUCTURE IN
MALARIA
•
•
•
•
Splicing
No polycistronic units
Can have small exons
Low complexity regions
Gene Structure
DONERS AND ACCEPTERS
ta
nnnnnnnnnnnnnnnn GTAT
g
Exon
Any size
TTTT
AaaCAG
nnnnnnnnnnnnnnnnn
Ccca
gggg
Intron
~50-500bp
Exon
Any size
AT content
• Coding regions have higher GC content
in AT rich genomes
AT content
CODON USAGE
• Codon bias is different for each
organisms.
• DNA content in coding regions is
restricted but not in non coding regions.
• The codon usage for any particular
gene can influence expression.
Codon usage
• All organisms have a preferred set of
codons.
Malaria
GUU
GUC
GUA
GUG
0.41
0.06
0.42
0.11
Trypanosoma
GUU
GUC
GUA
GUG
0.28
0.19
0.14
0.39
Codon Usage
• http://www.kazusa.or.jp/codon/
Codon Usage Table
UUU 34.3( 26847)
UUC 7.3( 5719)
UUA 49.2( 38527)
UUG 10.1( 7911)
UCU 15.3( 11956)
UCC 5.3( 4141)
UCA 18.2( 14239)
UCG 2.8( 2154)
UAU 45.6( 35709)
UAC 5.5( 4340)
UAA 1.0(
813)
UAG 0.2(
123)
UGU 15.3( 11942)
UGC 2.4( 1872)
UGA 0.2(
188)
UGG 5.2( 4066)
CUU
CUC
CUA
CUG
6776)
1354)
4217)
1044)
CCU 9.1( 7148)
CCC 2.5( 1982)
CCA 13.1( 10221)
CCG 0.9(
742)
CAU 19.5( 15287)
CAC 3.9( 3020)
CAA 25.1( 19650)
CAG 3.3( 2598)
CGU
CGC
CGA
CGG
AUU 34.0( 26611)
AUC 5.9( 4636)
AUA 44.7( 34976)
AUG 20.9( 16326)
ACU 12.8( 10050)
ACC 5.5( 4312)
ACA 22.8( 17822)
ACG 3.8( 2951)
AAU105.5(
AAC 18.5(
AAA 90.5(
AAG 19.2(
82591)
14518)
70863)
15056)
AGU 21.6( 16899)
AGC 3.8( 2994)
AGA 16.9( 13213)
AGG 3.9( 3091)
GUU 18.1( 14200)
GUC 2.6( 2063)
GUA 18.2( 14258)
GUG 4.9( 3806)
GCU 12.5(
GCC 3.2(
GCA 12.6(
GCG 1.1(
GAU 55.5( 43424)
GAC 8.6( 6696)
GAA 65.8( 51505)
GAG 10.1( 7878)
GGU 16.6( 12960)
GGC 1.6( 1269)
GGA 16.7( 13043)
GGG 2.9( 2243)
8.7(
1.7(
5.4(
1.3(
9811)
2541)
9871)
890)
3.3(
0.5(
2.4(
0.2(
2561)
354)
1878)
184)
Codon Usage in Artemis
Forward
frames
Reverse
frames
GC frame plot
• Plots the third position GC content of
each frame of a DNA sequence.
• In coding DNA the GC content of the 3rd
base is often higher.
• Good prediction of coding in malaria
and trypanosomes.
Genefinding programs
• Genefinding software packages use
hidden markov models.
• Predict coding, intergenic and intron
sequences
• Need to be trained on a specific
organism.
• Never perfect!
Phat
Cawley et al. (2001) Mol. Bio. Para. 118 p167
http://www.stat.berkeley.edu/users/scawley/Phat/
• Based on a generalised hidden Markov
model (GHMM)
• Free easily installed and run.
• Is good at predicting multiexon genes
but will in some cases miss out genes
altogether and will over predict.
Whant is an HMM
• A statistical model that represents a
gene.
• Similar to a “weight Matrix” that can
recognise gaps and treat them in a
systematic way.
• Has a different “states” that represent
introns,exons and intergenic regions.
GlimmerM
Salzberg et al. (1999) genomics 59 24-31
• Adaption of the prokaryotic genefinder
Glimmer.
Delcher et al. (1999) NAR 2 4363-4641
• Based on a interpolated HMM (IHMM).
• Only used short chains of bases (markov
chains) to generate probabilities.
• Trained identically to Phat
GlimmerM
• Under predicts splicing
• Hardly hardly ever misses a gene
completely.
• Does over predict.
• Free with licence.
Homology Data
• Coding regions are more conserved than
non coding regions due to selective
pressure.
• Comparing all possible translations
against all known proteins will give clues
to known genes.
• Blastx
The Gene Prediction Process
ESTs
ANNALYSIS SOFTWARE
DNA SEQUENCE
FASTA
BlastX
Good
Gene
Models
Phat
GlimmerM
DNA Plots
Annotator