Transcript Lecture

Automate Function Prediction
Outline
•
•
•
•
•
Goal
How function is defined
Why Gene Ontology
Methods for protein function prediction
End points
GOAL
• A) You find a new protein
• B) You sequence the whole genome of your
favorite organism
• Obtained gene(s) should be annotated
• A can be solved manually. B needs automatic
tools
How function is defined
•
•
•
•
Functional description as text
Linking gene to Key Words (Uniprot)
Linking gene Gene Ontology
Linking gene to Signalling Pathways or
Biochemical Pathways (KEGG)
Why Gene Ontology (GO)
• GO represents a popular standard currently in
the gene annotation
• GO represents categories that represent gene
function
• Creates an union for genes in same process
• Easy summary for genes with similar function
Why Gene Ontology (GO)
• 3 sub-parts: Biological Process, Molecular
Function, Cellular Localization
– Molecular Function => chemical activity
– Biological Process => Biology, cellular process
– Cellular localization => Location of gene
• Hierarchical structure
– Categories with very precise function
– Categories with less precise function
– Categories with very broad function
How GO helps
• End user: Summary categories for genes with
various functions
• Computer programs: Classifier algorithms can
be taught to predict the categories for genes
Understanding GO
• Amigo server (http://amigo.geneontology.org/cgibin/amigo/go.cgi)
Function Prediction: What can we use
to predict function
•
•
•
•
•
Sequence homology (BLAST result list)
Phylogenetic tree of sequences
Protein Domains (PFAM domains)
Short sequence patterns – motifs
Sequence features (sec. struct., low compl.
regions)
Sequence Homology Methods
• Do a BLAST search with a query sequence
• Collect GO classes for genes in the BLAST
result hit
• Give a weight to each BLAST hit
– often log(E-value)
• Combine the scores from the genes that
belong to same GO class
• Report the top best / significant GO classes
Sequence Homology Methods
• Simple methods
• Programs
– BLAST2GO (http://www.blast2go.com/b2ghome)
– GOTCHA (http://www.compbio.dundee.ac.uk/gotcha/gotcha.php)
– ARGOT(http://www.medcomp.medicina.unipd.it/Argot2/form.php)
– PFP (http://kiharalab.org/web/pfp.php)
Phylogenetic tree methods
• Create the pair-wise distances for the set of
genes
• Do a hierarchical clustering of genes
• Map the know GO functions to cluster tree
• Look for unknown genes in a cluster with many
genes from the same GO class
• Report the top best / significant GO classes
• More => http://genome.cshlp.org/content/8/3/163.full
Phylogenetic tree methods
• These should outperform sequence homology
methods (CAFA 2011?)
• Require a set of related genes
• Often much heavier calculations
• Programs:
– Sifter
(http://genome.cshlp.org/content/early/2011/07/22/gr.104687.109)
Prediction with Protein domains
• Look what protein domains there are in query
protein (PFAM)
• Map the functions that are linked to domains to
your query sequence
– PFAM2GO
• Programs: InterProScan + PFAM2GO
• Drawbacks:
– This mapping is same in plant, mammal, bacteria
– Many domains to specific function
Prediction with Protein domains
• Benefits:
– Can create annotation from separate domains
– Similar seq:s do not have to be in database
• Programs (?): InterProScan
(http://www.ebi.ac.uk/InterProScan/)
• Drawbacks:
– The mapping is same in plant, mammal, bacteria
– Many domains to specific function
Prediction with patterns and motifs
• Same principle as before, but we look
sequence patterns and motifs
• Map the functions that are linked to patterns
to your query sequence
• Programs:
– InterProScan
– IBM BioDictionary (http://cbcsrv.watson.ibm.com/Tpa.html)
• Drawbacks and benefits appr. same as before
Prediction with sequence features
• Again same principle as before
• We look seq. features (see pict.)
• These are given as an input to classifier
algorithm (Support Vector Machine)
Prediction with sequence features
Prediction with sequence features
• Benefits:
– No actual seq. similarity needed
– Info collected from vague similarities
– Use of classifier => feature weighting
• Program: FFPred (http://bioinf.cs.ucl.ac.uk/ffpred/)
• Drawbacks:
• Calculations probably quite heavy
• No use of nearby sequence similarities (domains etc.)
Our contribution: PANNZER
• Use BLAST result list
• Add Taxonomic information
• Score GO classes using a score that takes the
frequency of GO class in seq. DB into account
• Method is used to predict:
– GO Classes
– Description line
Our contribution: PANNZER
• Benefits:
– Taking the species taxonomy into account
– Improved use of statistics
• Not public yet
Our contribution: No Name Yet
• Take PFAM domain predictions, BLAST
similarities and Taxonomic information
• Feed this to feature selection and to classifier
algorithm
• …Wait…
• Method is used to predict GO-classes
• Not public + testing is ongoing
Conclusion
• These methods increasingly needed
• Some methods exist
• Unfortunately no clear evaluation (my
opinion)
• Remember: These are predictions. No certain
info until they are tested in wet lab…