Transcript Slide 1

Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 6
Gene Function Prediction
Quaid Morris
http://morrislab.med.utoronto.ca
Outline
• Functional interaction networks
• Concepts in gene function prediction:
– Guilt-by-association
– Gene recommender systems
•
•
•
•
•
Scoring interactions by guilt-by-association
GeneMANIA
GeneMANIA demo
Explanation of network weighting schemes
STRING
Module 6
bioinformatics.ca
Using genome-wide data in the lab
Protein domain similarity network
Protein-protein
interaction data
Genetic interaction data
?!?
Microarray expression data
Module 6
bioinformatics.ca
Two types of function prediction
• “What does my gene do?”
– Goal: determine a gene’s function based on who it interacts
with: “guilt-by-association”
• “Give me more genes like these”
– e.g. find more genes in the Wnt signaling pathway, find more
kinases, find more members of a protein complex
Module 6
bioinformatics.ca
Guilt-by-association principle
Microarray expression data
Conditions
Co-expression network
Cell cycle
CDC3
CLB4
Genes
CDC16
UNK1
RPT1
RPN3
RPT6
Eisen et al (PNAS 1998)
Module 6
UNK2
Protein degradation
A useful reference: Fraser AG, Marcotte EM - A probabilistic
view of gene function - Nat Genet. 2004 Jun;36(6):559-64
bioinformatics.ca
“What does my gene do?”
6/12/13
SLM4GeneMANIA
YGR151C
Created on: 12 June 2
Last database update:
Application version: 3.
Output
Input
Report of GeneMANIA
search
BEM1
Network and profile data
CDC42
RGA1
Network image
NCS6
RDI1
PXL1
SSK2
IQG1
RSR1
BZZ1
HUR1
SLM4
YGR151C
Query list
CDC48
Gene
recommender
system, then
enrichment
analysis
GI
CDC42
BEM4
RGA1
BEM1
CLA4
PEA2
RSR1
BZZ1
SNC1
GIC2
SWF1
GIC1
BEM4
Functions legend
Networks legend
SNC1
small GTPase mediated signal transduction
query genes
SKM1
Functions legend
Co-expression
Networks legend
Co-localization
small GTPase mediated signal transduction
Co-expression
query genes
Co-localization
Genetic interactions
Genetic interactions
Other
Other
Physical interactions
Module 6
Physical interactions
bioinformatics
.ca
Predicted
Shared protein domains
Recommender Systems
• Memphis, Knoxville, Nashville…
– Chattanooga, Morristown
• Memphis, Alexandria, Cairo…
– Luxor, Giza, Aswan
Module 6
bioinformatics.ca
GNAQ
NOSIP
NPR2
LYPLA1
NOS2
POR
“Give me more genes like these”
NOS1
GNAS
NOS3
NDOR1
GUCY1B3
MTRR
ZDHHC21
GUCY1A2
GUCY1A3
Input
Output
TYW1
Network and profile data
Functions legend
Networks legend
6/12/13
muscle
contraction
GeneMANIA
Co-expression
Created on: 12 June 2013 07:18:01
cyclic nucleotide metabolic process
Last database update: 19 July 2012 20:00:00
Co-localization
Application version: 3.1.2
query genes
Genetic interactions
Pathway
Report of GeneMANIA search
Physical interactions
Shared protein domains
Network image
PDE4A
Search results generated by the GeneMANIA algorithm (genemania.org)
GSTO1
Gene
recommender
system
PDE7A
PDE4D
ACTA1
PDE4B
MYL2
PPP1R1B
NPR1
www.genemania.org/printNOSIP
GNAQ
CNN3
NPR2
LYPLA1
CNN2
NOS2
POR
CNN1
NOS1
GNAS
NOS3
NDOR1
Query list
GUCY1B3
MTRR
ZDHHC21
MYLK2
TAGLN
PLN
GUCY1A3
ATP2A3
ATP2A2
GUCY1A2
ARGLU1
TYW1
DGKZ
CALD1
LSP1
Functions legend
Networks legend
muscle contraction
Co-expression
cyclic nucleotide metabolic process
Co-localization
query genes
Genetic interactions
Pathway
Physical interactions
Shared protein domains
Search results generated by the GeneMANIA algorithm (genemania.org)
Module 6
www.genemania.org/print
bioinformatics.ca
1/1
Demo of GeneMANIA
Module 6
bioinformatics.ca
GeneMANIA: Selecting networks I
Click links to
select all, zero
or a predefined
(default) set of
networks
Module 6
bioinformatics.ca
GeneMANIA: Selecting networks II
Click check
boxes to
select all (or
no) networks
of that type.
Fraction indicates # of
networks selected out of
total available (for this
organism).
Module 6
bioinformatics.ca
GeneMANIA: Selecting networks III
Click on
network type
to view list of
networks (of
that type) in
right panel
Module 6
Click on check box to
select (or deselect)
network
Click on
network name
to expand entry
to get more
information on
network. HTML
link points to
Pubmed
abstract
bioinformatics.ca
Query-independent composite networks
Cell
cycle
CDC27
CDC23
Pre-combine networks e.g. by simple
addition or by pre-determined weights
APC11
UNK1
RAD54
+
+
Genetic
XRS2
DNA
repair
MRE11
e.g. Tong et al. 2001
UNK2
Co-expression
=
Co-complexed
e.g. Jeong et al 2002
Composite networks: One size doesn’t fit all
• Gene function could be a/the:
–
–
–
–
–
–
Biological process,
Biochemical/molecular function,
Subcellular/Cellular localization,
Regulatory targets,
Temporal expression pattern,
Phenotypic effect of deletion.
Some networks may be better for some
types of gene function than others
Module 6
bioinformatics.ca
Two rules for network weighting
Relevance
The network should be relevant to predicting the function of interest
• Test: Are the genes in the query list more often connected to one
another than to other genes?
Redundancy
The network should not be redundant with other datasets – particularly a
problem for co-expression
• Test: Do the two networks share many interactions?
• Caveat: Shared interactions also provide more confidence that the
interaction is real.
Module 6
bioinformatics.ca
Solution: Query-specific weights
w1 x
Cell
cycle
weights
w3 x
CDC27
CDC23
APC11
UNK1
RAD54
w2 x
+
+
Genetic
Co-complexed
e.g.Tong et al. 2001
e.g. Jeong et al 2002
XRS2
DNA
repair
MRE11
UNK2
Co-expression
=
54%
33%
13%
Network weighting schemes I
By default, GeneMANIA decides between
GO-dependent and query-specific weighting
scheme based on the size of your list. We
recommend using the default scheme in
most cases
Click radio button
to change the
network weight
scheme
Module 6
bioinformatics.ca
Network weighting schemes II
- GO-based weighting assigns network
weights based on how well the networks
reproduce patterns of GO co-annotations
(“Are genes that interact in the network
more likely to have the same annotation?”),
- Can choose any of the three hierarchies,
- Ignores query list when assigning network
weight.
Module 6
bioinformatics.ca
Network weighting schemes III
Can force query
list based
weighting by
selecting this
option
Module 6
Select these and
either all
networks or all
data types get
the same weight
bioinformatics.ca
Scoring nodes by guilt-by-association
Query list: “positive
examples” MCA1
CDC48
CPR3
TDH2
Module 6
bioinformatics.ca
Scoring nodes by guilt-by-association
Query list: “positive
examples” MCA1
Score
CDC48
high
CPR3
TDH2
low
Direct neighborhood
CDC48
MCA1
CPR3
TDH2
Module 6
Two main
algorithms
Label propagation
CDC48
MCA1
CPR3
TDH2
bioinformatics.ca
Node scoring algorithm details
• Direct neighbour node score depends on:
– Strength of links to query genes
– # of query gene neighbors
• GeneMANIA Label propagation node score depends on:
– Strength of links and # of query gene neighbors
– # of shared neighbors with positive examples
– Whether or not node is in a cluster of nodes with query gene(s)
(i.e. # of shared neighbours with query genes)
– Take home: allows indirect links to query genes to impact
scores, so often brings up clusters of nodes
Module 6
bioinformatics.ca
Label propagation example
Before
Module 6
After
bioinformatics.ca
Three parts of GeneMANIA:
• A large, automatically updated collection of interactions
networks.
• A query algorithm to find genes and networks that are
functionally associated to your query gene list.
• An interactive, client-side network browser with
extensive link-outs
Module 6
bioinformatics.ca
GeneMANIA data sources
-Gene ID mappings from
Ensembl and Ensembl Plant
IRefIndex
-Network/gene descriptors
from Entrez-Gene and
Pubmed
Interologs
+ some organism-specific datasets
(click around to see what’s available)
Module 6
-Gene annotations from
Gene Ontology, GOA, and
model org. databases
bioinformatics.ca
Gene identifiers
• All unique identifiers within the selected organism: e.g.
–
–
–
–
–
Entrez-Gene ID
Gene symbol
Ensembl ID
Uniprot (primary)
also, some synonyms & organism-specific names
• We use Ensembl database for gene mappings (but we
mirror it once / 3 months, so sometimes we are out of
date)
Module 6
bioinformatics.ca
Current status
• Seven organisms:
– Human, Mouse, yeast, worm, fly, A Thaliana, Rat
• ~1,250 networks (about 50% co-expression, 35% physical
interaction)
• Web network browser
Module 6
bioinformatics.ca
Cytoscape plugin
http://www.genemania.org/plugin/
QueryRunner
Area under the curve
-Runs GO function
prediction from the
command line.
-Does cross-validation
to assess predictive
performance of a set of
networks
Genetic interaction networks
Legend
-Can assess “added
predictive value of new
data”
(Michaut et al, in press)
Module 6
bioinformatics.ca
STRING: http://string-db.org/
Module 6
bioinformatics.ca
STRING results
Module 6
bioinformatics.ca
STRING results
Module 6
bioinformatics.ca
GeneMANIA vs STRING
• STRING (2003-present)
–
–
–
–
Large organism coverage
Protein focused
Uses eight pre-computed networks
Heavy use of phylogeny to infer functional interactions, also contains text mining derived
interactions
– Uses “direct interaction” to score nodes
– Link weights are “Probability of functional interaction”
• GeneMANIA webserver (2010-present)
–
–
–
–
Covers 6 major model organisms (but can add more with plugin)
Gene focused
Thousands of networks, weights are not pre-computed, can upload your own network
Relies heavily on functional genomic data: so has genetic interactions, phenotypic info,
chemical interactions
– Allows enrichment analysis
– Uses “label propagation” to score nodes
Module 6
bioinformatics.ca
GeneMANIA future directions
•
•
•
•
•
Other organisms – next: E. Coli, zebrafish
Non-coding genes (miRNAs!)
Regulatory networks (ChIP, RNA-protein, miRNA-mRNAs)
More phenotypic information (OMIM)
Orthology mapping for inferring interologs
Module 6
bioinformatics.ca
GeneMANIA URLs
Main site (stable but still fun):
http://www.genemania.org
Beta site (new and edgy but possibly unreliable):
http://beta.genemania.org
Module 6
bioinformatics.ca