Pathway and Network Analysis Workshop December 8, 2011, Montreal Quaid Morris Scooter Morris Piet Molenaar Gary Bader Tero Aittokallio Boris Steipe Module #: Title of Module Canadian Bioinformatics Workshops.

Download Report

Transcript Pathway and Network Analysis Workshop December 8, 2011, Montreal Quaid Morris Scooter Morris Piet Molenaar Gary Bader Tero Aittokallio Boris Steipe Module #: Title of Module Canadian Bioinformatics Workshops.

Pathway and Network Analysis
Workshop
December 8, 2011, Montreal
Quaid Morris
Scooter Morris
Piet Molenaar
Gary Bader
Tero Aittokallio
Boris Steipe
Module #: Title of Module
Canadian Bioinformatics Workshops
2
Workshop outline
• Introduction to gene lists, gene attributes, and interaction
networks
• Pathway enrichment analysis
– Theory: Fisher’s exact test, background and multiple test
correction.
– Practical: DAVID
• Gene recommender systems:
– function prediction and gene-centered network browsing using
GeneMANIA
User account
• Username: csuser51
• B1o1nf51
Introduction to Gene Lists,
Attributes and Networks
Outline
• Gene lists and gene attributes
– Where they come from
– What do they mean
• Networks
– What are they
– Analysis
– Use in Biology
– Biological questions/applications
Interpreting Gene Lists
• My cool new screen worked and produced 1000 hits! …Now what?
• Genome-Scale Analysis (Omics)
– Genomics, Proteomics
• Tell me what’s interesting about these genes
Ranking or
clustering
?
GenMAPP.org
Interpreting Gene Lists
• My cool new screen worked and produced 1000 hits! …Now what?
• Genome-Scale Analysis (Omics)
– Genomics, Proteomics
• Tell me what’s interesting about these genes
– Are they enriched in known pathways, complexes, functions
Analysis
tools
Ranking or
clustering
Prior knowledge about
cellular processes
Eureka! New
heart disease
gene!
Where Do Gene Lists Come From?
• Molecular profiling e.g. mRNA, protein
– Identification  Gene list
– Quantification  Gene list + values
– Ranking, Clustering (biostatistics)
• Interactions: Protein interactions, microRNA
targets, transcription factor binding sites (ChIP)
• Genetic screen e.g. of knock out library
• Association studies (Genome-wide)
– Single nucleotide polymorphisms (SNPs)
– Copy number variants (CNVs)
What Do Gene Lists Mean?
• Biological system: complex, pathway,
physical interactors
• Similar gene function e.g. protein kinase
• Similar cell or tissue location
• Chromosomal location (linkage, CNVs)
Data
Gene Attributes
Available in databases:
• Function annotation
– Biological process, molecular function, cell location
• Chromosome position
• Disease association
• DNA properties
– TF binding sites, gene structure (intron/exon), SNPs
• Transcript properties
– Splicing, 3’ UTR, microRNA binding sites
• Protein properties
– Domains, secondary and tertiary structure, PTM sites
• Interactions with other genes
What is the Gene Ontology (GO)?
• Set of biological phrases (terms) which are
applied to genes:
– protein kinase
– apoptosis
– membrane
• Dictionary: term definitions
• Ontology: A formal system for describing
knowledge
Jane Lomax @ EBI
www.geneontology.org
GO Structure
• Terms are related
within a hierarchy
– is-a
– part-of
• Describes multiple
levels of detail of
gene function
• Terms can have more
than one parent or
child
What GO Covers?
• GO terms divided into three aspects:
– cellular component
– molecular function
– biological process (important pathway source)
glucose-6-phosphate isomerase activity
Cell division
Terms
• Where do GO terms come from?
– GO terms are added by editors at EBI and gene
annotation database groups
– Terms added by request
– Experts help with major development
– 32029 terms, >99% with definitions.
•
•
•
•
19639 biological_process
2859 cellular_component
9531 molecular_function
As of July 15, 2010
Annotations
• Genes are linked, or associated, with GO
terms by trained curators at genome
databases
– Known as ‘gene associations’ or GO annotations
– Multiple annotations per gene
• Some GO annotations created automatically
(without human review)
Annotation Sources
• Manual annotation
– Curated by scientists
• High quality
• Small number (time-consuming to create)
– Reviewed computational analysis
• Electronic annotation
– Annotation derived without human validation
• Computational predictions (accuracy varies)
• Lower ‘quality’ than manual codes
• Key point: be aware of annotation origin
For your information
Evidence Types
•
•
Experimental Evidence Codes
•
EXP: Inferred from Experiment
•
IDA: Inferred from Direct Assay
•
IPI: Inferred from Physical Interaction
•
IMP: Inferred from Mutant Phenotype
•
IGI: Inferred from Genetic Interaction
•
IEP: Inferred from Expression Pattern
•
•
Computational Analysis Evidence Codes
•
ISS: Inferred from Sequence or Structural
Similarity
•
ISO: Inferred from Sequence Orthology
•
ISA: Inferred from Sequence Alignment
•
ISM: Inferred from Sequence Model
•
IGC: Inferred from Genomic Context
•
RCA: inferred from Reviewed Computational
Analysis
Author Statement Evidence
Codes
•
TAS: Traceable Author
Statement
•
NAS: Non-traceable
Author Statement
Curator Statement Evidence
Codes
•
IC: Inferred by
Curator
•
ND: No biological
Data available
• IEA: Inferred from electronic annotation
See http://www.geneontology.org
Wide & Variable Species Coverage
Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.
Accessing GO: QuickGO
http://www.ebi.ac.uk/ego/
See also AmiGO: http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
Gene Attributes
• Function annotation
– Biological process, molecular function, cell location
• Chromosome position
• Disease association
• DNA properties
– TF binding sites, gene structure (intron/exon), SNPs
• Transcript properties
– Splicing, 3’ UTR, microRNA binding sites
• Protein properties
– Domains, secondary and tertiary structure, PTM sites
• Interactions with other genes
Sources of Gene Attributes
• Ensembl BioMart (eukaryotes)
– http://www.ensembl.org
• Entrez Gene (general)
– http://www.ncbi.nlm.nih.gov/sites/entrez?db=gen
e
• Model organism databases
– E.g. SGD: http://www.yeastgenome.org/
• Many others: discuss during lab
Biomart 0.7
Use this one 
Ensembl BioMart
• Convenient access to gene list annotation
Select genome
Select filters
Select attributes
to download
BioMART demo
http://www.biomart.org
What Have We Learned?
• Many gene attributes in databases
– Gene Ontology (GO) provides gene function annotation
• GO is a classification system and dictionary for biological
concepts
• Annotations are contributed by many groups
• More than one annotation term allowed per gene
• Some genomes are annotated more than others
• Annotation comes from manual and electronic sources
• GO can be simplified for certain uses (GO Slim)
• Many gene attributes available from Ensembl and
Entrez Gene
Gene Lists Overview
• Interpreting gene lists
• Gene function attributes
– Gene Ontology
• Ontology Structure
• Annotation
– BioMart + other sources
• Gene identifiers and mapping
Gene and Protein Identifiers
• Identifiers (IDs) are ideally unique, stable
names or numbers that help track database
records
– E.g. Social Insurance Number, Entrez Gene ID 41232
• Gene and protein information stored in many databases
–  Genes have many IDs
• Records for: Gene, DNA, RNA, Protein
– Important to recognize the correct record type
– E.g. Entrez Gene records don’t store sequence.
They link to DNA regions, RNA transcripts and
proteins e.g. in RefSeq, which stores sequence.
For your information
Common Identifiers
Gene
Ensembl ENSG00000139618
Entrez Gene 675
Unigene Hs.34012
RNA transcript
GenBank BC026160.1
RefSeq NM_000059
Ensembl ENST00000380152
Protein
Ensembl ENSP00000369497
RefSeq NP_000050.2
UniProt BRCA2_HUMAN or
A1YBP1_HUMAN
IPI IPI00412408.1
EMBL AF309413
PDB 1MIU
Species-specific
HUGO HGNC BRCA2
MGI MGI:109337
RGD 2219
ZFIN ZDB-GENE-060510-3
FlyBase CG9097
WormBase WBGene00002299 or ZK1067.1
SGD S000002187 or YDL029W
Annotations
InterPro IPR015252
OMIM 600185
Pfam PF09104
Gene Ontology GO:0000724
SNPs rs28897757
Experimental Platform
Affymetrix 208368_3p_s_at
Agilent A_23_P99452
Red = Recommended
CodeLink GE60169
Illumina GI_4502450-S
Identifier Mapping
• So many IDs!
– Mapping (conversion) is a headache
• Four main uses
– Searching for a favorite gene name
– Link to related resources
– Identifier translation
• E.g. Genes to proteins, Entrez Gene to Affy
– Unification during dataset merging
• Equivalent records
ID Mapping Services
• Synergizer
–
http://llama.med.harvard.edu/syner
gizer/translate/
• Ensembl BioMart
–
http://www.ensembl.org
• PICR (proteins
only)
–
http://www.ebi.ac.uk/Tools/picr/
Synergizer demo
(http://llama.med.harvard.edu/synergizer/translate/)
also see BioMART
ID Mapping Challenges
• Avoid errors: map IDs correctly
• Gene name ambiguity – not a good ID
– e.g. FLJ92943, LFS1, TRP53, p53
– Better to use the standard gene symbol: TP53
• Excel error-introduction
– OCT4 is changed to October-4
• Problems reaching 100% coverage
– E.g. due to version issues
– Use multiple sources to increase coverage
Zeeberg BR et al. Mistaken identifiers: gene name
errors can be introduced inadvertently when
using Excel in bioinformatics BMC Bioinformatics.
2004 Jun 23;5:80
Recommendations
• For proteins and genes
– (doesn’t consider splice forms)
• Map everything to Entrez Gene IDs using a
spreadsheet
• If 100% coverage desired, manually curate
missing mappings
• Be careful of Excel auto conversions –
especially when pasting large gene lists!
– Format cells as ‘text’
What Have We Learned?
• Genes and their products and attributes have
many identifiers (IDs)
• Genomics requirement to convert or map IDs
from one type to another
• ID mapping services are available
• Use standard, commonly used IDs to reduce
ID mapping challenges
Networks
• Represent relationships
– Physical, regulatory, genetic, functional
interactions
• Useful for discovering relationships in large
data sets
– Better than tables in Excel
• Visualize multiple data types together
– See interesting patterns
Mapping Biology to a Network
• A simple mapping
– one compound/node, one interaction/edge
• A more realistic mapping
– Cell localization, cell cycle, cell type, taxonomy
– Only represent physiologically relevant interaction
networks
• Edges can represent other relationships
• Critical: understand what nodes and edges mean
Protein Sequence Similarity Network
http://apropos.icmb.utexas.edu/lgl/
Six Degrees of Separation
• Many people in N America are
connected by at most six links
• Which path should we take?
• Shortest path by breadth first search
– If two nodes are connected, will find the shortest
path between them
• Are two proteins connected? If so, how?
• Biologically relevant?
http://www.time.com/time/techtime/200406/community.html
Biological Questions
• Step 1: What do you want to accomplish with
your list or network (hopefully part of experiment design!  )
– Summarize biological processes or other aspects of
gene function
– Find a controller for a process (TF, miRNA)
– Find new pathways or new pathway members
– Discover new gene function
– Correlate with a disease or phenotype (candidate
gene prioritization)
– Perform differential analysis – what’s different
between samples?
Other
Questions?
Applications of Network Biology
•
Gene Function Prediction –
shows connections to sets of
genes/proteins involved in same
biological process
•
Detection of protein
complexes/other modular
structures –
discover modularity & higher order
organization (motifs, feedback
loops)
•
Network evolution –
biological process(es)
conservation across species
•
Prediction of new interactions
and functional associations –
Statistically significant domaindomain correlations in protein
interaction network to predict
protein-protein or genetic
interaction
jActiveModules, UCSD
PathBlast, UCSD
MCODE, University of Toronto
DomainGraph, Max Planck Institute
humangenetics-amc.nl
Applications of Network Informatics in Disease
•
Identification of disease
subnetworks – identification of
disease network subnetworks that
are transcriptionally active in
disease.
•
Subnetwork-based diagnosis –
source of biomarkers for disease
classification, identify interconnected
genes whose aggregate expression
levels are predictive of disease state
•
Subnetwork-based gene
association – map common
pathway mechanisms affected by
collection of genotypes
Agilent Literature Search
PinnacleZ, UCSD
Mondrian, MSKCC
humangenetics-amc.nl
June 2009
http://cytoscape.org
Network
visualization
and analysis
Pathway comparison
Literature mining
Gene Ontology analysis
Active modules
Complex detection
Network motif search
UCSD, ISB, Agilent,
MSKCC, Pasteur, UCSF,
Unilever, UToronto, U
Texas
Network Analysis using Cytoscape
Find biological processes
underlying a phenotype
Databases
Literature
Network
Analysis
Network
Information
Expert knowledge
Experimental Data
Manipulate Networks
Automatic Layout
Filter/Query
Interaction Database Search
Active Community
• Help
http://www.cytoscape.org
– 8 tutorials, >10 case studies
– Mailing lists for discussion
– Documentation, data sets
Cline MS et al. Integration of
biological networks and gene
expression data using Cytoscape
Nat Protoc. 2007;2(10):2366-82
• Annual Conferences
• 10,000s users, 2500 downloads/month
• >80 Plugins Extend Functionality
– Build your own, requires programming
Where to start? Cytoscape tutorials
http://opentutorials.cgl.ucsf.edu/index.php/Portal:Cytoscape
Pathway enrichment analysis
Enrichment Analysis Intro Outline
• What is Gene Set Enrichment Analysis?
• Theory: Fisher’s exact test and multiple test
correction
• DAVID enrichment analysis tool
What is Gene Set Enrichment Analysis?
• Break down cellular function into gene sets
- Every set of genes is associated to a specific
cellular function, process, component or pathway
Nuclear Pore
Gene.AAA
Gene.ABA
Gene.ABC
Cell Cycle
Gene.CC1
Gene.CC2
Gene.CC3
Gene.CC4
Gene.CC5
Ribosome
Gene.RP1
Gene.RP2
Gene.RP3
Gene.RP4
P53 signaling
Gene.CC1
Gene.CK1
Gene.PPP
Daniele Merico
What is Gene Set Enrichment Analysis?
• Find known gene sets (e.g. pathways)
enriched in a gene list (e.g. from gene
expression)
Nuclear Pore
Gene.AAA
Gene.ABA
Gene.ABC
Cell Cycle
Gene.CC1
Gene.CC2
Gene.CC3
Gene.CC4
Gene.CC5
Ribosome
Gene.RP1
Gene.RP2
Gene.RP3
Gene.RP4
P53 signaling
Gene.CC1
Gene.CK1
Gene.PPP
What is Gene Set Enrichment Analysis?
• Find known gene sets (e.g. pathways)
enriched in a gene list (e.g. from gene
expression)
– Look for significant enrichment (more on how this
Nuclear Pore
Ribosome
works later)
Gene.AAA
Gene.ABA
Gene.ABC
Cell Cycle
Gene.CC1
Gene.CC2
Gene.CC3
Gene.CC4
Gene.CC5
Gene.RP1
Gene.RP2
Gene.RP3
Gene.RP4
P53 signaling
Gene.CC1
Gene.CK1
Gene.PPP
Microarray
Experiment
(gene expression table)
Enrichment Test
Enrichment Table
Spindle
Apoptosis
ENRICHMENT
TEST
Gene-set
Databases
0.00001
0.00025
Microarray
Experiment
(gene expression table)
Enrichment Test
Enrichment Table
Spindle
Apoptosis
ENRICHMENT
TEST
Experimental Data
Gene-set
Databases
A priori knowledge +
existing experimental data
0.00001
0.00025
Microarray
Experiment
(gene expression table)
Enrichment Test
Enrichment Table
Spindle
Apoptosis
0.00001
0.00025
ENRICHMENT
TEST
Interpretation
& Hypotheses
Gene-set
Databases
http://david.abcc.ncifcrf.gov/
DAVID demo
http://david.abcc.ncifcrf.gov/tools.jsp
Step 1: Define your gene list
• Either
– (a) Copy and paste your list in
– (b) Upload a gene list file
– (c) Choose an example gene list (so, click “demolist1”
on next slide)
Step 1: Define your gene list
If you choose the wrong identifier, you
may need to use the conversion tool
Click here to define type of list (now
we are doing a gene list, next slide
we will define the background)
Step 2: Choose background
You can either upload a background list
(in the upload tab, see previous
step) or choose one of the
background sets (shown here)
Example list is from the Human U95A
array, select it here.
Step 3: Check list & background
Step 4: Run Enrichment Analysis
Click here
Step 5: Select categories of gene lists to
Click here totest
expand
Red indicates default
selections, click
check marks if you
want to change
Click here once you’ve
selected your sets
Step 6: View enrichment
Gene set name
# of genes in set
with annotation
EASE P-value
4.0E-4 means:
4.0 x 10-4
Step 7: Change parameters if desired
Set “count” to 1 and “EASE” to 1 if you
want maximum # of categories.
Beware, only corrects within category.
Step 8: Download results
Download spreadsheet (in tabdelimited text format).
Beware: if you click you may
get text in a browser window,
if this happens, “right-click” to
save as a file.
Step 8a: What can happen if no right-click
Step 8b: How it looks in a spreadsheet
Outline of theory component
• Fisher’s Exact Test for calculating enrichment Pvalues (also used for calculating EASE score)
• Multiple test corrections:
– Bonferroni
– Benjamini-Hochberg FDR
• Other enrichment tests widely used but not
covered here:
– GSEA for ranked lists
– See: http://www.broadinstitute.org/gsea/index.jsp
Fisher’s exact test
a.k.a., the hypergeometric test
Gene list
RRP6
MRD1
RRP7
RRP43
RRP42
Null hypothesis: List is a random sample from population
Alternative hypothesis: More black genes than expected
Background population:
500 black genes,
4500 red genes
70
Fisher’s exact test
a.k.a., the hypergeometric test
Gene list
RRP6
MRD1
RRP7
RRP43
RRP42
Null distribution
P-value
Answer = 4.6 x 10-4
Background population:
500 black genes,
4500 red genes
71
Newly added slide, not in your binder
2x2 contingency table for Fisher’s Exact Test
Gene list
RRP6
MRD1
RRP7
RRP43
RRP42
In gene list
Not in gene list
In gene set
4
496
Not in gene set
1
4499
e.g.: http://www.graphpad.com/quickcalcs/contingency1.cfm
Background population:
500 black genes,
4500 red genes
72
Important details
• To test for under-enrichment of “black”, test for overenrichment of “red”.
• The EASE score used by DAVID subtracts one from the
observed overlap between gene list and gene set to ensure >1
from the list is in the gene set.
• Need to choose “background population” appropriately, e.g.,
if only portion of the total gene complement is queried (or
available for annotation), only use that population as
background.
• To test for enrichment of more than one independent types of
annotation (red vs black and circle vs square), apply Fisher’s
exact test separately for each type. ***More on this later***
73
Multiple test corrections
How to win the P-value lottery, part 1
Random draws
… 7,834 draws later …
Expect a random draw
with observed enrichment
once every 1 / P-value
draws
Background population:
500 black genes,
4500 red genes
How to win the P-value lottery, part 2
Keep the gene list the same, evaluate different annotations
Observed draw
RRP6
MRD1
RRP7
RRP43
RRP42
Different annotations
RRP6
MRD1
RRP7
RRP43
RRP42
Simple P-value correction: Bonferroni
If M = # of annotations tested:
Corrected P-value = M x original P-value
Corrected P-value is greater than or equal to the probability that one or
more of the observed enrichments could be due to random draws. The
jargon for this correction is “controlling for the Family-Wise Error Rate
(FWER)”
Bonferroni correction caveats
• Bonferroni correction is very stringent and can
“wash away” real enrichments.
• Often one is willing to accept a less stringent
condition, the “false discovery rate” (FDR),
which leads to a gentler correction when
there are real enrichments.
False discovery rate (FDR)
• FDR is the expected proportion of the
observed enrichments due to random
chance.
• Compare to Bonferroni correction which is a bound on the
probability that any one of the observed enrichments
could be due to random chance.
• Typically FDR corrections are calculated using the
Benjamini-Hochberg procedure.
• FDR threshold is often called the “q-value”
For your information
Benjamini-Hochberg example
Rank
Category
P-value
1
2
3
Transcriptional regulation
Transcription factor
Initiation of transcription
0.001
0.01
0.02
…
…
…
50
51
52
53
Nuclear localization
RNAi activity
Cytoplasmic localization
Translation
0.04
0.05
0.06
0.07
Adjusted P-value
Sort P-values of all tests in decreasing order
FDR / Q-value
For your information
Benjamini-Hochberg example
Rank
Category
P-value
Adjusted P-value
1
2
3
Transcriptional regulation
Transcription factor
Initiation of transcription
0.001
0.01
0.02
0.001 x 53/1 = 0.053
0.01 x 53/2 = 0.27
0.02 x 53/3 = 0.35
…
…
…
…
50
51
52
53
Nuclear localization
RNAi activity
Cytoplasmic localization
Translation
0.04
0.05
0.06
0.07
0.04 x 53/50
0.05 x 53/51
0.06 x 53/52
0.07 x 53/53
FDR / Q-value
= 0.042
= 0.052
= 0.061
= 0.07
Adjusted P-value = P-value X [# of tests] / Rank
For your information
Benjamini-Hochberg example
Rank
Category
P-value
Adjusted P-value
FDR / Q-value
1
2
3
Transcriptional regulation
Transcription factor
Initiation of transcription
0.001
0.01
0.02
0.001 x 53/1 = 0.053
0.01 x 53/2 = 0.27
0.02 x 53/3 = 0.35
0.042
0.042
0.042
…
…
…
…
…
50
51
52
53
Nuclear localization
RNAi activity
Cytoplasmic localization
Translation
0.04
0.05
0.06
0.07
0.04 x 53/50
0.05 x 53/51
0.06 x 53/52
0.07 x 53/53
= 0.042
= 0.052
= 0.061
= 0.07
0.042
0.052
0.061
0.07
Q-value = minimum adjusted P-value at given rank or below
For your information
Benjamini-Hochberg example
Rank
Category
1
2
3
Transcriptional regulation
Transcription factor
Initiation of transcription
…
…
50
51
52
53
Nuclear localization
RNAi activity
Cytoplasmic localization
Translation
P-value
0.001
0.01
0.02
P-value threshold for FDR < 0.05
…
0.04
0.05
0.06
0.07
Adjusted P-value
FDR /
Q-value FDR < 0.05?
0.001 x 53/1 = 0.053 0.042
0.01 x 53/2 = 0.27 0.042
0.02 x 53/3 = 0.35 0.042
…
0.04 x 53/50
0.05 x 53/51
0.06 x 53/52
0.07 x 53/53
= 0.042
= 0.052
= 0.061
= 0.07
Y
Y
Y
…
…
0.042
0.052
0.061
0.07
Y
N
N
N
P-value threshold is highest ranking P-value for which
corresponding Q-value is below desired significance threshold
Reducing multiple test correction stringency
• The correction to the P-value threshold 
depends on the # of tests that you do, so, no
matter what, the more tests you do, the more
sensitive the test needs to be
• Can control the stringency by reducing the
number of tests: e.g. use GO slim; restrict
testing to the appropriate GO annotations; or
select only larger GO categories.
Multiple test correction in DAVID
• In DAVID, the “Benjamini-Hochberg” column
corresponds to the false discovery rate as it is
typically defined. It is unclear what the FDR
means.
• DAVID does multiple test correction separately
within each category of gene sets, so adding
more categories does not change the FDRs or
P-values. Be careful how you report these
numbers.
Summary
• Enrichment analysis:
– Statistical tests
• Gene list: Fisher’s Exact Test
• Gene rankings: GSEA, also see Wilcoxon ranksum,
Mann-Whitney U-test, Kolmogorov-Smirnov test
– Multiple test correction
• Bonferroni: stringent, controls probability of at least
one false positive*
• FDR: more forgiving, controls expected proportion of
false positives* -- typically uses Benjamini-Hochberg
* Type 1 error, aka probability that observed enrichment if no association
Gene function prediction with
GeneMANIA
Outline
• Concepts in gene function prediction:
– Guilt-by-association
– Gene recommender systems
• GeneMANIA demo
• Gene function prediction use cases
• Scoring interactions by guilt-by-association
Using genome-wide data in the lab
Pathway-based networks
Protein-protein
interaction data
Genetic interaction data
?!?
Microarray expression data
Genomics revolution, the bad news
Genomics datasets are:
•
•
•
•
•
noisy,
redundant,
incomplete,
mysterious,
massive
Google can’t do biology
Google can’t do biology
Guilt-by-association principle
Microarray expression data
Conditions
Co-expression network
Cell cycle
CDC3
CLB4
CDC16
Genes
UNK1
RPT1
RPN3
RPT6
Eisen et al (PNAS 1998)
UNK2
Protein degradation
Fraser AG, Marcotte EM - A probabilistic view of gene
function - Nat Genet. 2004 Jun;36(6):559-64
GeneMANIA Demo
Main site (stable but still fun):
http://www.genemania.org
Beta site (new and edgy but possibly unreliable):
http://beta.genemania.org
Two types of functional prediction
• “Give me more genes like these”,
– e.g. find more genes in the Wnt signaling pathway,
find more kinases, find more members of a
protein complex
• “What does my gene do?”
– Goal: determine a gene’s function based on who it
interacts with: “guilt-by-association”.
“Give me more genes like these”
Input
Network and profile data
Output
from GeneMANIA
Query list
CDC48
CPR3
MCA1
TDH2
Gene
recommender
system
e.g., GeneMANIA,
STRING http://www.string-db.org,
bioPIXIE http://pixie.princeton.edu/pixie/
“What does my gene do?”
Input
Network and profile data
Output
Query list
CDC48
Gene
recommender
system then
enrichment
analysis
e.g., GeneMANIA, bioPIXIE
Composite functional
interaction/linkage/association networks
Pathway-based networks
Protein-protein
interaction data
Genetic interaction data
Microarray expression data
Composite
functional
association
network
Pre-computed functional interaction networks
Cell
cycle
CDC23
CDC27
Pre-combine networks e.g. by simple addition or
Naïve Bayes
APC11
+
UNK1
RAD54
XRS2
+
Genetic
Co-complexed
Tong et al. 2001
Jeong et al 2002
DNA
repair
MRE11
UNK2
Co-expression
Pavlidis et al, 2002,
Marcotte et al, 1999
bioPIXIE
Composite networks: One size doesn’t fit all
• Gene function could be a/the:
– Biological process,
– Biochemical/molecular function,
– Subcellular/Cellular localization,
– Regulatory targets,
– Temporal expression pattern,
– Phenotypic effect of deletion.
Some networks may be better for some types
of gene function than others
Query-specific composite networks
weights
w1 x
Cell
cycle
CDC23
w3 x
CDC27
APC11
UNK1
RAD54
w2 x
+
+
Genetic
Co-complexed
Tong et al. 2001
Jeong et al 2002
XRS2
DNA
repair
MRE11
UNK2
=
Co-expression
Pavlidis et al, 2002,
Lanckriet et al, 2004
Mostafavi et al, 2008
Two rules for network weighting
Relevance
The network should be relevant to predicting the
function of interest
• Test: Are the genes in the query list more often
connected to one another than to other genes?
Redundancy
The network should not be redundant with other
datasets – particularly a problem for co-expression
• Test: Do the two networks share many interactions
• Caveat: Shared interactions also provide more
confidence that the interaction is real.
Scoring nodes by guilt-by-association
Query list: “positive
examples” MCA1
CDC48
CPR3
TDH2
Scoring nodes by guilt-by-association
Query list: “positive
examples” MCA1
Score
CDC48
high
CPR3
TDH2
low
Direct neighborhood
CDC48
MCA1
CPR3
TDH2
Two main
algorithms
Label propagation
CDC48
MCA1
CPR3
TDH2
Node scoring algorithm details
• Direct neighbour node score depends on:
– Strength of links to positive examples
– # of positive neighbors
• Label propagation node score depends on:
– Strength of links and # of positive direct neighbors
– # of shared neighbors with positive examples
– “modular structure” of network
Label propagation example
Before
After
Three parts of GeneMANIA:
• A large, automatically updated collection of
interactions networks.
• A query algorithm to find genes and networks
that are functionally associated to your query
gene list.
• An interactive, client-side network browser
with extensive link-outs
GeneMANIA data sources
Network types
Legend
* minor curation
** major curation
Co-expression*
-Gene ID
mappings from
Ensembl and
Ensembl Plant
Co-localization**
Pathways
-Network/gene
descriptors from
Entrez-Gene and
Pubmed
Physical interactions
Genetic interactions*
Shared domains
Predicted interactions**
- Gene
annotations from
Gene Ontology,
GOA, and model
org. databases
Other
MGI, Chemogenomics
Gene identifiers
• All unique identifiers within the selected
organism: e.g.
–
–
–
–
–
Entrez-Gene ID
Gene symbol
Ensembl ID
Uniprot (primary)
also, some synonyms & organism-specific names
• We use Ensembl database for gene mappings
(but we mirror it once / 3 months, so sometimes
we are out of date)
Cytoscape plugin
http://www.genemania.org/plugin/
+ QueryRunner
http://cytoscapeweb.cytoscape.org/
Other resources for GeneMANIA info
• “About” page on GeneMANIA interface
– http://www.genemania.org/pages/about.jsf
• OpenHelix tutorial (not available everywhere)
– http://www.openhelix.com/
– http://www.openhelix.com//cgi/tutorialInfo.cgi?id=11
3
• Our papers:
–
–
–
–
GeneMANIA website: Warde-Farley et al, NAR 2010
GeneMANIA algorithm: Mostafavi et al, GB 2008
Cytoscape plugin: Montojo et al, Bioinfo 2010
Cytoscape Web: Lopes et al, Bioinfo 2010
Principal
Investigators
Quaid Morris
Gary Bader
Outreach
David Warde-Farley
Students
Who?
Sylva Donaldson
Sara Mostafavi
Ovi Comes
Christian Lopes Max Franz
Harold Rodriguez
Developers
Khalid Zuberi Jason Montojo
Farzana Kazi