Systems Biology: The inference of networks from high dimensional genomics data Ka Yee Yeung Nov 3, 2011

Download Report

Transcript Systems Biology: The inference of networks from high dimensional genomics data Ka Yee Yeung Nov 3, 2011

Systems Biology: The
inference of networks from
high dimensional genomics
data
Ka Yee Yeung
Nov 3, 2011
1
Systems Biology
• “Systems biology is the study of the
interactions between the components of a
biological system, and how these
interactions give rise to the function and
behavior of that system” (Nir Friedman)
• The goal is to construct models of complex
biological systems and diseases (Trey
Ideker)
2
An iterative approach
High-throughput assays
Data handling
Experiments
Integration of multiple forms of
experiments and knowledge
Mathematical
modeling
3
Multi-disciplinary Science
•
•
•
•
•
•
Biology
Biotechnology
Computer Science
Mathematics and Statistics
Physics and chemistry
Engineering…
4
Networks as a universal language
We are caught in an inescapable network of
mutuality. ... Whatever affects one directly, affects
all indirectly. —Martin Luther King Jr.
Internet
Electronic
Circuit
Social Network
Gene Regulatory Network
Science Special Online Collection: Complex systems and
networks: http://www.sciencemag.org/complexity/
5
Road Map
• Definitions: graphical representation of
networks
• Different types of molecular networks
• What can we do with networks?
• Network construction methods
– Co-expression networks
– Bayesian networks
– Regression-based methods
6
Graphical Representation of Gene
Networks
• G=(V,E) where
– V: set of nodes (vertices)
– E: set of edges between the nodes that
represent the relationships between nodes
• Directed vs. undirected
• Network topology: connectivity structure
• Modules: subset of nodes that are more
highly interconnected with each other than
other nodes in the network
Undirected
Directed
7
Degree
• The degree k of a node is the number of edges
connected to it.
• In a directed graph, each node has an in-degree
and an out-degree.
8
Courtesy of Bill Noble
Degree distribution
• The degree distribution plots the number of
nodes that have a given degree k as a function
of k.
• The shape of the degree distribution allows
us to distinguish among types of networks.
9
Courtesy of Bill Noble
Scale-free networks
• Most nodes have only one connection; a few
hub nodes are highly connected.
• The degree distribution is exponential, which
yields a straight line on a log plot.
• Most biological networks are scale-free.
10
Courtesy of Bill Noble
Molecular or Biochemical Pathways
• A set of coupled chemical
reactions or signaling events.
• Nodes are molecules (often
substrates) and edges
represent chemical reactions.
• Represent decades of work
in which the underlying
chemical reactions are
validated.
Example: KEGG (Kyoto Encyclopedia of Genes and Genomes)
• Contains 410 “pathways” that represent molecular interaction and
reaction networks that are manually curated from 149,937 published
references (as of 10/25/2011).
11
Molecular Networks Constructed
from High-throughput assays (1)
Physical interaction network:
• A graphical representation of molecular
binding interactions such as a protein-protein
interaction (PPI) network.
• Nodes are molecules; edges represent
physical interactions between molecules.
• Example: Yeast PPI network in which most
interactions derived from large-scale
experiments like yeast 2-hybrid data (high
false +ve/-ve rates)
12
Molecular Networks Constructed
from High-throughput assays (2)
Correlation or co-expression network:
A graphical representation that averages
over observed expression data. Nodes are
mRNA molecules, edges represent
correlations between expression levels of
connected nodes.
Bayesian networks:
A directed, graphical representation of the
probabilities of one observation given another.
Nodes represent mRNA molecules; edges
represent the probability of a particular
expression value given the expression values
of the parent nodes.
13
What can we do with
these molecular
networks?
Using the position in
networks to describe
function
Guilt by association
Finding the causal
regulator
(the "Blame Game")
14
Courtesy of Mark Gerstein
What can we do with these molecular networks?
Hubs tend to be essential!
log(Frequency)
Power-law distribution
Success stories:
• Network modeling links breast cancer susceptibility and centrosome
dysfunction. Pujana et al. Nature Genetics 2007
• Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies
ASPM as a Novel Molecular Target. PNAS 2006
15
Courtesy of Mark Gerstein
What can we
do with these
molecular
networks?
Network-based
drug discovery
Success stories:
• Variations in DNA elucidate molecular networks that cause disease. Chen
et al. Nature 2008.
• Genetics of gene expression and its effect on disease. Emilsson et al.
Nature 2008.
16
Fig 4 Schadt et al. Nature Reviews Drug Discovery 2009
Motivation of gene network
inference
• Using biochemical methods, it takes 1000’s of person
years to assign genes to pathways.
• Even for well-studied genomes, the majority of genes
are not mapped to known pathways.
• As more genomes are being sequenced, and more
genes are discovered, we need systematic methods to
assign genes to pathways.
17
A gene-regulation
function describes
how inputs such as
transcription
factors and
regulatory
elements, are
transformed into a
gene’s mRNA
level.
18
Kim et al. Science 2009
A gene-regulation
function describes
how inputs such as
transcription
factors and
regulatory
elements, are
transformed into a
gene’s mRNA
level.
Modeling DNA
sequence-based cisregulatory gene
networks. Bolouri &
Davidson. 2002.
19
Kim et al. Science 2009
Network construction methods
• Co-expression networks
• Bayesian networks
• Regression-based methods
20
Goal: construct gene networks
21
Early inference of transcriptional
regulation: Clustering
• Clustering: extract groups of genes that
are tightly co-expressed over a range of
different experiments.
• Pattern discovery
• No prior knowledge required
• Applications:
– Guilt by association (functional annotations)
– Extraction of regulatory motifs
– Molecular signatures for tissue sub-types
22
Correlation: pairwise similarity
1
1 Experiments
X
p
genes
n
genes
genes
X
Raw matrix
Y
Similarity matrix
Y
n
n
4
Correlation (X,Y) = 1
3
2
X
1
Y
0
Z
1
2
3
4
W
Correlation (X,Z) = -1
Correlation (X,W) = 1
-1
-2
-3
23
Clustering algorithms
• Inputs:
– Similarity matrix
– Number of clusters or some other
parameters
• Many different classifications of
clustering algorithms:
– Hierarchical vs partitional
– Heuristic-based vs model-based
– Soft vs hard
24
Hierarchical Clustering
• Agglomerative (bottom-up)
• Algorithm:
dendrogram
– Initialize: each item a
cluster
– Iterate:
• select two most similar
clusters
• merge them
– Halt: when required number
of clusters is reached
25
Hierarchical: Single Link
• cluster similarity = similarity of two
most similar members
- Potentially
long and skinny
clusters
+ Fast
26
Hierarchical: Complete Link
• cluster similarity = similarity of two least
similar members
+ tight clusters
- slow
27
Hierarchical: Average Link
• cluster similarity = average similarity of
all pairs
+ tight clusters
- slow
28
Co-expression Networks
• Co-expression networks
– Aka: Correlation networks, association networks
– Use microarray data only
– Nodes are connected if they have a significant
pairwise expression profile association across
environmental perturbations
• References:
– A general framework for weighted gene coexpression network analysis (Zhang, Horvath
SAGMB 2005)
– WGCNA: an R package for weighted correlation
network analysis. (Langfelder, Horvath BMC
Bioinformatics 2008)
29
Steps for constructing a
co-expression network
A) Microarray gene expression data
B) Measure concordance of gene
expression with correlation
C) The Pearson correlation matrix is
thresholded to arrive at an
adjacency matrix  unweighted
network
Or transformed continuously with the
power adjacency function 
weighted network
30
Example: co-expression network
Correlation matrix
A
A
B
C
D
E
F
G
H
I
J
B
c
D
0.91 0.72 0.84
E
F
G
0.78 0.88
H
I
J
0.91
0.72
0.84
Correlation
threshold, t=1
0.94
0.78
0.88
0.94
0.75
0.75
0.92
0.92
0.98
0.98
C
I
H
A
G
B
F
D
J
k
P(k)
0
10/10
At t=1, there are no
edges, so all nodes
have degree (k) = 0
E
31
Example: co-expression network
Correlation matrix
A
A
B
C
D
E
F
G
H
I
J
B
c
D
0.91 0.72 0.84
E
F
G
0.78 0.88
H
I
J
0.91
0.72
0.84
Correlation
threshold, t=0.9
0.94
0.78
0.88
0.94
0.75
0.75
0.92
0.92
0.98
0.98
C
G
B
F
D
P(k)
0
2/10
1
8/10
I
H
A
k
J
At t=0.9, there are 4
edges, so 8 nodes
have degree (k) = 1
E
32
Example: co-expression network
Correlation matrix
A
B
c
D
0.91 0.72 0.84
E
F
G
0.78 0.88
H
I
J
0.91
0.72
0.84
Correlation
threshold, t=0.7
0.94
0.78
0.88
0.94
0.75
0.75
0.92
0.92
0.98
0.98
C
I
H
A
G
B
F
D
J
k
P(k)
1
7/10
3
2/10
5
1/10
Log(P(k))
A
B
C
D
E
F
G
H
I
J
E
Log(k)
33
Bayesian networks
• A directed acyclic graph (DAG) such that the nodes
represent mRNA expression levels and the edges
represent the probability of observing an expression
value given the values of the parent nodes.
• The probability distribution for a gene depends only on
its regulators (parents) in the network.
Example: G4 and G5 share a common
regulator G2, i.e., they are
conditionally independent given G2.
 factorization of the full joint
probability distribution into component
conditional distributions.
34
Needham et al. PLOS Comp Bio 2007
Independent Events
If G1, …, G5 are independent, then
the joint probability
p(G1, G2, G3, G4, G5)
= p(G1) p(G2) p(G3) p(G4) p(G5)
G1
G2
G4
G3
G5
Example:
K=“KaYee gives the lecture today”.
R=“It is raining outside today”
Whether it is rain or shine outside doesn’t affect
whether KaYee is giving the lecture today.
p(K,R) = p(K) * p(R)
35
Conditional Probability Distributions
• Conditional probability distributions: p(B|A) = the
probability of B given A.
Example:
K=“KaYee gives the lecture today”.
E=“today’s lecture contains equations”
P(E, K) = Probability that Ka Yee gives the lecture today and
today’s lecture contains equations = 0.05.
P(K)=1/10 = 0.1.
P(E|K) = P(E, K) / P(K) = 0.05/0.1 = 0.5.
• Score a network (fit) in light of the data: p(M|D) where
D=data, M=network structure  infer how well a particular
network explains the observed data.
36
Conditional Independence
Example:
K=“KaYee gives the lecture today”.
E=“today’s lecture contains equations”
C=“today’s slides are in Comic Sans font”
K
E
C
If Ka Yee is giving the lecture today, then whether today’s lecture contains
equations doesn’t affect whether today’s slides are in Comic Sans.
P(E|K,C) = P(E|K)
E and C are conditionally independent given K.
• In Bayesian networks, each node is independent of its
non-descendants, given its parents in the DAG.
• Using conditional independence between variables, the
joint probability distribution of the models may be
represented in a compact manner.
37
Joint Probability Distribution
p(G1, G2, G3, G4, G5)
= p(G1) p(G2|G1) p(G3|G1) p(G4|G2) p(G5|G1, G2, G3)
38
Constructing a Bayesian network
• Variables (nodes in the graph)
• Add edges to the graph by computing conditional
probabilities that characterize the distribution of states
of each node given the state of its parents.
• The number of possible network structures grows
exponentially with the number of nodes, so an exhaustive
search of all possible structures to find the one best
supported by the data is not feasible.
• Monte Carlo Markov Chain (MCMC) algorithm:
– Start with a random network.
– Small random changes are then made to the network by flipping,
adding, or deleting individual edges.
– Accept changes that improve the fit of the network to the data.
39
Bayesian networks
• Advantages:
– Compact and intuitive representation
– Integration of prior knowledge
– Probabilistic framework for data integration
• Limitation: no feedback loop  dynamic Bayesian
networks (variables are indexed by time and
replicated in the network)
• References:
– Using Bayesian Network to Analyze Expression Data. Friedman
et al. J. Computational Biology 7:601-620, 2000.
– A Primer on Learning in Bayesian Networks for Computational
Biology. Needham et al. PLOS Computational Biology 2007.
40
What kinds of data contain potential
information about gene networks?
Large expression sets
• Co-expression (correlation of expression levels) implies
connectivity
• But correlation
A
B
≠ causality
✔
A
B

A
B
A
B
C
Adding causality
• Genetic perturbation:
DNA variation at A
influences RNA variation
at B.
• Time series: A goes up
prior to B.
• Prior knowledge
41
Adding genetics data
• Quantitative trait locus (QTL): a region of
DNA that is associated with a particular trait
(eg. Height)
• QTL mapping (linkage analysis): correlate the
genotypic and phenotypic variation
42
Our data
Our experimental design:
Time dependencies: ordering of regulatory events.
Genotype data: correlate DNA variations in the
segregants to measured expression levels
BY
(lab)
Phenotype:
RNA levels in
response to drug
perturbation
6 time
points
DNA
genotype
×
RM (wild)
...
:
95 segregants
Genetics of global gene expression.
Rockman & Kruglyak. 2006.
43
Experimental design: Roger Bumgarner, Kenneth Dombek, Eric Schadt, Jun Zhu.
genes
regulators
Expression data
Genome-wide
binding data
Literature
Other data, e.g.
protein-protein
interaction,
genetic
interaction,
genotype etc.
Supervised learning:
integration of external data
0.95
0.23
0.78 Probability that
…
R regulates g
….
Regulators constrained
by the external data
sources
g
Variable
selection
Time series
expression data
Gene
regulatory
network
Yeung et al. To appear in PNAS.
44
44
Integration of external data
Expression data
Genome-wide
binding data
Literature
Other data
Compute variables (Xi) that capture
evidence of regulation for (TF-gene) pairs
Y
Xi
TF-gene
Training data:
Positive (Y=1) vs. negative (Y=0)
training examples
Apply logistic regression to
determine weights (ai’s)
of Xi’s.
regulators
genes
0.95
0.23
0.78
…
….
Probability that R
regulates g
45
Constraining candidate regulators
R2
R1
g
R3
Graphical representation of network
as a set of nodes and edges.
Goal: To infer parent nodes
(regulators) for each gene g using
the time series expression data
• Without prior knowledge, every gene is a potential
regulator of every other gene. We want to restrict the
search to the most likely regulators.
• For each gene g, we estimated how likely that each
regulator R regulates g (a priori) using the supervised
framework and the external data sources.
46
Regression-based approach
Use the expression level at time (t-1) to predict the
expression levels at time t in the same segregant
Potential regulators R
t-1
t
g
Let X(g,t,s) = expression level of gene g at time t in
segregant s
X(g,t,s) 
  g,s * X(R,t  1,s)  
Variable selection
R is a potential regulator
47
Yeung et al. To appear in PNAS.
Bayesian Model Averaging (BMA)
[Raftery 1995], [Hoeting et. al. 1999]
• BMA takes model uncertainty into account by
averaging over the posterior distributions of a
quantity of interest based on multiple models,
weighted by their posterior model
probabilities.
Pr( | D) 
K
Pr( | D, Mk )*Pr(Mk | D)

k 1
• Output: Posterior probabilities for selected
 genes and selected models
48
Assessment
• Recovery of known regulatory relationships:
– We showed significant enrichment between our
inferred network and the assessment criteria.
• Lab validation of selected sub-networks
Child nodes of selected TFs
Genes that respond to deletion with
rapamycin perturbation
…
WT
TF
• Comparison to other methods in the literature.
49
Known
#
# child
Expression
Systematic Common
binding site
references nodes in pattern over
Description from SGD
Name
name
from
in SGD
network A
time
JASPAR?
YDR421W ARO80
YML113W DAT1
YBL103C
RTG3
19
17
83
51
57
47
Increasing
over time
Decreasing
over time
Increasing
over time
yes
Zinc finger transcriptional
activator of the Zn2Cys6 family;
activates transcription of
aromatic amino acid catabolic
genes in the presence of
aromatic amino acids
no
DNA binding protein that
recognizes oligo(dA).oligo(dT)
tracts; Arg side chain in its Nterminal pentad Gly-Arg-LysPro-Gly repeat is required for
DNA-binding; not essential for
viability
yes
Basic helix-loop-helix-leucine
zipper (bHLH/Zip) transcription
factor that forms a complex with
another bHLH/Zip protein,
Rtg1p, to activate the retrograde
(RTG) and TOR pathways
50
Comparing our networks to the
deletion data
Our inferred network
Validation
experiment
# child nodes
Genes that
respond to the
deletion
# overlap
Fisher's test
p-value
ARO80
51
10
4
9.3 x 10-6
DAT1
57
784
20
0.04
RTG3
47
2288
39
0.03
Deleted TF
51
Legend:
Green: Genes that respond to deletion of
ARO80 under rapamycin in BY at 50
minutes.
Aro80p is a known
regulator of ARO9 and
ARO10. (Iraqui et al. Molecular
and Cellular Biology 1999,
19:3360-3371).
52
Legend:
Green: Genes that respond to
deletion of ARO80 under rapamycin
in BY at 50 minutes.
Magenta: Target genes with known
ARO80 binding site.
Amazingly, all 4 genes that
respond to deletion (ARO9,
ARO10, NAF1, ESBP6) contain the
known ARO80 binding site
upstream!
53
genes
regulators
Expression data
Genome-wide
binding data
Literature
Supervised learning:
integration of external data
0.95
0.23
0.78 Probability that
…
R regulates g
….
Regulators constrained
by the external data
sources
g
Other data, e.g.
protein-protein
interaction,
genetic
interaction,
genotype etc.
Goal: incorporate
prior probabilities in
the variable selection
step.
Variable
selection
Time series
expression data
Gene
regulatory
network
54
Revisiting our Road Map
• Definitions: graphical representation of
networks
• Different types of molecular networks
• What can we do with networks?
• Network construction methods
–
–
–
–
Co-expression networks
Bayesian networks
Regression-based methods
Assessment
55
Thank you’s
Data + Biological
interpretation
Roger Bumgarner
Kenneth Dombek
Eric Schadt
Jun Zhu
Method development
Adrian Raftery
Kenneth Lo
John Mittler
Special thanks
Dr. Rachel Brem
Dr. Su-In Lee
R01GM084163
R01GM084163-02S2
56