Transcript Presentation
Slide 1
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 2
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 3
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 4
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 5
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 6
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 7
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 8
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 9
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 10
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 11
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 12
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 13
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 14
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 15
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 16
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 17
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 18
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 19
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 20
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 21
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 22
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 23
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 24
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 25
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 2
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 3
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 4
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 5
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 6
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 7
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 8
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 9
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 10
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 11
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 12
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 13
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 14
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 15
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 16
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 17
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 18
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 19
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 20
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 21
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 22
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 23
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 24
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions
Slide 25
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
New
technological breakthrough:
Measure RNA expression levels of thousands
of genes in one experiment
Measure expression on
a genomic scale
Opens up new
experimental designs
Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
Learn regulatory/metabolic networks
Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize
statistical relationships between
expression patterns of different genes
Beyond pair-wise interactions
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
A node represents
an individual’s
genotype
Bart
Modeling
Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:
A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes
if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
Compact
& efficient representation:
k parents O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
Flexible representation of dependency structure
of multivariate distributions
Natural for modeling processes with local
interactions
Learning of Bayesian Networks
Can learn dependencies from observations
Handles stochastic processes:
“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
Expression levels of genes
Concentration levels of proteins (proteomics!)
Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
Phenotype information
…
Bayesian Network Structure:
Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
Each gene is represented by a node in the graph
Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
Contains 76 samples of all the
yeast genome:
Different methods for
synchronizing cell-cycle in
yeast
Time series at few minutes
(5-20min) intervals
Spellman et al. identified 800
cell-cycle regulated genes.
Methods
Treat
samples as IID (ignoring temporal order)
Experiment 1:
Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
Learn
multinomial probabilities
Experiment 2:
Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
Small number of samples
“Flat posterior” -- many networks fit the data
Solution
estimate confidence in network features
Two types of features
Markov neighbors: X directly interacts with Y
Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
Estimate:
B
C (f )
1
m
Learn
E
R
B
A
C
m
1f
i 1
Gi
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
MCD1 Mitotic Chromosome Determinant;
RAD27 DNA repair protein
CLN2 role in cell cycle START
SRO4 involved in cellular polarization during budding
YOX1 Homeodomain protein that binds leu-tRNA gene
POL30 required for DNA replication and repair
YLR467W
CDC5
MSH6 Homolog of the human GTBP protein
YML119W
CLN1 role in cell cycle START
Future Work
Finding
suitable local distribution models
Correct handling of hidden variables
Can we recognize hidden causes of coordinated
regulation events?
Incorporating
prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
Abstraction
Combine with cluster analysis of higher confidence
conclusions