Presentation

Transcript Presentation

Slide 1

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 2

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 3

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 4

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 5

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 6

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 7

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 8

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 9

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 10

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 11

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 12

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 13

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 14

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 15

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 16

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 17

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 18

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 19

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 20

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 21

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 22

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 23

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 24

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Slide 25

Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem

.

Central Dogma

Transcription

Translation

mRNA
Gene

Cells express different subset of the genes
In different tissues and under different conditions

Protein

Microarrays (aka “DNA chips”)
 New

technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future

The Problem

Experiments

j

Genes

i

Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest

Our Approach
 Characterize

statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions



Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts

We build on the language of Bayesian networks

Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling


Marge

Lisa

Maggie

assumptions:

Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations

Ancestor

Network Structure

Parent

Generalizing to DAGs:


A child is conditionally
independent from its
non-descendents, given the
value of its parents

Y1

Y2

X

Often a natural assumption
for causal processes


if we believe that we capture
the relevant state of each
intermediate stage.

Non-descendent
Descendent

Local Probabilities
 Associated

with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x

Y
variables:
Choice: for example linear gaussian

P(Y | X)

 Continuous

X

Y

0.3 0.7

Bayesian Network Semantics
B

E
R

A
C

Qualitative part
DAG specifies
conditional
independence
statements

Quantitative part
+

local
probability
models

=

Unique joint
distribution
over domain

 Compact



& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:



“true” stochastic behavior
noise in measurements

Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables

Examples
Measured expression
level of each gene

Gene interaction

Random variables

Probabilistic
dependencies

Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A

B

A

B

More Complex Examples
 Dependencies

can be mediated through other

nodes
B
A
A

C

C

Common cause
 Common

B
Intermediate gene

effects can imply conditional dependence
A

B
C

Outline of Our Approach
Bayesian Network
Learning Algorithm

Expression data
B

E
R

A
C

Use learned network to make predictions about
structure of the interactions between genes

Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.

Methods
 Treat

samples as IID (ignoring temporal order)

Experiment 1:
 Discretized into three levels of expression
0

-

-0.5

+

0.5

Log(ratio to control)

 Learn

multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used

Network Learned

Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data

Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y

Confidence Estimates
B

E

Bootstrap approach
[FGW, UAI99]

D1

Learn

R

A
C

E

D

resample

D2

Learn

R

A
C

...
Dm
Estimate:

B

C (f ) 

1

m

Learn

E
R

B
A
C

m

 1f
i 1

 Gi 

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level

Original Data

Randomized Data

Testing for Significance
 We

run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models

Features with Confidence above t

4000

500
450

3500

Random
Real

Random
Real

400
350

3000

300

2500

250

200

2000

150

1500

100

50

1000
500
0
0.1

0.2

0.3

0.4

0.5

0.6

t

0
0.1

0.2

0.7

0.8

0.3

0.9

0.4

0.5

1

0.6

0.7

0.8

0.9

1

Testing for Significance

Markov w/ Multinomial Models
Features with Confidence above t

250
1400

Random
Real

Random
Real
200

1200
1000

150

800

100

600

50

400
0
0.1

200

0.2

0.3

0.4

0
0.1

0.2

0.3

0.4

0.5

0.6

t

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

Local Map

Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START


Future Work
 Finding

suitable local distribution models
 Correct handling of hidden variables


Can we recognize hidden causes of coordinated
regulation events?

 Incorporating


prior knowledge

Incorporate large mass of biological knowledge, and
insight from sequence/structure databases

 Abstraction


Combine with cluster analysis of higher confidence
conclusions

Presentation

Transcript Presentation

Directory