Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne Koller (Stanford) Our Goals  Find patterns in gene.

Download Report

Transcript Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne Koller (Stanford) Our Goals  Find patterns in gene.

Rich Probabilistic Models for
Gene Expression
Eran Segal (Stanford)
Ben Taskar (Stanford)
Audrey Gasch (Berkeley)
Nir Friedman (Hebrew University)
Daphne Koller (Stanford)
Our Goals
 Find
patterns in gene expression data
Data Organization
Experiments
j
Genes
Induced
i
Repressed
Aij - mRNA level of gene i in experiment j
Standard Clustering Organization
Genes
Experiments
Bi-Clustering Organization
Genes
Experiments
Undetected
Similarity
Desired Organization
Detect similarities over subsets
of genes and experiments
Note: rows and columns no longer correspond to genes and
experiments
Incorporate Heterogeneous Data
CG CTA
A
Clinical
information
C
Find
correlations directly
Focus
Annotations
(GO, MIPS, YPD)
on novel discoveries
Experimental
Details
Our Approach
CG CTA
A
C
Clinical
information
Experimental
Details
Annotations
(GO, MIPS, YPD)
L
E
A
R
N
E
R
Gene Cluster
Exp. type
GCN4
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
hypotheses
Probabilistic Relational Models
(Koller & Pfeffer 98; Friedman,Getoor,Koller & Pfeffer 99)
Gene
Experiment
Gene Cluster
Exp. cluster
Level
Expression
Resulting Bayesian Network
Gene
Experiment
Gene Cluster
Exp. cluster
Level
Expression
+
Exp. Cluster2
Exp. Cluster1
Gene Cluster1
Level1,1
Level1,2
Level2,1
Level2,2
Level3,1
Level3,2
Gene Cluster2
Gene Cluster3
Probabilistic Relational Models
Gene
Experiment
Gene Cluster
Exp. cluster
Level
Expression
CPD
GCluster ECluster
1
2
P(Level)
P(Level)
0.8 1.2
-0.7 0.6
…
1
1
 
Level
-0.7
Level
0.8
Adding Heterogeneous Data
Gene
Experiment
Gene Cluster
GCN4
Exp. type
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
 Annotations
 Binding
sites
 Experimental
details
Resulting Bayesian Network
Gene
CG C T A
Experiment
Gene Cluster
A
Exp. type
GCN4
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
GCN41
+
Experimental
Details
Annotations
(GO, MIPS, YPD)
Exp. type1
Gene Cluster1
Exp. type2
Exp. cluster1
HSF1
Lipid1
Exp. cluster2
Level1,1
Level1,2
Level2,1
Level2,2
Level3,1
Level3,2
Endoplasmatic1
GCN42
Gene Cluster2
HSF2
Lipid2
Endoplasmatic2
GCN43
Gene Cluster3
HSF3
Endoplasmatic3
Lipid3
C
Problem: Exponential Blowup
Gene
Experiment
Gene Cluster
GCN4
Exp. type
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
 
1 No No No
1 No No No
0.8 1.2
0.7 0.6
…
GC LP END HSF EC TYP
1
1
1
2
6 parents
k parents
26 cases
2k cases!
Solution: Context Specificity
Gene
Experiment
Ultra Violet Light
UV
Light
DNA repair
DNA Damage
Level
Expression
DNA repair genes transcribed
UV = Yes
UV = No
Repair = Yes Repair = No
Repair = Yes Repair = No
0
0
0
0
Solution: Context Specificity
Gene
Experiment
Ultra Violet Light
UV
Light
DNA repair
DNA Damage
Level
Expression
0
DNA repair genes transcribed
UV = Yes
UV = No
0
0
0
0
0
Solution: Context Specificity
Gene
Experiment
Ultra Violet Light
UV
Light
DNA repair
DNA Damage
Level
Expression
DNA repair genes transcribed
UV = Yes
true
false
Repair = Yes
true
false
0
0
0
Modeling Context Specificity
Gene
Experiment
Gene Cluster
GCN4
HSF
Exp. type
Lipid
Endoplasmatic
Exp. cluster
Exp. Cluster =
2
true
false
Level
Expression
Lipid = Yes
true
false
true
false
true
...
GCN4 = Yes
GCN4 = Yes
true
HSF= Yes
false
P(Level)
false
Level
Grouping = a leaf in
2
...
P(Level)
P(Level)
Level
-3
the tree
P(Level)
Level
Level
0
3
How do I learn
these models?
Learning the Models
Gene
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
GC E C
GCN4 = Yes
...
...
GCN4 = Yes
...
Annotations
(GO, MIPS, YPD)
HSF= Yes
1
1
2
2
1
2
1
2
…
Lipid = Yes


0.8
-0.7
0.8
-0.7
1.2
0.6
1.2
0.6
…
Exp. Cluster = 2
...
Experimental
Details
L
E
A
R
N
E
R
HSF
...
C
Exp. type
GCN4
...
CG CTA
A
Experiment
Gene Cluster
Automatic Induction
 Structure
Learning:
 Dependency structure
 Missing
Data:
 Tree structure
 Gene cluster &
experiment cluster
never observed
 Bayesian
score
 Expectation
 Heuristic
search
Maximization (EM)
Learning Algorithm
Learning Process
Gene
Experiment
Gene Cluster
GCN4
Exp. type
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
Learning Process
Experiment Similarity
Gene
Experiment
Gene Cluster
GCN4
Exp. type
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
Exp. Cluster = 2
Learning Process
Gene Similarity
Gene
Experiment
Gene Cluster
GCN4
Exp. type
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
Exp. Cluster = 2
Gene Cluster =
Yes
Learning Process
Separability by binding site
Gene
Experiment
Gene Cluster
GCN4
Exp. type
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
Exp. Cluster = 2
Gene Cluster =
Yes
...
...
HSF= Yes
Learning Process
Attribute dependencies: induce cluster changes
Gene
Experiment
Gene Cluster
GCN4
Exp. type
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
Exp. Cluster = 2
Gene Cluster =
Yes
...
...
HSF= Yes
Learning Process
Achieved desired clustering
Gene
Experiment
Gene Cluster
GCN4
Exp. type
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
Expression
Exp. Cluster = 2
Gene Cluster =
Yes
HSF= Yes
...
...
...
...
...
...
GCN4 = Yes
GCN4 = Yes
Yeast Stress Data (Gasch et al 2001)
 Measured
 92
response to stress cond.
arrays
 We
selected ~900 genes
 Added
data: TRANSFAC, MIPS
Results:
 15
7
significant TFs
significant function categories
 793
Groupings
Context Specific Groupings
 Down
in nitrogen depletion
 Transporter
genes
 Metabolism
of amino acids
Context Specific Groupings
 Up
in Starvation, Nitrogen depletion & DTT
 Transporter
genes
 Metabolism
of nitrogen
Example Biological Finding
 Discovered
grouping of 17 genes
 All induced in diauxic shift
 All have  2 binding sites for MIG1 transcription factor
 Many not known to be regulated by MIG1
 Context-sensitive
groupings were key to finding cluster
Compendium Data (Hughes et al 2000)
 300
samples of yeast deletion mutants
Gene
Array/Mutated Gene
GCluster
GCN4
GCluster
(of mutated
gene)
HSF
Endoplasmatic
Lipid (of
mutated
gene)
Lipid
Level
Expression
ACluste
r
Resulting Bayesian Network
Gene 1 mutant Gene 3 mutant
Gene 1
Lipid1
Lipid3
Gene Cluster1
Array. cluster1
Array. cluster3
HSF1
Gene 2
Gene Cluster2
Level1,1
Level1,2
Level2,1
Level2,2
Level3,1
Level3,2
Level3,1
Level3,2
HSF2
Gene 3
Gene Cluster3
HSF3
Gene 4
Gene Cluster4
HSF4
Experimental Setup
 Goal:
predict the effect of mutating specific genes
without performing the experiment (!)
 Example: predicting the effect of mutating gene 4
Gene 4 mutant
 Available information:
 Attributes of gene 4
 Gene Cluster of gene 4
as a gene
Lipid4
?
Array.
cluster
?
Gene Cluster4
HSF4
Experimental Setup
Gene 1 mutant Gene 3 mutant Gene 4 mutant
Lipid1
Lipid3
Lipid4
Gene Cluster1
Array. cluster1
Array. cluster3
HSF1
Gene Cluster2
Level1,1
Level1,2
Level2,1
Level2,2
Level3,1
Level3,2
Level3,1
Level3,2
HSF2
Gene Cluster3
HSF3
Gene Cluster4
HSF4
?
Array.
cluster
?
Results
Training set:
180 mutants
Gene Cluster
Test set:
20 mutants
Exp. type
GCN4
HSF
Lipid
Exp. cluster
Endoplasmatic
Level
 44
arrays predicted at 99%
confidence and 95% accuracy
 Relational model is key to
prediction
Accuracy (%)
95% accuracy
100
90
80
70
60
50
40
30
20
10
0
PRMs
Conclusions
 Presented
a unified probabilistic framework:
 Models complex biological domains
 Expressive data organization
 Incorporates heterogeneous data
 Future directions:
 Incorporate DNA and protein sequence data
 Discover regulatory networks
Thank You!
 Paper:
http://www.cs.stanford.edu/~eran
 Software (soon): http://dags.stanford.edu/bio
 Contact: [email protected]