Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne Koller (Stanford) Our Goals Find patterns in gene.
Download ReportTranscript Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne Koller (Stanford) Our Goals Find patterns in gene.
Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne Koller (Stanford) Our Goals Find patterns in gene expression data Data Organization Experiments j Genes Induced i Repressed Aij - mRNA level of gene i in experiment j Standard Clustering Organization Genes Experiments Bi-Clustering Organization Genes Experiments Undetected Similarity Desired Organization Detect similarities over subsets of genes and experiments Note: rows and columns no longer correspond to genes and experiments Incorporate Heterogeneous Data CG CTA A Clinical information C Find correlations directly Focus Annotations (GO, MIPS, YPD) on novel discoveries Experimental Details Our Approach CG CTA A C Clinical information Experimental Details Annotations (GO, MIPS, YPD) L E A R N E R Gene Cluster Exp. type GCN4 HSF Lipid Exp. cluster Endoplasmatic Level hypotheses Probabilistic Relational Models (Koller & Pfeffer 98; Friedman,Getoor,Koller & Pfeffer 99) Gene Experiment Gene Cluster Exp. cluster Level Expression Resulting Bayesian Network Gene Experiment Gene Cluster Exp. cluster Level Expression + Exp. Cluster2 Exp. Cluster1 Gene Cluster1 Level1,1 Level1,2 Level2,1 Level2,2 Level3,1 Level3,2 Gene Cluster2 Gene Cluster3 Probabilistic Relational Models Gene Experiment Gene Cluster Exp. cluster Level Expression CPD GCluster ECluster 1 2 P(Level) P(Level) 0.8 1.2 -0.7 0.6 … 1 1 Level -0.7 Level 0.8 Adding Heterogeneous Data Gene Experiment Gene Cluster GCN4 Exp. type HSF Lipid Exp. cluster Endoplasmatic Level Expression Annotations Binding sites Experimental details Resulting Bayesian Network Gene CG C T A Experiment Gene Cluster A Exp. type GCN4 HSF Lipid Exp. cluster Endoplasmatic Level Expression GCN41 + Experimental Details Annotations (GO, MIPS, YPD) Exp. type1 Gene Cluster1 Exp. type2 Exp. cluster1 HSF1 Lipid1 Exp. cluster2 Level1,1 Level1,2 Level2,1 Level2,2 Level3,1 Level3,2 Endoplasmatic1 GCN42 Gene Cluster2 HSF2 Lipid2 Endoplasmatic2 GCN43 Gene Cluster3 HSF3 Endoplasmatic3 Lipid3 C Problem: Exponential Blowup Gene Experiment Gene Cluster GCN4 Exp. type HSF Lipid Exp. cluster Endoplasmatic Level Expression 1 No No No 1 No No No 0.8 1.2 0.7 0.6 … GC LP END HSF EC TYP 1 1 1 2 6 parents k parents 26 cases 2k cases! Solution: Context Specificity Gene Experiment Ultra Violet Light UV Light DNA repair DNA Damage Level Expression DNA repair genes transcribed UV = Yes UV = No Repair = Yes Repair = No Repair = Yes Repair = No 0 0 0 0 Solution: Context Specificity Gene Experiment Ultra Violet Light UV Light DNA repair DNA Damage Level Expression 0 DNA repair genes transcribed UV = Yes UV = No 0 0 0 0 0 Solution: Context Specificity Gene Experiment Ultra Violet Light UV Light DNA repair DNA Damage Level Expression DNA repair genes transcribed UV = Yes true false Repair = Yes true false 0 0 0 Modeling Context Specificity Gene Experiment Gene Cluster GCN4 HSF Exp. type Lipid Endoplasmatic Exp. cluster Exp. Cluster = 2 true false Level Expression Lipid = Yes true false true false true ... GCN4 = Yes GCN4 = Yes true HSF= Yes false P(Level) false Level Grouping = a leaf in 2 ... P(Level) P(Level) Level -3 the tree P(Level) Level Level 0 3 How do I learn these models? Learning the Models Gene Lipid Exp. cluster Endoplasmatic Level Expression GC E C GCN4 = Yes ... ... GCN4 = Yes ... Annotations (GO, MIPS, YPD) HSF= Yes 1 1 2 2 1 2 1 2 … Lipid = Yes 0.8 -0.7 0.8 -0.7 1.2 0.6 1.2 0.6 … Exp. Cluster = 2 ... Experimental Details L E A R N E R HSF ... C Exp. type GCN4 ... CG CTA A Experiment Gene Cluster Automatic Induction Structure Learning: Dependency structure Missing Data: Tree structure Gene cluster & experiment cluster never observed Bayesian score Expectation Heuristic search Maximization (EM) Learning Algorithm Learning Process Gene Experiment Gene Cluster GCN4 Exp. type HSF Lipid Exp. cluster Endoplasmatic Level Expression Learning Process Experiment Similarity Gene Experiment Gene Cluster GCN4 Exp. type HSF Lipid Exp. cluster Endoplasmatic Level Expression Exp. Cluster = 2 Learning Process Gene Similarity Gene Experiment Gene Cluster GCN4 Exp. type HSF Lipid Exp. cluster Endoplasmatic Level Expression Exp. Cluster = 2 Gene Cluster = Yes Learning Process Separability by binding site Gene Experiment Gene Cluster GCN4 Exp. type HSF Lipid Exp. cluster Endoplasmatic Level Expression Exp. Cluster = 2 Gene Cluster = Yes ... ... HSF= Yes Learning Process Attribute dependencies: induce cluster changes Gene Experiment Gene Cluster GCN4 Exp. type HSF Lipid Exp. cluster Endoplasmatic Level Expression Exp. Cluster = 2 Gene Cluster = Yes ... ... HSF= Yes Learning Process Achieved desired clustering Gene Experiment Gene Cluster GCN4 Exp. type HSF Lipid Exp. cluster Endoplasmatic Level Expression Exp. Cluster = 2 Gene Cluster = Yes HSF= Yes ... ... ... ... ... ... GCN4 = Yes GCN4 = Yes Yeast Stress Data (Gasch et al 2001) Measured 92 response to stress cond. arrays We selected ~900 genes Added data: TRANSFAC, MIPS Results: 15 7 significant TFs significant function categories 793 Groupings Context Specific Groupings Down in nitrogen depletion Transporter genes Metabolism of amino acids Context Specific Groupings Up in Starvation, Nitrogen depletion & DTT Transporter genes Metabolism of nitrogen Example Biological Finding Discovered grouping of 17 genes All induced in diauxic shift All have 2 binding sites for MIG1 transcription factor Many not known to be regulated by MIG1 Context-sensitive groupings were key to finding cluster Compendium Data (Hughes et al 2000) 300 samples of yeast deletion mutants Gene Array/Mutated Gene GCluster GCN4 GCluster (of mutated gene) HSF Endoplasmatic Lipid (of mutated gene) Lipid Level Expression ACluste r Resulting Bayesian Network Gene 1 mutant Gene 3 mutant Gene 1 Lipid1 Lipid3 Gene Cluster1 Array. cluster1 Array. cluster3 HSF1 Gene 2 Gene Cluster2 Level1,1 Level1,2 Level2,1 Level2,2 Level3,1 Level3,2 Level3,1 Level3,2 HSF2 Gene 3 Gene Cluster3 HSF3 Gene 4 Gene Cluster4 HSF4 Experimental Setup Goal: predict the effect of mutating specific genes without performing the experiment (!) Example: predicting the effect of mutating gene 4 Gene 4 mutant Available information: Attributes of gene 4 Gene Cluster of gene 4 as a gene Lipid4 ? Array. cluster ? Gene Cluster4 HSF4 Experimental Setup Gene 1 mutant Gene 3 mutant Gene 4 mutant Lipid1 Lipid3 Lipid4 Gene Cluster1 Array. cluster1 Array. cluster3 HSF1 Gene Cluster2 Level1,1 Level1,2 Level2,1 Level2,2 Level3,1 Level3,2 Level3,1 Level3,2 HSF2 Gene Cluster3 HSF3 Gene Cluster4 HSF4 ? Array. cluster ? Results Training set: 180 mutants Gene Cluster Test set: 20 mutants Exp. type GCN4 HSF Lipid Exp. cluster Endoplasmatic Level 44 arrays predicted at 99% confidence and 95% accuracy Relational model is key to prediction Accuracy (%) 95% accuracy 100 90 80 70 60 50 40 30 20 10 0 PRMs Conclusions Presented a unified probabilistic framework: Models complex biological domains Expressive data organization Incorporates heterogeneous data Future directions: Incorporate DNA and protein sequence data Discover regulatory networks Thank You! Paper: http://www.cs.stanford.edu/~eran Software (soon): http://dags.stanford.edu/bio Contact: [email protected]