Dependency networks Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 [email protected] Nov 25th, 2014 RECAP • Probabilistic graphical models provide a natural way to represent biological networks • So far we.
Download ReportTranscript Dependency networks Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 [email protected] Nov 25th, 2014 RECAP • Probabilistic graphical models provide a natural way to represent biological networks • So far we.
Dependency networks Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 [email protected] Nov 25th, 2014 RECAP • Probabilistic graphical models provide a natural way to represent biological networks • So far we have see Bayesian networks: – Sparse candidates – Module networks • Today we will focus on dependency networks What you should know • What are dependency networks? • How they differ from Bayesian networks? • GENIE3 algorithm for learning a dependency network from expression data • Different ways to represent conditional distributions • Evaluation of various network inference methods Graphical models for representing regulatory networks • Bayesian networks • Dependency networks Random variables encode expression levels Msb 2 Sho1 X1 REGULATORS X2 TARGET Ste20 Y3 X2 X1 Y3=f(X1,X2) Y3 Structure Function Edges correspond to some form of statistical dependencies Dependency network • A type of probabilistic graphical model • As in Bayesian networks has – A graph component describing the dependency structure between random variables – Each variable Xj is associated with a prediction function fj to predict Xj from the state of its neighbors • Unlike Bayesian network – Can have cyclic dependencies Dependency Networks for Inference, Collaborative Filtering and Data visualization Heckerman, Chickering, Meek, Rounthwaite, Kadie 2000 Notation • • • • Xi: ith random variable X={X1,.., Xp}: set of p random variables xik: An assignment of Xi in the kth sample x-ik: Set of assignments to all variables other than Xi in the kth sample Learning dependency networks Regulators ? ? fj Xj … ? •fj can be of different types. •Learning requires estimation of each of the fj functions •In all cases learning requires us to minimize an error of predicting Xj from its neighborhood: Different representations of the fj function • If Xj is continuous – fj can be a linear function – fj can be a regression tree – fj can be a random forest • An ensemble of trees • If Xj is discrete – fj can be a conditional probability table – fj can be a conditional probability tree GENIE3: GEne Network Inference with Ensemble of trees • Solves a set of regression problems – One per random variable • Uses an Ensemble of regression trees to represent fj – Models non-linear dependencies • Outputs a directed, cyclic graph with a confidence of each edge • Focus on generating a ranking over edges rather than a graph structure and parameters Inferring Regulatory Networks from Expression Data Using Tree-Based Methods Van Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, Pierre Geurts, Plos One 2010 Recall our very simple regression tree example e1 NO e2 X2 > e1 YES X3 X2 > e2 X3 X2 NO YES An Ensemble of trees • A single tree is prone to “overfitting” • Instead of learning a single tree, Ensemble models make use of a collection of trees A Random forest: An Ensemble of Trees leaf nodes split nodes x-j tree t1 x-j …… tree tT – Prediction is Taken from ICCV09 tutorial by Kim, Shotton and Stenger: http://www.iis.ee.ic.ac.uk/~tkkim/iccv09_tutorial GENIE3 algorithm sketch • For each Xj, generate learning samples of input/output pairs – – – – LSj={(x-jk,xjk), k=1..N} On each LSj learn fj to predict the value of Xj fj is either a Random forest or Extra trees Estimate wij for all genes i ≠ j • wij quantifies the confidence of the edge between Xi and Xj • Generate a global ranking of edges based on each wij Note that depending of the interpretation of the weights wi,j , their aggregation to a get a global ranking of regulatory links isnot trivial. We will see in the context of tree-based methods that it requires to normalize each expression vector appropriately. variable (selected in x ), trying to reduce as variance of the output variable (x j ) in the samples. Candidate splits for numerical compare the input variable values with a determined during the tree growing. GENIE3 algorithm sketch Predictor ranking Figure 1. GENIE3 procedure. For each gene j~ 1, . . . ,p, a learning sample L Sj is generated with expression levels of j expression levels of all other genes as input values. A function f j is learned from L Sj and a local ranking of all genes except j is rankings are then aggregated to get a global ranking of all regulatory links. Figure from Huynh-Thu et al. doi:10.1371/journal.pone.0012776.g001 Learning fj in GENIE3 • Random forest or Extra Trees to represent the fj • Learning the Random forest – Generate M=1000 bootstrap samples – At each node to be split, search for best split among K randomly selected variables – K was set to p-1 or (p-1)1/2, where p is the number of regulators/parents • Learning the Extra-Trees – Learn 1000 trees – Each tree is built from the original learning sample – At each test node, the best split is determined among K random splits, each determined by randomly selecting one input (without replacement) and a threshold Computing the importance weight of a predictor • Importance is computed at each interior node • Remember there can be multiple interior nodes per regulator • For an interior node, importance is given by the reduction in variance if we make a split on that node Interior node Set of data samples that reach this node #S: Size of the set S Var(S): variance of the output variable in set S St: subset of S when a test at N is true Sf: subset of S when a test at N is false Computing the importance weight of a predictor • For a single tree the overall importance is then sum over all points in the tree where this node is used to split • For an ensemble the importance is averaged over all trees • To avoid bias towards highly variable genes, normalize the expression genes to all have unit variance Computational complexity of GENIE3 • Complexity per variable – – – – O(TKNlog N) T is the number of trees K is the number of random attributes selected per split N is the learning sample size Evaluation of network inference methods • Assume we know what the “right” network is • One can use Precision-Recall curves to evaluate the predicted network • Area under the PR curve (AUPR) curve quantifies performance Precision= Recall= # of correct edges # of correct edges # of predicted edges # of true edges zed. One apparent that it does not take alization. Indeed since hts satisfy equation (4) qual weights to all tree distribution, represented by a (single) regression tree. Finally, although we exploited tree-based ensemble methods, our framework is general and other feature selection techniques could have been used as well. Actually, several existing methods AUPR based performance comparison Some comments about expression-based network inference methods • We have seen two types of algorithms to learn these networks – Per-gene methods • Sparse candidate: learn regulators for individual genes • GENIE3 – Per-module methods • Module networks: learn regulators for sets of genes/modules – Other implementations of module networks exist • LIRNET: Learning a Prior on Regulatory Potential from eQTL Data – Su In Lee et al, Plos genetics 2009 (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjou rnal.pgen.1000358) • LeMoNe: Learning Module Networks – Michoel et al 2007 (http://www.biomedcentral.com/1471- Many implementations of per-gene methods • Mutual Information – Context Likelihood of relatedness (CLR) – ARACNE • Probabilistic methods – Bayesian network: Sparse Candidates • Regression – TIGRESS – GENIE-3 DREAM: Dialogue for reverse engineeting assessments and methods Community effort to assess regulatory network inference DREAM 5 challenge Previous challenges: 2006, 2007, 2008, 2009, 2010 Marbach et al. 2012, Nature Methods Marbach et al., 2010 Community Random Where do different methods rank? Comparing module (LeMoNe) and per-gene (CLR) methods