#### Transcript ppt

Latent Variables Naman Agarwal Michael Nute May 1, 2013 Latent Variables Contents • Definition & Example of Latent Variables • EM Algorithm Refresher • Structured SVM with Latent Variables • Learning under semi-supervision or indirect supervision – CoDL – Posterior Regularization – Indirect Supervision Latent Variables General Definition & Examples A Latent Variable in a machine learning algorithm is one which is assumed to exist (or have null value) but which is not observed and is inferred from other observed variables. • Generally corresponds to some meaningful element of the problem for which direct supervision is intractable. • Latent variable methods often imagine the variable as part of the input/feature space (e.g. PCA, factor analysis), or as part of the output space (e.g. EM). – This distinction is only illustrative though and can be blurred, as we will see with indirect supervision. Latent Input Variables: 𝒳∗ 𝒳 𝒴 (unobserved) As part of the input space, the variable 𝑥 ∈ 𝒳 affects the output 𝑦 ∈ 𝒴 only through the unobserved variable 𝑥 ∗ ∈ 𝒳 ∗ . This formulation is only helpful if the dimension of 𝒳 ∗ is smaller than the dimension of 𝒳, so latent variables here are essentially an exercise in dimension reduction. Latent Output Variables: 𝒴∗ 𝒳 (unobserved) 𝒴 (observed) When we think of a latent variable as part of the output space, the method becomes an exercise in unsupervised or semi-supervised learning. Example Paraphrase Identification Problem: Given sentences A and B, determine whether they are paraphrases of each other. • Note that if they are paraphrases, then there will exist a mapping between named entities and predicates in the sentence. • The mapping is not directly observed, but is a latent variable in the decision problem of determining whether the sentences say the same thing. A: Druce will face murder charges, Conte said. (latent) B: Conte said Druce will be charged with murder. Revised Problem: Given sentences A and B, determine the mapping of semantic elements between A and B. • Now we are trying to learn specifically the mapping between them, so we can use the Boolean question in the previous problem as a latent variable. • In practice, the Boolean question is easy to answer, so we can use it to guide the semisupervised task of mapping semantic elements. • This is called indirect supervision (more on that later). 1Example taken from talk by D. Roth Language Technologies Institute Colloquium, Carnegie Mellon University, Pittsburgh, PA. Constraints Driven Structured Learning with Indirect Supervision. April 2010. The EM Algorithm Refresher In practice, many algorithms that use latent variables have a structure similar to the ExpectationsMaximization algorithm (even though EM is not discriminative and others are). So let’s review: The EM Algorithm (formally) Setup: Observed Data: 𝑿 Unobserved Data: 𝒀 Unknown Parameters: 𝜣 Log-Likelihood Function: 𝐿 𝜣 𝐗 = 𝑦∈𝑌 log(P 𝑿, 𝒀 𝜣 ) Algorithm: Initialize 𝜣 = 𝜣(𝟎) E-Step: Find the expected value of 𝐿 𝜣 𝑿 over the unobserved data 𝒀 given the current estimate of the parameters: 𝑄 𝚯𝚯 𝑡 = 𝐄𝒀|𝑿,𝚯 𝑡 𝐿 𝜣 𝑿 M-Step: Find the parameters that maximize the expected log-likelihood function: 𝚯 𝑡+1 = argmax 𝑄 𝚯 𝚯 𝚯 𝑡 Takes expectation over possible “labels” of 𝒀 The EM Algorithm Hard EM vs. Soft EM • The algorithm at left is often called Soft EM because it computes the expectation of the log-likelihood function in the E-Step. • An important variation on this algorithm is called Hard EM: — instead of computing expectation, we simply choose the MAP value for 𝐘 and proceed with the likelihood function conditional on the MAP value. • This is a simpler procedure which many latent variable methods will essentially resemble: Label 𝒀 Train 𝜣 (repeat until convergence) Yu & Joachims—Learning Structured SVMs with Latent Variables Model Formulation General Structured SVM Formulation: Solve: 1 min 𝑤 𝑤 2 𝑛 2 max Δ 𝑦𝑖 , 𝑦 + 𝑤 ′ Φ 𝑥𝑖 , 𝑦 +𝑐 𝑖=1 𝑦∈𝒴 − 𝑤 ′ Φ 𝑥𝑖 , 𝑦𝑖 Where: (𝑥𝑖 , 𝑦𝑖 ) are input and structure for training example 𝑖. Φ 𝑥𝑖 , 𝑦𝑖 is the feature vector Δ 𝑦𝑖 , 𝑦 is the loss function in the output space 𝑤 is the weight vector Structured SVM Formulation with Latent Variable: Let ℎ ∈ ℋ be an unobserved variable. Since the predicted 𝑦 now depends on ℎ, the predicted value of the latent variable ℎ, the loss function of the actual 𝑦 and 𝑦 may now become a function of ℎ as well: Δ 𝑦𝑖 , 𝑦 ⇒ Δ 𝑦𝑖 , 𝑦, ℎ So our new optimization problem becomes: 1 min 𝑤 𝑤 2 𝑛 2 +𝑐 𝑛 max 𝑖=1 (𝑦,ℎ)∈𝒴×ℋ Δ 𝑦𝑖 , 𝑦, ℎ + 𝑤 ′ Φ 𝑥𝑖 , 𝑦, ℎ max 𝑤 ′ Φ 𝑥𝑖 , 𝑦𝑖 , ℎ −𝑐 𝑖=1 ℎ∈ℋ Problem is now the difference of two convex functions, so we can solve it using a concave-convex procedure (CCCP). Yu & Joachims—Learning Structured SVMs with Latent Variables Optimization Methodology & Notes The CCCP: Notes: 1. Compute • Technically the loss function would compare the true values 𝑦𝑖 , ℎ𝑖 to the predicted 𝑦, ℎ , but since we do not observe ℎ𝑖 , we are restricted to using loss functions that reduce to what is shown. ℎ𝑖∗ = argmax 𝑤 ′ Φ 𝑥𝑖 , 𝑦𝑖 , ℎ ℎ∈ℋ for each 𝑖 2. Update 𝑤 by solving the standard Structured SVM formulation, treating each ℎ𝑖∗ as though it were an observed value. (repeat until convergence) Note the similarity to the simple way we looked at Hard-EM earlier: first we label the unlabeled values, then we re-train the model based on the newly labeled values. • It is not strictly necessary that the loss function depend on ℎ. In NLP it often does not. • In the absence of latent variables, the optimization problem reduces to the general Structured SVM formulation. Learning under semi-supervision Labeled dataset is hard to obtain We generally have a small labeled dataset and a large unlabeled data-set Naïve Algorithm [A kind of EM] Train on labeled data set [Initialization] Make Inference on the unlabeled set [Expectation] Include them in your training [Maximization] Repeat Can we do better ? Indirect supervision Constraints Binary decision problems Constraint Driven Learning Proposed by Chang et al [2007] Uses constraints obtained by domain-knowledge as to streamline semi-supervision Constraints are pretty general Incorporates soft constraints Why are constraints useful ? [AUTHOR Lars Ole Anderson . ] [TITLE Program Analysis and specification for the C programming language . ] [ TECH-REPORT PhD thesis , ] [INSTITUTION DIKU , University of Copenhagen , ][DATE May 1994 .] HMM trained on 30 data sets produces [AUTHOR Lars Ole Anderson . Program Analysis and ] [ TITLE specification for the ] [ EDITOR C ] BOOKTITLE programming language . ] [ TECHREPORT PhD thesis , ] [INSTITUTION DIKU , University of Copenhagen , May ][DATE 1994 .] Leads to noisy predictions. Simple constraint that state transition occurs only on punctuation marks produces the correct output CoDL Framework Notations = (XL , YL ) is the labeled dataset U = (X U , YU ) is the unlabeled dataset 𝜙 𝑋, 𝑌 represents a feature vector Structured Learning Task L Learn w such that Yi = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝒘𝑻 . 𝜙(𝑋𝑖 , 𝑌𝑖 ) 𝐶1 , … , 𝐶𝑘 are the set of constraints where each 𝐶𝑖 ∶ 𝑋 × 𝑌 → {0,1} CoDL Objective If the constraints are hard – 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦∈1 𝐶 𝑥 If constraints are soft they define a notion of violation by a distance function d such that 𝑑 𝑦, 𝑥, 𝐶𝑖 = 𝒘𝑻 . 𝜙(𝑋𝑖 , 𝑌𝑖 ) min 𝐻(𝑦, 𝑦′) 𝑦 ′ ∈1𝐶 𝑖 𝑥 The objective in this “soft” formulation is given by 𝐾 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝒘𝑻 . 𝜙 𝑋𝑖 , 𝑌𝑖 − 𝑝𝑖 𝑑 𝑦, 𝑥, 𝐶𝑖 𝑖=1 Learning Algorithm Divided into Four Steps Initialization 𝒘𝟎 = 𝑙𝑒𝑎𝑟𝑛(𝐿) Expectation For all 𝑥 ∈ 𝑈 𝑥, 𝑦1 … 𝑥, 𝑦 𝑘 𝑇=𝑇∪ = 𝑇𝑜𝑝 − 𝐾 − 𝐼𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑥, 𝒘, 𝑪𝒊 , 𝒑𝒊 𝑥, 𝑦1 … 𝑥, 𝑦 𝑘 𝑇𝑜𝑝 − 𝐾 − 𝐼𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 generates the best K “valid” assignments to Y using Beam – Search techniques Can be thought of as assigning a uniform distribution over the above K in the posterior and 0 everywhere else Learning Algorithm (cntd.) Maximization 𝑤 = 𝛾𝑤0 + 1 − 𝛾 ∗ 𝑙𝑒𝑎𝑟𝑛(𝑇) 𝛾 is a smoothing parameter that does not let the model drift too much from the supervised model Repeat Posterior Regularization [Ganchev et al ‘09] Hard vs Soft EM Imposes constraints in Expectation over the Posterior Distribution of the Latent Variables Two components of the objective function log-likelihood – l 𝜃 = log(𝑝𝜃 (𝑋, 𝑌)) The deviation from the predicted posterior and the one Posterior Distribution of the satisfying constraints latent variables min 𝐾𝐿(𝑞||𝑝𝜃 (𝑌|𝑋)) The 𝑞 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 Set of all posterior distributions 𝐶𝑖 (𝑞) = 1 Constraint specified in terms of expectation over q The PR Algorithm Initialization Estimate parameters 𝜃 from the labeled data set Expectation Step Compute the closest satisfying distribution 𝑄𝑡 = min 𝐾𝐿(𝑞||𝑝𝜃𝑡 (𝑌|𝑋)) 𝑞 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝐶𝑖 (𝑞) = 1 Maximization Step 𝜃 𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝐸𝑄𝑡 [𝑙(𝜃)] Repeat Indirect Supervision - Motivation Paraphrase Identification S1: Druce will face murder charges, Conte said. S2: Conte said Druce will be charged with murder. There exists some Latent Structure H between S1 and S2 H acts as a justification for the binary decision. Can be used as an intermediate step in learning the model Supervision through Binary Problems Now we ask the previous question in the reverse direction Given answers to the binary problem, can we improve our latent structure identification Structured Prediction Problem Example – Field • Companion Binary Problem • Labeled dataset – easy to obtain Identification in advertisements (size,rent etc.) Whether the text is a well formed advertisement The Model [Chang et al 2010] Notations – = (XL , YL ) is the labeled dataset 𝐵 = 𝐵+ ∪ 𝐵− B = (X B , YB ) is the binary (𝑌𝐵 ∈ {1, −1}) labeled dataset. 𝜙 𝑋, 𝑌 represents a feature vector Structured Learning Task L Learn w such that F 𝑥, 𝑦, 𝑤 = min 𝑤 Additionally we require ∀ 𝑥, −1 ∈ 𝐵− , ∀ 𝑦 , ∀ 𝑥, +1 ∈ 𝐵+ , ∃ 𝑦 , The weight vector scores some structure well 𝑤 2 2 + 𝐶1 𝑖∈𝑆 𝐿𝑆 𝑥𝑖 , 𝑦𝑖 , 𝑤 The weight vector scores all structures badly 𝒘𝑻 . 𝝓 𝒙, 𝒚 ≤ 𝟎 𝒘𝑻 . 𝝓 𝒙, 𝒚 > 𝟎 Loss Function The previous “constraint” can be captured by the following loss function 𝐿𝐵 𝑥𝑖 , 𝑦𝑖 , 𝒘 = 𝑙(1 − 𝑦𝑖 max(𝒘𝑻 . 𝜙(𝑥𝑖 , 𝑌))) 𝑌 Now we wish to optimize the following objective 𝐹 𝒘, 𝑋, 𝑌 + 𝐿𝐵 𝑥𝑖 , 𝑦𝑖 , 𝒘 𝐵 Structured Prediction over the labeled dataset Indirect Supervision Model Specification Setup: 𝑥𝑖 , 𝒉𝑖 𝑙 𝑖=1 Fully-labeled training data: 𝑆= Binary-labeled training data: 𝐵 = 𝐵+ ∪ 𝐵− = 𝑥𝑖 , 𝑦𝑖 𝑙+𝑚 𝑖=𝑙+1 where 𝑦𝑖 ∈ −1, +1 Two Conditions Imposed on the Weight Vector: ∀ 𝑥, −1 ∈ 𝐵− , ∀𝒉 ∈ ℋ 𝑥 , 𝑤 ′ Φ 𝑥, 𝒉 ≤ 0 (i.e. there is no good predicted structure for the negative examples) ∀ 𝑥, +1 ∈ 𝐵+ , ∃𝒉 ∈ ℋ 𝑥 , 𝑤 ′ Φ 𝑥, 𝒉 ≥ 0 (i.e. there is at least one good predicted structure for the positive examples) So the optimization problem becomes: 𝑤 min 𝑤 2 2 + 𝐶1 𝐿𝑆 𝑥𝑖 , 𝒉𝑖 , 𝑤 + 𝐶2 𝐿𝐵− 𝑥𝑖 , 𝑦𝑖 , 𝑤 + 𝐶2 𝑖∈𝐵− 𝑖∈𝑆 Where: ′ 𝐿𝑆 𝑥𝑖 , 𝒉𝑖 , 𝑤 = 𝓁 max Δ 𝒉, 𝒉𝑖 − 𝑤 Φ 𝑥𝑖 , 𝒉𝑖 − Φ 𝑥𝑖 , 𝒉 𝒉 𝐿𝐵 𝑥𝑖 , 𝑦𝑖 , 𝑤 = 𝓁 1 − 𝑦𝑖 max 𝑤 ′ ℎ∈ℋ(𝑥) Φ 𝑥𝑖 , 𝒉 𝜅(𝑥𝑖 ) 𝓁 is a common loss function such as the hinge loss 𝜅 is a normalization constant 𝐿𝐵+ 𝑥𝑖 , 𝑦𝑖 , 𝑤 𝑖∈𝐵+ This term is non-convex and must be optimized by setting 𝒉 = argmax 𝑤 ′ Φ 𝑥𝑖 , 𝒉 and ℎ∈ℋ solving the first two terms, then repeating (CCCP-like). Latent Variables in NLP Overview of Three Methods Method 2-Second Description Latent Variable EM Analogue Key Advantage Structural SVM1 Structured SVM with latent variables & EMlike training Separate and independent from the output variable Hard EM, latent value found by argmax 𝑤 ′ Φ 𝑥𝑖 , 𝑦𝑖 , ℎ Enables Structured SVM learned with latent variable ℎ∈ℋ CoDL2 Train on labeled data, generate K best structures of unlabeled data and train on that. Average the two. Output variable for unlabeled training examples Soft-EM with Uniform Distribution on top-K predicted outputs. Efficient semisupervised learning when constraints are difficult to guarantee for predictions but easy to evaluate Indirect Supervision3 Get small number of labeled & many where we know if label exists or not. Train a model on both at the same time. 1. Companion binary-decision variable 2. Output structure on positive, unlabeled examples Hard EM where label is applied only to examples where binary classifier is positive Combines information gain from indirect supervision (on lots of data) with direct supervision 1Learning Structural SVMs with Latent Variables, Chun-Nam John Yu and T. Joachims, ICML, 2009. Semi-Supervision with Constraint-Driven Learning, M. Chang, L. Ratinov and D. Roth, ACL 2007 3Structured Output Learning with Indirect Supervision, M. Chang, V. Srikumar, D. Goldwasser and D. Roth, ICML 2010. 2Guiding