. Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12
Download ReportTranscript . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12
. Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at www.cs.huji.ac.il/~pmai. Changes made by Dan Geiger and Ydo Wexler. . Example: Binomial Experiment Head Tail When tossed, it can land in one of two positions: Head or Tail We denote by the (unknown) probability P(H). Estimation task: Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= and P(T) = 1 - 3 Statistical Parameter Fitting Consider instances x[1], x[2], …, x[M] such that The set of values that x can take is known Each is sampled from the same distribution i.i.d. samples Each sampled independently of the rest The task is to find a vector of parameters that have generated the given data. This vector parameter can be used to predict future data. 4 The Likelihood Function How good is a particular ? It depends on how likely it is to generate the observed data L D ( ) P ( D | ) P ( x[ m ] | ) m likelihood for the sequence H,T, T, H, H is L() The L D ( ) (1 ) (1 ) 0 0.2 0.4 0.6 0.8 1 5 Sufficient Statistics To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails) L D ( ) NH (1 ) NT NH and NT are sufficient statistics for the binomial distribution 6 Sufficient Statistics A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood s(D) is a sufficient statistics if for any two datasets D and D’ Formally, s(D) = s(D’ ) LD() = LD’ () Datasets Statistics 7 Maximum Likelihood Estimation MLE Principle: Choose parameters that maximize the likelihood function This is one of the most commonly used estimators in statistics Intuitively appealing One usually maximizes the log-likelihood function defined as lD() = logeLD() 8 Example: MLE in Binomial Data Applying l D N H the MLE principle we get NH NT log N T log 1 ˆ 1 NH Example: (NH,NT ) = (3,2) L() N H NT (Which coincides with what one would expect) MLE estimate is 3/5 = 0.6 0 0.2 0.4 0.6 0.8 1 9 From Binomial to Multinomial example, suppose X can have the values 1,2,…,K (For example a die has 6 sides) We want to learn the parameters 1, 2. …, K Sufficient statistics: N1, N2, …, NK - the number of times each outcome is observed K N L ( ) Likelihood function: k D For k k 1 MLE: ˆ k Nk N 10 Example: Multinomial Let x1 x 2 .... x n be a protein sequence We want to learn the parameters q1, q2,…,q20 corresponding to the frequencies of the 20 amino acids N1, N2, …, N20 - the number of times each amino acid is observed in the sequence 20 Likelihood function: LD (q ) q Nk k k 1 MLE: qk Nk n 11 Is MLE all we need? Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack Would you bet on heads for the next toss? Suppose now that after 10 observations, ML estimates P(H) = 0.7 for a coin Would you place the same bet? Solution: The Bayesian approach which incorporates your subjective prior knowledge. E.g., you may know a priori that some amino acids have the same frequencies and some have low frequencies. How would one use this information ? 12 Bayes’ rule Bayes’ rule: Py | x P x Where P x | y P y P ( x) P x | y P y y It hold because: P x, y P x | y P y P x P x, y P x | y P y y y 13 Example: Dishonest Casino A casino uses 2 kind of dice: 99% are fair 1% is loaded: 6 comes up 50% of the times We pick a die at random and roll it 3 times We get 3 consecutive sixes What is the probability the die is loaded? 14 Dishonest Casino (cont.) The solution is based on using Bayes rule and the fact that while P(loaded | 3sixes) is not known, the other three terms in Bayes rule are known, namely: P(3sixes | loaded)=(0.5)3 P(loaded)=0.01 P(3sixes) = P(3sixes | loaded) P(loaded)+P(3sixes | not loaded) (1-P(loaded)) P ( loaded | 3 sixes ) P ( 3 sixes | loaded ) P ( loaded ) P ( 3 sixes ) 15 Dishonest Casino (cont.) P ( loaded | 3 sixes ) P ( 3 sixes | loaded ) P ( loaded ) P ( 3 sixes ) P ( 3 sixes | loaded ) P ( loaded ) P ( 3 sixes | loaded ) P ( loaded ) P ( 3 sixes | fair ) P ( fair ) 0 . 5 3 0 . 01 3 0 . 5 3 1 0 . 01 0 . 99 6 0 . 21 16 Biological Example: Proteins Extracellular proteins have a slightly different amino acid composition than intracellular proteins. From a large enough protein database (SWISS-PROT), we can get the following: q a i p(ai|int) - the frequency of amino acid ai for intracellular proteins int q a i p(ai|ext) - the frequency of amino acid ai for extracellular proteins ext p(int) - the probability that any new sequence is intracellular p(ext) - the probability that any new sequence is extracellular 17 Biological Example: Proteins (cont.) What is the probability that a given new protein sequence x=x1x2….xn is extracellular? Assuming that every sequence is either extracellular or intracellular (but not both) we can write p ( ext ) 1 p (int) . Thus, p x p ( ext ) p x | ext p (int) p x | int 18 Biological Example: Proteins (cont.) Using conditional probability we get p x | ext p ( x i | ext ) , P x | int i By p ( x i | int) i Bayes’ theorem P ext | x p ( ext ) p ( x i | ext ) i p ( ext ) p ( x i | ext ) p (int) i p ( x i | int) i • The probabilities p(int), p(ext) are called the prior probabilities. • The probability P(ext|x) is called the posterior probability. 19