. Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12

Download Report

Transcript . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12

.
Parameter Estimation using
likelihood functions
Tutorial #1
This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is
available
at www.cs.huji.ac.il/~pmai. Changes made by Dan Geiger and Ydo Wexler.
.
Example: Binomial Experiment
Head
Tail
 When
tossed, it can land in one of two positions:
Head or Tail
We
denote by  the (unknown) probability P(H).
Estimation task:
Given a sequence of toss samples x[1], x[2],
…, x[M] we want to estimate the probabilities
P(H)=  and P(T) = 1 - 
3
Statistical Parameter Fitting
Consider
instances x[1], x[2], …, x[M]
such that



The set of values that x can take is known
Each is sampled from the same distribution i.i.d.
samples
Each sampled independently of the rest
The
task is to find a vector of
parameters  that have generated the
given data. This vector parameter 
can be used to predict future data.
4
The Likelihood Function

How good is a particular ?
It depends on how likely it is to generate the
observed data
L D ( )  P ( D |  )   P ( x[ m ] |  )
m
likelihood for the sequence H,T, T, H, H is
L()
 The
L D ( )    (1   )  (1   )    
0
0.2
0.4

0.6
0.8
1
5
Sufficient Statistics
 To
compute the likelihood in the thumbtack
example we only require NH and NT
(the number of heads and the number of tails)
L D ( )  
NH
 (1   )
NT
 NH
and NT are sufficient statistics for the
binomial distribution
6
Sufficient Statistics
A
sufficient statistic is a function of the data that
summarizes the relevant information for the
likelihood
s(D) is a sufficient statistics if for
any two datasets D and D’
Formally,
s(D) = s(D’ )  LD() = LD’ ()

Datasets
Statistics
7
Maximum Likelihood Estimation
MLE Principle:
Choose parameters that maximize the
likelihood function
 This
is one of the most commonly used
estimators in statistics
 Intuitively appealing
 One usually maximizes the log-likelihood
function defined as lD() = logeLD()
8
Example: MLE in Binomial Data
 Applying
l D    N H
the MLE principle we get
NH
NT
log   N T log 1    

 ˆ 

1
NH
Example:
(NH,NT ) = (3,2)
L()
N H  NT
(Which coincides with what one would expect)
MLE estimate is 3/5 = 0.6
0
0.2
0.4
0.6
0.8
1
9
From Binomial to Multinomial
example, suppose X can have the values
1,2,…,K (For example a die has 6 sides)
 We want to learn the parameters 1, 2. …, K
Sufficient statistics:
N1, N2, …, NK - the number of times each outcome
is observed
K
N
L
(

)


Likelihood function:
 k
D
 For
k
k 1
MLE:
ˆ
k 
Nk
 N

10
Example: Multinomial
 Let
x1 x 2 .... x n be a protein sequence
 We want to learn the parameters q1, q2,…,q20
corresponding to the frequencies of the 20 amino
acids
 N1, N2, …, N20 - the number of times each amino
acid is observed in the sequence
20
Likelihood function:
LD (q ) 
q
Nk
k
k 1
MLE:
qk 
Nk
n
11
Is MLE all we need?
 Suppose


that after 10 observations,
ML estimates P(H) = 0.7 for the thumbtack
Would you bet on heads for the next toss?
Suppose now that after 10 observations,
ML estimates P(H) = 0.7 for a coin
Would you place the same bet?
Solution: The Bayesian approach which incorporates
your subjective prior knowledge. E.g., you may know
a priori that some amino acids have the same
frequencies and some have low frequencies. How
would one use this information ?
12
Bayes’ rule
Bayes’ rule:
Py | x 
P x  
Where
P x | y  P  y 
P ( x)
 P x | y  P  y 
y
It hold because:
P x, y   P x | y  P  y 
P x  
 P x, y    P x | y  P  y 
y
y
13
Example: Dishonest Casino



A casino uses 2 kind of dice:
99% are fair
1% is loaded: 6 comes up 50% of the times
We pick a die at random and roll it 3 times
We get 3 consecutive sixes
 What is the probability the die is loaded?
14
Dishonest Casino (cont.)
The solution is based on using Bayes rule and the
fact that while P(loaded | 3sixes) is not known, the
other three terms in Bayes rule are known, namely:



P(3sixes | loaded)=(0.5)3
P(loaded)=0.01
P(3sixes) =
P(3sixes | loaded) P(loaded)+P(3sixes | not loaded) (1-P(loaded))
P ( loaded | 3 sixes ) 
P ( 3 sixes | loaded )  P ( loaded )
P ( 3 sixes )
15
Dishonest Casino (cont.)
P ( loaded | 3 sixes ) 
P ( 3 sixes | loaded )  P ( loaded )

P ( 3 sixes )

P ( 3 sixes | loaded )  P ( loaded )
P ( 3 sixes | loaded ) P ( loaded )  P ( 3 sixes | fair ) P ( fair )
0 . 5 3  0 . 01

3
0 . 5 
3
1
 0 . 01     0 . 99
6

 0 . 21
16
Biological Example: Proteins
Extracellular
proteins have a slightly different
amino acid composition than intracellular
proteins.
From
a large enough protein database
(SWISS-PROT), we can get the following:
q a i  p(ai|int) - the frequency of amino acid ai for intracellular proteins
int
q a i  p(ai|ext) - the frequency of amino acid ai for extracellular proteins
ext
p(int) - the probability that any new sequence is intracellular
p(ext) - the probability that any new sequence is extracellular
17
Biological Example: Proteins (cont.)
 What
is the probability that a given new protein
sequence x=x1x2….xn is extracellular?
 Assuming
that every sequence is either
extracellular or intracellular (but not both)
we can write p ( ext )  1  p (int) .
 Thus,
p  x   p ( ext )  p  x | ext   p (int)  p  x | int

18
Biological Example: Proteins (cont.)
 Using
conditional probability we get
p  x | ext  

p ( x i | ext )
,
P  x | int
 
i
By
p ( x i | int)
i
Bayes’ theorem
P ext | x  
p ( ext )  p ( x i | ext )
i
p ( ext )  p ( x i | ext )  p (int)
i

p ( x i | int)
i
• The probabilities p(int), p(ext) are called the prior probabilities.
• The probability P(ext|x) is called the posterior probability.
19