. Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12
Download
Report
Transcript . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12
.
Parameter Estimation using
likelihood functions
Tutorial #1
This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is
available
at www.cs.huji.ac.il/~pmai. Changes made by Dan Geiger and Ydo Wexler.
.
Example: Binomial Experiment
Head
Tail
When
tossed, it can land in one of two positions:
Head or Tail
We
denote by the (unknown) probability P(H).
Estimation task:
Given a sequence of toss samples x[1], x[2],
…, x[M] we want to estimate the probabilities
P(H)= and P(T) = 1 -
3
Statistical Parameter Fitting
Consider
instances x[1], x[2], …, x[M]
such that
The set of values that x can take is known
Each is sampled from the same distribution i.i.d.
samples
Each sampled independently of the rest
The
task is to find a vector of
parameters that have generated the
given data. This vector parameter
can be used to predict future data.
4
The Likelihood Function
How good is a particular ?
It depends on how likely it is to generate the
observed data
L D ( ) P ( D | ) P ( x[ m ] | )
m
likelihood for the sequence H,T, T, H, H is
L()
The
L D ( ) (1 ) (1 )
0
0.2
0.4
0.6
0.8
1
5
Sufficient Statistics
To
compute the likelihood in the thumbtack
example we only require NH and NT
(the number of heads and the number of tails)
L D ( )
NH
(1 )
NT
NH
and NT are sufficient statistics for the
binomial distribution
6
Sufficient Statistics
A
sufficient statistic is a function of the data that
summarizes the relevant information for the
likelihood
s(D) is a sufficient statistics if for
any two datasets D and D’
Formally,
s(D) = s(D’ ) LD() = LD’ ()
Datasets
Statistics
7
Maximum Likelihood Estimation
MLE Principle:
Choose parameters that maximize the
likelihood function
This
is one of the most commonly used
estimators in statistics
Intuitively appealing
One usually maximizes the log-likelihood
function defined as lD() = logeLD()
8
Example: MLE in Binomial Data
Applying
l D N H
the MLE principle we get
NH
NT
log N T log 1
ˆ
1
NH
Example:
(NH,NT ) = (3,2)
L()
N H NT
(Which coincides with what one would expect)
MLE estimate is 3/5 = 0.6
0
0.2
0.4
0.6
0.8
1
9
From Binomial to Multinomial
example, suppose X can have the values
1,2,…,K (For example a die has 6 sides)
We want to learn the parameters 1, 2. …, K
Sufficient statistics:
N1, N2, …, NK - the number of times each outcome
is observed
K
N
L
(
)
Likelihood function:
k
D
For
k
k 1
MLE:
ˆ
k
Nk
N
10
Example: Multinomial
Let
x1 x 2 .... x n be a protein sequence
We want to learn the parameters q1, q2,…,q20
corresponding to the frequencies of the 20 amino
acids
N1, N2, …, N20 - the number of times each amino
acid is observed in the sequence
20
Likelihood function:
LD (q )
q
Nk
k
k 1
MLE:
qk
Nk
n
11
Is MLE all we need?
Suppose
that after 10 observations,
ML estimates P(H) = 0.7 for the thumbtack
Would you bet on heads for the next toss?
Suppose now that after 10 observations,
ML estimates P(H) = 0.7 for a coin
Would you place the same bet?
Solution: The Bayesian approach which incorporates
your subjective prior knowledge. E.g., you may know
a priori that some amino acids have the same
frequencies and some have low frequencies. How
would one use this information ?
12
Bayes’ rule
Bayes’ rule:
Py | x
P x
Where
P x | y P y
P ( x)
P x | y P y
y
It hold because:
P x, y P x | y P y
P x
P x, y P x | y P y
y
y
13
Example: Dishonest Casino
A casino uses 2 kind of dice:
99% are fair
1% is loaded: 6 comes up 50% of the times
We pick a die at random and roll it 3 times
We get 3 consecutive sixes
What is the probability the die is loaded?
14
Dishonest Casino (cont.)
The solution is based on using Bayes rule and the
fact that while P(loaded | 3sixes) is not known, the
other three terms in Bayes rule are known, namely:
P(3sixes | loaded)=(0.5)3
P(loaded)=0.01
P(3sixes) =
P(3sixes | loaded) P(loaded)+P(3sixes | not loaded) (1-P(loaded))
P ( loaded | 3 sixes )
P ( 3 sixes | loaded ) P ( loaded )
P ( 3 sixes )
15
Dishonest Casino (cont.)
P ( loaded | 3 sixes )
P ( 3 sixes | loaded ) P ( loaded )
P ( 3 sixes )
P ( 3 sixes | loaded ) P ( loaded )
P ( 3 sixes | loaded ) P ( loaded ) P ( 3 sixes | fair ) P ( fair )
0 . 5 3 0 . 01
3
0 . 5
3
1
0 . 01 0 . 99
6
0 . 21
16
Biological Example: Proteins
Extracellular
proteins have a slightly different
amino acid composition than intracellular
proteins.
From
a large enough protein database
(SWISS-PROT), we can get the following:
q a i p(ai|int) - the frequency of amino acid ai for intracellular proteins
int
q a i p(ai|ext) - the frequency of amino acid ai for extracellular proteins
ext
p(int) - the probability that any new sequence is intracellular
p(ext) - the probability that any new sequence is extracellular
17
Biological Example: Proteins (cont.)
What
is the probability that a given new protein
sequence x=x1x2….xn is extracellular?
Assuming
that every sequence is either
extracellular or intracellular (but not both)
we can write p ( ext ) 1 p (int) .
Thus,
p x p ( ext ) p x | ext p (int) p x | int
18
Biological Example: Proteins (cont.)
Using
conditional probability we get
p x | ext
p ( x i | ext )
,
P x | int
i
By
p ( x i | int)
i
Bayes’ theorem
P ext | x
p ( ext ) p ( x i | ext )
i
p ( ext ) p ( x i | ext ) p (int)
i
p ( x i | int)
i
• The probabilities p(int), p(ext) are called the prior probabilities.
• The probability P(ext|x) is called the posterior probability.
19