P416 Lecture 5

Download Report

Transcript P416 Lecture 5

The Maximum Likelihood Method
(Taylor: “Principle of maximum likelihood”)
l
Suppose we are trying to measure the true value of some quantity (xT).
u We make repeated measurements of this quantity {x1, x2, … xn}.
u The standard way to estimate xT from our measurements is to calculate the mean value:
mx 
1
N
N
 xi
i1
and set xT = mx.
DOES THIS PROCEDURE MAKE SENSE???
The maximum likelihood method (MLM) answers this question and provides a

general method for estimating parameters of interest from data.
l Statement of the Maximum Likelihood Method
u Assume we have made N measurements of x {x1, x2, … xn}.
u Assume we know the probability distribution function that describes x: f(x, a).
u Assume we want to determine the parameter a. (e.g. we want to determine the mean of a gaussian)
MLM: pick a to maximize the probability of getting the measurements (the xi's) that we got!
l How do we use the MLM?
u The probability of measuring x1 is f(x1, a)dx
u The probability of measuring x2 is f(x2, a)dx
u The probability of measuring xn is f(xn, a)dx
u If the measurements are independent, the probability of getting the measurements we got is:
L  f ( x1 , a ) dx  f ( x 2 , a ) dx    f ( x n , a ) dx  f ( x1 , a )  f ( x 2 , a )    f ( x n , a )[ dx ]
n
We can drop the dxn term as it is only a proportionality constant.
N
L is called the Likelihood
Function : L   f ( x i , a )
1
i 1
R. Kass/S12
P416 Lecture 5
u
We want to pick the a that maximizes L:
L
a
u
0
a a *
Often easier to maximize lnL. Both L and lnL are maximum at the same location.
we maximize lnL rather than L itself because lnL converts the product into a summation.
N
ln L   ln f ( x i , a )

i1
The new maximization condition is:
 ln L
a

n
n
l
N
 

i1 a
a a *
ln f ( x i , a )
0
a a *
a could be an array of parameters (e.g. slope and intercept) or just a single variable.
equations to determine a range from simple linear equations to coupled non-linear equations.
Example:

u Let f(x, a) be given by a Gaussian distribution function.
u Let a = m be the mean of the Gaussian. We want to use our data+MLM to find the mean, m.
u We want the best estimate of a from our set of n measurements {x1, x2, … xn}.
u Let’s assume that s is the same for each measurement.
f ( xi ,a ) 
u

1
s
2
( x i a )
2s
e
2
2
gaussian pdf
The likelihood function for this problem is:
n
n
 L   f ( x , a )  
i
i1
i1 s
1
2

e
( x i a )
2s
2
2
n
 1  
 
e
s 2  

( x 1 a )
2s
2
2

e
( x 2 a )
2s
2
2

e
( x n a )
2s
2
2
n
 1 
 
e
s 2  

n
( x i a )
i 1
2s

2
2
2
R. Kass/S12
P416 Lecture 5
n
n
1
L   f ( xi ,a )  
i1

i1 s
2
( x i a )
2
2s
e
2
n
 1 
 
e
s 2  

( x 1 a )
2s
n
n

ln L  ln

f ( x i , a )  ln([
i 1
1
s
2

n

] e
2

2
e
( xi a )
i 1
2s
2
( x 2 a )
2s
2

2
e
a
2s
2
n
 1 
 
e
s 2  

2
n
( x i a )
i 1
2s

2
2
2
)  n ln(
1
s
We want to find the a that maximizes the log likelihood function:
 ln L
( x n a )
2
n
( xi  a )
i 1
2s
)
2
2
 
n ( x  a )2 
1


i

 
 n ln 
0
2
a 
 s 2  i  1 2s


n
 ( xi  a )  0
a
2
i 1
factor out s since it is a constant
n
 2 ( x i  a )(  1)  0
i 1
n
n
i 1
i 1
 x i   ia  0
n
 xi  n a
don’t forget the factor of n
i 1
a 
1
n
 xi
n i 1
u
Average! The MLM method says that calculating
the average is the best thing we can do in this situation.
If s are different for each data point then a is just the weighted average:
n

a 
xi
2
i1 s i
n
1

i1 s
R. Kass/S12
2
i
Weighted Average
3
P416 Lecture 5
l
Example: Poisson Distribution
a
x
e a
u Let f(x, a) be given by a Poisson distribution.
f ( x ,a ) 
u Let a = m be the mean of the Poisson.
x!
u We want the best estimate of a from our set of n measurements {x1, x2, … xn}.
u The likelihood function for this problem is:
n
n
n
L   f ( xi ,a )  
i1
u
i1
e
a
a
xi!
xi

e
a
a
x1 !
x1
e
a
a
x2!
x2
...
e
Find a that maximizes the log likelihood function:
d ln L
da

a 
1
n
a
a
xn!
xn

e
 na
a
 xi
i 1
x1! x 2 !.. x n !
n

d 
1 n

 n a  ln a   x i  ln( x1! x 2 !.. x n ! )    n   x i  0
d a 
a i1

i1
n
 xi
Average!
i1
Some general properties of the Maximum Likelihood Method

J For large data samples (large n) the likelihood function, L, approaches a Gaussian distribution.
J Maximum likelihood estimates are usually consistent.
For large n the estimates converge to the true value of the parameters we wish to determine.
J Maximum likelihood estimates are usually unbiased.
For all sample sizes the parameter of interest is calculated correctly.
J Maximum likelihood estimate is efficient: the estimate has the smallest variance.
J Maximum likelihood estimate is sufficient: it uses all the information in the observations (the xi’s).
J The solution from MLM is unique.
L Bad news: we must know the correct probability distribution for the problem at hand!
4
R. Kass/S12
P416 Lecture 5
Maximum Likelihood Fit of Data to a Function (essentially Taylor Chapter 8)
Suppose we have a set of n measurements:
l
x1, y1  s 1
x2 , y2  s 2
...

u
u
u
x n , yn  s n
Assume each measurement error (s) is a standard deviation from a Gaussian pdf.
Assume that for each measured value y, there’s an x which is known exactly.
Suppose we know the functional relationship between the y’s and the x’s:
y  q ( x , a , b ,...)
a, b...are parameters that we are trying determine from our data.
MLM gives us a method to determine a, b... from our data.
n
l

Example: Fitting data points to a straight line. We want to determine the slope (b) and intercept (a).
q ( x , a , b ,...)  a  b x
n
n
L   f ( xi ,a , b )  
i 1 s i
i 1
u

1
2
e
( y i  q ( x i ,a , b ))
2
2s i
2
n
 
i 1 s i

1
2
e
( y i a  bxi )
2
2
2s i
Find a and b by maximizing the likelihood function L:
 ln L
a
 ln L
b
R. Kass/S12
  1
 ( y  a  b x ) 2  n  2 ( y  a  b x )(  1) 
i
i
i
i

 ln 
  
 0

2
2
a i1 s i 2  
2s i
2s i

 i1

n
  1
 ( y  a  b x ) 2  n  2 ( y  a  b x )(  x ) 
i
i
i
i
i

 ln 
  
 0

2
2
b i1 s i 2  
2s i
2
s

 i1
i

n
two linear equations
with two unknowns
5
P416 Lecture 5
 ln L
a
 ln L
b
  1
 ( y  a  b x ) 2  n  2 ( y  a  b x )(  1) 
i
i
i
i

 ln 
  
 0

2
2
a i1 s i 2  
2s i
2s i

 i1

n
  1
 ( y  a  b x ) 2  n  2 ( y  a  b x )(  x ) 
i
i
i
i
i

 ln 
  
 0

2
2
b i1 s i 2  
2s i
2s i

 i1

n
For the sake of simplicity assume that all s’s are the same (s1= s2= sis):
n
n
n
 2 ( y i  a  b x i )(  1) 
  
   y i   a   b xi  0
2
2s i
i 1 
i 1
i 1
 i 1
 ln L
n
a
n
n
n
 2 ( y i  a  b x i )(  x i ) 
2
  
   y i xi   a xi   b xi  0
2
2s i
i 1 
i 1
i 1
 i 1
 ln L
n
b
We now have two equations that are linear in the unknowns a, b:
n
n
n
n
 n
 
n
 yi
 i 1
 
 n
  n
  y i xi    xi
 i 1
  i 1
 y i   a   b xi  n a  b  xi
i 1
i 1
i 1
i 1
n
n
n
i 1
i 1
i 1
in matrix form
 y i xi  a  xi  b  xi  0
n
n
i1
i1
n
 yi 
a 
2
xi
2
2
n
n
i1
n
i1
  yi x i  x i
n  xi  (  xi )
i1
i1
2
n
n
n
i1
n
i1
n
i1
n  x i yi   yi  x i
and
b 
2
n  xi  (  xi )
i1
2


i 1

n
2
 xi 

i 1
n
 xi
a


b
These equations correspond to
Taylor’s Eq. 8.10-8.12
i1
6
R. Kass/S12
P416 Lecture 5




l
u
EXAMPLE: A trolley moves along a track at constant speed.
Suppose the following measurements of the time vs. distance were made.
From the data find the best value for the speed (v) of the trolley.
Time t (seconds)
1.0
2.0
3.0
4.0
5.0
6.0
Distance d (mm)
11
19
33
40
49
61
The variables in this
example are time (t) and
distance (d) instead of
x and y.
Our model of the motion of the trolley tells us that:
d  d 0  vt
u We want to find v, the slope (b) of the straight line describing the motion of the trolley.
u We need to evaluate the sums listed in the above formula:
n
6
 x i   t i  21 s
i1
i1
n
6
i1
i1
 y i   d i  213 mm
n
6
i1
i1
 x i y i   t i d i  919 s  mm
n
2
6
2
 x i   t i  91 s
i1
2
i1
n
n
n
i1
n
i1
n
i1
n  x i yi   yi  x i

v
n
i1
2
xi
 (  xi )
2

6  919  21  213
6  91  21
2
 9.9 mm / s
best estimate of the speed
i1
d0 = 0.8 mm
best estimate of the starting point
7
R.
Kass/S12
P416 Lecture 5
The MLM fit to the data for d=d0+vt
The line in the above plot “best” represents our data.
Not all the data points are "on" this line.
The line minimizes the sum of squares of the deviations (d) between a line and our data (di):
di=data-prediction= di-(d0+vti) → minimize: Σ[di2] → same as MLM!
Often we call this technique “Least Squares” (LSQ).
LSQ is more general than MLM but not always justified.
8
R. Kass/S12
P416 Lecture 5