Stats for astronomers - LUTH

Download Report

Transcript Stats for astronomers - LUTH

Statistical inference for
astrophysics
A short course for astronomers
Cardiff University 16-17 Feb 2005
Graham Woan, University of Glasgow
Cardiff Feb 16-17 2005
1
Lecture Plan
• What is probability?
• Estimating parameters values
Lectures 1 & 2
• Why statistical astronomy?
o the Bayesian way
o the Frequentist way
othe Bayesian way
o the Frequentist way
• Assigning Bayesian probabilities
Lectures 3 & 4
• Testing hypotheses
• Monte-Carlo methods
Cardiff Feb 16-17 2005
2
Why statistical astronomy?
Generally, statistics has got a bad reputation
“There are three types of lies:
lies, damned lies and
statistics”
Mark Twain
Benjamin Disraeli
Often for good reason:
The Economist
Jun 3rd 2004
… two researchers at the University of Girona in Spain, have found that 38% of a sample
of papers in Nature contained one or more statistical errors…
Cardiff Feb 16-17 2005
3
Why statistical astronomy?
Data analysis methods are often
regarded as simple recipes…
http://www.numerical-recipes.com/
Cardiff Feb 16-17 2005
4
Why statistical astronomy?
Data analysis methods are often
regarded as simple recipes…
o
o
o
o
…but in analysis as in life,
sometimes the recipes don’t
work as you expect.
Low number counts
Distant sources
Correlated ‘residuals’
Incorrect assumptions
Systematic errors and/or
misleading results
Cardiff Feb 16-17 2005
5
Why statistical astronomy?
…and the tools can be clunky:
" The trouble is that what we [statisticians] call
modern statistics was developed under strong
pressure on the part of biologists. As a result, there is
practically nothing done by us which is directly
applicable to problems of astronomy."
Jerzy Neyman, founder of frequentist hypothesis testing.
Cardiff Feb 16-17 2005
6
Why statistical astronomy?
For example, we can observe only the one Universe:
(From Bennett et al 2003)
Cardiff Feb 16-17 2005
7
The Astrophysicist’s Shopping List
We want tools capable of:
o dealing with very faint sources
o handling very large data sets
o correcting for selection effects
o diagnosing systematic errors
o avoiding unnecessary assumptions
o estimating parameters and testing models
Cardiff Feb 16-17 2005
8
Why statistical astronomy?
Key question:
How do we infer properties of the Universe from
incomplete and imprecise astronomical data?
Our goal:
To make the best inference, based on our observed data and any
prior knowledge, reserving the right to revise our position if new
information comes to light.
Let’s come to this problem afresh with an astrophyicist’s eye, and
bypass some of the jargon of orthodox statistics, going right back to
plain probability:
Cardiff Feb 16-17 2005
9
Right-thinking gentlemen #1
“A decision was wise, even though it led to
disastrous consequences, if with the
evidence at hand indicated it was the best
one to make; and a decision was foolish,
even though it led to the happiest possible
consequences, if it was unreasonable to
expect those consequences”
Herodotus, c.500 BC
We should do the best with what we have,
not what we wished we had.
Cardiff Feb 16-17 2005
10
Right-thinking gentlemen #2
“Probability theory is nothing but
common sense reduced to calculation”
Pierre-Simon Laplace
(1749 – 1827)
Cardiff Feb 16-17 2005
11
Right-thinking gentlemen #3
Occam’s Razor
“Frustra fit per plura, quod fieri
potest per pauciora.”
“It is vain to do with more what can
be done with less.”
William of Occam
(1288 – 1348 AD)
Everything else being equal, we
favour models which are simple.
Cardiff Feb 16-17 2005
12
The meaning of probability
Definitions, algebra and useful distributions
Cardiff Feb 16-17 2005
13
But what is “probability”?
• There are three popular interpretations of the word, each with an
interesting history:
– Probability as a measure of our degree of belief in a statement
– Probability as a measure of limiting relative frequency of outcome of a set
of identical experiments
– Probability as the fraction of favourable (equally likely) possibilities
• We will call these the Bayesian, Frequentist and Combinatorial
interpretations.
• Note there are signs of trouble here:
– How do you quantify “degree of belief”?
– How do you define “relative frequency” without using ideas of probability?
– What does “equally likely” mean?
• Thankfully, at some level, all three interpretations agree on the
algebra of probability, which we will present in Bayesian terms:
Cardiff Feb 16-17 2005
14
Algebra of (Bayesian) probability
• Probability [of a statement, such as “y = 3”, “the mass of a neutron star
is 1.4 solar masses” or “it will rain tomorrow”] is a number between 0
and 1, such that
p(if true )  1
p(if false )  0
0  p(if not sure )  1
• For some statement X,
p( X )  p( X )  1
where the bar denotes the negation of the statement -- The Sum Rule
• If there are two statements X and Y, then joint probability
p( X ,Y )  p(Y )p( X |Y )
where the vertical line denotes the conditional statement “X given Y is
true” – The Product Rule
Cardiff Feb 16-17 2005
15
Algebra of (Bayesian) probability
• From these we can deduce that
p( X  Y | I)  p( X | I)  p(Y | I)  p( X ,Y | I)
where “+” denotes “or” and I represents common background
information -- The Extended Sum Rule
• …and, because p(X,Y)=p(Y,X), we get
p( X |Y , I) 
p( X | I)p(Y | X , I)
p(Y | I)
which is called Bayes’ Theorem
• Note that these results are also applicable
in Frequentist probability theory, with a
suitable change in meaning of “p”.
Cardiff Feb 16-17 2005
Thomas Bayes
(1702 – 1761 AD)
16
Algebra of (Bayesian) probability
Bayes’ theorem is the appropriate rule for updating our degree
of belief when we have new data:
p( X |Y , I)
p( X | I)  p(Y | X , I)
p(Y | I)
Prior
Likelihood
Posterior
p(model | data, I)


p(data | model, I)  p(model | I)
p(data | I)
Evidence
[note that the word evidence
is sometimes used for
something else (the ‘log
odds’). We will stick to the
p(d|I) definition here.]
We can usually calculate all these terms
Cardiff Feb 16-17 2005
17
Algebra of (Bayesian) probability
• We can also deduce the marginal probabilities. If X and Y are
propositions that can take on values drawn from X  { x1, x 2 ,..., x n },
and Y  { y 1, y 2 ,..., y m }, then
p(x i )  p(x i )
 p(y
j
j 1..m
| xi ) 
 p(x )p(y
i
j 1..m
j
| xi ) 
 p(x , y )
i
j
j 1..m
=1
this gives use the probability of X when we don’t care about Y. In these
circumstances, Y is known as a nuisance parameter.
• All these relationships can be smoothly extended from discrete
probabilities to probability densities, e.g.
p(x)   p(x, y)dy
where “p(y)dy” is the probability that y lies in the range y to y+dy.
Cardiff Feb 16-17 2005
18
Algebra of (Bayesian) probability
These equations are the key to
Bayesian Inference – the
methodology upon which (much)
astronomical data analysis is now
founded.
Clear introduction by Devinder
Sivia (OUP).
(fuller bibliography tomorrow)
See also the free book by Praesenjit Saha (QMW, London).
http://ankh-morpork.maths.qmw.ac.uk/%7Esaha/book
Cardiff Feb 16-17 2005
19
Example…
•
A gravitational wave detector may have seen a type II supernova as a burst of
gravitational radiation. Burst-like signals can also come from instrumental glitches, and
only 1 in 10,000 bursts is really a supernova, so the data are checked for glitches by
examining veto channels. The test is expected to confirm the burst is astronomical in 95%
of cases in which it truly is, and in 1% when it truly is not.
The burst passes the veto test!! What is the probability we have seen a supernova?
Answer: Denote
S  “burst really is a supernova”
G  “burst really is a glitch”
  “test says it' s a supernova”
  “test says it' s a glitch”
Let I represent the information that the burst seen is typical of those used to deduce the
information in the question. Then we are told that:
p(S | I)  0.0001
p(G | I)  0.9999
p( | S, I)  0.95
p( | G, I)  0.01
}
}
Cardiff Feb 16-17 2005
Prior probabilities for S and G
Likelihoods for S and G
20
Example cont…
•
But we want to know the probability that it’s a supernova, given it passed the veto test:
p(S | , I).
By Bayes, theorem
p(S | , I)  p(S | I)
p( | S, I)
,
p( | I)
and we are directly told everything on the rhs except p( | I) , the probability that any
burst candidate would pass the veto test. If the burst is either a supernova or a hardware
glitch then we can marginalise over these alternatives:
p( | I)  p(, S | I)  p(,G | I)
 p(G | I)p( | G, I)  p(S | I)p( | S, I)
0.9999
so
0.01
0.0001
0.95
0.0001  0.95
p(S | , I) 
 0.01
0.9999  0.01  0.0001  0.95
Cardiff Feb 16-17 2005
21
Example cont…
•
•
So, despite passing the test there is only a 1% probability that the burst is a supernova!
Veto tests have to be blisteringly good if supernovae are rare.
Why? Because most bursts that pass the test are just instrumental glitches – it really is
just common sense reduced to calculation.
•
Note however that by passing the test, the probability that this burst is from a supernova
has increased by a factor of 100 (from 0.0001 to 0.01).
•
Moral:
p( | S, I)  p(S | , I)
Probability that a
supernova burst gets
through the veto
(0.95)
Probability that it’s a
supernova burst if it
gets through the veto
(0.01)
This is the likelihood: how
consistent is the data with a
particular model?
This is the posterior: how
probable is the model,
given the data?
Cardiff Feb 16-17 2005
22
the basis of frequentist probability
Cardiff Feb 16-17 2005
23
Basis of frequentist probability
•Bayesian probability theory is simultaneously a very old
and a very young field:
•Old : original interpretation of Bernoulli, Bayes, Laplace…
•Young: ‘state of the art’ in (astronomical) data analysis
•But BPT was rejected for several centuries because
probability as degree of belief was seen as too subjective
Frequentist approach
Cardiff Feb 16-17 2005
24
Basis of frequentist probability
Probability = ‘long run relative frequency’ of an event
it appears at first that this can be measured objectively
e.g. rolling a die.
What is p(1) ?
1
If die is ‘fair’ we expect p(1)  p(2)    p(6) 
6
These probabilities are fixed (but unknown) numbers.
Can imagine rolling die M times.
Number rolled is a random variable – different outcome each time.
Cardiff Feb 16-17 2005
25
Basis of frequentist probability
•We define
n(1)
p(1)  lim
M M
If
p(1) 
1
6
die is ‘fair’
•But objectivity is an illusion:
n(1) assumes each outcome equally likely
p(1)  lim
M M
(i.e. equally probable)
•Also assumes infinite series of identical trials
•What can we say about the fairness of the die after
(say) 5 rolls, or 10, or 100 ?
Cardiff Feb 16-17 2005
26
Basis of frequentist probability
In the frequentist approach, a lot of mathematical machinery is
specifically defined to let us address frequency questions:
•We take the data as a random sample of size M , drawn from an
assumed underlying pdf
•Sampling distribution, derived from the underlying pdf and M
•Define an estimator – function of the sample data that is used to
estimate properties of the pdf
But how do we decide what makes an acceptable estimator?
Cardiff Feb 16-17 2005
27
Basis of frequentist probability
Example: measuring a galaxy redshift
•
Let the true redshift = z0 -- a fixed but unknown parameter. Weuse two
telescopes to estimate z0 and compute sampling distributions for z1 and z 2
modelling errors
ˆ
pz
1.
pẑ2 
Small telescope
low dispersion spectrometer
Unbiased:
Repeat observation a
large number of times
 average estimate is
equal to z0
pẑ1 

 

E(z1)   z1 pz1 | z0 dz1  z0

BUT varz1  is large due to the low dispersion.
Cardiff Feb 16-17 2005
28
Basis of frequentist probability
Example: measuring a galaxy redshift
•
Let the true redshift = z0 -- a fixed but unknown parameter. Weuse two
telescopes to estimate z0 and compute sampling distributions for z1 and z 2
modelling errors
ˆ
pz
2.
pẑ2 
Large telescope
high dispersion spectrometer
but faulty astronomer!
(e.g. wrong calibration)
Biased:




E ( z 2 )   z 2 p z 2 | z 0  dz 2  z 0
pẑ1 

BUT varz 2  is small. Which is the better estimator?
Cardiff Feb 16-17 2005
29
Basis of frequentist probability
What about the sample mean?
• Let x1,, x M  be a random sample from pdf p(x) with mean 
and variance  2 . Then



Can show that
1 M
x i = sample mean

M i 1

E(  )  
-- an unbiased estimator
But bias is defined formally in terms of an infinite set of randomly
chosen samples, each of size M.
What can we say with a finite number of samples, each of finite
size?
Before that…
Cardiff Feb 16-17 2005
30
Some important
probability distributions
quick definitions
Cardiff Feb 16-17 2005
31
Some important pdfs: discrete case
1)
Poisson pdf
e.g. number of photons / second counted by a CCD,
number of galaxies / degree2 counted by a survey
n = number of detections
p(n)

 ne  
n!
Poisson pdf assumes detections are independent, and
there is a constant rate 
Can show that

 p(n)  1
n 0
Cardiff Feb 16-17 2005
32
Some important pdfs: discrete case
1)
Poisson pdf
e.g. number of photons / second counted by a CCD,
number of galaxies / degree2 counted by a survey
p(n)
p(n)

 ne  
n!
n
Cardiff Feb 16-17 2005
33
Some important pdfs: discrete case
2.
Binomial pdf
number of ‘successes’ from N observations, for two mutually
exclusive outcomes (‘Heads’ and ‘Tails’)
e.g. number of binary stars, Seyfert galaxies, supernovae…
r = number of ‘successes’

= probability of ‘success’ for single observation
pN (r)

N!
 r (1   )N r
r!(N  r)!

Can show that
p
N
(r)  1
r 0
Cardiff Feb 16-17 2005
34
Some important pdfs: continuous case
1) Uniform pdf
p(x)
 1
b  a
0

axb
otherwise
p(x)
1
b-a
0
a
b
Cardiff Feb 16-17 2005
x
35
Some important pdfs: continuous case
1)
Central, or Normal pdf
(also known as Gaussian )
p(x)

 1  x   2 
1
exp 
 
2 
 2    
p(x)
  0.5
 1
x
Cardiff Feb 16-17 2005
x
36
Cumulative distribution function (CDF)
a
P(a) 
 p(x)dx
= Prob( x < a )

P(x)
x
Cardiff Feb 16-17 2005
x
37
Measures and moments of a pdf
The nth moment of a pdf is defined as
xn


n
x
 p(x | I)
Discrete case
x 0

xn

n
x
 p(x | I)dx
Continuous case

Cardiff Feb 16-17 2005
38
Measures and moments of a pdf
The 1st moment is called the mean or expectation value
E( x )

x


 x p(x | I)
Discrete case
x 0

E( x )

x

 x p(x | I)dx
Continuous case

Cardiff Feb 16-17 2005
39
Measures and moments of a pdf
The 2nd moment is called the mean square
x2


2
x
 p(x | I)
Discrete case
x 0

x2

2
x
 p(x | I)dx
Continuous case

Cardiff Feb 16-17 2005
40
Measures and moments of a pdf
The variance is defined as
varx 

 x 

2 p(x | I)
x
Discrete case
x 0
varx 

 x 

x

2
p(x | I)dx
Continuous case

and is often written  .
2
In general
  2
varx 
is called the standard deviation

Cardiff Feb 16-17 2005
x
2
 x
2
41
Measures and moments of a pdf
pdf
Poisson
discrete
Binomial

pN (r)
Uniform
continuous
p(r)
mean variance
 re

N!
 r (1   )N r
r!(N  r)!
p( X )

1
b a
Normal
p( X )


r!
 1  X   2 
1
exp 
 
2 
 2    
Cardiff Feb 16-17 2005
N
1
2
a  b 


N (1   )
1
12
b  a 2
2
42
Measures and moments of a pdf
The Median divides the CDF into two equal halves
P(x med)
P(x)

x med
 p(x ' )dx '

0. 5

Prob( x < xmed ) = Prob( x > xmed ) = 0.5
x
Cardiff Feb 16-17 2005
x
43
Measures and moments of a pdf
The Mode is the value of x for which the pdf is a maximum
p(x)
  0.5
 1
x
x
For a Normal pdf, mean = median = mode = 
Cardiff Feb 16-17 2005
44
Parameter estimation
The Bayesian way
Cardiff Feb 16-17 2005
45
Bayesian parameter estimation
In the Bayesian approach, we can test our model, in the light of our
data (i.e. rolling the die) and see how our degree of belief in its
‘fairness’ evolves, for any sample size, considering only the data
that we did actually observe.
Posterior
Likelihood
Prior
p(model | data, I)  p(data | model, I)  p(model | I)
What we know now
Influence of our
observations
Cardiff Feb 16-17 2005
What we
knew before
46
Bayesian parameter estimation
Astronomical example #1:
Probability that a galaxy is a Seyfert 1
We want to know the fraction of Seyfert galaxies that are type 1.
How large a sample do we need to reliably measure this?

Model as a binomial pdf:
= global fraction of Seyfert 1s
Suppose we sample N Seyferts, and observe r Seyfert 1s
pN (r | , N)

 (1   )
r
N r
Cardiff Feb 16-17 2005
probability of obtaining
observed data, given
model – the likelihood of 
47
Bayesian parameter estimation
• But what do we choose as our prior? This has been the source of much
argument between Bayesians and frequentists, though it is often not that
important.
•We can sidestep this for a moment, realising that if our data are good
enough, the choice of prior shouldn’t matter!
Posterior
Likelihood
Prior
p(model | data, I)  p(data | model, I)  p(model | I)
Dominates
Cardiff Feb 16-17 2005
48
Bayesian parameter estimation
Consider a simulation of this problem using two different priors
p( | I )
Flat prior; all
values of 
equally probable
Normal prior;
peaked at  = 0.5

Cardiff Feb 16-17 2005
49
p( | data, I )
After observing 0 galaxies

Cardiff Feb 16-17 2005
50
p( | data, I )
After observing 1 galaxy: Seyfert 1

Cardiff Feb 16-17 2005
51
p( | data, I )
After observing 2 galaxies: S1 + S1

Cardiff Feb 16-17 2005
52
p( | data, I )
After observing 3 galaxies: S1 + S1 + S2

Cardiff Feb 16-17 2005
53
p( | data, I )
After observing 4 galaxies: S1 + S1 + S2 + S2

Cardiff Feb 16-17 2005
54
p( | data, I )
After observing 5 galaxies: S1 + S1 + S2 + S2 + S2

Cardiff Feb 16-17 2005
55
p( | data, I )
After observing 10 galaxies: 5 S1 + 5 S2

Cardiff Feb 16-17 2005
56
p( | data, I )
After observing 20 galaxies: 7 S1 + 13 S2

Cardiff Feb 16-17 2005
57
p( | data, I )
After observing 50 galaxies: 17 S1 + 33 S2

Cardiff Feb 16-17 2005
58
p( | data, I )
After observing 100 galaxies: 32 S1 + 68 S2

Cardiff Feb 16-17 2005
59
p( | data, I )
After observing 200 galaxies: 59 S1 + 141 S2

Cardiff Feb 16-17 2005
60
p( | data, I )
After observing 500 galaxies: 126 S1 + 374 S2

Cardiff Feb 16-17 2005
61
p( | data, I )
After observing 1000 galaxies: 232 S1 + 768 S2

Cardiff Feb 16-17 2005
62
Bayesian parameter estimation
What do we learn from all this?
•As our data improve (e.g. our sample increases), the posterior
pdf narrows and becomes less sensitive to our choice of prior.
• The posterior conveys our (evolving) degree of belief in
different values of  , in the light of our data
• If we want to express our result as a single number we
could perhaps adopt the mean, median, or mode
• We can use the variance of the posterior pdf to assign an
uncertainty for 
• It is very straightforward to define confidence intervals
Cardiff Feb 16-17 2005
63
Bayesian parameter estimation
p( | data, I )
Bayesian confidence intervals
We are 95% sure that 
lies between 1 and 2
95% of area
under pdf
1
Note: the confidence
interval is not unique, but
we can define the shortest
C.I.
2
Cardiff Feb 16-17 2005

64
Bayesian parameter estimation
Astronomical example #2:
Flux density of a GRB
Take Gamma Ray Bursts to be equally luminous events,
distributed homogeneously in the Universe. We see three
gamma ray photons from a GRB in an interval of 1 s. What is the
flux of the source, F?
•The seat-of-the-pants answer is F=3 photons/s, with an uncertainty of about 3 , but
we can do better than that by including our prior information on luminosity and
homogeneity. Call this background information I:
Homogeneity implies that the probability the source is in any particular
volume of space is proportional to the volume, so the prior probability that
the source is in a thin shell of radius r is
p(r | I)dr  4r 2dr
Cardiff Feb 16-17 2005
65
Bayesian parameter estimation
• But the sources have a fixed luminosity, L, so r and F are directly related by
F 
hence
L
4r 2
dF
 r 3
dr
p(F|I)
• The prior on F is therefore
p(F | I)  p(r | I)
dr
 F 5 / 2
dF
F
Interpretation: low flux sources are intrinsically more probable, as there is more
space for them to sit in.
• We now apply Bayes’ theorem to determine the posterior for F after seeing n
photons:
p(F | n, I)  p(F | I)p(n | F , I)
Cardiff Feb 16-17 2005
66
Bayesian parameter estimation
• The Likelihood for F comes from the Poisson nature of photons:
p(n | F , I)  F n exp(F ) / n!
so finally,
p(F | n, I)  F
0.45
n 5 / 2
p(F|n=3,I)
0.4
exp(F )
0.35
0.3
0.25
For n=3 we get this
0.2
with the most probable value of
F equalling 0.5 photons/sec.
0.15
Fmost prob
F=n/T
0.1
0.05
0
1
2
3
4
5
F
• Clearly it is more probable this is a distant source
from which we have seen an unusually high number of photons than it is an
unusually nearby source from which we have seen an expected number of
photons. (The most probable value of F is n-5/2, approaching n for n>>1)
Cardiff Feb 16-17 2005
67
Parameter estimation
The frequentist way
Cardiff Feb 16-17 2005
68
Frequentist parameter estimation
• Recall that in frequentist (orthodox) statistics, probability is limiting
relative frequency of outcome, so :
only random variables can have frequentist probabilities
as only these show variation with repeated measurement. So we can’t
talk about the probability of a model parameter, or of a hypothesis.
E.g., a measurement of a mass is a RV, but the mass itself is not.
• So no orthodox probabilistic statement can be interpreted as directly
referring to the parameter in question! For example, orthodox
confidence intervals do not indicate the range in which we are
confident the parameter value lies. That’s what Bayesian intervals do.
• So what do they mean? …
Cardiff Feb 16-17 2005
69
Frequentist parameter estimation
•
Orthodox parameter estimate proceeds as follows: Imagine we have
some data, {di}, that we want to use to estimate the value of a
parameter, a, in a model. These data must depend on the true value
of a in a known way, but as they are random variables all we can say
is that we know, or can estimate, p({di}) for some given a.
1.
Use the data to construct a statistic (i.e. a function of the data) that can
be used as an estimator for a called â . A good estimator will have a pdf
that depends heavily on a, and which is sharply peaked at, or very close
to, a.
p(aˆ| a)
Measured value of
a
Cardiff Feb 16-17 2005
â
â
70
Frequentist parameter estimation
2.
One such estimator is the maximum likelihood (ML) estimator, constructed
in the following way: given the distribution from which the data are
drawn, p(d|a), construct the sampling distribution p({di}|a), which is just
p(d1|a). p(d2|a). p(d3|a)… if the data are independent.
Interpret this as a function of a, called the Likelihood of a, and call
the value of a that maximises it for the given data â . The
corresponding sampling distribution, p({d } | aˆ ), is the one from which the
data were ‘most likely’ drawn.
ML
i
ML
L(a)
â
ML
Cardiff Feb 16-17 2005
a
71
Frequentist parameter estimation
…but remember that this is just one value of â ML , albeit drawn from a
population that has an attractive average behaviour:
p(aˆ | a)
ML
Measured value of
a
â
â
ML
ML
And we still haven’t said anything about a.
3.
Define a confidence interval around a enclosing, say, 95% of the expected
values of â ML from repeated experiments:
p(aˆ | a)
ML
2
a
Cardiff Feb 16-17 2005
â
ML
72
Frequentist parameter estimation
 may be known from the sampling pdf or it may have to be estimated too.
4. We can now say that Prob(a    aˆ  a  )  0.95 . Note this is a
ML
probabilistic statement about the estimator, not about a. However this
expression can be restated: 0.95 is the relative frequency with which the
statement ‘a lies in the region â   ’ is true, over many repeated
experiments.
ML
a
No
Yes
The disastrous shorthand for this is
Yes
" a  aˆ   with 95% confidence"
Yes
ML
Note that this is not a statement
about our degree of belief that a lies
in the numerical interval generated
in our particular experiment.
No
Different
experiments
Cardiff Feb 16-17 2005
73
Example: B and F compared
• Two gravitational wave (GW) detectors see 7 simultaneous burst-like
signal in the data from one hour, consisting of GW signals and spurious
signals. When the two sets of data are offset in time by 10 minutes
there are 9 simultaneous signals. What is the true GW burst rate?
(Note: no need to be a expert to realise there is not much GW activity
here!)
• A frequentist solution:
Take the events as Poisson, characterised by the background rate, rb and
the GW rate, rg. We get



rb  rg  7;
ˆb  g  7
rˆb  9;
ˆb  9
so
Where  is the variance of
the sampling distribution
2
rˆg  (7  9)  2
and ˆg  9  7  4
Cardiff Feb 16-17 2005
74
Example: B and F compared
• So we would quote our result as:
rg  2  4
• But burst rates can’t be negative! What does this mean? Infact it is
quite consistent with our definition of a frequentist confidence interval.
Our value of r̂g is drawn from its sampling distribution, that will look
something like:
p(rˆg | rg )
So, in the
shorthand of the
subject, this result
~4
is quite correct.
Our particular sample
rg
Cardiff Feb 16-17 2005
r̂g
75
Example: B and F compared
• The Bayesian solution:
We’ll go through this carefully. Our job is to determine rg. In Bayesian
terms that means we are after p(rg|data). We start by realising there
are two experiments here: one to determine the background rate and
one to determine the joint rate, so we will also determine p(rb|data).
If the background count comes from a Poisson process of mean rb then
the probability of n events is
rbn exp( rb )
p(n | rb ) 
n!
which is our Bayesian likelihood for rb. We will choose a prior probability
for rb proportional to 1/rb. This is the scale invariant prior that
encapsulates total ignorance of the non-zero rate (of course in reality
we may have something that constrains rb more tightly a priori). See
later…
Cardiff Feb 16-17 2005
76
Example: B and F compared
• So our posterior for rb, based on the background counts, is
which, normalising and setting n=9, gives
p(rb | n  9) 
1 8
rb exp( rb )
8!
Background
rate
0.1
p(r |n=9)
b
1 rbn exp(rb )
p(rb | n) 
rb
n!
0.05
00
2
4
6
8
10 12 14 16 18 20
r
b
• The probability of seeing m coincident bursts, given the two rates, is
(rb  rg )m exp[ (rb  rg )]
p(m | rb , rg ) 
m!
Cardiff Feb 16-17 2005
77
Example: B and F compared
• And our joint posterior for the rates is, by Bayes’ theorem,
p(rb , rg | m)  p(rb , rg )p(m | rb , rg )
The joint prior in the above can be split into a probability for rb, which
we have just evaluated, and a prior for rg. This may be zero, so we will
say that we are equally ignorant over whether to expect 0,1,2,… counts
from this source. This translates into a uniform prior on rg, so our joint
prior is
p(rg , rb ) 
1 8
rb exp(rb )
8!
Finally we get the posterior on rg by marginalising over rb:
p(rg | m  7)   1 rb8 exp(rb ). 1 (rb  rg )7 exp[ (rb  rg )]drb
8!
7!
likelihood
prior
Cardiff Feb 16-17 2005
78
Example: B and F compared
...giving our final answer to the problem
0.25
p(rg|n=9,m=7)
Compare with
our frequentist
result:
0.15
rg  2  4
g
p(r |n=9,m=7)
0.2
0.1
to 1-sigma
0.05
0
0
1
2
3
4
5
rg
6
7
8
9
10
rg
Note that p(rg<0)=0, due to the prior, and that the true value
of rg could very easily be as high as 4 or 5
Cardiff Feb 16-17 2005
79
Example: B and F compared
Let’s see how this result would change if the background count
was 3, rather than 9 (joint count still 7):
0.14
p(rg|n=3,m=7)
0.12
0.08
0.06
g
p(r |n=3,m=7)
0.1
0.04
0.02
0
0
5
10
15
rg
rg
Again, this looks very reasonable
Cardiff Feb 16-17 2005
80
Bayesian hypothesis testing
And why we already know how to do it
Cardiff Feb 16-17 2005
81
Bayesian hypothesis testing
• In Bayesian analysis, hypothesis testing can be performed as a generalised
application of Bayes’ theorem. Generally a hypothesis differs from a parameter
in that many values of the parameter(s) are consistent with one hypothesis.
Hypotheses are models that depend of parameters.
• Note however that we cannot define the probability of one hypothesis given
some data, d, without defining all the alternatives in its class, i.e.
prob( H1 | d, I) 
prob( H1 | I) prob(d | H1, I)
 prob( Hi | I) prob(d | Hi , I)
all possible hypotheses
and often this is impossible. So questions like “do the data fit a gaussian?” are
not well-formed until you list all the curves the data could fit. A well-formed
question would be “do the data fit a gaussian better than these other curves?”,
or more usually “do the data fit a gaussian better than a lorentzian?”.
• Simple comparisons can be expressed as an odds ratio, O
O12 
prob( H1 | d, I) prob( H1 | I) prob(d | H1, I)

prob( H2 | d, I) prob( H2 | I) prob(d | H2 , I)
Cardiff Feb 16-17 2005
82
Bayesian hypothesis testing
•
The odds ratio an be divided into the prior odds and the Bayes’ factor
O12 
prob( H1 | d, I) prob( H1 | I) prob( d | H1, I)


prob( H2 | d, I) prob( H2 | I) prob( d | H2 , I)
 

prior odds
Bayes' factor
•
The prior odds simply express our prior preference for H1 over H2, and is set to 1 if you
are indifferent.
•
The Bayes’ factor is just the ratio of the evidences, as defined in the earlier lectures.
Recall that for a model that depends on a parameter a
p(a | d, H1 , I) 
p(a | H1 , I)p(d | a, H1 , I)
p(d | H1 , I)
so the evidence is simply the joint probability of the parameter(s) and the data,
marginalised over all hypothesis parameter values:
p(d | H1, I)   p(a | H1 , I)p(d | a, H1 , I)da
Cardiff Feb 16-17 2005
83
Bayesian hypothesis testing
• Example: A spacecraft is sent to a moon of Saturn and, using a penetrating
probe, detects a liquid sea deep under the surface at 1 atmosphere pressure
and a temperature of -3°C. However, the thermometer has a fault, so that the
temperature reading may differ from the true temperature by as much as ±5°C,
with a uniform probability within this range.
Determine the temperature of the liquid, assuming it is water (liquid within
0<T<100°C) and then assuming it is ethanol (liquid within -80<T<80°C). What
are the odds of it being ethanol?
measurement
-80
0
80
100 °C
Ethanol is liquid
water is liquid
[based loosely on a problem by John Skilling]
Cardiff Feb 16-17 2005
84
Bayesian hypothesis testing
• Call the water hypothesis H1 and the ethanol hypothesis H2.
For H1:
The prior on the temperature is
p(T | H1)
0.01 for 0  T  100
p(T | H1)  
0 otherwise
0
100
T
the likelihood of the temperature is the probability of the data d, given the
temperature:
p(d |T, H1)
0.1 for | d  T | 5
p(d | T , H1)  
0 otherwise
thought of as a function of T, for d=-3,
T
10
p(d  3 |T, H1)
-3
10
Cardiff Feb 16-17 2005
d
T
85
Bayesian hypothesis testing
The posterior for T is the normalised product of the prior and the likelihood, giving
p(T | d, H1)
0.5 for 0  T  2
p(T | d, H1)  
0 otherwise
0 2
H1 only allows the temperature to be between 0 and 2°C.
The evidence for water (as we defined it) is
•
T
p(d | H1 )   p(d | T , H1)p(T | H1)dT  0.002
For H2:
By the same arguments
0.1 for  8  T  2
p(T | d, H2 )  
0 otherwise
p(T | d, H2 )
-8
and the evidence for ethanol is
0 2
T
p(d | H2 )   p(d | T , H2 )p(T | H2 )dT  0.00625
Cardiff Feb 16-17 2005
86
Bayesian hypothesis testing
• So under the water hypothesis we have a tighter possible range for the liquid’s
temperature, but it may not be water. In fact, the odds of it being water rather
than ethanol are
O12 
prob( H1 | d, I) prob( H1 | I) prob(d | H1, I)
0.002


 1
 0.32
prob( H2 | d, I) prob( H2 | I) prob(d | H2 , I)
0.00625
 

prior odds
Bayes' factor
which means 3:1 in favour of ethanol. Of course this depends on our prior odds
too, which we have set to 1. If the choice was between water and whisky under
the surface of the moon the result would be very different, though the Bayes’
factor would be roughly the same!
• Why do we prefer the ethanol option? Because too much of the prior for
temperature, assuming water, falls at values that are excluded by the data. In
other words, the water hypothesis is unnecessarily complicated . Bayes’ factors
naturally implement Occam’s razor in a quantitative way.
Cardiff Feb 16-17 2005
87
Bayesian hypothesis testing
 p(d |T, H1)p(T | H1)dT  2  0.1 0.01  0.002
p(d |T, H1)
0.1
p(T | H1)
-8
0
2
0.01
T
 p(d |T, H2 )p(T | H2 )dT  10  0.1 0.00625  0.00625
p(d |T, H2 ) 0.1
p(T | H2 )
-8
0
2
T
0.00625
Overlap integral (=evidence) is greater for H2
(ethanol) than for H1 (water)
Cardiff Feb 16-17 2005
88
Bayesian hypothesis testing
• To look at this a bit more generally, we can split the evidence into two
approximate parts, the maximum of the likelihood and an “Occam factor”:
Lmax
likelihood
Likelihood_range
prior
x
Prior_range
p(d | H)   p(x | H)p(d | x, H)dx  Lmax
likelihood_range
prior_range



the 'Occam factor'
i.e., evidence = maximum likelihood x Occam factor
The Occam factor penalises models that include wasted parameter space, even
if they show a good ML fit.
Cardiff Feb 16-17 2005
89
Bayesian hypothesis testing
• This is a very powerful property of the Bayesian method.
Example: your given a time series of 1000 data points comprising a number of
sinusoids embedded in gaussian noise. Determine the number of sinusoids and
the standard deviation of the noise.
• We could think of this as comparing hypotheses Hn that there are n sinusoids in
the data, with n ranging from 0 to nmax. Equivalently, we could consider it as a
parameter fitting problem, with n an unknown parameter within the model.
The joint posterior for the n signals, with amplitudes {A}n and frequencies {}n
2
and noise variance  given the overall model and data {D} is
p(n, { A}n , { }n , | {D}, I)  p(n, { A}n , { }n , | I)p({D} | n, { A}n , { }n , )
The likelihood term, based on gaussian noise is
n
1
1

p({ D } | n, { A} n , { } n , ) 
exp 2  (D j   Ai sinit j )2 
i 0
2 
 2 j

and we can set the priors as independent and uniform over sensible ranges.
Cardiff Feb 16-17 2005
90
Bayesian hypothesis testing
• Our result for n is just its marginal probability:
p(n | {D }, I)  { A}n ,{ }n , p(n, { A}n , { }n , | {D }, I)d{ A}n d{ }n d
and similarly we could marginalise for . Recent work, in collaboration with
Umstatter and Christensen, has explored this:
•1000 data points with
50 embedded signals
• =2.6
Result: around 33 signals can
be recovered from the data,
the rest are indistinguishable
from noise, and  is
consequentially higher
Cardiff Feb 16-17 2005
91
Bayesian hypothesis testing
• This has implications for the analysis of LISA data, which is expected to contain
many (perhaps 50,000) signals from white dwarf binaries. The data will contain
resolvable binaries and binaries that just contribute to the overall noise (either
because they are faint or because their frequencies are too close together).
Bayes can sort these out without having to introduce ad hoc acceptance and
rejection criteria, and without needing to know the “true noise level”
(whatever that means!).
Cardiff Feb 16-17 2005
92
Frequentist hypothesis
testing
And why we should tread carefully, if at all
Cardiff Feb 16-17 2005
93
Frequentist hypothesis testing – significance tests
• A note on why these should really be avoided!
• The method goes like this:
– To test a hypothesis H1 consider another hypothesis, called the null
hypothesis, H0, the truth of which would deny H1. Then argue against H0…
– Use the data you have gathered to compute a test statistic Tobs (eg, the
chisquared statistic) which has a calculable pdf if H0 is true. This can be
calculated analytically or by Monte Carlo methods.
– Look where your observed value of the statistic lies in the pdf, and reject H0
based on how far in the wings of the distribution you have fallen.
p(T|H0)
Reject H0 if your result lies in here
T
Cardiff Feb 16-17 2005
94
Frequentist hypothesis testing – significance tests
• H0 is rejected at the x% level if x% of the probability lies to the right of
the observed value of the statistic (or is ‘worse’ in some other sense):
p(T|H0)
X% of the
area
Tobs
T
and makes no reference to how improbable the value is under any
alternative hypothesis (not even H1!). So…
An hypothesis [H0] that may be true is rejected because it has failed to
predict observable results that have not occurred. This seems a
remarkable procedure. On the face of it, the evidence might more
reasonably be taken as evidence for the hypothesis, not against it.
Harold Jeffreys Theory of Probability 1939
Cardiff Feb 16-17 2005
95
Frequentist hypothesis testing – significance tests
• Eg, You are given a data point Tobs affected by gaussian noise, drawn
either from N(mean=-1,=0.5) or N(+1,0.5).
H0: The datum comes from  = -1
H1: The datum comes from  = +1
Test whether H1 is true.
• Here our statistic is simply the value of the observed data. We will
chose a critical region of T>0, so that if Tobs>0 we reject H0 and
therefore accept H1.
H0
H1
-1
+1
Cardiff Feb 16-17 2005
T
96
Frequentist hypothesis testing
Type II error
Type I error
T
• Formally, this can go wrong in two ways:
–
–
A Type I error occurs when we reject the null hypothesis when it is true
A Type II error occurs when we accept the null hypothesis when it is false
both of which we should strive to minimise
• The probabilities (‘p-values’) of these are shown as the coloured areas above
(about 5% for this problem)
• If Tobs>0 then we ‘reject the null hypothesis at the 5% level’ and therefore
accept H1.
Cardiff Feb 16-17 2005
97
Frequentist hypothesis testing
• But note that we do not consider the relative probabilities of the hypotheses
(we can’t! Orthodox statistics does not allow the idea of hypothesis
probabilities), so the results can be misleading.
• For example, let Tobs=0.01. This lies just on the boundary of the critical region,
so we reject H0 in favour of H1 at the 5% level, despite the fact that we know a
value of 0.01 is just as likely under both hypotheses (acutally just as unlikely,
but it has happened).
• Note that the Bayes’ factor for this same result is ~1, reflecting the intuitive
answer that you can’t decide between the hypotheses based on such a datum.
Moral
The subtleties of p-values are rarely reflected in papers that quote them, and
many errors of Type III (misuse!) occur. Always take them with a pinch of salt,
and avoid them if possible – there are better tools available.
Cardiff Feb 16-17 2005
98
Assigning Bayesian probabilities
(you have to start somewhere!)
Cardiff Feb 16-17 2005
99
Assigning Bayesian probabilities
• We have done much on the manipulation of probabilities, but not much on the
initial assignments of likelihoods and priors. Here are a few words…
The principle of insufficient reason (Bernoulli, 1713) helps us out: If we have N
mutually exclusive possibilities for an outcome, and no reason to believe one
more than another, then each should be assigned the probability 1/N.
• So the probability of throwing a 6 on a die is 1/6, if all you know is it has 6
faces.
• The key is to realise that this is one example of a more general principle. Your
state of knowledge is such that the probabilities are invariant under some
transformation group (here, the exchange of numbers on faces).
• Using this idea we can extend the principle of insufficient reason to continuous
parameters.
Cardiff Feb 16-17 2005
100
Assigning Bayesian probabilities
• So, a parameter that is not known to within a change of origin (a ‘location
parameter’) should have a uniform probability
p(x)  p(x  a)  constant
• A parameter for which you have no knowledge of scale (a ‘scale parameter’) is a
location parameter in its log so
p(x) 
1
x
the so-called ‘Jeffreys prior’.
• Note that both these prior are improper – you can’t normalise them, so their
unfettered use is restricted. However, you are rarely that ignorant about the
value of a parameter.
Cardiff Feb 16-17 2005
101
Assigning Bayesian probabilities
• More generally, given some information, I, that we wish to use to assign
probabilities {p1…pn} to n different possibilities, then the most honest way of
assigning the probabilities is the one that maximises
n
H   pi log pi
i 1
subject to the constraints imposed by I. H is traditionally called the information
entropy of the distribution and measures the amount of uncertainty in the
distribution in the sense of Shannon (though there are several routes to the
same result). Honesty demands we maximise this, otherwise we are implicitly
imposing further constraints not contained in I.
• For example, the maximum entropy solution for the probability of seeing k
events given only the information that they are characterised by a single rate
constant  is the Poisson distribution. If you are told the first and second
moments of a continuous distribution the maxent solution is a gaussian etc…
Cardiff Feb 16-17 2005
102
Monte Carlo methods
How to do those nasty marginalisations
Cardiff Feb 16-17 2005
103
Monte Carlo Methods
1. Uniform random number, U[0,1]
See Numerical Recipes!
http://www.numerical-recipes.com/
Cardiff Feb 16-17 2005
104
Monte Carlo Methods
2. Transformed random variables
Suppose we have
x ~ U[0,1]
Let y  y(x)
Then
p(y )dy  p(x)dx
p(y ) 
Probability
of number
between y
and y+dy
Probability
of number
between x
and x+dx
Cardiff Feb 16-17 2005
p(x(y ))
dy d x
Because
probability must
be positive
105
Monte Carlo Methods
2. Transformed random variables
Suppose we have x ~ U[0,1]
p(y)
1
b-a
Let y  a  (b  a)x
Then
y ~ U[a, b]
0
a
b
y
Numerical Recipes uses the transformation method to provide y ~ N(0,1)
Cardiff Feb 16-17 2005
106
Monte Carlo Methods
3. Probability integral transform
Suppose we can compute the CDF of some desired random variable
1)
y ~ U[0,1]
2)
Compute
3)
Then
x  P 1(y)
x ~ p(x)
Cardiff Feb 16-17 2005
107
Monte Carlo Methods
4. Rejection Sampling
Suppose we want to sample from some pdf p(x) and we know that
p(x)  q(x)
x
where q(x) is a simpler pdf called the proposal distribution
q(x)
1)
Sample x1 from q(x)
2)
Sample y~U[0,q(x1)]
3)
If y<p(x)
otherwise
y
p(x)
ACCEPT
REJECT
x1
Set of accepted values {xi} are a sample from p(x).
Cardiff Feb 16-17 2005
108
Monte Carlo Methods
4. Rejection Sampling
The method can be very slow if the shaded
region is too large -particularly in high-N
problems, such as the LISA problem
considered earlier.
LISA
Cardiff Feb 16-17 2005
109
Monte Carlo Methods
p(x)
5. Metropolis-Hastings Algorithm
Speed this up by letting the proposal distribution
depend on the current sample:
(1)
o Sample initial point x
o Sample tentative new state
from
Q (x' ; x(1)) (e.g. Gaussian)
o Compute
p(x' ) Q (x(1) ; x' )
a
p(x(1) ) Q (x' ; x(1) )
o If a  1
Otherwise
Accept
Accept with probability a
Acceptance:
Rejection:
x(2)  x'
x(2)  x(1)
X is a Markov Chain. Note that rejected samples appear in the chain as
repeated values of the current state.
Cardiff Feb 16-17 2005
110
Monte Carlo Methods
A histogram of the contents of the
chain for a parameter converges to
the marginal pdf for that
parameter.
In this way, high-dimensional
posteriors can be explored and
marginalised to return parameter
ranges and interrelations inferred
fro complex data sets (eg, the
WMAP results)
http://www.statslab.cam.ac.uk/~mcmc/pages/links.html
Cardiff Feb 16-17 2005
111
Cardiff Feb 16-17 2005
112
The end
nearly
Cardiff Feb 16-17 2005
113
Final thoughts
• Things to keep in mind:
– Priors are fundamental, but not always influential. If you
have good data most reasonable priors return very similar
posteriors.
– Degree of belief is subjective but not arbitrary. Two people
with the same degree of belief agree about the value of the
corresponding probability.
– Write down the probability of everything, and marginalise
over those parameters that are of no interest.
– Don’t pre-filter if you can help it. Work with the data, not
statistics of the data.
Cardiff Feb 16-17 2005
114
Bayesian bibliography for astronomy
There are an increasing number of good books and articles on Bayesian methods.
Here are just a few:
•
•
•
•
E.T. Jaynes, Probability theory: the logic of science, CUP 2003
D.J.C. Mackay, Information theory, inference and learning algorithms, CUP, 2004
D.S. Sivia, Data analysis – a Bayesian tutorial, OUP 1996
T.J. Loredo, From Laplace to Supernova SN 1987A: Bayesian Inference in
Astrophysics, 1990, copy at http://bayes.wustl.edu/gregory/articles.pdf
• G.L. Bretthorst, Bayesian Spectrum Analysis and Parameter Estimation, 1988 copy
at http://bayes.wustl.edu/glb/book.pdf,
• G. D'Agostini, Bayesian reasoning in high-energy physics: principles and
applications http://preprints.cern.ch/cgi-bin/setlink?base=cernrep&categ=Yellow_Report&id=99-03
• Soon to appear: P. Gregory, Bayesian Logical Data Analysis for the Physical
Sciences, 2005, CUP
Review sites:
• http://bayes.wustl.edu/
• http://www.bayesian.org/
Cardiff Feb 16-17 2005
115