Transcript Document

1
Review of Bayesian Analysis
and Modeling
James B. Elsner
Department of Geography, Florida State University
2
Bayesian analysis and modeling using MCMC is good for you because…
 It has a solid decision-theoretical framework. Output from a
probability model can be linked naturally to a utility model.
 It is intuitive. Combines the prior distribution (prior beliefs and/or
experience) with the likelihood (experiment) to obtain the posterior
distribution (accumulated information). In this context, knowledge is
equivalent to a distribution, and knowledge precision can be
quantified by the precision parameter.
 It is exact. Asymptotic approximations are not used. The “plug-in”
principle is avoided.
 Computations are tractable for practically all models.
 Focus shifts from model estimation to model correctness.
3
AnticipatingFlorida Hurricanes
James B. Elsner and Thomas H. Jagger, Florida State University
This past seasonof unprecedentedFlorida hurricaneactivityw asnot entirelya
consequenceof bad luck. A notable preseasonw eakeningof the sea-levelpressure
gradient over the easternNorth Atlantic foreshadow edthe onslaughtof storms consistent
w ithour researchshow inga negativerelationshipbetw eenspringtime valuesof the North
Atlantic oscillation (NAO) and hurricaneactivityover the U.S. coast (Elsner et al.~2001;
Jagger et al.~2001), especiallyalong the southeastthat includes Florida (Elsner 2003).
Here w edesign a hierarchicalBayesianmodel for annualFlorida hurricanecountsthat
includes a preseasonMay-Junevaluefor the NAO as a covariateand that assigns greater
uncertaintyto the countsprior to 1900. Using an averageindex valuefor May-Juneof 2004,
the model hindcastsan activeseasonfor Florida. The model show inga 45% chanceof at
least one Florida hurricanefor 2004 is in bold contrastto the most recent53-yearhistorical
probability of 30%.
Introduction
Doodle
Code
Data
Initial Conditions
Results
http://garnet.fsu.edu/~jelsner/www/research.html
Anticipating Florida Hurricanes, November 2004.
4
Frequentist
Definition of
probability
Point estimate
Confidence
intervals of
parameters
Confidence
intervals of nonparameters
Model selection
Difficulties
Long-run expected frequency in
repeated (actual or hypothetical)
experiments.
Maximum likelihood estimate.
Bayesian
Relative degree of belief in the state of
the world.
Mean, mode, or median of the
posterior probability distribution.
Based on the likelihood ratio test, which
is based on the expected probability
Credible intervals based on the
distribution of the maximum likelihood posterior probability distribution.
over many experiments.
Based on likelihood profile, or by
resampling from the sampling
distribution of the parameter.
Calculated directly from the
distribution of parameters.
Discard terms that are not significantly Retain terms in models on the
different from a null model (nested) at a argument that processes are not
previously set confidence level.
absent simply because they are not
CI are confusing (range which will
contain true value in proportion a of
repeated experiments); rejection of
model terms for non-significance.
statistically significant.
Subjectivity; need to specify
priors.
5
Why use Bayesian analysis and models?
When you have good (informative) prior information that you
want to use (from previous experiments, or reliable information from
literature). Frequentist meta-analysis can also be used in this case, but the
Bayesian approach is natural.
When you want to specify the distribution of non-parameters
(future values), Bayesian models work well. In particular, if you are
going to try to make a practical decision based on your model (using
decision theory), for example picking a management regime that
minimizes expected risks or maximizes expected returns, having the
posterior distribution of all the parameters makes the decision calculations
easier.
Complicated models such as hierarchical models or models with
missing or unobserved data can be particularly conveniently fit using
Bayesian methods. Frequentists call these random-effect or mixed
models and have their own estimation procedures.
6
How Gibbs Sampling Works
The contours in the plot represent the joint
distribution of q and the labels (0), (1) etc.,
denote the simulated values. Note that one
iteration of the algorithm is complete after
both components are revised. Also notice
that each component is revised along the
direction of the coordinate axes. This
feature can be a source of problems if the
two components are correlated because then
the contours get compressed and
movements along the coordinate axes tend
to produce small moves.
q2
Gibbs sampling
algorithm in two
dimensions starting from
an initial point and then
completing three
iterations.
(3)
(2)
(1)
(0)
q1
7
After the model has converged, samples from the conditional distributions are used to
summarize the posterior distribution of parameters of interest.
Convergence refers to the idea that eventually the Gibbs samples will reach a stationary
distribution.
The question is:
1) How many samples are needed before convergence? The samples discarded
before convergence are called “burn-in”.
2) After convergence, how many samples are needed to summarize the posterior
distribution?
Answers vary from model to model and diagnostics are used to examine the
samples for convergence.
Possible convergence problems.
The assumed model may not be realistic from a substantive point of view or may
not fit.
Errors in calculation or programming. Often, simple syntax mistakes may be
responsible; however, it is possible that the algorithm may not converge to a
proper distribution.
Slow convergence: this is the problem we are most likely to run into. The
simulation can remain for many iterations in a region heavily influenced by the
starting distribution. If the iterations are used to summarize the target distribution,
they can yield falsely precise estimates.
8
Diagnostics for examining convergence.
One intuitive and easily implemented diagnostic tool is a trace plot (or history plot)
which plots the parameter value as a function of sample number. If the model has
converged, the trace plot will move up and down around the mode of the distribution. A
clear sign of non-convergence occurs when we observe some trending in the trace plot.
In WinBugs, you can setup trace plots to monitor parameters while the program runs.
Trace plots
convergence
non-convergence
9
An autocorrelation plot will reveal whether or not there is correlation between successive
samples. The presence of correlation indicates that the samples are not effective in
moving through the entire posterior distribution. WinBUGS plots the autocorrelation out
to 50 lags.
betaD[1]
No large
autocorrelation
1.0
0.5
0.0
-0.5
-1.0
0
20
40
lag
mu.b[1] c hains 1:2
Autocorrelation
1.0
0.5
0.0
-0.5
-1.0
0
20
40
lag
10
If the model has converged, additional samples from a parameter’s posterior
distribution should not influence the calculation of the mean. Running means of the
samples will reveal if the posterior mean has settled to a particular value.
A “lumpy” posterior may indicate non-convergence. It may be necessary to let the
sampler generate additional samples.
mu.b[8] chains 1:2 sample: 128
30.0
20.0
10.0
0.0
0.3
0.35
0.4
Geweke Time-Series Diagnostic: If a model has converged, then the series of
samples from the first half of the chain will have the same “time-series” properties of
the series of samples from the second half of the chain.
11
Gelman-Rubin Diagnostic: Another way to identify convergence is to simulate
multiple sequences from different starting points. The behavior of the sequence of
chains should be the same. That is, the variance within the chains should be the same
as the variance across the chains.
alpha0 chains 1:2
convergence
0.5
0.0
-0.5
-1.0
-1.5
101
200
400
600
iteration
alpha0 chains 1:2
non-convergence
10.0
7.5
5.0
2.5
0.0
-2.5
101
200
400
iteration
600
12
If more than 1 chain is initialized, the Gelman-Rubin statistic is reported in WinBugs.
The statistic is based on the following procedure:
1) estimate your model with a variety of different initial values and iterate for
an n-iteration burn-in and an n-iteration monitored period.
2) take the n-monitored draws of m parameters and calculate the following
statistics:
m
n
1
i

W it hinchain varianceW 
q

j q j
m( n  1) j 1 i 1
n m

Bet ween chain varianceB 
q j q

m  1 j 1

2

2
1
1

ˆ
Est imat edvarianceV (q )  1  W  B
n
n

Vˆ (q )
T heGelman - Rubin St at ist ic R 
W
Before convergence, W underestimates total posterior variance inq because it has not
fully explored the target distribution.
V(q) on the other hand overestimates variance in q because the starting points are overdispersed relative to the target.
13
Once convergence is reached, W and V(q) should be almost equivalent because
variation within the chains and variations between the chains should coincide, so R
should approximately equal one.
b[6] chains 1:2
1.0
0.5
0.0
1
500
iteration
The normalized width of the central 80% interval of the pooled runs is green
The normalized average width of the 80% intervals within the individual runs is blue
R is red. R would generally be expected to be greater than 1 if the starting values are
suitably over-dispersed.
Brooks and Gelman (1998) emphasize that one should be concerned both with convergence
of R to 1, and with convergence of both the pooled and within interval widths to stability.
14
Speeding up the sampling
Convergence sometimes requires generating 10s or 100s of thousands of
samples. If this is the case, it is important to speed up the sampling.
Standardize your input variables. Subtract the mean and divide by the
standard deviation. This decreases the correlation between model variables
which can decrease the autocorrelation between samples.
Use WinBugs’ over-relaxation algorithm. This generates multiple samples at
each iteration and then selects one that is negatively correlated with the
current value. The time per iteration will increase, but the within-chain
correlations should be reduced and hence fewer iterations may be necessary.
However, this method is not always effective and should be used with caution.
The autocorrelation function should be used to check whether the mixing of the
chain improves.
Pick good initial values. If your initial values are near their posterior modes,
then convergence should occur relatively quickly, especially if there is not a big
problem with autocorrelation.
Just wait. With models involving many parameters it may take 100s of
thousands of samples. Using the “thin” option, you do not need to save all the
sample values for all the parameters.