01 Bayes large_print version

Download Report

Transcript 01 Bayes large_print version

My First Bayes:
Why and How to Run Your
First Bayesian Model
Rens van de Schoot
[email protected]
rensvandeschoot.wordpress.com
As stated by Kruschke (2012) in a recent
special issue of Perspectives on Psychological
Science
“[...] whereas the 20th century
was dominated by null hypothesis significance
testing, the 21st century is becoming
Bayesian.”
NHT vs. Bayes
Pr (Data | H0)
≠
Pr (Hi | Data)
Bayes Theorem
Pr (Hi | Data) =
Posterior ≈ prior * data
Posterior probability is proportional
to the product of the prior
probability and the likelihood
5
Bayes theorem: prior, data and posterior
Bayes theorem: prior, data and posterior
p(A) = ¼
Bayes theorem: prior, data and posterior
p(B) = ¼
p (A∩B) = 1/16
Bayes theorem: prior, data and posterior
What is the probability of A within B ?
Stated otherwise, what is the probability of A given B
-> p(A|B)?
Bayes theorem: prior, data and posterior
(1)
(2)
(3)
Bayes Theorem (4)
Bayes theorem: prior, data and posterior
Replace A and B with an hypothesis (H) and data (D)
Bayes Theorem:
Bayes Theorem
Pr (Hi| Data) =
Posterior ≈ prior * data
Posterior probability is proportional
to the product of the prior
probability and the likelihood
Intelligence (IQ)
-∞
IQ
∞
Prior Knowledgde
1
-∞
IQ
∞
Intelligence Interval
Cognitive Designation
40 - 54
Severely challenged (<1%)
55 - 69
Challenged (2.3% of test takers)
70 - 84
Below average
85 - 114
Average (68% of test takers)
115 - 129
Above average
130 - 144
Gifted (2.3% of test takers)
145 - 159
Genius (Less than 1% of test takers)
160 - 175
Extraordinary genius
15
Prior Knowledgde
40
IQ
180
Prior Knowledgde
2
40
IQ
180
Prior Knowledgde
3
40
100
IQ
180
Prior Knowledgde
4
40
100
IQ
180
Prior Knowledgde
5
40
100
IQ
180
Prior Knowledgde
1
2
-∞
∞
3
4
5
Prior
Prior
-∞
IQ
∞
Data
Data
Prior
-∞
IQ
∞
Posterior
Posterior
Data
Prior
-∞
IQ
∞
Prior - Data
Data
Prior
40
100
IQ
180
Posterior
Posterior
Data
Prior
40
100
IQ
180
Prior - Data
Data
Prior
40
100
IQ
180
Posterior
Posterior
Data
Prior
40
100
IQ
180
How to obtain posterior?

In complex models, the posterior is
often intractable (impossible to
compute exactly)

Solution: approximate posterior by
simulation
– Simulate many draws from posterior
distribution
– Compute mode, median, mean, 95% interval
et cetera from the simulated draws
29
ANOVA example
4 unknown parameters μj (j=1,...,4) and
one common but unknown σ2.
Statistical model:
Y = μ1*D1 + μ2*D2 + μ3*D3 + μ4*D4 + E
with E ~ N(0, σ2)
The Gibbs sampler
Specify prior:
Pr(μ1, μ2, μ3, μ4, σ2)
Prior (μj) ~ Nor(μ0, var0)
Prior (μj) ~ Nor(0,10000)
Prior (σ2) ~ IG(0.001, 0.001)
Prior is Inverse Gamma
a (shape), b (scale)
32
The Gibbs sampler
Combine prior with likelihood provides
posterior:
Post ( μ1, μ2, μ3, μ4, σ2 | data )
…this is a 5 dimensional distribution…
The Gibbs sampler
Iterative evaluation via conditional
distributions:
Post ( μ1 | μ2, μ3, μ4, σ2, data ) ~
N(M1,S1)
Post ( μ2 | μ1, μ3, μ4, σ2, data ) ~
N(M2,S2)
Post ( μ3 | μ1, μ2, μ4, σ2, data ) ~
N(M3,S3)
Post ( μ4 | μ1, μ2, μ3, σ2, data ) ~
N(M4,S4)
Post ( σ2 | μ1, μ2, μ3, μ4, data ) ~
IG(a,b)
The Gibbs sampler
1.Assign starting values
2.Sample μ1 from conditional distribution
3.Sample μ2 from conditional distribution
4.Sample μ3 from conditional distribution
5.Sample μ4 from conditional distribution
6.Sample σ2 from conditional distribution
7.Go to step 2 until enough iterations
The Gibbs sampler
Iteration
μ1
μ2
μ3
μ4
σ2
1
3.00
5.00
8.00
3.00
10
2
3.75
4.25
7.00
4.30
8
3
3.65
4.11
6.78
5.55
5
.
.
.
.
.
.
15
4.45
3.19
5.08
6.55
1.1
.
.
.
.
.
.
.
.
.
.
.
.
199
4.59
3.75
5.21
6.36
1.2
200
4.36
3.45
4.65
6.99
1.3
Trace plot
Trace plot: starting value
Trace plot: start of stable pattern
Trace plot: stable pattern
Trace plot: omit burn-in
Trace plot: posterior
Trace plot: Kernel density plot
Posterior Distribution
44
Convergence
45
Burn In

Gibbs sampler must run t iterations ‘burn in’
before we reach target distribution f(Z)
– How many iterations are needed to converge on
the target distribution?

Diagnostics
– Examine graph of burn in
– Try different starting values
– Run several chains in parallel
46
Convergence
47
Convergence
48
Convergence
49
Convergence
50
Convergence
51
Convergence
52
What went wrong?



Mplus automatically monitors convergence
using the potential scale reduction factor
(PSRF)
The PSRF criterion essentially requires the
between-chain variation to be small relative to
the total of between- and within-chain
variation.
The PSRF is checked every 100th iteration for
each parameter separately and the highest
value is displayed on your computer screen
while Mplus is running.
53
Convergence
54
Convergence
55
Convergence
56
Convergence
57
Convergence
58
Conclusion about convergenge


Burn-in: Mplus deletes first half of chain
Run multiple chains (Mplus default 2)
– PSR statistic compares variances within chains to
variance between chains
– Must be close to 1
– Decrease Bconvergence: default .05 but better use .01

ALWAYS do a graphical evaluation of each and
every parameter
59
Summing up

Probability

Degree of belief

Prior

What is known before observing the data

Posterior

What is known after observing the

Informative prior

Tool to include subjective knowledge

Non-informative prior  Try to express absence of prior knowledge
Posterior mainly determined by data

MCMC methods

Simulation (sampling) techniques to obtain
the posterior distribution and all posterior
summary measures

Convergence

Important to check
60
61
Uncertainty in Classical Statistics

Uncertainty = sampling distribution
– Estimate population parameter  by ˆ
– Imagine drawing an infinity of samples
– Distribution of ˆ over samples

Problem is that we have only one sample
– Estimate ˆ and its sampling distribution
– Estimate 95% confidence interval
62
Inference in Classical Statistics

What does 95% confidence interval actually
mean?
63
Inference in Classical Statistics

What does 95% confidence interval actually
mean?
– Over an infinity of samples, 95% of these contain
the true population value 
– But we have only one sample
– We never know if our present estimate ˆ and
confidence interval is one of those 95% or not
64
Inference in Classical Statistics

What does 95% confidence interval NOT mean?

We have a 95% probability that the true
population value  is within the limits of our
confidence interval

We only have an aggregate assurance that in
the long run 95% of our confidence intervals
contain the true population value
65
Uncertainty in Bayesian Statistics



Uncertainty = probability distribution for the
population parameter
In classical statistics the population parameter
 has one single true value
In Bayesian statistics we imagine a distribution
of possible values of population parameter 
66
Inference in Bayesian Statistics


What does a95% central credibility interval mean?
We have a 95% probability that the population
value  is within the limits of our confidence
interval
67
Software

WinBUGS/ OpenBUGS



R packages


Special implementation for multilevel regression
AMOS


LearnBayes, R2Winbugs, MCMCpack
MLwiN


Bayesian inference Using Gibbs Sampling
Very general, user must set up model
Special implementation for SEM
Mplus

Very general (SEM + ML + many other models)
MPLUS - ML
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE x1-x5 y1-y5;
ANALYSIS:
ESTIMATOR IS ML;
MODEL:
y1-y5 ON x1-x5;
69
MPLUS - BAYES
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE x1-x5 y1-y5;
ANALYSIS:
ESTIMATOR IS BAYES;
MODEL:
y1-y5 ON x1-x5;
70
MPLUS - BAYES






Single-level, multilevel, and mixture models
Continuous and categorical outcomes
Default non-informative priors or user-specified
informative priors (MODEL PRIORS)
Multiple chains using parallel processing (CHAIN)
Convergence assessment using Gelman-Rubin
potential scale reduction factors
Posterior parameter distributions with means,
medians, modes, and credibility intervals (POINT)
71
MPLUS - BAYES



Posterior parameter trace plots
Autocorrelation plots
Posterior predictive checking plots
72
MPLUS - BAYES

ANALYSIS:
–
–
–
–
–
–
–
–
–
–
–
POINT ( mean, , median, mode; default is median)
CHAINS (default is 2) + Processors = 32
BSEED
STVALUES (= ML, PERTURBED, UNPERTURBED)
MEDIATOR (observed, latent; default is latent)
ALGORITHM (GIBBS, MH; default is GIBBS)
BCONVERGENCE (related to PSR)
BITERATIONS (to go beyond 50K iterations)
FBITERATIONS (fixed number of iterations)
THIN (every kth iteration recorded; default is 1)
DISTRIBUTION (how many iterations used for MODE)
73
MPLUS - BAYES

MODEL PRIOR:
– Informative priors are used to incorporate prior
knowledge
– Flexible in Mplus, but (Win)Bugs is more flexible and
offers more tools
– In models with small variance parameters and small
samples the posterior is often sensitive to choice of prior
– Especially the variance estimates
– Do sensitivity analysis (try different priors)
74
MPLUS - BAYES

MODEL PRIOR:
– Intercepts, regression slopes, loadings:
N(0, infinity),
– unless these parameters are in a probit regression in
which case N(0,5) is used
– Variances: IG(0, -1)
– Covariance matrices: IW(0, -p-1), unless the elements
include parameters from a probit regression in which
case IW(I, p+1) is used
– Thresholds: N(0, infinity)
– Class proportions: Dirichlet prior D(10, 10, ..., 10)
75
Let’s Analyze
IQ








N = 20
Data are generated
Mean = 102
SD = 15
N = 20
Data are generated
Mean = 102
SD = 15
IQ
77
IQ
78
IQ
Prior type
ML
Prior 1 A
Prior 2a M or A
Prior2b M or A
Prior2c M or A
Prior 3A
Prior 4W
Prior 5 W
Prior 6a W
Prior 6b W
Prior Variance used
large variance, SD=100
medium variance, SD=10
small variance, SD=1
medium variance, SD=10
small variance, SD=1
Large variance, SD=100
medium variance, SD=10
Posterior Mean IQ score
95% C.I./C.C.I.
102.00
101.99
101.99
101.99
102.00
102.03
102.00
102.00
99.37
86.56
94.42 – 109.57
94.35 – 109.62
94.40 – 109.42
94.89 – 109.07
100.12 – 103.87
94.22 – 109.71
97.76 – 106.80
100.20-103.90
92.47 – 106.10
80.17 – 92.47
79
An example where Bayesian
analysis is an improvement
The intervention program ATLAS (Adolescent Training and Learning to
Avoid Steroids) was administered to high school football players to
prevent use of anabolic steroids. Data are from 861 high school
football players.
Example from Bengt Muthén, Bayesian Analysis In Mplus: A Brief Introduction (Incomplete Draft, Version 3, May 17,
2010). Data from MacKinnon, D.P., Lockwood, C.M., & Williams, J. (2004). Confidence limits for the indirect effect:
Distribution of the product and resampling methods. Multivariate Behavioral Research, 39, 99-128.
80
ATLAS Example, cntd.


The indirect effect of the intervention on
nutrition via perceived severity of using steroids
is the focus.
The ML estimate is 0.02 (.011), p = 0.056ns
81
ATLAS Example, setup
DATA: file = mbr2004atlas.dat;
VARIABLE: names = obs group severity nutrit; usevariables = group - nutrit;
ANALYSIS: estimator = bayes;
processors = 2; chains=4; bconvergence=0.01;
MODEL:
severity on group (a);
nutrit on severity (b) group;
MODEL CONSTRAINT:
new(indirect);
With Bayes currently no INDIRECT or
indirect = a*b;
VIA command
OUTPUT: tech8 standardized;
PLOT: type = plot2;
82
ATLAS Example, cntd.
MODEL RESULTS (default settings)
Posterior
One-Tailed
95% C.I.
Estimate
S.D. P-Value Lower 2.5% Upper 2.5%
New/Additional Parameters
INDIRECT
0.018
0.012
0.010
0.002
0.045
MODEL RESULTS: chains = 4; bconvergence = 0.01;
Posterior
One-Tailed
95% C.I.
Estimate
S.D. P-Value Lower 2.5% Upper 2.5%
New/Additional Parameters
INDIRECT
0.019
0.011
0.011
0.002
0.043
83
ATLAS Example, cntd.

Why is the Bayes estimate significant and the ML
estimate not?
84
Two-level CFA, Sibling Data

37 families, 187 children
Scores on 6 intelligence tests

Multilevel structure: children nested in families

– Example from Hox Multilevel Analysis, 1st Ed., 2002
– Problematic because of small family level sample
size

Source: Van Peet, A.A.J. (1992). De potentieeltheorie van intelligentie.
[The potentiality theory of intelligence] Amsterdam: University of
Amsterdam, Unpublished Ph.D. Thesis
85
Path Diagram CFA Sibling Data
Between
Within
86
ML estimation 2-level CFA
TITLE: two level factor analysis Van Peet data, using ML estimators;
DATA: FILE IS ggkind.dat;
VARIABLE:
NAMES ARE famnr wordlist cards figures matrices animals occup;
CLUSTER IS famnr;
ANALYSIS:
TYPE IS TWOLEVEL;
ESTIMATOR IS ML;
MODEL:
%BETWEEN%
general by wordlist cards figures matrices animals occup;
%WITHIN%
numeric by wordlist cards matrices;
percept by occup animals figures;
OUTPUT: SAMPSTAT STANDARDIZED;
87
ML estimation 2-level CFA
MODEL RESULTS
Residual
Variances
WORDLIST
CARDS
FIGURES
MATRICES
ANIMALS
OCCUP
Estimate
S.E.
1.598
3.871
2.315
-0.160
1.085
5.705
1.323
1.769
1.496
0.673
1.400
1.988
Est./S.E.
1.208
2.188
1.548
-0.237
0.775
2.870
Two-Tailed
P-Value
0.227
0.029
0.122
0.813
0.438
0.004
88
Bayes estimation 2-level CFA
TITLE: two level factor analysis Van Peet data, using ML estimators;
DATA: FILE IS ggkind.dat;
VARIABLE:
NAMES ARE famnr wordlist cards figures matrices animals occup;
CLUSTER IS famnr;
ANALYSIS:
TYPE IS TWOLEVEL;
ESTIMATOR IS Bayes;
MODEL:
%BETWEEN%
general by wordlist cards figures matrices animals occup;
%WITHIN%
numeric by wordlist cards matrices;
percept by occup animals figures;
OUTPUT: SAMPSTAT STANDARDIZED;
89
Bayes estimation 2-level CFA
MODEL RESULTS
Residual
Variances
WORDLIST
CARDS
FIGURES
MATRICES
ANIMALS
OCCUP
Estimate
4.618
3.991
2.150
0.712
3.104
6.389
Posterior One-Tailed
S.D.
P-Value
2.235
2.369
1.625
0.669
1.891
2.511
0.000
0.000
0.000
0.000
0.000
0.000
95% C.I.
L 2.5% U2.5%
1.322 9.820
0.434 9.468
0.201 6.299
0.101 2.604
0.676 7.825
2.888 12.652
90
What have we learned so far?
Results are compromise of prior & data
However:
-> non/low-informative priors
-> informative priors
-> misspecification of the prior
-> convergence
Results are easier to communicate
(eg CCI compared to confidence interval)
Missing Data
Yet another advantage:
Missing data are easily handled with
Missing Data
Carried out using Bayesian estimation to
create several data sets where missing
values have been imputed
The multiple imputation data sets can be
used for subsequent
model estimation using ML or WLSMV
The imputed data sets can be saved for
subsequent analysis or analysis can be
carried out in the same run
The Gibbs sampler
1.Assign starting values
2.Sample μ1 from conditional distribution
3.Sample μ2 from conditional distribution
4.Sample μ3 from conditional distribution
5.Sample μ4 from conditional distribution
6.Sample σ2 from conditional distribution
7.Go to step 2 until enough iterations
The Gibbs sampler:
now including missing data
Data: y(obs) and y(mis)
The Gibbs sampler:
now including missing data
1.Assign starting values including y(mis)
2. … 6.
7. Sample y(mis) from y(obs) and
current parameter values
8. Go to step 2 until enough iterations
Missing data: an example (n = 467)
Missing data: an example (n = 467)
1.Listwise deletion
2.FIML
3.FIML with auxiliary variables
4.Bayes
5.Bayes with auxiliary variables
Inadmissible parameters
WARNING: THE LATENT VARIABLE COVARIANCE MATRIX
(PSI) IS NOT POSITIVE DEFINITE. THIS COULD
INDICATE A NEGATIVE VARIANCE/ RESIDUAL VARIANCE
FOR A LATENT VARIABLE, A CORRELATION GREATER OR
EQUAL TO ONE BETWEEN TWO LATENT VARIABLES, OR A
LINEAR DEPENDENCY AMONG MORE THAN TWO LATENT
VARIABLES. CHECK THE TECH4 OUTPUT FOR MORE
INFORMATION.
Prior is Inverse Gamma
a (shape)=-1, b (scale)=0
101
Inadmissible parameters
Between
r = 1.05
fb1
fb2
y1
y2
y3
y4
y5
y6
y7
y8
y1
y2
y3
y4
y5
y6
y7
y8
fw1
fw2
Within
Inadmissible parameters
Inadmissible parameters
Inadmissible parameters
Between
r = .94
fb1
fb2
y1
y2
y3
y4
y5
y6
y7
y8
y1
y2
y3
y4
y5
y6
y7
y8
fw1
fw2
Within
More on priors


Specifying type of distribution, e.g.
normal versus inv.chi2
non-informative versus informative priors
106
Conclusions I/II
• Excellent tool to include prior knowledge if available
• Estimates (including intervals) always lie in the
sample space if prior is chosen wisely
• Results are easier to communicate
• Missing data are easily handled with
• BUT: Bayes doesn’t solve misspecification of the
model
Conclusions II/II
• Better small-sample performance, large-sample theory
not needed
• Analyses can be made less computationally demanding
Bayesian Analysis When ML Is Slow Or Intractable
Due To Many Dimensions Of Numerical Integration