Monte Carlo methods - Applied Biomathematics Inc

Download Report

Transcript Monte Carlo methods - Applied Biomathematics Inc

Answer to #1
L
i
K
n
BD
foc
Koc
source-receptor distance
hydraulic gradient
hydraulic conductivity
effective soil porosity
soil bulk density
fraction organic carbon
organic partition coefficient
Porosity and conductivity would likely be positively related
Porosity and density would likely be inversely related
Thus, conductivity and density should be inversely related too
Fraction organic carbon and the partition coefficient would likely be
positively related
Answer to #2
L = [80,120] m
i = [0.0003,0.0008] m per m
K = [300,3000] m per yr
n = [0.2,0.35]
BD = [1500,1750] kg per m3
foc = [0.0001,0.005]
Koc = [5 ,20] m3 per kg
Kd = foc * Koc
R = 1 + BD * Kd / n
V = K * i / (n * R)
T = L/V
T
[ 20.952380952, 408800] yr
// source-receptor distance
// hydraulic gradient
// hydraulic conductivity
// effective soil porosity
// soil bulk density
// fraction organic carbon
// organic partition coefficient
func TT() return L/(K * i / ($1 * (1 + BD * foc * Koc / $1)))
b = left(n)
e = TT(b)
w = width(n)/100
for k=1 to 100 do begin
s = [ (k-1), k] * w + b
e = env(e, TT(s))
end
e
[ 31.480562448, 235352] yr
Answer to #3
Iwater = [1.5, 2.5] liters per day
Ifish = [0, 8] g per day
B = [0.9, 2.1] liters per g
W = [60, 90] kg
// water intake
// dietary ingestion of fish tissue
// bioaccumulation factor
// receptor biomass
D = [0, 6] mg per kg per day
// tolerable dose
// forward equation
// D = (Iwater * C) / W + (Ifish * B * C) / W
// condense repeated parameters
// D = C * (Iwater + Ifish * B) / W
// solve for C by factoring the quotient (Iwater+Ifish*B)/W out of D
C = factor((Iwater + Ifish * B) / W, D)
C
[ 0, 18.652849741] mg liters1
// put C back into original forward equation (and ensure correct units)
d = (Iwater * C) / W + (Ifish * C * B) / W + 0 mg per kg per day
d
[ 0, 6] mg kg1 day1
Answer to #5
3.5105
3.499105, 3.504105]
[3.50105, 1.35104]
[3105, 1.4104]
[2.9105, 4.1105]
[ 2.5105, 1.905 104]
105
Vesely et al. (all independent)
Mixed dependencies
Frank to Fréchet
All Fréchet
Mixed dependence with intervals
All Fréchet
104
Probability of tank rupturing under pumping
103
Plan of attack for #6
• Use the events from the slide “Snakes on a plane”
• Write as Boolean expression
– Represent ANDs as conjunctions &
– Represent ORs as disjunctions V
(plane V ship V raft V pet V zoo) & (pregnant V (malefemale &
repro) V (female & parthenog)) & (dogs & cats & pigs) & (eggs V
rodents) & (hides V quick V disperse)
• Specify the temporal context, e.g., a year
• Estimate marginal event probabilities
– Maybe as submodels, e.g., plane = flights * stowaway per flight
• Decide which events are independent
Answer to #7
Assuming a = 0.29, b = 0.22, and r = 1.0, yields
a * b + r * sqrt(a*(1-a)*b*(1-b))
0.2517692528
Pearson AND
[ max(0, a+b–1), min(a, b) ]
[ 0, 0.22]
Fréchet AND
The probability assuming Pearson correlation r =1 is larger than the largest possible
probability. This happens because the Pearson correlation for two events with these
marginals cannot be as large as one (so we are making a false assumption which
leads to the error). The extreme Pearson correlations for them are
(-a*b)/sqrt(a*b*(1-a)*(1-b)) // when they don’t overlap
-0.33941721344
(b-a*b)/sqrt(a*b*(1-a)*(1-b)) // when b’s inside a
0.83098697084
We can check this with this calculation
r = 0.83098697084
a * b + r * sqrt(a*(1-a)*b*(1-b))
0.22
Probability and
Monte Carlo simulation
Scott Ferson, [email protected]
18 September 2007, Stony Brook University, MAR 550, Challenger 165
Outline
•
•
•
•
•
Probability
Ensembles and distributions
Pseudorandom numbers and sampling
Convolutions via Monte Carlo simulation
Selecting distributions
–
–
–
•
•
Dependence
Other issues
–
–
•
Multiple instantiation
Model uncertainty
Number of replications to use
–
–
–
•
•
Default distributions
Fitting distributions (MoM, ML, regression)
Empirical distributions
Confidence interval for percentile
Confidence interval for whole distribution
Latin hypercube sampling to accelerate simulation
Summary of Monte Carlo: steps, limits and pitfalls
Related methods
–
–
Second-order Monte Carlo
Probability bounds analysis
Traditional practice
• Most assessments are deterministic
• Some are deliberately conservative, but
– degree of conservatism is opaque,
unquantified, and can be inconsistent
– difficult to characterize risks, except in
extreme situations
What’s needed
An assessment should tell us
• How likely the various consequences are
• How reliable the conclusions are
Probabilistic models
• The same as deterministic models
except that point values are replaced
by probability distributions
• Well, almost
•
•
•
•
Ensembles
Distribution shapes and tails
Dependencies
Backcalculation
What is probability?
• The frequentist view is that the probability
of an event is its frequency given a long
sequence of independent trials
• The subjectivist view is that probability of
an event is the degree of belief that a
person has that the event will occur
• Don’t confuse or mix these ideas
Bayesian probability
• Asserting A means you’ll pay $1 if not A
• If the probability of A is P, then a Bayesian
agrees to assert A for a fee of $(1-P), and
assert not-A for a fee of $P
• For Bayesians, belief implies action: bets
• People will have different Ps for an A
Frequentist probabilty
• Let NA be the number of times event A
occurs out of N trials
• The probability of A is NA/N, as N
• If the probability of event A is P(A) then
event A will occur NP(A) times, on
average, out of every N times
Probability
• Several ways to express how often the
same value appears
– Counts (most concrete conception)
– Relative frequencies (finite populations)
– Probabilities (even more abstract)
• Probability distributions used to model
both variability and incertitude
Probability distributions
• Distribution: bunch of values (ignoring order)
• Random distribution: order unpredictable
• Stationary distribution: unchanging shape
Probability distribution display
p(x)
• Density distribution or
mass distributions
x
• Cumulative distribution
function (CDF, integral
of density)
1
Pr(x<X)
0
• Complementary CDF
(exceedance risk)
x
1
Pr(x>X)
0
x
Continuous distributions
normal
1
uniform
0.8
0.6
exponential
0.4
lognormal
0.2
triangular
0
Weibull
0
beta
Laplace
many, many more
Cumulative probability
•
•
•
•
•
•
•
•
•
2
4
6
8
Values less than 0, or larger than
6, are impossible. Most values are
around 3. 80% of values are less
than 4.
•
•
•
•
binominal
Poisson
discrete uniform
reciprocal
Cumulative probability
Discrete distributions
1
0.8
0.6
0.4
0.2
0
0
2
4
Distributions can also be “mixed”
(neither continuous nor discrete)
6
8
Notation and language
• A “deviate” or “realization” is a particular real value of a
random variable or a value drawn from a distribution
• U(a, b) denotes a uniform distribution with minimum a
and maximum b, or sometimes a deviate from it
• N(, ) denotes a normal distribution with mean  and
standard deviation , or a deviate from it
• A tilda ~ is read “is distributed as”
• A distribution that describes the variation of a random
variable is sometimes called its “law”, so if X ~ N(0,1)
the law for X is normal
Ensemble
• Statisticians call it the “population” or
“reference class”
• E.g., dysfunction rates in prostate patients
• 50% of attempts at sex fail, or
• 50% of men are totally impotent
• If the analyst can't tell you what ensemble
the distribution represents, the analysis is
probably meaningless
Computational methods
• Analytical approaches
– Laplace and Mellin transforms
– Delta method (just means and variances)
• Discrete probability distributions
• Monte Carlo simulation
– Sampling values from input distributions
– Computing function of inputs
– Accumulating output histograms
Pseudorandom numbers
Sampling from a uniform
• Computerized eenie-meanie-miney-moe
•
•
•
•
Neumann’s middle-square method
Linear congruential generators
Mersenne twister (bulkier but faster and better)
Blum-Blum-Shub
• Smart to repeat simulations with different
generators
• The quantity a + (b − a)u varies uniformly
over [a, b] if u varies uniformly over [0,1]
Sampling continuous distributions
0.8
0.6
0.4
0.2
0
50
100
Value of random variable
0
150
Cumulative probability
uniform deviate
1
Sampling discrete distributions
0.8
0.6
0.4
0.2
0
50
100
Value of random variable
0
150
Cumulative probability
uniform deviate
1
Generating random deviates
• Inverse transform sampling can make
realizations from an arbitrary distribution
using uniform deviates
• Acceptance-rejection method can be used
when the CDF has no closed form
• Many distribution shapes have special
algorithms that are faster or better
Box-Muller algorithm
N(0,1) ~ 2 ln(u)  sin(2 v)
where u, v ~ U(0,1), and u  v
• Specialized generator for normal deviates
• Faster than inverting the cumulative normal
distribution which has no closed form
• General deviates N(, ) ~  N(0,1) + 
Rough and ready normals
• Add 12 independent uniforms to 6
• Algorithm easy to remember
• Not so great at extreme tails
– Bounded inside [6, +6]
Evans et al. describe
many distributions and
algorithms to generate
samples from them
Convolution
(arithmetic on distributions)
Convolution
+
X
=
Y
X+Y
Convolution is getting the output distribution
from the input distributions. Monte Carlo is
one way to do this.
Integrated calculations
X
Sample realizations from each input
distribution, then combine them all
to compute a (scalar) output value.
Repeat many times to accumulate a
distribution of output values.
Y
W
Z
f(W,X,Y,Z)
This is unlike interval analysis which
solved operations sequentially
Travel time (Lobascio)

n  BD  foc  Koc L
T
K i
Parameter
L source-receptor distance
i
hydraulic gradient
K hydraulic conductivity
n
effective soil porosity
BD soil bulk density
foc fraction organic carbon
Koc organic partition coefficient
Units
m
m/m
m/yr
kg/m3
m3/kg
Min
80
0.0003
300
0.2
1500
0.0001
5
Max
120
0.0008
3000
0.35
1750
0.005
20
Mean
100
0.00055
1000
0.25
1650
0.00255
10
Stdv
11.55
0.0001443
750
0.05
100
0.001415
3
Shape
uniform
uniform
lognormal
lognormal
lognormal
uniform
normal
Implementation in R
rlognorm < function(many,a, b)
return(rlnorm(many,log(a^2/sqrt(a^2+b^2)),sqrt(log((a^2+b^2)/a^2))))
many < 2000
L < runif(many, 80, 120)
i < runif(many, 0.0003 ,0.0008)
K < pmax(300, pmin(3000, rlognorm(many, 1000, 750)))
n < pmax(0.2, pmin(0.35, rlognorm(many, 0.25, 0.05)))
BD < pmax(1500, pmin(1750, rlognorm(many, 1650, 100)))
foc < runif(many, 0.0001, 0.005)
Koc < pmax(5, pmin(20, rnorm(many, 10, 3)))
T < (n + BD * foc * Koc) * L / (K * i)
summary(T)
Min.
1st Qu.
100.4
3853.0
Median
8111.0
Mean
12460.0
3rd Qu.
16950.0
< is assignment
rdist functions
rlnorm’s arguments
pmin and pmax
Max.
108800.0
plot(sort(T), 1:many/many, t='s', xlab='T', ylab='Probability')
D = T[T<500]
plot(sort(D), 1:length(D)/many, t='s', xlab='T', ylab='Probability', ylim=c(0,1))
length(D)/many
0.0195
Result of the simulation
1
Probability
0.8
0.6
0.4
0.2
0
0
50,000
Travel time [years]
100,000
Interpreting the result
• What is the minimum time? Maximum?
» 100 years, 109,000 years
• What happens when reps are increased?
» Smaller min, larger max, approaching interval range
• Why is the maximum time so much
smaller than the 234,000 years?
» Distribution highly skewed
• If the random variables weren’t truncated,
what would the min and max be?
» Negative travel times (!), larger max time
Detail
1
Probability
0.8
0.6
0.4
0.2
2%
0
0
100
200
300
400
Travel time [years]
500
Selecting input distributions
How to specify distributions
•
•
•
•
•
Default distribution
Expert opinion
Fitted distribution (goodness of fit)
Empirical distribution
Maximum entropy
Default distributions
•
•
•
•
•
•
Physicists overuse normal distributions
Risk analysts overuse uniform distributions
Reliability engineers overuse Weibulls
Ecologists overuse lognormals
Statisticians overuse beta distributions
Food safety overuses triangular distributions
Cooper’s quip
Arguments (frequently abused)
• Sum of many independent comparable terms (normal)
• Product of many independent factors (lognormal)
• Largest of many values (Gumbel, Weibull, Fréchet)
• Lifetimes with neither aging nor burn-in (exponential)
• Symmetry (uniform)
• Rare discrete events (Poisson)
• Independent trials (binomial)
Fitting distributions
• Assumes you know the distribution family
• Different methods give different results
– Method of moments
– Maximum likelihood
– Regression methods
Method of matching moments
• Pick the distribution family
– Mechanistic argument or mathematical convenience
• Compute moments of the data
• Select distribution from family with same moments
– E.g., exponential (1), normal (2), triangular (3)
• Other issues
– Check the goodness of fit of the answer
– Software can check many shapes at once
– MoM not always available, lacks optimality properties
Example
1
Second moment  x 2 = 0.2334
n
normal(0.439, 0.2334)
sd  0.2
1
Cumulative probability
2
0.8
Probability density
Data
0.653
0.178
0.263
0.424
0.284
0.438
0.471
0.852
0.480
0.375
0.148
0.185
0.320
0.642
0.247
0.784
0.643
0.261
0.636
0.487
1
Mean  x = 0.439
n
0.6
1
0.4
0.2
0
0
0
0.2 0.4 0.6 0.8 1
0
0.2 0.4 0.6 0.8 1
Maximum likelihood (ML)
• Find the distribution that maximizes the
likelihood of having observed the data
• Likelihood is the probability of seeing the
data assuming the distribution actually
has some form
datum
Maximum likelihood in practice
• When data are sampled independently,
just multiply the likelihoods
• Formulas for many shapes known
• But they can sometimes be scary
• Maximum likelihood agrees with MoM for
some distributions (e.g., normal and
exponential)
Graphical/regression methods
• Rank the data
• Transform ranks by probability distribution
• Regress data against the transformed ranks
x
10
8
6
4
2
0
- 1.5
The slope and
intercept give the
parameters of the
fitted distribution
-1
- 0.5
0
0.5
1
1.5
Normal score of datum rank
Goodness of fit (GoF)
• Goodness of fit measures
– Chi squared
– Anderson-Darling (focuses on tails)
– Kolmogorov-Smirnov (largest vertical difference)
– Specialized tests for particular families
(N,L : Shapiro-Wilks, Shapiro-Frances, D’Agostino, Lillefors)
• EPA says it’s less important than sensibleness
– Recommend bypassing a better fitting distribution for
one that is plausible under some mechanism
Empirical distributions
• Distribution characterized by observed data
F(x) = 1/N  #(Xi < x)
• Assumes data were randomly collected
• Doesn’t need a shape assumption
Data
0.653
0.178
0.263
0.424
0.284
0.438
0.471
0.852
0.480
0.375
0.148
0.185
0.320
0.642
0.247
0.784
0.643
0.261
0.636
0.487
Emp irical cumulative probability
Example
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Value of random variable
1
1.2
Drawbacks of EDFs
• Need a fair amount of data to yield
reliable estimates of distributions
• Tails are likely to be poorly modeled
• Data usually underestimate the range
• Depends on “plotting position”
Surrogacy
• Data from a “related” species
• Data on spatial variation used for temporal
• QSAR, TEFs, allometries, regressions, etc.
• Analyst should model this extrapolation
• But it’s often not clear how to fix values
Maximum entropy
• Information theory
n
  pi ln pi
i 1
• All possibilities are equiprobable
• Laplace's “principle of insufficient reason”
• Specified by what is known
• Mathematically more defensible than
arbitrary assignment of distributions
• Much cheaper than expert elicitation
Maximum entropy solutions
When you know
{minimum, maximum}
{mean, standard deviation}
{minimum, maximum, mode}
{minimum, mean}
{min, max, mean, stddev}
{minimum = 0, some quantile}
{minimum > 0, some quantile}
{minimum, maximum, mean}
{mean, geometric mean}
Use this shape
uniform
normal
beta
exponential
beta
exponential
gamma
beta
gamma
1
1
1
mean,
std
min,max
0
0
2
4
6
8
10
0
-10
1
0
10
20
1
0
4
6
8
10
4
6
8
30
40
2
4
min,max,
mean,std
0
20
50
40
50
6
8
10
1
min,
mean
0
30
sample
data,
range
10
1
10
20
0
2
1
0
10
min,
max,
mean
0
2
0
1
integral,
min,max
Cumulative probability
positive,
quantile
percentiles
0
2
4
6
8
10
2
4
6
8
10
Maximum entropy’s problem
• Depends on the choice of scale
• For instance, knowing the possible
range for degradation rate yields one
distribution, but knowing the possible
range for half life yields a very different
one even though the information is
exactly the same
How to specify distributions
• Default distributions
come right out of the book
• Fitted or empirical distributions
usually not enough data available
• Extrapolations and surrogate data
requires professional judgment
• Elicitation from experts
expensive, controversial when experts disagree
• Maximum entropy criterion
inconsistent through changes of scale
All distributions should be reality-checked
Example: mercury in wild mink
Location: Bayou d’Inde, Louisiana
Receptor: generic piscivorous small mammal
Contaminant: mercury
Exposure route: diet (fish and invertebrates)
Based loosely on the assessment described in “Appendix I2: Assessment of Risks to Piscivorus
[sic] Mammals in the Calcasieu Estuary”, Calcasieu Estuary Remedial Investigation/Feasibility
Study (RI/FS): Baseline Ecological Risk Assessment (BERA), prepared October 2002 for the U.S.
Environmental Protection Agency. See http://www.epa.gov/earth1r6/6sf/pdffiles/appendixi2.pdf.
Total daily intake from diet
 Cfish |  | Pfish
Cinverts |  | Pinverts 

TDI  FMR |  | 


AE

GE
AE
|

|
GE
fish
fish
inverts
inverts 

FMR
normalized free metabolic rate of the mink
AEfish assimilation efficiency of dietary fish in the mink
AEinverts assimilation efficiency of dietary invertebrates in the mink
GEfish gross energy of fish tissue
GEinverts gross energy of invertebrate tissue
Cfish
mercury concentration in fish tissue
Cinverts mercury concentration in invertebrate tissue
Pfish
proportion of fish in the mink’s diet
Pinverts proportion of invertebrates in the mink’s diet
What’s known about the inputs
FMR, Regression on mink body mass
FMR = a BW b, where a and b are intervals
BW, Empirically well studied
Normal with known mean and dispersion
AEfish, AEinverts, Some field data
Mean and upper and lower values
GEfish, GEinverts, Many measurements
Normal with known mean and dispersion
Cfish, Cinverts, Dictated by EPA policy
95% upper confidence limit on concentration
Pfish, Pinverts, Assumed by the analyst
Constants
Input assignments
BW
~ normal( 608, 66.9) gram
a
~ uniform(0.354,0.47)
b
~ uniform(0.836,0.888)
FMR ~ a * BW ^ b
AEfish ~ MEminmaxmean(0.77, 0.98, 0.91)
AEinverts ~ MEminmaxmean(0.72, 0.96, 0.87)
GEfish ~ normal(1200, 240)
GEinverts~ normal(1050, 225)
Cfish
= 0.3
Cinverts = 0.06
Pfish
= 0.9
Pinverts = 0.1
Kcal per kg per day
Kcal per kg
Kcal per kg
mg per kg
mg per kg
Input distributions
BW
400
0
600
800
FMR
0
100
200
GE2
C1
1000 2000
0.1 0.2 0.3 0.4
AE1
0.7 0.8 0.9 1
AE2
0.7 0.8 0.9 1
C2
0.02 0.04 0.06
GE1
P1
0.8
0.9
0
1000 2000
P2
1
0.1
0.12
Subscript 1 denotes fish, 2 denotes inverts
Results
1
Exceedance risk
mean
median
95th percentile
standard deviation
0.019
0.017
0.032
0.007
0
0
0.1
TDI, mg kg1 day1
0.2
Wrinkles
•
•
•
•
•
Number of replications
Model uncertainty
Backcalculation
Multiple instantiations
Dependence among variables
Dependence
Dependence
• Most variables assumed independent
• But some variables clearly aren’t
– Density and porosity
– Rainfall and temperature
– Body size and skin surface area
Dependence can be complex
Y
X
Dependencies
•
•
•
•
•
•
•
•
Independence (knowing X tells nothing about Y)
Perfect dependence
F(X) = G(Y)
Opposite dependence
F(X) = 1G(Y)
Complete dependence
Y = z(X)
Linearly correlated
Y = mX + b + 
Ranks linearly correlated G(Y) = mF(X) + b + 
Functional modeling
Y = z(X) + 
Complex dependence
(anything else!)
Uncorrelatedness is not independence
Perfect dependence
uniform deviate
1
0.8
0.6
0.4
0.2
0
50
100
150
0
800
0
1600
Cumulative probability
• May be better than assuming independence
• Very easy to simulate
Specified correlations
Zi 
• Scheuer and Stoller (1962)
• Pearson, not rank, correlation
• Only for normal marginals
i
cij Y j ,

j 1
i  1,...,k
 σ11  σ1k 
        CC T
σ k 1  σ kk 
Correlated deviates are weighted sums of independent
standard normal deviates Yj ~ normal(0,1) where
weights cij are the elements of a lower triangular matrix
C solving  = CCT (by Cholesky decomposition) where
 is the intended variance-covariance matrix (in which
diagonals are variances off-diagonals are covariances)
Bivariate case
If Y1, Y2 ~ normal(0,1) are independent, then
Z1  Y1σ1  μ1


Z 2  rY1  1  r Y2 σ 2  μ 2
2
will produce deviates such that
Z1 ~ normal(1, 1)
Z2 ~ normal(2, 2)
corr(Z1, Z2) = r
Modeling dependencies
• Need to check that correlations are not
infeasible (e.g., corr(A,B) = 0.9, corr(A,C)=0.9,
corr(B,C)=0.9)
• Monte Carlo simulations can model each of the
special cases, but not complex dependence
• Many dependence patterns yield the same
correlation coefficient (most MC software
arbitrarily selects one of these patterns)
• It can be difficult to specify the dependence if
empirical information is sparse
But what if we don’t know r ?
Cumulative probability
1
0.8
0.6
X ~ normal(5,1)
Y ~ normal(10,2)
various correlations
0.4
0.2
0
0
20
40
60
80
X Y
100
120 140
A ~ normal(2,1), B ~ lognormal(5,2)
Cumulative probability
A+B (1000 Trials with independence)
A+B (1000 Trials with 0.7 correlation)
1.000
.750
.500
.250
.000
-5.00
0.00
5.00
10.00
15.00
Differences in tail risks could be much greater if the dependence
between A and B is nonlinear, even if their correlation is very small.
Each variable gets a single value
• A variable can't be independent of itself,
e.g., a multiple-route exposure model
cairiair cwateriwater csoil isoil


BW
BW
BW
• All BW’s should be the same value within a
single Monte Carlo replicate
• Be wary of “libraries” of partial results
Microevent modeling
Monte Carlo simulation
Simulate many anglers
Simulate an angler
Sample BW, ED
Simulate ED years
Simulate a year
Sample EF
Simulate EF meals
Simulate a meal
Sample Cfish, IR, LOSS
slug = Cfish  (1LOSS)  IR  CF
acute = slug / BW
total = total + slug
chronic = total / (BW  AT)
Example: PCBs and duck hunters
Location: Berkshire County, Massachusetts
Receptor: Adult human hunters of waterfowl
Contaminant: PCBs (polychorinated biphenyls)
Exposure route: dietary consumption of
contaminated waterfowl
Loosely based on the assessment for non-cancer risks from PCB to adult hunters who consume
contaminated waterfowl described in Human Health Risk Assessment: GE/Housatonic River Site: Rest
of River, Volume IV, DCN: GE-031203-ABMP, April 2003, Weston Solutions (West Chester,
Pennsylvania), Avatar Environmental (Exton, Pennsylvania), and Applied Biomathematics (Setauket,
New York).
Hazard quotient
EF |  | IR |  | C |  | 1 |  | LOSS 
HQ 
AT |  | BW |  | RfD
EF = EDF(n = 23) meals per year
// exposure frequency, censored data
IR = lognormal(188, 113) grams per meal
// poultry ingestion from EPA’s EFH
C = 9.73 mg per kg
// exposure point (mean) concentration
LOSS = 0
// loss due to cooking
AT = 365.25 days per year
// averaging time (not just units conversion)
BW = mixture(BWfemale, BWmale)
// Brainard and Burmaster (1992)
BWmale = lognormal(171, 30) pounds
// adult male n = 9,983
BWfemale = lognormal(145, 30) pounds
// adult female n = 10,339
RfD = 0.00002 mg per kg per day
// reference dose (EPA considers tolerable)
Inputs
1
1
EF
IR
0
0
0
10
20
30
40
meals per year
50
60
200
400
grams per meal
600
1
Exceedance probability
1
0
males
females
C
BW
0
Always a point
value, unless
microevents are
modeled
0
0
100
200
pounds
300
0
10
mg per kg
20
Results
1
HQ
mean
standard deviation
median
95th percentile
range
18.2
35.2
6.8
69
[0.01, 1385]
0
0
1000
2000
Backcalculation
Backcalculation
• Cleanup and remediation planning
requires backcalculation
• How can we untangle the expression
A+B = C
when we know A and C, and need B?
Can’t just invert the equation
conc  intake
dose =
body mass
dose  body mass
conc =
intake
0.06
0.4
0.04
0.2
0.02
0.00
0
2
4
6
Planned dose
8
0.0
0
0.06
0.03
0.04
0.02
0.02
0.01
0.00
0
2
4
Intake
6
0.00
0
8
2
4
6
Body mass
8
2
4
6
Concentration
8
0.06
Large doses in the realized distribution (arising
from the allowed distribution of concentrations)
are more common than were originally planned
0.05
0.04
planned
0.03
0.02
realized
0.01
0
0
2
4
Dose
6
8
Normal approximation
• If A+B=C, compute B as CA under the
assumption that the correlation
between A and C is r = sA/sC
• Needs Pearson (not rank) correlation
• Usually okay if multivariate normal
Normal approximation with non-normal distributions
1 .0 0
.7 5
.5 0
.2 5
.0
1 .0 0
3 .0 0
5 .0 0
7 .0 0
9 .0 0
A=lognormal(5,1.5)
0 .0 0
4 .3 8
8 .7 5
13 .1 3
17 .5 0
C=triangular(0,12,15.5)
1 .0 0
realiz ed
.7 5
.5 0
planned
.2 5
.0
-5 .0 0
-0 .6 3
3 .7 5
8 .1 3
CA, with r =sA/sC
12 .5 0
0
5
10
C
15
20
Iterative (trial & error) approach
•
•
•
•
•
•
Initialize B with CA
This distribution is too wide
Transform density p(x) to p(x)m
Rescale so that area remains one
Whenever m > 1 dispersion decreases
Repeat until you get an acceptable fit
0.06
By trial and error, you may be able to find a distribution
of concentrations that yields something close to the
planned distribution of doses. This is not always
possible however.
0.05
0.04
conc int ake
 dose
body mass
0.03
 int ake 
  ln(conc)  ln(dose)
ln
body
mass


 

 


A
B
C
0.02
0.01
0
0
2
4
Dose
6
8
Model uncertainty
Model uncertainty
• Doubt about the structural form of the model
model
uncertainty
Parameters
Distribution shape
Intervariable dependence
Arithmetic equation
Level of abstraction
• Usually incertitude rather than variability
• Usually considerable in ecosystems models
• Often the elephant in the middle of the room
Monte Carlo strategy
• Introduce a new discrete variable
• Let the value of the variable dictate which
model will be used in each rep
• Wiggle the value from rep to rep
• Only works for short, explicit list of models
(you have to list the models)
• Many theorists object to this strategy
Model uncertainty as a mixture
PDF
If u>0.5 then model=I else model=II
II
I
1
CDF
I or II
0
How many replicates
How many replications
•
•
•
•
•
•
•
More is always better
Tails are especially hard to nail down
Curse of dimensionality
Latin hypercube sampling can help
Repeat the simulation as a check  best
Confidence intervals on fractiles
Kolmogorov-Smirnov limits on distributions
100 versus 10,000 replicates
1.000
0.02
100 Trials:
close-up of
left hand tail
.750
.500
0.01
.250
0.00
.000
0.02
1.000
10,000 Trials:
close-up of left
hand tail
.750
0.01
.500
.250
0.00
.000
0.00
125.00
250.00
375.00
Time to contamination (yr)
500.00
Confidence interval for a fractile
The a100% confidence interval for the pth
fractile can be estimated by [Yi, Yj ], where
Yk is the (n  k + 1)th largest value from the
Monte Carlo simulation, i = floor(np  b), j =
ceiling(np + b), b = z1((1  a)/2)(np(1  p))
• Vary n to get the precision you desire
• But remember this represents only sampling
error, not measurement error
Kolmogorov-Smirnov bounds
Bounds on the distribution as a whole
100 replications
replications
200
1000 replications
2000
replications
1
1
0. 8
0. 8
0. 6
0. 6
0. 4
0. 4
0. 2
0. 2
0
0
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
95% of the time, the entire distribution
will lie within the bounds
8
9
Latin hypercube sampling
• Stratify the sampling of distributions
– Divide each distribution into n equal-probability
regions, where n = #reps
– For each replicate, select one deviate from
each distribution
– Select each region only once
• Often much better for moments, sometimes
better for tails
• May reduce the number of replicated
needed
Conclusions
Contrast with interval analysis
There are fundamentally two approaches to
calculation under uncertainty:
Bounding
• Say where you know the answer isn’t
• Tighten range
Approximation
• Say where you think the answer is
• Minimize bias
Monte Carlo simulation
• How?
– replace point estimates with distributions
– repeatedly sample from each distribution and compute
– tally answers in histogram
• Why?
– simple to implement
– fairly simple to explain
– summarizes entire distribution of risk
• Why not?
– requires a lot of empirical information (or guesses)
– routine assumptions may be “non-protective”
Why do probabilistic analysis?
•
•
•
•
•
The only way to get at likelihoods
Enhances understanding of risks
Improves decision making
EPA guidance available
NASA guidance in preparation
Steps in a Monte Carlo PRA
• Clarify the questions and gather data
• Formulate the model (identify variables
and their interrelationships)
• Specify distributions for each input
• Specify dependencies and correlations
• Run simulation
• Conduct sensitivity studies
• Present results
• Discuss limitations of the assessment
Pitfalls
•
•
•
•
•
•
•
Muddled model
Infeasible correlation matrices
Inconsistent independence assumptions
Multiply instantiated variables
Too few replications
No sensitivity studies
Confounding of different populations
Limitations of Monte Carlo
•
•
•
•
•
•
•
•
Thought by some to be too difficult
Needs precise, stationary distributions
Nonlinear dependencies often ignored
Unknown dependencies often ignored
Assumes model is well understood
Can be computationally expensive
Backcalculations difficult
Treats incertitude like variability
Kinds of uncertainty
• Variability
–
–
–
–
Stochasticity
Heterogeneity
Spatial variation
Individuality and genetics
• Incertitude
– Measurement imprecision
– Ignorance
– Model uncertainty
• Vagueness, confusion, etc.
Why distinguish them?
• Empirical effort can reduce incertitude
• Usually can't reduce variability
• Managers need to distinguish the two
kinds of uncertainty to plan remediation
efforts effectively
Two-dimensional simulation
•
•
•
•
•
Monte Carlo nested inside Monte Carlo
Inner loop for variability
Outer loop for uncertainty (incertitude)
Squared number of replications
Integrated sensitivity analysis
Meta-distribution is the output
Second-order tail risks
1
Exceedance risk
(Complementary)
cumulative version of
the meta-distribution
0
0
Probability boxes
•
•
•
•
Bounds about the cumulative distribution
Rigorous (rather than approximate)
Analytical (not from simulations)
Can account for lack of knowledge about
distribution shapes and dependencies
1
1
1
0
0
0
0
20
0
40
0
80
References
Abbas, A.E. 2003. Entropy methods for univariate distributions in decision analysis. Bayesian Inference
and Maximum Entropy Methods in Science and Engineering: 22nd International Workshop, editied
by C.J. Williams, American Institute of Physics.
Cullen, A.C., and H.C. Frey. (1999). Probabilistic Techniques in Exposure Assessment: A Handbook for
Dealing with Variability and Uncertainty in Models and Inputs. Plenum Press: New York.
Evans, M., N. Hastings, B. Peacock. 2000. Statistical Distributions. John Wiley and Sons, New York.
Ferson, S., R.B. Nelsen, J. Hajagos, D.J. Berleant, J. Zhang, W.T. Tucker, L.R. Ginzburg and W.L.
Oberkampf. 2004. Dependence in probabilistic modeling, Dempster-Shafer theory, and probability
bounds analysis. Sandia National Laboratories, Albuquerque, NM.
Hansen, F. (1997). Policy for Use of Probabilistic Analysis in Risk Assessment at the U.S.
Environmental Protection Agency. EPA/ORD/NCEA May 15:1-4. Available on-line at
http://www.epa.gov/osp/spc/probpol.htm.
Jaynes, E.T. (edited by G. Larry Bretthorst). 2003. Probability Theory: The Logic of Science. Cambridge
University Press.
Lee, R.C. and W.E. Wright. 1994. Development of human exposure-factor distributions using
maximum-entropy inference. Journal of Exposure Analysis and Environmental Epidemiology 4:329341.
Morgan, M.G. and M. Henrion. (1990). Uncertainty: A Guide to Dealing with Uncertainty in Quantitative
Risk and Policy Analysis. Cambridge University Press, Cambridge, UK.
Scheuer, E.M. and D.S. Stoller. (1962). On the generation of normal random vectors. Technometrics
4:278-281.
Solana, V. and N.C. Lind. 1990. Two principles for data based on probabilistic system analysis.
Proceedings of ICOSSAR '89, 5th International Conferences on Structural Safety and Reliability.
American Society of Civil Engineers, New York.
Exercises for next time
1. How could you sample deviates from a
normal and a lognormal distribution that
have opposite dependence?
2. Why is the concentration generally modeled
as a point value in PRA’s? Under what
conditions should a distribution be used?
3. What are the implications of using a uniform
distribution to model an interval range?
4. How could model uncertainty be
characterized if we cannot list the possible
models?