Transcript Document

Probability bounds analysis
Scott Ferson, [email protected]
2 October 2007, Stony Brook University, MAR 550, Challenger 165
Outline
• P-boxes: marrying intervals and probability
• Calculations via the Cartesian product
• Getting p-boxes
– Moment and shape constraints
– From empirical data
•
•
•
•
•
•
•
Dependence
Model uncertainty
Updating
Backcalculation
Sensitivity analysis
Differential equations
Conclusions
Examples
Dike reliability
PCBs and duck hunters
Dioxin inhalation
Hydrocarbon plume
Typical risk analysis problems
• Sometimes little or even no data
– Updating is rarely used
• Very simple arithmetic (or logical) calculations
– Occasionally, finite element meshes or differential equations
• Usually small in number of inputs
– Nuclear power plants are the exception
• Results are important and often high profile
– But the approach is being used ever more widely
Probability vs. intervals
• Probability theory
– Can handle likelihoods and dependence well
– Has an inadequate model of ignorance
– Lying: saying more than you really know
• Interval analysis
– Can handle epistemic uncertainty (ignorance) well
– Inadequately models frequency and dependence
– Cowardice: saying less than you know
What’s needed
• Reliable, conservative assessments of tail risks
• Using available information but without forcing
analysts to make unjustified assumptions
• Neither computationally expensive nor
intellectually taxing
Deterministic
calculation
Probabilistic
convolution
Second-order
probability
Interval
analysis
Probability
bounds analysis
Probability box (p-box)
Cumulative probability
Interval bounds on an cumulative distribution function (CDF)
1
0
0.0
1.0
X
2.0
3.0
Generalizes an “uncertain number”
Cumulative probability
Probability
distribution
Probability
box
Interval
1
1
1
0
0
0
0
10
20
30
40
10
20
30
40
10
20
Not a uniform
distribution
30
Probability bounds analysis
• Marries intervals with probability theory
• Distinguishes variability and incertitude
• Solves many problems in uncertainty analysis
– Input distributions unknown
– Imperfectly known correlation and dependency
– Large measurement error, censoring, small sample sizes
– Model uncertainty
Calculations
• All standard mathematical operations
–
–
–
–
–
–
Arithmetic operations (+, , ×, ÷, ^, min, max)
Logical operations (and, or, not, if, etc.)
Transformations (exp, ln, sin, tan, abs, sqrt, etc.)
Backcalculation (deconvolutions, updating)
Magnitude comparisons (<, ≤, >, ≥, )
Other operations (envelope, mixture, etc.)
• Faster than Monte Carlo
• Guaranteed to bounds answer
• Good solutions often easy to compute
Faithful generalization
• When inputs are distributions, its answers
conform with probability theory
• When inputs are intervals, it agrees with
interval analysis
1
Cumulative Probability
Cumulative Probability
Probability bounds arithmetic
A
0
0
1
2
3
4
5
6
1
B
0
0
2
4
6
What’s the sum of A+B?
8
10
12
14
Cartesian product
A+B
A[1,3]
p1 = 1/3
A[2,4]
p2 = 1/3
A[3,5]
p3 = 1/3
B[2,8]
q1 = 1/3
A+B[3,11]
prob=1/9
A+B[4,12]
prob=1/9
A+B[5,13]
prob=1/9
B[6,10]
q2 = 1/3
A+B[7,13]
prob=1/9
A+B[8,14]
prob=1/9
A+B[9,15]
prob=1/9
B[8,12]
q3 = 1/3
A+B[9,15]
prob=1/9
A+B[10,16]
prob=1/9
A+B[11,17]
prob=1/9
independence
Cumulative probability
A+B under independence
1.00
0.75
0.50
 Rigorous
 Best possible
0.25
0.00
0
3
6
9
A+B
12
15
18
Case study: dike reliability
revetment
blocks
wave
sea level
D

H tan()
Z = D  ——————
cos() M s
(all variables )
Inputs
• relative density of the revetment blocks
  [1.60, 1.65]
• revetment blocks thickness
D  [0.68, 0.72] meters
• slope of the revetment
  arc tangent([0.32, 0.34]) = [ 0.309, 0.328] radians
• model parameter
M  [3.0, 5.2]
• significant wave height
H ~ Weibull(scale = [1.2,1.5] meters, shape = [10,12])
• offshore peak wave steepness
s ~ normal(mean = [0.039,0.041], stdev = [0.005,0.006])
Cumulative probability
1
1
1

0
1.5
D
1.6
1.7
1
0
0
0.6
0.7
1
0.3
M
0.4
1
2
0.04
0.06
meters
1
s
0
radians
0
0.8
meters

0
0.2
H
2
3
4
5
6
0
0.02
Discretize each p-box
Probability
1
H
0
0
1
meters
2
100  100 Cartesian product
H
Z( ,D,M, ,H,s)
Independent
s
[0, 1.02]
0.01
[0.76, 1.08]
0.01
[0.81, 1.12]
0.01
… [1.34, 1.75]
0.01
[1.36, 1.77]
0.01
[0.0235, 0.0294] [0.290,1.19] [0.242, 0.903] [0.213, 0.882]
s
0.01
0.0001
0.0001
0.0001
[0.275, 0.682]
0.0001
[0.294, 0.675]
0.0001
[0.0250, 0.0307] [0.314, 1.19] [0.268, 0.909] [0.239, 0.889]
0.01
0.0001
0.0001
0.0001
..
..
.
.
[0.0493, 0.0550] [0.536, 1.19] [0.503, 0.980] [0.483, 0.965]
0.01
0.0001
0.0001
0.0001
[0.233, 0.693]
0.0001
[0.252, 0.686]
0.0001
[0.145, 0.818]
0.0001
[0.132, 0.813]
0.0001
[0.0506, 0.0565] [0.544, 1.19] [0.511, 0.983] [0.491, 0.968]
0.01
0.0001
0.0001
0.0001
[0.158, 0.823]
0.0001
[0.145, 0.818]
0.0001
Each cell is an interval calculation
Z1,1  D 
H1 tan( )
cos() M s1
 [1.60, 1.65]  [0.68, 0.72] meters 
[0,1.02]  tan [0.309, 0.328] radians 
cos[0.309, 0.328] radians  [3.0, 5.2]  [0.0235, 0.294]
 [0.290, 1.19] meters
Note that variable  is uncertain and repeated, but it does not
lead to an inflation of the uncertainty because the function is
monotone increasing over the range of ’s uncertainty
Reliability function
Reassembled from 10,000 intervals
1
0
0
PBA says the risk Z is less than zero is less than 0.044
Monte Carlo said this risk is about 0.000014
1
Implementation in R
levels <- 100
delta.1 <- 1.60;
Lower
D.1 <- 0.68;
alpha.1 <- atan(0.32);
endpoints
M.1 <- 3.0;
alpha.1 <- atan(0.32);
R.1 <- tan(alpha.1) / cos(alpha.1);
delta.2 <- 1.65
Upper
D.2 <- 0.72
alpha.2 <- atan(0.34) endpoints
M.2 <- 5.2
alpha.2 <- atan(0.34)
R.2 <- tan(alpha.2) / cos(alpha.2)
slice <- function(p, df, a1, a2, b1, b2) { w <- do.call(df, list(p,a1, b1)); x <- do.call(df, list(p,a1, b2));
y <- do.call(df, list(p,a2, b1)); z <- do.call(df, list(p,a2, b2));
list(one=pmin(w,x,y,z)[1:levels], two=pmax(w,x,y,z)[2:(levels+1)]) }
p <- c(0.005, 1:(levels-1)/levels, 0.995)
H <- slice(p,'qweibull',10,12, 1.2,1.5);
Cartesian product
s <- slice(p,'qnorm', 0.039,0.041, 0.005,0.006)
Hs.1 <- as.vector(outer(H$one, sqrt(s$two), '/'));
Hs.2 <- as.vector(outer(H$two, sqrt(s$one), '/'))
Zb.1 <- delta.1 * D.1 - (Hs.2 * R.2) / M.1;
Zb.2 <- delta.2 * D.2 - (Hs.1 * R.1) / M.2
Zb.1 <- sort(Zb.1);
Zb.2 <- sort(Zb.2)
Zb <- c(Zb.1,Zb.2);
L <- length(Zb)/2
plot(Zb, rep(1:L/L,2), t='s', xlab='Z', ylab='CDF', ylim=c(0,1)); lines(c(Zb.2[1],Zb.2[1]),c(1,0),col='white')
Monte Carlo simulation
many <- 1000000
delta <- runif(many,1.60, 1.65)
Intervals replaced by
D <- runif(many,0.68, 0.72)
uniform distributions or,
alpha <- atan(runif(many,0.32, 0.34))
if they’re parameters, by
M <- runif(many,3.0, 5.2)
H <- rweibull(many, 11, 1.35) # parm order reversed their midpoint values
s <- rnorm(many,0.04, 0.0055)
Zp <- delta * D - (H * tan(alpha) / cos(alpha)) / (M * sqrt(s))
risk = Zp[Zp<0]
length(risk)/many
[1] 1.6e-05
Zp <- Zp[1:2000] # else plotting takes too long
plot(sort(Zp), 1:length(Zp)/many, t='s', xlab='T', ylab='Probability', ylim=c(0,1))
Selecting p-box inputs
Where do we get p-boxes?
•
•
•
•
•
Assumption
Modeling
Robust Bayesian analysis
Constraint propagation
Data with incertitude
– Measurement error
– Sampling error
– Censoring
1
1
CDF
Constraint propagation
.5
0
min
max
1
0
0
min
median
max
1
min
mean
1
0
1
max
0 min
0
min
mode
max
1
mean=mode
max
1
mean, sd
0
0
min median=mode
max
1
symmetric, mean, sd
0 min
mean, sd
max
(Maximum entropy erases uncertainty)
Data of poor repute
Data sets whose values are intervals
All measurements
are actually intervals
Generalize empirical
distributions to p-boxes
Censoring is
sometimes significant
Model intervals as
uniform distributions
“Interval uncertainty
doesn’t exist in real life”
-Tony O’Hagan
Ignore intervals or
model them as uniforms
Incertitude is common in data
• Periodic observations
When did the fish in my aquarium die during the night?
• Plus-or-minus measurement uncertainties
Coarse measurements, measurements from digital readouts
• Non-detects and data censoring
Chemical detection limits, studies prematurely terminated
• Privacy requirements
Epidemiological or medical information, census data
• Theoretical constraints
Concentrations, solubilities, probabilities, survival rates
• Bounding studies
Presumed or hypothetical limits in what-if calculations
A tale of two data sets
Skinny
Puffy
[1.00, 1.52]
[2.68, 2.98]
[7.52, 7.67]
[7.73, 8.35]
[9.44, 9.99]
[3.66, 4.58]
[3.5, 6.4]
[6.9, 8.8]
[6.1, 8.4]
[2.8, 6.7]
[3.5, 9.7]
[6.5, 9.9]
[0.15, 3.8]
[4.5, 4.9]
[7.1, 7.9]
0
2
4
6
8
10
6
8
10
x
0
2
4
x
Cumulative probability
Empirical p-boxes
1
1
Skinny
Puffy
0
0
0
2
4
6
x
8
10
0
2
4
6
x
8
10
Many possible precise data sets
Cumulative probability
1
Puffy
0
0
2
4
x
6
8
10
Statistics for data that are intervals
• Some statistics are easy to compute
• Empirical distribution, mean, median, percentiles, etc.
• Some are tricky, but easy for a computer
• Variance, upper confidence limit, correlation, etc.
• Tradeoff between more versus better data
• Review just published as a Sandia report
P-box parameter fitting
• Method of matching moments
• Regression approaches
• Maximum likelihood
Method of matching moments
• Mean, variance, etc. are now intervals
• Very easy for one-parameter distributions
• In general, moments are dependent
(The largest mean cannot be associated with the largest variance)
• Envelope distributions that extremize the moments
• Resulting p-boxes represent the assumptions about
the shape family as informed by the available data
Cumulative probability
Cumulative probability
1
1
Exponential
Skinny
0
0
10
20
30
Exponential
Puffy
40
1
0
10
20
30
40
1
Normal
Skinny
0
0
0
Normal
Puffy
x 10
20
0
0
x 10
20
Sampling uncertainty
• When you’ve measured only part of a population
• Kolmogorov-Smirnov confidence intervals
– Distribution-free
– Assumes only random sampling
• Parametric versions
– normal, lognormal, exponential, Weibull, etc.
• Bound the whole distribution
“95% of the time, the true distribution falls entirely inside the limits”
Distributional confidence limits
Cumulative probability
1.0
0.8
0.6
Kolmogorov-Smirnov
(distribution-free)
0.4
0.2
Normal distribution
0.0
300000 350000 400000 450000 500000
Volumetric heat capacity, Cp (J/m3°C)
Confidence limits on p-boxes
Cumulative probability
1
95% KS
confidence
limits
Skinny and Puffy
pooled (n = 15)
0
0
5
x
10
Uncertainties expressible with p-boxes
• Sampling uncertainty
– Distributional confidence limits
• Measurement incertitude
– Intervals
• Uncertainty about distribution shape
– Constraints (non-parametric p-boxes)
• Surrogacy uncertainty
– Modeling
Example: PCBs and duck hunters
Location: Massachusetts and Connecticut
Receptor: Adult human hunters of waterfowl
Contaminant: PCBs (polychorinated biphenyls)
Exposure route: dietary consumption of
contaminated waterfowl
Based on the assessment for non-cancer risks from PCB to adult hunters who consume
contaminated waterfowl described in Human Health Risk Assessment: GE/Housatonic River
Site: Rest of River, Volume IV, DCN: GE-031203-ABMP, April 2003, Weston Solutions (West
Chester, Pennsylvania), Avatar Environmental (Exton, Pennsylvania), and Applied
Biomathematics (Setauket, New York).
Hazard quotient
EF  IR  C  1  LOSS 
HQ 
AT  BW  RfD
{min, max, mean, std}
EF = mmms(1, 52, 5.4, 10) meals per year
// exposure frequency, censored data, n = 23
IR = mmms(1.5, 675, 188, 113) grams per meal // poultry ingestion rate from EPA’s EFH
C = [7.1, 9.73] mg per kg
// exposure point (mean) concentration
LOSS = 0
// loss due to cooking
AT = 365.25 days per year
// averaging time (not just units conversion)
BW = mixture(BWfemale, BWmale)
// Brainard and Burmaster (1992)
BWmale = lognormal(171, 30) pounds
// adult male n = 9,983
BWfemale = lognormal(145, 30) pounds
// adult female n = 10,339
RfD = 0.00002 mg per kg per day
// reference dose considered tolerable
Exceedance risk = 1 - CDF
Inputs
1
1
EF
0
0
1
IR
10 20 30 40 50 60
meals per year
0
0
1
200
400
600
grams per meal
males
females
0
0
C
BW
100
200
pounds
300
0
0
10
mg per kg
20
Results
1
Exceedance risk
mean
standard deviation
median
95th percentile
range
[3.8, 31]
[0, 186]
[0.6, 55]
[3.5 , 384]
[0.01, 1230]
0
0
500
HQ
1000
Dependence
Dependence
• Not all variables are independent
– Body size and skin surface area
– “Common-cause” variables
• Known dependencies should be modeled
• What can we do when we don’t know them?
How to do other dependencies?
• Independent
• Perfect (comonotonic)
• Opposite (countermonotonic)
Perfect dependence
A+B
A[1,3]
p1 = 1/3
A[2,4]
p2 = 1/3
A[3,5]
p3 = 1/3
B[2,8]
q1 = 1/3
A+B[3,11]
prob=1/3
A+B[4,12]
prob=0
A+B[5,13]
prob=0
B[6,10]
q2 = 1/3
A+B[7,13]
prob=0
A+B[8,14]
prob=1/3
A+B[9,15]
prob=0
B[8,12]
q3 = 1/3
A+B[9,15]
prob=0
A+B[10,16]
prob=0
A+B[11,17]
prob=1/3
perfect positive
Opposite dependence
A+B
A[1,3]
p1 = 1/3
A[2,4]
p2 = 1/3
A[3,5]
p3 = 1/3
B[2,8]
q1 = 1/3
A+B[3,11]
prob=0
A+B[4,12]
prob=0
A+B[5,13]
prob=1/3
B[6,10]
q2 = 1/3
A+B[7,13]
prob=0
A+B[8,14]
prob=1/3
A+B[9,15]
prob=0
B[8,12]
q3 = 1/3
A+B[9,15]
prob= 1/3
A+B[10,16]
prob=0
A+B[11,17]
prob=0
opposite positive
Cumulative probability
Perfect and opposite dependencies
1
0
0
3
6
9
A+B
12
15
18
Uncertainty about dependence
• Sensitivity analyses usually used
– Vary correlation coefficient between 1 and +1
• But this underestimates the true uncertainty
– Example: suppose X, Y ~ uniform(0,24) but we don’t
know the dependence between X and Y
Varying the correlation coefficient
Cumulative probability
11
X, Y ~ uniform(0,24)
00
0
10
20
X+Y
30
40
50
Counterexample
1
Cumulative probability
30
20
Y
10
0
0
0
10
X
20
30
0
10
20
30
X+Y
40
50
What about other dependencies?
•
•
•
•
•
•
•
Independent
Perfectly positive
Opposite
Positively or negatively associated
Specified correlation coefficient
Nonlinear dependence (copula)
Unknown dependence
Fréchet inequalities
They make no assumption about dependence (Fréchet 1935)
max(0, P(A)+P(B)–1)  P(A & B)  min(P(A), P(B))
max(P(A), P(B))  P(A  B)  min(1, P(A)+P(B))
Fréchet case (no assumption)
A+B
A[1,3]
p1 = 1/3
A[2,4]
p2 = 1/3
A[3,5]
p3 = 1/3
B[2,8]
q1 = 1/3
A+B[3,11]
prob=[0,1/3]
A+B[4,12]
prob=[0,1/3]
A+B[5,13]
prob=[0,1/3]
B[6,10]
q2 = 1/3
A+B[7,13]
prob=[0,1/3]
A+B[8,14]
prob=[0,1/3]
A+B[9,15]
prob=[0,1/3]
B[8,12]
q3 = 1/3
A+B[9,15]
prob=[0,1/3]
A+B[10,16]
prob=[0,1/3]
A+B[11,17]
prob=[0,1/3]
Fréchet case
Cumulative Probability
Naïve Fréchet case
1
This p-box is
not best possible
0
0
3
6
9
A+B
12
15
18
Fréchet can be improved
• Interval estimates of probabilities don’t reflect
the fact that the sum must equal one
– Resulting p-box is too fat
• Linear programming needed to get the
optimal answer using this approach
• Frank, Nelsen and Sklar gave a way to
compute the optimal answer directly
Frank, Nelsen and Sklar (1987)
If X ~ F and Y ~ G, then the distribution of X+Y is
  ,C ( F , G )( z ) 
 dC F ( x), G( y)
x y z
where C is the copula between F and G. In any case, and
irrespective of this dependence, this distribution is bounded by


max F ( x)  G( y )  1, 0, inf min F ( x)  G( y ), 1
 zsup
z x y
  x y

This formula can be generalized to work with bounds on F and G.
Cumulative Probability
Best possible bounds
1
0
0
3
6
9
A+B
12
15
18
Unknown dependence
Cumulative probability
1
Fréchet
X,Y ~ uniform(1,24)
0
0
10
20
30
X+Y
40
50
Between independence and Fréchet
• May know something about dependence that
tightens better than Fréchet
• Dependence is positive (PQD)
P(X  x, Y  y)  P(X  x) P(Y  y) for all x and y


F ( x)G( y), zinf

1  1  F ( x) 1  G ( y ) 
 zsup
 x y
  x y

• Variables are uncorrelated
Pearson correlation r is zero
• Dependence is a particular kind
Unknown but positive dependence
Cumulative probability
1
Positive
X,Y ~ uniform(1,24)
0
0
10
20
30
X+Y
40
50
Uncorrelated variables
Cumulative probability
1
Uncorrelated
X,Y ~ uniform(1,24)
0
0
10
20
30
X+Y
40
50
Varying correlation between 1 and +1
1
Cumulative probability
Pearsonnormal
X,Y ~ uniform(1,24)
0
0
10
20
30
X+Y
40
50
Can model dependence exactly too
Cumulative probability
1
Frank
(medial)
0.5
0
4
1
8
10
12
8
10
12
8
10
12
Mardia
(Kendall)
0.5
0
4
1
6
Clayton
0.5
0
6
4
6
X+Y
X ~ normal(5,1)
Y ~ uniform(2,5)
various correlations
and dependence
functions (copulas)
Example: dioxin inhalation
Location: Superfund site in California
Receptor: adults in neighboring community
Contaminant: dioxin
Exposure route: inhalation of windborne soil
Modified from Table II and IV in Copeland, T.L., A.M. Holbrow, J.M Otani, K.T. Conner and
D.J. Paustenbach 1994. Use of probabilistic methods to understand the conservativism in
California’s approach to assessing health risks posed by air contaminant. Journal of the Air and
Waste Management Association 44: 1399-1413.
Total daily intake from inhalation
R = normal(20, 2)
CGL = 2
Finh = uniform(0.46,1)
ED = exponential(11)
EF = uniform(0.58, 1)
BW = normal(64.2, 13.19)
AT = gumbel(70, 8)
//respiration rate, m3/day
//concentration at ground level, mg/m3
//fraction of particulates retained in lung, [unitless]
//exposure duration, years
//exposure frequency, fraction of a year
//receptor body weight, kg
//averaging time, years
Input distributions
1
0
1
0
1
R
10
20
30
0.4
1
EF
0.6
0
0.8
1
0
1
Finh
0.6
0.8
1
50
0
0
1
BW
30
ED
70
90
0
20
40
60
AT
50
70
90
110
Results
Exceedance risk
1
All variables mutually independent
No assumptions about dependencies
0
0
1
TDI, mg kg1 day1
2
Uncertainty about dependence
• Impossible with sensitivity analysis since it’s
an infinite-dimensional problem
• Kolmogorov-Fréchet bounding lets you be sure
• Can be a large or a small consequence
Model uncertainty
Model uncertainty
• Doubt about the structural form of the model
• Usually incertitude rather than variability
• Often the elephant in the middle of the room
Uncertainty in probabilistic analyses
•
•
•
•
•
•
Parameters
Data surrogacy
Distribution shape
Intervariable dependence
Arithmetic expression
Level of abstraction
model
uncertainty
General strategies
• Sensitivity (what-if) studies
• Probabilistic mixture
• Bayesian model averaging
• Enveloping and bounding analyses
Sensitivity (what-if) studies
• Simply re-computes the analysis with
alternative assumptions
– Intergovernmental Panel on Climate Change
• No theory required to use or understand
Example
The function f is one of two possibilities. Either
f(A,B) = fPlus(A,B) = A + B
or
f(A,B) = fTimes(A,B) = A  B
is the correct model, but the analyst does not know
which. Suppose that A ~ triangular(2.6, 0, 2.6)
and B ~ triangular(2.4, 5, 7.6).
Cumulative probability
1
0.8
0.6
0.4
Plus
Times
0.2
0
-15
-10
-5
0
X
5
10
15
Drawbacks of what-if
• Consider a long-term model of the economy
under global warming stress
3 baseline weather trends
3 emission scenarios
3 population models
3 mitigation plans
81 analyses to compute,
and to document
• Combinatorially complex as more model
components are considered
• Cumbersome to summarize results
Probabilistic mixture
• Identify all possible models
• Translate model uncertainty into choices
about distributions
• Vertically average probability densities
(or cumulative probabilities)
• Can use weights to account for credibility
Example
Cumulative probability
1
0.8
P(f+) = 2P(f)
0.6
What-if
0.4
Mixture
0.2
0
-15
-10
-5
0
X
5
10
15
Probabilistic mixture
• State of the art in probabilistic risk analysis
– Nuclear power plant risk assessments
• Need to know what all the possibilities are
• If don’t know the weights, assume equality
Drawbacks of mixture
• If you cannot enumerate the possible models,
you can’t use this approach
• Averages together incompatible theories and
yields an answer that neither theory supports
• Can underestimate tail risks
Bayesian model averaging
• Similar to the probabilistic mixture
• Updates prior probabilities to get weights
• Takes account of available data
Bayesian model averaging
• Assume one of the two models is correct
• Compute probability distribution for f(A,B)
• Read off probability density of observed data
– That’s the likelihood of the model
• Repeat above steps for each model
• Compute posterior  prior  likelihood
– This gives the Bayes’ factors
• Use the posteriors as weights for the mixture
Cumulative probability
1
Datum: f(A,B) = 7.59
0.8
0.6
Bayes
What-if
0.4
Mixture
0.2
0
-15
-10
-5
0
X
5
10
15
Bounding probabilities
• Translate model uncertainties to a choice
among distributions
• Envelope the cumulative distributions
Cumulative probability
1
0.8
0.6
Bayes
What-if
0.4
Mixture
Envelope
0.2
0
-15
-10
-5
0
X
5
10
15
Strategy for enumerable models
• What-if analysis isn’t feasible in big problems
• Probabilistic mixture is, at best, ad hoc
• For abundant data, Bayesian approach is best
• Otherwise, it’s probably just wishful thinking
• Bounding is reliable, but may be too wide
Uncertainty about distribution shape
Suppose correct model is known to be
f(A,B) = A+B, but the distributions for the
independent variables A and B are not
precisely known.
1
1
0
-10
0
-10
0
10
0
10
20
Strategy for distribution shape
• Very challenging for sensitivity analysis
since infinite-dimensional problem
• Bayesians usually fall back on a maximum
entropy approach, which erases uncertainty
rather than propagates it
• Bounding seems most reasonable, but
should reflect all available information
Uncertainty about dependence
• Neither sensitivity studies nor Monte Carlo
simulation can comprehensively assess it
• Bayesian model averaging can’t even begin
• Only bounding strategies work
What-if sensitivity analysis
• Simple theory
• Straightforward to implement
• Doesn’t confuse variability and incertitude
• Must enumerate all possible models
• Combinatorial complexity
• Hard to summarize
Probabilistic mixture
• Produces single distribution as answer
• Can account for differential credibility
•
•
•
•
Must enumerate all possible models
Confounds variability and incertitude
Averages together incompatible theories
Underestimates tail risks
Bayesian model averaging
• Produces single distribution as answer
• Can account for differential prior credibility
• Takes account of available data
•
•
•
•
•
Must enumerate all possible models
May be computationally challenging
Confounds variability and incertitude
Averages together incompatible theories
Underestimates tail risks
Bounding probability
•
•
•
•
Straightforward theoretically
Yields single mathematical object as answer
Doesn’t confuse variability and incertitude
Doesn’t underestimate tail risks
• Cannot account for differential credibility
• Cannot take account of available data
• Optimality may be computationally expensive
Strategy for model uncertianty
• Bounding seems best when data are sparse
• When the answer is clear, strong assurance
for the conclusion
• If the answer is too wide, need more
information to tighten it
– Unless it’s okay to mislead people about the
reliability of your conclusions
Updating
Updating
• Using knowledge of how variables are
related to tighten their estimates
• Removes internal inconsistency and
explicates unrecognized knowledge
• Also called constraint updating or editing
• Also called natural extension
Example
• Suppose
W = [23, 33]
H = [112, 150]
A = [2000, 3200]
• Does knowing WH=A let us to say any more?
Answer
• Yes, we can infer that
W = [23, 28.57]
H = [112, 139.13]
A = [2576, 3200]
• The formulas are just W = intersect(W, A/H), etc.
To get the largest possible W, for instance, let A be as large
as possible and H as small as possible, and solve for W =A/H.
Bayesian strategy
Prior
I (W  [23,33]) I ( H  [112 ,150 ]) I ( A  [2000 ,3200 ])
Pr(W , H , A) 


33  23
150  112
3200  2000
Likelihood
L( A  W  H | W , H , A)   ( A  W  H )
Posterior
f (W , H , A | A  W  H )   ( A  W  H )  Pr(W , H , A)
Bayes’ rule
• Concentrates mass onto the manifold of
feasible combinations of W, H, and A
• Answers have the same supports as intervals
• Computationally complex
• Needs specification of priors
• Yields distributions that are not justified
(come from the choice of priors)
• Expresses less uncertainty than is present
Updating with p-boxes
1
1
1
0
20
30
A
H
W
40
0
120
140
160
0
2000
3000
4000
Answers
1
1
1
0
20
30
intersect(W, A/H)
A
H
W
40
0
120
140
160
intersect(H, A/W)
0
2000
3000
4000
intersect(A, WH)
Calculation with p-boxes
• Agrees with interval analysis whenever
inputs are intervals
• Relaxes Bayesian strategy when precise
priors are not warranted
• Produces more reasonable answers when
priors not well known
• Much easier to compute than Bayes’ rule
Backcalculation
Backcalculation
• Needed for cleanup and remediation planning
• Untangles an equation in uncertain numbers
when we know all but one of the variables
• For instance, backcalculation finds B such
that A+B = C, from estimates for A and C
Hard with probability distributions
• Inverting the equation doesn’t work
• Available analytical algorithms are unstable
for almost all problems
• Except in a few special cases, Monte Carlo
simulation cannot compute backcalculations;
trial and error methods are required
Can’t just invert the equation
prescribed
unknown
known
Dose = Concentration × Intake
Concentration = Dose / Intake
When concentration is put back into the forward
equation, the resulting dose is wider than planned
Backcalculation with p-boxes
Suppose A + B = C, where
A = normal(5, 1)
C = {0  C, median  1.5, 90th %ile  35, max  50}
1
1
02
A
3
4
5
C
6
7
8
0
0 10 20 30 40 50 60
Getting the answer
• The backcalculation algorithm basically
reverses the forward convolution
• Not hard at all…but a little messy to show
• Any distribution
1
totally inside B is
sure to satisfy the
constraint … it’s
a “kernel”
B
0-10 0
10 20 30 40 50
Check it by plugging it back in
A + B = C*  C
1
C*
0
-10
0
10
20
C
30
40
50
60
Precise distributions don’t work
• Precise distributions can’t express the target
• A concentration distribution giving a prescribed
distribution of doses seems to say we want
some doses to be high
• Any distribution to the left would be better
• A p-box on the dose target expresses this idea
Backcalculation algebra
• Can define untanglings for all basic operations
e.g., if A  B = C, then B = exp(backcalc(ln A, ln C))
• Can chain them together for big problems
• Assuming independence widens the result
• Repeated terms need special strategies
Conclusion
• Planning cleanup requires backcalculation
• Monte Carlo methods don’t generally work
except in a trial-and-error approach
• Can express the dose target as a p-box
Sensitivity analysis
Sensitivity analysis with p-boxes
• Local sensitivity via derivatives
• Explored macroscopically over the
uncertainty in the input
• Describes the ensemble of tangent slopes to
the function over the range of uncertainty
Monotone function
range
of input
Nonlinear function
range
of input
Sensitivity analysis of p-boxes
• Quantifies the reduction in uncertainty of a
result when an input is pinched
• Pinching is hypothetically replacing it by a
less uncertain characterization
Cumulative
probability
1
0
0
Cumulative
probability
Pinching to a point value
1
X2
3
1
0
0
1
X2
3
1
0
0
Cumulative
probability
Cumulative
probability
Pinching to a (precise) distribution
1
X2
3
1
0
0
1
X2
3
Pinching to a zero-variance interval
Cumulative
probability
1
0
0
1
X 2
3
Assumes value is constant, but unknown
There’s no analog of this in Monte Carlo
Using sensitivity analyses
There is only one take-home message:
“Shortlisting” variables for treatment is bad
– Reduces dimensionality, but erases uncertainty
Differential equations
Uncertainty usually explodes
The explosion can be traced
to numerical instabilities
x
Time
Uncertainty
• Artifactual uncertainty
– Too few polynomial terms
– Numerical instability
– Can be reduced by a better analysis
• Authentic uncertainty
– Genuine unpredictability due to input uncertainty
– Cannot be reduced by a better analysis
Only by more information, data or assumptions
Uncertainty propagation
• We want the prediction to ‘break down’ if
that’s what should happen
• But we don’t want artifactual uncertainty
–
–
–
–
Numerical instabilities
Wrapping effect
Dependence problem
Repeated parameters
Problem
• Nonlinear ordinary differential equation (ODE)
dx/dt = f(x, )
with uncertain  and uncertain initial state x0
• Information about  and x0 comes as
– Interval ranges
– Probability distributions
– Probability boxes
Model
Initial states (range)
Parameters (range)
VSPODE
Mark Stadherr et al. (Notre Dame)
Taylor models & interval
Taylor series code, see also
COSY and VNODE
List of constants
plus remainder
Example ODE
dx1/dt = 1 x1(1 – x2)
dx2/dt = 2 x2(x1–1)
What are the states at t = 10?
x0 = (1.2, 1.1)T
1 [2.99, 3.01]
2 [0.99, 1.01]
VSPODE
– Constant step size h = 0.1, Order of Taylor model q = 5,
– Order of interval Taylor series k = 17, QR factorization
VSPODE tells how to compute x1
1.916037656181642  10  21 + 0.689979149231081  11  20 +
-4.690741189299572  10  22 + -2.275734193378134  11  21 +
-0.450416914564394  12  20 + -29.788252573360062  10  23 +
-35.200757076497972  11  22 + -12.401600707197074  12  21 +
-1.349694561113611  13  20 + 6.062509834147210  10  24 +
-29.503128650484253  11  23 + -25.744336555602068  12  22 +
-5.563350070358247  13  21 + -0.222000132892585  14  20 +
218.607042326120308  10  25 + 390.260443722081675  11  24 +
256.315067368131281  12  23 + 86.029720297509172  13  22 +
15.322357274648443  14  21 + 1.094676837431721  15  20 +
[ 1.1477537620811058, 1.1477539164945061 ]
where  ’s are centered forms of the parameters; 1 = 1  3, 2 = 2  1
Input p-boxes
1
2
uniform
normal
precise
Probability
min, max,
mean, var
1
0
2.99
3
3.01
0.99
1
1.01
x1
Results
x2
uniform
normal
precise
Probability
min, max,
mean, var
1
0
1.12 1.14 1.16 1.18
0.87 0.88 0.89
0.9
Still repeated uncertainties
1.916037656181642  10  21 + 0.689979149231081  11  20 +
-4.690741189299572  10  22 + -2.275734193378134  11  21 +
-0.450416914564394  12  20 + -29.788252573360062  10  23 +
-35.200757076497972  11  22 + -12.401600707197074  12  21 +
-1.349694561113611  13  20 + 6.062509834147210  10  24 +
-29.503128650484253  11  23 + -25.744336555602068  12  22 +
-5.563350070358247  13  21 + -0.222000132892585  14  20 +
218.607042326120308  10  25 + 390.260443722081675  11  24 +
256.315067368131281  12  23 + 86.029720297509172  13  22 +
15.322357274648443  14  21 + 1.094676837431721  15  20 +
[ 1.1477537620811058, 1.1477539164945061 ]
Subinterval reconstitution
• Subinterval reconstitution (SIR)
– Partition the inputs into subintervals
– Apply the function to each subinterval
– Form the union of the results
• Still rigorous, but often tighter
– The finer the partition, the tighter the union
– Many strategies for partitioning
• Apply to each cell in the Cartesian product
Discretizations
1
0
2.99
3
3.01
Contraction from SIR
1
Probability
1
0
1.12
Best possible bounds
reveal the authentic
uncertainty
1.14
x1
1.16
0
0.87
0.88
0.89
x2
0.9
Monte Carlo is more limited
• Monte Carlo cannot propagate incertitude
• Monte Carlo cannot produce validated results
– Though can be checked by repeating simulation
• Validated results from distributions can be obtained
by modeling inputs with (narrow) p-boxes and
applying probability bounds analysis
• Results converge to narrow p-boxes obtained from
infinitely many Monte Carlo replications
Conclusions
• VSPODE is useful for bounding solutions
of parametric nonlinear ODEs
• Probability bounds analysis is useful when
distributions are known imprecisely
• Together, they rigorously propagate
uncertainty through a nonlinear ODE
Intervals
Distributions
P-boxes
Initial states
Parameters
Conclusions
Rigorousness
• “Automatically verified calculations”
• The computations are guaranteed to enclose
the true results (so long as the inputs do)
• You can still be wrong, but the method
won’t be the reason if you are
How to use the results
When uncertainty makes no difference
(because results are so clear), bounding gives
confidence in the reliability of the decision
When uncertainty obscures the decision
(i) use results to identify inputs to study better, or
(ii) use other criteria within probability bounds
Can uncertainty swamp the answer?
• Sure, if uncertainty is huge
• This should happen (it’s not “unhelpful”)
• If you think the bounds are too wide, then
put in whatever information is missing
• If there isn’t any such information, do you
want to mislead your readers?
Monte Carlo is problematic
• Probability doesn’t accumulate gross
uncertainty in an intuitive way
• Precision of the answer (measured as cv)
depends strongly on the number of inputs
• The more inputs, the tighter the answer,
irrespective of the distribution shape
A few grossly uncertain inputs
Probability
Uniform
Uniform
Uniform
Uniform
A lot of grossly uncertain inputs...
Probability
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Uniform
Where does this
surety come from?
What justifies it?
Smoke and mirrors certainty
• P-boxes give a vacuous answer if all you
provide are vacuous inputs
• Conventional probability theory, at least as
it’s naively applied, seems to manufacture
certainty out of nothing
• This is why some critics say probabilistic
risk analyses are “smoke and mirrors”
Uncertain numbers
• P-boxes are very crude, but they can
express the two main forms of uncertainty
• Despite their limitations, p-boxes may be
useful for modeling uncertain numbers
• Simple arithmetic and logical expressions
are easy to compute and understand
What p-boxes can’t do
• Give best-possible bounds on non-tail risks
• Conveniently get best-possible bounds
when dependencies are subtle or
calculations are very complex
• Show what’s most likely within the box
References
•
•
•
•
•
•
•
Goodman, L. 1960. On the exact variance of products. Journal of the American
Statistical Association 55: 708-713.
Ferson, S., V. Kreinovich, L. Ginzburg, K. Sentz and D.S. Myers. 2003.
Constructing probability boxes and Dempster-Shafer structures. Sandia National
Laboratories, SAND2002-4015, Albuquerque, New Mexico.
Ferson, S., V. Kreinovich, J. Hajagos, W. Oberkampf and L. Ginzburg. 2007.
Experimental uncertainty estimation and statistics for data having interval
uncertainty. Sandia National Laboratories, SAND2007-0939, Albuquerque,
New Mexico.
Frank M.J., R.B Nelsen, B. Schweizer (1987). Best-possible bounds for the
distribution of a sum—a problem of Kolmogorov. Probability Theory and
Related Fields 74:199–211.
Ci for normals, etc.
Kolmogorov [Kolmogoroff], A. 1941. Confidence limits for an unknown
distribution function. Annals of Mathematical Statistics 12: 461–463.
Walley, P. 1991. Statistical Reasoning with Imprecise Probabilities. Chapman
and Hall, London.
Web-accessible reading
http://www.sandia.gov/epistemic/Reports/SAND2002-4015.pdf
(introduction to p-boxes and related structures)
http://www.ramas.com/depend.zip
(handling dependencies in probabilistic uncertainty modeling)
http://www.ramas.com/bayes.pdf
(introduction to Bayesian and robust Bayesian methods in risk analysis)
http://www.ramas.com/intstats.pdf
(statistics for data that may contain interval uncertainty)
http://maths.dur.ac.uk/~dma31jm/durham-intro.pdf
(Gert de Cooman’s gentle introduction to imprecise probabilities)
http://www.cs.cmu.edu/~qbayes/Tutorial/quasi-bayesian.html
(Fabio Cozman’s introduction to imprecise probabilities)
http://idsia.ch/~zaffalon/events/school2004/school.htm
(notes from a week-long summer school on imprecise probabilities)
Software
• Dan Berleant (Iowa)
– Statool (free)
• Applied Biomathematics
– PBDemo (free)
– Risk Calc (commercial)
– S3 and S4 packages for R (request beta version)
• Didier Dubois et al. (France)
Exercise
Cumulative probability
What distribution inside the p-box has the
largest mean? Which has the largest variance?
Smallest variance?
1
0
0
10
20
End
2MC simulations don’t fill p-boxes
• 2-D Monte Carlo is not comprehensive
–
–
–
–
Inadequate model of ignorance
Dependence among parameters of a distribution
Uncertainty about dependence (Fréchet)
Non-denumerable model uncertainty
• Probability bounds analysis is not optimal
– Independence between parameters of a distribution
– Ternary (and higher) Fréchet operations
Maturing methodology
•
•
•
•
•
•
•
•
•
Arithmetic
Logical computations (and, or, not)
Backcalculation, updating, deconvolution
Decision analysis
Statistics of data with interval uncertainty
Sensitivity analysis
Validation
Non-linear ordinary differential equations
Black-box strategies (Cauchy, quadratic, etc.)
Slide shows and/or papers on these topics are available on request
Moment propagation
How?
– Taylor expansion, or decomposition into binary operations
Operation Mean
Variance
X+Y
XY
XY
EX+EY
EX EY
EX(1/EY+VY/(EY)³)
VX + VY
VX(EY)² + VY (EX)² + VX VY
(VX + (EX)² VY /(EY)²)/EY²
– formulas slightly messier without independence assumption
– Chebyshev inequality gives limits on tail probabilities
Why?
– fully probabilistic with minimal data requirements
Why not?
– repeated variables can complication propagation
– may not use all of available information
Interval probability
How?
– bound estimates, a = [a1, a2], where a1  a2
– addition:
[a1, a2] + [b1, b2] = [a1+b1, a2+b2]
– subtraction:
[a1, a2] – [b1, b2] = [a1–b2, a2–b1]
– conjunctions, disjunctions, negations also easy to compute
– dependency relations among events can also be represented
Why?
– natural for scientists and easy to explain to others
– works no matter where uncertainty comes from
Why not?
– paradoxical: can’t give exact value but can give exact bounds
– ranges could grow quickly, yielding very wide results
Dempster-Shafer theory
How?
– express measurement error or censoring as intervals
– develop uncertain numbers as collections of such intervals
– do arithmetic (under independence) with a Cartesian product
Why?
– measurement error, censoring, etc. is ubiquitous
– propagates non-negligible incertitude through calculations
– simple to implement
Why not?
– doesn’t use information about inter-variable dependencies
– uncertainty only in x-values, and not in probability values
– Dempster’s rule is weird and controversial
Probability bounds analysis
How?
– specify what you are sure about
– establish bounds on probability distributions
– pick dependencies (no assumption, indep., perfect, etc.)
Why?
– account for uncertainty better than maximum entropy
– puts bounds on Monte Carlo results
– bounds get narrower with better empirical information
Why not?
– does not yield second-order probabilities
– best-possible results can sometimes be expensive to compute
Imprecise probabilities
How?
_
– avoid sure loss, P(A)  P(A)
– be coherent, P(A) + P(B)  P(A  B)
– use natural extension (mathematical programming) to find
consequences
Why?
– most expressive language for uncertainty of all kinds
– can provide expectations and conditional probabilities
– provides best possible results that do not lose information
Why not?
– requires mathematical programming
– can strain mathematical ability of the analyst
History of the speed of light
Speed of light (m/sec)
300,000,000
299,900,000
299,820,000
299,792,458
1880
299,770,000
299,700,000
1900
1920
1940
1960
1980
History of overconfidence
• About 70% should enclose true value (fewer than half do)
• Overconfidence is “almost universal in all measurements of
physical quantities” (Morgan and Henrion 1990)
• Humans (expert and otherwise) routinely grossly overconfident
90% confidence intervals typically enclose their true values only
about 30 to 50% of the time
• Schlyakhter suggested we automatically widen all bounds
Everyone makes assumptions
• But not all sets of assumptions are equal!
Point value
Interval range
Entire real line
Linear function
Monotonic function
Any function
Normal distribution
Unimodal distribution
Any distribution
Independence
Known correlation
Any dependence
• Like to discharge unwarranted assumptions
“Certainties lead to doubt; doubts lead to certainty”
Two paths
• .What assumptions are needed to get an answer?
It’s always possible to find some
• What’s the quantitative answer that doesn’t
.depend on any unjustified assumptions?
Recognizing when you’ve made an unjustified
assumption may take some discipline
Sometimes, “I don’t know” is the right answer
Example: endangered species
•
•
•
•
•
•
Northern spotted owl Strix occidentalis caurina
Olympic Peninsula, Washington State
Leslie matrix model (with composite age)
Environmental and demographic stochasticity
Density dependence (territorial, Allee effects)
Catastrophic windstorms
IUCN threat criteria
Extinct (not sighted in the wild for 50 years)
Critical (50% risk of extinction in 18 years)
Endangered (20% risk of extinction in 89 years)
Vulnerable (10% risk of extinction in 100 years)
Nonthreatened (better than any of the above)
Leslie matrix model
juveniles t + 1
subadults t + 1
adults t + 1
=
0
Sjuveniles
0
Fsubadults
0
Ssubadults
Fadults
0
Sadults
juveniles t
subadults t
adults t
What kind of information might be available about these variables?
Risk of quasi-extinction
Cumulative probability
1
0.8
0.6
critical
0.4
endangered
0.2
vulnerable
0
0
20
40
60
Time (years)
80
100
Regression of x on probability scores
10
8 Skinny
6
x
4
2
0
-1.5 -1 -0.5 0
z
0.5
1
10
8 Puffy
6
4
2
0
1.5 -1.5 -1 -0.5 0
z
0.5
1
1.5
Different kinds of interval regression could be used
Least squares (bounds given all possible points within intervals)
Interval principal components (model II regression)
“All-constraints” regression
All-constraints interval regression
10
8
6
4
2
0
- 1.5
-1
- 0.5
0
0.5
1
1.5
Dependence between p-boxes
•
o
o
o
•
•
•
•
•
•
•
•
•
Random-set independent
Epistemically independent
Strongly independent
Repetition independent
Perfectly associated
Oppositely associated
Known copula
Specified copula family and correlation
Known functional relationship
Positively quadrant dependent (PQD)
Negatively quadrant dependent (NQD)
Known or interval-bounded correlation
Fréchet case
For precise
probabilities,
these are all
the same.
These cases
yield precise
distributions
from precise
input
distributions
Dependence tree
Fréchet
r0
NQD
opposite, W
0r
r=0
independence, 
PQD
perfect, M
How else to do shape uncertainty?
• Very challenging for sensitivity analysis
since it’s an infinite-dimensional problem
• Bayesians usually fall back on a maximum
entropy approach, which erases uncertainty
rather than propagates it
• Bounding seems most reasonable, but
should reflect all available information
Moment-range propagation
Simultaneous moment propagation
• Just means and variances, and ranges
• Makes use of general formulas, and also
special formulas for named distribution shapes
• Finite ranges imply moments always exist and
often improve bounds formulas substantially
• Intersects bounds from formulas and inferred
from distribution bounds
What do bounds say about moments?
1
Exceedance risk
These bounds imply the mean can’t be any
smaller than 30 or larger than 75. Likewise,
the variance has to be within [600, 6000].
0
0
100
200
300
400
What do moments say about bounds?
1
Exceedance risk
If we know the mean is 10
and the variance is 2, these
are best possible bounds
on the chance the variable
is bigger than any value
(Chebyshev inequality).
0
-10
0
10
20
30
Range-moment propagation
Consider multiplication, for instance
L(X  Y) = min(LX LY, LX GY, GX LY, GX GY)
G(X  Y) = max(LX LY, LX GY, GX LY, GX GY)
E(X  Y) = EX EY ± (VX VY )
V(X  Y) = Goodman’s (1960) formula
where L = min, G = max, E = mean, V = variance
Repetitions are okay
Assuming independence,
because all quantities
are positive
E(X  Y) = EX EY
V(X  Y) = (EX)2VY + (EY)2VX + VX VY
Probability (x < X)
Range and moments together
1
0
LX
EX
VX
GX
{min = 0, max = 100, mean = 50, stdev = s}
s = 1
s = 5
s = 10
s = 15
s = 20
s = 25
s = 30
s = 35
s = 40
s = 45
s = 49
s = 50
{min = 0, max = 100, mean = 10, stdev = s}
s = 1
s = 2
s = 3
s = 4
s = 5
s = 6
s = 8
s = 10
s = 15
s = 20
s = 25
s = 29
Output
Inputs
Boundaries
interval
statistics
moment-range
constraints
Moments
Williamson-Frank-Nelsen-Sklar
Mathematical
operation
Rowe, inter alia
Boundaries
interval
statistics
moment-range
constraints
Moments
This auxiliary effort often substantially
improves (tightens) the output p-boxes
Travel time (Lobascio)

n  BD  foc  Koc L
T
K i
Parameter
L source-receptor distance
i
hydraulic gradient
K hydraulic conductivity
n
effective soil porosity
BD soil bulk density
foc fraction organic carbon
Koc organic partition coefficient
Units
m
m/m
m/yr
kg/m3
m3/kg
Min
80
0.0003
300
0.2
1500
0.0001
5
Max
120
0.0008
3000
0.35
1750
0.005
20
Mean
100
0.00055
1000
0.25
1650
0.00255
10
Stdv
11.55
0.0001443
750
0.05
100
0.001415
3
Shape
uniform
uniform
lognorm
lognorm
lognorm
uniform
normal
Inputs as mmms p-boxes
1
1
L
0
70
1
1
i
90 110 130
m
K
0
0.0003 0.0006 0.0009
1
0
1
1600
kg m-3
2000
m yr-1
4000
foc
1800
0
0
0.2
0.3
1
BD
0
1400
0
n
0
0.002 0.004
Koc
0
0
10 20
m3 kg-1
30
0.4
Output p-box
Cumulative probability
1
0
0
50000
Travel time (years)
100000
Detail of left tail
Cumulative probability
1
0
0
100
200
300
Travel time (years)
400
500
Is independence reasonable?
•
•
•
•
Soil porosity and soil bulk density
Hydraulic conductivity and soil porosity
Hydraulic gradient and hydraulic conductivity
Organic carbon and partition coefficient
Remember: independence is a much stronger
assumption than uncorrelatedness
Assumptions no longer needed
• A decade ago, you had to assume all variables
were mutually independent
• Software tools now allow us to relax any pesky
independence assumption
• No longer necessary to make independence
assumptions for mathematical convenience
• But do the assumptions make any difference?
Without dependence assumptions
Cumulative probability
1
0
0
50000
Travel time (years)
100000
Left tails
Cumulative probability
1
0
0
100
200
300
Travel time (years)
400
500
Dependence bounds
• Guaranteed to enclose results no matter
what correlation or dependence there may
be between the variables
• Best possible (couldn’t be any tighter
without saying more about the dependence)
• Can be combined with independence
assumptions between other variables
Conclusions
• The model is a cartoon, but it illustrates the
use of methods to relax independence and
precise distribution assumptions
• Relaxing these assumptions can have a big
impact on quantitative conclusions from an
assessment
Take-home message
• Whatever assumption about dependencies
and the shape of distributions is between
you and your spreadsheet
• There are methods now available that don’t
force you to make assumptions you’re not
comfortable with