Kriging - University of Florida

Download Report

Transcript Kriging - University of Florida

Kriging
1
Cost of surrogates
• In linear regression, the process of fitting involves
solving a set of linear equations once.
• With radial basis neural networks we have to optimize
the selection of neurons, which will again entail
multiple solutions of the linear system. We may find
the best spread by minimizing cross-validation errors.
• Kriging, our next surrogate is even more expensive,
we have a spread constant in every direction and we
have to perform optimization to calculate the best set
of constants (hyperparameters).
– With many hundreds of data points this can become
significant computational burden.
Introduction to Kriging
• Method invented in the 1950s by South African
geologist Danie G. Krige (1919-2013) for predicting
distribution of minerals.
– Formalized by French engineer, Georges Matheron in
1960.
– Statisticians refer to a more general Gaussian Process
regression.
• Became very popular for fitting surrogates to
expensive computer simulations in the 21st century.
• It is one of the best surrogates available.
• It probably became popular late mostly because of
the high computer cost of fitting it to data.
Kriging philosophy
• We assume that the data is sampled from an unknown
function that obeys simple correlation rules.
• The value of the function at a point is correlated to the
values at neighboring points based on their separation in
different directions.
• The correlation is strong to nearby points and weak with
far away points, but strength does not change based on
location.
• Normally Kriging is used with the assumption that there
is no noise so that it interpolates exactly the function
values.
Reminder: Covariance and Correlation
• Covariance of two random variables X and Y
cov( X , Y )  E [( X   X )(Y   Y )]  E [ X Y ]   X  Y
• The covariance of a random variable with itself is
the square of the standard deviation.
• Covariance matrix for a vector contains the
 V ar ( X )
covariances of the components.

 cov( X , Y )
• Correlation
cor ( X , Y ) 
cov( X , Y )
 X Y
cov( X , Y ) 

V ar (Y ) 
 1  cor ( X , Y )  1
V ar ( X )  [ ( X )]
2
• The correlation matrix has 1 on the diagonal.
Correlation between function
values at nearby points for sine(x)
•
Generate 10 random numbers, translate them by a bit (0.1), and by more
(1.0)
x=10*rand(1,10)
8.147
9.058
1.267
9.134
6.324
0.975
2.785
5.469
9.575
9.649
xnear=x+0.1; xfar=x+1;
• Calculate the sine function at the three sets.
ynear=sin(xnear)
0.9237
0.2637
0.9799
0.1899
0.1399
0.8798
0.2538 -0.6551 -0.2477 -0.3185
0.9551
0.2869
0.0404
0.8279
0.3491 -0.7273 -0.1497 -0.2222
0.7654 -0.6511
0.8626
0.9193 -0.5999
y=sin(x)
0.9573
0.3587
yfar=sin(xfar)
0.2740 -0.5917
0.1846 -0.9129 -0.9405
•
Compare corelations: r=corrcoef(y,ynear) 0.9894; rfar=corrcoef(y,yfar)
0.4229
•
Decay to about 0.4 over one sixth of the wavelength.
– Wavelength on sine function is 2𝜋~6
Gaussian correlation function
• Correlation between point x and point s
C  Z ( x ), Z ( s ), θ  
Nv

exp    i ( x i  s i ) 2 
i 1
• We would like the correlation to decay to about
0.4 at one sixth of the wavelength 𝑙𝑖 .
• Approximately 𝜃𝑖 𝑙𝑖 6 2 = 1 or 𝜃𝑖 = 36 𝑙𝑖2
• For the function 𝑦 = sin(𝑥1 ∗ sin(5𝑥2 we would
like to estimate 𝜃1 ≈ 1, 𝜃2 ≈ 25
Universal Kriging
Nv
yˆ ( x ) 

i 1
i
i
(x)  Z (x)
• Trend function is most often a low
order polynomial
• We will cover Ordinary Kriging,
where linear trend is just a constant y
to be estimated by data.
• There is also Simple Kriging , where
constant is assumed to be known
(often as 0).
• Assumption: Systematic
departures Z(x) are correlated.
• Kriging prediction comes with a
normal distribution of the uncertainty
in the prediction.
Trend model
Systematic departure
Sampling
data points
Systematic
Departure
Linear Trend
Model
Kriging
x
Notation
• The function values are given at 𝑛𝑦 points 𝐱 𝑖 , 𝑖 = 1, . . . , 𝑛𝑦 ,
with the point 𝐱
of dimensions).
𝑖
having components
𝑖
𝑥𝑘
, k = 1, … , n (number
• The function value at the ith point is 𝑦𝑖 =y(𝐱 𝑖 ), and the vector of
𝑛𝑦 function values is denoted y.
• Given decay rates 𝜃𝑘 , we form the covariance matrix of the data
 n
(i)
( j) 2 
2
C ov ( y i , y j )   exp     k ( x k  x k )    R ij
 k 1

2
i , j  1, ..., n y
• The correlation matrix R above is formed from the covariance
matrix, assuming a constant standard deviation 𝜎, which
measures the uncertainty in function values.
• For dense data, 𝜎 will be small, for sparse data 𝜎 will be large.
• How do you decide whether the data is sparse or dense?
Prediction and shape functions
• Ordinary Kriging prediction formula
T
1
T
yˆ ( x )  ˆ  r R y  ˆ  b r
 n
(i )
2 
ri  exp     k ( x k  x k ) 
 k 1

• The equation is linear in r, which means that the
exponentials 𝑟𝑖 may be viewed as basis functions.
• The equation is linear in the data y, in common with
linear regression, but b is not calculated by minimizing
RMS.
• Note that far away from data 𝑦(𝐱 ~𝜇 (not good for
substantial extrapolation).
Fitting the data
• Fitting refers to finding the parameters 𝜃𝑘 .
• We fit by maximizing the likelihood that the data comes from a
Gaussian process defined by 𝜃𝑘 .
ℒ 𝜃 𝑌, 𝑋 = 𝑃(𝑌|𝑋, 𝜃
• Once they are found, the estimate of the mean and standard
deviation is obtained as
T
1
T
1
ˆ
y

1

R


 y  1 ˆ 
1 R y
2
ˆ 
T
1
1 R 1
ˆ 
n
• Maximum likelihood is a tough optimization problem.
• Some kriging codes minimize the cross validation error.
• Stationary covariance assumption: smoothness of response
is quite uniform in all regions of input space.
Prediction variance
T
1

(1  1 R r ) 
T
1
V  yˆ ( x )    1  r R r 

T
1
1
R
r


2
Square root of variance is
called standard error.
The uncertainty at any x is
normally distributed.
𝑦 𝑥 represents the mean of
Kriging prediction.
Kriging fitting issues
• The maximum likelihood or cross-validation optimization
problem solved to obtain the kriging fit is often illconditioned leading to poor fit, or poor estimate of the
prediction variance.
• Poor estimate of the prediction variance can be checked
by comparing it to the cross validation error.
• Poor fits are often characterized by the kriging surrogate
having large curvature near data points (see example on
next slide).
• It is recommended to visualize by plotting the kriging fit
and its standard error.
Example of poor fits.
True function
𝑦 = 𝑥 2 + 5𝑥 − 10
True function
Data points
100
80
y
60
40
20
0
-20
-8
-6
-4
-2
0
x
Kriging Model
Trend model
: 𝜉(𝑥 = 0
Covariance function : c𝑜𝑣 𝑥𝑖 , 𝑥𝑗 =
exp(−𝜃(𝑥𝑖 − 𝑥𝑗 2
2
4
6
8
𝜃 selection
𝑃𝑜𝑜𝑟 𝑓𝑖𝑡 𝑎𝑛𝑑
𝑝𝑜𝑜𝑟 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟
Good fit and
poor standard error
100
80
True function
Test points
Krigin interpolation
2SE bounds
100
True function
Test points
Kriging interpolation
2SE bounds
80
60
y
y
60
40
40
20
20
0
0
-20
-20
-8
-6
-4
-2
0
x
2
4
6
8
-8
-6
-4
-2
0
x
SE: standard error
2
4
6
8
Kriging with nuggets
20
True function value y(x)
Surrogate Prediction
Data
15
10
y
• Nuggets – refers to the
inclusion of noise at data
points.
• The more general
Gaussian Process
surrogates or Kriging with
nuggets can handle data
with noise (e.g.
experimental results with
noise).
5
0
-5
-10
0
0.2
0.4
0.6
x
0.8
1
Introduction to optimization with
surrogates
• Based on cycles. Each consists of sampling design
points by simulations, fitting surrogates to simulations
and then optimizing an objective.
• Construct surrogate, optimize original objective, refine
region and surrogate.
– Typically small number of cycles with large number of
simulations in each cycle.
• Adaptive sampling
– Construct surrogate, add points by taking into account not
only surrogate prediction but also uncertainty in prediction.
– Most popular, Jones’s EGO (Efficient Global Optimization).
– Easiest with one added sample at a time.
Problems Kriging
• Fit the quadratic function of Slide 13 with
kriging using different options, like different
covariance (different 𝜃s) and trend
functions and compare the accuracy of the
fit.
• For this problem compare the standard
error with the actual error.