Transcript Slide 1

Local surrogates
• To model a complex wavy function we need a lot of data.
• Modeling a wavy function with high order polynomials is
inherently ill-conditioned.
• With a lot of data we normally predict function values using
only nearby values. We may fit several local surrogates as in
figure.
• For example, if you have the
price of gasoline every first of
the month from 2000 through
2009, how many values would
you use to estimate the price
on June 15, 2007?
Popular local surrogates
• Moving least squares: Weighting more heavily
points near the prediction location.
• Radial basis neural network: Regression with
local functions that decay away from data
points.
• Kriging: Radial basis functions, but fitting
philosophy not based on error at data points
but on correlation between function values at
near and far points.
Review of Linear Regression
• Surrogate is linear
n
combination of 𝑛𝑏 given
yˆ   bii (x)
shape functions
i 1
• For linear
1  1  2  x
approximation
• Difference (residual)
r  y   b  (x ) r  y  Xb
between 𝑛𝑦 data and
surrogate
T
T
r
r

(
y

X
b
)
(y  Xb)
• Minimize square
residual
• Differentiate to obtain
X T Xb  X T y
b
nb
j
j
i 1
i i
j
Moving least squares
• Instead of fitting surrogate ahead of time, we fit it at
the time of prediction.
• We assign weight to each data point based on its
distance from the prediction point
• Popular weight is 𝑤 =
2
−
𝑑
𝜆
𝑒
=
2
−𝜃𝑑
𝑒
;𝜃
= 1 𝜆2
  100
 1
Weighted least squares
• Weighted least squares was developed to allow us to
assign weights to data based on confidence or relevance.
• Here we use it for moving least squares, but we have also
a lecture on using it to identify outliers.
• Error measure
2
𝑒𝑤𝑟𝑚𝑠
=
1
𝑛𝑦
1 𝑇
= 𝐞 𝑊𝐞
𝑛𝑦
𝑋 𝑇 𝑊𝑋𝐛 = 𝑋 𝑇 𝑊𝑦
𝑤𝑖 𝑒𝑖2
• Surrogate coefficients found from
• Coefficients need to be recalculated at every prediction
point, which can be expensive if we have many, such as in
Monte Carlo simulation.
Six-hump camelback function
• Definition:
F(x1 ,x 2 ) = (4 - 2.1x12 + x14 / 3)x12 +x1x 2 +(- 4 + 4x 22 ) x 22
2  x1  2
1  x2  1
• Function fit with moving least squares using quadratic
polynomials.
• Fitting in normalized domain with
each variable in [0,1].
• Each quadratic piece should use
data from about 1/3rd of range in
each variable.
• 𝜆=0.1, corresponding to 𝜃=100,
would do that.
• Will need about 10 points in that
range to fit a quadratic.
Effect of number of points and decay rate
.
Radial basis neural networks
nb
N RBF
yˆ ( x ) 
1
 w a (x)   bii (x)
i
i 1
i
i 1
0.5
Input
a  radbas  x i  x / b  ; radbas  n   e
• Neurons (Radial basis functions) at
some of data points used to
approximate response. Other points
used to estimate error
• User-defined constants:
– Spread constant: radius of influence
𝜆 =b
– Error goal: minimum sum of square of
errors
n
2
-0.833
b
0.833
0
Radial basis function
W1 a1
W2
x
Input
W3
a2
ŷ(x)
a3
Output
Radial basis functions
In regression notation
− ‖𝐱−𝐱𝑖 ‖ 𝜆 2
𝑒
• Radial basis functions 𝜉𝑖 (𝐱) =
• The basis functions are defined at a subset of data
points.
• For a given spread constant (b or 𝜆 )
– Can perform the fit and calculate all error measures.
– Can select a subset of functions (neurons) to maximize
predictive accuracy (akin to stepwise regression).
• Because of similarity to nonlinear neural networks,
similarity to polynomial response surfaces is obscured.
• Number of basis functions (neurons) is selected to
achieve a desired value of rms error.
Example
• Evaluate the function y=x+0.5sin(5x) at 21 points in
the interval [1,9], fit an RBF to it and compare the
surrogate to the function over the interval[0,10].
10
• Fit using default options in
Matlab, achieves zero rms
error by using all data
points as basis functions
(neurons)
• Very good interpolation, but
even mild extrapolation is
horrible.
8
6
4
2
0
-2
0
1
2
3
4
5
6
7
8
9
10
Accept 0.1 mean squared error
• net=newrb(x,y,0.1,1,20,1); spread set to 1,
( 11 neurons were used).
• With about half of the data
points used as basis functions,
the fit is more like polynomial
regression.
• Interpolation is not as good,
but the trend is captured, so
that extrapolation is not as
disastrous.
• Obviously, if we just wanted to
capture the trend, we would
have been better with a
polynomial.
16
14
12
10
8
6
4
2
0
-2
0
1
2
3
4
5
6
7
8
9
10
Too narrow a spread
• net=newrb(x,y,0.1,0.2,20,1); ( 17 neurons
used)
• With a spread of 0.2 and the
points being 0.4 apart (21
points in [1,9]), the shape
functions decay to less than
0.02 at the nearest point.
• This means that each data point
if fitted individually, so that we
get spikes at data points.
• A rule of thumb is that the
spread should not be smaller
than the distance to the nearest
point.
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Problems
1. Fit the example with weighted least squares.
You can use Matlab’s lscov to perform the fit.
Compare the fit to the one obtained with the
neural network fit.
2. Repeat the example with 41 points,
experimenting with the parameters of
newrb. How much of what you see did you
expect?