4.2.1 Interpolation, extrapolation and prediction variance
Download
Report
Transcript 4.2.1 Interpolation, extrapolation and prediction variance
Prediction variance in Linear
Regression
• Assumptions on noise in linear regression allow
us to estimate the prediction variance due to the
noise at any point.
• Prediction variance is usually large when you are
far from a data point.
• We distinguish between interpolation, when we
are in the convex hull of the data points, and
extrapolation where we are outside.
• Extrapolation is associated with larger errors, and
in high dimensions it usually cannot be avoided.
Linear Regression
• Surrogate is linear
n
combination of 𝑛𝑏
yˆ bii (x)
given shape functions
i 1
• For linear
1 1 2 x
approximation
• Difference (error)
e y b (x ) e y Xb
between 𝑛𝑦 data and
surrogate
T
T
e
e
(
y
X
b
)
(y Xb)
• Minimize square error
• Differentiate to obtain X T Xb X T y
b
nb
j
j
i 1
i i
j
Model based error for linear
regression
• The common assumptions for linear regression
– The true function is described by the functional form
of the surrogate.
– The data is contaminated with normally distributed
error with the same standard deviation at every point.
– The errors at different points are not correlated.
• Under these assumptions, the noise standard
deviation (called standard error) is estimated as
eT e
ˆ
n y nb
2
• 𝜎 is used as estimate of the prediction error.
Prediction variance
n
yˆ bii (x),
• Linear regression model
i 1
• Define x
( m)
i
i (x), then
yˆ x
( m )T
b,
• With some algebra
Var[ yˆ (x)] x
( m )T
• Standard error
b x
(m)
( m )T
(m)
X
X
x
,
( m )T
X X
x
2
sy ˆ x
T
T
1
1
x( m)
Interpolation, extrapolation and regression
• Interpolation is often contrasted to regression or
least-squares fit
• As important is the contrast between interpolation
and extrapolation
• Extrapolation occurs when we are outside the convex
hull of the data points
n 1
x i xi ,
i 1
n 1
i 1
i
1,
i 0,
• For high dimensional spaces we must have
extrapolation!
2D example of convex hull
• By generating 20 points at
random in the unit square
we end up with
substantial region near
the origin where we will
need to use extrapolation
• Using the data in the
notes, give a couple of
alternative sets of alphas
Approximately for the point
(0.4,0.4)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Example of prediction variance
• For a linear polynomial RS y=b1+b2x1+b3x2 find the prediction
variance in the region
1 x1 1, 1 x2 1
1
0.8
• (a) For data at three vertices (omitting (1,1))
0.6
0.4
0.2
0
-0.2
-0.4
x 1, 1, x 1,1, x 1, 1
T
1
x
(m)
T
2
T
3
1
2
T 1
x1 X X 0.25 1
x
1
2
s y ˆ x
( m )T
1 1
2 1 ,
1 2
-0.6
-0.8
-1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1 1 1
3 1 -1
X 1 1 1 , X T X 1 3 -1
1 1 1
1 -1 3
(m)
2
2
ˆ
X
X
x
0.5
1
x
x
x
x
1 2 1 2 x1x2
T
1
Interpolation vs. Extrapolation
• At origin s
y
2.
ˆ
At 3 vertices
sy ˆ
. At (1,1)
s y 3ˆ
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-1
sy ˆ x
-0.8
( m)T
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
( m)
2
2
ˆ
X
X
x
0.5
1
x
x
x
x
1 2 1 2 x1x2
T
1
Standard error contours
• Minimum error
sy
obtained by setting to
zero derivative of
prediction variance
with respect to
𝑥1 and 𝑥2 .
• What is special about
this point
• Contours of prediction
variance provide more
detail.
1
1
ˆ
3
1
1.4
1
3
1.
6
1.2
1
0.8
0.6
x1 x2
at
0.8
0.4
1
0.8
1.
4
1.
2
0.2
1
-0.2
0.6
8
0.
0.
6
0
-0.4
0.6
-0.6
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
1
-1
-1
0.8
0.8
-0.8
0.8
1
Data at four vertices
1
1
X
1
1
• Now
x1T 1, 1, x2T 1,1, x3T 1, 1, x4T 1,1
• And
x
( m )T
• Error at vertices
X X
T
sy
• At the origin minimum is
1
1 1
1 0 0
1 1
, X T X 4 0 1 0
1 1
0 0 1
1 1
x ( m) 0.25(1 x12 x22 ),
3
ˆ
2
1
s y ˆ
2
• How can we reduce error without adding points?
Graphical Comparison of Standard
Errors
Three points
1.
4
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.
8
1
-0.8
-0.8
0.8
0.8
0.55
7
0.
-0.6
0.6
0.8
1
-1
-1
0.7
0.
6
-0.4
-0.6
0.6
0.55
1
-0.2
-0.4
-1
-1
0.55
8
0.
-0.2
-0.8
0.55
0
0
0.6
0.6
0.65
0.2
0.2
0.65
0.6
0.4
0.65
1.
2
0.
8
0.7
5
0.7
1
0.7
0.65
7
0.
0.6
0.8
5
0.7
0.8
0.8
0.4
0.7
0.8
1.
6
1.2
1
0.8
0.6
1
1.4
0.6
1
0.7
1
Four points
0.
65
0.6
0.7
5
-0.8
0.6
65
0.
0.65
-0.6
-0.4
-0.2
5
0.7
0.7
0.7
0
0.2
0.4
0.6
0.8
0.8
1
Homework
• Redo the four point example, when the
data points are not at the corners but
inside the domain, at +-0.8. What does the
difference in the results tells you?
• For a grid of 3x3 data points, compare the
standard errors for a linear and quadratic
fits.