Lecture 3 Review of Linear Algebra Simple least-squares

Download Report

Transcript Lecture 3 Review of Linear Algebra Simple least-squares

Lecture 3
Review of Linear Algebra
Simple least-squares
9 things you need to remember
from Linear Algebra
Number 1
rule for vector and matrix multiplication
u = Mv
ui = Sk=1N Mik vk
P = QR
Pij = Sk=1N Qik Rkj
Name of index in sum irrelevant.
You can call it anything (as long as
you’re consistent)
Sum over
nearest neighbor
indices
Number 2
transpostion
rows become columns and columns become rows
(AT)ij = Aji
and rule for transposition of products
(AB)T = BT AT
Note reversal
of order
Number 3
rule for dot product
ab = aT b = Si=1N ai bi
note
aa is sum of squared elements of a
“the length of a”
Number 4
the inverse of a matrix
A-1 A = I
A A-1 = I
I is the identity
matrix
1 0 0
0 1 0
0 0 1
(exists only when A is square)
Number 5
solving y=Mx using the inverse
x = M-1y
Number 6
multiplication by identity matrix
M = IM = MI
Just a name …
in component notation Iij = dij
Sk=1N dik Mkj = Mij
Sk=1N dik Mkj = Mij
Cross out sum
Cross out dik
And change k
to i in rest of
equation
Number 7
inverse of a 22 matrix
A=
a
b
c
d
A-1 =
1
d
-b
ad-bc
-c
a
Number 8
inverse of a diagonal matrix
a 0 0…0
0
b
0
…
0
A=
0 0 c…0
1/a 0 0 … 0
A-1 =
0 1/b 0 … 0
0 0 1/c … 0
...
...
0 0 0 …z
0 0 0 …1/z
Number 9
rule for taking a derivative
use component-notation
treat every element as a independent variable
remember that since elements are independent
dxi / dxj = dij = identity matrix
How does yi vary as we change xj?
(That’s the meaning of the derivative dyi/dxj)
first write i-th component of y, yi = Sk=1N Aik xk
(d/dxj) yi = (d/dxj) Sk=1N Aik xk
We’re using I and j, so use a different
letter, say k, in the summation!
Example: Suppose y = Ax
= Sk=1N Aik dxk/dxj = Sk=1N Aik dkj = Aij
So the derivative dyi/dxj is just Aij. This is analogous to
the case for scalars, where the derivative dy/dx of the
scalar expression y=ax is just dy/dx=a.
best fitting line
the combination of
pre
pre
a and b
that have the smallest
sum-of-squared-errors
find it by exhaustive search
‘grid search’
Fitting line to noisy data
yobs = a + bx
Observations:
the vector,
yobs
Guess values for a, b
ypre = aguess + bguessx
Prediction error =
observed minus
predicted
e = yobs - ypre
aguess=2.0
bguess=2.4
Total error: sum of
squared
predictions errors
E = Σ ei2
= eT e
Systematically examine combinations of (a, b) on a 101101 grid
Minimum
total
error E is
here
apre
Note E is
not zero
Error Surface
bpre
Note Emin is
not zero
Here are bestfitting a, b
best-fitting line
Error Surface
Note some range of values where the error is about the same as
the minimun value, Emin
Emin is here
Error pretty
close to Emin
everywhere
in here
Error Surface
All a’s in
this range
and b’s in
this range
have pretty
much the
same error
moral
the shape of the error surface
controls the accuracy by which
(a,b) can be estimated
What controls the shape of the
error surface?
Let’s examine effect of
increasing the error in the data
The minimum error increases, but the shame of the error
surface is pretty much the same
Error in data = 0.5
Emin = 0.20
Error in data = 5.0
Emin = 23.5
What controls the shape of the
error surface?
Let’s examine effect of shifting
the x-position of the data
Big change by simply
shifting x-values of the
data
0
5
10
Region of low error is
now tilted
High b low a has low
error
Low b high a has low
error
But (high b, high a)
and (low a, low b) have
high error
Meaning of tilted region of low
error
error in (apre, bpre) are
correlated
Uncorrelated estimates of intercept and slope
Best fit
intercept
erroneous
intercept
When the data straddle the origin, if you tweak the
intercept up, you can’t compensate by changing the
slope
Negatively correlation of intercept and slope
erroneous
intercept
Best fit intercept
When the data are all to the right of the origin, if you
tweak the intercept up, you must lower the slope to
compensate
Positive correlation of intercept and slope
erroneous
intercept
Best fit intercept
Best fit intercept
When the data are all to the right of the origin, if you
tweak the intercept up, you must raise the slope to
compensate
small
data near origin
possibly good control on intercept
but lousy control on slope
big
-5
0
5
big
data far from origin
lousy control on intercept
but possibly good control on slope
0
small
50
100
Set up for standard Least Squares
y i = a + b xi
y1
y2 =
…
yN
d
1 x1
1 x2
… …
1 xN
= G
a
b
m
Standard Least-squares Solution
mest = [GTG]-1 GT d
Derivation: use fact that minimum is at dE/dmi = 0
E = Sk ek ek = Sk (dk- SpGkpmp) (dk- SqGkqmq) =
Sk dkdk - 2 Sk dk SpGkpmp + SkSpGkpmpSqGkqmq
dE/dmi = 0 - 2 Sk dk SpGkp(dmp/dmi) +
SkSpGkp(dmp/dmi)SqGkqmq + SkSpGkpmpSqGkq(dmq/dmi)
= -2 Sk dk SpGkpdpi +
SkSpGkpdpiSqGkqmq + SkSpGkpmpSqGkqdqi
= -2 Sk dk Gki + SkGkiSqGkqmq + SkSpGkpmpGki
= -2 Sk Gki dk + 2 Sq [SkGkiGkq]mq = 0
or 2GTd + 2[GTG]m = 0 or m=[GTG]-1GTdy
Why least-squares?
Why not least-absolute length?
Or something else?
Least-Squares
a=0.94 b = 2.02
Least Absolute Value
a=1.00 b=2.02