Simple View on Simple Interval Calculation (SIC)

Download Report

Transcript Simple View on Simple Interval Calculation (SIC)

Simple View on Simple Interval
Calculation (SIC)
Alexey Pomerantsev,
Oxana Rodionova
Institute of Chemical Physics, Moscow
and Kurt Varmuza
Vienna Technical University
© Kurt Varmuza
10.02.05
WSC-4
1
CAC, Lisbon, September 2004
10.02.05
WSC-4
2
Leisured Agenda
1. Why errors are limited?
2. Simple calculations, indeed!
Univariate case
3. Complicated SIC.
Bivariate case
4. Conclusions
10.02.05
WSC-4
3
Part I. Why errors are limited?
10.02.05
WSC-4
4
Water in wheat. NIR spectra by Lumex Co
2
1
0
-1
-2
9058.
10.02.05
9290.
9521.
9753.
9984.
WSC-4
10216
10447
10679
5
Histogram for Y (water contents)
40
141 samples
30
20
10
0
8
10.02.05
9
10
11
WSC-4
12
13
14
6
Normal Probability Plot for Y
99.65
89.01
78.37
67.73
57.09
42.91
32.27
38%
21.63
10.99
21%
3%
0.35
8
10.02.05
9
10
11
WSC-4
12
13
14
7
PLS Regression. Whole data set
10.02.05
WSC-4
8
PLS Regression. Marked “outliers”
10.02.05
WSC-4
9
PLS Regression. Revised data set
10.02.05
WSC-4
10
Histogram for Y. Revised data set
40
124 samples
30
20
10
0
8
10.02.05
10
12
WSC-4
14
11
Normal Probability Plot. Revised data set
99.60
89.92
80.24
70.56
60.89
39.11
29.44
96%
19.76
10.08
81%
31%
0.40
9
10.02.05
10
11
WSC-4
12
13
14
12
Histogram for Y. Revised data set
40
30
20
10
0
10
m-3s
10.02.05
12
m-2s
m-s
m
WSC-4
14
m+s
m+2s
m+3s
13
Error Distribution
Normal distribution
-b
 s  s  s  s  s
Truncated normal distribution 3.5s
+b
s
s
s
s
s
-b
s
+b
 s  s  s  s  s
s
s
s
s
s
s
Both distributions
-b
+b
 s  s  s  s  s
10.02.05
s
WSC-4
s
s
s
s
s
14
Main SIC postulate
All errors are limited!
There exists Maximum Error Deviation,
b, such that for any error e
Prob{| e | > b }= 0
E r r o r d is tr ib u tio n
b
10.02.05
b
WSC-4
15
Part 2. Simple calculations
10.02.05
WSC-4
16
Case study. Simple Univariate Model
Data
7
T2
Test
y
C1
1.0
1.28
C2
2.0
1.68
C3
4.0
4.25
C4
5.0
5.32
T1
3.0
3.35
T2
4.5
6.19
T3
5.5
5.40
6
C4
5
R esponse , y
Training
x
T3
4
T1
C3
3
2
C1
C2
1
Error distribution
V ariable, x
0
0
1
2
3
4
5
6
Model
b
10.02.05
y=ax+e
b
WSC-4
17
OLS calibration
OLS Calibration is minimizing the Sum of Least Squares
7
T2
6
C4
R esponse , y
5
S um of S quares
T3
4
T1
C3
3
2
C1
C2
1
V ariable, x
a
0
0 .5
10.02.05
0
0.7
0 .7 0.8 0 .9 1.044
1 .1 1.2 1 .3 1.4 1 .5
WSC-4
1
2
3
4
5
6
18
Uncertainties in OLS
7
T2
6
C4
R esponse , y
5
t3(P) is quantile of Student's
t-distribution for probability
P with 3 degrees of freedom
T3
4
T1
C3
3
2
C1
C2
1
V ariable, x
0
0
10.02.05
WSC-4
1
2
3
4
5
6
19
SIC calibration
|e|<b
7
Maximum Error Deviation
is known:
6
b = 0.7 (=2.5s)
C4
2b
R esponse , y
5
C3
2b
4
3
2
C 2 2b
2b
C1
1
V ariable, x
0
0
10.02.05
WSC-4
1
2
3
4
5
6
20
SIC calibration
7
6
C4
R esponse , y
5
C3
4
3
2
C2
Training
10.02.05
amin amax
x
y
C1
1.0
1.28
0.58
1.98
C2
2.0
1.68
0.49
1.19
C1
1
V ariable, x
0
0
C3
4.0
4.25
0.89
1.24
C4
5.0
5.32
0.92
1.20
WSC-4
1
2
3
4
5
6
21
Region of Possible Values
amin amax
Training
x
y
C1
1.0
1.28
0.58
1.98
C2
2.0
1.68
0.49
1.19
C3
4.0
4.25
0.89
1.24
C4
5.0
5.32
0.92
1.20
C4
C3
C2
C1
RPV
a
10.02.05
m in = 0 .9 2
a
aa
m ax = 1 .1 9
WSC-4
22
SIC prediction
7
T2
6
T3
C4
R esponse , y
5
Test
x
y
v-
v+
T1
3.0
3.35
2.77
3.57
T2
4.5
6.19
4.16
5.36
T3
5.5
5.40
5.08
6.55
C3
4
T1
3
2
C2
C1
1
V ariable, x
0
0
10.02.05
WSC-4
1
2
3
4
5
6
23
Object Status. Calibration Set
amin amax
y
C1
1.0
1.28
0.58
1.98
C2
2.0
1.68
0.49
1.19
C3
4.0
4.25
0.89
1.24
C4
5.0
5.32
0.92
1.20
Samples C2 & C4 are the boundary
objects. They form RPV.
7
6
C4
5
R esponse , y
Training
x
C3
4
3
2
C2
Samples C1 & C3 are insiders.
They could be removed from the
calibration set and RPV doesn’t
C1
1
Variable,
ariable, xx
V
0
change.
10.02.05
0
WSC-4
1
2
3
4
5
6
24
Object Status. Test Set
Let’s consider what happens when a
7
new sample is added to the calibration
set.
6
C4
R esponse , y
5
4
3
2
C2
C4
1
C2
a
0
RPV
a m i n = 0.92
10.02.05
V ariable, x
a m a x = 1.19
0
WSC-4
1
2
3
4
5
6
25
Object Status. Insider
If we add sample T1,
RPV doesn’t change.
7
This object is an insider.
6
C4
Prediction interval lies
5
R esponse , y
inside error interval
4
T1
3
2
T1
C2
C4
1
C2
a
a m i n = 0.92
10.02.05
RPV
V ariable, x
0
a m a x = 1.19
0
WSC-4
1
2
3
4
5
6
26
Object Status. Outlier
If we add sample T2,
RPV disappears.
7
This object is an outlier.
T2
6
C4
Prediction Interval
5
R esponse , y
lies out error interval
4
3
2
T2
C2
C4
1
C2
a
V ariable, x
0
a m i n = 0.92
10.02.05
a m a x = 1.19
0
WSC-4
1
2
3
4
5
6
27
Object Status. Outsider
If we add sample T3,
7
RPV becomes smaller.
6
This object is an outsider.
T3
C4
5
R esponse , y
Prediction interval overlaps
error interval
4
3
2
T3
C2
C4
1
C2
a
a m i n = 0.92
10.02.05
RPV
V ariable, x
0
a m a x = 1.11
0
WSC-4
1
2
3
4
5
6
28
SIC-Residual and SIC-Leverage
They characterize interactions between prediction and error intervals
Definition 1.
SIC-residual is defined as –
v+
bh
This is a characteristic of bias
y–b
Definition 2.
SIC-leverage is defined as –
br
v–
y
y+b
This is a normalized precision
10.02.05
WSC-4
29
Object Status Plot
Using simple algebraic calculus one can prove the following statements
T2
Statement 1
An object (x, y) is an insider, iff
B
1
Statement 2
An object (x, y) is an outlier, iff
S IC -residu al, r
| r (x, y) |  1 – h (x)
Presented by triangle BCD
C1
T1
C2
C
C4
C3
-1
1
S IC -Leverage, h
T3
D
| r (x, y) | > 1 + h (x)
Presented by lines AB and DE
10.02.05
A
E
WSC-4
30
Object Status Classification
Insiders
Absolute
outsiders
Outliers
Outsiders
10.02.05
WSC-4
31
OLS Confidence versus SIC Prediction
7
True response value, y, is always
= 0.9
59
PP=0
=0
.999
located within the SIC prediction
T2
6
C4
interval. This has been confirmed
5
times. Thus
Prob{ v- < y < v+ } = 1.00
Confidence intervals tends to
infinity when P is increased.
R esponse , y
by simulations repeated 100,000
T3
4
T1
C3
3
2
C1
C2
1
Confidence intervals are
unreasonably wide!
10.02.05
V ariable, x
0
0
WSC-4
1
2
3
4
5
6
32
Beta Estimation. Minimum b
0.6
0.5
0.4
0.3
bb==0.7
CC4
C4
CC
444
CC3C3C3C33
C 2C 2C 2C 2C 2
C1 C1 C1 C1 C1
RPV
RPV
RPV
RPV
a
b > bmin = 0.3
10.02.05
WSC-4
33
Beta Estimation from Regression Residuals
e = ymeasured – ypredicted
bOLS= max {|e1|, |e2|, ... , |en |}
bOLS = 0.4
bSIC= bOLS C(n)
Prob{b < bSIC}=0.90
10.02.05
WSC-4
bSIC = 0.8
34
1-2-3-4 Sigma Rule
10.02.05
1s  RMSEC
RMSEC = 0.2 = 1s
2s  bmin
bmin
= 0.3 = 1.5s
3s  bOLS
bOLS
= 0.4 = 2s
4s  bSIC
bSIC
= 0.8 = 4s
WSC-4
35
Part 3. Complicated SIC. Bivariate case
10.02.05
WSC-4
36
Octane Rating Example (by K. Esbensen)
Test set =13 samples
0.6
0.4
0.2
Training set = 24 samples
0.6
0
0.4
1100
1200
1300
1400
1500
X-values are NIR-measurements
over 226 wavelengths
0.2
Y-values are reference
measurements of octane number.
0
1100
10.02.05
1200
1300
1400
1500
WSC-4
37
Calibration
PC2
Scores
U Scor es
0.4
4
Elements:
0.2
2
0
0
-0.2
37
Slope:
9.643866
Offset:
0.006391
Corr elation:
0.991227
-2
PC1
-0.4
RESULT4,
-0.2
X- expl: 85%,12%
RMSE
0
T Scor es
0.2
-0.1
Y- expl: 85%,13%
0
RESULT4, PC( X- expl,Y-expl):
Root Mean Square Err or
0.1
0.2
0.3
0.4
2(12%,13%)
Predi cted Y
4
94
Slope
Offset
Corr .
0.981975
1.608816
0.990947
0.919002
7.082160
0.972058
92
2
90
88
0
86
PCs
PC_01
RESULT4, Variable:
10.02.05
PC_02
c.octane
v.octane
PC_03
Measured Y
PC_04
86
RESULT4, ( Y- var, PC) :
WSC-4
88
(octane,2)
90
92
(octane,2)
38
PLS Decomposition
1
p
PLS
= y
n
b
p
n
X
1
2PC
1
1
10.02.05
WSC-4
= y – y0 1
n
a
n
n
T
2
1
39
1-2-3-4 Sigma Rule for Octane Example
RMSEC = 0.27 = 1s
bmin
= 0.48 = 1.8s
bOLS
= 0.58 = 2.2s
bSIC
= 0.88 = 3.3s
b = bSIC = 0.88
10.02.05
WSC-4
40
RPV in Two-Dimensional Case
y1 – y0– b  t11a1 + t12a2  y1 – y0 + b
y2 – y0– b  t21a1 + t22a2  y2 – y0 + b
...
yn – y0– b  tn1a1 + tn2a2  yn – y0 + b
We have a system of 2n =48 inequalities
regarding two parameters a1 and a2
10.02.05
WSC-4
41
Region of Possible Values
40
a2
35
30
25
RPV
20
15
10
5
5
aa 11
0
0
0
0
10.02.05
5
5
10
10
15
15
20
20
WSC-4
25
25
30
30
35
35
40
40
42
Close view on RPV. Calibration Set
RPV in parameter space
a
18
2
1
+
3
24
23
4
9
–
RPV
9
20
13
–
2
+
2
0
3
14
11
–
7
5
16
18
20
Samples
1
20
18
24
+
a
7 13
1
S I C -L e ve rag e
-1
22
Boundary Samples
C7
10.02.05
4
23
24
12
14
6
22
10
1
6
12
1
8 91 5
21
5 1 61 7
0
1
16
14
12
S IC -R e sid u a l
28
Object Status Plot
C9
C13
C14
C18
C23
—— —— —— —— —— ——
WSC-4
43
SIC Prediction with Linear Programming
v+
v–
Vertex #
a1
a2
t ta
y
1
13.91 16.36
-0.40
88.86
2
14.22 18.36
-0.35
88.90
3
16.79 26.66
-0.24
89.01
4
19.91 26.61
-0.46
88.79
5
20.41 13.16
-0.96
88.30
6
17.44 13.52
-0.74
88.52
Linear Programming Problem
10.02.05
WSC-4
44
Octane Prediction. Test Set
Prediction intervals: SIC & PLS
Object Status Plot
Reference values
PLS 2RMSEP
SIC prediction
94
2
13
S IC -R es idual
O ctane N um ber
92
90
7
8
9
10
4
3
0
2
1
88
-1
11
6
1
12
1
2
3
S IC -Leverage
5
86
1
2
3
4
5
6
7
8
9
10 11 12 13
Test S amples
10.02.05
-2
WSC-4
45
Conclusions
• Real errors are limited. The truncated normal distribution is a
much more realistic model for the practical applications than
unlimited error distribution.
 s  s  s  s  s s
s
s
s
s
s
• Postulating that all errors are limited we can draw out a new
concept of data modeling that is the SIC method. It is based on
this single assumption and nothing else.
• SIC approach let us a new view on the old chemometrics
problems, like outliers, influential samples, etc. I think that
this is interesting and helpful view.
10.02.05
WSC-4
46
OLS versus SIC
SIC-Residuals vs. OLS-Residuals
SIC-Leverages vs. OLS-Leverages
O LS -Leverage
O LS -R es idual
T2
C1
T2
S IC -R es idual
T1
C4
0.0
0.0
C2
T3
C3
0.4
C3
- 1.0
C4
0.6
1.0
1.0
2.0
0.2
C2
T1
C1
T3
S IC -Leverage
0.0
0.0
- 1.0
0.5
SIC Object Status Plot
S IC -res idual
1.0
OLS/PLS Influence Plot
T2
O LS -varianc e
T2
2
1.0
C1
T1
0.0
0.5
C4
C3
S IC -Leverage
1.0
1
C2
- 1.0
C2
T3
C1
T3
T1
C4
C3
O LS -Leverage
0
0.0
10.02.05
WSC-4
0.5
1.0
47
Statistical view on OLS & SIC
Let’s have a sampling {x1,...xn} from a distribution with finite support [-1,+1].
The mean value a is unknown!
OLS
SIC
Statistics
-1
+1
Deviation
a =?
2 .5 s tr u n c a te d n o r m a l d is tr ib u tio n , n = 1 0 0
0 .3
OLS
0 .2
S IC
0 .1
0 .0
-0 .1
-0 .2
-0 .3
1
10.02.05
20
40
60
WSC-4
80
100
48