Simple Method for Outlier Detection in Fitting

Download Report

Transcript Simple Method for Outlier Detection in Fitting

On Sequential Experimental Design
for Empirical Model-Building
under Interval Error
Sergei Zhilin,
[email protected]
Altai State University,
Barnaul, Russia
Outline
• Regression under interval error
• Experimental design: refining context
• Classical and “interval” design optimality criteria
• Sequential experimental design for regression
models under interval error
• Comparative simulation study of classical and
“interval” sequential design procedures
• Conclusions
2
Regression under Interval Error
x = (x1,…,xp)
measured
without error
…
• Model structure
x1
Input variables
x2
T
x T

xp
Linear-parameterized
modeling function
+

Model parameters
to be estimated
y
Output variable
y
measured with
error
Measurement error
• “Interval” error means “unknown but bounded”:
  [ ,  ]
3
Regression under Interval Error
• Each row (xj , yj , j) of the measurements
table constrains possible values of the
parameter  with the set

Sj  

y j   j  x j    y j   j , j  1,...,n.
• Values of the parameter  consistent with
all constraints form an uncertainty set
n
A  S j
j 1
4
Regression under Interval Error
• Fitting data with the model y = 1 + 2x
In (1, 2) domain
In (x, y) domain
y
Set of feasible
models
Set of feasible
models
x
2
Uncertainty set A is
unbounded =
Uncertainty set A
not enough data to
build the model
Uncertainty set A
1
5
Regression under Interval Error
• Problems that may be stated with respect
to uncertainty set A
• Model parameters estimation
• Interval estimates of 
2
IA  [ 1 ,  1 ]  ...[ p ,  p ] :
 i  min  i ,  i  max i ,
 A
i  1,..., p.
A
• Point estimates of 



   1 ,..., p  :



2
^ 2
2

i  12  i   i , i  1,..., p.
1
^ 1
1
1
6
Regression under Interval Error
• Problems that may be stated with respect
to uncertainty set A
• Prediction of the output variable value for
fixed values of input variables
• Interval estimate of y
y( x)  y( x), y( x):
y( x)  min  T x,
A
y
y(x)
^y(x)
y(x)
y( x)  max  T x,
A
• Point estimate of y
y ( x)  1  y ( x )  y ( x) 
2

x
x
7
Experimental Design:
Refining Context
• Product or process optimization
• Model quality optimization
Design for N
observations
Experiment
Analysis
End
Begin
– Simultaneous experimental design
Begin
– Sequential experimental design
Analysis
Experiment
(Is the model quality
satisfactory?)
Design for ~1
observation
End
8
Experimental Design for
Regression under Interval Error
• Notations
y  xT    , x 
 R p – design space
 x1T 
 
X     – design matrix
 xnT 
 
 y1 
 
Y     – measurements
 yn 
 
 1
 
E     – error bounds
  n
 
– model
T
M

X
X
1
D  M 1
1
0
d ( x)  xT Dx
– information
matrix

–  covariance
matrix
– standardized
variance function

of y(x,)
9
Experimental Design for
Regression under Interval Error
• Design optimality criteria
– Classical
Name
D -optimality
G -optimality
Minimizes
Depend only on X,
hence are applicable for
D =interval
(XTX)–1error as well
det D 
(volume of joint confidence interval)
max d ( x)
(maximal variance of prediction)
x
IE=
- and
TDxIG-optimality
d(x)
x
– Interval (by M.P. Dyvak)
are equivalent for
Name
Minimizes
spherical design space
ID -optimality
squared volume of Aand n > p
IE -optimality
IG -optimality
squared maximal diagonal of A
maximal prediction error
10
Experimental Design for
Regression under Interval Error
• Motivation
– Classical methods of experimental design
use only an information which X brings, nor Y, nor E
– Interval methods of experimental design developed
by Dyvak work for saturated designs (p=n) and
use X and E, nor Y.
– Does using of information, which Y contains, allow
to improve the quality of constructed model or
to increase the “speed” of sequential experimental
design procedure?
11
Experimental Design for
Regression under Interval Error
• How to use the information which Y brings?
xnext = IEDesign( , X, Y, E)
1. Find out the direction a
of maximal spread of A:
Uncertainty set A(X,Y,E)
2
{1* ,  2*}  arg max 1   2 ,
a   
*
1
*
2
1 , 2 A
2. Next experimental point xnext
is selected in such a way that it
• induces the constraint
orthogonal to a
• has maximal norm (width of
constraint w  2 xnext )
xnext  k *a, k *  max | k |
kR , ka
w
1
12
Experimental Design for
Regression under Interval Error
• IE-optimal sequential design
(X0, Y0, E0) – initial dataset
i = 0;
repeat
x = IEDesign( , Xi, Yi, Ei);
13
Experimental Design for
Regression under Interval Error
• IE-optimal sequential design
(X0, Y0, E0) – initial dataset
i = 0;
repeat
x = IEDesign( , Xi, Yi, Ei);
y = measurement in x with error ;
X 
Y 
E 
X i 1   Ti ; Yi 1   i ; Ei 1   i ;
 y
 
x 
i = i + 1;
until i > N or IA(Xi, Yi, Ei) is small;
14
Experimental Design for
Regression under Interval Error
• Simulation study 1. Comparison of IE- and D-optimal
sequential designs under zero errors

 x R
2
  0.26  0.61
x x  1 ,   (1, 2) ,   0.4, X 0   0.59  0.24
  0.49  0.31


T

IE-optimal sequential design
i0
repeat
Yi  X i
xnext  I E Design , X i , Yi ,  
 Xi 
X i 1  

x
 next 
i  i 1
until i > 9
T
D-optimal sequential design
i0
repeat
xnext  DDesign , X i 
 X 
X i 1   i 
 xnext 
i  i 1
until i > 9
15
Experimental Design for
Regression under Interval Error
• Simulation study 1. D-optimal sequential design results
Variables domain
Parameters domain
1,5,9
1
3
3,7
2
2.5
0.5
2
0
-0.5
1.5
2,6,10
-1
4,8
-1
-0.5
0
0.5
1
1
0
0.5
1
1.5
2
Volume(A) = 0.6400  42
IA = [0.45, 1.55][1.45, 2.55]
Volume(IA) = 1.21
16
Experimental Design for
Regression under Interval Error
• Simulation study 1. IE-optimal sequential design results
Variables domain
Parameters domain
3
1
2
2.5
0.5
0
2
-0.5
1.5
-1
-1
-0.5
0
0.5
1
1
0
0.5
1
1.5
2
Volume(A) = 0.5077  2
IA = [0.59, 1.41][1.60, 2.40]
Volume(IA) = 0.66
17
Experimental Design for
Regression under Interval Error
• Simulation study 2. Comparison of IE- and D-optimal
sequential designs under error which follows truncated
normal distribution


 x  R d xT x  1 ,   (1, 2)T ,   0.4,
X 0  { 3 uniformly distributed points from
}
Errors are simulated by   N T ( ) – truncated normal distribution
N T ( )
3s  
18
Experimental Design for
Regression under Interval Error
Simulation study 2
k  0;
for r = 1 to 1500 do
i  0; Ξ0  { 3 random values from N T ( ) };
X 0  { 3 uniformly distributed points from }; Y0  X 0  Ξ0 ;
X 0D  X 0 ; Y0D  Y0 ;
X 0I  X 0 ; Y0I  Y0 ;
repeat
  random value from N T ( ) ;

x I  I E Design
y I  x I  
 X iI 
 Yi I 
I
X   I ; Yi 1   I ;
x 
y 
i  i  1;
I
i 1

, X iI , Yi I ,  ;
x D  DDesign , X iD ;
y D  x D  
X
D
i 1
 X iD 
 Yi D 
D
  D ; Yi 1   D ;
x 
y 
until i > N
if VolumeIAX NI , YNI ,    VolumeIAX ND , YND ,   then k  k  1;
end for
19
Experimental Design for
Regression under Interval Error
Simulation study 2. Results for   x  R x x  1,
T
2
•
Number of winnings
k, (1500 – k)
1500
100%
90%
1250
80%
70%
1000
60%
50%
750
40%
500
30%
20%
250
IE-Design
10%
D-Design
0
0
5
10
15
Number of selected points
20
N
25
0%
20
Experimental Design for
Regression under Interval Error
Simulation study 2. Results for   x  R x x  1,
T
3
•
Number of winnings
k, (1500 – k)
1500
100%
90%
1250
80%
70%
1000
60%
50%
750
40%
500
30%
20%
250
IE-Design
10%
D-Design
0
0
5
10
15
Number of selected points
20
N
25
0%
21
Experimental Design for
Regression under Interval Error
• The “cost” of IE-optimal design
– The problem of finding maximal spread direction of A
{1* ,  2*}  arg max 1   2
1 , 2 A
is a concave quadratic programming problem (CQPP)
– It is proved that CQPP is NP-hard, i.e. solving
time of the problem exponentially depends on its
dimension (the number of input variables p)
– To overcome the difficulties we need to use special
computational means (such as parallel computers) or
we can limit ourself with near-optimal solutions
22
Conclusions
• Interval model of error allows to use the information
about measured values of output variable for effective
sequential experimental design
• The results of the performed simulation study give a
cause for careful analytical investigation of properties of
IE-optimal sequential design procedures
• IE-optimal sequential design for high-dimensional
models demands for special computational techniques
23