Transcript Document

The Least Squares Principle
• Regression tries to produce a
“best fit equation” --- but what is
“best” ?
• Criterion: minimize the sum of
squared deviations of data
points from the regression line.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
How Good is the Regression
(Part 1) ?
How well does the regression equation represent our original
data?
The proportion (percentage) of the of the variance in y that is
explained by the regression equation is denoted by the symbol R2.
2
R
(Sum of squares about mean of Y)
=
(Sum of squares about regression line)
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Explained Variability - illustration
High R2 - good
explanation
Low R2 - poor
explanation
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
How Good is the Regression
(Part 2) ?
How well would this regression equation predict NEW data
points?
• Remember you used a sample from the population of potential data
points to determine your regression equation.
– e.g. one value every 15 minutes, 1-2 weeks of operating data
• A different sample would give you a different equation with different
coefficients bi
• As illustrated on the next slide, the sample can greatly affect the
regression equation…
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Sampling variability of Regression
Coefficients - illustration

y = a'x
b' + +
Sample 1:Sample
y = 1:a’x
++ b’
e
Sample 2:2:
y=y
a''x=
+ b''
+  + b’’ + e
Sample
a’’x
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Confidence Limits
• Confidence limits (x%) are upper and lower bounds which have an
x% probability of enclosing the true population value of a given
variable
• Often shown as bars above and below a predicted data point:
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Normalisation of Data
• Data used for regression are usually normalised to have mean zero
and variance one.
• Otherwise the calculations would be dominated (biased) by variables
having:
– numerically large values
– large variance
• This means that the MVA software never sees the original data, just
the normalised version
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Normalisation of Data - illustration
Each variable is represented by a variance bar and its mean (centre).
Raw data
Mean-centred
only
Variancecentred only
NAMP Module 17: “Introduction to Multivariate Analysis”
Normalised
Tier 1, Part 1, Rev.: 0
Requirements for Regression
• Data Requirements
– Normalised data
– Errors normally distributed with mean zero
– Independent variables uncorrelated
• Implications if Requirements Not Met
– Larger confidence limits around
regression coefficients (bi)
– Poorer prediction on new data
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Multivariate Analysis
Now we are ready to start talking about multivariate analysis (MVA)
itself. There are two main types of MVA:
1.
2.
Principal Component Analysis (PCA)
•
X’s only
Xx
Projection to Latent Structures (PLS)
•
a.k.a. “Partial Least Squares”
•
X’s and Y’s
XY
Can be same
dataset, i.e., you
can do PCA on
the whole thing
(X’s and Y’s
together)
Let’s start with PCA. Note that the European food example at the
beginning was PCA, because all the food types were treated as
equivalent.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Purpose of PCA
The purpose of PCA is to project a data space with a large number of
correlated dimensions (variables) into a second data space with a
much smaller number of independent (orthogonal) dimensions.
This is justified scientifically because of Ockham’s Razor. Deep down,
Nature IS simple. Often the lower dimenional space corresponds
more closely to what is actually happening at a physical level.
The challenge is interpreting the MVA
results in a scientifically valid way.
Reminder…
“Ockham’s Razor”
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Advantages of PCA
Among the advantages of PCA:
• Uncorrelated variables lend themselves to traditional statistical
analysis
• Lower-dimensional space easier to work with
• New dimensions often represent more clearly the underlying
structure of the set of variables (our friend Ockham)
+1
-1
Reminder…
“Latent Attributes”
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
How PCA works (Concept)
PCA is a step-wise process. This is how it works conceptually:
•
•
•
Find a component (dimension vector) which explains as much
x-variation as possible
Find a second component which:
– is orthogonal to (uncorrelated with) the first
– explains as much as possible of the remaining x-variation
Process continues until researcher satisfied or increase in
explanation is judged minimal
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
How PCA Works (Math)
This is how PCA works mathematically:
• Consider an (n x k) data matrix X
(n observations, k variables)
• PCS models this as (assuming normalized data):
X = T * P’ + E
• where
Like linear
regression only
using matrices
T
is the scores of each observation
on the new components
P
is the loadings of the original
variables on the new components
E
residual matrix, containing the noise
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
How PCA Works (Visually)
The way PCA works visually is by projecting the multidimensional data
cloud onto the “hyperplane” defined by the first two components. The
image below shows this in 3-D, for ease of understanding, but in reality
there can be dozens or even hundreds of dimensions:
X3
. ..
.
.
.
.
.
.
.
. X2
.
X1
3 original
variables
NAMP Module 17: “Introduction to Multivariate Analysis”
Data cloud (in red) is
projected onto plane
defined by first 2
components
Tier 1, Part 1, Rev.: 0
Number of Components
Components are simply the new axes which are created to explain the
most variance with the least dimensions. The PCA methodology ensures
that components are extracted in decreasing order of explained variance.
In other words, the first component always explains the most variance,
the second component explains the next most variance, and so forth:
1
2 3
4
5
6 ...
Eventually, the higher-level components represent mainly noise. This is a
good thing, and in fact one of the reasons we use PCA in the first place.
Because noise is relegated to the higher-level components, it is absent
from the first few components. This is because all components are
orthogonal to each other, which means that they are statistically
independent or uncorrelated.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
The Eigenvalue Criterion
There are two ways to determine when to stop creating new
components:
–Eigenvalue criterion
–Scree test
The first of these uses the following mathematical definition:
• Eigenvalues of a matrix A :
– Mathematically defined by (A - I) = 0
– Useful as an “importance measure” for variables
Usually, components with eigenvalues less than one are
discarded, since they have less explanatory power than the
original variables did in the first place.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
The Inflection Point Criterion
(Scree Test)
8
• Plot eigenvalues vs. number
of components
• Extract components up to
the point where the plot
“levels off”
• Right-hand tail of the curve
is “scree” (like lower part of
a rocky slope)
7
6
5
Eigenvalue
The second method is a simple
graphical technique:
4
3
2
1
1
NAMP Module 17: “Introduction to Multivariate Analysis”
2
3
4
5
Component #
6
Tier 1, Part 1, Rev.: 0
Interpretation of the PCA
Components
As with any type of MVA, the most difficult part of PCA is interpreting the
components. The software is 100% mathematical, and gives the same
outputs whether the data relates to diesel fuel composition or last night’s
horse racing results. It is up to the engineer to make sense of the
outputs. Generally, you have to:
• Look at strength and direction of loadings
• Look for clusters of variables which may be physically related or
have a common origin
– e.g., In papermaking, strength properties such as tear, burst,
breaking length in the paper are all related to the length and
bonding propensity of the initial fibres.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
PCA vs. PLS
What is the difference between PCA and PLS?
PLS is the multivariate version of regression. It uses two different PCA
models, one for the X’s and one for the Y’s, and finds the links between
the two.
Mathematically, the difference is as follows:
In PCA, we are maximising the variance that
is explained by the model.
Xx
In PLS, we are maximising the covariance.
XY
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
How PLS works (Concept)
PLS is also a step-wise process. This is how it works conceptually:
• PLS finds a set of orthogonal components that :
– maximize the level of explanation of both X and Y
– provide a predictive equation for Y in terms of the X’s
• This is done by:
– fitting a set of components to X (as in PCA)
– similarly fitting a set of components to Y
– reconciling the two sets of components so as to maximize
explanation of X and Y
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
How PLS works (Math)
This is how PLS works mathematically:
• X = TP’ + E
• Y = UQ’ + F
• uh = bhth
•
outer relation for X (like PCA)
outer relation for Y (like PCA)
inner relation for components
h = 1,…,(# of components)
Weighting factors w are used to make sure dimensions are orthogonal
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
PLS – the “Inner Relation”
The way PLS works visually is by “tweeking” the two PCA models (X and
Y) until their covariance is optimised. It is this “tweeking” that led to the
name partial least-squares.
All 3 are solved simultaneously
via numerical methods
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Interpretation of the PLS
Components
Interpretation of the PLS results has all the difficulties of PCA, plus one
other one: making sense of the individual components in both X and Y
space.
In other words, for the results to make sense, the first component in X
must be related somehow to the first component in Y.
Note that throughout this course, the words “cause” and “effect” are
absent. MVA determines correlations ONLY. The only exception is
when a proper design-of-experiment has been used.
Here is an example of a false correlation: the seed in your birdfeeder
remains full all winter, then suddenly disappears in the spring. You
conclude that the warm weather made the seeds disintegrate…
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Types of MVA Outputs
MVA software generates two types of outputs: results, and diagnostics.
We have already seen the Score plot and Loadings plot in the food
example. Some others are shown on the next
few slides.
• Results
– Score Plots
– Loadings Plots
Already
seen…
• Diagnostics
– Plot of Residuals
– Observed vs. Predicted
– …(many more)
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
1999-11-23
1999-11-24
1999-11-26
1999-11-25
1999-11-28
1999-11-27
1999-11-29
1999-12-01
1999-11-30
1999-12-03
1999-12-02
1999-12-05
1999-12-04
1999-12-07
1999-12-06
1999-12-09
1999-12-08
1999-12-11
1999-12-10
1999-12-13
1999-12-12
1999-12-15
1999-12-14
1999-12-17
1999-12-16
1999-12-19
1999-12-18
1999-12-21
1999-12-20
1999-12-23
1999-12-22
1999-12-25
1999-12-24
1999-12-27
1999-12-26
1999-12-29
1999-12-28
2000-01-00
1999-12-30
2000-01-02
2000-01-01
2000-01-04
2000-01-03
2000-01-06
2000-01-05
2000-01-08
2000-01-07
2000-01-10
2000-01-09
2000-01-12
2000-01-11
2000-01-14
2000-01-13
2000-01-16
2000-01-15
2000-01-18
2000-01-17
2000-01-20
2000-01-19
2000-01-22
2000-01-21
2000-01-24
2000-01-23
2000-01-26
2000-01-25
2000-01-28
2000-01-27
2000-01-29
2000-01-31
2000-01-30
2000-02-02
2000-02-01
2000-02-04
2000-02-03
2000-02-06
2000-02-05
2000-02-08
2000-02-07
2000-02-10
2000-02-09
2000-02-12
2000-02-11
2000-02-14
2000-02-13
2000-02-16
2000-02-15
2000-02-18
2000-02-17
2000-02-20
2000-02-19
2000-02-22
2000-02-21
2000-02-24
2000-02-23
2000-02-26
2000-02-25
2000-02-28
2000-02-27
2000-03-01
2000-02-29
2000-03-03
2000-03-02
2000-03-05
2000-03-04
2000-03-07
2000-03-06
2000-03-09
2000-03-08
2000-03-11
2000-03-10
2000-03-13
2000-03-12
2000-03-15
2000-03-14
2000-03-17
2000-03-16
2000-03-19
2000-03-18
2000-03-21
2000-03-20
2000-03-23
2000-03-22
2000-03-25
2000-03-24
2000-03-27
2000-03-26
2000-03-28
2000-03-30
2000-03-29
2000-04-01
2000-03-31
2000-04-03
2000-04-02
2000-04-05
2000-04-04
2000-04-07
2000-04-06
2000-04-09
2000-04-08
2000-04-11
2000-04-10
2000-04-13
2000-04-12
2000-04-15
2000-04-14
2000-04-17
2000-04-16
2000-04-19
2000-04-18
2000-04-21
2000-04-20
2000-04-23
2000-04-22
2000-04-25
2000-04-24
2000-04-27
2000-04-26
2000-04-29
2000-04-28
2000-05-01
2000-04-30
2000-05-03
2000-05-02
2000-05-05
2000-05-04
2000-05-07
2000-05-06
2000-05-09
2000-05-08
2000-05-11
2000-05-10
2000-05-13
2000-05-12
2000-05-15
2000-05-14
2000-05-17
2000-05-16
2000-05-19
2000-05-18
2000-05-21
2000-05-20
2000-05-23
2000-05-22
2000-05-25
2000-05-24
2000-05-27
2000-05-26
2000-05-28
2000-05-30
2000-05-29
2000-06-01
2000-05-31
2000-06-03
2000-06-02
2000-06-05
2000-06-04
2000-06-07
2000-06-06
2000-06-09
2000-06-08
2000-06-11
2000-06-10
2000-06-13
2000-06-12
2000-06-15
2000-06-14
2000-06-17
2000-06-16
2000-06-19
2000-06-18
2000-06-21
2000-06-20
2000-06-23
2000-06-22
2000-06-25
2000-06-24
2000-06-27
2000-06-26
2000-06-29
2000-06-28
2000-07-01
2000-06-30
2000-07-03
2000-07-02
2000-07-05
2000-07-04
2000-07-07
2000-07-06
2000-07-09
2000-07-08
2000-07-11
2000-07-10
2000-07-13
2000-07-12
2000-07-15
2000-07-14
2000-07-17
2000-07-16
2000-07-19
2000-07-18
2000-07-21
2000-07-20
2000-07-23
2000-07-22
2000-07-25
2000-07-24
2000-07-27
2000-07-26
2000-07-28
2000-07-30
2000-07-29
2000-08-01
2000-07-31
2000-08-03
2000-08-02
2000-08-05
2000-08-04
2000-08-07
2000-08-06
2000-08-09
2000-08-08
2000-08-11
2000-08-10
2000-08-13
2000-08-12
2000-08-15
2000-08-14
2000-08-17
2000-08-16
2000-08-19
2000-08-18
2000-08-21
2000-08-20
2000-08-23
2000-08-22
2000-08-25
2000-08-24
2000-08-27
2000-08-26
2000-08-29
2000-08-28
2000-08-31
2000-08-30
2000-09-02
2000-09-01
2000-09-04
2000-09-03
2000-09-06
2000-09-05
2000-09-08
2000-09-07
2000-09-10
2000-09-09
2000-09-12
2000-09-11
2000-09-14
2000-09-13
2000-09-16
2000-09-15
2000-09-18
2000-09-17
2000-09-20
2000-09-19
2000-09-22
2000-09-21
2000-09-24
2000-09-23
2000-09-26
2000-09-25
2000-09-27
2000-09-29
2000-09-28
2000-10-01
2000-09-30
2000-10-03
2000-10-02
2000-10-05
2000-10-04
2000-10-07
2000-10-06
2000-10-09
2000-10-08
2000-10-11
2000-10-10
2000-10-13
2000-10-12
2000-10-15
2000-10-14
2000-10-17
2000-10-16
2000-10-19
2000-10-18
2000-10-21
2000-10-20
2000-10-23
2000-10-22
2000-10-25
2000-10-24
2000-10-27
2000-10-26
2000-10-29
2000-10-28
2000-10-31
2000-10-30
2000-11-02
2000-11-01
2000-11-04
2000-11-03
2000-11-06
2000-11-05
2000-11-08
2000-11-07
2000-11-10
2000-11-09
2000-11-12
2000-11-11
2000-11-14
2000-11-13
2000-11-16
2000-11-15
2000-11-18
2000-11-17
2000-11-20
2000-11-19
2000-11-22
2000-11-21
2000-11-24
2000-11-23
2000-11-26
2000-11-25
2000-11-27
2000-11-29
2000-11-28
2000-12-01
2000-11-30
2000-12-03
2000-12-02
2000-12-05
2000-12-04
2000-12-07
2000-12-06
2000-12-09
2000-12-08
2000-12-11
2000-12-10
2000-12-13
2000-12-12
2000-12-15
2000-12-14
2000-12-17
2000-12-16
2000-12-19
2000-12-18
2000-12-21
2000-12-20
2000-12-23
2000-12-22
2000-12-25
2000-12-24
2000-12-27
2000-12-26
2000-12-29
2000-12-28
2000-12-31
2000-12-30
2001-01-02
2001-01-01
2001-01-04
2001-01-03
2001-01-06
2001-01-05
2001-01-08
2001-01-07
2001-01-10
2001-01-09
2001-01-12
2001-01-11
2001-01-14
2001-01-13
2001-01-16
2001-01-15
2001-01-18
2001-01-17
2001-01-20
2001-01-19
2001-01-22
2001-01-21
2001-01-24
2001-01-23
2001-01-26
2001-01-25
2001-01-27
2001-01-29
2001-01-28
2001-01-31
2001-01-30
2001-02-02
2001-02-01
2001-02-04
2001-02-03
2001-02-06
2001-02-05
2001-02-08
2001-02-07
2001-02-10
2001-02-09
2001-02-12
2001-02-11
2001-02-14
2001-02-13
2001-02-16
2001-02-15
2001-02-18
2001-02-17
2001-02-20
2001-02-19
2001-02-22
2001-02-21
2001-02-24
2001-02-23
2001-02-26
2001-02-25
2001-02-28
2001-02-27
2001-03-02
2001-03-01
2001-03-04
2001-03-03
2001-03-06
2001-03-05
2001-03-08
2001-03-07
2001-03-10
2001-03-09
2001-03-12
2001-03-11
2001-03-14
2001-03-13
2001-03-16
2001-03-15
2001-03-18
2001-03-17
2001-03-20
2001-03-19
2001-03-22
2001-03-21
2001-03-24
2001-03-23
2001-03-26
2001-03-25
2001-03-27
2001-03-29
2001-03-28
2001-03-31
2001-03-30
2001-04-02
2001-04-01
2001-04-04
2001-04-03
2001-04-06
2001-04-05
2001-04-08
2001-04-07
2001-04-10
2001-04-09
2001-04-12
2001-04-11
2001-04-14
2001-04-13
2001-04-16
2001-04-15
2001-04-18
2001-04-17
2001-04-20
2001-04-19
2001-04-22
2001-04-21
2001-04-24
2001-04-23
2001-04-26
2001-04-25
2001-04-28
2001-04-27
2001-04-30
2001-04-29
2001-05-02
2001-05-01
2001-05-04
2001-05-03
2001-05-06
2001-05-05
2001-05-08
2001-05-07
2001-05-10
2001-05-09
2001-05-12
2001-05-11
2001-05-14
2001-05-13
2001-05-16
2001-05-15
2001-05-18
2001-05-17
2001-05-20
2001-05-19
2001-05-22
2001-05-21
2001-05-24
2001-05-23
2001-05-26
2001-05-25
2001-05-27
2001-05-29
2001-05-28
2001-05-31
2001-05-30
2001-06-02
2001-06-01
2001-06-04
2001-06-03
2001-06-06
2001-06-05
2001-06-08
2001-06-07
2001-06-10
2001-06-09
2001-06-12
2001-06-11
2001-06-14
2001-06-13
2001-06-16
2001-06-15
2001-06-18
2001-06-17
2001-06-20
2001-06-19
2001-06-22
2001-06-21
2001-06-24
2001-06-23
2001-06-26
2001-06-25
2001-06-28
2001-06-27
2001-06-30
2001-06-29
2001-07-02
2001-07-01
2001-07-04
2001-07-03
2001-07-06
2001-07-05
2001-07-08
2001-07-07
2001-07-10
2001-07-09
2001-07-12
2001-07-11
2001-07-14
2001-07-13
2001-07-16
2001-07-15
2001-07-18
2001-07-17
2001-07-20
2001-07-19
2001-07-22
2001-07-21
2001-07-24
2001-07-23
2001-07-26
2001-07-25
2001-07-27
2001-07-29
2001-07-28
2001-07-31
2001-07-30
2001-08-02
2001-08-01
2001-08-04
2001-08-03
2001-08-06
2001-08-05
2001-08-08
2001-08-07
2001-08-10
2001-08-09
2001-08-12
2001-08-11
2001-08-14
2001-08-13
2001-08-16
2001-08-15
2001-08-18
2001-08-17
2001-08-20
2001-08-19
2001-08-22
2001-08-21
2001-08-24
2001-08-23
2001-08-26
2001-08-25
2001-08-28
2001-08-27
2001-08-30
2001-08-29
2001-09-01
2001-08-31
2001-09-03
2001-09-02
2001-09-05
2001-09-04
2001-09-07
2001-09-06
2001-09-09
2001-09-08
2001-09-11
2001-09-10
2001-09-13
2001-09-12
2001-09-15
2001-09-14
2001-09-17
2001-09-16
2001-09-19
2001-09-18
2001-09-21
2001-09-20
2001-09-23
2001-09-22
2001-09-25
2001-09-24
2001-09-26
2001-09-28
2001-09-27
2001-09-30
2001-09-29
2001-10-02
2001-10-01
2001-10-04
2001-10-03
2001-10-06
2001-10-05
2001-10-08
2001-10-07
2001-10-10
2001-10-09
2001-10-12
2001-10-11
2001-10-14
2001-10-13
2001-10-16
2001-10-15
2001-10-18
2001-10-17
2001-10-20
2001-10-19
2001-10-22
2001-10-21
2001-10-24
2001-10-23
2001-10-26
2001-10-25
2001-10-28
2001-10-27
2001-10-30
2001-10-29
2001-11-01
2001-10-31
2001-11-03
2001-11-02
2001-11-05
2001-11-04
2001-11-07
2001-11-06
2001-11-09
2001-11-08
2001-11-11
2001-11-10
2001-11-13
2001-11-12
2001-11-15
2001-11-14
2001-11-17
2001-11-16
2001-11-19
2001-11-18
2001-11-21
2001-11-20
2001-11-23
2001-11-22
2001-11-25
2001-11-24
2001-11-26
2001-11-28
2001-11-27
2001-11-30
2001-11-29
2001-12-02
2001-12-01
2001-12-04
2001-12-03
2001-12-06
2001-12-05
2001-12-08
2001-12-07
2001-12-10
2001-12-09
2001-12-12
2001-12-11
2001-12-14
2001-12-13
2001-12-16
2001-12-15
2001-12-18
2001-12-17
2001-12-20
2001-12-19
2001-12-22
2001-12-21
2001-12-24
2001-12-23
2001-12-26
2001-12-25
2001-12-28
2001-12-27
2001-12-30
2001-12-29
2002-01-01
2001-12-31
2002-01-03
2002-01-02
2002-01-05
2002-01-04
2002-01-07
2002-01-06
2002-01-09
2002-01-08
2002-01-11
2002-01-10
2002-01-13
2002-01-12
2002-01-15
2002-01-14
2002-01-17
2002-01-16
2002-01-19
2002-01-18
2002-01-21
2002-01-20
2002-01-23
2002-01-22
2002-01-25
2002-01-24
2002-01-26
2002-01-28
2002-01-27
2002-01-30
2002-01-29
2002-02-01
2002-01-31
2002-02-03
2002-02-02
2002-02-05
2002-02-04
2002-02-07
2002-02-06
2002-02-09
2002-02-08
2002-02-11
2002-02-10
2002-02-13
2002-02-12
2002-02-15
2002-02-14
2002-02-17
2002-02-16
2002-02-19
2002-02-18
2002-02-21
2002-02-20
2002-02-23
2002-02-22
2002-02-25
2002-02-24
2002-02-27
2002-02-26
2002-03-01
2002-02-28
2002-03-03
2002-03-02
2002-03-05
2002-03-04
2002-03-07
2002-03-06
2002-03-09
2002-03-08
2002-03-11
2002-03-10
2002-03-13
2002-03-12
2002-03-15
2002-03-14
2002-03-17
2002-03-16
2002-03-19
2002-03-18
2002-03-21
2002-03-20
2002-03-23
2002-03-22
2002-03-25
2002-03-24
2002-03-27
2002-03-26
2002-03-28
2002-03-30
2002-03-29
2002-04-01
2002-03-31
2002-04-03
2002-04-02
2002-04-05
2002-04-04
2002-04-07
2002-04-06
2002-04-09
2002-04-08
2002-04-11
2002-04-10
2002-04-13
2002-04-12
2002-04-15
2002-04-14
2002-04-17
2002-04-16
2002-04-19
2002-04-18
2002-04-21
2002-04-20
2002-04-23
2002-04-22
2002-04-25
2002-04-24
2002-04-27
2002-04-26
2002-04-29
2002-04-28
2002-05-01
2002-04-30
2002-05-03
2002-05-02
2002-05-05
2002-05-04
2002-05-07
2002-05-06
2002-05-09
2002-05-08
2002-05-11
2002-05-10
2002-05-13
2002-05-12
2002-05-15
2002-05-14
2002-05-17
2002-05-16
2002-05-19
2002-05-18
2002-05-21
2002-05-20
2002-05-23
2002-05-22
2002-05-25
2002-05-24
2002-05-26
2002-05-28
2002-05-27
2002-05-30
2002-05-29
2002-06-01
2002-05-31
2002-06-03
2002-06-02
2002-06-05
2002-06-04
2002-06-07
2002-06-06
2002-06-09
2002-06-08
2002-06-11
2002-06-10
2002-06-13
2002-06-12
2002-06-15
2002-06-14
2002-06-17
2002-06-16
2002-06-19
2002-06-18
2002-06-21
2002-06-20
2002-06-23
2002-06-22
2002-06-25
2002-06-24
2002-06-27
2002-06-26
2002-06-29
2002-06-28
2002-07-01
2002-06-30
2002-07-03
2002-07-02
2002-07-05
2002-07-04
2002-07-07
2002-07-06
2002-07-09
2002-07-08
2002-07-11
2002-07-10
2002-07-13
2002-07-12
2002-07-15
2002-07-14
2002-07-17
2002-07-16
2002-07-19
2002-07-18
2002-07-21
2002-07-20
2002-07-23
2002-07-22
2002-07-25
2002-07-24
2002-07-26
2002-07-28
2002-07-27
2002-07-30
2002-07-29
2002-08-01
2002-07-31
DModX[1](Norm)
Residuals
• Also called “Distance to Model” (DModX)
– Contains all the noise
– Definition:
1
(next slide)
DModX = ( eik2 / D.F.)1/2
32-months of 1 day.M2 (PLS), Untitled
DModX[1](Norm)
5
4
3
2
D-Crit(0.05)
Original observations
Obs ID (TIME)
M2-D-Crit[4]
=
1.157
• Used to identify moderate outliers
– Extreme outliers visible on Score Plot
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
“Distance to Model”
eik
.
i=observation
k=variable
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Observed vs. Predicted
This graph plots the Y values predicted by the model, against the original
Y values. A perfect model would only have points along the diagonal line.
32-months of 1 day.M3 (PLS), Untitled
YPred[14](53AI034.AI)/YVar(53AI034.AI)
YVar(53AI034.AI)
240
220
200
IDEAL MODEL
180
160
150
160
170
180
190
200
210
220
230
240
YPred[14](53AI034.AI)
RMSEE = 24.6664
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
MVA Challenges
Here is a list of some of the main challenges you will encounter when
doing MVA. You have been warned!
•
•
•
•
•
•
Difficulty interpreting the plots (“like reading tea leaves”)
Data pre-processing
Control loops can disguise real correlations
Discrete vs. averaged vs. interpolated data
Determining lags to account for flowsheet residence times
Time increment issues
– e.g., second-by-second values, or daily averages?
Some typical sensitivity variables for the application of MVA to real
process data are shown on the next page…
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Typical Sensitivity Variables
MVA
Calculations
-Time step / averages
-Which variables are used
-How many components?
-Data pre-processing
-Treatment of noise/outliers
-PCA vs. PLS
Physical reality
-Which are the X’s and Y’s?
-Sub-sections of flowsheet
-Time lags, mixing & recirculation
-Process/equipment changes
-Seasonal effects
Unmeasured
variables
-Known but not measured
-Unknown and unmeasured
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
End of Tier 1
Congratulations!
Assuming that you have done all the reading, this is the end of Tier 1.
No doubt much of this information seems confusing, but things will
become more clear when we look at real-life examples in Tier 2.
All that is left is to complete the short quiz that follows…
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 1:
Looking at one or two variables at a time is not recommended,
because often variables are correlated. What does this mean
exactly?
a) These variables tend to increase and decrease in unison.
b) These variables are probably measuring the same thing,
however indirectly.
c) These variables reveal a common, deeper variable that is
probably unmeasured.
d) These variables are not statistically independent.
e) All of the above.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 2:
What is the difference between “information” and “knowledge”?
a) Information is in a computer or on a piece of paper, while
knowledge is inside a person’s head.
b) Only scientists have “true” knowledge.
c) Information is mathematical, while knowledge is not.
d) Information includes relationships between variables, but
without regard for the underlying scientific causes.
e) Knowledge can only be acquired through experience.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 3:
Why does MVA never reveal cause-and-effect, unless a designed
experiment is used?
a) Cause-and-effect can only be determined in a laboratory.
b) Designed experiments eliminate error.
c) MVA without a designed experiment is only inductive,
whereas a cause-and-effect relationship requires deduction.
d) Only effects are measurable.
e) Scientists design experiments to work perfectly the first time.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 4:
What is the biggest disadvantage to using a “black-box” model
instead of one based on first principles?
a)
b)
c)
d)
There are no unit operations.
The model is only as good as the data used to create it.
Chemical reactions and thermodynamic data are not used.
A black-box model can never take into account the entire
flowsheet.
e) MVA models are linear only.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 5:
What does a confidence interval tell you?
a) How widely your data are spread out around a regression line.
b) The range within which a certain percentage of sample
values can be expected to lie.
c) The area within which your regression line should fall.
d) The level of believability of the results of a specific analysis.
e) The number of times you should repeat your analysis to be sure
of your results
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 6:
When your data were being recorded, one of the mill sensors was
malfunctioning and giving you wildly inaccurate readings. What
are the implications likely to be for statistical analysis?
a) More square and cross-product terms in the model you fit to the
data.
b) Higher mean values than would normally be expected.
c) Higher variance values for the variables associated with the
malfunctioning sensor.
d) Different selection of variables to include in the analysis.
e) Bigger residual term in your model.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 7:
Why does reducing the number of dimensions (more variables to
fewer components) make sense from a scientific point of view?
a) The new components might correspond to underlying
physical phenomena that can’t be measured directly.
b) Fewer dimensions are easier to view on a graph or computer
output.
c) Ockham’s Razor limits scientists to less than five dimensions.
d) The real world is limited to just three dimensions.
e) All of the above.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 8:
If two points on a score plot are almost touching, does that mean that
these two observations are nearly identical?
a) Yes, because they lie in the same position within the same
quadrant.
b) No, because of experimental error.
c) Yes, because they have virtually the same effect on the MVA
model.
d) No, because the score plot is only a projection.
e) Answers (a) and (c).
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 9:
Looking at the food example, what countries appear to be correlated
with high consumption of olive oil?
a)
b)
c)
d)
e)
Italy and Spain, and to a lesser degree Portugal and Austria.
Italy and Spain only.
Just Italy.
Ireland and Italy.
All the countries except Sweden, Denmark and England.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0
Tier 1 Quiz
Question 10:
Why does error get relegated to higher-order components when
doing PCA?
a)
b)
c)
d)
e)
Because Ockham’s Razor says it will.
Because the real world has only three dimensions.
Because noise is false information.
Because MVA is able to correct for poor data.
Because noise is uncorrelated to the other variables.
NAMP Module 17: “Introduction to Multivariate Analysis”
Tier 1, Part 1, Rev.: 0