Dia 1 - van der Veld

Download Report

Transcript Dia 1 - van der Veld

Multivariate Statistics
Principal Component Analysis
W. M. van der Veld
University of Amsterdam
Overview
•
•
•
•
•
Eigenvectors and eigenvalues
Principal Component Analysis (PCA)
Visualization
Example
Practical issues
Eigenvectors and eigenvalues
• Let A be a square matrix of order n x n.
• It can be shown that vectors exist, such that Ak=λk, with λ some
scalar. Where k the eigenvector, and λ the eigenvalue.
• The eigenvectors k and eigenvalues λ have many applications,
but we will only use it in this course for principal component
analysis.
• But they also play a role in cluster analysis, canonical
correlations, and other methods.
Eigenvectors and eigenvalues
• So, for the system of equations Ak=λk, only A is known.
• We have to solve for k and λ to find the eigenvector and
eigenvalue.
• It is not possible to solve this set of equations straightforward
with the method described last week, since m<n.
• The trivial solution k=0 is excluded.
• A solution can however be found under certain conditions.
• First an example to a feeling for the equation.
Eigenvectors and eigenvalues
3 2
• An example: Ak=λk. Let A be 

1 4
1
• One solution for k is: k1   
1
 3 2 1  5
Ak  
      k    5
 1 4 1  5
2
• One solution for k is: k1   
  1
 3 2  2   4 
Ak  
      k    2
 1 4   1   2 
Eigenvectors and eigenvalues
•
•
•
•
•
•
How did we find the eigenvectors?
Before that we first have to find the eigenvalues!
From Ak=λk it follows that Ak - λk = 0, which is:
(A - λI)k = 0.
Since k = 0 is excluded, there seems no solution.
However, for homogeneous equations, a solution can be found
when rank(A)<n, and this is only the case when |A - λI|=0.
which can be rewritten as the characteristic equation:
A  I  ann  an1n1   a1  a0  0
• This (|A - λI|=0) is what I meant with certain conditions!
• We can now easily solve for λ.
Eigenvectors and eigenvalues
A  I  0
• This determinant gives an equation in λ.
 3 2
 1 0 3  
A  I  
   
 
1
1 4
 0 1
2
0
4
3   4     2 1  0
12  3  4  2  2  2  7  10  0
  5  2  0  1  5 2  2
Eigenvectors and eigenvalues
• It is now a matter of substitution, of λ; start with λ1 = 5.
Ak  k  A  I k  0
  3 2   0 
A  I k   
  
 k  0 
 1 4  0  
2  k1 
3  5
A  I k  
   0
4  5 k2 
 1
  2k1  2k2   0 
A  I k  
    
 1k1  1k2   0 
 k1   1
    
 k 2   1
• Note that any multiplier of k would also satisfy the equation!
Eigenvectors and eigenvalues
• The same for λ2 = 2.
Ak  k  A  I k  0
  3 2   0 
A  I k   
  
 k  0 
 1 4  0  
2  k1 
3  2
A  I k  
   0
4  2  k2 
 1
 k1  2k2   0 
A  I k  
    
 k1  2k2   0 
 k1    2 
    
 k2   1 
Eigenvectors and eigenvalues
• This was the 2 x 2 case, but in general the matrix A is of order
n x n.
• In that case we will find
– n different eigenvalues, and
– n different eigenvectors.
• The eigenvectors could be collected in a matrix K, with k1, k2,
…, kn as the eigenvectors.
• The eigenvalues could be collected in a matrix Λ., with on the
diagonal the eigenvalues, λ1, λ2, …, λn.
• Hence the generalized form of Ak=λk is: AK=ΛK
Principal Component Analysis
Harold Hotelling (1895-1973)
• PCA was introduced Harold Hotelling (1933).
• Harold Hotelling was appointed as a professor of
economics at Columbia.
• But he was a statistician first, economist second.
• His work in mathematical statistics included his
famous 1931 paper on the Student's t distribution for
hypothesis testing, in which he laid out what has
since been called "confidence intervals".
• In 1933 he wrote "Analysis of a Complex of
Statistical Variables with Principal Components“ in
the Journal of Educational Psychology.
•
Principal Component Analysis
• Principal components analysis (PCA) is a technique that can be
used to simplify a dataset*.
• More formally it is a linear transformation that chooses a new
coordinate system for the data set such that the greatest variance
by any projection of the data set comes to lie on the first axis
(then called the first principal component), the second greatest
variance on the second axis, and so on.
• PCA can be used for reducing dimensionality in a dataset while
retaining those characteristics of the dataset that contribute most
to its variance by eliminating the later principal components (by
a more or less heuristic decision). These characteristics may be
the "most important", but this is not necessarily the case,
depending on the application.
Principal Component Analysis
• The data reduction is accomplished via a linear transformation of
the observed variables, such that :
yi = ai1x1 + ai2x2 + … + aipxp; where i=1..p
• Where the y’s are the principal components, which are
uncorrelated to each other.
Digression
• The equation: yi = ai1x1 + ai2x2 + … + aipxp; What does it say?
• Let’s assume for a certain respondent his answers to items x1, x2,
…, xp are known, and we also know the coefficients aij.
• What does that imply for yi?
• The function is a prediction of y.
• How does the path model for this equation look like?
• What is so special about this equation?
• In multiple regression we have an observed y and observed x’s;
this allows the estimation of the constants ai.
• PCA is a different thing. Although the equation is the same!
Principal Component Analysis
• The equation: yi = ai1x1 + ai2x2 + … + aipxp
• In the PCA case:
– The y variables are not observed (unknown), and
– The constants a are unknown.
• So there are too many unknowns, to solve the system.
• We can do PCA, so we must be able to solve it. But how?
• The idea is straightforward.
–
–
–
–
Choose a’s such that each principal component has maximum variance.
Express the variance of y in terms of the observed (x) and unknown (a).
Make a constraint to limit the number of solutions.
Then we must set the derivative to zero, to find a maximum of the
function.
Principal Component Analysis
• The basic equation: yi = ai1x1 + ai2x2 + … + aipxp;
• Let x be a column vector with p random x variables
– The x variables are, without loss of generality, expressed as deviations
from the mean.
• We usually worked with the data matrix, now suddenly a
vector with p random variables?
Digression
• The vector x containing random variables is directly related to
the data matrix X.
 x1 
 
 x2 
x 

 
x 
 p
random variable x1


random variable x p
 x11

 x21
X


x
 n1
x12
x22

xn 2
x1 p 

 x2 p 
  

 xnp 

with observed values column1 of X

with observed values column p of X
Principal Component Analysis
• The basic equation: yi = ai1x1 + ai2x2 + … + aipxp;
• Let x be a column vector with p random x variables
– The x variables are, without loss of generality, expressed as deviations
from the mean.
• Let a be a p component column vector,
• Then y = a’x.
• Because this function is unbounded, we can always find a
vector a’ for which the variance of the principal component is
larger, and for which the equation is satisfied; hence.
• Make a constraint on the unknowns so that a’a = 1.
– This (=1) is an arbitrary choice,
– but it will show that this makes the algebra simpler.
• Now the number of solutions for y is constrained (bounded).
Principal Component Analysis
• The variance of y is var(y) = var(a’x) = E((a’x)(a’x)’).
• Which is: E((a’x)(x’a)) = a’E(xx’)a = a’Σa;
– Because E(xx’) = X’X/n = variantie-covariantie matrix.
• Thus f: a’Σa.
• We have to find a maximum of this function.
Principal Component Analysis
• So, we have to solve the derivative function equal to zero.
• Don’t forget the constraint (a’a = 1) in the function, that
should be accounted for when finding the maximum.
• This is can be achieved using the Lagrange multiplier, a
mathematical shortcut.
• h: ∂f/∂a’ – λ ∂g/∂a’ = 0; where g: a’a - 1
• ∂h/∂a’ = 2Σa – 2λa
• This is the derivative which we need to find the maximum of
the variance function.
• 2Σa – 2λa = 0 => divide both sides by 2
• Σa – λa = 0 => get the factor a out.
• (Σ – λI)a = 0 => this should look familiar!
Principal Component Analysis
• (Σ – λI)a = 0 can be solved via |Σ – λI| = 0, and a ≠ 0.
– Here λ is eigenvalue of the eigenvector a.
• Rewrite (Σ – λI)a = 0 so that: Σa = λIa  Σa = λa
• If we premultiply both sides with a’; then
• a’Σa = a’λa  a’Σa = λa’a = λ;
– Because a’a = 1.
• It follows that var(y) = λ;
– because var(y) = a’Σa,
• So the eigenvalues are the variances of the principal components.
• And the largest eigenvalue is the variance of the first principal
component, etc.
• The elements of the eigenvector a, which are found by
substitution of the largest λ, are called loadings of y.
Principal Component Analysis
• The next principal component is found taking away the variance
of the first principal component, which is:
– var(y1) = a1’x
• In order to find the y2 we assume that it is uncorrelated with y1,
next to the other assumption that a2’a2 = a’a = 1.
• Therefore:
–
–
–
–
–
–
–
cor(y2, y1)
E(y2, y1’)
E((a2’x) (a1’x)’)
E(a2’xx’a1)
a2’E(xx’)a1
a2’Σa1
a’λ1a1 = λ1a’a1
=0
=0
=0
=0
=0
=0
= 0 => Because Σa1 = λ1a1, since (Σ – λ1I)a1 = 0
Principal Component Analysis
•
•
•
•
Since y2 = a2’x
The variance of y2 is f2: a2’Σa2
So, we have to solve the derivative function equal to zero.
Don’t forget to take the constraints into account:
– (a2’a2 = 1), and
– λa2’a1 = 0.
• when finding the maximum.
• This is can be achieved using the Lagrange multiplier, a
mathematical shortcut.
Principal Component Analysis
• The result: ∂h2/∂a2’ = 2Σa2 – 2λ2a2 – 2ν2Σa1 = 0
• ν2 = 0 as a consequence of the constraints.
• Thus:
– 2Σa2 – 2λ2a2 = 0 => (Σ – λ2I) a2 = 0
– Which can be solved via |Σ – λ2I| = 0, and a2 ≠ 0
• We solve this equation, then take the highest eigenvalue (λ2),
• Solve for the eigenvector (a2) that corresponds to this
eigenvalue.
• Et cetera, for the other components.
Principal Component Analysis
•
•
•
•
•
Estimation is now rather straightforward.
We use the ML estimate of Σ, which is ˆ.
Then we simply have to solve for ˆ in ˆ  ˆI  0.
Where ˆ is the ML estimate of λ.
ˆ  ˆI aˆ  0
Then we simply have to solve for â in 
by substitution of the solution for ˆ .
• Where â is the ML estimate of a.


Visualization
Visualization
• Let’s consider R, which is that standardized Σ.
• And start simple with only two variables, v1 and v2.
v 2 v1 
 v1 v1


n
n
  E( v1 v 2 )  
v 2 v 2 
 v1 v 2
n
n

 1.0 0.7 
Let's assume Σ  

 0.7 1.0 
•
•
•
•
How can this situation be visualized?
In a 2-dimensional space, one dimension for each variable.
Each variable is a vector with length 1.
The angle between the vector represents the correlation.
Visualization
• Intuitive proof that “the angle between the vector represents
the correlation”; cos(angle)=cor(v1, v2)
• If there is no angle, then they are actually the same (except for
a constant).
– In that case cos(0)=1, and cor(v1, v2) = 1
• Now if they are uncorrelated, then the correlation is zero.
– In that case cor(v1, v2) = 0,
– and if cos(angle)=cor(v1, v2), then cos(angle) = 0,
– So angle = ½π, since cos(½π) = 0
• So we can visualize the correlation matrix.
• The correlation between v1 and v2 = 0.7; thus angle = ¼π.
Visualization
First
Principal
Componen
t
Variable 2
Variable 1
Visualization
V
2
1st PC
V1
Projection of V2
on principal
component
(equals constant
a12 in equation)
Projection of V1
on principal
component
(equals constant
a11 in equation)
Visualization
V
2
1st PC
V1
Total projection,
which is the variance
of the 1st PC, and thus
λ1.
Visualization
2nd
PC
Projection of
V2 on 2nd
PC
V
2
V1
Projection
(=0) of V1
on 2nd PC
1st PC
Visualization
2nd
PC
V
2
V1
Total projection,
which is the variance
of the 2nd PC, and
thus λ2.
1st PC
Total projection, which is the variance of
the 1st PC, and thus λ1.
Visualization
• Of course, PCA is concerned with finding the largest variance
of the first component, etc.
• In this example, there is possibly a better alternative.
• So, what I said where solution for λ’s and a, where in fact nonoptimal solutions.
• Let’s find an optimal solution.
Visualization
Variable 2
Variable 1
First
Principal
Componen
t
Visualization
Maximized
projection of V2
on principal
component
(a12 in equation)
V2
V1
1st PC
Maximized
projection of V1
on principal
component
(a11 in equation)
Visualization
Total maximum
projection, which is
the variance of the 1st
PC, and thus λ1.
V2
‘Minimized’
projection of V1
and V2 on 2nd
PC
1st PC
V1
2nd PC
Visualization
Total maximum
projection, which is
the variance of the 1st
PC, and thus λ1.
Total maximum
projection,
which is the
variance of the
2nd PC, and thus
λ2.
V2
V1
1st PC
2nd PC
Example
An example
• How does a PCA solution look like?
CD
D
M
A
CA
15.1
Ik voel me buitengesloten van anderen
1
2
3
4
5
15.2
Ik heb het gevoel dat niemand me echt goed kent
1
2
3
4
5
15.3
Ik heb het gevoel dat er niemand is bij wie ik terecht kan
1
2
3
4
5
15.4
Ik heb het gevoel dat er mensen zijn die me echt goed
begrijpen
1
2
3
4
5
15.5
Ik voel me alleen
1
2
3
4
5
15.6
Ik heb het gevoel dat ik bij niemand echt hoor
1
2
3
4
5
15.7
Ik heb het gevoel dat ik met mensen verbonden ben
1
2
3
4
5
15.8
Ik heb het gevoel dat er mensen zijn bij wie ik terecht
kan
1
2
3
4
5
• It might make sense to say that the weighted sum of these items is something
that we could call loneliness.
An example
• The loneliness items not only seem to be related on ‘face value’, but the
variables are also correlated.
V15_1
V15_2
V15_3
V15_4
V15_5
V15_6
V15_7
V15_1
1.000
V15_2
.523
1.000
V15_3
.432
.554
1.000
V15_4
.217
.348
.346
1.000
V15_5
.606
.530
.434
.265
1.000
V15_6
.520
.567
.464
.258
.596
1.000
V15_7
.199
.282
.252
.374
.212
.304
1.000
V15_8
.186
.342
.341
.476
.237
.271
.566
Correlations between the loneliness items (n=679)
V15_8
1.000
An example
• What can we expect with PCA?
– There are 8 items, so there will be 8 PC’s.
– On face value the items are related to one thing: loneliness.
– So there should be one PC (interpretable as loneliness), that accounts
for most variance in the nine observed variables.
An example
This is the variance explained
by the principal components.
Note that it is never 1, due to
the fact that only the PC’s
with eigenvalue >1 are used.
First 2 PC’s
have EV>1
An example
The complete PCA solution. With all
8 variables and 8 PC’s.
They
absorbed
63.6% of all
variance
The ‘practical’ solution, that has
thrown away all PC’s with a
eigenvalue smaller than 1.
This is an arbitrary choice, which is
called the Kaiser criterion
An example
These are the constants ai from
the matrix A’. Also they are the
projections on the principal
component axes.
The square is the variance
contributed to the component by
the observed variable.
When you add the
squared loadings,
you will obtain the
eigenvalue of the
component.
46%
17%
Σ(squared numbers)=3.7
An example
• What can we expect with PCA?
– There are 8 items, so there will be 8 PC’s.
– On face value the items are related to one thing: loneliness.
– So there should be one PC (interpretable as loneliness), that accounts
for most variance in the nine observed variables.
• We find 1 PC that absorbs almost 50% of the variance, that
one might be called loneliness.
• However we loose 50% of the variance. The second PC hardly
does anything. Let alone the other components.
• So the number of variables can be reduced from 10 to 1.
• However, at a huge loss.
Practical issues
Practical issues
• PCA is NOT factor analysis!!
– neither exploratory nor confirmatory.
– In factor analysis it is assumed that the structure in the data are the
result of an underlying factor structure. So, there is a theory.
– In PCA the original data are linearly transformed into a set of new
uncorrelated variables, with maximum variance property. This is a
mathematical optimization procedure, that lacks a theory about the
data.
• Many people think they use PCA. However, they use a rotated
version of the PCA solution, for which the maximum variance
property does not necessarily hold any more.
• The advantage is that such rotated solutions are often better
interpretable, because the PCA solution has too often no
substantive meaning.
Practical issues
• In PCA the PC’s are uncorrelated, beware of that when
interpreting the PC’s.
• I have often seen that the PC’s are interpreted as related
constructs, eg. loneliness and shyness, but I assume that such
constructs are related, so a different interpretation should be
found.
• Many times the solutions are rotated, to obtain results that are
better interpretable.
Practical issues
• Method of rotation:
– No rotation is the default in SPSS, unrotated solutions are hard to
interpret because variables tend to load on multiple factors.
– Varimax rotation is an orthogonal rotation of the factor axes to maximize
the variance of the squared loadings of a factor (column) on all the
variables (rows) in a factor matrix. Each factor will tend to have either
large or small loadings of any particular variable. A varimax solution
yields results which make it as easy as possible to identify each variable
with a single factor. This is the most common rotation option.
– Quartimax rotation is an orthogonal alternative which minimizes the
number of factors needed to explain each variable.
– Direct oblimin rotation is the standard method when one wishes a nonorthogonal solution -- that is, one in which the factors are allowed to be
correlated. This will result in higher eigenvalues but diminished
interpretability of the factors.
– Promax rotation is an alternative non-orthogonal rotation method which
is computationally faster than the direct oblimin method.
Practical issues
We had this
one solution
This is the Varimax rotated solution. Notice that the
loadings are high for either component 1 or 2. Because the
loadings are used to interpret the PC’s, this should make it
easier.
Although it seems to be that now there are clearly two
PC’s, but they can be interpreted as positive (4,7,8) and
negative (others).
An example
• How does a PCA solution look like?
CD
D
M
A
CA
15.1
Ik voel me buitengesloten van anderen
1
2
3
4
5
15.2
Ik heb het gevoel dat niemand me echt goed kent
1
2
3
4
5
15.3
Ik heb het gevoel dat er niemand is bij wie ik terecht kan
1
2
3
4
5
15.4
Ik heb het gevoel dat er mensen zijn die me echt goed
begrijpen
1
2
3
4
5
15.5
Ik voel me alleen
1
2
3
4
5
15.6
Ik heb het gevoel dat ik bij niemand echt hoor
1
2
3
4
5
15.7
Ik heb het gevoel dat ik met mensen verbonden ben
1
2
3
4
5
15.8
Ik heb het gevoel dat er mensen zijn bij wie ik terecht
kan
1
2
3
4
5
• It might make sense to say that the weighted sum of these items is something
that we could call loneliness.
Practical issues
• PCA is useful, when you have constructs by definition.
• You can force there to be one component.
• You can calculate the PC scores, which are weighted using the
constants aij.
• And these scores can then be used in further analysis.
• Use it with care, and think about and look at your data before
doing any analysis.