Principal Components Analysis

Download Report

Transcript Principal Components Analysis

Principal Components Analysis
• Objectives:
– Understand the principles of principal
components analysis (PCA)
– Recognize conditions under which PCA
may be useful
– Use SAS procedure PRINCOMP to
• perform a principal components analysis
• interpret PRINCOMP output.
Xuhua Xia
Slide 1
Typical Form of Data
A data set in a 8x3 matrix. The rows could be species and
columns sampling sites.
100 97
99
96
90
90
80
75
60
75
85
95
X=
62
40
28
77
80
78
92
91
80
75
85
100
A matrix is often referred to as a nxp matrix (n for number of rows
and p for number of columns). Our matrix has 8 rows and 3
columns, and is an 8x3 matrix. A variance-covariance matrix has n
= p, and is called n-dimensional square matrix.
Xuhua Xia
Slide 2
What are Principal Components?
Y = b1X1 + b2 X2 + … bn Xn
• Principal components are linear
combinations of the observed variables.
The coefficients of these principal
components are chosen to meet three
criteria
• What are the three criteria?
Xuhua Xia
Slide 3
What are Principal Components?
• The three criteria:
– There are exactly p principal components
(PCs), each being a linear combination of
the observed variables;
– The PCs are mutually orthogonal (i.e.,
perpendicular and uncorrelated);
– The components are extracted in order of
decreasing variance.
Xuhua Xia
Slide 4
1
2
3
4
5
Mean
Var
rX ,Y
X
-1.264911064
-0.632455532
0
0.632455532
1.264911064
0.0000
1
Y
-1.78885
-0.89443
0
0.894427
1.788854
0.0000
2
 ( X  X )(Y  Y )
5.6569


1
2
2
4 8
 ( X  X )  (Y  Y )
Y
A Simple Data Set
-1
-0.5
0
0.5
1
1.5
X
n
 ( xi  x )( yi  y )
Cov( x, y )  i 1
X
Y
Correlation matrix
X
1
1
Covariance matrix
Y
1
1
Xuhua Xia
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-1.5
n 1
 2
X
Y
X
1
1.414
Y
1.414 2
Slide 5
General Patterns
• The total variance is 3 (= 1 + 2)
• The two variables, X and Y, are perfectly correlated,
with all points fall on the regression line.
• The spatial relationship among the 5 points can
therefore be represented by a single dimension.
• PCA is a dimension-reduction technique. What
would happen if we apply PCA to the data?
Xuhua Xia
Slide 6
Graphic PCA
2
1.5
1
Y
0.5
0
-0.5
-1
-1.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
X
Xuhua Xia
Slide 7
SAS Program
data pca;
input x y;
cards;
-1.264911064 -1.788854382
-0.632455532 -0.894427191
0
0
0.632455532 0.894427191
1.264911064 1.788854382
;
proc princomp cov out=pcscore;
proc print;
var prin1 prin2;
proc princomp data=pca out=pcscore;
proc print;
var prin1 prin2;
run;
Xuhua Xia
Requesting the PCA to be
carried out on the
covariance matrix rather
than the correlation
matrix.
Without specifying the
covariance option, PCA
will be carried out on the
correlation matrix.
Slide 8
A positive definite matrix
• When you run the SAS program, the log file will warn that
“The Correlation Matrix is not positive definite.”. What does
that mean?
• A symmetric matrix M (such as a correlation matrix or a
covariance matrix) is positive definite if z’Mz > 0 for all nonzero vectors z with real entries, where z’ is the transpose of z.
• Given our correlation matrix with all entries being 1, it is easy
to find z that lead to z’Mz = 0. So the matrix is not positive
definite:
1 1  z1 
 z1 z2     z   0
1 1  2 
Solution : z1   z 2
Xuhua Xia
Replace the correlation matrix
with the covariance matrix and
solve for z.
Slide 9
SAS Output
Eigenvalues of the Covariance Matrix
Eigenvalue
PRIN1
3.00000
PRIN2
0.00000
Difference
3.00000
.
Proportion
1.00000
0.00000
Cumulative
1.00000
1.00000
PC1 = 0.57735*X1+0.816497*X2
Eigenvectors
PRIN1
PRIN2
X
Y
0.577350
0.816497
0.816497
-.577350
Variance
OBS
PRIN1
PRIN2
1
-2.19089
0
accounted
Principal
2
-1.09545
0
for by each
component
3
0.00000
0
principal
4
1.09545
0
scores
5
2.19089
0
components
What’s the variance in PC1? How are the values computed?
Xuhua Xia
Slide 10
PC2
SAS Output
1
0.8
0.6
0.4
0.2
0
-3
-2
-1
0
1
2
3
PC1
OBS
1
2
3
4
5
Xuhua Xia
PRIN1
-2.19089
-1.09545
0.00000
1.09545
2.19089
PRIN2
0
0
0
0
0
Slide 11
SAS Output
Eigenvalues of the Correlation Matrix
Eigenvalue
PRIN1
2.00000
PRIN2
0.00000
Difference
2.00000
.
Proportion
1.00000
0.00000
Cumulative
1.00000
1.00000
Eigenvectors
X
Y
Variance
accounted
for by each
principal
components
Xuhua Xia
PRIN1
0.707107
0.707107
OBS
1
2
3
4
5
PRIN2
0.70710
-0.70711
PRIN1
-1.78885
-0.89443
0.00000
0.89443
1.78885
PRIN2
0
0
0
0
0
What’s the variance in PC1?
Principal
component
scores
Slide 12
Steps in a PCA
• Have at least two variables
• Generate a correlation or variance-covariance matrix
• Obtain eigenvalues and eigenvectors (This is called
an eigenvalue problem, and will be illustrated with a
simple numerical example)
• Generate principal component (PC) scores
• Plot the PC scores in the space with reduced
dimensions
• All these can be automated by using SAS.
Xuhua Xia
Slide 13
Covariance or Correlation Matrix?
Abundance
40
30
20
Sp1
Sp2
10
0
Xuhua Xia
Slide 14
Covariance or Correlation Matrix?
35
Abundance
30
25
20
15
Sp2
Sp3
10
5
0
Xuhua Xia
Slide 15
Covariance or Correlation Matrix?
35
30
25
20
15
Sp1
Sp2
Sp3
10
5
0
Xuhua Xia
Slide 16
The Eigenvalue Problem
The covariance matrix.
1
A 
 2
1 
2
2

2
2
2
1  0, 2  3
 2  3  0
The Eigenvalue is the set of
values that satisfy this
condition.
The resulting eigenvalues
(There are n eigenvalues for
n variables). The sum of
eigenvalues is equal to the
sum of variances in the
covariance matrix.
Finding the eigenvalues and eigenvectors is called an eigenvalue
problem (or a characteristic value problem).
Xuhua Xia
Slide 17
Get the Eigenvectors
• An eigenvector is a vector (x) that satisfies the following
condition:
A x = x
• In our case A is a variance-covariance matrix of the order of 2,
and a vector x is a vector specified by x1 and x2.
1
A 
 2
1 
2
2

2
2
2
 2  3  0
For   0,
 1
2   x1  0
Ax  
    
 2 2   x2  0
which is equivalentto
x1  2 x2  0,
2 x1  2 x2  0
1  0, 2  3
Xuhua Xia
 x2  
x1
2
For   3,
1
 x1 
2   x1 
Ax  

3
 
x 
2
2
 2

  x2 
which is equivalentto
x1  2 x2  3x1 ,
2 x1  2 x2  3x2
 x2  2 x1
Slide 18
Get the Eigenvectors
• We want to find an eigenvector of unit length, i.e.,
x1 2 + x2 2 = 1
• We therefore have
From Previous Slide
x1
For   0, x2  
2
x1
x2  1  x  
2
x1  0.8165, x2  0.5774
2
1
For   3,
x2  1  x  2 x1
2
1
x1  0.5774, x 2  0.8165
Xuhua Xia
Solve x1
The first eigenvector
is one associated
with the largest
eigenvalue.
Slide 19
Get the PC Scores
First PC score
Original data (x and y)
Eigenvectors
4 - 1.78885438
2
- 1.26491106
- 0.63245553

2
0.89442719
1

 0.577350
0

0

 0.816497
2 0.89442719
1 
0.63245553
1.26491106
4 1.78885438
2 
- 2.19089
- 1.09545
0.816497 
  0.00000

- .577350 
 1.09545
 2.19089
0
0 
0

0
0 
Second PC score
The original data in a two dimensional
space is reduced to one dimension..
Xuhua Xia
Slide 20
What Are Principal Components?
• Principal components are a new set of variables,
which are linear combinations of the observed ones,
with these properties:
– Because of the decreasing variance property, much of the
variance (information in the original set of p variables)
tends to be concentrated in the first few PCs. This implies
that we can drop the last few PCs without losing much
information. PCA is therefore considered as a dimensionreduction technique.
– Because PCs are orthogonal, they can be used instead of
the original variables in situations where having
orthogonal variables is desirable (e.g., regression).
Xuhua Xia
Slide 21
Index of hidden variables
• The ranking of Asian universities by the Asian Week
– HKU is ranked second in financial resources, but seventh
in academic research
– How did HKU get ranked third?
– Is there a more objective way of ranking?
• An illustrative example:
School
1
2
3
4
5
6
Xuhua Xia
Math English
60
55
70
65
80
75
90
85
100
95
….
…
Physics Chemistry
65
64
69
71
72
85
85
88
95
95
…
…
Chinese
67
77
82
88
93
…
Slide 22
A Simple Data Set
Math English
60
55
70
65
80
75
90
85
100
95
80.0
75.0
250
250
100
English
School
1
2
3
4
5
Mean
Var
75
50
50
75
100
Math
• School 5 is clearly the best school
• School 1 is clearly the worst school
Xuhua Xia
Slide 23
Graphic PCA
Xuhua Xia
Slide 24
Crime Data in 50 States
STATE
ALABAMA
ALASKA
ARIZONA
ARKANSAS
CALIFORNIA
COLORADO
CONNECTICUT
DELAWARE
FLORIDA
GEORGIA
HAWAII
IDAHO
ILLINOIS
.
.
MURDER
14.2
10.8
9.5
8.8
11.5
6.3
4.2
6.0
10.2
11.7
7.2
5.5
9.9
.
.
RAPE
25.2
51.6
34.2
27.6
49.4
42.0
16.8
24.9
39.6
31.1
25.5
19.4
21.8
.
.
ROBBE
96.8
96.8
138.2
83.2
287.0
170.7
129.5
157.0
187.9
140.5
128.0
39.6
211.3
.
.
ASSAU
278.3
284.0
312.3
203.4
358.0
292.9
131.8
194.2
449.1
256.5
64.1
172.5
209.0
.
.
BURGLA
1135.5
1331.7
2346.1
972.6
2139.4
1935.2
1346.0
1682.6
1859.9
1351.1
1911.5
1050.8
1085.0
.
.
LARCEN
1881.9
3369.8
4467.4
1862.1
3499.8
3903.2
2620.7
3678.4
3840.5
2170.2
3920.4
2599.6
2828.5
.
.
AUTO
280.7
753.3
439.5
183.4
663.5
477.1
593.2
467.0
351.4
297.9
489.4
237.6
528.6
.
.
PROC PRINCOMP OUT=CRIMCOMP;
Xuhua Xia
Slide 25
DATA CRIME;
TITLE 'CRIME RATES PER 100,000 POP BY STATE';
INPUT STATENAME $1-15 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
CARDS;
Alabama
14.2 25.2 96.8 278.3 1135.5 1881.9 280.7
Alaska
10.8 51.6 96.8 284.0 1331.7 3369.8 753.3
Arizona
9.5 34.2 138.2 312.3 2346.1 4467.4 439.5
Arkansas
8.8 27.6 83.2 203.4 972.6 1862.1 183.4
California
11.5 49.4 287.0 358.0 2139.4 3499.8 663.5
Colorado
6.3 42.0 170.7 292.9 1935.2 3903.2 477.1
Connecticut
4.2 16.8 129.5 131.8 1346.0 2620.7 593.2
Delaware
6.0 24.9 157.0 194.2 1682.6 3678.4 467.0
Florida
10.2 39.6 187.9 449.1 1859.9 3840.5 351.4
Georgia
11.7 31.1 140.5 256.5 1351.1 2170.2 297.9
Hawaii
7.2 25.5 128.0 64.1 1911.5 3920.4 489.4
Idaho
5.5 19.4 39.6 172.5 1050.8 2599.6 237.6
Illinois
9.9 21.8 211.3 209.0 1085.0 2828.5 528.6
Indiana
7.4 26.5 123.2 153.5 1086.2 2498.7 377.4
Iowa
2.3 10.6 41.2 89.8 812.5 2685.1 219.9
Kansas
6.6 22.0 100.7 180.5 1270.4 2739.3 244.3
Kentucky
10.1 19.1 81.1 123.3 872.2 1662.1 245.4
Louisiana
15.5 30.9 142.9 335.5 1165.5 2469.9 337.7
Maine
2.4 13.5 38.7 170.0 1253.1 2350.7 246.9
Maryland
8.0 34.8 292.1 358.9 1400.0 3177.7 428.5
Massachusetts
3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1
Michigan
9.3 38.9 261.9 274.6 1522.7 3159.0 545.5
Minnesota
2.7 19.5 85.9 85.8 1134.7 2559.3 343.1
Mississippi
14.3 19.6 65.7 189.1 915.6 1239.9 144.4
Missouri
9.6 28.3 189.0 233.5 1318.3 2424.2 378.4
Montana
5.4 16.7 39.2 156.8 804.9 2773.2 309.2
Nebraska
3.9 18.1 64.7 112.7 760.0 2316.1 249.1
Nevada
15.8 49.1 323.1 355.0 2453.1 4212.6 559.2
New Hampshire
3.2 10.7 23.2 76.0 1041.7 2343.9 293.4
New Jersey
5.6 21.0 180.4 185.1 1435.8 2774.5 511.5
New Mexico
8.8 39.1 109.6 343.4 1418.7 3008.6 259.5
New York
10.7 29.4 472.6 319.1 1728.0 2782.0 745.8
North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1
North Dakota
0.9 9.0 13.3 43.8 446.1 1843.0 144.7
/* Add to have a map view*/
Ohio
7.8 27.3 190.5 181.1 1216.0 2696.8 400.4
proc sort data=crimcomp out=crimcomp;
Oklahoma
8.6 29.2 73.8 205.0 1288.2 2228.1 326.8
by STATENAME;
Oregon
4.9 39.9 124.1 286.9 1636.4 3506.1 388.9
Pennsylvania
5.6 19.0 130.3 128.0 877.5 1624.1 333.2
run;
Rhode Island
3.6 10.5 86.5 201.0 1489.5 2844.1 791.4
proc sort data=maps.us2 out=mymap;
South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1
by STATENAME;
South Dakota
2.0 13.5 17.9 155.7 570.5 1704.4 147.5
run;
Tennessee
10.1 29.7 145.8 203.9 1259.7 1776.5 314.0
data both;
Texas
13.3 33.8 152.4 208.2 1603.1 2988.7 397.6
merge mymap crimcomp;
Utah
3.5 20.3 68.8 147.3 1171.6 3004.6 334.5
Vermont
1.4 15.9 30.8 101.2 1348.2 2201.0 265.2
by STATENAME;
Virginia
9.0 23.3 92.1 165.7 986.2 2521.2 226.7
run;
Washington
4.3 39.6 106.2 224.8 1605.6 3386.9 360.3
proc gmap data=both;
West Virginia
6.0 13.2 42.2 90.9 597.4 1341.7 163.3
id _map_geometry_;
Wisconsin
2.8 12.9 52.2 63.7 846.9 2614.2 220.7
choro PRIN1 PRIN2/levels=15;
Wyoming
5.4 21.9 39.7 173.9 811.6 2772.2 282.0
/* choro PRIN1/discrete; */
;
PROC PRINCOMP out=crimcomp;
run;
run;
PROC PRINT;
ID STATENAME;
VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
run;
PROC GPLOT;
PLOT PRIN2*PRIN1=STATENAME;
TITLE2 'PLOT OF THE FIRST TWO PRINCIPAL COMPONENTS';
run;
PROC PRINCOMP data=CRIME COV OUT=crimcomp;
run;
PROC PRINT;
ID STATENAME;
VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
run;
Correlation Matrix
MURDER
MURDER
RAPE
ROBBERY
ASSAULT
BURGLARY
LARCENY
AUTO
1.0000
0.6012
0.4837
0.6486
0.3858
0.1019
0.0688
RAPE ROBBERY ASSAULT BURGLARY LARCENY
0.6012
1.0000
0.5919
0.7403
0.7121
0.6140
0.3489
0.4837
0.5919
1.0000
0.5571
0.6372
0.4467
0.5907
0.6486
0.7403
0.5571
1.0000
0.6229
0.4044
0.2758
0.3858
0.7121
0.6372
0.6229
1.0000
0.7921
0.5580
0.1019
0.6140
0.4467
0.4044
0.7921
1.0000
0.4442
AUTO
0.0688
0.3489
0.5907
0.2758
0.5580
0.4442
1.0000
If variables are not correlated, there would be no point in
doing PCA.
The correlation matrix is symmetric, so we only need to
inspect either the upper or lower triangular matrix.
Xuhua Xia
Slide 28
Eigenvalues
Eigenvalue Difference Proportion Cumulative
PRIN1
PRIN2
PRIN3
PRIN4
PRIN5
PRIN6
PRIN7
Xuhua Xia
4.11496
1.23872
0.72582
0.31643
0.25797
0.22204
0.12406
2.87624
0.51291
0.40938
0.05846
0.03593
0.09798
.
0.587851
0.176960
0.103688
0.045205
0.036853
0.031720
0.017722
0.58785
0.76481
0.86850
0.91370
0.95056
0.98228
1.00000
Slide 29
Eigenvectors
MURDER
RAPE
ROBBERY
ASSAULT
BURGLARY
LARCENY
AUTO
PRIN1
PRIN2
PRIN3
PRIN4
PRIN5
PRIN6
PRIN7
0.3002
0.4317
0.3968
0.3966
0.4401
0.3573
0.2951
-.6291
-.1694
0.0422
-.3435
0.2033
0.4023
0.5024
0.1782
-.2441
0.4958
-.0695
-.2098
-.5392
0.5683
-.2321
0.0622
-.5579
0.6298
-.0575
-.2348
0.4192
0.5381
0.1884
-.5199
-.5066
0.1010
0.0300
0.3697
0.2591
-.7732
-.1143
0.1723
0.5359
0.0394
-.0572
0.2675
-.2964
-.0039
0.1917
-.6481
0.6016
0.1470
• Do these eigenvectors mean anything?
– All crimes are positively correlated with the first eigenvector, which is
therefore interpreted as a measure of overall crime rate.
– The 2nd eigenvector has positive loadings on AUTO, LARCENY and
ROBBERY and negative loadings on MURDER, ASSAULT and
RAPE. It is interpreted to measure the preponderance of property
crime over violent crime…...
Xuhua Xia
Slide 30
PC Plot: Crime Data
Maryland
3
MA
North and
South
Dakota
2
PC 2
RH
HA
CO
VE MI UT
IO
NE
WI
MA
NO
NEMO
WY
IN
ID KA
PE
SO
1
0
DE
NE
WA
OR
KE
-2
AR
NO
COAR
NE
MI
AL
MA
OH
IL
VI OK
WE
-1
Nevada, New
York,
California
MI
GE
Mississippi,
Alabama,
Louisiana,
South Carolina
LO SO
AL
NE
FL
TE
NE
TE
CA
MI
-3
-5
-3
-1
1
3
5
7
PC 1
Xuhua Xia
Slide 31
Plot of PC1
Prin1
-3.9640776 - -3.1477220
-1.5073580 - -1.4246347
-0.0498802 - 0.4904076
1.6033606 - 2.2733344
-2.5815619 - -2.4656229
-1.0544104 - -0.6992517
0.5129025 - 0.8231313
2.4215150 - 3.0141383
-2.1507074 - -1.7269086
-0.6340669 - -0.4998955
0.9305796 - 0.9784390
3.1117540 - 5.2669853
-1.7200694 - -1.5543424
-0.3213630 - -0.1365951
1.1202026 - 1.4490021
Plot of PC2
Prin2
-2.54671E+00 - -2.09610E+00
-8.14251E-01 - -6.81314E-01
-2.80416E-02 - 2.60334E-05
2.70992E-01 - 4.32893E-01
9.16596E-01 - 9.44967E-01
-2.08327E+00 - -1.38079E+00
-6.24288E-01 - -5.58511E-01
6.26829E-02 - 9.42305E-02
5.78785E-01 - 7.37764E-01
9.64209E-01 - 1.29674E+00
-1.34544E+00 - -9.50756E-01
-2.54464E-01 - -1.94742E-01
1.43187E-01 - 2.25739E-01
7.80831E-01 - 8.44945E-01
1.50123E+00 - 2.63105E+00