Principal Component Analysis (PCA)

Download Report

Transcript Principal Component Analysis (PCA)

Introduction to Multivariate
Analysis
Biology 4605/7220
Chih-Lin Wei
Canadian Health Oceans Network Postdoc Fellow
Ocean Science Centre, MUN
My Background
• Benthic ecologist:
Community ecology
How environments control macroecological patterns in
the deep-sea
Interested in R but “NOT a statistician”.
• Education: BS in Zoology in Taiwan; MS & PhD in
Biological Oceanography, Texas A&M University
• Current project: Scale-up regional benthic diversity and
standing stock pattern using ecological modeling
approaches
Lecture Contents
•
•
•
•
•
•
•
Visualization
Resemblance index
Cluster analysis
Ordination
Correlation
Testing for difference
Other stuff
Clarke & Warwick (2001)
Front Matter
• Mostly non-parametric, permutation-based
techniques
• Start with graphical concept
• Followed by examples in simple R codes
• No more than 3 lines of code for each example
• Most functions in base R or package “vegan”
• All analyses are available on commercial software
(PRIMER-E) [demo version]
R packages
# Install and load R Packages
install.packages( c("vegan", "scatterplot3d",
"reshape2", "lattice", "clustsig") )
library( vegan )
library( scatterplot3d )
library( reshape2 )
library( lattice )
library( clustsig)
First thing first, plot the data
# Violent Crime Rates by US State
200
50
plot( USArrests[,1:2] )
100
150
Assault
250
300
USArrests
5
10
Murder
15
3D Scatter Plot
50
250
200
40
150
100
50
0
0
5
10
Murder
15
20
Assault
70
60
350
300
30
UrbanPop
80
90 100
scatterplot3d( USArrests[,1:3] )
Scatterplot Matrices
50
150
250
10
20
30
10
15
pairs( USArrests )
40
250
5
Murder
70
90
50
150
Assault
30
40
30
50
UrbanPop
10
20
Rape
5
10
15
30
50
70
90
Lattice Graphs
Rape
# Melt dataframe to flat format
m = melt( USArrests,
id.vars = "Assault" )
m
80
60
40
20
value
0
Murder
UrbanPop
80
60
# Multipanel scatter plot
xyplot( value ~ Assault | variable,
data = m )
40
20
0
50 100 150 200 250 300 350
Assault
Resemblance/distance Indices
*Not good for data with lots of zero
(e.g. species abundance)
Clarke & Warwick (2001)
Resemblance/distance Indices
• D = 0, if species are
identical in 2
samples
• D = 1, if 2 samples
have no species in
common
• Better for species
abundance data
(with lots of zero)
Resemblance/distance Indices
# Euclidean Distance:
# Bray-Crutis Dissimilarity
# Vegetation in lichen pastures
dist( USArrests )
data( varespec )
varespec
vegdist( varespec )
Hierarchical Clustering
4
2
3
1
0.0
0.2
0.3
0.4
0.5
0.6
0.7
Cluster Dendrogram
Dissimilarity
• Patterns in distance or
dissimilarity matrix is
difficult to detect.
• Find natural grouping
by successive fusing of
samples
Hierarchical Clustering
Linkage Options:
Sp 2
•Single linkage
Group 1
Group 2
(neareast neighbour clustering)
•Complete linkage
Single Link
(furthest neighbour clustering)
Complete Link
•Group-average linkage
•Ward’s minimum variance
Sp 1
5
10
15
Alaska
Florida
Delaware
Hawaii
Rhode
Island
Kentucky
Missouri
Utah
Oregon
Washington
Massachusetts
New
Jersey
Connecticut
Vermont
West
Virginia
Arkansas
South
Dakota
Idaho
North
Dakota
Minnesota
Maine
Wisconsin
Iowa
New Hampshire
Wyoming
Virginia
Nebraska
Oklahoma
Montana
Indiana
Ohio
Kansas
Pennsylvania
Nevada
North Carolina
Arizona
Michigan
Maryland
New
Mexico
Illinois
New
York
Tennessee
Texas
Georgia
Mississippi
South Alabama
Carolina
Louisiana
California
Colorado
Vermont
Maine
North Dakota
Idaho
South
Dakota
Minnesota
Wisconsin
Iowa
New Hampshire
Hawaii
Utah
Oregon
Washington
New
Jersey
Oklahoma
Indiana
Ohio
Delaware
Rhode
Island
Connecticut
Massachusetts
Kentucky
Wyoming
Arkansas
Virginia
West Virginia
Kansas
Pennsylvania
Montana
NorthNebraska
Carolina
Georgia
Alabama
Louisiana
Mississippi
South
Carolina
Florida
Michigan
Maryland
New
Mexico
Arizona
Illinois
New
York
Missouri
Tennessee
Texas
California
Colorado
Alaska
Nevada
Dissimilarity
0.0 0.5
1.0 1.5 2.0
Dissimilarity
2.5
0.0 0.1 0.2 0.3 0.4 0.5
Single Linkage
North Carolina
Georgia
Alabama
Louisiana
Mississippi
South
Carolina
California
Colorado
Alaska
Nevada
Missouri
Tennessee
Texas
Florida
Illinois
New
York
Arizona
Michigan
Maryland
New
Mexico
West
Virginia
Idaho
South
Dakota
Minnesota
Wisconsin
Iowa
New Hampshire
Vermont
Maine
North
Dakota
New Indiana
Jersey
Ohio
Kentucky
Arkansas
Wyoming
Oklahoma
Virginia
Hawaii
Kansas
Pennsylvania
Montana
Nebraska
Utah
Oregon
Washington
Delaware
Rhode
Island
Connecticut
Massachusetts
0
Height
1.2
# Euclidean Distance
d = dist( arrest )
West
Virginia
NorthVermont
Dakota
Idaho
South
Dakota
Minnesota
Maine
Wisconsin
Iowa
New Hampshire
Hawaii
Utah
Oregon
Washington
New
Jersey
Kansas
Pennsylvania
Montana
Nebraska
Wyoming
Virginia
Oklahoma
Indiana
Ohio
Arkansas
Kentucky
Delaware
Rhode
Island
Connecticut
Massachusetts
Alaska
Nevada
California
NorthColorado
Carolina
Georgia
Mississippi
South
Carolina
Alabama
Louisiana
Florida
Arizona
Illinois
New
York
Michigan
Maryland
New
Mexico
Missouri
Tennessee
Texas
0.4
0.8
# Dendrograms
plot( hclust( d, "single" ) )
plot( hclust( d, "complete" ) )
plot( hclust( d, "average" ) )
plot( hclust( d, "ward" ) )
Dissimilarity
# Normalization
arrest = scale( USArrests,
center = FALSE )
0.0
Hierarchical Clustering
Complete Linkage
Group-Average Linkage
Ward's Minimum Variance
North Carolina
Georgia
Alabama
Louisiana
Mississippi
South Carolina
California
Colorado
Alaska
Nevada
Missouri
Tennessee
Texas
Florida
Illinois
New York
Arizona
Michigan
Maryland
New Mexico
West Virginia
Idaho
South Dakota
Minnesota
Wisconsin
Iowa
New Hampshire
Vermont
Maine
North Dakota
New Jersey
Indiana
Ohio
Kentucky
Arkansas
Wyoming
Oklahoma
Virginia
Hawaii
Kansas
Pennsylvania
Montana
Nebraska
Utah
Oregon
Washington
Delaware
Rhode Island
Connecticut
Massachusetts
10
15
# Cut into 3 groups
rect.hclust( clus, k = 3 )
5
Height
North Carolina
Georgia
Alabama
Louisiana
Mississippi
South Carolina
California
Colorado
Alaska
Nevada
Missouri
Tennessee
Texas
Florida
Illinois
New York
Arizona
Michigan
Maryland
New Mexico
West Virginia
Idaho
South Dakota
Minnesota
Wisconsin
Iowa
New Hampshire
Vermont
Maine
North Dakota
New Jersey
Indiana
Ohio
Kentucky
Arkansas
Wyoming
Oklahoma
Virginia
Hawaii
Kansas
Pennsylvania
Montana
Nebraska
Utah
Oregon
Washington
Delaware
Rhode Island
Connecticut
Massachusetts
0
5
Height
10
15
# Using Ward's mehtod
clus = hclust( d, "ward" )
plot( clus )
0
Determine Numbers of Clusters
Cluster Dendrogram
K=3
Cluster Dendrogram
K=6
Determine Significant Clusters
Clarke et al. (2008, JEMBE 366:56-69)
West Virginia
North Dakota
Vermont
Idaho
South Dakota
Minnesota
Maine
Wisconsin
Iowa
New Hampshire
Hawaii
Utah
Oregon
Washington
New Jersey
Kansas
Pennsylvania
Montana
Nebraska
Wyoming
Virginia
Oklahoma
Indiana
Ohio
Arkansas
Kentucky
Delaware
Rhode Island
Connecticut
Massachusetts
Alaska
Nevada
California
Colorado
North Carolina
Georgia
Mississippi
South Carolina
Alabama
Louisiana
Florida
Arizona
Illinois
New York
Michigan
Maryland
New Mexico
Missouri
Tennessee
Texas
0.0
0.2
clus2 = simprof( arrest )
simprof.plot( clus2 )
0.6
0.8
1.0
1.2
# 999 permutation
# Group-average clustering
# alpha = 0.05
0.4
1.4
Similarity Profile Test
* Colors = significant clusters
Motivations for Ordination
• Dendrogram is still difficult to understand
• Clustering forced samples into groups despites the
compositional changes may be continuous.
• Ordination reduces dimensionality of multivariate
data (data cloud so to speak)
• Preferably, capture majority of the information as
bivariate data frame, so the multivariate patterns can
be shown on a scatter plot.
Principal Component Analysis (PCA)
2 species example
Clarke & Warwick (2001)
Principal Component Analysis (PCA)
• PC1 maximizes variance of
points projected on it.
3 species example
• PC2 is perpendicular to PC1
• PC3 is perpendicular to PC1
and PC2
• New orthogonal axes are
linear combination of old
data:
PC1 = 0.62 Sp1 + 0.52 Sp2 + 0.58 Sp3
PC2 = -0.73 Sp1 + 0.65 Sp2 + 0.2 Sp3
PC3 = 0.28 Sp 1 + 0.55 Sp2 -0.79 Sp3
Clarke & Warwick (2001)
Principal Component Analysis (PCA)
0.0
0.5
-0.4
0.0 0.2 0.4
# PCA
pca = princomp( arrest )
0.0
1.0
-0.5
0.5
-1.0
Comp.1
0.0
0.2
-0.5
0.0 0.2
0.4
-0.6
-0.2
Comp.3
Comp.4
-0.4
# New orthogonal axes
pairs( pca$scores )
Comp.2
-1.0
0.0
1.0
-0.6
-0.2
0.2
Principal Component Analysis (PCA)
0.2
0.1
# Variance of PC axes
plot( pca )
# Total variance explained
summary( pca )
0.0
Variances
0.3
0.4
# Variable contributions
# PC1 = -0.65 Murder -0.6
Assault -0.46 Rape
pca$loading
Comp.1
Comp.2
Comp.3
Comp.4
Principal Component Analysis (PCA)
Mississippi
North Carolina
0.5
#Cut dentrogram for 6 cluster
group = cutree( clus, 6 )
South
Carolina
Georgia
Louisiana
Alabama
plot( pca$scores, type = "n" )
0.0
Florida
Arkansas
Vermont
Wyoming
Maine
South Dakota
North Dakota
Montana
Virginia
Texas
New Hampshire
Pennsylvania
Delaware
Maryland
Rhode Island Wisconsin
Iowa
Kansas Connecticut
Illinois
Idaho
Indiana
Oklahoma
Nebraska
New York
New Jersey
Ohio
New Mexico
Minnesota
Missouri
Massachusetts
Michigan
Hawaii
Arizona
-0.5
Comp.2
Tennessee
text( pca$scores,
names( group ),
col = group )
West Virginia
Kentucky
Utah
Washington
Oregon
Alaska
Nevada
Colorado
California
-1.0
-0.5
0.0
Comp.1
0.5
1.0
Principal Component Analysis (PCA)
-0.4
-0.2
0.0
0.2
0.4
0.6
0.6
1.0
-0.6
Murder
0.4
Mississippi
0.0
0.2
West Virginia
Kentucky
Tennessee Arkansas
Vermont
Wyoming
Maine
South Dakota
North Dakota
Montana
Virginia
Assault
Texas
New
Hampshire
Florida
Pennsylvania
Maryland
Delaware Rhode Island
Iowa
Wisconsin
Kansas
Idaho
Indiana
Connecticut
Oklahoma
Nebraska
NewIllinois
York
New
Jersey
Ohio
New Mexico Missouri
Minnesota
Massachusetts
Michigan
Hawaii
Arizona
Washington
Oregon
Utah
Alaska
Nevada
Colorado
California
UrbanPop
-0.4
-0.2
0.0
-0.5
South
Carolina
Georgia
Louisiana
Alabama
-0.6
-1.0
Comp.2
0.5
North Carolina
Rape
-1.0
-0.5
0.0
Comp.1
0.5
1.0
# Add variable contributions
biplot( pca, scale = 0 )
Non-Metric Multidimensional Scaling (nMDS)
• Ordination bases on ranked resemblance (or
distance) matrix
• Robust and flexible for all kind of resemblance
indices
• Using iterative procedure, successively refine the
locations of ordination points according to the
ranked dissimilarities of samples
• Better choice for species abundance data (comparing
to PCA)
Multidimensional Scaling (nMDS)
2
1
0
Ordination Distance
3
Non-metric fit, R2 = 0.995
Linear fit, R2 = 0.98
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Observed Dissimilarity
mds = metaMDS( arrest )
stressplot( mds )
Multidimensional Scaling (nMDS)
-0.2
0.0
North Carolina
Mississippi
Murder
0.6
0.4
0.2
0.2
West Virginia
-0.1
0.0
0.1
Vermont
Kentucky
South Carolina
Georgia
South Dakota
Alabama
Louisiana
Arkansas
Montana
Tennessee
Maine
Assault
Wyoming
Delaware
Florida
North Dakota
Idaho
Maryland
Texas Virginia
Indiana
Nebraska
Kansas
Pennsylvania
Hampshire
NewIowa
New Mexico
Oklahoma
York
NewIllinois
Michigan
Ohio
Missouri
Minnesota
Alaska
New Jersey
Wisconsin
NevadaArizonaRape
Connecticut
Massachusetts
Oregon
UrbanPop
California
Colorado Washington
Utah
-0.2
0.3
0.1
0.0
-0.1
-0.2
Vermont
Kentucky
South Carolina
Georgia
South Dakota
Alabama
Louisiana
Arkansas
Montana
Tennessee
Maine
Wyoming
Delaware
Florida
North Dakota
Idaho
Maryland
Texas Virginia
Indiana
Nebraska
Kansas
Pennsylvania
New Mexico
Hampshire
NewIowa
Oklahoma
Illinois
York
New
Michigan Missouri Ohio
Minnesota
Alaska
New Jersey
Wisconsin
Nevada
Arizona
Connecticut
Massachusetts
Oregon
California
Colorado Washington
Utah
0.2
West Virginia
MDS2
Hawaii
-0.3
Hawaii
Rhode Island
Rhode Island
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
MDS1
# Ordination with 6 clusters
plot( mds$points, type = "n" )
text( mds$points, names( group ),
pch = group, col = group)
-0.4
-0.2
0.0
0.2
-0.3
0.0
-0.1
-0.2
North Carolina
Mississippi
-0.3
MDS2
0.1
0.2
0.3
-0.4
0.4
0.6
0.8
MDS1
# Add variable score
# Weighted average
biplot( mds$points ,
mds$species )
Correlation between Matrices
0.8
# Vegetation and environment
# in lichen pastures
data( varespec )
data( varechem )
0.6
0.4
# Euclidean distance
env.dist = dist( scale( varechem ) )
0.2
veg.dist
# Bray-Crutis Dissimilarity
veg.dist = vegdist( varespec )
2
4
6
env.dist
8
Mantel Test
Species
Rank
Sites
Sites
BC
1, 2, 3,……....
ρ
Correlation
Sites
ED
Rank
Sites
Sites
Environ.
Sites
1, 2, 3,……....
Mantel Test
100
r = 0.3
60
40
20
# Distribution of permuted r
hist ( man$perm )
0
Frequency
80
# Mantel test
# Based on 999 permutations
# Pearson's correlation
man = mantel( veg.dist, env.dist )
man
-0.3
-0.2
-0.1
0.0
0.1
0.2
Pearson Correlation (r)
0.3
0.4
Best Environmental Subsets
Species
Rank
Sites
Sites
BC
1, 2, 3,……....
ρ
Correlation
Sites
ED
Rank
Sites
Sites
Environ.
Sites
1, 2, 3,……....
BIOENV
bioenv( varespec, varechem )
0.6
0.4
0.2
veg.dist
0.8
# 16383 possible subsets
# Subset of environmental variables
with best correlation to
community data
1
2
3
4
env.dist (N + P + Al + Mn + Baresoil)
5
Testing Group Difference for Community Data
data( dune ) #Vegetation in Dutch Dune Meadows
dune
# More species (variables) than samples
# Dominance of zero values
# Violates multivariate normality and constant variance across
the groups
# A robust, permuatation-based test is needed for community
data.
Analysis of Similarity (ANOSIM)
BC
Rank
Sites
Sites
Species
1, 2, 3,……....
Sites
• R = 1: Within group are more similar
than between groups
• R = 0: Between and within group
are the same in average
• R is an absolute measure of group
seperation
R
rB  rW
n ( n  1) / 4
rB = Avg. rank between groups
rW = Avg. rank within groups
n = sample size
Analysis of Similarity (ANOSIM)
# MDS plot seems to suggest moisture
effect
plot( mds$points, pch = 21,
bg = Moisture, cex = Moisture )
1
0.0
2
3
4
-0.5
# Run a MDS on dune vegetation
mds = metaMDS( dune )
Moisture
MDS2
Moisture = as.numeric(
dune.env$Moisture )
0.5
# Does moisture has effect on
vegetation?
Vegetation in Dutch Dune Meadows
1.0
# Environment factors in Dutch Dune
Meadows
data( dune.env )
-0.5
0.0
MDS1
0.5
1.0
Analysis of Similarity (ANOSIM)
aos = anosim( dune, Moisture )
R = 0.43
100
50
# Distribution of permuted R
hist( aos$perm )
0
Frequency
150
200
aos
-0.2
0.0
0.2
ANOSIM R-statistics
0.4
Other Useful Functions
Clustering:
• pam() for clustering around medoids and clara() for clustering large data (both in
“cluster”)
• pvclust() in “pvclust” for assessing the uncertainty in hierarchical cluster analysis
Ordination:
• Great PCA video explanation on YOUTUBE
• imputePCA() in “missMDA” for handling missing data
• cca() and rda() in “vegan” for constrained type of ordinations
Testing difference:
• mrpp() in “vegan” for ANOSIM type analysis but using original dissimilarities instead
of their ranks.
• adonis() in “vegan” for robust and flexible multivariate permutational analysis of
variance (e.g. factorial & nested design, mixed model, etc.)
• betadisper() in “vegan” for testing constant multivariate variance (or dispersion)