Hal`s agglomerative clustering presentation
Download
Report
Transcript Hal`s agglomerative clustering presentation
Hierarchal Clustering in R
Hamilton Elkins
November 14, 2013
Agglomerative Clustering1
Data analysis tool used to group items
Builds a binary tree showing items in similar groups
Provides visual representation of the process
Based on distances between items
1. Blei, 2008
2
Agglomerative Clustering2
Works from bottom up
Each observation starts in single group
Closest groups are merged together
Process is repeated until all observations are merged into a
single group
Visual example from Blei (2008) start on page 15 of pdf
2. Blei, 2008
R Packages
Stat package
Function hclust
Cluster package
Function agnes
Function daisy
Hclust3,4
Requires pre-calculated distance matrix
Plotting results produces dendrogram
Dendrogram is visual representation of agglomerative cluster
method
Group merges at greater height than subgroup merges can be
natural clusters (Tibshirani et al, 2001)
3. stat.berkley 4. Blei, 2008
Dendrogram 50 Obs
Distance Matrix5,6,7
Agglomerative clustering is distance dependent
Distance matrix provides greater generality than clustering
observations
Function dist calculates distance matrix
Options for distance between observations
Euclidean, manhattan, maximum, canberra, binary, minkowski
5. Blei, 2008 6. stat.berkley 7.
astrostatictics.psu
Distance Matrix8
Function daisy in cluster package offers more options
Provides gower distance option that can calculate non-
numerical variable distance
Allows individual treatment for variables
ordratio-Treats ratio scaled as ordinal
logratio- Log transforms variables
asymm- Asymmetrical binary
symm- binary
8. Maechler, 2013
Missing Values in hclust10
Function dist accepts missing values
10. astrostatistics.psu
Number of Observations
Hierarchal clustering is visual in nature11
Dendrogram shows entire tree from single observations to
cluster of one
Splits and heights matter for interpretation
Being able to read the dendrogram is vital
11. Blei, 2008
Standardizing Data
Makes all variables contribute to clusters equally
12. stat.berkley
Linkage13,14
Method that links observations to form clusters
Linkage is a measure of inter-cluster distance
hclust default is complete but offers other options
Average, ward, single, mcquitty, median, centroid
Different linkages produce different dendrograms
13. ecology.msu 14. stat.ethz
agnes (Agglomerative Nesting)16,17
Can use either pre-calculated distance or raw values
Options on distance calculation
Differences if data is standardized
Produces an agglomerative coefficient
Measures cluster structure
Average of 1- (dissimilarity from first/ dissimilarity from last)
Increases with sample size
16. Maechler, 2013 17. Glynn, 2005
Reading a Dendrogram19
Height – How distant clusters are prior to merge
Height is determined by linkage method
Smaller height jumps in between branches show poorly
differentiated clusters
Larger height jumps between last merged group and current
indicate well-differentiated clusters
19. stat.berkley
Pitfalls20,21
Choices matter and produce different results
Different distance measures can produce vastly different
distance matrices
Different linkage choices can lead to vastly different clusters
The algorithm finds the clusters and groupings even if are
none in reality
Better for descriptive purposes
20. Blei, 2008 21. stat.berkley
References
astrostatistics.psu.edu. “Distance Matrix Computation”.
http://www.astrostatistics.psu.edu/su07/R/stats/html/dist.html
Blei, D. 2008. “Hierarchal clustering, COS424”. Princeton University.
http://www.cs.princeton.edu/courses/archive/spr08/cos424/slides/clustering-2.pdf
ecology.msu.montana.edu. “Lab 13- Cluster Analysis”.
http://ecology.msu.montana.edu/labdsv/R/labs/lab13/lab13.html
Glynn, E.F. 2005. “Correlation ‘Distances’ and Hierarchal Clustering”. Stowers Institute for Medical
Research. http://research.stowers-institute.org/efg/R/Visualization/cor-cluster/index.htm
Jain, A.K., Murty, M.N., & Flynn, P.J. 1999. Data Clustering: A Review. ACM Computing Surveys 31 (3):
264-323.
Maechler, M. 2013. Package ‘cluster’. http://cran.r-project.org/web/packages/cluster/cluster.pdf
stat.berkley.edu. “Performing and Interpreting Cluster Analysis”.
http://www.stat.berkeley.edu/users/spector/s133/Clus.html
stat.ethz.ch. “Hierarchal Clustering”. http://stat.ethz.ch/R-manual/Rdevel/library/stats/html/hclust.html
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap
statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423
R code to run simulation shown during presentation
#data set must be read in as object compsetr. CSV version is available from ISQS6348 library
salary <- compsetr[,c(1:5, 7,8,10)]
sc<-salary[complete.cases(salary),]
sal_st<-data.frame(sapply(sc[,],scale))
scd<-dist(as.matrix(sc))
scc<-hclust(scd)
plot(scc, labels=F)
stdis<-dist(as.matrix(sal_st))
st_cl<-hclust(stdis)
plot(st_cl, labels=F)
st_cl2<-hclust(stdis, method='average')
st_cl4<-hclust(stdis, method='single')
st_cl3<-hclust(stdis, method='ward')
plot(st_cl2, labels=F)
plot(st_cl4, labels=F)
plot(st_cl3, labels=F)
snc<-agnes(stdis, diss=T, stand=T, method='ward')
plot(snc,labels=F)
st_clg3<-cutree(st_cl3,3)
st_clg5<-cutree(st_cl3,5)
st_clg7<-cutree(st_cl3,7)
st_clg10<-cutree(st_cl3,10)
table(st_clg3)
table(st_clg5)
table(st_clg7)
table(st_clg10)