Doing hierarchical clustering with a precalculated dissimilarity index
Hierarchical clustering functionality in R is great, right? Between
vegdist it is possible to base your clustering on almost any method you want, from cosine to Canberra. However, what if you do want to use a different or custom method, and you’ve already calculated the distances separately? All of the documentation for the
hclust function asks you to start with raw data from which R can calculate the distances between pairs. How do you get
hclust to read in pre-calculated distance scores?
In this example, let’s say I’ve already manually calculated the Jaccard dissimilarity between each pair of cars from the
mtcars dataset, and put them in a data.frame called
## car1 car2 jaccardDissimilarity ## 1 Mazda RX4 Mazda RX4 0.000000000 ## 2 Mazda RX4 Wag Mazda RX4 0.002471232 ## 3 Datsun 710 Mazda RX4 0.237474920 ## 4 Hornet 4 Drive Mazda RX4 0.251866514 ## 5 Hornet Sportabout Mazda RX4 0.461078746 ## 6 Valiant Mazda RX4 0.211822414
So how do we get
hclust to recognise this as a distance matrix? The first step is to reshape this data.frame into a regular matrix. We can do this using
acast from the
library(reshape2) regularMatrix <- acast(carJaccard, car1 ~ car2, value.var = "jaccardDissimilarity")
We now need to convert this into a distance matrix. The first step is to make sure that we don’t have any NA’s in the matrix, which happens when you have pairings created in your matrix that don’t exist in your dataframe of dissimilarity scores. If there are any, these should be replaced with whatever value indicates complete dissimilarity for your chosen metric (in the case of Jaccard dissimilarity, this is 1).
regularMatrix[is.na(regularMatrix)] <- 1
We are now able to convert our matrix into a distance matrix like so:
distanceMatrix <- as.dist(regularMatrix)
Let’s throw it into
hclust and see how we went!
clusters <- hclust(distanceMatrix, method = "ward.D2") plot(clusters)
Great! As you can see from the dendogram above, we’ve ended up with what looks to be some fairly sensible looking clusters (well, at least to someone like me that knows nothing about cars - the ones with the same names are together so that looks good!).
We can also use the results of our clustering exercise to create groups based on a selected cutoff using the
cutree function from the
stats package. Looking at the dendrogram above, 0.5 looks like it will get us a good number of groups:
group <- cutree(clusters, h = 0.5)
(As an aside, you can also specify the number of groups you want using the
k argument in
cutree rather than the
h, or height, argument. Check out
help(cutree) for more details on this.)
The output of
cutree looks … weird. Where are our groups? Never fear, we just need to take one final step in order to connect up our groupings with our car names.
groups <- as.data.frame(group) groups$cars <- rownames(groups) rownames(groups) <- NULL groups <- groups[order(groups$group), ] head(groups)
## group cars ## 1 1 Mazda RX4 ## 2 1 Mazda RX4 Wag ## 3 1 Datsun 710 ## 8 1 Merc 240D ## 9 1 Merc 230 ## 10 1 Merc 280
As you can see, we just need to convert the results of
cutree to a data.frame, which has the car names as the row names. In order to make it a bit neater, I’ve pulled the car names out into a column and ordered them based on the clusters.
As a final note, you may have noticed that I kept referring to dissimilarity scores in this post - this is for good reason!
hclust is based on dissimilarity between pairs, rather than their similarity. I made this mistake the first time I used it and ended up with, to my puzzlement, a set of groups containing the most disparate pairs in the whole dataset! So learn from my foolishness and make sure you are using dissimilarity scores from the outset.