Doing hierarchical clustering with a precalculated dissimilarity index

written in r, programming tips, statistics

Hierarchical clustering functionality in R is great, right? Between dist and vegdist it is possible to base your clustering on almost any method you want, from cosine to Canberra. However, what if you do want to use a different or custom method, and you’ve already calculated the distances separately? All of the documentation for the hclust function asks you to start with raw data from which R can calculate the distances between pairs. How do you get hclust to read in pre-calculated distance scores?

In this example, let’s say I’ve already manually calculated the Jaccard dissimilarity between each pair of cars from the mtcars dataset, and put them in a data.frame called carJaccard.

##                car1      car2 jaccardDissimilarity
## 1         Mazda RX4 Mazda RX4          0.000000000
## 2     Mazda RX4 Wag Mazda RX4          0.002471232
## 3        Datsun 710 Mazda RX4          0.237474920
## 4    Hornet 4 Drive Mazda RX4          0.251866514
## 5 Hornet Sportabout Mazda RX4          0.461078746
## 6           Valiant Mazda RX4          0.211822414

So how do we get hclust to recognise this as a distance matrix? The first step is to reshape this data.frame into a regular matrix. We can do this using acast from the reshape2 package.

regularMatrix <- acast(carJaccard, car1 ~ car2, value.var = "jaccardDissimilarity")

We now need to convert this into a distance matrix. The first step is to make sure that we don’t have any NA’s in the matrix, which happens when you have pairings created in your matrix that don’t exist in your dataframe of dissimilarity scores. If there are any, these should be replaced with whatever value indicates complete dissimilarity for your chosen metric (in the case of Jaccard dissimilarity, this is 1).

regularMatrix[] <- 1

We are now able to convert our matrix into a distance matrix like so:

distanceMatrix <- as.dist(regularMatrix)

Let’s throw it into hclust and see how we went!

clusters <- hclust(distanceMatrix, method = "ward.D2")

Great! As you can see from the dendogram above, we’ve ended up with what looks to be some fairly sensible looking clusters (well, at least to someone like me that knows nothing about cars - the ones with the same names are together so that looks good!).

We can also use the results of our clustering exercise to create groups based on a selected cutoff using the cutree function from the stats package. Looking at the dendrogram above, 0.5 looks like it will get us a good number of groups:

group <- cutree(clusters, h = 0.5)

(As an aside, you can also specify the number of groups you want using the k argument in cutree rather than the h, or height, argument. Check out help(cutree) for more details on this.)

The output of cutree looks … weird. Where are our groups? Never fear, we just need to take one final step in order to connect up our groupings with our car names.

groups <-
groups$cars <- rownames(groups)
rownames(groups) <- NULL
groups <- groups[order(groups$group), ]
##    group          cars
## 1      1     Mazda RX4
## 2      1 Mazda RX4 Wag
## 3      1    Datsun 710
## 8      1     Merc 240D
## 9      1      Merc 230
## 10     1      Merc 280

As you can see, we just need to convert the results of cutree to a data.frame, which has the car names as the row names. In order to make it a bit neater, I’ve pulled the car names out into a column and ordered them based on the clusters.

As a final note, you may have noticed that I kept referring to dissimilarity scores in this post - this is for good reason! hclust is based on dissimilarity between pairs, rather than their similarity. I made this mistake the first time I used it and ended up with, to my puzzlement, a set of groups containing the most disparate pairs in the whole dataset! So learn from my foolishness and make sure you are using dissimilarity scores from the outset.