# Doing hierarchical clustering with a precalculated dissimilarity index

Hierarchical clustering functionality in R is great, right? Between `dist`

and `vegdist`

it is possible to base your clustering on almost any method you want, from cosine to Canberra. However, what if you do want to use a different or custom method, and you’ve already calculated the distances separately? All of the documentation for the `hclust`

function asks you to start with raw data from which R can calculate the distances between pairs. How do you get `hclust`

to read in pre-calculated distance scores?

In this example, let’s say I’ve already manually calculated the Jaccard dissimilarity between each pair of cars from the `mtcars`

dataset, and put them in a data.frame called `carJaccard`

.

```
head(carJaccard)
```

```
## car1 car2 jaccardDissimilarity
## 1 Mazda RX4 Mazda RX4 0.000000000
## 2 Mazda RX4 Wag Mazda RX4 0.002471232
## 3 Datsun 710 Mazda RX4 0.237474920
## 4 Hornet 4 Drive Mazda RX4 0.251866514
## 5 Hornet Sportabout Mazda RX4 0.461078746
## 6 Valiant Mazda RX4 0.211822414
```

So how do we get `hclust`

to recognise this as a distance matrix? The first step is to reshape this data.frame into a regular matrix. We can do this using `acast`

from the `reshape2`

package.

```
library(reshape2)
regularMatrix <- acast(carJaccard, car1 ~ car2, value.var = "jaccardDissimilarity")
```

We now need to convert this into a distance matrix. The first step is to make sure that we don’t have any NA’s in the matrix, which happens when you have pairings created in your matrix that don’t exist in your dataframe of dissimilarity scores. If there are any, these should be replaced with whatever value indicates complete dissimilarity for your chosen metric (in the case of Jaccard dissimilarity, this is 1).

```
regularMatrix[is.na(regularMatrix)] <- 1
```

We are now able to convert our matrix into a distance matrix like so:

```
distanceMatrix <- as.dist(regularMatrix)
```

Let’s throw it into `hclust`

and see how we went!

```
clusters <- hclust(distanceMatrix, method = "ward.D2")
plot(clusters)
```

Great! As you can see from the dendogram above, we’ve ended up with what looks to be some fairly sensible looking clusters (well, at least to someone like me that knows nothing about cars - the ones with the same names are together so that looks good!).

We can also use the results of our clustering exercise to create groups based on a selected cutoff using the `cutree`

function from the `stats`

package. Looking at the dendrogram above, 0.5 looks like it will get us a good number of groups:

```
group <- cutree(clusters, h = 0.5)
```

(As an aside, you can also specify the number of groups you want using the `k`

argument in `cutree`

rather than the `h`

, or height, argument. Check out `help(cutree)`

for more details on this.)

The output of `cutree`

looks … weird. Where are our groups? Never fear, we just need to take one final step in order to connect up our groupings with our car names.

```
groups <- as.data.frame(group)
groups$cars <- rownames(groups)
rownames(groups) <- NULL
groups <- groups[order(groups$group), ]
head(groups)
```

```
## group cars
## 1 1 Mazda RX4
## 2 1 Mazda RX4 Wag
## 3 1 Datsun 710
## 8 1 Merc 240D
## 9 1 Merc 230
## 10 1 Merc 280
```

As you can see, we just need to convert the results of `cutree`

to a data.frame, which has the car names as the row names. In order to make it a bit neater, I’ve pulled the car names out into a column and ordered them based on the clusters.

As a final note, you may have noticed that I kept referring to **dissimilarity** scores in this post - this is for good reason! `hclust`

is based on dissimilarity between pairs, rather than their similarity. I made this mistake the first time I used it and ended up with, to my puzzlement, a set of groups containing the most disparate pairs in the whole dataset! So learn from my foolishness and make sure you are using dissimilarity scores from the outset.