Optimising the Minkowski distance, part 1: vector subtraction
In the last blog post, we discussed how to calculate the Manhattan and Euclidean distances from first principles. However, in that post, we did a very manual implementation for a single pair of vectors, which would not generalise well to more than one pair and would become cumbersome for more than a few dimensions. As such, in this blog post we will discuss how to implement the Minkowski distance calculation for vectors of any number of dimensions and for any number of pairs. As doing pairwise calculations can become very computationally expensive, we will also be exploring how to optimise our distance calculations using some further knowledge from linear algebra in both this blog post and the coming two.
Our inital attempt
Let’s start by revisiting our manual implementation of the Euclidean distance:
import numpy as np
X = [5, 7, 3, 9]
Y = [1, 6, 2, 4]
i = len(X)
p = 2
diff_1 = np.abs(X[0] - Y[0]) ** p
diff_2 = np.abs(X[1] - Y[1]) ** p
diff_3 = np.abs(X[2] - Y[2]) ** p
diff_4 = np.abs(X[3] - Y[3]) ** p
euclidean_distance_sq = diff_1 + diff_2 + diff_3 + diff_4
euclidean_distance = euclidean_distance_sq ** (1 / p)
euclidean_distance
6.557438524302
Right away, we can see one possibility for generalising our code over as many dimensions as we like: we can run all of our difference calculations in a loop. Let’s see how this looks:
diffs = []
for element in np.arange(0, i):
diffs.append(np.abs(X[element] - Y[element]) ** p)
We now have a list containing the squared absolute differences, which we can then add.
euclidean_distance_sq = sum(diffs)
Let’s bundle this up into our first Minkowski distance function.
def calculate_minkowski_distance(X, Y, p):
"""Calculates the Minkowski distance between two vectors, X and Y. When p = 1, calculates the Manhattan distance, when p = 2, calculates the Euclidean distance."""
i = len(X)
diffs = []
for element in np.arange(0, i):
diffs.append(np.abs(X[element] - Y[element]) ** p)
abs_diffs = sum(diffs)
return abs_diffs ** (1 / p)
If we apply it to our two test vectors, we can see we get the exact same answer as when we calculated it step-by-step:
calculate_minkowski_distance(X, Y, 2)
6.557438524302
We now need a way of applying our newly minted Minkowski distance function to multiple pairs at the same time. What we need to do is compare every vector all other vectors pairwise, and the simplest way of doing this is within a nested loop. We’ll create an additional vector, \(Z\) to compare to \(X\) and \(Y\). We then combine these three vectors into a list of lists.
Z = [8, 8, 3, 1]
vectors = [X, Y, Z]
What we’ll now do is loop over each vector in this list, comparing \(X\) with itself, \(X\) with \(Y\), \(X\) with \(Z\), etc., using two nested for loops.
distances = []
for a in vectors:
tmp_distances = []
for b in vectors:
tmp_distances.append(calculate_minkowski_distance(a, b, 2))
distances.append(tmp_distances)
Finally, we can output the pairwise distance matrix to a Pandas DataFrame.
import pandas as pd
pd.DataFrame(distances)
0 | 1 | 2 | |
---|---|---|---|
0 | 0.000000 | 6.557439 | 8.602325 |
1 | 6.557439 | 0.000000 | 7.937254 |
2 | 8.602325 | 7.937254 | 0.000000 |
We can see that everything makes sense: vectors compared with themselves have a distance of 0, and that the distance we have already calculated between \(X\) and \(Y\) is still returned as 6.56. Let’s now package this up into a function and see how it performs when calculating pairwise differences on our beans dataset.
def apply_minkowski_distance(vectors: list, p: int) -> pd.DataFrame:
distances = []
for a in vectors:
tmp_distances = []
for b in vectors:
tmp_distances.append(calculate_minkowski_distance(a, b, p))
distances.append(tmp_distances)
return pd.DataFrame(distances)
As a refresher for those of you who may not have read the last blog post, the dry bean dataset describes the characteristics of 13,611 images of dried beans across 7 different types. There are 16 features describing the beans, such as their area, aspect ratio and roundness. We’ll prepare three samples from this dataset for testing: a sample of 100 observations using only three of the features (MajorAxisLength
, MinorAxisLength
and roundness
); all observations using only the above three features; and the entire dataset with all 16 features.
For each of our tests, we’ll use the Euclidean distance (so, \(p = 2\)).
As our function above takes in a list of lists containing each of the vectors, we convert our Pandas DataFrames to this format by first extracting the values of each row using values
and then converting the values to a list using tolist()
.
beans = pd.read_excel("data/Dry_Bean_Dataset.xlsx")
list_sample_1 = beans[["MajorAxisLength", "MinorAxisLength", "roundness"]][:100].values.tolist()
list_sample_2 = beans[["MajorAxisLength", "MinorAxisLength", "roundness"]].values.tolist()
list_sample_3 = beans.drop(columns=["Class"]).values.tolist()
We then run the pairwise Euclidean distance calculation over each of our samples.
%time list_min_1 = apply_minkowski_distance(list_sample_1, 2)
CPU times: user 84.5 ms, sys: 34.2 ms, total: 119 ms
Wall time: 89.4 ms
Our small sample with three features finishes relatively quickly.
%time list_min_2 = apply_minkowski_distance(list_sample_2, 2)
CPU times: user 16min 47s, sys: 18.2 s, total: 17min 5s
Wall time: 16min 56s
However, when we move up to just over 100 times that many observations, our operation takes more than 10,000 times as long to complete, clocking in at an eyewatering 17 minutes. It’s easy to see why when we look at the code: in our nested for loop we’re multiplying every single element by every other element (including itself), meaning that the number of operations we need to complete is the square of the number of vectors. You can confirm this by looking at our example using \(X\), \(Y\) and \(Z\): we had three vectors, but ended up with 9 calculations in our distance matrix.
%time list_min_3 = apply_minkowski_distance(list_sample_3, 2)
CPU times: user 1h 5min 48s, sys: 2min 15s, total: 1h 8min 4s
Wall time: 1h 6min 37s
When using all 16 features, we can see that our processing time has just increased approximately fourfold. This makes sense given that we’re looping over each element in the vector to get the difference scores. Since we’ve jumped from 3 to 16 features, we’d expect a corresponding increase in the processing time for each distance calculation. This calculation of the difference score element-by-element is the first inefficiency we’ll tackle. To understand how to do so, let’s have a discussion about some of the arithmetic we can do with vectors.
Vector subtraction
Just like with numbers, we can do some simple arithmetic operations with vectors. Relevant to our case, we can take two vectors with the same number of elements and subtract them. This results in the corresponding elements being lined up and the second subtracted from the first. Let’s see an example.
Xm = np.array([5, 7, 3, 9])
Ym = np.array([1, 6, 2, 4])
Ym - Xm
array([-4, -1, -1, -5])
We can see that by subtracting \(Xm\) from \(Ym\), we ended up with pairwise subtraction of each element of each vector, with \(Ym_1 - Xm_1 = 1 - 5 = -4\), \(Ym_2 - Xm_2 = 6 - 7 = -1\), and so on. You can probably see already that if we can replace the for loop to calculate the difference in our Minkowski function with this vectorised solution, we could save quite a lot of processing time.
In order to fully replace this functionality, we need to know about two other things we can do with vectors. Using NumPy, we can apply an absolute difference function to each element in the vector. In addition, we can raise each element in a vector to a power (also known as the Hadamard power) using NumPy’s power
method. Let’s see these both in action.
np.abs(Ym - Xm)
array([4, 1, 1, 5])
You can see it has worked, with all values now converted to positive. Let’s now apply the power
function:
np.power(np.abs(Ym - Xm), 2)
array([16, 1, 1, 25])
We’ve now replicated all the functionality we were including in the loop in our Minkowski function. Let’s update it and see whether our performance has improved.
def calculate_minkowski_distance(X, Y, p):
"""Calculates the Minkowski distance between two vectors, X and Y. When p = 1, calculates the Manhattan distance, when p = 2, calculates the Euclidean distance."""
diffs = np.power(np.abs(X - Y), p)
abs_diffs = sum(diffs)
return abs_diffs ** (1 / p)
As a quick sanity check, let’s confirm that we get the same answer as with our previous implementation of the function.
calculate_minkowski_distance(Xm, Ym, 2)
6.557438524302
Having confirmed this, let’s test it out with the same samples as we used earlier. We now need to convert these into NumPy arrays to be able to exploit the vector functions we’ve used in our function.
array_sample_1 = beans[["MajorAxisLength", "MinorAxisLength", "roundness"]][:100].to_numpy()
array_sample_2 = beans[["MajorAxisLength", "MinorAxisLength", "roundness"]].to_numpy()
array_sample_3 = beans.drop(columns=["Class"]).to_numpy()
Let’s now test out how it performs compared to our last implementation.
%time list_min_1 = apply_minkowski_distance(array_sample_1, 2)
CPU times: user 52.5 ms, sys: 2.74 ms, total: 55.3 ms
Wall time: 53.6 ms
%time list_min_2 = apply_minkowski_distance(array_sample_2, 2)
CPU times: user 13min 1s, sys: 6.9 s, total: 13min 8s
Wall time: 13min 10s
%time list_min_3 = apply_minkowski_distance(array_sample_3, 2)
CPU times: user 17min 16s, sys: 7.35 s, total: 17min 23s
Wall time: 17min 26s
Whoo hoo, we managed to knock our calculation using all 16 vectors down to around a quarter of the original time! It seems that eliminating the loop to calculate the difference between vector elements has definitely sped things up. However, ~15 minutes is still a long time, especially given that we’re only dealing with around 13,000 observations. This current implementation is still being slowed down by the nested for loop we’re using to apply the Minkowski distance formula to all pairs. Luckily, NumPy has another solution for us here called broadcasting, which we will discuss in the next blog post in this series.