Distance as Metrics
No Pun Intended
In Machine Learning, being able to calculate distances is essential in a lot of cases. The most obvious use for it is when dealing with spatial data, but it also is one of the easiest ways to assess membership/similarity in supervised and unsupervised techniques. K-Nearest Neighbours relies on distances between points, same as the Radial Base Function Kernel in Support Vector Machine or even the cosine similarity used in NLP can be seen as a measure of distance (in its angular similarity form, at least).
Distance measurement is surprisingly straightforward when dealing with vectors. It becomes a little more complex when we want to measure “real” distance in a given space (the distance between two cities, for example).
In this article, we will have a look at the different methods available to measure distance. We will work our way up from one-dimension space to multi-dimensional spaces, and finish with the more complex “geographical” distance.
One Dimension
In one dimension, the “distance” between two data points is simply going to be the difference between those two values. For example, looking at items prices, we calculate the distance between two items as follow: item1_price - item2_price
. It seems trivial, but this feature already tells us which item is more expensive (is the distance positive or negative?), and we can also tell by how much it is more expensive (how big is the distance?).
Two Dimensions
Manhattan Distance
In the previous image, any non-green coloured path is showing the manhattan distance between the top right corner of the square and the bottom left corner. All these routes have a manhattan distance equal to 12.
To calculate the manhattan distance, we need to calculate the difference between the two points x and y (in the directions of reference, along the axis). As we are summing those distances, we need to ensure the values are positive by taking their absolute values.
One of the ways to compute that distance in Python would be as follow:
def manhattan(a, b):
manhattan_distance = 0
for i in [0, 1]:
manhattan_distance += abs(a[i] - b[i])
return manhattan_distancea = (4, 2)
b = (0, 1)
manhattan(a,b)5
Euclidean Distance
It is slightly different from the Manhattan distance as it is using the Pythagorean theorem to calculate the distance (“as the crow flies”).
The Euclidean distance is shown in green in our initial image and is equal to ⁶² + ⁶² = √72.
Once again, the calculation steps are pretty simple: we measure the difference between our points (in the directions of reference, along the axis), we then take the square of those values, add them together, then we will take the square root to the result of that sum.
One way to compute the Euclidean distance in Python is:
from math import sqrt
def euclidean(a, b):
euclidean_distance = 0
for i in [0, 1]:
euclidean_distance += (a[i] - b[i])**2
return sqrt(euclidean_distance)a = (4, 2)
b = (0, 1)
euclidean(a,b)4.123105625617661
Three Dimensions or More
Manhattan Distance
The manhattan distance calculation scales up really well as we only need to calculate the difference between two points in the extra dimensions, then add those (absolute) values:
def manhattanXD(a, b, dimensions=2):
manhattan_XD = 0
for i in range(dimensions):
manhattan_XD += abs(a[i] - b[i])
return manhattan_XDa = (4, 2, 2)
b = (0, 0, 1)
manhattanXD(a, b, 3)7
Euclidean Distance
Once again, scaling up the euclidean distance formula is very simple:
from math import sqrt
def euclideanXD(a, b, dimensions=2):
euclidean_XD = 0
for i in range(dimensions):
euclidean_XD += (a[i] - b[i]) ** 2
return sqrt(euclidean_XD)a = (4, 2, 2)
b = (0, 0, 1)
euclideanXD(a, b, 3)4.58257569495584
Minkowski Distance
This distance is a generalisation of the previous calculation methods we have seen in this article. Manhattan and Euclidean distances are applicable in a cartesian coordinate system (or real vector space). The Minkowski distance applies to a normed vector space.
The formula for the Minkowski distance:
We can see that when p = 1, the formula is the manhattan distance one and, when p = 2, we recognise the equation of the euclidean distance.
import numpy as np
def minkowski(a, b, dimensions=2, power=1):
minkowski_XD = 0
for i in range(dimensions):
minkowski_XD += abs(a[i] - b[i]) ** power
return np.power(minkowski_XD, 1/power)a = (4, 2, 2)
b = (0, 0, 1)
minkowski(a, b, dimension=3, power=3)4.179339196381232
If you are familiar with KNeighborsClassifier or KNeighborsRegressor from scikit.neighbors
, you might have seen that the Minkowski distance can be one of the parameters. It goes hand in hand with the power parameter that will define which formula to use: Manhattan, Euclidean, etc. The Minkowski can actually be the metric throughout most of the sklearn.neighbors
module.
Cosine Distance
The cosine distance is a little more complex as it involves normalisation and matrix dot product but, alltogether, the equation is straightforward. The cosine similarity is:
The cosine distnace is then calculated using this formula:
a = (4, 2, 2)
b = (0, 0, 1)def cosine_distance(a, b):
normalised_a = np.linalg.norm(a)
normalised_b = np.linalg.norm(b)
dot_product = np.dot(a, b)
return 1 - (dot_product/(normalised_a * normalised_b))cosine_distance(a, b)0.5917517095361369
Unlike the previous formulas, this equation for the cosine distance does not need cartesian coordinates as it uses matrix operations. Which means it scales up nicely.
Haversine Distance
Finally, let us have a look at something a bit more tangible.
Distance calculation gets slightly more complex when the route is the projection of the chord (straight line between two points on a sphere) onto that sphere. One of the ways to calculate such a distance (also called great-circle distance) is by using the Haversine formula. It comes with a few caveats (mainly due to rounding approximations), but it still is a useful formula to get familiar with:
where (λ, φ) is the (lon, lat) coordinates of a point. Note that unlike the (lon, lat) pair, in degrees, (λ, φ) are in radians. Last, r is the radius of the sphere. The average earth radius is 6371 km. The adverse effect of using a radius in kilometres is that our distance will be in kilometres as well, happy days!
def haversine_distance(lat1, lon1, lat2, lon2):
p1 = np.radians(lat1)
p2 = np.radians(lat2)
l1 = np.radians(lon1)
l2 = np.radians(lon2)
hav = np.sin((p2 - p1) / 2) ** 2 + np.cos(p1) * np.cos(p2) * np.sin((l2 - l1) / 2) ** 2
dist = 2 * 6371 * np.arcsin(np.sqrt(hav))
return np.round(dist, 2)london = (51.509865, -0.118092)
paris = (48.864716, 2.349014)haversine_distance(*london, *paris)342.54
Last Words
As we have seen, it is fairly easy to calculate distances between points, even in the special case of geographical distances. There are a few other methods to assess neighbourhood (or assumed membership), but hopefully, this article will have given you enough information to understand the most common “metrics”.