Photo by Bruno Wolff on Unsplash

Distance as Metrics

No Pun Intended

--

In Machine Learning, being able to calculate distances is essential in a lot of cases. The most obvious use for it is when dealing with spatial data, but it also is one of the easiest ways to assess membership/similarity in supervised and unsupervised techniques. K-Nearest Neighbours relies on distances between points, same as the Radial Base Function Kernel in Support Vector Machine or even the cosine similarity used in NLP can be seen as a measure of distance (in its angular similarity form, at least).

Distance measurement is surprisingly straightforward when dealing with vectors. It becomes a little more complex when we want to measure “real” distance in a given space (the distance between two cities, for example).

In this article, we will have a look at the different methods available to measure distance. We will work our way up from one-dimension space to multi-dimensional spaces, and finish with the more complex “geographical” distance.

Photo by Tom Barrett on Unsplash

One Dimension

In one dimension, the “distance” between two data points is simply going to be the difference between those two values. For example, looking at items prices, we calculate the distance between two items as follow: item1_price - item2_price. It seems trivial, but this feature already tells us which item is more expensive (is the distance positive or negative?), and we can also tell by how much it is more expensive (how big is the distance?).

Two Dimensions

Taxicab

Manhattan Distance

In the previous image, any non-green coloured path is showing the manhattan distance between the top right corner of the square and the bottom left corner. All these routes have a manhattan distance equal to 12.

To calculate the manhattan distance, we need to calculate the difference between the two points x and y (in the directions of reference, along the axis). As we are summing those distances, we need to ensure the values are positive by taking their absolute values.

One of the ways to compute that distance in Python would be as follow:

def manhattan(a, b):
manhattan_distance = 0
for i in [0, 1]:
manhattan_distance += abs(a[i] - b[i])
return manhattan_distance
a = (4, 2)
b = (0, 1)
manhattan(a,b)
5

Euclidean Distance

It is slightly different from the Manhattan distance as it is using the Pythagorean theorem to calculate the distance (“as the crow flies”).

Euclidean2D

The Euclidean distance is shown in green in our initial image and is equal to ⁶² + ⁶² = √72.

Once again, the calculation steps are pretty simple: we measure the difference between our points (in the directions of reference, along the axis), we then take the square of those values, add them together, then we will take the square root to the result of that sum.

One way to compute the Euclidean distance in Python is:

from math import sqrt 

def euclidean(a, b):
euclidean_distance = 0
for i in [0, 1]:
euclidean_distance += (a[i] - b[i])**2
return sqrt(euclidean_distance)
a = (4, 2)
b = (0, 1)
euclidean(a,b)
4.123105625617661

Three Dimensions or More

Manhattan Distance

The manhattan distance calculation scales up really well as we only need to calculate the difference between two points in the extra dimensions, then add those (absolute) values:

Manhattan
def manhattanXD(a, b, dimensions=2):
manhattan_XD = 0
for i in range(dimensions):
manhattan_XD += abs(a[i] - b[i])
return manhattan_XD
a = (4, 2, 2)
b = (0, 0, 1)
manhattanXD(a, b, 3)
7

Euclidean Distance

Once again, scaling up the euclidean distance formula is very simple:

Euclidean3D
from math import sqrt 

def euclideanXD(a, b, dimensions=2):
euclidean_XD = 0
for i in range(dimensions):
euclidean_XD += (a[i] - b[i]) ** 2
return sqrt(euclidean_XD)
a = (4, 2, 2)
b = (0, 0, 1)
euclideanXD(a, b, 3)
4.58257569495584

Minkowski Distance

This distance is a generalisation of the previous calculation methods we have seen in this article. Manhattan and Euclidean distances are applicable in a cartesian coordinate system (or real vector space). The Minkowski distance applies to a normed vector space.

The formula for the Minkowski distance:

Minkowsky

We can see that when p = 1, the formula is the manhattan distance one and, when p = 2, we recognise the equation of the euclidean distance.

import numpy as np

def minkowski(a, b, dimensions=2, power=1):
minkowski_XD = 0
for i in range(dimensions):
minkowski_XD += abs(a[i] - b[i]) ** power
return np.power(minkowski_XD, 1/power)
a = (4, 2, 2)
b = (0, 0, 1)
minkowski(a, b, dimension=3, power=3)
4.179339196381232

If you are familiar with KNeighborsClassifier or KNeighborsRegressor from scikit.neighbors, you might have seen that the Minkowski distance can be one of the parameters. It goes hand in hand with the power parameter that will define which formula to use: Manhattan, Euclidean, etc. The Minkowski can actually be the metric throughout most of the sklearn.neighbors module.

Cosine Distance

The cosine distance is a little more complex as it involves normalisation and matrix dot product but, alltogether, the equation is straightforward. The cosine similarity is:

Cosine_similarity

The cosine distnace is then calculated using this formula:

Cosine_dictance
a = (4, 2, 2)
b = (0, 0, 1)
def cosine_distance(a, b):
normalised_a = np.linalg.norm(a)
normalised_b = np.linalg.norm(b)
dot_product = np.dot(a, b)
return 1 - (dot_product/(normalised_a * normalised_b))
cosine_distance(a, b)0.5917517095361369

Unlike the previous formulas, this equation for the cosine distance does not need cartesian coordinates as it uses matrix operations. Which means it scales up nicely.

Haversine Distance

Photo by Amy Humphries on Unsplash

Finally, let us have a look at something a bit more tangible.

Distance calculation gets slightly more complex when the route is the projection of the chord (straight line between two points on a sphere) onto that sphere. One of the ways to calculate such a distance (also called great-circle distance) is by using the Haversine formula. It comes with a few caveats (mainly due to rounding approximations), but it still is a useful formula to get familiar with:

Haversine

where (λ, φ) is the (lon, lat) coordinates of a point. Note that unlike the (lon, lat) pair, in degrees, (λ, φ) are in radians. Last, r is the radius of the sphere. The average earth radius is 6371 km. The adverse effect of using a radius in kilometres is that our distance will be in kilometres as well, happy days!

def haversine_distance(lat1, lon1, lat2, lon2):
p1 = np.radians(lat1)
p2 = np.radians(lat2)
l1 = np.radians(lon1)
l2 = np.radians(lon2)
hav = np.sin((p2 - p1) / 2) ** 2 + np.cos(p1) * np.cos(p2) * np.sin((l2 - l1) / 2) ** 2
dist = 2 * 6371 * np.arcsin(np.sqrt(hav))
return np.round(dist, 2)
london = (51.509865, -0.118092)
paris = (48.864716, 2.349014)
haversine_distance(*london, *paris)342.54

Last Words

As we have seen, it is fairly easy to calculate distances between points, even in the special case of geographical distances. There are a few other methods to assess neighbourhood (or assumed membership), but hopefully, this article will have given you enough information to understand the most common “metrics”.

--

--