Image for post
Image for post
Photo by THE 9TH Coworking on Unsplash

It is estimated that in 2025, 175 trillions of gigabytes of data will be created (that is for the year 2025 only! source). On top of this, a greater number of companies and institutions are making an effort to increase data’s availability to the public; they develop API or create open data portals such as London, UK, or Google Mobility Data, to name a few.

Accessing data has never been easier, and it will only get better!

That is the theory, at least! The question you are trying to answer can be crystal clear, and you might have the perfect metric to prove it, but finding the right dataset can still be a mission. There are many reasons why finding a suitable dataset is hard and, more often than not, being creative is necessary. …


Image for post
Image for post
Photo by Brendan Church on Unsplash

I only recently learnt how to code in Python. Or should I say to code at all? I still consider myself a Python newbie, and I even doubt I will ever fully master it when I see how many things I learn every time I code. But on the other hand, that is what makes it fun, and I find it very motivating!

Throughout the last year, I have encountered a few cases with results I was not expecting. Sometimes, I clearly did not know how Python processes things under the hood and made a mistake, other times I thought I understood how things were going under the hood but, I misunderstood, it was an error of judgment. …


Image for post
Image for post
Photo by Susan Holt Simpson on Unsplash

0 — The Data

import numpy as np
import pandas as pd
import seaborn as sns

df = sns.load_dataset('tips')
df

Image for post
Image for post
Photo by Bruno Wolff on Unsplash

In Machine Learning, being able to calculate distances is essential in a lot of cases. The most obvious use for it is when dealing with spatial data, but it also is one of the easiest ways to assess membership/similarity in supervised and unsupervised techniques. K-Nearest Neighbours relies on distances between points, same as the Radial Base Function Kernel in Support Vector Machine or even the cosine similarity used in NLP can be seen as a measure of distance (in its angular similarity form, at least).

Distance measurement is surprisingly straightforward when dealing with vectors. It becomes a little more complex when we want to measure “real” distance in a given space (the distance between two cities, for example). …


Image for post
Image for post
Photo by Pars Sahin on Unsplash

There is a lot of decisions to make during an A/B test; most of them are made during the conception stage. In this article, I am going to focus on the steps following the data collection stage. You will not hear about sample size nor power in this article (or maybe just a little).

This means our starting point is a dataset. The story behind this dataset will vary a lot depending on the product/experiment but, generally speaking, it is quite simple. It contains observations within two randomly assigned populations: A and B. …


Image for post
Image for post
Photo by Lorenzo Herrera on Unsplash

Jupyter notebooks are incredibly powerful to develop ideas quickly, then share them if need be. The notebooks run in a web browser, and support many languages out of the box (Jupyter actually stands for Julia, Python and R, when it does not refer to Galileo’s notebook).

You can dynamically code in your favourite language, include some explanation in rich text (LaTeX, Markdown, HTML, you chose). The output is generated straight away (after hitting shift+enter), then you can publish your notebook on GitHub, nbviewer, or in a pdf/HTML page.

Developing does not get much easier than this to be honest!

Here a few tips I use daily to make the best of Jupyter notebooks. …


Image for post
Image for post
Photo by veeterzy on Unsplash

Git is a lot of things. But one of the things I like the most about it is that it can be so painful to use, it is funny (as long as I am not the victim). From the classic XKCD comic to the Git Koans, I find those very entertaining. Some are even useful: Oh Shit, Git!.

When I started using git, one of the things that were instrumental in my understanding were the visual representations of git commands. Articles such as this one and this one are good examples.

Do not get me wrong; I am far from being a git power-user! I use it to share my work with a small team of fellow data scientists, to back up my work, and as a project portfolio. I am nowhere near a software developer juggling between developing features, fixing bugs, and managing production code. I am, however, still literate enough to laugh at the websites listed above. …


Image for post
Image for post
Photo by Denise Jans on Unsplash

Recently, I have been doing a lot more work on data that have a spatial meaning. Lucky for me, I have always been interested in GIS and, in my professional life, I have used a lot of GIS packages such as ArcGIS, MapInfo, and qGIS. But this time, as the data needs more processing, I did not go back to my old friends and, instead, I focussed on using Python to handle this information.

Because Python is open-source, there are lots of packages available to work with spatial data. It has never been easier to draw a map in Python. Folium, Bokeh, Plotly, or Altair achieve such a task with relative ease and, more often than not, the result is quite convincing. But these libraries are visualisation libraries, not really designed to analyse the data in a geospatial way. If we have a look at the more “back-end” libraries, or at least not the visualisation ones, we see that there are even more libraries available for spatial data handling. …


Image for post
Image for post
Photo by Edu Grande on Unsplash

Imagine you want to know which candidate will win the next election. Ideally, you conduct a census, and you ask every single person in the country up to two questions:

  • Will you vote next elections? And if the answer is yes:
  • Who are you voting for?

You expect some people to change their mind between the survey and the vote. Maybe they were too ashamed to tell you which candidate they would vote for and gave you another name. …


Image for post
Image for post
Photo by Julia Kadel on Unsplash

You might not realise this, but recursions are very common. A lot of well-known acronyms are recursive, for example:

  • Do you know what VISA (as in the VISA card in your wallet) stands for? It is the acronym for “Visa International Service Association”.
  • Maybe you are familiar with YAML files, do you know what YAML stands for? YAML Ain’t Markup Language
  • You might even be listening to music reading this article, and if you are not using Bluetooth headphones, you could be using the “JACK Audio Connection Kit” plug of your computer.

With these example in mind, even if you have never heard of recursion before, you can get an idea of what recursion is. It is just something that refers to itself. …

About

Antoine Ghilissen

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store