# Feature Engineering for Large Datasets

It is estimated that in 2025, 175 trillions of gigabytes of data will be created (that is for the year 2025 only! source). On top of this, a greater number of companies and institutions are making an effort to increase data’s availability to the public; they develop API or create open data portals such as London, UK, or Google Mobility Data, to name a few.

Accessing data has never been easier, and it will only get better!

That is the theory, at least! The question you are trying to answer can be crystal clear, and you might have the perfect…

I only recently learnt how to code in Python. Or should I say to code at all? I still consider myself a Python newbie, and I even doubt I will ever fully master it when I see how many things I learn every time I code. But on the other hand, that is what makes it fun, and I find it very motivating!

Throughout the last year, I have encountered a few cases with results I was not expecting. Sometimes, I clearly did not know how Python processes things under the hood and made a mistake, other times I thought…

# 0 — The Data

import numpy as np
import pandas as pd
import seaborn as sns

df

# Distance as Metrics

In Machine Learning, being able to calculate distances is essential in a lot of cases. The most obvious use for it is when dealing with spatial data, but it also is one of the easiest ways to assess membership/similarity in supervised and unsupervised techniques. K-Nearest Neighbours relies on distances between points, same as the Radial Base Function Kernel in Support Vector Machine or even the cosine similarity used in NLP can be seen as a measure of distance (in its angular similarity form, at least).

Distance measurement is surprisingly straightforward when dealing with vectors. It becomes a little more complex…

# Hypothesis Testing Applied to A/B Testing

There is a lot of decisions to make during an A/B test; most of them are made during the conception stage. In this article, I am going to focus on the steps following the data collection stage. You will not hear about sample size nor power in this article (or maybe just a little).

This means our starting point is a dataset. The story behind this dataset will vary a lot depending on the product/experiment but, generally speaking, it is quite simple. It contains observations within two randomly assigned populations: A and B. …

# Getting More Out of Your Jupyter Notebooks

Jupyter notebooks are incredibly powerful to develop ideas quickly, then share them if need be. The notebooks run in a web browser, and support many languages out of the box (Jupyter actually stands for Julia, Python and R, when it does not refer to Galileo’s notebook).

You can dynamically code in your favourite language, include some explanation in rich text (LaTeX, Markdown, HTML, you chose). The output is generated straight away (after hitting shift+enter), then you can publish your notebook on GitHub, nbviewer, or in a pdf/HTML page.

Developing does not get much easier than this to be honest!

Here…

# Speed Up Your Git Workflow

Git is a lot of things. But one of the things I like the most about it is that it can be so painful to use, it is funny (as long as I am not the victim). From the classic XKCD comic to the Git Koans, I find those very entertaining. Some are even useful: Oh Shit, Git!.

When I started using git, one of the things that were instrumental in my understanding were the visual representations of git commands. Articles such as this one and this one are good examples.

Do not get me wrong; I am far from…

# Geopandas: Accessible, Yet Powerful GIS With Python

Recently, I have been doing a lot more work on data that have a spatial meaning. Lucky for me, I have always been interested in GIS and, in my professional life, I have used a lot of GIS packages such as ArcGIS, MapInfo, and qGIS. But this time, as the data needs more processing, I did not go back to my old friends and, instead, I focussed on using Python to handle this information.

Because Python is open-source, there are lots of packages available to work with spatial data. It has never been easier to draw a map in Python…

# Sampling

Imagine you want to know which candidate will win the next election. Ideally, you conduct a census, and you ask every single person in the country up to two questions:

• Will you vote next elections? And if the answer is yes:
• Who are you voting for?

You expect some people to change their mind between the survey and the vote. Maybe they were too ashamed to tell you which candidate they would vote for and gave you another name. …

# Recursive Functions in Python

You might not realise this, but recursions are very common. A lot of well-known acronyms are recursive, for example:

• Do you know what VISA (as in the VISA card in your wallet) stands for? It is the acronym for “Visa International Service Association”.
• Maybe you are familiar with YAML files, do you know what YAML stands for? YAML Ain’t Markup Language
• You might even be listening to music reading this article, and if you are not using Bluetooth headphones, you could be using the “JACK Audio Connection Kit” plug of your computer.

With these example in mind, even if you…

## Antoine Ghilissen

Data Scientist

Get the Medium app