Image for post
Image for post
Photo by Edu Grande on Unsplash

Sampling

Focussing your efforts to understand the bigger picture

Imagine you want to know which candidate will win the next election. Ideally, you conduct a census, and you ask every single person in the country up to two questions:

  • Will you vote next elections? And if the answer is yes:
  • Who are you voting for?

You expect some people to change their mind between the survey and the vote. Maybe they were too ashamed to tell you which candidate they would vote for and gave you another name. But this survey would result in a dataset allowing you to predict the outcome of the election with a reasonable amount of confidence.

However, it is a practical nightmare:

  • How would you make sure individuals are only surveyed once?
  • How long is it going to take to query everyone?
  • How do you poll everyone: in person, via phone, via mail, via email?

This “method” does not sound practical at all, and it actually seems more tedious than organising the election itself! As impractical as it sounds, a census is a surveying tool used periodically across the world.

Thankfully, with a little bit of planning, there is a much quicker, more convenient (and cheaper) way to achieve such a task with acceptable results: sampling!

Before talking about sampling techniques, we need to make the distinction between the population, which includes all the members of the group studied, and a sample, which is a limited selection of the population.

If you want to generalise your findings to the entire population, a fundamental aspect of sampling is that it MUST represent the population. To that end, a few sampling techniques are available and, carefully choosing one will maximise your chance of accurately representing the population.

The Sampling Process

The sampling procedure in itself is fairly consistent. No matter which technique you plan on using, the following steps should be followed:

First Step: Sample Frame Definition

Image for post
Image for post
Photo by Martin Péchy on Unsplash

It is a good idea to define the target population (or sample frame). In our country-wide survey, we want to know people intentions, but there is no point asking pupils, for example, about who they would elect. We only care about people that have the power to elect someone. Therefore our sample frame is “people that vote next election”. We apply a filter by asking the first question “Will you vote?”.

It might not always be easy to establish the list from which the samples will be drawn but, just like spending time on defining the question, this time will be well spent.

Second Step: Sampling Method Selection

Now that you know what characteristic makes a part of the population valuable to your study, you can select the sampling technique best suited to your case. A lot of parameters will have an impact on which method will work for you. How much time do you have? How much money? How well do you know the topic? How well do you know the population?

There are two principal categories of sampling techniques:

Probability Sampling:

  • Random Sampling:
    Every member has the same chance to be selected.
  • Systematic Sampling:
    Every member is assigned a number, their selection follows a numerical logic: all the odd numbers; every 10th, starting from the 3rd position; etc. It is vital to ensure there is no hidden pattern skewing the sample in the initial list. Imagine our list of voters is alternating female-male. Sampling the even numbers would not result in the most representative sample.
  • Stratified Sampling:
    The population is categorised into meaningful sub-categories, called strata. Each of these strata is then sampled, either randomly or systematically. The sample size per strata depends on the size of the strata. In our example, the electors could be stratified by age group or ethnicity.
  • Cluster Sampling:
    Unlike stratified sampling, which has homogeneity within the groups, here, the creation of the group is arbitrary. The homogeneity is between the groups. The sampling is achieved by selecting a random set of groups. We could categorise our voters by county, for example. Note that because the population density varies significantly between counties, this would not be the best approach.

Probability sampling as a few advantages:

  • Quick and easy to implement.
  • Does not require a high level of expertise in the studied field.
  • Reduces sample bias and systematic error.
  • Sample accurately represents the population, which allows inferences to be generalised to the whole sample frame.
  • Bonds well with a diverse population.

Non-probability Sampling:

  • Convenience Sampling:
    The sample is whatever is convenient to collect, without too much effort. In our example, you can ask your friends who they will vote for, or you publish a survey online and wait for (willing) people to answer.
  • Snowball Sampling:
    You ask 10 of your friends, each one of them asks 10 of theirs, and so on.
  • Purposive Sampling:
    As a scientist, you choose the sample because, based on your expertise, you believe it is representative.
  • Quota Sampling:
    Like stratified sampling, the population is segmented by characteristics. But then an arbitrary number of elements are selected in those strata. In our example, you establish the sub-groups then you select 500 men and 500 women between 25 and 45 years old.

All these non-probability sampling methods have specific advantages and disadvantages but, generally speaking, they are:

  • Quicker and cheaper to implement.
  • Used when some parameters, such as the sample frame, are unknown.

Third Step: Sample Size Definition

Image for post
Image for post
Photo by National Cancer Institute on Unsplash

Time to determine the sample size! How many elements do we need in our sample to be able to make inferences on the population? Intuitively, the bigger the sample, the more representative, but the least convenient it will be. There are many ways to determine the sample size. But I will not expand on this matter as my next article will be about ascertaining the optimal sample size (for probability sampling methods).

In a nutshell, you can choose the size by using a table, you can tweak parameters, such as the confidence interval or the p-values, to optimise the sampling technique (to give you an intuition on this, head to R Psychologist’s brilliant visualisation).

Fourth Step: Data Collection

Now that you know what information you need, and how many times it needs to be collected. It is time to start collecting the data and put this sampling strategy in practice!

Final Words

Inadequate sampling induces a bias which leads to erroneous generalisations. It is essential to try to understand the sampling process, especially if you are designing the study from end-to-end.

Most of the time, an approximate sampling will be due to a restriction on time and/or money. But, step 2 and 3 provide ways to mitigate the impact of such constraints on the study.

Sometimes, you already have a dataset, and you were not involved in the sampling process. Thinking about how this data was collected, identifying where bias can be introduced will allow you to have a much more meaningful analysis.

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store