Introduction

When I started learning about linear regression, my first thought was that this kind of technique is too simple to be commonly used. I thought the purpose of that linear regression lesson was educational more than practical.

It appears I was completely wrong! Not only is it used frequently, it is also a fundamental concept a lot of techniques are built on (as detailed on this website).

As with any tools/techniques, there are cirumstances where using linear regression is a good idea and others where it simply will not work. But how to know when to use linear regressions? Is there anything to be aware of before considering using it?

But let us talk about a few ideas to apprehend before answering those questions

Key concepts

Dependent Variable

It is a variable which has a value that depends on the value of other variables. That dependence is established by the hypothesis tested in the experiment. It can be called $Y$ (as in $Y = A+BX$), or “target variable”.

Independent Variable

On the other hand, an independent variable does not depend on any other variable (in the scope of the experiment). It is the $X$ in $Y = A+BX$, or “predictor variable”.

Covariance

It is a measurement that assesses the relation between two variables.

$$\sigma_{XY} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i -\mu_x)(y_i — \mu_y)$$

  • $\sigma_{XY}$ = Covariance between $X$ and $Y$
  • $x_i$ = $i^{th}$ element of $X$
  • $y_i$ = $i^{th}$ element of $Y$
  • $n$ = number of data points ($X$ and $Y$ must have the same number of data points)
  • $\mu_x$ = mean of the independent variable $X$
  • $\mu_y$ = mean of the dependent variable $Y$

When $\sigma_{XY}$

  • is positive, the variables are positively related.
  • is negative, they are negatively related.
  • is null, there is no linear relationship between those variables.

Unfortunately, $\sigma$ is not standardised and ranges from -$\infty$ to +$\infty$. Therefore $\sigma_{XY}$ > $\sigma_{AB}$ does not mean the relationship between XY is stronger than the relation between AB. To compare variables, we need another measurement:

Correlation

By standardizing covariance by a measure of variability in the data, we get a metric which is consistent throughout the dataset, and which can be intuitively interpreted. For example, uisng Pearson correlation coefficient:

$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i — \mu_y)} {\sqrt{\sum_{i=1}^{n}(x_i — \mu_x)² \sum_{i=1}^{n}(y_i-\mu_y)²}}$$

  • $r$ = Pearson Correlation Coefficient
  • $x_i$ = $i^{th}$ element of $X$
  • $y_i$ = $i^{th}$ element of $Y$
  • $n$ = number of data points ($X$ and $Y$ must have the same number of data points)
  • $\mu_x$ = mean of the independent variable $X$
  • $\mu_y$ = mean of the dependent variable $Y$

$r$ varies between -1 and 1.

Coefficient of Determination

$R²$, the coefficient of determination is a measure that assesses the goodness of fit of a regression model.

$$ \large R² = 1 — \dfrac{\sum_i(y_i — \hat y_i)²}{\sum_i(y_i — \overline y_i)²} $$

  • $\sum_i(y_i — \hat y_i)²$ is the residual sum of squared errors. It is the squared difference between $y$ and $\hat y$. The model does not explain this part of the error.
  • $\sum_i(y_i — \overline y_i)²$ is the total sum of squared error. It is the squared difference between $y$ and $\overline y$.

$R²$ can be seen as “1 — the proportion of variance that can not be explained by the model”. A perfect fit would therefore have a $R²$ of 1.

Assumptions

A linear regression is a parametric technique; it uses parameters induced by the data. Therefore, the data must fulfil several criteria. If those assumptions are violated, the prediction might be biased, or unreliable.

  • The variables have a linear relationship. A scatter plot of the data will quickly tell you if this is the case.
  • Residuals are normally distributed. A histogram or a Q-Q plot of the residuals will be able to tell you more about this.
  • Homoscedasticity of data. The residuals have a constant variance. A scatter plot of the residuals will reveal if this assumption holds.

Final notes

Although it might be the perfect tool for the job, and all the assumptions are met, extrapolating the regression equation to values out of the range of data analysed might be a bad idea. Predictions will be more reliable in the scope of the linear regression.

Correlation does not imply causation.

Trust your intuition! Some correlations are spurious, or due to chance. Look at this website to see a few of these improbable correlations.

And, sometimes, there is a hidden factor. For example, this “study” links chocolate consumption with Nobel Prizes. One may argue wealth implies better education/research hence higher Nobel prices, and wealth implies luxury goods consumption like chocolate.

Finally, keep the Anscombe’s Quartet in mind when analysing data. You might be comfortable enough to crunch the numbers and understand the data from the resulting tables, but do yourself a favour and plot the data, it might go a long way!

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store