# Key concepts

## Covariance

• $\sigma_{XY}$ = Covariance between $X$ and $Y$
• $x_i$ = $i^{th}$ element of $X$
• $y_i$ = $i^{th}$ element of $Y$
• $n$ = number of data points ($X$ and $Y$ must have the same number of data points)
• $\mu_x$ = mean of the independent variable $X$
• $\mu_y$ = mean of the dependent variable $Y$
• is positive, the variables are positively related.
• is negative, they are negatively related.
• is null, there is no linear relationship between those variables.

## Correlation

• $r$ = Pearson Correlation Coefficient
• $x_i$ = $i^{th}$ element of $X$
• $y_i$ = $i^{th}$ element of $Y$
• $n$ = number of data points ($X$ and $Y$ must have the same number of data points)
• $\mu_x$ = mean of the independent variable $X$
• $\mu_y$ = mean of the dependent variable $Y$

## Coefficient of Determination

• $\sum_i(y_i — \hat y_i)²$ is the residual sum of squared errors. It is the squared difference between $y$ and $\hat y$. The model does not explain this part of the error.
• $\sum_i(y_i — \overline y_i)²$ is the total sum of squared error. It is the squared difference between $y$ and $\overline y$.

# Assumptions

• The variables have a linear relationship. A scatter plot of the data will quickly tell you if this is the case.
• Residuals are normally distributed. A histogram or a Q-Q plot of the residuals will be able to tell you more about this.
• Homoscedasticity of data. The residuals have a constant variance. A scatter plot of the residuals will reveal if this assumption holds.

--

--

Data Scientist