Data Analysis with Pandas

From A to Z

8 min readDec 30, 2020

0 — The Data

import numpy as np
import pandas as pd
import seaborn as sns

df = sns.load_dataset('tips')
df

A — Apply

This method applies a function along an axis of the DataFrame. reference

It can also apply a function to the elements of a Series (reference), as per the following example:

# function to binarise the data
def smoke_bin(text):
    if text == 'Yes':
        return 1
    else:
        return 0


df['smoker'].apply(lambda x : smoke_bin(x))0      0
1      0
2      0
3      0
4      0
      ..
239    0
240    1
241    1
242    0
243    0
Name: smoker, Length: 244, dtype: category
Categories (2, int64): [1, 0]

B — Boolean Criterion

Using a boolean criterion, also called boolean indexing, is a great way to split a DataFrame into subsets. In the following case, the criterion is on the smoker field (the possible values being: “Yes” or “No”). The criterion object itself is a list of True/False values. When pandas reads True, the value is selected, when the value is False, the value is discarded. reference

is_smoker = df['smoker'] == 'Yes'
df[is_smoker]

# ~ is the boolean operator "not"
df[~is_smoker]

C — Contains

This method only works on Series. It will test the content of a text field (hence the .str prefix) against a pattern or a regex expression. The result of this test will be a boolean criterion. reference

For example, one way to filter for a specific day of the week could be:

sundays = df['day'].str.contains('Sun')
df[sundays]

D — Describe

Describe usually is the first line typed after defining the DataFrame. It gives a quick summary of the DataFrame. By default, it will only return statistics about numerical fields, but using the parameter include='all' will force all the fields to be returned. Note the statistics on non-numerical fields are somewhat concise. reference

df.describe(percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99], include='all')

E — Explode

Explode is a routine that turns sequences (lists, tuples, arrays, sets, etc.) into one-line observations. reference

# creating a DataFrame with sequences as observations
df_explode = pd.DataFrame({'Col1': [(2,4), [i for i in range(5)], []], 'Col2': [1, 2, 3]})

df_explode.explode('Col1')

F — Fillna

When it comes to handling missing data, .fillna() is of great help. Missing values can easily be replaced by a specific value (see example), by the last valid observation, or by the next valid observation. reference

# altering the DataFrame
df['tip_na'] = df['tip'].replace(to_replace=[2.00, 3.00, 4.00], value=np.nan)

df['tip_na'].fillna(df['tip_na'].mean())0      1.010000
1      1.660000
2      3.500000
3      3.310000
4      3.610000
         ...   
239    5.920000
240    3.116932
241    3.116932
242    1.750000
243    3.116932
Name: tip_na, Length: 244, dtype: float64import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
df

G — Groupby

Groupby folds the data according to the values of the specified column. On its own, it returns a groupby object, so do not forget to chain a function specifying how to handle the data in other columns. reference

df.groupby(by=['day', 'time']).sum()

H — Head

This function returns the first n rows of the DataFrame. A great way to get familiar with a new dataset!

.tail(n=nrow) is a similar function that will return the last n rows of a dataset reference - head / reference - tail

df.head(n=10)

I — Isna

Checks for missing values. It will return True if the value is missing, and False if there is a value. Once again, it can be used as a boolean criterion to filter a dataset. Or it cn be chained with other functions such as value_counts() to compile a handy summary of the state of the DataFrame. reference

# altering the DataFrame
df['tip_na'] = df['tip'].replace(to_replace=[2.00, 3.00, 4.00], value=np.nan)

df.isna().value_counts()total_bill  tip    sex    smoker  day    time   size   tip_na
False       False  False  False   False  False  False  False     176
                                                       True       68
dtype: int64

J — Join

Just like tables in SQL, DataFrames can be joined based on specific keys. reference

# creating DataFrames to join
df_num = df[['total_bill', 'tip', 'size']]
df_cat = df[['sex', 'smoker', 'day', 'time']]df_num.join(df_cat)

K — Kurtosis

Routine calculating normalised and unbiaised kurtosis (using Fisher’s definition of kurtosis). A negative number indicates platykurtosis, 0 indicates a normally distributed population, and the higher the number is, the more the data distribution is platykurtic. reference

df['total_bill'].plot(kind='kde')<AxesSubplot:ylabel='Density'>

df['tip'].plot(kind='kde')<AxesSubplot:ylabel='Density'>

df.kurtosis()total_bill    1.218484
tip           3.648376
size          1.731700
tip_na        2.494184
dtype: float64

L — Loc/ Iloc

.loc[] is used to access specific values of the DataFrame, based on its index. .loc[] is label based (or a boolean criterion), whilst .iloc[] is position based (also works with a boolean criterion). reference - loc / reference - iloc

# Creating a DataFrame with alpha index
df_loc = df.loc[[i for i in range(26)]]
df_loc['new_index'] =[i for i in 'qwertyuiopasdfghjklzxcvbnm']
df_loc.set_index('new_index', inplace=True)df_loc.head(5)

df_loc.loc[['w','a','s','d']]

df_loc.iloc[[0, 1, 2, 3, 4]]

M -Melt

Melt is used to “unpivot” a DataFrame from wide format to long format. reference

pd.melt(df, id_vars=['time'], value_vars=['total_bill', 'tip'])

N — NaN

Not a Number is a value coming from the numpy library (np.nan). It means there is no value recorded for that observation/parameter. Depending on how much data is missing, it can be problematic to perform certain tasks. For example, you will not be able to train most models with NaN. Thankfully there are ways to alliviate these missing data, such as [fillna()](#"F - Fillna"). Read more about dealing with missing data.

O — Ordered

This routine checks if a categories have an ordered relationship. (only workds on categorical data, check with df.dtypes if necessary). reference

df['time'].cat.orderedFalse

P — Plot

Using matplotlib as a backend, pandas can directly plot DataFrames/Series. .plot() can be added to the DataFrame itself, or to an alteration of it: df['column_1'].plot(), df['column_1'].cumsum().plot(). It allows several kind of plots out the box: ‘area’, ‘barh’, ‘bar’, ‘box’, ‘density’, ‘hexbin’, ‘hist’, ‘kde’, ‘line’, ‘pie’, ‘scatter’ and accepts all the matplotlib plotting arguments. reference

df.plot()<AxesSubplot:>

Q — Qcut

With this routine, it becomes very easy to discretise the data (at least if you are interested in a quantile-based categorisation). reference

pd.qcut(df['tip'], 3, labels=['Cheap', 'Decent', 'Generous'])0         Cheap
1         Cheap
2      Generous
3      Generous
4      Generous
         ...   
239    Generous
240       Cheap
241       Cheap
242       Cheap
243      Decent
Name: tip, Length: 244, dtype: category
Categories (3, object): ['Cheap' < 'Decent' < 'Generous']

R — Read_…

Pandas can read a lot of different file formats out of the box:

pandas.read_clipboard
pandas.read_csv
pandas.read_excel
pandas.read_feather
pandas.read_fwf
pandas.read_gbq
pandas.read_hdf
pandas.read_html
pandas.read_json
pandas.read_orc
pandas.read_parquet
pandas.read_pickle
pandas.read_sas
pandas.read_spss
pandas.read_sql
pandas.read_sql_query
pandas.read_sql_table
pandas.read_stata
pandas.read_table

reference

The most common ones wil be “read_csv”, “read_excel”, “read_json” bue there ar some very interesting ones such as “read_parquet” which allows to load a url, with only certain columns if desired, into memory (handy with big data):

pd.read_parquet('s3://bucket_name', columns=['column_1', 'column_3', 'column_11'])

S -Sample

This is another great way to get familiar with the data. Sample will randomly pick data from the DataFrame. It is also an easy way to perform undersampling (frac < 1 ) and oversampling (frac > 1). reference

df.sample(frac=.1, replace=True)

T — To_…

This one is more a category of functions than a function per se, but pandas offers a lot of exporting functions:

to_clipboard
to_csv
to_dict
to_excel
to_feather
to_gbq
to_hdf
to_html
to_json
to_latex
to_markdown
to_parquet
to_pickle
to_records
to_sql
to_stata
to_string
to_xarray

reference

For example, to export a DataFrame as a .csv:

df.to_csv('filename.csv')pd.read_csv('filename.csv', index_col=[0])

U — Unique

This routine will return an array of unique values encountered in a Series. reference

pd.unique(df['day'])['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Sun', 'Sat', 'Thur', 'Fri']

V — Value Counts

This routine counts the occurence of the unique values in the DataFrame/Series reference

df['tip'].value_counts().head(20)2.00    33
3.00    23
4.00    12
5.00    10
2.50    10
3.50     9
1.50     9
1.00     4
1.25     3
3.48     3
2.01     2
4.08     2
2.23     2
2.03     2
3.18     2
2.31     2
2.24     2
6.50     2
4.30     2
2.20     2
Name: tip, dtype: int64

W — Where

Similar to numpy.where, this routine will filter the data based on a condition. It returns the initial DataFrame, but where the condition is False, values have been removed. reference

df.where(df['smoker'] == 'Yes')

X — XS

In a multi-indexed DataFrame, xs takes a cross-section of the DataFrame reference

# creating a DataFrame with multi-index
df_xs = df.set_index(['day', 'size'])

df_xs.xs('Fri')

Y — Year

This attribute can be run on datetime, timestamp, etc. It will extract the year of time data. reference

# creating DataFrame 
df_time = pd.Series(pd.date_range('2000-01/01', periods=20, freq='M'))0    2000-01-31
1    2000-02-29
2    2000-03-31
3    2000-04-30
4    2000-05-31
5    2000-06-30
6    2000-07-31
7    2000-08-31
8    2000-09-30
9    2000-10-31
10   2000-11-30
11   2000-12-31
12   2001-01-31
13   2001-02-28
14   2001-03-31
15   2001-04-30
16   2001-05-31
17   2001-06-30
18   2001-07-31
19   2001-08-31
dtype: datetime64[ns]
df_time.dt.year0     2000
1     2000
2     2000
3     2000
4     2000
5     2000
6     2000
7     2000
8     2000
9     2000
10    2000
11    2000
12    2001
13    2001
14    2001
15    2001
16    2001
17    2001
18    2001
19    2001
dtype: int64

Z — Zfill

This method pads strings to a minimum length, filling in with 0. If data are heterogenous, non-string data will be converted to NaN. reference

df['smoker'].str.zfill(4)0      00No
1      00No
2      00No
3      00No
4      00No
       ... 
239    00No
240    0Yes
241    0Yes
242    00No
243    00No
Name: smoker, Length: 244, dtype: object