Data Analysis with Pandas

From A to Z

--

0 — The Data

import numpy as np
import pandas as pd
import seaborn as sns

df = sns.load_dataset('tips')
df
png

A — Apply

This method applies a function along an axis of the DataFrame. reference

It can also apply a function to the elements of a Series (reference), as per the following example:

# function to binarise the data
def smoke_bin(text):
if text == 'Yes':
return 1
else:
return 0


df['smoker'].apply(lambda x : smoke_bin(x))
0 0
1 0
2 0
3 0
4 0
..
239 0
240 1
241 1
242 0
243 0
Name: smoker, Length: 244, dtype: category
Categories (2, int64): [1, 0]

B — Boolean Criterion

Using a boolean criterion, also called boolean indexing, is a great way to split a DataFrame into subsets. In the following case, the criterion is on the smoker field (the possible values being: “Yes” or “No”). The criterion object itself is a list of True/False values. When pandas reads True, the value is selected, when the value is False, the value is discarded. reference

is_smoker = df['smoker'] == 'Yes'
df[is_smoker]
png
# ~ is the boolean operator "not"
df[~is_smoker]
png

C — Contains

This method only works on Series. It will test the content of a text field (hence the .str prefix) against a pattern or a regex expression. The result of this test will be a boolean criterion. reference

For example, one way to filter for a specific day of the week could be:

sundays = df['day'].str.contains('Sun')
df[sundays]
png

D — Describe

Describe usually is the first line typed after defining the DataFrame. It gives a quick summary of the DataFrame. By default, it will only return statistics about numerical fields, but using the parameter include='all' will force all the fields to be returned. Note the statistics on non-numerical fields are somewhat concise. reference

df.describe(percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99], include='all')
png

E — Explode

Explode is a routine that turns sequences (lists, tuples, arrays, sets, etc.) into one-line observations. reference

# creating a DataFrame with sequences as observations
df_explode = pd.DataFrame({'Col1': [(2,4), [i for i in range(5)], []], 'Col2': [1, 2, 3]})
png
df_explode.explode('Col1')
png

F — Fillna

When it comes to handling missing data, .fillna() is of great help. Missing values can easily be replaced by a specific value (see example), by the last valid observation, or by the next valid observation. reference

# altering the DataFrame
df['tip_na'] = df['tip'].replace(to_replace=[2.00, 3.00, 4.00], value=np.nan)

df['tip_na'].fillna(df['tip_na'].mean())
0 1.010000
1 1.660000
2 3.500000
3 3.310000
4 3.610000
...
239 5.920000
240 3.116932
241 3.116932
242 1.750000
243 3.116932
Name: tip_na, Length: 244, dtype: float64
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
df
png

G — Groupby

Groupby folds the data according to the values of the specified column. On its own, it returns a groupby object, so do not forget to chain a function specifying how to handle the data in other columns. reference

df.groupby(by=['day', 'time']).sum()
png

H — Head

This function returns the first n rows of the DataFrame. A great way to get familiar with a new dataset!

.tail(n=nrow) is a similar function that will return the last n rows of a dataset reference - head / reference - tail

df.head(n=10)
png

I — Isna

Checks for missing values. It will return True if the value is missing, and False if there is a value. Once again, it can be used as a boolean criterion to filter a dataset. Or it cn be chained with other functions such as value_counts() to compile a handy summary of the state of the DataFrame. reference

# altering the DataFrame
df['tip_na'] = df['tip'].replace(to_replace=[2.00, 3.00, 4.00], value=np.nan)

df.isna().value_counts()
total_bill tip sex smoker day time size tip_na
False False False False False False False False 176
True 68
dtype: int64

J — Join

Just like tables in SQL, DataFrames can be joined based on specific keys. reference

# creating DataFrames to join
df_num = df[['total_bill', 'tip', 'size']]
df_cat = df[['sex', 'smoker', 'day', 'time']]
df_num.join(df_cat)
png

K — Kurtosis

Routine calculating normalised and unbiaised kurtosis (using Fisher’s definition of kurtosis). A negative number indicates platykurtosis, 0 indicates a normally distributed population, and the higher the number is, the more the data distribution is platykurtic. reference

df['total_bill'].plot(kind='kde')<AxesSubplot:ylabel='Density'>
png
df['tip'].plot(kind='kde')<AxesSubplot:ylabel='Density'>
png
df.kurtosis()total_bill    1.218484
tip 3.648376
size 1.731700
tip_na 2.494184
dtype: float64

L — Loc/ Iloc

.loc[] is used to access specific values of the DataFrame, based on its index. .loc[] is label based (or a boolean criterion), whilst .iloc[] is position based (also works with a boolean criterion). reference - loc / reference - iloc

# Creating a DataFrame with alpha index
df_loc = df.loc[[i for i in range(26)]]
df_loc['new_index'] =[i for i in 'qwertyuiopasdfghjklzxcvbnm']
df_loc.set_index('new_index', inplace=True)
df_loc.head(5)
png
df_loc.loc[['w','a','s','d']]
png
df_loc.iloc[[0, 1, 2, 3, 4]]
png

M -Melt

Melt is used to “unpivot” a DataFrame from wide format to long format. reference

pd.melt(df, id_vars=['time'], value_vars=['total_bill', 'tip'])
png

N — NaN

Not a Number is a value coming from the numpy library (np.nan). It means there is no value recorded for that observation/parameter. Depending on how much data is missing, it can be problematic to perform certain tasks. For example, you will not be able to train most models with NaN. Thankfully there are ways to alliviate these missing data, such as [fillna()](#"F - Fillna"). Read more about dealing with missing data.

O — Ordered

This routine checks if a categories have an ordered relationship. (only workds on categorical data, check with df.dtypes if necessary). reference

df['time'].cat.orderedFalse

P — Plot

Using matplotlib as a backend, pandas can directly plot DataFrames/Series. .plot() can be added to the DataFrame itself, or to an alteration of it: df['column_1'].plot(), df['column_1'].cumsum().plot(). It allows several kind of plots out the box: ‘area’, ‘barh’, ‘bar’, ‘box’, ‘density’, ‘hexbin’, ‘hist’, ‘kde’, ‘line’, ‘pie’, ‘scatter’ and accepts all the matplotlib plotting arguments. reference

df.plot()<AxesSubplot:>
png

Q — Qcut

With this routine, it becomes very easy to discretise the data (at least if you are interested in a quantile-based categorisation). reference

pd.qcut(df['tip'], 3, labels=['Cheap', 'Decent', 'Generous'])0         Cheap
1 Cheap
2 Generous
3 Generous
4 Generous
...
239 Generous
240 Cheap
241 Cheap
242 Cheap
243 Decent
Name: tip, Length: 244, dtype: category
Categories (3, object): ['Cheap' < 'Decent' < 'Generous']

R — Read_…

Pandas can read a lot of different file formats out of the box:

  • pandas.read_clipboard
  • pandas.read_csv
  • pandas.read_excel
  • pandas.read_feather
  • pandas.read_fwf
  • pandas.read_gbq
  • pandas.read_hdf
  • pandas.read_html
  • pandas.read_json
  • pandas.read_orc
  • pandas.read_parquet
  • pandas.read_pickle
  • pandas.read_sas
  • pandas.read_spss
  • pandas.read_sql
  • pandas.read_sql_query
  • pandas.read_sql_table
  • pandas.read_stata
  • pandas.read_table

reference

The most common ones wil be “read_csv”, “read_excel”, “read_json” bue there ar some very interesting ones such as “read_parquet” which allows to load a url, with only certain columns if desired, into memory (handy with big data):

pd.read_parquet('s3://bucket_name', columns=['column_1', 'column_3', 'column_11'])

S -Sample

This is another great way to get familiar with the data. Sample will randomly pick data from the DataFrame. It is also an easy way to perform undersampling (frac < 1 ) and oversampling (frac > 1). reference

df.sample(frac=.1, replace=True)
png

T — To_…

This one is more a category of functions than a function per se, but pandas offers a lot of exporting functions:

  • to_clipboard
  • to_csv
  • to_dict
  • to_excel
  • to_feather
  • to_gbq
  • to_hdf
  • to_html
  • to_json
  • to_latex
  • to_markdown
  • to_parquet
  • to_pickle
  • to_records
  • to_sql
  • to_stata
  • to_string
  • to_xarray

reference

For example, to export a DataFrame as a .csv:

df.to_csv('filename.csv')pd.read_csv('filename.csv', index_col=[0])
png

U — Unique

This routine will return an array of unique values encountered in a Series. reference

pd.unique(df['day'])['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Sun', 'Sat', 'Thur', 'Fri']

V — Value Counts

This routine counts the occurence of the unique values in the DataFrame/Series reference

df['tip'].value_counts().head(20)2.00    33
3.00 23
4.00 12
5.00 10
2.50 10
3.50 9
1.50 9
1.00 4
1.25 3
3.48 3
2.01 2
4.08 2
2.23 2
2.03 2
3.18 2
2.31 2
2.24 2
6.50 2
4.30 2
2.20 2
Name: tip, dtype: int64

W — Where

Similar to numpy.where, this routine will filter the data based on a condition. It returns the initial DataFrame, but where the condition is False, values have been removed. reference

df.where(df['smoker'] == 'Yes')
png

X — XS

In a multi-indexed DataFrame, xs takes a cross-section of the DataFrame reference

# creating a DataFrame with multi-index
df_xs = df.set_index(['day', 'size'])
png
df_xs.xs('Fri')
png

Y — Year

This attribute can be run on datetime, timestamp, etc. It will extract the year of time data. reference

# creating DataFrame 
df_time = pd.Series(pd.date_range('2000-01/01', periods=20, freq='M'))
0 2000-01-31
1 2000-02-29
2 2000-03-31
3 2000-04-30
4 2000-05-31
5 2000-06-30
6 2000-07-31
7 2000-08-31
8 2000-09-30
9 2000-10-31
10 2000-11-30
11 2000-12-31
12 2001-01-31
13 2001-02-28
14 2001-03-31
15 2001-04-30
16 2001-05-31
17 2001-06-30
18 2001-07-31
19 2001-08-31
dtype: datetime64[ns]
df_time.dt.year0 2000
1 2000
2 2000
3 2000
4 2000
5 2000
6 2000
7 2000
8 2000
9 2000
10 2000
11 2000
12 2001
13 2001
14 2001
15 2001
16 2001
17 2001
18 2001
19 2001
dtype: int64

Z — Zfill

This method pads strings to a minimum length, filling in with 0. If data are heterogenous, non-string data will be converted to NaN. reference

df['smoker'].str.zfill(4)0      00No
1 00No
2 00No
3 00No
4 00No
...
239 00No
240 0Yes
241 0Yes
242 00No
243 00No
Name: smoker, Length: 244, dtype: object

--

--