Data Analysis with Pandas
From A to Z
0 — The Data
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
df
A — Apply
This method applies a function along an axis of the DataFrame. reference
It can also apply a function to the elements of a Series (reference), as per the following example:
# function to binarise the data
def smoke_bin(text):
if text == 'Yes':
return 1
else:
return 0
df['smoker'].apply(lambda x : smoke_bin(x))0 0
1 0
2 0
3 0
4 0
..
239 0
240 1
241 1
242 0
243 0
Name: smoker, Length: 244, dtype: category
Categories (2, int64): [1, 0]
B — Boolean Criterion
Using a boolean criterion, also called boolean indexing, is a great way to split a DataFrame into subsets. In the following case, the criterion is on the smoker field (the possible values being: “Yes” or “No”). The criterion
object itself is a list of True/False values. When pandas reads True, the value is selected, when the value is False, the value is discarded. reference
is_smoker = df['smoker'] == 'Yes'
df[is_smoker]
# ~ is the boolean operator "not"
df[~is_smoker]
C — Contains
This method only works on Series. It will test the content of a text field (hence the .str
prefix) against a pattern or a regex expression. The result of this test will be a boolean criterion. reference
For example, one way to filter for a specific day of the week could be:
sundays = df['day'].str.contains('Sun')
df[sundays]
D — Describe
Describe usually is the first line typed after defining the DataFrame. It gives a quick summary of the DataFrame. By default, it will only return statistics about numerical fields, but using the parameter include='all'
will force all the fields to be returned. Note the statistics on non-numerical fields are somewhat concise. reference
df.describe(percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99], include='all')
E — Explode
Explode is a routine that turns sequences (lists, tuples, arrays, sets, etc.) into one-line observations. reference
# creating a DataFrame with sequences as observations
df_explode = pd.DataFrame({'Col1': [(2,4), [i for i in range(5)], []], 'Col2': [1, 2, 3]})
df_explode.explode('Col1')
F — Fillna
When it comes to handling missing data, .fillna()
is of great help. Missing values can easily be replaced by a specific value (see example), by the last valid observation, or by the next valid observation. reference
# altering the DataFrame
df['tip_na'] = df['tip'].replace(to_replace=[2.00, 3.00, 4.00], value=np.nan)
df['tip_na'].fillna(df['tip_na'].mean())0 1.010000
1 1.660000
2 3.500000
3 3.310000
4 3.610000
...
239 5.920000
240 3.116932
241 3.116932
242 1.750000
243 3.116932
Name: tip_na, Length: 244, dtype: float64import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
df
G — Groupby
Groupby folds the data according to the values of the specified column. On its own, it returns a groupby object, so do not forget to chain a function specifying how to handle the data in other columns. reference
df.groupby(by=['day', 'time']).sum()
H — Head
This function returns the first n rows of the DataFrame. A great way to get familiar with a new dataset!
.tail(n=nrow)
is a similar function that will return the last n rows of a dataset reference - head / reference - tail
df.head(n=10)
I — Isna
Checks for missing values. It will return True if the value is missing, and False if there is a value. Once again, it can be used as a boolean criterion to filter a dataset. Or it cn be chained with other functions such as value_counts()
to compile a handy summary of the state of the DataFrame. reference
# altering the DataFrame
df['tip_na'] = df['tip'].replace(to_replace=[2.00, 3.00, 4.00], value=np.nan)
df.isna().value_counts()total_bill tip sex smoker day time size tip_na
False False False False False False False False 176
True 68
dtype: int64
J — Join
Just like tables in SQL, DataFrames can be joined based on specific keys. reference
# creating DataFrames to join
df_num = df[['total_bill', 'tip', 'size']]
df_cat = df[['sex', 'smoker', 'day', 'time']]df_num.join(df_cat)
K — Kurtosis
Routine calculating normalised and unbiaised kurtosis (using Fisher’s definition of kurtosis). A negative number indicates platykurtosis, 0 indicates a normally distributed population, and the higher the number is, the more the data distribution is platykurtic. reference
df['total_bill'].plot(kind='kde')<AxesSubplot:ylabel='Density'>
df['tip'].plot(kind='kde')<AxesSubplot:ylabel='Density'>
df.kurtosis()total_bill 1.218484
tip 3.648376
size 1.731700
tip_na 2.494184
dtype: float64
L — Loc/ Iloc
.loc[]
is used to access specific values of the DataFrame, based on its index. .loc[]
is label based (or a boolean criterion), whilst .iloc[]
is position based (also works with a boolean criterion). reference - loc / reference - iloc
# Creating a DataFrame with alpha index
df_loc = df.loc[[i for i in range(26)]]
df_loc['new_index'] =[i for i in 'qwertyuiopasdfghjklzxcvbnm']
df_loc.set_index('new_index', inplace=True)df_loc.head(5)
df_loc.loc[['w','a','s','d']]
df_loc.iloc[[0, 1, 2, 3, 4]]
M -Melt
Melt is used to “unpivot” a DataFrame from wide format to long format. reference
pd.melt(df, id_vars=['time'], value_vars=['total_bill', 'tip'])
N — NaN
Not a Number is a value coming from the numpy library (np.nan
). It means there is no value recorded for that observation/parameter. Depending on how much data is missing, it can be problematic to perform certain tasks. For example, you will not be able to train most models with NaN. Thankfully there are ways to alliviate these missing data, such as [fillna()](#"F - Fillna"). Read more about dealing with missing data.
O — Ordered
This routine checks if a categories have an ordered relationship. (only workds on categorical data, check with df.dtypes
if necessary). reference
df['time'].cat.orderedFalse
P — Plot
Using matplotlib as a backend, pandas can directly plot DataFrames/Series. .plot()
can be added to the DataFrame itself, or to an alteration of it: df['column_1'].plot()
, df['column_1'].cumsum().plot()
. It allows several kind of plots out the box: ‘area’, ‘barh’, ‘bar’, ‘box’, ‘density’, ‘hexbin’, ‘hist’, ‘kde’, ‘line’, ‘pie’, ‘scatter’ and accepts all the matplotlib plotting arguments. reference
df.plot()<AxesSubplot:>
Q — Qcut
With this routine, it becomes very easy to discretise the data (at least if you are interested in a quantile-based categorisation). reference
pd.qcut(df['tip'], 3, labels=['Cheap', 'Decent', 'Generous'])0 Cheap
1 Cheap
2 Generous
3 Generous
4 Generous
...
239 Generous
240 Cheap
241 Cheap
242 Cheap
243 Decent
Name: tip, Length: 244, dtype: category
Categories (3, object): ['Cheap' < 'Decent' < 'Generous']
R — Read_…
Pandas can read a lot of different file formats out of the box:
- pandas.read_clipboard
- pandas.read_csv
- pandas.read_excel
- pandas.read_feather
- pandas.read_fwf
- pandas.read_gbq
- pandas.read_hdf
- pandas.read_html
- pandas.read_json
- pandas.read_orc
- pandas.read_parquet
- pandas.read_pickle
- pandas.read_sas
- pandas.read_spss
- pandas.read_sql
- pandas.read_sql_query
- pandas.read_sql_table
- pandas.read_stata
- pandas.read_table
The most common ones wil be “read_csv”, “read_excel”, “read_json” bue there ar some very interesting ones such as “read_parquet” which allows to load a url, with only certain columns if desired, into memory (handy with big data):
pd.read_parquet('s3://bucket_name', columns=['column_1', 'column_3', 'column_11'])
S -Sample
This is another great way to get familiar with the data. Sample will randomly pick data from the DataFrame. It is also an easy way to perform undersampling (frac < 1 ) and oversampling (frac > 1). reference
df.sample(frac=.1, replace=True)
T — To_…
This one is more a category of functions than a function per se, but pandas offers a lot of exporting functions:
- to_clipboard
- to_csv
- to_dict
- to_excel
- to_feather
- to_gbq
- to_hdf
- to_html
- to_json
- to_latex
- to_markdown
- to_parquet
- to_pickle
- to_records
- to_sql
- to_stata
- to_string
- to_xarray
For example, to export a DataFrame as a .csv
:
df.to_csv('filename.csv')pd.read_csv('filename.csv', index_col=[0])
U — Unique
This routine will return an array of unique values encountered in a Series. reference
pd.unique(df['day'])['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Sun', 'Sat', 'Thur', 'Fri']
V — Value Counts
This routine counts the occurence of the unique values in the DataFrame/Series reference
df['tip'].value_counts().head(20)2.00 33
3.00 23
4.00 12
5.00 10
2.50 10
3.50 9
1.50 9
1.00 4
1.25 3
3.48 3
2.01 2
4.08 2
2.23 2
2.03 2
3.18 2
2.31 2
2.24 2
6.50 2
4.30 2
2.20 2
Name: tip, dtype: int64
W — Where
Similar to numpy.where, this routine will filter the data based on a condition. It returns the initial DataFrame, but where the condition is False, values have been removed. reference
df.where(df['smoker'] == 'Yes')
X — XS
In a multi-indexed DataFrame, xs takes a cross-section of the DataFrame reference
# creating a DataFrame with multi-index
df_xs = df.set_index(['day', 'size'])
df_xs.xs('Fri')
Y — Year
This attribute can be run on datetime, timestamp, etc. It will extract the year of time data. reference
# creating DataFrame
df_time = pd.Series(pd.date_range('2000-01/01', periods=20, freq='M'))0 2000-01-31
1 2000-02-29
2 2000-03-31
3 2000-04-30
4 2000-05-31
5 2000-06-30
6 2000-07-31
7 2000-08-31
8 2000-09-30
9 2000-10-31
10 2000-11-30
11 2000-12-31
12 2001-01-31
13 2001-02-28
14 2001-03-31
15 2001-04-30
16 2001-05-31
17 2001-06-30
18 2001-07-31
19 2001-08-31
dtype: datetime64[ns]
df_time.dt.year0 2000
1 2000
2 2000
3 2000
4 2000
5 2000
6 2000
7 2000
8 2000
9 2000
10 2000
11 2000
12 2001
13 2001
14 2001
15 2001
16 2001
17 2001
18 2001
19 2001
dtype: int64
Z — Zfill
This method pads strings to a minimum length, filling in with 0. If data are heterogenous, non-string data will be converted to NaN. reference
df['smoker'].str.zfill(4)0 00No
1 00No
2 00No
3 00No
4 00No
...
239 00No
240 0Yes
241 0Yes
242 00No
243 00No
Name: smoker, Length: 244, dtype: object