I’m a product data analyst (currently working for leboncoin.fr).
This is my peripherical-brain for python / pandas, it’s not exhaustive, but I try to update it with cool stuff I learned along the way.

Started: ~2015
Last Update: 2020.04

Python 🐍



Create on the fly list for loop:

for info in df.column.unique().to_list():
     do stuff

Enumerate() for loop:

env_list = [responsive, ios, android]
env_name = ["responsive", "ios", "android"]
for idx, env in enumerate(env_list):
    current_date_name = 'df_{}_{}'.format(env_name[idx], current_date.replace('-', '_'))
    previous_date_name = 'df_{}_{}'.format(env_name[idx], previous_date.replace('-', '_'))
    print ('getting {} information for {}'.format(env_name[idx], current_date))

Python has a HTTP server built into the# standard library. This is super handy for# previewing websites.
Python 3

$ python3 -m http.server

Python 2

# (This will serve the current directory at#  http://localhost:8000)
$ python -m SimpleHTTPServer 8000

Backward range() 350 –0

range(350, 0, -1)

Execute a python file in terminal

python filename.py

How to document a function

def function():
     """ this function aims to..."""
     """ this function aims to..."""

File concatenation and folder cleaning

os.system("sh /opt/insight-repositories/pyLBC/reporting/concat_and_clean.sh " + current_date_name)

Loop over a date range and get first and last day of the week (or the month with rrule.MONTHLY)

from datetime import datetime, timedelta, date
from dateutil import rrule

start = date(2018, 11, 5) 
end = date(2018, 11, 19)

for dt in rrule.rrule(rrule.WEEKLY, dtstart=start, until=end):
    start_date = dt.date()
    end_date = start_date + timedelta(6)

.format() method for variable: Template Strings

from string import Template
name = 'Bob'
t = Template('Hey, $name!')
>>'Hey, Bob!'


from datetime import datetime, timedelta
yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y-%m-%d')

How to easily get same week day y-1

52*7 = 364
day n - 364 = same week day y-1

Iterating through a range of dates in Python

from datetime import timedelta, date

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

start_date = date(2013, 1, 1)
end_date = date(2015, 6, 2)
for single_date in daterange(start_date, end_date):

Find any text between β€˜<’ and β€˜>

import re
text = '<me@email.com>'
re.findall(r'<(.*?)>', text)


Pandas 🐼


groupby nomenclature

df[['metric_1', 'metric_2’]].groupby('df.dimension_1', 'df.dimension_2').df_metrics.mean()

Put index as column


Append data to csv with mode=”a”

# create tmp dataframe
data_tmp = {'retention': retention, 'platform' : env_name[idx]}
df_tmp = pd.DataFrame(index=[current_date], data=data_tmp)
# append date and retention value to the csv
df_tmp.to_csv('daily_retention.csv', mode='a', header=False)

How to handle automation date with manual date capability

# date
if len(sys.argv) 1:
    ref_date = toStrDateIso(sys.argv[1])
    ref_date = date.today()
current_date = toStrDateIso(ref_date + timedelta(days = -2))
previous_date = toStrDateIso(ref_date + timedelta(days = -3))

How to load csv files and interpreting: date + decimal with comma instead of dot

df_data = pd.read_csv("/path/file.csv", sep=";", parse_dates=True, decimal=",")
or df_data = pd.read_csv("/path/file.csv", parse_dates=['col1', 'col2'])

Index manipulation

  • set_index moves columns to left index
  • reset_index moves the index to the right out of index
  • unstack moves line to the top index (β€˜up’) / stack

Most efficient way to select part of a data frame

df.set_index(['a', 'b']).sort_index()

How to add thousand separator to a .plot() graph

ax = plt.gca()
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

Merge on right and left index

df_global = pd.merge(df_1, df_2, left_index=True, right_index=True)

When importing csv, make sure dd/mm/yyyy object format translate in yyy-mm-dd datetime object

df = pd.read_csv()
df.column_date = pd.to_datetime(df.column_date, format='%d/%m/%Y')

Drop a specific column

del df['column_name']

Aggregate on specific column and get a dataframe (double brackets)


Aggregate on specific column and get a serie (single bracket)


Methods to rename columns

column_names = {
     β€˜col_name_1’ : β€˜new_col_1’,
pd.rename(column=column_names, inplace = True)


df.column(β€˜new_col_1’, […])
or for one specific columns (and more if needed)
df = df.rename(columns=({'col_name' : 'new_col_name'}))

but best method

df.columns = ['col_1', 'col_2', ...]

Convert specific column in datetime object (time consuming method)

df.column_date = pd.to_datetime(df.column_date)

In order to improve it, specify format!

df.column_date = pd.to_datetime(df.column_date, format="%Y-%m-%d")

How to use different agg method in the same groupby

df.groupby('col1').agg({'col2' : 'count', 'col3' : 'sum'})

how to return the number of unique element of a serie


From an object column containing number, clean β€˜-β€˜ and convert it to int (with pandas version 0.17)

df.ad_price = df.ad_price.replace({'-' : 0})
df.ad_price = pd.to_numeric(df.ad_price)

From an object column containing number, replace β€˜,’ with β€˜.’

df.ca_ht.str.replace(',' , '.') # use the comma!

Get column percentage occurencies / number

df.col.value_counts(normalize=True)/ df.col.value_counts(normalize=False) or df.col.value_counts()

Filter datetime columns that have X days in difference

df[df.col_datetime1 - df.col_datetime2 <= 'X days']

int64 serie to object

df.int_col = df.int_col.apply(str)

Filter values of a column based on conditions from another set of columns

df.loc[(df["Gender"]=="Female") & (df["Education"]=="Not Graduate") & (df["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]
df.loc[(df.user_weight == 'big') | (df.user_weight == 'medium') & (df.user_type == 'browsers')]
# must use | & not or and

Skip first raw when importing excel

df_3 = pd.read_excel('contact_3.xlsx', skiprows=1)

YOY evolution over time

df.resample('M')[['metric ']].sum().pct_change(12)

Datetime format apply the format and allow us to manipulate it as we want after!

df_data.Date = pd.to_datetime(df_data.Date, format="%d/%m/%Y")    

Handy matplotlib funtions to get thousands or percentage

def plt_thousand():
    # avoid scientific thousand notation + add comma between thousands for better readability
    ax = plt.gca()
    ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

def plt_percentage(df):
    # transform ylabel decimal in percentage
    ax = df
    ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))

# or (↓ might be a better option) 

def plt_percentage():
    # transform ylabel decimal in percentage
    ax = plt.gca()
    ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda y, _: '{:.0%}'.format(y)))

No index column when creating a csv

df.to_csv('my_csv.csv', sep=';', index=False)

How to read files in a folder

import os

Delete several columns (will delete columns 1, 7, 9, 10, 11)

df = df.drop(df.columns[[1, 7, 9, 10, 11]], axis=1)

Split column text with an argument and select the second part of the split

df['str_obj_column'] = df['str_obj_column'].str.split('argument', expand=True)[1]

Renaming multiple columns

df.columns = ['colunm_1_name', ..., 'column_n_name']

Index to datetime

df.index = pd.to_datetime(df.index)

Insert image in text cell


Remove the first blank row when assigning index (this is caused by the name assigned to the index)


For loops to concat df

list_nb = ['0', '1', '2']
for i in list_nb:
    if i == '0':
        data_agg = pd.DataFrame(data=data)
        data_agg = data_agg.append(data, ignore_index=True)
list_nb = ['0', '1', '2']
for idx, i in enumerate(list_nb):
    if idx == 0:
        data_agg = pd.DataFrame(data=data)
        data_agg = data_agg.append(data)

How to create a new empty dataframe

df = pd.DataFrame()

How to calculate % of grouped df

grouped_df = grouped_df.groupby('desired_column').apply(lambda x: x/x.sum() *100)
# (or float(x.sum()) )

How to calculate D3/D1 % from a multiIndex DF?

df['d3vsd1'] =  df[df.columns[1]] / df[df.columns[0]] *100
# use df.columns[level] !

How to sort df by a specific column

df = df.sort_values('column_to_sort')

When returning a dataframe from a function, we need to assign it to a variable in order to use it

def function():
    do things
     return df
df_var = function()

Reload packages without restarting the kernel

%reload_ext autoreload
%autoreload 2

Flatten hierarchical index in columns

df.columns = df.columns.get_level_values(0)
# or
pd.DataFrame(df.to_records()) # multiindex become columns

Combine multiple index into one index

df.columns = [' '.join(col).strip() for col in df.columns.values]
# even better:
pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only

How to reference a link to another cell

[link to the cell](#name-of-the-markdown-cell)



To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:


Combine multiple conditions with &:

df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]

To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:



UnicodeDecodeError when reading CSV file β€˜ISO-8859-1’ is the solution

df = pd.read_csv('file.csv', sep=';', encoding = 'ISO-8859-1')

How to put last column first

cols = list(df.columns)
cols = [cols[-1]] + cols[:-1]
df = df[cols]

How to replace negative value with 0 (clip / clip_lower)

df[['neg_col_1', 'neg_col_2']] = df[['neg_col_1', 'neg_col_2']].clip_lower(0)

How to keep only date from datetime full format (dt.date)

df.d_time_date = df.d_time_date.dt.date

Set graph size when starting a new notebook

plt.rcParams['figure.figsize'] = (17, 5)

How to create a Y-1 column? With .shift(364) !

df['visits_y1'] = df.visits.shift(364)

.plot() line width


How to sort a multi-index

df.sort_values([('level_0', 'level_1')], ascending=False)

Group by on index level


Retention - Color scoping vmin/vmax + fix graph

import seaborn as sns
plt.figure(figsize=(15, 8))
plt.title('Cohorts: weekly Buyers Retention')
ax = sns.heatmap(weekly_buyer_retention, mask=weekly_buyer_retention.isnull(), annot=True, fmt='.0%', vmin=0, vmax=0.15);

# fix the wrong diplay
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

Function on several columns with .loc

Pandas Plot - Rotate X Axis

df.plot(figsize=(15,7), kind='bar', rot=0)

Graph function project #wip

def graph(dataframe, **kwargs):
    Function to improve matplotlib graphs readability
    β†’ import matplotlib as mpl
    β†’ import matplotlib.pyplot as plt
    β†’ from matplotlib.ticker import FuncFormatter
    β†’ import matplotlib.ticker as mtick
    β†’ from cycler import cycler
    β†’ functions plt_thousand() + plt_percentage()
    Optional informations:
    - kind: ['line', 'bar', 'barh', 'area'...]
    - rot: handle xaxis text rotation
    - figsize: per default to (15,7)
    - title: optional title for the graph
    - stacked: True to be stacked
    - thousand: rely on the custom function plt_thousand()
    - percentage: rely on the custom function plt_percentage()
    - legend: yes|no, location, custom values, understand why several column referenced
    - title position (put it a bit higher)
    # kwargs mapping
    kind = kwargs.get('kind', 'line')
    rot = kwargs.get('rot', None)
    figsize = kwargs.get('figsize', (15,7))
    title = kwargs.get('title', None)
    stacked = kwargs.get('stacked', False)
    thousand = kwargs.get('thousand', False)
    percentage = kwargs.get('percentage', False)
    legend = kwargs.get('legend', True)
    # plotting
    dataframe.plot(kind=kind, rot=rot, figsize=figsize, title=title, stacked=stacked, legend=legend)

    # legend
    plt.rcParams['legend.frameon'] = True
    plt.rcParams['legend.loc'] = 'upper right'
    # add an horizonal label for the y axis 
    #plt.text(-0.23, 0.96, 'Transaction Type', fontsize=15, fontweight='black', color = '#333F4B')

    # set the spines position
    #ax.spines['bottom'].set_position(('axes', -0.04))
    #ax.spines['left'].set_position(('axes', 0.015))

    # style
    # font
    plt.rcParams['font.family'] = 'sans-serif'
    plt.rcParams['font.serif'] = 'Arial'
    plt.rcParams['font.size'] = 12

    # axis style
    plt.rcParams['axes.edgecolor'] = '#E6EBEF'
    plt.rcParams['axes.linewidth'] = 0.8
    plt.rcParams['axes.facecolor'] = 'white'
    plt.rcParams['axes.titleweight'] = 'heavy'
    # change the style of the axis spines
    plt.rcParams['axes.spines.top'] = False
    plt.rcParams['axes.spines.right'] = False
    plt.rcParams['axes.labelcolor'] = '#3C3C3C'
    plt.rcParams['xtick.top'] = False
    plt.rcParams['ytick.right'] = False
    # line style
    plt.rcParams['lines.linewidth'] = 1.5
    # color
    plt.rcParams['axes.prop_cycle'] = cycler('color', ['#FF6E14', '#3C78C8', '#FFBE00', '#55B950', '#DC002D', '#7346AA', '#BEBEBE', '#A8B4C0'])
    # label
    #ax.set_xlabel('Percentage', fontsize=15, fontweight='black', color = '#333F4B')
    # Handle Thousand notation on y-axis
    if thousand == True:
    # Handle Percentage notation on y-axis
    if percentage == True:

How to load a tsv file

df = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv', sep='\t')

How to make a dataframe with a single date column

df_period_index = pd.period_range(start='1/1/2018', end='31/12/2018', freq='D')
df_date = pd.DataFrame()
df_date['date'] = df_period_index

↑ the date column is going to be a period object.
From period object to datetime:

df_date.date = df_date['date'].apply(lambda d: pd.to_datetime(str(d)))



  • CoolStuffILearn - unix / bash
  • CoolStuffILearn - SQL