I’m a product data analyst (currently working for leboncoin.fr).
This is my peripherical-brain for python / pandas, it’s not exhaustive, but I try to update it with cool stuff I learned along the way.

Started: ~2015


Python 🐍

Ressources


___


Create on the fly list for loop:

for info in df.column.unique().to_list():
     do stuff


Enumerate() for loop:

env_list = [responsive, ios, android]
env_name = ["responsive", "ios", "android"]
for idx, env in enumerate(env_list):
    current_date_name = 'df_{}_{}'.format(env_name[idx], current_date.replace('-', '_'))
    previous_date_name = 'df_{}_{}'.format(env_name[idx], previous_date.replace('-', '_'))
    print ('getting {} information for {}'.format(env_name[idx], current_date))


Python has a HTTP server built into the# standard library. This is super handy for# previewing websites.
Python 3

$ python3 -m http.server


Python 2

# (This will serve the current directory at#  http://localhost:8000)
$ python -m SimpleHTTPServer 8000


Backward range() 350 –0

range(350, 0, -1)


Execute a python file in terminal

python filename.py


How to document a function

def function():
     """ this function aims to..."""
function?
[output]
Docstring:
     """ this function aims to..."""


File concatenation and folder cleaning

os.system("sh /opt/insight-repositories/pyLBC/reporting/concat_and_clean.sh " + current_date_name)


Loop over a date range and get first and last day of the week (or the month with rrule.MONTHLY)

from datetime import datetime, timedelta, date
from dateutil import rrule

start = date(2018, 11, 5) 
end = date(2018, 11, 19)

for dt in rrule.rrule(rrule.WEEKLY, dtstart=start, until=end):
    start_date = dt.date()
    end_date = start_date + timedelta(6)
    
    print(start_date)
    print(end_date)


.format() method for variable: Template Strings

from string import Template
name = 'Bob'
t = Template('Hey, $name!')
t.substitute(name=name)
>>'Hey, Bob!'


yesterday

from datetime import datetime, timedelta
yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y-%m-%d')


today

from datetime import datetime
today = datetime.strftime(datetime.now(), '%Y-%m-%d')


How to easily get same week day y-1

52*7 = 364
day n - 364 = same week day y-1


How to create a Y-1 column? With .shift(364) !

df['visits_y1'] = df.visits.shift(364)


Iterating through a range of dates in Python

from datetime import timedelta, date

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

start_date = date(2013, 1, 1)
end_date = date(2015, 6, 2)
for single_date in daterange(start_date, end_date):
    print(single_date.strftime("%Y-%m-%d"))


Find any text between ‘<’ and ‘>

import re
text = '<me@email.com>'
re.findall(r'<(.*?)>', text)





Pandas 🐼

Ressources


groupby nomenclature

df[['metric_1', 'metric_2’]].groupby('df.dimension_1', 'df.dimension_2').df_metrics.mean()


Put index as column

df.reset_index(inplace=True)


Append data to csv with mode=”a”

# create tmp dataframe
data_tmp = {'retention': retention, 'platform' : env_name[idx]}
df_tmp = pd.DataFrame(index=[current_date], data=data_tmp)
# append date and retention value to the csv
df_tmp.to_csv('daily_retention.csv', mode='a', header=False)


How to handle automation date with manual date capability

# date
if len(sys.argv) 1:
    ref_date = toStrDateIso(sys.argv[1])
else:
    ref_date = date.today()
current_date = toStrDateIso(ref_date + timedelta(days = -2))
previous_date = toStrDateIso(ref_date + timedelta(days = -3))


How to load csv files and interpreting: date + decimal with comma instead of dot

df_data = pd.read_csv("/path/file.csv", sep=";", parse_dates=True, decimal=",")
or df_data = pd.read_csv("/path/file.csv", parse_dates=['col1', 'col2'])


Index manipulation

  • set_index moves columns to left index
  • reset_index moves the index to the right out of index
  • unstack moves line to the top index (‘up’) / stack


Most efficient way to select part of a data frame

df.set_index(['a', 'b']).sort_index()
df.loc[('v','u')]


How to add thousand separator to a .plot() graph

ax = plt.gca()
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))


Merge on right and left index

df_global = pd.merge(df_1, df_2, left_index=True, right_index=True)


When importing csv, make sure dd/mm/yyyy object format translate in yyy-mm-dd datetime object

df = pd.read_csv()
df.column_date = pd.to_datetime(df.column_date, format='%d/%m/%Y')


Drop a specific column

del df['column_name']


Aggregate on specific column and get a dataframe (double brackets)

df.groupby('column_name')[['column_name']].count()


Aggregate on specific column and get a serie (single bracket)

df.groupby('column_name')['column_name'].count()


Methods to rename columns

column_names = {
     col_name_1 : new_col_1,
     [...]
}
pd.rename(column=column_names, inplace = True)

or

df.column(new_col_1, [])
or for one specific columns (and more if needed)
df = df.rename(columns=({'col_name' : 'new_col_name'}))

but best method

df.columns = ['col_1', 'col_2', ...]


Convert specific column in datetime object (time consuming method)

df.column_date = pd.to_datetime(df.column_date)

In order to improve it, specify format!

df.column_date = pd.to_datetime(df.column_date, format="%Y-%m-%d")


How to use different agg method in the same groupby

df.groupby('col1').agg({'col2' : 'count', 'col3' : 'sum'})


how to return the number of unique element of a serie

df.col.nunique()


From an object column containing number, clean ‘-‘ and convert it to int (with pandas version 0.17)

df.ad_price = df.ad_price.replace({'-' : 0})
df.ad_price = pd.to_numeric(df.ad_price)


From an object column containing number, replace ‘,’ with ‘.’

df.ca_ht.str.replace(',' , '.') # use the comma!


Get column percentage occurencies / number

df.col.value_counts(normalize=True)/ df.col.value_counts(normalize=False) or df.col.value_counts()


Filter datetime columns that have X days in difference

df[df.col_datetime1 - df.col_datetime2 <= 'X days']


int64 serie to object

df.int_col = df.int_col.apply(str)


Filter values of a column based on conditions from another set of columns

df.loc[(df["Gender"]=="Female") & (df["Education"]=="Not Graduate") & (df["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]
df.loc[(df.user_weight == 'big') | (df.user_weight == 'medium') & (df.user_type == 'browsers')]
# must use | & not or and


Skip first raw when importing excel

df_3 = pd.read_excel('contact_3.xlsx', skiprows=1)


YOY evolution over time

df.resample('M')[['metric ']].sum().pct_change(12)


Datetime format apply the format and allow us to manipulate it as we want after!

df_data.Date = pd.to_datetime(df_data.Date, format="%d/%m/%Y")    


Handy matplotlib funtions to get thousands or percentage

def plt_thousand():
    # avoid scientific thousand notation + add comma between thousands for better readability
    ax = plt.gca()
    ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

def plt_percentage(df):
    # transform ylabel decimal in percentage
    ax = df
    ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))

# or (↓ might be a better option) 

def plt_percentage():
    # transform ylabel decimal in percentage
    ax = plt.gca()
    ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda y, _: '{:.0%}'.format(y)))


No index column when creating a csv

df.to_csv('my_csv.csv', sep=';', index=False)


How to read files in a folder

import os
os.listdir(os.curdir)


Delete several columns (will delete columns 1, 7, 9, 10, 11)

df = df.drop(df.columns[[1, 7, 9, 10, 11]], axis=1)


Split column text with an argument and select the second part of the split

df['str_obj_column'] = df['str_obj_column'].str.split('argument', expand=True)[1]


Renaming multiple columns

df.columns = ['colunm_1_name', ..., 'column_n_name']


Index to datetime

df.index = pd.to_datetime(df.index)


Insert image in text cell

![title](img/image.png)


Remove the first blank row when assigning index (this is caused by the name assigned to the index)

df.index.name=None


For loops to concat df

list_nb = ['0', '1', '2']
for i in list_nb:
    if i == '0':
        data_agg = pd.DataFrame(data=data)
    else:
        data_agg = data_agg.append(data, ignore_index=True)
or
list_nb = ['0', '1', '2']
for idx, i in enumerate(list_nb):
    if idx == 0:
        data_agg = pd.DataFrame(data=data)
    else:
        data_agg = data_agg.append(data)


How to create a new empty dataframe

df = pd.DataFrame()


How to calculate % of grouped df

grouped_df = grouped_df.groupby('desired_column').apply(lambda x: x/x.sum() *100)
# (or float(x.sum()) )


How to calculate D3/D1 % from a multiIndex DF?

df['d3vsd1'] =  df[df.columns[1]] / df[df.columns[0]] *100
df['d3vsd1'].plot(figsize=(15,8))
# use df.columns[level] !


How to sort df by a specific column

df = df.sort_values('column_to_sort')


When returning a dataframe from a function, we need to assign it to a variable in order to use it

def function():
    do things
     return df
df_var = function()


Reload packages without restarting the kernel

%reload_ext autoreload
%autoreload 2


Flatten hierarchical index in columns

df.columns = df.columns.get_level_values(0)
# or
pd.DataFrame(df.to_records()) # multiindex become columns


Combine multiple index into one index

df.columns = [' '.join(col).strip() for col in df.columns.values]
# even better:
pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only


How to reference a link to another cell

[link to the cell](#name-of-the-markdown-cell)


___

Selection

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]


To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]


Combine multiple conditions with &:

df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]


To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]


isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]


___

UnicodeDecodeError when reading CSV file ‘ISO-8859-1’ is the solution

df = pd.read_csv('file.csv', sep=';', encoding = 'ISO-8859-1')


How to put last column first

cols = list(df.columns)
cols = [cols[-1]] + cols[:-1]
df = df[cols]


How to replace negative value with 0 (clip / clip_lower)

df[['neg_col_1', 'neg_col_2']] = df[['neg_col_1', 'neg_col_2']].clip_lower(0)


How to keep only date from datetime full format (dt.date)

df.d_time_date = df.d_time_date.dt.date


Set graph size when starting a new notebook

plt.rcParams['figure.figsize'] = (17, 5)


.plot() line width

df.plot(lw=1)


How to sort a multi-index

df.sort_values([('level_0', 'level_1')], ascending=False)


Group by on index level

df.groupby(level=0).sum()


Retention - Color scoping vmin/vmax + fix graph

import seaborn as sns
#sns.set(style='white')
plt.figure(figsize=(15, 8))
plt.title('Cohorts: weekly Buyers Retention')
ax = sns.heatmap(weekly_buyer_retention, mask=weekly_buyer_retention.isnull(), annot=True, fmt='.0%', vmin=0, vmax=0.15);

# fix the wrong diplay
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.yticks(rotation='horizontal')


Function on several columns with .loc


Pandas Plot - Rotate X Axis

df.plot(figsize=(15,7), kind='bar', rot=0)


Graph function project #wip

def graph(dataframe, **kwargs):
    '''
    Function to improve matplotlib graphs readability
    
    Prerequisites:
    → import matplotlib as mpl
    → import matplotlib.pyplot as plt
    → from matplotlib.ticker import FuncFormatter
    → import matplotlib.ticker as mtick
    → from cycler import cycler
    → functions plt_thousand() + plt_percentage()
    
    Optional informations:
    - kind: ['line', 'bar', 'barh', 'area', 'pie'...]
    - rot: handle xaxis text rotation
    - figsize: per default to (15,7)
    - title: optional title for the graph
    - stacked: True to be stacked
    - thousand: rely on the custom function plt_thousand()
    - percentage: rely on the custom function plt_percentage()
    
    todo:
    - legend: yes|no, location, custom values, understand why several column referenced
    - title position (put it a bit higher)
    
    '''
    
    # kwargs mapping
    kind = kwargs.get('kind', None) # type du graphique
    rot = kwargs.get('rot', 0) # rotation des x labels (0 = horizontal)
    figsize = kwargs.get('figsize', (15,7)) # taille du graphique
    title = kwargs.get('title', None) 
    stacked = kwargs.get('stacked', False)
    thousand = kwargs.get('thousand', False) # ajout d'une virgule pour séparer les milliers
    percentage = kwargs.get('percentage', False) # multiplié par 100 et ajout du % 
    legend = kwargs.get('legend', True)
    subplots = kwargs.get('subplots', False) # Doit être à True si kind='pie'
    xlabel = kwargs.get('xlabel', None) # Pas de titre pour l'axe des abscisses par défaut
    ylabel = kwargs.get('ylabel', None) # Pas de titre pour l'axe des abscisses par défaut
    ylim = kwargs.get('ylim', None) # Démarrage de l'axe des ordonnés
    xlim = kwargs.get('xlim', None) # Démarrage de l'axe des abscisses
    legend_loc =  kwargs.get('legend.loc', 'best')
    legend = kwargs.get('legend', None)
    color =  kwargs.get('color', 'set_1') # set_1 per default
    text =  kwargs.get('text', None)
    
    # Nuancier leboncoin : https://adevinta.frontify.com/document/2#/basics/couleurs/le-nuancier
    # remap colors to 'set_1' per default
    plt.rcParams['axes.prop_cycle'] = cycler('color', ['#FF6E14', '#3C78C8', '#FFBE00', '#55B950', '#DC002D', '#7346AA', '#BEBEBE', '#A8B4C0'])
    if color == 'set_1':
        set_1  = cycler('color', ['#FF6E14', '#3C78C8', '#FFBE00', '#55B950', '#DC002D', '#7346AA', '#BEBEBE', '#A8B4C0'])
        color =  kwargs.get('color', set_1) 
    if color == 'set_2': 
        set_2 = cycler('color', ['#FFD74B', '#87CD82', '#F05069', '#234678', '#870019', '#412864', '#3C3C3C', '#647382'])
        plt.rcParams['axes.prop_cycle'] = set_2
    if color == 'set_3':
        set_3  = cycler('color', ['#9B7DC3', '#D7E1F5', '#FFF0CD', '#DCF0DC', '#FACDD2', '#E6D7F0', '#E6EBEF', '#CAD1D9'])
        plt.rcParams['axes.prop_cycle'] = set_3
    
    # plotting main function
    (dataframe.plot(kind=kind
                  , rot=rot 
                  , figsize=figsize
                  , title=title
                  , stacked=stacked
                  , legend=legend
                  , subplots=subplots
                  , ylim=ylim
                  , xlim=xlim)
    )
    
    # legend
    plt.rcParams['legend.frameon'] = True
    
    if legend_loc != 'best':
        plt.rcParams['legend.loc'] = legend_loc
    
    # add an horizonal label for the y axis 
    # plt.text(-0.23, 0.96, 'Transaction Type', fontsize=15, fontweight='black', color = '#333F4B')

    # set the spines position
    #ax.spines['bottom'].set_position(('axes', -0.04))
    #ax.spines['left'].set_position(('axes', 0.015))

    # style
    
    # font
    plt.rcParams['font.family'] = 'sans-serif'
    plt.rcParams['font.serif'] = 'Arial'
    plt.rcParams['font.size'] = 12

    # axis style
    plt.rcParams['axes.edgecolor'] = '#E6EBEF'
    plt.rcParams['axes.linewidth'] = 0.8
    plt.rcParams['axes.facecolor'] = 'white'
    plt.rcParams['axes.titleweight'] = 'heavy'
    
    # change the style of the axis spines
    plt.rcParams['axes.spines.top'] = False
    plt.rcParams['axes.spines.right'] = False
    plt.rcParams['axes.labelcolor'] = '#3C3C3C'
    
    plt.rcParams['xtick.top'] = False
    plt.rcParams['ytick.right'] = False
    #ax.spines['right'].set_color('none')
    #ax.spines['left'].set_smart_bounds(True)
    #ax.spines['bottom'].set_smart_bounds(True)
    
    # line style
    plt.rcParams['lines.linewidth'] = 1.5
    
    # label
    #ax.set_xlabel(xlabel, fontsize=15, fontweight='black', color = '#333F4B')
    plt.xlabel(xlabel, fontsize=15, fontweight='black', color='#333F4B')
    plt.ylabel(ylabel, fontsize=15, fontweight='black', color='#333F4B')
    plt.ylim(ylim)
    plt.xlim(xlim)
    
    # handle Thousand notation on y-axis
    if thousand == True:
        plt_thousand()
        
    # handle Percentage notation on y-axis
    if percentage == True:
        plt_percentage()

    # custom legend    
    if legend != None:
        plt.legend(legend)
    


How to load a tsv file

df = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv', sep='\t')


How to make a dataframe with a single date column

df_period_index = pd.period_range(start='1/1/2018', end='31/12/2018', freq='D')
df_date = pd.DataFrame()
df_date['date'] = df_period_index

↑ the date column is going to be a period object.
From period object to datetime:

df_date.date = df_date['date'].apply(lambda d: pd.to_datetime(str(d)))


How to drop empty rows (when noting in the cell)
stackoverflow

df = df[df.column_name.astype(bool)]
# df.column_name.astype(bool) = True
# and empty row return False ;) !


Create new column with “Year-Quarter” from datetime object
stackoverflow

df_op['order_quarter'] = df_op.event_datetime.apply(lambda x: x.strftime('%Y' + '-' + str((x.month-1)//3))) 


From entire date to year-month only

df['year_month'] = df['complete_date'].dt.to_period('M')


From unix timestamp to datetime

pd.to_datetime(df['timestamp'], unit='s')


Outer join to merge and to see differences between 2 dataframes thanks to a ‘merge’ column (_indicator=True)

df_merge = df_old.merge(df_new, how='outer', on=['user_id', 'hour_event'], indicator=True)

df_merge.groupby('_merge')[['user_id']].nunique()
_merge user_id
left_only 4017
right_only 2315
both 41400



A Cool SQL Problem: Avoiding For-Loops
I decided to solve the problem presented in this post. The SQL is not that hard, but it took me more time to find a ‘set-based’ solution in Pandas without the use of a for loop.
Here is what I ended up with:

# Dataframe creation
d = {'trading_date': ['2015-06-01', '2015-06-02', '2015-06-03', '2015-06-04', '2015-06-05', '2015-06-08', '2015-06-09', '2015-06-10'], 'price': [41, 43, 47, 42, 45, 39, 38, 41] }
df = pd.DataFrame(data=d)

# new column that gives the max price from previous rows 
df['rolling_max'] = df.price.expanding().max()
# new column that gives the min price from previous rows
df['rolling_min'] = df.price.expanding().min()

# Max profit for each row
df['max_profit'] = df.price - df.rolling_min
# Min profit for each row
df['min_profit'] = df.rolling_min - df.rolling_max


# Overall Max Profit
print(max(df.price - df.rolling_min))

# Overall Min Profit
print(min(df.rolling_min - df.rolling_max))

→ The difficulty for me lied in the translation of the sql:

max(price) over (order by trading_date between unbounded preceding and current row)

→ This stackoverflow post and in particular this sentence, helped me a lot:

# If the goal is to get the function applied from the beginning of  the DataFrame down to the current line, then the object to use is Expanding:
df.expanding().std()



___



#WIP

  • CoolStuffILearn - unix / bash
  • CoolStuffILearn - SQL