DEV Community

Cover image for How Python’s Argparse Can Be Useful in Data Science
Federico Trotta
Federico Trotta

Posted on • Updated on • Originally published at federicotrotta.com

How Python’s Argparse Can Be Useful in Data Science

When I first approached Python’s Argparse, I had great difficulty understanding how it works because I had never programmed before.

Also, I asked myself: “How can a command-line interface be useful in Data Science??”. Well, I’m showing it, with a practical example.

But first, let’s explain what Argparse is.

What is Python’s Argparse?

Python’s Argparse is a library that gives you the possibility to pass arguments via the command-line interface. It is not the only module you can use (you can also use sys.argv), but it is definitely the most complete.

As we can see in its documentation:

The argparse module makes it easy to write user-friendly command-line interfaces. […]. The argparse module also automatically generates help and usage messages and issues errors when users give the program invalid arguments.

How Python’s Argparse can be useful in your Data Science projects: a practical example

Let’s say you have an empirical way to calculate a parameter and this empirical method needs to insert a value to achieve a “considered good result”. The problem is that you have to calculate the right value, iteratively. If you work in Jupyter Notebooks, you’ll need to find the exact line of code to modify the parameter, each time.

For the purpose of this article, I’ve created a dataset with simulated data which reflects the reality of typical distributions, in real cases. Let’s say that our data are measured times in minutes; let’s import the data and see the data frame:

import pandas as pd

# Import data and show head
df = pd.read_excel('example.xlsx')
df.head(10)
Enter fullscreen mode Exit fullscreen mode

Here’s the data frame:

A data frame by Federico Trotta

The purpose of the exercise is to find the measured time that best fits the distribution

Let’s say that those measurements are times related to athletes running a fixed distance; let’s say 1 km.

We want to evaluate the athletes based on the time they need to run 1 km. But how can we fix a reasonable value of time to be achieved? One minute is a good time? Can the majority of the athletes run 1 km in one minute? When an athlete can be considered too slow and when too fast?

The purpose of this study relies on that.

As often happens in these cases, the mean value is typically far away from being a good value, because, often, the data are not normally distributed. So we need a different metric, but this metric can rely on the mean time.

To find the metric, we have to empirically find a factor that, multiplied by the mean time, gives a value that is one of the most frequent values.

Let’s show a plot for a better understanding:

Frequencies to describe Python's Argparse by Federico Trotta

As you can see, the mean time (4.4 min) is not a good value to use to evaluate the athletes because the majority of them run 1 km in 3 or 4 minutes. In similar cases, I found that a good value is “0.85*mean time”; but this ‘0.85’ factor is an empirical value and sometimes it can be more, sometimes less (depending on how skewed is the data distribution). So the goal of using Argparse is to modify just the multiplication factor to fit a good final result (a time on which evaluate your athletes on running 1 km).

So, let’s see a bit of code:

import argparse

# Create parser
parser = argparse.ArgumentParser()

# Specify the arguments that has to be insert
parser.add_argument('multiple', type=float, help='moltiplication factor (0.85 is typical)')

# Parse and control the arguments
args = parser.parse_args()

# Define factor of percentage
fac = args.multiple 
Enter fullscreen mode Exit fullscreen mode

With the above code, I’ve created the parser, specified the arguments to parse (in this case, the argument is just one: the factor of percentage), and in the end, after controlling the arguments, I’ve defined the factor of percentage as controlled by Argparse (fac = args.multiple). The work is done, and in the end, we can calculate the mean time and the adjusted time (as the mean time multiplied by the factor of percentage):

# Calculate mean values
mean = df['measures [min]'].mean(axis=0) #mean

# Define adjusted value
adj = mean*fac
Enter fullscreen mode Exit fullscreen mode

We can now plot the graph:

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os

# Define figure size in inches and font scale
plt.rcParams['figure.figsize'] = 15, 10
sns.set(font_scale=1)

# Plotting the frequences
sns.histplot(df, x="measures [min]", binwidth=1, color='red') 

# Addin the time mean and the theoretical time mean
plt.axvline(x=adj, color="blue") # Vertical line to "adjusted" value 
plt.axvline(x=mean, color="green") # Vertical line to "mean" value

#Create labels
plt.title(f"FREQUENCES OF THE MEASURED VALUES", fontsize=18)
plt.xlabel("VALUES", fontsize=12)
plt.ylabel("FREQUENCES",fontsize=12)

# Define mean and adjusted legend
blu_line = mpatches.Patch(color="blue", label=f"adjusted value: {adj:.1f}")
green_line = mpatches.Patch(color="green", label=f"mean value: {mean:.1f}")
plt.legend(handles=[blu_line, green_line], prop={"size":15})
Enter fullscreen mode Exit fullscreen mode

The result is:

Frequencies to describe Python's Argparse by Federico Trotta

As you can see, the adjusted time (3.7 min) can be a good value to evaluate the athletes, instead of the mean time (4.4 min) since it is near the mean of the most bar related to the most frequent times measured. And how can we use Argparse to arrive here?

First of all, save your Jupyter Notebook with .py extension. Let’s call it exercise.py and save it in a directory. Open the file with the terminal and type:

python3 exercise.py --h
Enter fullscreen mode Exit fullscreen mode

This shows the help:

Python's Argparse help by Federico Trotta

So, if you want to play with the multiplication factors and if you want to try starting from “0.85” you just need to write this in the terminal:

python3 exercise.py 0.85
Enter fullscreen mode Exit fullscreen mode

With the above code, Python will display the image of the plot and if “0.85” is not a good fit, you can change it very easily and in a very fast way, without the need to search in all your Notebook the exact line of code to modify!

Conclusions

This article shows how Python’s Argparse can be useful even in Data Science projects.

Sometimes, in fact, when analyzing data you may need to adjust some parameters: in that case, Python’s Argparse can be the best library you can choose.


The article "How Python’s Argparse Can Be Useful in Data Science" was originally created for my blog.

Top comments (0)