Last week we saw the fundamentals of the matplotlib package and what we can do with it, now we'll see how to use it with pandas and numpy to use it effectively.
Table of contents:
- Pandas dataframes
- Plotting a dataframe
- The object-oriented method
- Customizing and saving the plot
- Final thoughts
Pandas dataframes
Pandas dataframes are the most common data structure that we'll use and we can obviously plot them with matplolib. When we introduced pandas we used a sample about baseball players with just ten rows because I have deleted all of them to simplify the learning. Now we can use the complete one that you can find here
But we have a problem: the names of the columns are badly formatted. We can resolve this in few steps: first, we see the name of the columns with the keys function, and then we rename them with the rename function:
Plotting a dataframe
We can quickly plot through our dataframe with the plot function. This is not the recommended way, but for quick plots works exactly fine:
This way we can have a quick overview of our data, another example is the hist function:
The object-oriented method
Now we know how to quickly see the data we need, but in the majority of cases we'll need well structured and customized plots, and to do this is more efficient the object-oriented method or the OO method. Let's see an example:
Now that we have our plot let's take a moment to understand what's going on here: We create a fig and an ax and gave to them a size, then we call the plot function on our dataframe, and into the function we declare
- the type of plot that we want (scatter)
- the x-axis (the age of every player)
- the y-axis (the weight of our players)
- the (c)olor of the plot, the range of every element based on a third value (their height in inches)
- the ax that we are plotting. In this case there was only one axis, but if there was more we could have typed ax = ax[0, 0] or ax = axis_name.
But we are not still completely using the OO method. To do it, we should use the subplots function with using all that derives from it:
figure, axis = plt.subplots(figsize=(10, 7))
#creating the plot, same as before
first_plot = axis.scatter(x = bb_players['Age'],
y = bb_players['Weight (lbs)'],
c = bb_players['Height (inches)'])
# adding information
axis.set(title="correlations between baseball players weight and age",
xlabel="players age",
ylabel="players weight")
# adding the legend
axis.legend(*first_plot.legend_elements(), title = 'Weight in lbs');
# the "*" query all the elements in first_plot
What we are doing here is simply creating a figure and customizing its plot.
The result will be:
here there are some typing error here, the most obvious is the legend referring to weight instead of height
Now should be clear how to do a figure with the OO method but with more plots in it. An example of code may be:
# creating a figure with more than 1 subplots
figure, (first_plot, second_plot) = plt.subplots(nrows = 2,
ncols = 1,
figsize = (12, 20))
scatter_plot = first_plot.scatter(x = bb_players['Age'],
y = bb_players['Weight (lbs)'],
c = bb_players['Height (inches)'])
first_plot.set(title="correlations between baseball players weight and age",
xlabel="age of players",
ylabel="weight of players")
first_plot.legend(*scatter_plot.legend_elements(),
title = 'Height in inches');
# putting a line on the average data
first_plot.axhline(y = bb_players['Weight (lbs)'].mean(),
color = 'r',
linestyle = '-')
# second plot
second_plot.hist(x = bb_players['Age'],
bins = 20)
second_plot.set(title = "Numbers of players with a certain age",
xlabel="age of players",
ylabel="number of players");
As you can see from the code we have set the rows and the columns of the figure and set the bins of the histogram. The bins are simply the unit that spaces through the x-axis. A histogram with seven bins will be similar to:
source
The axhline is a very simple function explained here. In this case, we set the mean value of the weight columns as the "y".
Now that the code should be clear we can see the result:
Customizing and saving the plot
There various ways to customize our plot in matplotlib with general styles which we can modify details if we want.
the basic command is:
plt.style.available
Using matplotlib style are important to add meaning to our graph, here the difference between the default style and one that you can choose:
But despite being now slightly different is may still be confusing. Matplotlib offers us the cmap function, among others, to change the details of a style:
The full reference is available at the official documentation page
Now the last step would be creating an image of your figure with the savefig function that we saw in the last post and everything would be ready to show our data to others.
Final thoughts
Today we saw the last part of the matplotlib python package. Not that we know well enough pandas, numpy, and matplotlib, we can see how scikit works and begin working on some model.
If you have any doubt, feel free to leave a comment.
Discussion (0)