dev_neil_a

Posted on Aug 15, 2022 • Edited on Mar 9

Introduction to Data Analysis with Python Part 3: Analysing Data Using Pandas and NumPy with Python

#python #programming #beginners #datascience

Introduction
Step 1. Importing Pandas and NumPy
Step 2. Import From Excel
Step 3. Validating the Data
Step 4. Basic Analysis Functions with Pandas
Step 5. Grouping Data by Criteria
Step 6. Using Mean
Step 7. Using Median
- Step 7.1. Get the Median of Orders by Country
- Step 7.2. Get the Median of Orders by Country, City and Gender
Step 8. Working with Dates
Resources

Introduction

In this multi-part series, I'll be going over some of the basics of Pandas, NumPy and Matplotlib that I've learned over the past few weeks.

In part one, I covered the following topics:

Importing data from a CSV file into Pandas
Cleaning up the data and removing unusable data.
Converting between datatypes.
Exporting and importing data to and from Excel files.

In part two, I covered performing mathematical operations against the data that is stored in a dataframe using both Pandas and NumPy

In this part, I will be covering how to perform analytical operations against data in a Pandas dataframe to show data that could be used for reporting, such as a total for example.

As before in the previous parts, there is a Jupyter notebook, along with all the other required files located in a GitHub repo that is linked in the Resources section.

Let's get started on part three.

Step 1. Importing Pandas and NumPy

First of all, the Pandas and NumPy libraries need to be imported.

import pandas as pd
import numpy as np

Step 2. Import From Excel

Once the libraries have been imported, the next step is to get the data imported. The import will be using strict datatype enforcement when importing the data, just like in part two the second time the Excel sheet was imported.

The only difference this time is that there are more columns to import and there will be no rows to skip.

sales_data = pd.read_excel(io = "data/order_data_with_totals.xlsx",
                           sheet_name = "order_data_with_totals",
                           dtype      = {"order_id": np.int64,
                                         "order_date": "datetime64",
                                         "customer_id": np.int64, 
                                         "customer_first_name": str,
                                         "customer_last_name": str,
                                         "customer_gender": str,
                                         "customer_city": str,
                                         "customer_country": str,
                                         "item_description": str,
                                         "item_qty": np.int64,
                                         "item_price": np.float64,
                                         "order_currency": str,
                                         "order_vat_rate": np.float64,
                                         "order_total_ex_vat_local_currency": np.float64,
                                         "order_total_vat_local_currency": np.float64,
                                         "order_total_inc_vat_local_currency": np.float64,
                                         "order_currency_conversion_rate": np.float64,
                                         "order_total_ex_vat_converted_gbp": np.float64,
                                         "order_total_vat_converted_gbp": np.float64,
                                         "order_total_inc_vat_converted_gbp": np.float64})

Step 3. Validating the Data

Now that the data has been imported from the Excel file into the sales_data dataframe, let's take a look at the data it contains.

Step 3.1. What the Data Looks Like

First, let's have a look at the first five rows of the data in the sales_data dataframe.

sales_data.head(n = 5)

The main difference this time is that there are more columns to the right of the sales_data dataframe. I've only shown some of the columns as the image wouldn't fit with all the columns that are in the sales_data dataframe.

Step 3.2. Check the Columns DataTypes

Next, let's have a look at the datatypes that have been assigned to each column in the sales_data dataframe.

sales_data.dtypes

As expected, all the datatypes match to what they were specified to be when they were imported.

Step 3.3. Check for NaN (Null) Values

sales_data.isna().sum()

There are no NaN values in the sales_data dataframe as it was cleaned up in part one and the new data and columns that was created in part two were checked for NaN values before exporting the data to a new Excel file.

Step 4. Basic Analysis Functions with Pandas

First up, let's take a look at some of the basic analysis functions that can be performed with Pandas. This is just a small sample of what can be done so I would recommend looking at the Pandas documentation if you need to find out how to perform a specific function against a dataframe.

Step 4.1. Total Number of Orders

Although this isn't strictly a Pandas specific function, it is useful to show how many rows there are in the sales_data dataframe.

print(f"Total Number of Orders: {len(sales_data)}")

Step 4.2. Show Orders with a Total Greater than 50 GBP

Now let's take a look at narrowing down some of the data in the sales_data dataframe. To start with, let's see the first five orders that have a value of greater than (>) 50 GBP in the order_total_inc_vat_converted_gbp column.

sales_data[sales_data["order_total_inc_vat_converted_gbp"] > 50].head(n = 5)

You can substitute the > for < (less than) if you need to find values under 50 GBP.

Step 4.3. Show the Highest Value Order

Next, let's use the max() function to find the order in the sales_data dataframe with the highest value in the order_total_inc_vat_converted_gbp column.

sales_data[sales_data["order_total_inc_vat_converted_gbp"] == sales_data["order_total_inc_vat_converted_gbp"].max()]

Step 4.4. Show the Lowest Value Order

Finally, let's use the min() function to find the order in the sales_data dataframe with the lowest value in the order_total_inc_vat_converted_gbp column.

sales_data[sales_data["order_total_inc_vat_converted_gbp"] == sales_data["order_total_inc_vat_converted_gbp"].min()]

Step 5. Grouping Data by Criteria

In this section, I'll be demonstrating how to use the groupby function, along with a few others that can be used with it.

The groupby function is used to group together rows that match a criteria for a given column or a list of columns that the data needs to be grouped by. The main purpose is so that you can display a set of results by that grouped up criteria, rather than just showing every row that matches the specified criteria.

Step 5.1. Total Number of Orders by Currency

To start this section, let's begin by getting a list of the currencies that the orders were placed with and then show the total number of orders up for each currency.

sales_data.groupby(["order_currency"]).size()

In the above, groupby will group up the entries found in the order_currency (GBP and EUR in this case) column and size will count up each one that it finds.

Now, let's make the output a little bit more presentable by putting it into a frame using the to_frame function.

Whilst we are at it, let's sort the orders by the currency with the lowest number of orders first using the sort_values function.

sales_data.groupby(["order_currency"])\
          .size()\
          .to_frame("total_number_of_orders")\
          .sort_values("total_number_of_orders", 
                       ascending = True)

That looks better!

Step 5.2. Total Number of Orders by Gender

Using the same method as the above, let's take a look at the total number of orders by the gender of the customers.

sales_data.groupby(["customer_gender"])\
          .size()\
          .to_frame("no_of_orders")\
          .sort_values("no_of_orders", 
                       ascending = False)

Step 5.3. Total Value of Orders by Gender

Now that the total number of orders by gender is known, let's see what the total order values are for each gender. In this step, it will use the agg function to perform the aggregation of the orders by passing a dictionary of columns and a function to use. In this example, it will only be the order_total_inc_vat_converted_gbp column.

As part of this, NumPy will be used to perform a sum (np.sum) against the orders that are found for each gender.

Finally, the round function is used to round the results to two decimal places.

sales_data.groupby(["customer_gender"])\
          .agg({"order_total_inc_vat_converted_gbp": np.sum})\
          .sort_values(["order_total_inc_vat_converted_gbp"], 
                       ascending = False)\
          .round(2)

As a side note, you can pass multiple columns and functions with the agg function, such as sum, mean and median. There is an example of how to do this in step 6.2.

Step 6. Using Mean

In this section, the mean function, specifically the np.mean (NumPy) function will be used to aggregate data in the sales_data dataframe.

What is mean?

Mean is the total of the numbers in a given column, divided by how many numbers there are, be that the total number in the column or by the number that has been filtered.

Step 6.1. Get the Mean of all the Orders

First, let's run mean against the order_total_inc_vat_converted_gbp column.

print(f"Mean of all orders (Converted to GBP): £{np.mean(sales_data['order_total_inc_vat_converted_gbp']):.2f}")

Just a quick note, :.2f will only show two decimal places in the output.

Step 6.2. Get the Mean of Orders by Country

Next, let's use agg to get the mean of the order_total_inc_vat_local_currency and order_total_inc_vat_converted_gbp columns and use groupby to group up the results by country and the currency used.

sales_data.groupby(["customer_country", "order_currency"])\
          .agg({"order_total_inc_vat_local_currency": np.mean,
                "order_total_inc_vat_converted_gbp":  np.mean},)\
          .sort_values(["order_total_inc_vat_converted_gbp"], 
                       ascending = False)\
          .round(2)

From the above, the results are grouped up by country and currency.

Step 6.3. Get the Mean of Orders by Country, City and Gender

Following on from the previous example, let's take it a little bit further. This time, let's use three columns for the grouping and do a mean of just the order_total_inc_vat_converted_gbp column.

sales_data.groupby(["customer_country", "customer_city", "customer_gender"])\
          .agg({"order_total_inc_vat_converted_gbp": np.mean})\
          .sort_values(["order_total_inc_vat_converted_gbp"], 
                       ascending = False)\
          .round(2)

The results this time are grouped up by country, city and gender. This will allow for a deeper dive into the customers by each and can be further expanded to see what they buy.

Step 7. Using Median

Now that mean has been covered, let's take a look at using median. The process for using median is very much the same as mean, you just substitute mean with median

What is median?

Median is the middle number that is found by putting all the data in a column in order (lowest to highest) and then finding the number in the middle. If there are two middle numbers (even amount of entries normally), it will work out the mean of those two numbers to give the median.

The two examples below are the the same as the examples used in step 6, only they use np.median (NumPy median) instead of np.mean (NumPy mean).

Step 7.1. Get the Median of Orders by Country

sales_data.groupby(["customer_country"])\
          .agg({"order_total_inc_vat_converted_gbp": np.median},)\
          .sort_values(["order_total_inc_vat_converted_gbp"], 
                       ascending = False)\
          .round(2)

Step 7.2. Get the Median of Orders by Country, City and Gender

sales_data.groupby(["customer_country", "customer_city", "customer_gender"])\
          .agg({"order_total_inc_vat_converted_gbp": np.median})\
          .sort_values(["order_total_inc_vat_converted_gbp"], 
                       ascending = False)\
          .round(2)

output from above

Step 8. Working with Dates

This final section will focus on using dates with Pandas dataframes. Using dates can be done a number of ways but I'll be covering just a couple. Feel free to leave a comment with ways that you would typically use dates with Pandas dataframes.

Step 8.1. Recreate the Index for the Dataframe with Dates for the Index

First, we will recreate the index for the sales_data dataframe, using the values stored in the order_date column. This allows the sales_data dataframe to then be treated as a time series-based dataframe that works faster when using both dates and / or times.

sales_data.set_index([sales_data["order_date"]], 
                     inplace = True)

sales_data.index.rename("date", 
                        inplace = True)

sales_data.head(n = 2)

The index (first column on the left) is now showing the same date per row as order_date.

Step 8.2. Work out the Orders Totals per Day in a Specific Range

Now that the sales_data dataframe is indexed with dates, let's make use of it. To do this, the loc function is used. This function will find entries in the index that fall within the criteria supplied.

In the below example, loc will search for values in the index that match between the first of April 2020 and the tenth of April 2020.

From there, it will group up the orders found by the order_date column and perform a sum on the order_total_inc_vat_converted_gbp column for any orders placed on each day. The output will show each day and the total for that day.

start_date = "2020-04-01"
end_date   = "2020-04-10"

sales_data.loc[start_date : end_date]\
          .groupby(["order_date"])\
          .agg({"order_total_inc_vat_converted_gbp": np.sum})\
          .sort_values("order_date", 
                       ascending = True)\
          .round(2)

Step 8.3. Show Total Order Values and Mean for Each Year

Next up, let's take a look at getting the total and the mean for each year that is found in the sales_data dataframe index. To do this, the index will be resampled using the resample function, which groups up the sales_data dataframe index and entries by the year ("Y").

This won't impact the sales_data dataframe as it isn't getting reassigned with the resampled data.

In addition, the agg function will be done a slightly different way. This time, we will use a method that will allow us to specify the name of the column in the output for each column that has been specified. For example, instead of order_total_inc_vat_converted_gbp, the column name in the output will be year_total_gbp.

sales_data.resample("Y")\
          .agg(year_total_gbp = ("order_total_inc_vat_converted_gbp", np.sum),
               year_mean_gbp  = ("order_total_inc_vat_converted_gbp", np.mean))\
          .round(2)

Step 8.4. Show Total Order Values and Mean for a Single Year

Now that you have seen it done for each year, let's do it again but only show the results for one year instead.

To do this, the loc function needs to be passed first which will use the index to filter for the year that is specified. In this case, the year will be 2020 but there is data for 2021 as well if you want to try it.

year_to_use = 2020

sales_data.loc[sales_data.index.year == year]\
          .resample("Y")\
          .agg(year_total_gbp = ("order_total_inc_vat_converted_gbp", np.sum),
               year_mean_gbp  = ("order_total_inc_vat_converted_gbp", np.mean))

Step 8.5. Show Total Order Values and Mean per Month for Each Year

Lastly, let's get the order total and mean for each month ("M") that is in the sales_data dataframe index.

sales_data.resample("M")\
          .agg(month_total_gbp = ("order_total_inc_vat_converted_gbp", np.sum),
               month_mean_gbp  = ("order_total_inc_vat_converted_gbp", np.mean))\
          .round(2)

If you change resample("M") to resample("Q"), the results for each quarter of the year will be shown.

Resources

GitHub files for part 3

DEV Community