DEV Community

Cover image for Introduction to Data Analysis with Python Part 2: Mathematical Operations Using Pandas and NumPy
dev_neil_a
dev_neil_a

Posted on • Updated on

Introduction to Data Analysis with Python Part 2: Mathematical Operations Using Pandas and NumPy

Introduction

In this multi-part series, I'll be going over some of the basics of Pandas, NumPy and Matplotlib that I've learned over the past few weeks.

In part one I covered the following topics:

  • Importing data from a CSV file into Pandas
  • Cleaning up the data and removing unusable data.
  • Converting between datatypes.
  • Exporting and importing data to and from Excel files.

In this part, I will be covering how to perform mathematical operations against data in a Pandas Dataframe to create additional data, such as a total for example. I will also be covering using NumPy to perform mathematical operations against data instead of using methods in Pandas, along with comparing the performance of each against one of the operations being performed.

There is a Jupyter notebook, along with all the other required files located in a GitHub repo that is linked in the Resources section.

Let's get started.

Step 1. Importing Pandas and NumPy

First of all, the Pandas and NumPy libraries need to be imported.

import pandas as pd
import numpy as np
Enter fullscreen mode Exit fullscreen mode

Step 2. Import From Excel

Once the libraries have been imported, the next step is to get the data imported.

sales_data = pd.read_excel(io = "data/orders-export.xlsx",
                           sheet_name = "orders-export",
                           skiprows   = 1)
Enter fullscreen mode Exit fullscreen mode

The above will look for the Excel file named orders-export.xlsx in a folder called data and use the worksheet named orders-export for its data source.

The first row will be skipped as that row has a title on it. If the row wasn't skipped, the data would be imported incorrectly.

Step 3. Validating the Data

Now that the data has been imported from an Excel file, let's take a look at the data it contains.

Step 3.1. What the Data Looks Like

First, let's have a look at the first five rows of the data in the sales_data dataframe.

sales_data.head(n = 5)
Enter fullscreen mode Exit fullscreen mode

output from above]

The data is a collection of sales for a number of customers, with each order being a single line.

Step 3.2. Check the Columns DataTypes

Next, let's have a look at the datatypes that have been assigned to each column in the sales_data dataframe.

sales_data.dtypes
Enter fullscreen mode Exit fullscreen mode

output from above

As you can see, the datatypes are different from when the CSV file was imported in part one. The reason for this is when you use pd.read_excel it will check what datatype is assigned to each column that it has been told to import and assign the nearest equivalent to it. You can override the datatype that is assigned when you import the data from the file if you so wish. For example, change a number (say an integer) column in Excel to an object (string) datatype in the dataframe.

Step 4. Reimport From Excel, Setting and Converting Data Types During Import

Step 4.1 Reimporting From Excel File

In the previous import of the Excel file, pd.read_excel was allowed to import the data and use the datatype that was best matched to what each column was set to in Excel. For the most part this worked well but there was a notable exception and that was the order_date column, which was assigned the object datatype.

To solve that, the column can be imported and then converted to a datetime format that is compatible with Pandas or the column can be converted as part of the import process.

In addition, the item_description column has a number of ',' in there that need to be removed in case the data needs to be exported to a CSV file. Again, this can be done after importing the data into the dataframe but instead it can be converted during the import.

In addition to converting the above two columns, let's hard-code the datatypes for each column. This can be useful to pickup issues in the data, such as strings being found in a column that is set to a float or an integer.

sales_data = pd.read_excel(io = "data/orders-export.xlsx",
                           sheet_name = "orders-export",
                           skiprows   = 1,
                           dtype      = {"order_id": np.int64,
                                         "customer_id": np.int64, 
                                         "customer_first_name": str,
                                         "customer_last_name": str,
                                         "customer_gender": str,
                                         "customer_city": str,
                                         "customer_country": str,
                                         "item_qty": np.int64,
                                         "item_price": np.float64,
                                         "order_currency": str,
                                         "order_vat_rate": np.float64},
                           converters = {"order_date": lambda date: pd.to_datetime(date, dayfirst=True),
                                         "item_description": lambda description: description.replace(",", " -")})
Enter fullscreen mode Exit fullscreen mode

To go over what just happened, the same Excel file and worksheet were used for the data source, along with skipping the first line of the worksheet. As part of the import the datatypes for each column were specified and used NumPy on any numerical field (more about NumPy in the next section) to do the datatype setting. After that (converters), the order date was then converted using a lambda function that converts the date from DD/MM/YYYY to YYYY-MM-DD. Finally, another lambda function was used to replace any ',' in the item description to ' -'.

Step 4.2 What the Data Now Looks Like

sales_data.head()
Enter fullscreen mode Exit fullscreen mode

Image description

From the above, the date column now has a different format to the previous import and any ',' have been replace with ' -' (see the item_description for order_id 200003 (index number 2)).

The rest of the data in the sales_data dataframe remains much the same as before.

Step 4.3 Recheck the Columns DataTypes

sales_data.dtypes
Enter fullscreen mode Exit fullscreen mode

output from above

As per the previous import, the datatype are correctly set, with the only change being the order_date datatype which is now set to datetime64.

Step 4.4 Check for NaN (Null) Values

sales_data.isna().sum()
Enter fullscreen mode Exit fullscreen mode

output from above

As you can see, there is no NaN values in the data as it was cleaned prior to use. If you need to know how to perform some basic cleaning up of data, see part one of the series where I go over some techniques for data cleaning and datatype conversions.

Step 5. Adding Custom Columns

In this section I will cover adding additional columns of data, along with also removing some columns as well.

For one of the columns, I will show you three different methods to multiply two numbers together, with the rest of the additional columns using the most performant method of the three. These three methods are:

  • Create two NumPy arrays, one for item_qty and another for the price_per_unit.
    • Multiply the two NumPy arrays together using np.multiply.
  • Multiply the item_qty by the price_per_unit directly from the sales_data dataframe using NumPy.
  • Multiply the item_qty by the price_per_unit directly from the sales_data dataframe.

Why would you use NumPy to do this when you can just directly do multiplication from the sales_data dataframe? One word: SPEEEEEEEEEEEEEED!

Step 5.1 What is NumPy?

NumPy is a library that is specifically built for performing mathematical operations against numerical data.

NumPy at it's base is written in C and C++ that has some Python code on top of it to interact with it as a Python library. The main reason it is so much faster versus Python lists (~50 times faster in some cases) is that it stores the data in an array, which occupies one continuous memory space so it can be accessed quickly (also called locality of reference).

Step 5.2 Creating an Order Total (Excluding VAT) Column for the Currency That the Order was Bought With

This column will take the item_price and order_qty columns and multiply them together. Now, for this particular part, we will do it three times using the previously mentioned methods (NumPy only, Pandas with NumPy and Pandas only).

Note: Each of these methods will take a number of seconds to complete, some more so than others. This is due to the %timeit magic function. It is used to time how long it takes to perform the creation of each array or column. It will loop through a number of times to find what the best time was as there may have been processes running at the time that may impact the performance. This is why it takes longer.

Normally, each would be done quickly that you wouldn't notice much of a difference but if the dataset was larger, that could be a different story.

Method 1: Pandas-Based Multiplication

This method will create a column that uses Pandas alone to perform the multiplication. It does use NumPy a little but that is to round (np.around) the result to two decimal places. This will be the same for the other two methods.

Note: The '\' is used as a break character. It was needed to be able to put a line break in when using %timeit, otherwise it would fail.

%timeit sales_data["order_total_ex_vat_local_currency_pandas"] =  np.around(sales_data["item_qty"] * sales_data["item_price"], \
                                                                            decimals = 2)
Enter fullscreen mode Exit fullscreen mode

output from above

As you can see from the result, the best time to complete it was 220 µs (µs is microseconds) or 0.00022 Seconds.

Now, onto the next method.

Method 2: NumPy Multiplication using Data from the sales_data Dataframe Directly

The process for this method is similar to that used in method one but instead it will use NumPy to perform the multiplication using data directly from the sales_data dataframe.

%timeit sales_data["order_total_ex_vat_local_currency_mixed"] =  np.around(np.multiply(sales_data["item_qty"], \
                                                                                       sales_data["item_price"]), \
                                                                                       decimals = 2)
Enter fullscreen mode Exit fullscreen mode

output from above

As you can see from the result, the best time to complete it was 247 µs or 0.000247 Seconds.

Now, onto the final method.

Method 3: NumPy-Based Multiplication

In this method, there will be four steps to be done.

  1. Create a NumPy array (ndarray) that is created from the data in the sales_data["item_qty"] column. This array will be called sales_data_order_qty.
  2. Create a NumPy array (ndarray) that is created from the data in the sales_data["item_price"] column. This array will be called sales_data_item_price.
  3. Create a NumPy array (ndarray) that is created by multiplying the two NumPy arrays together. This array will be called sales_data_order_total_ex_vat_local_currency.
  4. Insert the results from the sales_data_order_total_ex_vat_local_currency array back into the sales_data dataframe.

This does have more steps to do than the previous two methods but the results will speak for themselves.

Note: %timeit has this annoying issue of not being able to perform variable assignments when it is used. This is why the array creations are shown twice for each array. This does not however though impact the insertion or reassignment of columns to a dataframe.

%timeit sales_data_item_qty = np.array(sales_data["item_qty"])
sales_data_item_qty = np.array(sales_data["item_qty"])

%timeit sales_data_item_price = np.array(sales_data["item_price"])
sales_data_item_price = np.array(sales_data["item_price"])

%timeit sales_data_order_total_ex_vat_local_currency = np.around(np.multiply(sales_data_item_qty, \
                                                                 sales_data_item_price), \
                                                                 decimals = 2)

sales_data_order_total_ex_vat_local_currency = np.around(np.multiply(sales_data_item_qty, \
                                                         sales_data_item_price), \
                                                         decimals = 2)

%timeit sales_data["order_total_ex_vat_local_currency"] = sales_data_order_total_ex_vat_local_currency
Enter fullscreen mode Exit fullscreen mode

output from above

As you can see from the result, the best time to complete all four steps is 54.78µs or 0.00005478 Seconds. This is significantly faster than using the previous two methods as all the data from both sources were stored in NumPy arrays.

The sales_data dataframe now looks like this:

sales_data.head(n = 1)
Enter fullscreen mode Exit fullscreen mode

output from above

You might be wondering how NumPy works out the order for which number from each array to multiply together.

When NumPy performs operations, such as multiplication or addition on two arrays, it will use what is known as element-wise matching. What this does is match the arrays like-for-like and perform whatever operation is asked.

For example, if we had an two arrays with ten numbers in that we need to add together, NumPy will take the first item (element) in the first array and then add that to the first item in the second array. Then the second to the second, third to the third and so on.

For the remaining columns that need to be created, this method will be used. But first though, some columns need to be removed from the sales_data dataframe as they will not be used.

Step 5.3 Removing Two Unused Columns

As we only need one copy of the order_total_ex_vat_local_currency column, we can remove two of them from the sales_data dataframe using the Pandas drop function.

sales_data.drop(labels = ["order_total_ex_vat_local_currency_pandas","order_total_ex_vat_local_currency_mixed"], 
                axis = 1,
                inplace = True)

sales_data.head(n = 1)
Enter fullscreen mode Exit fullscreen mode

output from above

As you can see, the two columns have now been removed. So what did it do?

The drop function was passed a list consisting of two columns that were duplicates of order_total_ex_vat_local_currency but with slightly different names. The column axis (axis = 1) was specified to imply it should drop a column matching each name in the list and commit those changes to the sales_data dataframe (inplace = True).

Step 5.4 Creating an Order Total VAT Column for the Currency That the Order was Bought With

sales_data_order_vat_rate = np.array(sales_data["order_vat_rate"])
sales_data_order_total_vat_local_currency = np.around(np.multiply(sales_data_order_total_ex_vat_local_currency,
                                                                  sales_data_order_vat_rate) / 100,  
                                                                  decimals = 2)

sales_data["order_total_vat_local_currency"] = sales_data_order_total_vat_local_currency
Enter fullscreen mode Exit fullscreen mode

Step 5.5 Creating an Order Total (Including VAT) Column for the Currency That the Order was Bought With

In this section, the values that are stored in the sales_data_order_total_ex_vat_local_currency and sales_data_order_total_vat_local_currency arrays will be added together and added to the sales_data dataframe a new column named order_total_inc_vat_local_currency.

sales_data_order_total_inc_vat_local_currency = np.around(np.add(sales_data_order_total_ex_vat_local_currency, 
                                                          sales_data_order_total_vat_local_currency), 
                                                          decimals = 2)

sales_data["order_total_inc_vat_local_currency"] = sales_data_order_total_inc_vat_local_currency
Enter fullscreen mode Exit fullscreen mode

Step 5.6 Creating an Currency Conversion Column

As part of the reporting that a company would do, it can be required to have all order totals be shown in one currency. For this, we need a column that will show the currency conversion rate that is to be used. For this, GBP (British pounds) will be used as the currency to report on.

Let us first see what currencies are present in the sales_data dataframe.

sales_data["order_currency"].value_counts()
Enter fullscreen mode Exit fullscreen mode

output from above

There are two currencies listed, GBP (British pounds) and EUR (Euros). For the conversion values, GBP will have a conversion rate of 1 and Euros will have a conversion rate of 0.84.

sales_data["order_currency_conversion_rate"] = sales_data["order_currency"].apply(lambda currency: 1.00 if currency == "GBP" else\
                                                                                  (0.84 if currency == "EUR" else 1.00))
Enter fullscreen mode Exit fullscreen mode

What the above did was create a column called order_currency_conversion_rate which would then have a value of 1 (GBP) or 0.84 (Euro) set per row that was determined by what was in the order_currency column for that row. A lambda function was used with if statements to determine which value to use.

Let ud take a look at an entry that has GBP for it's currency:

sales_data.head(n = 1)
Enter fullscreen mode Exit fullscreen mode

output from above

As you can see, the order_currency_conversion_rate value for the first row is set to 1 as it has GBP in the order_currency column.

Now, let's take a look at an entry that has EUR for it's currency:

sales_data.tail(n = 1)
Enter fullscreen mode Exit fullscreen mode

output from above

As you can see, the order_currency_conversion_rate value for the row is set to 0.84 as it has EUR in the order_currency column.

Now that the order_currency_conversion_rate column has been created, let's create a NumPy array from it for use in the next few steps.

sales_data_currency_conversion_rate = np.array(sales_data["order_currency_conversion_rate"])
Enter fullscreen mode Exit fullscreen mode

Step 5.7 Creating an Order Total (Excluding VAT) Column for the Converted Currency

Now that the currency conversion rates have been added for each row, the total values that were previously created using the local currency the order was paid for in can now be converted to GBP.

To do this, simply multiply the previous total by the currency conversion rate.

sales_data["order_total_ex_vat_converted_gbp"] = np.around(np.multiply(sales_data_order_total_ex_vat_local_currency, 
                                                                       sales_data_currency_conversion_rate), 
                                                                       decimals = 2)
Enter fullscreen mode Exit fullscreen mode

Step 5.8 Creating an Order Total VAT Column for the Converted Currency

The same process will be used for converting the total VAT to GBP.

sales_data["order_total_vat_converted_gbp"] = np.around(np.multiply(sales_data_order_total_vat_local_currency, 
                                                                    sales_data_currency_conversion_rate), 
                                                                    decimals = 2)
Enter fullscreen mode Exit fullscreen mode

Step 5.9 Creating an Order Total (Including VAT) Column for the Converted Currency

And finally, the same for converting the total (including) VAT to GBP.

sales_data["order_total_inc_vat_converted_gbp"] = np.around(np.multiply(sales_data_order_total_inc_vat_local_currency, 
                                                                        sales_data_currency_conversion_rate), 
                                                                        decimals = 2)
Enter fullscreen mode Exit fullscreen mode

Step 6. Examining the Completed sales_data Dataframe

Now that the sales_data dataframe has the totals needed, let's take a look at it. First, let's take a look at a row that was paid for in GBP and was converted (or rather not but was) to GBP.

sales_data.head(n = 1)
Enter fullscreen mode Exit fullscreen mode

output from above

Next, let's look at a row that had its totals converted from EUR to GBP.

sales_data.tail(n = 1)
Enter fullscreen mode Exit fullscreen mode

output from above

The values look correct so they were converted as expected.

Now let's check the datatypes for all of the columns, specifically the newly created columns.

sales_data.dtypes
Enter fullscreen mode Exit fullscreen mode

output from above

They all have a datatype of float64, which is what is required.

Finally, are there any missing (NaN) values.

sales_data.isna().sum()
Enter fullscreen mode Exit fullscreen mode

output from above

No missing values have been reported so the sales_data dataframe is now complete, clean and all the columns have the correct datatypes.

Step 7. Exporting data back to Excel

In the last section of part two, the data in the sales_data dataframe will be exported to a new Excel file that will contain all the data, including all of the totals that were created in the previous section.

This file will be used in future parts of this series to perform some analysis against and view some statistics.

with pd.ExcelWriter(path = "data/order_data_with_totals.xlsx", 
                    engine = "xlsxwriter",
                    date_format = "YYYY-MM-DD",
                    datetime_format = "YYYY-MM-DD") as writer:

    sales_data.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data_with_totals")
Enter fullscreen mode Exit fullscreen mode

A sample of the contents of the Excel file are shown below.

output from above

Resources

GitHub files for part 2

Top comments (0)