DEV Community

Cover image for Data Analysis with Python
Wanjala Simiyu
Wanjala Simiyu

Posted on

Data Analysis with Python

A Short Introduction

Before 1938, the Middle Eastern Kingdom of Saudi Arabia was a dusty desert country dotted with poor migratory populations of people and animals. However, the region and its people witnessed enormous transformation when oil was struck. Possibly, the Saudis may have had knowledge of the fossil fuel, but it took Max Steineke, an American petroleum geologist, to discover the resource in commercial quantities.

As with oil in Saudi Arabia, many businesses today generate loads of data and most have little idea how precious a resource the data is. Many have called data the new oil of the digital economy because it can turn around a business if harnessed properly. Some businesses are pre-1938 Saudi Arabia -- they have an idea about their data stored in numerous excel sheets and (perhaps) in cloud-based databases, but completely unaware of the resource's commercial benefits.

Data analysis is a huge field today, largely due to a huge business analytics movement, and most importantly, the rise of Python as a handy tool for data analysis. Python is a popular scripting language for data analysis nerds and data scientists. In fact, the TIOBE Index for January 2023 indicates that Python trounces hundreds of rivals to be the most popular programming language overall, and it was language of the year in 2007, 2010, 2018, 2020, and 2021.

My first encounter with Python was in 2019. I was looking through YouTube for the best tutorial on R, another important language for Data Analysts. The love for Python grew just minutes into the first video -- I found the syntax appealing and the IDEs easy to use. On the contrary, RStudio was giving me panic attacks and the syntax felt off.

I dug deeper into Python and the love affair continues to grow from strength to strength. During this period, I realized that the best way to learn Python, as with any other technical skill, is through regular projects. I also learnt that Kaggle is a massive repository for datasets I could use to polish data analysis and other skills with Python.

This project is the first of many to come. It uses a dataset downloadable from Kaggle (via this link). Also, a huge shoutout to Keith Galli, from whose YouTube video I drew crucial lessons.

Project Description:

The dataset's objective is to enable Data Analysts to practice basic exploratory data analysis (EDA) and cleaning.

We should answer the following questions by the end of this presentation:

  1. What was the best month for sales and how much did the online tech store earn?
  2. In what city did the tech store sell the most products?
  3. When should the store display ads if it wanted to maximize purchases?
  4. What products should the store recommend to buyers to purchase as a group?
  5. What product did the store sell the most over the 12 months of the data?

Data Wrangling

Preliminaries

Import the necessary libraries

import pandas as pd # for data wrangling 
import glob # for collecting files from system into the program
import matplotlib.pyplot as plt # for visualization
plt.style.use("fivethirtyeight") # default theme for plots
Enter fullscreen mode Exit fullscreen mode

Loading the dataset into the notebook

As you may have noticed, the data is available in 12 files (for each month of 2019). It is not possible to work with such fragmented files, so we must merge them into one file.
Hint:

  • Use glob.glob() to create an object with all the files then concatenate them using pandas.concat()
# if you created the notebook in the same directory 
# as the dataset, the dataset's root path is:
root_path = "./Sales_Data"

# assign all files to one variable
files = glob.glob(f'{root_path}/*.csv', recursive=True)

Enter fullscreen mode Exit fullscreen mode

Load the files into the current workspace and assign them to a variable.

# I use list comprehensions to iterate through
# the 'files' variable created earlier.
# read more on list comprehensions if you have
# trouble understanding what is happening here.
all_files = [pd.read_csv(file) for file in files]

# Concatenate the files into one huge file
# and assign it to 'df' variable
df = pd.concat(all_files, axis=0)

# to be sure all is well, print the first five 
# rows of the data frame
df.head()
Enter fullscreen mode Exit fullscreen mode

You should see something like this after running df.head():
A screenshot showing the results when you run  raw `df.head()` endraw

The dataset is now successfully loaded into the notebook.

Data Cleaning

Data cleaning is the most taxing phase in the data analysis process. It entails removing incorrectly formatted, duplicate, incomplete, or corrupted data within the dataset. This step is necessary primarily because most datasets are messy.

A typical data cleaning job involves two primary tasks:

  • Investigate the dataset for missing values.
  • Check the data type of each column.

This article discusses these steps in great detail.

Let's step through each data cleaning task:

Investigating the dataset for missing values

Once you confirm that the dataset has missing values, you can choose to either purge the rows containing the missing values or fill them with arbitrary values.

Let's see if our dataset has any missing values:

# Check for the total number of null values
# per column
df.isnull().sum()
Enter fullscreen mode Exit fullscreen mode

The output should show the following:

Image showing the output of df.isnull().sum()
The uniform nature of the missing data implies that there are complete rows with unusable data, which also means we will be fine with just dropping all of the NaNs.

Drop the NaN values using this command: df.dropna().

Investigating the data type

Next, let's check the data type for each column and see if we may need to make some changes. To do this, use the .info() function. It provides a summary of the dataset, i.e., the total number of entries (rows), number of columns, and the data type of the values in each column.

# Check the data type for each column
df.info()
Enter fullscreen mode Exit fullscreen mode

The output should look like this:

Image showing the output of df.info()

Notice that the data type for all columns is object. In pandas, an object data type denotes mixed non-numeric and numeric values or text. Python recognizes such values as strings. Thus, the object data type is basically a Python string data type. Read more here.

Looking at the data once more, intuition should tell you that columns like Quantity Ordered and Price Each should have numeric values. Let's make the changes as follows:

# Convert 'Quantity Ordered' and 'Price Each' column values
# to numeric (either float or int)
df['Quantity Ordered'] = pd.to_numeric(df['Quantity Ordered'])
df['Price Each'] = pd.to_numeric(df['Price Each'])
Enter fullscreen mode Exit fullscreen mode

If you were coding alongside me, the compiler may have thrown an error after running the above code. Mine looks likes this:

Image description

The beauty about errors from compiler is that they point you to the problem, which you can solve and carry on with the analysis. Looking at the error message, you'll notice that it is a ValueError, meaning the compiler encountered some unexpected values when converting values to numeric. The most important part of the error message is the one reading ValueError: Unable to parse string "Quantity Ordered" at position 519.

Let's solve this issue. First, head straight to the position (position 519). Note: The position might be different for you, just head to the position the error messages points you to.
Checking the entries at position 519 yields the following results:

Image showing the entries at pos 519

Clearly, the dataset got really messed up when merging the 12 files into one pandas DataFrame. Now it has rows with the column names, i.e., Order ID, Product, etc. Secondly, purge these rows because they're unneeded.

Various methods can be applied to remove the rows; I prefer the one where we create a mask (a slice of the dataset we do not want). Then select all the values that do not include the unwanted (or masked) values. This way:

mask = df.iloc[: , 0] != 'Order ID' # all rows in the first
                                    # column that do not contain 
                                    # the value 'Order ID'
# slice off the unwanted values
df = df[mask]
Enter fullscreen mode Exit fullscreen mode

Now, if you attempt to convert the data types as before, the compiler shouldn't complain.
Running df.info() again should show that the data type is changed. The new data type for each columns should be as follows:

Image showing the output for df.info()

Notice that the data types for the two columns have now changed to float64. Success!

So, we have gotten rid of missing values, converted data types to appropriate values, and removed unwanted rows. This is critical work that ensures our dataset is ready for further manipulation so that we can begin answering the burning questions. As a side note, we may need to conduct more data wrangling in the course of answering questions because more issues appear when we play with the dataset more. Part of playing with the dataset includes adding new columns, which introduces new issues with the data.

Answering the Questions

Question 1: What was the best month for sales and how much did the online tech store earn?

The best data analysts can see key words or phrases in questions, which point them to the specific actions they should take to work out solutions. In this question, the keywords/key phrases are best month and sales earned. It means we should look at the total sales earned for each month.

But we may have a problem. Our dataset does not have a Sales column or Month column. Thus, we must create them from existing columns. Let's deal with the Sales column first.

Normally, you determine the sales value by multiplying the quantity sold by the price for each unit. We'll do the same in this case, only that we'll be multiplying two columns, Quantity Ordered and Price Each.

df['Sales'] = df['Quantity Ordered'] * df['Price Each']
Enter fullscreen mode Exit fullscreen mode

Executing df.head() will show that the dataset now has an extra column called Sales.

Image showing dataset with new column, 'Sales'

Creating the Month column will require more energy and perhaps a few Google searches. But worry not because I already did that.

Looking at the Order Date column, you'll notice that each entry has the month, day, year, and time when the order was made. We can extract the month by splitting each entry at the forward slash symbol, /. We use the pandas.Series.str.split method as shown (ensure expand is set to True).

df['Order Date'].str.split('/', expand=True)
Enter fullscreen mode Exit fullscreen mode

The output after executing the code will be:

Image description
However, we are only interested in the month column, which is at index 0. So,

df['Order Date'].str.split('/', expand=True)[0]

# assign it to the 'Month' column
df['Month'] = df['Order Date'].str.split('/', expand=True)[0]
Enter fullscreen mode Exit fullscreen mode

Executing df.head() again shows we have a new column, Month.

Image description
Now, we have the columns we needed, shall we answer the question?
We are required to determine the best month for sales and the total sales amount the online tech store earned for that month. In pandas, there is a groupby() method with which data values can be grouped by a specific column (or feature). In this case, we'll group the data values by month then find the total sales for each month as follows:

df.groupby('Month')['Sales Amount'].sum()
Enter fullscreen mode Exit fullscreen mode

Executing the code yields the following output:

Image description
While the result is commendable, the reader cannot immediately tell which month recorded the best sales value, unless they strain through the output. The primary goal of a data analyst is to make things as simple as possible for the end user (i.e., the consumer of the analysis/the reader). The best way to achieve the simplicity that underlies convenience while conveying sufficient details is through charts. So, let's create our first chart displaying the output of the above code.

# data
months = range(1, 13)
sales = df.groupby('Month')['Sales'].sum().values

# plot
fig, ax = plt.subplots(figsize=(10,6))
ax = plt.plot(months, sales)
plt.xticks(months)

# prevent the y axis from displaying data with scientific notation
plt.ticklabel_format(useOffset=False, style='plain')

plt.title('Total Sales per Month', pad=20)
plt.ylabel('Sales', labelpad=20)
plt.xlabel('Month', labelpad=20)

# save the plot to 'plots' directory
plt.savefig('plots/sales_per_month.png', dpi=600, bbox_inches='tight')

plt.show()
Enter fullscreen mode Exit fullscreen mode

The output is as shown below:
Image description
Clearly, the online tech store recorded the highest sales value in December ($4,613,443).

Question 2: In what city did the tech store sell the most products?

Again, the keywords here are city and product sales or quantity ordered, to be specific. We are expected to determine the city in which the tech store sold the highest number of products over the 12 months of data.

Similar to the previous question, our dataset does not have a City column. Let's create it.

In the dataset, the city from which a product was ordered is found in the Purchase Address column. The problem is that we may have to split the entries to extract the city. Note: It is safe to extract the city together with the State code. Some States in the US have cities with the same name.

df['Purchase Address'].str.split(',', expand=True)[[1,2]]
state_code = df['Purchase Address'].str.split(',', expand=True)[2].str.strip().str.split(" ", expand=True)[0]
city = df['Purchase Address'].str.split(',', expand=True)[1]
df["City"] = city + " " + state_code
Enter fullscreen mode Exit fullscreen mode

On executing the code block and running df.head(), you'll notice a City column at the right end of the dataset.

Image description
With the City column ready, let's once again see what the question demands of us. It says we should determine in what city did the tech store sell the most products. This means we should group the dataset by city and then evaluate the sum of the quantity ordered. Like this:

df.groupby('City')['Quantity Ordered'].sum()
Enter fullscreen mode Exit fullscreen mode

The output should look like this:

Image description
But again, the consumer of this analysis may find it inconvenient to pinpoint the answer from the output. As before, let's create a chart for clarity.

# data
cities = [city.lstrip() for city, df in city_group]
quantity_ordered = df.groupby('City')['Quantity Ordered'].sum().values

# plot
fig, ax = plt.subplots(figsize=(10,6))
ax= plt.bar(cities, quantity_ordered)
plt.xticks(cities, rotation=90)
plt.title('Quantity Ordered by City', pad=20)
plt.ylabel('Quantity Ordered', labelpad=20)
plt.xlabel('City', labelpad=20)
plt.grid(False)

plt.savefig('plots/quantity_ordered_city.png', dpi=600, bbox_inches='tight')

plt.show()
Enter fullscreen mode Exit fullscreen mode

The resulting chart looks as below:

Image description
From the chart, the tech store sold more products in San Francisco, CA than elsewhere.

Question 3: When should the store display ads if it wanted to maximize purchases?

What do you think we should do here? Well, the question is talking about time, meaning the column of focus is Order Date. Unlike previously when we extracted the month, this task involves extracting the hour. We do not need the minute because it plays an insignificant role in answering the question at hand.

The most interesting aspect of the Python programming community is the array of techniques developed by coders for tackling specific issues. For example, we used the pandas.Series.str.split method to extract the month in Question 1. Alternatively, we can use a pandas method, pandas.to_datetime, to convert the entries under Order Date to a datetime object then extract the hour. See below:

# convert 'Order Date' to datetime
df['Order Date'] = pd.to_datetime(df['Order Date'])

# extract the hour then assign it to 'Hour' column
df['Hour'] = df['Order Date'].dt.hour
Enter fullscreen mode Exit fullscreen mode

Running df.head() shows a new column, Hour.

Image description
With the Hour column created, we can proceed with examining the question. It is asking for the best time the tech store should display ads to maximize the likelihood of customers buying its products. How can we approach this task?

One approach could be to examine the product quantity ordered by hour. Examining the relationship between hour and quantity ordered might point us towards the right answer to this question. Let's create a plot of quantity ordered vs hour:

# data
hour = [f'{0}{hour}' if len(str(hour)) < 2 else str(hour) for hour, df in df.groupby('Hour')]
quantity_ordered = df.groupby('Hour')['Quantity Ordered'].sum().values

# plot
fig, ax = plt.subplots(figsize=(10, 6))

ax= plt.plot(hour, quantity_ordered)
plt.xticks(hour)
plt.title('Quantity Ordered by Hour', pad=20)
plt.ylabel('Quantity Ordered', labelpad=20)
plt.xlabel('Hour', labelpad=20)

plt.savefig('plots/quantity_ordered_hour.png', dpi=600, bbox_inches='tight')

plt.show()
Enter fullscreen mode Exit fullscreen mode

The output of this code block appears as below:

Image description
The chart shows an interesting relationship -- the number of orders made increases steeply from six in the morning, peaking between 11am and 12 noon. The trend dips after noon and then establishes another peak at 7pm.

Normally, a business would want potential customers to see its products just before they make the purchase order. This is to say the tech store should display ads just before the number of orders made peaks, at 11am and 6pm.

Question 4: What products should the store recommend to buyers to purchase as a group?

The best way to approach this question to ask ourselves, how can the tech store know what products to recommend? Looking at the Order Id column, you'll notice several duplicate order IDs. In other words, several products share the same order ID, meaning they were ordered together. Online stores often bundle different products ordered together under a single order ID to keep track of sales.

How can we use this information to solve the question? Well, we can group the products bought together under a new column called Grouped. Follow the code below to accomplish this:

# You may want to create a copy of the dataset before
# you proceed
new_df = df.copy()

# Slice the new dataset to keep only 
# the products with duplicated Order ID
new_df = new_df[new_df['Order ID'].duplicated(keep=False)]

# Store the grouped products into a column called _Grouped_
new_df['Grouped'] = new_df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
Enter fullscreen mode Exit fullscreen mode

On executing the code block, run new_df.head() and the output should look something this:

Output for new_df.head()
We can now remove duplicated entries in the Grouped column. Also, we are only interested in the Order ID and Grouped columns, so select only those columns.

new_df = new_df[['Order ID', 'Grouped']].drop_duplicates()
Enter fullscreen mode Exit fullscreen mode

The output for new_df.head() should be:

Image description
Now, let's find the most common products sold as a group. In the code block below, we determine the 10 most common products sold in a group of three.
Note: We'll import some new libraries

from itertools import combinations
from collections import Counter

# initialize an empty Counter object
count = Counter() 

for row in new_df['Grouped']:
    # split the grouped products
    row_list = row.split(',')

    # the most common products sold as group of three
    count.update(Counter(combinations(row_list, 3))) 

# ten most common groups of three
for key, value in count.most_common(10): 
    print(key, value) 
Enter fullscreen mode Exit fullscreen mode

The output should be:

Image description
The output indicates that the Google Phone, USB-C Charging Cable, Wired Headphones were sold together 87 times. Thus, the store should recommend them to their customers to buy as a group. Other notable combinations are iPhone, Lightning Charging Cable, and Wired Headphones bought as a group 62 times, and iPhone, Lightning Charging Cable, and Apple Airpods Headphones bought as a group 47 times.

Question 5: What product did the store sell the most over the 12 months of the data?

As we have done before, understanding this question requires us to determine the keywords, which are product and quantity ordered. Thus, we'll group our dataset by Product column then find the number of orders per product as follows:

df.groupby('Product')['Quantity Ordered'].sum()
Enter fullscreen mode Exit fullscreen mode

Running the code yields:

Image description
Let's see how the output looks on a chart for better understanding of the insights it provides.

# data
products = [product for product, df in df.groupby('Product')]
quantity_ordered = df.groupby('Product')['Quantity Ordered'].sum().values

# plot
fig, ax = plt.subplots(figsize=(12, 6))
ax = plt.bar(products, quantity_ordered)
plt.xticks(products, rotation=90)
plt.title('Number of Orders per Product', pad=20)
plt.ylabel('Quantity Ordered', labelpad=20)
plt.xlabel('Product', labelpad=20)
plt.grid(False)

plt.savefig('plots/quantity_ordered_product.png', dpi=600, bbox_inches='tight')

plt.show()
Enter fullscreen mode Exit fullscreen mode

The output of the code block should look like the chart below:

quantity ordered per product
The chart indicates that AAA Batteries (4-pack) was the product the tech store sold the most over the 12 month period of the data.

Conclusion

Data analysis is the backbone of business analytics. As we have seen, Python is an important tool in a Data Analyst's arsenal. Most importantly, you must be able to present the insights from data in a way that readers will understand easily, using charts.

Top comments (0)