DEV Community

Sydney Andre
Sydney Andre

Posted on

Introduction to Data Analysis with Python

In the world of big business, decisions are rarely left to someone's gut feeling. When making plans for the future, most business owners and executives want evidence to inform their decision. This is were data analysis comes into the picture. In this article we will examine at a high-level what the process of analyzing data looks like and explore how Python libraries like NumPy, Pandas, and Matplotlib enable us to follow this process.

What is Data Analysis?

It is the process of transforming and organizing collected data to provide insights on its relationships and patterns. Professionals can then make informed decisions based on evidence from the data. Data analysis is a tool to aid professionals in their decision making but should not be the only factor involved. Just as important as discovering the patterns and relationships the data reveals is interrogating the data's limitations. When handling a large amount of data, it is important to follow a process to maintain focus on the primary purpose of the analysis.

Data Analysis includes the following steps:
1. Ask/Specify Data Requirements
In this step we ask ourselves or the intended audience (stakeholders, supervisor, etc.) What is the question we are trying to answer through analyzing this data set? What problem are we solving for or upcoming event are we planning for? This allows you to maintain focus and provides a scope for your analysis. If we are trying to forecast how much of product X we should order for November, we probably are not very concerned with the sales of product Y in July.

2. Prepare or Collect Data
Here, the questions are How are we currently collecting data? Does this include the data points I need to answer the questions in the scope of this analysis? If not, what are those gaps and how can we fill them? These questions allow us to take stock of the information we have at our disposal and if that will be enough to meet the needs of our scope. If not, we will need to add additional steps to get the information we need, but knowing this early on can help avoid future headaches.

3. Clean and Process Data
In this step we are thinking about how the data needs to be organized to best meet our needs. Asking ourself, Are there any potential errors, corruptions, or duplications that could skew or misrepresent data? If so, how to we account for these and clean them? How do we store this data so it is easily accessible and manageable? Something to keep in mind is your analysis is only as good as your raw data, so you must be aware of the integrity of your data to ensure the integrity of your findings.

4. Analyze
Now for the fun part! Here is where you will be looking for those relationships and patterns we were talking about earlier. What is the data showing you? Where are there meaningful relationships between variables? What trends are revealing themselves? This is where we are trying to answer our question from step one by using data science and computing tools.

5. Interpretation
Now that the data has been collected, organized, and then transformed into insightful information, we can interpret our findings. Did you answer the original question? Were there limitations that the analysis was not able to account for? Do those limitations affect your results? This line of questioning is important in understanding how we can report on or make decisions based on the results.

6. Act/Report
Lastly, it is time to either make a plan based on your results or provide stakeholders with the information. At this point, you must turn your results into something that is easily digestible using visualization tools and summarization.

Python Libraries

In this article we will be discussing three Python libraries that streamline data analysis by providing structures and operations to manipulate the raw data.

NumPy is a general purpose multi dimensional array processing package that provides operations for dealing with these types of complex array structures. NumPy can be used in machine learning to process images into arrays for Deep Learning. More broadly, it is the base for many other scientific computing libraries in Python where its functionalities are used to increase other library's performance and capabilities.

Pandas is one of these libraries built on top of NumPy. It is used to work with relational data and offers multiple structures and operations to transform numerical data and time series.

Matplotlib is a data visualization tool built on top of NumPy that makes creating line, bar, scatter plots, and other graphs simple.

Using Python to Analyze Sales Data

Now that we understand a bit more about the process of analyzing data and the libraries that helps us do that, let's look at an example. Your company is looking to order more products for the holidays. You think you may need to order more than usual, but are unsure how much more. Let's follow the steps outlined above to see how we should plan out this upcoming product order.

  1. The question we are trying to answer through analyzing this data is: Based on sales from previous years, should we order more product than usual in November? This means we will want to group our data by month and year and get the sum of the sales and number of products sold.

  2. Now, we need to review the data we have access to. Is there somewhere we have access to the raw sales data? If so, let's move on to the next step.

  3. What state is this data in? Does it need to be cleaned of duplications or errors before analyzing it? If not, let's get to the fun part.

  4. Now that we have clean, relevant data, let's use Pandas and Matplotlib to analyze and interpret our data.

#import libraries
import pandas as pd
import matplotlib.pyplot as plt

#create a data frame from .csv data
#set index column to the column you want to extract your rows based on 
dfYear = pd.read_csv("/file/path/to/sales_data_sample.csv", encoding= 'unicode_escape', index_col="YEAR_ID")

#make sure the date field in your data is converted to a readable date 
dfYear['ORDERDATE']=pd.to_datetime(dfYear['ORDERDATE'])

#extract the data for each year
year2021 = dfYear.loc[2021]
year2022 = dfYear.loc[2022]

#group each year by month and get the sum of all sales for each month
byMonth2021 = year2021.groupby([year2021['ORDERDATE'].dt.month]).agg({
  'SALES': 'sum'
})
byMonth2022 = year2022.groupby([year2022['ORDERDATE'].dt.month]).agg({
  'SALES': 'sum'
})

#plot each table
plt.plot(byMonth2021.index, byMonth2021['SALES'], label=2021)
plt.plot(byMonth2022.index, byMonth2022['SALES'], label=2022)

#plot styling
#change the format of the y axis to represent money
ax = plt.subplot()
def formatter(x, pos):
    return '$' + format(x, ',')
ax.yaxis.set_major_formatter(formatter)

#make sure all months are displayed on the x axis
plt.xticks(byMonth2003.index)

#label your axes 
plt.ylabel('Sales')
plt.xlabel('Months')

#title your graph
plt.title('Very Important Business Sales by Month')

#make sure your line labels appear in the legend
plt.legend()

#export the plot
plt.show()

run python <file name> in your terminal to see your graph

Enter fullscreen mode Exit fullscreen mode

Output
line graph of sales by month for 2022 and 2021

5.Based on this graph we can see that sales are much higher in November than the rest of the year, but what does this tell us about how much product we need to order? Well, one may assume we need to order a lot more product for November. This may be true, but the sum of sales for the month may not be the best data point to look at when trying to determine how much product you need to order. This could also mean that people just bought more expensive things in November.

Conclusion

Although we looked at a fairly simple example, understanding how to use data analysis tools effectively is an incredibly important skill that many businesses need. Libraries like Pandas and Matplotlib make transforming data that much easier.

Sources
Data Analysis General: https://www.geeksforgeeks.org/what-is-data-analysis/?ref=lbp
Pandas General: https://www.geeksforgeeks.org/introduction-to-pandas-in-python/?ref=lbp
Extracting rows in Pandas:https://www.geeksforgeeks.org/python-pandas-extracting-rows-using-loc/?ref=lbp
Matplotlib General:https://www.geeksforgeeks.org/matplotlib-tutorial/?ref=lbp#line
Matplotlib styling:https://www.w3resource.com/graphics/matplotlib/basic/matplotlib-basic-exercise-6.php

Top comments (0)