DEV Community

Cover image for Stop Using Excel for Data Analytics: Upgrade to Python
Amanda Fawcett for Educative

Posted on • Edited on • Originally published at educative.io

Stop Using Excel for Data Analytics: Upgrade to Python

This article was written by Tyler Folkman and was originally publised on Educative. Tyler is the Head of AI at Branded Entertainment Network and author of Educative's course Python Data Analysis and Visualization. Tyler is passionate about data science and artificial intelligence. He contributes to the community in many ways, including being a top writer in artificial intelligence on Medium. Check out his blog Learning with Data.

In 2017, it was estimated that 750 million people worldwide used Excel. The population of the world in 2017 was about 7.6 billion. That means roughly 10% of the population was using Excel and I would guess mostly for data analytics. That is insane.

There is no doubt that Excel has been an incredibly important tool for companies and still has a place in the toolkit of every data analyst and scientist, but for most of your work, you need to stop using Excel and upgrade to Python. I’m going to show you why.

So, if you still have not taken the leap to learn Python and take your data analytics and visualization skills to the next level, I present you with 5 reasons why you need to learn Python right now. By the end, I’m confident you’ll be looking forward to replacing most of your Excel work with Python.

Today, we will go over:

Scale and Automation

Excel is great when you want to do quick, ad-hoc analysis with small data, but once you want to move to a larger scale, it just doesn’t work. Excel can support data up to 1,048,576 rows by 16,384 columns. On the other hand, Python can scale to the size of your memory and also has many tools that support out-of-memory computations.

For example, the Dask library allows you to scale your computations to run on a cluster of machines and not just your laptop. In fact, if you are familiar with Pandas, it's almost exactly the same code to read in a CSV:

import dask.dataframe as dd

# Load the data with Dask instead of Pandas.

df = dd.read_csv(<file_path>)
Enter fullscreen mode Exit fullscreen mode

One line of code and you are now reading in data larger than your computer’s memory. I’d show you how to do that in Excel, but it’s not even possible.

If this code looks foreign, check out my course that teaches you everything you need to know to get started with Pandas.

Also, Python can scale when it comes to multiple data sources. While Excel is both the data store and the computation engine, Python is completely data agnostic. If you can find a way to read your data into Python, you can use it. And since Python has so many great libraries, it is trivial to read in data from many sources such as CSV, Excel, JSON, and SQL databases.

Lastly, Python is an amazing programming language when it comes to automation. Since you can connect Python directly to any data source, it is easy to schedule a job that will re-pull your data with any updates, run your calculations, and even create a report or dynamic dashboard saving you tons of time. Excel, on the other hand, requires far too much manual labor and cannot automate updates.

Alt Text

Python can connect directly to your database for automated updates. Source: NobleDesktop

Reproducibility

Reproducibility is the concept that any analytics or visualizations you create should be easy and straightforward to reproduce for someone else. Not only does someone need to be able to re-run your process and end up with the same result, but they should also be able to walk through your steps to ensure accuracy. This concept is extremely important as you start relying on automation. Automation is amazing when working correctly, but when incorrect, automated reports can be a nightmare.

Reproducibility with Excel is very challenging. Excel calculations in cells are all but impossible to check at any type of scale. Data types are extremely confusing because what you see isn’t always what is represented in the raw data, and while VBA makes reproducibility slightly better, at that point, you’re much better off investing in learning Python.

Take a look at this Excel document:

Alt Text

You know the sum column should be the sum of a and b, but how do you confirm that? You could check one of the formulas and see that it is in fact the sum, but since every cell can be its own formula, what if they all are not correct? If you were not paying attention, you might have missed that row x was wrong.

But in Python, your sum would look like this:

a = [1,2,3,4]
b = [5,6,7,8]
sum = []
for i in range(a):
    sum.append(a[i] + b[i])
Enter fullscreen mode Exit fullscreen mode

This code is clear and easy to confirm that the sum is always calculated correctly.

With Python, you get all the tools developed to make reproducibility and collaboration better for software engineers. On top of that, Python is superior for data connectivity, allowing us to analyze data in the cloud and repeat a process instantly. Git, unit testing, documentation, and code formatting standards are all prevalent in the Python community. With Python 3, you can even add static typing to make your code even more clear. All of these tools make it easier to ensure your code is written well and correctly. So that the next time you look at your code or someone else picks it up, it's easy to reproduce and understand.

Keep the learning going.

Learn Python for Data Analytics and big data without scrubbing through videos or documentation. Educative's text-based courses are easy to skim and feature live coding environments, making learning quick and efficient.

Python Data Analysis and Visualization

Transferable Skills

If you know Excel, you know Excel. While the skills you learn are useful, they are not transferable to anything else. One of the things I like most about Python is that not only is it an amazing tool for data analytics and visualization, but it’s a solid programming language that can be used for so many other things.

Want to do machine learning or even deep learning? You can do that with Python. Want to build a website? Python can do that. Want to automate your smart home? Python can do that as well.

Also, Python is much closer to other programming languages than Excel. That makes it much easier to pick up other languages you might encounter along the way. Learning Python opens up far more doors than Excel ever could.

Lastly, the demand for Python is incredibly high. According to StackOverflow, in 2019, it was ranked as the world’s 4th most popular programming language among professional software developers as well as the first most wanted programming language. And Indeed says the average Python developer salary in the US in 2020 is $120K per year. Not bad.

Advanced capabilities

Excel has a lot of built-in formulas, but it pales in comparison with Python’s capabilities. Not only does Python offer hundreds of libraries to make advanced statistics and analytics easier, but it can also take your visualizations to another level. With tools like Matplotlib, Plotly, Streamlit, and Seaborn, you can create beautiful visualizations of data as well as interactive dashboards and plots.

Numpy and scipy have amazing support for scientific computations, linear algebra, and vectorized calculations. And scikit-learn lets you train machine learning algorithms from decision trees to gradient boosting machines. I think xkcd said it best:

Alt Text

Python is easy to learn

Given all the amazing benefits of Python over Excel, it must be hard to learn, right? No!
Check out this comparison of Hello World, the simplest program, in various languages:

Alt Text

Alt Text

Alt Text

Python is literally one line: print(“Hello World!”). It doesn’t get any simpler. Python is one of the easiest programming languages to pick up and has one of the most active communities, especially in the data analytics world. Python is one of the most intuitive programming languages out there, and it's even understandable for someone with little background in computer science! While the learning curve Excel might seem preferable, there is far less payoff on the other end. Python's learning curve is worth the time and effort in a way that Excel will never match due to its one-size-fits-all design.

Python is easy to use, and with a huge community of support, it's never been easier to learn. To help make it even easier for you to make the switch, I’ve developed a course that assumes no prior Python experience and can take you from zero to confidently being able to extract knowledge and answers from data.

To show you just how easy Python is, in the next section I will introduce you to some of the foundations for data analytics and visualizations in Python.

Get started with Python for data analytics

To get you started, I want to walk you through some basic commands and operations in Python that will be essential to your data analysis skills. Let's start with the foundations.

The first thing you will notice is that Python makes use of whitespace and does not use a use semicolon ; like other languages. Here is a very simple example:

Alt Text

Importing functionalities

We will make use of many libraries, some that are pre-installed with Python and some we will have to install ourselves. To get a library use an import statement:

from collections import Counter
Enter fullscreen mode Exit fullscreen mode

This command imports the class Counter from the collections library. Counter is a very useful tool for data scientists; it can count the number of times items appear in collections such as lists. For example, in the code below we will create a list of marriage ages. Using Counter we can quickly count the number of times each unique age appears.

Lists in Python

Lists are a useful data structure to store data. They will be studied in more detail during the next lesson. For example:

Alt Text

You can see that we created a list containing marriage ages using the [] at line 2. We then fed that list into the Counter function at line 4 to print out the most common values as a list of tuples, at line 5.

A tuple is a collection inside the (). These tuples contain two elements: the value and then the number of times that value appeared in your list. The frequency orders the list of tuples. The value with the most occurrence appears first.

Functions in Python

Functions are also useful. Functions in Python start with the keyword def and the function name followed by the inputs the function expects within brackets. Here is a function that takes in 2 inputs, x and y, and returns the sum:

Alt Text

Functions can also be anonymous, meaning that you don’t have to declare them with the above structure. Instead, you can use the lambda keyword. Here is the same function as above, but as an anonymous function:

Alt Text

Wrapping up

It's time to make the switch to Python. There are no more excuses! To become a data analyst, you'll need to cover the following concepts:

  • Python basics for analytics
  • Reading data
  • Describing data
  • Cleaning data
  • Visualizing data

Get started will all these concepts and more in my course Python Data Analysis and Visualization, which uses hands-on practice and code examples to propel your career in data analytics.

Go and take your skills to the next level with Python!

Top comments (2)

Collapse
 
aza profile image
Aza

Wonderful post, Amanda!

Unfortunately there is a deep and unconditional love for excel.
With some teams we reached only the level of OLAP cubes :) (I know...). Just because the data was still in excel, even though they haven't seen the calculations behind. Learning and shifting the gears takes time for many teams (especially non technical).

Collapse
 
fenchu profile image
fenchu

You do not need to stop using Excel. It is great at what it does. Creating dataframes (
Just save your spreadsheet as a .csv file with headers, or use Pandas read_excel).
Another issue is that python is slow. You may remedy this by using modules like pytorch and numpy with CUDA support and a decent Nvidia RTX2060 or higher card. This can speed up analysis 100 times. Excel is optimized for matrix computation and do this way faster than native python.