DEV Community: Kevin White

How to speed up pandas with NumPy methods

Kevin White — Tue, 28 Feb 2023 17:00:58 +0000

You'd be hard-pressed to find a data scientist who doesn't use pandas for their day-to-day work, but sometimes it pays to go from pandas to NumPy.

Pandas boasts great accessibility and a wide range of tools for small and medium-scale projects, but not all pandas methods utilize vectorization, which reduces the runtime of code. Take the apply() method, for instance. It's essentially a glorified Python "for" loop, and it comes with immense speed constraints, especially with large data sets.

On the bright side, you can speed up many pandas methods by pulling from NumPy. Let's look at how to speed up pandas with NumPy methods using a data notebook.

Use cases for using NumPy methods to speed up pandas

Runtime optimization is crucial in a business context. It directly impacts the performance of programs --- especially bigger, more complex ones.

Just consider the importance of customer experience. If your company's application is powered by a program that runs faster and more efficiently, your end users are bound to be more satisfied. Hence, increased customer loyalty and lifetime value. On the other hand, a sluggish app runs the risk of sending customers to your competitors.

There are also cost savings and scalability to consider. Optimizing the runtime of a program means it will require fewer resources to execute, helping to reduce costs. And as applications grow in size and complexity, runtime optimization becomes even more critical to making sure performance doesn't suffer.

The issue with pandas is that although it supports vectorization, some of its methods don't. This is true for the apply() and groupby() methods. You end up using native Python "for" loops for execution, which slows pandas down.

But NumPy can help improve the performance of pandas in several ways. For instance, if you're performing numerical operations, NumPy offers a suite of numerical functions, including element-wise operations and linear algebra. By using these functions instead of equivalent pandas methods, you can get a performance boost.

Plus, since NumPy provides a high-performance array data structure that is optimized for numerical computations, you can often achieve faster computations when compared to pandas DataFrames. There's also the fact that NumPy arrays are stored in a contiguous block of memory, making them slightly more memory-efficient than DataFrames, which store data in a more complex structure and form.

Using NumPy methods to speed up pandas in a data notebook

Pandas and NumPy are supported across both interactive Python environments and Python scripts. Therefore, runtime optimization can be achieved in either of them.

For this tutorial, we'll be exploring how to go from pandas to NumPy methods in a notebook that has Python installed.

👉 Keep in mind that you can skip the setup process below if you're using a Deepnote notebook. Deepnote comes complete with the most popular data analytics and machine learning Python libraries already installed, so you can seamlessly import them into your project.

Installing pandas & NumPy

To install pandas and NumPy in a Jupyter notebook, run the following command:

Importing pandas & NumPy

Next, to use their methods, we must import them. The commonly used alias for pandas is pd and for NumPy it's np.

Pandas vs. NumPy runtime comparison

First, let's create a dummy data set for experimentation:

Next, let's create a NumPy array with two columns and 10⁸ rows. From this array, we create a pandas DataFrame using the pd.DataFrame method.

The first five rows are shown below:

Pandas apply() vs. NumPy where() for conditional columns

We mostly use the apply() method to create conditional columns in pandas. The implementation is shown below:

We categorize the column as "Class A" if the entry is less than 45 and "Class B" otherwise.

As discussed above, the apply() method is basically a glorified "for" loop. As a result, it misses the whole point of vectorization.

Using NumPy's where() method, we can create conditional columns while also having vectorization in place.

With this method, the condition is passed as the first argument. This is followed by the result if the condition evaluates to True (second argument) and False (third argument). As you can see above, the np.where() method is approximately five times faster.

Pandas apply() vs. NumPy select() for conditional columns

The np.where() method allows us to choose between two results (i.e., it's for binary conditions). If there are multiple conditions, np.select() should be used.

Consider we have the following function:

With the apply() method, we get the following runtime:

But if we use np.select(), we get:

The conditions and the corresponding results are passed as the first two arguments. The last argument is the default result.

You can see that the np.select() method is five times faster than the apply() method. This is because it's using vectorization, while the apply() method is looping through the column to create a new column based on the conditions.

Pandas sort_values() vs. NumPy sort()

Sorting is a common operation to arrange data in a specific order. In pandas, we use the df.sort_values() method to sort a DataFrame.

With the earlier DataFrame, we get the following runtime:

But NumPy also provides a sorting method. More specifically, it's the np.sort() method:

Using NumPy, we first convert the column we want to sort into a NumPy array. Next, we sort the obtained array. Lastly, we create a new DataFrame using the sorted array and rearrange the column entry of the other column.

The NumPy approach is roughly 33% faster than pandas.

Best practices for runtime optimization with pandas & NumPy

Optimization is important, but it's also important to proceed with caution.

It's always a good practice to profile your code first so you're aware of the bottlenecks. If you optimize your code before profiling it, you may end up optimizing the wrong parts, making it even slower.

The biggest bottlenecks in pandas usually come from looping through a DataFrame. Thus, look for parts that can be replaced with vectorization.

Additionally, getting support from external libraries can offer many benefits as well. They're often optimized for performance and can be faster than custom implementations.

Lastly, ensure that you don't compromise the readability of your code for optimization. Optimizing your code can sometimes make it less readable, which makes it harder for other people to understand and maintain it.

In general, you should only move forward with optimization if you have evidence from profiling that your code is slow and that optimization will lead to a significant performance boost.

Combine pandas & NumPy with Deepnote

Get started for free to explore, collaborate on, and share your data.

The good, the bad, & the ugly: how to share Jupyter notebooks

Kevin White — Wed, 22 Feb 2023 13:53:18 +0000

Why do data teams need to share Jupyter notebooks? Let us count the ways.

Code reviews, team presentations, delivering insights to non-programmers, general hotdogging (we all need to show off now and then) --- the list goes on and on.

But sharing a Jupyter notebook is problematic (i.e., painful). You end up jumping through annoying, time-consuming hoops just to share the results of your analysis. And the workarounds you use often defeat the purpose of sharing your notebook in the first place.

As with most things in data work, there's a good, bad, and ugly way of getting it done. Let's look at the different ways you can share a Jupyter notebook and weigh the advantages and disadvantages of each approach.

What it means to share

Before we start comparing and contrasting different Jupyter notebook sharing options, let's be clear on what we mean by "share." There's a whole spectrum of sharing, ranging from "I literally just want your eyeballs on this output" to "Go ahead and edit my code."

Sometimes you want people to run your notebook. Maybe you even want them to be able to toggle a few dropdown menus so they can actually explore the data. Other times it's more of a look-but-don't-touch situation.

At the end of the day, we share to collaborate. And successful data collaboration is dependent on:

Speed (our ability to share work quickly)
Reproducibility (our ability to duplicate it)
Interactivity (our ability to work together on it)

Those are the criteria we'll use to evaluate different sharing options. So, without further ado, here are your choices for sharing a data notebook.

Ugly: the file option

Downloading and emailing IPYNB files --- the bane of many a data professional's existence.

Since Jupyter files run on your local machine, you can't simply send someone a link to your notebook. Instead, you have to download the file (which takes way too long) and send it off to a teammate so they can fire it up on their machine (which also takes way too long).

Unfortunately, none of your assets are included. Maybe you add your database password to help speed things along (the exact kind of behavior that keeps your security team up at night), but now your colleague has to set up their environment from scratch. Only after the necessary Python packages have been installed and the environment configuration is complete can they run your notebook.

It's not exactly what you'd call high-speed --- but it's better than exporting your notebook as a static file or copying and pasting screenshots. With those options, reproducibility and interactivity go right out the window.

Say you need to share your notebook with a non-technical stakeholder --- they're not interested in running it, just getting to the insights you uncovered. You download it as a PDF and send it over. Then they spot a mistake that needs to be fixed. Or they have a follow-up request not covered in your analysis. Or your results lead to another question that needs to be answered.

No matter the issue, you'll have to go back to your notebook, rerun it, re-export it, resend it, and repeat until the stakeholder is satisfied. On and on and on. You (and your teammates) get stuck in a never-ending loop of busy work that makes exploring and collaborating on data a grind.

Bad: the view option

The challenge of sharing a Jupyter notebook is nothing new --- that's why there's a cottage industry built around making it easier to view them.

GitHub repositories are a great way to organize static data notebooks and make them accessible to teammates, but therein lies the rub: They're static. Notebooks are rendered on GitHub, not run. Reproducing the work, collaborating in real time, commenting --- none of it is an option. Ditto for tools like nbviewer.

There's also Binder, which claims to help you "turn a Git repo into a collection of interactive notebooks." But "interactive" is a stretch. Once you get past the time it takes to load a repository and the custom libraries you need, you realize notebooks are displayed in isolated environments.

After you close that URL, the notebook goes poof. You may feel like you're sharing the real deal, but as soon as you want to actually do anything collaborative, it instantly falls apart. Tools like this are good for quickly reproducing a notebook, but interactivity is ultimately an illusion.

Good: the cloud option

Cloud-based technology allows you to share fully executable notebooks with a link --- no muss, no fuss.

This isn't the same as JupyterHub, which is a DIY option that requires organizations to install and manage their own Jupyter notebook servers (i.e., you have to manage everything and deal with computing power and storage limits). Cloud-based notebooks are hosted for you, giving you an easy and scalable way to quickly share and reproduce projects.

But keep in mind that not all cloud-based notebooks are created equal --- the ability to easily share your notebook doesn't always mean you can collaborate on it. Take Google Colab, for instance. You and your teammates can't share the same execution environment simultaneously or leave comments for each other. Each time a person makes an edit and saves the notebook, it reverts to their copy and destroys whatever their colleagues were working on.

Then there's the question of permissions. Not all cloud-based notebooks have the same level of granularity when it comes to who can access notebooks and how they can use them (e.g., run a notebook but not alter the code).

Your best bet is a cloud-based data notebook that's truly collaborative by design (hint: That's what we made).

It looks like this:

Sharing the same environment with collaborators at the same time, complete with database connections and environment configuration
Editing code with collaborators in real time and leaving comments for each other
Assigning granular access levels to collaborators, from view-only to full code access and everything in between
Giving collaborators a shared workspace where they can easily store, organize, and find their teammates' notebooks to view, work on, or duplicate a project
Publishing shareable notebooks as articles, dashboards, and interactive apps with just a click to make sharing insights with stakeholders that much easier

Sharing data notebooks seems like such a simple task, but the truth is it's as complicated as any machine learning model. Combining shared environments and accessibility controls --- and spinning it up in a package that's fast and accessible to people of all technical levels --- is not easy. Some sharing options do a passable imitation, but the devil's in the details.

Data collaboration is a computational puzzle every team has to solve on its own. Next time you need to share a Jupyter notebook, consider a solution that's built for real teamwork.

Simplify sharing Jupyter notebooks with Deepnote

Get started for free to see how easy it is to share and collaborate on data notebooks.

3 data smells that mean you need a Jupyter notebook alternative

Kevin White — Mon, 06 Feb 2023 18:39:24 +0000

Jupyter notebooks are one of the world's most popular interactive computing tools (approximately 10 million Jupyter notebooks on GitHub and counting), but that hasn't stopped businesses from seeking out Jupyter notebook alternatives.

Why? Because despite Jupyter's omnipresence in the data science community, it simply isn't always the right tool for the job --- especially for data teams.

We're all familiar with code smells, those pesky warning signs that all is not right with your source code. Let's look at three common circumstances --- what we call data smells --- that indicate it's time to seek out a replacement for your Jupyter notebook.

1. Your team wastes valuable time trying to reproduce analyses

The results of your analysis don't count if they can't be replicated, but that's not easy with a Jupyter notebook.

Essential elements of your analysis, such as the environment and required files, are typically bound to your local computer. For your team to reliably reproduce it, they'll need everything --- not just the notebook file itself.

Team members will have to tediously configure their environments to match yours and make sure they have exactly the same assets if they want to reproduce your analysis.

Ditto for getting access to the same data sources. That process usually goes something like this: Your teammate hunts down the proper credentials, searches for how to connect their notebook to a specific data source online, reads through the documentation, installs the necessary Python packages, works through the command line, and so on.

And when it's time for a colleague to duplicate the same setup? Back to step one. Those tight feedback loops you're after remain out of reach with out-of-the-box Jupyter.

But that's not the case with today's cloud-based notebooks. These Jupyter notebook alternatives allow data teams to share the same execution environment, file system, and data connection simultaneously for real-time collaboration.

The ability to run reproducible environments in the cloud is enough to turn skeptics into advocates.

"I was personally skeptical about the performance of a collaborative, hosted platform, but when I used Deepnote to collaborate with a colleague and we were able to remotely troubleshoot and try different plots, I immediately saw its value," said Webflow's Senior Manager of Data Science & Analytics Allie Russell.

Team members can connect to any data source, work in their browser, and share connections and environments with teammates who have been granted the proper user permissions. It's as easy as adding your credentials or API keys, which are then encrypted and securely stored. What was once a frustratingly convoluted process now only takes a few clicks.

"We'd had a lot of technical issues when trying to pair up on other notebooks during remote interviews," said Gusto's Product Analytics Lead Becca Carter. "Deepnote was incredibly easy to set up and allows us to start new notebooks in seconds."

2. Your team resorts to screenshots and PDFs to communicate findings

Data teams are tasked with uncovering business insights, validating them with their peers, and making them actionable for business stakeholders.

In other words: Data collaboration is critical.

But sharing insights isn't so straightforward with Jupyter notebooks. You're usually forced to grab screenshots and port them to a document or download notebook files as static PDFs. These quickly become outdated, and the whole process must be repeated any time there's a follow-up question or data refresh.

But modern data notebooks can be shared via email invitation or simply by sharing a link --- no different from Google Docs.

This helps support faster, more collaborative iteration.

"Working in Deepnote is like code review and rapid prototyping at the same time, saving valuable time in the iteration cycles," said VantAI CTO Luca Naef. "But as opposed to code review via GitHub, you have direct access to the runtime and program state, which makes understanding complex models much easier and leads to much more spontaneous creative ideas."

And when it's time to share your analysis with non-technical teammates, you can publish notebooks as code-free articles or interactive applications. You and your teammates can tag one another and leave each other comments to collaborate in real time or asynchronously.

It's this ability to better facilitate collaboration with business stakeholders that drives many companies to look beyond Jupyter notebooks.

"Since metrics require a lot of input from subject matter experts, data consumers, and business stakeholders to define and align on definitions, we needed a collaborative layer where we could get immediate feedback," said Slido's Head of Analytics Engineering Michal Koláček.

3. Your team struggles to work together on interrelated projects

Data notebooks are synonymous with exploratory programming, but quick experiments and prototypes shouldn't be seen once and then forgotten. Projects often grow over time, changing hands and evolving alongside the needs of the business. This makes discoverability key.

But Jupyter notebooks create silos. There's no single place where your analysis is stored, organized, version-controlled, and made easily accessible to the company.

This is another area where modern notebooks excel. Teams can create workspaces where data science and analytics teams can share analysis (and narratives) with both technical team members and business stakeholders. Notebooks, articles, and applications can all be organized into a customizable folder structure that's tailored to how different teams and companies operate.

These workspaces act as searchable databases that scale as teams and their projects grow, complete with granular permissions that dictate who can access which projects and what actions they're allowed to take, from viewing to commenting to editing. A project's entire history can be tracked and reviewed, and older versions of notebooks can be previewed and instantly restored.

This allows data teams to ensure sensitive information stays secure while democratizing data access for both technical and non-technical teams. That's why one Deepnote customer --- a hedge fund with a large data science team --- sought out a Jupyter notebook alternative that would act as a living knowledge base, not a siloed tool.

"Notebooks are often used as a quick prototyping tool, but we don't want to create one-off work," the customer said. "We want to invest in ideas that compound over time. Deepnote gives our team one place to create, store ideas, and build on top of the work of others. Visibility goes up over time."

These data smells aren't deal-breakers for everyone. Plenty of Jupyter notebook users aren't operating in a traditional business environment. Perhaps your work is strictly solo, which makes collaboration challenges a moot point. Or maybe you see the technical limitations as a fair tradeoff for a free tool.

But every busted workflow and lost second counts for business users. If you work on a team where time is of the essence and collaboration is the norm, consider these smells your cue to find a Jupyter notebook alternative.

Explore Deepnote as a Jupyter notebook alternative

Get started for free to see how Deepnote compares to a Jupyter notebook.

How to query pandas DataFrames with SQL

Kevin White — Wed, 01 Feb 2023 20:53:51 +0000

Pandas is a go-to tool for tabular data management, processing, and analysis in Python, but sometimes you may want to go from pandas to SQL.

Why? Perhaps you find pandas' syntax intimidating and less intuitive than SQL, which is more beginner-friendly. Or maybe you miss some of the functionalities SQL offers.

The good news is you can work in Python and still use SQL on a tabular pandas DataFrame.

Let's look at how to query a pandas DataFrame with SQL using either a Jupyter notebook or Deepnote.

Use cases for using SQL with pandas

Pandas comes with many complex tabular data operations. And, since it exists in a Python environment, it can be coupled with lots of other powerful libraries, such as Requests (for connecting to other APIs), Matplotlib (for plotting data), Keras (for training machine learning models), and many more.

But when it comes to talking to databases, it's not always as powerful as SQL.

For example, an enterprise company may have a massive database where many tables need to be joined together based on different conditions. Performing conditional joins isn't possible with pandas.

Additionally, SQL allows you to enforce different data constraints --- such as unique, not null, primary key, etc. --- to make sure your data is consistent and accurate. You can't do that with pandas.

And if you need to perform advanced aggregations of data (e.g., calculating running totals, moving averages, and percentiles), it can be challenging to write solutions for these in pandas.

Using SQLAlchemy to query pandas DataFrames in a Jupyter notebook

There are multiple ways to run SQL queries in a Jupyter notebook, but this tutorial will focus on using SQLAlchemy --- a Python library that provides an API for connecting to and interacting with different relational databases, including SQLite, MySQL, and PostgreSQL.

Using SQLAlchemy, you can retrieve, manipulate, and analyze large sets of data using SQL syntax directly from a Jupyter notebook.

Other options include the PandaSQL library, but it's lacking in recent updates and has some significant run-time issues that make it less than ideal. There's also the .query() method, but this is mostly limited to filtering DataFrames and doesn't provide the exact SQL syntax you may want.

Compared to these options, SQLAlchemy is more straightforward and intuitive. It's mostly used in an IPython environment, so all you need to get started is a Jupyter notebook.

Installing SQL Alchemy

To install SQLAlchemy, run the following command:

pip install sqlalchemy

Importing SQL Alchemy

To connect to an in-memory database (SQLite, in this case), follow these steps by running commands inside a Jupyter cell:

Step 1: Import SQLAlchemy

import sqlalchemy

Step 2: Load the SQL extension

%load_ext sql

Step 3: Create a SQLite connection

engine = sqlalchemy.create_engine('sqlite:///mydatabase.db')

Step 4: Connect to the SQLite database

%sql sqlite:///mydatabase.db

Running queries with SQLAlchemy

After connecting to an in-memory database, you should store data as tables. To do this, first create a dummy DataFrame:

import pandas as pd

df = pd.DataFrame([["A",1,2],
                   ["B",3,4],
                   ["A",5,6],
                   ["C",7,8],
                   ["A",9,10]], columns = ["colA", "colB", "colC"])

Next, store the DataFrame in the SQLite database:

df.to_sql('data', con=engine, if_exists='replace')
In the above statement, we created table data in the SQLite engine. The if_exists='replace' argument deletes the current table if it already exists.

Lastly, we can query the table data using SQL as follows:

results = %sql SELECT * FROM data WHERE colA = "A";
If we view the results, we get:

print(results)
+-------+------+------+------+
| index | colA | colB | colC |
+-------+------+------+------+
|   0   |  A   |  1   |  2   |
|   2   |  A   |  5   |  6   |
|   4   |  A   |  9   |  10  |
+-------+------+------+------+

We can also convert the results to a pandas DataFrame as follows:

results.DataFrame()

  index colA colB colC
0   0     A    1    2
1   2     A    5    6
2   4     A    9    10

Using Deepnote to query pandas DataFrames with SQL

Deepnote comes complete with SQL support for pandas DataFrames --- no downloading, installing, or importing required.

To start querying a pandas DataFrame using SQL, create a DataFrame as follows:

Then create a SQL block:

You can write any SQL query:

Similar to storing the results in a variable in a Jupyter Notebook, you can store the results in Deepnote as shown:

Since Deepnote uses jinjasql templating, you can pass Python variables, functions, and control structures (e.g., "if" statements and "for" loops) into your SQL queries.

Best practices for using SQL with pandas

Using SQL with pandas can make data analysis easier, but there are a few caveats.

Switching to SQL may mean missing out on different optimization techniques that could have been applied using Python. And if you're using SQLAlchemy, remember that it creates an in-memory data set. Therefore, you should only use it when you have sufficient memory.

Moreover, unlike pandas, which infers the data types by itself, SQL requires explicit specification when creating new tables. To make sure your data is stored and retrieved correctly, it's important to use the appropriate data types for the columns in your DataFrame.

Lastly, keep in mind that SQL is a powerhouse for databases, not tabular DataFrames sitting in a Python environment. If you can complete your tasks using pandas, you may want to avoid leveraging SQL unnecessarily (unless it offers some major run-time benefits).

So there you have it --- you're ready to query pandas DataFrames with SQL.

As this tutorial demonstrated, using SQL with a Jupyter notebook requires multiple steps, which can be quite time-consuming. If you want to integrate SQL with an interactive Python environment minus the extra dependencies and effort, give Deepnote a try.

Combine pandas & SQL with Deepnote

Get started for free to explore, collaborate on, and share your data.