DEV Community

Cover image for How to query pandas DataFrames with SQL
Kevin White for Deepnote

Posted on • Originally published at deepnote.com

How to query pandas DataFrames with SQL

Pandas is a go-to tool for tabular data management, processing, and analysis in Python, but sometimes you may want to go from pandas to SQL.

Why? Perhaps you find pandas' syntax intimidating and less intuitive than SQL, which is more beginner-friendly. Or maybe you miss some of the functionalities SQL offers.

The good news is you can work in Python and still use SQL on a tabular pandas DataFrame.

Let's look at how to query a pandas DataFrame with SQL using either a Jupyter notebook or Deepnote.

Use cases for using SQL with pandas

Pandas comes with many complex tabular data operations. And, since it exists in a Python environment, it can be coupled with lots of other powerful libraries, such as Requests (for connecting to other APIs), Matplotlib (for plotting data), Keras (for training machine learning models), and many more.

But when it comes to talking to databases, it's not always as powerful as SQL.

For example, an enterprise company may have a massive database where many tables need to be joined together based on different conditions. Performing conditional joins isn't possible with pandas.

Additionally, SQL allows you to enforce different data constraints --- such as unique, not null, primary key, etc. --- to make sure your data is consistent and accurate. You can't do that with pandas.

And if you need to perform advanced aggregations of data (e.g., calculating running totals, moving averages, and percentiles), it can be challenging to write solutions for these in pandas.

Using SQLAlchemy to query pandas DataFrames in a Jupyter notebook

There are multiple ways to run SQL queries in a Jupyter notebook, but this tutorial will focus on using SQLAlchemy --- a Python library that provides an API for connecting to and interacting with different relational databases, including SQLite, MySQL, and PostgreSQL.

Using SQLAlchemy, you can retrieve, manipulate, and analyze large sets of data using SQL syntax directly from a Jupyter notebook.

Other options include the PandaSQL library, but it's lacking in recent updates and has some significant run-time issues that make it less than ideal. There's also the .query() method, but this is mostly limited to filtering DataFrames and doesn't provide the exact SQL syntax you may want.

Compared to these options, SQLAlchemy is more straightforward and intuitive. It's mostly used in an IPython environment, so all you need to get started is a Jupyter notebook.

Installing SQL Alchemy

To install SQLAlchemy, run the following command:

pip install sqlalchemy

Importing SQL Alchemy

To connect to an in-memory database (SQLite, in this case), follow these steps by running commands inside a Jupyter cell:

Step 1: Import SQLAlchemy

import sqlalchemy

Step 2: Load the SQL extension

%load_ext sql

Step 3: Create a SQLite connection

engine = sqlalchemy.create_engine('sqlite:///mydatabase.db')

Step 4: Connect to the SQLite database

%sql sqlite:///mydatabase.db

Running queries with SQLAlchemy

After connecting to an in-memory database, you should store data as tables. To do this, first create a dummy DataFrame:

import pandas as pd

df = pd.DataFrame([["A",1,2],
                   ["B",3,4],
                   ["A",5,6],
                   ["C",7,8],
                   ["A",9,10]], columns = ["colA", "colB", "colC"])
Enter fullscreen mode Exit fullscreen mode

Next, store the DataFrame in the SQLite database:

df.to_sql('data', con=engine, if_exists='replace')

In the above statement, we created table data in the SQLite engine. The if_exists='replace' argument deletes the current table if it already exists.

Lastly, we can query the table data using SQL as follows:

results = %sql SELECT * FROM data WHERE colA = "A";

If we view the results, we get:

print(results)
+-------+------+------+------+
| index | colA | colB | colC |
+-------+------+------+------+
|   0   |  A   |  1   |  2   |
|   2   |  A   |  5   |  6   |
|   4   |  A   |  9   |  10  |
+-------+------+------+------+
Enter fullscreen mode Exit fullscreen mode

We can also convert the results to a pandas DataFrame as follows:

results.DataFrame()

  index colA colB colC
0   0     A    1    2
1   2     A    5    6
2   4     A    9    10
Enter fullscreen mode Exit fullscreen mode

Using Deepnote to query pandas DataFrames with SQL

Deepnote comes complete with SQL support for pandas DataFrames --- no downloading, installing, or importing required.

To start querying a pandas DataFrame using SQL, create a DataFrame as follows:

Image description
Then create a SQL block:

sql block create.gif

You can write any SQL query:

Image description
Similar to storing the results in a variable in a Jupyter Notebook, you can store the results in Deepnote as shown:

df variable.gif

Since Deepnote uses jinjasql templating, you can pass Python variables, functions, and control structures (e.g., "if" statements and "for" loops) into your SQL queries.

Best practices for using SQL with pandas

Using SQL with pandas can make data analysis easier, but there are a few caveats.

Switching to SQL may mean missing out on different optimization techniques that could have been applied using Python. And if you're using SQLAlchemy, remember that it creates an in-memory data set. Therefore, you should only use it when you have sufficient memory.

Moreover, unlike pandas, which infers the data types by itself, SQL requires explicit specification when creating new tables. To make sure your data is stored and retrieved correctly, it's important to use the appropriate data types for the columns in your DataFrame.

Lastly, keep in mind that SQL is a powerhouse for databases, not tabular DataFrames sitting in a Python environment. If you can complete your tasks using pandas, you may want to avoid leveraging SQL unnecessarily (unless it offers some major run-time benefits).

So there you have it --- you're ready to query pandas DataFrames with SQL.

As this tutorial demonstrated, using SQL with a Jupyter notebook requires multiple steps, which can be quite time-consuming. If you want to integrate SQL with an interactive Python environment minus the extra dependencies and effort, give Deepnote a try.

Combine pandas & SQL with Deepnote

Get started for free to explore, collaborate on, and share your data.

Top comments (0)