Eric P Green

Posted on Jun 10, 2021 • Edited on Jun 21, 2021

Turn a Pandas DataFrame into an API

#python #serverless #datascience #showdev

Pandas DataFrames are my favorite way to manipulate data in Python. In fact, the end product of many of my small analytics projects is just a data frame containing my results.

I used to dump my dataframes to CSV files and save them to Github. But recently, I've been using Beneath, a data sharing service I'm building, to save my dataframes and simultaneously turn them into a full-blown API with a website. It's great when I need to hand-off a dataset to clients or integrate the data into a frontend.

In this post, I'll show you how that works! I'm going to fetch GitHub commits, analyze them, and use Beneath to turn the result into an API.

Setup Beneath

To get started, you need to install the Beneath pip module and login with a free Beneath account. It's pretty easy and the docs already cover it. Just follow these steps.

Make sure to remember your username as you'll need it in a minute!

Let's analyze some data

I think Github activity is a fascinating, underexplored data source. Let's scratch the surface and look at commits to... Pandas! Here's a quick script to fetch the pandas source code and aggregate some daily stats on the number of commits and contributors:

import io
import pandas as pd
import subprocess

# Get all Pandas commit timestamps
repo = "pandas-dev/pandas"
cmd = f"""
    if [ -d "repo" ]; then rm -Rf "repo"; fi;
    git clone https://github.com/{repo}.git repo;
    cd repo;
    echo "timestamp,contributor";
    git log --pretty=format:"%ad,%ae" --date=iso
"""
res = subprocess.run(cmd, capture_output=True, shell=True).stdout.decode()

# Group by day and count number of commits and contributors
df = (
    pd.read_csv(
        io.StringIO(res),
        parse_dates=["timestamp"],
        date_parser=lambda col: pd.to_datetime(col, utc=True),
    )
    .resample(rule="d", on="timestamp")["contributor"]
    .agg(commits="count", contributors="nunique")
    .rename_axis("day")
    .reset_index()
)

Now, the df variable contains our insights. If you're following along, you can change the repo variable to scrape another Github project. Just beware that some major repos can take a long time to analyze (I'm looking at you, torvalds/linux).

Save the DataFrame to Beneath

First, we'll create a new project to store our results. I'll do that from the command-line, but you can also use the web console:

beneath project create USERNAME/github-fun

Just replace USERNAME with your own username.

Now, we're ready to publish the dataframe. We do it with a simple one-liner directly in Python (well, I split it over multiple lines, but it's still just one call):

import beneath

await beneath.write_full(
    table_path="USERNAME/github-fun/pandas-commits",
    records=df,
    key=["day"],
    description="Daily commits to https://github.com/pandas-dev/pandas",
)

There are a few things going on here. Let's go through them:

The table_path gives the full path for the output table, including our username and project.
We use the records parameter to pass our DataFrame.
We provide a key for the data. The auto-generated API uses the key to index the data so we can quickly filter records. By default, Beneath will use our DataFrame's index as the key, but I prefer setting it manually.
The description parameter adds some documentation to the dataset that will be shown at the top of the table's page.

And that's it! Now let's explore the results.

Explore your data

You can now head over to the web console and browse the data and its API docs. Mine's at https://beneath.dev/epg/github-fun/table:pandas-commits (if you used the same project and table names, you can just replace my username epg for your own).

You can also share or publish the data. Permissions are managed at the project layer, so just head over to the project page and add members or flip the project settings to public.

Use the API

Now that the data is in Beneath, anyone with access can use the API. On the "API" tab of the table page, we get auto-generated code snippets for integrating the dataset.

For example, we can load the dataframe back into Python:

import beneath

df = await beneath.load_full("USERNAME/github-fun/pandas-commits")

Or we can query the REST API and get the commit info every day in May 2021:

curl https://data.beneath.dev/v1/USERNAME/github-fun/pandas-commits \
  -d type=index \
  -d filter='{"day":{"_gte":"2021-05-01","_lt":"2021-06-01"}}' \
  -G

Or use the React hook to read data directly into the frontend:

import { useRecords } from "beneath-react";

const App = () => {
  const { records, loading, error } = useRecords({
    table: "USERNAME/github-fun/pandas-commits",
    query: {
      type: "index",
      filter: '{"day":{"_gte":"2021-05-01","_lt":"2021-06-01"}}'
    }
  })
  ...
}

Check out the API tab of my dataframe in the Beneath console to see all the ways to use the data.

That's it

That's it! We used Beneath to turn a Pandas DataFrame into an API. If you have any questions, I'm online most of the time in Beneath's Discord (I love to chat about data science, so you're also welcome to just say hi 👋). And let me know if you publish a cool dataset that I can spotlight in the featured projects!

Top comments (7)

Emanuel Giannattasio • Jun 12 '21 • Edited

Hi Eric...(First, Sorry for my English.)

I follow the example that I find on the git repo(github.com/beneath-hq/beneath/blob...), when I run it , i see this error :

await beneath.write_full(
^

SyntaxError: 'await' outside function

Can you tell me if i do something wrong? Thanks in advance.

Eric P Green • Jun 14 '21

Hey Emanuel,

Sorry about that! We'll add some info to the docs today to make this more clear. I know you figured it out, but I'll leave this comment here for others to find.

The keyword "await" declares an asyncio coroutine (described in depth here: realpython.com/async-io-python/), and there are a few options for running these:

In a standard Python script, as you mentioned, you have to wrap each coroutine in an async function, and start the execution with asyncio.run(). This is a good resource: docs.python.org/3/library/asyncio-...
In a Jupyter notebook, which I was using for the blog post and the quick start, it's actually much easier. An asyncio context is automatically included, and you can directly run any await ... code block.
You can spawn an asyncio-enabled Python REPL by running python -m asyncio in a command line. In this case too, you can directly run any await ... code block.

Let me know if anything else comes up and I'd be happy to help!

Emanuel Giannattasio • Jun 12 '21

I guess, you need to update the doc...all beneath function must be called with

asyncio.run(

for example, to read a stream and load in pandas dataframe the code is the next:

df = asyncio.run(beneath.load_full("USERNAME/financial-reference-data/s-and-p-500-constituents"))
print(df)

At last, this work for me. Regards Eric is a great solution Beneath, I will try to use it with streamlit.

Benjamin Egelund-Müller • Jun 14 '21 • Edited

Hey Emanuel (and anyone else following along here!),

I know await/async (asyncio) can be pretty confusing if you haven't used it before – it's perfectly normal to use Python without it, but we think it's such an awesome feature that we wanted to use it for Beneath.

To help explain it better, I added a new docs page to explain the what/how/why of it: python.docs.beneath.dev/misc/async...

We want to keep the example snippets concise, but we'll try to link to this helper from the examples to avoid confusion :)

Thanks for the input!

Emanuel Giannattasio • Jun 12 '21 • Edited

I resolve called beneath in this way :

import beneath
import asyncio

asyncio.run(
beneath.write_full(
stream_path="USERNAME/financial-reference-data/s-and-p-500-constituents",
records=df,
key=["day"],
description="Daily commits to github.com/pandas-dev/pandas",
))

Adwaith Rajesh • Jun 11 '21

Beneath is a great initiative, really appreciate it. 🥰🥰

Eric P Green • Jun 11 '21

Thanks Adwaith!