Pandas DataFrames are my favorite way to manipulate data in Python. In fact, the end product of many of my small analytics projects is just a data frame containing my results.
I used to dump my dataframes to CSV files and save them to Github. But recently, I've been using Beneath, a data sharing service I'm building, to save my dataframes and simultaneously turn them into a full-blown API with a website. It's great when I need to hand-off a dataset to clients or integrate the data into a frontend.
In this post, I'll show you how that works! I'm going to fetch GitHub commits, analyze them, and use Beneath to turn the result into an API.
Setup Beneath
To get started, you need to install the Beneath pip
module and login with a free Beneath account. It's pretty easy and the docs already cover it. Just follow these steps.
Make sure to remember your username as you'll need it in a minute!
Let's analyze some data
I think Github activity is a fascinating, underexplored data source. Let's scratch the surface and look at commits to... Pandas! Here's a quick script to fetch the pandas
source code and aggregate some daily stats on the number of commits and contributors:
import io
import pandas as pd
import subprocess
# Get all Pandas commit timestamps
repo = "pandas-dev/pandas"
cmd = f"""
if [ -d "repo" ]; then rm -Rf "repo"; fi;
git clone https://github.com/{repo}.git repo;
cd repo;
echo "timestamp,contributor";
git log --pretty=format:"%ad,%ae" --date=iso
"""
res = subprocess.run(cmd, capture_output=True, shell=True).stdout.decode()
# Group by day and count number of commits and contributors
df = (
pd.read_csv(
io.StringIO(res),
parse_dates=["timestamp"],
date_parser=lambda col: pd.to_datetime(col, utc=True),
)
.resample(rule="d", on="timestamp")["contributor"]
.agg(commits="count", contributors="nunique")
.rename_axis("day")
.reset_index()
)
Now, the df
variable contains our insights. If you're following along, you can change the repo
variable to scrape another Github project. Just beware that some major repos can take a long time to analyze (I'm looking at you, torvalds/linux).
Save the DataFrame to Beneath
First, we'll create a new project to store our results. I'll do that from the command-line, but you can also use the web console:
beneath project create USERNAME/github-fun
Just replace USERNAME
with your own username.
Now, we're ready to publish the dataframe. We do it with a simple one-liner directly in Python (well, I split it over multiple lines, but it's still just one call):
import beneath
await beneath.write_full(
table_path="USERNAME/github-fun/pandas-commits",
records=df,
key=["day"],
description="Daily commits to https://github.com/pandas-dev/pandas",
)
There are a few things going on here. Let's go through them:
- The
table_path
gives the full path for the output table, including our username and project. - We use the
records
parameter to pass our DataFrame. - We provide a
key
for the data. The auto-generated API uses the key to index the data so we can quickly filter records. By default, Beneath will use our DataFrame's index as the key, but I prefer setting it manually. - The
description
parameter adds some documentation to the dataset that will be shown at the top of the table's page.
And that's it! Now let's explore the results.
Explore your data
You can now head over to the web console and browse the data and its API docs. Mine's at https://beneath.dev/epg/github-fun/table:pandas-commits (if you used the same project and table names, you can just replace my username epg
for your own).
You can also share or publish the data. Permissions are managed at the project layer, so just head over to the project page and add members or flip the project settings to public
.
Use the API
Now that the data is in Beneath, anyone with access can use the API. On the "API" tab of the table page, we get auto-generated code snippets for integrating the dataset.
For example, we can load the dataframe back into Python:
import beneath
df = await beneath.load_full("USERNAME/github-fun/pandas-commits")
Or we can query the REST API and get the commit info every day in May 2021:
curl https://data.beneath.dev/v1/USERNAME/github-fun/pandas-commits \
-d type=index \
-d filter='{"day":{"_gte":"2021-05-01","_lt":"2021-06-01"}}' \
-G
Or use the React hook to read data directly into the frontend:
import { useRecords } from "beneath-react";
const App = () => {
const { records, loading, error } = useRecords({
table: "USERNAME/github-fun/pandas-commits",
query: {
type: "index",
filter: '{"day":{"_gte":"2021-05-01","_lt":"2021-06-01"}}'
}
})
...
}
Check out the API tab of my dataframe in the Beneath console to see all the ways to use the data.
That's it
That's it! We used Beneath to turn a Pandas DataFrame into an API. If you have any questions, I'm online most of the time in Beneath's Discord (I love to chat about data science, so you're also welcome to just say hi 👋). And let me know if you publish a cool dataset that I can spotlight in the featured projects!
Top comments (7)
Hi Eric...(First, Sorry for my English.)
I follow the example that I find on the git repo(github.com/beneath-hq/beneath/blob...), when I run it , i see this error :
SyntaxError: 'await' outside function
Can you tell me if i do something wrong? Thanks in advance.
Hey Emanuel,
Sorry about that! We'll add some info to the docs today to make this more clear. I know you figured it out, but I'll leave this comment here for others to find.
The keyword "await" declares an asyncio coroutine (described in depth here: realpython.com/async-io-python/), and there are a few options for running these:
In a standard Python script, as you mentioned, you have to wrap each coroutine in an async function, and start the execution with
asyncio.run()
. This is a good resource: docs.python.org/3/library/asyncio-...In a Jupyter notebook, which I was using for the blog post and the quick start, it's actually much easier. An asyncio context is automatically included, and you can directly run any
await ...
code block.You can spawn an asyncio-enabled Python REPL by running
python -m asyncio
in a command line. In this case too, you can directly run anyawait ...
code block.Let me know if anything else comes up and I'd be happy to help!
I guess, you need to update the doc...all beneath function must be called with
asyncio.run(
for example, to read a stream and load in pandas dataframe the code is the next:
df = asyncio.run(beneath.load_full("USERNAME/financial-reference-data/s-and-p-500-constituents"))
print(df)
At last, this work for me. Regards Eric is a great solution Beneath, I will try to use it with streamlit.
Hey Emanuel (and anyone else following along here!),
I know await/async (asyncio) can be pretty confusing if you haven't used it before – it's perfectly normal to use Python without it, but we think it's such an awesome feature that we wanted to use it for Beneath.
To help explain it better, I added a new docs page to explain the what/how/why of it: python.docs.beneath.dev/misc/async...
We want to keep the example snippets concise, but we'll try to link to this helper from the examples to avoid confusion :)
Thanks for the input!
I resolve called beneath in this way :
import beneath
import asyncio
asyncio.run(
beneath.write_full(
stream_path="USERNAME/financial-reference-data/s-and-p-500-constituents",
records=df,
key=["day"],
description="Daily commits to github.com/pandas-dev/pandas",
))
Beneath is a great initiative, really appreciate it. 🥰🥰
Thanks Adwaith!