DEV Community

Cover image for Exploring data with Jupyter Notebook, Pandas and CrateDB
Carlota Soto for CrateDB

Posted on

Exploring data with Jupyter Notebook, Pandas and CrateDB

In a previous post, I showed you how to pair CrateDB (an open-source database specialized in machine data) and Jupyter Notebook, an application that allows you to create and share documents containing live code. I also told you how to load two different sample datasets into CrateDB, one recording the position of the ISS and another containing information about taxi trips in NYC.

In this post, I'll play a little bit with the NYC taxi dataset using Pandas and Matplotlib. This will be a simple example, but you can take it from here and explore further what you can do with these tools.

Before continuing, make sure you take it from where we left it, i.e. with Jupyter Notebook running and the NYC taxi dataset loaded into CrateDB.

To start, let's import our libraries. We will be importing pandas and matplotlib. Open a new Jupyter Notebook and run the following code:

import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import rcParams
plt.style.use ('ggplot')

Enter fullscreen mode Exit fullscreen mode

If there's no error, you will see no return.

Now, let's query CrateDB from Jupyter. This is how you can do it:

limit = 1000000
sql_query = """
  SELECT total_amount, trip_distance FROM nyc_taxi_sample
LIMIT 1000000;
""".format(limit)
try:
    df_data = pd.read_sql(
    sql_query, 'crate://localhost:4200')
    display(df_data.head(5))
except Exception as error:
    print(error)
Enter fullscreen mode Exit fullscreen mode

This query asks CrateDB for a table containing the price of each taxi trip and its distance, and with the help of pandas, we're defining a function called df_data with this result. Besides, we're asking to display a table with the first 5 values in our notebook.

If everything goes well, you will see such table showing up:

Alt Text

Next, let's see how our data looks like by building our first plot:

df_data.plot.scatter(x='trip_distance',y='total_amount')
plt.show()
Enter fullscreen mode Exit fullscreen mode

After running the command above, a price vs distance plot will display in your notebook:

Alt Text

We notice some interesting things; for example, there are some anomalies, most notably what appears to be negative trip distances. Datasets are rarely perfect... Especially those with millions of data points.

We can change our original query, so it only includes positive values in the return:

limit = 1000000
sql_query = """
  SELECT total_amount, trip_distance FROM nyc_taxi_sample
  WHERE total_amount> '0' AND trip_distance> '0'
LIMIT 1000000;
""".format(limit)
try:
    df_data = pd.read_sql(
    sql_query, 'crate://localhost:4200')
    display(df_data.head(5))
except Exception as error:
    print(error)
Enter fullscreen mode Exit fullscreen mode

Again, if it's successful, you will see a table with the first five data points.

Let's plot the data again:

df_data.plot.scatter(x='trip_distance',y='total_amount')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Now, let's zoom in a little, limiting our x axis to 75 and our y axis to 1000. You can do that like this:

df_data.plot.scatter(x='trip_distance',y='total_amount')
plt.xlim([0,75])
plt.ylim([0,1000])
plt.show()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Let's zoom in even more:

df_data.plot.scatter(x='trip_distance',y='total_amount')
plt.xlim([0,50])
plt.ylim([0,300])
plt.show()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Nice!

To finish with this little exploration, let's estimate the linear function describing the relationship both variables.

In order to do this, let's use SciPy:

import scipy.stats
slope, intercept, r, p, stderr = scipy.stats.linregress(df_data.trip_distance, df_data.total_amount)
Enter fullscreen mode Exit fullscreen mode

Next, run the command below. It will return you a string with the equation of the regression line and the value of the Pearson correlation coefficient:

line = f'Regression line: y={intercept:.2f}+{slope:.2f}x, r={r:.2f}'
line
Enter fullscreen mode Exit fullscreen mode

Something like this:

'Regression line: y=10.27+3.19x, r=0.75'

To end, if you want to show the regression line in a plot it together with our data, this is how you can do it:

fig, ax = plt.subplots()
ax.plot(df_data.trip_distance, df_data.total_amount, linewidth=0, marker='s', label='Data points')
ax.plot(df_data.trip_distance, intercept + slope * df_data.trip_distance, label=line)
ax.set_xlabel('trip_distance')
ax.set_ylabel('total_amount')
ax.legend(facecolor='white')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Hope this was fun!

Top comments (0)