Lucas Barret

Posted on Mar 20, 2023

Polars for Analytics an Introduction

#python #sql

Introduction

Sometimes you reach a point where you connect dots between different things. Indeed I am currently digging into SQL, and for each new concept that I learned I tried to illustrate it with a ruby project.

Here is the thing, I love ruby it is a cool scripting language. But I have to admit it is not the best language for analytical stuff and so on. Plus it is cool sometimes to learn new stuff and practice our skill another way. So I have decided to use Polars and Python to have another point of view.

Wait what? Polars is not for SQL. What am I talking about? How can I learn SQL using Polars and Python?

My second point is what is important are the concepts, and there are a lot of concepts (not all of them I suppose) that are common to both SQL and Polars. For example the relational algebra or window function.

This article is an introduction to the use of Polars and Python in order to learn the concepts of SQL.

Baby step

That said before diving into really complex subjects. We should first learn the abc of our API.

So as a first project, I wanted to do some basic analytics on github data that I found on Kaggle. If you are interested here is the dataset.

Like I said let's take Polars gently and see what we can do. Let's discover our dataset by reading the CSV and displaying the first 5 repository name.

import polars as pl

df = pl.read_csv("path/to/github_dataset.csv")

first_five_repo = df.select(
  pl.col("repositories").head(5)
)

print(first_five_repo)

This is the equivalent of this in SQL, if we suppose that our csv is let's say a table name git_repos.

SELECT repositories FROM git_repos LIMIT 5

And it gives us this result :

shape: (5, 1)
┌───────────────────────────┐
│ repositories              │
│ ---                       │
│ str                       │
╞═══════════════════════════╡
│ octocat/Hello-World       │
│ EddieHubCommunity/support │
│ ethereum/aleth            │
│ localstack/localstack     │
│ education/classroom       │
└───────────────────────────┘

It is a good beginning, so let's dive a bit more into polars and let's do a bit more of analytics and reporting.

A little more advanced stuff

Let's see what the top 5 languages in our csv file, I think you are going to be really surprised :

top_5 = df.groupby("language").agg(
    [pl.count().alias("count")]
  ).sort("count",descending=True).head(5)

print(top_5)

This equivalent to this in PostgreSQL would be something like this

SELECT COUNT(*) AS count FROM git_repos GROUP BY language ORDER BY count DESC;

And eventually we end up with this result :

shape: (5, 2)
┌────────────┬───────┐
│ language   ┆ count │
│ ---        ┆ ---   │
│ str        ┆ u32   │
╞════════════╪═══════╡
│ JavaScript ┆ 253   │
│ Python     ┆ 155   │
│ NULL       ┆ 145   │
│ HTML       ┆ 72    │
│ Java       ┆ 44    │
└────────────┴───────┘

Yes, it seems that NULL is the 3rd favorite language of people around the world 😅.

Let's just filtering out the null values to have a better idea of the language distribution, and it would be ok then :

filtered_language = df.filter(pl.col("language") != "NULL")
top_5 = filtered_language.groupby("language").agg(
    [pl.count().alias("count")]
  ).sort("count",descending=True).head(5)

print(top_5)

Which pretty much I would say is not exactly equivalent to something like one of the following answers.

SELECT COUNT(*) AS count FROM git_repos WHERE language IS NOT NULL GROUP BY language ORDER BY count DESC;

--OR something like that in the idea but a bit more advanced in SQL

WITH filtered_language AS (SELECT * FROM git_repos WHERE language IS NOT NULL)
SELECT COUNT(*) AS count FROM filtered_language GROUP BY count ORDER BY DESC

What is cool about using python and Polars to create all of this is that you have some cool concepts of SQL or something near for free. For example, here we have declared a filtered_language value which contains all the columns of our DataFrame but without any NULL.

It is close to the Common Table Expressions in SQL, like with seen in the second way to write this query.

Conclusion

I have to say that I am pleased by the flexibility of Polars, which is close to Pandas of course. It is adding a layer of abstraction above our loveable Relational Algebra. I am not throwing Ruby away or else, but it's true that when you want to do analytics on data it is less natural to use Ruby than Python.

Nevertheless, python and pandas will never replace SQL since it is not for the same usage and keep increasing my SQL skills is still a way to go for me along with Ruby. I will continue my quest in SQL and data, trying to improve my skills and share all this with you people

Keep in Touch

On Twitter : @yet_anotherdev

Top comments (2)

krlz • Mar 20 '23

Hi Lucas, thanks for sharing, I have some years of experience with SQL but never thought of using Python and Polars to work with a CSV file, sounds like a great way to even sketch some projects without setting up a whole database service, I will play with this to learn something new as well 😀

Lucas Barret • Mar 20 '23

Yes I find it really cool too. If you want to play with data with a lot of flexibility.
Moreover polars seems to have great performance.
You can use different types of files like parquet file for example.
And if you want to real complex stuff with data it is totally the way to go like heavy data science, statistical stuff or AI it is the way to go. :)