DEV Community: Juan Luis Cano Rodríguez

How to sign your git commits with SSH when doing remote development

Juan Luis Cano Rodríguez — Wed, 03 Dec 2025 11:42:59 +0000

Do you want your commits to appear as "verified" on GitHub?

The easiest way is to sign them with the SSH key you're already using. For that, you can run:

$ git config --global gpg.format ssh

In my case, though, I do all my development inside a LXD virtual machine. This is very nice because it isolates my environment and I can nuke it and rebuild it with cloud-init if something goes wrong.

Since I'm working inside a VM, I actually don't have any SSH keys inside the VM!

$ ls ~/.ssh
authorized_keys  known_hosts  known_hosts.old

Instead, I use SSH agent forwarding:

$ ssh-add -L
ssh-ed25519 AAAAC3NzaC1lZD... comment

So you can tell git to use forwarded keys as follows:

$ git config --global gpg.ssh.defaultKeyCommand "ssh-add -L"

Last thing, now you will want to verify locally your own commits. But if you don't do anything else, you will see this:

$ git show --show-signature --stat
error: gpg.ssh.allowedSignersFile needs to be configured and exist for ssh signature verification
commit 7bbebcb0b65ae704cdf8b54361f1287c9b95d1f0 (HEAD -> juanlu/...)
No signature
Author: ...

So the last step is configuring such file:

$ mkdir ~/.config/git
$ echo "$(git config user.email) $(ssh-add -L)" >> ~/.config/git/allowed_signers
$ git config --global gpg.ssh.allowedSignersFile ~/.config/git/allowed_signers

And now, finally:

$ git show --show-signature --stat
commit 7bbebcb0b65ae704cdf8b54361f1287c9b95d1f0 (HEAD -> juanlu/...)
Good "git" signature for user@domain with ED25519 key SHA256:4RdE/O/mv3Y/YjC07RatbWtmak5tzx9HUdYR3RZFjNg
Author: ...

And that's it! You can now push and your commits will be verified ✨

If you discovered this in the middle of writing a pull request, well, you can sign all the commits with a rebase:

$ git rebase --exec 'git commit --amend --no-edit -n -S' main
...

Happy coding!

Python Packaging is Great Now: `uv` is all you need

Juan Luis Cano Rodríguez — Sat, 10 Aug 2024 12:17:10 +0000

The title of this post is a reference to Glyph's Python Packaging is Good Now. I think it's safe to say that, in these 8 years, we've gone from "Good" to "Great". Keep reading for my reasoning.

What makes Python packaging hard for beginners?

I contend that the two main difficulties for Python packaging are

Bootstrapping, i.e. how to even get started!
Activation, i.e. how venvs in Python work.

Bootstrapping was an often neglected problem. Should we tell people to install Python from https://python.org? The Anaconda distribution? How do we stop folks from using their system package manager and risk breaking everything?

And don't forget the whole virtual environment lifecycle. It's so crazy how numb I've become to it as a long time Python user, but every time I have to explain it I see my students faces and I think "this is not okay".

Sure, there are other problems, like how to build and publish distributable packages. But I contend these don't affect most Python beginners. Plus, they are in the process of being addressed as well. Read on.

Enter `uv`

On February 15th, Astral released uv and I jumped ship immediately. As part of my job I routinely have to install lots of potentially conflicting dependencies, and uv was an immediate relief.

But the interesting thing is that now uv has gone well beyond its initial "faster pip" phase and it's fulfilling its promise of being "a comprehensive Python project and package manager that's fast, reliable, and easy to use".

Going back to the bootstrapping and activation problems that I mentioned at the very beginning, how does uv solve them? Consider this:

uv does not depend on Python itself. Precompiled, standalone binaries can be easily installed on Linux, macOS and Windows.
uv python manages Python versions! No need to resort to OS-specific mechanisms, like pyenv, deadsnakes, or to heavyweight tools like conda.
uv tool manages tools in centralized environments! No more need for pipx or fades.
uv init creates a barebones pyproject.toml using hatchling as build backend and a working src-layout with an empty README and a dummy module.
- If you need something more sophisticated, you could always use copier or cookiecutter with some more sophisticated template.
uv add adds dependencies to pyproject.toml, creates a venv if one didn't exist, and installs them!
uv lock creates a lock file with all your dependencies, which you can then use in uv sync.
- And if you want a good old requirements.txt, uv pip compile does it for you, just like pip-tools!
uv run executes scripts and commands, again without explicitly activating environments!

Essentially, this:

$ mkdir uv-playground
$ cd uv-playground
$ uv init
warning: `uv init` is experimental and may change without warning
Initialized project `uv-playground`
$ uv add click
warning: `uv add` is experimental and may change without warning
Using Python 3.12.3 interpreter at: /usr/bin/python3
Creating virtualenv at: .venv
Resolved 3 packages in 66ms
   Built uv-playground @ file:///tmp/uv-playground
Prepared 2 packages in 430ms
Installed 2 packages in 0.62ms
 + click==8.1.7
 + uv-playground==0.1.0 (from file:///tmp/uv-playground)
$ tree
.
├── pyproject.toml
├── README.md
├── src
│   └── uv_playground
│       ├── __init__.py
└── uv.lock

3 directories, 4 files
$ uv run python -c "from uv_playground import hello; print(hello())"
warning: `uv run` is experimental and may change without warning
Hello from uv-playground!

Therefore, to the question "how do I get started learning Python on my computer", now you can universally respond: "install uv".

Some reflections

On the topic of virtual environments, I essentially agree with Armin when he says

npm got away without any equivalent of "activation" and I think a future Python ecosystem will also no longer find much use in virtualenv activation.

I also notice that uv init chose hatchling. I always had a slight preference towards PDM, but I think this might be a point of no return.

It took Leah and contributors a lot of work to come up with this decision diagram for the PyOpenSci packaging guide. But the fact that now there's a baseline that folks can change in case they have more specific needs (for example, a Meson or scikit-build capable build backend) again provides for a much better Developer Experience.

On conda

The topic of conda vs pip is another common source of confusion. I was a conda user and fan since day 1, and it effectively saved Python from a very clear death at a time when it was very difficult to just install stuff on Windows.

In the years that followed, I often referred to the old blog post by Jake VanderPlas explaining the differences, but it looks like a lost cause by now.

The interoperability problems between pip and conda were never fully addressed, and while I think the Pixi folks are doing a fantastic job, I think in the long run uv will win.

I fully acknowledge that conda packages are better structured around the notion of non-Python code, and that the current world of "fat wheels on PyPI" is clearly a suboptimal solution. But the whole ecosystem has moved in that direction: most packages now publish precompiled wheels for a rich variety of platforms.

In other words: conda might not be as useful in 2024 as it was in 2014, and it might be time to stop teaching it to beginners and deem it an advanced tool.

Conclusion

The reason it's a bit too early is that some of these uv commands are still experimental and might evolve in the future. But for the first time ever, I clearly see a workflow tool that is standards-compliant, comprehensive, free of bootstrapping problems, carefully designed, and that can win.

Update 2024-08-20: uv 3.0 introduced the project, tool, script, and python interfaces, so they're not experimental anymore!

Which is what many Python packaging critics wanted all along, right? Not having to choose from many different tools. But I think uv went well beyond that and solved other Developer Experience issues, for which I'm happy and thankful.

I am effectively using uv for everything and I am not looking back. I will continue recommending this tool to everyone, continue talking about it, and hope that it becomes more widespread.

The simplicity of DuckDB

Juan Luis Cano Rodríguez — Mon, 20 Nov 2023 12:37:12 +0000

This post is an adaptation of the one I originally published in the Orchest blog. Enjoy!

This post is the first part of our series “SQL on Python”, in which we will explore different Python libraries that help you manipulate and query your data using SQL or a SQL-inspired syntax.

Why SQL, after all?

SQL (the initials for Structured Query Language, also known as ISO/IEC 9075-1:2016) was originally designed in the 70s for managing relational databases, but nowadays, it is being used for analytics workloads as well.

SQL has lots of benefits for analytics, to name a few:

It’s easy to pick up: SQL is a domain-specific language, rather than a general-purpose language, and as such it has more limited scope and fewer syntax elements to learn.
‍It’s everywhere: SQL is a family of query languages available in many systems, and all of them share some core common characteristics. When you learn a particular SQL dialect (PostgreSQL, SQL Server, Google Standard SQL, others), you can easily transfer your skills from other dialects with ease.
‍It’s fast: SQL is a statically typed language, which allows query planning systems to perform sophisticated optimizations. This, along with the decades of accumulated knowledge about relational databases, allow SQL implementations to have difficult to beat performance.

However, if you are used to the Python or R ecosystems (pandas, Polars, data.table, dplyr), you are probably spoiled by how easy it is to download a CSV or Parquet file from somewhere, launch a Python or R process, read it, and start querying and manipulating it.

Comparatively, this bootstrapping process is a bit more tedious with SQL: assuming you have, say, a local PostgreSQL database up and running and a CSV file, you would need to create a table with the appropriate schema, import the data using COPY, and hope that there are no inconsistencies, missing data, or weird date formats. If the file happened to be Parquet, you would need to work a bit more.

To try to make the process a bit more lightweight, you could try to convert your CSV or Parquet to SQLite, a widely available, in-process SQL database. However, SQLite was designed with transactional use cases in mind, and therefore might not scale well with some analytical workloads.

In summary: SQL is appealing, but the boilerplate not so much. What if you could run SQL for your analytics workloads without having to configure a database, just by importing a module in your Python or R process, and make your queries blazing fast? What if, rather than having to choose between Python or SQL, you could use both?

https://twitter.com/anyfactor/status/1551650476651081729

Enter DuckDB

DuckDB is an open source (MIT) high-performance, in-process SQL database for analytics. It is a relatively new project (the first public release was in June 2019), but got tremendously popular in a short period of time.

DuckDB popularity is growing (we like this image so much)

DuckDB can read data from different sources:

From CSV or Parquet files
From pandas DataFrame or Arrow Table objects in the process memory
From PostgreSQL tables (by reading the binary data directly!)

Some of the DuckDB operations have out-of-core capabilities (similar to Vaex or the new streaming mode of Polars), which means that it can read data that is larger than RAM!

Finally, DuckDB offers some additions on top of standard SQL that make it very pleasant to use, for example friendlier error messages or, behold, trailing commas!

Trying out DuckDB

For this example, we will use a dataset containing all mentions of climate change on Reddit before September 2022 obtained from Kaggle. Our generic goal is to understand the sentiment of these mentions.

I have published an Orchest pipeline that contains all the necessary files so you can run these code snippets on JupyterLab easily: the first step downloads the data using your Kaggle API key, and the second step performs some exploratory analysis.

First steps with DuckDB

You can install DuckDB with conda/mamba or pip:

mamba install -y "python-duckdb=0.5.1"  
# Or, alternatively, with pip  
# pip install "duckdb==0.5.1"

The first step to start using DuckDB is creating a connection object. This mimics the Python Database API 2.0, also implemented by other projects like SQLite and psycopg2:

import duckdb  
conn = duckdb.connect()

By default, duckdb.connect() will return a connection to an in-memory database, which will be perfectly fine for reading data from external files. In fact, you can run a SQL query directly on the CSV file straight away!

In [3]: conn.execute("""  
  ...: SELECT COUNT(*)  
  ...: FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv'  
  ...: """).fetchall()  
Out[3]: [(4600698,)]

As you can see, the comments CSV file contains 4.6+ million rows. This took about 50 seconds on an Orchest instance though, which is not very impressive for just a COUNT(*) operation. What about converting the CSV to Parquet, as we did in my blog post about Arrow? This time, we can use DuckDB for that:

import os  

csv_files = !ls /data/reddit-climate/\*.csv  

for filename in csv_files:  
   print(f"Reading {filename}...")  
   destination_file = os.path.splitext(filename)[0] + ".parquet"  
   if os.path.isfile(destination_file):  
       continue  
   conn.execute(f"""  
   COPY (SELECT * FROM '{filename}')  
   TO '{destination_file}' (FORMAT 'parquet')  
   """)

And now, let’s repeat the query on the Parquet file:

In [8]: %%timeit  
  ...: conn.execute("""  
  ...: SELECT COUNT(*)  
  ...: FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'  
  ...: """).fetchall()  
234 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Approximately a 200x speedup over the same operation using CSV! That is a better baseline for running the rest of the queries.

Querying Parquet files with DuckDB

Since you will be referring to the same file several times, it’s a good moment to create a view for it. This will allow you to query the Parquet file without copying all the data to memory:

conn.execute("""  
CREATE VIEW comments AS  
SELECT \* FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'  
""")

Next, let’s find out which subreddits had the most number of comments about climate change:

In [11]: conn.query("""  
   ...: SELECT  
   ...:   "subreddit.name" AS subreddit\_name,  
   ...:   COUNT(*) AS num_comments,  
   ...: FROM comments  
   ...: GROUP BY subreddit_name  
   ...: ORDER BY num_comments DESC  
   ...: LIMIT 10  
   ...: """).fetchall()  
Out[11]:  
[('politics', 370018),  
('worldnews', 351195),  
('askreddit', 259848),  
('collapse', 94696),  
('news', 94558),  
('futurology', 89945),  
('science', 71453),  
('environment', 70444),  
('canada', 66813),  
('australia', 60239)]

Unsurprisingly, /r/politics, /r/worldnews, and /r/collapse were among the subreddits with the largest number of comments about climate change.

What about the overall sentiment of those comments?

In [12]: conn.query("""  
   ...: SELECT  
   ...:   AVG(sentiment) AS average_sentiment,  
   ...:   STDDEV(sentiment) AS stddev_sentiment,  
   ...: FROM comments  
   ...: """).fetchall()  
Out[12]: [(-0.005827451977706203, 0.6581439484369691)]  

In [13]: conn.query("""  
   ...: SELECT  
   ...:   "subreddit.name" AS subreddit_name,  
   ...:   COUNT(*) AS num_comments,  
   ...:   AVG(sentiment) AS average_sentiment,  
   ...:   STDDEV(sentiment) AS stddev_sentiment,  
   ...: FROM comments  
   ...: WHERE subreddit_name IN (  
   ...:   SELECT "subreddit.name" AS subreddit_name  
   ...:   FROM comments  
   ...:   GROUP BY subreddit_name  
   ...:   ORDER BY COUNT(*) DESC  
   ...:   LIMIT 10  
   ...: )  
   ...: GROUP BY subreddit_name  
   ...: ORDER BY num_comments DESC  
   ...: """).fetchall()  
Out[13]:  
[('politics', 370018, -0.018118589649651674, 0.6600297061408),  
('worldnews', 351195, -0.058001587387908435, 0.6405990095462681),  
('askreddit', 259848, -0.068637218639235, 0.6089748718101456),  
('collapse', 94696, -0.1332661626390419, 0.6667106776062662),  
('news', 94558, -0.09367126059175682, 0.6276134461239258),  
('futurology', 89945, 0.0018637489115630797, 0.6506820198836241),  
('science', 71453, 0.04588216852922973, 0.6248484283076333),  
('environment', 70444, -0.015670189810189843, 0.6467846578160414),  
('canada', 66813, 0.021118244331091468, 0.6408319443539487),  
('australia', 60239, -0.021869519296548085, 0.6405803819103508)]

While the overall sentiment is slightly negative (with a large standard deviation), some subreddits like /r/askreddit and /r/collapse exhibited a sentiment more negative than average. Others like /r/science and /r/canada were slightly positive.

Most interestingly, all these queries ran in about 2 seconds!

DuckDB also has integration with Jupyter through the ipython-sql extension and the DuckDB SQLAlchemy driver, which allows you to query your data using an even more compact syntax:

In [1]: %load_ext sql  
In [2]: %sql duckdb:///:memory:  
In [3]: %%sql  
  ...: CREATE VIEW comments AS  
  ...: SELECT * FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'  
  ...:  
  ...:  
* duckdb:///:memory:  
Done.  
Out[3]: []  

In [4]: %sql SELECT COUNT(*) FROM comments  
* duckdb:///:memory:  
Done.  
Out[4]: [(4600698,)]

Interoperability with Python dataframe libraries

Did you notice how we were using conn.execute() all the time? As we said above, this method follows the widely used Python DBAPI 2.0. However, DuckDB can return richer objects by using conn.query() instead:

rel = conn.query("""  
SELECT  
 "subreddit.name" AS subreddit_name,  
 COUNT(*) AS num_comments,  
FROM comments  
GROUP BY subreddit_name  
ORDER BY num_comments DESC  
LIMIT 10  
""")

This method returns an instance of DuckDBPyRelation, which can be pretty printed in Jupyter:

In [5]: type(rel)  
Out[5]: duckdb.DuckDBPyRelation  

In [6]: rel  
Out[6]:  
---------------------  
--- Relation Tree ---  
---------------------  
Subquery  

---------------------  
-- Result Columns  --  
---------------------  
- subreddit_name (VARCHAR)  
- num_comments (BIGINT)  

---------------------  
-- Result Preview  --  
---------------------  
subreddit_name  num_comments  
VARCHAR BIGINT  
[ Rows: 10]  
politics        370018  
worldnews       351195  
askreddit       259848  
collapse        94696  
news    94558  
futurology      89945  
science 71453  
environment     70444  
canada  66813  
australia       60239

Moreover, you can efficiently retrieve the data from this relation and convert it to several Python objects:

A dictionary of masked NumPy arrays using .fetchnumpy()
A pandas DataFrame using .df() or its aliases (.fetchdf(), .fetch_df())
An Arrow Table using .arrow() or .fetch_arrow_table()
An Arrow record batch reader using .fetch_record_batch(chunk_size)

Therefore, you can easily convert query results to a pandas DataFrame, and also a Polars one (since you can pass an Arrow table directly):

In [6]: rel.df()  # pandas  
Out[6]:  
  subreddit_name  num_comments  
0       politics        370018  
1      worldnews        351195  
2      askreddit        259848  
3       collapse         94696  
4           news         94558  
5     futurology         89945  
6        science         71453  
7    environment         70444  
8         canada         66813  
9      australia         60239  

In [7]: import polars as pl  

In [8]: data = rel.arrow()  # Arrow data  

In [9]: pl.DataFrame(data)  # Polars  
Out[9]: shape: (10, 2)  
┌────────────────┬──────────────┐  
│ subreddit_name ┆ num_comments │  
│ ---            ┆ ---          │  
│ str            ┆ i64          │  
╞════════════════╪══════════════╡  
│ politics       ┆ 370018       │  
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ worldnews      ┆ 351195       │  
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ askreddit      ┆ 259848       │  
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ collapse       ┆ 94696        │  
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ ...            ┆ ...          │  
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ science        ┆ 71453        │  
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ environment    ┆ 70444        │  
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ canada         ┆ 66813        │  
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ australia      ┆ 60239        │  
└────────────────┴──────────────┘

Note: Result objects returned by conn.execute() also have these methods, but they consume the data after they are called and therefore are not so convenient.

Conversely, you can transfer data from pandas or Arrow to DuckDB. Or, more precisely: you can use DuckDB to query pandas or Arrow objects that live in memory! Moreover, DuckDB can read local variables without having to do anything:

In [13]: df_most_comments = rel.df()  

In [14]: df_most_comments.head()  # pandas  
Out[14]:  
  subreddit_name  num_comments  
0       politics        370018  
1      worldnews        351195  
2      askreddit        259848  
3       collapse         94696  
4           news         94558  

In [15]: conn.execute("""  
   ...: SELECT subreddit_name  
   ...: FROM df_most_comments  -- Sorcery!  
   ...: LIMIT 5  
   ...: """).fetchall()  
Out[15]: [('politics',), ('worldnews',), ('askreddit',), ('collapse',), ('news',)]

You can also manually register a compatible object with a given name:

In [17]: conn.register("most_comments_arrow", data)  
Out[17]: <duckdb.DuckDBPyConnection at 0x7f9be41434f0>  

In [18]: conn.execute("""  
   ...: SELECT subreddit_name  
   ...: FROM most_comments_arrow  
   ...: LIMIT 5  
   ...: """).fetchall()  
Out[18]: [('politics',), ('worldnews',), ('askreddit',), ('collapse',), ('news',)]

Or, using the %sql magic as before:

In [21]: %sql output << SELECT subreddit_name FROM df_most_comments LIMIT 5  
* duckdb:///:memory:  
Done.  
Returning data to local variable output  

In [22]: output.DataFrame()  # pandas  
Out[22]:  
  subreddit_name  
0       politics  
1      worldnews  
2      askreddit  
3       collapse  
4           news

In other words: you can transparently go back and forth between DuckDB and your favourite Python dataframe library. Cool!

Other features

Without extending ourselves too much, there are a few extra interesting things about DuckDB you should check out:

Relational API: Apart from executing your SQL queries, DuckDBPyRelation objects have some basic filtering and aggregation methods. For example, you can do things like:

In [27]: rel.filter("num_comments > 100000").order("subreddit_name").df()  
Out[27]:  
  subreddit_name  num_comments  
0      askreddit        259848  
1       politics        370018  
2      worldnews        351195

The documentation is still in progress, but potentially the DuckDB team will expand it in the future!

The CLI: DuckDB has a command-line client you can use directly from your terminal, without even launching a Python or Jupyter interpreter:

$ ./duckdb -c '  
>   SELECT "subreddit.name" AS subreddit_name,  
>   COUNT(*) AS num_comments  
> FROM "/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet"  
> GROUP BY subreddit_name  
> ORDER BY num_comments DESC  
> LIMIT 10  
> '  
┌────────────────┬──────────────┐  
│ subreddit_name │ num_comments │  
├────────────────┼──────────────┤  
│ politics       │ 370018       │  
│ worldnews      │ 351195       │  
│ askreddit      │ 259848       │  
│ collapse       │ 94696        │  
│ news           │ 94558        │  
│ futurology     │ 89945        │  
│ science        │ 71453        │  
│ environment    │ 70444        │  
│ canada         │ 66813        │  
│ australia      │ 60239        │  
└────────────────┴──────────────┘

Should you use DuckDB?

Although DuckDB is a wonderful piece of technology, “there is no silver bullet” and there might be cases in which you might want to use something else. The project homepage itself hints at some of these cases:

For transactional workloads, you might want to use SQLite, or a more sophisticated transactional database like PostgreSQL. Remember, DuckDB was created for analytics!
When several people are reading or writing the same data, using a warehouse might make more sense.

Other than that, as you saw above if you are looking for a lightweight and fast solution for in-process analytics, and you want to leverage both your general-purpose language of choice (Python, R, others) as well as SQL, DuckDB might be exactly what you want.

In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!

Thanks to Alex Monahan and Elliana May for reviewing early drafts of this blog post. All remaining errors are my own.

How to integrate Kedro and Databricks Connect

Juan Luis Cano Rodríguez — Thu, 21 Sep 2023 14:41:03 +0000

In recent months we've updated Kedro documentation to illustrate three different ways of integrating Kedro with Databricks.

You can choose a workflow based on Databricks jobs to deploy a project that finished development.
For faster iteration on changes, the workflow documented in "Use a Databricks workspace to develop a Kedro project" is for those who prefer to develop and test their projects directly within Databricks notebooks, to avoid the overhead of setting up and syncing a local development environment with Databricks.
Alternatively, you can work locally in an IDE as described by the workflow documented in "Use an IDE, dbx and Databricks Repos to develop a Kedro project". You can use your IDE’s capabilities for faster, error-free development, while testing on Databricks. This is ideal if you’re in the early stages of learning Kedro, or if your project requires constant testing and adjustments. However, the experience is still not perfect: you must sync your work inside Databricks with dbx and run the pipeline inside a notebook. Debugging has a lengthy setup for each change and there is less flexibility than inside an IDE.

In this blog post, Diego Lira, a Specialist Data Scientist and a client-facing member of QuantumBlack, AI by McKinsey, explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE. He recommends this as a solution where the data-heavy parts of your pipelines are in PySpark. If part of your workflow is in Python (e.g. Pandas) and not Spark (using PySpark), then you will find that Databricks Connect will download your data frame to your local environment to continue running your workflow. This might cause performance issues and introduce compliance risks because the data has left the Databricks workspace.

What is Databricks Connect?

Databricks Connect is Databricks' official method of interacting with a remote Databricks instance while using a local environment.

To configure Databricks Connect for use with Kedro, follow the official setup to create a .databrickscfg file containing your access token. It can be installed with a pip install databricks-connect, and it will substitute your local SparkSession:

from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()

Spark commands are sent and executed on the cluster, and results are returned to the local environment as needed. In the context of Kedro, this has an amazing effect: as long as you don’t explicitly ask for the data to be collected in your local environment, operations will be executed only when saving the outputs of your node. If you use datasets saved to a Databricks path, there will be no performance hit for transferring data between environments.

This tool was recently made available as a thin client for Spark Connect, one of the highlights of Spark 3.4, and configuration was made easier than earlier versions. If your cluster doesn’t support the current Connect, please refer to the documentation as previous versions had different limitations.

How can I use a Databricks Connect workflow with Kedro?

Databricks Connect (and Spark Connect) enables us to have a completely local development flow, while all artifacts can be remote objects. Using Delta tables for all our datasets and MLflow for model objects and tracking, nothing needs to be saved locally. Developers can take full advantage of the Databricks stack while maintaining their full IDE usage.

How to use Databricks as your PySpark engine

Kedro supports integration with PySpark through the use of Hooks. To configure and enable your Databricks session through Spark Connect, simply set up your SPARK_REMOTE environment variable with your Databricks configuration. Here is an example implementation:

import configparser
import os
from pathlib import Path

from kedro.framework.hooks import hook_impl
from pyspark.sql import SparkSession

class SparkHooks:
    @hook_impl
    def after_context_created(self) -> None:
        """Initialises a SparkSession using the config
        from Databricks.
        """
        set_databricks_creds()
        _spark_session = SparkSession.Builder().getOrCreate()

def set_databricks_creds():
    """
    Pass databricks credentials as OS variables if using the local machine.
    If you set DATABRICKS_PROFILE env variable, it will choose the desired profile on .databrickscfg,
    otherwise it will use the DEFAULT profile in databrickscfg.
    """
    DEFAULT = os.getenv("DATABRICKS_PROFILE", "DEFAULT")
    if os.getenv("SPARK_HOME") != "/databricks/spark":
        config = configparser.ConfigParser()
        config.read(Path.home() / ".databrickscfg")

        host = (
            config[DEFAULT]["host"].split("//", 1)[1].strip()[:-1]
        )  # remove "https://" and final "/" from path
        cluster_id = config[DEFAULT]["cluster_id"]
        token = config[DEFAULT]["token"]

        os.environ[
            "SPARK_REMOTE"
        ] = f"sc://{host}:443/;token={token};x-databricks-cluster-id={cluster_id}"

This example will populate SPARK_REMOTE with your local .databrickscfg file. You don't setup the remote connection if the project is being run from inside Databricks (if SPARK_HOME points to Databricks), so you're still able to run it in the usual hybrid development flow. Notice that you don’t need to setup a spark.yml file as is common in other PySpark templates; you’re not passing any configuration, just using the cluster that is in Databricks. You also don’t need to load any extra Spark files (e.g. JARs), as you are using a thin Spark Connect client.

Now all your Spark calls in your pipelines will automatically use the remote cluster. There's no need to change anything in your code. However, notebooks might be part of the project. To use your remote cluster without needing to use environment variables, you can use the DatabricksSession:

from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()

When using the remote cluster, it's preferred to avoid data transfers between the environments, with all catalog entries referencing remote locations. Using kedro_datasets.databricks.ManagedTableDataSet as your dataset type in the catalog also allows you use Delta table features.

How to enable MLflow on Databricks

Using MLflow to save all your artifacts directly to Databricks leads to a powerful workflow. For this you can use kedro-mlflow. Note that kedro-mlflow is built on top of the mlflow library and although the databricks config cannot be found in its documentation, you can read more about it in the documentation from mlflow directly.

After doing the basic setup of the library in your project, you should see a mlflow.yml configuration file. In this file, change the following to set up your URI:

server:
    mlflow_tracking_uri: databricks # if null, will use mlflow.get_tracking_uri() as a default
    mlflow_registry_uri: databricks # if null, mlflow_tracking_uri will be used as mlflow default

Setup your experiment name (this should be a valid Databricks path):

experiment:
    name: /Shared/your_experiment_name

By default, all your parameters will be logged, and objects such as models and metrics can be saved as MLflow objects referenced in the catalog.

Limitations of this workflow

Databricks Connect, built on top of Spark Connect, supports only recent versions of Spark. I recommend looking at the detailed limitations in the official documentation for specific guidance, such as the upload limit of only 128MB for dataframes.

Users also need to be conscious that .toPandas() will move the data to your local pandas environment. Saving results back as MLflow objects is the preferred way to avoid local objects. Examples can be seen in the kedro-mlflow documentation for all types of supported objects.

In the pipeline: September 2023

Juan Luis Cano Rodríguez — Wed, 06 Sep 2023 08:49:16 +0000

This month: a roundup of the summer’s Kedro news, some release updates, and our top picks from recent articles.

Kedro team news

Over the last few months, we’ve been happy to welcome some new team members to the Kedro and Kedro-Viz teams, who have also joined our Technical Steering Committee. Welcome Dmitry Sorokin, Jitendra Gundaniya, Laura Couto, Ravi Kumar Pilla, and Vladimir Nikolic!

We are also pleased to announce a Kedro baby, delivered safely by one of the team, at the end of July!

Contributors news

We reworked the Kedro contributors guide in August, and moved it to the Kedro wiki. There are loads of different ways to contribute to Kedro and if you want to get involved, we encourage you to look at the table that introduces the Kedro contributor guide.

If you spot an article, podcast or video that discusses Kedro, you can also contribute by adding it to the “Awesome Kedro” repository, or letting us know on Slack.

There have been some amazing contributions in recent weeks, including the kedro-vineyard plugin for efficient intermediate sharing in Kedro pipelines, kedro-graphql for serving Kedro projects as a GraphQL API, and kedro-pandera to bring data validation to your Kedro projects.

Release news

August 2023 saw a set of releases to introduce Python 3.11 support across Kedro, Kedro-Viz and Kedro datasets.

Kedro version 0.18.13 included these major features and improvements:

Added support for Python 3.11.
Added new OmegaConfigLoader features: registering of custom resolvers through CONFIG_LOADER_ARGS and support for global variables.
Added kedro catalog resolve CLI command that resolves dataset factories in the catalog with any explicit entries in the project pipeline.
Simplified the conf folder structure for modular pipelines and updated kedro pipeline create and kedro catalog create accordingly.
Made various updates to the Kedro project template and Kedro starters: use of OmegaConfigLoader, transition from setup.py to pyproject.toml, and updated for the simplified conf structure.

Kedro Viz version 6.5 added support for Python 3.11, while Kedro Viz version 6.4 added two new features: feature hint cards to highlight key features of Kedro Viz and support for displaying dataset statistics in the metadata panel for further investigation.

Kedro Datasets version 1.7.0 added polars.GenericDataSet, a dataset backed by polars, a lightning fast dataframe package built entirely using Rust. Kedro Datasets version 1.6.0 added support for Python 3.11.

Recently on the Kedro blog

In the last few weeks we’ve published the following on the Kedro blog:

We’re always looking for collaborators to write about their experiences using Kedro, particularly if you’re working with Kedro datasets or converting an existing project to use Kedro. Get in touch with us on our Slack workspace to tell us your story.

What we’ve learned

We really enjoyed reading more on Medium about the Kedro Vineyard plugin, which is a cloud-native data manager, for data sharing using memory in data science pipelines on Kubernetes.

Quix published an interesting article called “Bridging the gap between data scientists and engineers in machine learning workflows” which is something we regularly discuss within the team.

We found a super-interesting project about font recognition that uses Kedro.

And finally, we enjoyed reading more about data streaming with Kedro over on the QuantumBlack Medium channel.

That’s it for this edition!

🐍 Best resources on Python packaging 📖

Juan Luis Cano Rodríguez — Thu, 17 Aug 2023 12:29:35 +0000

Are you confused by the various names that float around the Python packaging ecosystem? Have you ever asked a colleague to help you with an installation issue, only for them to reply "use {other_tool} instead" and make the problem worse? Have you seen the infamous XKCD comic on Python environments but you're still wondering how to solve your mess?

This short blog post is not a guide that will help you troubleshoot everything, but instead a list of resources that I consider up to date, modern, informative, and free of "hot takes" or unnecessary hate towards maintainers.

1. "The Basics of Python Packaging in Early 2023"

https://drivendata.co/blog/python-packaging-2023

This blog post by Jay Qi, Lead Data Scientist at DrivenData, is an informative take on how to produce your own Python package, specifically writing your pyproject.toml (modern replacement of setup.py). It neatly explains all the concepts involved, including PEP 517 build backends, PEP 621 project metadata, and some extra stuff.

2. "An unbiased evaluation of environment management and packaging tools"

https://youtu.be/MsJjzVIVs6M

Anna-Lena Popkes, Senior Machine Learning Engineer at inovex, delivered this talk at PyConDE and EuroPython. She offers a neat categorization of the different aspects or facets of Python packaging:

The chart does not include all tools, but it's an excellent starting point.

One thing I'd change is recommending rtx over pyenv, which works in a very similar way but it's written in Rust (so it's super fast) and avoids the typical problems with shims.

3. The pyOpenSci Python packaging guide

https://www.pyopensci.org/python-package-guide/package-structure-code/intro.html

This long guide was written by Leah Wasser, Executive Director at pyOpenSci, with the help of many contributors (including myself). It goes into more depth on how to choose a specific build backend or workflow tool when developing and creating packages. Whether you're creating a complex Python package with compiled extensions or a plain, pure Python one, this guide will help you navigate the ecosystem quite effectively.

4. "Why not tell people to 'simply' use pyenv, poetry or anaconda"

https://www.bitecode.dev/p/why-not-tell-people-to-simply-use

Okay, this one is a bit more loaded and opinionated than the others. But I believe this blog post in the Bite code! is very necessary - it's often too tempting to tell someone to "just" install yet another tool that will fix their problems, but this creates a massive amount of collective pain, and Python packaging is particularly affected. Please refrain from doing that!

(Which reminds me of this fantastic piece by Ned Batchelder, "How to be helpful online" https://nedbatchelder.com/blog/202009/how_to_be_helpful_online.html ❤️)

More often than not, installing yet another tool won't fix the user original problem. So, instead, try to spend some time helping them debug their problem.

Would you like me to write a guide on how to debug Python installation issues? Leave a comment saying "yes" or, much better, a situation that has affected you recently, or even right now.

5. "Thoughts on Python packaging"

https://pradyunsg.me/blog/2023/01/21/thoughts-on-python-packaging/

If you've made it this far is because you're another Python packaging nerd, like myself. Congratulations! (Or... sorry?)

To end with a positive, I'd highly recommend you to check out this blog post by Pradyun Gedam, maintainer of pip and many other packaging projects, which offers a meditated perspective on where we stand now, how we got here, and where we go next.

Hope you liked this list of resources! If you'd like to read more about this, comment, recommend, follow, and spread the love.

Also, remember to always thank your open source maintainers, they will appreciate it 💖

How to use Databricks managed Delta tables in a Kedro project

Juan Luis Cano Rodríguez — Thu, 17 Aug 2023 08:55:07 +0000

In this blog post, we'll guide you through the specifics of building a Kedro project that uses managed Delta tables in Databricks using the newly-released ManagedTableDataSet.

What is Kedro?

Kedro is a toolbox for production-ready data science. It's an open-source Python framework that enables the development of clean data science code, borrowing concepts from software engineering and applying them to machine-learning projects. A Kedro project provides scaffolding for complex data and machine-learning pipelines. It enables developers to spend less time on tedious "plumbing" and focus on solving new problems.

What is Databricks?

Databricks is a unified data analytics platform designed for simplifying big data processing and free-form data exploration at any scale. Based on Apache Spark, an open-source distributed computing system, Databricks provides a collaborative cloud-based environment where users can process large amounts of data.

The platform provides collaborative workspaces (notebooks) and computational resources (clusters) to run code with. Clusters are groups of nodes that run Apache Spark. Notebooks are collaborative web-based interfaces where users can write and execute code on an attached cluster.

Why use Kedro on Databricks?

As we've described, Kedro offers a framework for building modular and scalable data pipelines, while Databricks provides a platform for running Spark jobs and managing data. You can combine Kedro and Databricks to build and deploy data pipelines and get the best of both worlds. Kedro's open-source framework will help you to build well-organised and maintainable pipelines, while Databricks' platform will provide you with the scalability you need to run your pipeline in production. Check out the recently-updated Kedro documentation for a set of workflow options for integrating Kedro projects and Databricks. (Additionally, the third-party kedro-mlflow plugin integrates mlflow capabilities inside Kedro projects to enhance reproducibility for machine learning experimentation).

What are Kedro datasets?

Kedro datasets are abstractions for reading and loading data, designed to decouple these operations from your business logic. These datasets manage reading and writing data from a variety of sources, while also ensuring consistency, tracking, and versioning. They allow users to maintain focus on core data processing, leaving data I/O tasks to Kedro.

What is managed data in Databricks?

To understand the concept of managed data in Databricks, it is first necessary to outline how Databricks organises data. At the highest level, Databricks uses metastores to store the metadata associated with data objects. Databricks Unity Catalog is one such metastore. It provides data governance and management across multiple Databricks workspaces. The metastore organises tables (where your data is stored) in a hierarchical structure.

The highest level of organisation in this hierarchy is the catalog. Catalogs are a collection of databases (also referred to as schemas in Databricks' terminology). A database is the second level of organisation in the Unity Catalog namespacing model. Databases are a collection of tables. The tables in a database are the third level of organisation in this hierarchy.

A table is structured data, stored as a directory of files on cloud object storage. By default, Databricks creates tables as Delta tables, which store data using the Delta Lake format. Delta Lake is an open-source storage format that offers ACID transactions, time travel and audit history.

Databricks tables belong to one of two categories: managed and unmanaged (external) tables. Databricks manages both the data and associated metadata of managed tables. If you drop a managed table, you will delete the underlying data. The data of a managed table resides in the location of the database to which it is registered.

On the other hand, for unmanaged tables, Databricks only manages the metadata. If you drop an unmanaged table, you will not delete the underlying data. These tables require a specified location during creation.

How to work with managed Delta tables using Kedro

Let's demonstrate how to use the ManagedTableDataSet with a simple example on Databricks. You'll need to open a new Databricks notebook and attach it to a cluster to follow along with the rest of this example, which runs on a workspace using a Hive metastore. We'll create a dataset containing weather readings, save it to a managed Delta table on Databricks, append some data, and access a specific table version to showcase Delta Lake's time travel capabilities.

Run every separate code snippet in this section in a new notebook cell.

The first steps are to set up your workspace by creating a weather database in your metastore and installing Kedro. Run the following SQL code to create the database:

%sql
create database if not exists weather;

To install Kedro and the ManagedTableDataSet, use the %pip magic:

%pip install kedro kedro-datasets[databricks.ManagedTableDataSet]

The first part of our program will create some weather data. We'll create a Spark DataFrame with four columns: date, location, temperature, and humidity to store our weather data. Then, we'll use a new instance of ManagedTableDataSet to save our DataFrame to a Delta table called 2023_06_22 (the day of the readings) in the weather database.

from pyspark.sql import SparkSession
from pyspark.sql.types import (StructField, StringType, IntegerType, StructType)
from kedro_datasets.databricks import ManagedTableDataSet

spark_session = SparkSession.builder.getOrCreate()

# Define schema
schema = StructType([
    StructField("date", StringType(), True),
    StructField("location", StringType(), True),
    StructField("temperature", IntegerType(), True),
    StructField("humidity", IntegerType(), True),
])

# Create DataFrame
data = [
    ('2023-06-22', 'London', 27, 39),
    ('2023-06-22', 'Warsaw', 28, 40),
    ('2023-06-22', 'Bucharest', 32, 38),
]
spark_df = spark_session.createDataFrame(data, schema)

# Create a ManagedTableDataSet instance using a new table named '2023_06_22'
weather = ManagedTableDataSet(database="weather", table="2023_06_22")

# Save the DataFrame to the table
weather.save(spark_df)

To load our data back into a dataframe, we use the load method on ManagedTableDataSet:

# Load the table data into a DataFrame
reloaded = weather.load()

# Print the first 3 rows of the DataFrame
display(reloaded.take(3))

This code loads the data from the weather table back into a Spark DataFrame and shows the first three rows of the data:

|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |

Let's say we take some more weather readings later in the day and want to add them to our Delta table. To do this, we can write to it using a new instance of ManagedTableDataSet initialised with "append" passed in as an argument to write_mode:

# Append new rows to the data
new_rows = [
    ('2023-06-22', 'Cairo', 35, 25),
    ('2023-06-22', 'Lisbon', 28, 44),
]
spark_df = spark_session.createDataFrame(new_rows, schema)

weather = ManagedTableDataSet(
    database="weather",
    table="2023_06_22",
    write_mode="append"
)
weather.save(spark_df)

The code above adds new rows for Cairo and Lisbon to our Delta table, which creates a new version of the table.

The ManagedTableDataSet class allows for saving data with three different write modes: overwrite, append, and upsert:

overwrite mode will completely replace the current data in the table with the new data.
append mode will add new data to the existing table.
upsert mode updates existing rows and inserts new rows, based on a specified primary key. Notably, if the table doesn't exist at save, the upsert mode behaves similarly to append, inserting data into a new table.

Suppose we later want to access our data as it appeared earlier in the day when we had only taken three readings. The ManagedTableDataSet class supports accessing different versions of the Delta table. We can access a specific version by defining a Kedro Version and passing it into a new instance of ManagedTableDataSet:

from kedro.io import Version

# Load version 0 of the table
weather = ManagedTableDataSet(
    database="weather",
    table="2023_06_22",
    version=Version(load=0, save=None)
)
reloaded = weather.load()
display(reloaded)

# Load version 1 of the table
weather = ManagedTableDataSet(
    database="weather",
    table="2023_06_22",
    version=Version(load=1, save=None)
)
reloaded = weather.load()
display(reloaded)

You will see two rendered tables as the output of running this code. The first corresponds to version 0 of the 2023_06_22 table, while the second corresponds to version 1:

|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |

|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |
|2023-06-22|  Lisbon  |     28      |   44     |
|2023-06-22|  Cairo   |     35      |   25     |

And that's it! We've put together a simple program to show some of the usual tasks that ManagedTableDataSet facilitates, making it easy to save, load, and manage versions of your data in Delta tables on Databricks.

Conclusion

Databricks is a fast-growing deployment vector for Kedro projects. This blog post has demonstrated how to combine the power of both Kedro and Databricks with an open-source ManagedTableDataSet that enables streamlined data I/O operations when deploying a Kedro project on Databricks. ManagedTableDataSet empowers you to spend more time implementing the business logic of your data pipeline or machine learning workflow and less time manually handling data.

A new Kedro dataset for Spark Structured Streaming

Juan Luis Cano Rodríguez — Wed, 12 Jul 2023 07:36:25 +0000

This article guides data practitioners on how to set up a Kedro project to use the new SparkStreaming Kedro dataset, with example use cases, and a deep-dive on some design considerations. It's meant for data practitioners familiar with Kedro so we'll not be covering the basics of a project, but you can familiarise yourself with them in the Kedro documentation.

What is Kedro?

Kedro is an open-source Python toolbox that applies software engineering principles to data science code. It makes it easier for a team to apply software engineering principles to data science code, which reduces the time spent rewriting data science experiments so that they are fit for production.

Kedro was born at QuantumBlack to solve the challenges faced regularly in data science projects and promote teamwork through standardised team workflows. It is now hosted by the LF AI & Data Foundation as an incubating project.

What are Kedro datasets?

What is Spark Structured Streaming?

Spark Structured Streaming is built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data, and the Spark SQL engine will run it incrementally and continuously and update the final result as streaming data continues to arrive.

Integrating Kedro and Spark Structured Streaming

Kedro is easily extensible for your own workflows and this article explains one of the ways to add new functionality. To enable Kedro to work with Spark Structured Streaming, a team inside QuantumBlack Labs developed a new Spark Streaming Dataset, as the existing Kedro Spark dataset was not compatible with Spark Streaming use cases. To ensure seamless streaming, the new dataset has a checkpoint location specification to avoid data duplication in streaming use cases and it uses .start() at the end of the _save method to initiate the stream.

Set up a project to integrate Kedro with Spark Structured streaming

The project uses a Kedro dataset to build a structured data pipeline that can read and write data streams with Spark Structured Streaming and process data streams in realtime. You need to add two separate Hooks to the Kedro project to enable it to function as a streaming application.

Integration involves the following steps:

Create a Kedro project.
Register the necessary PySpark and streaming related Hooks.
Configure the custom dataset in the catalog.yml file, defining the streaming sources and sinks.
Use Kedro’s new dataset for Spark Structured Streaming to store intermediate dataframes generated during the Spark streaming process.

Create a Kedro project

Ensure you have installed a version of Kedro greater than version 0.18.9 and kedro-datasets greater than version 1.4.0.

pip install kedro~=0.18.0 kedro-datasets~=1.4.0

Create a new Kedro project using the Kedro pyspark starter:

kedro new --starter=pyspark

Register the necessary PySpark and streaming related Hooks

To work with multiple streaming nodes, two hooks are required. The first is for integrating PySpark: see Build a Kedro pipeline with PySpark for details. You will also need a Hook for running a streaming query without termination unless an exception occurs.

Add the following code to src/$your_kedro_project_name/hooks.py:

from pyspark import SparkConf
from pyspark.sql import SparkSession

from kedro.framework.hooks import hook_impl


class SparkHooks:
    @hook_impl
    def after_context_created(self, context) -> None:
        """Initialises a SparkSession using the config
        defined in project's conf folder.
        """

        # Load the spark configuration in spark.yaml using the config loader
        parameters = context.config_loader.get("spark*", "spark*/**")
        spark_conf = SparkConf().setAll(parameters.items())

        # Initialise the spark session
        spark_session_conf = (
            SparkSession.builder.appName(context._package_name)
            .enableHiveSupport()
            .config(conf=spark_conf)
        )

        _spark_session = spark_session_conf.getOrCreate()
        _spark_session.sparkContext.setLogLevel("WARN")


class SparkStreamsHook:
    @hook_impl
    def after_pipeline_run(self) -> None:
        """Starts a spark streaming await session
        once the pipeline reaches the last node.
        """

        spark = SparkSession.builder.getOrCreate()
        spark.streams.awaitAnyTermination()

"""Project settings. There is no need to edit this file unless you want to change values
from the Kedro defaults. For further information, including these default values, see
https://kedro.readthedocs.io/en/stable/kedro_project_setup/settings.html."""

from .hooks import SparkHooks, SparkStreamsHook

HOOKS = (SparkHooks(), SparkStreamsHook())

# Instantiated project hooks.
# from streaming.hooks import ProjectHooks
# HOOKS = (ProjectHooks(),)

# Installed plugins for which to disable hook auto-registration.
# DISABLE_HOOKS_FOR_PLUGINS = ("kedro-viz",)

# Class that manages storing KedroSession data.
# from kedro.framework.session.shelvestore import ShelveStore
# SESSION_STORE_CLASS = ShelveStore
# Keyword arguments to pass to the `SESSION_STORE_CLASS` constructor.
# SESSION_STORE_ARGS = {
#     "path": "./sessions"
# }

# Class that manages Kedro's library components.
# from kedro.framework.context import KedroContext
# CONTEXT_CLASS = KedroContext

# Directory that holds configuration.
# CONF_SOURCE = "conf"

# Class that manages how configuration is loaded.
# CONFIG_LOADER_CLASS = ConfigLoader
# Keyword arguments to pass to the `CONFIG_LOADER_CLASS` constructor.
# CONFIG_LOADER_ARGS = {
#       "config_patterns": {
#           "spark" : ["spark*/"],
#           "parameters": ["parameters*", "parameters*/**", "**/parameters*"],
#       }
# }

# Class that manages the Data Catalog.
# from kedro.io import DataCatalog
# DATA_CATALOG_CLASS = DataCatalog

How to set up your Kedro project to read data from streaming sources

Once you have set up your project, you can use the new Kedro Spark streaming dataset. You need to configure the data catalog, in conf/base/catalog.yml as follows to read from a streaming JSON file:

raw_json:
  type: spark.SparkStreamingDataSet
  filepath: data/01_raw/stream/inventory/
  file_format: json

Additional options can be configured via the load_args key.

int.new_inventory:
   type: spark.SparkStreamingDataSet
   filepath: data/02_intermediate/inventory/
   file_format: csv
   load_args:
     header: True

How to set up your Kedro project to write data to streaming sinks

All the additional arguments can be kept under the save_args key:

processed.sensor:
   type: spark.SparkStreamingDataSet
   file_format: csv
   filepath: data/03_primary/processed_sensor/
   save_args:
     output_mode: append
     checkpoint: data/04_checkpoint/processed_sensor
     header: True

Note that when you use the Kafka format, the respective packages should be added to the spark.ymlconfiguration as follows:

spark.jars.packages: org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1

Design considerations

Pipeline design

In order to benefit from Spark's internal query optimisation, we recommend that any interim datasets are stored as memory datasets.

All streams start at the same time, so any nodes that have a dependency on another node that writes to a file sink (i.e. the input to that node is the output of another node) will fail on the first run. This is because there are no files in the file sink for the stream to process when it starts.

We recommended that you either keep intermediate datasets in memory or split out the processing into two pipelines and start by triggering the first pipeline to build up some initial history.

Feature creation

Be aware that windowing operations only allow windowing on time columns.

Watermarks must be defined for joins. Only certain types of joins are allowed, and these depend on the file types (stream-stream, stream-static) which makes joining of multiple tables a little complex at times. For further information or advice about join types and watermarking, take a look at the PySpark documentation or reach out on the Kedro Slack workspace.

Logging

When initiated, the Kedro pipeline will download the JAR required for the Spark Kafka. After the first run, it won't download the file again but simply retrieve it from where the previously downloaded file was stored.

For each node, the logs for the following will be shown: Loading data, Running nodes, Saving data, Completed x out of y tasks.

The completed log doesn't mean that the stream processing in that node has stopped. It means that the Spark plan has been created, and if the output dataset is being saved to a sink, the stream has started.

Once Kedro has run through all the nodes and the full Spark execution plan has been created, you'll see INFO Pipeline execution completed successfully.

This doesn't mean the stream processing has stopped as the post run hook keeps the Spark Session alive. As new data comes in, new Spark logs will be shown, even after the "Pipeline execution completed" log.

If there is an error in the input data, the Spark error logs will come through and Kedro will shut down the SparkContext and all the streams within it.

In summary

In this article, we explained how to take advantage of one of the ways to extend Kedro by building a new dataset to create streaming pipelines. We created a new Kedro project using the Kedro pysparkstarter and illustrated how to work with Hooks, adding them to the Kedro project to enable it to function as a streaming application. The dataset was then easy to configure through the Kedro data catalog, making it possible to use the new dataset, defining the streaming sources and sinks.

There are currently some limitations because it is not yet ready for use with a service broker, e.g. Kafka, as an additional JAR package is required.

If you want to find out more about the ways to extend Kedro, take a look at the advanced Kedro documentation for more about Kedro plugins, datasets, and Hooks.

Contributors

This post was created by Tingting Wan, Tom Kurian, and Haris Michailidis, who are all Data Engineers in the London office of QuantumBlack, AI by McKinsey.

Get up to speed: how to build a custom Kedro runner

Juan Luis Cano Rodríguez — Thu, 22 Jun 2023 09:46:25 +0000

In Kedro, runners are the execution mechanism for data science and machine learning pipelines. The default behaviour of all of Kedro’s built-in runners is to halt pipeline execution if an error occurs that is significant enough to cause any of the nodes to fail, as shown in the following diagram:

In the diagram, the entire run aborts when it encounters a node that it cannot run, terminating all other sections or branches of the pipeline, even those that it could have run.

The custom runner described in this article was specifically developed for a top player in the mining industry that uses Kedro to construct data pipelines for BI dashboards essential for operational excellence.

The client’s pipeline is designed to be resilient towards node failures. Certain nodes operate independently of each other, and especially during the development and exploration stages, the failure of a single node does not necessitate the termination of the entire Kedro run. The desired behaviour is as shown below:

In the diagram, the runner meets a node that cannot run but finds other sections or branches that it can execute.

The client relies on Kedro to execute a substantial pipeline that retrieves data from various sources. Some of the input datasets are manually created, which introduces the possibility of errors if entries are mistyped or omitted. By allowing the pipeline to continue and bypass nodes as they encounter failures, it becomes possible to compile a comprehensive list of data issues during a single run and address them collectively.

In comparison, the default Kedro approach is considerably more time-consuming as it pauses the pipeline upon the failure of a single node, leading to a repetitive cycle of fixing one issue, rerunning the pipeline to encounter the next issue, fixing that, and so on.

Executing all feasible nodes within the pipeline provides an additional advantage. In cases where no data issues arise, completing the pipeline allows the available metrics to be displayed on a BI dashboard, ensuring service continuity. For instance, if only one data source is corrupted, the BI metrics that depend on that specific data need to be withheld, but all others can be showcased. In contrast, the default Kedro behaviour would render all metrics unavailable until the single dataset issue is resolved.

The solution: a customised Kedro runner

As an open-source project, Kedro enables you to define a custom runner for your project. The team took the open-source code for Kedro’s sequential runner and extended it, since the code didn’t need any parallelisation.

“One of the reasons we selected Kedro is that it is open source and highly extensible. We knew from the outset that we could make our own customisations”.

The team created a soft-fail runner to transform errors into warnings, allowing the pipeline to continue executing to the best of its ability while providing a report of any nodes that failed, so that data issues can be addressed. At that point, the pipeline run can be finalised by executing only those missing nodes separately, using appropriate Kedro syntax.

The resulting SoftFailRunner is an implementation of AbstractRunner that runs a pipeline sequentially using a topological sort of provided nodes. Unlike the built-in SequentialRunner, this runner does not terminate the pipeline but runs any remaining nodes as long as their dependencies are fulfilled. The SoftFailRunner implementation adds two arguments: --from-nodes and --runner. The essential code for the SoftFailRunner is shown below and the full code can be found on GitHub.

The logic behind the runner is as follows:

Addition of a new skip_nodes variable to keep track of which nodes should be skipped.
Every time a node is about to run - the skip_nodes list is checked.
When a node fails, all of its descendants are added into skip_nodes with Breadth-first search (BFS).

In summary

The customised Kedro runner was straightforward to create and a satisfactory solution to enable maximum efficiency when handling this particular pipeline and dataset.

“These results could certainly be achieved with an orchestrator, but using an open-source project with customisation is a quick win for delivering business value”.

Collaborative experiment tracking in Kedro-Viz

Juan Luis Cano Rodríguez — Fri, 02 Jun 2023 14:09:58 +0000

When training a model in machine learning, the goal is to determine the optimal configuration of attributes such as hyper-parameters, metrics, and training data. The process of identifying the best combinations requires running a lot of experiments and comparing them. As I mentioned in my previous article, experiment tracking is a way to record all the metadata you need to compare machine-learning experiments and recreate them for your project.

What is Kedro-Viz?

Kedro-Viz is an interactive development tool for building and visualising data science pipelines with Kedro. It enables you to monitor the status of your ML project, present it to stakeholders, and easily bring new team members up to speed. You can try it out using our hosted demo.

“There's no better method to give an overview of a pipeline's structure in such an engaging, interactive, and thorough way. Our asset's pipelines are very complex, but are structured with modular pipelines, so being able to show the overall structure at the modular pipeline level, before jumping into each individual pipeline helps prevent the audience from getting overwhelmed by the number of nodes and datasets shown”.

Senior Data Scientist at Consultancy

What is experiment tracking in Kedro-Viz?

Experiment tracking on Kedro-Viz enables users to select, plot, and compare how multiple metrics change over time, and identify the best-performing ML experiment, with no additional dependencies to manage or infrastructure needed.

The video below demonstrates experiment tracking on Kedro-Viz:

During a project with multiple team members, you could end up with a scenario where the results of your experiments are spread across many machines because people are iterating on their individual computers. This makes the tracking process difficult to manage at a team level, as suggested by this feedback from our users.

"You might train one model locally on your computer. You might train another one in the cloud. Joe might run another pipeline or another experiment. Having all of those experiments in one place as a single source of truth is really powerful.

"If we could write our metrics files to an S3 bucket and then run experiment tracking pointing at that S3 bucket, that simplifies our workflow in many different ways and would be really helpful. And it would make Kedro experiment tracking just as easy, if not easier, than MLFlow for us."

"Can you use an existing database so that we can keep track of runs happening in different places?"

We have found a way to address this pain point and enable you to collaborate more easily. We are excited to announce that we've launched collaborative experiment tracking in Kedro-Viz 6.2.0. The new feature enables a team of users to log their experiments to a shared cloud storage service and view and compare each others' experiments in their own experiment tracking view. This simplifies their workflow, providing a single ‘source of truth’ and encourages multi-user collaboration.

We are releasing this feature in stages across different versions, and the first phase is Kedro-Viz 6.2.0. This version enables users to read experiments of other users that are stored on Amazon S3 or similar storage solutions on other cloud providers, as long as they are supported by fsspec. Future versions of collaborative experiment tracking aim to improve the user experience through automatic reloading and optimisation by caching.

Get started with collaborative experiment tracking

Follow these steps to set up collaborative experiment tracking in Kedro-Viz:

Step 1: Update Kedro-Viz

Ensure you have the latest version of Kedro-Viz (6.2.0 or later).

pip install kedro-viz --upgrade

Step 2: Set up cloud storage

Kedro-Viz uses fsspec to save and read session_store files from a variety of data stores, including local file systems, network file systems, cloud object stores (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage), and HDFS.

Set up a central cloud storage repository such as a AWS S3 bucket to store all your team's experiments.

Step 3: Configure your Kedro project

Locate the settings.py file in your Kedro project directory and add the following:

from kedro_viz.integrations.kedro.sqlite_store import SQLiteStore
from pathlib import Path

SESSION_STORE_CLASS = SQLiteStore
SESSION_STORE_ARGS = {
    "path": str(Path(__file__).parents[2] / "data"),
    "remote_path": "s3://my-bucket-name/path/to/experiments",
}

Step 4: Set up a unique username

Kedro-Viz saves your experiments as SQLite database files on the central cloud storage. To ensure that all users have unique filenames, you need to set up your **KEDRO_SQLITE_STORE_USERNAME** in the environment variables. By default, Kedro-Viz will take your computer username if this is not specified.

export KEDRO_SQLITE_STORE_USERNAME ="your_unique__username"

Step 5: Configure cloud storage credentials

From Kedro-Viz version 6.2, the only way to set up credentials for accessing your cloud storage is through environment variables, as shown below for Amazon S3 cloud storage.

export AWS_ACCESS_KEY_ID="your_access_key_id"
export AWS_SECRET_ACCESS_KEY="your_secret_access_key"
export AWS_REGION="your_aws_region"

In the screenshot below we show an example of the session store and Kedro-Viz output for three team members (Huong, Tynan, and Rashida):

Session store showing the 3 objects for Huong, Tynan, and Rashida.

Three separate Kedro-Viz runs by Huong, Tynan, and Rashida.

This tutorial offers a very swift run through of the configuration process. For further information, check out the documentation on the experiment tracking feature and keep up-to-date with the latest news about Kedro and Kedro-Viz on our Slack workspace.

Many thanks to the Kedro-Viz team especially @Rashida Kanchwala for contributing to this post.

Lightning-fast queries with Polars

Juan Luis Cano Rodríguez — Thu, 25 May 2023 12:33:46 +0000

This post is an adaptation of the one I originally published in the Orchest blog. Lots of things have changed in Polars since I wrote it, but at the time of writing this lines the post still has value. Enjoy!

Polars is an open-source project that provides in-memory dataframes for Python and Rust. Despite its young age (its first commit was a mere two years ago, in the middle of the COVID-19 pandemic) it has already gained lots of popularity due to its "lightning-fast" performance and the expressiveness of its API.

One of the most interesting things about Polars is that it offers two modes of operation:

The eager mode is somewhat similar to how pandas works: operations are executed immediately and their result is available in memory. Every operation in a chain would need to allocate a DataFrame however, which is less than ideal.
The lazy mode, on the other hand, builds an optimized query plan that exploits parallelism as much as possible: Polars applies several simplification techniques and pushes computations to accelerate the running time as much as possible.

These ideas are not new: in fact, in my blog post about Vaex we covered its lazy computation capabilities. However, Polars takes them one step further by offering a functional API that is delightful to use.

The other secret sauce of Polars is Apache Arrow. While other libraries use Arrow for things like reading Parquet files, Polars is tightly coupled with it: by using a Rust-native implementation of the Arrow memory format for its columnar storage, Polars can leverage the highly optimized Arrow data structures and focus on the data manipulation operations.

Interested? Read on!

Polars popularity is growing fast (https://twitter.com/braaannigan/status/1526901314978029568)

First steps with Polars

For this example, we will use a sample of Stack Overflow questions and their tags obtained from Kaggle. Our generic goal is to display the most highly voted Python questions.

You can install Polars with conda/mamba or pip:

mamba install -y "polars=0.13.37"  
pip install "polars==0.13.37"

Even though Polars is written in Rust, it distributes precompiled binary wheels on PyPI, so pip install will just work on all major Python versions from 3.6 onwards.

Let's load the Questions and Tags CSV files using

import polars as pl  

df = pl.read_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")  
tags = pl.read_csv("/data/stacksample/Tags.csv")

The type of both objects is `polars.internals.frame.DataFrame`, "a two-dimensional data structure that represents data as a table with rows and columns" (reference docs). Both dataframes have millions of rows, and the Questions one takes almost 2 GB of memory:

In [7]: len(df), len(tags)  
Out[7]: (1264216, 3750994)  

In [8]: print(f"Estimated size: {df.estimated_size() >> 20} MiB")  
Estimated size: 1865 MiB

Polars dataframes have some typical methods we know from pandas to inspect the data. Notice that calling the print function on a DataFrame produces a tidy ASCII representation, in addition to the fancy HTML representation available in Jupyter:

In [9]: print(df.head(3))  # No `print` needed on Jupyter  
shape: (3, 7)  
┌─────┬─────────────┬─────────────────┬─────────────────┬───────┬─────────────────┬────────────────┐  
│ Id  ┆ OwnerUserId ┆ CreationDate    ┆ ClosedDate      ┆ Score ┆ Title           ┆ Body           │  
│ --- ┆ ---         ┆ ---             ┆ ---             ┆ ---   ┆ ---             ┆ ---            │  
│ i64 ┆ str         ┆ str             ┆ str             ┆ i64   ┆ str             ┆ str            │  
╞═════╪═════════════╪═════════════════╪═════════════════╪═══════╪═════════════════╪════════════════╡  
│ 80  ┆ 26          ┆ 2008-08-01T13:5 ┆ NA              ┆ 26    ┆ SQLStatement.ex ┆ <p>I've        │  
│     ┆             ┆ 7:07Z           ┆                 ┆       ┆ ecute() -       ┆ written a      │  
│     ┆             ┆                 ┆                 ┆       ┆ multipl...      ┆ database       │  
│     ┆             ┆                 ┆                 ┆       ┆                 ┆ gener...       │  
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 90  ┆ 58          ┆ 2008-08-01T14:4 ┆ 2012-12-26T03:4 ┆ 144   ┆ Good branching  ┆ <p>Are there   │  
│     ┆             ┆ 1:24Z           ┆ 5:49Z           ┆       ┆ and merging     ┆ any really     │  
│     ┆             ┆                 ┆                 ┆       ┆ tutor...        ┆ good tut...    │  
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 120 ┆ 83          ┆ 2008-08-01T15:5 ┆ NA              ┆ 21    ┆ ASP.NET Site    ┆ <p>Has anyone  │  
│     ┆             ┆ 0:08Z           ┆                 ┆       ┆ Maps            ┆ got experience │  
│     ┆             ┆                 ┆                 ┆       ┆                 ┆ cre...         │  
└─────┴─────────────┴─────────────────┴─────────────────┴───────┴─────────────────┴────────────────┘  
In [10]: print(df.describe())  
shape: (5, 8)  
┌──────────┬─────────────┬─────────────┬──────────────┬────────────┬───────────┬───────┬──────┐  
│ describe ┆ Id          ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score     ┆ Title ┆ Body │  
│ ---      ┆ ---         ┆ ---         ┆ ---          ┆ ---        ┆ ---       ┆ ---   ┆ ---  │  
│ str      ┆ f64         ┆ str         ┆ str          ┆ str        ┆ f64       ┆ str   ┆ str  │  
╞══════════╪═════════════╪═════════════╪══════════════╪════════════╪═══════════╪═══════╪══════╡  
│ mean     ┆ 2.1327e7    ┆ null        ┆ null         ┆ null       ┆ 1.781537  ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ std      ┆ 1.1514e7    ┆ null        ┆ null         ┆ null       ┆ 13.663886 ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ min      ┆ 80.0        ┆ null        ┆ null         ┆ null       ┆ -73.0     ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ max      ┆ 4.014338e7  ┆ null        ┆ null         ┆ null       ┆ 5190.0    ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ median   ┆ 2.1725415e7 ┆ null        ┆ null         ┆ null       ┆ 0.0       ┆ null  ┆ null │  
└──────────┴─────────────┴─────────────┴──────────────┴────────────┴───────────┴───────┴──────┘  
[11]: print(tags["Tag"].value_counts().head())  
shape: (5, 2)  
┌────────────┬────────┐  
│ Tag        ┆ counts │  
│ ---        ┆ ---    │  
│ str        ┆ u32    │  
╞════════════╪════════╡  
│ javascript ┆ 124155 │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ java       ┆ 115212 │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ c#         ┆ 101186 │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ php        ┆ 98808  │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ android    ┆ 90659  │  
└────────────┴────────┘

Following a terminology similar to pandas, Polars dataframes contain several columns of type polars.internals.series.Series , each of them with a different data type:

In [12]: df["Title"].head(5)  
Out[12]: shape: (5,)  
Series: 'Title' [str]  
[  
"SQLStatement.e...  
"Good branching...  
"ASP.NET Site M...  
"Function for c...  
"Adding scripti...  
]  

In [13]: df.dtypes  
Out[13]: [polars.datatypes.Int64,  
polars.datatypes.Utf8,  
polars.datatypes.Utf8,  
polars.datatypes.Utf8,  
polars.datatypes.Int64,  
polars.datatypes.Utf8,  
polars.datatypes.Utf8]

Expressions as chained operations on columns

The essential building blocks in Polars are expressions: functions that receive a Series and transform it into another Series. Expressions start with a root, and then you can chain more operations:

(  
   pl.col("Score")  # Root of the Expression (a single column)  
   .mean()  # Returns another Expression  
)

The most interesting feature is that expressions are not bound to a specific object, but instead they are generic. Chains of expressions define the computation, which is materialized by a DataFrame method (acting as an execution context).

Sounds too abstract? See it in action:

In [20]: print(df.select(pl.col("Score").mean()))  
shape: (1, 1)  
┌──────────┐  
│ Score    │  
│ ---      │  
│ f64      │  
╞══════════╡  
│ 1.781537 │  
└──────────┘

The df.select method can do much more than just selecting columns: it can execute any column-wise expression. In fact, when passed a list of such expressions, it can broadcast them automatically if the dimensions are coherent, and it will execute them in parallel:

In [21]: print(df.select([  
   ...:   pl.col("Id").n_unique().alias("num_unique_users"),  
   ...:   pl.col("Score").mean().alias("mean_score"),  
   ...:   pl.col("Title").str.lengths().max().alias("max_title_length"),  
   ...:   # To run the above in all text columns,  
   ...:   # you can filter by data type:  
   ...:   # pl.col(Utf8).str.lengths().max().suffix("_max_length"),  
   ...: ]))  
shape: (1, 3)  
┌──────────────────┬────────────┬──────────────────┐  
│ num_unique_users ┆ mean_score ┆ max_title_length │  
│ ---              ┆ ---        ┆ ---              │  
│ u32              ┆ f64        ┆ u32              │  
╞══════════════════╪════════════╪══════════════════╡  
│ 1264216          ┆ 1.781537   ┆ 204              │  
└──────────────────┴────────────┴──────────────────┘

The power of laziness

It is now time to start narrowing down the analysis a bit and focus on the questions that are related to Python. Notice that Polars algorithms require all the data to live in memory, and therefore when using the eager API you have to apply the usual caveats about large datasets. As a result, since the questions dataset is already quite big, performing a .join operation with the tags data can crash the kernel:

# Don't try this at home unless you have enough RAM!  
# (  
#     df  
#     .join(tags, on="Id")  
#     .filter(pl.col("Tag").str.contains(r"(i?)python"))  
#     .sort("Id")  
# )

But fear not, because Polars has the perfect solution: switching to lazy mode! By prefixing our chain of operations by .lazy() and calling .collect() at the end, you can leverage Polars optimization capabilities to its fullest potential, and perform operations that would be otherwise impossible:

In [22]: q_python = (  
   ...:   df.lazy()  # Notice the .lazy() call  
   ...:   # The input of a lazy join needs to be lazy  
   ...:   # We use a 'semi' join, like 'inner' but discarding extra columns  
   ...:   .join(tags.lazy(), on="Id", how="semi")  
   ...:   .filter(pl.col("Tag").str.contains(r"(i?)python"))  
   ...:   .sort("Id")  
   ...: ).collect()  # Call .collect() at the end  
   ...: print(q_python.head(3))  
shape: (3, 7)  
┌───────┬─────────────┬──────────────────┬────────────┬───────┬──────────────────┬─────────────────┐  
│ Id    ┆ OwnerUserId ┆ CreationDate     ┆ ClosedDate ┆ Score ┆ Title            ┆ Body            │  
│ ---   ┆ ---         ┆ ---              ┆ ---        ┆ ---   ┆ ---              ┆ ---             │  
│ i64   ┆ str         ┆ str              ┆ str        ┆ i64   ┆ str              ┆ str             │  
╞═══════╪═════════════╪══════════════════╪════════════╪═══════╪══════════════════╪═════════════════╡  
│ 11060 ┆ 912         ┆ 2008-08-14T13:59 ┆ NA         ┆ 18    ┆ How should I     ┆ <p>This is a    │  
│       ┆             ┆ :21Z             ┆            ┆       ┆ unit test a      ┆ difficult and   │  
│       ┆             ┆                  ┆            ┆       ┆ code-ge...       ┆ open-...        │  
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 17250 ┆ 394         ┆ 2008-08-20T00:16 ┆ NA         ┆ 24    ┆ Create an        ┆ <p>I'm creating │  
│       ┆             ┆ :40Z             ┆            ┆       ┆ encrypted ZIP    ┆ an ZIP file     │  
│       ┆             ┆                  ┆            ┆       ┆ file in ...      ┆ with...         │  
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 19030 ┆ 745         ┆ 2008-08-20T22:50 ┆ NA         ┆ 2     ┆ How to check set ┆ <p>I have a     │  
│       ┆             ┆ :55Z             ┆            ┆       ┆ of files         ┆ bunch of files  │  
│       ┆             ┆                  ┆            ┆       ┆ confor...        ┆ (TV e...        │

In fact, if your raw CSV is so big that it doesn't fit in RAM to start, Polars offers a lazy way of reading the file too using scan_csv :

# We create the query plan separately  
plan = (  
   # scan_csv returns a lazy dataframe already  
   pl.scan_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")  
   .join(tags.lazy(), on="Id", how="semi")  
   .filter(pl.col("Tag").str.contains(r"(i?)python"))  
   .sort("Score", reverse=True)  
   .limit(1_000)  
)  
top_voted_python_qs = plan.collect()

If you are curious about how Polars is doing all this work under the hood, notice that you can visualize the query plan!

Polars visualization of a query plan (not optimized)

Working with columns of lists

Notice that, in the previous section, we did a "semi" join to filter the questions, but we still don't have the list of tags associated with such questions. To achieve that, we will use one of the most surprisingly pleasant features of Polars: its list-column handling.

In [30]: tag_list_lazy = (  
   ...:   tags.lazy()  
   ...:   .groupby("Id").agg(  
   ...:     pl.col("Tag")  
   ...:     .list()  # Convert to a list of strings  
   ...:     .alias("TagList")  
   ...:   )  
   ...: )  
   ...: print(tag_list_lazy.limit(5).collect())  
shape: (5, 2)  
┌──────────┬─────────────────────────────────────┐  
│ Id       ┆ TagList                             │  
│ ---      ┆ ---                                 │  
│ i64      ┆ list [str]                          │  
╞══════════╪═════════════════════════════════════╡  
│ 994990   ┆ ["spring"]                          │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 29087440 ┆ ["android", "android-intent"]       │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 12093870 ┆ ["asp.net", ".net", "sqldatasour... │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 32889780 ┆ ["c", "extern", "function-declar... │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 22436290 ┆ ["mysql", "sql", ... "multiple-t... │  
└──────────┴─────────────────────────────────────┘

After grouping by "Id" and turning each row into a list of tags, it's time to add a boolean column "ContainsPython" that signals whether any of the tags in the list contains the substring "python". For that', let's use the _.arr.eval_ context (also known as the List context):

tag_list_extended_lazy = tag_list_lazy.with_column(  
   pl.col("TagList")  
   .arr.eval(  
       pl.element()  
       .str.contains(r"(i?)python")  
       .any()  
   ).flatten().alias("ContainsPython")  
)

The final join will provide the answer we are looking for:

top_python_questions = (  
   pl.scan_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")  
   .join(tag_list_extended_lazy, on="Id")  
   .filter(pl.col("ContainsPython"))  
   .sort("Score", reverse=True)  
).limit(1_000).collect()

And the result:

Joining two dataframes in Polars

Very neat!

Some differences with pandas

Similarly to what happens with Vaex, Polars DataFrames don't have an index. The user guide goes as far as saying this:

Indexes are not needed! Not having them makes things easier - convince us otherwise!

The discussion of this contentious stance will be the subject of a future blog post. In any case, this allows Polars to simplify indexing operations, since strings will always refer to column names, and numbers in the first axis will always refer to row numbers:

In [36]: print(df[0])  # First row  
shape: (1, 7)  
┌─────┬─────────────┬───────────────────┬────────────┬───────┬──────────────────┬──────────────────┐  
│ Id  ┆ OwnerUserId ┆ CreationDate      ┆ ClosedDate ┆ Score ┆ Title            ┆ Body             │  
│ --- ┆ ---         ┆ ---               ┆ ---        ┆ ---   ┆ ---              ┆ ---              │  
│ i64 ┆ str         ┆ str               ┆ str        ┆ i64   ┆ str              ┆ str              │  
╞═════╪═════════════╪═══════════════════╪════════════╪═══════╪══════════════════╪══════════════════╡  
│ 80  ┆ 26          ┆ 2008-08-01T13:57: ┆ NA         ┆ 26    ┆ SQLStatement.exe ┆ <p>I've written  │  
│     ┆             ┆ 07Z               ┆            ┆       ┆ cute() -         ┆ a database       │  
│     ┆             ┆                   ┆            ┆       ┆ multipl...       ┆ gener...         │  
└─────┴─────────────┴───────────────────┴────────────┴───────┴──────────────────┴──────────────────┘  

[37]: df[0, 0]  # First row, first column  
Out[37]: 80  

In [38]: df[0, "Id"]  # First row, column by name  
Out[38]: 80  

In [39]: df["Id"].head(5)  # Column by name  
Out[39]: shape: (5,)  
Series: 'Id' [i64]  
[  
80  
90  
120  
180  
260  
]

On the other hand, even though indexing with boolean masks is supported in Polars as a way to bridge the gap with Pandas users, its use is discouraged in favor of select and filter, and "the functionality may be removed in the future". However, as you could see in the examples above, direct indexing is not needed as often as in pandas.

Should you use Polars?

Beyond this short introduction, Polars has much more to offer, from window functions and complex aggregations to time-series processing, and much more.

As a downside, since it is a young project and it's evolving quite fast, you will notice that some areas of the documentation are a bit lacking, or that there are no comprehensive release notes yet. Fortunately, Ritchie Vink, the Polars creator and current maintainer, quickly answers Stack Overflow questions and GitHub issues, and releases with bug fixes and new features are frequent.

On the other hand, if you are looking for an ultimate solution for your larger-than-RAM datasets, Polars might not be for you. Its lazy processing capabilities can take you quite far, but at some point you will have to confront the fact that Polars is an in-memory dataframe library, similar to pandas.

In summary:

Use Polars if you are willing to learn a different but powerful new API, if your data fits in memory, if your workflow involves lots of list-column manipulation, and in general if you want to explore a much faster alternative to pandas.
Don't use Polars if your data is much larger than RAM, if you are looking for solutions to quickly migrate a large pandas codebase, or if you are looking for an old, battle-tested library.

A Polars exploration into Kedro

Juan Luis Cano Rodríguez — Wed, 17 May 2023 14:50:58 +0000

One year ago I travelled to Lithuania for the first time to present at PyCon/PyData Lithuania, and I had a great time there. The topic of my talk was an evaluation of some alternative dataframe libraries, including Polars, the one that I ended up enjoying the most.

I enjoyed it so much that this week I’m in Vilnius again, and I’ll be delivering a workshop at PyCon Lithuania 2023 called “Analyze your data at the speed of light with Polars and Kedro”.

In this blog post you will learn how using Polars in Kedro can make your data pipelines much faster, what’s the current status of Polars in Kedro, and what can be expected in the near future. In case it’s the first time you’ve heard about Polars, I have included a short introduction at the beginning.

Let’s dive in!

What is the Polars library?

Polars is an open-source library for Python, Rust, and NodeJS that provides in-memory dataframes, out-of-core processing capabilities, and more. It is based on the Rust implementation of the Apache Arrow columnar data format (you can read more about Arrow on my earlier blog post “Demystifying Apache Arrow”), and it is optimised to be blazing fast.

The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

I started experimenting with Polars one year ago, and it has now become my go-to data manipulation library. I gave several talks about it, for example at PyData NYC, and the room was full.

How do Polars and Kedro get used together?

If you want to learn more about Kedro, you can watch a video introduction on our YouTube channel:

Traditionally Kedro has favoured pandas as a dataframe library because of its ubiquity and popularity. This means that, for example, to read a CSV file, you would add a corresponding entry to the catalog:

openrepair-0_3-categories:
  type: pandas.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv

And then, you would use that dataset as input for your node functions, which would, in turn, receive pandas DataFrame objects:

def join_events_categories(
    events: pd.DataFrame,
    categories: pd.DataFrame,
) -> pd.DataFrame:
        ...

(This is just one of the formats supported by Kedro datasets of course! You can also load Parquet, GeoJSON, images… have a look at the kedro-datasets reference for a list of datasets maintained by the core team, or the #kedro-plugin topic on GitHub for some contributed by the community!)

The idea of this blog post is to teach you how can you use Polars instead of pandas for your catalog entries, which in turn allow you to write all your data transformation pipelines using Polars dataframes. For that, I crafted some examples that use the Open Repair Alliance dataset, containing more than 80 000 records of repair events across Europe.

And if you’re ready to start, let’s go!

Get started with Polars for Kedro

First of all, you will need to add kedro-datasets[polars.CSVDataSet] to your requirements. At the time of writing (May 2023), the code below requires development versions of both kedro and kedro-datasets, which you can declare on your requirements.txt or pyproject.toml as follows:

# requirements.txt

kedro @ git+https://github.com/kedro-org/kedro@3ea7231
kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets

# pyproject.toml

[project]
dependencies = [
    "kedro @ git+https://github.com/kedro-org/kedro@3ea7231",
    "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets",
]

If you are using the legacy setup.py files, the syntax is very similar:

setup(
    requires=[
        "kedro @ git+https://github.com/kedro-org/kedro@3ea7231",
        "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets",
    ]
)

After you install these dependencies, you can start using the polars.CSVDataSet by using the appropriate type in your catalog entries:

openrepair-0_3-categories:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv

and that’s it!

Reading real world CSV files with `polars.CSVDataSet`

It turns out that reading CSV files is not always that easy. The good news is that you can use the load_args parameter of the catalog entry to pass extra options to the polars.CSVDataSet, which mirror the function arguments of polars.read_csv. For example, if you want to attempt parsing the date columns in the CSV, you can set the try_parse_dates option to true:

openrepair-0_3-categories:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
  load_args:
    # Doesn't make much sense in this case,
    # but serves for demonstration purposes
    try_parse_dates: true

Some of these parameters are required to be Python objects: for example, polars.read_csv takes an optional dtypes parameter that can be used to specify the dtypes of the columns, as follows:

pl.read_csv(
    "data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv",
    dtypes={
        "product_age": pl.Float64,
        "group_identifier": pl.Utf8,
    }
)

Kedro catalog files only support primitive types. But fear not! You can use more sophisticated configuration loaders in Kedro that allow you to tweak how such files are parsed and loaded.

To pass the appropriate dtypes to read this CSV file, you can use the TemplatedConfigLoader, or alternatively the shiny new OmegaConfigLoader with a custom omegaconf resolver. Such resolver will take care of parsing the strings in the YAML catalog and transforming them into the objects Polars needs. Place this code in your settings.py:

# settings.py

import polars as pl
from omegaconf import OmegaConf
from kedro.config import OmegaConfigLoader

if not OmegaConf.has_resolver("polars"):
    OmegaConf.register_new_resolver("polars", lambda attr: getattr(pl, attr))

CONFIG_LOADER_CLASS = OmegaConfigLoader

And now you can use the special OmegaConf syntax in the catalog:

openrepair-0_3-events-raw:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv
  load_args:
    dtypes:
      # Notice the OmegaConf resolver syntax!
      product_age: ${polars:Float64}
      group_identifier: ${polars:Utf8}
    try_parse_dates: true

Now you can access Polars data types with ease from the catalog!

Future plans for Polars integration in Kedro

This all looks very promising, but it’s only the tip of the iceberg. First of all, these changes need to land in stable versions of kedro and kedro-datasets. More importantly, we are working on a generic Polars dataset that will be able to read other file formats, for example Parquet, which is faster, more compact, and easier to use.

Polars makes me so excited about the future of data manipulation in Python, and I hope that all Kedro users are able to leverage this amazing project on their data pipelines very soon!

DEV Community: Juan Luis Cano Rodríguez

How to sign your git commits with SSH when doing remote development

Python Packaging is Great Now: `uv` is all you need

What makes Python packaging hard for beginners?

Enter uv

Some reflections

On conda

Conclusion

The simplicity of DuckDB

Why SQL, after all?

Enter DuckDB

Trying out DuckDB

First steps with DuckDB

Querying Parquet files with DuckDB

Interoperability with Python dataframe libraries

Other features

Should you use DuckDB?

How to integrate Kedro and Databricks Connect

What is Databricks Connect?

How can I use a Databricks Connect workflow with Kedro?

How to use Databricks as your PySpark engine

How to enable MLflow on Databricks

Limitations of this workflow

In the pipeline: September 2023

Kedro team news

Contributors news

Release news

Recently on the Kedro blog

What we’ve learned

🐍 Best resources on Python packaging 📖

1. "The Basics of Python Packaging in Early 2023"

2. "An unbiased evaluation of environment management and packaging tools"

3. The pyOpenSci Python packaging guide

4. "Why not tell people to 'simply' use pyenv, poetry or anaconda"

5. "Thoughts on Python packaging"

How to use Databricks managed Delta tables in a Kedro project

What is Kedro?

What is Databricks?

Why use Kedro on Databricks?

What are Kedro datasets?

What is managed data in Databricks?

How to work with managed Delta tables using Kedro

Conclusion

A new Kedro dataset for Spark Structured Streaming

What is Kedro?

What are Kedro datasets?

What is Spark Structured Streaming?

Integrating Kedro and Spark Structured Streaming

Set up a project to integrate Kedro with Spark Structured streaming

Create a Kedro project

Register the necessary PySpark and streaming related Hooks

How to set up your Kedro project to read data from streaming sources

How to set up your Kedro project to write data to streaming sinks

Design considerations

Pipeline design

Feature creation

Logging

In summary

Contributors

Get up to speed: how to build a custom Kedro runner

The solution: a customised Kedro runner

In summary

Collaborative experiment tracking in Kedro-Viz

What is Kedro-Viz?

What is experiment tracking in Kedro-Viz?

Get started with collaborative experiment tracking

Step 1: Update Kedro-Viz

Step 2: Set up cloud storage

Step 3: Configure your Kedro project

Step 4: Set up a unique username

Step 5: Configure cloud storage credentials

Lightning-fast queries with Polars

First steps with Polars

Expressions as chained operations on columns

The power of laziness

Working with columns of lists

Some differences with pandas

Should you use Polars?

A Polars exploration into Kedro

What is the Polars library?

Enter `uv`

Reading real world CSV files with `polars.CSVDataSet`