<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Juan Luis Cano Rodríguez</title>
    <description>The latest articles on DEV Community by Juan Luis Cano Rodríguez (@astrojuanlu).</description>
    <link>https://dev.to/astrojuanlu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F921385%2Fbf873438-c675-4317-a842-0cd981578726.jpg</url>
      <title>DEV Community: Juan Luis Cano Rodríguez</title>
      <link>https://dev.to/astrojuanlu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/astrojuanlu"/>
    <language>en</language>
    <item>
      <title>How to sign your git commits with SSH when doing remote development</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Wed, 03 Dec 2025 11:42:59 +0000</pubDate>
      <link>https://dev.to/astrojuanlu/how-to-sign-your-git-commits-with-ssh-when-doing-remote-development-4pg1</link>
      <guid>https://dev.to/astrojuanlu/how-to-sign-your-git-commits-with-ssh-when-doing-remote-development-4pg1</guid>
      <description>&lt;p&gt;Do you want your commits to appear as "verified" on GitHub?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ool78udf4w5unpvy1c7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ool78udf4w5unpvy1c7.png" alt="Screenshot of a signed commit as seen on GitHub" width="541" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The easiest way is to sign them with the SSH key you're already using. For that, you can run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ git config --global gpg.format ssh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In my case, though, I do all my development inside a &lt;a href="https://canonical.com/lxd" rel="noopener noreferrer"&gt;LXD&lt;/a&gt; virtual machine. This is very nice because it isolates my environment and I can nuke it and rebuild it with &lt;a href="https://cloud-init.io/" rel="noopener noreferrer"&gt;cloud-init&lt;/a&gt; if something goes wrong.&lt;/p&gt;

&lt;p&gt;Since I'm working inside a VM, I actually don't have any SSH keys &lt;em&gt;inside&lt;/em&gt; the VM!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls ~/.ssh
authorized_keys  known_hosts  known_hosts.old
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead, I use &lt;a href="https://askubuntu.com/q/1008052" rel="noopener noreferrer"&gt;SSH agent forwarding&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ssh-add -L
ssh-ed25519 AAAAC3NzaC1lZD... comment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So you can tell git to use forwarded keys as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ git config --global gpg.ssh.defaultKeyCommand "ssh-add -L"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Last thing, now you will want to verify locally your own commits. But if you don't do anything else, you will see this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ git show --show-signature --stat
error: gpg.ssh.allowedSignersFile needs to be configured and exist for ssh signature verification
commit 7bbebcb0b65ae704cdf8b54361f1287c9b95d1f0 (HEAD -&amp;gt; juanlu/...)
No signature
Author: ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the last step is configuring such file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ mkdir ~/.config/git
$ echo "$(git config user.email) $(ssh-add -L)" &amp;gt;&amp;gt; ~/.config/git/allowed_signers
$ git config --global gpg.ssh.allowedSignersFile ~/.config/git/allowed_signers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now, finally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ git show --show-signature --stat
commit 7bbebcb0b65ae704cdf8b54361f1287c9b95d1f0 (HEAD -&amp;gt; juanlu/...)
Good "git" signature for user@domain with ED25519 key SHA256:4RdE/O/mv3Y/YjC07RatbWtmak5tzx9HUdYR3RZFjNg
Author: ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vftor9d4s3cb7iva42z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vftor9d4s3cb7iva42z.png" alt="Locally verifying a commit, in full color" width="800" height="63"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that's it! You can now push and your commits will be verified ✨&lt;/p&gt;

&lt;p&gt;If you discovered this in the middle of writing a pull request, well, you can sign all the commits with a rebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ git rebase --exec 'git commit --amend --no-edit -n -S' main
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Happy coding!&lt;/p&gt;

</description>
      <category>git</category>
      <category>github</category>
      <category>ssh</category>
    </item>
    <item>
      <title>Python Packaging is Great Now: `uv` is all you need</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Sat, 10 Aug 2024 12:17:10 +0000</pubDate>
      <link>https://dev.to/astrojuanlu/python-packaging-is-great-now-uv-is-all-you-need-4i2d</link>
      <guid>https://dev.to/astrojuanlu/python-packaging-is-great-now-uv-is-all-you-need-4i2d</guid>
      <description>&lt;p&gt;&lt;em&gt;The title of this post is a reference to Glyph's &lt;a href="https://blog.glyph.im/2016/08/python-packaging.html" rel="noopener noreferrer"&gt;Python Packaging is Good Now&lt;/a&gt;. I think it's safe to say that, in these 8 years, we've gone from "Good" to "Great". Keep reading for my reasoning.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes Python packaging hard &lt;em&gt;for beginners&lt;/em&gt;?
&lt;/h2&gt;

&lt;p&gt;I contend that the two main difficulties for Python packaging are&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bootstrapping, i.e. how to even get started!&lt;/li&gt;
&lt;li&gt;Activation, i.e. how venvs in Python work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bootstrapping was an often neglected problem. Should we tell people to install Python from &lt;code&gt;https://python.org&lt;/code&gt;? The Anaconda distribution? How do we stop folks from using their system package manager and risk breaking everything?&lt;/p&gt;

&lt;p&gt;And don't forget the whole virtual environment lifecycle. It's so crazy how numb I've become to it as a long time Python user, but &lt;a href="https://social.juanlu.space/@astrojuanlu/111901822426625396" rel="noopener noreferrer"&gt;every time I have to explain it&lt;/a&gt; I see my students faces and I think "this is not okay".&lt;/p&gt;

&lt;p&gt;Sure, there are other problems, like how to build and publish distributable packages. But I contend these don't affect most Python &lt;em&gt;beginners&lt;/em&gt;. Plus, they are in the process of being addressed as well. Read on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter &lt;code&gt;uv&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;On February 15th, &lt;a href="https://astral.sh/blog/uv" rel="noopener noreferrer"&gt;Astral released &lt;code&gt;uv&lt;/code&gt;&lt;/a&gt; and I jumped ship immediately. As part of my job I routinely have to install lots of potentially conflicting dependencies, and &lt;code&gt;uv&lt;/code&gt; was an immediate relief.&lt;/p&gt;

&lt;p&gt;But the interesting thing is that now &lt;code&gt;uv&lt;/code&gt; has gone well beyond its initial "faster pip" phase and it's fulfilling its promise of being "a comprehensive Python project and package manager that's fast, reliable, and easy to use".&lt;/p&gt;

&lt;p&gt;Going back to the bootstrapping and activation problems that I mentioned at the very beginning, how does &lt;code&gt;uv&lt;/code&gt; solve them? Consider this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;uv&lt;/code&gt; does not depend on Python itself. Precompiled, standalone binaries can be &lt;a href="https://github.com/astral-sh/uv/blob/0.2.35/docs/getting-started/installation.md" rel="noopener noreferrer"&gt;easily installed&lt;/a&gt; on Linux, macOS and Windows.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;uv python&lt;/code&gt; manages Python versions! No need to resort to OS-specific mechanisms, like &lt;code&gt;pyenv&lt;/code&gt;, &lt;code&gt;deadsnakes&lt;/code&gt;, or to heavyweight tools like &lt;code&gt;conda&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;uv tool&lt;/code&gt; manages tools in centralized environments! No more need for &lt;code&gt;pipx&lt;/code&gt; or &lt;code&gt;fades&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;uv init&lt;/code&gt; creates a barebones &lt;code&gt;pyproject.toml&lt;/code&gt; using &lt;code&gt;hatchling&lt;/code&gt; as build backend and a working src-layout with an empty README and a dummy module.

&lt;ul&gt;
&lt;li&gt;If you need something more sophisticated, you could always use &lt;code&gt;copier&lt;/code&gt; or &lt;code&gt;cookiecutter&lt;/code&gt; with some more sophisticated template.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;
&lt;code&gt;uv add&lt;/code&gt; adds dependencies to &lt;code&gt;pyproject.toml&lt;/code&gt;, &lt;em&gt;creates a &lt;code&gt;venv&lt;/code&gt; if one didn't exist&lt;/em&gt;, and installs them!&lt;/li&gt;

&lt;li&gt;
&lt;code&gt;uv lock&lt;/code&gt; creates a lock file with all your dependencies, which you can then use in &lt;code&gt;uv sync&lt;/code&gt;.

&lt;ul&gt;
&lt;li&gt;And if you want a good old &lt;code&gt;requirements.txt&lt;/code&gt;, &lt;code&gt;uv pip compile&lt;/code&gt; does it for you, just like &lt;code&gt;pip-tools&lt;/code&gt;! &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;
&lt;code&gt;uv run&lt;/code&gt; executes scripts and commands, again &lt;em&gt;without explicitly activating environments&lt;/em&gt;!&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Essentially, this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ mkdir uv-playground
$ cd uv-playground
$ uv init
warning: `uv init` is experimental and may change without warning
Initialized project `uv-playground`
$ uv add click
warning: `uv add` is experimental and may change without warning
Using Python 3.12.3 interpreter at: /usr/bin/python3
Creating virtualenv at: .venv
Resolved 3 packages in 66ms
   Built uv-playground @ file:///tmp/uv-playground
Prepared 2 packages in 430ms
Installed 2 packages in 0.62ms
 + click==8.1.7
 + uv-playground==0.1.0 (from file:///tmp/uv-playground)
$ tree
.
├── pyproject.toml
├── README.md
├── src
│   └── uv_playground
│       ├── __init__.py
└── uv.lock

3 directories, 4 files
$ uv run python -c "from uv_playground import hello; print(hello())"
warning: `uv run` is experimental and may change without warning
Hello from uv-playground!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Therefore, to the question "how do I get started learning Python on my computer", now you can universally respond: "install &lt;code&gt;uv&lt;/code&gt;".&lt;/p&gt;

&lt;h2&gt;
  
  
  Some reflections
&lt;/h2&gt;

&lt;p&gt;On the topic of virtual environments, I essentially agree with Armin &lt;a href="https://github.com/astral-sh/uv/issues/1910#issuecomment-2256407301" rel="noopener noreferrer"&gt;when he says&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;npm got away without any equivalent of "activation" and I think a future Python ecosystem will also no longer find much use in virtualenv activation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I also notice that &lt;a href="https://github.com/astral-sh/uv/issues/5461" rel="noopener noreferrer"&gt;&lt;code&gt;uv init&lt;/code&gt; chose &lt;code&gt;hatchling&lt;/code&gt;&lt;/a&gt;. I always had a slight preference towards PDM, but I think this might be a point of no return.&lt;/p&gt;

&lt;p&gt;It took Leah and contributors a lot of work to come up with &lt;a href="https://www.pyopensci.org/python-package-guide/package-structure-code/intro.html" rel="noopener noreferrer"&gt;this decision diagram&lt;/a&gt; for the PyOpenSci packaging guide. But the fact that now there's a &lt;em&gt;baseline&lt;/em&gt; that folks can change in case they have more specific needs (for example, a Meson or scikit-build capable build backend) again provides for a much better Developer Experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  On conda
&lt;/h2&gt;

&lt;p&gt;The topic of conda vs pip is another common source of confusion. I was a conda user and fan since day 1, and it effectively saved Python from a very clear death at a time when it was very difficult to just install stuff on Windows.&lt;/p&gt;

&lt;p&gt;In the years that followed, I often referred to &lt;a href="https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/" rel="noopener noreferrer"&gt;the old blog post by Jake VanderPlas explaining the differences&lt;/a&gt;, but it looks like a lost cause by now.&lt;/p&gt;

&lt;p&gt;The interoperability problems between pip and conda were never fully addressed, and while I think the &lt;a href="https://pixi.sh" rel="noopener noreferrer"&gt;Pixi&lt;/a&gt; folks are doing a fantastic job, I think in the long run &lt;code&gt;uv&lt;/code&gt; will win.&lt;/p&gt;

&lt;p&gt;I fully acknowledge that conda packages are better structured around the notion of non-Python code, and that the current world of "fat wheels on PyPI" is clearly a suboptimal solution. But the whole ecosystem has moved in that direction: most packages now publish precompiled wheels for a rich variety of platforms.&lt;/p&gt;

&lt;p&gt;In other words: conda might not be as useful in 2024 as it was in 2014, and it might be time to stop teaching it to beginners and deem it an advanced tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The reason it's a bit too early is that some of these &lt;code&gt;uv&lt;/code&gt; commands are still experimental and might evolve in the future. But for the first time ever, I clearly see a workflow tool that is standards-compliant, comprehensive, free of bootstrapping problems, carefully designed, and that can &lt;em&gt;win&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update 2024-08-20: &lt;code&gt;uv&lt;/code&gt; 3.0 introduced the &lt;code&gt;project&lt;/code&gt;, &lt;code&gt;tool&lt;/code&gt;, &lt;code&gt;script&lt;/code&gt;, and &lt;code&gt;python&lt;/code&gt; interfaces, so they're not experimental anymore!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Which is what many Python packaging critics wanted all along, right? Not having to choose from many different tools. But I think &lt;code&gt;uv&lt;/code&gt; went well beyond that and solved other Developer Experience issues, for which I'm happy and thankful.&lt;/p&gt;

&lt;p&gt;I am effectively using &lt;code&gt;uv&lt;/code&gt; for everything and I am not looking back. I will continue recommending this tool to everyone, continue talking about it, and hope that it becomes more widespread.&lt;/p&gt;

</description>
      <category>python</category>
      <category>packaging</category>
      <category>uv</category>
    </item>
    <item>
      <title>The simplicity of DuckDB</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Mon, 20 Nov 2023 12:37:12 +0000</pubDate>
      <link>https://dev.to/astrojuanlu/the-simplicity-of-duckdb-3lad</link>
      <guid>https://dev.to/astrojuanlu/the-simplicity-of-duckdb-3lad</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This post is an adaptation of the one I originally published &lt;a href="https://www.orchest.io/blog/sql-on-python-part-1-the-simplicity-of-duckdb" rel="noopener noreferrer"&gt;in the Orchest blog&lt;/a&gt;. Enjoy!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This post is the first part of our series “SQL on Python”, in which we will explore different Python libraries that help you manipulate and query your data using SQL or a SQL-inspired syntax.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why SQL, after all?
&lt;/h2&gt;

&lt;p&gt;SQL (the initials for Structured Query Language, also known as &lt;a href="http://web.archive.org/web/20230320155204/https://www.iso.org/standard/63555.html" rel="noopener noreferrer"&gt;ISO/IEC 9075-1:2016&lt;/a&gt;) was originally designed in the 70s for managing relational databases, but nowadays, it is being used for analytics workloads as well.&lt;/p&gt;

&lt;p&gt;SQL has lots of benefits for analytics, to name a few:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;It’s easy to pick up:&lt;/strong&gt; SQL is a domain-specific language, rather than a general-purpose language, and as such it has more limited scope and fewer syntax elements to learn.&lt;/li&gt;
&lt;li&gt;  ‍&lt;strong&gt;It’s everywhere:&lt;/strong&gt; SQL is a family of query languages available in many systems, and all of them share some core common characteristics. When you learn a particular SQL dialect (PostgreSQL, SQL Server, Google Standard SQL, others), you can easily transfer your skills from other dialects with ease.&lt;/li&gt;
&lt;li&gt;  ‍&lt;strong&gt;It’s fast:&lt;/strong&gt; SQL is a statically typed language, which allows query planning systems to perform sophisticated optimizations. This, along with the decades of accumulated knowledge about relational databases, allow SQL implementations to have difficult to beat performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, if you are used to the Python or R ecosystems (pandas, &lt;a href="https://dev.to/astrojuanlu/lightning-fast-queries-with-polars-1bp3"&gt;Polars&lt;/a&gt;, data.table, dplyr), you are probably spoiled by how easy it is to download a CSV or Parquet file from somewhere, launch a Python or R process, read it, and start querying and manipulating it.&lt;/p&gt;

&lt;p&gt;Comparatively, this bootstrapping process is a bit more tedious with SQL: assuming you have, say, a local PostgreSQL database up and running and a CSV file, you would need to create a table with the appropriate schema, import the data using COPY, and hope that there are no inconsistencies, missing data, or weird date formats. If the file happened to be Parquet, you would need to work a bit more.&lt;/p&gt;

&lt;p&gt;To try to make the process a bit more lightweight, you could try to convert your CSV or Parquet to SQLite, a widely available, in-process SQL database. However, SQLite was designed with transactional use cases in mind, and therefore might not scale well with some analytical workloads.&lt;/p&gt;

&lt;p&gt;In summary: SQL is appealing, but the boilerplate not so much. What if you could run SQL for your analytics workloads without having to configure a database, just by importing a module in your Python or R process, and make your queries blazing fast? What if, rather than having to choose between Python or SQL, you could use &lt;em&gt;both&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdbo1lg4e829ts0dobt9q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdbo1lg4e829ts0dobt9q.png" alt="" width="607" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://twitter.com/anyfactor/status/1551650476651081729" rel="noopener noreferrer"&gt;https://twitter.com/anyfactor/status/1551650476651081729&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter DuckDB
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; is an open source (MIT) high-performance, in-process SQL database for analytics. It is a relatively new project (the first public release was in June 2019), but got tremendously popular in a short period of time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frlslx1ylwkkp4f5p0pdf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frlslx1ylwkkp4f5p0pdf.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;DuckDB popularity is growing (we like this image so much)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;DuckDB can read data from different sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  From &lt;a href="https://duckdb.org/docs/data/csv" rel="noopener noreferrer"&gt;CSV&lt;/a&gt; or &lt;a href="https://duckdb.org/docs/data/parquet" rel="noopener noreferrer"&gt;Parquet&lt;/a&gt; files&lt;/li&gt;
&lt;li&gt;  From pandas DataFrame or Arrow Table objects in the process memory&lt;/li&gt;
&lt;li&gt;  From &lt;a href="https://duckdb.org/2022/09/30/postgres-scanner.html" rel="noopener noreferrer"&gt;PostgreSQL tables&lt;/a&gt; (by reading the binary data directly!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of the DuckDB operations have out-of-core capabilities (similar to &lt;a href="https://dev.to/astrojuanlu/out-of-core-processing-with-vaex-3724"&gt;Vaex&lt;/a&gt; or &lt;a href="https://twitter.com/RitchieVink/status/1579827660142051328" rel="noopener noreferrer"&gt;the new streaming mode of Polars&lt;/a&gt;), which means that it can read data that is larger than RAM!&lt;/p&gt;

&lt;p&gt;Finally, DuckDB offers &lt;a href="https://duckdb.org/2022/05/04/friendlier-sql.html" rel="noopener noreferrer"&gt;some additions on top of standard SQL&lt;/a&gt; that make it very pleasant to use, for example friendlier error messages or, behold, trailing commas!&lt;/p&gt;

&lt;h2&gt;
  
  
  Trying out DuckDB
&lt;/h2&gt;

&lt;p&gt;For this example, we will use a dataset containing &lt;a href="https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset" rel="noopener noreferrer"&gt;all mentions of climate change on Reddit before September 2022&lt;/a&gt; obtained from Kaggle. Our generic goal is to understand the sentiment of these mentions.&lt;/p&gt;

&lt;p&gt;I have published &lt;a href="https://github.com/astrojuanlu/orchest-duckdb" rel="noopener noreferrer"&gt;an Orchest pipeline&lt;/a&gt; that contains all the necessary files so you can run these code snippets on JupyterLab easily: the first step downloads the data using your Kaggle API key, and the second step performs some exploratory analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  First steps with DuckDB
&lt;/h3&gt;

&lt;p&gt;You can install DuckDB with conda/mamba or pip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mamba install -y "python-duckdb=0.5.1"  
# Or, alternatively, with pip  
# pip install "duckdb==0.5.1"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first step to start using DuckDB is creating a connection object. This mimics the &lt;a href="https://peps.python.org/pep-0249/" rel="noopener noreferrer"&gt;Python Database API 2.0&lt;/a&gt;, also implemented by other projects like SQLite and psycopg2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;  
&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, &lt;code&gt;duckdb.connect()&lt;/code&gt; will return a connection to an in-memory database, which will be perfectly fine for reading data from external files. In fact, you can run a SQL query directly on the CSV file straight away!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
  ...: SELECT COUNT(*)  
  ...: FROM &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;  
  ...: &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;4600698&lt;/span&gt;&lt;span class="p"&gt;,)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, the comments CSV file contains 4.6+ million rows. This took about 50 seconds on an Orchest instance though, which is not very impressive for just a COUNT(*) operation. What about converting the CSV to Parquet, as we did in my &lt;a href="https://dev.to/astrojuanlu/demystifying-apache-arrow-5b0a"&gt;blog post about Arrow&lt;/a&gt;? This time, we can use DuckDB for that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  

&lt;span class="n"&gt;csv_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;ls&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;reddit&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;climate&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;csv&lt;/span&gt;  

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;csv_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reading &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;destination_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;splitext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;destination_file&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="k"&gt;continue&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
   COPY (SELECT * FROM &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)  
   TO &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;destination_file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; (FORMAT &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)  
   &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now, let’s repeat the query on the Parquet file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
  ...: SELECT COUNT(*)  
  ...: FROM &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;  
  ...: &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="mi"&gt;234&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;12.3&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="nf"&gt;loop &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Approximately a 200x speedup over the same operation using CSV! That is a better baseline for running the rest of the queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Querying Parquet files with DuckDB
&lt;/h3&gt;

&lt;p&gt;Since you will be referring to the same file several times, it’s a good moment to &lt;a href="https://duckdb.org/docs/sql/statements/create_view" rel="noopener noreferrer"&gt;create a view&lt;/a&gt; for it. This will allow you to query the Parquet file without copying all the data to memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
CREATE VIEW comments AS  
SELECT \* FROM &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;  
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let’s find out which subreddits had the most number of comments about climate change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
   ...: SELECT  
   ...:   &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subreddit.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; AS subreddit\_name,  
   ...:   COUNT(*) AS num_comments,  
   ...: FROM comments  
   ...: GROUP BY subreddit_name  
   ...: ORDER BY num_comments DESC  
   ...: LIMIT 10  
   ...: &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;politics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;370018&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;worldnews&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;351195&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;askreddit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;259848&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;collapse&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;94696&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;94558&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;futurology&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;89945&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;science&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;71453&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;70444&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;canada&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;66813&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;australia&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60239&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unsurprisingly, &lt;em&gt;/r/politics&lt;/em&gt;, &lt;em&gt;/r/worldnews&lt;/em&gt;, and &lt;em&gt;/r/collapse&lt;/em&gt; were among the subreddits with the largest number of comments about climate change.&lt;/p&gt;

&lt;p&gt;What about the overall sentiment of those comments?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
   ...: SELECT  
   ...:   AVG(sentiment) AS average_sentiment,  
   ...:   STDDEV(sentiment) AS stddev_sentiment,  
   ...: FROM comments  
   ...: &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.005827451977706203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6581439484369691&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
   ...: SELECT  
   ...:   &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subreddit.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; AS subreddit_name,  
   ...:   COUNT(*) AS num_comments,  
   ...:   AVG(sentiment) AS average_sentiment,  
   ...:   STDDEV(sentiment) AS stddev_sentiment,  
   ...: FROM comments  
   ...: WHERE subreddit_name IN (  
   ...:   SELECT &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subreddit.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; AS subreddit_name  
   ...:   FROM comments  
   ...:   GROUP BY subreddit_name  
   ...:   ORDER BY COUNT(*) DESC  
   ...:   LIMIT 10  
   ...: )  
   ...: GROUP BY subreddit_name  
   ...: ORDER BY num_comments DESC  
   ...: &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;politics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;370018&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.018118589649651674&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6600297061408&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;worldnews&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;351195&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.058001587387908435&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6405990095462681&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;askreddit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;259848&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.068637218639235&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6089748718101456&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;collapse&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;94696&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.1332661626390419&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6667106776062662&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;94558&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.09367126059175682&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6276134461239258&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;futurology&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;89945&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0018637489115630797&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6506820198836241&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;science&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;71453&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.04588216852922973&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6248484283076333&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;70444&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.015670189810189843&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6467846578160414&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;canada&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;66813&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.021118244331091468&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6408319443539487&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;australia&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60239&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.021869519296548085&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6405803819103508&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While the overall sentiment is slightly negative (with a large standard deviation), some subreddits like &lt;em&gt;/r/askreddit&lt;/em&gt; and &lt;em&gt;/r/collapse&lt;/em&gt; exhibited a sentiment more negative than average. Others like &lt;em&gt;/r/science&lt;/em&gt; and &lt;em&gt;/r/canada&lt;/em&gt; were slightly positive.&lt;/p&gt;

&lt;p&gt;Most interestingly, all these queries ran in about 2 seconds!&lt;/p&gt;

&lt;p&gt;DuckDB also &lt;a href="https://duckdb.org/docs/guides/python/jupyter" rel="noopener noreferrer"&gt;has integration with Jupyter&lt;/a&gt; through the &lt;a href="https://pypi.org/project/ipython-sql/" rel="noopener noreferrer"&gt;ipython-sql&lt;/a&gt; extension and &lt;a href="https://pypi.org/project/duckdb-engine/" rel="noopener noreferrer"&gt;the DuckDB SQLAlchemy driver&lt;/a&gt;, which allows you to query your data using an even more compact syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;load_ext&lt;/span&gt; &lt;span class="n"&gt;sql&lt;/span&gt;  
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="n"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;comments&lt;/span&gt; &lt;span class="n"&gt;AS&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;...:&lt;/span&gt;  
&lt;span class="err"&gt;  &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt;  
&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="nc"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;comments&lt;/span&gt;  
&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;4600698&lt;/span&gt;&lt;span class="p"&gt;,)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Interoperability with Python dataframe libraries
&lt;/h3&gt;

&lt;p&gt;Did you notice how we were using &lt;code&gt;conn.execute()&lt;/code&gt; all the time? As we said above, this method follows the widely used Python DBAPI 2.0. However, DuckDB can return richer objects by using &lt;code&gt;conn.query()&lt;/code&gt; instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
SELECT  
 &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subreddit.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; AS subreddit_name,  
 COUNT(*) AS num_comments,  
FROM comments  
GROUP BY subreddit_name  
ORDER BY num_comments DESC  
LIMIT 10  
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This method returns an instance of &lt;code&gt;DuckDBPyRelation&lt;/code&gt;, which can be pretty printed in Jupyter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DuckDBPyRelation&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
&lt;span class="o"&gt;---------------------&lt;/span&gt;  
&lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="n"&gt;Relation&lt;/span&gt; &lt;span class="n"&gt;Tree&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;  
&lt;span class="o"&gt;---------------------&lt;/span&gt;  
&lt;span class="n"&gt;Subquery&lt;/span&gt;  

&lt;span class="o"&gt;---------------------&lt;/span&gt;  
&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt; &lt;span class="n"&gt;Columns&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;  
&lt;span class="o"&gt;---------------------&lt;/span&gt;  
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;subreddit_name &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;num_comments &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="o"&gt;---------------------&lt;/span&gt;  
&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt; &lt;span class="n"&gt;Preview&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;  
&lt;span class="o"&gt;---------------------&lt;/span&gt;  
&lt;span class="n"&gt;subreddit_name&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;num_comments&lt;/span&gt;  
&lt;span class="n"&gt;VARCHAR&lt;/span&gt; &lt;span class="n"&gt;BIGINT&lt;/span&gt;  
&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;Rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
&lt;span class="n"&gt;politics&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;370018&lt;/span&gt;  
&lt;span class="n"&gt;worldnews&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;351195&lt;/span&gt;  
&lt;span class="n"&gt;askreddit&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;259848&lt;/span&gt;  
&lt;span class="n"&gt;collapse&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;94696&lt;/span&gt;  
&lt;span class="n"&gt;news&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;94558&lt;/span&gt;  
&lt;span class="n"&gt;futurology&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;89945&lt;/span&gt;  
&lt;span class="n"&gt;science&lt;/span&gt; &lt;span class="mi"&gt;71453&lt;/span&gt;  
&lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;70444&lt;/span&gt;  
&lt;span class="n"&gt;canada&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;66813&lt;/span&gt;  
&lt;span class="n"&gt;australia&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;60239&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Moreover, you can efficiently retrieve the data from this relation and convert it to several Python objects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  A dictionary of masked NumPy arrays using &lt;code&gt;.fetchnumpy()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  A pandas DataFrame using &lt;code&gt;.df()&lt;/code&gt; or its aliases (&lt;code&gt;.fetchdf()&lt;/code&gt;, &lt;code&gt;.fetch_df()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;  An Arrow Table using &lt;code&gt;.arrow()&lt;/code&gt; or &lt;code&gt;.fetch_arrow_table()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  An Arrow record batch reader using &lt;code&gt;.fetch_record_batch(chunk_size)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore, you can easily convert query results to a pandas DataFrame, and also a Polars one (since you can pass an Arrow table directly):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# pandas  
&lt;/span&gt;&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;subreddit_name&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;num_comments&lt;/span&gt;  
&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;politics&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;370018&lt;/span&gt;  
&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;worldnews&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;351195&lt;/span&gt;  
&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;askreddit&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;259848&lt;/span&gt;  
&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;collapse&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;94696&lt;/span&gt;  
&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;news&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;94558&lt;/span&gt;  
&lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;futurology&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;89945&lt;/span&gt;  
&lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;science&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;71453&lt;/span&gt;  
&lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;70444&lt;/span&gt;  
&lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;canada&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;66813&lt;/span&gt;  
&lt;span class="mi"&gt;9&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;australia&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;60239&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;polars&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Arrow data  
&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Polars  
&lt;/span&gt;&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt;┌────────────────┬──────────────┐&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;subreddit_name&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;num_comments&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;╞════════════════╪══════════════╡&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;politics&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;370018&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;worldnews&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;351195&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;askreddit&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;259848&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;collapse&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;94696&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;science&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;71453&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;70444&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;canada&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;66813&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;australia&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;60239&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;└────────────────┴──────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: Result objects returned by &lt;code&gt;conn.execute()&lt;/code&gt; also have these methods, but they consume the data after they are called and therefore are not so convenient.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Conversely, you can transfer data from pandas or Arrow to DuckDB. Or, more precisely: you can use DuckDB to query pandas or Arrow objects that live in memory! Moreover, DuckDB can read local variables without having to do anything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;df_most_comments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;df_most_comments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# pandas  
&lt;/span&gt;&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;subreddit_name&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;num_comments&lt;/span&gt;  
&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;politics&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;370018&lt;/span&gt;  
&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;worldnews&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;351195&lt;/span&gt;  
&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;askreddit&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;259848&lt;/span&gt;  
&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;collapse&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;94696&lt;/span&gt;  
&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;news&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="mi"&gt;94558&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
   ...: SELECT subreddit_name  
   ...: FROM df_most_comments  -- Sorcery!  
   ...: LIMIT 5  
   ...: &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;politics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;worldnews&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;askreddit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;collapse&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also &lt;a href="https://duckdb.org/docs/api/python/overview" rel="noopener noreferrer"&gt;manually register a compatible object&lt;/a&gt; with a given name:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;most_comments_arrow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DuckDBPyConnection&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="mh"&gt;0x7f9be41434f0&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
   ...: SELECT subreddit_name  
   ...: FROM most_comments_arrow  
   ...: LIMIT 5  
   ...: &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;politics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;worldnews&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;askreddit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;collapse&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or, using the &lt;code&gt;%sql&lt;/code&gt; magic as before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;subreddit_name&lt;/span&gt; &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;df_most_comments&lt;/span&gt; &lt;span class="n"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;  
&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;  
&lt;span class="n"&gt;Returning&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;local&lt;/span&gt; &lt;span class="n"&gt;variable&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# pandas  
&lt;/span&gt;&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;subreddit_name&lt;/span&gt;  
&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;politics&lt;/span&gt;  
&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;worldnews&lt;/span&gt;  
&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;askreddit&lt;/span&gt;  
&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;collapse&lt;/span&gt;  
&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;news&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In other words: you can transparently go back and forth between DuckDB and your favourite Python dataframe library. Cool!&lt;/p&gt;

&lt;h3&gt;
  
  
  Other features
&lt;/h3&gt;

&lt;p&gt;Without extending ourselves too much, there are a few extra interesting things about DuckDB you should check out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Relational API&lt;/strong&gt;: Apart from executing your SQL queries, DuckDBPyRelation objects have &lt;a href="https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyRelation" rel="noopener noreferrer"&gt;some basic filtering and aggregation methods&lt;/a&gt;. For example, you can do things like:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_comments &amp;gt; 100000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subreddit_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;subreddit_name&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;num_comments&lt;/span&gt;  
&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;askreddit&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;259848&lt;/span&gt;  
&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;politics&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;370018&lt;/span&gt;  
&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;worldnews&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="mi"&gt;351195&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://duckdb.org/docs/guides/python/relational_api_pandas" rel="noopener noreferrer"&gt;The documentation is still in progress&lt;/a&gt;, but potentially the DuckDB team will expand it in the future!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The CLI&lt;/strong&gt;: DuckDB has &lt;a href="https://duckdb.org/docs/api/cli" rel="noopener noreferrer"&gt;a command-line client&lt;/a&gt; you can use directly from your terminal, without even launching a Python or Jupyter interpreter:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./duckdb -c '  
&amp;gt;   SELECT "subreddit.name" AS subreddit_name,  
&amp;gt;   COUNT(*) AS num_comments  
&amp;gt; FROM "/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet"  
&amp;gt; GROUP BY subreddit_name  
&amp;gt; ORDER BY num_comments DESC  
&amp;gt; LIMIT 10  
&amp;gt; '  
┌────────────────┬──────────────┐  
│ subreddit_name │ num_comments │  
├────────────────┼──────────────┤  
│ politics       │ 370018       │  
│ worldnews      │ 351195       │  
│ askreddit      │ 259848       │  
│ collapse       │ 94696        │  
│ news           │ 94558        │  
│ futurology     │ 89945        │  
│ science        │ 71453        │  
│ environment    │ 70444        │  
│ canada         │ 66813        │  
│ australia      │ 60239        │  
└────────────────┴──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
   Should you use DuckDB?
&lt;/h2&gt;

&lt;p&gt;Although DuckDB is a wonderful piece of technology, “there is no silver bullet” and there might be cases in which you might want to use something else. The project homepage itself hints at some of these cases:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8piwp74fiux63zfmm3e5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8piwp74fiux63zfmm3e5.png" alt="When not to use DuckDB" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  For transactional workloads, you might want to use SQLite, or a more sophisticated transactional database like PostgreSQL. Remember, DuckDB was created for analytics!&lt;/li&gt;
&lt;li&gt;  When several people are reading or writing the same data, using a warehouse might make more sense.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Other than that, as you saw above if you are looking for a lightweight and fast solution for in-process analytics, and you want to leverage both your general-purpose language of choice (Python, R, others) as well as SQL, DuckDB might be exactly what you want.&lt;/p&gt;

&lt;p&gt;In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to &lt;a href="https://github.com/Alex-Monahan" rel="noopener noreferrer"&gt;Alex Monahan&lt;/a&gt; and &lt;a href="https://github.com/Mause" rel="noopener noreferrer"&gt;Elliana May&lt;/a&gt; for reviewing early drafts of this blog post. All remaining errors are my own.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>duckdb</category>
      <category>sql</category>
    </item>
    <item>
      <title>How to integrate Kedro and Databricks Connect</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Thu, 21 Sep 2023 14:41:03 +0000</pubDate>
      <link>https://dev.to/kedro/how-to-integrate-kedro-and-databricks-connect-3ep7</link>
      <guid>https://dev.to/kedro/how-to-integrate-kedro-and-databricks-connect-3ep7</guid>
      <description>&lt;p&gt;In recent months we've updated Kedro documentation to illustrate &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/index.html" rel="noopener noreferrer"&gt;three different ways of integrating Kedro with Databricks&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;You can choose a &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html" rel="noopener noreferrer"&gt;workflow based on Databricks jobs&lt;/a&gt; to deploy a project that finished development.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For faster iteration on changes, the workflow documented in &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/databricks_notebooks_development_workflow.html" rel="noopener noreferrer"&gt;"Use a Databricks workspace to develop a Kedro project"&lt;/a&gt; is for those who prefer to develop and test their projects directly within Databricks notebooks, to avoid the overhead of setting up and syncing a local development environment with Databricks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alternatively, you can work locally in an IDE as described by the workflow documented in &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/databricks_ide_development_workflow.html" rel="noopener noreferrer"&gt;"Use an IDE, dbx and Databricks Repos to develop a Kedro project"&lt;/a&gt;. You can use your IDE’s capabilities for faster, error-free development, while testing on Databricks. This is ideal if you’re in the early stages of learning Kedro, or if your project requires constant testing and adjustments. However, the experience is still not perfect: you must sync your work inside Databricks with dbx and run the pipeline inside a notebook. Debugging has a lengthy setup for each change and there is less flexibility than inside an IDE.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this blog post, Diego Lira, a Specialist Data Scientist and a client-facing member of &lt;a href="https://www.mckinsey.com/capabilities/quantumblack/how-we-help-clients" rel="noopener noreferrer"&gt;QuantumBlack, AI by McKinsey&lt;/a&gt;, explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE. He recommends this as a solution where the data-heavy parts of your pipelines are in PySpark. If part of your workflow is in Python (e.g. Pandas) and not Spark (using PySpark), then you will find that Databricks Connect will download your data frame to your local environment to continue running your workflow. This might cause performance issues and introduce compliance risks because the data has left the Databricks workspace.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Databricks Connect?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.databricks.com/dev-tools/databricks-connect-ref.html" rel="noopener noreferrer"&gt;Databricks Connect&lt;/a&gt; is Databricks' official method of interacting with a remote Databricks instance while using a local environment.&lt;/p&gt;

&lt;p&gt;To configure Databricks Connect for use with Kedro, follow the official setup to create a &lt;code&gt;.databrickscfg&lt;/code&gt; file containing your access token. It can be installed with a &lt;code&gt;pip install databricks-connect&lt;/code&gt;, and it will substitute your local SparkSession:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.connect&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DatabricksSession&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DatabricksSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spark commands are sent and executed on the cluster, and results are returned to the local environment as needed. In the context of Kedro, this has an amazing effect: as long as you don’t explicitly ask for the data to be collected in your local environment, operations will be executed only when saving the outputs of your node. If you use datasets saved to a Databricks path, there will be no performance hit for transferring data between environments.&lt;/p&gt;

&lt;p&gt;This tool was recently made available as a thin client for &lt;a href="https://spark.apache.org/docs/latest/spark-connect-overview.html" rel="noopener noreferrer"&gt;Spark Connect&lt;/a&gt;, one of the highlights of Spark 3.4, and configuration was made easier than earlier versions. If your cluster doesn’t support the current Connect, please refer to the &lt;a href="https://docs.databricks.com/en/dev-tools/databricks-connect-legacy.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; as previous versions had different limitations.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/p9IRFSjuLBE"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  How can I use a Databricks Connect workflow with Kedro?
&lt;/h2&gt;

&lt;p&gt;Databricks Connect (and Spark Connect) enables us to have a completely local development flow, while all artifacts can be remote objects. Using Delta tables for all our datasets and MLflow for model objects and tracking, nothing needs to be saved locally. Developers can take full advantage of the Databricks stack while maintaining their full IDE usage.    &lt;/p&gt;

&lt;h2&gt;
  
  
  How to use Databricks as your PySpark engine
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.kedro.org/en/stable/integrations/pyspark_integration.html" rel="noopener noreferrer"&gt;Kedro supports integration with PySpark&lt;/a&gt; through the use of Hooks. To configure and enable your Databricks session through Spark Connect, simply set up your &lt;code&gt;SPARK_REMOTE&lt;/code&gt; environment variable with your Databricks configuration. Here is an example implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;configparser&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro.framework.hooks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hook_impl&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SparkHooks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nd"&gt;@hook_impl&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;after_context_created&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Initialises a SparkSession using the config
        from Databricks.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="nf"&gt;set_databricks_creds&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;_spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Builder&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_databricks_creds&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Pass databricks credentials as OS variables if using the local machine.
    If you set DATABRICKS_PROFILE env variable, it will choose the desired profile on .databrickscfg,
    otherwise it will use the DEFAULT profile in databrickscfg.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;DEFAULT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATABRICKS_PROFILE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEFAULT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SPARK_HOME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/databricks/spark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;configparser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ConfigParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;home&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.databrickscfg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# remove "https://" and final "/" from path
&lt;/span&gt;        &lt;span class="n"&gt;cluster_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cluster_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SPARK_REMOTE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sc://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:443/;token=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;;x-databricks-cluster-id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This example will populate &lt;code&gt;SPARK_REMOTE&lt;/code&gt; with your local &lt;code&gt;.databrickscfg&lt;/code&gt; file. You don't setup the remote connection if the project is being run from inside Databricks (if &lt;code&gt;SPARK_HOME&lt;/code&gt; points to Databricks), so you're still able to run it in the usual &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/databricks_ide_development_workflow.html" rel="noopener noreferrer"&gt;hybrid development flow&lt;/a&gt;. Notice that you don’t need to setup a &lt;code&gt;spark.yml&lt;/code&gt; file as is common in other PySpark templates; you’re not passing any configuration, just using the cluster that is in Databricks. You also don’t need to load any extra Spark files (e.g. JARs), as you are using a thin Spark Connect client.&lt;/p&gt;

&lt;p&gt;Now all your Spark calls in your pipelines will automatically use the remote cluster. There's no need to change anything in your code. However, notebooks might be part of the project. To use your remote cluster without needing to use environment variables, you can use the &lt;code&gt;DatabricksSession&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When using the remote cluster, it's preferred to avoid data transfers between the environments, with all catalog entries referencing remote locations. Using &lt;code&gt;kedro_datasets.databricks.ManagedTableDataSet&lt;/code&gt; as your dataset type in the catalog also allows you use Delta table features.&lt;/p&gt;

&lt;h1&gt;
  
  
  How to enable MLflow on Databricks
&lt;/h1&gt;

&lt;p&gt;Using &lt;a href="https://mlflow.org/" rel="noopener noreferrer"&gt;MLflow&lt;/a&gt; to save all your artifacts directly to Databricks leads to a powerful workflow. For this you can use &lt;a href="https://github.com/Galileo-Galilei/kedro-mlflow" rel="noopener noreferrer"&gt;kedro-mlflow&lt;/a&gt;. Note that &lt;code&gt;kedro-mlflow&lt;/code&gt; is built on top of the mlflow library and although the databricks config cannot be found in its documentation, you can read more about it in the &lt;a href="https://mlflow.org/docs/latest/index.html" rel="noopener noreferrer"&gt;documentation from mlflow directly&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After doing the &lt;a href="https://kedro-mlflow.readthedocs.io/en/stable/source/02_installation/02_setup.html#activate-kedro-mlflow-in-your-kedro-project" rel="noopener noreferrer"&gt;basic setup of the library&lt;/a&gt; in your project, you should see a &lt;code&gt;mlflow.yml&lt;/code&gt; configuration file. In this file, change the following to set up your URI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;server:
    mlflow_tracking_uri: databricks # if null, will use mlflow.get_tracking_uri() as a default
    mlflow_registry_uri: databricks # if null, mlflow_tracking_uri will be used as mlflow default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setup your experiment name (this should be a valid Databricks path):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;experiment:
    name: /Shared/your_experiment_name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, all your parameters will be logged, and objects such as models and metrics can be saved as MLflow objects referenced in the catalog.&lt;/p&gt;

&lt;h2&gt;
  
  
   Limitations of this workflow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.databricks.com/dev-tools/databricks-connect-ref.html" rel="noopener noreferrer"&gt;Databricks Connect&lt;/a&gt;, built on top of Spark Connect, supports only recent versions of Spark. I recommend looking at the detailed limitations in the official documentation for specific guidance, such as the upload limit of only 128MB for dataframes.&lt;/p&gt;

&lt;p&gt;Users also need to be conscious that &lt;code&gt;.toPandas()&lt;/code&gt; will move the data to your local pandas environment. Saving results back as MLflow objects is the preferred way to avoid local objects. Examples can be seen in the &lt;a href="https://kedro-mlflow.readthedocs.io/en/stable/source/04_experimentation_tracking/index.html" rel="noopener noreferrer"&gt;kedro-mlflow documentation&lt;/a&gt; for all types of supported objects.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>In the pipeline: September 2023</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Wed, 06 Sep 2023 08:49:16 +0000</pubDate>
      <link>https://dev.to/kedro/in-the-pipeline-september-2023-14ek</link>
      <guid>https://dev.to/kedro/in-the-pipeline-september-2023-14ek</guid>
      <description>&lt;p&gt;This month: a roundup of the summer’s Kedro news, some release updates, and our top picks from recent articles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kedro team news
&lt;/h2&gt;

&lt;p&gt;Over the last few months, we’ve been happy to welcome some new team members to the Kedro and Kedro-Viz teams, who have also joined our &lt;a href="https://docs.kedro.org/en/stable/contribution/technical_steering_committee.html" rel="noopener noreferrer"&gt;Technical Steering Committee&lt;/a&gt;. Welcome &lt;a href="https://github.com/DimedS" rel="noopener noreferrer"&gt;Dmitry Sorokin&lt;/a&gt;, &lt;a href="https://github.com/jitu5" rel="noopener noreferrer"&gt;Jitendra Gundaniya&lt;/a&gt;, &lt;a href="https://github.com/lrcouto" rel="noopener noreferrer"&gt;Laura Couto&lt;/a&gt;, &lt;a href="https://github.com/ravi-kumar-pilla" rel="noopener noreferrer"&gt;Ravi Kumar Pilla&lt;/a&gt;, and &lt;a href="https://github.com/vladimir-mck" rel="noopener noreferrer"&gt;Vladimir Nikolic&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;We are also pleased to announce a Kedro baby, delivered safely by one of the team, at the end of July!&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributors news
&lt;/h2&gt;

&lt;p&gt;We reworked the Kedro contributors guide in August, and moved it to the &lt;a href="https://github.com/kedro-org/kedro/wiki" rel="noopener noreferrer"&gt;Kedro wiki&lt;/a&gt;. There are loads of different ways to contribute to Kedro and if you want to get involved, we encourage you to look at the &lt;a href="https://github.com/kedro-org/kedro/wiki/Contribute-to-Kedro#how-to-contribute" rel="noopener noreferrer"&gt;table that introduces the Kedro contributor guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwstew1ucbg0zamtlgukn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwstew1ucbg0zamtlgukn.png" alt="These are some of the ways to contribute to Kedro" width="800" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you spot an article, podcast or video that discusses Kedro, you can also contribute by adding it to the “&lt;a href="https://github.com/kedro-org/awesome-kedro" rel="noopener noreferrer"&gt;Awesome Kedro&lt;/a&gt;” repository, or letting us know on &lt;a href="https://slack.kedro.org" rel="noopener noreferrer"&gt;Slack&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There have been some amazing contributions in recent weeks, including the &lt;a href="https://pypi.org/project/vineyard-kedro/" rel="noopener noreferrer"&gt;kedro-vineyard plugin&lt;/a&gt; for efficient intermediate sharing in Kedro pipelines, &lt;a href="https://pypi.org/project/kedro-graphql/#data" rel="noopener noreferrer"&gt;kedro-graphql&lt;/a&gt; for serving Kedro projects as a GraphQL API, and &lt;a href="https://pypi.org/project/kedro-pandera/" rel="noopener noreferrer"&gt;kedro-pandera&lt;/a&gt; to bring data validation to your Kedro projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Release news
&lt;/h2&gt;

&lt;p&gt;August 2023 saw a set of &lt;a href="https://linen-slack.kedro.org/t/15611709/hi-channel-we-are-excited-to-announce-several-new-releases-m#5fa69a60-84b7-4b82-adca-a16f87fac6b1" rel="noopener noreferrer"&gt;releases to introduce Python 3.11&lt;/a&gt; support across Kedro, Kedro-Viz and Kedro datasets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zzibh5l4xenccdt4zt8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zzibh5l4xenccdt4zt8.jpg" alt="All the Kedro things support Python 3.11" width="667" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kedro-org/kedro/releases/tag/0.18.13" rel="noopener noreferrer"&gt;Kedro version 0.18.13&lt;/a&gt; included these major features and improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Added support for Python 3.11.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Added new &lt;code&gt;OmegaConfigLoader&lt;/code&gt; features: registering of custom resolvers through &lt;code&gt;CONFIG_LOADER_ARGS&lt;/code&gt; and support for global variables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Added &lt;code&gt;kedro catalog resolve&lt;/code&gt; CLI command that resolves dataset factories in the catalog with any explicit entries in the project pipeline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simplified the &lt;code&gt;conf&lt;/code&gt; folder structure for modular pipelines and updated kedro pipeline create and kedro catalog create accordingly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Made various updates to the Kedro project template and Kedro starters: use of OmegaConfigLoader, transition from &lt;code&gt;setup.py&lt;/code&gt; to &lt;code&gt;pyproject.toml&lt;/code&gt;, and updated for the simplified &lt;code&gt;conf&lt;/code&gt; structure.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://github.com/kedro-org/kedro-viz/releases/tag/v6.5.0" rel="noopener noreferrer"&gt;Kedro Viz version 6.5&lt;/a&gt; added support for Python 3.11, while &lt;a href="https://github.com/kedro-org/kedro-viz/releases/tag/v6.4.0" rel="noopener noreferrer"&gt;Kedro Viz version 6.4&lt;/a&gt; added two new features: feature hint cards to highlight key features of Kedro Viz and support for displaying dataset statistics in the metadata panel for further investigation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kedro-org/kedro-plugins/releases/tag/kedro-datasets-1.7.0" rel="noopener noreferrer"&gt;Kedro Datasets version 1.7.0&lt;/a&gt; added &lt;code&gt;polars.GenericDataSet&lt;/code&gt;, a dataset backed by &lt;a href="https://www.pola.rs/" rel="noopener noreferrer"&gt;polars&lt;/a&gt;, a lightning fast dataframe package built entirely using Rust. &lt;a href="https://github.com/kedro-org/kedro-plugins/releases/tag/kedro-datasets-1.6.0" rel="noopener noreferrer"&gt;Kedro Datasets version 1.6.0&lt;/a&gt; added support for Python 3.11.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recently on the Kedro blog
&lt;/h2&gt;

&lt;p&gt;In the last few weeks we’ve published the following on the Kedro blog:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/how-to-integrate-kedro-and-databricks-connect" rel="noopener noreferrer"&gt;How to integrate Kedro and Databricks Connect&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/managed-delta-tables-kedro-dataset" rel="noopener noreferrer"&gt;How to use Databricks managed Delta tables in a Kedro project&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/kedro-dataset-for-spark-structured-streaming" rel="noopener noreferrer"&gt;A new Kedro dataset for Spark Structured Streaming&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/collaborative-experiment-tracking-in-kedro-viz" rel="noopener noreferrer"&gt;Collaborative experiment tracking in Kedro-Viz&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/build-a-custom-kedro-runner" rel="noopener noreferrer"&gt;Get up to speed: How to build a custom Kedro runner&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’re always looking for collaborators to write about their experiences using Kedro, particularly if you’re working with Kedro datasets or converting an existing project to use Kedro. Get in touch with us on our &lt;a href="https://slack.kedro.org" rel="noopener noreferrer"&gt;Slack workspace&lt;/a&gt; to tell us your story.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nnuhluwy5em5oqbh0gb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nnuhluwy5em5oqbh0gb.png" alt="Powered by Kedro badge" width="526" height="138"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What we’ve learned
&lt;/h2&gt;

&lt;p&gt;We really enjoyed reading more on Medium about the &lt;a href="https://medium.com/cncf-vineyard/efficient-data-sharing-in-data-science-pipelines-on-kubernetes-bb42d36c739" rel="noopener noreferrer"&gt;Kedro Vineyard plugin&lt;/a&gt;, which is a cloud-native data manager, for data sharing using memory in data science pipelines on Kubernetes.&lt;/p&gt;

&lt;p&gt;Quix published an interesting article called “&lt;a href="https://www.notion.so/In-the-pipeline-September-2023-39eeb4c7219442b3b0dfc7df9d854b4d?pvs=21" rel="noopener noreferrer"&gt;Bridging the gap between data scientists and engineers in machine learning workflows&lt;/a&gt;” which is something we regularly discuss within the team.&lt;/p&gt;

&lt;p&gt;We found a &lt;a href="https://github.com/madziejm/project-fontr" rel="noopener noreferrer"&gt;super-interesting project about font recognition&lt;/a&gt; that uses Kedro.&lt;/p&gt;

&lt;p&gt;And finally, we enjoyed reading more about &lt;a href="https://medium.com/quantumblack/kedro-goes-streaming-34e1094c354c" rel="noopener noreferrer"&gt;data streaming with Kedro&lt;/a&gt; over on the QuantumBlack Medium channel.&lt;/p&gt;

&lt;p&gt;That’s it for this edition!&lt;/p&gt;

</description>
      <category>kedro</category>
      <category>python</category>
      <category>datascience</category>
      <category>news</category>
    </item>
    <item>
      <title>🐍 Best resources on Python packaging 📖</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Thu, 17 Aug 2023 12:29:35 +0000</pubDate>
      <link>https://dev.to/astrojuanlu/best-resources-on-python-packaging-fea</link>
      <guid>https://dev.to/astrojuanlu/best-resources-on-python-packaging-fea</guid>
      <description>&lt;p&gt;Are you confused by the various names that float around the Python packaging ecosystem? Have you ever asked a colleague to help you with an installation issue, only for them to reply "use {other_tool} instead" and make the problem worse? Have you seen &lt;a href="https://xkcd.com/1987/" rel="noopener noreferrer"&gt;the infamous XKCD comic on Python environments&lt;/a&gt; but you're still wondering how to solve your mess?&lt;/p&gt;

&lt;p&gt;This short blog post is not a guide that will help you troubleshoot everything, but instead a list of resources that I consider up to date, modern, informative, and free of "hot takes" or unnecessary hate towards maintainers.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. "The Basics of Python Packaging in Early 2023"
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://drivendata.co/blog/python-packaging-2023" rel="noopener noreferrer"&gt;https://drivendata.co/blog/python-packaging-2023&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This blog post by Jay Qi, Lead Data Scientist at DrivenData, is an informative take on how to &lt;em&gt;produce&lt;/em&gt; your own Python package, specifically writing your &lt;code&gt;pyproject.toml&lt;/code&gt; (modern replacement of &lt;code&gt;setup.py&lt;/code&gt;). It neatly explains all the concepts involved, including PEP 517 build backends, PEP 621 project metadata, and some extra stuff.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. "An unbiased evaluation of environment management and packaging tools"
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://youtu.be/MsJjzVIVs6M" rel="noopener noreferrer"&gt;https://youtu.be/MsJjzVIVs6M&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Anna-Lena Popkes, Senior Machine Learning Engineer at inovex, delivered this talk at PyConDE and EuroPython. She offers a neat categorization of the different aspects or facets of Python packaging:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89zunsp2yreqzlffr2ai.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89zunsp2yreqzlffr2ai.png" alt="Python packaging categorization" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The chart does not include &lt;em&gt;all&lt;/em&gt; tools, but it's an excellent starting point.&lt;/p&gt;

&lt;p&gt;One thing I'd change is recommending &lt;a href="https://github.com/jdxcode/rtx" rel="noopener noreferrer"&gt;rtx&lt;/a&gt; over pyenv, which works in a very similar way but it's written in Rust (so it's super fast) and avoids the typical problems with shims.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The pyOpenSci Python packaging guide
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.pyopensci.org/python-package-guide/package-structure-code/intro.html" rel="noopener noreferrer"&gt;https://www.pyopensci.org/python-package-guide/package-structure-code/intro.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This long guide was written by Leah Wasser, Executive Director at pyOpenSci, with the help of many contributors (including myself). It goes into more depth on how to choose a specific build backend or workflow tool when developing and creating packages. Whether you're creating a complex Python package with compiled extensions or a plain, pure Python one, this guide will help you navigate the ecosystem quite effectively. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsa800lgj8jil8j6pb695.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsa800lgj8jil8j6pb695.png" alt="How to choose tooling" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. "Why not tell people to 'simply' use pyenv, poetry or anaconda"
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.bitecode.dev/p/why-not-tell-people-to-simply-use" rel="noopener noreferrer"&gt;https://www.bitecode.dev/p/why-not-tell-people-to-simply-use&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Okay, this one is a bit more loaded and opinionated than the others. But I believe this blog post in the Bite code! is very necessary - it's often too tempting to tell someone to "just" install yet another tool that will fix their problems, but this creates a massive amount of collective pain, and Python packaging is particularly affected. Please refrain from doing that!&lt;/p&gt;

&lt;p&gt;(Which reminds me of this fantastic piece by Ned Batchelder, "How to be helpful online" &lt;a href="https://nedbatchelder.com/blog/202009/how_to_be_helpful_online.html" rel="noopener noreferrer"&gt;https://nedbatchelder.com/blog/202009/how_to_be_helpful_online.html&lt;/a&gt; ❤️)&lt;/p&gt;

&lt;p&gt;More often than not, installing yet another tool won't fix the user original problem. So, instead, try to spend some time helping them debug their problem.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Would you like me to write a guide on how to debug Python installation issues? Leave a comment saying "yes" or, much better, a situation that has affected you recently, or even right now.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. "Thoughts on Python packaging"
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://pradyunsg.me/blog/2023/01/21/thoughts-on-python-packaging/" rel="noopener noreferrer"&gt;https://pradyunsg.me/blog/2023/01/21/thoughts-on-python-packaging/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you've made it this far is because you're another Python packaging nerd, like myself. Congratulations! (Or... sorry?)&lt;/p&gt;

&lt;p&gt;To end with a positive, I'd highly recommend you to check out this blog post by Pradyun Gedam, maintainer of pip and many other packaging projects, which offers a meditated perspective on where we stand now, how we got here, and where we go next.&lt;/p&gt;




&lt;p&gt;Hope you liked this list of resources! If you'd like to read more about this, comment, recommend, follow, and spread the love.&lt;/p&gt;

&lt;p&gt;Also, remember to always thank your open source maintainers, they will appreciate it 💖&lt;/p&gt;

</description>
      <category>python</category>
      <category>packaging</category>
      <category>pip</category>
    </item>
    <item>
      <title>How to use Databricks managed Delta tables in a Kedro project</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Thu, 17 Aug 2023 08:55:07 +0000</pubDate>
      <link>https://dev.to/kedro/how-to-use-databricks-managed-delta-tables-in-a-kedro-project-jj</link>
      <guid>https://dev.to/kedro/how-to-use-databricks-managed-delta-tables-in-a-kedro-project-jj</guid>
      <description>&lt;p&gt;In this blog post, we'll guide you through the specifics of building a Kedro project that uses managed Delta tables in Databricks using the newly-released &lt;a href="https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets/databricks" rel="noopener noreferrer"&gt;ManagedTableDataSet&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Kedro?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://kedro.org" rel="noopener noreferrer"&gt;Kedro&lt;/a&gt; is a toolbox for production-ready data science. It's an open-source Python framework that enables the development of clean data science code, borrowing concepts from software engineering and applying them to machine-learning projects. A Kedro project provides scaffolding for complex data and machine-learning pipelines. It enables developers to spend less time on tedious "plumbing" and focus on solving new problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Databricks?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.databricks.com/" rel="noopener noreferrer"&gt;Databricks&lt;/a&gt; is a unified data analytics platform designed for simplifying big data processing and free-form data exploration at any scale. Based on &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt;, an open-source distributed computing system, Databricks provides a collaborative cloud-based environment where users can process large amounts of data.&lt;/p&gt;

&lt;p&gt;The platform provides collaborative workspaces (notebooks) and computational resources (clusters) to run code with. Clusters are groups of nodes that run Apache Spark. Notebooks are collaborative web-based interfaces where users can write and execute code on an attached cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use Kedro on Databricks?
&lt;/h2&gt;

&lt;p&gt;As we've described, Kedro offers a framework for building modular and scalable data pipelines, while Databricks provides a platform for running Spark jobs and managing data. You can combine Kedro and Databricks to build and deploy data pipelines and get the best of both worlds. Kedro's open-source framework will help you to build well-organised and maintainable pipelines, while Databricks' platform will provide you with the scalability you need to run your pipeline in production. Check out the &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/index.html" rel="noopener noreferrer"&gt;recently-updated Kedro documentation&lt;/a&gt; for a set of workflow options for integrating Kedro projects and Databricks. (Additionally, the third-party &lt;a href="https://github.com/Galileo-Galilei/kedro-mlflow" rel="noopener noreferrer"&gt;kedro-mlflow&lt;/a&gt; plugin integrates &lt;a href="https://mlflow.org/docs/latest/index.html" rel="noopener noreferrer"&gt;mlflow&lt;/a&gt; capabilities inside Kedro projects to enhance reproducibility for machine learning experimentation).&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Kedro datasets?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.kedro.org/en/stable/data/data_catalog.html" rel="noopener noreferrer"&gt;Kedro datasets&lt;/a&gt; are abstractions for reading and loading data, designed to decouple these operations from your business logic. These datasets manage reading and writing data from a variety of sources, while also ensuring consistency, tracking, and versioning. They allow users to maintain focus on core data processing, leaving data I/O tasks to Kedro.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is managed data in Databricks?
&lt;/h2&gt;

&lt;p&gt;To understand the concept of managed data in Databricks, it is first necessary to outline how Databricks organises data. At the highest level, Databricks uses metastores to store the metadata associated with data objects. Databricks Unity Catalog is one such metastore. It provides data governance and management across multiple Databricks workspaces. The metastore organises tables (where your data is stored) in a hierarchical structure.&lt;/p&gt;

&lt;p&gt;The highest level of organisation in this hierarchy is the catalog. Catalogs are a collection of databases (also referred to as schemas in Databricks' terminology). A database is the second level of organisation in the Unity Catalog namespacing model. Databases are a collection of tables. The tables in a database are the third level of organisation in this hierarchy.&lt;/p&gt;

&lt;p&gt;A table is structured data, stored as a directory of files on cloud object storage. By default, Databricks creates tables as Delta tables, which store data using the &lt;a href="https://delta.io/" rel="noopener noreferrer"&gt;Delta Lake&lt;/a&gt; format. &lt;a href="https://delta.io/" rel="noopener noreferrer"&gt;Delta Lake&lt;/a&gt; is an open-source storage format that offers ACID transactions, time travel and audit history.&lt;/p&gt;

&lt;p&gt;Databricks tables belong to one of two categories: managed and unmanaged (external) tables. Databricks manages both the data and associated metadata of managed tables. If you drop a managed table, you will delete the underlying data. The data of a managed table resides in the location of the database to which it is registered.&lt;/p&gt;

&lt;p&gt;On the other hand, for unmanaged tables, Databricks only manages the metadata. If you drop an unmanaged table, you will not delete the underlying data. These tables require a specified location during creation.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to work with managed Delta tables using Kedro
&lt;/h2&gt;

&lt;p&gt;Let's demonstrate how to use the &lt;a href="https://docs.kedro.org/en/stable/kedro_datasets.databricks.ManagedTableDataSet.html" rel="noopener noreferrer"&gt;ManagedTableDataSet&lt;/a&gt; with a simple example on Databricks. You'll need to open a new Databricks notebook and attach it to a cluster to follow along with the rest of this example, which runs on a workspace using a Hive metastore. We'll create a dataset containing weather readings, save it to a managed Delta table on Databricks, append some data, and access a specific table version to showcase Delta Lake's time travel capabilities.&lt;/p&gt;

&lt;p&gt;Run every separate code snippet in this section in a new notebook cell.&lt;/p&gt;

&lt;p&gt;The first steps are to set up your workspace by creating a &lt;code&gt;weather&lt;/code&gt; database in your metastore and installing Kedro. Run the following SQL code to create the database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%sql
create database if not exists weather;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To install Kedro and the &lt;code&gt;ManagedTableDataSet&lt;/code&gt;, use the &lt;code&gt;%pip&lt;/code&gt; magic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%pip install kedro kedro-datasets[databricks.ManagedTableDataSet]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first part of our program will create some weather data. We'll create a Spark DataFrame with four columns: &lt;code&gt;date&lt;/code&gt;, &lt;code&gt;location&lt;/code&gt;, &lt;code&gt;temperature&lt;/code&gt;, and &lt;code&gt;humidity&lt;/code&gt; to store our weather data. Then, we'll use a new instance of &lt;code&gt;ManagedTableDataSet&lt;/code&gt; to save our DataFrame to a Delta table called &lt;code&gt;2023_06_22&lt;/code&gt; (the day of the readings) in the &lt;code&gt;weather&lt;/code&gt; database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro_datasets.databricks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ManagedTableDataSet&lt;/span&gt;

&lt;span class="n"&gt;spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Define schema
&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;humidity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Create DataFrame
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;London&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Warsaw&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bucharest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;38&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;spark_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create a ManagedTableDataSet instance using a new table named '2023_06_22'
&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManagedTableDataSet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023_06_22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Save the DataFrame to the table
&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To load our data back into a dataframe, we use the &lt;code&gt;load&lt;/code&gt; method on &lt;code&gt;ManagedTableDataSet&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load the table data into a DataFrame
&lt;/span&gt;&lt;span class="n"&gt;reloaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Print the first 3 rows of the DataFrame
&lt;/span&gt;&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reloaded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code loads the data from the &lt;code&gt;weather&lt;/code&gt; table back into a Spark DataFrame and shows the first three rows of the data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's say we take some more weather readings later in the day and want to add them to our Delta table. To do this, we can write to it using a new instance of &lt;code&gt;ManagedTableDataSet&lt;/code&gt; initialised with &lt;code&gt;"append"&lt;/code&gt; passed in as an argument to &lt;code&gt;write_mode&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Append new rows to the data
&lt;/span&gt;&lt;span class="n"&gt;new_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Cairo&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Lisbon&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;44&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;spark_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManagedTableDataSet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023_06_22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;write_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code above adds new rows for Cairo and Lisbon to our Delta table, which creates a new version of the table.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ManagedTableDataSet&lt;/code&gt; class allows for saving data with three different write modes: &lt;code&gt;overwrite&lt;/code&gt;, &lt;code&gt;append&lt;/code&gt;, and &lt;code&gt;upsert&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;overwrite&lt;/code&gt; mode will completely replace the current data in the table with the new data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;append&lt;/code&gt; mode will add new data to the existing table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;upsert&lt;/code&gt; mode updates existing rows and inserts new rows, based on a specified primary key. Notably, if the table doesn't exist at save, the &lt;code&gt;upsert&lt;/code&gt; mode behaves similarly to append, inserting data into a new table.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suppose we later want to access our data as it appeared earlier in the day when we had only taken three readings. The &lt;code&gt;ManagedTableDataSet&lt;/code&gt; class supports accessing different versions of the Delta table. We can access a specific version by defining a Kedro &lt;code&gt;Version&lt;/code&gt; and passing it into a new instance of &lt;code&gt;ManagedTableDataSet&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro.io&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Version&lt;/span&gt;

&lt;span class="c1"&gt;# Load version 0 of the table
&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManagedTableDataSet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023_06_22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;save&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reloaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reloaded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load version 1 of the table
&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManagedTableDataSet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023_06_22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;save&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reloaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reloaded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will see two rendered tables as the output of running this code. The first corresponds to version 0 of the &lt;code&gt;2023_06_22&lt;/code&gt; table, while the second corresponds to version 1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |

|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |
|2023-06-22|  Lisbon  |     28      |   44     |
|2023-06-22|  Cairo   |     35      |   25     |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that's it! We've put together a simple program to show some of the usual tasks that &lt;code&gt;ManagedTableDataSet&lt;/code&gt; facilitates, making it easy to save, load, and manage versions of your data in Delta tables on Databricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Databricks is a fast-growing deployment vector for Kedro projects. This blog post has demonstrated how to combine the power of both Kedro and Databricks with an open-source &lt;code&gt;ManagedTableDataSet&lt;/code&gt; that enables streamlined data I/O operations when deploying a Kedro project on Databricks. &lt;code&gt;ManagedTableDataSet&lt;/code&gt; empowers you to spend more time implementing the business logic of your data pipeline or machine learning workflow and less time manually handling data.&lt;/p&gt;

</description>
      <category>kedro</category>
      <category>python</category>
      <category>databricks</category>
      <category>deltalake</category>
    </item>
    <item>
      <title>A new Kedro dataset for Spark Structured Streaming</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Wed, 12 Jul 2023 07:36:25 +0000</pubDate>
      <link>https://dev.to/kedro/a-new-kedro-dataset-for-spark-structured-streaming-n39</link>
      <guid>https://dev.to/kedro/a-new-kedro-dataset-for-spark-structured-streaming-n39</guid>
      <description>&lt;p&gt;This article guides data practitioners on how to set up a Kedro project to use the new &lt;code&gt;SparkStreaming&lt;/code&gt; Kedro dataset, with example use cases, and a deep-dive on some design considerations. It's meant for data practitioners familiar with Kedro so we'll not be covering the basics of a project, but you can familiarise yourself with them in the &lt;a href="https://docs.kedro.org/en/stable/get_started/install.html" rel="noopener noreferrer"&gt;Kedro documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Kedro?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://kedro.org" rel="noopener noreferrer"&gt;Kedro&lt;/a&gt; is an open-source Python toolbox that applies software engineering principles to data science code. It makes it easier for a team to apply software engineering principles to data science code, which reduces the time spent rewriting data science experiments so that they are fit for production.&lt;/p&gt;

&lt;p&gt;Kedro was born at QuantumBlack to solve the challenges faced regularly in data science projects and promote teamwork through standardised team workflows. It is now hosted by the &lt;a href="https://lfaidata.foundation/" rel="noopener noreferrer"&gt;LF AI &amp;amp; Data Foundation&lt;/a&gt; as an incubating project.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are Kedro datasets?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.kedro.org/en/stable/data/data_catalog.html" rel="noopener noreferrer"&gt;Kedro datasets&lt;/a&gt; are abstractions for reading and loading data, designed to decouple these operations from your business logic. These datasets manage reading and writing data from a variety of sources, while also ensuring consistency, tracking, and versioning. They allow users to maintain focus on core data processing, leaving data I/O tasks to Kedro.&lt;/p&gt;

&lt;h2&gt;
  
  
   What is Spark Structured Streaming?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" rel="noopener noreferrer"&gt;Spark Structured Streaming&lt;/a&gt; is built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data, and the Spark SQL engine will run it incrementally and continuously and update the final result as streaming data continues to arrive.&lt;/p&gt;

&lt;h2&gt;
  
  
   Integrating Kedro and Spark Structured Streaming
&lt;/h2&gt;

&lt;p&gt;Kedro is easily extensible for your own workflows and this article explains one of the ways to add new functionality. To enable Kedro to work with Spark Structured Streaming, a team inside &lt;a href="https://www.mckinsey.com/capabilities/quantumblack/labs" rel="noopener noreferrer"&gt;QuantumBlack Labs&lt;/a&gt; developed a new &lt;a href="https://docs.kedro.org/en/stable/kedro_datasets.spark.SparkStreamingDataSet.html" rel="noopener noreferrer"&gt;Spark Streaming Dataset&lt;/a&gt;, as the existing Kedro Spark dataset was not compatible with Spark Streaming use cases. To ensure seamless streaming, the new dataset has a checkpoint location specification to avoid data duplication in streaming use cases and it uses &lt;code&gt;.start()&lt;/code&gt; at the end of the &lt;code&gt;_save&lt;/code&gt; method to initiate the stream.&lt;/p&gt;

&lt;h2&gt;
  
  
   Set up a project to integrate Kedro with Spark Structured streaming
&lt;/h2&gt;

&lt;p&gt;The project uses a Kedro dataset to build a structured data pipeline that can read and write data streams with Spark Structured Streaming and process data streams in realtime. You need to add two separate Hooks to the Kedro project to enable it to function as a streaming application.&lt;/p&gt;

&lt;p&gt;Integration involves the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a Kedro project.&lt;/li&gt;
&lt;li&gt;Register the necessary PySpark and streaming related Hooks. &lt;/li&gt;
&lt;li&gt;Configure the custom dataset in the &lt;code&gt;catalog.yml&lt;/code&gt; file, defining the streaming sources and sinks. &lt;/li&gt;
&lt;li&gt;Use Kedro’s new &lt;a href="https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets/spark" rel="noopener noreferrer"&gt;dataset for Spark Structured Streaming&lt;/a&gt; to store intermediate dataframes generated during the Spark streaming process.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Create a Kedro project
&lt;/h3&gt;

&lt;p&gt;Ensure you have installed a version of Kedro greater than version 0.18.9 and &lt;code&gt;kedro-datasets&lt;/code&gt; greater than version 1.4.0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install kedro~=0.18.0 kedro-datasets~=1.4.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a new Kedro project using the Kedro &lt;code&gt;pyspark&lt;/code&gt; starter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kedro new --starter=pyspark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Register the necessary PySpark and streaming related Hooks
&lt;/h3&gt;

&lt;p&gt;To work with multiple streaming nodes, two hooks are required. The first is for integrating PySpark: see &lt;a href="https://docs.kedro.org/en/stable/integrations/pyspark_integration.html" rel="noopener noreferrer"&gt;Build a Kedro pipeline with PySpark&lt;/a&gt; for details. You will also need a Hook for running a streaming query without termination unless an exception occurs.&lt;/p&gt;

&lt;p&gt;Add the following code to &lt;code&gt;src/$your_kedro_project_name/hooks.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkConf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro.framework.hooks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hook_impl&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SparkHooks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nd"&gt;@hook_impl&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;after_context_created&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Initialises a SparkSession using the config
        defined in project&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s conf folder.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="c1"&gt;# Load the spark configuration in spark.yaml using the config loader
&lt;/span&gt;        &lt;span class="n"&gt;parameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config_loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark*/**&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;spark_conf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkConf&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;setAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="c1"&gt;# Initialise the spark session
&lt;/span&gt;        &lt;span class="n"&gt;spark_session_conf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_package_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enableHiveSupport&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;spark_conf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;_spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session_conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;_spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sparkContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setLogLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SparkStreamsHook&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nd"&gt;@hook_impl&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;after_pipeline_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Starts a spark streaming await session
        once the pipeline reaches the last node.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;awaitAnyTermination&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Register the Hooks in &lt;code&gt;src/$your_kedro_project_name/settings.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Project settings. There is no need to edit this file unless you want to change values
from the Kedro defaults. For further information, including these default values, see
https://kedro.readthedocs.io/en/stable/kedro_project_setup/settings.html.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.hooks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkHooks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkStreamsHook&lt;/span&gt;

&lt;span class="n"&gt;HOOKS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SparkHooks&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nc"&gt;SparkStreamsHook&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiated project hooks.
# from streaming.hooks import ProjectHooks
# HOOKS = (ProjectHooks(),)
&lt;/span&gt;
&lt;span class="c1"&gt;# Installed plugins for which to disable hook auto-registration.
# DISABLE_HOOKS_FOR_PLUGINS = ("kedro-viz",)
&lt;/span&gt;
&lt;span class="c1"&gt;# Class that manages storing KedroSession data.
# from kedro.framework.session.shelvestore import ShelveStore
# SESSION_STORE_CLASS = ShelveStore
# Keyword arguments to pass to the `SESSION_STORE_CLASS` constructor.
# SESSION_STORE_ARGS = {
#     "path": "./sessions"
# }
&lt;/span&gt;
&lt;span class="c1"&gt;# Class that manages Kedro's library components.
# from kedro.framework.context import KedroContext
# CONTEXT_CLASS = KedroContext
&lt;/span&gt;
&lt;span class="c1"&gt;# Directory that holds configuration.
# CONF_SOURCE = "conf"
&lt;/span&gt;
&lt;span class="c1"&gt;# Class that manages how configuration is loaded.
# CONFIG_LOADER_CLASS = ConfigLoader
# Keyword arguments to pass to the `CONFIG_LOADER_CLASS` constructor.
# CONFIG_LOADER_ARGS = {
#       "config_patterns": {
#           "spark" : ["spark*/"],
#           "parameters": ["parameters*", "parameters*/**", "**/parameters*"],
#       }
# }
&lt;/span&gt;
&lt;span class="c1"&gt;# Class that manages the Data Catalog.
# from kedro.io import DataCatalog
# DATA_CATALOG_CLASS = DataCatalog
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
   How to set up your Kedro project to read data from streaming sources
&lt;/h2&gt;

&lt;p&gt;Once you have set up your project, you can use the new &lt;a href="https://docs.kedro.org/en/stable/kedro_datasets.spark.SparkStreamingDataSet.html" rel="noopener noreferrer"&gt;Kedro Spark streaming dataset&lt;/a&gt;. You need to configure the data catalog, in &lt;code&gt;conf/base/catalog.yml&lt;/code&gt; as follows to read from a streaming JSON file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;raw_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spark.SparkStreamingDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/stream/inventory/&lt;/span&gt;
  &lt;span class="na"&gt;file_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Additional options can be configured via the &lt;code&gt;load_args&lt;/code&gt; key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;int.new_inventory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
   &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spark.SparkStreamingDataSet&lt;/span&gt;
   &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/02_intermediate/inventory/&lt;/span&gt;
   &lt;span class="na"&gt;file_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;csv&lt;/span&gt;
   &lt;span class="na"&gt;load_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
     &lt;span class="na"&gt;header&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
   How to set up your Kedro project to write data to streaming sinks
&lt;/h2&gt;

&lt;p&gt;All the additional arguments can be kept under the &lt;code&gt;save_args&lt;/code&gt; key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;processed.sensor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
   &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spark.SparkStreamingDataSet&lt;/span&gt;
   &lt;span class="na"&gt;file_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;csv&lt;/span&gt;
   &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/03_primary/processed_sensor/&lt;/span&gt;
   &lt;span class="na"&gt;save_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
     &lt;span class="na"&gt;output_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;append&lt;/span&gt;
     &lt;span class="na"&gt;checkpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/04_checkpoint/processed_sensor&lt;/span&gt;
     &lt;span class="na"&gt;header&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that when you use the Kafka format, the respective packages should be added to the &lt;code&gt;spark.yml&lt;/code&gt;configuration as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spark.jars.packages: org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
   Design considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pipeline design
&lt;/h3&gt;

&lt;p&gt;In order to benefit from Spark's internal query optimisation, we recommend that any interim datasets are stored as memory datasets.&lt;/p&gt;

&lt;p&gt;All streams start at the same time, so any nodes that have a dependency on another node that writes to a file sink (i.e. the input to that node is the output of another node) will fail on the first run. This is because there are no files in the file sink for the stream to process when it starts.&lt;/p&gt;

&lt;p&gt;We recommended that you either keep intermediate datasets in memory or split out the processing into two pipelines and start by triggering the first pipeline to build up some initial history.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature creation
&lt;/h3&gt;

&lt;p&gt;Be aware that windowing operations only allow windowing on time columns.&lt;/p&gt;

&lt;p&gt;Watermarks must be defined for joins. Only certain types of joins are allowed, and these depend on the file types (stream-stream, stream-static) which makes joining of multiple tables a little complex at times. For further information or advice about join types and watermarking, take a look at the &lt;a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#join-operations" rel="noopener noreferrer"&gt;PySpark documentation&lt;/a&gt; or reach out on the &lt;a href="https://slack.kedro.org" rel="noopener noreferrer"&gt;Kedro Slack workspace&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
   Logging
&lt;/h2&gt;

&lt;p&gt;When initiated, the Kedro pipeline will download the JAR required for the Spark Kafka. After the first run, it won't download the file again but simply retrieve it from where the previously downloaded file was stored.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttg7xtyy9c59x6zlnn74.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttg7xtyy9c59x6zlnn74.png" alt="Spark logging" width="800" height="754"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For each node, the logs for the following will be shown: Loading data, Running nodes, Saving data, Completed x out of y tasks.&lt;/p&gt;

&lt;p&gt;The completed log doesn't mean that the stream processing in that node has stopped. It means that the Spark plan has been created, and if the output dataset is being saved to a sink, the stream has started.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsakexf7ctormpsi1ifq2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsakexf7ctormpsi1ifq2.png" alt="Spark logging" width="800" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once Kedro has run through all the nodes and the full Spark execution plan has been created, you'll see &lt;code&gt;INFO Pipeline execution completed successfully&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This doesn't mean the stream processing has stopped as the post run hook keeps the Spark Session alive. As new data comes in, new Spark logs will be shown, even after the "Pipeline execution completed" log.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgiocc9hj7jh8o05xe3ji.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgiocc9hj7jh8o05xe3ji.png" alt="Spark logging" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If there is an error in the input data, the Spark error logs will come through and Kedro will shut down the SparkContext and all the streams within it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwultqq2h553hem0dbfa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwultqq2h553hem0dbfa.png" alt="Spark logging" width="800" height="281"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
   In summary
&lt;/h2&gt;

&lt;p&gt;In this article, we explained how to take advantage of one of the ways to extend Kedro by building a new dataset to create streaming pipelines. We created a new Kedro project using the Kedro &lt;code&gt;pyspark&lt;/code&gt;starter and illustrated how to work with Hooks, adding them to the Kedro project to enable it to function as a streaming application. The dataset was then easy to configure through the Kedro data catalog, making it possible to use the new dataset, defining the streaming sources and sinks.&lt;/p&gt;

&lt;p&gt;There are currently some limitations because it is not yet ready for use with a service broker, e.g. Kafka, as an additional JAR package is required.&lt;/p&gt;

&lt;p&gt;If you want to find out more about the ways to extend Kedro, take a look at the &lt;a href="https://docs.kedro.org/en/stable/extend_kedro/index.html" rel="noopener noreferrer"&gt;advanced Kedro documentation&lt;/a&gt; for more about Kedro plugins, datasets, and Hooks.&lt;/p&gt;

&lt;h2&gt;
  
  
   Contributors
&lt;/h2&gt;

&lt;p&gt;This post was created by &lt;a href="https://www.linkedin.com/in/tingting-w-93b32516a/" rel="noopener noreferrer"&gt;Tingting Wan&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/chivo369/" rel="noopener noreferrer"&gt;Tom Kurian&lt;/a&gt;, and &lt;a href="https://uk.linkedin.com/in/harismichailidis" rel="noopener noreferrer"&gt;Haris Michailidis&lt;/a&gt;, who are all Data Engineers in the London office of &lt;a href="https://www.mckinsey.com/capabilities/quantumblack/how-we-help-clients" rel="noopener noreferrer"&gt;QuantumBlack, AI by McKinsey&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>kedro</category>
      <category>spark</category>
      <category>streaming</category>
    </item>
    <item>
      <title>Get up to speed: how to build a custom Kedro runner</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Thu, 22 Jun 2023 09:46:25 +0000</pubDate>
      <link>https://dev.to/kedro/get-up-to-speed-how-to-build-a-custom-kedro-runner-2dj3</link>
      <guid>https://dev.to/kedro/get-up-to-speed-how-to-build-a-custom-kedro-runner-2dj3</guid>
      <description>&lt;p&gt;In Kedro, &lt;a href="https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_pipeline.html" rel="noopener noreferrer"&gt;runners are the execution mechanism&lt;/a&gt; for data science and machine learning pipelines. The default behaviour of all of Kedro’s built-in runners is to halt pipeline execution if an error occurs that is significant enough to cause any of the nodes to fail, as shown in the following diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzzcpeuxofk4qta7mzeg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzzcpeuxofk4qta7mzeg.png" alt="Sequential runner when a node fails" width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the diagram, the entire run aborts when it encounters a node that it cannot run, terminating all other sections or branches of the pipeline, even those that it could have run.&lt;/p&gt;

&lt;p&gt;The custom runner described in this article was specifically developed for a top player in the mining industry that uses Kedro to construct data pipelines for BI dashboards essential for operational excellence.&lt;/p&gt;

&lt;p&gt;The client’s pipeline is designed to be resilient towards node failures. Certain nodes operate independently of each other, and especially during the development and exploration stages, the failure of a single node does not necessitate the termination of the entire Kedro run. The desired behaviour is as shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytrrj37ihr5o3u8f45uv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytrrj37ihr5o3u8f45uv.png" alt="Custom runner that does not halt all nodes when a failure is encountered" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the diagram, the runner meets a node that cannot run but finds other sections or branches that it can execute.&lt;/p&gt;

&lt;p&gt;The client relies on Kedro to execute a substantial pipeline that retrieves data from various sources. Some of the input datasets are manually created, which introduces the possibility of errors if entries are mistyped or omitted. By allowing the pipeline to continue and bypass nodes as they encounter failures, it becomes possible to compile a comprehensive list of data issues during a single run and address them collectively.&lt;/p&gt;

&lt;p&gt;In comparison, the default Kedro approach is considerably more time-consuming as it pauses the pipeline upon the failure of a single node, leading to a repetitive cycle of fixing one issue, rerunning the pipeline to encounter the next issue, fixing that, and so on.&lt;/p&gt;

&lt;p&gt;Executing all feasible nodes within the pipeline provides an additional advantage. In cases where no data issues arise, completing the pipeline allows the available metrics to be displayed on a BI dashboard, ensuring service continuity. For instance, if only one data source is corrupted, the BI metrics that depend on that specific data need to be withheld, but all others can be showcased. In contrast, the default Kedro behaviour would render all metrics unavailable until the single dataset issue is resolved.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution: a customised Kedro runner
&lt;/h2&gt;

&lt;p&gt;As an open-source project, Kedro enables you to define a custom runner for your project. The team took the open-source code for Kedro’s sequential runner and extended it, since the code didn’t need any parallelisation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“One of the reasons we selected Kedro is that it is open source and highly extensible. We knew from the outset that we could make our own customisations”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The team created a soft-fail runner to transform errors into warnings, allowing the pipeline to continue executing to the best of its ability while providing a report of any nodes that failed, so that data issues can be addressed. At that point, the pipeline run can be finalised by executing only those missing nodes separately, using appropriate Kedro syntax.&lt;/p&gt;

&lt;p&gt;The resulting &lt;code&gt;SoftFailRunner&lt;/code&gt; is an implementation of &lt;a href="https://docs.kedro.org/en/stable/kedro.runner.AbstractRunner.html" rel="noopener noreferrer"&gt;&lt;code&gt;AbstractRunner&lt;/code&gt;&lt;/a&gt; that runs a pipeline sequentially using a topological sort of provided nodes. Unlike the built-in &lt;a href="https://docs.kedro.org/en/stable/kedro.runner.SequentialRunner.html" rel="noopener noreferrer"&gt;&lt;code&gt;SequentialRunner&lt;/code&gt;&lt;/a&gt;, this runner does not terminate the pipeline but runs any remaining nodes as long as their dependencies are fulfilled. The &lt;code&gt;SoftFailRunner&lt;/code&gt; implementation adds two arguments: &lt;code&gt;--from-nodes&lt;/code&gt; and &lt;code&gt;--runner&lt;/code&gt;. The essential code for the &lt;code&gt;SoftFailRunner&lt;/code&gt; is shown below and the full code &lt;a href="https://github.com/kedro-org/kedro/blob/feat/softfail-runner/kedro/runner/softfail_runner.py" rel="noopener noreferrer"&gt;can be found on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxn3477gb3px8sk16fhf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxn3477gb3px8sk16fhf.png" alt="Code for the soft-fail runner" width="800" height="979"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The logic behind the runner is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Addition of a new &lt;code&gt;skip_nodes&lt;/code&gt; variable to keep track of which nodes should be skipped.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Every time a node is about to run - the &lt;code&gt;skip_nodes&lt;/code&gt; list is checked.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When a node fails, all of its descendants are added into &lt;code&gt;skip_nodes&lt;/code&gt; with &lt;a href="https://en.wikipedia.org/wiki/Breadth-first_search" rel="noopener noreferrer"&gt;Breadth-first search (BFS)&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
   In summary
&lt;/h2&gt;

&lt;p&gt;The customised Kedro runner was straightforward to create and a satisfactory solution to enable maximum efficiency when handling this particular pipeline and dataset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“These results could certainly be achieved with an orchestrator, but using an open-source project with customisation is a quick win for delivering business value”.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>kedro</category>
      <category>python</category>
      <category>datascience</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Collaborative experiment tracking in Kedro-Viz</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Fri, 02 Jun 2023 14:09:58 +0000</pubDate>
      <link>https://dev.to/kedro/collaborative-experiment-tracking-in-kedro-viz-3697</link>
      <guid>https://dev.to/kedro/collaborative-experiment-tracking-in-kedro-viz-3697</guid>
      <description>&lt;p&gt;When training a model in machine learning, the goal is to determine the optimal configuration of attributes such as hyper-parameters, metrics, and training data. The process of identifying the best combinations requires running a lot of experiments and comparing them. As I mentioned in my &lt;a href="https://kedro.org/blog/experiment-tracking-with-kedro" rel="noopener noreferrer"&gt;previous article&lt;/a&gt;, experiment tracking is a way to record all the metadata you need to compare machine-learning experiments and recreate them for your project.&lt;/p&gt;

&lt;h2&gt;
  
  
   What is Kedro-Viz?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kedro-org/kedro-viz" rel="noopener noreferrer"&gt;Kedro-Viz&lt;/a&gt; is an interactive development tool for building and visualising data science pipelines with &lt;a href="https://github.com/kedro-org/kedro" rel="noopener noreferrer"&gt;Kedro&lt;/a&gt;. It enables you to monitor the status of your ML project, present it to stakeholders, and easily bring new team members up to speed. You can try it out using our &lt;a href="https://demo.kedro.org/" rel="noopener noreferrer"&gt;hosted demo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;“&lt;em&gt;There's no better method to give an overview of a pipeline's structure in such an engaging, interactive, and thorough way. Our asset's pipelines are very complex, but are structured with modular pipelines, so being able to show the overall structure at the modular pipeline level, before jumping into each individual pipeline helps prevent the audience from getting overwhelmed by the number of nodes and datasets shown&lt;/em&gt;”.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Senior Data Scientist at Consultancy&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is experiment tracking in Kedro-Viz?
&lt;/h2&gt;

&lt;p&gt;Experiment tracking on Kedro-Viz enables users to select, plot, and compare how multiple metrics change over time, and identify the best-performing ML experiment, with no additional dependencies to manage or infrastructure needed.&lt;/p&gt;

&lt;p&gt;The video below demonstrates experiment tracking on Kedro-Viz:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/odXhTEa50PU"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;During a project with multiple team members, you could end up with a scenario where the results of your experiments are spread across many machines because people are iterating on their individual computers. This makes the tracking process difficult to manage at a team level, as suggested by this feedback from our users.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You might train one model locally on your computer. You might train another one in the cloud. Joe might run another pipeline or another experiment. Having all of those experiments in one place as a single source of truth is really powerful.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"If we could write our metrics files to an S3 bucket and then run experiment tracking pointing at that S3 bucket, that simplifies our workflow in many different ways and would be really helpful. And it would make Kedro experiment tracking just as easy, if not easier, than MLFlow for us."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Can you use an existing database so that we can keep track of runs happening in different places?&lt;/em&gt;"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We have found a way to address this pain point and enable you to collaborate more easily. We are excited to announce that we've &lt;a href="https://www.linen.dev/s/kedro/t/12096327/kedro-kedro-kedro-kedro-kedro-kedro-viz-6-2-0-is-out-kedro-k#ba733439-8aac-46f5-9c37-d015287835cc" rel="noopener noreferrer"&gt;launched collaborative experiment tracking&lt;/a&gt; in Kedro-Viz 6.2.0. The new feature enables a team of users to log their experiments to a shared cloud storage service and view and compare each others' experiments in their own experiment tracking view. This simplifies their workflow, providing a single ‘source of truth’ and encourages multi-user collaboration.&lt;/p&gt;

&lt;p&gt;We are releasing this feature in stages across different versions, and the first phase is &lt;a href="https://github.com/kedro-org/kedro-viz/releases" rel="noopener noreferrer"&gt;Kedro-Viz 6.2.0&lt;/a&gt;. This version enables users to read experiments of other users that are stored on Amazon S3 or similar storage solutions on other cloud providers, as long as they are supported by &lt;a href="https://filesystem-spec.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;fsspec&lt;/a&gt;. Future versions of collaborative experiment tracking aim to improve the user experience through automatic reloading and optimisation by caching.&lt;/p&gt;

&lt;h2&gt;
  
  
   Get started with collaborative experiment tracking
&lt;/h2&gt;

&lt;p&gt;Follow these steps to set up collaborative experiment tracking in Kedro-Viz:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Update Kedro-Viz
&lt;/h3&gt;

&lt;p&gt;Ensure you have the latest version of Kedro-Viz (6.2.0 or later).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;kedro-viz &lt;span class="nt"&gt;--upgrade&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Set up cloud storage
&lt;/h3&gt;

&lt;p&gt;Kedro-Viz uses &lt;a href="https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=s3#other-known-implementations" rel="noopener noreferrer"&gt;fsspec&lt;/a&gt; to save and read &lt;code&gt;session_store&lt;/code&gt; files from a variety of data stores, including local file systems, network file systems, cloud object stores (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage), and HDFS.&lt;/p&gt;

&lt;p&gt;Set up a central cloud storage repository such as a AWS S3 bucket to store all your team's experiments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Configure your Kedro project
&lt;/h3&gt;

&lt;p&gt;Locate the &lt;code&gt;settings.py&lt;/code&gt; file in your Kedro project directory and add the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro_viz.integrations.kedro.sqlite_store&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SQLiteStore&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="n"&gt;SESSION_STORE_CLASS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLiteStore&lt;/span&gt;
&lt;span class="n"&gt;SESSION_STORE_ARGS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remote_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket-name/path/to/experiments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Set up a unique username
&lt;/h3&gt;

&lt;p&gt;Kedro-Viz saves your experiments as SQLite database files on the central cloud storage. To ensure that all users have unique filenames, you need to set up your &lt;code&gt;**KEDRO_SQLITE_STORE_USERNAME**&lt;/code&gt; in the environment variables. By default, Kedro-Viz will take your computer username if this is not specified.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;KEDRO_SQLITE_STORE_USERNAME &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_unique__username"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 5: Configure cloud storage credentials&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;From Kedro-Viz version 6.2, the only way to set up credentials for accessing your cloud storage is through environment variables, as shown below for &lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html" rel="noopener noreferrer"&gt;Amazon S3 cloud storage&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_access_key_id"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_secret_access_key"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_aws_region"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the screenshot below we show an example of the session store and Kedro-Viz output for three team members (Huong, Tynan, and Rashida):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvt9sgfuto8gzhbf11h6v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvt9sgfuto8gzhbf11h6v.png" alt="Session store showing the 3 objects for Huong, Tynan, and Rashida" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Session store showing the 3 objects for Huong, Tynan, and Rashida.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjuydxwja7hlf4r61lwe0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjuydxwja7hlf4r61lwe0.png" alt="Three separate Kedro-Viz runs by Huong, Tynan, and Rashida" width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three separate Kedro-Viz runs by Huong, Tynan, and Rashida.&lt;/p&gt;

&lt;p&gt;This tutorial offers a very swift run through of the configuration process. For further information, check out the &lt;a href="https://docs.kedro.org/en/stable/experiment_tracking/index.html" rel="noopener noreferrer"&gt;documentation on the experiment tracking feature&lt;/a&gt; and keep up-to-date with the latest news about Kedro and Kedro-Viz on our &lt;a href="https://slack.kedro.org" rel="noopener noreferrer"&gt;Slack workspace&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Many thanks to the Kedro-Viz team especially &lt;a href="https://github.com/rashidakanchwala" rel="noopener noreferrer"&gt;@Rashida Kanchwala&lt;/a&gt; for contributing to this post.&lt;/p&gt;

</description>
      <category>kedro</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Lightning-fast queries with Polars</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Thu, 25 May 2023 12:33:46 +0000</pubDate>
      <link>https://dev.to/astrojuanlu/lightning-fast-queries-with-polars-1bp3</link>
      <guid>https://dev.to/astrojuanlu/lightning-fast-queries-with-polars-1bp3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;This post is an adaptation of the one I originally published &lt;a href="https://www.orchest.io/blog/the-great-python-dataframe-showdown-part-3-lightning-fast-queries-with-polars" rel="noopener noreferrer"&gt;in the Orchest blog&lt;/a&gt;. Lots of things have changed in Polars since I wrote it, but at the time of writing this lines the post still has value. Enjoy!&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Polars is an open-source project that provides in-memory dataframes for Python and Rust. Despite its young age (&lt;a href="https://github.com/pola-rs/polars/commit/2714893dd8061644a7aa0fe0e983c2faf17d18c1" rel="noopener noreferrer"&gt;its first commit was a mere two years ago&lt;/a&gt;, in the middle of the COVID-19 pandemic) it has already gained lots of popularity due to its "lightning-fast" performance and the expressiveness of its API.&lt;/p&gt;

&lt;p&gt;One of the most interesting things about Polars is that it offers two modes of operation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The &lt;strong&gt;eager&lt;/strong&gt; mode is somewhat similar to how pandas works: operations are executed immediately and their result is available in memory. Every operation in a chain would need to allocate a DataFrame however, which is less than ideal.&lt;/li&gt;
&lt;li&gt;  The &lt;strong&gt;lazy&lt;/strong&gt; mode, on the other hand, builds an optimized query plan that exploits parallelism as much as possible: Polars applies several simplification techniques and pushes computations to accelerate the running time as much as possible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These ideas are not new: in fact, in &lt;a href="https://dev.to/astrojuanlu/out-of-core-processing-with-vaex-3724"&gt;my blog post about Vaex&lt;/a&gt; we covered its lazy computation capabilities. However, Polars takes them one step further by offering a functional API that is delightful to use.&lt;/p&gt;

&lt;p&gt;The other secret sauce of Polars is Apache Arrow. While other libraries use Arrow for things like reading Parquet files, Polars is tightly coupled with it: by using a Rust-native implementation of &lt;a href="https://dev.to/astrojuanlu/demystifying-apache-arrow-5b0a"&gt;the Arrow memory format&lt;/a&gt; for its columnar storage, Polars can leverage the highly optimized Arrow data structures and focus on the data manipulation operations.&lt;/p&gt;

&lt;p&gt;Interested? Read on!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgr5v3u0z6ll74ql8zv0j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgr5v3u0z6ll74ql8zv0j.png" alt="Polars popularity is growing fast" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Polars popularity is growing fast (&lt;a href="https://twitter.com/braaannigan/status/1526901314978029568" rel="noopener noreferrer"&gt;https://twitter.com/braaannigan/status/1526901314978029568&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  First steps with Polars
&lt;/h3&gt;

&lt;p&gt;For this example, we will use &lt;a href="https://www.kaggle.com/datasets/stackoverflow/stacksample" rel="noopener noreferrer"&gt;a sample of Stack Overflow questions and their tags&lt;/a&gt; obtained from Kaggle. Our generic goal is to display the most highly voted Python questions.&lt;/p&gt;

&lt;p&gt;You can install Polars with conda/mamba or pip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mamba install -y "polars=0.13.37"  
pip install "polars==0.13.37"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even though Polars is written in Rust, it distributes precompiled binary wheels on PyPI, so pip install will just work on all major Python versions from 3.6 onwards.&lt;/p&gt;

&lt;p&gt;Let's load the Questions and Tags CSV files using&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;polars&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;  

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/data/stacksample/Questions.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf8-lossy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/data/stacksample/Tags.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The type of both objects is &lt;em&gt;`polars.internals.frame.DataFrame&lt;/em&gt;`, "a two-dimensional data structure that represents data as a table with rows and columns" (&lt;a href="https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.html#polars.DataFrame" rel="noopener noreferrer"&gt;reference docs&lt;/a&gt;). Both dataframes have millions of rows, and the Questions one takes almost 2 GB of memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1264216&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3750994&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimated size: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;estimated_size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; MiB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;Estimated&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1865&lt;/span&gt; &lt;span class="n"&gt;MiB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Polars dataframes have some typical methods we know from pandas to inspect the data. Notice that calling the print function on a DataFrame produces a tidy ASCII representation, in addition to the fancy HTML representation available in Jupyter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# No `print` needed on Jupyter  
&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt;┌─────┬─────────────┬─────────────────┬─────────────────┬───────┬─────────────────┬────────────────┐&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Id&lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;OwnerUserId&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;CreationDate&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;ClosedDate&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;Score&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;Title&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;Body&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;╞═════╪═════════════╪═════════════════╪═════════════════╪═══════╪═════════════════╪════════════════╡&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;2008&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="n"&gt;T13&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;NA&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;SQLStatement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ve        │  
│     ┆             ┆ 7:07Z           ┆                 ┆       ┆ ecute() -       ┆ written a      │  
│     ┆             ┆                 ┆                 ┆       ┆ multipl...      ┆ database       │  
│     ┆             ┆                 ┆                 ┆       ┆                 ┆ gener...       │  
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 90  ┆ 58          ┆ 2008-08-01T14:4 ┆ 2012-12-26T03:4 ┆ 144   ┆ Good branching  ┆ &amp;lt;p&amp;gt;Are there   │  
│     ┆             ┆ 1:24Z           ┆ 5:49Z           ┆       ┆ and merging     ┆ any really     │  
│     ┆             ┆                 ┆                 ┆       ┆ tutor...        ┆ good tut...    │  
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 120 ┆ 83          ┆ 2008-08-01T15:5 ┆ NA              ┆ 21    ┆ ASP.NET Site    ┆ &amp;lt;p&amp;gt;Has anyone  │  
│     ┆             ┆ 0:08Z           ┆                 ┆       ┆ Maps            ┆ got experience │  
│     ┆             ┆                 ┆                 ┆       ┆                 ┆ cre...         │  
└─────┴─────────────┴─────────────────┴─────────────────┴───────┴─────────────────┴────────────────┘  
In [10]: print(df.describe())  
shape: (5, 8)  
┌──────────┬─────────────┬─────────────┬──────────────┬────────────┬───────────┬───────┬──────┐  
│ describe ┆ Id          ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score     ┆ Title ┆ Body │  
│ ---      ┆ ---         ┆ ---         ┆ ---          ┆ ---        ┆ ---       ┆ ---   ┆ ---  │  
│ str      ┆ f64         ┆ str         ┆ str          ┆ str        ┆ f64       ┆ str   ┆ str  │  
╞══════════╪═════════════╪═════════════╪══════════════╪════════════╪═══════════╪═══════╪══════╡  
│ mean     ┆ 2.1327e7    ┆ null        ┆ null         ┆ null       ┆ 1.781537  ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ std      ┆ 1.1514e7    ┆ null        ┆ null         ┆ null       ┆ 13.663886 ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ min      ┆ 80.0        ┆ null        ┆ null         ┆ null       ┆ -73.0     ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ max      ┆ 4.014338e7  ┆ null        ┆ null         ┆ null       ┆ 5190.0    ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ median   ┆ 2.1725415e7 ┆ null        ┆ null         ┆ null       ┆ 0.0       ┆ null  ┆ null │  
└──────────┴─────────────┴─────────────┴──────────────┴────────────┴───────────┴───────┴──────┘  
[11]: print(tags[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;].value_counts().head())  
shape: (5, 2)  
┌────────────┬────────┐  
│ Tag        ┆ counts │  
│ ---        ┆ ---    │  
│ str        ┆ u32    │  
╞════════════╪════════╡  
│ javascript ┆ 124155 │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ java       ┆ 115212 │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ c#         ┆ 101186 │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ php        ┆ 98808  │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ android    ┆ 90659  │  
└────────────┴────────┘
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Following a terminology similar to pandas, Polars dataframes contain several columns of type polars.internals.series.Series , each of them with a different &lt;a href="https://pola-rs.github.io/polars-book/user-guide/datatypes.html" rel="noopener noreferrer"&gt;data type&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,)&lt;/span&gt;  
&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
&lt;span class="p"&gt;[&lt;/span&gt;  
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SQLStatement.e...  
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="n"&gt;Good&lt;/span&gt; &lt;span class="n"&gt;branching&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;  
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ASP.NET Site M...  
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="n"&gt;Function&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;  
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Adding scripti...  
]  

In [13]: df.dtypes  
Out[13]: [polars.datatypes.Int64,  
polars.datatypes.Utf8,  
polars.datatypes.Utf8,  
polars.datatypes.Utf8,  
polars.datatypes.Int64,  
polars.datatypes.Utf8,  
polars.datatypes.Utf8]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Expressions as chained operations on columns
&lt;/h3&gt;

&lt;p&gt;The essential building blocks in Polars are &lt;strong&gt;expressions&lt;/strong&gt;: functions that receive a Series and transform it into another Series. Expressions &lt;a href="https://stackoverflow.com/a/72121352/554319" rel="noopener noreferrer"&gt;start with a root&lt;/a&gt;, and then you can chain more operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(  
   pl.col("Score")  # Root of the Expression (a single column)  
   .mean()  # Returns another Expression  
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most interesting feature is that expressions are not bound to a specific object, but instead they are generic. Chains of expressions define the computation, which is materialized by a DataFrame method (acting as an &lt;strong&gt;execution context&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;Sounds too abstract? See it in action:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;  
&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt;┌──────────┐&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Score&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;╞══════════╡&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;1.781537&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;└──────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;em&gt;df.select&lt;/em&gt; method can do much more than just selecting columns: it can execute any column-wise expression. In fact, when passed a list of such expressions, it can broadcast them automatically if the dimensions are coherent, and it will execute them in parallel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;n_unique&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_unique_users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mean_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lengths&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_title_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="c1"&gt;# To run the above in all text columns,  
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="c1"&gt;# you can filter by data type:  
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="c1"&gt;# pl.col(Utf8).str.lengths().max().suffix("_max_length"),  
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="p"&gt;]))&lt;/span&gt;  
&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt;┌──────────────────┬────────────┬──────────────────┐&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;num_unique_users&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;mean_score&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;max_title_length&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;╞══════════════════╪════════════╪══════════════════╡&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;1264216&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mf"&gt;1.781537&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;204&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;└──────────────────┴────────────┴──────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The power of laziness
&lt;/h3&gt;

&lt;p&gt;It is now time to start narrowing down the analysis a bit and focus on the questions that are related to Python. Notice that Polars algorithms require all the data to live in memory, and therefore when using the eager API you have to apply the usual caveats about large datasets. As a result, since the questions dataset is already quite big, performing a .join  operation with the tags data can crash the kernel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Don't try this at home unless you have enough RAM!  
# (  
#     df  
#     .join(tags, on="Id")  
#     .filter(pl.col("Tag").str.contains(r"(i?)python"))  
#     .sort("Id")  
# )
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But fear not, because Polars has the perfect solution: switching to lazy mode! By prefixing our chain of operations by .lazy()  and calling .collect()  at the end, you can leverage Polars optimization capabilities to its fullest potential, and perform operations that would be otherwise impossible:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;q_python&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lazy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Notice the .lazy() call  
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="c1"&gt;# The input of a lazy join needs to be lazy  
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="c1"&gt;# We use a 'semi' join, like 'inner' but discarding extra columns  
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lazy&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(i?)python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Call .collect() at the end  
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q_python&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt;┌───────┬─────────────┬──────────────────┬────────────┬───────┬──────────────────┬─────────────────┐&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Id&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;OwnerUserId&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;CreationDate&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;ClosedDate&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;Score&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;Title&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;Body&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;╞═══════╪═════════════╪══════════════════╪════════════╪═══════╪══════════════════╪═════════════════╡&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;11060&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;912&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;2008&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="n"&gt;T13&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;59&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;NA&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;How&lt;/span&gt; &lt;span class="n"&gt;should&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;This&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="n"&gt;Z&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;difficult&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ge&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;17250&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;394&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;2008&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="n"&gt;T00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;NA&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;Create&lt;/span&gt; &lt;span class="n"&gt;an&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m creating │  
│       ┆             ┆ :40Z             ┆            ┆       ┆ encrypted ZIP    ┆ an ZIP file     │  
│       ┆             ┆                  ┆            ┆       ┆ file in ...      ┆ with...         │  
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 19030 ┆ 745         ┆ 2008-08-20T22:50 ┆ NA         ┆ 2     ┆ How to check set ┆ &amp;lt;p&amp;gt;I have a     │  
│       ┆             ┆ :55Z             ┆            ┆       ┆ of files         ┆ bunch of files  │  
│       ┆             ┆                  ┆            ┆       ┆ confor...        ┆ (TV e...        │
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In fact, if your raw CSV is so big that it doesn't fit in RAM to start, Polars offers a lazy way of reading the file too using scan_csv :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# We create the query plan separately  
&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# scan_csv returns a lazy dataframe already  
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/data/stacksample/Questions.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf8-lossy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lazy&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(i?)python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;top_voted_python_qs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are curious about how Polars is doing all this work under the hood, notice that you can visualize the query plan!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vhayr6jhnv7fx9ebx77.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vhayr6jhnv7fx9ebx77.png" alt="Polars visualization of a query plan (not optimized)" width="679" height="670"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Polars visualization of a query plan (not optimized)&lt;/p&gt;

&lt;h3&gt;
  
  
  Working with columns of lists
&lt;/h3&gt;

&lt;p&gt;Notice that, in the previous section, we did a "semi" join to filter the questions, but we still don't have the list of tags associated with such questions. To achieve that, we will use one of the most surprisingly pleasant features of Polars: its list-column handling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;tag_list_lazy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lazy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Convert to a list of strings  
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TagList&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;...:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag_list_lazy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  
&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt;┌──────────┬─────────────────────────────────────┐&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Id&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;TagList&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;╞══════════╪═════════════════════════════════════╡&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;994990&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spring&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;29087440&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;android&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;android-intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;12093870&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;asp.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sqldatasour... │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 32889780 ┆ [&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="n"&gt;extern&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;declar&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;22436290&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mysql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiple-t... │  
└──────────┴─────────────────────────────────────┘
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After grouping by "Id" and turning each row into a list of tags, it's time to add a boolean column "ContainsPython" that signals whether any of the tags in the list contains the substring "python". For that', let's use the &lt;code&gt;_.arr.eval_&lt;/code&gt; context (also known as the &lt;a href="https://pola-rs.github.io/polars-book/user-guide/dsl/list_context.html" rel="noopener noreferrer"&gt;List context&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tag_list_extended_lazy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tag_list_lazy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TagList&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;element&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(i?)python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;flatten&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ContainsPython&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final join will provide the answer we are looking for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;top_python_questions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/data/stacksample/Questions.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf8-lossy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag_list_extended_lazy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ContainsPython&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_000&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fri6s01ra6k9kknmclemn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fri6s01ra6k9kknmclemn.png" alt="Joining two dataframes in Polars" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Joining two dataframes in Polars&lt;/p&gt;

&lt;p&gt;Very neat!&lt;/p&gt;

&lt;h3&gt;
  
  
  Some differences with pandas
&lt;/h3&gt;

&lt;p&gt;Similarly to what happens with Vaex, Polars DataFrames don't have an index. The user guide goes as far as saying &lt;a href="https://pola-rs.github.io/polars-book/user-guide/coming_from_pandas.html#polars-does-not-have-an-index" rel="noopener noreferrer"&gt;this&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Indexes are not needed! Not having them makes things easier - convince us otherwise!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The discussion of this contentious stance will be the subject of a future blog post. In any case, this allows Polars to &lt;a href="https://pola-rs.github.io/polars-book/user-guide/indexing.html" rel="noopener noreferrer"&gt;simplify indexing operations&lt;/a&gt;, since strings will always refer to column names, and numbers in the first axis will always refer to row numbers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# First row  
&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="err"&gt;┌─────┬─────────────┬───────────────────┬────────────┬───────┬──────────────────┬──────────────────┐&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Id&lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;OwnerUserId&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;CreationDate&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;ClosedDate&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;Score&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;Title&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;Body&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; │&lt;/span&gt;  
&lt;span class="err"&gt;╞═════╪═════════════╪═══════════════════╪════════════╪═══════╪══════════════════╪══════════════════╡&lt;/span&gt;  
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="mi"&gt;2008&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="n"&gt;T13&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;57&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;NA&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; ┆&lt;/span&gt; &lt;span class="n"&gt;SQLStatement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exe&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ve written  │  
│     ┆             ┆ 07Z               ┆            ┆       ┆ cute() -         ┆ a database       │  
│     ┆             ┆                   ┆            ┆       ┆ multipl...       ┆ gener...         │  
└─────┴─────────────┴───────────────────┴────────────┴───────┴──────────────────┴──────────────────┘  

[37]: df[0, 0]  # First row, first column  
Out[37]: 80  

In [38]: df[0, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]  # First row, column by name  
Out[38]: 80  

In [39]: df[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;].head(5)  # Column by name  
Out[39]: shape: (5,)  
Series: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; [i64]  
[  
80  
90  
120  
180  
260  
]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the other hand, even though indexing with boolean masks is supported in Polars as a way to bridge the gap with Pandas users, its use is discouraged in favor of &lt;em&gt;select&lt;/em&gt; and &lt;em&gt;filter&lt;/em&gt;, and &lt;a href="https://pola-rs.github.io/polars-book/user-guide/indexing.html#anti-pattern" rel="noopener noreferrer"&gt;"the functionality may be removed in the future"&lt;/a&gt;. However, as you could see in the examples above, direct indexing is not needed as often as in pandas.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should you use Polars?
&lt;/h3&gt;

&lt;p&gt;Beyond this short introduction, Polars has much more to offer, from &lt;a href="https://pola-rs.github.io/polars-book/user-guide/dsl/window_functions.html" rel="noopener noreferrer"&gt;window functions&lt;/a&gt; and &lt;a href="https://pola-rs.github.io/polars-book/user-guide/dsl/groupby.html" rel="noopener noreferrer"&gt;complex aggregations&lt;/a&gt; to &lt;a href="https://pola-rs.github.io/polars-book/user-guide/timeseries/intro.html" rel="noopener noreferrer"&gt;time-series processing&lt;/a&gt;, and much more.&lt;/p&gt;

&lt;p&gt;As a downside, since it is a young project and it's evolving quite fast, you will notice that some areas of the documentation are a bit lacking, or that there are &lt;a href="https://github.com/pola-rs/polars/issues/1423" rel="noopener noreferrer"&gt;no comprehensive release notes yet&lt;/a&gt;. Fortunately, Ritchie Vink, the Polars creator and current maintainer, quickly answers Stack Overflow questions and GitHub issues, and releases with bug fixes and new features are frequent.&lt;/p&gt;

&lt;p&gt;On the other hand, if you are looking for an ultimate solution for your larger-than-RAM datasets, Polars might not be for you. Its lazy processing capabilities can take you quite far, but at some point you will have to confront the fact that Polars is an in-memory dataframe library, similar to pandas.&lt;/p&gt;

&lt;p&gt;In summary:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Use Polars if you are willing to learn a different but powerful new API, if your data fits in memory, if your workflow involves lots of list-column manipulation, and in general if you want to explore a much faster alternative to pandas.&lt;/li&gt;
&lt;li&gt;  Don't use Polars if your data is much larger than RAM, if you are looking for solutions to quickly migrate a large pandas codebase, or if you are looking for an old, battle-tested library.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>polars</category>
      <category>dataframe</category>
    </item>
    <item>
      <title>A Polars exploration into Kedro</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Wed, 17 May 2023 14:50:58 +0000</pubDate>
      <link>https://dev.to/kedro/a-polars-exploration-into-kedro-3cab</link>
      <guid>https://dev.to/kedro/a-polars-exploration-into-kedro-3cab</guid>
      <description>&lt;p&gt;One year ago I travelled to Lithuania for the first time to present at PyCon/PyData Lithuania, and I had a great time there. The topic of my talk was an evaluation of some alternative dataframe libraries, including Polars, the one that I ended up enjoying the most. &lt;/p&gt;

&lt;p&gt;I enjoyed it so much that this week I’m in Vilnius again, and I’ll be delivering a workshop at PyCon Lithuania 2023 called &lt;a href="https://pycon.lt/2023/activities/talks/KAJGPU" rel="noopener noreferrer"&gt;“Analyze your data at the speed of light with Polars and Kedro”&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this blog post you will learn how using &lt;a href="https://www.pola.rs/" rel="noopener noreferrer"&gt;Polars&lt;/a&gt; in Kedro can make your data pipelines much faster, what’s the current status of Polars in Kedro, and what can be expected in the near future. In case it’s the first time you’ve heard about Polars, I have included a short introduction at the beginning.&lt;/p&gt;

&lt;p&gt;Let’s dive in!&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the Polars library?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.pola.rs/" rel="noopener noreferrer"&gt;Polars&lt;/a&gt; is an open-source library for Python, Rust, and NodeJS that provides in-memory dataframes, out-of-core processing capabilities, and more. It is based on the Rust implementation of the &lt;a href="https://arrow.apache.org/" rel="noopener noreferrer"&gt;Apache Arrow&lt;/a&gt; columnar data format (you can read more about Arrow on my earlier blog post &lt;a href="https://dev.to/astrojuanlu/demystifying-apache-arrow-5b0a/"&gt;“Demystifying Apache Arrow”&lt;/a&gt;), and it is optimised to be blazing fast.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9e8a9ozp51rfaqed0puj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9e8a9ozp51rfaqed0puj.png" alt="Snippet of Polars code" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like &lt;a href="https://www.dask.org/" rel="noopener noreferrer"&gt;Dask&lt;/a&gt;, &lt;a href="https://rapids.ai/" rel="noopener noreferrer"&gt;cuDF&lt;/a&gt;, or &lt;a href="https://modin.readthedocs.io/" rel="noopener noreferrer"&gt;Modin&lt;/a&gt;, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.&lt;/p&gt;

&lt;p&gt;I started experimenting with Polars one year ago, and it has now become my go-to data manipulation library. I gave several talks about it, for example &lt;a href="https://youtu.be/LGAHTp4DYZY" rel="noopener noreferrer"&gt;at PyData NYC&lt;/a&gt;, and the room was full.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do Polars and Kedro get used together?
&lt;/h2&gt;

&lt;p&gt;If you want to learn more about Kedro, you can watch a video introduction on &lt;a href="https://www.youtube.com/@kedro-python" rel="noopener noreferrer"&gt;our YouTube channel&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/qClSGY6B0r0"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Traditionally Kedro has favoured &lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;pandas&lt;/a&gt; as a dataframe library because of its ubiquity and popularity. This means that, for example, to read a CSV file, you would add a corresponding entry to &lt;a href="https://docs.kedro.org/en/stable/data/data_catalog.html" rel="noopener noreferrer"&gt;the catalog&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;openrepair-0_3-categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pandas.CSVDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_Product_Categories.csv&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then, you would use that dataset as input for &lt;a href="https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html" rel="noopener noreferrer"&gt;your node functions&lt;/a&gt;, which would, in turn, receive pandas &lt;code&gt;DataFrame&lt;/code&gt; objects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;join_events_categories&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(This is just one of the formats supported by Kedro datasets of course! You can also load Parquet, GeoJSON, images… have a look at &lt;a href="https://docs.kedro.org/en/stable/kedro_datasets.html" rel="noopener noreferrer"&gt;the &lt;code&gt;kedro-datasets&lt;/code&gt; reference&lt;/a&gt; for a list of datasets maintained by the core team, or &lt;a href="https://github.com/topics/kedro-plugin" rel="noopener noreferrer"&gt;the &lt;code&gt;#kedro-plugin&lt;/code&gt; topic on GitHub&lt;/a&gt; for some contributed by the community!)&lt;/p&gt;

&lt;p&gt;The idea of this blog post is to teach you how can you use Polars instead of pandas for your catalog entries, which in turn allow you to write all your data transformation pipelines using Polars dataframes. For that, I crafted some examples that use &lt;a href="https://openrepair.org/open-data/downloads/" rel="noopener noreferrer"&gt;the Open Repair Alliance dataset&lt;/a&gt;, containing more than 80 000 records of repair events across Europe.&lt;/p&gt;

&lt;p&gt;And if you’re ready to start, let’s go!&lt;/p&gt;

&lt;h2&gt;
  
  
  Get started with Polars for Kedro
&lt;/h2&gt;

&lt;p&gt;First of all, you will need to add &lt;code&gt;kedro-datasets[polars.CSVDataSet]&lt;/code&gt; to your requirements. At the time of writing (May 2023), the code below requires development versions of both &lt;code&gt;kedro&lt;/code&gt; and &lt;code&gt;kedro-datasets&lt;/code&gt;, which you can declare on your &lt;code&gt;requirements.txt&lt;/code&gt; or &lt;code&gt;pyproject.toml&lt;/code&gt; as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# requirements.txt

kedro @ git+https://github.com/kedro-org/kedro@3ea7231
kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# pyproject.toml&lt;/span&gt;

&lt;span class="nn"&gt;[project]&lt;/span&gt;
&lt;span class="py"&gt;dependencies&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s"&gt;"kedro @ git+https://github.com/kedro-org/kedro@3ea7231"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are using the legacy &lt;code&gt;setup.py&lt;/code&gt; files, the syntax is very similar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;requires&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kedro @ git+https://github.com/kedro-org/kedro@3ea7231&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After you install these dependencies, you can start using the &lt;code&gt;polars.CSVDataSet&lt;/code&gt; by using the appropriate &lt;code&gt;type&lt;/code&gt; in your catalog entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;openrepair-0_3-categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;polars.CSVDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_Product_Categories.csv&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and that’s it!&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading real world CSV files with &lt;code&gt;polars.CSVDataSet&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;It turns out that reading CSV files is not always that easy. The good news is that you can use the &lt;code&gt;load_args&lt;/code&gt; parameter of the catalog entry to pass extra options to the &lt;code&gt;polars.CSVDataSet&lt;/code&gt;, which mirror the function arguments of &lt;code&gt;polars.read_csv&lt;/code&gt;. For example, if you want to attempt parsing the date columns in the CSV, you can set the &lt;code&gt;try_parse_dates&lt;/code&gt; option to &lt;code&gt;true&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;openrepair-0_3-categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;polars.CSVDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_Product_Categories.csv&lt;/span&gt;
  &lt;span class="na"&gt;load_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Doesn't make much sense in this case,&lt;/span&gt;
    &lt;span class="c1"&gt;# but serves for demonstration purposes&lt;/span&gt;
    &lt;span class="na"&gt;try_parse_dates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some of these parameters are required to be Python objects: for example, &lt;code&gt;polars.read_csv&lt;/code&gt; takes an optional &lt;code&gt;dtypes&lt;/code&gt; parameter that can be used to specify the dtypes of the columns, as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dtypes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;group_identifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Utf8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kedro catalog files only support primitive types. But fear not! You can use more sophisticated configuration loaders in Kedro that allow you to tweak how such files are parsed and loaded.&lt;/p&gt;

&lt;p&gt;To pass the appropriate &lt;code&gt;dtypes&lt;/code&gt; to read this CSV file, you can use the &lt;code&gt;TemplatedConfigLoader&lt;/code&gt;, or alternatively &lt;a href="https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader" rel="noopener noreferrer"&gt;the shiny new &lt;code&gt;OmegaConfigLoader&lt;/code&gt;&lt;/a&gt; with a custom &lt;code&gt;omegaconf&lt;/code&gt; resolver. Such resolver will take care of parsing the strings in the YAML catalog and transforming them into the objects Polars needs. Place this code in your &lt;code&gt;settings.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# settings.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;polars&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;omegaconf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OmegaConf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OmegaConfigLoader&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;OmegaConf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has_resolver&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;OmegaConf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_new_resolver&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;CONFIG_LOADER_CLASS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OmegaConfigLoader&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now you can use the special OmegaConf syntax in the catalog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;openrepair-0_3-events-raw&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;polars.CSVDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv&lt;/span&gt;
  &lt;span class="na"&gt;load_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dtypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Notice the OmegaConf resolver syntax!&lt;/span&gt;
      &lt;span class="na"&gt;product_age&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${polars:Float64}&lt;/span&gt;
      &lt;span class="na"&gt;group_identifier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${polars:Utf8}&lt;/span&gt;
    &lt;span class="na"&gt;try_parse_dates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can access Polars data types with ease from the catalog!&lt;/p&gt;

&lt;h2&gt;
  
  
  Future plans for Polars integration in Kedro
&lt;/h2&gt;

&lt;p&gt;This all looks very promising, but it’s only the tip of the iceberg. First of all, these changes need to land in stable versions of &lt;code&gt;kedro&lt;/code&gt; and &lt;code&gt;kedro-datasets&lt;/code&gt;. More importantly, we are working on &lt;a href="https://github.com/kedro-org/kedro-plugins/pull/170" rel="noopener noreferrer"&gt;a generic Polars dataset&lt;/a&gt; that will be able to read other file formats, for example Parquet, which is faster, more compact, and easier to use.&lt;/p&gt;

&lt;p&gt;Polars makes me so excited about the future of data manipulation in Python, and I hope that all Kedro users are able to leverage this amazing project on their data pipelines very soon!&lt;/p&gt;

</description>
      <category>kedro</category>
      <category>python</category>
      <category>polars</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
