DEV Community: LETSQL

Embrace the Power of Nix for Your Python + Rust Workflow

Hussain Sultan — Wed, 24 Jul 2024 18:01:43 +0000

Hello Dev.to community!

At LETSQL, we're passionate about delivering the best tools for data scientists and ML engineers. As part of our journey, we leverage Nix to streamline our development workflows. If you're working with Python and Rust, or simply curious about why Nix is a game-changer, read on!

Why Nix?

Nix is a powerful package manager that offers reproducibility, reliability, and isolation. Here’s why we think it’s great:

Reproducible Builds: Nix ensures that your development environment is consistent across different machines. With Nix, you can avoid the infamous "works on my machine" problem. Every developer and CI pipeline gets the same environment, leading to fewer surprises and smoother collaborations.
Declarative Configuration: Nix uses a declarative approach to specify dependencies. This means you can define your environment in a single configuration file, making it easy to version control and share. Whether you're working on a small script or a large project, managing dependencies becomes straightforward.
Isolation: Nix provides isolated environments, preventing conflicts between dependencies. This is particularly useful when working on multiple projects with different requirements. No more worrying about version clashes or dependency hell!

Nix for Python + Rust Workflow

Combining Python and Rust in a single project can be challenging, but Nix simplifies this process. Here’s how:

Single Source of Truth: With Nix, you can manage both Python and Rust dependencies in a unified way. Define your dependencies in a flake.nix file and let Nix handle the rest. This ensures that your Python packages and Rust crates are compatible and work seamlessly together.
Easy Setup: Setting up a development environment with Nix is as simple as running a single command. Once your configuration is in place, you can quickly spin up environments with all the necessary dependencies, saving valuable setup time.
Consistent Development and Deployment: By using Nix, you ensure that your development, testing, and production environments are identical. This reduces the risk of bugs caused by environmental discrepancies and makes deployments more predictable and reliable.

Crane Lib: Incremental Artifact Caching for Rust Projects

We use Crane, a Nix library for building Cargo projects. One of Crane’s standout features is incremental artifact caching, which ensures you never build the same artifact twice. This greatly speeds up the build process and reduces redundant work.

Here's how Crane fits into our flake.nix:

Crane Library Integration: We use Crane to manage our Rust builds. It allows for fine-grained control over the build process and ensures efficient use of cached artifacts.
Cargo Configuration: The cargo.toml and Cargo.lock files define our Rust dependencies. Crane leverages these files to manage the build process.
Build Command: We use a custom build command for Maturin, integrated into our Nix build script to create Python wheels from our Rust code. This command ensures that all necessary artifacts are built and cached.

Maturin Build: Bridging Rust and Python

Maturin is a fantastic tool for building and publishing Rust crates as Python packages. With Nix, integrating Maturin into our workflow is straightforward. Here's a snippet from our flake.nix demonstrating the Maturin build setup:

buildPhaseCargoCommand = ''
  ${pkgs.maturin}/bin/maturin build \
    --offline \
    --target-dir target \
    --manylinux off \
    --strip \
    --release
'';

This command ensures that our Rust crate is built and packaged as a Python wheel, ready for distribution.

Poetry2nix: Seamless Python Dependency Management

We also utilize Poetry2nix to manage our Python dependencies. Poetry2nix translates pyproject.toml and poetry.lock files into Nix expressions, allowing us to maintain our Python dependencies declaratively.

Here’s how Poetry2nix is configured in our flake.nix:

inputs = {
  poetry2nix = {
    url = "github:nix-community/poetry2nix";
    inputs.nixpkgs.follows = "nixpkgs";
  };
};

...

commonPoetryArgs = {
  projectDir = ./.;
  src = pySrc;
  preferWheels = true;
  python = python';
  groups = [ "dev" "test" "docs" ];
};

myapp = (mkPoetryApplication (commonPoetryArgs // {
  buildInputs = pkgs.lib.optionals pkgs.stdenv.isDarwin [
    pkgs.libiconv
  ];
})).overridePythonAttrs maturinOverride;

Explore Our Configuration

For a full example of how we set up our development environment using Nix, check out our flake.nix file. This configuration includes everything from Crane for Rust builds to Poetry2nix for Python dependencies, ensuring a seamless and reproducible development workflow.

Join Us on GitHub!

We're always looking for contributors and feedback. If you find Nix as exciting as we do, check out our LETSQL GitHub repository and give us a star! 🌟 Your support helps us continue building great tools for the community.

Let's Connect!

Have questions or want to share your experience with Nix? Drop a comment below, or reach out to us on Twitter. We love hearing from you!

Happy coding!

How to build a new Harlequin adapter with Poetry

Hussain Sultan — Wed, 17 Jul 2024 18:59:00 +0000

Welcome to the first post in LETSQL's tutorial series!

In this blog post, we take a detour from our usual theme of data pipelines to demonstrate how to create and publish a Python package with Poetry, using DataFusion as an example.

Introduction

Harlequin is a TUI client for SQL databases known for its light-weight extensive support for SQL databases. It is a versatile tool for data exploration and analysis workflows. Harlequin provides an interactive SQL editor with features like autocomplete, syntax highlighting, and query history. It also has a results viewer that can display large result sets. However, Harlequin did not have a DataFusion adapter before. Thankfully, it was really easy to add one.

In this post, We'll demonstrate these concepts by building a Harlequin adapter for DataFusion. And, by way of doing so, we will also cover Poetry's essential features, project setup, and the steps to publish your package on PyPI.

To get the most out of this guide, you should have a basic understanding of virtual environments, Python packages and modules, and pip.
Our objectives are to:

Introduce Poetry and its advantages
Set up a project using Poetry
Develop a Harlequin adapter for DataFusion
Prepare and publish the package to PyPI

By the end, you'll have practical experience with Poetry and an understanding of modern Python package management.

The code implemented in this post is available on GitHub and available in PyPI.

Harlequin

Harlequin is a SQL IDE that runs in the terminal. It provides a powerful and feature-rich alternative to traditional command-line database tools, making it versatile for data exploration and analysis workflows.

Some key things to know about Harlequin:

Harlequin supports multiple database adapters, connecting you to DuckDB, SQLite, PostgreSQL, MySQL, and more.
Harlequin provides an interactive SQL editor with features like autocomplete, syntax highlighting, and query history. It also has a results viewer that can display large result sets.
Harlequin replaces traditional terminal-based database tools with a more powerful and user-friendly interface.
Harlequin uses adapter plug-ins as a generic interface to any database.

DataFusion

DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.

DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.

It ships with it its own CLI, more information can be found here.

Poetry

Poetry is a modern, feature-rich tool that streamlines dependency management and packaging for Python projects, making development more deterministic and efficient.
From the documentation:

Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on, and it will manage (install/update) them for you.
Poetry offers a lockfile to ensure repeatable installs and can build your project for distribution.

Creating New Adapters for Harlequin

A Harlequin adapter is a Python package that allows Harlequin to work with a database system.

An adapter is a Python package that declares an entry point in the harlequin.adapters group. That entry point should reference a subclass of the HarlequinAdapter abstract base class.
This allows Harlequin to discover installed adapters and instantiate a selected adapter at run-time

In addition to the HarlequinAdapter class, the package must also provide implementations for HarlequinConnection, and HarlequinCursor. A more detailed description can be found on this
guide.

The Harlequin Adapter Template

The first step for developing a Harlequin adapter is to generate a new repo from the existing harlequin-adapter-template

GitHub templates are repositories that serve as starting points for new projects. They provide pre-configured files, structures, and settings that are copied to new repositories, allowing for quick project setup without the overhead of forking.
This feature streamlines the process of creating consistent, well-structured projects based on established patterns.

The harlequin-adapter-template comes with a poetry.lock file and a pyproject.toml file, in addition to some boilerplate code for defining the required classes.

Coding the Adapter

Let's explore the essential files needed for package distribution before we get into the specifics of coding.

Package configuration

The pyproject.toml file is now the standard for configuring Python packages for publication and other tools. Introduced in PEP 518 and PEP 621, this TOML-formatted file consolidates multiple configuration files into one. It enhances dependency management by making it more robust and standardized.

Poetry, utilizes pyproject.toml to handle the project's virtual environment, resolve dependencies, and create packages.

The pyproject.toml of the template is as follows:

[tool.poetry]
name = "harlequin-myadapter"
version = "0.1.0"
description = "A Harlequin adapter for <my favorite database>."
authors = ["Ted Conbeer <tconbeer@users.noreply.github.com>"]
license = "MIT"
readme = "README.md"
packages = [
    { include = "harlequin_myadapter", from = "src" },
]

[tool.poetry.plugins."harlequin.adapter"]
my-adapter = "harlequin_myadapter:MyAdapter"

[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
harlequin = "^1.7"

[tool.poetry.group.dev.dependencies]
ruff = "^0.1.6"
pytest = "^7.4.3"
mypy = "^1.7.0"
pre-commit = "^3.5.0"
importlib_metadata = { version = ">=4.6.0", python = "<3.10.0" }

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

As it can be seen:

The [tool.poetry] section of the pyproject.toml file is where you define the metadata for your Python package, such as the name, version, description, authors, etc.
The [tool.poetry.dependencies] subsection is where you declare the runtime dependencies your project requires. Running poetry add <package> will automatically update this section.
The [tool.poetry.dev-dependencies] subsection is where you declare development-only dependencies, like testing frameworks, linters, etc.
The [build-system] section is used to store build-related data. In this case, it specifies the build-backend as "poetry.core.masonry.api". In a narrow sense, the core responsibility of a
build-backend is to build wheels and sdist.

The repository also includes a poetry.lock file, a Poetry-specific component generated by running poetry install or poetry update. This lock file specifies the exact versions of all dependencies and sub-dependencies for your project, ensuring reproducible installations across different environments.

It's crucial to avoid manual edits to the poetry.lock file, as this can cause inconsistencies and installation issues. Instead, make changes to your pyproject.toml file and allow Poetry to automatically update the lock file by running poetry lock.

Getting Poetry

Per Poetry's installation warning

::: {.warning}
Poetry should always be installed in a dedicated virtual environment to isolate it from the rest of your system. It should in no case be installed in the environment of the project that is to be managed by Poetry.
:::

Here we will presume you have access to Poetry by running pipx install poetry

Developing in the virtual environment

With our file structure clarified, let's begin the development process by setting up our environment. Since our project already includes pyproject.toml and poetry.lock files, we can initiate our environment using the poetry shell command.

This command activates the virtual environment linked to the current Poetry project, ensuring all subsequent operations occur within the project's dependency context. If no virtual environment exists, poetry shell automatically creates and activates one.

poetry shell detects your current shell and launches a new instance within the virtual environment. As Poetry centralizes virtual environments by default, this command eliminates the need to locate or recall the specific path to the activate script.

To verify which Python environment is currently in use with Poetry, you can use the following commands:

poetry env list --full-path

This will show all the virtual environments associated with your project and indicate which one is currently active.
As an alternative, you can get the full path of only the current environment:

poetry env info -p

With the environment activated, use poetry install to install the required dependencies. The command works as follows

If a poetry.lock file is present, poetry install will use the exact versions specified in that file rather than resolving the dependencies dynamically. This ensures consistent, repeatable installations across different environments. i. If you run poetry install and it doesn't seem to be progressing, you may need to run export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring in the shell you're installing in
Otherwise, it reads the pyproject.toml file in the current project, resolves the dependencies listed there, and installs them.
If no poetry.lock file exists, poetry install will create one after resolving the dependencies, otherwise it will update the existing one.

To complete the environment setup, we need to add the datafusion library to our dependencies. Execute the following command:

poetry add datafusion

This command updates your pyproject.toml file with the datafusion package and installs it. If you don't specify a version, Poetry will automatically select an appropriate one based on available package versions.

Implementing the Interfaces

To create a Harlequin Adapter, you need to implement three interfaces defined as abstract classes in the harlequin.adapter module.

The first one is the HarlequinAdapter.

#| eval: false
#| code-fold: false
#| code-summary: implementation of HarlequinAdapter

class DataFusionAdapter(HarlequinAdapter):
    def __init__(self, conn_str: Sequence[str], **options: Any) -> None:
        self.conn_str = conn_str
        self.options = options

    def connect(self) -> DataFusionConnection:
        conn = DataFusionConnection(self.conn_str, self.options)
        return conn

The second one is the HarlequinConnection, particularly the methods execute and get_catalog.

#| eval: false
#| code-fold: false
#| code-summary: implementation of execution of HarlequinConnection

 def execute(self, query: str) -> HarlequinCursor | None:
     try:
         cur = self.conn.sql(query)  # type: ignore
         if str(cur.logical_plan()) == "EmptyRelation":
             return None
     except Exception as e:
         raise HarlequinQueryError(
             msg=str(e),
             title="Harlequin encountered an error while executing your query.",
         ) from e
     else:
         if cur is not None:
             return DataFusionCursor(cur)
         else:
             return None

For brevity, we've omitted the implementation of the get_catalog function. You can find the full code in the adapter.py file within our GitHub repository.

Finally, a HarlequinCursor implementation must be provided as well:

#| eval: false
#| code-fold: false
#| code-summary: implementation of HarlequinCursor

class DataFusionCursor(HarlequinCursor):
    def __init__(self, *args: Any, **kwargs: Any) -> None:
        self.cur = args[0]
        self._limit: int | None = None

    def columns(self) -> list[tuple[str, str]]:
        return [
            (field.name, _mapping.get(field.type, "?")) for field in self.cur.schema()
        ]

    def set_limit(self, limit: int) -> DataFusionCursor:
        self._limit = limit
        return self

    def fetchall(self) -> AutoBackendType:
        try:
            if self._limit is None:
                return self.cur.to_arrow_table()
            else:
                return self.cur.limit(self._limit).to_arrow_table()
        except Exception as e:
            raise HarlequinQueryError(
                msg=str(e),
                title="Harlequin encountered an error while executing your query.",
            ) from e

Making the plugin discoverable

Your adapter must register an entry point in the harlequin.adapters group using the packaging software you use to build your project.
If you use Poetry, you can define the entry point in your pyproject.toml file:

[tool.poetry.plugins."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"

An entry point is a mechanism for code to advertise components it provides to be discovered and used by other code.

Notice that registering a plugin with Poetry is equivalent to the following pyproject.toml specification for entry points:

[project.entry-points."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"

Testing

The template provides a set of pre-configured tests, some of which are applicable to DataFusion while others may not be relevant. One test that's pretty cool checks if the plugin can be discovered, which is crucial for ensuring proper integration:

#| eval: false
#| code-fold: false
if sys.version_info < (3, 10):
    from importlib_metadata import entry_points
else:
    from importlib.metadata import entry_points


def test_plugin_discovery() -> None:
    PLUGIN_NAME = "datafusion"
    eps = entry_points(group="harlequin.adapter")
    assert eps[PLUGIN_NAME]
    adapter_cls = eps[PLUGIN_NAME].load()
    assert issubclass(adapter_cls, HarlequinAdapter)
    assert adapter_cls == DataFusionAdapter

To make sure the tests are passing, run:

poetry run pytest

The run command executes the given command inside the project’s virtualenv.

Building and Publishing to PyPI

With the tests passing, we're nearly ready to publish our project. Let's enhance our pyproject.toml file to make our package more discoverable and appealing on PyPI. We'll add key metadata including:

A link to the GitHub repository
A path to the README file
A list of relevant classifiers

These additions will help potential users find and understand our package more easily.

classifiers = [
    "Development Status :: 3 - Alpha",
    "Intended Audience :: Developers",
    "Topic :: Software Development :: User Interfaces",
    "Topic :: Database :: Database Engines/Servers",
    "License :: OSI Approved :: MIT License",
    "Programming Language :: Python :: Implementation :: CPython"
]
readme = "README.md"
repository = "https://github.com/mesejo/datafusion-adapter"

For reference:

The complete list of classifiers is available on PyPI's website.
For a detailed guide on writing pyproject.toml, check out this resource.
The formal, technical specification for pyproject.toml can be found on packaging.python.org.

Building

We're now ready to build our library and verify its functionality by installing it in a clean virtual environment. Let's start with the build process:

poetry build

This command will create distribution packages (both source and wheel) in the dist directory.

The wheel file should have a name like harlequin_datafusion-0.1.1-py3-none-any.whl. This follows the standard naming convention:

harlequin_datafusion is the package (or distribution) name
0.1.1 is the version number
py3 indicates it's compatible with Python 3
none compatible with any CPU architecture
any with no ABI (pure Python)

To test the installation, create a new directory called test_install. Then, set up a fresh virtual environment with the following command:

python -m venv .venv

To activate the virtual environment on MacOS or Linux:

source .venv/bin/activate

After running this command, you should see the name of your virtual environment (.venv) prepended to your command prompt, indicating that the virtual environment is now active.

To install the wheel file we just built, use pip as follows:

pip install /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl

Replace /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl with the actual path to the wheel file you want to install.

If everything works fined, you should see some dependencies installed, and you should be able to do:

harlequin -a datafusion

Congrats! You have built a Python library. Now it is time to share it with the world.

Publishing to PyPI

The best practice before publishing to PyPI is to actually publish to the Test Python Package Index (TestPyPI)

To publish a package to TestPyPI using Poetry, follow these steps:

Create an account at TestPyPI if you haven't already.
Generate an API token on your TestPyPI account page.

poetry config repositories.test-pypi https://test.pypi.org/legacy/

To publish your package, run:

poetry publish -r testpypi --username __token__ --password <token>

Replace <token> with the actual token value you generated in step 2. To verify the publishing process, use the following command:

python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple <package-name>

This command uses two key arguments:

--index-url: Directs pip to find your package on TestPyPI.
--extra-index-url: Allows pip to fetch any dependencies from the main PyPI repository.

Replace <package-name> with your specific package name (e.g., harlequin-datafusion if following this post). For additional details, consult the information provided in this post.

To publish to the actual Python Package Index (PyPI) instead:

Create an account at https://pypi.org/ if you haven't already.
Generate an API token on your PyPI account page.

Run:

poetry publish --username __token__ --password <token>

The default repository is PyPI, so there's no need to specify it.

Is worth noting that Poetry only supports the Legacy Upload API when publishing your project.

Automated Publishing on GitHub release

Manually publishing each time is repetitive and error-prone, so to fix this problem, let us create a GitHub Action to
publish each time we create a release.

Here are the key steps to publish a Python package to PyPI using GitHub Actions and Poetry:

Set up PyPI authentication: You must provide your PyPI credentials (the API token) as GitHub secrets so the GitHub Actions workflow can access them. Name these secrets something like PYPI_TOKEN.
Create a GitHub Actions workflow file: In your project's .github/workflows directory, create a new file like publish.yml with the following content:

   name: Build and publish python package

   on:
     release:
       types: [ published ]

   jobs:
     publish-package:
       runs-on: ubuntu-latest
       permissions:
         contents: write
       steps:
         - uses: actions/checkout@v3
         - uses: actions/setup-python@v4
           with:
             python-version: '3.10'

         - name: Install Poetry
           uses: snok/install-poetry@v1

         - run: poetry config pypi-token.pypi "${{ secrets.PYPI_TOKEN }}"

         - name: Publish package
           run: poetry publish --build --username __token__

The key is to leverage GitHub Actions to automate the publishing process and use Poetry to manage your package's dependencies and metadata.

Conclusion

Poetry is a user-friendly Python package management tool that simplifies project setup and publication. Its intuitive command-line interface streamlines environment management and dependency installation. It supports plugin development, integrates with other tools, and emphasizes testing for robust code. With straightforward commands for building and publishing packages, Poetry makes it easier for developers to share their work with the Python community.

At LETSQL, we're committed to contributing to the developer community. We hope this blog post serves as a straightforward guide to developing and publishing Python packages, emphasizing best practices and providing valuable resources.
To subscribe to our newsletter, visit letsql.com.

Future Work

As we continue to refine the adapter, we would like to provide better autocompletion and direct reading from files (parquet, csv) as in the DataFusion-cli. This requires a tighter integration with the Rust library without going through the Python bindings.

Your thoughts and feedback are invaluable as we navigate this journey. Share your experiences, questions, or suggestions in the comments below or on our community forum. Let's redefine the boundaries of data science and machine learning integration.

Acknowledgements

Thanks to Dan Lovell and Hussain Sultan for the comments and the thorough review.

Declarative Multi-Engine Data Stack with Ibis

Hussain Sultan — Wed, 17 Jul 2024 18:26:24 +0000

TL;DR

I recently came across the Ju Data Engineering Newsletter by Julien Hurault on the multi-engine data stack. The idea is simple; we'd like to easily port our code across any backend while retaining the flexibility to grow our pipeline as new backends and features are developed. This entails in at least the following high-level workflows:

Offloading part of a SQL query to serverless engines with DuckDB, polars, DataFusion, chdb etc.
Right-size pipeline for various development and deployment scenarios. For example, developers can work locally and ship to production with confidence.
Apply database style optimizations to your pipelines automatically.

In this post, we dive into how we can implement the multi-engine pipeline from a programming language; Instead of SQL, we use propose using a Dataframe API that can be used for both interactive and batch use-cases. Specifically, we show how to break up our pipeline into smaller pieces and execute them across DuckDB, pandas, and Snowflake. We also discuss the advantages of a multi-engine data stack and highlight emerging trends in the field.

The code implemented in this post is available on GitHub^[In order to quickly try out repo, I also provide a nix flake]. The reference work in the newsletter with original implementation is here.

Overview

Multi-engine data stack pipeline works as follows: Some data lands in an S3 bucket, gets preprocessed to remove any duplicates and then loaded into a Snowflake table, where it is transformed further with ML or Snowflake specific functions^[Please note we do not go into implementing the types of things that might be possible in Snowflake and assume that as a requirement for the workflow]. The pipeline takes orders as parquet files that get saved into landing location, are preprocessed and then stored at the staging location in an S3 bucket. The staging data is then loaded in Snowflake to connect downstream BI tools to it. The pipeline is tied together by SQL dbt with one model for each backend and the newsletter chooses Dagster as the orchestration tool.

Today, we are going to dive into how we can convert our pandas code to Ibis expressions, reproducing the complete example for Julien Hurault's multi engine stack example ¹. Instead of using dbt Models and SQL, we use ibis and some Python to compile and orchestrate SQL engines from a shell. By rewriting our code as Ibis expressions, we can declaratively build our data pipelines with deferred execution. Moreover, Ibis supports over 20 backends, so we can write code once and port our ibis.exprs to multiple backends. To further simplify, we leave scheduling and task orchestration² provided by Dagster, up to the reader.

Core Concept of Multi-Engine Data Stack

Here are the core concepts of the multi-engine data stack as outlined in Julien's newsletter:

Multi-Engine Data Stack: The concept involves combining different data engines like Snowflake, Spark, DuckDB, and BigQuery. This approach aims to reduce costs, limit vendor lock-in, and increase flexibility. Julien mentions that for certain benchmark queries, using DuckDB could achieve a significant cost reduction compared to Snowflake.
Development of a Cross-Engine Query Layer: The newsletter highlights advancements in technology that allow data teams to transpile their SQL or Dataframe code from one engine to another seamlessly. This development is crucial for maintaining efficiency across different engines.
Use of Apache Iceberg and Alternatives: While Apache Iceberg is seen as a potential unified storage layer, its integration is not yet mature to be used in a dbt project. Instead, Julien has opted to use Parquet files stored in S3, accessed by both DuckDB and Snowflake, in his Proof of Concept (PoC).
Orchestration and Engines in PoC: For the project, Julien used Dagster as the orchestrator, which simplifies the job scheduling of different engines within a dbt project. The engines combined in this PoC were DuckDB and Snowflake.

Why DataFrames and Ibis?

While the pipeline above is nice for ETL and ELT, sometimes we want the power of a full programming language instead of a Query Language like SQL e.g. debugging, testing, complex UDFs etc. For scientific exploration, interactive computing is essential as data scientists need to quickly iterate on their code, visualize the results, and make decisions based on the data.

DataFrames are such a data structure: DataFrames are used to process ordered data and apply compute operations on it in an interactive manner. They provide the flexibility to be able to process large data with SQL style operations, but also provides lower level control to edit cell level changes ala Excel Sheets. Typically, the expectation is that all data is processed in-memory and typically fits in-memory. Moreover, DataFrames make it easy to go back and forth between deferred/batch and interactive modes.

DataFrames excel^[no pun intended] at enabling folks to apply user-defined functions and releases a user from the limitations of SQL i.e. You can now re-use code, test your operations, easily extend relational machinery for complex operations. DataFrames also make it easy to quickly go from Tabular representation of data into Arrays and Tensors expected by Machine Learning libraries.

Specialized and in-process databases e.g. DuckDB for OLAP³, are blurring the boundary between a remote heavy weight database like Snowflake and an ergonomic library like pandas. We believe this is an opportunity for allowing DataFrames to process larger than memory data while maintaining the interactivity expectations and developer feel of a local Python shell, making larger than memory data feel small.

Technical Deep Dive

Our implementation focuses on the 4 concepts presented earlier:

Multi-Engine Data Stack: We will use DuckDB, pandas, and Snowflake as our engines.
Cross-Engine Query Layer: We will use Ibis to write our expressions and compile them to run on DuckDB, pandas, and Snowflake.
Apache Iceberg and Alternatives: We will use Parquet files stored locally as our storage layer with the expectation that its trivial to extend to S3 using s3fs package.
Orchestration and Engines in PoC: We will focus on fine-grained scheduling for engines and leave orchestration to the reader. Fine-grained scheduling is more aligned with Ray, Dask, PySpark as compared to orchestration frameworks e.g. Dagster, Airflow etc.

Implementing with `pandas`

pandas is the quintessential DataFrame library and perhaps provides the simplest way to implement the above workflow. First, we generate random data borrowing from the implementation in the newsletter.

#| echo: false
import pandas as pd
from multi_engine_stack_ibis.generator import generate_random_data
generate_random_data("landing/orders.parquet")

df = pd.read_parquet("landing/orders.parquet")
deduped = df.drop_duplicates(["order_id", "dt"])

The pandas implementation is imperative in style and is designed so the data that can fit in memory. The pandas API is hard to compile down to SQL with all its nuances and largely sits in its own special place bringing together Python visualization, plotting, machine learning, AI and complex processing libraries.

pt.write_pandas(
    conn,
    deduped,
    table_name="T_ORDERS",
    auto_create_table=True,
    quote_identifiers=False,
    table_type="temporary"
)

After de-duplicating using pandas operators, we are ready to send the data to Snowflake. Snowflake has a method called write_pandas that comes in handy for our use-case.

Implementing with `Ibis` aka Ibisify

One pandas limitation is that it has its own API that does not quite map back to relational algebra. Ibis is such a library that's literally built by people who built pandas to provide a sane expressions system that can be mapped back to multiple SQL backends. Ibis takes inspiration from the dplyr R package to build a new expression system that can easily map back to relational algebra and thus compile to SQL. It also is declarative in style, enabling us to apply database style optimizations on the complete logical plan or the expression. Ibis is a key component for enabling composability as highlighted in the excellent composable codex.

#| echo: false
import pathlib

import ibis
import ibis.backends.pandas.executor
import ibis.expr.types.relations
from ibis import _

from multi_engine_stack_ibis.generator import generate_random_data
from multi_engine_stack_ibis.utils import (MyExecutor, checkpoint_parquet,
                                           create_table_snowflake,
                                           replace_unbound)
from multi_engine_stack_ibis.connections import make_ibis_snowflake_connection



ibis.backends.pandas.executor.PandasExecutor = MyExecutor
setattr(ibis.expr.types.relations.Table, "checkpoint_parquet", checkpoint_parquet)
setattr(
    ibis.expr.types.relations.Table,
    "create_table_snowflake",
    create_table_snowflake,
)
ibis.set_backend("pandas")
p_staging = pathlib.Path("staging/staging.parquet")
p_landing = pathlib.Path("landing/orders.parquet")

snow_backend = make_ibis_snowflake_connection(database="MULTI_ENGINE", schema="PUBLIC", warehouse="COMPUTE_WH")

expr = (
  ibis.read_parquet(p_landing)
  .mutate(
      row_number=ibis.row_number().over(group_by=[_.order_id], order_by=[_.dt]))
  .filter(_.row_number == 0)
  .checkpoint_parquet(p_staging)
  .create_table_snowflake("T_ORDERS")
)
expr

Ibis expression prints itself as a plan that is akin to traditional Logical Plan in databases. A Logical Plan is a tree of relational algebra operators that describes the computation that needs to be performed. This plan is then optimized by the query optimizer and converted into a physical plan that is executed by the query executor. Ibis expressions are similar to Logical Plans in that they describe the computation that needs to be performed, but they are not executed immediately. Instead, they are compiled into SQL and executed on the backend when needed. Logical Plan is generally at a higher level of granularity than a DAG produced by a task scheduling framework like Dask. In theory, this plan could be compiled down to Dask's DAG.

While pandas is embedded and is just a pip install away, it still has much documented limitations with plenty of performance improvements left on the table. This is where the recent embedded databases like DuckDB fill the gap of packing the full punch of a SQL engine, with all of its optimizations and benefiting from years of research that is as easy to import as is pandas. In this world, at minimum we can delegate all relational and SQL parts of our pipeline in pandas to DuckDB and only get the processed data ready for complex user defined Python.

Now, we are ready to take our Ibisified code and compile our expression above to execute on arbitrary engines, to truly realize the write-once-run-anywhere paradigm: We have successfully decoupled our compute engine with the expression system describing our computation.

Multi-Engine Data Stack w/ Ibis

DuckDB + pandas + Snowflake

Let's break our expression above into smaller parts and have them run across DuckDB, pandas and Snowflake. Note that we are not doing anything once the data lands in Snowflake and just show that we can select the data. Instead, we are leaving that up to the user's imagination what is possible with Snowflake native features.

Notice our expression above is bound to the pandas backend. First, lets create an UnboundTable expression to not have to depend on a backend when writing our expressions.

schema = {
    "user_id": "int64",
    "dt": "timestamp",
    "order_id": "string",
    "quantity": "int64",
    "purchase_price": "float64",
    "sku": "string",
    "row_number": "int64",
}

first_expr_for = (
    ibis.table(schema, name="orders")
    .mutate(
        row_number=ibis.row_number().over(group_by=[_.order_id], order_by=[_.dt])
    )
    .filter(_.row_number == 0)
)
first_expr_for

Next, we replace the UnboundTable expression with the DuckDB backend and execute it with to_parquet method⁴. This step is covered by the checkpoint_parquet operator that we added to pandas backend above. Here is an excellent blog that discusses inserting data into Snowflake from any Ibis backend with to_pyarrow functionality.

data = pd.read_parquet("landing/orders.parquet")
duck_backend = ibis.duckdb.connect()
duck_backend.con.execute("CREATE TABLE orders as SELECT * from data")

bind_to_duckdb = replace_unbound(first_expr_for, duck_backend) 
bind_to_duckdb.to_parquet(p_staging)
to_sql = ibis.to_sql(bind_to_duckdb)
print(to_sql)

Once the above step creates the de-duplicated table, we can then send data to Snowflake using the pandas backend. This functionality is covered by create_table_snowflake operator that we added to pandas backend above.

second_expr_for = ibis.table(schema, name="T_ORDERS") # nothing special just a reading the data from orders table
snow_backend.create_table("T_ORDERS", schema=second_expr_for.schema(), temp=True)
pandas_backend = ibis.pandas.connect({"T_ORDERS": pd.read_parquet(p_staging)})
snow_backend.insert("T_ORDERS", pandas_backend.to_pyarrow(second_expr_for))

Finally, we can select the data from the Snowflake table to verify that the data has been loaded successfully.

third_expr_for = ibis.table(schema, name="T_ORDERS") # add you Snowflake ML functions here
third_expr_for

We successfully broke up our computation in pieces, albeit manually, and executed them across DuckDB, pandas, and Snowflake. This demonstrates the flexibility and power of a multi-engine data stack, allowing users to leverage the strengths of different engines to optimize their data processing pipelines.

Acknowledgments

I'd like to thank Neal Richardson, Dan Lovell and Daniel Mesejo for providing the initial feedback on the post. I highly appreciate the early review and encouragement by Wes McKinney.

Resources

In this post, we have primarily focused on v0 of the multi-engine data stack. In the latest version, Apache Iceberg is included as a storage and data format layer. NYC Taxi data is used instead of the random Orders data treated in this and v0 of the posts. ↩
Orchestration Vs fine-grained scheduling: ↩
- The orchestration is left to the reader. The orchestration can be done using a tool like Dagster, Prefect, or Apache Airflow.
- The fine-grained scheduling can be done using a tool like Dask, Ray, or Spark.
Some of the examples of in-process databases is described in this post extending DuckDB example above to newer purpose built databases like LanceDB and KuzuDB. ↩
The Ibis docs use backend.to_pandas(expr) commands to bind and run the expression in the same go. Instead, we use replace_unbound method to show a generic way to just compile the expression and not execute it to said backend. This is just for illustration purposes. All the code below, uses the backend.to_pyarrow methods from here on. ↩