AlexS

Posted on Mar 25, 2022

Developing in Dagster

#dagster #python #datascience #machinelearning

tl;dr: Use Poetry, Docker, and sensible folder structures to create a streamlined dev experience for creating dagster pipelines. This technical blog post dives into how this was accomplished. This post is about environment management more than it is about writing actual dagster ops, jobs, etc. The goal is to make your life easier while you do those things :)

The associated code repo can be found here

Fixing containerized code in (2x) real-time

I’ve been exploring dagster for some of Mile Two’s data orchestration needs and have been absolutely loving it. It hits all of the sweet spots for gradually developing data pipelines, but I found myself in a familiar situation: trying to logically structure my code such that it can easily be containerized and thrown into a CI/CD process. To that end, I’ve open-sourced a boilerplate project that enhances the dagster development experience with these valuable features:

Uses one multi-stage Dockerfile for development & deployment which can easily integrate with CI/CD processes
Containerized environment picks up code changes immediately (just hit Reload in dagit); *no more waiting or containers to spin down and up!*
Uses poetry for virtual environment creation and tractable package management
Dependencies are specified according to PEP 518 using pyproject.toml instead of setup.py, which means no more hideous pip freeze > requirements.txt

Below, I start with a brief comparison of dagster new-project and my project structure. Then, I walk through some features & configuration of poetry. Finally, I dive into the multi-stage dockerfile and how it bridges the gap from development to deployment

Improvements to New Projects

dagster comes with the ability to create template projects. Even though it’s currently marked experimental, it’s an excellent starting point for the project structure

$ dagster new-project fresh-user-code
ExperimentalWarning: "new_project_command" is an experimental function. 
Creating a new Dagster repository in fresh-user-code...
Done.

The resulting project structure

Overall, it’s lovely! Code is organized into appropriate submodules and has auto-generated environment setup instructions (as long as you’re using conda or virtualenv). It even configures user code as an editable package and creates setup.py for packaging.

Let’s compare it against the enhanced project structure (differences highlighted on the left)

Our enhanced dagster user code boilerplate. The photo above contains the entire setup process! :)

Change #1	pyproject.toml and the generated poetry.lock replace setup.py
Change #2	.venv contains our virtual environment, including the installed dependencies (exists only after running poetry)
Change #3	Notice the nested folder! This allows poetry to auto-resolve & package our code. Also, this project doesn’t have subdirectories for job, op, etc for demonstration purposes, but they could be easily added
Change #4	Docker-related files
Change #5	I like to use a convention where each job will have a corresponding default YAML configuration using a naming convention job_name.yaml so they can easily be loaded in a programmatic fashion; each of those configs are located in this directory

The first three changes are poetry- and PEP 517/518-related and are discussed in the next section. In the section after that, I’ll dive into the contents of Dockerfile and docker-compose and how they support both local development and deployment

Managing Via Poetry

Poetry is a great choice when working exclusively in a python ecosystem because it allows us to distinguish between

specified dependencies—packages we explicitly include in pyproject.toml
resolved dependencies—any package in poetry.lock

If we were using conda’s environment.yml or a more traditional requirements.txt , the specified dependencies would not be tracked and so we lose the context of which packages are desired. When managing packages later in a project’s lifecycle, it’s helpful to understand which packages are intended to be included and which ones can be pruned

you vs the package manager they told you not to worry about

To understand why the ability to track specified dependencies is important, imagine you have been asked to remove dagster and dagit from the project (for some silly reason). With poetry, you remove both packages from the dependencies sections of pyproject.toml and run poetry update. In pip, you would do pip uninstall dagster dagit, but that doesn’t clean up any of their dependencies. Over time, the requirements.txt grows with more and more unnecessary packages until the painful day you decide to sift through the codebase in search of “Which packages am I actually importing?” The following video demonstrates just how easy this cleanup can be when using poetry:

When removing dagster, poetry removes *59 packages* for us that are no longer needed. If we were using pip, those 59 packages would still be cluttering up our environment and our requirements.txt

Major sections of `pyproject.toml`:

Below, I break down the sections of pyproject.toml and what each one does. For even more detail, take a look at the poetry pyproject documentation

# Section 1
[tool.poetry]
name = "dagster-example-pipeline"
version = "1.0.0"
description = ""
authors = ["Alex Service <aservice@miletwo.us>"]

The first section defines our python package. A couple of notable things happen automatically here:

When packaging our source code, poetry will automatically search src for a subdirectory with a matching name. This behavior can be overridden if desired
- Note: pyproject.toml expects hyphens for the name, but the directory itself should use underscores, e.g. src/dagster_example_pipeline
poetry respects semantic versioning. If you wish to bump the version number, you can manually change it, or use the poetry version command
- e.g. poetry version minor would change the version to 1.1.0

# Section 2
[tool.poetry.dependencies]
python = "~3.9"
pandas = "^1.3.2"
google-cloud-storage = "^1.42"
dagster = "0.13.19"
dagster-gcp = "0.13.19"

The second section is where we include our specified dependencies. These are the packages we want at all times, both in production and during development. This section should only include the names of packages you explicitly want to define. Do not fill this with the output of pip freeze! poetry will resolve each package’s dependencies for us.

# Section 3
[tool.poetry.dev-dependencies]
dagit = "0.13.19"
debugpy = "^1.4.1"
# jupyterlab = "^3.2.2"

The third section specifies our dev-dependencies, which are packages we only want to install during development. dagit is a good example because we already have an existing dagit deployment, but I want to be able to test in the UI locally. It doesn’t need to be deployed with my user code, so it can be included as a dev-dependency. For my workflow, I often include a few types of dev-dependencies

Packages for Exploratory Data Analysis, e.g. jupyterlab, matplotlib
Debugging packages. As a VSCode user, I find debugpy to be very helpful
New packages I’m trialing to see if they solve my problems; if they do, I’ll “promote” them to become a regular dependency by moving them out of the dev-dependencies

# Section 4
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

The final section configures the python build system to use poetry instead of setuptools in accordance with PEP 517

poetry install

TIP: Before running the following commands, if you configure poetry to create the virtualenv inside of the project (via poetry config virtualenvs.in-project true), then VSCode will automatically recognize the new environment and ask you to select it as your environment :)

The command poetry install does a few things

Creates a lock file and resolves the dependency tree (i.e. it resolves all sub-dependencies for our specified dependencies), marking each packages as either “main” or “dev”
Downloads & caches all of the dependencies and sub-dependencies from the previous step
Adds our code as an editable package to the environment

$ poetry install
Updating dependencies
Resolving dependencies... (9.5s)

Writing lock file

Package operations: 124 installs, 0 updates, 0 removals

  • Installing protobuf (3.19.4)
  • Installing pyasn1 (0.4.8)
# ... omitted output
  • Installing pytest (6.2.5)

Installing the current project: dagster-example-pipeline (1.0.0)

Activate the Environment

To actually use all of these packages, it’s very simple:

$ poetry shell
Spawning shell within /path/to/.venv
. /path/to/.venv/bin/activate

(.venv) bash-3.2$

Run Dagster Daemon and Dagit (without a container)

We’ll explore containerization in a moment, but first let’s demonstrate that the environment is properly set up:

(.venv) bash-3.2$ dagit
$ dagit
Using temporary directory /path/to/dagster-example-pipeline/tmp7wdyoxas for storage. This will be removed when dagit exits.
To persist information across sessions, set the environment variable DAGSTER_HOME to a directory to use.

2022-02-15 16:07:08 -0500 - dagit - INFO - Serving dagit on http://127.0.0.1:3000 in process 14650

Navigate to https://localhost:3000 and try running the job, which simply grabs the top 5 items from Hacker News :)

The Problem With Containerizing Dagster

A major selling point of containerization is how it blurs the lines between “works on my machine” and deploying to production. The fundamental problem is this: there is a tradeoff between support for hot-loading code changes and support for CI/CD build processes. This problem isn’t dagster-specific—it exists almost everywhere when trying to containerize a dev environment

In more detail, this problem might sound familiar:

I want my python code to be editable, so that code changes are loaded immediately and I have a faster development loop. So, I will mount my project inside of a docker container with a configured python environment
My CI/CD build process expects a container with my project copied inside of it. I could use this container for local development, but will have to rebuild and rerun the container with each code change

It sounds like we have to either write multiple dockerfiles, or we have to give up the ability to hot-load our code*

*To be fair, this is a false dichotomy. Other approaches, such as VSCode devcontainers do exist, but in my experience, they don’t quite “scratch the itch”

The Solution: Multi-Stage Dockerfile

Using poetry and docker, we can use a multi-stage Dockerfile to support both needs and speed up the development of dagster user-code environments! Here’s how:

Create a Dockerfile with 3 stages: dev, build, and deploy
1. dev installs all of the necessary dependencies using poetry and runs dagit when targeted; it only expects code to be volume-mounted if the dev stage is targeted
2. build uninstalls dev dependencies, copies our project into the container, and then builds a python package of our code, which gives us a standard python wheel file
3. deploy copies only the wheel file and installs it using pip (no poetry, no volume mount, no mess)
Create a docker-compose file that targets the dev stage of our Dockerfile and mounts our project as a volume in the container. This will be used for local development
1. Bonus: Use an external environment variable manager like direnv to centralize all project environment variables into a single .envrc file and simply reference these variables in docker-compose.yml
Let our CI/CD process run through all stages of the Dockerfile, resulting in a container ready to be deployed as a dagster user-code environment

Let’s dive into each of the three stages to understand what’s going on

A quick note about deployments: Elementl provide an example of deploying via docker, but even their documentation for it states how the user code container has to be restarted to reflect code changes

Dockerfile Stage 1: dev

Here are the critical bits from the first stage:

ARG BASE_IMAGE=python:3.9.8-slim-buster
FROM "${BASE_IMAGE}" as dev

The only exciting part above is that we label our first stage so it can be referenced later in the build stage

COPY poetry.lock pyproject.toml ./
RUN poetry install

poetry.lock and pyproject.toml are the only files copied into the dev container, because it is expected that everything else will be mounted. As a result, the only reason to restart the dev container is if we make changes to our dependencies :)

RUN echo "poetry install" > /usr/bin/dev_command.sh
RUN echo "poetry run dagit -h 0.0.0.0 -p 3000" >> /usr/bin/dev_command.sh
RUN chmod +x /usr/bin/dev_command.sh
CMD ["bash", "dev_command.sh"]

It might seem weird that poetry install gets called a second time, but because dev_command.sh is executed after our code is mounted, it’s necessary in order to add our code to the environment

To use the newly-created dev environment, In docker-compose.yml, simply specify the build and image tags for a service:

dagsterdev:
    build: 
      context: .
      dockerfile: Dockerfile
      target: dev
    image: dagster-example-pipeline-dev
    volumes:
      - ./:/usr/src/app

With a simple docker compose up, the dev environment is ready to go!

Dockerfile Stage 2: build

This stage is wonderfully simple

FROM dev as build
RUN poetry install --no-dev
COPY . .

The build stage extends the dev stage, meaning all installed packages are still present. Above, poetry searches for any dependencies labeled “dev” and removes them. Also, we finally copy the actual project into the container

RUN poetry build --format wheel | grep "Built" | sed 's/^.*\s\(.*\.whl\)/\1/' > package_name

The magic happens! poetry builds a python wheel from our code and packages it up with only the necessary dependencies. The rest of the line looks scary, but it’s just extracting and saving the filename of the wheel. For reference, the output of poetry build looks like this:

$ poetry build --format wheel
Building dagster-example-pipeline (1.0.0)
  - Building wheel
  - Built dagster_example_pipeline-1.0.0-py3-none-any.whl

Dockerfile Stage 3: deploy

Now that the code is packaged as a wheel, poetry’s no longer needed. In fact, nothing is needed outside of a fresh python environment, the wheel, and any configuration for dagster!

FROM "${BASE_IMAGE}"
# remember, BASE_IMAGE is just a python image
# ... omitted some python setup. I'll be honest, not sure how much 
#     of this is actually needed :) ...

# copy the directory with our wheel
COPY --from=build /usr/src/app/dist repo_package
# copy the file containing our wheel filename
COPY --from=build /usr/src/app/package_name package_name

RUN pip install --no-cache-dir repo_package/$(cat package_name)
COPY workspace.yaml workspace.yaml
COPY job_configs job_configs

And there we go! Everything from the previous stages is discarded except for the wheel that was just created. Once installed and configured, this final stage is ready to be deployed

The Result: Faster Dev, Easier Deploys, & Cleaner Repositories

In the end, I now have everything I wanted:

The ability to develop & test jobs without constantly waiting for containers to build and spin up or down
Containerization handled without cluttering up my project (and mental) workspace
Package management that maintains a history of specified, intended packages so I don’t have to consider, months later, whether the package I want to remove is a dependency of a dependency of a dependency of...

Conclusion

Even if you don't need the repository, I hope you've found the technical discussion above to be useful to your projects. I'd love if you could clone the repo and try it for yourself!

Oldest comments (1)

Alexander Whillas • Oct 10 '23

SO there is no point in deploying the daemon and the webserver in seperate images?

DEV Community

Developing in Dagster

Improvements to New Projects

Managing Via Poetry

Major sections of `pyproject.toml`:

poetry install

Activate the Environment

Run Dagster Daemon and Dagit (without a container)

The Problem With Containerizing Dagster

The Solution: Multi-Stage Dockerfile

Dockerfile Stage 1: dev

Dockerfile Stage 2: build

Dockerfile Stage 3: deploy

The Result: Faster Dev, Easier Deploys, & Cleaner Repositories

Conclusion

Oldest comments (1)

Read next

Deploy your own ChatGPT in 5 minutes

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Improvements to New Projects

Managing Via Poetry

Major sections of pyproject.toml:

poetry install

Activate the Environment

Run Dagster Daemon and Dagit (without a container)

The Problem With Containerizing Dagster

The Solution: Multi-Stage Dockerfile

Dockerfile Stage 1: dev

Dockerfile Stage 2: build

Dockerfile Stage 3: deploy

The Result: Faster Dev, Easier Deploys, & Cleaner Repositories

Conclusion

Read next

Deploy your own ChatGPT in 5 minutes

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Major sections of `pyproject.toml`: