<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AlexS</title>
    <description>The latest articles on DEV Community by AlexS (@alexserviceml).</description>
    <link>https://dev.to/alexserviceml</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F835631%2F673ad297-1e5a-47f5-8f7c-99a567921cbf.jpeg</url>
      <title>DEV Community: AlexS</title>
      <link>https://dev.to/alexserviceml</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alexserviceml"/>
    <language>en</language>
    <item>
      <title>Developing in Dagster</title>
      <dc:creator>AlexS</dc:creator>
      <pubDate>Fri, 25 Mar 2022 22:40:26 +0000</pubDate>
      <link>https://dev.to/alexserviceml/developing-in-dagster-2flh</link>
      <guid>https://dev.to/alexserviceml/developing-in-dagster-2flh</guid>
      <description>&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt;: Use Poetry, Docker, and sensible folder structures to create a streamlined dev experience for creating dagster pipelines. This technical blog post dives into how this was accomplished. This post is about environment management more than it is about writing actual dagster ops, jobs, etc. The goal is to make your life easier while you do those things :)&lt;/p&gt;

&lt;p&gt;The associated code repo can be &lt;a href="https://github.com/MileTwo/dagster-example-pipeline" rel="noopener noreferrer"&gt;&lt;strong&gt;found here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/XYlsFbbfrDA"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;small&gt;&lt;em&gt;Fixing containerized code in (2x) real-time&lt;/em&gt;&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’ve been exploring &lt;a href="http://dagster.io" rel="noopener noreferrer"&gt;dagster&lt;/a&gt; for some of Mile Two’s data orchestration needs and have been absolutely loving it. It hits all of the sweet spots for gradually developing data pipelines, but I found myself in a familiar situation: trying to logically structure my code such that it can easily be containerized and thrown into a CI/CD process. To that end, I’ve open-sourced a boilerplate project that enhances the dagster development experience with these valuable features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Uses one multi-stage Dockerfile for development &amp;amp; deployment which can easily integrate with CI/CD processes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Containerized environment picks up code changes immediately (just hit &lt;code&gt;Reload&lt;/code&gt; in dagit); *no more waiting or containers to spin down and up!&lt;/strong&gt;*&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uses poetry for virtual environment creation and tractable package management&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependencies are specified according to &lt;a href="https://www.notion.so/Developing-in-dagster-5004d963928b4e229c331a562eeae5c7" rel="noopener noreferrer"&gt;PEP 518&lt;/a&gt; using &lt;code&gt;pyproject.toml&lt;/code&gt; instead of setup.py, which means no more hideous &lt;code&gt;pip freeze &amp;gt; requirements.txt&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below, I start with a brief comparison of &lt;code&gt;dagster new-project&lt;/code&gt; and my project structure. Then, I walk through some features &amp;amp; configuration of poetry. Finally, I dive into the multi-stage dockerfile and how it bridges the gap from development to deployment&lt;/p&gt;

&lt;h2&gt;
  
  
  Improvements to New Projects
&lt;/h2&gt;

&lt;p&gt;dagster comes with the ability to create template projects. Even though it’s currently marked experimental, it’s an excellent starting point for the project structure&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;dagster new-project fresh-user-code
ExperimentalWarning: &lt;span class="s2"&gt;"new_project_command"&lt;/span&gt; is an experimental &lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt; 
Creating a new Dagster repository &lt;span class="k"&gt;in &lt;/span&gt;fresh-user-code...
Done.


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4nfw0amcln88315kgp6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4nfw0amcln88315kgp6.png" alt="And the resulting project structure"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;small&gt;&lt;em&gt;The resulting project structure&lt;/em&gt;&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Overall, it’s lovely! Code is organized into appropriate submodules and has auto-generated environment setup instructions (as long as you’re using conda or virtualenv). It even configures user code as an editable package and creates  &lt;code&gt;setup.py&lt;/code&gt; for packaging. &lt;/p&gt;

&lt;p&gt;Let’s compare it against the enhanced project structure (differences highlighted on the left)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbz4evxa930z6xusqlzfp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbz4evxa930z6xusqlzfp.png" alt="Our enhanced dagster user code boilerplate. The photo above contains the entire setup process! :)"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;small&gt;&lt;em&gt;Our enhanced dagster user code boilerplate. The photo above contains the entire setup process! :)&lt;/em&gt;&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Change #1&lt;/th&gt;
&lt;th&gt;pyproject.toml and the generated poetry.lock replace setup.py&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Change #2&lt;/td&gt;
&lt;td&gt;.venv contains our virtual environment, including the installed dependencies (exists only after running poetry)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change #3&lt;/td&gt;
&lt;td&gt;Notice the nested folder! This allows poetry to auto-resolve &amp;amp; package our code. Also, this project doesn’t have subdirectories for job, op, etc for demonstration purposes, but they could be easily added&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change #4&lt;/td&gt;
&lt;td&gt;Docker-related files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change #5&lt;/td&gt;
&lt;td&gt;I like to use a convention where each job will have a corresponding default YAML configuration using a naming convention job_name.yaml so they can easily be loaded in a programmatic fashion; each of those configs are located in this directory&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first three changes are poetry- and PEP 517/518-related and are discussed in the next section. In the section after that, I’ll dive into the contents of &lt;code&gt;Dockerfile&lt;/code&gt; and &lt;code&gt;docker-compose&lt;/code&gt; and how they support both local development and deployment&lt;/p&gt;

&lt;h2&gt;
  
  
  Managing Via Poetry
&lt;/h2&gt;

&lt;p&gt;Poetry is a great choice when working exclusively in a python ecosystem because it allows us to distinguish between &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;specified dependencies&lt;/strong&gt;—packages we explicitly include in &lt;code&gt;pyproject.toml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;resolved dependencies&lt;/strong&gt;—any package in &lt;code&gt;poetry.lock&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If we were using conda’s &lt;code&gt;environment.yml&lt;/code&gt; or a more traditional &lt;code&gt;requirements.txt&lt;/code&gt; , the specified dependencies would not be tracked and so we lose the context of which packages are desired. When managing packages later in a project’s lifecycle, it’s helpful to understand which packages are intended to be included and which ones can be pruned&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftd7tvrxm8nkacy4n90z3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftd7tvrxm8nkacy4n90z3.png" alt="you vs the package manager they told you not to worry about"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;small&gt;you vs the package manager they told you not to worry about&lt;/small&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To understand why the ability to track specified dependencies is important, imagine you have been asked to remove dagster and dagit from the project (for some silly reason). With poetry, you remove both packages from the dependencies sections of &lt;code&gt;pyproject.toml&lt;/code&gt; and run &lt;code&gt;poetry update&lt;/code&gt;. In pip, you would do &lt;code&gt;pip uninstall dagster dagit&lt;/code&gt;, but that doesn’t clean up any of their dependencies. Over time, the &lt;code&gt;requirements.txt&lt;/code&gt; grows with more and more unnecessary packages until the painful day you decide to sift through the codebase in search of “Which packages am I actually importing?” The following video demonstrates just how easy this cleanup can be when using poetry:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/EzOfoR13h8Y"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;When removing  dagster, poetry &lt;strong&gt;removes  *59 packages&lt;/strong&gt;* for us that are no longer needed. If we were using pip, those 59 packages would still be cluttering up our environment and our requirements.txt&lt;/p&gt;

&lt;h3&gt;
  
  
  Major sections of &lt;code&gt;pyproject.toml&lt;/code&gt;:
&lt;/h3&gt;

&lt;p&gt;Below, I break down the sections of &lt;code&gt;pyproject.toml&lt;/code&gt; and what each one does. For even more detail, take a look at the &lt;a href="https://python-poetry.org/docs/pyproject/" rel="noopener noreferrer"&gt;poetry pyproject documentation&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;

&lt;span class="c"&gt;# Section 1&lt;/span&gt;
&lt;span class="nn"&gt;[tool.poetry]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"dagster-example-pipeline"&lt;/span&gt;
&lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1.0.0"&lt;/span&gt;
&lt;span class="py"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;
&lt;span class="py"&gt;authors&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Alex Service &amp;lt;aservice@miletwo.us&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The first section defines &lt;em&gt;our&lt;/em&gt; python package. A couple of notable things happen automatically here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When packaging our source code, poetry will automatically search &lt;code&gt;src&lt;/code&gt; for a subdirectory with a matching name. This behavior can be overridden if desired

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Note&lt;/strong&gt;: &lt;code&gt;pyproject.toml&lt;/code&gt; expects hyphens for the name, but the directory itself should use underscores, e.g. &lt;code&gt;src/dagster_example_pipeline&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;poetry respects semantic versioning. If you wish to bump the version number, you can manually change it, or use the &lt;code&gt;poetry version&lt;/code&gt; command

&lt;ul&gt;
&lt;li&gt;e.g. &lt;code&gt;poetry version minor&lt;/code&gt; would change the version to &lt;code&gt;1.1.0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;

&lt;span class="c"&gt;# Section 2&lt;/span&gt;
&lt;span class="nn"&gt;[tool.poetry.dependencies]&lt;/span&gt;
&lt;span class="py"&gt;python&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"~3.9"&lt;/span&gt;
&lt;span class="py"&gt;pandas&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"^1.3.2"&lt;/span&gt;
&lt;span class="py"&gt;google-cloud-storage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"^1.42"&lt;/span&gt;
&lt;span class="py"&gt;dagster&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.13.19"&lt;/span&gt;
&lt;span class="py"&gt;dagster-gcp&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.13.19"&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The second section is where we include our specified dependencies. These are the packages we want at all times, both in production and during development. This section should only include the names of packages you explicitly want to define. &lt;strong&gt;Do not fill this with the output of &lt;code&gt;pip freeze&lt;/code&gt;!&lt;/strong&gt; poetry will resolve each package’s dependencies for us.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;

&lt;span class="c"&gt;# Section 3&lt;/span&gt;
&lt;span class="nn"&gt;[tool.poetry.dev-dependencies]&lt;/span&gt;
&lt;span class="py"&gt;dagit&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.13.19"&lt;/span&gt;
&lt;span class="py"&gt;debugpy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"^1.4.1"&lt;/span&gt;
&lt;span class="c"&gt;# jupyterlab = "^3.2.2"&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The third section specifies our dev-dependencies, which are packages we only want to install during development. &lt;code&gt;dagit&lt;/code&gt; is a good example because we already have an existing dagit deployment, but I want to be able to test in the UI locally. It doesn’t need to be deployed with my user code, so it can be included as a dev-dependency. For my workflow, I often include a few types of dev-dependencies&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Packages for Exploratory Data Analysis, e.g. jupyterlab, matplotlib&lt;/li&gt;
&lt;li&gt;Debugging packages. As a VSCode user, I find debugpy to be very helpful&lt;/li&gt;
&lt;li&gt;New packages I’m trialing to see if they solve my problems; if they do, I’ll “promote” them to become a regular dependency by moving them out of the dev-dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;

&lt;span class="c"&gt;# Section 4&lt;/span&gt;
&lt;span class="nn"&gt;[build-system]&lt;/span&gt;
&lt;span class="py"&gt;requires&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="py"&gt;["poetry-core&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="s"&gt;"]&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="py"&gt;build-backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"poetry.core.masonry.api"&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The final section configures the python &lt;a href="https://packaging.python.org/en/latest/tutorials/packaging-projects/#creating-pyproject-toml" rel="noopener noreferrer"&gt;build system&lt;/a&gt; to use poetry instead of setuptools in accordance with &lt;a href="https://www.python.org/dev/peps/pep-0517/" rel="noopener noreferrer"&gt;PEP 517&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  poetry install
&lt;/h3&gt;




&lt;p&gt;&lt;strong&gt;TIP:&lt;/strong&gt; Before running the following commands, if you configure poetry to create the virtualenv inside of the project (via &lt;code&gt;poetry config virtualenvs.in-project true&lt;/code&gt;), then VSCode will automatically recognize the new environment and ask you to select it as your environment :) &lt;/p&gt;




&lt;p&gt;The command &lt;code&gt;poetry install&lt;/code&gt; does a few things&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Creates a lock file and resolves the dependency tree (i.e. it resolves all sub-dependencies for our specified dependencies), marking each packages as either “main” or “dev”&lt;/li&gt;
&lt;li&gt;Downloads &amp;amp; caches all of the dependencies and sub-dependencies from the previous step&lt;/li&gt;
&lt;li&gt;Adds our code as an editable package to the environment&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;poetry &lt;span class="nb"&gt;install
&lt;/span&gt;Updating dependencies
Resolving dependencies... &lt;span class="o"&gt;(&lt;/span&gt;9.5s&lt;span class="o"&gt;)&lt;/span&gt;

Writing lock file

Package operations: 124 installs, 0 updates, 0 removals

  • Installing protobuf &lt;span class="o"&gt;(&lt;/span&gt;3.19.4&lt;span class="o"&gt;)&lt;/span&gt;
  • Installing pyasn1 &lt;span class="o"&gt;(&lt;/span&gt;0.4.8&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;# ... omitted output&lt;/span&gt;
  • Installing pytest &lt;span class="o"&gt;(&lt;/span&gt;6.2.5&lt;span class="o"&gt;)&lt;/span&gt;

Installing the current project: dagster-example-pipeline &lt;span class="o"&gt;(&lt;/span&gt;1.0.0&lt;span class="o"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Activate the Environment
&lt;/h3&gt;

&lt;p&gt;To actually &lt;em&gt;use&lt;/em&gt; all of these packages, it’s very simple:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;poetry shell
Spawning shell within /path/to/.venv
&lt;span class="nb"&gt;.&lt;/span&gt; /path/to/.venv/bin/activate

&lt;span class="o"&gt;(&lt;/span&gt;.venv&lt;span class="o"&gt;)&lt;/span&gt; bash-3.2&lt;span class="err"&gt;$&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Run Dagster Daemon and Dagit (without a container)
&lt;/h3&gt;

&lt;p&gt;We’ll explore containerization in a moment, but first let’s demonstrate that the environment is properly set up: &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="o"&gt;(&lt;/span&gt;.venv&lt;span class="o"&gt;)&lt;/span&gt; bash-3.2&lt;span class="nv"&gt;$ &lt;/span&gt;dagit
&lt;span class="nv"&gt;$ &lt;/span&gt;dagit
Using temporary directory /path/to/dagster-example-pipeline/tmp7wdyoxas &lt;span class="k"&gt;for &lt;/span&gt;storage. This will be removed when dagit exits.
To persist information across sessions, &lt;span class="nb"&gt;set &lt;/span&gt;the environment variable DAGSTER_HOME to a directory to use.

2022-02-15 16:07:08 &lt;span class="nt"&gt;-0500&lt;/span&gt; - dagit - INFO - Serving dagit on http://127.0.0.1:3000 &lt;span class="k"&gt;in &lt;/span&gt;process 14650


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Navigate to &lt;a href="https://localhost:3000" rel="noopener noreferrer"&gt;https://localhost:3000&lt;/a&gt; and try running the job, which simply grabs the top 5 items from &lt;a href="https://news.ycombinator.com/" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt; :)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1srfrs662bof433o381p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1srfrs662bof433o381p.png" alt="Job result"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Containerizing Dagster
&lt;/h2&gt;

&lt;p&gt;A major selling point of containerization is how it blurs the lines between “works on my machine” and deploying to production. &lt;strong&gt;The fundamental problem is this&lt;/strong&gt;: there is a tradeoff between support for hot-loading code changes and support for CI/CD build processes. This problem isn’t dagster-specific—it exists almost everywhere when trying to containerize a dev environment&lt;/p&gt;

&lt;p&gt;In more detail, this problem might sound familiar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I want my python code to be editable, so that code changes are loaded immediately and I have a faster development loop. So, &lt;strong&gt;I will mount my project&lt;/strong&gt; inside of a docker container with a configured python environment&lt;/li&gt;
&lt;li&gt;My CI/CD build process expects a container &lt;strong&gt;with my project copied inside of it&lt;/strong&gt;. I could use this container for local development, but &lt;strong&gt;&lt;em&gt;will have to rebuild and rerun the container with each code change&lt;/em&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It &lt;em&gt;sounds like&lt;/em&gt; we have to either write multiple dockerfiles, or we have to give up the ability to hot-load our code*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;small&gt;*To be fair, this is a false dichotomy. Other approaches, such as &lt;a href="https://code.visualstudio.com/docs/remote/containers" rel="noopener noreferrer"&gt;VSCode devcontainers&lt;/a&gt; do exist, but in my experience, they don’t quite “scratch the itch”&lt;/small&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Multi-Stage Dockerfile
&lt;/h2&gt;

&lt;p&gt;Using poetry and docker, &lt;strong&gt;we can use a multi-stage Dockerfile to support both needs&lt;/strong&gt; and speed up the development of dagster user-code environments! Here’s how:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a Dockerfile with 3 stages: &lt;code&gt;dev&lt;/code&gt;, &lt;code&gt;build&lt;/code&gt;, and &lt;code&gt;deploy&lt;/code&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;dev&lt;/code&gt; installs all of the necessary dependencies using poetry and runs dagit when targeted; it only expects code to be volume-mounted &lt;em&gt;if the dev stage is targeted&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;build&lt;/code&gt; uninstalls dev dependencies, &lt;em&gt;copies our project into the container,&lt;/em&gt; and then builds a python package of our code, which gives us a standard python wheel file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;deploy&lt;/code&gt; &lt;strong&gt;copies only the wheel file and installs it using pip&lt;/strong&gt; (no poetry, no volume mount, no mess)&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;Create a docker-compose file that &lt;strong&gt;targets the &lt;code&gt;dev&lt;/code&gt; stage&lt;/strong&gt; of our Dockerfile and mounts our project as a volume in the container. This will be used for local development

&lt;ol&gt;
&lt;li&gt;Bonus: Use an external environment variable manager like &lt;a href="https://direnv.net/" rel="noopener noreferrer"&gt;direnv&lt;/a&gt; to centralize all project environment variables into a single &lt;code&gt;.envrc&lt;/code&gt; file and simply reference these variables in &lt;code&gt;docker-compose.yml&lt;/code&gt; &lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;Let our CI/CD process run through all stages of the Dockerfile, resulting in a container ready to be deployed as a dagster user-code environment&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;Let’s dive into each of the three stages to understand what’s going on&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;small&gt;&lt;strong&gt;A quick note about deployments&lt;/strong&gt;: Elementl provide an example of &lt;a href="https://github.com/dagster-io/dagster/tree/0.14.1/examples/deploy_docker" rel="noopener noreferrer"&gt;deploying via docker&lt;/a&gt;, but even &lt;a href="https://docs.dagster.io/deployment/guides/docker#mounting-volumes" rel="noopener noreferrer"&gt;their documentation&lt;/a&gt; for it states how the user code container has to be restarted to reflect code changes&lt;/small&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Dockerfile Stage 1: dev
&lt;/h3&gt;

&lt;p&gt;Here are the critical bits from the first stage:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;

&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; BASE_IMAGE=python:3.9.8-slim-buster&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"${BASE_IMAGE}"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The only exciting part above is that we label our first stage so it can be referenced later in the build stage&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; poetry.lock pyproject.toml ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;poetry &lt;span class="nb"&gt;install&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;poetry.lock&lt;/code&gt; and &lt;code&gt;pyproject.toml&lt;/code&gt;  are the &lt;em&gt;only&lt;/em&gt; files copied into the dev container, because it is expected that everything else will be mounted. As a result, the only reason to restart the dev container is if we make changes to our dependencies :)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"poetry install"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /usr/bin/dev_command.sh
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"poetry run dagit -h 0.0.0.0 -p 3000"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /usr/bin/dev_command.sh
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x /usr/bin/dev_command.sh
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["bash", "dev_command.sh"]&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;It might seem weird that &lt;code&gt;poetry install&lt;/code&gt; gets called a second time, but because &lt;code&gt;dev_command.sh&lt;/code&gt; is executed after our code is mounted, it’s necessary in order to add our code to the environment&lt;/p&gt;

&lt;p&gt;To use the newly-created dev environment, In &lt;code&gt;docker-compose.yml&lt;/code&gt;,  simply specify the &lt;code&gt;build&lt;/code&gt; and &lt;code&gt;image&lt;/code&gt; tags for a service:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;

dagsterdev:
    build: 
      context: .
      dockerfile: Dockerfile
      target: dev
    image: dagster-example-pipeline-dev
    volumes:
      - ./:/usr/src/app


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;With a simple &lt;code&gt;docker compose up&lt;/code&gt;, the dev environment is ready to go!&lt;/p&gt;

&lt;h3&gt;
  
  
  Dockerfile Stage 2: build
&lt;/h3&gt;

&lt;p&gt;This stage is wonderfully simple&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;poetry &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-dev&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;build&lt;/code&gt; stage extends the &lt;code&gt;dev&lt;/code&gt; stage, meaning all installed packages are still present. Above, poetry searches for any dependencies labeled “dev” and removes them. Also, we finally copy the actual project into the container&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;poetry build &lt;span class="nt"&gt;--format&lt;/span&gt; wheel | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Built"&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/^.*\s\(.*\.whl\)/\1/'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; package_name


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The magic happens! poetry builds a python wheel from our code and packages it up with only the necessary dependencies. The rest of the line looks scary, but it’s just extracting and saving the filename of the wheel. For reference, the output of &lt;code&gt;poetry build&lt;/code&gt; looks like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;

$ poetry build --format wheel
Building dagster-example-pipeline (1.0.0)
  - Building wheel
  - Built dagster_example_pipeline-1.0.0-py3-none-any.whl


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Dockerfile Stage 3: deploy
&lt;/h3&gt;

&lt;p&gt;Now that the code is packaged as a wheel, poetry’s no longer needed. In fact, nothing is needed outside of a fresh python environment, the wheel, and any configuration for dagster!&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; "${BASE_IMAGE}"&lt;/span&gt;
&lt;span class="c"&gt;# remember, BASE_IMAGE is just a python image&lt;/span&gt;
&lt;span class="c"&gt;# ... omitted some python setup. I'll be honest, not sure how much &lt;/span&gt;
&lt;span class="c"&gt;#     of this is actually needed :) ...&lt;/span&gt;

&lt;span class="c"&gt;# copy the directory with our wheel&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=build /usr/src/app/dist repo_package&lt;/span&gt;
&lt;span class="c"&gt;# copy the file containing our wheel filename&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=build /usr/src/app/package_name package_name&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; repo_package/&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;package_name&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; workspace.yaml workspace.yaml&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; job_configs job_configs&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And there we go! Everything from the previous stages is discarded except for the wheel that was just created. Once installed and configured, this final stage is ready to be deployed&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result: Faster Dev, Easier Deploys, &amp;amp; Cleaner Repositories
&lt;/h2&gt;

&lt;p&gt;In the end, I now have everything I wanted: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The ability to develop &amp;amp; test jobs without constantly waiting for containers to build and spin up or down&lt;/li&gt;
&lt;li&gt;Containerization handled without cluttering up my project (and mental) workspace&lt;/li&gt;
&lt;li&gt;Package management that maintains a history of &lt;em&gt;specified, intended&lt;/em&gt; packages so I don’t have to consider, months later, whether the package I want to remove is a dependency of a dependency of a dependency of...&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Even if you don't need the repository, I hope you've found the technical discussion above to be useful to your projects. I'd love if you could clone the repo and try it for yourself!&lt;/p&gt;

</description>
      <category>dagster</category>
      <category>python</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
