Podcast.init

Digging Into Dagster: An Opinionated Open Source Framework For Data Orchestration

Sep 7 '20

Summary

Data applications are complex and continually evolving, often requiring collaboration across multiple teams. In order to keep everyone on the same page a high level abstraction is needed to facilitate a cross-cutting view of the data orchestration across integration, transformation, analytics, and machine learning. Dagster is an innovative new framework that leans on the power and flexibility of Python to provide an extensible interface to the complete lifecycle of data projects. In this episode Nick Schrock explains how he designed the Dagster project to allow for integration with the entire data ecosystem while providing an opinionated structure for connecting the different stages of computation. He also discusses how he is working to grow an open ecosystem around the Dagster project, and his thoughts on building a sustainable business on top of it without compromising the integrity of the community. This was a great conversation about playing the long game when building a business while providing a valuable utility to a complex problem domain.

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This portion of Python Podcast is brought to you by Datadog. Do you have an app in production that is slower than you like? Is its performance all over the place (sometimes fast, sometimes slow)? Do you know why? With Datadog, you will. You can troubleshoot your app’s performance with Datadog’s end-to-end tracing and in one click correlate those Python traces with related logs and metrics. Use their detailed flame graphs to identify bottlenecks and latency in that app of yours. Start tracking the performance of your apps with a free trial at pythonpodcast.com/datadog. If you sign up for a trial and install the agent, Datadog will send you a free t-shirt.
You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host as usual is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source data orchestrator for powering data engineering, analytics, and machine learning

Interview

Introductions
How did you get introduced to Python?
Can you start by describing what Dagster is and how it got started?
What are the most common difficulties that organizations face when working with data projects?
- How does Dagster help in addressing those challenges?
There are a number of workflow orchestration platforms, spanning a few generations of tooling. What do you see as the defining characteristics of the various options, and how does Dagster fit in that ecosystem?
What are the assumptions that you made at the start of building Dagster and how have they been challenged, updated, or invalidated over the past year of working with end users?
How are the internals of Dagster implemented?
- How has the design changed or evolved since you first began working on it?
For someone who is building on top of Dagster, what is their workflow from first steps through to production?
What are your guiding principles for desigining the user facing API?
What are the available extension points for Dagster?
What was your reason for implementing Dagster as a Python framework?
- With the benefit of hindsight, would you make the same decision today?
What are some of the most interesting, innovative, or unexpected ways that you have seen Dagster used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building Dagster and working to grow its ecosystem?
When is Dagster the wrong choice?
As you continue to build Dagster, what is your vision for it and its ecosystem?
- What are the next steps that you are taking to achieve that vision?

Keep In Touch

@schrockn on Twitter
schrockn on GitHub
LinkedIn

Picks

Tobias
- Caddy web server
Nick
- Black code formatter

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at pythonpodcast.com/chat