Chris

Posted on Jan 24

Hybrid Dagster OSS: Scalable Compute on a Hybrid Azure Setup

#azure #dagster #devops #infrastructure

I have been trying all sorts of orchestration tools for data pipelines and background jobs over the years (Prefect, Airflow, Dagster and others). I always stayed away from their commercial offerings. I decided to use OSS and always found it super easy to get a first version running with docker compose in a single VM. However, the benefit was always limited to the visualization or orchestration part of the jobs. Was always wondering ok, now this is like a cron job + some UI. But the real benefit to have scalable compute for beefy background jobs was missing.

When starting to build Dryft last year I was again up with the choice of selecting an orchestration tool for our scheduled jobs. I opted for Dagster again, as I liked the developer experience and the concepts around assets.

We've been running Dagster in production for about a year now. For most of that time, it AGAIN lived on a single VM, with all needed services in a docker compose (including a Postgres DB). This was very quick to set up and got the job done. But as our data volume grew, we started hitting limits.

We scaled up the VM a few times (more RAM, more CPU), and eventually ended up with a beefy VM that was idle 95% of the time while barely able to run our biggest jobs.

It was time to fix this properly.

The Obvious Options (And Why They All Felt Wrong)

When you outgrow a single VM with Dagster, the paths that people usually seem to take are:
(Side note: Azure was pretty much a given since it is our main cloud provider.)

Dagster+: Dagster's managed offering. It would get the job done. We can use their hosted UI for the dagster webserver. For the compute itself we can deploy their agent in AKS. Isssue is that we would like RBAC and SSO with Microsoft Entra ID (Azure AD) which means we need at lest the starter plan for $100/month (which gives a measly 3 users). I'd love to rather spend that money on raw compute.

The Helm chart Dagster OSS deployment on AKS: Full control, "production-ready," all the knobs you could want. But now we are running a full Kubernetes cluster. You need someone who understands K8s networking, RBAC, persistent volumes, node pools. Biggest additional complexity is again also the RBAC and SSO setup.

Stick with Docker Compose, just bigger: Throw more CPU and RAM at the VM. But this doesn't solve the fundamental problem. We'd be paying for that RAM and CPU 24/7 even when jobs aren't running. I was briefly thinking about some scheduled scaling or starting/stopping of the VM but that felt hacky and fragile. What if people want to start a job ad-hoc?

None of these fit what we actually needed: managed infrastructure for the boring parts (dagster web UI, Entra ID SSO), real compute power for jobs when they need it, and not paying for idle resources.

What We Actually Needed

Overall this is what I wanted to achieve:

Scalable compute: Being able to spin up almost arbitrary compute resource for jobs that really need it. Scaling back to zero when idle.
Easy deploys: To deploy code changes to the jobs, I want to use a simple GitHub Actions workflow. No manual kubectl or Helm commands.
Managed where possible: I don't want to run and maintain a full Kubernetes cluster for parts where I don't need it.
SSO: All developers have an Entra ID account. I want to give only the devs access to dagster, with SSO.
Stay on Azure: All our stuff is there, like it or not I will stay there for now.
Reasonable cost: I am fine spending money on compute but only when we actually have a benefit from it.

The Architecture We Landed On

After a few weeks of experimentation, we settled on a hybrid. I haven't really seen this setup anywhere so I'd be courious if others have tried something similar or have opinions on this. Does it make sense? Is it crazy?

Azure Container Apps runs the Dagster control plane—webserver, daemon, and code location. These are small, stable, long-running processes, that don't need a lot of compute. ACA gives us:

Consumption-based pricing
Built-in authentication via EasyAuth (Microsoft Entra ID in front of the app, zero code changes)
Simple deploys from GitHub Actions
Managed TLS certificates and termination out of the box

Azure Kubernetes Service runs only the job pods. A dedicated cluster with:

A tiny system node pool (always on, ~€15/month)
A job node pool that scales to zero when idle
Bigger VMs available when jobs need them

The key difference to other approaches I've seen: the webserver and daemon don't need Kubernetes flexibility. They're boring. I let them be boring on the managed ACA. I save the K8s complexity for where it actually matters, scalable compute for jobs with different needs.

        User (Browser)
              │
              ▼
        ┌────────────┐
        │  EasyAuth  │ (Entra ID SSO)
        └─────┬──────┘
              │
┌─────────────▼─────────────────────────────────────────────────────────────────────────────────┐
│ Azure Virtual Network                                                                         │
│                                                                                               │
│   ┌──────────────────────────────────────────────────┐     ┌────────────────────────────────┐ │
│   │ Azure Container Apps Environment                 │     │ Azure Kubernetes Service       │ │
│   │                                                  │     │ (Private Cluster)              │ │
│   │  ┌───────────┐  gRPC  ┌───────────────┐          │     │                                │ │
│   │  │ Webserver ├───────►│ Code Location │          │     │  ┌─────────────┐ ┌───────────┐ │ │
│   │  └─────┬─────┘        └───────────────┘          │     │  │ System Pool │ │ Job Pool  │ │ │
│   │        │                                         │     │  │ (Always On) │ │ (0 → N)   │ │ │
│   │        │              ┌───────────────┐ K8s API  │     │  └─────────────┘ └─────┬─────┘ │ │
│   │        │              │    Daemon     ├──────────┼─────┼────────────────────────┘       │ │
│   │        │              └───────┬───────┘          │     │                                │ │
│   └────────┼──────────────────────┼──────────────────┘     └────────────────────────────────┘ │
│            │                      │                                                           │
│            ▼                      ▼                                                           │
│   ┌─────────────────────────────────────────────────────────────────────────────────────┐     │
│   │ Azure Database for PostgreSQL (Flexible Server, Private Access)                     │     │
│   └─────────────────────────────────────────────────────────────────────────────────────┘     │
│                                                                                               │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

The Non-Obvious Parts

Getting this to work actually turned out a bit tricker than it initially sounded. Here's a few gotchas that I stumbeld upon along the way.

The azurerm Provider Doesn't Support EasyAuth for Container Apps :(

Note: If it does, please let me know how!!

We wanted Microsoft SSO in front of Dagster. The obvious approach would be to add authentication middleware to the Dagster webserver. But then we'd need a custom Docker image, handle token validation, manage sessions—a whole thing.

Azure Container Apps has a feature called EasyAuth that puts authentication in front of your app at the infrastructure level. Your app never sees unauthenticated requests. But the Terraform azurerm provider doesn't support configuring this for Container Apps.

The workaround: use the azapi provider to hit the Azure Resource Manager API directly.

resource "azapi_resource" "dagster_webserver_auth" {
  type      = "Microsoft.App/containerApps/authConfigs@2024-03-01"
  name      = "current"
  parent_id = azurerm_container_app.dagster_webserver.id

  body = {
    properties = {
      platform = { enabled = true }
      globalValidation = {
        unauthenticatedClientAction = "RedirectToLoginPage"
        redirectToProvider          = "azureactivedirectory"
      }
      identityProviders = {
        azureActiveDirectory = {
          enabled = true
          registration = {
            clientId                = azuread_application.dagster.client_id
            clientSecretSettingName = "microsoft-provider-client-secret"
            openIdIssuer            = "https://login.microsoftonline.com/${tenant_id}/v2.0"
          }
        }
      }
    }
  }
}

Now anyone hitting the Dagster URL gets redirected to Microsoft login. Only people in our Entra ID tenant can access it. Zero changes to Dagster itself.

Container Apps Talking to a Private AKS Cluster

Our AKS cluster is private. The API server isn't exposed to the internet. This is good for security but creates a problem: how does the Dagster daemon (running in Container Apps) submit jobs to Kubernetes?

The naive approach would be to create a Kubernetes service account, extract its token, and store it somewhere the Container Apps can access. This works but creates a long-lived credential that never expires. If it leaks, someone can create pods in your cluster forever.

The better approach: Azure Workload Identity.

AKS can be configured to use Azure RBAC for authorization. This means you can grant an Azure managed identity permission to perform Kubernetes operations. Container Apps already run with a managed identity. Connect the dots:

Enable Azure RBAC on the AKS cluster
Grant the Container Apps' managed identity the "Azure Kubernetes Service RBAC Writer" role on the cluster
Configure Dagster's K8s run launcher to authenticate using Azure Identity

Now the daemon authenticates to Kubernetes using short-lived Entra ID tokens, automatically rotated, no static credentials to leak.

resource "azurerm_kubernetes_cluster" "dagster" {
  # ... other config ...

  azure_active_directory_role_based_access_control {
    azure_rbac_enabled = true
    tenant_id          = data.azuread_client_config.current.tenant_id
  }
}

resource "azurerm_role_assignment" "aca_aks_rbac" {
  scope                = azurerm_kubernetes_cluster.dagster.id
  role_definition_name = "Azure Kubernetes Service RBAC Writer"
  principal_id         = var.container_apps_managed_identity_principal_id
}

Letting Job Pods Access Azure Resources

When Dagster runs a job, it spins up a pod in AKS. That pod needs to access our application's PostgreSQL database, Redis cache, and Azure Blob Storage. We use managed identity everywhere—no connection strings with passwords.

But a Kubernetes pod doesn't automatically have an Azure identity. You need to set up Workload Identity Federation: tell Azure to trust tokens issued by your AKS cluster's OIDC provider for a specific Kubernetes service account.

resource "azurerm_federated_identity_credential" "dagster_jobs" {
  name                = "dagster-jobs-aks"
  resource_group_name = var.resource_group_name
  parent_id           = var.app_managed_identity_id

  issuer   = azurerm_kubernetes_cluster.dagster.oidc_issuer_url
  subject  = "system:serviceaccount:dagster-jobs:dagster-runner"
  audience = ["api://AzureADTokenExchange"]
}

Now any pod running as the dagster-runner service account in the dagster-jobs namespace can authenticate as our application's managed identity. It can connect to PostgreSQL with Entra authentication, access Redis, read from Blob Storage—all without any secrets in environment variables.

Make Sure to Allow Proper Ports for gRPC Traffic!

Dagster's code location serves definitions over gRPC. When running in Kubernetes, this just works—pods talk to each other directly. In Container Apps, you need to be explicit.

Container Apps defaults to HTTP ingress. For gRPC (which runs over HTTP/2 but isn't quite the same), you need TCP transport with an explicit port:

ingress {
  external_enabled = false  # Internal only
  target_port      = 4000
  exposed_port     = 4000   # Required for TCP
  transport        = "tcp"  # Not "http"

  traffic_weight {
    percentage      = 100
    latest_revision = true
  }
}

Without transport = "tcp" and exposed_port, the webserver can't connect to the code location and you get cryptic gRPC errors.

The Cost Breakdown

Here's what we're actually paying (Western Europe region, prices approximate):

Component	Spec	Monthly Cost
ACA Webserver	0.25 vCPU, 0.5 GB, always-on	~€8
ACA Daemon	0.25 vCPU, 0.5 GB, always-on	~€8
ACA Code Location	0.5 vCPU, 1 GB, always-on	~€15
AKS System Node	B2als_v2 (2 vCPU, 4 GB)	~€18
AKS Job Pool	D4s_v3, scale-to-zero	~€0.15/hour when running
PostgreSQL Flexible	B1ms, 32 GB storage	~€13
Total baseline		~€62/month

The job pool is the variable part. On a typical day with moderate pipeline activity, we might run 2-3 hours of job node time. Heavy processing days might hit 8-10 hours. Call it €30-50/month on average for job compute.

Total: roughly €90-110/month for a production Dagster deployment with proper isolation, SSO, autoscaling, and no shared resources between the UI and jobs.

Is This Right for You?

Honestly, you tell me. I'd be very curious to hear!

Imo this can make sense if:

You're already on Azure
You want SSO without building auth
You're comfortable with Terraform but don't want to become a K8s admin
You want scale-to-zero to save costs

It's probably not for you if:

You need Dagster Cloud features (branch deployments, built-in alerting, the Insights product)
You're not on Azure (duh, but I guess you could run a similar setup on AWS or GCP)
You already use Kubernetes at scale (just use the Helm chart)

We've actually been super happy with this setup so far. The only thing I'd consider long term is moving everything to Kubernetes but for now that seems unnecessary. I'd be super curious what you think of this setup. Please let me know!

Wanna chat about interesting topics in Infra, DevOps, AI Agents in Production? Leave a comment below or reach out via github or linkedin: