Aws Glue or Airflow? You're probably paying for both to do one job

#python #data #ai

Glue or Airflow? You're probably paying for both to do one job.

It's the wrong question, and the wrong question quietly doubles your bill. Every couple of months someone asks me whether they should move their pipeline from Airflow to Glue, or the reverse, and the answer is almost always "neither, because you've misunderstood what each one is for." So let's fix that first, because once you see it, the cost mistakes become obvious.

Two different jobs that get confused for one

Picture a restaurant kitchen. There's a head chef calling out the order of dishes — appetizers first, mains when table four is ready, dessert last. And there are the line cooks actually chopping, searing, and plating. The chef coordinates. The cooks do the work. They are not interchangeable, and you wouldn't ask "should I hire a chef or a line cook?" You need the right amount of each.

That's Airflow and Glue.

Airflow is the head chef. It's an orchestrator. It decides what runs, in what order, and when — and then it waits. It does not move your data. It triggers a task, watches whether it succeeded, and triggers the next one. An Airflow DAG ("directed acyclic graph" — just a fancy term for "a list of steps with dependencies") looks like this:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract():  ...
def transform(): ...
def load():     ...

with DAG("daily_sales", start_date=datetime(2026, 1, 1),
         schedule="0 2 * * *", catchup=False) as dag:

    e = PythonOperator(task_id="extract",   python_callable=extract)
    t = PythonOperator(task_id="transform", python_callable=transform)
    l = PythonOperator(task_id="load",      python_callable=load)

    e >> t >> l     # this line is the whole point: order and dependency

Read that last line out loud: extract, then transform, then load. That >> is Airflow's entire reason to exist — managing order and dependencies and retries and schedules. Notice it doesn't say anything about how the data gets transformed. That's not its job.

Glue is the line cook. It's managed Spark — actual compute that lifts and reshapes data. A Glue job does the chopping:

import sys
from awsglue.context import GlueContext
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())

# read raw data, transform it with Spark, write it back — this MOVES data
df = glueContext.create_dynamic_frame.from_catalog(
        database="raw", table_name="sales").toDF()

clean = (df.filter(df.amount > 0)
           .dropDuplicates(["order_id"])
           .groupBy("region").sum("amount"))

clean.write.mode("overwrite").parquet("s3://warehouse/sales_by_region/")

This code reads gigabytes, filters, dedupes, aggregates, and writes Parquet. That's compute. That's the work.

And here's the thing that resolves the whole "versus": you run Glue jobs from inside Airflow. They're not competitors on the same shelf. The chef tells the cook when to start.

from airflow.providers.amazon.aws.operators.glue import GlueJobOperator

run_glue = GlueJobOperator(task_id="transform_sales", job_name="clean_sales_job")

So where does the bill quietly double?

Now that the roles are clear, the two classic ways teams overpay are easy to see.

Mistake 1: a full orchestrator to babysit three cron jobs. Someone stands up Airflow — which means a scheduler, a webserver, a metadata database, all running 24/7 — to coordinate three independent daily jobs that have no real dependencies between them. Airflow is superb when you have a tangled DAG: forty tasks, branching, backfills, retries, SLAs. It is wild overkill for "run these three scripts every morning." That's a cron line:

# three independent jobs, no dependencies — this is the whole orchestrator you need
0 2 * * *  python extract_sales.py
0 3 * * *  python extract_users.py
0 4 * * *  python build_report.py

If your "DAG" is a straight line with no branching and the failure handling is "email me," you're paying the operational cost of a tool built for a problem you don't have.

Mistake 2: a Spark cluster to transform five gigabytes. This is the more expensive one. Someone spins up Glue — a distributed Spark cluster, billed per DPU-hour — to process a few gigabytes that pandas on a single modest box would crush in under a minute. Spark earns its cost when data genuinely doesn't fit on one machine and needs to be processed in parallel across a cluster. Below that threshold, you're paying cluster prices and cluster cold-start latency to do laptop work.

# 5 GB that fits in memory? You don't need a cluster.
import pandas as pd

df = pd.read_parquet("s3://raw/sales/")
clean = (df[df.amount > 0]
           .drop_duplicates("order_id")
           .groupby("region")["amount"].sum())
clean.to_parquet("s3://warehouse/sales_by_region/")

Same result as the Glue job above, no cluster, no DPU-hours, no cold start.

The actual decision

Stop asking "Glue or Airflow." Ask two separate questions, because they're answering two separate needs:

→ Do I have real orchestration complexity? Branching, dependencies, backfills, retries across many tasks, schedules that interact. Yes → you want an orchestrator (Airflow, or a lighter one like Prefect, or your cloud's native scheduler). No, it's a few independent jobs → cron or a managed schedule is plenty.

→ How much data is actually moving in the transform? Tens of GB or more, or it genuinely needs parallelism → managed Spark like Glue earns its keep. A few GB that fits in memory → pandas, DuckDB, or plain SQL in your warehouse will be faster and cheaper, with none of the cluster overhead.

These are independent dials. A real pipeline might be: a managed schedule (no Airflow) triggering a small Python job (no Glue), because it's three steps and four gigabytes. Another might be: full Airflow orchestrating a dozen Glue jobs, because it's a forty-task DAG over terabytes. Both are correct. The expensive mistake is reaching for the heavyweight version of either dial when your workload only turned one of them up.

Seventeen years in, the single most common thing I see on a cloud bill isn't a slow pipeline. It's a stack architected for a scale the company hasn't reached yet — Spark clusters idling, orchestrators babysitting cron jobs, all of it provisioned for the data volume someone hopes to have in two years.

What's running in your stack right now that's sized for a problem you don't actually have yet? Go look at your least-utilized component first — that's usually where the answer is hiding.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. Right-sizing over-built data stacks is a big part of what I do. If this sounds like yours, that work lives at vf-insights.com.

DEV Community

Aws Glue or Airflow? You're probably paying for both to do one job

Two different jobs that get confused for one

So where does the bill quietly double?

The actual decision

Top comments (0)