DEV Community: Fazalul Haque

Airflow 3.x Architecture Explained: A Practical Guide for Data Engineers

Fazalul Haque — Mon, 22 Jun 2026 12:53:58 +0000

What Is Airflow?

Let me be honest about how I first encountered Airflow. My team had a bunch of Python and SQL scripts running on cron jobs. One script would download data, another would load it into a staging database, a third one would transform it, and a fourth one would ingest it into our data warehouse. The scripts were chained together with some fragile shell glue. When things worked, nobody noticed. When things broke — and they broke often — figuring out where it broke and why was a nightmare.

Someone on the team said "we should use Airflow." I nodded like I understood, then spent the next three evenings reading documentation.

Here's what I wish someone had told me then: Airflow is a platform for authoring, scheduling, and monitoring workflows — workflows that you define as Python code. That's it. No magic. You write a Python file describing what tasks to run and in what order, and Airflow takes care of the scheduling, retries, logging, and visibility.

The thing that took me a while to internalize is that Airflow is not a data processing engine. It doesn't move your data around. It orchestrates the things that do. It's the conductor, not the orchestra.

The DAG — Everything Starts Here

Before diving into the architecture, you need to understand what a DAG is, because the whole system revolves around it.

DAG stands for Directed Acyclic Graph. Sounds fancy. In practice it just means: a set of tasks with dependencies between them, where you can't have circular dependencies (task A depends on task B depends on task A — that would be a cycle, and Airflow won't allow it).

A simple DAG might look like this:

extract_data → transform_data → load_to_warehouse

You define this in a Python file. Each step is a Task – a single unit of work, like running a Python function, executing a SQL query, or calling an external API. The arrows between them are dependencies. Airflow reads your Python file, figures out the task ordering, and then knows how and when to execute things.

In Airflow 3.x, DAG files still live in a dags/ folder, but there have been some meaningful changes to how they're parsed and distributed — we'll get to that when we talk about the DAG Processor.

The Architecture: A Bird's Eye View

Before we look at each component individually, here's a high-level view of how they fit together:

Every component reads from or writes to the Metadata Database in some way. That database is the single source of truth for the entire system. Let's now go through each piece.

Why Understanding the Architecture Matters

If you're running Airflow locally, many of these components may live on the same machine and feel invisible.

So why spend time learning the architecture?

Because most real-world Airflow problems come down to understanding which component is responsible for what.

For example:

DAG not showing up in the UI? Look at DAG parsing.
Tasks stuck in queued state? Check the executor and workers.
No new tasks are starting? Investigate the scheduler.
UI errors or failed API requests? Check the API Server.
Everything seems broken? The metadata database is often the first place to look.

You don't need to memorize every component. But having a mental model of how they fit together makes troubleshooting much easier as you move beyond simple local deployments.

The Metadata Database

I'm starting here because once you understand this, everything else makes sense.

The metadata database is a relational database — Postgres in production (please use Postgres, not SQLite outside of local dev). It stores:

All your DAG definitions and their current state
Every DAG Run — every time a DAG has executed or is scheduled to execute
Every Task Instance — every individual task run, its state (queued, running, success, failed), pointers to logs, and timing
User information, connections, variables, and more

Every other component in Airflow talks to this database. The Scheduler reads from it to decide what to run. Workers write to it when tasks finish. The API Server and UI read from it to show you that nice UI. The database is the glue that holds the distributed system together.

One practical implication: if your database is slow or down, your entire Airflow cluster is effectively crippled. Invest in your database setup.

The Scheduler

The Scheduler is the brain of the operation. It's a long-running process that continuously checks the metadata database, figures out which tasks are ready to run, and sends them off to be executed.

Here's a rough version of what it does in a loop:

Check all DAG schedules — is there a DAG that's supposed to have run in the last interval and hasn't started yet?
Create DAG Runs and Task Instances in the metadata database for anything that needs to run
Look at all Task Instances that are in a "scheduled" state and evaluate their dependencies — are the upstream tasks done?
Move eligible tasks to "queued" state and submit them to the executor

The Scheduler doesn't run tasks itself. It decides that tasks should run and tells the executor to handle the actual execution. That's an important distinction.

In Airflow 3.x, the Scheduler has become more reliable and scalable. You can run multiple schedulers (active-active setup) for high availability. If one scheduler process dies, another picks up the work.

One gotcha with the Scheduler: it needs up-to-date DAG metadata in order to schedule work. In Airflow 3.x, DAG parsing responsibilities can be delegated to a dedicated DAG Processor component. More on that below.

The DAG Processor (New in Airflow 3.x)

This is one of the more significant architectural changes in Airflow 3.x, and one that beginners often don't realize exists.

In older versions of Airflow, the Scheduler also handled parsing your DAG files. This was problematic. If a DAG file had a slow import or a bug that caused it to hang, it could slow down the Scheduler itself. Not great.

In Airflow 3.x, DAG parsing has been extracted into its own DAG Processor component. It runs as a separate process, continuously scanning your dags/ directory, importing the Python files, and extracting DAG metadata (schedule, tasks, dependencies) into the metadata database.

The Scheduler then just reads from the database — it doesn't need to parse Python files directly anymore. Cleaner separation of concerns.

To be precise, in a basic single‑machine setup the scheduler still spawns subprocesses to parse DAGs for you; the dedicated DAG Processor is mainly a production pattern. The important idea is that, in 3.x, you can keep user DAG code out of the core scheduler process and scale DAG parsing independently when you need to.

The API Server

Airflow 3.x introduced a proper standalone API Server. In earlier versions, the REST API was bundled into the Webserver.

The API Server exposes Airflow's REST API — it's what the UI uses to fetch DAG information, trigger runs, clear tasks, etc. It's also what you'd use if you want to trigger DAG runs programmatically from another system. CI/CD pipelines kicking off DAGs, external services triggering workflows — all of that goes through the API Server.

What changed from 2.x is that the REST API is now a first-class interface rather than a secondary feature attached to the Webserver. This separation gives Airflow a cleaner architecture and makes it easier to evolve the UI and API independently over time.

The Webserver (Airflow UI)

In Airflow 3.x, the traditional standalone webserver role has largely been replaced by a UI that communicates with the API Server.

You use Airflow UI to:

See all your DAGs and their last run status
View the graph of a DAG — which tasks ran, which failed, which are running right now
Look at logs for individual task runs
Manually trigger DAG runs or clear failed tasks to retry them
Manage connections and variables

In Airflow 3.x, the UI has been significantly revamped. It's faster, cleaner, and the new Grid View is much more useful than the old tree view for understanding the history of a DAG across many runs.

The Airflow UI is stateless with respect to actual scheduling. It doesn't schedule anything. It's purely a read/write interface to the metadata database (via the API Server). If the API Server or the UI crashes, your DAGs keep running. Your visibility goes away, but the work continues.

The Executor

The Executor is one of the most misunderstood Airflow concepts.

The Executor is not a separate process you run — it's a component within the Scheduler that determines how tasks actually get executed. Think of it as the strategy the Scheduler uses for dispatching work, not something you run independently.

There are a few executor types:

LocalExecutor — Tasks run as subprocesses on the same machine as the Scheduler. Simple, works well for small setups. Not suitable for real scale because everything shares one machine.

CeleryExecutor — Tasks are sent to a Celery task queue (backed by Redis or RabbitMQ), and picked up by Worker processes that can run on different machines. This is the classic horizontally scalable setup.

KubernetesExecutor — Each task spins up a new Kubernetes Pod to run in, and the pod is destroyed when the task finishes. Very clean isolation, great for containerized environments. More overhead per task but excellent for bursty workloads.

Some deployments use hybrid execution models that combine Celery workers with Kubernetes-based execution.

The choice of executor has massive implications for how you deploy and scale Airflow. Start with LocalExecutor if you're just learning. Move to CeleryExecutor or KubernetesExecutor as you grow.

The Workers

Workers are the things that actually execute your task code. They're separate processes (or pods, in the Kubernetes case) that pull tasks from the queue and run them. They only exist as separate processes if you're using CeleryExecutor or KubernetesExecutor. With LocalExecutor, there isn’t a separate worker service: tasks run as subprocesses inside the scheduler machine.

When a task runs on a Worker:

The Worker picks up the task from the queue
It executes your Python function (or Bash command, or Spark job, or whatever the operator does)
It writes the result back to the metadata database — success or failure
It ships logs to wherever logs are configured to go (local filesystem, S3, GCS, etc.)

Workers don't need to know about your full DAG structure. They just need to know "run this task." All the DAG context they need is passed along with the task message.

One important thing: in a multi-worker setup, every worker machine needs access to the same DAG files. Otherwise, the worker won't be able to find the code it's supposed to run. This is commonly solved by mounting a shared filesystem or using Git Sync to pull DAGs onto every worker.

The Triggerer

One component I haven't mentioned yet: the Triggerer. It's an optional process that handles deferred tasks — tasks that are waiting on some external event (like a file landing in S3, or a sensor waiting for a condition) without occupying a worker slot while they wait. It runs deferred tasks in an asyncio event loop, which is far more efficient than a worker sitting idle. If you're not using deferrable operators, you don't strictly need it — but most production setups run it.

Connections and Variables

These aren't really "architecture" in the strict sense, but they're part of the runtime infrastructure and worth understanding early.

Connections are how Airflow stores credentials and connection info for external systems — your database host/port/user/password, your AWS credentials, your Snowflake account info. When you use an operator (like PostgresOperator or S3Hook), it looks up a named connection from the metadata database rather than you hardcoding credentials in your DAG.

Variables are just key-value pairs stored in the metadata database. Useful for config values you want to change without editing DAG code.

Both connections and variables can be managed through the UI, the API, or environment variables (the latter being preferred in production from a secrets management perspective).

Putting It All Together

Let's see what actually happens when a scheduled DAG runs, so all these pieces connect:

Your DAG file lives in the dags/ folder. The DAG Processor picks it up, parses it, and writes the DAG structure to the Metadata Database.
The Scheduler wakes up (it's running in a loop, typically every few seconds). It checks the database and sees that your DAG is scheduled to run at 2:00 AM. It's now 2:00 AM. The Scheduler creates a DAG Run record and individual Task Instance records in the database.
The Scheduler evaluates task dependencies. Task A has no upstream dependencies, so it becomes eligible. The Scheduler flips its state to "queued" and tells the Executor to execute it.
The Executor places Task A into the task queue (Redis/RabbitMQ when using CeleryExecutor).
An available Worker picks up Task A from the queue. It executes your Python function. If it succeeds, the Worker marks the Task Instance as "success" in the database.
Back to the Scheduler — on the next loop, it sees Task A is done. Task B was waiting on Task A. Now Task B's dependencies are satisfied, so it becomes eligible. The cycle repeats.
Throughout all of this, the API Server is answering requests from your browser, showing you the DAG Run and its task states in real time.
Everything is written to the Metadata Database along the way.

What if one of the components is down?

Component	What breaks if it's down?
Scheduler	No new tasks start
Workers	Tasks stop executing
Metadata DB	Entire platform affected
API Server	UI/API operations fail
Triggerer	Deferrable tasks stop progressing

A Note on Airflow 3.x Specifically

Airflow 3.x brought a bunch of changes, and if you're reading older tutorials, some things will look different. The key differences to be aware of:

The DAG authoring syntax got cleaner — the @dag and @task decorators (from the TaskFlow API) are now first-class citizens and the recommended way to write most DAGs. The old Operator-heavy style still works but new code should prefer TaskFlow.

The UI was rebuilt from scratch — it's a modern React app now, not the old Flask/Jinja UI. Grid view is the default and it's significantly better.

The Scheduler and DAG Processor are now clearly separated — as described above. Relevant if you're configuring a production deployment.

The REST API is now the proper interface for programmatic access — no more hacking around with CLI commands for automation.

Where to Go From Here

If you've followed this far, you understand the fundamental architecture. The next things worth digging into:

Write your first DAG using the TaskFlow API (@dag, @task decorators)
Understand what Operators are — they're the building blocks of tasks (PythonOperator, BashOperator, and so on)
Set up a local Airflow environment with Docker Compose (the official docker-compose.yaml from the Airflow docs is a good start)
Learn about XComs — that's how tasks pass data to each other
Understand task retries and the on_failure_callback — essential for production use

The architecture seems complex at first, but once you've run Airflow a few times, you start to develop intuition for which component is causing problems when things go wrong. And things will go wrong. That's part of the fun. Good luck.

How to Deploy a LangGraph Agent on AWS Bedrock AgentCore

Fazalul Haque — Mon, 25 May 2026 15:54:19 +0000

You’ve built a LangGraph agent that works fine on your laptop. The next challenge is getting it running in a scalable, serverless production infrastructure without having to redesign the whole thing.

That’s where AWS Bedrock AgentCore comes in. In this guide, I’ll show you how to put a wrapper for your existing agent to make it run on AgentCore, set up an AgentCore project, test it locally, deploy it to AWS, and invoke it after deployment.

What Is AWS Bedrock AgentCore?

AgentCore is a serverless hosting platform designed by AWS to deploy, scale, and operate your AI agents securely without you managing the infrastructure. It works with any open-source framework like LangGraph, Strands, CrewAI, or LlamaIndex and supports Large Language Models like OpenAI's GPT, Google's Gemini, or Anthropic's Claude. So you don’t have to rewrite the agent logic. It also provides session isolation, persistent memory, observability and identity management.

The deployment is managed through the AgentCore CLI, a Node.js tool that scaffolds projects, runs a local dev server, and deploys to AWS using CDK under the hood.

Prerequisites

Before you start, make sure you have the following in place:

Python 3.12+
uv or pip for Python dependency management
Node.js 20+ — the AgentCore CLI is an npm package
AWS CDK installed globally (npm install -g aws-cdk)
An AWS account with credentials configured locally (aws configure)
Your existing LangGraph agent code that creates a compiled StateGraph object

Step 1 — Install the AgentCore CLI

The AgentCore workflow starts with a single command-line tool. You’ll use it to create the project, run the app locally, and deploy it to AWS cloud.

Install the CLI:

npm install -g @aws/agentcore

Verify it after the installation:

agentcore --help

Step 2 — Add Additional Dependencies

Add the following packages to your agent's dependencies:
- bedrock-agentcore - the Python SDK that provides the BedrockAgentCoreApp wrapper class.
- aws-opentelemetry-distro - AWS-supported distribution of the OpenTelemetry Python Instrumentation package.

# pyproject.toml
[project]
name = "my-agent"
version = "0.1.0"
requires-python = ">=3.12"

dependencies = [
    "aws-opentelemetry-distro==0.17.0",
    "bedrock-agentcore>=1.6.3",
    "boto3>=1.42.0",
    "langgraph>=1.1.0",
    "langchain-core>=1.2.0",
    # ... your other existing dependencies
]

Install all dependencies with your usual workflow:

uv sync
# or: pip install -e .

Step 3 — Write the AgentCore Entrypoint (`main.py`)

AgentCore expects a single Python file as the entrypoint with a BedrockAgentCoreApp instance and a function decorated with @app.entrypoint. The important part is that your LangGraph logic does not need to change much; you’re just wrapping it in a small runtime entrypoint.

Here is the basic structure:

# main.py

from langchain_core.messages import HumanMessage
from bedrock_agentcore.runtime import BedrockAgentCoreApp

from graphs.my_agent_graph import build_my_agent  # your existing graph builder

# 1. Instantiate the AgentCore app and logger
app = BedrockAgentCoreApp()
log = app.logger

# 2. Build your graph at module load time (startup)
#    AgentCore initialises the module once, then handles concurrent invocations.
#    Any failure here will prevent a broken agent from going live.
def create_agent():
    log.info("Initialising agent...")
    graph = build_my_agent()   # returns your compiled LangGraph StateGraph
    log.info("Agent ready")
    return graph

try:
    graph = create_agent()
except Exception as e:
    log.error(f"Critical failure during agent initialisation: {e}")
    raise  # fail fast — don't let a broken agent start serving requests

# 3. Async helper that drives the LangGraph streaming loop
async def run_agent(user_input: str, session_id: str = "default-session") -> str:
    responses = []
    config = {"configurable": {"thread_id": session_id}}

    async for chunk in graph.astream(
        {"messages": [HumanMessage(content=user_input)]},
        config=config,
        stream_mode="values",
    ):
        messages = chunk.get("messages", [])
        if messages:
            last = messages[-1]
            if getattr(last, "type", None) == "ai":
                responses.append(last)

    if not responses:
        return "No response"

    content = getattr(responses[-1], "content", "")
    return content if isinstance(content, str) else str(content)

# 4. The entrypoint — this is what AgentCore calls on every invocation
@app.entrypoint
async def invoke(payload, context):
    try:
        log.info("Invoke received")
        user_input = payload.get("prompt", "")
        session_id = payload.get("session_id", "default-session")

        if not user_input.strip():
            return {"error": "Prompt cannot be empty"}

        response = await run_agent(user_input, session_id)
        return {"response": response}

    except Exception as e:
        log.error(f"Error: {e}")
        return {"error": str(e)}

# 5. Run the app (only executed when running locally using agentcore dev)
if __name__ == "__main__":
    app.run()

A few things that matter

BedrockAgentCoreApp() gives you the AgentCore runtime wrapper. It sets up the HTTP server, health check endpoints, and structured logging for you.

Module-level graph initialisation — the graph is created once when the module loads, not on every request. That catches startup failures early, which is good because it prevents a broken app from going live.

@app.entrypoint — this decorator registers the function as the handler for incoming invocations. It receives payload (the parsed JSON body) and context (AgentCore request context). It should return a plain JSON-serializable dictionary, not a raw string or a custom object.

session_id → thread_id — The session_id is passed into LangGraph as thread_id, which enables per-session memory with MemorySaver.

Step 4 — Create an AgentCore Project

Next, we need to generate our project layout. Run the initialization wizard inside a clean root folder and let it create the basic project layout for you.

agentcore create

The setup wizard will ask a few simple questions, like the project name, language, Python version, entrypoint file, etc. The important part is that the entrypoint and the Python version matches your app.

After the project is created, you’ll get an agentcore.json file (in agentcore folder) that tells AgentCore where your code lives and how to run it.

{
  "name": "MyAgentOnAgentcore",
    "runtimes": [
    {
      "name": "my-agent",
      "build": "CodeZip",
      "entrypoint": "main.py",
      "codeLocation": "app/my-agent/",
      "runtimeVersion": "PYTHON_3_12",
      "networkMode": "PUBLIC",
      "protocol": "HTTP",
      "envVars": []
    }
  ]
}

Step 5 — Put Your Source Code to the Code Location

Copy or move your agent source into the directory specified by codeLocation in agentcore.json. This part is easy to get wrong, and when it’s wrong, deployment becomes unnecessarily frustrating. I struggled for a while after accidentally putting it in the agentcore folder.

A typical layout should look like this:

MyAgentOnAgentcore/
├──agentcore
    └── agentcore.json
├──app/
    └──my-agent/                       ← codeLocation 
        ├── main.py                  ← entrypoint (matches agentcore.json)
        ├── pyproject.toml           ← or requirements.txt
        ├── src/
        │   └── my_agent_graph.py    ← your LangGraph graph builder
        │   └── tools.py             ← your tools, utilities, etc.
        └── .env                   ← your configuration variables

Make sure main.py sits at the top level of codeLocation and matches the entrypoint in agentcore.json exactly.

Environment variables

AgentCore can pass environment variables into the runtime container. For local testing, a .env file is usually the easiest option.

For production, you can put values in agentcore.json under envVars (.env file also works). But secrets like passwords and API keys are better stored in AWS Secrets Manager and loaded at runtime. If you do that, the AgentCore execution role needs permission to read those secrets.

Note: envVars should be an array of JSON objects with name and value fields.

Step 6 — Validate the Project

Before you try to run anything, use AgentCore CLI to validate the project configuration:

agentcore validate

This catches the common mistakes early, like a missing entrypoint file or a bad path in the config.

Step 7 — Run and Test Locally

Start the local runtime to test the agent locally:

agentcore dev

This spins up a local HTTP server that closely mirrors the production environment for testing. If everything is wired up properly, you should see that the app is ready and listening on a local port.

Test on Local Server

You can test an invocation using either the AgentCore CLI or curl:

agentcore invoke "Summarise last month sales"

curl -X POST http://localhost:8080/invocations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Summarise last month sales", "session_id": "test-001"}'

A successful response should come back as JSON with the agent’s answer in it.

{"response": "Last month, total sales were $1.2M across 3 regions..."}

Step 8 — Deploy to AWS

Once local testing looks solid, deploy to AWS Bedrock AgentCore Runtime:

agentcore deploy

Under the hood, the CLI:

Packages your code
Provisions an S3 bucket for direct code deploy
Creates an IAM execution role
Deploys the AgentCore Runtime via CDK

The first deploy takes a few minutes. Subsequent deploys tend to be faster.

When it’s successfully completed, check the runtime status. It will display the runtime ARN and HTTP URL to invoke the deployed agent.

agentcore status

Step 9 — Invoke the Deployed Agent

The easiest way to test the deployed version is with the CLI:

agentcore invoke --prompt "Who are the top 5 customers by revenue?" --session-id "session-id-with-length-greater-than-or-equal-33"

If you want to call it using HTTP, you’ll need to sign the request with AWS SigV4.


curl -X POST "https://bedrock-agentcore.us-east-1.amazonaws.com/runtimes/arn-of-MyAgentOnAgentcore-xxxx/invocations" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  --aws-sigv4 "aws:amz:us-east-1:bedrock-agentcore" \
  -H "Content-Type: application/json" \
  -H "x-amzn-bedrock-agentcore-runtime-session-id: session-id-with-length-greater-than-or-equal-33" \
  -d '{"prompt": "Who are the top 5 customers by revenue?"}'

Use the HTTP URL displayed by agentcore status.

When integrating your agent with other Python applications, using the SDK is usually the cleanest approach. You can keep the runtime ARN in an environment variable, send the prompt, and pass the same session ID back on follow-up requests.

# invoke_agent.py
import json
import boto3
import os
from dotenv import load_dotenv

load_dotenv()

client = boto3.client("bedrock-agentcore", region_name="us-east-1")
runtime_arn = os.getenv("AGENTCORE_RUNTIME_ARN")

def invoke(prompt: str, session_id: str) -> str:
    payload = json.dumps({"prompt": prompt, "session_id": session_id})

    response = client.invoke_agent_runtime(
        agentRuntimeArn=runtime_arn,
        payload=payload,
        contentType="application/json"
    )

    body = json.loads(response["response"].read().decode())
    return body["response"]

if __name__ == "__main__":
    session_id = ""
    print("Ready. Type 'exit' to quit.")

    while True:
        prompt = input("Your question: ").strip()
        if prompt == "exit":
            break
        if not prompt:
            continue

        params = {
            "agentRuntimeArn": runtime_arn,
            "payload": json.dumps({"prompt": prompt}),
            "contentType": "application/json"
        }
        if session_id:
            params["runtimeSessionId"] = session_id

        try:
            response = client.invoke_agent_runtime(**params)
            body = json.loads(response["response"].read().decode())
            session_id = response.get("runtimeSessionId", session_id)
            print(f"Agent: {body['response']}\n")
        except Exception as e:
            print(f"Error: {e}")

Set AGENTCORE_RUNTIME_ARN in your .env file:

AGENTCORE_RUNTIME_ARN=arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/MyAgentOnAgentcore-xxxx

Maintaining session state across invocations

Notice runtimeSessionId in the boto3 example. AgentCore returns a runtimeSessionId in every response. Passing it back in the next request tells AgentCore to route subsequent calls to the same session context — which, combined with LangGraph's MemorySaver, gives the agent full conversation memory across multiple HTTP calls.

Common Gotchas

Startup failures are intentional. If create_agent() raises at module load, AgentCore will refuse to start the runtime. This is a feature, not a bug — it prevents an incorrectly configured agent (missing credentials, wrong DB URL) from going live silently.

The execution role AgentCore creates needs explicit permissions for any AWS service your agent touches, including Secrets Manager and S3. Add those permissions after the first deploy.

The Python version in pyproject.toml should match the runtime version in agentcore.json.

The entrypoint should return a JSON-serialisable dictionary. If it returns a plain string or a custom object, the runtime boundary will reject it.

Conclusion

The transition from a local LangGraph agent to an AgentCore deployment really comes down to a few practical changes: add the right dependencies, wrap the graph in BedrockAgentCoreApp, scaffold the project, test locally, then deploy and invoke it.

Everything else — the graph, the tools, the model calls, and the memory setup — stays mostly the same. AgentCore handles the runtime side, and LangGraph handles the agent logic.

A full working example is available at github.com/thedataengr/data-agent-on-aws-agentcore.