At the AI Engineer Summit 2025 in New York, the mantra that got repeated from stage after stage was four words. Capability does not mean reliability. Speakers from finance, infrastructure, and consumer products converged on the same point: shipping an agent that demos well is now a solved problem, and shipping one that survives a Tuesday in production is not.
The data backs the room. LangChain's State of Agent Engineering report found that 89 percent of organizations running agents in production have had to add observability that their framework did not give them. Sixty-two percent had to build detailed tracing for individual agent steps. Honeycomb's O11yCon 2026 was themed, in full, as the observability conference for the agent era. Three different angles on the same pattern. Teams that took an agent to production had to build half an orchestrator on top of their framework.
The pattern has a name now. Last month, Kaxil Naik and Pavan Kumar Gopidesu shipped the Common AI Provider for Apache Airflow 3 with a sentence that articulates what hundreds of teams had been intuiting: "Not a wrapper around another framework, but a provider package that plugs into the orchestrator you already run." Both work at Astronomer, the commercial backer of Airflow, which is worth naming up front. The sentence is a diagnosis whether it came from Astronomer or anyone else.
The diagnosis is that the dominant design pattern of the last two years, treating the agent loop as a new runtime, was a category error. The agent loop is not a runtime. It is a workload. The runtime already exists.
What durable orchestration actually buys you
Three things a mature orchestrator gives an agent that a prototype framework does not.
The first is durable replay. The Common AI Provider post puts it bluntly: "When a 10-step agent task fails on step 8, a retry shouldn't re-run all 10 steps and double your API bill." Durable execution caches each model response and each tool result in object storage. A retry serves the cache instead of paying the LLM again. Anyone who has watched an agent loop burn a hundred dollars in a single Sunday night incident will recognize the value of that one line. Frameworks ship retry as a decorator. Orchestrators ship retry as a contract.
The second is observability that did not have to be invented. Airflow has had structured logging, run history, task duration metrics, and lineage tracking for years, because those features are how a data team trusts a pipeline at all. When the agent becomes a task, the agent inherits everything. There is no instrumentation project. The trace exists because it had to exist for ETL.
The third is the boring infrastructure that every framework eventually rediscovers. Authentication to three hundred and fifty backends. Role-based access control on who can approve which tool call. Secret management. Connection pooling. Cost attribution by team. None of these are agent features. All of them are required to ship one. Airflow has them because its core customers have been demanding them for a decade. A new framework starts at zero and rebuilds them, badly, in the months that follow its first production incident.
These three are why the conference circuit converged on reliability as the theme. The talks were not announcing a new problem. They were naming the rebuild bill.
The category error
Most agent frameworks were designed around the same wrong premise. The premise was that the new thing was the orchestration of LLM calls. If you accept that premise, you build a runtime. You write a scheduler, a retry layer, a state machine, an observability story, a permissions model. You ship the runtime as the framework and the framework owns the lifecycle of the application.
The premise was off by one. The new thing was the LLM call. Everything around it was already a solved problem. The orchestrator did not need to be invented. It has been in production since 2014. The agent loop is the carry, not the chassis.
The Airflow team had been building toward this correction for two years before the provider shipped. Airflow 3 reshaped the engine around assets rather than schedules, so a pipeline can react to data arriving instead of a clock ticking. The Common AI Provider is the surface layer on top of that foundation, not the diagnosis itself. The diagnosis was the engine work that came first.
Naik and Gopidesu's line, a provider package that plugs into the orchestrator you already run, is the cleanest articulation of the correction. It moves the agent from the center of the architecture to the edge. The center stays where it was. The provider model means a team running Airflow gets @task.agent and @task.llm as decorators next to the @task they have been using since Airflow 2.0, and the new code looks like the old code, because it is.
# Prototype-shape: the agent runtime is the application
from pydantic_ai import Agent
agent = Agent(model="openai:gpt-4o", tools=[query_db, read_s3])
result = agent.run_sync("Analyze churn for Q3")
# retry, logging, RBAC, secrets: your problem
# Production-shape: the agent is a task on the orchestrator
from pydantic_ai import Agent
from airflow.sdk import task
agent = Agent(model="openai:gpt-4o", tools=[query_db, read_s3])
@task.agent(agent=agent, llm_conn_id="openai_default")
def analyze_q3_churn(segment: str):
return f"Analyze churn for {segment}"
# retry, logging, RBAC, secrets: the orchestrator's problem
Same agent definition. The difference is where it runs and what it inherits.
Where the framework still wins
Frameworks are not wrong. They are wrong in production. In every other phase of the work, they are correct.
Prototyping is faster in LangGraph or Pydantic AI than it will ever be in Airflow. The mental model is closer to the code, the iteration loop is shorter, the dependencies are lighter. Sketching a new agent shape when you do not yet know what tools it needs, the right tool is a notebook with a framework, not a DAG.
Exploratory work belongs there too. Research, evaluation harnesses, small internal tools one person runs once a week. None of these justify the operational weight of an orchestrator. None of them suffer from missing durable replay because they do not run unattended.
The framework wins everywhere the agent is not yet load-bearing. The provider wins the moment the agent has to survive a holiday weekend without you watching it.
The two-line decision
Here is the heuristic, two lines.
If the agent is not yet running unattended on a schedule and not yet paid for by a customer, keep it in the framework. If it is either of those, move it behind a provider on an orchestrator you already run.
That is the line. It is not a commitment to Airflow specifically. The same logic applies if your orchestrator is Dagster, Prefect, or Temporal. The principle is that durable execution is not a checkbox on a roadmap. It is a contract between the engine and the workload, and prototype frameworks ship engines that do not honor that contract.
What the pattern says about software
The pattern is not new. Rails won the web because it absorbed the request-response cycle until that cycle became invisible. Kubernetes won infrastructure because it absorbed the deploy-and-restart loop until the loop became invisible. Postgres absorbed twenty years of small databases because each of those small databases eventually rediscovered transactions, recovery, and indexing, badly. Every time the boring layer wins, it wins because newcomers underestimate how much the boring layer was already doing.
Production AI is in that moment. The boring layer is the orchestrator. The newcomer is the agent framework. The newcomer is not going away, because the newcomer is correct in the half of the lifecycle where the boring layer is too heavy. The boring layer is not going away either, because the moment the agent goes load-bearing, the rebuild begins, and the rebuild is the orchestrator. The signal that this has moved from contrarian read to industry consensus is the framing of Astronomer's State of Airflow 2026 report itself: "The Orchestration Layer is Uniting Data, AI, and Enterprise Growth." Two years ago the orchestrator was a deployment concern. In 2026 it is the consolidating layer.
Naming the pattern early is the move. Teams who name it spend their second quarter shipping product. Teams who do not spend it rebuilding retry logic and calling it agent engineering.
The boring AI is the right AI. Borrowed term, durable claim.
I am building Kilnx, a declarative backend DSL that pairs with htmx, and Provero, where a lot of the orchestration-shape decisions I write about are the day job. If the diagnosis here lands, that is the door.
André Ahlert is a product engineer. Contributor across Apache, Flyte, Backstage, HTMX, Hyperscript. Currently building Kilnx and Provero.
Top comments (0)