Picking an Agent Framework in 2026: An Honest Verdict on Six of Them

#ai #llm #python #agents

Book: Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust
Also by me: Observability for LLM Applications — the companion book in The AI Engineer's Library (2-book series)
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

On April 2, 2026, Microsoft shipped agent-framework 1.0 and, in the same blog post, moved AutoGen into maintenance mode. Semantic Kernel went with it. Three overlapping projects folded into one package with stable APIs and long-term support. Microsoft framed the move as a consolidation. If you had an AutoGen project that morning, you woke up with a migration.

That is the shape of this whole category. The framework landscape you pick from today is not the one you picked from a year ago, and it will not be the one you pick from next year. So the useful question is not "which framework is best." It is "which framework has which wedge, and which trade-off comes with it."

Here is an honest read on six frameworks worth installing in 2026, and when to reach for each.

The churn is the feature, not the bug

Before the tour, one thing that changed the math: the wire formats underneath these SDKs converged. Every framework here speaks MCP for tools. Most support A2A for cross-framework handoffs. Model Context Protocol started as an Anthropic proposal at the start of 2025 and is now the default way agents pick up external tools.

That convergence means the framework you pick locks you in less than it used to. You are still locked at the abstraction layer, though. Migrating a production system from CrewAI to Pydantic AI is a rewrite of every Agent definition and every tool decorator. The pick is sticky. Choose it with that in mind.

LangGraph: durability as the wedge

Reach for LangGraph when your agent has to survive a crash. It models the agent as a graph with checkpointers backed by Postgres or SQLite, so a workflow that dies at step seven resumes at step seven. Human-in-the-loop interrupts are first-class.

The prebuilt path is short:

from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent

model = ChatAnthropic(model="claude-opus-4-8")
agent = create_react_agent(model, tools=[refund])

The cost is conceptual weight. You think in nodes, edges, and state reducers. For a simple tool-call loop that never needs to resume, the graph is more than the job asks for. OTel spans come through OpenInference, not natively.

OpenAI Agents SDK: the fast path when you live on OpenAI

If your stack is already OpenAI and you want the shortest distance to a working agent, this is it. The primitives are Agent, Runner, and Handoff. State is session-scoped, not durable. You can point it at other models through a LiteLLM layer, but the ergonomics are tuned for the OpenAI ecosystem it ships from.

Pick it for speed on OpenAI. Look elsewhere when you need durable checkpoints or a language a frontend engineer can read.

Claude Agent SDK: for agents that touch code and files

The Claude Agent SDK is the one to reach for when the work is coding or file manipulation. It carries the same primitives that power Claude Code: subagents, file-system tools, and session resume. If your agent reads a repo, edits files, and runs commands, this SDK was shaped for exactly that loop.

It leans on Claude models, and the default when you show an LLM call is a Claude model like claude-opus-4-8. Native OTel is not there yet, so you wrap it with OpenInference for tracing. Outside the coding and file-agent lane, its wedge matters less.

Microsoft agent-framework: enterprise on Azure

This is the merged successor to AutoGen and Semantic Kernel. Its wedge is enterprise Microsoft. If your infrastructure runs on Azure, your auditors know the word Purview, and you need C# and Python parity in one codebase, nothing else competes. Native OpenTelemetry GenAI spans are built in, which puts it ahead of most of this list on the one metric that matters for production.

The primitives are ChatAgent, Handoff, and Workflow. The catch is migration fatigue: users who rode AutoGen 0.2 to 0.4 to agent-framework have absorbed three breaking APIs in eighteen months. There is no TypeScript. Pick it when your stack looks like Azure and governance is the real requirement.

CrewAI: when the problem is genuinely a team

CrewAI is what a non-ML engineer reaches for first. You hire agents. They have roles, goals, and backstories. You give them tasks, drop them in a crew, pick a process, and kick off. The mental model is an org chart, which is why it sells so well to the people buying the software.

That abstraction is also the trap. A lot of prompting happens behind the scenes. When you set a backstory, the framework injects an instruction block you did not write. When something goes wrong, you end up reading the CrewAI source to find out what was actually sent to the model. Reach for it when the product manager says "it's a team" and means it. Reach for something simpler when the work is really one agent wearing a costume.

Pydantic AI: types as the contract

Pydantic AI's wedge is the type system. Every agent is parameterized by its dependency type and its output type, and your IDE flags mismatches before you run anything.

from pydantic import BaseModel
from pydantic_ai import Agent

class Triage(BaseModel):
    lane: str
    urgent: bool

agent = Agent(
    "anthropic:claude-opus-4-8",
    output_type=Triage,
)

result = agent.run_sync("Card charged twice.")
print(result.output.lane)

The output_type earns the install. When the model returns something that does not parse into Triage, the framework feeds the validation error back to the model as a retry. You get a validated object or a clean exception, never a string with a JSON code fence in it. Logfire integration gives you OTel traces in one line.

The weak spot is ecosystem: smaller than LangGraph or CrewAI, with a graph API that is younger for checkpointing. If you need durable Postgres-backed state today, LangGraph still wins. Pick Pydantic AI when your shop already lives in Pydantic and you want agents that feel like FastAPI routes.

The pick-by-need matrix

Framework	Reach for it when	Watch out for
LangGraph	You need durable state and human-in-the-loop	Graph overhead; OTel via OpenInference
OpenAI Agents SDK	You are on OpenAI and want speed	Session-only state; OpenAI-shaped
Claude Agent SDK	The work is coding and file manipulation	Native OTel not there yet
Microsoft agent-framework	Azure, .NET parity, auditable governance	Migration fatigue; no TypeScript
CrewAI	The problem is genuinely a team	Hidden prompting; higher non-determinism
Pydantic AI	Your shop is already type-first	Smaller ecosystem; younger graph API

Before you install anything

Here is the note the framework vendors would rather you skip. Every framework on this list adds a vocabulary, a failure mode, a set of version pins, and a layer between you and the model. Sometimes that layer pays for itself. Often it does not.

The honest default is the bare provider SDK. An event loop, a list of tools, and a while loop around a messages.create call:

import anthropic

client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    tools=tools,
    messages=messages,
)
# loop while resp.stop_reason == "tool_use"

That is enough for more production systems than the framework marketing suggests. Reach for a framework only when one of three things is true: you need durable state across restarts, you are building on an ecosystem the framework is already glued into, or your team cannot hold the raw loop in its head and needs shared named primitives.

The test you cannot outsource: build the smallest honest program on your top two candidates, run them side by side, and see which one you read more easily at 9 a.m. the next day. Not the one with nicer docs. The one that tells you what the agent is about to do when you open the file cold. Then read the trace it produces. If the shape of what comes out does not match the shape of what you expected, you have found the next problem worth solving.

Picking the framework is the easy half. The hard half is running the thing once it ships: knowing why a tool call looped, what a turn cost, and whether the agent actually did the job. Agents in Production covers building and shipping the multi-step loop; Observability for LLM Applications covers the tracing, evals, and cost accounting that keep it honest after launch. Together they are The AI Engineer's Library.