Alex Cloudstar

Posted on Apr 30 • Originally published at alexcloudstar.com

AI Agent Frameworks in 2026: LangGraph vs Mastra vs Vercel AI SDK vs OpenAI Agents SDK vs Pydantic AI

#ai #architecture #devtools #productivity

Every time someone asks me which agent framework to use, I have to ask three questions back before I can answer. What language are you in. How long-running is the agent. How much do you care about being able to swap out the model later. Without those answers, any recommendation is a coin flip, and that is exactly why most of the comparison posts you find online are not very useful. They list features. They do not tell you which framework will quietly destroy your week six months from now.

I have shipped production code on five of the major agent frameworks in the last year, and built throwaway prototypes on three more. The differences between them are real. The marketing pages do a bad job of surfacing those differences because every framework's pitch is "we let you build agents" and the actual job of an agent framework is to manage the long tail of state, retries, tool execution, observability, and model interactions in a way that does not eat your codebase alive when the requirements change.

This is the comparison I wish I had read a year ago. It covers Vercel AI SDK, LangGraph, Mastra, OpenAI Agents SDK, Pydantic AI, and the Claude Agent SDK. It is opinionated and based on what actually broke in production, not on what the docs claim.

What An Agent Framework Is Actually For

Before the comparison, the framing matters. An agent is not a model. An agent is a model plus a loop plus a set of tools plus a memory of what just happened. The framework's job is to manage the loop and the surrounding plumbing.

That plumbing has more parts than people give it credit for. The framework is responsible for, at minimum:

Calling the model with the right messages, system prompt, tools, and configuration. This sounds trivial until you realize you are now the proud owner of code that constructs a tool schema in three different formats depending on which provider you happen to be using today.

Parsing tool calls out of the model's response, validating the arguments, and dispatching them to the right handler. Bonus complexity if any tool returns something the model needs to retry with corrected arguments.

Holding state between turns. What the conversation has been so far, what tools have been called, what data has been retrieved, what the user said three turns ago that is suddenly relevant.

Handling failure. Model timeouts. Tool failures. Rate limits. Context window overflows. Each of these has a sensible recovery and an obvious wrong move, and the framework's defaults determine which one you get.

Streaming. Tokens, tool calls, intermediate steps, all of it ideally arriving at the UI as soon as it is available so the user does not stare at a spinner.

Observability. Knowing what the agent did and why. This is the part that goes from "nice to have" to "the only way to ship" the moment you put the agent in front of real users.

Frameworks differ on which of those they handle for you, how opinionated their solutions are, and how easy it is to opt out of any one piece. That is the lens for the comparison.

Vercel AI SDK

The default if you are in JavaScript or TypeScript and you want to ship something that works without thinking too hard about the framework choice. The v6 release earlier this year leveled up the agent primitives substantially, and the integration with the generative UI patterns I wrote about makes it the smoothest path for anything that renders in a React frontend.

What it does well. Streaming is best in class. The generateText and streamText APIs feel like the AI equivalent of fetch: minimal, predictable, easy to reason about. Tool calls are typed end to end with Zod schemas, which means you stop spending time debugging tool argument shape mismatches. The provider abstraction lets you swap between Anthropic, OpenAI, Google, and the AI Gateway by changing one string. The agent loop is the right size: opinionated enough to save you boilerplate, open enough that you can step out of it when you need to.

What it does not do. Long-running durable execution is not its job. If your agent needs to pause for hours, survive a process restart, and resume cleanly, you are pairing it with something else (Inngest, Trigger, Vercel Workflow). It does not impose a graph or state machine model, which is freedom if you want it and structure you have to build yourself if you need it. Multi-agent orchestration is doable but you are building the patterns yourself.

Where it bites. The flexibility means two engineers will end up writing two different patterns for the same problem if you do not establish conventions early. It is also very TypeScript-shaped, which is fine if your stack is TypeScript and not great if you wanted to do the agent in Python alongside a Python data pipeline.

When to pick it. You are in a TypeScript or Next.js stack. You are shipping an agent that renders into a frontend. You want to be able to swap models. You do not need durable execution out of the box. This is the default for most web product use cases right now.

LangGraph

The framework people pick when they have outgrown chains and want to model their agent as an explicit state machine. LangGraph is a graph executor where each node is a step in the agent's reasoning and edges are conditional transitions. It has matured into the heaviest hitter for genuinely complex agent workflows.

What it does well. Explicit state. Every transition is visible, every edge is testable, every state is inspectable. For agents with non-trivial control flow (multiple decision points, conditional retries, branching workflows), this is the model that makes the code legible. Persistence is first class with checkpointing built in: an agent can pause, the process can die, and the agent can resume from the last checkpoint without losing state. Multi-agent patterns are well-supported. The integration with the broader LangChain ecosystem (memory, retrievers, evaluators) is deep.

What it does not do. Lightness. There is a learning curve and a vocabulary to absorb (state schemas, edges, conditional edges, supervisors, swarms). For a simple chat agent, it is overkill. The TypeScript port exists and works but lags the Python version on features and documentation.

Where it bites. The control flow is explicit, which means refactoring an agent often means redrawing the graph. That is good for clarity and bad for iteration speed. The framework also has opinions about how state flows that you have to internalize before you can move quickly. Onboarding a new engineer to a LangGraph codebase takes longer than onboarding to most of the alternatives.

When to pick it. Your agent has real branching logic. You need durable, resumable execution. You are doing multi-agent coordination and want a framework that thought about it. You are in Python and want the most mature option. For multi-agent architectures specifically, this is often the right starting point.

Mastra

The TypeScript-first option that is trying to be what LangChain wishes it had been from day one. Mastra has gained traction quickly because it bundles the things you need (agents, workflows, memory, evals, observability) without forcing the LangChain mental model on you.

What it does well. Sensible defaults across the whole agent stack. The workflow primitive is closer to durable execution than Vercel AI SDK gives you, with built-in step-based execution and retry semantics. Memory is a first-class concept rather than something you bolt on. Evals are built in, which is rare and valuable. The DX is closer to "Rails for agents" than to a library you wire together yourself.

What it does not do. It is younger than the alternatives. APIs have moved more in the last year than the more established frameworks, which is normal for a project at this stage but worth knowing. The Python story does not exist in any meaningful way: this is a TypeScript world.

Where it bites. The bundled approach means you are taking a position on a lot of decisions at once. If you want to use Mastra's agent runtime but your own memory layer, that is doable but you are fighting some current. Lock-in is higher than with the more decomposed frameworks. The community is smaller than LangChain's so you will find fewer answers when you hit an edge case.

When to pick it. You want the integrated experience: workflows, memory, evals, agents in one toolkit. You are TypeScript-only. You are willing to take some opinions in exchange for moving faster. For a team building several agent features that need to share infrastructure, the consolidation pays off quickly.

OpenAI Agents SDK

OpenAI's official entry into the agent framework space. It is the natural choice if you are already building heavily on OpenAI's platform (Responses API, Realtime, file search, code interpreter) and want a thin, official-blessed wrapper around the agent loop.

What it does well. Tight integration with OpenAI's hosted tools. Things like file search, code interpreter, and the computer-use model are first-class rather than something you wire up yourself. The agent handoffs primitive is clean for multi-agent flows. The SDK is small and easy to read, which means you can reason about exactly what is happening when something breaks.

What it does not do. Multi-provider abstraction. The whole point is that it is OpenAI's SDK. You can stretch it to other providers but you are working against the grain. Long-running durable execution is not the focus. The visualization and inspection tooling is thinner than what LangGraph or Mastra ship.

Where it bites. Lock-in to OpenAI. If your bet is "OpenAI is the right model provider for the next several years," this is a fine bet, and the SDK rewards it. If your bet is "we want to be able to switch when the price-performance frontier moves," this is the wrong abstraction layer to build on.

When to pick it. You are committed to OpenAI's platform. You want their hosted tools (Realtime, file search, computer use) without rewriting integration code. You are doing handoff-style multi-agent and want a small SDK that does not force a graph model on you.

Pydantic AI

The Python option for teams that care about type safety and structured outputs above all else. Pydantic AI brings the same discipline that Pydantic brings to data validation: agents and tools are defined in terms of typed Python, and the framework leans on that to validate everything that flows through.

What it does well. Type safety is real. Tool inputs and outputs are validated automatically. Structured outputs from the model are validated automatically. The error messages when something does not match are useful in a way that most agent framework error messages are not. The framework is small enough that you can hold the whole thing in your head, which is a different kind of value than "feature-complete."

What it does not do. It is not trying to be everything. There is no built-in graph executor, no checkpointing-based durable execution, no first-class memory layer. You bring those if you need them. It is also Python-only, which is fine if Python is where you live and a hard stop if it is not.

Where it bites. The minimalism is a feature until it is not. For a complex multi-agent system with checkpointing and rich memory, you are augmenting Pydantic AI with other libraries until you have a custom stack. At some point the question becomes whether you should have started with LangGraph instead.

When to pick it. You are in Python. You want type safety to do real work for you. Your agent is moderately scoped (one or two flows, a handful of tools) rather than a full multi-agent system. You value being able to read every line of the framework you depend on.

Claude Agent SDK

Anthropic's official SDK for building agents on top of Claude. It is newer than the OpenAI Agents SDK but it has converged on a similar shape: a small, opinionated wrapper around the agent loop with first-class support for the platform's specific features (computer use, the Files API, prompt caching, the new memory primitives).

What it does well. The Claude-specific features (extended thinking, prompt caching, computer use, memory primitives) are wired in idiomatically. If you are building on Claude and want all of those features without writing the integration yourself, this is the cleanest path. Prompt caching specifically is much easier to use right when the SDK handles the structure for you.

What it does not do. Same as the OpenAI counterpart in mirror image: the multi-provider story is intentionally weak because the SDK exists to make Claude easy to use. If you want to swap providers, you are at the wrong abstraction layer.

Where it bites. Same lock-in tradeoff as the OpenAI Agents SDK. If you are confident in Claude as your primary model, the SDK pays for itself. If you want optionality, you are using the wrong tool.

When to pick it. You are committed to Claude as your model. You want to use Claude-specific features (extended thinking, memory, caching, computer use) without building the integrations yourself. You are happy to trade portability for ergonomics.

What About LlamaIndex, CrewAI, AutoGen, etc.

The frameworks above are the ones I would actually pick for a production agent in 2026. The others are not bad, they are just not where I would start.

LlamaIndex is best understood as a data framework for LLMs rather than an agent framework. If your agent's hard problem is RAG over a complex document corpus, LlamaIndex is excellent at that part and you can wrap it with a thin agent loop from any of the frameworks above.

CrewAI is appealing for the multi-agent role-play model (researcher, writer, critic). It works well for prototypes and content workflows. For production agents that have to be reliable rather than creative, the role-play abstraction tends to add complexity without earning its keep.

AutoGen has been moving fast and the Microsoft backing is real, but the API has shifted enough times that the community knowledge is fragmented. I would wait for it to stabilize further before betting a production system on it.

There are also dozens of smaller frameworks. Most of them solve a problem one of the big ones already solves. Unless they are unambiguously better at something specific to your use case, the tooling and community around the larger frameworks is worth more than a marginal feature.

How to Actually Decide

Three questions in this order.

What language is your stack in. If TypeScript: Vercel AI SDK, Mastra, or one of the provider SDKs. If Python: LangGraph, Pydantic AI, or one of the provider SDKs. Cross-language teams should pick per-service rather than trying to pick one for the whole org.

How important is provider portability. If high: Vercel AI SDK or LangGraph. If low and you are committing to a specific provider: the OpenAI Agents SDK or Claude Agent SDK will give you the cleanest integration with that provider's platform features.

How complex is the agent's control flow. If simple (a chat with tools): Vercel AI SDK, Pydantic AI, or one of the provider SDKs. If complex (branching workflows, multi-agent coordination, durable resumption): LangGraph or Mastra.

The match between framework and use case matters more than the framework's overall quality. All of the frameworks above are genuinely good at what they are designed for. They are also genuinely painful when used outside that range. Picking the right one is mostly about being honest about which job you are doing.

What You Will Wish You Had Built In Anyway

Whichever framework you pick, a few things are not its job and you will end up needing them.

Observability is on you. Every framework ships some logging. None of them ship the dashboard that lets your team replay a failing conversation, see what tool was called, see what the model produced, and adjust the prompt. Build that early. The patterns I outlined for debugging agents in production apply across all the frameworks.

Cost tracking is on you. Token counts per turn, per user, per feature. The frameworks expose the data, none of them roll it up the way your finance team will eventually ask for. Token cost discipline does not happen on its own.

Eval discipline is on you. Mastra ships some primitives. Most of the others assume you will build your own. Evals for solo developers is achievable with a small budget if you start early.

Security on tool execution is on you. The frameworks will dispatch tool calls. They will not stop the agent from calling a destructive tool with bad arguments because a prompt injection made it past your input handling. Sandboxing, allow-lists, and confirmation flows for destructive tools are still your job.

Memory beyond the conversation window is partially on you. The frameworks have varying support for persistent memory, but the policy questions (what to remember, what to forget, what to summarize, when to retrieve) are product decisions that no framework can make for you.

What I Would Actually Do Today

If I were starting a new agent project this week, the decision would look like this.

For a SaaS product with a chat or copilot interface, in TypeScript: Vercel AI SDK with a tight component registry for generative UI. Pair it with Inngest or Vercel Workflow for any background or durable steps. Add Langfuse or Helicone for observability.

For a multi-step automation that has to survive restarts and run for hours: LangGraph in Python, with checkpointing turned on, deployed somewhere with persistent storage. Pair it with the relevant data integration libraries (LlamaIndex if heavy RAG, otherwise straight retrievers).

For a content or research workflow with multiple agents collaborating: Mastra if TypeScript, LangGraph if Python. CrewAI for prototypes, but plan to migrate.

For an agent that lives entirely inside one provider's ecosystem (heavy use of OpenAI Realtime, or computer use, or hosted RAG): the matching provider SDK. The integration savings are real and the lock-in is the price of those features.

For a Python service that does one well-scoped agent task with strong typing: Pydantic AI. Keep the scope tight, accept the framework will not grow with you forever, and migrate if the agent grows beyond it.

The thing I would not do is pick a framework on vibes. The agent space changes fast enough that "popular this quarter" is not a reliable signal. The questions above produce a more durable answer because they are about your code rather than someone else's marketing.

The frameworks are all good now. That was not true two years ago. The job of picking one is no longer about avoiding the bad option, it is about matching the framework to the shape of the problem you actually have. Spend an hour answering the three questions honestly before writing a line of code, and the next six months will go a lot smoother than they would have otherwise.

DEV Community