Tae Kim

Posted on Jul 1 • Originally published at compare-lab.xyz

15 AI Agent Frameworks, One Side-by-Side Table

#python #langchain #ai #agents

Four production agent projects in the last two years. Four different frameworks.

LangGraph for a stateful multi-step pipeline with human review gates at critical decision points. CrewAI for a research workflow that needed role-based task delegation. Pydantic AI for typed tool calls behind a thin API wrapper. OpenAI Agents SDK for one that was going to live inside the OpenAI runtime anyway.

Every one of those projects started the same way: two weeks of reading documentation, building toy demos, and trying to understand whether the problem called for a graph, a crew, a typed tool loop, or something else entirely.

The decision I kept failing to make fast was framework selection. Not because the information was hidden, but because it was scattered. GitHub stars on one site. Benchmark numbers on another, measuring canned tasks I did not recognize from real production. Marketing pages using language shaped to attract everyone rather than describe anything specific.

What I wanted was a table. Control style: graph, role crew, typed tool, conversational. State model: how the framework holds and passes state between steps. License. What the framework is actually shaped for. A liveness signal so I could tell whether the project was maintained. One row per framework. All in one place.

That table did not exist, so I built it.

What the table covers

15 frameworks at launch: LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents SDK, Mastra, LangChain Agents, LlamaIndex Agents, Semantic Kernel, Haystack Agents, Smolagents, Atomic Agents, Phidata, DSPy, AG2.

Each row has a tagline I would write to a friend making the selection decision, not the one from the project marketing page. Control style. State model. License. What the framework is shaped for. GitHub stars as a rough liveness proxy.

The full directory: https://compare-lab.xyz/ai-agent-frameworks/

Why not a benchmark

I do not believe the public agent benchmarks right now. They measure tool-call success on small canned tasks and miss the three things that actually determine which framework fits a real project.

The first is state-model fit. If your task requires explicit state that survives across steps and branches conditionally, you need LangGraph or Semantic Kernel. If your task is naturally decomposable into independent roles, CrewAI or multi-agent AutoGen patterns fit better. Picking the wrong state model means fighting the framework rather than building with it.

The second is abstraction escape. Every framework has an abstraction layer designed for the common case. Production agents hit edge cases constantly. What matters is whether you can break out of the abstraction cleanly when you need to, without rewriting the whole pipeline. Some frameworks make this easy. Some make it very hard.

The third is failure recovery. What does the framework actually do when a tool call fails halfway through a long-running run? Can you retry from the checkpoint? Can you log the partial state and restart at step three? These questions are invisible in toy demos and critical in production.

Contributing

The data file is open. If you have shipped on one of these in production and a row gets a detail wrong, a correction is a one-line PR. If you have shipped on a framework not yet listed, send me the slug.

The longer write-up on why I went with a directory format instead of a benchmark: https://hannune.ai/blog/why-i-built-ai-agent-framework-compare.html

Top comments (5)

Alex Shev • Jul 1

The framework comparison gets much more useful when it includes failure modes, not just features. For agent work, I would add one more column: where state lives when a run is interrupted. That detail usually decides whether the system is demo-ready or production-ready.

Tae Kim • Jul 2

That column would have saved me weeks of conversation in at least two projects. State persistence under interruption is where most framework comparisons fall apart in practice — the feature matrix looks identical until the first retry after a tool timeout, and then you find out whether you're rebuilding from scratch or replaying from a checkpoint. LangGraph's Postgres checkpointer is the only one I've seen actually solve this cleanly in production, everything else either loses state or double-executes on resume.

Alex Shev • Jul 5

That matches what I have seen too. Checkpointing only solves half the problem unless tool side effects are also tracked.

The hard case is not "can the graph resume?" but "can it resume without charging the card twice, sending the email twice, or re-running a destructive step?" That is where idempotency becomes part of the framework story.

DockSky • Jul 2

Thank you for this table, and for the trio state model / abstraction escape / failure recovery. These are exactly the kind of criteria you do not see in demos, but they matter as soon as you want a tool you can rely on day to day.

I am not an "AI agents" specialist in the research or pure framework sense. I am building more of a business assistant (desktop app + API + MCP tools) where AI fits into an existing workflow: project context, tool calls, and sometimes human validation before a sensitive action.

In that context, I am not necessarily looking for the most sophisticated framework, but for one that integrates without breaking everything and that lets you resume cleanly after a tool error or a timeout, without restarting the whole conversation from scratch. Your exchange about state persistence on interruption resonated with me. That is often the line between "it works in testing" and "I can trust it in production."

I will go through the directory with that filter in mind. Thanks also for opening it to contributions. Real-world feedback makes this kind of comparison much more useful than benchmarks on toy tasks.

Tae Kim • Jul 3

That gap — "works in testing" versus "I can trust it in production" — is exactly what the state model column is trying to surface. For a business assistant with tool calls and human validation gates, the critical question is what the framework actually persists when an interruption happens: in-memory session state you lose on crash, or a durable snapshot you can replay from. LangGraph's Postgres checkpointer covers the latter, and that's the pairing I'd look at first if clean resume after tool error is the non-negotiable for your use case.