Four production agent projects in the last two years. Four different frameworks.
LangGraph for a stateful multi-step pipeline with human review gates at critical decision points. CrewAI for a research workflow that needed role-based task delegation. Pydantic AI for typed tool calls behind a thin API wrapper. OpenAI Agents SDK for one that was going to live inside the OpenAI runtime anyway.
Every one of those projects started the same way: two weeks of reading documentation, building toy demos, and trying to understand whether the problem called for a graph, a crew, a typed tool loop, or something else entirely.
The decision I kept failing to make fast was framework selection. Not because the information was hidden, but because it was scattered. GitHub stars on one site. Benchmark numbers on another, measuring canned tasks I did not recognize from real production. Marketing pages using language shaped to attract everyone rather than describe anything specific.
What I wanted was a table. Control style: graph, role crew, typed tool, conversational. State model: how the framework holds and passes state between steps. License. What the framework is actually shaped for. A liveness signal so I could tell whether the project was maintained. One row per framework. All in one place.
That table did not exist, so I built it.
What the table covers
15 frameworks at launch: LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents SDK, Mastra, LangChain Agents, LlamaIndex Agents, Semantic Kernel, Haystack Agents, Smolagents, Atomic Agents, Phidata, DSPy, AG2.
Each row has a tagline I would write to a friend making the selection decision, not the one from the project marketing page. Control style. State model. License. What the framework is shaped for. GitHub stars as a rough liveness proxy.
The full directory: https://compare-lab.xyz/ai-agent-frameworks/
Why not a benchmark
I do not believe the public agent benchmarks right now. They measure tool-call success on small canned tasks and miss the three things that actually determine which framework fits a real project.
The first is state-model fit. If your task requires explicit state that survives across steps and branches conditionally, you need LangGraph or Semantic Kernel. If your task is naturally decomposable into independent roles, CrewAI or multi-agent AutoGen patterns fit better. Picking the wrong state model means fighting the framework rather than building with it.
The second is abstraction escape. Every framework has an abstraction layer designed for the common case. Production agents hit edge cases constantly. What matters is whether you can break out of the abstraction cleanly when you need to, without rewriting the whole pipeline. Some frameworks make this easy. Some make it very hard.
The third is failure recovery. What does the framework actually do when a tool call fails halfway through a long-running run? Can you retry from the checkpoint? Can you log the partial state and restart at step three? These questions are invisible in toy demos and critical in production.
Contributing
The data file is open. If you have shipped on one of these in production and a row gets a detail wrong, a correction is a one-line PR. If you have shipped on a framework not yet listed, send me the slug.
The longer write-up on why I went with a directory format instead of a benchmark: https://hannune.ai/blog/why-i-built-ai-agent-framework-compare.html
Top comments (1)
The framework comparison gets much more useful when it includes failure modes, not just features. For agent work, I would add one more column: where state lives when a run is interrupted. That detail usually decides whether the system is demo-ready or production-ready.