Dextra Labs

Posted on May 20

Top 10 Agentic AI Frameworks Compared: LangGraph vs CrewAI vs AutoGen vs... (Benchmarks Inside)

#webdev #ai #programming #python

I spent six weeks running identical tasks through ten different frameworks so you don't have to argue about this in Slack anymore.

There's a conversation that happens in almost every engineering team building agents right now. Someone says "we should use LangChain." Someone else says "CrewAI is better for multi-agent stuff." A third person asks if anyone has looked at AutoGen. Nobody can agree because everyone is going off demos, blog posts and vibes.

I got tired of that conversation, so I ran actual benchmarks.

Six weeks, ten frameworks, five evaluation tasks repeated consistently across all of them. The tasks were chosen to reflect what production agent systems actually need to do, not what looks impressive in a README.

Here's what I tested and what I found.

The Benchmark Setup

Five evaluation dimensions:

Agent setup time : how long from pip install to a working agent with tool use. Measured in minutes, not "effort."

Tool integration complexity : how many lines of code to add a custom tool that calls an external API.

Multi-agent orchestration : can it coordinate multiple specialised agents? How cleanly?

Memory handling : does it support conversation memory and persistent context across sessions?

Error recovery : what happens when a tool call fails or returns unexpected output?

Hardware: M3 MacBook Pro, 32GB. All tests used Claude Sonnet as the underlying model via API for consistency. I'm not benchmarking model quality, I'm benchmarking framework overhead and developer experience.

The Frameworks

LangGraph, CrewAI, AutoGen, LlamaIndex Workflows, Haystack, OpenClaw, Semantic Kernel, Phidata, Pydantic AI and AgentOps.

Let's go through them.

1. LangGraph

Setup time: 18 minutes | Tool integration: Medium | Multi-agent: Excellent | Memory: Good | Error recovery: Excellent

LangGraph is the framework I'd recommend to engineers who think in graphs. The mental model, nodes are processing steps, edges are transitions, state flows through the graph, is powerful once it clicks and slightly alien until it does.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    tool_results: list
    next_action: str

def research_node(state: AgentState):
    # Agent reasoning step
    return {"messages": [{"role": "assistant", "content": "Researching..."}]}

def tool_node(state: AgentState):
    # Tool execution step  
    return {"tool_results": ["result_data"]}

workflow = StateGraph(AgentState)
workflow.add_node("research", research_node)
workflow.add_node("tools", tool_node)
workflow.add_edge("research", "tools")
workflow.add_edge("tools", END)

app = workflow.compile()

The error recovery is genuinely impressive. You can define explicit fallback edges, if this node fails, route here instead, which produces resilient agent behaviour without try/except spaghetti scattered throughout your application code.

The trade-off: the learning curve is real. Engineers who aren't comfortable with graph-based thinking will fight the abstraction. Setup time reflects this, 18 minutes because I kept second-guessing the state schema design.

2. CrewAI

Setup time: 8 minutes | Tool integration: Easy | Multi-agent: Excellent | Memory: Good | Error recovery: Medium

CrewAI has the most intuitive API of any framework on this list for multi-agent work. The Role-Task-Crew mental model maps directly to how you'd describe the work to another engineer.

from crewai import Agent, Task, Crew, Process

researcher = Agent(
    role="Research Analyst",
    goal="Find accurate information about {topic}",
    backstory="Expert at synthesising complex information",
    verbose=True,
    tools=[search_tool, web_scraper_tool]
)

writer = Agent(
    role="Technical Writer",
    goal="Write clear documentation from research",
    backstory="Turns technical findings into readable content"
)

research_task = Task(
    description="Research {topic} and identify key technical details",
    agent=researcher,
    expected_output="Structured research summary with sources"
)

write_task = Task(
    description="Write documentation based on the research",
    agent=writer,
    expected_output="Complete technical document"
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_task],
    process=Process.sequential
)

result = crew.kickoff(inputs={"topic": "MCP protocol"})

8 minutes to a working multi-agent system. That's impressive and it reflects how well-designed the abstractions are.

The error recovery is where CrewAI shows its youth. When tools fail, the default behaviour is for the agent to retry with the same approach, which sometimes loops rather than adapts. You can override this with custom callbacks but it requires more configuration than LangGraph's graph-native error routing.

3. AutoGen

Setup time: 22 minutes | Tool integration: Medium | Multi-agent: Excellent | Memory: Medium | Error recovery: Good

AutoGen is Microsoft's framework and it shows, in the best possible way. The conversational multi-agent pattern, where agents literally message each other to collaborate, is different from CrewAI's task assignment model and genuinely powerful for complex reasoning chains.

import autogen

config_list = [{"model": "claude-sonnet-4-5", "api_key": "your_key"}]

assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config={"config_list": config_list},
    system_message="You are a helpful coding assistant."
)

code_reviewer = autogen.AssistantAgent(
    name="code_reviewer", 
    llm_config={"config_list": config_list},
    system_message="You review code for bugs and improvements."
)

user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    code_execution_config={
        "work_dir": "coding",
        "use_docker": False
    }
)

# Start a multi-agent conversation
user_proxy.initiate_chat(
    assistant,
    message="Write a Python function to parse nested JSON"
)

The code execution capability, agents can write and run code in a sandboxed environment, is genuinely useful and something not every framework handles this cleanly.

The 22-minute setup time reflects the Azure OpenAI configuration options and the number of agent parameters. Not complex, just verbose.

4. LlamaIndex Workflows

Setup time: 15 minutes | Tool integration: Easy | Multi-agent: Good | Memory: Excellent | Error recovery: Medium

LlamaIndex has the best RAG integration of any framework here — which makes sense given its origins. If your agent needs to reason over large document collections, LlamaIndex Workflows is the framework that handles this without bolted-on complexity.

LlamaIndex has the best RAG integration of any framework here, which makes sense given its origins. If your agent needs to reason over large document collections, LlamaIndex Workflows is the framework that handles this without bolted-on complexity.

from llama_index.core.workflow import Workflow, StartEvent, StopEvent, step, Event

class ResearchEvent(Event):
    query: str

class AnalysisEvent(Event):
    research_results: str

class ResearchWorkflow(Workflow):
    @step
    async def research(self, ev: StartEvent) -> ResearchEvent:
        # Query documents and retrieve context
        results = await self.query_index(ev.query)
        return ResearchEvent(query=ev.query)

    @step
    async def analyse(self, ev: ResearchEvent) -> StopEvent:
        # Synthesise the research
        analysis = await self.synthesise(ev.query)
        return StopEvent(result=analysis)

workflow = ResearchWorkflow(timeout=60, verbose=True)
result = await workflow.run(query="Explain the MCP protocol")

The event-driven architecture is clean once you understand it. The memory handling, particularly for RAG-heavy workloads, is the best on this list.

Where it falls short: the multi-agent orchestration requires more manual wiring than CrewAI or AutoGen. It's capable, but you're doing more of the coordination work yourself.

5. Haystack

Setup time: 12 minutes | Tool integration: Easy | Multi-agent: Good | Memory: Good | Error recovery: Good

Haystack's pipeline-based architecture is the most auditable of any framework here. Every processing step is explicit, the data flow is visible and the system is straightforward to debug when something goes wrong.

from haystack import Pipeline
from haystack.components.generators import AnthropicGenerator
from haystack.components.routers import MetadataRouter

pipeline = Pipeline()
pipeline.add_component("router", MetadataRouter(rules={
    "search": {"task": {"$eq": "search"}},
    "analysis": {"task": {"$eq": "analysis"}}
}))
pipeline.add_component("search_agent", AnthropicGenerator(
    model="claude-sonnet-4-5"
))
pipeline.add_component("analysis_agent", AnthropicGenerator(
    model="claude-sonnet-4-5"
))

pipeline.connect("router.search", "search_agent.prompt")
pipeline.connect("router.analysis", "analysis_agent.prompt")

For teams with compliance or audit requirements, the explicit pipeline structure makes Haystack genuinely preferable to more opaque frameworks. You can answer "what did this agent do and why" clearly from the pipeline logs.

The trade-off: less dynamic than graph-based frameworks. Complex conditional reasoning is harder to express in pipeline terms.

6. OpenClaw

Setup time: 25 minutes | Tool integration: Medium | Multi-agent: Good | Memory: Good | Error recovery: Good

OpenClaw is the self-hosted option on this list and the one worth knowing about if data privacy is a requirement. No API calls to external services, everything runs in your infrastructure.

from openclaw import Agent, LocalLLM, Tool

llm = LocalLLM(
    model_path="./models/llama-3.1-70b-q4",
    context_length=8192
)

@Tool.register("database_query")
async def query_db(sql: str) -> dict:
    conn = await get_db_connection()
    result = await conn.execute(sql)
    return {"rows": result.fetchall()}

agent = Agent(
    llm=llm,
    tools=["database_query"],
    system_prompt="You are a data analysis assistant.",
    memory_enabled=True
)

response = await agent.run(
    "Analyse the monthly revenue trend from the sales table"
)

The 25-minute setup reflects model download and local configuration. Once running, the performance is solid for its class.

The honest trade-off: the model capability ceiling is lower than API-based frameworks unless you have significant local compute. For use cases where data residency matters more than peak performance, it's worth it. For OpenClaw's full architecture and where it fits in the self-hosted landscape, the Dextra deep-dive covers what a README can't.

7. Semantic Kernel

Setup time: 20 minutes | Tool integration: Easy | Multi-agent: Good | Memory: Good | Error recovery: Good

Microsoft's other agent framework. Where AutoGen is built around conversational multi-agent patterns, Semantic Kernel is built around plugins and planners, a more structured, less conversational approach.

import semantic_kernel as sk
from semantic_kernel.connectors.ai.anthropic import AnthropicChatCompletion

kernel = sk.Kernel()
kernel.add_service(AnthropicChatCompletion(
    ai_model_id="claude-sonnet-4-5",
    api_key="your_key"
))

@sk.kernel_function(name="analyse_data", description="Analyse dataset")
async def analyse_data(kernel: sk.Kernel, data: str) -> str:
    return f"Analysis of: {data}"

kernel.add_function(plugin_name="DataPlugin", function=analyse_data)

The .NET-first heritage shows in the C# documentation being significantly better than the Python docs. If your team works in .NET, Semantic Kernel is the clear choice. For Python-first teams, it's capable but requires tolerance for occasionally thin Python documentation.

8. Phidata

Setup time: 6 minutes | Tool integration: Very easy | Multi-agent: Good | Memory: Excellent | Error recovery: Medium

Phidata has the fastest setup time on this list, six minutes is genuinely six minutes and the built-in storage integrations (PostgreSQL, SQLite, Redis) for agent memory are better than almost any other framework here out of the box.

from phi.agent import Agent
from phi.model.anthropic import Claude
from phi.tools.duckduckgo import DuckDuckGo
from phi.storage.agent.sqlite import SqlAgentStorage

agent = Agent(
    model=Claude(id="claude-sonnet-4-5"),
    tools=[DuckDuckGo()],
    storage=SqlAgentStorage(
        table_name="agent_sessions",
        db_file="agent_memory.db"
    ),
    add_history_to_messages=True,
    num_history_responses=5,
    show_tool_calls=True
)

agent.print_response("What are the latest developments in MCP?")

The trade-off for the fast setup: less flexibility for complex custom orchestration. Phidata is excellent for building agents quickly with solid memory. It's less suited for intricate multi-agent coordination patterns.

9. Pydantic AI

Setup time: 10 minutes | Tool integration: Very easy | Multi-agent: Medium | Memory: Medium | Error recovery: Excellent

If you already use Pydantic (and most Python developers do), Pydantic AI's mental model will feel immediately familiar. The typed output validation is the best of any framework here, if your agent produces structured data, Pydantic AI guarantees it conforms to your schema.

from pydantic_ai import Agent
from pydantic import BaseModel

class AnalysisResult(BaseModel):
    summary: str
    key_findings: list[str]
    confidence_score: float
    recommendations: list[str]

agent = Agent(
    'claude-sonnet-4-5',
    result_type=AnalysisResult,
    system_prompt="Analyse the provided data and return structured insights."
)

result = await agent.run("Analyse Q3 2025 performance metrics: revenue up 23%...")
print(result.data.key_findings)  # Guaranteed to be a list[str]
print(result.data.confidence_score)  # Guaranteed to be a float

The error recovery is excellent specifically because validation happens at the framework level, not just at the application level. If the model produces output that doesn't match the schema, Pydantic AI retries automatically with corrective context.

The multi-agent orchestration is the weak point. It's not impossible but it requires more manual coordination than dedicated multi-agent frameworks.

10. AgentOps

Setup time: 14 minutes | Tool integration: Medium | Multi-agent: Good | Memory: Medium | Error recovery: Good

AgentOps is different from the others, it's less a standalone framework and more an observability and orchestration layer that wraps other frameworks. If you're already using LangGraph or CrewAI and need production monitoring, cost tracking and session replay, AgentOps is the integration to look at.

import agentops
from agentops import track_agent, record_tool

agentops.init(api_key="your_key")

@track_agent(name="research_agent")
class ResearchAgent:
    @record_tool("web_search")
    async def search(self, query: str) -> str:
        results = await perform_search(query)
        return results

    async def run(self, task: str):
        research = await self.search(task)
        return research

agent = ResearchAgent()
result = await agent.run("Research agentic AI frameworks")
agentops.end_session("Success")

In production, the cost per session tracking alone makes this worth evaluating. Knowing which agent workflows are burning tokens without producing value is information you need and that most frameworks don't surface cleanly.

The Benchmark Summary

My Actual Recommendations

For a new production agent system: LangGraph. The learning curve is real, the error recovery and state management are worth it at scale.

For a team that needs something working this week: CrewAI. The time-to-working-system is the best of the serious frameworks.

For document-heavy RAG agent work: LlamaIndex Workflows. Nothing else handles this as naturally.

For regulated environments needing audit trails: Haystack. The pipeline explicitness isn't a limitation, it's the feature.

For self-hosted with data privacy requirements: OpenClaw. The setup overhead is the price of keeping data in your infrastructure.

The full breakdown with additional benchmarks on the top agentic AI frameworks in 2026 is published for teams who want more than fits in a single article.

If you're evaluating open-source options specifically, OpenClaw is worth a look for self-hosted use cases. We published a deep-dive on its architecture, deployment patterns and where it fits relative to the managed alternatives.

Published by Dextra Labs | AI Consulting & Enterprise Agent Development

DEV Community