DEV Community

Sai Vishwak
Sai Vishwak

Posted on

Benchmarking AI Agent Frameworks in 2026: AutoAgents (Rust) vs LangChain, LangGraph, LlamaIndex, PydanticAI, and more

Why we ran this benchmark

Every AI agent framework claims to be production-ready. Few of them tell you what "production" actually costs in CPU, RAM, and latency. We built AutoAgents — a Rust-native framework for building tool-using AI agents — and wanted to know honestly how it performs against the established Python and Rust players under identical conditions.

This post covers the methodology, the raw numbers, and what we think they mean (and don't mean).


The Task

We picked a task that's representative of real-world agentic workloads: a ReAct-style agent that receives a question, decides to call a tool, processes a parquet file to compute average trip duration, and returns a formatted answer.

This tests:

  • LLM planning (tool selection)
  • Tool execution (actual parquet parsing and computation)
  • Result formatting and response generation

It's not a toy "what's 2+2" benchmark, but it's also a single-step tool call — not a long-horizon multi-agent workflow. We note this limitation upfront.


Setup

Model: gpt-5.1 (same across all frameworks)
Requests: 50 total, 10 concurrent (Hitting TPM Rate beyond, hence limited)
Machine: Same hardware for all runs, no process affinity pinning
Measured: end-to-end latency (P50, P95, P99), throughput (req/s), peak RSS memory (MB), CPU usage (%), cold-start time (ms), determinism rate (same output across runs)

All frameworks achieved 100% success rate (50/50). CrewAI was excluded after it showed a 44% failure rate under the same conditions.

Benchmark code and raw JSON are in the repo: https://github.com/liquidos-ai/autoagents-bench


Results

Framework Language Avg Latency P95 Latency Throughput Peak Memory CPU Cold Start Score
AutoAgents Rust 5,714 ms 9,652 ms 4.97 rps 1,046 MB 29.2% 4 ms 98.03
Rig Rust 6,065 ms 10,131 ms 4.44 rps 1,019 MB 24.3% 4 ms 90.06
LangChain Python 6,046 ms 10,209 ms 4.26 rps 5,706 MB 64.0% 62 ms 48.55
PydanticAI Python 6,592 ms 11,311 ms 4.15 rps 4,875 MB 53.9% 56 ms 48.95
LlamaIndex Python 6,990 ms 11,960 ms 4.04 rps 4,860 MB 59.7% 54 ms 43.66
GraphBit JS/TS 8,425 ms 14,388 ms 3.14 rps 4,718 MB 44.6% 138 ms 22.53
LangGraph Python 10,155 ms 16,891 ms 2.70 rps 5,570 MB 39.7% 63 ms 0.85

Composite score is a weighted, min-max normalized aggregate across all dimensions (latency 27.8%, throughput 33.3%, memory 22.2%, CPU efficiency 16.7%).

Breaking Down the Numbers

Memory: The Biggest Gap

The most striking result isn't latency — it's memory.

AutoAgents peaks at 1,046 MB. The average Python framework peaks at 5,146 MB. That's a ~5× difference on a single-agent workload.

At deployment scale (50 instances):

Framework Total RAM needed
AutoAgents ~51 GB
Rig ~50 GB
LangChain ~279 GB
LangGraph ~272 GB
PydanticAI ~238 GB
LlamaIndex ~237 GB
GraphBit ~230 GB

Python frameworks carry baseline weight you pay even when idle: interpreter, dependency tree, dynamic dispatch, GC. Rust's ownership model means memory is freed immediately when objects go out of scope — no GC heap to keep around.

Latency: Smaller Gap, Still Real

Latency differences are more nuanced. The LLM network round-trip dominates, which is why all frameworks cluster between 5,700 and 7,000 ms. The outliers (GraphBit at 8,425 ms, LangGraph at 10,155 ms) reflect additional framework orchestration overhead.

AutoAgents beats the average Python framework by 25% on latency, and beats LangGraph by 43.7%.

The P95 numbers diverge more:

  • AutoAgents P95: 9,652 ms
  • LangGraph P95: 16,891 ms

At the tail end — the requests that matter most for user-perceived reliability — the gap widens significantly.

Throughput

AutoAgents delivers 4.97 rps vs an average of 3.66 rps across Python frameworks — 36% more throughput under the same concurrency. Against LangGraph specifically, it's 84% more throughput (4.97 vs 2.70 rps).

Higher throughput per instance means you need fewer instances to serve the same load.

Cold Start

This is where Rust's near-zero initialization really shows:

  • AutoAgents: 4 ms
  • LangChain: 62 ms (15× slower)
  • PydanticAI: 56 ms (14× slower)
  • LlamaIndex: 54 ms (14× slower)
  • GraphBit: 138 ms (34× slower)
  • LangGraph: 63 ms (16× slower)

For serverless deployments or auto-scaling scenarios where instances spin up on demand, a 4 ms cold start vs 60–140 ms is a qualitative difference in user experience.

CPU Usage

CPU tells a more nuanced story. Rig (Rust) runs at 24.3% — the most efficient. AutoAgents runs at 29.2%. LangChain leads the Python pack at 64.0%. High CPU means less headroom for burst traffic without throttling.

The throughput-per-CPU efficiency ranking mirrors the composite score.

How We Scored Frameworks

The composite score uses min-max normalization so every dimension is on a consistent 0–1 scale (best = 1, worst = 0), regardless of unit or direction:

score = mmLow(latency)     × 27.8%   # lower is better
      + mmLow(memory)      × 22.2%   # lower is better
      + mmHigh(throughput) × 33.3%   # higher is better
      + mmHigh(cpu_eff)    × 16.7%   # rps/cpu%, higher is better

where mmHigh(v, min, max) = (v - min) / (max - min)
      mmLow(v,  min, max) = (max - v) / (max - min)
Enter fullscreen mode Exit fullscreen mode

Weights reflect what matters at production scale: throughput is the primary capacity driver (33.3%), latency is user-facing (27.8%), memory drives infrastructure cost (22.2%), and CPU efficiency determines burst headroom (16.7%).


What This Benchmark Doesn't Cover

  • Multi-step agents: We only benchmark single tool-call ReAct loops. Long-horizon planning with many LLM calls may change the picture.
  • Multi-agent systems: Frameworks designed for agent orchestration (LangGraph, CrewAI) are arguably optimized for complexity we didn't measure.
  • Answer quality: Determinism rate tracks whether the output is consistent, not whether it's correct by a human rubric.
  • Streaming: All results are blocking responses. Streaming latency profiles differ.
  • Different models: These results are specific to gpt-4o-mini. Different models with different token sizes will shift the LLM-dominated portion of latency.

If these gaps are important for your use case, we'd welcome contributions that extend the benchmark suite.


Takeaway

If you're choosing an AI agent framework for a production system where infrastructure cost and reliability under load matter, the memory footprint of Python frameworks is a real constraint. AutoAgents and Rig both stay under 1.1 GB peak — all Python frameworks measured exceeded 4.7 GB.

The throughput and latency advantages are meaningful but not dramatic for single-agent tasks. The memory advantage is 5×, and it's structural — not something you tune away with configuration.

We're continuing to extend the benchmark with more task types, multi-step workflows, and streaming measurements. Issues and PRs welcome.

Give us a star on GitHub: https://github.com/liquidos-ai/AutoAgents

Thanks

Top comments (0)