DEV Community: LukaszGrochal

Choosing an Agent Framework in 2026: A Data-Driven Decision Guide

LukaszGrochal — Thu, 26 Feb 2026 11:01:00 +0000

You've seen the benchmarks. You've read the methodology. Now the question that actually matters: which one should YOU use?

I spent weeks building the same multi-agent workflow in five frameworks, running 45 controlled benchmarks, and analyzing every dimension I could measure. The full results are in Part 1 and the methodology is in Part 2. This article distills all of that into actionable guidance.

The Short Answer

There is no single "best" framework. If someone tells you otherwise, they're either selling something or they only evaluated one dimension. The right choice depends on what you're optimizing for.

Here's the decision matrix:

Your Priority	Best Choice	Why (Data)
Fastest prototype	CrewAI	Simplest API, 246s latency, 9.66 quality
Production stability	LangGraph	1.0 GA, graph-based control, 9.42 quality
Raw speed	MS Agent Framework	93s latency (6x faster), 9.87 quality
Microsoft/Azure ecosystem	MS Agent Framework	Ecosystem integration, successor to AutoGen + Semantic Kernel
OpenAI-native apps	Agents SDK	Tightest OpenAI integration, built-in tracing
Lowest token cost	MS Agent Framework	7,006 tokens/run (vs CrewAI's 27,684)
Most consistent output	MS Agent Framework	Std=0.10, range=0.2 (narrowest)

If you already know your top constraint, that table might be all you need. If you want to understand the tradeoffs in depth, keep reading.

Factor 1: Consistency Matters More Than Average Score

This is the finding I keep coming back to. Everyone focuses on mean quality scores, but variance is what bites you in production.

Think about it this way: if a framework averages 9.6 but occasionally drops to 8.6, you need retry logic, output validation, and fallback handling. If another framework averages 9.87 and never drops below 9.8, you can trust it and move on.

Framework	Mean	Std Dev	Min	Max	Range
MS Agent	9.87	0.10	9.8	10.0	0.2
CrewAI	9.66	0.30	9.2	10.0	0.8
LangGraph	9.42	0.32	9.0	10.0	1.0
Agents SDK	9.31	0.36	8.6	9.6	1.0
AutoGen	9.63	0.45	8.6	10.0	1.4

MS Agent Framework's consistency is remarkable. A standard deviation of 0.10 means virtually every run lands in the same narrow band. AutoGen sits at the other end with a 1.4-point range and a std dev of 0.45 -- meaning roughly one in three runs deviates noticeably from the mean.

Why does this happen? Architecture. Sequential pipelines (MS Agent) produce deterministic data flow: each agent gets a fixed input and produces a fixed output. Group chat patterns (AutoGen) introduce conversational branching where subtle phrasing differences in early turns cascade into meaningfully different outputs, even at temperature=0.

If you're building a pipeline that runs unattended -- batch processing, scheduled reports, automated analysis -- consistency should be your top priority. A framework that's slightly lower in average quality but tighter in variance will cause fewer 3am pages than one that's higher on average but occasionally produces garbage.

Factor 2: Token Cost at Scale

When running locally via Ollama, tokens are free. The moment you deploy to a cloud model, they're your biggest variable cost.

Here's what each framework costs at GPT-4o rates ($2.50/1M input, $10/1M output, assuming a roughly 40/60 input/output split):

Framework	Tokens/Run	Approx Cost/Run	1,000 runs/month
MS Agent	7,006	~$0.06	~$60
Agents SDK	8,676	~$0.07	~$70
LangGraph	8,823	~$0.07	~$70
AutoGen	10,793	~$0.09	~$90
CrewAI	27,684	~$0.22	~$220

CrewAI uses nearly 4x more tokens than MS Agent Framework. That's the cost of its role-playing architecture -- verbose system prompts and inter-agent communication inflate every run. At $220/month for 1,000 runs, it's still reasonable. But scale to 10,000 runs and you're looking at $2,200 vs $600. That delta funds an engineer for a week.

MS Agent Framework is the most token-efficient at ~7,000 tokens per run, with Agents SDK and LangGraph close behind at ~8,700. If token cost is your binding constraint, any of the three lean frameworks is a safe bet.

Factor 3: Production Readiness

Raw benchmark numbers don't capture maturity. A framework that tops every metric doesn't help you if it ships breaking changes every two weeks or has no documentation for your edge case. Here's my honest tiered assessment:

Tier 1 -- Production Ready

LangGraph 1.0 -- The only 1.0 GA release in this comparison. Graph-based architecture gives you explicit control over execution flow. Largest community, most Stack Overflow answers, best debugging and observability tools. If something goes wrong at 2am, you'll find help.

Tier 2 -- Stable, Active Development

CrewAI 1.9 -- Rapidly evolving with good documentation and an intuitive API. Some API churn between minor versions, so pin your dependencies carefully. The ecosystem is smaller than LangGraph's but growing fast.
Agents SDK -- OpenAI-backed with a stable API surface. Tightly coupled to OpenAI's ecosystem, which is either a feature or a lock-in risk depending on your perspective. Built-in tracing is a genuine production advantage.

Tier 3 -- Use with Caution

AutoGen 0.7 -- Effectively in maintenance mode. Microsoft's engineering energy is flowing into MS Agent Framework. The group chat architecture is genuinely powerful for open-ended collaboration, but if you're starting a new project today, you're building on a platform that's being superseded.

Tier 4 -- High Potential, Not GA

MS Agent Framework 1.0.0b -- Topped every metric in the benchmark: quality, speed, and consistency. But it's a beta release with GA expected around March 2026. The API surface could change. Documentation is thin. Community support is minimal. If you can absorb that risk, the numbers are compelling. If you need stability guarantees today, wait two months.

Factor 4: Architecture Style

Each framework embodies a different mental model for agent orchestration. Picking one that matches how you think about your problem will save you more time than any benchmark number.

Graph-based (LangGraph) -- You define nodes (agents, functions) and edges (transitions, conditions). Execution follows the graph. Best for workflows with branching logic, conditional routing, or cycles. If you think in flowcharts, you'll feel at home.

Task-based (CrewAI) -- You define tasks with descriptions and assign them to agents with roles. The framework handles sequencing. Lowest boilerplate of the five. Best for quick prototypes and linear pipelines where you don't need fine-grained control over agent interaction.

Group chat (AutoGen) -- Agents communicate via a shared message stream, taking turns based on selection logic. Most flexible for open-ended collaboration where you don't know the conversation shape in advance. Worst for structured pipelines where that flexibility becomes overhead.

Sequential (MS Agent Framework) -- A clean pipeline where each agent processes input and passes output to the next. Simple mental model, predictable execution, easy to debug. Best when your workflow is a straight line from input to output.

Runner-based (Agents SDK) -- A runner executes an agent, which can hand off to other agents. Lightweight abstraction with built-in tracing and OpenAI ecosystem integration. Best when you're already deep in the OpenAI stack and want minimal friction.

My Recommendation

I'll be opinionated here because vague advice is useless advice. These are my recommendations based on the data, tempered by practical experience:

Starting a production system today? LangGraph.
It's the only 1.0 GA framework in this comparison. The graph-based architecture scales to complex workflows. The community and tooling ecosystem are mature. Quality is solid at 9.42, and while it's not the fastest (506s) or cheapest in tokens, it has the most predictable upgrade path. You won't regret this choice in six months.

Prototyping fast? CrewAI.
If you need a working multi-agent system by Friday, CrewAI's API is the fastest path from zero to demo. Define roles, assign tasks, run. Accept the 3x token overhead as the cost of velocity. You can always migrate later if the token cost becomes a problem at scale.

Can wait two months? MS Agent Framework.
The benchmark numbers are remarkable: fastest latency by 2.5x, highest quality, tightest consistency. If the GA release delivers on the beta's promise, this becomes the default recommendation. Watch the March 2026 release closely.

Already in the OpenAI ecosystem? Agents SDK.
Don't fight your stack. If you're using OpenAI models, OpenAI's function calling, and OpenAI's tooling, the Agents SDK integrates most naturally. Lowest token cost, built-in tracing, clean handoff semantics. The coupling to OpenAI is the obvious tradeoff -- if you ever need to switch providers, you'll be rewriting.

Get the Data

Everything behind this analysis is open source. Run the benchmarks yourself, challenge my numbers, extend the comparison to new frameworks or tasks.

GitHub: agent-framework-benchmark
Analysis notebook: notebooks/analysis.ipynb -- all charts, tables, and statistical tests
Raw data: results/benchmark_results.csv -- 45 runs, every metric

Clone it, install with uv sync, and run uv run python -m benchmark.runner. If you find different results, I want to hear about it.

Series Navigation

Part 1: I Benchmarked 5 AI Agent Frameworks -- Here's What Actually Matters -- The results: quality, latency, tokens, and consistency across 45 runs.
Part 2: How I Built a Fair AI Agent Benchmark -- Architecture, methodology, and the engineering behind controlled comparisons.
Part 3: Choosing an Agent Framework in 2026 -- You are here. The decision guide.

Built with Python 3.12, uv, Ollama (Qwen 3 14B), and 45 runs of hard data. Pick a framework, ship something, and remember: the model does the thinking. The framework just gets out of the way.

How I Built a Fair AI Agent Benchmark (Architecture & Methodology)

LukaszGrochal — Tue, 24 Feb 2026 11:01:00 +0000

Comparing frameworks is easy. Comparing them fairly is the hard part.

In Part 1 of this series, I published the results of benchmarking five AI agent frameworks head-to-head. MS Agent Framework won on speed and consistency. Quality scores were nearly identical across the board. The results surprised me.

But results without methodology are just opinions with charts. This article is about the engineering behind the benchmark: how I designed the system to isolate framework behavior from everything else, the architectural decisions that made fair comparison possible, and the mistakes I'd fix if I ran it again.

If you've ever tried to compare two libraries by building a quick prototype in each, you know the problem. The first one you build teaches you the task. The second one benefits from everything you learned. Your "comparison" is really measuring your own learning curve. I wanted to eliminate that entirely.

The Fairness Problem

Most framework comparisons I've seen online have the same fundamental flaw: they're benchmarking prompt quality, not framework quality.

Think about what typically happens. Someone builds a project in LangGraph, writes carefully tuned prompts, gets great results. Then they try CrewAI, use slightly different wording, maybe a different model temperature, and get different results. They write a blog post declaring one framework superior. But what actually differed? The prompts. The configuration. The author's familiarity with each API. The framework was maybe 10% of the equation.

There are several ways naive comparisons fail:

Different prompts — Each implementation uses hand-written instructions. Prompt phrasing changes output quality dramatically.
Different tools — One version calls a real API, another uses a mock. Network latency and API variability dominate the measurement.
Temperature randomness — Running at temperature 0.7 means every run produces different output. You're measuring random variance, not framework capability.
Framework-specific optimizations — Tuning one framework's settings while leaving another at defaults isn't a framework comparison; it's a configuration comparison.

I wanted to control for all of this. Every variable that isn't "which framework is orchestrating the agents" had to be identical.

The Task: Company Research Agent

The benchmark task is a 3-agent pipeline:

Researcher — Gathers raw information about a target company
Analyst — Synthesizes research findings into structured business insights
Writer — Produces a polished 500-800 word research report

I chose this task because it hits a sweet spot. It's complex enough to exercise real multi-agent orchestration — three agents with data dependencies, where each agent's output feeds the next. But it's simple enough that the output (a structured report) can be evaluated objectively on dimensions like completeness, accuracy, and readability.

Each framework researches three companies (Anthropic, Stripe, Datadog), three iterations each, for 9 runs per framework and 45 runs total. Three companies gives us variety in available information. Three iterations gives us enough repetition to measure consistency.

Architecture Overview

The benchmark isn't a loose collection of scripts. It's a modular system with strict dependency boundaries:

                   ┌──────────────┐
                   │  Benchmark   │
                   │   Runner     │
                   └──────┬───────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
      ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
      │ Framework │ │ Framework │ │ Framework │  x5 frameworks
      │   Impl    │ │   Impl    │ │   Impl    │  x3 companies
      └─────┬─────┘ └─────┬─────┘ └─────┬─────┘  x3 iterations
            │             │             │        = 45 runs
            ▼             ▼             ▼
      ┌─────────────────────────────────────────┐
      │              shared/                    │
      │  prompts.py │ tools.py │ schemas.py     │
      │  config.py (BenchmarkSettings)          │
      └─────────────────────────────────────────┘
                           │
                    ┌──────▼───────┐
                    │  eval_core   │──▶ LLM-as-Judge
                    │  (LLMJudge)  │    Quality Scores
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │ vendor/      │
                    │ llm_core     │──▶ Provider Abstraction
                    └──────────────┘

The runner dynamically imports each framework module, passes it a company name and settings, and collects the report text plus token usage. It then sends each report through the LLM judge for quality scoring. Everything feeds into a CSV of results for analysis.

The 5 Rules of Fair Comparison

Rule 1: Identical Prompts

Every framework implementation imports its prompts from the same shared/prompts.py file. No framework gets custom instructions. Here are the actual prompt strings:

RESEARCHER_SYSTEM = (
    "You are a company research specialist. Your task is to gather comprehensive "
    "information about {company}. Focus on: company overview, key leadership, "
    "products/services, recent news and developments, market position, and key "
    "financial or operational metrics. Be thorough and factual. Present your "
    "findings as a structured list of facts with categories."
)

ANALYST_SYSTEM = (
    "You are a business analyst specializing in {company}. Review the research data "
    "provided and identify: key strengths, potential risks and challenges, market "
    "trends affecting {company}, competitive advantages, and notable strategic "
    "insights. Provide data-driven analysis with clear reasoning."
)

WRITER_SYSTEM = (
    "You are a professional report writer covering {company}. Create a structured "
    "research report with these sections: Executive Summary, Company Overview, "
    "Products & Services, Market Position, Key Insights, and Conclusion. Write "
    "clearly and professionally. The report should be 500-800 words."
)

If CrewAI got "Be incredibly detailed and thorough" while LangGraph got "Be concise," we'd be testing prompt engineering, not frameworks. Sharing a single source file eliminates that variable entirely.

Rule 2: Identical Tools

The agents use a mock search tool from shared/tools.py with pre-built data for each benchmark company. This is critical for two reasons.

First, determinism. Real API calls return different results at different times. A company's stock price changes, news articles rotate, search rankings shift. Mock data guarantees every framework gets the exact same input information on every run.

Second, isolation. If one framework happens to make API calls faster due to connection pooling, or runs into rate limiting, that shows up as a latency difference that has nothing to do with the framework's orchestration quality. Mock tools remove network variability from the equation.

The gather_all_search_results() function runs the same six standard queries for every company, ensuring all implementations receive identical raw data regardless of how they choose to call the search tool.

Rule 3: Same Model

All five frameworks run against Qwen 3 14B via Ollama, configured through BenchmarkSettings:

class BenchmarkSettings(BaseSettings):
    llm_provider: str = "ollama"
    llm_model: str = "qwen3:14b"
    ollama_host: str = "http://localhost:11434"

One model, one machine, one inference server. No framework gets a smarter model or a faster endpoint.

Rule 4: temperature=0 Everywhere

Every framework implementation sets temperature=0 for all LLM calls. This eliminates random sampling from the generation process. With temperature 0, the model always picks the highest-probability next token, making outputs as deterministic as the framework allows. (Some variation still occurs due to floating-point nondeterminism in GPU computation, but it's minimal.)

Rule 5: No Framework-Specific Optimizations

No custom retry logic, no framework-specific prompt tweaking, no tuning of agent count or conversation structure beyond what the pipeline requires. Every implementation gets the most straightforward translation of the three-agent pipeline into that framework's idiom. If a framework makes certain patterns easier or harder, that's a legitimate difference worth measuring.

The Module Dependency Fence

Beyond shared inputs, architectural boundaries prevent accidental coupling between components:

llm_core         <── eval_core (judge uses BaseLLMProvider)
shared/          <── all framework implementations
eval_core        <── benchmark/ (runner uses judge)
No framework implementation imports from another.

Three rules make this work:

eval_core CANNOT import shared/. The judge evaluates reports as plain text. It doesn't know what prompts were used, what tools were available, or how agents were structured. This prevents the evaluation from being biased toward the specific task design.
Framework implementations CANNOT import each other. If the LangGraph implementation imported a utility from the CrewAI implementation, we'd have hidden coupling. Each implementation is self-contained within its own package.
llm_core is vendored and unmodified. The LLM provider abstraction layer (BaseLLMProvider, OllamaProvider, etc.) is vendored from another project and treated as a frozen dependency. No benchmark-specific modifications.

This might seem overly strict for a benchmark project. But without these boundaries, it's easy for shared state or implicit dependencies to contaminate the comparison. I've seen benchmark repos where "shared utilities" slowly accumulate framework-specific logic. Explicit import rules prevent that drift.

LLM-as-Judge Evaluation

Each of the 45 reports is scored by an LLM judge on five criteria, each rated 1-10:

Completeness — Does the report cover key aspects? (leadership, products, market position, developments, metrics)
Accuracy — Are stated facts verifiable and reasonable? Any fabrications?
Structure — Well-organized with clear sections? Follows the requested format?
Insight — Analysis beyond surface-level facts? Meaningful observations?
Readability — Well-written, professional, clear?

The judge receives the report as plain text along with a task description, and returns a structured JSON response:

_JUDGE_PROMPT = """\
You are an expert evaluator of research reports. Score the following report on a \
scale of 1-10 for each criterion. Be rigorous and fair.

## Criteria
1. **Completeness** (1-10): Does the report cover key aspects of the company?
2. **Accuracy** (1-10): Are the stated facts verifiable and reasonable?
3. **Structure** (1-10): Is the report well-organized with clear sections?
4. **Insight** (1-10): Does the report provide analysis beyond surface-level facts?
5. **Readability** (1-10): Is it well-written, professional, and clear?

## Report to Evaluate
{report}

Respond with ONLY valid JSON (no markdown, no code blocks):
{"completeness": <int>, "accuracy": <int>, "structure": <int>,
 "insight": <int>, "readability": <int>, "overall": <float>,
 "reasoning": "<brief explanation>"}"""

Why an LLM judge instead of human evaluation? Three reasons: scale (45 reports is a lot to evaluate manually), consistency (human evaluators drift over time — the 40th report gets different attention than the 5th), and reproducibility (anyone can re-run the judge and get the same scores).

The limitations are real. LLM judges have known biases — they tend to prefer verbose, well-formatted output over concise but equally correct output. But since all five frameworks produce structurally similar reports (they're all following the same writer prompt with the same section headings), this bias affects all frameworks roughly equally. It's a systematic offset, not a confound.

The judge uses temperature=0 for consistency and retries up to 3 times on JSON parse failures. Failed parses get logged and the response is re-requested. This handles the occasional case where the model wraps its JSON in markdown code blocks despite being told not to.

Local-First: Why Ollama Instead of Cloud APIs

The entire benchmark runs locally using Ollama. No cloud API keys required. This was a deliberate choice with several advantages:

$0 cost. Forty-five benchmark runs plus 45 judge evaluations. At cloud pricing, that's potentially hundreds of dollars. Locally, it's electricity.
No rate limits. Cloud APIs throttle concurrent requests. Running 45 sequential calls against GPT-4o means dealing with rate limiting, retry backoff, and variable response times that have nothing to do with framework quality.
No network variability. When measuring latency differences between frameworks, the last thing you want is network jitter adding 50-500ms of noise per request.
Complete reproducibility. Anyone with an Ollama installation and the Qwen 3 14B model can reproduce these results exactly. No API key, no billing account, no waiting list.

The trade-off is obvious: Qwen 3 14B isn't GPT-4o. The absolute quality of outputs is lower than what you'd get from a frontier model. But this benchmark measures relative framework performance — how much overhead each framework adds, how consistently each one produces results, how efficiently each one uses tokens. Those relative measurements hold regardless of the underlying model's capability.

The configuration supports cloud providers too (openai, anthropic are valid llm_provider values), so you can re-run with GPT-4o or Claude if you want to validate that framework rankings hold at higher model capability levels.

Surviving Dependency Hell

Here's a problem I didn't anticipate: the five frameworks literally cannot all be installed in the same Python environment.

CrewAI pins openai<1.84. MS Agent Framework requires openai>=1.99. These are hard version constraints in their respective pyproject.toml files. pip will just fail. Even if you could force-install both, one of them would break at runtime.

The solution: uv's dependency groups (PEP 735). Each framework gets its own resolution context:

uv sync --group crewai      # Installs CrewAI (pins openai<1.84)
uv sync --group msagent     # Installs MS Agent Framework (needs openai>=1.99)

Groups that are compatible can be installed together:

uv sync --group langgraph --group autogen --group agents-sdk

I also declared explicit conflicts in pyproject.toml so that uv resolves these groups independently rather than trying to find a single unified solution:

[tool.uv]
conflicts = [
    [{ group = "crewai" }, { group = "msagent" }],
    [{ group = "crewai" }, { group = "agents-sdk" }],
]

This is a real-world takeaway that goes beyond benchmarking: your existing dependency tree might rule out certain frameworks before you write a line of code. If your project already depends on openai>=1.90, CrewAI is off the table until they update their pin. If you're on an older openai version and can't upgrade, the newer frameworks won't work. Check compatibility before you invest a week building a proof of concept.

What I'd Do Differently

No benchmark is perfect, and this one has gaps I'd address in a v2:

More test companies. Three companies gives us variety, but 5-7 would provide better statistical power. With only 9 runs per framework, the confidence intervals on quality scores are wide enough that most pairwise differences aren't statistically significant (as the Mann-Whitney U tests in Part 1 confirmed).

Multiple task types. Company research is one workflow pattern. A more comprehensive benchmark would include a coding task (generate and debug code), a data analysis task (interpret a dataset), and a customer support task (handle multi-turn conversations). Different frameworks might excel at different patterns.

Human eval baseline. I'd recruit 3-5 evaluators to score a subset of reports independently and compare their rankings to the LLM judge's rankings. This would validate whether the judge's quality scores match human intuition or if systematic biases are distorting results.

Test with cloud models. Running the same benchmark with GPT-4o and Claude Sonnet would answer an important question: do framework rankings change with model capability? It's possible that a framework that adds overhead with a strong model actually helps compensate for a weaker model's limitations, or vice versa.

Standardized token tracking. Token tracking varies across frameworks — some report tokens natively, others require instrumentation hooks. A complete benchmark needs a framework-agnostic way to capture token usage at the provider level, rather than relying on each framework's own reporting.

Tech Stack

Component	Tool
Language	Python 3.12
Package Manager	uv
Build System	Hatchling
LLM Serving	Ollama (Qwen 3 14B)
Linter/Formatter	ruff
Type Checker	mypy
Testing	pytest
Analysis	pandas + Plotly
Notebooks	Jupyter

All code, data, and analysis notebooks are open source: agent-framework-benchmark

Series Navigation

Part 1: I Benchmarked 5 AI Agent Frameworks — Here's What Actually Matters — The results: quality scores, latency, token efficiency, and consistency across all 45 runs.
Part 2: How I Built a Fair Benchmark — You are here. Architecture, methodology, and the engineering behind controlled comparisons.
Part 3: A Practical Decision Guide — Flowchart for picking the right framework based on your actual constraints.

Built with Python 3.12, uv, Ollama, and a determination to answer "which framework is best?" with data instead of opinions.

I Benchmarked 5 AI Agent Frameworks — Here's What Actually Matters

LukaszGrochal — Mon, 16 Feb 2026 07:01:00 +0000

I ran 45 benchmarks across 5 agent frameworks expecting a clear winner. The answer wasn't what I expected.

Everyone building with LLM agents in 2026 faces the same question: which framework should I use? Blog posts give you vibes. Docs give you cherry-picked examples. Twitter threads give you hot takes from people who tried one framework for a weekend.

I wanted numbers. Real numbers, from a controlled experiment.

So I built the same multi-agent workflow — a Company Research Agent — in five different frameworks, ran each one 9 times (3 companies x 3 iterations), scored every output with an LLM judge, and tracked latency and token usage down to the request level. Forty-five runs total, same model, same prompts, same evaluation criteria. No cloud APIs, no variable pricing confounding the results — everything running locally on the same machine.

Here's what the data actually says.

The Setup

Five frameworks, each implementing the same three-agent pipeline:

Researcher — gathers raw information about a company
Analyst — synthesizes findings into structured insights
Writer — produces a polished research report

The frameworks:

LangGraph 1.0.x — graph-based state machine with explicit node/edge definitions
CrewAI 1.9.x — task-based sequential orchestration with role-playing agents
AutoGen 0.7.x — async group chat where agents collaborate via messages
MS Agent Framework 1.0.0b — sequential orchestration with built-in tool routing
OpenAI Agents SDK — runner-based pipeline with handoff semantics

All five ran against the same local model (Qwen 3 14B via Ollama) with temperature=0 for reproducibility. The target companies — Anthropic, Stripe, and Datadog — were chosen to represent different levels of public information availability: a well-documented public company, a high-profile private company, and a mid-profile enterprise player. Each framework researched all three, three times each.

The LLM judge evaluated each output report on five dimensions: completeness, accuracy, structure, insight depth, and readability — each scored 1-10, then combined into an overall quality score.

Why does this matter in 2026? Because agent frameworks have matured past the "hello world" phase. The question is no longer "can I build a multi-agent system?" — it's "which framework gives me the best tradeoff between quality, speed, cost, and reliability for production workloads?" I picked a company research pipeline because it's complex enough to stress-test orchestration (three agents with dependencies) but simple enough that the results are easy to evaluate objectively.

Quality Results: Closer Than You'd Think

Here's the part that surprised me most. Look at this radar chart:

Every framework scores above 9.0 overall. The total spread from best to worst is just 0.56 points. Here are the full numbers:

Framework	Quality	Completeness	Accuracy	Structure	Insight	Readability
MS Agent	9.87	10.00	10.00	10.00	9.33	10.00
CrewAI	9.66	9.44	9.44	9.89	9.56	10.00
AutoGen	9.63	9.44	9.67	9.89	9.33	9.89
LangGraph	9.42	9.11	9.44	9.89	9.22	9.78
Agents SDK	9.31	9.00	9.11	9.89	9.00	9.78

MS Agent Framework sits at the top with a near-perfect 9.87. Agents SDK comes in last at 9.31. But here's the thing — 9.31 is still excellent. When your worst performer is scoring above 9 out of 10, quality isn't the axis that differentiates these tools.

The radar chart tells the same story visually: all five polygons overlap heavily. Structure and readability are essentially identical across the board (everyone's above 9.78). The only dimension with meaningful separation is completeness, where MS Agent's perfect 10.00 pulls away from Agents SDK's 9.00.

What Actually Differentiates Them

If quality is a wash, what should you care about? Three things: speed, token cost, and consistency.

Speed: A 6x Gap

This is where the differences get dramatic. Average end-to-end latency per run:

MS Agent Framework: 93s
CrewAI: 246s
Agents SDK: 448s
LangGraph: 506s
AutoGen: 572s

That's a 6x gap between fastest and slowest. MS Agent finishes in a minute and a half while AutoGen is still grinding away at nearly ten minutes. For a batch job researching 100 companies, that's the difference between 2.5 hours and 16 hours.

CrewAI lands in a comfortable middle ground at ~4 minutes — fast enough for interactive use, efficient enough for batch processing. LangGraph and Agents SDK cluster together in the 7-8 minute range.

AutoGen's async group chat pattern, while flexible, introduces significant coordination overhead that shows up directly in wall-clock time. The agents exchange messages in a round-robin style, and each message round requires a full LLM call to decide whether to continue the conversation or hand off. That flexibility is powerful for open-ended collaboration, but for a linear pipeline like this one, it's overhead without payoff.

Token Cost: 3x Difference

Not all frameworks are equally efficient with their LLM calls. Average total tokens per run:

MS Agent Framework: 7,006 tokens
Agents SDK: 8,676 tokens
LangGraph: 8,823 tokens
AutoGen: 10,793 tokens
CrewAI: 27,684 tokens

CrewAI uses nearly 4x more tokens than MS Agent Framework to produce comparable quality output. At local Ollama pricing, this is free. At GPT-4o pricing ($2.50/1M input, $10/1M output), that's the difference between ~$0.06 and ~$0.22 per run. Scale to thousands of runs per day and the gap matters.

Why such a spread? CrewAI's role-playing approach includes verbose system prompts and inter-agent communication that inflates token counts. MS Agent Framework, Agents SDK, and LangGraph take a leaner approach with minimal framing overhead.

Consistency: The Hidden Variable

Average scores hide variance. Here's what the consistency numbers reveal:

Framework	Std Dev	Min Score	Max Score	Range
MS Agent	0.10	9.8	10.0	0.2
CrewAI	0.30	9.2	10.0	0.8
LangGraph	0.32	9.0	10.0	1.0
Agents SDK	0.36	8.6	9.6	1.0
AutoGen	0.45	8.6	10.0	1.4

MS Agent is remarkably tight — std dev of 0.10, range of just 0.2 points. Every single run scored between 9.8 and 10.0. You know exactly what you're going to get.

AutoGen is the opposite story. It can hit a perfect 10.0, but it can also drop to 8.6 — a 1.4-point range. A standard deviation of 0.45 means roughly one in three runs will deviate noticeably from the mean. If you're building a production pipeline where predictability matters (and it always does), this variance is a real concern. You'd need to build retry logic or output validation around it, which adds complexity.

What drives the inconsistency? I suspect it's the group chat architecture. When agents negotiate via messages, the conversation can take different paths depending on subtle phrasing differences in early turns, even with temperature=0. Sequential pipelines like MS Agent's don't have this branching problem — each agent gets a fixed input and produces a fixed output.

Statistical Reality Check

Eyeballing averages is one thing. Let's see what the statistics actually support.

Kruskal-Wallis test on quality scores: p = 0.005. Statistically significant — differences between frameworks do exist. But that's the omnibus test. It tells you something differs, not what.

Pairwise Mann-Whitney U tests with Bonferroni correction (10 comparisons, corrected alpha = 0.005) tell a more nuanced story:

Only one pair shows a statistically significant quality difference: Agents SDK vs MS Agent (p = 0.0003, effect size r = 0.86 — large).

Every other pairwise comparison — LangGraph vs CrewAI, AutoGen vs Agents SDK, CrewAI vs MS Agent, all of them — fails to reach significance after correction. The apparent quality differences between most frameworks are indistinguishable from noise at this sample size.

Now compare that to latency. Kruskal-Wallis test on latency: p = 0.000001. The speed differences are extremely real and not going away with more data.

Translation: don't pick your framework based on quality. Pick based on speed, cost, and consistency.

The scatter plot drives this home. Quality clusters tightly between 8.6 and 10.0 regardless of framework, while latency sprawls from 80 seconds to over 700. The vertical axis is noise. The horizontal axis is signal.

This is the single most important finding from this benchmark: all five frameworks produce excellent output when given the same model and prompts. The framework is the orchestration layer, not the intelligence layer. The model does the heavy lifting. The framework's job is to get out of the way efficiently — and that's where the real differences emerge.

The Ranking

Putting it all together:

Framework	Quality	Latency (s)	Tokens	Consistency (std)
MS Agent	9.87	93	7,006	0.10
CrewAI	9.66	246	27,684	0.30
AutoGen	9.63	572	10,793	0.45
LangGraph	9.42	506	8,823	0.32
Agents SDK	9.31	448	8,676	0.36

MS Agent dominates on every metric — quality, speed, token efficiency, and consistency — but it's a 1.0.0 beta release with a smaller ecosystem. If you're comfortable betting on a newer framework, it's compelling. If you need production maturity and community support today, that's a different calculation.

CrewAI is the pragmatic middle ground: fast enough, high quality, reasonable consistency, and the most intuitive API of the bunch. The token cost is the tax you pay for its role-playing architecture. For most teams, that tradeoff is worth it.

AutoGen produces great output but slowly and unpredictably. Its group chat pattern shines for open-ended agent collaboration — just not for structured pipelines.

LangGraph and Agents SDK are solid workhorses with lean token usage. LangGraph gives you the most control over execution flow (it's a state machine, after all), while Agents SDK keeps things simple with minimal boilerplate. Both pay for that simplicity with longer execution times.

There's no single winner. There's a set of tradeoffs, and the right choice depends on what you're optimizing for.

What's Next

This article covered the what. The next two in this series cover the how and the so what:

Part 2: How I Built a Fair Benchmark — The methodology behind controlled comparisons: same prompts, same model, LLM-as-judge evaluation, and the dependency hell of installing five frameworks that don't want to coexist.
Part 3: A Practical Decision Guide — Flowchart for picking the right framework based on your actual constraints: team size, latency budget, cost sensitivity, and how much you value consistency.

Built with Python 3.12, uv, Ollama (Qwen 3 14B), and too many hours debugging dependency conflicts between frameworks that each want their own version of the OpenAI SDK.

All code, data, and analysis notebooks are open source: agent-framework-benchmark

I Built a Python CLI Tool for RAG Over Any Document Folder

LukaszGrochal — Mon, 09 Feb 2026 07:01:00 +0000

A zero-config command-line tool for retrieval-augmented generation — index a folder, ask questions, get cited answers. Works locally with Ollama or with cloud APIs.

Every time I wanted to ask questions about a set of documents, I'd write the same 100 lines of boilerplate: load docs, chunk them, embed them, store in a vector DB, retrieve, generate. I got tired of it. So I built a CLI tool that does it in two commands.

The Problem

RAG prototyping has too much ceremony. You have a folder of PDFs, Markdown files, maybe some text notes. You want to ask questions about them. Simple enough in theory.

In practice, you're wiring up document loaders, picking a chunking strategy, initializing an embedding provider, setting up a vector store, writing retrieval logic, and then finally getting to the part you actually care about: generating an answer. And you do this every single time you start a new project or want to test a new document set.

Existing solutions sit at the extremes. Full frameworks like LangChain and LlamaIndex are powerful, but they're heavy. You pull in a framework with dozens of abstractions just to ask a question about a folder. On the other end, tutorial notebooks are disposable. They work once, for one demo, and you throw them away.

I wanted something in the middle. A CLI that's zero-config for the common case, configurable when you need it, and built from pieces I can reuse in other projects. No framework dependencies. No notebook rot. Just a tool that does one thing well.

What I Built

rag-cli-tool gives you two commands:

rag-cli index ./my-docs/
rag-cli ask "What is the refund policy?"

That's it. Point it at a folder, it indexes everything. Ask a question, it answers from your documents. Supported formats include PDF, Markdown, plain text, and DOCX.

Under the hood, the pipeline is straightforward. index loads documents from the directory, splits them into overlapping chunks using a recursive text splitter, generates embeddings, and stores everything in a local ChromaDB instance. ask embeds your question, retrieves the most similar chunks, and generates an answer using only the retrieved context -- strict RAG, no hallucination from external knowledge.

The tech stack is deliberately boring. ChromaDB for the vector store because it runs locally with zero setup -- no Docker, no server, just a directory. Typer for the CLI framework because it gives you type-checked arguments and auto-generated help for free. Rich for terminal output because progress bars and formatted answers make the tool pleasant to use. Pydantic Settings for configuration because environment variables and .env files are the right answer for CLI tools.

You can run it fully local with Ollama (no API keys needed) or use cloud providers:

# Local -- no API keys
RAG_CLI_MODEL=ollama:llama3.2 RAG_CLI_EMBEDDING_MODEL=ollama:nomic-embed-text \
  rag-cli ask "What are the payment terms?"

# Cloud -- Anthropic + OpenAI
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
rag-cli ask "What are the payment terms?"

Architecture -- Built for Reuse

This is where rag-cli-tool diverges from a typical weekend project. The repository contains three independent packages, not one monolith:

src/
├── rag_cli/       # CLI interface (Typer + Rich)
├── llm_core/      # LLM abstraction layer (providers, config, retry)
└── rag_core/      # RAG pipeline (loaders, chunking, embeddings, retrieval)

llm_core handles everything related to calling language models. It defines a provider interface, implements Anthropic and Ollama adapters, and includes retry logic with exponential backoff. It knows nothing about RAG, documents, or CLI output.

rag_core handles the RAG pipeline: loading documents, chunking text, generating embeddings, storing vectors, and retrieving results. It depends on llm_core for embedding providers but has no opinion about how you present results to users.

rag_cli is the thin layer that wires everything together. It handles argument parsing, progress bars, and formatted output. The actual logic is a few lines of glue code.

The reason for this separation is practical, not academic. I build AI projects regularly. The next one might be a web app, a Slack bot, or an API service. When that happens, I don't want to extract RAG logic from a CLI tool. I want to import rag_core and start building. Same for llm_core -- provider switching, retry logic, and configuration management are problems I solve once.

Every major component has an abstract base class. BaseLLMProvider, BaseEmbedder, BaseChunker, BaseRetriever, BaseVectorStore. Today I have one implementation of each. Tomorrow I can add a GraphRAG retriever or a Pinecone vector store without touching existing code. The abstractions aren't speculative -- they're the minimum interface each component needs to be swappable.

The project has full test coverage across all three packages -- 37 tests covering providers, configuration, chunking, embeddings, retrieval, and vector store operations.

Design Decisions

Four decisions shaped the project, each with a specific reason:

ChromaDB over FAISS or Pinecone. FAISS requires numpy gymnastics for persistence and doesn't store metadata natively. Pinecone requires an account and network access. ChromaDB gives you a local, persistent vector store with metadata filtering in one line: ChromaStore(persist_dir=path). For a CLI tool that should work offline, this was the only real choice.

Typer over Click. Click is battle-tested, but Typer gives you type annotations as your argument definitions. No decorators for each option, no callback functions. You write a normal Python function with type hints, and Typer generates the CLI. The help text writes itself.

Pydantic Settings for configuration. CLI tools need to read config from environment variables and .env files. Pydantic Settings does both, with validation, default values, and type coercion. One class definition replaces a dozen os.getenv() calls with fallback logic.

Provider routing via model string prefix. Instead of separate config fields for provider selection, the model string does double duty: claude-3-5-sonnet-latest routes to Anthropic, ollama:llama3.2 routes to Ollama. One config field, zero ambiguity. This pattern scales to any number of providers without config proliferation.

What I Learned

The 80/20 of RAG tooling surprised me. I expected the infrastructure -- vector stores, embedding APIs, retrieval logic -- to consume most of the development time. Instead, chunking decisions dominated. How big should chunks be? How much overlap? Which separators produce coherent boundaries? The pipeline code was straightforward; the tuning was where the real work happened.

CLI-first development forces good API design. When your first consumer is a command-line interface, you can't hide behind web framework magic. Every input is explicit, every output is visible. This discipline produced cleaner interfaces in llm_core and rag_core than I would have gotten starting with a web app.

I intentionally shipped without several features: chat mode with conversation history, benchmarking against different chunking strategies, a web UI, and support for more vector stores. These are all reasonable features. They're also scope creep for a v0.1. The foundation is solid, the abstractions are in place, and each of those features is an afternoon of work because the architecture supports extension.

Try It

The best developer tools solve your own problems first. rag-cli-tool started as "I'm tired of writing this boilerplate" and turned into reusable building blocks for my entire AI project portfolio. If you work with documents and want a fast way to prototype RAG pipelines, give it a try.

# Install from PyPI
pip install rag-cli-tool

# Or from source
git clone https://github.com/LukaszGrochal/rag-cli-tool
cd rag-cli-tool
pip install -e .

# With Ollama (free, local)
ollama pull llama3.2 && ollama pull nomic-embed-text
rag-cli index ./sample-docs/
rag-cli ask "What is the refund policy?"

PyPI: https://pypi.org/project/rag-cli-tool/
GitHub: https://github.com/LukaszGrochal/rag-cli-tool

Tags: python, cli, rag, ai, developer-tools

How a Missing Trace Led Me to Build a Local Observability Stack

LukaszGrochal — Tue, 03 Feb 2026 14:24:11 +0000

Last year, our team spent three days debugging why traces from a critical payment service weren't appearing in DataDog. This service processed ~15,000 orders daily—roughly $200K in transactions. The service was running, logs showed successful transactions, but the APM dashboard was empty. No traces. No spans. Nothing.

For three days, we couldn't answer basic questions: Was the payment gateway slow? Were retries happening? Where was latency hiding? Without traces, we were debugging blind—adding print statements to production code, tailing logs, guessing at latency sources.

The breakthrough came when someone asked: "Can we just run the same setup locally and see if traces actually leave the application?"

We couldn't. DataDog requires cloud connectivity. The local agent still needs an API key and phones home. There was no way to intercept and visualize traces without a DataDog account—and our staging key had rate limits that made local testing impractical.

So I built a stack that accepts ddtrace telemetry locally and routes it to open-source backends. Within an hour of running it, we found the bug. A config change from two sprints back had introduced this filter rule:

# The bug - intended to filter health checks, matched EVERYTHING
filter/health:
  traces:
    span:
      - 'attributes["http.target"] =~ "/.*"'  # Regex matched all paths!

Instead of filtering only /health endpoints, the regex /.* matched every single span. A one-character fix—changing =~ to == and using exact paths—and traces appeared in production within minutes.

Why did it take three days to find a one-character bug? Because we had no visibility into what the collector was actually doing. The config looked reasonable at a glance. The collector reported healthy. Logs showed "traces exported successfully"—but those were other services' traces passing through. Without a way to isolate our service's telemetry and watch it flow through the pipeline, we were guessing. The local stack gave us that visibility in minutes.

This repository is a cleaned-up, documented version of that debugging tool. It's now used across three teams: the original payments team, our logistics service team (who had a similar "missing traces" panic), and the platform team who adopted it for testing collector configs before production rollouts.

What This Stack Does

Point your ddtrace-instrumented application at localhost:8126. The OpenTelemetry Collector receives DataDog-format traces, converts them to OTLP, and exports to Grafana Tempo. Your application thinks it's talking to a DataDog agent.

No code changes required. Set DD_AGENT_HOST=localhost and your existing instrumentation works.

When To Use This (And When Not To)

This stack is valuable when:

You need to verify ddtrace instrumentation works before deploying
You're debugging why traces aren't appearing in production DataDog
You want local trace visualization without DataDog licensing costs
You're testing collector configurations (sampling, filtering, batching) before production rollout

Use something else when:

You're starting a new project—use OpenTelemetry native instrumentation for better portability
You need DataDog-specific features (APM service maps, profiling, Real User Monitoring)
You're processing sustained high throughput (see Performance section below)

Alternatives I evaluated:

Jaeger All-in-One: Simpler setup, but no native log correlation. You'd need a separate logging stack and manual trace ID lookup. For debugging, clicking from log → trace is essential.
DataDog Agent locally: Requires API key, sends data to cloud, rate limits apply. Defeats the purpose of local-only debugging.
OpenTelemetry Demo: Excellent for learning OTLP from scratch, but doesn't help debug existing ddtrace instrumentation—which was our whole problem.

Why Tempo over Jaeger for the backend? Tempo integrates natively with Grafana's Explore view, enabling the bidirectional log↔trace correlation that made debugging fast. Jaeger would require a separate UI and manual correlation.

Quick Start

git clone https://github.com/LukaszGrochal/demo-repo-otel-stack
cd local-otel-stack
docker-compose up -d

# Verify stack health
curl -s http://localhost:3200/ready   # Tempo
curl -s http://localhost:3100/ready   # Loki
curl -s http://localhost:13133/health # OTel Collector

Run the example application (requires uv):

cd examples/python
uv sync
DD_AGENT_HOST=localhost DD_TRACE_ENABLED=true uv run uvicorn app:app --reload

Generate a trace:

curl -X POST http://localhost:8000/orders \
  -H "Content-Type: application/json" \
  -d '{"user_id": 1, "product": "widget", "amount": 29.99}'

Open Grafana at http://localhost:3000 → Explore → Tempo.

Traces not appearing?

# Check collector is receiving data
docker-compose logs -f otel-collector | grep -i "trace"

# Common issues:
# - Port 8126 already bound (existing DataDog agent?)
# - DD_TRACE_ENABLED not set to "true"
# - Application not waiting for collector startup

Pattern 1: Subprocess Trace Propagation

The Problem We Hit

Once the filter bug was fixed, we used the local stack to investigate another issue: the payment service spawned worker processes to generate invoice PDFs after each order. In production DataDog, we could see the HTTP request span, but the PDF generation time was invisible—traces stopped at the subprocess boundary.

This made debugging timeouts nearly impossible. When customers complained about slow order confirmations, we couldn't tell if it was the payment gateway or the invoice generation. The worker was a black box.

Why ddtrace Doesn't Handle This

ddtrace automatically propagates trace context for HTTP requests, gRPC calls, Celery tasks, and other instrumented protocols. But subprocess.run() isn't a protocol—it's an OS primitive. ddtrace can't know whether you want context passed via environment variables, command-line arguments, stdin, or files.

The Solution

Inject trace context into environment variables before spawning. The key insight is just 10 lines—the rest is error handling. From examples/python/app.py:272-340:

def spawn_traced_subprocess(command: list[str], timeout: float = 30.0):
    env = os.environ.copy()

    # THE KEY PATTERN: inject trace context into subprocess environment
    current_span = tracer.current_span()
    if current_span:
        env['DD_TRACE_ID'] = str(current_span.trace_id)
        env['DD_PARENT_ID'] = str(current_span.span_id)

    with tracer.trace("subprocess.spawn", service="subprocess") as span:
        span.set_tag("subprocess.command", " ".join(command[:3]))
        result = subprocess.run(command, env=env, capture_output=True, timeout=timeout)
        span.set_tag("subprocess.exit_code", result.returncode)
        return result.returncode, result.stdout, result.stderr

The full implementation includes timeout handling, error tagging, and logging—see the repository for the complete 70-line version with production error handling.

The worker process reads the context automatically. Key insight: ddtrace reads DD_TRACE_ID and DD_PARENT_ID from the environment when it initializes. You don't need to manually link spans—just ensure ddtrace is imported and patched before creating spans.

From examples/python/worker.py:89-105:

def get_parent_trace_context() -> tuple[int | None, int | None]:
    """Read trace context injected by parent process."""
    trace_id = os.environ.get('DD_TRACE_ID')
    parent_id = os.environ.get('DD_PARENT_ID')
    if trace_id and parent_id:
        return int(trace_id), int(parent_id)
    return None, None

The worker creates nested spans that automatically link to the parent trace. From examples/python/worker.py:108-170:

def process_file(input_path: str, simulate_error: bool = False) -> dict:
    with tracer.trace("worker.process_file", service="file-worker") as span:
        span.set_tag("file.path", input_path)
        span.set_tag("worker.pid", os.getpid())

        with tracer.trace("file.read") as read_span:
            # ... file reading with span tags

        for i in range(chunks):
            with tracer.trace("chunk.process") as chunk_span:
                chunk_span.set_tag("chunk.index", i)
                # ... chunk processing

        with tracer.trace("file.write") as write_span:
            # ... file writing with span tags

        return {"lines_processed": processed_lines, "chunks": chunks}

See worker.py for the full implementation with error simulation and detailed span tagging.

Test it:

curl -X POST http://localhost:8000/process-file \
  -H "Content-Type: application/json" \
  -d '{"file_path": "test.txt"}'

The trace shows the complete chain: HTTP request → subprocess.spawn → worker.process_file → file.read → chunk.process (×N) → file.write. All connected under one trace ID.

Limitation

This only works for synchronous subprocess spawning where you control the invocation. For Celery, RQ, or other task queues, use their built-in trace propagation instead.

Pattern 2: Circuit Breaker Observability

We don't need another circuit breaker implementation—libraries like pybreaker and tenacity handle that. What matters for observability is tagging spans with circuit state so you can query failures during incidents.

From examples/python/app.py:609-618:

# Check inventory with circuit breaker
with tracer.trace("inventory.check", service="inventory-service") as span:
    span.set_tag("product", product)
    span.set_tag("circuit_breaker.state", external_service_circuit.state)

    if not external_service_circuit.can_execute():
        span.set_tag("error", True)
        span.set_tag("error.type", "circuit_open")
        PROM_ORDERS_FAILED.labels(reason="circuit_open").inc()
        raise HTTPException(status_code=503, detail="Inventory service circuit breaker open")

During an incident, query Tempo for circuit_breaker.state=OPEN to see:

When exactly the circuit opened
What failure pattern preceded it
Which downstream service caused the cascade

Pattern 3: Log-Trace Correlation

Click a log line in Loki, jump directly to the trace in Tempo.

Inject Trace IDs Into Logs

From examples/python/app.py:84-109:

class TraceIdFilter(logging.Filter):
    """Injects trace context into log records for correlation."""

    def filter(self, record):
        # Get current span from ddtrace
        span = tracer.current_span()
        if span:
            record.trace_id = span.trace_id
            record.span_id = span.span_id
            # Convert to hex format for Tempo compatibility
            record.trace_id_hex = format(span.trace_id, 'x')
        else:
            record.trace_id = 0
            record.span_id = 0
            record.trace_id_hex = '0'
        return True


# Set up logging with trace correlation
# Use hex format for trace_id to match Tempo's format
logging.basicConfig(
    format='%(asctime)s %(levelname)s [trace_id=%(trace_id_hex)s span_id=%(span_id)s] %(name)s: %(message)s',
    level=logging.INFO
)
logger = logging.getLogger(__name__)
logger.addFilter(TraceIdFilter())

Configure Grafana to Link Them

The Loki data source includes derived fields that extract trace IDs and create clickable links:

derivedFields:
  - datasourceUid: tempo
    matcherRegex: 'trace_id=([a-fA-F0-9]+)'
    name: TraceID
    url: '$${__value.raw}'

Correlation works bidirectionally:

Loki → Tempo: Click trace ID in any log entry
Tempo → Loki: Click "Logs for this span" in trace view

The Collector Pipeline

This is where the debugging power comes from. From config/otel-collector.yaml:146-160:

service:
  extensions: [health_check, zpages]

  pipelines:
    # Main traces pipeline - processes all incoming traces
    traces:
      receivers: [datadog, otlp]
      processors:
        - memory_limiter      # First: prevent OOM
        - filter/health       # Remove health check noise
        - attributes/sanitize # Remove sensitive data
        - probabilistic_sampler # Sample if needed
        - batch               # Batch for efficiency
        - resource            # Add metadata
      exporters: [otlp/tempo]

Why each processor matters:

Processor	Purpose	What breaks without it
`memory_limiter`	Prevents OOM on traffic spikes	Collector crashes, loses all buffered traces
`filter/health`	Removes health check noise	Storage fills with useless spans
`attributes/sanitize`	Strips sensitive headers	Credentials leaked to trace storage
`batch`	Groups spans for efficient export	High CPU, slow exports, Tempo overload

The filter configuration that caused our original production issue. From config/otel-collector.yaml:82-91:

filter/health:
  error_mode: ignore
  traces:
    span:
      - 'attributes["http.target"] == "/health"'
      - 'attributes["http.target"] == "/ready"'
      - 'attributes["http.target"] == "/metrics"'
      - 'attributes["http.target"] == "/"'
      - 'attributes["http.route"] == "/health"'
      - 'attributes["http.route"] == "/ready"'

Our production bug was a wildcard in one of these expressions that matched everything. Having a local stack to test filter rules before deploying them would have caught this in minutes, not days.

Performance Characteristics

Measured on M1 MacBook Pro, 16GB RAM, Docker Desktop 4.25:

Metric	Value	Methodology
Idle memory (full stack)	1.47 GB	`docker stats` after 5min idle
Collector memory	89 MB	Under load, batch size 100
Sustained throughput	~800 spans/sec	`hey` load test, 50 concurrent, 60 seconds
Tempo query latency	35-80ms	Trace with 50 spans, cold query
Export latency (P99)	18ms	Collector metrics `/metrics` endpoint

What does 800 spans/sec mean in practice? A typical request to our payment service generates 8-12 spans (HTTP, DB queries, external calls). That's ~70 requests/second before hitting limits. Our heaviest local testing—running integration suites with parallel workers—peaks at ~200 spans/sec, well within capacity.

At ~1200 spans/second, the collector begins dropping traces. You'll see this in the otelcol_processor_dropped_spans metric. For higher throughput, increase memory_limiter thresholds and batch sizes—but this is a local dev tool, not a production trace pipeline.

Security Model

What's Implemented

Measure	Purpose
`read_only: true`	Immutable container filesystem—compromise can't persist
`no-new-privileges`	Blocks privilege escalation via setuid
Network isolation	Tempo only accessible from internal Docker network
Resource limits	Memory caps prevent container resource exhaustion

What's NOT Implemented

TLS between components: All traffic is plaintext on the Docker network
Authentication: Grafana runs with anonymous access
Secrets management: No sensitive data in this stack

This is appropriate for local development. For shared dev environments, enable Grafana authentication:

# docker-compose.override.yml
services:
  grafana:
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=false
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}

Alerting

Pre-configured Prometheus and Loki alert rules, evaluated by Grafana:

Alert	Condition	Purpose
HighErrorRate	>10% order failures	Catch application bugs early
SlowRequests	P95 latency > 2s	Detect performance regressions
CircuitBreakerOpen	State = OPEN	External dependency issues
ErrorLogSpike	Error log rate > 0.1/sec	Unusual error patterns
ServiceDown	Scrape target unreachable	Infrastructure failures

Limitations

Trace ID format conversion: DataDog uses 64-bit IDs; OTLP uses 128-bit. The collector zero-pads. Cross-system correlation with 128-bit-native systems may fail.
No DataDog APM features: This gives you traces, not service maps, anomaly detection, or profiling integration.
Memory footprint: ~1.5GB at idle. Not suitable for resource-constrained environments.
Retention defaults: 24h for traces, 7d for logs. Configurable in tempo.yaml and loki.yaml.

What I'd Do Differently

1. Start with OpenTelemetry native instrumentation. If starting fresh today, I'd use the OpenTelemetry Python SDK rather than ddtrace. The 64-bit/128-bit trace ID mismatch we deal with is a symptom of building on a proprietary format. OTel gives you vendor portability from day one.

2. Use W3C Trace Context for subprocess propagation. The current pattern relies on ddtrace reading DD_TRACE_ID and DD_PARENT_ID from the environment—behavior that's not prominently documented and could change. A more portable approach would serialize W3C Trace Context headers to a temp file or pass via stdin:

# More portable alternative (pseudocode, not implemented here)
# W3C traceparent format: version-trace_id(32 hex)-parent_id(16 hex)-flags
traceparent = f"00-{trace_id:032x}-{span_id:016x}-01"
subprocess.run(cmd, input=json.dumps({"traceparent": traceparent}), ...)

3. Add a config validation mode. The filter regex bug that started this project could have been caught by a "dry run" mode that shows which spans would be filtered without actually dropping them. I may add this in a future version.

4. Consider ClickHouse for trace storage. Tempo is excellent for this use case, but for teams that need SQL queries over traces (e.g., "show me all spans where db.statement contains 'SELECT *'"), ClickHouse with the OTel exporter would be more powerful.

That said, for teams already invested in ddtrace, this stack provides immediate value without code changes—and that was the whole point.

Lessons for Incident Response

This incident changed how we handle observability issues:

"Can we reproduce it locally?" is now our first question. If the answer is no, we build the tooling to make it yes.
Config changes to observability pipelines get the same review rigor as application code. That regex change went through PR review—but nobody caught it because we couldn't test it.
Silent failures are the worst failures. The collector reported healthy while dropping 100% of our traces. We now have alerts on otelcol_processor_dropped_spans > 0.

Repository

github.com/LukaszGrochal/demo-repo-otel-stack

This is a documented, tested version of the debugging tool that helped us fix a production outage. The patterns—subprocess tracing, circuit breaker tagging, log correlation—are used across three teams in our development workflows.

MIT licensed. Issues and PRs welcome.