<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: LukaszGrochal</title>
    <description>The latest articles on DEV Community by LukaszGrochal (@lukaszgrochal).</description>
    <link>https://dev.to/lukaszgrochal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3746029%2F11e966f6-099d-4c86-b40f-ae7f94ace120.png</url>
      <title>DEV Community: LukaszGrochal</title>
      <link>https://dev.to/lukaszgrochal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lukaszgrochal"/>
    <language>en</language>
    <item>
      <title>Choosing an Agent Framework in 2026: A Data-Driven Decision Guide</title>
      <dc:creator>LukaszGrochal</dc:creator>
      <pubDate>Thu, 26 Feb 2026 11:01:00 +0000</pubDate>
      <link>https://dev.to/lukaszgrochal/choosing-an-agent-framework-in-2026-a-data-driven-decision-guide-1mkk</link>
      <guid>https://dev.to/lukaszgrochal/choosing-an-agent-framework-in-2026-a-data-driven-decision-guide-1mkk</guid>
      <description>&lt;p&gt;You've seen the benchmarks. You've read the methodology. Now the question that actually matters: which one should YOU use?&lt;/p&gt;




&lt;p&gt;I spent weeks building the same multi-agent workflow in five frameworks, running 45 controlled benchmarks, and analyzing every dimension I could measure. The full results are in &lt;a href="https://dev.to/lukaszgrochal/i-benchmarked-5-ai-agent-frameworks-heres-what-actually-matters-3ela"&gt;Part 1&lt;/a&gt; and the methodology is in &lt;a href="https://dev.to/lukaszgrochal/how-i-built-a-fair-ai-agent-benchmark-architecture-methodology-4p34"&gt;Part 2&lt;/a&gt;. This article distills all of that into actionable guidance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Short Answer
&lt;/h2&gt;

&lt;p&gt;There is no single "best" framework. If someone tells you otherwise, they're either selling something or they only evaluated one dimension. The right choice depends on what you're optimizing for.&lt;/p&gt;

&lt;p&gt;Here's the decision matrix:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your Priority&lt;/th&gt;
&lt;th&gt;Best Choice&lt;/th&gt;
&lt;th&gt;Why (Data)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fastest prototype&lt;/td&gt;
&lt;td&gt;CrewAI&lt;/td&gt;
&lt;td&gt;Simplest API, 246s latency, 9.66 quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production stability&lt;/td&gt;
&lt;td&gt;LangGraph&lt;/td&gt;
&lt;td&gt;1.0 GA, graph-based control, 9.42 quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raw speed&lt;/td&gt;
&lt;td&gt;MS Agent Framework&lt;/td&gt;
&lt;td&gt;93s latency (6x faster), 9.87 quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft/Azure ecosystem&lt;/td&gt;
&lt;td&gt;MS Agent Framework&lt;/td&gt;
&lt;td&gt;Ecosystem integration, successor to AutoGen + Semantic Kernel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI-native apps&lt;/td&gt;
&lt;td&gt;Agents SDK&lt;/td&gt;
&lt;td&gt;Tightest OpenAI integration, built-in tracing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lowest token cost&lt;/td&gt;
&lt;td&gt;MS Agent Framework&lt;/td&gt;
&lt;td&gt;7,006 tokens/run (vs CrewAI's 27,684)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Most consistent output&lt;/td&gt;
&lt;td&gt;MS Agent Framework&lt;/td&gt;
&lt;td&gt;Std=0.10, range=0.2 (narrowest)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you already know your top constraint, that table might be all you need. If you want to understand the tradeoffs in depth, keep reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Factor 1: Consistency Matters More Than Average Score
&lt;/h2&gt;

&lt;p&gt;This is the finding I keep coming back to. Everyone focuses on mean quality scores, but variance is what bites you in production.&lt;/p&gt;

&lt;p&gt;Think about it this way: if a framework averages 9.6 but occasionally drops to 8.6, you need retry logic, output validation, and fallback handling. If another framework averages 9.87 and never drops below 9.8, you can trust it and move on.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Mean&lt;/th&gt;
&lt;th&gt;Std Dev&lt;/th&gt;
&lt;th&gt;Min&lt;/th&gt;
&lt;th&gt;Max&lt;/th&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MS Agent&lt;/td&gt;
&lt;td&gt;9.87&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;9.8&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CrewAI&lt;/td&gt;
&lt;td&gt;9.66&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;9.2&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangGraph&lt;/td&gt;
&lt;td&gt;9.42&lt;/td&gt;
&lt;td&gt;0.32&lt;/td&gt;
&lt;td&gt;9.0&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agents SDK&lt;/td&gt;
&lt;td&gt;9.31&lt;/td&gt;
&lt;td&gt;0.36&lt;/td&gt;
&lt;td&gt;8.6&lt;/td&gt;
&lt;td&gt;9.6&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AutoGen&lt;/td&gt;
&lt;td&gt;9.63&lt;/td&gt;
&lt;td&gt;0.45&lt;/td&gt;
&lt;td&gt;8.6&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;1.4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1helzc0hz1q6r6povjq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1helzc0hz1q6r6povjq.png" alt="Score distribution by framework" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MS Agent Framework's consistency is remarkable. A standard deviation of 0.10 means virtually every run lands in the same narrow band. AutoGen sits at the other end with a 1.4-point range and a std dev of 0.45 -- meaning roughly one in three runs deviates noticeably from the mean.&lt;/p&gt;

&lt;p&gt;Why does this happen? Architecture. Sequential pipelines (MS Agent) produce deterministic data flow: each agent gets a fixed input and produces a fixed output. Group chat patterns (AutoGen) introduce conversational branching where subtle phrasing differences in early turns cascade into meaningfully different outputs, even at &lt;code&gt;temperature=0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you're building a pipeline that runs unattended -- batch processing, scheduled reports, automated analysis -- consistency should be your top priority. A framework that's slightly lower in average quality but tighter in variance will cause fewer 3am pages than one that's higher on average but occasionally produces garbage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Factor 2: Token Cost at Scale
&lt;/h2&gt;

&lt;p&gt;When running locally via Ollama, tokens are free. The moment you deploy to a cloud model, they're your biggest variable cost.&lt;/p&gt;

&lt;p&gt;Here's what each framework costs at GPT-4o rates ($2.50/1M input, $10/1M output, assuming a roughly 40/60 input/output split):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Tokens/Run&lt;/th&gt;
&lt;th&gt;Approx Cost/Run&lt;/th&gt;
&lt;th&gt;1,000 runs/month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MS Agent&lt;/td&gt;
&lt;td&gt;7,006&lt;/td&gt;
&lt;td&gt;~$0.06&lt;/td&gt;
&lt;td&gt;~$60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agents SDK&lt;/td&gt;
&lt;td&gt;8,676&lt;/td&gt;
&lt;td&gt;~$0.07&lt;/td&gt;
&lt;td&gt;~$70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangGraph&lt;/td&gt;
&lt;td&gt;8,823&lt;/td&gt;
&lt;td&gt;~$0.07&lt;/td&gt;
&lt;td&gt;~$70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AutoGen&lt;/td&gt;
&lt;td&gt;10,793&lt;/td&gt;
&lt;td&gt;~$0.09&lt;/td&gt;
&lt;td&gt;~$90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CrewAI&lt;/td&gt;
&lt;td&gt;27,684&lt;/td&gt;
&lt;td&gt;~$0.22&lt;/td&gt;
&lt;td&gt;~$220&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7ej7nbmg3f9jyo3bwxa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7ej7nbmg3f9jyo3bwxa.png" alt="Token efficiency comparison" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CrewAI uses nearly 4x more tokens than MS Agent Framework. That's the cost of its role-playing architecture -- verbose system prompts and inter-agent communication inflate every run. At $220/month for 1,000 runs, it's still reasonable. But scale to 10,000 runs and you're looking at $2,200 vs $600. That delta funds an engineer for a week.&lt;/p&gt;

&lt;p&gt;MS Agent Framework is the most token-efficient at ~7,000 tokens per run, with Agents SDK and LangGraph close behind at ~8,700. If token cost is your binding constraint, any of the three lean frameworks is a safe bet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Factor 3: Production Readiness
&lt;/h2&gt;

&lt;p&gt;Raw benchmark numbers don't capture maturity. A framework that tops every metric doesn't help you if it ships breaking changes every two weeks or has no documentation for your edge case. Here's my honest tiered assessment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1 -- Production Ready&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph 1.0&lt;/strong&gt; -- The only 1.0 GA release in this comparison. Graph-based architecture gives you explicit control over execution flow. Largest community, most Stack Overflow answers, best debugging and observability tools. If something goes wrong at 2am, you'll find help.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tier 2 -- Stable, Active Development&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CrewAI 1.9&lt;/strong&gt; -- Rapidly evolving with good documentation and an intuitive API. Some API churn between minor versions, so pin your dependencies carefully. The ecosystem is smaller than LangGraph's but growing fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents SDK&lt;/strong&gt; -- OpenAI-backed with a stable API surface. Tightly coupled to OpenAI's ecosystem, which is either a feature or a lock-in risk depending on your perspective. Built-in tracing is a genuine production advantage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tier 3 -- Use with Caution&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AutoGen 0.7&lt;/strong&gt; -- Effectively in maintenance mode. Microsoft's engineering energy is flowing into MS Agent Framework. The group chat architecture is genuinely powerful for open-ended collaboration, but if you're starting a new project today, you're building on a platform that's being superseded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tier 4 -- High Potential, Not GA&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MS Agent Framework 1.0.0b&lt;/strong&gt; -- Topped every metric in the benchmark: quality, speed, and consistency. But it's a beta release with GA expected around March 2026. The API surface could change. Documentation is thin. Community support is minimal. If you can absorb that risk, the numbers are compelling. If you need stability guarantees today, wait two months.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Factor 4: Architecture Style
&lt;/h2&gt;

&lt;p&gt;Each framework embodies a different mental model for agent orchestration. Picking one that matches how you think about your problem will save you more time than any benchmark number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graph-based (LangGraph)&lt;/strong&gt; -- You define nodes (agents, functions) and edges (transitions, conditions). Execution follows the graph. Best for workflows with branching logic, conditional routing, or cycles. If you think in flowcharts, you'll feel at home.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task-based (CrewAI)&lt;/strong&gt; -- You define tasks with descriptions and assign them to agents with roles. The framework handles sequencing. Lowest boilerplate of the five. Best for quick prototypes and linear pipelines where you don't need fine-grained control over agent interaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Group chat (AutoGen)&lt;/strong&gt; -- Agents communicate via a shared message stream, taking turns based on selection logic. Most flexible for open-ended collaboration where you don't know the conversation shape in advance. Worst for structured pipelines where that flexibility becomes overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sequential (MS Agent Framework)&lt;/strong&gt; -- A clean pipeline where each agent processes input and passes output to the next. Simple mental model, predictable execution, easy to debug. Best when your workflow is a straight line from input to output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runner-based (Agents SDK)&lt;/strong&gt; -- A runner executes an agent, which can hand off to other agents. Lightweight abstraction with built-in tracing and OpenAI ecosystem integration. Best when you're already deep in the OpenAI stack and want minimal friction.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Recommendation
&lt;/h2&gt;

&lt;p&gt;I'll be opinionated here because vague advice is useless advice. These are my recommendations based on the data, tempered by practical experience:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Starting a production system today? LangGraph.&lt;/strong&gt;&lt;br&gt;
It's the only 1.0 GA framework in this comparison. The graph-based architecture scales to complex workflows. The community and tooling ecosystem are mature. Quality is solid at 9.42, and while it's not the fastest (506s) or cheapest in tokens, it has the most predictable upgrade path. You won't regret this choice in six months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prototyping fast? CrewAI.&lt;/strong&gt;&lt;br&gt;
If you need a working multi-agent system by Friday, CrewAI's API is the fastest path from zero to demo. Define roles, assign tasks, run. Accept the 3x token overhead as the cost of velocity. You can always migrate later if the token cost becomes a problem at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can wait two months? MS Agent Framework.&lt;/strong&gt;&lt;br&gt;
The benchmark numbers are remarkable: fastest latency by 2.5x, highest quality, tightest consistency. If the GA release delivers on the beta's promise, this becomes the default recommendation. Watch the March 2026 release closely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Already in the OpenAI ecosystem? Agents SDK.&lt;/strong&gt;&lt;br&gt;
Don't fight your stack. If you're using OpenAI models, OpenAI's function calling, and OpenAI's tooling, the Agents SDK integrates most naturally. Lowest token cost, built-in tracing, clean handoff semantics. The coupling to OpenAI is the obvious tradeoff -- if you ever need to switch providers, you'll be rewriting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bowzerk3lkblneqviaq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bowzerk3lkblneqviaq.png" alt="Quality heatmap across frameworks and companies" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Get the Data
&lt;/h2&gt;

&lt;p&gt;Everything behind this analysis is open source. Run the benchmarks yourself, challenge my numbers, extend the comparison to new frameworks or tasks.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/LukaszGrochal/agent-framework-benchmark" rel="noopener noreferrer"&gt;agent-framework-benchmark&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analysis notebook&lt;/strong&gt;: &lt;code&gt;notebooks/analysis.ipynb&lt;/code&gt; -- all charts, tables, and statistical tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raw data&lt;/strong&gt;: &lt;code&gt;results/benchmark_results.csv&lt;/code&gt; -- 45 runs, every metric&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Clone it, install with &lt;code&gt;uv sync&lt;/code&gt;, and run &lt;code&gt;uv run python -m benchmark.runner&lt;/code&gt;. If you find different results, I want to hear about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Series Navigation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1: &lt;a href="https://dev.to/lukaszgrochal/i-benchmarked-5-ai-agent-frameworks-heres-what-actually-matters-3ela"&gt;I Benchmarked 5 AI Agent Frameworks -- Here's What Actually Matters&lt;/a&gt;&lt;/strong&gt; -- The results: quality, latency, tokens, and consistency across 45 runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2: &lt;a href="https://dev.to/lukaszgrochal/how-i-built-a-fair-ai-agent-benchmark-architecture-methodology-4p34"&gt;How I Built a Fair AI Agent Benchmark&lt;/a&gt;&lt;/strong&gt; -- Architecture, methodology, and the engineering behind controlled comparisons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3: Choosing an Agent Framework in 2026&lt;/strong&gt; -- You are here. The decision guide.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built with Python 3.12, uv, Ollama (Qwen 3 14B), and 45 runs of hard data. Pick a framework, ship something, and remember: the model does the thinking. The framework just gets out of the way.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>leadership</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>How I Built a Fair AI Agent Benchmark (Architecture &amp; Methodology)</title>
      <dc:creator>LukaszGrochal</dc:creator>
      <pubDate>Tue, 24 Feb 2026 11:01:00 +0000</pubDate>
      <link>https://dev.to/lukaszgrochal/how-i-built-a-fair-ai-agent-benchmark-architecture-methodology-4p34</link>
      <guid>https://dev.to/lukaszgrochal/how-i-built-a-fair-ai-agent-benchmark-architecture-methodology-4p34</guid>
      <description>&lt;p&gt;Comparing frameworks is easy. Comparing them &lt;em&gt;fairly&lt;/em&gt; is the hard part.&lt;/p&gt;




&lt;p&gt;In &lt;a href="https://dev.to/lukaszgrochal/i-benchmarked-5-ai-agent-frameworks-heres-what-actually-matters-3ela"&gt;Part 1 of this series&lt;/a&gt;, I published the results of benchmarking five AI agent frameworks head-to-head. MS Agent Framework won on speed and consistency. Quality scores were nearly identical across the board. The results surprised me.&lt;/p&gt;

&lt;p&gt;But results without methodology are just opinions with charts. This article is about the engineering behind the benchmark: how I designed the system to isolate framework behavior from everything else, the architectural decisions that made fair comparison possible, and the mistakes I'd fix if I ran it again.&lt;/p&gt;

&lt;p&gt;If you've ever tried to compare two libraries by building a quick prototype in each, you know the problem. The first one you build teaches you the task. The second one benefits from everything you learned. Your "comparison" is really measuring your own learning curve. I wanted to eliminate that entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fairness Problem
&lt;/h2&gt;

&lt;p&gt;Most framework comparisons I've seen online have the same fundamental flaw: they're benchmarking prompt quality, not framework quality.&lt;/p&gt;

&lt;p&gt;Think about what typically happens. Someone builds a project in LangGraph, writes carefully tuned prompts, gets great results. Then they try CrewAI, use slightly different wording, maybe a different model temperature, and get different results. They write a blog post declaring one framework superior. But what actually differed? The prompts. The configuration. The author's familiarity with each API. The framework was maybe 10% of the equation.&lt;/p&gt;

&lt;p&gt;There are several ways naive comparisons fail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Different prompts&lt;/strong&gt; — Each implementation uses hand-written instructions. Prompt phrasing changes output quality dramatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Different tools&lt;/strong&gt; — One version calls a real API, another uses a mock. Network latency and API variability dominate the measurement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temperature randomness&lt;/strong&gt; — Running at temperature 0.7 means every run produces different output. You're measuring random variance, not framework capability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framework-specific optimizations&lt;/strong&gt; — Tuning one framework's settings while leaving another at defaults isn't a framework comparison; it's a configuration comparison.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted to control for all of this. Every variable that isn't "which framework is orchestrating the agents" had to be identical.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Task: Company Research Agent
&lt;/h2&gt;

&lt;p&gt;The benchmark task is a 3-agent pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Researcher&lt;/strong&gt; — Gathers raw information about a target company&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyst&lt;/strong&gt; — Synthesizes research findings into structured business insights&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writer&lt;/strong&gt; — Produces a polished 500-800 word research report&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I chose this task because it hits a sweet spot. It's complex enough to exercise real multi-agent orchestration — three agents with data dependencies, where each agent's output feeds the next. But it's simple enough that the output (a structured report) can be evaluated objectively on dimensions like completeness, accuracy, and readability.&lt;/p&gt;

&lt;p&gt;Each framework researches three companies (Anthropic, Stripe, Datadog), three iterations each, for 9 runs per framework and 45 runs total. Three companies gives us variety in available information. Three iterations gives us enough repetition to measure consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The benchmark isn't a loose collection of scripts. It's a modular system with strict dependency boundaries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                   ┌──────────────┐
                   │  Benchmark   │
                   │   Runner     │
                   └──────┬───────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
      ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
      │ Framework │ │ Framework │ │ Framework │  x5 frameworks
      │   Impl    │ │   Impl    │ │   Impl    │  x3 companies
      └─────┬─────┘ └─────┬─────┘ └─────┬─────┘  x3 iterations
            │             │             │        = 45 runs
            ▼             ▼             ▼
      ┌─────────────────────────────────────────┐
      │              shared/                    │
      │  prompts.py │ tools.py │ schemas.py     │
      │  config.py (BenchmarkSettings)          │
      └─────────────────────────────────────────┘
                           │
                    ┌──────▼───────┐
                    │  eval_core   │──▶ LLM-as-Judge
                    │  (LLMJudge)  │    Quality Scores
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │ vendor/      │
                    │ llm_core     │──▶ Provider Abstraction
                    └──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The runner dynamically imports each framework module, passes it a company name and settings, and collects the report text plus token usage. It then sends each report through the LLM judge for quality scoring. Everything feeds into a CSV of results for analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 Rules of Fair Comparison
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Rule 1: Identical Prompts
&lt;/h3&gt;

&lt;p&gt;Every framework implementation imports its prompts from the same &lt;code&gt;shared/prompts.py&lt;/code&gt; file. No framework gets custom instructions. Here are the actual prompt strings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;RESEARCHER_SYSTEM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a company research specialist. Your task is to gather comprehensive &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;information about {company}. Focus on: company overview, key leadership, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;products/services, recent news and developments, market position, and key &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;financial or operational metrics. Be thorough and factual. Present your &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;findings as a structured list of facts with categories.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ANALYST_SYSTEM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a business analyst specializing in {company}. Review the research data &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provided and identify: key strengths, potential risks and challenges, market &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trends affecting {company}, competitive advantages, and notable strategic &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insights. Provide data-driven analysis with clear reasoning.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;WRITER_SYSTEM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a professional report writer covering {company}. Create a structured &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research report with these sections: Executive Summary, Company Overview, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Products &amp;amp; Services, Market Position, Key Insights, and Conclusion. Write &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clearly and professionally. The report should be 500-800 words.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If CrewAI got "Be incredibly detailed and thorough" while LangGraph got "Be concise," we'd be testing prompt engineering, not frameworks. Sharing a single source file eliminates that variable entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 2: Identical Tools
&lt;/h3&gt;

&lt;p&gt;The agents use a mock search tool from &lt;code&gt;shared/tools.py&lt;/code&gt; with pre-built data for each benchmark company. This is critical for two reasons.&lt;/p&gt;

&lt;p&gt;First, determinism. Real API calls return different results at different times. A company's stock price changes, news articles rotate, search rankings shift. Mock data guarantees every framework gets the exact same input information on every run.&lt;/p&gt;

&lt;p&gt;Second, isolation. If one framework happens to make API calls faster due to connection pooling, or runs into rate limiting, that shows up as a latency difference that has nothing to do with the framework's orchestration quality. Mock tools remove network variability from the equation.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;gather_all_search_results()&lt;/code&gt; function runs the same six standard queries for every company, ensuring all implementations receive identical raw data regardless of how they choose to call the search tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 3: Same Model
&lt;/h3&gt;

&lt;p&gt;All five frameworks run against Qwen 3 14B via Ollama, configured through &lt;code&gt;BenchmarkSettings&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BenchmarkSettings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseSettings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;llm_provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;llm_model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3:14b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;ollama_host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One model, one machine, one inference server. No framework gets a smarter model or a faster endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 4: temperature=0 Everywhere
&lt;/h3&gt;

&lt;p&gt;Every framework implementation sets &lt;code&gt;temperature=0&lt;/code&gt; for all LLM calls. This eliminates random sampling from the generation process. With temperature 0, the model always picks the highest-probability next token, making outputs as deterministic as the framework allows. (Some variation still occurs due to floating-point nondeterminism in GPU computation, but it's minimal.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 5: No Framework-Specific Optimizations
&lt;/h3&gt;

&lt;p&gt;No custom retry logic, no framework-specific prompt tweaking, no tuning of agent count or conversation structure beyond what the pipeline requires. Every implementation gets the most straightforward translation of the three-agent pipeline into that framework's idiom. If a framework makes certain patterns easier or harder, that's a legitimate difference worth measuring.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Module Dependency Fence
&lt;/h2&gt;

&lt;p&gt;Beyond shared inputs, architectural boundaries prevent accidental coupling between components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llm_core         &amp;lt;── eval_core (judge uses BaseLLMProvider)
shared/          &amp;lt;── all framework implementations
eval_core        &amp;lt;── benchmark/ (runner uses judge)
No framework implementation imports from another.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three rules make this work:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;eval_core CANNOT import shared/.&lt;/strong&gt; The judge evaluates reports as plain text. It doesn't know what prompts were used, what tools were available, or how agents were structured. This prevents the evaluation from being biased toward the specific task design.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Framework implementations CANNOT import each other.&lt;/strong&gt; If the LangGraph implementation imported a utility from the CrewAI implementation, we'd have hidden coupling. Each implementation is self-contained within its own package.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;llm_core is vendored and unmodified.&lt;/strong&gt; The LLM provider abstraction layer (&lt;code&gt;BaseLLMProvider&lt;/code&gt;, &lt;code&gt;OllamaProvider&lt;/code&gt;, etc.) is vendored from another project and treated as a frozen dependency. No benchmark-specific modifications.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This might seem overly strict for a benchmark project. But without these boundaries, it's easy for shared state or implicit dependencies to contaminate the comparison. I've seen benchmark repos where "shared utilities" slowly accumulate framework-specific logic. Explicit import rules prevent that drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM-as-Judge Evaluation
&lt;/h2&gt;

&lt;p&gt;Each of the 45 reports is scored by an LLM judge on five criteria, each rated 1-10:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Completeness&lt;/strong&gt; — Does the report cover key aspects? (leadership, products, market position, developments, metrics)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt; — Are stated facts verifiable and reasonable? Any fabrications?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structure&lt;/strong&gt; — Well-organized with clear sections? Follows the requested format?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Insight&lt;/strong&gt; — Analysis beyond surface-level facts? Meaningful observations?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Readability&lt;/strong&gt; — Well-written, professional, clear?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The judge receives the report as plain text along with a task description, and returns a structured JSON response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_JUDGE_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;You are an expert evaluator of research reports. Score the following report on a &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;scale of 1-10 for each criterion. Be rigorous and fair.

## Criteria
1. **Completeness** (1-10): Does the report cover key aspects of the company?
2. **Accuracy** (1-10): Are the stated facts verifiable and reasonable?
3. **Structure** (1-10): Is the report well-organized with clear sections?
4. **Insight** (1-10): Does the report provide analysis beyond surface-level facts?
5. **Readability** (1-10): Is it well-written, professional, and clear?

## Report to Evaluate
{report}

Respond with ONLY valid JSON (no markdown, no code blocks):
{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completeness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;int&amp;gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;int&amp;gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;structure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;int&amp;gt;,
 &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;int&amp;gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;readability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;int&amp;gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overall&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;float&amp;gt;,
 &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;brief explanation&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why an LLM judge instead of human evaluation? Three reasons: &lt;strong&gt;scale&lt;/strong&gt; (45 reports is a lot to evaluate manually), &lt;strong&gt;consistency&lt;/strong&gt; (human evaluators drift over time — the 40th report gets different attention than the 5th), and &lt;strong&gt;reproducibility&lt;/strong&gt; (anyone can re-run the judge and get the same scores).&lt;/p&gt;

&lt;p&gt;The limitations are real. LLM judges have known biases — they tend to prefer verbose, well-formatted output over concise but equally correct output. But since all five frameworks produce structurally similar reports (they're all following the same writer prompt with the same section headings), this bias affects all frameworks roughly equally. It's a systematic offset, not a confound.&lt;/p&gt;

&lt;p&gt;The judge uses &lt;code&gt;temperature=0&lt;/code&gt; for consistency and retries up to 3 times on JSON parse failures. Failed parses get logged and the response is re-requested. This handles the occasional case where the model wraps its JSON in markdown code blocks despite being told not to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local-First: Why Ollama Instead of Cloud APIs
&lt;/h2&gt;

&lt;p&gt;The entire benchmark runs locally using Ollama. No cloud API keys required. This was a deliberate choice with several advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;$0 cost.&lt;/strong&gt; Forty-five benchmark runs plus 45 judge evaluations. At cloud pricing, that's potentially hundreds of dollars. Locally, it's electricity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No rate limits.&lt;/strong&gt; Cloud APIs throttle concurrent requests. Running 45 sequential calls against GPT-4o means dealing with rate limiting, retry backoff, and variable response times that have nothing to do with framework quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No network variability.&lt;/strong&gt; When measuring latency differences between frameworks, the last thing you want is network jitter adding 50-500ms of noise per request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete reproducibility.&lt;/strong&gt; Anyone with an Ollama installation and the Qwen 3 14B model can reproduce these results exactly. No API key, no billing account, no waiting list.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is obvious: Qwen 3 14B isn't GPT-4o. The absolute quality of outputs is lower than what you'd get from a frontier model. But this benchmark measures &lt;em&gt;relative&lt;/em&gt; framework performance — how much overhead each framework adds, how consistently each one produces results, how efficiently each one uses tokens. Those relative measurements hold regardless of the underlying model's capability.&lt;/p&gt;

&lt;p&gt;The configuration supports cloud providers too (&lt;code&gt;openai&lt;/code&gt;, &lt;code&gt;anthropic&lt;/code&gt; are valid &lt;code&gt;llm_provider&lt;/code&gt; values), so you can re-run with GPT-4o or Claude if you want to validate that framework rankings hold at higher model capability levels.&lt;/p&gt;

&lt;h2&gt;
  
  
  Surviving Dependency Hell
&lt;/h2&gt;

&lt;p&gt;Here's a problem I didn't anticipate: the five frameworks literally cannot all be installed in the same Python environment.&lt;/p&gt;

&lt;p&gt;CrewAI pins &lt;code&gt;openai&amp;lt;1.84&lt;/code&gt;. MS Agent Framework requires &lt;code&gt;openai&amp;gt;=1.99&lt;/code&gt;. These are hard version constraints in their respective &lt;code&gt;pyproject.toml&lt;/code&gt; files. pip will just fail. Even if you could force-install both, one of them would break at runtime.&lt;/p&gt;

&lt;p&gt;The solution: uv's dependency groups (PEP 735). Each framework gets its own resolution context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="nt"&gt;--group&lt;/span&gt; crewai      &lt;span class="c"&gt;# Installs CrewAI (pins openai&amp;lt;1.84)&lt;/span&gt;
uv &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="nt"&gt;--group&lt;/span&gt; msagent     &lt;span class="c"&gt;# Installs MS Agent Framework (needs openai&amp;gt;=1.99)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Groups that are compatible can be installed together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="nt"&gt;--group&lt;/span&gt; langgraph &lt;span class="nt"&gt;--group&lt;/span&gt; autogen &lt;span class="nt"&gt;--group&lt;/span&gt; agents-sdk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also declared explicit conflicts in &lt;code&gt;pyproject.toml&lt;/code&gt; so that uv resolves these groups independently rather than trying to find a single unified solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tool.uv]&lt;/span&gt;
&lt;span class="py"&gt;conflicts&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;{&lt;/span&gt; &lt;span class="py"&gt;group&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"crewai"&lt;/span&gt; &lt;span class="err"&gt;}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;{&lt;/span&gt; &lt;span class="py"&gt;group&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"msagent"&lt;/span&gt; &lt;span class="err"&gt;}&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;{&lt;/span&gt; &lt;span class="py"&gt;group&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"crewai"&lt;/span&gt; &lt;span class="err"&gt;}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;{&lt;/span&gt; &lt;span class="py"&gt;group&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"agents-sdk"&lt;/span&gt; &lt;span class="err"&gt;}&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a real-world takeaway that goes beyond benchmarking: &lt;strong&gt;your existing dependency tree might rule out certain frameworks before you write a line of code.&lt;/strong&gt; If your project already depends on &lt;code&gt;openai&amp;gt;=1.90&lt;/code&gt;, CrewAI is off the table until they update their pin. If you're on an older &lt;code&gt;openai&lt;/code&gt; version and can't upgrade, the newer frameworks won't work. Check compatibility before you invest a week building a proof of concept.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;No benchmark is perfect, and this one has gaps I'd address in a v2:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More test companies.&lt;/strong&gt; Three companies gives us variety, but 5-7 would provide better statistical power. With only 9 runs per framework, the confidence intervals on quality scores are wide enough that most pairwise differences aren't statistically significant (as the Mann-Whitney U tests in Part 1 confirmed).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple task types.&lt;/strong&gt; Company research is one workflow pattern. A more comprehensive benchmark would include a coding task (generate and debug code), a data analysis task (interpret a dataset), and a customer support task (handle multi-turn conversations). Different frameworks might excel at different patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human eval baseline.&lt;/strong&gt; I'd recruit 3-5 evaluators to score a subset of reports independently and compare their rankings to the LLM judge's rankings. This would validate whether the judge's quality scores match human intuition or if systematic biases are distorting results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test with cloud models.&lt;/strong&gt; Running the same benchmark with GPT-4o and Claude Sonnet would answer an important question: do framework rankings change with model capability? It's possible that a framework that adds overhead with a strong model actually helps compensate for a weaker model's limitations, or vice versa.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standardized token tracking.&lt;/strong&gt; Token tracking varies across frameworks — some report tokens natively, others require instrumentation hooks. A complete benchmark needs a framework-agnostic way to capture token usage at the provider level, rather than relying on each framework's own reporting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;Python 3.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Package Manager&lt;/td&gt;
&lt;td&gt;uv&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build System&lt;/td&gt;
&lt;td&gt;Hatchling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM Serving&lt;/td&gt;
&lt;td&gt;Ollama (Qwen 3 14B)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linter/Formatter&lt;/td&gt;
&lt;td&gt;ruff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type Checker&lt;/td&gt;
&lt;td&gt;mypy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testing&lt;/td&gt;
&lt;td&gt;pytest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analysis&lt;/td&gt;
&lt;td&gt;pandas + Plotly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Notebooks&lt;/td&gt;
&lt;td&gt;Jupyter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All code, data, and analysis notebooks are open source: &lt;a href="https://github.com/LukaszGrochal/agent-framework-benchmark" rel="noopener noreferrer"&gt;agent-framework-benchmark&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Series Navigation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1: &lt;a href="https://dev.to/lukaszgrochal/i-benchmarked-5-ai-agent-frameworks-heres-what-actually-matters-3ela"&gt;I Benchmarked 5 AI Agent Frameworks — Here's What Actually Matters&lt;/a&gt;&lt;/strong&gt; — The results: quality scores, latency, token efficiency, and consistency across all 45 runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2: How I Built a Fair Benchmark&lt;/strong&gt; — You are here. Architecture, methodology, and the engineering behind controlled comparisons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3: A Practical Decision Guide&lt;/strong&gt; — Flowchart for picking the right framework based on your actual constraints.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built with Python 3.12, uv, Ollama, and a determination to answer "which framework is best?" with data instead of opinions.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>programming</category>
    </item>
    <item>
      <title>I Benchmarked 5 AI Agent Frameworks — Here's What Actually Matters</title>
      <dc:creator>LukaszGrochal</dc:creator>
      <pubDate>Mon, 16 Feb 2026 07:01:00 +0000</pubDate>
      <link>https://dev.to/lukaszgrochal/i-benchmarked-5-ai-agent-frameworks-heres-what-actually-matters-3ela</link>
      <guid>https://dev.to/lukaszgrochal/i-benchmarked-5-ai-agent-frameworks-heres-what-actually-matters-3ela</guid>
      <description>&lt;p&gt;I ran 45 benchmarks across 5 agent frameworks expecting a clear winner. The answer wasn't what I expected.&lt;/p&gt;




&lt;p&gt;Everyone building with LLM agents in 2026 faces the same question: which framework should I use? Blog posts give you vibes. Docs give you cherry-picked examples. Twitter threads give you hot takes from people who tried one framework for a weekend.&lt;/p&gt;

&lt;p&gt;I wanted numbers. Real numbers, from a controlled experiment.&lt;/p&gt;

&lt;p&gt;So I built the same multi-agent workflow — a Company Research Agent — in five different frameworks, ran each one 9 times (3 companies x 3 iterations), scored every output with an LLM judge, and tracked latency and token usage down to the request level. Forty-five runs total, same model, same prompts, same evaluation criteria. No cloud APIs, no variable pricing confounding the results — everything running locally on the same machine.&lt;/p&gt;

&lt;p&gt;Here's what the data actually says.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Five frameworks, each implementing the same three-agent pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Researcher&lt;/strong&gt; — gathers raw information about a company&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyst&lt;/strong&gt; — synthesizes findings into structured insights&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writer&lt;/strong&gt; — produces a polished research report&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The frameworks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph 1.0.x&lt;/strong&gt; — graph-based state machine with explicit node/edge definitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CrewAI 1.9.x&lt;/strong&gt; — task-based sequential orchestration with role-playing agents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AutoGen 0.7.x&lt;/strong&gt; — async group chat where agents collaborate via messages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MS Agent Framework 1.0.0b&lt;/strong&gt; — sequential orchestration with built-in tool routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI Agents SDK&lt;/strong&gt; — runner-based pipeline with handoff semantics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All five ran against the same local model (Qwen 3 14B via Ollama) with &lt;code&gt;temperature=0&lt;/code&gt; for reproducibility. The target companies — Anthropic, Stripe, and Datadog — were chosen to represent different levels of public information availability: a well-documented public company, a high-profile private company, and a mid-profile enterprise player. Each framework researched all three, three times each.&lt;/p&gt;

&lt;p&gt;The LLM judge evaluated each output report on five dimensions: completeness, accuracy, structure, insight depth, and readability — each scored 1-10, then combined into an overall quality score.&lt;/p&gt;

&lt;p&gt;Why does this matter in 2026? Because agent frameworks have matured past the "hello world" phase. The question is no longer "can I build a multi-agent system?" — it's "which framework gives me the best tradeoff between quality, speed, cost, and reliability for production workloads?" I picked a company research pipeline because it's complex enough to stress-test orchestration (three agents with dependencies) but simple enough that the results are easy to evaluate objectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality Results: Closer Than You'd Think
&lt;/h2&gt;

&lt;p&gt;Here's the part that surprised me most. Look at this radar chart:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjjh01tuifcovfysdu5b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjjh01tuifcovfysdu5b.png" alt="Quality dimension profiles by framework" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every framework scores above 9.0 overall. The total spread from best to worst is just 0.56 points. Here are the full numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Completeness&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Structure&lt;/th&gt;
&lt;th&gt;Insight&lt;/th&gt;
&lt;th&gt;Readability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MS Agent&lt;/td&gt;
&lt;td&gt;9.87&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;td&gt;9.33&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CrewAI&lt;/td&gt;
&lt;td&gt;9.66&lt;/td&gt;
&lt;td&gt;9.44&lt;/td&gt;
&lt;td&gt;9.44&lt;/td&gt;
&lt;td&gt;9.89&lt;/td&gt;
&lt;td&gt;9.56&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AutoGen&lt;/td&gt;
&lt;td&gt;9.63&lt;/td&gt;
&lt;td&gt;9.44&lt;/td&gt;
&lt;td&gt;9.67&lt;/td&gt;
&lt;td&gt;9.89&lt;/td&gt;
&lt;td&gt;9.33&lt;/td&gt;
&lt;td&gt;9.89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangGraph&lt;/td&gt;
&lt;td&gt;9.42&lt;/td&gt;
&lt;td&gt;9.11&lt;/td&gt;
&lt;td&gt;9.44&lt;/td&gt;
&lt;td&gt;9.89&lt;/td&gt;
&lt;td&gt;9.22&lt;/td&gt;
&lt;td&gt;9.78&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agents SDK&lt;/td&gt;
&lt;td&gt;9.31&lt;/td&gt;
&lt;td&gt;9.00&lt;/td&gt;
&lt;td&gt;9.11&lt;/td&gt;
&lt;td&gt;9.89&lt;/td&gt;
&lt;td&gt;9.00&lt;/td&gt;
&lt;td&gt;9.78&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;MS Agent Framework sits at the top with a near-perfect 9.87. Agents SDK comes in last at 9.31. But here's the thing — 9.31 is still &lt;em&gt;excellent&lt;/em&gt;. When your worst performer is scoring above 9 out of 10, quality isn't the axis that differentiates these tools.&lt;/p&gt;

&lt;p&gt;The radar chart tells the same story visually: all five polygons overlap heavily. Structure and readability are essentially identical across the board (everyone's above 9.78). The only dimension with meaningful separation is completeness, where MS Agent's perfect 10.00 pulls away from Agents SDK's 9.00.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Differentiates Them
&lt;/h2&gt;

&lt;p&gt;If quality is a wash, what should you care about? Three things: &lt;strong&gt;speed&lt;/strong&gt;, &lt;strong&gt;token cost&lt;/strong&gt;, and &lt;strong&gt;consistency&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Speed: A 6x Gap
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqxk991gjo672tia6fvf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqxk991gjo672tia6fvf.png" alt="Average latency by framework" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where the differences get dramatic. Average end-to-end latency per run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MS Agent Framework&lt;/strong&gt;: 93s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CrewAI&lt;/strong&gt;: 246s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents SDK&lt;/strong&gt;: 448s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph&lt;/strong&gt;: 506s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AutoGen&lt;/strong&gt;: 572s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a &lt;strong&gt;6x gap&lt;/strong&gt; between fastest and slowest. MS Agent finishes in a minute and a half while AutoGen is still grinding away at nearly ten minutes. For a batch job researching 100 companies, that's the difference between 2.5 hours and 16 hours.&lt;/p&gt;

&lt;p&gt;CrewAI lands in a comfortable middle ground at ~4 minutes — fast enough for interactive use, efficient enough for batch processing. LangGraph and Agents SDK cluster together in the 7-8 minute range.&lt;/p&gt;

&lt;p&gt;AutoGen's async group chat pattern, while flexible, introduces significant coordination overhead that shows up directly in wall-clock time. The agents exchange messages in a round-robin style, and each message round requires a full LLM call to decide whether to continue the conversation or hand off. That flexibility is powerful for open-ended collaboration, but for a linear pipeline like this one, it's overhead without payoff.&lt;/p&gt;

&lt;h3&gt;
  
  
  Token Cost: 3x Difference
&lt;/h3&gt;

&lt;p&gt;Not all frameworks are equally efficient with their LLM calls. Average total tokens per run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MS Agent Framework&lt;/strong&gt;: 7,006 tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents SDK&lt;/strong&gt;: 8,676 tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph&lt;/strong&gt;: 8,823 tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AutoGen&lt;/strong&gt;: 10,793 tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CrewAI&lt;/strong&gt;: 27,684 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CrewAI uses &lt;strong&gt;nearly 4x more tokens&lt;/strong&gt; than MS Agent Framework to produce comparable quality output. At local Ollama pricing, this is free. At GPT-4o pricing ($2.50/1M input, $10/1M output), that's the difference between ~$0.06 and ~$0.22 per run. Scale to thousands of runs per day and the gap matters.&lt;/p&gt;

&lt;p&gt;Why such a spread? CrewAI's role-playing approach includes verbose system prompts and inter-agent communication that inflates token counts. MS Agent Framework, Agents SDK, and LangGraph take a leaner approach with minimal framing overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Consistency: The Hidden Variable
&lt;/h3&gt;

&lt;p&gt;Average scores hide variance. Here's what the consistency numbers reveal:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Std Dev&lt;/th&gt;
&lt;th&gt;Min Score&lt;/th&gt;
&lt;th&gt;Max Score&lt;/th&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MS Agent&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;9.8&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CrewAI&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;9.2&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangGraph&lt;/td&gt;
&lt;td&gt;0.32&lt;/td&gt;
&lt;td&gt;9.0&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agents SDK&lt;/td&gt;
&lt;td&gt;0.36&lt;/td&gt;
&lt;td&gt;8.6&lt;/td&gt;
&lt;td&gt;9.6&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AutoGen&lt;/td&gt;
&lt;td&gt;0.45&lt;/td&gt;
&lt;td&gt;8.6&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;1.4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;MS Agent is remarkably tight — std dev of 0.10, range of just 0.2 points. Every single run scored between 9.8 and 10.0. You know exactly what you're going to get.&lt;/p&gt;

&lt;p&gt;AutoGen is the opposite story. It can hit a perfect 10.0, but it can also drop to 8.6 — a 1.4-point range. A standard deviation of 0.45 means roughly one in three runs will deviate noticeably from the mean. If you're building a production pipeline where predictability matters (and it always does), this variance is a real concern. You'd need to build retry logic or output validation around it, which adds complexity.&lt;/p&gt;

&lt;p&gt;What drives the inconsistency? I suspect it's the group chat architecture. When agents negotiate via messages, the conversation can take different paths depending on subtle phrasing differences in early turns, even with &lt;code&gt;temperature=0&lt;/code&gt;. Sequential pipelines like MS Agent's don't have this branching problem — each agent gets a fixed input and produces a fixed output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Statistical Reality Check
&lt;/h2&gt;

&lt;p&gt;Eyeballing averages is one thing. Let's see what the statistics actually support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kruskal-Wallis test on quality scores: p = 0.005.&lt;/strong&gt; Statistically significant — differences between frameworks do exist. But that's the omnibus test. It tells you &lt;em&gt;something&lt;/em&gt; differs, not &lt;em&gt;what&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Pairwise Mann-Whitney U tests with Bonferroni correction (10 comparisons, corrected alpha = 0.005) tell a more nuanced story:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Only one pair shows a statistically significant quality difference: Agents SDK vs MS Agent (p = 0.0003, effect size r = 0.86 — large).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every other pairwise comparison — LangGraph vs CrewAI, AutoGen vs Agents SDK, CrewAI vs MS Agent, all of them — fails to reach significance after correction. The apparent quality differences between most frameworks are &lt;strong&gt;indistinguishable from noise&lt;/strong&gt; at this sample size.&lt;/p&gt;

&lt;p&gt;Now compare that to latency. &lt;strong&gt;Kruskal-Wallis test on latency: p = 0.000001.&lt;/strong&gt; The speed differences are extremely real and not going away with more data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Translation: don't pick your framework based on quality. Pick based on speed, cost, and consistency.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbayk8n6jwv0087btjmmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbayk8n6jwv0087btjmmc.png" alt="Quality vs latency scatter plot" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The scatter plot drives this home. Quality clusters tightly between 8.6 and 10.0 regardless of framework, while latency sprawls from 80 seconds to over 700. The vertical axis is noise. The horizontal axis is signal.&lt;/p&gt;

&lt;p&gt;This is the single most important finding from this benchmark: &lt;strong&gt;all five frameworks produce excellent output when given the same model and prompts.&lt;/strong&gt; The framework is the orchestration layer, not the intelligence layer. The model does the heavy lifting. The framework's job is to get out of the way efficiently — and that's where the real differences emerge.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ranking
&lt;/h2&gt;

&lt;p&gt;Putting it all together:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Latency (s)&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Consistency (std)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MS Agent&lt;/td&gt;
&lt;td&gt;9.87&lt;/td&gt;
&lt;td&gt;93&lt;/td&gt;
&lt;td&gt;7,006&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CrewAI&lt;/td&gt;
&lt;td&gt;9.66&lt;/td&gt;
&lt;td&gt;246&lt;/td&gt;
&lt;td&gt;27,684&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AutoGen&lt;/td&gt;
&lt;td&gt;9.63&lt;/td&gt;
&lt;td&gt;572&lt;/td&gt;
&lt;td&gt;10,793&lt;/td&gt;
&lt;td&gt;0.45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangGraph&lt;/td&gt;
&lt;td&gt;9.42&lt;/td&gt;
&lt;td&gt;506&lt;/td&gt;
&lt;td&gt;8,823&lt;/td&gt;
&lt;td&gt;0.32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agents SDK&lt;/td&gt;
&lt;td&gt;9.31&lt;/td&gt;
&lt;td&gt;448&lt;/td&gt;
&lt;td&gt;8,676&lt;/td&gt;
&lt;td&gt;0.36&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;MS Agent dominates on every metric — quality, speed, token efficiency, and consistency — but it's a 1.0.0 beta release with a smaller ecosystem. If you're comfortable betting on a newer framework, it's compelling. If you need production maturity and community support today, that's a different calculation.&lt;/p&gt;

&lt;p&gt;CrewAI is the pragmatic middle ground: fast enough, high quality, reasonable consistency, and the most intuitive API of the bunch. The token cost is the tax you pay for its role-playing architecture. For most teams, that tradeoff is worth it.&lt;/p&gt;

&lt;p&gt;AutoGen produces great output but slowly and unpredictably. Its group chat pattern shines for open-ended agent collaboration — just not for structured pipelines.&lt;/p&gt;

&lt;p&gt;LangGraph and Agents SDK are solid workhorses with lean token usage. LangGraph gives you the most control over execution flow (it's a state machine, after all), while Agents SDK keeps things simple with minimal boilerplate. Both pay for that simplicity with longer execution times.&lt;/p&gt;

&lt;p&gt;There's no single winner. There's a set of tradeoffs, and the right choice depends on what you're optimizing for.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This article covered the &lt;em&gt;what&lt;/em&gt;. The next two in this series cover the &lt;em&gt;how&lt;/em&gt; and the &lt;em&gt;so what&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 2: How I Built a Fair Benchmark&lt;/strong&gt; — The methodology behind controlled comparisons: same prompts, same model, LLM-as-judge evaluation, and the dependency hell of installing five frameworks that don't want to coexist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3: A Practical Decision Guide&lt;/strong&gt; — Flowchart for picking the right framework based on your actual constraints: team size, latency budget, cost sensitivity, and how much you value consistency.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built with Python 3.12, uv, Ollama (Qwen 3 14B), and too many hours debugging dependency conflicts between frameworks that each want their own version of the OpenAI SDK.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All code, data, and analysis notebooks are open source: &lt;a href="https://github.com/LukaszGrochal/agent-framework-benchmark" rel="noopener noreferrer"&gt;agent-framework-benchmark&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>python</category>
    </item>
    <item>
      <title>I Built a Python CLI Tool for RAG Over Any Document Folder</title>
      <dc:creator>LukaszGrochal</dc:creator>
      <pubDate>Mon, 09 Feb 2026 07:01:00 +0000</pubDate>
      <link>https://dev.to/lukaszgrochal/i-built-a-python-cli-tool-for-rag-over-any-document-folder-55ic</link>
      <guid>https://dev.to/lukaszgrochal/i-built-a-python-cli-tool-for-rag-over-any-document-folder-55ic</guid>
      <description>&lt;p&gt;&lt;em&gt;A zero-config command-line tool for retrieval-augmented generation — index a folder, ask questions, get cited answers. Works locally with Ollama or with cloud APIs.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every time I wanted to ask questions about a set of documents, I'd write the same 100 lines of boilerplate: load docs, chunk them, embed them, store in a vector DB, retrieve, generate. I got tired of it. So I built a CLI tool that does it in two commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;RAG prototyping has too much ceremony. You have a folder of PDFs, Markdown files, maybe some text notes. You want to ask questions about them. Simple enough in theory.&lt;/p&gt;

&lt;p&gt;In practice, you're wiring up document loaders, picking a chunking strategy, initializing an embedding provider, setting up a vector store, writing retrieval logic, and then finally getting to the part you actually care about: generating an answer. And you do this every single time you start a new project or want to test a new document set.&lt;/p&gt;

&lt;p&gt;Existing solutions sit at the extremes. Full frameworks like LangChain and LlamaIndex are powerful, but they're heavy. You pull in a framework with dozens of abstractions just to ask a question about a folder. On the other end, tutorial notebooks are disposable. They work once, for one demo, and you throw them away.&lt;/p&gt;

&lt;p&gt;I wanted something in the middle. A CLI that's zero-config for the common case, configurable when you need it, and built from pieces I can reuse in other projects. No framework dependencies. No notebook rot. Just a tool that does one thing well.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;rag-cli-tool gives you two commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rag-cli index ./my-docs/
rag-cli ask &lt;span class="s2"&gt;"What is the refund policy?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Point it at a folder, it indexes everything. Ask a question, it answers from your documents. Supported formats include PDF, Markdown, plain text, and DOCX.&lt;/p&gt;

&lt;p&gt;Under the hood, the pipeline is straightforward. &lt;code&gt;index&lt;/code&gt; loads documents from the directory, splits them into overlapping chunks using a recursive text splitter, generates embeddings, and stores everything in a local ChromaDB instance. &lt;code&gt;ask&lt;/code&gt; embeds your question, retrieves the most similar chunks, and generates an answer using only the retrieved context -- strict RAG, no hallucination from external knowledge.&lt;/p&gt;

&lt;p&gt;The tech stack is deliberately boring. ChromaDB for the vector store because it runs locally with zero setup -- no Docker, no server, just a directory. Typer for the CLI framework because it gives you type-checked arguments and auto-generated help for free. Rich for terminal output because progress bars and formatted answers make the tool pleasant to use. Pydantic Settings for configuration because environment variables and &lt;code&gt;.env&lt;/code&gt; files are the right answer for CLI tools.&lt;/p&gt;

&lt;p&gt;You can run it fully local with Ollama (no API keys needed) or use cloud providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Local -- no API keys&lt;/span&gt;
&lt;span class="nv"&gt;RAG_CLI_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama:llama3.2 &lt;span class="nv"&gt;RAG_CLI_EMBEDDING_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama:nomic-embed-text &lt;span class="se"&gt;\&lt;/span&gt;
  rag-cli ask &lt;span class="s2"&gt;"What are the payment terms?"&lt;/span&gt;

&lt;span class="c"&gt;# Cloud -- Anthropic + OpenAI&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-...
rag-cli ask &lt;span class="s2"&gt;"What are the payment terms?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Architecture -- Built for Reuse
&lt;/h2&gt;

&lt;p&gt;This is where rag-cli-tool diverges from a typical weekend project. The repository contains three independent packages, not one monolith:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/
├── rag_cli/       # CLI interface (Typer + Rich)
├── llm_core/      # LLM abstraction layer (providers, config, retry)
└── rag_core/      # RAG pipeline (loaders, chunking, embeddings, retrieval)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;llm_core&lt;/code&gt; handles everything related to calling language models. It defines a provider interface, implements Anthropic and Ollama adapters, and includes retry logic with exponential backoff. It knows nothing about RAG, documents, or CLI output.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;rag_core&lt;/code&gt; handles the RAG pipeline: loading documents, chunking text, generating embeddings, storing vectors, and retrieving results. It depends on &lt;code&gt;llm_core&lt;/code&gt; for embedding providers but has no opinion about how you present results to users.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;rag_cli&lt;/code&gt; is the thin layer that wires everything together. It handles argument parsing, progress bars, and formatted output. The actual logic is a few lines of glue code.&lt;/p&gt;

&lt;p&gt;The reason for this separation is practical, not academic. I build AI projects regularly. The next one might be a web app, a Slack bot, or an API service. When that happens, I don't want to extract RAG logic from a CLI tool. I want to import &lt;code&gt;rag_core&lt;/code&gt; and start building. Same for &lt;code&gt;llm_core&lt;/code&gt; -- provider switching, retry logic, and configuration management are problems I solve once.&lt;/p&gt;

&lt;p&gt;Every major component has an abstract base class. &lt;code&gt;BaseLLMProvider&lt;/code&gt;, &lt;code&gt;BaseEmbedder&lt;/code&gt;, &lt;code&gt;BaseChunker&lt;/code&gt;, &lt;code&gt;BaseRetriever&lt;/code&gt;, &lt;code&gt;BaseVectorStore&lt;/code&gt;. Today I have one implementation of each. Tomorrow I can add a GraphRAG retriever or a Pinecone vector store without touching existing code. The abstractions aren't speculative -- they're the minimum interface each component needs to be swappable.&lt;/p&gt;

&lt;p&gt;The project has full test coverage across all three packages -- 37 tests covering providers, configuration, chunking, embeddings, retrieval, and vector store operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Decisions
&lt;/h2&gt;

&lt;p&gt;Four decisions shaped the project, each with a specific reason:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChromaDB over FAISS or Pinecone.&lt;/strong&gt; FAISS requires numpy gymnastics for persistence and doesn't store metadata natively. Pinecone requires an account and network access. ChromaDB gives you a local, persistent vector store with metadata filtering in one line: &lt;code&gt;ChromaStore(persist_dir=path)&lt;/code&gt;. For a CLI tool that should work offline, this was the only real choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typer over Click.&lt;/strong&gt; Click is battle-tested, but Typer gives you type annotations as your argument definitions. No decorators for each option, no callback functions. You write a normal Python function with type hints, and Typer generates the CLI. The help text writes itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pydantic Settings for configuration.&lt;/strong&gt; CLI tools need to read config from environment variables and &lt;code&gt;.env&lt;/code&gt; files. Pydantic Settings does both, with validation, default values, and type coercion. One class definition replaces a dozen &lt;code&gt;os.getenv()&lt;/code&gt; calls with fallback logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider routing via model string prefix.&lt;/strong&gt; Instead of separate config fields for provider selection, the model string does double duty: &lt;code&gt;claude-3-5-sonnet-latest&lt;/code&gt; routes to Anthropic, &lt;code&gt;ollama:llama3.2&lt;/code&gt; routes to Ollama. One config field, zero ambiguity. This pattern scales to any number of providers without config proliferation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The 80/20 of RAG tooling surprised me. I expected the infrastructure -- vector stores, embedding APIs, retrieval logic -- to consume most of the development time. Instead, chunking decisions dominated. How big should chunks be? How much overlap? Which separators produce coherent boundaries? The pipeline code was straightforward; the tuning was where the real work happened.&lt;/p&gt;

&lt;p&gt;CLI-first development forces good API design. When your first consumer is a command-line interface, you can't hide behind web framework magic. Every input is explicit, every output is visible. This discipline produced cleaner interfaces in &lt;code&gt;llm_core&lt;/code&gt; and &lt;code&gt;rag_core&lt;/code&gt; than I would have gotten starting with a web app.&lt;/p&gt;

&lt;p&gt;I intentionally shipped without several features: chat mode with conversation history, benchmarking against different chunking strategies, a web UI, and support for more vector stores. These are all reasonable features. They're also scope creep for a v0.1. The foundation is solid, the abstractions are in place, and each of those features is an afternoon of work because the architecture supports extension.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The best developer tools solve your own problems first. rag-cli-tool started as "I'm tired of writing this boilerplate" and turned into reusable building blocks for my entire AI project portfolio. If you work with documents and want a fast way to prototype RAG pipelines, give it a try.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install from PyPI&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;rag-cli-tool

&lt;span class="c"&gt;# Or from source&lt;/span&gt;
git clone https://github.com/LukaszGrochal/rag-cli-tool
&lt;span class="nb"&gt;cd &lt;/span&gt;rag-cli-tool
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# With Ollama (free, local)&lt;/span&gt;
ollama pull llama3.2 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ollama pull nomic-embed-text
rag-cli index ./sample-docs/
rag-cli ask &lt;span class="s2"&gt;"What is the refund policy?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PyPI: &lt;a href="https://pypi.org/project/rag-cli-tool/" rel="noopener noreferrer"&gt;https://pypi.org/project/rag-cli-tool/&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/LukaszGrochal/rag-cli-tool" rel="noopener noreferrer"&gt;https://github.com/LukaszGrochal/rag-cli-tool&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tags: python, cli, rag, ai, developer-tools&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>cli</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>How a Missing Trace Led Me to Build a Local Observability Stack</title>
      <dc:creator>LukaszGrochal</dc:creator>
      <pubDate>Tue, 03 Feb 2026 14:24:11 +0000</pubDate>
      <link>https://dev.to/lukaszgrochal/how-a-missing-trace-led-me-to-build-a-local-observability-stack-2b82</link>
      <guid>https://dev.to/lukaszgrochal/how-a-missing-trace-led-me-to-build-a-local-observability-stack-2b82</guid>
      <description>&lt;p&gt;Last year, our team spent three days debugging why traces from a critical payment service weren't appearing in DataDog. This service processed ~15,000 orders daily—roughly $200K in transactions. The service was running, logs showed successful transactions, but the APM dashboard was empty. No traces. No spans. Nothing.&lt;/p&gt;

&lt;p&gt;For three days, we couldn't answer basic questions: Was the payment gateway slow? Were retries happening? Where was latency hiding? Without traces, we were debugging blind—adding print statements to production code, tailing logs, guessing at latency sources.&lt;/p&gt;

&lt;p&gt;The breakthrough came when someone asked: "Can we just run the same setup locally and see if traces actually leave the application?"&lt;/p&gt;

&lt;p&gt;We couldn't. DataDog requires cloud connectivity. The local agent still needs an API key and phones home. There was no way to intercept and visualize traces without a DataDog account—and our staging key had rate limits that made local testing impractical.&lt;/p&gt;

&lt;p&gt;So I built a stack that accepts ddtrace telemetry locally and routes it to open-source backends. Within an hour of running it, we found the bug. A config change from two sprints back had introduced this filter rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The bug - intended to filter health checks, matched EVERYTHING&lt;/span&gt;
&lt;span class="na"&gt;filter/health&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;span&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;attributes["http.target"]&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;=~&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"/.*"'&lt;/span&gt;  &lt;span class="c1"&gt;# Regex matched all paths!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of filtering only &lt;code&gt;/health&lt;/code&gt; endpoints, the regex &lt;code&gt;/.*&lt;/code&gt; matched every single span. A one-character fix—changing &lt;code&gt;=~&lt;/code&gt; to &lt;code&gt;==&lt;/code&gt; and using exact paths—and traces appeared in production within minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why did it take three days to find a one-character bug?&lt;/strong&gt; Because we had no visibility into what the collector was actually doing. The config looked reasonable at a glance. The collector reported healthy. Logs showed "traces exported successfully"—but those were &lt;em&gt;other&lt;/em&gt; services' traces passing through. Without a way to isolate our service's telemetry and watch it flow through the pipeline, we were guessing. The local stack gave us that visibility in minutes.&lt;/p&gt;

&lt;p&gt;This repository is a cleaned-up, documented version of that debugging tool. It's now used across three teams: the original payments team, our logistics service team (who had a similar "missing traces" panic), and the platform team who adopted it for testing collector configs before production rollouts.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Stack Does
&lt;/h2&gt;

&lt;p&gt;Point your ddtrace-instrumented application at &lt;code&gt;localhost:8126&lt;/code&gt;. The OpenTelemetry Collector receives DataDog-format traces, converts them to OTLP, and exports to Grafana Tempo. Your application thinks it's talking to a DataDog agent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs982bep2ptafaq3du5fn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs982bep2ptafaq3du5fn.png" alt="Architecture diagram" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No code changes required.&lt;/strong&gt; Set &lt;code&gt;DD_AGENT_HOST=localhost&lt;/code&gt; and your existing instrumentation works.&lt;/p&gt;




&lt;h2&gt;
  
  
  When To Use This (And When Not To)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This stack is valuable when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need to verify ddtrace instrumentation works before deploying&lt;/li&gt;
&lt;li&gt;You're debugging why traces aren't appearing in production DataDog&lt;/li&gt;
&lt;li&gt;You want local trace visualization without DataDog licensing costs&lt;/li&gt;
&lt;li&gt;You're testing collector configurations (sampling, filtering, batching) before production rollout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use something else when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're starting a new project—use OpenTelemetry native instrumentation for better portability&lt;/li&gt;
&lt;li&gt;You need DataDog-specific features (APM service maps, profiling, Real User Monitoring)&lt;/li&gt;
&lt;li&gt;You're processing sustained high throughput (see Performance section below)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alternatives I evaluated:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Jaeger All-in-One&lt;/strong&gt;: Simpler setup, but no native log correlation. You'd need a separate logging stack and manual trace ID lookup. For debugging, clicking from log → trace is essential.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DataDog Agent locally&lt;/strong&gt;: Requires API key, sends data to cloud, rate limits apply. Defeats the purpose of local-only debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Demo&lt;/strong&gt;: Excellent for learning OTLP from scratch, but doesn't help debug &lt;em&gt;existing&lt;/em&gt; ddtrace instrumentation—which was our whole problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why Tempo over Jaeger for the backend?&lt;/strong&gt; Tempo integrates natively with Grafana's Explore view, enabling the bidirectional log↔trace correlation that made debugging fast. Jaeger would require a separate UI and manual correlation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/LukaszGrochal/demo-repo-otel-stack
&lt;span class="nb"&gt;cd &lt;/span&gt;local-otel-stack
docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# Verify stack health&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:3200/ready   &lt;span class="c"&gt;# Tempo&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:3100/ready   &lt;span class="c"&gt;# Loki&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:13133/health &lt;span class="c"&gt;# OTel Collector&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the example application (requires &lt;a href="https://github.com/astral-sh/uv" rel="noopener noreferrer"&gt;uv&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;examples/python
uv &lt;span class="nb"&gt;sync
&lt;/span&gt;&lt;span class="nv"&gt;DD_AGENT_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;localhost &lt;span class="nv"&gt;DD_TRACE_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true &lt;/span&gt;uv run uvicorn app:app &lt;span class="nt"&gt;--reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate a trace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/orders &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"user_id": 1, "product": "widget", "amount": 29.99}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open Grafana at &lt;code&gt;http://localhost:3000&lt;/code&gt; → Explore → Tempo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzb0onjiyza48vfyaq1to.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzb0onjiyza48vfyaq1to.png" alt="Trace visualization in Grafana Tempo" width="800" height="627"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traces not appearing?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check collector is receiving data&lt;/span&gt;
docker-compose logs &lt;span class="nt"&gt;-f&lt;/span&gt; otel-collector | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"trace"&lt;/span&gt;

&lt;span class="c"&gt;# Common issues:&lt;/span&gt;
&lt;span class="c"&gt;# - Port 8126 already bound (existing DataDog agent?)&lt;/span&gt;
&lt;span class="c"&gt;# - DD_TRACE_ENABLED not set to "true"&lt;/span&gt;
&lt;span class="c"&gt;# - Application not waiting for collector startup&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pattern 1: Subprocess Trace Propagation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem We Hit
&lt;/h3&gt;

&lt;p&gt;Once the filter bug was fixed, we used the local stack to investigate another issue: the payment service spawned worker processes to generate invoice PDFs after each order. In production DataDog, we could see the HTTP request span, but the PDF generation time was invisible—traces stopped at the subprocess boundary.&lt;/p&gt;

&lt;p&gt;This made debugging timeouts nearly impossible. When customers complained about slow order confirmations, we couldn't tell if it was the payment gateway or the invoice generation. The worker was a black box.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why ddtrace Doesn't Handle This
&lt;/h3&gt;

&lt;p&gt;ddtrace automatically propagates trace context for HTTP requests, gRPC calls, Celery tasks, and other instrumented protocols. But &lt;code&gt;subprocess.run()&lt;/code&gt; isn't a protocol—it's an OS primitive. ddtrace can't know whether you want context passed via environment variables, command-line arguments, stdin, or files.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;Inject trace context into environment variables before spawning. The key insight is just 10 lines—the rest is error handling. From &lt;code&gt;examples/python/app.py:272-340&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;spawn_traced_subprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# THE KEY PATTERN: inject trace context into subprocess environment
&lt;/span&gt;    &lt;span class="n"&gt;current_span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_span&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DD_TRACE_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DD_PARENT_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;span_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subprocess.spawn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subprocess&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subprocess.command&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subprocess.exit_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full implementation includes timeout handling, error tagging, and logging—see the repository for the complete 70-line version with production error handling.&lt;/p&gt;

&lt;p&gt;The worker process reads the context automatically. &lt;strong&gt;Key insight&lt;/strong&gt;: ddtrace reads &lt;code&gt;DD_TRACE_ID&lt;/code&gt; and &lt;code&gt;DD_PARENT_ID&lt;/code&gt; from the environment when it initializes. You don't need to manually link spans—just ensure ddtrace is imported and patched before creating spans.&lt;/p&gt;

&lt;p&gt;From &lt;code&gt;examples/python/worker.py:89-105&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_parent_trace_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Read trace context injected by parent process.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;trace_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DD_TRACE_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DD_PARENT_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;trace_id&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;parent_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worker creates nested spans that automatically link to the parent trace. From &lt;code&gt;examples/python/worker.py:108-170&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;simulate_error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;worker.process_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file-worker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file.path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;worker.pid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getpid&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file.read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;read_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# ... file reading with span tags
&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk.process&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;chunk_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;chunk_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk.index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# ... chunk processing
&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file.write&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;write_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# ... file writing with span tags
&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lines_processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;processed_lines&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See &lt;code&gt;worker.py&lt;/code&gt; for the full implementation with error simulation and detailed span tagging.&lt;/p&gt;

&lt;p&gt;Test it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/process-file &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"file_path": "test.txt"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgyra7wlck93t4zguxbu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgyra7wlck93t4zguxbu.png" alt="Subprocess trace propagation" width="800" height="627"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The trace shows the complete chain: HTTP request → subprocess.spawn → worker.process_file → file.read → chunk.process (×N) → file.write. All connected under one trace ID.&lt;/p&gt;

&lt;h3&gt;
  
  
  Limitation
&lt;/h3&gt;

&lt;p&gt;This only works for synchronous subprocess spawning where you control the invocation. For Celery, RQ, or other task queues, use their built-in trace propagation instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pattern 2: Circuit Breaker Observability
&lt;/h2&gt;

&lt;p&gt;We don't need another circuit breaker implementation—libraries like &lt;code&gt;pybreaker&lt;/code&gt; and &lt;code&gt;tenacity&lt;/code&gt; handle that. What matters for observability is &lt;em&gt;tagging spans with circuit state&lt;/em&gt; so you can query failures during incidents.&lt;/p&gt;

&lt;p&gt;From &lt;code&gt;examples/python/app.py:609-618&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check inventory with circuit breaker
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventory.check&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventory-service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;circuit_breaker.state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;external_service_circuit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;external_service_circuit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;can_execute&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error.type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;circuit_open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;PROM_ORDERS_FAILED&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;circuit_open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Inventory service circuit breaker open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During an incident, query Tempo for &lt;code&gt;circuit_breaker.state=OPEN&lt;/code&gt; to see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When exactly the circuit opened&lt;/li&gt;
&lt;li&gt;What failure pattern preceded it&lt;/li&gt;
&lt;li&gt;Which downstream service caused the cascade&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Pattern 3: Log-Trace Correlation
&lt;/h2&gt;

&lt;p&gt;Click a log line in Loki, jump directly to the trace in Tempo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inject Trace IDs Into Logs
&lt;/h3&gt;

&lt;p&gt;From &lt;code&gt;examples/python/app.py:84-109&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TraceIdFilter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Injects trace context into log records for correlation.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Get current span from ddtrace
&lt;/span&gt;        &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_span&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;
            &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;span_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;span_id&lt;/span&gt;
            &lt;span class="c1"&gt;# Convert to hex format for Tempo compatibility
&lt;/span&gt;            &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace_id_hex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
            &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;span_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
            &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace_id_hex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;


&lt;span class="c1"&gt;# Set up logging with trace correlation
# Use hex format for trace_id to match Tempo's format
&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%(asctime)s %(levelname)s [trace_id=%(trace_id_hex)s span_id=%(span_id)s] %(name)s: %(message)s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addFilter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TraceIdFilter&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configure Grafana to Link Them
&lt;/h3&gt;

&lt;p&gt;The Loki data source includes derived fields that extract trace IDs and create clickable links:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;derivedFields&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tempo&lt;/span&gt;
    &lt;span class="na"&gt;matcherRegex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trace_id=([a-fA-F0-9]+)'&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TraceID&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;$${__value.raw}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsr1fq7nxor7ycddg5d9h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsr1fq7nxor7ycddg5d9h.png" alt="Loki log with trace ID" width="800" height="627"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Correlation works bidirectionally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loki → Tempo&lt;/strong&gt;: Click trace ID in any log entry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tempo → Loki&lt;/strong&gt;: Click "Logs for this span" in trace view&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y157d5n12bijlkzh2sa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y157d5n12bijlkzh2sa.png" alt="Log-trace correlation" width="800" height="627"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Collector Pipeline
&lt;/h2&gt;

&lt;p&gt;This is where the debugging power comes from. From &lt;code&gt;config/otel-collector.yaml:146-160&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;extensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;health_check&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;zpages&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Main traces pipeline - processes all incoming traces&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;memory_limiter&lt;/span&gt;      &lt;span class="c1"&gt;# First: prevent OOM&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;filter/health&lt;/span&gt;       &lt;span class="c1"&gt;# Remove health check noise&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;attributes/sanitize&lt;/span&gt; &lt;span class="c1"&gt;# Remove sensitive data&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;probabilistic_sampler&lt;/span&gt; &lt;span class="c1"&gt;# Sample if needed&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;batch&lt;/span&gt;               &lt;span class="c1"&gt;# Batch for efficiency&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;resource&lt;/span&gt;            &lt;span class="c1"&gt;# Add metadata&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp/tempo&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why each processor matters:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Processor&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;What breaks without it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;memory_limiter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevents OOM on traffic spikes&lt;/td&gt;
&lt;td&gt;Collector crashes, loses all buffered traces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;filter/health&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Removes health check noise&lt;/td&gt;
&lt;td&gt;Storage fills with useless spans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;attributes/sanitize&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Strips sensitive headers&lt;/td&gt;
&lt;td&gt;Credentials leaked to trace storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;batch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Groups spans for efficient export&lt;/td&gt;
&lt;td&gt;High CPU, slow exports, Tempo overload&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The filter configuration that caused our original production issue. From &lt;code&gt;config/otel-collector.yaml:82-91&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;filter/health&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;error_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ignore&lt;/span&gt;
  &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;span&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;attributes["http.target"]&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"/health"'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;attributes["http.target"]&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"/ready"'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;attributes["http.target"]&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"/metrics"'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;attributes["http.target"]&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"/"'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;attributes["http.route"]&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"/health"'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;attributes["http.route"]&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"/ready"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our production bug was a wildcard in one of these expressions that matched everything. Having a local stack to test filter rules before deploying them would have caught this in minutes, not days.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance Characteristics
&lt;/h2&gt;

&lt;p&gt;Measured on M1 MacBook Pro, 16GB RAM, Docker Desktop 4.25:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Methodology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Idle memory (full stack)&lt;/td&gt;
&lt;td&gt;1.47 GB&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;docker stats&lt;/code&gt; after 5min idle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Collector memory&lt;/td&gt;
&lt;td&gt;89 MB&lt;/td&gt;
&lt;td&gt;Under load, batch size 100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sustained throughput&lt;/td&gt;
&lt;td&gt;~800 spans/sec&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hey&lt;/code&gt; load test, 50 concurrent, 60 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tempo query latency&lt;/td&gt;
&lt;td&gt;35-80ms&lt;/td&gt;
&lt;td&gt;Trace with 50 spans, cold query&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Export latency (P99)&lt;/td&gt;
&lt;td&gt;18ms&lt;/td&gt;
&lt;td&gt;Collector metrics &lt;code&gt;/metrics&lt;/code&gt; endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What does 800 spans/sec mean in practice?&lt;/strong&gt; A typical request to our payment service generates 8-12 spans (HTTP, DB queries, external calls). That's ~70 requests/second before hitting limits. Our heaviest local testing—running integration suites with parallel workers—peaks at ~200 spans/sec, well within capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At ~1200 spans/second&lt;/strong&gt;, the collector begins dropping traces. You'll see this in the &lt;code&gt;otelcol_processor_dropped_spans&lt;/code&gt; metric. For higher throughput, increase &lt;code&gt;memory_limiter&lt;/code&gt; thresholds and batch sizes—but this is a local dev tool, not a production trace pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security Model
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What's Implemented
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Measure&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;read_only: true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Immutable container filesystem—compromise can't persist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;no-new-privileges&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Blocks privilege escalation via setuid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network isolation&lt;/td&gt;
&lt;td&gt;Tempo only accessible from internal Docker network&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resource limits&lt;/td&gt;
&lt;td&gt;Memory caps prevent container resource exhaustion&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What's NOT Implemented
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TLS between components&lt;/strong&gt;: All traffic is plaintext on the Docker network&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt;: Grafana runs with anonymous access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets management&lt;/strong&gt;: No sensitive data in this stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is appropriate for local development. For shared dev environments, enable Grafana authentication:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.override.yml&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;GF_AUTH_ANONYMOUS_ENABLED=false&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Alerting
&lt;/h2&gt;

&lt;p&gt;Pre-configured Prometheus and Loki alert rules, evaluated by Grafana:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Alert&lt;/th&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HighErrorRate&lt;/td&gt;
&lt;td&gt;&amp;gt;10% order failures&lt;/td&gt;
&lt;td&gt;Catch application bugs early&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SlowRequests&lt;/td&gt;
&lt;td&gt;P95 latency &amp;gt; 2s&lt;/td&gt;
&lt;td&gt;Detect performance regressions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CircuitBreakerOpen&lt;/td&gt;
&lt;td&gt;State = OPEN&lt;/td&gt;
&lt;td&gt;External dependency issues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ErrorLogSpike&lt;/td&gt;
&lt;td&gt;Error log rate &amp;gt; 0.1/sec&lt;/td&gt;
&lt;td&gt;Unusual error patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ServiceDown&lt;/td&gt;
&lt;td&gt;Scrape target unreachable&lt;/td&gt;
&lt;td&gt;Infrastructure failures&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Far2gd73stn6d4nxuqevm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Far2gd73stn6d4nxuqevm.png" alt="Grafana alerting rules" width="800" height="627"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trace ID format conversion&lt;/strong&gt;: DataDog uses 64-bit IDs; OTLP uses 128-bit. The collector zero-pads. Cross-system correlation with 128-bit-native systems may fail.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No DataDog APM features&lt;/strong&gt;: This gives you traces, not service maps, anomaly detection, or profiling integration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory footprint&lt;/strong&gt;: ~1.5GB at idle. Not suitable for resource-constrained environments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Retention defaults&lt;/strong&gt;: 24h for traces, 7d for logs. Configurable in &lt;code&gt;tempo.yaml&lt;/code&gt; and &lt;code&gt;loki.yaml&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Start with OpenTelemetry native instrumentation.&lt;/strong&gt; If starting fresh today, I'd use the OpenTelemetry Python SDK rather than ddtrace. The 64-bit/128-bit trace ID mismatch we deal with is a symptom of building on a proprietary format. OTel gives you vendor portability from day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Use W3C Trace Context for subprocess propagation.&lt;/strong&gt; The current pattern relies on ddtrace reading &lt;code&gt;DD_TRACE_ID&lt;/code&gt; and &lt;code&gt;DD_PARENT_ID&lt;/code&gt; from the environment—behavior that's not prominently documented and could change. A more portable approach would serialize W3C Trace Context headers to a temp file or pass via stdin:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# More portable alternative (pseudocode, not implemented here)
# W3C traceparent format: version-trace_id(32 hex)-parent_id(16 hex)-flags
&lt;/span&gt;&lt;span class="n"&gt;traceparent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;00-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;032&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;span_id&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;016&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traceparent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;traceparent&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Add a config validation mode.&lt;/strong&gt; The filter regex bug that started this project could have been caught by a "dry run" mode that shows which spans &lt;em&gt;would&lt;/em&gt; be filtered without actually dropping them. I may add this in a future version.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Consider ClickHouse for trace storage.&lt;/strong&gt; Tempo is excellent for this use case, but for teams that need SQL queries over traces (e.g., "show me all spans where &lt;code&gt;db.statement&lt;/code&gt; contains 'SELECT *'"), ClickHouse with the OTel exporter would be more powerful.&lt;/p&gt;

&lt;p&gt;That said, for teams already invested in ddtrace, this stack provides immediate value without code changes—and that was the whole point.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons for Incident Response
&lt;/h2&gt;

&lt;p&gt;This incident changed how we handle observability issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;"Can we reproduce it locally?" is now our first question.&lt;/strong&gt; If the answer is no, we build the tooling to make it yes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config changes to observability pipelines get the same review rigor as application code.&lt;/strong&gt; That regex change went through PR review—but nobody caught it because we couldn't test it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silent failures are the worst failures.&lt;/strong&gt; The collector reported healthy while dropping 100% of our traces. We now have alerts on &lt;code&gt;otelcol_processor_dropped_spans &amp;gt; 0&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/LukaszGrochal/demo-repo-otel-stack" rel="noopener noreferrer"&gt;github.com/LukaszGrochal/demo-repo-otel-stack&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a documented, tested version of the debugging tool that helped us fix a production outage. The patterns—subprocess tracing, circuit breaker tagging, log correlation—are used across three teams in our development workflows.&lt;/p&gt;

&lt;p&gt;MIT licensed. Issues and PRs welcome.&lt;/p&gt;

</description>
      <category>opentelemetry</category>
      <category>observability</category>
      <category>datadog</category>
      <category>grafana</category>
    </item>
  </channel>
</rss>
