<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rahul</title>
    <description>The latest articles on DEV Community by Rahul (@rahugur).</description>
    <link>https://dev.to/rahugur</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3867731%2Fadb71a88-38fe-4406-9575-d7268df329b1.jpg</url>
      <title>DEV Community: Rahul</title>
      <link>https://dev.to/rahugur</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rahugur"/>
    <language>en</language>
    <item>
      <title>I Built a Lock-Free Agent Runtime in C++17 — Here's Why Python Frameworks Are 2500x Slower</title>
      <dc:creator>Rahul</dc:creator>
      <pubDate>Wed, 08 Apr 2026 12:06:49 +0000</pubDate>
      <link>https://dev.to/rahugur/i-built-a-lock-free-agent-runtime-in-c17-heres-why-python-frameworks-are-2500x-slower-2e2h</link>
      <guid>https://dev.to/rahugur/i-built-a-lock-free-agent-runtime-in-c17-heres-why-python-frameworks-are-2500x-slower-2e2h</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I replaced Python's LLM orchestration layer with C++17 lock-free data structures. The result: 25,000 sessions/sec vs LangChain's ~10-50. Here's what I learned about why the gap exists, how lock-free programming works, and why it matters for the future of AI infrastructure.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/rahugur" rel="noopener noreferrer"&gt;
        rahugur
      &lt;/a&gt; / &lt;a href="https://github.com/rahugur/forge-lock-free" rel="noopener noreferrer"&gt;
        forge-lock-free
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Forge — Lock-Free Agent Orchestration Runtime&lt;/h1&gt;
&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;A high-performance C++17 agent runtime that orchestrates LLM-powered workflows using lock-free concurrency primitives. Built to demonstrate that agent orchestration doesn't have to be slow — Forge handles &lt;strong&gt;25,000+ sessions/sec&lt;/strong&gt; where Python frameworks like LangChain manage ~50.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why This Exists&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Every major AI agent framework today — LangChain, CrewAI, AutoGen — is written in Python. Python is great for prototyping, but it has a fundamental problem for production agent workloads: &lt;strong&gt;the Global Interpreter Lock (GIL)&lt;/strong&gt;. The GIL means only one thread can execute Python bytecode at a time, even on a 64-core server. When you're orchestrating hundreds of concurrent agent sessions, each making LLM calls and executing tools, the framework itself becomes the bottleneck.&lt;/p&gt;
&lt;p&gt;Forge asks: &lt;strong&gt;what if the orchestration layer was as fast as the hardware allows?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This project is a from-scratch C++17 implementation of an agent runtime that uses lock-free data structures…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/rahugur/forge-lock-free" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Every production AI deployment I've seen has the same architecture: a Python framework (LangChain, CrewAI, AutoGen) orchestrating LLM calls and tool execution. For a single chatbot, this works fine. But when you need to run hundreds of concurrent agent sessions — think customer support, code review pipelines, batch analysis — the framework itself becomes the bottleneck.&lt;/p&gt;

&lt;p&gt;Not the LLM. Not the network. The orchestration layer.&lt;/p&gt;

&lt;p&gt;I wanted to understand exactly &lt;em&gt;why&lt;/em&gt;, so I built Forge: a complete agent runtime in C++17 with lock-free concurrency primitives, three workflow patterns (ReAct, Plan-Execute, Map-Reduce), an HTTP API with SSE streaming, and 106 tests verified under ThreadSanitizer and AddressSanitizer.&lt;/p&gt;

&lt;p&gt;Then I benchmarked it against LangChain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Forge (C++17)&lt;/th&gt;
&lt;th&gt;LangChain (Python)&lt;/th&gt;
&lt;th&gt;Gap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scheduling overhead per task&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;307 ns&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~50-100 us&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;200-300x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Session throughput&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;25,000/sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~10-50/sec&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;500-2500x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory per session&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.8 KB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~2-5 MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2500-6000x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concurrent scaling&lt;/td&gt;
&lt;td&gt;Linear with cores&lt;/td&gt;
&lt;td&gt;GIL-capped&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These aren't synthetic micro-benchmarks. Both frameworks run the same 2-step ReAct workflow (LLM call -&amp;gt; tool execution -&amp;gt; LLM call -&amp;gt; final answer) against the same mock LLM server. The gap is &lt;em&gt;entirely&lt;/em&gt; orchestration overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Is Python So Much Slower? (It's Not What You Think)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The GIL Is Worse Than You Think
&lt;/h3&gt;

&lt;p&gt;Python's Global Interpreter Lock isn't just "one thread at a time." It's worse: the GIL introduces &lt;strong&gt;context-switch overhead&lt;/strong&gt; even when there's no contention. Every 5ms (the default &lt;code&gt;sys.getswitchinterval&lt;/code&gt;), Python forcibly releases and reacquires the GIL, even if no other thread wants it. That's a kernel-level context switch for nothing.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;asyncio&lt;/code&gt; doesn't help with CPU-bound orchestration work. It gives you &lt;em&gt;concurrency&lt;/em&gt; (interleaving) but not &lt;em&gt;parallelism&lt;/em&gt; (simultaneous execution). When the event loop is building prompt templates, parsing JSON responses, or managing callback chains, it's single-threaded.&lt;/p&gt;

&lt;p&gt;Forge's approach: &lt;strong&gt;actual parallel threads&lt;/strong&gt; with lock-free data structures that never block each other.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Object Allocation Is The Hidden Tax
&lt;/h3&gt;

&lt;p&gt;When LangChain creates an &lt;code&gt;AgentExecutor&lt;/code&gt;, here's what gets allocated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The LLM wrapper (ChatOpenAI) with connection pooling, retry config, callbacks&lt;/li&gt;
&lt;li&gt;A prompt template object tree (SystemMessage, HumanMessage, MessagesPlaceholder)&lt;/li&gt;
&lt;li&gt;An output parser chain&lt;/li&gt;
&lt;li&gt;A callback manager with handler registration&lt;/li&gt;
&lt;li&gt;Tool wrappers with schema validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's thousands of Python objects on the heap — each requiring &lt;code&gt;malloc&lt;/code&gt;, reference count initialization, and eventually garbage collection.&lt;/p&gt;

&lt;p&gt;A Forge &lt;code&gt;Session&lt;/code&gt; is this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Session&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                              &lt;span class="c1"&gt;//  8 bytes&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SessionState&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;//  4 bytes&lt;/span&gt;
    &lt;span class="n"&gt;Conversation&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                     &lt;span class="c1"&gt;// 24 bytes (vector + mutex)&lt;/span&gt;
    &lt;span class="n"&gt;SessionConfig&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                     &lt;span class="c1"&gt;// 16 bytes&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;step_count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;         &lt;span class="c1"&gt;//  4 bytes&lt;/span&gt;
    &lt;span class="n"&gt;time_point&lt;/span&gt; &lt;span class="n"&gt;deadline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// 16 bytes&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="n"&gt;initial_prompt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;               &lt;span class="c1"&gt;// 32 bytes&lt;/span&gt;
    &lt;span class="c1"&gt;// ... ~104 bytes total&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No callbacks. No middleware chain. No decorator stack. Just the state machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Scheduling: One Instruction vs Thousands
&lt;/h3&gt;

&lt;p&gt;When Forge submits a task to the thread pool, the hot path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;move&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="n"&gt;Node&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;head_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exchange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_acq_rel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// ONE atomic instruction&lt;/span&gt;
    &lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_release&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;              &lt;span class="c1"&gt;// ONE store&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two machine instructions. No mutex lock, no kernel syscall, no memory fence beyond what the hardware requires.&lt;/p&gt;

&lt;p&gt;When LangChain submits an async task, it goes through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Python function call overhead (frame allocation, argument unpacking)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;asyncio&lt;/code&gt; event loop scheduling (heap allocation for the coroutine frame)&lt;/li&gt;
&lt;li&gt;Callback registration and future management&lt;/li&gt;
&lt;li&gt;GIL acquisition/release cycles&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these individually is "fast enough." Together, they compound to ~100 microseconds per task — &lt;strong&gt;300x more than Forge's 307 nanoseconds&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Lock-Free Programming Actually Works
&lt;/h2&gt;

&lt;p&gt;If you've never done lock-free programming, here's the mental model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mutex-based:&lt;/strong&gt; "I'm going to lock this resource. Everyone else waits. I do my thing. I unlock. Next person goes."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lock-free:&lt;/strong&gt; "I'm going to &lt;em&gt;try&lt;/em&gt; to make my change atomically. If someone else changed it first, I notice and retry. Nobody ever waits — they either succeed immediately or retry."&lt;/p&gt;

&lt;p&gt;The key CPU instruction is &lt;strong&gt;Compare-And-Swap (CAS)&lt;/strong&gt;: "If this memory location still has value X, change it to Y. Tell me if it worked."&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: Work-Stealing Deque
&lt;/h3&gt;

&lt;p&gt;Forge's thread pool uses the &lt;a href="https://dl.acm.org/doi/10.1145/1073970.1073974" rel="noopener noreferrer"&gt;Chase-Lev work-stealing deque&lt;/a&gt; — the same algorithm used in Go's goroutine scheduler, Java's ForkJoinPool, and Rust's Rayon.&lt;/p&gt;

&lt;p&gt;Each worker thread has its own deque:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Owner&lt;/strong&gt; pushes/pops from the bottom (fast, no contention)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thieves&lt;/strong&gt; steal from the top (uses CAS — if two thieves race, one retries)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Worker 0:  [A] [B] [C]  ← owner pops C (LIFO, cache-friendly)
Worker 1:  [D] [E]
Worker 2:  (idle) ── steals A from Worker 0's top (FIFO, coarse-grained work)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The owner's push/pop is wait-free (always completes in bounded steps). Stealing requires one CAS — on a modern CPU, that's ~10-20 nanoseconds. Compare that to &lt;code&gt;pthread_mutex_lock&lt;/code&gt;, which can cost 25ns uncontended and microseconds contended (it's a kernel syscall on contention).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Subtlety: Memory Ordering
&lt;/h3&gt;

&lt;p&gt;The hardest part of lock-free programming isn't the algorithm — it's &lt;strong&gt;memory ordering&lt;/strong&gt;. Modern CPUs reorder instructions for performance. On x86, stores can appear out of order to other cores. On ARM (including Apple Silicon), both loads and stores can be reordered.&lt;/p&gt;

&lt;p&gt;Forge uses explicit memory orderings throughout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Producer stores the value, then publishes with release ordering&lt;/span&gt;
&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;move&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;ready_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_release&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Everything before this is visible&lt;/span&gt;

&lt;span class="c1"&gt;// Consumer acquires — sees everything the producer released&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;ready_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_acquire&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;backoff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// But does useful work while waiting!&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;memory_order_release&lt;/code&gt; means: "Make all my previous writes visible before this store."&lt;br&gt;
&lt;code&gt;memory_order_acquire&lt;/code&gt; means: "See all writes that happened before the corresponding release."&lt;/p&gt;

&lt;p&gt;Getting this wrong causes data races that only manifest under high contention on specific CPU architectures. That's why Forge runs all 106 tests under ThreadSanitizer — it catches these bugs at the instruction level.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Work-Stealing Future Trick
&lt;/h2&gt;

&lt;p&gt;Here's my favorite design decision in Forge. When a worker thread calls &lt;code&gt;future.get()&lt;/code&gt; and the value isn't ready yet, what should it do?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;std::future:&lt;/strong&gt; Sleep. (Wastes the thread.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forge Future:&lt;/strong&gt; Process other tasks from the pool while waiting.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;state_&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;ready_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_acquire&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// yield_fn is set by the ThreadPool — it tries to execute&lt;/span&gt;
        &lt;span class="c1"&gt;// another task from the pool, preventing starvation&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yield_fn&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;yield_fn&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// Did useful work!&lt;/span&gt;
        &lt;span class="n"&gt;backoff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// No work available, back off&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;move&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;state_&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;value_ptr&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents &lt;strong&gt;pool starvation&lt;/strong&gt;: the scenario where all 8 workers are blocked on futures, but the tasks that would &lt;em&gt;fulfill&lt;/em&gt; those futures are sitting in the queue with nobody to execute them. Standard thread pools deadlock here. Forge doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Over-engineering the concurrent map.&lt;/strong&gt; Forge uses a 64-stripe concurrent hash map for session storage. In practice, session creation/deletion is rare compared to state queries. A simpler RCU (Read-Copy-Update) pattern or even a single &lt;code&gt;shared_mutex&lt;/code&gt; would have been fine for &amp;lt; 10,000 sessions. The striped map is more impressive on paper than necessary in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Under-investing in observability.&lt;/strong&gt; The tracing system (RAII spans with hierarchical IDs) works, but there's no export to Jaeger/Zipkin. For a production system, that's table stakes. I'd add OpenTelemetry support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not building a gRPC interface.&lt;/strong&gt; REST + SSE works, but gRPC with bidirectional streaming would be more natural for session management and would eliminate the polling in SSE.&lt;/p&gt;

&lt;h2&gt;
  
  
  When You Should Use This (And When You Shouldn't)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Forge (or something like it) when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're running 100+ concurrent agent sessions on a server&lt;/li&gt;
&lt;li&gt;Orchestration latency matters (real-time applications)&lt;/li&gt;
&lt;li&gt;You're deploying to edge/embedded (single binary, &amp;lt;1KB/session)&lt;/li&gt;
&lt;li&gt;You need to squeeze maximum throughput from your LLM API quota&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use LangChain/CrewAI when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're prototyping and need to move fast&lt;/li&gt;
&lt;li&gt;You need the ecosystem (vector stores, document loaders, 500+ integrations)&lt;/li&gt;
&lt;li&gt;You're building a single-agent chatbot&lt;/li&gt;
&lt;li&gt;Your team knows Python, not C++&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest answer: &lt;strong&gt;most teams should use Python frameworks&lt;/strong&gt;. The GIL matters when you hit scale. Most teams haven't hit scale yet, and shipping fast matters more than scheduling overhead.&lt;/p&gt;

&lt;p&gt;But for the teams that &lt;em&gt;are&lt;/em&gt; at scale — running thousands of concurrent agents for production workloads — the orchestration layer is worth optimizing. And lock-free C++ is how you do it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &amp;lt;repo-url&amp;gt; forge &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;forge
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build
./build/src/forge &lt;span class="nt"&gt;--api-base&lt;/span&gt; https://api.groq.com/openai &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; llama-3.3-70b-versatile &lt;span class="nt"&gt;-k&lt;/span&gt; &lt;span class="nv"&gt;$GROQ_API_KEY&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"What are the trade-offs of lock-free vs mutex-based concurrency?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full source, all 106 tests, and benchmark scripts are on GitHub. PRs welcome.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Forge is a portfolio project demonstrating C++17 lock-free concurrency applied to AI agent orchestration. It includes lock-free MPSC queues, Chase-Lev work-stealing deques, atomic Future/Promise, three workflow patterns, an HTTP API with SSE streaming, and full sanitizer verification. Built from scratch — no frameworks, no external concurrency libraries.&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  About the Author
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Systems engineer interested in high-performance computing, concurrency, and AI infrastructure. This project exists because I wanted to understand the actual performance cost of Python's GIL in production agent workloads — and because lock-free programming is fun.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cpp</category>
      <category>concurrency</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
