DEV Community

Rahul
Rahul

Posted on

I Built a Lock-Free Agent Runtime in C++17 — Here's Why Python Frameworks Are 2500x Slower

TL;DR: I replaced Python's LLM orchestration layer with C++17 lock-free data structures. The result: 25,000 sessions/sec vs LangChain's ~10-50. Here's what I learned about why the gap exists, how lock-free programming works, and why it matters for the future of AI infrastructure.

Forge — Lock-Free Agent Orchestration Runtime

A high-performance C++17 agent runtime that orchestrates LLM-powered workflows using lock-free concurrency primitives. Built to demonstrate that agent orchestration doesn't have to be slow — Forge handles 25,000+ sessions/sec where Python frameworks like LangChain manage ~50.


Why This Exists

Every major AI agent framework today — LangChain, CrewAI, AutoGen — is written in Python. Python is great for prototyping, but it has a fundamental problem for production agent workloads: the Global Interpreter Lock (GIL). The GIL means only one thread can execute Python bytecode at a time, even on a 64-core server. When you're orchestrating hundreds of concurrent agent sessions, each making LLM calls and executing tools, the framework itself becomes the bottleneck.

Forge asks: what if the orchestration layer was as fast as the hardware allows?

This project is a from-scratch C++17 implementation of an agent runtime that uses lock-free data structures…


The Problem Nobody Talks About

Every production AI deployment I've seen has the same architecture: a Python framework (LangChain, CrewAI, AutoGen) orchestrating LLM calls and tool execution. For a single chatbot, this works fine. But when you need to run hundreds of concurrent agent sessions — think customer support, code review pipelines, batch analysis — the framework itself becomes the bottleneck.

Not the LLM. Not the network. The orchestration layer.

I wanted to understand exactly why, so I built Forge: a complete agent runtime in C++17 with lock-free concurrency primitives, three workflow patterns (ReAct, Plan-Execute, Map-Reduce), an HTTP API with SSE streaming, and 106 tests verified under ThreadSanitizer and AddressSanitizer.

Then I benchmarked it against LangChain.

The Numbers

Metric Forge (C++17) LangChain (Python) Gap
Scheduling overhead per task 307 ns ~50-100 us 200-300x
Session throughput 25,000/sec ~10-50/sec 500-2500x
Memory per session 0.8 KB ~2-5 MB 2500-6000x
Concurrent scaling Linear with cores GIL-capped N/A

These aren't synthetic micro-benchmarks. Both frameworks run the same 2-step ReAct workflow (LLM call -> tool execution -> LLM call -> final answer) against the same mock LLM server. The gap is entirely orchestration overhead.

Why Is Python So Much Slower? (It's Not What You Think)

1. The GIL Is Worse Than You Think

Python's Global Interpreter Lock isn't just "one thread at a time." It's worse: the GIL introduces context-switch overhead even when there's no contention. Every 5ms (the default sys.getswitchinterval), Python forcibly releases and reacquires the GIL, even if no other thread wants it. That's a kernel-level context switch for nothing.

asyncio doesn't help with CPU-bound orchestration work. It gives you concurrency (interleaving) but not parallelism (simultaneous execution). When the event loop is building prompt templates, parsing JSON responses, or managing callback chains, it's single-threaded.

Forge's approach: actual parallel threads with lock-free data structures that never block each other.

2. Object Allocation Is The Hidden Tax

When LangChain creates an AgentExecutor, here's what gets allocated:

  • The LLM wrapper (ChatOpenAI) with connection pooling, retry config, callbacks
  • A prompt template object tree (SystemMessage, HumanMessage, MessagesPlaceholder)
  • An output parser chain
  • A callback manager with handler registration
  • Tool wrappers with schema validation

That's thousands of Python objects on the heap — each requiring malloc, reference count initialization, and eventually garbage collection.

A Forge Session is this:

struct Session {
    uint64_t id;                              //  8 bytes
    std::atomic<SessionState> state;          //  4 bytes
    Conversation history;                     // 24 bytes (vector + mutex)
    SessionConfig config;                     // 16 bytes
    std::atomic<uint32_t> step_count;         //  4 bytes
    time_point deadline, created_at;          // 16 bytes
    std::string initial_prompt;               // 32 bytes
    // ... ~104 bytes total
};
Enter fullscreen mode Exit fullscreen mode

No callbacks. No middleware chain. No decorator stack. Just the state machine.

3. Scheduling: One Instruction vs Thousands

When Forge submits a task to the thread pool, the hot path is:

void push(T value) {
    auto* node = new Node(std::move(value));
    Node* prev = head_.exchange(node, std::memory_order_acq_rel);  // ONE atomic instruction
    prev->next.store(node, std::memory_order_release);              // ONE store
}
Enter fullscreen mode Exit fullscreen mode

Two machine instructions. No mutex lock, no kernel syscall, no memory fence beyond what the hardware requires.

When LangChain submits an async task, it goes through:

  1. Python function call overhead (frame allocation, argument unpacking)
  2. asyncio event loop scheduling (heap allocation for the coroutine frame)
  3. Callback registration and future management
  4. GIL acquisition/release cycles

Each of these individually is "fast enough." Together, they compound to ~100 microseconds per task — 300x more than Forge's 307 nanoseconds.

How Lock-Free Programming Actually Works

If you've never done lock-free programming, here's the mental model.

Mutex-based: "I'm going to lock this resource. Everyone else waits. I do my thing. I unlock. Next person goes."

Lock-free: "I'm going to try to make my change atomically. If someone else changed it first, I notice and retry. Nobody ever waits — they either succeed immediately or retry."

The key CPU instruction is Compare-And-Swap (CAS): "If this memory location still has value X, change it to Y. Tell me if it worked."

Example: Work-Stealing Deque

Forge's thread pool uses the Chase-Lev work-stealing deque — the same algorithm used in Go's goroutine scheduler, Java's ForkJoinPool, and Rust's Rayon.

Each worker thread has its own deque:

  • Owner pushes/pops from the bottom (fast, no contention)
  • Thieves steal from the top (uses CAS — if two thieves race, one retries)
Worker 0:  [A] [B] [C]  ← owner pops C (LIFO, cache-friendly)
Worker 1:  [D] [E]
Worker 2:  (idle) ── steals A from Worker 0's top (FIFO, coarse-grained work)
Enter fullscreen mode Exit fullscreen mode

The owner's push/pop is wait-free (always completes in bounded steps). Stealing requires one CAS — on a modern CPU, that's ~10-20 nanoseconds. Compare that to pthread_mutex_lock, which can cost 25ns uncontended and microseconds contended (it's a kernel syscall on contention).

The Subtlety: Memory Ordering

The hardest part of lock-free programming isn't the algorithm — it's memory ordering. Modern CPUs reorder instructions for performance. On x86, stores can appear out of order to other cores. On ARM (including Apple Silicon), both loads and stores can be reordered.

Forge uses explicit memory orderings throughout:

// Producer stores the value, then publishes with release ordering
new (storage) T(std::move(value));
ready_.store(true, std::memory_order_release);  // Everything before this is visible

// Consumer acquires — sees everything the producer released
while (!ready_.load(std::memory_order_acquire)) {
    backoff(iter++);  // But does useful work while waiting!
}
Enter fullscreen mode Exit fullscreen mode

memory_order_release means: "Make all my previous writes visible before this store."
memory_order_acquire means: "See all writes that happened before the corresponding release."

Getting this wrong causes data races that only manifest under high contention on specific CPU architectures. That's why Forge runs all 106 tests under ThreadSanitizer — it catches these bugs at the instruction level.

The Work-Stealing Future Trick

Here's my favorite design decision in Forge. When a worker thread calls future.get() and the value isn't ready yet, what should it do?

  • std::future: Sleep. (Wastes the thread.)
  • Forge Future: Process other tasks from the pool while waiting.
T get() {
    while (!state_->ready_.load(std::memory_order_acquire)) {
        // yield_fn is set by the ThreadPool — it tries to execute
        // another task from the pool, preventing starvation
        if (yield_fn && yield_fn()) continue;  // Did useful work!
        backoff(iter++);  // No work available, back off
    }
    return std::move(*state_->value_ptr());
}
Enter fullscreen mode Exit fullscreen mode

This prevents pool starvation: the scenario where all 8 workers are blocked on futures, but the tasks that would fulfill those futures are sitting in the queue with nobody to execute them. Standard thread pools deadlock here. Forge doesn't.

What I'd Do Differently

Over-engineering the concurrent map. Forge uses a 64-stripe concurrent hash map for session storage. In practice, session creation/deletion is rare compared to state queries. A simpler RCU (Read-Copy-Update) pattern or even a single shared_mutex would have been fine for < 10,000 sessions. The striped map is more impressive on paper than necessary in practice.

Under-investing in observability. The tracing system (RAII spans with hierarchical IDs) works, but there's no export to Jaeger/Zipkin. For a production system, that's table stakes. I'd add OpenTelemetry support.

Not building a gRPC interface. REST + SSE works, but gRPC with bidirectional streaming would be more natural for session management and would eliminate the polling in SSE.

When You Should Use This (And When You Shouldn't)

Use Forge (or something like it) when:

  • You're running 100+ concurrent agent sessions on a server
  • Orchestration latency matters (real-time applications)
  • You're deploying to edge/embedded (single binary, <1KB/session)
  • You need to squeeze maximum throughput from your LLM API quota

Use LangChain/CrewAI when:

  • You're prototyping and need to move fast
  • You need the ecosystem (vector stores, document loaders, 500+ integrations)
  • You're building a single-agent chatbot
  • Your team knows Python, not C++

The honest answer: most teams should use Python frameworks. The GIL matters when you hit scale. Most teams haven't hit scale yet, and shipping fast matters more than scheduling overhead.

But for the teams that are at scale — running thousands of concurrent agents for production workloads — the orchestration layer is worth optimizing. And lock-free C++ is how you do it.

Try It

git clone <repo-url> forge && cd forge
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build
./build/src/forge --api-base https://api.groq.com/openai \
  -m llama-3.3-70b-versatile -k $GROQ_API_KEY \
  -p "What are the trade-offs of lock-free vs mutex-based concurrency?"
Enter fullscreen mode Exit fullscreen mode

The full source, all 106 tests, and benchmark scripts are on GitHub. PRs welcome.


Forge is a portfolio project demonstrating C++17 lock-free concurrency applied to AI agent orchestration. It includes lock-free MPSC queues, Chase-Lev work-stealing deques, atomic Future/Promise, three workflow patterns, an HTTP API with SSE streaming, and full sanitizer verification. Built from scratch — no frameworks, no external concurrency libraries.


About the Author

Systems engineer interested in high-performance computing, concurrency, and AI infrastructure. This project exists because I wanted to understand the actual performance cost of Python's GIL in production agent workloads — and because lock-free programming is fun.

Top comments (0)