Mukunda Rao Katta

Posted on May 25

100 Users, One LLM Call: Single-Flight Deduplication with llm-batch-coalesce

#hermeschallenge #ai #python #agents

100 users hit the same endpoint at once

It was a status page. Simple enough. A chatbot interface where users could ask "what is the current system status?" and get a plain-language summary generated by an LLM from the latest metrics.

Then a deployment happened. The engineering team sent out a Slack alert. Within a few seconds, 100 people clicked the same link and asked the same question.

The server logged 100 LLM calls in a 200ms window. Same prompt. Same context. Same metrics snapshot. The model returned the same answer, 100 times, at full token cost.

The latency was fine. The bill was not. The system had charged for 100 identical LLM calls when it needed to make one.

This is not a cache problem. Caching would have worked if those requests had been spread across minutes. But they were simultaneous. All 100 callers hit the in-flight window before any result came back. A standard result cache would have missed every single one of them.

The actual fix is the single-flight pattern. When many callers request the same thing at the same time, you let one caller go through and make the real call. Everyone else waits. When the result comes back, you hand it to all of them at once.

That is what llm-batch-coalesce does.

The shape of the fix

Install the library:

pip install llm-batch-coalesce

Wrap your LLM call with the coalescer:

from llm_batch_coalesce import LLMBatchCoalescer

coalescer = LLMBatchCoalescer(ttl_seconds=5.0)

def call_llm(prompt: str) -> str:
    return coalescer.call(prompt, fn=_real_llm_call)

def _real_llm_call(prompt: str) -> str:
    # actual Anthropic/OpenAI SDK call here
    import anthropic
    client = anthropic.Anthropic()
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

Now 100 concurrent callers with the same prompt produce one real LLM call. The other 99 block until the result is ready, then get the same value.

For async code, the pattern is the same:

from llm_batch_coalesce import AsyncLLMBatchCoalescer

coalescer = AsyncLLMBatchCoalescer(ttl_seconds=5.0)

async def call_llm_async(prompt: str) -> str:
    return await coalescer.call(prompt, fn=_real_llm_call_async)

async def _real_llm_call_async(prompt: str) -> str:
    import anthropic
    client = anthropic.AsyncAnthropic()
    message = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

The ttl_seconds parameter controls how long a result stays in the in-flight deduplication window after the real call completes. After that window closes, the next caller for the same prompt triggers a fresh LLM call.

What it does NOT do

This library is deliberately narrow. Here is what it intentionally skips:

No persistent result cache. Once the TTL expires, the next request pays full price again. The library does not save results to disk, Redis, or any external store. If you want a result cache that survives across minutes or sessions, pair this with a real cache layer.
No semantic similarity matching. Two prompts that mean the same thing but differ by whitespace or word order hash to different keys and get separate calls. The library compares prompts by canonical hash, not meaning.
No batching to a batch API. This is not the Anthropic Message Batches API. It does not accumulate requests and send them in bulk for async processing. It runs one synchronous call and shares the result. These are different tools for different problems.
No distributed coordination. In-flight deduplication is per-process. Two server instances running in parallel do not share their coalescing state. For distributed deduplication, you need a shared lock layer that this library does not provide.

Inside the lib: in-flight-only vs result-cache

The key design decision in this library is where the deduplication window starts and ends.

A result cache keeps a value around after the call completes. You control how long. If two callers hit the same key ten minutes apart, they share the cached result.

This library only deduplicates concurrent in-flight requests. The window is: from when the first caller starts the real LLM call until the result comes back plus the TTL. After that, the entry is gone.

Two callers hitting the same prompt five seconds apart, after the first call has already returned, get two separate LLM calls. That is intentional.

Why? Because this sidesteps cache invalidation. LLM results can depend on time-sensitive context, session state, or data that changes between requests. A result cache that holds a value for minutes or hours can silently return stale answers. The in-flight window is short enough that the context has not changed.

The implementation uses a dictionary of in-flight futures (or events for the sync case). When a new request arrives:

Hash the prompt to a cache key.
Check if there is already an in-flight entry for that key.
If yes, attach to the existing future and wait.
If no, create a new entry, make the real LLM call, resolve all waiters with the result.

# Simplified sketch of the sync path
import threading, hashlib, json

class LLMBatchCoalescer:
    def __init__(self, ttl_seconds: float = 5.0):
        self._lock = threading.Lock()
        self._inflight: dict[str, threading.Event] = {}
        self._results: dict[str, object] = {}
        self.ttl_seconds = ttl_seconds

    def _key(self, prompt: str) -> str:
        return hashlib.sha256(prompt.encode()).hexdigest()

    def call(self, prompt: str, fn) -> object:
        key = self._key(prompt)
        with self._lock:
            if key in self._inflight:
                event = self._inflight[key]
                is_leader = False
            else:
                event = threading.Event()
                self._inflight[key] = event
                is_leader = True

        if is_leader:
            try:
                result = fn(prompt)
                self._results[key] = result
            finally:
                event.set()
                # TTL cleanup happens here
        else:
            event.wait()
            result = self._results[key]

        return result

The real implementation handles cleanup, TTL expiry, error propagation to all waiters, and the async path with asyncio.Event. But that sketch captures the core idea.

When this is useful

Status pages and dashboards with bursty traffic. A deployment alert, a public incident, an on-call page. Users hit the same summarization endpoint all at once. The burst window is short. Single-flight collapses it.

Autocomplete or typeahead backed by an LLM. If ten users type the same prefix in the same 200ms window, they trigger the same LLM completion. Coalescing means one call.

Internal tools shared by a team. A Slack bot that answers "how do I do X in our system?" gets asked the same question by different people within seconds of a team meeting. Coalescing cuts the duplicate calls.

Rate-limit pressure. If your LLM API key is under a tight rate limit, reducing duplicate concurrent calls directly reduces the chance of hitting the limit during bursts.

When NOT to use it

Personalized responses. If the prompt includes user-specific context (session ID, user history, preferences), different callers should not share results even if the base prompt looks the same. Add the user-specific context to the prompt before passing it through the coalescer, or bypass the coalescer entirely for personalized paths.
Single-threaded, sequential workloads. If your agent calls the LLM one request at a time with no concurrency, there is no burst to collapse. The library adds overhead with no benefit.
Long TTL as a substitute for a real cache. Setting ttl_seconds to 300 to avoid cache invalidation complexity is the wrong tool for that job. Use a proper result cache with explicit invalidation instead.

Install

pip install llm-batch-coalesce

Zero runtime dependencies. Python 3.9 and up. 22 tests covering sync and async paths, error propagation to all waiters, TTL expiry, and concurrent key isolation.

No opinion about which LLM SDK you use. Pass any callable as the fn argument. Works with Anthropic, OpenAI, Bedrock, or a local model.

Source: MukundaKatta/llm-batch-coalesce

Siblings

Lib	Boundary	Repo
tool-call-cache	Result memoization for tool calls, different scope from in-flight dedup	MukundaKatta/tool-call-cache
llm-message-hash-py	Canonical hashing that this library uses for the dedup key	MukundaKatta/llm-message-hash-py
anthropic-batch-kit	Async batch submission for sequential requests, not concurrent dedup	MukundaKatta/anthropic-batch-kit
token-budget-py	Budget cap so a burst of real calls does not overrun the monthly limit	MukundaKatta/token-budget-py

What is next

The core case works. A few gaps worth closing:

Distributed coordination via Redis. Right now deduplication is per-process. A Redis-backed lock and result store would extend the same pattern across multiple server instances without changing the caller API.
Error isolation per waiter. Currently if the leader call throws, all waiters get the same exception. A future version could give each waiter its own retry budget instead of propagating the leader error.
Metrics hook. A simple callback surface for recording how many requests were coalesced per key per window. Useful for capacity planning without adding a metrics library dependency to the core.

If any of those would be useful in your stack, open an issue or PR on the repo.

This is part of the Hermes Agent Challenge, a sprint to build and ship practical agent infrastructure libraries. The goal is a library per day covering the gaps between LLM SDK calls and production-ready agent behavior. Each library is small, focused, and ships with a full test suite.

DEV Community