DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

50 Users Ask the Same Question. Your Agent Makes 50 API Calls. Here's the Fix.

The product launched on a Tuesday. By Wednesday morning, the Anthropic usage dashboard showed 48x the expected API calls.

The feature: a sidebar that answered "What does this term mean?" when users hovered over highlighted words. Fifty users hovering over "amortization" at roughly the same time. Fifty separate calls to claude-sonnet-4-6, each asking for the same definition. Fifty identical responses that cost fifty times as much as one would have.

The fix is request coalescing: many callers with the same cache key get one underlying LLM call.


The Shape of the Fix

from llm_batch_coalesce import BatchCoalescer

coalescer = BatchCoalescer(ttl_seconds=5.0)

def get_definition(term: str) -> str:
    return coalescer.get_or_call(
        key=f"define:{term}",
        fn=lambda: call_llm(f"Define '{term}' in one sentence.")
    )

# 50 concurrent threads all calling get_definition("amortization")
# Result: 1 LLM call, 50 callers get the same response
Enter fullscreen mode Exit fullscreen mode

The first caller triggers the LLM call. Every other caller with the same key waits for that same future to complete, then gets the same result. One call, N recipients.


What It Does NOT Do

llm-batch-coalesce does not implement caching across process restarts. It does not persist results to disk or Redis. It does not shard across multiple processes or machines.

It is an in-process single-flight guard. Within one process, within the TTL window, duplicate calls for the same key collapse to one. After the TTL expires or the process restarts, the next call goes through.

For persistent caching, pair it with tool-result-cache or your existing cache layer. This library handles the inflight deduplication problem: multiple concurrent callers asking for the same thing before any result is available.


Inside the Library

The core is a dict of pending futures, keyed by the cache key you provide:

class BatchCoalescer:
    def __init__(self, ttl_seconds: float = 5.0):
        self._pending: dict[str, Future] = {}
        self._lock = threading.Lock()
        self._ttl = ttl_seconds

    def get_or_call(self, key: str, fn: Callable[[], T]) -> T:
        with self._lock:
            if key in self._pending:
                future = self._pending[key]
                is_caller = False
            else:
                future = Future()
                self._pending[key] = future
                is_caller = True

        if is_caller:
            try:
                result = fn()
                future.set_result(result)
            except Exception as e:
                future.set_exception(e)
            finally:
                threading.Timer(self._ttl, lambda: self._pending.pop(key, None)).start()

        return future.result()
Enter fullscreen mode Exit fullscreen mode

The lock protects the dict. Only one thread becomes the "caller" — the one that created the future. All others wait on future.result().

Async variant: AsyncBatchCoalescer uses asyncio.Event instead of threading.Future for use in async contexts. Both variants in the same package.

The 22 tests cover threading (10 concurrent callers, verify 1 underlying call), async variant, TTL expiry (second wave after TTL goes through), exception propagation (all waiters get the exception), and key isolation (different keys get separate calls).


When to Use It

Use it for stateless or idempotent LLM calls where the same input should always produce the same output (or where near-identical outputs from one call are acceptable for all concurrent waiters). Definitions. Classifications. Translations. Summarizations of fixed reference documents.

The key design decision: callers share the same result. If one caller gets a slightly different answer because of model temperature, all 50 callers get that same answer. For most informational queries this is fine. For personalized responses or anything where the output depends on caller identity, do not use coalescing.

Skip it for single-threaded or low-concurrency environments. If you have one user at a time, there is nothing to coalesce and the lock overhead is pure cost.


Install

pip install git+https://github.com/MukundaKatta/llm-batch-coalesce
Enter fullscreen mode Exit fullscreen mode
from llm_batch_coalesce import BatchCoalescer, AsyncBatchCoalescer

# Threading variant
sync_coalescer = BatchCoalescer(ttl_seconds=10.0)

# Async variant for async agents
async_coalescer = AsyncBatchCoalescer(ttl_seconds=10.0)

async def async_get_summary(doc_id: str) -> str:
    return await async_coalescer.get_or_call(
        key=f"summary:{doc_id}",
        fn=lambda: call_llm_async(f"Summarize document {doc_id}")
    )
Enter fullscreen mode Exit fullscreen mode

Sibling Libraries

Library What it solves
tool-result-cache LRU+TTL persistent cache for tool call results
tool-call-dedup Session-scoped exact-duplicate detection
token-budget-pool Shared budget across concurrent agents
llm-rate-limit-bucket Rate limiting for concurrent API callers
agent-rate-fence Per-key sliding-window rate limit

The typical production setup: llm-batch-coalesce for inflight deduplication, tool-result-cache for post-call caching, and llm-rate-limit-bucket to cap burst rates when coalescing is not applicable.


What's Next

The main gap is distributed coalescing. Right now this works within one process. For a fleet of workers behind a load balancer, each process has its own coalescer and duplicates are still sent from different workers.

A Redis-backed DistributedBatchCoalescer that uses a distributed lock and pub/sub for result propagation would handle this. The interface would stay the same; the storage backend would change. That is a meaningful dependency to add so it would be an optional extra, not the default.

Cache warming is another direction: after coalescing resolves, the result could be pushed to a local tool-result-cache so subsequent callers after TTL do not go back to the LLM at all. That would make the two libraries compose naturally without manual wiring.


Built as part of the agent-stack family: composable Python primitives for production LLM agents.

Top comments (0)