50 users, same question, 50 API calls
Imagine a news summarization endpoint. A breaking story drops. Fifty users hit your endpoint within two seconds. All of them send the same article URL. Each request triggers an LLM call to generate a summary.
Naive implementation: 50 LLM calls, 50 billed tokens sets, 50 roundtrips to the API, 50 chances for a 429 rate-limit hit.
You could cache the result. But caching has a race condition. The first request fires. While it is running, the next 49 requests also see a cache miss and fire their own LLM calls. By the time the first call finishes and writes to the cache, you have already made 50 calls.
Single-flight batching solves this at the concurrency level. The first caller fires the LLM request. Every subsequent caller with the same key waits on the same Future. When the first call completes, all 50 callers get the result. One call. One billing event. No race condition.
This is what llm-batch-coalesce does.
This is NOT the Anthropic Batch API
Before the code, a clarification. The Anthropic Message Batches API is a different thing. It lets you submit a batch of up to 10,000 prompts asynchronously and get a discount on pricing. Results come back within 24 hours. It is for offline workloads where latency does not matter.
llm-batch-coalesce is for real-time workloads. It deduplicates concurrent in-process calls that happen to share the same prompt. No async job queue. No webhook. Results come back within the normal LLM latency (seconds, not hours).
Think of it as call deduplication, not batching in the Anthropic sense.
The code
import asyncio
import anthropic
from llm_batch_coalesce import BatchCoalescer
client = anthropic.Anthropic()
# The key function determines when two calls are "the same"
def make_key(model: str, system: str, user_message: str) -> str:
import hashlib
payload = f"{model}|{system}|{user_message}"
return hashlib.sha256(payload.encode()).hexdigest()
# The actual LLM call - this only fires once per unique key
def run_llm(model: str, system: str, user_message: str) -> str:
msg = client.messages.create(
model=model,
max_tokens=512,
system=system,
messages=[{"role": "user", "content": user_message}],
)
return msg.content[0].text
coalescer = BatchCoalescer(key_fn=make_key, call_fn=run_llm)
# --- Async example ---
async def handle_request_async(article_text: str) -> str:
return await coalescer.async_get(
model="claude-sonnet-4-6",
system="Summarize this article in two sentences.",
user_message=article_text,
)
async def simulate_concurrent_users(article_text: str, num_users: int = 50):
tasks = [handle_request_async(article_text) for _ in range(num_users)]
results = await asyncio.gather(*tasks)
# All results are the same string - one underlying call was made
assert len(set(results)) == 1
print(f"{num_users} users, 1 LLM call. Result: {results[0][:80]}...")
# asyncio.run(simulate_concurrent_users("Breaking: Scientists discover..."))
# --- Sync example (for non-async codebases) ---
from concurrent.futures import ThreadPoolExecutor
sync_coalescer = BatchCoalescer(key_fn=make_key, call_fn=run_llm, mode="sync")
def handle_request_sync(article_text: str) -> str:
return sync_coalescer.get(
model="claude-sonnet-4-6",
system="Summarize this article in two sentences.",
user_message=article_text,
)
def simulate_threaded_users(article_text: str, num_threads: int = 20):
with ThreadPoolExecutor(max_workers=num_threads) as pool:
futures = [pool.submit(handle_request_sync, article_text) for _ in range(num_threads)]
results = [f.result() for f in futures]
assert len(set(results)) == 1
print(f"{num_threads} threads, 1 LLM call. Result: {results[0][:80]}...")
The coalescer holds a dict of in-flight Futures keyed by the hash. When a call comes in, it checks if a matching Future already exists. If yes, it attaches and waits. If no, it creates a new Future, fires the LLM call, and resolves all waiters when it completes.
The key function is the contract
The key function is the most important part. It defines when two calls are considered identical. Get it wrong in either direction and you have a bug.
Too narrow: you use only the user message as the key, ignoring the system prompt. Two callers with the same user message but different system prompts get deduplicated even though they should produce different outputs. Bad.
Too broad: you include a timestamp in the key. Every call gets a unique key. No deduplication ever happens. Also bad.
The right default is to hash all the inputs that affect the output: model name, system prompt, user messages, temperature, max tokens. If you use the default claude-sonnet-4-6 with default settings, hashing system + user message is usually enough.
# Minimal key for fixed-model, fixed-settings usage
def simple_key(system: str, user_message: str) -> str:
import hashlib
return hashlib.sha256(f"{system}|{user_message}".encode()).hexdigest()
# Full key for variable model + settings
def full_key(model: str, system: str, user_message: str, max_tokens: int, temperature: float) -> str:
import hashlib
payload = f"{model}|{system}|{user_message}|{max_tokens}|{temperature}"
return hashlib.sha256(payload.encode()).hexdigest()
What this does NOT do
It does not cache results across time. When the in-flight call finishes, the Future is removed. If a new caller comes in one second later with the same key, it fires a fresh LLM call. For cross-time caching, use tool-result-cache or llm-cache-mem.
It does not work across processes. The coalescer state lives in memory in a single process. If you have 10 gunicorn workers, each worker has its own coalescer. Call deduplication happens within each worker, not across the fleet. For fleet-level deduplication, you need a shared cache (Redis, Postgres) with a distributed lock.
It does not handle errors gracefully by default. If the in-flight LLM call raises an exception, all waiting callers get the same exception. You can configure retry behavior but the default is propagate-to-all.
It does not apply to streaming. If you use streaming LLM calls (where the response comes back in chunks), single-flight batching does not help because the Future cannot be resolved until the stream completes. Streaming callers each need their own call.
When to use this
Use it when the same prompt hits your LLM endpoint many times in a short window due to multiple users or services querying the same underlying question.
Good fits:
- Public-facing summarization or Q&A where many users ask the same thing
- Scheduled jobs that fan out to many workers, all needing the same context lookup
- Webhook handlers where duplicate events arrive in bursts
- Any "thundering herd" pattern where cache invalidation triggers many simultaneous lookups
Not a good fit:
- Personalized responses where the prompt changes per user
- Prompts that include user-specific context, session history, or real-time data
- Low-traffic endpoints where deduplication saves you less than one call per hour
Install and quick-start
pip install llm-batch-coalesce
Zero runtime dependencies. Works with any LLM client library. The coalescer is LLM-agnostic. You pass in a call function. It can call Anthropic, OpenAI, a local model, anything.
pip install llm-batch-coalesce anthropic
# Set your API key
export ANTHROPIC_API_KEY=your-key-here
# Run the async example
python -c "
import asyncio
from llm_batch_coalesce import BatchCoalescer
import anthropic
client = anthropic.Anthropic()
def my_llm(prompt: str) -> str:
msg = client.messages.create(
model='claude-sonnet-4-6',
max_tokens=100,
messages=[{'role': 'user', 'content': prompt}]
)
return msg.content[0].text
c = BatchCoalescer(key_fn=lambda p: p, call_fn=my_llm)
result = asyncio.run(c.async_get('Hello'))
print(result)
"
Sibling libraries in the agent stack
| Library | What it does |
|---|---|
llm-batch-coalesce |
Deduplicate concurrent in-process LLM calls by key |
anthropic-batch-kit |
Anthropic Message Batches API helper (async, discounted) |
tool-result-cache |
LRU+TTL cache for tool call results across time |
llm-message-hash-py |
Canonical hash of LLM request payloads |
llm-cost-cap |
Pre-flight USD cost gate before firing an LLM call |
llm-cache-mem |
In-process LRU cache for completed LLM responses |
What is next
Three things that would make llm-batch-coalesce more useful in production:
First, metrics integration. A counter for deduplication events. How many calls were coalesced into one? What is the average wait time for queued callers? Right now you have no visibility into how much work the coalescer is saving.
Second, configurable error behavior. Right now, exceptions propagate to all waiters. A retry-on-error option would let the coalescer retry the LLM call before notifying waiters, so a transient API error does not cascade to all queued callers.
Third, partial streaming support. When the LLM supports streaming, the first caller gets the stream directly. Subsequent callers with the same key attach to a broadcast stream. This is significantly more complex because you need to tee the stream, but it would allow streaming responses with deduplication.
For high-traffic endpoints that serve repeated LLM queries, single-flight batching is the fastest win available. One line to set up the coalescer. The rest of your code stays the same.
GitHub: MukundaKatta/llm-batch-coalesce
PyPI: pip install llm-batch-coalesce
Top comments (0)