Awaliyatul Hikmah

Posted on Jun 10

When Prompt Batching Made My LLM App More Expensive

#ai #llm #optimization #programming

Retry overhead from single-item failures

I was working on cost optimization for an LLM-based document translation
pipeline.

At that point, the LLM translation flow was still very direct: one extracted
text segment became one API call.

It worked, but it was not ideal for cost.

For a document with many text segments, the number of API calls grew linearly.
So the optimization idea was straightforward: batch multiple text segments into
one prompt.

In simpler terms:

Instead of sending one API call for every text segment, we group multiple
segments into one request. In theory, fewer API calls should mean lower cost
and faster processing.

That was the plan.

But in the first real benchmark, the "optimization" made the system more
expensive and much slower.

The Baseline

The test used the same input file:

File: sample_10p.pdf
Language pair: zh-TW -> en
Model: gpt-4.1-nano

Before batching, the system translated one segment per API call.

Metric	No batching
Segments	160
API calls	160
Input tokens	14,287
Output tokens	2,506
Estimated cost	$0.0024
Duration	30.4s

This was simple and predictable: 160 segments meant 160 API calls.

The problem was also obvious: if I wanted to reduce cost, reducing the number of
LLM calls was the first thing to try.

What I Tried First

The first implementation added prompt batching.

The idea was to group up to 20 text segments into one request using keyed JSON:

keyed_subset = {str(idx): text for idx, text in enumerate(masked_subset)}

kwargs = {
    "model": settings.OLLAMA_MODEL_NAME,
    "messages": [
        {"role": "system", "content": self._sys_batch},
        {"role": "user", "content": user_msg},
    ],
    "temperature": self._temperature,
    "response_format": {"type": "json_object"},
}

At first glance, the result looked better because API calls dropped from 160 to
107.

But the cost and latency got worse.

Metric	No batching	First batching
Segments	160	140
API calls	160	107
Input tokens	14,287	14,876
Output tokens	2,506	4,541
Estimated cost	$0.0024	$0.0033
Duration	30.4s	136.2s
Fallback rate	0%	71.43%

So batching reduced API calls by 33%, but increased cost by 37%.

This was the confusing part.

The dashboard said we had fewer API calls. But the final bill estimate was
higher, and the total processing time was more than 4x slower.

So the question became: where did the extra cost come from?

What Went Wrong?

The batch size was 20.

With 140 segments, the system should only need:

140 / 20 = 7 batch calls

But 5 of those 7 batch calls failed validation.

When one ID was missing from the JSON response, the old fallback logic retried
the whole batch item by item:

for i in range(len(subset)):
    key = str(i)
    if key in keyed_translations:
        translated_list.append(keyed_translations[key])
    else:
        mismatch_found = True
        break

if mismatch_found or len(translated_list) != len(subset):
    return self._fallback_per_item(texts, tracker)

That means one missing translation could discard 19 successful translations and
retry all 20 segments.

The reconstructed call count matched the dashboard:

7 batch calls
5 failed batches x 20 per-item retries = 100 retry calls

Total API calls = 7 + 100 = 107

So 100 of 107 API calls were retries.

That was the real cost multiplier.

JSON Mode Was Not Enough

The first implementation used:

"response_format": {"type": "json_object"}

This only asked the model to return valid JSON.

It did not guarantee that all required IDs would be present.

The prompt said "do not skip any IDs", but prompt instructions are still
instructions. They are not structural enforcement.

In the logs, the missing IDs often appeared near the end of the batch:

ID 19 missing
ID 18 missing
ID 12 missing
ID 18 missing
ID 14 missing

That pattern was consistent with long structured outputs degrading near the
tail.

What I Changed Next

The fix had three parts.

First, for the OpenAI endpoint, the response format was changed from
json_object to strict json_schema.

keys = [str(i) for i in range(n_items)]

return {
    "type": "json_schema",
    "json_schema": {
        "name": "batch_translation",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "translations": {
                    "type": "object",
                    "properties": {
                        k: {"type": "string"} for k in keys
                    },
                    "required": keys,
                    "additionalProperties": False,
                }
            },
            "required": ["translations"],
            "additionalProperties": False,
        },
    },
}

Now every expected ID is listed as required.

For non-OpenAI endpoints, the system still uses best-effort json_object mode
because compatibility varies.

Second, fallback became partial.

Instead of retrying the whole batch, the code keeps successful translations and
only retries missing IDs:

missing = [i for i, v in enumerate(translated) if v is None]

if missing:
    tracker.record_prompt_batch_fallback()

    if len(missing) > 1:
        retry_result = self._request_batch_keyed(
            [masked_subset[i] for i in missing],
            context,
            tracker,
        )

    still_missing = [i for i, v in enumerate(translated) if v is None]
    for i in still_missing:
        translated[i] = self.translate(subset[i], tracker)

Third, the batch request now sets max_tokens and checks truncation:

if choice.finish_reason == "length" and len(items) > 1:
    mid = len(items) // 2
    left = self._request_batch_keyed(items[:mid], context, tracker)
    right = self._request_batch_keyed(items[mid:], context, tracker)
    return left + right

So a truncated batch is split and retried as smaller batches instead of falling
straight into per-item fallback.

The Result

After the fix, the same benchmark was rerun.

Metric	First batching	Fixed batching	No batching
API calls	107	7	160
Fallback rate	71.43%	0.00%	0%
Input tokens	14,876	6,206	14,287
Output tokens	4,541	2,640	2,506
Estimated cost	$0.0033	$0.0017	$0.0024
Duration	136.2s	22.1s	30.4s
Processed segments	240	140	160

The fixed version finally achieved the original goal:

API calls dropped from 160 to 7
Estimated cost dropped from $0.0024 to $0.0017
Duration dropped from 30.4s to 22.1s
Fallback dropped to 0%

Takeaways

The lesson is simple: batching is not automatically cheaper.

If a batch response can fail partially, the fallback strategy matters as much
as the batching strategy.

For structured LLM workflows, these details are important:

Use schema enforcement when the endpoint supports it.
Do not rely only on prompt instructions for required fields.
Keep partial successes.
Retry only missing items.
Check finish_reason.
Measure real cost, not just API call count.

In this case, the first optimization reduced requests but increased cost.

The real optimization was not just batching.

It was making the batch output reliable.

Top comments (12)

VoltageGPU • Jun 16

Interesting take on prompt batching—costs can definitely spiral if you're not careful with token counts and model pricing tiers. In my work with GPU-accelerated inference, I've seen how batch sizes affect both latency and resource utilization. Sometimes smaller batches with tighter token limits yield better cost/performance trade-offs, especially when using models with strict context windows.

Vic Chen • Jun 17

Great write-up. This matches what I've seen in agent-style LLM systems: batching can reduce request count while still increasing total cost once retries, partial failures, and larger-context pricing kick in. The real metric isn't calls per job, it's cost per successful task with acceptable latency and debuggability. Fewer API calls can be a very misleading KPI.

Max Quimby • Jun 10

The detail that jumped out at me is that 100 of your 107 calls were retries — batching wasn't the villain, the all-or-nothing fallback was. One missing key nuking 19 good translations is exactly the kind of thing that stays invisible until you reconcile call counts against the dashboard like you did.

We hit a near-identical trap batching extraction jobs. Strict json_schema cut our miss rate a lot (so good call there), but the change that helped most was making the fallback surgical — only re-request the missing keys instead of replaying the whole batch, since even with strict schemas you still drop the occasional item on long outputs.

Two things I'm curious about: did you find a batch size where the tail-degradation basically disappears for nano? We landed around 8–10 items, not 20. And are you bounding batches by estimated output tokens rather than segment count? Translation length varies enough that a fixed count of 20 can quietly push past the model's reliable range even when the segment count looks safe.

Awaliyatul Hikmah • Jun 11

I haven’t done a proper batch-size sweep for nano model yet. On the file I tested, batch size 20 worked after strict schema + retrying only the missing IDs: 140 segments, 7 calls, 0 fallback. But I wouldn’t treat 20 as a generally "safe" number.

I still need to benchmark this with larger and more complex files, and your point about bounding batches by estimated output tokens may be useful there. In this run the batching was still count-based (PROMPT_BATCH_SIZE=20). I added max_tokens estimation and split-on-length as a safety net after the request, but I’m not yet packing batches by estimated output tokens before sending them.

20 short segments and 20 long segments are very different workloads, so that’s probably one of the next things I’d benchmark.

Thanks for the insight 🙌🏻

Alex Shev • Jun 11

This is a good reminder that batching is a workflow decision, not a default optimization. If the batch hides failure modes or forces extra context into every request, the cost curve can move the wrong way.

The useful pattern is to measure per task: what can be cached, what needs fresh reasoning, and what should run as a deterministic terminal step instead.

Awaliyatul Hikmah • Jun 11

That’s a good way to frame it. I also started with the assumption that fewer requests would mean lower cost, but the benchmark showed that the workflow around the batch matters just as much 😅

Alex Shev • Jun 11

Yes, exactly. Batching is only cheaper when the unit of work stays clean.

Once the batch creates extra parsing, retries, larger prompts, or harder debugging, the “fewer requests” metric stops being the real cost model. I like benchmarking it as a workflow, not as an API-call count.

Andrii Krugliak • Jun 11

I hit the same wall batching translation segments. The token cost of dragging shared context into every batch quietly beat the per-call savings, and it only showed once I logged cost per segment instead of per request. Measuring the wrong unit hid it for a week.

Mininglamp • Jun 11

Batching prompts sounds efficient in theory but falls apart when you factor in context window waste. Each batch item shares the system prompt tokens, so you're paying for that overhead N times whether you batch or not. The real savings come from reducing round trips at the orchestration layer, not cramming more into a single call. For agent workloads with branching logic, sequential calls with proper caching end up cheaper than trying to batch everything upfront.

Jasmine Park • Jun 11

The retry economics are the part that bit us with batching too. One segment failing or coming back malformed means you re-send the whole batch, so a 5 percent per-segment failure rate quietly turns into a much higher re-billed-token rate once you batch 20 of them together. We only saw it after stamping a correlation id per logical document and measuring cost per document instead of per API call, the per-call number looked great while the per-document number got worse. Batching is a real win when the failure rate is low and a cost trap when it is not, and most of us do not measure the failure rate at the batch granularity until the bill says so.

Nazar Boyko • Jun 12

The detail that the missing IDs clustered near the tail is the useful tell. That's the output-token budget running out mid-response, not the model "forgetting." One lever that pairs nicely with your max_tokens + split fix: group segments by length before batching instead of taking them in document order. A batch that mixes a two-word heading with a 200-word paragraph spends its budget on the long one and drops the short ones at the end, even though they were trivial.

FastAnchor_io • Jun 11

Great write-up! This is a common trap — batching helps throughput but can blow up token costs if individual prompts get padded. One thing that helped me: set a max_tokens per request and group prompts by expected response length. Thanks for sharing the real numbers!

View full discussion (12 comments)