DEV Community

Cover image for When Prompt Batching Made My LLM App More Expensive
Awaliyatul Hikmah
Awaliyatul Hikmah

Posted on

When Prompt Batching Made My LLM App More Expensive

I was working on cost optimization for an LLM-based document translation
pipeline.

At that point, the LLM translation flow was still very direct: one extracted
text segment became one API call.

It worked, but it was not ideal for cost.

For a document with many text segments, the number of API calls grew linearly.
So the optimization idea was straightforward: batch multiple text segments into
one prompt.

In simpler terms:

Instead of sending one API call for every text segment, we group multiple
segments into one request. In theory, fewer API calls should mean lower cost
and faster processing.

That was the plan.

But in the first real benchmark, the "optimization" made the system more
expensive and much slower.

The Baseline

The test used the same input file:

  • File: sample_10p.pdf
  • Language pair: zh-TW -> en
  • Model: gpt-4.1-nano

Before batching, the system translated one segment per API call.

Metric No batching
Segments 160
API calls 160
Input tokens 14,287
Output tokens 2,506
Estimated cost $0.0024
Duration 30.4s

This was simple and predictable: 160 segments meant 160 API calls.

The problem was also obvious: if I wanted to reduce cost, reducing the number of
LLM calls was the first thing to try.

What I Tried First

The first implementation added prompt batching.

The idea was to group up to 20 text segments into one request using keyed JSON:

keyed_subset = {str(idx): text for idx, text in enumerate(masked_subset)}

kwargs = {
    "model": settings.OLLAMA_MODEL_NAME,
    "messages": [
        {"role": "system", "content": self._sys_batch},
        {"role": "user", "content": user_msg},
    ],
    "temperature": self._temperature,
    "response_format": {"type": "json_object"},
}
Enter fullscreen mode Exit fullscreen mode

At first glance, the result looked better because API calls dropped from 160 to
107.

But the cost and latency got worse.

Metric No batching First batching
Segments 160 140
API calls 160 107
Input tokens 14,287 14,876
Output tokens 2,506 4,541
Estimated cost $0.0024 $0.0033
Duration 30.4s 136.2s
Fallback rate 0% 71.43%

So batching reduced API calls by 33%, but increased cost by 37%.

This was the confusing part.

The dashboard said we had fewer API calls. But the final bill estimate was
higher, and the total processing time was more than 4x slower.

So the question became: where did the extra cost come from?

What Went Wrong?

The batch size was 20.

With 140 segments, the system should only need:

140 / 20 = 7 batch calls
Enter fullscreen mode Exit fullscreen mode

But 5 of those 7 batch calls failed validation.

When one ID was missing from the JSON response, the old fallback logic retried
the whole batch item by item:

for i in range(len(subset)):
    key = str(i)
    if key in keyed_translations:
        translated_list.append(keyed_translations[key])
    else:
        mismatch_found = True
        break

if mismatch_found or len(translated_list) != len(subset):
    return self._fallback_per_item(texts, tracker)
Enter fullscreen mode Exit fullscreen mode

That means one missing translation could discard 19 successful translations and
retry all 20 segments.

The reconstructed call count matched the dashboard:

7 batch calls
5 failed batches x 20 per-item retries = 100 retry calls

Total API calls = 7 + 100 = 107
Enter fullscreen mode Exit fullscreen mode

So 100 of 107 API calls were retries.

That was the real cost multiplier.

JSON Mode Was Not Enough

The first implementation used:

"response_format": {"type": "json_object"}
Enter fullscreen mode Exit fullscreen mode

This only asked the model to return valid JSON.

It did not guarantee that all required IDs would be present.

The prompt said "do not skip any IDs", but prompt instructions are still
instructions. They are not structural enforcement.

In the logs, the missing IDs often appeared near the end of the batch:

ID 19 missing
ID 18 missing
ID 12 missing
ID 18 missing
ID 14 missing
Enter fullscreen mode Exit fullscreen mode

That pattern was consistent with long structured outputs degrading near the
tail.

What I Changed Next

The fix had three parts.

First, for the OpenAI endpoint, the response format was changed from
json_object to strict json_schema.

keys = [str(i) for i in range(n_items)]

return {
    "type": "json_schema",
    "json_schema": {
        "name": "batch_translation",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "translations": {
                    "type": "object",
                    "properties": {
                        k: {"type": "string"} for k in keys
                    },
                    "required": keys,
                    "additionalProperties": False,
                }
            },
            "required": ["translations"],
            "additionalProperties": False,
        },
    },
}
Enter fullscreen mode Exit fullscreen mode

Now every expected ID is listed as required.

For non-OpenAI endpoints, the system still uses best-effort json_object mode
because compatibility varies.

Second, fallback became partial.

Instead of retrying the whole batch, the code keeps successful translations and
only retries missing IDs:

missing = [i for i, v in enumerate(translated) if v is None]

if missing:
    tracker.record_prompt_batch_fallback()

    if len(missing) > 1:
        retry_result = self._request_batch_keyed(
            [masked_subset[i] for i in missing],
            context,
            tracker,
        )

    still_missing = [i for i, v in enumerate(translated) if v is None]
    for i in still_missing:
        translated[i] = self.translate(subset[i], tracker)
Enter fullscreen mode Exit fullscreen mode

Third, the batch request now sets max_tokens and checks truncation:

if choice.finish_reason == "length" and len(items) > 1:
    mid = len(items) // 2
    left = self._request_batch_keyed(items[:mid], context, tracker)
    right = self._request_batch_keyed(items[mid:], context, tracker)
    return left + right
Enter fullscreen mode Exit fullscreen mode

So a truncated batch is split and retried as smaller batches instead of falling
straight into per-item fallback.

The Result

After the fix, the same benchmark was rerun.

Metric First batching Fixed batching No batching
API calls 107 7 160
Fallback rate 71.43% 0.00% 0%
Input tokens 14,876 6,206 14,287
Output tokens 4,541 2,640 2,506
Estimated cost $0.0033 $0.0017 $0.0024
Duration 136.2s 22.1s 30.4s
Processed segments 240 140 160

The fixed version finally achieved the original goal:

  • API calls dropped from 160 to 7
  • Estimated cost dropped from $0.0024 to $0.0017
  • Duration dropped from 30.4s to 22.1s
  • Fallback dropped to 0%

Takeaways

The lesson is simple: batching is not automatically cheaper.

If a batch response can fail partially, the fallback strategy matters as much
as the batching strategy.

For structured LLM workflows, these details are important:

  1. Use schema enforcement when the endpoint supports it.
  2. Do not rely only on prompt instructions for required fields.
  3. Keep partial successes.
  4. Retry only missing items.
  5. Check finish_reason.
  6. Measure real cost, not just API call count.

In this case, the first optimization reduced requests but increased cost.

The real optimization was not just batching.

It was making the batch output reliable.

Top comments (0)