DEV Community

Cover image for Your LLM JSON Got Cut Off. Don't Just Raise max_tokens
Alex Spinov
Alex Spinov

Posted on • Originally published at blog.spinov.online

Your LLM JSON Got Cut Off. Don't Just Raise max_tokens

Your agent asked a model for a JSON array of records. The model started writing it, ran into the token cap, and stopped halfway through an object. Your code called json.loads on the whole response, caught a JSONDecodeError, and returned an empty list. Every record the model already finished, and that you already paid for, left with the one it didn't.

The reflex is to raise max_tokens and run the call again. That is the expensive wrong move, and the API told you so before you even opened the parser.

Here is the shape of a response that got cut off at the cap. It is synthetic, hand-built so you can see the cliff:

[{"id":1,"name":"Acme Corp","city":"Reno"},
 {"id":2,"name":"Globex","city":"Ogdenville"},
 {"id":3,"name":"Initech","city":"Austin"},
 {"id":4,"na
Enter fullscreen mode Exit fullscreen mode

Three complete records. Then {"id":4,"na, and nothing. That is where the token limit fell. The fourth object is half a key with no value, and the closing ] never came. Run json.loads on that string and you do not get three records and a warning. You get an exception, and most code answers an exception with return []. All three finished records are now gone because the fourth one is a stump.

TL;DR

  • A truncated JSON response is not a parse failure to swallow. The finished records in front of the cut are valid, complete, and already billed.
  • The single most expensive line is except json.JSONDecodeError: return []. One unfinished tail object, and it throws away every whole record before it.
  • The fix is not "raise max_tokens and retry." That pays to regenerate the whole batch and just moves the cliff to a bigger payload. The API already signals the cut (Anthropic stop_reason="max_tokens", OpenAI finish_reason of length).
  • Salvage the complete objects with stdlib json.JSONDecoder().raw_decode, then ask only for the tail past the last good id. On the synthetic response below, naive reading recovers 0; salvage recovers 3 and hands you a resume cursor.

The signal you are ignoring

When a model stops because it ran out of room, the API does not hide it. It is right there in the response, and it has been the whole time.

Anthropic's Messages API sets a stop_reason. I checked their current docs on handling stop reasons before writing this. For the max_tokens value, the description is exact: "The response reached your max_tokens limit." And the recommended action, in their own words, is "Raise max_tokens or continue the response." Read that second clause again. The official documentation puts "continue the response" right next to "raise the limit," as a peer, not a footnote. The platform itself is telling you that bumping the number is one option, not the option.

OpenAI's Chat Completions exposes the same fact through a different field. Each choice carries a finish_reason, and the value you get when the output was cut short by the token limit is length. Same signal, same meaning: the model did not decide it was done, the budget decided for it.

So before you touch max_tokens, you already know two things for free. You know the response was truncated, because the API flagged it. And you know roughly where, because you have the partial text in hand. That is enough to do something far cheaper than starting over.

Why "just raise the limit" loses twice

I reached for the limit knob first too. It is the obvious lever. The output got cut, so give it more room. I raised max_tokens, re-ran the call, and felt clever for about a day.

Then the batch got bigger and it cut off again, further down. Of course it did. The cap was never the disease. It was a symptom of asking one call to emit more records than fit in one budget. Raising the ceiling on a fixed-size request just means the next slightly larger request finds the new ceiling. You are chasing a line you will never catch as long as the payload can grow.

And every chase costs money twice over. When you re-run the whole call, you pay again for the output tokens you already received and parsed the first time. Those three finished records in the example were already generated, already billed. Throwing them away and asking for the entire array a second time means buying records one through three a second time so you can finally reach record four. On a real extraction batch, the part you keep re-buying is the expensive part.

This is where the cost angle and the reliability angle are the same angle. The reliable move and the cheap move point in the same direction: keep what is whole, ask only for what is missing.

The arithmetic, on numbers you can run

Here is a small script. One import, json, from the standard library. No network, no randomness, no clock, so its output is identical every time you run it. It takes the truncated response from the top of this post and reads it two ways. The fixture is synthetic, hand-built to isolate the mechanism, not a capture from any one job. I will come back to why that label matters.

"""Salvage complete records from a truncated LLM JSON output.

Deterministic, stdlib-only (json), no net / RNG / clock / subprocess / env.
Fixture is SYNTHETIC: a JSON array the model started emitting, cut off at the
token/length cap mid-way through the 4th object.
"""
import json

# --- INPUT: one LLM response, cut off at the token cap (last object incomplete)
RAW = (
    '[{"id":1,"name":"Acme Corp","city":"Reno"},'
    '{"id":2,"name":"Globex","city":"Ogdenville"},'
    '{"id":3,"name":"Initech","city":"Austin"},'
    '{"id":4,"na'  # <-- length cap hit here; rest of the response never arrived
)


def naive(raw):
    """The instinct: json.loads the whole thing. One bad tail loses the batch."""
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        return []  # everything you already paid for is gone


def salvage(raw):
    """Recover every COMPLETE top-level object via stdlib raw_decode.

    Returns (records, truncated, resume_after_id).
    raw_decode is string-safe: it parses one JSON value from a position and
    tells you where it ended, so braces inside string values don't fool it.
    """
    dec = json.JSONDecoder()
    start = raw.find("[")
    if start < 0:
        return [], False, None
    i = start + 1
    n = len(raw)
    out = []
    while i < n:
        while i < n and raw[i] in " \t\r\n,":
            i += 1
        if i >= n:
            break
        if raw[i] == "]":  # clean close: not truncated
            return out, False, (out[-1]["id"] if out else None)
        try:
            obj, end = dec.raw_decode(raw, i)
        except json.JSONDecodeError:
            # the cut lands here: stop, keep what is whole
            return out, True, (out[-1]["id"] if out else None)
        out.append(obj)
        i = end
    return out, True, (out[-1]["id"] if out else None)


naive_recovered = naive(RAW)
records, truncated, resume_after_id = salvage(RAW)

print("INPUT: 1 LLM response, cut off at the token cap (4th object incomplete)")
print("NAIVE  json.loads(whole)  -> recovered %d records" % len(naive_recovered))
print("FIX    salvage(raw_decode)-> recovered %d records, truncated=%s"
      % (len(records), truncated))
print("       resume_after_id = %s  (request only id > %s, do NOT re-run the call)"
      % (resume_after_id, resume_after_id))
print("NAIVE threw away %d complete records you already paid for"
      % (len(records) - len(naive_recovered)))

# --- asserts (silent on success) ---
assert len(naive_recovered) == 0
assert len(records) == 3
assert truncated is True
assert resume_after_id == 3
assert [r["id"] for r in records] == [1, 2, 3]

# --- CEILING (this is a floor, not a cure) ---
print("--- CEILING ---")
print("1) Salvage keeps only WHOLE objects; the partial 4th record is dropped "
      "(half a field is not data). Damage control, not the truncation cure.")
print("2) raw_decode assumes the prefix up to the cut is otherwise well-formed; "
      "if the model also emitted a structural error earlier, salvage stops "
      "there. This is not a general JSON repairer.")
print("3) The real fix is upstream: read the signal the API already gives you "
      "(OpenAI finish_reason='length' / Anthropic stop_reason='max_tokens') and "
      "request only id>resume_after_id. Raising max_tokens just moves the cliff "
      "and you pay to regenerate the whole batch. Synthetic fixture.")
Enter fullscreen mode Exit fullscreen mode

Run it with python3 -I salvage_truncated_json.py and you get:

INPUT: 1 LLM response, cut off at the token cap (4th object incomplete)
NAIVE  json.loads(whole)  -> recovered 0 records
FIX    salvage(raw_decode)-> recovered 3 records, truncated=True
       resume_after_id = 3  (request only id > 3, do NOT re-run the call)
NAIVE threw away 3 complete records you already paid for
--- CEILING ---
1) Salvage keeps only WHOLE objects; the partial 4th record is dropped (half a field is not data). Damage control, not the truncation cure.
2) raw_decode assumes the prefix up to the cut is otherwise well-formed; if the model also emitted a structural error earlier, salvage stops there. This is not a general JSON repairer.
3) The real fix is upstream: read the signal the API already gives you (OpenAI finish_reason='length' / Anthropic stop_reason='max_tokens') and request only id>resume_after_id. Raising max_tokens just moves the cliff and you pay to regenerate the whole batch. Synthetic fixture.
Enter fullscreen mode Exit fullscreen mode

Same string in. Zero records one way, three records the other. The only thing that changed is whether the reader gives up on the first exception or keeps the objects it already decoded.

How salvage actually reads it

The naive function is the one almost everyone ships. Try to parse the whole blob, and if it throws, return an empty list. It is honest about one thing and wrong about everything else: it correctly notices the response is not valid JSON, and then it concludes the response is worthless. A truncated array is not valid JSON. The three records inside it are still perfectly good.

The salvage function walks the array one top-level object at a time using json.JSONDecoder().raw_decode. That method is the quiet hero here. Unlike json.loads, which insists on consuming an entire well-formed document, raw_decode parses one JSON value starting at a position you give it and returns both the value and the index where it stopped. You move your cursor to that index, skip the comma and whitespace, and decode the next one. When raw_decode finally hits the stump at {"id":4,"na, it raises, and that is the signal to stop. You keep everything you decoded before the raise.

One detail that matters more than it looks: raw_decode is string-safe. A naive brace-counting scan would get confused by a { or ] sitting inside a string value like "city":"Reno]". The decoder understands JSON string grammar, so braces and brackets inside quoted values do not fool it. That is the difference between a parser and a regex, and it is the reason to lean on stdlib here instead of writing your own bracket matcher.

The function also returns two things beyond the records. truncated=True tells you the array did not close cleanly, so you know there is more to fetch. And resume_after_id=3 is the id of the last whole record, which is your cursor. You do not re-run the call. You issue a follow-up that says, in whatever shape your prompt takes, "continue the list starting after id 3." You buy record four onward, once.

Where this stops working

I put the limits in the program's own output, because a fix that oversells itself is just a fancier kind of bug. The script prints three of them, and I mean all three.

First, salvage keeps only whole objects. The half-written fourth record, {"id":4,"na, is dropped on the floor, and it should be. Half a field is not data. So this is damage control, not a cure. You recover what was finished and you accept that the interrupted object is gone until you re-request it. If you ever find yourself trying to "guess" the rest of a truncated object, stop. That is how you invent data.

Second, raw_decode assumes the prefix up to the cut is otherwise well-formed. It trusts that records one through three are valid JSON, because in a truncation they almost always are: the model was emitting clean objects right up until the budget ran out. But if the model also made a structural mistake inside an earlier object, a missing quote in record two or an unquoted key, salvage will stop at that error and report a smaller, earlier cursor. That is the correct behavior, and it is also a warning: this is not a general JSON repairer. It will not fix malformed objects, only stop cleanly at the first one it cannot read.

Third, the fixture is synthetic, and the real fix lives upstream. The 0-versus-3 result is arithmetic on a hand-built string, not a measured recovery rate from any one job. The durable fix is the boring one: read the stop_reason or finish_reason the API already hands you, and when it says you were cut off, request only the tail past your resume cursor. Raising max_tokens blindly just moves the cliff and makes you pay to regenerate the batch. Your real cut-off point is wherever your extraction payloads happen to grow past the budget, which is a number only you can see.

This is not the fetch-tool truncation

If you have read my earlier posts, this might look like a cousin of a problem I have written about before, where an agent trusts a 200 OK and acts on a page that came back garbage or cut short. It is the opposite end of the same pipe. That post is about input: content the agent reads, a fetched page that arrived incomplete, where the fix is to mark the truncation inline so the model does not reason over a half-page as if it were whole.

This post is about output: the structured response your own model generates, cut off by the generation budget. The page-truncation case is something arriving broken from outside. This case is something you are producing, hitting a limit you set. Different cause, different cursor, different fix. Worth keeping straight, because the instinct in both is to silently accept the partial thing, and in both that instinct is what bites you.

This is not a crawl that crashed

There is a second neighbor worth separating. When a long scraper dies at row 12,000, you resume a multi-hour process from a checkpoint: the worker fell over, and you restart the job where it stopped. That is process-level crash recovery, and the resume key is a position in a long run.

What I am describing is smaller and sharper. One API response came back truncated in a single call. The resume_after_id is not a checkpoint for a dead process. It is a marker inside one model output that says which records are already whole, so your next request asks for the remainder instead of the entire array. Same word, "resume," two very different scopes. One restarts a job. The other continues a single response.

And to close the loop on a third look-alike: this also is not the "valid value that is quietly wrong" bug, where a row parses fine and still holds junk. Truncation is a structural problem. The response is incomplete JSON. That is a different failure from a complete object holding a bad field, and it wants a different tool. Here you are recovering structure. There you are validating meaning.

What I would actually do

The whole thing collapses to three moves, and only the middle one is code.

Read the signal. Check stop_reason or finish_reason on every structured response before you parse it. If the API says it truncated, believe it, and skip straight to salvage.

Salvage the whole records. Use raw_decode to keep every complete top-level object and stop cleanly at the cut. Never let a single unfinished tail object turn into return []. The records in front of the cliff are finished and paid for.

Resume the tail, do not re-run the call. Take the last good id and ask only for what follows. You stop paying twice for the same output, and you stop moving the cliff one max_tokens bump at a time.

For context on why this is my default rather than a thought experiment: I run production scrapers and agents, currently 2,190 runs across 32 actors, and the Trustpilot review scraper alone holds 962 of them. Large extraction outputs are exactly the ones that flirt with the output budget, and when they cut off they do it quietly, the job exits clean and the dataset is just short. I am not going to quote you a percentage of how often that happens, because for a long stretch I was not catching it as truncation at all. I was catching the JSONDecodeError, returning an empty list, and calling the batch a failure. The records were sitting right there in the string the whole time. A truncated batch is not a failed batch.


Written by Aleksei Spinov. I run production scrapers and agents, currently 2,190 runs across 32 actors. The code here is stdlib-only and was run and verified (python3 -I, identical output, asserts green) before publishing; the input is a synthetic fixture, labelled as such in the script. Drafted with an AI assistant, fact-checked and edited by me.

Follow for the next teardown from the run ledger, one fix at a time. Honest question for the comments: when your structured output gets cut off, do you re-run the call, raise the limit, or salvage and resume, and what made you settle on that? I read every reply.

Source: Anthropic, handling stop reasons in the Messages API.

Top comments (0)