John

Posted on Jun 29 • Edited on Jul 11 • Originally published at hexisteme.github.io

Claude Code '400: no low surrogate in string' on every turn: repairing a permanently broken session transcript

#ai #claude #cli #tutorial

Originally published on hexisteme notes.

A Claude Code session that returns API Error: 400 ... not valid JSON: no low surrogate in string on every turn is poisoned by a lone UTF-16 surrogate (a code point in U+D800–U+DFFF) baked into its on-disk transcript, and the fix is to close that session, strip only those lone surrogates from the offending line of the .jsonl file while leaving real emoji untouched, re-serialize that one line, and then claude --resume.

The poison is already on disk and is precisely targetable: a normal emoji is a single Python code point (e.g. U+1F9ED) and can never fall inside the surrogate range U+D800–U+DFFF, so deleting only that range removes the broken half-character with zero collateral damage to valid text. A cheap C-level byte pre-filter (scan for the \ud escape or raw ED A0–BF bytes before doing any per-line json.loads) cut a 174-file transcript scan from 3.4s to 1.1s, making it cheap enough to run automatically on every session start.

Why every turn fails: the surrogate is in the transcript, not the network

Claude Code persists each session to a JSONL transcript on disk (one JSON object per line, under your projects directory). Every turn replays the accumulated history back to the API. So if a single byte sequence in that history is invalid, the API rejects every subsequent request with the same error, at the same byte offset — the session is permanently bricked, and reopening it doesn't help because the bad data is reloaded from the file.

The specific failure is 400 The request body is not valid JSON: no low surrogate in string: line 1 column N (the mirror-image variant is no high surrogate). Non-BMP characters — emoji like 🧭, some extended CJK ideographs — are encoded in UTF-16 as a pair of surrogate code units: a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF). When a large tool output is truncated by a length limit and the cut lands exactly between the two halves of a pair, one orphaned half survives. That lone surrogate gets written into the transcript, replayed on every turn, and the API's strict JSON parser refuses it.

The triggering pattern is mundane: dumping a big, emoji-heavy blob into the context — a worker log peppered with status emoji, a daily-report job's output, a verbose ingest run — right before a large body that gets truncated mid-pair. Content-heavy projects (anything generating a lot of natural-language or creative text with non-BMP characters) re-hit this per session, not once.

The safe repair: strip only U+D800–U+DFFF, leave real emoji intact

The key insight that makes the fix safe is a property of Python's str: a valid emoji is a single code point (🧭 is U+1F9ED), so it can never land in the surrogate range U+D800–U+DFFF. Anything you find in that range is, by definition, a broken half. So you can delete exactly those code points and every legitimate character — emoji included — is left byte-for-byte untouched. You are not "removing emoji"; you are removing the orphaned halves that should never have been on disk.

The manual version of the fix, when you don't have a script handy:

import json

# read the one offending line, parse it, walk every string,
# drop only lone surrogates, re-serialize compactly
obj = json.loads(line)

def strip(o):
    if isinstance(o, str):
        return "".join(c for c in o if not (0xD800 <= ord(c) <= 0xDFFF))
    if isinstance(o, list):
        return [strip(x) for x in o]
    if isinstance(o, dict):
        return {k: strip(v) for k, v in o.items()}
    return o

fixed = json.dumps(strip(obj), ensure_ascii=False, separators=(",", ":"))

Write fixed back as that single line, leaving every other line of the transcript exactly as it was. Always copy the file to a .bak first. Re-serializing only the broken line keeps the diff minimal and preserves the rest of the conversation verbatim. After the rewrite, claude --resume and the 400 is gone.

Close the session before you touch the file

This is the step people skip, and it silently undoes the repair. While a session is open, Claude Code holds the transcript and will overwrite your edited file from its in-memory state — your fix vanishes the moment the next turn flushes. You must close the target session first, or verify nothing holds the file:

lsof -- ~/.claude/projects//.jsonl

Empty output means it's safe to edit. A good repair tool checks this for you and refuses (skips) any transcript that is currently held open, so an automated pass can never corrupt a live session. The trade-off: the one session you most want to fix — the one throwing 400s right now — may be the one you have open, so a fully automatic pass can skip exactly that file. That's the case where you fall back to the manual close-then-fix once.

Make it cheap enough to auto-repair on every session start

Because a content-heavy project re-breaks per session, a one-time manual fix isn't durable — you want the repair to run automatically before you ever see the error. The natural place is a SessionStart hook that scans recently-modified transcripts and silently cleans the closed ones. The problem is that per-line json.loads over every transcript in your projects tree is too slow to run on every launch.

The fix is a cheap byte-level pre-filter that runs before any JSON parsing. A lone surrogate only persists to disk in two shapes, and both are detectable by scanning raw bytes:

The JSON escape \ud... (some serializers emit unpaired surrogates as a \uXXXX escape; valid UTF-8 text never produces a literal \ud on disk).
The raw three-byte UTF-8 surrogate encoding ED A0–BF (valid UTF-8 only allows ED followed by 80–9F, so ED + A0–BF is unambiguously a surrogate).

A file with neither signal is provably clean and skips the expensive parse entirely. In practice this matters: a 174-file scan dropped from 3.4s to 1.1s, cheap enough to run on every session start. Run it scoped to recent days only, quiet unless something was actually fixed, and skipping any open transcript via lsof. A representative one-shot invocation in a hook:

python3 fix-jsonl-surrogates.py --fix-all --recent 3 --quiet

## Why you can't prevent it upstream (honest limitation)

This is a repair pattern, not a prevention pattern, and that's a deliberate concession to where the truncation happens. The cut that orphans a surrogate occurs inside the harness's own truncation logic, between the model output and the transcript write. The user-facing hook lifecycle (PreToolUse, PostToolUse, Stop, SessionStart, and so on) fires around lifecycle events, not in the middle of serializing the request body — so there is no interception point that can stop the bad byte from being written in the first place. Making the recovery fast and automatic is the only lever you actually control.

The pre-filter approach also has a real gap: if you immediately --resume the very session that's broken, the auto-repair pass may find the file held open and skip it (correctly, to avoid clobbering live state), so coverage is not 100% — that's the one case needing a manual close-then-fix. And do not lean on /compact as a fix: if a lone surrogate survives into the summary, the 400 persists; if it works, it worked by luck, not by design. The only genuine frequency-reduction is behavioral: avoid dumping huge emoji-saturated blobs (verbose logs, status-emoji-heavy output) into the context wholesale, or ASCII-ify such logs at the source.

More notes at hexisteme.github.io/notes.