Build a Bulletproof Claude Code JSONL Parser

#ai #opensource #programming #machinelearning

Learn 3 battle-tested patterns for parsing Claude Code's JSONL session files under ~/.claude/projects/, including tolerant whitelists, versioned derived data, and toWellFormed() string normalization.

Build a Bulletproof Claude Code JSONL Parser: 3 Patterns That Survived a Dozen CLI Releases

Every conversation you have with Claude Code is written to disk as JSONL, under ~/.claude/projects/. Your decisions, your dead ends, the bug hunt that took three sessions: it is all there. You have probably never opened one.

The catch: the format is an internal implementation detail. No documentation, no version field, no stability guarantee. The schema changes whenever the CLI updates, which is, at the current pace, almost daily. But with the right patterns, you can build tools that survive—and learn from—those changes.

Here are three patterns from a developer who built a read-only replay and search tool on top of these files and kept it alive through a dozen CLI releases.

Pattern 1: Tolerant Line Parsing with an Explicit Whitelist

The naive loop (JSON.parse each line, switch on type) works on day one. The question is what happens when a CLI update introduces a type nobody has seen before. This is not hypothetical.

The approach that holds up: keep an explicit whitelist of known types, and treat everything outside it as "parse failed, but preserved":

const KNOWN_MESSAGE_TYPES = new Set([
  'user', 'assistant', 'system',
  'queue-operation', 'last-prompt',
  'progress', 'attachment', 'file-history-snapshot',
  'permission-mode', 'custom-title', 'ai-title',
  'agent-name', 'pr-link',
])

function parseLine(line: string): ParsedLine | null {
  if (!line.trim()) return null

  let obj: Record<string, unknown>
  try {
    obj = JSON.parse(line)
  } catch {
    return null // malformed line: skip, never throw
  }

  const type = typeof obj.type === 'string' ? obj.type : 'unknown'
  const parseFailed = !KNOWN_MESSAGE_TYPES.has(type)
  // unknown type → keep the raw JSON string for later re-parse
  // known type → extracted fields are enough, raw can be dropped
  ...
}

The whitelist does double duty as a storage policy. For known types, the extracted columns are sufficient and the raw JSON can be discarded; that alone reclaims most of the disk space. For unknown types, the raw line goes into an archive table untouched. When a future version of the parser learns the new shape, the evidence is still there.

One more detail that pays off: cap the length of identifier fields (uuid, requestId) at something sane like 128 chars before trusting them.

Pattern 2: Version Your Derived Data, Not Just Your Schema

Preserving unknowns only matters if you can act on them later. The mechanism is a SUMMARY_VERSION integer stored per session. When the parser learns new tricks, bump the version; the indexer sees stale versions and re-parses those sessions automatically on the next sync.

This turns "the schema changed again" from a migration crisis into a routine: extend the parser, bump the version, let the backfill run. No manual steps, no data loss, no "please delete your index and start over" release notes.

War Story 1: The Lone Surrogates

One day the indexer started producing strings that crashed downstream consumers. The cause: some JSONL lines contained unpaired UTF-16 surrogates. Half an emoji, lurking in a tool-error message.

How does half an emoji end up on disk? Older Claude Code versions (up to around 2.1.132) truncated long tool outputs by byte length, and the cut sometimes landed mid-emoji. JSON.stringify happily writes the lone surrogate as a \udXXX escape, the file looks like clean ASCII, and JSON.parse faithfully reconstructs the broken string at read time.

The fix is one line, if it lands in the right place:

// at the parser's exit boundary, applied to every extracted string
export function ensureWellFormed(s: string): string {
  return s.toWellFormed() // lone surrogates → U+FFFD
}

The placement is the actual lesson. Normalize once, at ingestion, and every consumer downstream (search index, renderer, Markdown export) gets to assume well-formed strings forever. (String.prototype.toWellFormed() needs Node.js 20+.)

War Story 2: The Tokens That Counted Themselves Twice

The tool's token dashboard once reported usage numbers roughly 2.3× higher than reality. The cause: one API response can become several JSONL lines. When a response contains multiple content blocks (text plus tool calls), Claude Code writes one assistant entry per block, and each entry carries a copy of the full token count. If you naively sum token counts per line, you overcount by the number of blocks.

The fix: only count tokens from the first assistant line in a response, or better, deduplicate by requestId before summing.

Try It Now

Explore your own sessions: ls ~/.claude/projects/ to see what's there. head -n 5 ~/.claude/projects/<project>/<session>.jsonl to peek at the format.
Build a tolerant parser: Start with the whitelist pattern above. Don't assume stability—plan for unknown types.
Add string normalization: Apply toWellFormed() at the ingestion boundary to avoid downstream crashes.
Version your derived data: Store a version number per session so you can re-parse when your parser improves.