DEV Community: 林子超（子超）

Why ccRewind never writes to ~/.claude/

林子超（子超） — Thu, 09 Jul 2026 10:05:28 +0000

ccRewind reads ~/.claude/projects/*.jsonl, parses them, indexes them into a local SQLite database, and renders the result back as a searchable, browsable history. It never opens those source files for writing. Not a rename, not a metadata touch, not one byte.

That sentence is easy to state and easy to promise. It is worth explaining why the line is drawn exactly there, and what holding it actually costs.

The archaeologist doesn't touch the dig site

The parsing rules that make ccRewind possible start from treating JSONL as a historical record: skip what you cannot read, preserve what you do not yet understand, never let your opinion of the data change the data. That principle has a natural extension. An archaeologist who reshapes what they dig up is not doing archaeology anymore. They are authoring a new site and calling it the old one.

The payoff is specific: because ccRewind never writes, a session file is exactly what it was on the day it was written, no matter how much time has passed or how many tools have looked at it since. That property has already paid for itself once. A subtle client-side bug caused certain tool calls to silently misbehave for a run of sessions. Weeks later, reconstructing what had happened depended on going back to the original, untouched session data. If the viewing tool had normalized, deduplicated, or otherwise reshaped that data on the way in, the reconstruction would have had to rule out its own side effects before trusting anything it found. An archive that might have edited the evidence is not an archive; it is a second suspect.

There is a more mundane reason underneath the philosophical one. A session file being viewed might not be finished. Claude Code could still be appending to it in another process. Writing to a file that something else is actively extending is a reliable way to produce corruption.

What holding that line actually costs

None of this is free. Refusing to write to the source pushes the cost somewhere else, and it is worth being honest about where.

A parallel database, not an augmented one. There is no column added to the source, no sidecar file next to it. Instead there is an entirely separate SQLite database that has to be built, migrated, and kept in sync by rescanning the filesystem from the outside.

Staleness has to be invented. If writing one flag back to an already-read file were allowed, "have I processed this yet" would be a solved problem. Since it is not allowed, ccRewind fingerprints instead: modification time, plus a per-session version counter that increments whenever the parser learns to extract something new. A sync pass compares those fingerprints against what is already indexed and decides what to reprocess.

const summaryStale =
  existing.summaryVersion === null ||
  existing.summaryVersion < SUMMARY_VERSION

That one comparison exists only because the alternative, a boolean written back into the JSONL itself, is off the table.

Disk, twice. The source history and its index both live on the same disk for the same conversations. In my data, the index alone runs to a bit under a gigabyte, sitting next to the JSONL it was built from.

Curation does not round-trip. Archiving a session or tagging it inside ccRewind creates metadata that exists only in ccRewind's own database. Rebuild the index, or point the tool at a fresh machine, and every conversation comes back exactly as it was, but every tag and archive flag is gone. The source was never an acceptable place to keep them, and the cost of that decision is that they are not durable anywhere else either.

Where the line actually sits

"Never writes" is precise about what it covers, and the boundary deserves the same precision. ccRewind is not a tool that touches nothing on disk. It keeps its own database, its own cache, its own state, and it manages all of that freely. The constraint is narrower and, because it is narrower, it is also unambiguous: the one thing it never touches is the source of truth it did not create.

Kept exactly there, that boundary is what keeps the promise true: the file did not change because someone looked at it.

To see this in the code, ccRewind is open source: ccRewind on GitHub. The parsing rules that the read-only index is built on are in the previous post: Parsing Claude Code's JSONL: patterns for a schema that keeps moving.

Disclosure: this post was drafted with Claude and edited by the human who made every tradeoff in it.

Parsing Claude Code's JSONL: patterns for a schema that keeps moving

林子超（子超） — Sun, 05 Jul 2026 10:01:55 +0000

Every conversation you have with Claude Code is written to disk as JSONL, under ~/.claude/projects/. Your decisions, your dead ends, the bug hunt that took three sessions: it is all there. You have probably never opened one.

The catch: the format is an internal implementation detail. No documentation, no version field, no stability guarantee. The schema changes whenever the CLI updates, which is, at the current pace, almost daily.

The patterns below come from building a read-only replay and search tool on top of these files, and from keeping it alive through a dozen CLI releases. Each pattern survived contact with real data. Four were learned the hard way, from bugs worth retelling.

The ground rule: you are an archaeologist, not a validator

A parser for someone else's internal format has a different job than a parser for your own API. Rejecting malformed input is not an option: the input is the historical record, and whatever is on disk is all there will ever be.

The contract that follows from this:

A bad line never kills the file. Skip it, count it, move on.
An unknown shape is preserved, not dropped. You can re-parse later; you cannot un-drop data.
Normalize at the boundary, once. Downstream code (search index, UI, export) should never see the mess.

Everything below is one of these rules meeting reality.

Pattern 1: tolerant line parsing with an explicit whitelist

The naive loop (JSON.parse each line, switch on type) works on day one. The question is what happens when a CLI update introduces a type nobody has seen before. This is not hypothetical; a real batch of them appears at the end of this post.

The approach that holds up: keep an explicit whitelist of known types, and treat everything outside it as "parse failed, but preserved":

const KNOWN_MESSAGE_TYPES = new Set([
  'user', 'assistant', 'system',
  'queue-operation', 'last-prompt',
  'progress', 'attachment', 'file-history-snapshot',
  'permission-mode', 'custom-title', 'ai-title',
  'agent-name', 'pr-link',
])

function parseLine(line: string): ParsedLine | null {
  if (!line.trim()) return null

  let obj: Record<string, unknown>
  try {
    obj = JSON.parse(line)
  } catch {
    return null // malformed line: skip, never throw
  }

  const type = typeof obj.type === 'string' ? obj.type : 'unknown'
  const parseFailed = !KNOWN_MESSAGE_TYPES.has(type)
  // unknown type → keep the raw JSON string for later re-parse
  // known type → extracted fields are enough, raw can be dropped
  ...
}

The whitelist does double duty as a storage policy. For known types, the extracted columns are sufficient and the raw JSON can be discarded; that alone reclaims most of the disk space. For unknown types, the raw line goes into an archive table untouched. When a future version of the parser learns the new shape, the evidence is still there.

One more detail that pays off: cap the length of identifier fields (uuid, requestId) at something sane like 128 chars before trusting them. Parsing files you do not control calls for a little paranoia at the boundary, and it is cheap.

Pattern 2: version your derived data, not just your schema

Preserving unknowns only matters if you can act on them later. The mechanism is a SUMMARY_VERSION integer stored per session. When the parser learns new tricks, bump the version; the indexer sees stale versions and re-parses those sessions automatically on the next sync.

This turns "the schema changed again" from a migration crisis into a routine: extend the parser, bump the version, let the backfill run. No manual steps, no data loss, no "please delete your index and start over" release notes.

War story 1: the lone surrogates

One day the indexer started producing strings that crashed downstream consumers. The cause: some JSONL lines contained unpaired UTF-16 surrogates. Half an emoji, lurking in a tool-error message.

How does half an emoji end up on disk? Older Claude Code versions (up to around 2.1.132, judging by the archived sessions) truncated long tool outputs by byte length, and the cut sometimes landed mid-emoji. JSON.stringify happily writes the lone surrogate as a \udXXX escape, the file looks like clean ASCII, and JSON.parse faithfully reconstructs the broken string at read time. The corruption stays invisible until something refuses it: SQLite, an IPC bridge, a TextEncoder.

The fix is one line, if it lands in the right place:

// at the parser's exit boundary, applied to every extracted string
export function ensureWellFormed(s: string): string {
  return s.toWellFormed() // lone surrogates → U+FFFD
}

The placement is the actual lesson. Normalize once, at ingestion, and every consumer downstream (search index, renderer, Markdown export) gets to assume well-formed strings forever. Unicode normalization (NFC/NFD) deliberately stays out of this step: it would change user-visible text, which an archival tool has no business doing. Fix what is broken, touch nothing else.

(String.prototype.toWellFormed() needs Node.js 20+. Before that, the surrogate scan has to be written by hand.)

War story 2: the tokens that counted themselves twice

The tool's token dashboard once reported usage numbers roughly 2.3× higher than reality, measured across a few hundred real sessions. The cause is a JSONL quirk worth knowing even if you never touch tokens:

One API response can become several JSONL lines. When a response contains multiple content blocks (text plus tool calls, for instance), Claude Code writes one assistant entry per block, and each entry carries a copy of the same usage object. Sum them naively and every multi-block turn is counted once per block.

The entries share a requestId, which is the dedup key. But there is a trap inside the trap: it is tempting to merge the entries into one logical message. Don't: entries of different requests can interleave on disk (streaming order), and merging would scramble the conversation. The entries themselves are fine; only the usage is duplicated.

So: keep every entry, zero out the usage on all but the last entry per requestId:

function deduplicateTokensByRequestId(lines: ParsedLine[]): ParsedLine[] {
  const lastIndex = new Map<string, number>()
  lines.forEach((line, i) => {
    if (line.role === 'assistant' && line.requestId) {
      lastIndex.set(line.requestId, i)
    }
  })
  return lines.map((line, i) =>
    line.role === 'assistant' && line.requestId && lastIndex.get(line.requestId) !== i
      ? { ...line, inputTokens: null, outputTokens: null,
          cacheReadTokens: null, cacheCreationTokens: null }
      : line,
  )
}

The general lesson: one JSONL line is not one logical event. Never assume a 1:1 mapping between physical lines and semantic units in a format you do not own.

War story 3: resumed sessions replay the past

When a Claude Code session is resumed, the new JSONL file starts with copies of messages from the original session: same uuid, same content, written again. Index both files naively and every resumed conversation shows up with duplicated history.

The remedy is UUID-level dedup against what is already indexed. The trap hiding inside that fix: the dedup query must exclude the session currently being indexed. Otherwise, re-indexing an existing session matches its own previously-indexed messages, concludes that every line is a duplicate, and quietly drops the entire session. A dedup check that can self-match is a data-loss machine with good intentions.

War story 4: screenshots will eat your index

Claude Code conversations can contain images: screenshots pasted into the prompt, arriving as content blocks with base64 data inline. Store message content verbatim and a handful of screenshots will outweigh thousands of text messages in the database.

The pattern: strip the payload, keep the shape.

if (block.type === 'image' && block.source?.type === 'base64') {
  return { ...block, source: { ...block.source, data: '[base64-stripped]' } }
}

The block structure survives, so the UI can still render an "image was here" placeholder at the right position, and a has_image flag stays queryable. Only the megabytes are gone. Same archaeology principle as everywhere else: preserve the evidence of structure, not necessarily every byte of payload.

The schema will move again

In case "the schema keeps moving" sounds abstract, here is what diffing real session files before and after one CLI release (v2.1.168, June 2026) turned up: top-level attribution fields on assistant entries (which skill, plugin, or MCP server produced a reply), image content blocks, structured system subtypes carrying API error status codes, and an edited_text_file attachment type. Four schema extensions, zero announcements. A normal month.

With the patterns above, absorbing that release was: extend the whitelist, extract the new fields, bump SUMMARY_VERSION, ship. The sessions written before the parser update backfilled themselves on the next sync. Nothing was lost in the weeks where the parser did not yet understand the new shapes: the unknown parts were sitting in the archive table, waiting.

Takeaways

Skip bad lines, never throw. The file is the historical record; the parser's opinion of it is irrelevant.
Whitelist known shapes; archive unknown ones raw. Storage policy and forward compatibility in one mechanism.
Version your derived data. Re-parsing should be a routine background event, not a migration.
Normalize at the ingestion boundary, exactly once, and only what is actually broken (toWellFormed, yes; NFC, no).
Distrust the line/event mapping. Duplicated usage across entries, replayed messages across files: physical lines lie.

To see these patterns in production context, the tool is open source: ccRewind on GitHub, a read-only, offline replay and search tool for Claude Code history. It never writes a byte to ~/.claude/. Why that constraint exists, and what it cost, is the next post.

Disclosure: this post was drafted with Claude and edited by the human who debugged every story in it. The drafting sessions are, naturally, JSONL files under ~/.claude/projects/ now.