DEV Community

UB3DQY
UB3DQY

Posted on

How I shipped a broken capture pipeline and didn't notice for 3 days

TL;DR

I built a hook-based capture system for Claude Code. Every session-end was supposed to get summarized and written into a daily log. My doctor.py gate said 13/13 PASS. Lint was clean. CI was green on every commit.

Then a user asked a simple question: "Is the wiki actually capturing this conversation?"

I checked the log. 57% of my recent sessions had been silently dropped for three days. The gate never told me. Every smoke test was passing. The system was broken in the one place no test was actually looking.

This is what happened, how I caught it, and what I changed so I would not miss the same kind of bug again.


The setup

The project is a memory system for Claude Code and Codex CLI. A session-end hook reads the transcript, hands it off to a background Python script, that script asks the Agent SDK whether the conversation is worth saving, and the result gets appended to daily/YYYY-MM-DD.md. Fairly normal hook plumbing.

I had a doctor.py script with 13 smoke checks across the pipeline. session-start.py produced valid JSON. user-prompt-wiki.py could look up articles. stop.py exited cleanly. I had structural lint. I had a green CI gate on every push.

I had shipped eight commits over two days, each with doctor --quick green, each with CI passing. I was telling myself the system was in good shape.

The moment of doubt

Someone I was working with asked a very simple question: "Just to confirm, is the wiki actually storing this conversation?"

I almost said yes immediately. The hooks were wired. Every prompt I sent was coming back with wiki snippets injected at the top. UserPromptSubmit was clearly doing its job. From the outside, the system looked alive.

But I have been burned by "it looks alive" enough times that I checked instead of trusting the feeling. I opened scripts/flush.log, the file where the session-end and pre-compact hooks write their operational log, and scrolled to the recent entries:

2026-04-12 16:36:39 INFO [session-end] SessionEnd fired: session=...
2026-04-12 16:36:39 INFO [session-end] SKIP: only 2 turns (min 4)
2026-04-12 16:39:27 INFO [session-end] SessionEnd fired: session=...
2026-04-12 16:39:27 INFO [session-end] SKIP: only 2 turns (min 4)
2026-04-12 16:42:07 INFO [session-end] SessionEnd fired: session=...
2026-04-12 16:42:07 INFO [session-end] SKIP: only 2 turns (min 4)
Enter fullscreen mode Exit fullscreen mode

That was the moment my confidence disappeared.

What I was looking at

The hooks were firing. SessionEnd fired is printed before any filtering happens, so those lines meant the hook chain from Claude Code to my Python script was intact. The wiring was not the problem.

But then immediately after, on every single recent entry: SKIP: only 2 turns (min 4).

My session-end code had this:

MIN_TURNS_TO_FLUSH = 4

# ... later ...

if turn_count < MIN_TURNS_TO_FLUSH:
    logging.info("SKIP: only %d turns (min %d)", turn_count, MIN_TURNS_TO_FLUSH)
    return
Enter fullscreen mode Exit fullscreen mode

This was supposed to protect against flushing trivial sessions, the "one question, one answer, exit" pattern that is probably not worth archiving. The threshold 4 felt reasonable when I wrote it. It felt reasonable when I reviewed it. It passed every test.

What I had not really internalized was the shape of my own usage. A typical Claude Code session for me is: open terminal, ask one specific question, get one specific answer, close terminal. That is exactly 2 turns. The rule I wrote to skip "trivial" sessions was skipping my normal session shape.

I ran the numbers over the full log:

SessionEnd fired:       109
Spawned flush.py:        52  (48%)
Skipped (various):       57  (52%)
Most recent skip reason: "SKIP: only 2 turns (min 4)"
Enter fullscreen mode Exit fullscreen mode

Over half the sessions from the last three days had been silently dropped. Not edge cases. Not weird corner traffic. Just normal usage. The daily log for those days had looked thinner than it should have been, and I had noticed that in the background, but never chased it because doctor --quick was green and I trusted it.

Why the gate didn't catch it

The actual bug was trivial. Change a number. That part took no time. The question that mattered was: why did my gate tell me everything was fine while half the data was disappearing?

Let me walk through what doctor --full actually tested:

  • check_session_start_smoke — runs session-start.py with an empty JSON input, verifies it prints a valid hook-output JSON. ✅
  • check_user_prompt_smoke — runs user-prompt-wiki.py, verifies it returns additionalContext with articles. ✅
  • check_stop_smoke — runs stop.py, verifies it exits cleanly on empty stdin. ✅
  • check_index_freshness, check_structural_lint, check_env_settings, check_path_normalization — the rest of the usual health-check surface.

Notice what those tests are really asking. Each one asks: "Does this script run without crashing?" That is a useful question. It catches real bugs: ImportError after a refactor, JSONDecodeError from bad stdin, FileNotFoundError after a rename. But it is not the question I actually cared about: "Does a real transcript, processed by this chain, end up in the daily log?"

That question has three subtly different parts:

  1. Does the hook fire when Claude Code ends a session? (Yes — I could see it in the log.)
  2. Does the hook's filter logic produce a "worth-saving" verdict for realistic input? (Turns out: no, because of the bug above.)
  3. Does the downstream chain actually write the result to the daily log? (Unknown, because step 2 always said no.)

doctor --full tested a weak version of (1) by running the script with an empty payload. It did not test (2), because that needs a realistic transcript. It did not test (3), because the chain never got that far. Every link passed in isolation, and the chain as a whole was still broken.

This is the old difference between smoke tests and end-to-end tests. In theory everybody knows it. In a personal tool, it is easy to get lazy about it. You know what the chain is supposed to do, so testing the whole thing can feel redundant. It is not redundant at all. The chain breaks in exactly the places where each individual component still passes its own tiny check.

Two things I added to stop this happening again

The code fix itself was boring: replace the turn-based threshold with a content-based one. Short but substantial sessions, two turns and a couple thousand characters of real discussion, now get captured. Tiny sessions, two turns and thirty characters of "ok thanks", still get skipped, but by character count instead of turn count. That is not really the point of the post.

The interesting part is what I added to doctor.py afterward, because that is what turns this from a one-off fix into something the project can actually defend itself with.

1. Observability check: check_flush_capture_health

This one reads scripts/flush.log over a rolling 7-day window and summarizes what the capture pipeline has actually been doing:

def check_flush_capture_health() -> CheckResult:
    # parse flush.log, count SessionEnd fired vs Spawned flush.py
    ...
    detail = f"Last 7d: {spawned}/{session_fired} flushes spawned (skip rate {skip_rate:.0%})"

    if spawned == 0:
        return CheckResult(
            "flush_capture_health", False,
            f"{detail}. Pipeline appears broken: SessionEnds fired but nothing was spawned."
        )
    if skip_rate > 0.5:
        return CheckResult(
            "flush_capture_health", True,
            f"{detail} [attention: high skip rate — consider lowering WIKI_MIN_FLUSH_CHARS]"
        )
    return CheckResult("flush_capture_health", True, detail)
Enter fullscreen mode Exit fullscreen mode

Important design choice: this check only FAILs when the pipeline is observably broken. If SessionEnds fired but nothing was spawned, that is a correctness bug. It does not FAIL on high skip rate, because skip rate is historical data about past usage, not necessarily a problem with the current code. A fresh clone has no history and should pass. A repo with lots of short sessions may have a high skip rate and still be behaving correctly. Blocking the merge gate on historical observability would be a mistake.

The check prints an [attention] marker in the detail line when the skip rate goes above 50%. On the first run in my own repo after I added it, it printed:

[PASS] flush_capture_health: Last 7d: 50/121 flushes spawned (skip rate 59%)
       [attention: high skip rate — consider lowering WIKI_MIN_FLUSH_CHARS]
Enter fullscreen mode Exit fullscreen mode

That one line would have saved me three days.

2. End-to-end acceptance test: check_flush_roundtrip

This is the answer to the "why didn't any test catch this?" question. It only runs in doctor --full, because it is more expensive than the fast smoke checks.

The test writes a dummy 6-turn transcript, about 2000 characters of realistic content, to a temp file. Then it invokes hooks/session-end.py as a real subprocess with a realistic hook-input JSON on stdin:

test_session_id = f"doctor-roundtrip-{uuid.uuid4().hex[:8]}"
transcript_path = SCRIPTS_DIR / f"doctor-transcript-{test_session_id}.jsonl"

# ... write dummy turns ...

hook_input = {
    "session_id": test_session_id,
    "source": "doctor-roundtrip",
    "transcript_path": str(transcript_path),
    "cwd": str(ROOT_DIR),
}

env = os.environ.copy()
env["WIKI_FLUSH_TEST_MODE"] = "1"

proc = subprocess.run(
    [sys.executable, str(session_end_script)],
    input=json.dumps(hook_input),
    ...
)
Enter fullscreen mode Exit fullscreen mode

Notice the WIKI_FLUSH_TEST_MODE=1 environment variable. That is the trick. The downstream script, flush.py, checks it at startup and, if it is set, skips the real Agent SDK call and writes a marker file to a known location instead:

# In flush.py
if os.environ.get("WIKI_FLUSH_TEST_MODE") == "1":
    TEST_MARKER_FILE.write_text(
        f"FLUSH_TEST_OK session={session_id} ts=...",
        encoding="utf-8",
    )
    return
Enter fullscreen mode Exit fullscreen mode

The test then polls for that marker file with a 15-second timeout, verifies it contains the right session ID, and cleans up. If any link in the chain is broken — if session-end.py does not spawn flush.py, if flush.py fails to import, if the environment is not inherited correctly — the marker never appears and the test fails with a clear message.

This is an actual end-to-end test, not a smoke test. It exercises the real subprocess invocation, real environment inheritance, real stdin/stdout piping, and real timing. The only thing it fakes is the API call itself, because that would cost money and pollute the real daily log.

On my machine right now:

[PASS] flush_roundtrip: session-end -> flush.py chain completed in test mode
Enter fullscreen mode Exit fullscreen mode

If I had had this test two weeks earlier, I would have caught the MIN_TURNS = 4 bug on the first realistic transcript. It would not have needed to be clever. A visible skip where a spawn was expected would have been enough.

The lessons, short enough to remember

1. Smoke tests are not end-to-end tests, and they do not substitute for one. I had nine smoke checks in doctor.py, and all of them were correct in isolation. None of them ran the actual production chain from a realistic input to a verifiable output. If you have a multi-process pipeline, you need at least one test that exercises the whole thing. It does not have to be fast and it does not have to run on every commit. It just has to exist somewhere meaningful in your gate.

2. Observability is a design choice, not an afterthought. My hooks were writing perfectly good operational logs. I just was not reading them, and my gate was not reading them either. Adding a check that summarizes those logs took about forty lines of code and would have turned a silent three-day outage into a visible [attention] marker from day one. Logs you do not read are not much better than logs you never wrote.

3. If a test could have caught the bug, it belongs in the gate — even if adding it feels obvious in hindsight. The wrong question is "why didn't I add this on day one?" The better question is "what class of future bugs does this protect me from now?" Hindsight is always perfect about the bug you already know. What you want is general immunity to the class of bugs you just learned about.


If you are building similar hook-based systems, the code for the project where this happened is at github.com/ub3dqy/llm-wiki. It is a markdown-first memory system for Claude Code and Codex CLI, and both fixes described here — the content-based threshold and the roundtrip test — live in scripts/doctor.py and hooks/session-end.py. No API keys, no vector database, and it boots with uv run python scripts/setup.py.

Top comments (0)