Part of the series Debugging Claude Agent SDK pipelines. One of the layers I'll mention near the end — hidden account-level Gmail / Calendar MCP integrations blocking my subprocesses — deserved its own write-up: Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline.
I noticed something weird in Codex.
UserPromptSubmit kept saying completed, but Stop kept saying failed.
If you're building a memory tool, that's a bad combination. It means the assistant can still read old context, but it may be failing to write new context back into long-term memory. In other words: it looks smart in the moment, but its memory may be quietly falling apart behind the scenes.
I assumed this would be a small hook bug.
It wasn't.
It turned into one of those debugging sessions where every layer was technically "working" and the system as a whole still wasn't.
What I was building
I'm working on a markdown-first memory system for Claude Code and Codex. The shape is simple enough:
- when a session ends, a hook grabs the transcript
- a background script decides whether the conversation is worth saving
- if it is, it writes a distilled note into a daily log
- later, that gets compiled into wiki pages and injected back into future sessions
The part that mattered here was the Codex Stop hook. That's the capture path. If that hook fails, new memory may never make it into the wiki.
So when I saw Stop failed in the UI over and over again, I treated it as a real product problem, not a cosmetic one.
The first bug was real
The first issue was exactly where I expected it to be: in the hook.
The parser only understood one transcript shape. Codex was emitting another one. The hook would fire, look at the transcript, fail to extract meaningful context, and then skip capture.
That part was straightforward to fix:
- teach the parser the real Codex transcript shape
- add a fallback for when
transcript_pathis missing - stop using the old turn-count gate and switch to a content-based threshold
- raise the timeout so the hook had room to finish
After that, things got better. The logs stopped saying SKIP: empty context, and I started seeing the line I wanted:
Spawned flush.py for session ...
At that point I thought I was done.
I was not done.
The second bug was weirder
Now the hook was successfully spawning the downstream capture process, which should have been a win.
And then it still ended with:
BrokenPipeError: [Errno 32] Broken pipe
This was one of those bugs that is annoying precisely because it happens after the important part.
The capture process had already started. Memory might already be on its way to being saved. But the hook still looked failed in the Codex UI, which meant I couldn't trust the system yet.
The cause turned out to be simple in hindsight:
- the hook took longer than the local timeout
- Codex closed stdout
- the hook tried to print its final success JSON into a pipe that no longer existed
So yes, the system was partly working. It just wasn't finishing cleanly.
That fix was also small: protect the final stdout write against a closed pipe.
At this point I had fixed the parser bug and the broken pipe. Surely now the memory pipeline would work.
Still no.
The third bug wasn't in the hook at all
Once I got past the broken pipe, the downstream process itself started failing.
The hook would spawn flush.py, and then flush.py would die with the deeply unhelpful classic:
Command failed with exit code 1
No useful stderr. No obvious explanation. Just enough information to waste an afternoon.
This is the moment where I finally stopped assuming I was debugging "the hook" and started treating the whole thing like what it really was: a chain of separate runtimes.
Because that's what it was.
Not one program. A chain:
- Codex UI
- hook runner
- Python process
- subprocess launcher
- WSL boundary
- bundled Claude CLI
- local authentication state
Each link could fail for a different reason.
And that is exactly what had happened.
One of the real bugs was hiding in the boundary
At that stage, the most immediate root cause of the exit code 1 failure wasn't inside my Python code at all.
The Claude CLI inside the WSL runtime wasn't authenticated.
That was the immediate bug in this part of the investigation. There was still another layer around hidden account-level MCP integrations, but that turned into its own separate story. I wrote that one up separately here: Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline.
I had already authenticated Claude on the Windows side. Claude Code on Windows was fine. But Codex hooks were spawning work inside WSL, and that runtime had its own separate ~/.claude state.
So from one side of the system, Claude was logged in.
From the other side, it wasn't.
And because the failure was happening in a subprocess several layers down, what bubbled back up was just a generic process failure.
That was the moment the whole debugging session clicked for me:
I wasn't dealing with a broken feature. I was dealing with a system that crossed OS boundaries, process boundaries, and auth boundaries, and I was still mentally treating it like one runtime.
It wasn't one runtime.
It was several. They just happened to be glued together tightly enough to look like one.
And then Windows joined in
While I was cleaning that up, Claude Code on Windows started throwing a completely different kind of error on every hook run:
error: failed to remove file `.venv\\lib64`: Access is denied. (os error 5)
That turned out to be another boundary problem.
I had Windows-side and WSL-side tooling both touching the same project environment. uv was trying to be helpful. Windows was trying to be Windows. A POSIX-style lib64 symlink was involved. None of this was improving my mood.
So now I had two parallel truths:
- the Codex capture path was broken because WSL auth was missing
- the Claude-side hook launcher was unstable because the shared
.venvstate was getting churned across environments
Both bugs were real.
Neither bug lived in the same place.
Both looked, from the outside, like "the AI tool is flaky again."
The part that actually mattered
Here's the practical lesson I walked away with:
When a system crosses runtime boundaries, the bug is often not in the place where the symptom shows up.
The symptom showed up as:
Stop failed
The actual causes were spread across multiple layers:
- transcript shape mismatch
- timeout mismatch
- unprotected stdout write
- missing WSL-side Claude auth
- shared Windows/WSL environment churn
If I had kept looking only at the final error message, I would have kept "fixing" the wrong layer.
What I changed
I didn't rewrite the system. I just stopped letting it be vague.
I added enough visibility so each boundary could tell me when it was the one failing:
- transcript parsing now understands real Codex output
- the stop hook has a fallback when transcript data is missing
- success output no longer crashes on a closed pipe
- the flush path logs more useful process diagnostics
- I verified the actual runtime where the subprocess was running, instead of assuming it matched the one I was sitting in
And maybe the most boring but important fix of all:
- I authenticated Claude in the runtime that was actually doing the work
Not "somewhere on the machine."
Not "the CLI works for me."
The exact runtime.
The lesson I'll keep
The real lesson wasn't "add more logging."
It was this:
if your tool crosses process boundaries, OS boundaries, and auth boundaries, you do not have one runtime anymore.
You have a chain of semi-independent runtimes, and each one can fail in its own extremely specific way.
That sounds obvious when written down. It was much less obvious when I was staring at one repeated Stop failed message and assuming there had to be one neat root cause behind it.
There wasn't.
There were several small, ordinary failures, all stacked on top of each other. That's what made the bug feel slippery.
And honestly, that is what a lot of debugging looks like in real life. Not one dramatic mistake. Just three or four boring mismatches, each living at a different seam, and all of them combining into one system that feels unreliable.
Sometimes the hardest part is realizing that one ugly symptom actually belongs to more than one story.
If you're building something similar
Don't just test whether the hook runs.
Test whether:
- it runs in the runtime you think it runs in
- it has the credentials you think it has
- it can finish within the timeout you actually configured
- and the final side effect really happens
Because "the script executed" is not the same thing as "the system worked."
And if your tool is supposed to remember things for you, that difference matters a lot.
I'm building this as part of llm-wiki, a markdown-first memory layer for Claude Code and Codex. The part I underestimated wasn't the prompt design or the summarization logic. It was the plumbing around the boundaries.
Which, in hindsight, is exactly where these systems like to break.
Top comments (0)