UB3DQY

Posted on Apr 13

I thought my AI memory hook was broken. It turned out to be Windows, WSL, uv, and one missing login

#debugging #python #ai #tooling

Part of the series Debugging Claude Agent SDK pipelines. One of the layers I'll mention near the end — hidden account-level Gmail / Calendar MCP integrations blocking my subprocesses — deserved its own write-up: Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline.

I noticed something weird in Codex.

UserPromptSubmit kept saying completed, but Stop kept saying failed.

If you're building a memory tool, that's a bad combination. It means the assistant can still read old context, but it may be failing to write new context back into long-term memory. In other words: it looks smart in the moment, but its memory may be quietly falling apart behind the scenes.

I assumed this would be a small hook bug.

It wasn't.

It turned into one of those debugging sessions where every layer was technically "working" and the system as a whole still wasn't.

What I was building

I'm working on a markdown-first memory system for Claude Code and Codex. The shape is simple enough:

when a session ends, a hook grabs the transcript
a background script decides whether the conversation is worth saving
if it is, it writes a distilled note into a daily log
later, that gets compiled into wiki pages and injected back into future sessions

The part that mattered here was the Codex Stop hook. That's the capture path. If that hook fails, new memory may never make it into the wiki.

So when I saw Stop failed in the UI over and over again, I treated it as a real product problem, not a cosmetic one.

The first bug was real

The first issue was exactly where I expected it to be: in the hook.

The parser only understood one transcript shape. Codex was emitting another one. The hook would fire, look at the transcript, fail to extract meaningful context, and then skip capture.

That part was straightforward to fix:

teach the parser the real Codex transcript shape
add a fallback for when transcript_path is missing
stop using the old turn-count gate and switch to a content-based threshold
raise the timeout so the hook had room to finish

After that, things got better. The logs stopped saying SKIP: empty context, and I started seeing the line I wanted:

Spawned flush.py for session ...

At that point I thought I was done.

I was not done.

The second bug was weirder

Now the hook was successfully spawning the downstream capture process, which should have been a win.

And then it still ended with:

BrokenPipeError: [Errno 32] Broken pipe

This was one of those bugs that is annoying precisely because it happens after the important part.

The capture process had already started. Memory might already be on its way to being saved. But the hook still looked failed in the Codex UI, which meant I couldn't trust the system yet.

The cause turned out to be simple in hindsight:

the hook took longer than the local timeout
Codex closed stdout
the hook tried to print its final success JSON into a pipe that no longer existed

So yes, the system was partly working. It just wasn't finishing cleanly.

That fix was also small: protect the final stdout write against a closed pipe.

At this point I had fixed the parser bug and the broken pipe. Surely now the memory pipeline would work.

Still no.

The third bug wasn't in the hook at all

Once I got past the broken pipe, the downstream process itself started failing.

The hook would spawn flush.py, and then flush.py would die with the deeply unhelpful classic:

Command failed with exit code 1

No useful stderr. No obvious explanation. Just enough information to waste an afternoon.

This is the moment where I finally stopped assuming I was debugging "the hook" and started treating the whole thing like what it really was: a chain of separate runtimes.

Because that's what it was.

Not one program. A chain:

Codex UI
hook runner
Python process
subprocess launcher
WSL boundary
bundled Claude CLI
local authentication state

Each link could fail for a different reason.

And that is exactly what had happened.

One of the real bugs was hiding in the boundary

At that stage, the most immediate root cause of the exit code 1 failure wasn't inside my Python code at all.

The Claude CLI inside the WSL runtime wasn't authenticated.

That was the immediate bug in this part of the investigation. There was still another layer around hidden account-level MCP integrations, but that turned into its own separate story. I wrote that one up separately here: Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline.

I had already authenticated Claude on the Windows side. Claude Code on Windows was fine. But Codex hooks were spawning work inside WSL, and that runtime had its own separate ~/.claude state.

So from one side of the system, Claude was logged in.

From the other side, it wasn't.

And because the failure was happening in a subprocess several layers down, what bubbled back up was just a generic process failure.

That was the moment the whole debugging session clicked for me:

I wasn't dealing with a broken feature. I was dealing with a system that crossed OS boundaries, process boundaries, and auth boundaries, and I was still mentally treating it like one runtime.

It wasn't one runtime.

It was several. They just happened to be glued together tightly enough to look like one.

And then Windows joined in

While I was cleaning that up, Claude Code on Windows started throwing a completely different kind of error on every hook run:

error: failed to remove file `.venv\\lib64`: Access is denied. (os error 5)

That turned out to be another boundary problem.

I had Windows-side and WSL-side tooling both touching the same project environment. uv was trying to be helpful. Windows was trying to be Windows. A POSIX-style lib64 symlink was involved. None of this was improving my mood.

So now I had two parallel truths:

the Codex capture path was broken because WSL auth was missing
the Claude-side hook launcher was unstable because the shared .venv state was getting churned across environments

Both bugs were real.
Neither bug lived in the same place.
Both looked, from the outside, like "the AI tool is flaky again."

The part that actually mattered

Here's the practical lesson I walked away with:

When a system crosses runtime boundaries, the bug is often not in the place where the symptom shows up.

The symptom showed up as:

Stop failed

The actual causes were spread across multiple layers:

transcript shape mismatch
timeout mismatch
unprotected stdout write
missing WSL-side Claude auth
shared Windows/WSL environment churn

If I had kept looking only at the final error message, I would have kept "fixing" the wrong layer.

What I changed

I didn't rewrite the system. I just stopped letting it be vague.

I added enough visibility so each boundary could tell me when it was the one failing:

transcript parsing now understands real Codex output
the stop hook has a fallback when transcript data is missing
success output no longer crashes on a closed pipe
the flush path logs more useful process diagnostics
I verified the actual runtime where the subprocess was running, instead of assuming it matched the one I was sitting in

And maybe the most boring but important fix of all:

I authenticated Claude in the runtime that was actually doing the work

Not "somewhere on the machine."
Not "the CLI works for me."
The exact runtime.

The lesson I'll keep

The real lesson wasn't "add more logging."

It was this:

if your tool crosses process boundaries, OS boundaries, and auth boundaries, you do not have one runtime anymore.

You have a chain of semi-independent runtimes, and each one can fail in its own extremely specific way.

That sounds obvious when written down. It was much less obvious when I was staring at one repeated Stop failed message and assuming there had to be one neat root cause behind it.

There wasn't.

There were several small, ordinary failures, all stacked on top of each other. That's what made the bug feel slippery.

And honestly, that is what a lot of debugging looks like in real life. Not one dramatic mistake. Just three or four boring mismatches, each living at a different seam, and all of them combining into one system that feels unreliable.

Sometimes the hardest part is realizing that one ugly symptom actually belongs to more than one story.

If you're building something similar

Don't just test whether the hook runs.

Test whether:

it runs in the runtime you think it runs in
it has the credentials you think it has
it can finish within the timeout you actually configured
and the final side effect really happens

Because "the script executed" is not the same thing as "the system worked."

And if your tool is supposed to remember things for you, that difference matters a lot.

I'm building this as part of llm-wiki, a markdown-first memory layer for Claude Code and Codex. The part I underestimated wasn't the prompt design or the summarization logic. It was the plumbing around the boundaries.

Which, in hindsight, is exactly where these systems like to break.

DEV Community