DEV Community: UB3DQY

Six small PRs later, repo hygiene stopped being a suggestion

UB3DQY — Thu, 16 Apr 2026 07:16:01 +0000

Over the last 24 hours I did one of those jobs that sounds smaller than it really is.

On paper it was boring: tighten up repo hygiene in a small Python project.

In practice it meant taking a repo where the right tools technically existed, but mostly as polite background noise, and turning them into something that actually pushes back.

That distinction matters more than people admit.

A lot of repos have hygiene in the aspirational sense. There is a formatter. There is a linter. There is a workflow file. There is maybe even a note somewhere saying “we should probably enforce this.” And yet the real contract of the repository is still social. If somebody forgets to sort imports, or reformats one file strangely, or adds a new script with a different style, nothing happens except maybe a vague feeling that the code is getting fuzzier around the edges.

That was roughly where this repo was.

By the end of the day, it wasn’t.

What changed

The repo is a slightly odd little Python codebase: scripts, hooks, some WSL/Windows friction, and enough operational glue that a careless formatting pass can do real damage.

The hygiene sequence ended up landing as six small PRs:

add a proper dev dependency group
enable Ruff import sorting only, with an explicit Python target version
add a narrow Ruff import-sorting check to CI
run a style-only ruff format pass
add .git-blame-ignore-revs for that style commit
add ruff format --check to CI

That list is tidy now. It did not start tidy.

The first temptation was the obvious one: if Ruff is underused, just turn on more Ruff. Add UP, add B, wire in pre-commit, maybe clean up packaging while we’re here, maybe finally fix the command entry point. The classic “while I’m touching hygiene, I might as well modernize the whole repo” impulse.

That would have been a mistake.

The actual winning move was to narrow the scope until every step was defensible on its own.

The difference between “installed” and “real”

The repo already had Ruff.

Which is exactly why this kind of work gets postponed forever.

When a tool already exists in pyproject.toml, people start speaking about it in the present tense. “We use Ruff.” “We have formatting.” “The repo is linted.” But if nothing in the actual merge path forces those claims to matter, what you really have is a tool-shaped decoration.

That was the first useful correction from this round of work:

a tool is not part of the contract until CI can turn red because of it

So the order mattered.

Not pre-commit first. Not “let’s all remember to run it locally.” Not “there’s a command for that.”

CI first.

Once ruff check --select I scripts/ hooks/ and ruff format --check scripts/ hooks/ sit in the workflow, the repo changes shape. Future PRs can’t quietly reintroduce unsorted imports or drifting formatting and call it an accident. The repository starts enforcing a boundary instead of merely describing one.

That is a much more important transition than it sounds.

Why we did not turn on all the rules

This was probably the most useful design decision of the whole sequence.

We only enabled Ruff’s I rule at first. Import sorting. Nothing broader.

That is not because the repo is perfect otherwise. It isn’t. There is still baseline lint debt sitting there, and we know it. But broadening the ruleset too early would have mixed three different jobs together:

introduce a new contract
pay down old debt
argue about the meaning of every new warning

That is how hygiene work turns into a swamp.

So the repo took the adult route instead:

start with the rule that is almost entirely mechanical
fix only what that rule surfaces
make it enforced
defer the noisier categories until there is a reason to take them on

That sounds conservative. It is. It is also how the change actually got merged.

There was one wrinkle, and it was a real one.

Three hook files use an intentional sys.path.insert(...)-then-import pattern. It is ugly, but it is tied to the current package layout and runtime boundary. Ruff’s import sorting model does not like that pattern, and when we tried to push it through blindly, it started producing destructive splits.

So instead of pretending the linter is always right, we added narrow per-file-ignores for those three files and moved on.

That was the correct tradeoff.

Linters are tools. They are not clergy.

The most boring PR was secretly important

The style-only formatting pass was the part that looks most cosmetic from the outside.

Run ruff format scripts/ hooks/.
Reformat a couple dozen files.
Move on.

But that step has two hidden traps.

First, you need to prove it is actually style-only. In this repo that meant checking the non-import baseline before and after, making sure the same existing errors were still the same existing errors, and not quietly smuggling in behavior changes under the name of formatting.

Second, once you do a broad formatting commit, git blame gets uglier unless you clean up after yourself.

So the format pass was immediately followed by .git-blame-ignore-revs.

I’m mentioning that because a lot of teams skip it, and then six months later every touched line looks like it was “written” by the style commit. It is a small operational courtesy that saves a lot of irritation later.

Formatting the repo is easy.
Formatting the repo without damaging the usefulness of its history is slightly less easy.

Still worth doing.

The part that kept this from becoming a mess

This repo has a weird workflow on purpose: plans, reports, explicit verification, narrow whitelists, and a very annoying habit of stopping when the written plan no longer matches reality.

That sounds bureaucratic until you hit the first real discrepancy.

And there were several.

One plan version assumed uv.lock was tracked. It wasn’t.

Another assumed project.optional-dependencies was the right place for dev tooling. The docs said otherwise: for this setup, dependency-groups was the semantically correct path.

One acceptance check assumed a clean diff on files that were already dirty before the task started.

Another workflow file had line-ending churn that made the raw diff look noisier than the actual content change.

None of those were dramatic bugs. They were the more ordinary kind of engineering problem: stale assumptions surviving just long enough to confuse the next step.

The only thing that reliably prevented those from turning into sloppy implementation was a simple rule:

when the docs, the code, and the plan disagree, the plan loses

That sounds obvious. It still needs to be enforced.

Where the repo ended up

At the end of the sequence, the workflow was doing real work.

The main CI path now enforces:

index freshness
structural wiki lint
Ruff import sorting for scripts/ and hooks/
Ruff formatting checks for scripts/ and hooks/
Python AST syntax validity

That is not some grand platform rewrite. It is a modest hardening pass on a small codebase.

But that is exactly why I like it.

Too much engineering writing treats hygiene as if it only becomes interesting once there is a huge monorepo, a staff-sized platform team, or a catastrophe. Most repos do not live there. Most repos live in the much less glamorous zone where a dozen small inconsistencies slowly teach everybody that the rules are optional.

This was the opposite kind of day.

No reinvention. No giant “quality initiative.” No holiness around tool choice. Just six small PRs in the right order, each one making the next one easier to justify.

That is usually what “maturity” looks like in a repo, if you strip away the self-importance.

What I would do again

If I had to repeat the same cleanup tomorrow in another Python repo, I’d keep the same order:

declare dev tooling properly
enable one narrow lint rule with a high signal-to-drama ratio
enforce it in CI
do the style-only format pass
add blame-ignore for that commit
only then expand the gate

And I would still resist the urge to “just enable everything.”

Not because broad linting is bad. Because sequencing matters.

A repo does not become cleaner when you dump more rules into it. It becomes cleaner when the rules it has are real, narrow enough to survive contact with reality, and enforced in the place that actually decides what lands.

That was the work of the last 24 hours.

Not glamorous.
But now the repo pushes back.

That’s better.

My wiki stopped being “memory” and quietly became a behavior patch for AI agents

UB3DQY — Wed, 15 Apr 2026 09:10:27 +0000

I had one of those sessions a few days ago where the problem stopped being technical and got embarrassing.

I was working with an AI coding agent inside a real repo. Not a toy prompt, not a benchmark, just ordinary work. At first the mistakes looked small enough to wave away: a suggestion that sounded locally reasonable, then another one that contradicted it, then a third that quietly walked us back to the first. The kind of thing that makes you stare at the screen and think, hang on, didn’t we already establish this?

What made it worse was how familiar the rhythm felt. The agent wasn’t crashing. It wasn’t hallucinating a spaceship in the database schema. It was doing something more irritating: cutting corners, smoothing over uncertainty, and acting like each new complaint had appeared in a vacuum.

That same day we had ingested a Reddit thread into the project wiki: “Anthropic made Claude 67% dumber and didn’t tell anyone, a developer ran 6,852 sessions to prove it.” The underlying research lived in anthropics/claude-code#42796, and whatever you think of the Reddit packaging, the numbers were not vague internet vibes. Someone had analyzed 6,852 real Claude Code sessions and 17,871 thinking blocks and came back with a specific pattern:

reasoning depth down 67%
file reads before edit down from 6.6 to 2 on average
roughly one in three edits happening without any prior file read
the word “simplest” showing up far more often in model output

That landed a little too cleanly.

Because earlier that day, in one long conversation, the agent had done exactly the kind of thing those numbers predict. It first recommended using a separate terminal window. When that turned out to be awkward for me, it recommended the VS Code integrated terminal instead. Then, when the integrated terminal had a scrolling problem with a TUI, it started drifting back toward a separate terminal again, like none of the previous reasoning had happened. Same conversation, same user, same repo, and still somehow a goldfish.

At one point, when I got annoyed with the quality of the advice, the agent did something even more revealing: it tried to solve the conversation by building more tooling. New directories, README files, orchestrator ideas, handoff pipeline scaffolding. As if “your reasoning got sloppy” naturally leads to “let me generate more infrastructure.”

That was the moment the Reddit thread stopped feeling theoretical.

The useful part of the bad session

Here is what it actually gave me that I did not have an hour earlier.

The thread mattered for two reasons.

First, it gave me language for what I was looking at. Instead of the usual mushy feeling that the model was “off today,” there was a concrete behavioral shape: shallower reasoning, less reading before editing, more shortcut-taking. That does not magically explain every bad turn, but it does turn a vibe into something you can reason about.

Second, it pushed the conversation away from “why is the model being like this?” and toward a much more practical question:

What can I change locally so that the agent stops behaving this way in my project?

That ended up being the right question.

Because the fix wasn’t a new product, and it wasn’t a better prompt pasted into a chat box, and it definitely wasn’t waiting for Anthropic to say they had sorted things out. The fix was smaller and much duller than that. Which is probably why it worked.

What we changed

The repo already had a markdown wiki next to the codebase.

Originally it was just that: a knowledge base. Notes, sources, concept pages, issue summaries, project-specific instructions, little slices of operational memory that are easy to forget between sessions. Plain files in git. Nothing clever.

But this repo also had a project-level CLAUDE.md, which gets injected into the agent’s context whenever work starts in the project. So instead of treating the wiki as a passive archive, we used it as a place to put behavior-shaping rules that would actually reappear in future sessions.

We added rules like:

- Research the codebase before editing. Never change code you haven't read.
- Verify work actually works before claiming done.

Whether or not the leak claim turns out to be true, those rules are obviously worth having anyway.

And we added a couple of persistent memory notes tied to concrete failure cases from the bad session:

don’t flip-flop on workflow advice inside one thread
don’t respond to user frustration about reasoning quality by reflexively building new tooling

That was it.

No vector database. No new service. No orchestration layer. No fancy “agent memory platform.” Just a few markdown files in the place the agent actually reads before it starts moving.

Which sounds almost insultingly small. But that’s the point.

The thing I had been misunderstanding

I used to think of a project wiki mostly as memory in the obvious sense: a way to remember what happened, what was decided, what broke last month, what weird edge case we already paid for once and don’t want to rediscover at 2 a.m.

That is still true. But it is not the whole story.

For an agent, a wiki that gets re-injected into context is not just memory. It is part of the agent’s operating environment.

That matters more than it sounds.

An LLM does not “learn” from a bad day the way a human does. It doesn’t go for a walk, think things over, and come back morally improved. If you want a future instance of the agent to behave differently, the only reliable mechanism you control is what gets placed in its context window when the next similar moment happens.

That means a sentence in a markdown file can do something surprisingly concrete. It can raise the floor.

The Reddit thread is basically the negative version of this same idea. If the default instruction layer gets weakened, even a little, behavior changes downstream in measurable ways. Less reading. More shortcuts. More “simplest.” More pretending local plausibility is enough. So the opposite is also true: if you add back explicit local instructions that are actually relevant to your project, behavior improves in the dimensions those instructions govern.

Not forever. Not perfectly. But enough to matter.

That was the real shift for me.

The wiki stopped being a storage layer and became a behavioral patch.

Why plain markdown turned out to be enough

This part is easy to overcomplicate.

There are many situations where you really do need a heavier retrieval system. Large corpora, fuzzy search across lots of semi-structured material, semantic lookup over things that were never designed to link to each other cleanly. Fine. Use the bigger machinery when the problem calls for it.

But a lot of agent work is not blocked on better retrieval. It is blocked on better discipline.

The agent doesn’t need a 40-millisecond vector search to discover the idea “read the file before you edit it.” It needs that rule to be present, visible, and hard to miss at the moment it starts acting clever.

Plain markdown is very good at that.

It is editable, reviewable, diffable, easy to keep near the code, and easy to inject back into the session. It also ages well. You can mark something stale. You can supersede it. You can point from one failure note to the later fix. You can tell the difference between historical context and current guidance if you are disciplined about how you structure the files.

That last part matters. A wiki can absolutely become a swamp if you just keep shoveling text into it and never think about freshness. But that is a problem of curation, not a failure of the basic approach.

The one claim I’m careful with

The thread also included a much juicier claim: that leaked Claude Code source appeared to route Anthropic employees through a different instruction set, including a stronger “verify work actually works before claiming done” style directive.

If true, that would explain a lot. It would also be a much bigger story than “my local workflow got weird this week.”

But I don’t think it should carry the article. That part still needs independent confirmation.

The measured regression does not.

And honestly, the practical lesson does not depend on it. Even if that claim turns out to be wrong, the local fix still stands: project-level instructions plus durable memory can compensate for at least some classes of agent drift.

That is enough to be useful on its own.

What I’d actually recommend

If any of this sounds familiar, I’d do three things before buying anything, rewriting your stack, or filing angry bug reports.

1. Put operating rules in the repo

Not a giant manifesto. Two or three lines.

read before edit
verify before claiming done
don’t reverse yourself mid-thread without explicitly acknowledging it

Keep them short enough that they feel like rules, not prose.

2. Record specific failure patterns, not abstract complaints

“Don’t be lazy” is useless.

“Yesterday you recommended terminal A, then terminal B, then terminal A again in the same conversation” is useful.

Agents pattern-match better against concrete reproductions than against moral advice. Frankly, so do humans.

3. Treat the wiki as part of runtime behavior, not just documentation

If your memory system is actually read during future sessions, then it is not only an archive. It is one of the controls that shapes what the agent does next. Design it that way.

That means caring about status, freshness, and whether a note is historical context or current guidance. It also means accepting that some of your best fixes may look embarrassingly low-tech.

And if your stack does not have a CLAUDE.md equivalent, the idea still transfers. Any text that gets re-injected into context at session start is a lever. The filename is incidental.

The part I keep coming back to

What changed my mind wasn’t that the agent had a bad session. Bad sessions happen.

It was that a markdown wiki, sitting quietly next to the code, ended up being the most practical lever for changing the agent’s behavior the next day.

Not memory as autobiography.

Memory as control surface.

And once you see that, a lot of “AI tooling” starts looking strangely overengineered for the problems people actually have. Sometimes the useful move is not another layer of automation. Sometimes it is one sentence in the right file, where the model is forced to read it before it starts improvising.

That’s a very old kind of software.

It still works.

Reference: the Reddit thread “Anthropic made Claude 67% dumber and didn't tell anyone, a developer ran 6,852 sessions to prove it” (r/ClaudeCode, 2026-04-10), and the underlying issue at anthropics/claude-code#42796.

Remote-WSL broke my AI agent hooks with one malformed cwd

UB3DQY — Wed, 15 Apr 2026 06:35:26 +0000

I spent most of this week debugging what looked like a flaky hook pipeline.

The project itself is simple on purpose: a local, markdown-first knowledge base that I use with coding agents. Hooks capture session output, a small filter decides what is worth keeping, and everything stays in git. No separate backend, no database to babysit, no infrastructure just because a text-heavy workflow could be turned into infrastructure.

That simplicity is exactly why the bug annoyed me so much.

The failure looked like an application problem. It smelled like an application problem. It even produced the kind of vague symptoms that make you think, “great, another intermittent pipeline bug.”

It was not an application problem.

It was one malformed working directory.

What I thought I was debugging

At first, I thought I was dealing with one recurring failure in my capture pipeline.

Sometimes a hook would appear to fail. Sometimes downstream logging would be empty. Sometimes a task would complete, but the surrounding automation would look half-dead. It was inconsistent enough to feel flaky and consistent enough to feel real.

That is an especially bad combination.

I started by doing the usual careful thing: grouping similar failures together and checking how often they happened.

That worked for about ten minutes.

Once I looked at timing, the nice tidy picture fell apart. Some failures happened almost immediately. Others took much longer. Same surface symptom, completely different execution pattern.

That was the first sign that I might be flattening two different problems into one label.

Then I went digging through the SDK layer and ran into a second problem: the tooling was not especially generous with useful error details. In a couple of places I could confirm that something had failed, but not why. The exact traceback or stderr I wanted either was not there or was being abstracted away into something much less helpful.

At that point I stopped treating this as a Python problem and started treating it as an environment problem.

That turned out to be the right move.

The setup that made it visible

I use two agents against the same repository:

Claude Code inside VS Code
Codex in parallel for implementation and verification

That setup is productive, but it puts real pressure on local tooling. You are suddenly relying on editors, terminals, shells, path translation, process spawning, and hook execution to behave consistently across a stack that spans Windows and Linux at the same time.

Naturally, I decided to make it more convenient.

I wanted a cleaner dual-window workflow inside VS Code instead of bouncing between an editor and a separate terminal. That pushed me deeper into VS Code Remote-WSL and duplicated workspace setups.

That is where the really confusing symptom showed up.

Codex could successfully answer a simple prompt, but the UI would still show hooks as failed.

That combination should immediately make you suspicious.

If the task result exists, but the hooks around it are marked failed and your own logging pipeline shows no fresh entries, then one of two things is true:

the hook command really is starting and dying very early, or
the command is never starting in a sane execution context to begin with.

The second explanation was uglier, but it fit the evidence better.

Manual tests made things weirder, not clearer

I checked the obvious suspects:

hook config
executable paths
shell path
Python runner
manual execution of the same command
exit code
runtime

Everything looked healthy.

The hook command worked perfectly when I ran it manually.

That made the problem harder, not easier. A command that fails only when launched through an editor integration is usually telling you that the command itself is fine and its environment is not.

So I stopped staring at the hook script and went looking for process-level clues on the Codex side.

That was when I realized I had been looking in the wrong place for logs. I expected plain text logs. The active event data I needed was actually in a SQLite log store.

Once I queried that, the whole thing cracked open.

The line that explained the whole day

Inside the recorded turn data, the working directory looked like this:

/mnt/c/.../Microsoft VS Code/e:\work\my-project

That path is nonsense.

It is a broken hybrid:

the VS Code install directory on the Windows side
a raw Windows-style workspace path

It is neither a valid WSL workspace path nor a valid normal working directory for a Linux-side process.

And once I saw it, the rest of the symptoms stopped being mysterious.

The issue was not that my hooks were unreliable.

The issue was that, in this Remote-WSL setup, the VS Code extension was handing Codex a malformed cwd.

Instead of turning something like:

E:\work\my-project

into:

/mnt/e/work/my-project

something in the chain appeared to be combining the raw Windows path with the wrong base first.

One bad cwd is enough to poison an entire process tree.

Once child processes inherit it, you start getting misleading secondary failures:

hooks reported as failed even though the commands are valid
subprocess behavior that differs from manual shell execution
empty downstream logs because the real work never starts in a usable context

That was exactly what I was seeing.

Why this bug was so misleading

This is my least favorite kind of tooling bug: the kind that breaks at the seams.

Nothing explodes cleanly.

The agent still answers.

The UI still looks alive.

The hook command still works in isolation.

The repository still exists where you expect it to exist.

Only one inherited bit of process state is wrong, and that is enough to make the system feel haunted.

From the outside, it looks like a flaky automation problem.
From the inside, it is just a bad path string.

That difference matters, because it changes what you should inspect first.

If a command works manually but fails only through an editor or agent integration, do not immediately assume the logic is wrong. Compare launch context first:

cwd
shell
environment variables
path translation
parent process

I should have done that earlier.

The workaround was almost embarrassingly simple

Once I knew what was broken, the local workaround was not clever at all:

do not run Codex through the VS Code extension in that setup.

Run it directly from a normal WSL shell in the project root:

cd /mnt/e/work/my-project
codex

Same machine. Same repository. Same hooks. Same scripts.

Different launch path.

And in that mode, everything immediately got boring again, which is exactly what you want from tooling:

clean cwd
hooks marked completed
logging pipeline resumes
downstream capture behaves normally

That was the moment I realized I had nearly talked myself into a much bigger solution for a much smaller problem.

At one point I was already mentally drifting toward “maybe I should move more of this workflow to a server” or “maybe the local-first design is too fragile.”

Nope.

The local-first design was fine.
The markdown-first architecture was fine.
The scripts were fine.
The hooks were fine.

The working directory was wrong.

What I changed after that

I stopped trying to force the elegant version of the workflow.

My stable setup now is much more boring:

Claude Code inside VS Code
Codex in a separate WSL terminal
both pointing at the same repository
shared append-only logs
no editor-managed cwd surprises

It is less stylish than the version I was trying to build.
It is also dramatically more reliable.

And honestly, that feels like the right ending for this story.

I spent a day chasing what looked like a deep pipeline bug.
It turned out that the best fix was to put the CLI tool back in a terminal.

What I took away from it

Three things.

First: if multiple failures share a label, that does not mean they share a cause.

Second: editor integrations often fail in ways that look like application bugs when they are really process-launch bugs.

Third: if you are debugging anything that touches Windows, WSL, and an editor extension at the same time, inspect cwd much earlier than feels necessary.

One malformed working directory was enough to waste an entire day.

I would rather somebody else not lose the same one.

I thought my AI memory hook was broken. It turned out to be Windows, WSL, uv, and one missing login

UB3DQY — Mon, 13 Apr 2026 23:29:37 +0000

Part of the series Debugging Claude Agent SDK pipelines. One of the layers I'll mention near the end — hidden account-level Gmail / Calendar MCP integrations blocking my subprocesses — deserved its own write-up: Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline.

I noticed something weird in Codex.

UserPromptSubmit kept saying completed, but Stop kept saying failed.

If you're building a memory tool, that's a bad combination. It means the assistant can still read old context, but it may be failing to write new context back into long-term memory. In other words: it looks smart in the moment, but its memory may be quietly falling apart behind the scenes.

I assumed this would be a small hook bug.

It wasn't.

It turned into one of those debugging sessions where every layer was technically "working" and the system as a whole still wasn't.

What I was building

I'm working on a markdown-first memory system for Claude Code and Codex. The shape is simple enough:

when a session ends, a hook grabs the transcript
a background script decides whether the conversation is worth saving
if it is, it writes a distilled note into a daily log
later, that gets compiled into wiki pages and injected back into future sessions

The part that mattered here was the Codex Stop hook. That's the capture path. If that hook fails, new memory may never make it into the wiki.

So when I saw Stop failed in the UI over and over again, I treated it as a real product problem, not a cosmetic one.

The first bug was real

The first issue was exactly where I expected it to be: in the hook.

The parser only understood one transcript shape. Codex was emitting another one. The hook would fire, look at the transcript, fail to extract meaningful context, and then skip capture.

That part was straightforward to fix:

teach the parser the real Codex transcript shape
add a fallback for when transcript_path is missing
stop using the old turn-count gate and switch to a content-based threshold
raise the timeout so the hook had room to finish

After that, things got better. The logs stopped saying SKIP: empty context, and I started seeing the line I wanted:

Spawned flush.py for session ...

At that point I thought I was done.

I was not done.

The second bug was weirder

Now the hook was successfully spawning the downstream capture process, which should have been a win.

And then it still ended with:

BrokenPipeError: [Errno 32] Broken pipe

This was one of those bugs that is annoying precisely because it happens after the important part.

The capture process had already started. Memory might already be on its way to being saved. But the hook still looked failed in the Codex UI, which meant I couldn't trust the system yet.

The cause turned out to be simple in hindsight:

the hook took longer than the local timeout
Codex closed stdout
the hook tried to print its final success JSON into a pipe that no longer existed

So yes, the system was partly working. It just wasn't finishing cleanly.

That fix was also small: protect the final stdout write against a closed pipe.

At this point I had fixed the parser bug and the broken pipe. Surely now the memory pipeline would work.

Still no.

The third bug wasn't in the hook at all

Once I got past the broken pipe, the downstream process itself started failing.

The hook would spawn flush.py, and then flush.py would die with the deeply unhelpful classic:

Command failed with exit code 1

No useful stderr. No obvious explanation. Just enough information to waste an afternoon.

This is the moment where I finally stopped assuming I was debugging "the hook" and started treating the whole thing like what it really was: a chain of separate runtimes.

Because that's what it was.

Not one program. A chain:

Codex UI
hook runner
Python process
subprocess launcher
WSL boundary
bundled Claude CLI
local authentication state

Each link could fail for a different reason.

And that is exactly what had happened.

One of the real bugs was hiding in the boundary

At that stage, the most immediate root cause of the exit code 1 failure wasn't inside my Python code at all.

The Claude CLI inside the WSL runtime wasn't authenticated.

That was the immediate bug in this part of the investigation. There was still another layer around hidden account-level MCP integrations, but that turned into its own separate story. I wrote that one up separately here: Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline.

I had already authenticated Claude on the Windows side. Claude Code on Windows was fine. But Codex hooks were spawning work inside WSL, and that runtime had its own separate ~/.claude state.

So from one side of the system, Claude was logged in.

From the other side, it wasn't.

And because the failure was happening in a subprocess several layers down, what bubbled back up was just a generic process failure.

That was the moment the whole debugging session clicked for me:

I wasn't dealing with a broken feature. I was dealing with a system that crossed OS boundaries, process boundaries, and auth boundaries, and I was still mentally treating it like one runtime.

It wasn't one runtime.

It was several. They just happened to be glued together tightly enough to look like one.

And then Windows joined in

While I was cleaning that up, Claude Code on Windows started throwing a completely different kind of error on every hook run:

error: failed to remove file `.venv\\lib64`: Access is denied. (os error 5)

That turned out to be another boundary problem.

I had Windows-side and WSL-side tooling both touching the same project environment. uv was trying to be helpful. Windows was trying to be Windows. A POSIX-style lib64 symlink was involved. None of this was improving my mood.

So now I had two parallel truths:

the Codex capture path was broken because WSL auth was missing
the Claude-side hook launcher was unstable because the shared .venv state was getting churned across environments

Both bugs were real.
Neither bug lived in the same place.
Both looked, from the outside, like "the AI tool is flaky again."

The part that actually mattered

Here's the practical lesson I walked away with:

When a system crosses runtime boundaries, the bug is often not in the place where the symptom shows up.

The symptom showed up as:

Stop failed

The actual causes were spread across multiple layers:

transcript shape mismatch
timeout mismatch
unprotected stdout write
missing WSL-side Claude auth
shared Windows/WSL environment churn

If I had kept looking only at the final error message, I would have kept "fixing" the wrong layer.

What I changed

I didn't rewrite the system. I just stopped letting it be vague.

I added enough visibility so each boundary could tell me when it was the one failing:

transcript parsing now understands real Codex output
the stop hook has a fallback when transcript data is missing
success output no longer crashes on a closed pipe
the flush path logs more useful process diagnostics
I verified the actual runtime where the subprocess was running, instead of assuming it matched the one I was sitting in

And maybe the most boring but important fix of all:

I authenticated Claude in the runtime that was actually doing the work

Not "somewhere on the machine."
Not "the CLI works for me."
The exact runtime.

The lesson I'll keep

The real lesson wasn't "add more logging."

It was this:

if your tool crosses process boundaries, OS boundaries, and auth boundaries, you do not have one runtime anymore.

You have a chain of semi-independent runtimes, and each one can fail in its own extremely specific way.

That sounds obvious when written down. It was much less obvious when I was staring at one repeated Stop failed message and assuming there had to be one neat root cause behind it.

There wasn't.

There were several small, ordinary failures, all stacked on top of each other. That's what made the bug feel slippery.

And honestly, that is what a lot of debugging looks like in real life. Not one dramatic mistake. Just three or four boring mismatches, each living at a different seam, and all of them combining into one system that feels unreliable.

Sometimes the hardest part is realizing that one ugly symptom actually belongs to more than one story.

If you're building something similar

Don't just test whether the hook runs.

Test whether:

it runs in the runtime you think it runs in
it has the credentials you think it has
it can finish within the timeout you actually configured
and the final side effect really happens

Because "the script executed" is not the same thing as "the system worked."

And if your tool is supposed to remember things for you, that difference matters a lot.

I'm building this as part of llm-wiki, a markdown-first memory layer for Claude Code and Codex. The part I underestimated wasn't the prompt design or the summarization logic. It was the plumbing around the boundaries.

Which, in hindsight, is exactly where these systems like to break.

Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline

UB3DQY — Mon, 13 Apr 2026 16:49:42 +0000

I lost a few hours to one of those bugs that feels fake when you first describe it out loud.

My Claude Agent SDK pipeline kept failing with the most generic error possible:

Command failed with exit code 1

No useful stderr. No clear traceback. No obvious repro beyond "real sessions fail, synthetic tests sometimes pass."

At first I thought it was just another boring auth problem. Then I thought it was a subprocess visibility problem. Then I thought it was WSL. All of those were plausible. None of them were the whole story.

The actual cause was stranger:

after claude auth login, my account quietly picked up account-level Gmail and Google Calendar MCP integrations that I had never explicitly enabled, could not see in the Claude web UI, and could not remove from the CLI. Those integrations wanted an interactive Google OAuth flow, and that was enough to break every non-interactive Claude SDK subprocess I was using for automation.

The workaround was one CLI flag:

extra_args={"strict-mcp-config": None}

That was the end of the bug.

Finding it was the hard part.

Where this showed up

I use Claude Agent SDK inside a small memory pipeline.

The shape is straightforward:

a hook fires at the end of a Codex session
the hook spawns flush.py
flush.py calls Claude Agent SDK
the result gets written into a daily markdown log

This is exactly the kind of automation that should be boring once it's set up.

Instead, real Codex sessions started failing. The hook would fire, the downstream script would start, and then the Agent SDK step would die with:

Fatal error in message reader: Command failed with exit code 1

That was enough to break the memory pipeline completely. No flush, no daily log entry, no durable memory from those sessions.

And because the error was happening in a subprocess layer, the surface signal was awful. It just looked like "the SDK sometimes fails."

Why this was so annoying to debug

There were at least three reasons this bug wasted my time.

1. The first fix appeared to work

At one point I thought I had solved it just by running:

claude auth login

And for a moment, it looked like I had.

Synthetic tests passed. The bundled Claude binary responded. The pipeline looked alive again.

That "fix" turned out to be fake.

It worked briefly because the environment had not yet fully populated the auth-related cache state that was about to cause the real failure.

So I got the worst possible debugging gift: a temporary false victory.

2. The important error wasn't where I was looking

I had already added stderr visibility around the SDK call, expecting to catch the real CLI failure there.

That didn't help much, because the Google MCP auth path wasn't producing a nice actionable stderr message.

The auth prompt was effectively happening on the wrong channel for my diagnostic setup. The SDK stderr callback got nothing useful. ProcessError.stderr was empty. All I had was the outer shell of the failure.

From the outside, it still looked like "exit code 1, good luck."

3. The integrations were invisible

This was the part that really crossed from "normal debugging" into "what exactly is this system doing?"

In Claude.ai web settings, I could see the usual visible things. But I could not see any Gmail or Google Calendar integrations anywhere I would expect:

not in Settings → Connectors
not in Settings → Customize → Skills
not in Customize → Connectors

And yet on disk, after claude auth login, I could see this:

{
  "claude.ai Gmail": {"timestamp": 1776092446619},
  "claude.ai Google Calendar": {"timestamp": 1776092446683}
}

That came from:

~/.claude/mcp-needs-auth-cache.json

And the CLI confirmed the same story:

claude.ai Gmail:           https://gmail.mcp.claude.com/mcp - ! Needs authentication
claude.ai Google Calendar: https://gcal.mcp.claude.com/mcp - ! Needs authentication

At that point the bug finally started making sense.

What was actually happening

Here is the version I wish someone had handed me before I started digging:

When you authenticate Claude via OAuth, your account may receive account-level MCP integrations as part of its backend claims.

In my case, that included:

claude.ai Gmail
claude.ai Google Calendar

Those integrations were not local project config.
They were not something I had added manually in the repo.
And they were not removable with normal local MCP commands.

For example:

claude mcp remove "claude.ai Gmail"

returned:

No MCP server found with name: "claude.ai Gmail"

Which makes sense once you realize the CLI isn't managing them as local entries. They are attached at the account/backend layer.

Then the next failure follows naturally:

bundled Claude CLI starts inside a non-interactive SDK subprocess
it checks account-level MCP claims
it sees Gmail and Calendar need additional Google auth
it tries to initiate an interactive OAuth flow
there is no TTY / browser / normal user interaction path
subprocess stalls or exits with code 1

That was the whole bug.

Not my prompt.
Not my retry logic.
Not the summary logic.
Not even the main auth state in the obvious sense.

Just hidden MCP integrations pulling a subprocess into an auth flow it had no way to complete.

Why this matters beyond my repo

This is not really about my particular memory pipeline.

It matters because subprocess-based Claude SDK automation is a completely normal pattern. People use it for:

capture pipelines
background summarizers
hook-triggered analysis
scheduled jobs
internal tools

All of those run in contexts where interactive browser auth is either awkward or impossible.

If account-level integrations that require fresh OAuth can quietly attach themselves after login, and if they are invisible in the UI, then the failure mode becomes:

everything looks fine in your normal interactive CLI
your automation suddenly fails in the background
the error surface is generic
and the root cause is not where you would reasonably look first

That is a nasty class of bug.

The workaround that fixed it

The workaround was to isolate the subprocess from account-level MCP discovery.

In practice, that meant passing:

ClaudeAgentOptions(
    allowed_tools=[],
    max_turns=2,
    extra_args={"strict-mcp-config": None},
)

That translates to the bundled Claude binary getting:

--strict-mcp-config

And once I did that, the subprocess stopped trying to discover or auth those account-level MCP integrations.

The pipeline started running cleanly again.

That was the actual fix.

Deleting the cache file:

rm ~/.claude/mcp-needs-auth-cache.json

did not really fix anything. It just bought a little time until the state was regenerated.

The part I still don't love

The workaround is fine. I'm happy to use it in automation.

What I don't love is the product behavior that made it necessary.

From a developer point of view, a few things feel wrong here.

Hidden integrations are a bad default

If my account has Gmail and Calendar MCP integrations attached, I should be able to see them in the UI and turn them off.

Right now, from the outside, it feels like they exist in a shadow layer of account state that can affect automation without being visible where a user would normally manage integrations.

Opt-out without visible opt-in is rough

I didn't explicitly wire Gmail or Calendar into this project. Yet they still ended up influencing subprocess behavior after OAuth login.

That is a surprising default for anyone using Claude as a programmable toolchain component instead of just a chat app.

Non-interactive contexts should fail more gracefully

If the process is clearly non-interactive, the CLI should not wander into a user-hostile auth path and then collapse into a generic exit code.

At minimum, I would want:

a clear message saying an MCP integration requires interactive authentication
the name of the integration
and ideally a way to skip it automatically in SDK subprocess mode

What I wish the docs said

The one warning I really needed was something like this:

If you use Claude Agent SDK in non-interactive subprocesses, account-level MCP integrations attached through OAuth may trigger additional authentication flows. If those integrations are not fully authenticated, subprocess calls may fail. Use strict MCP config isolation for automation workloads.

That single paragraph would have saved me hours.

The short version

If you hit a mysterious Claude Agent SDK subprocess failure with a generic exit code 1, and your interactive CLI mostly works, check whether hidden account-level MCP integrations are involved before you start rewriting your own code.

In my case, the real sequence was:

I thought the pipeline was unauthenticated
then I thought stderr visibility was missing
then I thought WSL was the root cause
and the real issue turned out to be hidden Gmail and Calendar MCP integrations trying to force Google OAuth inside a non-interactive subprocess

The fix was not glamorous:

extra_args={"strict-mcp-config": None}

But at least now I know what class of failure I was dealing with.

And honestly, that's often the hardest part.

I'm building this as part of a markdown-first memory system for Claude Code and Codex. I keep expecting the hard parts to be prompt design or summarization quality. More often than not, the real work is figuring out which invisible layer is making the obvious layer look broken.

How I shipped a broken capture pipeline and didn't notice for 3 days

UB3DQY — Sun, 12 Apr 2026 23:00:36 +0000

TL;DR

I built a hook-based capture system for Claude Code. Every session-end was supposed to get summarized and written into a daily log. My doctor.py gate said 13/13 PASS. Lint was clean. CI was green on every commit.

Then a user asked a simple question: "Is the wiki actually capturing this conversation?"

I checked the log. 57% of my recent sessions had been silently dropped for three days. The gate never told me. Every smoke test was passing. The system was broken in the one place no test was actually looking.

This is what happened, how I caught it, and what I changed so I would not miss the same kind of bug again.

The setup

The project is a memory system for Claude Code and Codex CLI. A session-end hook reads the transcript, hands it off to a background Python script, that script asks the Agent SDK whether the conversation is worth saving, and the result gets appended to daily/YYYY-MM-DD.md. Fairly normal hook plumbing.

I had a doctor.py script with 13 smoke checks across the pipeline. session-start.py produced valid JSON. user-prompt-wiki.py could look up articles. stop.py exited cleanly. I had structural lint. I had a green CI gate on every push.

I had shipped eight commits over two days, each with doctor --quick green, each with CI passing. I was telling myself the system was in good shape.

The moment of doubt

Someone I was working with asked a very simple question: "Just to confirm, is the wiki actually storing this conversation?"

I almost said yes immediately. The hooks were wired. Every prompt I sent was coming back with wiki snippets injected at the top. UserPromptSubmit was clearly doing its job. From the outside, the system looked alive.

But I have been burned by "it looks alive" enough times that I checked instead of trusting the feeling. I opened scripts/flush.log, the file where the session-end and pre-compact hooks write their operational log, and scrolled to the recent entries:

2026-04-12 16:36:39 INFO [session-end] SessionEnd fired: session=...
2026-04-12 16:36:39 INFO [session-end] SKIP: only 2 turns (min 4)
2026-04-12 16:39:27 INFO [session-end] SessionEnd fired: session=...
2026-04-12 16:39:27 INFO [session-end] SKIP: only 2 turns (min 4)
2026-04-12 16:42:07 INFO [session-end] SessionEnd fired: session=...
2026-04-12 16:42:07 INFO [session-end] SKIP: only 2 turns (min 4)

That was the moment my confidence disappeared.

What I was looking at

The hooks were firing. SessionEnd fired is printed before any filtering happens, so those lines meant the hook chain from Claude Code to my Python script was intact. The wiring was not the problem.

But then immediately after, on every single recent entry: SKIP: only 2 turns (min 4).

My session-end code had this:

MIN_TURNS_TO_FLUSH = 4

# ... later ...

if turn_count < MIN_TURNS_TO_FLUSH:
    logging.info("SKIP: only %d turns (min %d)", turn_count, MIN_TURNS_TO_FLUSH)
    return

This was supposed to protect against flushing trivial sessions, the "one question, one answer, exit" pattern that is probably not worth archiving. The threshold 4 felt reasonable when I wrote it. It felt reasonable when I reviewed it. It passed every test.

What I had not really internalized was the shape of my own usage. A typical Claude Code session for me is: open terminal, ask one specific question, get one specific answer, close terminal. That is exactly 2 turns. The rule I wrote to skip "trivial" sessions was skipping my normal session shape.

I ran the numbers over the full log:

SessionEnd fired:       109
Spawned flush.py:        52  (48%)
Skipped (various):       57  (52%)
Most recent skip reason: "SKIP: only 2 turns (min 4)"

Over half the sessions from the last three days had been silently dropped. Not edge cases. Not weird corner traffic. Just normal usage. The daily log for those days had looked thinner than it should have been, and I had noticed that in the background, but never chased it because doctor --quick was green and I trusted it.

Why the gate didn't catch it

The actual bug was trivial. Change a number. That part took no time. The question that mattered was: why did my gate tell me everything was fine while half the data was disappearing?

Let me walk through what doctor --full actually tested:

check_session_start_smoke — runs session-start.py with an empty JSON input, verifies it prints a valid hook-output JSON. ✅
check_user_prompt_smoke — runs user-prompt-wiki.py, verifies it returns additionalContext with articles. ✅
check_stop_smoke — runs stop.py, verifies it exits cleanly on empty stdin. ✅
check_index_freshness, check_structural_lint, check_env_settings, check_path_normalization — the rest of the usual health-check surface.

Notice what those tests are really asking. Each one asks: "Does this script run without crashing?" That is a useful question. It catches real bugs: ImportError after a refactor, JSONDecodeError from bad stdin, FileNotFoundError after a rename. But it is not the question I actually cared about: "Does a real transcript, processed by this chain, end up in the daily log?"

That question has three subtly different parts:

Does the hook fire when Claude Code ends a session? (Yes — I could see it in the log.)
Does the hook's filter logic produce a "worth-saving" verdict for realistic input? (Turns out: no, because of the bug above.)
Does the downstream chain actually write the result to the daily log? (Unknown, because step 2 always said no.)

doctor --full tested a weak version of (1) by running the script with an empty payload. It did not test (2), because that needs a realistic transcript. It did not test (3), because the chain never got that far. Every link passed in isolation, and the chain as a whole was still broken.

This is the old difference between smoke tests and end-to-end tests. In theory everybody knows it. In a personal tool, it is easy to get lazy about it. You know what the chain is supposed to do, so testing the whole thing can feel redundant. It is not redundant at all. The chain breaks in exactly the places where each individual component still passes its own tiny check.

Two things I added to stop this happening again

The code fix itself was boring: replace the turn-based threshold with a content-based one. Short but substantial sessions, two turns and a couple thousand characters of real discussion, now get captured. Tiny sessions, two turns and thirty characters of "ok thanks", still get skipped, but by character count instead of turn count. That is not really the point of the post.

The interesting part is what I added to doctor.py afterward, because that is what turns this from a one-off fix into something the project can actually defend itself with.

1. Observability check: `check_flush_capture_health`

This one reads scripts/flush.log over a rolling 7-day window and summarizes what the capture pipeline has actually been doing:

def check_flush_capture_health() -> CheckResult:
    # parse flush.log, count SessionEnd fired vs Spawned flush.py
    ...
    detail = f"Last 7d: {spawned}/{session_fired} flushes spawned (skip rate {skip_rate:.0%})"

    if spawned == 0:
        return CheckResult(
            "flush_capture_health", False,
            f"{detail}. Pipeline appears broken: SessionEnds fired but nothing was spawned."
        )
    if skip_rate > 0.5:
        return CheckResult(
            "flush_capture_health", True,
            f"{detail} [attention: high skip rate — consider lowering WIKI_MIN_FLUSH_CHARS]"
        )
    return CheckResult("flush_capture_health", True, detail)

Important design choice: this check only FAILs when the pipeline is observably broken. If SessionEnds fired but nothing was spawned, that is a correctness bug. It does not FAIL on high skip rate, because skip rate is historical data about past usage, not necessarily a problem with the current code. A fresh clone has no history and should pass. A repo with lots of short sessions may have a high skip rate and still be behaving correctly. Blocking the merge gate on historical observability would be a mistake.

The check prints an [attention] marker in the detail line when the skip rate goes above 50%. On the first run in my own repo after I added it, it printed:

[PASS] flush_capture_health: Last 7d: 50/121 flushes spawned (skip rate 59%)
       [attention: high skip rate — consider lowering WIKI_MIN_FLUSH_CHARS]

That one line would have saved me three days.

2. End-to-end acceptance test: `check_flush_roundtrip`

This is the answer to the "why didn't any test catch this?" question. It only runs in doctor --full, because it is more expensive than the fast smoke checks.

The test writes a dummy 6-turn transcript, about 2000 characters of realistic content, to a temp file. Then it invokes hooks/session-end.py as a real subprocess with a realistic hook-input JSON on stdin:

test_session_id = f"doctor-roundtrip-{uuid.uuid4().hex[:8]}"
transcript_path = SCRIPTS_DIR / f"doctor-transcript-{test_session_id}.jsonl"

# ... write dummy turns ...

hook_input = {
    "session_id": test_session_id,
    "source": "doctor-roundtrip",
    "transcript_path": str(transcript_path),
    "cwd": str(ROOT_DIR),
}

env = os.environ.copy()
env["WIKI_FLUSH_TEST_MODE"] = "1"

proc = subprocess.run(
    [sys.executable, str(session_end_script)],
    input=json.dumps(hook_input),
    ...
)

Notice the WIKI_FLUSH_TEST_MODE=1 environment variable. That is the trick. The downstream script, flush.py, checks it at startup and, if it is set, skips the real Agent SDK call and writes a marker file to a known location instead:

# In flush.py
if os.environ.get("WIKI_FLUSH_TEST_MODE") == "1":
    TEST_MARKER_FILE.write_text(
        f"FLUSH_TEST_OK session={session_id} ts=...",
        encoding="utf-8",
    )
    return

The test then polls for that marker file with a 15-second timeout, verifies it contains the right session ID, and cleans up. If any link in the chain is broken — if session-end.py does not spawn flush.py, if flush.py fails to import, if the environment is not inherited correctly — the marker never appears and the test fails with a clear message.

This is an actual end-to-end test, not a smoke test. It exercises the real subprocess invocation, real environment inheritance, real stdin/stdout piping, and real timing. The only thing it fakes is the API call itself, because that would cost money and pollute the real daily log.

On my machine right now:

[PASS] flush_roundtrip: session-end -> flush.py chain completed in test mode

If I had had this test two weeks earlier, I would have caught the MIN_TURNS = 4 bug on the first realistic transcript. It would not have needed to be clever. A visible skip where a spawn was expected would have been enough.

The lessons, short enough to remember

1. Smoke tests are not end-to-end tests, and they do not substitute for one. I had nine smoke checks in doctor.py, and all of them were correct in isolation. None of them ran the actual production chain from a realistic input to a verifiable output. If you have a multi-process pipeline, you need at least one test that exercises the whole thing. It does not have to be fast and it does not have to run on every commit. It just has to exist somewhere meaningful in your gate.

2. Observability is a design choice, not an afterthought. My hooks were writing perfectly good operational logs. I just was not reading them, and my gate was not reading them either. Adding a check that summarizes those logs took about forty lines of code and would have turned a silent three-day outage into a visible [attention] marker from day one. Logs you do not read are not much better than logs you never wrote.

3. If a test could have caught the bug, it belongs in the gate — even if adding it feels obvious in hindsight. The wrong question is "why didn't I add this on day one?" The better question is "what class of future bugs does this protect me from now?" Hindsight is always perfect about the bug you already know. What you want is general immunity to the class of bugs you just learned about.

If you are building similar hook-based systems, the code for the project where this happened is at github.com/ub3dqy/llm-wiki. It is a markdown-first memory system for Claude Code and Codex CLI, and both fixes described here — the content-based threshold and the roundtrip test — live in scripts/doctor.py and hooks/session-end.py. No API keys, no vector database, and it boots with uv run python scripts/setup.py.