<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: UB3DQY</title>
    <description>The latest articles on DEV Community by UB3DQY (@ub3dqy).</description>
    <link>https://dev.to/ub3dqy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875580%2F3dd40e53-66ce-4b1c-bdb9-bd0bdd6ca2f5.jpg</url>
      <title>DEV Community: UB3DQY</title>
      <link>https://dev.to/ub3dqy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ub3dqy"/>
    <language>en</language>
    <item>
      <title>Six small PRs later, repo hygiene stopped being a suggestion</title>
      <dc:creator>UB3DQY</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:16:01 +0000</pubDate>
      <link>https://dev.to/ub3dqy/six-small-prs-later-repo-hygiene-stopped-being-a-suggestion-3m70</link>
      <guid>https://dev.to/ub3dqy/six-small-prs-later-repo-hygiene-stopped-being-a-suggestion-3m70</guid>
      <description>&lt;p&gt;Over the last 24 hours I did one of those jobs that sounds smaller than it really is.&lt;/p&gt;

&lt;p&gt;On paper it was boring: tighten up repo hygiene in a small Python project.&lt;/p&gt;

&lt;p&gt;In practice it meant taking a repo where the right tools technically existed, but mostly as polite background noise, and turning them into something that actually pushes back.&lt;/p&gt;

&lt;p&gt;That distinction matters more than people admit.&lt;/p&gt;

&lt;p&gt;A lot of repos have hygiene in the aspirational sense. There is a formatter. There is a linter. There is a workflow file. There is maybe even a note somewhere saying “we should probably enforce this.” And yet the real contract of the repository is still social. If somebody forgets to sort imports, or reformats one file strangely, or adds a new script with a different style, nothing happens except maybe a vague feeling that the code is getting fuzzier around the edges.&lt;/p&gt;

&lt;p&gt;That was roughly where this repo was.&lt;/p&gt;

&lt;p&gt;By the end of the day, it wasn’t.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;The repo is a slightly odd little Python codebase: scripts, hooks, some WSL/Windows friction, and enough operational glue that a careless formatting pass can do real damage.&lt;/p&gt;

&lt;p&gt;The hygiene sequence ended up landing as six small PRs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;add a proper dev dependency group&lt;/li&gt;
&lt;li&gt;enable Ruff import sorting only, with an explicit Python target version&lt;/li&gt;
&lt;li&gt;add a narrow Ruff import-sorting check to CI&lt;/li&gt;
&lt;li&gt;run a style-only &lt;code&gt;ruff format&lt;/code&gt; pass&lt;/li&gt;
&lt;li&gt;add &lt;code&gt;.git-blame-ignore-revs&lt;/code&gt; for that style commit&lt;/li&gt;
&lt;li&gt;add &lt;code&gt;ruff format --check&lt;/code&gt; to CI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That list is tidy now. It did not start tidy.&lt;/p&gt;

&lt;p&gt;The first temptation was the obvious one: if Ruff is underused, just turn on more Ruff. Add &lt;code&gt;UP&lt;/code&gt;, add &lt;code&gt;B&lt;/code&gt;, wire in pre-commit, maybe clean up packaging while we’re here, maybe finally fix the command entry point. The classic “while I’m touching hygiene, I might as well modernize the whole repo” impulse.&lt;/p&gt;

&lt;p&gt;That would have been a mistake.&lt;/p&gt;

&lt;p&gt;The actual winning move was to narrow the scope until every step was defensible on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The difference between “installed” and “real”
&lt;/h2&gt;

&lt;p&gt;The repo already had Ruff.&lt;/p&gt;

&lt;p&gt;Which is exactly why this kind of work gets postponed forever.&lt;/p&gt;

&lt;p&gt;When a tool already exists in &lt;code&gt;pyproject.toml&lt;/code&gt;, people start speaking about it in the present tense. “We use Ruff.” “We have formatting.” “The repo is linted.” But if nothing in the actual merge path forces those claims to matter, what you really have is a tool-shaped decoration.&lt;/p&gt;

&lt;p&gt;That was the first useful correction from this round of work:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a tool is not part of the contract until CI can turn red because of it&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So the order mattered.&lt;/p&gt;

&lt;p&gt;Not pre-commit first. Not “let’s all remember to run it locally.” Not “there’s a command for that.”&lt;/p&gt;

&lt;p&gt;CI first.&lt;/p&gt;

&lt;p&gt;Once &lt;code&gt;ruff check --select I scripts/ hooks/&lt;/code&gt; and &lt;code&gt;ruff format --check scripts/ hooks/&lt;/code&gt; sit in the workflow, the repo changes shape. Future PRs can’t quietly reintroduce unsorted imports or drifting formatting and call it an accident. The repository starts enforcing a boundary instead of merely describing one.&lt;/p&gt;

&lt;p&gt;That is a much more important transition than it sounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we did not turn on all the rules
&lt;/h2&gt;

&lt;p&gt;This was probably the most useful design decision of the whole sequence.&lt;/p&gt;

&lt;p&gt;We only enabled Ruff’s &lt;code&gt;I&lt;/code&gt; rule at first. Import sorting. Nothing broader.&lt;/p&gt;

&lt;p&gt;That is not because the repo is perfect otherwise. It isn’t. There is still baseline lint debt sitting there, and we know it. But broadening the ruleset too early would have mixed three different jobs together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;introduce a new contract&lt;/li&gt;
&lt;li&gt;pay down old debt&lt;/li&gt;
&lt;li&gt;argue about the meaning of every new warning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is how hygiene work turns into a swamp.&lt;/p&gt;

&lt;p&gt;So the repo took the adult route instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;start with the rule that is almost entirely mechanical&lt;/li&gt;
&lt;li&gt;fix only what that rule surfaces&lt;/li&gt;
&lt;li&gt;make it enforced&lt;/li&gt;
&lt;li&gt;defer the noisier categories until there is a reason to take them on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That sounds conservative. It is. It is also how the change actually got merged.&lt;/p&gt;

&lt;p&gt;There was one wrinkle, and it was a real one.&lt;/p&gt;

&lt;p&gt;Three hook files use an intentional &lt;code&gt;sys.path.insert(...)&lt;/code&gt;-then-import pattern. It is ugly, but it is tied to the current package layout and runtime boundary. Ruff’s import sorting model does not like that pattern, and when we tried to push it through blindly, it started producing destructive splits.&lt;/p&gt;

&lt;p&gt;So instead of pretending the linter is always right, we added narrow &lt;code&gt;per-file-ignores&lt;/code&gt; for those three files and moved on.&lt;/p&gt;

&lt;p&gt;That was the correct tradeoff.&lt;/p&gt;

&lt;p&gt;Linters are tools. They are not clergy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The most boring PR was secretly important
&lt;/h2&gt;

&lt;p&gt;The style-only formatting pass was the part that looks most cosmetic from the outside.&lt;/p&gt;

&lt;p&gt;Run &lt;code&gt;ruff format scripts/ hooks/&lt;/code&gt;.&lt;br&gt;
Reformat a couple dozen files.&lt;br&gt;
Move on.&lt;/p&gt;

&lt;p&gt;But that step has two hidden traps.&lt;/p&gt;

&lt;p&gt;First, you need to prove it is actually style-only. In this repo that meant checking the non-import baseline before and after, making sure the same existing errors were still the same existing errors, and not quietly smuggling in behavior changes under the name of formatting.&lt;/p&gt;

&lt;p&gt;Second, once you do a broad formatting commit, &lt;code&gt;git blame&lt;/code&gt; gets uglier unless you clean up after yourself.&lt;/p&gt;

&lt;p&gt;So the format pass was immediately followed by &lt;code&gt;.git-blame-ignore-revs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I’m mentioning that because a lot of teams skip it, and then six months later every touched line looks like it was “written” by the style commit. It is a small operational courtesy that saves a lot of irritation later.&lt;/p&gt;

&lt;p&gt;Formatting the repo is easy.&lt;br&gt;
Formatting the repo without damaging the usefulness of its history is slightly less easy.&lt;/p&gt;

&lt;p&gt;Still worth doing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that kept this from becoming a mess
&lt;/h2&gt;

&lt;p&gt;This repo has a weird workflow on purpose: plans, reports, explicit verification, narrow whitelists, and a very annoying habit of stopping when the written plan no longer matches reality.&lt;/p&gt;

&lt;p&gt;That sounds bureaucratic until you hit the first real discrepancy.&lt;/p&gt;

&lt;p&gt;And there were several.&lt;/p&gt;

&lt;p&gt;One plan version assumed &lt;code&gt;uv.lock&lt;/code&gt; was tracked. It wasn’t.&lt;/p&gt;

&lt;p&gt;Another assumed &lt;code&gt;project.optional-dependencies&lt;/code&gt; was the right place for dev tooling. The docs said otherwise: for this setup, &lt;code&gt;dependency-groups&lt;/code&gt; was the semantically correct path.&lt;/p&gt;

&lt;p&gt;One acceptance check assumed a clean diff on files that were already dirty before the task started.&lt;/p&gt;

&lt;p&gt;Another workflow file had line-ending churn that made the raw diff look noisier than the actual content change.&lt;/p&gt;

&lt;p&gt;None of those were dramatic bugs. They were the more ordinary kind of engineering problem: stale assumptions surviving just long enough to confuse the next step.&lt;/p&gt;

&lt;p&gt;The only thing that reliably prevented those from turning into sloppy implementation was a simple rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;when the docs, the code, and the plan disagree, the plan loses&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That sounds obvious. It still needs to be enforced.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the repo ended up
&lt;/h2&gt;

&lt;p&gt;At the end of the sequence, the workflow was doing real work.&lt;/p&gt;

&lt;p&gt;The main CI path now enforces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;index freshness&lt;/li&gt;
&lt;li&gt;structural wiki lint&lt;/li&gt;
&lt;li&gt;Ruff import sorting for &lt;code&gt;scripts/&lt;/code&gt; and &lt;code&gt;hooks/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ruff formatting checks for &lt;code&gt;scripts/&lt;/code&gt; and &lt;code&gt;hooks/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Python AST syntax validity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not some grand platform rewrite. It is a modest hardening pass on a small codebase.&lt;/p&gt;

&lt;p&gt;But that is exactly why I like it.&lt;/p&gt;

&lt;p&gt;Too much engineering writing treats hygiene as if it only becomes interesting once there is a huge monorepo, a staff-sized platform team, or a catastrophe. Most repos do not live there. Most repos live in the much less glamorous zone where a dozen small inconsistencies slowly teach everybody that the rules are optional.&lt;/p&gt;

&lt;p&gt;This was the opposite kind of day.&lt;/p&gt;

&lt;p&gt;No reinvention. No giant “quality initiative.” No holiness around tool choice. Just six small PRs in the right order, each one making the next one easier to justify.&lt;/p&gt;

&lt;p&gt;That is usually what “maturity” looks like in a repo, if you strip away the self-importance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would do again
&lt;/h2&gt;

&lt;p&gt;If I had to repeat the same cleanup tomorrow in another Python repo, I’d keep the same order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;declare dev tooling properly&lt;/li&gt;
&lt;li&gt;enable one narrow lint rule with a high signal-to-drama ratio&lt;/li&gt;
&lt;li&gt;enforce it in CI&lt;/li&gt;
&lt;li&gt;do the style-only format pass&lt;/li&gt;
&lt;li&gt;add blame-ignore for that commit&lt;/li&gt;
&lt;li&gt;only then expand the gate&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And I would still resist the urge to “just enable everything.”&lt;/p&gt;

&lt;p&gt;Not because broad linting is bad. Because sequencing matters.&lt;/p&gt;

&lt;p&gt;A repo does not become cleaner when you dump more rules into it. It becomes cleaner when the rules it has are real, narrow enough to survive contact with reality, and enforced in the place that actually decides what lands.&lt;/p&gt;

&lt;p&gt;That was the work of the last 24 hours.&lt;/p&gt;

&lt;p&gt;Not glamorous.&lt;br&gt;
But now the repo pushes back.&lt;/p&gt;

&lt;p&gt;That’s better.&lt;/p&gt;

</description>
      <category>python</category>
      <category>tooling</category>
      <category>ci</category>
      <category>opensource</category>
    </item>
    <item>
      <title>My wiki stopped being “memory” and quietly became a behavior patch for AI agents</title>
      <dc:creator>UB3DQY</dc:creator>
      <pubDate>Wed, 15 Apr 2026 09:10:27 +0000</pubDate>
      <link>https://dev.to/ub3dqy/my-wiki-stopped-being-memory-and-quietly-became-a-behavior-patch-for-ai-agents-178h</link>
      <guid>https://dev.to/ub3dqy/my-wiki-stopped-being-memory-and-quietly-became-a-behavior-patch-for-ai-agents-178h</guid>
      <description>&lt;p&gt;I had one of those sessions a few days ago where the problem stopped being technical and got embarrassing.&lt;/p&gt;

&lt;p&gt;I was working with an AI coding agent inside a real repo. Not a toy prompt, not a benchmark, just ordinary work. At first the mistakes looked small enough to wave away: a suggestion that sounded locally reasonable, then another one that contradicted it, then a third that quietly walked us back to the first. The kind of thing that makes you stare at the screen and think, hang on, didn’t we already establish this?&lt;/p&gt;

&lt;p&gt;What made it worse was how familiar the rhythm felt. The agent wasn’t crashing. It wasn’t hallucinating a spaceship in the database schema. It was doing something more irritating: cutting corners, smoothing over uncertainty, and acting like each new complaint had appeared in a vacuum.&lt;/p&gt;

&lt;p&gt;That same day we had ingested a Reddit thread into the project wiki: &lt;em&gt;“Anthropic made Claude 67% dumber and didn’t tell anyone, a developer ran 6,852 sessions to prove it.”&lt;/em&gt; The underlying research lived in &lt;a href="https://github.com/anthropics/claude-code/issues/42796" rel="noopener noreferrer"&gt;anthropics/claude-code#42796&lt;/a&gt;, and whatever you think of the Reddit packaging, the numbers were not vague internet vibes. Someone had analyzed &lt;strong&gt;6,852 real Claude Code sessions&lt;/strong&gt; and &lt;strong&gt;17,871 thinking blocks&lt;/strong&gt; and came back with a specific pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reasoning depth down 67%&lt;/li&gt;
&lt;li&gt;file reads before edit down from 6.6 to 2 on average&lt;/li&gt;
&lt;li&gt;roughly one in three edits happening without any prior file read&lt;/li&gt;
&lt;li&gt;the word “simplest” showing up far more often in model output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That landed a little too cleanly.&lt;/p&gt;

&lt;p&gt;Because earlier that day, in one long conversation, the agent had done exactly the kind of thing those numbers predict. It first recommended using a separate terminal window. When that turned out to be awkward for me, it recommended the VS Code integrated terminal instead. Then, when the integrated terminal had a scrolling problem with a TUI, it started drifting back toward a separate terminal again, like none of the previous reasoning had happened. Same conversation, same user, same repo, and still somehow a goldfish.&lt;/p&gt;

&lt;p&gt;At one point, when I got annoyed with the quality of the advice, the agent did something even more revealing: it tried to solve the conversation by building more tooling. New directories, README files, orchestrator ideas, handoff pipeline scaffolding. As if “your reasoning got sloppy” naturally leads to “let me generate more infrastructure.”&lt;/p&gt;

&lt;p&gt;That was the moment the Reddit thread stopped feeling theoretical.&lt;/p&gt;

&lt;h2&gt;
  
  
  The useful part of the bad session
&lt;/h2&gt;

&lt;p&gt;Here is what it actually gave me that I did not have an hour earlier.&lt;/p&gt;

&lt;p&gt;The thread mattered for two reasons.&lt;/p&gt;

&lt;p&gt;First, it gave me language for what I was looking at. Instead of the usual mushy feeling that the model was “off today,” there was a concrete behavioral shape: shallower reasoning, less reading before editing, more shortcut-taking. That does not magically explain every bad turn, but it does turn a vibe into something you can reason about.&lt;/p&gt;

&lt;p&gt;Second, it pushed the conversation away from “why is the model being like this?” and toward a much more practical question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What can I change locally so that the agent stops behaving this way in &lt;em&gt;my&lt;/em&gt; project?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That ended up being the right question.&lt;/p&gt;

&lt;p&gt;Because the fix wasn’t a new product, and it wasn’t a better prompt pasted into a chat box, and it definitely wasn’t waiting for Anthropic to say they had sorted things out. The fix was smaller and much duller than that. Which is probably why it worked.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;The repo already had a markdown wiki next to the codebase.&lt;/p&gt;

&lt;p&gt;Originally it was just that: a knowledge base. Notes, sources, concept pages, issue summaries, project-specific instructions, little slices of operational memory that are easy to forget between sessions. Plain files in git. Nothing clever.&lt;/p&gt;

&lt;p&gt;But this repo also had a project-level &lt;code&gt;CLAUDE.md&lt;/code&gt;, which gets injected into the agent’s context whenever work starts in the project. So instead of treating the wiki as a passive archive, we used it as a place to put behavior-shaping rules that would actually reappear in future sessions.&lt;/p&gt;

&lt;p&gt;We added rules like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;-&lt;/span&gt; Research the codebase before editing. Never change code you haven't read.
&lt;span class="p"&gt;-&lt;/span&gt; Verify work actually works before claiming done.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Whether or not the leak claim turns out to be true, those rules are obviously worth having anyway.&lt;/p&gt;

&lt;p&gt;And we added a couple of persistent memory notes tied to concrete failure cases from the bad session:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;don’t flip-flop on workflow advice inside one thread&lt;/li&gt;
&lt;li&gt;don’t respond to user frustration about reasoning quality by reflexively building new tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That was it.&lt;/p&gt;

&lt;p&gt;No vector database. No new service. No orchestration layer. No fancy “agent memory platform.” Just a few markdown files in the place the agent actually reads before it starts moving.&lt;/p&gt;

&lt;p&gt;Which sounds almost insultingly small. But that’s the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing I had been misunderstanding
&lt;/h2&gt;

&lt;p&gt;I used to think of a project wiki mostly as memory in the obvious sense: a way to remember what happened, what was decided, what broke last month, what weird edge case we already paid for once and don’t want to rediscover at 2 a.m.&lt;/p&gt;

&lt;p&gt;That is still true. But it is not the whole story.&lt;/p&gt;

&lt;p&gt;For an agent, a wiki that gets re-injected into context is not just memory. It is part of the agent’s operating environment.&lt;/p&gt;

&lt;p&gt;That matters more than it sounds.&lt;/p&gt;

&lt;p&gt;An LLM does not “learn” from a bad day the way a human does. It doesn’t go for a walk, think things over, and come back morally improved. If you want a future instance of the agent to behave differently, the only reliable mechanism you control is what gets placed in its context window when the next similar moment happens.&lt;/p&gt;

&lt;p&gt;That means a sentence in a markdown file can do something surprisingly concrete. It can raise the floor.&lt;/p&gt;

&lt;p&gt;The Reddit thread is basically the negative version of this same idea. If the default instruction layer gets weakened, even a little, behavior changes downstream in measurable ways. Less reading. More shortcuts. More “simplest.” More pretending local plausibility is enough. So the opposite is also true: if you add back explicit local instructions that are actually relevant to your project, behavior improves in the dimensions those instructions govern.&lt;/p&gt;

&lt;p&gt;Not forever. Not perfectly. But enough to matter.&lt;/p&gt;

&lt;p&gt;That was the real shift for me.&lt;/p&gt;

&lt;p&gt;The wiki stopped being a storage layer and became a behavioral patch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why plain markdown turned out to be enough
&lt;/h2&gt;

&lt;p&gt;This part is easy to overcomplicate.&lt;/p&gt;

&lt;p&gt;There are many situations where you really do need a heavier retrieval system. Large corpora, fuzzy search across lots of semi-structured material, semantic lookup over things that were never designed to link to each other cleanly. Fine. Use the bigger machinery when the problem calls for it.&lt;/p&gt;

&lt;p&gt;But a lot of agent work is not blocked on better retrieval. It is blocked on better &lt;em&gt;discipline&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The agent doesn’t need a 40-millisecond vector search to discover the idea “read the file before you edit it.” It needs that rule to be present, visible, and hard to miss at the moment it starts acting clever.&lt;/p&gt;

&lt;p&gt;Plain markdown is very good at that.&lt;/p&gt;

&lt;p&gt;It is editable, reviewable, diffable, easy to keep near the code, and easy to inject back into the session. It also ages well. You can mark something stale. You can supersede it. You can point from one failure note to the later fix. You can tell the difference between historical context and current guidance if you are disciplined about how you structure the files.&lt;/p&gt;

&lt;p&gt;That last part matters. A wiki can absolutely become a swamp if you just keep shoveling text into it and never think about freshness. But that is a problem of curation, not a failure of the basic approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one claim I’m careful with
&lt;/h2&gt;

&lt;p&gt;The thread also included a much juicier claim: that leaked Claude Code source appeared to route Anthropic employees through a different instruction set, including a stronger “verify work actually works before claiming done” style directive.&lt;/p&gt;

&lt;p&gt;If true, that would explain a lot. It would also be a much bigger story than “my local workflow got weird this week.”&lt;/p&gt;

&lt;p&gt;But I don’t think it should carry the article. That part still needs independent confirmation.&lt;/p&gt;

&lt;p&gt;The measured regression does not.&lt;/p&gt;

&lt;p&gt;And honestly, the practical lesson does not depend on it. Even if that claim turns out to be wrong, the local fix still stands: project-level instructions plus durable memory can compensate for at least some classes of agent drift.&lt;/p&gt;

&lt;p&gt;That is enough to be useful on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I’d actually recommend
&lt;/h2&gt;

&lt;p&gt;If any of this sounds familiar, I’d do three things before buying anything, rewriting your stack, or filing angry bug reports.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Put operating rules in the repo
&lt;/h3&gt;

&lt;p&gt;Not a giant manifesto. Two or three lines.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read before edit&lt;/li&gt;
&lt;li&gt;verify before claiming done&lt;/li&gt;
&lt;li&gt;don’t reverse yourself mid-thread without explicitly acknowledging it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keep them short enough that they feel like rules, not prose.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Record specific failure patterns, not abstract complaints
&lt;/h3&gt;

&lt;p&gt;“Don’t be lazy” is useless.&lt;/p&gt;

&lt;p&gt;“Yesterday you recommended terminal A, then terminal B, then terminal A again in the same conversation” is useful.&lt;/p&gt;

&lt;p&gt;Agents pattern-match better against concrete reproductions than against moral advice. Frankly, so do humans.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Treat the wiki as part of runtime behavior, not just documentation
&lt;/h3&gt;

&lt;p&gt;If your memory system is actually read during future sessions, then it is not only an archive. It is one of the controls that shapes what the agent does next. Design it that way.&lt;/p&gt;

&lt;p&gt;That means caring about status, freshness, and whether a note is historical context or current guidance. It also means accepting that some of your best fixes may look embarrassingly low-tech.&lt;/p&gt;

&lt;p&gt;And if your stack does not have a &lt;code&gt;CLAUDE.md&lt;/code&gt; equivalent, the idea still transfers. Any text that gets re-injected into context at session start is a lever. The filename is incidental.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part I keep coming back to
&lt;/h2&gt;

&lt;p&gt;What changed my mind wasn’t that the agent had a bad session. Bad sessions happen.&lt;/p&gt;

&lt;p&gt;It was that a markdown wiki, sitting quietly next to the code, ended up being the most practical lever for changing the agent’s behavior the next day.&lt;/p&gt;

&lt;p&gt;Not memory as autobiography.&lt;/p&gt;

&lt;p&gt;Memory as control surface.&lt;/p&gt;

&lt;p&gt;And once you see that, a lot of “AI tooling” starts looking strangely overengineered for the problems people actually have. Sometimes the useful move is not another layer of automation. Sometimes it is one sentence in the right file, where the model is forced to read it before it starts improvising.&lt;/p&gt;

&lt;p&gt;That’s a very old kind of software.&lt;/p&gt;

&lt;p&gt;It still works.&lt;/p&gt;




&lt;p&gt;Reference: the Reddit thread &lt;em&gt;“Anthropic made Claude 67% dumber and didn't tell anyone, a developer ran 6,852 sessions to prove it”&lt;/em&gt; (&lt;a href="https://www.reddit.com/r/ClaudeCode/comments/1shaxkt/" rel="noopener noreferrer"&gt;r/ClaudeCode&lt;/a&gt;, 2026-04-10), and the underlying issue at &lt;a href="https://github.com/anthropics/claude-code/issues/42796" rel="noopener noreferrer"&gt;anthropics/claude-code#42796&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>tooling</category>
      <category>claude</category>
    </item>
    <item>
      <title>Remote-WSL broke my AI agent hooks with one malformed cwd</title>
      <dc:creator>UB3DQY</dc:creator>
      <pubDate>Wed, 15 Apr 2026 06:35:26 +0000</pubDate>
      <link>https://dev.to/ub3dqy/remote-wsl-broke-my-ai-agent-hooks-with-one-malformed-cwd-3okj</link>
      <guid>https://dev.to/ub3dqy/remote-wsl-broke-my-ai-agent-hooks-with-one-malformed-cwd-3okj</guid>
      <description>&lt;p&gt;I spent most of this week debugging what looked like a flaky hook pipeline.&lt;/p&gt;

&lt;p&gt;The project itself is simple on purpose: a local, markdown-first knowledge base that I use with coding agents. Hooks capture session output, a small filter decides what is worth keeping, and everything stays in git. No separate backend, no database to babysit, no infrastructure just because a text-heavy workflow &lt;em&gt;could&lt;/em&gt; be turned into infrastructure.&lt;/p&gt;

&lt;p&gt;That simplicity is exactly why the bug annoyed me so much.&lt;/p&gt;

&lt;p&gt;The failure looked like an application problem. It smelled like an application problem. It even produced the kind of vague symptoms that make you think, “great, another intermittent pipeline bug.”&lt;/p&gt;

&lt;p&gt;It was not an application problem.&lt;/p&gt;

&lt;p&gt;It was one malformed working directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I thought I was debugging
&lt;/h2&gt;

&lt;p&gt;At first, I thought I was dealing with one recurring failure in my capture pipeline.&lt;/p&gt;

&lt;p&gt;Sometimes a hook would appear to fail. Sometimes downstream logging would be empty. Sometimes a task would complete, but the surrounding automation would look half-dead. It was inconsistent enough to feel flaky and consistent enough to feel real.&lt;/p&gt;

&lt;p&gt;That is an especially bad combination.&lt;/p&gt;

&lt;p&gt;I started by doing the usual careful thing: grouping similar failures together and checking how often they happened.&lt;/p&gt;

&lt;p&gt;That worked for about ten minutes.&lt;/p&gt;

&lt;p&gt;Once I looked at timing, the nice tidy picture fell apart. Some failures happened almost immediately. Others took much longer. Same surface symptom, completely different execution pattern.&lt;/p&gt;

&lt;p&gt;That was the first sign that I might be flattening two different problems into one label.&lt;/p&gt;

&lt;p&gt;Then I went digging through the SDK layer and ran into a second problem: the tooling was not especially generous with useful error details. In a couple of places I could confirm that something had failed, but not why. The exact traceback or stderr I wanted either was not there or was being abstracted away into something much less helpful.&lt;/p&gt;

&lt;p&gt;At that point I stopped treating this as a Python problem and started treating it as an environment problem.&lt;/p&gt;

&lt;p&gt;That turned out to be the right move.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup that made it visible
&lt;/h2&gt;

&lt;p&gt;I use two agents against the same repository:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Code inside VS Code&lt;/li&gt;
&lt;li&gt;Codex in parallel for implementation and verification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That setup is productive, but it puts real pressure on local tooling. You are suddenly relying on editors, terminals, shells, path translation, process spawning, and hook execution to behave consistently across a stack that spans Windows and Linux at the same time.&lt;/p&gt;

&lt;p&gt;Naturally, I decided to make it &lt;em&gt;more&lt;/em&gt; convenient.&lt;/p&gt;

&lt;p&gt;I wanted a cleaner dual-window workflow inside VS Code instead of bouncing between an editor and a separate terminal. That pushed me deeper into VS Code Remote-WSL and duplicated workspace setups.&lt;/p&gt;

&lt;p&gt;That is where the really confusing symptom showed up.&lt;/p&gt;

&lt;p&gt;Codex could successfully answer a simple prompt, but the UI would still show hooks as &lt;code&gt;failed&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That combination should immediately make you suspicious.&lt;/p&gt;

&lt;p&gt;If the task result exists, but the hooks around it are marked failed and your own logging pipeline shows no fresh entries, then one of two things is true:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the hook command really is starting and dying very early, or&lt;/li&gt;
&lt;li&gt;the command is never starting in a sane execution context to begin with.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The second explanation was uglier, but it fit the evidence better.&lt;/p&gt;

&lt;h2&gt;
  
  
  Manual tests made things weirder, not clearer
&lt;/h2&gt;

&lt;p&gt;I checked the obvious suspects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hook config&lt;/li&gt;
&lt;li&gt;executable paths&lt;/li&gt;
&lt;li&gt;shell path&lt;/li&gt;
&lt;li&gt;Python runner&lt;/li&gt;
&lt;li&gt;manual execution of the same command&lt;/li&gt;
&lt;li&gt;exit code&lt;/li&gt;
&lt;li&gt;runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything looked healthy.&lt;/p&gt;

&lt;p&gt;The hook command worked perfectly when I ran it manually.&lt;/p&gt;

&lt;p&gt;That made the problem harder, not easier. A command that fails only when launched through an editor integration is usually telling you that the command itself is fine and its environment is not.&lt;/p&gt;

&lt;p&gt;So I stopped staring at the hook script and went looking for process-level clues on the Codex side.&lt;/p&gt;

&lt;p&gt;That was when I realized I had been looking in the wrong place for logs. I expected plain text logs. The active event data I needed was actually in a SQLite log store.&lt;/p&gt;

&lt;p&gt;Once I queried that, the whole thing cracked open.&lt;/p&gt;

&lt;h2&gt;
  
  
  The line that explained the whole day
&lt;/h2&gt;

&lt;p&gt;Inside the recorded turn data, the working directory looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/mnt/c/.../Microsoft VS Code/e:\work\my-project
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That path is nonsense.&lt;/p&gt;

&lt;p&gt;It is a broken hybrid:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the VS Code install directory on the Windows side&lt;/li&gt;
&lt;li&gt;a raw Windows-style workspace path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is neither a valid WSL workspace path nor a valid normal working directory for a Linux-side process.&lt;/p&gt;

&lt;p&gt;And once I saw it, the rest of the symptoms stopped being mysterious.&lt;/p&gt;

&lt;p&gt;The issue was not that my hooks were unreliable.&lt;/p&gt;

&lt;p&gt;The issue was that, in this Remote-WSL setup, the VS Code extension was handing Codex a malformed &lt;code&gt;cwd&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Instead of turning something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;E:\work\my-project
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;into:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/mnt/e/work/my-project
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;something in the chain appeared to be combining the raw Windows path with the wrong base first.&lt;/p&gt;

&lt;p&gt;One bad &lt;code&gt;cwd&lt;/code&gt; is enough to poison an entire process tree.&lt;/p&gt;

&lt;p&gt;Once child processes inherit it, you start getting misleading secondary failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hooks reported as failed even though the commands are valid&lt;/li&gt;
&lt;li&gt;subprocess behavior that differs from manual shell execution&lt;/li&gt;
&lt;li&gt;empty downstream logs because the real work never starts in a usable context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That was exactly what I was seeing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this bug was so misleading
&lt;/h2&gt;

&lt;p&gt;This is my least favorite kind of tooling bug: the kind that breaks at the seams.&lt;/p&gt;

&lt;p&gt;Nothing explodes cleanly.&lt;/p&gt;

&lt;p&gt;The agent still answers.&lt;/p&gt;

&lt;p&gt;The UI still looks alive.&lt;/p&gt;

&lt;p&gt;The hook command still works in isolation.&lt;/p&gt;

&lt;p&gt;The repository still exists where you expect it to exist.&lt;/p&gt;

&lt;p&gt;Only one inherited bit of process state is wrong, and that is enough to make the system feel haunted.&lt;/p&gt;

&lt;p&gt;From the outside, it looks like a flaky automation problem.&lt;br&gt;
From the inside, it is just a bad path string.&lt;/p&gt;

&lt;p&gt;That difference matters, because it changes what you should inspect first.&lt;/p&gt;

&lt;p&gt;If a command works manually but fails only through an editor or agent integration, do not immediately assume the logic is wrong. Compare launch context first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;cwd&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;shell&lt;/li&gt;
&lt;li&gt;environment variables&lt;/li&gt;
&lt;li&gt;path translation&lt;/li&gt;
&lt;li&gt;parent process&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I should have done that earlier.&lt;/p&gt;
&lt;h2&gt;
  
  
  The workaround was almost embarrassingly simple
&lt;/h2&gt;

&lt;p&gt;Once I knew what was broken, the local workaround was not clever at all:&lt;/p&gt;

&lt;p&gt;do not run Codex through the VS Code extension in that setup.&lt;/p&gt;

&lt;p&gt;Run it directly from a normal WSL shell in the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; /mnt/e/work/my-project
codex
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same machine. Same repository. Same hooks. Same scripts.&lt;/p&gt;

&lt;p&gt;Different launch path.&lt;/p&gt;

&lt;p&gt;And in that mode, everything immediately got boring again, which is exactly what you want from tooling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clean &lt;code&gt;cwd&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;hooks marked &lt;code&gt;completed&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;logging pipeline resumes&lt;/li&gt;
&lt;li&gt;downstream capture behaves normally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That was the moment I realized I had nearly talked myself into a much bigger solution for a much smaller problem.&lt;/p&gt;

&lt;p&gt;At one point I was already mentally drifting toward “maybe I should move more of this workflow to a server” or “maybe the local-first design is too fragile.”&lt;/p&gt;

&lt;p&gt;Nope.&lt;/p&gt;

&lt;p&gt;The local-first design was fine.&lt;br&gt;
The markdown-first architecture was fine.&lt;br&gt;
The scripts were fine.&lt;br&gt;
The hooks were fine.&lt;/p&gt;

&lt;p&gt;The working directory was wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I changed after that
&lt;/h2&gt;

&lt;p&gt;I stopped trying to force the elegant version of the workflow.&lt;/p&gt;

&lt;p&gt;My stable setup now is much more boring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Code inside VS Code&lt;/li&gt;
&lt;li&gt;Codex in a separate WSL terminal&lt;/li&gt;
&lt;li&gt;both pointing at the same repository&lt;/li&gt;
&lt;li&gt;shared append-only logs&lt;/li&gt;
&lt;li&gt;no editor-managed &lt;code&gt;cwd&lt;/code&gt; surprises&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is less stylish than the version I was trying to build.&lt;br&gt;
It is also dramatically more reliable.&lt;/p&gt;

&lt;p&gt;And honestly, that feels like the right ending for this story.&lt;/p&gt;

&lt;p&gt;I spent a day chasing what looked like a deep pipeline bug.&lt;br&gt;
It turned out that the best fix was to put the CLI tool back in a terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I took away from it
&lt;/h2&gt;

&lt;p&gt;Three things.&lt;/p&gt;

&lt;p&gt;First: if multiple failures share a label, that does not mean they share a cause.&lt;/p&gt;

&lt;p&gt;Second: editor integrations often fail in ways that look like application bugs when they are really process-launch bugs.&lt;/p&gt;

&lt;p&gt;Third: if you are debugging anything that touches Windows, WSL, and an editor extension at the same time, inspect &lt;code&gt;cwd&lt;/code&gt; much earlier than feels necessary.&lt;/p&gt;

&lt;p&gt;One malformed working directory was enough to waste an entire day.&lt;/p&gt;

&lt;p&gt;I would rather somebody else not lose the same one.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vscode</category>
      <category>wsl</category>
      <category>debugging</category>
    </item>
    <item>
      <title>I thought my AI memory hook was broken. It turned out to be Windows, WSL, uv, and one missing login</title>
      <dc:creator>UB3DQY</dc:creator>
      <pubDate>Mon, 13 Apr 2026 23:29:37 +0000</pubDate>
      <link>https://dev.to/ub3dqy/i-thought-my-ai-memory-hook-was-broken-it-turned-out-to-be-windows-wsl-uv-and-one-missing-login-a6</link>
      <guid>https://dev.to/ub3dqy/i-thought-my-ai-memory-hook-was-broken-it-turned-out-to-be-windows-wsl-uv-and-one-missing-login-a6</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Part of the series &lt;strong&gt;Debugging Claude Agent SDK pipelines&lt;/strong&gt;. One of the layers I'll mention near the end — hidden account-level Gmail / Calendar MCP integrations blocking my subprocesses — deserved its own write-up: &lt;a href="https://dev.to/ub3dqy/hidden-gmail-and-calendar-integrations-quietly-broke-my-claude-sdk-pipeline-18ei"&gt;Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I noticed something weird in Codex.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;UserPromptSubmit&lt;/code&gt; kept saying &lt;code&gt;completed&lt;/code&gt;, but &lt;code&gt;Stop&lt;/code&gt; kept saying &lt;code&gt;failed&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you're building a memory tool, that's a bad combination. It means the assistant can still &lt;strong&gt;read&lt;/strong&gt; old context, but it may be failing to &lt;strong&gt;write&lt;/strong&gt; new context back into long-term memory. In other words: it looks smart in the moment, but its memory may be quietly falling apart behind the scenes.&lt;/p&gt;

&lt;p&gt;I assumed this would be a small hook bug.&lt;/p&gt;

&lt;p&gt;It wasn't.&lt;/p&gt;

&lt;p&gt;It turned into one of those debugging sessions where every layer was technically "working" and the system as a whole still wasn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I was building
&lt;/h2&gt;

&lt;p&gt;I'm working on a markdown-first memory system for Claude Code and Codex. The shape is simple enough:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;when a session ends, a hook grabs the transcript&lt;/li&gt;
&lt;li&gt;a background script decides whether the conversation is worth saving&lt;/li&gt;
&lt;li&gt;if it is, it writes a distilled note into a daily log&lt;/li&gt;
&lt;li&gt;later, that gets compiled into wiki pages and injected back into future sessions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The part that mattered here was the Codex &lt;code&gt;Stop&lt;/code&gt; hook. That's the capture path. If that hook fails, new memory may never make it into the wiki.&lt;/p&gt;

&lt;p&gt;So when I saw &lt;code&gt;Stop failed&lt;/code&gt; in the UI over and over again, I treated it as a real product problem, not a cosmetic one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first bug was real
&lt;/h2&gt;

&lt;p&gt;The first issue was exactly where I expected it to be: in the hook.&lt;/p&gt;

&lt;p&gt;The parser only understood one transcript shape. Codex was emitting another one. The hook would fire, look at the transcript, fail to extract meaningful context, and then skip capture.&lt;/p&gt;

&lt;p&gt;That part was straightforward to fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;teach the parser the real Codex transcript shape&lt;/li&gt;
&lt;li&gt;add a fallback for when &lt;code&gt;transcript_path&lt;/code&gt; is missing&lt;/li&gt;
&lt;li&gt;stop using the old turn-count gate and switch to a content-based threshold&lt;/li&gt;
&lt;li&gt;raise the timeout so the hook had room to finish&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After that, things got better. The logs stopped saying &lt;code&gt;SKIP: empty context&lt;/code&gt;, and I started seeing the line I wanted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Spawned flush.py for session ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At that point I thought I was done.&lt;/p&gt;

&lt;p&gt;I was not done.&lt;/p&gt;

&lt;h2&gt;
  
  
  The second bug was weirder
&lt;/h2&gt;

&lt;p&gt;Now the hook was successfully spawning the downstream capture process, which should have been a win.&lt;/p&gt;

&lt;p&gt;And then it still ended with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BrokenPipeError: [Errno 32] Broken pipe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was one of those bugs that is annoying precisely because it happens &lt;strong&gt;after&lt;/strong&gt; the important part.&lt;/p&gt;

&lt;p&gt;The capture process had already started. Memory might already be on its way to being saved. But the hook still looked failed in the Codex UI, which meant I couldn't trust the system yet.&lt;/p&gt;

&lt;p&gt;The cause turned out to be simple in hindsight:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the hook took longer than the local timeout&lt;/li&gt;
&lt;li&gt;Codex closed stdout&lt;/li&gt;
&lt;li&gt;the hook tried to print its final success JSON into a pipe that no longer existed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So yes, the system was partly working. It just wasn't finishing cleanly.&lt;/p&gt;

&lt;p&gt;That fix was also small: protect the final stdout write against a closed pipe.&lt;/p&gt;

&lt;p&gt;At this point I had fixed the parser bug and the broken pipe. Surely &lt;em&gt;now&lt;/em&gt; the memory pipeline would work.&lt;/p&gt;

&lt;p&gt;Still no.&lt;/p&gt;

&lt;h2&gt;
  
  
  The third bug wasn't in the hook at all
&lt;/h2&gt;

&lt;p&gt;Once I got past the broken pipe, the downstream process itself started failing.&lt;/p&gt;

&lt;p&gt;The hook would spawn &lt;code&gt;flush.py&lt;/code&gt;, and then &lt;code&gt;flush.py&lt;/code&gt; would die with the deeply unhelpful classic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Command failed with exit code 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No useful stderr. No obvious explanation. Just enough information to waste an afternoon.&lt;/p&gt;

&lt;p&gt;This is the moment where I finally stopped assuming I was debugging "the hook" and started treating the whole thing like what it really was: a chain of separate runtimes.&lt;/p&gt;

&lt;p&gt;Because that's what it was.&lt;/p&gt;

&lt;p&gt;Not one program. A chain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Codex UI&lt;/li&gt;
&lt;li&gt;hook runner&lt;/li&gt;
&lt;li&gt;Python process&lt;/li&gt;
&lt;li&gt;subprocess launcher&lt;/li&gt;
&lt;li&gt;WSL boundary&lt;/li&gt;
&lt;li&gt;bundled Claude CLI&lt;/li&gt;
&lt;li&gt;local authentication state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each link could fail for a different reason.&lt;/p&gt;

&lt;p&gt;And that is exactly what had happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  One of the real bugs was hiding in the boundary
&lt;/h2&gt;

&lt;p&gt;At that stage, the most immediate root cause of the &lt;code&gt;exit code 1&lt;/code&gt; failure wasn't inside my Python code at all.&lt;/p&gt;

&lt;p&gt;The Claude CLI inside the WSL runtime wasn't authenticated.&lt;/p&gt;

&lt;p&gt;That was the immediate bug in this part of the investigation. There was still another layer around hidden account-level MCP integrations, but that turned into its own separate story. I wrote that one up separately here: &lt;a href="https://dev.to/ub3dqy/hidden-gmail-and-calendar-integrations-quietly-broke-my-claude-sdk-pipeline-18ei"&gt;Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I had already authenticated Claude on the Windows side. Claude Code on Windows was fine. But Codex hooks were spawning work inside WSL, and that runtime had its own separate &lt;code&gt;~/.claude&lt;/code&gt; state.&lt;/p&gt;

&lt;p&gt;So from one side of the system, Claude was logged in.&lt;/p&gt;

&lt;p&gt;From the other side, it wasn't.&lt;/p&gt;

&lt;p&gt;And because the failure was happening in a subprocess several layers down, what bubbled back up was just a generic process failure.&lt;/p&gt;

&lt;p&gt;That was the moment the whole debugging session clicked for me:&lt;/p&gt;

&lt;p&gt;I wasn't dealing with a broken feature. I was dealing with a system that crossed &lt;strong&gt;OS boundaries&lt;/strong&gt;, &lt;strong&gt;process boundaries&lt;/strong&gt;, and &lt;strong&gt;auth boundaries&lt;/strong&gt;, and I was still mentally treating it like one runtime.&lt;/p&gt;

&lt;p&gt;It wasn't one runtime.&lt;/p&gt;

&lt;p&gt;It was several. They just happened to be glued together tightly enough to look like one.&lt;/p&gt;

&lt;h2&gt;
  
  
  And then Windows joined in
&lt;/h2&gt;

&lt;p&gt;While I was cleaning that up, Claude Code on Windows started throwing a completely different kind of error on every hook run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;error: failed to remove file `.venv\\lib64`: Access is denied. (os error 5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That turned out to be another boundary problem.&lt;/p&gt;

&lt;p&gt;I had Windows-side and WSL-side tooling both touching the same project environment. &lt;code&gt;uv&lt;/code&gt; was trying to be helpful. Windows was trying to be Windows. A POSIX-style &lt;code&gt;lib64&lt;/code&gt; symlink was involved. None of this was improving my mood.&lt;/p&gt;

&lt;p&gt;So now I had two parallel truths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the Codex capture path was broken because WSL auth was missing&lt;/li&gt;
&lt;li&gt;the Claude-side hook launcher was unstable because the shared &lt;code&gt;.venv&lt;/code&gt; state was getting churned across environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both bugs were real.&lt;br&gt;
Neither bug lived in the same place.&lt;br&gt;
Both looked, from the outside, like "the AI tool is flaky again."&lt;/p&gt;
&lt;h2&gt;
  
  
  The part that actually mattered
&lt;/h2&gt;

&lt;p&gt;Here's the practical lesson I walked away with:&lt;/p&gt;

&lt;p&gt;When a system crosses runtime boundaries, the bug is often not in the place where the symptom shows up.&lt;/p&gt;

&lt;p&gt;The symptom showed up as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stop failed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The actual causes were spread across multiple layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transcript shape mismatch&lt;/li&gt;
&lt;li&gt;timeout mismatch&lt;/li&gt;
&lt;li&gt;unprotected stdout write&lt;/li&gt;
&lt;li&gt;missing WSL-side Claude auth&lt;/li&gt;
&lt;li&gt;shared Windows/WSL environment churn&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I had kept looking only at the final error message, I would have kept "fixing" the wrong layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I changed
&lt;/h2&gt;

&lt;p&gt;I didn't rewrite the system. I just stopped letting it be vague.&lt;/p&gt;

&lt;p&gt;I added enough visibility so each boundary could tell me when it was the one failing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transcript parsing now understands real Codex output&lt;/li&gt;
&lt;li&gt;the stop hook has a fallback when transcript data is missing&lt;/li&gt;
&lt;li&gt;success output no longer crashes on a closed pipe&lt;/li&gt;
&lt;li&gt;the flush path logs more useful process diagnostics&lt;/li&gt;
&lt;li&gt;I verified the actual runtime where the subprocess was running, instead of assuming it matched the one I was sitting in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And maybe the most boring but important fix of all:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I authenticated Claude in the runtime that was actually doing the work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not "somewhere on the machine."&lt;br&gt;
Not "the CLI works for me."&lt;br&gt;
The exact runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson I'll keep
&lt;/h2&gt;

&lt;p&gt;The real lesson wasn't "add more logging."&lt;/p&gt;

&lt;p&gt;It was this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;if your tool crosses process boundaries, OS boundaries, and auth boundaries, you do not have one runtime anymore.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You have a chain of semi-independent runtimes, and each one can fail in its own extremely specific way.&lt;/p&gt;

&lt;p&gt;That sounds obvious when written down. It was much less obvious when I was staring at one repeated &lt;code&gt;Stop failed&lt;/code&gt; message and assuming there had to be one neat root cause behind it.&lt;/p&gt;

&lt;p&gt;There wasn't.&lt;/p&gt;

&lt;p&gt;There were several small, ordinary failures, all stacked on top of each other. That's what made the bug feel slippery.&lt;/p&gt;

&lt;p&gt;And honestly, that is what a lot of debugging looks like in real life. Not one dramatic mistake. Just three or four boring mismatches, each living at a different seam, and all of them combining into one system that feels unreliable.&lt;/p&gt;

&lt;p&gt;Sometimes the hardest part is realizing that one ugly symptom actually belongs to more than one story.&lt;/p&gt;

&lt;h2&gt;
  
  
  If you're building something similar
&lt;/h2&gt;

&lt;p&gt;Don't just test whether the hook runs.&lt;/p&gt;

&lt;p&gt;Test whether:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it runs in the runtime you think it runs in&lt;/li&gt;
&lt;li&gt;it has the credentials you think it has&lt;/li&gt;
&lt;li&gt;it can finish within the timeout you actually configured&lt;/li&gt;
&lt;li&gt;and the final side effect really happens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because "the script executed" is not the same thing as "the system worked."&lt;/p&gt;

&lt;p&gt;And if your tool is supposed to remember things for you, that difference matters a lot.&lt;/p&gt;




&lt;p&gt;I'm building this as part of &lt;a href="https://github.com/ub3dqy/llm-wiki" rel="noopener noreferrer"&gt;llm-wiki&lt;/a&gt;, a markdown-first memory layer for Claude Code and Codex. The part I underestimated wasn't the prompt design or the summarization logic. It was the plumbing around the boundaries.&lt;/p&gt;

&lt;p&gt;Which, in hindsight, is exactly where these systems like to break.&lt;/p&gt;

</description>
      <category>debugging</category>
      <category>python</category>
      <category>ai</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Hidden Gmail and Calendar integrations quietly broke my Claude SDK pipeline</title>
      <dc:creator>UB3DQY</dc:creator>
      <pubDate>Mon, 13 Apr 2026 16:49:42 +0000</pubDate>
      <link>https://dev.to/ub3dqy/hidden-gmail-and-calendar-integrations-quietly-broke-my-claude-sdk-pipeline-18ei</link>
      <guid>https://dev.to/ub3dqy/hidden-gmail-and-calendar-integrations-quietly-broke-my-claude-sdk-pipeline-18ei</guid>
      <description>&lt;p&gt;I lost a few hours to one of those bugs that feels fake when you first describe it out loud.&lt;/p&gt;

&lt;p&gt;My Claude Agent SDK pipeline kept failing with the most generic error possible:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Command failed with exit code 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No useful stderr. No clear traceback. No obvious repro beyond "real sessions fail, synthetic tests sometimes pass."&lt;/p&gt;

&lt;p&gt;At first I thought it was just another boring auth problem. Then I thought it was a subprocess visibility problem. Then I thought it was WSL. All of those were plausible. None of them were the whole story.&lt;/p&gt;

&lt;p&gt;The actual cause was stranger:&lt;/p&gt;

&lt;p&gt;after &lt;code&gt;claude auth login&lt;/code&gt;, my account quietly picked up &lt;strong&gt;account-level Gmail and Google Calendar MCP integrations&lt;/strong&gt; that I had never explicitly enabled, could not see in the Claude web UI, and could not remove from the CLI. Those integrations wanted an interactive Google OAuth flow, and that was enough to break every non-interactive Claude SDK subprocess I was using for automation.&lt;/p&gt;

&lt;p&gt;The workaround was one CLI flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;extra_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strict-mcp-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That was the end of the bug.&lt;/p&gt;

&lt;p&gt;Finding it was the hard part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this showed up
&lt;/h2&gt;

&lt;p&gt;I use Claude Agent SDK inside a small memory pipeline.&lt;/p&gt;

&lt;p&gt;The shape is straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a hook fires at the end of a Codex session&lt;/li&gt;
&lt;li&gt;the hook spawns &lt;code&gt;flush.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;flush.py&lt;/code&gt; calls Claude Agent SDK&lt;/li&gt;
&lt;li&gt;the result gets written into a daily markdown log&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly the kind of automation that should be boring once it's set up.&lt;/p&gt;

&lt;p&gt;Instead, real Codex sessions started failing. The hook would fire, the downstream script would start, and then the Agent SDK step would die with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fatal error in message reader: Command failed with exit code 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That was enough to break the memory pipeline completely. No flush, no daily log entry, no durable memory from those sessions.&lt;/p&gt;

&lt;p&gt;And because the error was happening in a subprocess layer, the surface signal was awful. It just looked like "the SDK sometimes fails."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this was so annoying to debug
&lt;/h2&gt;

&lt;p&gt;There were at least three reasons this bug wasted my time.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The first fix appeared to work
&lt;/h3&gt;

&lt;p&gt;At one point I thought I had solved it just by running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude auth login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And for a moment, it looked like I had.&lt;/p&gt;

&lt;p&gt;Synthetic tests passed. The bundled Claude binary responded. The pipeline looked alive again.&lt;/p&gt;

&lt;p&gt;That "fix" turned out to be fake.&lt;/p&gt;

&lt;p&gt;It worked briefly because the environment had not yet fully populated the auth-related cache state that was about to cause the real failure.&lt;/p&gt;

&lt;p&gt;So I got the worst possible debugging gift: a temporary false victory.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The important error wasn't where I was looking
&lt;/h3&gt;

&lt;p&gt;I had already added stderr visibility around the SDK call, expecting to catch the real CLI failure there.&lt;/p&gt;

&lt;p&gt;That didn't help much, because the Google MCP auth path wasn't producing a nice actionable stderr message.&lt;/p&gt;

&lt;p&gt;The auth prompt was effectively happening on the wrong channel for my diagnostic setup. The SDK stderr callback got nothing useful. &lt;code&gt;ProcessError.stderr&lt;/code&gt; was empty. All I had was the outer shell of the failure.&lt;/p&gt;

&lt;p&gt;From the outside, it still looked like "exit code 1, good luck."&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The integrations were invisible
&lt;/h3&gt;

&lt;p&gt;This was the part that really crossed from "normal debugging" into "what exactly is this system doing?"&lt;/p&gt;

&lt;p&gt;In Claude.ai web settings, I could see the usual visible things. But I could not see any Gmail or Google Calendar integrations anywhere I would expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;not in &lt;strong&gt;Settings → Connectors&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;not in &lt;strong&gt;Settings → Customize → Skills&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;not in &lt;strong&gt;Customize → Connectors&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet on disk, after &lt;code&gt;claude auth login&lt;/code&gt;, I could see this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"claude.ai Gmail"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1776092446619&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"claude.ai Google Calendar"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1776092446683&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That came from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/mcp-needs-auth-cache.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the CLI confirmed the same story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;claude.ai Gmail:           https://gmail.mcp.claude.com/mcp - ! Needs authentication
claude.ai Google Calendar: https://gcal.mcp.claude.com/mcp - ! Needs authentication
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At that point the bug finally started making sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually happening
&lt;/h2&gt;

&lt;p&gt;Here is the version I wish someone had handed me before I started digging:&lt;/p&gt;

&lt;p&gt;When you authenticate Claude via OAuth, your account may receive &lt;strong&gt;account-level MCP integrations&lt;/strong&gt; as part of its backend claims.&lt;/p&gt;

&lt;p&gt;In my case, that included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;claude.ai Gmail&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;claude.ai Google Calendar&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those integrations were not local project config.&lt;br&gt;
They were not something I had added manually in the repo.&lt;br&gt;
And they were not removable with normal local MCP commands.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;claude mcp remove "claude.ai Gmail"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No MCP server found with name: "claude.ai Gmail"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which makes sense once you realize the CLI isn't managing them as local entries. They are attached at the account/backend layer.&lt;/p&gt;

&lt;p&gt;Then the next failure follows naturally:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;bundled Claude CLI starts inside a non-interactive SDK subprocess&lt;/li&gt;
&lt;li&gt;it checks account-level MCP claims&lt;/li&gt;
&lt;li&gt;it sees Gmail and Calendar need additional Google auth&lt;/li&gt;
&lt;li&gt;it tries to initiate an interactive OAuth flow&lt;/li&gt;
&lt;li&gt;there is no TTY / browser / normal user interaction path&lt;/li&gt;
&lt;li&gt;subprocess stalls or exits with code 1&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That was the whole bug.&lt;/p&gt;

&lt;p&gt;Not my prompt.&lt;br&gt;
Not my retry logic.&lt;br&gt;
Not the summary logic.&lt;br&gt;
Not even the main auth state in the obvious sense.&lt;/p&gt;

&lt;p&gt;Just hidden MCP integrations pulling a subprocess into an auth flow it had no way to complete.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why this matters beyond my repo
&lt;/h2&gt;

&lt;p&gt;This is not really about my particular memory pipeline.&lt;/p&gt;

&lt;p&gt;It matters because subprocess-based Claude SDK automation is a completely normal pattern. People use it for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;capture pipelines&lt;/li&gt;
&lt;li&gt;background summarizers&lt;/li&gt;
&lt;li&gt;hook-triggered analysis&lt;/li&gt;
&lt;li&gt;scheduled jobs&lt;/li&gt;
&lt;li&gt;internal tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of those run in contexts where interactive browser auth is either awkward or impossible.&lt;/p&gt;

&lt;p&gt;If account-level integrations that require fresh OAuth can quietly attach themselves after login, and if they are invisible in the UI, then the failure mode becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;everything looks fine in your normal interactive CLI&lt;/li&gt;
&lt;li&gt;your automation suddenly fails in the background&lt;/li&gt;
&lt;li&gt;the error surface is generic&lt;/li&gt;
&lt;li&gt;and the root cause is not where you would reasonably look first&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a nasty class of bug.&lt;/p&gt;
&lt;h2&gt;
  
  
  The workaround that fixed it
&lt;/h2&gt;

&lt;p&gt;The workaround was to isolate the subprocess from account-level MCP discovery.&lt;/p&gt;

&lt;p&gt;In practice, that meant passing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;allowed_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="n"&gt;max_turns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;extra_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strict-mcp-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That translates to the bundled Claude binary getting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--strict-mcp-config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And once I did that, the subprocess stopped trying to discover or auth those account-level MCP integrations.&lt;/p&gt;

&lt;p&gt;The pipeline started running cleanly again.&lt;/p&gt;

&lt;p&gt;That was the actual fix.&lt;/p&gt;

&lt;p&gt;Deleting the cache file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; ~/.claude/mcp-needs-auth-cache.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;did not really fix anything. It just bought a little time until the state was regenerated.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part I still don't love
&lt;/h2&gt;

&lt;p&gt;The workaround is fine. I'm happy to use it in automation.&lt;/p&gt;

&lt;p&gt;What I don't love is the product behavior that made it necessary.&lt;/p&gt;

&lt;p&gt;From a developer point of view, a few things feel wrong here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hidden integrations are a bad default
&lt;/h3&gt;

&lt;p&gt;If my account has Gmail and Calendar MCP integrations attached, I should be able to see them in the UI and turn them off.&lt;/p&gt;

&lt;p&gt;Right now, from the outside, it feels like they exist in a shadow layer of account state that can affect automation without being visible where a user would normally manage integrations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Opt-out without visible opt-in is rough
&lt;/h3&gt;

&lt;p&gt;I didn't explicitly wire Gmail or Calendar into this project. Yet they still ended up influencing subprocess behavior after OAuth login.&lt;/p&gt;

&lt;p&gt;That is a surprising default for anyone using Claude as a programmable toolchain component instead of just a chat app.&lt;/p&gt;

&lt;h3&gt;
  
  
  Non-interactive contexts should fail more gracefully
&lt;/h3&gt;

&lt;p&gt;If the process is clearly non-interactive, the CLI should not wander into a user-hostile auth path and then collapse into a generic exit code.&lt;/p&gt;

&lt;p&gt;At minimum, I would want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a clear message saying an MCP integration requires interactive authentication&lt;/li&gt;
&lt;li&gt;the name of the integration&lt;/li&gt;
&lt;li&gt;and ideally a way to skip it automatically in SDK subprocess mode&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I wish the docs said
&lt;/h2&gt;

&lt;p&gt;The one warning I really needed was something like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you use Claude Agent SDK in non-interactive subprocesses, account-level MCP integrations attached through OAuth may trigger additional authentication flows. If those integrations are not fully authenticated, subprocess calls may fail. Use strict MCP config isolation for automation workloads.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That single paragraph would have saved me hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;p&gt;If you hit a mysterious Claude Agent SDK subprocess failure with a generic &lt;code&gt;exit code 1&lt;/code&gt;, and your interactive CLI mostly works, check whether hidden account-level MCP integrations are involved before you start rewriting your own code.&lt;/p&gt;

&lt;p&gt;In my case, the real sequence was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I thought the pipeline was unauthenticated&lt;/li&gt;
&lt;li&gt;then I thought stderr visibility was missing&lt;/li&gt;
&lt;li&gt;then I thought WSL was the root cause&lt;/li&gt;
&lt;li&gt;and the real issue turned out to be hidden Gmail and Calendar MCP integrations trying to force Google OAuth inside a non-interactive subprocess&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix was not glamorous:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;extra_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strict-mcp-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But at least now I know what class of failure I was dealing with.&lt;/p&gt;

&lt;p&gt;And honestly, that's often the hardest part.&lt;/p&gt;




&lt;p&gt;I'm building this as part of a markdown-first memory system for Claude Code and Codex. I keep expecting the hard parts to be prompt design or summarization quality. More often than not, the real work is figuring out which invisible layer is making the obvious layer look broken.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>debugging</category>
      <category>claude</category>
    </item>
    <item>
      <title>How I shipped a broken capture pipeline and didn't notice for 3 days</title>
      <dc:creator>UB3DQY</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:00:36 +0000</pubDate>
      <link>https://dev.to/ub3dqy/how-i-shipped-a-broken-capture-pipeline-and-didnt-notice-for-3-days-4b1h</link>
      <guid>https://dev.to/ub3dqy/how-i-shipped-a-broken-capture-pipeline-and-didnt-notice-for-3-days-4b1h</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I built a hook-based capture system for Claude Code. Every session-end was supposed to get summarized and written into a daily log. My &lt;code&gt;doctor.py&lt;/code&gt; gate said &lt;code&gt;13/13 PASS&lt;/code&gt;. Lint was clean. CI was green on every commit.&lt;/p&gt;

&lt;p&gt;Then a user asked a simple question: &lt;em&gt;"Is the wiki actually capturing this conversation?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I checked the log. &lt;strong&gt;57% of my recent sessions had been silently dropped&lt;/strong&gt; for three days. The gate never told me. Every smoke test was passing. The system was broken in the one place no test was actually looking.&lt;/p&gt;

&lt;p&gt;This is what happened, how I caught it, and what I changed so I would not miss the same kind of bug again.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The project is a memory system for Claude Code and Codex CLI. A session-end hook reads the transcript, hands it off to a background Python script, that script asks the Agent SDK whether the conversation is worth saving, and the result gets appended to &lt;code&gt;daily/YYYY-MM-DD.md&lt;/code&gt;. Fairly normal hook plumbing.&lt;/p&gt;

&lt;p&gt;I had a &lt;code&gt;doctor.py&lt;/code&gt; script with 13 smoke checks across the pipeline. &lt;code&gt;session-start.py&lt;/code&gt; produced valid JSON. &lt;code&gt;user-prompt-wiki.py&lt;/code&gt; could look up articles. &lt;code&gt;stop.py&lt;/code&gt; exited cleanly. I had structural lint. I had a green CI gate on every push.&lt;/p&gt;

&lt;p&gt;I had shipped eight commits over two days, each with &lt;code&gt;doctor --quick&lt;/code&gt; green, each with CI passing. I was telling myself the system was in good shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  The moment of doubt
&lt;/h2&gt;

&lt;p&gt;Someone I was working with asked a very simple question: &lt;em&gt;"Just to confirm, is the wiki actually storing this conversation?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I almost said yes immediately. The hooks were wired. Every prompt I sent was coming back with wiki snippets injected at the top. &lt;code&gt;UserPromptSubmit&lt;/code&gt; was clearly doing its job. From the outside, the system looked alive.&lt;/p&gt;

&lt;p&gt;But I have been burned by "it looks alive" enough times that I checked instead of trusting the feeling. I opened &lt;code&gt;scripts/flush.log&lt;/code&gt;, the file where the session-end and pre-compact hooks write their operational log, and scrolled to the recent entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2026-04-12 16:36:39 INFO [session-end] SessionEnd fired: session=...
2026-04-12 16:36:39 INFO [session-end] SKIP: only 2 turns (min 4)
2026-04-12 16:39:27 INFO [session-end] SessionEnd fired: session=...
2026-04-12 16:39:27 INFO [session-end] SKIP: only 2 turns (min 4)
2026-04-12 16:42:07 INFO [session-end] SessionEnd fired: session=...
2026-04-12 16:42:07 INFO [session-end] SKIP: only 2 turns (min 4)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That was the moment my confidence disappeared.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I was looking at
&lt;/h2&gt;

&lt;p&gt;The hooks &lt;strong&gt;were firing&lt;/strong&gt;. &lt;code&gt;SessionEnd fired&lt;/code&gt; is printed before any filtering happens, so those lines meant the hook chain from Claude Code to my Python script was intact. The wiring was not the problem.&lt;/p&gt;

&lt;p&gt;But then immediately after, on every single recent entry: &lt;code&gt;SKIP: only 2 turns (min 4)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;My session-end code had this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MIN_TURNS_TO_FLUSH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

&lt;span class="c1"&gt;# ... later ...
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;turn_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;MIN_TURNS_TO_FLUSH&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SKIP: only %d turns (min %d)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MIN_TURNS_TO_FLUSH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was supposed to protect against flushing trivial sessions, the "one question, one answer, exit" pattern that is probably not worth archiving. The threshold &lt;code&gt;4&lt;/code&gt; felt reasonable when I wrote it. It felt reasonable when I reviewed it. It passed every test.&lt;/p&gt;

&lt;p&gt;What I had not really internalized was the shape of my own usage. A typical Claude Code session for me is: open terminal, ask one specific question, get one specific answer, close terminal. That is &lt;strong&gt;exactly 2 turns&lt;/strong&gt;. The rule I wrote to skip "trivial" sessions was skipping &lt;strong&gt;my normal session shape&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I ran the numbers over the full log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SessionEnd fired:       109
Spawned flush.py:        52  (48%)
Skipped (various):       57  (52%)
Most recent skip reason: "SKIP: only 2 turns (min 4)"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Over half the sessions from the last three days had been silently dropped.&lt;/strong&gt; Not edge cases. Not weird corner traffic. Just normal usage. The daily log for those days had looked thinner than it should have been, and I had noticed that in the background, but never chased it because &lt;code&gt;doctor --quick&lt;/code&gt; was green and I trusted it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the gate didn't catch it
&lt;/h2&gt;

&lt;p&gt;The actual bug was trivial. Change a number. That part took no time. The question that mattered was: why did my gate tell me everything was fine while half the data was disappearing?&lt;/p&gt;

&lt;p&gt;Let me walk through what &lt;code&gt;doctor --full&lt;/code&gt; actually tested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;check_session_start_smoke&lt;/code&gt;&lt;/strong&gt; — runs &lt;code&gt;session-start.py&lt;/code&gt; with an empty JSON input, verifies it prints a valid hook-output JSON. ✅&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;check_user_prompt_smoke&lt;/code&gt;&lt;/strong&gt; — runs &lt;code&gt;user-prompt-wiki.py&lt;/code&gt;, verifies it returns additionalContext with articles. ✅&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;check_stop_smoke&lt;/code&gt;&lt;/strong&gt; — runs &lt;code&gt;stop.py&lt;/code&gt;, verifies it exits cleanly on empty stdin. ✅&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;check_index_freshness&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;check_structural_lint&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;check_env_settings&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;check_path_normalization&lt;/code&gt;&lt;/strong&gt; — the rest of the usual health-check surface.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice what those tests are really asking. Each one asks: &lt;em&gt;"Does this script run without crashing?"&lt;/em&gt; That is a useful question. It catches real bugs: &lt;code&gt;ImportError&lt;/code&gt; after a refactor, &lt;code&gt;JSONDecodeError&lt;/code&gt; from bad stdin, &lt;code&gt;FileNotFoundError&lt;/code&gt; after a rename. But it is not the question I actually cared about: &lt;em&gt;"Does a real transcript, processed by this chain, end up in the daily log?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That question has three subtly different parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Does the hook fire when Claude Code ends a session?&lt;/strong&gt; (Yes — I could see it in the log.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does the hook's filter logic produce a "worth-saving" verdict for realistic input?&lt;/strong&gt; (Turns out: no, because of the bug above.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does the downstream chain actually write the result to the daily log?&lt;/strong&gt; (Unknown, because step 2 always said no.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;doctor --full&lt;/code&gt; tested a weak version of (1) by running the script with an empty payload. It did not test (2), because that needs a realistic transcript. It did not test (3), because the chain never got that far. Every link passed in isolation, and the chain as a whole was still broken.&lt;/p&gt;

&lt;p&gt;This is the old difference between &lt;strong&gt;smoke tests&lt;/strong&gt; and &lt;strong&gt;end-to-end tests&lt;/strong&gt;. In theory everybody knows it. In a personal tool, it is easy to get lazy about it. You know what the chain is supposed to do, so testing the whole thing can feel redundant. It is not redundant at all. The chain breaks in exactly the places where each individual component still passes its own tiny check.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two things I added to stop this happening again
&lt;/h2&gt;

&lt;p&gt;The code fix itself was boring: replace the turn-based threshold with a content-based one. Short but substantial sessions, two turns and a couple thousand characters of real discussion, now get captured. Tiny sessions, two turns and thirty characters of "ok thanks", still get skipped, but by character count instead of turn count. That is not really the point of the post.&lt;/p&gt;

&lt;p&gt;The interesting part is what I added to &lt;code&gt;doctor.py&lt;/code&gt; afterward, because that is what turns this from a one-off fix into something the project can actually defend itself with.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Observability check: &lt;code&gt;check_flush_capture_health&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This one reads &lt;code&gt;scripts/flush.log&lt;/code&gt; over a rolling 7-day window and summarizes what the capture pipeline has actually been doing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_flush_capture_health&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CheckResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# parse flush.log, count SessionEnd fired vs Spawned flush.py
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;detail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Last 7d: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;spawned&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;session_fired&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; flushes spawned (skip rate &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;skip_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;spawned&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CheckResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flush_capture_health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Pipeline appears broken: SessionEnds fired but nothing was spawned.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;skip_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CheckResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flush_capture_health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; [attention: high skip rate — consider lowering WIKI_MIN_FLUSH_CHARS]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CheckResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flush_capture_health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important design choice: &lt;strong&gt;this check only FAILs when the pipeline is observably broken&lt;/strong&gt;. If SessionEnds fired but nothing was spawned, that is a correctness bug. It does &lt;strong&gt;not&lt;/strong&gt; FAIL on high skip rate, because skip rate is historical data about past usage, not necessarily a problem with the current code. A fresh clone has no history and should pass. A repo with lots of short sessions may have a high skip rate and still be behaving correctly. Blocking the merge gate on historical observability would be a mistake.&lt;/p&gt;

&lt;p&gt;The check prints an &lt;code&gt;[attention]&lt;/code&gt; marker in the detail line when the skip rate goes above 50%. On the first run in my own repo after I added it, it printed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[PASS] flush_capture_health: Last 7d: 50/121 flushes spawned (skip rate 59%)
       [attention: high skip rate — consider lowering WIKI_MIN_FLUSH_CHARS]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That one line would have saved me three days.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. End-to-end acceptance test: &lt;code&gt;check_flush_roundtrip&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is the answer to the "why didn't any test catch this?" question. It only runs in &lt;code&gt;doctor --full&lt;/code&gt;, because it is more expensive than the fast smoke checks.&lt;/p&gt;

&lt;p&gt;The test writes a dummy 6-turn transcript, about 2000 characters of realistic content, to a temp file. Then it invokes &lt;code&gt;hooks/session-end.py&lt;/code&gt; as a real subprocess with a realistic hook-input JSON on stdin:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;test_session_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doctor-roundtrip-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;transcript_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SCRIPTS_DIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doctor-transcript-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;test_session_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# ... write dummy turns ...
&lt;/span&gt;
&lt;span class="n"&gt;hook_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;test_session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doctor-roundtrip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcript_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cwd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ROOT_DIR&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WIKI_FLUSH_TEST_MODE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;proc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_end_script&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hook_input&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the &lt;code&gt;WIKI_FLUSH_TEST_MODE=1&lt;/code&gt; environment variable. That is the trick. The downstream script, &lt;code&gt;flush.py&lt;/code&gt;, checks it at startup and, if it is set, skips the real Agent SDK call and writes a marker file to a known location instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In flush.py
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WIKI_FLUSH_TEST_MODE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;TEST_MARKER_FILE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FLUSH_TEST_OK session=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ts=...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test then polls for that marker file with a 15-second timeout, verifies it contains the right session ID, and cleans up. If any link in the chain is broken — if &lt;code&gt;session-end.py&lt;/code&gt; does not spawn &lt;code&gt;flush.py&lt;/code&gt;, if &lt;code&gt;flush.py&lt;/code&gt; fails to import, if the environment is not inherited correctly — the marker never appears and the test fails with a clear message.&lt;/p&gt;

&lt;p&gt;This is an actual end-to-end test, not a smoke test. It exercises the real subprocess invocation, real environment inheritance, real stdin/stdout piping, and real timing. The only thing it fakes is the API call itself, because that would cost money and pollute the real daily log.&lt;/p&gt;

&lt;p&gt;On my machine right now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[PASS] flush_roundtrip: session-end -&amp;gt; flush.py chain completed in test mode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If I had had this test two weeks earlier, I would have caught the &lt;code&gt;MIN_TURNS = 4&lt;/code&gt; bug on the first realistic transcript. It would not have needed to be clever. A visible skip where a spawn was expected would have been enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lessons, short enough to remember
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Smoke tests are not end-to-end tests, and they do not substitute for one.&lt;/strong&gt; I had nine smoke checks in &lt;code&gt;doctor.py&lt;/code&gt;, and all of them were correct in isolation. None of them ran the actual production chain from a realistic input to a verifiable output. If you have a multi-process pipeline, you need at least one test that exercises the whole thing. It does not have to be fast and it does not have to run on every commit. It just has to exist somewhere meaningful in your gate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Observability is a design choice, not an afterthought.&lt;/strong&gt; My hooks were writing perfectly good operational logs. I just was not reading them, and my gate was not reading them either. Adding a check that summarizes those logs took about forty lines of code and would have turned a silent three-day outage into a visible &lt;code&gt;[attention]&lt;/code&gt; marker from day one. Logs you do not read are not much better than logs you never wrote.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. If a test could have caught the bug, it belongs in the gate — even if adding it feels obvious in hindsight.&lt;/strong&gt; The wrong question is "why didn't I add this on day one?" The better question is "what class of future bugs does this protect me from now?" Hindsight is always perfect about the bug you already know. What you want is &lt;strong&gt;general immunity&lt;/strong&gt; to the class of bugs you just learned about.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you are building similar hook-based systems, the code for the project where this happened is at &lt;a href="https://github.com/ub3dqy/llm-wiki" rel="noopener noreferrer"&gt;github.com/ub3dqy/llm-wiki&lt;/a&gt;. It is a markdown-first memory system for Claude Code and Codex CLI, and both fixes described here — the content-based threshold and the roundtrip test — live in &lt;code&gt;scripts/doctor.py&lt;/code&gt; and &lt;code&gt;hooks/session-end.py&lt;/code&gt;. No API keys, no vector database, and it boots with &lt;code&gt;uv run python scripts/setup.py&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>debugging</category>
      <category>python</category>
    </item>
  </channel>
</rss>
