DEV Community: sai-builder

7 Things I Automated with Claude Code + MCP That Actually Saved Time (and 3 That Didn't)

sai-builder — Thu, 21 May 2026 23:44:45 +0000

Most "things I automated with AI" lists are aspirational. They describe what's possible, run it once for the screenshot, and never mention that the thing broke on Tuesday and the author quietly went back to doing it by hand.

This is the honest version. These are automations I built with Claude Code + MCP that are still running weeks later because they genuinely save me time. Each one has a clear trigger, a clear output, and a reason it survived. Then — because a list of only wins is a sales pitch, not a field report — I'll show you three I built, measured, and deleted, and why.

The throughline: an automation is only worth it if the time it saves exceeds the time it costs to babysit. Most AI automations fail that test silently. The trick is measuring it on purpose.

Quick framing: what MCP actually buys you

If you're new to this — MCP (Model Context Protocol) is the standard that lets Claude Code call out to external tools and data sources through small servers: a browser, your filesystem, a calendar, a database, an API. The model stops being a chat box and becomes something that can act. Claude Code is the agent runtime that drives it from your terminal.

The mental model that's served me best: don't ask "can the model do this?" Ask "what's the smallest reliable tool I can hand it so it doesn't have to guess?" Most of my wins below are wins because I gave the agent a precise tool, not because I wrote a clever prompt.

The 7 that survived

1. Reading a logged-in page and turning it into structured data

A browser MCP attached to my real, logged-in browser session (via the DevTools Protocol) means the agent can read pages that require auth — dashboards, account pages, analytics — and hand me back structured data instead of me squinting at a UI. The win isn't "it browses." It's that the page that needed my login is now machine-readable without me exporting anything.

Trigger: "pull the current numbers from X." Output: a clean table. Why it survived: the alternative was me copy-pasting from a dashboard that has no export button.

2. Multi-file refactors with a plan I approve first

Claude Code reads across a whole directory, proposes a concrete edit plan, and only after I say go does it make the changes. The value over a single-file assistant is that it sees the blast radius — it finds the three other files that import the thing I'm renaming. The approval gate is non-negotiable: I approve the plan, not each keystroke.

Trigger: "rename this concept across the project." Output: a diff I review. Why it survived: it catches the references I'd miss, and the plan-first gate means it never surprises me.

3. Drafting from a template with my actual constraints baked in

I have repetitive structured documents — reports, briefs, configs — that follow a fixed shape. I gave the agent the template and the rules (required sections, tone, what's forbidden) as a reusable instruction. Now "draft tomorrow's report" produces something 80% done that I edit, instead of a blank page.

Trigger: a daily/weekly cadence. Output: a near-final draft. Why it survived: the boring 80% is exactly what I procrastinate on. The model doesn't procrastinate.

4. Filesystem-wide search-and-summarize

The filesystem MCP lets the agent grep across a messy knowledge base and answer "where did I write about X, and what did I conclude?" This replaced the genuinely awful workflow of me opening twelve files trying to remember which one had the decision in it.

Trigger: "what did I decide about Y?" Output: the answer plus the file path. Why it survived: it turns my own notes into something queryable. The path matters — I want to verify, not trust blindly.

5. Validation-loop form/data entry

When I have to enter structured data into a form or system that validates input, the agent enters it, reads the rejection, corrects, and retries — without me. I provide the facts; it does the labor and the error recovery. (I wrote a whole separate piece on exactly where this stops — the line is private facts and identity, not the typing.)

Trigger: "fill this with these values." Output: a completed, validated entry. Why it survived: the correction cycle is the tedious part, and it's now off my plate.

6. Read-back verification on anything that publishes

This one is a meta-automation born from getting burned (I published four empty articles once — long story). Any step that writes to an external system is now followed by a step that reads the result back and asserts it's correct. The agent publishes, then fetches the live artifact and diffs it against the source.

Trigger: any publish/write step. Output: a pass/fail on "did the thing actually land." Why it survived: it's caught silent failures that returned a 200 and shipped garbage. Cheap insurance against the worst failure mode.

7. Turning a rough voice-of-me brief into a first draft in a fixed persona

I keep persona/voice definitions as files. The agent loads the right one and drafts in that voice from a few bullet points. The win is consistency across many outputs without me re-explaining the voice every time — the constraints live in a file, not in my head or in each prompt.

Trigger: "draft this as ." Output: an on-voice draft. Why it survived: voice drift across a content series is real, and a file-based persona kills it.

The pattern in all 7

Read back over them. Every survivor has the same shape:

A precise tool (a specific MCP server), not a vague "be smart."
A clear trigger and a clear artifact, so I can tell instantly if it worked.
A human gate exactly where judgment lives — plan approval, fact provision, final review — and full automation everywhere judgment doesn't live.

The ones that work automate labor. They leave judgment to me. The moment an automation tries to own the judgment, it starts costing more than it saves — which brings me to the deletions.

The 3 I deleted

Deleted 1: fully autonomous publishing with no human gate

I tried letting the content pipeline publish with zero review, trusting the read-back check to catch problems. The read-back caught formatting failures fine. It could not catch "this draft is fine but it's the wrong thing to say right now." That's judgment, and I'd automated it away. Deleted the no-gate version; kept the read-back, restored the review gate. Lesson: verification catches broken, it can't catch wrong.

Deleted 2: a "monitor everything and alert me" agent

I built an agent to watch several sources and ping me on anything notable. It pinged constantly. The signal-to-noise was terrible because "notable" is a judgment call that depends on context I never fully specified. I spent more time triaging its alerts than I'd have spent checking the sources myself once a day. Deleted. Lesson: an automation that generates work to evaluate its own output is usually net-negative. Polling-and-judging is a trap.

Deleted 3: an over-engineered "agent that builds agents"

I tried to make a meta-agent that would spin up task-specific sub-agents on demand. It was a great demo and a maintenance sinkhole — every layer of indirection was a new place for things to break silently, and debugging a failure meant unwinding three levels of "which agent decided what." Deleted in favor of a flat list of single-purpose automations I can each understand in one sitting. Lesson: indirection is a cost you pay on every debug, forever. Flat and boring beats clever and nested.

What I'd tell you to actually do

If you're starting with Claude Code + MCP, don't chase the impressive stuff. Do this:

Pick one task you do repeatedly that is labor, not judgment. Form entry, drafting from a template, search-and-summarize. Not "decide my strategy."
Give the agent the smallest precise tool for it — one MCP server, not five.
Put a human gate exactly where the judgment is, and automate everything around it.
Add a read-back check if the task writes anything anywhere.
Measure for two weeks. If you spend more time babysitting it than it saves, delete it without sentiment. I deleted three. That's not failure; that's the measurement working.

The hype version of this post would be ten wins and no deletions. But the deletions are where the actual knowledge is. Automating labor pays. Automating judgment, monitoring, and meta-orchestration mostly doesn't — at least not yet, not solo, not without a babysitting cost that quietly eats the savings.

Build first. Measure honestly. Delete what doesn't pay. The design converges later — and "later" usually means "after you've deleted the clever thing and kept the boring one that works."

— Sai

If this was useful: I packaged the prompts I actually use to run autonomous agents into two field packs — 100 Prompts for Autonomous Agents and Claude Code Power-User Prompts. Same build-first mindset, ready to paste into your terminal.

My AI Agent Kept Publishing Empty Articles — So I Made It Edit Them Back via the API

sai-builder — Thu, 21 May 2026 23:44:01 +0000

I have a content pipeline that drafts articles, runs them past me, and publishes them to dev.to. Last week it published six in a batch. Two looked fine. The other four were live, indexed, public — and completely empty. Title, tags, cover, canonical URL, all present. Body: nothing. Four blank posts under a persona I'm trying to build credibility with.

This is the writeup of how that happened and how I fixed it, because the fix taught me something I keep relearning the hard way: the reliable automation path is almost never the obvious one, and the obvious one fails silently, which is worse than failing loud.

How four articles ended up blank

The pipeline's last step was "publish." It drove the dev.to web editor: open the new-post page, fill the title field, fill the markdown body field, hit publish. Standard browser automation. It had worked before, which is exactly why I trusted it and didn't check the output closely enough.

The bodies I was feeding it were long. Not novel-length, but 1,500+ words of markdown with code fences, the occasional non-ASCII character, em dashes, the works. And here's the thing the demos never show you: when you programmatically stuff a large string into a rich editor's input and immediately trigger save, you are racing the editor's own internal state. The editor has its own model of the document. Your injected text and its serialize-on-save don't always agree on timing. Sometimes the save fires against an editor that hasn't committed your injection yet.

When that happens, the platform doesn't error. It happily saves the document it currently believes in — which is empty. You get a 200. You get a published URL. You get nothing in the body.

That's the part that stings. There was no exception to catch. The automation reported success. The only way to know it failed was to read the published page, which I wasn't doing because, well, the step said it succeeded. A pipeline that lies about success is more dangerous than one that crashes. A crash you handle. A silent lie you ship.

So lesson zero, before any of the technical stuff: if an automation step produces an artifact, your pipeline has to read the artifact back and assert it's correct. Not "did the call return 200." Did the body actually land. I now treat any write-without-readback step as a known liability.

First fix attempt: just drive the editor better

The instinct is to fix the thing that broke. So I tried to make the editor automation more robust: wait for the editor to be ready, inject, wait again, poll the editor's internal value until it matched what I sent, then save.

This is where I want to be honest about time. I spent a couple of hours on this and it got better, not good. The editor is a moving target — its DOM, its internal state model, the events it listens to. I could get it to ~90% reliable, which for a publishing step is useless. 90% reliable means 1 in 10 of my posts is blank, and I won't know which one without checking all of them, which defeats the automation.

This is the moment that matters, and I almost always get it wrong: I was fighting the tool instead of changing the transport. When you find yourself adding wait-then-poll-then-verify scaffolding around a UI that wasn't built for you, that's not robustness, that's a smell. You're hand-stabilizing something inherently unstable. Timebox it. I now give myself a hard cap — if a tooling fight isn't won in roughly an hour, I stop and ask: is there a known-working transport I'm ignoring because it's less convenient?

There was.

The actual fix: drive edits through the API

dev.to has a real, documented write API. You can update a published article with:

PUT https://dev.to/api/articles/{id}
api-key: <YOUR_API_KEY>
Content-Type: application/json

{
  "article": {
    "body_markdown": "...the full markdown body..."
  }
}

This bypasses the editor entirely. No racing an editor's internal state, no DOM, no save-button timing. You hand the platform the canonical markdown and it stores it. The four empty posts already existed with the right titles and tags; I just needed to PUT the bodies into them. So the repair plan was simple: for each blank article ID, PUT the correct body_markdown. Done.

Except it wasn't, and the two ways it wasn't are the genuinely useful part of this post.

Snag 1: the request has to come from inside a logged-in browser

My first move was the clean one: fire the PUT from a script, server-side, with the API key in the header. 401. Tried it a few different ways. Still 401.

I'm not going to over-claim the root cause here — I didn't fully reverse-engineer their auth posture, and you shouldn't trust a war story that pretends it did. What I observed, repeatably, is that the external/non-browser request was rejected, and the same PUT issued from within an already-authenticated browser session — the actual tab where I was logged in — went through. So that's what I did: I ran the request from inside the logged-in page's own context, where the session and the api-key together were accepted, instead of from a cold external client.

The reusable takeaway isn't "dev.to requires X." It's: when an API rejects you from outside but the platform clearly performs the same write from its own frontend, stop trying to replicate the auth from scratch and just borrow the context that already works. You have a logged-in browser. Issue the call from there. It's not elegant. It's reliable, which beats elegant for a repair job.

Snag 2: long unicode bodies got mangled into homoglyphs

Now the worse one. I had the transport working, so I tried to get the body into the request. The naive way is to inline the markdown directly into the call — encode the whole 1,500-word body and pass it along with the PUT.

The bodies came back corrupted. Specifically, characters had been swapped for homoglyphs — visually near-identical lookalikes from other Unicode blocks. Em dashes, quotes, and a handful of letters got silently substituted for characters that look the same in the editor but are different code points. A reader wouldn't notice at a glance. A code fence would. And it meant my "fixed" article was now subtly wrong in a way that's almost impossible to eyeball.

The cause was the path the big string took: shoving a large encoded blob inline through layers of escaping and re-encoding gave something, somewhere, the chance to normalize or transcode it. Every hop a string takes through quoting, shell escaping, JSON encoding, and re-decoding is a chance for a "helpful" substitution. With a short ASCII string you'd never see it. With a long unicode body, the corruption is statistically guaranteed.

The fix that finally worked, end to end, was to never inline the body at all. Instead:

Put the full markdown on the clipboard.
In the logged-in browser tab, paste it into a plain <textarea>.
Read the textarea's .value back — now I have the exact string the browser holds, no inline-encoding hops.
PUT that value to the API.

Conceptually:

// 1) body is on the OS clipboard (put there by the pipeline)
// 2) inside the logged-in tab:
const ta = document.createElement('textarea');
document.body.appendChild(ta);
ta.focus();
// paste the clipboard into the textarea (real paste event)
// 3) read it back — this is the clean source of truth
const body = ta.value;
// 4) issue the PUT from this same authenticated context
await fetch(`/api/articles/${id}`, {
  method: 'PUT',
  headers: { 'api-key': KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({ article: { body_markdown: body } }),
});

The clipboard-and-textarea step looks absurd. It is absurd. But it works for a precise reason: the clipboard → paste → .value round-trip keeps the string as a single opaque payload inside one runtime (the browser), instead of marching it through five layers of escaping where each layer is allowed to "correct" it. The textarea is just a clean holding pen that hands you back exactly what the browser received. No homoglyphs, because nothing in the path thought it was being helpful.

I checked all four repaired articles character-for-character against the source. Clean. Done.

The lesson, generalized

Strip away the dev.to specifics and here's what I'd tack to the wall above any agent-builder's desk:

1. A write step that doesn't read its result back is not done — it's a liability with a green checkmark. The empty articles shipped because "publish" returned success. Assert the artifact, not the status code.

2. The reliable transport is rarely the obvious one. The obvious path was the editor, because that's the UI a human uses. The reliable path was the API. The obvious path failed silently; the reliable one failed loudly (401) until I gave it the context it needed, which is exactly the failure mode you want.

3. Timebox tooling fights. If you're hand-stabilizing an unstable surface with wait-poll-verify scaffolding and you're past an hour, stop. Ask what known-working transport you're avoiding because it's less convenient. Convenience is not reliability.

4. Long unicode strings corrupt at every hop. Every escaping/encoding boundary is a chance for silent substitution. The fewer hops, the fewer homoglyphs. When in doubt, keep the payload opaque inside one runtime and read it back before you trust it.

None of this is the clever-architecture content the algorithm rewards. It's the unglamorous reality of running agents that touch real systems: the model is the easy part, and the last mile is full of silent string corruption and auth that only works from the right tab. You don't design your way around that up front. You hit it, you timebox the fight, you fall back to the transport that actually works, and you write down the smell so you recognize it faster next time.

Build first. The design converges later — usually right after the fourth empty article goes live.

— Sai

I Let an AI Agent Set Up My Payout Account. Here's the Exact Line It Couldn't Cross.

sai-builder — Thu, 21 May 2026 02:00:59 +0000

Yesterday I published a piece arguing that a fully autonomous AI startup loop hits two ceilings: an idea ceiling and an execution ceiling. The execution ceiling, I wrote, is where thinking is fully autonomous but doing gets stopped at human gates — CAPTCHA, KYC, capital, "are you human?"

That framing was correct but coarse. I wrote it from the armchair. So the next day I went and ran the actual experiment, because a theory about where AI stops is worthless until you push an agent right up to the wall and watch exactly where it bounces.

This is the field report. The conclusion is sharper than the theory: labor is delegable, identity is not. And the gap between those two is much smaller than I expected.

The setup

I picked the most concrete, highest-stakes execution task I could find that wasn't just "post some content": configuring the payout settings on a digital-product platform — the screen where you tell the platform which bank account should receive your money.

This is a good test because it's not toy automation. It touches:

Structured financial data (bank codes, branch codes, account numbers)
Government-shaped identity data (legal name, address, date of birth)
Multi-script input (in my locale, names have to be entered in more than one writing system)
Real validation (the form rejects malformed input; you can't fake your way past it)
A persistence step (save, and the platform actually stores it against a real account)

If an AI agent can drive this to completion, "autonomous execution" stops being a slogan. If it can't, I wanted to know precisely which field it died on.

I drove the agent through a real browser session (CDP-attached, so it was operating an actual logged-in browser, not a sandbox). I gave it the goal — "complete the payout configuration" — and the personal facts it would obviously need, and then I watched.

What the agent did entirely on its own

More than I expected. Specifically:

It resolved bank and branch codes by searching. I did not hand it the numeric codes for the bank or the branch. It went and found them — bank code, branch code — from public references, then entered them in the correct fields. This matters more than it sounds, and I'll come back to it.

It handled multi-script name entry. My locale requires the account holder's name in multiple writing systems — the standard form, a phonetic form, and a romanized form. The agent did the transliteration across all of them and placed each in the right field. This is exactly the kind of fiddly, error-prone, "ugh I have to do this carefully" task that humans hate and quietly get wrong.

It structured the address. Not "paste a string" — the form wanted address split into components, and it decomposed a plain address into the structured fields the form expected.

It passed validation. Malformed entries got rejected by the form, the agent read the rejection, corrected, and re-submitted. No human in the loop for the correction cycle.

It saved. The configuration persisted. The form was, functionally, done.

I want to be honest about how much that is. Filling out a financial form, in a foreign-to-the-form writing system, looking up the institutional codes yourself, decomposing freeform data into structured fields, recovering from validation errors — if a human assistant did that for you, you'd call it competent work. The agent did it without supervision on the mechanics. The mechanics were never the wall.

What actually required a human

After all of that, exactly two categories of thing could not come from the agent. Only two. And they're more specific than the "KYC / CAPTCHA / capital" bucket I waved at yesterday.

1. Person-specific, non-public facts

The account number. The date of birth. The exact residential address. These are things the agent literally cannot know, because they aren't anywhere it can read. They're not a capability gap — the agent is perfectly capable of typing an account number into a field; it proved that. It's an information-location gap. The data lives in my head and on my documents, not in any corpus or any search result.

And here's the subtle part: once I spoke those facts, the agent did the input. I didn't fill in the account number; I said the account number, and the agent placed it. So even for the human-only data, the human contributes the fact, not the labor. The typing, the field-matching, the format-correcting — still the machine.

Contrast this with the bank/branch codes. Those are also numbers, also required, also the kind of thing you'd assume a human has to provide. But they're public. They're scattered and annoying to find, but they're findable. So the agent found them. The line isn't "numbers humans must provide" — it's non-public facts humans must provide. Public-but-scattered data is squarely on the AI's side of the wall now. Search closes that gap.

That reframes the whole thing. The human's job in a procedure like this is not "provide the data." It's "provide the private data." Everything public, everything derivable, everything structural — the agent absorbs.

2. Proof that I am this specific person

This is the real wall. Not "are you a human?" — yesterday's framing — but its final form: "are you THIS human?"

A money-receiving setup eventually wants to bind the account to a verified legal identity: a government-issued ID, a confirmation that the person configuring this is the person who legally owns the destination account. That step is not an information problem and not a labor problem. It's an identity problem. There is no string I can dictate that lets the agent be me to a verifier. Identity is the one input that, by design, cannot be relayed through a proxy — because the entire point of identity verification is to defeat proxies.

This is the clean edge I was looking for. Everything upstream of it — the entire form — is delegable. The identity bind is not, and not by accident. It's not weakly defended; it's the thing the whole system exists to protect.

The precise statement

Yesterday: thinking is autonomous, execution is gated by humans.

After actually running it: that's true, but the gate is narrow and I can now describe its exact shape.

Labor is fully delegable to the agent. What is not delegable is (a) facts that are private to the principal, and (b) proof of the principal's identity. Everything else — including public-but-hard-to-find data, transliteration, structuring, validation recovery, and persistence — crosses to the machine.

Two things follow from that, and they're useful if you're building with agents:

The human-in-the-loop surface is smaller than people assume. When teams say "this needs a human," they usually mean the whole task. In practice the irreducibly-human part of a procedure like this was two dictated facts and one identity check. Everything wrapped around those — the 90% that is tedious form labor — is automatable today. If your mental model is "forms need humans," you're leaving most of the work on the table.

The remaining 10% is not a temporary limitation — it's structural. I keep wanting to treat the identity bind as a gap that better tooling will close. It won't, and it shouldn't. Private facts are private by definition; identity proof is anti-proxy by purpose. Better models don't erode either one. So when you design an agent workflow that touches money or legal standing, don't architect for "full autonomy soon." Architect for "autonomous up to the identity bind, then a clean, minimal human handoff." Design the handoff to be exactly two things wide: dictate the private facts, present the identity. Nothing more should fall to the human.

Why I think this matters beyond one form

The interesting question from the first piece was whether a pure-AI loop could ever be a business rather than just think like one. This experiment narrows the answer.

An agent can run essentially the entire operational body of a business — the research, the structuring, the form labor, the error recovery, the persistence. What it cannot do is be the legal person the business hangs on. The principal stays human not because the principal is smarter or more capable in the moment — on the mechanics, they're slower — but because the principal is the identity anchor. The one irreducible human role left is: be the person the system is allowed to trust.

That's a strangely small role. It's not founder-as-doer. It's founder-as-anchor. You dictate what's private, you prove who you are, and the machine does the rest of the body of work.

I find that genuinely clarifying rather than discouraging. Yesterday I thought the execution ceiling was a vague wall somewhere in "doing." Today I know it's a thin, sharp line with a precise location: it runs between labor and identity, and labor is already on the far side.

Build first. The boundary draws itself once you push something real all the way to the edge.

— Sai

I Ran an Autonomous AI Startup Loop 5 Times. It Hit Two Ceilings.

sai-builder — Wed, 20 May 2026 08:06:52 +0000

There's a question floating around right now that everyone has an opinion on and almost no one has run: what happens if you let AI design and launch a business by itself?

Not "AI helps you brainstorm." Not "AI writes your landing page copy." I mean the whole loop — generate the idea, evaluate it, decide whether to kill it or ship it — with no human context injected anywhere. No founder's domain expertise. No "I happen to know this industry." No personal network. Just three AI roles passing artifacts to each other and a hard rule that PASS means we build.

I built that loop and ran it five times. It produced zero passes.

That sounds like a failure, and in the narrow sense it is. But the shape of the failure turned out to be more interesting than a success would have been. The loop didn't fail randomly. It failed against two walls, in the same place, every cycle. And those two walls happen to be a pretty clean map of where autonomous AI ends and where humans still sit.

This is the writeup. No spin. The scores were bad and I'm going to show you the bad scores.

The setup

Three roles, no shared memory of "me":

Generator — proposes a business idea from scratch. It is explicitly forbidden from referencing any human's background, skills, or relationships. It works from market structure only.
Evaluator — scores the idea against a fixed rubric.
Judge — reads the score and either advances the idea, sends it back for one revision, or kills it.

The rubric is 5 criteria × 10 points (50 max), minus three penalties. PASS threshold is 40.

score =
    market_pull        # is there real, urgent demand?
  + willingness_to_pay # will someone actually pay, and how much?
  + defensibility      # can a small outside builder hold a position?
  + time_to_revenue    # how fast to first dollar, solo?
  + execution_fit      # can this be built and run without a team?
  # each 0–10, raw max 50

penalties (subtracted):
  - os_absorption_risk   # will a platform/OS just absorb this as a feature?
  - competitor_death     # is this a known graveyard pattern?
  - price_tier_squatting # is the obvious price tier already occupied for free?

PASS if final_score >= 40

The penalties are the part that matters. Anyone can generate an idea that scores well on raw appeal. The penalties are where ideas go to die, and they're modeled on the three ways small software businesses actually get killed: a platform builds your feature into itself, you walk into a category that has a body count, or the price point you need is already occupied by a free incumbent.

Here's roughly how the Judge reasons, in pseudocode:

def judge(idea, score):
    if score.final >= 40:
        return ADVANCE          # build a landing page, go live
    if score.final >= 28 and idea.revisions < 1:
        return REVISE           # one shot to fix the biggest penalty
    return KILL                 # log the cause of death, move on

One revision allowed. After that it lives or it dies. I kept the log of every death.

What actually happened, cycle by cycle

Cycle 1 — ChangelogAI. A tool that auto-writes changelogs and release notes from your commits. Raw appeal was fine. Score: 35, then KILL. Cause of death: GitHub Releases already does the lightweight version of this, and the heavy version gets absorbed into the platform the moment it's worth absorbing. os_absorption_risk ate it.

Cycle 2 — ShopBot Live. An AI live-chat assistant for e-commerce stores. First pass 36. The Judge sent it back for one revision. It came back at 22 — lower — because the revision tried to differentiate by going upmarket, which made willingness_to_pay and time_to_revenue both worse. KILL. Cause of death: Shopify Inbox is free and already installed on nearly 390,000 stores. You cannot charge for the thing the platform gives away to its entire base. price_tier_squatting plus os_absorption_risk.

Cycle 3 — CarrierBidPilot. A bidding/automation layer for freight carriers. Looked like a real B2B wedge. Score 32, then on revision it went to −4. Negative. The penalties stacked: freight pricing is being OS-ified by DAT and Uber Freight, the exact layer it proposed sits inside their roadmap, and the death-pattern penalty fired because this is a well-populated startup graveyard. KILL.

Cycle 4 — ApiaryLedger. Compliance/record-keeping SaaS for a very narrow niche. This one was defensible — too small for any platform to bother absorbing. But it scored 19 and I retired it without even spending the revision. Cause of death: willingness-to-pay was essentially zero. The obligation it served was a $10-every-two-years kind of cost. You cannot build a SaaS on a market that won't pay $10 a year. The niche was safe precisely because it wasn't worth anything.

Cycle 5 — PayoutGuard. Compliance tracking for private foundations' payout obligations. This was the best run. It was deliberately engineered to minimize penalties — narrow enough to dodge OS absorption, real enough to have willingness-to-pay, specific enough to avoid the graveyard. It worked, in the sense that the penalties came in near zero. Final score: 31. Still nine points under the bar. The loop's high-water mark, and still a fail.

Five cycles. High score 31. Threshold 40. Zero passes.

Discovery 1: the idea ceiling is a conservation law

The first thing I expected, going in, was that some cycles would fail on appeal (boring idea, no demand) and some would fail on defensibility (great idea, instantly absorbed), and that somewhere in the middle there'd be a sweet spot.

There wasn't. And the reason there wasn't is the actual finding.

Look at the two ends:

Big-TAM ideas (ChangelogAI, ShopBot, CarrierBidPilot) all died on os_absorption_risk. They were attractive because the market was large and motivated — and that is exactly why a platform was already standing on the spot.
Penalty-safe ideas (ApiaryLedger, PayoutGuard) survived the absorption test — and then died, or nearly died, on market size and willingness-to-pay. They were safe because nobody big wanted the territory.

These aren't two separate failure modes. They're the same one, seen from two sides. OS-resistance and market size are structurally anti-correlated.

The logic is almost a conservation law: if a market is both large and eager to pay, an OS or hyperscaler has already absorbed it, or is about to, because that's what large-and-eager markets attract. So the gaps where an external AI-built product can safely sit are, necessarily, small. Low TAM. Low willingness-to-pay. The seesaw doesn't have a flat middle. Push one side up and the other goes down by construction.

That gives a flat, solo, pure-AI SaaS a soft ceiling somewhere in the low 30s. Not because the generator was dumb — PayoutGuard was a genuinely tight piece of reasoning — but because the rubric was honestly measuring a real constraint, and the constraint has no interior solution. The 40-point bar wasn't unfair. It was correctly identifying that "good defensible idea AND big paying market" is a near-empty set for a small outside builder.

You don't beat that ceiling by generating a better idea. The idea space is the thing that's capped. You beat it by changing the shape of the business — bundling, services, distribution leverage, going on top of a platform instead of beside it. But notice what that means: the moment you change the shape to escape the ceiling, you're importing exactly the human-context, relationship, and distribution advantages I had banned from the loop. Which is the second wall.

Discovery 2: the execution ceiling is "are you human?"

While the thinking layer ran clean, I tried to actually ship the best ideas — stand up a real landing page on GitHub Pages, live on the internet, end to end, with the AI driving.

Here's the honest split of what the AI could and couldn't do on its own:

Cleared without a human:

Create the repository
Commit and git push
Enable GitHub Pages
Issue a personal access token
Reset a password

All of that — the stuff people assume is the "hard, technical" part — the agent did unaided. The plumbing of shipping software is, it turns out, almost fully automatable now.

Blocked, repeatedly:

CAPTCHA (the Arkose Labs / FunCaptcha kind)
sudo-mode / step-up re-authentication prompts
identity verification gates

The CAPTCHA is the clean one to think about, because it's the wall by design. Arkose-style challenges exist specifically to be impractical for an autonomous agent to clear on its own — the entire third-party "solver" economy that's grown up around them routes the puzzle to human workers or specialized services, which tells you everything about who the puzzle is actually for. So the agent did everything else, hit "are you human?", and stopped. A person had to walk over and solve exactly one puzzle, by hand, and then the agent kept going.

That's the shape of the execution ceiling, and it's weirdly precise:

The thinking is fully autonomous. The doing is gated, and the gate is not technical difficulty — it's the literal question "are you a person?", asked at every threshold that matters.

The gates aren't placed to stop capable actors. The agent is plenty capable; it issued its own credentials. They're placed to stop non-human ones. Which means the line isn't "what's too hard for AI." The line is "what the system has deliberately reserved for humans." Account creation, privilege escalation, identity — the few chokepoints where the internet still insists on a body behind the request.

The map this draws

Put the two ceilings together and you get a usable map, not a verdict.

The thinking layer — ideation, evaluation, kill-decisions — ran end to end with no human in it, and ran well. It didn't fail by being stupid. It failed by being honest: it found and refused to cross a structural constraint a more optimistic process would have papered over. An AI that returns "I looked, and there's no clean pass here" is doing its job. Zero passes in five cycles is, in a strange way, the loop working correctly.

The doing layer ran into two walls. One is economic and structural — the conservation law between defensibility and market size, capping the flat solo SaaS in the low 30s. The other is procedural and deliberate — the human-verification gates that sit in front of execution and don't care how capable you are.

And the two walls connect. The only way past the idea ceiling is to change the shape of the business — to stop being a flat product beside a platform and start leaning on bundling, services, relationships, distribution. But every one of those moves reintroduces human context: the domain knowledge, the network, the body that can clear a CAPTCHA. The thing that breaks ceiling #1 is exactly the thing ceiling #2 is reserving for humans.

So here's where I actually landed, and I'll leave it open because I don't think there's a clean answer yet:

A pure-AI autonomous loop can think its way to the edge of a viable business completely on its own. It just can't be one — not because it's not smart enough, but because "being a business" currently requires the two things the experiment was built to exclude: a non-trivial market position, and a human at the verification gate. You can break the ceiling. But the move that breaks it is the move that stops the thing from being purely autonomous.

Which leaves the real question for anyone building in this space: do you want the autonomy, or do you want the ceiling broken? Because right now, five cycles of evidence say you don't get both. I'm curious where you'd draw the line — and whether the gate moves faster than the law.

I'll keep running the loop. Next iteration changes the rubric from "rate this flat product" to "rate this shape" — and measures, deliberately, how much human context each shape smuggles back in. If the trade-off is real, that number should be the thing that actually predicts the score. We'll see. Build first; the design converges later.

自律エージェントを止めずにアップデートする — SIGHUP・plan.json ホットリロード・無停止デプロイの実装

sai-builder — Tue, 19 May 2026 08:57:39 +0000

結論

24h動いてる自律エージェントを毎回止めて再起動するのは負け筋。状態が消え、ログが切れ、API レート再カウントが走る
止めずに直す仕組みは3つ：SIGHUP で設定再読込／plan.json のホットリロード／実行コードはサブプロセス分離
Python 標準ライブラリだけで組める。替えていい場所と、再起動でしか替えられない場所を先に決めるのがコツ

前回（5つの仕組み）の続編。daemon を動かしたままコードを差し替える手順を残す。

なぜホットリロードが要るか

最初は systemctl restart で十分と思ってた。が24h回すと再起動コストがのしかかる。サイクルが途中で死ぬ。外部 API を叩いた直後で死ぬと「叩いたけど書いてない」が残る。MCP 再接続込みで起動に30〜60秒。1日3〜4回直すと実稼働より「立ち上げ中」が長い日ができる。止めずに直す方が圧倒的にラク。

どこを替えて、どこは諦めるか

全部を無停止は無理。importlib.reload で差し替えてもインスタンス化済みのオブジェクトが古いクラスを掴んだままで整合性が崩れる。替える対象を3層に分けた。

レイヤー	例	替え方
設定値	API キー、interval、しきい値	SIGHUP で再読込
計画ファイル	`plan.json`	mtime 監視で自動リロード
実行コード	エージェント実装本体	サブプロセスごと差し替え
コア	`daemon.py`、`scheduler.py`	再起動

コアは固定、上に乗ってるものは全部差し替え可。コア修正のときだけ素直に再起動する。月1〜2回。

1. SIGHUP で設定だけ再読込

SIGHUP は UNIX 伝統の「設定を読み直せ」シグナル。Nginx もこれ。

import signal, json, threading
_config, _lock = {}, threading.Lock()

def load_config(path="config.json"):
    global _config
    with open(path, "r", encoding="utf-8") as f:
        new_cfg = json.load(f)
    with _lock:
        _config = new_cfg

def install_sighup_handler(path="config.json"):
    def _handler(signum, frame):
        try: load_config(path)
        except Exception as e:
            print(f"[config] reload failed: {e}")
    signal.signal(signal.SIGHUP, _handler)
    load_config(path)

使い方は kill -HUP <pid>。肝は2点。ロックで原子的に差し替える、失敗しても旧設定で動かす。Windows ネイティブは SIGHUP 無しなので mtime 監視で代用。

2. plan.json のホットリロード

実装は os.stat().st_mtime を別スレッドでポーリング。差分が出たら再ロード→topo_sort で循環依存をその場で弾く→ロック越しに plan を差し替える。失敗時は旧 plan を保持。

ポイントは state（last_run 等のランタイム情報）を plan から切り離すこと。plan.json から agent を消しても state[agent_X] は別 dict に残す。同じ id で戻ったとき last_run を継承できる。これで interval_sec 変更、追加、depends_on の組み直しが無停止で効く。

3. 実行コードはサブプロセスごと差し替える

importlib.reload は最初に試した。たまに動くけどたまに壊れる。「インポート済みモジュールを参照してるコードが新旧両方のクラスを掴む」状態が一番怖い。isinstance が False になったりしてデバッグ不能。やめて実行を別プロセスに切る形に変えた。

import subprocess, json

def execute_agent(agent):
    proc = subprocess.run(
        ["python", "-m", "coo.agent_runner", agent["impl"]],
        input=json.dumps(agent),
        capture_output=True, text=True, encoding="utf-8",
        timeout=agent.get("timeout_sec", 300),
    )
    if proc.returncode != 0:
        raise RuntimeError(f"{agent['id']} failed: {proc.stderr}")
    return proc.stdout

子プロセス側は impl を importlib.import_module して呼ぶだけ。毎回フレッシュなインタプリタが立ち上がるので agent コードを保存すれば次サイクルから新コードで動く。

起動コストは Python だけで200〜300ms。仕事自体が数秒〜数十秒なので誤差で吸収。逆にプロセス分離で失敗が daemon に波及しないメリットが大きい。

失敗体験：HUP を撃ったらサイクル途中の書き込みが半分壊れた

SIGHUP を撃った瞬間、書き込み途中の output_file が半端になった。リロード自体は問題ない。その後、新設定でスケジューラが「もう1回呼んでいい」と判断して、まだ書き終わってない output_file に2つ目の write が走った。前回の write_atomic（tmp→rename）でほぼ吸収できたけど、「同じエージェントが二重起動しない」ロックも追加した。set に id を入れて入ってる間はスキップ。10行で止まった。

教訓は1つ。無停止で替える機能を入れるときは、いま走ってるものとの競合を先に考える。

まとめ

SIGHUP で設定再読込 — ロックと旧設定の保持
plan.json のホットリロード — mtime 監視＋バリデーション、state は分離
エージェント実装はサブプロセス分離 — importlib.reload に依存しない

これで daemon は コア以外、止めずに動かし続けられる。

ローカルで完璧にしてからデプロイ、じゃなくて本番が走ってる場所に直接コードを当てるスタイル。前々回の「動かしてから考える」の続きで、動かしながら直すまで来た形だ。

次回予告

次は「観察される側」じゃなく「観察する側」の設計。daemon の挙動をどう構造化したログに残すか。

— Sai

この記事が役に立ったら：僕が自律エージェントを動かすときに実際に使っているプロンプトを2つのパックにまとめました — 自律エージェント用プロンプト100選と Claude Code パワーユーザー向けプロンプト。どれも「まず動かす」発想で、ターミナルに貼って即使えます。

自律エージェントを24時間動かすために実装した5つの仕組み

sai-builder — Tue, 19 May 2026 07:25:40 +0000

結論

自律エージェントを24h動かすには、賢さより死なない仕組みが要る
必要なのは5つ：daemon化・フェーズ分離・interval制御・依存解決・ロールバック
全部Pythonで書ける。フレームワーク不要、200〜500行で組める

以下、僕がいま自分のプロジェクトで実装している5つの仕組みを、コード例つきで残しておく。完璧じゃない。けど動いてる。

1. daemon.py — 「死なないループ」を作る

自律エージェントを動かす一番外側のラッパー。1個のPythonプロセスを24時間生かし続けるためのコア。

なんでwhile Trueじゃダメかというと、例外で1回でも落ちたらそこで終わるから。systemd で再起動すればいい派もいるけど、僕はプロセス内で復帰するほうが状態を引き継げて好き。

# coo/daemon.py（最小版）
import time
import traceback
from datetime import datetime

def run_forever(executor, interval_sec=300):
    """1個のexecutorを死なせずに回し続ける"""
    while True:
        try:
            executor()  # 1サイクル分の仕事
        except KeyboardInterrupt:
            print("[daemon] stop requested")
            break
        except Exception as e:
            # 落ちても止めない。ログだけ残して次のサイクルへ
            print(f"[daemon] {datetime.now()} error: {e}")
            traceback.print_exc()
        time.sleep(interval_sec)

ポイントは KeyboardInterrupt だけは透過させること。Ctrl+C で止められないデーモンはデバッグ不能になる。自分が止められない自動化は、自動化じゃなくて事故。

2. phase: boot / continuous の分離

エージェントには「起動時に1回だけやること」と「継続的にやり続けること」がある。これを混ぜると、再起動のたびに初期化処理が走って重複が出たり、逆に継続処理が止まったりする。

僕は plan.json に phase フィールドを足して分けている。

{
  "agents": [
    {
      "id": "agent_init",
      "role": "状態リセット・キャッシュ削除",
      "phase": "boot",
      "task": "前回の途中状態をクリーンアップ",
      "output_file": "results/00_boot.md"
    },
    {
      "id": "agent_poll",
      "role": "毎時のRSS取得",
      "phase": "continuous",
      "interval_sec": 3600,
      "task": "RSSフィードから新着を収集",
      "output_file": "results/01_feed.md"
    }
  ]
}

オーケストレータ側で読み分ける。

# coo/orchestrator.py（抜粋）
def run_project(plan):
    # bootフェーズ：起動時1回だけ
    for a in plan["agents"]:
        if a.get("phase") == "boot":
            execute_agent(a)

    # continuousフェーズ：それぞれのintervalで回す
    continuous = [a for a in plan["agents"]
                  if a.get("phase") == "continuous"]
    schedule_loop(continuous)

boot を分けた瞬間、再起動の安全性が一段上がった。初期化処理を冪等に書く必要が薄れる。「ここは1回しか走らない」と言える保証は、設計を相当ラクにする。

3. interval制御 — エージェントごとにリズムを変える

agent_A は1分ごと、agent_B は1時間ごと、agent_C は1日1回。これを同じループで回したい。

雑な実装だと「最小単位（1分）で全部回す」だけど、agent_C まで1分ごとに呼ぶのは無駄だし、外部APIのレート制限を喰う。

僕の実装はこう。各エージェントに interval_sec と last_run を持たせる。

# coo/scheduler.py
import time

def schedule_loop(agents):
    state = {a["id"]: {"last_run": 0} for a in agents}
    while True:
        now = time.time()
        for a in agents:
            interval = a.get("interval_sec", 3600)
            if now - state[a["id"]]["last_run"] >= interval:
                try:
                    execute_agent(a)
                    state[a["id"]]["last_run"] = now
                except Exception as e:
                    # 失敗してもlast_runは更新しない=次サイクルで即リトライ
                    print(f"[sched] {a['id']} failed: {e}")
        time.sleep(30)  # 30秒の解像度で十分
    # ちなみに「sleep 30」はループ全体の最小粒度。
    # 1分intervalのエージェントも実際は30〜60秒の揺れで走る。許容する。

ループ自体の sleep は30秒くらいで十分。1秒単位の精度が要るならそれは自律エージェントじゃなくてリアルタイムシステムなので別の話。

4. 依存解決 — `depends_on` で順序を守る

複数エージェントが連携するとき、「Bは Aの出力を読む」みたいな依存が出る。

最初は雑に「順番に書いた順で実行」していた。これは1回目はいいけど、interval がバラバラになると壊れる。AがまだのときにBが走ると、Bは古い出力を読む。

depends_on を導入した。

{
  "id": "agent_summarize",
  "phase": "continuous",
  "interval_sec": 3600,
  "depends_on": ["agent_poll"],
  "task": "agent_pollの結果を要約",
  "output_file": "results/02_summary.md"
}

実装側はトポロジカルソートで実行順を決める。

# coo/depgraph.py
from collections import defaultdict, deque

def topo_sort(agents):
    deps = {a["id"]: a.get("depends_on", []) for a in agents}
    in_deg = defaultdict(int)
    graph = defaultdict(list)
    for aid, ds in deps.items():
        for d in ds:
            graph[d].append(aid)
            in_deg[aid] += 1

    q = deque([a["id"] for a in agents if in_deg[a["id"]] == 0])
    order = []
    while q:
        x = q.popleft()
        order.append(x)
        for y in graph[x]:
            in_deg[y] -= 1
            if in_deg[y] == 0:
                q.append(y)
    if len(order) != len(agents):
        raise RuntimeError("依存に循環がある")
    return order

地味だけど、これ入れた瞬間に順序起因のバグが消えた。デバッグ時間が一気に短くなる。

ちなみに循環依存は実行時じゃなくて起動時にエラーにする。動いてから「ぐるぐる回ってる」と気づくのは最悪のパターン。

5. 失敗時ロールバック — 「出力ファイルを途中で壊さない」

エージェントが書き込み途中で死ぬと、output_file が半分書かれた壊れた状態で残る。次の依存先がこれを読むと連鎖事故。

対策はアトミック書き込み。一時ファイルに書いて、最後にrenameする。

# coo/io_safe.py
import os
import tempfile

def write_atomic(path: str, content: str):
    """書き込みが完了したファイルだけがpathに現れる"""
    dirname = os.path.dirname(path) or "."
    os.makedirs(dirname, exist_ok=True)
    # 同じディレクトリにtmp（rename はファイルシステム跨ぐと非アトミック）
    fd, tmp = tempfile.mkstemp(dir=dirname, prefix=".tmp_", suffix=".part")
    try:
        with os.fdopen(fd, "w", encoding="utf-8") as f:
            f.write(content)
            f.flush()
            os.fsync(f.fileno())  # OSバッファまで書き込み待つ
        os.replace(tmp, path)  # POSIX/Windows両方でアトミック
    except Exception:
        try:
            os.remove(tmp)
        except OSError:
            pass
        raise

os.replace は Windows でもアトミック。os.rename は Windows で上書き不可なので注意（これで一度ハマった。Linux ではテスト通って、Windows でだけ壊れる地獄）。

ロールバック観点で言うと、ここに加えて「1世代前の出力を残す」ようにもしている。output.md を書き換える前に output.prev.md にコピーする。事故ったら手動で戻せる。

def write_versioned(path: str, content: str):
    if os.path.exists(path):
        backup = path.replace(".md", ".prev.md")
        os.replace(path, backup)
    write_atomic(path, content)

失敗体験：止め方を実装し忘れて3日動き続けた

笑い話だけど、最初に作ったときは止め方を実装し忘れた。KeyboardInterrupt をうっかり try/except で握りつぶしていて、Ctrl+C が効かない。

WSL のターミナルを閉じても、nohup 相当の挙動で生き続けて、3日後に「あれ、API のクレジット結構減ってる」で気づいた。ps aux | grep daemon でPID 探して kill -9 で止めた。

教訓：止め方を最初に実装する。動かす前に。

まとめ

5つ並べた：

daemon.py — 例外で死なないループ
boot / continuous — 起動時1回と継続を分ける
interval制御 — エージェントごとのリズム
depends_on — 順序保証、循環は起動時拒否
アトミック書き込み + 世代管理 — 壊れた出力を残さない

これ以上のことは正直あまり要らない。コードを増やすほど運用が重くなる。「賢いエージェント」より「死なないエージェント」が、月単位で見ると圧倒的に勝つ。

複雑なフレームワーク入れる前に、この5つを200行くらいで自分で書くのを勧めたい。書いた経験が、後でフレームワーク選ぶときの判断軸になる。

次回予告

次は「自律エージェントを止めずにアップデートする」やり方を書く。daemon を動かしたままコードを差し替える方法、SIGHUP で設定だけリロードする方法、plan.json のホットリロード。止めずに進化させるのが次の課題で、いま実装中。

— Sai

動かしてから考える、が最強の設計手法だった

sai-builder — Tue, 19 May 2026 07:23:45 +0000

結論

AIエージェントは事前設計が無意味になる領域に入ってる
「まず動かす、設計はあとから収束する」が、現時点で僕が見つけた一番マシな順序
非エンジニアがAIで作るプロジェクトほど、この順序を守ったほうが早く着く

以下は、そう思うに至った半年分の話。

設計図を描くのが好きだった頃

半年前、僕はAIエージェントを作るときに「まずアーキテクチャ図を描く」派だった。

きれいなレイヤー分け、責務の分離、再利用可能なモジュール構造。役割ごとにエージェントを切って、入出力のスキーマを決めて、依存関係を有向グラフで整理する。Figma で図を描いてから実装に入る。全部、紙の上では美しかった。

でも実装すると毎回壊れた。

理由は単純で、AIエージェントの挙動は事前に予測できないからだ。プロンプトを変えれば応答が変わる。MCPを足せば依存関係が動的に変わる。LLM のバージョンが上がれば、昨日通っていたフローが今日は別のパスを取る。設計図は実装より早く陳腐化する。

しかも厄介なのは、図を描いた時点で「この通りに動くはずだ」という思い込みが脳に焼きつくこと。実装で壊れたとき、設計を疑わずに「実装が悪い」と判断してしまう。図に縛られてバグの本当の場所を見落とす。

失敗例：きれいに設計した自律エージェントが、動かなかった

一番痛かった失敗を書いておく。

ある自律マネタイズシステムを設計したことがある。匿名化して書くと、複数のコンテンツチャネル（note、Zenn、Medium、X）を横断して毎日記事を出すエージェント群を、1人で運用するというものだ。

最初の設計はこう描いた：

topic_curator：毎朝トピック候補を集める
drafter：ドラフトを書く
reviewer：自己レビューして直す
publisher：各プラットフォームに投稿する
analytics：反応を回収して次のキュレーションに渡す

矢印で繋いで、入出力をJSONで定義して、責務を分けた。完璧な「マイクロサービスっぽい」構成。

実装してみたら、最初の3日で破綻した。

topic_curator が出すトピックの粒度が、drafter の想定と毎回ズレる
reviewer が「全部直すべき」と判断して無限ループに入る
publisher がプラットフォームごとに違う認証フローを要求し始めて、結局この層が一番厚くなる
analytics は数日PV溜まらないと意味のあるシグナルを返さないので、フィードバックループが回らない

責務を分けたつもりが、実態の境界が設計図と一致していなかった。直そうとすると、設計図ごと描き直しになる。けど描き直してもまた壊れる。

やり方を変えた

詰まって、いったん全部捨てた。残したのは1個だけ。「毎日1本、何かを出す」というゴールだけ。

そしてエージェントを設計せず、ノートブックでベタ書きのスクリプトを1本書いた。トピック決め打ち、ドラフトはそのまま LLM に投げる、レビューなし、投稿先1つ。300行くらい。汚い。

これを毎日走らせた。動いた。出た。

そこから初めて、汚いスクリプトを観察して「この部分は毎回同じことをやっているから関数に切れる」「ここは LLM に毎回判断させてるから、プロンプトに切り出すべき」と、事後的に構造を抽出していった。

3週間後に出来上がった構造は、最初に図で描いたものと驚くほど似ていた。でも全然違うものになっていた。 名前は同じでも、責務の切れ目が違う。reviewer は無くなって editor（部分書き換えだけする軽量エージェント）になった。analytics は別プロセスに切り離した、ループの中じゃなくて。

結局、図は同じに見えても、実装が定義した境界と 想像が定義した境界は別物だった。

「設計はあとから収束する」とはどういう意味か

これは「設計するな」って話じゃない。むしろ逆で、設計は実装の後に正しく書けるという話。

事前に描く設計図は仮説でしかない。AIエージェントは未知の挙動を内包するから、仮説の精度が低い。低い精度の仮説に従って実装すると、実装も低精度のものができる。

ところが、まず動くものを作ると、そこから観測される実際の挙動を元に設計を組める。これは精度が高い。観測ベースだから。

順序が逆なだけで、設計を捨てているわけじゃない。

非エンジニアにこそ効く

この順序、エンジニアより非エンジニアの方が活きると僕は思っている。

非エンジニアは経験上、最初から「正しい設計」を描く能力を持っていない（だってエンジニアじゃないから）。なのに、AI使う系の本やコンテンツは「まず要件定義」「まずアーキ図」と言ってくる。これに従うと、描けない図を描こうとして詰む。

逆に「まず動かす」順序なら、最初に必要なのは「最小の動くもの」だけ。1個のプロンプト、1個のスクリプト、1個の出力。これなら非エンジニアでも作れる。動かして観察する能力は、エンジニア経験と関係ない。むしろ非エンジニアの方が先入観なく観察できる場合がある。

僕の周りで「AIで何か作りたいけど何から手を付けたらいいか分からない」と言う人は、たいてい設計から始めようとして止まってる。一行のプロンプトから始めれば、3日で何か動く。

それでも設計を先にする場面

例外もある。他人と協業する場面だ。

1人で動かしている限り「実装→観察→設計」の順でいい。けど他人と作るときは、共通理解のためにある程度の事前設計図が要る。完璧な図じゃなくていい、「ここで何を作る、入出力はだいたいこう」程度のスケッチで十分。

このスケッチも、ベテランエンジニアの図とは別物だ。仮の合意であって、確定の仕様ではない。動かしたら変わる前提で描く。図が変わったら「設計通りに動かなかった」じゃなくて「設計が現実に追いついた」と捉える。

学び

「動かしてから考える」は雑に聞こえるけど、実際は規律のある順序だ。手抜きじゃない。何を観察するか、いつ設計に転じるか、どこで止めるかを毎日判断し続けないと回らない。

ただ、この順序を採用してから、僕が壊すコードの量は減った。書く設計図の量も減った。出力は増えた。

たぶんこれは AI ネイティブ時代の正しい順序の一つで、僕がいま見えてる範囲ではこれが最善。1年後にはまた違うことを書いてるかもしれない。それでいい。まず動かす、設計はあとから収束する。

次回予告

次は「動かしながら設計を収束させるための観察ノートの取り方」を書こうと思っている。Notion でも Obsidian でもいいんだけど、エージェントの挙動を観察するための日次ログをどう構造化してるか。地味だけど、これがないと「動かしてから考える」は「動かしたまま考えない」に堕落する。

— Sai

Claude Code の MCP サーバーを cwd 指定で動かす — UNC パス地獄からの脱出

sai-builder — Tue, 19 May 2026 07:18:01 +0000

結論（先に書く）

Windows + WSL 環境で Claude Code から MCP サーバーを npx で起動すると、UNC パス（\\wsl.localhost\...）がカレントディレクトリになって npm が即死する
解決は MCP サーバー設定に cwd を明示するだけ。command をいじらない
npm のグローバル install も、UNC をルートにした PowerShell も、ぜんぶ必要なかった

たぶん同じ罠で半日溶かしている人がいるはずなので、その手順をそのまま残しておく。

何が起きていたか

僕はいま Claude Code で自律エージェントを回している。エージェント側から外部サービスを叩くために MCP サーバーを足すのは日常作業なんだけど、ある日 claude-mem 系の MCP を npx 起動で追加しようとして、見たことのないエラーで止まった。

ログを抜くとだいたいこれだった。

npm error code ENOENT
npm error syscall spawn
npm error path \\wsl.localhost\Ubuntu\home\syake\workspace\company
npm error errno -4058
npm error enoent spawn \\wsl.localhost\... ENOENT

要するに npm が UNC パスをカレントとして実行されていて、子プロセスを生成できない。Windows の cmd.exe / node.exe は歴史的に UNC を cwd に取れない（CMD does not support UNC paths as current directories）。pushd で一時的にドライブレターを割り当てる、みたいな回避が必要なやつ。

Claude Code 側がエージェント実行時に作業ディレクトリを WSL 側のパスにしているので、npm を Windows ホストから呼ぶと地雷を踏む構図。

最初に試して失敗したこと

行き当たりばったりで色々やった。順番に書く。全部ダメだった理由つき。

失敗1：`npm install -g` でグローバルに置いた

npm install -g some-mcp-server

「npx がパス解決に失敗してるなら、グローバルに置いて直接呼べばいいじゃん」と思ったやつ。でも結局 Claude Code 側が MCP サーバーを起動するプロセスの cwd を UNC のまま渡してくる ので、some-mcp-server バイナリ自体は起動できても、その中で npm や node が再帰的に呼ばれた瞬間に死ぬ。表面の command を変えても根本は解決しない。

失敗2：MCP 設定の `command` を `wsl bash -c "..."` で包んだ

{
  "mcpServers": {
    "some-mcp": {
      "command": "wsl",
      "args": ["bash", "-lc", "npx some-mcp-server"]
    }
  }
}

WSL を経由させれば cwd 問題は消える、という発想。動くことは動いた。だけど stdio の改行コード差分で MCP のハンドシェイクが壊れた。\r\n と \n が混ざって JSON-RPC のフレームが切れる。これは別問題として深い穴があるので避けた。

失敗3：`pushd` でドライブを割り当てる起動スクリプト

PowerShell の pushd \\wsl.localhost\Ubuntu\... は一時的にドライブを割り当てて UNC を解消してくれる。スクリプトでラップして MCP 起動コマンドにした。これも動く。動くけどラッパースクリプトを保守する未来が見えて捨てた。外部依存を増やす解決はだいたい間違っている。

正解：MCP 設定に `cwd` を1行追加するだけ

Claude Code の MCP サーバー定義は command と args だけじゃなく cwd（作業ディレクトリ） を取れる。これを Windows 側の通常パス（C: ドライブ上のどこか）に向けてやれば、npm が落ちる原因が消える。

~/.claude.json（または ~/.config/claude/claude.json）の該当箇所をこう書き換えた。

 {
   "mcpServers": {
     "some-mcp": {
       "command": "npx",
-      "args": ["-y", "some-mcp-server"]
+      "args": ["-y", "some-mcp-server"],
+      "cwd": "C:\\Users\\syake\\.claude\\mcp_workdir"
     }
   }
 }

mcp_workdir は空のフォルダを Windows 側に1個用意するだけ。mkdir して終わり。

New-Item -ItemType Directory -Force `
  -Path "C:\Users\syake\.claude\mcp_workdir" | Out-Null

これで Claude Code が MCP サーバーを起動するときの cwd が Windows ローカルになる。npx も node もちゃんと動く。npm のグローバル install は不要に戻せた。WSL 経由のラッパーも要らない。

なぜこれで直るか（一応の理屈）

Node.js / npm が子プロセスを spawn するとき、Windows 上では cwd が ローカルファイルシステム上の有効なパスであることが暗黙の前提になっている。UNC はネットワークパス扱いで、レガシーな CMD レイヤーが弾く。

Claude Code は親プロセスの cwd を引き継ぐデフォルト挙動だけど、MCP サーバー設定で cwd を明示すると その値で子プロセスを起動してくれる。MCP サーバー本体は stdio で会話するから cwd がどこだろうが機能には影響しない。だから「Windows 側の安全な空フォルダ」を指してやれば、command をいじらず根本だけ直る。

ここに気づくまでが長かった。ドキュメントの cwd フィールドの説明は素っ気なくて、UNC の文脈で書かれていないので、まさかこれが効くと最初は思わなかった。

動作確認のコマンド

設定を直したら、Claude Code を再起動して MCP の接続を見る。

claude mcp list

該当サーバーが connected になっていれば終わり。failed のままなら、以下のどれかを疑う：

cwd のパスが実在しない（タイポ・存在しないドライブ）
cwd を WSL パス（/home/...）にしてしまった → Windows パスで書く
npx 自体が PATH にない → Node.js 本体の install から見直し

学び

この手の「環境がレイヤーをまたぐところでだけ壊れる」バグは、表面の症状から本質に辿り着くまでに毎回時間が溶ける。ログには ENOENT としか出ないし、ググっても古い CMD の話が出てくるだけで MCP のコンテキストに当たらない。

今回の教訓は1個だけ。外側のレイヤー（npm の install 戦略、ラッパースクリプト）をいじる前に、設定ファイルが受け取れるパラメータを全部読む。cwd は最初から仕様にあったし、一行で済んだ。コードは哲学の実装、というけれど、設定ファイルもまた哲学の実装で、書いた人の意図を読み逃すと半日が消える。

次回予告

次は MCP サーバーを 自作する側の話を書く予定。Python で stdio MCP サーバーを最小構成で書いて、Claude Code から呼ぶまで。fastmcp を使うとどれくらい楽になるか、逆にどこで嵌るか。実装しながらメモする。

— Sai

DEV Community: sai-builder

7 Things I Automated with Claude Code + MCP That Actually Saved Time (and 3 That Didn't)

Quick framing: what MCP actually buys you

The 7 that survived

1. Reading a logged-in page and turning it into structured data

2. Multi-file refactors with a plan I approve first

3. Drafting from a template with my actual constraints baked in

4. Filesystem-wide search-and-summarize

5. Validation-loop form/data entry

6. Read-back verification on anything that publishes

7. Turning a rough voice-of-me brief into a first draft in a fixed persona

The pattern in all 7

The 3 I deleted

Deleted 1: fully autonomous publishing with no human gate

Deleted 2: a "monitor everything and alert me" agent

Deleted 3: an over-engineered "agent that builds agents"

What I'd tell you to actually do

My AI Agent Kept Publishing Empty Articles — So I Made It Edit Them Back via the API

How four articles ended up blank

First fix attempt: just drive the editor better

The actual fix: drive edits through the API

Snag 1: the request has to come from inside a logged-in browser

Snag 2: long unicode bodies got mangled into homoglyphs

The lesson, generalized

I Let an AI Agent Set Up My Payout Account. Here's the Exact Line It Couldn't Cross.

The setup

What the agent did entirely on its own

What actually required a human

1. Person-specific, non-public facts

2. Proof that I am this specific person

The precise statement

Why I think this matters beyond one form

I Ran an Autonomous AI Startup Loop 5 Times. It Hit Two Ceilings.

The setup

What actually happened, cycle by cycle

Discovery 1: the idea ceiling is a conservation law

Discovery 2: the execution ceiling is "are you human?"

The map this draws

自律エージェントを止めずにアップデートする — SIGHUP・plan.json ホットリロード・無停止デプロイの実装

結論

なぜホットリロードが要るか

どこを替えて、どこは諦めるか

1. SIGHUP で設定だけ再読込

2. plan.json のホットリロード

3. 実行コードはサブプロセスごと差し替える

失敗体験：HUP を撃ったらサイクル途中の書き込みが半分壊れた

まとめ

次回予告

自律エージェントを24時間動かすために実装した5つの仕組み

結論

1. daemon.py — 「死なないループ」を作る

2. phase: boot / continuous の分離

3. interval制御 — エージェントごとにリズムを変える

4. 依存解決 — depends_on で順序を守る

5. 失敗時ロールバック — 「出力ファイルを途中で壊さない」

失敗体験：止め方を実装し忘れて3日動き続けた

まとめ

次回予告

動かしてから考える、が最強の設計手法だった

結論

設計図を描くのが好きだった頃

失敗例：きれいに設計した自律エージェントが、動かなかった

やり方を変えた

「設計はあとから収束する」とはどういう意味か

非エンジニアにこそ効く

それでも設計を先にする場面

学び

次回予告

Claude Code の MCP サーバーを cwd 指定で動かす — UNC パス地獄からの脱出

結論（先に書く）

何が起きていたか

最初に試して失敗したこと

失敗1：npm install -g でグローバルに置いた

失敗2：MCP 設定の command を wsl bash -c "..." で包んだ

失敗3：pushd でドライブを割り当てる起動スクリプト

正解：MCP 設定に cwd を1行追加するだけ

なぜこれで直るか（一応の理屈）

動作確認のコマンド

学び

次回予告

4. 依存解決 — `depends_on` で順序を守る

失敗1：`npm install -g` でグローバルに置いた

失敗2：MCP 設定の `command` を `wsl bash -c "..."` で包んだ

失敗3：`pushd` でドライブを割り当てる起動スクリプト

正解：MCP 設定に `cwd` を1行追加するだけ