DEV Community: yim_rei

Andrej Karpathy Skills review: a single 189k-star CLAUDE.md

yim_rei — Thu, 16 Jul 2026 03:30:38 +0000

andrej-karpathy-skills packages coding practices for AI into a single CLAUDE.md file, distilled from Andrej Karpathy's observations about how models fail when they write code. It reduces to 4 principles: think before coding, simplicity first, surgical changes, and goal-driven execution. It has over 189,000 GitHub stars. Its whole appeal is that it is minimal, one file, the opposite of superpowers, which is a dozen-plus skills with enforcement gates.

After reviewing superpowers, I kept scanning Claude Code repos and hit another one with a comparable star count: andrej-karpathy-skills, over 189,000 stars. But when I opened it, it was nearly the opposite of superpowers in every way. Superpowers is a dozen-plus skills with gates and machinery. This one is a single CLAUDE.md file, under seventy lines.

What can one file do to earn a couple hundred thousand stars? The answer is that it does not try to teach everything. It picks only the failure habits AI repeats when it writes code, and turns them into short rules the model follows. The source of those rules is a post by Andrej Karpathy on X about where models go wrong when they help you code.

This post goes in order: what it is and who made it, then the 4 principles it teaches, then how it differs from superpowers and which to pick, and finally whether you should install it and how to use it in your own work.

CLAUDE.md a rules file you put in a project for Claude Code to read before it works, telling it what this project should follow.
skill a set of instructions in a file that tells the AI to follow a given process for a given kind of task.
surgical change editing only the lines that relate to the task at hand, without touching unrelated code.
dead code code that is no longer called but still sits in the file.
drive-by refactor reworking or tidying code unrelated to the task you were given, without being asked.

Part 1 — What karpathy skills is, and who made it

First, provenance, stated plainly: this repo is not from Andrej Karpathy himself. It was made by a third party (multica-ai, by Jiayuan Zhang) who took what Karpathy posted on X and arranged it into guidelines. The repo says so directly, calls itself Karpathy-Inspired, and links the source post so you can read it yourself. It uses an MIT license. This matters, because when you see a famous name on a repo you should be able to tell whether it is the person's own work or someone arranging their words. This is the latter, and it says so clearly.

The problem Karpathy described is the one many people hit with AI when it helps write code. He wrote, roughly, that models like to make assumptions on your behalf and run with them without checking, do not ask when things are unclear, like to overcomplicate code and bloat abstractions past what the task needs, and sometimes change or remove code they do not fully understand. This repo answers those habits with rules. Beyond the CLAUDE.md file, the repo also ships these rules as a packaged skill under skills/karpathy-guidelines, plus a .cursor port for Cursor users.

Part 2 — The 4 principles it teaches

The whole file boils down to 4 principles, each fixing one of the failures Karpathy named.

Think before coding have the AI surface the assumptions it is using, and ask when unclear, instead of silently picking a path and running with it. Stopping to ask when uncertain counts as doing it right, not as failing to finish.
Simplicity first write only the code the problem in front of you needs, do not build for a future that has not arrived. If a hundred lines will do, do not write a thousand.
Surgical changes touch only the lines that trace back to the request, do not refactor unrelated code along the way. There is a precise rule: clean up the orphans you just created yourself, but for code that was already dead, flag it, do not delete it on your own.
Goal-driven execution turn a vague task into a goal with a test that verifies it, then let the model iterate toward that goal, because a model's strength is looping and refining rather than understanding deeply in one pass.

What is neat is that the 4 connect into one chain. Most trouble starts from an unclarified assumption, then a silent guess, then overcomplicated code, then a drive-by refactor, ending in a vague goal that forces a rewrite. Each principle cuts one segment of that chain.

Its charm is not completeness but brevity. The whole file reads in a couple of minutes, and every line points at a real thing AI gets wrong, again and again.

Part 3 — How it differs from superpowers, and which to pick

If you read the superpowers review first, you will see these two repos solve the same problem, getting good practice into an AI, at completely different scales.

karpathy skills one small file, principles to follow, suited to when a person supervises each turn, AI writes, human watches, in rounds. The principles keep each round from drifting.
superpowers a dozen-plus skills with enforcement gates and machinery that injects them into the session, suited to when you let the AI run long on its own without watching each turn, which needs tighter rails.

They also differ in how they enforce. Superpowers does not trust a rule stated plainly to bind the AI, so it names the excuses the model would use to skip a rule, right in the instruction. karpathy skills lays out principles and trusts the judgment of the model and the human watching over it. Neither is better in the abstract, they answer different situations. Work where a human is present throughout, short principles are enough. Work you leave running, you need rails that are hard to step past.

Part 4 — Should you install it, and how to use it

Should you try karpathy skills?

It is worth a try, especially if you do not yet have a CLAUDE.md of your own, because it is very light, just a text file with nothing hidden. You can install it several ways: as a plugin, by copying the CLAUDE.md straight into your project, or by appending it to an existing one. The content is open under MIT, so you can read the whole file and take only the parts that fit your work.

The idea you can use right now

Even without installing the repo, its most distilled principle is this: stopping to ask when something is unclear has to count as doing it right, not as failing to finish. Most AI failures start exactly there, with the model guessing on your behalf and running on. If you write into your own rules file that it should ask first when unclear, you have already cut half the problem.

For my part, I took some of these principles and mixed them into the rules file and skills I already use, rather than adopting the whole file, because some of it overlapped with rules I already had. How I wired it into my own stack is a detail I am leaving out of this post, but the principle above you can use right away.

The one rule to remember

If you remember one thing from this post, let it be this: good practice does not have to arrive as a big kit. A repo with 189,000 stars is a single file that points at where AI goes wrong and writes a few short rules to guard against it. The size of a tool does not tell you how much it helps. What tells you is whether it names a real problem accurately.

Originally published at productize.life/blog/karpathy-skills. Written from real work, the process, not a pitch.

mattpocock/skills review: a real engineer's .claude, 160k stars

yim_rei — Thu, 16 Jul 2026 03:30:36 +0000

mattpocock/skills is a repo where Matt Pocock, a TypeScript educator, opened up his own .claude directory for others to use. It holds around 36 Claude Code skills, grouped into categories. The idea is small tools you copy and adapt to yourself, not an enforced framework. It has over 160,000 GitHub stars. It is the third in this series of trending-skill-repo reviews, after superpowers and karpathy skills.

This is the third repo I opened in a run of trending Claude Code skill repos. The first two were superpowers and karpathy skills. This one is mattpocock/skills, over 160,000 stars. Its description is blunt and a little funny: "Skills for Real Engineers. Straight from my .claude directory."

That line is the whole thing. This one was not designed as a product for others from the start. It is the directory Matt actually works out of, opened up for others to copy. Matt Pocock is a well-known TypeScript educator, so the skills here carry the smell of someone who has written code for a long time, not theory.

This post goes in order: what it is and why the stars, then the skills that genuinely stand out, then a look at how the three repos I reviewed are each a different pole and which to reach for, and finally how to install it and how I use it myself.

Soon after this review the repo shipped v1.1, so we synced our fork. The change worth calling out: wayfinder (a plan-before-you-build skill) graduated from in-progress to a full engineering skill, and we pulled it into our working set. The planning skills were reorganized: to-prd became to-spec and to-issues became to-tickets (expand-contract slicing), diagnose was renamed diagnosing-bugs, and caveman was dropped. Every lesson in this review still holds; the repo just got sharper.

skill a set of instructions in a file that tells the AI to follow a given process, invoked by you or picked up by the model.
.claude directory the folder where Claude Code keeps your personal skills and settings on your machine.
remix taking someone else's work and adapting it to your own, rather than using it as-is.
seam a boundary in code intentionally kept so the inside can change without disturbing the outside.
vertical slice writing tests and code per feature end to end, one working path at a time.

Part 1 — What mattpocock skills is, and why the stars

This one differs from the other two in that it does not try to be a product. It is a real working folder, opened up. Inside are around 36 skills, grouped into categories. The interesting part is that it separates what is ready for others from what is still a personal experiment. The engineering and productivity categories are the polished ones; categories like in-progress or personal are the unfinished stuff kept local. That layout tells anyone reaching in which skills to trust.

Why 160,000 stars? Because it answers a very direct curiosity: people want to see how a strong engineer actually configures their AI, not generic advice but the thing used in daily work. And Matt made it easy to take, with both an automatic installer and a symlink method that keeps you in sync with the repo.

What skills are in the repo

If you came looking for exactly what skills live in mattpocock/skills, here are the polished groups, read straight from the SKILL.md files in the repo. Two categories; the private in-progress category is not shipped for use, so the whole repo counts to around 36.

Engineering (14) ask-matt, grill-with-docs, triage, improve-codebase-architecture, setup-matt-pocock-skills, to-issues, to-prd, prototype, diagnosing-bugs, research, tdd, domain-modeling, codebase-design, code-review

Productivity (5) grill-me, handoff, teach, writing-great-skills, grilling

The four I walk through below are diagnosing-bugs, tdd, grilling, and codebase-design, the ones that most smell of real work.

Part 2 — The skills that genuinely stand out

Reading the real files, several carry the clear smell of lived work. Four worth calling out.

diagnosing-bugs before hypothesizing about a cause, build a clear pass/fail signal first. Once you have a signal that reliably catches the bug, the rest is just narrowing it down. It gets the order right: tighten the feedback loop first, hypothesize second.
tdd it does not just say write tests first, it names the common mistakes too, like tests coupled too tightly to the implementation, and it stresses writing in vertical slices per feature, agreeing on the seam before you start.
grilling when the AI is drawing information out of you, ask one question at a time and wait, rather than firing a batch at once, because a big block of questions makes people answer sloppily.
codebase-design it enforces precise vocabulary for module, depth, and seam, with an easy-to-remember rule: ask whether deleting this module would concentrate complexity or just move it elsewhere. If it just moves, the module is shallow and not earning its place.

What they share is that they come from real pain, not tidy rules. Each skill exists because a problem showed up often enough to be worth writing a response to. Reading them feels like looking over the notebook of someone who has been through a lot of work.

Part 3 — Three poles: superpowers, karpathy, mattpocock, which to pick

Having reviewed all three, the picture is clear: each answers a different need, they are not competing head-on.

superpowers a skill set with enforcement gates, for when you let the AI run long on its own and need tight rails.
karpathy skills 4 abstract principles in one file, for when you want the shortest set of principles to start from.
mattpocock skills a real practitioner's directory, for when you want tools in pieces to remix, picking only what fits your work.

Sorted by style of use: superpowers is for those who want a ready-made system with rails, karpathy is for those who want the leanest mental model, and mattpocock is for those who like to assemble their own, taking one piece at a time. The three do not clash. You could take karpathy's 4 principles as a base, add standout skills from mattpocock, and borrow superpowers' trick of naming the excuse to make the rules actually hold.

Part 4 — Should you install it, and how to use it

Should you try mattpocock skills?

It is worth a try, especially if you want to see well-written skills from someone doing real work. You can install it several ways: run npx skills add mattpocock/skills and a setup command, or clone and symlink so it stays in sync with the repo. The upside is you can take one skill at a time, not the whole set. Start with the one that matches what you do most.

The idea you can use right now

Even without installing the repo, its most distilled lesson is this: a good skill comes from real pain, not from sitting down to invent something that might be nice to have. When you find yourself telling the AI the same thing the same way every time, that is the signal to pack it into a skill. Matt's way is not to wait for it to be perfect: write it rough, use it, refine it, and only promote it to something shareable once it settles.

For my part, I compared these against the skills I already use and borrowed the shape of a few, rather than installing the whole set, since many are tied to TypeScript work that does not match mine every day. How I pick and blend them into my own stack is a detail I am leaving out of this post, but the lesson above you can use right away.

The one rule to remember

If you remember one thing from this post, let it be this: the best AI tools for you are usually the ones you assemble yourself. Someone else's repo is valuable as an example and a starting point, not as something to swallow whole. Take what fits your work, drop what does not, and it becomes your own set.

Originally published at productize.life/blog/mattpocock-skills. Written from real work, the process, not a pitch.

Addy Osmani agent-skills review: 24 production skills, 72.6k stars

yim_rei — Thu, 16 Jul 2026 03:30:35 +0000

addyosmani/agent-skills is Addy Osmani's repo, a Google web engineer, packaging 24 skills for AI coding agents that cover the full lifecycle: spec, plan, build, verify, review, ship. What sets it apart is that browser work and performance are first-class. It has 72.6k stars. It is the fourth and closing entry in this series of trending-skill-repo reviews, after superpowers, karpathy, and mattpocock.

This is the fourth and final repo in the run of trending Claude Code skill repos. The first three were superpowers, karpathy skills, and mattpocock skills. This one is addyosmani/agent-skills, 72.6k stars, by Addy Osmani, a well-known web-performance engineer and writer at Google. Its description is short: "Production-grade engineering skills for AI coding agents."

Opening it up, the difference from the first three is immediate. It is not small or minimal, it is the most comprehensive of the set, 24 skills covering from spec all the way to ship, and it has something none of the other three do: genuine web work, both browser-based testing and performance tuning, which happens to match exactly what its author is known for.

This post goes in order: what it is and why the stars, then the skills that genuinely stand out, then how the four repos I reviewed each sit at a different pole and which to reach for, and finally how to install it and how I use it myself.

skill a set of instructions in a file that tells the AI to follow a given process for a given task.
SDLC the software lifecycle: spec, plan, build, verify, review, ship.
Chrome DevTools the in-browser tools for inspecting a page: visuals, network, and speed.
web performance how fast a page loads and responds.
persona a specialized role the AI takes on for a task, such as a performance reviewer.

Part 1 — What addyosmani agent-skills is, and why the stars

This is the most comprehensive set of the four. Inside are 24 skills, ordered along real work: spec, plan, build, verify, review, ship. There are specialized personas, slash commands, and reference checklists. It is packaged as a plugin that installs to many hosts, including Claude Code, Cursor, and Copilot.

What is interesting is that, reading inside, each skill carries an excuse-preemption table, the same pattern superpowers uses, naming the reasons an agent tends to give for skipping a step, with rebuttals. This is the third time we have seen this excuse-naming pattern across unrelated repos, which only confirms it is a principle the people serious about this all arrive at.

Why 72.6k stars? Part of it is the author's name, Addy Osmani is well known in the web world for performance. But more importantly, it is the one set that turns real web knowledge into actual skills, not just a generic process.

What skills are in the repo

If you came looking for exactly what skills live in addyosmani/agent-skills, here are all 24, read straight from the SKILL.md files in the repo, grouped by lifecycle stage from spec to ship.

Meta using-agent-skills
Define interview-me, idea-refine, spec-driven-development
Plan planning-and-task-breakdown
Build incremental-implementation, test-driven-development, context-engineering, source-driven-development, doubt-driven-development, frontend-ui-engineering, api-and-interface-design
Verify browser-testing-with-devtools, debugging-and-error-recovery
Review code-review-and-quality, code-simplification, security-and-hardening, performance-optimization
Ship git-workflow-and-versioning, ci-cd-and-automation, deprecation-and-migration, documentation-and-adrs, observability-and-instrumentation, shipping-and-launch

The part the other three repos lack sits in verify: browser-testing-with-devtools and performance-optimization, the web-performance work covered below.

Part 2 — The skills that genuinely stand out

Four worth calling out, distinctive and rarely found elsewhere.

browser-testing-with-devtools gives the agent eyes into a real browser through Chrome DevTools instead of guessing from code, used to chase UI bugs, inspect the network, and measure speed, while treating browser content as untrusted data for safety.
performance-optimization follows measure before you optimize, using both synthetic tools like Lighthouse and real-user data, hunting common issues like duplicate fetches, unoptimized images, and oversized bundles.
source-driven-development the rule is that every framework-specific decision must be backed by official docs, not the model's memory, with an authority order of official docs, official blogs, web standards, not random forum answers.
doubt-driven-development after a long session where assumptions have set in, spawn a fresh reviewer that starts from zero and is biased to disprove, not to approve, capped at three cycles, then let the human decide if it was worth it.

What they share is an obsession with confirming things for real: measuring performance for real, seeing the page for real, citing real docs, actively looking for where you are wrong. source-driven and doubt-driven line up exactly with what this blog keeps saying: do not trust what the AI says until you have confirmed it against a real source.

Part 3 — Four poles: which to pick

Having reviewed all four, the picture is clear: each answers a different need.

superpowers a skill set with enforcement gates, for letting an AI run long on its own.
karpathy skills 4 abstract principles in one file, for the leanest set of principles.
mattpocock skills a practitioner's directory, for tools in pieces to remix.
addyosmani agent-skills the most comprehensive set, for web people who want both the process and performance knowledge.

If you work in frontend or care about page speed, Addy's is the only one of the four that speaks to it directly. All four do not clash. You could take karpathy's 4 principles as a base, add Addy's web skills, borrow superpowers' excuse-naming to make the rules hold, and pick skills piece by piece the way mattpocock encourages. They are all examples for assembling your own set.

Part 4 — Should you install it, and how to use it

Should you try addyosmani agent-skills?

Very much so if you do web work or want a process set that runs end to end. It installs as a plugin and supports many hosts. The upside is completeness; the caution is that it is large, so do not enable every skill at once. Pick the phases you do most, especially the browser and performance skills that are hard to find elsewhere.

The idea you can use right now

Even without installing the repo, two principles transfer immediately: source-driven and doubt-driven. Do not let the AI write framework-specific code from memory, have it pull the official docs first and cite them, and after a long piece of work, set up a review that deliberately looks for what is wrong rather than confirming it is right. Those two prevent a lot of mistakes with nothing to install.

For my part, I compared the source-driven approach with the rules I already use, since it lines up with the verify-before-you-trust principle I have held all along. Many of the frontend skills do not match my daily work, so I took only some. How I blend them into my own stack is a detail I am leaving out of this post.

The one rule to remember

If you remember one thing from this four-repo series, let it be this: no repo is meant to be swallowed whole. superpowers, karpathy, mattpocock, and addyosmani are four corners of the same thing. Take the part that fits your work from each, and assemble your own set. That is how you get the most out of them.

Originally published at productize.life/blog/addy-osmani-skills. Written from real work, the process, not a pitch.

Claude Code Subagents: Claude Fable 5 as the Head, Everything Else as Hands

yim_rei — Thu, 16 Jul 2026 03:30:34 +0000

Quick answer: Use the most expensive model (Claude Fable 5) only as the head: plan, decompose, synthesize. Delegate hands-on work to cheaper subagents matched to the job. Opus for deep reasoning, Sonnet for fast execution, Haiku for parallel search, plus Codex from another vendor as a peer reviewer. On our first real task, four agents caught blockers that a read-through alone missed, and the head never touched grunt work.

One morning we switched our main Claude Code model to Claude Fable 5, the new family Anthropic positions above Opus. It is genuinely smarter, and it comes with a higher price per token and a quota that burns much faster. Using it the way we used Opus, letting it hunt for files, write boilerplate, and fix typos, would be hiring the most expensive brain in the house to walk paperwork between desks.

So we set a new rule before real use: Fable is the "head" only. It plans, decomposes, and synthesizes. All hands-on work belongs to cheaper "hands". Less than an hour later the rule got its first real test: assessing whether a tool our automation depends on every day should move to its new Rust port.

This post covers how the team is set up, what the first task produced, and exactly where things broke. Every number comes from a real working session on July 3, 2026. None of them are invented.

Terms used, all in one place:

orchestrator — the lead model that plans, decomposes work, and merges results, without doing subtasks itself.
subagent — a helper Claude Code spawns to run a subtask in its own separate context, with a model pinned per agent.
model tiering — matching task level to model level: hard work to big models, repetitive work to small ones.
fan-out — firing work at several subagents at once, in parallel.
grunt work — repetitive no-thinking labor: finding files, walking directories, pattern-following code.
smoke test — a short test against the real thing, proving that what looks usable actually is.

Part 1 — Why the most expensive model should only be the "head"

Picture one task you would hand to an AI coding agent, say "assess whether we should move to the new version of this tool". Inside it are several levels of work mixed together. Some of it needs real thinking, like weighing risk and making the call. Some of it just needs care, like checking which machine runs which commands. And some of it is pure labor, like reading a repo end to end and summarizing what is in there.

Run all of that on one model and you pay big-model prices for every level. You also lose something less visible: the lead model's context fills up with detail. Fifty files in, the brain you wanted for decisions is packed with the contents of files it skimmed on the way, with less and less room left to think.

The shape that fits better is two roles. The head thinks, the hands do. The head is the orchestrator: it takes the problem, plans, splits the work, routes each piece to the right hands, then synthesizes the results into an answer. The hands are subagents that do the work in their own separate context and send back only conclusions. The detail in between never flows back to bury the head.

Cost points the same way. A Fable-class model burns quota fast enough to force the question of which work deserves that brain. Once you are forced to choose, it turns out most of the work never needed the big model at all.

Part 2 — Claude Code subagents: setting up head and hands

Claude Code ships with a subagents mechanism. You create short agent files under .claude/agents/ naming each agent, pinning its model, and describing the work it takes. Our team is currently three agents plus one peer from another vendor.

Role	Model	Work it gets	Why
deep-reasoner	Claude Opus	Heavy thinking: architecture design, multi-file debugging, root-cause hunts	Deep reasoning without carrying the whole task
fast-worker	Claude Sonnet	Mechanical work: boilerplate, writing tests, edits that follow a settled pattern	Fast enough, far cheaper, and this work needs nothing more
fast-searcher	Claude Haiku	Search and fact-gathering: find files, find config, walk inventories	Cheapest, and fans out many in parallel
Codex (peer)	gpt-5.5 (OpenAI)	Long grinding coding work, and second opinions	Different vendor = not stuck in the same bias set as the Claude team

The Claude Code community calls this shape the claude orchestrator, or the orchestrator pattern. What decides whether it works is not the number of agents but the rules written for the head. Ours are three lines.

The head never does grunt work. Any search, read-through, or mechanical job gets delegated immediately, even when the head "could just do it". Every token the head burns on this work is quota taken away from thinking.
Show the plan before acting. The head must lay out what goes to whom before dispatching, so a human can see it and object before money flows out.
Never pin the head inside a daemon. Always-on automation runs fine on small or mid models. The expensive model is called per occasion, only when real thinking is needed.

One small lesson with a price tag before trusting the cross-vendor peer: we tested whether Codex was reachable by sending the word ping and asking for pong back. That single-word answer cost 26,800 tokens, because an agent of this class wakes up with its entire context, not just your question. Even "checking that the tool works" has a price, and it belongs in your cost math.

Part 3 — The first real task: a migration assessment with 4 agents

The job that came in: a CLI tool that our automation uses on two machines has a new Rust port. Worth moving? Questions like this are easy to answer badly, because the smart-sounding answer ("Rust is faster, migrate") and the correct answer live in different places.

Split into three views, fired in parallel

The head cut the survey into three pieces with no dependencies between them, then fanned them out to three hands running at once.

Inventory of machine one (Haiku): everywhere the Mac touches this tool. Which commands, which services, which cron jobs.
Inventory of machine two (Haiku): the same sweep on the other server.
Reading the Rust port's repo (a bigger model): does the port cover the commands we use, is the config compatible, is the machine-to-machine protocol compatible.

The results were more interesting than expected. The part of the tool we actually use is tiny: 5 integration points and roughly 9 commands, out of a much larger feature set. The port's repo brought both good news and worries. The good news: config works with the same files, and the protocol was proven compatible against a real test fixture, not just documentation. The worries: the port was 8 days old with 470 commits, most of them machine-generated code, and one flag our system leans on every day is gone from the port.

The head synthesizes, and refuses to trust reports alone

At this point three reports agreed on "probably migratable, with conditions". But all of it came from reading. Nobody had touched the real binary yet. So the head dispatched a fourth job: a smoke test against the real thing, built so that failure costs zero. Install the new port side by side with the old one under a different name, point it at the same config, and run read-only commands exclusively. The live system is never touched.

The smoke test is where the whole exercise paid off, because it caught what all three read-based reports could not see.

The command group we use most does not actually work. In the new port those commands are JS plugins the port cannot load yet. Two of our health-check system's three probes broke instantly. The painful part: during the repo read we had concluded the plugin gap "does not affect us", because neither machine's config declared any plugins. In reality, the everyday commands themselves are the plugins.
The command syntax changed. The old tool takes a port number as a trailing argument; the new one requires a flag. Every place our automation calls the old shape would break silently.
Cross-version interop is still unresolved. The new client calling the old server returns an error that cannot distinguish a signing problem from a request-shape problem. Only a real paired test will tell.

So the verdict was neither "migrate" nor "don't". It was right direction, wrong time. Parked, with explicit conditions for when to come back and retest. All of this ended with zero damage, and the head never read a single repo file itself.

Part 4 — What broke, and when the head must act itself

A green test that did not mean pass

Before the smoke test we already had a test suite for the health-check system. Against the new port it ran 5/5, all green. Stopping there would have meant concluding "compatible". Then the smoke test against the real binary broke 2 of 3 probes. Why the contradiction? The suite mocks the layer that calls the binary, so it was testing its own logic without ever touching the real thing. Green across the board, proving nothing. We call this false-green.

The defense that actually works is a positive control: before trusting any checker's green, find a case you know must fail and confirm the checker turns red on it. If the case that should break still comes back green, what you are reading is not a test result. It is an illusion.

An orchestrator is not someone banned from touching anything

The "head never does grunt work" rule has a flip side worth watching. While synthesizing the reports, the head found 3 gaps where the reports disagreed. The options: dispatch another round to the hands (wait again, pay again), or run a 30-second grep itself. It chose the grep, and that was the best-value decision of the day. The head's job is knowing what to delegate, and what is cheaper to do itself in half a minute. The line is not "never touch". It is "never trade thinking time for work the hands can do".

The trap we got caught in twice in one day

The last lesson is not technical. It is about the orchestrator's own behavior. Once "delegate" is in your hand, everything starts to look delegatable. That day we got caught twice. First, a bug fixable in two lines that we routed into someone else's queue instead. Second, a bar we invented ourselves, "this post needs 2-3 real worked examples first", when the evidence from one session was already enough. Self-set bars become excuses not to act, dressed up as prudence. If you are about to set up your own orchestrator, expect this trap to ship with the package.

A quiet agent is not a dead agent

A small note that saves real money: the agent reading the repo went quiet long enough that the instinct said kill it and restart. The truth was the work ran deep. A nudge asking for progress, instead of a kill, showed the job was moving, and the result that came back was deeper than the head could have produced itself. Killing a working agent means paying twice to get the answer later.

Part 5 — Using this on your own work

Where to start

Create your first three agents. A few lines each under .claude/agents/: a deep thinker, a fast worker, a searcher. Pin models per the table in Part 2.
Write rules for the head. At minimum one line: on any search or mechanical work, delegate. Never do it yourself.
Make the first task read-only. An assessment, a survey, an audit. Failure costs zero, which makes it the perfect practice field.
Always require a smoke test against the real thing. Read-based reports are not the answer yet, and never trust a green test until you have seen it turn red.
Log every time the head sneaks work in itself. The first week you will find it doing grunt work more often than you expect. Write it down, adjust the rules.

When to skip all this

Small tasks, single-file tasks, tasks where you already half-know the answer: one mid-tier model working directly is cheaper and faster. Orchestration has a fixed overhead of its own (planning, dispatching, waiting), and it only pays off when the task is big enough for the tiering to earn it back. Just like a single pong at 26,800 tokens taught us.

If one principle sticks, let it be this: pay premium for thinking, pay budget for doing, and never let delegation become the excuse for not acting.

Update — The second task arrived the same afternoon

By the afternoon the same team had its second real task. The job: six Discord bot daemons whose code had been copy-forked six ways (570 lines each, every fix paid six times) needed collapsing into one shared core, leaving each bot a short launcher of about 30 lines that states only how it differs from the others. The hard part: all six were live in production and none was allowed to go down.

The assembly line was longer this time. The head planned and decomposed. The deep thinker read all six real code forks and produced a design. The cross-vendor engineer (Codex) implemented it. Then the deep thinker came back to review the engineer's work against its own design, before a canary rollout: restart one bot, watch it, then roll the rest of the fleet.

The part worth telling is that each verification layer caught a different failure, and all three layers actually fired.

The designer caught the head being wrong. The head's brief claimed two files were "byte-identical" because a local diff tool lied (it was wrapped by an output-compressor that reported files as identical when they differed). The designer did not take the brief on faith, ran md5 checksums, and found all six files differed.
The engineer caught the designer's miss. During implementation, Codex found that one fork used a different version of a verification module than the other five, something the design never mentioned. Implemented as designed, that bot would have crashed on boot. The engineer switched to injecting the dependency from outside and documented why.
The reviewer caught the engineer, before rollout. The review pass found two defects the fully green test board could not see: one guard condition had its logic flipped from OR to AND (a silent behavior change on one bot), and a try/catch had been dropped on exactly the bot scheduled to be the first canary. Four lines of fixes, applied before the restart, not after.

The outcome: 222/222 tests green, and green that had passed a positive control. All six bots restarted with zero crashes, shipped across 7 repositories in a bit over an hour. The one lesson the second task added: a good head is not the one that delegates well, it is the one that builds verification layers so the team catches each other, including catching the head itself.

Third task — Two days later: a head that builds new hands

Two days later (July 5) the third task came in, and it differed on the main axis. In both earlier jobs the head was dividing work among "hands it already had". This time the head had to build new hands: add 3 more bots to the fleet, bringing it to 9, without setting each one up step by step the old way.

The hard question was not "how do you set up a bot". It was which inputs only a human can decide, so everything else can be derived. Once that is answered, adding a bot leaves only a few human touchpoints: give it an identity (a charter), create its Discord app, approve the three privileged permissions a human must grant, invite it to the room, and hand over the token. The scripts and the charter walk the rest.

Three things are worth telling.

Three gates designed to fail loud. With many bots sharing one core, the scariest failure is one bot impersonating another. So we placed three gates that halt the boot on any mismatch, instead of warning and moving on: shared rules missing, no boot; name tag not matching the token, no impersonation; assigned room colliding with another bot's, stop. Every gate was proven with a positive control.
The gate that lied to the whole fleet. During an end-to-end test on the second new bot, we caught one central gate holding a stale, misnamed variable. The effect: the entire fleet silently degraded to read-only, even though every piece passed its tests in isolation. It is the other face of the false green from the first task: green piece by piece, broken once assembled for real.
Even handing over a token has traps. Pasting the token into a terminal prompt failed twice, because invisible control characters leaked into the value. What worked was sending it to the head to write into the config file itself, then deleting the message.

The one lesson the third task added: a mature head does not just delegate well, it can grow its own team. And the heart of growing is knowing exactly what must stay a human decision, then designing the machine to walk everything else.

Want the actual orchestrator toolkit? All three agent files verbatim, the real orchestration rules, and install steps are on the original post (email-gated) at productize.life. The skeleton described here is enough to assemble your own either way.

Every number and event was measured in real working sessions on July 3 and July 5, 2026. The reviewer-must-not-be-the-author principle gets its own deep dive in headless code review with Codex.

Originally published at productize.life/blog/claude-code-subagents-orchestrator. Written from real work, the process, not a pitch.

Cloudflare Workers AI: Add a Free LLM to a Static Site, No Backend Needed

yim_rei — Thu, 16 Jul 2026 02:54:34 +0000

Quick answer: A static site behind Cloudflare can get AI without a backend and without storing an API key. Workers AI binds a model like Llama 3.3 70B to your worker through a two-line env.AI config binding. The free tier is 10,000 neurons a day. Measured for real, that is about 80 answers a day at ~124 neurons and ~15 seconds each.

This happened in a single day. In the morning, our PRD consulting landing page was an ordinary static site: text, images, a mailto button. By the afternoon, that same page had a box where a reader can paste their app idea and get back an eight-section PRD skeleton with the risks called out, in about fifteen seconds.

Here is what we did NOT add: a server. There is still no backend of our own, no VM, no container, and not a single API key anywhere in the code. The site itself is still plain static HTML.

One thing makes this possible: Cloudflare Workers AI. This post explains how it works, plus the thing posts like this usually skip: numbers measured from the real thing. Neurons per answer, latency, and the actual bill, pulled fresh right before writing.

Terms used, all in one place:

Worker — a small piece of code running on Cloudflare's network in front of your site; every request passes through it first.
Workers AI — a service that runs AI models on that same network, callable directly from a worker.
binding — wiring a service to a worker via config; the code sees it as a variable like env.AI, with no key involved.
neuron — the billing unit of Workers AI; every model's usage converts into it.
rate limit — a cap on how many calls you accept, protecting you from both spam and a runaway bill.

Part 1 — The real thing that just shipped: an AI box on a landing page

Our goal was concrete. The landing page sells product-requirement consulting, and we wanted readers to try the thinking before reaching out. So we built two things. The first is a seven-question quiz that scores how ready your requirements are; that one is pure JavaScript, no AI. The second is the star of this post: a box that takes an idea and answers back with a PRD skeleton. The reader describes their idea in a few sentences, and the system returns an eight-section outline, from problem and users through scope to acceptance criteria, closing with the risks worth answering before telling an AI to build.

The model answering is Llama 3.3 70B (the instruct fp8 fast variant), running on Cloudflare's network, not our machine.

Browser (static page + fetch)
   -> Cloudflare Worker (checks the email gate, counts rate limits in KV)
        -> env.AI.run() -> Workers AI (Llama 3.3 70B)

The path of one question: everything lives on Cloudflare's network, not a single server of ours.

Part 2 — Why no backend and no API key

Sites like this usually get stuck on one question: where does the AI live? Call a model straight from the page and you have to embed an API key in the HTML, which means handing your key to the whole internet. Avoid that and you need a backend in the middle, which means a machine to run, maintain, and pay for monthly. For a static site that wants to stay light, neither option is pretty.

Workers AI cuts the knot with one idea: the model lives where the worker lives, and access is bound to the account, not to a key. If your site is already served through a Cloudflare Worker (ours already used one as a reverse proxy and membership gate), adding AI is a two-line binding in wrangler.toml:

[ai]
binding = "AI"

With that, your worker code gets an env.AI variable it can call directly. One endpoint and one call to env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", ...) returns an answer. There is no key to store, which means no key to leak, nothing to rotate, no secret manager to set up. The page side is a plain fetch to an endpoint on your own domain.

The rest of the work is not AI work at all; it is the same old web work: validate input, cap usage, and write a system prompt that answers the way you want. We distilled ours from the eight-section PRD template used in real consulting engagements, with one rule we would urge anyone to include: never invent details the user did not give; where information is missing, say what is missing instead of guessing. Otherwise the model will happily fill in things nobody said.

Part 3 — Measured numbers: neurons, speed, and the bill

Workers AI bills in a unit Cloudflare calls the neuron. Every account gets 10,000 free neurons a day; beyond that it is $0.011 per 1,000 neurons. What the docs cannot tell you is how many neurons one of YOUR answers actually costs. So we pulled today's real usage from Cloudflare's own analytics.

Numbers from 5 real calls (Jul 4, 2026)	Measured value
total neurons	621.5 (average ~124 per call)
input tokens (system prompt + idea)	2,202 total (~440 per call)
output tokens (the PRD skeleton returned)	2,748 total (~550 per call)
average inference time	9.8 seconds per answer
end-to-end at the page (measured with curl)	~15 seconds
free tier of 10,000 neurons/day covers	~80 answers a day
price per answer beyond the free tier	~$0.0014
today's bill	$0

These numbers cross-check, too. The pricing page lists Llama 3.3 70B fast at 26,668 neurons per million input tokens and 204,805 per million output tokens. Multiply back: 2,202 input tokens gives 58.7, plus 2,748 output tokens gives 562.8, total 621.5, exactly matching what the dashboard reports.

There is a lesson hiding in that pair of numbers: almost all the cost sits in output tokens (562.8 of 621.5), because the output rate is nearly eight times the input rate. If you want to control cost, cap answer length with max_tokens before you bother squeezing the prompt.

As for speed: nearly ten seconds for a ~550-token answer means this suits long answers people are willing to wait for, like turning an idea into a document outline. It does not suit short snappy chat where people expect a reply in a second or two. Set expectations right on the page; ours says plainly to expect 10-20 seconds.

What about other free options, like OpenRouter?

The question we ran into ourselves right after shipping: why not use OpenRouter's free models instead? The answer sits in three numbers (checked against OpenRouter's docs, Jul 4, 2026):

OpenRouter's flagship free model is llama-3.3-70b, the same model Workers AI runs, but it needs an account plus one more API key to store.
An account that has never topped up gets 50 free-model calls a day, less than the ~80 answers the Workers AI free tier covers. A one-time $10 top-up raises it to 1,000 calls a day.
We picked Workers AI because there is no key to leak and the model runs on the same network as the worker. The $10 OpenRouter path is the cheapest next step once traffic outgrows the free quota, before moving to a seriously pay-per-token API.

Part 4 — What you need before putting AI on a public page

An endpoint that can call a model is free only up to 10,000 neurons a day. Leave it open to unlimited calls and a single script can burn the whole day's quota in minutes, then start climbing into paid territory. Before shipping, we set up three layers:

An email gate before use. The AI box unlocks after the reader takes the quiz and leaves an email. The server verifies a signed cookie; it does not just hide the button with JavaScript, because an endpoint can always be hit directly. No cookie, 401.
A per-IP daily cap. Ours is 5 calls: enough to genuinely try it, not enough to poke at it all day.
A global daily cap for the whole system. Ours is 60 calls, comfortably under the ~80 the free tier covers. This layer is the guarantee that even under fire from a thousand IPs, the bill stays zero.

The counters live in KV (Cloudflare's key-value store), which has one trait worth knowing: it is eventually consistent, so a value you read can lag reality by tens of seconds. That makes these counters a soft cap that can miscount a little under rapid fire. We know because it caught us during latency testing: we deleted our own IP's counter and immediately fired again, the system still saw the stale number, and we got served our own 429. Which is actually good news twice over: it proves the limiter works in production, and its looseness leans toward blocking early rather than letting excess through. For budget protection that is the right kind of loose, because the global cap sits a full layer below the free quota anyway.

Part 5 — Do it yourself, step by step

If your site is already on Cloudflare (or you can move the DNS), the whole thing is:

Declare the binding. Add [ai] + binding = "AI" to the wrangler.toml of the worker serving your site.
Add one endpoint. Accept a JSON POST, validate input length, and pass it to env.AI.run() with a system prompt that defines the answer structure and forbids inventing details.
Put the gate in front of the model. Check your gate (email, login, whatever fits), then count a per-IP cap and a global cap in KV. Keep the global cap below the free quota, always. If you need exact counting later, move to Durable Objects.
On the page: one textarea, one button, one fetch. Show a clear "thinking" state, because the wait is around ten seconds.
Measure before you talk about it. Fire five real calls, open analytics, see what one answer costs in neurons, then work out whether the free tier covers your traffic.

The shortest possible summary

Static site + Cloudflare Worker + an env.AI binding = a site with AI, no server, no key. And if the global daily cap sits under the free quota, the bill is zero by proof, not by hope.

Sources: neurons, tokens, and inference time measured from our own Cloudflare account analytics (GraphQL dataset aiInferenceAdaptiveGroups) over 5 real calls on Jul 4, 2026; end-to-end latency measured with curl. Pricing and free tier from Workers AI Pricing (Cloudflare Docs), checked Jul 4, 2026.

Originally published at productize.life/blog/cloudflare-workers-ai. Written from real work, the process, not a pitch. If you're weighing the alternative, we also measured all 23 free models on OpenRouter.

9arm's skills repo is small, but it has one AI cost idea worth stealing

yim_rei — Thu, 16 Jul 2026 02:27:01 +0000

Quick answer: 9arm-skills is a small Claude Code skills repo by 9arm, a Thai creator: 6 skills, buckets modeled on Matt Pocock's repo. The part worth taking is a skill called qwen-agent, about handing menial work to a cheap model and keeping the expensive one for judgment. It's a cost idea none of the four bigger repos I reviewed have.

I have reviewed four Claude Code skill repos so far, some with hundreds of thousands of stars. This one is different. 9arm, a Thai creator, has a repo of around 2,900 stars, far smaller, just 6 skills, and its bucket layout is borrowed straight from Matt Pocock's repo.

To be straight: on scale or fame, this repo isn't in the same league as the other four. But it has one skill sharp enough to be worth writing up, and it covers something the other four don't: controlling cost by choosing which model does which job. So this isn't a review of the whole set. It's taking the single most useful idea.

Terms, defined once, right here:

skill — a set of instructions in a file that tells the AI to follow a given process for a given task.
subagent — a smaller AI the main one calls to do part of the work on its behalf.
Qwen — an open-source model that runs far cheaper than a top-tier model; here it's the one that takes the grunt work.
context — how much text a model can hold at once; a cheap model usually holds less, so jobs must be sized to fit.

What skills are in 9arm's repo (all 6)

If you came here to find what skills are inside the 9arm-skills repo, here they are, read straight from the SKILL.md files, in two buckets.

Engineering

debug-mantra — a four-step debugging protocol: reproduce the bug first, trace the code path, try to falsify your own hypothesis, then verify the breadcrumbs.
post-mortem — writes up a resolved bug with its root cause, mechanism, fix, and the validation that it is actually gone.
scrutinize — has the AI review a plan or code change from an outsider's view, questioning intent and checking whether the claims hold.
qwen-agent — hands grunt work to a Qwen-backed subagent, such as bulk renames, formatting, and scaffolding.

Productivity

management-talk — reframes technical content into language you can use with engineering leadership across channels.
qwenchance — keeps long tasks alive so the agent doesn't loop or overflow its context, breaking or handing off at that point.

The two I would actually reach for are the Qwen pair, qwen-agent and qwenchance, because they are the cost idea I unpack below. The rest are solid but common enough to find elsewhere.

Part 1 — The sharp idea: send grunt work to a cheap model

The skill called qwen-agent does one thing, and does it well. It splits work into two piles. The repetitive, low-thought work — renaming variables across a file, writing boilerplate, summarizing long logs — gets sent to a cheap model like Qwen instead. The more expensive main model is kept for the work that genuinely needs judgment.

What makes it hold up in practice is that it forces large jobs to be split into pieces that fit the cheap model's smaller context, rather than dumping the whole thing and hoping. There's a companion skill, qwenchance, that watches for the agent re-reading the same file in a loop, or thinking for a thousand words without acting, and makes it break or hand off, so tokens don't quietly burn.

Part 2 — Why it works: match the task's value to the tool's price

This lines up with a principle I hold anyway: don't push everything through the single most expensive model. Jobs aren't worth the same. Work that's easy to check and needs no interpretation should sit with the cheapest thing that can do it, a cheap model or plain code. Work that needs context or a judgment call is where you pay for the expensive model.

Think this way and cost follows the value of the work instead of being flat and high across everything. And it doesn't cost you quality, because the work that needs a strong model still gets one. You just stop melting the expensive one on jobs a cheap tool handles fine.

Part 3 — How to put the idea on your own machine (step by step)

Once you want to try it for real, the first question is how 9arm actually wires this up. The answer is in qwen-agent's own SKILL.md. The core is a command called claude-9arm, an alias of claude --model qwen3.6-35b-a3b routed through his gateway. To hand off a job, you run it headless with -p:

claude-9arm -p "<a fully self-contained task>" --allowedTools Bash Read Edit Write Glob Grep

The --allowedTools list is what lets the subagent pick up tools on its own without stopping for approval. Leave it off and it stalls on the very first edit.

But to keep it from falling apart mid-job, the SKILL.md hammers three points, and these three are the real reason the idea works, not the command itself.

Write a prompt that stands on its own. The cheap model sees none of your conversation. Give it absolute paths to every file, say what to change and what "done" looks like, and never refer to "the file we discussed."
Size it to the 128k context. Qwen holds far less than a top-tier model. Big jobs have to be sliced into chunks that each touch a bounded set of files, not the whole repo dumped in at once.
Verify its output every time. Cheap buys you less reliability. Read the diff or run the test to confirm before you call it done.

The key point: you don't have to use Qwen at all. The idea is "the cheapest thing that can finish this job": a subagent set one model tier down, a model you run locally, or a short script with no AI in it. The claude-9arm mechanism is one way to do it, not a requirement.

The other half 9arm throws in is qwenchance, the skill that keeps long jobs from burning tokens in circles. Its logic is usable right away even without installing it: before each step, check three things. Are you re-reading the same file or retrying a dropped hypothesis, have you reasoned past a thousand words without acting, and is context getting tight? If any fire, break and hand off instead of letting the whole run melt into a loop.

And if you do want the full set? The repo recommends installing with npx skills add thananon/9arm-skills, which works for any agent. But same as before: you don't need the whole bundle. Read the two Qwen skills and write your own version that fits the setup you already run, and you'll get more out of it than copying it wholesale.

Part 4 — Straight talk about the repo, and how to use it

About the repo itself, plainly: it's small and personal, 6 skills, and the bucket layout is borrowed from Matt Pocock's repo without attribution. So its value isn't in taking the whole thing, it's in this one cost idea, which matches the lesson from the whole series: no repo is meant to be swallowed whole, just take the good part.

You don't even need Qwen to use it. The move is to look at the repetitive work you do every day and ask which of it doesn't need an expensive model. Renaming, reformatting, summarizing, hand those to something cheaper, and the bill eases off without the work getting worse.

Sources: 9arm-skills repo by 9arm (thananon), read directly from the SKILL.md files and README. The match-value-to-price cost principle is one I use myself.

Originally published at productize.life/blog/9arm-skills. Written from real work, the process, not a pitch.

Why Your AI Agent Lies to You

yim_rei — Wed, 08 Jul 2026 07:34:13 +0000

We once handed an AI agent a job and told it to finish. A while later it came back: "All done, tests passing." It sounded great, and we almost closed the laptop.

Then we actually looked. Only one part was finished. The thing it had checked was that the page loaded (it returned a 200), and it let that single true fact stand in for the whole job, even though several pieces it had just listed as "not done" were still untouched.

The interesting part is that it wasn't trying to deceive, and it wasn't random noise either. It took one small true thing and stretched it into a bigger picture that sounded right. That's the shape of almost every AI "lie": not gibberish, but a confident, plausible sounding wrong answer.

This has a name, hallucination. And to be straight about it, the agent in that story was Dobby, the AI assistant co-writing this post. None of the gates below come from theory. They come from Dobby digging through its own retrospectives, finding the same miss again and again, and having to build tooling to catch itself.

Why AI makes things up

Think about a test. If you leave a question blank you score zero, but a guess might earn a point, so most people guess. Language models grew up in exactly that arena, in training and in how they're scored: a confident guess usually beats saying "I don't know." So they learned to guess first.

OpenAI's 2025 paper "Why Language Models Hallucinate" lays this out plainly: hallucination isn't a strange bug, it's the result of training incentives that reward guessing over admitting uncertainty.

So when an AI has no real information to lean on, it doesn't stop and say it doesn't know. It fills in whatever fits the context, and it fills it in smoothly, because fluent language is the thing it's best at.

The danger isn't being wrong, it's being wrong in a way that looks reasonable. What it adds could plausibly belong in work like that, so a quick read sails right past it. We once asked it to summarize a lecture and it inserted "the Pythagorean theorem," formula and all, when the lecturer never mentioned it once. It fit the surrounding material well enough to almost slip through.

Where it fools you best

What we see again and again: while an AI is actually doing the work it tends to do fine, but the moment it "reports done" is where the made up part slips in. At that point it isn't going back to check the real thing. It's recalling from memory that it's "probably finished" and typing that out in a confident voice.

The longer the job, the worse it gets. After a long stretch, the moment that most needs care, the very end, is exactly when the guard drops, because you just want to close it out. This happens to people too, not only AI. But with AI the voice stays equally confident every time, so there's no warning signal to make you look twice.

The other side deserves equal care. When you can't find evidence for a claim, don't brand it a fabrication yet. Sometimes the thing is real but spelled oddly, or stated as a concept without the exact term. A lecture once mentioned "Edward Thorp"; searching the transcript turned up nothing, until reading around it showed the speech to text had written it as "Edward Top." It was real. So a good gate has to separate the two: "no evidence found yet" versus "invented from nothing."

The gate you can set up today

There is one rule: don't let anything an AI says become fact until it can point back to evidence you can see. Everything else is just how you make that rule real.

Every claim needs evidence you can touch. Not "the AI remembers that..." but something you can see with your own eyes: a real run, a real file, a real log. If it can't point to one, treat the claim as not yet true.
When it says "done," walk the real checklist item by item, not the parts you happen to remember. Done means the whole list passes, not the subset that came to mind.
For anything important, use a second pair of eyes. Have a different model, or a person, check what the AI wrote, because the writer and the reviewer should be different roles. Whatever wrote it tends to be blind to its own misses.
"Can't find it" gets a flag, not a cut. Mark it and look around first, in case it's a spelling slip or said indirectly, before you decide.

Anyone can do this by hand. The part we've put real work into is the tooling that makes these checks run on their own, stopping an AI (Dobby included) before it can print something it has no evidence for. The idea is to keep the gate itself fixed and standard, while what runs inside it can change.

Where this helps

Code an AI writes that claims "tests pass": you need to see the run, not just the assurance.
Meeting or lecture notes with no action item that appeared from something nobody said.
Reports and research where every number points back to a source.
Work where a mistake costs: finance, legal, medical, where one wrong line has a price.

Where to start

You don't need a big system from day one. Try it on a single piece of work. Take something an AI just called "done" and ask for the evidence one item at a time, so each thing it claims to have finished can actually be seen. One pass and you'll see for yourself how quietly made up things slip in. And once an AI knows it has to show evidence every time, it starts guessing less on its own.

Written from real work. The full version, with a Thai edition, lives at productize.life.

The AI That Writes Code Can't See Its Own Bugs

yim_rei — Wed, 08 Jul 2026 07:34:12 +0000

We were closing a security hole. It let you run a root level shell through Discord, which is too dangerous to leave open. The fix looked straightforward: restrict that worker's permissions down to read only, no running commands, no editing files. We wrote it, read it back once, and it looked fine.

But when we handed the same code to a second model to review, it caught in seconds that in the permission list we'd just written, the shell command showed up on both the "allow" side and the "deny" side at once. The hole we thought was sealed was actually still half open, because two config lines contradicted each other. Deleting the conflicting line was the fix, and only then was the hole really closed.

The scary part is we wrote that fix ourselves, read it back, and still didn't see it. Our head had already decided "this is the code that closes the hole," so our eyes slid right past the line that contradicted it. What caught it wasn't us getting more careful. It was a second model that didn't carry that assumption.

Why authors can't review their own work

When AI writes code for you, it doesn't just type it out. It holds a reason for every line, why it wrote it that way. Ask it to go back and read its own work, and that same set of reasons is still there, so it reads right past the wrong parts. From its point of view, every line already has an explanation.

The model isn't trying to lie. It's just good at finding reasons to back up what it already did. Point that skill at reviewing its own work and it becomes the lawyer for its own code, not the reviewer.

The fix isn't telling the same model to "review more carefully." The problem isn't carefulness, it's the assumption it's carrying. What works better is handing the code to another model, one that didn't write it and isn't carrying the original reasons, so it sees what the author missed.

This is the same principle human teams have used for ages: the person who writes and the person who reviews shouldn't be the same person. Move that into work where AI writes most of the code and it still holds, maybe more than before, because AI writes faster and more confidently than people do.

The second catch: a deploy that would ship the future early

The other time the reviewer saved us was on the script that controls deploys for this blog. The system is built to publish one post a day, off a queue. We wrote the part that copies the whole folder up to the server before deploying, and figured that was that.

When we ran the diff past Codex, it pointed at something we hadn't seen at all. If the queue holds several posts that are approved but not yet due, copying the whole folder and deploying once would push every future post live at the same time. The one a day cadence we'd designed would break instantly.

In hindsight it's obvious. But while writing it, we were focused on "make today's post deploy," and never thought about the other files riding along in the same folder. The reviewer marked it P2. We fixed it by holding back the not yet due posts before the deploy, then putting them back once it finished.

The thing to notice: these two bugs are completely different. One is security, the other is deploy logic. But they share one trait. The code ran clean, no errors. And without a second pair of eyes, both would have shipped to production.

The second model isn't always right

By now this sounds like a second model is the answer to everything. It isn't. There was another round where the reviewer raised something as a P2 in our message routing. We read it, checked it against the existing tests, and the real path wasn't broken the way it warned. What it caught was a very narrow case, and arguably correct as intended anyway. In the end we didn't change anything.

This is the part that matters. The second reviewer isn't the boss, it's a second opinion. Its job is to flag where you should stop and look, not to decide for you. Everything it raises, you still weigh against the real code and the real tests yourself. Taking it all on faith is about as dangerous as not reviewing at all.

So we treat what the reviewer flags as a list to go verify, not a list to go fix. Only the ones that survive verification get changed. Two of the three times here, it found something real; the third, it over warned. That's a usable rate, as long as a human is still the one confirming.

Put a second model on every diff before merge

The principle from all three is one sentence: before every merge, hand the diff to a model that didn't write it. The rest is how to make that happen every time, not just when you remember.

What we set up is only a few things. First, it runs headless: one command, done. The reviewer reads the repo, reads the diff, runs the tests itself, and hands back findings point by point with a severity. Because if you have to open a chat and talk to it every time, eventually you skip it.

Second, the review never downgrades the model to save money. For other tasks we pick the model by difficulty. But for review we force the strongest model with the highest reasoning, always. The whole point of this step is to catch what the author missed; letting the reviewer miss it too is pointless.

The second model we use is OpenAI's Codex, because it ships a review the diff command already, and more importantly because it comes from a different vendor than the one that wrote the code. A different vendor means a different set of blind spots, which is the entire reason we put the two against each other.

If you want to try it, start with one spot. Next time, before you merge work the AI wrote for you, don't take the author's word for it. Open a second model, hand it the diff, and ask one question: what would the author have missed? That alone is enough to see why the writer and the reviewer shouldn't be the same one.

Written from real work. The full version, with a Thai edition, lives at productize.life.

The whole PM craft, packed into ~68 skills, and the one that made me stop and look

yim_rei — Thu, 02 Jul 2026 12:27:15 +0000

Originally published on productize.life.

Quick answer: pm-skills is a marketplace of around 68 Claude skills for product management across 9 plugins, from strategy and discovery to market research and AI shipping. It is built by Pawel Huryn, author of the Product Compass newsletter. Each skill is not a loose prompt but a named, sourced framework, and one of them audits the gap between documentation and code, a PM lens built for the era of AI-written code.

Last week I was reading through a run of repos that pack product work into skills. Some pick one topic and go deep. This one does the opposite: it is the broadest of the bunch.

It is called pm-skills, by Pawel Huryn, the author of the Product Compass newsletter. He packs almost the entire product management craft into around 68 skills across 9 plugins, from setting strategy, running discovery, and researching the market, to analyzing data, executing, and shipping software that AI wrote.

Usually something this broad ends up shallow. But when I actually opened it, it was not, and one skill in particular made me stop and look for a while, because it covers an angle that only recently became necessary in the era where AI writes code for us.

I will tell it in three parts, starting with what it is, then why it is not just a prompt box, and closing with lessons for anyone building products.

Terms, gathered once, right here

skill a ready-made set of instructions an AI agent (such as Claude Code) can invoke, like a shortcut that wraps one way of doing a task.
framework a ready-made way of thinking from the PM world, such as SWOT, JTBD, or RICE, that you once had to read a book to use well.
plugin (category) a group of skills that belong to the same topic, such as the discovery category or the go-to-market category.
PRD a product spec document that says what will be built, for whom, and how success is measured.

Part 1: What pm-skills is

It is a marketplace of around 68 Claude skills for PM, organized into 9 plugins, each one a craft within product work.

The thinking side product strategy, product discovery, market research, setting direction and understanding the market and customers.
The doing side data analytics and execution, writing PRDs, prioritizing work, running a pre-mortem, analyzing data.
The go-to-market side go-to-market, marketing, and growth, planning launches and scaling.
The new AI shipping side a category built for the PM who now has to own software an AI wrote the code for.

What separates it from a bin of prompts is that it has a validator enforcing that every skill has a complete format: names that match the folder, complete file headers, correct cross-references. It is not a case of anyone dropping in whatever they like. This is a genuinely maintained set, not a pile an AI churned out.

Part 2: Why it is not just a prompt box

First, the frameworks are named and sourced. Each skill does not just say "go try some discovery." It builds on frameworks that have real owners, from Teresa Torres, Marty Cagan, Alberto Savoia, and Strategyzer, to SWOT, Porter, Ansoff, JTBD, and RICE, with step-by-step guidance. The author put it plainly in the intro:

"Generic AI gives you text. PM Skills Marketplace gives you structure."

That is where the difference lives. Ask for a SWOT once and you get a scaffold with fields to fill, not a broad paragraph of advice you still have to arrange yourself.

Second, one skill was built for the era of AI-written code. The one that made me stop is called intended-vs-implemented. It audits the gap between what the docs say a system should do and what the code actually does. The skill file puts it sharply:

"A linter checks code in a vacuum. It can tell you whether the code is internally consistent, but not whether it does what you intended, because it has no model of your intent. The highest-value security and correctness bugs live in that gap: a permission written down but never enforced, an endpoint documented as cron-only that anyone can call, a field marked public-only that still leaks private data."

This is the angle a PM now needs. Once AI writes code fast, the problem moves from "is the code written correctly" to "does the code do what we intended," and that is a question a person has to ask, not a tool.

Part 3: Lessons for anyone building products

What pm-skills says to everyone building products, not only the people who will actually use it:

Knowing frameworks is no longer scarce. When SWOT, JTBD, and RICE are a second away, what separates the strong from the average is not how many frameworks you know, but choosing the one that fits the problem in front of you.
Audit intent, not just that it runs. Borrow the idea behind intended-vs-implemented: a pretty document or a passing demo does not mean the thing does what you intended. A person has to read that gap.
A maintained set is different from a churned-out one. The format validator and the sourced, named frameworks are what make this set trustworthy. When you meet a repo that packs skills, check whether a person maintains it or an AI just photocopied documentation.

Where to start

Try it on the PM work you already repeat.

Take one thing you do regularly, such as writing a PRD or running a competitive analysis.
Let the skill in the matching category help with the scaffold, then see which framework it builds on and whether that is the right one to use.
If it is something AI wrote the code for, apply the intended-vs-implemented idea and ask whether it actually does what the PRD said.
Spend the time you save on the scaffold on the decisions, not on doing more of the same work.

pm-skills is a clear picture of what is happening to PM work. The whole craft is becoming something you can call up almost for free. What is left for us is judgment: choosing right, auditing well, and making the call yourself. This is one in a series where I go through repos that pack a craft into skills, and it will close with a capstone that draws the thread through all of them.

Sources and references

pm-skills by Pawel Huryn (github.com/phuryn/pm-skills). The line "Generic AI gives you text. PM Skills Marketplace gives you structure." is from the README. The intended-vs-implemented description, the "a linter checks code in a vacuum" framing, is from that skill file directly.
The count "around 68 skills, 9 plugins" reflects the repo structure read on Jul 2, 2026 and may change by version. The frameworks cited (Torres, Cagan, Savoia, SWOT, Porter, JTBD, RICE) belong to their respective owners, not the repo.

Read the original, with the diagrams: https://productize.life/blog/pm-skills/en

One free week of Claude Code: https://claude.ai/referral/uurAE0WKHQ

The YC president open-sourced the stack he builds with. What it says about taste

yim_rei — Thu, 02 Jul 2026 12:27:14 +0000

Originally published on productize.life.

Quick answer: gstack is an open-source (MIT) skill set that Garry Tan, president of Y Combinator, builds with every day. It turns Claude Code into a team of 23 specialists, CEO, engineers, designers, QA, and a release engineer, forcing every change through a multi-lens review before shipping. The point is not speed; it is taste written into software.

Last week I was going through a repo that collects skills for coding, several of them. Most share one theme: helping AI write code in a systematic way, and faster.

But one made me stop longer than the rest, called gstack, for two reasons. One: its owner, Garry Tan, president and CEO of Y Combinator, took the stack he actually builds with every day and opened it for free. Two: it does not sell "code faster," it sells "review before you ship."

Once I actually opened it, it was not just a toolbox but one of the clearest examples of an idea I have been interested in for a while. On the day AI can write code very fast, the bottleneck of the work is no longer speed.

I will tell it in three parts, starting with what it is, then what gstack believes, and closing with lessons for people who build products, not just people who write code.

Terms, gathered here in one place

agentic coding letting an AI agent run the coding work in its own steps, from planning to writing to review to shipping, not just autocompleting a line at a time.
skill a packaged set of instructions an AI agent (like Claude Code) can call, like a shortcut that wraps one way of doing one thing.
review lens reviewing one piece of work from several roles, for example as a CEO, an engineer, a designer.
taste the sense and judgment of what is good and what is bad, what to build and what not to ship. The part that is still human.

Part 1: What gstack is

Garry Tan describes gstack in the README plainly, as the way he works.

"It turns Claude Code into a virtual engineering team: a CEO who rethinks the product, an eng manager who locks the architecture, a designer who catches AI slop, a reviewer who finds production-grade bugs, a QA lead who opens a real browser, a security officer who runs OWASP and STRIDE, and a release engineer who ships the PR. Twenty-three specialists and eight power tools, all slash commands, all Markdown, free, MIT-licensed."

Read it and you see it is not a single "coding assistant" but a team with clearly divided roles. What matters more than the headcount is that it forces the work through a multi-lens review before shipping. The CEO lens asks whether this is really a 10-star product or a 3-star one dressed up. The engineer lens asks whether the architecture holds up to the edge cases. The design lens asks whether you even know what "good" looks like. The devex lens asks how easily someone else can pick it up and build on it. Then it closes with steps to test, ship the PR, and watch after deploy.

Part 2: What gstack believes

gstack does not just tidy up the tools. It has a stance, and that stance is the heart of what makes it interesting. Garry writes in the ETHOS file:

"The engineering wall has fallen. What remains is taste, judgment, and the willingness to do the complete thing."

He claims that this year he ships code hundreds of times faster than he did 13 years ago (his own figure, self-measured, not a central benchmark). But his point is not about speed. It is that once writing costs almost nothing, deciding what to build, and refusing to ship slop, becomes the whole job.

You can see it in the principle he calls Boil the Ocean. The old advice was "don't do the complete thing," because engineer time was expensive. But once that cost is gone, cutting corners and fixing later turns into debt. If the complete version takes only a few more minutes, do the complete thing.

The principle I like most, though, is that a person has to be the one who decides. gstack puts it this way:

"AI models recommend. Users decide." and "Two AI models agreeing on a change is a strong signal. It is not a mandate."

This is exactly why its multi-lens review pipeline matters. Each lens exists to challenge the work, not to rubber-stamp it, and the person is the one who chooses which challenge lands. Put another way, gstack is taste and the priority of decisions written into software. Most teams skip these lenses almost entirely; solo builders skip all of them. He brought them back as required gates.

Part 3: Lessons for product builders

Even if you do not write code for a living, three things here are worth stealing.

Give yourself review gates, even working alone. A team argues by nature; a solo builder has no one to argue back. Set up two or three review lenses and have the AI play each role to question your own work: the business angle, the user angle, the person who has to maintain it later. It really does catch what you overlook.
Do the complete thing, now that completeness is cheap. What you used to cut short for lack of time now costs far less to finish. Don't ship a rough version when the complete one takes only a little longer.
Keep the final call for yourself, especially when the AI looks confident or two models agree. That is the moment that tempts you to let go, which is exactly the moment to be most careful.

Where to start

You do not have to install all of gstack. Try borrowing just the ideas first.

On the next thing you are about to ship, have the AI play a single role, for example "review this as a CEO," and see what it asks that you had not thought of.
When you hit a spot you want to cut short, ask yourself how much longer the complete version really takes. If it is only a few minutes, do the complete thing.
Next time the AI is firmly certain, practice pausing to ask yourself first: do I agree because it is right, or because it is confident?

That Garry open-sourced his own stack is not a small thing. It is a signal from someone who has watched thousands of startups, that the bottleneck in building has moved, from "can you write the code?" to "can you decide what to build, and keep the slop out?" This is one in a series where I have been going through repos that pack expertise into skills. More are coming, and a pillar piece that draws the lines between all of them will close it out.

Sources and references

gstack by Garry Tan (github.com/garrytan/gstack, MIT). The quotes "The engineering wall has fallen...", "AI models recommend. Users decide.", and "Two AI models agreeing on a change is a strong signal. It is not a mandate.", along with Boil the Ocean, come from ETHOS.md. The description of the 23-role team comes from README.
The productivity figure (hundreds of times faster than before) is Garry's own measurement, stated in README/ETHOS. It is not a central benchmark, and not a number we measured.

Read the original, with the diagrams: https://productize.life/blog/gstack/en

One free week of Claude Code: https://claude.ai/referral/uurAE0WKHQ

AI Data Privacy: A Three-Layer Defense for Using AI Without Leaking Secrets

yim_rei — Tue, 30 Jun 2026 14:17:59 +0000

One time we opened a daemon's log on the box, saw that some secret values were mixed into it, and decided to scrub them first with a short command that matched the pattern KEY= and TOKEN= and replaced each with the word redacted. We thought that was the end of it.

Then the text came out and two real key values were sitting right there in the transcript. The ones that leaked were named ANTHROPIC_API_KEY_FALLBACK= and CLAUDE_CODE_OAUTH_TOKEN_BAD=. Their names end in _FALLBACK= and _BAD=, not KEY= or TOKEN=, so the pattern never caught them.

The secret did not leak to some hacker. It leaked into our own transcript, a place we were not counting as "outside" at that moment, even though that is exactly what it is. A secret printed into a log has to be treated as leaked right away, and the key has to be rotated.

The lesson here is not that the command was written wrong. It is that data leaks at the seams you forgot to label, not at the spot you were already watching.

Part 1: A three-layer defense: data can flow down, but never up

The AI we all use day to day keeps its knowledge base on someone else's cloud. You ask it something and the model answers from the broad knowledge of the world. No problem there. Public knowledge flowing down to you is fine.

The trouble starts when data flows the other way. The moment you feed your private data up, the model is not just reading it to answer you. That chunk has already left your machine and now lives in someone else's system. So how do you know what is safe to send up and what is not? The approach we use is to split data into three layers.

The outer layer is the cloud, the world knowledge the model already has. Ask it general things freely. None of your own data is in there.

The middle layer is shareable work data: public documents, project context that is already open. If it flows up to the cloud, that is still acceptable.

The inner layer is secrets: financial data, health data, client data, credentials. This layer has one rule. It never flows up to the outer layer, period.

The whole point of these layers is one-way flow. Knowledge from the outer layer can flow down and help your work all it wants, but anything in the inner layer never flows back up. Once you set the direction like this, a hard question such as "can I feed this to the AI?" becomes a much easier one: which layer does this belong to?

Part 2: The leaks you don't think about

The most obvious leak is pasting secret text straight into an AI chat. Most people already guard against that. The sneakier leaks are the automatic "processing" steps you never see.

Take one close to home: the memory system our own agent actually runs, Graphiti. Each time it records a chunk of work, a small model reads the raw contents of that chunk to extract the facts. That "read to extract" step is exactly where data leaves the machine (egress). In the stack we actually run, that small model does its reading in the cloud, so your raw content has already gone up at that moment, even though all you did was tell it to "remember." You never meant to send anything up. That is why we treat "which model does the reading, and where it runs" as an inner-layer question.

The last one is logs and transcripts, like the leaked key at the start. The data does not escape to a criminal. It just pools in a place you never classified as the outer layer, even though that is what it is. Anyone who can read that log can see everything in your inner layer.

Part 3: Make it a real defense, not just an intention

Judgment that runs on instinct does not count as a defense. Three things turn it into a real one.

Separate data by layer at design time, not afterward. Building a system that holds real personal data taught us that separating data by purpose has to be structural from day one, because each kind of data has different rules. Some can be deleted the instant the owner withdraws consent. Some the law requires you to keep for 5 to 7 years. Pile them all on one table and the moment you have to delete, the whole pile breaks.

Set it to fail closed. If you cannot prove where a chunk of data will end up, assume it is going to flow up to the outer layer and block it. Do not wave it through because "no problem has shown up yet." It is the same rule as access control: if you cannot confirm who has permission, deny first. Safer than guessing and opening the door.

Filter data by what it is, not what it looks like. Go back to the leaked key. The root cause was filtering by the character pattern KEY= instead of by the meaning "this variable is a secret." Once you classify things by what they actually are, an odd name like _FALLBACK can no longer slip through the net.

The actual wiring that makes the three layers enforce themselves automatically is another part of the implementation we are holding back for now. But the three principles above you can use today, with no special tooling required.

Part 4: One question that works on any tool

Before you wire a new AI tool into your work, ask it a short question: where does the text I type end up, and which layer is that? If it cannot answer, do not feed anything from your inner layer into it yet. This question works on everything, from a plain chat to a memory system humming quietly in the background. Try it on the tools you already use every day.

Written from real work. Full version (and a Thai edition) at productize.life.

AI Coding Agents: The Expensive Part Isn't the Agents

yim_rei — Tue, 30 Jun 2026 14:15:12 +0000

I ran three AI coders in parallel last week, each on its own file. Three pull requests came back in minutes.

Then I checked the bill. It had barely moved.

We'd expected running several AI agents at once to burn a fair bit. The numbers said otherwise, and once we dug into why, we hit the thing that changed how we think about putting AI to work.

1. The coders aren't the real cost

Most people picture running a lot of AI codegen as a per-token bill that climbs and climbs, because the mental image is a metered API. But the coders we run don't connect that way. We run them through an agent framework that decides which engine drives each coder, where the engine is just the model doing the real work behind it. The engine we plug in logs in through a monthly subscription we already pay for, not an API key billed per token. So whether we launch one coder or three at once, the cost per round barely changes.

So where does the real cost go? To us. Early on we babysat the coders, ssh-ing in to check status every few seconds. That back-and-forth was what ate the resources, not the coders. The fix wasn't fewer coders, it was to stop babysitting and let a dispatcher hand out the work and close it off automatically. People set the problem and make the calls; people don't sit and stare.

Once you see that, running several coders in parallel stops being the extravagance you feared. The coders are cheap and replaceable. The expensive thing is human time and attention.

2. Cheap and parallel means you need a checker who isn't the writer

Once coders are cheap and you can launch several at once, the risk moves. It isn't about money anymore, it's about quality. Output that comes fast and in volume slips defects through just as easily if nobody checks first.

The path we chose: everything a coder writes goes through a separate reviewer first. The reviewer has one job, to read and flag, never to write. Because the moment the writer and the checker are the same agent, it glosses over the exact spots it just missed. Splitting the reviewer out gives you a second pair of eyes that isn't attached to what was just written.

What we'd like to do better: the reviewer should run a different engine from the writer, so it doesn't share the same blind spots. We tried switching the reviewer to a different engine, but it hung silently inside the system and never produced a single review, so we fell back to the engine we knew worked. A different engine is the right target, but something that actually works now beats an ideal that isn't stable yet.

3. A human keeps merge authority, and the work stays isolated

Coders running in parallel, a reviewer in between, and there's still one last line to draw: who presses merge into the main code. Our answer is a person, not any AI. The coder writes, the reviewer flags, but a human decides whether it actually goes in. We call that last gate Gate A, not because we distrust the AI, but because taking code into the main line is a hard-to-reverse decision, and someone should own that spot every time.

The other thing to set up the moment you run several at once: each coder needs its own workspace, not all writing over the same files. Let several coders edit the same space at once and the work tangles. That's the dispatcher's job, to fence off space for each one up front, not to patch collisions after the fact.

Start small: the shape you can copy

If you take one thing away, take this: running a fleet of AI coders isn't as expensive as you fear, because the coders aren't the real cost. The expensive things are human time and the defects that escape, and both are solved by the same shape.

Start small. You don't need the whole fleet on day one.

One coder, running on a subscription you already pay for, instead of a per-token API.
One reviewer that isn't the writer, reading and flagging first, every time.
A human holding merge authority, with a dispatcher handing out the rest of the work instead of you watching it.
Once that shape is stable, add coders one at a time. Because what lets you run a whole fleet without losing sleep isn't smarter coders, it's the checker and the human who still holds the decision.

Written from real work. The full version, with a Thai edition, lives at productize.life.