DEV Community: Alex Cloudstar

Claude Opus 4.8 Is Here: Benchmarks, Dynamic Workflows, and Whether to Upgrade From 4.7

Alex Cloudstar — Fri, 29 May 2026 11:36:47 +0000

Anthropic dropped Claude Opus 4.8 yesterday, May 28. Same playbook as the last few releases. No waitlist, no staged rollout. It showed up in Claude Code, the API, and the major cloud providers on the same day, with the model ID claude-opus-4-8 ready to drop into existing config.

I have been running Opus 4.7 as my default coding model since it launched in April. It handled my agentic coding sessions, my content pipeline, and most of my production debugging. So the first thing I did with 4.8 was throw the exact same hard tasks at it that I used to stress-test 4.7, then dig into the official announcement to separate the real changes from the launch-day polish.

Here is what I found after a day of hands-on use.

The Headline: It Stopped Lying to Me About My Code

The benchmark Anthropic led with is not a coding score or a reasoning score. It is honesty. Opus 4.8 is roughly 4x less likely than 4.7 to let a code flaw pass unremarked.

That number sounds abstract until you have lived the failure mode it describes. You ask a model to review a function. It tells you the function looks good. You ship it. It breaks. The model did not miss the bug because it was incapable of seeing it. It missed it because the path of least resistance in a review is to agree with you and move on.

Opus 4.8 does this far less. In my testing yesterday, I deliberately fed it three functions I knew had subtle problems. An off-by-one in a pagination helper, a race condition in a debounced save, and a silent error swallow in a fetch wrapper. 4.7 caught the off-by-one and missed the other two on the first pass. 4.8 flagged all three, and on the error swallow it specifically called out that the empty catch block would hide failures in production, which is exactly the kind of thing my global rules tell it to watch for.

This is the change that matters most for daily work, and it is the hardest to capture in a single number. A model that reliably tells you when something is wrong is worth more than a model that is marginally smarter but agreeable. The whole point of AI code review is catching what you missed. A model that rubber-stamps your mistakes is just a more expensive way to feel confident about broken code.

What the Benchmarks Actually Say

Anthropic published the usual comparison chart showing Opus 4.8 ahead of 4.7 across coding, agentic skills, reasoning, and practical knowledge work. The improvements are real but mostly incremental on the pure-coding side. The bigger jumps are in agentic and tool-use territory.

Here are the numbers worth knowing.

Benchmark	What it measures	Opus 4.8 result
Online-Mind2Web	Computer use, real web tasks	84% (ahead of 4.7 and GPT-5.5)
Legal Agent Benchmark	All-pass legal reasoning	First model over 10% on the all-pass standard
Code flaw detection	Catching bugs in review	~4x fewer missed flaws vs 4.7
Tool calling	Steps to complete a task	Fewer steps for equivalent intelligence

The Online-Mind2Web score is the one I would not have predicted. Computer use, the ability to drive a real browser and complete multi-step web tasks, has been the weakest part of every frontier model I have used. 84% is the first time the number has been high enough that I would actually trust it for low-stakes automation. It is still not something I would point at my bank, but for filling forms, navigating dashboards, and pulling data out of web apps that lack an API, it crossed the line from demo to useful.

The Legal Agent Benchmark result is a niche flex, but it signals something broader. Breaking 10% on an all-pass standard, where the model has to get every sub-task in a legal workflow correct or the whole thing fails, means the error rate on long multi-step chains dropped enough to matter. That same reliability shows up in coding agents that have to chain twenty tool calls without going off the rails halfway through.

Dynamic Workflows: The Feature I Did Not Know I Needed

The flashiest addition is Dynamic Workflows, shipping as a research preview in Claude Code. The pitch is that Claude can now spin up hundreds of parallel subagents and coordinate them on a single task. The headline use case is codebase-scale migrations, the kind that touch hundreds of thousands of lines.

I was skeptical. Parallel subagents have been a thing for a while, and in practice they tend to step on each other, duplicate work, or produce inconsistent results that take longer to reconcile than doing the work serially would have. So I tried it on a real job: migrating a mid-sized project from one date library to another, across about 60 files with inconsistent usage patterns.

The old way, even with agentic coding, was a slog. One agent, one file at a time, me babysitting context and re-explaining the pattern every few files as the conversation drifted.

Dynamic Workflows handled it differently. It scanned the codebase, grouped the files by usage pattern, fanned out a batch of subagents to transform each group in isolation, and then ran a verification pass to reconcile the edits. The whole thing finished in one sitting. Not every file was perfect. I caught two cases where it picked the wrong replacement function. But the wall-clock time was a fraction of the serial approach, and the consistency across files was better than I get when I do migrations by hand and forget my own convention by file 40.

The honest read is that this is genuinely new leverage for a specific kind of work. Large mechanical migrations, sweeping refactors, repo-wide audits. It is not magic for creative architecture decisions, and you still have to review everything it touches. But for the work that used to eat a full day of tedious repetition, it is the first tool that made me feel like the agent was actually operating at the scale of the codebase rather than the scale of a single file.

If you have wrestled with agent reliability at scale, the interesting part is how the verification pass cleans up after the fan-out. The subagents are allowed to be imperfect because a final reconciliation step catches the divergence. That is a better architecture than hoping every parallel agent gets it right independently.

Effort Control Comes to the Consumer Apps

Opus 4.7 introduced the xhigh effort level for developers. Opus 4.8 takes the idea and exposes it directly to users on claude.ai and Cowork through a setting called Effort Control. You pick how much compute Claude applies to a request. Higher effort means deeper thinking, more tokens spent, slower but more thorough answers.

By default, 4.8 runs at high effort. Anthropic tuned the default so it spends roughly the same number of tokens as 4.7's default while delivering better results, which is the kind of efficiency win that does not show up in a headline benchmark but shows up on your bill.

In practice, I leave it on high for almost everything and bump it up only for genuinely hard problems. A gnarly debugging session where the bug spans three systems, an architecture decision with real tradeoffs, a piece of analysis where I want the model to actually sit with the problem. For quick edits and lookups, high is already more than enough, and dropping the effort makes the response snappier without a quality hit I can notice.

The thing I appreciate is that this makes the cost and latency tradeoff explicit instead of hidden. You are no longer guessing whether the model is thinking hard. You are deciding.

Pricing: Nothing Changed, and That Is the Story

Opus 4.8 costs the same as 4.7. Five dollars per million input tokens, twenty-five per million output. The model got better and the price stayed flat.

That is worth pausing on. We have gotten so used to capability going up while price holds or drops that it barely registers as news anymore. But it is the entire reason the economics of building AI features keep improving. Every release that holds price while raising the capability floor means the same product gets cheaper to run in real terms, because you can do more with fewer tokens or accomplish a task that previously needed a more expensive workaround.

Tier	Input (per 1M)	Output (per 1M)
Standard	$5	$25
Fast mode	$10	$50

The Fast mode pricing is the genuinely new line. At $10 input and $50 output it runs at about 2.5x the speed of standard, and Anthropic says it is roughly 3x cheaper than the previous fast mode. For latency-sensitive paths, where you previously had to drop down to a smaller model and accept the quality hit, you can now keep Opus-class quality and just pay a premium for speed. That changes the calculus for anything user-facing where response time affects conversion.

If you are still mapping out your spend across plans and API usage, my Claude pricing survival guide walks through how to think about the tradeoffs, and the fast-mode change tilts a few of those decisions.

A Quiet API Change With Real Consequences

Buried in the announcement is a Messages API change that most people will skim past. The API now accepts system entries mid-conversation without breaking prompt caching.

If you have built anything serious on the Claude API, you know why this matters. Prompt caching is how you keep costs sane on long conversations and agent loops. The moment you inject a new system instruction partway through a conversation, the old behavior was to invalidate the cache from that point forward, which meant you ate the full cost of reprocessing the prefix.

Being able to slot in a system entry mid-conversation without busting the cache means you can steer the model dynamically, injecting fresh instructions or context as a task evolves, without paying the caching penalty every time. For agent architectures that adjust their own instructions based on what they discover, this removes a real cost cliff. It is the kind of plumbing change that does not get a benchmark but quietly makes a whole class of designs cheaper to run.

This pairs well with the broader push toward context engineering as the discipline that actually separates good agent performance from bad. The cheaper it is to manage context dynamically, the more aggressively you can do it.

How It Stacks Up Against the Competition

The frontier model race has not fundamentally reshuffled. When I last did a full Claude vs GPT vs Gemini breakdown, the picture was that the models converge on baseline capability and diverge on specific strengths. Opus 4.8 widens Claude's lead in the places it was already strong rather than opening a new front.

On coding, Claude was already my default, and 4.8 reinforces that rather than dramatically extending it. The pure code-generation improvement over 4.7 is modest. The real gap-widener is the reliability and self-correction, the 4x fewer missed flaws, which competitors have not matched in my testing.

On computer use, the 84% on Online-Mind2Web puts Opus 4.8 ahead of GPT-5.5 on that specific benchmark, which is notable because browser automation has been an area where the gaps between frontier models were small and noisy. A clear lead there is new.

On reasoning and multimodal breadth, the competitive story has not changed much. If raw reasoning scores or native audio and video are your priority, the calculus from a few months ago still holds. Opus 4.8 did not show up to win those categories.

The summary I would give a teammate: if you do agentic work, coding, or any task where the model has to chain many steps and catch its own mistakes, Opus 4.8 extended an existing lead. If your work lives in the categories where Claude was already the second choice, this release is not the one that changes your mind.

Should You Upgrade From 4.7?

Here is how I would think about it depending on where you sit.

If you use Claude Code on a Pro or Max plan: You already have access. Switch to 4.8 and run it on your current work. The self-correction and reliability improvements are real and the transition is seamless. Try Dynamic Workflows on the next migration or sweeping refactor you have been putting off, since that is where the new leverage actually shows up.

If you run Opus 4.7 in production via the API: The swap to claude-opus-4-8 is low risk because the pricing and core behavior are stable, but test anyway. The improved instruction-following and the more aggressive flaw detection can change outputs in ways your downstream code might not expect, especially if you parse the model's review comments. If you have an eval suite, run it before you flip the model ID. This is exactly the kind of update your evals exist to catch.

If you are on GPT-5.5 or Gemini for primary work: The coding, tool-use, and computer-use gaps just widened in Claude's favor. If you have been on the fence about Claude for agentic development, this is the strongest case yet. If reasoning depth or multimodal breadth is your main concern, the competitive picture has not moved enough to force a switch.

If you are new to AI coding tools: Start with Claude Code and Opus 4.8. The combination of strong coding, better long-session reliability, explicit effort control, and a model that actually tells you when your code is wrong makes it the most forgiving entry point for getting serious about AI-assisted development.

What Anthropic Is Hinting At Next

The announcement closes with two forward-looking notes that are worth reading carefully.

First, Anthropic says lower-cost Opus-equivalent models are in development. If that lands, it pulls Opus-class capability down into a price bracket where you could run it on high-volume, cost-sensitive paths that currently force a downgrade to a smaller model. That would be a bigger deal for production economics than anything in this release.

Second, and more loaded, Mythos-class models are coming to all customers in the coming weeks, pending cybersecurity safeguards. I wrote about Claude Mythos when it was a restricted research preview that scored 93.9% on SWE-bench and 100% on Cybench, the model Anthropic decided was too capable to release. The fact that a Mythos-class model is now being lined up for general availability, gated on safety work rather than capability, is the most interesting sentence in the entire announcement.

It tells you the gap between what these labs can build and what they choose to ship is still the binding constraint, and that the constraint is loosening. Opus 4.8 is an excellent model that Anthropic is comfortable handing to everyone today. Something meaningfully more capable is being prepared for release the moment the safety story is solid enough.

The Bigger Picture

What strikes me about Opus 4.8 is not any single benchmark. It is the shape of the improvement.

The last few releases chased raw capability. Higher SWE-bench, sharper vision, longer context. 4.8 spent its gains differently. It got more honest, more reliable across long chains, better at catching its own mistakes, and more efficient with tokens, all while holding the price flat and adding a way to coordinate work at the scale of an entire codebase.

That is what a maturing tool looks like. Not a model that is dramatically smarter than the one before it, but one that is more trustworthy in the moments that actually cost you time. A model that flags the bug instead of waving it through. A model that finishes the migration instead of drifting halfway through. A model that lets you decide how hard it should think instead of guessing.

For the work I do every day, that is more valuable than another few points on a coding benchmark. Opus 4.7 was the best model I had used for shipping software. Opus 4.8 is better in the specific places that matter when you are the one responsible for what ships.

If you are already on 4.7, the upgrade is easy and the wins are real. Switch, throw Dynamic Workflows at the migration you have been dreading, and see how it feels to have a model that argues with you when you are wrong. That is the part that does not show up in the chart, and it is the part you will notice first.

Writing Your Own Claude Code Skill in 2026: The Practical Guide

Alex Cloudstar — Fri, 22 May 2026 06:53:26 +0000

I noticed I was typing the same five paragraphs into the Claude Code prompt every morning. The paragraphs explained how I wanted PR reviews done on my own projects: which directories to weight, which patterns I cared about, what to ignore, what to flag, what to never suggest. Five paragraphs, every morning, sometimes twice.

The third time I noticed it, I copied the paragraphs into a SKILL.md, dropped it in my plugin directory, and added a one-sentence trigger description. The next morning I typed "review this branch." Claude pulled the skill in automatically and produced exactly the review I had been asking for by hand. I have not pasted those paragraphs again.

That was the moment I understood that skills are not for sharing with the world. They are for sharing with future-you. The marketplace is great, but the highest-leverage skills you will ever use are the ones you write for your own projects, your own quirks, your own opinions about how the work should be done. This post is the missing manual for writing them.

If you have not seen the Claude Code plugin marketplace post, read that first for the broader context. This one is the hands-on follow-up: how to actually write a skill, where they live, what makes them fire, and the patterns that keep tripping up developers who try.

Why You Should Write Your Own Skills

There is a category of work that you do over and over again, that has a clear right answer, and that you have to re-explain to the model every time. That is the skill-shaped hole in your workflow. Pattern-match for it.

Common examples from my own machine.

A skill that knows how my Astro blog is structured, where posts live, what the frontmatter schema is, and how internal links should look. Saves me a paragraph of context every time I write a post.

A skill that turns "review the work" into a deep verification pass with a specific checklist, instead of the generic "looks good to me" review the model produces by default.

A skill that estimates how long a ticket should take, in story points, using the rubric my team has agreed on. This used to be a paragraph I copy-pasted from a Notion doc. Now it is a slash command.

A skill that knows my testing conventions and refuses to suggest tests that mock the database, because we agreed not to mock the database two quarters ago and the model keeps forgetting.

None of these are interesting to the world. All of them save me actual time, every week, on specific work I am going to do anyway. That is the skill bar. Useful to you, repeatedly, on real work. Everything else is a hobby.

The other reason to write skills, separate from the time savings, is that the model's default behavior is a compromise designed to please everyone. Your work is not "everyone." Your work is yours. Every skill you write pushes the model's behavior closer to what you actually want, and away from the average of what everyone wanted in the training data. That distance, accumulated across a dozen skills, is what makes the difference between "the AI is a smart intern" and "the AI is a smart colleague who already knows how we do things here."

The Anatomy Of A Skill

A skill is a directory with a SKILL.md at the root. That is it. Everything else is optional. The directory can also contain supporting files (scripts, references, examples, additional Markdown chunks the skill points to) but only the SKILL.md is required.

The SKILL.md has frontmatter and a body. The frontmatter looks like this.

---
name: review-blog-post
description: Review a draft blog post for voice consistency, em-dash usage, and SEO hygiene before publishing. Use when the user mentions "review the post," "check the draft," or pastes a blog post for feedback.
---

Two fields. Name is the slug. Description is the trigger. The description is doing roughly ninety percent of the work and most people get it wrong on the first try. I will come back to that.

The body of the SKILL.md is the instructions the model reads when the skill fires. It is plain Markdown. You can use headings, bullets, code blocks, examples, anything that helps the model do the work. The body is loaded into context as system instructions for the turn, so treat every word as load-bearing.

A skill can also live inside a plugin. A plugin is a directory with a plugin.json and one or more skills inside a skills/ subfolder. Plugins are the unit that the marketplace ships. For personal use you do not need a plugin wrapper at all. Just drop the skill into ~/.claude/skills/ and it will fire. If you want to share the skill or use it across multiple projects with version control, wrap it in a plugin and ship it through your own Git-backed marketplace, the way the marketplace post walks through.

The Trigger Description Is The Whole Game

If you take one thing from this post, take this. The skill's trigger description is the single most important line you will write, and it is the line developers most consistently underthink.

The description is what the model reads at the top of every turn to decide whether to load the skill. The model sees a list: skill name, one line of description, skill name, one line of description. Maybe forty entries. It picks the ones that match the current task. Your description is competing with thirty-nine others for the model's attention. If your description is vague or generic, it loses, and your skill never fires.

What a bad description looks like.

description: Helps with code reviews.

What a good description looks like.

description: Review code for the alexcloudstar.com Astro blog. Use when the user runs /review, asks to review a PR or branch, or pastes a diff and asks for feedback. Checks voice consistency, em-dash usage, internal-link health, and MDX frontmatter schema.

The good one names the triggers (/review, "review a PR or branch," pasting a diff), describes the scope (this blog specifically), and lists the checks. The model now has a clear handle on when this skill applies and when it does not. The bad one matches "code review" abstractly and ends up either firing on every code-review-shaped task and confusing the model, or never firing at all because something more specific outranks it.

The pattern I use.

{What it does in a sentence}. Use when {trigger phrase 1}, {trigger phrase 2}, or {trigger phrase 3}. {Optional scope note}.

If a skill could plausibly fire on anything code-related, you have not been specific enough. Tighten the description until it fires on the cases you want and stays silent on the rest. The wrong skill firing at the wrong time is worse than no skill firing at all, because it wastes the turn.

Writing Your First Skill: A Walkthrough

Let me walk through a concrete example end to end. The goal: a skill that codifies how I want commit messages written on my projects.

I open a terminal and run.

mkdir -p ~/.claude/skills/conventional-commits
cd ~/.claude/skills/conventional-commits

I create SKILL.md.

---
name: conventional-commits
description: Write a Git commit message in the project's conventional-commits style. Use when the user runs /commit, asks to "write a commit message," or finishes a change and asks what to commit. Generates the message; never runs git commit unless explicitly asked.
---

# Conventional Commits

You are writing a commit message for code changes the user has made. Follow these rules.

## Format

`<type>(<scope>): <subject>`

- type: one of feat, fix, refactor, chore, docs, test, perf, style, build, ci
- scope: optional, lowercase, one or two words max, no spaces (use hyphens)
- subject: imperative mood, lowercase, no trailing period, under 72 characters

## Body

- Wrap at 72 characters
- Explain the **why**, not the **what**. The diff already shows the what
- Reference related issues with `Refs: #123` or `Closes: #123` on their own line
- Skip the body entirely for trivial commits

## Hard rules

- Never use em dashes. Use commas, periods, or restructure the sentence
- Never add Co-Authored-By unless the user explicitly asks for it
- Never run `git commit` yourself. Output the message and let the user decide
- If the diff touches more than one logical change, suggest splitting into multiple commits rather than writing one big message

## Examples

`feat(blog): add MCP server tutorial post`

`fix(rss): drop trailing slash in canonical URL`

`refactor(content): split shared frontmatter into a helper`

That is the skill. It took five minutes. I save the file, restart Claude Code (or run /plugin reload), and from the next session forward, when I finish a change and say "what should I commit," the model produces a message in exactly the format I want. No back and forth, no reminders, no "actually I prefer the subject in lowercase."

The reason this works is the description. It names three trigger phrases I actually use. It scopes the skill to commit-message writing only. It mentions a guardrail (does not run git commit) that disambiguates this skill from any general "make a commit" skill that might exist. The model loads it when relevant and ignores it the rest of the time.

The CLAUDE.md To Skill Promotion Path

The fastest way to write good skills is to not start with skills. Start with CLAUDE.md.

When I am working on a new project, every instruction I find myself giving the model more than once goes into CLAUDE.md first. The file lives in the project root, loads automatically on every turn, and is the lowest-friction way to capture a working pattern. Most projects need ten to thirty lines of CLAUDE.md and that is the right amount.

The problem with CLAUDE.md is that it loads on every turn. As it grows past a couple hundred lines, the model spends real context on it whether the current task is relevant or not. That is when you promote.

Pick a section of CLAUDE.md that only matters for a specific class of task. Move it to a skill. Write a description that targets exactly that class of task. The CLAUDE.md keeps the always-relevant rules. The skill carries the sometimes-relevant ones. The model loads the sometimes-relevant ones only when they apply.

This promotion path is the most reliable way to write skills that actually fire well, because the body of the skill is content you have already validated by using it as CLAUDE.md instructions. You know it works. You know the model follows it. You are just changing the trigger surface from "always" to "when relevant."

Do not skip the validation step. Skills written from scratch, without a real-use loop, are almost always too abstract. The model ignores them in practice because the descriptions sound right but the body says nothing it cannot infer on its own. Skills written by promotion are concrete, opinionated, and load-bearing.

The same pattern applies to skills you find in the marketplace. Install the one that almost fits. Use it for a week. Note where it does the wrong thing. Fork it, write your own copy in your local ~/.claude/skills/, and tune the body to match your project. The forked skill outperforms the original every time, because the original was the average of what worked for everyone and yours is exactly what works for you.

Skill Versus Subagent Versus Hook Versus MCP

This is the matrix that confused me for months. Let me draw it cleanly.

A skill is structured instructions the model reads when the trigger matches. Pure prompt. Runs in the same turn. No external process. Fires when relevant, sits silent otherwise. Use for: codifying how you want the model to approach a class of task.

A subagent is a separately-spawned model invocation, with its own context window, that returns a summary to the main agent. Use for: research, broad code exploration, parallel work, or any case where you want to protect the main context from a lot of intermediate output.

A hook is a shell command the harness runs in response to a lifecycle event. Runs outside the model. Use for: enforcing rules the model cannot enforce on its own (formatters, linters, tests, audit logs). The right tool when the requirement is "every time X happens, Y must run, regardless of what the model decides."

An MCP server exposes tools to the model that it can call mid-turn to read or modify external systems. Use for: any case where the model needs to touch a system it does not natively know about (your database, your ticketing system, your build pipeline). See the MCP server building guide for how to write one.

The honest matrix.

"I want the model to do X consistently" → skill
"I want to delegate X without polluting context" → subagent
"I want X to happen every time, no matter what" → hook
"I want the model to read or modify Y" → MCP server

A well-tuned project uses two or three of these together. A skill that tells the model how to use an MCP server is the canonical pairing. A hook that runs a test suite after every code edit, paired with a skill that explains how to interpret the test output, is the second-most-common pairing. The primitives compose.

What you should not do is reach for an MCP server when a skill would have been enough, or reach for a skill when a hook is the right answer. The first wastes engineering time. The second produces unreliable enforcement, because the model can decide not to follow a skill but cannot decide not to run a hook.

Patterns That Work, And Ones I Keep Failing With

After roughly thirty skills written (most kept, some thrown out), here are the patterns I trust.

Skills that codify a checklist work brilliantly. Anything where there are five to fifteen specific things you want the model to do or not do, in a specific order, is a perfect fit. The review-work skill on my machine is twenty bullet points. It runs the same checks every time. The model reliably executes the list. The output is consistent.

Skills that name the bad outputs explicitly work better than skills that describe the good ones. "Do not use em dashes. Do not write Of course!. Do not start with Let me dive into." beats "Write in a clean style." The model is much better at avoiding specific banned strings than at hitting a vague target.

Skills that include before/after examples work better than skills that just state rules. Two paragraphs of "here is the rule" plus one paragraph of "here is a wrong version and the right version" produces dramatically better adherence than three paragraphs of pure rules.

Skills scoped to a tool or a file type work better than skills scoped to a "domain." A skill for Astro MDX posts is better than a skill for "blog writing." A skill for Drizzle migrations is better than a skill for "database work." Narrow scope, sharp trigger, concrete body.

The patterns I keep failing with.

Skills that try to teach the model a whole domain in five hundred lines. The model already knows the domain. The skill should add the specifics of how you want the domain handled, not re-teach it. Long, comprehensive skills load slowly, dilute attention, and often get ignored when something more specific exists.

Skills with overlapping triggers. If two of my skills both claim to fire on "code review," the model picks one (often the wrong one) and the other never gets used. Audit your triggers. Make sure each skill owns a clear slice of the task space.

Skills that mix instructions for the model with documentation for humans. Pick a lane. The SKILL.md body is read by the model on every fire. If you want a human-readable README, put it in README.md next to SKILL.md. The body of the skill should be all model-facing, all the time.

Skills that depend on context the model does not have. "Use the patterns from the design system" is useless if the design system is not in context. Either inline the relevant patterns, or have the skill instruct the model to read a specific file before continuing.

Versioning, Shipping, And The Plugin Wrapper

For personal skills, sitting in ~/.claude/skills/, you do not need a plugin or a version. You are the only user. Just edit the file when it needs to change.

The moment you want to share a skill (with a teammate, with a client, with the world), wrap it in a plugin.

A plugin is a directory with a plugin.json and a skills/ subfolder. The plugin.json looks roughly like this.

{
  "name": "alexcloudstar-blog-tools",
  "version": "0.2.0",
  "description": "Skills for writing and reviewing posts on alexcloudstar.com",
  "author": "alexcloudstar",
  "skills": ["review-blog-post", "estimate-post-effort", "publish-checklist"]
}

Push that to a public or private Git repo, point your marketplace.json at it, and /plugin install alexcloudstar-blog-tools works for anyone with the marketplace added.

Versioning matters more than people expect. The moment a skill is in use by anyone other than you, breaking changes to the skill's behavior need a major version bump. Otherwise users who pull the latest version will see different outputs from the same prompts, and they will not understand why. Treat skills like API contracts. They are read by an agent, but the agent's behavior is observable, and the user has built a mental model around it.

A CHANGELOG.md next to your plugin.json is more important than you would guess. The plugins that get adopted have changelogs. The ones that do not have changelogs feel risky to install. The cost of writing two lines per release is roughly zero. Do it.

What I Wish I Knew After Twenty Skills

A few hard-won lessons in no particular order.

Most skills should be under a hundred lines. If you find yourself writing a fifth heading, you are probably building two skills, not one. Split.

Test triggers explicitly. After writing a skill, start a fresh session and use natural-sounding prompts that should fire the skill. If it does not fire, the description is wrong. Iterate on the description until it fires reliably on three or four trigger phrases you actually use.

Re-read your own skills monthly. Skills rot. The conventions in the body drift away from the conventions you actually follow. If a skill no longer matches how you work, edit it or delete it. Stale skills are worse than no skills, because they produce confident wrong answers.

Skills compose with subagents better than you would expect. If you have a skill that explains how to do a deep review, and you spawn a subagent to do the review, the subagent inherits the skill. You get focused, high-quality work in a protected context window. The combination is one of the strongest patterns in the system.

Do not skill-ify creative work. A skill that "writes a blog post" produces formulaic blog posts. The point of writing is that the output is yours, not the model's. Skills are excellent for repetitive, rule-based work, and they are the wrong tool for work where the value is in the unique judgement you are bringing. For my own writing, I have skills for the mechanical parts (frontmatter schema, internal-link patterns, SEO checks) and no skills for the actual prose. The mechanical parts are happily automated. The prose is mine.

Treat skills as a backlog, not a one-shot. I keep a notes file with "things I keep typing into the prompt." Every couple of weeks I look at the list, pick the top one, and turn it into a skill. The compounding effect over a year is significant. The single-week effect is not.

Where The Skills Ecosystem Is Going

The skills surface is going to grow fast in late 2026. Anthropic has signalled that the official marketplace will start accepting third-party plugins under a review process by Q3. Cursor is shipping their own skill format (compatible enough that conversion will be a script). The MCP integration with skills is getting tighter: a skill that knows about a specific MCP server's tools and explains how to chain them is becoming the canonical "useful plugin."

What this means in practice. The half-life of a useful skill is increasing. The skills I wrote six months ago still fire, still work, and still save time. The skills I write now have a longer runway than the skills I wrote in 2025, because the surrounding ecosystem (plugins, marketplaces, MCP, hooks, subagents) is no longer changing weekly. The investment compounds.

The other thing it means. The teams that have ten well-tuned internal skills are dramatically more productive than the teams that have none. The gap is roughly the size of a junior developer per developer. That is a real number. It does not show up on dashboards. It shows up in the rate at which work that used to be repetitive becomes automatic.

If you have read this far, the next step is concrete. Open ~/.claude/skills/. Create a directory. Drop a SKILL.md in it. Pick the smallest, most boring, most repeated thing you do every week and codify it. Spend twenty minutes on the description. Test the trigger. Use the skill for a week.

If you do that one time, the next nine skills will write themselves. The leverage is real. The investment is small. The hard part is just starting.

The dotfiles era of AI tooling produced your vimrc. The skills era produces your ~/.claude/skills/ directory. Treat it with the same care, and it will repay you the same way.

How to Build Your First MCP Server in 2026: A Practical Developer Guide

Alex Cloudstar — Fri, 22 May 2026 06:53:25 +0000

The first MCP server I wrote did one thing. It read my Postgres database and returned the schema as structured JSON. That was it. No fancy joins, no query builder, just a list of tables, columns, and types.

It took me an afternoon. Two weeks later it had saved me hours. Every time I asked Claude Code to add a new feature that touched the database, it pulled the schema through the MCP server instead of hallucinating column names. The bug rate on AI-generated migrations dropped to roughly zero. The pattern was so obviously useful that I now write a small MCP server for almost every project I work on.

If you have read my MCP developer guide, you know what the protocol is. This post is the part I did not cover there: actually building one. Not theoretical, not waving at the spec. The exact steps I take when I sit down and decide a project needs its own MCP server, and the mistakes I keep watching other developers make on the way.

Why You Should Build One Even If You Are Not At Anthropic

There is a weird assumption I keep running into. People think MCP servers are something big AI companies write for their integrations, and that the rest of us just install them. That is half right. The official servers (GitHub, Linear, Notion, Postgres) are excellent. They cover the obvious cases. But they cannot cover your case.

Your case is your repo's weird internal CLI. Your case is the JSON schema you keep pasting into prompts by hand. Your case is the three internal APIs nobody outside your team will ever wrap. Your case is the build script that ships your product, the migration tool you wrote in 2023, the analytics dashboard you query through a homemade SQL view.

Every one of those things is a candidate for an MCP server. Not because the protocol is glamorous, but because once it is wrapped, every AI agent you use can hit it. Claude Code, Cursor, Windsurf, Zed, the official Anthropic chat app, any future agent runtime. One server, many consumers. The leverage is hard to overstate.

The other reason to build is simpler. You will understand MCP at a level that is not available from reading the spec. The first time you have to decide what your searchTickets tool returns and what it does not, you learn more about agent design than a year of theory.

What You Are Actually Building

An MCP server is a process that speaks JSON-RPC and exposes a small set of primitives. The protocol calls them tools, resources, and prompts. Most servers ship only tools. Some ship resources. Almost nobody ships prompts. Start with tools.

A tool is a function with a name, a description, an input schema (JSON Schema), and an implementation. When the model decides to call your tool, the runtime serialises the arguments, sends them to your server, your server runs the function, and the result comes back as text or structured content. That is the whole loop.

A resource is something the model can read but not write. Think of it as a file the agent can fetch by URI. A common pattern is to expose your project's docs, your database schema, or a snapshot of system state as resources. The model pulls them when relevant.

A prompt is a templated instruction the user can invoke by name. They are useful for codifying common workflows. In practice almost every server I have seen skips them and lets the user invoke slash commands or skills instead.

The protocol does not care what language you write the server in. There are mature SDKs for TypeScript, Python, Go, Rust, Kotlin, C#, and Java. I write almost all of mine in TypeScript because the toolchain matches the rest of my stack and because the official @modelcontextprotocol/sdk is the most actively maintained. Pick whatever language gets you to a working server fastest. The model does not know or care.

Pick Your Transport: stdio Versus HTTP

There are two transports that matter in 2026, and the choice between them shapes everything else.

The stdio transport is the one you want for local-only servers. The agent runtime spawns your server as a child process and pipes JSON-RPC over stdin and stdout. There is no port, no auth, no network. The server lives and dies with the agent session. Most local development tools (Postgres helpers, git wrappers, file system tools, build runners) ship as stdio servers because the security model is dead simple: if the agent can run a process on your machine, it already has the same trust level as that process.

The streamable HTTP transport is the one you want for hosted servers. It runs over HTTP with Server-Sent Events for the streaming half. You stand it up on a real server (Fluid Compute, Lambda, a VM, whatever), give it a URL, and any agent that supports remote MCP can connect. Use this when the server needs to be shared across machines, when it needs centralised auth, or when it wraps an API that should not have its credentials on every developer's laptop.

There is a third option, the deprecated SSE-only transport, which you should ignore. The 2025 spec consolidated on streamable HTTP. New servers should not implement SSE-only.

For your first MCP server I strongly recommend stdio. The feedback loop is fast, the auth story is non-existent, and the deployment story is "drop the binary on your laptop." You can graduate to HTTP later when you have something worth hosting.

The Minimum Viable MCP Server

Here is the smallest useful TypeScript server I would actually ship. It exposes one tool that returns the current git status of the repo it was launched in.


import {
  CallToolRequestSchema,
  ListToolsRequestSchema,
} from '@modelcontextprotocol/sdk/types.js';

const server = new Server(
  { name: 'git-status-mcp', version: '0.1.0' },
  { capabilities: { tools: {} } }
);

server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: 'git_status',
      description:
        'Return the porcelain git status of the current working directory. Use this before suggesting commits.',
      inputSchema: {
        type: 'object',
        properties: {},
      },
    },
  ],
}));

server.setRequestHandler(CallToolRequestSchema, async (req) => {
  if (req.params.name !== 'git_status') {
    throw new Error(`Unknown tool: ${req.params.name}`);
  }
  const output = execSync('git status --porcelain=v1', {
    encoding: 'utf-8',
  });
  return {
    content: [{ type: 'text', text: output || '(clean)' }],
  };
});

const transport = new StdioServerTransport();
await server.connect(transport);

That is a complete, working MCP server. Around forty lines. It does one job, but the job is real, and the model can call it the moment the server is wired in. The pattern scales: every tool you add follows the same shape. List the tool in ListToolsRequestSchema. Handle it in CallToolRequestSchema. Return text or structured content. Done.

The two things I want you to notice. First, the description on the tool. It is not "returns git status." It is "Return the porcelain git status of the current working directory. Use this before suggesting commits." That last sentence is what the model reads when it decides whether to call the tool. Tool descriptions are not documentation for humans. They are activation prompts for the agent. Treat them accordingly.

Second, the input schema is empty. The tool takes no arguments. If your tool takes arguments, define them as a proper JSON Schema with required, types, and description fields on every property. The model uses the schema to construct calls. Fuzzy schemas produce fuzzy calls.

Designing Tools The Model Will Actually Use

This is the part nobody tells you, and the part that makes the difference between a server you wrote and a server you actually use.

Tools are not API endpoints. The temptation is to expose every method of your underlying system one to one. Resist it. A server with forty tools is worse than a server with eight, because the model has to read every tool description on every call and the noise drowns out the signal. I have hit this directly. A server I wrote with twenty-five tools was used less than a server I refactored to expose eight tools that grouped related operations.

The rule I follow: tools should match user intentions, not implementation details. A search_tickets tool that takes a natural-language query and returns ranked results is better than five tools called filter_by_status, filter_by_assignee, filter_by_label, filter_by_date, and combine_filters. The model can compose the natural-language query. It cannot reliably compose five micro-tools without getting confused.

The second rule: return what the model needs to decide what to do next, not the raw payload. If your tool returns a 10,000-line JSON blob, the model will spend half its context reading it. Trim aggressively. Summarise. Paginate. If the user really needs the full payload, expose a second tool that fetches by ID. Default to small, focused responses.

The third rule: make errors educational, not stack-tracey. When a tool fails, return a message that tells the model how to recover. "User not found. Try search_users first to get a valid user id." is fifty times more useful than Error: 404 Not Found. The model is not your operator. It cannot read your terminal. The error message is its entire view of the failure.

The fourth rule: idempotency where you can swing it. Models retry. Sometimes for good reasons, sometimes because they got confused. A tool that creates duplicate records on retry will burn you. Either return existing records when called with the same arguments, or expose a dry_run mode that the agent can call first.

If you internalise those four rules before writing any code, you will skip the year of pain I went through.

Resources, Prompts, And The Other Primitives

I said start with tools. I meant it. But there is a small, specific case where resources earn their keep, and it is worth covering.

Resources are for things the model needs to read but never modify. The canonical example is project context: a README.md, a CLAUDE.md, an OpenAPI spec, a database schema dump, a CHANGELOG. The model fetches them by URI, reads them, uses them, and moves on. Resources are cheaper than tools because they do not require a function call round trip. The agent loads them, often automatically, when starting a session.

If your server wraps a system with structured context that does not change per-request (the schema, the docs, the config), expose it as resources. Otherwise stick to tools.

Prompts are the third primitive and I genuinely have not seen a useful one in production. The idea is that you ship reusable prompt templates that the user can invoke. In practice, slash commands in Claude Code and rules in Cursor cover the same ground with less friction. Skip prompts for your first server. Revisit them if you ever feel the lack.

There is also a notifications/sampling/elicitation surface in the protocol that I am skipping entirely here. It is not necessary for any normal server. If you find yourself needing it, you have already outgrown this guide.

Authentication And Secrets

The auth model is where local and hosted servers diverge sharply.

For stdio servers, you mostly do not have an auth problem. The server inherits the user's environment. If your tool needs an API key, the standard pattern is to read it from an env var the user sets in their shell or in the agent runtime's config. There is no token exchange, no OAuth, no session. The trust boundary is the user's local machine.

For hosted HTTP servers, you have a real auth problem and you should treat it as such. The MCP spec aligned in 2025 on OAuth 2.1 with Dynamic Client Registration. The agent presents a token. Your server validates it. There is also a simpler bearer-token pattern for first-party servers, where the user pastes a token into their agent config and your server checks it on every request. Both are fine for different use cases.

The mistake I keep seeing: developers ship an HTTP MCP server with no auth at all because they assume only they will hit it. Then six weeks later they leave the URL in a tweet, somebody scrapes their database, and there is a bad afternoon. If your server is on the public internet and it does anything beyond returning constants, it needs auth. No exceptions.

The other mistake: storing the wrong secrets. Your MCP server is going to handle the model's queries, which include user data. Treat the server like any other production service. Use env vars, not hardcoded values. Use a secrets manager for production. Rotate credentials. Log auth failures. The fact that the consumer is an AI agent does not change the security model. If anything, it raises the stakes, because the agent will retry queries automatically and amplify any leak.

Testing Without A Model In The Loop

The single biggest mistake I made on my first three MCP servers was testing them only through Claude Code. The feedback loop is too slow and too coarse. The model decides whether to call your tool, what to call it with, and how to interpret the result, which means a single end-to-end test exercises ten degrees of freedom you cannot isolate.

Test the server like a normal HTTP API. The official SDKs ship inspector tools that let you send raw JSON-RPC messages and see the responses. Use them.

The workflow that saved me hours.

Step one, run npx @modelcontextprotocol/inspector against your server. It opens a UI where you can list tools, call them with arbitrary arguments, and inspect the responses. Every tool gets exercised here first.

Step two, write integration tests against your tool handlers directly. They are normal async functions. Call them from a test runner. Assert on the output. This catches schema mismatches, edge cases on inputs, and regressions when you refactor.

Step three, only after the above two pass, wire the server into Claude Code and exercise the actual model loop. The model will do things you did not expect with your tools. That is fine. You are watching for "did the model find this tool" and "did it use the tool sensibly," not for "did this tool work." Those questions were already answered.

If you follow that order, you will catch ninety percent of bugs without ever burning a token.

Wiring It Into Claude Code

The integration surface for Claude Code is ~/.claude/mcp_servers.json (or the project-scoped equivalent). For a stdio server, the config looks like this.

{
  "mcpServers": {
    "git-status": {
      "command": "node",
      "args": ["/absolute/path/to/your/server/index.js"],
      "env": {}
    }
  }
}

For a hosted HTTP server.

{
  "mcpServers": {
    "my-internal-api": {
      "type": "http",
      "url": "https://mcp.example.com",
      "headers": {
        "Authorization": "Bearer ${MY_INTERNAL_TOKEN}"
      }
    }
  }
}

Restart Claude Code. Run /mcp to see the server's status. If it shows up green, your tools are loaded. If it shows up red, the bottom of the panel tells you why. The most common failure is a wrong path or a missing env var. The second most common is the server crashing on startup. Run the server manually from the same shell Claude Code launches from to confirm it boots.

Cursor and Windsurf both ship MCP support with similar config files. The surface is identical. Whatever language and transport you picked, the server runs the same against every consumer.

What I Wish I Knew Before Shipping The First Version

Six things, in order of how much pain they would have saved me.

Pick a clear scope before writing the first tool. "An MCP server for my project" is not a scope. "An MCP server that exposes our Postgres schema and lets the agent run safe read-only queries" is a scope. The narrow servers I have shipped have all been useful. The "general-purpose" ones have all been deleted within a month.

Write the tool descriptions before the tool implementations. This is the single highest-leverage practice I have found. If you cannot describe what a tool does in two sentences that include when the model should call it, the tool is wrong. Rewrite. The description is the API contract with the agent. The code is just implementation.

Log every tool call with arguments and results. Even in development. You need to see what the model is actually doing with your server. The patterns are not what you expect. I have watched models call my tools with arguments I would never have predicted, and the logs are how I find out. Without logging you are guessing.

Version your server from day one. Use semver. Tag releases. When you make a breaking change to a tool's schema, bump the major version, and add a deprecation period for the old schema if the server is shared. Agents do not handle silent breaking changes well. They will keep calling the old shape until the descriptions tell them otherwise.

Cap response sizes. Set a hard ceiling on how much text any tool can return (I default to 16 KB, less for noisy tools). When you hit the cap, return a truncation notice with a hint about how to fetch more. Letting a tool dump 200 KB into the model context once will teach you why this matters.

Treat the server like a product. Your future self is its first user. Your team is its second. The agent is its third. Write a README that explains what the server does, what tools it exposes, and what to do when something breaks. Six months from now you will be grateful.

Where To Take It Next

Once you have shipped one MCP server you will see candidates everywhere. The instinct is to write more servers. Resist a little. The better move is usually to grow the first server with carefully chosen tools until it covers most of your daily workflow, then split out a second server only when the first server starts to feel unfocused.

A few directions worth exploring once you are comfortable.

Combine your MCP server with Claude Code skills by packaging both into a single plugin. Skills tell the model when to reach for the tools your server exposes. The combination is dramatically more reliable than either piece in isolation.

If your server wraps a third-party API, look at the agent tool design patterns for what production-grade tool signatures look like, especially around pagination, partial failure, and rate limiting.

If you are running the server in production for multiple users, look at agent observability for what you actually need to monitor. The interesting metrics are not the ones you would track for a normal HTTP API. They are things like "what percentage of tool calls resulted in the agent making a follow-up call to the same tool with corrected arguments," which is a strong signal that your descriptions are unclear.

If you are wondering how MCP fits next to direct agent-to-agent communication, the A2A vs MCP comparison is the right next read. Short version: MCP exposes capabilities to a single agent, A2A coordinates multiple agents. They solve adjacent problems.

The Honest Read On MCP Servers In 2026

Most developers will never write one. That is fine. The official integrations are good enough for most use cases. But the developers who do write their own end up with a leverage advantage that is hard to describe until you have it. Every model gets your tool. Every agent runtime gets your tool. Every future tool gets your tool, for free, the moment they ship MCP support, which they all will because the protocol won.

The cost of entry is one afternoon. The payoff is permanent.

If you are sitting on a project that has any repetitive interaction you wish the agent could automate, write the server. Start with one tool. Make sure the tool description is clear enough that the model uses it without prompting. Wire it in. See what happens.

The first time the agent calls your tool unprompted, in the middle of a task, and uses the result correctly, the rest of the post will make sense. That is the moment MCP stops being a protocol you read about and starts being something you build with.

That moment is one afternoon away. Worth the afternoon.

I Got Falsely Reported and Banned on X. I Am Permanently Banned on Reddit. I Am Still Building SaaS.

Alex Cloudstar — Thu, 21 May 2026 12:08:15 +0000

I opened X this week and the account was gone.

No warning. No "your post violated rule 7.3." Just the screen that tells you your account has been permanently suspended. I went through the standard appeal flow, got the standard copy-paste rejection within an hour, and that was that. Years of posts, a small audience I had been building slowly, dozens of in-progress conversations with other founders, all wiped from my login.

The trigger, as far as I can reconstruct it, was a mass report. Somebody decided they did not like me, or did not like a post, or just felt like ruining someone's morning, and they used X's reporting system to do it. I will probably never know who. I genuinely do not know what I did to deserve this. I was not harassing anyone. I was not spamming. I was just posting about building software, like I have been doing for years.

That is part of what makes this so absurd. The honest version of my reaction is not a measured "lessons learned" tone. It is closer to: people online are unbelievably sensitive lately, and the reporting tools that were built for actual abuse are now the favorite weapon of anyone whose feelings got pinched by a tweet. Somebody read something they did not like, ran to the report button instead of scrolling past, and a system designed for emergencies treated their hurt feelings as one.

This is the part where, if I were a different person, I would write a calm "lessons learned" post. I am going to be honest instead: I am furious. Not at the platform exactly, more at the fact that one anonymous person can erase years of work with a few taps on their phone. There is no jury. There is no court. There is a moderation queue and a button.

And here is the second punch in the same week: I am also permanently banned on Reddit. That happened months ago and I never wrote about it because it felt embarrassing at the time. Now that both bans are stacked on top of each other, I am going to stop pretending they did not happen and talk about what this actually means for someone trying to build a SaaS business as a solo founder.

Spoiler, since I do not want to bury the conclusion: I am still going to build. I am not quitting. The bans are bad, the situation is unfair, and I am not pretending otherwise. But the answer is not to fold. The answer is to make sure the next person who hits that report button cannot do this much damage to my work ever again.

What Actually Happened

Let me lay out the two events in order so it is clear what kind of situation I am dealing with.

The Reddit ban came first. I had been participating in a few subreddits where my target users hang out, doing the thing every guide tells indie hackers to do. Answer questions. Share what I had learned. Mention my product only when it was genuinely relevant, and always with full disclosure. I followed the rules I could find. I read the subreddit wikis. I made an honest effort.

It did not matter. A mod somewhere decided I was self-promoting, the report went through, and I went from "regular contributor" to "permanently banned across all of Reddit" in the time it takes to refresh a page. I appealed. The appeal system on Reddit is, charitably, a black hole. The response, when it eventually came, was a templated note saying the decision stood.

I lost access to the largest concentration of indie hackers, developers, and SaaS users on the internet. Not for a month. Not for a quarter. Permanently. Across every subreddit. Forever.

The X ban happened this week. Same general pattern but different mechanism. On X, a false mass report is enough to get an account auto-actioned. Whether a human reviews it before or after suspension is unclear and frankly I am not sure it matters. The end state was the same: account gone, audience gone, DMs gone, draft tweets gone.

Two of the seven channels I have written about, community marketing and build-in-public on X, are now closed to me. Not throttled. Not deprioritized. Closed.

The False Report Is the New Shadow Ban

A few years ago, the worry was the algorithm. Your post would not get distributed, your account would get throttled, your reach would silently drop. You could still post, but nobody saw you. Annoying, but survivable.

The new failure mode is more direct: weaponized reporting.

The mechanic works because moderation at platform scale is fundamentally automated. X cannot pay a human to investigate every report. Reddit cannot have a moderator personally review every flag. Both platforms rely on signals: how many reports, from how many accounts, against the same target. Cross a threshold and the system takes action without a human ever looking at the underlying content. If a human does look later, it is usually after the account is already suspended, which is the wrong order if you care about accuracy.

Anyone with a small group of friends, a Discord server, or just a couple of burner accounts can trigger this threshold against any target. There is no real penalty for false reporting. The platforms occasionally talk about cracking down on report abuse and then never publish numbers on what actually happens. The expected value of mass-reporting an account you do not like is negative for the platform, mildly damaging for the target, and free for the attacker. So it keeps happening.

This is the part I want indie hackers to internalize, because most of us are not thinking about it. When you build an audience on rented land, you are not just exposed to algorithm changes and policy shifts. You are exposed to any individual on the internet who decides they want you gone. The cost for them to try is zero. The cost for you if they succeed is everything you built on that platform.

I do not say this to scare anyone off social media. I am going to keep using social media. I am saying it because the calculation people make about platform risk is usually too optimistic. Platforms are not just unstable. They are adversarial environments where individual actors can damage you and the platform's incentive to prevent that damage is weaker than you think.

The Emotional Part, Because It Matters

I want to be honest about this part because most rants pretend the writer is calm and I am not.

When the X suspension hit, I sat there refreshing the page like it would change. I tried logging out and logging back in, which is the most useless thing a developer can do and I did it three times anyway. I went to the help center and read the appeal documentation. I submitted the appeal. I refreshed my email. I refreshed it again ten minutes later. The rejection came in about an hour and I felt every cliché feeling at once: angry, embarrassed, defeated, indignant.

Then I started doing the thing that I always tell other people not to do, which is mentally cataloguing all the work I had put in. The conversations. The posts that took an hour to write. The slow trickle of followers who actually cared. The DMs I had been maintaining with other founders. The threads I had written and refined. All of that was sitting in an account I no longer had access to, and the value of it to anyone other than me was now zero, because nobody could find it.

If you have ever been laid off, or had a project killed by a manager who did not understand it, you know this feeling. It is the feeling of work that mattered to you being made retroactively meaningless by someone else's decision. You did the work. The work was real. And then a process you do not control deletes the artifact, and you have to convince yourself the work still counts even though the artifact is gone.

It does still count. I know that because I am the same person who wrote those posts, with the same understanding of the problems, the same way of seeing the patterns. The skill stays with me even when the platform takes the receipts. But it does not feel that way in the moment, and I am not going to pretend it does. The setback is real. The frustration is real. Anyone telling you to "just stay positive" through something like this has either never had it happen to them, or is performing for an audience.

The reason I am writing this post instead of quietly disappearing for a week is because I think other indie hackers will hit this same wall, and I want there to be at least one honest account of what it feels like and what you do next.

The Crybaby Economy

I am going to say the quiet part out loud, because pretending it is not there does not make it not there.

The internet right now is full of people who treat any disagreement, any sharp opinion, any unflattering observation, as a reportable offense. Not a thing to argue with. Not a thing to scroll past. A thing to escalate. The path from "I do not like this post" to "I am going to try to get this person removed from the platform" used to be long and required real effort. Now it is two taps and a dropdown.

Some of this is the platforms' fault for building report systems that are basically grievance buttons with no friction. Some of it is cultural. Whole communities have normalized the idea that the right move when somebody says something you do not like is to mass-report them until the system caves. It works because the system was not designed to weigh truth. It was designed to weigh volume.

I do not have a tidy political take on this and I am not going to fake one. I just have an observation as a builder: if you are putting yourself online to do honest work in public, you are going to brush up against this sooner or later. Not because you did anything actually wrong. Because somebody, somewhere, got upset, and the path of least resistance for their upset is to try to delete you.

The right response is not to get quieter or blander. The right response is to make yourself harder to delete. Which is most of what the rest of this post is about.

The Real Cost of Platform Risk

Let me put numbers on this, because the abstract version of "platform risk" never sticks.

Suppose you spend 18 months building an audience on X. You post most days. Conservatively, that is 500 hours of writing, replying, reading, and engaging. Add another 200 hours across the same period spent participating in Reddit communities and other forums. You are at roughly 700 hours invested in two channels.

If both of those channels disappear, the time spent on the writing skill itself was not wasted. You got better at writing. You got better at thinking clearly in public. You have a stronger sense of your niche. Those are durable assets.

But the artifact of all that work, the actual audience, is gone. The people who followed you cannot find you. The threads that ranked in Google through quote-tweets and engagement are gone. The DMs that were turning into customer conversations are inaccessible. The credibility signal of "this person has 8,000 followers and posts thoughtfully" reverts to zero.

The math for an indie hacker is brutal. If you assume your hourly opportunity cost is conservatively 50 dollars an hour (it is probably much higher), 700 hours of channel-specific work that gets erased represents 35,000 dollars of effective value destroyed. The skill remains. The asset evaporates.

This is the cost most people do not price into their distribution strategy when they pick channels. It does not feel like a cost because the platform is "free." The platform is not free. You are paying with audience risk, and the bill comes due all at once if you get unlucky.

Why I Am Not Quitting SaaS

If the bans were going to make me quit, I would have already quit. Here is why they will not.

The first reason is purely practical. The SaaS business model still works for solo developers in 2026 even with the agent-driven shifts I have written about before. The categories that are surviving and the new ones that are emerging both require the same core skills I already have: building, shipping, talking to users, iterating. Losing access to two distribution channels does not change the underlying opportunity. It just makes one part of the job harder.

The second reason is that I have seen enough cycles to know that quitting in the middle of a setback always feels rational and is almost always wrong. The pattern repeats in every founder community. Someone hits a brutal week. They write a "maybe I am not cut out for this" post. They go quiet for a month. They come back six months later having shipped something else. The interruption was the worst part, not the setback itself. The setback gets metabolized into experience. The interruption resets the compounding curve back to zero.

The third reason is that the skill stack I have been building for years is exactly the skill stack that is most valuable right now. Ship products, run distribution, manage cost, talk to users, iterate fast. Those compose. Removing two distribution surfaces from my toolkit does not remove the toolkit. It just narrows it temporarily while I rebuild on more durable surfaces.

The fourth reason, and I will be blunt here, is that I do not want to give the person who reported me the satisfaction. Whoever they are, they got their dopamine hit when they saw my account go down. I am not going to give them a second hit by quitting. Petty as a motivation, but motivation is motivation.

So I am going to keep building SaaS. I am going to write about what I am building. I am going to ship features and fix bugs and talk to users and figure out pricing and run experiments. Same job. Different distribution mix.

What I Am Changing

If I am going to keep going, I have to take the lesson from this seriously instead of just complaining about it. The lesson is the same lesson the developer newsletter post made, the distribution moat post made, and frankly the lesson I wrote down myself in earlier posts and then partially ignored: own the channel where you build the relationship.

Here is what changes for me from this week forward.

Email becomes the primary channel. Not X. Not Reddit. Email. I am going to invest in the newsletter strategy I have already written about but never fully committed to. Email cannot be mass-reported. Email cannot be suspended by a moderator who skimmed three posts. The relationship is direct, the list is mine, and if the email provider ever became hostile, I can export the addresses and move providers in a day.

Long-form on my own domain becomes the primary content surface. This blog is the most durable thing I publish. It compounds in Google. It cannot be deleted by a third party. I should have been treating it as the center of gravity all along instead of as an afterthought after social posting. From here on out, every important idea gets a post on my own domain first, and the social channels become amplification, not origination. I wrote about why SEO actually moved the needle for me, and I am going to act like I believe my own conclusions this time.

Social presence gets rebuilt, but with different assumptions. I will rebuild on X if appeals eventually succeed, or on a fresh account if they do not. I will also expand to LinkedIn, where the moderation model is different (still imperfect, but less prone to single-actor mass reporting taking you down). The critical change is that I will treat any social account as borrowed infrastructure, not as the foundation. Every social post that does well gets cross-posted or expanded into a blog post that lives on my own site. Nothing important lives only on rented land.

No more deep investment in subreddit communities. This one stings to admit because Reddit is genuinely useful for indie hackers when it works. But after a permanent ban with no real appeal path, I am not going to spend another month building credibility in a community where one moderator's interpretation can erase the work. I will read Reddit. I will not contribute to it. Time that would have gone there now goes to email outreach, to my own community, and to direct customer conversations.

A direct community I run. I have been thinking about starting a small Discord or community for the people who read this blog and use the products I build. I have resisted it because it sounds like a lot of work, and it is. But after watching two big communities lock me out, having a community I actually own and moderate looks like the obvious move. Even a small one with a few hundred genuine readers is more durable than a "huge audience" on a platform that can be taken away.

These changes do not undo the bans. They make me less exposed to the next one.

A Note for Founders Who Have Not Hit This Wall Yet

If you have not been mass-reported, content-suspended, or community-banned, you might read this and think it sounds dramatic. I would have thought the same thing six months ago.

The advice I will leave you with is simple and also slightly annoying because it feels like overengineering when nothing has gone wrong yet.

Treat every social channel like it might disappear tomorrow. Not because it definitely will, but because the cost of preparing for that scenario is much lower than the cost of recovering after it happens.

Concretely, that means three habits. Capture emails from every important conversation. If a follower on X has been engaging with you for months, do not let the only relationship live in the DMs. Find a reason to move it to email, even if it is just "I am starting a small newsletter, here is the link." Republish your best social content on your own domain within a few days of posting. Treat every viral thread as the seed of a blog post. And keep a backup of your own posts. Most platforms let you export your data. Most people never bother. Do it quarterly. It takes ten minutes and gives you the option to rebuild.

None of this is glamorous. None of it will go viral. It is the boring infrastructure work that protects everything else you do. It is the same logic as backups for code, the same logic as redundancy for production systems. You set it up when nothing is wrong so that something going wrong is survivable.

The Setback Advantage Is a Real Thing

I want to close with something I genuinely believe and not the part about anger.

Setbacks have an advantage that you cannot get any other way. They force the lesson. Everyone who reads about platform risk learns it intellectually. The people who get banned learn it in a way that changes their behavior. The cost of the lesson is real. The behavior change is also real, and the behavior change is what creates durable founders.

The indie hackers I respect most have all eaten some version of this. A product that died. An account suspension. A launch that flopped. A customer who churned and took six others with them. None of them quit. All of them came back with a slightly different setup that made them less fragile next time. The setbacks compounded into wisdom, the same way the wins compounded into momentum.

If I had not been banned, I would still be over-investing in rented channels. I would still be writing thoughtful posts that sit on someone else's servers under someone else's terms of service. The ban is unfair. The ban also forced me to do something I should have done a year ago: build the durable side of my distribution.

So this is the post I am writing instead of the angry vent post I wanted to write. It is still a little bit angry. That is fine. The frustration is real and pretending otherwise would be dishonest. But the conclusion is not "the system is broken so I quit." The conclusion is "the system has these specific failure modes, I just hit two of them, and here is what I am doing about it."

I am still building SaaS. I am still shipping. I am still going to write about what I learn. Most of it will land here, on my own domain, where no anonymous reporter can reach. Some of it will land in your inbox if you decide that is a channel you want to give me. And some of it, when X eventually unbans me or when I rebuild somewhere else, will land in feeds again.

The work continues. The platforms change. The skill stays. That is the only model I have ever seen actually work for solo founders over the long run, and the only one I am willing to bet on going forward.

If you are an indie hacker who has hit something similar, I would genuinely like to hear about it. Reply to this post in whatever channel you read it on, or just go build the boring infrastructure I described above so the next person who tries to take you down has a smaller target.

And if you are the person who reported me: you got me for now. You did not get the work. Try again next quarter.

Infrastructure as Vibe: What Comes After Infrastructure as Code

Alex Cloudstar — Thu, 21 May 2026 07:34:05 +0000

A friend of mine sent me a screenshot last week. Eight tabs. Vercel, Neon, Cloudflare, GitHub, Stripe, Resend, a Notion doc with env vars, and a terminal running ngrok. He was three hours into "just deploying a side project" and had not written a line of product code yet. The message under the screenshot said: "tell me again why this is the good timeline."

I laughed because I have lived this exact afternoon at least fifty times. You know the stack you want. You know which buttons to press. The CLIs are great individually. None of them talk to each other. So you become the wetware integration layer, copying connection strings into env files, pasting CNAME records into DNS panels, remembering which secret needs to be rotated where.

That experience is the friction that gave the next paradigm its name. Infrastructure as Vibe. The phrase started as a joke on Twitter, but enough people stopped laughing for it to become a real category. Neon's official account endorsed it on stage. Cloudflare's team has been hinting at it in every product update for six months. The pattern is showing up everywhere even when people do not use the term.

This article is about what "infrastructure as vibe" actually means as a paradigm, why it became inevitable, what it gets right, what it gets dangerously wrong, and where developer tooling is heading because of it.

From Infrastructure as Code to Infrastructure as Vibe

To understand why this matters, you have to remember what each previous layer was a reaction to.

In the early 2010s, deploying anything to production meant clicking through AWS consoles, SSHing into boxes, and praying you remembered to set the security group correctly. People called this ClickOps. It was a disaster for reproducibility. You could not version it. You could not review it. The person who set it up was the only one who knew what it looked like, and the moment they left the company, the infrastructure became a haunted house.

Infrastructure as Code was the response. Terraform, CloudFormation, later Pulumi and CDK. Declare your infrastructure in a file. Commit the file to git. Run a command. Get the same environment every time. Everyone who has ever maintained a real system understands why this was a good idea.

But IaC carried a tax that nobody really wanted to pay. You now had to learn HCL, or YAML schemas, or a TypeScript-flavored DSL that pretended to be regular code but very much was not. You had to keep state files. You had to understand provider versions. You had to debug why your apply hung for nineteen minutes because someone deleted a resource in the console. You had to write 400 lines of config to deploy something that, in your head, was "a Node app and a Postgres database."

GitOps, Helm charts, and Kubernetes operators were attempts to make this nicer. They mostly added more files to learn. Then PaaS providers like Heroku, Vercel, Render, Railway, and Fly came back into fashion because they let you skip most of that. The pattern was "git push and we figure it out." That worked great until you wanted to do anything slightly outside the happy path. As soon as you needed a database in a different region, or a queue that talked to a worker, or a domain that pointed at two services, you were back in five dashboards.

Infrastructure as Vibe is the layer that sits on top of all of this. The interface is plain language. The implementation is "we figure out the right APIs to call." You describe what you want in the way you would describe it to a colleague who happens to know every platform's quirks. The system reads your intent, scans your code, makes opinionated choices, and shows you what it provisioned.

If IaC was about removing humans from the imperative step, Infrastructure as Vibe is about removing humans from the declarative step too. You stop writing configuration entirely. You write intent.

What "Infrastructure as Vibe" Actually Means

The phrase is catchy enough that people are already using it to mean different things, so let me be specific about what it actually describes as a paradigm.

There are three properties that have to be true for a tool to fit this category.

The interface is natural language, not configuration. You do not fill out a form. You do not edit a YAML file. You type or say what you want. "Deploy this repo. Add a Postgres. Point it at myapp.com." The system parses that, infers the gaps, and confirms before doing anything destructive.

Inference replaces declaration. Traditional IaC requires you to specify every detail because the tool refuses to assume. Infrastructure as Vibe inverts that. The tool reads your code, your dependencies, your existing project, and makes structural assumptions based on what it sees. If your package.json includes prisma and pg, you probably want Postgres. If you have @supabase/supabase-js, you probably want Supabase. If you have mongoose, you want Mongo. You can override anything, but the default is "guess correctly most of the time."

The platform absorbs the integration work. This is the part that distinguishes Vibe from "just a wrapper over Terraform." A real Infrastructure as Vibe tool does not just generate config and hand it back. It calls the APIs, watches the rollout, handles the failure modes, retries the flaky ones, and tells you when it is done. You do not get a 200-line plan to apply yourself. You get a result.

You can argue about whether all three need to be present. I would say yes, because if you remove any of them, you collapse back into a previous paradigm. Strip the natural language and you have a smarter PaaS. Strip the inference and you have ChatOps that just calls a CLI. Strip the integration work and you have a clever assistant that drafts Terraform.

The combination is what makes it new. And the combination is what makes some developers nervous.

The Friction That Made This Inevitable

You can see the demand for this paradigm if you look at what developers actually complain about online. Not what they say in conference talks. What they post at 11pm on a Tuesday after their fifth deploy attempt.

The complaints cluster into a few categories that should be familiar to anyone who has shipped a side project in the last three years.

Tab fatigue. Every PaaS has a console. Every database has a console. Every domain registrar has a console. Every observability tool has a console. You are not really using each one. You are using all of them as a single distributed system that someone forgot to integrate.

Bill shock. This is the most consistent complaint across Vercel, Render, Neon, Supabase, and Fly. Usage meters that run while you sleep. No hard caps. You read a forum post in the morning explaining that someone got a $4,000 bill because their cron job hit an infinite loop. You spend the next hour wiring spend alerts and budget guards. That hour is not building product.

Secret sprawl. A modern app has, on average, somewhere between twelve and twenty environment variables across staging and production. You set them in three different dashboards. You forget which is which. You leak one into a screenshot during a demo. You promise yourself you will rotate them later. You do not.

DNS purgatory. You buy a domain. You point it at a service. The TTL is 24 hours and the CNAME is fighting an A record. You verify ownership through a TXT record that takes 45 minutes to propagate. You check dig in three terminals. You add a www subdomain and the redirect breaks. You forget to renew the cert. The cert renews automatically and you forget you ever worried about it.

None of these are theoretical. They are the unpaid labor of being a modern developer. Every one of them is something a sufficiently smart tool could absorb. The reason none of them got fully absorbed sooner is that no single platform owned the whole stack. Vercel cannot fix your Neon bill. Neon cannot fix your DNS. Your DNS provider cannot fix your secret rotation. Somebody had to play orchestrator. Until recently, that somebody had to be you.

The shift toward natural language as a universal interface, combined with the agentic tooling I wrote about in agentic coding, changed the math. A coordination layer that calls every platform's API on your behalf no longer needs to be a Heroku-scale company. It can be a relatively small orchestration service with good prompts, good defaults, and a careful integration layer.

That is why this paradigm is showing up now and not five years ago.

Where You Can Already See It

You can spot Infrastructure as Vibe in the wild even when nobody is calling it that.

When Vercel added "deploy from a screenshot" and "deploy from a sentence" via v0, that was the paradigm leaking into a frontend product. You did not configure anything. You described something. The platform figured out the rest.

Cloudflare has been pushing in the same direction with their AI gateway and their increasingly chat-driven control plane. You can describe a worker, get a worker, attach a domain, all without ever opening a config file. They are not marketing it as "infrastructure as vibe," but the shape of the experience is unmistakable.

Render's blueprints have been quietly turning into something similar. Drop a repo in, accept the inferred services, click deploy. The friction between "I have an idea" and "it is live on the internet" gets compressed every release.

The most aggressive version of this is what is happening through coding agents themselves. Claude Code, Cursor agents, and the various MCP-driven orchestrators can now provision a database, push a deploy, configure a domain, and rotate a secret from a single prompt, given the right tools. The "infrastructure" is the LLM's tool calls. The "vibe" is the prompt. It looks weird the first time you see it. The second time it looks inevitable.

There are also a wave of newer tools whose entire pitch is the paradigm. Some are general purpose orchestrators. Some are stack-specific. Some are aimed at non-technical founders who never wanted to learn the difference between an A record and a CNAME. The tools differ. The pattern does not.

If you squint, you can also see this pattern in adjacent categories. Stripe Apps and shadcn/ui blocks both reduce the cognitive load of "I need this thing" to a one-line invocation. They are not infrastructure, but they share the same posture: assume good defaults, do the integration work for me, let me describe outcomes instead of steps.

Why Developers Resist It

This paradigm is not universally welcomed, and the resistance is not stupid. Most of it comes from real scars.

Loss of control. A developer who has been burned by a magical platform will be suspicious of any tool that says "trust me, I will figure it out." They want to see the plan. They want to read the Terraform. They want to know what the security group looks like. The Vibe layer hides that by default, and that is the right design choice for most users and the wrong choice for some.

Black box anxiety. When something breaks at 3am, you need to know what got provisioned, where, with what permissions. If the provisioning happened through a chat interface and the audit log is "Claude decided," that is not a debugging story. That is an incident report you have to write to your CTO.

Comprehension debt. I wrote about this in the vibe coding revolution in the context of code. It applies just as forcefully to infrastructure. The developer who never learned what a security group is, because a friendly tool always set it up correctly, is going to have a bad time the first time the tool gets it wrong. And the tool will eventually get it wrong, because every tool does.

Vendor lock-in. A coordination layer that owns your deploy, your database, your DNS, and your secrets is also a single point of failure for your entire stack. If the company behind it gets acquired, raises prices, or pivots, you may not have an exit strategy. IaC at least gave you portable files. Vibe gives you outcomes that are difficult to reproduce on a different platform without redoing the work.

Existing skill obsolescence. A developer who spent five years getting good at Terraform is not enthusiastic about a paradigm that says the Terraform expertise is now a niche specialty. This is a human reaction, not a technical one, but it shapes adoption.

These are all legitimate concerns. The right response to them is not "trust the magic." It is to design Vibe tooling that surfaces the underlying reality on demand. Show me what got created. Show me the cost projection. Show me where to override. Give me an escape hatch back to raw config. The good versions of this paradigm will be the ones that respect the people who want to look under the hood.

When Infrastructure as Vibe Works and When It Fails

Like every paradigm before it, this one has a fitness landscape. It is not universally better. It is meaningfully better in some places and meaningfully worse in others.

It works great for:

Side projects and prototypes. The whole point is getting from idea to live URL in minutes. The cost of being wrong is low. The cost of being slow is the project dying before launch.
Solo developers and small teams. The integration work is a tax that hurts small teams the hardest. Removing it gives them back a measurable percentage of their week.
Greenfield production apps with standard stacks. If you are building a Next.js app with Postgres and Stripe and Resend, the default choices a Vibe tool would make are probably the same choices you would have made.
Founders and designers without a dedicated devops person. The alternative is hiring someone or hand-waving DNS for a year. This is a real, expensive choice that this paradigm collapses.

It is risky for:

Regulated environments. Healthcare, fintech, public sector. You need explicit, auditable, declared infrastructure for compliance reasons. Vibe-style provisioning is not yet the right shape for SOC 2 audits, even though it can in principle generate the artifacts.
Large existing systems. A coordination layer that is good at clean deploys is not necessarily good at integrating with a fifteen-year-old VPC, three legacy databases, and a custom IAM model. The brownfield problem is brutal.
Mission-critical production with hard SLOs. When 99.95% matters, you cannot afford "the platform chose for me." You need to make every choice deliberately and review it.
Anyone who hates leaky abstractions more than they hate friction. Some developers are wired to want to see the bytes. They will be miserable on top of a Vibe layer no matter how well it works.

The honest version of this paradigm acknowledges those tradeoffs instead of pretending the line is everywhere. You can use Infrastructure as Vibe to ship a prototype in an afternoon and still believe that your bank should run on hand-written Terraform. Both of those statements can be true at the same time.

The Vibe Stack Underneath

Let me get a little more concrete about how this actually works under the hood, because "the AI does it" is not a real answer.

Most viable Infrastructure as Vibe systems sit on top of a layer cake that looks something like this.

Deterministic detection comes first. You do not ask the LLM what database to provision. You scan the package.json, the lockfile, the go.mod, the Cargo.toml. You read framework signatures from imports. You check for known config files. This is fast, free, and reliable. The LLM only gets called when the deterministic layer is genuinely ambiguous.

Intent parsing is where the LLM earns its keep. Translating "deploy this and add a database, point it at app.example.com" into structured calls is exactly the kind of task small fast models like Claude Haiku are good at. The output is not freeform. It is a constrained schema that the orchestrator can execute. No code generation. No hallucinated APIs. Just structured decisions.

Orchestration is plain old engineering. A long-running service, a job queue, a Redis somewhere, and careful integration code that knows how every target platform actually behaves. This is the part that takes the longest to get right and is the hardest to fake with a demo. Anyone can ship "I called the API." Far fewer people ship "I called the API, handled the rate limit, retried the flaky one, rolled back the partial state when the third call failed, and gave you a useful error message."

Cost transparency lives in the middle. Before anything destructive happens, the system tells you what it is about to do and what it estimates the monthly cost will be. This is the answer to bill shock and it is what separates a serious Vibe tool from a fancy CLI wrapper.

Audit logging is the floor. Every provisioned resource, every config decision, every secret rotation gets logged with timestamps, model versions, and the original prompt. When something breaks, you need to be able to reconstruct exactly what happened. The version of this paradigm that does not log carefully will not survive contact with real users.

The reason this stack matters is that it shows the paradigm is not "vibes all the way down." Most of it is conventional, careful software engineering. The vibe is at the surface. The substrate is the same kind of integration work that has always been required to make platforms talk to each other. The trick is that the substrate is finally getting built, because the surface is finally compelling enough to justify the investment.

What Has to Be True for This to Actually Work

Predictions are cheap. Let me make some falsifiable ones.

For Infrastructure as Vibe to graduate from "interesting demo" to "default way most apps get deployed in 2028," a few conditions have to hold.

Reversibility has to be cheap. If I can ask the system to provision something and then ask it to undo cleanly, I will use it. If undoing requires me to log into five dashboards and clean up orphans, I will not trust it again. The good tools will make destroy as fluent as create.

Cost has to be visible before action. Every command that costs money should show its cost first. Every recurring resource should show its monthly projection. This is non-negotiable. Bill shock is the single biggest reason people abandon PaaS, and any Vibe layer that does not solve it is one bad incident away from a viral horror story.

Audit has to be ambient. Not a feature you turn on. Always on by default. Every action provenanced. Every secret rotation recorded. The audit log is what unlocks regulated use cases later.

The escape hatch has to be real. "Export to Terraform" or "show me the underlying calls" cannot be a roadmap item. It is the price of trust. If a developer feels trapped, they will not adopt. If they can leave, they probably will not.

Defaults have to be opinionated, not naive. A tool that asks me twelve questions is worse than the dashboard it replaces. A tool that picks reasonable answers and lets me override is the entire point. The opinions need to be good ones, which means real product decisions, not arbitrary ones.

If those conditions hold, the paradigm wins for most new applications. If they do not, this becomes another era of demos that did not generalize. Both outcomes are plausible. I am betting on the first one because the underlying technology curve, particularly the cost and reliability of structured LLM output, is moving in the right direction every quarter.

Where This Goes Next

I keep thinking about how this connects to the broader shift toward spec-driven development. In code, the spec is becoming the source of truth and the code is generated from it. In infrastructure, the intent is becoming the source of truth and the config is generated from it. Same pattern, different layer. The end state is that human-written declarative config might become as rare as hand-written assembly is today. It will still exist for the cases that need it. It will not be the default.

The teams that move fastest will be the ones who treat infra as a means, not as a craft. Not because the craft does not matter, but because most of the time, the right amount of craft to apply to "ship this prototype tonight" is none. The same developer who insists on hand-tuned config for a financial service will happily let the Vibe layer handle their weekend project. That bifurcation is the natural shape of the next few years.

The risk worth watching is consolidation. A handful of companies will become the orchestration layer between all the platforms. They will know what you deploy, what you spend, what you store, and where your traffic comes from. That is enormous power. The paradigm only works socially if the tools at the surface compete hard enough that no single one becomes the lock-in story for the entire web.

I also expect this to push the underlying platforms toward better, more predictable APIs. If your value as a database, hosting provider, or DNS service is increasingly mediated through an orchestration layer, your API ergonomics matter more than your dashboard ergonomics. That should be good for everyone. The platforms with the cleanest APIs are about to win an asymmetric amount of new business.

The interesting question for the next few years is which of the giant providers, the existing PaaS players, or the newer orchestration startups end up owning the surface. I do not know the answer. I do know the surface is moving from forms to sentences, and from documents to conversations, and that is not going back.

My Take

I have been deploying things for most of a decade. I have written more Terraform than I want to admit. I have set up Vercel projects in my sleep. I know the muscle memory. I am still tired of doing it.

The honest reason Infrastructure as Vibe is going to win for most use cases is not that the technology is magical. It is that the alternative is bad. Modern application infrastructure asks every developer to be a part-time platform engineer. That was tolerable when there were three platforms. It is not tolerable when there are thirty. The integration work has gotten worse every year, and developers have absorbed it because there was no other option. Now there is one.

The version of this paradigm that fails is the one that hides too much. Black-box magic that you cannot inspect, cannot reverse, cannot afford to break. That version will get one good demo cycle and then a thread of horror stories that kill it. The version that wins is the one that defaults to friction-free and treats the underlying truth as something you can always pull up. Good defaults, visible cost, clean audit, easy exit.

If you are building developer tools, the lesson is to take the integration work seriously even when the surface looks easy. The fluency of the chat interface is the marketing. The brittleness of the third API call you make under the hood is the product. The companies that win this category will be the ones that obsess over the unsexy substrate.

If you are a developer who has resisted this paradigm, I would not blame you. Most prior "we make infrastructure easy" pitches were wishful thinking. This one feels different because the AI layer makes the inference step finally cheap, and the existing platform CLIs are finally good enough to chain reliably. The two missing pieces showed up at the same time.

Infrastructure as Code did not kill the console. PaaS did not kill IaC. Infrastructure as Vibe will not kill PaaS. Every layer continues to exist. The default just moves up. The thing you reach for first when you start a new project keeps getting closer to "here is what I want to build." That progression is the whole story of developer tooling. Vibe is the next stop on a road we have been walking for thirty years.

I will keep my Terraform skills sharp. I will also keep typing sentences and watching apps appear. Both of those can be professional. Pretending only one of them is, at this point, is just nostalgia in a hoodie.

A2A vs MCP in 2026: How AI Agents Actually Talk to Each Other

Alex Cloudstar — Thu, 21 May 2026 07:34:04 +0000

A team I have been advising spent two weeks building what they called a "multi-agent customer support platform." Five agents. A triage agent, a billing agent, a refund agent, a knowledge base agent, and an orchestrator on top. They wired everything together through MCP servers because that is the protocol they had heard about most.

It worked. Sort of. The orchestrator could call any of the four worker agents as if they were tools. The latency was terrible. The errors were impossible to trace. When the billing agent failed, the orchestrator either retried forever or gave up silently, with nothing in between. The system felt brittle in a way that nobody could quite explain.

I looked at the architecture for about ten minutes and asked a single question. "Why are you using MCP for this?" The answer was the most honest thing I heard all month. "Because we did not know A2A existed."

That is the state of agent protocols in May 2026. MCP is everywhere. A2A has been ratified, adopted by the same major vendors, and is now under the same Linux Foundation umbrella. And yet the gap between developers who can explain when to use each one and developers who pick MCP for everything is enormous.

This post is the practical breakdown. What each protocol actually does. Where each one fits. Where they overlap. Where they absolutely do not. By the end you should be able to look at any agent system on a whiteboard and pick the right protocol for each edge in under a minute.

The Short Version You Can Use Today

If you only remember one thing from this article, make it this.

MCP is how an agent talks to tools. Files, databases, APIs, repos, Slack, your filesystem. A single agent uses MCP to extend its reach into the outside world. The agent is the consumer. The MCP server is the provider. It is a one-way relationship from a privilege standpoint, even if the protocol itself supports back-channels.

A2A is how agents talk to each other. Two or more agents, each owned by potentially different teams or vendors, coordinating on a multi-step task. Each side has its own goals, its own context, its own underlying model. They negotiate, they delegate, they exchange artifacts. Neither side is just a tool. Both are decision-makers.

You can think of MCP as the protocol for "I need to act on the world" and A2A as the protocol for "I need someone else who can think for themselves to handle part of this for me."

I keep coming back to a different analogy because it lands better with developers I work with. MCP is like a function call. A2A is like a service-to-service API contract between two systems that each have their own state, their own deploy cycle, and their own opinions. Wrapping an agent in an MCP server is roughly equivalent to flattening a microservice into a library. Sometimes that is fine. Often it leaks abstractions you did not want to leak.

If you have not read the MCP developer guide yet, that one walks through the tool-call side in detail. This article assumes you have at least the rough shape of MCP in your head.

What MCP Does, Stripped Down

MCP at its core is a JSON-RPC protocol with a server-client model.

A client (your agent) connects to a server (your tool provider). The server exposes a fixed set of capabilities: tools you can call, resources you can read, and prompts you can ask for. The client lists what is available, picks one, calls it with arguments, and gets a response. There is also a streaming side for long-running operations, but the mental model stays simple.

What makes MCP work in practice is the standardization. Before MCP, every AI tool integration was bespoke. You wrote a custom function for each external system, baked it into your agent loop, and prayed the schema never changed. With MCP, a single agent can connect to dozens of external systems through the same interface, and a single tool can be reused across dozens of agents.

97 million monthly SDK downloads by early 2026 is not hype. It is the result of every major vendor agreeing on the same wire format. Anthropic shipped it. OpenAI adopted it. Google added it to Gemini. Cursor, Claude Code, Windsurf, Zed, and every IDE in the AI IDE space speaks it natively.

The thing MCP does not do, and was never designed to do, is coordinate between independent agents. The MCP spec itself uses the word "tool" deliberately. A tool is something an agent uses. A tool does not have its own goals. A tool does not push work back. A tool does not say "I am partly done, here is what I have, please clarify the next step." MCP can stretch in those directions, but every time you stretch it, you are working against the grain.

This is where the trouble starts, because in practice a lot of developers have been wrapping entire agents in MCP servers and calling them tools.

Why Wrapping Agents in MCP Servers Falls Apart

The pattern is everywhere. You build a specialist agent (say, a code review agent), you expose it as an MCP server, and any other agent can now "call" it like a tool. From the outside, it works. From the inside, it leaks every problem multi-agent systems were trying to solve in the first place.

Here is the first thing that breaks. Identity. MCP tools are stateless from the caller's perspective. Each call is a fresh request. If the underlying agent has memory, plans, partial state, or open sessions, the MCP wrapper either hides them or replicates them awkwardly. You end up encoding session identifiers into tool arguments and treating the response like a database row. That works until two callers hit the same session and you have to invent a locking protocol on top.

Long-running tasks. MCP supports streaming, but the streaming model assumes a single tool call with progressive output. Real agent work is non-linear. The agent might pause to ask a clarifying question. It might split a task into subtasks and run them out of order. It might decide that what you asked for is impossible and propose an alternative. None of that fits into "tool with progressive output." You end up modeling negotiation as a sequence of tool calls and the calling agent has to remember which call belongs to which conversation. Congratulations, you reinvented sessions, badly.

Asymmetric authority. With MCP, the caller decides what happens. The tool either succeeds or fails. A real agent on the other side might want to say "I will not do this because it violates my safety policy" or "I will do this but I need a human in the loop first." Encoding all of that into tool response status codes is possible but it pollutes the contract. Every other tool now has to ignore those status codes, and every new agent contract has to opt in.

Discovery. MCP servers expose a list of tools. They do not expose a description of who they are, what they specialize in, what languages they support, whether they accept long-running asynchronous work, or how to authenticate to them in a structured way. You can stuff some of that into tool docstrings, but at scale you need machine-readable metadata. Without it, an orchestrator cannot decide which downstream agent to talk to. The orchestrator becomes a giant if-else statement, and you end up back where you started.

I have seen exactly these failure modes in production. Once you have seen them, you cannot unsee them. The fix is not "more MCP." The fix is a different protocol for the thing MCP was not designed to handle.

What A2A Adds That MCP Does Not

A2A was Google's first move on this gap, announced in April 2025 with fifty partner companies and now under the same Agentic AI Foundation that governs MCP. The protocol is not a competitor to MCP. It sits next to it, covering the part of the agent stack MCP intentionally left open.

The protocol has three primitives worth knowing by name.

Agent Cards. Every A2A-compliant agent publishes a JSON document at /.well-known/agent-card.json. The card describes the agent's identity, what it can do, what modalities it accepts (text, structured data, files, audio, video), how to authenticate to it, whether it supports streaming or push, and which version of the protocol it speaks. This is the discovery layer MCP does not have. If you are an orchestrator deciding which downstream agent to talk to, you fetch their card and you know what you are dealing with.

Tasks. A task is the stateful unit of work between two agents. It has a lifecycle: pending, in progress, completed, failed, and the negotiated states in between. The calling agent creates a task. The receiving agent owns the execution. Both sides see the same task object and can read its status. Server-Sent Events stream progress updates as the task moves through its lifecycle. If the receiving agent needs a clarification, it can pause the task and ask. If it needs to spawn subtasks, the task tree is visible to both sides.

This is exactly the missing piece in the MCP failure modes above. Tasks have identity. They have state. They survive multiple round-trips. They support negotiation. They do not pretend to be function calls.

Artifacts. When one agent produces an output that another agent needs, that output is wrapped in an artifact. An artifact carries content type, encoding, semantics, and structured metadata. A PDF is an artifact. A code patch is an artifact. A structured JSON document is an artifact. The point is that the receiving agent can reason about the artifact, not just consume it raw. If the artifact is a Markdown report and the receiving agent prefers HTML, it can negotiate. With MCP, the response is whatever the tool returns and the caller deals with it.

Put together, agent cards, tasks, and artifacts are roughly the same shape as a service-to-service contract in modern distributed systems. They are deliberately not function calls. That is the entire point.

A Concrete Example: Refund Workflow With Both Protocols

Going back to the customer support platform I mentioned at the start of the article, here is what the right architecture would have looked like.

The triage agent receives a customer message. It needs to read the customer's recent orders to understand the context. It uses MCP to call the orders database. The orders database is a passive system. It has no opinions. It returns data. MCP is the perfect protocol for that edge.

The triage agent determines that this is a billing question. It needs to delegate to the billing agent. The billing agent has its own model, its own prompts, its own constraints about what it can and cannot promise the customer, and its own audit log. The triage agent uses A2A to create a task on the billing agent. The billing agent accepts the task, fetches its own context (using its own MCP connection to the billing database, which the triage agent does not need to know about), and starts working.

While working, the billing agent realizes the refund request requires manager approval. It pauses the task and emits a status update over A2A asking for human-in-the-loop confirmation. The triage agent forwards the message to a real person. Eventually approval comes back. The billing agent completes the task, returns the refund confirmation as an artifact, and the triage agent translates that artifact into a customer-facing reply.

What is happening here at the protocol level is clear and clean. MCP for agent-to-tool. A2A for agent-to-agent. Each edge uses the protocol that matches its semantics. The system is debuggable because each side of every conversation has a clear identity and a clear lifecycle.

Compare that to the original architecture, where the billing agent was an MCP server exposed as a single process_refund tool. The triage agent calls the tool. The tool either returns "done" or throws an error. There is no place for "I need approval first." There is no place for "I am 60% done and waiting on a database lookup." The only escape hatch is a synchronous wait, and synchronous waits in agent systems are how you discover that your timeouts are wrong six weeks after launch.

When MCP Is Still the Right Choice

I do not want to oversell A2A. For a very large class of agent systems, MCP is the only protocol you need, and reaching for A2A would add complexity for no benefit.

Single-agent products. If your product is one well-prompted agent with a strong set of tools, MCP is the whole stack. Add A2A when you have a second agent that genuinely has its own brain. Do not add a coordination protocol before you have something to coordinate. Most of the multi-agent vs single-agent decision boils down to this exact line.

Internal tool integrations. If you are exposing your own databases, internal APIs, or microservices to your agent, those are tools. They do not have agentic behavior of their own. Wrapping them in A2A would be a category error.

Most IDE and editor workflows. Claude Code, Cursor, Windsurf, and the rest of the agentic IDE family use MCP for almost everything they do. The IDE is one agent. The MCP servers extend its reach. There is no second-agent coordination happening in those flows, and stuffing A2A in there would slow everything down without giving you anything.

Latency-sensitive paths. A2A's task lifecycle, by design, is heavier than an MCP tool call. There are more round-trips, more state to manage, more SSE streams to keep open. If you are inside a tight loop where every millisecond matters, that overhead is a tax you may not want to pay. MCP wins on raw speed.

The simple test I use: would the system on the other side of this edge benefit from having its own goals, its own memory, and its own ability to push back? If yes, that is an agent and A2A fits. If no, it is a tool and MCP fits.

When You Actually Need A2A

The cases where A2A unlocks something MCP cannot:

Cross-vendor coordination. You have a Salesforce agent that lives inside Salesforce, a SAP agent that lives inside SAP, and your own internal agent in the middle. None of these talk to each other today through tool calls because they each have a full execution context that the other side does not see. A2A's agent cards plus structured tasks are the actual integration story that platform vendors are pushing for in 2026.

Long-running asynchronous workflows. Tasks that take minutes, hours, or days. Approval workflows. Multi-step research. Compliance pipelines. Anything where the work cannot fit into a single function call and the receiving system needs to own its own lifecycle. The task primitive in A2A was built for exactly this.

Multi-tenant agent marketplaces. If you are building a platform where third-party agents can plug in and offer services to other agents, you need a way for those agents to advertise their capabilities, authenticate, and execute work without your platform knowing the internals. Agent cards plus signed task lifecycles do this cleanly. Trying to do the same with MCP would mean every third-party agent is technically a tool and you lose the protections you want around agent identity.

Multi-step negotiation. Real coordination involves clarifications, partial results, alternative proposals, and human-in-the-loop pauses. The A2A task lifecycle gives you the states to represent all of that natively. You can build it on MCP, but you are reinventing a worse version of A2A's primitives.

How the Two Protocols Compose

The clearest mental model I have settled on is that a real agent is usually an A2A endpoint that internally consumes a bunch of MCP servers.

From the outside, the agent advertises itself with an agent card. Other agents create tasks on it via A2A. Inside, the agent uses MCP to read files, query databases, hit APIs, and call other passive systems. The two protocols compose neatly because they sit at different layers. A2A is the front door for other agents. MCP is the toolbelt the agent uses internally to actually get work done.

This is also where some of the protocol confusion comes from. People look at an architecture diagram, see an A2A edge between two agents, and ask "but couldn't this just be an MCP server?" The answer is yes, technically, you can put almost any agent behind an MCP server and call it a tool. You will pay for it in the failure modes from earlier in the article. The protocols are designed to compose, not to compete. Picking the right one for each edge is the difference between an architecture that scales and one that has a stress test ahead of it.

I wrote about this from a different angle in the agent frameworks comparison. Most agent frameworks now support both protocols natively. The question is no longer "which framework lets me do this." The question is "where in my system does each protocol belong."

What This Means For How You Build in 2026

The practical advice is simpler than the protocol discussion makes it sound.

Default to MCP. If you are building today, start by giving your agent the tools it needs via MCP. Get the single-agent experience right. Most products live their whole lifecycle here, and that is fine. Premature distribution is the same trap as premature optimization. The agentic coding workflows most developers benefit from on a daily basis are pure single-agent-plus-MCP flows.

Reach for A2A when you have a second brain. The moment you have a system that genuinely belongs to another team, vendor, or model, with its own constraints and its own audit story, switch that edge to A2A. Do not try to flatten it into a tool. The seam is real and the protocol is the cleanest way to model it.

Publish agent cards even before you need them. If you are building any agent that other systems might want to integrate with, exposing an agent card at the well-known URL costs nothing and signals that you take cross-agent integration seriously. It also forces you to write down what your agent actually does, which is a useful exercise in its own right.

Treat the boundary between MCP and A2A as a design decision, not a default. Every time you add a new edge to your system, ask whether the thing on the other side is a tool or an agent. The wrong choice early on is expensive to undo later. Two minutes of thinking at the start saves weeks of refactoring after the architecture has calcified.

The reason this matters is that agent systems are about to compose at a scale we have not seen yet. Internal agents talking to vendor agents talking to platform agents talking to third-party agents. The protocols are not going to be optional. They are the substrate. The developers who get fluent in both, and in the boundary between them, are the ones who will build the systems that hold up when everything else starts to fray.

I have spent the last eight months alternating between explaining MCP to teams who only knew tool calls and explaining A2A to teams who only knew MCP. The teams that come out the other side with shippable agent products all converge on roughly the same shape. MCP for the toolbelt. A2A for the seams. Both used deliberately, neither used as a hammer.

If you take one thing from this article into your next system, make it the question I asked that team at the start. Why are you using this protocol for this edge? If the honest answer is "because I did not know there was another option," you now know.

Prompts as Code: How to Version, Test, and Ship the Prompt Layer in 2026

Alex Cloudstar — Wed, 20 May 2026 08:29:38 +0000

The prompt regression that cost me the worst Sunday of 2025 was twelve characters long.

I had been tuning the system prompt of an LLM feature for about a month. The feature was a structured-output endpoint that took a free-form user message and returned a JSON object with five fields. It worked. Users liked it. Conversion was up. On a Friday afternoon I added what I thought was a small clarification to the prompt: a sentence telling the model to "prefer concise wording" in one of the fields. I deployed. I went into the weekend feeling good about my week.

By Sunday morning, support had three tickets and a Slack DM from our biggest customer. The endpoint was returning malformed JSON on roughly 14% of requests. Not always. Just often enough that retries hid it from our dashboards. The "prefer concise wording" sentence had pushed the model into a mode where it sometimes dropped a comma in the JSON output, because conciseness and valid JSON syntax were apparently in tension in a way I had not anticipated.

The fix was rolling back the prompt. The fix took ninety seconds. The problem was that I had no idea which version of the prompt was running. I had edited the prompt in three places that month: the source code, a draft in Notion, a quick test in the model playground. The deployed version was a copy-paste of one of them. I had no diff. I had no test for "does this prompt still return valid JSON for our golden set of inputs." I had no rollback button. The Sunday was spent reconstructing what I had typed, retesting it by hand against a list of fifty inputs I dug out of the logs, and deploying the rebuilt prompt.

That weekend is the reason I now treat prompts as code. Not "code-adjacent." Not "prompts that live in a file in the repo, sort of." Actual versioned, tested, CI-gated source code with the same review rigor as anything else we ship. The argument for this is not aesthetic. The argument is that you cannot debug what you cannot see, and you cannot roll back what you do not version. Everything else flows from that.

This post is what the prompt-as-code workflow actually looks like, what tools sit at the layer above your editor, and the migration path that gets you from "the prompt is somewhere in a Notion doc" to "the prompt has a CI job and a changelog."

What Prompts As Code Actually Means

The phrase "prompts as code" gets thrown around like every other piece of AI vocabulary. Let me be specific.

A prompt-as-code workflow has five properties. If any of them are missing, you are doing something else.

Every prompt lives in a tracked file or a tracked registry. Not in a Notion page. Not pasted into a Cloudflare worker. Not in a "system message" field of a hosted dashboard that nobody on the team has the URL for. The prompt has a path. The path is in a repo or in a registry that the repo references by ID. Either is fine. Both are fine. What is not fine is "I think the prompt is in this Slack thread."

Every change to the prompt is reviewed. Same workflow as a code change. A PR, a review, a merge. The reviewer is allowed to push back. The reviewer is qualified to push back, because the prompt is presented in a diff alongside the test results, not as a 200-line wall of text in a doc with no context.

Every prompt has a test. At minimum, a smoke test that runs the prompt against a small set of golden inputs and validates the output shape. Ideally, an eval that scores quality across a larger set. The test runs in CI before merge. The PR cannot land if the test fails. I went deep on the eval question in the AI evals for solo developers piece; the short version is that even a fifty-row eval set is dramatically better than nothing.

Every prompt is versioned. Not just "the file is in git." The runtime knows which version is deployed. The logs include the version. When a regression hits, you can look at a request, read the version, and check exactly what prompt was used. This is the single capability that turns prompt debugging from archeology into engineering.

Every prompt is rollbackable. A bad deploy is a one-button revert. Not "redeploy the application with the old prompt re-pasted." A real revert that flips the active version back to the previous one in seconds, without a full app redeploy, because prompts change more often than code and tying every prompt change to a full app deploy is the friction that turns engineers into copy-paste cowboys.

You can get all five with tools that exist today. The question is just whether your team has actually built the workflow, or whether you are still in the world where the prompt is a string literal in src/services/ai.ts and nobody knows where the canonical version lives.

The Two Architectures That Work

There are two workable architectures for prompts as code in 2026. They are real alternatives, not flavors of the same thing.

Architecture one is prompts in the repo. The prompt files live next to the code that uses them. Format is your call: .prompt files, .md files, TypeScript template literals, YAML. The build system bundles them. The runtime imports them. Version is the git SHA. Rollback is a revert and a redeploy. Tests are colocated with the prompt file.

The strengths are that this is simple, the diff history is obvious, and the same review workflow as code already exists in your team. The weaknesses are that you cannot change a prompt without a full app deploy, the prompt is locked to the application's release cycle, and non-engineers cannot touch the prompts without going through engineering.

Architecture two is prompts in a registry. The prompts live in a managed service. Langfuse, PromptLayer, Maxim, LangWatch, Helicone, Braintrust. The application fetches the active version at runtime by ID, with optional caching. Updates happen in the registry's UI or via API. The version is a registry version, not a git SHA. Rollback is a one-click change of the active version. Tests run against the registry, either in CI or in the registry's own eval system.

The strengths are that prompt updates are decoupled from code deploys, non-engineers can edit prompts safely, and the registry usually includes built-in eval and observability. The weaknesses are that you have introduced a runtime dependency, the prompt history lives outside git so your "source of truth" is now split, and you have to trust the registry to be up when your app is up.

I have shipped both. The choice depends on team shape.

If you are a solo founder or a two-person team, the repo architecture wins. You do not have non-engineers editing prompts. You do not have enough prompts to justify a registry. The simplicity of "the prompt is a file in the repo" is worth more than the flexibility of a registry.

If you are a five-person team with a product manager who wants to A/B test prompt copy, or you have prompts that change daily, or you have prompts you want to roll out to 10% of traffic before everyone, the registry architecture wins. The decoupling pays for itself in week one.

The two are not mutually exclusive. Some teams use registries for the prompts that change often (marketing-style outputs, customer-facing tone) and keep the structural prompts (JSON schemas, agent instructions) in the repo. That hybrid is fine. The rule is: every prompt lives in exactly one place. Never two. Never "it is in the repo but also in the registry and they sometimes disagree."

The CI Job That Actually Catches Things

The single most valuable artifact in a prompt-as-code workflow is the CI job. Not the registry, not the file format, not the eval framework. The job that runs on every PR and fails the PR if the prompt regression rate exceeds a threshold.

Here is what mine does, in practice. The same shape works for repo-based and registry-based architectures.

The job loads the prompt under review. It loads the eval set, which is a JSON file of input-output pairs that I maintain by hand based on real production examples. The set is around 80 rows for the smaller prompts, 300 rows for the bigger ones. Each row has an input (whatever the prompt is meant to take) and an expected output, or a set of validators the output must pass.

The job runs the prompt against each input. The model behind it is the production model. The runs happen in parallel with a small concurrency limit to avoid rate limits. The cost of a full eval run is between $0.50 and $5 depending on the prompt size and model, which is cheap enough to do on every PR.

The job collects the outputs. It runs the validators. The validators are a mix of structural checks (is this valid JSON, does it have these fields, are the field types correct) and quality checks (is the score from a judge model above this threshold, does this output match the expected pattern). I lean structural where I can, because structural validators are deterministic and judge-based validators are not.

The job emits a pass-rate. The PR fails if the pass-rate drops more than a threshold, currently 3%, from the previous prompt version. The PR also fails if any of a small set of "must-pass" cases regress. Those are the cases I learned about the hard way: the JSON-validity case from the Sunday incident, the SQL-injection case from another bad weekend, the tone-of-voice case from when a customer pointed out that our model had started sounding like a chatbot.

The full CI job is around 200 lines of code. It is the cheapest insurance I have ever bought. In the eight months it has been running, it has caught six regressions that would have shipped without it. Two of them would have been customer-visible incidents. The others would have been silent quality drops that I would have noticed weeks later via support tickets.

If you do not have this job, write it before you do anything else. The exact tool does not matter. The discipline of "no prompt change ships without an eval pass" is the discipline.

The Prompt Diff That Reviewers Actually Read

The other piece nobody talks about is what the PR review actually looks like.

A prompt diff is hard to read. The default GitHub diff shows you the changed lines, but the semantic difference between two prompts is rarely the changed lines. It is what the change does to model behavior. A two-word change can shift outputs significantly. A twenty-line change can have no effect. Reviewing prompts by reading the diff alone is the same as reviewing code by reading the diff without ever running it.

The fix is to attach the eval output to the PR. My CI job posts a comment on every PR with three artifacts. The first is the pass-rate before and after. The second is a sample of 10 outputs from the eval set, side by side: the previous prompt's output and the new prompt's output. The third is a list of the failing cases, with input and expected output, so the reviewer can read the actual model behavior that triggered the failure.

The PR review changes shape with that comment in place. The reviewer is not reading the diff and squinting. They are reading the model's actual behavior, looking at the side-by-side, and deciding whether the change is an improvement or a regression. The diff is supporting evidence. The output is the substance.

This is the same shift that happened in frontend code a decade ago when teams started attaching Percy or Chromatic visual diffs to PRs. The diff is necessary but insufficient. The visual is the actual artifact. Prompts work the same way. The text change is necessary but insufficient. The output behavior is the actual artifact.

If you can get this in place, your team starts having real prompt review conversations. Without it, you are voting on vibes.

What Goes In The Registry vs The Repo

If you go with the registry architecture, the question that consumes the next month of your team's life is "what goes where." Here is the heuristic I have ended up with.

Repo: the structural prompts that define agent shape. The JSON schemas. The tool definitions. The system prompts that lock the model into a specific role. Anything where the prompt change is logically a code change because the application's contract depends on the output shape.

Registry: the surface-level prompts that change with product decisions. The tone of voice. The marketing copy. The user-facing greeting. The prompt that summarizes a document for the user. Anything where the prompt change is logically a product decision and should be touchable by a PM without a code deploy.

The line gets fuzzy. There are prompts that feel structural but actually change often. There are prompts that feel cosmetic but actually have downstream code dependencies. The rule of thumb I use is: if changing the prompt could break a downstream consumer (a JSON schema mismatch, a missing field, a different output type), it goes in the repo. If changing the prompt could only ever affect output quality and tone, it goes in the registry.

Either way, the registry needs to support the same five properties as the repo workflow: tracked, reviewed, tested, versioned, rollbackable. If your registry does not have an audit log, prompt review, eval integration, version pinning, and rollback, you have bought a prettier Notion. Replace it with a real one. The registry market in 2026 is competitive. Tools that lack these features will be irrelevant by 2027.

The Eval Set Is The Hard Part

I want to be honest about which part of this is hard, because the tooling makes it sound easier than it is.

The CI job is a one-week build. The registry integration is a one-day build. The prompt files in the repo are a five-minute decision. None of that is the hard part.

The hard part is the eval set.

A good eval set takes a long time to build, requires real production data, and never finishes. You add cases for every regression you ship. You add cases for every customer report. You add cases when you notice an output that looks weird. The set grows. Six months in, it has 500 rows and you trust it. Twelve months in, it has 1,200 rows and it is one of the most valuable artifacts your company owns. You do not throw it away. You do not regenerate it. You curate it.

Most teams give up on this before month two. The cases feel arbitrary. The pass-rate hovers at 87% and seems impossible to push higher. The eval cost adds up. The temptation is to skip the eval and just deploy.

The teams that push through end up with something rare and valuable: a precise, executable definition of "what our AI feature is supposed to do." That artifact is more valuable than the model behind it. The model can be swapped for a better one. The eval set tells you whether the swap was actually better, on the specific axes your customers care about. Without the eval set, every model upgrade is a gamble.

If you want a starting point that does not require months of curation, look at the LLM cost optimization workflow I described earlier; the same logs you mine for cost are the logs you mine for eval cases. Pull a hundred recent inputs. Sample for diversity. Write the expected outputs by hand. That is your eval set v0.1. Ship it. Improve from there.

The bar is not perfection. The bar is "better than no eval set." That bar is low. Clear it and keep moving.

Migration: From "Where Is The Prompt" To Prompts As Code

If you are starting from "the prompt is somewhere in a Notion page and probably also a code file and I am not sure they agree," here is the migration path that works.

Week one: find every prompt. This is uncomfortable. Most teams discover they have more prompts in production than they thought. Grep the codebase for system:, user:, messages:. Search Slack. Open every Notion page anyone mentions. Make a list. The first time I did this, I found 14 prompts. I had thought we had 6.

Week two: pick one architecture and move one prompt. Do not try to move all 14 in week two. Pick the most-touched prompt. Move it. Build the eval set for it. Wire up the CI job. Watch a PR go through the full workflow. Adjust until it feels right.

Week three: move the next three prompts. Use the workflow you built in week two. Notice what breaks. Improve the tooling.

Months two and three: move the rest. Some prompts will resist. They will be in weird places, owned by people who do not want to touch them, or attached to features you are planning to deprecate. Move them anyway. The cost of one prompt outside the workflow is more than the effort of moving it.

Quarterly: review the eval sets. The same way you review your topical standards files, walk through the eval sets. Cases that are no longer relevant get removed. Cases that catch new regressions get added. The sets stay current. The CI job stays meaningful.

That is the whole migration. It takes a quarter for a small team, half a year for a larger one. The dividend is paid every time you ship a prompt change without a Sunday morning support fire.

What This Changes About How You Build

The part that surprised me most when I made the switch is what it changed about my own decision-making.

Before prompts-as-code, every prompt change felt heavy. I would tweak a word, redeploy, watch the dashboards for an hour, hope nothing broke. The cost of a change was high because the blast radius was unclear. So I changed prompts less often than I should have. The prompts got stale. The feature underperformed. I knew it could be better but I was scared of the breakage.

After prompts-as-code, every prompt change costs the same as a code change. Write the change, open a PR, watch the eval, ship if green. The blast radius is bounded by the eval. The rollback is one click. The fear is gone. So I change prompts more often. The prompts stay current. The feature improves on a normal cadence instead of in stressful sprints.

The same thing happened when teams adopted CI for application code 15 years ago. The cost of a change went down. The volume of changes went up. The quality of the codebase improved because nobody was scared to touch it anymore. Prompts are following the same arc, ten years late.

If your team is still scared of prompt changes, that is the symptom. The fix is the workflow. The discipline of versioning, testing, and rollback removes the fear. The fear was real. The fix is real.

The Sunday incident I started this post with cost me eight hours and a piece of my Sunday. I do not think about that incident often anymore, because the workflow that came out of it has made the kind of mistake I made that day impossible to ship undetected. That is the trade. Eight hours of pain plus a few weekends of building the workflow, in exchange for a system that catches regressions before users do for the rest of your career.

Treat prompts like code. Version them. Test them. Roll them back. The tools exist. The discipline is the gap.

If you are still pasting prompts into Notion in 2026, the cost is not aesthetic. It is the Sunday you have not had yet.

Claude Skills vs Cursor Rules vs Copilot Instructions: How Real Teams Set AI Coding Standards in 2026

Alex Cloudstar — Wed, 20 May 2026 08:28:32 +0000

The first real fight I had with a teammate over AI tooling was not about which model to use. It was about which config file we were supposed to be editing.

We had a CLAUDE.md at the repo root, written by me back in February. We had a .cursorrules file from the engineer who joined in March, who only used Cursor and refused to even open Claude Code. We had a copilot-instructions.md under .github/, dropped in by whoever turned on Copilot for the org. And we had a half-empty AGENTS.md that someone had heard about at a conference. Four files. Four different versions of the truth. Three subtly different opinions on how to write a database migration. Zero shared understanding.

The bug that triggered the argument was small. A junior engineer asked Cursor to write a Postgres migration. Cursor produced something fine by its own rules. The PR review pointed out that "our convention is to always wrap migrations in a transaction and write a rollback." Junior engineer pointed at .cursorrules. Nothing about transactions. We checked CLAUDE.md. It had the rule. We checked copilot-instructions.md. Different wording, same intent. We checked AGENTS.md. Empty.

Nobody had written the rule into Cursor's file because nobody used Cursor when the rule was added. The model did exactly what the file told it to. The file was just wrong by neglect.

This is the new shape of team coding standards in 2026. The standards exist. They just live in five places, in five formats, with five different update schedules. If you are leading any team larger than two people that uses more than one AI coding tool, this post is for you. I am going to walk through what each config file actually does, where the meaningful differences are, and the layout that finally got my team off the Monday argument loop.

The Four Files That Run Your Codebase Now

Before we compare anything, here is the lineup. There are more than four config files in the wild, but these are the ones you actually need to care about.

CLAUDE.md is the Claude Code config. It lives at the root of your repo (or in ~/.claude/CLAUDE.md for personal global rules). Claude Code reads it on every session start and treats the contents as a permanent system prompt. It supports @path/to/file imports up to five levels deep, which means you can split your standards into multiple files and reference them. It also supports auto-memory, where Claude writes notes back into a memory directory based on your corrections and preferences during a session.

.cursorrules (or the newer .cursor/rules/*.mdc directory layout) is the Cursor config. The legacy .cursorrules is a single Markdown file. The new directory format lets you have multiple rules with frontmatter that controls when they apply. The frontmatter supports globs, alwaysApply, and description fields, so you can have a rule that only fires for .tsx files, or only fires when the user asks about migrations. This is genuinely useful and an underused feature on most teams I have seen.

copilot-instructions.md is the GitHub Copilot config. It lives at .github/copilot-instructions.md and applies to Copilot Chat, code review, and the autonomous coding agent. Copilot also supports path-specific instructions under .github/instructions/*.instructions.md with YAML frontmatter that uses glob patterns. The path-specific files combine with the root file when both match, so a TypeScript rule plus a general rule both apply to a .ts file.

AGENTS.md is the closest the ecosystem has to a neutral standard. It started as the OpenAI Codex CLI config, then got adopted by Cursor, Aider, and a handful of smaller tools as a fallback. It supports nothing fancy. No imports, no globs, no path-specific overrides. Just plain Markdown that any compliant agent will read. The widest cross-tool support means it is also the lowest common denominator.

There are others. GEMINI.md for Google Gemini CLI. .windsurfrules for Windsurf. WARP.md for Warp's agent mode. The pattern is identical to the four above. Pick the ones your team actually uses and ignore the rest.

What Each File Is Actually Good At

Treating these files as interchangeable is the first mistake most teams make. They are not. The features they support are different, and the optimal content for each one is different. Here is what I have ended up using each for.

CLAUDE.md is the longest-lived, most opinionated file in the system. Claude Code's harness is the most autonomous of the three big tools, which means the rules you write get the most leverage. A single sentence like "always run typecheck before declaring work complete" actually changes behavior across a 40-step agent session. The @import feature lets you split standards into topical files (@.claude/standards/api.md, @.claude/standards/testing.md, @.claude/standards/database.md) and keep the root file as a table of contents. That is how I structure mine now. The auto-memory layer also makes CLAUDE.md grow over time without you babysitting it, which is either a feature or a footgun depending on how much you trust the model's judgment about what to remember. I cover the broader Claude Code workflow in the agentic coding 2026 piece; the short version is that the leverage on a long-running agent is high enough that CLAUDE.md is worth the most editorial care.

Cursor rules are best treated as path-specific overrides. The new .cursor/rules/*.mdc format with globs is the killer feature. I have a rule that fires only for *.test.ts files and tells the agent to never use mocks for the database layer. I have another that fires only for *.tsx files and tells the agent the design tokens we use. The base file under .cursor/rules/ stays small and general. The path-specific files do the heavy lifting. If you write a 600-line .cursorrules at the root, you are using the tool wrong. Split it.

Copilot instructions are best for repo-wide truths. Copilot's strength is breadth: every developer with the extension reads it whether they like it or not. Its weakness is that the instructions are short, get truncated more aggressively than CLAUDE.md, and are mostly used for Copilot Chat and the PR-review agent rather than for long autonomous sessions. The right content for copilot-instructions.md is the high-signal rules that you want every developer's autocomplete and PR-review agent to enforce. Stack choices. Banned patterns. The five things that should never make it into a PR. Not a 200-line essay on architecture.

AGENTS.md is the politeness file. Its job is to exist for the tools that do not read your other files. Treat it as a redirect, not the source of truth. A typical AGENTS.md for my projects is twenty lines: "See CLAUDE.md for full standards. The most important ones are: A, B, C. Run bun test before claiming work is done." That is enough for an unfamiliar agent to do something reasonable without needing the full CLAUDE.md context.

The mistake teams make is treating all four as parallel. They are not. There is a hierarchy.

The Layout That Finally Worked

After three months of fighting with this, here is the layout my team settled on. It has stayed stable for the last six months, which is the longest any AI tooling decision has lasted in our org.

We have one source of truth: .claude/standards/. Inside it, one file per topic.

.claude/standards/
  api.md
  database.md
  testing.md
  frontend.md
  observability.md
  security.md

Each file is short. 100 to 200 lines. Each one is reviewed by the team lead for that area. Each one has the same structure: a one-paragraph statement of intent, then a list of rules with reasons, then a list of bans with reasons. The structure matters because it forces the writer to justify the rule. A rule with no reason gets removed.

The root CLAUDE.md is a 30-line file that imports the topical files and adds the cross-cutting rules that do not belong to any single domain.

# Project Standards

This project uses Bun, Postgres, Astro, Vercel.

## Topical standards
@.claude/standards/api.md
@.claude/standards/database.md
@.claude/standards/testing.md
@.claude/standards/frontend.md
@.claude/standards/observability.md
@.claude/standards/security.md

## Cross-cutting rules
- Never run destructive SQL without showing the rows first.
- Always run `bun test` and `bun run typecheck` before declaring work complete.
- No em dashes in any file we ship. Ever.

## Workflow
- We use trunk-based development. Feature branches are short-lived.
- PRs are reviewed by the area owner. Big changes get a Linear ticket first.

That is the whole root file. Everything else gets pulled in by import.

The AGENTS.md is fifteen lines and points at the same standards. It is a redirect with a fallback for tools that cannot import.

# Agent Standards

See `.claude/standards/` for full rules. If you cannot read those files,
the four rules you must follow are:

1. Use Bun, not npm or yarn.
2. Never write to the database without showing the rows first.
3. Always run `bun test` and `bun run typecheck` after changes.
4. Follow the existing code style and import order.

Full standards: github.com/our-org/our-repo/tree/main/.claude/standards

The copilot-instructions.md mirrors the same content but compressed for Copilot's shorter context window. We auto-generate it from the topical files using a small script that runs on pre-commit. That removes the drift problem at the cost of accepting that Copilot reads a summary, not the full standards.

The Cursor rules live in .cursor/rules/ and are tied to globs. The database.mdc rule has globs: ["**/migrations/**", "**/db/**"]. The testing.mdc rule has globs: ["**/*.test.ts", "**/*.spec.ts"]. The frontend.mdc rule has globs: ["**/*.tsx", "**/*.css"]. Each one is the relevant content from the corresponding .claude/standards/ file. We auto-generate these too, from the same source.

The script that does the generation is fifty lines of TypeScript. It reads the topical files, applies a compressor that strips reason-paragraphs and keeps only the rule lines for the shorter formats, and writes the four output files. Pre-commit runs it. The PR review checks that the generated files are in sync. If you edit a generated file by hand, the hook complains.

The point is not that you need exactly this layout. The point is that you need one source of truth and a way to project it into the formats every tool wants to read. Without that, you are running four parallel codebases of rules and one of them will silently drift.

The Things That Actually Cause Drift

Once you understand the file layout, the next problem is keeping the content in sync over time. This is where most teams fail, and the failure mode is always the same.

A new rule gets added to one file and not the others. A senior engineer adds a rule about transaction wrappers to CLAUDE.md after a bad incident. Six months later a junior engineer using Cursor produces the exact code the rule was meant to prevent, because Cursor never saw the rule. The fix is the projection layer above. One source, many outputs. Adding a rule means adding it in one place and letting the script handle the rest.

Old rules are never removed. AI coding standards files have a tendency to grow forever. A rule from 2024 about a deprecated library is still in the file in 2026. The model dutifully follows it. The newer engineer wonders why their code keeps getting rewritten to use Lodash. The fix is quarterly review. One thirty-minute meeting per quarter, the team lead walks down each topical file, and any rule that nobody can defend gets cut. We delete more than we add. The file shrinks roughly 20% per year. That is healthy.

The reasons get lost. A rule that exists without a reason is impossible to refactor. The team cannot evaluate whether the rule still applies, because the original context is gone. The fix is to require a reason for every rule. We use the same structure as my feedback memories in Claude Code: the rule itself, then Why: and How to apply: lines. The reason is mandatory. If you cannot articulate the reason, the rule does not belong in the file.

The standards contradict the codebase. A rule says "always use Tanstack Query for client-side data fetching." The codebase has 40 SWR calls because it predates the rule. The model sees both and picks one at random. The fix is to either migrate the codebase or update the rule. There is no third option. A standard that contradicts reality is worse than no standard at all, because it teaches the model to ignore standards in general.

Personal preferences sneak into the team file. This one is the hardest. An engineer who hates trailing commas adds "no trailing commas in object literals" to CLAUDE.md. The team has never discussed this. Six months later the codebase is half-formatted that way and the linter has been disabled to accommodate it. The fix is process. New rules go to PR. The team lead reviews. Rules with no team-wide consensus get rejected. The standards file is a team contract, not a personal scratchpad.

The pattern across all four is the same. Drift is a process problem, not a tooling problem. The tools project the rules. The team writes them. If the team is sloppy, no tool will save you.

The Difference Between a Rule and a Skill

There is a deeper distinction that most teams miss, and it is the one I would write into the back of every developer's hand if I could.

A rule is a constraint. It tells the model what not to do, or what it must do in a specific situation. "Never use mocks in integration tests." "Always wrap migrations in transactions." "Use Bun, not npm." Rules are short. Rules are imperative. Rules belong in CLAUDE.md and friends.

A skill is a procedure. It tells the model how to do a specific task well. "How to add a new API endpoint to this codebase." "How to review a PR for security issues." "How to migrate a table without downtime." Skills are long. Skills are explanatory. Skills belong in the Claude Code plugin marketplace, in SKILL.md files under .claude/skills/, or in the equivalent system for whichever tool you use.

The mistake teams make is putting skills into CLAUDE.md. The file balloons to 800 lines. The model still uses it, but each rule loses leverage because it is buried in three pages of procedural content. Worse, the procedures get triggered every session even when they are not relevant, which wastes tokens and slows everything down.

The right home for a procedure is a skill that fires on demand. Claude Skills, Cursor's path-globbed rules, GitHub's path-specific instruction files. Each tool has its own way of saying "load this content when this condition is true." Use them. Keep the main config file lean. Keep the heavy procedural content in the skill system, where it loads only when needed.

If you read CLAUDE.md files in the wild, you will find this distinction violated everywhere. The 1,200-line CLAUDE.md is almost always 200 lines of rules and 1,000 lines of procedures that should have been skills. Splitting them is a one-day refactor with month-long benefits.

What Each Tool Is Actually Best At, In Practice

After running all three of these tools in production for the better part of a year, here is my honest read on where each one earns its keep. This is also the distribution my team has settled on.

Copilot is the default for everyone. Junior, mid, senior, every laptop in the org. It is the autocomplete baseline. The instructions file gets the most universal rules, the ones we want every PR to enforce. We do not expect Copilot to do agentic work for us. We expect it to make every developer 15% faster on the boring 90% of their day and to catch the obvious mistakes in PR review.

Claude Code is the senior engineer's tool. It is what we use for the high-leverage agentic work: a multi-file refactor, a migration plan with rollback, a new feature that touches the backend, the frontend, and the database in one shot. The CLAUDE.md and .claude/skills/ are where the senior engineers invest. We do not push Claude Code on the juniors. They graduate into it when they are ready to direct an agent rather than press tab.

Cursor is the IDE-first option for engineers who want a richer editor experience. Some of our team prefers the Cursor UI, some prefers VS Code with Copilot, some lives in Claude Code in the terminal. We do not standardize. The Cursor rules are auto-generated from the same source. The cost of supporting three tools is small once the projection layer exists. The cost of forcing everyone onto one tool is high, because half the team would lose 20% productivity in the switch.

If you compare these tools head-to-head, the deeper comparison lives in the Claude Code vs Cursor vs Copilot 2026 piece. The TL;DR for the config-file question is: each tool needs its config file, the content overlaps 80%, the projection layer is what keeps them honest.

The Workflow For Adding A New Rule

The last thing worth writing down is the workflow itself. Without a workflow, you are back to ad-hoc edits and silent drift.

A new rule starts as a Linear ticket or a Slack thread. Someone notices a recurring problem. Someone writes up the pattern, the proposed rule, and the reason. The team reviews. If there is consensus, the rule lands in the relevant topical file under .claude/standards/ via PR. The PR includes the reason in the commit body. The PR is reviewed by the area owner.

After merge, the projection script runs. The generated files update. Copilot, Cursor, AGENTS.md all reflect the new rule on the next session.

If a new rule contradicts the existing codebase, the PR includes a migration plan. Either the codebase changes to match the rule, or the rule changes to match the codebase. We do not ship a standard that lies about the code.

Every quarter, the team lead reviews each topical file. Rules with no defender are deleted. The file shrinks. The standards stay relevant.

This sounds heavy. It is not. Most quarters the review takes thirty minutes. Most new rules get reviewed in a five-minute Slack thread. The investment is upfront, in building the projection layer and the review habit. Once it is built, it runs itself.

The Part Nobody Talks About

The hidden cost of AI coding standards is not the tooling. It is the social cost of writing them down.

For years, senior engineers got away with carrying the standards in their heads. They reviewed PRs, they corrected the juniors, they shaped the codebase one line of feedback at a time. The standards existed, but they were tacit. The senior was the standard.

The AI tools force tacit standards out into text. Every rule the senior used to hold in their head now has to be articulable, defensible, written. The first time an engineer sits down to write the database.md file for their team, they hit the wall: half of what they "know" is actually preference, not principle. The other half they cannot defend without a long story about an incident from two years ago. Writing the rules surfaces every weak link in your team's thinking. It is uncomfortable. It is also exactly the thing that makes the team stronger.

The teams I have seen do this well treat the standards file as a team artifact, not a personal one. The senior writes the first draft. The team reviews. The juniors ask why. The senior either has a good answer or removes the rule. After six months, the file is denser and shorter than the first draft, because the only rules that survive are the ones the team can defend.

That is the real prize. Not faster agents. Not more lines of code per day. A team that can explain, to a piece of software, what good looks like. The agents read the file. The juniors read the file. The future senior who joins next year reads the file. The standard is no longer locked in one person's brain. It is on disk, versioned, reviewable, improvable.

The tools are the surface. The discipline of writing down what you actually believe is the substance.

Pick the layout you can defend. Build the projection script. Run the quarterly review. Stop arguing about which file is the source of truth and start arguing about what the rules should actually be. That is the conversation that makes a team better.

The Monday argument on my team has changed. It is not "which file should we have edited." It is "should this rule exist at all, and what is the reason." That is the right argument to be having. If you are not having it yet, that is the goal.

The Freelance Profit Leak: Why Solo Developers Lose Money Even When They Are Booked Solid

Alex Cloudstar — Tue, 19 May 2026 08:27:07 +0000

The last contract I lost money on was the one I felt best about.

It was a four-week React build for a fintech founder who paid the invoice in three days, gave me a Loom of genuine thanks, and asked when I could start the next phase. I had quoted $14,000 fixed price. The work shipped on time. The client was happy. Every signal looked green.

Then I sat down on a Sunday and did the thing I had been avoiding for two years. I added up every hour I had actually spent on that project. Not the ones I had written on a timesheet. The real ones. The standup calls. The Slack threads. The "quick" Figma re-review on a Saturday morning. The night I rebuilt the auth flow because the OAuth provider had silently deprecated a callback shape. The four hours debugging a Vercel deploy that turned out to be a stale env var.

The total came to 132 hours. My effective rate on that contract was $106 an hour. My posted rate, the one I told clients with a straight face, was $175.

I had given the client a 40% discount and never noticed.

This is the freelance profit leak. It is the gap between the work you bill for and the work the project actually consumes. Almost every freelance developer I have spoken to in 2026 has one. Most of them are running blind on it. This is the post I wish someone had handed me three years ago, when I was still convinced "I roughly know where my time goes."

What The Profit Leak Actually Is

A profit leak is not theft, fraud, or a client doing you dirty. It is a structural feature of solo work. There is no project manager logging your hours, no PMO chasing utilisation reports, no Jira board reconciled against a burndown. The only person who knows where your time went is you, and you are too busy doing the work to also be the auditor.

The leak shows up in a few specific places. Once you can name them, you can plug them. Until you can name them, you cannot.

Unbilled communication. The thirty-minute kickoff call. The Slack thread that ran 90 messages over two days. The async Loom you recorded three times because the camera caught your kitchen. None of this lands on the invoice. All of it eats your week.

Scope drift you absorbed. The "small tweak" that turned out to need a database migration. The "just add a button" that triggered a redesign of three modals. The feature you said yes to at 11pm on Thursday because you were tired and the client had been nice all week. Every drift you absorbed is a discount you gave away.

Friction that does not show up as work. The forty-five minutes spent reauthenticating into a client's staging environment because their secrets rotated. The fight with their CI pipeline because nobody documented the deploy steps. The hour you wasted because their designer renamed every Figma component. Each of these is a tax on your hourly. None of them are billable.

Context-switch overhead between clients. Two clients on the same day. Each one needs twenty minutes of "where was I" before you can actually code. That is forty minutes of friction you swallowed. Multiply by 200 working days and you have lost a full month of revenue to nothing.

Refactor and cleanup time you absorbed as "professional pride." You shipped the feature in nine hours. You spent another four cleaning up the diff, writing tests, and fixing the lint config. The client paid for nine. You paid for four.

The pattern across all of these is the same. Time leaves your day, value never lands on the invoice, and because the gap is in your head and not on a screen, you tell yourself "it was fine, it averaged out."

It rarely averages out. It compounds.

The Math Most Freelancers Are Not Running

If you have ever read the developer freelancing playbook, you know the rate ladder. Junior, mid, senior, specialist. The advertised rates make freelancing look like a clear win over salary. The reality is messier, because the rate on your contract is not the rate you take home.

The real number is your effective rate per client. The formula is brutally simple.

effective_rate = (invoiced_amount - direct_costs) / total_hours_consumed

The two parts that lie to you are direct_costs and total_hours_consumed. Most freelancers either skip them or undercount them.

direct_costs includes the obvious stuff (Vercel, Cloudflare, GitHub seats, design tools you bought for the project) and the not-so-obvious (the Claude or Cursor credits you burned on this client's code, the LLM bill on the eval suite you ran ten times, the Stripe fees on the invoice itself). If you are running AI agent workflows for a client, the inference cost is a direct cost. It comes out of your margin if you have not priced it in.

total_hours_consumed is the harder number. It is every minute the project occupied your brain. Communication. Meetings. Re-reading their spec on a flight. Debugging the deploy at 9pm. The walk where you mentally rehearsed the next standup. Yes, all of it.

You do not have to capture every minute perfectly. You have to capture it well enough that the number you write down is closer to truth than to fiction. Most freelancers underreport their own hours by 30 to 50 percent. I did. The Sunday I sat down with that fintech project, my logged time said 78 hours. Reality said 132.

If you do this once, for any project that closed in the last 90 days, you will find one of three things.

The first is that the math actually works. Your effective rate is close to your advertised rate and the leak is small. Good. Keep doing what you are doing and move on.

The second is that one client is dragging your average down by ten or twenty percent. You can usually name them within thirty seconds. They are the client who pays on time but consumes three times the meeting hours of anyone else. Fix that relationship or fire them.

The third, and most common, is that two or three clients are bleeding you and one is carrying the whole month. This is the scenario where you feel busy, your bank account is fine, and yet your hourly is significantly lower than the rate you would have quoted a stranger today. That is the leak.

Why You Cannot Fix What You Do Not See

The reason this leak persists is not laziness. It is that the existing tools for time tracking were designed for the wrong problem.

Toggl, Harvest, Clockify, and the rest were built for agencies. Agencies have project managers, timesheets that get reviewed, and a back office reconciling utilisation. For a solo operator, those tools are heavy. You do not need a timesheet review workflow. You need a fast log, a clean view of where the hours actually went, and a profit-per-client number you can trust on a Friday afternoon.

Most freelancers I know either run a half-abandoned Toggl, a spreadsheet they update twice a week and lie to themselves with, or nothing at all. The result is the same. You bill confidently. You estimate badly. You renew contracts at the same rate because you have no data that says you should not.

There is also a soft trap in time tracking that is worse than not tracking at all. When you log time in a tool that pings you to start and stop a timer, you build a habit of treating the timer like the truth. The timer captures focused coding time. It does not capture the Slack thread, the call, the context switch. The number you end up with is the optimistic version of your week, which is exactly the version that hides the leak.

The thing that finally moved me was switching to a system that logged time per ticket, treated communication and review as first-class entries, and showed me a live profit-by-client number after every invoice. That is the system I now run myself. I built it because I could not find one that fit the way solo operators actually work. It is called Graphalt, and it is in waitlist while I finish a few of the rougher edges. I am not pitching it here so much as describing the workflow it forces, because the workflow matters more than the tool.

The Friday Margin Audit

You can run a version of this with any tool, including a Google Sheet, if you commit to the ritual. The ritual is the thing.

Block thirty minutes every Friday afternoon. The slot is non-negotiable. It is the only paid client work you do for yourself.

Open the week's hours. For each client, write down three numbers.

The first is billable hours. The ones you will put on the invoice or that count against a fixed-price budget.

The second is shadow hours. Communication, meetings, deploy debugging, environment setup, anything that touched the client but did not produce a line item. Be honest. Round up rather than down. If you think it was 90 minutes, write 2 hours.

The third is direct costs accrued this week. Compute, AI tokens, paid APIs, anything you billed to your own card on this client's behalf.

Now compute the effective rate for the week.

effective_rate = (week_invoiced - direct_costs) / (billable_hours + shadow_hours)

You will get an honest number. Compare it to your posted rate. If the gap is more than 15 percent, something is leaking. The leak is one of the five categories from earlier. Most weeks it will be unbilled communication or scope drift.

Then, and this is the part most freelancers skip, write down what you are going to change next week. One thing. Not five. One. Maybe it is "no meetings under thirty minutes get scheduled without a written agenda." Maybe it is "every scope change goes through a one-line email that I keep in a folder." Maybe it is "I add a 15% communication buffer to every fixed-price quote until further notice." Just one change, and you do it for two weeks before you measure again.

The reason this works where most "track your time better" advice does not is that it ties the act of measuring to the act of deciding. You are not logging hours for the sake of logging. You are logging them to make one specific decision a week. The decision compounds. The hours stop leaking.

The Three Clients You Will Find In Your Data

After you run the audit for a month, you will start seeing patterns. Every freelance roster I have audited has the same three clients in it.

The Anchor. This is the client whose work is well-scoped, whose communication is on email or in a shared doc, and whose effective rate is within 5 percent of your posted rate. Maybe even above it. The Anchor is who you should be selling more of your time to. They are not always the highest-paying client on paper. They are the highest-paying client after the leak.

The Mirage. This is the client who pays well per hour on paper and consumes 40 percent more time than their invoice reflects. The Mirage is usually a founder who is friendly, busy, and bad at scope. You like them. They like you. The effective rate is somewhere between disappointing and tragic. The Mirage is not malicious. They just have a startup brain and a constant trickle of "could you also..." messages. You either retrain them (hard) or you raise their rate by 25 percent at the next renewal to cover the leak, or you let them go.

The Tax. This is the client who pays late, has a bureaucratic intake process, requires three weekly meetings, and asks for change requests in five different channels. The Tax may technically be profitable on paper. They are not profitable when you account for the energy cost, the context-switch overhead, and the time you spend dreading their Slack pings. The Tax is the easiest one to fire and the hardest one to admit you should fire, because the invoice numbers look fine.

The reason the audit works is that it gives you a number to point at instead of a feeling. "I feel like this client is hard" is easy to override. "This client's effective rate is $58 against a posted $150" is harder to argue with at 5pm on a Friday when you have data in front of you.

Pricing After The Leak

Once you can see the leak, your pricing should change. Most freelancers respond to "my effective rate is low" with "I should work harder." That is the wrong move. The right move is to price for the leak you cannot eliminate, and eliminate the leak you can.

The leak you cannot eliminate is the communication-and-context tax that any client costs. Twenty to thirty percent of your time on every project will go to things that are not the actual code. Quote for it. If the client wants a fixed price for what you estimate is 40 hours of code, quote 52 hours of price. Do not call out the buffer as a "communication tax" line item. Bake it into the number. The number is the number.

The leak you can eliminate is the unstructured creep that comes from saying yes too easily. A one-page change-request process kills more profit leak than any time tracker. The format does not matter. A Google doc, a Linear issue, a Notion page, even an email thread you star. What matters is that nothing new gets built without a written line. That line is your boundary. Everything that lands inside the line is in scope. Everything outside the line is a new quote.

There is also a deeper pricing change that most freelance developers eventually make, which is to move off pure hourly billing for at least one engagement type. Hourly billing rewards inefficiency. The faster you get, the less you earn. Fixed-price billing on well-scoped work, retainer billing for ongoing maintenance, and outcome-based billing for higher-stakes projects all give you the leverage that hourly cannot. I went into the mechanics of this transition in the AI-powered agency playbook; the short version is that pricing leverage matters more than rate at scale.

The Tools Question, Honestly

I am going to be specific about tools because handwaving is unhelpful. Here is what I use and what I would change if I were starting today.

For per-ticket time logging, I want a flow where I can open the ticket I am working on, log the time, and close the window in under ten seconds. No timer to start. No timer to forget. No timer to lie. Just a quick "I spent 45 minutes on ticket BUG-142, here is a one-line note about what I did." That note matters. It is the audit trail that lets the Friday review actually work.

For profit-per-client, I want a number that updates after every invoice and every cost entry, with currency conversion if my clients pay in different currencies. The number should be live, not "let me run a report." If I have to run a report, I will not, and the system fails the first month.

For invoices, I want them generated from the time entries themselves, with line items grouped by ticket, project, or week depending on the contract. The friction of going from "the work is done" to "the invoice is sent" is where most freelancers leak revenue. Reduce that friction and you collect on time more often.

This is the workflow Graphalt is built around. I am openly biased here because I built it for myself. The point is the workflow, not the tool. Toggl plus a spreadsheet plus FreshBooks will get you most of the way there. So will a custom Notion. So will, for a small enough client list, a Google Sheet you maintain religiously. The cost of the wrong tool is friction. The cost of no system is the leak we have been talking about for two thousand words.

A note for developers who like the idea of "I will just build my own": you will not. I tried. I built three internal versions over four years and abandoned every one because solo project tooling lives at the bottom of your priority list when paying client work is on top. Use something that exists. Save the build energy for the side project that might actually pay you back.

What I Would Tell Past Me

If I could send one paragraph back to the version of myself who closed that fintech project feeling proud, it would be this. The contract was not a $14,000 win. It was a $14,000 invoice on $14,000 of actual delivered value, and the delivered value cost you 54 hours more than you priced for. You did not lose money on the project. You lost the right to call it a high-rate project. That is not a moral failure. It is a math failure. The math is fixable. You just have to actually look at it.

The freelance career most developers want to build is the one where year three is meaningfully better than year one. Better clients, better rates, better hours, better margin. None of that happens automatically. The compounding only kicks in if you are paying attention to where the leak is. Otherwise you keep running the same plumbing and wondering why the bucket never gets full.

Sit down this Friday. Pick one client. Run the math. The first audit will be uncomfortable. Every subsequent audit will be easier. By the third month you will know exactly which client is your Anchor, which is your Mirage, and which is your Tax. By the sixth month you will have raised one rate, fired one client, and added a buffer to your fixed-price quotes. By the twelfth month, your effective rate will start to look like your posted rate.

That is the whole game. The leak is not the enemy. Ignoring the leak is the enemy.

Track the time. Run the math. Plug the holes. Build the freelance career the math says you should have, not the one the invoice line says you do.

The Claude Code Plugin Marketplace: How Skills, MCP Servers, and Plugins Actually Fit Together in 2026

Alex Cloudstar — Tue, 19 May 2026 08:27:06 +0000

The first plugin I installed in Claude Code did something I had been doing by hand for nine months.

It was a skill called frontend-design. I added it, asked Claude to build a landing page, and watched it ignore every default React-and-Tailwind cliché it usually reaches for. No emoji buttons. No space-y-4 p-6 rounded-2xl. No three-column-card grid. The skill loaded in a specific design vocabulary at the top of the turn and pushed the rest of the model output into a different aesthetic universe. I had been writing that vocabulary into my own system prompt for nearly a year. Now it was a one-line install.

That was the moment I understood what the Claude Code plugin marketplace actually is. It is not an app store. It is a distribution system for opinions. Specifically, the opinions about how to do a thing that you used to keep in your own CLAUDE.md, in a Notion page, or in your head. The marketplace turned those opinions into installable units that any developer can subscribe to in thirty seconds.

By May 2026, the official marketplace lists hundreds of plugins, the community marketplaces have shipped over two thousand skills, and the ergonomics finally feel like a real ecosystem rather than a hobby project. This is the post I wish I had read in February when I was still confusing plugins, skills, MCP servers, and hooks and reaching for the wrong one on every problem.

The Mental Model

The single sentence that unlocks the marketplace is this. Plugins are the distribution format. Skills are the content. MCP servers are the data layer. Hooks are the automation.

If you only remember one paragraph from this post, remember the four nouns and what each of them does.

A skill is a Markdown file with a SKILL.md at its root, optional supporting files, and a frontmatter block that tells Claude when to fire. A skill teaches the model how to do a specific thing well: write Remotion videos, debug a flaky test, audit SEO, review React components, build a landing page in a specific style. When the user's request matches the skill's trigger description, Claude reads the skill and follows its instructions. The model does not run anything outside its normal toolset. The skill is, fundamentally, structured prompt material.

A plugin is a packaged folder that can contain one or more skills, plus MCP server configs, hooks, slash commands, and agent definitions. It is the unit you install and uninstall. Plugins are how skills travel between developers. The Anthropic-official marketplace ships plugins. The community marketplaces ship plugins. Your team's private GitHub repo ships plugins. The wrapper is always the same.

An MCP server is a separate process that exposes tools to the model over a standardised protocol. If you want Claude to read your Linear tickets, query your Postgres, or look up your internal Confluence, you need an MCP server for that data source. The marketplace lists MCP servers as plugin components. Installing them wires the model into real systems. I went deep into MCP in the MCP developer guide; the short version is that MCP is the thing that turns Claude from "a model that writes code" into "an agent that touches your stack."

A hook is a shell command the harness runs in response to lifecycle events. SessionStart, UserPromptSubmit, Stop, PostToolUse, etc. Hooks are not prompts. They are real automation. They run when something happens, regardless of what the model thinks. Hooks are how plugins enforce behavior the model cannot enforce on its own. You can find the full event list in the Claude Code hooks docs.

Most plugins use two or three of these primitives. A well-designed plugin uses all four. The shape of a great plugin is something like: skills for the agent's behavior, MCP servers for the data the skills need, hooks to enforce the rules the skills suggest, and slash commands to expose the most common workflows to the user.

Once you see the four pieces, every plugin you read becomes legible.

What The Marketplace Actually Looks Like

The default Anthropic marketplace ships with every Claude Code install. You did not have to do anything to get it. Run /plugin list inside Claude Code and you will see plugins from anthropics/claude-plugins-official. The official set is conservative: code review, security review, status line config, a few language-specific best practices, the init plugin that drops a CLAUDE.md template.

The much larger world lives in community marketplaces. The biggest catalogues I have used.

The claudemarketplaces.com directory is a discovery surface, not a marketplace itself. It indexes plugins from GitHub and lets you search by category, popularity, and last update. Useful for browsing. It does not host. The install command still points at the source repo.

The tonsofskills.com project bundles 425 plugins and 2,810 skills with a ccpi CLI on top. The CLI is closer to npm than to apt: search, install, version-pin, dependency-resolve. It is the closest thing the ecosystem has to a real package manager. It is also community-curated, which means the quality bar is wide. Some skills are excellent. Some are someone's lunch break.

The Vercel plugin marketplace, which ships as part of the Vercel CLI, is a vertical bundle. It includes deployment helpers, AI Gateway routing, Vercel Sandbox tooling, the bootstrap orchestration, and a handful of MCP servers for project metadata. If you ship to Vercel, this is the one to add first.

There are smaller domain-specific marketplaces. One for security tooling, one for Notion automation, one for video creation around Remotion, one for the Anthropic API itself. They each ship a couple of dozen plugins, deep on their vertical, shallow on everything else. Pick the ones that match your stack and skip the rest.

The pattern across all of them is the same. A marketplace is a Git repo with a marketplace.json at the root, listing the plugins it hosts. You subscribe to the marketplace once. New plugins inside it become discoverable automatically. Updates to existing plugins flow through on the next update. Removing the marketplace removes everything you installed from it.

This is a deliberate design call from Anthropic, and it is the right one. The fragility of an "app store" model is that one central registry becomes a chokepoint. The Git-repo-as-marketplace model means anyone can run one, anyone can fork one, and the cost of running your own is roughly the cost of a public GitHub repo.

How To Actually Install Things

There are three install workflows in 2026, depending on where the plugin lives.

The first is the official marketplace, already attached to your install. To install a plugin from it:

/plugin install code-reviewer

That is the whole command. The plugin downloads, the harness picks up the new skills and slash commands, and you can use them in the same session. No restart.

The second is community marketplaces hosted on GitHub. You add the marketplace once.

/plugin marketplace add tonsofskills/marketplace

After the marketplace is added, every plugin it contains is installable by name.

/plugin install tonsofskills/seo-audit
/plugin install tonsofskills/frontend-design

The harness pulls the plugin from the source repo, validates it, and registers the new skills. You can run /plugin list to see what is installed and which marketplace it came from.

The third is local plugins. You point the harness at a folder on your machine. This is the one you use for plugins you are still building, or for private team plugins that live in your company's repo.

/plugin install ./packages/my-team-plugin

There is also a /plugin update command for refreshing everything to the latest version, and a /plugin remove for cleaning up. The CLI surface is small and stable.

If you live in a config file instead of a session, plugin entries also go in .claude/settings.json. You can pin specific versions, scope plugins to specific projects, and ship plugin configuration alongside your codebase. For team setups, the file-based approach is the one to standardise on, because it lets new joiners get the same plugin set on day one with a single repo clone.

The Skill Format, Concretely

A skill is a folder with a SKILL.md at the root. The minimum viable skill is twenty lines.

---
name: postgres-migration-reviewer
description: |
  Review Postgres migrations for safety before merging. Use when the user
  is editing a SQL migration file, asking about migration safety, or asks
  to review a database change.
---

When reviewing a Postgres migration:

1. Flag any DROP COLUMN, DROP TABLE, or ALTER COLUMN TYPE as high-risk.
2. Confirm the migration is reversible. If not, the PR description must
   explain why.
3. For any NOT NULL added to an existing column, confirm a backfill exists.
4. For any index added on a table over 10M rows, recommend CREATE INDEX
   CONCURRENTLY.

Always end with a short verdict: SAFE, RISKY, or UNSAFE, with one line on why.

That is a complete skill. The frontmatter tells Claude when to load it. The body tells Claude what to do once it is loaded. If you save that to ~/.claude/skills/postgres-migration-reviewer/SKILL.md, it is available in your next session.

A more sophisticated skill can include supporting files. A scripts/ folder for shell scripts the skill knows how to run. A references/ folder for reference material the skill quotes. A templates/ folder for boilerplate. The skill body can point to these files with relative paths and Claude will read them on demand. This is how skills stay small in context while still carrying a lot of structured knowledge.

The thing skills are not good at is anything that needs deterministic execution. A skill cannot guarantee that a check ran. A skill cannot block a commit. A skill is a strong suggestion to the model. If you need a hard guarantee, you need a hook. The two compose well: the skill tells the model how to think about the problem, the hook makes sure the resulting action passes a real check.

I have shipped twenty or so skills now. The pattern that consistently produces good ones is to start by writing the skill as a CLAUDE.md snippet in a real project, use it for a week, watch where Claude does the wrong thing, refine the snippet, then promote it to a packaged skill once it stops surprising you. Skills built without that real-use loop are almost always too abstract and the model ignores them in practice.

When To Build A Plugin Instead Of A Skill

The first question I get from teams adopting this stack is "should I build a plugin or just a skill." The answer is almost always "start with a skill" and the reason matters.

A single skill is a five-minute investment. You write the Markdown, you drop it in your skills folder, you use it. If it works, you keep it. If it does not, you delete it. The cost of being wrong is zero.

A plugin is a longer investment. You write a plugin.json, you decide on a directory layout, you add a README, you publish it somewhere, you keep it updated. The cost of being wrong is the time you spent maintaining a packaged thing that nobody used.

The right time to graduate from a skill to a plugin is when one of three things happens.

The first is more than one person needs it. Once your teammate is asking for "that skill you use," the plugin form makes sharing trivial. A plugin in a GitHub repo can be installed by anyone with two commands. A loose SKILL.md in your personal folder is fundamentally yours.

The second is the skill needs supporting infrastructure. If your skill works only when an MCP server is also running, or only when a specific hook is firing, or only when a slash command is registered, you cannot ship the skill alone. The plugin format is the container that bundles all of those pieces together.

The third is the skill has a versioned dependency on something external. If it relies on a specific version of a CLI, a particular SDK shape, or an environment variable being set, the plugin can declare those dependencies and the install process can validate them. The skill cannot.

Outside those three triggers, a skill is the right move. Most of what teams call "we should build a plugin" is actually "we should write three skills and see which one survives."

The Part Nobody Is Talking About: Plugin Provenance

There is a real security story under the marketplace that has barely been discussed, and it is going to matter in the next twelve months.

When you install a plugin, you are installing instructions that the model will follow. A malicious plugin can do things you do not expect. It can include a hook that pipes your file contents to an external server. It can include a skill that tells Claude to commit a backdoor on certain conditions. It can include an MCP server that proxies all your local tool calls through someone else's process. None of this is hypothetical. The capability is there from day one.

The current state of provenance in 2026 is roughly where the npm ecosystem was in 2015. The official Anthropic marketplace is signed and reviewed. Most community marketplaces are unsigned, unaudited, and operated by individuals. A plugin with 50 GitHub stars and a clean README is not the same thing as a plugin you can trust to run on your machine.

The defensive moves are the same ones that mature ecosystems eventually adopt. Pin specific plugin versions in your .claude/settings.json rather than tracking head. Read the plugin.json, the hooks, and the MCP server configs of any plugin before you install it. Treat plugin updates the same way you treat dependency bumps in your codebase, with at least a glance at the diff. If you are working in a regulated environment, run a private marketplace that mirrors only the plugins you have audited.

This is the same defense pattern that applies to AI-generated code security risks. The model and the plugins around it have access to your tools. Trust is not the default. Trust is a thing you earn from sources you have inspected.

I have seen one team get bitten by a plugin that quietly added a hook capturing every prompt and shipping it to an analytics endpoint. The analytics endpoint was the maintainer's own; the intent was telemetry, not malice. The team only found out because their network monitoring caught the outbound traffic. The lesson is not "do not trust plugins." The lesson is "trust plugins the way you trust npm packages, which is carefully."

Building And Shipping Your Own Plugin

Once you have used the marketplace for a few weeks, you will find yourself wanting to ship your own. The flow in 2026 is genuinely friendly.

Start with a folder.

my-plugin/
  plugin.json
  README.md
  skills/
    my-skill/
      SKILL.md
  hooks/
    pre-commit.sh
  mcp-servers/
    my-server.json
  commands/
    my-command.md

The plugin.json is the manifest. It names the plugin, lists its components, and declares its version. Claude Code uses this to register everything inside the plugin in one go.

Test the plugin locally first.

/plugin install ./my-plugin

Iterate on it the same way you iterate on a CLI tool. Run it, see what breaks, fix the skill or the hook, run again. There is no build step. The plugin is just files. Reload picks up changes in seconds.

When the plugin is ready to share, drop it in a public GitHub repo and write a marketplace.json at the repo root.

{
  "name": "my-marketplace",
  "plugins": [
    {
      "name": "my-plugin",
      "source": "./my-plugin",
      "version": "0.1.0"
    }
  ]
}

That is a working marketplace. Anyone can subscribe with /plugin marketplace add yourgithub/yourrepo and install your plugin by name. You do not need to publish to a central registry. The Git repo is the registry.

For plugins that survive past their first month, version them. Use semver. Keep a CHANGELOG.md. Pin breaking changes to major bumps. The ecosystem is young enough that none of this is enforced, but the maintainers who do it have plugins that get adopted at twice the rate of the ones that do not. The signal that you take maintenance seriously is what tips a team from "I will read the code first" to "I will trust the install."

What I Recommend In May 2026

If you are starting from zero, here is the short stack I would install on a clean machine today.

From the official Anthropic marketplace: the code-reviewer plugin for PR review, the init plugin for new project bootstrapping, the security-review plugin for branch audits.

From the Vercel marketplace, if you ship to Vercel: the Vercel CLI plugin which bundles deployment helpers, AI Gateway routing, Vercel Sandbox tooling, and a handful of MCP servers for project metadata and logs.

From the community marketplaces, three or four skills that match your daily work. For me that is frontend-design, seo-audit, and find-skills (the meta-skill that helps you discover other skills based on what you are working on). For a backend-heavy engineer it might be Postgres helpers, an HTTP debugger, and a Kubernetes context switcher.

From your own repo: a private plugin with your team's CLAUDE.md content broken into skills, plus any hooks that enforce your local conventions. Even a single-skill private plugin pays back the setup cost within a week.

What I would skip on the first install: anything in the "agentic" category that promises to autonomously refactor your codebase or auto-merge PRs. Not because they cannot work. Because the failure mode is hard to recover from when you have not yet built a feel for the surface. Start with skills that suggest. Graduate to hooks that enforce. Reach for autonomous workflows last.

The honest read on the marketplace as a whole is that it is genuinely good, and also genuinely early. The plugins that work are excellent. The ones that do not work waste a session you cannot get back. Read the source before you install. Treat the marketplace like npm circa 2015 and you will be fine.

What This Changes For Developer Tooling

Step back from the implementation details for a second. The marketplace is the first time AI tooling has had a real distribution surface for opinionated workflows. Before this, every team had a CLAUDE.md, a Notion page, a Slack thread of "here is how we use the AI." None of it travelled. None of it got versioned. None of it composed.

The marketplace solves the distribution problem. It does not solve the quality problem, the security problem, or the discovery problem. Those are still open. But distribution is the prerequisite that the rest of the ecosystem can build on. The same way npm being good did not automatically make every npm package good, but did make the package ecosystem possible.

What I expect to happen over the next year. The official marketplace will grow slowly and carefully. The community marketplaces will explode and then consolidate. A handful of curated marketplaces will become the de facto standard. Teams will start running private marketplaces for internal tooling. Plugin signing will arrive. Provenance metadata will get richer. The good plugins will outlast the noise.

If you only do one thing after reading this post, install three plugins this week. One official, one community, one of your own. Run them for a week. See what changed in your workflow. The first marketplace shift will not feel like a productivity revolution. It will feel like the thing you used to type into your system prompt now lives somewhere a teammate can install. Quiet. Composable. The kind of change that compounds.

The dotfiles era of AI tooling is over. The package manager era just started. Worth getting set up early.

Vercel Zero: The Programming Language Built So AI Agents Can Read, Repair, And Ship Native Code

Alex Cloudstar — Mon, 18 May 2026 08:00:57 +0000

I burned forty minutes of Claude's context window last Tuesday parsing the same Rust compiler error.

The agent had written a function that took a &str where the caller passed a String. The borrow checker did what borrow checkers do. The error message was three paragraphs long, included a diagram of two coloured arrows, and referenced four different rules from the reference manual. The agent read the error. The agent tried a fix. The fix introduced a different error. The agent read that one. The agent tried again. By the fifth round the context was full of partial diagnostics, the model had started hallucinating crate names, and I closed the loop manually with a one-character change.

I was thinking about that loop when I saw Vercel Labs ship Zero on May 15. A new systems language whose entire pitch is that the compiler is supposed to talk to agents. Not to me. To the agent. The errors come out as JSON with stable codes. The fixes come out as machine-readable plans. The standard library declares its side effects in function signatures so an agent can audit what a binary touches without running it.

It is the kind of project that sounds gimmicky for about ninety seconds and then starts to make sense. This is the post I wish I had read on day one. What Zero actually is, where it fits, the parts that are clever, the parts that are still vapour, and whether you should bother installing it this week.

What Zero Actually Is

Zero is an experimental systems language from Vercel Labs, currently at v0.1.2, Apache 2.0 licensed, with source files using the .0 extension. The compiler is written partly in C (the bootstrap, in native/zero-c/) and partly in Zero itself (the self-hosted compiler, in compiler-zero/). It emits native binaries for Linux, macOS, and Windows, plus WebAssembly, without going through LLVM.

The headline number is that a "hello world" binary lands under ten kilobytes. Not ten megabytes. Not even ten hundred kilobytes. Sub-ten KiB. That is closer to a Zig "hello world" than a Rust one, and the absence of LLVM is the reason. Compile times stay short. Binaries stay tiny. The tradeoff is that the optimiser is not yet competitive with what LLVM has been polishing for two decades, so for pure compute-bound code Rust will still win. Zero is not trying to win that race.

The thing Zero is trying to win is a different one. Most languages were designed for humans to write and humans to read. Their compilers were designed for humans to debug. The error messages, the documentation, the package ecosystems, the IDE tooling are all shaped around a human in the loop. Zero is the first language I have seen that was designed, on purpose, from day one, for a workflow where the human is not necessarily the one reading the compiler output. The agent is.

That single design choice changes a surprising number of downstream decisions, and most of this post is about what those decisions actually look like in the code.

Why A New Language At All

The honest version of "why a new language" is that you cannot retrofit agent-first tooling onto an existing language without losing the design. You can add a --json flag to rustc. You can write a wrapper that parses tsc output. You can pipe gcc errors through an LLM and ask it to extract structured fields. People are doing all three. None of it really works.

It does not work because the underlying compilers were not designed to make stable promises to a machine. Error message wording changes between versions. Span information moves around. The mental model the human reading the error is supposed to apply lives in prose, not in a typed payload. The agent has to do an extra interpretation step every time it reads the error, and that interpretation costs tokens, costs accuracy, and costs the kind of loop reliability that turns "agentic coding works on demos" into "agentic coding ships features."

I wrote about this exact failure mode in agentic coding in 2026. The agent is fine when the loop is short. The agent collapses when the loop has more than three or four rounds because the diagnostic surface is too noisy. Zero's bet is that if you remove the prose-parsing step, the loop tightens enough that the agent stays coherent for longer.

The other half of the bet is that the rest of the toolchain has to change too. It is not enough for the compiler to emit JSON errors. The package manager has to emit JSON errors. The formatter has to. The doc generator has to. The size analyser, the import graph, the cross-compile target list, all of it has to be structured. Otherwise the agent still has to parse three different output formats, and you have only fixed one third of the problem.

Zero's single binary ships every subcommand from one process: zero check, zero run, zero build, zero size, zero graph, zero routes, zero fix, zero explain, zero doctor, zero skills. They all share a --json flag. They all share the same diagnostic schema. The agent learns one shape, not ten.

Hello World, Then The Trick

A Zero "hello world" looks like this.

pub fun main(world: World) -> Void raises {
  check world.out.write("hello from zero\n")
}

There is one thing on that page that does not exist in any mainstream language: the world: World parameter. Every function that wants to touch the outside world, write to stdout, open a file, hit the network, has to accept a World capability. A pure function does not. A function that only adds two integers looks like this.

fun answer() -> i32 {
  return 40 + 2
}

No world parameter. No raises. The signature is the contract. If you read fun answer() -> i32 you know, statically, that calling it cannot touch a file, cannot make a network request, cannot block on a queue, cannot leak data. The compiler enforces it. If answer tried to call world.out.write it would not compile, because answer does not have a World to pass to it.

The pattern is called capability-based I/O, and it has been kicking around the academic side of language design for thirty years. It is not new. What is new is shipping it as the default in a language meant for production work, not as an opt-in linter you turn on for security-critical files.

The reason this matters in 2026, specifically, is auditability. When an agent generates a Zero function, you can read the signature alone and know what it can do. No execution. No deep code analysis. No "did the agent quietly add a fetch call buried inside a helper?" The capability has to be passed in. If it is not in the signature, the function cannot do the thing.

The same instinct that drives sandboxing AI-generated code at the runtime layer drives capability-based I/O at the language layer. The runtime version says "this process cannot reach the network." The language version says "this function cannot reach the network." They compose. They do not replace each other.

The Compiler That Talks JSON

The headline feature is the JSON diagnostic output. Run zero check --json some_file.0 and instead of a paragraph of prose you get a structured payload that looks roughly like this.

{
  "diagnostics": [
    {
      "code": "NAM003",
      "severity": "error",
      "message": "unknown identifier `wirte`",
      "span": {
        "file": "examples/hello.0",
        "line": 2,
        "column": 23
      },
      "repair": {
        "id": "declare-missing-symbol",
        "candidates": ["write"]
      }
    }
  ]
}

Three things matter here, and only one of them is the JSON itself.

The first thing is the stable code. NAM003 is going to mean "unknown identifier" in v0.1, in v0.2, in v1.0, and forever after. Vercel has committed to keeping the codes stable across versions. That means an agent can have a built-in mental model of "code starting with NAM is a naming problem" and apply it across compiler releases. The wording of the message is allowed to change. The code is not.

The second is the typed repair. The repair.id is declare-missing-symbol, and it comes from a finite, documented vocabulary. The agent does not have to read the message to know what kind of fix is needed. It can look up the repair id and apply a known transformation. For the small set of mechanical errors (typos, missing imports, missing semicolons), the agent does not need to reason. It just has to look up the fix template.

The third, and this is the part that surprised me, is zero fix --plan --json. Run it against the same file and the compiler returns a fix plan, not a fix. It tells the agent "I would change line 2 column 23 from wirte to write," and lets the agent decide whether to apply it. The plan is the diff, structured. The agent can accept it, modify it, or reject it. It is the difference between a code action you blindly run and a code action you negotiate with.

The agent loop ends up looking like this.

zero check --json
  -> diagnostics with stable codes
zero fix --plan --json
  -> machine-readable repair plan
agent reviews plan, applies it
zero check --json
  -> clean (hopefully)

That loop is short. That loop does not eat the model's context window. That loop is the reason Zero exists.

The CLI Is One Binary, And That Matters More Than It Sounds

Most languages have separate binaries for separate jobs. rustc, cargo, rustfmt, clippy, rust-analyzer. go, gofmt, golangci-lint. tsc, eslint, prettier, tsx. Every one of them has its own flag conventions, its own exit codes, its own output formats, its own update cadence. For a human this is mostly fine, because the human only invokes one of them at a time and remembers the flags by muscle memory.

For an agent it is a nightmare. The agent has to learn ten different command surfaces, ten different output parsers, ten different version compatibility matrices. Half of the agent's failures in agentic coding are not "the agent cannot reason about the code." They are "the agent invoked the wrong tool with the wrong flag and could not parse the resulting output."

Zero collapses this into one binary. The subcommands.

zero check validates a file or a package
zero build produces a binary (with --target for cross-compilation and --emit to choose the output)
zero run runs a file directly
zero size --json reports the binary size, broken down by module
zero graph --json emits the import graph as JSON
zero routes --json extracts HTTP routes from a web package
zero fix --plan --json returns a repair plan for the current diagnostics
zero explain CODE returns a typed explanation for a diagnostic code
zero doctor checks the local install
zero skills get zero --full returns the version-matched usage guide

The last one is doing something quiet but useful. Documentation in most ecosystems lives on a website that may or may not match the version on your machine. Zero ships the workflow guides inside the binary, version-locked. The agent does not have to web-search "how do I cross-compile to musl in Zero v0.1.2?" and get a v0.0.7 Stack Overflow answer. It runs zero skills get cross-compile and gets the answer that matches the installed compiler.

This is the same pattern as MCP servers shipping their own tool guides, one level lower in the stack. The tool documents itself to the agent, with no out-of-band fetch.

Memory Without A Garbage Collector

Zero is a systems language, not a scripting language, and that shows up most clearly in how it handles memory. There is no garbage collector. There is no hidden allocator. Every allocation is explicit, every deallocation is too, and the type system tracks the lifetime of references through a borrow-checker-style mechanism.

If you have written Rust this will feel familiar. If you have written Zig it will feel even more familiar, because the explicitness leans closer to Zig's "the allocator is a parameter you pass in" model than Rust's "the allocator is implicit but the lifetimes are not."

The borrow checker in Zero v0.1.2 is, by all reports, less sophisticated than Rust's. It handles the easy cases, the ones that prevent use-after-free and double-free and the simple aliasing bugs. It does not yet handle the harder cases that take real effort to model in Rust. That gap will close over time. For now, Zero is not trying to be a Rust replacement. It is trying to be a language where binaries are small, allocations are visible, and an agent can read the size report after each build and notice when something doubled.

That last bit, the per-build size report, is more important than it looks. The most common failure I have seen with AI-generated code shipping to production is not a bug in the logic. It is a quiet ten-megabyte regression that nobody noticed until the cold start time on the function tripled. Zero's zero size --json ships a structured size report for every module in the binary. The agent can diff sizes between commits and flag bloat the moment it appears. That kind of automated audit is the kind of thing you cannot easily do in Rust without a custom toolchain. In Zero it is a subcommand.

What Zero Is Not

Zero is not a Rust replacement. It is missing the borrow checker maturity, the package ecosystem, the IDE tooling, the production track record, and the community that took Rust a decade to build. If you need to ship a database, a browser engine, an operating system kernel, or any other "this code will run for ten years with no rewrite" target, you do not pick a v0.1.2 language.

Zero is not a Go replacement either. It does not have green threads. It does not have channels as a built-in primitive. The standard library is much smaller. The deployment story is "you get a binary," which is what Go also gives you, but Go is a five-megabyte binary with a complete runtime and Zero is a ten-kilobyte binary without one. Different tradeoffs, different fit.

Zero is not a Python replacement, a Node replacement, or a TypeScript replacement. It is a systems language. It is not what you reach for to build a web app, a data pipeline, or a CRUD service.

Zero is also not, despite the marketing, "the language for agents" in the sense that you should write all your AI-related code in it. The runtime around an agent (the prompt orchestration, the tool calls, the memory layer, the eval pipeline) still belongs in whatever language already has a working SDK. Zero is for the parts of the stack where an agent needs to produce a small, fast, auditable binary. Think CLI utilities, edge functions, glue tools, embedded scripts, generated microservices. Not your main application code.

What Zero is is a working experiment in agent-native toolchain design. It is the first language whose compiler, package layout, error format, and CLI were all designed at the same time, with an agent in mind. Whether the experiment succeeds in the long run depends on whether enough agent frameworks bother to learn the Zero conventions specifically, or whether Zero's conventions get cribbed by larger languages, or whether the whole thing turns out to be an answer to a question nobody actually had.

The honest answer is that it is too early to tell. v0.1.2 is not a verdict. It is a start.

The Skeptical Read

I want to give the skeptical version equal time, because if you only read the marketing you would think Vercel solved the agent loop problem. They did not. They built a clever tool that solves one slice of it.

The first skeptical point: JSON diagnostics are not new. Rust has JSON error output, TypeScript can emit machine-readable diagnostics, Go's go vet is parseable. The reason agents do not lean on those today is not "the format does not exist." It is that the agent frameworks have not standardised on a way to consume them. Zero solving the producer side does not automatically solve the consumer side.

The second skeptical point: capability-based I/O is great for auditability and lousy for ergonomics if the rest of the standard library does not lean into it. If every helper function in std requires you to thread a World parameter through six layers of call stack, the prose-code ratio gets bad fast. Languages that have tried this approach (Koka, Eff, various effect-system experiments in the academic world) ran into exactly this wall. Zero's bet is that the agent doing the threading does not care about ergonomics, and the human is mostly reviewing, not writing. That bet is probably right, but it does mean Zero is unlikely to feel pleasant to write by hand for a human.

The third skeptical point: stable diagnostic codes are a promise, and promises break. Rust has changed error wording multiple times. Even with stable codes, the meaning of a code can drift as the language evolves. If the agent has memorised "code NAM003 means typo, apply fix template X" and a future Zero version uses NAM003 for something subtly different, the agent will quietly do the wrong thing. This is not unique to Zero, but the agent-first framing makes it a higher-stakes commitment than the same promise in Rust.

The fourth skeptical point: there is no package registry yet. Every serious systems language has had to solve the dependency problem, and the answer is always painful. Cargo took Rust years to get right. go mod is still controversial. Zero will have to ship some version of this and the design choices will determine whether the language scales to real projects or stays a hobbyist toy. v0.1.2 has not made those decisions yet, which means anyone betting on Zero today is betting on a future package manager that does not exist.

None of these points kill the project. They just frame what "Zero is interesting" means in 2026: it is a research artifact with a working compiler and a clear design thesis, not a production tool you should pick for the next thing you build.

Should You Install It

Yes, if you are curious. No, if you have a deadline.

The install is one line.

curl -fsSL https://zerolang.ai/install.sh | bash
export PATH="$HOME/.zero/bin:$PATH"
zero new cli hello
cd hello
zero run

You will have a sub-ten-kilobyte binary that prints "hello from zero" in about a minute. That is enough to decide whether the design thesis appeals to you. If it does, the rest is reading the examples/ directory in the repo and the conformance/ test fixtures, both of which are well-curated.

The audience worth recommending Zero to specifically is people building agent frameworks. If you are wiring up Claude or GPT to write code in a loop, Zero is a target language worth experimenting with right now, because the diagnostic surface is built for what you are trying to do. You will find out, fast, whether the JSON loop actually reduces the round trips for your agent. If it does, you can advocate for the format to propagate into the languages your team already uses.

The audience worth telling not to switch yet is anyone shipping production code. The borrow checker is too young, the ecosystem is empty, and the language spec is not stable. You will hit edges. The edges will not be fun. Come back in a year.

What I'd Tell Past Me

If I could send one paragraph back to last Tuesday, mid-context-collapse with the Rust borrow checker, it would be this. The problem you are watching is not a Claude problem. It is a language problem. The languages we have were built for humans who could read three paragraphs of prose and parse a coloured-arrow diagram. The agents we have were not. Until either the languages change or the agents get much smarter at parsing prose, the loop will keep eating context windows.

Vercel's bet with Zero is that the languages can change faster than the agents will. The bet might be right. The bet might be wrong. The interesting part is that someone is actually making it.

Whether you use Zero in the next year or not, the design pattern (JSON diagnostics, typed repairs, capability signatures, a single CLI binary, version-locked agent guidance) is going to show up in the next generation of tooling whether or not Zero itself catches on. The ideas are loose now. They will not stay loose for long.

Install it once. Read the examples. File the design thesis away. Then go back to whatever language is paying your bills and watch which of these ideas show up there first.

That is usually how this stuff works. The experiment gets the attention. The convention wins.

Vercel BotID In 2026: How The Invisible CAPTCHA Actually Works, And Where It Earns Its Place In My Stack

Alex Cloudstar — Mon, 18 May 2026 08:00:56 +0000

A few months back I shipped a sign-up flow with what I thought was a solid defense in depth. Cloudflare in front. A per-IP rate limit at the edge. A per-email throttle at the route handler. A captcha on the form for anyone who tripped a velocity threshold. The honeypot field that nineties spammers cannot resist. I felt clever. I went to bed.

I woke up to four hundred and twelve fake accounts created by a single Playwright session running on a residential proxy network. None of my defenses had fired. The IP changed on every request. The email looked plausible. The captcha was solved by a third-party solver in under two seconds per challenge. The honeypot was left blank, because the bot was reading the field labels and behaving like a polite human. The accounts were doing what accounts do (claiming a free trial of an AI feature, burning credits, never coming back) and my OpenAI bill had jumped enough that I started reading the changelog at 6am.

That was the morning I stopped pretending that traditional rate limiting was enough on its own. I had written about the layered defense pattern for SaaS APIs a few weeks earlier. I had not actually used all the layers. The layer I was missing was a bot detector that ran in the browser before the request hit my edge. The layer I added that afternoon was Vercel BotID.

This is the post I wish I had read the night before that wakeup. What BotID actually is, how the invisible challenge works under the hood, the Basic versus Deep Analysis tradeoff, the routes I actually wire it into, and where it fits in the stack alongside rate limiting, WAFs, and old-fashioned authentication.

What BotID Actually Is

BotID is a client-side challenge that runs in the requester's browser, returns a result to a Vercel-validated endpoint, and lets your server code decide what to do based on whether the requester looks like a human or a bot. It is invisible. There is no captcha, no checkbox, no "I am not a robot" widget. Real users see nothing. Bots either fail the challenge or pass it slowly enough that the deep analysis flags them.

The product launched as a Vercel-native feature in mid-2025 and went general availability in June 2025. The underlying machine-learning layer (the part that does the heavy lifting on suspected traffic) is powered by Kasada, the same provider that sits behind a lot of Fortune 500 anti-fraud setups. Vercel wraps that engine in an integration that you can wire up with a few lines of code and a route config.

The thing to internalise is that BotID is not a CAPTCHA replacement in the sense of "another widget you put on the page." It is closer to a runtime check that you run before doing anything expensive. It is a server-side decision based on a client-side proof. The mental model is "I will hold the work until I have decided this caller is a human."

For indie SaaS in 2026 that mental model maps cleanly onto the three places where a bot can ruin your week: account creation, AI inference endpoints, and anything that touches a credit card.

How The Invisible Challenge Actually Works

The mechanics are interesting because the design has to survive an attacker who can read all of the client code. There is no security through obscurity here. Everything that runs in the browser is, by definition, visible to anyone who wants to look.

The flow goes like this. When the browser loads a page that has the BotID client script, the script downloads a challenge payload from Vercel. That payload contains a small piece of obfuscated JavaScript that the browser has to run. The script computes a token based on a mix of inputs (some entropy from the page, some signals from the browser environment, some interaction patterns) and attaches that token as a header on subsequent requests to your protected routes.

On the server side, your route handler calls checkBotId() from the SDK. That function ships the token off to Vercel's edge, which validates that the challenge was actually solved (not replayed from a previous session, not generated by a tampered version of the script). The function returns a verdict, which your handler reads and acts on.

What makes this resilient is that the challenge code is regenerated every time. Vercel does not ship one static piece of JavaScript that an attacker can analyse once and replay forever. The obfuscation is rotated. The exact signals being collected change. The expected output format changes. By the time an attacker has reverse-engineered the current challenge format and written a bot that passes it, the next deploy has already changed the rules.

There is a terrific writeup by nullpt.rs on what reverse-engineering the BotID client actually looks like in practice. The short version is "you can do it, but you have to do it again every week, and Vercel has more engineers working on the next rotation than you have working on the next bypass." For attackers running spray-and-pray scrapers, the math stops working long before they have a viable client. For state-level actors with infinite time, it does not stop them. They are not your problem on a SaaS side project.

Basic Versus Deep Analysis

BotID ships in two modes and the choice between them is the first real decision you have to make.

Basic is free. It validates that the client-side challenge was solved correctly. It catches anything that did not run the JavaScript at all, anything that replayed a stale token, anything that submitted a token from a different origin, and a large fraction of less-sophisticated bots that try to fake the challenge response. Basic mode is enough to filter out the casual scrapers, the curl-from-a-script attackers, and the long tail of nineties-style bots that never bothered to run a JS engine.

Deep Analysis is the paid mode. It pipes the client-side signals into Kasada's ML model, which looks at thousands of behavioural fingerprints (timing patterns, mouse movement traces, browser quirks, network signals) and decides whether the requester is a real human or a sophisticated bot pretending to be one. This is the mode that catches Playwright running on residential proxies. It is also the mode that costs a dollar per thousand calls to checkBotId(), on top of the Pro plan, which is twenty dollars a month.

For an indie SaaS the math usually breaks down like this. On a public landing page where bots can crawl freely, Basic is fine and free. On a sign-up endpoint, an AI inference endpoint, or a checkout, Deep Analysis pays for itself the first time it stops a single scraper run. The dollar per thousand is a rounding error compared to one weekend of API abuse on an OpenAI-backed feature.

The pricing detail that matters: Vercel only bills for Deep Analysis when you actually call checkBotId(). Page views that load the script but never make a protected request are free. That means you can install the script everywhere and only pay for the routes you explicitly protect. If you forget to wire checkBotId() into a handler, the bill does not surprise you, but neither does the bot.

Wiring It Into A Next.js Route

The integration pattern is short enough to fit in one screen. Here is what a protected sign-up handler looks like in 2026.


export async function POST(request: Request) {
  const verdict = await checkBotId();

  if (verdict.isBot) {
    return new Response(JSON.stringify({ error: 'forbidden' }), {
      status: 403,
    });
  }

  if (verdict.isVerifiedBot) {
    if (verdict.verifiedBotCategory === 'search-engine') {
      return new Response(null, { status: 200 });
    }
    return new Response(JSON.stringify({ error: 'forbidden' }), {
      status: 403,
    });
  }

  const body = await request.json();
  return createAccount(body);
}

Three things are worth flagging about that snippet.

The first is that checkBotId() takes no arguments. It reads the request context implicitly. That keeps the call site clean, but it does mean you have to be on a Vercel-deployed handler for it to work, because the function depends on Vercel headers being present. There is a documented local development mode that lets you test on a laptop without paying for Deep Analysis. The mode is opt-in and the verdict is a fake "always human" so you do not lock yourself out of your own dev environment.

The second is the isVerifiedBot field. As of BotID v1.5.0+, Vercel ships a list of verified, legitimate bots (search engine crawlers, monitoring tools, social media preview bots) and exposes their category. A naive "block all bots" rule would also block Googlebot, which is rarely what you want. The verified-bot data lets you allow the polite bots through while still rejecting everything else. The pattern in the snippet above is the most common one I see: humans pass, search engines pass, everything else gets a 403.

The third is that the function is async and it does a network call to Vercel's edge to validate the token. That call is fast (single-digit milliseconds in my measurements) but it is not free, and on routes where you call checkBotId() you are adding a tiny latency budget. For a sign-up endpoint that is fine. For a hot API path that gets ten thousand requests per second, you probably want to be more selective.

The protected paths are configured in either vercel.json or the new vercel.ts config. You list the routes that should run the client-side script, and BotID injects it on those pages automatically. Routes you do not list do not get the script, which keeps the JavaScript footprint small on public pages where you do not need protection.

What BotID Is Not

BotID is good at one thing. It is not good at everything. Mistaking it for a complete security layer is the most common failure mode I see on indie projects.

BotID is not a rate limiter. It tells you whether the caller is a bot or a human. It does not tell you whether the caller is calling too often. A human user spamming an endpoint will pass every BotID check and still burn your budget. You still need a rate limiter on the same routes you protect with BotID. The two are stacked: BotID throws out the obvious bots, the rate limiter throws out the legitimate-looking abusers. I went into the rate limiter layer in more detail in the rate limiting your SaaS API piece, and the punch line there applies here: layer your defenses, do not pick one.

BotID is not authentication. It does not tell you who the caller is. A bot can have a valid API key. A human can have a stolen API key. BotID answers a different question. Run it alongside proper auth with passkeys or a session-based scheme, not instead of it.

BotID is not a WAF. Vercel's Web Application Firewall does pattern matching on known-bad request shapes and is a separate product. BotID slots in as one of several rules the firewall can apply. If your application is being attacked at a different layer (SQL injection attempts, path traversal, header smuggling), the WAF is the right tool and BotID will not catch any of it.

BotID does not catch every bot. A sufficiently determined attacker can still pass Deep Analysis with a real headed Chrome session running on a human-like input pattern. The cost of doing that is high, the throughput is low, and the economics of credential stuffing or large-scale scraping break long before you get there. But there is no "all bots are stopped" mode and pretending otherwise is how you ship a security theater feature.

BotID is not a guarantee for AI scrapers either. The isAiBot and verifiedBotName fields let you treat known model trainers and known AI agents as a separate category. If you want to allow them, you can. If you want to refuse them with a 402 and a "subscribe to access" message, you can do that too. But the population of AI scrapers that ignore robots.txt, do not identify themselves, and route through residential proxies is growing fast, and the verified list cannot keep up with all of them. Deep Analysis catches a lot of the unidentified ones. Some still slip through.

The Routes Where BotID Actually Earns Its Place

For an indie SaaS I run, three families of routes are protected by BotID. Everything else is left to rate limiting and auth.

Account creation routes. Sign up, sign in, password reset. These are the routes where a successful bot run creates persistent damage (fake accounts, account takeover, free-trial abuse). They are also the routes where a single human user only hits them a handful of times per week, so the latency cost of an extra checkBotId() call is irrelevant. Deep Analysis goes here.

AI inference routes. Anything where calling the endpoint costs me money downstream. The chat endpoint that hits an LLM, the image generation route that hits Replicate, the embedding endpoint that hits a vector database. These are the routes where bot abuse turns into a visible bill, fast. Deep Analysis goes here too, because the cost of one missed scraper run is much higher than the cost of a thousand checkBotId() calls.

Checkout and billing routes. Anything where a successful bot run causes either a fraudulent charge or an inventory hit. The webhook routes that complete a purchase, the route that applies a coupon code, the route that triggers a refund. Deep Analysis again, for the same reason. These routes also get the most engagement from sophisticated bots, because there is real money on the other side of a successful run.

Routes I do not protect with BotID include public landing pages (no high-value action, robots are mostly fine), public API documentation (let the AI crawlers index it), and anything where the request itself is cheap and the only reason to limit it is rate. Those routes get the rate limiter and nothing else.

The instinct that drives the choice is the same one I use for feature flags in solo developer setups: the protection has to match the cost of being wrong. If being wrong about a request costs me a tenth of a cent of compute, the cheapest filter is good enough. If being wrong costs me an OpenAI invoice, the expensive filter pays for itself.

Observability And The Dashboard You Will Actually Use

The thing nobody tells you about bot detection is that you will second-guess every block. You will see a 403 in your logs and wonder if it was a real user with a weird browser. You will see your active-user count dip after a deploy and wonder if BotID just turned on too aggressively. The only way to stay sane is to look at the data.

Vercel ships BotID metrics in two places. The Firewall tab in your project dashboard has a traffic breakdown that includes BotID verdicts as a filter. You can see, in real time, how many requests were blocked, how many were allowed, what categories the blocks fell into. The Observability Plus add-on extends this with longer retention and finer slicing.

The view I check most often is "verdicts over the last 24 hours, grouped by route." It shows me which protected routes are actually catching bots and which ones have been quiet. If a route I expected to catch bots is showing zero blocks for a week, either the protection is working perfectly (no bots are trying), or the protection is not wired in correctly (the checkBotId() call is silently failing somewhere). Both are worth knowing.

The metric I do not put much faith in is the raw block count without context. A spike in blocks is not automatically a sign of an attack. Some weeks Googlebot decides to crawl harder than usual and your verified-bot category lights up. Some weeks a popular sub-reddit links to your site and your real-user traffic looks like a bot wave. The interesting signal is "blocks that did not fall into the verified categories, on routes that handle paid actions, over time." Everything else is noise.

If you are going to wire BotID up at all, wire up a Slack alert on "blocks per hour on the AI inference route exceeded the rolling average by 3x." The same instinct that drives production observability for solo developers applies here. You do not need a full security operations center. You need a single ping that fires when something looks weird, so you can decide whether to look closer.

The Honest Tradeoffs

I want to be honest about the parts of BotID that are not perfect, because the marketing page is, predictably, the marketing page.

The first tradeoff is vendor lock-in. BotID is a Vercel-native product. If you ever move your hosting off Vercel, the integration goes with it. You can replace it with another bot detector (Kasada sells the same engine directly, hCaptcha Enterprise does something similar, Cloudflare Turnstile is a free-ish alternative for the public side of the challenge), but it is a rip-and-replace, not a portable abstraction. If portability matters to you, factor that in.

The second tradeoff is the cost ceiling on high-volume routes. A dollar per thousand calls is cheap on a sign-up endpoint that fires once per user per month. It is not cheap on an API route that fires three times per second per user. Before you slap checkBotId() on every authenticated route, run the math on monthly call volumes. Sometimes the right answer is "rate limit aggressively, only invoke BotID when the rate limit trips."

The third tradeoff is the small but non-zero false positive rate. Privacy-focused browsers, aggressive ad blockers, weird mobile environments, and edge cases like users running through a VPN can occasionally fail the challenge. The published rate is low (less than a tenth of a percent of real users in Vercel's own data), but you will see it, and you will need a way to handle it. The pattern I use is "BotID block on a sign-up route returns a 'try again' message with a link to a slower, captcha-protected fallback." Real users do the fallback, bots do not.

The fourth tradeoff is that Deep Analysis is asynchronous. The verdict you get back from checkBotId() is the result of the initial challenge plus whatever signals were collected up to that point. Some of the deeper analysis happens after the request has returned, with the verdict updating in Vercel's dashboard later. If you build a workflow that depends on "the bot was caught and immediately reversed," you will not get the precision you want. The right pattern is "use the verdict for immediate decisions, use the dashboard for retroactive investigation."

Should You Wire It Up Today

For most indie SaaS in 2026 the answer is yes, for at least the high-value routes, in Basic mode if cost is a concern and Deep Analysis if cost is not.

The install is short. You add the BotID client script to your app via the integration, you wire the route config in vercel.json or vercel.ts, you add a checkBotId() call to the half-dozen routes that matter, and you are done. The whole thing takes an afternoon. Most of the afternoon is testing in dev to make sure your local environment is not getting falsely flagged.

The audience that should not bother is anyone whose backend is not on Vercel. The product is Vercel-native. You can replicate the pattern with other providers (Kasada directly, Cloudflare Turnstile plus a custom verifier, hCaptcha Enterprise), and if you are on AWS or Cloudflare or your own infra those are the right options to look at. They are not drop-in replacements, but they cover the same ground.

The other audience that should pause is anyone whose threat model does not include bots. If you run a B2B SaaS where every user is on a signed contract and the only people calling your API are the people who paid for the seats, you have an authentication problem, not a bot problem. Wire up auth, log every request, and skip BotID. You will not need it.

For everyone in the middle, where bots are a real risk because your AI endpoint costs money or your sign-up flow gives away free trials or your checkout has anything resembling inventory, BotID is the single highest-leverage thing you can add in a day. It will not solve everything. It will not stop every bot. It will not replace your rate limiter, your auth, or your WAF. It will dramatically thin the population of attackers who can hit the routes you actually care about, and that is the part that earns its place.

What I'd Tell Past Me

If I could hand my past self one paragraph, it would be this. The traditional defenses (rate limits, captchas, honeypots, IP blocks) assume an attacker who is roughly human-shaped. The attackers in 2026 are not human-shaped. They are Playwright instances running on residential proxies, AI scrapers ignoring robots.txt, and credential stuffers cycling through hundreds of thousands of leaked passwords from last decade's breaches. The defenses you grew up with do not stop them, not because the defenses are bad, but because they were built for a different threat model.

BotID is not a silver bullet for that new threat model. There is no silver bullet. But it is the cheapest single addition to a SaaS stack that meaningfully shifts the economics of attacking you. The bots that pass it cost more to operate than the ones that do not. The attackers who can afford to pass it move on to softer targets. The math finally tips the other way.

Wire it into your sign-up, your AI inference routes, and your checkout. Leave the rest of your stack alone. Watch the dashboard for a week. Then go back to building your product, which is what you should have been doing in the first place, before the four hundred and twelve fake accounts taught you what you already suspected.

The defense was always going to have to evolve. The good news is that for once the tooling evolved with it.