DEV Community: Håvard Bartnes

Be the Gatekeeper. Your AI Agent Will Hate It

Håvard Bartnes — Sun, 07 Jun 2026 08:19:29 +0000

The most useful thing an AI coding agent can do is let you check its work. It will not enjoy it.

I once watched my own agent do beautiful work on the wrong half of the job.

I had built a harness called Mycelium to make Claude do real product development, discovery through delivery. On a little macOS app, it aced the discovery. Mapped the opportunity tree like it had written the textbook. Then it shipped the thing with zero tests and no accessibility, and told me it looked great. It wasn't lying, exactly. It just had no idea what it didn't know, and neither of us found out until I went looking.

That is the part nobody prices. Not how good the agent is. How long it takes you to find out whether it is right.

The agents people call a joy to use are the ones that hand you the answer and the receipt at the same time: a diff, a screen recording, a before and after you can glance at and trust in two seconds. The ones that wear you down hand you something you have to re-derive yourself just to know whether you can believe it. Same model underneath, sometimes the same task. The only difference is who pays to check, and how much it costs them. Hamel Husain has put the sharp version of this: if a thing is hard to evaluate, that is a product smell. He is right, and I would push it one step further.

If the cost of checking is the thing that matters, you do not win by making the agent smarter. You win by making it pay that cost up front, the moment it makes a claim, instead of dropping an unverifiable heap on your desk and letting you find the problem at review time, which is the most expensive time there is.

That is what a gate is, in the thing I build. People hear "gate" and picture a slow approval step with a clipboard. It is closer to an eval that runs at discovery time instead of after the output. Before the agent earns the right to write code, it has to clear one question: does the evidence for this actually exist. Not does it look plausible. Plausible is exactly how I ended up with a test-free app that demoed beautifully. Does it exist. When it does not, the agent stops, and it does not get to guess its way past. The current build has thirteen of these defined across the road from a vague idea to shipped work, and a project clears the ones its scale calls for.

Here is one, because the abstraction is cheap and the example is where it lives or dies.

An agent is about to build against an external API it has never actually read. The contract, the schema, the real shape of the payload: all imagined. Left alone it writes something plausible against an interface that exists only in its head, and you find out at integration time, which is to say the worst time. The gates already stop it on the missing evidence. The part I shipped recently is the part that makes the stop useful: instead of a shrug, the agent names what it is actually facing, "technical discovery," and sends itself off to read the docs and pull a real payload before it touches a line of code. The bill got paid up front, for almost nothing. The alternative is me paying it by hand, later, after the agent has already talked itself into being sure.

Move the check from after to before. From me to the loop. That is the whole trade.

Theory is cheap, though, and I have been burned by my own confidence often enough to distrust it. What tells you whether any of this holds is what happens when other people touch it.

Drew Hoskins, who wrote The Product-Minded Engineer, approved the line I now lead with: the agent should stop and say "I don't have enough to claim this" before it ships code or marks anything done. An agent admitting the gap is a lot cheaper than you finding it.

The evidence I actually trust came from someone with no reason to be kind about it. One of the developers testing Mycelium ran it on her own project, a public-sector app for next-of-kin in home care. At some point the agent decided to improve her brief by quietly rewriting it in its own words. She caught it. She made it keep her original and ask before changing the record. She enforced, by hand, the exact discipline the framework is built to enforce, in the one moment the framework had not quite managed to. She was the gatekeeper. The framework was supposed to be. Being out-disciplined by your own tester is humbling, and I recommend it to anyone who thinks their harness is finished.

I will give you the failure in the same breath, because a post that only reports wins is its own kind of product smell. That same tester told me, flatly, that the vocabulary reads like it was written for people who already know the frameworks. The gate caught the agent. The words lost the human. Straight into the backlog. The machinery was sound and the on-ramp was not, both true at once, and pretending otherwise would be exactly the overconfidence the gates exist to catch.

It is tempting to read all this and conclude that hard-to-check work is work agents should keep their hands off. I think that is backwards. The harder a thing is to verify, the more it needs the discipline, not less, because that is precisely where confident nonsense survives longest. Nobody can cheaply tell it is wrong. That is not the place to give up on checking. It is the place to spend the most on it, early, while it is still cheap.

The bet under all of it is small and faintly boring, which in my experience is what a good bet usually looks like. Gates do not make the agent smarter. They make it pay for its claims while the bill is small, instead of leaving the tab for you to find at review time, next to a demo that looked great. The agent does not get to move on until it has earned the right to. Same as the rest of us, on a good team.

The Coach, the Cage, and the Deadline

Håvard Bartnes — Sun, 31 May 2026 08:50:59 +0000

Two builders, one canon, and one disagreement about who gets to skip a step.

Your AI agent will skip the boring step the second it thinks it can get away with it. Which is immediately. Same as a tired developer at 4pm on a Friday, except the agent does it a hundred times faster, with a hundred times the confidence, and none of the self-doubt that occasionally saves the rest of us.

I learned this the hard way. I built a harness called Mycelium to make Claude do real product work, discovery through delivery. The agent aced the discovery on a macOS app. Mapped the opportunity tree like a textbook. Then it shipped the thing with zero tests and no accessibility, which is a flawless plan executed beautifully on the wrong half of the job. Why? Because my rules were friendly advice. And advice, human or machine, is the first thing out the window when there is a deadline in the room.

So I went looking for other people fixing the same problem. The best one I found is Dean Peters and his Product-Manager-Skills.

Here is the part that made me sit up. Dean packaged 49 product-management skills for AI agents. Torres on discovery, Cagan on empowered teams, Jobs-to-be-Done, the whole canon. I built Mycelium on the same canon. Also 49 skills. Two people, no coordination, same map of what good product work looks like, right down to the reading list. Either we are both onto something, or we read the same five books. Probably both.

We disagree about exactly one thing. And it is the thing that matters.

Coaching the human, or stopping the agent

Dean coaches. His tagline is "Always Be Coaching", and every skill is built to leave the human knowing more than they did before. He is refreshingly honest about the stance. On his Substack he says he uses frameworks "like seasoning, not a main course," which is the most sensible thing anyone has said about frameworks in years. The discipline lives in your judgment. The tool sharpens it.

Mycelium does not coach. It gates. Same frameworks, wired up as 38 guardrails in three tiers: a couple are blocked outright, a big middle tier will not let the work be marked done until the evidence exists, and the rest are nudges. No tests, no done. No threat model, no done. The agent finds this deeply unreasonable, or would, if it had feelings. It does not, which is the entire point.

Two different bets about where discipline should live.

Dean bets on the human. Keep it light, teach the craft, trust the person to carry it from there. That is the right bet when the person at the keyboard is steering and the real risk is that they do not yet know what good looks like. Coaching grows a product manager. A gate has never taught anyone anything, and never pretended to.

I bet on the loop. Make the step something the agent has to skip on purpose, because left to itself it will skip it the moment it feels sure, which is always. A handful of things are blocked outright. The rest the framework simply refuses to call done, and writes down that it is not done, which turns out to be enough most of the time. That is the right bet when the agent is moving fast and the risk is not ignorance but momentum. You cannot coach an agent into caring about the boring fix. You can, however, stand in the doorway until the boring fix is done.

Neither is the upgrade of the other. They stack. Run a coaching skill to think the problem through, then gate the output so it cannot ship half-built. The library teaches the human. The gate constrains the agent. Same canon, opposite ends of the same loop.

So which one do you need

One question: where is your risk?

If the risk is that the person does not yet know the craft, you want coaching. A gate will only frustrate them and teach them nothing, which is the worst of both worlds. Dean's framework is built for exactly that person, by someone who clearly enjoys teaching them.

If the risk is that the agent is fast, confident, and one keystroke from a shortcut you will pay for next quarter, you want gating. This is where I have evidence. Modest, but real. I ran Mycelium with three first-time users last month. One of them, a junior developer, caught the agent quietly trying to overwrite her own brief with an improved version. She stopped it and made it keep both. She enforced the exact discipline the framework was built to enforce, in the one moment the framework had not quite managed to. Being out-disciplined by your own tester is a humbling experience, and I recommend it. That is the whole argument in a single story. The discipline has to hold when nobody is watching. Especially then.

Three users is three, not thirty. I am not claiming this scales yet. I am claiming the failure mode is real, because I keep watching it happen, usually right after the agent tells me everything looks great.

The part we agree on

Strip the mechanism away and Dean and I are saying the same thing. The tools will keep changing, monthly, whether we want them to or not. What lasts is whether the discipline survives contact with a deadline. He builds tools that coach that discipline back into the human. I built one that holds the agent to it whether it wants to or not.

So if you are choosing between us, you are probably asking the wrong question. We answer two different ones. Work out whether the thing you cannot trust is the human's knowledge or the agent's restraint, and the choice makes itself. And if it turns out to be the agent, well, that is what the straitjacket is for.

Mycelium is at github.com/haabe/mycelium. Dean's Product-Manager-Skills is worth your time whichever bet you make.

I built for developers. What a non-developer felt hit hardest

Håvard Bartnes — Tue, 26 May 2026 14:03:08 +0000

AI has made building cheap. It hasn't made deciding cheap. Agents will jump from an idea to a pull request without asking why, who for, or whether anyone needs it. Mycelium is the framework I built to make them stop and demand evidence first.

I built Mycelium for developers. The strongest signal it produced this month came from someone who is not one.

My wife Edith is writing a book. She has never used Claude Code. Last week she sat down at my keyboard, ran /start once, and reached the assumption test fifteen minutes later. The framework read her own words back to her as a structured brief. She nearly cried. The note I wrote down: even though it was her own words, it was captured and presented well.

That wasn't supposed to happen. I built the brief synthesis to keep developers from skipping discovery. Reaching a non-developer at the emotional layer on a book project was nowhere in the design.

Two other testers ran Mycelium that same week, on projects with nothing in common. Together they showed me something I didn't expect.

What the test was

Three first-runs. Maximum variance by design — or rather, by accident, as it turned out.

Frida: junior developer, building a public-sector mobile app for next-of-kin in home care. GDPR, healthcare, AI-naive end users. Alex: also a junior developer, working on a project of his own. Edith: my wife, writing a book.

They ran independently. The brief was explicit. No comparing notes mid-run. No asking each other how to get past confusion.

Frida's friction log landed on a Tuesday. Edith's session a mere week later. Alex's recap landed in Discord the Tuesday after that.

I went in wanting evidence on one thing: that Mycelium is light enough to keep using past the first friction moment.

The triangulation rule was committed to after Frida's first log but before any other data arrived. Friction that two of three testers hit independently counts as a real signal on the cautious-learner segment. Friction only one tester hit stays a single observation, however vivid. The original cohort design called for Frida, Alex, and a third junior developer. The third never returned data. Edith's session became the third leg by accident, when she sat down to use the framework on her own work.

Three patterns converged anyway.

What converged

Three patterns. Each named by two of three testers, all strangers to each other, all building completely different things.

Wayfinding stops at phase transitions. Edith finished the assumption test and lost her place. She remembered being asked early about a deep-dive interview, but there were no traces of it after. She was bewildered. Alex finished his proof-of-concept and Mycelium went quiet on him. His words: once it was done with building it just kind of stopped and didn't prompt me for more info or advise me what to do next, so I had a look through the readme/commands to see what to do to get it on course again.

Two testers, two different phases, two different projects. Same failure. The framework points you to the start. It doesn't point you anywhere after that. I had built the orientation surface for one phase. It never extended to the others, because nobody had asked it to.

Wall of text. Frida stopped when she got tired, not when she got annoyed. Too much structured output to read through. Alex: my brain was fried from the gigantic walls of text I had been wading through so I gave it a break. Daniel Bentes, an experienced architect from a different cohort, hit the same wall earlier this month. He called it verbose and strict, framed as feature, not bug. Three users, three experience levels, three contexts, same complaint.

Internal vocabulary in user-facing prompts. Frida hit framework-internal terms with no anchor — words I'd never explained in the surface she was actually reading. Alex: I also kind of kept getting a little lost in the vocabulary, I wonder if more straight forward terms would have helped me at least. The glossary exists. The glossary is something the user has to go look up. The prompts themselves were leaking terms I never introduced.

Three patterns. Three users with nothing in common. That's stronger evidence than five users in the same context all reporting the same things would have been.

What the framework caught

This isn't a hatchet piece. What worked deserves naming too.

Frida wrote three sentences when the first question asked for one. The constraint annoyed her. Then it forced her into umbrella-thinking before the details. Afterward she told me it had been the right move. She'd have got there slower on her own.

Later in her session, Mycelium noticed two empty fields and called itself out. She told me she liked that. The agent flagging its own holes before she had to.

Edith's near-tears reaction to the brief is the most emotional moment Mycelium has produced from any user. The really saw me feedback on the assumption test was the second. Neither was in the spec.

Alex went from interview to feature selection to research prompting to a working proof-of-concept in one session. That's a lot of ground. The break came after the PoC, not before.

What Frida caught

Halfway through Frida's session, the agent tried to overwrite her brief with a revised version. She stopped it. She asked the agent to keep the original as a revision note and record the new version alongside as a confidence note, so the history would survive.

She enforced the discipline the framework was supposed to enforce.

That's the most useful thing any user has told Mycelium about itself.

Mycelium's whole pitch is that the agent shouldn't silently overwrite the user. In Frida's session, the agent tried. She caught it. The next day I shipped the calibration fix she'd surfaced — the confidence-floor framing that had triggered the overwrite attempt in the first place. Her exact preservation convention, revision note and confidence note as named structures, is still on the candidate list. I haven't shipped that one yet.

The cohort was set up to surface friction. Frida shipped ten items, fully attributed, with quotes. She also showed me what the framework's own discipline looks like when it actually works. That's not a tester's job. That's a contributor's.

What this evidence supports

Three first-runs is three. Not thirty.

Convergence is what makes the data load-bearing. Items two testers hit independently count as evidence on the cautious-learner segment, per the rule I committed to before the second and third logs arrived. Items only one tester hit stay anecdotes.

Edith's emotional reaction is one user. A real signal, not yet a pattern. Another non-developer reacting the same way would change that.

What this is NOT evidence of: that Mycelium has no friction. That it works for non-developers in general. That every book project would land like Edith's.

Three changes are supported by the data: wayfinding, verbosity, vocabulary. Whether they shipped is the next question.

What shipped

Frida's log landed Tuesday. By Wednesday I'd shipped seven of her ten findings.

Three were noise she shouldn't have been seeing in the first place. Two were consent moments where the framework wasn't giving her real choice. Two were transparency: numbers and terms the framework had been hiding from the user.

Here's the one that tells the story: the friction-log prompt. The old wording implied an opaque pipeline. She didn't know where her input was going, and the framework was making her hesitant about contributing. I rewrote it to offer three named destinations. She picks. That's the discipline I want at every consent moment.

Edith's wayfinding gap was logged the same day as her session. The correction was recorded. The mechanism didn't ship — I had her friction in the framework's memory and no clear fix for it yet.

What triggered the actual ship came six days later, when Alex's rich recap landed and converged on the same failure shape. Post-build silence on his side, post-assumption-test silence on Edith's, same underlying pattern surfaced by two testers in completely different contexts. v0.31.1 closed both the same evening Alex's recap landed.

When a build produces working code, the framework now offers six options: security-review, threat-model, definition-of-done, reflexion, refine the spec, or ship as-is. It doesn't pick. You do. The orientation surface fires at every phase transition, not just the start.

v0.31.2 shipped the same evening, fixing Alex's wall-of-text complaint. This is the one that broke my assumption about how to fix this kind of problem. The obvious move was to strip content. I almost did. Then I noticed what would have gone: the citations, the attribution labels, the alternatives I'd considered and rejected. All the discipline metadata that makes the framework auditable to a careful reader. Stripping it would have made the output cheaper and less honest at the same time.

So I layered it instead. Every response that carries discipline metadata now has three blocks. A one-or-two-line claim up top. A scannable rationale below. Then everything else, below a horizontal rule. Read the first line, you have the answer. Scroll down if you want the receipts.

Theory behind it: Sweller's cognitive load, Cowan's working-memory chunks, Nielsen's F-pattern, Minto's pyramid, the BLUF tradition, and W3C's COGA accessibility working draft. Honesty caveat: the chain wall-of-text → comprehension failure → people quit is one tester's report. The research is theory, not validation on this surface. The next move is an A/B test.

Three of Frida's items are still open. One of Alex's is too, vocabulary work I haven't shipped yet. Frida's preservation convention is in the same queue. None of the open items is blocking. But I know exactly what each one needs. That's on me.

The interesting number isn't how many shipped. It's how fast. Two of Alex's findings plus Edith's wayfinding correction were framework conventions the same evening Alex's recap landed.

What this implies

The cohort that surfaced the convergence wasn't designed for it. A junior developer on a healthcare mobile app, a junior developer on a software project, my wife on a book. Maximum variance was an accident. It's also why the data is sharper than the test design intended.

Mycelium's job has always been to surface failure modes early. The cohort that did that best this month had nothing in common except the framework.

Metabolism rate is what I get judged on next.

If you've been thinking about running Mycelium on a project you care about, and you'd keep a friction log along the way, I want it. Three first-runs gave me convergence on three patterns. Five would tell me more.

Mycelium is at github.com/haabe/mycelium. /mycelium:start is the first command.

Why Your AI Coding Agent Needs a Digital Straitjacket

Håvard Bartnes — Wed, 08 Apr 2026 16:27:07 +0000

We are giving AI coding agents way too much freedom. What they actually need is a digital straitjacket.

My north star in software development has always been simple: create actual value. Make things that make people's lives easier—or as Jonathan Smart perfectly puts it: Better Value Sooner Safer Happier (BVSSH).

But for 20 years, I’ve watched the industry actively fight this. I started in the waterfall factory floors of the dot-com bubble. Later, as a Scrum PO and agency owner, I fought fake agile and HiPPOs (Highest Paid Person's Opinion). When consulting in critical national infrastructure, I had to drag teams kicking and screaming into modern DevOps.

The core issue is always human: we know the right practices, but when deadlines loom, we drop the theories and take shortcuts.

For a long time, I dreamed of building a SaaS platform—a UI to force teams to actually follow product frameworks. But as generative AI evolved, I realized that building AI middleware was a dead end. The real power lies directly in the agents. So, I shelved the idea.

The Problem: AI Makes the Same Mistakes, Just 100x Faster

While building my DAW project, N-trax, I moved from the bloated bureaucracy of AWS Kiro to the reckless speed of Claude Code. I finally figured out how to mechanically harness an agent’s delivery phase.

That was the spark. It was time to resurrect my old dream, not as a SaaS, but as an open-source boilerplate.

I built Mycelium—a harnessing system for Claude Code that forces the agent to do proper product development from discovery to delivery.

To test the process (and honestly, because I just really wanted these products), I used Mycelium to guide Claude through building a multiplayer WebSockets game (play it here) and a native macOS app.

The early versions were a massive wake-up call.

The AI aced the product discovery. It mapped Opportunity Solution Trees perfectly. But then it shipped the macOS app with zero tests and completely ignored accessibility (WCAG).

Why? Because my rules were just "friendly advice." The AI acted exactly like a stressed human product team: it invested heavily in discovery, then took the path of least resistance during delivery.

The Solution: Mechanical Enforcement

Unlike humans, AI doesn't have feelings. It doesn't get annoyed by rigid, theory-driven bureaucracy.

This led to Mycelium v0.5.0. It is, quite literally, a digital straitjacket designed to guarantee BVSSH. It encodes 20+ frameworks into a mechanical, three-tier enforcement architecture:

🚫 BLOCKED: Physically stops fatal errors (e.g., exiting with code 2 if it tries to write secrets to disk).
🚧 GATED: The agent cannot mark delivery as "done" until automated tests exist, accessibility is checked, and the OWASP threat model is updated.
💡 ADVISORY: Nudges for clean code (DRY, KISS) and cognitive bias awareness.

Add to that a brutal corrections loop: The agent cannot write a single line of code without first reading its own previous mistakes.

The Theory Stack Under the Hood

For those curious about how the gates are structured, Mycelium encodes frameworks from:

Teresa Torres (Continuous Discovery)
Marty Cagan (Empowered Teams)
Nicole Forsgren & Gene Kim (DORA / Accelerate)
Lou Downe (Good Services)
Matthew Skelton & Manuel Pais (Team Topologies)
Itamar Gilad (GIST)

...along with OWASP, WCAG 2.1 AA, and cognitive bias mitigations by Richard Shotton and Daniel Kahneman.

I Need Your Help Stress-Testing This

The era of "vibecoding" needs to end. We need to force discipline into the loop.

Mycelium is a hypothesis. Testing it on my own projects isn't enough to prove it scales. If you are tired of AI writing incredibly fast code for the wrong problems, check out the repo:

🔗 haabe/mycelium on GitHub

Run it on your next project, see how the agent reacts to the straitjacket, and please open an issue or a PR when it breaks.

Where do you draw the line between giving an AI agent freedom and putting it in a straitjacket? Let me know in the comments!