Håvard Bartnes

Posted on May 26 • Originally published at linkedin.com

I built for developers. What a non-developer felt hit hardest

#ai #agents #productdevelopment #opensource

AI has made building cheap. It hasn't made deciding cheap. Agents will jump from an idea to a pull request without asking why, who for, or whether anyone needs it. Mycelium is the framework I built to make them stop and demand evidence first.

I built Mycelium for developers. The strongest signal it produced this month came from someone who is not one.

My wife Edith is writing a book. She has never used Claude Code. Last week she sat down at my keyboard, ran /start once, and reached the assumption test fifteen minutes later. The framework read her own words back to her as a structured brief. She nearly cried. The note I wrote down: even though it was her own words, it was captured and presented well.

That wasn't supposed to happen. I built the brief synthesis to keep developers from skipping discovery. Reaching a non-developer at the emotional layer on a book project was nowhere in the design.

Two other testers ran Mycelium that same week, on projects with nothing in common. Together they showed me something I didn't expect.

What the test was

Three first-runs. Maximum variance by design — or rather, by accident, as it turned out.

Frida: junior developer, building a public-sector mobile app for next-of-kin in home care. GDPR, healthcare, AI-naive end users. Alex: also a junior developer, working on a project of his own. Edith: my wife, writing a book.

They ran independently. The brief was explicit. No comparing notes mid-run. No asking each other how to get past confusion.

Frida's friction log landed on a Tuesday. Edith's session a mere week later. Alex's recap landed in Discord the Tuesday after that.

I went in wanting evidence on one thing: that Mycelium is light enough to keep using past the first friction moment.

The triangulation rule was committed to after Frida's first log but before any other data arrived. Friction that two of three testers hit independently counts as a real signal on the cautious-learner segment. Friction only one tester hit stays a single observation, however vivid. The original cohort design called for Frida, Alex, and a third junior developer. The third never returned data. Edith's session became the third leg by accident, when she sat down to use the framework on her own work.

Three patterns converged anyway.

What converged

Three patterns. Each named by two of three testers, all strangers to each other, all building completely different things.

Wayfinding stops at phase transitions. Edith finished the assumption test and lost her place. She remembered being asked early about a deep-dive interview, but there were no traces of it after. She was bewildered. Alex finished his proof-of-concept and Mycelium went quiet on him. His words: once it was done with building it just kind of stopped and didn't prompt me for more info or advise me what to do next, so I had a look through the readme/commands to see what to do to get it on course again.

Two testers, two different phases, two different projects. Same failure. The framework points you to the start. It doesn't point you anywhere after that. I had built the orientation surface for one phase. It never extended to the others, because nobody had asked it to.

Wall of text. Frida stopped when she got tired, not when she got annoyed. Too much structured output to read through. Alex: my brain was fried from the gigantic walls of text I had been wading through so I gave it a break. Daniel Bentes, an experienced architect from a different cohort, hit the same wall earlier this month. He called it verbose and strict, framed as feature, not bug. Three users, three experience levels, three contexts, same complaint.

Internal vocabulary in user-facing prompts. Frida hit framework-internal terms with no anchor — words I'd never explained in the surface she was actually reading. Alex: I also kind of kept getting a little lost in the vocabulary, I wonder if more straight forward terms would have helped me at least. The glossary exists. The glossary is something the user has to go look up. The prompts themselves were leaking terms I never introduced.

Three patterns. Three users with nothing in common. That's stronger evidence than five users in the same context all reporting the same things would have been.

What the framework caught

This isn't a hatchet piece. What worked deserves naming too.

Frida wrote three sentences when the first question asked for one. The constraint annoyed her. Then it forced her into umbrella-thinking before the details. Afterward she told me it had been the right move. She'd have got there slower on her own.

Later in her session, Mycelium noticed two empty fields and called itself out. She told me she liked that. The agent flagging its own holes before she had to.

Edith's near-tears reaction to the brief is the most emotional moment Mycelium has produced from any user. The really saw me feedback on the assumption test was the second. Neither was in the spec.

Alex went from interview to feature selection to research prompting to a working proof-of-concept in one session. That's a lot of ground. The break came after the PoC, not before.

What Frida caught

Halfway through Frida's session, the agent tried to overwrite her brief with a revised version. She stopped it. She asked the agent to keep the original as a revision note and record the new version alongside as a confidence note, so the history would survive.

She enforced the discipline the framework was supposed to enforce.

That's the most useful thing any user has told Mycelium about itself.

Mycelium's whole pitch is that the agent shouldn't silently overwrite the user. In Frida's session, the agent tried. She caught it. The next day I shipped the calibration fix she'd surfaced — the confidence-floor framing that had triggered the overwrite attempt in the first place. Her exact preservation convention, revision note and confidence note as named structures, is still on the candidate list. I haven't shipped that one yet.

The cohort was set up to surface friction. Frida shipped ten items, fully attributed, with quotes. She also showed me what the framework's own discipline looks like when it actually works. That's not a tester's job. That's a contributor's.

What this evidence supports

Three first-runs is three. Not thirty.

Convergence is what makes the data load-bearing. Items two testers hit independently count as evidence on the cautious-learner segment, per the rule I committed to before the second and third logs arrived. Items only one tester hit stay anecdotes.

Edith's emotional reaction is one user. A real signal, not yet a pattern. Another non-developer reacting the same way would change that.

What this is NOT evidence of: that Mycelium has no friction. That it works for non-developers in general. That every book project would land like Edith's.

Three changes are supported by the data: wayfinding, verbosity, vocabulary. Whether they shipped is the next question.

What shipped

Frida's log landed Tuesday. By Wednesday I'd shipped seven of her ten findings.

Three were noise she shouldn't have been seeing in the first place. Two were consent moments where the framework wasn't giving her real choice. Two were transparency: numbers and terms the framework had been hiding from the user.

Here's the one that tells the story: the friction-log prompt. The old wording implied an opaque pipeline. She didn't know where her input was going, and the framework was making her hesitant about contributing. I rewrote it to offer three named destinations. She picks. That's the discipline I want at every consent moment.

Edith's wayfinding gap was logged the same day as her session. The correction was recorded. The mechanism didn't ship — I had her friction in the framework's memory and no clear fix for it yet.

What triggered the actual ship came six days later, when Alex's rich recap landed and converged on the same failure shape. Post-build silence on his side, post-assumption-test silence on Edith's, same underlying pattern surfaced by two testers in completely different contexts. v0.31.1 closed both the same evening Alex's recap landed.

When a build produces working code, the framework now offers six options: security-review, threat-model, definition-of-done, reflexion, refine the spec, or ship as-is. It doesn't pick. You do. The orientation surface fires at every phase transition, not just the start.

v0.31.2 shipped the same evening, fixing Alex's wall-of-text complaint. This is the one that broke my assumption about how to fix this kind of problem. The obvious move was to strip content. I almost did. Then I noticed what would have gone: the citations, the attribution labels, the alternatives I'd considered and rejected. All the discipline metadata that makes the framework auditable to a careful reader. Stripping it would have made the output cheaper and less honest at the same time.

So I layered it instead. Every response that carries discipline metadata now has three blocks. A one-or-two-line claim up top. A scannable rationale below. Then everything else, below a horizontal rule. Read the first line, you have the answer. Scroll down if you want the receipts.

Theory behind it: Sweller's cognitive load, Cowan's working-memory chunks, Nielsen's F-pattern, Minto's pyramid, the BLUF tradition, and W3C's COGA accessibility working draft. Honesty caveat: the chain wall-of-text → comprehension failure → people quit is one tester's report. The research is theory, not validation on this surface. The next move is an A/B test.

Three of Frida's items are still open. One of Alex's is too, vocabulary work I haven't shipped yet. Frida's preservation convention is in the same queue. None of the open items is blocking. But I know exactly what each one needs. That's on me.

The interesting number isn't how many shipped. It's how fast. Two of Alex's findings plus Edith's wayfinding correction were framework conventions the same evening Alex's recap landed.

What this implies

The cohort that surfaced the convergence wasn't designed for it. A junior developer on a healthcare mobile app, a junior developer on a software project, my wife on a book. Maximum variance was an accident. It's also why the data is sharper than the test design intended.

Mycelium's job has always been to surface failure modes early. The cohort that did that best this month had nothing in common except the framework.

Metabolism rate is what I get judged on next.

If you've been thinking about running Mycelium on a project you care about, and you'd keep a friction log along the way, I want it. Three first-runs gave me convergence on three patterns. Five would tell me more.

Mycelium is at github.com/haabe/mycelium. /mycelium:start is the first command.