DEV Community: Stanislav Kremeň

Building with mini, Part 5/9: When you need to think — discuss and verify

Stanislav Kremeň — Wed, 08 Jul 2026 11:00:00 +0000

In Part 4 the loop next → plan → do → done ran and pycalc got its tokenizer. But the loop is optimistic: it assumes you already understand the phase, and that "the tests passed" means "done". Sometimes neither is true. For those two situations mini has two human checkpoints, symmetric around the loop:

discuss — before the plan, when you don't have the phase thought through yet.
verify — after do, when "green tests" aren't enough and it calls for a human eye.

Neither writes code, and neither moves the phase state. But each leaves behind a file that the later steps read — that's the whole point.

`discuss` — talk through the intent

We built the parser as Phase 2. Recursive descent looks like "just write it", but it hides decisions that are hard to change after the fact. Rather than discover them mid-implementation, I open a discussion:

mini discuss

The session gets the phase and the project, and is forbidden from implementing anything. The agent summarizes what the phase means, then asks — in small batches. The two most valuable questions were ones I wouldn't have asked myself:

2. Error positions at end of input — this is the main trap. For 2+ or (2+3 the token is missing exactly where the caret should point. But the parser only gets a list of tokens — the last position is at the last token, not past it. Either we pass the text length to the parser and report the error at pos = len(text), or the tokenizer starts adding an EOF token — but that changes Phase 1's output and breaks its tests.

That's not cosmetics. The second option would silently break a finished, tested Phase 1 — and it would only surface when the tests ran. We agreed on the first path (pass the length, don't touch the tokenizer). Decisions also landed on evaluation vs. AST (compute inline, no tree — an AST is overkill for a learning parser) and on the single public entry point (evaluate(text) -> Decimal).

Before the session ends, it writes a summary to .mini/discuss/phase-002.md — a fixed structure of Intent / Key decisions / Watch out for:

## Key decisions
- **Evaluate inline, no AST.** Compute the result during the recursive descent.
- **Single public entry point: `evaluate(text: str) -> Decimal`.**
- **End-of-input error positions: pass the text length to the parser.** Do NOT
  modify the tokenizer (no EOF token — that would break Phase 1 tests).
- **Division by zero points at the `/` operator token's position.**
- **Unary minus:** allow chaining (`--2` = 2) and post-operator use (`2*-3` = -6).
  **Reject unary plus** (`+5` is a syntax error).

That file is the payoff. mini plan and mini do pull it into their context — they see the decisions, not the whole rambling debate. The discussion is disposable, the conclusions permanent. (After discuss you also get offered an edit of the phase name/goal, in case the debate shifted the scope.)

`verify` — when "the tests passed" isn't enough

Two phases later pycalc actually computes: Phase 3 added the CLI layer. python3 pycalc.py "2+3*4" prints 14, errors go to stderr with a caret line. The tests are green, 14/14. Done?

No. The tests verify that 2++3 fails at position 2 — they don't verify whether that error looks good. And the readability of the messages is pycalc's entire selling point. So, before closing the phase:

mini verify

This is the one checkpoint that deliberately targets what a machine can't judge: appearance, clarity, smoothness. The agent leads the review interactively — one thing at a time, waiting for a reaction. It started from the verify items in the run report and then broadened the review. It signed off the happy path (the numbers read well, 1/3 truncates without a …, which I confirmed as intentional), but the wider pass turned up three real findings:

1. The error message has no prefix. The line prints unexpected token '+' with no error:/pycalc:. When you redirect stderr to a log or the caret line scrolls off, you can't tell an error from output. git/python prefix their errors.

2. The caret misaligns on tabs. The caret line is " " * err.pos + "^" over the raw index, but a tab renders as several columns — the caret lands elsewhere. Confirmed live. This is a genuine bug in the one feature you called the selling point, not cosmetics.

3. Usage names the wrong program. It prints usage: pycalc EXPRESSION, but it's invoked as python3 pycalc.py. The first line a confused user sees points at a command that doesn't exist.

I confirmed the tab caret myself (python3 pycalc.py "2⇥+⇥*3" — the caret really lands off) and --help too (it parses as an expression, "invalid character 'h'"). The agent wrote the findings into .mini/run/phase-003.md under a ## Verify findings section — below the YAML header, so mini done isn't disturbed. And the key moment:

Verdict: do not close the phase.
Recommended next step: go back to /mini:do to implement the three fixes.

This is the whole point of verify: it sent me back to do, not forward to done. Verify never closes phases — it either confirms there's something to close (and recommends done), or it finds work (and recommends do). It leaves the phase state in state.json untouched, because that's done's job, not its.

What to take away

Both commands are the same pattern: an interactive session, no code, no state change, output to a file. They differ only in where they stand in the loop and what they write — discuss notes for plan/do, verify findings into the report (and, for an already closed phase, into the memory too). Both are purely opt-in: the loop works without them. You reach for them only when thinking pays off.

The trade-offs, to put them on the table:

They cost tokens and time. They're full interactive sessions. For a trivial phase discuss is just ceremony — next straight into plan is faster. They earn their keep only when there's something to decide or something to judge.
verify is only as good as the reviewer. The agent suggests what to try, but you're the judge. If you answer "yeah, fine" without running the commands, you get a false "OK" in the report. The value is in actually clicking through it.
verify findings aren't phases. Those three fixes now live only in the report. The next do has them in context, but if I don't finish the phase and /clear, I have to come back to the report by hand — the checkpoint doesn't auto-generate tasks into todo.

Next time

Three human stops per phase (proposal, plan, verification) are priceless for one phase a day — for five phases in an evening they're a chore. Next time we run mini auto: the loop driving phases back to back on its own, stopping for you only where it must. And with it stop — a clean way to halt it.

mini is open source: npm install -g mini-orchestrator, then mini install-commands in your project. Source and docs on GitHub.

Building with mini, Part 4/9: The main loop — next plan do done

Stanislav Kremeň — Wed, 01 Jul 2026 11:00:00 +0000

pycalc has a vision (Part 2) and a backlog (Part 3) — and still not a single line of Python. Today we fix that. Four commands form the heart of mini: propose, break down, build, close. One turn of the loop = one phase = one commit.

`next` — propose

mini next

next is the entry point of the loop, and the only step that begins with a question: do you have your own idea, or do you leave it to the agent? I left it to the agent — and it's worth showing what it did with last episode's backlog:

The backlog has two items so far ([1] ** and %, [2] reading from stdin), but both are explicitly out of scope for this version (non-goals), so neither fits the first phase.

This is exactly the interplay that project.md and todo.md exist for: the backlog offered itself, the non-goals stopped it. Instead, three candidates came up — tokenizer with positions, a CLI skeleton, a walking skeleton — each with its trade-off:

I propose candidate 1: Tokenizer with positions. Why not the walking skeleton: it sounds tempting, but for a project this small the rewriting overhead outweighs the benefit — the pipeline has only three pieces, and in three phases it will be end-to-end anyway. The trade-off of candidate 1: until there's a CLI, you can't poke at anything by hand; verification is purely through tests.

After approval the agent saves the phase itself, via mini next --apply --title … --goal …. State is never changed by the model's hand:

Phases:
  [proposed] > 1. Tokenizer with positions

  Next: mini plan (break it down) or mini do (run directly)

`plan` — break down

mini plan

plan breaks the phase into steps — a short title plus a one-sentence detail (that pairing is deliberate: only the titles go into later steps' prompts, the detail is fetched when needed). For the tokenizer it produced five steps, from the Token NamedTuple through the number scanner into Decimal to the unittest table. And here too a decision surfaced that would otherwise only emerge in the code:

.5 and 2. are deliberately rejected (simplest rule with exact error positions; trade-off: .5 is valid in many calculators — say so if you want it supported and I'll adjust step 3).

Approved, saved (mini plan --apply reads the steps from stdin):

[planned]  > 1. Tokenizer with positions
  Steps:
    [todo]     Token type and CalcError in pycalc.py
    [todo]     tokenize() loop: operators, parens, whitespace
    [todo]     Number scanning into Decimal
    [todo]     Invalid character raises CalcError with position
    [todo]     unittest case table in test_pycalc.py

`do` — build

mini do

Only now does code get written. The session gets the phase, the steps and a reference to the project — not the whole conversation history. The agent ticks steps off as it goes (mini do --apply --step-done "<title>"), so mini status shows live progress, and at the end it writes a run report to .mini/run/phase-001.md: the verdict, the step statuses, and notes beyond the plan. Those notes are the most valuable thing in the report:

I deliberately rejected str.isdigit() — it accepts Unicode digits like ٣, which Decimal() would happily parse too. A helper restricts numbers to ASCII 0-9; there's a test that 1+٣ raises an error at position 2.

I verified the "no input ever produces an unhandled exception" claim with a fuzz over 20,000 strings — only CalcError was ever raised.

One of my own test expectations was off by one (10.25+0.75 positions); the tokenizer was correct, the test data was fixed.

All five steps done, 27 test cases green. And note what didn't happen: no REPL, no sin, no colors. The non-goals hold.

`done` — close

done is the loop's human gate — the one place mini insists on a person. State, incidentally, now reminds you itself that there's something to close:

[doing]    > 1. Tokenizer with positions
            ⚠ stuck: phase "doing", but it has no open steps — close it via mini done

mini done

The agent summarizes the phase from the run report, proposes an entry for CHANGELOG.md, and asks the one thing that matters: does it work for you? Verification is my job, not its:

python3 -m unittest -v
...
Ran 3 tests in 0.001s
OK

After confirmation it writes the CHANGELOG (Keep a Changelog format, Unreleased section) and calls mini done --apply. That does three things at once: moves the state, writes the phase memory to .mini/memory/phase-001.md (goal, steps, run report — what the next phase will find useful), and wraps the entire phase's work into a single commit:

git log --oneline
a01eb36 Phase 1: Tokenizer with positions
9cf1df1 init: mini project setup

The version isn't bumped by default and nothing is pushed — both are opt-in (--bump patch --push), no surprises on the remote.

Phases:
  [done]       1. Tokenizer with positions (took 2m 6s)

  Next: mini next (proposes the first phase)

Two minutes six seconds from do to commit. And the backlog? Still 2 open items — the loop never touched it, because it never came up.

Why a loop, not one big prompt

This is where the whole architecture from Part 0 pays off. Every step of the loop starts from saved state, not from a conversation: plan doesn't need to know what was said in next, it reads the phase from .mini/. Between phases /clear is recommended — the context is thrown away, because everything that matters survives in files: state in state.json, experience in the phase memory, history in git and the CHANGELOG. The conversation is disposable; the state is permanent.

A trade-off, to put it fairly on the table: one phase = one commit means a half-done phase isn't in git. If a do session crashes mid-way, the code on disk stays (so do the report and the ticked steps), but the commit doesn't — which is why keeping phases small helps, exactly as next enforces (1-3 days, one verifiable goal).

And about those three human stops (approving the proposal, approving the plan, verifying in done): for one phase a day they're priceless, for five phases in an evening they're a chore. That's precisely why mini auto exists — but more on that in a later episode.

Next time

The loop can build. Next we'll look at the two human checkpoints around it — discuss, when a phase needs talking through first, and verify, when "the tests passed" isn't enough and it calls for a human UI/UX review.

mini is open source: npm install -g mini-orchestrator, then mini install-commands in your project. Source and docs on GitHub.

I shipped a card game in a weekend — but I didn't really "vibe code" it

Stanislav Kremeň — Wed, 24 Jun 2026 15:44:01 +0000

This is my entry for the Vibe Coding Weekend Challenge. The honest version, anyway — because what I did wasn't quite vibe coding.

The app, in two sentences

It's a browser version of Prší, a Czech card game from the Mau-Mau family. You play one full match, 1v1 against a deliberately dumb AI, from the deal to someone emptying their hand — mouse or touch.

🎮 Play: https://czsoftcode.github.io/prsi/
📦 Source: https://github.com/czsoftcode/prsi

Stack: TypeScript + Vite, no framework, plain DOM rendering. The game engine (rules, move validation, win detection) is pure functions with zero DOM dependencies, so it's fully testable on its own — it ended up with ~300 unit/integration tests.

The tool: Claude Code, but on a leash

The challenge suggests Lovable, Cursor, or Bolt. I used Claude Code instead — and not in the "describe the app, accept whatever comes out" way that vibe coding usually means.

Instead I drove it through a small phase orchestrator. The whole build was sliced into ~10 explicit phases, each with a written goal and a verifiable output:

Vite + TypeScript scaffold
Engine: data model (32-card deck, deal)
Engine: basic move (suit/rank match, draw)
Engine: special cards + win
AI opponent heuristic
UI: render the table
UI: interaction loop + suit picker
End of game, new match, stalemate
End-to-end simulation (200 AI-vs-AI games)
README + deploy

Each phase got planned, implemented, and reviewed before the next one started. So less "vibe", more spec-driven development with an AI doing the typing. The upside: when the AI got something subtly wrong, I had tests and a tight scope to catch it instead of a 2000-line blob I had to trust.

What actually went wrong

The challenge asks for the real iteration count and the things that broke. Honestly, the rules were the hard part, not the code:

The seven. A 7 forces the opponent to draw 2, and sevens stack — up to four of them, so the last player can be forced to draw 8. Getting "you can only respond to a seven with another seven" right took a couple of passes, because it interacts with which cards are even considered playable.
The queen is a wildcard. It can be played on anything and changes the demanded suit. The first engine version wrongly restricted it to a normal suit/rank match. Caught by a test, fixed.
Deployment bit me. Everything worked on localhost, then broke on GitHub Pages. The cards live under /cards/... as absolute paths, but the site is served from /prsi/ (a project subpath), so every card 404'd. Fix: build the paths through Vite's import.meta.env.BASE_URL instead of hardcoding the leading slash.
A dumb versioning slip. I tagged v1.0.2 but forgot to actually bump package.json, so the tag's tree still said 1.0.1. Small, but exactly the kind of thing you only notice when you look.

The AI side stayed simple on purpose: the opponent only ever plays legal moves and picks them with a basic heuristic. A smart AI wasn't the point — a correct one was, which is why those 200 simulated games matter (every one has to end in a win or a stalemate without breaking an invariant).

Was it worth it?

Yes. The weekend constraint forces you to cut scope — no accounts, no multiplayer, no score across rounds, just one good game. And going phase-by-phase instead of pure vibe meant the thing that shipped is one I can actually reason about and extend.

If you want to see a "dumb" AI lose at a Czech card game, it's right here: https://czsoftcode.github.io/prsi/

Building with mini, Part 3/9: Capturing ideas with todo

Stanislav Kremeň — Wed, 24 Jun 2026 11:00:00 +0000

In Part 2, pycalc went through its planning interview. A side product you may have noticed: the interview generated ideas we deliberately rejected — Claude itself deferred the ** power operator with the note that right-associativity is "the ideal v2 exercise". Reading from stdin dropped out too. The non-goals now say "Do not add * or % in this version."*

But a non-goal is a rule, not a plan. It says what must not happen now — it doesn't say what should happen someday. So where does "someday" go?

I know all three bad options from my own practice: your head (you forget), an IDEAS.md file growing forever (exactly the documentation pile mini stands against), or creating it as a phase right away (and the project state silts up with a future that may never come).

mini todo is the fourth option: a backlog for things that aren't phases yet.

Zero tokens

First the most important part, faithful to Part 0: mini todo never starts Claude. It's pure TypeScript over one markdown file — adding an idea costs exactly 0 tokens. You pay only at the moment the ideas become useful: mini next offers them as candidates, each as a single line.

And because the command needs no TTY and no interaction, it works the same from the terminal and as the /mini:todo slash command in the middle of a Claude Code session — when an idea shows up while you're working on something else, the agent stashes it with one call and nobody loses the thread.

pycalc gets a backlog

cd pycalc
mini todo

The todo archive is empty. Add an idea with `mini todo add "<text>"`.

Let's pour in what the interview deferred:

mini todo add "Power ** and modulo % — right-associativity of ** is the v2 parser exercise"
mini todo add "Read the expression from stdin so pycalc works in pipes (echo \"2+2\" | pycalc)"
mini todo add "Interactive REPL mode"

[ok] Added: Power ** and modulo % — right-associativity of ** is the v2 parser exercise
[ok] Added: Read the expression from stdin so pycalc works in pipes (echo "2+2" | pycalc)
[ok] Added: Interactive REPL mode

mini todo

Ideas & changes
  1. [ ] Power ** and modulo % — right-associativity of ** is the v2 parser exercise
  2. [ ] Read the expression from stdin so pycalc works in pipes (echo "2+2" | pycalc)
  3. [ ] Interactive REPL mode
  3 open / 3 total
  Actions: list · add "<text>" · edit <n> "<text>" · done <n> · remove <n> · clear

And immediately, the first act of curation. That third line is a mistake — not because a REPL is a bad idea, but because one-shot evaluation is the identity of the tool, not a version constraint. "What I'm building" from Part 2 says evaluates one expression and exits. A REPL isn't a deferred idea; it's a different program. The backlog deserves the same strictness as the non-goals:

mini todo remove 3

[ok] Removed: Interactive REPL mode

That's the whole curation lesson: done <n> ticks off (built), remove <n> deletes (we changed our mind), clear sweeps out everything ticked.

One file, readable without mini

The entire archive is .mini/todo.md — a plain markdown checklist:

# Ideas & changes

> Archive of future ideas and changes for this project. Managed by `mini todo`
> (`add` / `done` / `remove`); `mini next` offers the open items as candidate
> phase ideas. You can also edit this checklist by hand.
- [ ] Power ** and modulo % — right-associativity of ** is the v2 parser exercise
- [ ] Read the expression from stdin so pycalc works in pipes (echo "2+2" | pycalc)

The last sentence of the header isn't decoration: you can edit the file by hand and mini won't mind. The parser only takes checklist lines and ignores everything else — the state can't break because you scribbled a note into the file.

The payoff: the backlog reports for duty

Here's where it connects. The next time we run mini next, this block gets bundled into the prompt — verbatim:

# Ideas in the backlog
The project keeps an ideas/changes archive (`mini todo`). Open items (numbered
by their archive position), any of which could become the next phase:
- [1] Power ** and modulo % — right-associativity of ** is the v2 parser exercise
- [2] Read the expression from stdin so pycalc works in pipes (echo "2+2" | pycalc)
If one of them fits as the next step, propose it. When you save such a phase,
add `--from-todo <n>` (the bracketed number) to `mini next --apply` so the
source item is ticked off automatically.

Note the last rule: when an idea becomes a phase, the agent saves it with --from-todo <n> and the item ticks itself off in the archive. No manual reconciliation of "which backlog items did we already do". It works the other way too — when you let mini next propose freely and it generates more candidates than fit into one phase, it offers to stash the surplus into todo so nothing gets lost. The backlog fills and drains from both ends.

What `todo` deliberately can't do

No priorities, labels, due dates, owners. It's a one-line checklist, not an issue tracker — if a project needs Jira, it should have Jira. One practical consequence follows, worth keeping in mind as you write items: the one-line text is all that survives. The discussion that led to the idea doesn't exist here. So write items that carry the why — compare "Add *" with "Power * and modulo % — right-associativity of ** is the v2 parser exercise". Half a year later, the second one still makes sense; the first one is a riddle.

Next time

We have a backlog, the project has a vision — and not a single line of Python exists yet. That changes next time: mini's main loop, next → plan → do → done, takes one item and carries it all the way to a finished phase.

mini is open source: npm install -g mini-orchestrator, then mini install-commands in your project. Source and docs on GitHub.

I'm Building a Code Security Analyzer. A Security Tool Found a Critical In It.

Stanislav Kremeň — Tue, 23 Jun 2026 11:00:00 +0000

I'm building a tool that's supposed to help check code. I call it vibeanalyzer for now. The idea is simple: a lot of us vibe-code — we let an agent write the code, it writes it, it looks clean, the tests pass — and we have no real idea what we just let into the project. Someone close to me put it perfectly: "I don't know what's inside, but I want it to work on the outside."

Before I went all in on the analyzer, I finally did the thing I should have done long ago: check whether someone had already solved this better than me. They had. It's called Semgrep — and the first thing it did was find a Critical vulnerability in my own security analyzer.

This post is about that irony, and what follows from it.

Confession up front: I didn't know the standard tools

I'm self-taught. No CS degree, no senior leaning over my shoulder, I learn from discussions and my own mistakes. So I set off building a code analyzer without really knowing that established tools have been doing this for years — Semgrep, CodeQL, SonarQube, Snyk.

That's the part you'll probably call out in the comments first, so I'll say it myself: yes, I started building a solution without checking what already exists. So I finally did. I ran Semgrep on the vibeanalyzer repo — on the tool that's supposed to be the one guarding security.

What it found

In a few minutes:

One cosmetic issue. SHA-1 used to generate a directory name — which in my case isn't a security hole (it's not a password or a signature, just a hash choice in a non-critical spot), but it's exactly the kind of thing I'd never spot myself.

And then six findings in dependencies. This is where I sat up, because these weren't cosmetic:

Critical in vitest — a path traversal that, with the UI server running, lets someone read, write, and execute files outside the project. And this is a direct dependency, one I pulled in myself.
Two High — in esbuild (downloading a binary with no integrity check) and in vite (bypassing the guard that's supposed to stop .env and certificate files from being served).
Three Medium — variants of path traversal and sensitive-data exposure across vite, esbuild, and launch-editor.

I have to be fair about how serious this actually is. Most of these are transitive — dependencies of my dependencies, not something I installed directly. They all sit in the dev toolchain, so they mostly threaten a running dev server, not necessarily production. And the real-world exploit probability (EPSS) is low across the board — one percent or less. This isn't a burning house.

But that vitest one is Direct and Critical. That's something I dragged into the project myself, directly, and I'd have had no idea.

The uncomfortable question

I'll ask it for you, because I know it's coming: How can a security-checking tool have a Critical vulnerability in itself?

The answer is uncomfortable, and it's the whole point of the project. The finding isn't in my logic — it's in the dependencies. My code can be written as honestly as I can manage and the supply chain still gets me, because the risk isn't carried by me but by what I build on. And if I didn't catch this — someone who's literally building a security tool and actively cares about the problem — what chance does a vibecoder have who just had an agent build an app and shipped it?

None. That's the reason the project makes sense at all.

What Semgrep can't do — and why I'm building anyway

I could stop here with "use Semgrep and forget your own tool." And honestly, for supply chain and known patterns of dangerous code, that holds — it does it better, deterministically, and for free. I'm not trying to beat it at its own game, and you shouldn't trust me if I claimed I was.

But I noticed something in what Semgrep didn't say. It didn't tell me whether the code does what my project actually intended. It doesn't know the boundaries I set for the project. It can't tell that some function runs fine and is safe, but solves a problem that shouldn't be there at all. Semgrep knows patterns of danger. It doesn't know intent.

That blind spot — the gap between what the code does and what it was meant to do — is what this whole series is about. I call it the intent gap. (Yes, the term gets used elsewhere, in marketing and UX; I mean it in my own narrow sense: the distance between code and its intended purpose.)

And that's exactly what I'm building vibeanalyzer on. The first thing it does is load — or ask the user to create — the project's intent and its non-goals: the guardrails that shouldn't be crossed. So that when the AI evaluates the code, it knows what the goal actually is and what's already out of bounds. That's a gap static analyzers don't fill, because they don't read intent, the idea behind the code. And it's the only piece I dare say is worth building next to the tools that already exist.

Where I am now

I've got intent and non-goal loading done. I've got a starting machine analysis of TypeScript and a basic markdown output with a mermaid graph of the folder structure. Ahead of me: the machine security layer (which, by the way, Semgrep currently does better than I do — that needs saying out loud), the skeleton of the AI layer, and the AI logic itself, which is the whole point and also the most uncertain part. I have no idea whether AI can meaningfully evaluate code against intent, or whether it'll just be a flood of false alarms. I tried it on websites, but I've yet to see how it works on a repository — and I'll write about it even if it doesn't pan out.

This series is about building in the open. Don't expect a finished product or guarantees. Expect notes from someone who finally ran the standard tools, took a hit from his own project, and decided to keep going — because the gap he wants to fill is even more obvious after today.

In the next part I'll dig into what the machine analysis of my TS code actually produces — and where it starts to diverge from what I need the AI layer to do. That's where the intent gap stops being a nice phrase and has to become working code, or fall apart trying.

By the way: after writing this, I bumped that vitest version — and it cleared all six dependency findings at once, because the others hung off it transitively. Semgrep now reports zero. Which doesn't mean the project is "safe" — it means those six known CVEs are gone. Unknown holes and my own logic bugs are still invisible to it. That's the whole reason I'm still building. And yes: I wrote a post about a hole I should have patched long ago. That's vibecoding in a nutshell — even in a tool that's supposed to police it.

The Habit That Stops Your AI From Quietly Wrecking Your Plan

Stanislav Kremeň — Thu, 18 Jun 2026 11:00:00 +0000

This is the third and final part of a little series. In part 1 I worked out the workflow: let the AI build the plan with you instead of writing it alone. In part 2 I shared the actual prompts that make an agent plan with you. This one is about the single habit that made the biggest difference of all — and the one I resisted the longest.

It's writing down what the app should not do.

Why I ignored this for so long

For a long time my instinct was simple: tell the AI what to build and how to work, and that's it. Describing what not to build felt pointless. Why spend words on features I'm not even making?

Then I noticed a pattern. I'd agree on a clean, small plan with the agent. Then a few prompts later, while building something unrelated, it would casually add an admin panel I never asked for. Or wire up a payment flow "to be helpful." Or refactor a simple feature into a multi-tenant architecture. Every time, I'd lose an afternoon pulling out work I never wanted.

The agent wasn't broken. I'd just left the door open. An AI fills empty space with assumptions — and an unspoken boundary is empty space.

Non-goals are guardrails, not documentation

Here's the reframe that fixed it for me: a non-goal isn't a note for humans. It's a guardrail for future prompts.

"No payments in v1" isn't there to remind you of anything — you already know. It's there so that three days from now, when you ask the agent to "improve the checkout screen," it doesn't quietly decide that checkout needs Stripe integration. The line holds the boundary even when you forget to.

That's why the moment you delete a non-goal, the idea comes back. The agent rediscovers it, thinks it's helping, and spends your time on it. The boundary only works while it's written down.

How to generate them during planning

The good news: you don't have to think these up yourself. Ask the agent to produce them while you're still planning. Two prompts do the job.

Surface them:

Based on this plan, list everything you might be tempted to add that I have NOT
asked for — extra features, extra structure, extra integrations. For each one,
tell me whether we should build it now or explicitly leave it out of this version.

This is revealing. The agent will list exactly the things it would have silently added — auth systems, dashboards, caching layers — and now you get to say "not yet" before it happens instead of after.

Lock them in:

Take everything we agreed to leave out and write it as a short list of non-goals.
Phrase each as a clear rule, like "Do not add X in this version." I'm going to
keep these in front of you for the rest of the project.

Then you do exactly that — keep the list somewhere the agent reads every time (your project instructions, your rules file, whatever your tool uses), so the boundary travels with every future prompt.

When to drop them

People ask whether non-goals are forever. They're not. Keep each one until the thing actually ships, or until you genuinely decide to build it. A non-goal isn't a permanent ban — it's a "not now" that protects your current scope. When "not now" becomes "now," you delete the line on purpose, not by accident.

The whole journey, in three lines

Don't write the plan alone — run a conversation that produces it.
Use prompts that make the agent interview you and argue with you, not just obey.
Write down what not to build, and keep it in front of the agent so it holds.

Looking back, this was never really about planning. It's the same thing I keep running into everywhere I work with an agent: the hard part isn't the code, it's deciding what the agent gets to know — and what it doesn't. A plan is that decision at the start. Non-goals are that decision about the edges. I've got a few more of these fights to write up. They're all the same fight.

Building with mini, Part 2/9: Shaping the vision with project

Stanislav Kremeň — Wed, 17 Jun 2026 11:00:00 +0000

In Part 1 we ran mini init and got pycalc — a small calculator for the terminal. init asked only four questions, and .mini/project.md is accordingly thin: a name, one line about what, who it's for, and the constraints ("Python 3, standard library only").

That's enough to start, but not enough to steer. Three things are missing: how we'll build it, what we deliberately won't build, and how we'll know it's done.

That's exactly what mini project adds.

What the command does (and doesn't)

mini project opens an interactive Claude Code session — but not to write code. It's a plan-before-code interview. Claude gets your current project.md as the starting point (it won't ask what the project is about again) and walks you through four stages:

Frame & remove assumptions — questions about the users, the core workflow, the data.
Rough plan & weighing decisions — for every choice the pros, the cons and an alternative, plus the question "why this over the simplest possible version?". This becomes the Approach section.
Non-goals & guardrails — everything you agreed to leave out, phrased as rules ("Do not add X in this version."). On top of that, Claude lists what it would be tempted to add — and recommends build-now or leave-out for each.
Success criteria & final check — the full draft of project.md is shown for approval; only then does it get saved.

One trait, faithful to the philosophy from Part 0: the interview is instructed to be critical, not agreeable. A weak idea doesn't get nodded through — you're told why it's weak.

And what the command doesn't do: it never touches .mini/state.json. Phase state stays in tested TypeScript; project rewrites one markdown page. The finished document isn't saved by a hallucination either — at the end the agent runs mini project --apply, and the contract goes through the same parser and renderer as init.

pycalc gets interviewed

cd pycalc
mini project

Mini first prints the exact prompt it will send to Claude (no magic — you see everything) and asks whether to start the session. What follows is from a real session, condensed; the answers are mine.

Claude opened by reflecting the project back and refusing to assume anything:

"Terminal calculator" can mean three quite different things, and each leads to a different design.

1. How is it used? One-shot (pycalc "2+3*4" → prints 14 and exits), an interactive REPL, or both? This is the most important decision — it determines the whole structure. And careful: for the one-shot case people already use python -c "print(2+3*4)" or bc. What makes pycalc different, so that it has a reason to exist? That's not nitpicking — it's the question the "What I'm building" line has to answer.

2. How much math? Just + - * / ( ) with precedence? Powers, modulo? Functions like sqrt, sin? Variables? Every extra level means a more complex parser, and variables only make sense in a REPL.

3. How will you evaluate expressions? Fair warning: the tempting shortcut is eval(). It's one line, but it's a security hole (it evaluates arbitrary Python) and you learn nothing from it. Alternatives: ast.parse with a node whitelist, or your own parser. What's your goal — to have a tool, or to learn how to write a parser?

My answers: one-shot; versus python -c it's less typing and nicer error messages, versus bc natural syntax and decimals without scale setup. Just the four operators and parentheses. And the goal is to learn — own tokenizer and recursive descent, no eval().

Claude then proposed the pipeline (tokenizer → recursive descent parser → evaluation, tokens carrying their position in the input) and pushed back where my framing was lazy:

Number semantics and division. You yourself list "decimals without scale setup" as the advantage over bc. So 1/2 must be 0.5, not 0. I'd propose floats — but floats have the known artifacts (0.1+0.2 → 0.30000000000000004). The alternative is the decimal module (still standard library). For a calculator whose selling point is decimal numbers, I'd seriously consider decimal. Which one?

Good catch — decimal it is. The interview just changed a design decision I would have gotten wrong by default. Two more came out of the same batch: unary minus goes in ("without it the calculator feels broken — but it's a conscious scope extension, so I'm saying it out loud"), and errors get a defined shape — stderr, non-zero exit code, and a message that points at the problem:

error: unexpected character '+' at position 2
2++3
  ^

Then stage three — the part I've come to value most. First the agreed leave-outs, then the things Claude would be tempted to add:

Power ` and modulo %** — tempting, "just one more rule". But ` is right-associative, which is a new class of problem in the parser. **Leave out in v1; once recursive descent is clear to you, it's the ideal v2 exercise.

Reading from stdin/pipes (echo "2+2" | pycalc) — useful, but it grows the CLI layer. Leave out.

Colored output, argparse with ten flags — the project needs none of it. One positional argument. Leave out.

Packaging (pyproject.toml, pip install) — needless overhead for a learning script. Leave out.

And the closing move the prompt demands — "is there any question that, answered wrong, would send us down the wrong path?" — earned its place:

We chose decimal, but we never said what 1/3 prints. The default context gives 28 significant digits: 0.3333333333333333333333333333. Technically correct, absurd as calculator output — and the fix, "set your precision", is exactly the scale from bc you wanted to avoid. I propose: compute at default precision, round the output to 10 places and strip trailing zeros (6.0 → 6). Trade-off: the output is no longer the bit-exact intermediate value, but for a calculator readability wins.

What got saved

After I approved the draft, Claude saved it itself — through the contract, not by editing the file freehand:

mini project --apply <<'EOF'
NAME: pycalc
WHAT: A small command-line calculator: `pycalc "2+3*4"` evaluates one expression and exits. Less typing than `python -c`, nicer error messages, and decimal arithmetic that works without bc's scale setup.
FOR_WHOM: People who want quick arithmetic in the terminal
CONSTRAINTS: Python 3, standard library only
APPROACH:
- One-shot CLI: single positional argument, result on stdout, errors on stderr with a non-zero exit code.
- Pipeline: tokenizer → recursive descent parser → evaluation. No eval(); the goal is to learn how to write a parser.
- Grammar: + - * / with correct precedence, parentheses, unary minus.
- Numbers via decimal (exact decimal arithmetic: 0.1+0.2 = 0.3); output rounded to 10 places with trailing zeros stripped (6.0 → 6).
- Tokens carry their position in the input so errors can point at the exact spot (message + caret line).
- Tests: unittest with a table of cases (expression → expected result or error).
NON_GOALS:
- Do not add an interactive REPL mode in this version.
- Do not add functions (sqrt, sin, …), constants, variables or assignment.
- Do not use eval() — ever, not even as a temporary shortcut.
- Do not add ** or % in this version (right-associativity of ** is a v2 exercise).
- Do not read expressions from stdin/pipes; one positional argument only.
- Do not add packaging (pyproject.toml, pip install) — run as a plain script.
- Do not add colors or extra CLI flags.
SUCCESS:
- `pycalc "2+3*4"` returns 14; precedence and parentheses verified by the unittest case table.
- `0.1+0.2` → 0.3, `1/2` → 0.5, `-5+3` → -2.
- Every invalid input (2++3, (2+3, abc, empty, division by zero) exits non-zero with a stderr message showing position and caret — never a traceback.
- Runs on plain Python 3 with zero dependencies outside the standard library.
EOF

[ok] Updated .mini/project.md

That contract is also the command's second mode: --apply reads stdin, starts no Claude, and just writes the file. Handy when you want to edit project.md from a script or by hand without a session.

One thing to watch: --apply does a full replace. Whatever you omit from the contract disappears from the file. The session prompt instructs the agent to keep the existing NAME / FOR_WHOM / CONSTRAINTS — but when you write the contract by hand, that's on you.

Tokenomics, as always

No new file appeared — project.md is still one page, it just steers now. Compare the before and after: init gave us "A small command-line calculator"; the interview turned that into a sharpened one-liner, six approach points, seven non-goals and four testable success criteria. The non-goals aren't decoration: mini keeps them in front of every later step, so when an agent in phase 7 feels like "improving" the calculator with colored output, the project has it in writing that no, it won't.

And the cost side: the whole thing is one interactive session, and you pay for it once. The result is a page that saves tokens afterwards — newer versions of mini send warm sessions only a reference ("read .mini/project.md if you don't already have it in context") instead of inlining it again and again.

Is the interview worth it for every project? Honestly, no — for a weekend script, init alone is fine. It pays off the moment a project is big enough that an agent will work on it across many sessions without you re-explaining the vision each time.

Next time

project.md now says what, for whom, how, what not, and when it's done. But the interview also left something behind: ideas like the ** operator — rejected for v1, too good to lose. A backlog for things that aren't phases yet is exactly what mini todo is for. Capturing ideas — that's Part 3.

mini is open source: npm install -g mini-orchestrator, then mini install-commands in your project. Source and docs on GitHub.

Stop Feeding Your AI Specs. Make It Interrogate You Instead

Stanislav Kremeň — Tue, 16 Jun 2026 11:00:00 +0000

Last week I wrote about the workflow: stop trying to write the perfect plan yourself, and let the AI build it with you through a conversation. A few people asked the obvious follow-up — okay, but what do I actually type?

Fair question. A workflow is useless if you don't know the words that trigger it. So here are the exact prompts I use. They work in Claude Code's Plan Mode, but nothing here is tool-specific — paste them into any agent before it writes a line of code.

The problem with "just describe your idea"

If you open a chat and say "build me an app that does X," the agent does what agents do: it fills every gap with an assumption and starts coding. You get a plan you never agreed to.

The fix is to change the agent's job before it starts. Don't ask it to build. Ask it to interview you, propose a structure, and argue with itself. Here's how.

The master prompt

This is the one I paste first, on an empty project, before anything else:

I have an idea for an app and I want to plan it with you before any code is written.

Do NOT write code yet. Your job right now is to help me produce a short, clear plan.

Here is the rough idea:
[describe your idea in 2-4 sentences — what it does and who it's for]

Before you propose anything, ask me the questions you need to remove your own
assumptions: about the users, the core workflow, the data, and the screens.
Ask them in small batches so I can actually answer.

Once you have enough, propose a short plan with:
- the main user and their main job
- 3-5 core flows
- the key data objects
- the main screens
- the unhappy paths that could corrupt data, leak permissions, or cost money
- a "not building yet" list of things we are deliberately leaving out

Keep it short and rough. We will refine it together.

The important parts: "do not write code yet" stops the rush. "ask me the questions you need to remove your own assumptions" turns the agent from a guesser into an interviewer. And the "not building yet" list is what keeps it from inventing scope later.

The smaller prompts for refining

Once it gives you a first draft, you steer with short follow-ups. These are the four I reach for most.

Make it push back instead of agreeing:

For each major decision in this plan, give me the pros, the cons, and at least
one alternative. If something is a bad idea, say so and tell me why. Do not just
agree with me.

Pressure-test a single choice:

You suggested [X]. Why is that the right call here? What would you lose by doing
the simplest possible version instead? Give me the trade-off, not reassurance.

Lock in the boundaries:

Turn the "not building yet" list into clear non-goals. Phrase each one as a rule
I can drop into my project instructions so you don't reintroduce it later.

Confirm you're done planning:

Are there any questions left that, if I answered them wrong, would send us down
the wrong path? If yes, ask them now. If no, summarize the final plan in one
short block so I can approve it.

That last one is how I know it's time to stop planning and start building — instead of guessing, I let the agent tell me whether anything important is still unanswered.

Why this works

None of these prompts are clever. They just reassign the agent's role at each stage: interviewer first, critic second, scope-keeper third, and only then a builder. The frustration in my old workflow came from skipping straight to "builder" and letting it improvise everything before it.

Try this

Next time you start something new, don't describe the app. Paste the master prompt, answer the questions honestly, and force it to argue with you before you approve anything.

You're not writing the plan. You're running the conversation that produces it — and that's a much easier job.

The Agent Reviewed Its Own Code and Passed Itself. It Was Wrong.

Stanislav Kremeň — Mon, 15 Jun 2026 17:00:00 +0000

I'm a self-taught solo dev. I vibe-code every day, and for a long time the same
quiet worry followed me around: the agent hands me code, it looks clean, the tests
pass — and I have no idea what I just let into my project.

It took me a while to see why that worry never went away. Most developers learn to
review code from someone — a senior leaning over their shoulder saying "this'll
bite you in a month." As a self-taught solo dev, I never had that. I learned to
write code. Nobody taught me what to check.

So I did the only thing I could: I asked. I put the question to people with more
miles on the clock than me, and the answers reshaped how I think about verifying AI
code. This is what they taught me — and the experiment it led to, which went wrong
in the most instructive way possible.

"Tests pass" was never the reassurance I treated it as

The first thing that landed: an agent tests what it assumed, not what reality
throws at it. The code runs fine on the happy path, the tests are green, and then
someone feeds it an empty string, a malformed page, an unexpected null — and it
falls apart. The agent skips the defensive thinking that only comes from actually
debugging things in production.

That reframed the whole problem. Green tests don't mean the code is correct. They
mean the code does what the agent thought it should do. Those are very different
claims, and the gap between them is where I kept getting burned.

You can't verify what you don't understand

The second thing was harder to swallow. The more code I let the agent write, the
less I understood what came out — and the worse I got at checking it. Verification
and comprehension are the same muscle. If I don't grasp what the code does, reading
every line is just theater.

One reviewer put the fix simply: before judging the lines, judge the shape. Have
the agent produce a systems-level overview — what calls what, where the boundaries
are — and verify that the structure makes sense before you ever look at
implementation. I'd been generating these as diagrams and still getting lost; the
real unlock was keeping the map as text, so I could hand it to a second model and
ask whether it actually matched the code.

Which points at the deepest problem of all.

An agent can't catch its own blind spots

Here's the thing nobody says out loud: when the same agent writes the code and
writes the tests, it's not verifying anything. It's confirming its own assumptions.
The tests pass because they test the same things the agent already believed. The
blind spot in the code and the blind spot in the test are the same blind spot.

The reviewers' answer was consistent: the check has to come from outside the
author. A fresh model, a separate pass, something that doesn't share the original's
assumptions. Don't let the thing that wrote the code be the only thing that grades it.

So I built exactly that. A red-team step for my orchestrator — a command that takes
the work the agent just finished and forces it into a different role: not the proud
author, but an independent reviewer whose job is to break it.

What I actually told the agent to do

The prompt is the whole trick, and it's blunt on purpose:

Switch into the role of an independent reviewer who did NOT write this code. Your
job is not to confirm it works — it's to find what breaks it. Assume there's a bug.

Then four concrete passes, straight from what the seniors drilled into me:

Unhappy path. Empty, malformed, unexpected input. Null, timeout, race. Show me a specific input that knocks it over.
Silent assumptions. Where does it assume a type, a shape, a state without checking? Where can failures cascade quietly instead of failing loud?
Premature complexity. Is there a layer solving a problem that doesn't exist yet? AI loves architecture that looks clean while serving no one.
Tests. If they exist — do they exercise failure, or just the happy path? What is NOT covered?

And one rule that matters more than the rest: don't tell me "looks good." If you
genuinely find nothing, list specifically what you checked and how. A confident
"looks solid" is the exact thing I'm trying to escape.

The part where it went wrong

Here's where it gets good, and a little absurd.

The agent built the red-team command. Then I pointed the command at the code that
had just produced it. It immediately tore into its own work — listed what it had
done wrong, what it had to fix. The agent admitted the problems and fixed them.

Win, right? Self-correcting AI. Except I ran it a second time.

The second pass found more. Things the first "fix everything" round had missed —
because the agent, even while critiquing itself, was still working from the same
assumptions that put the bugs there in the first place. One adversarial pass didn't
purge the blind spots. It just surfaced the ones the agent could already see.

That's the lesson, and it's sharper than anything I could have written on purpose:
giving an agent a tool to criticize itself doesn't remove its blind spots — it
only reveals the ones it was already capable of seeing. Verification isn't a step
you run once and tick off. The author grading its own work, even adversarially, is
still the author.

Why the step is optional, not a rule

I almost made adversarial review a global rule — run it on everything, always. I'm
glad I didn't.

All of this costs tokens. More scrutiny means more output, more passes, more model
time — and that runs directly against the thing I care about, which is keeping the
context I send lean. Verification has a price, and the price is real. If I red-teamed
every trivial change, I'd be drowning in self-reflection over things that don't
warrant it.

So the step is opt-in. You spend the scrutiny where the stakes justify it — input
handling, parsing, anything touching the outside world — and you skip it where it's
noise. That trade-off, honestly, is the part I'm least done thinking about. Verifying
costs context, and context isn't free. I haven't solved that tension. I've just
decided to pay it on purpose, where it counts, instead of everywhere by reflex.

What I'd tell the version of me from six months ago

The fix was never about writing better code, or even better prompts. It was about
accepting that the agent and I have different jobs. It writes. I decide what's true.
And deciding what's true means refusing to take the agent's word for it — especially
when the agent is grading itself.

I still don't have a senior leaning over my shoulder. But I've learned to build the
shoulder out of separate passes, fresh models, and a stubborn refusal to trust a
green checkmark. If you're self-taught and that quiet worry follows you around too —
that's not a gap in your skill. It's the start of the right instinct.

Building with mini, Part 1/9: Initializing a Project with init

Stanislav Kremeň — Wed, 10 Jun 2026 11:00:00 +0000

In Part 0 I laid out the philosophy behind mini: keep state thin, send Claude only the essentials, and let tested TypeScript — not the model — own the project state. Now we start building.

Across this series we'll work on one small running example: pycalc, a command-line calculator in Python. It's deliberately a boring domain — that way the spotlight stays on mini, not on the math. This part is about the very first command: mini init.

What `init` actually does

mini init creates a new project. It asks four questions and writes two files into a .mini/ directory. That's it — no scaffolding of your source code, no opinions about your stack. It only sets up mini's state.

mkdir pycalc && cd pycalc
mini init

The four questions are: name, what you're building, for whom, and the constraints. For our calculator:

Name: pycalc
What: A small command-line calculator
For whom: People who want quick arithmetic in the terminal
Constraints: Python 3, standard library only

Prefer it non-interactive? Every answer has a flag:

mini init --apply \
  --name "pycalc" \
  --what "A small command-line calculator" \
  --for-whom "People who want quick arithmetic in the terminal" \
  --constraints "Python 3, standard library only"

The two files it writes

project.md — one page, by hand if you want

This is the human-readable heart of the project: what you're building, for whom, and the constraints. One page, on purpose. It's the file that gets sent to Claude as context, so every extra paragraph is tokens you'll pay for on every phase.

pycalc

A small command-line calculator.

For: People who want quick arithmetic in the terminal.

Constraints: Python 3, standard library only.

You can edit project.md by hand anytime — it's just markdown. Keep it tight.

state.json — a lightweight header

The second file is the machine state. Right after init it's nearly empty — no phases yet — but it's the index that everything else hangs off: phase list, statuses, and the model choices per step.

The important design decision: state.json is only a header. When phases arrive later, their detail (steps, reports) won't bloat this file — it lives separately in .mini/phases/phase-{id}.json and loads only when needed. That separation is what keeps mini's context footprint small as the project grows.

Starting inside an existing project

pycalc is greenfield, but init is just as happy in a directory that already has code. When it detects existing source, it offers to run mini audit at the end — a pass that builds a codebase.md overview so later Claude sessions can orient themselves without re-reading your whole src/. We'll cover that path properly in Part 2.

What you have now

After init, your project looks like this:

pycalc/
└── .mini/
    ├── project.md      # one page: what, for whom, constraints
    └── state.json      # lightweight header: phases, statuses, models

No phases, no plans, no code yet — just a clean, minimal foundation. That's the point: mini doesn't ask you to think about the whole roadmap up front. You define what you're building in one page, and decide the next step when you get there.

And deciding the next step is exactly what mini next is for.

Next up: In Part 2 we'll shape pycalc's vision with mini project — approach, non-goals and success criteria.

Want to try it now? mini is free and open source (MIT):

cd your-project
npx mini-orchestrator install-commands

Repo: github.com/czsoftcode/mini-orchestrator

My AI Code Was Fine. My Initial Plan Was a Mess.

Stanislav Kremeň — Tue, 09 Jun 2026 11:00:00 +0000

I kept hitting the same wall. I'd have a "perfect" app idea in my head, get excited, and start building from scratch with no real preparation. Halfway through, I'd get stuck — rewriting half the code and throwing away the other half.

For a long time I blamed the tools. But the problem wasn't the AI's code. It was me. When I didn't give Claude a proper plan, it designed things its own way: a different data model, a different structure than I had in mind. Then I'd spend the rest of the project chasing it and fixing things.

So I asked two communities how they write that first plan. The answers changed how I work. Here's the workflow I landed on.

1. Stop trying to write the perfect plan

The most freeing advice I got: keep the first spec ugly and short. I'd been frozen because I thought the design had to be perfect and cover everything. It doesn't — and it can't. An app isn't finished at the start; it's finished at the end, based on what the user actually needs. Everything changes along the way.

A good starting skeleton is just: user type, the main job, three to five screens, data objects, the ugly failure cases, and what not to build yet.

2. Write down what NOT to build

This was the part that surprised me most. I always thought you only describe what the app should do and how the agent should work. It never occurred to me to write down what it should not do.

That's exactly why the model kept inventing scope I never asked for. Lines like "no payments in v1" or "no social login yet" aren't documentation fluff — they're guardrails. The moment you remove them, the agent rediscovers the idea and spends your afternoon on it. So keep your non-goals in the spec until the thing ships, and move them into your CLAUDE.md so they stick across every prompt.

3. Don't write the plan alone — let the AI build it with you

Instead of forcing a flawless document out of my own head, I now let the plan emerge through a conversation. I describe the rough idea and ask the agent to flesh it out, propose the data model and screens, and ask me the questions it needs before writing anything.

One warning from experience: if you just let two models brainstorm freely, you end up with a huge document full of praise about how brilliant your idea is. Useless. The fix is to give the agent rules on how to push back. Always ask for pros and cons. Always ask for alternatives. Always ask why. If the AI agrees with everything, you're not getting a design partner — you're getting a cheerleader.

4. Use Plan Mode for the empty page

Here's the piece that tied it all together: Claude Code's Plan Mode works even when you have no code at all.

I used to think Plan Mode was only for changes inside an existing project. It's not. Open an empty folder, switch to Plan Mode (Shift+Tab), and describe your idea. Claude won't touch a single file — it explores, proposes a structure, and surfaces the decisions it would otherwise make silently. You review the plan, push back, and only approve it when it matches what's in your head.

That's the step I was missing. The decisions the agent used to invent for me now show up before a single line of code exists.

The workflow, end to end

Open Plan Mode on an empty folder and describe the rough idea.
Let the agent ask questions and propose a short, ugly spec.
Demand pros, cons, and alternatives — make it challenge the idea.
Ask what it would not build yet, and capture those non-goals.
Move the permanent rules and non-goals into CLAUDE.md.
Approve the plan, then build.

I'm not trying to write the perfect plan anymore. I bring the intent — the agent helps me find the plan. That one shift fixed the whole problem.

We're Building the Funnel and Standing Under It

Stanislav Kremeň — Mon, 08 Jun 2026 17:32:06 +0000

The picture says it all. Up top, a row of robots: one hammering away at a typewriter, another painting a landscape, a third spitting images out of a printer. Below them, a conveyor belt carrying it all away. And down at the bottom - wired directly into their heads by a hose - sit the people. Tablets, phones, laptops, eyes bugging out, a thread of drool at the corner of the mouth. Consuming. No pauses, no questions, no blinking.

It's an exaggeration. A caricature. And uncomfortably on point.

Because the question isn't whether the picture is true today. It's how far from it we actually are - and which direction we're drifting.

How we got here

Nobody wakes up one morning and decides to stop thinking. It happens in small, perfectly reasonable steps.

Instead of reading the long article, we have it summarized - who's got the time. Instead of searching and comparing sources, we ask and take the first answer - it sounds confident, after all. Instead of understanding the problem, we have a solution generated - it works, so why dig in.

Each step makes sense on its own. The problem is the sum. Active searching slowly turns into passive intake. "I understand it" becomes "I have it." And between those two sentences there's a chasm.

And then the uncomfortable part: the line separating what a human made from what a machine made gets thinner by the day. An article, a post, an image, a snippet of code, the comment underneath it - who wrote that? More and more often, we can't tell. And worse, we stop asking.

Why developers in particular should care

This isn't abstract philosophy. It has two very concrete dimensions.

The first is personal - skill atrophy. A muscle you don't use gets weaker. Spend five years handing off your debugging, your design, your decisions to a tool, and the ability to do it yourself quietly walks out the door. It won't vanish overnight; it'll vanish in a way you only notice the moment you badly need it - and it's gone. The point isn't to stop using tools. The point is not to lose the ability to tell when a tool is talking nonsense.

The second is systemic - and scarier. Models learn from data. But more and more of the data on the internet is generated by models themselves. A loop forms: AI trained on the output of other AI, not on human work. Researchers call this model collapse - copy of a copy of a copy, where each generation loses a slice of diversity and quality, much like photographing a photograph. The phenomenon was documented by Shumailov et al. in Nature in 2024¹: when generative models are trained recursively on their own output, the tails of the original data distribution - the rare, unusual cases - disappear first, and the degradation compounds. The human original - that irregular, unpolished, but real thing - is fuel that can't be substituted. And we're starting to stop supplying it.

Add to that the fact that we're simultaneously losing the ability to judge quality, and you get an unpleasant combination: machines produce ever-worse content and people are ever-less able to notice. The funnel tightens from both ends.

A fair caveat, in the spirit of this article: the research isn't unanimous. Later work by Gerstgrasser et al. argues that accumulating real and synthetic data - rather than replacing one with the other - can avoid collapse, and that the most catastrophic predictions assume real data gets deleted entirely, which isn't how the real world works. So treat model collapse as a real risk to manage, not a prophecy. Which is rather the point.

This isn't a manifesto against tools

Before this starts to sound like a sermon from some Luddite who rejects everything invented after the typewriter - it isn't.

These tools are wonderful. I had this very article's structure workshopped and half its phrasing polished in collaboration with a model. It'd be hypocritical to pretend otherwise. The question was never "use them or don't." The question is how.

One distinction helps me: tool versus prosthesis. A tool extends what you can do - makes you faster, lets you reach further, frees your hands for what matters. A prosthesis replaces what you've stopped being able to do. A hammer is a tool. A crutch you've talked a healthy leg into believing it can't walk without is something else.

The same model, the same prompt, can be either one - it depends entirely on what's happening inside your head. "Explain why this solution is failing, so I can spot it myself next time" is a tool. "Give me something that passes so I don't have to think about it" is the first installment on a prosthesis. From the outside, indistinguishable. The difference is all on the inside.

How not to end up hanging under the funnel

There's no heroic resistance here. Just a few habits that keep you in the robot's chair up top instead of sitting you down by the hose below.

Verify. A confident tone isn't proof. Before you adopt anything - especially when it sounds smooth and finished - check it against the source. Five seconds of doubt is what separates you from the role of passive recipient.

Ask smart, don't swallow blind. AI is a phenomenal thinking partner and a lousy replacement for thinking. Use it for questions that move you forward - "what did I miss?", "why isn't this working?", "what's the counterargument?" - not just for answers that spare you the thinking entirely.

Create more than you consume. This is maybe the most important one. Anyone who writes, builds, or designs something original feeds that rare human raw material back into the system. Being a maker instead of a mere channel is almost a political act these days. And it's also the only reliable defense against atrophy: the muscle you use doesn't weaken.

Closing

The picture isn't a prophecy. It's a warning - and the only point of a warning is that it can be avoided.

The robots up top and the people on the hose down below aren't two inevitable categories that history will sort us into. It's a choice. And the nice thing about it is that it doesn't renew once a generation - it renews every single day, in every prompt, in every article you either read or have paraphrased for you, in every thing you either make or just swallow.

The funnel exists. The only question is whether you're standing under it, or operating it.

Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631(8022), 755–759. doi:10.1038/s41586-024-07566-y. Earlier preprint: The Curse of Recursion: Training on Generated Data Makes Models Forget, arXiv:2305.17493. Counterpoint: Gerstgrasser et al. (2024), Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data, arXiv:2404.01413. ↩

DEV Community: Stanislav Kremeň

Building with mini, Part 5/9: When you need to think — discuss and verify

discuss — talk through the intent

verify — when "the tests passed" isn't enough

What to take away

Next time

Building with mini, Part 4/9: The main loop — next plan do done

next — propose

plan — break down

do — build

done — close

Why a loop, not one big prompt

Next time

I shipped a card game in a weekend — but I didn't really "vibe code" it

The app, in two sentences

The tool: Claude Code, but on a leash

What actually went wrong

Was it worth it?

Building with mini, Part 3/9: Capturing ideas with todo

Zero tokens

pycalc gets a backlog

One file, readable without mini

The payoff: the backlog reports for duty

What todo deliberately can't do

Next time

I'm Building a Code Security Analyzer. A Security Tool Found a Critical In It.

Confession up front: I didn't know the standard tools

What it found

The uncomfortable question

What Semgrep can't do — and why I'm building anyway

Where I am now

The Habit That Stops Your AI From Quietly Wrecking Your Plan

Why I ignored this for so long

Non-goals are guardrails, not documentation

How to generate them during planning

When to drop them

The whole journey, in three lines

Building with mini, Part 2/9: Shaping the vision with project

What the command does (and doesn't)

pycalc gets interviewed

What got saved

Tokenomics, as always

Next time

Stop Feeding Your AI Specs. Make It Interrogate You Instead

The problem with "just describe your idea"

The master prompt

The smaller prompts for refining

Why this works

Try this

The Agent Reviewed Its Own Code and Passed Itself. It Was Wrong.

"Tests pass" was never the reassurance I treated it as

You can't verify what you don't understand

An agent can't catch its own blind spots

What I actually told the agent to do

The part where it went wrong

Why the step is optional, not a rule

What I'd tell the version of me from six months ago

Building with mini, Part 1/9: Initializing a Project with init

What init actually does

pycalc

My AI Code Was Fine. My Initial Plan Was a Mess.

1. Stop trying to write the perfect plan

2. Write down what NOT to build

3. Don't write the plan alone — let the AI build it with you

4. Use Plan Mode for the empty page

The workflow, end to end

We're Building the Funnel and Standing Under It

How we got here

Why developers in particular should care

This isn't a manifesto against tools

How not to end up hanging under the funnel

Closing

`discuss` — talk through the intent

`verify` — when "the tests passed" isn't enough

`next` — propose

`plan` — break down

`do` — build

`done` — close

What `todo` deliberately can't do

What `init` actually does