AI Dev: Plan Mode vs. SDD — A Weekend Experiment

#showdev #ai #flutter #productivity

Three months ago, I tested Kiro's Spec-Driven Development (SDD) workflow and walked away impressed but frustrated. The AI built 13,000 lines of Rust code with 246 tests... that took 30 minutes to run, checked God-knows-what, left CI/CD broken beyond repair, and produced a codebase I couldn't maintain. Fast-forward to this weekend: I built a complete mobile app using Cursor + Gemini 3 Pro + Flutter—structured, maintainable, and shipped in one evening plus half a day.

The difference? Let's unpack...

What I Have Done

Built a Flutter app targeting Android and macOS (mainly for UI debugging) from scratch -> https://github.com/maxim-saplin/nothingness:

It shows currently playing media, provide media controls (pause, next, etc.), displays spectrum analyzer using mic
Used Cursor + Gemini 3 (and some GPT 5.1 and Opus 4.5), mostly Plan and Agent modes
Added 6 Cursor rules acting as Guardrails and Guidelines for agents
26 Unit/integration tests
Focus on Docs:
- I didn't save the MDs produced by plan mode
- Yet I asked to follow a simple discipline adding important tech decisions to the docs/ folder
- Had a separate Cursor rule for docs
Set-up and validated GH MCP is working, agents can autonomously build CI/CD
Working CI/CD with GitHub Actions - build/test on commit, release by request
Saturday evening and Sunday (~ 8h effort)
Spent ~$50 in tokens

Model	Tokens	Cost
gemini-3-pro-preview	42757369	$32,02
gpt-5.1-high	9721834	$5,79
claude-4.5-opus-high-thinking	9065436	$8,66
gpt-5.1-codex-high	276380	$0,20
composer-1	10999	$0,01
Grand Total	61832018	$46,68

Why even do this app in the first place? Well, I've been driving an "analog" VW Polo for a week while my EV was in a workshop. I had a serious withdrawal during this time missing plenty of "conveniences" my Zeekr has provided: watching/listening-in on YouTube videos, highway autopilot allowed me to doom-scroll, 15 inch OLED infotainment screen always loaded with info (nav, videos).

During the 2nd week of digital withdrawal I felt a sudden relief.. That was a 90-s vibe, a nice song coming through car audio, pixelated LCD screen showing the name of a popular at the time artist, no urges to pick up the phone and scroll while staying at the traffic lights. That reminded me of a video touching on the subject how gadgets and constant connectivity steals from our lives... Why not create a simple app that darkens the infotainment in my EV?

SDD Sidenote

After my Kiro experiment in September I moved on to scaling SDD approach to actual production work.

First, I tried GitHub SpecKit with my Cursor enterprise subscription (I couldn't use Kiro free tier with commercial code base) - and I didn't like what I saw. After Kiro it felt bloated, too many artifacts loaded with text, extra steps etc.

Turned out, there were Kiro prompts circling around the internet. By tweaking those ones a bit and putting into the right place I've recreated Kiro experience in Cursor - check out this gist for details.

Over that week I successfully shipped 4 features in Python/Dart codebase - merged and rolled to prod. All of that while multi-tasking and occasionally switching to check the results OR untangle roadblocks. I had mixed feelings, losing grip of the codebase, being lost in a flux.

Some of the lessons learned:

Feasibility Checks are Mandatory: Models often propose impossible or broken solutions (e.g., bad data flows, unworkable stacks). Always verify feasibility before implementation to avoid wasting days.
Aggressively Prune "Bloat": AI tends to over-engineer (excessive env vars, extra containers, verbose docs). Reducing scope before code generation saves massive cleanup time later.
Read Specs: Bugs caught during spec review are far cheaper than bugs caught in implementation. Poor doc review compounds AI-generated issues.
The "Shallow" Trap: AI allows you to avoid deep diving into tech, but this backfires during debugging. You are often faster if you understand requirements and the underlying tech/codebase rather than blindly trusting the agent.
Avoid "Time Sinks": Be ruthless about abandoning low-value features (e.g., "Geo in Analytics," complex filters) that the AI suggests but struggles to implement cleanly.

This time I felt in Control

In October Cursor team has introduced their response to ever growing demand for "think before you do" approaches - Plan Mode. Since then I mostly used this mode rarely reverting to SDD. And I never kept the produced Plans/Designs (unlike the specs produced by SDD). I saw Plan Mode more of more structured approach allowing to spend tokens and have an "alignment" ceremony with an agent on a "transaction" - something barely small, a task, a deliverable... Part of this transaction could be a doc put into a dedicated place, to keep traces of important decisions and be used later.

While working on nothingness it felt natural to plan the implementation, argue certain decisions, decide on document creation rule, document, add Cursor rule to create rule, create rule, design testing framework, expand test coverage... The experience was quite different - I felt complete control and confidence what I do. Even if there were any bugs or deficiencies I had no doubts those will be easily fixable.

One could say I vibe coded an app over the weekend - I would argue I exercised a disciplined approach and produced maintainable code that can be built upon. And indeed over the next day I did quite a lot of refactoring and added multiple features.

The "Plan Mode" wasn't just about generating a to-do list; it acted as an alignment ceremony. It was a deliberate pause—spending tokens to "think" and clarify intent before rushing into implementation. In the same dialog I could switch between Plan and Agent modes multiple times, periodically compacting the conversation via /summarise command. When the thread was done - feature delivered, task done - I could nudge the agent to check test coverage (sometimes new tests were added) or if a doc is worth adding.

What about the Structured Approach?

While most of the work flowed naturally and I did not struggle with heavy ceremonies (think BMad or SpecKit), there was software engineering common sense, paying attention to structure (of solution and work execution):

First prompt was a feasibility check of what felt like the most unclear/challenging part:
After some discussion with the agent I outlined the requirements and worked on the plan proposed by the agent:
In further dialogs I asked the model to define documentation discipline and when I decided it was worth making a pause and leaving traces of the docs I prompted the model to make a detour. Those docs were later used by agents when ramping up new features.
When the minimal version of the app was running together with the agent we agreed on a general approach to testing, documented it, added the initial coverage and later on added and modified test harness in accordance with the testing discipline which emerged early on. Again, a best practice that protects against regressions and also is a strong signal to AI agent in terms of how good or bad it does with newer features.
While reviewing the produced code I had several occasions challenging the solution breakdown (i.e. why downstream code must be aware of upstream scaling details) - that led to a few refactors, test updates and new docs being created.
For CI/CD it was a deliberate step validating how MCP tooling works and if the agent can engage. While doing so a number of Cursor rules popped up explaining peculiarities when interacting with GitHub Actions and sandboxed CLI execution when dealing with Flutter commands.

The MCP Stutter

I decided to let the agent autonomously set-up GitHub Actions CI/CD. In order to do this I needed GitHub MCP server working properly. This led to a few hours of "setup tax" that are worth mentioning:

The Auth Trap: My Personal Access Token expired, wasted time browsing and configuring. Classic.
The Tool Bundle Limits: GitHub's MCP server recently changed how it bundles tools. The default configuration exposed a limited set of tools (about 20), missing the critical Actions-related tools I needed. The agent initially couldn't "see" the CI/CD failures because it literally didn't have the tools in its context.
Validating MCP tooling: I explicitly probed agent for MCP connectivity, that helped a lot, yet didn't solve the feedback loop completely (see next point).

Troubleshooting YAML workflows: Recently I've been noticing that LLMs struggle with YAML formatting. For an hour an agent struggled to get CI/CD running due to YAML file syntax error - it pushed the broken file, checked CI/CD job status on the server, saw it errored and then proceed to check job log - which was empty. Turns out, in case of workflow syntax the error should be checked in a dedicated 'annotation' file of workflow run tool call - this rule handles GH Actions feedback loop.

Once I fixed the tool configuration, the payoff was massive - green and easily maintainable CI/CD pipeline.

Tips and Tricks

If you want to replicate this "Plan Mode" flow, here are the non-obvious lessons:

Treat Plans as Disposable: Unlike Kiro or strict SDD, I didn't treat the generated "Plan" as a sacred artifact to be committed to the repo. It's a transient thought process. The result of the plan (code, specific docs) is what matters.
Know the Task: as long as you are confident in what you are building, quite often we don't realise what we're building (feasibility, consistency, why?)
Choose Familiar Tech Stack: it's easier to spot issues by skimming through generated code and docs
Rules as Guardrails: I added 6 specific Cursor rules (.cursor/rules). One was specifically for documentation: "If you change logic, you must update the docs/ folder." This forced the agent to maintain a "Technical Decisions" log alongside the code, which saved me from the "black box" problem later.
Use /summarize Ruthlessly: Long context windows are great, but models get "dumb" and expensive as the chat grows (especially past 20-30k tokens). I frequently used the /summarize command to compress the history. It keeps the agent sharp and the costs down.
Weekend Models: Anecdotally, gemini-3-pro-preview performed significantly better on Saturday/Sunday than during the week. Perhaps less traffic?

Model and Harness Progress

I attribute my satisfaction with the results to significant progress the models have made over the past 3 months - more reliable in agentic settings (multi-turn dialogs with extensive tool use), it feels like the recent GPT 5+, Claude 4.5 and Gemini 3 are models that can be relied upon producing more substantial code and docs - no more shallow verbosity or pointless unit tests.

Same goes about tooling, AI IDE assistants like Cursor do great in terms of context engineering and providing models with efficient tools and environments feeding relevant info and establishing effective feedback loops.

Disclaimer: When to Use What

This experiment convinced me that for greenfield projects, prototypes, or "Solopreneur" work, this Plan Mode + Guardrails approach is superior to heavy SDD. It's agile, keeps you in the driver's seat, and maintains momentum.

However, SDD still has its place. If I were tackling a massive legacy enterprise codebase, or working in a large team where "hidden knowledge" is the enemy, I would likely revert to a stricter Spec-Driven approach (like SpecKit or custom workflows). There, the overhead of generating strict artifacts pays off in alignment and safety.

But for building a bespoke infotainment system for my car in a single weekend? AI coding with discipline is the future.