I shipped CalcBook — an offline natural-language calculator for Android — by pair-programming with Claude end-to-end. No bootcamp, no team. Here's what worked, what didn't, and the prompts that mattered.
I built and shipped a Soulver-style calculator app on the Play Store using only Claude (no Copilot, no Cursor, no human collaborators). 4 months. 100% offline. Free. Here's the honest breakdown of how AI pair-programming actually works at scale — and where it breaks.
The bet
Soulver is the legendary "notepad that computes" on Mac and iOS. For Android, the category didn't exist. I wanted to find out two things:
- Can a solo dev with zero Flutter background ship a polished offline app to the Play Store?
- Can Claude alone carry that project — design, architecture, debugging, polish — without it falling apart?
The answer to both: yes, but not the way Twitter makes it sound.
The app is live: 📱 CalcBook on Google Play.
This post is the unvarnished version of how it got built.
What "built with AI" actually means here
I want to be precise because this phrase has been beaten into a marketing slogan.
What I did: I wrote every product decision, every UX choice, every requirement. I tested every screen. I rejected ~30% of what Claude generated on the first pass. I shipped, marketed, and supported the app.
What Claude did: Wrote ~95% of the actual Dart code. Designed the recognizer pipeline that powers the parser. Caught bugs I would never have found. Refactored mercilessly when I asked. Wrote inline docs and tests.
The split that worked: I owned product. Claude owned implementation. Every time I tried to flip that — letting Claude pick the feature, or me writing code without it — quality dropped immediately.
The stack
| Layer | Choice |
|---|---|
| Framework | Flutter 3.5 |
| State | Riverpod 2 |
| Storage | Hive (local-first, offline) |
| Parser | Custom recognizer pipeline (no petitparser, no regex soup) |
| Cloud sync | Google Drive appDataFolder (optional, opt-in) |
| Analytics | Firebase Analytics + Remote Config |
| Monetization | AdMob + one-time Pro IAP |
| IDE | VS Code |
| AI | Claude (chat + Claude Code CLI) |
Zero servers. Zero backend code. The only network calls in the app are Firebase, AdMob, and the user's own Drive — everything else runs on-device.
The hardest problem: the parser
Soulver's magic isn't the UI. It's that you can type:
coffee $4.50 and muffin $3.00
…and get $7.50 back. A traditional parser errors on coffee. Soulver doesn't.
My first three attempts at the parser used a single monolithic grammar. Each one collapsed under its own weight — every new feature broke the previous one. After throwing ~4,000 lines of grammar code in the trash, Claude proposed the approach that actually worked:
"Don't write one parser. Write N tiny recognizers, each owning one domain, and run them in priority order. Let unrecognized tokens silently fall through."
DateRecognizer (priority 80)
TimeRecognizer (priority 80)
PercentageRecognizer (priority 70)
CurrencyRecognizer (priority 65)
UnitRecognizer (priority 60)
MathRecognizer (priority 50)
FuzzyScanner (priority 10 — last resort)
Each one implements a single method:
abstract class Recognizer {
int get priority;
/// Try to recognize a pattern starting at [startIndex] in [tokens].
/// Returns null if it doesn't apply — caller skips to the next position.
RecognizerMatch? tryRecognize(
List<Token> tokens,
int startIndex,
DocumentScope scope,
);
}
The pipeline walks the tokens. At each position it asks recognizers (high priority first) to claim a window. Anything not claimed is just skipped. That's the entire trick.
Adding a new domain — say, temperatures — is now isolated work. Write a TemperatureRecognizer, register it in the pipeline, ship. No existing recognizer changes.
I'd love to take credit for this design. I can't. Claude proposed it. I evaluated it, accepted it, and steered the implementation. That's the actual workflow.
Where Claude got it right
- Architecture. Recognizer pipeline. Riverpod state shape. Hive serialization layout. All Claude.
- Boilerplate. Settings persistence, syntax highlighting, theme switching. I described what I wanted; it produced clean code on the first try.
- Refactors. "Rename this. Extract this. Split this file." Claude does this with surgical precision. Far better than any human reviewer because it has no ego about the existing code.
- Inline documentation. Every comment in the codebase is Claude. Reads like a senior engineer wrote it.
Where Claude got it wrong (and how I caught it)
- Over-engineering. Default Claude wants to add abstractions, base classes, factories, and "future-proofing" everywhere. You have to fight this constantly. My most-used prompt fragment became: "No backward-compat shims. No defensive null checks for impossible cases. Just write the change."
-
Confident hallucinations on Flutter APIs. Claude is occasionally certain about a
TextFieldparameter that doesn't exist. Always run the build before trusting "looks good." - UX taste. Claude can implement a UI spec flawlessly but cannot tell you the spec is bad. Every padding, every color, every animation curve needs human eyes.
- Don't let it choose features. "What should we build next?" produces generic answers. Product decisions stay with you.
Prompts that mattered
The four prompt patterns I used most:
1. "Here's the bug. Here's the file. Don't refactor — just fix the bug."
2. "Add feature X. Don't add error handling for cases that can't happen.
Don't add base classes. Don't introduce abstractions."
3. "Audit this file for dead code, unused params, redundant null checks.
Remove without changing behavior."
4. "Read these three files. Plan a change. Don't write code yet —
Describe the diff you'd make and why."
The fourth one is the multiplier. Forcing Claude into "plan first" mode catches 80% of the over-engineering before any code is written.
What I'd do differently
- Start with the AI workflow on day one. I spent the first two weeks trying to "learn Flutter properly" before letting Claude write. Pointless. The fastest way to learn Flutter was to read Claude's code, ask why, and iterate.
-
Write the test cases as English sentences first.
"coffee $4.50 and muffin $3.00 → $7.50". Every test was authored by me, every test implementation by Claude. -
Keep one source-of-truth file (
CLAUDE.mdin the project root) that documents the architecture. Update it whenever an invariant changes. For future Claude sessions, read this and get up to speed instantly.
The honest takeaway
AI pair-programming is not "describe the app, get the app." It's a senior engineer who never gets tired, never argues, has perfect recall, and sometimes invents APIs that don't exist. You have to lead it. But if you can describe what you want clearly, you can ship things you couldn't before.
CalcBook is the proof of concept I needed. Whether you use Claude, Cursor, or whatever else, the bottleneck is no longer "can I write the code." It's "do I know what I want?"
Try the app
📱 CalcBook on Google Play — free, fully offline.
I'm shipping more on LinkedIn
Builders using AI to ship: what's your prompt-pattern that changed everything? Drop it in the comments.
Top comments (0)