Claude Writes the Code, Gemini Runs It: How Two Competing AIs Cut My SaaS Costs by 30x
I build Nokos, an AI-powered note-taking app that auto-captures conversations from 25+ AI tools. Here's the thing — the product itself runs entirely on AI, and picking the wrong model for the wrong job almost killed the economics.
This is the story of how I went from "this will never be profitable" to "break-even at 150 users" by splitting my AI stack between two competing providers.
The Original Architecture (And Why It Was Bleeding Money)
When I first built Nokos, I used Anthropic's Claude for everything:
| Feature | Model | Cost per call |
|---|---|---|
| Metadata generation | Claude Haiku | ~¥0.5 |
| AI Chat | Claude Sonnet | ~¥4.5 |
| Personal AI (RAG) | Claude Sonnet | ~¥4.5 |
| Daily Diary generation | Claude Haiku | ~¥1.5 |
| Session summary | Claude Haiku | ~¥1.0 |
| Natural Language Search | Claude Haiku | ~¥0.5 |
The per-user costs added up fast. With the Plus plan priced at ¥480/month, I was losing money on every paying user. The math was simple: this product could never work.
The Realization: Not Every AI Call Needs a Genius
Here's what I noticed looking at my AI usage patterns:
Metadata generation — Extract title, tags, category, sentiment from a memo. Claude Sonnet is wildly overqualified for this. It's pattern matching, not reasoning.
Chat responses — Most user questions are "What did I write about X last week?" The answer is in the RAG context. The model just needs to synthesize it, not think deeply.
Diary generation — Take today's memos, write a narrative. This is structured content generation with clear inputs and outputs.
None of these need the most powerful model. They need a fast, cheap, good-enough model.
But one thing absolutely does need the best: writing the code itself.
The Split: Claude for Code, Gemini for Production
I migrated every production AI feature to Google's Gemini Flash in a single day. Here's the new architecture:
Claude Code (Opus) — The Architect
Claude Code writes all the application code — the API routes, the React components, the database migrations, the infrastructure config. This is where reasoning quality matters most. One subtle bug in RLS policy logic or Stripe webhook handling could be catastrophic.
Cost: High per-session, but I'm the only "user." Fixed cost, not per-customer.
Gemini Flash — The Production Workhorse
Every AI feature that runs for actual users:
| Feature | Model | Cost per call |
|---|---|---|
| Metadata generation | Gemini Flash | ~¥0.02 |
| AI Chat | Gemini Flash | ~¥0.07 |
| Personal AI (RAG) | Gemini Flash | ~¥0.15 |
| Daily Diary | Gemini Flash | ~¥1.0 |
| Session summary | Gemini Flash | ~¥0.15 |
| Natural Language Search | Gemini Flash | ~¥0.05 |
| Embedding | gemini-embedding-001 | ~¥0.01 |
The Result
The chat/RAG queries — the most expensive calls — dropped from ~¥4.5 to ~¥0.07-0.15. That's where the "30x cheaper" comes from.
Across all plans, per-user costs dropped by 3x to 7x. The Free plan became sustainable. The paid plans became profitable.
What Cheap AI Unlocked
The cost reduction didn't just improve margins — it changed what the product could be:
1. Free users get real AI features
When per-user costs were high, giving Free users AI chat was financial suicide. After the migration, I can afford 50 chat turns/month as a taste of the product. Small cost, huge conversion driver.
2. "Kimagure" Diary for Free
Free users get a "whimsical diary" — Nokos (the AI) writes one when it feels like it (triggered on login, a few times per month). At ~¥1/diary, this is viable as a free feature. Magical enough to drive upgrades.
3. Stamina-based pricing instead of feature gating
Instead of locking features behind plans, every plan gets access to everything — just with different quotas. This only works when the per-call cost is low enough that occasional use doesn't destroy your margins.
4. Session ingestion at scale
Coding sessions from Claude Code, Codex, Cursor, and others get summarized by Gemini Flash at ~¥0.15/session. At the old Claude Haiku cost (~¥1.0), high-volume session ingestion would have been economically impossible.
Break-Even Math
With the migration complete and pricing adjusted (Plus ¥980/month, Pro ¥2,980/month), the break-even point landed at roughly 150 users.
The assumption: the vast majority of users (~90%+) will be on the Free plan — that's standard for freemium SaaS. The cost reduction made this survivable. At the old per-user costs, I would have needed 700+ users just to break even.
The biggest cost driver to watch? Session ingestion. Claude Code fires "Compacting conversation" 3-5 times per coding session, each counting as a separate ingest. Heavy users can rack up hundreds of sessions per month. This is the line item I monitor most closely.
Quality: Did It Actually Get Worse?
Honestly? For these use cases, I can't tell the difference.
- Metadata extraction — Gemini Flash correctly identifies category, sentiment, tags, people, locations from memo text. The structured JSON output is reliable.
- Chat/RAG — When the relevant context is already retrieved by vector search, the model just needs to synthesize an answer. Flash does this well.
- Diary generation — The narrative quality is comparable. Users read their own memos reflected back as a story. The memos provide the substance; the model provides the structure.
Where I would notice a difference: complex multi-step reasoning, nuanced code generation, architectural decisions. That's why Claude Code still writes the code.
The Uncomfortable Truth About Model Selection
Most AI features in production apps are glorified text transformation. Extract these fields. Summarize this text. Generate a response given this context.
You don't need the most intelligent model for text transformation. You need:
- Reliable structured output (JSON mode)
- Good instruction following
- Low latency
- Low cost
Gemini Flash delivers all four.
The hard part — the part that actually requires intelligence — is designing the system, writing the prompts, building the data pipeline, and catching the edge cases. That's where Claude Code (Opus) earns its cost, once, during development.
Five Lessons for Your AI Stack
- Audit your AI calls by complexity. Most of them are simpler than you think.
- The model that builds your product and the model that runs it don't need to be the same. Claude builds; Gemini runs. Each does what it's best at.
- Per-call cost determines your product design. At ¥4.5/query, you gate features. At ¥0.07/query, you give them away as samples.
- Session ingestion is a hidden cost bomb. If your product processes AI coding sessions, one "session" can actually be 3-5 API calls due to conversation compacting.
- Run the P&L before you pick a model. I built a spreadsheet with per-feature costs × expected usage × plan distribution. The answer was obvious once I saw the numbers.
Try It Yourself
Nokos is live — free plan with AI chat, diary generation, and session capture from Claude Code, ChatGPT, Cursor, and more. The entire product is powered by the dual-AI architecture described above.
What's your AI cost optimization story? I'd love to hear how others are handling the model selection tradeoff.
This is article 2 in a series about building Nokos as a solo developer. Article 1: Zero Lines of Code covered how the product was built entirely by AI.
Follow me for more — I'm launching on Product Hunt on March 31st!
Top comments (0)