Continued from Part 1...
"Let the flow guide me" seemed like a fun way to build a side project. That lasted about 10 minutes.
Turns out, even side projects benefit from structure. Especially when you're using AI coding agents that will happily generate code for whatever half-baked idea you throw at them. Without precise direction, AI coding agents will build you something half-baked every time. Some people vibe code, this guy needs absolute control.
Enter BMAD: Breakthrough Method of Agile AI Driven Development. It's a workflow for using AI agents throughout the entire SDLC, not just for code generation. Sure, using a formal methodology for a lone-wolf side project sounds like overkill. But being prepared in advance is the way to succeed with AI coding agents.
I used theย Analyst agentย to brainstorm product direction and develop a proper backlog. What started as "build a sarcastic Q&A bot" turned into a structured set of epics, features, and technical constraints. (Don't judge, organizing is very relaxing)
The product evolved:
- Not just Q&A, but shareable "receipts" of roasts
- Not just sarcastic, but multiple personas with different personalities
- Not just answers, but a hidden narrative layer (more on that later)
- Not just ads but merch (really, Jason?)
The first real technical challenges emerged:
1. Developing and packaging the personas:
How do you get an LLM to consistently stay in character as "Overqualified and Annoyed" or "Weary Tech Support" without it either going too soft or crossing into genuinely mean? This wasn't just prompt engineering. It was product design masked as technical constraints.
2. LLM model evaluation:
I needed models that could follow persona instructions reliably while staying brutally efficient on cost. That meant testing dozens of models across multiple providers. Some were too expensive. Some ignored instructions. Some were painfully slow.
The goal: $0.02 to $0.20 per million output tokens. The result: a multi-model fallback system through OpenRouter that could hit the $30 per million questions target.
These first challenges were just the warmup. The real fun was still ahead.
AI agents are incredible at implementation, but they need constraints. They need a backlog. They need someone saying "build THIS, not that." The Analyst agent helped me think through the product. The coding agents helped me build it. But the architecture? Can't take that away from me.
Finding the Goldilocks LLM
Building DumbQuestion.ai meant solving two problems at once: creating personas with the right tone AND finding models cheap enough to keep the lights on.
The product challenge:ย Get an LLM to roast users for asking dumb questions without crossing into genuinely mean. Sarcastic, not cruel. Funny, not hurtful. And still actually answer the question.
The AI agent challenge:ย Keeping my coding agent (Gemini 3 Pro) on track was its own battle. It constantly wanted to build something far nerdier than even I wanted and tended to lean quite a bit into the roast. You can still see this in some of the personas as I continue to tweak.
The technical challenge:ย Do this with models that cost nearly nothing.
My initial goal was ambitious: use only free or very cheap models. I started running evaluations on nano and edge models. Some showed promise, especially offerings from Liquid AI. Solid performance, free or super cheap ($0.02/M tokens), perfect.
Except later evaluations proved they couldn't reliably follow instructions once I asked more of them. They were just too small. Free models have a habit of hitting quota limits, taking forever to respond, or just disappearing.
The evaluation process:
I used Gemini to build an LLM evals script that iterates through dozens of free and low-cost models, generating responses based on sample questions and different persona instructions. Then I use Gemini 3 Pro to judge the results. Automated taste-testing at scale.
What I found:
Nano/edge models were too inconsistent (porridge too cold). Xiaomi MiMo-V2-Flash was great but outside my target price range ($0.29/M, porridge too hot).
The winner:ย Gemma 3 12B at $0.13/M output tokens. Consistently follows instructions. Stays true to persona. Reliable enough for production.
Not free, but brutally efficient.
The personas I settled on:
- Overqualified: A supercomputer level intelligence forced to answer questions about cheese
- Weary Tech Support: Exhausted and nihilistic, reluctantly explaining why water is wet
- [REDACTED]: Former intelligence AI who ties everything to a conspiracy theory
- The Compliant: Reprogrammed so many times it's forced to be relentlessly cheerful
You can't just choose the cheapest model and hope it works. You need evaluation infrastructure. You need to test consistency across dozens of scenarios. And you need models that won't change behavior when you least expect it.
AI coding agents helped me build the evaluation system. But deciding what "good enough" means for tone, reliability, and cost? That's still manual judgment.
Code is getting cheaper. Knowing which model to trust with your product? Still requires human experimentation.
Top comments (0)