The "build an app in 20 minutes" demos are real. The problem is they stop at the exact point engineering begins. So I gave five AI builders the same brief and graded them on production reality, not on the first pretty screen.
The brief: signup and login, per user private data, a subscription payment, and an AI feature that must not hallucinate. The four things every demo skips and every real app needs.
What I actually checked under the hood:
- Auth. Not "is there a login screen" but is it real authentication and authorization. Can user A reach user B's rows.
- Data layer. Is the schema sane. Are there constraints, or just a table the model guessed at.
- AI correctness. Is there any grounding, or does the model freely invent facts.
- Security. Input validation, secrets handling, and the one everyone forgets, prompt injection.
- Cost. What does one request cost, and what happens to that number at a thousand users.
The results, demo score versus ships score, with the builder truth a demo hides:
- Lovable 10 / 4. Built on Supabase. The trap: row level security is frequently left permissive or off, so the happy path works while every authenticated user can query every other user's rows. First thing a builder audits is the RLS policy. Edits are also not surgical, a small request can regenerate whole files and silently revert your manual fixes.
- Bolt 9 / 4. Runs in WebContainers, a browser based runtime, not your target server. Native deps and some backend behaviour differ from a real deploy, so passing in Bolt is not passing in prod. Token burn is high.
- v0 8 / 3. Outputs idiomatic React, Next, Tailwind and shadcn. Genuinely good handoff code, which is exactly the point, it stops at the component boundary. Server actions, data layer and auth are yours to wire.
- Replit 7 / 7. Real Postgres, a secrets manager, a shell, readable logs and one click deploy. The closest thing to a real environment. Watch the always on deployment cost and the agent checkpoint usage, neither is free at scale, and defaults are not tuned for load.
- Cursor 6 / 8. A VS Code fork operating on your actual repo and git, so every AI diff is reviewable and revertable. Context is manual, it only sees the files you feed it, and rules files matter. No database, hosting or deploy, that stays your stack.
The pattern: demo score and ships score are almost inversely correlated. The tools optimised to impress are not the tools optimised to survive.
The part that matters most: stop the model inventing facts
Wrong: let the LLM decide the answer and hope the prompt holds.
Right: compute the answer deterministically in your backend, then let the LLM only phrase it.
On one product I shipped, the backend calculates the real result and the model is reduced to a narrator. It physically cannot hallucinate the core output. No builder gives you this for free. It is an architecture decision, and architecture is the thing no 20 minute demo makes for you.
The takeaway for engineers: none of these tools ship your app. They generate a starting point. Auth, data integrity, evals, security, cost control and a safe rollout are still yours. I have shipped more than one AI product, and the builder was never the hard part. Use the tool for the 5 percent. Own the 95 percent.
Ridhika | Prompt to Production
Top comments (0)