I shipped 14 AI features to production in 2025. Here's the checklist nobody talks about.

I run engineering at a small Singapore agency. Over 2025 we shipped 14 AI features into production websites — RAG chatbots, AI search, summary generators, document Q&A, agentic form fillers. None of them were "throw GPT-4 at it" demos. All had real users, real bills, real complaints.

This post is the checklist we use now, distilled from things that went wrong in the first 8 features. It's deliberately not aspirational. It's the boring operational layer most LLM tutorials skip.

If you're about to ship an AI feature past prototype, read this first.

The 7 failure modes that bit us, in order of pain

1. The "happy path demo" trap

The first 4 features all looked great on day one of QA. Three of them broke production within two weeks because the demo path is ~5% of real user traffic.

Real users:

Paste 8 KB of nested HTML when asked for a name
Ask "where is the chicken curry" to a real-estate chatbot
Trigger 14 concurrent requests by holding down the send button
Run the chatbot in an iframe inside an Outlook email preview pane

Fix: Before launch, run a "weirdness simulation" — script 200 inputs scraped from your existing customer-support transcripts, plus 50 deliberately broken inputs (emoji floods, SQL-injection-shaped strings, opposite-language Unicode). Don't ship until your tracing dashboard shows graceful handling for ≥95% of them.

2. Cost runaways from one bad page

Feature #5 (a "summarise this page" button on a documentation site) burned through three months of our LLM budget in 11 days. Why: one page that nobody had read in months had a 200,000-token rendered HTML output because of an infinite-recursion bug in their CMS.

Fix: Token budgets per request, enforced server-side. Always. Plus rate-limiting by IP+session, not just by API key. Plus a daily $-cap circuit breaker that pages you (literally) at 80% of budget.

// One example — Cloudflare Worker enforcing per-request input cap
const MAX_INPUT_TOKENS = 8000;
const inputTokens = await tokenizer.encode(userInput).length;
if (inputTokens > MAX_INPUT_TOKENS) {
  return new Response("Input too long. Please trim and retry.", { status: 413 });
}

3. "It works on Gemini Flash 2.5" doesn't mean it works on Gemini Flash 2.5

Two features that ran perfectly in our staging Gemini Flash 2.5 deployment started returning truncated answers in prod. Reason: regional capacity changes, Google was silently routing some prod requests to the older Flash 2.0 endpoint, which has tighter output limits.

Fix: Pin model versions explicitly (gemini-2.5-flash-001, not gemini-2.5-flash). Log the model that actually served each request from the response metadata. Alert if it deviates.

4. RAG retrieval drift after content updates

Feature #7 was a customer-service chatbot for a clinic. They updated their service-hours page in October. We didn't re-embed for 11 days. Customers kept being told outdated hours by the bot. Three angry phone calls before we figured it out.

Fix: Tie retrieval-index updates to your CMS publish webhook, not a cron. The CMS knows when content changes; cron only knows the time. We use a Cloudflare Workers webhook receiver that triggers re-embedding within ~30 seconds of a CMS publish.

5. The hallucination tax on "trustworthy" outputs

For feature #8 (a property-listing assistant), we used careful system prompts and grounding. The bot still confidently invented a 3-bedroom unit at a $X price that didn't exist. The agency that licensed it got a real customer enquiry for that imaginary unit.

Fix: Always show the source with the answer. If you can't cite, don't answer. Concretely: every chatbot response includes a sources: [] array linking to the retrieved chunks. The frontend renders them inline as [1] [2] markers. Users self-police when they can see sources are weak.

6. The "but what if they ask about competitors" question

Feature #9 was a chatbot for a regional bank. Marketing asked midway through QA: "What if a user asks 'is OCBC better than DBS?'" Nobody had thought about this. The bot, helpfully, answered with hallucinated comparative facts.

Fix: Topic-routing guards. Before the LLM call, classify the user query against a fixed list of allowed topics for that deployment. Off-topic queries get a polite deflection ("I can only help with X, Y, Z — try our search bar for general questions.") This is a 50ms classifier call. The savings on PR risk are huge.

7. Streaming UI looks fast but feels broken

We launched feature #11 with streamed token output. Beautiful in our office. In production, users on flaky mobile connections (60% of our traffic) saw text appear, pause for 3 seconds, then keep going. They thought it had crashed and refreshed. Refresh = start over = bad UX.

Fix: Streaming UX needs:

A "thinking" indicator that's separate from the streamed text (so users see something happens during gaps)
Heartbeat events every 500ms so the connection doesn't appear dead
A non-streaming fallback if the client supports it (mobile gets fallback by default in our setup)

The production checklist (the part most people scroll for)

Before any AI feature touches real users, we now run through this. It's saved us 3 weeks of post-launch firefighting per feature on average.

Inputs

[ ] Per-request input token cap, enforced server-side, returns 413 not 500
[ ] Per-user rate limit (cookie+IP), separate from per-API-key
[ ] "Weirdness simulation" run of 250 inputs passed before launch
[ ] Input is sanitized of HTML/script before logging (you WILL log inputs — don't log XSS)
[ ] Multi-language behaviour tested if your audience isn't 100% English

Costs

[ ] Daily $-budget circuit breaker that pages on-call at 80%
[ ] Per-tenant budget if multi-tenant (one customer can't burn the others' allocation)
[ ] Logged cost per request, with attribution tag (feature, userId, tenant)
[ ] Free-tier downgrade path: if budget exceeded, system serves cached/template responses instead of failing

Models

[ ] Pin model version explicitly (-001 suffix, not the moving alias)
[ ] Log the model that served each request (from response metadata)
[ ] Alert on model-version drift in production
[ ] Tested with the worst model in your fallback chain (some traffic will route there)
[ ] Temperature locked at the lowest value that doesn't break the use case (we default to 0.2)

Grounding & truthfulness

[ ] Every answer cites its sources, in the response payload
[ ] Frontend renders sources visibly with the answer
[ ] Retrieval index re-builds on CMS publish webhook, NOT on a daily cron
[ ] "I don't know" path tested — does the bot actually defer when context is weak?
[ ] Topic-routing guard before LLM call (fixed allowlist per deployment)

[ ] Streaming has separate "thinking" indicator from streamed content
[ ] Heartbeat events keep the connection alive for slow networks
[ ] Non-streaming fallback for clients/networks that need it
[ ] Loading states for the first 200ms (don't show empty boxes)
[ ] Error states clearly say "try again" with what went wrong, not "something happened"

Compliance & ethics

[ ] User data sent to LLM provider is logged in your privacy policy
[ ] DPA in place with the LLM provider for PII regions (Singapore PDPA, EU GDPR, etc.)
[ ] No PII in retraining or vendor analytics pipelines (check the small print)
[ ] User-facing disclaimer that this is AI-generated where it matters (medical, legal, financial)

Monitoring

[ ] Every request logged with: timestamp, tenant, latency, tokens-in, tokens-out, $, model, source IPs
[ ] Hallucination canaries — synthetic queries with known correct answers, run hourly, alert on drift
[ ] User feedback widget on every response ("Was this helpful? Y/N + free text")
[ ] Weekly review of low-rated responses → manual triage → prompt updates

What didn't make this list

A few things I expected to matter but didn't, in practice:

Model fine-tuning — every time we considered it, prompt iteration + better RAG got us there faster. Save fine-tuning for genuinely specialized domains.
GPU local inference — interesting at scale, but for our SMB clients the API cost is rarely the actual bottleneck. Engineering complexity is.
Embeddings choice — we tried 5 embedding models. text-embedding-004 and text-embedding-3-small were both fine. Pick one and move on.
Chunking strategy debates — 300-600 char chunks with sentence boundaries, overlap of 50 chars. Stop reading chunking blog posts. Ship.

The bigger pattern

AI features are now ~20% of new builds we ship at SGBP — Singapore Build Partners. The pattern that's emerged: the LLM part is the easiest 30% of the work. The other 70% is the boring operational scaffolding above. That's where the actual product-readiness lives.

If you're an in-house dev or agency about to ship your first production AI feature, save this checklist. The version I keep in our internal wiki has another ~40 items, but the ones above cover 80% of the pain.

If your team is staring at a "we should add AI to this" mandate from leadership and isn't sure where to start, we do this for clinics, retailers, and SaaS teams across Singapore — happy to talk through your specific stack.

Otherwise — ship slowly, log everything, and never trust the happy path.

Daniel Cheong is a Senior Frontend Engineer at SGBP. He writes about web performance, AI in production, and the boring infrastructure that makes web products actually work.