TL;DR: 18 months building AI for a restaurant chain with 236 employees. Real production metrics, real code, real mistakes. This is what actually works (and what doesn't).
Production numbers:
- 94% accuracy on 23 KPI queries in natural language
- €12,000 in suspicious transactions caught in 6 months
- 40% reduction in HR tickets
- Response time: 180ms (down from 1.2s — this mattered more than accuracy)
- 2 → 140 daily messages after one change
I'll explain all of these. Starting with the one change that 10x'd adoption.
The One Change That 10x'd Adoption
Daily active messages: 2 → 140. One change.
I moved the interface from a web app to WhatsApp.
The AI was identical. The responses were identical. But managers had WhatsApp open all day. Opening a browser tab was friction they wouldn't accept.
The lesson: employees don't want AI. They want answers in the place they already look.
System 1: Natural Language → SQL KPI Engine
Managers were drowning in Excel. I built intent detection that converts plain language to one of 23 pre-validated query templates.
// Intent detection — NOT fine-tuned, just prompt engineering
$intent = llm_detect_intent($query, $schema_context);
// Map to one of 23 query templates (not free-form SQL generation)
$template = QueryRegistry::get($intent['type']);
// Fill params from entity extraction
$params = EntityExtractor::extract($intent['entities'], $date_context);
// Execute against read replica only — never production writes
$result = DB::readReplica()->execute($template, $params);
Why templates instead of full SQL generation from LLM:
Started with full LLM SQL generation. Disaster. Hallucinated JOINs, wrong table names, one query that locked a table for 40 seconds in production.
Switched to template matching. The LLM only does intent classification now. 23 templates cover 94% of real queries. Much safer, much cheaper.
System 2: RAG for HR/Policy Questions
40% reduction in HR tickets. 236 employees asking about schedules, policies, payroll.
The context size mistake everyone makes:
# What everyone does:
context = vector_search(query, top_k=20, max_tokens=4000)
# What actually works:
context = vector_search(query, top_k=5, max_tokens=800)
# Re-rank by recency + exact keyword match
# Add only top 3 chunks
Smaller context → faster response → higher adoption. I measured it.
The stack: 240-page HR manual + policy docs, chunked at 400 tokens with 50-token overlap. No LangChain. After 3 weeks I removed it — too much abstraction over things I needed to control. Replaced with ~200 lines of code I fully understand.
System 3: Audio Meeting Intelligence
Shift handoffs by voice note. Manager leaves note at 11pm. Next manager arrives at 6am.
Voice note → Whisper (local, not API) → Structured extraction → Push to Notion
The prompt pattern that cut hallucinations by 60%:
You are extracting operational intelligence from a restaurant shift handoff.
Extract ONLY:
1. Problems that need action (with urgency: now/today/this-week)
2. Stock alerts
3. Staff incidents
4. Customer complaints needing followup
Format as JSON. If unclear, mark as "needs_clarification".
DO NOT summarize. DO NOT add context. Only factual operational items.
The "DO NOT summarize" instruction is the key. LLMs want to be helpful and add context. For operational data, you want facts only.
System 4: Fraud Detection (Paid for the whole project)
Simple statistical anomaly detection. Not ML. Not neural networks.
- Rolling 30-day average per employee per shift type
- Flag transactions > 2.5σ from mean
- Cross-reference with inventory consumption
Result: €12,000 in suspicious transactions flagged in 6 months.
Not all were fraud — some were data entry errors. But the attention to patterns changed behavior.
Everything I Removed (and why)
| Removed | Reason |
|---|---|
| LangChain | Too much abstraction, replaced with 200 lines of custom code |
| Streaming responses | Managers started reading mid-sentence, got confused |
| GPT-4 for everything | Expensive + slow. Now: Haiku for classification, Opus for reasoning. Cost -80% |
| Conversation history > 3 exchanges | Context degraded after 3 turns. Truncate aggressively |
18-Month Results
| Metric | Before | After |
|---|---|---|
| HR tickets/week | 45 | 27 |
| Report generation | 2h manual | 0 (automated) |
| KPI query time | 20min Excel | 180ms |
| Fraud caught | unknown | €12,000 / 6 months |
| Daily AI interactions | 0 | 140+ |
What I'd do differently
- Start with WhatsApp, not web. Would have saved 3 months building an interface nobody used.
- Template matching before LLM generation. For structured data queries, always.
- Measure adoption from day 1. I didn't track usage for the first 2 months. Flying blind.
- Smaller context windows. Instinct is to give LLM more context. Usually wrong.
Want something like this for your business?
I do consulting on production AI systems for SMBs. Not "add ChatGPT to your website" — actual systems that replace manual work.
Typical projects: $500-1500, delivered in 2-4 weeks.
What I can build:
- Natural language → your database (no more Excel reports)
- Internal knowledge assistant (HR, policy, training)
- Meeting/audio intelligence → task extraction
- Anomaly detection on transaction data
Get in touch — I reply within 24h
Happy to answer any technical questions in the comments.
Top comments (0)