Ale Santini

Posted on Mar 24

I Spent a Year Building AI for 236 People. Here's What Actually Works.

#ai #career #production #programming

I Spent a Year Building AI for 236 People. Here's What Actually Works

It was 2 AM on a Tuesday when the fraud detection model caught something that shouldn't exist.

A manager at the Barcelona location had processed 47 transactions in 8 minutes. All refunds. All to different cards. All flagged as "system error." The model caught the pattern—something no rule-based system would have—and locked the account. We recovered €12,000. The manager is now a legal problem for the restaurant chain, not an ongoing one.

That moment validated everything I'd been doing for the previous 11 months. But it also proved something I'd learned the hard way: AI in production isn't about building impressive models. It's about building systems that work when you're not looking.

What I Actually Built

Let me be specific because vague claims are useless:

RAG Assistant: Answers questions about POS policies, menu items, scheduling rules. 94% accuracy on employee questions. Reduced HR tickets by 40%.
Intent Detection: Converts natural language chat messages into 23 different KPI queries. Employees ask "why was yesterday slow?" and get actual data, not guesses.
Audio Meeting Intelligence: Records shift handoffs, extracts action items, surfaces problems. 15-minute meetings → 2-minute summaries.
27-Collector Notification Engine: Monitors 27 different data collectors (inventory, labor, sales, compliance). Sends one message per day instead of 27 separate alerts.
Fraud Detection: Behavioral analysis on transactions. Caught the Barcelona incident plus 3 other smaller anomalies.
Automated PDF Reports: Daily reports on 14 locations, 8 different report types. Replaced manual work that took 4 hours/day.

Real numbers: €88,000/month in transactions processed. 236 employees using these systems daily. One engineer. No ML team. No data science hire.

Stack: OpenRouter (Claude Opus for complex reasoning, Haiku for speed), PHP backend, MySQL, Python for model training/inference, n8n for orchestration.

Three Things That Surprised Me

1. Model Versioning in the Database Saved Everything

I started by storing model parameters in code. Bad decision.

Halfway through, I needed to roll back the fraud detection model because it was too aggressive. Changing code, testing, deploying—that's a 45-minute process. I needed it in 5 minutes.

Now every model lives as a database record:

// models table structure
CREATE TABLE models (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(100),
    type ENUM('fraud_detection', 'intent_classifier', 'rag_prompt'),
    version INT,
    active BOOLEAN,
    parameters JSON,
    created_at TIMESTAMP,
    deployed_at TIMESTAMP
);

// Deployment is literally one query
UPDATE models SET active = 0 WHERE name = 'fraud_detection' AND version = 3;
UPDATE models SET active = 1 WHERE name = 'fraud_detection' AND version = 2;

This single decision eliminated deployment anxiety. I could test a new model version for days before flipping one boolean. Rollbacks took seconds.

2. Employees Don't Want AI. They Want Answers.

I spent two weeks building a beautiful chat interface. Nobody used it. They wanted Slack integration.

Once I integrated with Slack, usage went from 2 messages/day to 140 messages/day. Same underlying AI. Different interface.

The lesson: people don't care about your architecture. They care about where they already work. I now build AI as middleware that lives in their existing tools, not as new tools.

3. Latency Matters More Than Accuracy After 85%

I spent a month tuning my intent classifier from 87% to 92% accuracy. The improvement was invisible to users.

Then I optimized the API response time from 1.2 seconds to 180ms. Suddenly, people started using it in their workflow instead of as an afterthought.

At 180ms, you can use it while thinking. At 1.2 seconds, you've already moved on.

# I went from this (accurate but slow)
response = client.messages.create(
    model="claude-opus",
    messages=[system_prompt, user_message],
    max_tokens=500
)

# To this (fast enough and still accurate)
response = client.messages.create(
    model="claude-3-haiku-20250307",
    messages=[system_prompt, user_message],
    max_tokens=100,
    timeout=150  # Hard stop at 150ms
)

Haiku is 10x faster than Opus. For 85% of use cases, nobody notices the accuracy difference. The speed difference, everyone notices.

What I Got Wrong First

Hallucination handling: I thought I could solve hallucinations with prompt engineering. I was wrong. I needed guardrails:

def validate_response(response, allowed_intents):
    """
    If the model says something outside our known intents,
    we don't trust it. Better to say "I don't know" than
    to confidently guess wrong.
    """
    parsed = extract_intent(response)

    if parsed['intent'] not in allowed_intents:
        return {"intent": "unknown", "confidence": 0}

    if parsed['confidence'] < 0.6:
        return {"intent": "unknown", "confidence": parsed['confidence']}

    return parsed

Context windows: I assumed larger context windows meant better results. They don't. They mean slower responses and higher costs. I now explicitly limit context to the last 3 exchanges.

Real-time requirements: I thought everything needed to be real-time. It doesn't. The notification engine batches messages once per day. Users prefer one good summary to 27 real-time alerts.

Training data quality: I spent a week collecting "good" examples. That was wasted time. I should have spent it on edge cases and failure modes.

The One Architectural Decision That Saved Everything

Storing models as database records instead of code changes.

This wasn't about elegance. This was about operational reality.

When a model breaks in production, you don't have time to run CI/CD. You need to flip it off immediately. You need to test a fix without deploying. You need to compare two versions side-by-side.

Every one of those things is trivial with models in the database. They're hard with models in code.

Everything else—the API structure, the choice of OpenRouter, the n8n orchestration—those were good choices but not critical. This one was critical.

What I'd Tell Myself on Day 1

Start with the slowest, dumbest solution that works. I wasted three weeks on optimization that didn't matter. Ship something that works first.
Your users don't care about your model architecture. They care about latency, accuracy, and where it lives. In that order.
Batch processing is underrated. Real-time feels important until you realize users prefer one good summary to constant noise.
Fraud detection works. It's not magic. It's just pattern matching on historical data. If you have 6 months of transaction history, you can build something useful in a week.
Employees will use AI if it saves them time. They won't use it if it requires learning something new. Integration > innovation.
One engineer can build this. You don't need a team. You need clarity on what you're solving and ruthlessness about scope.
The database is your infrastructure. When you're one person, the database is your deployment system, your versioning system, your A/B testing system. Treat it as such.

The year is over. The system is running. Nobody thinks about it anymore, which is exactly what I wanted. That's when you know it works.

Need an AI system for your business?
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~88k EUR/month processed).
Email: alevibecoding@gmail.com | Portfolio | Case study

DEV Community