DEV Community: Ale Santini

Show HN: I built a full AI ops system for a restaurant chain (236 employees, 18 months in production)

Ale Santini — Tue, 24 Mar 2026 02:39:44 +0000

TL;DR: 18 months building AI for a restaurant chain with 236 employees. Real production metrics, real code, real mistakes. This is what actually works (and what doesn't).

Production numbers:

94% accuracy on 23 KPI queries in natural language
€12,000 in suspicious transactions caught in 6 months
40% reduction in HR tickets
Response time: 180ms (down from 1.2s — this mattered more than accuracy)
2 → 140 daily messages after one change

I'll explain all of these. Starting with the one change that 10x'd adoption.

The One Change That 10x'd Adoption

Daily active messages: 2 → 140. One change.

I moved the interface from a web app to WhatsApp.

The AI was identical. The responses were identical. But managers had WhatsApp open all day. Opening a browser tab was friction they wouldn't accept.

The lesson: employees don't want AI. They want answers in the place they already look.

System 1: Natural Language → SQL KPI Engine

Managers were drowning in Excel. I built intent detection that converts plain language to one of 23 pre-validated query templates.

// Intent detection — NOT fine-tuned, just prompt engineering
$intent = llm_detect_intent($query, $schema_context);

// Map to one of 23 query templates (not free-form SQL generation)
$template = QueryRegistry::get($intent['type']);

// Fill params from entity extraction
$params = EntityExtractor::extract($intent['entities'], $date_context);

// Execute against read replica only — never production writes
$result = DB::readReplica()->execute($template, $params);

Why templates instead of full SQL generation from LLM:

Started with full LLM SQL generation. Disaster. Hallucinated JOINs, wrong table names, one query that locked a table for 40 seconds in production.

Switched to template matching. The LLM only does intent classification now. 23 templates cover 94% of real queries. Much safer, much cheaper.

System 2: RAG for HR/Policy Questions

40% reduction in HR tickets. 236 employees asking about schedules, policies, payroll.

The context size mistake everyone makes:

# What everyone does:
context = vector_search(query, top_k=20, max_tokens=4000)

# What actually works:
context = vector_search(query, top_k=5, max_tokens=800)
# Re-rank by recency + exact keyword match
# Add only top 3 chunks

Smaller context → faster response → higher adoption. I measured it.

The stack: 240-page HR manual + policy docs, chunked at 400 tokens with 50-token overlap. No LangChain. After 3 weeks I removed it — too much abstraction over things I needed to control. Replaced with ~200 lines of code I fully understand.

System 3: Audio Meeting Intelligence

Shift handoffs by voice note. Manager leaves note at 11pm. Next manager arrives at 6am.

Voice note → Whisper (local, not API) → Structured extraction → Push to Notion

The prompt pattern that cut hallucinations by 60%:

You are extracting operational intelligence from a restaurant shift handoff.
Extract ONLY:
1. Problems that need action (with urgency: now/today/this-week)
2. Stock alerts
3. Staff incidents
4. Customer complaints needing followup

Format as JSON. If unclear, mark as "needs_clarification".
DO NOT summarize. DO NOT add context. Only factual operational items.

The "DO NOT summarize" instruction is the key. LLMs want to be helpful and add context. For operational data, you want facts only.

System 4: Fraud Detection (Paid for the whole project)

Simple statistical anomaly detection. Not ML. Not neural networks.

Rolling 30-day average per employee per shift type
Flag transactions > 2.5σ from mean
Cross-reference with inventory consumption

Result: €12,000 in suspicious transactions flagged in 6 months.

Not all were fraud — some were data entry errors. But the attention to patterns changed behavior.

Everything I Removed (and why)

Removed	Reason
LangChain	Too much abstraction, replaced with 200 lines of custom code
Streaming responses	Managers started reading mid-sentence, got confused
GPT-4 for everything	Expensive + slow. Now: Haiku for classification, Opus for reasoning. Cost -80%
Conversation history > 3 exchanges	Context degraded after 3 turns. Truncate aggressively

18-Month Results

Metric	Before	After
HR tickets/week	45	27
Report generation	2h manual	0 (automated)
KPI query time	20min Excel	180ms
Fraud caught	unknown	€12,000 / 6 months
Daily AI interactions	0	140+

What I'd do differently

Start with WhatsApp, not web. Would have saved 3 months building an interface nobody used.
Template matching before LLM generation. For structured data queries, always.
Measure adoption from day 1. I didn't track usage for the first 2 months. Flying blind.
Smaller context windows. Instinct is to give LLM more context. Usually wrong.

Want something like this for your business?

I do consulting on production AI systems for SMBs. Not "add ChatGPT to your website" — actual systems that replace manual work.

Typical projects: $500-1500, delivered in 2-4 weeks.

What I can build:

Natural language → your database (no more Excel reports)
Internal knowledge assistant (HR, policy, training)
Meeting/audio intelligence → task extraction
Anomaly detection on transaction data

Get in touch — I reply within 24h

Happy to answer any technical questions in the comments.

17 Things I Wish I Knew Before Building My First Production LLM System

Ale Santini — Tue, 24 Mar 2026 02:11:59 +0000

17 Things I Wish I Knew Before Building My First Production LLM System

I deployed an LLM system to 236 real users last year. It broke in ways I couldn't have anticipated, taught me lessons I'll carry forever, and made me realize that building for production is fundamentally different from building a demo. Here's what I learned.

1. JSON Schema Validation Saves Your Life

Never trust an LLM to output valid JSON. Ever. I learned this the hard way when a single malformed response cascaded into 47 failed user requests. Validate every single output before you touch it.

import json
import jsonschema

schema = {
    "type": "object",
    "properties": {
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "reasoning": {"type": "string"}
    },
    "required": ["sentiment", "confidence", "reasoning"]
}

def validate_llm_output(response_text):
    try:
        data = json.loads(response_text)
        jsonschema.validate(instance=data, schema=schema)
        return data
    except (json.JSONDecodeError, jsonschema.ValidationError) as e:
        raise ValueError(f"Invalid LLM output: {e}")

This single pattern prevented approximately 60% of the bugs I would have encountered in production.

2. Model-Agnostic From Day 1: Store Model Name in DB

You will switch models. Maybe GPT-4 becomes too expensive, maybe Claude gets better at your task, maybe you need to self-host. If your model name is hardcoded, you're going to have a bad time. Store it in your database from day one.

CREATE TABLE llm_configs (
    id INT PRIMARY KEY AUTO_INCREMENT,
    config_name VARCHAR(255) UNIQUE,
    model_name VARCHAR(255),
    api_key_secret VARCHAR(255),
    temperature FLOAT,
    max_tokens INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

INSERT INTO llm_configs (config_name, model_name, temperature, max_tokens)
VALUES ('sentiment_analysis', 'gpt-4-turbo', 0.0, 500);

Then in your code:

def get_llm_config(config_name):
    result = db.query(
        "SELECT model_name, temperature, max_tokens FROM llm_configs WHERE config_name = %s",
        (config_name,)
    )
    return result[0] if result else None

This saved me three days of refactoring when I needed to downgrade from GPT-4 to GPT-3.5-turbo for cost reasons.

3. Retry Logic Is the Difference Between Dev and Prod

Your LLM provider will have blips. Rate limits will hit. Networks will hiccup. Exponential backoff with jitter isn't optional—it's required.

import time
import random
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise

                    delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                    print(f"Attempt {attempt + 1} failed. Retrying in {delay:.2f}s...")
                    time.sleep(delay)
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3, base_delay=1)
def call_llm(prompt):
    return openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

This reduced our error rate from 3.2% to 0.4% in production.

4. System Prompts Are Code: Version Them and Test Them

Your system prompt is not a one-time thing. It's code. It evolves. It needs versions, tests, and rollback capability.

# prompts.py
SYSTEM_PROMPTS = {
    "v1.0": """You are a customer support assistant. Be helpful, concise, and professional.""",
    "v1.1": """You are a customer support assistant. Be helpful, concise, and professional. 
    If you don't know something, say "I don't know" instead of guessing.""",
    "v1.2": """You are a customer support assistant. Be helpful, concise, and professional. 
    If you don't know something, say "I don't know" instead of guessing.
    Always cite sources when providing information."""
}

def get_system_prompt(version="latest"):
    if version == "latest":
        return SYSTEM_PROMPTS[max(SYSTEM_PROMPTS.keys())]
    return SYSTEM_PROMPTS.get(version)

# In your tests
def test_system_prompt_v1_2():
    response = call_llm("What's your favorite food?", prompt_version="v1.2")
    assert "I don't know" in response or "prefer" not in response.lower()

When a system prompt change caused issues, we rolled back in 30 seconds instead of debugging for hours.

5. Streaming Is Not Optional for UX

Users will leave if they stare at a loading spinner for 8 seconds. Streaming isn't a nice-to-have—it's table stakes for production LLM apps.

from flask import Flask, Response
import openai

app = Flask(__name__)

@app.route("/chat", methods=["POST"])
def chat():
    def generate():
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": "Explain quantum computing"}],
            stream=True
        )

        for chunk in response:
            if "content" in chunk["choices"][0]["delta"]:
                content = chunk["choices"][0]["delta"]["content"]
                yield f"data: {json.dumps({'text': content})}\n\n"

    return Response(generate(), mimetype="text/event-stream")

And on the frontend:

const eventSource = new EventSource("/chat");
eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    document.getElementById("response").innerHTML += data.text;
};

Streaming reduced perceived latency from "feels slow" to "feels responsive" even though the actual time-to-first-token didn't change much.

6. Your Most Expensive Token Is the One That Causes a Retry Loop

A single token that causes validation to fail, which triggers a retry, which fails again—that's not just expensive, it's a death spiral. One user hit a retry loop that cost us $47 before we noticed.

def call_llm_with_circuit_breaker(prompt, config):
    cost_this_call = 0
    max_cost_per_call = 0.50  # Hard limit in dollars

    for attempt in range(3):
        try:
            response = openai.ChatCompletion.create(
                model=config["model"],
                messages=[{"role": "user", "content": prompt}],
                temperature=config["temperature"]
            )

            # Estimate cost (simplified)
            cost_this_call += len(prompt.split()) * 0.00001

            if cost_this_call > max_cost_per_call:
                raise Exception(f"Cost exceeded limit: ${cost_this_call}")

            return validate_llm_output(response["choices"][0]["message"]["content"])

        except Exception as e:
            if attempt == 2:
                log_alert(f"Failed after 3 attempts, cost: ${cost_this_call}", severity="critical")
                raise

7. Context Window Bloat Is a Hidden Cost

Every token you send costs money. Every token you receive costs money. I was sending 8KB of context when I needed 2KB. That's a 4x cost multiplier for no benefit.

def build_context_efficiently(user_history, max_context_tokens=2000):
    """Only include recent, relevant messages"""

    # Don't include the entire history
    recent_messages = user_history[-5:]  # Last 5 messages only

    context = "\n".join([
        f"User: {msg['user_message']}\nAssistant: {msg['assistant_response']}"
        for msg in recent_messages
    ])

    # Check token count
    token_count = len(context.split())
    if token_count > max_context_tokens:
        # Truncate oldest messages first
        context = "\n".join(context.split("\n")[-20:])

    return context

Implementing this reduced our per-request token usage by 65% without degrading quality.

8. Temperature 0 for Structured Output, 0.7 for Creative

This seems obvious in retrospect, but I spent two weeks debugging "inconsistent JSON output" before realizing I had temperature set to 0.9. Temperature isn't a knob you set once.

TEMPERATURE_CONFIGS = {
    "sentiment_analysis": 0.0,      # Deterministic
    "classification": 0.0,           # Deterministic
    "content_generation": 0.7,       # Creative but coherent
    "brainstorming": 0.9,            # Very creative
    "code_generation": 0.2,          # Mostly deterministic, slight variation
}

def call_llm_with_task(prompt, task_type):
    temperature = TEMPERATURE_CONFIGS.get(task_type, 0.7)

    return openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )

9. Tool Calling Is More Reliable Than JSON-in-Prompt

When you need structured output, use the model's tool calling API instead of asking the model to output JSON in a prompt. It's more reliable, more consistent, and actually simpler.

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Extract the customer's name and email"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "extract_contact_info",
            "description": "Extract contact information from text",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "email": {"type": "string"}
                },
                "required": ["name", "email"]
            }
        }
    }],
    tool_choice="auto"
)

# The response is already structured
tool_call = response["choices"][0]["message"]["tool_calls"][0]
extracted_data = json.loads(tool_call["function"]["arguments"])

This reduced JSON parsing errors from 8% to 0.3%.

10. Log Every LLM Call With Input/Output/Latency/Cost

You cannot debug what you cannot see. Log everything. When something goes wrong at 2 AM, you'll be grateful.


python
import logging
import time
import json
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def call_llm_with_logging(prompt, model, user_id):
    start_time = time.time()

    try:
        response = openai.ChatCom

---
**Need an AI system for your business?**
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~88k EUR/month processed).
Email: **alevibecoding@gmail.com** | [Portfolio](https://alessandrotrimarco.github.io) | [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)

The Path to AI Engineer Nobody Tells You About (From LATAM)

Ale Santini — Tue, 24 Mar 2026 02:06:03 +0000

The Path to AI Engineer Nobody Tells You About (From LATAM)

I have never taken a formal AI course. My first LLM integration serves 236 users daily. This is what the path actually looks like.

The Trap Nobody Warns You About

You finish Andrew Ng's course. You understand transformers. You can explain attention mechanisms at a dinner party. You build a sentiment classifier on the movie review dataset. You feel like an AI engineer.

Then you apply for jobs. They want "production AI experience."

You look at their codebase. It's nothing like the tutorials. There are retry loops. There are rate limits. There are three different error states you never considered. There's a database call that sometimes takes 8 seconds and sometimes times out. There's a Slack alert at 3 AM because the API response format changed slightly.

The gap between "make this work in Jupyter" and "make this work when real money depends on it" is not a small step. It's a chasm. Most people get stuck in it, endlessly tweaking portfolios that nobody will ever use.

I was stuck there too.

What Actually Changed Everything

I stopped building for hypothetical users. I built for one real person with one real problem.

A friend's marketing agency was drowning in email summaries. They had a client sending 40-60 emails daily, and someone had to read each one and write a 2-line summary. It took 3 hours a day. The client wouldn't pay more for it. It was just eating their margin.

I said: "I can automate that."

That was it. No portfolio project. No "let me learn more first." Just: there's a problem, someone loses money if I don't solve it, and I have to make it work.

That constraint changed everything about how I built.

Tutorial Code vs. Production Code

Here's what I learned in the first 48 hours:

Tutorial code:

import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": email_text}]
)

summary = response.choices[0].message.content
print(summary)

Production code:

import openrouter
import logging
from tenacity import retry, stop_after_attempt, wait_exponential
import json

logger = logging.getLogger(__name__)

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def summarize_email(email_text: str, user_id: str) -> dict:
    """
    Summarize email with production safeguards.
    Returns structured response with metadata.
    """

    if not email_text or len(email_text) > 50000:
        logger.warning(f"Invalid email length for user {user_id}")
        return {
            "status": "error",
            "reason": "invalid_input",
            "summary": None
        }

    try:
        response = await openrouter.AsyncOpenRouter(
            api_key=os.getenv("OPENROUTER_KEY")
        ).create(
            model="anthropic/claude-3-5-sonnet",
            messages=[
                {
                    "role": "system",
                    "content": "Summarize this email in 1-2 sentences. Be specific about action items."
                },
                {"role": "user", "content": email_text}
            ],
            temperature=0.3,
            max_tokens=150,
            timeout=15
        )

        summary = response.choices[0].message.content.strip()

        if len(summary) < 10:
            logger.warning(f"Suspiciously short summary for user {user_id}")
            return {
                "status": "warning",
                "reason": "short_output",
                "summary": summary
            }

        logger.info(f"Successfully summarized email for user {user_id}")
        return {
            "status": "success",
            "summary": summary,
            "tokens_used": response.usage.prompt_tokens + response.usage.completion_tokens
        }

    except openrouter.RateLimitError:
        logger.error(f"Rate limited for user {user_id}")
        raise
    except openrouter.APIError as e:
        logger.error(f"API error for user {user_id}: {str(e)}")
        return {
            "status": "error",
            "reason": "api_failure",
            "summary": None,
            "retry_after": 300
        }
    except asyncio.TimeoutError:
        logger.error(f"Timeout for user {user_id}")
        return {
            "status": "error",
            "reason": "timeout",
            "summary": None
        }

The second version is 10x longer. It's also the only one that doesn't wake you up at 3 AM.

I chose OpenRouter because:

I could switch models without rewriting code
Better rate limiting behavior than direct API calls
Cheaper fallback options when Claude was expensive
One API key instead of managing ten

This flexibility saved me when Claude had issues. I switched to GPT-4 for 6 hours. Users never noticed.

The Real Learning Curve

Week 1: Built the MVP. It worked for one email. My friend tested it. It broke on emails with attachments, forwarded threads, and HTML formatting.

Week 2: 10 users. That's when I learned about:

Emails that are 200KB of quoted history
Encoding issues with special characters
What happens when the LLM refuses to respond to certain content
Rate limiting when multiple users hit at the same time
Database queries that should be cached but weren't

Week 3: 50 users. The system was slow. I added caching. I batched requests. I learned that 90% of emails are duplicates of yesterday's emails, and summarizing the same thing twice is waste.

Week 4: 100+ users. Now it's a business. There's a Slack channel. People depend on it. I added monitoring. I set up alerts. I learned that one user's workflow was generating 500 emails a day and I needed to handle that gracefully.

Month 2: 236 users. The system is boring now. It just works. I spend 2 hours a week on maintenance.

That's the path. You don't become an AI engineer by studying. You become one by shipping something that breaks, then fixing it while real people wait.

Why LATAM Is Actually an Advantage

I'm in Buenos Aires. UTC-3. Most of my users are in the US (UTC-5 to UTC-8).

This is a feature, not a bug.

When it's 9 AM here, it's 4 AM in California. If something breaks at their morning, I can fix it while they sleep. I can push updates, monitor for issues, and have it stable before they wake up. That's competitive.

The model doesn't care where I am. Claude doesn't care if I'm in LATAM or Silicon Valley. The API response time is the same. But my cost of living is 60% lower. That means I can:

Charge less and still make real money
Spend more on infrastructure that's actually needed
Take longer on projects that matter
Turn down bad clients

The salary difference between "AI engineer in San Francisco" and "AI engineer in Buenos Aires" is real. But the skill difference is zero. And the opportunity difference is actually inverted—there's less competition here, and the market is hungry for this work.

I've turned down three job offers in the last six months because the consulting work pays better and I control my time.

The Uncomfortable Truth About Compensation

Here's what nobody tells you:

AI engineer who completed a course: $80K-$120K
AI engineer who deployed one production system: $200K-$300K
AI engineer who maintains three production systems with real users: $350K+

That's not a typo. The jump isn't 20%. It's 300%.

Why? Because the first person can't be trusted with anything that matters. The second person has scars from production failures and knows how to prevent them. The third person is irreplaceable.

You can't fake this jump with a better portfolio or a certifications. You have to have actually done it.

Your Challenge This Week

Pick one real problem. Not hypothetical. Not "I could build this." Something where if it breaks, someone loses money or time or both.

It should be small enough to ship in a week. It should be something one person actually needs.

Build it. Make it work. Show me what you made.

Post it in the comments. Tell me:

What problem did you solve?
What broke first?
What surprised you about production vs. tutorial?

The people who do this will be the ones who aren't stuck between the tutorial and the job in six months.

The model doesn't care where you are. It doesn't care if you have a degree. It only cares if your code works when someone depends on it.

That's the path.

What's the real problem you're going to build for this week?

Need an AI system for your business?
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~88k EUR/month processed).
Email: alevibecoding@gmail.com | Portfolio | Case study

Meeting Intelligence Notion: Auto-extract tasks and decisions from any meeting transcript

Ale Santini — Tue, 24 Mar 2026 02:02:13 +0000

I've been automating meetings for a 236-person company for the past year.

Every week, the same problem: important decisions and tasks from meetings disappearing into chat history. Managers leaving voice notes at 11pm. The next shift arriving at 6am with no idea what was decided.

So I built MeetingMind: a tool that takes any meeting transcript or audio file, extracts every task and decision using AI, and creates structured Notion pages -- automatically -- via the Notion MCP server.

GitHub: github.com/AlessandroTrimarco/meetingmind

The Problem (From Real Production)

At a restaurant chain with 236 employees across 14 locations, shift handoffs happen by voice note. Manager A leaves a 3-minute audio at 11pm. Manager B arrives at 6am. Without a system, critical context gets lost.

Before MeetingMind:

Action items were forgotten
Decisions were not documented
People left meetings with different understanding of what was agreed

After MeetingMind:

Every meeting structured Notion page in ~3 seconds
0 missed follow-ups
Full searchable history of every decision

Demo

# From transcript file
python meetingmind.py --transcript meeting.txt --notion-db YOUR_DB_ID

# From audio (local Whisper, no API cost)
python meetingmind.py --audio standup.mp3 --notion-db YOUR_DB_ID

# Test extraction without writing to Notion
python meetingmind.py --dry-run --transcript example_transcript.txt --notion-db x

Real output from the included example:

Extracted:
  Title       : Team Standup - March 24, 2026
  Date        : 2026-03-24
  Participants: 3
  Action Items: 6
  Decisions   : 3
  Blockers    : 0
  Sentiment   : positive

Full JSON with assignees, due dates, and priority levels -- all pushed to Notion via MCP.

Architecture

Meeting transcript or audio
         |
   [Optional] Whisper (local, free)
         |
   Claude Haiku via OpenRouter (~$0.001/meeting)
         |
   Notion MCP Server (@notionhq/notion-mcp-server)
         |
   Structured Notion page:
   - Action items (assignee + due date + priority)
   - Decisions (context + owner)
   - Blockers
   - Participants
   - Meeting sentiment

Why Notion MCP?

The Notion MCP server handles authentication, API versioning, and error handling. My code focuses only on the intelligence layer. This is how MCP is supposed to work -- infrastructure you don't maintain.

The Prompt Pattern That Works

After months in production, this made the biggest difference:

Extract ONLY factual information explicitly stated.
Do NOT infer or add context.
Mark as null anything not explicitly mentioned.

That last rule cut hallucinations by 60%. LLMs want to be helpful and fill in gaps. For operational data, you want facts only.

Setup (5 minutes)

git clone https://github.com/AlessandroTrimarco/meetingmind
cd meetingmind
pip install -r requirements.txt
npm install -g @notionhq/notion-mcp-server
cp .env.example .env
# Edit .env: OPENROUTER_API_KEY and NOTION_TOKEN
python meetingmind.py --dry-run --transcript example_transcript.txt --notion-db x

Full code on GitHub -- MIT license, ready to use.

I Replaced 3 SaaS Tools With n8n + OpenRouter (Saving $80/month)

Ale Santini — Tue, 24 Mar 2026 01:59:42 +0000

I Replaced 3 SaaS Tools With n8n + OpenRouter (Saving $80/month)

Last year I was paying for 6 SaaS tools. Now I pay for one.

The shift wasn't about being cheap. It was about control. Every tool I used sat behind an API with rate limits, feature gates, and pricing that climbed with usage. So I built a stack around n8n and OpenRouter instead.

Here's what I killed and what I built to replace it.

Why n8n + OpenRouter Changes the Game

Zapier charges per task. OpenRouter charges per token. That's the fundamental difference.

With Zapier at $50/month, I was paying for task volume whether I used it or not. A task is anything that happens in your workflow—reading a database, calling an API, transforming data. Run 10,000 tasks and you hit their limit. You upgrade or hit the wall.

OpenRouter is different. I pay $0.02 per million tokens for Claude 3.5 Sonnet. A token is roughly 4 characters. So I can humanize 50,000 notifications for $1 if I'm using their API directly. With n8n self-hosted, there's no per-execution fee at all.

Zapier also locks you into their UI. n8n runs on your infrastructure. You get JSON configs, version control, and the ability to debug without refreshing a web form 40 times.

The trade: you need to understand JSON, handle errors yourself, and manage a server. Worth it at $80/month saved.

Workflow 1: Smart Notification Engine

The problem: I had 27 SQL data collectors feeding alerts to Slack. Most were noise. Notifications weren't personalized by role. I was paying $15/month to a custom SaaS just to filter and rewrite them.

The solution: n8n + OpenRouter humanizes raw data and routes by role.

Here's the flow:

┌─────────────────────┐
│  27 SQL Collectors  │
│   (cron: every 5m)  │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  n8n Merge Node     │
│  (combine all data) │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────────────────────┐
│  OpenRouter API Call                │
│  (Claude 3.5 Sonnet humanization)   │
│  Cost: ~$0.001 per execution        │
└──────────┬──────────────────────────┘
           │
           ▼
┌─────────────────────┐
│  Role-Based Router  │
│  (IF/ELSE by dept)  │
└──────────┬──────────┘
           │
    ┌──────┴──────┬──────────┐
    ▼             ▼          ▼
┌────────┐   ┌────────┐  ┌────────┐
│ Slack  │   │ Email  │  │ PagerD │
│ Eng    │   │ Mgmt   │  │ OnCall │
└────────┘   └────────┘  └────────┘

The n8n workflow JSON (simplified):

{
  "nodes": [
    {
      "name": "Merge SQL Results",
      "type": "n8n-nodes-base.merge",
      "parameters": {
        "mode": "combine",
        "combinationMode": "multiplex"
      }
    },
    {
      "name": "Call OpenRouter",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "https://openrouter.ai/api/v1/chat/completions",
        "method": "POST",
        "headers": {
          "Authorization": "Bearer {{ $env.OPENROUTER_KEY }}",
          "HTTP-Referer": "https://myapp.com"
        },
        "body": {
          "model": "claude-3.5-sonnet",
          "messages": [
            {
              "role": "user",
              "content": "Humanize this alert for a {{ $node['Merge SQL Results'].json.department }} team member:\n\n{{ JSON.stringify($node['Merge SQL Results'].json.raw_data) }}"
            }
          ],
          "max_tokens": 150
        }
      }
    },
    {
      "name": "Route by Role",
      "type": "n8n-nodes-base.switch",
      "parameters": {
        "cases": [
          {
            "condition": "department === 'engineering'",
            "output": 0
          },
          {
            "condition": "department === 'management'",
            "output": 1
          }
        ]
      }
    },
    {
      "name": "Send to Slack",
      "type": "n8n-nodes-base.slack",
      "parameters": {
        "channel": "#eng-alerts",
        "text": "{{ $node['Call OpenRouter'].json.choices[0].message.content }}"
      }
    }
  ]
}

Raw OpenRouter API call:

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_KEY" \
  -H "HTTP-Referer: https://myapp.com" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3.5-sonnet",
    "messages": [
      {
        "role": "user",
        "content": "Rewrite this database alert for a non-technical manager: Database connection pool at 89% capacity. Avg query time 245ms. P99 latency spike detected."
      }
    ],
    "max_tokens": 150
  }'

Response:

{
  "choices": [
    {
      "message": {
        "content": "Our database is running hot right now. Response times are slower than normal, and we're approaching capacity limits. The team is monitoring this closely."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 28
  }
}

Cost: Runs 27 times per day. ~$0.002 per run. Total: ~$1.60/month. The SaaS it replaced: $15/month.

Workflow 2: Meeting Intelligence

The problem: I was paying $30/month for a meeting summarization tool that recorded calls, transcribed them, and extracted action items. I only needed summaries 3-4 times per week.

The solution: n8n + Whisper + OpenRouter on-demand.

┌──────────────────────┐
│  Slack Command:      │
│  /summarize [link]   │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Download Audio      │
│  (from Slack/Drive)  │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Whisper API         │
│  Transcription       │
│  Cost: $0.02 per min │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────────────────────┐
│  OpenRouter API                      │
│  (Claude 3.5 Sonnet)                 │
│  - Summarize (3 key points)          │
│  - Extract tasks (assign to people)  │
│  - Generate follow-up questions      │
│  Cost: ~$0.01 per meeting            │
└──────────┬───────────────────────────┘
           │
           ▼
┌──────────────────────┐
│  Generate PDF        │
│  (n8n template)      │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Post to Slack       │
│  + Save to Drive     │
└──────────────────────┘

The n8n workflow:

{
  "nodes": [
    {
      "name": "Receive Slack Command",
      "type": "n8n-nodes-base.slackTrigger",
      "parameters": {
        "event": "slash_command",
        "command": "summarize"
      }
    },
    {
      "name": "Download Audio",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "{{ $node['Receive Slack Command'].json.files[0].url_private }}",
        "method": "GET",
        "headers": {
          "Authorization": "Bearer {{ $env.SLACK_BOT_TOKEN }}"
        }
      }
    },
    {
      "name": "Transcribe with Whisper",
      "type": "n8n-nodes-base.openAi",
      "parameters": {
        "resource": "audio",
        "operation": "transcribe",
        "binaryPropertyName": "data",
        "options": {
          "model": "whisper-1"
        }
      }
    },
    {
      "name": "Analyze with OpenRouter",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "https://openrouter.ai/api/v1/chat/completions",
        "method": "POST",
        "headers": {
          "Authorization": "Bearer {{ $env.OPENROUTER_KEY }}",
          "HTTP-Referer": "https://myapp.com"
        },
        "body": {
          "model": "claude-3.5-sonnet",
          "messages": [
            {
              "role": "user",
              "content": "Analyze this meeting transcript and provide:\n1. Three key discussion points\n2. Action items with assigned owners\n3. Follow-up questions\n\nTranscript:\n\n{{ $node['Transcribe with Whisper'].json.text }}"
            }
          ],
          "max_tokens": 800
        }
      }
    },
    {
      "name": "Build PDF",
      "type": "n8n-nodes-base.pdf",
      "parameters": {
        "content": "Meeting Summary\n\n{{ $node['Analyze with OpenRouter'].json.choices[0].message.content }}"
      }
    },
    {
      "name": "Upload to Slack",
      "type": "n8n-nodes-base.slack",
      "parameters": {
        "resource": "file",
        "operation": "upload",
        "channels": ["{{ $node['Receive Slack Command'].json.channel_id }}"],
        "binaryPropertyName": "data"
      }
    }
  ]
}

Cost per meeting:

Whisper: 30-min call = ~$0.30
OpenRouter: ~$0.20
Total: ~$0.50/month average (4 meetings/week). The SaaS it replaced: $30/month.

Workflow 3: Automated Reporting

The problem: I had a reporting SaaS that pulled data from 5 sources, formatted it, and emailed it every Monday. $20/month. I built this in n8n in 2 hours.


json
{
  "nodes": [
    {
      "name": "Cron Trigger",
      "type": "n8n-nodes-base.cron",
      "parameters": {
        "cronExpression": "0 9 * * 1"
      }
    },
    {
      "name": "Query Data Warehouse",
      "type": "n8n-nodes-base.postgres",
      "

---
**Need an AI system for your business?**
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~88k EUR/month processed).
Email: **alevibecoding@gmail.com** | [Portfolio](https://alessandrotrimarco.github.io) | [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)

AI Is Creating a New Kind of Technical Debt — And Most Teams Don't See It Yet

Ale Santini — Tue, 24 Mar 2026 01:55:04 +0000

AI Is Creating a New Kind of Technical Debt — And Most Teams Don't See It Yet

You're shipping AI features faster than you ever have. The prompts work. The model responds. Users are happy. Your sprint velocity looks incredible.

Six months from now, your AI system will be a maintenance nightmare.

Not because AI is fundamentally different, but because teams treat it like magic instead of infrastructure. You wouldn't ship a database query without tests, without monitoring, without versioning. But somehow, a 500-character string that controls your model's behavior? That lives in a .py file with no version control, no A/B testing, no audit trail.

This is AI technical debt. It's different from code debt. It's worse because it's invisible until it breaks production.

1. Prompt Debt: The Hardcoded Time Bomb

Bad Pattern:

def generate_summary(text):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": "You are a helpful assistant that summarizes text concisely."
        }],
        temperature=0.7
    )
    return response['choices'][0]['message']['content']

This prompt exists nowhere else. It's not versioned. You changed it last Tuesday to add "concisely" and forgot. Nobody knows when it changed or why. Your data team runs an analysis and finds summaries degraded 3% last week. They have no way to correlate it to your prompt tweak.

Fix:

# prompts.py - version controlled, tagged
PROMPTS = {
    "summarize_v1": {
        "created": "2024-01-15",
        "modified": "2024-01-20",
        "system": "You are a helpful assistant that summarizes text concisely.",
        "temperature": 0.7,
        "tags": ["production", "active"]
    },
    "summarize_v2": {
        "created": "2024-02-01",
        "modified": "2024-02-01",
        "system": "Summarize the following text in 1-2 sentences. Focus on actionable insights.",
        "temperature": 0.5,
        "tags": ["staging"]
    }
}

def generate_summary(text, prompt_version="summarize_v1"):
    prompt_config = PROMPTS[prompt_version]
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": prompt_config["system"]
        }],
        temperature=prompt_config["temperature"]
    )
    return response['choices'][0]['message']['content']

Now your prompts are:

Versioned in git with commit history
Tagged for production/staging/experiment
Auditable (who changed what, when)
A/B testable (compare v1 vs v2 systematically)

Store this in version control. Treat it like database schema migrations. Because it is.

2. Model Coupling: The Vendor Lock-In Trap

Bad Pattern:

class AIAssistant:
    def __init__(self):
        self.client = openai.OpenAI()

    def get_response(self, prompt):
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )
        return response.choices[0].message.content

You're locked to OpenAI's API. Your code is full of OpenAI-specific parameters. Claude's API is slightly different. Anthropic's pricing is better. You want to switch? You're rewriting everything.

Your CEO sees Claude's 200K context window and wants to migrate. Your team says "maybe next quarter" because it's entangled everywhere. That's model coupling.

Fix:

from abc import ABC, abstractmethod

class LLMProvider(ABC):
    @abstractmethod
    def complete(self, messages: list[dict], temperature: float) -> str:
        pass

class OpenAIProvider(LLMProvider):
    def __init__(self):
        self.client = openai.OpenAI()

    def complete(self, messages, temperature):
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            temperature=temperature
        )
        return response.choices[0].message.content

class ClaudeProvider(LLMProvider):
    def __init__(self):
        self.client = anthropic.Anthropic()

    def complete(self, messages, temperature):
        response = self.client.messages.create(
            model="claude-3-opus-20240229",
            max_tokens=2048,
            messages=messages,
            temperature=temperature
        )
        return response.content[0].text

class AIAssistant:
    def __init__(self, provider: LLMProvider):
        self.provider = provider

    def get_response(self, prompt):
        messages = [{"role": "user", "content": prompt}]
        return self.provider.complete(messages, temperature=0.7)

# Usage
# assistant = AIAssistant(OpenAIProvider())
# or
# assistant = AIAssistant(ClaudeProvider())

Now switching models is a configuration change, not a rewrite. You can A/B test different providers. You can fall back gracefully if one API goes down.

3. Evaluation Desert: The Silent Degradation

Bad Pattern:

def generate_title(article_text):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": "Generate a catchy title for this article"
        }],
        messages=[{"role": "user", "content": article_text}]
    )
    return response['choices'][0]['message']['content']

# Ship it
# No tests
# No metrics
# No benchmarks

Three months later, your titles are garbage. But you don't know when it started. Was it the model update? The prompt change? Something in your data pipeline? You have no baseline to compare against.

Fix:

from dataclasses import dataclass
import json

@dataclass
class EvalResult:
    prompt_version: str
    model: str
    score: float
    timestamp: str
    sample_size: int

class TitleGenerator:
    def __init__(self, prompt_version="v1", model="gpt-4"):
        self.prompt_version = prompt_version
        self.model = model

    def generate_title(self, article_text):
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=[{
                "role": "system",
                "content": PROMPTS[self.prompt_version]
            }],
            messages=[{"role": "user", "content": article_text}]
        )
        return response['choices'][0]['message']['content']

class TitleEvaluator:
    def __init__(self, generator: TitleGenerator):
        self.generator = generator
        self.eval_results = []

    def evaluate(self, test_articles: list[dict]) -> EvalResult:
        """
        test_articles: [{"text": "...", "human_rating": 8}, ...]
        """
        scores = []

        for article in test_articles:
            title = self.generator.generate_title(article["text"])
            # Your evaluation metric (could be LLM-based, human, or heuristic)
            score = self._score_title(title, article.get("expected_style"))
            scores.append(score)

        avg_score = sum(scores) / len(scores)
        result = EvalResult(
            prompt_version=self.generator.prompt_version,
            model=self.generator.model,
            score=avg_score,
            timestamp=datetime.now().isoformat(),
            sample_size=len(test_articles)
        )

        self.eval_results.append(result)
        return result

    def _score_title(self, title, expected_style):
        # This could be:
        # - Length check (5-10 words)
        # - Sentiment analysis
        # - LLM-based scoring
        # - Human feedback loop
        if len(title.split()) < 5 or len(title.split()) > 10:
            return 0.5
        return 0.8  # simplified

    def regression_check(self, threshold=0.85):
        if len(self.eval_results) < 2:
            return True

        latest = self.eval_results[-1].score
        previous = self.eval_results[-2].score

        if latest < threshold:
            print(f"⚠️ Quality below threshold: {latest}")
            return False

        if latest < previous * 0.95:  # 5% drop
            print(f"⚠️ Regression detected: {previous} → {latest}")
            return False

        return True

# Usage in CI/CD
evaluator = TitleEvaluator(TitleGenerator("v1", "gpt-4"))
test_data = load_eval_dataset()  # Fixed test set
result = evaluator.evaluate(test_data)

if not evaluator.regression_check():
    exit(1)  # Fail the build

Now you have:

A baseline to compare against
Automated regression detection
Visibility into when quality changes
Data to debug what broke

4. Context Window Inflation: The Expensive Slide

Bad Pattern:

def answer_question(question, user_history):
    # Just dump everything into the context
    context = "\n".join([
        f"Previous message {i}: {msg}"
        for i, msg in enumerate(user_history[-100:])  # Last 100 messages
    ])

    prompt = f"""
    {context}

    New question: {question}
    """

    response = openai.ChatCompletion.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
    )
    return response['choices'][0]['message']['content']

Your tokens per request: 8,000. Your monthly bill: $50K.

Six months ago, your context was 2,000 tokens. You kept adding "useful" context. Now every request costs 4x as much. Nobody noticed because it was gradual.

Fix:


python
from typing import Optional

class ContextManager:
    def __init__(self, max_tokens: int = 2000, model: str = "gpt-4-turbo"):
        self.max_tokens = max_tokens
        self.model = model
        self.token_counter = TokenCounter()  # Use tiktoken

    def build_context(
        self,
        question: str,
        user_history: list[str],
        max_history_messages: int = 5
    ) -> tuple[str, int]:
        """Returns (context_string, token_count)"""

        # Start with question
        context_parts = [f"Question: {question}"]
        token_count = self.token_counter.count(context_parts[0])

        # Add history incrementally, stop when we hit budget
        for msg in reversed(user_history[-max_history_messages:]):
            msg_tokens = self.token_counter.count(msg)
            if token_count + msg_tokens > self.max_tokens * 0.8:  # Leave 20% buffer
                break
            context_parts.insert(1, f"Previous: {msg}")
            token_count += msg_tokens

        context = "\n".join(context_parts)
        return context, token_count

    def answer_question(self, question: str, user_history: list[str]) -> str:
        context, token_count = self.build_context(question, user_history)

        # Log token usage for monitoring
        log_metric("llm_tokens

---
**Need an AI system for your business?**
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~88k EUR/month processed).
Email: **alevibecoding@gmail.com** | [Portfolio](https://alessandrotrimarco.github.io) | [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)

I Spent a Year Building AI for 236 People. Here's What Actually Works.

Ale Santini — Tue, 24 Mar 2026 01:53:33 +0000

I Spent a Year Building AI for 236 People. Here's What Actually Works

It was 2 AM on a Tuesday when the fraud detection model caught something that shouldn't exist.

A manager at the Barcelona location had processed 47 transactions in 8 minutes. All refunds. All to different cards. All flagged as "system error." The model caught the pattern—something no rule-based system would have—and locked the account. We recovered €12,000. The manager is now a legal problem for the restaurant chain, not an ongoing one.

That moment validated everything I'd been doing for the previous 11 months. But it also proved something I'd learned the hard way: AI in production isn't about building impressive models. It's about building systems that work when you're not looking.

What I Actually Built

Let me be specific because vague claims are useless:

RAG Assistant: Answers questions about POS policies, menu items, scheduling rules. 94% accuracy on employee questions. Reduced HR tickets by 40%.
Intent Detection: Converts natural language chat messages into 23 different KPI queries. Employees ask "why was yesterday slow?" and get actual data, not guesses.
Audio Meeting Intelligence: Records shift handoffs, extracts action items, surfaces problems. 15-minute meetings → 2-minute summaries.
27-Collector Notification Engine: Monitors 27 different data collectors (inventory, labor, sales, compliance). Sends one message per day instead of 27 separate alerts.
Fraud Detection: Behavioral analysis on transactions. Caught the Barcelona incident plus 3 other smaller anomalies.
Automated PDF Reports: Daily reports on 14 locations, 8 different report types. Replaced manual work that took 4 hours/day.

Real numbers: €88,000/month in transactions processed. 236 employees using these systems daily. One engineer. No ML team. No data science hire.

Stack: OpenRouter (Claude Opus for complex reasoning, Haiku for speed), PHP backend, MySQL, Python for model training/inference, n8n for orchestration.

Three Things That Surprised Me

1. Model Versioning in the Database Saved Everything

I started by storing model parameters in code. Bad decision.

Halfway through, I needed to roll back the fraud detection model because it was too aggressive. Changing code, testing, deploying—that's a 45-minute process. I needed it in 5 minutes.

Now every model lives as a database record:

// models table structure
CREATE TABLE models (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(100),
    type ENUM('fraud_detection', 'intent_classifier', 'rag_prompt'),
    version INT,
    active BOOLEAN,
    parameters JSON,
    created_at TIMESTAMP,
    deployed_at TIMESTAMP
);

// Deployment is literally one query
UPDATE models SET active = 0 WHERE name = 'fraud_detection' AND version = 3;
UPDATE models SET active = 1 WHERE name = 'fraud_detection' AND version = 2;

This single decision eliminated deployment anxiety. I could test a new model version for days before flipping one boolean. Rollbacks took seconds.

2. Employees Don't Want AI. They Want Answers.

I spent two weeks building a beautiful chat interface. Nobody used it. They wanted Slack integration.

Once I integrated with Slack, usage went from 2 messages/day to 140 messages/day. Same underlying AI. Different interface.

The lesson: people don't care about your architecture. They care about where they already work. I now build AI as middleware that lives in their existing tools, not as new tools.

3. Latency Matters More Than Accuracy After 85%

I spent a month tuning my intent classifier from 87% to 92% accuracy. The improvement was invisible to users.

Then I optimized the API response time from 1.2 seconds to 180ms. Suddenly, people started using it in their workflow instead of as an afterthought.

At 180ms, you can use it while thinking. At 1.2 seconds, you've already moved on.

# I went from this (accurate but slow)
response = client.messages.create(
    model="claude-opus",
    messages=[system_prompt, user_message],
    max_tokens=500
)

# To this (fast enough and still accurate)
response = client.messages.create(
    model="claude-3-haiku-20250307",
    messages=[system_prompt, user_message],
    max_tokens=100,
    timeout=150  # Hard stop at 150ms
)

Haiku is 10x faster than Opus. For 85% of use cases, nobody notices the accuracy difference. The speed difference, everyone notices.

What I Got Wrong First

Hallucination handling: I thought I could solve hallucinations with prompt engineering. I was wrong. I needed guardrails:

def validate_response(response, allowed_intents):
    """
    If the model says something outside our known intents,
    we don't trust it. Better to say "I don't know" than
    to confidently guess wrong.
    """
    parsed = extract_intent(response)

    if parsed['intent'] not in allowed_intents:
        return {"intent": "unknown", "confidence": 0}

    if parsed['confidence'] < 0.6:
        return {"intent": "unknown", "confidence": parsed['confidence']}

    return parsed

Context windows: I assumed larger context windows meant better results. They don't. They mean slower responses and higher costs. I now explicitly limit context to the last 3 exchanges.

Real-time requirements: I thought everything needed to be real-time. It doesn't. The notification engine batches messages once per day. Users prefer one good summary to 27 real-time alerts.

Training data quality: I spent a week collecting "good" examples. That was wasted time. I should have spent it on edge cases and failure modes.

The One Architectural Decision That Saved Everything

Storing models as database records instead of code changes.

This wasn't about elegance. This was about operational reality.

When a model breaks in production, you don't have time to run CI/CD. You need to flip it off immediately. You need to test a fix without deploying. You need to compare two versions side-by-side.

Every one of those things is trivial with models in the database. They're hard with models in code.

Everything else—the API structure, the choice of OpenRouter, the n8n orchestration—those were good choices but not critical. This one was critical.

What I'd Tell Myself on Day 1

Start with the slowest, dumbest solution that works. I wasted three weeks on optimization that didn't matter. Ship something that works first.
Your users don't care about your model architecture. They care about latency, accuracy, and where it lives. In that order.
Batch processing is underrated. Real-time feels important until you realize users prefer one good summary to constant noise.
Fraud detection works. It's not magic. It's just pattern matching on historical data. If you have 6 months of transaction history, you can build something useful in a week.
Employees will use AI if it saves them time. They won't use it if it requires learning something new. Integration > innovation.
One engineer can build this. You don't need a team. You need clarity on what you're solving and ruthlessness about scope.
The database is your infrastructure. When you're one person, the database is your deployment system, your versioning system, your A/B testing system. Treat it as such.

The year is over. The system is running. Nobody thinks about it anymore, which is exactly what I wanted. That's when you know it works.

Whisper + LLM Task Extraction: My Meeting Intelligence Architecture

Ale Santini — Tue, 24 Mar 2026 01:43:25 +0000

Whisper + LLM Task Extraction: My Meeting Intelligence Architecture

Last quarter, our team was drowning in meeting notes. We had 40+ meetings per week across 12 people, and action items were scattered across Slack, email, and Google Docs. Someone would inevitably miss a deadline because a task got buried in a 2000-word transcript. So I built a system that listens to meetings, extracts structured tasks, and routes them to the right people. It's been running in production for 6 months, processing ~200 meetings monthly. Here's exactly how it works.

The Problem With Naive Transcription

You might think: "Just use Whisper to transcribe, then ask an LLM to extract tasks." That's a starting point, but it fails in practice.

The issues:

Whisper produces 3000-5000 word transcripts. LLMs struggle to extract precise tasks from walls of text.
A 45-minute meeting transcript costs $0.15-0.30 to process with GPT-4. At scale, this adds up.
You lose context about who said what and when they committed to something.
Generic prompts produce 10 tasks when there are actually 3 real ones. You get noise, not signal.

I needed a multi-stage pipeline: transcribe → segment → classify → extract → validate.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                      Meeting Audio File                      │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
        ┌────────────────────────────────────┐
        │  Whisper (speech-to-text)          │
        │  - Local or API                    │
        │  - Timestamps + speaker labels     │
        └────────────────┬───────────────────┘
                         │
                         ▼
        ┌────────────────────────────────────┐
        │  Segment by speaker turns          │
        │  - Group into logical chunks       │
        │  - Max 300 tokens per segment      │
        └────────────────┬───────────────────┘
                         │
                         ▼
        ┌────────────────────────────────────┐
        │  Classify segments                 │
        │  - Decision/Action/Discussion      │
        │  - Cheap LLM (Claude Haiku)        │
        └────────────────┬───────────────────┘
                         │
         ┌───────────────┼───────────────────┐
         │               │                   │
         ▼               ▼                   ▼
    ┌────────┐     ┌────────┐         ┌──────────┐
    │Decision│     │ Action │         │Discussion│
    │ (skip) │     │(extract)│        │ (skip)   │
    └────────┘     └────┬───┘         └──────────┘
                        │
                        ▼
        ┌────────────────────────────────────┐
        │  Extract task details              │
        │  - Owner, deadline, dependencies   │
        │  - More capable LLM (Claude 3.5)   │
        └────────────────┬───────────────────┘
                         │
                         ▼
        ┌────────────────────────────────────┐
        │  Structured output (JSON)          │
        │  - Deduplicate                     │
        │  - Route to project management     │
        └────────────────────────────────────┘

The key insight: use cheap models for classification, expensive ones only for extraction. This cuts costs by 70%.

Implementation: The Real Code

Here's the production pipeline I use, simplified for clarity:

import anthropic
import json
from typing import TypedDict

class TaskExtraction(TypedDict):
    owner: str
    task: str
    deadline: str
    priority: str
    dependencies: list[str]

class MeetingProcessor:
    def __init__(self, api_key: str):
        self.client = anthropic.Anthropic(api_key=api_key)

    def segment_transcript(self, transcript: str, max_tokens: int = 300) -> list[str]:
        """Split transcript into speaker segments, respecting token boundaries."""
        segments = []
        current_segment = ""

        for line in transcript.split("\n"):
            if len(current_segment.split()) > max_tokens:
                segments.append(current_segment)
                current_segment = line
            else:
                current_segment += "\n" + line

        if current_segment:
            segments.append(current_segment)
        return segments

    def classify_segment(self, segment: str) -> str:
        """Cheap classification: is this an action, decision, or discussion?"""
        response = self.client.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=50,
            messages=[
                {
                    "role": "user",
                    "content": f"""Classify this meeting segment as one of: ACTION, DECISION, DISCUSSION.

Segment:
{segment}

Return only the classification word."""
                }
            ]
        )
        return response.content[0].text.strip()

    def extract_tasks(self, segments_with_context: list[dict]) -> list[TaskExtraction]:
        """Extract structured tasks from ACTION segments only."""
        action_segments = [
            s for s in segments_with_context 
            if s["classification"] == "ACTION"
        ]

        if not action_segments:
            return []

        combined_context = "\n\n".join([
            f"[{s['timestamp']}] {s['text']}" 
            for s in action_segments
        ])

        response = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1500,
            messages=[
                {
                    "role": "user",
                    "content": f"""Extract all action items from this meeting excerpt. 
Return a JSON array of tasks with this structure:
{{
  "owner": "person name or 'TBD'",
  "task": "specific action",
  "deadline": "date or 'ASAP' or 'TBD'",
  "priority": "high/medium/low",
  "dependencies": ["other task ids if mentioned"]
}}

Meeting excerpt:
{combined_context}

Return ONLY valid JSON array, no markdown formatting."""
                }
            ]
        )

        try:
            tasks = json.loads(response.content[0].text)
            return tasks
        except json.JSONDecodeError:
            print(f"Failed to parse: {response.content[0].text}")
            return []

    def process_meeting(self, transcript: str) -> list[TaskExtraction]:
        """Full pipeline: segment → classify → extract."""
        segments = self.segment_transcript(transcript)

        # Classify all segments (cheap pass)
        segments_with_class = [
            {
                "text": seg,
                "classification": self.classify_segment(seg),
                "timestamp": "0:00"  # You'd extract real timestamps
            }
            for seg in segments
        ]

        # Extract tasks from ACTION segments only
        tasks = self.extract_tasks(segments_with_class)

        # Deduplicate by task description
        seen = set()
        unique_tasks = []
        for task in tasks:
            task_key = (task["owner"], task["task"])
            if task_key not in seen:
                seen.add(task_key)
                unique_tasks.append(task)

        return unique_tasks

# Usage
processor = MeetingProcessor(api_key="your-key")
with open("meeting_transcript.txt") as f:
    transcript = f.read()

tasks = processor.process_meeting(transcript)
print(json.dumps(tasks, indent=2))

This approach costs ~$0.04-0.08 per meeting. At 200 meetings/month, that's $8-16 in LLM costs alone.

Cost Optimization: Using OpenRouter

If you're processing lots of meetings, you'll want to try different models and providers. I use OpenRouter for this—it abstracts away provider switching and gives you a single API key to test Claude, GPT-4, Llama, and others.

Here's a variant that lets you swap models easily:


python
import requests
import json

class OpenRouterTaskExtractor:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://openrouter.ai/api/v1"

    def classify_segment(self, segment: str, model: str = "anthropic/claude-3-5-haiku") -> str:
        """Classify using OpenRouter—easy model swapping."""
        response = requests.post(
            f"{self.base_url}/chat/completions",
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "HTTP-Referer": "https://your-app.com",
            },
            json={
                "model": model,
                "messages": [
                    {
                        "role": "user",
                        "content": f"Classify as ACTION, DECISION, or DISCUSSION:\n{segment}"
                    }
                ],
                "max_tokens": 50,
            }
        )
        return response.json()["choices"][0]["message"]["content"].strip()

    def extract_tasks(self, segments: str, model: str = "anthropic/claude-3-5-sonnet") -> list:
        """Extract tasks using OpenRouter."""
        response = requests.post(
            f"{self.base_url}/chat/completions",
            headers={
                "Authorization": f"Bearer {self.api_key}

---
**Need an AI system for your business?**  
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~€88k/month processed).  
📬 **[alevibecoding@gmail.com](mailto:alevibecoding@gmail.com)** · [Portfolio](https://alessandrotrimarco.github.io) · [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)