Most teams building AI-powered messaging systems make the same mistake: they run every inbound message through every agent. Got 5 agents? That's 5 LLM calls per message. Your users send "👍" and you just burned $0.003 classifying a thumbs-up five times.
We needed a better approach. Here's the pipeline we built — and why each layer exists.
The Naive Approach (And Why It Hurts)
The obvious architecture:
Message arrives
→ Agent 1: "Is this for me?" (LLM call)
→ Agent 2: "Is this for me?" (LLM call)
→ Agent 3: "Is this for me?" (LLM call)
→ Agent 4: "Is this for me?" (LLM call)
→ Agent 5: "Is this for me?" (LLM call)
5 LLM calls. 5× the latency. 5× the cost. And 4 of them will say "nah, not for me."
Now imagine your hotel WhatsApp gets 500 messages/day. That's 2,500 LLM calls just for routing. Most of them are "ok", "👍", "thx", and emoji reactions.
You're paying OpenAI to classify thumbs-ups.
The Pipeline: Free Before Paid
Our philosophy is simple: filter what you can for free, before you spend tokens.
Let's walk through each layer.
Layer 1: Trailing-Edge Debounce
People don't send one message. They send three:
14:02:01 "I need room service"
14:02:04 "room 204"
14:02:07 "before 2pm if possible"
Without debounce, you'd process "I need room service" and miss the room number. The agent would hallucinate it or ask a follow-up question. Neither is great.
We use a trailing-edge debounce: each inbound message schedules a delayed job (say, 3 seconds). When the job fires, it checks if newer messages arrived for the same conversation. If yes, it bails — a fresher job is already in the queue.
The result: "I need room service" + "room 204" + "before 2pm" get batched into one pipeline run.
No Redis streams. No WebSocket aggregation. Just a background job with a delay and a freshness check. Dead simple.
The debounce window is configurable per agent. Too short and you split messages. Too long and response feels slow. We default to 3 seconds — short enough to feel instant, long enough to catch the "oh wait, one more thing" follow-up.
Layer 2: PreFilter — Kill Noise for Free
Before spending a single token, we filter the obvious junk with plain regex:
- Empty messages — webhook noise, media-only messages with no text
- Emoji-only — "👍", "😂😂😂", "🙏" (yes, we handle emoji ZWJ sequences)
- Ultra-short — "ok", "ty", "k" (below a configurable minimum length)
That's it. Three rules. Zero LLM cost. Takes less than a millisecond.
In production, this filters ~30-40% of all inbound messages before the LLM even wakes up. On a busy account, that's hundreds of dollars saved per month.
We thought about adding language detection, spam scoring, sentiment analysis... We didn't. Three regex rules catch enough, and every rule we add is a rule we need to maintain. The LLM router handles the harder cases anyway.
Layer 3: The Single-LLM Router
This is the core trick. Instead of asking each agent "is this for you?", we ask one cheap model to classify AND route in a single call.
The approach: build a tool-calling schema where the LLM must return:
-
actionable(boolean) — is this a real request or just noise? -
reason— why it classified this way (for audit) -
agent_ids— which agents should handle this message
The key insight is constraining the tool schema dynamically. The agent_ids parameter uses an enum that's generated from the user's actual agent IDs:
{
"agent_ids": {
"type": "array",
"items": {
"type": "string",
"enum": ["uuid-concierge", "uuid-housekeeping", "uuid-billing"]
}
}
}
The LLM literally cannot hallucinate an agent ID — it can only pick from the list. And the system prompt is also dynamic, listing each agent's name and a truncated description of what it handles.
One call. A tiny model (we use gpt-4.1-nano). Costs about ~$0.00015 per message. Routes to 1, 2, or all agents at once.
On a real account with ~400 messages/day, the router costs about $2/month. Without it, running 3 agents on every message would cost ~$15/month in classification alone.
Fail-Open: Never Lose a Message
What happens when the LLM provider is down?
We fail open. If the router call fails (after fallback — more on that in our next article), we route to ALL agents. Each agent's own LLM call will decide if the message is relevant.
You might burn a few extra tokens during an outage, but you'll never silently drop a customer message. This is a deliberate trade-off: false positives over false negatives. A customer whose message gets processed by an extra agent won't notice. A customer whose message disappears into the void will leave a 1-star review.
Layer 4: Per-Agent Filters
After routing, each agent gets one more chance to reject — based on config, not LLM:
- Trigger mode — only trigger on text messages, skip images/audio/stickers
- Min message length — ignore very short messages
-
Ignore patterns — regex patterns to skip (e.g.,
/^(ok|thx|merci)$/i) - Rate limit — max N messages/hour per conversation (prevents loops)
- Excluded numbers — blocklist specific phone numbers (internal, test devices)
Two design decisions worth highlighting:
1. User config can only make filters stricter, never weaker.
If our default minimum message length is 3 characters, a user can set it to 5 but not to 0. They can't set trigger_on: "all" if the default is "text_only". Opinionated? Yes. But it prevents a class of "my agent is running on every message and I'm out of quota" support tickets.
2. User-defined regex runs with a hard timeout.
Because users will write (.+)+$ and wonder why their server is melting. We timeout after 100ms and skip the broken pattern. ReDoS protection is not optional when you accept user-defined regex.
The Audit Trail
Every routing decision gets logged with:
-
Outcome —
routed,skip_rules,skip_router,no_agent_matched,error_fallback - Per-agent decisions — which agents were routed, which were filtered, and why
- Token usage — how many tokens the router consumed
When a user asks "why didn't my agent process this message?", we can answer with data, not guesswork. And on our end, we can spot patterns — if error_fallback spikes, something is wrong upstream.
The Numbers
On a real account (hotel, ~400 messages/day, 3 agents):
| Layer | Messages filtered | Cost |
|---|---|---|
| PreFilter (regex) | ~35% | $0 |
| Router (not actionable) | ~25% | ~$0.06/day |
| Per-agent filter | ~5% | $0 |
| Actually processed | ~35% | Agent LLM cost |
65% of messages never reach an agent. The first and last filters are free. The router in the middle costs pennies.
What We'd Do Differently
The debounce window is the hardest tuning problem. There's no perfect value. We expose it as a per-agent config and let users decide. Most never change the default.
Fail-open is the right default, but it needs monitoring. Without tracking
error_fallbackoutcomes, you'd never know your router has been down for 3 hours and every message is hitting every agent. Alerting on this is critical.Don't over-engineer the PreFilter. We almost built a spam classifier. We're glad we didn't. The simplest possible rules give you 80% of the value for 0% of the maintenance.
This Is What We Built
This pipeline powers WhatsRB Cloud — a platform that turns WhatsApp messages into structured, actionable webhooks. You define agents, set routing rules, and receive clean JSON payloads.
If you're building something similar on WhatsApp, we've done the hard part so you don't have to.
Built with Ruby, Rails, and a deep respect for not wasting tokens on thumbs-ups.

Top comments (0)