I was building an Agent-as-a-Service platform. Multi-tenant, WhatsApp + Telegram + Slack ingress, LLM inference running locally on a 6GB GPU.
And it kept dying.
Not crashing with a nice error. Just... silently failing. Telegram would send a webhook, wait 3 seconds, get no response, and retry. The retry would stack on top of the first one. My Python worker, already halfway through an LLM call, would get hit again. GPU memory would spike. The whole thing would choke.
I had built the wrong architecture. Python was doing a job it was never designed for.
The Core Problem Nobody Talks About
When you build an AI agent backend in Python, you're writing two fundamentally different types of code in the same process:
Type 1 — Latency-sensitive edge work:
Accepting the incoming webhook
Immediately returning 200 OK before the platform times out
Validating API keys and rate limits
Routing the payload somewhere
Type 2 — Slow, compute-heavy work:
Running LLM inference
Executing tool calls
Processing results
Writing to a database
The mistake I was making — the mistake most people make — is letting Python handle both. Type 1 work needs to respond in milliseconds. Type 2 work can take 15-30 seconds. Mixing them in a single FastAPI app means your slow work blocks your fast work, and platforms like WhatsApp and Telegram don't wait.
They retry. Which makes everything worse.
What I Built: Sentinel
Sentinel is a Go-based edge gateway that sits in front of my entire Python/LLM stack. It handles everything that needs to be fast, so Python can focus on everything that needs to be smart.
Here's the full architecture:
WhatsApp / Telegram / Slack / Salesforce
│
▼
┌───────────────┐
│ SENTINEL │ ← Go (high-concurrency edge)
│ (Go Proxy) │
└───────┬───────┘
│
┌──────┴──────┐
│ │
▼ ▼
Redis Queue Neon DB ← Job written, 200 OK returned instantly
│
▼
┌─────────────────┐
│ Python Healer │ ← LLM inference, tool execution, response generation
│ (AI Workers) │
└─────────────────┘
│
▼
Response delivered
back to user
Go handles the edge. Python handles the intelligence. They never block each other.
What Sentinel Actually Does
- Instant Webhook Acknowledgment WhatsApp Business API requires a 200 OK within 3 seconds or it marks the delivery as failed and retries. Telegram has similar constraints. If your Python agent is mid-inference when the next webhook arrives, you're going to start accumulating ghost retries. Sentinel accepts every incoming request and immediately fires back 200 OK. The payload gets written to a Redis queue. The platform is satisfied. Python picks up the job when it's ready — not when the platform demands it. No more timeouts. No more phantom retries stacking on each other.
- The Async Shock Absorber Imagine 500 users message your agent simultaneously. Without a queue, 500 requests hit your Python workers at once. On a 6GB GPU running local inference, that's an instant OOM crash. Sentinel's job is to be a bouncer. It absorbs the spike, pushes every payload into Redis, and lets your Python workers pull jobs at whatever rate they can sustainably handle. The queue is the pressure valve. Go can manage thousands of concurrent connections without breaking a sweat — that's what goroutines are built for.
- Edge Validation Before the Expensive Stuff Every LLM call costs money (or GPU time). Running a validation check after you've already started inference is wasteful. Sentinel validates before Python ever sees the request:
Is this API key valid?
Is this user's subscription active?
Have they hit their rate limit?
Does the payload look malicious?
If any check fails, Sentinel drops the request at the edge. Your Python workers and your LLM never get touched. You don't burn tokens on a user who hasn't paid.
- The Bridge Between Frontend and AI My Next.js dashboard doesn't talk directly to the AI. It talks to Sentinel. When a user clicks "Process this document" on the frontend, the request hits Sentinel. Sentinel writes the job to the database, pushes it to the queue, and returns a job ID. The frontend shows a loading bar. Python picks up the job, runs inference, writes the result. The frontend polls for completion. This pattern — sometimes called the "async job" pattern — is what makes AI-powered UIs feel responsive even when the underlying work takes 30 seconds. Sentinel is what makes it possible.
Why Go and Not Just More Python Workers
I get this question a lot. "Why not just run more FastAPI instances behind nginx?"
You could. But you're still paying the Python interpreter tax on every request. Python's GIL, even with async, creates overhead per connection that Go simply doesn't have. Go's goroutines are lightweight enough that a single Sentinel instance can handle tens of thousands of concurrent webhook connections on modest hardware.
More importantly: the operational model is simpler. One Go binary at the edge. Any number of Python workers behind it. You scale the workers independently based on AI load, not on webhook volume.
They're solving different problems. Let them be different services.
The Stack
Go — edge proxy, webhook ingress, routing, validation
Python — LLM inference, tool execution, business logic
Redis (Upstash) — async job queue between the two layers
Neon (PostgreSQL) — job persistence, audit trail
Docker Compose — runs the whole thing locally, single command
The repo is here: github.com/Anhsirkm/sentinel
It's early. No production hardening yet. But the architecture is the point — the separation of concerns between the fast layer and the smart layer is what makes AI agent infrastructure actually work under real load.
What I Learned
The single most useful mental shift was this: your AI agent is not one service. It's two.
One service that has to be fast. One service that has to be smart. They should never be the same process, and they should only talk to each other through a queue.
Once I split them, the timeouts disappeared. The retries stopped stacking. The GPU stopped spiking. The system became predictable.
If you're building any kind of AI agent backend that receives webhooks from external platforms, I'd seriously consider this pattern before you go deeper on the Python side. The edge layer is not glamorous, but it's what keeps everything else alive.
Top comments (0)