<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: krishna kumar</title>
    <description>The latest articles on DEV Community by krishna kumar (@krishna_kumar_f87cba99533).</description>
    <link>https://dev.to/krishna_kumar_f87cba99533</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3863445%2Fdb7c298d-48c3-4d3d-a28e-c36c17b28ad7.png</url>
      <title>DEV Community: krishna kumar</title>
      <link>https://dev.to/krishna_kumar_f87cba99533</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/krishna_kumar_f87cba99533"/>
    <language>en</language>
    <item>
      <title>I Built a Go Gateway That Stops My AI Agent From Dying Under Load — Here's Why Python Can't Do This Job Alone</title>
      <dc:creator>krishna kumar</dc:creator>
      <pubDate>Mon, 06 Apr 2026 08:08:05 +0000</pubDate>
      <link>https://dev.to/krishna_kumar_f87cba99533/i-built-a-go-gateway-that-stops-my-ai-agent-from-dying-under-load-heres-why-python-cant-do-this-3mko</link>
      <guid>https://dev.to/krishna_kumar_f87cba99533/i-built-a-go-gateway-that-stops-my-ai-agent-from-dying-under-load-heres-why-python-cant-do-this-3mko</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flla46bs40strrzp1gkzo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flla46bs40strrzp1gkzo.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;I was building an Agent-as-a-Service platform. Multi-tenant, WhatsApp + Telegram + Slack ingress, LLM inference running locally on a 6GB GPU.&lt;br&gt;
And it kept dying.&lt;br&gt;
Not crashing with a nice error. Just... silently failing. Telegram would send a webhook, wait 3 seconds, get no response, and retry. The retry would stack on top of the first one. My Python worker, already halfway through an LLM call, would get hit again. GPU memory would spike. The whole thing would choke.&lt;br&gt;
I had built the wrong architecture. Python was doing a job it was never designed for.&lt;/p&gt;

&lt;p&gt;The Core Problem Nobody Talks About&lt;br&gt;
When you build an AI agent backend in Python, you're writing two fundamentally different types of code in the same process:&lt;br&gt;
Type 1 — Latency-sensitive edge work:&lt;/p&gt;

&lt;p&gt;Accepting the incoming webhook&lt;br&gt;
Immediately returning 200 OK before the platform times out&lt;br&gt;
Validating API keys and rate limits&lt;br&gt;
Routing the payload somewhere&lt;/p&gt;

&lt;p&gt;Type 2 — Slow, compute-heavy work:&lt;/p&gt;

&lt;p&gt;Running LLM inference&lt;br&gt;
Executing tool calls&lt;br&gt;
Processing results&lt;br&gt;
Writing to a database&lt;/p&gt;

&lt;p&gt;The mistake I was making — the mistake most people make — is letting Python handle both. Type 1 work needs to respond in milliseconds. Type 2 work can take 15-30 seconds. Mixing them in a single FastAPI app means your slow work blocks your fast work, and platforms like WhatsApp and Telegram don't wait.&lt;br&gt;
They retry. Which makes everything worse.&lt;/p&gt;

&lt;p&gt;What I Built: &lt;a href="https://github.com/Anhsirkm/sentinel" rel="noopener noreferrer"&gt;Sentinel&lt;/a&gt;&lt;br&gt;
Sentinel is a Go-based edge gateway that sits in front of my entire Python/LLM stack. It handles everything that needs to be fast, so Python can focus on everything that needs to be smart.&lt;br&gt;
Here's the full architecture:&lt;br&gt;
WhatsApp / Telegram / Slack / Salesforce&lt;br&gt;
            │&lt;br&gt;
            ▼&lt;br&gt;
    ┌───────────────┐&lt;br&gt;
    │   SENTINEL    │  ← Go (high-concurrency edge)&lt;br&gt;
    │  (Go Proxy)   │&lt;br&gt;
    └───────┬───────┘&lt;br&gt;
            │&lt;br&gt;
     ┌──────┴──────┐&lt;br&gt;
     │             │&lt;br&gt;
     ▼             ▼&lt;br&gt;
  Redis Queue    Neon DB      ← Job written, 200 OK returned instantly&lt;br&gt;
     │&lt;br&gt;
     ▼&lt;br&gt;
┌─────────────────┐&lt;br&gt;
│  Python Healer  │  ← LLM inference, tool execution, response generation&lt;br&gt;
│  (AI Workers)   │&lt;br&gt;
└─────────────────┘&lt;br&gt;
            │&lt;br&gt;
            ▼&lt;br&gt;
    Response delivered&lt;br&gt;
    back to user&lt;br&gt;
Go handles the edge. Python handles the intelligence. They never block each other.&lt;/p&gt;

&lt;p&gt;What Sentinel Actually Does&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Instant Webhook Acknowledgment
WhatsApp Business API requires a 200 OK within 3 seconds or it marks the delivery as failed and retries. Telegram has similar constraints. If your Python agent is mid-inference when the next webhook arrives, you're going to start accumulating ghost retries.
Sentinel accepts every incoming request and immediately fires back 200 OK. The payload gets written to a Redis queue. The platform is satisfied. Python picks up the job when it's ready — not when the platform demands it.
No more timeouts. No more phantom retries stacking on each other.&lt;/li&gt;
&lt;li&gt;The Async Shock Absorber
Imagine 500 users message your agent simultaneously. Without a queue, 500 requests hit your Python workers at once. On a 6GB GPU running local inference, that's an instant OOM crash.
Sentinel's job is to be a bouncer. It absorbs the spike, pushes every payload into Redis, and lets your Python workers pull jobs at whatever rate they can sustainably handle. The queue is the pressure valve. Go can manage thousands of concurrent connections without breaking a sweat — that's what goroutines are built for.&lt;/li&gt;
&lt;li&gt;Edge Validation Before the Expensive Stuff
Every LLM call costs money (or GPU time). Running a validation check after you've already started inference is wasteful.
Sentinel validates before Python ever sees the request:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Is this API key valid?&lt;br&gt;
Is this user's subscription active?&lt;br&gt;
Have they hit their rate limit?&lt;br&gt;
Does the payload look malicious?&lt;/p&gt;

&lt;p&gt;If any check fails, Sentinel drops the request at the edge. Your Python workers and your LLM never get touched. You don't burn tokens on a user who hasn't paid.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Bridge Between Frontend and AI
My Next.js dashboard doesn't talk directly to the AI. It talks to Sentinel.
When a user clicks "Process this document" on the frontend, the request hits Sentinel. Sentinel writes the job to the database, pushes it to the queue, and returns a job ID. The frontend shows a loading bar. Python picks up the job, runs inference, writes the result. The frontend polls for completion.
This pattern — sometimes called the "async job" pattern — is what makes AI-powered UIs feel responsive even when the underlying work takes 30 seconds. Sentinel is what makes it possible.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why Go and Not Just More Python Workers&lt;br&gt;
I get this question a lot. "Why not just run more FastAPI instances behind nginx?"&lt;br&gt;
You could. But you're still paying the Python interpreter tax on every request. Python's GIL, even with async, creates overhead per connection that Go simply doesn't have. Go's goroutines are lightweight enough that a single Sentinel instance can handle tens of thousands of concurrent webhook connections on modest hardware.&lt;br&gt;
More importantly: the operational model is simpler. One Go binary at the edge. Any number of Python workers behind it. You scale the workers independently based on AI load, not on webhook volume.&lt;br&gt;
They're solving different problems. Let them be different services.&lt;/p&gt;

&lt;p&gt;The Stack&lt;/p&gt;

&lt;p&gt;Go — edge proxy, webhook ingress, routing, validation&lt;br&gt;
Python — LLM inference, tool execution, business logic&lt;br&gt;
Redis (Upstash) — async job queue between the two layers&lt;br&gt;
Neon (PostgreSQL) — job persistence, audit trail&lt;br&gt;
Docker Compose — runs the whole thing locally, single command&lt;/p&gt;

&lt;p&gt;The repo is here: github.com/Anhsirkm/sentinel&lt;br&gt;
It's early. No production hardening yet. But the architecture is the point — the separation of concerns between the fast layer and the smart layer is what makes AI agent infrastructure actually work under real load.&lt;/p&gt;

&lt;p&gt;What I Learned&lt;br&gt;
The single most useful mental shift was this: your AI agent is not one service. It's two.&lt;br&gt;
One service that has to be fast. One service that has to be smart. They should never be the same process, and they should only talk to each other through a queue.&lt;br&gt;
Once I split them, the timeouts disappeared. The retries stopped stacking. The GPU stopped spiking. The system became predictable.&lt;br&gt;
If you're building any kind of AI agent backend that receives webhooks from external platforms, I'd seriously consider this pattern before you go deeper on the Python side. The edge layer is not glamorous, but it's what keeps everything else alive.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Anhsirkm" rel="noopener noreferrer"&gt;
        Anhsirkm
      &lt;/a&gt; / &lt;a href="https://github.com/Anhsirkm/sentinel" rel="noopener noreferrer"&gt;
        sentinel
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Sentinel: A high-concurrency reverse proxy that intelligently heals broken webhook schemas in transit using a Go execution engine and Python logic layer.
    &lt;/h3&gt;
  &lt;/div&gt;
&lt;/div&gt;


</description>
    </item>
  </channel>
</rss>
