Shanmukhi

Posted on Mar 8

Stop Treating AI Agents as Web Servers: A Kubernetes Survival Guide - Part 1

#ai #aiops #kubernetes #cloudnative

Before we Begin

We need to get something out of the way, as this distinction has significant implications for everything that follows.

In the GenAI world, there is a clear distinction between a Chatbot and an Agent, and the industry continues to use the words interchangeably, which leads to architectural chaos.

A Chatbot is simple. It's text in, text out, send it to an LLM, and send the text back out. It's mostly stateless. It's mostly synchronous. It's just another web service, and you can go to bed at night with a clear conscience.

An Agent is completely a different thing entirely, it needs to:

Plan: Break a vague request like "review this PR" into multiple steps, figure out what files to fetch, what context to gather, what order to analyze things in.
Act: Use tools like GitHub API, Slack, databases, code interpreters, etc., to fetch actual information from the world.
Remember: Preserve the state of its reasoning across many iterations, many tool calls, and possibly many minutes.

The chatbot is like the waiter, who simply takes the order and brings the food. The agent, meanwhile, is like the entire kitchen, which not only prepares the food but also decides what to cook, sources the ingredients, and at the same time remembers that table 4 has a peanut allergy.

We didn’t fully appreciate the difference between the two when we began, and it ended up costing us.

The Backstory: How We Got Here

This project started the way most good engineering stories do, with curiosity and a slightly naive sense of confidence.

We wanted to build an AI agent for a real application. Not a toy. Not a tutorial chatbot. Something that actually does work in the real world.

The idea was simple: build a GitHub PR Review Agent. A system that takes a URL for a GitHub pull request, reads the code changes, thinks about them a bit, and generates a thoughtful review of those changes, like a junior developer who's incredibly thorough and never gets tired.

We weren't trying to build a platform. We weren't trying to solve a set of distributed systems problems. We were just trying to see if an AI agent could actually be useful to a developer in their day to day workflow. And if it could, how difficult would it be to let other people use it?

Well, the answer to the first question was a definitive yes.

The answer to the second question... not so much.

It works on my Machine

We went with a stack that any developer would recognize:

Backend Framework: Django, battle-tested, familiar, fast to build with.
Agent Framework: LangGraph, gave us the graph-based execution model we needed for multi-step reasoning.
State Management: In-memory Python objects — a simple memory = [ ] that stored conversation history.
Deployment: A single container running inside a Kubernetes cluster.

The design philosophy was dead simple: keep a direct, open line of communication between the user's browser and the agent.

Here's how the happy path was supposed to work:

The user pastes a GitHub PR link
The browser opens a persistent HTTP connection to our Django API.
The API holds that connection open while the LangGraph agent does its thing, fetches the code diff, reads the files, reasons about the changes, waits for LLM tokens to stream back.
After about 45 secs to 1 min, the review is complete.
The API sends the JSON response, the connection closes, and the user reads a clean, structured code review.

Figure 1: The Intended V1 Architecture. A simple, synchronous flow designed for low complexity.
On our local machines, this was flawless. And we don't mean mostly worked, we mean genuinely flawless.

Latency: Who cares? Local terminals do not timeout after 60 seconds.

State: Because we didn't restart the python process, the memory = [ ] variable retained conversation history perfectly.

Concurrency: We were using it one user at a time. One request, one response, no issues.

So we packaged the code into a Docker image, pushed it to our Kubernetes cluster, attached a standard Horizontal Pod Autoscaler (HPA) to handle traffic spikes, and shared the link with a few friends to test it out.

Our reasoning was simple:

If traffic increases, HPA will notice the CPU load and spin up more pods.

That's what it's designed for.

That's what it does for web applications everywhere.

The Incident: 10 Users and a System Breakdown

After deploying the V1 container, we shared the application with a small group of friends to validate the idea. The expectation was modest traffic and predictable behavior.

What followed was our first real production incident.

As users started submitting GitHub pull request URLs at roughly the same time, the application began to misbehave in ways we hadn’t expected.

What Actually Happened

Several users accessed the system simultaneously and initiated agent executions almost at the same time.

Each request made a synchronous call to a LangGraph agent tasked with the responsibility of conducting a complete pull request review, a heavy operation that took almost a minute to complete.

However, the HTTP connections used for making the requests were subject to browser and ingress timeouts.

Before the completion of the agent execution, the connections timed out, and users encountered timeout errors.

From the user's point of view, the request had failed.

However, from the system's point of view, the agent executions were still in progress.

The users, assuming the failure was temporary, attempted to make their requests again.

Each request made a new agent execution while the previous ones continued running in the background.

Within minutes, a single pod was handling much more concurrent agent executions than it was meant to.

The Impact

The consequences were immediate and visible across multiple layers of the system.

Unbounded LLM usage

Orphaned agent executions continued making LLM calls even though no client was waiting for the result, rapidly consuming our API quota.

Resource exhaustion

The growing number of concurrent executions pushed the Python process beyond its memory limits, causing the container to be terminated by the Kubernetes runtime.

State loss

When the pod restarted, all in-memory agent state was wiped.

Users who managed to reconnect attempted follow up interactions, only to discover that the agent had lost all context.

At this point, the system was functionally unusable.

Root Cause Analysis: Why this Failed

After the incident, we analyzed why a design that worked in isolated testing failed almost immediately under real use. The failure was not caused by a single bug, but by a set of architectural mismatches between AI workloads and standard web service patterns.

Flaw 1: Long Running Work Tied to HTTP

Assumption: HTTP is suitable for serving AI agents.

Reality: Agent execution is long running, often taking tens of seconds, while HTTP connections are short lived and timeout prone.

Impact: When requests timed out, agent executions continued running in the background, consuming resources without any active client or recovery mechanism.

Flaw 2: Autoscaling Blind to Agent Load

Assumption: Kubernetes autoscaling would handle increased traffic automatically.

Reality: Agent workloads are primarily I/O bound while waiting for LLM responses, resulting in low CPU utilization even under heavy load.

Impact: The Horizontal Pod Autoscaler failed to scale because CPU usage appeared idle, despite a growing backlog of requests.

Flaw 3: Volatile InMemory State

Assumption: Inmemory state is sufficient for maintaining agent context.

Reality: Kubernetes pods are ephemeral and expected to restart.

Impact: Storing state in memory causes all agent context to be lost during crashes or restarts, breaking continuity and recovery.

V1 failed because it treated AI agents like traditional web requests. Long running execution, invisible load, and volatile state created a fragile system that could not survive real concurrency.

The Paradigm Shift: From Web App to Distributed System

The failure of V1 made one thing clear, this was not a bug level problem. The architecture was simply wrong.

We had treated the AI agent as a traditional web application, where a request is handled synchronously from start to finish. In reality, agent execution is a long running, failure prone process that does not fit within the lifecycle of an HTTP request.

To move forward, we adopted an asynchronous job model commonly used in large scale systems. Instead of executing agent logic inline with user requests, the system now accepts work, queues it, and processes it independently.

This approach gave us three guiding principles for the V2 design:

Decouple ingestion from execution

We split the monolith.

The API (Django): Dumb, fast, and only handles HTTP.
The Worker (LangGraph): Smart, slow, and only talks to the Queue.

Externalize all agent State

We adopted Checkpointed State.

Every time the agent takes a step, it saves its entire memory to a CloudNativePG managed PostgreSQL cluster. If the pod crashes, a new pod reads the database and continues exactly where the old one left off.

Scale based on work, not resource utilization

We moved to Event Driven Autoscaling (KEDA).

We scale based on the length of the Redis Queue.

These principles directly informed the V2 architecture described next.

What Comes Next

V1 showed that AI agents require a different architecture than traditional web applications.

In the next part of this series, we’ll walk through the V2 architecture and how we addressed these issues using queue-based execution, checkpointed state, and event-driven autoscaling to make the system production ready.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.