Kaushik Maram

Posted on Mar 25

Stop Treating AI Agents as Web Servers: A Kubernetes Survival Guide - Part 2

#ai #aiops #kubernetes #cloudnative

If you haven’t read Part 1, I’d recommend starting there.

Shanmukhi Kairuppala did a great job breaking down why the initial approach failed, this post builds on that and focuses on what we changed next.

In Part 1, we made a familiar mistake:

We treated AI agents like web servers.

A synchronous, request-response model worked fine at small scale, but broke the moment real-world usage hit.

Not because of infrastructure,
but because agents don’t behave like APIs.

They are stateful, long-running systems, and our architecture wasn’t built for that.

So we stopped trying to scale the system
and started questioning the model.

If agents aren’t request-response systems,
why are we treating them like one?

That question forced us to rethink everything.

This is where V2 architecture begins.

The V2 Architecture

The V2 Implementation: Four Pillars of Resilience

The V2 architecture is built around four core components. Each directly addresses a specific failure mode observed in V1 and together forms a resilient, production-ready agent platform.

Pillar 1: Asynchronous Ingestion (Redis)

Problem:
Long-running agent execution exceeded browser and ingress timeouts, causing failed requests.

Solution:
We introduced a Redis-backed task queue to decouple HTTP ingestion from agent execution fully.

How it works:

The user submits a request.
The API enqueues the job in Redis and immediately returns 202 Accepted with a task_id.
The client polls a status endpoint to track progress.

Result:
User requests return in milliseconds, while agent execution proceeds independently. HTTP timeouts are eliminated, regardless of agent runtime.

Pillar 2: Checkpointed Agent State (CloudNativePG)

Problem:
The agent's memory was lost every time a pod crashed or restarted. Moreover, the AI agents produce bursty traffic, which opens hundreds of short-lived database connections in seconds, risking the exhaustion of the connection limits of a standard PostgreSQL instance.

Solution:
We used CloudNativePG (CNPG), a Kubernetes operator that automates the management of our PostgreSQL cluster and offers native connection pooling.

How it works:

State Checkpointing: After every meaningful step in the agent graph (e.g., tool call, generation), the worker persists the full conversation state to the CNPG-managed cluster.
Native Pooling: We enabled CNPG’s built-in PgBouncer integration. This allows the database to handle thousands of concurrent, short-lived connections from our auto-scaling AI workers without overhead.

Result:
The system ensured self-driving data resilience. In case a worker pod fails, a new pod instantly retrieves the latest checkpoint from the cluster and continues running with no loss of context.

Pillar 3: Event-Driven Autoscaling (KEDA)

Problem:
CPU based auto-scaling failed to reflect real agent load during I/O bound LLM waits.

Solution:
We adopted KEDA to scale workers based on queue depth rather than resource utilisation.

How it works:

KEDA monitors the Redis queue length.
Scaling rule: one worker per pending job.
When the queue is empty, workers scale down to zero.

Result:
The system scales precisely with demand and incurs zero compute cost when idle.

Pillar 4: Secure Tool Governance (MCP)

Problem:
Granting agents direct access to infrastructure credentials posed significant security risks.

Solution:
We implemented Model Context Protocol (MCP) to isolate tool execution from agent reasoning.

How it works:

The agent never holds sensitive credentials.
Tool containers (e.g., GitHub, Slack) execute privileged actions after validating structured requests.
Each tool enforces strict scope and access controls.

Result:
Even in the presence of prompt injection or model misuse, critical resources remain protected.

Together, these four pillars transform a fragile, synchronous prototype into an event-driven, fault-tolerant agent platform designed for real-world Kubernetes environments.

The Results: Validating the V2 Architecture

After the successful deployment of the event-driven V2 platform, we recreated the same load conditions that led to the initial failure of the V1 system. We aimed to measure the system's robustness against concurrency and active failure.

Scenario A: Concurrent Load Handling.
The system was challenged with an immediate spike of ten concurrent pull request reviews, which was the exact threshold that led to a complete system failure.
In the V2 system design, the API tier did not maintain HTTP connections. Each request was immediately acknowledged with a 202 Accepted status and a tracking ID, separating the user from the backend processing. KEDA immediately sensed the rise in the Redis queue depth and created ten worker pods to process the load concurrently. All tasks were completed successfully without any timeouts, and users were able to monitor the progress asynchronously through the status endpoint.

Scenario B: Active Crash Recovery
To test the hypothesis of Durable State, we stopped an executing worker pod in the middle of code review generation.
In the earlier system design, this would have caused total data loss. In the V2 environment, an automatic replacement pod was scheduled by Kubernetes. The new worker pod joined the CloudNativePG managed cluster, restored the latest conversation checkpoint, and continued code review generation from the same spot where it had stopped. The crash was seamless for the user, causing only a brief pause in the data stream.

Here’s how V2 changed the system in practice:

Key Takeaways

If you’re deploying AI agents on Kubernetes, these guidelines will protect you from the most common issues in production:

1. Separate Clocks (Async by Default) User requests and agent processing must never occur on the same timeline. HTTP is short-lived, agent processing is not.

Guideline: When processing takes longer than a few seconds, queue the task and return 202 Accepted. Let the client poll or be notified asynchronously.

2. Memory is Ephemeral (Externalise State) Kubernetes pods are cattle, they will die. Any state stored in memory is, by definition, temporary.

Guideline: Use a scalable, Kubernetes native data store (such as CloudNativePG) to manage state persistence. Don’t simply run a database, run an Operator that manages the fail-over for you automatically.

3. Scale on Work, Not CPU (Event Driven Autoscaling) AI workloads are typically I/O bound (waiting for LLMs). CPU saturation is not a good indicator of system load.

Guideline: Scale workers based on the amount of work pending (Redis queue length), not resource utilisation.

Conclusion

When we set out on this project, we thought the toughest part would be designing the prompt or choosing the model. What we found was that the toughest part was something much more basic, keeping the system up while the agent was thinking.

V1 taught us that AI apps are not web apps with a dash of LLMs. They are distributed systems, complete with long running tasks, race conditions, retries, and failures. To call an LLM like a regular function in the middle of an HTTP request may work on your local machine, but it will fail catastrophically under load.

We didn’t solve our system problems by improving our model or adding more resources. We solved them by recognising the robust primitives that have been around in production systems for years. Queues (Redis), Autoscalers (KEDA), and Durable Storage (CloudNativePG). What we built wasn’t exciting, it was robust.

The moral of the story is this: