Boopathi

Posted on Mar 1

Sherlock Holmes: The Case Of AI Brought Down Our Servers

#agents #ai #architecture #sre

There are two kinds of production incidents.

The first kind gives you signals. Metrics drift slightly off baseline. Latency edges upward. Dashboards turn yellow long before anything turns red. You have time to reason about it.

The second kind doesn’t negotiate. It lets you sleep peacefully and then informs you in the morning that your server died multiple times overnight.

This was the second kind.

The Setup

We’re building a voice agent platform.

Calls come in from users. Audio streams over WebSocket. We integrate with Twilio for real-time media streams. AI agents process the conversation, decide what to say next, and occasionally invoke tools. Some of those tools query our database to fetch context or perform actions.

Architecturally, nothing unusual. A fairly standard real-time pipeline: streaming input, AI orchestration, tool execution, database lookups.

And everything had been working fine.

Then one night, one of our Kubernetes pods limited to 1GB of memory started crashing repeatedly. There was no deployment. No configuration change. No obvious traffic spike. No infrastructure event. Just restarts.

That’s always unsettling. When nothing changed, but something clearly broke.

The First Suspect: Streaming

When memory spikes in a real-time system, your instinct immediately points to streaming.

WebSockets can buffer unexpectedly. Audio chunks might accumulate if something downstream slows down. Garbage collection might not keep up under bursty traffic. Maybe some array was growing quietly in memory.

All of those were reasonable hypotheses.

We spun up a test environment and tried to simulate the issue. We created parallel calls. We streamed audio continuously. We monitored memory closely, expecting to see the same runaway pattern.

Nothing happened.

Memory usage remained stable. The heap grew and shrank normally. No vertical spikes. No crashes.

That almost made it worse. Because in production, it was reproducible —just not consistently. During US night hours, when traffic was low, we triggered calls manually and sometimes we could reproduce the crash. Other times, everything behaved perfectly.

Intermittent, probabilistic failures are far harder to reason about than deterministic ones.

Heap Snapshots and False Leads

Next, we went for heap snapshots using V8 and Chrome DevTools. If something large was being retained, the snapshot would reveal it.

process.on("SIGUSR1", async () => {
  console.log("Received SIGUSR1 event on proc. Executing heap snapshot.");
  const fs = require("fs");
  const v8 = require("v8");

  function writeHeapSnapshot(filename = `/profiling/heap-${Date.now()}.heapsnapshot`) {
    const snapshotStream = v8.getHeapSnapshot();
    const fileStream = fs.createWriteStream(filename);

    snapshotStream.pipe(fileStream);
    console.log(`Heap snapshot saved as ${filename}`);
  }

  writeHeapSnapshot();
});

We added a signal handler to our Node.js process so we could trigger heap snapshots on demand. The plan was simple: wait until memory rose, send the signal, capture the snapshot, and analyze it offline.

There’s a catch, though. Generating a heap snapshot requires additional memory. If your pod is already close to its limit, the snapshot process itself can push it over.

That’s exactly what happened.

Sometimes the pod crashed before the snapshot completed. Other times it succeeded, but the analysis didn’t reveal anything clearly catastrophic. We saw objects. We saw JSON structures. We saw logs. But nothing that screamed “this is it.”

We compared multiple snapshots normal state versus spike moments. The differences weren’t obvious enough to explain a near 1GB allocation.

Meanwhile, the day was progressing. It was evening in India. Which meant it was morning in the US.

Traffic was about to return.

Watching It Happen Live

As calls started coming in, we stopped theorizing and simply watched production.

There’s something tense about staring at live memory graphs when you know a crash is possible.

At first, everything looked normal. Heap usage was steady. CPU was fine. Calls were connecting. Conversations were flowing.

One call completed. No issue.

Another started. Still stable.

A few more came in. The graph moved slightly, but within normal range. For a moment, we thought maybe the issue had somehow resolved itself.

Then it happened.

The memory line didn’t drift upward gradually. It didn’t climb in a smooth curve. It jumped. A sharp vertical spike as if a massive object had been allocated in a single operation.

Within seconds, the pod was terminated due to OOM killed

Restarting.

This wasn’t a leak accumulating over time. This was a sudden allocation.

Scaling Didn’t Save Us

Under pressure and with leadership understandably concerned we tried the obvious mitigation: horizontal scaling.

If one pod was overloaded, maybe splitting the traffic would help. So we spun up an additional pod and routed traffic between them.

The assumption was simple: less load per instance means less memory pressure.

It didn’t help.

Both pods eventually crashed.

That clarified something important. Scaling helps when the issue is cumulative load. It does not help when a single request is catastrophic. If one request allocates hundreds of megabytes, any pod that processes that request will fail independently.

The problem wasn’t load distribution. It was logic.

Observing Memory in Real Time

Instead of relying on snapshots, I added periodic memory logging directly in the application. Node.js exposes memory usage metrics like rss, heapTotal, heapUsed, external, and arrayBuffers. We logged them every few seconds.

if you are a nodejs dev you might know what each mean if not for you

rss (Resident Set Size): The total memory allocated for the process execution in main memory, including the heap, stack, and code segments.
heapTotal: The total size of the allocated memory heap, which is managed by the V8 engine and stores objects, strings, and closures.
heapUsed: The actual memory currently being used within the heapTotal. This is often the most relevant parameter for identifying memory leaks in the JavaScript code itself.
external: Memory used by C++ objects that are bound to JavaScript objects managed by V8.
arrayBuffers: Memory allocated for ArrayBuffers and SharedArrayBuffers, which is also included in the external value

So I added below code on our codebase to log the memory usage over period of time

setInterval(() => { 
  const memory = process.memoryUsage();

for (let key in memory) {
  console.log(`[MEMORY]${key}: ${(memory[key] / 1024 / 1024).toFixed(2)} MB`);
}
}, 5000);

so what we thought was If streaming is issue we see the increase of arrayBuffers or external

But supersinlgly we see the heapUsed is get increased and The pattern was consistent.

Then suddenly, heapUsed would spike dramatically hundreds of megabytes in a short window and the pod would be killed.

This ruled out slow leaks and twilio audio streams Garbage collection wasn’t failing. Something large was being allocated all at once.

The Pattern

Eventually, our one of dev noticed something interesting in the logs around the spike.

A tool call.

More specifically, a tool call with an empty object as parameters.

Our AI agents can invoke tools. One of those tools performs a database search and expects a required parameter to search on the collection so what we saw in log was empty object

params: {}

At first glance, it didn’t look dangerous. It was syntactically valid. It didn’t throw an error. The function executed normally.

But that empty object changed everything. what we do was after receving the params we search on db using that


db.collection.find(params)

Our collection contained around one million documents.

When you execute Model.find({}) in MongoDB, you are not asking for nothing. You are asking for everything.

MongoDB did exactly what we requested. It returned all documents.

The Node.js driver then deserialized those documents into JavaScript objects in memory before our code could process them. That meant potentially hundreds of megabytes being allocated almost instantly.

Inside a pod limited to 1GB.

The vertical memory spike finally made sense.

This wasn’t a memory leak. It wasn’t streaming buffers accumulating. It wasn’t garbage collection lag. It was a full-collection query triggered by an empty filter.

Why It Was So Hard to Reproduce

It didn’t happen on every call.

Only one agent had access to that tool. Only certain conversation flows triggered it. Only when the AI decided the tool was relevant. And only when the model generated an empty object instead of a properly populated parameter set.

Unless that exact probabilistic sequence occurred, the system behaved perfectly.

Traditional bugs are deterministic. Given the same input, you get the same output.

AI-integrated systems introduce probabilistic behavior. The model didn’t crash the server directly. It generated a syntactically valid tool call that was semantically unsafe. And we trusted it.

That trust was the real bug.

The Fix

Once understood, the fix was straightforward.

We added strict schema validation before executing any tool call. If required parameters were missing, the call was rejected immediately. Empty filters were explicitly disallowed. We chose to fail fast instead of querying blindly.

There was no infrastructure change. No scaling adjustment. No tuning of garbage collection.

Just validation.

After that, the crashes stopped. it almost took for us 24 hours to find and fix.

Top comments (3)

Matthew Hou • Mar 2

Great incident writeup. The pattern where everything looks fine until it catastrophically isn't is the most underappreciated risk with AI in production. What strikes me is that the verification layer — monitoring, alerting, baselines — is the thing that eventually saves you, not the AI itself. The AI introduces novel failure modes (connection storms, unexpected query patterns) and your traditional observability stack is the safety net. This is why I keep arguing that as AI gets deeper into production workloads, the investment in verification and monitoring infrastructure has to scale at least as fast as the AI deployment itself. The "second kind" of incident you describe is exactly what happens when it doesn't.

klement Gunndu • Mar 3

The heap snapshot approach is solid, but I'd push back on starting there — with intermittent memory spikes in a streaming system, the V8 snapshot might miss the peak if you can't trigger it at the right moment. Did you try continuous memory profiling with something like clinic.js before going manual?

Boopathi • Mar 3

Nope we have not try with clinic js because we have limited time so make it faster we have started looking for pattern hopefuly we find. I will not this down for my future invesitigation :)