Systems that Scale Podcast: EP1 (The AI Shift in DevOps and SRE)

#devops #sre #ai #systemsthatscale

Systems That Scale is a podcast about real infrastructure at real scale. Each episode focuses on moments where systems stopped behaving the way their builders expected, and what those moments reveal about reliability, engineering judgment, and the future of operations.

In Episode 1, we speak with Gaurav, CEO of Sherlocks.ai, about operating large production systems during periods of explosive growth, the kinds of failures that never appear in design documents, and why AI is becoming unavoidable in SRE and operations work.

Scale Does Not Break Where You Expect

One of the central themes of the episode is that large scale systems rarely fail in dramatic ways. Instead, they fail when assumptions quietly expire.

During a period of sustained growth, a production system suddenly stopped accepting new user requests. There were no obvious crashes, no loud alerts, and no immediate indicators pointing to a specific component. The system looked healthy on the surface, yet traffic had effectively stalled.

After extended investigation, the root cause turned out to be a hard limit that most teams never expect to hit. A core database table had exhausted its primary key space at roughly sixty four million rows. The failure was not caused by a bug or a bad deploy. It was caused by scale outgrowing an early design assumption.

Why Debugging at Scale Is So Hard

A recurring point in the episode is that debugging at scale is not just about tools. It is about cognitive load.

At high traffic levels, a single incident can involve application code, databases, caches, networks, and deployment systems all interacting in subtle ways. Engineers form hypotheses based on experience, validate them against metrics and logs, discard what does not fit, and repeat the process until the real cause emerges.

This works when systems are small. It becomes painfully slow as complexity grows. Even with good observability, humans struggle to correlate signals across dozens of dashboards and timelines while under pressure.

The Gap AI Is Starting to Fill

The conversation draws a clear distinction between AI used for coding and AI used for operations.

Code generation happens before production, where mistakes are relatively cheap. SRE work happens after production, where incorrect decisions directly impact users and revenue. This makes accuracy far more important than fluency.

The episode explores why AI for SRE cannot be a simple chatbot. Instead, it needs to behave more like an experienced teammate. It must understand system architecture, track recent changes, observe metrics across subsystems, and continuously test hypotheses against live data.

In one real incident discussed, an application rollout caused severe memory and network pressure due to inefficient caching behavior. Manually, it took hours to identify the true cause. An AI SRE system designed to correlate deployments with resource anomalies could have narrowed the issue down in minutes.

From Reaction to Compression

A key idea in the episode is not replacement, but compression.

AI does not eliminate the need for SREs. It compresses the time between symptom and understanding. Instead of engineers spending hours gathering context, forming guesses, and validating them one by one, AI systems can surface the most likely causes early, allowing humans to focus on decisions rather than detection.

This shift matters because modern teams are shipping faster than ever. AI assisted development has increased the number of changes entering production, while SRE capacity has remained relatively flat. Without some form of automation in diagnosis and triage, incidents will only become harder to manage.

Watch the Full Episode

This post captures only part of the discussion.

In the full episode, we also talk about multi cloud complexity, on call fatigue, why traditional alerting breaks down at scale, and what skills engineers should focus on as infrastructure work evolves.