Gnani Rahul

Posted on May 4

"DevOps Is Not Ending. The Production Surface Changed."

#devops #ai #kubernetes #mlops

"AI is not replacing DevOps. It is widening the production surface DevOps teams have to operate: models, prompts, evals, GPU capacity, tool permissions, and rollback paths."

Ram is new to DevOps, but not new to the tools changing it.

He is comfortable with coding assistants, agent demos, GitHub workflows, cloud consoles, and AI-powered operations tools. If a model can explain a failed pipeline, generate Kubernetes YAML, or draft a Terraform module, Ram wants to try it.

That curiosity is useful.

It is also risky if nobody reviews it with production discipline.

That is where Siya comes in.

Siya has been through enough incidents to distrust clean demos. She likes useful automation, but she asks the questions that decide whether something survives production:

What changed?
Who approved it?
Can we roll it back?
What is the blast radius?
How much does it cost when traffic doubles?
What happens when the tool is wrong?
Ram’s question is: can AI make DevOps faster?

Siya’s question is: can we operate AI-assisted DevOps safely?

That is the theme of this series.

DevOps is not ending. The production surface is changing.

The part of DevOps that will shrink, Some DevOps work will absolutely get smaller.

Copying YAML from one repo to another. Writing the first draft of a CI workflow. Explaining a common Kubernetes error. Summarizing logs. Turning a runbook into a checklist.

AI is already useful for that work.

If someone’s entire value is typing commands without understanding the system, that is a fragile place to be. Ram sees that. He is not trying to protect busywork.

But that was never the best version of DevOps.

*The serious part of DevOps was always judgment under production constraints: *

What changed?
What is the blast radius?
Can we roll back?
Is this secure?
Why did cost jump?
What does the dashboard not show?
Should this automation be allowed to act?

AI does not remove those questions. It adds more of them.

The production surface got bigger

A normal service has failure modes we know how to name: latency, error rate, saturation, bad deploy, expired certificate, broken dependency, runaway logs, surprise cloud bill.

An AI system can fail while looking healthy.

The pod is running. The API returns 200. The GPU is busy. The dashboard is green.

The answer is still wrong.
Or unsafe.
Or too expensive.
Or produced through a tool path nobody approved.

That is a DevOps-shaped problem. It touches release control, observability, security, identity, cost, rollback, and incident response.

The artifact is no longer just a container image. It might include:

code
model version
prompt version
retrieval index
evaluation results
tool permissions
provider routing
runtime configuration

If those pieces can change behavior, they belong in the operating model.

Why Kubernetes and GitOps still matter

CNCF’s 2025 survey says Kubernetes is already a major production foundation for AI workloads. That tracks with what platform teams are seeing: AI workloads still need scheduling, isolation, rollout control, policy, networking, observability, and cost boundaries.

The details are changing.

Dynamic Resource Allocation matters because accelerators are not normal CPU requests.

Kueue matters because AI and batch workloads need fair queueing.

AI Gateway work matters because inference traffic is not ordinary web traffic.

KServe and llm-d matter because model serving is becoming a distributed systems problem.

GitOps also becomes more important, not less.

For AI systems, desired state has to include more than YAML. It has to answer:

which model moved?
which prompt changed?
which evaluation passed?
which tool permissions changed?
which rollback path exists?

Siya’s point to Ram was not “stop experimenting.”
It was: turn the experiment into an operating model.
That is the calmer path.

What I would try first

I would start with read-only DevOps assistance:

summarize a failed CI job
explain Kubernetes events
draft a runbook from existing alerts
review a Helm chart for obvious mistakes
compare a Terraform plan against a policy checklist
build an incident timeline from logs and commits

Then I would learn the production concepts that are becoming unavoidable:

Kubernetes scheduling and GPU capacity
GitOps with Argo CD or Flux
Helm packaging and rollback
MLOps basics: model registry, evaluation, inference
AI observability: traces, tokens, tool calls, cost, quality
agent security: identity, permissions, audit trails

I would not begin by giving an agent production write access.

Read-only first. Sandbox writes second. Production writes only with narrow permissions, approval, receipts, and rollback.

Ops note: if your current platform cannot explain a normal deploy, it will not explain an AI deploy.

The career signal

This series will talk about careers, but it is not only career advice.

The larger story is how DevOps itself is changing.

DevOps absorbed cloud. It absorbed containers. It absorbed Kubernetes. It absorbed infrastructure as code, GitOps, DevSecOps, platform engineering, observability, and FinOps.

Now it is absorbing AI.

Ram’s instinct is right: DevOps teams should test these tools early.

Siya’s answer is the one I trust:

Someone still has to make these systems deployable, observable, secure, reversible, and affordable.

That work is not disappearing.

It is getting harder.

And if DevOps has always been about making change safer, then AI is not the end of DevOps.

It is the next test.

DEV Community