Neeraj Kumar Singh Beshane

Posted on Apr 20 • Edited on Jul 5 • Originally published at neerazz.hashnode.dev

Nobody Wants to Give the AI Write Access to Prod

#devops #aiops #sre #finops

An SRE lead I follow spent KubeCon Europe walking the AI-vendor aisle. Multi-agent investigation. Production knowledge graphs. Autonomous remediation. Nine-figure funding rounds.

Then he wrote down what the hallway track actually agreed on:

"Most teams are open to AI investigation. Very few are ready to give AI write access to production."

That single sentence is the entire state of AI in DevOps, and it's worth more than every keynote from the last twelve months. The vendors are selling autonomy. The buyers are buying explanation. The gap between those two things is where your next two years of leverage lives.

The trust conversation beat the feature conversation

I run security infrastructure at a fintech that moves real money, which means I sit in the meetings where "should the agent be allowed to do X" gets decided. The pattern from those rooms matches the KubeCon read exactly: the feature conversation is over — everyone believes the AI can gather logs, correlate telemetry, and draft a root-cause hypothesis. The trust conversation has barely started.

And the market data backs the skepticism. In that same r/sre thread, the top-voted practitioner comment about commercial AI-SRE tools was blunt: "Evaluated a couple of these and was... not impressed" — from a team that had already built an internal MVP doing root-cause analysis with automated PR generation. Read that twice: the sophisticated buyers aren't rejecting AI ops. They're building it themselves, scoped to investigation, and keeping the write path human.

So here's the map, drawn around trust rather than vendor categories.

Where AI already earned prod: reading, never writing

The investigation lane is genuinely good now, and the honest tooling markets itself that way. Cleric leads with explainability and confidence scores rather than "AI fixes everything." OpsWorker's whole pitch is compressing the 30-90 minute manual investigation loop to under two minutes — through a read-only in-cluster agent, with every production action human-approved. Resolve AI has the enterprise logos and the big autonomy vision, and the open question their own category can't answer yet: will anyone actually let it operate beyond recommendation mode?

Even the incumbents' numbers, taken at face value, are investigation numbers: Dynatrace's agentic AI claims 12x higher success than LLM-only approaches — at diagnosis. New Relic's SRE Agent reports 25% faster resolution — via triage.

The down after the up, and it's the same one as always: every one of these amplifies your observability hygiene. AI triage on top of an incoherent alert taxonomy is a 2003 SOC with a chatbot on it, as Anton Chuvakin keeps putting it. Fix your tagging before you buy anything in this section.

Self-healing: the two-sentence verdict

Kubernetes-native self-healing (probes, HPA, restart policies) is solved and boring — deploy it everywhere. Everything past that still runs through a human approval gate in serious shops, and the KubeCon consensus says that's not changing this year. Move on.

IaC: the agent writes Terraform now. That's the problem.

LLM-assisted infrastructure-as-code crossed a line recently: it stopped being autocomplete and started being authorship. The Terraform MCP Server gives agents live access to providers and modules; Pulumi's Neo cut provisioning from 3 days to 4 hours at Werner Enterprises. Real, verifiable, useful.

But watch what the ecosystem is quietly building in response: Agent-Skills-style guardrail packs for Terraform exist because "Terraform best practices are easy to forget when an agent is writing the code." That's the tell. The bottleneck moved from writing HCL to reviewing HCL you didn't write — at generation speed. If your plan-review process was rubber-stamp-shaped before, an agent just turned it into a pipeline that ships mistakes faster. Policy-as-code (Sentinel, OPA) stops being a nice-to-have and becomes the only reviewer that scales to agent throughput.

That's the DevOps version of a rule my security work beat into me: the agent's output is untrusted input. Gate it mechanically, not vigilantly.

FinOps: where the bill quietly became the incident

The most underrated shift on this map: AI spend became an ops problem. 98% of organizations now manage AI spend, up from 31% two years ago, and the FinOps crowd has coined "tokenomics" unironically — GPU instances, training runs, and per-token inference costs don't follow any of the CPU-era optimization patterns your dashboards were built for. AWS shipped a FinOps agent specifically for AI cost governance; the dashboards-first FinOps era is visibly ending.

Here's the career read: every DevOps engineer can restart a pod. Very few can walk into a leadership meeting and explain why the inference bill doubled and which workload to move. The engineers doing GPU-aware capacity planning are having a very different comp conversation right now — the AI-skills premium (20-45% by the salary data) concentrates exactly where the money hurts.

Monday morning

One read: that r/sre KubeCon thread, including the comments — it's the most honest market map in the category, disclosure-flagged vendor bias and all.

One action: pick your ugliest recurring alert and run one AI-assisted investigation on it (Bits AI, HolmesGPT, or a raw LLM fed the logs — the tool matters less than the exercise). Time it against your manual loop. That number — 45 minutes to 5, or 45 minutes to 40 — tells you exactly how ready your telemetry is, and it's the demo that gets budget approved.

The full graded resource path — beginner to advanced, with time estimates and honest annotations, plus runnable labs — lives in the companion repo. This post is the map; that's the shelf.

What to skip

Skip: any tool whose pitch is "autonomous remediation" without a visible approval-gate story. The KubeCon consensus is your cover: nobody serious is buying write access yet.
Skip: AIOps procurement before observability hygiene. You'll pay enterprise prices to triage garbage faster.
Skip: learning a vendor's AI assistant as your "AI skill." The durable skills are the trust architecture — approval gates, policy-as-code, blast-radius thinking. Assistants churn; the boundaries transfer.

Next in the series: the security engineer's map — what happens when the agent reading your GitHub repo opens a reverse shell, and the one defense that survived contact with production.

This is Part 2 of the AI Role Upgrade Roadmap series. Each post maps the AI landscape for a specific software role — what matters, what doesn't, and where to invest your time.

Code & Resources: All curated resource lists, graded learning paths, and runnable labs live in the public companion repo: github.com/neerazz/ai-role-upgrade-roadmap.

Written by Neeraj Singh. Staff Security Infrastructure Engineer building security at scale in fintech. 15 years across JPMorgan Chase, Wayfair, Meta, and Parafin. I write about the intersection of security, infrastructure, and AI, from the perspective of someone who has to make these tools work in production, not just evaluate them in a lab.

Top comments (1)

arun rajkumar • Jul 10

"The agent's output is untrusted input — gate it mechanically, not vigilantly" is the line that carries this whole shift. We run payments and drew the same boundary from a slightly different angle: autonomy is a function of reversibility. Auto-restart a pod, auto-scale, roll back a stateless service — fine, the action undoes for free. The moment write access touches anything mid-settlement there's no free undo, only reconciliation, so it stays human-gated no matter how good the agent's RCA gets. Do you frame the gate around blast radius, or around reversibility specifically?