Nerav Doshi

Posted on Jun 15 • Edited on Jun 24 • Originally published at pipelineandprompts.com

AI Tooling on OpenShift: A Practitioner's Evaluation Framework

#aiinthestack #platformengineering #openshift #aitooling

Pipeline & Prompts | Byte size guides on DevOps, Cloud and AI

** AI in the Stack #1**

Byte size summary

After reading this article, you'll have a framework for evaluating AI tools in platform engineering contexts — not by capability type, but by where in your workflow the tool actually changes the outcome. You'll understand why the tools that sound most compelling are still hype, where genuine productivity gains exist today, and what governance infrastructure you need in place before any AI component gets near production. This article is the foundation for the series; subsequent articles implement each touch point against real OpenShift infrastructure.

The story

I spent months selling IBM's AI and data science portfolio before I truly understood what I was selling.

I knew the pitch. Predictive analytics. Optimization. Decision intelligence. I could walk a room through the business value without breaking a sweat. CPLEX for scheduling, Watson for insights — I had the slides, the talking points, the customer stories.

Then I sat in on a data scientist demo.

Not a sales demo. An actual working session — models being trained, outputs being interrogated, assumptions being challenged in real time. And somewhere in that room, watching someone do the thing I'd been describing from the outside, something clicked — and not in a good way.

The models were impressive. The theory was solid. But I kept asking myself the same quiet question: where does this go next?

Because most of what I saw never made it anywhere near production. It lived in notebooks. In slide decks. In proof-of-concept environments that were never ready to cross the line into something real. I'd been selling outcomes — optimised schedules, smarter decisions, reduced costs — without a clear path to how you'd actually get there. And underneath all of it, something else bothered me that nobody was talking about loudly enough: the data going into these models was often messy, unvalidated, and ungoverned. Bias wasn't a theoretical risk. It was baked in. And there was no framework to catch it.

I kept selling anyway.

Not because I was dishonest. But because that's how the industry worked — and still largely works. The industry positions AI at the outcome layer. The messy middle — governance, production readiness, operationalisation — gets handed to someone else to figure out later.

That gap between AI as it's sold and AI as it actually lands in production? That's exactly what this series is going to dig into.

The problem

The AI hype cycle has arrived in platform engineering with full force. Every observability tool now has a "Copilot." Every CI/CD platform is announcing AI-powered pipeline suggestions. Every cloud vendor has an AI assistant that promises to write your Kubernetes manifests, triage your alerts, and — if you believe the marketing — practically run your infrastructure for you.

The problem isn't that these tools are useless. Some of them are genuinely good. The problem is that the signal-to-noise ratio is terrible, and platform engineers are making real decisions — budget decisions, architecture decisions, tooling decisions — in an environment where nearly everything is being AI-washed.

Recognise this pattern: A product adds "AI-powered" to its marketing, ships a chatbot interface over an existing feature, calls it a Copilot, and charges a premium tier for access. The underlying capability hasn't changed. Only the framing has.

Three categories of noise dominate right now:

AI-washing. Existing features rebranded with AI language. Natural language search that was always just a filter. Log aggregation renamed "intelligent log analysis." If removing the word "AI" from the description doesn't change what the product actually does, that's AI-washing.

Demo-ware. Tools that work beautifully in controlled demos on clean, predictable data — and fall apart the moment they touch the complexity of a real production environment. This is exactly what I kept seeing in those IBM sessions years ago, and it's still the dominant failure mode. The demo closes the deal. The production deployment reveals the gap.

Solutions to problems you don't have. Autonomous AI agents that self-heal your infrastructure sound compelling until you ask: what does "self-healing" mean when your organisation requires a change advisory board (CAB) approval for every production modification? Context matters. Most AI infrastructure tooling is built for a hypothetical engineering organisation that doesn't look much like yours.

The question isn't whether a tool uses AI. The question is whether it changes the outcome — and whether that change survives contact with your actual environment.

Why existing approaches fall short

Most teams evaluating AI tooling for infrastructure fall into one of three patterns. All lead to the same outcome: either you adopt too much too fast and create governance debt you'll spend months unwinding, or you dismiss the category entirely and miss the genuine wins available right now.

Evaluating by feature list. The vendor demo shows the feature. You evaluate whether your team would use it. This completely bypasses whether the feature survives contact with your environment's specific constraints — your compliance requirements, your data quality, your change management process. The feature list approach is how you end up with a "self-healing pipeline" tool that can't make a production change without CAB approval.

Evaluating by category. "We need an AI observability solution." This leads to comparing tools within a category without first asking whether that category of AI is actually mature enough to be useful. Anomaly detection in observability has been real and useful for years. Autonomous incident remediation is still largely demo-ware. Treating them the same because they both appear in an "AI in DevOps" quadrant is the evaluation mistake that sends teams down the wrong procurement path.

Evaluating by peer adoption. "Company X is using it in production." The signal is real but the inference is wrong. Their environment, their data quality, their governance framework, and their team's capacity to manage AI output are all different from yours. What works in a greenfield startup cluster on Elastic Kubernetes Service (EKS) with three engineers who all understand the tooling does not automatically work in a regulated, multi-tenant OpenShift environment with a full change management process.

The architecture

Rather than thinking about AI by capability type — supervised learning, generative, agentic — it's more useful for platform engineers to think about where in the workflow AI can change the outcome. There are five meaningful touch points, each with a different maturity level and a different blast radius when something goes wrong.

Touch point 1 — Writing infrastructure code. Generating Terraform, Helm charts, Kubernetes manifests, GitHub Actions pipelines. This is currently where AI delivers the most consistent value. Output quality is high enough to be useful as a starting point, and the cost of a mistake is manageable — you review before you apply. Tools like GitHub Copilot, Claude Code, and cursor-style IDE integrations have meaningfully changed how fast experienced engineers can scaffold infrastructure.

Touch point 2 — Reviewing infrastructure code. Using large language models (LLMs) to review Terraform plans, flag misconfigurations, surface security issues in manifests, or check for policy violations before they hit kubectl apply. Underutilised and underrated. AI as a first-pass reviewer catches the obvious before a human looks — freeing review time for the decisions that actually require judgment.

Touch point 3 — Operating systems. AI-assisted runbooks, natural language interfaces to cluster state, AI that can answer "why is this pod crashing?" and surface relevant logs and events in one response. OpenShift Lightspeed targets exactly this layer. Genuinely promising — but still early. "Natural language interface to cluster state" is a different capability from "correctly diagnoses the root cause of a cascading failure."

Touch point 4 — Observing systems. Anomaly detection, intelligent alerting, log triage, pattern recognition across time-series data. The most mature AI application in infrastructure tooling — ML-based anomaly detection in observability platforms has existed for years. The catch: AI observation is only as good as your instrumentation, and most organisations' instrumentation is messier than they admit.

Touch point 5 — Responding to incidents. AI-generated post-mortems, suggested remediation steps, automated root-cause correlation. The least mature category. The gap between "AI suggests a fix" and "AI safely executes a fix in production" is enormous — and crossing it requires governance infrastructure most organisations haven't built yet.

What's actually working right now

Still hype	Actually working
Fully autonomous agents managing production infra	AI-assisted Terraform scaffolding and review
Self-healing pipelines without human oversight	LLM-powered log triage and error summarisation
AI that understands your org context without setup	GitHub Copilot / Claude Code in terminal workflows
Zero-touch incident resolution	AI-generated first-pass post-mortems and runbooks
Replacing platform engineers with AI agents	Natural language interfaces to cluster state (OpenShift Lightspeed)

The pattern is consistent: AI is genuinely useful as an accelerator for experienced engineers. It's not yet reliable as an autonomous operator. The engineers getting real value are the ones who understand the domain well enough to critically evaluate AI output — not the ones hoping AI will substitute for that understanding.

What's still hype — and why it's hard

The hardest part of being honest about AI in infrastructure is explaining why the things that sound most compelling are still hype — because they're not impossible, they're just harder than the demos suggest.

Autonomous agents running production infrastructure. The dream: an AI agent that detects a problem, diagnoses it, and fixes it — all without human intervention. The reality: every production environment has constraints, guardrails, compliance requirements, and organisational processes that an AI agent has no context about. Building the scaffolding for an agent to operate safely in production is a significant engineering project in itself, before you even get to the AI.

Self-healing pipelines. Retry logic with exponential backoff isn't AI. Pipelines that genuinely diagnose why something failed and take contextually appropriate corrective action — that's a much harder problem. The current generation of tools can handle narrow, well-defined failure patterns. They struggle with novel failures, which are precisely the ones you most need to handle.

AI that understands your organisational context. Every demo uses clean, well-labelled, well-structured data. Every real environment has years of accumulated naming inconsistencies, undocumented dependencies, and tribal knowledge that exists nowhere in any system. Getting AI to be genuinely useful in your environment requires significant investment in context — not just in the AI tool itself.

Implementation

Prerequisites

Before applying this framework to any AI tool evaluation, establish these baselines:

Document your current change management process — specifically what requires CAB approval and what doesn't. Any AI tool that touches production is subject to these constraints.
Audit your observability instrumentation coverage. Incomplete instrumentation makes Touch point 4 (observing systems) unreliable before you start.
Know your OpenShift Security Context Constraints (SCC) and role-based access control (RBAC) model. Any AI tool that interacts with your cluster will operate within or around these — understand the model before you connect anything.
Identify one concrete, scoped problem in your current workflow. "Improve our platform with AI" is not a problem statement. "Our on-call team spends 40% of incident time manually correlating logs across three tools" is.

Step 1 — Locate the claim on the framework

For any AI tool or feature you're evaluating, determine which touch point it primarily operates at. Then read the blast radius that comes with it:

Touch point 1-2 (Writing/Reviewing code):
  - Human reviews output before anything is applied
  - Blast radius: the quality of what you accept and apply
  - Adopt with normal review discipline

Touch point 3-4 (Operating/Observing):
  - Evaluate data quality before adopting
  - Recommendations can be wrong; understand escalation path
  - Blast radius: operational decisions made on bad AI signal

Touch point 5 (Responding to incidents):
  - Requires explicit governance framework before adoption
  - "AI-suggested" ≠ "AI-executed" — keep them separate initially
  - Blast radius: autonomous action in production

If the vendor's description places a tool at Touch point 5 — autonomous remediation, self-healing, zero-touch incident resolution — apply significantly more scrutiny than if it operates at Touch points 1 or 2.

Step 2 — Apply the hype test

Before spending time on a proof of concept, run these four questions:

Can the vendor show it working on data with the same characteristics as yours? Not a demo on clean, synthetic, well-labelled data. Your data. If they can't or won't, that's the answer.
What happens when it's wrong? Every AI tool is wrong sometimes. The question is whether "wrong" means a suggestion you dismiss, or an action that causes an outage.
Does it require context your organisation hasn't documented? AI tools that depend on understanding your org's naming conventions, undocumented dependencies, or tribal knowledge will underperform until that context is captured somewhere. That capture work is your responsibility, not the vendor's.
Can you remove it if it's not working? Evaluating against reversibility is not pessimism — it's risk management. A tool you can't easily remove carries a higher adoption threshold.

Step 3 — Governance before production

Before any AI component reaches a production environment:

Define the audit requirement. Who reviews AI-suggested or AI-executed changes? What is the audit trail? For regulated environments this is not optional.
Establish the blast radius. What can this tool do if it behaves unexpectedly? Can it modify production resources directly, or does it only make recommendations?
Set the escalation path. When the AI is confidently wrong — and it will be — what is the process for catching and correcting it before it compounds?
Document the data governance position. What data are you sending to an external LLM? What data must stay on-cluster or on-premises? Most AI tools send more than you'd expect by default.

The governance gap: What bothered me years ago in those IBM data science sessions still applies today. Most teams rushing to deploy AI in their infrastructure have no governance framework for it. These aren't blockers — but they need answers before you're running AI anywhere near production decisions.

Security considerations

LLM prompt injection via infrastructure data. Any AI tool that reads external data — logs, alert content, GitHub Issues, Slack messages — and uses it as context for an LLM is a prompt injection surface. If an attacker can write to that data source, they may be able to influence the AI's output and, at Touch point 5, potentially influence what actions the AI recommends or takes.

Data exfiltration via LLM context. Sending cluster state, application logs, or infrastructure configuration to a third-party LLM endpoint is a data governance decision that must be made explicitly — not by default when you install the tool. Identify what data the tool sends, where it goes, and whether that is consistent with your data classification requirements before connecting it to production namespaces.

Blast radius of AI service accounts. An AI tool that applies changes directly has the blast radius of its service account. Apply the same least-privilege discipline to AI agent service accounts as to any other automation credential. Audit with oc auth can-i --list --as=system:serviceaccount:[namespace]:[sa-name] on a schedule — these accounts have a tendency to accumulate permissions when AI-suggested changes start failing for access reasons.

Data quality risk in observability AI. If your observability data has gaps or historical anomalies from past incidents, your anomaly detection model is trained on those. An AI baseline trained during a period of chronic latency will produce different signals than one trained on clean data. Understand what your observability AI was trained on, and re-evaluate the baseline when your environment changes significantly.

Tradeoffs

AI as accelerator vs. AI as operator. The most common evaluation mistake is treating these as the same procurement category. AI accelerators (Touch points 1-2) improve throughput for experienced engineers without autonomous authority. AI operators (Touch point 5) require governance infrastructure — audit trails, blast radius controls, escalation paths — before they can safely operate in production. The distinction drives different adoption timelines and different security requirements.

Speed of adoption vs. governance debt. Moving fast on AI tooling creates governance debt that compounds. Every AI tool in your stack without a documented blast radius, audit trail, or removal plan is a liability you'll eventually have to address — usually during an incident. The teams getting the best outcomes are adopting one touch point at a time, establishing governance, then expanding.

Build vs. buy for AI-infrastructure integration. Off-the-shelf tools offer faster time to value and someone else's maintenance burden. Custom integrations — your own MCP server connecting an LLM to your cluster — give you full control over what data the AI sees and what actions it can take. The right answer depends on your engineering capacity and how sensitive your environment is. Subsequent articles in this series cover both paths.

Vendor-integrated AI features vs. standalone tools. Your existing observability, CI/CD, and cluster management platforms are all adding AI features. The integrated feature is faster to adopt. A standalone AI tool is more flexible and less vendor-coupled. Risk of integrated: you're dependent on the vendor's AI implementation choices and data handling. Risk of standalone: you own the integration complexity and the maintenance of compatibility across upgrades.

What I'd do differently

Apply the framework before buying. I spent months selling AI solutions that were firmly in the "still hype" column — not because the technology was fraudulent, but because the missing piece was never the AI itself. It was the data quality, the governance, the production path. That framework, applied at the evaluation stage, would have changed what I recommended to customers.

Start at Touch point 1, not Touch point 5. The temptation is always to start with the most compelling use case — autonomous remediation, self-healing pipelines, AI that runs the on-call shift. Start instead where the blast radius is lowest and the feedback loop is tightest. AI-assisted infrastructure code generation gives you real signal about where LLMs help and where they confidently mislead — without the consequence of discovering that during a 2am incident.

Build the governance framework before the first tool, not after the fifth. The governance questions — who reviews, what's the audit trail, what's the blast radius, what data leaves the cluster — are significantly easier to answer when you have one AI tool than when you have five. Define the framework early.

Treat data quality as a blocking condition, not a future problem. Every AI capability in this framework degrades as data quality degrades — except the degradation is silent, in ways you won't notice until something breaks in production. Observability AI on bad data produces confidently wrong signals. LLMs fed poorly-structured logs produce poorly-structured summaries of the wrong thing. Fix the data before you build the AI layer on top of it.

GitHub repo

All working implementations for this series live at agentic-devops/pipelineandprompts-labs. Each subsequent article links directly to its repo. This article is the framework; the code starts in Article 02.

What's next in this series

#	Article	What it covers
01	What's Real, What's Hype (you are here)	The practitioner's framework for evaluating AI in infrastructure
02	MCP Servers — The Connective Tissue	How Model Context Protocol servers let AI agents interact with real systems
03	AI-Assisted OpenShift Operations	OpenShift Lightspeed, natural language cluster interrogation, where AI saves time
04	n8n Workflows for Platform Engineering	Agentic automation pipelines connecting AI with your infrastructure toolchain
05	Agentic AI Infrastructure — Doing It Safely	Governance, guardrails, and engineering scaffolding before handing AI operational authority

What's next

Article 02 — MCP Servers: The Connective Tissue Between AI and Infrastructure

Before AI agents can do anything useful in your stack, they need a way to talk to it. Model Context Protocol servers are how that happens. Next: what MCP servers are, why they matter for platform engineering, and how to build one that connects an LLM to your real infrastructure toolchain — with working code and a threat model.

DEV Community