Humanizing Artificial Intelligence for Platform Engineering: Helping Internal Developer Platforms Become Easier to Use

#kubernetes #devops #platformengineering #ai

The Developer Who Just Wanted to Ship a Service

A backend developer on my team pinged me last week. She'd built a small Go service, it passed tests locally, and she wanted to get it running in our staging cluster. Simple ask. Then she opened the platform wiki.

Forty pages. A section on naming conventions. A page on which Terraform module to use for a Postgres instance, with a note at the bottom saying that page was "mostly current." A Helm values reference. A link to a deprecated runbook that someone forgot to delete. By the time she found the right golden-path template, she'd burned half a day and pinged three people in Slack. She didn't need a tutorial on our platform. She needed someone to walk her to the right door.

That gap is the actual problem with most internal developer platforms. The paved road exists. The templates exist. The policy is sound. But the cognitive load of finding the paved road is so high that developers route around it, copy a YAML file from another team, and ship something that violates four of our standards. The platform isn't failing on capability. It's failing on usability.

This is where I think AI earns its keep in platform engineering, and it's a narrower claim than the hype suggests. "Humanizing AI" here doesn't mean a chatbot that replaces your platform team. It means using AI as the friendly front door to your IDP: turning a plain-English request into the right golden-path template, scaffolding the boilerplate against an approved template, and explaining the cryptic platform error so the developer fixes it correctly instead of hacking around it. The guardrails, the templates, the approvals, and the humans who own them all stay exactly where they are. AI lowers the friction of self-service. It does not bypass the paved road.

Self-Service Is Only Self-Service If People Can Find It

The promise of an IDP is that a developer can provision what they need without filing a ticket and waiting two days. The reality, on most platforms I've seen, is a self-service portal that assumes the developer already knows the vocabulary of the platform. Which module? Which environment tier? Which approval gate applies to a service that touches PII?

A developer asking "I need a new Go service with Postgres" is asking a perfectly reasonable question in human terms. The platform answers in machine terms: pick a template ID, fill in fourteen variables, know in advance which of them are required. The translation layer between those two is exactly what an AI assistant is good at — and exactly where most teams have been doing the translation manually, in Slack, over and over, forever.

The trick is grounding. An ungrounded AI will happily tell your developer to drop in some Terraform it saw on a blog in 2021. That's worse than the wiki, because it's confidently wrong. A grounded assistant — one wired to your template catalog, your golden paths, your current docs via retrieval — recommends the paved road because the paved road is the only thing in its context.

Pro Tip: Don't fine-tune a model on your platform and call it done. Ground it with retrieval over your live template repo, your Backstage software catalog, and your current runbooks. Templates change weekly; a fine-tuned model freezes a snapshot. Retrieval means the assistant recommends what's actually approved today, not what was approved the day someone trained it.

From a Sentence to a Scaffold

Here's the concrete artifact. A developer types into the platform assistant:

I need a new Go service called billing-events with a Postgres database, staging and prod, and it needs to publish to our Kafka topic.

A grounded assistant doesn't generate freehand infrastructure. It maps that request onto the approved go-service-pg golden-path template in our catalog and produces the scaffold the platform team already blessed:

billing-events/
  catalog-info.yaml        # Backstage entity, owner team pre-filled
  cmd/main.go              # service skeleton, our logging + health endpoints
  Dockerfile               # from the approved base image, pinned digest
  helm/
    values.yaml            # resource limits from the "small" tier preset
  terraform/
    main.tf                # calls module "rds" { source = "registry/our-org/rds/aws" version = "4.2.1" }
    kafka.tf               # topic ACL via the approved kafka module
  .github/workflows/
    deploy.yaml            # the standard pipeline, not a bespoke one

The Terraform it emits isn't invented. It calls our internal module registry at a pinned version, with the variables the template requires:

module "billing_events_db" {
  source         = "registry.internal/our-org/rds/aws"
  version        = "4.2.1"
  instance_class = "db.t4g.medium"   # from the "small" tier
  multi_az       = true              # prod default, policy-enforced
  pii            = false             # tagged so policy bot knows the rules
}

Notice what the AI did and didn't do. It filled in the boilerplate, picked sane defaults from our tier presets, and tagged the resource so our policy engine can reason about it. It did not choose the module version, invent a new instance class, or skip multi_az. Those are constraints baked into the template, and the assistant is generating into the template, not around it. If the developer asks for something the template can't express — a database class we don't support, a region we don't run in — the right answer from the assistant is "that's not on the paved road, here's who to talk to," not a creative workaround.

If you're new to wiring AI into this kind of container-to-cluster flow, the walkthrough in from Dockerfile to first Kubernetes deployment with AI is a good ground-level starting point for what "generate against a known-good shape" looks like in practice.

Golden Paths Need a Tour Guide, Not Just a Map

A golden path is only golden if developers stay on it. The failure mode I see constantly: the platform team builds an excellent paved road, documents it beautifully, and then watches teams drift off it because at the moment of decision — 4 p.m., trying to ship before standup tomorrow — nobody reads the docs. They grab whatever's nearest.

An AI assistant grounded in your golden paths changes the economics of that decision. Staying on the paved road becomes the low-effort option, because the assistant does the lookup and the scaffolding for you. Drifting off becomes the high-effort option, because now you're writing Terraform by hand. That's the inversion you want. For years we've asked developers to do extra work to comply with the platform. Done right, AI makes compliance the path of least resistance.

This matters most for the workflows that are genuinely hard to reason about. GitOps is a great example: the model of "Git is the source of truth, a controller reconciles the cluster to match" is powerful but unintuitive for someone shipping their first service. An assistant that can explain why their change isn't live yet — "your manifest is merged, but Argo CD hasn't synced; here's the sync status and here's the app it belongs to" — is doing real developer support. If your platform runs this pattern, point people at the conceptual grounding in GitOps for infrastructure explained and the hands-on practical starting guide for Argo CD, and let the assistant answer the "but why isn't my thing deployed" questions against your actual setup.

Pro Tip: Make the assistant cite its source. Every recommendation should link back to the template, runbook, or catalog entry it came from. This does two things: it lets the developer verify the paved road for themselves, and it gives your platform team a signal — when the assistant keeps citing a stale doc, that's your cue to fix the doc. The AI becomes a usage tracker for your documentation.

Explaining the Cryptic Error Instead of Inviting a Hack

The highest-leverage place for AI in an IDP is the moment a developer hits a platform error they don't understand. This is where people hack. The pipeline fails with something opaque, the developer doesn't know what it means, and the fastest escape is to disable the check, add an || true, or copy a workaround from a Stack Overflow answer that has nothing to do with your platform.

A good assistant intercepts that moment. Here's the kind of output I want to see when a deploy fails our policy gate:

Your deploy to prod was blocked by the require-resource-limits policy.

Your helm/values.yaml sets a CPU request but no CPU limit. Our prod admission policy requires both — this prevents one service from starving the node. You don't need to disable the check.

The go-service-pg golden path ships a values-prod.yaml with the "small" tier preset that already includes compliant limits. You're missing it because this service was scaffolded before that preset existed. Copy the resources: block from the template here: [link to template]. After that, re-run the deploy — no override needed.

If you genuinely need higher limits than the preset allows, that's a capacity conversation with the platform team, not a policy bypass. Open a request here: [link].

That output respects the guardrail. It explains why the policy exists, points to the golden-path fix, and explicitly closes the door on the hack — while leaving a legitimate human-in-the-loop escalation path for the real exception. Compare that to the developer's default move, which is to find the flag that makes the red text go away. The AI didn't weaken the policy. It made the policy understandable, which is the only thing that makes a policy survive contact with a deadline.

There's a deeper point here about toil. Every one of these "what does this error mean" questions used to land in the platform team's Slack channel, and answering them is pure, repetitive toil — the same Postgres-module question, the same policy-gate question, fifteen times a month. Routing those to a grounded assistant frees your senior people to work on the platform instead of staffing a help desk for it. I've written more about that specific dynamic in identifying and eliminating toil with AI, and platform support is one of the cleanest examples of toil I know.

Deployment Workflows: Narrate, Don't Drive

I want to be precise about where AI sits in the deployment path, because this is where teams get nervous and they're right to. The assistant should narrate and assist the deployment workflow. It should not be the thing that pushes to prod.

The pattern that works: the developer commits to Git, the pipeline runs, the controller reconciles — your normal, audited, human-approved flow. The AI's job is to make that flow legible. "Here's where your change is. It's merged, it passed CI, it's waiting on the prod approval gate, which is owned by your team lead." When something goes wrong, it explains the failure in the terms above. When the developer asks "how do I roll back," it points them at the actual rollback procedure in your runbook — the real one, retrieved live, not a generic Kubernetes answer.

What it should never do is hold the credentials to apply infrastructure or promote a release on its own. The approval gates are the guardrail. A human owns the merge. A human owns the prod promotion. The AI is the assistant standing next to the control panel explaining what each button does — not the operator with its hands on the panel. If you build platform automation that AI helps developers operate, the Kubernetes operator pattern guide is worth reading for how to keep the reconciliation logic — the part that actually changes cluster state — in deterministic, audited controllers rather than in a probabilistic model.

Pro Tip: Treat the AI assistant's outputs as suggestions that flow into your existing review and policy gates, never as a side channel that skips them. The scaffold it generates should land as a pull request a human reviews. The Terraform it writes should still hit plan, policy-as-code, and approval. If your assistant can make a change that your normal pipeline couldn't, you've built a backdoor, not a front door.

The Human Platform Team Is the Point

It's tempting to read all of this as "AI replaces the platform team." It's the opposite. Every capability I've described depends on the platform team's work being good: the templates have to be right, the golden paths have to be real, the docs have to be current, the policies have to be sound. The AI is a multiplier on that work, not a substitute for it. Point it at a mediocre platform and you get a fast, confident way to spread mediocrity.

The work shifts, though. Less time spent answering the same onboarding question for the hundredth time. More time spent making sure the templates the assistant recommends are excellent, because now they reach every developer instantly. Your golden paths become higher-stakes in the best way — they're no longer docs that maybe get read; they're the thing the front door walks everyone toward. That's a better job. It's the platform-engineering job, freed from the help-desk tax.

"Humanizing AI" in this context means keeping humans in control of the road while using AI to make the road easy to walk. The developer who just wants to ship a Go service gets walked to the right template in thirty seconds instead of reading forty pages of wiki. The platform team keeps its guardrails, gains a tireless first-line guide, and gets back the hours it used to spend repeating itself. Nobody bypasses the paved road. The paved road just finally became the obvious one to take.

James Joyner IV runs devopsaitoolkit.com, where he writes about running AI alongside real production platforms. If you want to put grounded, human-in-control prompts to work on your own platform, start with the prompt library or the Writing Humanizer pack.