Vitaliy Ryumshyn

Posted on Mar 24 • Edited on May 17

Why Your AI Agent Skill Sucks

#mcp #aiops #agentskills #kubernetes

You wrote a skill prompt for your AI agent. It looks great — diagnosis protocol, safety rules, operational discipline. Your agent fixes broken deployments 4x faster.

Ship it?

We tested role-based skills across 16 real infrastructure scenarios on 4 models. Here's what happened.

The Setup

infra-bench runs AI agents against real Kubernetes clusters and Terraform projects. No mocks. Kind clusters, real kubectl, real failures. The agent gets a task ("the deployment is broken"), tools (kubectl, terraform, helm), and a turn budget. Fix it or fail.

We tested two modes:

Baseline: no skill — the model uses its own judgment
With skill: a compact ~300-token role prompt (k8s-admin for Kubernetes, platform-eng for Terraform)

Same model, same scenarios, same cluster. The only difference: did we tell the agent how to think?

The Results

Kubernetes Scenarios (8 CKA/CKS scenarios, L2-L3)

Model	Baseline	With k8s-admin skill	Delta
Claude Sonnet 4	8/8	8/8	0
Gemini 2.5 Flash	6/8	5/8	-1
GPT-4o	4/6	4/8	-2
DeepSeek Chat	6/7	6/8	0

Terraform Scenarios (4 scenarios, L2-L3)

Model	Baseline	With platform-eng skill	Delta
Claude Sonnet 4	3/4	4/4	+1
Gemini 2.5 Flash	3/4	2/4	-1
GPT-4o	2/4	2/4	0
DeepSeek Chat	3/4	3/4	0

New Scenarios — Baseline Only (4 scenarios, L2-L4)

Model	readonly-fs (L2)	psa-conflict (L2)	capabilities (L2)	cascading (L4)	Total
DeepSeek Chat	PASS	PASS	PASS	PASS	4/4
GPT-4o	PASS	PASS	PASS	FAIL	3/4
Gemini 2.5 Flash	FAIL	PASS	PASS	FAIL	2/4
Claude Sonnet 4	FAIL	PASS	PASS	FAIL	2/4

DeepSeek Chat — the cheapest model in the test ($0.006/run) — was the only one to pass the L4 multi-stage cascading-failures scenario. Claude Sonnet 4 failed it.

The Pattern

Strong models don't need your skill. Claude Sonnet 4 scored 8/8 on Kubernetes without any skill. Adding the k8s-admin skill didn't improve anything — it was already diagnosing before fixing, checking blast radius, making targeted changes. The skill just described what it was already doing.

Weak models get hurt by your skill. GPT-4o lost 2 scenarios when we added the k8s-admin skill. The skill says "check events and conditions before logs." For a kubeconfig connectivity issue, the agent needed to inspect the kubeconfig file — not Kubernetes events. The skill imposed a wrong mental model.

Skills help on specific tasks and break others. The platform-eng skill helped Claude Sonnet pass terraform-import-existing (FAIL → PASS) because the skill specifically teaches "prefer import over destroy-recreate." But the same skill pattern made Gemini fail terraform-state-drift (PASS → FAIL) because it followed the skill's diagnostic protocol instead of just reading the plan diff.

Price doesn't correlate with performance. DeepSeek Chat at $0.006/run beat Claude Sonnet 4 at $0.06/run on the hardest scenario. The 10x price difference bought zero advantage on multi-stage forensics.

Why Skills Break Things

A skill prompt is a mental model injection. You're telling the agent: "think like THIS kind of engineer." That works when the scenario matches the model. It breaks when:

The skill is too procedural. "Run terraform plan first, then read .tf files, then check state" — great for state management, wrong for a simple image tag fix. The agent follows the procedure and burns turns on unnecessary diagnosis.
The skill overrides good instincts. A model that would naturally read the error message and fix it in 2 turns now follows your 5-step protocol and times out.
The skill scope is wrong. A k8s-admin skill teaches deployment patterns. But kubeconfig issues aren't deployment issues — the agent needs to think about TLS and cluster connectivity, not pod scheduling.

The Real Problem

You can't know whether a skill helps without testing it on real scenarios. Prompt engineering intuition fails here. The skill that cuts L1 scenarios from 17 to 4 turns is the same skill that makes L2 scenarios fail entirely.

We proved this with our first skill experiment:

Without skill: 17 turns, PASS (L1 broken-deployment)
With skill:     4 turns, PASS — 4x faster

Same skill, harder scenario:
Without skill: 12 turns, PASS (L2 crashloop-backoff)
With skill:     4 turns, FAIL — skipped diagnosis

The skill made the agent skip diagnosis and jump to a fix pattern. On L1 (obvious problem), that's a speedup. On L2 (requires investigation), it's a failure.

What Actually Works

For strong models (Claude Sonnet 4, GPT-5.2): Don't add skills for tasks they already handle. Your skill is at best neutral, at worst destructive. Test on harder scenarios where the model fails — skills can help there (Claude + platform-eng skill on terraform-import-existing).

For mid-tier models (Gemini Flash, DeepSeek): Test every skill variant against your actual scenarios. A skill that helps on 6 scenarios but breaks 2 is a net negative if those 2 are production-critical. Also: don't assume expensive = better. DeepSeek beat Claude on multi-stage forensics.

For weak models (Llama 70B, Qwen): Skills help more here — the structure compensates for weaker reasoning. But test anyway.

The general rule: Skills are not universally good or bad. You need to benchmark them against real infrastructure failures to know which help and which hurt.

62 scenarios. 8 exam-aligned tracks. 5 models. Run your skill against real clusters and get data, not opinions.

infra-bench: bench.evidra.cc

DEV Community