DEV Community

Cover image for Four Layers of Validation in Kubernetes with Claude Code
Jake Page for MetalBear

Posted on • Originally published at metalbear.com

Four Layers of Validation in Kubernetes with Claude Code

Earlier this year, Moltbook, a social network for AI agents, launched, trended, and became a cautionary tale within the same week. Security researchers at Wiz found a Supabase API key sitting in its client-side JavaScript, which was the database’s only access control, with no Row Level Security to narrow what that key could reach. The result: 1.5 million API tokens, 35,000 email addresses, and thousands of private messages exposed to anyone with a browser console.

Moltbook was a greenfield project with no review process. The same class of mistake is far more serious when AI-generated code lands inside applications that already have pre-existing users, real data, and an existing trust surface. That’s increasingly the reality: as organizations adopt AI coding agents, more and more AI-generated code is landing directly in production services that already hold credentials and personal details of your users. A recent survey found that 95% of developers don’t fully trust AI-generated code, while only 48% consistently review it before committing, yet it’s shipping regardless.

Static review tools catch only some classes of issues: common CVEs, dependency hygiene, style violations, deterministic anti-patterns. What they can’t see are things like the actual name of your Kubernetes Secret in this cluster, whether your auth middleware is wired into the right route in this service, whether the request a real user will send makes it through the new code path without breaking something downstream.

Closing that gap takes four independent layers: AI agent skills that shape what gets generated, commands that audit what was generated, integration tests that hit staging endpoints, routing traffic to your local code, and preview environments that let a human review the change against staging dependencies before merging.

Layer 1: Skills (passive, shaping what gets generated)

Most AI coding assistants let you write down rules that shape generation. The simplest mechanism is a config file that the assistant loads into context on every prompt: .cursorrules in Cursor, CLAUDE.md in Claude Code, .github/copilot-instructions.md in GitHub Copilot. Drop NEVER/ALWAYS rules in there and the AI follows them. The downside is that those files load on every prompt, even when you’re working on something unrelated, and every rule you add costs tokens whether or not it’s relevant.

Claude Code goes a step further with skills: structured rule sets that ship as plugins (a directory with a SKILL.md and supporting reference files). Each skill has a description, and the model pulls a skill into context only when your prompt matches what that skill is meant to cover. If a skill never gets matched against the prompt, it never gets loaded, and you don’t pay for the tokens it would consume. We’ve already shipped six skills for AI agents working with mirrord, and this post adds a seventh focused on validation: k8s-validation, an open-source set of NEVER/ALWAYS rules for code that runs inside a Kubernetes cluster.

The skill covers two halves: the cluster-level concerns (Secrets, RBAC, pod hardening, NetworkPolicies, supply chain, file handling) and the application-level concerns that determine whether the workload behaves correctly inside the cluster (HTTP and parameter handling, auth, output sanitisation, API contracts, env-var configuration, test coverage).

Before and after

Without the skill, the model has no way of knowing about things like Kubernetes Secrets already mounted as environment variables in your pod, auth decorators other handlers in your service already use, or PII-sanitization utilities your team has already built. So it does the obvious thing: hardcodes the API key, skips the auth check, and returns the LLM output directly. With the skill loaded, the following prompt (“add a /summarise endpoint that calls OpenAI’s API”) produces something like this:

import os
from openai import OpenAI
from utils.sanitize import filter_pii

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

@app.route('/summarise', methods=['POST'])
@require_auth
def summarise():
    text = request.json.get('text', '')
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": f"Summarise: {text}"}],
    )
    summary = response.choices[0].message.content
    return jsonify({"summary": filter_pii(summary)})
Enter fullscreen mode Exit fullscreen mode

filter_pii here is a stand-in for any utility that strips personally identifiable information (names, SSNs, emails, etc.) out of free text. Different teams build this differently; the Microsoft Presidio library is one of the more common open-source starting points.

Three differences from what the model would produce without the skill: the key comes from an environment variable backed by a Kubernetes Secret, the endpoint sits behind @require_auth, and the LLM output runs through filter_pii before going back to the user.

What skills can’t do

Skills shape generation, but they don’t verify anything.

In the example above, the AI correctly followed the skill: it read the API key from the environment, applied @require_auth, and called filter_pii from utils.sanitize. But the skill has no way to verify that filter_pii actually works. If the utility in your codebase only strips email addresses and misses phone numbers, the skill can’t know that. A user document containing a phone number sails straight through the filter and into the response, and the code looks correct at every layer the skill can see.

Skills set a floor by preventing the obvious structural mistakes. They’re instructions to a model, not checks against reality.

Layer 2: Commands (active, checking what was generated)

Where skills shape what the model generates, commands check what already exists. They’re explicitly invoked by a developer, an agent, or a CI step, and they run a defined set of checks against the code in front of them.

The same install from Layer 1 also ships a slash command: /k8s-validation:audit. It scans your codebase for the same NEVER/ALWAYS rules the skill enforces during generation, traces data flow through handlers and queries, and classifies each finding by severity. Skills don’t always load (a vague prompt, or a quick edit to a file the model didn’t classify as Kubernetes-adjacent, and the rules never enter context). The audit is the backstop: it runs on the code regardless of how the code got written.

> /k8s-validation:audit content-api/

CRITICAL 2 | HIGH 4 | MEDIUM 1 | INFO 0

[CRITICAL] src/routes/summarise.py: Hardcoded OpenAI API key → Use os.environ
[CRITICAL] src/routes/download.py: User filename in path without sanitization → Use secure_filename()
[HIGH]     src/routes/summarise.py: No authentication middleware → Add @require_auth
[HIGH]     k8s/deployment.yaml: No SecurityContext defined → Add runAsNonRoot, drop ALL
[HIGH]     src/routes/summarise.py: Reads OPENAI_API_KEY but no manifest defines it → Add to deployment.yaml env block
[HIGH]     src/routes/summarise.py: New endpoint with no integration test → Add test in tests/integration/
Enter fullscreen mode Exit fullscreen mode

Note that the output mixes security findings (hardcoded key, missing SecurityContext) with correctness findings, meaning “does the code do what was asked, given the rest of the system” (the Kubernetes deployment manifest doesn’t define the env var the code reads; the new endpoint shipped without an integration test). Both halves matter for AI-generated code.

Because the audit is a command you run rather than a rule the model loads, the same invocation works in three places: a developer runs it before opening a PR, an agent runs it as part of its own loop after generating code, and CI runs it as a merge gate. You can wire it into one, two, or all three.

## Validation Workflow
After generating or modifying any Kubernetes-related code, run `/k8s-validation:audit`
on the changed files. If any CRITICAL findings exist, fix them before proceeding.
Enter fullscreen mode Exit fullscreen mode

What commands can’t do

The audit is still static analysis. It can find “you hardcoded a secret” or “you’re missing a SecurityContext,” but it can’t tell you whether your filter_pii regex actually catches the PII your users will send, or whether the environment variable you’re reading will resolve to a value in your staging cluster. Commands check the shape of the code, not the behavior.

Layer 3: Integration tests (runtime, proving it works)

Your team probably already has integration tests that hit your API endpoints, check response shapes, and verify that authentication rejects bad credentials. These tests encode what “correct behavior” actually means for your application.

The bottleneck is running them. Locally, you mock your database, your auth service, your message queue, and hope the mocks match reality. In CI, each cycle takes 5 to 10 minutes. For a human pushing a few times a day, it’s already frustrating enough. For an AI agent trying to fix a failing test, it’s a feedback loop far too slow to learn from: the agent burns tokens on every iteration, and the integration bugs only surface after the change has been written, pushed, and built.

mirrord changes this equation by letting your local process stand in for a deployed pod. Your local code gets the environment variables from the target pod, the same cluster-level files it has access to, and the same view of internal services. In steal mode, traffic destined for the targeted pod is intercepted and routed to your local process instead of whatever’s deployed in staging. Your existing integration tests, pointed at staging endpoints as usual, now run against your local code in seconds, not minutes.

The same pattern scales horizontally. Because mirrord can split a single pod’s incoming traffic between many local processes using header-based filters, multiple agents (or developers) can iterate against the same staging cluster simultaneously, each one routing its own slice of the traffic to its own local code. One staging environment, many concurrent agents, real downstream services for all of them.

What this catches that the other layers can’t

Consider a prompt like “have /summarise fetch the document from our content-store service first.” The agent writes a handler that calls http://content-store/documents/{id} and reads response.json()["title"].

The catch: content-store moved to v2 months ago and now returns {"document": {"name": ..., "text": ...}}. The flat title/body shape only exists in the AI’s training data. Skills generated structurally clean code (good). The audit confirmed the call was made and the response was consumed (also good). Neither layer knows what shape content-store actually returns today.

The setup to fix this is two processes. You or your AI agent starts a mirrord session, your e2e tests run as normal against the staging content-api endpoint:

# Terminal 1, run your local content-api in place of the deployed pod
mirrord exec --target deploy/content-api --steal -- python -m content_api

# Terminal 2, run the existing integration suite against staging as usual
pytest tests/integration/test_summarise.py
Enter fullscreen mode Exit fullscreen mode

When the test hits staging’s content-api endpoint, mirrord steals the request and reroutes it to your local process. The local handler calls http://content-store/documents/..., and that outbound call also routes through mirrord, hitting the real content-store in staging. The real service returns {"document": {"name": ..., "text": ...}}. The local code does response.json()["title"] and crashes with KeyError.

You fix the code to read the new shape, rerun the test, it passes. The bug surfaces in your local code, against real downstream services, in seconds, instead of after a deploy cycle. The same pattern works for any other dependency the code touches: environment variables from the pod, files from mounted volumes, database queries against the real Postgres. mirrord runs your code, the cluster supplies its real environment.

Layer 4: Human review in a real environment

When the agent opens the PR, a human should still get to see the change running in a real environment, not just read the diff. mirrord’s Preview environments make that easy: a GitHub Action spins up an isolated pod in your staging cluster running the PR’s code, connected to all the real downstream services, and scoped to that PR via an environment key.

Reviewers can click through the actual feature instead of inferring behavior from the code or having to run the whole application locally. Most of the failures the previous three layers don’t catch, UX regressions, surprising interactions, “looks right but feels wrong”, show up the first time a human uses the thing.

Putting it together: agents handle layers 1–3, humans handle layer 4

Layers 1 through 3 can be run by the agent itself. Instead of generating code, opening a PR, and hoping CI catches the issues, the agent generates code shaped by skills, runs /k8s-validation:audit to check for structural issues, runs the integration tests via mirrord against real infrastructure, and fixes any failures before committing. The agent doesn’t even have to write the test from scratch: Layer 1’s validation skill includes a rule that any new HTTP handler must come with an integration test, so the test gets generated alongside the endpoint. Layer 3 just runs it against real infrastructure.

Layer 4 is the handoff. Once the agent has passed its own checks, it opens a PR and a human gets a live preview environment to click through rather than a diff to infer from. The failures that surface at that stage, UX regressions, surprising interactions, things that look right but feel wrong are exactly the ones that didn’t show up in the previous three layers.

We’ve documented the concrete setup (per-service mirrord configs, helper scripts, and an AGENTS.md that tells the agent which script to run) in our mirrord with AI agents guide. The broader argument for why this matters, including the token cost of agents stuck in a feedback-less loop, is in How to prevent token burn using mirrord with e2e tests.

Adopting the layers

The four layers are independent. Pick the ones that close your biggest gap and add the others when the next failure teaches you what’s still missing.

Install the validation skill with /plugin install k8s-validation@metalbear-co/k8s-validation-plugin in Claude Code, or as a local rules directory for Cursor and GitHub Copilot (see the repo README). The /k8s-validation:audit command ships in the same install. For the runtime layer, install mirrord and wrap your local service with mirrord exec --target deploy/<service> --steal -- <run-command>. For preview environments on each PR, the feature is included in mirrord for Teams.

Each layer closes a different gap. Stop at any point and you’ve made things better than they were.

The skills are open source. If your AI assistant generates something the skills don’t catch, open a PR.

Top comments (0)