paul_h

Posted on May 28

AI Agents Mapped My Legacy Production Environment in One Hour.

#devops #aiops #sre #welcome

I inherited a black box.

Three VMs. A hundred-something microservices. Redis, ClickHouse, MySQL, some homegrown database nobody could name. Kafka and Zookeeper thrown in because of course they were.

Nobody knew how the services connected. The original team was gone. The architecture lived entirely in oral tradition, and the last person who could recite it had left six months ago.

This is not a metaphor. This is Tuesday for anyone who's done SRE work long enough.

Setup: 30 seconds, zero footprint

I already had Teleport for daily ops. SSH access, session recording. It worked, I didn't want to break it.

What I did:

Installed knoxd on my Teleport proxy (not on the servers)
AI agent team auto-configured a Teleport connector

That's it. Nothing new on my production machines. The agents ride the Teleport session I already had, with the permissions I'd already defined.

Non-invasive — not in the "we promise it's lightweight" sense. In the "there is literally nothing new running on your production machines" sense.

How it actually works

The agents SSH in through Teleport. Plain SSH commands, same ones you'd type yourself.

What makes this safe rather than terrifying:

	Auto-run	Requires human approval
Read-only	`ps`, `ss`, `cat /proc/net/tcp`, `nginx -T`	—
Mutating	—	`kill`, `systemctl restart`, `rm`

The sandbox: strict AST parsing + default-deny whitelist. The agents can look at everything but touch nothing without asking.

What the agents discovered

Step 1: OS inventory — kernel, distro, packages. All 3 VMs in parallel.

Step 2: Process mapping — ps aux, parsed. Hundreds of processes tagged with binary path, resource footprint, parent-child relationships.

Step 3: Process → Service resolution

Check name service first
If unregistered (most weren't — legacy system), infer from install path
Flag for human confirmation before writing anything back

The AI doesn't hallucinate service names into your architecture map. It asks.

Step 4: Service → Business Island grouping

A business island = logical grouping by business function (billing, user auth, order processing). The thing that exists in every architect's head but never in any document.

Step 5: Connection mapping — four evidence sources, cross-referenced:

Source	What it reveals	Example
Network connections (`ss -tnp`)	Live TCP dependencies	Port 6379 → Redis, port 9092 → Kafka
Config files	Declared dependencies	`kafka.brokers: kafka-01:9092` in YAML
Access logs	Actual call patterns	Who calls whom, how often
LB configs (nginx)	Ingress chain	Domain → LB → real server

Cross-reference. Resolve conflicts. Draw edges.

One hour.

What I got

Architecture diagrams — topology maps of each business island, services as nodes, dependencies as edges, data flows labeled. The kind of diagram you'd pay a consultant a week to produce.

High-risk report:

Single points of failure
Circular dependencies
Kafka topics with no visible consumer group
One Redis instance holding session state for 6 business islands, zero isolation

Things I needed to know. Things dashboards would never show me.

The cost

Zero.

Knox gives free credits on signup. Enough for a small cluster for a long time. No credit card. No trial-that-converts-to-paid. One binary on a jump host.

Why this matters

Most AIOps tools treat metrics as the final answer. They're not. They're the starting point.

Real outages hide in blind spots:

System logs nobody tails
Manual changes nobody tracked
Config drift APM tools don't see

To find root cause, you have to log into machines and build an evidence chain. That's what humans do. That's what these agents do.

Monitoring tells you a metric crossed a threshold. It doesn't tell you:

Service X and Y form a circular dependency that will cascade
Your session store is a single point of failure for half the platform

Those aren't metric problems. They're structure problems. LLMs are uniquely good at structure — if you give them a way to see it without breaking anything.

Safety model

Letting AI touch production should sound terrifying. That's why:

AST-parsed command validation — not string matching, actual syntax tree analysis
Default-deny whitelist — everything blocked unless explicitly allowed
Human-in-the-loop — any destructive action requires a plan + approval
Connector model — agents use paths you already trust (Teleport, SSH, AWS, Prometheus)

The agents never need their own access path. They never open a new hole in your security posture.

That's the difference between an agent you'd let near production and one you wouldn't.

What I'm building

It's called KnoxOps. Core idea: infrastructure is an object graph, not a flat list of resources. Model it that way and LLMs can reason like a senior SRE — tracing dependencies, calculating blast radius, finding what dashboards miss.

The goal: delegate routine SRE toil so developers can focus on building.

More connectors coming. The principle stays the same: use the access paths you already trust.

If you've inherited a system nobody understands — I'd like to hear from you.

I'm the founder of KnoxOps. Currently in open beta — use code DEVTO26 for 10,000 free credits on signup.

Top comments (1)

paul_h • Jun 23

Hi, dev.to, this is a reader perk for you.
Knox is currently in open beta, If you want to try mapping your own environment, use code DEVTO26 for 10,000 free credits at knoxops.app — enough to manage a small cluster for a month.