I inherited a black box.
Three VMs. A hundred-something microservices. Redis, ClickHouse, MySQL, some homegrown database nobody could name. Kafka and Zookeeper thrown in because of course they were.
Nobody knew how the services connected. The original team was gone. The architecture lived entirely in oral tradition, and the last person who could recite it had left six months ago.
This is not a metaphor. This is Tuesday for anyone who's done SRE work long enough.
Setup: 30 seconds, zero footprint
I already had Teleport for daily ops. SSH access, session recording. It worked, I didn't want to break it.
What I did:
- Installed
knoxdon my Teleport proxy (not on the servers) - AI agent team auto-configured a Teleport connector
That's it. Nothing new on my production machines. The agents ride the Teleport session I already had, with the permissions I'd already defined.
Non-invasive — not in the "we promise it's lightweight" sense. In the "there is literally nothing new running on your production machines" sense.
How it actually works
The agents SSH in through Teleport. Plain SSH commands, same ones you'd type yourself.
What makes this safe rather than terrifying:
| Auto-run | Requires human approval | |
|---|---|---|
| Read-only |
ps, ss, cat /proc/net/tcp, nginx -T
|
— |
| Mutating | — |
kill, systemctl restart, rm
|
The sandbox: strict AST parsing + default-deny whitelist. The agents can look at everything but touch nothing without asking.
What the agents discovered
Step 1: OS inventory — kernel, distro, packages. All 3 VMs in parallel.
Step 2: Process mapping — ps aux, parsed. Hundreds of processes tagged with binary path, resource footprint, parent-child relationships.
Step 3: Process → Service resolution
- Check name service first
- If unregistered (most weren't — legacy system), infer from install path
- Flag for human confirmation before writing anything back
The AI doesn't hallucinate service names into your architecture map. It asks.
Step 4: Service → Business Island grouping
A business island = logical grouping by business function (billing, user auth, order processing). The thing that exists in every architect's head but never in any document.
Step 5: Connection mapping — four evidence sources, cross-referenced:
| Source | What it reveals | Example |
|---|---|---|
Network connections (ss -tnp) |
Live TCP dependencies | Port 6379 → Redis, port 9092 → Kafka |
| Config files | Declared dependencies |
kafka.brokers: kafka-01:9092 in YAML |
| Access logs | Actual call patterns | Who calls whom, how often |
| LB configs (nginx) | Ingress chain | Domain → LB → real server |
Cross-reference. Resolve conflicts. Draw edges.
One hour.
What I got
Architecture diagrams — topology maps of each business island, services as nodes, dependencies as edges, data flows labeled. The kind of diagram you'd pay a consultant a week to produce.
High-risk report:
- Single points of failure
- Circular dependencies
- Kafka topics with no visible consumer group
- One Redis instance holding session state for 6 business islands, zero isolation
Things I needed to know. Things dashboards would never show me.
The cost
Zero.
Knox gives free credits on signup. Enough for a small cluster for a long time. No credit card. No trial-that-converts-to-paid. One binary on a jump host.
Why this matters
Most AIOps tools treat metrics as the final answer. They're not. They're the starting point.
Real outages hide in blind spots:
- System logs nobody tails
- Manual changes nobody tracked
- Config drift APM tools don't see
To find root cause, you have to log into machines and build an evidence chain. That's what humans do. That's what these agents do.
Monitoring tells you a metric crossed a threshold. It doesn't tell you:
- Service X and Y form a circular dependency that will cascade
- Your session store is a single point of failure for half the platform
Those aren't metric problems. They're structure problems. LLMs are uniquely good at structure — if you give them a way to see it without breaking anything.
Safety model
Letting AI touch production should sound terrifying. That's why:
- AST-parsed command validation — not string matching, actual syntax tree analysis
- Default-deny whitelist — everything blocked unless explicitly allowed
- Human-in-the-loop — any destructive action requires a plan + approval
- Connector model — agents use paths you already trust (Teleport, SSH, AWS, Prometheus)
The agents never need their own access path. They never open a new hole in your security posture.
That's the difference between an agent you'd let near production and one you wouldn't.
What I'm building
It's called KnoxOps. Core idea: infrastructure is an object graph, not a flat list of resources. Model it that way and LLMs can reason like a senior SRE — tracing dependencies, calculating blast radius, finding what dashboards miss.
The goal: delegate routine SRE toil so developers can focus on building.
More connectors coming. The principle stays the same: use the access paths you already trust.
If you've inherited a system nobody understands — I'd like to hear from you.
I'm the founder of KnoxOps. Currently in open beta — use code DEVTO26 for 10,000 free credits on signup.



Top comments (0)