paul_h

Posted on Jun 15

I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

#ai #infrastructure #sre #devops

I manage a small stack. Three Linux VMs, one Kubernetes cluster, maybe 20-something services total. Not big. But underdocumented — the kind of environment where you SSH in and discover things you forgot were running.

Last week I ran the same task through two different AI tools: "tell me what's running, how it connects, and what looks risky." One is a general-purpose LLM (Claude). The other is a purpose-built AI SRE tool. Same environment, same ask. The results were... instructive.

The task

Simple brief: infrastructure discovery. I want a full picture — services, dependencies, topology, risks. The kind of thing a new hire would spend their first week piecing together from wikis that haven't been updated since 2023.

Claude Code (Opus model)

My prompt:

"I manage a small infrastructure — 3 Linux VMs (172.30.0.41, 172.30.0.42, 172.30.0.43) and a Kubernetes cluster. SSH access is already configured. Help me understand what's running across this environment — I want a full picture of my services, dependencies, and topology."

I'm running Claude Code locally with the Opus model — their flagship tier. Claude didn't ask questions. It just started SSH-ing in.

Five minutes later it handed me a report. And honestly? It was better than I expected.

What Claude delivered:

Identified all three VM roles correctly (API Gateway, Order Processing, Data Tier)
Drew an ASCII topology showing Nginx routing to backend services with canary weights
Built a full service table — host, port, tech stack, notes
Mapped the Redis Sentinel cluster including a stale replica on a decommissioned node
Enumerated every K8s namespace and workload
Traced the observability pipeline (node_exporter → Prometheus, OTel → Jaeger, Datadog agents)
Flagged four real issues: dead Redis replica, broken image pulls in aigc-app, active canary split, multiple knoxd versions

Five minutes. No hand-holding. For a "quick, what's running here?" sweep, this is genuinely useful.

Where it stops

Here's what I noticed after the initial "wow, that was fast" wore off.

The output is a wall of markdown. Accurate, mostly. But flat. Everything has the same weight — a critical single-point-of-failure sits next to a cosmetic naming inconsistency. No severity. No priority.

More specifically:

No topology visualization. I got an ASCII diagram. It's readable for 6 machines. At 60 machines, it's unreadable. At 600, impossible.

No business grouping. Claude listed every service but couldn't tell me which ones form the e-commerce flow vs. the logistics flow vs. the platform layer. That requires domain context it doesn't have.

No risk assessment. Four issues found, but no severity classification. The dead Redis replica and the cosmetic knoxd naming thing are presented with equal weight.

No quality gate. Nobody verified whether Claude's topology was actually correct. It connected things confidently — but was the canary weight really 90/10? I'd need to go check.

No persistence. Close the chat window. The report is gone. Tomorrow I'd run it again and get a slightly different exploration path, slightly different findings.

No depth control. I can't say "that Business Island looks risky, go deeper on it." It's all-or-nothing.

This maps to a pattern I keep seeing across industries. In legal tech, people noticed the same thing — general LLMs are good at summarizing contracts but can't do precision clause verification. In finance, ChatGPT can describe how to post a journal entry but can't actually post one. The dividing line is consistent: general AI is a thinking tool; specialized AI is an acting tool.

When the task is "reason about this data and explain it to me" — general tools are great. When the task shifts to "build a structured, persistent, verifiable model of my environment" — you've crossed into territory they weren't designed for.

Purpose-built tool, same task

For comparison, here's what happens when I send one line to Knox (our purpose-built AI SRE tool — yes, this is our product, stating that upfront):

"Run a full infrastructure discovery on our production environment."

Shorter prompt. No need to explain the environment — it already has connectors configured.

Twenty minutes later:

The differences that matter:

Visual topology — not ASCII art, an interactive service relationship graph
Business Islands — services auto-grouped by business function with criticality labels
Risk Triage — findings ranked by severity with a distribution chart
Persistence — results stored in a graph database, queryable later
Depth on demand — "Deep Analysis Available" button for any Business Island

How it got there — a team of agents, not a single model:

This is the work process, not a deliverable. Multiple specialized agents collaborated — one coordinated the task, one did the actual discovery, one quality-checked the findings — flagging 9 uncertain items for human review instead of presenting everything with equal confidence.

The scale question

We ran this on 5-6 machines. The gap is already visible. But this is the minimum-gap scenario.

At 60 servers across multiple environments, Claude's context window fills up. You'd need multiple sessions, manual stitching, and the "flat markdown" problem becomes unbearable. The gap doesn't grow linearly — it compounds.

That's not a knock on Claude. A Swiss Army knife is great. But when you need surgery, you reach for a scalpel.

What's your environment look like? At what scale did you find general AI tools hitting their ceiling for ops work?

If you want to try the purpose-built approach: knoxops.app

Top comments (1)

paul_h • Jun 23

Hi, dev.to, this is a reader perk for you.
Knox is currently in open beta, If you want to try mapping your own environment, use code DEVTO26 for 10,000 free credits at knoxops.app — enough to manage a small cluster for a month.