paul_h

Posted on Jun 23

AI Scanned My Infra — 67% Were Dead Weight on My AWS Bill

#aws #devops #cloudcomputing #aiops

Last weekend I used AI to scan my AWS account for idle resources. Here's what I found:

Scanned 3 EC2 instances, flagged 2 as suspicious. One of them was running an entire microservice stack — with zero business traffic.

The whole thing was done by 10 AI agents working together. I wasn't typing commands in a terminal. I defined a process contract, then let go.

Let me start with what was discovered.

Discovery: 3 EC2 Instances Scanned, 2 Suspicious

The scope was small: 3 t3.xlarge EC2 instances, us-east-1 region. No Prometheus or Datadog. ICO relied on SSH to grab real-time snapshots and process details.

After the first round of scoring and screening, .42 was excluded — someone had logged in 26 days ago, within the 30-day active threshold. A live machine. Two remained:

Resource	zombie_score	Level	CPU	Network	Last Login
ec2-172.30.0.41	0.35	LOW	6.4%	1.82 GB/day	41 days ago
ec2-172.30.0.43	0.35	LOW	8.4%	1.18 GB/day	99 days ago

The three-signal filter is straightforward: CPU daily avg > 20% = active, network > 2 GB/day = active, human login within 30 days = active. All three must be inactive to become a candidate. .42 got caught by the login signal. .41 and .43 passed none of them — but their network was close to the threshold (1.82, 1.18), so they only scored 0.35.

My first instinct was to skip them. A score of 0.35, LOW label — not worth the time. But ICO's process doesn't let you draw conclusions at this stage. Scoring is just a coarse filter. The next phase is deep scanning, and it requires human confirmation to proceed.

I selected both. Deep scan it is.

Deep Scan: What Is a 0.35-Score Instance Actually Running?

The deep scan phase launched agents that SSH'd in concurrently. Each machine was checked across 14 signals: process table, listening ports, crontab, systemd timers, disk usage, external connections, real-time traffic topology.

The deep scan result for ec2-172.30.0.41 came back as 317 lines of JSON. Here are the key findings:

What was running:

Nginx reverse proxy (:80), routing to multiple backends
Redis 7.0 MASTER (:6379), read-write mode, bound to 0.0.0.0
Redis Sentinel (:26379), in a cluster across three machines
Nacos standalone (:8848), Java process eating 512MB RAM
inventory-service (:8081) and warehouse-service (:8083), two Python HTTP services
Full Datadog agent stack (6 processes)
Docker installed but zero running containers

Traffic topology:

Redis Sentinel interconnection with ec2-172.30.0.43 (bidirectional, a few Kbps)
Redis client connections from 172.30.0.25 (8+ connections)
Datadog agent continuously sending metrics outbound

Looking at this report, you wouldn't think this machine is a zombie. Redis cluster, Nacos service registry, two Python services, Nginx reverse proxy — this looks like a full microservice setup.

But look closer at the traffic topology. All external connections are from Datadog and within the Redis cluster. No real business traffic coming in. All services are running, but nobody is using them.

That's the truth about this instance: an abandoned microservice setup. A zombie. Without the deep scan, with a 0.35 score on the scorecard, nobody would have given it a second look.

Scoring tells you "which ones might be idle." Deep scan tells you "what they're actually doing." Not the same question.

How It Works: 10 Agents, 4 Human Decision Gates

ICO covers compute instances, Kubernetes workloads, databases, object storage, and network resources — across AWS, GCP, Azure, or on-prem via SSH. This case focused on EC2, but the same pipeline handles all of them.

ICO is not "one AI that deletes your resources." It's 10 independent skill agents, each responsible for one link in the chain, passing data through structured files:

At the four BLOCKING checkpoints, the agent must stop and wait for a human. Before anything gets deleted, a human confirms three times:

Phase C — Review the scorecard, select which resources enter deep scan
Phase E — Review the deep scan report, select which enter isolation
Phase G — Approve the isolation plan (method, rollback script, observation period)
Phase J — Final deletion approval

This is not about writing "be careful" in a prompt. You can't get safety from prompts — the model might ignore what you said, or forget it once the context fills up. Safety must be hardcoded into the process: if a phase doesn't pass, the agent cannot jump to the next step on its own.

Agents don't pass data through context either. Scoring produces suspect_assessment.json, deep scan produces deep_scan_{id}.json, isolation produces isolation_plan_{id}.json — each constrained by a Schema. If the previous agent's output doesn't match the format, the next agent errors out. No improvising.

This is what agent-runbook is about: constrain agent collaboration with contracts. Relying on prompts is gambling.

Three Hard Lessons

1. Scoring can't see the service stack. Deep scan can.

ec2-172.30.0.41 scored 0.35, LOW. Based on the score alone, you'd skip it. But the deep scan found Redis MASTER, Nacos, two Python services, Nginx — an entire microservice infrastructure stack sitting there idling. The three coarse signals — CPU, network, login — completely fail to capture "what's actually running." Scoring points the way. Deep scan shows you what's actually there.

2. Without historical data, real-time snapshots have blind spots

This case had no Prometheus or Datadog. ICO relied on SSH real-time snapshots. Monthly jobs, quarterly reports, on-demand batch processing — snapshots will never see them. If a crontab has a "run at 1 AM on the 1st of every month" entry, scanning a hundred times in real time won't catch it. Historical monitoring data is the most reliable signal source. Without it, deep scanning carries double the weight.

3. Internal traffic doesn't mean business usage

.41 and .43 had Redis Sentinel interconnections. .25 was connecting to .41's Redis. The topology graph had plenty of edges — it looked busy. But it was all infrastructure internal communication — services probing each other, syncing state, with not a single edge from an external user. The scorecard treats any network traffic as an active signal, but not all traffic is the same. Only the deep scan's traffic topology can distinguish "machines talking to each other" from "users making requests."

Conclusion

The value of AI agents in operations is not "they can delete resources automatically" — that's called danger.

The real value is: you codify a verified safety procedure into an auditable, reusable agent skill anyone can run, and get the same result every time. Not typing commands each time and hoping, but a contract file committed to a repo.

ICO applies this approach to cloud cost optimization. If it works in dev, it's only more valuable in production.

The code:

open-devops-skills — directly installable ICO skill library: github.com/KnoxOps/open-devops-skills

Install with one line:

claude plugin install ico@open-devops-skills

Then:

/ico:orchestrator Scan my cloud for idle resources

If you've got EC2 instances you're not sure are still in use, give it a try. Let the agents scan and analyze — you make the call.

Top comments (1)

paul_h • Jun 23

Hi, dev.to, this is a reader perk for you.
Knox is currently in open beta, If you want to try mapping your own environment, use code DEVTO26 for 10,000 free credits at knoxops.app — enough to manage a small cluster for a month.