DEV Community

Cover image for New Benchmark for Cloud Tasks
Osman
Osman

Posted on

New Benchmark for Cloud Tasks

AI capability is jagged, not uniform. So the question is: how does your favorite harness (Codex, Claude Code...) fare on real cloud work?

A harness can top the coding benchmarks and still hallucinate resources that don't exist. And while there are established benchmarks for coding, computer use, and general reasoning, there's none for the work cloud teams actually delegate: cloud management tasks.

We're building that benchmark. Codex and Claude Code are on the rig now, others in the pipeline. Task 1 runs on AWS; the template is cloud-agnostic, with Azure and GCP replays to follow. This post is the methodology and an open invitation to tell us what to cover next.

The methodology

IaC is the answer key. We use Terraform to build the resources, so its outputs are the ground truth: exact resource IDs for what should be found and what must not be flagged. No human labeling, nothing to drift, and anyone can deploy the same stack and reproduce the result.

Environments vary on two axes.

  1. Size: small accounts you can read in one console page, medium, and large ones where the signal hides among thousands of live dependencies.
  2. History: greenfield accounts that are fully IaC-managed, and brownfield ones with years of drift and half-consistent tagging. A harness that only performs on small greenfield fixtures hasn't demonstrated much; production clouds are large and brownfield.

The agent is contained, verifiably. It runs in an empty container with just the prompt, temporary read-only credentials, and the ephemeral permissions needed for that session. CloudTrail is used to confirm its activity on every run.

Runs are repeated. Same prompt, three runs to account for possible network issues and temporary model failures.

Wrong answers are classified. Found, missed, flagged-though-in-use, and fabricated: an ID that exists nowhere in the account. Each scores separately, because each has a different cause and a different fix. Every candidate fabrication is verified against the live account first; a "hallucination" that turns out to be a real resource we didn't plant is a fixture bug, not an agent failure. We know because it happened to us. That story is part of this series.

The first task

We picked a task every cloud team cares about: waste discovery on AWS.

Terraform plants orphans (unattached EBS volumes, an unassociated Elastic IP, an unattached ENI, a security group nothing references, a NAT gateway no route table points at) alongside in-use distractors that must not be flagged.

Waste is first because it's real money and easy to score exactly. But nothing above depends on it. Security audit, cost forensics, and architecture reconstruction drop into the same template: seed with IaC, vary size and history, contain the agent, classify the results.

The series follows the order the work did: first the rig itself with everything open (Terraform, prompt, scorer, raw logs), then the results and takeaways.

Tell us what you think

Where is this methodology weak? What does a brownfield fixture need to approximate a real account? What distractor would most reliably fool an agent, or a human? Which task should drop into the template after this one, and which cloud should get the first replay?

Disclosure: I work on cloud tooling. The benchmark is vendor-neutral by design, and we will publish results even when they are unflattering.

Top comments (0)