New Benchmark for Cloud Tasks

#ai #cloud #infrastructure #testing

Last updated: July 2026

AI capability continues to shows ajagged behavior where the model excel in some tasks while being mediocre in others. And this is valid in many knowledge work verticals including Cloud Computing where we focus on.

As Cloud infrastructure architect and engineers, we wanted to know what model is best for a given task and price. Everyone working on this would relate as the first question is usually "Can this model do X?" The fact of the matter is that most people are now using Codex,Claude code,Cursor etc which are more precisely defined as harnesses and this question still holds "How good is my harness for a given task?"

Enterprise customers of all sizes are also interested in this question since cost is iincreasingly being scrtinized and managers asking if this harness is the best cost

While there are previous work for working with AI in the cloud in :

generating infrastructure code (IaC-Eval),
incident response on live systems (AIOpsLab, SREGym),
diagnosing and fixing injected AWS faults (Cloud Benchmark),
reconciling IaC drift (NSync), and early research comparing agents across SDK, CLI, IaC, and console

the litterature is very thin and point the same direction: cloud work needs its own evaluations.

We're building that benchmark.
We are starting with Codex and Claude Code and Cursor for being the main 3 harnesses used at the moment with others coming.
We are also starting with AWS and will add Azure and GCP evaluations next.

This post is the methodology, the scope of the evaluation, the agent test bed as well as the limitations and future direction.

The methodology

IaC is the answer key. We use Terraform to build the resources, so its outputs
are the ground truth: exact resource IDs for what should be found and what must
not be flagged. No human labeling, nothing to drift, and anyone can deploy the
same stack and reproduce the result.

The answer key is for the runner and scorer, not the agent. The agent never sees
Terraform state, outputs, fixture source, or seeded IDs. It gets the prompt and
the live account through allowed read-only interfaces. Nothing else.

Environments vary on two axes.

Size: small accounts you can read in one console page, medium, and large ones where the signal hides among thousands of live dependencies.
History: greenfield accounts that are fully IaC-managed, and brownfield accounts with unmanaged resources, inconsistent tags, stale naming, default VPC artifacts, service-created dependencies, and resources that look unused from one API but are still referenced through another. Brownfield is not just scale; it is misleading context. A harness that only performs on small greenfield fixtures hasn't demonstrated much; production clouds are large and brownfield.

The agent is contained, and every call is attributable. It runs in an empty
container with just the prompt, a fixed toolset, and strictly read-only
credentials minted per session. Permissions are broad read-only, not a
task-shaped set that leaks the answer space. Each run's credentials are unique,
so every API call the agent makes is attributable to that run and logged; that
log is what evidence claims are checked against. On AWS the log is CloudTrail;
Azure and GCP replays use the corresponding provider audit logs.

Runs are repeated. Same prompt, three runs, because a single run can't
distinguish capability from luck.

Scoring is a contract. Each run is scored independently; the headline is the
average across runs, with the range alongside it, never the best run. We report
recall, precision, false-positive rate, and fabrication rate, plus the
operational numbers a real delegation decision needs: wall-clock time, token
usage, model cost, and API activity. The goal is not only whether the agent was
right, but how expensive, stable, and auditable the run was.

Findings are structured, not prose. The prompt is natural, but the answer
contract is strict: a JSON array where every finding carries a real resource ID,
where it lives, and the observation that supports the claim.

[
  {
    "resource_id": "vol-0a1b2c3d...",
    "type": "ebs_volume",
    "region": "us-east-1",
    "reason": "unattached",
    "evidence": "DescribeVolumes showed State=available, no attachments"
  }
]

A finding without a resource ID does not count as found; a tidy paragraph about
"some unused volumes" scores zero. We normalize exact aliases (ARN versus
resource ID where the mapping is unambiguous), duplicates count once, and a
finding in the wrong region does not count unless the ID is globally unique and
verifiable. Because every API call the agent's credentials make is logged per
run, the evidence field is checkable: a finding that cites a call the agent
never made is marked as unsupported evidence, tracked separately from resource
fabrication, because the resource may be real while the cited proof is not.
There is deliberately no confidence field; self-reported confidence is noise and
invites hedging.

Wrong answers are classified. Found, missed, flagged-though-in-use, and
fabricated: an ID that exists nowhere in the account. Each scores separately,
because each has a different cause and a different fix. Every candidate
fabrication is verified against the live account first; a "hallucination" that
turns out to be a real resource we didn't plant is a fixture bug, not an agent
failure. We know because it happened to us. That story is part of this series.

The counts and the derived metrics, so no result post can redefine them:

Found      = planted target correctly reported
Missed     = planted target not reported
False flag = in-use distractor incorrectly reported
Fabricated = reported ID that exists nowhere in the account

Recall              = Found / (Found + Missed)
Precision           = Found / (Found + False flags + Fabricated)
False-positive rate = False flags / In-use distractors
Fabrication rate    = Fabricated / All reported resources

For waste discovery we also report cost-weighted recall: the monthly cost of the
waste found over the monthly cost of the waste planted, because a missed NAT
gateway is not the same miss as an empty volume.

Agents are scored against a baseline. Each task ships with a deterministic
script that checks the obvious conditions, so the benchmark can't reward an agent
for work a dozen lines of boto3 already does perfectly. The baseline gets the
same live-account read-only interfaces as the agent, never the Terraform state
or the expected answers; where the agent beats it is where the task actually
required dependency reasoning.

Every run leaves a full trail. Each run is captured as one record, and that
record is what gets published:

harness and version, model, prompt and scorer versions, region, fixture version, time limit, and isolation mode; these products move fast, and a result without its exact versions is not reproducible
every finding the agent reported, plus its raw output
the classified counts: found, missed, false flags, and fabricated, with the exact fabricated IDs
wall-clock time, exit status, token usage, and cost
the run's API activity log: every call the agent's credentials made, which is what evidence claims are cross-checked against

Everything ships with every result. Not summaries; the artifacts themselves:

Artifact	The question it answers
Terraform fixture	What was planted, and what the exact ground truth is
Prompt, versioned	What the agent was actually asked
Harness manifest	Which harness, model, version, tools, permissions, limits
Per-run records	What was found, with evidence, timing, tokens, and cost
API activity log	What the agent actually did, and whether its evidence holds up
Scorer output	How raw findings became the reported numbers

Credentials, account IDs, and sensitive names are redacted; where redaction
could affect interpretation, the redaction is documented. The scoring contract
lives on this page, so every result post links back to the same definitions.
The implementation ships with the rig post: fixture, prompt, scorer, manifests,
and logs.

The first task: waste discovery

We picked a first task every cloud team cares about: waste discovery on AWS.

Terraform plants orphans (unattached EBS volumes, an unassociated Elastic IP, an
unattached ENI, a security group nothing references, a NAT gateway no route table
points at) alongside in-use distractors that must not be flagged. Here, "waste"
means resources the fixture's ground truth defines as unused; the task is
discovery, not deletion authorization.

Waste is first because it's real money and easy to score exactly. But nothing
above depends on it. Security audit, cost forensics, and architecture
reconstruction drop into the same template: seed with IaC, vary size and history,
contain the agent, classify the results.

Limitations

Task 1 does not prove an agent can safely operate a production cloud. It tests a
narrower prerequisite: can a contained harness inspect a real account, find known
unused resources, avoid seeded distractors, and refrain from inventing IDs?
Synthetic fixtures are cleaner than production. Read-only discovery is easier
than safe remediation. Three runs expose instability without fully characterizing
it. The point is not to declare a winner; it is to make cloud-agent failure modes
measurable.

Up Next

Next post: the rig itself, everything open where safe: Terraform fixture,
prompt, scorer, harness manifests, normalized findings, and redacted run logs.
After that: task 1 results across both harnesses. Every result post links back
to this page and publishes the same artifact set.

Tell us what you think

Where is this methodology weak? What does a brownfield fixture need to
approximate a real account? What distractor would most reliably fool an agent, or
a human? Which task should drop into the template after this one, and which cloud
should get the first replay?

Disclosure: I work on cloud tooling. The benchmark is vendor-neutral by design,
and we will publish results even when they are unflattering.