Lars Winstand

Posted on May 18 • Originally published at standardcompute.com

I thought the $1.3M OpenAI bill was the story, then I looked at what 100 agents actually do all day

#ai #devops #api #automation

I saw the screenshot the same way everyone else did.

$1,305,088.81 in OpenAI API spend over 30 days.

My first reaction was the same as Reddit’s: what on earth are you doing to burn that much money on tokens?

But after digging into the details, I think the dollar amount is actually the least interesting part.

The real story is what happens when you run a serious agent fleet: around 100 coding agents, 7.6 million requests, and 603 billion tokens. At that point, per-token billing stops feeling like a clean usage model and starts feeling like distributed systems pain with an invoice attached.

That’s the part I think more developers should pay attention to.

The screenshot was wild, but the workload matters more

Tom’s Hardware reported that Peter Steinberger showed:

$1,305,088.81 in OpenAI API spend over 30 days
603 billion tokens
7.6 million requests
roughly 100 Codex instances
about $19,985.84 spent on one day alone
about 206,000 requests on that same day

That sounds absurd until you look at what those agents were apparently doing all day:

pull request reviews
commit security scanning
GitHub issue deduplication
code fixes
benchmark monitoring
turning meeting discussions into PRs

That is not “I asked GPT-5.4 a coding question.”

That is a software factory.

And once you frame it that way, the pricing problem changes.

The real issue is not cost alone. It’s cost plus operations.

At small scale, token pricing feels reasonable.

You make a request. You use tokens. You pay for tokens.

Simple.

At fleet scale, that model gets messy fast.

Now you are managing:

request bursts
shared rate limits
prompt caching behavior
model selection by task priority
monthly caps
retries and backoff
queueing for async work
internal dashboards so nobody accidentally nukes the budget

That’s why I think the real OpenAI API cost problem is operational, not moral.

The money hurts.

The constant control work hurts more.

What breaks first when you run 100 agents?

Usually not the code.

Usually your sanity.

If you read OpenAI docs like an operator instead of a hobbyist, the limits are the giveaway. You are not just dealing with one meter. You are dealing with multiple overlapping ones:

RPM
TPM
RPD
TPD
monthly org caps
project-level caps
model-family shared limits

So when somebody says they hit openai api quota exceeded, that can mean a bunch of different failure modes.

And once multiple agents are running in parallel, those failure modes stack.

Prompt design stops being prompt design

It becomes infrastructure.

OpenAI’s Prompt Caching sounds great on paper:

up to 80% lower latency
up to 90% lower input token cost

But there’s a catch: cache hits depend on exact prefix matching, generally on prompts 1024+ tokens long.

That means small prompt differences across agents can destroy your cache efficiency.

If your agents all prepend slightly different repo instructions, tool descriptions, or task wrappers, you lose the caching benefit and pay full price.

Here’s the kind of shape that matters:

const response = await client.responses.create({
  model: "gpt-5.5",
  instructions: [
    "You are a code review agent.",
    "Follow AGENTS.md exactly.",
    "Use the repo style guide.",
    "Only propose minimal diffs."
  ].join("\n"),
  input: diffText,
  service_tier: "flex"
}, { timeout: 15 * 60 * 1000 });

That service_tier: "flex" line is doing a lot of work.

It’s OpenAI quietly admitting that not all inference is interactive and not all tokens should be priced the same way.

If per-token pricing is so clean, why are there so many exceptions?

This is the part I keep coming back to.

OpenAI now has multiple pricing modes because one token meter clearly does not fit every workload.

You can see it in products like:

Standard API pricing
Batch API
Flex processing

Batch API cuts input and output cost by 50% for jobs that can finish within 24 hours. Flex gives slower jobs lower-priority processing at lower economics.

That’s not a minor optimization.

That’s an admission that async agent work is fundamentally different from interactive chat.

Here’s the practical version.

Option	What it really means
OpenAI Standard API	Best for interactive requests where latency matters and usage is controlled
OpenAI Batch or Flex	Better for async jobs, evals, enrichment, and lower-priority agent work
OpenRouter	OpenAI-compatible routing layer with provider choice, analytics, and spend visibility
Standard Compute	OpenAI-compatible API with flat monthly pricing for teams that want predictable cost instead of per-token billing stress

I don’t think per-token billing is wrong.

I think it’s only honest for certain shapes of work.

The mismatch gets obvious in automation stacks

If you are running:

n8n workflows
Make scenarios
Zapier automations
OpenClaw jobs
custom coding agents
internal background workers

then your orchestration layer is already priced one way, and your model layer is priced another way.

You pay for executions, tasks, or runs on one side.

Then you pay for cognition by token on the other.

That second bill is where things get weird.

Especially when your agents run 24/7.

Developers already feel this way, even at much smaller scale

You do not need 603 billion tokens to feel token anxiety.

That was one of the most useful things hiding in the Reddit discussions around this story.

One user built usage monitoring tools just to answer a simple question: am I better off on a subscription or API billing?

Another said OpenClaw “cost me an arm and a leg” while they were on a token budget.

That is the important part.

The problem shows up way before $1.3M.

It shows up the moment every experiment starts with a flinch.

Should I run one more eval pass?
Should I let this agent retry?
Should I keep the context window large?
Should I spawn more workers?
Should I turn on better models for code review?

When the meter is always visible, it changes behavior.

And usually not in a good way.

What pricing model actually fits agent fleets?

My opinion: use two different economic models for two different workload shapes.

1. Use per-token pricing for interactive work

Per-token pricing still makes sense for:

chat interfaces
one-off coding help
low-volume internal tools
experiments with unpredictable usage
latency-sensitive requests

If a developer asks GPT-5.4 to debug a flaky test, token pricing is a reasonable fit.

You used a resource. You pay for the resource.

2. Use predictable pricing for persistent automation

If you have:

code review agents
issue triage agents
benchmark watchers
repo-specific fixers
support automations
enrichment pipelines

running all day, the thing you want is not just cheaper tokens.

You want to stop thinking about every token.

That’s why flat-cost infrastructure is so appealing for agent workloads.

You can budget it.

You can let automations run.

You can stop turning every architecture decision into a pricing debate.

That is the big appeal of Standard Compute.

It gives you an OpenAI-compatible API, so you can keep your existing SDKs and clients, but the pricing model is flat monthly instead of per-token. That matters a lot if you are building agents in n8n, Make, Zapier, OpenClaw, or custom internal workflows and you want predictable cost.

What this looks like in practice

If you are currently using the OpenAI SDK, the migration path is pretty boring, which is good.

npm install openai

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.STANDARD_COMPUTE_API_KEY,
  baseURL: "https://api.standardcompute.com/v1"
});

const response = await client.chat.completions.create({
  model: "gpt-5.4",
  messages: [
    { role: "system", content: "You are a senior code review agent." },
    { role: "user", content: "Review this pull request diff for security and performance issues." }
  ]
});

console.log(response.choices[0].message.content);

That drop-in compatibility is a big deal if your real problem is not model quality, but cost predictability across a lot of automation.

A practical checklist for teams running agents

If you are operating agent workflows today, these are the questions I would ask first:

Is this workload interactive or async?
Can it be batched?
Can lower-priority tasks use Flex-like processing?
Are prompts structured for cache hits?
What happens when 20 agents spike the same model at once?
Are retries creating hidden token burn?
Do we actually need per-token billing for this workload shape?

That last one matters more than people think.

Because once your team starts designing around token caps, cache misses, quota errors, and pricing edge cases, you are not just building agents anymore.

You are running a token economy.

My takeaway

The viral screenshot was not proof that agentic coding is fake.

It was proof that OpenAI API pricing starts behaving strangely once agents become parallel, persistent, and semi-autonomous.

If you are building serious automations, the first question should not be “what is the cheapest model per million tokens?”

It should be:

what workload shape do I actually have?
what latency do I really need?
what breaks when multiple agents run at once?
do I want to optimize prompts, or do I want to keep shipping?

For side projects, per-token billing is fine.

For always-on agent fleets, it starts turning into queue management, cache management, rate-limit management, and human stress management all at the same time.

That’s the real story hiding behind the $1.3M screenshot.

And if you’re tired of building around token anxiety, this is exactly why products like Standard Compute exist.

DEV Community