Arthur

Posted on May 18 • Originally published at pickles.news

What GenAI Actually Costs in Production

#llm #mlops #aiengineering #cost

The first number anyone quotes when asked what generative AI costs is a per-token figure. It is a comfortable number — small, unambiguous, available on a vendor's pricing page, and easy to multiply by an estimated request volume to produce a monthly total. It is also, on inspection of any actual production deployment, the smaller piece of what the company is paying.

I want to take that number seriously, then take it apart. The per-token bill is real. It is also the visible tip of a stack whose other line items, individually and in aggregate, dominate it. The "managed API vs self-hosted open-source" comparison that lives in every 2026 AI deck is, in the deck version, missing the seven other variables that decide the answer. With those variables back in the picture, the comparison stops being a comparison of compute prices and becomes a question about a company's volume, team, and tolerance for failure modes.

The visible math

For a concrete example: an internal assistant handling a million requests per month, average request size 1,000 input tokens and 500 output tokens. With current Western API pricing:

Anthropic Claude Sonnet 4.6 — $3 per million input tokens, $15 per million output tokens — comes to about $10,500 per month.
Anthropic Claude Opus 4.6 — $5 / $25 per million tokens — about $17,500 per month.
OpenAI's current GPT-5 family pricing has GPT-5.4 at $2.50 / $15 per million tokens, close to $10,000 per month at the same volume.

These are not large numbers. A million requests is a serious internal deployment, and the inference bill for a serious internal deployment lands in the same order of magnitude as one mid-level engineer's monthly fully-loaded compensation. Anthropic's own pricing materials note that prompt caching can cut input-token cost by up to 90% on cached prefixes, and batch processing offers another 50% on bulk workloads, which can pull these figures lower in production patterns where cache hit rates are high.

If the conversation stops here — and it usually does — the company has now compared three pricing-page numbers and picked the one with the smallest digit. That is the answer to a different question than the one anyone meant to ask.

The first invisible layer

A production GenAI service is rarely a single API call wrapped in a function. The architecture diagrams that ship to engineering review include a backend service, a vector database for retrieval-augmented generation, embedding generation (separately metered), a document store, an authorisation layer, content moderation and guardrails, request tracing, quality monitoring, structured logging, rate limits, retry queues, and a CI/CD path for prompts and model versions. None of this lives on the model vendor's pricing page.

Order-of-magnitude estimates for a serious internal deployment in 2026:

A managed vector DB such as Pinecone or Weaviate Cloud: $200–2,000 per month at moderate index sizes; pgvector on managed Postgres in the same range or lower.
Embedding generation: a separate per-token line, often a quarter to a half of the LLM input bill at RAG-heavy workloads.
Tracing and observability tooling (LangSmith, Phoenix, Helicone, or a Datadog plus custom instrumentation stack): $500–5,000 per month depending on volume and retention.
Guardrails / moderation: free if the model vendor's content filter is sufficient, or $1,000–10,000 per month for a managed third-party tool.
Document storage and access control: small in isolation, real in aggregate, especially with auditable retention.

That stack runs $5,000–25,000 per month for the kind of deployment whose inference bill is $10,000–17,500. The first invisible layer is, by itself, the same order of magnitude as the visible one.

The dominant line item

Production GenAI requires a team. The minimum viable headcount for a service the company is willing to put in front of paying users or sensitive internal data is something like:

A backend engineer, for the API surface, business logic, and integrations.
An ML or LLM engineer, for model selection, prompts, evaluations, and quality metrics.
A platform / SRE engineer, for the orchestration, GPU when present, CI/CD, and observability wiring.
Some fraction of a security or compliance person, for the data flow, audit logs, and PII handling.
Some fraction of a product person, for the use-case definition and prioritisation.

Levels.fyi's 2026 data places the median US Machine Learning Engineer total compensation at $264,400, with senior FAANG engineers reaching $350,000 and above and mid-market senior compensation, depending on the source's mix of base and total comp, landing in roughly the $180,000–260,000 range. Backend and platform engineers track those numbers within ten or fifteen percent. International compensation varies sharply; EU senior comp tends to land lower, but the structural point — that engineers are expensive — does not.

Three engineers at $250,000 each in total comp, fully loaded with employer payroll tax, benefits, equipment, management overhead, and recruiting amortisation, is on the order of one million dollars per year. About $83,000 per month. That single line item dwarfs the inference bill at any volume below several million requests per day, and it dwarfs the GPU-rental bill at any volume the team would actually run on its own hardware.

The naive math — "API costs $10K per month, self-host costs $40K per month, therefore API wins" — is not wrong about those two numbers. It is wrong about which numbers are the deciding ones.

Self-host with the full stack

A self-hosted serving deployment is the second comparison most teams reach for. The compute side has fallen dramatically through 2025 and into 2026. AWS's p5.48xlarge instance, with eight NVIDIA H100s, lists at $55.04 per hour on-demand in us-east-1 as of May 2026, which works out to roughly $39,600 per month sustained, or about $6.88 per GPU-hour at the box level. Other H100 cloud providers run from $1.49 to $6.98 per GPU-hour depending on commitment, region, and provider tier, with the spot and longer-commitment ends of that range producing much lower numbers than the on-demand top.

The compute is real, and the compute is not the part that decides. Self-hosting means picking and operating a serving framework — vLLM and Text Generation Inference are the dominant choices in 2026 — and picking and operating the orchestration around it: Kubernetes or a comparable runtime, container registries, model artifact storage, an evaluation harness that runs on every model update, an on-call rotation that knows what to do when an inference replica wedges, a release-gating process for new model versions, and a documented procedure for the cases when an OSS model is deprecated by its publisher and needs to be rotated to a successor.

Each of those pieces costs money in the same way the first invisible layer does: not enormous in isolation, real in aggregate, and entirely absent from the GPU-hourly figure that dominates the comparison slide. The break-even where self-host beats managed API has moved as H100 rentals have fallen, but the break-even has moved against a fixed people cost. At realistic per-team headcounts, the break-even is still north of the volume most internal services run.

The cost of a bad answer

Quality is a cost. The framing of LLM cost as a token-pricing question quietly assumes that any answer the model produces is worth the price of producing it. In production deployments the assumption is wrong, often by an order of magnitude.

A wrong answer in a finance tool has a remediation cost: manual rework, an inflated escalation queue, sometimes a regulatory disclosure. A wrong answer in a legal-research tool has a similar profile, with the additional risk that nobody catches it until the document leaves the building. A wrong answer in a customer-support tool produces a churned customer; a wrong answer in an engineering tool produces a real bug in real code. Each of these has a dollar value that does not appear on the model pricing page.

The straightforward implication is that a more expensive model that is right ninety-five percent of the time can be cheaper than a less expensive model that is right eighty percent of the time, once the cost of the missed five percent is accounted for. The same argument applies to running a smaller model with a more sophisticated retrieval layer versus running a larger model with cheaper retrieval. The decision-relevant comparison is the cost-per-correct-answer, not the cost-per-token.

This is the line item teams least often track and most often are surprised by.

The honest TCO

A defensible total-cost-of-ownership formula for a production GenAI system looks something like this, in the order the line items dominate at typical scale:

People, fully loaded.
Cost of bad answers, evaluated against the use case's quality bar.
Self-host compute, when present, or managed-API inference, when not.
The first invisible layer (vector DB, embeddings, tracing, guardrails, moderation, logging, queues, retries).
Storage, network, and CI/CD.
Model-update churn — regression testing, prompt rebuilds, rollback capacity, the engineering hours absorbed every time a vendor deprecates a version or changes a behaviour.
Security and compliance, separately accounted because audit cost is real and uneven.

For a team running a serious deployment in 2026, none of these are zero. The first three together are typically eighty percent or more of the total. The token bill, the number that started the conversation, often comes in at five to fifteen percent of TCO at the volumes most teams actually operate.

The practical implication is not "self-hosting is bad" or "managed APIs are bad." It is that the question "what does GenAI cost?" has a different answer at different scales, with different teams, against different quality bars. A small deployment with a serious quality requirement and a small team is cleanly served by a managed API with high-quality models, because the dominant cost is the team and a more expensive model is the cheapest way to reduce the team's load. A large deployment with a stable workload, a quality bar that tolerates self-hosted-OSS quality, and a mature platform team is cleanly served by self-hosting, because at sufficient volume the compute line item becomes the dominant variable cost and the team is already paid for. Neither answer is "obvious" without doing the full TCO; both answers fall out of the formula once it is run.

What the per-token comparison is for

The per-token number is not useless. It is the one variable that is genuinely linear in volume, that is genuinely transparent, and that genuinely reflects a real cost. For sizing exercises, capacity planning, and quick architectural sketches it is the right starting number. The error is treating it as the finishing number — as if the conclusion of the cost analysis is contained in the same figure that started it.

A production cost analysis worth running in 2026 looks more like the formula above than like a per-token spreadsheet. It includes the team, the quality bar, the storage, the observability, and the model-update churn, and it produces an answer over a six-to-twelve month horizon rather than a single month. The numbers are easy to gather and the math is not difficult. The discipline is in remembering to include the line items that are not on the model vendor's pricing page.

The right question

The question "which model is cheaper?" has a clean answer on the pricing page and a misleading one in the deployment. The question "which architecture meets our quality, latency, security, and operational requirements at minimum total cost of ownership at our actual volume?" has a more useful answer, and a different one for different organisations on the same day.

On the worked numbers above, a typical serious deployment lands at $10,000 to $25,000 per month for inference and several multiples of that for the rest of the stack — and the team usually discovers this slowly, line item by line item, as production volume rises and the headcount grows around it. The discovery would arrive faster, and the architectural decisions would be better, if the conversation started with the full TCO formula instead of the visible number on the pricing page. The visible number on the pricing page is one of seven. It is also, in most cases, the smallest of them. That is the part the per-token comparison hides, and that is the part the architectural decision should be built on.

Top comments (1)

Harjot Singh • May 31

The per-token-is-the-visible-tip framing is exactly right and it's the most expensive misconception in the field, because the comfortable number is the one people budget on and the iceberg below it is what actually blows the budget. The line items that hide under it: retries and the agent re-doing work it already did, the context tax (re-sending growing history every turn, quadratic in conversation length), failed runs you still paid for, the eval and observability infra you need to keep it honest, and the human time spent babysitting outputs that weren't verified. Per-token is the one cost that scales linearly and predictably; almost everything else scales worse and surprises you. The discipline that controls the real bill isn't negotiating per-token rates, it's the structural stuff: route cheap work to cheap models, bound the loops so retries can't spiral, trim context aggressively, and cache. The visible number is the one you can't do much about; the hidden stack is where all the leverage is. That treat-the-whole-stack-as-the-cost view is core to how I think about spend in Moonshift. In your production breakdown, which hidden line item was biggest, the retries/loops, or the context-resend tax?