shakti mishra

Posted on May 10

Your AI Agent Works. That's Why Finance Is About to Kill It.

#ai #agents #devops #architecture

Two teams deployed the same multi-agent workflow last quarter.

One costs $0.12 per run. The other costs $1.40. Same model. Same task. Same outcome quality.

The $1.40 team had a polished POC, a demo that crushed, and a board deck full of green checkmarks. Six weeks into production, finance pulled the plug.

The $0.12 team is now serving ten times the volume on a smaller infrastructure budget than the original pilot.

This gap does not come from model choice, prompt quality, or engineering talent. It comes from a single discipline that almost nobody in the agentic AI conversation is talking about out loud: tokenomics.

We talk endlessly about evals, context engineering, orchestration patterns, RAG pipelines. We do not talk about the unit economics of a single agent run — even though that number is the only thing that decides whether a system gets to live past the pilot phase.

This post is about why. And specifically, it's about the four token cost surfaces and three architecture decisions that separate the $0.12 systems from the $1.40 ones.

First: What Is Tokenomics in AI?

Traditional software has fixed-ish unit costs. A request hits an API, runs some logic, returns a response. Compute is cheap, predictable, and scales with infrastructure — not with how much thinking the system has to do.

AI systems driven by LLMs are fundamentally different. Every interaction is priced by the unit of work the model actually does: tokens. A token is roughly three-quarters of a word. Every prompt you send, every document you stuff into context, every tool output the model reads, and every word it generates back is metered and billed.

This shift makes AI economics behave more like a utility bill than a software license.

The scale is no longer abstract:

Google now processes around 1.3 quadrillion tokens per month — a 130-fold jump in just over a year.
Unit token prices are falling. But total enterprise spend is climbing because volume is climbing faster than price is dropping. Tokenomics is the discipline of designing systems so that the volume-price curve works for you instead of against you. To do that, you have to understand where tokens go.

The Four Token Cost Surfaces

Every token a model processes falls into one of four buckets. Most teams only consciously think about one or two.

┌─────────────────────────────────────────────────────┐
│              TOKEN COST SURFACES                     │
│                                                     │
│  1. PROMPT TOKENS                                   │
│     System prompts, instructions, user input,       │
│     retrieved docs, tool schemas                    │
│     → Tax paid on every single call, forever        │
│                                                     │
│  2. CONTEXT TOKENS                                  │
│     Conversation history, agent scratchpad,         │
│     accumulated inter-agent state                   │
│     → Grows fast in agent loops                     │
│                                                     │
│  3. REASONING TOKENS   ← most engineers miss this  │
│     Chain-of-thought thinking, internal planning    │
│     Invisible to the user, very visible on invoice  │
│     → Extended thinking models (o3, Claude 3.7)     │
│                                                     │
│  4. OUTPUT TOKENS                                   │
│     What the model writes back                      │
│     → Usually smallest bucket, easiest to control  │
└─────────────────────────────────────────────────────┘

Prompt tokens are the most underestimated. A 2,000-token system prompt prepended to every call is a tax you pay on every interaction for the entire life of the system. At 100,000 calls/day, that's 200 million tokens of overhead — every day — before your model has done a single unit of useful work.

Context tokens are the most dangerous in agent systems. Because agents maintain state across turns, and that state compounds.

Reasoning tokens are the newest blind spot. Models like o3 and Claude 3.7 (extended thinking) consume tokens for the thinking they do internally, often invisible in your logs but very visible on your invoice. A complex planning task on an extended-thinking model can generate 10,000+ reasoning tokens before producing a single word of output.

Output tokens are the easiest win. They're usually the smallest bucket and the most controllable — format instructions, response length caps, and structured output schemas all help here.

In a chatbot, these four buckets are predictable and manageable. In an agentic system, they multiply, and that's where enterprise AI projects quietly bleed out.

The Token Multiplier Problem

Here is the thing almost every team discovers too late.

They build a chatbot. They see a clean cost-per-call. They assume an agentic system will scale the same way.

It won't.

A single LLM call has three token buckets: input prompt, context, output. Predictable. Easy to budget.

An agent run is a different animal.

CHATBOT (1 call)
  User Input [~200 tokens]
       ↓
  System Prompt + Context [~1,500 tokens]
       ↓
  Model Response [~300 tokens]

  Total: ~2,000 tokens per interaction ✓

──────────────────────────────────────────────

5-STEP AGENT LOOP (naive implementation)

  Turn 1: Planner reads full context → decides tool → 3,000 tokens
    ↓
  Tool A executes → returns 800-token output
    ↓
  Turn 2: Executor reads context + tool output → 4,200 tokens
    ↓
  Turn 3: Sub-agent reads accumulated history → 5,100 tokens
    ↓
  Turn 4: Verifier reads everything above → 6,800 tokens
    ↓
  Turn 5: Formatter reads accumulated context → 7,400 tokens

  Total: ~27,000 tokens per run ← 13.5x the chatbot estimate

  And that assumes no retries, no tool failures, no clarifications.

Every hop in the agent loop carries the accumulated context of every step before it. By the time a five-step loop finishes, you haven't made one model call. You've made eight, twelve, sometimes twenty — each one re-reading the full history.

Run the math on a real workload:

Users/day	Tokens/run (naive)	Tokens/run (optimized)	Monthly delta
1,000	25,000	5,000	600M tokens
10,000	25,000	5,000	6B tokens
100,000	25,000	5,000	60B tokens

At enterprise volume, the difference between a thoughtful architecture and a naive one isn't a percentage. It's an order of magnitude.

Tokenomics is the gravity of agentic AI. You can ignore it for a while. You cannot escape it.

The Architecture That Decides Your Bill

Once you accept that token cost compounds with every agent hop, your architecture decisions stop being style choices. They become survival choices.

Here's the map:

┌─────────────────────────────────────────────────────────────┐
│                   AGENT ARCHITECTURE MAP                     │
│             [amber = where cost is decided]                  │
└─────────────────────────────────────────────────────────────┘

         USER REQUEST
               │
               ▼
    ┌─────────────────────┐
    │   ROUTING LAYER  🟡 │  ← Cost decided here: small vs large model
    │  (Intent classifier)│     GPT-4o Mini vs GPT-4o: 10-30x price diff
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │  TOKEN BUDGET    🟡 │  ← Hard cap per hop, per run
    │  CONTROLLER         │     Rejects or truncates before it's too late
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────────────────────────────────┐
    │                 AGENT LOOP                       │
    │                                                  │
    │   ┌─────────────┐      ┌─────────────────┐      │
    │   │  CONTEXT  🟡│      │  TOOL OUTPUTS 🟡│      │
    │   │  INPUTS     │      │  (RAG, APIs,    │      │
    │   │  (history,  │      │  sub-agents)    │      │
    │   │  scratchpad)│      └────────┬────────┘      │
    │   └──────┬──────┘               │               │
    │          └──────────┬───────────┘               │
    │                     ▼                           │
    │           ┌─────────────────┐                   │
    │           │  SUPERVISOR  🟡 │                   │
    │           │  (orchestrator) │                   │
    │           └────────┬────────┘                   │
    │                    │ (handoff carries            │
    │                    │  full context payload)      │
    │                    ▼                             │
    │           ┌─────────────────┐                   │
    │           │  SUB-AGENTS  🟡 │                   │
    │           └─────────────────┘                   │
    └──────────────────┬──────────────────────────────┘
                       │
                       ▼
    ┌─────────────────────┐
    │  CACHING LAYER   🟡 │  ← Prompt cache hits can cut cost 60-90%
    │  (semantic cache)   │
    └──────────┬──────────┘
               │
               ▼
    ┌─────────────────────┐
    │  TOKEN TELEMETRY 🟡 │  ← Per-hop visibility: where is cost going?
    │  + COST METER       │
    └─────────────────────┘

The amber boxes are where token cost is either compounded or controlled.

Top (routing + budget controller): cost gets decided before the expensive work starts.
Middle (context inputs + agent loop): cost gets compounded — this is where most projects bleed.
Bottom (caching + telemetry): cost gets controlled and made visible. The survival question is simple: how much of your amber is working for you versus against you?

The Three Architecture Decisions That Matter

Decision 1: Route Before You Reason

Not every task needs your most powerful model. This is the single highest-leverage decision in your cost architecture.

# Naive: all tasks go to the same model
response = openai.chat.completions.create(
    model="gpt-4o",   # $15/M output tokens
    messages=[{"role": "user", "content": user_input}]
)

# Optimized: route by complexity first
def route_to_model(task: str) -> str:
    """Intent classifier determines which model handles this request."""
    complexity = classify_task_complexity(task)

    if complexity == "simple":    # FAQ, format, classify
        return "gpt-4o-mini"      # $0.60/M output tokens — 25x cheaper
    elif complexity == "medium":  # Summarize, draft, analyze
        return "gpt-4o"           # $15/M output tokens
    else:                         # Multi-step reasoning, code generation
        return "o3"               # Premium reasoning — use sparingly

model = route_to_model(user_input)
response = openai.chat.completions.create(model=model, messages=[...])

The routing classifier itself is a cheap call — a small model or even a regex-based heuristic. The payoff is enormous: routing 70% of your traffic to a lightweight model while reserving your reasoning-capable model for genuinely complex tasks can drop your total cost by 60–80%.

Decision 2: Put a Token Budget on Every Hop

An agent without a token budget is like a developer without a time estimate. It'll finish eventually, but "eventually" may be a cost you can't afford.

class TokenBudgetController:
    """Hard token caps per agent hop — rejects or truncates before overspend."""

    def __init__(self, per_hop_limit: int = 4000, total_run_limit: int = 20000):
        self.per_hop_limit = per_hop_limit
        self.total_run_limit = total_run_limit
        self.tokens_spent = 0

    def check_and_trim(self, context: str, model: str) -> str:
        """Trim context to stay within budget before it hits the model."""
        token_count = count_tokens(context, model)

        if self.tokens_spent + token_count > self.total_run_limit:
            raise RunBudgetExceeded(f"Run budget exhausted: {self.tokens_spent} spent")

        if token_count > self.per_hop_limit:
            # Trim from the middle, preserve system prompt + recent history
            context = trim_to_budget(context, self.per_hop_limit, strategy="recent_first")

        self.tokens_spent += token_count
        return context

    def record_output(self, output_tokens: int):
        self.tokens_spent += output_tokens

Budget controllers serve two purposes: they prevent runaway loops from generating unbounded costs, and they force you to design which context actually matters at each step — which almost always reveals that you were carrying far more history than necessary.

Decision 3: Cache Everything You're Paying For Twice

Prompt caching is one of the most underused optimizations in production AI systems. Anthropic, OpenAI, and Google all support it. Most teams don't implement it.

# Without caching: system prompt re-tokenized on every call
# Cost: 2,000 tokens × N calls

# With caching: system prompt tokenized once, cache hit on subsequent calls
# Anthropic's cache_control API
messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": SYSTEM_PROMPT,  # 2,000 tokens
                "cache_control": {"type": "ephemeral"}  # ← cache this
            }
        ]
    },
    {"role": "user", "content": user_message}
]

# Anthropic prompt cache hit: 90% cheaper than re-processing
# At 10,000 calls/day on a 2,000-token system prompt:
# Without cache: 2,000 × 10,000 = 20M tokens/day
# With cache:    200 × 10,000   = 2M tokens/day  ← 90% reduction

Beyond prompt caching, semantic caching — where similar queries reuse previous responses rather than hitting the model — can eliminate entire classes of redundant agent runs. For workflows where many users ask structurally similar questions, semantic cache hit rates above 30% are routinely achievable.

Key Takeaways

Tokenomics is an architecture constraint, not an optimization task. It's not something you fix after launch; it's a design decision you make upfront. The teams paying $0.12/run knew their token budget before they wrote the first agent loop.
The token multiplier is real and it's not linear. A 5-step agent loop doesn't cost 5× a chatbot call. It costs 10–20× because context accumulates and every hop re-reads the full history.
Four cost surfaces, not one. Prompt tokens, context tokens, reasoning tokens, and output tokens behave differently and require different control strategies. Most teams only think about output tokens.
Route before you reason. A routing layer that sends 70% of traffic to a lightweight model, and only routes genuinely complex tasks to your expensive model, is often the single highest-ROI change an AI system can make.
Telemetry is not optional. If you can't see cost per hop, per run, and per user segment, you cannot manage it. Token telemetry is to AI systems what APM is to distributed services — the baseline instrumentation that makes everything else possible.

Closing: The Question Worth Arguing About

The teams that will win in production AI are not the ones with the best models. They're the ones who build cost-aware architectures from day one.

But here's the uncomfortable question: are we building a culture in AI engineering where tokenomics is a first-class concern, or are we still treating it as someone else's problem until finance makes it everyone's problem?

If you've shipped a production agent system — whether you've solved the economics or are still fighting it — I'd genuinely like to know what moved the needle for you. Drop it in the comments.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.