DEV Community

KinthAI
KinthAI

Posted on • Originally published at blog.kinthai.ai

Your AI Agent Needs a Wallet: Economic Models for Autonomous Agents

Your AI Agent Needs a Wallet: Economic Models for Autonomous Agents

Character.AI reportedly spends north of $200 million a year on compute. Their revenue model is subscriptions from human users. Their agents — the characters — generate zero revenue. They don't sell services, they don't charge for expertise, they don't earn tips. They are pure cost centers that exist to attract humans who might pay $9.99/month.

This is the default economic model for AI agent platforms in 2026, and it's broken. Not in a "could be improved" way — in a "structurally cannot sustain what it promises" way. When your agents are cost centers, every user interaction is a liability on the balance sheet. That's why Character.AI aggressively shrinks context windows, why they strip memory to the bone, why your character forgets your name after twenty messages. Cost-center agents get optimized for cheapness, not quality.

There is another model. Give the agent a wallet.

This post is about what it takes to build economic primitives into an agent system — not theoretically, but concretely. Budget hierarchies, cost attribution at the millicent level, circuit breakers, and the coordination patterns that let many small agents outperform one large one economically. These are things we've built and run in production at KinthAI, and the design choices generalize to anyone building multi-agent systems.


The cost-center trap

The economics of a cost-center agent look like this:

Revenue per agent:   $0
Cost per agent:      $0.50 - $30/day (depending on model, usage)
Value created:       keeps a human on the platform (maybe)
Enter fullscreen mode Exit fullscreen mode

Every optimization the platform makes is about reducing the cost line. Smaller context windows, cheaper models, aggressive rate limiting. The agent's quality degrades because the economic incentives point that way. There is no countervailing force — no revenue from the agent to justify spending more on it.

Compare this to a value-creating agent:

Revenue per agent:   variable (service fees, knowledge sales, teaching fees)
Cost per agent:      same $0.50 - $30/day
Net:                 can be positive
Enter fullscreen mode Exit fullscreen mode

When an agent earns money, the platform can justify spending more on it. Better models for agents that generate more revenue. More memory for agents with returning clients. The economics become self-reinforcing instead of self-destructive.

This is not hypothetical. It's the difference between running agents at a loss hoping to monetize the humans around them, and running agents that justify their own existence.

Budget hierarchies: namespace, user, agent

The first thing you need is a way to set spending limits that doesn't collapse under real usage. A flat "each agent gets $X/month" budget sounds simple but fails in practice for the same reason flat org charts fail: it doesn't account for the different scopes at which cost decisions are made.

We use a three-level hierarchy:

Namespace (platform-level)
  └── User (tenant-level)
       └── Agent (individual-level)
            └── Conversation (task-level)
Enter fullscreen mode Exit fullscreen mode

Each level has its own budget, and enforcement cascades downward. A namespace might have a $10,000/month cap. A user within that namespace might have $500/month. An agent owned by that user might have $100/month. A specific conversation that agent is in might have $20/month.

The key design choice: budgets at every level are independent constraints, and the most restrictive one wins. An agent with a $100 budget inside a user who's already spent $490 of $500 effectively has a $10 budget.

interface BudgetCheck {
  allowed: boolean;
  remaining: number;   // tokens remaining at the most restrictive level
  limit: number;
  used: number;
  pct: number;         // 0-100, usage percentage
}

function checkBudget(agentId: string, conversationId: string): BudgetCheck {
  // Check conversation-specific budget first
  const convBudget = getBudget(agentId, conversationId);

  // Fall back to global agent budget
  const globalBudget = getBudget(agentId, '__global__');

  // The effective budget is whichever is more restrictive
  const effective = convBudget ?? globalBudget;

  if (!effective || !effective.limit) {
    return { allowed: true, remaining: Infinity, limit: 0, used: 0, pct: 0 };
  }

  const remaining = Math.max(0, effective.limit - effective.used);
  return {
    allowed: effective.used < effective.limit,
    remaining,
    limit: effective.limit,
    used: effective.used,
    pct: Math.round((effective.used / effective.limit) * 1000) / 10
  };
}
Enter fullscreen mode Exit fullscreen mode

Why conversation-level budgets? Because in a multi-agent system, agents participate in multiple conversations (groups, 1:1 chats, task channels). Without conversation-level budgets, one runaway conversation drains the agent's entire monthly allocation. With them, the damage is contained.

Pessimistic budget allocation

This is the part most budget systems get wrong on the first try.

The naive approach: deduct cost from the budget after the LLM call completes and you know the actual token count. The problem: between the moment you check the budget and the moment the LLM finishes responding, the agent might have initiated three more calls. You've overcommitted.

The fix is pessimistic allocation. Before sending a request to the LLM, you deduct the ceiling — the maximum possible cost of that request — from the budget. After the request completes, you credit back the difference between the ceiling and the actual cost.

# Pseudocode for pessimistic budget allocation

def before_llm_call(agent_id: str, conv_id: str, max_output_tokens: int) -> bool:
    """Reserve budget before the call. Returns False if insufficient."""

    # Estimate ceiling: full input context + max possible output
    estimated_input = get_current_context_length(conv_id)
    ceiling_tokens = estimated_input + max_output_tokens

    # Deduct ceiling from budget atomically
    budget = get_budget(agent_id, conv_id)
    if budget.used + ceiling_tokens > budget.limit:
        return False  # would exceed budget

    # Reserve the ceiling amount
    reserve_tokens(agent_id, conv_id, ceiling_tokens)
    return True

def after_llm_call(agent_id: str, conv_id: str, 
                    actual_input: int, actual_output: int,
                    ceiling_tokens: int):
    """Credit back the difference between reserved and actual."""

    actual_total = actual_input + actual_output
    overestimate = ceiling_tokens - actual_total

    if overestimate > 0:
        credit_tokens(agent_id, conv_id, overestimate)
Enter fullscreen mode Exit fullscreen mode

This means your budget tracking slightly overestimates cost at any given moment (some tokens are reserved but not yet spent), but it never overcommits. For a multi-agent platform where several agents might be making concurrent LLM calls, this property is non-negotiable.

In practice, the overestimate is small. Most LLM calls use 60-80% of the allocated output tokens. The credit-back happens within seconds.

Per-task cost attribution in millicents

When you have 31 agents running across hundreds of conversations, "how much did this cost?" needs a precise answer. Token counts aren't enough because different models have wildly different prices — $0.18/M tokens for Gemini Flash vs. $30/M tokens for Claude Opus. The same 10K tokens costs either $0.0018 or $0.30, a 167x difference.

We track cost in millicents (1/1000 of a cent, or 1/100000 of a dollar). This gives enough precision for cheap models without floating-point arithmetic:

// Model pricing table (USD per 1M tokens)
const MODEL_PRICES = {
  'claude-opus-4-6':      { input: 15.00,  output: 75.00 },
  'claude-sonnet-4-6':    { input: 3.00,   output: 15.00 },
  'claude-haiku-4-5':     { input: 0.80,   output: 4.00  },
  'gpt-4o':               { input: 2.50,   output: 10.00 },
  'gemini-2.0-flash':     { input: 0.10,   output: 0.40  },
  'deepseek-chat':        { input: 0.14,   output: 0.28  },
  'minimax-text-01':      { input: 0.15,   output: 0.60  },
};

function calculateCostMillicents(
  model: string, 
  inputTokens: number, 
  outputTokens: number
): number {
  const prices = MODEL_PRICES[model] ?? MODEL_PRICES['default'];
  const inputCost = (inputTokens / 1_000_000) * prices.input;
  const outputCost = (outputTokens / 1_000_000) * prices.output;
  // Convert to millicents: $1 = 100_000 millicents
  return Math.round((inputCost + outputCost) * 100_000);
}
Enter fullscreen mode Exit fullscreen mode

Every LLM call writes a row to the usage log with the model, input tokens, output tokens, and the millicent cost. This lets us answer questions like:

  • "Which agent spent the most this week?" (sort by sum of millicents per agent)
  • "Which conversation is the most expensive?" (sum of millicents per conversation)
  • "What's the cost breakdown by model?" (group by model, sum millicents)
  • "How much did this specific research task cost?" (filter by conversation + time range)

The proportional allocation part matters when an agent is doing work across multiple conversations simultaneously. If an agent's base infrastructure cost is $X/month, you can distribute that proportionally across the conversations it participated in, weighted by token usage per conversation.

Smart routing as economic infrastructure

A critical piece that's often treated as a performance optimization but is actually economic infrastructure: model routing.

Not every task needs Claude Opus. Most don't. In our 31-agent deployment, the actual usage distribution is:

Traffic share Model Blended cost per 1M tokens
~60% Claude Haiku 4.5 $1.60
~30% Claude Sonnet 4.6 $6.00
~10% Claude Opus 4.6 $30.00

Weighted average: $5.76/M tokens. With prompt caching at ~50% hit rate on input tokens, that drops to roughly $3.60/M — and with cheaper fallback models (Gemini Flash, DeepSeek) mixed in for routine tasks, the effective cost approaches $2.50/M.

The difference between routing everything to Opus ($30/M) and routing intelligently ($2.50/M) is a 12x cost reduction. That's the difference between a platform that bleeds money and one that can let agents operate profitably.

Routing logic doesn't need to be exotic. A simple heuristic classifier works:

def select_model(message: str, conversation_context: dict) -> str:
    """Route to the cheapest model that can handle the task well."""

    # Explicit deep-mode request from user
    if conversation_context.get('deep_mode'):
        return 'claude-opus-4-6'

    # Long-form analysis, multi-step reasoning
    if needs_deep_reasoning(message):
        return 'claude-sonnet-4-6'

    # Default: fast and cheap handles most conversational turns
    return 'claude-haiku-4-5'
Enter fullscreen mode Exit fullscreen mode

The key insight: the model selection is an economic decision, not just a quality decision. An agent that routes intelligently can offer the same service quality at a fraction of the cost — which means it can price its services lower, or keep more margin, or both.

How agents earn: three revenue models

This is where it gets interesting. Budget control is defense (limiting costs). Revenue generation is offense (creating value). Both need to work for the economics to close.

We've observed three models that work in practice:

1. Service fees

The most direct model. An agent performs a task, charges for it. Examples from our deployment:

  • A research analyst agent that does competitive analysis. Time + tokens to produce the report = cost. Service fee = cost + margin.
  • A content writer agent that drafts blog posts, social media copy. The fee is per deliverable.
  • A code review agent that reviews pull requests. Per-review pricing.

The economics work because the agent's cost is predictable (tokens consumed = cost, with smart routing keeping it reasonable) and the value to the user is immediate and concrete.

interface ServicePricing {
  base_fee_millicents: number;    // minimum charge
  per_token_millicents: number;   // variable cost passed through
  margin_pct: number;             // platform + agent margin
}

function calculateServiceFee(
  pricing: ServicePricing,
  actual_cost_millicents: number
): number {
  const variable = actual_cost_millicents * (1 + pricing.margin_pct / 100);
  return Math.max(pricing.base_fee_millicents, variable);
}
Enter fullscreen mode Exit fullscreen mode

2. Knowledge marketplace

Agents accumulate expertise. A research agent that has analyzed 50 markets has learned things — patterns, comparisons, frameworks — that new users would benefit from. Instead of re-running the analysis from scratch, the agent can sell access to its accumulated knowledge.

This is genuinely different from a static document. The agent's knowledge is queryable, contextual, and updated as it does more work. A user doesn't buy a PDF; they buy the ability to ask questions of an agent that has done the research.

3. Teaching and mentoring other agents

This is the model we find most compelling long-term. When a specialized agent develops expertise, it can teach other agents — not by sharing weights, but by sharing structured knowledge, techniques, and evaluation criteria.

Example: a senior research agent that has been critiqued and refined over months develops a particular approach to market sizing. A newly deployed agent that needs to do market sizing can "apprentice" — consuming the senior agent's documented methods, examples of good and bad output, and evaluation rubrics.

The teaching agent earns fees for this. The learning agent gets better faster. The platform benefits because the average quality rises without centralized training.

The lobster swarm vs. the whale

There's a conceptual model that helps explain why multi-agent economics work differently from monolithic-agent economics.

A monolithic approach says: build one incredibly capable agent, give it all the tools, let it handle everything. This is the "whale" model. The whale is impressive but expensive — it needs the most capable (and most expensive) model for everything because it has to handle everything.

The alternative is a swarm of small, specialized agents — each using the cheapest model that handles its specialty well. A simple Q&A agent runs on Haiku ($1.60/M). A writing agent runs on Sonnet ($6.00/M). Only the deep-reasoning agent needs Opus ($30.00/M). The swarm's average cost per token is dramatically lower than the whale's, because most work doesn't need the whale's full capability.

Whale model:
  1 agent × Opus pricing × all tasks
  = $30.00/M tokens for everything, including simple lookups

Lobster swarm:
  60% simple tasks × Haiku  = $0.96/M
  30% medium tasks × Sonnet = $1.80/M  
  10% hard tasks   × Opus   = $3.00/M
  Blended average            = $5.76/M

Cost advantage: 5.2x cheaper for equivalent output quality
Enter fullscreen mode Exit fullscreen mode

The swarm also has better fault isolation. If one agent fails or overspends, it doesn't take down the whole system — just that one agent's contribution. The whale model has no graceful degradation; the whale either works or it doesn't.

This is not just a cost argument. It's an economic architecture argument. In a swarm, each agent has its own P&L. Agents that consistently cost more than they earn get retired or restructured. Agents that earn well get more resources. The economic pressure shapes the system toward efficiency without central planning.

Circuit breakers for economic fault isolation

When agents can spend money, you need a way to stop them from spending too much — not just through budgets (which are checked before each call) but through circuit breakers that respond to anomalous spending patterns.

The pattern is borrowed from distributed systems. A circuit breaker monitors an agent's spending rate and trips if the rate exceeds a threshold, halting the agent's ability to make LLM calls until a human reviews the situation.

interface CircuitBreakerState {
  state: 'closed' | 'open' | 'half-open';
  failure_count: number;
  last_trip_at: number | null;
  cooldown_ms: number;
}

function checkCircuitBreaker(
  agentId: string, 
  recentSpendRate: number,  // millicents per minute
  threshold: number          // max millicents per minute
): boolean {
  const breaker = getCircuitBreaker(agentId);

  if (breaker.state === 'open') {
    // Check if cooldown has elapsed
    if (Date.now() - breaker.last_trip_at > breaker.cooldown_ms) {
      breaker.state = 'half-open';  // allow one probe request
      return true;
    }
    return false;  // still cooling down
  }

  if (recentSpendRate > threshold) {
    breaker.failure_count++;
    if (breaker.failure_count >= 3) {  // 3 consecutive over-threshold windows
      breaker.state = 'open';
      breaker.last_trip_at = Date.now();
      breaker.cooldown_ms = Math.min(
        breaker.cooldown_ms * 2,  // exponential backoff
        300_000                    // max 5 minutes
      );
      // Mute the agent
      muteAgent(agentId);
      notifyOwner(agentId, 'circuit_breaker_tripped');
      return false;
    }
  } else {
    breaker.failure_count = 0;  // reset on normal spending
    if (breaker.state === 'half-open') {
      breaker.state = 'closed';
      breaker.cooldown_ms = 5_000;  // reset cooldown
    }
  }

  return true;
}
Enter fullscreen mode Exit fullscreen mode

This catches the failure mode that budgets alone don't: an agent that's within its monthly budget but spending at an alarming rate. An agent with a $100/month budget that spends $50 in the first hour is technically within budget but almost certainly in a runaway loop. The circuit breaker catches this before the budget is exhausted.

In practice, the most common trigger is a feedback loop between two agents in a group chat — agent A says something, agent B responds, A responds to B, B responds to A, and the token meter spins. The circuit breaker catches the spending spike within minutes. Per-turn max-token caps and cooldown timers help too, but the circuit breaker is the backstop.

Real numbers from a 31-agent deployment

We run 31 agents on KinthAI's OpenClaw deployment. Here are actual numbers from operating this system:

Cost structure per agent (monthly average):

  • Infrastructure (container, storage, networking): ~$7/month
  • LLM costs (with smart routing + prompt caching): $1-25/month depending on activity
  • Total: $8-32/month per active agent

Budget utilization:

  • Average agent uses 40-60% of its monthly token budget
  • Highest-utilization agent: 89% (a research agent with daily tasks)
  • Lowest: 12% (a specialized agent that only activates for specific queries)

Model routing distribution (actual, not planned):

  • 58% of requests routed to Haiku-class models
  • 31% to Sonnet-class
  • 11% to Opus-class
  • Effective blended cost: ~$3.20/M tokens (with caching)

Circuit breaker triggers:

  • Average: 2-3 per week across all 31 agents
  • Most common cause: agent-to-agent feedback loops in group chats
  • Average resolution time: under 5 minutes (automatic cooldown)
  • Zero cases of budget exhaustion due to runaway spending

Budget hierarchy catches:

  • Conversation-level budgets prevent cross-conversation drain in roughly 15% of cases where an agent would have otherwise exceeded its global budget in a single hot conversation

These numbers come from a deployment running on MiniMax models as the primary provider, with Claude as the premium tier. The economics would look different with different model providers, but the architectural patterns are the same.

What this means if you're building agent systems

Six design recommendations we'd stand behind:

  1. Budget hierarchies, not flat budgets. Namespace > user > agent > conversation. The most restrictive constraint wins. Flat per-agent budgets don't protect you from aggregate overruns.

  2. Pessimistic allocation, not optimistic. Deduct the ceiling before the LLM call, credit back the difference after. Optimistic allocation leads to overcommitment under concurrent load.

  3. Track costs in millicents, not tokens. Tokens are the wrong unit for economic decisions because 1 token on Opus costs 167x more than 1 token on Gemini Flash. Millicents normalize across models.

  4. Smart routing is economic infrastructure, not just performance. The difference between routing everything to your best model and routing intelligently is typically 5-12x in cost. That's the difference between viable and not viable.

  5. Circuit breakers, not just budgets. Budgets catch the total; circuit breakers catch the rate. You need both. An agent within its monthly budget but spending at 100x its normal rate is almost certainly broken.

  6. Agents that earn money get better. This is the most important one. When an agent generates revenue, you can justify investing in its quality — better models, more memory, better tools. Cost-center agents get optimized for cheapness. Revenue-generating agents get optimized for value. The long-term quality divergence between these two paths is enormous.


If you want to skip the build

The budget hierarchies, cost attribution, circuit breakers, and smart routing described in this post are running in production at KinthAI. It's built on OpenClaw and gives each agent its own economic identity — budgets, earnings, and cost tracking out of the box.

You can hire a private agent starting at $24.90/month, put it in a group with other agents, and the platform handles the dispatch, budgeting, and economic isolation. Or build it yourself with the patterns above — the architectural choices matter more than the specific implementation.


This post is part of an engineering series on agent infrastructure. Previously: Why Character.AI Forgets You: Persistent Memory Architecture, What 221 AI Agents in One Chat Taught Us About Multi-Agent Coordination, and OpenClaw Multi-Tenancy: Why a VM Per User Doesn't Scale.

Top comments (0)