Give Your Agent a Budget: Cost Ceilings That Stop Runaway Loops

#ai #agents #go #reliability

Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A developer building on top of an LLM API woke up to a bill that ran into the tens of thousands of dollars. Versions of this story circulate regularly on Hacker News and in dev write-ups, and the shape is always the same. An autonomous loop, no hard cap, a tool or a prompt that kept the model asking for one more turn, and an API that happily charged for every one of those turns until someone noticed.

You do not have to believe any one of those numbers to take the lesson. An agent loop is a while loop that spends money on each pass. A while loop with no exit condition is a bug in any language. Wire that bug to a credit card and the only question is how fast the meter runs before a human sees it.

The fix is not clever. It is a ceiling the loop checks before every model call, and a switch you can throw without a deploy. Most teams add both after the first scary invoice. Here is what they look like before.

A loop spends on every turn

An agent turn costs input tokens for the whole conversation so far, plus output tokens for the model's reply. The conversation grows on every turn, so each pass is more expensive than the last. A ten-turn run is not ten times the cost of one turn. It is closer to the sum of an arithmetic series, and the tail turns are the heavy ones.

That is the trap behind the runaway. The first few turns look cheap in your traces. By the time the cost per turn looks alarming, you have already paid for the turns that got you there. So the ceiling has to live in front of the call, not behind it.

Price the run, then check before you spend

Start with a budget that tracks dollars, not just tokens. Tokens are the honest unit, but dollars are what finance asks about, and the conversion is a multiply.

package agent

import "time"

// Per-1M-token prices in USD. Fill from your provider.
type Pricing struct {
    InputPerM  float64
    OutputPerM float64
}

func (p Pricing) Cost(in, out int) float64 {
    return float64(in)/1e6*p.InputPerM +
        float64(out)/1e6*p.OutputPerM
}

type Budget struct {
    MaxSteps   int
    MaxUSD     float64
    MaxSeconds float64
    Pricing    Pricing
}

The three axes do different jobs. MaxSteps stops the loop that keeps calling tools forever. MaxUSD stops the loop where each step is cheap but the growing context quietly inflates the bill. MaxSeconds protects the user-facing request timeout and catches a hung tool that burns wall-time without burning tokens. None of the three covers the other two.

Now the meter. It accumulates spend and answers one question: can I afford the next call.

type Meter struct {
    start     time.Time
    steps     int
    usd       float64
    budget    Budget
}

func NewMeter(b Budget) *Meter {
    return &Meter{start: time.Now(), budget: b}
}

// CanContinue runs BEFORE each model call.
func (m *Meter) CanContinue() (bool, string) {
    if m.steps >= m.budget.MaxSteps {
        return false, "step_ceiling"
    }
    if m.usd >= m.budget.MaxUSD {
        return false, "cost_ceiling"
    }
    if time.Since(m.start).Seconds() >=
        m.budget.MaxSeconds {
        return false, "time_ceiling"
    }
    return true, ""
}

// Record runs AFTER each model call.
func (m *Meter) Record(in, out int) {
    m.steps++
    m.usd += m.budget.Pricing.Cost(in, out)
}

CanContinue returns the reason it stopped, not a bare false. The caller needs that string. A run that stops because the model finished is a success you show the user. A run that stops on cost_ceiling is a partial answer you flag, log, and maybe offer to resume. Same loop, different outcome, and the only way to tell them apart is the reason.

The loop that respects the ceiling

The loop is the boring part once the meter exists. Check, call, record, repeat.

type Response struct {
    Content      string
    ToolCalls    []ToolCall
    InputTokens  int
    OutputTokens int
}

type ModelFunc func([]Message) (Response, error)
type ToolFunc func(ToolCall) string

func Run(
    model ModelFunc,
    tool ToolFunc,
    msgs []Message,
    b Budget,
) ([]Message, string, error) {
    m := NewMeter(b)

    for {
        ok, reason := m.CanContinue()
        if !ok {
            return msgs, reason, nil
        }

        resp, err := model(msgs)
        if err != nil {
            return msgs, "model_error", err
        }
        m.Record(resp.InputTokens, resp.OutputTokens)

        msgs = append(msgs, Message{
            Role:    "assistant",
            Content: resp.Content,
        })

        if len(resp.ToolCalls) == 0 {
            return msgs, "done", nil
        }

        for _, call := range resp.ToolCalls {
            out := tool(call)
            msgs = append(msgs, Message{
                Role:    "tool",
                Content: out,
            })
        }
    }
}

The check sits at the top of the loop body, before model(msgs). Move it after the call and you have already paid for the turn that put you over the line. The whole point of a ceiling is to refuse the call you cannot afford, so it has to run first.

This loop caps cost reactively: it stops once spend has crossed the line. For a tight ceiling that is enough. If a single turn can be large enough to blow well past the line on its own, forecast the next call too. Estimate the input tokens for the conversation you are about to send, price them, and add that to the running total before deciding. Refuse the call if the projection clears the ceiling, instead of discovering it after the fact.

The kill switch lives outside the loop

Per-run ceilings stop one session. They do nothing about the failure that actually hurts: a bad deploy where every session behaves, each one safely under its cap, but ten thousand of them in an hour add up to a number your business cannot absorb. No per-run budget sees that. It is an aggregate problem, and the control for it is a switch you can throw from outside the process without shipping code.

A flag in a config store works. So does a row in a table or a key in Redis. The loop reads it on every pass and refuses to start a turn when it is set.

type KillSwitch interface {
    Tripped() (bool, string)
}

// Checked at the top of the loop, alongside the meter.
func guard(m *Meter, ks KillSwitch) (bool, string) {
    if tripped, why := ks.Tripped(); tripped {
        return false, "kill_switch:" + why
    }
    return m.CanContinue()
}

Two properties make a kill switch worth having. It takes effect without a deploy, because during an incident the deploy pipeline is the slowest tool you own. And it fails safe: if the config store is unreachable, decide up front whether that means keep running or stop. For a loop that spends money on every pass, unreachable should usually mean stop. A few minutes of refused agent traffic is cheaper than an open meter you cannot see.

Wire guard in place of the bare m.CanContinue() call and the loop now answers to both the per-run ceiling and the global switch. One human, one flag, every running session stops on the next turn.

What to log on every stop

A ceiling you cannot see is a ceiling you will set wrong. Every stop writes one structured record:

session_id
stop_reason — done, step_ceiling, cost_ceiling, time_ceiling, kill_switch:<why>
steps, input_tokens, output_tokens, usd, wall_seconds

Watch the mix of stop reasons over time. You want done to dominate. A rising share of cost_ceiling or step_ceiling is the early signal that tasks are getting harder, a prompt regressed, or a tool started misbehaving, and it shows up here before it shows up on the invoice.

Where to set the numbers

Do not guess the ceilings. Pull the last few hundred sessions of a live loop and read the distribution. Take the p99 of cost per session, of steps per session, of wall-time per session, and set each ceiling at roughly 1.5x that p99. That clears your real traffic with headroom and still catches the run that goes three times beyond anything you have ever seen.

Then tighten. The first ceiling exists to stop a catastrophe, not to be precise. Catastrophe first, precision later, both before the loop sees a user.

A budget will not save you from a badly framed task. An agent told to "keep going until the answer is perfect" will hit the step ceiling every run and return a partial, and the ceiling is doing exactly its job. The prompt is the bug. But that is a problem you can read in a dashboard at your own pace, which is the entire difference between a tuning task and a Sunday-morning invoice.

If you want the loop-budget patterns here laid out end to end (bounded iteration, structured stop reasons, cost accounting, and the recovery paths that pair with each), that is most of what the AI Agents Pocket Guide is about. The chapter on autonomous loops is the one this post is a slice of.