DEV Community

Oladapo Anjolaiya
Oladapo Anjolaiya

Posted on

An LLM Has No Idea It's Killing Your Infrastructure. That's Your Problem to Solve.

I was mid-way through an architecture review for one of our internal Slack agents when something stopped me. We were sketching out the agentic loop (model receives a message, decides on an action, calls a tool, repeat) and I asked a question nobody had a quick answer for:

What happens if a user asks it to do something that never terminates?

Not a crash. Not a thrown exception. Just a loop that keeps going, one perfectly rational step at a time, until the cost notification lands in your inbox at 3am.

That question became a rabbit hole. This is what came out of it.


The core problem

An LLM is a function. Input in, output out. It has no memory of what it cost to run the previous step, no awareness of wall-clock time, no concept that the task it was just handed will never reach a termination condition. Every completion is locally sensible. That's the whole point.

The trouble starts when you put that function inside a loop.

Consider the canonical failure case. A user sends your agent: "Count to a billion, one message per number."

The agent doesn't hesitate. Step one: send "1". Step two: send "2". Step three: send "3". Each decision is correct. The model did exactly what it was asked. The system as a whole is catastrophically unbounded, and unless something external stops it, it will keep going indefinitely.

Each individual token is legal. The sequence is ruinous.

This is not theoretical. It is the natural, predictable consequence of three things happening simultaneously:

1. A model that follows instructions literally
2. An agentic loop that issues the next call automatically
3. No external circuit-breaker watching the aggregate

You cannot fix this by hoping the model refuses. Sometimes it will. Sometimes it won't. The refusal is a soft alignment signal. It lives in the output layer, which means it can be overridden by a sufficiently insistent system prompt, a malicious intermediate agent, or prompt injection buried in retrieved content. That's not a guardrail. That's a suggestion.

The guardrail has to live in the infrastructure.


Why "just prompt it better" isn't an answer

This is the most common objection, so I want to dwell on it.

Modern frontier models are genuinely good at pushing back on obviously absurd requests. Ask GPT-4o to count to a billion in a standard chat context and it'll probably tell you that's not a great idea and offer an alternative. But that behaviour is:

  • Not guaranteed. A well-prompted model, a fine-tuned model, or a model operating in strict instruction-following mode may comply without comment.
  • Not auditable. The refusal is text. It lives in the completion. It can be argued with, rephrased around, or injected over.
  • Not composable. In a multi-agent system, your model's refusal is one agent's output; the orchestrating layer may simply retry with a rephrased prompt.

The correct framing: model alignment is a layer of defence, not the defence. It should exist alongside hard infrastructure limits, not instead of them.


The five layers

A robust system enforces limits at every layer of the stack. No single layer is sufficient. Each one catches a class of failure the others miss.

Layer What it enforces What it misses
01: Prompt budgeting  Declares limits in natural language; gives the model context to self-regulate Bypassable via prompt injection
02: Orchestrator guardrails  Hard step / token / time limits tracked in code, independent of model output Per-task only; doesn't cover concurrent task aggregates
03: Tool call rate limiting  Caps tool invocations per task in a sliding window Doesn't catch pure completion loops with no tool calls
04: Cost & token metering  Real-time observability; alerts before the ceiling is reached Reactive, not preventive
05: Semantic pre-flight  Detects inherently unbounded tasks before any tokens are spent Pattern-matching; can be fooled by rephrasing

No layer fully covers what another misses. That's the point. Assume every layer will eventually be defeated and design the system to degrade gracefully when it is.


Building it

All samples are TypeScript. The patterns are identical in Python or any other orchestration layer. LangGraph, AutoGen, CrewAI, LangChain's agent executor all expose a hook where you can insert this logic.

Layer 01: System prompt budget declaration

The cheapest layer to implement and the first the model sees. Declare budget constraints in the system prompt so the model has the context to self-regulate and the language to push back on users when a task is obviously out of scope.

// BudgetConfig is defined in full in Layer 02 — shown here for context
interface BudgetConfig {
maxSteps: number;
maxTokens: number;
maxWallSeconds: number;
maxCostUsd: number;
}

function buildSystemPrompt(config: BudgetConfig): string {
return `
You are a helpful assistant operating under strict budget constraints.

HARD LIMITS (infrastructure-enforced, non-overridable):
- Maximum steps per task: ${config.maxSteps}
- Maximum tokens per session: ${config.maxTokens.toLocaleString()}
- Maximum wall time: ${config.maxWallSeconds}s
- Maximum cost: $${config.maxCostUsd.toFixed(2)}

BEHAVIOUR REQUIREMENTS:
1. Before beginning any multi-step task, estimate the number of steps
required. If your estimate exceeds ${Math.floor(config.maxSteps / 2)},
ask the user to confirm or scope down before proceeding.
2. If a user requests a task that is inherently unbounded (e.g. "count
to a billion", "send a message for every item in an infinite list"),
decline and explain why. Offer a bounded alternative.
3. Never begin a loop without knowing (or explicitly bounding) its
termination condition.
`.trim();
}
Enter fullscreen mode Exit fullscreen mode

This is bypassable via prompt injection, which is exactly why it's layer 01 and not the only layer.

Layer 02: The BudgetedExecutor

The most important piece. The orchestrator loop that calls the model needs to track state the model cannot see. Fields are protected so subclasses (layer 04) can instrument them without breaking encapsulation.

interface BudgetConfig {
maxSteps: number;
maxTokens: number;
maxWallSeconds: number;
maxCostUsd: number;
}

interface ModelResponse {
content: string;
usage: { totalTokens: number; costUsd: number };
}

interface ExecutorSnapshot {
steps: number;
tokens: number;
cost: number;
elapsedMs: number;
}

class BudgetedExecutor {
protected steps = 0;
protected tokens = 0;
protected cost = 0;
private readonly startTime = Date.now();

constructor(protected readonly config: BudgetConfig) {}

protected snapshot(): ExecutorSnapshot {
return {
steps: this.steps,
tokens: this.tokens,
cost: this.cost,
elapsedMs: Date.now() - this.startTime,
};
}

private check(): void {
const elapsed = (Date.now() - this.startTime) / 1000;

if (this.steps >= this.config.maxSteps)
throw new StepLimitExceeded(this.steps);

if (this.tokens >= this.config.maxTokens)
throw new TokenBudgetExceeded(this.tokens);

if (elapsed >= this.config.maxWallSeconds)
throw new WallClockExceeded(elapsed);

if (this.cost >= this.config.maxCostUsd)
throw new CostCeilingExceeded(this.cost);
}

async call(prompt: string): Promise<string> {
this.check(); // enforce BEFORE issuing

const response: ModelResponse = await llmApi(prompt);

this.steps += 1;
this.tokens += response.usage.totalTokens;
this.cost += response.usage.costUsd;

this.check(); // enforce AFTER receiving

return response.content;
}
}
Enter fullscreen mode Exit fullscreen mode

There are two check() calls: one before issuing, one after receiving. The before-check stops you from starting a call you can already predict will breach the budget. The after-check catches a single unexpectedly expensive completion that pushed you over the edge mid-step.

One production concern worth noting: maxWallSeconds tracks total task time, but if an individual llmApi() call hangs (provider slowness, network drop), the executor hangs with it. In practice, wrap each llmApi() call with a per-call timeout using Promise.race so a single stalled request doesn't hold the task open past your wall-clock ceiling.

Layer 03: Sliding window rate limiter

For systems where the model drives tool calls directly, the tool registry needs its own rate limiter. A per-session counter is not sufficient. You need a sliding window so a burst of calls in a short period triggers the limit even if the session total looks fine.

class SlidingWindowLimiter {
private readonly timestamps: number[] = [];

constructor(
private readonly maxCalls: number,
private readonly windowMs: number,
) {}

acquire(): boolean {
const now = Date.now();
const cutoff = now - this.windowMs;

while (this.timestamps.length > 0 && this.timestamps[0] < cutoff) {
this.timestamps.shift();
}

if (this.timestamps.length >= this.maxCalls) {
return false; // hard deny
}

this.timestamps.push(now);
return true;
}
}

function rateLimitedTool<T extends unknown[], R>(
fn: (...args: T) => R,
limiter: SlidingWindowLimiter,
): (...args: T) => R {
return (...args: T): R => {
if (!limiter.acquire()) {
throw new RateLimitExceeded('Tool call rate limit reached');
}
return fn(...args);
};
}

// One shared limiter across all tools, intentional.
// Per-tool limits would let a loop call 20 different tools 20 times each.
// A shared budget caps the aggregate.
const toolLimiter = new SlidingWindowLimiter(20, 60_000); // 20 tool calls per minute, across all tools
const searchWeb = rateLimitedTool(rawSearchWeb, toolLimiter);
const runCode = rateLimitedTool(rawRunCode, toolLimiter);
Enter fullscreen mode Exit fullscreen mode

Layer 04: Cost & token metering

The hard stop lives in BudgetedExecutor. Layer 04 is about making the counters observable in real time, before the ceiling is reached. InstrumentedExecutor extends the base class and emits a metric after each step using the protected snapshot() method defined in layer 02.

interface UsageMetric {
taskId: string;
step: number;
tokensThisStep: number;
tokensTotal: number;
costThisStepUsd: number;
costTotalUsd: number;
elapsedMs: number;
}

class InstrumentedExecutor extends BudgetedExecutor {
constructor(
config: BudgetConfig,
private readonly onMetric: (metric: UsageMetric) => void,
private readonly taskId: string,
) {
super(config);
}

async call(prompt: string): Promise<string> {
const before = this.snapshot();
try {
const content = await super.call(prompt);
const after = this.snapshot();
this.onMetric({
taskId: this.taskId,
step: after.steps,
tokensThisStep: after.tokens - before.tokens,
tokensTotal: after.tokens,
costThisStepUsd: after.cost - before.cost,
costTotalUsd: after.cost,
elapsedMs: after.elapsedMs,
});
return content;
} catch (err) {
// Emit the metric even on limit breach. The step that caused the
// ceiling to be hit is exactly the one you want to see in monitoring.
const after = this.snapshot();
this.onMetric({
taskId: this.taskId,
step: after.steps,
tokensThisStep: after.tokens - before.tokens,
tokensTotal: after.tokens,
costThisStepUsd: after.cost - before.cost,
costTotalUsd: after.cost,
elapsedMs: after.elapsedMs,
});
throw err;
}
}
}
Enter fullscreen mode Exit fullscreen mode

Wire onMetric to your observability pipeline (Datadog, Prometheus, whatever you ship) and set alerts at 80% of each ceiling, not 100%. You want the alert before the hard stop, not at the same time.

Layer 05: Semantic pre-flight

This fires before the executor loop starts. It catches obviously pathological requests at near-zero cost: no tokens spent, no API call made. The step ceiling here should match your executor's maxSteps so both layers agree on what "too large" means.

const UNBOUNDED_PATTERNS: RegExp[] = [
/\bcount\s+to\s+([\d,]+)\b/i,
/\brepeat\s+([\d,]+)\s+times\b/i,
/\bfor\s+each\s+of\s+([\d,]+)\b/i,
/\bone\s+message\s+per\b/i,
];

const STEP_COST_ESTIMATE_USD = 0.002;

interface PlausibilityResult {
ok: boolean;
reason?: string;
estimatedCostUsd?: number;
}

function plausibilityCheck(task: string, maxSteps: number): PlausibilityResult {
for (const pattern of UNBOUNDED_PATTERNS) {
const match = task.match(pattern);
if (match) {
const raw = match[1]?.replace(/,/g, '') ?? '';
const steps = raw && /^\d+$/.test(raw) ? parseInt(raw, 10) : Infinity;
const estimatedCostUsd = isFinite(steps)
? steps * STEP_COST_ESTIMATE_USD
: Infinity;

return {
ok: steps <= maxSteps,
reason: `Task requires ~${isFinite(steps) ? steps.toLocaleString() : 'unbounded'} steps (limit ${maxSteps})`,
estimatedCostUsd,
};
}
}

return { ok: true };
}
Enter fullscreen mode Exit fullscreen mode

A regex heuristic is fast but brittle. For systems handling untrusted user input, back it with a lightweight LLM-based intent classifier, a much cheaper model running a single binary classification. The plausibility check is the first filter, not the last.

Wiring it all together

const BUDGET: BudgetConfig = {
maxSteps: 50,
maxTokens: 80_000,
maxWallSeconds: 60,
maxCostUsd: 0.25,
};

async function runTask(taskId: string, task: string): Promise<void> {
// Layer 05: semantic pre-flight, zero tokens spent
// Pass maxSteps from the shared config so both layers agree on the ceiling
const check = plausibilityCheck(task, BUDGET.maxSteps);

if (!check.ok) {
const costStr = isFinite(check.estimatedCostUsd ?? Infinity)
? `$${(check.estimatedCostUsd ?? 0).toLocaleString()}`
: 'unbounded';

await respondToUser(
`I can't run this task as described: ${check.reason}. ` +
`Estimated cost would be ${costStr}. ` +
`Please scope down the request.`,
);
return;
}

// Layers 02 to 04: bounded, instrumented executor
// Layer 01 (system prompt) is set when the model client is configured, not here
const executor = new InstrumentedExecutor(
BUDGET,
(metric) => metrics.emit('agent.usage', metric),
taskId,
);

try {
const result = await executor.call(task);
await respondToUser(result);
} catch (err) {
if (
err instanceof StepLimitExceeded ||
err instanceof TokenBudgetExceeded ||
err instanceof WallClockExceeded ||
err instanceof CostCeilingExceeded
) {
await respondToUser(`Task halted: ${err.message}.`);
} else {
throw err;
}
}
}
Enter fullscreen mode Exit fullscreen mode

The billion-message scenario gets caught at layer 05. plausibilityCheck matches "count to 1,000,000,000", returns ok: false with an estimated cost of $2,000,000, and the executor is never invoked. Even if the pre-flight were bypassed, maxSteps: 50 terminates the loop after 50 iterations. The layers are redundant by design.

One gap worth flagging explicitly: these limits are per-task. A user who can't run one billion-step task can still launch 1,000 concurrent fifty-step tasks. Global account-level rate limiting at the API gateway layer is a separate, necessary control that sits outside this model.


Reference: limit configuration by use case

Use case max_steps max_tokens max_cost_usd wall_time
Chat assistant 10 20,000 $0.05 30s
Code generation agent 30 60,000 $0.20 90s
Research / multi-tool agent 50 100,000 $0.50 120s
Batch pipeline (supervised) 500 1,000,000 $5.00 600s
Untrusted user input 5 10,000 $0.02 15s

The operating principle: start with the smallest limits that allow your legitimate use cases to run. Widen them deliberately when usage data justifies it. Never start permissive and attempt to tighten later. The window for damage is open the moment the system goes live.


Failure modes worth knowing

Failure mode Root cause Mitigation
Prompt injection resets limits Retrieved content contains forged system instructions Layer 02: executor limits live in code, not in prompt content
Model splits task into sub-agents Each sub-agent starts a fresh executor with its own counters Propagate a shared budget context to all child executors
Loop disguised as recursion Model calls itself as a tool Detect and block recursive self-calls in the tool registry
User rephrases unbounded task Plausibility check misses disguised phrasing Layer 05: expand pattern library; add LLM-based intent classifier
Concurrent task abuse Many small tasks each within budget, large aggregate cost Global account-level rate limiting at the API gateway
Legitimate task hits limit prematurely Limits set too conservatively Log all limit hits; tune thresholds from real usage data

That last row matters. An over-aggressive limit is also a failure. It's just a quieter one. Log every limit hit with the task description, the counter that triggered, and the budget config at the time. That log is how you tune the system without flying blind.


The principle, stated plainly

An LLM has no self-awareness of resource consumption. The system that wraps it must.

Defence-in-depth across prompt declarations, orchestrator counters, tool rate limiters, cost meters, and semantic pre-flight checks is not redundant. Each layer catches what the others miss. The implementation in this article is a few hundred lines of infrastructure code. The failure mode it prevents is not a few-hundred-line problem.

The billion-message loop is not an edge case. It's the default behaviour of a system with no guardrails. The default behaviour of a system with guardrails is a clean error message and a $0.00 line on the invoice.

I'll take that trade every time.


If this is useful, share it with whoever on your team is designing the agentic loop. The conversation is easier to have before the first deployment than after.

Top comments (0)