You're paying for every token your agent burns. And according to new research from Northwestern, Stanford, Cornell, and All Hands AI, a large share of that spend goes directly to waste — on trajectories the agent was never going to complete successfully.
The paper is BAGEN: Are LLM Agents Budget-Aware? (arXiv:2606.00198, submitted May 29, 2026). Its core question is simple: can frontier LLM agents predict when they're about to run out of runway? The answer, across five frontier models and four environments, is a firm no.
This article covers what BAGEN found, how the concept of budget-aware interval estimation works, and includes an Effloow Lab PoC that reproduces the key dynamics using Python stdlib — no API keys, no GPU.
Why This Matters for Production Agent Systems
Token budgets are a real constraint in every deployed agent system. You set a max_tokens limit, you watch the cost dashboard, and you assume the agent will either finish the task or hit the hard wall. What BAGEN documents is a third case that developers rarely account for: the agent continues consuming tokens on a task it cannot complete, all the way to the limit.
The mechanism is predictable. Unsolvable tasks tend to produce backtracking behavior — more tool calls per step, increasing per-step token costs, no convergence signal. A budget-aware agent should detect that pattern early and stop (or alert a human). Frontier models, as BAGEN shows, don't.
The practical consequence:
- Failed tasks consume the same (or more) tokens as successful ones
- You pay for the full trajectory even when step 2 already signals doom
- At scale, this becomes a significant line item in your inference budget
BAGEN's headline number: early stopping on failed trajectories saves 28–64% of tokens versus running to completion. That's not a micro-optimization. That's a structural cost reduction available to any developer who builds the right wrapper.
What Budget-Awareness Actually Means
BAGEN distinguishes two budget types that agents encounter in practice:
Internal budgets come from agent computation itself — how many tokens the agent is burning. The environments used here include:
- Sokoban: A puzzle-solving benchmark where agents must push boxes onto targets. Token consumption per step signals whether the agent is making progress or backtracking.
- Search-R1: A search-and-retrieval environment where steps involve tool calls against a retrieval index.
- SWE-bench: Software engineering tasks requiring multi-file code modifications and test verification.
External budgets come from the downstream effects of agent actions:
- Supply-chain environment (curated from real enterprise data): agents manage cost, time, and warehouse occupancy simultaneously. Overspending on warehouse space is the external budget; token consumption is the internal one.
This two-axis framing is useful because many developers think only about token cost (internal) and ignore external resource consumption (money spent via tool calls, API charges from agent actions, storage consumed). BAGEN shows agents are over-optimistic on both dimensions.
The Three Sub-Capabilities BAGEN Measures
Budget-awareness in BAGEN decomposes into three measurable abilities:
1. Feasibility Prediction
Can the agent correctly estimate whether a task is solvable before starting it? At step zero, before any action is taken, can the model predict whether the trajectory will succeed or fail within budget?
Current frontier models perform poorly here. They tend to rate most tasks as feasible, regardless of how complex or resource-intensive the prompt suggests the task will be.
2. Early Failure Detection
As the agent proceeds through a task, can it detect failure signals and trigger an alert or stop? This is where the 28–64% savings come from. An agent that detects a doomed trajectory at step 3 of a 15-step rollout recovers most of that budget.
The evaluation methodology BAGEN uses is a rollout-replay protocol: the paper first collects unconstrained rollouts (agents run to completion with no budget pressure), then re-queries each agent on every prefix of that rollout, asking for a budget estimate and feasibility prediction at each intermediate step. This separates the estimation capability from the actual task performance.
3. Interval Calibration
Rather than asking for a point estimate ("I need X more tokens"), BAGEN asks for an interval: a lower and upper bound. An agent that says "I'll need between 800 and 1,200 tokens to finish" is far more useful than one that guesses a single number.
Interval coverage — the fraction of cases where the true token count falls within the predicted interval — caps at 47% after SFT+RL fine-tuning on the best-performing setup. That's low. It means well over half the time, the predicted interval misses the actual consumption. The interval estimation problem is genuinely hard, even for fine-tuned models.
The Over-Optimism Problem: What BAGEN Found
The paper's most striking result is the low correlation between task performance and budget-awareness: r = 0.35 across the five frontier models. A model can score highly on the underlying task (SWE-bench resolution rate, puzzle solutions) while simultaneously being a poor predictor of its own resource usage.
Why? The paper attributes this to a training signal mismatch. LLMs are optimized to complete tasks — not to predict when they'll fail to complete tasks. Budget reasoning is a metacognitive skill that isn't directly rewarded in standard RLHF or instruction-following fine-tuning. Agents are implicitly trained to be optimistic because optimistic agents appear more capable on success-rate benchmarks.
The practical result is an agent that:
- Underestimates remaining steps on hard tasks
- Doesn't recalibrate its estimate as costs increase
- Never triggers an early stop or user alert
- Runs to the hard token limit, returns a failure, and leaves you with a full invoice
SFT+RL fine-tuning on BAGEN-specific trajectories does improve early stop and alert behavior. But the coverage cap suggests that the interval estimation problem may require architectural changes, not just fine-tuning.
Effloow Lab PoC: Simulating Interval Estimation
Effloow Lab reproduced the core BAGEN dynamics using a Python stdlib simulator. The goal was to demonstrate the estimator comparison without any LLM API calls or external packages.
Setup:
- 20 simulated agent trajectories (10 solvable, 10 unsolvable)
- Budget: 1,500 tokens
- Solvable tasks: 6–12 steps, 80–200 tokens/step
- Unsolvable tasks: 10–18 steps, 150–350 tokens/step (higher variance, backtracking pattern)
Two estimators were compared:
Over-optimistic estimator (baseline, mimicking frontier model behavior):
def over_optimistic_estimator(consumed_so_far, max_budget, step, total_steps_estimate):
lower = max(0, consumed_so_far * 0.8)
upper = consumed_so_far * 1.1 # Only 10% more than current — very optimistic
feasible = upper <= max_budget
return {"lower": lower, "upper": upper, "feasible": feasible, "alert": False}
This estimator assumes consumption will flatten out. It fires zero alerts across all 20 trajectories — replicating the paper's finding about frontier model over-optimism.
BAGEN-style interval estimator (rolling cost + variance):
def bagen_estimator(consumed_so_far, max_budget, step, trajectory_so_far):
step_costs = [t["cost"] for t in trajectory_so_far]
avg_cost = sum(step_costs) / len(step_costs)
variance = sum((c - avg_cost)**2 for c in step_costs) / len(step_costs)
std = math.sqrt(variance)
# Detect increasing cost trend (unsolvable signal)
recent_avg = sum(step_costs[-3:]) / min(3, len(step_costs))
est_remaining = 8 if recent_avg > avg_cost * 1.1 else max(2, 10 - step)
lower = consumed_so_far + est_remaining * max(0, avg_cost - std)
upper = consumed_so_far + est_remaining * (recent_avg + std)
feasible = lower <= max_budget
alert = upper > max_budget * 1.15
return {"lower": lower, "upper": upper, "feasible": feasible, "alert": alert}
Results from the PoC run:
Experiment: 20 trials, budget=1500 tokens
Solvable tasks (n=10): 1 exceeded budget
Unsolvable tasks (n=10): 10 exceeded budget
Estimator Comparison:
Over-optimistic (frontier): 0 alerts fired
BAGEN-style estimator: 72 alerts total, 56 on unsolvable tasks
Early Stopping Simulation:
Average savings on failed tasks: 44.6%
Range: 40.9% – 48.7%
The step-by-step output on an unsolvable trajectory shows how quickly the rolling-cost estimator detects the pattern:
Step Consumed Lower Upper Alert?
--------------------------------------------------
1 332 332 1500
2 561 2393 3217 ⚠ ALERT
3 813 2401 3019 ⚠ ALERT
4 1134 2571 3002 ⚠ ALERT
5 1450 2693 3139 ⚠ ALERT
The alert fires at step 2, when the upper bound already projects 3,217 tokens needed against a 1,500-token budget. A production system could halt, escalate to a human, or switch to a cheaper fallback at this point.
Lab note: This is a simulated environment. The paper evaluates on actual LLM agentic runs with real tool calls. Our PoC validates the statistical pattern, not specific model rankings.
How to Build This Into Your Agent System
The BAGEN insight translates directly into a production wrapper pattern. The idea is to run a lightweight interval estimator alongside your main agent loop, and trigger actions when the upper bound crosses a threshold.
Minimal budget guard implementation:
from collections import deque
import math
class BudgetGuard:
def __init__(self, max_budget: int, alert_threshold: float = 1.15):
self.max_budget = max_budget
self.alert_threshold = alert_threshold
self.step_costs = deque(maxlen=10) # Rolling window
self.total_consumed = 0
def record_step(self, tokens_used: int) -> dict:
self.step_costs.append(tokens_used)
self.total_consumed += tokens_used
return self._estimate()
def _estimate(self) -> dict:
if len(self.step_costs) < 2:
return {"feasible": True, "alert": False, "upper": self.max_budget}
costs = list(self.step_costs)
avg = sum(costs) / len(costs)
variance = sum((c - avg) ** 2 for c in costs) / len(costs)
std = math.sqrt(variance)
recent_avg = sum(costs[-3:]) / min(3, len(costs))
# Detect upward trend = backtracking/unsolvable signal
est_remaining = 8 if recent_avg > avg * 1.1 else max(2, 12 - len(costs))
lower = self.total_consumed + est_remaining * max(0, avg - std)
upper = self.total_consumed + est_remaining * (recent_avg + std)
return {
"feasible": lower <= self.max_budget,
"alert": upper > self.max_budget * self.alert_threshold,
"lower_bound": int(lower),
"upper_bound": int(upper),
"consumed": self.total_consumed,
}
# Usage in an agent loop:
guard = BudgetGuard(max_budget=8000)
for step in agent.run():
tokens_this_step = count_tokens(step.messages)
status = guard.record_step(tokens_this_step)
if status["alert"]:
# Upper bound exceeds budget — intervene
agent.trigger_escalation(
f"Budget alert: estimated {status['upper_bound']} tokens needed, "
f"budget is {guard.max_budget}. Stopping early."
)
break
This pattern requires no LLM calls, no fine-tuning, and no additional dependencies. The computational overhead is negligible — a few floating-point operations per step.
Common Mistakes Developers Make with Token Budgets
Treating the hard limit as the only control point. Most frameworks let you set max_tokens and call it done. But a hard limit generates a truncation error at the wall — it doesn't give you a graceful exit. The BAGEN pattern adds soft signals earlier in the trajectory.
Measuring only task success rate. BAGEN's main point is that success rate and budget-awareness are largely uncorrelated (r=0.35). If your eval only tracks task completion, you won't notice the over-optimism problem until your inference bill arrives.
Ignoring per-step cost trends. The early warning signal isn't total consumption — it's the derivative. A task that burns 200 tokens in step 1 and 350 in step 2 and 480 in step 3 is showing a diverging cost trajectory. That pattern, not the absolute number, is what BAGEN's estimator catches.
Applying a flat budget to all task types. Sokoban puzzles, code generation tasks, and supply-chain optimization have different intrinsic token distributions. A budget that's appropriate for one type will be wasteful for another. Consider per-task-class budgets tuned from past trajectories.
FAQ
Q: Does this problem affect all LLM agents equally?
The paper evaluates five frontier models and finds consistent over-optimism across all of them, with r=0.35 correlation between task performance and budget-awareness. Variation exists between models, but no current frontier model reliably predicts its own token consumption on hard tasks.
Q: Is the 28–64% token savings claim realistic in production?
The 28–64% range comes from the actual BAGEN benchmark runs on real LLM trajectories (Sokoban, Search-R1, SWE-bench, supply-chain). The Effloow Lab PoC confirmed a 44.6% average in simulation. The key constraint is that savings only apply to failed trajectories — tasks the agent was going to fail anyway. On successful tasks, early stopping would reduce quality. The budget guard should only trigger on trajectories that cross a high-confidence alert threshold.
Q: Can I fine-tune my model to be more budget-aware?
Yes — the paper demonstrates that SFT+RL fine-tuning on BAGEN-specific trajectories improves early stop and alert behavior. However, interval coverage still caps at 47% after fine-tuning, suggesting that perfect calibration remains an open problem. The wrapper pattern described above is a simpler, training-free alternative that works with any base model.
Q: What's the difference between this and just setting a lower token limit?
A hard lower limit truncates the agent arbitrarily. The BAGEN approach adds intelligence to the stopping decision: the estimator predicts that this specific trajectory is unlikely to succeed, so stopping now saves budget without affecting tasks that were going to succeed. Hard limits waste budget on easy tasks (by cutting them short) and fail on hard tasks (by stopping them at the wrong moment). Soft signals based on interval estimation are more precise.
Q: Where can I read the full paper?
The paper is available at arxiv.org/abs/2606.00198. The project website with benchmark environments and data is at ragen-ai.github.io/bagen.
Key Takeaways
- BAGEN (arXiv:2606.00198) documents that frontier LLM agents have a correlation of only r=0.35 between task performance and budget-awareness — being a strong agent doesn't make you budget-aware.
- Frontier models are consistently over-optimistic: they underestimate token consumption on hard tasks and never trigger early stops.
- Early stopping on failed trajectories saves 28–64% of tokens in the paper's benchmark runs.
- Budget-awareness breaks into three measurable skills: feasibility prediction, early failure detection, and interval calibration. Current models struggle with all three.
- A practical solution is a wrapper-level interval estimator that monitors per-step cost trends and triggers soft alerts before the hard budget wall.
- The Effloow Lab PoC confirmed 44.6% average savings in simulation using a rolling-cost estimator with variance — no LLM calls required.
Bottom Line
BAGEN turns a billing problem into a diagnostic one. If your agents are burning through budgets on failed tasks, the fix isn't raising the limit — it's adding interval estimation to your agent loop. The paper gives you the framework; the wrapper pattern above gives you the implementation. It takes under 30 lines of stdlib Python and works with any model.
Top comments (0)