Everyone knows long prompts cost money. Almost nobody knows which parts of their prompt actually matter.
Prompts accumulate over time, a system message, a style guide, a few-shot example or two, some background context. Each addition made sense when it was added. Over hundreds of API calls, the overhead compounds. And the honest answer to "which of these sections can I remove?" is: you don't know until you test.
Token Budget Negotiator makes that test systematic. It takes a prompt split into named, prioritised sections, runs a greedy ablation loop that drops one section at a time, scores the remaining prompt against a rubric using a local or remote LLM judge, and stops when savings hit the target without falling below the quality threshold. The result is the smallest prompt that still behaves like the original.
It ships as a CLI, a Python library, and an MCP server.
The Problem
Prompt sections are not equal in value, but there's no principled way to know which ones matter for a given task without testing. Manual trimming is guesswork. Token Budget Negotiator answers the question empirically per section, per task, against a rubric that defines what quality means for that use case.
How It Works
A prompt is defined as a YAML file with named sections. Each section carries a type (system, few_shot, context, instruction), a content block, and a priority integer. Priority determines the order in which sections are considered for removal low-priority sections are evaluated first, high-priority sections last.
Before any removal happens, the full prompt is scored by the judge LLM against the rubric. This establishes a baseline. The quality target for the run is baseline_score × threshold.
The ablation loop then works through sections in ascending priority order. For each candidate, a test prompt is built without that section and rescored. If the score still meets the target, the section is dropped permanently and the loop continues with the updated prompt. If not, the section is kept and the next candidate is evaluated.
Two conditions stop the loop:
- Token savings reach
min_token_savings,the target has been hit. - A removal would push savings above
max_token_savings, the ceiling is enforced.
Every accepted removal is verified to actually reduce the token count. The loop cannot produce a larger prompt than it started with.
The output is a NegotiationResult containing the original and optimised token counts, the list of sections removed, per-step scores, quality retention percentage, elapsed time, scoring call count, rubric name, and a full ablation log. This can be written to JSON or YAML.
Getting Started
Install
cd token-budget-negotiator
pip install -e .
Requires Python 3.11+. The local judge path requires Ollama with a model pulled, verified end-to-end against gemma4:latest. The OpenRouter path requires OPENROUTER_API_KEY.
Analyze token distribution
Before negotiating, analyze prints how many tokens each section holds and its share of the total budget:
$ token-budget analyze examples/prompt.yaml
Token Distribution Analysis:
Section Type Tokens % Priority
-----------------------------------------------------------------
system system 22 18.6% 30
style_guide system 26 22.0% 10
few_shot_1 few_shot 26 22.0% 20
few_shot_2 few_shot 20 16.9% 25
context context 12 10.2% 40
instruction instruction 12 10.2% 100
-----------------------------------------------------------------
TOTAL 118 100.0%
Check the local judge:
$ token-budget check-ollama --model gemma4:latest
Ollama is connected
Host: http://localhost:11434
Model requested: gemma4:latest
Model available: Yes
Run the negotiator:
$ token-budget negotiate examples/prompt.yaml \
--scorer ollama --model gemma4:latest \
--threshold 0.80 --min-savings 0.20 --max-savings 0.80 \
--output result.json --format json
Negotiation Result:
Original: 118 tokens, score=0.600
Optimized: 92 tokens, score=0.700
Savings: 22.0%
Quality Retention: 116.7%
Success: Yes
Sections removed: style_guide
Results saved to result.json
result.json contains the full ablation log, the final optimized prompt, per-step scores, and metadata (elapsed time, scoring call count, rubric name).
Run the negotiator - OpenRouter path
$ export OPENROUTER_API_KEY=sk-or-...
$ token-budget check-openrouter
OpenRouter is connected
Base URL: https://openrouter.ai/api/v1
Model requested: qwen/qwen3-8b
$ token-budget -v negotiate examples/prompt.yaml \
--scorer openrouter --model meta-llama/llama-3.2-3b-instruct \
--rubric rubrics/qa.yaml \
--threshold 0.7 --min-savings 0.1 --max-savings 0.6 --no-cache
Connected to openrouter
Negotiation Result:
Original: 118 tokens, score=1.000
Optimized: 92 tokens, score=0.900
Savings: 22.0%
Quality Retention: 90.0%
Success: Yes
Sections removed: style_guide
With a looser threshold (-t 0.7 --min-savings 0.1 --max-savings 0.5) and caching left on, the same model drops two sections for 44.1% savings at 100% quality retention.
CLI Reference
Key negotiate flags:
-
-r, --rubric PATH: YAML rubric. Defaults to a built-in accuracy+relevance rubric. -
-s, --scorer {ollama,openrouter}: which judge to use. Default ollama. -
-m, --model TEXT: model name (gemma4:latest for Ollama, qwen/qwen3-8b for OpenRouter, etc.). -
-t, --threshold FLOAT: minimum fraction of the baseline score to keep. Default 0.95. -
--min-savings FLOAT: stop once savings reach this fraction. Default 0.40. -
--max-savings FLOAT: never drop sections if it would save more than this. Default 0.60. -
-o, --output PATH/-f, --format {json,yaml}: write a machine-readable report. -
--no-cache: disable the in-memory scoring cache.
Python API
from token_budget_negotiator import (
Negotiator, OllamaScorer, PromptSection, Rubric,
)
from token_budget_negotiator.models import RubricCriterion, SectionType
sections = [
PromptSection(name="sys", content="You are helpful.",
section_type=SectionType.SYSTEM, priority=20),
PromptSection(name="q", content="What is 2+2?",
section_type=SectionType.INSTRUCTION, priority=100),
]
rubric = Rubric(
name="qa", description="qa rubric",
criteria=[RubricCriterion(name="accuracy",
description="factually correct",
weight=1.0)],
)
scorer = OllamaScorer(model="gemma4:latest")
negotiator = Negotiator(scorer=scorer, quality_threshold=0.9,
min_token_savings=0.1, max_token_savings=0.9)
result = negotiator.negotiate(sections=sections, rubric=rubric, target_task="qa")
print(result.original_token_count, "->", result.optimized_token_count)
print("removed:", result.sections_removed)
Rubric Format
The rubric defines what quality means for the task. The judge scores each test prompt against it. Three rubrics ship in rubrics/: qa.yaml, coding.yaml, summarization.yaml.
name: qa
description: General question-answer rubric
version: "1.0"
criteria:
- name: accuracy
description: Is the response factually correct?
weight: 1.0
- name: relevance
description: Does it answer what was asked?
weight: 1.0
scoring_instructions: |
Score 0-1. 1 = perfect, 0 = wrong or irrelevant.
output_format: json
MCP Server
The library also runs as an MCP server over stdio transport, exposing two tools, analyze and negotiate, so Claude Code or any MCP-compatible agent can call it directly during a session.
python -m token_budget_negotiator.mcp_server --scorer ollama --model gemma4:latest
analyze takes a sections list and returns token distribution as JSON. negotiate takes sections, rubric, task, thresholds, and scorer config and returns the full negotiation result as JSON.
Limitations
- Ablation is greedy one-at-a-time in priority order, not exhaustive subset search.
- The judge is asked for strict JSON; free-text replies fall back to regex score extraction with reduced confidence.
- Small local judges like
gemma4are noisy, prefer thresholds in the 0.80-0.90 range and expect multi-minute wall clock even for short prompts. -
check-openrouterand the OpenRouter scorer requireOPENROUTER_API_KEY; there is no offline stub. - Only the
removecompression strategy is wired up.CompressionStrategyandsections_compressedexist on the model but are not yet produced by the negotiator.
How I Built This Using NEO
This project was built using NEO, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.
The problem was defined at a high level: a tool that takes a structured prompt, scores it with a local or remote LLM judge, and finds the minimum set of sections needed to hit a quality threshold. NEO generated the full implementation, the greedy ablation loop in Negotiator, the OllamaScorer and OpenRouterScorer with their shared interface, the ScoreCache with TTL-based invalidation, the SectionTokenizer backed by tiktoken, the YAML rubric format, the MCP server with its two exposed tools, and the CLI built on Click. All 49 tests pass.
Final Notes
Token Budget Negotiator turns prompt compression from guesswork into an empirical process. It scores every section against a rubric, drops only what demonstrably doesn't matter, and produces a report showing exactly what changed and why.
The code is at https://github.com/dakshjain-1616/token-budget-negotiator
You can also build with NEO in your IDE using the VS Code extension or Cursor.

Top comments (0)