Nilofer 🚀

Posted on Apr 27

Token Budget Negotiator

#ai #opensource #promptengineering #machinelearning

Everyone knows long prompts cost money. Almost nobody knows which parts of their prompt actually matter.

Prompts accumulate over time, a system message, a style guide, a few-shot example or two, some background context. Each addition made sense when it was added. Over hundreds of API calls, the overhead compounds. And the honest answer to "which of these sections can I remove?" is: you don't know until you test.

Token Budget Negotiator makes that test systematic. It takes a prompt split into named, prioritised sections, runs a greedy ablation loop that drops one section at a time, scores the remaining prompt against a rubric using a local or remote LLM judge, and stops when savings hit the target without falling below the quality threshold. The result is the smallest prompt that still behaves like the original.

It ships as a CLI, a Python library, and an MCP server.

The Problem

Prompt sections are not equal in value, but there's no principled way to know which ones matter for a given task without testing. Manual trimming is guesswork. Token Budget Negotiator answers the question empirically per section, per task, against a rubric that defines what quality means for that use case.

How It Works

A prompt is defined as a YAML file with named sections. Each section carries a type (system, few_shot, context, instruction), a content block, and a priority integer. Priority determines the order in which sections are considered for removal low-priority sections are evaluated first, high-priority sections last.

Before any removal happens, the full prompt is scored by the judge LLM against the rubric. This establishes a baseline. The quality target for the run is baseline_score × threshold.

The ablation loop then works through sections in ascending priority order. For each candidate, a test prompt is built without that section and rescored. If the score still meets the target, the section is dropped permanently and the loop continues with the updated prompt. If not, the section is kept and the next candidate is evaluated.

Two conditions stop the loop:

Token savings reach min_token_savings,the target has been hit.
A removal would push savings above max_token_savings, the ceiling is enforced.

Every accepted removal is verified to actually reduce the token count. The loop cannot produce a larger prompt than it started with.
The output is a NegotiationResult containing the original and optimised token counts, the list of sections removed, per-step scores, quality retention percentage, elapsed time, scoring call count, rubric name, and a full ablation log. This can be written to JSON or YAML.

Getting Started

Install

cd token-budget-negotiator
pip install -e .

Requires Python 3.11+. The local judge path requires Ollama with a model pulled, verified end-to-end against gemma4:latest. The OpenRouter path requires OPENROUTER_API_KEY.

Analyze token distribution
Before negotiating, analyze prints how many tokens each section holds and its share of the total budget:

$ token-budget analyze examples/prompt.yaml

Token Distribution Analysis:
Section              Type              Tokens        % Priority
-----------------------------------------------------------------
system               system                22    18.6%       30
style_guide          system                26    22.0%       10
few_shot_1           few_shot              26    22.0%       20
few_shot_2           few_shot              20    16.9%       25
context              context               12    10.2%       40
instruction          instruction           12    10.2%      100
-----------------------------------------------------------------
TOTAL                                     118   100.0%

Check the local judge:

$ token-budget check-ollama --model gemma4:latest
Ollama is connected
  Host: http://localhost:11434
  Model requested: gemma4:latest
  Model available: Yes

Run the negotiator:

$ token-budget negotiate examples/prompt.yaml \
    --scorer ollama --model gemma4:latest \
    --threshold 0.80 --min-savings 0.20 --max-savings 0.80 \
    --output result.json --format json

Negotiation Result:
  Original: 118 tokens, score=0.600
  Optimized: 92 tokens, score=0.700
  Savings: 22.0%
  Quality Retention: 116.7%
  Success: Yes
  Sections removed: style_guide

Results saved to result.json

result.json contains the full ablation log, the final optimized prompt, per-step scores, and metadata (elapsed time, scoring call count, rubric name).

Run the negotiator - OpenRouter path

$ export OPENROUTER_API_KEY=sk-or-...
$ token-budget check-openrouter
OpenRouter is connected
  Base URL: https://openrouter.ai/api/v1
  Model requested: qwen/qwen3-8b

$ token-budget -v negotiate examples/prompt.yaml \
    --scorer openrouter --model meta-llama/llama-3.2-3b-instruct \
    --rubric rubrics/qa.yaml \
    --threshold 0.7 --min-savings 0.1 --max-savings 0.6 --no-cache
Connected to openrouter

Negotiation Result:
  Original: 118 tokens, score=1.000
  Optimized: 92 tokens, score=0.900
  Savings: 22.0%
  Quality Retention: 90.0%
  Success: Yes
  Sections removed: style_guide

With a looser threshold (-t 0.7 --min-savings 0.1 --max-savings 0.5) and caching left on, the same model drops two sections for 44.1% savings at 100% quality retention.

CLI Reference

Key negotiate flags:

-r, --rubric PATH: YAML rubric. Defaults to a built-in accuracy+relevance rubric.
-s, --scorer {ollama,openrouter}: which judge to use. Default ollama.
-m, --model TEXT: model name (gemma4:latest for Ollama, qwen/qwen3-8b for OpenRouter, etc.).
-t, --threshold FLOAT: minimum fraction of the baseline score to keep. Default 0.95.
--min-savings FLOAT: stop once savings reach this fraction. Default 0.40.
--max-savings FLOAT: never drop sections if it would save more than this. Default 0.60.
-o, --output PATH / -f, --format {json,yaml}: write a machine-readable report.
--no-cache: disable the in-memory scoring cache.

Python API

from token_budget_negotiator import (
    Negotiator, OllamaScorer, PromptSection, Rubric,
)
from token_budget_negotiator.models import RubricCriterion, SectionType

sections = [
    PromptSection(name="sys", content="You are helpful.",
                  section_type=SectionType.SYSTEM, priority=20),
    PromptSection(name="q", content="What is 2+2?",
                  section_type=SectionType.INSTRUCTION, priority=100),
]

rubric = Rubric(
    name="qa", description="qa rubric",
    criteria=[RubricCriterion(name="accuracy",
                              description="factually correct",
                              weight=1.0)],
)

scorer = OllamaScorer(model="gemma4:latest")
negotiator = Negotiator(scorer=scorer, quality_threshold=0.9,
                        min_token_savings=0.1, max_token_savings=0.9)

result = negotiator.negotiate(sections=sections, rubric=rubric, target_task="qa")
print(result.original_token_count, "->", result.optimized_token_count)
print("removed:", result.sections_removed)

Rubric Format

The rubric defines what quality means for the task. The judge scores each test prompt against it. Three rubrics ship in rubrics/: qa.yaml, coding.yaml, summarization.yaml.

name: qa
description: General question-answer rubric
version: "1.0"
criteria:
  - name: accuracy
    description: Is the response factually correct?
    weight: 1.0
  - name: relevance
    description: Does it answer what was asked?
    weight: 1.0
scoring_instructions: |
  Score 0-1. 1 = perfect, 0 = wrong or irrelevant.
output_format: json

MCP Server

The library also runs as an MCP server over stdio transport, exposing two tools, analyze and negotiate, so Claude Code or any MCP-compatible agent can call it directly during a session.

python -m token_budget_negotiator.mcp_server --scorer ollama --model gemma4:latest

analyze takes a sections list and returns token distribution as JSON. negotiate takes sections, rubric, task, thresholds, and scorer config and returns the full negotiation result as JSON.

Limitations

Ablation is greedy one-at-a-time in priority order, not exhaustive subset search.
The judge is asked for strict JSON; free-text replies fall back to regex score extraction with reduced confidence.
Small local judges like gemma4 are noisy, prefer thresholds in the 0.80-0.90 range and expect multi-minute wall clock even for short prompts.
check-openrouter and the OpenRouter scorer require OPENROUTER_API_KEY; there is no offline stub.
Only the remove compression strategy is wired up. CompressionStrategy and sections_compressed exist on the model but are not yet produced by the negotiator.

How I Built This Using NEO

This project was built using NEO, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.

The problem was defined at a high level: a tool that takes a structured prompt, scores it with a local or remote LLM judge, and finds the minimum set of sections needed to hit a quality threshold. NEO generated the full implementation, the greedy ablation loop in Negotiator, the OllamaScorer and OpenRouterScorer with their shared interface, the ScoreCache with TTL-based invalidation, the SectionTokenizer backed by tiktoken, the YAML rubric format, the MCP server with its two exposed tools, and the CLI built on Click. All 49 tests pass.

Final Notes

Token Budget Negotiator turns prompt compression from guesswork into an empirical process. It scores every section against a rubric, drops only what demonstrably doesn't matter, and produces a report showing exactly what changed and why.
The code is at https://github.com/dakshjain-1616/token-budget-negotiator
You can also build with NEO in your IDE using the VS Code extension or Cursor.

DEV Community