How to Measure and Reduce Your LLM Tokenizer Costs

#ai #llm #python #costoptimization

You're shipping an AI-powered feature, the demo looks great, and then the invoice arrives. Suddenly that clever summarization endpoint is costing you $400/day because nobody bothered to measure how many tokens you're actually burning.

I've been there. Twice.

The problem isn't that LLM APIs are expensive — pricing has dropped dramatically. The problem is that most developers have no idea how their text maps to tokens, and that ignorance compounds fast at scale.

Why Token Counts Surprise You

Tokenizers don't work the way your brain does. You see "authentication" as one word. A BPE (Byte Pair Encoding) tokenizer might split it into ["auth", "entic", "ation"] — three tokens. Multiply that mismatch across thousands of requests per hour and your cost estimates are fiction.

Different models use different tokenizers, too. Swapping from one model family to another can change your token counts by 10-20% on the same input text. I found this out the hard way when migrating a document processing pipeline between providers and watching costs drift upward despite "cheaper" per-token pricing.

The root causes of unexpected token costs usually boil down to:

Verbose system prompts that get sent with every single request
Uncompressed context windows stuffed with raw text instead of summaries
No measurement — you're guessing instead of counting
Ignoring output tokens, which are typically 3-5x more expensive than input tokens

Step 1: Actually Measure Your Tokens

Before optimizing anything, instrument your calls. Most LLM API responses include token usage in the response metadata. If you're not logging this, start now.

import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

def call_with_tracking(messages, model, system=None):
    """Wrapper that logs token usage for every call."""
    kwargs = {"model": model, "max_tokens": 1024, "messages": messages}
    if system:
        kwargs["system"] = system

    response = client.messages.create(**kwargs)

    usage = response.usage
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "model": model,
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        # cache reads are cheaper — track them separately
        "cache_read_tokens": getattr(usage, "cache_read_input_tokens", 0),
        "cache_creation_tokens": getattr(usage, "cache_creation_input_tokens", 0),
    }

    # Ship this to your observability stack
    print(json.dumps(log_entry))
    return response

Run this for a day in production. You'll probably discover that 60% of your token spend is on input — specifically, on the same system prompt and context being resent over and over.

Step 2: Count Tokens Before You Send Them

Waiting for the API response to tell you token counts is like checking your bank balance after the vacation. You want to know before you make the call.

Anthropic provides a token counting API, and for local estimation, the tiktoken library (originally built for OpenAI's models) gives you a rough baseline for BPE tokenizers generally:

import tiktoken

def estimate_tokens(text, encoding_name="cl100k_base"):
    """Rough token estimate using a BPE tokenizer.
    Note: actual counts will vary by model — use the
    provider's counting API for precision."""
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)
    return len(tokens)

# Compare what you think vs. reality
test_strings = [
    "Hello world",
    "Authentication failed for user@example.com",
    '{"error": "rate_limit_exceeded", "retry_after": 30}',
    "The quick brown fox " * 100,  # repetitive text
]

for s in test_strings:
    count = estimate_tokens(s)
    ratio = count / len(s.split())
    print(f"Words: {len(s.split()):>4} | Tokens: {count:>4} | Ratio: {ratio:.2f}")

That ratio column is the number to watch. For English prose it's usually around 1.3. For code, it jumps to 1.5-2.0. For JSON with lots of punctuation and special characters? I've seen it hit 2.5.

Step 3: Slash Your System Prompt Costs

This is where the biggest wins hide. If your system prompt is 2,000 tokens and you're making 10,000 requests per day, that's 20 million input tokens daily just on instructions that never change.

Three strategies that actually work:

Prompt caching. Anthropic and other providers support caching of static prompt prefixes. The first request pays full price, but subsequent requests within the cache TTL (usually around 5 minutes) get charged at a fraction of the cost — sometimes 90% less.

# With Anthropic's prompt caching, mark your static content
response = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,  # 2000+ tokens
            "cache_control": {"type": "ephemeral"}  # enables caching
        }
    ],
    messages=[{"role": "user", "content": user_input}]
)
# Check response.usage.cache_read_input_tokens to verify it's working

Compress your instructions. I rewrote a 1,800-token system prompt down to 600 tokens by removing redundant phrasing, using shorthand, and cutting examples that weren't improving output quality. Test your outputs before and after — you'll often find that shorter prompts work just as well.

Move context to retrieval. Instead of stuffing 50 pages of documentation into every request, use RAG (retrieval-augmented generation) to pull in only the relevant chunks. This alone cut one of my project's token costs by 70%.

Step 4: Control Output Token Bloat

Output tokens cost more, and models love to be verbose. Fight back:

Set max_tokens to a reasonable limit, not the maximum
Add explicit length instructions: "Respond in under 100 words"
For structured data, ask for JSON — it's more token-dense than prose
Use streaming so you can abort early if the response is going off-track

Step 5: Build a Cost Dashboard

Once you're logging token usage per request, aggregate it. You don't need anything fancy — a simple script that groups by endpoint and calculates daily cost is enough to catch problems:

def calculate_daily_cost(logs, input_price_per_mtok, output_price_per_mtok):
    """Calculate cost from a list of usage log entries."""
    total_input = sum(log["input_tokens"] for log in logs)
    total_output = sum(log["output_tokens"] for log in logs)
    # Subtract cached tokens — they're billed at reduced rate
    cached = sum(log.get("cache_read_tokens", 0) for log in logs)

    full_price_input = total_input - cached

    cost = (
        (full_price_input / 1_000_000) * input_price_per_mtok
        + (cached / 1_000_000) * input_price_per_mtok * 0.1  # 90% discount
        + (total_output / 1_000_000) * output_price_per_mtok
    )
    return {
        "total_input_tokens": total_input,
        "total_output_tokens": total_output,
        "cached_tokens": cached,
        "estimated_cost_usd": round(cost, 4)
    }

Run this weekly. Set alerts for when daily cost exceeds your baseline by more than 20%. I guarantee it'll catch a runaway prompt or an unexpected traffic spike before it empties your credits.

Prevention: Bake This Into Your Workflow

The real fix isn't one-time optimization — it's making token cost a first-class metric:

Add token counts to your CI. If a PR changes a system prompt, log the before/after token count in the PR description.
Set per-endpoint budgets. "This summarization endpoint should average under 800 input tokens per call." Alert when it drifts.
Review your model selection. A smaller, faster model might handle 80% of your requests at a fraction of the cost. Route only complex queries to the expensive model.
Benchmark when switching models or providers. Run your actual production prompts through the new tokenizer and compare counts before committing to a migration.

Token costs are one of those problems that's trivially easy to measure and absurdly expensive to ignore. Spend an afternoon instrumenting your calls, and you'll probably find savings that pay for that afternoon a hundred times over.

The tools exist. The APIs report usage. There's genuinely no excuse for flying blind on this anymore.