Ayi NEDJIMI

Posted on May 23

LLM Token Counting and Cost Optimization: A Practical Guide

#python #ai #llm #tutorial

Every API call to a language model costs money, and that cost is denominated in tokens. If you're running LLMs in production — for summarization, classification, or chat — token waste is the fastest way to blow your budget without realizing it. This guide covers how to count tokens accurately before sending requests, and the patterns that actually reduce your costs.

What a Token Actually Is

Tokens are not words. They are not characters either. A token is a chunk of text produced by a tokenizer — typically a byte-pair encoding (BPE) algorithm — trained alongside the model. On average, one token ≈ 4 characters in English, but this breaks down for code, non-Latin scripts, and structured formats like JSON.

A few examples to make it concrete:

"hello world" → 2 tokens
{"key": "value"} → 6 tokens
A 4000-character Python function → roughly 1000 tokens

Most providers charge separately for input tokens (your prompt) and output tokens (the model's response). Output tokens are typically 3–5× more expensive than input tokens. If your task produces verbose output you don't need, that cost multiplier compounds fast.

Counting Tokens Before You Pay

The only way to control costs is to measure before sending. Most providers expose a tokenizer you can run locally — no API call required.

For OpenAI-compatible models, tiktoken is the standard library:

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def estimate_cost(
    prompt: str,
    expected_output_tokens: int,
    model: str = "gpt-4o",
    input_price_per_1k: float = 0.005,
    output_price_per_1k: float = 0.015,
) -> dict:
    input_tokens = count_tokens(prompt, model)
    input_cost = (input_tokens / 1000) * input_price_per_1k
    output_cost = (expected_output_tokens / 1000) * output_price_per_1k
    return {
        "input_tokens": input_tokens,
        "expected_output_tokens": expected_output_tokens,
        "estimated_total_cost_usd": round(input_cost + output_cost, 6),
    }

# Before sending a batch job
result = estimate_cost(
    prompt="Summarize the following document:\n\n" + open("doc.txt").read(),
    expected_output_tokens=150,
)
print(result)
# {'input_tokens': 2341, 'expected_output_tokens': 150, 'estimated_total_cost_usd': 0.013930}

Run this check before every call in your batch jobs. If the estimate exceeds a threshold, truncate or reject rather than silently burning budget. A simple guard that raises an exception when input exceeds 3000 tokens will save you from accidentally passing an entire database dump as context.

Four Patterns That Actually Reduce Costs

1. Prompt compression

Most prompts contain redundancy — repeated context, verbose instructions, unnecessary whitespace. A few lines of preprocessing reduce token counts without affecting output quality:

import re
import tiktoken

def compress_prompt(text: str) -> str:
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r' {2,}', ' ', text)
    text = re.sub(r'<[^>]+>', '', text)  # strip HTML if processing scraped content
    return text.strip()

def truncate_to_budget(
    prompt: str,
    system: str,
    max_input_tokens: int,
    model: str = "gpt-4o",
) -> str:
    enc = tiktoken.encoding_for_model(model)
    system_tokens = len(enc.encode(system))
    budget = max_input_tokens - system_tokens - 50  # 50-token safety margin

    tokens = enc.encode(prompt)
    if len(tokens) <= budget:
        return prompt

    # Middle-truncation: preserve start and end context
    half = budget // 2
    kept = tokens[:half] + tokens[-half:]
    return enc.decode(kept)

Middle-truncation is often better than tail-truncation for documents where the conclusion matters as much as the introduction — legal clauses, technical specs, bug reports.

2. Hard output caps

If you only need a JSON object with three fields, tell the model explicitly and set max_tokens on the API call. Letting the model write a 500-token explanation when you asked for a label is expensive.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Respond with JSON only. No explanation."},
        {"role": "user", "content": prompt},
    ],
    max_tokens=256,  # hard billing ceiling
    response_format={"type": "json_object"},
)

Set max_tokens to 20–30% above your expected output length, not to the model's maximum. The difference between max_tokens=256 and max_tokens=4096 on 10,000 daily requests is significant even if the model rarely fills the window.

3. Model routing by complexity

Not all requests need your most capable — and most expensive — model. A classification task that a smaller model handles at 94% accuracy doesn't belong in your flagship model's queue.

def route_to_model(prompt: str, token_count: int) -> str:
    if token_count < 500 and is_simple_task(prompt):
        return "gpt-4o-mini"   # typically 10-15x cheaper
    elif token_count < 2000:
        return "gpt-4o-mini"
    else:
        return "gpt-4o"

def is_simple_task(prompt: str) -> bool:
    simple_patterns = ["classify", "label", "category", "is this", "yes or no", "true or false"]
    return any(k in prompt.lower() for k in simple_patterns)

A routing strategy that sends 70% of traffic to the smaller model and 30% to the capable one often cuts monthly costs in half with negligible quality regression. Measure both accuracy and cost before committing; the tradeoff is task-dependent.

4. Response caching

Identical prompts should never hit the API twice. This matters most for FAQ chatbots, template-based generation, and any workflow with repeated structured inputs.

import hashlib
import json

_cache: dict[str, str] = {}  # swap for Redis in production

def cache_key(prompt: str, model: str, system: str = "") -> str:
    payload = json.dumps(
        {"prompt": prompt, "model": model, "system": system},
        sort_keys=True
    )
    return hashlib.sha256(payload.encode()).hexdigest()

def cached_completion(prompt: str, model: str, system: str = "") -> str:
    key = cache_key(prompt, model, system)
    if key in _cache:
        return _cache[key]

    response = call_llm(prompt, model, system)
    _cache[key] = response
    return response

In production, replace _cache with Redis and set a TTL matched to how often your prompts change — 1 hour for dynamic content, 24 hours for reference data.

Measuring What You're Actually Spending

Logging token usage per request is non-negotiable. Most APIs return a usage object in the response:

response = client.chat.completions.create(...)
usage = response.usage
print(f"Prompt: {usage.prompt_tokens}, Completion: {usage.completion_tokens}")

Aggregate these into a time-series metric — Prometheus, InfluxDB, or even a SQLite table — and alert when daily spend exceeds a threshold. Track cost per feature, not just per model. A feature that costs $15/day when you budgeted $1/day is a conversation you need to have before the invoice arrives.

The Takeaway

Token costs are predictable and controllable, but only if you instrument before you deploy:

Count before sending — use a local tokenizer, not approximations
Cap output hard — set max_tokens on every call, not just some
Route by task complexity — don't use a heavy model for simple classification
Cache identical prompts — hit Redis, not the API, for repeat requests
Monitor per feature — blind spots in cost attribution become budget surprises

If you're building production LLM applications and want to harden the surrounding infrastructure — API security, secrets management, least-privilege IAM — we publish free security hardening checklists that cover exactly that.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

DEV Community