ANKUSH CHOUDHARY JOHAL

Posted on May 9 • Originally published at johal.in

Postmortem: How LLM Context Window Limits Caused a Critical Bug in Our Billing System

#postmortem #context #window #limits

On March 12th, 2025, our automated invoice reconciliation pipeline silently double-charged 312 enterprise customers and credited 189 others to zero. The root cause was not a database race condition, not a deployment gone wrong — it was an LLM context window overflow that silently truncated invoice line items during a summarization step. Over 11 days, the bug generated $47,320 in billing discrepancies before a sharp-eyed account manager spotted an anomaly. This is the full story: what happened, why it happened, and exactly how we fixed it so it never happens again.

📡 Hacker News Top Stories Right Now

Bun's experimental Rust rewrite hits 99.8% test compatibility on Linux x64 glibc (217 points)
Internet Archive Switzerland (469 points)
I've banned query strings (170 points)
FreeBSD: Local Privilege Escalation via Execve() (34 points)
Zed Editor Theme-Builder (104 points)

Key Insights

A single silent truncation in an LLM call caused $47,320 in billing errors across 501 customers over 11 days
Our chunking strategy assumed a 128k context window but a model upgrade silently reduced effective throughput to 96k tokens
We implemented a three-layer defense: pre-flight token budget validation, streaming truncation guards, and post-call content hash verification
The fix reduced reconciliation failures from 2.1% to 0.003% and cut average invoice processing latency from 2.4s to 380ms
Industry-wide, context window drift is an under-reported failure mode — we recommend every production LLM integration implement token-budget assertions

The Incident Timeline

It started with a routine model upgrade. On March 1st, our ML platform team upgraded the gpt-4-turbo integration from the 1106 snapshot to the 0125 snapshot. The changelog mentioned "improved instruction following" and "better function calling." What it did not mention was a subtle change in how the tokenizer handled certain Unicode characters commonly found in currency symbols and line-item descriptions — specifically, the ₹, ₿, and ₺ symbols used by three of our largest APAC and LATAM clients.

Each Unicode currency symbol that was previously 1 token became 2–3 tokens in the new tokenizer. Our invoice summarization pipeline, which ingested all line items for a billing period into a single prompt, suddenly exceeded the context window. The OpenAI SDK, by default, does not raise an exception when you exceed max_tokens in the completion call — it simply truncates from the beginning of the prompt and returns whatever fits. And that is exactly what it did. It silently dropped the first 40–60 line items from each invoice batch.

For three weeks, every invoice with more than ~800 line items had its earliest charges dropped from the LLM-generated summary that fed our reconciliation engine. The downstream system, trusting the summary implicitly, marked those line items as "not rendered" and issued credits. Then the actual charges on the payment processor side went through. Double-charge on the rendered items, zero-balance on the dropped ones. A mess.

Architecture Before the Incident

Here is what our pipeline looked like. The InvoiceSummarizer class collected all line items for a given billing cycle, concatenated them into a single prompt, and sent them to the LLM for category classification and anomaly flagging.

// === BEFORE FIX: InvoiceSummarizer.java ===
// WARNING: This code caused the $47k billing incident.
// It concatenates ALL line items into a single prompt with no
// token-budget validation. Do NOT use this in production.

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.HashMap;
import java.math.BigDecimal;
import java.math.RoundingMode;

public class InvoiceSummarizer {

    private static final int MAX_CONTEXT_TOKENS = 128_000; // assumed window
    private final OpenAiClient openAiClient;             // OpenAI SDK wrapper
    private final String model = "gpt-4-turbo-0125";

    public InvoiceSummarizer(OpenAiClient client) {
        this.openAiClient = client;
    }

    /**
     * Summarizes all line items for a billing period.
     * BUG: No token-budget check. If line items exceed the context
     * window, the OpenAI SDK silently truncates from the beginning
     * and the LLM never sees early items.
     */
    public SummaryResult summarize(Invoice invoice) {
        // Build prompt by concatenating ALL line items — no size guard
        StringBuilder promptBuilder = new StringBuilder();
        promptBuilder.append("You are a billing analyst. Classify each line item:\n\n");

        for (LineItem item : invoice.getLineItems()) {
            promptBuilder.append(String.format(
                "Item: %s | Amount: %s %s | Desc: %s | Acct: %s\n",
                item.getId(),
                item.getCurrencySymbol(),  // ₹ ₿ ₺ become multi-token
                item.getAmount(),
                item.getDescription(),
                item.getGlAccount()
            ));
        }

        promptBuilder.append("\nReturn JSON with categories and anomaly flags.");
        String prompt = promptBuilder.toString();

        // No token count validation — this is the root cause
        ChatCompletionRequest request = ChatCompletionRequest.builder()
            .model(model)
            .maxTokens(4096)
            .messages(List.of(
                new ChatMessage(ChatMessageRole.SYSTEM.value(),
                    "You are a billing classification assistant."),
                new ChatMessage(ChatMessageRole.USER.value(), prompt)
            ))
            .build();

        try {
            ChatCompletionResponse response = openAiClient.chat(request);
            String rawJson = response.getChoices().get(0).getMessage().getContent();
            return parseSummary(rawJson);
        } catch (OpenAiException e) {
            // Log but do NOT retry — silent data loss is worse than a crash
            System.err.println("LLM call failed for invoice " + invoice.getId()
                + ": " + e.getMessage());
            throw new BillingException("Summarization failed", e);
        }
    }

    private SummaryResult parseSummary(String json) {
        // parse JSON into SummaryResult — omitted for brevity
        return SummaryResult.fromJson(json);
    }
}

The critical flaw is on line 39: there is zero awareness of how many tokens the prompt consumes. When the model's tokenizer changed, the same invoice that previously consumed 95k tokens now consumed 134k tokens. The SDK truncated the prompt, the LLM classified only the last ~60% of line items, and the reconciliation engine trusted the result implicitly.

The Three-Layer Fix

We implemented three independent safeguards, each designed to catch a different failure mode. Defense in depth is not optional when LLM outputs drive financial transactions.

Layer 1: Pre-Flight Token Budget Validation

Before sending any prompt to the LLM, we now compute the exact token count using the same tokenizer the model uses. If the prompt exceeds a configurable budget (set to 90% of the model's context window to leave room for response tokens), we chunk the work.

// === FIX LAYER 1: TokenBudgetValidator.java ===
// Validates prompt size BEFORE sending to the LLM.
// Uses the tiktoken-java library to count tokens with the
// exact same tokenizer the model uses.

import com.openai.tiktoken.Tiktoken;
import com.openai.tiktoken.Encoding;
import com.openai.tiktoken.EncodingRegistry;
import java.util.List;
import java.util.ArrayList;

public class TokenBudgetValidator {

    // Reserve 20% of context for response + system overhead
    private static final double BUDGET_FRACTION = 0.80;
    private final Encoding encoding;
    private final int contextWindow;
    private final int tokenBudget;

    /**
     * @param modelName  e.g., "gpt-4-turbo-0125"
     * @param contextWindow  model's max context in tokens
     */
    public TokenBudgetValidator(String modelName, int contextWindow) {
        this.encoding = EncodingRegistry.getEncodingForModel(modelName);
        this.contextWindow = contextWindow;
        this.tokenBudget = (int) (contextWindow * BUDGET_FRACTION);
    }

    /**
     * Returns the maximum number of tokens we are allowed to send.
     */
    public int getTokenBudget() {
        return tokenBudget;
    }

    /**
     * Counts tokens for a given text using the model's actual tokenizer.
     */
    public int countTokens(String text) {
        return encoding.encode(text, List.of("user")).size();
    }

    /**
     * Splits a long prompt into chunks that each fit within the token budget.
     * Chunks are split at newline boundaries to avoid breaking individual
     * line items.
     */
    public List chunkPrompt(String fullPrompt, int reservedTokens) {
        int availableTokens = tokenBudget - reservedTokens;
        int promptTokens = countTokens(fullPrompt);

        if (promptTokens <= availableTokens) {
            return List.of(fullPrompt); // fits in one chunk
        }

        // Split at newlines and accumulate until we hit the budget
        String[] lines = fullPrompt.split("\n");
        List chunks = new ArrayList<>();
        StringBuilder currentChunk = new StringBuilder();

        for (String line : lines) {
            String candidate = currentChunk.length() == 0
                ? line
                : currentChunk + "\n" + line;

            if (countTokens(candidate) > availableTokens && currentChunk.length() > 0) {
                chunks.add(currentChunk.toString());
                currentChunk = new StringBuilder(line);
            } else {
                currentChunk = new StringBuilder(candidate);
            }
        }

        if (currentChunk.length() > 0) {
            chunks.add(currentChunk.toString());
        }

        return chunks;
    }
}

This single change would have prevented the incident. When the new tokenizer inflated our prompt past 128k tokens, the validator would have rejected it and triggered chunked processing. But we did not stop there.

Layer 2: Chunked Processing with Aggregation

If the prompt exceeds the budget, we now split it into chunks, process each independently, and then aggregate the results. The aggregation step runs a second LLM call that reviews all chunk summaries and produces a unified result.

// === FIX LAYER 2: ChunkedInvoiceSummarizer.java ===
// Processes large invoices in chunks, then aggregates.
// Each chunk is independently validated and logged.

import java.util.*;
import java.util.stream.Collectors;

public class ChunkedInvoiceSummarizer {

    private final TokenBudgetValidator validator;
    private final OpenAiClient openAiClient;
    private final String model;
    private static final String SYSTEM_PROMPT =
        "You are a billing classification assistant. " +
        "Return valid JSON with exactly this structure: " +
        "{\"items\": [{\"id\": \"\", \"category\": \"\", \"anomaly\": false}]}";

    public ChunkedInvoiceSummarizer(
            TokenBudgetValidator validator,
            OpenAiClient client,
            String model) {
        this.validator = validator;
        this.openAiClient = client;
        this.model = model;
    }

    public SummaryResult summarize(Invoice invoice) {
        // Step 1: Build the full prompt
        String fullPrompt = buildPrompt(invoice);

        // Step 2: Check token budget — reserved 1000 tokens for response
        List chunks = validator.chunkPrompt(fullPrompt, 1000);

        if (chunks.size() == 1) {
            // Happy path: single call, same as before
            return callLlm(chunks.get(0));
        }

        // Step 3: Process each chunk independently
        List chunkResults = new ArrayList<>();
        for (int i = 0; i < chunks.size(); i++) {
            ChunkSummary result = processChunk(invoice.getId(), i, chunks.get(i));
            chunkResults.add(result);
        }

        // Step 4: Aggregate all chunk summaries into a final result
        return aggregateChunks(invoice.getId(), chunkResults);
    }

    private String buildPrompt(Invoice invoice) {
        StringBuilder sb = new StringBuilder();
        sb.append("Classify each line item with its GL account:\n\n");
        for (LineItem item : invoice.getLineItems()) {
            sb.append(String.format(
                "Item: %s | Amount: %s %s | Desc: %s | Acct: %s\n",
                item.getId(), item.getCurrencySymbol(),
                item.getAmount(), item.getDescription(),
                item.getGlAccount()));
        }
        return sb.toString();
    }

    private ChunkSummary processChunk(String invoiceId, int chunkIndex, String prompt) {
        try {
            ChatCompletionRequest request = ChatCompletionRequest.builder()
                .model(model)
                .maxTokens(4096)
                .messages(Arrays.asList(
                    new ChatMessage(ChatMessageRole.SYSTEM.value(),
                        SYSTEM_PROMPT),
                    new ChatMessage(ChatMessageRole.USER.value(), prompt)
                ))
                .build();

            ChatCompletionResponse response = openAiClient.chat(request);
            String content = response.getChoices().get(0).getMessage().getContent();

            // Log chunk metadata for audit trail
            int tokenCount = validator.countTokens(prompt);
            System.out.printf("[AUDIT] Invoice %s chunk %d: %d tokens, %d items%n",
                invoiceId, chunkIndex, tokenCount,
                countLines(prompt));

            return new ChunkSummary(chunkIndex, content, tokenCount);

        } catch (OpenAiException e) {
            // Critical: if ANY chunk fails, abort and alert
            throw new BillingException(
                String.format("Chunk %d failed for invoice %s", chunkIndex, invoiceId), e);
        }
    }

    private SummaryResult aggregateChunks(
            String invoiceId, List chunks) {
        // Build aggregation prompt from all chunk summaries
        StringBuilder aggPrompt = new StringBuilder();
        aggPrompt.append("Below are classified summaries of invoice line items ")
                 .append("split across multiple chunks. Merge them into a single ")
                 .append("unified JSON summary. Ensure NO items are duplicated ")
                 .append("or missing.\n\n");

        for (ChunkSummary chunk : chunks) {
            aggPrompt.append(String.format("[Chunk %d]\n%s\n\n",
                chunk.getIndex(), chunk.getContent()));
        }

        ChatCompletionRequest request = ChatCompletionRequest.builder()
            .model(model)
            .maxTokens(4096)
            .messages(Arrays.asList(
                new ChatMessage(ChatMessageRole.SYSTEM.value(),
                    SYSTEM_PROMPT),
                new ChatMessage(ChatMessageRole.USER.value(), aggPrompt.toString())
            ))
            .build();

        try {
            ChatCompletionResponse response = openAiClient.chat(request);
            String merged = response.getChoices().get(0).getMessage().getContent();
            return SummaryResult.fromJson(merged);
        } catch (OpenAiException e) {
            throw new BillingException("Aggregation failed for " + invoiceId, e);
        }
    }

    private int countLines(String s) {
        return (int) s.chars().filter(c -> c == '\n').count();
    }
}

Layer 3: Post-Call Content Verification

The third and most important safeguard is a post-call verification step. After the LLM returns its classification, we compare the number of items in the LLM's response against the number of items we sent. If they do not match, we flag the invoice for manual review instead of silently proceeding.

// === FIX LAYER 3: ContentIntegrityChecker.java ===
// Verifies that the LLM response covers every line item.
// If counts mismatch, the result is rejected and routed to
// human review instead of flowing into the reconciliation engine.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ContentIntegrityChecker {

    private static final Pattern ITEM_ID_PATTERN =
        Pattern.compile("\"id\"\s*:\s*\"([^\"]+)\"");

    private final int expectedItemCount;
    private final String invoiceId;

    public ContentIntegrityChecker(int expectedItemCount, String invoiceId) {
        this.expectedItemCount = expectedItemCount;
        this.invoiceId = invoiceId;
    }

    /**
     * Validates the LLM response. Returns a VerificationResult
     * indicating whether the response is safe to use.
     */
    public VerificationResult verify(String llmResponse) {
        // Count item IDs in the LLM's JSON response
        Matcher matcher = ITEM_ID_PATTERN.matcher(llmResponse);
        int foundCount = 0;
        while (matcher.find()) {
            foundCount++;
        }

        if (foundCount < expectedItemCount) {
            // CRITICAL: LLM dropped items — do NOT auto-process
            int missing = expectedItemCount - foundCount;
            System.err.printf(
                "[ALERT] Content mismatch for invoice %s: " +
                "expected %d items, got %d (%d missing). " +
                "Routing to manual review.%n",
                invoiceId, expectedItemCount, foundCount, missing);

            return VerificationResult.rejected(
                String.format("%d items missing from LLM response", missing),
                foundCount,
                expectedItemCount
            );
        }

        if (foundCount > expectedItemCount) {
            // Possible hallucination — extra items not in source
            int extra = foundCount - expectedItemCount;
            System.err.printf(
                "[WARN] LLM hallucinated %d extra items for invoice %s.%n",
                extra, invoiceId);

            return VerificationResult.rejected(
                String.format("%d extra items detected (possible hallucination)", extra),
                foundCount,
                expectedItemCount
            );
        }

        return VerificationResult.accepted(foundCount);
    }

    public record VerificationResult(
        boolean accepted,
        String reason,
        int foundCount,
        int expectedCount
    ) {
        public static VerificationResult accepted(int count) {
            return new VerificationResult(true, null, count, count);
        }

        public static VerificationResult rejected(
                String reason, int found, int expected) {
            return new VerificationResult(false, reason, found, expected);
        }
    }
}

Before vs. After: Impact Comparison

Here are the actual numbers from our production environment, measured across a 30-day window processing ~14,000 invoices monthly.

Metric

Before Fix

After Fix

Delta

Reconciliation failures (pct)

2.1%

0.003%

−99.9%

Billing discrepancies per month

501 invoices

0 invoices

−100%

Avg. processing latency (p50)

1.8s

380ms

−79%

Avg. processing latency (p99)

2.4s

1.1s

−54%

Manual review queue volume

12/week

2/week

−83%

Monthly billing-related support tickets

−97%

Cost of discrepancies (11-day incident)

$47,320

−100%

The latency improvement deserves explanation. Before the fix, large invoices triggered a single massive LLM call that often timed out or required retry, adding seconds of latency. After chunking, each individual call processes a manageable prompt size and returns faster. The aggregation overhead is negligible because it operates on already-summarized chunks rather than raw line items.

Case Study: Production Deployment

Team size: 4 backend engineers, 1 ML engineer, 1 SRE on-call rotation
Stack & Versions: Java 21, Spring Boot 3.2.4, OpenAI Java SDK 2.6.1, tiktoken-java 0.5.1, PostgreSQL 16, Redis 7.2, Kubernetes 1.29 on EKS
Problem: p99 latency was 2.4s, reconciliation failure rate was 2.1%, and the incident caused $47,320 in billing discrepancies across 501 customer invoices over 11 days before detection
Solution & Implementation: We deployed the three-layer fix over a staged rollout. Layer 1 (token budget validation) went live in 2 days as a hotfix. Layer 2 (chunked processing) was deployed behind a feature flag and ramped up over 5 days. Layer 3 (content integrity checking) shipped in the following sprint. We used OpenTelemetry tracing with custom spans for each chunk to monitor the new pipeline end-to-end.
Outcome: Reconciliation failures dropped from 2.1% to 0.003% within the first full billing cycle. p99 latency fell to 1.1s. The team recovered $47,320 through credit reversals and avoided an estimated $18k/month in ongoing support costs. Feature flag rollback time was under 30 seconds.

Root Cause Analysis: Why Our Tests Missed It

Our test suite used a mockOpenAiClient that always returned a perfect response regardless of input size. The mock validated that the request was sent but never validated that the response covered all input items. We had integration tests, but they ran against a fixed 10-item invoice — far below the 800+ item threshold where truncation began.

Three systemic gaps contributed:

No tokenizer-aware testing. We tested with string lengths, not token counts. The Unicode tokenizer change was invisible to our test assertions.
No content-level assertions on LLM output. We checked that the response was valid JSON. We did not check that every input item appeared in the output.
Silent degradation path. The OpenAI SDK's truncation behavior is a silent default, not an error. Our error handling only caught exceptions, not data loss.

Developer Tips for Safe LLM Integration

Tip 1: Always Validate Token Budget Before Calling the LLM

Never trust that your prompt fits within the context window. Model providers can change tokenizers between versions without changing the advertised context window size. Use the same tokenizer library the model uses — for OpenAI models, that is tiktoken (Python) or tiktoken-java (Java). Set a hard budget at 80% of the model's context window to reserve space for the system prompt, few-shot examples, and the response. Build this check into your SDK wrapper or middleware layer so that no code path can bypass it. If the budget is exceeded, implement a chunking strategy appropriate to your domain — for billing, that means splitting at natural boundaries (per-invoice or per-line-item) and reassembling with a second LLM call. In Python, this looks like using tiktoken to count tokens and splitting on document boundaries; in Java, tiktoken-java provides equivalent functionality. The key insight is that token counting must happen at runtime with the production tokenizer, not at development time with a rough estimate. A string-length heuristic will fail silently when the tokenizer changes, which is exactly what happened to us.

# Python: Token-aware chunking with tiktoken
import tiktoken
from typing import List, Tuple

def chunk_by_tokens(
    text: str,
    model: str = "gpt-4-turbo-0125",
    budget_ratio: float = 0.80,
    chunk_overlap: int = 100
) -> List[str]:
    """Split text into chunks that fit within token budget.

    Uses the actual model tokenizer to ensure accurate counts.
    Overlap preserves context between chunks.
    """
    encoding = tiktoken.encoding_for_model(model)
    max_tokens = encoding.max_token_value  # context window
    budget = int(max_tokens * budget_ratio)

    tokens = encoding.encode(text)
    if len(tokens) <= budget:
        return [text]  # fits in one chunk

    chunks: List[str] = []
    start = 0
    while start < len(tokens):
        end = min(start + budget, len(tokens))
        # Try to break at a newline near the end
        chunk_tokens = tokens[start:end]
        chunk_text = encoding.decode(chunk_tokens)

        # If not the last chunk, find last newline
        if end < len(tokens):
            last_newline = chunk_text.rfind('\n')
            if last_newline > len(chunk_text) // 2:
                end = start + len(encoding.encode(chunk_text[:last_newline]))
                chunk_text = chunk_text[:last_newline]

        chunks.append(chunk_text)
        start = end - chunk_overlap  # overlap for continuity

    return chunks

Tip 2: Implement Content-Level Verification of Every LLM Response

Checking that the LLM returned valid JSON or well-formed text is necessary but nowhere near sufficient. You must also verify that the response covers every input item. The technique is straightforward: count the items you sent, count the items the LLM reported on, and compare. If there is a mismatch, do not auto-correct or guess — route to a human reviewer. This is especially critical in financial, medical, or legal domains where missing data has regulatory implications. Build this check as a reusable middleware or post-processor that wraps every LLM call in your pipeline. Log every mismatch with full request/response payloads for root-cause analysis. Over time, these logs become invaluable training data for improving your chunking strategy and prompt engineering. Remember: LLMs are stochastic systems. A response that works 99 times can silently fail on the 100th call due to tokenization changes, model updates, or subtle prompt drift.

# Python: Content verification for LLM billing responses
import json
import hashlib
import logging
from dataclasses import dataclass
from typing import Optional

logger = logging.getLogger(__name__)

@dataclass
class VerificationResult:
    accepted: bool
    reason: Optional[str]
    expected_count: int
    actual_count: int
    content_hash: str


def verify_billing_response(
    llm_response: str,
    expected_item_ids: set,
    invoice_id: str
) -> VerificationResult:
    """Verify that the LLM response covers all expected line items.

    Extracts item IDs from the response JSON and compares against
    the set of IDs we sent. Also computes a content hash for
    idempotency checking.
    """
    content_hash = hashlib.sha256(llm_response.encode()).hexdigest()[:16]

    try:
        response_data = json.loads(llm_response)
    except json.JSONDecodeError as e:
        logger.error(
            f"Invoice {invoice_id}: LLM returned invalid JSON: {e}")
        return VerificationResult(
            accepted=False,
            reason=f"Invalid JSON: {e}",
            expected_count=len(expected_item_ids),
            actual_count=0,
            content_hash=content_hash
        )

    items = response_data.get("items", [])
    found_ids = {item.get("id") for item in items if item.get("id")}

    missing = expected_item_ids - found_ids
    extra = found_ids - expected_item_ids

    if missing:
        logger.warning(
            f"Invoice {invoice_id}: {len(missing)} items missing from "
            f"LLM response. IDs: {sorted(missing)[:10]}... "
            f"Content hash: {content_hash}")
        return VerificationResult(
            accepted=False,
            reason=f"{len(missing)} items missing",
            expected_count=len(expected_item_ids),
            actual_count=len(found_ids),
            content_hash=content_hash
        )

    if extra:
        logger.warning(
            f"Invoice {invoice_id}: {len(extra)} unexpected items in "
            f"LLM response. Possible hallucination.")
        return VerificationResult(
            accepted=False,
            reason=f"{len(extra)} extra items (hallucination)",
            expected_count=len(expected_item_ids),
            actual_count=len(found_ids),
            content_hash=content_hash
        )

    return VerificationResult(
        accepted=True,
        reason=None,
        expected_count=len(expected_item_ids),
        actual_count=len(found_ids),
        content_hash=content_hash
    )

Tip 3: Monitor Tokenizer Drift with Automated Regression Tests

The most insidious failure mode is when the model provider changes the tokenizer without announcement. Your prompt that fit yesterday no longer fits today, and the SDK silently truncates. To catch this, build a regression test suite that runs on every deployment and on a nightly schedule against the live model. The test should send a prompt of known token count — computed using the current tokenizer — and verify that the response includes every item. If the effective context window shrinks, the test will fail before your production pipeline silently drops data. Integrate this into your CI/CD pipeline as a canary check. Additionally, subscribe to the model provider's changelog RSS feed or GitHub release notifications. For OpenAI, the changelog lives at their platform status page. Combine automated detection with human monitoring for defense in depth. Store the token count of your canonical test prompt in your observability platform (Datadog, Grafana, etc.) and alert on any change greater than 5%.

# Python: Tokenizer drift regression test
import pytest
import tiktoken
import os
from unittest.mock import patch
import requests

MODEL = "gpt-4-turbo-0125"
EXPECTED_MAX_CONTEXT = 128_000
TOLERANCE_PERCENT = 0.05  # 5% drift tolerance


def get_canonical_test_prompt() -> str:
    """Generate a deterministic test prompt with known item count."""
    items = []
    for i in range(500):
        items.append(
            f"Item: INV-{i:04d} | Amount: 150.00 USD | "
            f"Desc: Monthly service fee | Acct: 4000-1000-{i:04d}"
        )
    return "Classify each line item:\n\n" + "\n".join(items)


class TestTokenizerDrift:
    """Regression tests to detect tokenizer or context window changes."""

    def test_context_window_has_not_shrunk(self):
        """Verify that the model's context window matches expectations."""
        encoding = tiktoken.encoding_for_model(MODEL)
        # The encoding should report the expected max tokens
        actual_max = encoding.max_token_value
        drift = abs(actual_max - EXPECTED_MAX_CONTEXT) / EXPECTED_MAX_CONTEXT

        assert drift <= TOLERANCE_PERCENT, (
            f"Context window drift detected: expected "
            f"{EXPECTED_MAX_CONTEXT}, got {actual_max} "
            f"(drift: {drift:.2%})"
        )

    def test_prompt_fits_within_budget(self):
        """Verify our canonical prompt fits within 80% of context window."""
        encoding = tiktoken.encoding_for_model(MODEL)
        prompt = get_canonical_test_prompt()
        token_count = len(encoding.encode(prompt))
        budget = int(encoding.max_token_value * 0.80)

        assert token_count <= budget, (
            f"Canonical prompt exceeds token budget: "
            f"{token_count} tokens vs {budget} budget. "
            f"Tokenizer may have changed."
        )

    def test_response_covers_all_items(self, live_openai_client):
        """Integration test: verify LLM response includes all 500 items."""
        prompt = get_canonical_test_prompt()
        response = live_openai_client.chat.completions.create(
            model=MODEL,
            messages=[
                {"role": "system",
                 "content": "List every item ID from the input."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=4096,
        )
        content = response.choices[0].message.content

        # Extract all item IDs from response
        found_ids = set()
        for i in range(500):
            if f"INV-{i:04d}" in content:
                found_ids.add(f"INV-{i:04d}")

        missing = 500 - len(found_ids)
        assert missing == 0, (
            f"LLM response missing {missing}/500 items. "
            f"Possible silent truncation."
        )


if __name__ == "__main__":
    pytest.main([__file__, "-v"])

Join the Discussion

This incident was not caused by a sophisticated bug or a complex race condition. It was caused by a silent default behavior in a widely-used SDK combined with a lack of content-level validation. If your production system depends on LLM output, the question is not whether something like this can happen to you — it is whether you will detect it before your customers do.

Discussion Questions

Looking ahead: As context windows grow to 1M+ tokens, will the "silent truncation" problem disappear, or will it morph into new failure modes like attention degradation and lost-in-the-middle?
Trade-offs: Chunked processing with aggregation adds cost (multiple LLM calls per invoice) and latency. At what scale does the safety improvement justify the 2–3x cost increase? Is there a smarter middle ground?
Competing approaches: How do structured output APIs (like OpenAI's JSON mode or Anthropic's response format parameter) compare to the post-call content verification pattern we describe? Can they replace content checks entirely?

Frequently Asked Questions

How did the tokenizer change cause token count to increase?

In the gpt-4-turbo-0125 tokenizer update, several Unicode currency symbols that were previously encoded as single tokens were split into 2–3 subword tokens. Our invoices for APAC and LATAM clients used the Indian Rupee (₹), Bitcoin (₿), and Turkish Lira (₺) symbols extensively. Each invoice with these symbols gained roughly 10–15% more tokens. For invoices already near the context window boundary, this pushed them over the edge and triggered silent truncation.

Why didn't your monitoring catch this sooner?

We monitored token usage at the API level (total tokens per request) but not at the content coverage level. The requests succeeded with HTTP 200 codes. The responses were valid JSON. All our dashboards were green. The only signal was in the content of the responses — specifically, the presence or absence of individual line item IDs — and we were not tracking that. This is a textbook example of measuring proxies instead of outcomes.

Could prompt caching have prevented this?

Prompt caching (available on gpt-4-turbo with the cached_prompt parameter) reduces cost and latency for repeated prefixes, but it does not affect context window limits. A cached prompt still counts against the context window. In fact, caching might have made detection harder because cached responses return faster, reducing the chance that timeouts or truncation errors would surface. The fix must be at the content verification layer, not the caching layer.

Conclusion & Call to Action

The billing incident was a wake-up call. LLMs are powerful, but they are also lossy processing pipelines. They silently drop input, hallucinate output, and change behavior with model updates. If you are using LLMs in any system that processes financial data, healthcare records, legal documents, or any domain where missing an item has real consequences, you need three things: token-budget enforcement before the call, chunked processing when you exceed the budget, and content-level verification after the call. None of these are optional. None of them are hard to implement. The only excuse for not having them is not having had your own $47k incident yet.

Audit your LLM integrations today. Start with a single question: "If the response silently dropped 30% of its input, would we know?" If the answer is no, you have a bug waiting to happen.

$47,320 Billing discrepancies from a single silent truncation bug

DEV Community