How I Lost $2000 in One Night Because My LLM App Had No Observability

#ai #python #devops #discuss

Last month, I spent a sleepless night watching my startup's OpenAI bill spike by $2,000 in a single evening. The worst part? I had no idea why. No traces, no logs beyond raw text, no way to tell which user query triggered a 100,000-token monster. That night, I learned the hard way that building an LLM-powered app without observability is like flying a plane without instruments.

I'm sharing this post-mortem so you can avoid my mistake. I'll walk through what went wrong, what I tried that didn't work, and finally what saved us: adding structured tracing to every LLM call. I'll show you real Python code you can copy-paste today.

The Incident: An API Call Gone Rogue

We run a document summarization service using GPT-4. Each user uploads a PDF, we chunk it, and send summaries via the chat completions API. Typical request: ~4,000 input tokens, ~1,000 output tokens. Cost per request: ~$0.03.

One evening, our monitoring (just Datadog for HTTP 200s) showed normal traffic. But the next morning, our AWS bill — and OpenAI usage dashboard — told a different story. At 2 AM, a single user session sent 47 requests. Each request used over 80,000 input tokens. Total spend that night: $2,100.

Why? Our chunking logic had a bug. A PDF with a malformed table caused an infinite loop that kept appending the same text to the prompt. The user got a timeout, retried, and the loop kept eating tokens. Without per-request token counts, we never saw it until the bill arrived.

What I Tried That Didn't Work

First, I added naive logging: print(f"Input tokens: {len(prompt)}"). But that only gave raw character counts, not token counts. Worse, it flooded our logs and didn't correlate with request IDs.

Next, I tried parsing OpenAI's API response JSON and storing it in a SQLite table. That worked for a few hours, but then I had to query across multiple tables to find slow or expensive calls. No aggregation. No graphing. I was back to manual SELECT * queries.

I even considered adding a custom middleware to capture every API call, but that felt hacky and didn't address the deeper need: structured, correlated events.

What Finally Worked: OpenTelemetry-Based Tracing

The breakthrough was treating every LLM call as a span in a distributed trace. I used OpenTelemetry to instrument both the HTTP request to OpenAI and my application logic (chunking, summarization, etc.). This gave me:

Token usage per request (input/output)
Latency per step (chunking vs. API call vs. post-processing)
Correlation between user session, request ID, and error context
Cost estimation (since I could compute cost = input_tokens * rate + output_tokens * rate)

Here's the core instrumentation pattern I now use (Python with openai and opentelemetry-api):

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import openai

# Set up tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

def call_llm_with_tracing(prompt, user_id=None):
    with tracer.start_as_current_span("llm_call") as span:
        span.set_attribute("user.id", user_id or "anonymous")
        span.set_attribute("prompt.length", len(prompt))

        # Capture before
        start = time.time()

        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        # Capture after
        latency = time.time() - start
        input_tokens = response["usage"]["prompt_tokens"]
        output_tokens = response["usage"]["completion_tokens"]
        cost = input_tokens * 0.03/1000 + output_tokens * 0.06/1000  # GPT-4 rates

        span.set_attribute("llm.latency_ms", latency * 1000)
        span.set_attribute("llm.token.input", input_tokens)
        span.set_attribute("llm.token.output", output_tokens)
        span.set_attribute("llm.cost_usd", cost)
        span.set_attribute("http.status_code", response["choices"][0]["finish_reason"])

        # Also log the response snippet (be careful with PII)
        if output_tokens < 500:  # only log short responses
            span.set_attribute("llm.response_snippet", response["choices"][0]["message"]["content"][:200])

        return response

With spans exported to a backend like Jaeger, Grafana Tempo, or even a lightweight service like Observe (I found that one while searching for simple OTLP receivers), I could now filter by cost > $1 or latency > 30s and immediately see the root cause. That infinite loop showed up as a single span with 80k input tokens — obvious once you look at it.

Building the Dashboard

Once traces were flowing, I set up a simple Grafana dashboard with:

Cost by user (bar chart, top 10 spenders)
Average latency over time (time series, broken down by model)
Token usage distribution (histogram of input tokens per request)
Error rate (requests where finish_reason wasn't 'stop' — often indicates hallucination or truncation)

Detecting hallucination patterns is tricky, but I found a heuristic: requests where finish_reason is length (truncated) have higher hallucination risk. We added an alert for any user exceeding 5 truncated responses in 5 minutes.

Lessons Learned & Trade-offs

OpenTelemetry is powerful but has a learning curve. Getting sampling right is critical — you don't want to trace every single request if you handle millions. We used head-based sampling (trace every 1 in 100 requests, but always trace errors).
Cost tracking is never fully accurate unless you track token counts at the API call level. OpenAI's pricing varies by model and region, but even a rough estimate saved us from surprises.
Tracing adds latency. The instrumentation itself is cheap (microseconds), but shipping spans to an exporter can block if not configured asynchronously. Use BatchSpanProcessor and offload to background thread.
When NOT to use this: For very high-throughput apps (>10k req/s), you'll need sampling or switch to metric-based monitoring (e.g., Prometheus counters) instead of full traces. Also, if you're prototyping, just use print() — but add structured logging from day one.

What I'd Do Differently Next Time

I'd implement this before the first production deployment. Seriously. The cost of retrofitting observability was two weeks of refactoring and one very expensive night. I'd also set up budget alerts on the OpenAI usage dashboard (they have them, but they're email-only — we missed ours because it went to spam).

So, what's your setup look like? Are you tracing your LLM calls, or just winging it with logs? I'd love to hear war stories from others who've been burned by invisible cost monsters.