Your Observability Is Looking at the Wrong Things

#architecture #cloud #kubernetes #tooling

I've been in incident calls where every dashboard was green. Latency nominal. Error rate under 0.1%. CPU humming along at a comfortable 40%. And somewhere downstream, a critical workflow had been silently producing wrong results for six hours.

Nobody had an alert for "the thing is doing something, just not the right thing."

This is the gap most observability setups never close: they're watching the infrastructure, not the behavior. They'll tell you the system is alive. They won't tell you it's lying.

The Three Dials Everyone Watches

The default observability stack for most teams converges on the same three signals: uptime, latency, and error rate. These show up in every runbook, every SLA, every on-call rotation. They're not useless — a spike in error rate is real signal, a latency cliff is real signal — but they share a critical property: they're all lagging indicators of failure that's already happened.

More importantly, they only fire when the system is explicitly misbehaving. They say nothing about a system that's doing exactly what you told it to do, but where what you told it to do was wrong.

I had a recommendation service that returned results within 50ms, with a 0.02% error rate, and near-perfect uptime. It was also returning the same stale set of recommendations to every user because a cache invalidation job had silently stopped running four days earlier. The system was technically flawless. It had completely stopped serving its purpose.

The dashboard gave it a clean bill of health.

Logs Are Not a Narrative

The second failure mode is subtler. Most teams log well, in the sense that they log a lot. Request in. Response out. Exceptions caught and written somewhere. Database queries above a threshold. Auth events.

What they don't have is a narrative — a way to reconstruct what actually happened during a user's session, a job's execution, a transaction's lifecycle. Individual log lines are breadcrumbs. What you need is the trail.

The difference shows up immediately when something goes wrong. With breadcrumbs, you spend the first hour of an incident correlating timestamps across three different log streams, mentally assembling a sequence of events that should have been assembled for you. With a trail — structured traces with a shared correlation ID flowing through every service that touched a request — you open one query and see the story.

import uuid
import logging
import functools
from contextvars import ContextVar

correlation_id: ContextVar[str] = ContextVar("correlation_id", default="")

def traced(fn):
    @functools.wraps(fn)
    def wrapper(*args, **kwargs):
        cid = correlation_id.get()
        logger.info(
            "enter",
            extra={"fn": fn.__name__, "correlation_id": cid}
        )
        try:
            result = fn(*args, **kwargs)
            logger.info(
                "exit",
                extra={"fn": fn.__name__, "correlation_id": cid, "status": "ok"}
            )
            return result
        except Exception as e:
            logger.error(
                "error",
                extra={"fn": fn.__name__, "correlation_id": cid, "error": str(e)}
            )
            raise
    return wrapper

# At the edge — set once, propagate everywhere
def handle_request(request):
    correlation_id.set(request.headers.get("X-Correlation-ID") or str(uuid.uuid4()))
    return process(request)

This is not complicated. It's not expensive. The reason most teams don't have it is that they added logging incrementally — one print statement at a time — and never stepped back to ask whether the sum of those statements could tell a story.

Metrics Without a Baseline Are Just Numbers

Here's a metric: your API is returning responses in 340ms.

Is that good? Bad? Degraded from yesterday? Normal for this time of week? You cannot answer without a baseline, and most teams don't have one that's precise enough to be useful.

What typically exists is a static threshold: alert if latency exceeds 500ms. That threshold was set during initial deployment, when load was a tenth of what it is now, and hasn't been revisited since. It's not a baseline — it's a guess that calcified into a rule.

A real baseline is dynamic. It accounts for time of day, day of week, and recent trend. It flags when you're 30% above your own normal, not when you cross an arbitrary line someone set two years ago.

from collections import deque
from statistics import mean, stdev
from datetime import datetime

class AdaptiveBaseline:
    def __init__(self, window_size=1440):  # 24h of per-minute samples
        self.samples = deque(maxlen=window_size)

    def record(self, value: float):
        self.samples.append((datetime.utcnow(), value))

    def is_anomalous(self, value: float, threshold_stdev: float = 2.5) -> bool:
        if len(self.samples) < 60:
            return False  # not enough data to have an opinion
        recent = [v for _, v in self.samples]
        m = mean(recent)
        s = stdev(recent)
        if s == 0:
            return False
        return abs(value - m) > threshold_stdev * s

    def summary(self) -> dict:
        if not self.samples:
            return {}
        values = [v for _, v in self.samples]
        return {"mean": mean(values), "stdev": stdev(values), "n": len(values)}

Static thresholds are a lazy stand-in for understanding your system's normal. They exist because setting them takes five minutes, and building real baselines takes an afternoon. That tradeoff looks different at 2am when an alert fires on a load pattern that's been there for three weeks.

What Actually Belongs in Your Dashboards

The signals that matter fall into a different category than infrastructure health. They're about whether the system is doing its job, measured in terms the business cares about.

Throughput on the critical path. Not "requests per second" in aggregate — the specific count of the transactions that matter. Orders placed. Reports generated. Messages delivered. If that number is lower than expected, something is wrong, even if all your infra metrics are green.

Queue depth and processing age. If you have async workers, the age of the oldest unprocessed item is a more honest health signal than worker CPU. A queue that's growing is a system falling behind, regardless of what the workers themselves are reporting.

Business-level error rates, not HTTP error rates. A 200 response that returns an empty result set is not a success. A job that completes without exception but produces zero output has failed. You need to define success in terms of what the system was supposed to produce, then measure whether it produced it.

Derivative metrics. If your checkout conversion rate drops from 68% to 51%, that's a signal — even if no individual service is throwing errors. Tracking rates and ratios, not just raw counts, catches the class of failures where something is working but working worse.

# Prometheus recording rules — compute these, don't query them live
groups:
  - name: business_health
    interval: 60s
    rules:
      - record: job:orders_per_minute:rate
        expr: rate(orders_completed_total[5m]) * 60

      - record: job:checkout_conversion:ratio
        expr: |
          rate(checkouts_completed_total[10m])
          / rate(checkout_initiated_total[10m])

      - record: job:queue_age_seconds:max
        expr: time() - min(job_enqueued_timestamp_seconds)

  - name: alerts
    rules:
      - alert: ConversionRateDrop
        expr: job:checkout_conversion:ratio < 0.55
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Checkout conversion below 55% for 5+ minutes"

      - alert: QueueProcessingStalled
        expr: job:queue_age_seconds:max > 300
        for: 2m
        labels:
          severity: warning

Alerts Should Be Harder to Silence Than to Fix

The last thing most teams get wrong is the incentive structure around noise. When alerts fire too often on non-issues, engineers start ignoring them — or worse, start routing around them. The standard fix is to raise thresholds and add retry logic so the alert doesn't fire. This is treating the symptom. The alert was lying because the metric was wrong, and the right fix is to measure something that's actually meaningful.

There's a useful rule here: if an alert fired and the on-call engineer's first instinct was to check whether it was a false positive, the alert is already broken. A good alert should produce a specific, directed response — not a "let me see if this is real" investigation. If you find yourself constantly confirming that real alerts are real, your signal-to-noise ratio is telling you something.

Flaky alerts are the observability equivalent of flaky tests. You know you have them. You've learned to distrust them. And every week they stay in the rotation makes you slightly less responsive to the ones that actually matter.

Track your alert false-positive rate like you track your error rate. Alert on your alerts. Set a rule that any alert firing more than twice without a corresponding incident review gets flagged for audit. This sounds bureaucratic until the first time you catch that a critical alert has been misfiring for three weeks and nobody noticed because the team had learned to dismiss it.

What You're Actually Missing

Most observability stacks are built to answer one question: is the system up? That's a fine question. It's just not the most important one.

The more useful questions are: is the system doing what users need? Is it doing it as well as it was yesterday? Is anything changing that I should know about before it becomes a problem?

Those questions require measuring at the level of behavior and outcome, not infrastructure and response codes. They require traces that tell a story instead of logs that record events. They require baselines instead of thresholds, and business metrics instead of system metrics.

None of this is exotic. The tooling exists — OpenTelemetry, Prometheus recording rules, structured logging with correlation IDs. The gap isn't tooling. It's the habit of reaching for the infrastructure dashboard first and calling it observability.

Start with one question: if your system silently started doing the wrong thing at 3am, how long would it take you to find out? If the answer is "until a user complained," your dashboards are watching the machine, not the work.

That's the thing worth fixing.