DEV Community

Timevolt
Timevolt

Posted on

Catch Problems Before They Catch You: Practical Monitoring & Logging

Catch Problems Before They Catch You: Practical Monitoring & Logging

Quick context (why you're writing this)

Here's the thing: a few months ago I pushed a tiny change to our payment API that looked harmless—just a new validation rule. The CI passed, the smoke tests were green, and I went home feeling good. Two hours later our support channel started flooding with messages: “I can’t checkout, the spinner never stops.” Turns out the validation was silently swallowing a specific edge case and the service kept retrying forever, chewing up CPU and blowing up our latency SLA. No one noticed until users were already angry.

That sucked. It reminded me that having tests isn’t enough; you need visibility into what your code is actually doing in the wild. If you’re waiting for a user to tweet about a problem, you’re already behind the curve.

The Insight

What I learned (the hard way) is that monitoring and logging aren’t just “nice‑to‑have” add‑ons; they’re the early‑warning system that lets you spot anomalies before they become user‑visible incidents.

  • Metrics give you a quantitative pulse—latency, error rates, throughput—so you can see trends and set alerts that fire before the SLA is breached.
  • Structured logging gives you the qualitative context: request IDs, user IDs, payload snippets, and the exact path through your code. When a metric spikes, you can jump straight to the relevant logs without guessing.
  • Correlation between the two is where the magic happens. A sudden rise in error rate paired with a specific log field (say, payment_method: "gift_card") points you straight at the root cause.

If you treat observability as an afterthought, you’ll spend your weekends firefighting. If you bake it in from the start, you’ll catch the weird edge cases while they’re still just a blip on a graph.

How (with code)

Below is a realistic example in Python using FastAPI, Prometheus client, and structlog. I’ll show the right way and then point out a common mistake I see all the time.

1. Instrument a handler with metrics and structured logs

# app/main.py
from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import structlog
import time

app = FastAPI()
log = structlog.get_logger()

# Metrics
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "http_status"]
)

REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["method", "endpoint"]
)

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    process_time = time.time() - start

    # NOTE: we label by endpoint *after* routing so we get the real path
    endpoint = request.url.path
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=endpoint,
        http_status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=endpoint
    ).observe(process_time)

    # Structured log entry – includes request ID for traceability
    log.info(
        "request completed",
        method=request.method,
        endpoint=endpoint,
        status=response.status_code,
        duration_ms=round(process_time * 1000, 2),
        request_id=request.headers.get("x-request-id", "unknown")
    )
    return response

@app.get("/metrics")
def metrics():
    return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}
Enter fullscreen mode Exit fullscreen mode

What’s happening here?

  • The middleware wraps every request, records a counter and a histogram with useful labels (method, endpoint, http_status).
  • After the request finishes we emit a single structured log line that contains the same labels plus a request ID (propagated from the incoming header or generated upstream).
  • The /metrics endpoint exposes Prometheus‑scrapable data.

2. A common mistake: missing or overly generic labels

I’ve seen developers do something like this:

# ❌ Bad – no endpoint label, high cardinality risk if you later add user_id
REQUEST_COUNT = Counter("http_requests_total", "Total HTTP requests")
Enter fullscreen mode Exit fullscreen mode

Or they label with something that changes per request, like the full URL or a user ID:

# ❌ Bad – creates a metric series per user → Prometheus blows up
REQUEST_COUNT.labels(method=request.method, endpoint=request.url.path, user_id=str(user.id)).inc()
Enter fullscreen mode Exit fullscreen mode

Why it hurts:

  • Without an endpoint label you can’t tell which route is slow or error‑prone; you just see a lump sum.
  • High‑cardinality labels (user IDs, UUIDs, raw query strings) cause Prometheus to create millions of time series, chewing memory and making queries unusable.

The fix? Stick to low‑cardinality, meaningful dimensions (method, endpoint, status code, maybe a version or tenant ID if you have few of them). Keep the rich, high‑cardinality stuff in your logs where it’s cheap to store and search.

3. Using the logs to debug a metric alert

Imagine our alert fires when http_request_duration_seconds{endpoint="/payments/charge"} exceeds 500ms for 5 minutes. You jump into your log UI (Loki, Elasticsearch, etc.) and run:

{app="payment-api", endpoint="/payments/charge"} | json | duration_ms > 500
Enter fullscreen mode Exit fullscreen mode

You’ll see lines like:

{
  "method": "POST",
  "endpoint": "/payments/charge",
  "status": 200,
  "duration_ms": 842,
  "request_id": "a1b2c3d4",
  "user_id": "12345",
  "payment_method": "gift_card",
  "error": "invalid_expiry_date"
}
Enter fullscreen mode Exit fullscreen mode

Now you know the latency spike is tied to gift‑card payments failing expiry validation—a clue you’d never get from metrics alone.

Why This Matters

When you have this kind of visibility, you shift from “reacting to user complaints” to “spotting anomalies before they become complaints.”

  • Faster MTTR: Alerts fire on metric thresholds; logs give you the exact context to fix the issue.
  • Fewer fire drills: You catch regressions in staging or canary releases before they hit 100% of traffic.
  • Confidence to ship: Knowing you’ll see a problem early lets you move faster, experiment more, and sleep better.

It’s not about buying the fanciest SaaS; it’s about instrumenting your code with a few lines that pay off every time something goes sideways.

Your Turn

Pick one endpoint in your service that currently has no custom metrics or structured logs. Add a counter, a histogram (or timer), and a single log line that includes the request ID and any business‑relevant fields (like payment_method, order_type, etc.). Deploy it to a staging environment, generate some traffic, and watch the metrics dashboard light up.

What did you notice that you weren’t seeing before? Drop a comment or a tweet—I’d love to hear what you uncovered. Happy observing!

Top comments (0)