S, Sanjay

Posted on Mar 22

Your App is on Fire and You Don't Even Know 🔥 — Observability for Humans

#prometheus #monitoring #devops #observability

🎬 The 3 AM Phone Call You're Not Prepared For

PagerDuty, 3:14 AM:

CRITICAL: payment-service error rate > 5%

You open your laptop. You open Grafana. You stare at 47 dashboards with 312 panels. Nothing looks obviously wrong. CPU is fine. Memory is fine. Pods are running.

You open the logs. There are 3.2 million log lines from the last hour. You search for "error." 47,000 results.

You are drowning in data but have zero information.

This is the difference between monitoring and observability, and it's why most teams are flying blind.

🔍 Monitoring vs. Observability: The Key Difference

Monitoring answers: "Is it broken?"
Observability answers: "WHY is it broken?"

Monitoring: Pre-defined dashboards for known problems
           → CPU high? Alert. Disk full? Alert.
           → Great for problems you've seen before.

Observability: The ability to ask ANY question about your system
              → "Why are requests from Germany 3x slower?"
              → "Which specific deployment caused the error spike?"
              → "What's different about the failing requests?"
              → Great for problems you've NEVER seen before.

At the Principal level, you need both. Monitoring catches the known issues automatically. Observability lets you debug the novel failures that wake you up at 3 AM.

📐 The Three Pillars (And How They Work Together)

  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
  │   METRICS    │    │    LOGS      │    │   TRACES     │
  │              │    │              │    │              │
  │ "WHAT is     │    │ "WHAT        │    │ "HOW does a  │
  │  happening?" │    │  happened?"  │    │  request     │
  │              │    │              │    │  flow?"      │
  │ Numbers over │    │ Text events  │    │              │
  │ time         │    │ with context │    │ Spans across │
  │              │    │              │    │ services     │
  │ Cheap to     │    │ Expensive    │    │ Shows the    │
  │ store        │    │ at scale     │    │ full journey │
  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
                    Trace ID links them all

  "Error rate spiked at 14:32"     ← Metric tells you WHEN
  "timeout connecting to DB"        ← Log tells you WHAT
  "DB call took 30s (timeout: 5s)" ← Trace tells you WHERE & WHY

The magic happens when all three are correlated by a trace ID. One ID connects the metric spike, the error log, and the slow database call. Without correlation, you're playing detective with missing evidence.

📊 Metrics: The Numbers That Actually Matter

The Two Frameworks You Need

RED Method (for your services — anything handling requests):

R — Rate:     How many requests per second?
E — Errors:   How many of those requests are failing?
D — Duration: How long do requests take? (p50, p95, p99)

USE Method (for your infrastructure — CPU, memory, disk, network):

U — Utilization: How busy is it? (% used)
S — Saturation:  Is there a queue? (waiting work)
E — Errors:      Any hardware/resource errors?

The Metrics That Actually Predict Outages

🚨 These metrics predict problems BEFORE users complain:

1. Error rate trending up (even 0.1% → 0.5% is a red flag)
2. p99 latency increasing (even if p50 looks fine)
3. Request queue depth growing
4. Pod restart count > 0 in last hour
5. Memory usage trending upward over days (memory leak!)
6. Connection pool exhaustion approaching
7. Disk I/O wait time increasing

Real PromQL Queries You'll Actually Use

# Request rate (requests per second)
rate(http_requests_total[5m])

# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# p99 latency
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

# Pod restart count (something is crashing!)
increase(kube_pod_container_status_restarts_total[1h]) > 0

# Memory usage trending (catch leaks early)
predict_linear(
  container_memory_working_set_bytes{pod=~"payment.*"}[6h], 
  3600 * 4
) > 1.5e9
# "If memory keeps growing at this rate, will it exceed 1.5GB in 4 hours?"

🚨 Real-World Disaster #1: The p50 Was Fine, But Everything Was Broken

The Dashboard: Average response time: 45ms. Looks great! 👍

The Reality:

p50 (median):  45ms       ← What the dashboard showed
p95:           200ms      ← 5% of users waited 4x longer
p99:           2,800ms    ← 1% of users waited A MINUTE
p99.9:         12,000ms   ← These users gave up and left

What Happened: A database query had no index on a commonly-filtered column. Most queries hit the cache (fast). But 1-5% missed the cache and did a full table scan (slow). The average hid the pain completely because 95% of requests were fast.

The Fix:

Never use averages for latency dashboards. Always show p50, p95, p99.
Add the slow query to database monitoring
Created the missing index (latency dropped from 2.8s to 12ms for affected queries)

# Dashboard panel: Show ALL percentiles, not just average
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))  # p50
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))  # p99

📝 Logging: Stop Logging Everything, Start Logging Smart

The Structured Logging Commandments

❌ BAD: Unstructured logs
"User 12345 failed to login from 192.168.1.1 at 2026-03-18T10:30:00Z"

✅ GOOD: Structured JSON logs
{
  "timestamp": "2026-03-18T10:30:00.123Z",
  "level": "warn",
  "message": "Login failed",
  "userId": "12345",
  "sourceIp": "192.168.1.1",
  "reason": "invalid_password",
  "attemptCount": 3,
  "traceId": "abc123def456",
  "service": "auth-service",
  "version": "v2.1.0"
}

Why structured? Because at 3 AM, searching for "reason": "invalid_password" is a billion times easier than grep-ing through text for "failed."

Log Levels: What Actually Belongs Where

FATAL:  "The app is dying. Page someone NOW."
        → Process cannot continue. Database connection permanently lost.
        → Usage: Extremely rare. If you see this, it's an incident.

ERROR:  "Something failed, but the app survived."
        → A request failed. A retry was exhausted. An external call timed out.
        → Usage: Every error should be actionable. If you can't do anything about it, 
          it's not an error — it's a warning.

WARN:   "Something is off, but not broken yet."
        → Memory usage above 80%. Retry attempt 2 of 3. Deprecated API called.
        → Usage: Things that MIGHT become problems.

INFO:   "Normal operations, key events."
        → Service started. Request processed. User logged in. Deployment completed.
        → Usage: Audit trail of what happened. Keep it minimal.

DEBUG:  "Developer needs this to debug locally."
        → Variable values. SQL queries. Internal state.
        → Usage: NEVER in production. Costs a fortune in log storage.

🚨 Real-World Disaster #2: The $14,000 Log Bill

What Happened: A developer set the log level to DEBUG in production "to investigate an issue" and forgot to change it back. For 3 weeks, every request logged 40+ lines of debug detail. Log Analytics ingestion cost went from $800/month to $14,800/month.

The Fix:

Default to WARN in production, INFO in staging
Use dynamic log levels — change via config without redeploy:

# Kubernetes ConfigMap for log level (change without redeploy)
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  LOG_LEVEL: "warn"    # Change to "info" or "debug" temporarily when needed

Set daily ingestion caps in Azure Log Analytics:

az monitor log-analytics workspace update \
  --resource-group rg-monitoring \
  --workspace-name law-prod \
  --quota 10  # GB per day cap

Sampling for high-volume services — log 10% of requests, 100% of errors

🔗 Distributed Tracing: Following the Breadcrumbs

When a user's request touches 5 microservices, a database, a cache, and an external API — how do you figure out which one is slow?

Distributed tracing follows a request across every service:

User request → api-gateway (12ms)
                 └→ auth-service (8ms)
                 └→ payment-service (2,340ms) ← 🚨 FOUND IT
                      └→ database query (2,280ms) ← 🚨 THE REAL CULPRIT
                      └→ cache lookup (3ms)
                 └→ notification-service (45ms)

Without tracing, you'd know "something is slow" but not WHERE. With tracing, you see the exact service AND the exact operation that's slow.

Setting Up Tracing (OpenTelemetry)

# Python example with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)

# Use in your code
tracer = trace.get_tracer(__name__)

@app.route('/payment')
def process_payment():
    with tracer.start_as_current_span("process-payment") as span:
        span.set_attribute("payment.amount", amount)
        span.set_attribute("payment.currency", "USD")

        # This automatically creates a child span when calling the DB
        result = db.execute(query)
        return result

🚨 Real-World Disaster #3: The Invisible Retry Storm

Symptoms: p99 latency jumped from 200ms to 4,000ms. No errors in logs. CPU and memory normal. Dashboard shows nothing wrong.

What Tracing Revealed:

Request timeline:
  api-gateway: 4,012ms total
    └→ order-service: 3,998ms
         └→ inventory-service: TIMEOUT (1,000ms) ← Attempt 1
         └→ inventory-service: TIMEOUT (1,000ms) ← Attempt 2  
         └→ inventory-service: TIMEOUT (1,000ms) ← Attempt 3
         └→ inventory-service: 800ms             ← Attempt 4 (success!)

The Problem: The inventory service was experiencing intermittent timeouts. The order service had a retry policy (good!) but each retry added 1 second. After 3 failures + 1 success = 3.8 seconds latency. The retry wasn't logging! So logs showed nothing. Only traces revealed the retry storm.

The Fix:

Log retries (even successful ones — they indicate underlying issues)
Add circuit breaker to stop retrying a consistently-failing service
Alert on retry rate, not just error rate

🔔 Alerting: The Art of Not Crying Wolf

The Alert Fatigue Problem

Week 1:  Team gets 50 alerts → Everyone investigates
Week 4:  Team gets 50 alerts → "Probably false positive"
Week 8:  Team gets 50 alerts → *mutes channel*
Week 12: Actual outage alert → Nobody sees it → 💀

Alert fatigue kills reliability. Every alert must be:

Actionable: Someone can fix it right now
Urgent: It needs to be fixed NOW, not tomorrow
Real: False positive rate < 5%

Multi-Window Burn Rate Alerting (The Modern Approach)

Instead of "alert when error rate > 1%", use burn-rate alerting:

SLO: 99.9% availability (error budget: 43.2 minutes/month)

Alert when error budget is being consumed too fast:

🔴 Page (wake someone up):
   1-hour window:  burning > 14.4x normal rate
   AND 5-minute window: burning > 14.4x normal rate
   → "At this rate, you'll exhaust your monthly budget in 1 hour"

🟡 Ticket (fix during business hours):
   6-hour window:  burning > 6x normal rate
   AND 30-minute window: burning > 6x normal rate
   → "At this rate, you'll exhaust your monthly budget in 3 days"

# Prometheus alerting rule: burn-rate based
groups:
  - name: slo-alerts
    rules:
      # Fast burn: Page immediately
      - alert: PaymentHighErrorBurnRate
        expr: |
          (
            sum(rate(http_requests_total{service="payment",code=~"5.."}[1h]))
            / sum(rate(http_requests_total{service="payment"}[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{service="payment",code=~"5.."}[5m]))
            / sum(rate(http_requests_total{service="payment"}[5m]))
          ) > (14.4 * 0.001)
        labels:
          severity: page
        annotations:
          summary: "Payment service burning error budget 14x too fast"

🚨 Real-World Disaster #4: The Alert That Fired 847 Times

What Happened: Alert rule: "Fire when CPU > 80%." A node running batch jobs hit 85% CPU for 30 seconds every 5 minutes (this is normal — batch jobs are CPU-intensive). Alert fired 847 times in one day. Team muted the channel. A real issue the next day went unnoticed for 4 hours.

The Fix:

Add duration requirements: "CPU > 80% for > 15 minutes"
Remove CPU alerts for batch job nodes (they're SUPPOSED to use CPU)
Alert on SLO burn rate instead of raw resource metrics

📉 Dashboards That Actually Help at 3 AM

The Dashboard Hierarchy

Level 1: Service Overview (START HERE at 3 AM)
  → Is the service healthy? Yes/No at a glance.
  → RED metrics: Request rate, Error rate, Duration
  → Current SLO status and error budget remaining

Level 2: Infrastructure (if L1 shows a problem)
  → Pods, nodes, CPU, memory, network
  → Database connections, query latency
  → Queue depth, consumer lag

Level 3: Deep Dive (for root cause analysis)
  → Per-endpoint latency breakdown
  → Trace search
  → Log queries correlated with timeframe

The Perfect Incident Dashboard (4 Panels)

┌──────────────────────────────┬──────────────────────────────┐
│  Request Rate (req/s)        │  Error Rate (%)              │
│  ┌─────────────────────┐     │  ┌─────────────────────┐     │
│  │    📈 Normal trend   │     │  │       📈 Spike!      │     │
│  │   with deployment    │     │  │ 🚨 this is why you  │     │
│  │   markers            │     │  │    got paged         │     │
│  └─────────────────────┘     │  └─────────────────────┘     │
├──────────────────────────────┼──────────────────────────────┤
│  Latency (p50, p95, p99)     │  Error Budget Remaining      │
│  ┌─────────────────────┐     │  ┌─────────────────────┐     │
│  │ p50: 45ms ✅         │     │  │  ████████░░ 73%     │     │
│  │ p95: 200ms ✅        │     │  │  "21 min remaining  │     │
│  │ p99: 2.8s 🚨        │     │  │   this month"       │     │
│  └─────────────────────┘     │  └─────────────────────┘     │
└──────────────────────────────┴──────────────────────────────┘

🎯 Key Takeaways

Monitoring ≠ Observability — you need both, but observability saves you at 3 AM
Correlate with Trace IDs — metrics, logs, and traces must be linked
p50 is a lie — always show p95 and p99 latency
Structured JSON logging or spend your debugging time grep-ing through chaos
Alert fatigue kills — every alert must be actionable, urgent, and real
Burn-rate alerting > simple threshold alerting
DEBUG logs in production = financial disaster

🔥 Homework

Check your production dashboards — do they show p99 latency? If only averages, add percentiles.
Count your alerts from last week. How many were actionable? Delete the rest.
Run kubectl logs -n <namespace> <pod> | head -5 — is the output structured JSON? If not, fix it.

Next up in the series: **Hackers Tried to Breach My Pipeline at 3 AM — A DevSecOps Survival Guide* — where we cover supply chain attacks, container security, secrets management, and zero-trust architecture.*

💬 What's the most expensive monitoring mistake you've seen? I once saw a team spending $23K/month on Application Insights because they logged every SQL query in production. Share your stories below! 💸

Top comments (2)

Apex Stack • Mar 22

The dashboard hierarchy concept (L1 → L2 → L3) is something I wish more teams internalized. I manage a portfolio of online properties — a financial data site with 100K+ pages across 12 languages, plus a digital product store — and the monitoring challenge is fragmented across completely different platforms: Google Search Console for indexing health, GA4 for traffic, Yandex Webmaster for Russian search, Bing Webmaster, and Gumroad for revenue.

Your "drowning in data but zero information" opener hit home. I literally built an AI agent whose sole job is checking 5+ dashboards every morning and producing a single unified report. Before that, I was the person opening 47 panels and trying to correlate manually. The agent applies something like your RED method but for SEO: Rate (pages crawled per day), Errors (5xx/403 responses in crawl logs), Duration (time to index new content). When Google's crawl rate drops or error rate spikes, the agent flags it before I even open my laptop.

Your Disaster #1 about p50 hiding the pain is exactly what happened with our indexing metrics. The "average" index rate looked stable. But when we looked at it per-language, Dutch pages were indexing 3x faster than English ones — and Japanese pages were barely being crawled at all. The aggregate hid a catastrophic per-language distribution problem, just like your p99 example.

To answer your closing question: our most expensive monitoring mistake was not monitoring at all for the first month. We had 100K+ pages live with zero alerting on crawl errors, which meant Google was silently returning 403s on an entire page category for weeks before we noticed. By then, those pages had been deindexed and we lost 30% of our indexed page count. The alert that would have saved us? A simple "403 response count > 0 in crawl logs" threshold.

klement Gunndu • Mar 22

The trace ID linking all three pillars is the part most teams skip when setting up observability. That single correlation key turns 3.2M log lines from noise into a searchable story.