Abraham Acha

Posted on May 16

Building a Production-Grade Observability Platform with LGTM Stack, DORA Metrics & SLOs

#devops #observability #prometheus #grafana

GitHub Repository: https://github.com/AirFluke/meetmind-observability
One command to deploy: docker compose up -d

Introduction

Modern software teams don't just need to know when something is down — they need to understand why it broke, how long users were affected, how fast they recovered, and whether their engineering practices are improving over time.

This is the gap between basic monitoring and true observability.

For Stage 6 of the HNG DevOps track, Team MeetMind built a production-grade observability and reliability platform from scratch using the LGTM stack — Loki, Grafana, Tempo, and Prometheus — alongside DORA metrics, SLI/SLO/Error Budget frameworks, and a fully automated alerting pipeline routing to Slack.

Everything is infrastructure as code. No manual UI configuration. One command brings the entire stack up.

Why LGTM Over Managed Alternatives?

The observability market offers managed alternatives — Datadog, New Relic, Grafana Cloud. So why self-host the LGTM stack?

Cost at scale. Managed platforms charge per host, per metric, per log line. At scale this becomes a significant infrastructure cost. The LGTM stack runs on a single server with no per-metric pricing.

Data sovereignty. Logs contain sensitive data — request bodies, auth tokens, PII. Shipping these to a third-party SaaS introduces compliance risk. Self-hosted Loki keeps logs within your own infrastructure.

No vendor lock-in. Prometheus exposition format and OpenTelemetry are open standards. Every instrumented service, every dashboard, every alert rule is portable. Switching providers means changing an endpoint URL, not rewriting your entire observability layer.

Full control over retention. We configured 30-day retention for both metrics and logs at no additional cost.

Learning depth. Operating the stack yourself forces genuine understanding of how metrics collection, log aggregation, and distributed tracing work — knowledge that transfers regardless of which tools your next employer uses.

Architecture Overview

The platform runs as a Docker Compose stack with nine services, all with automatic restart policies.

Component	Role	Port
Prometheus	Metrics collection and storage	9090
Loki	Log aggregation	3100
Tempo	Distributed trace storage	3200
Grafana	Unified observability frontend	3000
Alertmanager	Alert routing to Slack	9093
Node Exporter	System metrics (CPU, RAM, disk, network)	9100
Blackbox Exporter	HTTP/SSL probing	9115
Pushgateway	Receives DORA metrics from GitHub Actions	9091
OTel Collector	Receives and routes traces and logs	4317/4318

Data flow:

Node Exporter and Blackbox Exporter expose metrics → Prometheus scrapes every 15 seconds
GitHub Actions pushes deployment metrics → Pushgateway → Prometheus
Applications send traces via OpenTelemetry → OTel Collector → Tempo
Applications send logs via OpenTelemetry → OTel Collector → Loki
Grafana sits on top of all three — Prometheus, Loki, Tempo — enabling correlated drill-down from a single dashboard

📸 [Screenshot: docker compose ps showing all 9 services Up]

Part 1: Deploying the Full LGTM Stack

Docker Compose — the complete stack

# docker-compose.yml
version: "3.8"

networks:
  observability:
    driver: bridge

volumes:
  prometheus_data:
  loki_data:
  tempo_data:
  grafana_data:

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    restart: unless-stopped
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
      - "--web.enable-remote-write-receiver"
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./alerts:/etc/prometheus/alerts:ro
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - observability

  loki:
    image: grafana/loki:2.9.7
    container_name: loki
    restart: unless-stopped
    command: -config.file=/etc/loki/loki-config.yaml
    volumes:
      - ./config/loki-config.yaml:/etc/loki/loki-config.yaml:ro
      - loki_data:/loki
    ports:
      - "3100:3100"
    networks:
      - observability

  tempo:
    image: grafana/tempo:2.4.1
    container_name: tempo
    restart: unless-stopped
    command: -config.file=/etc/tempo/tempo.yaml
    volumes:
      - ./config/tempo.yaml:/etc/tempo/tempo.yaml:ro
      - tempo_data:/var/tempo
    ports:
      - "3200:3200"
      - "4317:4317"
      - "4318:4318"
    networks:
      - observability

  grafana:
    image: grafana/grafana:10.4.2
    container_name: grafana
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    ports:
      - "3000:3000"
    networks:
      - observability

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - ./config/slack.tmpl:/etc/alertmanager/slack.tmpl:ro
    ports:
      - "9093:9093"
    networks:
      - observability

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    ports:
      - "9100:9100"
    networks:
      - observability

  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    volumes:
      - ./config/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
    ports:
      - "9115:9115"
    networks:
      - observability

  pushgateway:
    image: prom/pushgateway:v1.7.0
    container_name: pushgateway
    restart: unless-stopped
    ports:
      - "9091:9091"
    networks:
      - observability

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.98.0
    container_name: otel-collector
    restart: unless-stopped
    command: ["--config=/etc/otel/otel-collector.yaml"]
    volumes:
      - ./config/otel-collector.yaml:/etc/otel/otel-collector.yaml:ro
    ports:
      - "4319:4317"
      - "4320:4318"
      - "8888:8888"
    networks:
      - observability

One command to bring everything up

docker compose up -d

Infrastructure as Code — non-negotiable

Every configuration file is version-controlled. Nothing is configured through a UI:

config/
├── prometheus.yml        # Scrape configs + recording rules
├── alertmanager.yml      # Route trees + inhibition rules
├── loki-config.yaml      # Log ingestion + 30d retention
├── tempo.yaml            # Trace storage + 30d retention
├── otel-collector.yaml   # Trace and log pipeline
└── blackbox.yml          # HTTP + SSL probe modules

alerts/
├── infrastructure.yml    # CPU, memory, disk, host down
├── slo-burnrate.yml      # Multi-window burn rate alerts
└── cicd.yml              # DORA threshold alerts

grafana/
├── provisioning/         # Datasource + dashboard discovery
└── dashboards/           # 5 JSON dashboards

Prometheus scrape configuration

# config/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/alerts/infrastructure.yml
  - /etc/prometheus/alerts/slo-burnrate.yml
  - /etc/prometheus/alerts/cicd.yml

scrape_configs:
  - job_name: node-exporter
    scrape_interval: 15s
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: blackbox-http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - http://grafana:3000
          - http://prometheus:9090/-/healthy
          - http://loki:3100/ready
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  - job_name: pushgateway
    honor_labels: true
    static_configs:
      - targets: ["pushgateway:9091"]

Retention periods:

Prometheus metrics: 30 days (--storage.tsdb.retention.time=30d)
Loki logs: 30 days (retention_period: 30d in loki-config.yaml)
Tempo traces: 30 days (block_retention: 720h in tempo.yaml)

📸 [Screenshot: Prometheus targets page showing all scrapers green]

Part 2: The Four Golden Signals as SLIs

Before writing a single PromQL query or building any dashboard, we defined what reliability means for MeetMind using Google's Four Golden Signals framework.

Why Four Golden Signals beat CPU/RAM monitoring

Traditional monitoring asks "is the server healthy?" The Four Golden Signals ask "is the user experiencing a healthy service?"

A server can have 10% CPU and still serve every request with 5-second latency. CPU monitoring shows green. The Four Golden Signals show red. That's the difference.

Signal 1 — Latency

How long does it take to serve a request? We distinguish successful from error latency — a fast error is not a success.

# p95 latency for successful requests
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le, job)
)

# p95 latency for error requests (errors are often faster — fail fast)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])) by (le, job)
)

Signal 2 — Traffic

How much demand is the system handling?

# Requests per second
sum(rate(http_requests_total[1m])) by (job)

Signal 3 — Errors

Rate of failed requests — explicit 5xx, implicit wrong content, policy failures.

# Error rate as a ratio (0 = perfect, 1 = everything failing)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)

Signal 4 — Saturation

How "full" is the service? We track CPU, memory, and disk.

# Memory saturation
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# CPU saturation
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Disk saturation
1 - (
  node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
  / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}
)

These four PromQL expressions become our SLIs — the measurements we track.

Part 3: SLOs and Error Budgets

The philosophy

An SLI is a measurement.
An SLO is a target for that measurement.
An Error Budget is the allowable gap between perfect and the SLO target.

This framework changes how engineering teams make decisions. Instead of arguing about whether a deployment is "safe enough", the question becomes: "Do we have enough error budget to absorb the risk of this deployment?"

It converts a subjective conversation into an objective one.

Our SLO targets

SLO	Target	Window	Error Budget
Availability	99.5% of HTTP probes return 2xx	30 days	216 minutes
Error rate	99% of requests succeed	30 days	432 minutes
Latency	p95 < 500ms	Rolling 5m	Alert-only

Why 99.5% availability?
This gives us 216 minutes per month — enough for one planned maintenance window without exhausting the budget. A stricter 99.9% would leave only 43 minutes, making any deployment risky.

Why 99% error rate?
One percent failure tolerance allows for transient errors during rolling deployments. Stricter targets require canary deployment infrastructure before they're meaningful.

Why 500ms p95 latency?
Industry standard for interactive APIs. Beyond this threshold, user experience degrades measurably. We chose p95 rather than p99 because optimising for the 99th percentile often requires disproportionate infrastructure investment.

Recording rules for SLIs

# alerts/slo-burnrate.yml
groups:
  - name: slo.recording_rules
    interval: 30s
    rules:
      - record: slo:availability:ratio_rate5m
        expr: avg_over_time(probe_success[5m])

      - record: slo:availability:ratio_rate1h
        expr: avg_over_time(probe_success[1h])

      - record: slo:availability:ratio_rate6h
        expr: avg_over_time(probe_success[6h])

      - record: slo:availability:ratio_rate30d
        expr: avg_over_time(probe_success[30d])

      # Burn rate = how fast we're consuming error budget
      # Error budget = 1 - 0.995 = 0.005
      - record: slo:availability:burn_rate1h
        expr: (1 - slo:availability:ratio_rate1h) / 0.005

      - record: slo:availability:burn_rate6h
        expr: (1 - slo:availability:ratio_rate6h) / 0.005

Error Budget Policy

Budget remaining > 50%  → Deploy freely, feature work continues
Budget remaining 25-50% → Investigate incidents, no major changes
Budget remaining < 25%  → Reliability sprint, senior review on all deploys
Budget remaining 0%     → Feature freeze until budget recovers

Who owns the freeze decision? Engineering lead.
Review cadence? First Monday of each month.

📸 [Screenshot: SLO & Error Budget dashboard showing gauges and burn rate]

Part 4: DORA Metrics and CI/CD Observability

Why DORA metrics connect to business outcomes

DORA metrics answer: "Is our team getting better or worse at delivering software safely?"

Metric	Business impact
Deployment Frequency	How often value reaches users
Lead Time for Changes	How quickly a bug fix ships
Change Failure Rate	Cost of broken deployments
Mean Time to Restore	Duration of user impact during incidents

DORA benchmarks

Metric	Elite	High	Medium	Low
Deploy frequency	Multiple/day	Weekly	Monthly	< Monthly
Lead time	< 1 hour	< 1 day	1d–1w	> 1 week
CFR	< 5%	5–10%	10–15%	> 15%
MTTR	< 1 hour	< 1 day	1d–1w	> 1 week

GitHub Actions pushing DORA metrics to Pushgateway

# .github/workflows/deploy.yml
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Record deploy start time
        id: timing
        run: echo "start_ts=$(date +%s)" >> $GITHUB_OUTPUT

      - name: Build and deploy
        run: |
          echo "Your actual build and deploy steps here"

      - name: Push DORA metrics on success
        if: success()
        run: |
          LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} ))
          WORKFLOW="${{ github.workflow }}"

          # Deployment counter
          cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
          deployment_total{status="success",workflow="${WORKFLOW}"} 1
          EOF

          # Lead time
          cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
          deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME}
          EOF

      - name: Push DORA metrics on failure
        if: failure()
        run: |
          WORKFLOW="${{ github.workflow }}"
          cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
          deployment_total{status="failure",workflow="${WORKFLOW}"} 1
          EOF

DORA recording rules in Prometheus

groups:
  - name: cicd.recording_rules
    rules:
      # Deployment frequency
      - record: dora:deployment_frequency:rate24h
        expr: sum(increase(deployment_total[24h])) by (workflow)

      # Change Failure Rate = failed / total over 7 days
      - record: dora:change_failure_rate:ratio7d
        expr: |
          sum(increase(deployment_total{status="failure"}[7d])) by (workflow)
          /
          sum(increase(deployment_total[7d])) by (workflow)

      # Mean Time to Restore
      - record: dora:mttr:avg7d
        expr: avg_over_time(deployment_restore_time_seconds[7d])

Toil identified and automated

Toil 1 — Manual alert acknowledgement. Engineers read a Slack alert, open a browser, navigate to Grafana, search for the relevant dashboard. Automation: every alert payload includes a direct link to the exact dashboard. Saves 2–3 minutes per alert.

Toil 2 — Certificate renewal reminders. SSL expiry tracked via calendar reminders. Automation: Blackbox Exporter monitors SSL expiry continuously. SSLCertExpiringSoon alert fires 14 days before expiry automatically.

📸 [Screenshot: DORA metrics dashboard with classification badges]

Part 5: Five Grafana Dashboards — All Provisioned as Code

All dashboards are provisioned from JSON files. The Grafana UI was never used to create or modify any panel.

Grafana provisioning configuration

# grafana/provisioning/datasources/datasources.yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

  - name: Loki
    type: loki
    uid: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      # This is the key config for trace drill-down
      derivedFields:
        - name: TraceID
          matcherRegex: 'traceID=(\w+)'
          url: "${__value.raw}"
          datasourceUid: tempo
          urlDisplayLabel: "Open in Tempo"
        - name: TraceID_json
          matcherRegex: '"traceId":"(\w+)"'
          url: "${__value.raw}"
          datasourceUid: tempo
          urlDisplayLabel: "Open trace in Tempo"

  - name: Tempo
    type: tempo
    uid: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        filterByTraceID: true
        customQuery: true
        query: '{service_name="${__span.tags.service.name}"} |= "${__trace.traceId}"'

Dashboard 1 — Node Exporter

CPU utilisation total and per-core, memory used/cached/available, disk I/O, network I/O, and load averages at 1/5/15 minutes. Gives instant visibility into whether resource saturation is causing service degradation.

📸 [Screenshot: Node Exporter dashboard with live CPU and memory data]

Dashboard 2 — Blackbox Exporter

External probing: uptime/downtime timeline, HTTP response time, SSL certificate expiry countdown, probe success rate. This dashboard answers "what is the user experiencing?" rather than "what is the server doing?" — a critical distinction.

📸 [Screenshot: Blackbox Exporter dashboard showing probe results]

Dashboard 3 — DORA Metrics

Deployment frequency trend, lead time distribution, CFR raw count and rolling percentage, MTTR with DORA benchmark classification displayed prominently. Classification updates automatically as metrics change.

Dashboard 4 — SLO & Error Budget

SLI vs SLO gauges, error budget remaining as a bar gauge coloured by urgency, burn rate time series with fast/slow burn thresholds marked, SLO compliance history over 7 and 30 day windows.

📸 [Screenshot: SLO dashboard with error budget gauge]

Dashboard 5 — Unified Observability (the most important)

This is the dashboard that makes the entire stack worth building.

A user sees a spike in the error rate panel → clicks through to Loki → sees error logs from that exact time window → clicks the trace ID link → Tempo opens the waterfall → identifies exactly which service, endpoint, and span caused the failure.

This drill-down — metric spike → correlated logs → causing trace — is what separates observability from monitoring.

Monitoring: "Something is wrong"
Observability: "Here is exactly why, where, and when"

📸 [Screenshot: Unified dashboard showing error rate spike]

📸 [Screenshot: Loki logs panel with clickable trace IDs]

Part 6: The Alerting System

All alert rules are version-controlled

Zero alert rules live in Grafana. Every rule is in a .yml file under alerts/.

Infrastructure alerts

# alerts/infrastructure.yml
groups:
  - name: infrastructure.rules
    rules:
      # Recording rules — pre-compute SLIs
      - record: sli:node_cpu_saturation
        expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

      - record: sli:node_memory_saturation
        expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

      # CPU alerts
      - alert: HighCPUWarning
        expr: sli:node_cpu_saturation > 0.80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU is {{ $value | humanizePercentage }} (threshold: 80%)"
          dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"

      - alert: HighCPUCritical
        expr: sli:node_cpu_saturation > 0.90
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU on {{ $labels.instance }}"
          description: "CPU is {{ $value | humanizePercentage }} for 10+ minutes"
          dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"

      # Host down — Blackbox probe fails for 2 minutes
      - alert: HostDown
        expr: probe_success == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is down"
          description: "Blackbox probe failed for 2+ consecutive minutes"
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/host-down.md"

Burn rate alerting — how it reduces alert fatigue

Traditional threshold alerting fires whenever a metric crosses a line. This produces alert storms — dozens of notifications for a single incident. Teams learn to ignore them.

Burn rate alerting answers a different question: "At this rate of failure, how long until our error budget is exhausted?"

Two alerts replace an entire category of noise:

# alerts/slo-burnrate.yml
  - name: slo.alerts
    rules:
      # Fast burn — act immediately
      # 14.4x means 2% of monthly budget gone in 1 hour
      - alert: SLOAvailabilityFastBurn
        expr: slo:availability:burn_rate1h > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn — act immediately"
          description: >
            Burn rate is {{ $value | humanize }}x. At this rate,
            2% of the 30-day budget will be consumed in 1 hour.
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md"

      # Slow burn — investigate before it escalates
      # 5x means 5% of monthly budget gone in 6 hours
      - alert: SLOAvailabilitySlowBurn
        expr: slo:availability:burn_rate6h > 5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn — investigate soon"
          description: >
            Burn rate is {{ $value | humanize }}x over 6h.
            5% of the 30-day budget will be consumed in 6 hours.
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-slow-burn.md"

Alertmanager routing and inhibition

# config/alertmanager.yml
route:
  receiver: slack-default
  group_by: [alertname, severity, instance]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: slack-critical
      group_wait: 10s
      repeat_interval: 4h

inhibit_rules:
  # When host is completely down, suppress CPU/memory/latency noise
  - source_match:
      alertname: HostDown
    target_match_re:
      alertname: "HighCPU.*|HighMemory.*|HighLatency.*|DiskSpace.*"
    equal: [instance]

  # Critical suppresses warning for same alert on same host
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, instance]

Structured Slack payload — plain text is not acceptable

Every alert in #all-hng-alerts includes alert name, severity, host, metric value, Grafana link, and runbook link.

# config/slack.tmpl
{{ define "slack.title" -}}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{- end }}

{{ define "slack.body" -}}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity | toUpper }}
*Status:* {{ if eq $.Status "resolved" }}✅ RESOLVED{{ else }}🔥 FIRING{{ end }}
*Host:* {{ .Labels.instance }}
*Summary:* {{ .Annotations.summary }}

*Links:*
• <{{ .Annotations.dashboard_url }}|📊 Grafana Dashboard>
• <{{ .Annotations.runbook_url }}|📖 Runbook>

*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ end }}
{{- end }}

📸 [Screenshot: Slack showing firing alert with full structured payload]

📸 [Screenshot: Slack showing RESOLVED alert]

Part 7: Runbooks and Incident Management

A runbook for every alert

Every alert links directly to its runbook. An engineer woken at 3am should be able to follow it to resolution without searching.

Each runbook answers six questions:

# Runbook: High CPU Usage

## What is this alert?
HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.

## Likely cause
1. Traffic spike
2. Runaway process
3. Post-deployment regression

## First 3 investigation steps
1. Check running processes:

bash
top -bn1 | head -20
docker stats --no-stream

2. Correlate with traffic on Unified Observability dashboard
3. Check recent deployments in GitHub Actions

## Resolution
- Runaway process: kill -9 <PID>
- Traffic spike: scale horizontally
- Deployment regression: roll back

## Roll back when?
If CPU spike started within 30 minutes of a deployment
and correlates with increased error rate.

## Escalation
Senior engineer if unresolved after 20 minutes.

Blameless Post-Incident Review

We documented a simulated incident where a missing environment variable caused 35% of requests to return 503 for 47 minutes.

Timeline:

Time	Event
14:18	Deployment triggered
14:23	503 responses begin
14:29	SLOAvailabilityFastBurn fires (6-min detection lag)
14:36	Trace ID in Loki → Tempo reveals config read failure
14:40	Root cause identified: missing DATABASE_URL env var
14:45	Rollback initiated
15:10	Error rate returns to baseline

Root cause: New environment variable added to code but not to docker-compose.yml.

Detection gap: 6-minute lag between incident start and alert firing. Action item: reduce fast-burn for: clause from 2m to 1m.

Action items:

Action	Owner	Due
Add post-deploy smoke test	DevOps	3 days
Add env var validation to entrypoint	App dev	5 days
Reduce fast-burn for: clause to 1m	DevOps	1 day

This review is blameless — we focus on systems and processes, not individuals.

Part 8: Game Day Results

Scenario 1 — Deployment Failure

Added exit 1 to the GitHub Actions workflow and pushed. The workflow failed and pushed deployment_total{status="failure"} to the Pushgateway. CICDDeploymentFailed fired in Slack within 2 minutes. DORA dashboard showed CFR increase. Immediately reverted.

📸 [Screenshot: GitHub Actions showing red failed run]

📸 [Screenshot: CICDDeploymentFailed in Slack]

Scenario 2 — Latency Injection

Injected 600ms network latency:

sudo tc qdisc add dev ens5 root netem delay 600ms

HighLatencyWarning fired confirming the alerting pipeline for latency SLO breaches works end-to-end.

# Remove latency
sudo tc qdisc del dev ens5 root

RESOLVED message confirmed recovery detection works.

📸 [Screenshot: Unified dashboard showing latency spike]

📸 [Screenshot: HighLatencyWarning in Slack]

📸 [Screenshot: RESOLVED in Slack after tc removed]

Scenario 3 — Resource Pressure

Used stress-ng to drive CPU above 90%:

stress-ng --cpu 0 --cpu-method matrixprod --timeout 600s &

What we observed:

HighCPUWarning entered pending state after CPU sustained above 80%
After 5 minutes → HighCPUWarning turned firing in Prometheus
Alert arrived in Slack with full structured payload
HighCPUCritical entered pending (needs 10min sustained above 90%)
After killing stress: both alerts RESOLVED in Slack

This confirmed the full warning → critical → recovery sequence and proved inhibition rules work — critical suppressed the warning notification.

pkill stress-ng

📸 [Screenshot: Prometheus alerts page showing Warning firing]

📸 [Screenshot: Prometheus alerts page showing Critical pending]

📸 [Screenshot: Node Exporter dashboard with CPU spike at 92%]

📸 [Screenshot: HighCPUWarning in Slack]

📸 [Screenshot: RESOLVED in Slack]

Key Learnings

1. Observability is not monitoring.
Monitoring tells you something is wrong. Observability tells you why, where, and when — without needing to SSH into a server.

2. SLOs make reliability decisions objective.
"Is this deployment safe?" is subjective. "Do we have 100 minutes of error budget remaining?" is objective. SLOs turn reliability from a conversation into a measurement.

3. Burn rate alerting eliminates alert fatigue.
Two burn rate alerts replaced what would have been dozens of threshold alerts during our Game Day scenarios. Engineers respond to meaningful signals, not noise.

4. DORA metrics connect engineering to business.
High MTTR isn't just a technical problem — it's lost revenue per minute. Low deployment frequency isn't just slow — it's delayed value delivery. DORA makes this explicit.

5. Everything as code is non-negotiable.
Every dashboard, alert rule, and config that lives only in a UI is technical debt. When the server dies, you want to run docker compose up -d and have everything back — not spend three hours recreating dashboards from memory.

Conclusion

The MeetMind Observability Platform demonstrates that production-grade observability is achievable without managed services. The LGTM stack provides the full observability triad — metrics, logs, and traces — with correlation between all three. SLOs convert vague reliability goals into measurable targets. DORA metrics connect daily engineering decisions to business outcomes. Burn rate alerting replaces alert storms with two meaningful signals.

The entire platform deploys with one command. Every component is version-controlled. Every alert links to a runbook. Every metric spike links to correlated logs and traces.

GitHub Repository: https://github.com/AirFluke/meetmind-observability

Built by Team MeetMind for HNG DevOps Track Stage 6