GitHub Repository: https://github.com/AirFluke/meetmind-observability
One command to deploy:docker compose up -d
Introduction
Modern software teams don't just need to know when something is down — they need to understand why it broke, how long users were affected, how fast they recovered, and whether their engineering practices are improving over time.
This is the gap between basic monitoring and true observability.
For Stage 6 of the HNG DevOps track, Team MeetMind built a production-grade observability and reliability platform from scratch using the LGTM stack — Loki, Grafana, Tempo, and Prometheus — alongside DORA metrics, SLI/SLO/Error Budget frameworks, and a fully automated alerting pipeline routing to Slack.
Everything is infrastructure as code. No manual UI configuration. One command brings the entire stack up.
Why LGTM Over Managed Alternatives?
The observability market offers managed alternatives — Datadog, New Relic, Grafana Cloud. So why self-host the LGTM stack?
Cost at scale. Managed platforms charge per host, per metric, per log line. At scale this becomes a significant infrastructure cost. The LGTM stack runs on a single server with no per-metric pricing.
Data sovereignty. Logs contain sensitive data — request bodies, auth tokens, PII. Shipping these to a third-party SaaS introduces compliance risk. Self-hosted Loki keeps logs within your own infrastructure.
No vendor lock-in. Prometheus exposition format and OpenTelemetry are open standards. Every instrumented service, every dashboard, every alert rule is portable. Switching providers means changing an endpoint URL, not rewriting your entire observability layer.
Full control over retention. We configured 30-day retention for both metrics and logs at no additional cost.
Learning depth. Operating the stack yourself forces genuine understanding of how metrics collection, log aggregation, and distributed tracing work — knowledge that transfers regardless of which tools your next employer uses.
Architecture Overview
The platform runs as a Docker Compose stack with nine services, all with automatic restart policies.
| Component | Role | Port |
|---|---|---|
| Prometheus | Metrics collection and storage | 9090 |
| Loki | Log aggregation | 3100 |
| Tempo | Distributed trace storage | 3200 |
| Grafana | Unified observability frontend | 3000 |
| Alertmanager | Alert routing to Slack | 9093 |
| Node Exporter | System metrics (CPU, RAM, disk, network) | 9100 |
| Blackbox Exporter | HTTP/SSL probing | 9115 |
| Pushgateway | Receives DORA metrics from GitHub Actions | 9091 |
| OTel Collector | Receives and routes traces and logs | 4317/4318 |
Data flow:
- Node Exporter and Blackbox Exporter expose metrics → Prometheus scrapes every 15 seconds
- GitHub Actions pushes deployment metrics → Pushgateway → Prometheus
- Applications send traces via OpenTelemetry → OTel Collector → Tempo
- Applications send logs via OpenTelemetry → OTel Collector → Loki
- Grafana sits on top of all three — Prometheus, Loki, Tempo — enabling correlated drill-down from a single dashboard
📸 [Screenshot: docker compose ps showing all 9 services Up]
Part 1: Deploying the Full LGTM Stack
Docker Compose — the complete stack
# docker-compose.yml
version: "3.8"
networks:
observability:
driver: bridge
volumes:
prometheus_data:
loki_data:
tempo_data:
grafana_data:
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
restart: unless-stopped
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle"
- "--web.enable-remote-write-receiver"
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./alerts:/etc/prometheus/alerts:ro
- prometheus_data:/prometheus
ports:
- "9090:9090"
networks:
- observability
loki:
image: grafana/loki:2.9.7
container_name: loki
restart: unless-stopped
command: -config.file=/etc/loki/loki-config.yaml
volumes:
- ./config/loki-config.yaml:/etc/loki/loki-config.yaml:ro
- loki_data:/loki
ports:
- "3100:3100"
networks:
- observability
tempo:
image: grafana/tempo:2.4.1
container_name: tempo
restart: unless-stopped
command: -config.file=/etc/tempo/tempo.yaml
volumes:
- ./config/tempo.yaml:/etc/tempo/tempo.yaml:ro
- tempo_data:/var/tempo
ports:
- "3200:3200"
- "4317:4317"
- "4318:4318"
networks:
- observability
grafana:
image: grafana/grafana:10.4.2
container_name: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
- GF_FEATURE_TOGGLES_ENABLE=traceqlEditor
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
ports:
- "3000:3000"
networks:
- observability
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
volumes:
- ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- ./config/slack.tmpl:/etc/alertmanager/slack.tmpl:ro
ports:
- "9093:9093"
networks:
- observability
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
ports:
- "9100:9100"
networks:
- observability
blackbox-exporter:
image: prom/blackbox-exporter:v0.25.0
container_name: blackbox-exporter
restart: unless-stopped
volumes:
- ./config/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
ports:
- "9115:9115"
networks:
- observability
pushgateway:
image: prom/pushgateway:v1.7.0
container_name: pushgateway
restart: unless-stopped
ports:
- "9091:9091"
networks:
- observability
otel-collector:
image: otel/opentelemetry-collector-contrib:0.98.0
container_name: otel-collector
restart: unless-stopped
command: ["--config=/etc/otel/otel-collector.yaml"]
volumes:
- ./config/otel-collector.yaml:/etc/otel/otel-collector.yaml:ro
ports:
- "4319:4317"
- "4320:4318"
- "8888:8888"
networks:
- observability
One command to bring everything up
docker compose up -d
Infrastructure as Code — non-negotiable
Every configuration file is version-controlled. Nothing is configured through a UI:
config/
├── prometheus.yml # Scrape configs + recording rules
├── alertmanager.yml # Route trees + inhibition rules
├── loki-config.yaml # Log ingestion + 30d retention
├── tempo.yaml # Trace storage + 30d retention
├── otel-collector.yaml # Trace and log pipeline
└── blackbox.yml # HTTP + SSL probe modules
alerts/
├── infrastructure.yml # CPU, memory, disk, host down
├── slo-burnrate.yml # Multi-window burn rate alerts
└── cicd.yml # DORA threshold alerts
grafana/
├── provisioning/ # Datasource + dashboard discovery
└── dashboards/ # 5 JSON dashboards
Prometheus scrape configuration
# config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/alerts/infrastructure.yml
- /etc/prometheus/alerts/slo-burnrate.yml
- /etc/prometheus/alerts/cicd.yml
scrape_configs:
- job_name: node-exporter
scrape_interval: 15s
static_configs:
- targets: ["node-exporter:9100"]
- job_name: blackbox-http
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://grafana:3000
- http://prometheus:9090/-/healthy
- http://loki:3100/ready
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
- job_name: pushgateway
honor_labels: true
static_configs:
- targets: ["pushgateway:9091"]
Retention periods:
- Prometheus metrics: 30 days (
--storage.tsdb.retention.time=30d) - Loki logs: 30 days (
retention_period: 30din loki-config.yaml) - Tempo traces: 30 days (
block_retention: 720hin tempo.yaml)
📸 [Screenshot: Prometheus targets page showing all scrapers green]
Part 2: The Four Golden Signals as SLIs
Before writing a single PromQL query or building any dashboard, we defined what reliability means for MeetMind using Google's Four Golden Signals framework.
Why Four Golden Signals beat CPU/RAM monitoring
Traditional monitoring asks "is the server healthy?" The Four Golden Signals ask "is the user experiencing a healthy service?"
A server can have 10% CPU and still serve every request with 5-second latency. CPU monitoring shows green. The Four Golden Signals show red. That's the difference.
Signal 1 — Latency
How long does it take to serve a request? We distinguish successful from error latency — a fast error is not a success.
# p95 latency for successful requests
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le, job)
)
# p95 latency for error requests (errors are often faster — fail fast)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])) by (le, job)
)
Signal 2 — Traffic
How much demand is the system handling?
# Requests per second
sum(rate(http_requests_total[1m])) by (job)
Signal 3 — Errors
Rate of failed requests — explicit 5xx, implicit wrong content, policy failures.
# Error rate as a ratio (0 = perfect, 1 = everything failing)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
Signal 4 — Saturation
How "full" is the service? We track CPU, memory, and disk.
# Memory saturation
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# CPU saturation
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Disk saturation
1 - (
node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
/ node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}
)
These four PromQL expressions become our SLIs — the measurements we track.
Part 3: SLOs and Error Budgets
The philosophy
An SLI is a measurement.
An SLO is a target for that measurement.
An Error Budget is the allowable gap between perfect and the SLO target.
This framework changes how engineering teams make decisions. Instead of arguing about whether a deployment is "safe enough", the question becomes: "Do we have enough error budget to absorb the risk of this deployment?"
It converts a subjective conversation into an objective one.
Our SLO targets
| SLO | Target | Window | Error Budget |
|---|---|---|---|
| Availability | 99.5% of HTTP probes return 2xx | 30 days | 216 minutes |
| Error rate | 99% of requests succeed | 30 days | 432 minutes |
| Latency | p95 < 500ms | Rolling 5m | Alert-only |
Why 99.5% availability?
This gives us 216 minutes per month — enough for one planned maintenance window without exhausting the budget. A stricter 99.9% would leave only 43 minutes, making any deployment risky.
Why 99% error rate?
One percent failure tolerance allows for transient errors during rolling deployments. Stricter targets require canary deployment infrastructure before they're meaningful.
Why 500ms p95 latency?
Industry standard for interactive APIs. Beyond this threshold, user experience degrades measurably. We chose p95 rather than p99 because optimising for the 99th percentile often requires disproportionate infrastructure investment.
Recording rules for SLIs
# alerts/slo-burnrate.yml
groups:
- name: slo.recording_rules
interval: 30s
rules:
- record: slo:availability:ratio_rate5m
expr: avg_over_time(probe_success[5m])
- record: slo:availability:ratio_rate1h
expr: avg_over_time(probe_success[1h])
- record: slo:availability:ratio_rate6h
expr: avg_over_time(probe_success[6h])
- record: slo:availability:ratio_rate30d
expr: avg_over_time(probe_success[30d])
# Burn rate = how fast we're consuming error budget
# Error budget = 1 - 0.995 = 0.005
- record: slo:availability:burn_rate1h
expr: (1 - slo:availability:ratio_rate1h) / 0.005
- record: slo:availability:burn_rate6h
expr: (1 - slo:availability:ratio_rate6h) / 0.005
Error Budget Policy
Budget remaining > 50% → Deploy freely, feature work continues
Budget remaining 25-50% → Investigate incidents, no major changes
Budget remaining < 25% → Reliability sprint, senior review on all deploys
Budget remaining 0% → Feature freeze until budget recovers
Who owns the freeze decision? Engineering lead.
Review cadence? First Monday of each month.
📸 [Screenshot: SLO & Error Budget dashboard showing gauges and burn rate]
Part 4: DORA Metrics and CI/CD Observability
Why DORA metrics connect to business outcomes
DORA metrics answer: "Is our team getting better or worse at delivering software safely?"
| Metric | Business impact |
|---|---|
| Deployment Frequency | How often value reaches users |
| Lead Time for Changes | How quickly a bug fix ships |
| Change Failure Rate | Cost of broken deployments |
| Mean Time to Restore | Duration of user impact during incidents |
DORA benchmarks
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deploy frequency | Multiple/day | Weekly | Monthly | < Monthly |
| Lead time | < 1 hour | < 1 day | 1d–1w | > 1 week |
| CFR | < 5% | 5–10% | 10–15% | > 15% |
| MTTR | < 1 hour | < 1 day | 1d–1w | > 1 week |
GitHub Actions pushing DORA metrics to Pushgateway
# .github/workflows/deploy.yml
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Record deploy start time
id: timing
run: echo "start_ts=$(date +%s)" >> $GITHUB_OUTPUT
- name: Build and deploy
run: |
echo "Your actual build and deploy steps here"
- name: Push DORA metrics on success
if: success()
run: |
LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} ))
WORKFLOW="${{ github.workflow }}"
# Deployment counter
cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
deployment_total{status="success",workflow="${WORKFLOW}"} 1
EOF
# Lead time
cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME}
EOF
- name: Push DORA metrics on failure
if: failure()
run: |
WORKFLOW="${{ github.workflow }}"
cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
deployment_total{status="failure",workflow="${WORKFLOW}"} 1
EOF
DORA recording rules in Prometheus
groups:
- name: cicd.recording_rules
rules:
# Deployment frequency
- record: dora:deployment_frequency:rate24h
expr: sum(increase(deployment_total[24h])) by (workflow)
# Change Failure Rate = failed / total over 7 days
- record: dora:change_failure_rate:ratio7d
expr: |
sum(increase(deployment_total{status="failure"}[7d])) by (workflow)
/
sum(increase(deployment_total[7d])) by (workflow)
# Mean Time to Restore
- record: dora:mttr:avg7d
expr: avg_over_time(deployment_restore_time_seconds[7d])
Toil identified and automated
Toil 1 — Manual alert acknowledgement. Engineers read a Slack alert, open a browser, navigate to Grafana, search for the relevant dashboard. Automation: every alert payload includes a direct link to the exact dashboard. Saves 2–3 minutes per alert.
Toil 2 — Certificate renewal reminders. SSL expiry tracked via calendar reminders. Automation: Blackbox Exporter monitors SSL expiry continuously. SSLCertExpiringSoon alert fires 14 days before expiry automatically.
📸 [Screenshot: DORA metrics dashboard with classification badges]
Part 5: Five Grafana Dashboards — All Provisioned as Code
All dashboards are provisioned from JSON files. The Grafana UI was never used to create or modify any panel.
Grafana provisioning configuration
# grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
uid: loki
access: proxy
url: http://loki:3100
jsonData:
# This is the key config for trace drill-down
derivedFields:
- name: TraceID
matcherRegex: 'traceID=(\w+)'
url: "${__value.raw}"
datasourceUid: tempo
urlDisplayLabel: "Open in Tempo"
- name: TraceID_json
matcherRegex: '"traceId":"(\w+)"'
url: "${__value.raw}"
datasourceUid: tempo
urlDisplayLabel: "Open trace in Tempo"
- name: Tempo
type: tempo
uid: tempo
access: proxy
url: http://tempo:3200
jsonData:
tracesToLogsV2:
datasourceUid: loki
filterByTraceID: true
customQuery: true
query: '{service_name="${__span.tags.service.name}"} |= "${__trace.traceId}"'
Dashboard 1 — Node Exporter
CPU utilisation total and per-core, memory used/cached/available, disk I/O, network I/O, and load averages at 1/5/15 minutes. Gives instant visibility into whether resource saturation is causing service degradation.
📸 [Screenshot: Node Exporter dashboard with live CPU and memory data]
Dashboard 2 — Blackbox Exporter
External probing: uptime/downtime timeline, HTTP response time, SSL certificate expiry countdown, probe success rate. This dashboard answers "what is the user experiencing?" rather than "what is the server doing?" — a critical distinction.
📸 [Screenshot: Blackbox Exporter dashboard showing probe results]
Dashboard 3 — DORA Metrics
Deployment frequency trend, lead time distribution, CFR raw count and rolling percentage, MTTR with DORA benchmark classification displayed prominently. Classification updates automatically as metrics change.
Dashboard 4 — SLO & Error Budget
SLI vs SLO gauges, error budget remaining as a bar gauge coloured by urgency, burn rate time series with fast/slow burn thresholds marked, SLO compliance history over 7 and 30 day windows.
📸 [Screenshot: SLO dashboard with error budget gauge]
Dashboard 5 — Unified Observability (the most important)
This is the dashboard that makes the entire stack worth building.
A user sees a spike in the error rate panel → clicks through to Loki → sees error logs from that exact time window → clicks the trace ID link → Tempo opens the waterfall → identifies exactly which service, endpoint, and span caused the failure.
This drill-down — metric spike → correlated logs → causing trace — is what separates observability from monitoring.
Monitoring: "Something is wrong"
Observability: "Here is exactly why, where, and when"
📸 [Screenshot: Unified dashboard showing error rate spike]
📸 [Screenshot: Loki logs panel with clickable trace IDs]
Part 6: The Alerting System
All alert rules are version-controlled
Zero alert rules live in Grafana. Every rule is in a .yml file under alerts/.
Infrastructure alerts
# alerts/infrastructure.yml
groups:
- name: infrastructure.rules
rules:
# Recording rules — pre-compute SLIs
- record: sli:node_cpu_saturation
expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
- record: sli:node_memory_saturation
expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# CPU alerts
- alert: HighCPUWarning
expr: sli:node_cpu_saturation > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU is {{ $value | humanizePercentage }} (threshold: 80%)"
dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"
- alert: HighCPUCritical
expr: sli:node_cpu_saturation > 0.90
for: 10m
labels:
severity: critical
annotations:
summary: "Critical CPU on {{ $labels.instance }}"
description: "CPU is {{ $value | humanizePercentage }} for 10+ minutes"
dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"
# Host down — Blackbox probe fails for 2 minutes
- alert: HostDown
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Host {{ $labels.instance }} is down"
description: "Blackbox probe failed for 2+ consecutive minutes"
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/host-down.md"
Burn rate alerting — how it reduces alert fatigue
Traditional threshold alerting fires whenever a metric crosses a line. This produces alert storms — dozens of notifications for a single incident. Teams learn to ignore them.
Burn rate alerting answers a different question: "At this rate of failure, how long until our error budget is exhausted?"
Two alerts replace an entire category of noise:
# alerts/slo-burnrate.yml
- name: slo.alerts
rules:
# Fast burn — act immediately
# 14.4x means 2% of monthly budget gone in 1 hour
- alert: SLOAvailabilityFastBurn
expr: slo:availability:burn_rate1h > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn — act immediately"
description: >
Burn rate is {{ $value | humanize }}x. At this rate,
2% of the 30-day budget will be consumed in 1 hour.
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md"
# Slow burn — investigate before it escalates
# 5x means 5% of monthly budget gone in 6 hours
- alert: SLOAvailabilitySlowBurn
expr: slo:availability:burn_rate6h > 5
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn — investigate soon"
description: >
Burn rate is {{ $value | humanize }}x over 6h.
5% of the 30-day budget will be consumed in 6 hours.
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-slow-burn.md"
Alertmanager routing and inhibition
# config/alertmanager.yml
route:
receiver: slack-default
group_by: [alertname, severity, instance]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: slack-critical
group_wait: 10s
repeat_interval: 4h
inhibit_rules:
# When host is completely down, suppress CPU/memory/latency noise
- source_match:
alertname: HostDown
target_match_re:
alertname: "HighCPU.*|HighMemory.*|HighLatency.*|DiskSpace.*"
equal: [instance]
# Critical suppresses warning for same alert on same host
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, instance]
Structured Slack payload — plain text is not acceptable
Every alert in #all-hng-alerts includes alert name, severity, host, metric value, Grafana link, and runbook link.
# config/slack.tmpl
{{ define "slack.title" -}}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{- end }}
{{ define "slack.body" -}}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity | toUpper }}
*Status:* {{ if eq $.Status "resolved" }}✅ RESOLVED{{ else }}🔥 FIRING{{ end }}
*Host:* {{ .Labels.instance }}
*Summary:* {{ .Annotations.summary }}
*Links:*
• <{{ .Annotations.dashboard_url }}|📊 Grafana Dashboard>
• <{{ .Annotations.runbook_url }}|📖 Runbook>
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ end }}
{{- end }}
📸 [Screenshot: Slack showing firing alert with full structured payload]
📸 [Screenshot: Slack showing RESOLVED alert]
Part 7: Runbooks and Incident Management
A runbook for every alert
Every alert links directly to its runbook. An engineer woken at 3am should be able to follow it to resolution without searching.
Each runbook answers six questions:
# Runbook: High CPU Usage
## What is this alert?
HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.
## Likely cause
1. Traffic spike
2. Runaway process
3. Post-deployment regression
## First 3 investigation steps
1. Check running processes:
bash
top -bn1 | head -20
docker stats --no-stream
2. Correlate with traffic on Unified Observability dashboard
3. Check recent deployments in GitHub Actions
## Resolution
- Runaway process: kill -9 <PID>
- Traffic spike: scale horizontally
- Deployment regression: roll back
## Roll back when?
If CPU spike started within 30 minutes of a deployment
and correlates with increased error rate.
## Escalation
Senior engineer if unresolved after 20 minutes.
Blameless Post-Incident Review
We documented a simulated incident where a missing environment variable caused 35% of requests to return 503 for 47 minutes.
Timeline:
| Time | Event |
|---|---|
| 14:18 | Deployment triggered |
| 14:23 | 503 responses begin |
| 14:29 | SLOAvailabilityFastBurn fires (6-min detection lag) |
| 14:36 | Trace ID in Loki → Tempo reveals config read failure |
| 14:40 | Root cause identified: missing DATABASE_URL env var |
| 14:45 | Rollback initiated |
| 15:10 | Error rate returns to baseline |
Root cause: New environment variable added to code but not to docker-compose.yml.
Detection gap: 6-minute lag between incident start and alert firing. Action item: reduce fast-burn for: clause from 2m to 1m.
Action items:
| Action | Owner | Due |
|---|---|---|
| Add post-deploy smoke test | DevOps | 3 days |
| Add env var validation to entrypoint | App dev | 5 days |
| Reduce fast-burn for: clause to 1m | DevOps | 1 day |
This review is blameless — we focus on systems and processes, not individuals.
Part 8: Game Day Results
Scenario 1 — Deployment Failure
Added exit 1 to the GitHub Actions workflow and pushed. The workflow failed and pushed deployment_total{status="failure"} to the Pushgateway. CICDDeploymentFailed fired in Slack within 2 minutes. DORA dashboard showed CFR increase. Immediately reverted.
📸 [Screenshot: GitHub Actions showing red failed run]
📸 [Screenshot: CICDDeploymentFailed in Slack]
Scenario 2 — Latency Injection
Injected 600ms network latency:
sudo tc qdisc add dev ens5 root netem delay 600ms
HighLatencyWarning fired confirming the alerting pipeline for latency SLO breaches works end-to-end.
# Remove latency
sudo tc qdisc del dev ens5 root
RESOLVED message confirmed recovery detection works.
📸 [Screenshot: Unified dashboard showing latency spike]
📸 [Screenshot: HighLatencyWarning in Slack]
📸 [Screenshot: RESOLVED in Slack after tc removed]
Scenario 3 — Resource Pressure
Used stress-ng to drive CPU above 90%:
stress-ng --cpu 0 --cpu-method matrixprod --timeout 600s &
What we observed:
-
HighCPUWarningentered pending state after CPU sustained above 80% - After 5 minutes →
HighCPUWarningturned firing in Prometheus - Alert arrived in Slack with full structured payload
-
HighCPUCriticalentered pending (needs 10min sustained above 90%) - After killing stress: both alerts RESOLVED in Slack
This confirmed the full warning → critical → recovery sequence and proved inhibition rules work — critical suppressed the warning notification.
pkill stress-ng
📸 [Screenshot: Prometheus alerts page showing Warning firing]
📸 [Screenshot: Prometheus alerts page showing Critical pending]
📸 [Screenshot: Node Exporter dashboard with CPU spike at 92%]
📸 [Screenshot: HighCPUWarning in Slack]
📸 [Screenshot: RESOLVED in Slack]
Key Learnings
1. Observability is not monitoring.
Monitoring tells you something is wrong. Observability tells you why, where, and when — without needing to SSH into a server.
2. SLOs make reliability decisions objective.
"Is this deployment safe?" is subjective. "Do we have 100 minutes of error budget remaining?" is objective. SLOs turn reliability from a conversation into a measurement.
3. Burn rate alerting eliminates alert fatigue.
Two burn rate alerts replaced what would have been dozens of threshold alerts during our Game Day scenarios. Engineers respond to meaningful signals, not noise.
4. DORA metrics connect engineering to business.
High MTTR isn't just a technical problem — it's lost revenue per minute. Low deployment frequency isn't just slow — it's delayed value delivery. DORA makes this explicit.
5. Everything as code is non-negotiable.
Every dashboard, alert rule, and config that lives only in a UI is technical debt. When the server dies, you want to run docker compose up -d and have everything back — not spend three hours recreating dashboards from memory.
Conclusion
The MeetMind Observability Platform demonstrates that production-grade observability is achievable without managed services. The LGTM stack provides the full observability triad — metrics, logs, and traces — with correlation between all three. SLOs convert vague reliability goals into measurable targets. DORA metrics connect daily engineering decisions to business outcomes. Burn rate alerting replaces alert storms with two meaningful signals.
The entire platform deploys with one command. Every component is version-controlled. Every alert links to a runbook. Every metric spike links to correlated logs and traces.
GitHub Repository: https://github.com/AirFluke/meetmind-observability
Built by Team MeetMind for HNG DevOps Track Stage 6





















Top comments (0)