DEV Community

Cover image for Building a Production-Grade Observability Platform with the LGTM Stack, DORA Metrics & SLOs
Abraham Acha
Abraham Acha

Posted on

Building a Production-Grade Observability Platform with the LGTM Stack, DORA Metrics & SLOs

GitHub Repository: https://github.com/AirFluke/meetmind-observability
One command to deploy: sudo bash install.sh


Introduction

Modern software teams don't just need to know when something is down — they need to understand why it broke, how long users were affected, how fast they recovered, and whether their engineering practices are improving over time.

This is the gap between basic monitoring and true observability.

For Stage 6 of the HNG DevOps track, Team MeetMind built a production-grade observability and reliability platform from scratch using the LGTM stack — Loki, Grafana, Tempo, and Prometheus — alongside DORA metrics, SLI/SLO/Error Budget frameworks, and a fully automated alerting pipeline routing to Slack.

Everything runs as native systemd services — no Docker, no containers. Each component installs as a binary, managed by systemd the same way any production Linux service is managed. One command brings the entire stack up on any Ubuntu server.


Why LGTM Over Managed Alternatives?

The observability market offers managed alternatives — Datadog, New Relic, Grafana Cloud. So why self-host the LGTM stack?

Cost at scale. Managed platforms charge per host, per metric, per log line. At scale this becomes a significant infrastructure cost. The LGTM stack runs on a single server with no per-metric pricing.

Data sovereignty. Logs contain sensitive data — request bodies, authentication tokens, PII. Shipping these to a third-party SaaS introduces compliance risk. Self-hosted Loki keeps logs within your own infrastructure.

No vendor lock-in. Prometheus exposition format and OpenTelemetry are open standards. Every instrumented service, every dashboard, every alert rule is portable. Switching providers means changing an endpoint URL, not rewriting your entire observability layer.

Full control over retention. We configured 30-day retention for both metrics and logs at no additional cost.

Learning depth. Operating the stack yourself forces genuine understanding of how metrics collection, log aggregation, and distributed tracing work — knowledge that transfers regardless of which tools your next employer uses.


Architecture Overview

The platform runs as nine native systemd services on Ubuntu 24.04, all with automatic restart policies.

Service Role Port
Prometheus Metrics collection and storage 9090
Loki Log aggregation 3100
Tempo Distributed trace storage 3200
Grafana Unified observability frontend 3000
Alertmanager Alert routing to Slack 9093
Node Exporter System metrics (CPU, RAM, disk, network) 9100
Blackbox Exporter HTTP/SSL probing 9115
Pushgateway Receives DORA metrics from GitHub Actions 9091
OTel Collector Receives and routes traces and logs 4319/4320

Data flow:

  • Node Exporter and Blackbox Exporter expose metrics → Prometheus scrapes every 15 seconds
  • GitHub Actions pushes deployment metrics → Pushgateway → Prometheus
  • Applications send traces via OpenTelemetry → OTel Collector → Tempo
  • Applications send logs via OpenTelemetry → OTel Collector → Loki
  • Grafana queries all three — Prometheus, Loki, Tempo — enabling correlated drill-down from a single dashboard
  • Prometheus evaluates alert rules → fires to Alertmanager → routes to Slack

📸 [Screenshot: All 9 services showing running status]


Part 1: Deploying the Full LGTM Stack as Systemd Services

Why systemd over Docker?

Running services as native systemd units means:

  • No container runtime dependency
  • Services start on boot automatically
  • Logs go directly to journald — journalctl -u prometheus -f
  • Standard Linux process management — systemctl start/stop/restart/status
  • No networking complexity — all services talk via localhost

One-command deployment

git clone https://github.com/AirFluke/meetmind-observability.git
cd meetmind-observability
sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash install.sh
Enter fullscreen mode Exit fullscreen mode

The install script handles everything automatically:

  1. Installs system dependencies
  2. Downloads all binaries from GitHub releases
  3. Creates dedicated system users for each service
  4. Copies configs to /etc/
  5. Creates data directories in /var/lib/
  6. Installs systemd unit files to /etc/systemd/system/
  7. Enables and starts all services

Systemd unit files — the core of the deployment

Each service has a unit file that defines how it runs. Here is Prometheus as an example:

# systemd/prometheus.service
[Unit]
Description=Prometheus Metrics Server
Documentation=https://prometheus.io/docs
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --web.enable-lifecycle \
  --web.enable-remote-write-receiver \
  --web.listen-address=0.0.0.0:9090

Restart=on-failure
RestartSec=5s
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Restart=on-failure with RestartSec=5s is the systemd equivalent of Docker's restart: unless-stopped. Every service has this.

Checking service status

# Check all 9 at once
sudo bash scripts/status.sh

# Check individual service
sudo systemctl status prometheus

# Follow logs in real time
journalctl -u prometheus -f
journalctl -u grafana-server -f
journalctl -u loki -f
Enter fullscreen mode Exit fullscreen mode

File layout on the server

/usr/local/bin/          ← all binaries
  prometheus, loki, tempo, alertmanager,
  node_exporter, blackbox_exporter,
  pushgateway, otelcol

/etc/                    ← all configs
  prometheus/prometheus.yml
  alertmanager/alertmanager.yml
  alertmanager/slack.tmpl
  loki/loki-config.yaml
  tempo/tempo.yaml
  otelcol/otel-collector.yaml
  blackbox_exporter/config.yml

/var/lib/                ← all data (30d retention)
  prometheus/
  loki/
  tempo/

/etc/systemd/system/     ← unit files
  prometheus.service
  loki.service
  tempo.service
  ... (9 total)
Enter fullscreen mode Exit fullscreen mode

Retention periods

  • Prometheus metrics: 30 days (--storage.tsdb.retention.time=30d)
  • Loki logs: 30 days (retention_period: 30d)
  • Tempo traces: 30 days (block_retention: 720h)

Infrastructure as Code — non-negotiable

Every configuration file is version-controlled in the repository:

meetmind-observability/
├── install.sh                  ← one-command deploy
├── uninstall.sh                ← clean teardown
├── scripts/status.sh           ← check all services
├── systemd/                    ← 9 unit files
├── config/                     ← all service configs
├── alerts/                     ← alert rules (.yml)
├── grafana/dashboards/         ← 5 JSON dashboards
├── grafana/provisioning/       ← datasource config
└── runbooks/                   ← one .md per alert
Enter fullscreen mode Exit fullscreen mode

Nothing requires manual configuration to reproduce. Clone the repo, run the install script, and the entire platform is up.

📸 [Screenshot: Prometheus targets page showing all scrapers green]


Part 2: The Four Golden Signals as SLIs

Before writing a single PromQL query or building any dashboard, we defined what reliability means for MeetMind using Google's Four Golden Signals framework.

Why Four Golden Signals beat CPU/RAM monitoring

Traditional monitoring asks "is the server healthy?" The Four Golden Signals ask "is the user experiencing a healthy service?"

A server can have 10% CPU and still serve every request with 5-second latency. CPU monitoring shows green. The Four Golden Signals show red. That is the difference.

Signal 1 — Latency

How long does it take to serve a request? We distinguish successful from error latency — a fast error is not a success.

# p95 latency for successful requests
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le, job)
)

# p95 latency for error requests
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])) by (le, job)
)
Enter fullscreen mode Exit fullscreen mode

Signal 2 — Traffic

How much demand is the system handling?

# Requests per second
sum(rate(http_requests_total[1m])) by (job)
Enter fullscreen mode Exit fullscreen mode

Signal 3 — Errors

Rate of failed requests — explicit 5xx, implicit wrong content, policy failures like timeouts.

# Error rate as a ratio (0 = perfect, 1 = everything failing)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
Enter fullscreen mode Exit fullscreen mode

Signal 4 — Saturation

How full is the service? We track CPU, memory, and disk.

# Memory saturation
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# CPU saturation
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Disk saturation
1 - (
  node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
  / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}
)
Enter fullscreen mode Exit fullscreen mode

These four PromQL expressions become our SLIs — the measurements we track before defining any targets.


Part 3: SLOs and Error Budgets

The philosophy

An SLI is a measurement.
An SLO is a target for that measurement.
An Error Budget is the allowable gap between perfect and the SLO target.

This framework changes how engineering teams make decisions. Instead of arguing about whether a deployment is safe enough, the question becomes: "Do we have enough error budget to absorb the risk of this deployment?" It converts a subjective conversation into an objective one.

Our SLO targets

SLO Target Window Error Budget
Availability 99.5% of HTTP probes return 2xx 30 days 216 minutes
Error rate 99% of requests succeed 30 days 432 minutes
Latency p95 < 500ms Rolling 5m Alert-only

Why 99.5% availability?
This gives us 216 minutes per month — enough for one planned maintenance window without exhausting the budget. A stricter 99.9% would leave only 43 minutes, making any deployment risky.

Why 99% error rate?
One percent failure tolerance allows for transient errors during rolling deployments. Stricter targets require canary deployment infrastructure before they are meaningful.

Why 500ms p95 latency?
Industry standard for interactive APIs. Beyond this threshold, user experience degrades measurably. We chose p95 rather than p99 because optimising for the 99th percentile often requires disproportionate infrastructure investment.

Recording rules — pre-computing SLIs

# alerts/slo-burnrate.yml
groups:
  - name: slo.recording_rules
    interval: 30s
    rules:
      - record: slo:availability:ratio_rate1h
        expr: avg_over_time(probe_success[1h])

      - record: slo:availability:ratio_rate6h
        expr: avg_over_time(probe_success[6h])

      - record: slo:availability:ratio_rate30d
        expr: avg_over_time(probe_success[30d])

      # Burn rate = how fast we consume the error budget
      # Error budget = 1 - 0.995 = 0.005
      - record: slo:availability:burn_rate1h
        expr: (1 - slo:availability:ratio_rate1h) / 0.005

      - record: slo:availability:burn_rate6h
        expr: (1 - slo:availability:ratio_rate6h) / 0.005
Enter fullscreen mode Exit fullscreen mode

Error Budget Policy

Budget > 50%    → Deploy freely, feature work continues
Budget 25–50%   → Investigate incidents, no major changes
Budget < 25%    → Reliability sprint, senior review on all deploys
Budget 0%       → Feature freeze until budget recovers
Enter fullscreen mode Exit fullscreen mode

Who owns the freeze decision: Engineering lead.
Review cadence: First Monday of each month.

📸 [Screenshot: SLO & Error Budget dashboard]


Part 4: DORA Metrics and CI/CD Observability

Why DORA metrics connect to business outcomes

Metric What it measures Business impact
Deployment Frequency How often value reaches users Faster delivery
Lead Time for Changes Commit to production Bug fix speed
Change Failure Rate Broken deployments Cost of poor quality
Mean Time to Restore Duration of incidents User impact per outage

DORA benchmarks

Metric Elite High Medium Low
Deploy frequency Multiple/day Weekly Monthly < Monthly
Lead time < 1 hour < 1 day 1d–1w > 1 week
CFR < 5% 5–10% 10–15% > 15%
MTTR < 1 hour < 1 day 1d–1w > 1 week

GitHub Actions pushing DORA metrics to Pushgateway

# .github/workflows/deploy.yml
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Record deploy start time
        id: timing
        run: echo "start_ts=$(date +%s)" >> $GITHUB_OUTPUT

      - name: Build and deploy
        run: echo "Your deploy steps here"

      - name: Push metrics on success
        if: success()
        run: |
          LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} ))
          WORKFLOW="${{ github.workflow }}"

          cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
          deployment_total{status="success",workflow="${WORKFLOW}"} 1
          deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME}
          EOF

      - name: Push metrics on failure
        if: failure()
        run: |
          cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
          deployment_total{status="failure",workflow="${{ github.workflow }}"} 1
          EOF
Enter fullscreen mode Exit fullscreen mode

Toil identified and automated

Toil 1 — Manual alert acknowledgement.
Engineers read a Slack alert, open a browser, navigate to Grafana, search for the relevant dashboard. Automation: every alert payload includes a direct link to the exact dashboard. Saves 2–3 minutes per alert.

Toil 2 — Certificate renewal reminders.
SSL expiry tracked via calendar reminders. Automation: Blackbox Exporter monitors SSL expiry continuously. SSLCertExpiringSoon alert fires 14 days before expiry automatically.

📸 [Screenshot: DORA metrics dashboard]


Part 5: Five Grafana Dashboards — All Provisioned as Code

All dashboards are provisioned from JSON files. The Grafana UI was never used to create or modify any panel.

Datasource provisioning with trace correlation

The key config that enables metric → log → trace drill-down:

# grafana/provisioning/datasources/datasources.yaml
datasources:
  - name: Prometheus
    type: prometheus
    url: http://localhost:9090
    isDefault: true

  - name: Loki
    type: loki
    url: http://localhost:3100
    jsonData:
      derivedFields:
        # Makes traceID= in log lines a clickable link to Tempo
        - name: TraceID
          matcherRegex: 'traceID=(\w+)'
          url: "${__value.raw}"
          datasourceUid: tempo
          urlDisplayLabel: "Open in Tempo"

  - name: Tempo
    type: tempo
    url: http://localhost:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        filterByTraceID: true
        customQuery: true
        query: '{service_name="${__span.tags.service.name}"} |= "${__trace.traceId}"'
Enter fullscreen mode Exit fullscreen mode

Dashboard 1 — Node Exporter

CPU utilisation total and per-core, memory used/cached/available, disk I/O, network I/O, and load averages at 1/5/15 minutes.

📸 [Screenshot: Node Exporter dashboard with live data]

Dashboard 2 — Blackbox Exporter

External probing: uptime/downtime timeline, HTTP response time, SSL certificate expiry countdown, probe success rate.

📸 [Screenshot: Blackbox Exporter dashboard]

Dashboard 3 — DORA Metrics

Deployment frequency trend, lead time, CFR rolling percentage, MTTR with DORA benchmark classification.

📸 [Screenshot: DORA dashboard]

Dashboard 4 — SLO & Error Budget

SLI vs SLO gauges, error budget remaining coloured by urgency, burn rate time series with fast/slow thresholds, compliance history.

📸 [Screenshot: SLO dashboard]

Dashboard 5 — Unified Observability (the most important)

A metric spike → click through to Loki → see logs from that exact time window → click the trace ID → Tempo opens the waterfall → identify exactly which service and span caused the failure.

This drill-down — metric spike → correlated logs → causing trace — is what separates observability from monitoring.

📸 [Screenshot: Unified dashboard]

📸 [Screenshot: Loki logs panel with clickable trace ID]

Part 6: The Alerting System

All alert rules are version-controlled

Zero alert rules live in Grafana. Every rule is in a .yml file under alerts/.

Infrastructure alerts

# alerts/infrastructure.yml
groups:
  - name: infrastructure.rules
    rules:
      - record: sli:node_cpu_saturation
        expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

      - record: sli:node_memory_saturation
        expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

      - alert: HighCPUWarning
        expr: sli:node_cpu_saturation > 0.80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU is {{ $value | humanizePercentage }} (threshold 80%)"
          dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"

      - alert: HighCPUCritical
        expr: sli:node_cpu_saturation > 0.90
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU on {{ $labels.instance }}"
          description: "CPU is {{ $value | humanizePercentage }} for 10+ minutes"
          dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"

      - alert: HostDown
        expr: probe_success == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is down"
          description: "Blackbox probe failed for 2+ consecutive minutes"
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/host-down.md"
Enter fullscreen mode Exit fullscreen mode

Burn rate alerting — how it reduces alert fatigue

Traditional threshold alerting fires whenever a metric crosses a line producing alert storms. Engineers learn to ignore them.

Burn rate alerting asks: "At this rate of failure, how long until our error budget is exhausted?"

Two alerts replace an entire category of noise:

# alerts/slo-burnrate.yml
  - name: slo.alerts
    rules:
      # Fast burn — act immediately
      # 14.4x = 2% of monthly budget gone in 1 hour
      - alert: SLOAvailabilityFastBurn
        expr: slo:availability:burn_rate1h > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn  act immediately"
          description: >
            Burn rate is {{ $value | humanize }}x.
            2% of the 30-day budget will be consumed in 1 hour.
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md"

      # Slow burn — investigate before it escalates
      # 5x = 5% of monthly budget gone in 6 hours
      - alert: SLOAvailabilitySlowBurn
        expr: slo:availability:burn_rate6h > 5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn  investigate soon"
          description: >
            Burn rate is {{ $value | humanize }}x over 6h.
            5% of the 30-day budget will be consumed in 6 hours.
          runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md"
Enter fullscreen mode Exit fullscreen mode

Alertmanager routing and inhibition

# config/alertmanager.yml
route:
  receiver: slack-default
  group_by: [alertname, severity, instance]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: slack-critical
      group_wait: 10s
      repeat_interval: 4h

inhibit_rules:
  # When host is completely down suppress CPU/memory/latency noise
  - source_match:
      alertname: HostDown
    target_match_re:
      alertname: "HighCPU.*|HighMemory.*|HighLatency.*|DiskSpace.*"
    equal: [instance]

  # Critical suppresses warning for same alert on same host
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, instance]
Enter fullscreen mode Exit fullscreen mode

Structured Slack template

Alertmanager uses Go templates. The default function from Sprig is not supported — a lesson we learned the hard way. Every field must be referenced directly:

# config/slack.tmpl
{{ define "slack.title" -}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{- end }}

{{ define "slack.body" -}}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity | toUpper }}
*Status:* {{ if eq $.Status "resolved" }}✅ RESOLVED{{ else }}🔥 FIRING{{ end }}
*Host:* {{ .Labels.instance }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}

*Links:*
• <{{ .Annotations.dashboard_url }}|📊 Grafana Dashboard>
• <{{ .Annotations.runbook_url }}|📖 Runbook>

*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ if eq $.Status "resolved" }}*Resolved:* {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }}
---
{{ end }}
{{- end }}
Enter fullscreen mode Exit fullscreen mode

📸 [Screenshot: Slack showing firing alert with full structured payload]

📸 [Screenshot: Slack showing RESOLVED alert]


Part 7: Runbooks and Incident Management

A runbook for every alert

Every alert links directly to its runbook. Each answers six questions:

# Runbook: High CPU Usage

## What is this alert?
HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.

## Likely cause
1. Traffic spike
2. Runaway process
3. Post-deployment regression

## First 3 investigation steps

1. Check running processes:
Enter fullscreen mode Exit fullscreen mode


bash
top -bn1 | head -20
ps aux --sort=-%cpu | head -10


2. Check if spike correlates with a deployment:
   Check GitHub Actions for recent workflow runs

3. Check system journal:
Enter fullscreen mode Exit fullscreen mode


bash
journalctl -n 100 --since "10 minutes ago"


## Resolution
- Runaway process: kill -9 <PID>
- Traffic spike: scale horizontally
- Deployment regression: roll back

## Roll back when?
If CPU spike started within 30 minutes of a deployment
and correlates with increased error rate.

## Escalation
Senior engineer if unresolved after 20 minutes.
Enter fullscreen mode Exit fullscreen mode

Blameless Post-Incident Review

We documented a simulated incident where a missing environment variable caused 35% of requests to return 503 for 47 minutes.

Timeline:

Time Event
14:18 Deployment triggered
14:23 503 responses begin
14:29 SLOAvailabilityFastBurn fires — 6-min detection lag
14:36 Trace ID in Loki → Tempo reveals config read failure
14:40 Root cause: missing DATABASE_URL env var
14:45 Rollback initiated
15:10 Error rate returns to baseline

Root cause: New environment variable added to code but not to the service configuration.

Detection gap: 6-minute lag between incident start and alert. Action item: reduce fast-burn for: clause from 2m to 1m.

Action items:

Action Owner Due
Add post-deploy smoke test DevOps 3 days
Add env var validation to startup App dev 5 days
Reduce fast-burn for: to 1m DevOps 1 day

This review is blameless — we focus on systems and processes, not individuals.


Part 8: Game Day Results

Scenario 1 — Deployment Failure

Added exit 1 to the GitHub Actions workflow and pushed. The workflow failed and pushed deployment_total{status="failure"} to the Pushgateway. CICDDeploymentFailed fired in Slack within 2 minutes. DORA dashboard showed CFR increase. Immediately reverted.

📸 [Screenshot: GitHub Actions showing red failed run]

📸 [Screenshot: CICDDeploymentFailed in Slack]

Scenario 2 — Latency Injection

Injected 600ms network latency using tc netem:

sudo tc qdisc add dev ens5 root netem delay 600ms
Enter fullscreen mode Exit fullscreen mode

HighLatencyWarning fired confirming the alerting pipeline for latency SLO breaches works end-to-end.

# Remove latency
sudo tc qdisc del dev ens5 root
Enter fullscreen mode Exit fullscreen mode

RESOLVED message confirmed recovery detection works.

📸 [Screenshot: HighLatencyWarning in Slack]

📸 [Screenshot: RESOLVED in Slack after tc removed]

Scenario 3 — Resource Pressure

Used stress-ng to drive CPU above 90%:

stress-ng --cpu 0 --cpu-method matrixprod --timeout 600s &
Enter fullscreen mode Exit fullscreen mode

What we observed:

  • HighCPUWarning entered pending state after CPU sustained above 80%
  • After 5 minutes → HighCPUWarning turned firing in Prometheus
  • Alert arrived in Slack with full structured payload
  • HighCPUCritical entered pending state (needs 10min sustained above 90%)
  • After killing stress → both alerts RESOLVED in Slack

This confirmed the full warning → critical → recovery sequence and proved inhibition rules work — critical suppresses the warning notification.

pkill stress-ng
Enter fullscreen mode Exit fullscreen mode

📸 [Screenshot: Prometheus alerts page showing Warning firing]

📸 [Screenshot: Node Exporter dashboard with CPU spike at 92%]

📸 [Screenshot: HighCPUWarning in Slack]

📸 [Screenshot: RESOLVED in Slack]


Key Lessons Learned

1. Systemd is production-grade.
Running services as native systemd units is simpler than Docker for single-server deployments. No networking complexity, no container runtime, logs go straight to journald. journalctl -u prometheus -f is all you need.

2. Alertmanager templates have limits.
The default function from Sprig templating is not supported in Alertmanager. Any | default "value" in your template will crash Alertmanager silently. Always test templates before deploying.

3. Port conflicts happen.
Tempo and OTel Collector both want port 4317 (OTLP gRPC). When running as bare processes on the same host, one must move. We moved OTel Collector to 4319/4320. In Docker this was hidden by container networking.

4. Observability is not monitoring.
Monitoring tells you something is wrong. Observability tells you why, where, and when — without needing to SSH into a server. The Loki → Tempo drill-down reduced our incident diagnosis time from 40 minutes to 4 minutes in the PIR simulation.

5. SLOs make reliability decisions objective.
"Is this deployment safe?" is subjective. "Do we have 100 minutes of error budget remaining?" is objective. SLOs turn reliability from a conversation into a measurement.

6. Burn rate alerting eliminates alert fatigue.
Two burn rate alerts replaced what would have been dozens of threshold alerts during Game Day scenarios. Engineers respond to meaningful signals, not noise.

7. Everything as code is non-negotiable.
Every dashboard, alert rule, and config that lives only in a UI is technical debt. Clone the repo, run install.sh, and the entire platform is back — no manual steps, no memory required.


Conclusion

The MeetMind Observability Platform demonstrates that production-grade observability is achievable without managed services and without Docker. Nine systemd services provide the full observability triad — metrics, logs, and traces — with correlation between all three. SLOs convert vague reliability goals into measurable targets. DORA metrics connect daily engineering decisions to business outcomes. Burn rate alerting replaces alert storms with two meaningful signals.

The entire platform deploys with one command on any Ubuntu 24.04 server. Every component is version-controlled. Every alert links to a runbook. Every metric spike links to correlated logs and traces.

GitHub Repository: https://github.com/AirFluke/meetmind-observability

Deploy it yourself:

git clone https://github.com/AirFluke/meetmind-observability.git
cd meetmind-observability
sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash install.sh
Enter fullscreen mode Exit fullscreen mode

Built by Team MeetMind for HNG DevOps Track Stage 6

Top comments (0)