DEV Community

Abraham Acha
Abraham Acha

Posted on

Building a Production-Grade Remote Observability Platform with LGTM Stack, DORA Metrics & SLOs

GitHub: github.com/AirFluke/meetmind-observability
Spin up monitoring server: terraform apply
Tear it down: terraform destroy


Introduction

Modern software teams don't just need to know when something is down. They need to understand why it broke, how long users were affected, how fast the team recovered, and whether engineering practices are improving over time.

For Stage 6 of the HNG DevOps track, Team MeetMind built a production-grade observability platform with a real-world constraint most tutorials ignore: the monitoring stack and the application live on completely separate servers in different AWS accounts.

We solved this with a reverse SSH tunnel — no shared VPC, no cross-account IAM, no firewall rules to negotiate. The monitoring server spins up with one Terraform command, connects to the application server automatically, and tears down just as cleanly.


Architecture — Two Servers, One Observability Platform

Application Server (13.63.206.183)        Monitoring Server (Terraform-managed)
──────────────────────────────────        ─────────────────────────────────────
Node Exporter     :9100  ──────────────→  :9101  (reverse SSH tunnel)
Nginx / App       :80    ──────────────→  :8080  (reverse SSH tunnel)
App               :443   ←─────────────  Blackbox SSL probe (direct)

                                          Prometheus  → scrapes :9101, :8080
                                          Loki        → receives logs
                                          Tempo       → receives traces
                                          Grafana     → 5 dashboards
                                          Alertmanager → Slack #all-hng-alerts
Enter fullscreen mode Exit fullscreen mode

Architecture Overview

The platform runs as nine native systemd services on Ubuntu 24.04, all with automatic restart policies.

Service Role Port
Prometheus Metrics collection and storage 9090
Loki Log aggregation 3100
Tempo Distributed trace storage 3200
Grafana Unified observability frontend 3000
Alertmanager Alert routing to Slack 9093
Node Exporter System metrics (CPU, RAM, disk, network) 9100
Blackbox Exporter HTTP/SSL probing 9115
Pushgateway Receives DORA metrics from GitHub Actions 9091
OTel Collector Receives and routes traces and logs 4319/4320

Why the reverse SSH tunnel?

The application server is in a different AWS account. We cannot modify its security group, open firewall ports, or set up VPC peering. The standard solution would be to open port 9100 to the monitoring server IP — but that requires access to the other account.

The reverse SSH tunnel solves this elegantly. The application server initiates an outbound SSH connection to the monitoring server (outbound traffic is almost always allowed). This creates a tunnel that forwards port 9100 on the app server to port 9101 on the monitoring server. Prometheus scrapes localhost:9101 — which is actually the app server's Node Exporter.

# /etc/systemd/system/monitoring-tunnel.service on the APP SERVER
[Unit]
Description=AutoSSH Reverse Tunnel to Monitoring Server
After=network.target

[Service]
Environment=AUTOSSH_GATETIME=0
User=ubuntu
ExecStart=/usr/bin/autossh -M 0 -N \
  -R 9101:localhost:9100 \
  -R 8080:localhost:80 \
  -i /var/lib/monitoring/id_rsa \
  -o "ServerAliveInterval 30" \
  -o "ServerAliveCountMax 3" \
  -o "StrictHostKeyChecking=no" \
  ubuntu@MONITORING_SERVER_IP
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

We confirmed the tunnel is pulling real app server metrics by querying node_uname_info in Prometheus:

node_uname_info{nodename="ip-172-31-12-155"} 1
Enter fullscreen mode Exit fullscreen mode

That hostname is the application server. The monitoring server is ip-172-31-21-196. Two different machines, one Prometheus.

📸 [Screenshot: Prometheus targets page showing all targets UP]


Why LGTM Over Managed Alternatives?

Cost at scale. Managed platforms charge per host, per metric, per log line. The LGTM stack runs on a single t3.small EC2 instance with no per-metric pricing.

Data sovereignty. Logs contain sensitive data. Self-hosted Loki keeps them within your own infrastructure.

No vendor lock-in. Prometheus and OpenTelemetry are open standards. Every dashboard, alert rule, and config is portable.

Spin up and down on demand. Because everything is Terraform, the monitoring server costs nothing when not needed. terraform destroy removes it entirely. terraform apply restores it in minutes pointing at the same application server.


Part 1: Deploying the Stack

Why systemd instead of Docker

We chose native systemd services — each service runs as a dedicated unprivileged system user, logs go to journald, and restart behaviour matches Docker's restart: unless-stopped without the container overhead.

Terraform — spin up, point at app server, done

resource "aws_instance" "monitoring" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.small"
  key_name      = var.key_name

  user_data = templatefile("${path.module}/user_data.sh.tpl", {
    app_server_ip    = var.app_server_ip
    slack_webhook    = var.slack_webhook
    repo_url         = var.repo_url
    grafana_password = var.grafana_password
  })
}

resource "aws_eip" "monitoring" {
  instance = aws_instance.monitoring.id
  domain   = "vpc"
}
Enter fullscreen mode Exit fullscreen mode

The user_data script clones the repo and runs install.sh automatically. All 9 services start without manual intervention.

One-command deploy

cd terraform
terraform apply
# Outputs:
# grafana_url          = "http://3.93.140.221:3000"
# monitoring_server_ip = "3.93.140.221"
Enter fullscreen mode Exit fullscreen mode

All 9 services running

prometheus        running  9090
node_exporter     running  9100
blackbox_exporter running  9115
alertmanager      running  9093
pushgateway       running  9091
loki              running  3100
tempo             running  3200
otelcol           running  8888
grafana-server    running  3000
Enter fullscreen mode Exit fullscreen mode

📸 [Screenshot: bash scripts/status.sh showing all 9 running]

Data retention:

  • Prometheus metrics: 30 days
  • Loki logs: 30 days
  • Tempo traces: 30 days

Part 2: The Four Golden Signals as SLIs

Why Four Golden Signals beat CPU/RAM monitoring

A server at 10% CPU can still be serving every request with 5-second latency. CPU monitoring shows green. The Four Golden Signals show red. That is the difference.

Latency

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le, job)
)
Enter fullscreen mode Exit fullscreen mode

Traffic

sum(rate(http_requests_total[1m])) by (job)
Enter fullscreen mode Exit fullscreen mode

Errors

sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
Enter fullscreen mode Exit fullscreen mode

Saturation

# Memory — pulled from app server via tunnel on :9101
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# CPU — pulled from app server via tunnel
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
Enter fullscreen mode Exit fullscreen mode

Part 3: SLOs and Error Budgets

The philosophy

An SLI is a measurement. An SLO is a target. An Error Budget is the allowable gap between perfect and the target.

Instead of asking "is this deployment safe?" the question becomes "do we have enough error budget to absorb the risk?" Objective, not subjective.

Our SLO targets

SLO Target Window Error Budget
Availability 99.5% of probes return 2xx 30 days 216 minutes
Error rate 99% of requests succeed 30 days 432 minutes
Latency p95 < 500ms Rolling 5m Alert-only

Error Budget Policy

Budget remaining Action
> 50% Deploy freely
25–50% Investigate incidents, no major changes
< 25% Reliability sprint, senior review on deploys
0% Feature freeze until budget recovers

📸 [Screenshot: SLO & Error Budget dashboard]


Part 4: DORA Metrics

Metric Elite target Business impact
Deployment Frequency Multiple/day How often value reaches users
Lead Time for Changes < 1 hour How quickly a bug fix ships
Change Failure Rate < 5% Cost of broken deployments
Mean Time to Restore < 1 hour Duration of user impact

GitHub Actions pushing metrics to Pushgateway

- name: Push deployment metrics
  run: |
    LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} ))
    cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
    deployment_total{status="success",workflow="${WORKFLOW}"} 1
    deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME}
    EOF
Enter fullscreen mode Exit fullscreen mode

Toil identified

Toil 1: Manual alert acknowledgement — now every alert includes a direct Grafana link. Saves 2–3 minutes per incident.

Toil 2: Certificate renewal reminders — Blackbox now probes SSL expiry continuously and fires SSLCertExpiringSoon 14 days before expiry.

📸 [Screenshot: DORA metrics dashboard]


Part 5: Five Grafana Dashboards

All provisioned as JSON. The UI was never used.

Loki → Tempo drill-down (the key config)

- name: Loki
  jsonData:
    derivedFields:
      - name: TraceID
        matcherRegex: 'traceID=(\w+)'
        url: "${__value.raw}"
        datasourceUid: tempo
        urlDisplayLabel: "Open in Tempo"
Enter fullscreen mode Exit fullscreen mode

Node Exporter — CPU, memory, disk, network. The instance selector shows both servers. Selecting the app server instance shows nodename="ip-172-31-12-155" — real cross-server monitoring confirmed.

📸 [Screenshot: Node Exporter showing app server metrics with correct nodename]

Blackbox Exporter — uptime, HTTP response time, SSL expiry.

📸 [Screenshot: Blackbox Exporter dashboard]

DORA Metrics — deployment frequency, lead time, CFR, MTTR with benchmark classification.

SLO & Error Budget — SLI gauges, budget remaining, burn rate.

Unified Observability — metric spike → Loki logs → clickable trace ID → Tempo waterfall. This drill-down is what separates observability from monitoring.

📸 [Screenshot: Unified Observability dashboard]


Part 6: The Alerting System

Burn rate over threshold alerting

# Fast burn — 2% of monthly budget in 1 hour
- alert: SLOAvailabilityFastBurn
  expr: slo:availability:burn_rate1h > 14.4
  for: 5m
  labels:
    severity: critical

# Slow burn — 5% of monthly budget in 6 hours
- alert: SLOAvailabilitySlowBurn
  expr: slo:availability:burn_rate6h > 5
  for: 15m
  labels:
    severity: warning
Enter fullscreen mode Exit fullscreen mode

A lesson learned — alert grouping matters

Early in the project we had 5 blackbox targets each firing their own SLO alert — 5 Slack messages per incident. The fix was grouping by alertname and severity only:

route:
  group_by: [alertname, severity]
  group_wait: 60s
  group_interval: 30m
  repeat_interval: 12h
Enter fullscreen mode Exit fullscreen mode

One incident now produces one Slack message regardless of how many targets are affected.

Inhibition rules

inhibit_rules:
  - source_match:
      alertname: HostDown
    target_match_re:
      alertname: "HighCPU.*|HighMemory.*|HighLatency.*"
    equal: [instance]
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, instance]
Enter fullscreen mode Exit fullscreen mode

📸 [Screenshot: Slack showing firing alert with structured payload]
📸 [Screenshot: Slack showing RESOLVED alert]


Part 7: Runbooks and Incident Management

Every alert links to a runbook answering six questions: what is it, likely cause, first 3 investigation steps, resolution, rollback criteria, escalation path.

Blameless PIR

Time Event
14:18 Deployment triggered
14:23 503 responses begin (35% of requests)
14:29 SLOAvailabilityFastBurn fires (6-min detection lag)
14:36 Trace in Tempo reveals config read failure
14:40 Root cause: missing DATABASE_URL env var
14:45 Rollback initiated
15:10 Error rate returns to baseline

Action items: reduce fast-burn for: from 2m to 1m, add post-deploy smoke test.


Part 8: Game Day Results

Scenario 1 — Deployment failure: exit 1 in GitHub Actions → CICDDeploymentFailed in Slack within 2 minutes.

📸 [Screenshot: GitHub Actions failed run + Slack alert]

Scenario 2 — Latency injection: sudo tc qdisc add dev ens5 root netem delay 600msHighLatencyWarning fired after 5 minutes → RESOLVED after removal.

📸 [Screenshot: HighLatencyWarning + RESOLVED in Slack]

Scenario 3 — CPU pressure: stress-ng --cpu 0 drove CPU to 92% → HighCPUWarning fired → RESOLVED after pkill stress-ng.

📸 [Screenshot: Node Exporter showing 92% CPU + Slack alert + RESOLVED]


Real-World Incident — The Tunnel Drops

During the project the reverse SSH tunnel dropped intermittently, causing HostDown and SLOAvailabilityFastBurn alerts to cascade. Multiple Slack messages per drop before the grouping fix.

Root cause: autossh reconnects in under 30 seconds, but the for: 2m clause on HostDown was short enough to fire on brief reconnects.

Fix: Increased for: 5m on HostDown. This was a genuine production incident — not simulated — and it validated the entire pipeline: alert fired, investigated in Grafana, root cause identified, fixed, RESOLVED message in Slack.


Key Learnings

Cross-account monitoring is a real problem. The reverse SSH tunnel solved a constraint most tutorials ignore. You won't always control the firewall on the server you're monitoring.

Alert grouping is as important as alert rules. Five targets firing five messages per incident creates noise. One grouped message creates signal.

Observability is not monitoring. Monitoring says something is wrong. Observability says why, where, and when — without SSHing into a server.

SLOs make reliability objective. Burn rate over thresholds. Two alerts replaced dozens.

Terraform gives you a monitoring server that costs nothing when idle. terraform destroy removes it. terraform apply restores it in 5 minutes.

Everything as code is non-negotiable. One command restores everything.


Conclusion

The MeetMind Observability Platform monitors a production application server across AWS account boundaries without needing firewall access or cross-account IAM. A reverse SSH tunnel makes this work. Terraform makes it reproducible. Systemd makes it reliable.

Clone the repo. Fill in terraform.tfvars. Run terraform apply. Full observability in 5 minutes.

GitHub: github.com/AirFluke/meetmind-observability


Built by Team MeetMind — HNG DevOps Track Stage 6

Top comments (0)