Abraham Acha

Posted on May 18

Building a Production-Grade Remote Observability Platform with LGTM Stack, DORA Metrics & SLOs

#devops #observability #prometheus #grafana

GitHub: github.com/AirFluke/meetmind-observability
Spin up monitoring server: terraform apply
Tear it down: terraform destroy

Introduction

Modern software teams don't just need to know when something is down. They need to understand why it broke, how long users were affected, how fast the team recovered, and whether engineering practices are improving over time.

For Stage 6 of the HNG DevOps track, Team MeetMind built a production-grade observability platform with a real-world constraint most tutorials ignore: the monitoring stack and the application live on completely separate servers in different AWS accounts.

We solved this with a reverse SSH tunnel — no shared VPC, no cross-account IAM, no firewall rules to negotiate. The monitoring server spins up with one Terraform command, connects to the application server automatically, and tears down just as cleanly.

Architecture — Two Servers, One Observability Platform

Application Server (13.63.206.183)        Monitoring Server (Terraform-managed)
──────────────────────────────────        ─────────────────────────────────────
Node Exporter     :9100  ──────────────→  :9101  (reverse SSH tunnel)
Nginx / App       :80    ──────────────→  :8080  (reverse SSH tunnel)
App               :443   ←─────────────  Blackbox SSL probe (direct)

                                          Prometheus  → scrapes :9101, :8080
                                          Loki        → receives logs
                                          Tempo       → receives traces
                                          Grafana     → 5 dashboards
                                          Alertmanager → Slack #all-hng-alerts

Architecture Overview

The platform runs as nine native systemd services on Ubuntu 24.04, all with automatic restart policies.

Service	Role	Port
Prometheus	Metrics collection and storage	9090
Loki	Log aggregation	3100
Tempo	Distributed trace storage	3200
Grafana	Unified observability frontend	3000
Alertmanager	Alert routing to Slack	9093
Node Exporter	System metrics (CPU, RAM, disk, network)	9100
Blackbox Exporter	HTTP/SSL probing	9115
Pushgateway	Receives DORA metrics from GitHub Actions	9091
OTel Collector	Receives and routes traces and logs	4319/4320

Why the reverse SSH tunnel?

The application server is in a different AWS account. We cannot modify its security group, open firewall ports, or set up VPC peering. The standard solution would be to open port 9100 to the monitoring server IP — but that requires access to the other account.

The reverse SSH tunnel solves this elegantly. The application server initiates an outbound SSH connection to the monitoring server (outbound traffic is almost always allowed). This creates a tunnel that forwards port 9100 on the app server to port 9101 on the monitoring server. Prometheus scrapes localhost:9101 — which is actually the app server's Node Exporter.

# /etc/systemd/system/monitoring-tunnel.service on the APP SERVER
[Unit]
Description=AutoSSH Reverse Tunnel to Monitoring Server
After=network.target

[Service]
Environment=AUTOSSH_GATETIME=0
User=ubuntu
ExecStart=/usr/bin/autossh -M 0 -N \
  -R 9101:localhost:9100 \
  -R 8080:localhost:80 \
  -i /var/lib/monitoring/id_rsa \
  -o "ServerAliveInterval 30" \
  -o "ServerAliveCountMax 3" \
  -o "StrictHostKeyChecking=no" \
  ubuntu@MONITORING_SERVER_IP
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

We confirmed the tunnel is pulling real app server metrics by querying node_uname_info in Prometheus:

node_uname_info{nodename="ip-172-31-12-155"} 1

That hostname is the application server. The monitoring server is ip-172-31-21-196. Two different machines, one Prometheus.

📸 [Screenshot: Prometheus targets page showing all targets UP]

Why LGTM Over Managed Alternatives?

Cost at scale. Managed platforms charge per host, per metric, per log line. The LGTM stack runs on a single t3.small EC2 instance with no per-metric pricing.

Data sovereignty. Logs contain sensitive data. Self-hosted Loki keeps them within your own infrastructure.

No vendor lock-in. Prometheus and OpenTelemetry are open standards. Every dashboard, alert rule, and config is portable.

Spin up and down on demand. Because everything is Terraform, the monitoring server costs nothing when not needed. terraform destroy removes it entirely. terraform apply restores it in minutes pointing at the same application server.

Part 1: Deploying the Stack

Why systemd instead of Docker

We chose native systemd services — each service runs as a dedicated unprivileged system user, logs go to journald, and restart behaviour matches Docker's restart: unless-stopped without the container overhead.

Terraform — spin up, point at app server, done

resource "aws_instance" "monitoring" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.small"
  key_name      = var.key_name

  user_data = templatefile("${path.module}/user_data.sh.tpl", {
    app_server_ip    = var.app_server_ip
    slack_webhook    = var.slack_webhook
    repo_url         = var.repo_url
    grafana_password = var.grafana_password
  })
}

resource "aws_eip" "monitoring" {
  instance = aws_instance.monitoring.id
  domain   = "vpc"
}

The user_data script clones the repo and runs install.sh automatically. All 9 services start without manual intervention.

One-command deploy

cd terraform
terraform apply
# Outputs:
# grafana_url          = "http://3.93.140.221:3000"
# monitoring_server_ip = "3.93.140.221"

All 9 services running

prometheus        running  9090
node_exporter     running  9100
blackbox_exporter running  9115
alertmanager      running  9093
pushgateway       running  9091
loki              running  3100
tempo             running  3200
otelcol           running  8888
grafana-server    running  3000

📸 [Screenshot: bash scripts/status.sh showing all 9 running]

Data retention:

Prometheus metrics: 30 days
Loki logs: 30 days
Tempo traces: 30 days

Part 2: The Four Golden Signals as SLIs

Why Four Golden Signals beat CPU/RAM monitoring

A server at 10% CPU can still be serving every request with 5-second latency. CPU monitoring shows green. The Four Golden Signals show red. That is the difference.

Latency

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le, job)
)

Traffic

sum(rate(http_requests_total[1m])) by (job)

Errors

sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)

Saturation

# Memory — pulled from app server via tunnel on :9101
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# CPU — pulled from app server via tunnel
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

Part 3: SLOs and Error Budgets

The philosophy

An SLI is a measurement. An SLO is a target. An Error Budget is the allowable gap between perfect and the target.

Instead of asking "is this deployment safe?" the question becomes "do we have enough error budget to absorb the risk?" Objective, not subjective.

Our SLO targets

SLO	Target	Window	Error Budget
Availability	99.5% of probes return 2xx	30 days	216 minutes
Error rate	99% of requests succeed	30 days	432 minutes
Latency	p95 < 500ms	Rolling 5m	Alert-only

Error Budget Policy

Budget remaining	Action
> 50%	Deploy freely
25–50%	Investigate incidents, no major changes
< 25%	Reliability sprint, senior review on deploys
0%	Feature freeze until budget recovers

📸 [Screenshot: SLO & Error Budget dashboard]

Part 4: DORA Metrics

Metric	Elite target	Business impact
Deployment Frequency	Multiple/day	How often value reaches users
Lead Time for Changes	< 1 hour	How quickly a bug fix ships
Change Failure Rate	< 5%	Cost of broken deployments
Mean Time to Restore	< 1 hour	Duration of user impact

GitHub Actions pushing metrics to Pushgateway

- name: Push deployment metrics
  run: |
    LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} ))
    cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
    deployment_total{status="success",workflow="${WORKFLOW}"} 1
    deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME}
    EOF

Toil identified

Toil 1: Manual alert acknowledgement — now every alert includes a direct Grafana link. Saves 2–3 minutes per incident.

Toil 2: Certificate renewal reminders — Blackbox now probes SSL expiry continuously and fires SSLCertExpiringSoon 14 days before expiry.

📸 [Screenshot: DORA metrics dashboard]

Part 5: Five Grafana Dashboards

All provisioned as JSON. The UI was never used.

Loki → Tempo drill-down (the key config)

- name: Loki
  jsonData:
    derivedFields:
      - name: TraceID
        matcherRegex: 'traceID=(\w+)'
        url: "${__value.raw}"
        datasourceUid: tempo
        urlDisplayLabel: "Open in Tempo"

Node Exporter — CPU, memory, disk, network. The instance selector shows both servers. Selecting the app server instance shows nodename="ip-172-31-12-155" — real cross-server monitoring confirmed.

📸 [Screenshot: Node Exporter showing app server metrics with correct nodename]

Blackbox Exporter — uptime, HTTP response time, SSL expiry.

📸 [Screenshot: Blackbox Exporter dashboard]

DORA Metrics — deployment frequency, lead time, CFR, MTTR with benchmark classification.

SLO & Error Budget — SLI gauges, budget remaining, burn rate.

Unified Observability — metric spike → Loki logs → clickable trace ID → Tempo waterfall. This drill-down is what separates observability from monitoring.

📸 [Screenshot: Unified Observability dashboard]

Part 6: The Alerting System

Burn rate over threshold alerting

# Fast burn — 2% of monthly budget in 1 hour
- alert: SLOAvailabilityFastBurn
  expr: slo:availability:burn_rate1h > 14.4
  for: 5m
  labels:
    severity: critical

# Slow burn — 5% of monthly budget in 6 hours
- alert: SLOAvailabilitySlowBurn
  expr: slo:availability:burn_rate6h > 5
  for: 15m
  labels:
    severity: warning

A lesson learned — alert grouping matters

Early in the project we had 5 blackbox targets each firing their own SLO alert — 5 Slack messages per incident. The fix was grouping by alertname and severity only:

route:
  group_by: [alertname, severity]
  group_wait: 60s
  group_interval: 30m
  repeat_interval: 12h

One incident now produces one Slack message regardless of how many targets are affected.

Inhibition rules

inhibit_rules:
  - source_match:
      alertname: HostDown
    target_match_re:
      alertname: "HighCPU.*|HighMemory.*|HighLatency.*"
    equal: [instance]
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, instance]

📸 [Screenshot: Slack showing firing alert with structured payload]
📸 [Screenshot: Slack showing RESOLVED alert]

Part 7: Runbooks and Incident Management

Every alert links to a runbook answering six questions: what is it, likely cause, first 3 investigation steps, resolution, rollback criteria, escalation path.

Blameless PIR

Time	Event
14:18	Deployment triggered
14:23	503 responses begin (35% of requests)
14:29	SLOAvailabilityFastBurn fires (6-min detection lag)
14:36	Trace in Tempo reveals config read failure
14:40	Root cause: missing DATABASE_URL env var
14:45	Rollback initiated
15:10	Error rate returns to baseline

Action items: reduce fast-burn for: from 2m to 1m, add post-deploy smoke test.

Part 8: Game Day Results

Scenario 1 — Deployment failure: exit 1 in GitHub Actions → CICDDeploymentFailed in Slack within 2 minutes.

📸 [Screenshot: GitHub Actions failed run + Slack alert]

Scenario 2 — Latency injection: sudo tc qdisc add dev ens5 root netem delay 600ms → HighLatencyWarning fired after 5 minutes → RESOLVED after removal.

📸 [Screenshot: HighLatencyWarning + RESOLVED in Slack]

Scenario 3 — CPU pressure: stress-ng --cpu 0 drove CPU to 92% → HighCPUWarning fired → RESOLVED after pkill stress-ng.

📸 [Screenshot: Node Exporter showing 92% CPU + Slack alert + RESOLVED]

Real-World Incident — The Tunnel Drops

During the project the reverse SSH tunnel dropped intermittently, causing HostDown and SLOAvailabilityFastBurn alerts to cascade. Multiple Slack messages per drop before the grouping fix.

Root cause: autossh reconnects in under 30 seconds, but the for: 2m clause on HostDown was short enough to fire on brief reconnects.

Fix: Increased for: 5m on HostDown. This was a genuine production incident — not simulated — and it validated the entire pipeline: alert fired, investigated in Grafana, root cause identified, fixed, RESOLVED message in Slack.

Key Learnings

Cross-account monitoring is a real problem. The reverse SSH tunnel solved a constraint most tutorials ignore. You won't always control the firewall on the server you're monitoring.

Alert grouping is as important as alert rules. Five targets firing five messages per incident creates noise. One grouped message creates signal.

Observability is not monitoring. Monitoring says something is wrong. Observability says why, where, and when — without SSHing into a server.

SLOs make reliability objective. Burn rate over thresholds. Two alerts replaced dozens.

Terraform gives you a monitoring server that costs nothing when idle. terraform destroy removes it. terraform apply restores it in 5 minutes.

Everything as code is non-negotiable. One command restores everything.

Conclusion

The MeetMind Observability Platform monitors a production application server across AWS account boundaries without needing firewall access or cross-account IAM. A reverse SSH tunnel makes this work. Terraform makes it reproducible. Systemd makes it reliable.

Clone the repo. Fill in terraform.tfvars. Run terraform apply. Full observability in 5 minutes.

GitHub: github.com/AirFluke/meetmind-observability

Built by Team MeetMind — HNG DevOps Track Stage 6

DEV Community