GitHub: github.com/AirFluke/meetmind-observability
Spin up monitoring server:terraform apply
Tear it down:terraform destroy
Introduction
Modern software teams don't just need to know when something is down. They need to understand why it broke, how long users were affected, how fast the team recovered, and whether engineering practices are improving over time.
For Stage 6 of the HNG DevOps track, Team MeetMind built a production-grade observability platform with a real-world constraint most tutorials ignore: the monitoring stack and the application live on completely separate servers in different AWS accounts.
We solved this with a reverse SSH tunnel — no shared VPC, no cross-account IAM, no firewall rules to negotiate. The monitoring server spins up with one Terraform command, connects to the application server automatically, and tears down just as cleanly.
Architecture — Two Servers, One Observability Platform
Application Server (13.63.206.183) Monitoring Server (Terraform-managed)
────────────────────────────────── ─────────────────────────────────────
Node Exporter :9100 ──────────────→ :9101 (reverse SSH tunnel)
Nginx / App :80 ──────────────→ :8080 (reverse SSH tunnel)
App :443 ←───────────── Blackbox SSL probe (direct)
Prometheus → scrapes :9101, :8080
Loki → receives logs
Tempo → receives traces
Grafana → 5 dashboards
Alertmanager → Slack #all-hng-alerts
Architecture Overview
The platform runs as nine native systemd services on Ubuntu 24.04, all with automatic restart policies.
| Service | Role | Port |
|---|---|---|
| Prometheus | Metrics collection and storage | 9090 |
| Loki | Log aggregation | 3100 |
| Tempo | Distributed trace storage | 3200 |
| Grafana | Unified observability frontend | 3000 |
| Alertmanager | Alert routing to Slack | 9093 |
| Node Exporter | System metrics (CPU, RAM, disk, network) | 9100 |
| Blackbox Exporter | HTTP/SSL probing | 9115 |
| Pushgateway | Receives DORA metrics from GitHub Actions | 9091 |
| OTel Collector | Receives and routes traces and logs | 4319/4320 |
Why the reverse SSH tunnel?
The application server is in a different AWS account. We cannot modify its security group, open firewall ports, or set up VPC peering. The standard solution would be to open port 9100 to the monitoring server IP — but that requires access to the other account.
The reverse SSH tunnel solves this elegantly. The application server initiates an outbound SSH connection to the monitoring server (outbound traffic is almost always allowed). This creates a tunnel that forwards port 9100 on the app server to port 9101 on the monitoring server. Prometheus scrapes localhost:9101 — which is actually the app server's Node Exporter.
# /etc/systemd/system/monitoring-tunnel.service on the APP SERVER
[Unit]
Description=AutoSSH Reverse Tunnel to Monitoring Server
After=network.target
[Service]
Environment=AUTOSSH_GATETIME=0
User=ubuntu
ExecStart=/usr/bin/autossh -M 0 -N \
-R 9101:localhost:9100 \
-R 8080:localhost:80 \
-i /var/lib/monitoring/id_rsa \
-o "ServerAliveInterval 30" \
-o "ServerAliveCountMax 3" \
-o "StrictHostKeyChecking=no" \
ubuntu@MONITORING_SERVER_IP
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
We confirmed the tunnel is pulling real app server metrics by querying node_uname_info in Prometheus:
node_uname_info{nodename="ip-172-31-12-155"} 1
That hostname is the application server. The monitoring server is ip-172-31-21-196. Two different machines, one Prometheus.
📸 [Screenshot: Prometheus targets page showing all targets UP]
Why LGTM Over Managed Alternatives?
Cost at scale. Managed platforms charge per host, per metric, per log line. The LGTM stack runs on a single t3.small EC2 instance with no per-metric pricing.
Data sovereignty. Logs contain sensitive data. Self-hosted Loki keeps them within your own infrastructure.
No vendor lock-in. Prometheus and OpenTelemetry are open standards. Every dashboard, alert rule, and config is portable.
Spin up and down on demand. Because everything is Terraform, the monitoring server costs nothing when not needed. terraform destroy removes it entirely. terraform apply restores it in minutes pointing at the same application server.
Part 1: Deploying the Stack
Why systemd instead of Docker
We chose native systemd services — each service runs as a dedicated unprivileged system user, logs go to journald, and restart behaviour matches Docker's restart: unless-stopped without the container overhead.
Terraform — spin up, point at app server, done
resource "aws_instance" "monitoring" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.small"
key_name = var.key_name
user_data = templatefile("${path.module}/user_data.sh.tpl", {
app_server_ip = var.app_server_ip
slack_webhook = var.slack_webhook
repo_url = var.repo_url
grafana_password = var.grafana_password
})
}
resource "aws_eip" "monitoring" {
instance = aws_instance.monitoring.id
domain = "vpc"
}
The user_data script clones the repo and runs install.sh automatically. All 9 services start without manual intervention.
One-command deploy
cd terraform
terraform apply
# Outputs:
# grafana_url = "http://3.93.140.221:3000"
# monitoring_server_ip = "3.93.140.221"
All 9 services running
prometheus running 9090
node_exporter running 9100
blackbox_exporter running 9115
alertmanager running 9093
pushgateway running 9091
loki running 3100
tempo running 3200
otelcol running 8888
grafana-server running 3000
📸 [Screenshot: bash scripts/status.sh showing all 9 running]
Data retention:
- Prometheus metrics: 30 days
- Loki logs: 30 days
- Tempo traces: 30 days
Part 2: The Four Golden Signals as SLIs
Why Four Golden Signals beat CPU/RAM monitoring
A server at 10% CPU can still be serving every request with 5-second latency. CPU monitoring shows green. The Four Golden Signals show red. That is the difference.
Latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le, job)
)
Traffic
sum(rate(http_requests_total[1m])) by (job)
Errors
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
Saturation
# Memory — pulled from app server via tunnel on :9101
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# CPU — pulled from app server via tunnel
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
Part 3: SLOs and Error Budgets
The philosophy
An SLI is a measurement. An SLO is a target. An Error Budget is the allowable gap between perfect and the target.
Instead of asking "is this deployment safe?" the question becomes "do we have enough error budget to absorb the risk?" Objective, not subjective.
Our SLO targets
| SLO | Target | Window | Error Budget |
|---|---|---|---|
| Availability | 99.5% of probes return 2xx | 30 days | 216 minutes |
| Error rate | 99% of requests succeed | 30 days | 432 minutes |
| Latency | p95 < 500ms | Rolling 5m | Alert-only |
Error Budget Policy
| Budget remaining | Action |
|---|---|
| > 50% | Deploy freely |
| 25–50% | Investigate incidents, no major changes |
| < 25% | Reliability sprint, senior review on deploys |
| 0% | Feature freeze until budget recovers |
📸 [Screenshot: SLO & Error Budget dashboard]
Part 4: DORA Metrics
| Metric | Elite target | Business impact |
|---|---|---|
| Deployment Frequency | Multiple/day | How often value reaches users |
| Lead Time for Changes | < 1 hour | How quickly a bug fix ships |
| Change Failure Rate | < 5% | Cost of broken deployments |
| Mean Time to Restore | < 1 hour | Duration of user impact |
GitHub Actions pushing metrics to Pushgateway
- name: Push deployment metrics
run: |
LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} ))
cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
deployment_total{status="success",workflow="${WORKFLOW}"} 1
deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME}
EOF
Toil identified
Toil 1: Manual alert acknowledgement — now every alert includes a direct Grafana link. Saves 2–3 minutes per incident.
Toil 2: Certificate renewal reminders — Blackbox now probes SSL expiry continuously and fires SSLCertExpiringSoon 14 days before expiry.
📸 [Screenshot: DORA metrics dashboard]
Part 5: Five Grafana Dashboards
All provisioned as JSON. The UI was never used.
Loki → Tempo drill-down (the key config)
- name: Loki
jsonData:
derivedFields:
- name: TraceID
matcherRegex: 'traceID=(\w+)'
url: "${__value.raw}"
datasourceUid: tempo
urlDisplayLabel: "Open in Tempo"
Node Exporter — CPU, memory, disk, network. The instance selector shows both servers. Selecting the app server instance shows nodename="ip-172-31-12-155" — real cross-server monitoring confirmed.
📸 [Screenshot: Node Exporter showing app server metrics with correct nodename]
Blackbox Exporter — uptime, HTTP response time, SSL expiry.
📸 [Screenshot: Blackbox Exporter dashboard]
DORA Metrics — deployment frequency, lead time, CFR, MTTR with benchmark classification.
SLO & Error Budget — SLI gauges, budget remaining, burn rate.
Unified Observability — metric spike → Loki logs → clickable trace ID → Tempo waterfall. This drill-down is what separates observability from monitoring.
📸 [Screenshot: Unified Observability dashboard]
Part 6: The Alerting System
Burn rate over threshold alerting
# Fast burn — 2% of monthly budget in 1 hour
- alert: SLOAvailabilityFastBurn
expr: slo:availability:burn_rate1h > 14.4
for: 5m
labels:
severity: critical
# Slow burn — 5% of monthly budget in 6 hours
- alert: SLOAvailabilitySlowBurn
expr: slo:availability:burn_rate6h > 5
for: 15m
labels:
severity: warning
A lesson learned — alert grouping matters
Early in the project we had 5 blackbox targets each firing their own SLO alert — 5 Slack messages per incident. The fix was grouping by alertname and severity only:
route:
group_by: [alertname, severity]
group_wait: 60s
group_interval: 30m
repeat_interval: 12h
One incident now produces one Slack message regardless of how many targets are affected.
Inhibition rules
inhibit_rules:
- source_match:
alertname: HostDown
target_match_re:
alertname: "HighCPU.*|HighMemory.*|HighLatency.*"
equal: [instance]
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, instance]
📸 [Screenshot: Slack showing firing alert with structured payload]
📸 [Screenshot: Slack showing RESOLVED alert]
Part 7: Runbooks and Incident Management
Every alert links to a runbook answering six questions: what is it, likely cause, first 3 investigation steps, resolution, rollback criteria, escalation path.
Blameless PIR
| Time | Event |
|---|---|
| 14:18 | Deployment triggered |
| 14:23 | 503 responses begin (35% of requests) |
| 14:29 | SLOAvailabilityFastBurn fires (6-min detection lag) |
| 14:36 | Trace in Tempo reveals config read failure |
| 14:40 | Root cause: missing DATABASE_URL env var |
| 14:45 | Rollback initiated |
| 15:10 | Error rate returns to baseline |
Action items: reduce fast-burn for: from 2m to 1m, add post-deploy smoke test.
Part 8: Game Day Results
Scenario 1 — Deployment failure: exit 1 in GitHub Actions → CICDDeploymentFailed in Slack within 2 minutes.
📸 [Screenshot: GitHub Actions failed run + Slack alert]
Scenario 2 — Latency injection: sudo tc qdisc add dev ens5 root netem delay 600ms → HighLatencyWarning fired after 5 minutes → RESOLVED after removal.
📸 [Screenshot: HighLatencyWarning + RESOLVED in Slack]
Scenario 3 — CPU pressure: stress-ng --cpu 0 drove CPU to 92% → HighCPUWarning fired → RESOLVED after pkill stress-ng.
📸 [Screenshot: Node Exporter showing 92% CPU + Slack alert + RESOLVED]
Real-World Incident — The Tunnel Drops
During the project the reverse SSH tunnel dropped intermittently, causing HostDown and SLOAvailabilityFastBurn alerts to cascade. Multiple Slack messages per drop before the grouping fix.
Root cause: autossh reconnects in under 30 seconds, but the for: 2m clause on HostDown was short enough to fire on brief reconnects.
Fix: Increased for: 5m on HostDown. This was a genuine production incident — not simulated — and it validated the entire pipeline: alert fired, investigated in Grafana, root cause identified, fixed, RESOLVED message in Slack.
Key Learnings
Cross-account monitoring is a real problem. The reverse SSH tunnel solved a constraint most tutorials ignore. You won't always control the firewall on the server you're monitoring.
Alert grouping is as important as alert rules. Five targets firing five messages per incident creates noise. One grouped message creates signal.
Observability is not monitoring. Monitoring says something is wrong. Observability says why, where, and when — without SSHing into a server.
SLOs make reliability objective. Burn rate over thresholds. Two alerts replaced dozens.
Terraform gives you a monitoring server that costs nothing when idle. terraform destroy removes it. terraform apply restores it in 5 minutes.
Everything as code is non-negotiable. One command restores everything.
Conclusion
The MeetMind Observability Platform monitors a production application server across AWS account boundaries without needing firewall access or cross-account IAM. A reverse SSH tunnel makes this work. Terraform makes it reproducible. Systemd makes it reliable.
Clone the repo. Fill in terraform.tfvars. Run terraform apply. Full observability in 5 minutes.
GitHub: github.com/AirFluke/meetmind-observability
Built by Team MeetMind — HNG DevOps Track Stage 6
Top comments (0)