GitHub Repository: https://github.com/AirFluke/meetmind-observability
One command to deploy:sudo bash install.sh
Introduction
Modern software teams don't just need to know when something is down — they need to understand why it broke, how long users were affected, how fast they recovered, and whether their engineering practices are improving over time.
This is the gap between basic monitoring and true observability.
For Stage 6 of the HNG DevOps track, Team MeetMind built a production-grade observability and reliability platform from scratch using the LGTM stack — Loki, Grafana, Tempo, and Prometheus — alongside DORA metrics, SLI/SLO/Error Budget frameworks, and a fully automated alerting pipeline routing to Slack.
Everything runs as native systemd services — no Docker, no containers. Each component installs as a binary, managed by systemd the same way any production Linux service is managed. One command brings the entire stack up on any Ubuntu server.
Why LGTM Over Managed Alternatives?
The observability market offers managed alternatives — Datadog, New Relic, Grafana Cloud. So why self-host the LGTM stack?
Cost at scale. Managed platforms charge per host, per metric, per log line. At scale this becomes a significant infrastructure cost. The LGTM stack runs on a single server with no per-metric pricing.
Data sovereignty. Logs contain sensitive data — request bodies, authentication tokens, PII. Shipping these to a third-party SaaS introduces compliance risk. Self-hosted Loki keeps logs within your own infrastructure.
No vendor lock-in. Prometheus exposition format and OpenTelemetry are open standards. Every instrumented service, every dashboard, every alert rule is portable. Switching providers means changing an endpoint URL, not rewriting your entire observability layer.
Full control over retention. We configured 30-day retention for both metrics and logs at no additional cost.
Learning depth. Operating the stack yourself forces genuine understanding of how metrics collection, log aggregation, and distributed tracing work — knowledge that transfers regardless of which tools your next employer uses.
Architecture Overview
The platform runs as nine native systemd services on Ubuntu 24.04, all with automatic restart policies.
| Service | Role | Port |
|---|---|---|
| Prometheus | Metrics collection and storage | 9090 |
| Loki | Log aggregation | 3100 |
| Tempo | Distributed trace storage | 3200 |
| Grafana | Unified observability frontend | 3000 |
| Alertmanager | Alert routing to Slack | 9093 |
| Node Exporter | System metrics (CPU, RAM, disk, network) | 9100 |
| Blackbox Exporter | HTTP/SSL probing | 9115 |
| Pushgateway | Receives DORA metrics from GitHub Actions | 9091 |
| OTel Collector | Receives and routes traces and logs | 4319/4320 |
Data flow:
- Node Exporter and Blackbox Exporter expose metrics → Prometheus scrapes every 15 seconds
- GitHub Actions pushes deployment metrics → Pushgateway → Prometheus
- Applications send traces via OpenTelemetry → OTel Collector → Tempo
- Applications send logs via OpenTelemetry → OTel Collector → Loki
- Grafana queries all three — Prometheus, Loki, Tempo — enabling correlated drill-down from a single dashboard
- Prometheus evaluates alert rules → fires to Alertmanager → routes to Slack
📸 [Screenshot: All 9 services showing running status]
Part 1: Deploying the Full LGTM Stack as Systemd Services
Why systemd over Docker?
Running services as native systemd units means:
- No container runtime dependency
- Services start on boot automatically
- Logs go directly to journald —
journalctl -u prometheus -f - Standard Linux process management —
systemctl start/stop/restart/status - No networking complexity — all services talk via localhost
One-command deployment
git clone https://github.com/AirFluke/meetmind-observability.git
cd meetmind-observability
sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash install.sh
The install script handles everything automatically:
- Installs system dependencies
- Downloads all binaries from GitHub releases
- Creates dedicated system users for each service
- Copies configs to
/etc/ - Creates data directories in
/var/lib/ - Installs systemd unit files to
/etc/systemd/system/ - Enables and starts all services
Systemd unit files — the core of the deployment
Each service has a unit file that defines how it runs. Here is Prometheus as an example:
# systemd/prometheus.service
[Unit]
Description=Prometheus Metrics Server
Documentation=https://prometheus.io/docs
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--web.enable-lifecycle \
--web.enable-remote-write-receiver \
--web.listen-address=0.0.0.0:9090
Restart=on-failure
RestartSec=5s
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Restart=on-failure with RestartSec=5s is the systemd equivalent of Docker's restart: unless-stopped. Every service has this.
Checking service status
# Check all 9 at once
sudo bash scripts/status.sh
# Check individual service
sudo systemctl status prometheus
# Follow logs in real time
journalctl -u prometheus -f
journalctl -u grafana-server -f
journalctl -u loki -f
File layout on the server
/usr/local/bin/ ← all binaries
prometheus, loki, tempo, alertmanager,
node_exporter, blackbox_exporter,
pushgateway, otelcol
/etc/ ← all configs
prometheus/prometheus.yml
alertmanager/alertmanager.yml
alertmanager/slack.tmpl
loki/loki-config.yaml
tempo/tempo.yaml
otelcol/otel-collector.yaml
blackbox_exporter/config.yml
/var/lib/ ← all data (30d retention)
prometheus/
loki/
tempo/
/etc/systemd/system/ ← unit files
prometheus.service
loki.service
tempo.service
... (9 total)
Retention periods
- Prometheus metrics: 30 days (
--storage.tsdb.retention.time=30d) - Loki logs: 30 days (
retention_period: 30d) - Tempo traces: 30 days (
block_retention: 720h)
Infrastructure as Code — non-negotiable
Every configuration file is version-controlled in the repository:
meetmind-observability/
├── install.sh ← one-command deploy
├── uninstall.sh ← clean teardown
├── scripts/status.sh ← check all services
├── systemd/ ← 9 unit files
├── config/ ← all service configs
├── alerts/ ← alert rules (.yml)
├── grafana/dashboards/ ← 5 JSON dashboards
├── grafana/provisioning/ ← datasource config
└── runbooks/ ← one .md per alert
Nothing requires manual configuration to reproduce. Clone the repo, run the install script, and the entire platform is up.
📸 [Screenshot: Prometheus targets page showing all scrapers green]
Part 2: The Four Golden Signals as SLIs
Before writing a single PromQL query or building any dashboard, we defined what reliability means for MeetMind using Google's Four Golden Signals framework.
Why Four Golden Signals beat CPU/RAM monitoring
Traditional monitoring asks "is the server healthy?" The Four Golden Signals ask "is the user experiencing a healthy service?"
A server can have 10% CPU and still serve every request with 5-second latency. CPU monitoring shows green. The Four Golden Signals show red. That is the difference.
Signal 1 — Latency
How long does it take to serve a request? We distinguish successful from error latency — a fast error is not a success.
# p95 latency for successful requests
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le, job)
)
# p95 latency for error requests
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])) by (le, job)
)
Signal 2 — Traffic
How much demand is the system handling?
# Requests per second
sum(rate(http_requests_total[1m])) by (job)
Signal 3 — Errors
Rate of failed requests — explicit 5xx, implicit wrong content, policy failures like timeouts.
# Error rate as a ratio (0 = perfect, 1 = everything failing)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
Signal 4 — Saturation
How full is the service? We track CPU, memory, and disk.
# Memory saturation
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# CPU saturation
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Disk saturation
1 - (
node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
/ node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}
)
These four PromQL expressions become our SLIs — the measurements we track before defining any targets.
Part 3: SLOs and Error Budgets
The philosophy
An SLI is a measurement.
An SLO is a target for that measurement.
An Error Budget is the allowable gap between perfect and the SLO target.
This framework changes how engineering teams make decisions. Instead of arguing about whether a deployment is safe enough, the question becomes: "Do we have enough error budget to absorb the risk of this deployment?" It converts a subjective conversation into an objective one.
Our SLO targets
| SLO | Target | Window | Error Budget |
|---|---|---|---|
| Availability | 99.5% of HTTP probes return 2xx | 30 days | 216 minutes |
| Error rate | 99% of requests succeed | 30 days | 432 minutes |
| Latency | p95 < 500ms | Rolling 5m | Alert-only |
Why 99.5% availability?
This gives us 216 minutes per month — enough for one planned maintenance window without exhausting the budget. A stricter 99.9% would leave only 43 minutes, making any deployment risky.
Why 99% error rate?
One percent failure tolerance allows for transient errors during rolling deployments. Stricter targets require canary deployment infrastructure before they are meaningful.
Why 500ms p95 latency?
Industry standard for interactive APIs. Beyond this threshold, user experience degrades measurably. We chose p95 rather than p99 because optimising for the 99th percentile often requires disproportionate infrastructure investment.
Recording rules — pre-computing SLIs
# alerts/slo-burnrate.yml
groups:
- name: slo.recording_rules
interval: 30s
rules:
- record: slo:availability:ratio_rate1h
expr: avg_over_time(probe_success[1h])
- record: slo:availability:ratio_rate6h
expr: avg_over_time(probe_success[6h])
- record: slo:availability:ratio_rate30d
expr: avg_over_time(probe_success[30d])
# Burn rate = how fast we consume the error budget
# Error budget = 1 - 0.995 = 0.005
- record: slo:availability:burn_rate1h
expr: (1 - slo:availability:ratio_rate1h) / 0.005
- record: slo:availability:burn_rate6h
expr: (1 - slo:availability:ratio_rate6h) / 0.005
Error Budget Policy
Budget > 50% → Deploy freely, feature work continues
Budget 25–50% → Investigate incidents, no major changes
Budget < 25% → Reliability sprint, senior review on all deploys
Budget 0% → Feature freeze until budget recovers
Who owns the freeze decision: Engineering lead.
Review cadence: First Monday of each month.
📸 [Screenshot: SLO & Error Budget dashboard]
Part 4: DORA Metrics and CI/CD Observability
Why DORA metrics connect to business outcomes
| Metric | What it measures | Business impact |
|---|---|---|
| Deployment Frequency | How often value reaches users | Faster delivery |
| Lead Time for Changes | Commit to production | Bug fix speed |
| Change Failure Rate | Broken deployments | Cost of poor quality |
| Mean Time to Restore | Duration of incidents | User impact per outage |
DORA benchmarks
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deploy frequency | Multiple/day | Weekly | Monthly | < Monthly |
| Lead time | < 1 hour | < 1 day | 1d–1w | > 1 week |
| CFR | < 5% | 5–10% | 10–15% | > 15% |
| MTTR | < 1 hour | < 1 day | 1d–1w | > 1 week |
GitHub Actions pushing DORA metrics to Pushgateway
# .github/workflows/deploy.yml
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Record deploy start time
id: timing
run: echo "start_ts=$(date +%s)" >> $GITHUB_OUTPUT
- name: Build and deploy
run: echo "Your deploy steps here"
- name: Push metrics on success
if: success()
run: |
LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} ))
WORKFLOW="${{ github.workflow }}"
cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
deployment_total{status="success",workflow="${WORKFLOW}"} 1
deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME}
EOF
- name: Push metrics on failure
if: failure()
run: |
cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
deployment_total{status="failure",workflow="${{ github.workflow }}"} 1
EOF
Toil identified and automated
Toil 1 — Manual alert acknowledgement.
Engineers read a Slack alert, open a browser, navigate to Grafana, search for the relevant dashboard. Automation: every alert payload includes a direct link to the exact dashboard. Saves 2–3 minutes per alert.
Toil 2 — Certificate renewal reminders.
SSL expiry tracked via calendar reminders. Automation: Blackbox Exporter monitors SSL expiry continuously. SSLCertExpiringSoon alert fires 14 days before expiry automatically.
📸 [Screenshot: DORA metrics dashboard]
Part 5: Five Grafana Dashboards — All Provisioned as Code
All dashboards are provisioned from JSON files. The Grafana UI was never used to create or modify any panel.
Datasource provisioning with trace correlation
The key config that enables metric → log → trace drill-down:
# grafana/provisioning/datasources/datasources.yaml
datasources:
- name: Prometheus
type: prometheus
url: http://localhost:9090
isDefault: true
- name: Loki
type: loki
url: http://localhost:3100
jsonData:
derivedFields:
# Makes traceID= in log lines a clickable link to Tempo
- name: TraceID
matcherRegex: 'traceID=(\w+)'
url: "${__value.raw}"
datasourceUid: tempo
urlDisplayLabel: "Open in Tempo"
- name: Tempo
type: tempo
url: http://localhost:3200
jsonData:
tracesToLogsV2:
datasourceUid: loki
filterByTraceID: true
customQuery: true
query: '{service_name="${__span.tags.service.name}"} |= "${__trace.traceId}"'
Dashboard 1 — Node Exporter
CPU utilisation total and per-core, memory used/cached/available, disk I/O, network I/O, and load averages at 1/5/15 minutes.
📸 [Screenshot: Node Exporter dashboard with live data]
Dashboard 2 — Blackbox Exporter
External probing: uptime/downtime timeline, HTTP response time, SSL certificate expiry countdown, probe success rate.
📸 [Screenshot: Blackbox Exporter dashboard]
Dashboard 3 — DORA Metrics
Deployment frequency trend, lead time, CFR rolling percentage, MTTR with DORA benchmark classification.
📸 [Screenshot: DORA dashboard]
Dashboard 4 — SLO & Error Budget
SLI vs SLO gauges, error budget remaining coloured by urgency, burn rate time series with fast/slow thresholds, compliance history.
📸 [Screenshot: SLO dashboard]
Dashboard 5 — Unified Observability (the most important)
A metric spike → click through to Loki → see logs from that exact time window → click the trace ID → Tempo opens the waterfall → identify exactly which service and span caused the failure.
This drill-down — metric spike → correlated logs → causing trace — is what separates observability from monitoring.
📸 [Screenshot: Unified dashboard]
📸 [Screenshot: Loki logs panel with clickable trace ID]
Part 6: The Alerting System
All alert rules are version-controlled
Zero alert rules live in Grafana. Every rule is in a .yml file under alerts/.
Infrastructure alerts
# alerts/infrastructure.yml
groups:
- name: infrastructure.rules
rules:
- record: sli:node_cpu_saturation
expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
- record: sli:node_memory_saturation
expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
- alert: HighCPUWarning
expr: sli:node_cpu_saturation > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU is {{ $value | humanizePercentage }} (threshold 80%)"
dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"
- alert: HighCPUCritical
expr: sli:node_cpu_saturation > 0.90
for: 10m
labels:
severity: critical
annotations:
summary: "Critical CPU on {{ $labels.instance }}"
description: "CPU is {{ $value | humanizePercentage }} for 10+ minutes"
dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"
- alert: HostDown
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Host {{ $labels.instance }} is down"
description: "Blackbox probe failed for 2+ consecutive minutes"
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/host-down.md"
Burn rate alerting — how it reduces alert fatigue
Traditional threshold alerting fires whenever a metric crosses a line producing alert storms. Engineers learn to ignore them.
Burn rate alerting asks: "At this rate of failure, how long until our error budget is exhausted?"
Two alerts replace an entire category of noise:
# alerts/slo-burnrate.yml
- name: slo.alerts
rules:
# Fast burn — act immediately
# 14.4x = 2% of monthly budget gone in 1 hour
- alert: SLOAvailabilityFastBurn
expr: slo:availability:burn_rate1h > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn — act immediately"
description: >
Burn rate is {{ $value | humanize }}x.
2% of the 30-day budget will be consumed in 1 hour.
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md"
# Slow burn — investigate before it escalates
# 5x = 5% of monthly budget gone in 6 hours
- alert: SLOAvailabilitySlowBurn
expr: slo:availability:burn_rate6h > 5
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn — investigate soon"
description: >
Burn rate is {{ $value | humanize }}x over 6h.
5% of the 30-day budget will be consumed in 6 hours.
runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md"
Alertmanager routing and inhibition
# config/alertmanager.yml
route:
receiver: slack-default
group_by: [alertname, severity, instance]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: slack-critical
group_wait: 10s
repeat_interval: 4h
inhibit_rules:
# When host is completely down suppress CPU/memory/latency noise
- source_match:
alertname: HostDown
target_match_re:
alertname: "HighCPU.*|HighMemory.*|HighLatency.*|DiskSpace.*"
equal: [instance]
# Critical suppresses warning for same alert on same host
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, instance]
Structured Slack template
Alertmanager uses Go templates. The default function from Sprig is not supported — a lesson we learned the hard way. Every field must be referenced directly:
# config/slack.tmpl
{{ define "slack.title" -}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{- end }}
{{ define "slack.body" -}}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity | toUpper }}
*Status:* {{ if eq $.Status "resolved" }}✅ RESOLVED{{ else }}🔥 FIRING{{ end }}
*Host:* {{ .Labels.instance }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Links:*
• <{{ .Annotations.dashboard_url }}|📊 Grafana Dashboard>
• <{{ .Annotations.runbook_url }}|📖 Runbook>
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ if eq $.Status "resolved" }}*Resolved:* {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }}
---
{{ end }}
{{- end }}
📸 [Screenshot: Slack showing firing alert with full structured payload]
📸 [Screenshot: Slack showing RESOLVED alert]
Part 7: Runbooks and Incident Management
A runbook for every alert
Every alert links directly to its runbook. Each answers six questions:
# Runbook: High CPU Usage
## What is this alert?
HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.
## Likely cause
1. Traffic spike
2. Runaway process
3. Post-deployment regression
## First 3 investigation steps
1. Check running processes:
bash
top -bn1 | head -20
ps aux --sort=-%cpu | head -10
2. Check if spike correlates with a deployment:
Check GitHub Actions for recent workflow runs
3. Check system journal:
bash
journalctl -n 100 --since "10 minutes ago"
## Resolution
- Runaway process: kill -9 <PID>
- Traffic spike: scale horizontally
- Deployment regression: roll back
## Roll back when?
If CPU spike started within 30 minutes of a deployment
and correlates with increased error rate.
## Escalation
Senior engineer if unresolved after 20 minutes.
Blameless Post-Incident Review
We documented a simulated incident where a missing environment variable caused 35% of requests to return 503 for 47 minutes.
Timeline:
| Time | Event |
|---|---|
| 14:18 | Deployment triggered |
| 14:23 | 503 responses begin |
| 14:29 | SLOAvailabilityFastBurn fires — 6-min detection lag |
| 14:36 | Trace ID in Loki → Tempo reveals config read failure |
| 14:40 | Root cause: missing DATABASE_URL env var |
| 14:45 | Rollback initiated |
| 15:10 | Error rate returns to baseline |
Root cause: New environment variable added to code but not to the service configuration.
Detection gap: 6-minute lag between incident start and alert. Action item: reduce fast-burn for: clause from 2m to 1m.
Action items:
| Action | Owner | Due |
|---|---|---|
| Add post-deploy smoke test | DevOps | 3 days |
| Add env var validation to startup | App dev | 5 days |
| Reduce fast-burn for: to 1m | DevOps | 1 day |
This review is blameless — we focus on systems and processes, not individuals.
Part 8: Game Day Results
Scenario 1 — Deployment Failure
Added exit 1 to the GitHub Actions workflow and pushed. The workflow failed and pushed deployment_total{status="failure"} to the Pushgateway. CICDDeploymentFailed fired in Slack within 2 minutes. DORA dashboard showed CFR increase. Immediately reverted.
📸 [Screenshot: GitHub Actions showing red failed run]
📸 [Screenshot: CICDDeploymentFailed in Slack]
Scenario 2 — Latency Injection
Injected 600ms network latency using tc netem:
sudo tc qdisc add dev ens5 root netem delay 600ms
HighLatencyWarning fired confirming the alerting pipeline for latency SLO breaches works end-to-end.
# Remove latency
sudo tc qdisc del dev ens5 root
RESOLVED message confirmed recovery detection works.
📸 [Screenshot: HighLatencyWarning in Slack]
📸 [Screenshot: RESOLVED in Slack after tc removed]
Scenario 3 — Resource Pressure
Used stress-ng to drive CPU above 90%:
stress-ng --cpu 0 --cpu-method matrixprod --timeout 600s &
What we observed:
-
HighCPUWarningentered pending state after CPU sustained above 80% - After 5 minutes →
HighCPUWarningturned firing in Prometheus - Alert arrived in Slack with full structured payload
-
HighCPUCriticalentered pending state (needs 10min sustained above 90%) - After killing stress → both alerts RESOLVED in Slack
This confirmed the full warning → critical → recovery sequence and proved inhibition rules work — critical suppresses the warning notification.
pkill stress-ng
📸 [Screenshot: Prometheus alerts page showing Warning firing]
📸 [Screenshot: Node Exporter dashboard with CPU spike at 92%]
📸 [Screenshot: HighCPUWarning in Slack]
📸 [Screenshot: RESOLVED in Slack]
Key Lessons Learned
1. Systemd is production-grade.
Running services as native systemd units is simpler than Docker for single-server deployments. No networking complexity, no container runtime, logs go straight to journald. journalctl -u prometheus -f is all you need.
2. Alertmanager templates have limits.
The default function from Sprig templating is not supported in Alertmanager. Any | default "value" in your template will crash Alertmanager silently. Always test templates before deploying.
3. Port conflicts happen.
Tempo and OTel Collector both want port 4317 (OTLP gRPC). When running as bare processes on the same host, one must move. We moved OTel Collector to 4319/4320. In Docker this was hidden by container networking.
4. Observability is not monitoring.
Monitoring tells you something is wrong. Observability tells you why, where, and when — without needing to SSH into a server. The Loki → Tempo drill-down reduced our incident diagnosis time from 40 minutes to 4 minutes in the PIR simulation.
5. SLOs make reliability decisions objective.
"Is this deployment safe?" is subjective. "Do we have 100 minutes of error budget remaining?" is objective. SLOs turn reliability from a conversation into a measurement.
6. Burn rate alerting eliminates alert fatigue.
Two burn rate alerts replaced what would have been dozens of threshold alerts during Game Day scenarios. Engineers respond to meaningful signals, not noise.
7. Everything as code is non-negotiable.
Every dashboard, alert rule, and config that lives only in a UI is technical debt. Clone the repo, run install.sh, and the entire platform is back — no manual steps, no memory required.
Conclusion
The MeetMind Observability Platform demonstrates that production-grade observability is achievable without managed services and without Docker. Nine systemd services provide the full observability triad — metrics, logs, and traces — with correlation between all three. SLOs convert vague reliability goals into measurable targets. DORA metrics connect daily engineering decisions to business outcomes. Burn rate alerting replaces alert storms with two meaningful signals.
The entire platform deploys with one command on any Ubuntu 24.04 server. Every component is version-controlled. Every alert links to a runbook. Every metric spike links to correlated logs and traces.
GitHub Repository: https://github.com/AirFluke/meetmind-observability
Deploy it yourself:
git clone https://github.com/AirFluke/meetmind-observability.git
cd meetmind-observability
sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash install.sh
Built by Team MeetMind for HNG DevOps Track Stage 6




















Top comments (0)