DEV Community

Nehemiah
Nehemiah

Posted on • Edited on

ObserveX: Building a Centralized Observability Platform for Modern Infrastructure

How We Built ObserveX: A Unified Monitoring and Reliability Platform for Production System
Introduction
Most teams start monitoring the wrong way. They install a tool, set a threshold on CPU, and call it observability. When something breaks at 2am, they get paged for a CPU spike that resolved itself, miss the actual error rate climbing quietly for six hours, and spend the incident staring at graphs that don't tell them what the user experienced.
This post documents how we built ObserveX — a production-grade observability and reliability platform built entirely on bare metal, no managed services, no black boxes. Every binary installed by hand. Every config file version-controlled. Every alert backed by a runbook.
By the end of this post you will understand not just what we built, but why each decision was made — from why we chose open source over managed alternatives, to the arithmetic behind burn rate alerting, to what DORA metrics actually tell you about your engineering organisation.
Architecture Overview


ObserveX runs across two servers.
The App Server runs the application being monitored: a service instrumented with OpenTelemetry, a Node Exporter exposing system metrics, and an OTel Collector acting as a local agent that ships logs and traces across the network.
The Observability Server runs the entire LGTM stack: Loki for logs, Grafana for visualisation, Tempo for traces, and Prometheus for metrics — plus Alertmanager for alert routing and Blackbox Exporter for probing the app from the outside.
Every component runs as a systemd service. No Docker. No Kubernetes. No abstraction layers between you and the process. When something breaks, you know exactly where to look.
Why the LGTM Stack Over Managed Alternatives
The first question anyone asks is: why not just use Datadog, New Relic, or Grafana Cloud?
The honest answer is that managed alternatives are excellent products. But they come with tradeoffs that matter depending on your context.
Cost at scale: Managed observability is priced per host, per metric, or per log line ingested. At small scale the cost is trivial. As your infrastructure grows, the bill grows with it — often faster than the infrastructure itself. With the LGTM stack, the cost is the server running it. That's it.
Data ownership: Every log line, every trace, every metric your application emits contains information about your system's behaviour. With a managed service, that data lives on someone else's infrastructure under their retention and access policies. With a self-hosted stack, your data stays in your environment.
Understanding: This is the most important reason for this project specifically. When you use a managed service, the tool does the wiring for you. You never have to understand why traceID needs to appear in a log line, or what the difference between a counter and a gauge is, or how burn rate alerting actually reduces pages. Building it yourself forces that understanding. When something breaks at 2am, you can debug it because you built it.
The tradeoff we accept. Self-hosted means you own the operational burden. Upgrades, storage management, backup — that's your responsibility. For a learning platform, that's the point. For production at scale, a hybrid approach often makes sense: self-hosted for cost efficiency, managed for the critical path.
The Philosophy Behind SLIs, SLOs, and Error Budgets
Before we look at any dashboard, we need to establish the thinking behind the numbers. This is the part most monitoring tutorials skip entirely.
What is an SLI
A Service Level Indicator is a measurement of how your service is performing from the user's perspective. Not whether your CPU is high. Not whether your deployment succeeded. Whether the user got a good response.
The key constraint: an SLI must be a ratio between 0 and 1. This makes SLIs comparable across different services and different time windows.
Availability SLI = successful probes / total probes
Error Rate SLI = successful requests / total requests
Latency SLI = requests under 500ms / total requests
What is an SLO
A Service Level Objective is a target for your SLI over a time window. It is a promise to yourself and your users about the reliability you intend to provide.
Availability SLO: 99.5% of probes succeed over any 30-day rolling window Error Rate SLO: 99.0% of requests are non-5xx over any 30-day rolling window
The number matters less than the reasoning behind it. 99.5% is not a random choice. It means you accept that your service can be unreachable for up to 216 minutes per month. If your users would find that unacceptable, your SLO is wrong.
What is an Error Budget
The error budget is the gap between perfection and your SLO target — the amount of failure you are allowed to have before you breach your promise.
Availability error budget = (1 - 0.995) × 30 days × 24 hours × 60 minutes = 0.005 × 43,200 minutes = 216 minutes of downtime allowed per month
This number does something important: it makes reliability a resource that can be spent. When your error budget is full, you can deploy aggressively and move fast. When it is nearly empty, you slow down and focus on reliability. The error budget is the mechanism that aligns the interests of developers who want to ship and operators who want stability.
The Four Golden Signals — Beyond CPU and RAM
Traditional monitoring watches CPU, memory, and disk. These are easy to measure and feel comprehensive. The problem is they are not what users experience.
A user does not know or care that your CPU is at 85%. They care whether their request was slow, failed, or never arrived. The Four Golden Signals are a framework for measuring what users actually experience.
Latency — How long does it take?
Not average latency. The p95 or p99 — the experience of your worst-off users. An average can look healthy while 5% of users are waiting 10 seconds for every request.
We chose p95 as our SLI. This means we are explicitly committing to the experience of 95% of our users, and accepting that 5% may occasionally have worse experiences.
Traffic — How much demand is the system handling?
Requests per second tells you the load context. A 500ms response time means something very different at 10 RPS versus 10,000 RPS. Traffic is also the denominator in your error rate calculation.
Errors — What fraction of requests are failing?
Not just 5xx responses. Implicit errors matter too — a request that returns 200 with empty content, or a timeout that the client treats as a failure even though the server returned successfully.
Saturation — How close to the limit is the system?
CPU and memory are saturation signals — but so is connection pool utilisation, queue depth, and disk I/O wait. Saturation predicts problems before they become failures. When saturation is high, latency usually follows.
The Full Stack: What We Built and How It Fits Together
Prometheus — The Metrics Engine
Prometheus is a pull-based metrics system. Every 15 seconds it visits each target's /metrics endpoint and reads a plain text file of numbers. It stores those numbers as time series and makes them queryable with PromQL.

Loki — The Log Aggregator
Loki works like Prometheus but for logs. Instead of scraping, it receives log streams pushed from the OTel Collector. Logs are indexed only by labels — the log content itself is not indexed, which keeps storage costs low.
Tempo — The Trace Backend
Tempo stores distributed traces sent via the OpenTelemetry protocol. A trace is a record of a single request's journey through your system — when it started, how long each operation took, and whether it succeeded.

Grafana — The Unified Frontend
Grafana reads from all three backends simultaneously. A single dashboard can show a Prometheus metric, the correlated Loki logs, and the causing Tempo trace — all from the same time window. This is what makes the LGTM stack more than the sum of its parts.
The Observability Stack in Detail
Node Exporter Dashboard
The Node Exporter dashboard gives us visibility into the App Server's system resources. CPU usage broken down by mode, memory showing used versus cached versus available, disk I/O in bytes per second, network I/O, and system load averages.


Blackbox Exporter Dashboard
The Blackbox Exporter probes our endpoints from the outside — simulating what a user experiences when they try to reach the service. It measures HTTP response time, probe success rate, and SSL certificate expiry.
This is important because it catches failures that internal metrics miss. If the server is up but the load balancer is misconfigured, Node Exporter will show healthy but Blackbox will show the probe failing.


SLO and Error Budget Dashboard
This is the most important dashboard in the stack. It answers the one question that matters: are we meeting our promises to users, and if not, how urgently do we need to respond?


Row 1 — Current SLI vs SLO Target
Four stat panels, one per SLO. Each shows the current SLI value and whether it is above or below the SLO target. Green means meeting the SLO. Red means breaching it right now.

Row 2 — Error Budget Remaining
Bar gauge panels showing how much of the monthly error budget has been consumed. The colour encoding makes urgency immediately visible: green when more than 50% remains, yellow when between 25% and 50%, red when below 25%.

Row 3 — Burn Rate
Time series panels showing the current burn rate for each SLO, with reference lines at 14.4x (fast burn, critical) and 5x (slow burn, warning).

How Burn Rate Alerting Reduces Alert Fatigue
This deserves its own section because it is one of the most practically valuable concepts in reliability engineering and one of the least understood.
The problem with simple threshold alerting
Imagine your error rate SLO is 99%. You set an alert: fire if error rate exceeds 1%.
Now imagine you have a brief spike of 2% errors for 3 minutes at 3am. Your alert fires, wakes someone up, and resolves itself before they even open their laptop. That is a false positive. Over time, if this happens regularly, engineers stop trusting alerts. They start ignoring pages. And then the real incident — the one that matters — gets ignored too.
The opposite problem also exists: a slow, steady error rate of 1.1% would not trigger your threshold alert but would quietly exhaust your entire monthly error budget in about 27 days. You would never get paged, but you would breach your SLO.
What burn rate solves
Burn rate does not ask "is the error rate above threshold right now?" It asks "at this rate, how quickly are we exhausting our monthly budget?"
Burn rate = actual error rate / allowed error rate = (1 - current SLI) / (1 - SLO target)
A burn rate of 1x means you are consuming your budget at exactly the rate that will exhaust it in 30 days — right on the SLO line. A burn rate of 14.4x means you would exhaust the budget in about 2 days. That is the fast burn threshold.
Fast burn: > 14.4x burn rate — page someone now
Slow burn: > 5x burn rate — create a ticket, handle this sprint
The result is that brief spikes — even large ones — do not trigger pages unless they are sustained long enough to actually threaten the budget. And slow burns that would have gone unnoticed now generate warnings proportional to their actual impact.

Alert Rules
All alert rules live in version-controlled YAML files. There are three files.
infrastructure.yml
Contains the Four Golden Signal recording rules and all infrastructure alerts: CPU, memory, disk, host down, and the
slo-burn-rate.yml
Contains the four burn rate alerts.
cicd.yml
Contains the DORA metric recording rules and the CFR and MTTR threshold alerts.

Alertmanager Configuration
Alertmanager handles routing, grouping, and inhibition. It receives alerts from Prometheus and decides who gets notified, how, and when.

All alerts go to #DevOps-Alerts by default.
Inhibition rules
When HostDown fires for an instance, Alertmanager suppresses all other alerts for that same instance — CPU, memory, latency. This matters because when a host is completely unreachable, firing five separate alerts adds noise without adding information. The HostDown alert tells the whole story.

Slack Notifications
Every alert routes to #DevOps-Alerts with a structured payload.
Each notification includes the alert name, severity, affected host, description of the current metric value, a direct link to the relevant Grafana dashboard, and a link to the runbook for that alert.

DORA Metrics Dashboard
DORA metrics measure engineering team performance across four dimensions. Understanding why these four were chosen requires understanding what they actually measure.


Deployment Frequency measures how often you successfully deploy to production. High frequency is a proxy for small batch sizes, which reduces risk per deployment and accelerates feedback loops.
Lead Time for Changes measures the time from a commit being merged to it running in production. Short lead time means your delivery pipeline is efficient and you can respond quickly to user needs and incidents.
Change Failure Rate measures what percentage of deployments cause a failure requiring a hotfix or rollback. It is a quality signal — high CFR means your testing and review processes are not catching problems before they reach production.
Mean Time to Restore measures how long it takes to recover from a failure. It is a resilience signal — teams with good runbooks, monitoring, and on-call processes restore service faster.

How DORA metrics connect to business outcomes
This is the part most monitoring tutorials do not explain. Why do engineering leaders care about these four numbers specifically?
Deployment Frequency and Lead Time measure throughput — how fast value reaches users. Change Failure Rate and MTTR measure stability — how reliably that value works when it arrives.
The insight from the DORA research programme is that high-performing teams score well on all four simultaneously. Throughput and stability are not a tradeoff — organisations that deploy more frequently also have lower failure rates and recover faster when failures occur.
This means DORA metrics are not just engineering vanity metrics. They predict business outcomes: faster feature delivery, higher reliability, and faster incident recovery all translate directly to user satisfaction and revenue.

Log and Trace Correlation — The Unified Dashboard
This is the most powerful capability in the stack and the one that makes the biggest difference during an actual incident.

When an error rate spike appears on the metrics panel, you can navigate directly to the correlated logs from that exact time window. The Loki panel automatically filters to the same time range.

When you see a log line with a traceID. Clicking it opens the full trace in Tempo — showing you exactly which endpoint was called, how long each operation took, and at which point it failed.

Runbooks
Every alert in the stack has a corresponding runbook. Here is the slo-fast-burn.md runbook as an example.
The runbook answers six questions:
• What is this alert
• What are the likely causes
• What are the first three investigation steps with exact commands
• How do you resolve it
• When should you roll back vs fix forward
• When and who to escalate to
The burn rate runbook includes the arithmetic so the on-call engineer understands the urgency without needing to do mental math at 2am:
At 14.4x burn rate: budget exhausts in 216 / 14.4 = 15 hours
At 40x burn rate: budget exhausts in 216 / 40 = 5.4 hours
At 200x burn rate: budget exhausts in 216 / 200 = 1.1 hours
Game Day: Chaos and Failure Simulation
Game Day is the test of whether the observability platform actually works when things go wrong. We ran three scenarios.
Scenario 1 — Deployment Failure
We triggered a failing GitHub Actions workflow by adding a deliberately failing test step. The workflow completed with status 0 (failure), which pushed deployment_status=0 to the Pushgateway
The Change Failure Rate updated in the DORA dashboard within one Prometheus scrape interval. The alert fired and the resolved notification arrived when we pushed a successful deployment to bring the CFR back down.

Scenario 2 — Latency Injection
We create an /slow endpoint to sleep for a random duration
[SCREENSHOT: Flask app code change showing the modified sleep range]
The p95 latency SLI began climbing past the 500ms SLO threshold.
The burn rate began accelerating and within two minutes the AvailabilityFastBurn alert fired.
We then used the Unified Observability dashboard to find a slow trace. The error rate panel showed elevated latency, the Loki panel showed log lines with long processing times, and clicking the traceID opened the exact slow request in Tempo.

Scenario 3 — Resource Pressure
We ran stress-ng --cpu 4 --timeout 360s on the App Server to spike CPU above the alert thresholds.
After 5 minutes the CPUWarning alert fired in Slack.
After 5 minutes of sustained pressure the CPUCritical alert fired.
When stress-ng finished, CPU dropped back to baseline and both resolved notifications arrived in Slack in the correct order — critical resolved first, then warning resolved.

Toil Identified
Toil is manual, repetitive work that has no lasting value. Two sources of toil were identified during this project.
Manual runbook lookups during incidents. When an alert fires, the on-call engineer has to remember which runbook applies and open it manually. The alert payload now includes a direct runbook link, which eliminates the lookup step. What was a 2-minute search becomes a one-click navigation.
Manual Grafana dashboard creation. Without provisioning as code, every dashboard would need to be recreated manually after a server rebuild. Dashboard JSON files in the repository mean cp grafana/dashboards/*.json /var/lib/grafana/dashboards/ followed by a Grafana reload restores all dashboards in under 30 seconds. The toil of manual recreation is eliminated entirely.

Conclusion
ObserveX is not a complex system. It is six binaries, a handful of config files, and a clear mental model of how data flows from application to alert.
The complexity in observability does not come from the tools. It comes from not understanding what you are measuring and why. An SLI is just a PromQL expression. An SLO is just a target for that expression. An error budget is just arithmetic. A burn rate alert is just a comparison of your current consumption rate against the rate that would exhaust that budget.
Once those concepts are clear, the tooling follows naturally. Prometheus measures the SLI. The recording rule stores it efficiently. The alert rule compares the burn rate against a threshold. Alertmanager routes the notification. The runbook tells the engineer what to do.
The LGTM stack was chosen not because it is the easiest option but because it is the most instructive one. Every connection between components is explicit. Every data flow is visible. When something breaks, you can trace it end to end because you built it end to end.
That understanding is the point.

Top comments (0)