<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nehemiah</title>
    <description>The latest articles on DEV Community by Nehemiah (@nehemiah_dev).</description>
    <link>https://dev.to/nehemiah_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3901837%2F28f453f3-d8ed-4826-a0cc-683de758ccc4.jpg</url>
      <title>DEV Community: Nehemiah</title>
      <link>https://dev.to/nehemiah_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nehemiah_dev"/>
    <language>en</language>
    <item>
      <title>ObserveX: Building a Centralized Observability Platform for Modern Infrastructure</title>
      <dc:creator>Nehemiah</dc:creator>
      <pubDate>Tue, 19 May 2026 10:56:57 +0000</pubDate>
      <link>https://dev.to/nehemiah_dev/observex-building-a-centralized-observability-platform-for-modern-infrastructure-3e11</link>
      <guid>https://dev.to/nehemiah_dev/observex-building-a-centralized-observability-platform-for-modern-infrastructure-3e11</guid>
      <description>&lt;p&gt;&lt;em&gt;How We Built ObserveX: A Unified Monitoring and Reliability Platform for Production System&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
Most teams start monitoring the wrong way. They install a tool, set a threshold on CPU, and call it observability. When something breaks at 2am, they get paged for a CPU spike that resolved itself, miss the actual error rate climbing quietly for six hours, and spend the incident staring at graphs that don't tell them what the user experienced.&lt;br&gt;
This post documents how we built ObserveX — a production-grade observability and reliability platform built entirely on bare metal, no managed services, no black boxes. Every binary installed by hand. Every config file version-controlled. Every alert backed by a runbook.&lt;br&gt;
By the end of this post you will understand not just what we built, but why each decision was made — from why we chose open source over managed alternatives, to the arithmetic behind burn rate alerting, to what DORA metrics actually tell you about your engineering organisation.&lt;br&gt;
&lt;strong&gt;Architecture Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2cxwe8jvoz7h3yofkwsv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2cxwe8jvoz7h3yofkwsv.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
ObserveX runs across two servers.&lt;br&gt;
The App Server runs the application being monitored: a service instrumented with OpenTelemetry, a Node Exporter exposing system metrics, and an OTel Collector acting as a local agent that ships logs and traces across the network.&lt;br&gt;
The Observability Server runs the entire LGTM stack: Loki for logs, Grafana for visualisation, Tempo for traces, and Prometheus for metrics — plus Alertmanager for alert routing and Blackbox Exporter for probing the app from the outside.&lt;br&gt;
Every component runs as a systemd service. No Docker. No Kubernetes. No abstraction layers between you and the process. When something breaks, you know exactly where to look.&lt;br&gt;
&lt;strong&gt;Why the LGTM Stack Over Managed Alternatives&lt;/strong&gt;&lt;br&gt;
The first question anyone asks is: why not just use Datadog, New Relic, or Grafana Cloud?&lt;br&gt;
The honest answer is that managed alternatives are excellent products. But they come with tradeoffs that matter depending on your context.&lt;br&gt;
&lt;strong&gt;Cost at scale&lt;/strong&gt;: Managed observability is priced per host, per metric, or per log line ingested. At small scale the cost is trivial. As your infrastructure grows, the bill grows with it — often faster than the infrastructure itself. With the LGTM stack, the cost is the server running it. That's it.&lt;br&gt;
&lt;strong&gt;Data ownership&lt;/strong&gt;: Every log line, every trace, every metric your application emits contains information about your system's behaviour. With a managed service, that data lives on someone else's infrastructure under their retention and access policies. With a self-hosted stack, your data stays in your environment.&lt;br&gt;
&lt;strong&gt;Understanding&lt;/strong&gt;: This is the most important reason for this project specifically. When you use a managed service, the tool does the wiring for you. You never have to understand why traceID needs to appear in a log line, or what the difference between a counter and a gauge is, or how burn rate alerting actually reduces pages. Building it yourself forces that understanding. When something breaks at 2am, you can debug it because you built it.&lt;br&gt;
The tradeoff we accept. Self-hosted means you own the operational burden. Upgrades, storage management, backup — that's your responsibility. For a learning platform, that's the point. For production at scale, a hybrid approach often makes sense: self-hosted for cost efficiency, managed for the critical path.&lt;br&gt;
&lt;strong&gt;The Philosophy Behind SLIs, SLOs, and Error Budgets&lt;/strong&gt;&lt;br&gt;
Before we look at any dashboard, we need to establish the thinking behind the numbers. This is the part most monitoring tutorials skip entirely.&lt;br&gt;
What is an SLI&lt;br&gt;
&lt;strong&gt;A Service Level Indicator&lt;/strong&gt; is a measurement of how your service is performing from the user's perspective. Not whether your CPU is high. Not whether your deployment succeeded. Whether the user got a good response.&lt;br&gt;
The key constraint: an SLI must be a ratio between 0 and 1. This makes SLIs comparable across different services and different time windows.&lt;br&gt;
Availability SLI = successful probes / total probes &lt;br&gt;
Error Rate SLI = successful requests / total requests &lt;br&gt;
Latency SLI = requests under 500ms / total requests &lt;br&gt;
&lt;strong&gt;What is an SLO&lt;/strong&gt;&lt;br&gt;
A Service Level Objective is a target for your SLI over a time window. It is a promise to yourself and your users about the reliability you intend to provide.&lt;br&gt;
Availability SLO: 99.5% of probes succeed over any 30-day rolling window Error Rate SLO: 99.0% of requests are non-5xx over any 30-day rolling window &lt;br&gt;
The number matters less than the reasoning behind it. 99.5% is not a random choice. It means you accept that your service can be unreachable for up to 216 minutes per month. If your users would find that unacceptable, your SLO is wrong.&lt;br&gt;
&lt;strong&gt;What is an Error Budget&lt;/strong&gt;&lt;br&gt;
The error budget is the gap between perfection and your SLO target — the amount of failure you are allowed to have before you breach your promise.&lt;br&gt;
Availability error budget = (1 - 0.995) × 30 days × 24 hours × 60 minutes = 0.005 × 43,200 minutes = 216 minutes of downtime allowed per month &lt;br&gt;
This number does something important: it makes reliability a resource that can be spent. When your error budget is full, you can deploy aggressively and move fast. When it is nearly empty, you slow down and focus on reliability. The error budget is the mechanism that aligns the interests of developers who want to ship and operators who want stability.&lt;br&gt;
&lt;strong&gt;The Four Golden Signals — Beyond CPU and RAM&lt;/strong&gt;&lt;br&gt;
Traditional monitoring watches CPU, memory, and disk. These are easy to measure and feel comprehensive. The problem is they are not what users experience.&lt;br&gt;
A user does not know or care that your CPU is at 85%. They care whether their request was slow, failed, or never arrived. The Four Golden Signals are a framework for measuring what users actually experience.&lt;br&gt;
&lt;strong&gt;Latency&lt;/strong&gt; — How long does it take?&lt;br&gt;
Not average latency. The p95 or p99 — the experience of your worst-off users. An average can look healthy while 5% of users are waiting 10 seconds for every request.&lt;br&gt;
We chose p95 as our SLI. This means we are explicitly committing to the experience of 95% of our users, and accepting that 5% may occasionally have worse experiences.&lt;br&gt;
&lt;strong&gt;Traffic — How much demand is the system handling?&lt;/strong&gt;&lt;br&gt;
Requests per second tells you the load context. A 500ms response time means something very different at 10 RPS versus 10,000 RPS. Traffic is also the denominator in your error rate calculation.&lt;br&gt;
&lt;strong&gt;Errors — What fraction of requests are failing?&lt;/strong&gt;&lt;br&gt;
Not just 5xx responses. Implicit errors matter too — a request that returns 200 with empty content, or a timeout that the client treats as a failure even though the server returned successfully.&lt;br&gt;
&lt;strong&gt;Saturation — How close to the limit is the system?&lt;/strong&gt;&lt;br&gt;
CPU and memory are saturation signals — but so is connection pool utilisation, queue depth, and disk I/O wait. Saturation predicts problems before they become failures. When saturation is high, latency usually follows.&lt;br&gt;
&lt;strong&gt;The Full Stack: What We Built and How It Fits Together&lt;/strong&gt;&lt;br&gt;
Prometheus — The Metrics Engine&lt;br&gt;
&lt;strong&gt;Prometheus is a pull-based metrics system&lt;/strong&gt;. Every 15 seconds it visits each target's /metrics endpoint and reads a plain text file of numbers. It stores those numbers as time series and makes them queryable with PromQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loki — The Log Aggregator&lt;/strong&gt;&lt;br&gt;
Loki works like Prometheus but for logs. Instead of scraping, it receives log streams pushed from the OTel Collector. Logs are indexed only by labels — the log content itself is not indexed, which keeps storage costs low.&lt;br&gt;
&lt;strong&gt;Tempo — The Trace Backend&lt;/strong&gt;&lt;br&gt;
Tempo stores distributed traces sent via the OpenTelemetry protocol. A trace is a record of a single request's journey through your system — when it started, how long each operation took, and whether it succeeded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grafana — The Unified Frontend&lt;/strong&gt;&lt;br&gt;
Grafana reads from all three backends simultaneously. A single dashboard can show a Prometheus metric, the correlated Loki logs, and the causing Tempo trace — all from the same time window. This is what makes the LGTM stack more than the sum of its parts.&lt;br&gt;
The Observability Stack in Detail&lt;br&gt;
Node Exporter Dashboard&lt;br&gt;
The Node Exporter dashboard gives us visibility into the App Server's system resources. CPU usage broken down by mode, memory showing used versus cached versus available, disk I/O in bytes per second, network I/O, and system load averages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsye1bm2we28vsqa1o67a.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsye1bm2we28vsqa1o67a.jpg" alt=" " width="720" height="1600"&gt;&lt;/a&gt;&lt;br&gt;
Blackbox Exporter Dashboard&lt;br&gt;
The Blackbox Exporter probes our endpoints from the outside — simulating what a user experiences when they try to reach the service. It measures HTTP response time, probe success rate, and SSL certificate expiry.&lt;br&gt;
This is important because it catches failures that internal metrics miss. If the server is up but the load balancer is misconfigured, Node Exporter will show healthy but Blackbox will show the probe failing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fex30gumq4s66f97mdo51.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fex30gumq4s66f97mdo51.jpg" alt=" " width="720" height="1600"&gt;&lt;/a&gt;&lt;br&gt;
SLO and Error Budget Dashboard&lt;br&gt;
This is the most important dashboard in the stack. It answers the one question that matters: are we meeting our promises to users, and if not, how urgently do we need to respond?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufvbebeudcvb84zbisov.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufvbebeudcvb84zbisov.jpg" alt=" " width="720" height="1600"&gt;&lt;/a&gt;&lt;br&gt;
Row 1 — Current SLI vs SLO Target&lt;br&gt;
Four stat panels, one per SLO. Each shows the current SLI value and whether it is above or below the SLO target. Green means meeting the SLO. Red means breaching it right now.&lt;/p&gt;

&lt;p&gt;Row 2 — Error Budget Remaining&lt;br&gt;
Bar gauge panels showing how much of the monthly error budget has been consumed. The colour encoding makes urgency immediately visible: green when more than 50% remains, yellow when between 25% and 50%, red when below 25%.&lt;/p&gt;

&lt;p&gt;Row 3 — Burn Rate&lt;br&gt;
Time series panels showing the current burn rate for each SLO, with reference lines at 14.4x (fast burn, critical) and 5x (slow burn, warning).&lt;/p&gt;

&lt;p&gt;How Burn Rate Alerting Reduces Alert Fatigue&lt;br&gt;
This deserves its own section because it is one of the most practically valuable concepts in reliability engineering and one of the least understood.&lt;br&gt;
The problem with simple threshold alerting&lt;br&gt;
Imagine your error rate SLO is 99%. You set an alert: fire if error rate exceeds 1%.&lt;br&gt;
Now imagine you have a brief spike of 2% errors for 3 minutes at 3am. Your alert fires, wakes someone up, and resolves itself before they even open their laptop. That is a false positive. Over time, if this happens regularly, engineers stop trusting alerts. They start ignoring pages. And then the real incident — the one that matters — gets ignored too.&lt;br&gt;
The opposite problem also exists: a slow, steady error rate of 1.1% would not trigger your threshold alert but would quietly exhaust your entire monthly error budget in about 27 days. You would never get paged, but you would breach your SLO.&lt;br&gt;
What burn rate solves&lt;br&gt;
Burn rate does not ask "is the error rate above threshold right now?" It asks "at this rate, how quickly are we exhausting our monthly budget?"&lt;br&gt;
Burn rate = actual error rate / allowed error rate = (1 - current SLI) / (1 - SLO target) &lt;br&gt;
A burn rate of 1x means you are consuming your budget at exactly the rate that will exhaust it in 30 days — right on the SLO line. A burn rate of 14.4x means you would exhaust the budget in about 2 days. That is the fast burn threshold.&lt;br&gt;
Fast burn: &amp;gt; 14.4x burn rate — page someone now &lt;br&gt;
Slow burn: &amp;gt; 5x burn rate — create a ticket, handle this sprint &lt;br&gt;
The result is that brief spikes — even large ones — do not trigger pages unless they are sustained long enough to actually threaten the budget. And slow burns that would have gone unnoticed now generate warnings proportional to their actual impact.&lt;/p&gt;

&lt;p&gt;Alert Rules&lt;br&gt;
All alert rules live in version-controlled YAML files. There are three files.&lt;br&gt;
infrastructure.yml&lt;br&gt;
Contains the Four Golden Signal recording rules and all infrastructure alerts: CPU, memory, disk, host down, and the &lt;br&gt;
slo-burn-rate.yml&lt;br&gt;
Contains the four burn rate alerts. &lt;br&gt;
cicd.yml&lt;br&gt;
Contains the DORA metric recording rules and the CFR and MTTR threshold alerts.&lt;/p&gt;

&lt;p&gt;Alertmanager Configuration&lt;br&gt;
Alertmanager handles routing, grouping, and inhibition. It receives alerts from Prometheus and decides who gets notified, how, and when.&lt;/p&gt;

&lt;p&gt;All alerts go to #DevOps-Alerts by default. &lt;br&gt;
Inhibition rules&lt;br&gt;
When HostDown fires for an instance, Alertmanager suppresses all other alerts for that same instance — CPU, memory, latency. This matters because when a host is completely unreachable, firing five separate alerts adds noise without adding information. The HostDown alert tells the whole story.&lt;/p&gt;

&lt;p&gt;Slack Notifications&lt;br&gt;
Every alert routes to #DevOps-Alerts with a structured payload. &lt;br&gt;
Each notification includes the alert name, severity, affected host, description of the current metric value, a direct link to the relevant Grafana dashboard, and a link to the runbook for that alert.&lt;/p&gt;

&lt;p&gt;DORA Metrics Dashboard&lt;br&gt;
DORA metrics measure engineering team performance across four dimensions. Understanding why these four were chosen requires understanding what they actually measure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsft6yyp0ti2s525phi8g.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsft6yyp0ti2s525phi8g.jpg" alt=" " width="720" height="1600"&gt;&lt;/a&gt;&lt;br&gt;
Deployment Frequency measures how often you successfully deploy to production. High frequency is a proxy for small batch sizes, which reduces risk per deployment and accelerates feedback loops.&lt;br&gt;
Lead Time for Changes measures the time from a commit being merged to it running in production. Short lead time means your delivery pipeline is efficient and you can respond quickly to user needs and incidents.&lt;br&gt;
Change Failure Rate measures what percentage of deployments cause a failure requiring a hotfix or rollback. It is a quality signal — high CFR means your testing and review processes are not catching problems before they reach production.&lt;br&gt;
Mean Time to Restore measures how long it takes to recover from a failure. It is a resilience signal — teams with good runbooks, monitoring, and on-call processes restore service faster.&lt;/p&gt;

&lt;p&gt;How DORA metrics connect to business outcomes&lt;br&gt;
This is the part most monitoring tutorials do not explain. Why do engineering leaders care about these four numbers specifically?&lt;br&gt;
Deployment Frequency and Lead Time measure throughput — how fast value reaches users. Change Failure Rate and MTTR measure stability — how reliably that value works when it arrives.&lt;br&gt;
The insight from the DORA research programme is that high-performing teams score well on all four simultaneously. Throughput and stability are not a tradeoff — organisations that deploy more frequently also have lower failure rates and recover faster when failures occur.&lt;br&gt;
This means DORA metrics are not just engineering vanity metrics. They predict business outcomes: faster feature delivery, higher reliability, and faster incident recovery all translate directly to user satisfaction and revenue.&lt;/p&gt;

&lt;p&gt;Log and Trace Correlation — The Unified Dashboard&lt;br&gt;
This is the most powerful capability in the stack and the one that makes the biggest difference during an actual incident.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1fqgjgggc0vdb1vkqv8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1fqgjgggc0vdb1vkqv8.jpg" alt=" " width="720" height="1600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When an error rate spike appears on the metrics panel, you can navigate directly to the correlated logs from that exact time window. The Loki panel automatically filters to the same time range.&lt;/p&gt;

&lt;p&gt;When you see a log line with a traceID. Clicking it opens the full trace in Tempo — showing you exactly which endpoint was called, how long each operation took, and at which point it failed.&lt;/p&gt;

&lt;p&gt;Runbooks&lt;br&gt;
Every alert in the stack has a corresponding runbook. Here is the slo-fast-burn.md runbook as an example.&lt;br&gt;
The runbook answers six questions:&lt;br&gt;
• What is this alert&lt;br&gt;
• What are the likely causes&lt;br&gt;
• What are the first three investigation steps with exact commands&lt;br&gt;
• How do you resolve it&lt;br&gt;
• When should you roll back vs fix forward&lt;br&gt;
• When and who to escalate to&lt;br&gt;
The burn rate runbook includes the arithmetic so the on-call engineer understands the urgency without needing to do mental math at 2am:&lt;br&gt;
At 14.4x burn rate: budget exhausts in 216 / 14.4 = 15 hours &lt;br&gt;
At 40x burn rate: budget exhausts in 216 / 40 = 5.4 hours &lt;br&gt;
At 200x burn rate: budget exhausts in 216 / 200 = 1.1 hours &lt;br&gt;
Game Day: Chaos and Failure Simulation&lt;br&gt;
Game Day is the test of whether the observability platform actually works when things go wrong. We ran three scenarios.&lt;br&gt;
Scenario 1 — Deployment Failure&lt;br&gt;
We triggered a failing GitHub Actions workflow by adding a deliberately failing test step. The workflow completed with status 0 (failure), which pushed deployment_status=0 to the Pushgateway &lt;br&gt;
The Change Failure Rate updated in the DORA dashboard within one Prometheus scrape interval. The alert fired and the resolved notification arrived when we pushed a successful deployment to bring the CFR back down.&lt;/p&gt;

&lt;p&gt;Scenario 2 — Latency Injection&lt;br&gt;
We create an /slow endpoint to sleep for a random duration &lt;br&gt;
[SCREENSHOT: Flask app code change showing the modified sleep range]&lt;br&gt;
The p95 latency SLI began climbing past the 500ms SLO threshold.&lt;br&gt;
The burn rate began accelerating and within two minutes the AvailabilityFastBurn alert fired.&lt;br&gt;
We then used the Unified Observability dashboard to find a slow trace. The error rate panel showed elevated latency, the Loki panel showed log lines with long processing times, and clicking the traceID opened the exact slow request in Tempo.&lt;/p&gt;

&lt;p&gt;Scenario 3 — Resource Pressure&lt;br&gt;
We ran stress-ng --cpu 4 --timeout 360s on the App Server to spike CPU above the alert thresholds.&lt;br&gt;
After 5 minutes the CPUWarning alert fired in Slack.&lt;br&gt;
After 5 minutes of sustained pressure the CPUCritical alert fired.&lt;br&gt;
When stress-ng finished, CPU dropped back to baseline and both resolved notifications arrived in Slack in the correct order — critical resolved first, then warning resolved.&lt;/p&gt;

&lt;p&gt;Toil Identified&lt;br&gt;
Toil is manual, repetitive work that has no lasting value. Two sources of toil were identified during this project.&lt;br&gt;
Manual runbook lookups during incidents. When an alert fires, the on-call engineer has to remember which runbook applies and open it manually. The alert payload now includes a direct runbook link, which eliminates the lookup step. What was a 2-minute search becomes a one-click navigation.&lt;br&gt;
Manual Grafana dashboard creation. Without provisioning as code, every dashboard would need to be recreated manually after a server rebuild. Dashboard JSON files in the repository mean cp grafana/dashboards/*.json /var/lib/grafana/dashboards/ followed by a Grafana reload restores all dashboards in under 30 seconds. The toil of manual recreation is eliminated entirely.&lt;/p&gt;

&lt;p&gt;Conclusion&lt;br&gt;
ObserveX is not a complex system. It is six binaries, a handful of config files, and a clear mental model of how data flows from application to alert.&lt;br&gt;
The complexity in observability does not come from the tools. It comes from not understanding what you are measuring and why. An SLI is just a PromQL expression. An SLO is just a target for that expression. An error budget is just arithmetic. A burn rate alert is just a comparison of your current consumption rate against the rate that would exhaust that budget.&lt;br&gt;
Once those concepts are clear, the tooling follows naturally. Prometheus measures the SLI. The recording rule stores it efficiently. The alert rule compares the burn rate against a threshold. Alertmanager routes the notification. The runbook tells the engineer what to do.&lt;br&gt;
The LGTM stack was chosen not because it is the easiest option but because it is the most instructive one. Every connection between components is explicit. Every data flow is visible. When something breaks, you can trace it end to end because you built it end to end.&lt;br&gt;
That understanding is the point.&lt;/p&gt;

</description>
      <category>infrastructure</category>
      <category>monitoring</category>
      <category>showdev</category>
      <category>sre</category>
    </item>
    <item>
      <title>Building swiftdeploy: A Policy-Gated Deployment CLI</title>
      <dc:creator>Nehemiah</dc:creator>
      <pubDate>Wed, 06 May 2026 20:41:57 +0000</pubDate>
      <link>https://dev.to/nehemiah_dev/i-built-my-own-deployment-engine-2gon</link>
      <guid>https://dev.to/nehemiah_dev/i-built-my-own-deployment-engine-2gon</guid>
      <description>&lt;p&gt;In a world of "one-click" managed solutions, it’s easy to let the underlying mechanics of infrastructure become a mystery. But for those of us who want production-grade control and a deep understanding of the "why" behind the "how", managed solutions can sometimes feel like a black box, feels magical until something breaks at 2am and you have no mental model of what's actually happening under the hood.&lt;br&gt;
​So I built swiftdeploy — a deployment CLI that does the same job as those platforms, but entirely in code I wrote, understand, and can reason about at any layer.&lt;br&gt;
This is the story of how it works, why I made the choices I did, and the one technical problem that turned out to be far more interesting than I expected.&lt;br&gt;
&lt;strong&gt;The Philosophy: Own Your Abstractions&lt;/strong&gt;&lt;br&gt;
There's a version of this project where I reach for an existing solution. Argo CD, Flux — all excellent tools. But using them at this stage would have given me a working deployment pipeline and almost no understanding of what a deployment pipeline actually is.&lt;br&gt;
The constraint I set myself was simple: if I can't explain exactly what happens between typing a command and traffic reaching my app, I don't get to use that tool.&lt;br&gt;
This meant writing my own template engine around Jinja2, my own nginx config generator, my own health-check loop, and eventually my own policy engine integration. Every layer I owned became a layer I understood. That understanding compounds.&lt;br&gt;
The engineering philosophy here isn't "reinvent everything." It's reinvent the things that teach you something. Deployment orchestration teaches you an enormous amount about networking, process lifecycle, and operational trust. That's worth the friction.&lt;br&gt;
&lt;strong&gt;What swiftdeploy Actually Does&lt;/strong&gt;&lt;br&gt;
At its core, swiftdeploy is a CLI that manages a Docker Compose stack with a twist: nothing happens without a policy check first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swiftdeploy init  &lt;span class="c"&gt;#generate nginx.conf + docker-compose.yaml from your manifest &lt;/span&gt;
swiftdeploy deploy &lt;span class="c"&gt;# policy check → start stack → health check loop &lt;/span&gt;
swiftdeploy promote &lt;span class="c"&gt;# scrape metrics → policy check → switch canary/stable mode &lt;/span&gt;
swiftdeploy status &lt;span class="c"&gt;# live terminal dashboard with real-time policy compliance &lt;/span&gt;
swiftdeploy audit &lt;span class="c"&gt;# parse history.jsonl → generate audit_report.md &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every deploy is gated. Every promotion is evidence-based. Every decision is logged.&lt;br&gt;
&lt;strong&gt;The Policy Sidecar: OPA as the Brain&lt;/strong&gt;&lt;br&gt;
The most deliberate architectural choice in this project is that the CLI never makes allow/deny decisions itself. All decision logic lives in Open Policy Agent, running as a sidecar container.&lt;br&gt;
The CLI's job is to collect facts and ask questions. OPA's job is to answer them.&lt;br&gt;
&lt;strong&gt;The Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzpqqhp2c03jfytmac2n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzpqqhp2c03jfytmac2n.png" alt=" "&gt;&lt;/a&gt;&lt;br&gt;
Here's something that initially seems like a minor detail but is actually load-bearing: nginx must never be able to reach OPA.&lt;br&gt;
If nginx could reach the policy engine, a malformed request from the internet could theoretically influence policy evaluation. The trust boundary would be blurred. So OPA runs on an internal-only Docker network.&lt;br&gt;
The CLI reaches OPA via a loopback-bound host port. Only processes running directly on the host can touch it. A container can't reach a loopback port on the host by default.&lt;br&gt;
This isn't theoretical hardening. It's a concrete, verifiable guarantee that public traffic can never trigger a policy evaluation.&lt;br&gt;
&lt;strong&gt;The Most Interesting Technical Challenge: P99 From a Histogram&lt;/strong&gt;&lt;br&gt;
The canary promotion gate blocks you from promoting if P99 latency exceeds 500ms. Simple requirement. Surprisingly interesting implementation.&lt;br&gt;
Prometheus doesn't give you a P99 directly. It gives you a histogram — a set of cumulative bucket counts with upper bounds. To get a percentile, you have to interpolate across those buckets.&lt;br&gt;
The naive approach — just read the bucket that contains the 99th percentile — gives you a ceiling, not a value. If your 99th observation falls somewhere inside the 0.25 bucket, you'd report 250ms regardless of whether the true value was 80ms or 249ms.&lt;br&gt;
The correct approach is linear interpolation within the containing bucket. Find where the target rank (99% × total count) falls, identify which bucket it lands in, then interpolate based on how far through that bucket's count range you are.&lt;br&gt;
But there's a second problem: you can't just read a snapshot. Prometheus counters are cumulative and monotonically increasing. If you scrape once and see 1000 requests in the 0.25 bucket, that's every request since the process started — not the last 30 seconds.&lt;br&gt;
The solution is to take two scrapes separated by a time window and compute deltas. The bucket counts in the second scrape minus the first give you the distribution for only that window. Then you run the interpolation on those deltas.&lt;br&gt;
This is what makes the promote gate actually meaningful: it's not checking the lifetime health of the service, it's checking the last 30 seconds of traffic before you ask it to carry more.&lt;br&gt;
&lt;strong&gt;The Audit Trail&lt;/strong&gt;&lt;br&gt;
Every status scrape appends a JSON record to history.jsonl. Running swiftdeploy audit parses that file and produces a markdown report with metrics trends, and a dedicated violations section.&lt;br&gt;
The design principle here is that observability is not optional. The history file survives stack restarts. The audit report gives you a narrative of exactly what your system was doing and what the policy engine was saying about it at every point.&lt;br&gt;
&lt;strong&gt;In Conclusion&lt;/strong&gt; &lt;br&gt;
This project taught me a lot of things that are usually hidden under abstractions, The metrics interpolation was the most satisfying problem. It looks like a small utility function. It's actually the difference between a gate that measures something real and one that just performs measurement.&lt;br&gt;
And the network isolation detail — the one that's easy to skip — is the one that determines whether your security model is real or decorative.&lt;br&gt;
That's the thing about owning your abstractions. The interesting problems are hiding inside the details that pre-built platforms quietly handle for you.&lt;/p&gt;

</description>
      <category>automation</category>
      <category>devops</category>
      <category>infrastructure</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Building a Real-Time Attack Detection Daemon</title>
      <dc:creator>Nehemiah</dc:creator>
      <pubDate>Wed, 29 Apr 2026 22:50:21 +0000</pubDate>
      <link>https://dev.to/nehemiah_dev/building-a-real-time-anomaly-detection-daemon-for-a-live-cloud-storage-platform-3fp8</link>
      <guid>https://dev.to/nehemiah_dev/building-a-real-time-anomaly-detection-daemon-for-a-live-cloud-storage-platform-3fp8</guid>
      <description>&lt;p&gt;Imagine you're running a busy coffee shop. On a normal day, about 30 customers walk in per hour. You know your regulars, you know the rhythm. Then one afternoon, 300 people rush in through the door in two minutes — and they're not ordering coffee, they're just slamming every cabinet open and closed.&lt;br&gt;
You'd notice. You'd react.&lt;br&gt;
That's exactly what this project does — but for an online service, instead of a coffee shop. It watches every single HTTP request coming into a server, learns what "normal" looks like, and automatically sounds the alarm (and slams the door shut) when something looks wrong.&lt;br&gt;
Let's walk through how it works, piece by piece.&lt;br&gt;
&lt;strong&gt;Step 1: Reading the Logs — The Monitor&lt;/strong&gt;&lt;br&gt;
The first thing the detector needs to do is read traffic data from the log, the reverse proxy, Nginx in this case is configured to write logs in JSON format, which makes them easy to parse programmatically. Each line looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1705318496.123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_ip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.2.3.4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/login"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4821"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every field tells you something: who sent the request, when, what they asked for, and whether the server responded OK (status 200) or with an error (status 404, 500, etc.).&lt;br&gt;
The monitor's job is to tail this file — meaning it reads new lines as they appear parses it and drops it into a queue for the detector to process.&lt;br&gt;
This runs as a daemon — a background process that never stops. Not a cron job, not a script you run once. Always on, always watching.&lt;br&gt;
&lt;strong&gt;Step 2: The Sliding Window — Counting Requests Over Time&lt;/strong&gt;&lt;br&gt;
Now the detector has a stream of incoming log entries. The first question it needs to answer is:&lt;br&gt;
How fast is this IP sending requests right now?&lt;br&gt;
You might think — just count all their requests! But that doesn't work. If an IP sent 10,000 requests six hours ago and 2 requests in the last minute, they're not attacking right now. You need to know the recent rate, not the all-time total.&lt;br&gt;
This is where a sliding window comes in.&lt;br&gt;
Think of it like a 60-second sliding ruler on a timeline:&lt;br&gt;
As time moves forward, the window moves with it. Requests older than 60 seconds slide out of the left side. New requests come in on the right. At any moment, you can count how many requests are inside the window to get the current rate.&lt;br&gt;
In code, we use a deque (a double-ended queue) — a list that's cheap to add to on the right and remove from on the left. We keep one window per IP address, plus one global window that counts all requests from all IPs combined.&lt;br&gt;
&lt;strong&gt;Step 3: The Rolling Baseline — Learning What "Normal" Looks Like&lt;/strong&gt;&lt;br&gt;
Here's the thing about anomaly detection: you can't hardcode a threshold like "block anyone over 10 req/s." Why? Because traffic patterns are different at 3am versus 3pm. A small company might have 0.1 req/s average; a big one might have 50 req/s average. A threshold that's too low creates false alarms. Too high and you miss real attacks.&lt;br&gt;
The solution is to let the system learn what normal looks like from real traffic, and update that knowledge continuously.&lt;br&gt;
We do this with a rolling 30-minute baseline.&lt;br&gt;
Every second, we count how many requests came in and store that number. After 30 minutes, from those numbers we calculate two things:&lt;br&gt;
&lt;strong&gt;Mean (average)&lt;/strong&gt; — the typical number of requests per second.&lt;br&gt;
&lt;strong&gt;Standard deviation&lt;/strong&gt; — how much the traffic normally varies around that average. Low stddev means traffic is very steady. High stddev means it's naturally spiky.&lt;br&gt;
Anything beyond 3 standard deviations (3σ) from the mean is statistically very unlikely under normal traffic — it only happens by chance about 0.3% of the time. So if we see it, something unusual is probably happening. This runs every 60 seconds so the baseline is always fresh. If traffic naturally grows over the day, the baseline grows with it. It's self-adapting.&lt;br&gt;
&lt;strong&gt;Step 4: Making a&lt;/strong&gt;** Decision — The Anomaly Detection Log**&lt;br&gt;
Now we have everything we need to answer the key question:&lt;br&gt;
Is this traffic suspicious?&lt;br&gt;
We use two tests, and either one can trigger an alert:&lt;br&gt;
&lt;strong&gt;Test 1: The Z-Score&lt;/strong&gt;&lt;br&gt;
The z-score measures how many standard deviations the current rate is from the mean.&lt;br&gt;
&lt;strong&gt;Test 2: The Multiplier&lt;/strong&gt;&lt;br&gt;
Z-scores can be misleading when traffic is very low (e.g., at 3am when stddev is near zero). So we also check: is the current rate more than 5 times the mean?&lt;br&gt;
&lt;strong&gt;Test 3: Error Surge Detection&lt;/strong&gt;&lt;br&gt;
If an IP is generating lots of 4xx/5xx errors (like hammering /login and failing), that's a signal too. We check whether their error rate is 3× the normal error rate, and if so, we tighten the thresholds — making detection more sensitive for that IP.&lt;br&gt;
&lt;strong&gt;What Happens When Something Is Flagged?&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Step 5: Blocking With iptables&lt;/strong&gt;&lt;br&gt;
Once an IP is flagged, we need to actually stop its traffic. We use iptables — Linux's built-in firewall, built directly into the kernel.&lt;br&gt;
Think of iptables as a bouncer standing at the network door. You give it a list of rules, and it checks every packet against that list before letting it through.&lt;br&gt;
After this runs, packets from that IP are dropped at the kernel level — they never even reach Nginx.&lt;br&gt;
&lt;strong&gt;Step 6: Auto-Unban With Backoff&lt;/strong&gt;&lt;br&gt;
We don't ban forever on the first offence (well, almost never). The unbanner follows a tiered schedule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Offence    Ban Duration**
 1st           10 minutes
 2nd           30 minutes
 3rd           2 hours
 4th           Permanent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every 30 seconds, the unbanner checks whether any bans have expired. When it unbans an IP, it remembers the tier — so if that same IP attacks again, the next ban is longer. Repeat offenders escalate toward permanent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nginx writes a log line
        │
        ▼
monitor.py reads the new line, parses JSON
        │
        ▼
detector.py adds timestamp to IP's sliding window deque
        │
        ▼
detector.py calculates current rate (len(window) / 60)
        │
        ▼
detector.py computes z-score against rolling baseline
        │
     z &amp;gt; 3.0 or rate &amp;gt; 5x mean?
        │
       YES
        │
        ▼
blocker.py runs: iptables -I INPUT -s &amp;lt;ip&amp;gt; -j DROP &amp;amp;&amp;amp; \
iptables -I FORWARD -s &amp;lt;ip&amp;gt; -j DROP
        │
        ▼
notifier.py fires Slack message + writes audit log entry
        │
        ▼
(10 minutes later) unbanner.py removes the iptables rule
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All of these steps happen asynchronously inside a single Python process using threads.&lt;br&gt;
&lt;strong&gt;What I Learned Building This&lt;/strong&gt;&lt;br&gt;
A few things that surprised me along the way:&lt;br&gt;
&lt;strong&gt;1. Hardcoded thresholds are fragile.&lt;/strong&gt; The very first version I sketched out used if rate &amp;gt; 20: ban(). That would have been a disaster — blocking legitimate traffic during a busy period, missing attacks at quiet times. The rolling baseline was the most important design decision.&lt;br&gt;
&lt;strong&gt;2. iptables chain selection matters enormously.&lt;/strong&gt; Using INPUT instead of FORWARD meant SSH dropped but HTTP kept flowing. Understanding how Docker intercepts packets before the kernel's normal routing is something a lot of guides skip over.&lt;br&gt;
&lt;strong&gt;3. Standard deviation is surprisingly useful.&lt;/strong&gt; Before this project, stddev felt like a statistics-class abstraction. Using it here to define "normal variance" made it concrete — it's just a measure of how wiggly your traffic normally is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security tooling&lt;/strong&gt; doesn't have to be mysterious. At its core, this project is just: read data, count things, compare to normal, act when something's off. The same pattern underlies intrusion detection systems, fraud detection, network monitoring, and a lot of other "scary" security tools.&lt;br&gt;
Once you understand the pieces — sliding windows, rolling baselines, z-scores, iptables — you can compose them into something genuinely useful.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>security</category>
    </item>
  </channel>
</rss>
