<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Adeolu </title>
    <description>The latest articles on DEV Community by Adeolu  (@adeolu102).</description>
    <link>https://dev.to/adeolu102</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916643%2Fa64f5e39-d396-4957-9103-13cf50c27649.jpg</url>
      <title>DEV Community: Adeolu </title>
      <link>https://dev.to/adeolu102</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/adeolu102"/>
    <language>en</language>
    <item>
      <title>Building a Production-Grade Observability Platform for the Anvila API with LGTM, SLOs, DORA Metrics, and Game Day Testing</title>
      <dc:creator>Adeolu </dc:creator>
      <pubDate>Tue, 19 May 2026 00:08:05 +0000</pubDate>
      <link>https://dev.to/adeolu102/building-a-production-grade-observability-platform-for-the-anvila-api-with-lgtm-slos-dora-5fnc</link>
      <guid>https://dev.to/adeolu102/building-a-production-grade-observability-platform-for-the-anvila-api-with-lgtm-slos-dora-5fnc</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;For the HNG DevOps Stage 6 task, our team built a production-grade observability and reliability platform for the Anvila API.&lt;/p&gt;

&lt;p&gt;The goal was not just to check whether a server was up or down. We needed to build a monitoring system that could help a team understand the health of an application from different angles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the application available?&lt;/li&gt;
&lt;li&gt;Is it fast enough for users?&lt;/li&gt;
&lt;li&gt;Are requests failing?&lt;/li&gt;
&lt;li&gt;Is the server under pressure?&lt;/li&gt;
&lt;li&gt;Are deployments healthy?&lt;/li&gt;
&lt;li&gt;Can engineers investigate logs, metrics, and traces from one place?&lt;/li&gt;
&lt;li&gt;Can alerts reach the team with enough context to act quickly?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To achieve this, we used the LGTM observability stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loki for logs&lt;/li&gt;
&lt;li&gt;Grafana for dashboards&lt;/li&gt;
&lt;li&gt;Tempo for traces&lt;/li&gt;
&lt;li&gt;Prometheus for metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also added Alertmanager for alert routing, Node Exporter for server metrics, Blackbox Exporter for uptime probing, OpenTelemetry Collector for logs and traces, and a custom GitHub Actions DORA exporter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prometheus targets all up
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogdcls0fkvl3rsn9qosi.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogdcls0fkvl3rsn9qosi.PNG" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture Overview
&lt;/h3&gt;

&lt;p&gt;Our architecture uses a separate monitoring server. The Anvila API continues to run on its application server, while the monitoring stack runs on a dedicated AWS EC2 instance.&lt;/p&gt;

&lt;p&gt;The monitoring server collects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;system metrics from Node Exporter&lt;/li&gt;
&lt;li&gt;uptime and response time metrics from Blackbox Exporter&lt;/li&gt;
&lt;li&gt;application metrics from the FastAPI &lt;code&gt;/metrics&lt;/code&gt; endpoint&lt;/li&gt;
&lt;li&gt;logs from the app server through OpenTelemetry Collector and Loki&lt;/li&gt;
&lt;li&gt;traces from the instrumented FastAPI staging service through OpenTelemetry and Tempo&lt;/li&gt;
&lt;li&gt;CI/CD metrics from GitHub Actions through a custom DORA exporter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Grafana connects all these data sources together so that we can view metrics, logs, and traces in one place.&lt;/p&gt;

&lt;h3&gt;
  
  
  LGTM dashboards list
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ttw06j86uoau4d2e4w8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ttw06j86uoau4d2e4w8.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At a high level, the data flow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Anvila API -&amp;gt; OpenTelemetry Collector -&amp;gt; Loki / Tempo
Anvila API /metrics -&amp;gt; Prometheus
Node Exporter -&amp;gt; Prometheus
Blackbox Exporter -&amp;gt; Prometheus
GitHub Actions exporter -&amp;gt; Prometheus
Prometheus / Loki / Tempo -&amp;gt; Grafana
Prometheus alerts -&amp;gt; Alertmanager -&amp;gt; Slack
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why We Chose LGTM Over Managed Alternatives&lt;/p&gt;

&lt;p&gt;Managed observability platforms are useful because they reduce operational work. However, for this task, we chose LGTM because we needed to understand and control the full observability pipeline.&lt;/p&gt;

&lt;p&gt;LGTM gave us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;full control over metrics, logs, and traces&lt;/li&gt;
&lt;li&gt;version-controlled configuration&lt;/li&gt;
&lt;li&gt;no dependency on a paid managed observability provider&lt;/li&gt;
&lt;li&gt;a better learning experience for how observability systems actually work&lt;/li&gt;
&lt;li&gt;the ability to provision dashboards, alert rules, and data sources as code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prometheus is strong for metrics and alerting. Loki is useful for storing logs in a way that works well with Grafana. Tempo stores distributed traces and helps us understand request flow. Grafana brings all three together in one interface.&lt;/p&gt;

&lt;p&gt;This made LGTM a good fit for a platform engineering task where the goal was to build the stack from the ground up.&lt;/p&gt;

&lt;p&gt;Infrastructure as Code and Systemd Deployment&lt;/p&gt;

&lt;p&gt;One important requirement was that the stack should be reproducible. We used Terraform to provision the monitoring EC2 server and upload the configuration files.&lt;/p&gt;

&lt;p&gt;The services are installed and managed with systemd instead of Docker. This was important because the task update said we should not install Docker on the server.&lt;/p&gt;

&lt;p&gt;The stack includes systemd services for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus&lt;/li&gt;
&lt;li&gt;Grafana&lt;/li&gt;
&lt;li&gt;Loki&lt;/li&gt;
&lt;li&gt;Tempo&lt;/li&gt;
&lt;li&gt;Alertmanager&lt;/li&gt;
&lt;li&gt;Node Exporter&lt;/li&gt;
&lt;li&gt;Blackbox Exporter&lt;/li&gt;
&lt;li&gt;OpenTelemetry Collector&lt;/li&gt;
&lt;li&gt;DORA exporter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repository contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terraform files&lt;/li&gt;
&lt;li&gt;Prometheus configuration&lt;/li&gt;
&lt;li&gt;Alertmanager configuration&lt;/li&gt;
&lt;li&gt;Grafana dashboard JSON files&lt;/li&gt;
&lt;li&gt;Loki and Tempo configuration&lt;/li&gt;
&lt;li&gt;OpenTelemetry Collector configuration&lt;/li&gt;
&lt;li&gt;alert rules&lt;/li&gt;
&lt;li&gt;runbooks&lt;/li&gt;
&lt;li&gt;SLI/SLO documentation&lt;/li&gt;
&lt;li&gt;Game Day documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Alert rules YAML
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flavsgpson2f6f3lrtmc4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flavsgpson2f6f3lrtmc4.png" alt=" " width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Metrics Collection&lt;/p&gt;

&lt;p&gt;Metrics are numerical measurements that tell us what is happening in the system.&lt;/p&gt;

&lt;p&gt;We collect several types of metrics:&lt;/p&gt;

&lt;h3&gt;
  
  
  Server Metrics
&lt;/h3&gt;

&lt;p&gt;Node Exporter gives us system-level metrics such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU usage&lt;/li&gt;
&lt;li&gt;memory usage&lt;/li&gt;
&lt;li&gt;disk usage&lt;/li&gt;
&lt;li&gt;disk I/O&lt;/li&gt;
&lt;li&gt;network I/O&lt;/li&gt;
&lt;li&gt;system load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics help us understand saturation, which means how full or stressed the system is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node Exporter dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpek4kijrvsq7bbrsuv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpek4kijrvsq7bbrsuv7.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Uptime and HTTP Probe Metrics
&lt;/h3&gt;

&lt;p&gt;Blackbox Exporter probes both the staging and production API URLs.&lt;/p&gt;

&lt;p&gt;It checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;whether the endpoint is reachable&lt;/li&gt;
&lt;li&gt;whether it returns a successful HTTP response&lt;/li&gt;
&lt;li&gt;how long the response takes&lt;/li&gt;
&lt;li&gt;SSL certificate expiry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is important because it measures the service from the outside, closer to how a user or client would experience it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Blackbox dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fva00ljsag6kofbi3um5m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fva00ljsag6kofbi3um5m.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Application Metrics
&lt;/h3&gt;

&lt;p&gt;We also added app-level Prometheus metrics to the FastAPI staging application through a &lt;code&gt;/metrics&lt;/code&gt; endpoint.&lt;/p&gt;

&lt;p&gt;This gives Prometheus real application request metrics such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http_request_duration_seconds&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http_requests_in_progress&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics help us track request volume, request latency, and request status codes from inside the application itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  App metrics target up
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fns4ut8wvz7bp4ffqy11d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fns4ut8wvz7bp4ffqy11d.png" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  App request metrics
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft93mpghbv0udo3mbwzg2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft93mpghbv0udo3mbwzg2.png" alt=" " width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  App latency histogram
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzs9b7b941e4p7zhnng65.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzs9b7b941e4p7zhnng65.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Logs and Traces
&lt;/h2&gt;

&lt;p&gt;Metrics tell us that something is wrong. Logs and traces help us understand why.&lt;/p&gt;

&lt;p&gt;We used OpenTelemetry Collector to collect application logs and send them to Loki. In Grafana, we can search these logs using Loki queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Loki logs
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5eygkg19pea248mx5bxi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5eygkg19pea248mx5bxi.png" alt=" " width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We also instrumented the staging FastAPI service with OpenTelemetry so that it sends traces to Tempo.&lt;/p&gt;

&lt;p&gt;A trace shows the path of a request through the application. This is very useful when investigating slow requests, because it helps identify which endpoint or operation caused the delay.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tempo traces
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxxm7qyg8oxjvpu52wu3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxxm7qyg8oxjvpu52wu3c.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Unified Observability dashboard combines metrics, logs, and traces. The idea is that if latency or error rate increases, an engineer can look at the metric, check related logs in Loki, and then inspect traces in Tempo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unified observability dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbseuiewhomi7l896jxoa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbseuiewhomi7l896jxoa.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Golden Signals
&lt;/h2&gt;

&lt;p&gt;The Four Golden Signals are a simple way to define service reliability from a user-focused perspective.&lt;/p&gt;

&lt;p&gt;They are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Traffic&lt;/li&gt;
&lt;li&gt;Errors&lt;/li&gt;
&lt;li&gt;Saturation&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;p&gt;Latency means how long it takes the service to respond.&lt;/p&gt;

&lt;p&gt;For Anvila, we track latency using both Blackbox response time and application request duration metrics.&lt;/p&gt;

&lt;p&gt;Example PromQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(
  0.95,
  sum by (le, path) (
    rate(http_request_duration_seconds_bucket{job="anvila-api-staging",status!~"5.."}[5m])
  )
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells us the 95th percentile request latency for successful requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traffic
&lt;/h3&gt;

&lt;p&gt;Traffic means how much demand the service is receiving.&lt;/p&gt;

&lt;p&gt;Example PromQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(http_requests_total{job="anvila-api-staging"}[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells us the number of requests per second.&lt;/p&gt;

&lt;h3&gt;
  
  
  Errors
&lt;/h3&gt;

&lt;p&gt;Errors measure how many requests are failing.&lt;/p&gt;

&lt;p&gt;Example PromQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(http_requests_total{job="anvila-api-staging",status=~"5.."}[5m]))
/
clamp_min(sum(rate(http_requests_total{job="anvila-api-staging"}[5m])), 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us the ratio of failed requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Saturation
&lt;/h3&gt;

&lt;p&gt;Saturation shows how full the system is.&lt;/p&gt;

&lt;p&gt;For example, CPU usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This goes beyond simply checking if the server is alive. A server can be up but overloaded. The Four Golden Signals help us understand the actual quality of the service.&lt;/p&gt;

&lt;h2&gt;
  
  
  SLOs and Error Budgets
&lt;/h2&gt;

&lt;p&gt;An SLI is a Service Level Indicator. It is a measurement.&lt;/p&gt;

&lt;p&gt;An SLO is a Service Level Objective. It is the target we want to meet.&lt;/p&gt;

&lt;p&gt;An error budget is the amount of unreliability we are allowed within a time window.&lt;/p&gt;

&lt;p&gt;For Anvila, our main availability SLO is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;99.5% successful probes over 30 days
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error budget calculation is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(1 - 0.995) * 30 days = 0.15 days = 3.6 hours
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the service can be unavailable for about 3 hours and 36 minutes in a 30-day window before the error budget is fully consumed.&lt;/p&gt;

&lt;p&gt;The SLO dashboard shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30-day availability SLI&lt;/li&gt;
&lt;li&gt;SLO target&lt;/li&gt;
&lt;li&gt;error budget remaining&lt;/li&gt;
&lt;li&gt;error budget time remaining&lt;/li&gt;
&lt;li&gt;burn rate&lt;/li&gt;
&lt;li&gt;7-day and 30-day compliance history&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SLO and Error Budget dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfmuvz6ssqpw9xj0ye0b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfmuvz6ssqpw9xj0ye0b.png" alt=" " width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Burn Rate Alerting
&lt;/h2&gt;

&lt;p&gt;Burn rate tells us how quickly the service is consuming its error budget.&lt;/p&gt;

&lt;p&gt;This is better than simple alerting because it reduces alert fatigue.&lt;/p&gt;

&lt;p&gt;For example, a short small failure may not need to wake everyone up. But if the service is burning through its error budget very quickly, the team should respond immediately.&lt;/p&gt;

&lt;p&gt;We configured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast Burn alert: critical&lt;/li&gt;
&lt;li&gt;Slow Burn alert: warning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Fast Burn alert means the system is consuming the error budget too quickly and needs immediate action.&lt;/p&gt;

&lt;p&gt;The Slow Burn alert means the system may become a serious incident if the trend continues.&lt;/p&gt;

&lt;h2&gt;
  
  
  DORA Metrics
&lt;/h2&gt;

&lt;p&gt;DORA metrics help connect engineering work to business outcomes.&lt;/p&gt;

&lt;p&gt;The four DORA metrics are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployment Frequency&lt;/li&gt;
&lt;li&gt;Lead Time for Changes&lt;/li&gt;
&lt;li&gt;Change Failure Rate&lt;/li&gt;
&lt;li&gt;Mean Time to Restore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics help answer questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How often do we deploy?&lt;/li&gt;
&lt;li&gt;How long does it take for code to reach production?&lt;/li&gt;
&lt;li&gt;How often do deployments fail?&lt;/li&gt;
&lt;li&gt;How quickly can we recover from incidents?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We built a GitHub Actions DORA exporter that exposes deployment workflow data to Prometheus.&lt;/p&gt;

&lt;p&gt;The dashboard shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deployment frequency&lt;/li&gt;
&lt;li&gt;lead time&lt;/li&gt;
&lt;li&gt;change failure rate&lt;/li&gt;
&lt;li&gt;MTTR&lt;/li&gt;
&lt;li&gt;deployment frequency classification&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  DORA dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgy5jqcv8xbt64q4p58gb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgy5jqcv8xbt64q4p58gb.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our current DORA implementation measures commit-to-workflow-completion lead time. A future improvement would be to break it into more detailed sub-intervals such as commit time, workflow trigger time, workflow completion time, and deployment confirmation time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Alerting and Slack Notifications
&lt;/h2&gt;

&lt;p&gt;Alert rules are stored in version-controlled YAML files, not manually created in the Grafana UI.&lt;/p&gt;

&lt;p&gt;We configured alerts for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;host downtime&lt;/li&gt;
&lt;li&gt;high CPU&lt;/li&gt;
&lt;li&gt;high memory&lt;/li&gt;
&lt;li&gt;disk usage&lt;/li&gt;
&lt;li&gt;SLO burn rate&lt;/li&gt;
&lt;li&gt;change failure rate&lt;/li&gt;
&lt;li&gt;MTTR threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alertmanager handles routing and sends alerts to Slack.&lt;/p&gt;

&lt;p&gt;The Alertmanager configuration includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;route grouping by alert name, service, and severity&lt;/li&gt;
&lt;li&gt;warning and critical routes&lt;/li&gt;
&lt;li&gt;inhibition rules&lt;/li&gt;
&lt;li&gt;Slack receiver configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Alertmanager config
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0ixouxdi1gqxqt0th8f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0ixouxdi1gqxqt0th8f.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We also used a structured Slack template. The alert payload includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;alert name&lt;/li&gt;
&lt;li&gt;severity&lt;/li&gt;
&lt;li&gt;affected host or target&lt;/li&gt;
&lt;li&gt;current value&lt;/li&gt;
&lt;li&gt;dashboard link&lt;/li&gt;
&lt;li&gt;runbook link&lt;/li&gt;
&lt;li&gt;firing or resolved status&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Slack template
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9oh0vfs9xfbakyq0nhwh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9oh0vfs9xfbakyq0nhwh.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is an example of a firing and resolved Slack notification:&lt;/p&gt;

&lt;h3&gt;
  
  
  Slack firing and resolved alert
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mfm70esbf4n5fty0tv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mfm70esbf4n5fty0tv3.png" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Runbooks and Incident Management
&lt;/h2&gt;

&lt;p&gt;Every alert needs a runbook. A runbook explains what the alert means and what to do first.&lt;/p&gt;

&lt;p&gt;Our runbooks include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what the alert means&lt;/li&gt;
&lt;li&gt;likely causes&lt;/li&gt;
&lt;li&gt;first investigation steps&lt;/li&gt;
&lt;li&gt;how to resolve it&lt;/li&gt;
&lt;li&gt;when to roll back&lt;/li&gt;
&lt;li&gt;when to escalate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example runbook:&lt;/p&gt;

&lt;p&gt;[SLO burn rate runbook]&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11y5o0hk8zhbnq5ykvhm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11y5o0hk8zhbnq5ykvhm.png" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We also documented a blameless post-incident review for the simulated latency incident.&lt;/p&gt;

&lt;p&gt;The goal of a blameless PIR is not to blame a person. The goal is to understand what happened, what the impact was, what went well, what did not go well, and what should be improved.&lt;/p&gt;
&lt;h2&gt;
  
  
  Game Day Testing
&lt;/h2&gt;

&lt;p&gt;Game Day is where we intentionally create controlled failures to test whether our monitoring and response system works.&lt;/p&gt;

&lt;p&gt;We ran three scenarios.&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 1: Deployment Failure
&lt;/h3&gt;

&lt;p&gt;We created a temporary PR with an intentionally failing GitHub Actions workflow.&lt;/p&gt;

&lt;p&gt;The goal was to prove that a deployment or pipeline failure can be detected without touching production.&lt;/p&gt;

&lt;p&gt;The PR was closed without merging after screenshots were captured.&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day deployment failure PR
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foy258ugy7gsskyioqsl3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foy258ugy7gsskyioqsl3.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day deployment failure check
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hblklpm7oz471w30ku3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hblklpm7oz471w30ku3.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 2: Latency Injection
&lt;/h3&gt;

&lt;p&gt;During the latency Game Day, we temporarily added a 2-second delay to the staging root endpoint and generated test requests. The evidence below shows the resulting Tempo traces and the recovery after the delay was removed.&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day latency traces
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2955gcm0spcpjuixm5d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2955gcm0spcpjuixm5d.png" alt=" " width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day latency recovery
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9ovatvpqvp2geec4o0f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9ovatvpqvp2geec4o0f.png" alt=" " width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 3: Resource Pressure
&lt;/h3&gt;

&lt;p&gt;We created CPU pressure on the monitoring server using controlled background processes.&lt;/p&gt;

&lt;p&gt;This triggered a CPU warning alert in Slack. After stopping the pressure, we confirmed the resolved alert was also sent.&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day resource pressure trigger
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi704l8a6fd2hgaqwfmk4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi704l8a6fd2hgaqwfmk4.png" alt=" " width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day resource pressure Slack alert
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm15yy0iujs8vurm688wd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm15yy0iujs8vurm688wd.png" alt=" " width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day resource pressure recovery
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35lxsbkd4a3xdvfrtcw7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35lxsbkd4a3xdvfrtcw7.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Toil Identified and Automation
&lt;/h2&gt;

&lt;p&gt;Toil is repetitive manual work that can be automated.&lt;/p&gt;

&lt;p&gt;We identified several examples of toil:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Manual Dashboard Creation
&lt;/h3&gt;

&lt;p&gt;Creating dashboards manually in Grafana is slow and hard to reproduce.&lt;/p&gt;

&lt;p&gt;Automation: All dashboards are stored as JSON and provisioned automatically.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Manual Monitoring Server Setup
&lt;/h3&gt;

&lt;p&gt;Installing each service manually would be time-consuming and error prone.&lt;/p&gt;

&lt;p&gt;Automation: Terraform creates the monitoring server, and systemd installation scripts configure the observability stack.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Vague Alert Investigation
&lt;/h3&gt;

&lt;p&gt;If an alert only says, "CPU high", the engineer still has to search for the server, dashboard, and runbook.&lt;/p&gt;

&lt;p&gt;Automation: Alertmanager Slack messages include service, host, metric value, dashboard link, and runbook link.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Manual CI/CD Metrics Collection
&lt;/h3&gt;

&lt;p&gt;Checking GitHub Actions manually does not scale.&lt;/p&gt;

&lt;p&gt;Automation: The DORA exporter pulls GitHub Actions data and exposes it as Prometheus metrics.&lt;/p&gt;
&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;This project helped us understand that observability is more than installing tools.&lt;/p&gt;

&lt;p&gt;The tools are only useful when they are connected to reliability goals.&lt;/p&gt;

&lt;p&gt;Prometheus helped us collect and alert on metrics. Loki helped us inspect logs. Tempo helped us understand request traces. Grafana helped us bring everything together. Alertmanager helped us route actionable alerts to Slack.&lt;/p&gt;

&lt;p&gt;We also learned that good reliability engineering requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clear SLIs&lt;/li&gt;
&lt;li&gt;realistic SLOs&lt;/li&gt;
&lt;li&gt;error budgets&lt;/li&gt;
&lt;li&gt;runbooks&lt;/li&gt;
&lt;li&gt;incident reviews&lt;/li&gt;
&lt;li&gt;controlled failure testing&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Current Limitations and Next Steps
&lt;/h2&gt;

&lt;p&gt;The platform is functional, but there are still improvements we can make.&lt;/p&gt;

&lt;p&gt;Current limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;App-level &lt;code&gt;/metrics&lt;/code&gt; is currently verified on staging.&lt;/li&gt;
&lt;li&gt;Production is monitored through Blackbox probes and shared host-level metrics.&lt;/li&gt;
&lt;li&gt;DORA lead time is currently simplified to commit-to-workflow-completion.&lt;/li&gt;
&lt;li&gt;A follow-up PR was opened to normalize unmatched route labels in Prometheus metrics and reduce high-cardinality risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;promote app-level metrics to production after staging validation&lt;/li&gt;
&lt;li&gt;merge the metrics label normalization improvement&lt;/li&gt;
&lt;li&gt;add more detailed DORA lead-time sub-intervals&lt;/li&gt;
&lt;li&gt;add deployment annotations to Grafana dashboards&lt;/li&gt;
&lt;li&gt;continue reviewing SLOs as the service matures&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;For this task, we built a complete observability platform around the Anvila API using the LGTM stack.&lt;/p&gt;

&lt;p&gt;We deployed the monitoring stack with Terraform and systemd, collected metrics, logs, and traces, built dashboards, configured Slack alerts, wrote runbooks, and tested the system through Game Day simulations.&lt;/p&gt;

&lt;p&gt;The biggest lesson is that observability is not just about knowing when something is broken. It is about giving the team enough context to understand the problem, respond quickly, and improve the system after every incident.&lt;/p&gt;

&lt;p&gt;Repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://github.com/Adeolu1024/anvila-observability-platform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Evidence screenshots:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://drive.google.com/drive/folders/1v_DiT6XQtq0iJtn5JqxfZecAmxhCdd82?usp=sharing```


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>devops</category>
      <category>graphana</category>
      <category>sre</category>
      <category>prometheus</category>
    </item>
    <item>
      <title>Building SwiftDeploy: A Declarative Infrastructure CLI with Observability and Policy Enforcement</title>
      <dc:creator>Adeolu </dc:creator>
      <pubDate>Wed, 06 May 2026 20:54:03 +0000</pubDate>
      <link>https://dev.to/adeolu102/building-swiftdeploy-a-declarative-infrastructure-cli-with-observability-and-policy-enforcement-4g8o</link>
      <guid>https://dev.to/adeolu102/building-swiftdeploy-a-declarative-infrastructure-cli-with-observability-and-policy-enforcement-4g8o</guid>
      <description>&lt;p&gt;What Is This Project?&lt;/p&gt;

&lt;p&gt;SwiftDeploy is a command-line tool that automatically sets up and manages web application deployments. Instead of manually configuring Docker containers, Nginx, and monitoring, you write one file (&lt;code&gt;manifest.yaml&lt;/code&gt;) that describes what you want, and the tool builds everything for you.&lt;/p&gt;

&lt;p&gt;The project was built in two parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 4A: Basic infrastructure generation and container management&lt;/li&gt;
&lt;li&gt;Stage 4B: Monitoring, policy enforcement, and audit logging&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;The Core Idea: Declarative Configuration&lt;/p&gt;

&lt;p&gt;In traditional DevOps, you manually write configuration files for each service. With SwiftDeploy, you write a single manifest file, and the tool generates all the configuration files automatically.&lt;/p&gt;

&lt;p&gt;manifest.yaml (the only file you edit manually):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-keeds-api:v1.0.0&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-service&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stable&lt;/span&gt;

&lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:alpine&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;proxy_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;

&lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-net&lt;/span&gt;
  &lt;span class="na"&gt;driver_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From this one file, SwiftDeploy generates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;nginx.conf&lt;/code&gt; (web server configuration)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docker-compose.yml&lt;/code&gt; (container orchestration)&lt;/li&gt;
&lt;li&gt;All the settings for monitoring and policy checks&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;How the Tool Works&lt;/p&gt;

&lt;p&gt;The CLI tool (&lt;code&gt;swiftdeploy&lt;/code&gt;) has several commands:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;init&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reads manifest.yaml and generates nginx.conf + docker-compose.yml&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;validate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Checks if everything is ready for deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deploy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Starts all containers and waits for them to be healthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;promote canary/stable&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switches between stable and canary modes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shows a live dashboard with metrics and policy compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;audit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generates a report of all events and policy violations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;teardown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stops and removes all containers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;Stage 4A: The Foundation&lt;/p&gt;

&lt;p&gt;API Service&lt;/p&gt;

&lt;p&gt;The API service is a Python Flask application that serves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GET /&lt;/code&gt; — Welcome message with current mode and version&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /healthz&lt;/code&gt; — Health check endpoint&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /chaos&lt;/code&gt; — Simulates problems for testing (only in canary mode)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nginx Proxy&lt;/p&gt;

&lt;p&gt;Nginx acts as a reverse proxy, routing all traffic to the API service. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Listens on port 8080 (configurable)&lt;/li&gt;
&lt;li&gt;Forwards requests to the API service&lt;/li&gt;
&lt;li&gt;Returns JSON error responses for 502, 503, 504 errors&lt;/li&gt;
&lt;li&gt;Logs all requests in a specific format&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker Compose&lt;/p&gt;

&lt;p&gt;Docker Compose manages all containers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API service (your application)&lt;/li&gt;
&lt;li&gt;Nginx (web server/proxy)&lt;/li&gt;
&lt;li&gt;OPA (policy engine, added in Stage 4B)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Stage 4B: Observability and Policy Enforcement&lt;/p&gt;

&lt;p&gt;The /metrics Endpoint&lt;/p&gt;

&lt;p&gt;The API service now exposes a &lt;code&gt;/metrics&lt;/code&gt; endpoint that reports statistics in Prometheus format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/healthz"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="n"&gt;http_request_duration_seconds_bucket&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;le&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"0.1"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;
&lt;span class="n"&gt;app_uptime_seconds&lt;/span&gt; &lt;span class="mi"&gt;847&lt;/span&gt;
&lt;span class="n"&gt;app_mode&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;chaos_active&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These metrics tell you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many requests have been made&lt;/li&gt;
&lt;li&gt;How fast responses are&lt;/li&gt;
&lt;li&gt;How long the app has been running&lt;/li&gt;
&lt;li&gt;Whether you're in stable or canary mode&lt;/li&gt;
&lt;li&gt;Whether chaos testing is active&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OPA: The Policy Engine&lt;/p&gt;

&lt;p&gt;OPA (Open Policy Agent) is a separate container that acts like a security guard. Before you can deploy or promote, the CLI asks OPA: "Is it safe?"&lt;/p&gt;

&lt;p&gt;Why use OPA instead of checking directly in the CLI?**&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Policies are separate from code — easier to update&lt;/li&gt;
&lt;li&gt;If OPA crashes, the CLI still works (just warns you)&lt;/li&gt;
&lt;li&gt;OPA is not accessible from the internet (security)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Two Policies&lt;/p&gt;

&lt;p&gt;Infrastructure Policy** (checks before deploy):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is there enough disk space? (must be &amp;gt; 10GB)&lt;/li&gt;
&lt;li&gt;Is the CPU overloaded? (must be &amp;lt; 2.0)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Canary Safety Policy** (checks before promoting to canary):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the error rate too high? (must be &amp;lt; 1%)&lt;/li&gt;
&lt;li&gt;Is the response time too slow? (P99 must be &amp;lt; 500ms)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data-Driven Thresholds&lt;/p&gt;

&lt;p&gt;The actual numbers (10GB, 2.0, 1%, 500ms) are stored in a separate JSON file, not in the policy code. This means you can change the limits without modifying the policy logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;thresholds.json:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"infrastructure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"min_disk_gb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_cpu_load"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"canary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_error_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_p99_latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Status Dashboard&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;swiftdeploy status&lt;/code&gt; command shows a live dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╔═══════════════════════════════════════╗
║     SwiftDeploy Status Dashboard      ║
╠═══════════════════════════════════════╣
║ Mode: canary                         ║
║ Chaos: none                          ║
║ Req/s: 0.98                          ║
║ P99 Latency: 5ms                     ║
║ Error Rate: 0.00%                    ║
║ Uptime: 133s                         ║
╠═══════════════════════════════════════╣
║ Policy Compliance                    ║
║   Infrastructure: PASS               ║
║   Canary Safety:  PASS               ║
╚═══════════════════════════════════════╝
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every time the dashboard refreshes, it saves the data to &lt;code&gt;history.jsonl&lt;/code&gt; for the audit trail.&lt;/p&gt;

&lt;p&gt;The Audit Report&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;swiftdeploy audit&lt;/code&gt; command reads &lt;code&gt;history.jsonl&lt;/code&gt; and generates &lt;code&gt;audit_report.md&lt;/code&gt; with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A timeline of all events (mode changes, status updates)&lt;/li&gt;
&lt;li&gt;A list of policy violations (when checks failed)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Bugs We Fixed&lt;/p&gt;

&lt;p&gt;Bug 1: OPA Crashed on Startup&lt;/p&gt;

&lt;p&gt;Problem: OPA wouldn't start because of "conflicting rules" error.&lt;/p&gt;

&lt;p&gt;Cause: We wrote &lt;code&gt;default deny := []&lt;/code&gt; in the Rego file, which conflicted with &lt;code&gt;deny contains msg if { ... }&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Fix: Removed the &lt;code&gt;default deny := []&lt;/code&gt; line. The &lt;code&gt;contains&lt;/code&gt; keyword handles empty sets automatically.&lt;/p&gt;

&lt;p&gt;Bug 2: OPA Couldn't Find Threshold Values&lt;/p&gt;

&lt;p&gt;Problem: OPA loaded the policy files but couldn't find the threshold numbers.&lt;/p&gt;

&lt;p&gt;Cause: The JSON file was in the wrong directory. OPA loads files based on their path structure.&lt;/p&gt;

&lt;p&gt;Fix: Moved &lt;code&gt;thresholds.json&lt;/code&gt; into a &lt;code&gt;swiftdeploy/&lt;/code&gt; subdirectory so OPA could find it at the correct data path.&lt;/p&gt;

&lt;p&gt;Bug 3: Status Dashboard Showed "FAIL" Incorrectly&lt;/p&gt;

&lt;p&gt;Problem: The dashboard showed "Infrastructure: FAIL" and "Canary Safety: FAIL" even when everything was within limits.&lt;/p&gt;

&lt;p&gt;Cause: The CLI didn't send a &lt;code&gt;timestamp&lt;/code&gt; field to OPA. The policy rules need &lt;code&gt;input.timestamp&lt;/code&gt; to work. Without it, the rules failed, and the CLI defaulted to "FAIL".&lt;/p&gt;

&lt;p&gt;Fix: Added &lt;code&gt;timestamp&lt;/code&gt; to all OPA queries.&lt;/p&gt;

&lt;p&gt;Bug 4: Nginx Couldn't Find the API Service&lt;/p&gt;

&lt;p&gt;Problem: Nginx returned 502 errors saying it couldn't resolve the API service hostname.&lt;/p&gt;

&lt;p&gt;Cause: Nginx tried to find the API service at startup, but the container wasn't running yet.&lt;/p&gt;

&lt;p&gt;Fix: Added Docker's internal DNS resolver (&lt;code&gt;127.0.0.11&lt;/code&gt;) and used a variable for the proxy address. This tells Nginx to look up the hostname when a request comes in, not at startup.&lt;/p&gt;

&lt;p&gt;Bug 5: Container Didn't Update After Promoting&lt;/p&gt;

&lt;p&gt;Problem: After switching to canary mode, the container was still running in stable mode.&lt;/p&gt;

&lt;p&gt;Cause: Using &lt;code&gt;docker compose restart&lt;/code&gt; doesn't reload environment variables from the updated docker-compose.yml.&lt;/p&gt;

&lt;p&gt;Fix: Changed to &lt;code&gt;docker compose up -d --no-deps &amp;lt;service&amp;gt;&lt;/code&gt;, which recreates the container with new settings.&lt;/p&gt;

&lt;p&gt;Bug 6: Nginx Permission Denied&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Nginx failed to start with "Permission denied" errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause&lt;/strong&gt;: We set &lt;code&gt;user: nginx&lt;/code&gt; and removed all Linux capabilities, which prevented Nginx from creating necessary directories.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Removed the explicit user setting. The official Nginx image handles user switching internally.&lt;/p&gt;




&lt;p&gt;Key Design Decisions&lt;/p&gt;

&lt;p&gt;Why a Separate OPA Container?&lt;/p&gt;

&lt;p&gt;The task required: "The CLI must not make any allow/deny decision itself."&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The CLI asks OPA for permission before every deploy/promote&lt;/li&gt;
&lt;li&gt;OPA returns "allowed" or "denied" with a reason&lt;/li&gt;
&lt;li&gt;The CLI never makes the decision itself&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Policies can be updated without changing the CLI&lt;/li&gt;
&lt;li&gt;If OPA is down, the CLI warns but continues&lt;/li&gt;
&lt;li&gt;All decisions are logged with reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why Data-Driven Thresholds?&lt;/p&gt;

&lt;p&gt;The task required: "Threshold values must not be hardcoded inside Rego files."&lt;/p&gt;

&lt;p&gt;This means the numbers (10GB, 2.0, 1%, 500ms) are in a separate JSON file, not in the policy code. This makes it easy to change limits without touching the policy logic.&lt;/p&gt;

&lt;p&gt;Why Separate Policy Files?&lt;/p&gt;

&lt;p&gt;The task required: "Organise policies by domain. Each domain owns exactly one question."&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;infrastructure.rego&lt;/code&gt; only checks disk and CPU&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;canary.rego&lt;/code&gt; only checks error rate and latency&lt;/li&gt;
&lt;li&gt;Changing one policy never requires changing another&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;How the Pieces Fit Together&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User runs: swiftdeploy deploy
    │
    ▼
CLI gets host stats (disk, CPU)
    │
    ▼
CLI asks OPA: "Is it safe to deploy?"
    │
    ▼
OPA checks infrastructure policy
    │
    ├── If safe → Start containers
    │
    └── If not safe → Block with reason
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User runs: swiftdeploy promote canary
    │
    ▼
CLI scrapes /metrics endpoint
    │
    ▼
CLI calculates error rate and P99 latency
    │
    ▼
CLI asks OPA: "Is it safe to promote?"
    │
    ▼
OPA checks canary safety policy
    │
    ├── If safe → Switch to canary mode
    │
    └── If not safe → Block with reason
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;Lessons Learned&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;One file can drive everything**: A single manifest file can generate all the configuration files you need.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Policies should be separate from code**: Using OPA makes policies easier to update and test.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Always handle failures gracefully**: If OPA is down, the CLI warns but continues working.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simple templates work fine**: You don't need complex template engines for configuration files.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Container recreation vs restart**: Restarting a container doesn't reload environment variables. You need to recreate it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Docker DNS is important**: Nginx needs to know how to find containers by name, which requires Docker's internal DNS resolver.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Summary&lt;/p&gt;

&lt;p&gt;SwiftDeploy is a tool that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Takes a single manifest file as input&lt;/li&gt;
&lt;li&gt;Generates all configuration files automatically&lt;/li&gt;
&lt;li&gt;Manages container lifecycle (deploy, promote, teardown)&lt;/li&gt;
&lt;li&gt;Enforces safety policies via OPA before deploy/promote&lt;/li&gt;
&lt;li&gt;Provides monitoring via /metrics endpoint&lt;/li&gt;
&lt;li&gt;Tracks all events in an audit log&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key innovation is that everything is driven by one file, and all safety checks happen automatically before any deployment action.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article documents the development of SwiftDeploy as part of the HNG Internship DevOps Track, Stage 4.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cybersecurity</category>
      <category>beginners</category>
      <category>python</category>
    </item>
  </channel>
</rss>
