<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Adeolu </title>
    <description>The latest articles on DEV Community by Adeolu  (@adeolu102).</description>
    <link>https://dev.to/adeolu102</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916643%2Fa64f5e39-d396-4957-9103-13cf50c27649.jpg</url>
      <title>DEV Community: Adeolu </title>
      <link>https://dev.to/adeolu102</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/adeolu102"/>
    <language>en</language>
    <item>
      <title>[Boost]</title>
      <dc:creator>Adeolu </dc:creator>
      <pubDate>Wed, 10 Jun 2026 20:11:36 +0000</pubDate>
      <link>https://dev.to/adeolu102/-216g</link>
      <guid>https://dev.to/adeolu102/-216g</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/adeolu102/engineering-design-document-reusable-observability-platform-v2-54gb" class="crayons-story__hidden-navigation-link"&gt;Engineering Design Document: Reusable Observability Platform V2&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/adeolu102" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916643%2Fa64f5e39-d396-4957-9103-13cf50c27649.jpg" alt="adeolu102 profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/adeolu102" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Adeolu 
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Adeolu 
                
              
              &lt;div id="story-author-preview-content-3868216" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/adeolu102" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916643%2Fa64f5e39-d396-4957-9103-13cf50c27649.jpg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Adeolu &lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/adeolu102/engineering-design-document-reusable-observability-platform-v2-54gb" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jun 10&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/adeolu102/engineering-design-document-reusable-observability-platform-v2-54gb" id="article-link-3868216"&gt;
          Engineering Design Document: Reusable Observability Platform V2
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/observability"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;observability&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/architecture"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;architecture&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/sre"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;sre&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/adeolu102/engineering-design-document-reusable-observability-platform-v2-54gb" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;1&lt;span class="hidden s:inline"&gt;&amp;nbsp;reaction&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/adeolu102/engineering-design-document-reusable-observability-platform-v2-54gb#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              

              1&lt;span class="hidden s:inline"&gt;&amp;nbsp;comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            19 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Engineering Design Document: Reusable Observability Platform V2</title>
      <dc:creator>Adeolu </dc:creator>
      <pubDate>Wed, 10 Jun 2026 20:11:24 +0000</pubDate>
      <link>https://dev.to/adeolu102/engineering-design-document-reusable-observability-platform-v2-54gb</link>
      <guid>https://dev.to/adeolu102/engineering-design-document-reusable-observability-platform-v2-54gb</guid>
      <description>&lt;h3&gt;
  
  
  A production-focused redesign of a Stage 6 LGTM observability platform, moving from a single-service Anvila monitoring setup to a reusable, secure, highly available observability platform.
&lt;/h3&gt;

&lt;h2&gt;
  
  
  Executive Summary
&lt;/h2&gt;

&lt;p&gt;This document proposes V2 of the observability platform I built during Stage 6. V1 was validated using the Anvila API as the first monitored service, so many names, dashboards, alerts, and targets were Anvila-specific. However, the real architectural idea was broader than Anvila: build a reusable internal monitoring and reliability platform that could collect metrics, logs, traces, service-level objectives, DORA metrics, and alerts for production applications.&lt;/p&gt;

&lt;p&gt;In this document, I treat Anvila as the first customer of the platform, not as the only possible customer. V1 proved that the stack worked for one real API. V2 redesigns it into a highly available, secure, multi-environment observability platform that can onboard many services without rebuilding the whole system each time. "Highly available" means the platform should keep working even if one server or component fails. The main changes are: replacing the single monitoring server with a resilient telemetry architecture, hardening access to observability tools, improving OpenTelemetry collection, making alert routing ownership-aware, storing telemetry with explicit retention and durability policies, and turning SLO and DORA measurements into enforceable release decisions.&lt;/p&gt;

&lt;p&gt;The target reader for this document is a Principal Engineer reviewing whether the proposed architecture can survive operational pressure, not whether the dashboards look impressive. For clarity, I still reference Anvila throughout the document because it was the real service used to test V1.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. V1 Architecture Critique
&lt;/h2&gt;

&lt;h3&gt;
  
  
  V1 Overview
&lt;/h3&gt;

&lt;p&gt;V1 was a single-service observability platform deployed on a dedicated AWS EC2 monitoring server. It monitored Anvila as the first real application, so the implementation was branded and configured around Anvila. The application server ran the Anvila API through Nginx and PM2, with staging and production processes exposed on separate local ports. PM2 is a process manager; in this setup, it kept the FastAPI backend running and allowed the team to restart staging or production processes. The monitoring server ran the LGTM stack as systemd services rather than Docker. LGTM means Loki, Grafana, Tempo, and Prometheus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus collected metrics, which are numeric measurements such as request count, latency, CPU usage, and error rate.&lt;/li&gt;
&lt;li&gt;Loki stored logs, which are timestamped records of what the application and servers are doing.&lt;/li&gt;
&lt;li&gt;Tempo stored traces, which show the path and timing of a request across services.&lt;/li&gt;
&lt;li&gt;Grafana visualized dashboards by reading data from Prometheus, Loki, and Tempo.&lt;/li&gt;
&lt;li&gt;Alertmanager routed alerts to &lt;code&gt;#DevOps-Alerts&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Node Exporter collected host-level CPU, memory, disk, filesystem, and network metrics.&lt;/li&gt;
&lt;li&gt;Blackbox Exporter probed the public staging and production API URLs.&lt;/li&gt;
&lt;li&gt;OpenTelemetry Collector received application traces and shipped logs.&lt;/li&gt;
&lt;li&gt;A custom DORA exporter scraped GitHub Actions workflow data into Prometheus.&lt;/li&gt;
&lt;li&gt;A later Tempo recent-traces exporter exposed trace summaries to Grafana through Prometheus.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The backend also received application-level instrumentation. Instrumentation means adding code or configuration so the system can report what is happening inside it. The FastAPI app exposed &lt;code&gt;/metrics&lt;/code&gt; using &lt;code&gt;prometheus_client&lt;/code&gt;, with counters, histograms, and in-progress gauges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http_request_duration_seconds&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http_requests_in_progress&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The route-labeling logic normalized unmatched paths into a controlled label to avoid Prometheus label-cardinality explosion from random scanner URLs.&lt;/p&gt;

&lt;p&gt;Dashboards were provisioned as JSON, and alert rules were committed as YAML. V1 included SLI/SLO definitions, an error budget policy, runbooks, a blameless post-incident review, and Game Day evidence for deployment failure, latency injection, and resource pressure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exact Breaking Points
&lt;/h3&gt;

&lt;p&gt;The first major breaking point was the single monitoring EC2 instance. If that instance failed, the team lost Prometheus, Grafana, Loki, Tempo, Alertmanager, dashboards, and local alert routing at the same time. That is unacceptable for production because the observability platform becomes unavailable exactly when operators may need it most. This is the opposite of high availability: one machine became a single point of failure.&lt;/p&gt;

&lt;p&gt;V2 fix: split the platform into multiple highly available components. Run at least two OpenTelemetry Collectors behind an internal load balancer, run Grafana in a highly available setup, and move telemetry storage to durable backends. One server failure should reduce capacity, not remove observability.&lt;/p&gt;

&lt;p&gt;The second breaking point was local disk dependency. Prometheus, Loki, Tempo, and Grafana state lived on one machine. A disk fill, filesystem issue, or instance replacement could destroy recent telemetry unless backups were handled outside the documented workflow. V1 had retention periods, but retention is not durability.&lt;/p&gt;

&lt;p&gt;V2 fix: move logs and traces to object-storage-backed Loki and Tempo, and move metrics to Prometheus remote write or a Prometheus-compatible long-term store such as Mimir. This means telemetry survives instance replacement and is not tied to one local disk.&lt;/p&gt;

&lt;p&gt;The third breaking point was access control. Grafana, Prometheus, and Alertmanager were exposed on public ports and protected mainly by a security group allowlist. That is better than open internet access, but it is not strong enough for production. IP allowlists break when team members change networks, and they do not provide identity, auditability, role-based permissions, or revocation at the user level.&lt;/p&gt;

&lt;p&gt;V2 fix: put observability tools behind an identity-aware access layer. Grafana should use SSO, MFA, and RBAC, while Prometheus, Loki, Tempo, and Alertmanager APIs should stay private. Access should be based on who the engineer is and what role they have, not only on their IP address.&lt;/p&gt;

&lt;p&gt;The fourth breaking point was telemetry ingestion coupling. The application depended on PM2 startup commands and OpenTelemetry runtime instrumentation. If the process was restarted without the instrumentation wrapper, traces could silently disappear while the app continued serving traffic. That creates false confidence: dashboards still exist, but the signal is incomplete.&lt;/p&gt;

&lt;p&gt;V2 fix: make telemetry configuration part of the deployment contract. Services should send telemetry to a stable internal OTLP endpoint, and instrumentation settings should be versioned with the service deployment. Health checks should also verify that metrics, logs, and traces are arriving, not only that the API returns HTTP 200.&lt;/p&gt;

&lt;p&gt;The fifth breaking point was incomplete DORA accuracy. Deployment frequency and change failure rate were reasonable approximations from GitHub Actions, but lead time and MTTR were not fully event-driven. Deployment confirmation was inferred from workflow success, and MTTR used a placeholder/manual incident metric. This is acceptable for a Stage 6 demo, but it weakens executive reporting because the data can overstate delivery health.&lt;/p&gt;

&lt;p&gt;V2 fix: make DORA events explicit. The deployment pipeline should emit deployment started, deployment completed, rollback started, rollback completed, incident opened, and incident resolved events. The DORA exporter should report those real events instead of approximating deployment confirmation and MTTR.&lt;/p&gt;

&lt;p&gt;The sixth breaking point was environment and service modeling. Staging and production were monitored, but the stack was not yet designed as a reusable multi-service platform. The labels were Anvila-specific, ownership routing was basic, and onboarding another service would require manual config changes and dashboard duplication. A stronger platform would let another application define its service name, owners, SLOs, alert routes, and dashboards through a repeatable template.&lt;/p&gt;

&lt;p&gt;V2 fix: introduce a service registry such as &lt;code&gt;observability_services.yml&lt;/code&gt;. Each service defines its name, environment, owners, SLO profile, dashboard folder, and alert route. New services are onboarded by adding a registry entry and using shared dashboard and alert templates.&lt;/p&gt;

&lt;p&gt;The seventh breaking point was operational drift. Terraform created the server and uploaded config, but the installation relied heavily on remote shell scripts and mutable systemd services. After deployment, manual server changes could drift away from Git without immediate detection.&lt;/p&gt;

&lt;p&gt;V2 fix: reduce mutable server configuration. Terraform should provision infrastructure, while service configuration should move toward immutable images, cloud-init, Ansible, or container orchestration. Config changes should be reviewed in Git and redeployed, not manually patched on the server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security Blind Spots
&lt;/h3&gt;

&lt;p&gt;V1 left several security gaps because the immediate goal was proving observability capability under pressure.&lt;/p&gt;

&lt;p&gt;Secrets management was too manual. Slack webhook URLs, GitHub tokens, Brevo keys, Terraform variables, and environment-specific credentials were handled through local files or server-side environment files. In a production workflow, files such as private keys, Terraform state, and &lt;code&gt;terraform.tfvars&lt;/code&gt; must never live in a publishable repository. They should be stored outside Git, encrypted where possible, rotated if exposure is suspected, and replaced by references to a managed secrets system.&lt;/p&gt;

&lt;p&gt;Internal services were not consistently authenticated. Loki had &lt;code&gt;auth_enabled: false&lt;/code&gt;, and Prometheus, Alertmanager, and Tempo were designed around network trust rather than service identity. That is common for demos, but production needs stronger boundaries.&lt;/p&gt;

&lt;p&gt;Grafana access did not have enterprise-grade identity controls. There was no documented SSO, MFA, per-team RBAC, or audit trail for dashboard and datasource access.&lt;/p&gt;

&lt;p&gt;The attack surface was larger than needed. Public access to Grafana, Prometheus, and Alertmanager ports, even through allowlists, increases exposure. Observability systems often contain sensitive data: URLs, headers, traces, stack traces, user IDs, deployment metadata, internal hostnames, and incident timelines.&lt;/p&gt;

&lt;p&gt;Log and trace data did not have a documented PII redaction policy. The backend did make good security choices around OAuth logs by hashing emails and avoiding raw token logging, but the wider telemetry platform lacked a formal rule for what must never enter logs, spans, or labels.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. New Features Fully Designed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Feature 1: Highly Available Telemetry Control Plane
&lt;/h3&gt;

&lt;p&gt;What it does and why it is needed:&lt;/p&gt;

&lt;p&gt;V2 replaces the single monitoring server with a highly available observability control plane. A control plane is the part of the system that receives, processes, stores, and routes observability data. The minimum production version runs two OpenTelemetry Collectors behind an internal load balancer, a highly available Grafana instance, and durable backend storage for metrics, logs, and traces. The goal is to ensure that one node failure does not remove visibility during an incident.&lt;/p&gt;

&lt;p&gt;Architectural integration:&lt;/p&gt;

&lt;p&gt;Application services send OTLP traces, logs, and metrics to an internal telemetry endpoint. OTLP means OpenTelemetry Protocol; it is the standard protocol applications use to send telemetry to OpenTelemetry Collectors. The endpoint load-balances across OpenTelemetry Collectors. Collectors apply batching, memory limits, retries, redaction processors, and routing rules before exporting telemetry to the storage layer.&lt;/p&gt;

&lt;p&gt;Prometheus either runs in high-availability pair mode with remote write, or V2 adopts a Prometheus-compatible long-term store such as Grafana Mimir. Remote write means Prometheus keeps scraping metrics but sends a copy to a more durable backend. Loki stores logs using object storage for durability. Tempo stores traces using object storage-backed blocks. Grafana reads from Prometheus/Mimir, Loki, and Tempo.&lt;/p&gt;

&lt;p&gt;Data model changes:&lt;/p&gt;

&lt;p&gt;Telemetry itself is not stored in the Anvila relational database, but V2 introduces an &lt;code&gt;observability_services.yml&lt;/code&gt; registry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;service_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anvila-api&lt;/span&gt;
    &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anvila-devops&lt;/span&gt;
    &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user-facing&lt;/span&gt;
    &lt;span class="na"&gt;environments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;production&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;slo_profile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;public-api-standard&lt;/span&gt;
    &lt;span class="na"&gt;slack_channel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#DevOps-Alerts"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This registry becomes the source of truth for labels, alert routes, dashboard folders, and service ownership. If another application joins later, it gets another entry in the same file instead of a separate hand-built monitoring stack.&lt;/p&gt;

&lt;p&gt;Trade-offs:&lt;/p&gt;

&lt;p&gt;This increases operational complexity and cost. V1 was cheap and simple because everything lived on one EC2 instance. V2 sacrifices simplicity for survivability. The cost is justified because observability must remain available during failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature 2: Identity-Aware Access Layer
&lt;/h3&gt;

&lt;p&gt;What it does and why it is needed:&lt;/p&gt;

&lt;p&gt;V2 puts Grafana, Prometheus, Alertmanager, and trace/log exploration behind an identity-aware access layer. Engineers authenticate through SSO with MFA. Access is granted by team and role, not by IP address.&lt;/p&gt;

&lt;p&gt;Architectural integration:&lt;/p&gt;

&lt;p&gt;Grafana is placed behind a private ALB or VPN-accessible endpoint. Authentication is delegated to an identity provider. Prometheus, Alertmanager, Loki, and Tempo APIs are not directly public. Engineers query them through Grafana or through short-lived authenticated access paths.&lt;/p&gt;

&lt;p&gt;Data model changes:&lt;/p&gt;

&lt;p&gt;The observability service registry gains ownership metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;owners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anvila-devops&lt;/span&gt;
    &lt;span class="na"&gt;grafana_role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;editor&lt;/span&gt;
    &lt;span class="na"&gt;alert_contact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#DevOps-Alerts"&lt;/span&gt;
    &lt;span class="na"&gt;escalation_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anvila-primary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grafana teams map to this registry. Alertmanager routes use the same ownership source.&lt;/p&gt;

&lt;p&gt;Trade-offs:&lt;/p&gt;

&lt;p&gt;Identity-aware access adds setup time and may slow emergency access if misconfigured. The trade-off is acceptable because production telemetry can contain sensitive operational and user-adjacent data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature 3: Release Health Gate Based on SLO Burn and DORA Signals
&lt;/h3&gt;

&lt;p&gt;What it does and why it is needed:&lt;/p&gt;

&lt;p&gt;V1 showed SLO burn and DORA metrics on dashboards, but V2 turns those signals into release policy. An SLO, or Service Level Objective, is a reliability target such as "99.5% of requests should succeed." An error budget is the amount of failure the service can tolerate before it breaks that target. DORA metrics measure software delivery performance: deployment frequency, lead time for changes, change failure rate, and mean time to restore. In V2, deployments can be blocked, delayed, or escalated when the service is burning error budget too quickly or when change failure rate is above threshold.&lt;/p&gt;

&lt;p&gt;The important change is that observability stops being only something engineers look at after a problem. It becomes part of the deployment decision. If the platform already knows the service is unhealthy, the release pipeline should not blindly push more change into production. This is the same idea as a safety gate: before the deployment continues, the system checks whether reliability conditions are acceptable.&lt;/p&gt;

&lt;p&gt;SLO burn means the service is consuming its error budget. A slow burn means the service is getting worse gradually. A fast burn means the service is failing quickly enough that the team may break the SLO soon. For example, if the availability target is 99.5%, the service only has 0.5% failure allowance for the window. A fast burn alert means that allowance is being consumed too quickly.&lt;/p&gt;

&lt;p&gt;Architectural integration:&lt;/p&gt;

&lt;p&gt;CI/CD queries a reliability policy endpoint before production deployment. CI/CD means Continuous Integration and Continuous Deployment: the automated path from code change to deployed service. The policy endpoint reads current SLO burn rate, active critical alerts, deployment failure history, and service tier. For Anvila API, production deployment is blocked if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fast burn is active;&lt;/li&gt;
&lt;li&gt;availability budget is fully consumed;&lt;/li&gt;
&lt;li&gt;unresolved critical incident exists;&lt;/li&gt;
&lt;li&gt;change failure rate exceeds 15% over the configured window.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The policy endpoint does not replace the deployment pipeline. It gives the pipeline a decision: allow, warn, or block. The deployment tool sends context such as service name, environment, commit SHA, actor, and target version. The policy service then checks Prometheus or Mimir for SLO burn, checks Alertmanager for active critical alerts, checks recent deployment history, and returns a decision with reasons.&lt;/p&gt;

&lt;p&gt;Example response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"block"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anvila-api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"environment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reasons"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"SLOFastBurn is active"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Change failure rate is 18%, above the 15% threshold"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because it gives the operator a clear explanation. The system should not just say "deployment blocked." It should explain which reliability rule failed and what evidence caused the block.&lt;/p&gt;

&lt;p&gt;Data model changes:&lt;/p&gt;

&lt;p&gt;V2 introduces a small reliability metadata database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;service_release_policies&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;service_name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;max_change_failure_rate&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;block_on_fast_burn&lt;/span&gt; &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;block_on_open_critical&lt;/span&gt; &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;release_decisions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;service_name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;commit_sha&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;reasons&lt;/span&gt; &lt;span class="n"&gt;JSONB&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;requested_by&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;service_release_policies&lt;/code&gt; table stores the rules for each service and environment. A user-facing production API can have stricter rules than an internal staging service. For example, production might block deployment during fast burn, while staging might only warn.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;release_decisions&lt;/code&gt; table stores every decision made by the policy service. This creates an audit trail. If someone asks why a deployment was blocked or allowed, the team can inspect the commit SHA, environment, decision, reasons, requester, and timestamp.&lt;/p&gt;

&lt;p&gt;Trade-offs:&lt;/p&gt;

&lt;p&gt;This can slow feature delivery during reliability incidents. That is intentional. V2 chooses controlled delivery over shipping into known instability.&lt;/p&gt;

&lt;p&gt;The cost of this feature is extra operational complexity. The team now needs a policy service, a metadata database, and reliable integrations with CI/CD, Prometheus, and Alertmanager. There is also a risk of false positives: a bad alert rule could block a safe deployment. To manage that risk, the policy should support an emergency override path that requires approval, records the reason, and notifies the team.&lt;/p&gt;

&lt;p&gt;The benefit is that release decisions become evidence-based. Instead of relying on a human to remember to check dashboards before deployment, the platform checks the most important reliability signals automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature 4: Telemetry Data Hygiene and Redaction
&lt;/h3&gt;

&lt;p&gt;What it does and why it is needed:&lt;/p&gt;

&lt;p&gt;V2 standardizes what can enter logs, metrics, and traces. It prevents secrets, OAuth codes, access tokens, raw emails, payment IDs, and high-cardinality labels from polluting telemetry. High-cardinality labels are labels with too many possible values, such as raw URLs, user emails, random IDs, or search terms. They can make Prometheus expensive, noisy, and slow.&lt;/p&gt;

&lt;p&gt;Architectural integration:&lt;/p&gt;

&lt;p&gt;OpenTelemetry Collector processors redact sensitive fields before exporting. Application logging uses structured JSON with approved fields. Metrics labels are restricted to bounded values such as &lt;code&gt;method&lt;/code&gt;, &lt;code&gt;route&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;service&lt;/code&gt;, and &lt;code&gt;environment&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Data model changes:&lt;/p&gt;

&lt;p&gt;An &lt;code&gt;observability_redaction_rules.yml&lt;/code&gt; file defines redaction policies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;redact_fields&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;authorization&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cookie&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;access_token&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;refresh_token&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;oauth_code&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;github_token&lt;/span&gt;
&lt;span class="na"&gt;hash_fields&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;user_email&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;user_id&lt;/span&gt;
&lt;span class="na"&gt;drop_span_attributes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;http.request.header.authorization&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trade-offs:&lt;/p&gt;

&lt;p&gt;Redaction can remove useful debugging context. V2 chooses privacy and security over convenience. Debug-only access to sensitive values should come from controlled application debugging, not broad telemetry storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Production Readiness
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;p&gt;Authentication and authorization are treated as separate concerns. Authentication, often shortened to AuthN, proves who the user or service is. Authorization, often shortened to AuthZ, decides what that identity is allowed to do.&lt;/p&gt;

&lt;p&gt;Human authentication uses SSO with MFA for Grafana and all human access to observability tools. SSO means Single Sign-On, where users log in through a central identity provider. MFA means Multi-Factor Authentication, where login requires an extra proof beyond a password. Authorization is RBAC-based, meaning Role-Based Access Control: viewers can inspect dashboards, on-call engineers can silence alerts, and platform admins can edit datasources and alert rules. Direct API access to Prometheus, Loki, Tempo, and Alertmanager is blocked from the public internet.&lt;/p&gt;

&lt;p&gt;Service-to-service authentication uses private networking plus cloud identity where available. GitHub Actions uses OIDC, or OpenID Connect, to assume AWS roles instead of storing long-lived cloud keys. Secrets move into AWS Secrets Manager or SSM Parameter Store. Terraform receives only secret references, not plaintext secrets. Slack, GitHub, Brevo, Stripe, Gemini, and database credentials are rotated and never stored in Git.&lt;/p&gt;

&lt;p&gt;Input validation is enforced at the application and telemetry layers. The FastAPI app already uses typed Pydantic settings and route-level models. V2 extends this by validating telemetry labels and dropping unknown high-cardinality labels at the collector.&lt;/p&gt;

&lt;p&gt;The attack surface is minimized by private networking. Only Grafana is reachable through an authenticated access layer. OTLP ports are internal. Node Exporter is reachable only from the collector or Prometheus security group. Admin APIs are private.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scalability
&lt;/h3&gt;

&lt;p&gt;Horizontal scaling boundaries are explicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anvila API scales horizontally behind a load balancer.&lt;/li&gt;
&lt;li&gt;OpenTelemetry Collectors scale horizontally and are stateless.&lt;/li&gt;
&lt;li&gt;Grafana scales horizontally behind a load balancer with external database/session storage.&lt;/li&gt;
&lt;li&gt;Logs and traces scale through object-storage-backed Loki and Tempo.&lt;/li&gt;
&lt;li&gt;Metrics scale through Prometheus HA with remote write or a Mimir-compatible backend.&lt;/li&gt;
&lt;li&gt;Celery workers scale independently for persona generation. Celery is a Python background job system. It lets slow work, such as LLM calls or GitHub publishing, run outside the user-facing HTTP request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Caching should be precise, not generic. Redis is used for session-adjacent ephemeral state, Celery broker/result coordination, persona generation events, and short-lived rate limit counters. Redis is an in-memory data store, which makes it fast for temporary data. Cache keys must have TTLs, meaning they expire automatically after a defined time. OAuth state remains short-lived and signed. Token blacklists or revocation sets should use TTL equal to token expiry.&lt;/p&gt;

&lt;p&gt;Traffic spikes are handled through three layers: API autoscaling, queue-based background processing for expensive persona generation, and backpressure/rate limiting on high-cost endpoints.&lt;/p&gt;

&lt;p&gt;API autoscaling means adding more API instances when request volume increases. Queue-based background processing means that expensive work, such as persona generation, is placed into a queue and handled by workers instead of forcing the user-facing API request to wait until all the work is finished. This matters because persona generation may call an LLM provider, write database records, match skills, build files, and publish events. Those operations are slower and more failure-prone than a normal API read request.&lt;/p&gt;

&lt;p&gt;Backpressure means the system deliberately slows down or rejects new work when it is already overloaded. Rate limiting means setting a maximum number of requests a user or client can make in a period of time. For high-cost endpoints, such as generation, publishing, OAuth callbacks, or payment actions, V2 should return clear &lt;code&gt;429 Too Many Requests&lt;/code&gt; or &lt;code&gt;202 Accepted&lt;/code&gt; responses instead of allowing unlimited requests to overload Redis, PostgreSQL, the worker pool, or the external LLM provider.&lt;/p&gt;

&lt;p&gt;The system should reject excess work predictably instead of letting the database, LLM provider, or worker pool collapse.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;p&gt;Structured logging uses JSON with required fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;timestamp&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;service&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;environment&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;level&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;event&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;request_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;trace_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;route&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;status_code&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;duration_ms&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;error_class&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Structured logging means every important log line follows the same shape. Instead of writing vague messages like &lt;code&gt;something failed&lt;/code&gt;, the service emits consistent fields that can be searched and grouped. For example, if an OAuth callback fails, the log should include the service name, environment, route, status code, error class, request ID, and trace ID. That makes it possible to connect a user-visible problem to the exact backend event without exposing secrets.&lt;/p&gt;

&lt;p&gt;Core metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;request rate by route and status;&lt;/li&gt;
&lt;li&gt;p50/p95/p99 latency by route;&lt;/li&gt;
&lt;li&gt;5xx rate;&lt;/li&gt;
&lt;li&gt;auth failure rate;&lt;/li&gt;
&lt;li&gt;OAuth provider failure rate;&lt;/li&gt;
&lt;li&gt;persona generation queue depth;&lt;/li&gt;
&lt;li&gt;persona generation success/failure count;&lt;/li&gt;
&lt;li&gt;Celery task duration and retry count;&lt;/li&gt;
&lt;li&gt;Redis connection errors;&lt;/li&gt;
&lt;li&gt;database connection pool saturation;&lt;/li&gt;
&lt;li&gt;SLO burn rate;&lt;/li&gt;
&lt;li&gt;DORA deployment frequency, lead time, CFR, and MTTR.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics are the operational health signals for the platform. Request rate shows demand. Latency shows whether users are waiting too long. Error rate shows whether the service is failing. Queue depth shows whether background workers are falling behind. Database pool saturation shows whether the API is running out of database connections. SLO burn rate shows whether the service is consuming its allowed failure budget too quickly. DORA metrics show whether the team is deploying safely and recovering quickly.&lt;/p&gt;

&lt;p&gt;Alerting thresholds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;availability fast burn: 14.4x over 1 hour;&lt;/li&gt;
&lt;li&gt;availability slow burn: 5x over 6 hours;&lt;/li&gt;
&lt;li&gt;5xx rate above 1% for 5 minutes;&lt;/li&gt;
&lt;li&gt;p95 API latency above 500ms for 10 minutes;&lt;/li&gt;
&lt;li&gt;CPU above 80% warning and 90% critical;&lt;/li&gt;
&lt;li&gt;memory above 80% warning and 90% critical;&lt;/li&gt;
&lt;li&gt;disk above 75% warning and 90% critical;&lt;/li&gt;
&lt;li&gt;queue depth above expected worker capacity for 10 minutes;&lt;/li&gt;
&lt;li&gt;deployment CFR above 15%.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The alert thresholds are intentionally tied to user impact and operational risk. A single slow request should not wake someone up, but sustained high latency or a fast SLO burn should. CPU and memory alerts are useful, but they are secondary signals; the more important question is whether users are experiencing failures or slow responses. Queue depth matters because it warns that background work is piling up before users start complaining that generation is stuck.&lt;/p&gt;

&lt;p&gt;Distributed error tracking connects metrics, logs, and traces through &lt;code&gt;trace_id&lt;/code&gt;. Grafana derived fields allow jumping from Loki logs into Tempo traces. Application errors include stable error classes but not raw secrets or full user data.&lt;/p&gt;

&lt;p&gt;In practical terms, this means an engineer can start from a Grafana latency spike, open the related Loki logs for the same time window, then jump into the Tempo trace for the exact request. The trace shows where time was spent, while the logs explain what happened. This is the difference between knowing "the API is slow" and knowing "persona generation is slow because the LLM call timed out after 30 seconds."&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Tech Stack Decisions
&lt;/h2&gt;

&lt;p&gt;FastAPI remains appropriate because Anvila is an API-heavy Python backend with strong async support, typed request validation, and easy OpenTelemetry instrumentation. Async support matters because API servers often wait on databases, external APIs, and network calls; async handling helps the server use resources efficiently during that waiting time.&lt;/p&gt;

&lt;p&gt;Nginx remains useful at the application edge because it is mature, battle-tested, and efficient at reverse proxying traffic to backend processes. In V2, it should either sit behind an AWS Application Load Balancer or be replaced by a managed ingress layer if the platform moves to containers.&lt;/p&gt;

&lt;p&gt;PM2 worked for V1 because it kept the FastAPI staging and production processes running and allowed fast restarts during the task. For V2, PM2 is acceptable for a small VM-based deployment, but a production platform should consider systemd units, containers, or orchestration so process configuration is versioned and less dependent on manual runtime commands.&lt;/p&gt;

&lt;p&gt;PostgreSQL remains the primary relational database because Anvila needs transactional integrity for users, personas, OAuth links, refresh tokens, payments, and publishing state. Unique indexes on email and provider subjects protect identity flows from races.&lt;/p&gt;

&lt;p&gt;Redis is used for ephemeral coordination because persona generation, rate limiting, short-lived state, Celery queues, and event streams need low-latency TTL-backed storage. "Ephemeral" means temporary. Redis is not the source of truth; if data must survive permanently, it belongs in PostgreSQL or object storage.&lt;/p&gt;

&lt;p&gt;Celery remains appropriate for persona generation because LLM calls and GitHub publishing can be slow, retryable, and unsuitable for synchronous request handling. Synchronous request handling would force the user to wait while the server does all the work before returning a response.&lt;/p&gt;

&lt;p&gt;Prometheus remains the metrics interface because it is mature, pull-based, and has strong PromQL support for SLO burn-rate alerting. For V2 scale, Prometheus should remote-write to durable long-term storage.&lt;/p&gt;

&lt;p&gt;Loki is retained for logs because it integrates tightly with Grafana and is cost-efficient when logs are indexed by labels rather than full text.&lt;/p&gt;

&lt;p&gt;Tempo is retained for traces because it is designed for high-volume trace storage with object storage and works well with OpenTelemetry.&lt;/p&gt;

&lt;p&gt;Grafana remains the visualization layer because it can unify Prometheus, Loki, and Tempo and supports provisioned dashboards.&lt;/p&gt;

&lt;p&gt;OpenTelemetry Collector becomes more central in V2 because it gives a vendor-neutral telemetry pipeline with batching, redaction, retries, memory limits, and routing. Vendor-neutral means the application does not have to be rewritten if the team later changes storage or visualization tools.&lt;/p&gt;

&lt;p&gt;Alertmanager remains useful for alert grouping, routing, inhibition, and resolved notifications. V2 strengthens it with ownership metadata and escalation policies.&lt;/p&gt;

&lt;p&gt;Node Exporter remains useful for host-level metrics such as CPU, memory, disk, filesystem, and network usage. Blackbox Exporter remains useful for probing public endpoints from outside the application process, because an API can look healthy internally while the public route is broken.&lt;/p&gt;

&lt;p&gt;The custom DORA exporter remains useful as a bridge between GitHub Actions and Prometheus. Its V2 responsibility should be narrowed and made more accurate: export deployment timestamps, workflow duration, deployment result, rollback markers, and incident restoration timestamps instead of relying on approximations.&lt;/p&gt;

&lt;p&gt;GitHub Actions remains the CI/CD system because it is already the source of deployment workflow events for the Anvila backend. Keeping it reduces migration risk and allows deployment metadata to feed directly into the DORA exporter and release health gate.&lt;/p&gt;

&lt;p&gt;Slack remains the primary alert destination because it is where the team already collaborates during incidents. Alertmanager should send structured Slack messages with severity, affected service, current value, dashboard link, runbook link, and resolved/firing status. Slack is not the system of record for incidents, but it is effective for fast human response.&lt;/p&gt;

&lt;p&gt;Brevo or another transactional email provider remains useful for lower-urgency notifications and account-related emails. Email should not replace Slack for urgent incidents, but it is useful for user-facing flows, escalation summaries, and non-real-time operational notices.&lt;/p&gt;

&lt;p&gt;Gemini remains the LLM provider for persona generation because the existing application already uses it. In V2, LLM calls should stay behind Celery workers and rate limits because they are slower, more expensive, and more failure-prone than normal API reads.&lt;/p&gt;

&lt;p&gt;Stripe remains the payment provider where payment features are enabled because it is a mature managed payment platform. The system should not store raw card data. Stripe webhooks should be validated, logged with safe event IDs, and monitored as high-impact integration points.&lt;/p&gt;

&lt;p&gt;Terraform remains the infrastructure provisioning tool because the system needs reproducible cloud infrastructure. However, V2 should reduce shell provisioner dependence and move toward immutable images, cloud-init, Ansible, or container orchestration.&lt;/p&gt;

&lt;p&gt;AWS remains a reasonable cloud provider because the existing deployment is already on EC2, and AWS provides the missing production pieces: Secrets Manager, IAM, ALB, S3, RDS, ElastiCache, autoscaling, and private networking. ALB means Application Load Balancer, which distributes traffic across healthy backends. S3 provides durable object storage for logs and traces. RDS provides managed PostgreSQL. ElastiCache provides managed Redis.&lt;/p&gt;

&lt;p&gt;Grafana Mimir is optional but valuable if metrics retention and scale outgrow a single Prometheus server. The trade-off is operational complexity; the benefit is long-term, horizontally scalable Prometheus-compatible metrics storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proposed V2 Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3k4p5hv73skzhihnf14.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3k4p5hv73skzhihnf14.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
The diagram separates the system into six readable paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user traffic enters through the public Application Load Balancer and reaches Anvila API replicas;&lt;/li&gt;
&lt;li&gt;slow persona generation and GitHub publishing run in Celery workers instead of blocking user requests;&lt;/li&gt;
&lt;li&gt;application services send metrics, logs, and traces to an internal OTLP endpoint;&lt;/li&gt;
&lt;li&gt;OpenTelemetry Collectors process, redact, batch, and route telemetry to durable storage;&lt;/li&gt;
&lt;li&gt;engineers access dashboards through SSO/MFA/RBAC-protected Grafana rather than direct public APIs;&lt;/li&gt;
&lt;li&gt;GitHub Actions, the DORA exporter, Alertmanager, and the release health policy close the loop between deployment and reliability.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;V1 was successful as a learning-stage observability platform because it proved the full reliability loop: telemetry collection, dashboards, SLOs, alerts, runbooks, incident review, and Game Day validation. It used Anvila as the first real service, which made the work concrete and testable. It was not production-grade because it depended on a single monitoring host, weak identity boundaries, mutable server state, partial DORA accuracy, incomplete data hygiene, and service-specific configuration.&lt;/p&gt;

&lt;p&gt;V2 keeps the strongest V1 decisions: LGTM, OpenTelemetry, provisioned dashboards, SLO burn-rate alerting, and DORA visibility. It replaces the fragile parts with highly available collectors, durable telemetry storage, identity-aware access, secrets management, release health gates, and structured ownership metadata. The result is an observability platform that can support Anvila as a real user-facing product and also onboard other applications through the same platform model.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>observability</category>
      <category>architecture</category>
      <category>sre</category>
    </item>
    <item>
      <title>Building a Production-Grade Observability Platform for the Anvila API with LGTM, SLOs, DORA Metrics, and Game Day Testing</title>
      <dc:creator>Adeolu </dc:creator>
      <pubDate>Tue, 19 May 2026 00:08:05 +0000</pubDate>
      <link>https://dev.to/adeolu102/building-a-production-grade-observability-platform-for-the-anvila-api-with-lgtm-slos-dora-5fnc</link>
      <guid>https://dev.to/adeolu102/building-a-production-grade-observability-platform-for-the-anvila-api-with-lgtm-slos-dora-5fnc</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;For the HNG DevOps Stage 6 task, our team built a production-grade observability and reliability platform for the Anvila API.&lt;/p&gt;

&lt;p&gt;The goal was not just to check whether a server was up or down. We needed to build a monitoring system that could help a team understand the health of an application from different angles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the application available?&lt;/li&gt;
&lt;li&gt;Is it fast enough for users?&lt;/li&gt;
&lt;li&gt;Are requests failing?&lt;/li&gt;
&lt;li&gt;Is the server under pressure?&lt;/li&gt;
&lt;li&gt;Are deployments healthy?&lt;/li&gt;
&lt;li&gt;Can engineers investigate logs, metrics, and traces from one place?&lt;/li&gt;
&lt;li&gt;Can alerts reach the team with enough context to act quickly?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To achieve this, we used the LGTM observability stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loki for logs&lt;/li&gt;
&lt;li&gt;Grafana for dashboards&lt;/li&gt;
&lt;li&gt;Tempo for traces&lt;/li&gt;
&lt;li&gt;Prometheus for metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also added Alertmanager for alert routing, Node Exporter for server metrics, Blackbox Exporter for uptime probing, OpenTelemetry Collector for logs and traces, and a custom GitHub Actions DORA exporter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prometheus targets all up
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogdcls0fkvl3rsn9qosi.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogdcls0fkvl3rsn9qosi.PNG" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture Overview
&lt;/h3&gt;

&lt;p&gt;Our architecture uses a separate monitoring server. The Anvila API continues to run on its application server, while the monitoring stack runs on a dedicated AWS EC2 instance.&lt;/p&gt;

&lt;p&gt;The monitoring server collects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;system metrics from Node Exporter&lt;/li&gt;
&lt;li&gt;uptime and response time metrics from Blackbox Exporter&lt;/li&gt;
&lt;li&gt;application metrics from the FastAPI &lt;code&gt;/metrics&lt;/code&gt; endpoint&lt;/li&gt;
&lt;li&gt;logs from the app server through OpenTelemetry Collector and Loki&lt;/li&gt;
&lt;li&gt;traces from the instrumented FastAPI staging service through OpenTelemetry and Tempo&lt;/li&gt;
&lt;li&gt;CI/CD metrics from GitHub Actions through a custom DORA exporter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Grafana connects all these data sources together so that we can view metrics, logs, and traces in one place.&lt;/p&gt;

&lt;h3&gt;
  
  
  LGTM dashboards list
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ttw06j86uoau4d2e4w8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ttw06j86uoau4d2e4w8.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At a high level, the data flow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Anvila API -&amp;gt; OpenTelemetry Collector -&amp;gt; Loki / Tempo
Anvila API /metrics -&amp;gt; Prometheus
Node Exporter -&amp;gt; Prometheus
Blackbox Exporter -&amp;gt; Prometheus
GitHub Actions exporter -&amp;gt; Prometheus
Prometheus / Loki / Tempo -&amp;gt; Grafana
Prometheus alerts -&amp;gt; Alertmanager -&amp;gt; Slack
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why We Chose LGTM Over Managed Alternatives&lt;/p&gt;

&lt;p&gt;Managed observability platforms are useful because they reduce operational work. However, for this task, we chose LGTM because we needed to understand and control the full observability pipeline.&lt;/p&gt;

&lt;p&gt;LGTM gave us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;full control over metrics, logs, and traces&lt;/li&gt;
&lt;li&gt;version-controlled configuration&lt;/li&gt;
&lt;li&gt;no dependency on a paid managed observability provider&lt;/li&gt;
&lt;li&gt;a better learning experience for how observability systems actually work&lt;/li&gt;
&lt;li&gt;the ability to provision dashboards, alert rules, and data sources as code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prometheus is strong for metrics and alerting. Loki is useful for storing logs in a way that works well with Grafana. Tempo stores distributed traces and helps us understand request flow. Grafana brings all three together in one interface.&lt;/p&gt;

&lt;p&gt;This made LGTM a good fit for a platform engineering task where the goal was to build the stack from the ground up.&lt;/p&gt;

&lt;p&gt;Infrastructure as Code and Systemd Deployment&lt;/p&gt;

&lt;p&gt;One important requirement was that the stack should be reproducible. We used Terraform to provision the monitoring EC2 server and upload the configuration files.&lt;/p&gt;

&lt;p&gt;The services are installed and managed with systemd instead of Docker. This was important because the task update said we should not install Docker on the server.&lt;/p&gt;

&lt;p&gt;The stack includes systemd services for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus&lt;/li&gt;
&lt;li&gt;Grafana&lt;/li&gt;
&lt;li&gt;Loki&lt;/li&gt;
&lt;li&gt;Tempo&lt;/li&gt;
&lt;li&gt;Alertmanager&lt;/li&gt;
&lt;li&gt;Node Exporter&lt;/li&gt;
&lt;li&gt;Blackbox Exporter&lt;/li&gt;
&lt;li&gt;OpenTelemetry Collector&lt;/li&gt;
&lt;li&gt;DORA exporter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repository contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terraform files&lt;/li&gt;
&lt;li&gt;Prometheus configuration&lt;/li&gt;
&lt;li&gt;Alertmanager configuration&lt;/li&gt;
&lt;li&gt;Grafana dashboard JSON files&lt;/li&gt;
&lt;li&gt;Loki and Tempo configuration&lt;/li&gt;
&lt;li&gt;OpenTelemetry Collector configuration&lt;/li&gt;
&lt;li&gt;alert rules&lt;/li&gt;
&lt;li&gt;runbooks&lt;/li&gt;
&lt;li&gt;SLI/SLO documentation&lt;/li&gt;
&lt;li&gt;Game Day documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Alert rules YAML
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flavsgpson2f6f3lrtmc4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flavsgpson2f6f3lrtmc4.png" alt=" " width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Metrics Collection&lt;/p&gt;

&lt;p&gt;Metrics are numerical measurements that tell us what is happening in the system.&lt;/p&gt;

&lt;p&gt;We collect several types of metrics:&lt;/p&gt;

&lt;h3&gt;
  
  
  Server Metrics
&lt;/h3&gt;

&lt;p&gt;Node Exporter gives us system-level metrics such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU usage&lt;/li&gt;
&lt;li&gt;memory usage&lt;/li&gt;
&lt;li&gt;disk usage&lt;/li&gt;
&lt;li&gt;disk I/O&lt;/li&gt;
&lt;li&gt;network I/O&lt;/li&gt;
&lt;li&gt;system load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics help us understand saturation, which means how full or stressed the system is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node Exporter dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpek4kijrvsq7bbrsuv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpek4kijrvsq7bbrsuv7.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Uptime and HTTP Probe Metrics
&lt;/h3&gt;

&lt;p&gt;Blackbox Exporter probes both the staging and production API URLs.&lt;/p&gt;

&lt;p&gt;It checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;whether the endpoint is reachable&lt;/li&gt;
&lt;li&gt;whether it returns a successful HTTP response&lt;/li&gt;
&lt;li&gt;how long the response takes&lt;/li&gt;
&lt;li&gt;SSL certificate expiry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is important because it measures the service from the outside, closer to how a user or client would experience it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Blackbox dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fva00ljsag6kofbi3um5m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fva00ljsag6kofbi3um5m.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Application Metrics
&lt;/h3&gt;

&lt;p&gt;We also added app-level Prometheus metrics to the FastAPI staging application through a &lt;code&gt;/metrics&lt;/code&gt; endpoint.&lt;/p&gt;

&lt;p&gt;This gives Prometheus real application request metrics such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http_request_duration_seconds&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http_requests_in_progress&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics help us track request volume, request latency, and request status codes from inside the application itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  App metrics target up
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fns4ut8wvz7bp4ffqy11d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fns4ut8wvz7bp4ffqy11d.png" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  App request metrics
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft93mpghbv0udo3mbwzg2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft93mpghbv0udo3mbwzg2.png" alt=" " width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  App latency histogram
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzs9b7b941e4p7zhnng65.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzs9b7b941e4p7zhnng65.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Logs and Traces
&lt;/h2&gt;

&lt;p&gt;Metrics tell us that something is wrong. Logs and traces help us understand why.&lt;/p&gt;

&lt;p&gt;We used OpenTelemetry Collector to collect application logs and send them to Loki. In Grafana, we can search these logs using Loki queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Loki logs
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5eygkg19pea248mx5bxi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5eygkg19pea248mx5bxi.png" alt=" " width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We also instrumented the staging FastAPI service with OpenTelemetry so that it sends traces to Tempo.&lt;/p&gt;

&lt;p&gt;A trace shows the path of a request through the application. This is very useful when investigating slow requests, because it helps identify which endpoint or operation caused the delay.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tempo traces
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxxm7qyg8oxjvpu52wu3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxxm7qyg8oxjvpu52wu3c.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Unified Observability dashboard combines metrics, logs, and traces. The idea is that if latency or error rate increases, an engineer can look at the metric, check related logs in Loki, and then inspect traces in Tempo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unified observability dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbseuiewhomi7l896jxoa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbseuiewhomi7l896jxoa.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Golden Signals
&lt;/h2&gt;

&lt;p&gt;The Four Golden Signals are a simple way to define service reliability from a user-focused perspective.&lt;/p&gt;

&lt;p&gt;They are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Traffic&lt;/li&gt;
&lt;li&gt;Errors&lt;/li&gt;
&lt;li&gt;Saturation&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;p&gt;Latency means how long it takes the service to respond.&lt;/p&gt;

&lt;p&gt;For Anvila, we track latency using both Blackbox response time and application request duration metrics.&lt;/p&gt;

&lt;p&gt;Example PromQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(
  0.95,
  sum by (le, path) (
    rate(http_request_duration_seconds_bucket{job="anvila-api-staging",status!~"5.."}[5m])
  )
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells us the 95th percentile request latency for successful requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traffic
&lt;/h3&gt;

&lt;p&gt;Traffic means how much demand the service is receiving.&lt;/p&gt;

&lt;p&gt;Example PromQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(http_requests_total{job="anvila-api-staging"}[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells us the number of requests per second.&lt;/p&gt;

&lt;h3&gt;
  
  
  Errors
&lt;/h3&gt;

&lt;p&gt;Errors measure how many requests are failing.&lt;/p&gt;

&lt;p&gt;Example PromQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(http_requests_total{job="anvila-api-staging",status=~"5.."}[5m]))
/
clamp_min(sum(rate(http_requests_total{job="anvila-api-staging"}[5m])), 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us the ratio of failed requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Saturation
&lt;/h3&gt;

&lt;p&gt;Saturation shows how full the system is.&lt;/p&gt;

&lt;p&gt;For example, CPU usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This goes beyond simply checking if the server is alive. A server can be up but overloaded. The Four Golden Signals help us understand the actual quality of the service.&lt;/p&gt;

&lt;h2&gt;
  
  
  SLOs and Error Budgets
&lt;/h2&gt;

&lt;p&gt;An SLI is a Service Level Indicator. It is a measurement.&lt;/p&gt;

&lt;p&gt;An SLO is a Service Level Objective. It is the target we want to meet.&lt;/p&gt;

&lt;p&gt;An error budget is the amount of unreliability we are allowed within a time window.&lt;/p&gt;

&lt;p&gt;For Anvila, our main availability SLO is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;99.5% successful probes over 30 days
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error budget calculation is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(1 - 0.995) * 30 days = 0.15 days = 3.6 hours
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the service can be unavailable for about 3 hours and 36 minutes in a 30-day window before the error budget is fully consumed.&lt;/p&gt;

&lt;p&gt;The SLO dashboard shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30-day availability SLI&lt;/li&gt;
&lt;li&gt;SLO target&lt;/li&gt;
&lt;li&gt;error budget remaining&lt;/li&gt;
&lt;li&gt;error budget time remaining&lt;/li&gt;
&lt;li&gt;burn rate&lt;/li&gt;
&lt;li&gt;7-day and 30-day compliance history&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SLO and Error Budget dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfmuvz6ssqpw9xj0ye0b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfmuvz6ssqpw9xj0ye0b.png" alt=" " width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Burn Rate Alerting
&lt;/h2&gt;

&lt;p&gt;Burn rate tells us how quickly the service is consuming its error budget.&lt;/p&gt;

&lt;p&gt;This is better than simple alerting because it reduces alert fatigue.&lt;/p&gt;

&lt;p&gt;For example, a short small failure may not need to wake everyone up. But if the service is burning through its error budget very quickly, the team should respond immediately.&lt;/p&gt;

&lt;p&gt;We configured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast Burn alert: critical&lt;/li&gt;
&lt;li&gt;Slow Burn alert: warning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Fast Burn alert means the system is consuming the error budget too quickly and needs immediate action.&lt;/p&gt;

&lt;p&gt;The Slow Burn alert means the system may become a serious incident if the trend continues.&lt;/p&gt;

&lt;h2&gt;
  
  
  DORA Metrics
&lt;/h2&gt;

&lt;p&gt;DORA metrics help connect engineering work to business outcomes.&lt;/p&gt;

&lt;p&gt;The four DORA metrics are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployment Frequency&lt;/li&gt;
&lt;li&gt;Lead Time for Changes&lt;/li&gt;
&lt;li&gt;Change Failure Rate&lt;/li&gt;
&lt;li&gt;Mean Time to Restore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics help answer questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How often do we deploy?&lt;/li&gt;
&lt;li&gt;How long does it take for code to reach production?&lt;/li&gt;
&lt;li&gt;How often do deployments fail?&lt;/li&gt;
&lt;li&gt;How quickly can we recover from incidents?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We built a GitHub Actions DORA exporter that exposes deployment workflow data to Prometheus.&lt;/p&gt;

&lt;p&gt;The dashboard shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deployment frequency&lt;/li&gt;
&lt;li&gt;lead time&lt;/li&gt;
&lt;li&gt;change failure rate&lt;/li&gt;
&lt;li&gt;MTTR&lt;/li&gt;
&lt;li&gt;deployment frequency classification&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  DORA dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgy5jqcv8xbt64q4p58gb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgy5jqcv8xbt64q4p58gb.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our current DORA implementation measures commit-to-workflow-completion lead time. A future improvement would be to break it into more detailed sub-intervals such as commit time, workflow trigger time, workflow completion time, and deployment confirmation time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Alerting and Slack Notifications
&lt;/h2&gt;

&lt;p&gt;Alert rules are stored in version-controlled YAML files, not manually created in the Grafana UI.&lt;/p&gt;

&lt;p&gt;We configured alerts for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;host downtime&lt;/li&gt;
&lt;li&gt;high CPU&lt;/li&gt;
&lt;li&gt;high memory&lt;/li&gt;
&lt;li&gt;disk usage&lt;/li&gt;
&lt;li&gt;SLO burn rate&lt;/li&gt;
&lt;li&gt;change failure rate&lt;/li&gt;
&lt;li&gt;MTTR threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alertmanager handles routing and sends alerts to Slack.&lt;/p&gt;

&lt;p&gt;The Alertmanager configuration includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;route grouping by alert name, service, and severity&lt;/li&gt;
&lt;li&gt;warning and critical routes&lt;/li&gt;
&lt;li&gt;inhibition rules&lt;/li&gt;
&lt;li&gt;Slack receiver configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Alertmanager config
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0ixouxdi1gqxqt0th8f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0ixouxdi1gqxqt0th8f.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We also used a structured Slack template. The alert payload includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;alert name&lt;/li&gt;
&lt;li&gt;severity&lt;/li&gt;
&lt;li&gt;affected host or target&lt;/li&gt;
&lt;li&gt;current value&lt;/li&gt;
&lt;li&gt;dashboard link&lt;/li&gt;
&lt;li&gt;runbook link&lt;/li&gt;
&lt;li&gt;firing or resolved status&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Slack template
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9oh0vfs9xfbakyq0nhwh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9oh0vfs9xfbakyq0nhwh.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is an example of a firing and resolved Slack notification:&lt;/p&gt;

&lt;h3&gt;
  
  
  Slack firing and resolved alert
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mfm70esbf4n5fty0tv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mfm70esbf4n5fty0tv3.png" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Runbooks and Incident Management
&lt;/h2&gt;

&lt;p&gt;Every alert needs a runbook. A runbook explains what the alert means and what to do first.&lt;/p&gt;

&lt;p&gt;Our runbooks include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what the alert means&lt;/li&gt;
&lt;li&gt;likely causes&lt;/li&gt;
&lt;li&gt;first investigation steps&lt;/li&gt;
&lt;li&gt;how to resolve it&lt;/li&gt;
&lt;li&gt;when to roll back&lt;/li&gt;
&lt;li&gt;when to escalate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example runbook:&lt;/p&gt;

&lt;p&gt;[SLO burn rate runbook]&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11y5o0hk8zhbnq5ykvhm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11y5o0hk8zhbnq5ykvhm.png" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We also documented a blameless post-incident review for the simulated latency incident.&lt;/p&gt;

&lt;p&gt;The goal of a blameless PIR is not to blame a person. The goal is to understand what happened, what the impact was, what went well, what did not go well, and what should be improved.&lt;/p&gt;
&lt;h2&gt;
  
  
  Game Day Testing
&lt;/h2&gt;

&lt;p&gt;Game Day is where we intentionally create controlled failures to test whether our monitoring and response system works.&lt;/p&gt;

&lt;p&gt;We ran three scenarios.&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 1: Deployment Failure
&lt;/h3&gt;

&lt;p&gt;We created a temporary PR with an intentionally failing GitHub Actions workflow.&lt;/p&gt;

&lt;p&gt;The goal was to prove that a deployment or pipeline failure can be detected without touching production.&lt;/p&gt;

&lt;p&gt;The PR was closed without merging after screenshots were captured.&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day deployment failure PR
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foy258ugy7gsskyioqsl3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foy258ugy7gsskyioqsl3.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day deployment failure check
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hblklpm7oz471w30ku3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hblklpm7oz471w30ku3.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 2: Latency Injection
&lt;/h3&gt;

&lt;p&gt;During the latency Game Day, we temporarily added a 2-second delay to the staging root endpoint and generated test requests. The evidence below shows the resulting Tempo traces and the recovery after the delay was removed.&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day latency traces
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2955gcm0spcpjuixm5d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2955gcm0spcpjuixm5d.png" alt=" " width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day latency recovery
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9ovatvpqvp2geec4o0f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9ovatvpqvp2geec4o0f.png" alt=" " width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 3: Resource Pressure
&lt;/h3&gt;

&lt;p&gt;We created CPU pressure on the monitoring server using controlled background processes.&lt;/p&gt;

&lt;p&gt;This triggered a CPU warning alert in Slack. After stopping the pressure, we confirmed the resolved alert was also sent.&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day resource pressure trigger
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi704l8a6fd2hgaqwfmk4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi704l8a6fd2hgaqwfmk4.png" alt=" " width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day resource pressure Slack alert
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm15yy0iujs8vurm688wd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm15yy0iujs8vurm688wd.png" alt=" " width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Game Day resource pressure recovery
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35lxsbkd4a3xdvfrtcw7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35lxsbkd4a3xdvfrtcw7.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Toil Identified and Automation
&lt;/h2&gt;

&lt;p&gt;Toil is repetitive manual work that can be automated.&lt;/p&gt;

&lt;p&gt;We identified several examples of toil:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Manual Dashboard Creation
&lt;/h3&gt;

&lt;p&gt;Creating dashboards manually in Grafana is slow and hard to reproduce.&lt;/p&gt;

&lt;p&gt;Automation: All dashboards are stored as JSON and provisioned automatically.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Manual Monitoring Server Setup
&lt;/h3&gt;

&lt;p&gt;Installing each service manually would be time-consuming and error prone.&lt;/p&gt;

&lt;p&gt;Automation: Terraform creates the monitoring server, and systemd installation scripts configure the observability stack.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Vague Alert Investigation
&lt;/h3&gt;

&lt;p&gt;If an alert only says, "CPU high", the engineer still has to search for the server, dashboard, and runbook.&lt;/p&gt;

&lt;p&gt;Automation: Alertmanager Slack messages include service, host, metric value, dashboard link, and runbook link.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Manual CI/CD Metrics Collection
&lt;/h3&gt;

&lt;p&gt;Checking GitHub Actions manually does not scale.&lt;/p&gt;

&lt;p&gt;Automation: The DORA exporter pulls GitHub Actions data and exposes it as Prometheus metrics.&lt;/p&gt;
&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;This project helped us understand that observability is more than installing tools.&lt;/p&gt;

&lt;p&gt;The tools are only useful when they are connected to reliability goals.&lt;/p&gt;

&lt;p&gt;Prometheus helped us collect and alert on metrics. Loki helped us inspect logs. Tempo helped us understand request traces. Grafana helped us bring everything together. Alertmanager helped us route actionable alerts to Slack.&lt;/p&gt;

&lt;p&gt;We also learned that good reliability engineering requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clear SLIs&lt;/li&gt;
&lt;li&gt;realistic SLOs&lt;/li&gt;
&lt;li&gt;error budgets&lt;/li&gt;
&lt;li&gt;runbooks&lt;/li&gt;
&lt;li&gt;incident reviews&lt;/li&gt;
&lt;li&gt;controlled failure testing&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Current Limitations and Next Steps
&lt;/h2&gt;

&lt;p&gt;The platform is functional, but there are still improvements we can make.&lt;/p&gt;

&lt;p&gt;Current limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;App-level &lt;code&gt;/metrics&lt;/code&gt; is currently verified on staging.&lt;/li&gt;
&lt;li&gt;Production is monitored through Blackbox probes and shared host-level metrics.&lt;/li&gt;
&lt;li&gt;DORA lead time is currently simplified to commit-to-workflow-completion.&lt;/li&gt;
&lt;li&gt;A follow-up PR was opened to normalize unmatched route labels in Prometheus metrics and reduce high-cardinality risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;promote app-level metrics to production after staging validation&lt;/li&gt;
&lt;li&gt;merge the metrics label normalization improvement&lt;/li&gt;
&lt;li&gt;add more detailed DORA lead-time sub-intervals&lt;/li&gt;
&lt;li&gt;add deployment annotations to Grafana dashboards&lt;/li&gt;
&lt;li&gt;continue reviewing SLOs as the service matures&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;For this task, we built a complete observability platform around the Anvila API using the LGTM stack.&lt;/p&gt;

&lt;p&gt;We deployed the monitoring stack with Terraform and systemd, collected metrics, logs, and traces, built dashboards, configured Slack alerts, wrote runbooks, and tested the system through Game Day simulations.&lt;/p&gt;

&lt;p&gt;The biggest lesson is that observability is not just about knowing when something is broken. It is about giving the team enough context to understand the problem, respond quickly, and improve the system after every incident.&lt;/p&gt;

&lt;p&gt;Repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://github.com/Adeolu1024/anvila-observability-platform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Evidence screenshots:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://drive.google.com/drive/folders/1v_DiT6XQtq0iJtn5JqxfZecAmxhCdd82?usp=sharing```


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>devops</category>
      <category>graphana</category>
      <category>sre</category>
      <category>prometheus</category>
    </item>
    <item>
      <title>Building SwiftDeploy: A Declarative Infrastructure CLI with Observability and Policy Enforcement</title>
      <dc:creator>Adeolu </dc:creator>
      <pubDate>Wed, 06 May 2026 20:54:03 +0000</pubDate>
      <link>https://dev.to/adeolu102/building-swiftdeploy-a-declarative-infrastructure-cli-with-observability-and-policy-enforcement-4g8o</link>
      <guid>https://dev.to/adeolu102/building-swiftdeploy-a-declarative-infrastructure-cli-with-observability-and-policy-enforcement-4g8o</guid>
      <description>&lt;p&gt;What Is This Project?&lt;/p&gt;

&lt;p&gt;SwiftDeploy is a command-line tool that automatically sets up and manages web application deployments. Instead of manually configuring Docker containers, Nginx, and monitoring, you write one file (&lt;code&gt;manifest.yaml&lt;/code&gt;) that describes what you want, and the tool builds everything for you.&lt;/p&gt;

&lt;p&gt;The project was built in two parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 4A: Basic infrastructure generation and container management&lt;/li&gt;
&lt;li&gt;Stage 4B: Monitoring, policy enforcement, and audit logging&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;The Core Idea: Declarative Configuration&lt;/p&gt;

&lt;p&gt;In traditional DevOps, you manually write configuration files for each service. With SwiftDeploy, you write a single manifest file, and the tool generates all the configuration files automatically.&lt;/p&gt;

&lt;p&gt;manifest.yaml (the only file you edit manually):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-keeds-api:v1.0.0&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-service&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stable&lt;/span&gt;

&lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:alpine&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;proxy_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;

&lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-net&lt;/span&gt;
  &lt;span class="na"&gt;driver_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From this one file, SwiftDeploy generates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;nginx.conf&lt;/code&gt; (web server configuration)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docker-compose.yml&lt;/code&gt; (container orchestration)&lt;/li&gt;
&lt;li&gt;All the settings for monitoring and policy checks&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;How the Tool Works&lt;/p&gt;

&lt;p&gt;The CLI tool (&lt;code&gt;swiftdeploy&lt;/code&gt;) has several commands:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;init&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reads manifest.yaml and generates nginx.conf + docker-compose.yml&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;validate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Checks if everything is ready for deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deploy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Starts all containers and waits for them to be healthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;promote canary/stable&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switches between stable and canary modes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shows a live dashboard with metrics and policy compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;audit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generates a report of all events and policy violations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;teardown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stops and removes all containers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;Stage 4A: The Foundation&lt;/p&gt;

&lt;p&gt;API Service&lt;/p&gt;

&lt;p&gt;The API service is a Python Flask application that serves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GET /&lt;/code&gt; — Welcome message with current mode and version&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /healthz&lt;/code&gt; — Health check endpoint&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /chaos&lt;/code&gt; — Simulates problems for testing (only in canary mode)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nginx Proxy&lt;/p&gt;

&lt;p&gt;Nginx acts as a reverse proxy, routing all traffic to the API service. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Listens on port 8080 (configurable)&lt;/li&gt;
&lt;li&gt;Forwards requests to the API service&lt;/li&gt;
&lt;li&gt;Returns JSON error responses for 502, 503, 504 errors&lt;/li&gt;
&lt;li&gt;Logs all requests in a specific format&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker Compose&lt;/p&gt;

&lt;p&gt;Docker Compose manages all containers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API service (your application)&lt;/li&gt;
&lt;li&gt;Nginx (web server/proxy)&lt;/li&gt;
&lt;li&gt;OPA (policy engine, added in Stage 4B)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Stage 4B: Observability and Policy Enforcement&lt;/p&gt;

&lt;p&gt;The /metrics Endpoint&lt;/p&gt;

&lt;p&gt;The API service now exposes a &lt;code&gt;/metrics&lt;/code&gt; endpoint that reports statistics in Prometheus format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/healthz"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="n"&gt;http_request_duration_seconds_bucket&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;le&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"0.1"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;
&lt;span class="n"&gt;app_uptime_seconds&lt;/span&gt; &lt;span class="mi"&gt;847&lt;/span&gt;
&lt;span class="n"&gt;app_mode&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;chaos_active&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These metrics tell you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many requests have been made&lt;/li&gt;
&lt;li&gt;How fast responses are&lt;/li&gt;
&lt;li&gt;How long the app has been running&lt;/li&gt;
&lt;li&gt;Whether you're in stable or canary mode&lt;/li&gt;
&lt;li&gt;Whether chaos testing is active&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OPA: The Policy Engine&lt;/p&gt;

&lt;p&gt;OPA (Open Policy Agent) is a separate container that acts like a security guard. Before you can deploy or promote, the CLI asks OPA: "Is it safe?"&lt;/p&gt;

&lt;p&gt;Why use OPA instead of checking directly in the CLI?**&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Policies are separate from code — easier to update&lt;/li&gt;
&lt;li&gt;If OPA crashes, the CLI still works (just warns you)&lt;/li&gt;
&lt;li&gt;OPA is not accessible from the internet (security)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Two Policies&lt;/p&gt;

&lt;p&gt;Infrastructure Policy** (checks before deploy):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is there enough disk space? (must be &amp;gt; 10GB)&lt;/li&gt;
&lt;li&gt;Is the CPU overloaded? (must be &amp;lt; 2.0)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Canary Safety Policy** (checks before promoting to canary):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the error rate too high? (must be &amp;lt; 1%)&lt;/li&gt;
&lt;li&gt;Is the response time too slow? (P99 must be &amp;lt; 500ms)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data-Driven Thresholds&lt;/p&gt;

&lt;p&gt;The actual numbers (10GB, 2.0, 1%, 500ms) are stored in a separate JSON file, not in the policy code. This means you can change the limits without modifying the policy logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;thresholds.json:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"infrastructure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"min_disk_gb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_cpu_load"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"canary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_error_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_p99_latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Status Dashboard&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;swiftdeploy status&lt;/code&gt; command shows a live dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╔═══════════════════════════════════════╗
║     SwiftDeploy Status Dashboard      ║
╠═══════════════════════════════════════╣
║ Mode: canary                         ║
║ Chaos: none                          ║
║ Req/s: 0.98                          ║
║ P99 Latency: 5ms                     ║
║ Error Rate: 0.00%                    ║
║ Uptime: 133s                         ║
╠═══════════════════════════════════════╣
║ Policy Compliance                    ║
║   Infrastructure: PASS               ║
║   Canary Safety:  PASS               ║
╚═══════════════════════════════════════╝
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every time the dashboard refreshes, it saves the data to &lt;code&gt;history.jsonl&lt;/code&gt; for the audit trail.&lt;/p&gt;

&lt;p&gt;The Audit Report&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;swiftdeploy audit&lt;/code&gt; command reads &lt;code&gt;history.jsonl&lt;/code&gt; and generates &lt;code&gt;audit_report.md&lt;/code&gt; with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A timeline of all events (mode changes, status updates)&lt;/li&gt;
&lt;li&gt;A list of policy violations (when checks failed)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Bugs We Fixed&lt;/p&gt;

&lt;p&gt;Bug 1: OPA Crashed on Startup&lt;/p&gt;

&lt;p&gt;Problem: OPA wouldn't start because of "conflicting rules" error.&lt;/p&gt;

&lt;p&gt;Cause: We wrote &lt;code&gt;default deny := []&lt;/code&gt; in the Rego file, which conflicted with &lt;code&gt;deny contains msg if { ... }&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Fix: Removed the &lt;code&gt;default deny := []&lt;/code&gt; line. The &lt;code&gt;contains&lt;/code&gt; keyword handles empty sets automatically.&lt;/p&gt;

&lt;p&gt;Bug 2: OPA Couldn't Find Threshold Values&lt;/p&gt;

&lt;p&gt;Problem: OPA loaded the policy files but couldn't find the threshold numbers.&lt;/p&gt;

&lt;p&gt;Cause: The JSON file was in the wrong directory. OPA loads files based on their path structure.&lt;/p&gt;

&lt;p&gt;Fix: Moved &lt;code&gt;thresholds.json&lt;/code&gt; into a &lt;code&gt;swiftdeploy/&lt;/code&gt; subdirectory so OPA could find it at the correct data path.&lt;/p&gt;

&lt;p&gt;Bug 3: Status Dashboard Showed "FAIL" Incorrectly&lt;/p&gt;

&lt;p&gt;Problem: The dashboard showed "Infrastructure: FAIL" and "Canary Safety: FAIL" even when everything was within limits.&lt;/p&gt;

&lt;p&gt;Cause: The CLI didn't send a &lt;code&gt;timestamp&lt;/code&gt; field to OPA. The policy rules need &lt;code&gt;input.timestamp&lt;/code&gt; to work. Without it, the rules failed, and the CLI defaulted to "FAIL".&lt;/p&gt;

&lt;p&gt;Fix: Added &lt;code&gt;timestamp&lt;/code&gt; to all OPA queries.&lt;/p&gt;

&lt;p&gt;Bug 4: Nginx Couldn't Find the API Service&lt;/p&gt;

&lt;p&gt;Problem: Nginx returned 502 errors saying it couldn't resolve the API service hostname.&lt;/p&gt;

&lt;p&gt;Cause: Nginx tried to find the API service at startup, but the container wasn't running yet.&lt;/p&gt;

&lt;p&gt;Fix: Added Docker's internal DNS resolver (&lt;code&gt;127.0.0.11&lt;/code&gt;) and used a variable for the proxy address. This tells Nginx to look up the hostname when a request comes in, not at startup.&lt;/p&gt;

&lt;p&gt;Bug 5: Container Didn't Update After Promoting&lt;/p&gt;

&lt;p&gt;Problem: After switching to canary mode, the container was still running in stable mode.&lt;/p&gt;

&lt;p&gt;Cause: Using &lt;code&gt;docker compose restart&lt;/code&gt; doesn't reload environment variables from the updated docker-compose.yml.&lt;/p&gt;

&lt;p&gt;Fix: Changed to &lt;code&gt;docker compose up -d --no-deps &amp;lt;service&amp;gt;&lt;/code&gt;, which recreates the container with new settings.&lt;/p&gt;

&lt;p&gt;Bug 6: Nginx Permission Denied&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Nginx failed to start with "Permission denied" errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause&lt;/strong&gt;: We set &lt;code&gt;user: nginx&lt;/code&gt; and removed all Linux capabilities, which prevented Nginx from creating necessary directories.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Removed the explicit user setting. The official Nginx image handles user switching internally.&lt;/p&gt;




&lt;p&gt;Key Design Decisions&lt;/p&gt;

&lt;p&gt;Why a Separate OPA Container?&lt;/p&gt;

&lt;p&gt;The task required: "The CLI must not make any allow/deny decision itself."&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The CLI asks OPA for permission before every deploy/promote&lt;/li&gt;
&lt;li&gt;OPA returns "allowed" or "denied" with a reason&lt;/li&gt;
&lt;li&gt;The CLI never makes the decision itself&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Policies can be updated without changing the CLI&lt;/li&gt;
&lt;li&gt;If OPA is down, the CLI warns but continues&lt;/li&gt;
&lt;li&gt;All decisions are logged with reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why Data-Driven Thresholds?&lt;/p&gt;

&lt;p&gt;The task required: "Threshold values must not be hardcoded inside Rego files."&lt;/p&gt;

&lt;p&gt;This means the numbers (10GB, 2.0, 1%, 500ms) are in a separate JSON file, not in the policy code. This makes it easy to change limits without touching the policy logic.&lt;/p&gt;

&lt;p&gt;Why Separate Policy Files?&lt;/p&gt;

&lt;p&gt;The task required: "Organise policies by domain. Each domain owns exactly one question."&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;infrastructure.rego&lt;/code&gt; only checks disk and CPU&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;canary.rego&lt;/code&gt; only checks error rate and latency&lt;/li&gt;
&lt;li&gt;Changing one policy never requires changing another&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;How the Pieces Fit Together&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User runs: swiftdeploy deploy
    │
    ▼
CLI gets host stats (disk, CPU)
    │
    ▼
CLI asks OPA: "Is it safe to deploy?"
    │
    ▼
OPA checks infrastructure policy
    │
    ├── If safe → Start containers
    │
    └── If not safe → Block with reason
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User runs: swiftdeploy promote canary
    │
    ▼
CLI scrapes /metrics endpoint
    │
    ▼
CLI calculates error rate and P99 latency
    │
    ▼
CLI asks OPA: "Is it safe to promote?"
    │
    ▼
OPA checks canary safety policy
    │
    ├── If safe → Switch to canary mode
    │
    └── If not safe → Block with reason
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;Lessons Learned&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;One file can drive everything**: A single manifest file can generate all the configuration files you need.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Policies should be separate from code**: Using OPA makes policies easier to update and test.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Always handle failures gracefully**: If OPA is down, the CLI warns but continues working.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simple templates work fine**: You don't need complex template engines for configuration files.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Container recreation vs restart**: Restarting a container doesn't reload environment variables. You need to recreate it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Docker DNS is important**: Nginx needs to know how to find containers by name, which requires Docker's internal DNS resolver.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Summary&lt;/p&gt;

&lt;p&gt;SwiftDeploy is a tool that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Takes a single manifest file as input&lt;/li&gt;
&lt;li&gt;Generates all configuration files automatically&lt;/li&gt;
&lt;li&gt;Manages container lifecycle (deploy, promote, teardown)&lt;/li&gt;
&lt;li&gt;Enforces safety policies via OPA before deploy/promote&lt;/li&gt;
&lt;li&gt;Provides monitoring via /metrics endpoint&lt;/li&gt;
&lt;li&gt;Tracks all events in an audit log&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key innovation is that everything is driven by one file, and all safety checks happen automatically before any deployment action.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article documents the development of SwiftDeploy as part of the HNG Internship DevOps Track, Stage 4.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cybersecurity</category>
      <category>beginners</category>
      <category>python</category>
    </item>
  </channel>
</rss>
