<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ramasankar Molleti</title>
    <description>The latest articles on DEV Community by Ramasankar Molleti (@ramasankar_molleti_f7f80d).</description>
    <link>https://dev.to/ramasankar_molleti_f7f80d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2661743%2F15059d3b-47e1-4d5f-93ee-2006386880a7.jpg</url>
      <title>DEV Community: Ramasankar Molleti</title>
      <link>https://dev.to/ramasankar_molleti_f7f80d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ramasankar_molleti_f7f80d"/>
    <language>en</language>
    <item>
      <title>Kronveil v0.3: Multi-Cluster Federation, Custom Collector SDK, and Automated Runbooks</title>
      <dc:creator>Ramasankar Molleti</dc:creator>
      <pubDate>Mon, 06 Apr 2026 04:56:17 +0000</pubDate>
      <link>https://dev.to/ramasankar_molleti_f7f80d/kronveil-v03-multi-cluster-federation-custom-collector-sdk-and-automated-runbooks-38oc</link>
      <guid>https://dev.to/ramasankar_molleti_f7f80d/kronveil-v03-multi-cluster-federation-custom-collector-sdk-and-automated-runbooks-38oc</guid>
      <description>&lt;h2&gt;
  
  
  From Single Cluster to Multi-Cluster Production
&lt;/h2&gt;

&lt;p&gt;A couple of weeks ago, I shipped &lt;a href="https://dev.to/ramasankar_molleti_f7f80d/kronveil-v02-dashboard-grpc-secret-management-and-local-deployment-heres-what-changed-25gb"&gt;Kronveil v0.2&lt;/a&gt; — a fully running AI infrastructure agent with a dashboard, gRPC transport, secret management, and local Docker deployment. If you missed the original launch post, &lt;a href="https://dev.to/ramasankar_molleti_f7f80d/i-built-an-ai-powered-infrastructure-observability-agent-from-scratch-4j68"&gt;here's where it all started&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;v0.2 worked well for a single cluster. But production environments don't run on one cluster. Teams have &lt;code&gt;us-east-prod&lt;/code&gt;, &lt;code&gt;eu-west-prod&lt;/code&gt;, maybe a staging cluster in &lt;code&gt;ap-south&lt;/code&gt;. They have GitHub Actions pipelines they need visibility into. They have Azure VMs and GCP instances alongside Kubernetes workloads.&lt;/p&gt;

&lt;p&gt;v0.3 addresses all of that. Here's what changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's New in v0.3
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Multi-Cluster Federation&lt;/td&gt;
&lt;td&gt;Aggregate telemetry from multiple Kubernetes clusters into one view&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Custom Collector SDK&lt;/td&gt;
&lt;td&gt;Build your own collectors in ~50 lines of Go&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Automated Runbook Engine&lt;/td&gt;
&lt;td&gt;Execute incident response playbooks automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Real Azure SDK Integration&lt;/td&gt;
&lt;td&gt;Azure Monitor metrics + Resource Manager listing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Real GCP SDK Integration&lt;/td&gt;
&lt;td&gt;Cloud Monitoring + Asset Inventory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;GitHub Actions CI/CD Collector&lt;/td&gt;
&lt;td&gt;Poll workflow runs, track status changes, map to severity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Kafka Throughput Monitoring&lt;/td&gt;
&lt;td&gt;Real offset tracking and messages/sec computation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;WebSocket Real-Time Streaming&lt;/td&gt;
&lt;td&gt;Live event feed to the dashboard, no more polling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Runbooks Dashboard Page&lt;/td&gt;
&lt;td&gt;New UI page for runbook management and execution history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Vault Background Sync&lt;/td&gt;
&lt;td&gt;Periodic secret rotation monitoring with KV v2 metadata API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  1. Multi-Cluster Federation
&lt;/h2&gt;

&lt;p&gt;This is the biggest feature in v0.3. The federation manager sits on top of multiple Kubernetes collectors and aggregates their telemetry into a single event stream.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│              Federation Manager                  │
│         implements engine.Collector              │
├─────────────────────────────────────────────────┤
│                                                  │
│  ┌──────────────┐  ┌──────────────┐             │
│  │ us-east-prod │  │ eu-west-prod │  ...        │
│  │  K8s Collector│  │  K8s Collector│            │
│  └──────┬───────┘  └──────┬───────┘             │
│         │                  │                     │
│         └──────┬───────────┘                     │
│                ▼                                  │
│  ┌──────────────────────────┐                    │
│  │       Aggregator         │                    │
│  │  SHA256 dedup (30s window)│                   │
│  │  Cross-cluster metrics    │                   │
│  └──────────────────────────┘                    │
│                                                  │
└─────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each cluster's events are tagged with &lt;code&gt;cluster_name&lt;/code&gt; and &lt;code&gt;cluster_region&lt;/code&gt; metadata before being forwarded. The aggregator deduplicates events using SHA256 fingerprinting — if overlapping collectors emit the same event within a 30-second window, it's counted once.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;collectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kubernetes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;clusters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-prod&lt;/span&gt;
        &lt;span class="na"&gt;kubeconfig_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;~/.kube/us-east&lt;/span&gt;
        &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-context&lt;/span&gt;
        &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;poll_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15s&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eu-west-prod&lt;/span&gt;
        &lt;span class="na"&gt;kubeconfig_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;~/.kube/eu-west&lt;/span&gt;
        &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-context&lt;/span&gt;
        &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;poll_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The federation manager implements &lt;code&gt;engine.Collector&lt;/code&gt;, so the rest of Kronveil — the intelligence pipeline, the API, the dashboard — doesn't need to know whether it's watching 1 cluster or 20.&lt;/p&gt;

&lt;p&gt;Aggregate metrics are computed automatically: total pods, total nodes, total events across all clusters.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Custom Collector SDK
&lt;/h2&gt;

&lt;p&gt;Writing a Kronveil collector used to mean implementing the full &lt;code&gt;engine.Collector&lt;/code&gt; interface — managing goroutines, channels, health reporting, and lifecycle. Now you implement three methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Plugin&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;Healthcheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SDK's &lt;code&gt;Builder&lt;/code&gt; handles the rest — polling loop, buffered event channel with backpressure, health reporting, and clean shutdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewBuilder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;myPlugin&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;WithPollInterval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;WithBufferSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;WithLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;Build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c"&gt;// col implements engine.Collector — register it like any built-in collector&lt;/span&gt;
&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RegisterCollector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Full Example: HTTP Health Checker
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;HTTPChecker&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;HTTPChecker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;"http-checker"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;HTTPChecker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;{{&lt;/span&gt;
            &lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="s"&gt;"http_check_failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Payload&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{}{&lt;/span&gt;&lt;span class="s"&gt;"url"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;()},&lt;/span&gt;
        &lt;span class="p"&gt;}},&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;{{&lt;/span&gt;
        &lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"http_check"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Payload&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{}{&lt;/span&gt;
            &lt;span class="s"&gt;"url"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"latency_ms"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Milliseconds&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}},&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;HTTPChecker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Healthcheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What the Adapter Handles For You
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Polling loop at your configured interval&lt;/li&gt;
&lt;li&gt;Immediate first collect (no waiting for the first tick)&lt;/li&gt;
&lt;li&gt;Buffered channel with drop + warn when full&lt;/li&gt;
&lt;li&gt;Health status combining &lt;code&gt;Healthcheck()&lt;/code&gt; result with recent collect errors&lt;/li&gt;
&lt;li&gt;Thread-safe start/stop lifecycle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One interesting bug I caught during CI: the original &lt;code&gt;Stop()&lt;/code&gt; held a mutex while waiting on a &lt;code&gt;WaitGroup&lt;/code&gt;. The polling goroutine needed the same mutex to record errors. Classic deadlock — the goroutine couldn't finish because it couldn't acquire the lock, and &lt;code&gt;Stop()&lt;/code&gt; couldn't return because it was waiting for the goroutine. Fixed by releasing the lock before &lt;code&gt;wg.Wait()&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Automated Runbook Engine
&lt;/h2&gt;

&lt;p&gt;When Kronveil detects an incident, it can now execute a predefined playbook instead of just alerting.&lt;/p&gt;

&lt;p&gt;The runbook engine ships with 4 default runbooks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Runbook&lt;/th&gt;
&lt;th&gt;Triggers&lt;/th&gt;
&lt;th&gt;Steps&lt;/th&gt;
&lt;th&gt;Auto-Execute&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pod OOM&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;OOMKilled&lt;/code&gt;, &lt;code&gt;MemoryPressure&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Diagnose &amp;gt; Scale &amp;gt; Notify&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High Latency&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;HighLatency&lt;/code&gt;, &lt;code&gt;SLOBreach&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Diagnose &amp;gt; Restart &amp;gt; Notify&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk Pressure&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;DiskPressure&lt;/code&gt;, &lt;code&gt;LogVolumeHigh&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Cleanup &amp;gt; Notify&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Certificate Expiry&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CertExpiry&lt;/code&gt;, &lt;code&gt;TLSError&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renew &amp;gt; Notify&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  How Execution Works
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Incident Detected
       │
       ▼
  FindRunbooks(incidentType)
       │
       ▼
  For each matching runbook:
    ├── autoExecute=true  → Execute immediately
    └── autoExecute=false → Queue for approval
       │
       ▼
  Execute each Step sequentially:
    ├── kubectl_scale   → Scale deployment replicas
    ├── restart_pod     → Delete pod for controller restart
    ├── notify_oncall   → Slack/PagerDuty notification
    ├── run_diagnostic  → Execute diagnostic command
    └── custom_script   → Run remediation script
       │
       ▼
  Record ExecutionResult (timing, step results, success/failure)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In v0.3, all action handlers run in &lt;strong&gt;dry-run mode&lt;/strong&gt; — they log what they would do without executing. This lets you validate runbook logic before enabling live remediation. Live execution is on the v0.4 roadmap.&lt;/p&gt;

&lt;p&gt;You can register custom runbooks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RegisterRunbook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runbook&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Runbook&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;            &lt;span class="s"&gt;"custom-db-failover"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;          &lt;span class="s"&gt;"Database Failover"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;IncidentTypes&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"DatabaseDown"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"ReplicationLag"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;AutoExecute&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Steps&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;runbook&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Step&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Check replication"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Action&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"run_diagnostic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"command"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"pg_stat_replication"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Promote standby"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Action&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"custom_script"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"script"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"/opt/scripts/promote-standby.sh"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Notify DBA team"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Action&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"notify_oncall"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"channel"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"#dba-oncall"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Real Cloud Provider Integrations
&lt;/h2&gt;

&lt;p&gt;v0.2 had stub implementations for cloud providers. v0.3 wires up real SDKs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Azure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auth&lt;/strong&gt;: &lt;code&gt;azidentity.DefaultAzureCredential&lt;/code&gt; — supports managed identity, CLI, environment variables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt;: Azure Monitor &lt;code&gt;azquery.MetricsClient&lt;/code&gt; queries CPU, memory, disk, and network&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources&lt;/strong&gt;: ARM &lt;code&gt;armresources.Client&lt;/code&gt; with full pagination support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config&lt;/strong&gt;: Set &lt;code&gt;AZURE_SUBSCRIPTION_ID&lt;/code&gt; and standard Azure credentials&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GCP
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auth&lt;/strong&gt;: Application Default Credentials (ADC)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt;: Cloud Monitoring &lt;code&gt;ListTimeSeries&lt;/code&gt; with 5-minute lookback window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources&lt;/strong&gt;: Cloud Asset &lt;code&gt;SearchAllResources&lt;/code&gt; for inventory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config&lt;/strong&gt;: Set &lt;code&gt;GCP_PROJECT_ID&lt;/code&gt; or &lt;code&gt;GOOGLE_CLOUD_PROJECT&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GitHub Actions (CI/CD Collector)
&lt;/h3&gt;

&lt;p&gt;The CI/CD collector now polls the GitHub REST API for workflow runs across configured repositories:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;collectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;cicd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;github_token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ghp_..."&lt;/span&gt;
    &lt;span class="na"&gt;repo_filters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-org/your-repo"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-org/another-repo"&lt;/span&gt;
    &lt;span class="na"&gt;poll_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It tracks run status changes, emits events for new runs and state transitions, and maps GitHub conclusions to severity levels:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Conclusion&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;failure&lt;/code&gt;, &lt;code&gt;timed_out&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;cancelled&lt;/code&gt;, &lt;code&gt;action_required&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;success&lt;/code&gt;, other&lt;/td&gt;
&lt;td&gt;Info&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Kafka Throughput
&lt;/h3&gt;

&lt;p&gt;The Kafka collector now dials brokers directly, reads partition offsets, and computes real messages/second throughput per topic. No more mock data.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. WebSocket Real-Time Streaming
&lt;/h2&gt;

&lt;p&gt;The dashboard no longer polls the REST API for updates. Events flow over WebSocket.&lt;/p&gt;

&lt;h3&gt;
  
  
  Backend
&lt;/h3&gt;

&lt;p&gt;The Go server manages a WebSocket hub with a broadcaster that pushes engine status to all connected clients every 2 seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client connects → wsHub.add(conn)
                     │
Broadcaster (2s)  ───┤──→ JSON to all clients
                     │
Client disconnects → wsHub.remove(conn)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Frontend
&lt;/h3&gt;

&lt;p&gt;Two new React hooks power the live experience:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;useWebSocket&lt;/code&gt;&lt;/strong&gt; — generic hook with auto-reconnect and exponential backoff (1s to 30s)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;useEventStream&lt;/code&gt;&lt;/strong&gt; — wraps WebSocket for the events endpoint, maintains a 100-event rolling buffer, provides memoized filtered views for incidents, anomalies, and all events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Overview page shows a green pulsing &lt;strong&gt;Live&lt;/strong&gt; indicator when WebSocket is connected. When disconnected, it falls back to mock data gracefully.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Runbooks Dashboard Page
&lt;/h2&gt;

&lt;p&gt;New &lt;code&gt;/runbooks&lt;/code&gt; route in the dashboard:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary cards at the top:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total runbooks&lt;/li&gt;
&lt;li&gt;Auto-execute count&lt;/li&gt;
&lt;li&gt;Executions in last 24 hours&lt;/li&gt;
&lt;li&gt;Average success rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Each runbook card shows:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Name and description&lt;/li&gt;
&lt;li&gt;Auto/manual execution badge (green dot for auto, gray for manual)&lt;/li&gt;
&lt;li&gt;Incident type tags&lt;/li&gt;
&lt;li&gt;Step count, last run time, success rate&lt;/li&gt;
&lt;li&gt;Recent run indicators — green and red dots for the last 3 executions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same dark theme as the rest of the dashboard. Built with the same Tailwind patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. OpenTelemetry &amp;amp; Observability
&lt;/h2&gt;

&lt;p&gt;Kronveil exports traces and metrics via OpenTelemetry, fitting into your existing observability stack:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Protocol&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;:4317&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;gRPC&lt;/td&gt;
&lt;td&gt;OTLP traces and metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;:4318&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HTTP&lt;/td&gt;
&lt;td&gt;OTLP traces and metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;:8889&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HTTP&lt;/td&gt;
&lt;td&gt;Prometheus metrics export&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;:13133&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HTTP&lt;/td&gt;
&lt;td&gt;OTel collector health check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;:55679&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HTTP&lt;/td&gt;
&lt;td&gt;zPages debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Point your Jaeger, Tempo, or Datadog backend at these endpoints, or configure Prometheus to scrape Kronveil's metrics.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Architecture (v0.3)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌───────────────────────────────────────────────────────┐
│                  Dashboard (React)                     │
│          WebSocket &amp;lt;── REST API (Go) ──&amp;gt;               │
├───────────────────────────────────────────────────────┤
│                     Engine Core                        │
│  ┌────────────┐  ┌──────────┐  ┌────────────────┐    │
│  │ Federation  │  │ Runbook  │  │ AI Intelligence│    │
│  │  Manager    │  │ Executor │  │ (AWS Bedrock)  │    │
│  └────────────┘  └──────────┘  └────────────────┘    │
├───────────────────────────────────────────────────────┤
│                     Collectors                         │
│  ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ ┌───┐ ┌────────┐ │
│  │ K8s │ │Kafka│ │CI/CD │ │Azure│ │GCP│ │Custom  │ │
│  │     │ │     │ │GitHub│ │     │ │   │ │ (SDK)  │ │
│  └─────┘ └─────┘ └──────┘ └─────┘ └───┘ └────────┘ │
├───────────────────────────────────────────────────────┤
│               Integrations &amp;amp; Export                    │
│  Slack · PagerDuty · Vault · Prometheus · OTel        │
└───────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Local Deployment
&lt;/h2&gt;

&lt;p&gt;Everything runs with Docker Compose — same as v0.2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kronveil/kronveil.git
&lt;span class="nb"&gt;cd &lt;/span&gt;kronveil
docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;URL&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;&lt;code&gt;http://localhost:3000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;React UI with live WebSocket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;&lt;code&gt;http://localhost:8080/api/v1/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;REST endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health&lt;/td&gt;
&lt;td&gt;&lt;code&gt;http://localhost:8080/api/v1/health&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Agent health check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WebSocket&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ws://localhost:8080/api/v1/ws/events&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Real-time event stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus&lt;/td&gt;
&lt;td&gt;&lt;code&gt;http://localhost:8889/metrics&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Metrics export&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OTel gRPC&lt;/td&gt;
&lt;td&gt;&lt;code&gt;localhost:4317&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OTLP ingest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Verify it's running
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/api/v1/health | jq &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"components"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kubernetes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kafka"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cicd-collector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cloud-aws"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"uptime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2m30s"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Production Deployment
&lt;/h2&gt;

&lt;p&gt;Kronveil ships with a Helm chart for Kubernetes deployment. For AWS EKS with Rancher:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build and push to ECR&lt;/span&gt;
docker build &lt;span class="nt"&gt;-f&lt;/span&gt; deploy/Dockerfile.agent &lt;span class="nt"&gt;-t&lt;/span&gt; &amp;lt;account&amp;gt;.dkr.ecr.&amp;lt;region&amp;gt;.amazonaws.com/kronveil/agent:v0.3 &lt;span class="nb"&gt;.&lt;/span&gt;
docker push &amp;lt;account&amp;gt;.dkr.ecr.&amp;lt;region&amp;gt;.amazonaws.com/kronveil/agent:v0.3

&lt;span class="c"&gt;# Deploy with Helm&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;kronveil helm/kronveil/ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; kronveil &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; values-prod.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Helm chart includes deployment, RBAC (ClusterRole + Role), ServiceAccount with IRSA annotation, NetworkPolicy, and Prometheus scrape annotations. A full production deployment guide covering ECR, MSK, IRSA, ALB Ingress, and TLS is available in the repository.&lt;/p&gt;




&lt;h2&gt;
  
  
  CI Pipeline
&lt;/h2&gt;

&lt;p&gt;All 7 CI jobs passing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;th&gt;What It Checks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lint&lt;/td&gt;
&lt;td&gt;golangci-lint v2 (errcheck, staticcheck, govet)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;govulncheck for known vulnerabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;go test ./... -race&lt;/code&gt; with coverage threshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build&lt;/td&gt;
&lt;td&gt;CGO_ENABLED=0 static binary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker Build &amp;amp; Scan&lt;/td&gt;
&lt;td&gt;Trivy scan for CRITICAL/HIGH CVEs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;npm lint + build&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helm Lint&lt;/td&gt;
&lt;td&gt;Chart validation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What's Next (v0.4)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live runbook execution&lt;/strong&gt; — move from dry-run to real &lt;code&gt;kubectl&lt;/code&gt; and script execution with approval gates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collector marketplace&lt;/strong&gt; — share and install community-built collectors via the SDK&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-cluster incident correlation&lt;/strong&gt; — AI-powered correlation across federated clusters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard runbook triggers&lt;/strong&gt; — execute runbooks directly from the UI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana plugin&lt;/strong&gt; — embed Kronveil panels in existing Grafana dashboards&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/kronveil/kronveil" rel="noopener noreferrer"&gt;github.com/kronveil/kronveil&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;v0.1 post: &lt;a href="https://dev.to/ramasankar_molleti_f7f80d/i-built-an-ai-powered-infrastructure-observability-agent-from-scratch-4j68"&gt;I Built an AI-Powered Infrastructure Observability Agent from Scratch&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;v0.2 post: &lt;a href="https://dev.to/ramasankar_molleti_f7f80d/kronveil-v02-dashboard-grpc-secret-management-and-local-deployment-heres-what-changed-25gb"&gt;Kronveil v0.2: Dashboard, gRPC, Secret Management, and Local Deployment&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've been following along, star the repo and try it out. PRs, issues, and feedback are always welcome.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>observability</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Kronveil v0.2: From Stubs to a Fully Running AI Infrastructure Agent with Dashboard, OTel, and Auto-Remediation</title>
      <dc:creator>Ramasankar Molleti</dc:creator>
      <pubDate>Mon, 16 Mar 2026 05:37:07 +0000</pubDate>
      <link>https://dev.to/ramasankar_molleti_f7f80d/kronveil-v02-dashboard-grpc-secret-management-and-local-deployment-heres-what-changed-25gb</link>
      <guid>https://dev.to/ramasankar_molleti_f7f80d/kronveil-v02-dashboard-grpc-secret-management-and-local-deployment-heres-what-changed-25gb</guid>
      <description>&lt;h2&gt;
  
  
  Quick Recap
&lt;/h2&gt;

&lt;p&gt;A week ago, I &lt;a href="https://dev.to/ramasankar_molleti_f7f80d/i-built-an-ai-powered-infrastructure-observability-agent-from-scratch-4j68"&gt;launched Kronveil&lt;/a&gt; - an AI-powered infrastructure observability agent that detects anomalies, performs root cause analysis, and auto-remediates incidents in milliseconds. The response was incredible.&lt;/p&gt;

&lt;p&gt;But that first version had a lot of stubs. The roadmap listed features like "Dashboard UI", "Prometheus metrics", and "multi-cloud secret management" as coming soon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;They're here now.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This post covers every new feature shipped in v0.2, a step-by-step guide to run Kronveil locally with Docker Compose, and live screenshots from the running dashboard.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's New in v0.2
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Full Dashboard UI (React + TypeScript)
&lt;/h3&gt;

&lt;p&gt;The biggest visible change. Kronveil now ships with a production-ready dashboard built with React 18, TypeScript, Tailwind CSS, and Recharts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Six pages, zero fluff:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Page&lt;/th&gt;
&lt;th&gt;What It Shows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overview&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time event throughput, active incidents, MTTR, anomaly count (24h), cluster health matrix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Incidents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filterable list (active/acknowledged/resolved), timeline view, root cause display, one-click acknowledge/resolve&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anomalies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Detected anomalies with scores, signal source, severity, historical comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Collectors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Health status per collector, event emission rates, degradation indicators&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Policies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OPA policy listing, enable/disable toggles, violation history, Rego rule display&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Settings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Collector config, integration credentials, anomaly sensitivity, remediation toggles&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dashboard runs as a separate container behind nginx, which reverse-proxies &lt;code&gt;/api/&lt;/code&gt; requests to the agent. No CORS headaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. gRPC API with TLS/mTLS
&lt;/h3&gt;

&lt;p&gt;The REST API was always there. Now there's a full gRPC API on port 9091 with four services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;StreamEvents&lt;/code&gt;&lt;/strong&gt; - Server-side streaming of real-time telemetry events with source and severity filtering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GetIncident&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;ListIncidents&lt;/code&gt;&lt;/strong&gt; - Incident queries with status filtering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GetHealth&lt;/code&gt;&lt;/strong&gt; - Component-level health reporting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Built with reflection support, so you can debug with &lt;code&gt;grpcurl&lt;/code&gt; out of the box. TLS and mutual TLS are configurable - just point it at your cert/key files.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Secret Management: Vault + AWS Secrets Manager
&lt;/h3&gt;

&lt;p&gt;Two new integrations for secret lifecycle management:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HashiCorp Vault:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes auth method&lt;/li&gt;
&lt;li&gt;TLS certificate lifecycle tracking&lt;/li&gt;
&lt;li&gt;Secret caching for performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;AWS Secrets Manager:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefix-based secret organization (&lt;code&gt;kronveil/&lt;/code&gt; default)&lt;/li&gt;
&lt;li&gt;Rotation monitoring with configurable windows (default 30 days)&lt;/li&gt;
&lt;li&gt;Secret expiration tracking&lt;/li&gt;
&lt;li&gt;Built-in caching layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both use the graceful degradation pattern - if credentials aren't configured, the agent logs a warning and continues running without them.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Three New Collectors
&lt;/h3&gt;

&lt;p&gt;The original had Kubernetes and Kafka. Now there are five:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud Collector (AWS/Azure/GCP):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CloudWatch metrics for EC2, RDS, ELB, Lambda, S3&lt;/li&gt;
&lt;li&gt;Multi-region support with resource enumeration&lt;/li&gt;
&lt;li&gt;Cost tracking per resource&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CI/CD Collector (GitHub Actions):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Webhook-based pipeline monitoring&lt;/li&gt;
&lt;li&gt;Job and step-level tracking with duration metrics&lt;/li&gt;
&lt;li&gt;Repository filtering with webhook secret validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Logs Collector:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File tailing with structured log parsing&lt;/li&gt;
&lt;li&gt;JSON, logfmt, and raw text format support&lt;/li&gt;
&lt;li&gt;Configurable error pattern matching (error, fatal, panic, OOM, killed)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Capacity Planner
&lt;/h3&gt;

&lt;p&gt;New intelligence module that goes beyond anomaly detection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linear regression-based forecasting (default 30-day horizon)&lt;/li&gt;
&lt;li&gt;Right-sizing recommendations: scale_up, scale_down, right_size, optimize&lt;/li&gt;
&lt;li&gt;Days-to-capacity projection&lt;/li&gt;
&lt;li&gt;Cost savings calculations with confidence intervals&lt;/li&gt;
&lt;li&gt;Historical data retention (90 days default)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Policy Engine (OPA/Rego)
&lt;/h3&gt;

&lt;p&gt;Compliance and governance built into the agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open Policy Agent integration with Rego language&lt;/li&gt;
&lt;li&gt;Default policies pre-loaded (compliance, security)&lt;/li&gt;
&lt;li&gt;Resource evaluation against all enabled policies&lt;/li&gt;
&lt;li&gt;Policy violation tracking with evaluation metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7. Prometheus Metrics Export
&lt;/h3&gt;

&lt;p&gt;Kronveil now exposes a full Prometheus scrape endpoint on port 9090:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard Go runtime metrics (goroutines, memory, GC)&lt;/li&gt;
&lt;li&gt;Custom Kronveil metrics: event counts per source, collector errors, policy evaluations, processing latency&lt;/li&gt;
&lt;li&gt;Ready-to-use with Grafana dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  8. OpenTelemetry (OTel) Integration
&lt;/h3&gt;

&lt;p&gt;Full OpenTelemetry support for distributed tracing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gRPC exporter&lt;/strong&gt; to any OTLP-compatible endpoint (Jaeger, Tempo, Datadog, etc.)&lt;/li&gt;
&lt;li&gt;Configurable export intervals (default 30s)&lt;/li&gt;
&lt;li&gt;Span and trace propagation across the agent pipeline&lt;/li&gt;
&lt;li&gt;Insecure mode for local development, TLS for production&lt;/li&gt;
&lt;li&gt;Default endpoint: &lt;code&gt;localhost:4317&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means you can plug Kronveil into your existing OTel collector pipeline and see traces from anomaly detection through incident creation to remediation execution - all in one trace.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. PagerDuty Integration
&lt;/h3&gt;

&lt;p&gt;Full Events API v2 support:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incident triggering, acknowledgment, resolution&lt;/li&gt;
&lt;li&gt;Deduplication keys for idempotent alerts&lt;/li&gt;
&lt;li&gt;Severity mapping (critical, high, warning, info)&lt;/li&gt;
&lt;li&gt;Links back to Kronveil dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  10. Audit Logging
&lt;/h3&gt;

&lt;p&gt;Security-grade audit trail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event types: auth, incident, remediation, policy_change, config_change, secret_access, api_call&lt;/li&gt;
&lt;li&gt;In-memory buffer with file sink&lt;/li&gt;
&lt;li&gt;Structured JSON output via slog&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  11. Helm Chart for Kubernetes
&lt;/h3&gt;

&lt;p&gt;Production-ready Helm chart with security hardened defaults:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-root containers (UID 1000)&lt;/li&gt;
&lt;li&gt;Read-only root filesystem&lt;/li&gt;
&lt;li&gt;Seccomp: RuntimeDefault&lt;/li&gt;
&lt;li&gt;NetworkPolicy for ingress/egress&lt;/li&gt;
&lt;li&gt;RBAC: ClusterRole with minimal permissions (pods, nodes, events, deployments)&lt;/li&gt;
&lt;li&gt;Prometheus scrape annotations built-in&lt;/li&gt;
&lt;li&gt;Liveness and readiness probes
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;kronveil helm/kronveil/ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kronveil &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; agent.bedrock.region&lt;span class="o"&gt;=&lt;/span&gt;us-east-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Upgraded Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;v0.1&lt;/th&gt;
&lt;th&gt;v0.2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;1.21&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;golangci-lint&lt;/td&gt;
&lt;td&gt;v1&lt;/td&gt;
&lt;td&gt;v2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alpine&lt;/td&gt;
&lt;td&gt;3.21&lt;/td&gt;
&lt;td&gt;3.23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;Planned&lt;/td&gt;
&lt;td&gt;React 18 + Tailwind&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;REST only&lt;/td&gt;
&lt;td&gt;REST + gRPC + mTLS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Vault + AWS SM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metrics Export&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Prometheus + OTel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tracing&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;OpenTelemetry (OTLP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alerting&lt;/td&gt;
&lt;td&gt;Slack&lt;/td&gt;
&lt;td&gt;Slack + PagerDuty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Docker Compose + Helm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Full pipeline (lint, test, security, build, Docker scan)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Run Kronveil Locally (5 Minutes)
&lt;/h2&gt;

&lt;p&gt;Here's the full local deployment walkthrough with live screenshots.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.docker.com/products/docker-desktop/" rel="noopener noreferrer"&gt;Docker Desktop&lt;/a&gt; installed and running&lt;/li&gt;
&lt;li&gt;&lt;a href="https://git-scm.com/" rel="noopener noreferrer"&gt;Git&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;~2GB free RAM (Kafka needs memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Clone and Build
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kronveil/kronveil.git
&lt;span class="nb"&gt;cd &lt;/span&gt;kronveil
docker-compose &lt;span class="nt"&gt;-f&lt;/span&gt; deploy/docker-compose.yaml up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This builds two images and starts four containers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Container&lt;/th&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;agent&lt;/td&gt;
&lt;td&gt;8080&lt;/td&gt;
&lt;td&gt;Kronveil REST API + gRPC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dashboard&lt;/td&gt;
&lt;td&gt;3000&lt;/td&gt;
&lt;td&gt;Web UI (nginx + React SPA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kafka&lt;/td&gt;
&lt;td&gt;9092&lt;/td&gt;
&lt;td&gt;Event bus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;zookeeper&lt;/td&gt;
&lt;td&gt;2181&lt;/td&gt;
&lt;td&gt;Kafka coordinator&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 2: Verify Everything Is Running
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose &lt;span class="nt"&gt;-f&lt;/span&gt; deploy/docker-compose.yaml ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All four containers should show &lt;code&gt;Up (healthy)&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;NAME                 STATUS                         PORTS
&lt;/span&gt;&lt;span class="gp"&gt;deploy-agent-1       Up About a minute (healthy)    127.0.0.1:8080-&amp;gt;&lt;/span&gt;8080/tcp
&lt;span class="gp"&gt;deploy-dashboard-1   Up About a minute (healthy)    127.0.0.1:3000-&amp;gt;&lt;/span&gt;8080/tcp
&lt;span class="gp"&gt;deploy-kafka-1       Up About a minute (healthy)    127.0.0.1:9092-&amp;gt;&lt;/span&gt;9092/tcp
&lt;span class="go"&gt;deploy-zookeeper-1   Up About a minute (healthy)    2181/tcp
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Access the Endpoints
&lt;/h3&gt;

&lt;p&gt;Once deployed, you have three endpoints available:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;URL&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dashboard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="http://localhost:3000" rel="noopener noreferrer"&gt;http://localhost:3000&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Full web UI with all 6 pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="http://localhost:8080/api/v1/health" rel="noopener noreferrer"&gt;http://localhost:8080/api/v1/health&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;REST API (health, incidents, anomalies)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metrics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="http://localhost:9090/metrics" rel="noopener noreferrer"&gt;http://localhost:9090/metrics&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Prometheus scrape endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 4: Check Agent Health
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/api/v1/health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Open the Dashboard
&lt;/h3&gt;

&lt;p&gt;Open &lt;a href="http://localhost:3000" rel="noopener noreferrer"&gt;http://localhost:3000&lt;/a&gt; in your browser.&lt;/p&gt;

&lt;h4&gt;
  
  
  Overview Page
&lt;/h4&gt;

&lt;p&gt;The Overview page shows real-time infrastructure intelligence at a glance - 10.2M events/sec throughput, 2 active incidents, 23-second average MTTR, and 47 anomalies detected in the last 24 hours. The cluster health matrix shows three clusters across US, EU, and AP regions with live node and pod counts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwmu9cmeoni1fpzcmqhd2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwmu9cmeoni1fpzcmqhd2.png" alt="Kronveil Overview Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Incidents Page
&lt;/h4&gt;

&lt;p&gt;AI-detected and auto-remediated incidents with filtering by status (all, active, acknowledged, resolved). Each incident shows the title, description, MTTR, and number of affected resources. Notice the resolved OOM incident with 23s MTTR - that's the auto-remediation in action.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvhonq1tvoxisz9v6pci8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvhonq1tvoxisz9v6pci8.png" alt="Kronveil Incidents"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Anomalies Page
&lt;/h4&gt;

&lt;p&gt;ML-powered anomaly detection and prediction. The distribution chart shows detected vs. predicted anomalies over 24 hours. Each anomaly has a score (0-100%) - the Kafka consumer lag spike scored 94%, and the system predicted a pod OOM 15 minutes before it happened.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8rcy0ufcef57vtz9bibo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8rcy0ufcef57vtz9bibo.png" alt="Kronveil Anomalies"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Collectors Page
&lt;/h4&gt;

&lt;p&gt;Telemetry collection agents across your infrastructure. Five active collectors processing 10.2M events/sec across 487 targets with only 0.001% error rate. Kubernetes leads at 4.2M events/sec monitoring 3 clusters, 54 nodes, and 312 pods. Each collector shows real-time health status.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhod8vofqm3l5ehjilemr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhod8vofqm3l5ehjilemr.png" alt="Kronveil Collectors - Top"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scroll down to see all five collectors - Kubernetes, Apache Kafka, AWS CloudWatch, GitHub Actions (CI/CD), and the Logs collector. GitHub Actions shows a degraded status with 3 errors, which is expected when webhook endpoints aren't publicly accessible in a local deployment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frdjql793e6cw5a0wn6qy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frdjql793e6cw5a0wn6qy.png" alt="Kronveil Collectors - All"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Explore the API
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Full system status:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/api/v1/status | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;List collectors and their health:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/api/v1/collectors | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Inject a test event (single):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/api/v1/test/inject?mode&lt;span class="o"&gt;=&lt;/span&gt;single
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Inject a burst of events to trigger anomaly detection:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/api/v1/test/inject?mode&lt;span class="o"&gt;=&lt;/span&gt;burst
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the burst injection, check for detected anomalies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/api/v1/anomalies | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And incidents that were auto-created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/api/v1/incidents | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 7: Prometheus Metrics
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:9090/metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see standard Go metrics plus Kronveil-specific counters for events processed, collector errors, and policy evaluations. Wire this into your Grafana instance for dashboards.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 8: Tail the Logs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose &lt;span class="nt"&gt;-f&lt;/span&gt; deploy/docker-compose.yaml logs &lt;span class="nt"&gt;-f&lt;/span&gt; agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watch the agent detect anomalies, correlate incidents, and execute remediation in real-time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cleanup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose &lt;span class="nt"&gt;-f&lt;/span&gt; deploy/docker-compose.yaml down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Architecture Diagram (Updated)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                         +------------------+
                         |   Dashboard UI   |
                         |  (React + nginx) |
                         |   :3000          |
                         +--------+---------+
                                  |
                           /api/ proxy
                                  |
+------------------+    +---------v---------+    +------------------+
|   Collectors     |    |    Kronveil Agent  |    |  Integrations    |
|                  +---&amp;gt;+                    +---&amp;gt;+                  |
| - Kubernetes     |    |  REST API  :8080   |    | - Slack          |
| - Kafka          |    |  gRPC API  :9091   |    | - PagerDuty      |
| - Cloud (AWS)    |    |  Metrics   :9090   |    | - Prometheus     |
| - CI/CD          |    |                    |    | - OpenTelemetry  |
| - Logs           |    |  +==============+  |    | - AWS Bedrock    |
+------------------+    |  | Intelligence |  |    | - Vault          |
                        |  | - Anomaly    |  |    | - AWS Secrets    |
                        |  | - RootCause  |  |    +------------------+
                        |  | - Capacity   |  |
                        |  | - Incident   |  |         +----------+
                        |  +==============+  |    +---&amp;gt;| OTel     |
                        |                    +----+    | Collector|
                        |  +==============+  |         +----------+
                        |  | Policy (OPA) |  |
                        |  | Audit Log    |  |
                        |  +==============+  |
                        +---------+----------+
                                  |
                         +--------v---------+
                         |   Apache Kafka   |
                         |   :9092          |
                         +------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  CI Pipeline
&lt;/h2&gt;

&lt;p&gt;Every push to &lt;code&gt;main&lt;/code&gt; runs seven jobs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lint&lt;/strong&gt; - golangci-lint v2 with staticcheck, errcheck, govet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test&lt;/strong&gt; - &lt;code&gt;go test -race&lt;/code&gt; with 40% coverage threshold&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Scan&lt;/strong&gt; - govulncheck for Go stdlib/dependency CVEs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build&lt;/strong&gt; - Cross-compile with ldflags (version, commit, date)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Build &amp;amp; Scan&lt;/strong&gt; - Multi-stage build + Trivy vulnerability scan (CRITICAL/HIGH)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard&lt;/strong&gt; - npm ci, ESLint, Vite production build&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Helm Lint&lt;/strong&gt; - Chart validation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All green before merge. No exceptions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next (v0.3 Roadmap)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cluster support&lt;/strong&gt; - Federated monitoring across Kubernetes clusters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom collector SDK&lt;/strong&gt; - Build your own collectors with a plugin interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runbook automation&lt;/strong&gt; - Attach runbooks to incident types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost anomaly detection&lt;/strong&gt; - Spot unexpected cloud spend spikes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana dashboards&lt;/strong&gt; - Pre-built dashboards for Kronveil Prometheus metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mobile alerts&lt;/strong&gt; - Push notifications via native apps&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/kronveil/kronveil" rel="noopener noreferrer"&gt;github.com/kronveil/kronveil&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kronveil/kronveil.git
&lt;span class="nb"&gt;cd &lt;/span&gt;kronveil
docker-compose &lt;span class="nt"&gt;-f&lt;/span&gt; deploy/docker-compose.yaml up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="c"&gt;# Open http://localhost:3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you find it useful, star the repo. If you find a bug, open an issue. PRs welcome - especially for new collectors, dashboard improvements, and LLM prompt tuning.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Follow me for more updates on building production-grade infrastructure tooling with Go and AI.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>go</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Built an AI-Powered Infrastructure Observability Agent from Scratch</title>
      <dc:creator>Ramasankar Molleti</dc:creator>
      <pubDate>Sun, 08 Mar 2026 01:11:01 +0000</pubDate>
      <link>https://dev.to/ramasankar_molleti_f7f80d/i-built-an-ai-powered-infrastructure-observability-agent-from-scratch-4j68</link>
      <guid>https://dev.to/ramasankar_molleti_f7f80d/i-built-an-ai-powered-infrastructure-observability-agent-from-scratch-4j68</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Kronveil watches your infrastructure, detects anomalies in real time, and auto-remediates incidents before you even wake up.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As platform engineers, we've all been there: 3 AM pages, scrambling through dashboards, correlating logs across 15 different tools, and trying to figure out &lt;em&gt;why&lt;/em&gt; the system broke — not just &lt;em&gt;what&lt;/em&gt; broke.&lt;/p&gt;

&lt;p&gt;I built &lt;strong&gt;Kronveil&lt;/strong&gt; to solve this. It's an open-source, AI-powered observability agent that combines deep telemetry collection, real-time anomaly detection, LLM-powered root cause analysis, and autonomous remediation — all in a single Go binary.&lt;/p&gt;

&lt;p&gt;In this post, I'll walk you through the architecture, the intelligence pipeline, and show you real test results of the system detecting anomalies and auto-remediating incidents in milliseconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/kronveil/kronveil" rel="noopener noreferrer"&gt;github.com/kronveil/kronveil&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Modern infrastructure is complex. A typical production environment has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hundreds of Kubernetes pods scaling up and down&lt;/li&gt;
&lt;li&gt;Apache Kafka clusters processing millions of events per second&lt;/li&gt;
&lt;li&gt;Multi-cloud workloads across AWS, Azure, and GCP&lt;/li&gt;
&lt;li&gt;CI/CD pipelines deploying dozens of times per day&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional monitoring tools tell you &lt;em&gt;what&lt;/em&gt; happened. But by the time you get the alert, correlate the signals, and figure out the root cause — you've already burned 30 minutes of MTTR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if your observability platform could think?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Kronveil is designed as a layered system with four main tiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; LAYER 1: DATA COLLECTION
 ========================
 +----------+  +-------+  +-------+
 |Kubernetes|  | Kafka |  | Cloud |
 |Collector |  |Collect|  |Collect|
 +----+-----+  +---+---+  +---+---+
      |            |           |
 +----+-----+  +--+---+       |
 |CI/CD     |  | Logs |       |
 |Collector |  |Tailer|       |
 +----+-----+  +--+---+       |
      |            |           |
      v            v           v
 ================================
 LAYER 2: KAFKA EVENT BUS
 ================================
 telemetry.raw -&amp;gt; telemetry.enriched
 anomalies.detected -&amp;gt; incidents.new
 remediation.actions -&amp;gt; policy.audit
 (10M+ events/sec | 3x replication)
 ================================
            |
            v
 LAYER 3: INTELLIGENCE
 ========================
 +---------+ +----------+
 | Anomaly | | Root     |
 | Detect  | | Cause    |
 | Z-Score | | Analyzer |
 | EWMA    | | DFS+LLM  |
 +---------+ +----------+
       |          |
       v          v
 +---------------------+
 | INCIDENT RESPONDER  |
 | Detect -&amp;gt; Triage    |
 | -&amp;gt; Respond -&amp;gt; Resolve|
 +---------------------+
            |
            v
 LAYER 4: ACTION
 ========================
 +-------+ +------+ +------+
 | Slack | |Pager | |Prom  |
 | Alert | |Duty  | |Metric|
 +-------+ +------+ +------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The diagram above shows the full platform. Let's break down each layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Data Collection
&lt;/h3&gt;

&lt;p&gt;Five specialized collectors continuously gather telemetry from your infrastructure. Each collector is a Go interface implementation that runs in its own goroutine and pushes &lt;code&gt;TelemetryEvent&lt;/code&gt; structs into the event bus.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------+---------------------------+
| Collector    | What It Watches           |
+--------------+---------------------------+
| Kubernetes   | Pods, Nodes, Events, HPA  |
|              | Metrics API, Deployments  |
+--------------+---------------------------+
| Kafka        | Consumer lag, Topics      |
|              | Throughput, Partitions    |
+--------------+---------------------------+
| Cloud        | EC2, RDS, ELB, Lambda    |
|              | S3, CloudWatch metrics   |
+--------------+---------------------------+
| CI/CD        | GitHub Actions, Jenkins  |
|              | GitLab CI pipelines      |
+--------------+---------------------------+
| Logs         | File tailing, Syslog     |
|              | Structured log parsing   |
+--------------+---------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each collector implements a simple interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Collector&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="n"&gt;Stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="n"&gt;Health&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;ComponentHealth&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means adding a new data source (e.g., Datadog, New Relic) is just implementing this interface — no changes to the core engine needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Apache Kafka Event Bus
&lt;/h3&gt;

&lt;p&gt;All telemetry flows through a unified Kafka event bus. This decouples collectors from intelligence modules — they don't know about each other. The bus handles &lt;strong&gt;10M+ events/sec&lt;/strong&gt; with 3x replication.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;KAFKA TOPICS (10 total):
========================

Telemetry Flow:
  telemetry.raw
    -&amp;gt; telemetry.enriched
      -&amp;gt; anomalies.detected

Incident Flow:
  incidents.new
    -&amp;gt; incidents.updated
      -&amp;gt; remediation.actions

Governance Flow:
  policy.violations
    -&amp;gt; policy.audit
      -&amp;gt; capacity.forecasts

Config: capacity.changes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why Kafka? Three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Durability&lt;/strong&gt; — events survive crashes, enabling replay and audit trails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fan-out&lt;/strong&gt; — multiple intelligence modules can consume the same event stream independently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backpressure&lt;/strong&gt; — if anomaly detection falls behind, events queue up instead of being dropped&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Layer 3: Intelligence Engine
&lt;/h3&gt;

&lt;p&gt;This is the brain of Kronveil. Three modules analyze telemetry in parallel, each specializing in a different aspect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------------------------+
|        ANOMALY DETECTOR          |
|                                  |
|  Input: telemetry.enriched      |
|                                  |
|  Algorithms:                    |
|  - Z-Score (deviation from mean)|
|  - EWMA (trend smoothing)      |
|  - Linear Trend (prediction)   |
|                                  |
|  Output: anomalies.detected    |
+----------------------------------+
          |
          v
+----------------------------------+
|     ROOT CAUSE ANALYZER          |
|                                  |
|  Input: anomalies.detected      |
|                                  |
|  Process:                       |
|  1. Build dependency graph      |
|  2. DFS traversal for causality |
|  3. Collect evidence            |
|  4. LLM analysis (AWS Bedrock) |
|                                  |
|  Output: root cause + fix       |
+----------------------------------+
          |
          v
+----------------------------------+
|     CAPACITY PLANNER             |
|                                  |
|  Input: telemetry.enriched      |
|                                  |
|  Algorithms:                    |
|  - Linear regression forecast   |
|  - Confidence intervals         |
|  - Resource right-sizing        |
|                                  |
|  Output: capacity.forecasts    |
+----------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All three modules feed into the &lt;strong&gt;Incident Responder&lt;/strong&gt;, which orchestrates the full incident lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INCIDENT LIFECYCLE:
===================

  Anomaly    Root Cause    Capacity
  Detected   Found         Alert
     \          |           /
      v         v          v
  +-------------------------+
  |   INCIDENT RESPONDER    |
  |                         |
  |  1. Create Incident     |
  |  2. Score Severity      |
  |  3. Correlate Events    |
  |  4. Auto-Remediate      |
  |  5. Notify (Slack/PD)   |
  |  6. Track Resolution    |
  +-------------------------+
           |
           v
  +-------------------------+
  |   AUTO-REMEDIATION      |
  |                         |
  |  - scale_deployment     |
  |  - restart_pods         |
  |  - rollback_deploy      |
  |  - drain_node           |
  |  - failover_db          |
  |  - toggle_feature       |
  |                         |
  |  Safety:                |
  |  - Circuit breaker      |
  |  - Dry run mode         |
  |  - Human approval gate  |
  +-------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 4: Action &amp;amp; Integrations
&lt;/h3&gt;

&lt;p&gt;The final layer delivers results to humans and systems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------+  +-----------+  +---------+
| AWS      |  | Slack     |  | Pager   |
| Bedrock  |  | Block Kit |  | Duty    |
| (LLM)   |  | Alerts    |  | Events  |
+----------+  +-----------+  +---------+

+----------+  +-----------+  +---------+
| REST API |  | gRPC API  |  | Prom    |
| :8080    |  | :9091     |  | :9090   |
+----------+  +-----------+  +---------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;REST API&lt;/strong&gt; (:8080) — Dashboard, incident management, test injection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gRPC API&lt;/strong&gt; (:9091) — High-performance inter-service communication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus&lt;/strong&gt; (:9090) — Metrics export for Grafana dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack&lt;/strong&gt; — Real-time alerts with Block Kit rich formatting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PagerDuty&lt;/strong&gt; — On-call escalation via Events API v2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Bedrock&lt;/strong&gt; — LLM backbone for root cause analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Complete Event Flow
&lt;/h3&gt;

&lt;p&gt;Here's how a single CPU spike travels through the entire system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CPU spike on pod-xyz (95% usage)
  |
  v
[K8s Collector] picks up metric
  |
  v
[Kafka] telemetry.raw topic
  |
  v
[Anomaly Detector] Z-score = 5.8 sigma
  |
  v
[Kafka] anomalies.detected topic
  |
  v
[Incident Responder] creates INC-0001
  |
  +---&amp;gt; [Root Cause Analyzer]
  |       |
  |       v
  |     DFS on dependency graph
  |       |
  |       v
  |     AWS Bedrock LLM analysis
  |       |
  |       v
  |     "OOM in pod-xyz caused by
  |      memory leak in v2.3.1"
  |
  +---&amp;gt; [Auto-Remediation]
  |       |
  |       v
  |     scale_deployment (replicas: 5)
  |
  +---&amp;gt; [Slack] Alert with root cause
  +---&amp;gt; [PagerDuty] Page on-call
  +---&amp;gt; [Prometheus] Metric exported
  |
  v
INC-0001 resolved (MTTR: 1.7ms)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Deep Dive: The Intelligence Pipeline
&lt;/h2&gt;

&lt;p&gt;This is where Kronveil gets interesting. Let me walk through how a single CPU spike turns into an auto-remediated incident.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Anomaly Detection
&lt;/h3&gt;

&lt;p&gt;Kronveil uses a combination of statistical methods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Z-Score Analysis&lt;/strong&gt;: Measures how many standard deviations a value is from the mean&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EWMA&lt;/strong&gt;: Smooths out noise to detect real trends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linear Trend Prediction&lt;/strong&gt;: Identifies directional trends to predict upcoming anomalies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The detector maintains a sliding time window for each signal and requires a minimum of 30 data points before it starts detecting. This prevents false positives during cold starts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sensitivity levels:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Z-Score Threshold&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;2.0 sigma&lt;/td&gt;
&lt;td&gt;Critical systems, catch everything&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;3.0 sigma&lt;/td&gt;
&lt;td&gt;Default, balanced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;4.0 sigma&lt;/td&gt;
&lt;td&gt;Noisy environments, reduce alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 2: Incident Creation &amp;amp; Severity Scoring
&lt;/h3&gt;

&lt;p&gt;When an anomaly is detected, it gets scored on a 0.0 to 1.0 scale:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Score &amp;gt;= 0.9  --&amp;gt;  CRITICAL  --&amp;gt;  Page On-Call
Score &amp;gt;= 0.7  --&amp;gt;  HIGH      --&amp;gt;  Slack Alert
Score &amp;gt;= 0.5  --&amp;gt;  MEDIUM    --&amp;gt;  Dashboard
Score &amp;lt;  0.5  --&amp;gt;  LOW       --&amp;gt;  Log Only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The incident responder also &lt;strong&gt;correlates events&lt;/strong&gt; — grouping related anomalies within the same time window to avoid alert storms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Root Cause Analysis (LLM-Powered)
&lt;/h3&gt;

&lt;p&gt;For high/critical incidents, Kronveil uses &lt;strong&gt;AWS Bedrock&lt;/strong&gt; (Claude or Titan):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build a &lt;strong&gt;dependency graph&lt;/strong&gt; of affected services&lt;/li&gt;
&lt;li&gt;Traverse the graph using &lt;strong&gt;DFS&lt;/strong&gt; to find the causal chain&lt;/li&gt;
&lt;li&gt;Collect evidence (metrics, logs, events)&lt;/li&gt;
&lt;li&gt;Send to the LLM with a structured prompt&lt;/li&gt;
&lt;li&gt;Receive root cause explanation and recommended fix&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 4: Auto-Remediation
&lt;/h3&gt;

&lt;p&gt;Supported actions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;scale_deployment&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scale up/down pods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;restart_pods&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rolling restart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rollback_deploy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Revert to previous version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;drain_node&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safely drain a problematic node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;failover_db&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Database failover&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;toggle_feature&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Feature flag toggle&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Safety is built in:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Circuit Breaker&lt;/strong&gt;: Max 5 attempts per 10 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dry Run Mode&lt;/strong&gt;: Test remediation without executing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval Required&lt;/strong&gt;: Optional human-in-the-loop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cooldown Period&lt;/strong&gt;: Prevent remediation storms&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Testing It Live
&lt;/h2&gt;

&lt;p&gt;I deployed Kronveil on a local Kubernetes cluster using &lt;code&gt;kind&lt;/code&gt; and tested the full pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kind create cluster &lt;span class="nt"&gt;--name&lt;/span&gt; kronveil-test

docker build &lt;span class="nt"&gt;-f&lt;/span&gt; deploy/Dockerfile.agent &lt;span class="nt"&gt;-t&lt;/span&gt; kronveil:latest &lt;span class="nb"&gt;.&lt;/span&gt;
kind load docker-image kronveil:latest &lt;span class="nt"&gt;--name&lt;/span&gt; kronveil-test

helm &lt;span class="nb"&gt;install &lt;/span&gt;kronveil ./helm/kronveil &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kronveil &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; image.repository&lt;span class="o"&gt;=&lt;/span&gt;kronveil &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; image.tag&lt;span class="o"&gt;=&lt;/span&gt;latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; image.pullPolicy&lt;span class="o"&gt;=&lt;/span&gt;Never
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Health Check
&lt;/h3&gt;

&lt;p&gt;All 6 modules running and healthy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"components"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kubernetes-collector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kafka-collector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anomaly-detector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"incident-responder"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"root-cause-analyzer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"capacity-planner"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Triggering Anomaly Detection
&lt;/h3&gt;

&lt;p&gt;Kronveil includes a test injection endpoint. The &lt;code&gt;burst&lt;/code&gt; mode sends 35 normal baseline events followed by a single spike — triggering the full pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"http://localhost:8080/api/v1/test/inject?mode=burst"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"source":"production-api","signal":"cpu_usage"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"burst_complete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"events_injected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anomalies_found"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"incidents_created"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anomalies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"signal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production-api.cpu_usage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.97&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"critical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"value 200.00 deviates 5.8 sigma from mean"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"incidents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"INC-0001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"critical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"resolved"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"timeline"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"created"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"actor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"remediation_started"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"actor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ai"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"resolved"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"details"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MTTR: 1.7ms"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happened in those 1.7 milliseconds:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;35 baseline events established a normal CPU usage pattern (~50%)&lt;/li&gt;
&lt;li&gt;One spike event hit 200% — deviating &lt;strong&gt;5.8 sigma&lt;/strong&gt; from the mean&lt;/li&gt;
&lt;li&gt;Anomaly detector flagged it as &lt;strong&gt;critical&lt;/strong&gt; (score: 0.97/1.0)&lt;/li&gt;
&lt;li&gt;Incident responder created &lt;strong&gt;INC-0001&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Auto-remediation kicked in with &lt;code&gt;scale_deployment&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Incident resolved — &lt;strong&gt;MTTR: 1.7ms&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Testing Multiple Signal Sources
&lt;/h3&gt;

&lt;p&gt;I then simulated a Kafka consumer lag spike — a second anomaly detected at 5.8 sigma, triggering INC-0002 with auto-remediation. Both incidents are independently tracked with full audit trails.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Go 1.21 (single binary, ~10MB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Event Bus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache Kafka (10 topics, 3x replication)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI/LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS Bedrock (Claude, Titan)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anomaly Detection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Z-Score, EWMA, Linear Regression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Policy Engine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OPA (Rego rules)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Secret Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS Secrets Manager + HashiCorp Vault&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kubernetes + Helm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dashboard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;React + TypeScript&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST + gRPC + Prometheus&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;76 files. One binary. Zero external Go dependencies.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real Kubernetes client-go integration&lt;/strong&gt; — watch actual pods, nodes, and events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kafka consumer group monitoring&lt;/strong&gt; — connect to real brokers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cloud secret management&lt;/strong&gt; — Azure Key Vault and GCP Secret Manager support (currently AWS-focused)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard UI&lt;/strong&gt; — React dashboard for visualizing anomalies and incidents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus metrics export&lt;/strong&gt; — anomaly scores, incident counts, MTTR&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Webhook integrations&lt;/strong&gt; — Slack and PagerDuty notifications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cluster support&lt;/strong&gt; — monitor multiple clusters from a single agent&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Get Involved
&lt;/h2&gt;

&lt;p&gt;Kronveil is open source under the Apache 2.0 license.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/kronveil/kronveil" rel="noopener noreferrer"&gt;github.com/kronveil/kronveil&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Star the repo&lt;/strong&gt; if you find this useful&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contributions welcome&lt;/strong&gt; — especially around new collector integrations, LLM prompt engineering, and dashboard widgets&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Developed by &lt;a href="https://github.com/sankar276" rel="noopener noreferrer"&gt;Ramasankar Molleti&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>GitOps-Driven Multi-Cluster Kubernetes Management: A Deep Dive into Modern Infrastructure</title>
      <dc:creator>Ramasankar Molleti</dc:creator>
      <pubDate>Fri, 24 Jan 2025 05:11:13 +0000</pubDate>
      <link>https://dev.to/ramasankar_molleti_f7f80d/gitops-driven-multi-cluster-kubernetes-management-a-deep-dive-into-modern-infrastructure-3ka0</link>
      <guid>https://dev.to/ramasankar_molleti_f7f80d/gitops-driven-multi-cluster-kubernetes-management-a-deep-dive-into-modern-infrastructure-3ka0</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;As organizations scale their container deployments, managing multiple Kubernetes clusters across different environments and regions has become increasingly complex. This article explores modern approaches to multi-cluster management using GitOps principles, focusing on real-world implementation strategies and emerging best practices in 2025.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evolution of Cluster Management
&lt;/h2&gt;

&lt;p&gt;Traditional management of Kubernetes mostly relies directly on access to clusters and manual interference. Modern landscape requires more sophisticated methods and approaches, among which GitOps emerged as a de facto standard to manage declarative infrastructure that ensures consistency, reliability, and audit capabilities not available with traditional methods.&lt;/p&gt;

&lt;p&gt;Key Components of Modern Kubernetes Architecture&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cluster Blueprints
Modern Kubernetes deployments utilize cluster blueprints - templated configurations that define the entire cluster state, including:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Node pool configurations&lt;/li&gt;
&lt;li&gt;Security policies&lt;/li&gt;
&lt;li&gt;Network policies&lt;/li&gt;
&lt;li&gt;Service mesh setup&lt;/li&gt;
&lt;li&gt;Monitoring and logging infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;GitOps Control Plane&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The GitOps control plane consists of several critical components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: gitops.example.com/v1
kind: ClusterTemplate
metadata:
  name: production-blueprint
spec:
  version: 1.28.0
  networking:
    cni: cilium
    serviceType: internal
  security:
    policyEngine: OPA
    imageScanning: true
  observability:
    prometheus: true
    opentelemetry: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Advanced Multi-Cluster Patterns&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Fleet Management
&lt;/h2&gt;

&lt;p&gt;Modern fleet management introduces the concept of cluster sets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: fleet.example.com/v1
kind: ClusterSet
metadata:
  name: production-fleet
spec:
  regions:
    - name: us-east
      clusters: 3
      template: production-blueprint
    - name: eu-west
      clusters: 2
      template: production-blueprint
  loadBalancing:
    mode: global
    algorithm: weighted-least-request
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementing Zero-Trust Security
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Certificate Management
&lt;/h2&gt;

&lt;p&gt;Modern Kubernetes deployments require sophisticated certificate management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;type CertificateRotation struct {
    Interval    time.Duration
    Algorithm   string
    KeySize     int
    CommonName  string
    SANs       []string
}

func (c *CertificateRotation) Setup() error {
    // Implementation for automated certificate rotation
    return nil
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Network Policy Enforcement
&lt;/h2&gt;

&lt;p&gt;Example of a zero-trust network policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: zero-trust-policy
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          security-zone: trusted
    ports:
    - protocol: TCP
      port: 443
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Advanced Observability
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Distributed Tracing
&lt;/h2&gt;

&lt;p&gt;Implementation of OpenTelemetry-based tracing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;func setupTracing(ctx context.Context) (*trace.TracerProvider, error) {
    exporter, err := otlptrace.New(ctx,
        otlptrace.WithInsecure(),
        otlptrace.WithEndpoint("otel-collector:4317"),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(
            resource.NewWithAttributes(
                semconv.SchemaURL,
                semconv.ServiceNameKey.String("cluster-manager"),
            ),
        ),
    )
    return tp, nil
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Metrics Aggregation
&lt;/h2&gt;

&lt;p&gt;Example of custom metrics collection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;type ClusterMetrics struct {
    NodeUtilization    float64
    PodDensity         float64
    NetworkLatency     map[string]float64
    ResourceQoS        map[string]int
}

func (cm *ClusterMetrics) Collect() error {
    // Implementation for metrics collection
    return nil
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Disaster Recovery and Business Continuity
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Cross-Cluster Backup Strategy
&lt;/h2&gt;

&lt;p&gt;Implementation of automated backup procedures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;type BackupStrategy struct {
    Interval    time.Duration
    Retention   time.Duration
    Encryption  bool
    Location    string
}

func (b *BackupStrategy) Execute() error {
    // Implementation for backup execution
    return nil
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Recovery Time Objectives
&lt;/h2&gt;

&lt;p&gt;Example of recovery automation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;func automateRecovery(cluster *Cluster) error {
    // Step 1: Validate backup integrity
    if err := validateBackup(cluster.LastBackup); err != nil {
        return err
    }

    // Step 2: Restore core components
    if err := restoreCoreComponents(cluster); err != nil {
        return err
    }

    // Step 3: Verify cluster health
    return verifyClusterHealth(cluster)
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cost Optimization Strategies
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Resource Right-Sizing
&lt;/h2&gt;

&lt;p&gt;Example of automated resource optimization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;type ResourceOptimizer struct {
    Thresholds map[string]float64
    History    []ResourceMetrics
    Predictions []ResourcePrediction
}

func (ro *ResourceOptimizer) Optimize() (*ResourceRecommendation, error) {
    // Implementation for resource optimization
    return nil, nil
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  End-to-End Implementation Example
&lt;/h2&gt;

&lt;p&gt;Project Structure&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├── clusters/
│   ├── production/
│   │   ├── cluster-config.yaml
│   │   ├── network-policies/
│   │   └── workloads/
│   └── staging/
├── platform/
│   ├── monitoring/
│   ├── security/
│   └── service-mesh/
└── tools/
    └── cluster-setup/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  1. Cluster Bootstrap
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Initialize infrastructure
terraform init
terraform apply -var-file=prod.tfvars

# Bootstrap cluster
./tools/cluster-setup/bootstrap.sh \
  --cluster-name=prod-east \
  --region=us-east-1 \
  --nodes=3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Base Platform Configuration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# platform/base/platform.yaml
apiVersion: platform.example.com/v1
kind: PlatformConfig
metadata:
  name: base-platform
spec:
  serviceMesh:
    enabled: true
    type: istio
    version: 1.20.0
    config:
      mtls: strict
      autoInject: true

  monitoring:
    prometheus:
      retention: 15d
      resources:
        requests:
          cpu: 1000m
          memory: 4Gi
    grafana:
      enabled: true
      dashboards:
        - cluster-health
        - application-metrics

  security:
    networkPolicies:
      defaultDeny: true
    podSecurityPolicies:
      enforcePrivileged: false
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Application Deployment
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# workloads/web-application/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      securityContext:
        runAsNonRoot: true
      containers:
      - name: web-app
        image: example/web-app:v1.2.3
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Service Mesh Configuration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# platform/service-mesh/virtual-service.yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: web-app
  namespace: production
spec:
  hosts:
  - web-app.example.com
  gateways:
  - production-gateway
  http:
  - match:
    - uri:
        prefix: /api
    route:
    - destination:
        host: web-app
        port:
          number: 8080
    retries:
      attempts: 3
      perTryTimeout: 2s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Monitoring Setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# platform/monitoring/service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: web-app
  namespace: production
spec:
  selector:
    matchLabels:
      app: web-app
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. Pipeline Configuration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# .github/workflows/deploy.yml
name: Deploy Application
on:
  push:
    branches: [main]
    paths:
      - 'workloads/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Setup Kubernetes Tools
        uses: azure/setup-kubectl@v1

      - name: Deploy to Kubernetes
        run: |
          kubectl apply -k workloads/web-application/
          kubectl rollout status deployment/web-app -n production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Testing and Verification
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Verify deployment
kubectl get pods -n production
kubectl get virtualservice -n production
kubectl get servicemonitor -n production

# Test connectivity
curl -H "Host: web-app.example.com" \
     https://production-gateway.example.com/api/health

# Check metrics
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring
# Visit http://localhost:9090 in browser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  This end-to-end example demonstrates:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure as Code setup&lt;/li&gt;
&lt;li&gt;Platform configuration&lt;/li&gt;
&lt;li&gt;Application deployment&lt;/li&gt;
&lt;li&gt;Service mesh integration&lt;/li&gt;
&lt;li&gt;Monitoring configuration&lt;/li&gt;
&lt;li&gt;CI/CD pipeline&lt;/li&gt;
&lt;li&gt;Verification steps&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When implementing this example:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Replace placeholder values (domains, image names)&lt;/li&gt;
&lt;li&gt;Adjust resource requests/limits based on needs&lt;/li&gt;
&lt;li&gt;Customize monitoring parameters&lt;/li&gt;
&lt;li&gt;Update security policies per requirements&lt;/li&gt;
&lt;li&gt;Configure backup/DR settings&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;As Kubernetes is evolving, the focus is shifting from mere cluster management to sophisticated orchestration of multiple clusters at scale. Integration of GitOps principles, zero-trust security, and advanced observability forms a strong foundation for modern cloud-native applications.&lt;br&gt;
With this comprehensive approach to managing Kubernetes, an organization is able to assure consistency, security, and reliability throughout the entire container infrastructure while preparing for future scaling challenges.&lt;/p&gt;

&lt;p&gt;About the Author&lt;br&gt;
Results-driven Principal Cloud Architect with extensive experience in designing and deploying complex Kubernetes deployments for a variety of industries. &lt;/p&gt;

&lt;p&gt;Hope you enjoyed the post.&lt;/p&gt;

&lt;p&gt;Cheers&lt;/p&gt;

&lt;p&gt;Ramasankar Molleti&lt;/p&gt;

&lt;p&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/ramasankar-molleti-23b13218?trk=nav_responsive_tab_profile" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/ramasankar-molleti-23b13218?trk=nav_responsive_tab_profile&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Book 1:1 (&lt;a href="http://topmate.io/ramasankar_molleti/?utm_source=topmate&amp;amp;utm_medium=popup&amp;amp;utm_campaign=Page_Ready" rel="noopener noreferrer"&gt;http://topmate.io/ramasankar_molleti/?utm_source=topmate&amp;amp;utm_medium=popup&amp;amp;utm_campaign=Page_Ready&lt;/a&gt;) &lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
