DEV Community

Ramasankar Molleti
Ramasankar Molleti

Posted on

I Built an AI-Powered Infrastructure Observability Agent from Scratch

Kronveil watches your infrastructure, detects anomalies in real time, and auto-remediates incidents before you even wake up.

As platform engineers, we've all been there: 3 AM pages, scrambling through dashboards, correlating logs across 15 different tools, and trying to figure out why the system broke — not just what broke.

I built Kronveil to solve this. It's an open-source, AI-powered observability agent that combines deep telemetry collection, real-time anomaly detection, LLM-powered root cause analysis, and autonomous remediation — all in a single Go binary.

In this post, I'll walk you through the architecture, the intelligence pipeline, and show you real test results of the system detecting anomalies and auto-remediating incidents in milliseconds.

GitHub: github.com/kronveil/kronveil


The Problem

Modern infrastructure is complex. A typical production environment has:

  • Hundreds of Kubernetes pods scaling up and down
  • Apache Kafka clusters processing millions of events per second
  • Multi-cloud workloads across AWS, Azure, and GCP
  • CI/CD pipelines deploying dozens of times per day

Traditional monitoring tools tell you what happened. But by the time you get the alert, correlate the signals, and figure out the root cause — you've already burned 30 minutes of MTTR.

What if your observability platform could think?


Architecture Overview

Kronveil is designed as a layered system with four main tiers:

 LAYER 1: DATA COLLECTION
 ========================
 +----------+  +-------+  +-------+
 |Kubernetes|  | Kafka |  | Cloud |
 |Collector |  |Collect|  |Collect|
 +----+-----+  +---+---+  +---+---+
      |            |           |
 +----+-----+  +--+---+       |
 |CI/CD     |  | Logs |       |
 |Collector |  |Tailer|       |
 +----+-----+  +--+---+       |
      |            |           |
      v            v           v
 ================================
 LAYER 2: KAFKA EVENT BUS
 ================================
 telemetry.raw -> telemetry.enriched
 anomalies.detected -> incidents.new
 remediation.actions -> policy.audit
 (10M+ events/sec | 3x replication)
 ================================
            |
            v
 LAYER 3: INTELLIGENCE
 ========================
 +---------+ +----------+
 | Anomaly | | Root     |
 | Detect  | | Cause    |
 | Z-Score | | Analyzer |
 | EWMA    | | DFS+LLM  |
 +---------+ +----------+
       |          |
       v          v
 +---------------------+
 | INCIDENT RESPONDER  |
 | Detect -> Triage    |
 | -> Respond -> Resolve|
 +---------------------+
            |
            v
 LAYER 4: ACTION
 ========================
 +-------+ +------+ +------+
 | Slack | |Pager | |Prom  |
 | Alert | |Duty  | |Metric|
 +-------+ +------+ +------+
Enter fullscreen mode Exit fullscreen mode

The diagram above shows the full platform. Let's break down each layer.

Layer 1: Data Collection

Five specialized collectors continuously gather telemetry from your infrastructure. Each collector is a Go interface implementation that runs in its own goroutine and pushes TelemetryEvent structs into the event bus.

+--------------+---------------------------+
| Collector    | What It Watches           |
+--------------+---------------------------+
| Kubernetes   | Pods, Nodes, Events, HPA  |
|              | Metrics API, Deployments  |
+--------------+---------------------------+
| Kafka        | Consumer lag, Topics      |
|              | Throughput, Partitions    |
+--------------+---------------------------+
| Cloud        | EC2, RDS, ELB, Lambda    |
|              | S3, CloudWatch metrics   |
+--------------+---------------------------+
| CI/CD        | GitHub Actions, Jenkins  |
|              | GitLab CI pipelines      |
+--------------+---------------------------+
| Logs         | File tailing, Syslog     |
|              | Structured log parsing   |
+--------------+---------------------------+
Enter fullscreen mode Exit fullscreen mode

Each collector implements a simple interface:

type Collector interface {
    Name() string
    Start(ctx context.Context) error
    Stop() error
    Health() ComponentHealth
}
Enter fullscreen mode Exit fullscreen mode

This means adding a new data source (e.g., Datadog, New Relic) is just implementing this interface — no changes to the core engine needed.

Layer 2: Apache Kafka Event Bus

All telemetry flows through a unified Kafka event bus. This decouples collectors from intelligence modules — they don't know about each other. The bus handles 10M+ events/sec with 3x replication.

KAFKA TOPICS (10 total):
========================

Telemetry Flow:
  telemetry.raw
    -> telemetry.enriched
      -> anomalies.detected

Incident Flow:
  incidents.new
    -> incidents.updated
      -> remediation.actions

Governance Flow:
  policy.violations
    -> policy.audit
      -> capacity.forecasts

Config: capacity.changes
Enter fullscreen mode Exit fullscreen mode

Why Kafka? Three reasons:

  1. Durability — events survive crashes, enabling replay and audit trails
  2. Fan-out — multiple intelligence modules can consume the same event stream independently
  3. Backpressure — if anomaly detection falls behind, events queue up instead of being dropped

Layer 3: Intelligence Engine

This is the brain of Kronveil. Three modules analyze telemetry in parallel, each specializing in a different aspect:

+----------------------------------+
|        ANOMALY DETECTOR          |
|                                  |
|  Input: telemetry.enriched      |
|                                  |
|  Algorithms:                    |
|  - Z-Score (deviation from mean)|
|  - EWMA (trend smoothing)      |
|  - Linear Trend (prediction)   |
|                                  |
|  Output: anomalies.detected    |
+----------------------------------+
          |
          v
+----------------------------------+
|     ROOT CAUSE ANALYZER          |
|                                  |
|  Input: anomalies.detected      |
|                                  |
|  Process:                       |
|  1. Build dependency graph      |
|  2. DFS traversal for causality |
|  3. Collect evidence            |
|  4. LLM analysis (AWS Bedrock) |
|                                  |
|  Output: root cause + fix       |
+----------------------------------+
          |
          v
+----------------------------------+
|     CAPACITY PLANNER             |
|                                  |
|  Input: telemetry.enriched      |
|                                  |
|  Algorithms:                    |
|  - Linear regression forecast   |
|  - Confidence intervals         |
|  - Resource right-sizing        |
|                                  |
|  Output: capacity.forecasts    |
+----------------------------------+
Enter fullscreen mode Exit fullscreen mode

All three modules feed into the Incident Responder, which orchestrates the full incident lifecycle:

INCIDENT LIFECYCLE:
===================

  Anomaly    Root Cause    Capacity
  Detected   Found         Alert
     \          |           /
      v         v          v
  +-------------------------+
  |   INCIDENT RESPONDER    |
  |                         |
  |  1. Create Incident     |
  |  2. Score Severity      |
  |  3. Correlate Events    |
  |  4. Auto-Remediate      |
  |  5. Notify (Slack/PD)   |
  |  6. Track Resolution    |
  +-------------------------+
           |
           v
  +-------------------------+
  |   AUTO-REMEDIATION      |
  |                         |
  |  - scale_deployment     |
  |  - restart_pods         |
  |  - rollback_deploy      |
  |  - drain_node           |
  |  - failover_db          |
  |  - toggle_feature       |
  |                         |
  |  Safety:                |
  |  - Circuit breaker      |
  |  - Dry run mode         |
  |  - Human approval gate  |
  +-------------------------+
Enter fullscreen mode Exit fullscreen mode

Layer 4: Action & Integrations

The final layer delivers results to humans and systems:

+----------+  +-----------+  +---------+
| AWS      |  | Slack     |  | Pager   |
| Bedrock  |  | Block Kit |  | Duty    |
| (LLM)   |  | Alerts    |  | Events  |
+----------+  +-----------+  +---------+

+----------+  +-----------+  +---------+
| REST API |  | gRPC API  |  | Prom    |
| :8080    |  | :9091     |  | :9090   |
+----------+  +-----------+  +---------+
Enter fullscreen mode Exit fullscreen mode
  • REST API (:8080) — Dashboard, incident management, test injection
  • gRPC API (:9091) — High-performance inter-service communication
  • Prometheus (:9090) — Metrics export for Grafana dashboards
  • Slack — Real-time alerts with Block Kit rich formatting
  • PagerDuty — On-call escalation via Events API v2
  • AWS Bedrock — LLM backbone for root cause analysis

The Complete Event Flow

Here's how a single CPU spike travels through the entire system:

CPU spike on pod-xyz (95% usage)
  |
  v
[K8s Collector] picks up metric
  |
  v
[Kafka] telemetry.raw topic
  |
  v
[Anomaly Detector] Z-score = 5.8 sigma
  |
  v
[Kafka] anomalies.detected topic
  |
  v
[Incident Responder] creates INC-0001
  |
  +---> [Root Cause Analyzer]
  |       |
  |       v
  |     DFS on dependency graph
  |       |
  |       v
  |     AWS Bedrock LLM analysis
  |       |
  |       v
  |     "OOM in pod-xyz caused by
  |      memory leak in v2.3.1"
  |
  +---> [Auto-Remediation]
  |       |
  |       v
  |     scale_deployment (replicas: 5)
  |
  +---> [Slack] Alert with root cause
  +---> [PagerDuty] Page on-call
  +---> [Prometheus] Metric exported
  |
  v
INC-0001 resolved (MTTR: 1.7ms)
Enter fullscreen mode Exit fullscreen mode

Deep Dive: The Intelligence Pipeline

This is where Kronveil gets interesting. Let me walk through how a single CPU spike turns into an auto-remediated incident.

Step 1: Anomaly Detection

Kronveil uses a combination of statistical methods:

  • Z-Score Analysis: Measures how many standard deviations a value is from the mean
  • EWMA: Smooths out noise to detect real trends
  • Linear Trend Prediction: Identifies directional trends to predict upcoming anomalies

The detector maintains a sliding time window for each signal and requires a minimum of 30 data points before it starts detecting. This prevents false positives during cold starts.

Sensitivity levels:

Level Z-Score Threshold Use Case
High 2.0 sigma Critical systems, catch everything
Medium 3.0 sigma Default, balanced
Low 4.0 sigma Noisy environments, reduce alerts

Step 2: Incident Creation & Severity Scoring

When an anomaly is detected, it gets scored on a 0.0 to 1.0 scale:

Score >= 0.9  -->  CRITICAL  -->  Page On-Call
Score >= 0.7  -->  HIGH      -->  Slack Alert
Score >= 0.5  -->  MEDIUM    -->  Dashboard
Score <  0.5  -->  LOW       -->  Log Only
Enter fullscreen mode Exit fullscreen mode

The incident responder also correlates events — grouping related anomalies within the same time window to avoid alert storms.

Step 3: Root Cause Analysis (LLM-Powered)

For high/critical incidents, Kronveil uses AWS Bedrock (Claude or Titan):

  1. Build a dependency graph of affected services
  2. Traverse the graph using DFS to find the causal chain
  3. Collect evidence (metrics, logs, events)
  4. Send to the LLM with a structured prompt
  5. Receive root cause explanation and recommended fix

Step 4: Auto-Remediation

Supported actions:

Action Description
scale_deployment Scale up/down pods
restart_pods Rolling restart
rollback_deploy Revert to previous version
drain_node Safely drain a problematic node
failover_db Database failover
toggle_feature Feature flag toggle

Safety is built in:

  • Circuit Breaker: Max 5 attempts per 10 minutes
  • Dry Run Mode: Test remediation without executing
  • Approval Required: Optional human-in-the-loop
  • Cooldown Period: Prevent remediation storms

Testing It Live

I deployed Kronveil on a local Kubernetes cluster using kind and tested the full pipeline.

Deployment

kind create cluster --name kronveil-test

docker build -f deploy/Dockerfile.agent -t kronveil:latest .
kind load docker-image kronveil:latest --name kronveil-test

helm install kronveil ./helm/kronveil \
  --namespace kronveil --create-namespace \
  --set image.repository=kronveil \
  --set image.tag=latest \
  --set image.pullPolicy=Never
Enter fullscreen mode Exit fullscreen mode

Health Check

All 6 modules running and healthy:

{
  "data": {
    "status": "healthy",
    "components": [
      {"name": "kubernetes-collector", "status": "healthy"},
      {"name": "kafka-collector", "status": "healthy"},
      {"name": "anomaly-detector", "status": "healthy"},
      {"name": "incident-responder", "status": "healthy"},
      {"name": "root-cause-analyzer", "status": "healthy"},
      {"name": "capacity-planner", "status": "healthy"}
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Triggering Anomaly Detection

Kronveil includes a test injection endpoint. The burst mode sends 35 normal baseline events followed by a single spike — triggering the full pipeline.

curl -s -X POST \
  "http://localhost:8080/api/v1/test/inject?mode=burst" \
  -H "Content-Type: application/json" \
  -d '{"source":"production-api","signal":"cpu_usage"}'
Enter fullscreen mode Exit fullscreen mode

Result:

{
  "data": {
    "status": "burst_complete",
    "events_injected": 36,
    "anomalies_found": 1,
    "incidents_created": 1,
    "anomalies": [{
      "signal": "production-api.cpu_usage",
      "score": 0.97,
      "severity": "critical",
      "description": "value 200.00 deviates 5.8 sigma from mean"
    }],
    "incidents": [{
      "id": "INC-0001",
      "severity": "critical",
      "status": "resolved",
      "timeline": [
        {"action": "created", "actor": "system"},
        {"action": "remediation_started", "actor": "ai"},
        {"action": "resolved", "details": "MTTR: 1.7ms"}
      ]
    }]
  }
}
Enter fullscreen mode Exit fullscreen mode

What happened in those 1.7 milliseconds:

  1. 35 baseline events established a normal CPU usage pattern (~50%)
  2. One spike event hit 200% — deviating 5.8 sigma from the mean
  3. Anomaly detector flagged it as critical (score: 0.97/1.0)
  4. Incident responder created INC-0001
  5. Auto-remediation kicked in with scale_deployment
  6. Incident resolved — MTTR: 1.7ms

Testing Multiple Signal Sources

I then simulated a Kafka consumer lag spike — a second anomaly detected at 5.8 sigma, triggering INC-0002 with auto-remediation. Both incidents are independently tracked with full audit trails.


Tech Stack

Component Technology
Agent Go 1.21 (single binary, ~10MB)
Event Bus Apache Kafka (10 topics, 3x replication)
AI/LLM AWS Bedrock (Claude, Titan)
Anomaly Detection Z-Score, EWMA, Linear Regression
Policy Engine OPA (Rego rules)
Secret Management AWS Secrets Manager + HashiCorp Vault
Deployment Kubernetes + Helm
Dashboard React + TypeScript
API REST + gRPC + Prometheus

76 files. One binary. Zero external Go dependencies.


What's Next

  • Real Kubernetes client-go integration — watch actual pods, nodes, and events
  • Kafka consumer group monitoring — connect to real brokers
  • Multi-cloud secret management — Azure Key Vault and GCP Secret Manager support (currently AWS-focused)
  • Dashboard UI — React dashboard for visualizing anomalies and incidents
  • Prometheus metrics export — anomaly scores, incident counts, MTTR
  • Webhook integrations — Slack and PagerDuty notifications
  • Multi-cluster support — monitor multiple clusters from a single agent

Get Involved

Kronveil is open source under the Apache 2.0 license.

  • GitHub: github.com/kronveil/kronveil
  • Star the repo if you find this useful
  • Contributions welcome — especially around new collector integrations, LLM prompt engineering, and dashboard widgets

Developed by Ramasankar Molleti

Top comments (0)