Ramasankar Molleti

Posted on Mar 16

Kronveil v0.2: From Stubs to a Fully Running AI Infrastructure Agent with Dashboard, OTel, and Auto-Remediation

#kubernetes #devops #opensource #go

Quick Recap

A week ago, I launched Kronveil - an AI-powered infrastructure observability agent that detects anomalies, performs root cause analysis, and auto-remediates incidents in milliseconds. The response was incredible.

But that first version had a lot of stubs. The roadmap listed features like "Dashboard UI", "Prometheus metrics", and "multi-cloud secret management" as coming soon.

They're here now.

This post covers every new feature shipped in v0.2, a step-by-step guide to run Kronveil locally with Docker Compose, and live screenshots from the running dashboard.

What's New in v0.2

1. Full Dashboard UI (React + TypeScript)

The biggest visible change. Kronveil now ships with a production-ready dashboard built with React 18, TypeScript, Tailwind CSS, and Recharts.

Six pages, zero fluff:

Page	What It Shows
Overview	Real-time event throughput, active incidents, MTTR, anomaly count (24h), cluster health matrix
Incidents	Filterable list (active/acknowledged/resolved), timeline view, root cause display, one-click acknowledge/resolve
Anomalies	Detected anomalies with scores, signal source, severity, historical comparison
Collectors	Health status per collector, event emission rates, degradation indicators
Policies	OPA policy listing, enable/disable toggles, violation history, Rego rule display
Settings	Collector config, integration credentials, anomaly sensitivity, remediation toggles

The dashboard runs as a separate container behind nginx, which reverse-proxies /api/ requests to the agent. No CORS headaches.

2. gRPC API with TLS/mTLS

The REST API was always there. Now there's a full gRPC API on port 9091 with four services:

StreamEvents - Server-side streaming of real-time telemetry events with source and severity filtering
GetIncident / ListIncidents - Incident queries with status filtering
GetHealth - Component-level health reporting

Built with reflection support, so you can debug with grpcurl out of the box. TLS and mutual TLS are configurable - just point it at your cert/key files.

3. Secret Management: Vault + AWS Secrets Manager

Two new integrations for secret lifecycle management:

HashiCorp Vault:

Kubernetes auth method
TLS certificate lifecycle tracking
Secret caching for performance

AWS Secrets Manager:

Prefix-based secret organization (kronveil/ default)
Rotation monitoring with configurable windows (default 30 days)
Secret expiration tracking
Built-in caching layer

Both use the graceful degradation pattern - if credentials aren't configured, the agent logs a warning and continues running without them.

4. Three New Collectors

The original had Kubernetes and Kafka. Now there are five:

Cloud Collector (AWS/Azure/GCP):

CloudWatch metrics for EC2, RDS, ELB, Lambda, S3
Multi-region support with resource enumeration
Cost tracking per resource

CI/CD Collector (GitHub Actions):

Webhook-based pipeline monitoring
Job and step-level tracking with duration metrics
Repository filtering with webhook secret validation

Logs Collector:

File tailing with structured log parsing
JSON, logfmt, and raw text format support
Configurable error pattern matching (error, fatal, panic, OOM, killed)

5. Capacity Planner

New intelligence module that goes beyond anomaly detection:

Linear regression-based forecasting (default 30-day horizon)
Right-sizing recommendations: scale_up, scale_down, right_size, optimize
Days-to-capacity projection
Cost savings calculations with confidence intervals
Historical data retention (90 days default)

6. Policy Engine (OPA/Rego)

Compliance and governance built into the agent:

Open Policy Agent integration with Rego language
Default policies pre-loaded (compliance, security)
Resource evaluation against all enabled policies
Policy violation tracking with evaluation metrics

7. Prometheus Metrics Export

Kronveil now exposes a full Prometheus scrape endpoint on port 9090:

Standard Go runtime metrics (goroutines, memory, GC)
Custom Kronveil metrics: event counts per source, collector errors, policy evaluations, processing latency
Ready-to-use with Grafana dashboards

8. OpenTelemetry (OTel) Integration

Full OpenTelemetry support for distributed tracing:

gRPC exporter to any OTLP-compatible endpoint (Jaeger, Tempo, Datadog, etc.)
Configurable export intervals (default 30s)
Span and trace propagation across the agent pipeline
Insecure mode for local development, TLS for production
Default endpoint: localhost:4317

This means you can plug Kronveil into your existing OTel collector pipeline and see traces from anomaly detection through incident creation to remediation execution - all in one trace.

9. PagerDuty Integration

Full Events API v2 support:

Incident triggering, acknowledgment, resolution
Deduplication keys for idempotent alerts
Severity mapping (critical, high, warning, info)
Links back to Kronveil dashboard

10. Audit Logging

Security-grade audit trail:

Event types: auth, incident, remediation, policy_change, config_change, secret_access, api_call
In-memory buffer with file sink
Structured JSON output via slog

11. Helm Chart for Kubernetes

Production-ready Helm chart with security hardened defaults:

Non-root containers (UID 1000)
Read-only root filesystem
Seccomp: RuntimeDefault
NetworkPolicy for ingress/egress
RBAC: ClusterRole with minimal permissions (pods, nodes, events, deployments)
Prometheus scrape annotations built-in
Liveness and readiness probes

helm install kronveil helm/kronveil/ \
  --namespace kronveil \
  --create-namespace \
  --set agent.bedrock.region=us-east-1

Upgraded Stack

Component	v0.1	v0.2
Go	1.21	1.25
golangci-lint	v1	v2
Alpine	3.21	3.23
Dashboard	Planned	React 18 + Tailwind
API	REST only	REST + gRPC + mTLS
Secrets	None	Vault + AWS SM
Metrics Export	None	Prometheus + OTel
Tracing	None	OpenTelemetry (OTLP)
Alerting	Slack	Slack + PagerDuty
Deployment	Manual	Docker Compose + Helm
CI	Basic	Full pipeline (lint, test, security, build, Docker scan)

Run Kronveil Locally (5 Minutes)

Here's the full local deployment walkthrough with live screenshots.

Prerequisites

Docker Desktop installed and running
Git
~2GB free RAM (Kafka needs memory)

Step 1: Clone and Build

git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker-compose -f deploy/docker-compose.yaml up --build -d

This builds two images and starts four containers:

Container	Port	Purpose
agent	8080	Kronveil REST API + gRPC
dashboard	3000	Web UI (nginx + React SPA)
kafka	9092	Event bus
zookeeper	2181	Kafka coordinator

Step 2: Verify Everything Is Running

docker-compose -f deploy/docker-compose.yaml ps

All four containers should show Up (healthy):

NAME                 STATUS                         PORTS
deploy-agent-1       Up About a minute (healthy)    127.0.0.1:8080->8080/tcp
deploy-dashboard-1   Up About a minute (healthy)    127.0.0.1:3000->8080/tcp
deploy-kafka-1       Up About a minute (healthy)    127.0.0.1:9092->9092/tcp
deploy-zookeeper-1   Up About a minute (healthy)    2181/tcp

Step 3: Access the Endpoints

Once deployed, you have three endpoints available:

Service	URL	Description
Dashboard	http://localhost:3000	Full web UI with all 6 pages
Agent API	http://localhost:8080/api/v1/health	REST API (health, incidents, anomalies)
Metrics	http://localhost:9090/metrics	Prometheus scrape endpoint

Step 4: Check Agent Health

curl http://localhost:8080/api/v1/health

{
  "data": {
    "status": "healthy"
  }
}

Step 5: Open the Dashboard

Open http://localhost:3000 in your browser.

Overview Page

The Overview page shows real-time infrastructure intelligence at a glance - 10.2M events/sec throughput, 2 active incidents, 23-second average MTTR, and 47 anomalies detected in the last 24 hours. The cluster health matrix shows three clusters across US, EU, and AP regions with live node and pod counts.

Incidents Page

AI-detected and auto-remediated incidents with filtering by status (all, active, acknowledged, resolved). Each incident shows the title, description, MTTR, and number of affected resources. Notice the resolved OOM incident with 23s MTTR - that's the auto-remediation in action.

Anomalies Page

ML-powered anomaly detection and prediction. The distribution chart shows detected vs. predicted anomalies over 24 hours. Each anomaly has a score (0-100%) - the Kafka consumer lag spike scored 94%, and the system predicted a pod OOM 15 minutes before it happened.

Collectors Page

Telemetry collection agents across your infrastructure. Five active collectors processing 10.2M events/sec across 487 targets with only 0.001% error rate. Kubernetes leads at 4.2M events/sec monitoring 3 clusters, 54 nodes, and 312 pods. Each collector shows real-time health status.

Scroll down to see all five collectors - Kubernetes, Apache Kafka, AWS CloudWatch, GitHub Actions (CI/CD), and the Logs collector. GitHub Actions shows a degraded status with 3 errors, which is expected when webhook endpoints aren't publicly accessible in a local deployment.

Step 6: Explore the API

Full system status:

curl http://localhost:8080/api/v1/status | python3 -m json.tool

List collectors and their health:

curl http://localhost:8080/api/v1/collectors | python3 -m json.tool

Inject a test event (single):

curl -X POST http://localhost:8080/api/v1/test/inject?mode=single

Inject a burst of events to trigger anomaly detection:

curl -X POST http://localhost:8080/api/v1/test/inject?mode=burst

After the burst injection, check for detected anomalies:

curl http://localhost:8080/api/v1/anomalies | python3 -m json.tool

And incidents that were auto-created:

curl http://localhost:8080/api/v1/incidents | python3 -m json.tool

Step 7: Prometheus Metrics

curl http://localhost:9090/metrics

You'll see standard Go metrics plus Kronveil-specific counters for events processed, collector errors, and policy evaluations. Wire this into your Grafana instance for dashboards.

Step 8: Tail the Logs

docker-compose -f deploy/docker-compose.yaml logs -f agent

Watch the agent detect anomalies, correlate incidents, and execute remediation in real-time.

Cleanup

docker-compose -f deploy/docker-compose.yaml down

Architecture Diagram (Updated)

                         +------------------+
                         |   Dashboard UI   |
                         |  (React + nginx) |
                         |   :3000          |
                         +--------+---------+
                                  |
                           /api/ proxy
                                  |
+------------------+    +---------v---------+    +------------------+
|   Collectors     |    |    Kronveil Agent  |    |  Integrations    |
|                  +--->+                    +--->+                  |
| - Kubernetes     |    |  REST API  :8080   |    | - Slack          |
| - Kafka          |    |  gRPC API  :9091   |    | - PagerDuty      |
| - Cloud (AWS)    |    |  Metrics   :9090   |    | - Prometheus     |
| - CI/CD          |    |                    |    | - OpenTelemetry  |
| - Logs           |    |  +==============+  |    | - AWS Bedrock    |
+------------------+    |  | Intelligence |  |    | - Vault          |
                        |  | - Anomaly    |  |    | - AWS Secrets    |
                        |  | - RootCause  |  |    +------------------+
                        |  | - Capacity   |  |
                        |  | - Incident   |  |         +----------+
                        |  +==============+  |    +--->| OTel     |
                        |                    +----+    | Collector|
                        |  +==============+  |         +----------+
                        |  | Policy (OPA) |  |
                        |  | Audit Log    |  |
                        |  +==============+  |
                        +---------+----------+
                                  |
                         +--------v---------+
                         |   Apache Kafka   |
                         |   :9092          |
                         +------------------+

CI Pipeline

Every push to main runs seven jobs:

Lint - golangci-lint v2 with staticcheck, errcheck, govet
Test - go test -race with 40% coverage threshold
Security Scan - govulncheck for Go stdlib/dependency CVEs
Build - Cross-compile with ldflags (version, commit, date)
Docker Build & Scan - Multi-stage build + Trivy vulnerability scan (CRITICAL/HIGH)
Dashboard - npm ci, ESLint, Vite production build
Helm Lint - Chart validation

All green before merge. No exceptions.

What's Next (v0.3 Roadmap)

Multi-cluster support - Federated monitoring across Kubernetes clusters
Custom collector SDK - Build your own collectors with a plugin interface
Runbook automation - Attach runbooks to incident types
Cost anomaly detection - Spot unexpected cloud spend spikes
Grafana dashboards - Pre-built dashboards for Kronveil Prometheus metrics
Mobile alerts - Push notifications via native apps

Try It

GitHub: github.com/kronveil/kronveil
License: Apache 2.0

git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker-compose -f deploy/docker-compose.yaml up --build -d
# Open http://localhost:3000

If you find it useful, star the repo. If you find a bug, open an issue. PRs welcome - especially for new collectors, dashboard improvements, and LLM prompt tuning.

Follow me for more updates on building production-grade infrastructure tooling with Go and AI.

DEV Community