DEV Community

Ramasankar Molleti
Ramasankar Molleti

Posted on

Kronveil v0.2: Dashboard, gRPC, Secret Management, and Local Deployment - Here's What Changed

Quick Recap

A week ago, I launched Kronveil - an AI-powered infrastructure observability agent that detects anomalies, performs root cause analysis, and auto-remediates incidents in milliseconds. The response was incredible.

But that first version had a lot of stubs. The roadmap listed features like "Dashboard UI", "Prometheus metrics", and "multi-cloud secret management" as coming soon.

They're here now.

This post covers every new feature shipped in v0.2, a step-by-step guide to run Kronveil locally with Docker Compose, and live screenshots from the running dashboard.


What's New in v0.2

1. Full Dashboard UI (React + TypeScript)

The biggest visible change. Kronveil now ships with a production-ready dashboard built with React 18, TypeScript, Tailwind CSS, and Recharts.

Six pages, zero fluff:

Page What It Shows
Overview Real-time event throughput, active incidents, MTTR, anomaly count (24h), cluster health matrix
Incidents Filterable list (active/acknowledged/resolved), timeline view, root cause display, one-click acknowledge/resolve
Anomalies Detected anomalies with scores, signal source, severity, historical comparison
Collectors Health status per collector, event emission rates, degradation indicators
Policies OPA policy listing, enable/disable toggles, violation history, Rego rule display
Settings Collector config, integration credentials, anomaly sensitivity, remediation toggles

The dashboard runs as a separate container behind nginx, which reverse-proxies /api/ requests to the agent. No CORS headaches.

2. gRPC API with TLS/mTLS

The REST API was always there. Now there's a full gRPC API on port 9091 with four services:

  • StreamEvents - Server-side streaming of real-time telemetry events with source and severity filtering
  • GetIncident / ListIncidents - Incident queries with status filtering
  • GetHealth - Component-level health reporting

Built with reflection support, so you can debug with grpcurl out of the box. TLS and mutual TLS are configurable - just point it at your cert/key files.

3. Secret Management: Vault + AWS Secrets Manager

Two new integrations for secret lifecycle management:

HashiCorp Vault:

  • Kubernetes auth method
  • TLS certificate lifecycle tracking
  • Secret caching for performance

AWS Secrets Manager:

  • Prefix-based secret organization (kronveil/ default)
  • Rotation monitoring with configurable windows (default 30 days)
  • Secret expiration tracking
  • Built-in caching layer

Both use the graceful degradation pattern - if credentials aren't configured, the agent logs a warning and continues running without them.

4. Three New Collectors

The original had Kubernetes and Kafka. Now there are five:

Cloud Collector (AWS/Azure/GCP):

  • CloudWatch metrics for EC2, RDS, ELB, Lambda, S3
  • Multi-region support with resource enumeration
  • Cost tracking per resource

CI/CD Collector (GitHub Actions):

  • Webhook-based pipeline monitoring
  • Job and step-level tracking with duration metrics
  • Repository filtering with webhook secret validation

Logs Collector:

  • File tailing with structured log parsing
  • JSON, logfmt, and raw text format support
  • Configurable error pattern matching (error, fatal, panic, OOM, killed)

5. Capacity Planner

New intelligence module that goes beyond anomaly detection:

  • Linear regression-based forecasting (default 30-day horizon)
  • Right-sizing recommendations: scale_up, scale_down, right_size, optimize
  • Days-to-capacity projection
  • Cost savings calculations with confidence intervals
  • Historical data retention (90 days default)

6. Policy Engine (OPA/Rego)

Compliance and governance built into the agent:

  • Open Policy Agent integration with Rego language
  • Default policies pre-loaded (compliance, security)
  • Resource evaluation against all enabled policies
  • Policy violation tracking with evaluation metrics

7. Prometheus Metrics Export

Kronveil now exposes a full Prometheus scrape endpoint on port 9090:

  • Standard Go runtime metrics (goroutines, memory, GC)
  • Custom Kronveil metrics: event counts per source, collector errors, policy evaluations, processing latency
  • Ready-to-use with Grafana dashboards

8. OpenTelemetry (OTel) Integration

Full OpenTelemetry support for distributed tracing:

  • gRPC exporter to any OTLP-compatible endpoint (Jaeger, Tempo, Datadog, etc.)
  • Configurable export intervals (default 30s)
  • Span and trace propagation across the agent pipeline
  • Insecure mode for local development, TLS for production
  • Default endpoint: localhost:4317

This means you can plug Kronveil into your existing OTel collector pipeline and see traces from anomaly detection through incident creation to remediation execution - all in one trace.

9. PagerDuty Integration

Full Events API v2 support:

  • Incident triggering, acknowledgment, resolution
  • Deduplication keys for idempotent alerts
  • Severity mapping (critical, high, warning, info)
  • Links back to Kronveil dashboard

10. Audit Logging

Security-grade audit trail:

  • Event types: auth, incident, remediation, policy_change, config_change, secret_access, api_call
  • In-memory buffer with file sink
  • Structured JSON output via slog

11. Helm Chart for Kubernetes

Production-ready Helm chart with security hardened defaults:

  • Non-root containers (UID 1000)
  • Read-only root filesystem
  • Seccomp: RuntimeDefault
  • NetworkPolicy for ingress/egress
  • RBAC: ClusterRole with minimal permissions (pods, nodes, events, deployments)
  • Prometheus scrape annotations built-in
  • Liveness and readiness probes
helm install kronveil helm/kronveil/ \
  --namespace kronveil \
  --create-namespace \
  --set agent.bedrock.region=us-east-1
Enter fullscreen mode Exit fullscreen mode

Upgraded Stack

Component v0.1 v0.2
Go 1.21 1.25
golangci-lint v1 v2
Alpine 3.21 3.23
Dashboard Planned React 18 + Tailwind
API REST only REST + gRPC + mTLS
Secrets None Vault + AWS SM
Metrics Export None Prometheus + OTel
Tracing None OpenTelemetry (OTLP)
Alerting Slack Slack + PagerDuty
Deployment Manual Docker Compose + Helm
CI Basic Full pipeline (lint, test, security, build, Docker scan)

Run Kronveil Locally (5 Minutes)

Here's the full local deployment walkthrough with live screenshots.

Prerequisites

Step 1: Clone and Build

git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker-compose -f deploy/docker-compose.yaml up --build -d
Enter fullscreen mode Exit fullscreen mode

This builds two images and starts four containers:

Container Port Purpose
agent 8080 Kronveil REST API + gRPC
dashboard 3000 Web UI (nginx + React SPA)
kafka 9092 Event bus
zookeeper 2181 Kafka coordinator

Step 2: Verify Everything Is Running

docker-compose -f deploy/docker-compose.yaml ps
Enter fullscreen mode Exit fullscreen mode

All four containers should show Up (healthy):

NAME                 STATUS                         PORTS
deploy-agent-1       Up About a minute (healthy)    127.0.0.1:8080->8080/tcp
deploy-dashboard-1   Up About a minute (healthy)    127.0.0.1:3000->8080/tcp
deploy-kafka-1       Up About a minute (healthy)    127.0.0.1:9092->9092/tcp
deploy-zookeeper-1   Up About a minute (healthy)    2181/tcp
Enter fullscreen mode Exit fullscreen mode

Step 3: Access the Endpoints

Once deployed, you have three endpoints available:

Service URL Description
Dashboard http://localhost:3000 Full web UI with all 6 pages
Agent API http://localhost:8080/api/v1/health REST API (health, incidents, anomalies)
Metrics http://localhost:9090/metrics Prometheus scrape endpoint

Step 4: Check Agent Health

curl http://localhost:8080/api/v1/health
Enter fullscreen mode Exit fullscreen mode
{
  "data": {
    "status": "healthy"
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 5: Open the Dashboard

Open http://localhost:3000 in your browser.

Overview Page

The Overview page shows real-time infrastructure intelligence at a glance - 10.2M events/sec throughput, 2 active incidents, 23-second average MTTR, and 47 anomalies detected in the last 24 hours. The cluster health matrix shows three clusters across US, EU, and AP regions with live node and pod counts.

Kronveil Overview Dashboard

Incidents Page

AI-detected and auto-remediated incidents with filtering by status (all, active, acknowledged, resolved). Each incident shows the title, description, MTTR, and number of affected resources. Notice the resolved OOM incident with 23s MTTR - that's the auto-remediation in action.

Kronveil Incidents

Anomalies Page

ML-powered anomaly detection and prediction. The distribution chart shows detected vs. predicted anomalies over 24 hours. Each anomaly has a score (0-100%) - the Kafka consumer lag spike scored 94%, and the system predicted a pod OOM 15 minutes before it happened.

Kronveil Anomalies

Collectors Page

Telemetry collection agents across your infrastructure. Five active collectors processing 10.2M events/sec across 487 targets with only 0.001% error rate. Kubernetes leads at 4.2M events/sec monitoring 3 clusters, 54 nodes, and 312 pods. Each collector shows real-time health status.

Kronveil Collectors - Top

Scroll down to see all five collectors - Kubernetes, Apache Kafka, AWS CloudWatch, GitHub Actions (CI/CD), and the Logs collector. GitHub Actions shows a degraded status with 3 errors, which is expected when webhook endpoints aren't publicly accessible in a local deployment.

Kronveil Collectors - All

Step 6: Explore the API

Full system status:

curl http://localhost:8080/api/v1/status | python3 -m json.tool
Enter fullscreen mode Exit fullscreen mode

List collectors and their health:

curl http://localhost:8080/api/v1/collectors | python3 -m json.tool
Enter fullscreen mode Exit fullscreen mode

Inject a test event (single):

curl -X POST http://localhost:8080/api/v1/test/inject?mode=single
Enter fullscreen mode Exit fullscreen mode

Inject a burst of events to trigger anomaly detection:

curl -X POST http://localhost:8080/api/v1/test/inject?mode=burst
Enter fullscreen mode Exit fullscreen mode

After the burst injection, check for detected anomalies:

curl http://localhost:8080/api/v1/anomalies | python3 -m json.tool
Enter fullscreen mode Exit fullscreen mode

And incidents that were auto-created:

curl http://localhost:8080/api/v1/incidents | python3 -m json.tool
Enter fullscreen mode Exit fullscreen mode

Step 7: Prometheus Metrics

curl http://localhost:9090/metrics
Enter fullscreen mode Exit fullscreen mode

You'll see standard Go metrics plus Kronveil-specific counters for events processed, collector errors, and policy evaluations. Wire this into your Grafana instance for dashboards.

Step 8: Tail the Logs

docker-compose -f deploy/docker-compose.yaml logs -f agent
Enter fullscreen mode Exit fullscreen mode

Watch the agent detect anomalies, correlate incidents, and execute remediation in real-time.

Cleanup

docker-compose -f deploy/docker-compose.yaml down
Enter fullscreen mode Exit fullscreen mode

Architecture Diagram (Updated)

                         +------------------+
                         |   Dashboard UI   |
                         |  (React + nginx) |
                         |   :3000          |
                         +--------+---------+
                                  |
                           /api/ proxy
                                  |
+------------------+    +---------v---------+    +------------------+
|   Collectors     |    |    Kronveil Agent  |    |  Integrations    |
|                  +--->+                    +--->+                  |
| - Kubernetes     |    |  REST API  :8080   |    | - Slack          |
| - Kafka          |    |  gRPC API  :9091   |    | - PagerDuty      |
| - Cloud (AWS)    |    |  Metrics   :9090   |    | - Prometheus     |
| - CI/CD          |    |                    |    | - OpenTelemetry  |
| - Logs           |    |  +==============+  |    | - AWS Bedrock    |
+------------------+    |  | Intelligence |  |    | - Vault          |
                        |  | - Anomaly    |  |    | - AWS Secrets    |
                        |  | - RootCause  |  |    +------------------+
                        |  | - Capacity   |  |
                        |  | - Incident   |  |         +----------+
                        |  +==============+  |    +--->| OTel     |
                        |                    +----+    | Collector|
                        |  +==============+  |         +----------+
                        |  | Policy (OPA) |  |
                        |  | Audit Log    |  |
                        |  +==============+  |
                        +---------+----------+
                                  |
                         +--------v---------+
                         |   Apache Kafka   |
                         |   :9092          |
                         +------------------+
Enter fullscreen mode Exit fullscreen mode

CI Pipeline

Every push to main runs seven jobs:

  1. Lint - golangci-lint v2 with staticcheck, errcheck, govet
  2. Test - go test -race with 40% coverage threshold
  3. Security Scan - govulncheck for Go stdlib/dependency CVEs
  4. Build - Cross-compile with ldflags (version, commit, date)
  5. Docker Build & Scan - Multi-stage build + Trivy vulnerability scan (CRITICAL/HIGH)
  6. Dashboard - npm ci, ESLint, Vite production build
  7. Helm Lint - Chart validation

All green before merge. No exceptions.


What's Next (v0.3 Roadmap)

  • Multi-cluster support - Federated monitoring across Kubernetes clusters
  • Custom collector SDK - Build your own collectors with a plugin interface
  • Runbook automation - Attach runbooks to incident types
  • Cost anomaly detection - Spot unexpected cloud spend spikes
  • Grafana dashboards - Pre-built dashboards for Kronveil Prometheus metrics
  • Mobile alerts - Push notifications via native apps

Try It

GitHub: github.com/kronveil/kronveil
License: Apache 2.0

git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker-compose -f deploy/docker-compose.yaml up --build -d
# Open http://localhost:3000
Enter fullscreen mode Exit fullscreen mode

If you find it useful, star the repo. If you find a bug, open an issue. PRs welcome - especially for new collectors, dashboard improvements, and LLM prompt tuning.


Follow me for more updates on building production-grade infrastructure tooling with Go and AI.

Top comments (0)