DEV Community

Ramasankar Molleti
Ramasankar Molleti

Posted on

Kronveil v0.3: Multi-Cluster Federation, Custom Collector SDK, and Automated Runbooks

From Single Cluster to Multi-Cluster Production

A couple of weeks ago, I shipped Kronveil v0.2 — a fully running AI infrastructure agent with a dashboard, gRPC transport, secret management, and local Docker deployment. If you missed the original launch post, here's where it all started.

v0.2 worked well for a single cluster. But production environments don't run on one cluster. Teams have us-east-prod, eu-west-prod, maybe a staging cluster in ap-south. They have GitHub Actions pipelines they need visibility into. They have Azure VMs and GCP instances alongside Kubernetes workloads.

v0.3 addresses all of that. Here's what changed.


What's New in v0.3

# Feature What It Does
1 Multi-Cluster Federation Aggregate telemetry from multiple Kubernetes clusters into one view
2 Custom Collector SDK Build your own collectors in ~50 lines of Go
3 Automated Runbook Engine Execute incident response playbooks automatically
4 Real Azure SDK Integration Azure Monitor metrics + Resource Manager listing
5 Real GCP SDK Integration Cloud Monitoring + Asset Inventory
6 GitHub Actions CI/CD Collector Poll workflow runs, track status changes, map to severity
7 Kafka Throughput Monitoring Real offset tracking and messages/sec computation
8 WebSocket Real-Time Streaming Live event feed to the dashboard, no more polling
9 Runbooks Dashboard Page New UI page for runbook management and execution history
10 Vault Background Sync Periodic secret rotation monitoring with KV v2 metadata API

1. Multi-Cluster Federation

This is the biggest feature in v0.3. The federation manager sits on top of multiple Kubernetes collectors and aggregates their telemetry into a single event stream.

How It Works

┌─────────────────────────────────────────────────┐
│              Federation Manager                  │
│         implements engine.Collector              │
├─────────────────────────────────────────────────┤
│                                                  │
│  ┌──────────────┐  ┌──────────────┐             │
│  │ us-east-prod │  │ eu-west-prod │  ...        │
│  │  K8s Collector│  │  K8s Collector│            │
│  └──────┬───────┘  └──────┬───────┘             │
│         │                  │                     │
│         └──────┬───────────┘                     │
│                ▼                                  │
│  ┌──────────────────────────┐                    │
│  │       Aggregator         │                    │
│  │  SHA256 dedup (30s window)│                   │
│  │  Cross-cluster metrics    │                   │
│  └──────────────────────────┘                    │
│                                                  │
└─────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Each cluster's events are tagged with cluster_name and cluster_region metadata before being forwarded. The aggregator deduplicates events using SHA256 fingerprinting — if overlapping collectors emit the same event within a 30-second window, it's counted once.

Configuration

collectors:
  kubernetes:
    clusters:
      - name: us-east-prod
        kubeconfig_path: ~/.kube/us-east
        context: prod-context
        namespaces: ["default", "payments", "auth"]
        poll_interval: 15s
      - name: eu-west-prod
        kubeconfig_path: ~/.kube/eu-west
        context: prod-context
        namespaces: ["default", "payments"]
        poll_interval: 15s
Enter fullscreen mode Exit fullscreen mode

The federation manager implements engine.Collector, so the rest of Kronveil — the intelligence pipeline, the API, the dashboard — doesn't need to know whether it's watching 1 cluster or 20.

Aggregate metrics are computed automatically: total pods, total nodes, total events across all clusters.


2. Custom Collector SDK

Writing a Kronveil collector used to mean implementing the full engine.Collector interface — managing goroutines, channels, health reporting, and lifecycle. Now you implement three methods:

type Plugin interface {
    Name() string
    Collect(ctx context.Context) ([]*collector.Event, error)
    Healthcheck(ctx context.Context) error
}
Enter fullscreen mode Exit fullscreen mode

The SDK's Builder handles the rest — polling loop, buffered event channel with backpressure, health reporting, and clean shutdown:

col := collector.NewBuilder(&myPlugin{}).
    WithPollInterval(10 * time.Second).
    WithBufferSize(128).
    WithLogger(slog.Default()).
    Build()

// col implements engine.Collector — register it like any built-in collector
registry.RegisterCollector(col)
Enter fullscreen mode Exit fullscreen mode

Full Example: HTTP Health Checker

type HTTPChecker struct {
    url string
}

func (h *HTTPChecker) Name() string { return "http-checker" }

func (h *HTTPChecker) Collect(ctx context.Context) ([]*collector.Event, error) {
    start := time.Now()
    resp, err := http.Get(h.url)
    if err != nil {
        return []*collector.Event{{
            Type:     "http_check_failed",
            Severity: "high",
            Payload:  map[string]interface{}{"url": h.url, "error": err.Error()},
        }}, nil
    }
    defer resp.Body.Close()

    return []*collector.Event{{
        Type: "http_check",
        Payload: map[string]interface{}{
            "url":        h.url,
            "status":     resp.StatusCode,
            "latency_ms": time.Since(start).Milliseconds(),
        },
    }}, nil
}

func (h *HTTPChecker) Healthcheck(ctx context.Context) error { return nil }
Enter fullscreen mode Exit fullscreen mode

What the Adapter Handles For You

  • Polling loop at your configured interval
  • Immediate first collect (no waiting for the first tick)
  • Buffered channel with drop + warn when full
  • Health status combining Healthcheck() result with recent collect errors
  • Thread-safe start/stop lifecycle

One interesting bug I caught during CI: the original Stop() held a mutex while waiting on a WaitGroup. The polling goroutine needed the same mutex to record errors. Classic deadlock — the goroutine couldn't finish because it couldn't acquire the lock, and Stop() couldn't return because it was waiting for the goroutine. Fixed by releasing the lock before wg.Wait().


3. Automated Runbook Engine

When Kronveil detects an incident, it can now execute a predefined playbook instead of just alerting.

The runbook engine ships with 4 default runbooks:

Runbook Triggers Steps Auto-Execute
Pod OOM OOMKilled, MemoryPressure Diagnose > Scale > Notify Yes
High Latency HighLatency, SLOBreach Diagnose > Restart > Notify No
Disk Pressure DiskPressure, LogVolumeHigh Cleanup > Notify Yes
Certificate Expiry CertExpiry, TLSError Renew > Notify No

How Execution Works

Incident Detected
       │
       ▼
  FindRunbooks(incidentType)
       │
       ▼
  For each matching runbook:
    ├── autoExecute=true  → Execute immediately
    └── autoExecute=false → Queue for approval
       │
       ▼
  Execute each Step sequentially:
    ├── kubectl_scale   → Scale deployment replicas
    ├── restart_pod     → Delete pod for controller restart
    ├── notify_oncall   → Slack/PagerDuty notification
    ├── run_diagnostic  → Execute diagnostic command
    └── custom_script   → Run remediation script
       │
       ▼
  Record ExecutionResult (timing, step results, success/failure)
Enter fullscreen mode Exit fullscreen mode

In v0.3, all action handlers run in dry-run mode — they log what they would do without executing. This lets you validate runbook logic before enabling live remediation. Live execution is on the v0.4 roadmap.

You can register custom runbooks:

executor.RegisterRunbook(runbook.Runbook{
    ID:            "custom-db-failover",
    Name:          "Database Failover",
    IncidentTypes: []string{"DatabaseDown", "ReplicationLag"},
    AutoExecute:   false,
    Steps: []runbook.Step{
        {Name: "Check replication", Action: "run_diagnostic",
         Config: map[string]string{"command": "pg_stat_replication"}},
        {Name: "Promote standby", Action: "custom_script",
         Config: map[string]string{"script": "/opt/scripts/promote-standby.sh"}},
        {Name: "Notify DBA team", Action: "notify_oncall",
         Config: map[string]string{"channel": "#dba-oncall"}},
    },
})
Enter fullscreen mode Exit fullscreen mode

4. Real Cloud Provider Integrations

v0.2 had stub implementations for cloud providers. v0.3 wires up real SDKs.

Azure

  • Auth: azidentity.DefaultAzureCredential — supports managed identity, CLI, environment variables
  • Metrics: Azure Monitor azquery.MetricsClient queries CPU, memory, disk, and network
  • Resources: ARM armresources.Client with full pagination support
  • Config: Set AZURE_SUBSCRIPTION_ID and standard Azure credentials

GCP

  • Auth: Application Default Credentials (ADC)
  • Metrics: Cloud Monitoring ListTimeSeries with 5-minute lookback window
  • Resources: Cloud Asset SearchAllResources for inventory
  • Config: Set GCP_PROJECT_ID or GOOGLE_CLOUD_PROJECT

GitHub Actions (CI/CD Collector)

The CI/CD collector now polls the GitHub REST API for workflow runs across configured repositories:

collectors:
  cicd:
    github_token: "ghp_..."
    repo_filters:
      - "your-org/your-repo"
      - "your-org/another-repo"
    poll_interval: 60s
Enter fullscreen mode Exit fullscreen mode

It tracks run status changes, emits events for new runs and state transitions, and maps GitHub conclusions to severity levels:

Conclusion Severity
failure, timed_out High
cancelled, action_required Medium
success, other Info

Kafka Throughput

The Kafka collector now dials brokers directly, reads partition offsets, and computes real messages/second throughput per topic. No more mock data.


5. WebSocket Real-Time Streaming

The dashboard no longer polls the REST API for updates. Events flow over WebSocket.

Backend

The Go server manages a WebSocket hub with a broadcaster that pushes engine status to all connected clients every 2 seconds:

Client connects → wsHub.add(conn)
                     │
Broadcaster (2s)  ───┤──→ JSON to all clients
                     │
Client disconnects → wsHub.remove(conn)
Enter fullscreen mode Exit fullscreen mode

Frontend

Two new React hooks power the live experience:

  • useWebSocket — generic hook with auto-reconnect and exponential backoff (1s to 30s)
  • useEventStream — wraps WebSocket for the events endpoint, maintains a 100-event rolling buffer, provides memoized filtered views for incidents, anomalies, and all events

The Overview page shows a green pulsing Live indicator when WebSocket is connected. When disconnected, it falls back to mock data gracefully.


6. Runbooks Dashboard Page

New /runbooks route in the dashboard:

Summary cards at the top:

  • Total runbooks
  • Auto-execute count
  • Executions in last 24 hours
  • Average success rate

Each runbook card shows:

  • Name and description
  • Auto/manual execution badge (green dot for auto, gray for manual)
  • Incident type tags
  • Step count, last run time, success rate
  • Recent run indicators — green and red dots for the last 3 executions

Same dark theme as the rest of the dashboard. Built with the same Tailwind patterns.


7. OpenTelemetry & Observability

Kronveil exports traces and metrics via OpenTelemetry, fitting into your existing observability stack:

Endpoint Protocol Purpose
:4317 gRPC OTLP traces and metrics
:4318 HTTP OTLP traces and metrics
:8889 HTTP Prometheus metrics export
:13133 HTTP OTel collector health check
:55679 HTTP zPages debugging

Point your Jaeger, Tempo, or Datadog backend at these endpoints, or configure Prometheus to scrape Kronveil's metrics.


Full Architecture (v0.3)

┌───────────────────────────────────────────────────────┐
│                  Dashboard (React)                     │
│          WebSocket <── REST API (Go) ──>               │
├───────────────────────────────────────────────────────┤
│                     Engine Core                        │
│  ┌────────────┐  ┌──────────┐  ┌────────────────┐    │
│  │ Federation  │  │ Runbook  │  │ AI Intelligence│    │
│  │  Manager    │  │ Executor │  │ (AWS Bedrock)  │    │
│  └────────────┘  └──────────┘  └────────────────┘    │
├───────────────────────────────────────────────────────┤
│                     Collectors                         │
│  ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ ┌───┐ ┌────────┐ │
│  │ K8s │ │Kafka│ │CI/CD │ │Azure│ │GCP│ │Custom  │ │
│  │     │ │     │ │GitHub│ │     │ │   │ │ (SDK)  │ │
│  └─────┘ └─────┘ └──────┘ └─────┘ └───┘ └────────┘ │
├───────────────────────────────────────────────────────┤
│               Integrations & Export                    │
│  Slack · PagerDuty · Vault · Prometheus · OTel        │
└───────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Local Deployment

Everything runs with Docker Compose — same as v0.2:

git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker compose up --build
Enter fullscreen mode Exit fullscreen mode
Service URL Description
Dashboard http://localhost:3000 React UI with live WebSocket
API http://localhost:8080/api/v1/ REST endpoints
Health http://localhost:8080/api/v1/health Agent health check
WebSocket ws://localhost:8080/api/v1/ws/events Real-time event stream
Prometheus http://localhost:8889/metrics Metrics export
OTel gRPC localhost:4317 OTLP ingest

Verify it's running

curl http://localhost:8080/api/v1/health | jq .
Enter fullscreen mode Exit fullscreen mode
{
  "status": "healthy",
  "components": [
    {"name": "kubernetes", "status": "healthy"},
    {"name": "kafka", "status": "healthy"},
    {"name": "cicd-collector", "status": "healthy"},
    {"name": "cloud-aws", "status": "healthy"}
  ],
  "uptime": "2m30s"
}
Enter fullscreen mode Exit fullscreen mode

Production Deployment

Kronveil ships with a Helm chart for Kubernetes deployment. For AWS EKS with Rancher:

# Build and push to ECR
docker build -f deploy/Dockerfile.agent -t <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 .
docker push <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3

# Deploy with Helm
helm install kronveil helm/kronveil/ \
  -n kronveil --create-namespace \
  -f values-prod.yaml
Enter fullscreen mode Exit fullscreen mode

The Helm chart includes deployment, RBAC (ClusterRole + Role), ServiceAccount with IRSA annotation, NetworkPolicy, and Prometheus scrape annotations. A full production deployment guide covering ECR, MSK, IRSA, ALB Ingress, and TLS is available in the repository.


CI Pipeline

All 7 CI jobs passing:

Job What It Checks
Lint golangci-lint v2 (errcheck, staticcheck, govet)
Security govulncheck for known vulnerabilities
Test go test ./... -race with coverage threshold
Build CGO_ENABLED=0 static binary
Docker Build & Scan Trivy scan for CRITICAL/HIGH CVEs
Dashboard npm lint + build
Helm Lint Chart validation

What's Next (v0.4)

  • Live runbook execution — move from dry-run to real kubectl and script execution with approval gates
  • Collector marketplace — share and install community-built collectors via the SDK
  • Cross-cluster incident correlation — AI-powered correlation across federated clusters
  • Dashboard runbook triggers — execute runbooks directly from the UI
  • Grafana plugin — embed Kronveil panels in existing Grafana dashboards

Links

If you've been following along, star the repo and try it out. PRs, issues, and feedback are always welcome.

Top comments (0)