Ramasankar Molleti

Posted on Apr 6

Kronveil v0.3: Multi-Cluster Federation, Custom Collector SDK, and Automated Runbooks

#kubernetes #devops #observability #opensource

From Single Cluster to Multi-Cluster Production

A couple of weeks ago, I shipped Kronveil v0.2 — a fully running AI infrastructure agent with a dashboard, gRPC transport, secret management, and local Docker deployment. If you missed the original launch post, here's where it all started.

v0.2 worked well for a single cluster. But production environments don't run on one cluster. Teams have us-east-prod, eu-west-prod, maybe a staging cluster in ap-south. They have GitHub Actions pipelines they need visibility into. They have Azure VMs and GCP instances alongside Kubernetes workloads.

v0.3 addresses all of that. Here's what changed.

What's New in v0.3

#	Feature	What It Does
1	Multi-Cluster Federation	Aggregate telemetry from multiple Kubernetes clusters into one view
2	Custom Collector SDK	Build your own collectors in ~50 lines of Go
3	Automated Runbook Engine	Execute incident response playbooks automatically
4	Real Azure SDK Integration	Azure Monitor metrics + Resource Manager listing
5	Real GCP SDK Integration	Cloud Monitoring + Asset Inventory
6	GitHub Actions CI/CD Collector	Poll workflow runs, track status changes, map to severity
7	Kafka Throughput Monitoring	Real offset tracking and messages/sec computation
8	WebSocket Real-Time Streaming	Live event feed to the dashboard, no more polling
9	Runbooks Dashboard Page	New UI page for runbook management and execution history
10	Vault Background Sync	Periodic secret rotation monitoring with KV v2 metadata API

1. Multi-Cluster Federation

This is the biggest feature in v0.3. The federation manager sits on top of multiple Kubernetes collectors and aggregates their telemetry into a single event stream.

How It Works

┌─────────────────────────────────────────────────┐
│              Federation Manager                  │
│         implements engine.Collector              │
├─────────────────────────────────────────────────┤
│                                                  │
│  ┌──────────────┐  ┌──────────────┐             │
│  │ us-east-prod │  │ eu-west-prod │  ...        │
│  │  K8s Collector│  │  K8s Collector│            │
│  └──────┬───────┘  └──────┬───────┘             │
│         │                  │                     │
│         └──────┬───────────┘                     │
│                ▼                                  │
│  ┌──────────────────────────┐                    │
│  │       Aggregator         │                    │
│  │  SHA256 dedup (30s window)│                   │
│  │  Cross-cluster metrics    │                   │
│  └──────────────────────────┘                    │
│                                                  │
└─────────────────────────────────────────────────┘

Each cluster's events are tagged with cluster_name and cluster_region metadata before being forwarded. The aggregator deduplicates events using SHA256 fingerprinting — if overlapping collectors emit the same event within a 30-second window, it's counted once.

Configuration

collectors:
  kubernetes:
    clusters:
      - name: us-east-prod
        kubeconfig_path: ~/.kube/us-east
        context: prod-context
        namespaces: ["default", "payments", "auth"]
        poll_interval: 15s
      - name: eu-west-prod
        kubeconfig_path: ~/.kube/eu-west
        context: prod-context
        namespaces: ["default", "payments"]
        poll_interval: 15s

The federation manager implements engine.Collector, so the rest of Kronveil — the intelligence pipeline, the API, the dashboard — doesn't need to know whether it's watching 1 cluster or 20.

Aggregate metrics are computed automatically: total pods, total nodes, total events across all clusters.

2. Custom Collector SDK

Writing a Kronveil collector used to mean implementing the full engine.Collector interface — managing goroutines, channels, health reporting, and lifecycle. Now you implement three methods:

type Plugin interface {
    Name() string
    Collect(ctx context.Context) ([]*collector.Event, error)
    Healthcheck(ctx context.Context) error
}

The SDK's Builder handles the rest — polling loop, buffered event channel with backpressure, health reporting, and clean shutdown:

col := collector.NewBuilder(&myPlugin{}).
    WithPollInterval(10 * time.Second).
    WithBufferSize(128).
    WithLogger(slog.Default()).
    Build()

// col implements engine.Collector — register it like any built-in collector
registry.RegisterCollector(col)

Full Example: HTTP Health Checker

type HTTPChecker struct {
    url string
}

func (h *HTTPChecker) Name() string { return "http-checker" }

func (h *HTTPChecker) Collect(ctx context.Context) ([]*collector.Event, error) {
    start := time.Now()
    resp, err := http.Get(h.url)
    if err != nil {
        return []*collector.Event{{
            Type:     "http_check_failed",
            Severity: "high",
            Payload:  map[string]interface{}{"url": h.url, "error": err.Error()},
        }}, nil
    }
    defer resp.Body.Close()

    return []*collector.Event{{
        Type: "http_check",
        Payload: map[string]interface{}{
            "url":        h.url,
            "status":     resp.StatusCode,
            "latency_ms": time.Since(start).Milliseconds(),
        },
    }}, nil
}

func (h *HTTPChecker) Healthcheck(ctx context.Context) error { return nil }

What the Adapter Handles For You

Polling loop at your configured interval
Immediate first collect (no waiting for the first tick)
Buffered channel with drop + warn when full
Health status combining Healthcheck() result with recent collect errors
Thread-safe start/stop lifecycle

One interesting bug I caught during CI: the original Stop() held a mutex while waiting on a WaitGroup. The polling goroutine needed the same mutex to record errors. Classic deadlock — the goroutine couldn't finish because it couldn't acquire the lock, and Stop() couldn't return because it was waiting for the goroutine. Fixed by releasing the lock before wg.Wait().

3. Automated Runbook Engine

When Kronveil detects an incident, it can now execute a predefined playbook instead of just alerting.

The runbook engine ships with 4 default runbooks:

Runbook	Triggers	Steps	Auto-Execute
Pod OOM	`OOMKilled`, `MemoryPressure`	Diagnose > Scale > Notify	Yes
High Latency	`HighLatency`, `SLOBreach`	Diagnose > Restart > Notify	No
Disk Pressure	`DiskPressure`, `LogVolumeHigh`	Cleanup > Notify	Yes
Certificate Expiry	`CertExpiry`, `TLSError`	Renew > Notify	No

How Execution Works

Incident Detected
       │
       ▼
  FindRunbooks(incidentType)
       │
       ▼
  For each matching runbook:
    ├── autoExecute=true  → Execute immediately
    └── autoExecute=false → Queue for approval
       │
       ▼
  Execute each Step sequentially:
    ├── kubectl_scale   → Scale deployment replicas
    ├── restart_pod     → Delete pod for controller restart
    ├── notify_oncall   → Slack/PagerDuty notification
    ├── run_diagnostic  → Execute diagnostic command
    └── custom_script   → Run remediation script
       │
       ▼
  Record ExecutionResult (timing, step results, success/failure)

In v0.3, all action handlers run in dry-run mode — they log what they would do without executing. This lets you validate runbook logic before enabling live remediation. Live execution is on the v0.4 roadmap.

You can register custom runbooks:

executor.RegisterRunbook(runbook.Runbook{
    ID:            "custom-db-failover",
    Name:          "Database Failover",
    IncidentTypes: []string{"DatabaseDown", "ReplicationLag"},
    AutoExecute:   false,
    Steps: []runbook.Step{
        {Name: "Check replication", Action: "run_diagnostic",
         Config: map[string]string{"command": "pg_stat_replication"}},
        {Name: "Promote standby", Action: "custom_script",
         Config: map[string]string{"script": "/opt/scripts/promote-standby.sh"}},
        {Name: "Notify DBA team", Action: "notify_oncall",
         Config: map[string]string{"channel": "#dba-oncall"}},
    },
})

4. Real Cloud Provider Integrations

v0.2 had stub implementations for cloud providers. v0.3 wires up real SDKs.

Azure

Auth: azidentity.DefaultAzureCredential — supports managed identity, CLI, environment variables
Metrics: Azure Monitor azquery.MetricsClient queries CPU, memory, disk, and network
Resources: ARM armresources.Client with full pagination support
Config: Set AZURE_SUBSCRIPTION_ID and standard Azure credentials

GCP

Auth: Application Default Credentials (ADC)
Metrics: Cloud Monitoring ListTimeSeries with 5-minute lookback window
Resources: Cloud Asset SearchAllResources for inventory
Config: Set GCP_PROJECT_ID or GOOGLE_CLOUD_PROJECT

GitHub Actions (CI/CD Collector)

The CI/CD collector now polls the GitHub REST API for workflow runs across configured repositories:

collectors:
  cicd:
    github_token: "ghp_..."
    repo_filters:
      - "your-org/your-repo"
      - "your-org/another-repo"
    poll_interval: 60s

It tracks run status changes, emits events for new runs and state transitions, and maps GitHub conclusions to severity levels:

Conclusion	Severity
`failure`, `timed_out`	High
`cancelled`, `action_required`	Medium
`success`, other	Info

Kafka Throughput

The Kafka collector now dials brokers directly, reads partition offsets, and computes real messages/second throughput per topic. No more mock data.

5. WebSocket Real-Time Streaming

The dashboard no longer polls the REST API for updates. Events flow over WebSocket.

Backend

The Go server manages a WebSocket hub with a broadcaster that pushes engine status to all connected clients every 2 seconds:

Client connects → wsHub.add(conn)
                     │
Broadcaster (2s)  ───┤──→ JSON to all clients
                     │
Client disconnects → wsHub.remove(conn)

Frontend

Two new React hooks power the live experience:

useWebSocket — generic hook with auto-reconnect and exponential backoff (1s to 30s)
useEventStream — wraps WebSocket for the events endpoint, maintains a 100-event rolling buffer, provides memoized filtered views for incidents, anomalies, and all events

The Overview page shows a green pulsing Live indicator when WebSocket is connected. When disconnected, it falls back to mock data gracefully.

6. Runbooks Dashboard Page

New /runbooks route in the dashboard:

Summary cards at the top:

Total runbooks
Auto-execute count
Executions in last 24 hours
Average success rate

Each runbook card shows:

Name and description
Auto/manual execution badge (green dot for auto, gray for manual)
Incident type tags
Step count, last run time, success rate
Recent run indicators — green and red dots for the last 3 executions

Same dark theme as the rest of the dashboard. Built with the same Tailwind patterns.

7. OpenTelemetry & Observability

Kronveil exports traces and metrics via OpenTelemetry, fitting into your existing observability stack:

Endpoint	Protocol	Purpose
`:4317`	gRPC	OTLP traces and metrics
`:4318`	HTTP	OTLP traces and metrics
`:8889`	HTTP	Prometheus metrics export
`:13133`	HTTP	OTel collector health check
`:55679`	HTTP	zPages debugging

Point your Jaeger, Tempo, or Datadog backend at these endpoints, or configure Prometheus to scrape Kronveil's metrics.

Full Architecture (v0.3)

┌───────────────────────────────────────────────────────┐
│                  Dashboard (React)                     │
│          WebSocket <── REST API (Go) ──>               │
├───────────────────────────────────────────────────────┤
│                     Engine Core                        │
│  ┌────────────┐  ┌──────────┐  ┌────────────────┐    │
│  │ Federation  │  │ Runbook  │  │ AI Intelligence│    │
│  │  Manager    │  │ Executor │  │ (AWS Bedrock)  │    │
│  └────────────┘  └──────────┘  └────────────────┘    │
├───────────────────────────────────────────────────────┤
│                     Collectors                         │
│  ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ ┌───┐ ┌────────┐ │
│  │ K8s │ │Kafka│ │CI/CD │ │Azure│ │GCP│ │Custom  │ │
│  │     │ │     │ │GitHub│ │     │ │   │ │ (SDK)  │ │
│  └─────┘ └─────┘ └──────┘ └─────┘ └───┘ └────────┘ │
├───────────────────────────────────────────────────────┤
│               Integrations & Export                    │
│  Slack · PagerDuty · Vault · Prometheus · OTel        │
└───────────────────────────────────────────────────────┘

Local Deployment

Everything runs with Docker Compose — same as v0.2:

git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker compose up --build

Service	URL	Description
Dashboard	`http://localhost:3000`	React UI with live WebSocket
API	`http://localhost:8080/api/v1/`	REST endpoints
Health	`http://localhost:8080/api/v1/health`	Agent health check
WebSocket	`ws://localhost:8080/api/v1/ws/events`	Real-time event stream
Prometheus	`http://localhost:8889/metrics`	Metrics export
OTel gRPC	`localhost:4317`	OTLP ingest

Verify it's running

curl http://localhost:8080/api/v1/health | jq .

{
  "status": "healthy",
  "components": [
    {"name": "kubernetes", "status": "healthy"},
    {"name": "kafka", "status": "healthy"},
    {"name": "cicd-collector", "status": "healthy"},
    {"name": "cloud-aws", "status": "healthy"}
  ],
  "uptime": "2m30s"
}

Production Deployment

Kronveil ships with a Helm chart for Kubernetes deployment. For AWS EKS with Rancher:

# Build and push to ECR
docker build -f deploy/Dockerfile.agent -t <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 .
docker push <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3

# Deploy with Helm
helm install kronveil helm/kronveil/ \
  -n kronveil --create-namespace \
  -f values-prod.yaml

The Helm chart includes deployment, RBAC (ClusterRole + Role), ServiceAccount with IRSA annotation, NetworkPolicy, and Prometheus scrape annotations. A full production deployment guide covering ECR, MSK, IRSA, ALB Ingress, and TLS is available in the repository.

CI Pipeline

All 7 CI jobs passing:

Job	What It Checks
Lint	golangci-lint v2 (errcheck, staticcheck, govet)
Security	govulncheck for known vulnerabilities
Test	`go test ./... -race` with coverage threshold
Build	CGO_ENABLED=0 static binary
Docker Build & Scan	Trivy scan for CRITICAL/HIGH CVEs
Dashboard	npm lint + build
Helm Lint	Chart validation

What's Next (v0.4)

Live runbook execution — move from dry-run to real kubectl and script execution with approval gates
Collector marketplace — share and install community-built collectors via the SDK
Cross-cluster incident correlation — AI-powered correlation across federated clusters
Dashboard runbook triggers — execute runbooks directly from the UI
Grafana plugin — embed Kronveil panels in existing Grafana dashboards

DEV Community