From Single Cluster to Multi-Cluster Production
A couple of weeks ago, I shipped Kronveil v0.2 — a fully running AI infrastructure agent with a dashboard, gRPC transport, secret management, and local Docker deployment. If you missed the original launch post, here's where it all started.
v0.2 worked well for a single cluster. But production environments don't run on one cluster. Teams have us-east-prod, eu-west-prod, maybe a staging cluster in ap-south. They have GitHub Actions pipelines they need visibility into. They have Azure VMs and GCP instances alongside Kubernetes workloads.
v0.3 addresses all of that. Here's what changed.
What's New in v0.3
| # | Feature | What It Does |
|---|---|---|
| 1 | Multi-Cluster Federation | Aggregate telemetry from multiple Kubernetes clusters into one view |
| 2 | Custom Collector SDK | Build your own collectors in ~50 lines of Go |
| 3 | Automated Runbook Engine | Execute incident response playbooks automatically |
| 4 | Real Azure SDK Integration | Azure Monitor metrics + Resource Manager listing |
| 5 | Real GCP SDK Integration | Cloud Monitoring + Asset Inventory |
| 6 | GitHub Actions CI/CD Collector | Poll workflow runs, track status changes, map to severity |
| 7 | Kafka Throughput Monitoring | Real offset tracking and messages/sec computation |
| 8 | WebSocket Real-Time Streaming | Live event feed to the dashboard, no more polling |
| 9 | Runbooks Dashboard Page | New UI page for runbook management and execution history |
| 10 | Vault Background Sync | Periodic secret rotation monitoring with KV v2 metadata API |
1. Multi-Cluster Federation
This is the biggest feature in v0.3. The federation manager sits on top of multiple Kubernetes collectors and aggregates their telemetry into a single event stream.
How It Works
┌─────────────────────────────────────────────────┐
│ Federation Manager │
│ implements engine.Collector │
├─────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ us-east-prod │ │ eu-west-prod │ ... │
│ │ K8s Collector│ │ K8s Collector│ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └──────┬───────────┘ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Aggregator │ │
│ │ SHA256 dedup (30s window)│ │
│ │ Cross-cluster metrics │ │
│ └──────────────────────────┘ │
│ │
└─────────────────────────────────────────────────┘
Each cluster's events are tagged with cluster_name and cluster_region metadata before being forwarded. The aggregator deduplicates events using SHA256 fingerprinting — if overlapping collectors emit the same event within a 30-second window, it's counted once.
Configuration
collectors:
kubernetes:
clusters:
- name: us-east-prod
kubeconfig_path: ~/.kube/us-east
context: prod-context
namespaces: ["default", "payments", "auth"]
poll_interval: 15s
- name: eu-west-prod
kubeconfig_path: ~/.kube/eu-west
context: prod-context
namespaces: ["default", "payments"]
poll_interval: 15s
The federation manager implements engine.Collector, so the rest of Kronveil — the intelligence pipeline, the API, the dashboard — doesn't need to know whether it's watching 1 cluster or 20.
Aggregate metrics are computed automatically: total pods, total nodes, total events across all clusters.
2. Custom Collector SDK
Writing a Kronveil collector used to mean implementing the full engine.Collector interface — managing goroutines, channels, health reporting, and lifecycle. Now you implement three methods:
type Plugin interface {
Name() string
Collect(ctx context.Context) ([]*collector.Event, error)
Healthcheck(ctx context.Context) error
}
The SDK's Builder handles the rest — polling loop, buffered event channel with backpressure, health reporting, and clean shutdown:
col := collector.NewBuilder(&myPlugin{}).
WithPollInterval(10 * time.Second).
WithBufferSize(128).
WithLogger(slog.Default()).
Build()
// col implements engine.Collector — register it like any built-in collector
registry.RegisterCollector(col)
Full Example: HTTP Health Checker
type HTTPChecker struct {
url string
}
func (h *HTTPChecker) Name() string { return "http-checker" }
func (h *HTTPChecker) Collect(ctx context.Context) ([]*collector.Event, error) {
start := time.Now()
resp, err := http.Get(h.url)
if err != nil {
return []*collector.Event{{
Type: "http_check_failed",
Severity: "high",
Payload: map[string]interface{}{"url": h.url, "error": err.Error()},
}}, nil
}
defer resp.Body.Close()
return []*collector.Event{{
Type: "http_check",
Payload: map[string]interface{}{
"url": h.url,
"status": resp.StatusCode,
"latency_ms": time.Since(start).Milliseconds(),
},
}}, nil
}
func (h *HTTPChecker) Healthcheck(ctx context.Context) error { return nil }
What the Adapter Handles For You
- Polling loop at your configured interval
- Immediate first collect (no waiting for the first tick)
- Buffered channel with drop + warn when full
- Health status combining
Healthcheck()result with recent collect errors - Thread-safe start/stop lifecycle
One interesting bug I caught during CI: the original Stop() held a mutex while waiting on a WaitGroup. The polling goroutine needed the same mutex to record errors. Classic deadlock — the goroutine couldn't finish because it couldn't acquire the lock, and Stop() couldn't return because it was waiting for the goroutine. Fixed by releasing the lock before wg.Wait().
3. Automated Runbook Engine
When Kronveil detects an incident, it can now execute a predefined playbook instead of just alerting.
The runbook engine ships with 4 default runbooks:
| Runbook | Triggers | Steps | Auto-Execute |
|---|---|---|---|
| Pod OOM |
OOMKilled, MemoryPressure
|
Diagnose > Scale > Notify | Yes |
| High Latency |
HighLatency, SLOBreach
|
Diagnose > Restart > Notify | No |
| Disk Pressure |
DiskPressure, LogVolumeHigh
|
Cleanup > Notify | Yes |
| Certificate Expiry |
CertExpiry, TLSError
|
Renew > Notify | No |
How Execution Works
Incident Detected
│
▼
FindRunbooks(incidentType)
│
▼
For each matching runbook:
├── autoExecute=true → Execute immediately
└── autoExecute=false → Queue for approval
│
▼
Execute each Step sequentially:
├── kubectl_scale → Scale deployment replicas
├── restart_pod → Delete pod for controller restart
├── notify_oncall → Slack/PagerDuty notification
├── run_diagnostic → Execute diagnostic command
└── custom_script → Run remediation script
│
▼
Record ExecutionResult (timing, step results, success/failure)
In v0.3, all action handlers run in dry-run mode — they log what they would do without executing. This lets you validate runbook logic before enabling live remediation. Live execution is on the v0.4 roadmap.
You can register custom runbooks:
executor.RegisterRunbook(runbook.Runbook{
ID: "custom-db-failover",
Name: "Database Failover",
IncidentTypes: []string{"DatabaseDown", "ReplicationLag"},
AutoExecute: false,
Steps: []runbook.Step{
{Name: "Check replication", Action: "run_diagnostic",
Config: map[string]string{"command": "pg_stat_replication"}},
{Name: "Promote standby", Action: "custom_script",
Config: map[string]string{"script": "/opt/scripts/promote-standby.sh"}},
{Name: "Notify DBA team", Action: "notify_oncall",
Config: map[string]string{"channel": "#dba-oncall"}},
},
})
4. Real Cloud Provider Integrations
v0.2 had stub implementations for cloud providers. v0.3 wires up real SDKs.
Azure
-
Auth:
azidentity.DefaultAzureCredential— supports managed identity, CLI, environment variables -
Metrics: Azure Monitor
azquery.MetricsClientqueries CPU, memory, disk, and network -
Resources: ARM
armresources.Clientwith full pagination support -
Config: Set
AZURE_SUBSCRIPTION_IDand standard Azure credentials
GCP
- Auth: Application Default Credentials (ADC)
-
Metrics: Cloud Monitoring
ListTimeSerieswith 5-minute lookback window -
Resources: Cloud Asset
SearchAllResourcesfor inventory -
Config: Set
GCP_PROJECT_IDorGOOGLE_CLOUD_PROJECT
GitHub Actions (CI/CD Collector)
The CI/CD collector now polls the GitHub REST API for workflow runs across configured repositories:
collectors:
cicd:
github_token: "ghp_..."
repo_filters:
- "your-org/your-repo"
- "your-org/another-repo"
poll_interval: 60s
It tracks run status changes, emits events for new runs and state transitions, and maps GitHub conclusions to severity levels:
| Conclusion | Severity |
|---|---|
failure, timed_out
|
High |
cancelled, action_required
|
Medium |
success, other |
Info |
Kafka Throughput
The Kafka collector now dials brokers directly, reads partition offsets, and computes real messages/second throughput per topic. No more mock data.
5. WebSocket Real-Time Streaming
The dashboard no longer polls the REST API for updates. Events flow over WebSocket.
Backend
The Go server manages a WebSocket hub with a broadcaster that pushes engine status to all connected clients every 2 seconds:
Client connects → wsHub.add(conn)
│
Broadcaster (2s) ───┤──→ JSON to all clients
│
Client disconnects → wsHub.remove(conn)
Frontend
Two new React hooks power the live experience:
-
useWebSocket— generic hook with auto-reconnect and exponential backoff (1s to 30s) -
useEventStream— wraps WebSocket for the events endpoint, maintains a 100-event rolling buffer, provides memoized filtered views for incidents, anomalies, and all events
The Overview page shows a green pulsing Live indicator when WebSocket is connected. When disconnected, it falls back to mock data gracefully.
6. Runbooks Dashboard Page
New /runbooks route in the dashboard:
Summary cards at the top:
- Total runbooks
- Auto-execute count
- Executions in last 24 hours
- Average success rate
Each runbook card shows:
- Name and description
- Auto/manual execution badge (green dot for auto, gray for manual)
- Incident type tags
- Step count, last run time, success rate
- Recent run indicators — green and red dots for the last 3 executions
Same dark theme as the rest of the dashboard. Built with the same Tailwind patterns.
7. OpenTelemetry & Observability
Kronveil exports traces and metrics via OpenTelemetry, fitting into your existing observability stack:
| Endpoint | Protocol | Purpose |
|---|---|---|
:4317 |
gRPC | OTLP traces and metrics |
:4318 |
HTTP | OTLP traces and metrics |
:8889 |
HTTP | Prometheus metrics export |
:13133 |
HTTP | OTel collector health check |
:55679 |
HTTP | zPages debugging |
Point your Jaeger, Tempo, or Datadog backend at these endpoints, or configure Prometheus to scrape Kronveil's metrics.
Full Architecture (v0.3)
┌───────────────────────────────────────────────────────┐
│ Dashboard (React) │
│ WebSocket <── REST API (Go) ──> │
├───────────────────────────────────────────────────────┤
│ Engine Core │
│ ┌────────────┐ ┌──────────┐ ┌────────────────┐ │
│ │ Federation │ │ Runbook │ │ AI Intelligence│ │
│ │ Manager │ │ Executor │ │ (AWS Bedrock) │ │
│ └────────────┘ └──────────┘ └────────────────┘ │
├───────────────────────────────────────────────────────┤
│ Collectors │
│ ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ ┌───┐ ┌────────┐ │
│ │ K8s │ │Kafka│ │CI/CD │ │Azure│ │GCP│ │Custom │ │
│ │ │ │ │ │GitHub│ │ │ │ │ │ (SDK) │ │
│ └─────┘ └─────┘ └──────┘ └─────┘ └───┘ └────────┘ │
├───────────────────────────────────────────────────────┤
│ Integrations & Export │
│ Slack · PagerDuty · Vault · Prometheus · OTel │
└───────────────────────────────────────────────────────┘
Local Deployment
Everything runs with Docker Compose — same as v0.2:
git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker compose up --build
| Service | URL | Description |
|---|---|---|
| Dashboard | http://localhost:3000 |
React UI with live WebSocket |
| API | http://localhost:8080/api/v1/ |
REST endpoints |
| Health | http://localhost:8080/api/v1/health |
Agent health check |
| WebSocket | ws://localhost:8080/api/v1/ws/events |
Real-time event stream |
| Prometheus | http://localhost:8889/metrics |
Metrics export |
| OTel gRPC | localhost:4317 |
OTLP ingest |
Verify it's running
curl http://localhost:8080/api/v1/health | jq .
{
"status": "healthy",
"components": [
{"name": "kubernetes", "status": "healthy"},
{"name": "kafka", "status": "healthy"},
{"name": "cicd-collector", "status": "healthy"},
{"name": "cloud-aws", "status": "healthy"}
],
"uptime": "2m30s"
}
Production Deployment
Kronveil ships with a Helm chart for Kubernetes deployment. For AWS EKS with Rancher:
# Build and push to ECR
docker build -f deploy/Dockerfile.agent -t <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 .
docker push <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3
# Deploy with Helm
helm install kronveil helm/kronveil/ \
-n kronveil --create-namespace \
-f values-prod.yaml
The Helm chart includes deployment, RBAC (ClusterRole + Role), ServiceAccount with IRSA annotation, NetworkPolicy, and Prometheus scrape annotations. A full production deployment guide covering ECR, MSK, IRSA, ALB Ingress, and TLS is available in the repository.
CI Pipeline
All 7 CI jobs passing:
| Job | What It Checks |
|---|---|
| Lint | golangci-lint v2 (errcheck, staticcheck, govet) |
| Security | govulncheck for known vulnerabilities |
| Test |
go test ./... -race with coverage threshold |
| Build | CGO_ENABLED=0 static binary |
| Docker Build & Scan | Trivy scan for CRITICAL/HIGH CVEs |
| Dashboard | npm lint + build |
| Helm Lint | Chart validation |
What's Next (v0.4)
-
Live runbook execution — move from dry-run to real
kubectland script execution with approval gates - Collector marketplace — share and install community-built collectors via the SDK
- Cross-cluster incident correlation — AI-powered correlation across federated clusters
- Dashboard runbook triggers — execute runbooks directly from the UI
- Grafana plugin — embed Kronveil panels in existing Grafana dashboards
Links
- GitHub: github.com/kronveil/kronveil
- v0.1 post: I Built an AI-Powered Infrastructure Observability Agent from Scratch
- v0.2 post: Kronveil v0.2: Dashboard, gRPC, Secret Management, and Local Deployment
If you've been following along, star the repo and try it out. PRs, issues, and feedback are always welcome.
Top comments (0)