DEV Community: Ramasankar Molleti

Anthropic Just Killed the API Key: A Deep Dive into Workload Identity Federation for Claude

Ramasankar Molleti — Wed, 06 May 2026 03:30:40 +0000

TL;DR — Anthropic shipped Workload Identity Federation (WIF) for the Claude API. Your workloads now exchange a short-lived OIDC JWT from your IdP (EKS IRSA, GKE, AKS, GitHub Actions, Kubernetes, SPIFFE/SPIRE, Okta, Entra ID) for a short-lived sk-ant-oat01-... token via RFC 7523 jwt-bearer grant. Zero static secrets. But it's workload identity, not user delegation — and that distinction is where confused deputy bugs are about to start showing up.

Why this matters (and why I'm writing a sequel)

A few weeks back I wrote about draft-klrc-aiagent-auth — the IETF blueprint for agentic identity from engineers at AWS, Zscaler, Ping Identity, and Defakto Security. The thesis was straightforward: most teams securing AI agents with API keys are one breach away from disaster, and the fix is an 8-layer Agent Identity Management System (AIMS) built on SPIFFE for workload identity, WIMSE for proof tokens across proxies, OAuth Token Exchange for delegation, and Transaction Tokens for operation-scoped authorization.

That post was about the standard. This post is about the first major LLM provider to ship a production implementation of the bottom half of that stack.

If you're running Claude in a regulated environment — financial services, healthcare, gov — and you've been waiting for the day you can stop baking sk-ant-... keys into Kubernetes secrets, that day is here. But there's a subtle architectural trap, and it's easy to miss.

Let's walk through what shipped, the exchange flow, the SPIFFE integration, and the confused-deputy footgun.

What Anthropic actually built

The mental model is clean:

A service account has credentials minted for it on demand, instead of being a credential.

That one sentence captures the whole shift. An API key is a credential — possessing it is sufficient. A service account is a principal that gets credentials minted on demand from an attested workload identity. Possession of the principal isn't a thing. You have to be the workload.

Three resources in the Claude Console express the trust relationship:

1. Service Account (`svac_...`)

A non-human principal in your Anthropic org. No email, no password, no Console login. It's the identity a federated token acts as. Joins workspaces like a human member. Minted tokens inherit that workspace's rate limits and usage attribution.

2. Federation Issuer (`fdis_...`)

Registers an OIDC provider with two key fields:

Issuer URL — must match the iss claim in your IdP's JWTs exactly.
JWKS source — discovery (default, hits /.well-known/openid-configuration), explicit_url, or inline for air-gapped clusters.

One issuer per environment. Your prod EKS, your staging EKS, and GitHub Actions are three separate issuers.

3. Federation Rule (`fdrl_...`)

The bridge between issuer and service account: "when a JWT from issuer X has claims matching Y, mint a token for service account Z."

Match conditions:

subject_prefix — exact or trailing-* match
exact audience
exact claim values (key/value map)
a CEL condition expression for complex logic

All matchers must pass. There is no implicit rule search — the client specifies the rule ID in the exchange request, and Anthropic verifies the JWT satisfies that rule. This is a deliberate design choice that prevents "rule confusion" attacks where a token accidentally matches a more permissive rule.

The exchange flow

┌──────────────┐  1. Get JWT     ┌───────────┐
│   Workload   │ ──────────────▶ │  Your IdP │
│  (in pod)    │ ◀────────────── │  (SPIRE,  │
└──────┬───────┘   JWT-SVID      │   EKS,    │
       │                          │   GHA…)   │
       │ 2. POST /v1/oauth/token  └───────────┘
       │    (jwt-bearer grant)
       ▼
┌──────────────────────────────────────┐
│  Anthropic token endpoint            │
│  - Verify signature against JWKS     │
│  - Check exp/nbf/iat                 │
│  - Match against federation rule     │
│  - Mint sk-ant-oat01-... (≤ rule TTL)│
└──────────────────────────────────────┘
       │
       │ 3. Bearer token on every API call
       ▼
   api.anthropic.com/v1/messages

Concretely, the SDK construction looks like this:

from anthropic import Anthropic, WorkloadIdentityCredentials, IdentityTokenFile

client = Anthropic(
    credentials=WorkloadIdentityCredentials(
        identity_token_provider=IdentityTokenFile(
            "/var/run/secrets/anthropic.com/token"
        ),
        federation_rule_id="fdrl_...",
        organization_id="00000000-0000-0000-0000-000000000000",
        service_account_id="svac_...",
        workspace_id="wrkspc_...",
    ),
)

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello, Claude"}],
)

In production you'd skip the explicit constructor entirely and let the SDK resolve from environment variables — that's the recommended pattern. Ship the same container image everywhere, inject the env per environment:

ANTHROPIC_FEDERATION_RULE_ID=fdrl_...
ANTHROPIC_ORGANIZATION_ID=00000000-...
ANTHROPIC_SERVICE_ACCOUNT_ID=svac_...
ANTHROPIC_WORKSPACE_ID=wrkspc_...
ANTHROPIC_IDENTITY_TOKEN_FILE=/var/run/secrets/anthropic.com/token

Then in code: client = Anthropic(). Done. The SDK reads the file, exchanges for a token, refreshes before expiry, retries on rotation.

Token lifetime — the smart part

Most "short-lived token" systems get this wrong. Anthropic got it right.

The minted token's lifetime is the lesser of:

The rule's token_lifetime_seconds (60s to 24h, default 1h)

Twice the remaining IdP JWT lifetime, with a 60-second floor

That second bound is what matters. It prevents an Anthropic token from significantly outliving the upstream identity it was derived from. If your SPIRE JWT-SVID has a 5-minute TTL (SPIRE's default), the Anthropic token can live at most 10 minutes regardless of what the rule says.

Upstream attestation is the binding constraint — exactly the property you want.

The SDK runs a two-tier refresh modeled on botocore:

Tier	Trigger	Behavior
Advisory	`expiry - 120s`	Best-effort exchange. Falls back to cached token on failure.
Mandatory	`expiry - 30s`	Failed exchange raises an error. Cached token too close to expiry.

And it re-reads ANTHROPIC_IDENTITY_TOKEN_FILE on every exchange, so rotated projected tokens (Kubernetes service-account tokens, SPIFFE JWT-SVIDs from spiffe-helper) get picked up transparently. No app restart. No human in the loop.

SPIFFE on Anthropic — the cleanest path

If you're running SPIRE, Anthropic has a first-class SPIFFE provider and the integration is genuinely well-designed. Here's the full setup.

SPIRE side

server.conf:

server {
    trust_domain         = "prod.example.com"
    jwt_issuer           = "https://oidc-discovery.prod.example.com"
    default_jwt_svid_ttl = "5m"
}

Two non-obvious things here:

jwt_issuer MUST equal the OIDC Discovery Provider's public URL — that exact string is what you register with Anthropic. Mismatch = 400 invalid_grant. This is the #1 cause of failed setups.
default_jwt_svid_ttl ≤ 1 hour. Anthropic's token-exchange endpoint rejects identity tokens with longer lifetimes. SPIRE's default is fine.

OIDC Discovery Provider config:

domains = ["oidc-discovery.prod.example.com"]

server_api {
    address = "unix:///run/spire/sockets/private/api.sock"
}

acme {
    email        = "..."
    tos_accepted = true
}

Workload registration entry:

spire-server entry create \
    -spiffeID spiffe://prod.example.com/ns/inference/sa/worker \
    -parentID spiffe://prod.example.com/spire/agent/k8s_psat/prod-cluster/NODE_UID \
    -selector k8s:ns:inference \
    -selector k8s:sa:worker

For cluster-wide registration, parent to a node alias instead of a single agent ID — otherwise you're pinned to one node.

spiffe-helper sidecar config:

agent_address = "/run/spire/sockets/agent.sock"
cert_dir      = "/var/run/secrets/anthropic.com"
daemon_mode   = true

jwt_svids = [{
    jwt_audience       = "https://api.anthropic.com"
    jwt_svid_file_name = "token"
}]

Anthropic side

Federation issuer:

{
  "name": "spire-prod",
  "issuer_url": "https://oidc-discovery.prod.example.com",
  "jwks_source": "discovery"
}

Federation rule:

{
  "name": "spire-inference-worker",
  "issuer_id": "fdis_...",
  "match": {
    "subject_prefix": "spiffe://prod.example.com/ns/inference/sa/worker",
    "audience": "https://api.anthropic.com"
  },
  "target": {
    "type": "service_account",
    "service_account_id": "svac_..."
  },
  "workspace_id": "wrkspc_...",
  "oauth_scope": "workspace:developer",
  "token_lifetime_seconds": 600
}

Kubernetes deployment — the volume detail nobody mentions

This is the one operational detail people miss:

volumes:
  - name: anthropic-token
    emptyDir:
      medium: Memory   # ← THIS LINE

Use a memory-backed emptyDir shared between spiffe-helper and the application container. The bearer JWT-SVID never touches the node's disk. Same pattern as Vault Agent token sinks. Same reason: bearer tokens on persistent storage are a postmortem waiting to happen.

Validation before wiring up the SDK

Always validate the JWT-SVID claims before you trust your federation rule:

spire-agent api fetch jwt \
    -audience https://api.anthropic.com \
    -socketPath /run/spire/sockets/agent.sock \
  | awk '/^[[:space:]]*eyJ/{print $1; exit}' \
  | jq -rR 'split(".")[1] | gsub("-";"+") | gsub("_";"/") | @base64d | fromjson'

Check:

iss matches the OIDC Discovery Provider URL you registered
sub is the workload's SPIFFE ID
aud contains https://api.anthropic.com

If any of those don't match what your federation rule expects, the exchange returns 400 invalid_grant with no useful diagnostic on the client side. Validate the claims first.

Three SPIFFE gotchas

1. Always set the audience matcher on the rule. Without it, the rule accepts JWT-SVIDs minted for any relying party. If the same workload also calls some other SaaS via SPIFFE, a token meant for that SaaS could exchange for an Anthropic token. Always pin audience.

2. Inline JWKS = you own rotation. SPIRE rotates signing keys frequently. If you registered the issuer with inline JWKS (air-gapped clusters), you must add new keys before workloads present them, and remove superseded keys after tokens signed with them expire. Stale keys in inline JWKS remain trusted indefinitely.

3. One issuer per trust domain. Each SPIRE trust domain has its own signing keys and OIDC Discovery Provider. Register each as a separate Anthropic federation issuer.

Mapping onto the IETF AIMS stack

This is where it gets interesting for anyone tracking the agentic identity standards work.

In draft-klrc-aiagent-auth, AIMS is 8 layers:

Identifiers → Credentials → Attestation → Provisioning → Authentication → Authorization → Observability → Policy

Anthropic WIF doesn't implement all eight. But it implements the bottom five correctly, and that's exactly the foundation the upper layers need.

AIMS Layer	What it requires	Anthropic WIF
Identifiers	Cryptographic, runtime-issued	SPIFFE ID via `sub` claim, or IdP-native subject
Credentials	Short-lived, attested, no secrets at rest	OIDC JWT exchanged for OAuth access token. Zero static secrets.
Attestation	Identity bound to what the workload is	Inherited from upstream IdP (SPIRE selectors, IRSA pod identity, GHA repo+workflow claims)
Provisioning	Federated trust, declarative	Console-configured issuer + rule. CEL for complex policy.
Authentication	Standards-based, verifiable	RFC 7523 JWT-bearer. JWKS validated. `iss`/`aud`/`exp`/`nbf`/`iat` enforced.
Authorization	Scoped, least-privilege	`workspace:developer` scope, workspace-bound, rate-limited
Observability	Audit chain	Service account attribution per request
Policy	Centralized enforcement	CEL match expressions, per-rule scoping

What's notably not here: the upper-layer agentic authorization primitives — OAuth Token Exchange with act claims, Transaction Tokens, Rich Authorization Requests (RAR), CAEP for real-time revocation.

That's not a criticism. Those belong at your gateway, not at the LLM provider's auth endpoint.

Which brings me to the most important point in this whole post.

The trap: this is workload auth, NOT user delegation

Here's the single most consequential thing to understand about Anthropic WIF, and it's hiding in plain sight: the caller is treated as a workload. There is no user delegation semantics here.

This isn't OAuth's authorization code flow. There's no user identity riding through the exchange. The federated token represents the workload that called Anthropic — not the user that asked the agent to do something. And in any real agentic deployment, a user is almost always at the top of the call chain.

That gap is where confused deputy bugs live.

Let me make this concrete.

Anthropic WIF answers: "Is this workload allowed to call Claude on behalf of this Anthropic service account, with these rate limits, in this workspace?"

Anthropic WIF does NOT answer: "Is Alice allowed to ask Claude to summarize Bob's salary data?"

There is no act claim. No user identity propagated to Anthropic. From Anthropic's perspective, every request from your gateway looks like the same service account. The user is invisible to them — by design, because that's how a workload-to-API trust boundary should work.

This is the classic confused deputy setup:

Your AgentGateway holds workload credentials (now: WIF tokens) that grant access to Claude.
Users delegate tasks to agents. Agents call through the gateway.
If the gateway doesn't enforce user authorization at its own boundary, an authenticated agent acting for a low-privilege user can ask the LLM to operate on data that user shouldn't see.
The LLM has no way to know.

The upstream WIF token only proves "the gateway said this is a legit workload call." It says nothing about which user triggered the call, what they're allowed to do, or whether the prompt content respects their authorization scope.

The layered model that actually works

┌───────────────────────────────────────────────────────────┐
│  USER                                                     │
│  ↓ (OIDC auth code flow, MFA, IdP session)                │
│  AGENT-FACING APP                                         │
│  ↓ (OAuth Token Exchange — adds `act` claim)              │
│  AGENT  ←── Transaction Token (RAR-scoped: "summarize     │
│  ↓                              doc X, max 1 LLM call")   │
│  AGENTGATEWAY  ←── enforces user policy + scope intersect │
│  ↓ (Anthropic WIF: SPIFFE JWT-SVID → sk-ant-oat01-...)    │
│  CLAUDE API                                               │
└───────────────────────────────────────────────────────────┘

Read top-to-bottom: user identity rides through OAuth Token Exchange with act claims, Transaction Tokens scope the specific operation, the gateway enforces user-level authorization, and Anthropic WIF handles the workload-to-LLM hop. Each layer answers a different question. None are interchangeable.

If you skip the user layer, WIF is still a massive upgrade over API keys — you've eliminated stored secrets, gained short-lived tokens, gained per-workload attribution. But you have not solved agentic identity. You've solved infrastructure identity.

The migration trap that will bite you

Buried in the docs:

ANTHROPIC_API_KEY sits above the federation tiers, so a leftover key in the environment silently shadows federation.

Translation: you can configure WIF perfectly, deploy, smoke-test, ship — and still be using the old API key. Because credential precedence puts ANTHROPIC_API_KEY above the federation env vars, the federation code path simply never runs.

The migration sequence that actually works:

1. Stand up federation in parallel. Leave ANTHROPIC_API_KEY in place.
2. Run `ant auth status` from inside the workload.
   At this stage: the API key wins. That's expected.
3. Unset ANTHROPIC_API_KEY EVERYWHERE:
     - CI secrets
     - Container env (Deployment manifests, Helm values)
     - Shell profiles
     - Any sidecar that injects it
4. Re-run `ant auth status`. Confirm the federation source is selected.
5. ONLY NOW: revoke the API key in the Console.

Step 3 is the high-risk step where audit chains catch leftover injections. I'd add: instrument your gateway logs to alert on requests carrying sk-ant-api03-... prefixes after cutover. If that prefix shows up after step 5, you have a stowaway. Could be a CronJob, a CI workflow, a debug pod, a contractor's laptop.

What this means for platform architecture

If you're running Claude in production today, three things change:

1. The threat model shifts from "key custody" to "issuer trust"

You're no longer worried about a static key leaking from a Vault transit engine, a CI log, a Slack message, or a developer laptop. You're worried about whether your IdP is correctly attesting workload identity.

The blast radius of compromise goes from "anyone with the key can be us" to "the attacker needs to compromise our IdP and satisfy the federation rule's match conditions during a short token window".

2. Audit and attribution become per-workload by default

Service account IDs flow into Anthropic's usage and rate limit attribution. Combined with your gateway logs, you can trace a single Claude inference back to: which workspace, which service account, which workload (via the federation rule + JWT sub), which user request (via correlation ID).

That's the audit chain regulators will eventually require for AI inference in regulated industries.

3. The gateway's job gets more important, not less

WIF closes the workload-to-LLM hop. The user-to-agent and agent-to-tool hops are still yours to enforce.

AgentGateway, with SPIFFE for workload mechanism and OAuth Token Exchange + RAR for user delegation, is where confused-deputy attacks get prevented. WIF is necessary but not sufficient.

Closing thought

Anthropic shipped Workload Identity Federation. RFC 7523 JWT-bearer grant. First-class support for AWS, GCP, Azure, GitHub Actions, Kubernetes, SPIFFE, Okta. Service account model. Short-lived tokens. Two-tier SDK refresh. Per-workload attribution.

For platform teams running Claude in regulated environments — especially on Kubernetes with SPIRE — this is the API key killer we've been waiting for. It maps cleanly onto the bottom five layers of the IETF AIMS stack.

But it is workload identity, not user delegation. The agent calling through your gateway still needs OAuth Token Exchange with act claims for user context, Transaction Tokens for operation scoping, and gateway-level policy enforcement to prevent confused deputy.

Static credentials die at the gateway. Dynamic attested tokens live at every hop.

The shift continues:

Stop asking "does it have the right key?"
Start asking "what IS this entity, do we trust it, what do we expect from it?"

References

If you found this useful, follow along — the next post in this series digs into AgentGateway implementation patterns: SPIFFE attestation, WIMSE proof tokens across proxies, and OAuth Token Exchange with act claim chaining for user delegation. The piece WIF doesn't solve.

Kronveil v0.3: Multi-Cluster Federation, Custom Collector SDK, and Automated Runbooks

Ramasankar Molleti — Mon, 06 Apr 2026 04:56:17 +0000

From Single Cluster to Multi-Cluster Production

A couple of weeks ago, I shipped Kronveil v0.2 — a fully running AI infrastructure agent with a dashboard, gRPC transport, secret management, and local Docker deployment. If you missed the original launch post, here's where it all started.

v0.2 worked well for a single cluster. But production environments don't run on one cluster. Teams have us-east-prod, eu-west-prod, maybe a staging cluster in ap-south. They have GitHub Actions pipelines they need visibility into. They have Azure VMs and GCP instances alongside Kubernetes workloads.

v0.3 addresses all of that. Here's what changed.

What's New in v0.3

#	Feature	What It Does
1	Multi-Cluster Federation	Aggregate telemetry from multiple Kubernetes clusters into one view
2	Custom Collector SDK	Build your own collectors in ~50 lines of Go
3	Automated Runbook Engine	Execute incident response playbooks automatically
4	Real Azure SDK Integration	Azure Monitor metrics + Resource Manager listing
5	Real GCP SDK Integration	Cloud Monitoring + Asset Inventory
6	GitHub Actions CI/CD Collector	Poll workflow runs, track status changes, map to severity
7	Kafka Throughput Monitoring	Real offset tracking and messages/sec computation
8	WebSocket Real-Time Streaming	Live event feed to the dashboard, no more polling
9	Runbooks Dashboard Page	New UI page for runbook management and execution history
10	Vault Background Sync	Periodic secret rotation monitoring with KV v2 metadata API

1. Multi-Cluster Federation

This is the biggest feature in v0.3. The federation manager sits on top of multiple Kubernetes collectors and aggregates their telemetry into a single event stream.

How It Works

┌─────────────────────────────────────────────────┐
│              Federation Manager                  │
│         implements engine.Collector              │
├─────────────────────────────────────────────────┤
│                                                  │
│  ┌──────────────┐  ┌──────────────┐             │
│  │ us-east-prod │  │ eu-west-prod │  ...        │
│  │  K8s Collector│  │  K8s Collector│            │
│  └──────┬───────┘  └──────┬───────┘             │
│         │                  │                     │
│         └──────┬───────────┘                     │
│                ▼                                  │
│  ┌──────────────────────────┐                    │
│  │       Aggregator         │                    │
│  │  SHA256 dedup (30s window)│                   │
│  │  Cross-cluster metrics    │                   │
│  └──────────────────────────┘                    │
│                                                  │
└─────────────────────────────────────────────────┘

Each cluster's events are tagged with cluster_name and cluster_region metadata before being forwarded. The aggregator deduplicates events using SHA256 fingerprinting — if overlapping collectors emit the same event within a 30-second window, it's counted once.

Configuration

collectors:
  kubernetes:
    clusters:
      - name: us-east-prod
        kubeconfig_path: ~/.kube/us-east
        context: prod-context
        namespaces: ["default", "payments", "auth"]
        poll_interval: 15s
      - name: eu-west-prod
        kubeconfig_path: ~/.kube/eu-west
        context: prod-context
        namespaces: ["default", "payments"]
        poll_interval: 15s

The federation manager implements engine.Collector, so the rest of Kronveil — the intelligence pipeline, the API, the dashboard — doesn't need to know whether it's watching 1 cluster or 20.

Aggregate metrics are computed automatically: total pods, total nodes, total events across all clusters.

2. Custom Collector SDK

Writing a Kronveil collector used to mean implementing the full engine.Collector interface — managing goroutines, channels, health reporting, and lifecycle. Now you implement three methods:

type Plugin interface {
    Name() string
    Collect(ctx context.Context) ([]*collector.Event, error)
    Healthcheck(ctx context.Context) error
}

The SDK's Builder handles the rest — polling loop, buffered event channel with backpressure, health reporting, and clean shutdown:

col := collector.NewBuilder(&myPlugin{}).
    WithPollInterval(10 * time.Second).
    WithBufferSize(128).
    WithLogger(slog.Default()).
    Build()

// col implements engine.Collector — register it like any built-in collector
registry.RegisterCollector(col)

Full Example: HTTP Health Checker

type HTTPChecker struct {
    url string
}

func (h *HTTPChecker) Name() string { return "http-checker" }

func (h *HTTPChecker) Collect(ctx context.Context) ([]*collector.Event, error) {
    start := time.Now()
    resp, err := http.Get(h.url)
    if err != nil {
        return []*collector.Event{{
            Type:     "http_check_failed",
            Severity: "high",
            Payload:  map[string]interface{}{"url": h.url, "error": err.Error()},
        }}, nil
    }
    defer resp.Body.Close()

    return []*collector.Event{{
        Type: "http_check",
        Payload: map[string]interface{}{
            "url":        h.url,
            "status":     resp.StatusCode,
            "latency_ms": time.Since(start).Milliseconds(),
        },
    }}, nil
}

func (h *HTTPChecker) Healthcheck(ctx context.Context) error { return nil }

What the Adapter Handles For You

Polling loop at your configured interval
Immediate first collect (no waiting for the first tick)
Buffered channel with drop + warn when full
Health status combining Healthcheck() result with recent collect errors
Thread-safe start/stop lifecycle

One interesting bug I caught during CI: the original Stop() held a mutex while waiting on a WaitGroup. The polling goroutine needed the same mutex to record errors. Classic deadlock — the goroutine couldn't finish because it couldn't acquire the lock, and Stop() couldn't return because it was waiting for the goroutine. Fixed by releasing the lock before wg.Wait().

3. Automated Runbook Engine

When Kronveil detects an incident, it can now execute a predefined playbook instead of just alerting.

The runbook engine ships with 4 default runbooks:

Runbook	Triggers	Steps	Auto-Execute
Pod OOM	`OOMKilled`, `MemoryPressure`	Diagnose > Scale > Notify	Yes
High Latency	`HighLatency`, `SLOBreach`	Diagnose > Restart > Notify	No
Disk Pressure	`DiskPressure`, `LogVolumeHigh`	Cleanup > Notify	Yes
Certificate Expiry	`CertExpiry`, `TLSError`	Renew > Notify	No

How Execution Works

Incident Detected
       │
       ▼
  FindRunbooks(incidentType)
       │
       ▼
  For each matching runbook:
    ├── autoExecute=true  → Execute immediately
    └── autoExecute=false → Queue for approval
       │
       ▼
  Execute each Step sequentially:
    ├── kubectl_scale   → Scale deployment replicas
    ├── restart_pod     → Delete pod for controller restart
    ├── notify_oncall   → Slack/PagerDuty notification
    ├── run_diagnostic  → Execute diagnostic command
    └── custom_script   → Run remediation script
       │
       ▼
  Record ExecutionResult (timing, step results, success/failure)

In v0.3, all action handlers run in dry-run mode — they log what they would do without executing. This lets you validate runbook logic before enabling live remediation. Live execution is on the v0.4 roadmap.

You can register custom runbooks:

executor.RegisterRunbook(runbook.Runbook{
    ID:            "custom-db-failover",
    Name:          "Database Failover",
    IncidentTypes: []string{"DatabaseDown", "ReplicationLag"},
    AutoExecute:   false,
    Steps: []runbook.Step{
        {Name: "Check replication", Action: "run_diagnostic",
         Config: map[string]string{"command": "pg_stat_replication"}},
        {Name: "Promote standby", Action: "custom_script",
         Config: map[string]string{"script": "/opt/scripts/promote-standby.sh"}},
        {Name: "Notify DBA team", Action: "notify_oncall",
         Config: map[string]string{"channel": "#dba-oncall"}},
    },
})

4. Real Cloud Provider Integrations

v0.2 had stub implementations for cloud providers. v0.3 wires up real SDKs.

Azure

Auth: azidentity.DefaultAzureCredential — supports managed identity, CLI, environment variables
Metrics: Azure Monitor azquery.MetricsClient queries CPU, memory, disk, and network
Resources: ARM armresources.Client with full pagination support
Config: Set AZURE_SUBSCRIPTION_ID and standard Azure credentials

GCP

Auth: Application Default Credentials (ADC)
Metrics: Cloud Monitoring ListTimeSeries with 5-minute lookback window
Resources: Cloud Asset SearchAllResources for inventory
Config: Set GCP_PROJECT_ID or GOOGLE_CLOUD_PROJECT

GitHub Actions (CI/CD Collector)

The CI/CD collector now polls the GitHub REST API for workflow runs across configured repositories:

collectors:
  cicd:
    github_token: "ghp_..."
    repo_filters:
      - "your-org/your-repo"
      - "your-org/another-repo"
    poll_interval: 60s

It tracks run status changes, emits events for new runs and state transitions, and maps GitHub conclusions to severity levels:

Conclusion	Severity
`failure`, `timed_out`	High
`cancelled`, `action_required`	Medium
`success`, other	Info

Kafka Throughput

The Kafka collector now dials brokers directly, reads partition offsets, and computes real messages/second throughput per topic. No more mock data.

5. WebSocket Real-Time Streaming

The dashboard no longer polls the REST API for updates. Events flow over WebSocket.

Backend

The Go server manages a WebSocket hub with a broadcaster that pushes engine status to all connected clients every 2 seconds:

Client connects → wsHub.add(conn)
                     │
Broadcaster (2s)  ───┤──→ JSON to all clients
                     │
Client disconnects → wsHub.remove(conn)

Frontend

Two new React hooks power the live experience:

useWebSocket — generic hook with auto-reconnect and exponential backoff (1s to 30s)
useEventStream — wraps WebSocket for the events endpoint, maintains a 100-event rolling buffer, provides memoized filtered views for incidents, anomalies, and all events

The Overview page shows a green pulsing Live indicator when WebSocket is connected. When disconnected, it falls back to mock data gracefully.

6. Runbooks Dashboard Page

New /runbooks route in the dashboard:

Summary cards at the top:

Total runbooks
Auto-execute count
Executions in last 24 hours
Average success rate

Each runbook card shows:

Name and description
Auto/manual execution badge (green dot for auto, gray for manual)
Incident type tags
Step count, last run time, success rate
Recent run indicators — green and red dots for the last 3 executions

Same dark theme as the rest of the dashboard. Built with the same Tailwind patterns.

7. OpenTelemetry & Observability

Kronveil exports traces and metrics via OpenTelemetry, fitting into your existing observability stack:

Endpoint	Protocol	Purpose
`:4317`	gRPC	OTLP traces and metrics
`:4318`	HTTP	OTLP traces and metrics
`:8889`	HTTP	Prometheus metrics export
`:13133`	HTTP	OTel collector health check
`:55679`	HTTP	zPages debugging

Point your Jaeger, Tempo, or Datadog backend at these endpoints, or configure Prometheus to scrape Kronveil's metrics.

Full Architecture (v0.3)

┌───────────────────────────────────────────────────────┐
│                  Dashboard (React)                     │
│          WebSocket <── REST API (Go) ──>               │
├───────────────────────────────────────────────────────┤
│                     Engine Core                        │
│  ┌────────────┐  ┌──────────┐  ┌────────────────┐    │
│  │ Federation  │  │ Runbook  │  │ AI Intelligence│    │
│  │  Manager    │  │ Executor │  │ (AWS Bedrock)  │    │
│  └────────────┘  └──────────┘  └────────────────┘    │
├───────────────────────────────────────────────────────┤
│                     Collectors                         │
│  ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ ┌───┐ ┌────────┐ │
│  │ K8s │ │Kafka│ │CI/CD │ │Azure│ │GCP│ │Custom  │ │
│  │     │ │     │ │GitHub│ │     │ │   │ │ (SDK)  │ │
│  └─────┘ └─────┘ └──────┘ └─────┘ └───┘ └────────┘ │
├───────────────────────────────────────────────────────┤
│               Integrations & Export                    │
│  Slack · PagerDuty · Vault · Prometheus · OTel        │
└───────────────────────────────────────────────────────┘

Local Deployment

Everything runs with Docker Compose — same as v0.2:

git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker compose up --build

Service	URL	Description
Dashboard	`http://localhost:3000`	React UI with live WebSocket
API	`http://localhost:8080/api/v1/`	REST endpoints
Health	`http://localhost:8080/api/v1/health`	Agent health check
WebSocket	`ws://localhost:8080/api/v1/ws/events`	Real-time event stream
Prometheus	`http://localhost:8889/metrics`	Metrics export
OTel gRPC	`localhost:4317`	OTLP ingest

Verify it's running

curl http://localhost:8080/api/v1/health | jq .

{
  "status": "healthy",
  "components": [
    {"name": "kubernetes", "status": "healthy"},
    {"name": "kafka", "status": "healthy"},
    {"name": "cicd-collector", "status": "healthy"},
    {"name": "cloud-aws", "status": "healthy"}
  ],
  "uptime": "2m30s"
}

Production Deployment

Kronveil ships with a Helm chart for Kubernetes deployment. For AWS EKS with Rancher:

# Build and push to ECR
docker build -f deploy/Dockerfile.agent -t <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 .
docker push <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3

# Deploy with Helm
helm install kronveil helm/kronveil/ \
  -n kronveil --create-namespace \
  -f values-prod.yaml

The Helm chart includes deployment, RBAC (ClusterRole + Role), ServiceAccount with IRSA annotation, NetworkPolicy, and Prometheus scrape annotations. A full production deployment guide covering ECR, MSK, IRSA, ALB Ingress, and TLS is available in the repository.

CI Pipeline

All 7 CI jobs passing:

Job	What It Checks
Lint	golangci-lint v2 (errcheck, staticcheck, govet)
Security	govulncheck for known vulnerabilities
Test	`go test ./... -race` with coverage threshold
Build	CGO_ENABLED=0 static binary
Docker Build & Scan	Trivy scan for CRITICAL/HIGH CVEs
Dashboard	npm lint + build
Helm Lint	Chart validation

What's Next (v0.4)

Live runbook execution — move from dry-run to real kubectl and script execution with approval gates
Collector marketplace — share and install community-built collectors via the SDK
Cross-cluster incident correlation — AI-powered correlation across federated clusters
Dashboard runbook triggers — execute runbooks directly from the UI
Grafana plugin — embed Kronveil panels in existing Grafana dashboards

Kronveil v0.2: From Stubs to a Fully Running AI Infrastructure Agent with Dashboard, OTel, and Auto-Remediation

Ramasankar Molleti — Mon, 16 Mar 2026 05:37:07 +0000

Quick Recap

A week ago, I launched Kronveil - an AI-powered infrastructure observability agent that detects anomalies, performs root cause analysis, and auto-remediates incidents in milliseconds. The response was incredible.

But that first version had a lot of stubs. The roadmap listed features like "Dashboard UI", "Prometheus metrics", and "multi-cloud secret management" as coming soon.

They're here now.

This post covers every new feature shipped in v0.2, a step-by-step guide to run Kronveil locally with Docker Compose, and live screenshots from the running dashboard.

What's New in v0.2

1. Full Dashboard UI (React + TypeScript)

The biggest visible change. Kronveil now ships with a production-ready dashboard built with React 18, TypeScript, Tailwind CSS, and Recharts.

Six pages, zero fluff:

Page	What It Shows
Overview	Real-time event throughput, active incidents, MTTR, anomaly count (24h), cluster health matrix
Incidents	Filterable list (active/acknowledged/resolved), timeline view, root cause display, one-click acknowledge/resolve
Anomalies	Detected anomalies with scores, signal source, severity, historical comparison
Collectors	Health status per collector, event emission rates, degradation indicators
Policies	OPA policy listing, enable/disable toggles, violation history, Rego rule display
Settings	Collector config, integration credentials, anomaly sensitivity, remediation toggles

The dashboard runs as a separate container behind nginx, which reverse-proxies /api/ requests to the agent. No CORS headaches.

2. gRPC API with TLS/mTLS

The REST API was always there. Now there's a full gRPC API on port 9091 with four services:

StreamEvents - Server-side streaming of real-time telemetry events with source and severity filtering
GetIncident / ListIncidents - Incident queries with status filtering
GetHealth - Component-level health reporting

Built with reflection support, so you can debug with grpcurl out of the box. TLS and mutual TLS are configurable - just point it at your cert/key files.

3. Secret Management: Vault + AWS Secrets Manager

Two new integrations for secret lifecycle management:

HashiCorp Vault:

Kubernetes auth method
TLS certificate lifecycle tracking
Secret caching for performance

AWS Secrets Manager:

Prefix-based secret organization (kronveil/ default)
Rotation monitoring with configurable windows (default 30 days)
Secret expiration tracking
Built-in caching layer

Both use the graceful degradation pattern - if credentials aren't configured, the agent logs a warning and continues running without them.

4. Three New Collectors

The original had Kubernetes and Kafka. Now there are five:

Cloud Collector (AWS/Azure/GCP):

CloudWatch metrics for EC2, RDS, ELB, Lambda, S3
Multi-region support with resource enumeration
Cost tracking per resource

CI/CD Collector (GitHub Actions):

Webhook-based pipeline monitoring
Job and step-level tracking with duration metrics
Repository filtering with webhook secret validation

Logs Collector:

File tailing with structured log parsing
JSON, logfmt, and raw text format support
Configurable error pattern matching (error, fatal, panic, OOM, killed)

5. Capacity Planner

New intelligence module that goes beyond anomaly detection:

Linear regression-based forecasting (default 30-day horizon)
Right-sizing recommendations: scale_up, scale_down, right_size, optimize
Days-to-capacity projection
Cost savings calculations with confidence intervals
Historical data retention (90 days default)

6. Policy Engine (OPA/Rego)

Compliance and governance built into the agent:

Open Policy Agent integration with Rego language
Default policies pre-loaded (compliance, security)
Resource evaluation against all enabled policies
Policy violation tracking with evaluation metrics

7. Prometheus Metrics Export

Kronveil now exposes a full Prometheus scrape endpoint on port 9090:

Standard Go runtime metrics (goroutines, memory, GC)
Custom Kronveil metrics: event counts per source, collector errors, policy evaluations, processing latency
Ready-to-use with Grafana dashboards

8. OpenTelemetry (OTel) Integration

Full OpenTelemetry support for distributed tracing:

gRPC exporter to any OTLP-compatible endpoint (Jaeger, Tempo, Datadog, etc.)
Configurable export intervals (default 30s)
Span and trace propagation across the agent pipeline
Insecure mode for local development, TLS for production
Default endpoint: localhost:4317

This means you can plug Kronveil into your existing OTel collector pipeline and see traces from anomaly detection through incident creation to remediation execution - all in one trace.

9. PagerDuty Integration

Full Events API v2 support:

Incident triggering, acknowledgment, resolution
Deduplication keys for idempotent alerts
Severity mapping (critical, high, warning, info)
Links back to Kronveil dashboard

10. Audit Logging

Security-grade audit trail:

Event types: auth, incident, remediation, policy_change, config_change, secret_access, api_call
In-memory buffer with file sink
Structured JSON output via slog

11. Helm Chart for Kubernetes

Production-ready Helm chart with security hardened defaults:

Non-root containers (UID 1000)
Read-only root filesystem
Seccomp: RuntimeDefault
NetworkPolicy for ingress/egress
RBAC: ClusterRole with minimal permissions (pods, nodes, events, deployments)
Prometheus scrape annotations built-in
Liveness and readiness probes

helm install kronveil helm/kronveil/ \
  --namespace kronveil \
  --create-namespace \
  --set agent.bedrock.region=us-east-1

Upgraded Stack

Component	v0.1	v0.2
Go	1.21	1.25
golangci-lint	v1	v2
Alpine	3.21	3.23
Dashboard	Planned	React 18 + Tailwind
API	REST only	REST + gRPC + mTLS
Secrets	None	Vault + AWS SM
Metrics Export	None	Prometheus + OTel
Tracing	None	OpenTelemetry (OTLP)
Alerting	Slack	Slack + PagerDuty
Deployment	Manual	Docker Compose + Helm
CI	Basic	Full pipeline (lint, test, security, build, Docker scan)

Run Kronveil Locally (5 Minutes)

Here's the full local deployment walkthrough with live screenshots.

Prerequisites

Docker Desktop installed and running
Git
~2GB free RAM (Kafka needs memory)

Step 1: Clone and Build

git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker-compose -f deploy/docker-compose.yaml up --build -d

This builds two images and starts four containers:

Container	Port	Purpose
agent	8080	Kronveil REST API + gRPC
dashboard	3000	Web UI (nginx + React SPA)
kafka	9092	Event bus
zookeeper	2181	Kafka coordinator

Step 2: Verify Everything Is Running

docker-compose -f deploy/docker-compose.yaml ps

All four containers should show Up (healthy):

NAME                 STATUS                         PORTS
deploy-agent-1       Up About a minute (healthy)    127.0.0.1:8080->8080/tcp
deploy-dashboard-1   Up About a minute (healthy)    127.0.0.1:3000->8080/tcp
deploy-kafka-1       Up About a minute (healthy)    127.0.0.1:9092->9092/tcp
deploy-zookeeper-1   Up About a minute (healthy)    2181/tcp

Step 3: Access the Endpoints

Once deployed, you have three endpoints available:

Service	URL	Description
Dashboard	http://localhost:3000	Full web UI with all 6 pages
Agent API	http://localhost:8080/api/v1/health	REST API (health, incidents, anomalies)
Metrics	http://localhost:9090/metrics	Prometheus scrape endpoint

Step 4: Check Agent Health

curl http://localhost:8080/api/v1/health

{
  "data": {
    "status": "healthy"
  }
}

Step 5: Open the Dashboard

Open http://localhost:3000 in your browser.

Overview Page

The Overview page shows real-time infrastructure intelligence at a glance - 10.2M events/sec throughput, 2 active incidents, 23-second average MTTR, and 47 anomalies detected in the last 24 hours. The cluster health matrix shows three clusters across US, EU, and AP regions with live node and pod counts.

Incidents Page

AI-detected and auto-remediated incidents with filtering by status (all, active, acknowledged, resolved). Each incident shows the title, description, MTTR, and number of affected resources. Notice the resolved OOM incident with 23s MTTR - that's the auto-remediation in action.

Anomalies Page

ML-powered anomaly detection and prediction. The distribution chart shows detected vs. predicted anomalies over 24 hours. Each anomaly has a score (0-100%) - the Kafka consumer lag spike scored 94%, and the system predicted a pod OOM 15 minutes before it happened.

Collectors Page

Telemetry collection agents across your infrastructure. Five active collectors processing 10.2M events/sec across 487 targets with only 0.001% error rate. Kubernetes leads at 4.2M events/sec monitoring 3 clusters, 54 nodes, and 312 pods. Each collector shows real-time health status.

Scroll down to see all five collectors - Kubernetes, Apache Kafka, AWS CloudWatch, GitHub Actions (CI/CD), and the Logs collector. GitHub Actions shows a degraded status with 3 errors, which is expected when webhook endpoints aren't publicly accessible in a local deployment.

Step 6: Explore the API

Full system status:

curl http://localhost:8080/api/v1/status | python3 -m json.tool

List collectors and their health:

curl http://localhost:8080/api/v1/collectors | python3 -m json.tool

Inject a test event (single):

curl -X POST http://localhost:8080/api/v1/test/inject?mode=single

Inject a burst of events to trigger anomaly detection:

curl -X POST http://localhost:8080/api/v1/test/inject?mode=burst

After the burst injection, check for detected anomalies:

curl http://localhost:8080/api/v1/anomalies | python3 -m json.tool

And incidents that were auto-created:

curl http://localhost:8080/api/v1/incidents | python3 -m json.tool

Step 7: Prometheus Metrics

curl http://localhost:9090/metrics

You'll see standard Go metrics plus Kronveil-specific counters for events processed, collector errors, and policy evaluations. Wire this into your Grafana instance for dashboards.

Step 8: Tail the Logs

docker-compose -f deploy/docker-compose.yaml logs -f agent

Watch the agent detect anomalies, correlate incidents, and execute remediation in real-time.

Cleanup

docker-compose -f deploy/docker-compose.yaml down

Architecture Diagram (Updated)

                         +------------------+
                         |   Dashboard UI   |
                         |  (React + nginx) |
                         |   :3000          |
                         +--------+---------+
                                  |
                           /api/ proxy
                                  |
+------------------+    +---------v---------+    +------------------+
|   Collectors     |    |    Kronveil Agent  |    |  Integrations    |
|                  +--->+                    +--->+                  |
| - Kubernetes     |    |  REST API  :8080   |    | - Slack          |
| - Kafka          |    |  gRPC API  :9091   |    | - PagerDuty      |
| - Cloud (AWS)    |    |  Metrics   :9090   |    | - Prometheus     |
| - CI/CD          |    |                    |    | - OpenTelemetry  |
| - Logs           |    |  +==============+  |    | - AWS Bedrock    |
+------------------+    |  | Intelligence |  |    | - Vault          |
                        |  | - Anomaly    |  |    | - AWS Secrets    |
                        |  | - RootCause  |  |    +------------------+
                        |  | - Capacity   |  |
                        |  | - Incident   |  |         +----------+
                        |  +==============+  |    +--->| OTel     |
                        |                    +----+    | Collector|
                        |  +==============+  |         +----------+
                        |  | Policy (OPA) |  |
                        |  | Audit Log    |  |
                        |  +==============+  |
                        +---------+----------+
                                  |
                         +--------v---------+
                         |   Apache Kafka   |
                         |   :9092          |
                         +------------------+

CI Pipeline

Every push to main runs seven jobs:

Lint - golangci-lint v2 with staticcheck, errcheck, govet
Test - go test -race with 40% coverage threshold
Security Scan - govulncheck for Go stdlib/dependency CVEs
Build - Cross-compile with ldflags (version, commit, date)
Docker Build & Scan - Multi-stage build + Trivy vulnerability scan (CRITICAL/HIGH)
Dashboard - npm ci, ESLint, Vite production build
Helm Lint - Chart validation

All green before merge. No exceptions.

What's Next (v0.3 Roadmap)

Multi-cluster support - Federated monitoring across Kubernetes clusters
Custom collector SDK - Build your own collectors with a plugin interface
Runbook automation - Attach runbooks to incident types
Cost anomaly detection - Spot unexpected cloud spend spikes
Grafana dashboards - Pre-built dashboards for Kronveil Prometheus metrics
Mobile alerts - Push notifications via native apps

Try It

GitHub: github.com/kronveil/kronveil
License: Apache 2.0

git clone https://github.com/kronveil/kronveil.git
cd kronveil
docker-compose -f deploy/docker-compose.yaml up --build -d
# Open http://localhost:3000

If you find it useful, star the repo. If you find a bug, open an issue. PRs welcome - especially for new collectors, dashboard improvements, and LLM prompt tuning.

Follow me for more updates on building production-grade infrastructure tooling with Go and AI.

I Built an AI-Powered Infrastructure Observability Agent from Scratch

Ramasankar Molleti — Sun, 08 Mar 2026 01:11:01 +0000

Kronveil watches your infrastructure, detects anomalies in real time, and auto-remediates incidents before you even wake up.

As platform engineers, we've all been there: 3 AM pages, scrambling through dashboards, correlating logs across 15 different tools, and trying to figure out why the system broke — not just what broke.

I built Kronveil to solve this. It's an open-source, AI-powered observability agent that combines deep telemetry collection, real-time anomaly detection, LLM-powered root cause analysis, and autonomous remediation — all in a single Go binary.

In this post, I'll walk you through the architecture, the intelligence pipeline, and show you real test results of the system detecting anomalies and auto-remediating incidents in milliseconds.

GitHub: github.com/kronveil/kronveil

The Problem

Modern infrastructure is complex. A typical production environment has:

Hundreds of Kubernetes pods scaling up and down
Apache Kafka clusters processing millions of events per second
Multi-cloud workloads across AWS, Azure, and GCP
CI/CD pipelines deploying dozens of times per day

Traditional monitoring tools tell you what happened. But by the time you get the alert, correlate the signals, and figure out the root cause — you've already burned 30 minutes of MTTR.

What if your observability platform could think?

Architecture Overview

Kronveil is designed as a layered system with four main tiers:

 LAYER 1: DATA COLLECTION
 ========================
 +----------+  +-------+  +-------+
 |Kubernetes|  | Kafka |  | Cloud |
 |Collector |  |Collect|  |Collect|
 +----+-----+  +---+---+  +---+---+
      |            |           |
 +----+-----+  +--+---+       |
 |CI/CD     |  | Logs |       |
 |Collector |  |Tailer|       |
 +----+-----+  +--+---+       |
      |            |           |
      v            v           v
 ================================
 LAYER 2: KAFKA EVENT BUS
 ================================
 telemetry.raw -> telemetry.enriched
 anomalies.detected -> incidents.new
 remediation.actions -> policy.audit
 (10M+ events/sec | 3x replication)
 ================================
            |
            v
 LAYER 3: INTELLIGENCE
 ========================
 +---------+ +----------+
 | Anomaly | | Root     |
 | Detect  | | Cause    |
 | Z-Score | | Analyzer |
 | EWMA    | | DFS+LLM  |
 +---------+ +----------+
       |          |
       v          v
 +---------------------+
 | INCIDENT RESPONDER  |
 | Detect -> Triage    |
 | -> Respond -> Resolve|
 +---------------------+
            |
            v
 LAYER 4: ACTION
 ========================
 +-------+ +------+ +------+
 | Slack | |Pager | |Prom  |
 | Alert | |Duty  | |Metric|
 +-------+ +------+ +------+

The diagram above shows the full platform. Let's break down each layer.

Layer 1: Data Collection

Five specialized collectors continuously gather telemetry from your infrastructure. Each collector is a Go interface implementation that runs in its own goroutine and pushes TelemetryEvent structs into the event bus.

+--------------+---------------------------+
| Collector    | What It Watches           |
+--------------+---------------------------+
| Kubernetes   | Pods, Nodes, Events, HPA  |
|              | Metrics API, Deployments  |
+--------------+---------------------------+
| Kafka        | Consumer lag, Topics      |
|              | Throughput, Partitions    |
+--------------+---------------------------+
| Cloud        | EC2, RDS, ELB, Lambda    |
|              | S3, CloudWatch metrics   |
+--------------+---------------------------+
| CI/CD        | GitHub Actions, Jenkins  |
|              | GitLab CI pipelines      |
+--------------+---------------------------+
| Logs         | File tailing, Syslog     |
|              | Structured log parsing   |
+--------------+---------------------------+

Each collector implements a simple interface:

type Collector interface {
    Name() string
    Start(ctx context.Context) error
    Stop() error
    Health() ComponentHealth
}

This means adding a new data source (e.g., Datadog, New Relic) is just implementing this interface — no changes to the core engine needed.

Layer 2: Apache Kafka Event Bus

All telemetry flows through a unified Kafka event bus. This decouples collectors from intelligence modules — they don't know about each other. The bus handles 10M+ events/sec with 3x replication.

KAFKA TOPICS (10 total):
========================

Telemetry Flow:
  telemetry.raw
    -> telemetry.enriched
      -> anomalies.detected

Incident Flow:
  incidents.new
    -> incidents.updated
      -> remediation.actions

Governance Flow:
  policy.violations
    -> policy.audit
      -> capacity.forecasts

Config: capacity.changes

Why Kafka? Three reasons:

Durability — events survive crashes, enabling replay and audit trails
Fan-out — multiple intelligence modules can consume the same event stream independently
Backpressure — if anomaly detection falls behind, events queue up instead of being dropped

Layer 3: Intelligence Engine

This is the brain of Kronveil. Three modules analyze telemetry in parallel, each specializing in a different aspect:

+----------------------------------+
|        ANOMALY DETECTOR          |
|                                  |
|  Input: telemetry.enriched      |
|                                  |
|  Algorithms:                    |
|  - Z-Score (deviation from mean)|
|  - EWMA (trend smoothing)      |
|  - Linear Trend (prediction)   |
|                                  |
|  Output: anomalies.detected    |
+----------------------------------+
          |
          v
+----------------------------------+
|     ROOT CAUSE ANALYZER          |
|                                  |
|  Input: anomalies.detected      |
|                                  |
|  Process:                       |
|  1. Build dependency graph      |
|  2. DFS traversal for causality |
|  3. Collect evidence            |
|  4. LLM analysis (AWS Bedrock) |
|                                  |
|  Output: root cause + fix       |
+----------------------------------+
          |
          v
+----------------------------------+
|     CAPACITY PLANNER             |
|                                  |
|  Input: telemetry.enriched      |
|                                  |
|  Algorithms:                    |
|  - Linear regression forecast   |
|  - Confidence intervals         |
|  - Resource right-sizing        |
|                                  |
|  Output: capacity.forecasts    |
+----------------------------------+

All three modules feed into the Incident Responder, which orchestrates the full incident lifecycle:

INCIDENT LIFECYCLE:
===================

  Anomaly    Root Cause    Capacity
  Detected   Found         Alert
     \          |           /
      v         v          v
  +-------------------------+
  |   INCIDENT RESPONDER    |
  |                         |
  |  1. Create Incident     |
  |  2. Score Severity      |
  |  3. Correlate Events    |
  |  4. Auto-Remediate      |
  |  5. Notify (Slack/PD)   |
  |  6. Track Resolution    |
  +-------------------------+
           |
           v
  +-------------------------+
  |   AUTO-REMEDIATION      |
  |                         |
  |  - scale_deployment     |
  |  - restart_pods         |
  |  - rollback_deploy      |
  |  - drain_node           |
  |  - failover_db          |
  |  - toggle_feature       |
  |                         |
  |  Safety:                |
  |  - Circuit breaker      |
  |  - Dry run mode         |
  |  - Human approval gate  |
  +-------------------------+

Layer 4: Action & Integrations

The final layer delivers results to humans and systems:

+----------+  +-----------+  +---------+
| AWS      |  | Slack     |  | Pager   |
| Bedrock  |  | Block Kit |  | Duty    |
| (LLM)   |  | Alerts    |  | Events  |
+----------+  +-----------+  +---------+

+----------+  +-----------+  +---------+
| REST API |  | gRPC API  |  | Prom    |
| :8080    |  | :9091     |  | :9090   |
+----------+  +-----------+  +---------+

REST API (:8080) — Dashboard, incident management, test injection
gRPC API (:9091) — High-performance inter-service communication
Prometheus (:9090) — Metrics export for Grafana dashboards
Slack — Real-time alerts with Block Kit rich formatting
PagerDuty — On-call escalation via Events API v2
AWS Bedrock — LLM backbone for root cause analysis

The Complete Event Flow

Here's how a single CPU spike travels through the entire system:

CPU spike on pod-xyz (95% usage)
  |
  v
[K8s Collector] picks up metric
  |
  v
[Kafka] telemetry.raw topic
  |
  v
[Anomaly Detector] Z-score = 5.8 sigma
  |
  v
[Kafka] anomalies.detected topic
  |
  v
[Incident Responder] creates INC-0001
  |
  +---> [Root Cause Analyzer]
  |       |
  |       v
  |     DFS on dependency graph
  |       |
  |       v
  |     AWS Bedrock LLM analysis
  |       |
  |       v
  |     "OOM in pod-xyz caused by
  |      memory leak in v2.3.1"
  |
  +---> [Auto-Remediation]
  |       |
  |       v
  |     scale_deployment (replicas: 5)
  |
  +---> [Slack] Alert with root cause
  +---> [PagerDuty] Page on-call
  +---> [Prometheus] Metric exported
  |
  v
INC-0001 resolved (MTTR: 1.7ms)

Deep Dive: The Intelligence Pipeline

This is where Kronveil gets interesting. Let me walk through how a single CPU spike turns into an auto-remediated incident.

Step 1: Anomaly Detection

Kronveil uses a combination of statistical methods:

Z-Score Analysis: Measures how many standard deviations a value is from the mean
EWMA: Smooths out noise to detect real trends
Linear Trend Prediction: Identifies directional trends to predict upcoming anomalies

The detector maintains a sliding time window for each signal and requires a minimum of 30 data points before it starts detecting. This prevents false positives during cold starts.

Sensitivity levels:

Level	Z-Score Threshold	Use Case
High	2.0 sigma	Critical systems, catch everything
Medium	3.0 sigma	Default, balanced
Low	4.0 sigma	Noisy environments, reduce alerts

Step 2: Incident Creation & Severity Scoring

When an anomaly is detected, it gets scored on a 0.0 to 1.0 scale:

Score >= 0.9  -->  CRITICAL  -->  Page On-Call
Score >= 0.7  -->  HIGH      -->  Slack Alert
Score >= 0.5  -->  MEDIUM    -->  Dashboard
Score <  0.5  -->  LOW       -->  Log Only

The incident responder also correlates events — grouping related anomalies within the same time window to avoid alert storms.

Step 3: Root Cause Analysis (LLM-Powered)

For high/critical incidents, Kronveil uses AWS Bedrock (Claude or Titan):

Build a dependency graph of affected services
Traverse the graph using DFS to find the causal chain
Collect evidence (metrics, logs, events)
Send to the LLM with a structured prompt
Receive root cause explanation and recommended fix

Step 4: Auto-Remediation

Supported actions:

Action	Description
`scale_deployment`	Scale up/down pods
`restart_pods`	Rolling restart
`rollback_deploy`	Revert to previous version
`drain_node`	Safely drain a problematic node
`failover_db`	Database failover
`toggle_feature`	Feature flag toggle

Safety is built in:

Circuit Breaker: Max 5 attempts per 10 minutes
Dry Run Mode: Test remediation without executing
Approval Required: Optional human-in-the-loop
Cooldown Period: Prevent remediation storms

Testing It Live

I deployed Kronveil on a local Kubernetes cluster using kind and tested the full pipeline.

Deployment

kind create cluster --name kronveil-test

docker build -f deploy/Dockerfile.agent -t kronveil:latest .
kind load docker-image kronveil:latest --name kronveil-test

helm install kronveil ./helm/kronveil \
  --namespace kronveil --create-namespace \
  --set image.repository=kronveil \
  --set image.tag=latest \
  --set image.pullPolicy=Never

Health Check

All 6 modules running and healthy:

{
  "data": {
    "status": "healthy",
    "components": [
      {"name": "kubernetes-collector", "status": "healthy"},
      {"name": "kafka-collector", "status": "healthy"},
      {"name": "anomaly-detector", "status": "healthy"},
      {"name": "incident-responder", "status": "healthy"},
      {"name": "root-cause-analyzer", "status": "healthy"},
      {"name": "capacity-planner", "status": "healthy"}
    ]
  }
}

Triggering Anomaly Detection

Kronveil includes a test injection endpoint. The burst mode sends 35 normal baseline events followed by a single spike — triggering the full pipeline.

curl -s -X POST \
  "http://localhost:8080/api/v1/test/inject?mode=burst" \
  -H "Content-Type: application/json" \
  -d '{"source":"production-api","signal":"cpu_usage"}'

Result:

{
  "data": {
    "status": "burst_complete",
    "events_injected": 36,
    "anomalies_found": 1,
    "incidents_created": 1,
    "anomalies": [{
      "signal": "production-api.cpu_usage",
      "score": 0.97,
      "severity": "critical",
      "description": "value 200.00 deviates 5.8 sigma from mean"
    }],
    "incidents": [{
      "id": "INC-0001",
      "severity": "critical",
      "status": "resolved",
      "timeline": [
        {"action": "created", "actor": "system"},
        {"action": "remediation_started", "actor": "ai"},
        {"action": "resolved", "details": "MTTR: 1.7ms"}
      ]
    }]
  }
}

What happened in those 1.7 milliseconds:

35 baseline events established a normal CPU usage pattern (~50%)
One spike event hit 200% — deviating 5.8 sigma from the mean
Anomaly detector flagged it as critical (score: 0.97/1.0)
Incident responder created INC-0001
Auto-remediation kicked in with scale_deployment
Incident resolved — MTTR: 1.7ms

Testing Multiple Signal Sources

I then simulated a Kafka consumer lag spike — a second anomaly detected at 5.8 sigma, triggering INC-0002 with auto-remediation. Both incidents are independently tracked with full audit trails.

Tech Stack

Component	Technology
Agent	Go 1.21 (single binary, ~10MB)
Event Bus	Apache Kafka (10 topics, 3x replication)
AI/LLM	AWS Bedrock (Claude, Titan)
Anomaly Detection	Z-Score, EWMA, Linear Regression
Policy Engine	OPA (Rego rules)
Secret Management	AWS Secrets Manager + HashiCorp Vault
Deployment	Kubernetes + Helm
Dashboard	React + TypeScript
API	REST + gRPC + Prometheus

76 files. One binary. Zero external Go dependencies.

What's Next

Real Kubernetes client-go integration — watch actual pods, nodes, and events
Kafka consumer group monitoring — connect to real brokers
Multi-cloud secret management — Azure Key Vault and GCP Secret Manager support (currently AWS-focused)
Dashboard UI — React dashboard for visualizing anomalies and incidents
Prometheus metrics export — anomaly scores, incident counts, MTTR
Webhook integrations — Slack and PagerDuty notifications
Multi-cluster support — monitor multiple clusters from a single agent

Get Involved

Kronveil is open source under the Apache 2.0 license.

GitHub: github.com/kronveil/kronveil
Star the repo if you find this useful
Contributions welcome — especially around new collector integrations, LLM prompt engineering, and dashboard widgets

Developed by Ramasankar Molleti

GitOps-Driven Multi-Cluster Kubernetes Management: A Deep Dive into Modern Infrastructure

Ramasankar Molleti — Fri, 24 Jan 2025 05:11:13 +0000

Introduction

As organizations scale their container deployments, managing multiple Kubernetes clusters across different environments and regions has become increasingly complex. This article explores modern approaches to multi-cluster management using GitOps principles, focusing on real-world implementation strategies and emerging best practices in 2025.

The Evolution of Cluster Management

Traditional management of Kubernetes mostly relies directly on access to clusters and manual interference. Modern landscape requires more sophisticated methods and approaches, among which GitOps emerged as a de facto standard to manage declarative infrastructure that ensures consistency, reliability, and audit capabilities not available with traditional methods.

Key Components of Modern Kubernetes Architecture

Cluster Blueprints Modern Kubernetes deployments utilize cluster blueprints - templated configurations that define the entire cluster state, including:

Node pool configurations
Security policies
Network policies
Service mesh setup
Monitoring and logging infrastructure

GitOps Control Plane

The GitOps control plane consists of several critical components:

apiVersion: gitops.example.com/v1
kind: ClusterTemplate
metadata:
  name: production-blueprint
spec:
  version: 1.28.0
  networking:
    cni: cilium
    serviceType: internal
  security:
    policyEngine: OPA
    imageScanning: true
  observability:
    prometheus: true
    opentelemetry: true

Advanced Multi-Cluster Patterns

Fleet Management

Modern fleet management introduces the concept of cluster sets:

apiVersion: fleet.example.com/v1
kind: ClusterSet
metadata:
  name: production-fleet
spec:
  regions:
    - name: us-east
      clusters: 3
      template: production-blueprint
    - name: eu-west
      clusters: 2
      template: production-blueprint
  loadBalancing:
    mode: global
    algorithm: weighted-least-request

Implementing Zero-Trust Security

1. Certificate Management

Modern Kubernetes deployments require sophisticated certificate management:

type CertificateRotation struct {
    Interval    time.Duration
    Algorithm   string
    KeySize     int
    CommonName  string
    SANs       []string
}

func (c *CertificateRotation) Setup() error {
    // Implementation for automated certificate rotation
    return nil
}

2. Network Policy Enforcement

Example of a zero-trust network policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: zero-trust-policy
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          security-zone: trusted
    ports:
    - protocol: TCP
      port: 443

Advanced Observability

1. Distributed Tracing

Implementation of OpenTelemetry-based tracing:

func setupTracing(ctx context.Context) (*trace.TracerProvider, error) {
    exporter, err := otlptrace.New(ctx,
        otlptrace.WithInsecure(),
        otlptrace.WithEndpoint("otel-collector:4317"),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(
            resource.NewWithAttributes(
                semconv.SchemaURL,
                semconv.ServiceNameKey.String("cluster-manager"),
            ),
        ),
    )
    return tp, nil
}

2. Metrics Aggregation

Example of custom metrics collection:

type ClusterMetrics struct {
    NodeUtilization    float64
    PodDensity         float64
    NetworkLatency     map[string]float64
    ResourceQoS        map[string]int
}

func (cm *ClusterMetrics) Collect() error {
    // Implementation for metrics collection
    return nil
}

Disaster Recovery and Business Continuity

1. Cross-Cluster Backup Strategy

Implementation of automated backup procedures:

type BackupStrategy struct {
    Interval    time.Duration
    Retention   time.Duration
    Encryption  bool
    Location    string
}

func (b *BackupStrategy) Execute() error {
    // Implementation for backup execution
    return nil
}

2. Recovery Time Objectives

Example of recovery automation:

func automateRecovery(cluster *Cluster) error {
    // Step 1: Validate backup integrity
    if err := validateBackup(cluster.LastBackup); err != nil {
        return err
    }

    // Step 2: Restore core components
    if err := restoreCoreComponents(cluster); err != nil {
        return err
    }

    // Step 3: Verify cluster health
    return verifyClusterHealth(cluster)
}

Cost Optimization Strategies

1. Resource Right-Sizing

Example of automated resource optimization:

type ResourceOptimizer struct {
    Thresholds map[string]float64
    History    []ResourceMetrics
    Predictions []ResourcePrediction
}

func (ro *ResourceOptimizer) Optimize() (*ResourceRecommendation, error) {
    // Implementation for resource optimization
    return nil, nil
}

End-to-End Implementation Example

Project Structure

├── clusters/
│   ├── production/
│   │   ├── cluster-config.yaml
│   │   ├── network-policies/
│   │   └── workloads/
│   └── staging/
├── platform/
│   ├── monitoring/
│   ├── security/
│   └── service-mesh/
└── tools/
    └── cluster-setup/

1. Cluster Bootstrap

# Initialize infrastructure
terraform init
terraform apply -var-file=prod.tfvars

# Bootstrap cluster
./tools/cluster-setup/bootstrap.sh \
  --cluster-name=prod-east \
  --region=us-east-1 \
  --nodes=3

2. Base Platform Configuration

# platform/base/platform.yaml
apiVersion: platform.example.com/v1
kind: PlatformConfig
metadata:
  name: base-platform
spec:
  serviceMesh:
    enabled: true
    type: istio
    version: 1.20.0
    config:
      mtls: strict
      autoInject: true

  monitoring:
    prometheus:
      retention: 15d
      resources:
        requests:
          cpu: 1000m
          memory: 4Gi
    grafana:
      enabled: true
      dashboards:
        - cluster-health
        - application-metrics

  security:
    networkPolicies:
      defaultDeny: true
    podSecurityPolicies:
      enforcePrivileged: false

3. Application Deployment

# workloads/web-application/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      securityContext:
        runAsNonRoot: true
      containers:
      - name: web-app
        image: example/web-app:v1.2.3
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL

4. Service Mesh Configuration

# platform/service-mesh/virtual-service.yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: web-app
  namespace: production
spec:
  hosts:
  - web-app.example.com
  gateways:
  - production-gateway
  http:
  - match:
    - uri:
        prefix: /api
    route:
    - destination:
        host: web-app
        port:
          number: 8080
    retries:
      attempts: 3
      perTryTimeout: 2s

5. Monitoring Setup

# platform/monitoring/service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: web-app
  namespace: production
spec:
  selector:
    matchLabels:
      app: web-app
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

6. Pipeline Configuration

# .github/workflows/deploy.yml
name: Deploy Application
on:
  push:
    branches: [main]
    paths:
      - 'workloads/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Setup Kubernetes Tools
        uses: azure/setup-kubectl@v1

      - name: Deploy to Kubernetes
        run: |
          kubectl apply -k workloads/web-application/
          kubectl rollout status deployment/web-app -n production

7. Testing and Verification

# Verify deployment
kubectl get pods -n production
kubectl get virtualservice -n production
kubectl get servicemonitor -n production

# Test connectivity
curl -H "Host: web-app.example.com" \
     https://production-gateway.example.com/api/health

# Check metrics
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring
# Visit http://localhost:9090 in browser

This end-to-end example demonstrates:

Infrastructure as Code setup
Platform configuration
Application deployment
Service mesh integration
Monitoring configuration
CI/CD pipeline
Verification steps

When implementing this example:

Replace placeholder values (domains, image names)
Adjust resource requests/limits based on needs
Customize monitoring parameters
Update security policies per requirements
Configure backup/DR settings

Conclusion

As Kubernetes is evolving, the focus is shifting from mere cluster management to sophisticated orchestration of multiple clusters at scale. Integration of GitOps principles, zero-trust security, and advanced observability forms a strong foundation for modern cloud-native applications.
With this comprehensive approach to managing Kubernetes, an organization is able to assure consistency, security, and reliability throughout the entire container infrastructure while preparing for future scaling challenges.

About the Author
Results-driven Principal Cloud Architect with extensive experience in designing and deploying complex Kubernetes deployments for a variety of industries.

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn: https://www.linkedin.com/in/ramasankar-molleti-23b13218?trk=nav_responsive_tab_profile

Book 1:1 (http://topmate.io/ramasankar_molleti/?utm_source=topmate&utm_medium=popup&utm_campaign=Page_Ready)

DEV Community: Ramasankar Molleti

Anthropic Just Killed the API Key: A Deep Dive into Workload Identity Federation for Claude

Why this matters (and why I'm writing a sequel)

What Anthropic actually built

1. Service Account (svac_...)

2. Federation Issuer (fdis_...)

3. Federation Rule (fdrl_...)

The exchange flow

Token lifetime — the smart part

SPIFFE on Anthropic — the cleanest path

SPIRE side

Anthropic side

Kubernetes deployment — the volume detail nobody mentions

Validation before wiring up the SDK

Three SPIFFE gotchas

Mapping onto the IETF AIMS stack

The trap: this is workload auth, NOT user delegation

The layered model that actually works

The migration trap that will bite you

What this means for platform architecture

1. The threat model shifts from "key custody" to "issuer trust"

2. Audit and attribution become per-workload by default

3. The gateway's job gets more important, not less

Closing thought

References

Kronveil v0.3: Multi-Cluster Federation, Custom Collector SDK, and Automated Runbooks

From Single Cluster to Multi-Cluster Production

What's New in v0.3

1. Multi-Cluster Federation

How It Works

Configuration

2. Custom Collector SDK

Full Example: HTTP Health Checker

What the Adapter Handles For You

3. Automated Runbook Engine

How Execution Works

4. Real Cloud Provider Integrations

Azure

GCP

GitHub Actions (CI/CD Collector)

Kafka Throughput

5. WebSocket Real-Time Streaming

Backend

Frontend

6. Runbooks Dashboard Page

7. OpenTelemetry & Observability

Full Architecture (v0.3)

Local Deployment

Verify it's running

Production Deployment

CI Pipeline

What's Next (v0.4)

Links

Kronveil v0.2: From Stubs to a Fully Running AI Infrastructure Agent with Dashboard, OTel, and Auto-Remediation

Quick Recap

What's New in v0.2

1. Full Dashboard UI (React + TypeScript)

2. gRPC API with TLS/mTLS

3. Secret Management: Vault + AWS Secrets Manager

4. Three New Collectors

5. Capacity Planner

6. Policy Engine (OPA/Rego)

7. Prometheus Metrics Export

8. OpenTelemetry (OTel) Integration

9. PagerDuty Integration

10. Audit Logging

11. Helm Chart for Kubernetes

Upgraded Stack

Run Kronveil Locally (5 Minutes)

Prerequisites

Step 1: Clone and Build

Step 2: Verify Everything Is Running

Step 3: Access the Endpoints

Step 4: Check Agent Health

Step 5: Open the Dashboard

Overview Page

Incidents Page

Anomalies Page

Collectors Page

Step 6: Explore the API

1. Service Account (`svac_...`)

2. Federation Issuer (`fdis_...`)

3. Federation Rule (`fdrl_...`)