Running Bifrost in Cluster Mode: The Path to Enterprise AI Deployments

Bifrost cluster mode distributes gateway traffic across multiple nodes with synchronized state, automatic node failover, and zero-downtime rolling updates.

If you're running an AI gateway in production, you've probably wondered what happens when a single instance fails. Cluster mode is Bifrost's answer: multiple gateway nodes that act as a single system, sharing rate limits, budgets, and governance rules in real time. Bifrost, the open-source AI gateway built in Go by Maxim AI, supports cluster mode to give teams the high-availability infrastructure that production AI workloads demand. Let's walk through what it is, why you'd use it, how it actually works under the hood, and how to deploy it on Kubernetes with Helm.

Understanding Cluster Mode

Cluster mode runs multiple gateway instances as an interconnected system. In the default setup (cluster mode disabled), each instance is isolated. It talks to a shared database, but handles its own rate limits and governance state. With cluster mode on, every node synchronizes those counters and policies with every other node, creating a single logical gateway spread across many machines.

Here's the practical difference: without clustering, each node enforces a rate limit independently. So if you set a limit of 1,000 requests per minute, each node gets its own 1,000 rps quota. With clustering, the fleet shares a single 1,000 rps quota. One node uses 600, another uses 300, and the limit holds cluster-wide.

The tech here is production-grade. Gossip-based state synchronization makes sure every node converges on the same view of the cluster in seconds. Cluster mode requires PostgreSQL, because SQLite doesn't support multi-node coordination. The feature set is part of Bifrost Enterprise, though the open-source image accepts cluster config values.

Why Teams Deploy Clustering

Single nodes fail. Cloud environments are noisy. Rate limits need to be accurate across the whole system. That's why enterprises move to clustered gateways.

Here's what clustering solves:

No single point of failure: When a node goes down (hardware fault, pod eviction, zone disruption), traffic automatically shifts to healthy instances. The gateway keeps serving requests.
Shared rate limits and budgets: A 1,000 requests-per-minute limit holds across the whole fleet, not per node. Same for budget tracking. Budget counters stay consistent everywhere.
Unified governance: Virtual keys, routing rules, providers, and access policies sync to every node instantly. Configuration drift disappears.
Handle spikes gracefully: Horizontal scaling adds replicas, distributing load instead of crushing a single instance.
Zero-downtime updates: Roll out new code one pod at a time while the rest keep serving traffic.
Multi-region awareness: Tag nodes with region labels so requests route to the closest instance.

For teams building capacity plans and high-availability strategies, the enterprise deployment guide covers cluster sizing and HA topologies in detail.

How Clustering Actually Works

Bifrost cluster mode is fully peer-to-peer. There's no primary node, no external coordinator, no special "cluster leader node" in the request path. Every node is equal. Each one discovers peers automatically, monitors whether every other node is alive, and gets state updates as they happen.

Clustering uses two separate communication channels with different jobs. Membership signals (which nodes are in the cluster, are they alive?) run over a memberlist gossip layer on port 10101 (TCP and UDP). This uses the SWIM protocol, which is designed for scalable, eventually consistent group membership. Everything else, including usage counters, config changes, routing rules, virtual keys, RBAC, and MCP tool definitions, travels over a dedicated gRPC channel on port 10102 (TCP). Splitting these two keeps the membership system isolated from application traffic, and each can be tuned independently.

Bifrost replicates 30+ entity types across the cluster: model catalog, providers, governance counters, routing rules, RBAC, MCP tools, pricing, and prompt deployments. All of it gets synced. Each message has a unique ID and timestamp. Nodes keep a short-lived deduplication cache, so if a message arrives twice, it's only processed once. All nodes converge to the same state within seconds (eventual consistency).

Leader election is automatic. One cluster-wide leader, one leader per region. The rule is simple: lexicographically first healthy member wins. The cluster re-evaluates every 30 seconds, so leadership transfers automatically when nodes join, leave, or fail. The leader handles singleton tasks (like fetching upstream pricing once instead of every node doing it), then broadcasts the result to the rest.

Deploying on Kubernetes with Helm

The easiest deployment path is Kubernetes + Helm with native pod discovery. The default mesh clustering model means nodes reach each other directly over gossip and gRPC. Minimum recommendation: three nodes (survives one failure), five nodes (survives two failures), seven or more for even higher redundancy.

A minimal cluster Helm config looks like:

# cluster-values.yaml
replicaCount: 3

storage:
  mode: postgres

postgresql:
  external:
    enabled: true
    host: "your-postgres-host.example.com"
    port: 5432
    user: bifrost
    database: bifrost
    sslMode: require
    existingSecret: "postgres-credentials"
    passwordKey: "password"

bifrost:
  encryptionKeySecret:
    name: "bifrost-encryption"
    key: "encryption-key"
  cluster:
    enabled: true
    region: "us-east-1"
    discovery:
      enabled: true
      type: kubernetes
      k8sNamespace: "default"
      k8sLabelSelector: "app.kubernetes.io/name=bifrost"
    gossip:
      port: 10101

Kubernetes discovery works by querying the API for pods matching your label selector. The gossip layer finds peers that way. You'll need service account RBAC to list pods. After creating your PostgreSQL and encryption secrets and configuring RBAC, just run:

helm install bifrost bifrost/bifrost -f cluster-values.yaml

Both ports 10101 (TCP+UDP) and 10102 (TCP) need to be open between pods, so check your NetworkPolicies and security groups.

Other discovery methods

Bifrost supports six ways to discover cluster members, so it works anywhere:

Kubernetes: Pod discovery via the API, best for cloud-native setups.
DNS: Headless services or A-record resolution, great with StatefulSets.
Consul and etcd: For teams already running those systems.
UDP broadcast and mDNS: Local-network discovery for on-prem and dev.

On serverless platforms without peer-to-peer networking (like Google Cloud Run), Bifrost offers broker mode. Each node makes one outbound connection to a central relay that fans out messages to the rest. Same reliability, different topology. See the in-VPC deployment docs for air-gapped and private-cloud setups.

Production HA Patterns

A solid Bifrost cluster combines the multi-node foundation with scheduling, autoscaling, and shutdown policies that keep things running through real-world disruptions. Goal: a fleet that survives node faults, scales for demand, and deploys new versions without dropping requests.

Spread replicas across hosts: Use pod anti-affinity so instances don't land on the same node. One machine can't take down the whole gateway.
Guard availability during maintenance: Set a PodDisruptionBudget with minAvailable: 2. Kubernetes won't let voluntary disruptions (drains, autoscaling) drop you below that threshold.
Graceful shutdown: Termination grace period + preStop hook ensures streaming responses finish cleanly instead of getting cut off.
Smart autoscaling: Enable HPA with conservative scale-down. Let the cluster grow for spikes, but don't aggressively kill pods that are still busy.
Layer in provider resilience: Mix cluster-level HA with automatic provider failover and adaptive load balancing. Survive node failures AND provider outages.
Monitor health: Export Prometheus metrics. Use the built-in cluster topology view to confirm every node is alive and receiving messages.

Rolling upgrades work because nodes negotiate capabilities with each other. New and old versions can run side by side without losing quorum. Deploy one pod at a time using standard Kubernetes rolling strategies.

Common Questions

Does cluster mode need a specific database?

Yes, PostgreSQL. Bifrost cluster mode coordinates shared state through the database, so it's a hard requirement. SQLite only works for single instances.

How many nodes should I run?

Three is the minimum and handles one failure. Five tolerates two. Seven+ gives you three fault tolerance. Pick based on how much disruption you can absorb.

Can I use cluster mode on serverless?

Yes, with broker mode. Instead of direct peer-to-peer connections, every node connects outbound to a central relay. Works on Cloud Run and similar platforms.

Do cluster upgrades require downtime?

Nope. Nodes support mixed versions during rolling upgrades. Replace one pod at a time, keep the rest serving traffic.

Getting Started

Bifrost cluster mode is the deployment pattern for teams that need a highly available AI gateway. It delivers gossip-based state sync, automatic failover, shared rate limits, and zero-downtime deploys across a peer-to-peer cluster. The enterprise deployment guide and the full Bifrost resources hub have reference topologies, sizing guidance, and HA walkthroughs.

Ready to see how Bifrost cluster mode fits your production setup? Book a demo with the Bifrost team.