yep

Posted on Apr 21 • Originally published at yepchaos.com

Active/Active Multi-region - Chat application Architecture

#multiregion #activeactive #postgres

In the previous post I covered how I connected two Kubernetes clusters across Mongolia and Germany using Netbird. That was the networking layer — pods can reach each other, DNS works across clusters. Now the interesting part: making the actual application work active/active across both regions.

Active/active means both clusters run independently and serve users, but a user on cluster A can chat with a user on cluster B in real time. No single point of failure, no "primary" region. Either cluster can go down and the other keeps running.

This breaks down into three problems: real-time events, chat history, and application state. Each one needs a different solution.

Part 1: Real-Time Events (NATS Super-Cluster)

For single-cluster WebSocket scaling I already use NATS — covered in an earlier post. The short version: all WebSocket servers publish and subscribe through NATS, so a message from a user on server A reaches a user on server B without those servers knowing about each other.

For multi-region, NATS has a concept called a super-cluster. You deploy independent NATS clusters in each region and connect them together. Messages published in one cluster eventually replicate to the other. "Eventually" here means milliseconds of extra latency — there are more network hops, but I accept that.

Setup is straightforward. Deploy NATS in each cluster using the operator (Helm chart), then configure the super-cluster by pointing each cluster at the other's gateway endpoints. After that, the application doesn't change at all. A backend in Germany subscribes to the same subjects as a backend in Mongolia. A message published in one region fans out to both. The application has no idea it's talking to a distributed system — it just publishes and subscribes like before.

This is the cleanest part of the whole setup. NATS was designed for this, and it shows. Example values.yaml

config:
  cluster:
    enabled: true
    replicas: 3
    merge:
      name: astring-fsn1

  gateway:
    enabled: true
    port: 7522
    merge:
      name: astring-fsn1
      gateways:
        - name: astring-mn
          urls:
            - nats://nats-mn-headless.nats.astring-mn.internal:7522

  monitor:
    enabled: true
    port: 8222
  merge:
    authorization:
      user: << $NATS_USER >>
      password: << $NATS_PASSWORD >>

container:
  env:
    NATS_USER:
      valueFrom:
        secretKeyRef:
          name: nats-auth-secret
          key: username
    NATS_PASSWORD:
      valueFrom:
        secretKeyRef:
          name: nats-auth-secret
          key: password

service:
  ports:
    nats:
      enabled: true
    gateway:
      enabled: true
    monitor:
      enabled: true

promExporter:
  enabled: true
  env:
    NATS_USER:
      valueFrom:
        secretKeyRef:
          name: nats-auth-secret
          key: username
    NATS_PASSWORD:
      valueFrom:
        secretKeyRef:
          name: nats-auth-secret
          key: password

reloader:
  enabled: true

natsBox:
  enabled: true

Part 2: Chat History (Cassandra, Not ScyllaDB)

I was using ScyllaDB. In December 2024 ScyllaDB moved from AGPL to a source-available license — the code is still public on GitHub, but running a cluster beyond a certain size requires a commercial license. ScyllaDB Manager (the tool for automation, repairs, and backups) is limited to 5 nodes on the free version. It's technically "open source" but not really anymore. I switched to Cassandra, which is fully open source under Apache 2.0 and has the same architecture.

For multi-region, Cassandra is actually the best-fit database I've worked with. Cassandra natively understands the concept of datacenters — your two sites aren't two separate clusters, they're two DCs in one logical Cassandra cluster. Replication is configured per-DC. Consistency levels let you decide per-query whether you need a local quorum (fast, single-region) or global quorum (slower but cross-region consistent).

For chat history, I use local quorum for reads and writes. Messages replicate to the other DC asynchronously. A user reading chat history gets it from their local DC — fast. Eventually the other DC catches up. For chat history this is fine — nobody needs sub-millisecond cross-region consistency for reading old messages.

For Kubernetes I use the k8ssandra-operator, which manages Cassandra clusters across multiple Kubernetes clusters. This is where it gets interesting: the operator needs to manage pods in both cluster-mn and cluster-de, which means it needs to reach both clusters. I deploy the k8ssandra-operator on a separate management cluster — a small single-node k3s cluster that reaches both application clusters through Netbird. The operator registers both clusters and treats them as two DCs in one Cassandra deployment.

If the management cluster goes down, the Cassandra cluster keeps running — the operator just can't make configuration changes until it comes back. Acceptable tradeoff. After the register the 2 clusters (use official doc, they have explained better), my cluster.yaml is

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: astring
  namespace: k8ssandra-operator
spec:
  cassandra:
    serverVersion: "4.0.10"
    telemetry:
      mcac:
        enabled: false
      prometheus:
        enabled: true
    storageConfig:
      cassandraDataVolumeClaimSpec:
        storageClassName: openebs-hostpath
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
    config:
      cassandraYaml:
        listen_address: "0.0.0.0"
      jvmOptions:
        heapSize: 512M
    datacenters:
      - metadata:
          name: dc1
        k8sContext: astring-mn
        size: 3
      - metadata:
          name: dc2
        k8sContext: astring-fsn1
        size: 3
  stargate:
    size: 1
    heapSize: 512M

Part 3: Application Database (The Hard Part)

NATS and Cassandra were relatively clean. Postgres is where I spent most of my time.

Postgres stores users, rooms, OTPs, metadatas — all the relational data. The problem: Postgres has one primary at a time. All writes go to the primary, replicas are read-only. In a multi-region setup, if the primary is in Mongolia and a user in Germany does a login, that request either needs to cross the ocean to write (200ms penalty) or I need two primaries that stay in sync.

What I Looked At

TiDB / SurrealDB (TiKV-based)

These are impressive databases but built for low-latency interconnects — single region or multi-AZ with <10ms between nodes. Stretch them across continents and the distributed SQL magic collapses for three specific reasons:

TSO coordination latency. TiDB relies on a Placement Driver (PD) acting as a Timestamp Oracle (TSO) to assign globally ordered timestamps. While timestamp allocation is optimized (batched/pipelined), it still requires coordination with a leader. In a Mongolia–Germany setup, this introduces non-trivial latency before transaction execution, especially under high concurrency.

Raft + 2PC write latency. TiKV uses Raft consensus for replication and Percolator-style two-phase commit for distributed transactions. Writes require quorum acknowledgment, which in cross-region setups means at least one intercontinental round trip. Combined with 2PC coordination, end-to-end write latency can reach hundreds of milliseconds.

Scaling across regions. Adding more regions increases coordination overhead (more replicas, more quorum distance). These systems scale well within a region, but cross-region deployments require careful topology design and acceptance of higher write latency.

CockroachDB

CockroachDB has similar characteristics: Consensus-driven latency. CockroachDB also uses Raft for replication. Cross-region writes require quorum, so latency is bounded by inter-region round trips, similar to TiDB/TiKV.

Operational and licensing considerations. Recent versions have shifted licensing and feature availability. Advanced capabilities like geo-partitioning (which help localize data and reduce cross-region latency) are part of paid tiers. This introduces constraints for setups that require fine-grained data locality control without additional licensing.

YugabyteDB

This one I actually deployed and tested. YugabyteDB is Kubernetes-native, supports active/active replication through their xCluster feature, and the management UI is genuinely good — modern, clear, well-designed. I ran it on both clusters using their operator.

The xCluster setup works: deploy two independent YugabyteDB clusters, configure bidirectional xCluster replication between them. In theory, writes in Mongolia replicate to Germany and vice versa.

The dealbreaker: no DDL replication. Every time I add a table or alter a schema, I have to manually register each new table in the xCluster configuration. There's no automation for this in the open source version — I'd have to go into the dashboard, find the table ID, and add it manually every time. The UI for xCluster management is also rough. YugabyteDB Anywhere (their managed product) handles this properly, but that requires a license.

Here's the values I used — it works fine for single-cluster if you're interested:

Image:
  tag: 2025.2.1.0-b141

storage:
  master:
    count: 3
    size: 2Gi
    storageClass: openebs-hostpath
  tserver:
    count: 3
    size: 2Gi
    storageClass: openebs-hostpath

resource:
  master:
    requests:
      cpu: 0.5
      memory: 0.5Gi
    limits:
      cpu: 1
      memory: 1Gi
  tserver:
    requests:
      cpu: 0.5
      memory: 0.5Gi
    limits:
      cpu: 1
      memory: 1Gi

replicas:
  master: 3
  tserver: 3

partition:
  master: 3
  tserver: 3

domainName: "<zone>.internal"

And creating xCluster replication (run from inside a pod):

yb-admin \
  --master_addresses yb-master-0.yb-masters.yb.svc.astring-fsn1.internal:7100,... \
  setup_universe_replication \
  <replication_id> \
  yb-master-0.yb-masters.yb.svc.astring-mn.internal:7100,... \
  <table_id>

You get the table ID from the dashboard manually. As I said — not practical.

What I Actually Use: PgEdge + Spock

After going through all of that, I ended up with two independent Postgres clusters synchronized using logical replication via Spock — a Postgres extension that enables multi-master replication. PgEdge is a Helm chart built on top of CloudNativePG that packages Spock with a proper Kubernetes operator.

CloudNativePG is excellent — backup, restore, WAL archiving, high availability all work seamlessly. PgEdge adds Spock on top for the cross-cluster sync.

The architecture: two independent Postgres clusters (one per region), each a primary with replicas. Spock creates a logical replication subscription in each direction — cluster-mn subscribes to cluster-de, cluster-de subscribes to cluster-mn. Writes in either region replicate to the other asynchronously.

Helm values for each cluster:

pgEdge:
  appName: astring-cluster
  nodes:
    - name: n1
      hostname: astring-cluster-n1-rw
      clusterSpec:
        instances: 2
        enableSuperuserAccess: true
        postgresql:
          parameters:
            track_commit_timestamp: "on"
            wal_level: "logical"
        plugins:
          - name: barman-cloud.cloudnative-pg.io
            isWALArchiver: true
            parameters:
              barmanObjectName: r2-storage
  clusterSpec:
    storage:
      size: 1Gi
      storageClass: openebs-hostpath

After deploying both clusters, I run an initialization script that sets up the database, roles, and Spock nodes on each cluster:

#!/bin/bash

CONTEXTS=("astring-mn" "astring-fsn1")
DB_NAME="astring_prod"
NAMESPACE="pgedge"
POD_NAME="astring-cluster-n1-1"

for CTX in "${CONTEXTS[@]}"; do
    echo "--- Initializing: $CTX ---"

    POD_IP=$(kubectl get pod "$POD_NAME" --context "$CTX" -n "$NAMESPACE" -o jsonpath='{.status.podIP}')
    SUPER_PASS=$(kubectl get secret "astring-cluster-n1-superuser" --context "$CTX" -n "$NAMESPACE" -o jsonpath='{.data.password}' | base64 --decode)
    APP_PASS=$(kubectl get secret "astring-cluster-n1-app" --context "$CTX" -n "$NAMESPACE" -o jsonpath='{.data.password}' | base64 --decode)

    [[ "$CTX" == *"mn"* ]] && NODE_NAME="region_mn" || NODE_NAME="region_fsn1"
    DSN_HOST="astring-cluster-n1-rw.pgedge.svc.cluster.local"

    export PGPASSWORD=$SUPER_PASS

    psql -h "$POD_IP" -U postgres -d postgres -c "CREATE DATABASE $DB_NAME;" || true
    psql -h "$POD_IP" -U postgres -d "$DB_NAME" -c "
        ALTER ROLE app WITH REPLICATION;
        ALTER DATABASE $DB_NAME OWNER TO app;
        CREATE EXTENSION IF NOT EXISTS spock;
        GRANT USAGE ON SCHEMA spock TO app;
        GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA spock TO app;
        GRANT ALL PRIVILEGES ON ALL FUNCTIONS IN SCHEMA spock TO app;
    "

    psql -h "$POD_IP" -U postgres -d "$DB_NAME" -c "
        SELECT spock.node_create(
            node_name := '$NODE_NAME',
            dsn := 'host=$DSN_HOST port=5432 dbname=$DB_NAME user=app password=$APP_PASS'
        );
    "
done

unset PGPASSWORD

Then sync an initial data dump to make both clusters start from the same state, and set up bidirectional replication:

#!/bin/bash

C1_CTX="astring-mn"
C2_CTX="astring-fsn1"
NS="pgedge"
DB_NAME="production_db"

C1_HOST="postgres-primary.pgedge.astring-mn.internal"
C2_HOST="postgres-primary.pgedge.astring-fsn1.internal"

get_ip() { kubectl get pod "astring-cluster-n1-1" --context "$1" -n "$NS" -o jsonpath='{.status.podIP}'; }
get_pass() { kubectl get secret "$2" --context "$1" -n "$NS" -o jsonpath='{.data.password}' | base64 --decode; }

C1_IP=$(get_ip "$C1_CTX")
C2_IP=$(get_ip "$C2_CTX")
C1_SUP_PASS=$(get_pass "$C1_CTX" "astring-cluster-n1-superuser")
C2_SUP_PASS=$(get_pass "$C2_CTX" "astring-cluster-n1-superuser")
C1_APP_PASS=$(get_pass "$C1_CTX" "astring-cluster-n1-app")
C2_APP_PASS=$(get_pass "$C2_CTX" "astring-cluster-n1-app")

# FSN1 subscribes to MN
export PGPASSWORD=$C2_SUP_PASS
psql -h "$C2_IP" -U postgres -d "$DB_NAME" -c "
SELECT spock.sub_create(
    subscription_name := 'sub_to_region_mn',
    provider_dsn := 'host=$C1_HOST port=5432 dbname=$DB_NAME user=app password=$C1_APP_PASS'
);"

# MN subscribes to FSN1
export PGPASSWORD=$C1_SUP_PASS
psql -h "$C1_IP" -U postgres -d "$DB_NAME" -c "
SELECT spock.sub_create(
    subscription_name := 'sub_to_region_fsn1',
    provider_dsn := 'host=$C2_HOST port=5432 dbname=$DB_NAME user=app password=$C2_APP_PASS'
);"

unset PGPASSWORD

DDL syncs automatically — add a table in one cluster, it appears in the other. No manual table registration like YugabyteDB.

One important detail: with two independent primaries both generating IDs, you need to make sure sequences don't conflict. If both clusters auto-increment from 1, you get duplicate primary keys. Update the sequence ID & increment on both database or use uuid v7, snowflake ID.

Where It Stands

Both clusters run independently behind GSLB. Users are routed to the nearest region, so normal operations stay local — no cross-ocean round trips on the critical path. Within each cluster, data remains strongly consistent. Across regions, I accept eventual consistency and the small window where state may diverge (e.g., a newly created user that hasn’t replicated yet during a failover). (I’ll cover the GSLB setup and routing details separately.)

Real-time messaging flows through a NATS supercluster, chat history is replicated using Apache Cassandra’s multi–data center replication, and application-level state syncs through Spock.

Is this over-engineered for a chat app with 10 users? Yes.

The architecture scales in a straightforward way — adding a new region means deploying another cluster and integrating it into the existing messaging and replication topology, with the usual tradeoffs around replication lag and consistency.

ArgoCD manages all of this across all three clusters — application clusters and management cluster — through ApplicationSets. Maybe I’ll write about this later.