DEV Community: Mark Zlamal

CockroachDB: live certificate rotation

Mark Zlamal — Thu, 20 Mar 2025 16:28:23 +0000

certificate rotation in a live CRDB environment

On Kubernetes we have a CockroachDB deployment and associated secret resources that are mapped as volumes in the CRDB pods. These secrets represent the certificates that are required by the database to operate, and include CA certs, Node certs, and User certs.

CockroachDB allows you to rotate these certificates in a non-disruptive way that keeps existing client/SQL connections alive, and no rolling restarts are required.

Because we’re working in a containerized environment, there is a specific sequence of tasks required to accomplish this process.

This blog covers these basics, and with the help of the linked GitHub repo you can automate this workflow using the NodeJS app in a reliable, repeatable, reusable way.

GitHub: crdb-cert-rotation

...to the cert rotation tasks

This section details the sequence of steps that are required to complete the cert-rotation. Like any integrated software system, we will encounter caveats and variances based on target platforms, customizations on the deployments, and security/role restrictions.

step 1: update your actual secrets (manual)

This first step requires you to refresh, generate, update your actual certificates.
These can grouped into a single common secret object or might be broken down into separate secret-objects based on specific usage. For example the CA can be its own secret, the Node certs might reside in a separate secret, and same goes for user/client certs.

This is the only step that requires manual intervention. This is by design because each organization typically has their own workflow to refresh these. Some use Hashicorp Vault, some rely on the cockroach cert commands, while other admins generate their own certs manually.

These need to be updated in the secrets such that future pod restarts will load these moving forward.

step 2: read all the new secrets

This step collects all the secrets tied to CockroachDB. These are specified as a list of tuples that represent the secret name and the mount-path from the CRDB pod perspective. The example in the GitHub repo shows 2 secrets with mounts, but you might have only 1 secret or possibly more. This tool allows for you to specify any number of mounts and their respective secrets object.

step 3: identify the target CRDB pods to refresh (auto)

CockroachDB pods should be labelled using common tags as well as node-specific tags. In the case here, we have a common tag that matches all the pods in the cluster, defined by the environment variable MZ_CRDB_POD_LABEL_KV.

This tag is actually a tuple (KV), where my example indicates a common tag defined as zlamal:demo-2025 that can be found in all the pods of my cluster. It’s functionally equal to a label listed as “zlamal: demo-2025”.

The NodeJS app reads this from the environment (supplied by you in the job-definition YAML), and uses it to capture the running pods so that the tool can perform actions on them in subsequent steps.

step 4: we iterate through each pod and perform the certificate updates

This next step is a loop that performs a few sub-steps to accomplish the rotation for each node. These tasks ensure that the pod remains running while bringing the latest secrets, and this can be verified by SHing directly into the pods to inspect the certs folders for changes.

step 4.1: delete the old certs

This step deletes the certificates for each secret object. There can be many secrets, and each secret can contain many certs and keys.

step 4.2: save the new certs

This step deletes the certificates for each secret object in an iterative process. There can be many secrets, and each secret can contain many certs and keys. This is all covered by iterative loops to save all the items.

step 4.3: adjust the permissions of the keys

CockroachDB will NOT accept keys that have open permissions. The app sets the read permissions to only the current user, no group access.

step 4.4: the magic of sighup

This is the special task that tells CRDB there were changes to the certificates. The sighup is an OS-level kill command that doesn’t actually kill the process, it merely tells the process of a HANGUP system event. CockroachDB knows to reload all certificates in this condition, without dropping any connections.

You can also see the effects of a SIGHUP in the CockroachDB logging (both on disk and in the console).

step 5: ...verify the certificates?

In your CockroachDB admin console, you can find the actual certificates under the Advanced Debug tab, and in the Even More Advanced Debugging section.

Here you will find “Certificates on this node” that lets you validate the expiry of your certs, and now they should reflect the freshest dates based on step 1 (cert creation saved as secrets).

Example certificates in Advanced Debug

Conclusion & References

All YAML is available in the GitHub repo for this project I did not want to mess-up this write up with YAML blocks, so please review them in the repository.
Have fun, but please work with the Cockroach Enterprise Architects to run through this process for the first time.
Please review the code. It’s an implementation of these steps, written in JavaScript running under NodeJS
I have a dockerized image available (currently for ARM64 but can be created for all architectures), but requires a pull-secret.

GitHub: crdb-cert-rotation

CockroachDB: fast-start configuration on a fresh cluster

Mark Zlamal — Tue, 24 Dec 2024 16:58:19 +0000

You frequently deploy CockroachDB clusters, and each time you need to create initial users, initial databases, some pre-loaded tables, adjustments to grants and privileges, and perhaps some custom zone-configurations based on your locality settings.

The pain is repeating this activity over, and over, and over again...

With this guide, after you deploy (or redeploy) your cluster, you can quickly configure it using a repeatable, reliable, and consistent pattern, encapsulated as a single Kubernetes job. This approach eliminates the error-prone and manual process of running your scripts to organize the database.

This configuration process prepares the database for your workloads, eliminating the need for application-managed configurations. These configs may include:

defining database regions
creation of named databases and schemas
pre-creating some tables
applying license keys
creating initial users or accounts that apps require to operate
defining permissions, grants, and roles-groups for the users
backup schedules
CDC/changefeed jobs
etc, etc, etc...

Theory of operation

This process is defined as a Kubernetes Job object that runs once.

It initially connects using the root-account via certs. These are typically found in the related secrets object that contains the CA-certs, node-certs, and other cluster-applicable certificates.

Upon connecting to the cluster, the job calls the CockroachDB command line function __file=<some SQL file> which points to the mapped/mounted ConfigMap containing the SQL to apply.

When you create the job spec (code follows later in this blog), it instantly runs, and allows CockroachDB to issue the sequential SQL commands, as-defined in the ConfigMap. Upon completion, you can check the completed pod and review the console logging to ensure that all your SQL statements were successful.

zlamal-initial-sql config map

Step 1: Define your SQL embedded inside a ConfigMap

The ConfigMap spec is below. Note that it's much easier to do this using the OpenShift console UI rather than text-edit the below spec.

In this example I create an initial user, and an initial table with some inserted dummy-data. Here is where you can define your database properties such as regions/user-accounts/license-keys/etc to prepare the cluster for client/app connections.

I wish there was a way to syntax-highlight specific fields that you can focus on. I deliberately use zlamal and mz (my initials) to help with search/replace when you adopt these specs.

kind: ConfigMap
apiVersion: v1
metadata:
  name: zlamal-initial-sql
data:
  zlamal-run-sql-once: |
    create user mark with password 'zlamal';
    grant admin to mark;
    create table aaaa (c0 int, c1 string);
    insert into aaaa values (0, 'val0'),(1, 'val1'),(2, 'val2');
    -- ...
    -- ...
    -- ...

zlamal-prep-crdb job

Step 2: Job object to SQL embedded inside a ConfigMap

The Kubernetes job spec is below. Again, I wish there was a way to syntax-highlight specific fields that you can focus on, but the idea is that you'll need to adjust the connection-strings, and the names of the K8s resources to align this task to your cluster.

This job will mount the ConfigMap in a folder named "/zlamal-initial-sql". You can change this provided you're consistent with your naming convention.

apiVersion: batch/v1
kind: Job
metadata:
  name: zlamal-prep-crdb
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: zlamal-prep-crdb
          image: cockroachdb/cockroach:v24.2.5
          command:
            - /bin/bash
            - '-ecx'
            - >-
              exec /cockroach/cockroach
              sql
              --url
              'postgresql://mz-crdb-v11-cockroachdb-public:26257'
              --file
              /zlamal-initial-sql/zlamal-run-sql-once
              --certs-dir=cockroach-certs
          volumeMounts:
            - name: client-certs
              mountPath: /cockroach/cockroach-certs/
            - name: ca
              mountPath: /cockroach/ca
            - name: zlamal-initial-sql-configmap
              mountPath: /zlamal-initial-sql
      volumes:
        - name: zlamal-initial-sql-configmap
          configMap:
            name: zlamal-initial-sql
            defaultMode: 0777
        - name: client-certs
          projected:
              sources:
                - secret:
                    name: mz-crdb-v11-cockroachdb-client-secret
                    items:
                      - key: ca.crt
                        path: ca.crt
                      - key: tls.crt
                        path: client.root.crt
                      - key: tls.key
                        path: client.root.key
              defaultMode: 256
        - name: ca
          projected:
            sources:
              - secret:
                  name: mz-crdb-v11-cockroachdb-ca-secret
                  items:
                    - key: ca.key
                      path: ca.key
            defaultMode: 256

Of-course you'll need to adjust many of the fields in this YAML such as the cockroachDB version, the names, possibly permissions based on the K8s / OpenShift cluster characteristics.

You should create your own copy of these fragments, and after making the necessary changes, please test them against your environments!

Conclusion

This procedure is a quick reference into Kubernetes Jobs, ConfigMaps, volumes & mounts, and operating CockroachDB by applying some SQL using command-line arguments.

CockroachDB on OpenShift: Separate your logs from data!

Mark Zlamal — Thu, 07 Nov 2024 11:34:39 +0000

CockroachDB and persistent volumes

When deployed on Kubernetes or OpenShift, CockroachDB uses persistent volumes (PVs) to store DB data, metadata, state-data, user-data, log files, configuration files. These volumes are typically file-system mounts that are mapped to disks/SSDs where the data is physically saved in a distributed fashion. When you operate CockroachDB and run queries, data must be read or written where these operations translate to frequent or continuous disk reads & writes.

Managing the disk: IOPS & throughput

On cloud-managed orchestrators, when you read or write data to disk (PVs), this consumes IOPS and utilizes some of the available IO throughput. These are limiting factors that can result bandwidth saturation, or worse, throttling by the cloud provider under heavier workloads. This condition can be identified by the combination of low CPU usage and high disk latencies, visualized through the CockroachDB UI console hardware dashboard metrics and charts.

Divide & conquer

To overcome these limitations, CockroachDB lets you take advantage of multiple, independent PVs to separate the destination of the cockroach runtime data. CockroachDB Logging is a good candidate to move out of the critical path by dedicating its own volume/storage. This will help with performance tuning since your SQL/schemas live on their own dedicated volume. In fact it's the production readiness recommendation to split the data from the logs into separate PVs.

Typical CockroachDB deployments

Most CockroachDB clusters implement a single PVC that is assigned to each node in a stateful set. Default configurations in both HELM and Operator managed environments create this 1:1 mapping as follows:

Default PV/PVC relationship between nodes and volumes

Our planned deployment with multiple PVs

By introducing a second PV dedicated for logs, we split the workload and effectively double the IO channels and allows for each to be independently configured. Storage for logs can be significantly reduced when compared to the cockroach-data PV since logs can be rotated/truncated while your business data can grow over time. This illustration highlights the logical infrastructure layout between nodes and PVs.

Multiple PV/PVCs assigned to each node

…to the implementation

We need to make additions to the StatefulSet template along with custom log-configuration settings to direct CockroachDB logs into the new destination PV.

The logging “secret” configuration

This resource is the one-stop-shop for all your customized logging properties, including log sinks (output logs to different locations, including over the network), logging channels that are mapped to each sink, the format used by the log messages, any redaction-flags of log messages, the buffering and max sizes of log messages.

The following log configuration is the smallest/simplest configuration that we will use as a starting point. Here we keep most defaults, only adjusting the file-defaults destination path for the actual files, where this path will be mounted to a separate PV defined in the StatefulSet template.

file-defaults:
  dir: /cockroach/cockroach-logs
sinks:
  file-groups:
    default:
      channels:
      - ALL

For a comprehensive explanation of this fragments, along with working examples and code-fragments, please refer to the Cockroach log configuration documentation so you can tailor the actual logging to your needs.

The StatefulSet template configuration

This statefulset fragment only highlights the added template properties to define the PVC and specific mount points to both the log config secret and the new logs folder. A full, complete StatefulSet example follows this fragment to show the entirety of an actual solution I deployed.

kind: StatefulSet
apiVersion: apps/v1
spec:
  volumeClaimTemplates:
    # ...
    # ...
    # Fragment 1
    # New volumeClaimTemplate to generate Log PVC & PV
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: logsdir
        labels:
          app.kubernetes.io/instance: zlamal
          app.kubernetes.io/name: cockroachdb
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        volumeMode: Filesystem
  template:
    spec:
      containers:
        - # ...
          # ...
          volumeMounts:
            # ...
            # ...
            # Fragment 2
            # Additional mount-points for path to logs and log-config
            - name: logsdir
              mountPath: /cockroach/cockroach-logs/
            - name: log-config
              readOnly: true
              mountPath: /cockroach/log-config
          # Fragment 3
          # Addition of a new “cockroach start” parameter --log-config-file=...
          # This parameter points CRDB to the mounted log-config secret
          args:
            - shell
            - '-ecx'
            - |-
              exec /cockroach/cockroach start --log-config-file=/cockroach/log-config/log-config.yaml --join=... --advertise-host=... --certs-dir=/cockroach/cockroach-certs/ --http-port=8081 --port=26257 --cache=11% --max-sql-memory=10%
      volumes:
        - name: datadir
          persistentVolumeClaim:
            claimName: datadir
        # Fragment 4
        # Establish the logical YAML reference to the logging directory
        - name: logsdir
          persistentVolumeClaim:
            claimName: logsdir
        # Fragment 5
        # Establish logical YAML reference to the log-config secret resource
        - name: log-config
          secret:
            secretName: zlamal-cockroachdb-log-config
            defaultMode: 420
  # ...
  # ...

Note the “Fragment 1, 2, 3, 4, 5” additions to the StatefulSet

Here is the complete StatefulSet of these changes,including tags/labels specific to my cluster as a reference example that you can copy and edit to make your own (eg sizes, storage classes, IOPS, tags/labels. etc):

kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: zlamal-cockroachdb
  labels:
    app.kubernetes.io/component: cockroachdb
    app.kubernetes.io/instance: zlamal
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: cockroachdb
    helm.sh/chart: cockroachdb-14.0.4
spec:
  serviceName: zlamal-cockroachdb
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: datadir
        labels:
          app.kubernetes.io/instance: zlamal
          app.kubernetes.io/name: cockroachdb
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        volumeMode: Filesystem
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: logsdir
        labels:
          app.kubernetes.io/instance: zlamal
          app.kubernetes.io/name: cockroachdb
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        volumeMode: Filesystem
  template:
    metadata:
      labels:
        app.kubernetes.io/component: cockroachdb
        app.kubernetes.io/instance: zlamal
        app.kubernetes.io/name: cockroachdb
    spec:
      restartPolicy: Always
      initContainers:
        - resources: {}
          terminationMessagePath: /dev/termination-log
          name: copy-certs
          command:
            - /bin/sh
            - '-c'
            - cp -f /certs/* /cockroach-certs/; chmod 0400 /cockroach-certs/*.key
          env:
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: certs
              mountPath: /cockroach-certs/
            - name: certs-secret
              mountPath: /certs/
          terminationMessagePolicy: File
          image: busybox
      serviceAccountName: zlamal-cockroachdb
      schedulerName: default-scheduler
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/component: cockroachdb
                    app.kubernetes.io/instance: zlamal
                    app.kubernetes.io/name: cockroachdb
                topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 300
      securityContext: {}
      containers:
        - resources: {}
          readinessProbe:
            httpGet:
              path: /health?ready=1
              port: http
              scheme: HTTPS
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 2
          terminationMessagePath: /dev/termination-log
          name: db
          livenessProbe:
            httpGet:
              path: /health
              port: http
              scheme: HTTPS
            initialDelaySeconds: 30
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: STATEFULSET_NAME
              value: zlamal-cockroachdb
            - name: STATEFULSET_FQDN
              value: zlamal-cockroachdb.mz-helm-v11.svc.cluster.local
            - name: COCKROACH_CHANNEL
              value: kubernetes-helm
          ports:
            - name: grpc
              containerPort: 26257
              protocol: TCP
            - name: http
              containerPort: 8081
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: datadir
              mountPath: /cockroach/cockroach-data/
            - name: logsdir
              mountPath: /cockroach/cockroach-logs/
            - name: log-config
              readOnly: true
              mountPath: /cockroach/log-config
            - name: certs
              mountPath: /cockroach/cockroach-certs/
          terminationMessagePolicy: File
          image: 'cockroachdb/cockroach:v23.2.1'
          args:
            - shell
            - '-ecx'
            - |-
              exec /cockroach/cockroach start --log-config-file=/cockroach/log-config/log-config.yaml --join=${STATEFULSET_NAME}-0.${STATEFULSET_FQDN}:26257,${STATEFULSET_NAME}-1.${STATEFULSET_FQDN}:26257,${STATEFULSET_NAME}-2.${STATEFULSET_FQDN}:26257 --advertise-host=$(hostname).${STATEFULSET_FQDN} --certs-dir=/cockroach/cockroach-certs/ --http-port=8081 --port=26257 --cache=11% --max-sql-memory=10% 
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: cockroachdb
              app.kubernetes.io/instance: zlamal
              app.kubernetes.io/name: cockroachdb
      serviceAccount: zlamal-cockroachdb
      volumes:
        - name: datadir
          persistentVolumeClaim:
            claimName: datadir
        - name: logsdir
          persistentVolumeClaim:
            claimName: logsdir
        - name: log-config
          secret:
            secretName: zlamal-cockroachdb-log-config
            defaultMode: 420
        - name: certs
          emptyDir: {}
        - name: certs-secret
          projected:
            sources:
              - secret:
                  name: zlamal-cockroachdb-node-secret
                  items:
                    - key: ca.crt
                      path: ca.crt
                      mode: 256
                    - key: tls.crt
                      path: node.crt
                      mode: 256
                    - key: tls.key
                      path: node.key
                      mode: 256
            defaultMode: 420
      dnsPolicy: ClusterFirst
  podManagementPolicy: Parallel
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/component: cockroachdb
      app.kubernetes.io/instance: zlamal
      app.kubernetes.io/name: cockroachdb

The logical names/mappings of the volumes are connected together

Conclusion & References

This is a versatile addition to the standard statefulset because the IOPS can be managed between the PVs, and the plumbing is in-place for log customization. DB admins can easily make changes the to logging channels in a running environment by editing a single log-config file that saved as a secrets object.

Cockroach Logging Overview
Cockroach log configuration
Cockroach start: logging
Production recommendations

CockroachDB: Multi-Region OpenShift using Azure Virtual WAN

Mark Zlamal — Thu, 10 Aug 2023 16:25:44 +0000

...art of the practical

Networking between managed OpenShift clusters has become practical through Azure's Virtual WAN. When deploying CockroachDB on top, you become the owner of a production-ready, globally available distributed database. Let's create the key components that facilitate secure multi-region networking and architect this database solution.

what is this?

This blog features a single logical CockroachDB database that’s deployed on ARO (Azure Red Hat OpenShift) clusters which are provisioned across multiple regions.

This article is my second instalment into OpenShift networking at global scale. Here we leverage Microsoft Azure, and this complements my original post CockroachDB: Multi-Region ROSA using Cloud WAN that focuses on AWS equivalent infrastructure.

the technical catalog

Virtual Networks are the fundamental building blocks for private networks in Azure. This foundation provides the ability for cloud services to interconnect and securely communicate with each other.
Subnets are the specific IP addresses range constructs where resources are grouped and interconnected within a virtual network.
ARO is the highly available, fully managed service for deploying Red Hat OpenShift instances that are monitored and operated jointly by Microsoft and Red Hat. OpenShift is Kubernetes running on Red Hat Enterprise Linux (RHEL) with built-in management UI services for convenience.
Virtual WAN is a networking service that brings many networking, security, and globally scaled routing functionalities together, managed by a single operational interface. This is the key offering that lets us manage and connect local and regional virtual networks together.
Virtual HUB is the service that lets you connect virtual networks together across sites and regions. The Virtual WAN architecture is a hub and spoke model so at least 1 hub is always required to establish a network-service mesh for your resources across the Microsoft network backbone.
CockroachDB (aka CRDB) is the distributed database that is deployed on the worker nodes across the OpenShift clusters.

the typical ARO cluster

It is important to have a background into the relevant components of a managed Red Hat OpenShift cluster to understand how they can be connected together. When creating an ARO cluster, the resources and services are pre-configured and automatically deployed based on the sizing criteria that is specified during the creation process. Virtual Network Subnets are the required critical networking aspect of ARO clusters that compartmentalize the master and worker node pools. These are used by the Cloud WAN service and Hubs to establish the network routing rules.

Architecture: This managed OpenShift cluster hosts containerized apps including CockroachDB nodes. Overview of Microsoft Azure ARO

Virtual WAN: facilitator of OpenShift cluster networking

This service defines the networking infrastructure and connectivity rules for the complete solution. The Virtual WAN defines the core network-framework that manages hubs, VPN sites, EspressRoute circuits, and virtual networks. My searches online revealed many great examples of single and multi-hub networks that handle groups of networks, including how to establish links to on-prem environments or other cloud resources (eg: multi cloud solutions).

In our solution, we will use a single HUB to define our CockroachDB network, and a distinct Virtual Network connections that’s associated with every managed OpenShift cluster in our solution.

The most important caveat is to define unique subnet ranges for each OpenShift environment, as this is how we’ll differentiate and access the actual network IPs across each node in each cluster.

Architecture: Virtual WAN allows interconnectivity between ARO clusters across regions, resulting in a network mesh that’s ideal for a distributed CockroachDB platform. Note the unique CIDR ranges for each virtual network.

the network service mesh

In the following diagram, all master nodes are hidden for simplicity. They are typically hosted on the 172.x.0.x/16 subnet, while all workers are hosted on the 172.x.1.x/16 subnet.
Worker nodes within each cluster are inherently connected; links have been omitted for simplicity in this chart.
Every node has network visibility to every other node across clusters and regions. This example only illustrates Node 2 links for simplicity.

Architecture: The CockroachDB network mesh requires that every worker node hosting a CockroachDB pod has routing/visibility to every other worker node (by extension every other CockroachDB pod) in all the OpenShift clusters across regions.

let’s create the Azure resources

The following steps will establish the necessary cluster infrastructure based on the included screenshots, but your settings may vary, such as IP CIDR ranges or regional settings.

Step 1: Create your Virtual Networks

For each OpenShift cluster, you need a corresponding virtual network with 2 subnets. One subnet is for OpenShift workers, the other is for OpenShift masters. This is required for associating an ARO cluster with a valid virtual network during the creation process.
Below is my new virtual network, named zlamal-aro-canada-01 with the 2 subnets.

I use default as my masters subnet, and create the second workers subnet with a uniquely routable CIDR range.

Step 2: Create your OpenShift Clusters

We will focus on the networking aspects of the Azure ARO Creation Wizard.
After specifying the cluster name, region and node sizing, the Virtual Networks list will be filtered to the same region of the cluster region. In my example I chose Canada Central for my Cluster & networking.

Master subnet is my default virtual network subnet, while the Worker subnet is tied to my worker-virtual-network subnet.

I’ve chosen to make the endpoints visible for convenience and to avoid the challenges of establishing a VPN framework for this exercise.

Notice: This screenshot indicates 172.18.x.x virtual network subnets while the prior screenshot shows 172.23.x.x. This is because during this blog write-up I’ve created yet another virtual network for an upcoming OpenShift cluster, but normally these CIDR ranges would align with each other.

Step 3: Create a Virtual WAN instance

In this diagram, the first step is to specify the management region of the Cloud WAN instance. Actual networking at this phase is agnostic of locality. Specify your name and ensure the Type is Standard.

Step 4: Create the Hub instance

This hub resides in the Virtual WAN, and will connect all the OpenShift clusters together in the subsequent step. The hub is also assigned a region, but ultimately accepts connections from any Azure network-resource provisioned in any region.

In this example it’s named canada-east-hub, along with a special-defined address space used internally by Microsoft to manage/operate the hub instance. I’ve chosen a non-overlapping CIDR of 172.100.0.0/16 and this will not interfere with my clusters or any future additions to this network. To save on costs I kept the hub capacity at the minimal possible throughput and connection capability.

”Basics” tab is the only one that’s relevant. All others can be kept as default.

Step 5: Add the connections to the Hub

Every OpenShift cluster resides on a unique Virtual Network. This step establishes the connectivity on all the virtual networks. The hub manages and operates this pool of connections.

To add a connection, select your Virtual WAN, under Virtual network connections click on Add connection. From here you can label this interface, apply the Hub, apply default settings, and complete the peering.

As you scale-out, create, or delete OpenShift clusters, here is where the interfaces are established or removed. This means creation of a Virtual WAN along with the Hub is a one-time process.

Upon completion of virtual network associations with the hub, the network topology will be updated with your complete network.

Step 6: Establish CockroachDB nodes

Every OpenShift is interconnected at this point. This means that if you open a terminal window to any running pod, you can curl to any worker on any of the clusters.

From this point you can create a unique OpenShift deployment instance for each worker-node affinity to ensure a 1:1 relationship between workers and CRDB nodes.
Catalog the IP addresses of the worker nodes in your primary OpenShift cluster that will serve as the starting point for CockroachDB. Once this primary cluster of 3+ nodes is initialized, all the other nodes can use these 3+ IP addresses, along with corresponding service NodePorts to connect to every other node on the cluster.
Load balancers can be created per-OpenShift-cluster, providing additional redundancy if a cluster becomes unavailable. This also clearly defines multiple access-points for clients to use and prioritize based on latency and geographic location.

This YAML fragment illustrates the locality and network parameters for a single node in the cluster. The --join statement is typically static for all nodes, while the sql-addr, listen-addr, and advertise-addr will be unique for each node.

          ...
          ...
apiVersion: apps/v1
kind: Deployment
metadata:
  name: crdb-deployment-zlamal-172-23-1-4
  labels:
    crdb-cluster: crdb-zlamal
spec:
  selector:
    matchLabels:
      crdb-pod-name: crdb-pod-zlamal-172-23-1-4
  replicas: 1
  template:
    metadata:
      labels:
        crdb-pod-name: crdb-pod-zlamal-172-23-1-4
        crdb-cluster: crdb-zlamal
    ...
    spec:
      ...
      containers:
          command:
            - /bin/bash
            - '-ecx'
            - >-
              exec /cockroach/cockroach
              start
              --cache=.25
              --logtostderr
              --certs-dir=/cockroach/cockroach-certs
              --locality=country=CA,region=CA-East,zone=Zone-1
              --sql-addr=:26257
              --listen-addr=:26357
              --advertise-addr=172.23.1.4:31951
              --join 172.23.1.6:31950,172.23.1.4:31951,172.23.1.5:31952
              ...
              ...

Each CRDB node will require its own NodePort service to establish the internal database network mesh, while a common service can internally load-balance connections for co-located applications.

This example service aligns with the above fragment to define unique NodePort port numbers that must not overlap with other services in the current OpenShift cluster.

kind: Service
apiVersion: v1
metadata:
  name: crdb-service-zlamal-172-23-1-4
spec:
  ports:
    - name: sql-access
      protocol: TCP
      port: 26257
      targetPort: 26257
      nodePort: 31851
    - name: node-comms
      protocol: TCP
      port: 26357
      targetPort: 26357
      nodePort: 31951
    - name: console-ui
      protocol: TCP
      port: 8080
      targetPort: 8080
  type: NodePort
  selector:
    crdb-pod-name: crdb-pod-zlamal-172-23-1-4

A generic service would have the selector not to a pod, but rather the common label of the deployments. In this case set the selector to crdb-zlamal for the load balancers to properly interface with all the active pods.

For example a load balancer service definition handing the local cluster will have a selector as follows:

apiVersion: v1
kind: Service
metadata:
  name: crdb-zlamal-load-balancer
  labels:
    crdb-cluster: crdb-zlamal
spec:
  ports:
    - name: console-ui
      port: 8080
      protocol: TCP
      targetPort: 8080
    - name: sql-access
      port: 26257
      protocol: TCP
      targetPort: 26257
  type: LoadBalancer
  selector:
    crdb-cluster: crdb-zlamal

conclusion

These Azure network services abstract away the complexities of infrastructure, security, and resource-management, offering a view of the entire architecture from a single pane of glass. I can't cover every detail into these technologies, but Azure also provides service & network monitoring, security controls, and visual representations of the connected ecosystem.

For a starter-pack of Kubernetes templates to help with deploying CockroachDB, go to the Cockroach/cloud/kubernetes GitHub repo.

references

Cockroach Labs
CockroachDB: Multi-Region ROSA (AWS) using Cloud WAN
What is Azure Virtual WAN?
Azure ARO

CockroachDB: row-level TTL to simulate Redis

Mark Zlamal — Mon, 03 Apr 2023 16:53:52 +0000

what this blog covers

row-level TTL implementation in CockroachDB
examples of prepared statements with execution parameters
usage demo that illustrate automatic deletion

what this is not...

Redis is a feature-rich, in-memory key/value database designed for high-performance caching and text-based queries against key-strings. This blog is not meant to replace a true Redis use-case, instead it provides an implementation to the most frequently used Redis capabilities, namely GET, SET, and EXPIRE functions.

no expressions or conditional get capabilities
no gets using multiple keys, scans, or wildcard queries
no in-memory stores. Data is distributed across nodes, but always using disk for I/O. There are in-memory options that would provide benefit to this solution, but not covered here. At the bottom of this blog is a link to in-memory stores to improve performance.

implementation: table definition

The Redis table must contain 3 specific columns to facilitate the capabilities tied to row-level TTL, namely the key, the value, and the expired_at columns. The names of these columns can be adjusted to meet your application needs, provided that the prepared statements below are in-sync with your naming conventions.

create table redis_tbl (
  key string primary key,
  value string,
  expired_at timestamptz
) with (ttl_expiration_expression = 'expired_at');

alter table redis_tbl configure zone using gc.ttlseconds = 300;

You can introduce additional app-specific columns including indexes to accommodate your workload.
The key and value data-types can also be tailored to meet your needs. In fact I often use JSONB data-type for values for easy data-processing in my NodeJS apps.
The expired_at column is a timestamp in seconds.
Note on the gc.ttlseconds alteration. The default CockroachDB garbage collector removes tombstones after 25 hours (90000s) so the recommended practice is to protect your storage capacity by reducing this window, especially under workloads with many short-lived or churning keys.
This example has GC set to 300 seconds (5 min), but should be adjusted based on anticipated usage and can be revisited & altered in a production environment.

implementation: prepared statements

For convenience, we create 3 prepared statements to provide the core functionality tied to set, get, and expire capabilities. These can be tailored to meet your application needs, including data-type augmentation or additional parameters.

prepare redis_set(string, string, integer) as
  upsert into redis_tbl values ($1, $2, cast(cast(now() as integer) + $3 as timestamptz));
The redis_set statement saves key/value data including an expiry duration (in seconds).

prepare redis_get(string) as
  select value from redis_tbl where key = $1 and expired_at > now();
The redis_get statement retrieves the stored value that’s identified by the key.

prepare redis_expire(string, integer) as
  update redis_tbl set expired_at = cast(cast(now() as integer) + $2 as timestamptz) where key = $1;
The redis_expire statement updates the expiry duration of an existing key to this new value (in seconds).

testing & usage: set, get, expire

Below is some basic usage of these operations. Note that time is of the essence when running tests since this intended to be a real-time demo.

execute redis_set('mz1', 'hello1', 10); -- entry is saved with a 10 second TTL

execute redis_get('mz1'); -- returns the 'mz1/hello1' row;

execute redis_expire('mz1', 10); -- entry is updated with a fresh 10 second TTL window

execute redis_get('mz1'); -- returns the 'mz1/hello1' row;

...wait 11 seconds to observe the DB changes (auto-deleted/expired rows)...

execute redis_get('mz1'); -- returns 0 rows;

execute redis_expire('mz1', 10); -- this is a no-op since mz1 expired due to row-level TTL.

This test-harness is not exhaustive but demonstrates the core behavior of CockroachDB highlighting outputs when keys exist and what you can expect after they’ve expired.

conclusion

If you’re already operating on a CockroachDB database, this is a quick extension to simulate Redis-style capabilities without the need to provision a dedicated Redis platform. For example, during the development of a web-application that requires session & cookie tracking, this technique is a quick add-on that lets you prove out your code and demo the app. When you’re ready to produce a production environment, you can then provision a true Redis platform and use that to perform the full scale of capabilities.

terminology & resources

Batch Delete Expired Data with Row-Level TTL

TIMESTAMP / TIMESTAMPTZ

CockroachDB in-memory storage options

redis.io homepage

CockroachDB: trace logging with Datadog

Mark Zlamal — Wed, 11 Jan 2023 13:47:49 +0000

trace logs

Trace logs are the detailed internal synchronous & asynchronous operations that take place to fulfill a complete transaction or deliver a result.

CockroachDB trace logs

In CockroachDB these operations may include any combination of network interactions, calls to the storage layer, SQL layer, distribution layer, query batching, reads/updates/writes, and various runtime coordinator services.

Normally this is the level of detail is what you’d find inside a CockroachDB debug ZIP bundle, and specifically the jaeger-file that lets you visualize and interact with the execution flow to identify where processing/waiting time is spent.

improving database performance

Datadog provides tools to assist in the tasks of database tuning, query-tuning, and improvements to schemas.

What’s special is that trace-logs are sent to Datadog in real-time during query execution.

In the Datadog portal you can monitor, filter, and explore the same jaeger-style activity on these dissected operations to help guide you with:

narrowing down bottle-necks in the cluster network topology
capturing transient issues tied to network fluctuations
isolating root-causes of poor-performing queries
observing application behaviour that send queries
tracking down the impact of sub-optimal schemas
providing additional context for Cockroach Labs support teams

to the architecture

The trace logging system is divided into 3 areas of integration.

The CockroachDB cluster (source; host machine)
The Datadog agent (OTLP protocol; host machine)
The Datadog web portal (sink; in cloud)

architecture of CockroachDB and Datadog integration

1) in the cloud: Datadog

Datadog is an integrated cloud-portal offering data-aggregation services, including all the tools needed to specify data-sources, infrastructure observability, build interactive dashboards, explore metrics, define alerts, monitors, and so on.

CockroachDB data source integration (Datadog marketplace)

CockroachDB is one of the many data-source integrations that’s available through their vendor marketplace. This service establishes the core settings for a typical CRDB installation that are required to capture metrics, text-log data, and defines all the back-end monitored properties.

2) on the host machine: Datadog agent

Here we have the Datadog agent. It’s a binary executable that runs as a service in the background, listening to OpenTelemetry data sources. You can find installation instructions on the main portal with download links specific to your target OS.

Datadog agents

They provide agents for many environments, including cloud-based and operator-driven deployments for managed Kubernetes and OpenShift.

3) on the host machine: CockroachDB

CockroachDB relies on the OpenTelemetry protocol as part of the observability instrumentation. this can be enabled by applying the cluster setting property trace.opentelemetry.collector to associate with the Datadog agent.

In this post the agent running locally on my laptop along with CockroachDB so I’m sticking to the default Datadog OLTP workload port for the telemetry connection.

to the installation

create an API key

Datadog provides a multi-tenant view to user-content and data-sources. API keys are used to identify and differentiate who’s data we’re working with, and where the data comes from.

API keys in Datadog

install the agent

Browse to the Agents Page and download the binary suitable for your OS or platform.

Once installed, configuration files made available so that you can associate your CockroachDB installation with the agent, and associate the agent with your Datadog account.

associate the Agent with Datadog

Using the API key generated above, you must edit the datadog.yaml file and save the key. The OTLP listening port is 4317 by default and can also be changed if you have multiple listeners or agents present.

edit ~/.datadog-agent/datadog-agent/datadog.yaml

...
...
api_key: **4fxxxxxxxxxxxxxxxxxxxxxxxxxxxxe0**
...
...
otlp_config:
  receiver:
    protocols:
      grpc:
        endpoint: localhost:4317
...
...

associate CockroachDB with the agent

Next step (optional): Tag your services and specify CockroachDB endpoints that match your database installation.

Tagging is important where multiple Cockroach DB instances are running on the same server and you need to differentiate between them on your dashboards.

The default prometheus URL setting is localhost:8080/_status/vars but can be changed as follows:

edit ~/.datadog-agent/conf.d/cockroachdb.d/conf.yaml

...
...
## Every instance is scheduled independent of the others.
#
instances:
  - openmetrics_endpoint: http://localhost:8091/_status/vars
    tags:
      - service:mzmz-crdb-8091
      - env:mzmz-env-8091
    service: mzmz-crdb-8091
    ...
    ...

CockroachDB cluster settings

The last setting is to establish the link between CockroachDB and the Datadog agent. This is accomplished by setting the trace-collector to the running Datadog agent as follows:

root@localhost:26257/defaultdb> set cluster setting trace.opentelemetry.collector='localhost:4317';
SET CLUSTER SETTING
Time: 78ms total (execution 78ms / network 0ms)

root@localhost:26257/defaultdb> set cluster setting sql.trace.log_statement_execute=on;
SET CLUSTER SETTING
Time: 41ms total (execution 41ms / network 0ms)

root@localhost:26257/defaultdb> set tracing=on;
SET TRACING
Time: 0ms total (execution 0ms / network 0ms)

Reference

CockroachLabs Docs: Set Tracing

SET TRACING changes the trace recording state of the current session. A trace recording can be inspected with the SHOW TRACE FOR SESSION statement.
SET TRACING=off;  -- Trace recording is disabled.
SET TRACING=cluster;  -- Trace recording is enabled; distributed traces are collected.
SET TRACING=on;  -- Same as cluster.
SET TRACING=kv;  -- Same as cluster except that "kv messages" are collected instead of regular trace messages.
SET TRACING=results; -- Result rows and row counts are copied to the session trace. This must be specified in order for the output of a query to be printed in the session trace.

testing the environment

Launch CockroachDB
Open the Datadog Agent local admin console (default address is http://127.0.0.1:5002
Under Checks --> Checks Summary you should see all green checkmarks. In my installation, I have 2 CockroachDB instances configured, but only one is running.
Sign into your Datadog HQ account
Left hand navigation bar: APM --> Traces Tracing from CockroachDB emits large volumes of data. Searching for sql txn will remove the noise and focus the trace-spans onto your queries for inspection.
Upon running a query, the full trace and spans are sent to this UI and can be dissected. Visualize where (and how) time is spent processing a query, and can be used as a baseline for comparison against related queries. A typical example is to confirm range hot-spots due to long wait-times acquiring a write-latch.

The UI lets you explore the query history and with capabilities such as sorting by duration, isolating long-running statements is a quick process.

By adjusting the filters, you can switch to other CockroachDB run-time activities such as liveness inspections, tracking jobs, or garbage-collector event logs.

conclusion

Tracing capabilities provided by Datadog offers a non-invasive solution for processing and visualizing logs across many deployment options. It's a fully-managed & integrated cloud solution that offers all the back-end instrumentation allowing you to focus on the health of your applications.

terminology & resources

APM

Application Performance Monitoring

OTLP

The OpenTelemetry Protocol specification describes the encoding/transport/delivery mechanism of telemetry data between sources/collectors/backends. What is OpenTelemetry?

Datadog & CockroachDB integration

The Datadog Agent ingests metrics and traces in the OpenTelemetry format (OTLP), which can be produced by OpenTelemetry-instrumented applications such as CockroachDB.

configure CRDB

CockroachDB cluster configuration

configure Datadog

Client-side Datadog configuration (client-side Datadog configuration)
Cloud account Datadog integration (cloud account Datadog integration)

trace

Information about the sub-operations performed as part of a high-level operation (a query or a transaction). This information is internally represented as a tree of "spans", with a special "root span" representing a whole SQL transaction
CockroachLabs Docs: Set Tracing

CockroachDB: Multi-Region ROSA using Cloud WAN

Mark Zlamal — Fri, 14 Oct 2022 18:48:19 +0000

what is this?

This blog dives into a distributed CockroachDB solution that is hosted on a fully-managed OpenShift platform known as ROSA.

For each region, a ROSA environment is provisioned, and are connected together using the AWS Cloud WAN (aka SD-WAN, Software-defined WAN).

To support multiple regions in a cloud ecosystem, the solution is tightly coupled across IaaS, PaaS, and SaaS.

IaaS, PaaS, and SaaS

There is a natural harmony and integration across these layers. This blog will highlight the capabilities, the value-add, and the required work-effort when building enterprise-ready production environments.

AWS Cloud (IaaS): AWS is the cloud vendor that hosts our global environments, facilitating all the infrastructure and platform services.
ROSA (PaaS): This is the Red Hat managed OpenShift platform that sits on the AWS ecosystem.
CockroachDB (SaaS): This is the distributed database that's deployed across multiple OpenShift (ROSA) clusters on AWS.

background: CockroachDB deployment options

Cockroach Labs offers CockroachDB as a self hosted solution, and as-a-service dedicated solution (including serverless), each with specific benefits tied to the desired use-cases & requirements.

self hosted

This relies on the expertise of customers to stand-up and operate the entire ecosystem. This means specialists making decisions about infrastructure and specialists who provision and connect it all together. So while you retain complete control across all layers, it’s no surprise that this approach is highly involved and complex from design to creation to maintenance of the entire stack.

dedicated and as-a-service offerings

This turnkey option lets customers focus on their data and apps, offering many great advantages that promote fast-to-market strategies with an evolving set of capabilities including hybrid connectivity.

where does ROSA fit in the CockroachDB landscape?

...the bridge between self-hosted and dedicated offerings.

ROSA is a balanced sweet-spot that offers the flexibility of self hosted while automating everything else. It serves as best of both worlds by abstracting-away the annoyances of infrastructure decisions and countless choices on service types. This is all done through a prescriptive creation process using internal terraform scripts that are highly optimized for the cloud of choice (AWS in this case). This results in a rapid provisioned, ready-to-use, globally available Kubernetes platform.

This middle-ground simplifies day-2 operations so you remain focused on the databases, applications, and business logic, while inheriting all the AWS resources to explore this environment.

Like any containerization environment, ROSA lets you push applications, services, and software such as CockroachDB on the platform. The key advantage that specifically benefits CockroachDB is the ease in scaling the system. By scaling I mean true resizing of the cloud environment ranging from physical or virtual hardware all the way to the workloads themselves. This is possible because ROSA being a cloud-native project, is tightly integrated with AWS. You can start small and operate a cost-effective database solution. Should there be demand for more storage or performance, this end-to-end scaling can be accomplished in short order.

...to the AWS environment

When provisioning a ROSA Kubernetes cluster, something very exciting happens...

Architecture: In less than 1 hour, ROSA becomes a complete, ready-to-use, scalable, highly-available solution ready for containerized apps such as CockroachDB. Red Hat OpenShift provisioning console

Through built-in terraform services, over 2 dozen services and resources are provisioned within AWS. Everything is secured, pre-configured and inter-connected including worker nodes, infrastructure nodes, master nodes. These server nodes are all accessible through load-balancers and network routing tables, gateways are established and IP addresses are defined, and finally access-controls are mapped out through defined security groups. The best part? It’s all laid out in the AWS cloud console, providing full visibility and control over this entire ecosystem.

...so what does this mean for CockroachDB?

The good news is that we already provide extensive documentation and guides on Kubernetes and OpenShift deployment using Operators, Helm charts, including collections of YAML specs and fragments. This makes ROSA a sweet-spot for rapid standing-up of a CockroachDB database in the cloud.
Deploy CockroachDB on Red Hat OpenShift
Orchestrate CockroachDB Across Multiple Kubernetes Clusters
YAML Fragments on GitHub
Scaling using operators, manual configs, helm charts

...to the NETWORKING!

Networking challenges for multi-region deployments aren’t specific to ROSA, it’s a general challenge in cloud-computing under any environment. I chose ROSA because it’s the path of least resistance, full featured, highly modernized, easy to deploy, scalable, everything I’ve already mentioned. The clusters sit on VPCs. VPCs are hosted in single regions, and regions belong to the global AWS Cloud ecosystem, and this is a common theme in all cases.

legacy approaches: Transit Gateways & VPC Peering

In the following diagrams you’ll see the use of transit gateways or VPC peering to establish connectivity, which is now considered a legacy approach. These examples highlight the growing challenges, complexity, and risk as a cluster extends across regions.

legacy: networking across 2 regions

This diagram represents the simplest solution where transit gateways or VPC peering is used to establish connectivity.

legacy: networking across 3 regions

When a 3^rd region is added, each routing table must be updated to reflect the new IP range and destination. Challenges around networking continue since all regions need explicit tables defined to see the other regions. It’s a game of continuous maintenance of routing tables across all the clusters, either via VPC peering or Transit-gateways.

legacy: networking across 4 regions

By adding a 4^th region you quickly see the complexity growing (almost exponential) since each cluster must have explicit access to the others. Every peering connection governs their IP range, this range points to the Transit gateway that in-turn connects to a network of other transit gateways distributed across regions. Everything must be perfectly defined and becomes a tedious & highly error-prone process since a single typo could take-out the entire ecosystem. To make matters worse, the addition of this new cluster requires you to visit all operational/production/live clusters and update those routing tables with this new VPC connection. Even in the AWS portal, navigating across the resources to find every field becomes a complicated and unmanageable mess that I can’t even fully depict here.

AWS Recommendation: Cloud WAN

I had a meeting with AWS staff, and they recommended the use of the Cloud WAN services instead of transit gateways or VPC peering. Transit Gateways are still the fundamental building block for communication between VPCs, in fact the underlying physical AWS WAN architecture continues to use transit gateways, but with added intelligence and UI tools to make it very easy to use and economical as you expand.

The same solution as above, using the AWS Software-Defined WAN. Immediately you see that it’s a flat network connecting all the clusters using the global AWS backbone. This picture shows the ENTIRETY of the implementation and network configuration. You only need a single route that points to a core network, and the SD-WAN manages all the connectivity across VPCs, across edges, regions, and partitions. That’s it. The AWS management console is a single pane-of-glass graphical UI providing visibility across the entire AWS network.

The best part is that when you add a new Cluster, you don’t have to make any changes to the existing clusters or infrastructure. The only rule that applies to all implementations: Each VPC must have a unique IP Address range for it to be propagated across the global network.

option 1: public Postgres endpoints

Here is an end-product for a traditional 3-region CockroachDB cluster, allowing users to pick & choose, or be assigned to the closest CockroachDB edge-location with lowest latency. All inter-node communication is hidden behind the SD-WAN, while direct access to the database ports is handled by regional load balancers.

option 2: private Postgres endpoints, public-facing colocated apps

For customers who do not want to expose actual CockroachDB Postgres connection-interfaces, instead they publicize application endpoints that can be firewalled and protected. These visible connections point to APIs, UIs, mobile apps services, while the data-processors and apps are all behind the firewall, sitting on the same subnet at the multi-region CockroachDB instance. VPC architectures like this are often referred to as secure landing zones.

Secure Landing Zones

These secure landing zones are entire platforms that operate under the umbrella of AWS compliance controls, security, encryption. You inherit all the governance rules for networking, user-permissions, and app permissions to protect these workloads. Data-hungry apps such as analytics/OLAP or CDC workloads are all prime candidates as colocated apps in these environments. I mentioned these workloads are on the same subnet, effectively all services are zero network-hops away from CockroachDB. Data can be consumed, perhaps using follower reads or pinned ranges to the physical nodes in the subnet. You’re literally benefiting from the performance of a local area network with no ingress or egress charges, limitations, or packet loss due to unreliable networks. At the risk of being controversial, we can enhance CockroachDB performance by running the entire cluster using insecure mode because nothing is accessible outside the secure landing zones.

connect our VPCs together: Cloud WAN on AWS

The first step is to create a global network. This defines your environment that will manage all the connections, policies, routes, and data metrics.

In my project I created a "core network", selecting 2 regions that will be connected (us-east-2, us-west-2). You can always add and remove regions across the entire AWS ecosystem.

AWS network backbone: us-west-2 to us-east-2

my Cloud WAN policy

The next step is to establish a network policy against this new segment. This policy governs the rights and extended capabilities for the connections, regions, rights, routes, and conditions for a connection. I left most of this as default, with an example policy JSON that works for me below:

{
  "version": "2021.12",
  "core-network-configuration": {
    "vpn-ecmp-support": true,
    "asn-ranges": [
      "64512-65534"
    ],
    "edge-locations": [
      {
        "location": "us-east-2"
      },
      {
        "location": "us-west-2"
      }
    ]
  },
  "segments": [
    {
      "name": "PrimarySegment",
      "edge-locations": [
        "us-east-2",
        "us-west-2"
      ],
      "require-attachment-acceptance": false
    }
  ],
  "attachment-policies": [
    {
      "rule-number": 100,
      "condition-logic": "and",
      "conditions": [
        {
          "type": "any"
        }
      ],
      "action": {
        "association-method": "constant",
        "segment": "PrimarySegment"
      }
    }
  ]
}

establish Cloud WAN connections

This final step is where the actual interfacing to the VPCs takes place (and you'll see that VPN, Transit Gateways, and other site-to-site options are provided).

By selecting a region where your VPC is provisioned (along with ROSA on that VPC), you pick the VPC and private subnet that ROSA provisioned. It's the private subnet that all the workers and servers are attached to.

back at the regional VPCs...

Find your ROSA VPC (region-specific), and go into the private subnet. This subnet has a link to the Route Table that's associated with this subnet. Here is where you establish the network route between this subnet and global network.

All traffic in the range 10.x.x.x will enter SD-WAN, except for the local traffic.

This process needs to be done for each ROSA VPC in every region. Note that all ROSA VPCs must have unique CIDR ranges. In my demo I defined 3 ROSA clusters with local CIDR ranges:

10.100.0.0 (cluster 1, above example screenshot)

10.110.0.0 (cluster 2)

10.120.0.0 (cluster 3)

back at the Cloud WAN portal...

You can now verify the routes across each edge location to validate that the network propagation is complete and active.

Routes with all the CIDR destinations.

The topology trees and maps can be explored for a graphical representation of your networks:

Network topology tree down to the VPCs

Network topology map showing the connections.

test connectivity between regions

The key to success is to ensure that every worker node in every ROSA cluster can talk to each other.
I create a dummy pod on each cluster/each region just to get access to a terminal window so I can run curl commands:

apiVersion: v1
kind: Pod
metadata:
  name: dummy-curl-pod
spec:
  containers:
    - name: dummy-curl-pod
      image: curlimages/curl
      command: [ "sh", "-c"]
      args:
      - while true; do
          echo -en '\n';
          printenv MY_NODE_NAME MY_HOST_IP MY_POD_NAME;
          sleep 20;
        done;
      env:
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: MY_HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name

Issue a curl command to a compute node in a different cluster

/ $ curl 10.120.3.5
curl: (7) Failed to connect to 10.120.3.5 port 80 after 0 ms: Connection refused

success! This error message tells us that while there are no services listening to port 80 on that node, the server is reachable!
This one-time test needs to be run at all edges/clusters to ensure that CockroachDB nodes can properly communicate (to at least a single node). This ensures database integrity and data replication across the entire platform.
Delete these pods after testing is done

from the perspective of CockroachDB...

This blog does not dive into the installation process, but it's a typical Kubernetes-focused methodology of creating deployments, services, routes, etc (see caveats).

When CockroachDB is live, the admin console provides visibility to the entire cluster (all regions), along with latency-charts and maps of node-deployments to monitor the overall health of the database.

In this example, we have 3 ROSA clusters (each hosting 3 worker nodes with CockroachDB installed), totalling 9 worker nodes. One ROSA cluster is on the west coast (AWS Oregon), and the other two ROSA clusters are in OHIO.

Map of the ROSA-deployed CockroachDB service across 2 regions

Investigating the latency, there is approximately 52ms between the regional edges, while ROSA clusters in the same data-center have sub-ms latency.

Latency chart across all 9 worker nodes.

conclusion

I can't cover every detail into the architecture, design, and deployment since any of these topics and sub-topics can be a discipline on its own and ideally delivered in an in-person working session with Q&A and conversations rather than a blog.
This solution is merely a proof of concept that allows me to leverage Red Hat and AWS resources to the maximum extent in a reliable and repeatable environment.

caveats

Kubernetes services, deployments, persistent volume claims, secrets, and load balancers need to be used across each ROSA cluster. These are found on GitHub
The cockroach start join syntax must specify proper server (worker-node) IPs that are part of the ROSA clusters.
CockroachDB certificates must be created for secure environments.
ROSA EC2 instances (eg: worker-nodes) are inaccessible by default, governed by the Access-Control-List inbound rules and outbound rules. You will need to adjust them to allow traffic from the other CIDR ranges. For convenience I've been known to allow 0/0, all-ports, both ways in VPCs since the IP ranges are virtual and inaccessible. I would love to know what your thoughts and concerns are on this.

references

Red Hat OpenShift provisioning console
Migration to SD-WAN from TGW
Cloud Wan Product Overview
VPC Peering vs Transit Gateways
MULTUS CNI in OpenShift
MULTUS CNI GitHub Quickstart
Red Hat Advanced Cluster Management for Kubernetes
AWS Architecture on ROSA (MZRs)
7 Advantages Of OpenShift Over Kubernetes
For the tech-savvy and initiated (Cloud WAN pdf)
Cloud App Trends)
Containerization Market
Containerization Trends
Growth in managed services: ROSA
Forrester: Red Hat partnerships for OpenShift

Docker Solution: CockroachDB with Grafana Logging & Monitoring

Mark Zlamal — Thu, 18 Aug 2022 00:51:00 +0000

what is this?

A prescriptive approach to deploying CockroachDB with integrated logging, monitoring, and alerting through Grafana.

what's special here?

This blog provides an overview of the containerized database environment that leverages key services such as Prometheus, Fluentd, Loki, and a handful of supporting components bundled into a single package. This package is hosted on GitHub, pre-configured using typical settings and can be easily adjusted to match your environment.

github.com/cockroachlabs/Docker-CRDB

This Docker-CRDB GitHub repository provides the core project files while this article highlights the components and how the overall solution fits together.

your takeaways?

Coverage of the components and technologies used in this CockroachDB / Logging solution.
Overview of the architecture, how everything is tied together within Docker. No need to be a subject matter expert in these areas.
A pre-built reference solution that you can download and make your own.

Technologies and components leveraged in this blog

logging is complicated

It's complicated because there are many ways to architect a solution, many ways to deploy the solution. It can be containerized and orchestrated through Kubernetes. It can be deployed as your own self-managed cloud solution, or you can leverage Grafana’s paid subscription for their integrated cloud-managed services.

The good news is that there is extensive documentation into logging, configuration, workshops, technologies, approaches, repositories, fragments, tips, and tricks.

The bad news is that there is extensive documentation into logging, configuration, workshops, technologies, approaches, repositories, fragments, tips, and tricks.

In my journey I wasn't looking for deep-dives into these topics, I just want a reference solution that's already pre-built, pre-configured, operational with minimal amount of integration steps.

At the same time I don't want automation through ansible or terraform since they can hide key integration aspects, potentially taking away from my understanding of how everything is tied together.

This project is deployed and operated as a set of independent and interacting Docker containers, where each container runs a single image that manages a single task. No orchestration through Kubernetes, and everything runs locally.

...to the architecture

This solution is divided into 2 perspectives using a Docker bridge network.

Core architecture (GitHub)

The top half (public network) represents access to pre-configured endpoints that the host machine can connect to to interact with the platform. Typical interactions can be browsers, potential workload-apps, CockroachDB clients, etc.
The lower half (private network) is the isolated and containerized 'sandbox' of apps and services that interconnect using virtualized network ports and hosts within the framework of the Docker and the Docker bridge network.

the public network

The upper half is the public network where we can interact with the database and all available logging services. This is effectively the host machine and all accessible endpoints to workloads outside of Docker. All the ports shown here (eg: 3000, 9090, 8080, 26257, …) are defined in the settings as defaults in this project, and must be available host-ports when this project is deployed.

If you have any conflicts, say another unrelated project/app/workload uses one or more of these ports, then you can adjust the port number through the provided configuration and docker-compose files.

the private network

The lower half is the private network, under the umbrella of Docker running on the host. You can see all the pieces, connected together via networked-interaction-chain that process (source/sink) logging data that’s eventually consumed by the user.

Port conflicts typically do not occur in the private network because each container is treated as a virtualized host within Docker. The key-example here is the set of 3 fluentd instances that listen to port 5170, but because each is treated as a unique host (and hostname), you can distinguish between them and connect accordingly.

Each orange box is a container running a single image of the highlighted component, and these are all running inside a bridge network. This bridge is a private Docker network, conceptually similar to a virtual-private-cloud. Every running container is treated as a virtual host and treated as a distinct service with ports that can be exposed across the bridge network. Each container (eg: host) has visibility to all other containers within this network but only to exposed ports defined by the docker-compose configuration file. Outside of this private-network, none of these exposed ports are visible or accessible to the host machine running Docker.

data and log flow

Starting at the right-hand side, we have a containerized 3-node CockroachDB cluster. Normally the database sends logs directly into the cockroach-data/logs folder, but here it’s configured to use fluentd as our log sink, and this is the first step in the chain.

so what’s Fluentd?

It’s an open source data collector. It unifies log collection and consumption in a formatted, tagged, buffered, consistent way across all your applications. Fluentd can then save this structured data back into the filesystem, or as the basis of this project, Fluentd sends the formatted data via local-networking to Loki, and this is the next piece of the puzzle.

...and Loki?

Loki is a scalable, multi-tenant logs aggregator and time-series database. It’s similar to Prometheus but specifically designed for TEXT-based analytics, indexing, searching, scanning, and querying facilitation inside Grafana.

what about Prometheus?

In Parallel to the above log flows, there is another network link from CockroachDB directly to the Prometheus container. This connection facilitates the operational metrics that allow us to perform queries on Cockroach statistics, usage, SQL, data volumes, contention rates, etc.

destination: Grafana!

Finally in Grafana we defined data-sources that listen to both Loki and Prometheus, and this is the final sink to our logs.

As mentioned earlier we have a handful of endpoints exposed to the public network, notably from Cockroach and Grafana so we can access their fancy UI and run logging queries against the CockroachDB nodes.

alerting

The alerting framework is separated from the core architecture because it's an optional capability and requires API/Service keys from 3rd party cloud services. In our example, when a trigger in Grafana is activated, it calls a NodeJS endpoint that issues an API call to Twilio and SendGrid. These services send live email and global SMS messages to a recipient, notifying them of this alert with context.

In Grafana, the alert is configured to monitor prometheus metrics. When the threshold value is reached, Grafana triggers this alert and calls web-hook that sends a JSON payload containing the properties of the alert, and custom fields such as API key information and recipient details.

The NodeJS application that listens to this alerting endpoint (webhook URL & Port) formats the data into a human readable format, leveraging email-templates, SMS formatting, and sends the new payload to Twilio through their API services.

Alerting architecture (GitHub)

alerting key capabilities

Monitoring and alerting using quantitative metrics from Prometheus, triggered when thresholds are reached or exceeded.
Monitoring and alerting using events from text/string/JSON logs as our triggers.

the GitHub repository

The folders represent all the operational containers running the complete solution. Specifically there is a 1:1 relationship between each folder in github and each container in the architecture diagram, and this was intentional to make it easy to learn, to locate and make adjustments to any component independently of all the others.

configuration files & settings

The architecture diagram highlights each configuration file that governs that particular aspect of this platform, and they can be found in the corresponding github folders. While everything is wired together right out of the box using typical values, you have full control of all the settings and communication. This facilitates an easy and flexible integration into an existing Cockroach environment, and as a bonus you get to learn how all the pieces work together.

GitHub folders and associated docker files

The key aspect in this organization is that each folder contains a docker-compose yaml file that defines the container including hostnames, images to use, which ports to expose and map publicly. You run docker-compose up -f and this will build the container using the image along with the properties in the spec. This container is then pushed into your local Docker repository and launched in your Docker environment.

implementation details

The GitHub repo covers the remaining details such as tools/prerequisites, establishing a Docker bridge network, and creating certificates for CockroachDB.

Cheat-sheets, fragments, and command shortcuts are included with example values to help stand-up the environment quickly.

Finally the startup sequence of containers and a listing of the endpoints are given for convenience. Note that the endpoints have defined ports based on the default settings of this project.

running CockroachDB and Grafana

We need to create data-source connections from our log sources into Grafana. According to the architecture diagram, we’ll establish the Loki and Prometheus endpoints in the Grafana UI as shown in this screenshot:

data source definitions in the Grafana UI

We need to define a default contact point which includes a webhook URL to the alerts container:

Contact-point & webhook definition

Define the alert parameters that are sent as a payload to the alerts container. Note that the SMS_Key field syntax must have the format '<SID>:<Secret>'.

Alert container parameters as required by the alerting NodeJS app

Example Log Queries

Below are a few example queries that you can test out against your CockroachDB cluster.

Prometheus log query examples

A: Alerting Rule Query:

rate(sql_distsql_contended_queries_count{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[3m:10s])

B: When this rate is > 0.373

This example shows the activity and triggered alert during our load-test.

The CRDB admin console provides basic views to the prometheus data, such as "SQL Statements" (queries) and "SQL Statement Contention". These charts can be replicated in Grafana using the following queries:

SQL Statements (queries)

rate(sql_update_count{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[1m:2s])

SQL Statement Contention

rate(sql_txn_contended_count{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[1m:1s])

Escalating contention across 2 connections to CockroachDB illustrated.

Other examples that show interesting views into the data

rate(sql_mem_distsql_current{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[5m:10s])

admission_admitted_kv{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}

rate(admission_admitted_kv{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[2m:10s])

rate(sql_insert_count_internal{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[3m:10s])

sql_contention_resolver_retries{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}

rate(sql_contention_resolver_retries{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[3m:10s])

sql_stats_mem_current{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}

sql_contention_resolver_queue_size{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}

Loki log query examples

Loki digs into the cockroach logs folder, capturing all the text-based messages that occur within the database. This service is necessary to capture connectivity issues, gossip-protocol updates, system events, and other activities related to a distributed database.

Exact case string-search:

{job=~"CRDB01|CRDB02|CRDB03"} |= "circuitbreaker"

Case insensitive & using regex line filters expressions:

{job=~"CRDB01|CRDB02|CRDB03"} |~ "(?i)CircuitBREAKER"

Capture logs with circuit breakers and connection (issues) in the logs

{job=~"CRDB01|CRDB02|CRDB03"} |~ "(?i)Circuitbreaker" |~"connection"

Rate of circuit breakers over a timeframe of 50 seconds

rate( ( {job=~"CRDB01|CRDB02|CRDB03"} |~ "(?i)Circuitbreaker")[50s] ) 
rate( ( {job=~"CRDB01|CRDB02|CRDB03"} |~ "(?i)gossip")[20s] )

This screenshot illustrates a sample query capturing the word “fail” (case-insensitive) and the trend-chart of occurrences, followed by the actual log-file entries with text-highlighting.

GitHub Repo URL

github.com/cockroachlabs/Docker-CRDB

References

https://www.cockroachlabs.com/docs/v22.1/configure-logs.html
https://www.cockroachlabs.com/docs/v22.1/logging-use-cases.html
https://www.cockroachlabs.com/docs/v22.1/log-formats.html
https://dev.to/cockroachlabs/a-gentle-introduction-to-cockroachdb-logging-and-fluentd-4mn9
https://grafana.com/docs/loki/latest/logql/query_examples
https://grafana.com/blog/2021/05/05/how-to-search-logs-in-loki-without-worrying-about-the-case
https://megamorf.gitlab.io/cheat-sheets/loki