Chaos testing a Postgres cluster managed by CloudNativePG

#devops #postgres #opensource #cloud

As more organizations move their databases to cloud-native environments, effectively managing and monitoring these systems becomes crucial. According to Coroot’s anonymous usage statistics, 64% of projects use PostgreSQL, making it the most popular RDBMS among our users, compared to 14% using MySQL. This is not surprising since it is also the most widely used open-source database worldwide.

Kubernetes is more than a platform for running containerized applications. It also enables better management of databases by allowing automation of tasks like backups, high availability, and scaling through its operator framework. This provides a management experience similar to using a managed service like AWS RDS but without vendor lock-in and often at a lower cost.

CloudNativePg is an open-source operator originally created by EDB, the oldest and the biggest Postgres vendor world-wide. As other operators, CNPG helps manage PostgreSQL databases on Kubernetes, covering the entire operational lifecycle from initial deployment to ongoing maintenance. Worth to mention that this is the youngest Postgres operator on the market, but its open source traction grows rapidly and based on my observations it’s the favorite operator across Reddit users.

In this post I’ll install a CNPG cluster in my lab, instrument it with Coroot, then generate some load and introduce some failures to ensure high availability and observability.

In this post I’ll install a CNPG cluster in my lab, instrument it with Coroot Community (open source), then generate some load and introduce some failures to ensure high availability and observability.

Setting up the cluster

Installing the CloudNativePG operator is simple:

helm repo add cnpg https://cloudnative-pg.github.io/charts
helm upgrade --install cnpg cnpg/cloudnative-pg

To deploy a cluster, create a Kubernetes custom resource:

kind: Cluster
metadata:
  name: pg-cluster
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  storage:
    size: 30Gi   
  postgresql:
    shared_preload_libraries: [pg_stat_statements]
    parameters:
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
  managed:
    roles:
    - name: coroot
      ensure: present
      login: true
      connectionLimit: 2
      inRoles:
      - pg_monitor
      passwordSecret:
        name: pg-cluster
---
apiVersion: v1
data:
  username: ******==
  password: *********==
kind: Secret
metadata:
  name: pg-cluster
type: kubernetes.io/basic-auth

Installing Coroot

In this post, I’ll be using the open source Community Edition of Coroot. Here are the commands to install the Coroot Operator for Kubernetes along with all Coroot components:

helm repo add coroot https://coroot.github.io/helm-charts
helm repo update coroot
helm install -n coroot --create-namespace coroot-operator coroot/coroot-operator
helm install -n coroot coroot coroot/coroot-ce

To access Coroot, I’m forwarding the Coroot UI port to my local machine. For production deployments the operator can create an Ingress.

kubectl port-forward -n coroot service/coroot-coroot 8083:8080

In the UI, we can see two applications: the operator (cnpg-cloudnative-pg) and our Postgres cluster (pg-cluster). Coroot has also identified that pg-cluster is a Postgres database and suggests integrating Postgres monitoring.

The Kubernetes approach to monitoring databases typically involves running metric exporters as sidecar containers within database instance Pods. However, this method can be challenging for certain use cases. For example, CNPG doesn’t support running custom sidecar containers, and their CNPG-i capability requires specific plugin support and is still in the experimental stage.

To address these limitations, Coroot has a dedicated coroot-cluster-agent that can discover and gather metrics from databases without requiring a separate container for each database instance. To configure this integration, simply use the credentials of the database role already created for Coroot. Click on “Postgres” in the Coroot UI and then on the “Configure” button.

Next, provide the credentials configured for Coroot in the cluster specification. Coroot’s cluster-agent will then collect Postgres metrics from each instance in the cluster.

It feels a bit dull without any load or issues. Let’s add an application that interacts with this database.

I deployed a simple application called “app” that executes approximately 600 queries per second: 300 on the primary and 300 across both replicas.

I believe that any observability solution must be tested on failures to ensure that if some problem occurs, we will be able to quickly identify the root case. So, let’s introduce some failures

Failure #1: CPU noisy neighbor

In shared infrastructures like Kubernetes clusters, applications often compete for resources. Let’s simulate a scenario with a noisy neighbor, where a CPU-intensive application runs on the same node as our database instance. The following Job will create a Pod with stress-ng on node100:

apiVersion: batch/v1
kind: Job
metadata:
  name: cpu-stress
spec:
  template:
    metadata:
      labels:
        app: cpu-stress
    spec:
      nodeSelector:
        kubernetes.io/hostname: node100
      containers:
        - name: stress-ng
          image: debian:bullseye-slim
          command:
            - "/bin/sh"
            - "-c"
            - |
              apt-get update && 
              apt-get install -y stress-ng && 
              stress-ng --cpu 0 --timeout 300s
      restartPolicy: Never

As we can see, our “noisy neighbor” has affected Postgres performance. Now, let’s assume we don’t know the root cause and use Coroot to identify the issue.

Using the CPU Delay chart, we can observe that pg-cluster-2 is experiencing a CPU time shortage. Why? Because node100 is overloaded. And why is that? The cpu-stress application has consumed all available CPU time.

Failure #2: Postgres Locks

Now, let’s explore a Postgres-specific failure scenario. We’ll run a suboptimal schema migration on our articles table, which contains 10 million rows:

ALTER TABLE articles ALTER COLUMN body SET NOT NULL;

For those who aren’t deeply familiar with databases, this migration will lock the entire table to verify that all rows are not NULL. Since the table is relatively large, the migration can take some time to complete. During this period, queries from our app will be forced to wait until the lock is released.

Let’s interpret these charts together: The Postgres latency of pg-cluster-2 has significantly increased. Many SELECT and INSERT queries are locked by another query. Which one? The ALTER TABLE query. Why is this query taking so long to execute? Because it is performing I/O operations to verify that the body column in each row is not NULL.

As you can see, having the right metrics was crucial in this scenario. For instance, simply knowing the number of Postgres locks wouldn’t help us identify the specific query holding the lock.

Failure #3: primary Postgres instance failure

Now, let’s see how CloudNativePG handles a primary instance failure. To simulate this failure, I’ll simply delete the Pod of the primary Postgres instance.

kubectl delete pod pg-cluster-2