DEV Community

Cover image for CloudNativePG - install (2.18) and first test: transient failure
Franck Pachot
Franck Pachot

Posted on

CloudNativePG - install (2.18) and first test: transient failure

I'm starting a series of blog posts to explore CloudNativePG (CNPG), a Kubernetes operator for PostgreSQL that automates high availability in containerized environments.

PostgreSQL itself supports physical streaming replication, but doesn’t provide orchestration logic — no automatic promotion, scaling, or failover. Tools like Patroni fill that gap by implementing consensus (etcd, Consul, ZooKeeper, Kubernetes, or Raft) for cluster state management. In Kubernetes, databases are often deployed with StatefulSets, which provide stable network identities and persistent storage per instance. CloudNativePG instead defines PostgreSQL‑specific CustomResourceDefinitions (CRDs), which introduce the following resources:

  • ImageCatalog: PostgreSQL image catalogs
  • Cluster: PostgreSQL cluster definition
  • Database: Declarative database management
  • Pooler: PgBouncer connection pooling
  • Backup: On-demand backup requests
  • ScheduledBackup: Automated backup scheduling
  • Publication Logical replication publications
  • Subscription Logical replication subscriptions

Install: control plane for PostgreSQL

Here I’m using CNPG 1.28, which is the first release to support (quorum-based failover). Prior versions promoted the most-recently-available standby without preventing data loss (good for disaster recovery but not strict high availability).

Install the operator’s components:

kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.28/releases/cnpg-1.28.0.yaml

Enter fullscreen mode Exit fullscreen mode

The CRDs and controller deploy into the cnpg-system namespace. Check rollout status:

kubectl rollout status deployment -n cnpg-system cnpg-controller-manager

deployment "cnpg-controller-manager" successfully rolled out
Enter fullscreen mode Exit fullscreen mode

This Deployment defines the CloudNativePG Controller Manager — the control plane component — which runs as a single pod and continuously reconciles PostgreSQL cluster resources with their desired state via the Kubernetes API:

kubectl get deployments -n cnpg-system -o wide

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                         SELECTOR
cnpg-controller-manager   1/1     1            1           11d   manager      ghcr.io/cloudnative-pg/cloudnative-pg:1.28.0   app.kubernetes.io/name=cloudnative-pg

Enter fullscreen mode Exit fullscreen mode

The pod’s containers listen on ports for metrics (8080/TCP) and webhook configuration (9443/TCP), and interact with CNPG’s CRDs during the reconciliation loop:

kubectl describe deploy -n cnpg-system cnpg-controller-manager

Name:                   cnpg-controller-manager
Namespace:              cnpg-system
CreationTimestamp:      Thu, 15 Jan 2026 21:04:25 +0100
Labels:                 app.kubernetes.io/name=cloudnative-pg
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app.kubernetes.io/name=cloudnative-pg
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/name=cloudnative-pg
  Service Account:  cnpg-manager
  Containers:
   manager:
    Image:           ghcr.io/cloudnative-pg/cloudnative-pg:1.28.0
    Ports:           8080/TCP (metrics), 9443/TCP (webhook-server)
    Host Ports:      0/TCP (metrics), 0/TCP (webhook-server)
    SeccompProfile:  RuntimeDefault
    Command:
      /manager
    Args:
      controller
      --leader-elect
      --max-concurrent-reconciles=10
      --config-map-name=cnpg-controller-manager-config
      --secret-name=cnpg-controller-manager-config
      --webhook-port=9443
    Limits:
      cpu:     100m
      memory:  200Mi
    Requests:
      cpu:      100m
      memory:   100Mi
    Liveness:   http-get https://:9443/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get https://:9443/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Startup:    http-get https://:9443/readyz delay=0s timeout=1s period=5s #success=1 #failure=6
    Environment:
      OPERATOR_IMAGE_NAME:           ghcr.io/cloudnative-pg/cloudnative-pg:1.28.0
      OPERATOR_NAMESPACE:             (v1:metadata.namespace)
      MONITORING_QUERIES_CONFIGMAP:  cnpg-default-monitoring
    Mounts:
      /controller from scratch-data (rw)
      /run/secrets/cnpg.io/webhook from webhook-certificates (rw)
  Volumes:
   scratch-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
   webhook-certificates:
    Type:          Secret (a volume populated by a Secret)
    SecretName:    cnpg-webhook-cert
    Optional:      true
  Node-Selectors:  <none>
  Tolerations:     <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   cnpg-controller-manager-6b9f78f594 (1/1 replicas created)
Events:          <none>
Enter fullscreen mode Exit fullscreen mode

Deploy: data plane (PostgreSQL cluster)

The control plane handles orchestration logic. The actual PostgreSQL instances — the data plane — are managed via CNPG’s Cluster custom resource.

Create a dedicated namespace:

kubectl delete namespace lab
kubectl create namespace lab

namespace/lab created
Enter fullscreen mode Exit fullscreen mode

Here’s a minimal high-availability cluster spec:

  • 3 instances: 1 primary, 2 hot standby replicas
  • Synchronous commit to 1 replica
  • Quorum-based failover enabled
cat > lab-cluster-rf3.yaml <<'YAML'
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cnpg
spec:
  instances: 3
  postgresql:
    synchronous:
      method: any
      number: 1
      failoverQuorum: true
  storage:
    size: 1Gi
YAML

kubectl -n lab apply -f lab-cluster-rf3.yaml

Enter fullscreen mode Exit fullscreen mode

CNPG provisions Pods with stateful semantics, using PersistentVolumeClaims for storage:

kubectl -n lab get pvc -o wide

NAME     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE   VOLUMEMODE
cnpg-1   Bound    pvc-76754ba4-e8bd-4218-837f-36aa0010940f   1Gi        RWO            hostpath       <unset>                 42s   Filesystem
cnpg-2   Bound    pvc-3b231dcc-b973-43f8-a429-80222bd51420   1Gi        RWO            hostpath       <unset>                 26s   Filesystem
cnpg-3   Bound    pvc-b8e4c6a0-bbcb-445d-9267-ffe38a1a8685   1Gi        RWO            hostpath       <unset>                 10s   Filesystem
Enter fullscreen mode Exit fullscreen mode

These PVCs bind to PersistentVolumes provided by their storage class:

kubectl -n lab get pv -o wide 

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM        STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE   VOLUMEMODE
pvc-3b231dcc-b973-43f8-a429-80222bd51420   1Gi        RWO            Delete           Bound    lab/cnpg-2   hostpath       <unset>                          53s   Filesystem
pvc-76754ba4-e8bd-4218-837f-36aa0010940f   1Gi        RWO            Delete           Bound    lab/cnpg-1   hostpath       <unset>                          69s   Filesystem
pvc-b8e4c6a0-bbcb-445d-9267-ffe38a1a8685   1Gi        RWO            Delete           Bound    lab/cnpg-3   hostpath       <unset>                          37s   Filesystem
Enter fullscreen mode Exit fullscreen mode

PostgreSQL instances runs in pods:

kubectl -n lab get pod -o wide

NAME     READY   STATUS    RESTARTS   AGE     IP           NODE             NOMINATED NODE   READINESS GATES
cnpg-1   1/1     Running   0          3m46s   10.1.0.141   docker-desktop   <none>           <none>
cnpg-2   1/1     Running   0          3m29s   10.1.0.143   docker-desktop   <none>           <none>
cnpg-3   1/1     Running   0          3m13s   10.1.0.145   docker-desktop   <none>           <none>
Enter fullscreen mode Exit fullscreen mode

In Kubernetes, pods are typically considered equal, but PostgreSQL uses a single primary node while the other pods serve as read replicas. CNPG identifies which pod is running the primary instance:

kubectl -n lab get cluster      

NAME   AGE   INSTANCES   READY   STATUS                     PRIMARY
cnpg    4m   3           3       Cluster in healthy state   cnpg-1
Enter fullscreen mode Exit fullscreen mode

As the roles of pods can change with a switchover or failover, application access though services that expose the right instances:

kubectl -n lab get svc -o wide

NAME      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE     SELECTOR
cnpg-r    ClusterIP   10.97.182.192    <none>        5432/TCP   4m13s   cnpg.io/cluster=cnpg,cnpg.io/podRole=instance
cnpg-ro   ClusterIP   10.111.116.164   <none>        5432/TCP   4m13s   cnpg.io/cluster=cnpg,cnpg.io/instanceRole=replica
cnpg-rw   ClusterIP   10.108.19.85     <none>        5432/TCP   4m13s   cnpg.io/cluster=cnpg,cnpg.io/instanceRole=primary
Enter fullscreen mode Exit fullscreen mode

Those are the endpoints used to connect to PostgreSQL:

  • cnpg-rw connects to the primary for consistent reads and writes
  • cnpg-ro connects to one standby for stale reads
  • cnpg-r connects the primary or standby for stale reads

The load-balancing of read workloads is round-robin, like a host list, so the same workload runs on all replicas.

Client access setup

CNPG generated credentials in a Kubernetes Secret named cnpg-app for the user app:

kubectl -n lab get secrets

NAME               TYPE                       DATA   AGE
cnpg-app           kubernetes.io/basic-auth   11     8m48s
cnpg-ca            Opaque                     2      8m48s
cnpg-replication   kubernetes.io/tls          2      8m48s
cnpg-server        kubernetes.io/tls          2      8m48s
Enter fullscreen mode Exit fullscreen mode

When needed, the password can be retreived with kubectl -n lab get secret cnpg-app -o jsonpath='{.data.password}' | base64 -d).

Define a shell alias to launch a PostgreSQL client pod with these credentials:

alias pgrw='kubectl -n lab run client --rm -it --restart=Never  \
 --env PGHOST="cnpg-rw" \
 --env PGUSER="app" \
 --env PGPASSWORD="$(kubectl -n lab get secret cnpg-app -o jsonpath='{.data.password}' | base64 -d)" \
--image=postgres:18 --'

Enter fullscreen mode Exit fullscreen mode

Use the alias pgrw to run a PostgreSQL client connected to the primary.

PgBench default workload

With the previous alias defined, initialize PgBench tables:


pgrw pgbench -i

dropping old tables...
creating tables...
generating data (client-side)...
vacuuming...                                                                              
creating primary keys...
done in 0.10 s (drop tables 0.02 s, create tables 0.01 s, client-side generate 0.04 s, vacuum 0.01 s, primary keys 0.01 s).
pod "client" deleted from lab namespace
Enter fullscreen mode Exit fullscreen mode

Run for 10 minutes with progress every 5 seconds:

pgrw pgbench -T 600 -P 5

progress: 5.0 s, 1541.4 tps, lat 0.648 ms stddev 0.358, 0 failed
progress: 10.0 s, 1648.6 tps, lat 0.606 ms stddev 0.154, 0 failed
progress: 15.0 s, 1432.7 tps, lat 0.698 ms stddev 0.218, 0 failed
progress: 20.0 s, 1581.3 tps, lat 0.632 ms stddev 0.169, 0 failed
progress: 25.0 s, 1448.2 tps, lat 0.690 ms stddev 0.315, 0 failed
progress: 30.0 s, 1640.6 tps, lat 0.609 ms stddev 0.155, 0 failed
progress: 35.0 s, 1609.9 tps, lat 0.621 ms stddev 0.223, 0 failed
Enter fullscreen mode Exit fullscreen mode

Simulated failure

In another terminal, I checked which is the primary pod:

kubectl -n lab get cluster      

NAME   AGE   INSTANCES   READY   STATUS                     PRIMARY
cnpg   40m   3           3       Cluster in healthy state   cnpg-1
Enter fullscreen mode Exit fullscreen mode

From the Docker Desktop GUI, I paused the container in the primary's pod:

PgBench queries hang as the primary where it is connected to doesn't reply:

The pod was recovered and PgBench continues without being disconnected:

Kubernetes monitors pod health with liveness/readiness probes and restarts containers when those probes fail. In this case, Kubernetes—not CNPG—restored the service.

Meanwhile, CNPG independently monitors PostgreSQL and triggered a failover before Kubernetes restarted the pod:

franck.pachot@M-C7Y646J4JP cnpg % kubectl -n lab get cluster

NAME   AGE    INSTANCES   READY   STATUS         PRIMARY
cnpg   3m6s   3           2       Failing over   cnpg-1
Enter fullscreen mode Exit fullscreen mode

Kubernetes brought the service back in about 30 seconds, but CNPG had already initiated a failover. A new outage will happen.

A few minutes later, cnpg-1 restarted and PgBench exited with:

WARNING:  canceling the wait for synchronous replication and terminating connection due to administrator command
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
pgbench: error: client 0 aborted in command 10 (SQL) of script 0; perhaps the backend died while processing
Enter fullscreen mode Exit fullscreen mode

Because cnpg-1 was still there and healthy, it is still the primary, but all connections have been terminated.

Observations

This test shows how PostgreSQL and Kubernetes interact under CloudNativePG. Kubernetes pod health checks and CloudNativePG’s failover logic each run their own control loop:

  • Kubernetes restarts containers when liveness or readiness probes fail.
  • CloudNativePG (CNPG) evaluates database health using replication state, quorum, and instance manager connectivity.

Pausing the container briefly triggered CNPG’s primary isolation check. When the primary loses contact with both the Kubernetes API and other cluster members, CNPG shuts it down to prevent split-brain. Timeline:

  • T+0s — Primary paused. CNPG detects isolation.
  • T+30s — Kubernetes restarts the container.
  • T+180s — CNPG triggers failover.
  • T+275s — Primary shutdown terminates client connections.

Because CNPG and Kubernetes act on different timelines, the original pod restarted as primary (“self-failover”) when no replica was a better promotion candidate. CNPG prioritizes data integrity over fast recovery and, without a consensus protocol like Raft, relies on:

  • Kubernetes API state
  • PostgreSQL streaming replication
  • Instance manager health checks

This can cause false positives under transient faults but protects against split-brain. Reproducible steps:
https://github.com/cloudnative-pg/cloudnative-pg/discussions/9814

Cloud systems can fail in many ways. In this test, I used docker pause to freeze processes and simulate a primary that stops responding to clients and health checks. This mirrors a previous test I did with Yugabyte: YugabyteDB Recovery Time Objective (RTO) with PgBench: continuous availability with max. 15s latency on infrastructure failure

This post starts a CNPG series where I will also cover failures like network partitions and storage issues, and the connection pooler.

Top comments (0)