DEV Community: Jesper Axelsen

The simple, yet powerful testing framework hidden in FluxCD

Jesper Axelsen — Tue, 30 Sep 2025 17:45:18 +0000

Testing Deployments in Kubernetes with FluxCD and Helm

Throughout my career, I’ve helped set up numerous production-grade Kubernetes clusters. While there are many problem domains in that area, today I’ll be focusing on testing.

The Challenge of Testing in Kubernetes

Testing in Kubernetes has never been easy. Questions such as “What kind of testing would you like to do?” or “When should the tests run?” are difficult to answer — not just in Kubernetes, but in DevOps as a whole.

When using GitOps as a deployment model on Kubernetes, one of its core benefits is fast deployment. GitOps promises that committed code enters a reconciliation loop, eventually resulting in scheduled workloads on, for example, a Kubernetes cluster.

Why Testing Matters in GitOps Workflows

Since code is typically committed many times a day, any testing framework that runs alongside a deployment should be easy to use, flexible, and ideally integrated into your deployment system.

This is where FluxCD comes in.

Introducing FluxCD

FluxCD is a delivery solution for Kubernetes that implements a GitOps operator.

I won’t go into detail about GitOps operators or FluxCD in general — you can explore those topics in the official documentation.

Instead, I want to highlight a powerful feature from their HelmRelease CRD.

Leveraging Helm Tests with FluxCD

The idea is simple: when deploying to Kubernetes, we want to run a series of tests to ensure that our deployment is successful.

Helm provides an excellent testing framework, and FluxCD integrates directly with it. In the test configuration section of the documentation, FluxCD specifically enables us to run Helm Tests as part of both install and upgrade processes for any HelmRelease.

This means we can write tests in Helm and have them automatically executed during an install or upgrade. The deployment will only be reported as successful to the FluxCD controller once all tests pass.

Automatic Rollbacks and Safer Deployments

This is a powerful capability — one that I haven’t seen utilized often.

Essentially, we can instruct our deployment controller to roll back a deployment automatically if any tests fail.

We can also run tests during installation to verify that specific resources are created or configured correctly, ensuring that everything behaves as expected before marking the deployment as complete.

Confidence and Control in Your Deployments

Running Helm tests during every install and upgrade in FluxCD gives us a high level of confidence that our deployment is successful.

It also allows us to define specific tests that serve as rollback triggers when they fail. With this approach, you gain the confidence and control that many large organizations require for production deployments.

In the environments where I’ve implemented this setup, install and upgrade errors have been reduced to almost zero. And when an error does occur, it’s straightforward to pinpoint the issue — since a failing test will clearly indicate what went wrong.

If no test fails but the deployment had an error, remember to add a test for that scenario in the future — continuous improvement in test coverage is key to long-term reliability.

Ceph data durability, redundancy, and how to use Ceph

Jesper Axelsen — Fri, 12 Mar 2021 11:35:55 +0000

This blog post is the second in a series concerning Ceph.

Creating data redundancy

One of the main concerns when dealing with large sets of data is data durability. We do not want a cluster in which a simple disk failure will introduce a loss in data. What Ceph aims for instead is fast recovery from any type of failure occurring on a specific failure domain.

Ceph is able to ensure data durability by using either replication or erasure coding.

Replication

For those of you who are familiar with RAID, you can think of Ceph's replication as RAID 1 but with subtle differences.

The data is replicated onto a number of different OSDs, nodes, or racks depending on your cluster configuration. The original data and the replicas are split into many small chunks and evenly distributed across your cluster using the CRUSH-algorithm. If you have chosen to have three replicas on a 6-node cluster, these three replicas will be spread out onto all six nodes, not just three nodes containing the full replicas.

It is important to choose the right level of data replication. If you are running a single-node cluster, replication on the node level would be impossible and your cluster would lose data in the event of a single OSD failure. In this case, you would choose to replicate data across the OSDs you have available on the node.

On a multi-node cluster, your replication factor decides how many OSDs or nodes you can afford to lose in case of disk or node failure, without data loss. Of course, the replication of data introduces the problem of lowering your total amount of space available in your cluster. If you choose a replication factor of 3 on the node level, you will only have 1/3 of your total storage available in your cluster for you to use.

Replication in Ceph is fast and only limited by the read/write operations of the OSDs. However, some people are not content with "only" being able to use a small amount of their total space. Therefore, Ceph also introduced erasure coding.

Erasure Coding

Erasure coding encodes your original data in a way so that when you need to retrieve the data again, you only need a subset of the data to recreate the original information. It splits objects into k data fragments and then computes m parity fragments. I will provide an example.

Let us say that the value of our data is 52. We could split it into:
x = 5
y = 2

The encoding process will then compute a number of parity fragments. In this example, these will be equations:
x + y = 7
x - y = 3
2x + y = 12

Here, we have a k = 2 and m = 3. k is the number of data fragments and m is the number of parity fragments. In case of a disk or node failure and the data needs to be recovered, out of the 5 elements we will be storing (the two data fragments and the three parity fragments) we only require two of these five to recover. This is what ensures data durability when using erasure coding.

Now, why does this matter? It matters because these parity fragments take up significantly less space when compared to replicating the data. Here is a table that shows how much overhead there is on different erasure coding schemes. The overhead is calculated with m / k.

Erasure coding scheme (k+m)	Minimum number of nodes	Storage overhead
4+2	6	50%
6+2	8	33%
8+2	10	25%
6+3	9	50%

As we can see in the table, you can use the (8+2) scheme to make sure you can lose two of your nodes without losing any data, and this with only a 25% storage overhead.

If you look at this from a storage space optimization standpoint, this is a much better use of the storage. However, it is not without certain downsides. The parity fragments take time for the cluster to calculate and read/write operations are therefore slower than with replication. Therefore, erasure coding is usually recommended on clusters that deal with large amounts of cold data.

Using Ceph

A natural part of deployments on Kubernetes is to create persistent volume claims (PVCs). PVCs can claim a volume and use that as storage for data in the pod. In order to create a PVC you first need to define a StorageClass in Kubernetes.

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
spec:
  failureDomain: host
  replicated:
    size: 3
    requireSafeReplicaSize: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    clusterID: rook-ceph # namespace:cluster
    pool: replicapool
    imageFormat: "2"
    imageFeatures: layering
    csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster
    csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster
    csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster
    csi.storage.k8s.io/fstype: ext4
allowVolumeExpansion: true
reclaimPolicy: Delete

In this StorageClass file, you can see that we first create a replica pool that creates 3 replicas in total and uses host as the failure domain. After that, we define whether or not we should allow volume expansion after a volume is created and what the reclaim policy should be. Reclaim policy determines whether the data that is stored in the volume should be deleted or retained when a pod ceases to exist. In this case, I have chosen delete.

# kubectl get storageclass -n rook-ceph
NAME              PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
rook-ceph-block   rook-ceph.rbd.csi.ceph.com      Delete          Immediate           true                   10m

Now that the StorageClass has been created, we can now create a PVC:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: rook-ceph-block

This creates a PVC that is now running on our Kubernetes cluster:

# kubectl get pvc -n rook-ceph
NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
rbd-pvc   Bound    pvc-56c45f01-562f-4222-8199-43abb856ca94   1Gi        RWO            rook-ceph-block   37s

We will now deploy a pod that uses this PVC:

---
apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
spec:
  containers:
   - name: web-server
     image: nginx
     volumeMounts:
       - name: mypvc
         mountPath: /var/lib/www/html
  volumes:
   - name: mypvc
     persistentVolumeClaim:
       claimName: pvc
       readOnly: false

After deploying this pod, you can see it in the pod list:

# kubectl get pods -n rook-ceph
NAME              READY   STATUS    RESTARTS   AGE
demo-pod          1/1     Running   0          118s

That is how you deploy pods that create persistent volume claims on your Ceph cluster!

Deploying a Ceph cluster with Kubernetes and Rook

Jesper Axelsen — Fri, 05 Mar 2021 12:28:49 +0000

This blog post is the first in a series concerning Ceph.

Introduction

In a world that is seeing an ever-increasing data generation, the need for scalable storage solutions will naturally rise. I am going to introduce you to one of these today. It is called Ceph.

Ceph is an open-source software storage platform. It implements object storage on a distributed computer cluster and provides an interface for three storage types: block, object, and file. Ceph's aim is to provide a free, distributed storage platform without any single point of failure that is highly scalable and will keep your data intact.

This post will go through the Ceph architecture, how to set up your own Ceph storage cluster, and discuss the architectural decisions you will inevitably have to make. We will be deploying Ceph on a Kubernetes cluster using the cloud-native storage orchestrator Rook.

Architecture

First, a small introduction to Ceph's architecture.

A Ceph storage cluster can be accessed in a number of ways.

First, Ceph provides the LIBRADOS library that allows you to connect directly to your storage cluster using either C, C++, Java, Python, Ruby or PHP. Ceph also allows for object storage through a REST gateway that is accessible with S3 and Swift.

Using Kubernetes, the more common ways to use your storage cluster would be to either create persistent volume claims(PVCs) using .yaml files in Kubernetes or to create a POSIX-compliant distributed filesystem.

Underneath all of this lies a reliable, autonomous, distributed object storage(RADOS). RADOS is in charge of managing the underlying daemons that are deployed with Ceph.

A Ceph storage cluster has these types of daemons:

The object storage daemons(OSDs) handle read/write operations on the disks. They are also in charge of checking that the state of the disk is healthy and report back to the monitor daemons.
The monitor daemons keep a copy of the cluster map as well as monitor the state of the cluster. These daemons are what ensure high availability if any monitor fails. You will always need an odd number of monitor daemons to keep quorum and it is recommended to dedicate nodes for the monitor daemons to run on, separate from the storage nodes.
The manager daemons creates and manages a map of clients, as well as management of reweighting and rebalancing operations.
The metadata server manages additional metadata about the file system, specifically permissions, hierarchy, names, timestamps, and owners.

Deploying the cluster

Having acquired a rudimentary understanding of Ceph, we are now ready to build our storage cluster. A basic guide on how to set up a Kubernetes cluster on Ubuntu can be found here. We will be deploying Ceph on a 3-node cluster where each node will have 2 available drives for Ceph to mount. To confirm that the cluster is up and running, run:

# kubectl get nodes
NAME          STATUS   ROLES                  AGE    VERSION
k8s-master    Ready    control-plane,master   110m   v1.20.4
k8s-node-01   Ready    <none>                 105m   v1.20.4
k8s-node-02   Ready    <none>                 105m   v1.20.4

Rook

As previously stated, we will be using Rook as our storage orchestrator. Clone the newest version with:

git clone https://github.com/rook/rook.git

After cloning the repo, navigate to the right folder with:

cd rook/cluster/examples/kubernetes/ceph.

First, we got to create the necessary custom resource definitions(CRDs) and the RoleBindings. Run the command:

kubectl create -f crds.yaml -f common.yaml

I will not go through these two files as they are not relevant to the cluster configuration.

Now, it is time for the Rook operator to be deployed. The Rook operator will automate most of the deployment of Ceph. We will in this example enable the Rook operator to automatically discover any OSDs that are empty, mount them and thereby join them into the cluster. The Rook operator is found in operator.yaml. A multitude of things can be configured in the operator file. Most noteworthy is that resources can be limited, to ensure that certain parts of your cluster do not consume too many resources, thus slowing down other parts of the cluster. We will go with a standard configuration and only change the following from false to true:

- name: ROOK_ENABLE_DISCOVERY_DAEMON
  value: "false"

This will enable the operator to automatically discover the current OSDs in the cluster and any OSDs that might be added later, without any input from us as admins.

Now deploy the Rook operator

# kubectl create -f operator.yaml
configmap/rook-ceph-operator-config created
deployment.apps/rook-ceph-operator created

You should now be able to see the operator pod and the OSD discover pods running in the rook-ceph namespace in Kubernetes

# kubectl get pods -n rook-ceph
NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-678f988875-r6nc4   1/1     Running   0          83s
rook-discover-4w92b                   1/1     Running   0          41s
rook-discover-gw22p                   1/1     Running   0          41s
rook-discover-kskfx                   1/1     Running   0          41s

With the operator now running, we are ready to deploy our storage cluster. The storage cluster will be created with the cluster.yaml file.

Cluster configuration

Before deploying a storage cluster, we need to configure the cluster's behavior. A storage solution needs to ensure that data is not lost in case of disk failure and that the system is able to recover quickly if anything was to happen.

Changing the configurations in cluster.yaml should be done with caution as you can introduce severe overhead into your cluster and even create a cluster without any data security, safety, or reliability. We will be going through the configurations I find relevant for someone deploying their first cluster.

mon: 
  count: 3

A standard cluster will have 3 monitor daemons. There have been discussions of the optimal number of monitor daemons for clusters for a long time. The general consensus is that 1 monitor pod will leave your cluster in an unhealthy state if a single node goes down. This is obviously not a great choice if you would like to ensure any kind of data durability. The other choice could be to create 5 monitor daemons. This is often regarded as a good idea when a cluster expands to hundreds or thousands of nodes. However, since each monitor keeps an updated version of the crush map, you can experience problems in the cluster's speed if this is done on a small cluster. The community largely agrees that for most clusters, this should be 3. This introduces another problem, however. If we lose more than one node at the same time, we will lose quorum and thereby leave the cluster in an unhealthy state.

waitTimeoutForHealthyOSDInMinutes: 10

We have to configure how long we will wait for OSDs that are still in the cluster but are non-responsive. This is set in minutes. If you go too low, you will risk that a temporary unresponsive OSD will start a recovery process that might slow down your cluster unnecessarily. However, if you wait too long to check the OSDs, you run the risk of permanently losing data in the case that any other OSDs that hold the replicated data, fail.

There are more things to configure in the cluster.yaml file. If you would like to use the Ceph dashboard or perhaps monitor your cluster with a monitor-tool like Prometheus, you can also enable these. For now, we will leave the rest of the settings as is and deploy the cluster

kubectl create -f operator.yaml

To see the magic unfold, you can use the command:

watch kubectl get pods -n rook-ceph

In a couple of minutes, Kubernetes should have deployed all the necessary daemons to have your cluster up and running. You should be able to see the monitor daemons.

rook-ceph-mon-a-5588866567-vjg99                        1/1     Running     0          4m51s
rook-ceph-mon-b-9bc647c5b-fmbjf                         1/1     Running     0          4m27s
rook-ceph-mon-c-7cd784c4b7-qwwwb                        1/1     Running     0          4m1s

You should also be able to see the OSDs in the cluster. There should be six of them since we have two disks on each of our nodes.

rook-ceph-osd-0-7b884cfccb-qpqbd                        1/1     Running     0          4m49s
rook-ceph-osd-1-5d4c587cdb-bzstp                        1/1     Running     0          4m48s
rook-ceph-osd-2-857b8786bd-q8wqk                        1/1     Running     0          4m41s
rook-ceph-osd-3-443df7d8er-q9we3                        1/1     Running     0          4m41s
rook-ceph-osd-4-5d47f54f7d-tq6rd                        1/1     Running     0          4m41s
rook-ceph-osd-5-32jkjdkwk2-33jkk                        1/1     Running     0          4m41s

That was it! You have now created your very own Ceph storage cluster, in which you will be able to create a distributed filesystem and Kubernetes will be able to create PVCs.

This blog post will be continued next week with more on how Ceph ensures data durability and how to start using your Ceph cluster with Kubernetes.
To be continued...