This blog post is the second in a series concerning Ceph.
One of the main concerns when dealing with large sets of data is data durability. We do not want a cluster in which a simple disk failure will introduce a loss in data. What Ceph aims for instead is fast recovery from any type of failure occurring on a specific failure domain.
Ceph is able to ensure data durability by using either replication or erasure coding.
The data is replicated onto a number of different OSDs, nodes, or racks depending on your cluster configuration. The original data and the replicas are split into many small chunks and evenly distributed across your cluster using the CRUSH-algorithm. If you have chosen to have three replicas on a 6-node cluster, these three replicas will be spread out onto all six nodes, not just three nodes containing the full replicas.
It is important to choose the right level of data replication. If you are running a single-node cluster, replication on the node level would be impossible and your cluster would lose data in the event of a single OSD failure. In this case, you would choose to replicate data across the OSDs you have available on the node.
On a multi-node cluster, your replication factor decides how many OSDs or nodes you can afford to lose in case of disk or node failure, without data loss. Of course, the replication of data introduces the problem of lowering your total amount of space available in your cluster. If you choose a replication factor of 3 on the node level, you will only have 1/3 of your total storage available in your cluster for you to use.
Replication in Ceph is fast and only limited by the read/write operations of the OSDs. However, some people are not content with "only" being able to use a small amount of their total space. Therefore, Ceph also introduced erasure coding.
Erasure coding encodes your original data in a way so that when you need to retrieve the data again, you only need a subset of the data to recreate the original information. It splits objects into k data fragments and then computes m parity fragments. I will provide an example.
Let us say that the value of our data is 52. We could split it into:
x = 5
y = 2
The encoding process will then compute a number of parity fragments. In this example, these will be equations:
x + y = 7
x - y = 3
2x + y = 12
Here, we have a k = 2 and m = 3. k is the number of data fragments and m is the number of parity fragments. In case of a disk or node failure and the data needs to be recovered, out of the 5 elements we will be storing (the two data fragments and the three parity fragments) we only require two of these five to recover. This is what ensures data durability when using erasure coding.
Now, why does this matter? It matters because these parity fragments take up significantly less space when compared to replicating the data. Here is a table that shows how much overhead there is on different erasure coding schemes. The overhead is calculated with m / k.
|Erasure coding scheme (k+m)||Minimum number of nodes||Storage overhead|
As we can see in the table, you can use the (8+2) scheme to make sure you can lose two of your nodes without losing any data, and this with only a 25% storage overhead.
If you look at this from a storage space optimization standpoint, this is a much better use of the storage. However, it is not without certain downsides. The parity fragments take time for the cluster to calculate and read/write operations are therefore slower than with replication. Therefore, erasure coding is usually recommended on clusters that deal with large amounts of cold data.
A natural part of deployments on Kubernetes is to create persistent volume claims (PVCs). PVCs can claim a volume and use that as storage for data in the pod. In order to create a PVC you first need to define a StorageClass in Kubernetes.
apiVersion: ceph.rook.io/v1 kind: CephBlockPool metadata: name: replicapool spec: failureDomain: host replicated: size: 3 requireSafeReplicaSize: true --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: rook-ceph-block provisioner: rook-ceph.rbd.csi.ceph.com parameters: clusterID: rook-ceph # namespace:cluster pool: replicapool imageFormat: "2" imageFeatures: layering csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster csi.storage.k8s.io/fstype: ext4 allowVolumeExpansion: true reclaimPolicy: Delete
In this StorageClass file, you can see that we first create a replica pool that creates 3 replicas in total and uses
host as the failure domain. After that, we define whether or not we should allow volume expansion after a volume is created and what the reclaim policy should be. Reclaim policy determines whether the data that is stored in the volume should be deleted or retained when a pod ceases to exist. In this case, I have chosen delete.
# kubectl get storageclass -n rook-ceph NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE rook-ceph-block rook-ceph.rbd.csi.ceph.com Delete Immediate true 10m
Now that the StorageClass has been created, we can now create a PVC:
--- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: rook-ceph-block
This creates a PVC that is now running on our Kubernetes cluster:
# kubectl get pvc -n rook-ceph NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rbd-pvc Bound pvc-56c45f01-562f-4222-8199-43abb856ca94 1Gi RWO rook-ceph-block 37s
We will now deploy a pod that uses this PVC:
-------- apiVersion: v1 kind: Pod metadata: name: demo-pod spec: containers: - name: web-server image: nginx volumeMounts: - name: mypvc mountPath: /var/lib/www/html volumes: - name: mypvc persistentVolumeClaim: claimName: pvc readOnly: false
After deploying this pod, you can see it in the pod list:
# kubectl get pods -n rook-ceph NAME READY STATUS RESTARTS AGE demo-pod 1/1 Running 0 118s
That is how you deploy pods that create persistent volume claims on your Ceph cluster!