Dmitriy A. for appfleet

Posted on Jun 8, 2020 • Originally published at appfleet.com

Orchestrating MongoDB over Kubernetes

#kubernetes #devops #docker #mongodb

Kubernetes was primarily used for Stateless Applications. However, PetSets were introduced in version 1.3 and later they evolved to Stateful Sets. The official documentation describes Stateful sets as

StatefulSets are intended to be used with stateful applications and distributed systems

One of the best use cases for this is to orchestrate data-store services such as MongoDB, ElasticSearch, Redis, Zookeeper and so on.

Some features that can be ascribed to StatefulSets are:

Pods with Ordinal Indexes
Stable Network Identities
Ordered and Parallel Pod Management
Rolling Updates

One very distinct feature of Stateful Sets is to provide StableNetwork Identities which when used with Headless Services, can be even more powerful.

Without spending much time on information readily available in Kubernetes documentation, let us focus on running and scaling a MongoDB cluster.

You need to have a running Kubernetes Cluster with RBAC enabled (recommended). In this tutorial, we will be using a GKE cluster, however, AWS EKS or Microsoft’s AKS or a Kops Managed K8s are also viable alternatives.

We will deploy the following components for our MongoDB cluster

1.Daemon Set to configure HostVM
2.Service Account and ClusterRole Binding for Mongo Pods
3.Storage Class to provision persistent SSDs for the Pods
4.Headless Service to access to Mongo Containers
5.Mongo Pods Stateful Set
6.GCP Internal LB to access MongoDB from outside the kuberntes cluster (Optional)
7.Access to pods using Ingress (Optional)

It is important to note that each MongoDB Pod will have a sidecar running, in order to configure the replica set on the fly. The sidecar checks for new members every 5 seconds.

Daemon Set for HostVM Configuration

kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
  name: hostvm-configurer
  labels:
    app: startup-script
spec:
  template:
    metadata:
      labels:
        app: startup-script
    spec:
      hostPID: true
      containers:
      - name: hostvm-configurer-container
        image: gcr.io/google-containers/startup-script:v1
        securityContext:
          privileged: true
        env:
        - name: STARTUP_SCRIPT
          value: |
            #! /bin/bash
            set -o errexit
            set -o pipefail
            set -o nounset

            # Disable hugepages
            echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
            echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag

Configuration for ServiceAccount, Storage Class, Headless SVC and StatefulSet

apiVersion: v1
kind: Namespace
metadata:
  name: mongo
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mongo
  namespace: mongo
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: mongo
subjects:
  - kind: ServiceAccount
    name: mongo
    namespace: mongo
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
  name: fast
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  fsType: xfs
allowVolumeExpansion: true
---
apiVersion: v1
kind: Service
metadata:
 name: mongo
 namespace: mongo
 labels:
   name: mongo
spec:
 ports:
 - port: 27017
   targetPort: 27017
 clusterIP: None
 selector:
   role: mongo
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: mongo
  namespace: mongo
spec:
  serviceName: mongo
  replicas: 3
  template:
    metadata:
      labels:
        role: mongo
        environment: staging
        replicaset: MainRepSet
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: replicaset
                  operator: In
                  values:
                  - MainRepSet
              topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 10
      serviceAccountName: mongo
      containers:
        - name: mongo
          image: mongo
          command:
            - mongod
            - "--wiredTigerCacheSizeGB"
            - "0.25"
            - "--bind_ip"
            - "0.0.0.0"
            - "--replSet"
            - MainRepSet
            - "--smallfiles"
            - "--noprealloc"
          ports:
            - containerPort: 27017
          volumeMounts:
            - name: mongo-persistent-storage
              mountPath: /data/db
          resources:
            requests:
              cpu: 1
              memory: 2Gi
        - name: mongo-sidecar
          image: cvallance/mongo-k8s-sidecar
          env:
            - name: MONGO_SIDECAR_POD_LABELS
              value: "role=mongo,environment=staging"
            - name: KUBE_NAMESPACE
              value: "mongo"
            - name: KUBERNETES_MONGO_SERVICE_NAME
              value: "mongo"
  volumeClaimTemplates:
  - metadata:
      name: mongo-persistent-storage
      annotations:
        volume.beta.kubernetes.io/storage-class: "fast"
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: fast
      resources:
        requests:
          storage: 10Gi

Important Points:

The Sidecar for Mongo should be configured carefully with proper environment variables, stating the labels given to the pod, namespace for the deployment and service.
The guidance around default cache size is: “50% of RAM minus 1 GB, or 256 MB”. Given that the amount of memory requested is 2GB, the WiredTiger cache size here has been set to 256MB
Inter-Pod Anti-Affinity ensures that no 2 Mongo Pods are scheduled on the same worker node, thus, making it resilient to node failures. Also, it is recommended to keep the nodes in different AZs so that the cluster is resilient to Zone failures.
The Sevice Account currently deployed has admin privileges. However, it should be restricted to the DB’s namespace.

Deploying the MongoDB cluster

Once we deploy both the manifests specified above, we can check the components as following

root$ kubectl -n mongo get all
NAME                 DESIRED   CURRENT   AGE
statefulsets/mongo   3         3         3m
NAME         READY     STATUS    RESTARTS   AGE
po/mongo-0   2/2       Running   0          3m
po/mongo-1   2/2       Running   0          2m
po/mongo-2   2/2       Running   0          1m
NAME        TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE
svc/mongo   ClusterIP   None         <none>        27017/TCP   3m

As you can see, that the service has no Cluster-IP and neither an External-IP, it is a headless service. This service will directly resolve to Pod-IPs for our Stateful Sets.

To verify the DNS resolution, we will launch an interactive shell within our cluster

kubectl run my-shell --rm -i --tty --image ubuntu -- bash
root@my-shell-68974bb7f7-cs4l9:/# dig mongo.mongo +search +noall +answer
; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> mongo.mongo +search +noall +answer
;; global options: +cmd
mongo.mongo.svc.cluster.local. 30 IN A 10.56.7.10
mongo.mongo.svc.cluster.local. 30 IN A 10.56.8.11
mongo.mongo.svc.cluster.local. 30 IN A 10.56.1.4

The DNS for service will be ., therefore, in our case it will be mongo.mongo .

The IPs( 10.56.6.17, 10.56.7.10, 10.56.8.11 ) are our Mongo Stateful Set’s Pod IPs. This can be tested by running a nslookup over these, from inside the cluster.

root@my-shell-68974bb7f7-cs4l9:/# nslookup 10.56.6.17
17.6.56.10.in-addr.arpa name = mongo-0.mongo.mongo.svc.cluster.local.

root@my-shell-68974bb7f7-cs4l9:/# nslookup 10.56.7.10
10.7.56.10.in-addr.arpa name = mongo-1.mongo.mongo.svc.cluster.local.

root@my-shell-68974bb7f7-cs4l9:/# nslookup 10.56.8.11
11.8.56.10.in-addr.arpa name = mongo-2.mongo.mongo.svc.cluster.local.

If your app is deployed in the K8’s cluster, then it can access the nodes by -

Node-0: mongo-0.mongo.mongo.svc.cluster.local:27017 
Node-1: mongo-1.mongo.mongo.svc.cluster.local:27017 
Node-2: mongo-2.mongo.mongo.svc.cluster.local:27017

If you would like to access the mongo nodes from outside the cluster, you can deploy internal load balancers for each of these pods or create an internal ingress, using an Ingress Controller such as NGINX or Traefik

GCP Internal LB SVC Configuration (Optional)

apiVersion: v1
kind: Service
metadata: 
  annotations: 
    cloud.google.com/load-balancer-type: Internal
  name: mongo-0
  namespace: mongo
spec: 
  ports: 
    - 
      port: 27017
      targetPort: 27017
  selector: 
    statefulset.kubernetes.io/pod-name: mongo-0
  type: LoadBalancer

Deploy 2 more similar services for mongo-1 and mongo-2.

You can provide IPs of the Internal Load Balancer to the MongoClient URI.

root$ kubectl -n mongo get svc
NAME      TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
mongo     ClusterIP      None            <none>        27017/TCP         15m
mongo-0   LoadBalancer   10.59.252.157   10.20.20.2    27017:30184/TCP   9m
mongo-1   LoadBalancer   10.59.252.235   10.20.20.3    27017:30343/TCP   9m
mongo-2   LoadBalancer   10.59.254.199   10.20.20.4    27017:31298/TCP   9m

The external IPs for mongo-0/1/2 are the IPs of the newly created TCP load balancers. These are local to your Subnetwork or peered networks, if any.

Access Pods using Ingress (Optional)

Traffic to Mongo Stateful set pods can also be directed using an Ingress Controller such as Nginx. Make sure the ingress service is internal and not exposed over public ip. The ingress object will look something like this:

...
spec:
  rules:
  - host: mongo.example.com
    http:
      paths:
      - path: '/'
        backend:
          serviceName: mongo # There is no extra service. This is 
          servicePort: '27017' # the headless service

It is important to note that your application is aware of atleast one mongo node which is currently up so that it can discover all the others.

You can use Robo 3T as a mongo client on your local mac. After connecting to one of the nodes and running rs.status() , you can view the details of the replica set and check if the other 2 pods were configured and connected to the Replica Set automatically.

Now we scale the Stateful Set for mongo Pods to check if the new mongo pods get added to the ReplicaSet or not.

root$ kubectl -n mongo scale statefulsets mongo --replicas=4
statefulset "mongo" scaled
root$ kubectl -n mongo get pods -o wide
NAME      READY     STATUS    RESTARTS   AGE       IP           NODE
mongo-0   2/2       Running   0          25m       10.56.6.17   gke-k8-demo-demo-k8-pool-1-45712bb7-vfqs
mongo-1   2/2       Running   0          24m       10.56.7.10   gke-k8-demo-demo-k8-pool-1-c6901f2e-trv5
mongo-2   2/2       Running   0          23m       10.56.8.11   gke-k8-demo-demo-k8-pool-1-c7622fba-qayt
mongo-3   2/2       Running   0          3m        10.56.1.4    gke-k8-demo-demo-k8-pool-1-85308bb7-89a4

It can be seen that all four pods are deployed to different GKE nodes and thus our Pod-Anti Affinity policies are working correctly.

The scaling action will also automatically provision a persistent volume, which will act as the data directory for the new pod.

root$ kubectl -n mongo get pvc
NAME                               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mongo-persistent-storage-mongo-0   Bound     pvc-337fb7d6-9f8f-11e8-bcd6-42010a940024   11G        RWO            fast           49m
mongo-persistent-storage-mongo-1   Bound     pvc-53375e31-9f8f-11e8-bcd6-42010a940024   11G        RWO            fast           49m
mongo-persistent-storage-mongo-2   Bound     pvc-6cee0f97-9f8f-11e8-bcd6-42010a940024   11G        RWO            fast           48m
mongo-persistent-storage-mongo-3   Bound     pvc-3e89573f-9f92-11e8-bcd6-42010a940024   11G        RWO            fast           28m

To check whether the pod named mongo-3 gets added to the replica set or not, we run rs.status() once again on the same node and observe the difference.

Further Considerations:

It can be helpful to label the Node Pool which will be used for Mongo Pods and ensure that appropriate Node Affinity is mentioned in the Spec for the Stateful Set and HostVM configurer Daemon Set. This is because the Daemon set will tweak some parameters of the host OS and those settings should be restricted for MongoDB Pods only. Other applications might work better without those settings.
Labeling a node pool is extremely easy in GKE, can be directly from the GCP console.
Although we have specified CPU and Memory limits in the Pod Spec, we can also consider deploying a VPA (Vertical Pod Autoscaler).
Traffic to our DB from inside the cluster can be controlled by implementing network policies or a service mesh such as Istio.

The aim of this blog is to give you enough information in order to get started with Stateful Sets on Kubernetes and hope you find it useful.