DEV Community

Cover image for Kubernetes Concepts: Deep Dive
Idan Refaeli
Idan Refaeli

Posted on • Edited on

15

Kubernetes Concepts: Deep Dive

Kubernetes Concepts - Idan Refaeli ©


Section 1: Core Concepts

Cluster Architecure

  • Worker Nodes - "Hold" and host the application as containers.
  • Master Node - Plans how to load containers on the worker nodes, identifies the right nodes, stores information, schedules, monitors, tracks the containers and more..

1. The master node consists of:

  • etcd - Key-value database that stores information about the worker nodes, which node holds which container, container time loaded, etc.
  • kube-scheduler - Identifies which containers should be placed inside the worker node, identifies the right worker node based on the size and available space, the types of containers it is allowed to carry, ..
  • Controller Manager - Takes care of nodes handling, traffic control, damage control, when containers are destroyed this makes sure new containers are made available. The nodes control is made using the Node-Controller, and the Replication-Controller ensures that the desired number of containers are running all the time in the replication group.
  • kube-apiserver - Primary management components of Kubernetes. Responsible for orchestrating all the operations in the cluster. Can also allow to make necessary changes to the cluster as required.

2. The worker nodes consist of:

  • kubelet - The "leader" of each node. Responsible for managing all the activities on these nodes, for creating the connection with the master node, for receiving information about the containers that will be loaded on the node & performing the loading of the containers as required, and sending reports to the master about the status of the node & the containers it holds. The kubelet is an agent that runs on each node in a cluster, listens for instructions from the kube-apiserver, and deploys & destroys containers as required.
  • kube-proxy service - A components that runs on the worker node. Responsible for the network communication between network sessions within or outside of the cluster. For example, configuring connection between a container running a web server and a database server running on another container on another node.

The ETCD

The etcd is a key-value database that stores information in the from of documents or pages. Each individual gets a dcocument that stores all the information regarding the individual. The fields can be different between every document.

After installing & running etcd on the system, it will listen on port 2379 as default. ./etcdctl is an etcd control client that allows storing & receiving key-value pairs.

Storing:

./etcdctl set key1 value1 ./etcdctl put key1 value1 # in ETCDCTL_API version 3+:

Getting:

./etcdctl get key1

In Kubernetes, the etcd data store contains information about the cluster such as the nodes, pods, secrets, accounts, roles, role binding and more. Every change is updated in this server.

If we setup the cluster using kubeadm, it deploys the etcd server as a pod in the kube-system namespace.

List all the keys stored by k8s:

kubectl exec etcd-master -n kube-system ./etcdctl get / --prefix -keys-only

Kube-API Server

The kube-api server is the primary management component in k8s. kubectl reaches the kube-api server, then the kube-api authenticates the user, validates the request, gets the data from the etcd cluster and sends back the response.

Note: We can send REST API Post requests instead of using the kubectl

Example - What happens when you create a pod?

  1. the request is authenticated & validated.
  2. The API Server creates a pod object but doesn't assign it into a node.
  3. Updates the etcd server with the new information.
  4. Updates the user that the pod has been created.
  5. The scheduler regularly moniors the API Server, finding that there is a new pod with no node handling it.
  6. The scheduler finds the right node to place the new pod, sending this information to the kube-apiserver.
  7. The kube-apiserver updates this new information in the etcd cluster.
  8. The API Server passes this information to the kubelet in the assigned worker node.
  9. The kubelet creates the pod on the worker node, and sends instructions to the container runtime engine (Docker for example) to deploy the application image.
  10. The kubelet updates the information into the API Server, and the API Server updates the etcd cluster.

This behavior happens every time a change is requested. The API Server is the center of the tasks that need to happen to make the changes in the cluster.

Get All Available Resources & Relevant Information:

kubectl api-resources

Kubernetes Controller Manager

This component manages various controllers in Kubernetes. A controller is a process that continuously monitors the state of various components in the system.

Examples for some controllers:

  1. The node controller is reponsible for taking care of the nodes in the system in order for the system to run properly, using the kube-apiserver. The node controller tests the health of the nodes. If a node stops responding, after 40 seconds of inactive activity in will be marked as unreachable. Then, the node gets 5 minutes to return to active status again, else the pods assigned to this node will be removed.

  2. The replication controller is responsible for monitoring the status of replica sets and making sure that the right amount of pods are available within the set. If not, new ones will be created.

There are more kinds of controllers in k8s such as endpoints controller, namespace controller, service account controller and more. Those are responsible for the logic of the cluster and k8s.

Those controllers are contained inside a process known as Kubernetes Controller Manager, and installed when we install this daemon.

This can be installed from Google's API.

Kubernetes Scheduler

The k8s scheduler is responsible for scheduling the pods on the nodes. It doesn't actually assign the pods inside the nodes, it only decides which pod goes where. Assigning the pods is the responsibility of the kubelet. The scheduler views all the available pods and tries to find the best node to hold each of them.

It starts by filtering out the nodes that may not be able to fit the pod due to lack of resources for instance RAM and CPU.

Then, it "ranks" using a priority function to find the best fit for the pod in a scale of 0-10. It checks what resource will be available on the node after placing the pod inside them to determine the score.

We can also customize the scheduler behavior and also create our own scheduler for our specific needs.

The scheduler can be downloaded using Google's API.

The Kubelet

The kubelet is the "captain" of the ship (worker node). It interfaces with the master node, loads/unloads containers (using the containers runtime engine such as docker) from the nodes as required by the scheduler running on the master node.

Kube-Proxy

How do pods "talk" to each other?

This task can be accomplished using the kube-proxy. Pods network is internal and virtual that all the nodes in the cluster can reach and allows the pods to connect to each other.

kube-proxy is a process that runs on each node in the cluster. It looks for new services, and every time a new service is created, it creates the appropriate rules on each node to forward traffic to those services.

One way it does this is by using iptables rules. It creates an iptables rule on each node in the cluster to forward traffic heading to the IP of the service to the IP of the actual pod.

Pods

Kubernetes does not deploy containers directly on the worker nodes, the containers are encapsulated into a k8s object known as pods.

A pod is a single instance of an application, and is the smallest object we can create and use in Kubernetes.

Use Case Example:

We have a single node k8s cluster, with a single instance of our application, running in a single Docker container encapsulated in a pod.

The number of users accessing the app increases, and we need to add additional instances of our web app to share the load.

We don't bring up a new container instance to the same pod, we create a new pod with a new instance of the same application, so now we have 2 instances of the application in separate pods from the same k8s node.

What happens if the node has no sufficient capacity? We deploy additional pods on a new node in the cluster, in order to expand the cluster physical capacity.

To Conclude:

  • Pods usaually have one-to-one relationship with containers running the application.
  • To scale up, we create new pods.
  • To scale down, we delete existing pods.

Although, we are not restricted to only hold one container in one pod. We can hold more containers in the pod if we intend to add a supporting container that will do some kind of supporting tasks to the application such as processing a user entered data, processing a file uploaded, et cetera, and we need those containers to operate alongside each other.

The containers in the same pod can communicate using localhost, because they use the same network space (and storage space as well).

Without pods and Kubernetes, we would have to configure all of these responsibilities by ourselves, for example using Docker or any other container runtime to link the application container and the helper container, setting up shared network and storage and more, so pods simplify this process for us.

Examples: Managing Pods In The CLI:

  • Create an nginx pod using CLI:

    kubectl run nginx --image=nginx

  • Show tabled information about existing pods:

    kubectl get pods -o wide

  • Create a pod configuration file and save as YAML format:

    kubectl run redid --image=redis123 --dry-run=client -o yaml > redis.yaml

  • Edit a pod:

    kubectl edit

YAML In Kubernetes

Kubernetes uses YAML files as configuration for the creation of objects such as pods, replicas, deployments, services and more.

A k8s configuration file always contains top level required fields:

apiVersion:
kind:
metadata:


spec: 

Example: Pods Using YAML:

apiVersion: v1
kind: Pod
metadata:
  name: myapp-pod
  labels:
    app: myapp
    type: front-end
spec:
  containers:
    - name: nginx-container
      image: nginx

Creating the pod:

kubectl create -f pod-definition.yml

Viewing the pods:

kubectl get pods

Seeing detailed information about the pod:

kubectl describe pod myapp-pod

Replica Set (And Replication Controller)

What is a replica and why do we need the Replication Controller?

If the pod fails, users will not be able to access the application, hence we want more than one pod running simultaneously. This is accomplished by the Replication Controller.

This characteristic is called 'High Availability'.

Even if we have a single pod, the Replication Controller can help to bring up a new one if it fails.

The Replication Controller can also help with load balancing - Creating multiple pods to share the load, and it can span through multiple nodes across the cluster.

The Replication Controller is the older technology of the Recplica Set. Although they both share the same features as we discussed above.

Defining a Replication Controller:

apiVersion: v1
kind: ReplicationController
metadata:
  name: myapp-rc
  labels:
    app: myapp
    type: front-end
spec:
  template:
    metadata:
      name: myapp-pod
      labels:
        app: myapp
        type: front-end
      spec:
        containers:
        -  name: nginx-container
           image: nginx
  replicas: 3

kubectl create -f rc-definition,yml

To View the replication controller created:

kubectl get replicationcontroller

To view the pods that were created automatically by the replication controller:

kubectl get pods


Defining a Replica Set:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: myapp-replicaset
  labels:
    app: myapp
    type: front-end
spec:
  template:
    metadata:
      name: myapp-pod
      labels:
        app: myapp
        type: front-end
      spec:
        containers:
        -  name: nginx-container
           image: nginx
  replicas: 3
  selector: 
    matchLabels:
      type: front-end
  • 'apiVersion' field is different
  • 'kind' field is different
  • 'selector' field is different.

'selector' is needed because it allows to manage pods that were not created as part of the Replica Set creation.

For instance, some pods existed before and they match the template of the replica set, and replica set will also take care of them.

This is the reason why from now, the 'labels' field is important. The selector is nor required.

After defining the file: Create replica set, show the replica set created and show the new pods:

kubectl create -f replicates-definition,yml kubectl get replicaset kubectl get pods

3 ways to scale the replica set:

  1. Change the 'replicas' value and run the command kubectl replace -f [yaml_file]
  2. kubectl scale --replicas=6 -f [yaml_file] - Changes the number of replicas in the file.
  3. kubectl scale --replicas=6 replicates [replicaset_name] - The number of replicas in the file will remain the same.

Kubernetes Deployments

When we roll out a newer version of the application available and we uploaded it to the Dockerhub, we want to roll out to a small portion of users and then roll out to the rest instead of rolling out to all the users at the same time.

Also we would like to be able to revert to the older version because suppose the update results in an unexpected error and we need to undo the change.

All of this can be done using Kubernetes Deployments, which is a kind of a Kubernetes object that is higher in the hierarchy.

A deployment usually represents an application. We can see the current deployments using kubectl get deploy.

  • The syntax of the Deployment definition files is exactly the same as ReplicaSets' format. The change is: 'kind': Deployment.

kubectl create -f deployment-definition.yml

kubectl get deployments

Deployments automatically create Replica Sets:

kubectl get replicaset

Output: myapp-deployment-6944234 ...

  • Getting information about the cluster, deployments, replica sets and pods that exist on the system: kubectl get all

  • Creating a deployment with a command: kubectl create deployment my-dep --image=nginx --replicas=3

Services

Kubernetes services enable communication between various components within and outside of the application.

They allow us to connect applications together with other applications and users.

This is helpful when for example our application is built using microservices when each microservice is packed as a container for a specific task, and enables loose coupling between the microservices.

Say for example we are trying to access a pod running on a node from an external computer, our option to connect is by SSH into the node and curl http://<pod_ip>.

The drawback in this is that this is that we access the pod from the specific node in this way, instead of accessing it remotely, without having to SSH into the node - simply by accessing the Kubernetes node IP.

This can be made using the Kubernetes Service.

The Kubernetes Service is an object like Pods, ReplicaSet, Deployments.

As an example, one of the use cases it provides is forwarding requests to the port running the pod. This service is called NodePort Service

There are more types of services such as:

  • NodePort - Making an internal port accessible from the port of the node.
  • ClusterIP - Creates a virtual IP inside the cluster to enable communication between different services.
  • Load Balancer - Provides load balancer in the supported cloud providers.

Creating a NodePort Service

# service-definition.yml:

apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  type: NodePort
  ports:
    - targetPort: 80
      port: 80
      nodePort: 30008
  
  # linking the specific pod using selector
  selector:
    app: myapp
    type: front-end
  • targetPort - The port on which the service will send requests to, that the pod will be listening on. The application also needs to listen to this port.
  • port - Exposes the k8s service on the specified port within the cluster. Other pods within the cluster can communicate with the server on this port.
  • nodePort - Exposes the service externally to the cluster by means of the target nodes' IP address and the NodePort.

Create and view the service:

kubectl create -f service-definition.yml

kubectl get services


What happens if we have multiple pods with the same label specified in the selector?

The service selects randomly one of the three pods, which means that the service acts like an internal load balancer to distribute load across the pods without any additional configurations.

The ClusterIP:

New pods are created all the time and we can't rely on the pods' IP addresses for internal communication of the application (front end, backend, databases for instance).

The ClusterIP can group all the pods in each layer of the application and allow us to connect to a single interface of each layer. The request reaches randomly to one of the pods. Each service gets a specific IP and name inside the cluster and is accessible from other pods inside the cluster using these identifiers.

This allows us to deploy a cluster that is built in the microservices architecture.

Creating the ClusterIP service-definition.yml:

apiVersion: v1
kind: Service
metadata: 
  name: back-end
spec:
  type: ClusterIP
  ports:
    - targetPort: 80
      port: 80
  selector:
    app: myapp
    type: back-end

kubectl create -f service-definition.yml

kubectl get services

kubectl describe service [service]


The Load Balancer:

Instead of setting up a load balancer manually using nginx for instance, we can use the native k8s load balancer to integrate with the supported cloud platforms such as AWS, Azure, GCP.

The definition file is the same as NodePort, except the type is LoadBalancer:

apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  type: LoadBalancer
  ports:
    - targetPort: 80
      port: 80
      nodePort: 30008
  • If we use the native load balancer in an unsupported platform such as VirtualBox, it will act as a NodePort and will not perform any kind of load balancing.

Namespaces

Until now we operated in a single namespace called the default namespace which is created automatically when the cluster is first set up.

A namespace is like a "house" with its own set of rules.

Kubernetes isolates the k8s user from it's internal containers and services that were designed to operate the engine properly by using a different namespace for those, called the kube-system.

There is also a third namespace called kube-public which is the namespace where public resources are created.

If we are using k8s for educational purposes or for small applications, we can use the default namespace. But if we use it for enterprise demand, we can use more than the default namespace.

For example, we can use two different namespaces Dev and Production so that we don't accidently modify resources in production.

When the namespace is created, DNS entries are created as well. If we try to access a resource from the same namespace as we are, we don't need the full DNS entry, otherwise, we do.

Connecting to a MySQL database example:

  • Inside the same namespace: mysql.connect("db-service")
  • In a different namespace: mysql.connect("db-service.dev.scv.cluster.local")

(While accessing to the other namespace in this format: mysql.connect("[service-name].[namespace].[service].[domain]"))

Note: kubectl get pods for example, lists all the pods in the default namespace only. If we want to list pods in another namespace use: kubectl get pods --namespace=[namespace]

Note: If we want a pod from a pod-definition file to always be created in a specific namespace, add the pair: namespace: [namespace] under metadata in the YAML file.

Create a dev namespace:

Option 1:

apiVersion: v1
kind: Namespace
metadata:
  name: dev

kubectl create -f namespace-dev.yml

Option 2: Creating using the CLI: kubectl create namespace dev

Change Namespace Context:

kubectl config set-context $(kubectl config current-context) --namespace=[namespace]

View Pods In All Namespaces:

kubectl get pods --all-namespaces

Limit Resources In a Namespace:

Using resource quota:

# copute-quota.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: dev
spec:
  hard:
    pods: "10"
    requests.cpu: "4"
    requests.memory: 5Gi
    limits.cpu: "10"
    limits.memory: 10Gi

kubectl create -f compute-quota.yaml




Section 2: Scheduling

The Scheduler

How does the scheduler work in the backend?

Every pod has a name nodeName in the pod-definition file that by default is not set - k8s adds it automatically.

The scheduler goes through all the pods, finds those that don't have this property set, and chooses them as candidates for scheduling.

Once the pod has chosen using the scheduling algorithm, it schedules the pod on the node and setting the node name property to the name of the node - creating a binding object.

If there is no scheduling, the pods continue to be in pending state.

We can do it manually by simply specifying the name of the node in the nodeName property.

Or, we can also create a binding object and send a POST request to the pod's binding API - acting like the scheduler.

For example:

  1. pod-definition.yaml
  2. pod-bind-definition.yaml:
apiVersion: v1
kind: Binding
metadata:
  name: nginx
target:
  apiVersion: v1
  kind: Node
  name: node02
  1. curl --header "Content-Type:application/json" --request POST --data '{"apiVersion": "v1", "kind": "Binding", ...} http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/
  • Check if the scheduler exists on the system: kubectl get pods --namespace kube-system

Taints & Tolerations

Taints and tolerations exist in order to set and specify restrictions on what pods can be scheduled on a node, either by preference or a hard requirement).

For example: Node 1 is tainted so that it allows to only hold Pod 'A' (in other words Node 1 is tolerated to Pod 'A'). When pods 'B' and 'C' try to be scheduled on Node 1, it rejects them and they schedule on a node that doesn't reject them.

  • Taint a node: kubectl taint nodes node-name key=value:taint-effect

  • Tolerations can be specified in the pod-definition file.

  • When the k8s system is set-up, the master node gets tainted so that it prevents pods from being scheduled on this node.

  • We can also verify the previous bullet using kubectl describe node kubemaster | grep Taint Output: Taints: node-role.kubernetes.io/master:NoSchedule

  • Untaint a node: kubectl taint nodes node-name Taints:taint-effect-

Node Selectors

We can limit specific pods to run only on specified nodes.

Option 1: Using Node Selectors -

apiVersion
...
...
spec:
  ...
  nodeSelector:
    size: Large

Using the value Large here for example, this value comes from the label attached to the node. This way we can identify the nodes.

Labeling Nodes: kubectl label nodes (node-name) (label-key)=(label-value)

In our example: kubectl label nodes nodes-1 size=Large

There is a drawback to this example (node selectors), we can't select nodes with other complex querying methods such as OR, NOT, AND, NOT EQUAL, and more.

For this we will use Node Affinity and Anti Affinity features.

Option 2: Using Node Affinity:

apiVersion
...
...
spec:
  ...
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            -key: size
             operator: In
             values:
             - Large
             - Medium

This does exactly the same as the example we have seen above, but also adds the "Medium" option to the node selectors.

"Operator" can also be "NotIn", "Exists" and more.

Check the documentation for more operators and affinity types.

Resources Requirements And Limits

We can add to the pod-definition file, the following:

...
spec:
  containers:
    ...
    resources:
      requests:
        memory: "1Gi"
        cpu: 1

To specify the resources request of the container.

  • The 'cpu' can be as low as 0.1 (or '100m' where 'm' stands for milli).

  • 1 count of 'cpu' is considered as 1 vCPU (1 vCPU in AWS or 1 core in GCP or Azure)

By default k8s limits 1 vCPU per container, if we like to change this limit, we can add the following:

...
spec:
  containers:
    ...
    resources:
      requests:
        ...
      limits:
        memory: ".."
        cpu: ".."

Setting the default limits & requests for the pods to pick from:

apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range
spec:
  limits:
  - default:
      memory: ".."
    defaultRequest:
      memory: ".."
    type: Container

Resources units:

  1. 1G (Gigabyte) = 1,000,000,000 bytes
  2. 1M (Megabyte) = 1,000,000 bytes
  3. 1K (Kilobyte) = 1,000 bytes
  4. 1Gi (Gibibyte) = 1,073,741,824 bytes
  5. 1Mi (Mebibyte) = 1,048,576 bytes
  6. 1Ki (Kibibyte) = 1,024 bytes

Exceeding The Limits:

What happens when the pod tries to exceed the specified limit?

In case of the CPU, it can't exceed the CPU limit, if it tries to do so, the CPU begins to throttle.

In memory, a container can use more resources than the limit, but if it does it constantly, the pod will eventually shut down.

Changing the limits & requests of a running pod:

> kubectl get pod [name] -o yaml > pod-definition.yaml

>  vi pod-definition.yaml

> # change the required values..

> kubectl delete pod [name]

> kubectl create -f pod-definition.yaml

> # can also use 'kubectl replace'

Daemon Sets

Daemon sets allows to deploy multiple instances of pods, but actually runs one copy of the pod on each node in the cluster.

When a new node is added, a replica of the pod is automatically added to that node, and the pod is removed when the node is removed as well.

It basically makes sure that one copy of the pod always exists in all nodes in the cluster.

  • A use case of daemon sets may be deploying monitoring agents, logging agents on nodes, etc.

  • The daemon set is ignored by the Kube-Scheduler.

  • The daemon-set definition file looks exactly the same as replica-set definition file, except the kind is DaemonSet.

How to schedule a pod using Daemon Set?

One option is to schedule manually by setting the nodeName property, so that when they are created to assign to the specified nodes.

But since k8s v1.12, the Daemon Set uses NodeAffinity and the default scheduler as we learned previously.

Static Pods

The kubelet can manage the pod independently, as "the leader of the ship".

All the operations need of course to be executed without communicating with the kube-api server, for this we need to configure the kubelet to read the pod-definitions from files located in destinations in the server, e.g /etc/k8s/manifests, then the kubelet will continuously check the directory, read the files and create new pods on the host, makes sure the pods stay alive in case of pod's failure, also recreates the pod if the manifest file changed and removes the pod if the file is deleted.

This static method is supported only for pods, not any other objects, since the other objects are operated using controllers, and the kubelet is only responsible for pods.

  • The static pods are ignored by the Kube-Scheduler.

How to configure the path?

  1. While uploading the k8s kubelet.service, we can specify the option --pod-manifest-path=[path].

  2. While uploading the k8s kubelet.service, we can specify a path to a configuration file; --config=kubeconfig.yaml, while kubeconfig.yaml specifies: staticPodPath: [path]

  • The kubelet can create pods from the api-server and static pods at the same time, which means that the api-server is aware of the static pods (only as a read-only mirror of the pod - we can view, but can't modify the pod).

Use Case: Since static pods are independent on the k8s control plane, we can use static pods to deploy the control plane components itself as pods on a node.

This way we don't have to download the binaries, configure services and worry about crashing services. If a service crashes, as a static pod it will be restarted by the kubelet.

How to do that?

  1. Install kubelet on all the master nodes
  2. Create pod-definition files that use Docker images of the control plane components (apiserver.yaml, etcd.yaml, controller-manager.yaml, ..)
  3. Place the files in the specified manifest folder
  4. The kubelet will now deploy the control plane components as pods on the cluster.

That's why when we list the pods on the kube-system namespace, we see the control plane components as pods in a cluster set up by the kube-admin tool.

How to check path of static pods configuration type:

  1. $ ps aux | grep kubelet and then look at the --config=[path] option.
  2. Check in the config file for the staticPodPath

How to delete a static pod:

  1. Identify which node the pod is located at: kubectl get pods --all-namespaces -o wide | grep [podname]

  2. SSH into the node and identify the path configured for static pods in this node (make sure to check the path in the kubelet configuration file as above): $ ssh [nodename]

  3. Find the file and delete it: a. ps -ef | grep /usr/bin/kubelet b. grep -i [podname] [configfilepath] c. Navigate to the staticPodPath directory and delete the pod-definition YAML file

Multiple Schedulers

If the default scheduler doesn't fit our needs we can run multiple schedulers, even customize our own scheduler with custom conditions and checks in it.

The default scheduler is configured in scheduler-config.yaml:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: [schedulername] (default is default-scheduler)

How to deploy a scheduler as a pod:

my-custom-scheduler.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: my-custom-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
     - kube-scheduler
     - --address=127.0.0.1
     - --kubeconfig=/etc/kubernetes/scheduler.conf
     - --config=/etc/kubernetes/my-scheduler-config.yaml

     image: k8s.gcr.io/kube-scheduler-amd64:v1.11.3
     name: kube-scheduler

my-scheduler-config.yaml

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: [schedulername] (default is default-scheduler)
leaderElection:
  leaderElect: true
  resourceNamespace: kube-system
  resourceName: lock-object-my-scheduler

(the leader election is used when we have multiple schedulers running on different master nodes).

How to use the custom scheduler:

In the pod-definition file:

[...]
kind: Pod
[...]
spec:
  containers:
    [..]
  schedulerName: my-custom-scheduler

How to make sure the scheduler was used:

kubectl get events -o wide

How to view scheduler logs:

kubectl logs my-custom-scheduler --name-space=kube-system

Scheduler Profiles

When the pods are created they are stored in the scheduling queue. In this stage the pods are sorted based on the priority of the pods.

To set a priority to a pod we can add priorityClassName under spec in the pod-definition.

For example: priorityClassName: high-priority

And this is defined in the priority class file:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be for pods X and Y only"

Then, the pods end up in the Filtering Queue, to filter only the pods that are available due to the limitations in the pod file.

The next phase is the Scoring Phase, this uses the scoring algorithm based on free space we saw earlier.

And the final phase is the Binding Phase, at this phase the pod is assigned to the node with the highest score.

The plugins that support those phases:

  1. Scheduling Queue = PrioritySort
  2. Filtering Phase = NodeResourcesFit, NodeName, NodeUnsechedualabale.
  3. Scoring Phase = NodeResourcesFit, ImageLocality
  4. Binding Phase = DefaultBinder

There are also Extension Points before, that are like sub-phases where we can use the plugins specified above on.

For example:

  • preFilter
  • postFilter
  • preScore
  • reserve
  • permit
  • preBind
  • postBind

And there are many more plugins that can fit more extension points.

To Conclude:

  • The scheduling profiles allow us to configure the different stages of scheduling in the cube-scheduler.
  • Each stage is exposed in an extension point.
  • Plugins provide scheduling behaviors by implementing one or more of these extension points.



Section 3: Logging & Monitoring

Monitoring Cluster Components

We would like to monitor things like:

  • Node-level metrics such as the number of nodes in the cluster, how many are healthy.
  • Performance metrics such as CPU, memory, network, disk utilization.
  • Pod-level metrics such as number of pods, performance metrics of each pod such as CPU, memory consumption.

We can use the metrics server - one per k8s cluster.

The metrics server retrieves metrics from the nodes and pods, aggregates them and stores them in memory.

The results are not stored in the disk, for that we can use advanced monitoring solutions such as the Elastic Stack.

The cAdvisor runs inside the kubelet and is responsible for receiving performance metrics from pods and exposing them from the kubelet API to make them accessible from the Metrics Server.

To enable the metrics server:

Then, kubectl create -f deploy/1.8+/ - this deploys a set of pods, services and roles to enable Metrics Server to pull for performance metrics from the nodes in the cluster.

Usage:

kubectl top node - to monitor the nodes. kubectl top pod - to monitor the pods.

Logging Pods

Single container in the pod:

kubectl logs -f [podname]

Multiple containers in the pod:

kubectl logs -f [podname] [containername]




Section 4: Application Lifecycle Management

Rolling Updates and Rollbacks

When we initialize a deployment, it creates a rollout. A new rollout creates a new deployment revision.

When a new update is featured, a new rollout is triggered and a new deployment revision is created.

Viewing the status of the rollout: kubectl rollout status deployment/myapp-deployment

To view the revisions and the history of the deployment: kubectl rollout history deployment/myapp-deployment

Deployment Strategies:

  • Recreate - destroy the old application instances and create newer version instances. Downside is that during downtime is inaccessible to users.
  • RollingUpdate - Take down older versions and bring up newer versions one by one - the default deployment strategy.

We can view the deployment strategy used using: kubectl describe deployment [deployment]

Performing Rollback:

In case we noticed something wrong with the deployment, we can rollback using kubectl rollout undo deployment/myapp-deployment. It will destroy the pods in the new replica set and bring back the older ones.

We can view the changes using kubectl get replicates - This will the the desired/current pods in the different versions.

To Summarize The Commands:

Create:

kubectl create -f deployment-definition.yaml

Get:

kubectl get deployments

kubectl describe deployment [deployment]

Update:

kubectl apply -f deployment-definition.yaml

kubectl set image deployment/myapp-deployment nginx=nginx:1.9.1

Status:

kubectl rollout status deployment/myapp-deployment

kubectl rollout history deployment/myapp-deployment

Rollback:

kubectl rollout undo deployment/myapp-deployment

Commands & Arguments In Pods

The command: field in the pod-definition file overwrites the ENTRYPOINT field in the Dockerfile.

And the args: field overwrites the CMD field.

Adding commands to a new container on kubectl run: kubectl run [name] [image] -- [args] for example: kubectl run webapp [image] -- --color green.

ENV In Kubernetes

Common Way To Use:

env:
  - name: KEY
    value: value

Configuring ConfigMap In Applications:

Config maps are used to pass configuration data in the form of key-value pairs in k8s.

Creating ConfigMaps:

  • Imperative way (without a definition file): kubectl create configmap [config-name] --from-literal=[key]=[value] or kubectl create configmap [config-name] --from-file=[path-to-file]
  • Declerative way (using a definition file): ... kind: ConfigMap.. metadata: name: app-config.. data: KEY: value & kubectl create -f [file-path]

Using in a container:

...
kind: Pod
spec:
  containers:
    ...
  - envFrom:
    - configMapRef:
      name: app-config

To Summarize:

Ways to implement ENV to a container:

  1. envFrom (env)
  2. env (single env)
  3. volumes (haven't covered yet)

Secrets

Although configMaps can store ENV values, those are not a good place to store plain text passwords and sensitive information.

Imperative:

  1. kubectl create secret generic [secret-name] --from-literal=[key]=[value]

  2. kubectl create secret generic [secret-name] --from-literal=[KEY1]=[value] --from-literal=[KEY2]=[value]

  3. kubectl create secret generic [secret-name] --from-file=[path]

Delerative:

apiVersion: v1
kind: Secret
metadata:
  name: app-secret
data:
  KEY1: [value-raw]
  KEY2: [value-raw]
  ...

kubectl create -f secret-deinition.yaml

PROBLEM: [value-raw] is plain text - So we use Encode64:

On Linux:

echo -n '[value-raw]' | base64

Then to get, describe & view (encoded values):

kubectl get secrets

kubectl describe secrets

kubectl get secret [name] -o yaml

Vieweing decoded values:

echo -n '[value-encoded64]' | base64 --decode

Secrets in Pods:

apiVersion: v1
kind: Pod
...
spec:
  containers:
  -  name:
  ...
  envFrom:
    - secretRef:
      name: [secret-name]

Secret in Pods as Volumes:

volumes:
- name: [app-secret-volume]
  secret:
    secretName: [secret-name]

> ls /opt/[app-secret-volumes]

> cat /opt/[app-secret-volumes]/[KEY]

Problem: Secrets are not encrpyted. Only encoded. Be careful with pushing them to Version Control.

Problem: Secrets are not encrypted in ETCD. Fix: Use Encryption At Rest.

Problem: Anyone able to create pods/deployments in the same namespace can access the secrets. Fix: Configure least-privilege access to Secrets - RBAC

Another Solution: Consider 3rd party secrets store providers (AWS, Azure Provider, GCP Provider, Vault Provider) instead of storing in the ETCD.

InitContainers

In multiple pods, the pod is expected to live as long as the containers are running. If any of them fails, the pod restarts.

However, we sometimes may want to run a process until completion in the container (for example tasks that will run only one time when the pod is created).

This can be made using InitContainers.

Example:

apiVersion: v1
kind: Pod
...
spec:
  containers:
  -  name: [container]
     image: busybox:1.28
     command: ['sh', '-c', 'echo the app is running && sleep 3600']
  initContainers:
  -  name: init-myservice
     image: busybox
     command: ['sh', '-c', 'git clone [repo] ; done;']

Here, the initContainer will run at first until completion and then the real container hosting the application will run.

We can also configure multiple initContainers, that will run at sequential order. If one of them fails, k8s will restart the pod until all succeed.




Section 5: Cluster Maintenance

OS Upgrades

Sometimes we may want to take out nodes from the cluster for OS upgrades, security patches and more.

In a node that became unavailable, if the node was terminated for more than 5 minutes, then k8s considers the pods as "dead". If the pods were a part of a replica set, they are recreated on other nodes.

Pod Eviction Timeout is the time k8s waits for the pod to come back online and can be set using kube-controller-manager --pod-eviction-timeout=5m0s ...

After the pod-eviction-timeout, if the node returns, it returns blank. The pods that were not part of a replica set are gone, and those that did, have been scheduled on other nodes.

If we know that we expect a system upgrade for a node that may take more than 5 minutes, we can use kubectl drain [node-name] to drain the node and mark it as unschedualable.

To make the node unavailable for scheduling we can use: kubectl cordon [node-name]

To make the node available for scheduling again, we can use kubectl uncordon [node-name]

  • if kubectl drain is prevented because of daemon sets we can use --ignore-daemonsets option.

  • if we kubectl drain --force and a pod is not managed by a replica set (and more), it will be deleted forever.

Cluster Upgrade Process

Since the kube-apiserver is highest in the hierarchy, it must be the highest version than the rest of the components.

The versions can be for example:

kube-apiserver = (X) (v1.10)
├─── Controller Manager (X-1) = (v1.9 or v1.10)
├─── kube-scheduler (X-1) = (v1.9 or v1.10)
│   └─── kubelet (X-2) = (v1.8, v1.9 or v1.10)
│   ├─── kube-proxy (X-2) = (v1.8, v1.9 or v1.10)

kubectl can be higher or lower than the kube-apiserver.

The recommended approach to upgrading is incremental upgrades, one version at a time.

Upgrading the master:

When the master is upgrading, the control plane components are down for a while, the applications in the cluster are not impacted, we can't use kubectl or access the cluster in any way, can't deploy new applications - until the master is up again.

Upgrading the workers:

  1. One approach is to take down all at once, and no user will be able to access the application - requires downtime.
  2. Second approach is to update all nodes one at a time, move pods from one node to another (drain) until all nodes are upgraded.
  3. The third approach is to add new nodes to the cluster (with the new version) - good to be used in cloud environments, where we can provision and manage new nodes easily, and then remove the older nodes.

All of these can be done using:

apt-get upgrade -y kubelet=[version]

kubeadm upgrade [node / plan / apply, ...]

systemctl restart kublet

Backup & Restore

Candidates for backups:

  1. Resource configuration
  2. ETCD cluster
  3. Persistent Volumes

Resource Configuration:

A good practice is to store configuration files (like pod definition files) in a repository like GitHub.

A better approach to backing up the resources configuration is to query to kube-apiserver: kubectl get all --all-namespaces -o yaml > all-deploy-services.yaml

The approach above can also be made using a tool such as Velero.

The ETCD:

This will backup the information stored in the ETCD - the information about the cluster itself (the nodes for example). Instead of backing resources as stated above, we can backup the ETCD.

The ETCD is hosted on the master node, while configuring it we set a location for all the data --data-dir - directory for backup.

We can also take a snapshot of the ETCD database by etcdctl snapshot save [name]

Note: Don't forget to also specify the certificates (since we are contacting the etcd server):

--endpoints=...

--cacert=...

--cert=...

--key=...


View details about the backup: etcdctl snapshot status [name].

To restore:

  1. service kube-apiserver stop

  2. etcdctl snapshot restore [name] --data-dir [path]

  3. systemctl daemon-reload

  4. service etcd restart

  5. service kube-apiserver start

  • Switch cluster: kubectl config use-context [cluster]
  • View kubectl configuration (clusters, contexts, ..): kubectl config view



Section 6: Security

We will learn how to secure the cluster and the communication between internal components.

Using authentication to secure access to the cluster:

  • k8s doesn't manage user accounts natively.
  • although, it can manage service accounts (kubectl create serviceaccount sa1, kubectl get serviceaccount), we will learn about it later.

When we try to access the server using kubectl or curl https://kube-server-ip:6443/ the request goes through the api-server and tries to authenticate the user.

The authentication mechanisms can be:

  1. Static password file (Not recommended)
  2. Static token file (Not recommended)
  3. Certificates
  4. Identity services such as LDAP, Kerberos and more.

Using basic authentication:

Inside kube-apiserver.service: --basic-auth-file=user-details.csv

While user-details.csv looks like:

password1,user1,u0001
password2,user2,u0002
...

Example to basic auth: curl -v -k [uri] -u "user1:password1"

Using token authentication:

Inside kube-apiserver.service: --token-auth-file=user-details.csv

While user-token-details.csv looks like:

[token],user1,u0001
[token],user2,u0002
...

Example to token auth: curl -v -k [uri] --header "Authorization: Bearer {token}"

TLS Certificates In Kubernetes

Prerequisites:

  • Symmetric & Asymmetric encryption with SSL.
  • Certificates - Certificate Authority (CA), openssl, openssl req, Public Key Infrastructure (PKI).

Naming Conventions:

  • Private key - *.key, *-key.pem
  • Certificate (Public Key) - *.crt, *.pem

To remember: private keys have the word "key" in them, certificates and public keys don't.

Client Certificates For Clients:

Those are used by clients to connect to the kube-apiserver

  • Admin - admin.crt, admin.key
  • Kube-Scheduler - scheduler.crt, scheduler.key
  • Controller Manager - controller.manager.crt, controller-manager.key
  • Kube Proxy - ...
  • Kube API Client - apiserver-kubelet-client.crt, apiserver-kubelet-client.key
  • ...

Server Certificates For Servers

Those are used by the servers to authenticate their clients

  • Kube API-Server - apiserver.crt, apiserver.key
  • ETCD Server - etcdserver.crt, etcdserver.key
  • Kubelet Server - kubelet.crt, kubelet.key

Kubernetes requires us to have at least 1 Certificate Authority (CA) for the cluster.

Creating a Certificate:

CA:

  1. Generate keys - openssl genrsa -out ca.key 2048
  2. Certificate signing request - openssl req -new -key ca.key -subj "/CN=KUBERNETES-CA" -out ca.csr
  3. Sign certificates - openssl x509 -req -in ca.csr -signkey ca.key -out ca.crt

Admin:

  1. Generate keys - openssl genrsa -out admin.key 2048
  2. Certificate signing request - openssl req -new -key admin.key -subj \ "/CN=kube-admin" -out admin.csr
  3. Sign certificates - openssl x509 -req -in admin.csr -CA ca.crt -CAkey ca.key -out admin.crt

Kube Scheduler: The kube scheduler is a system component and a part of the control plain, so the name must be prefixed with "SYSTEM:" - ("SYSTEM: KUBE-SCHEDULER")

Same for Kube Controller Manager and Kube Proxy.

View Certificate Details:

  1. while deploying as native services on the nodes manually - cat /etc/systemd/system/kube-apiserver.service
  2. while deploying using kubeadm - cat /etc/kubernetes/manifests/kube-apiserver.yaml

Get detailed data about the certificate (.crt file): openssl x509 -in [.crt-file] -text

Certificates API

What:

Say we are the admins of the cluster, and we have a user that wants to sign a certificate, instead of logging in to the master node and signing the certificate by themselves, the user sends an object called CertificateSigningRequest.

Once the object has been created, all the sign requests can be seen by the admins of the cluster, so now those can be reviewed and approved easily by kubectl, and then be sent to the users.

How:

  1. User creates a key: openssl genrsa -out user.key 2048
  2. User sends the request to the admin: openssl req -new -key user.key -subj "/CN=user" -out user.csr
  3. Administrator creates a signing request object using a definition file (kind: CertificateSigningRequest).
  4. Before specifying the 'request' field in the YAML file, the admin encodes it to base64: cat user.csr | base64 | tr -d "\n"
  5. After the request is handled, all the administrators can see the status of the request: kubectl get csr, and approve/deny using kubectl certificate [approve/deny] jane.
  6. Kubernetes signs the certificate using the CA key pairs and generates the certificate
  7. View the certificate using kubectl get csr user -o yaml (need to base64 decode the 'certificate' field).

All of these operations are performed using the Controller Manager (it handles the controllers 'CSR Approving' and 'CSR Signing').

Kubeconfig

We have seen how to get the pods using curl https://my-kube:6443/api/v1/pods --key admin.key --cert admin.crt --cacert ca.crt

How to do that using the kubectl: kubectl get pods --server my-kube-playground:6443 --client-key admin.key --client-certificate admin.crt --certificate-authority ca.crt

We can move the options to a KubeConfig File and then use kubectl get pods --kubeconfig config - that looks for the file at $HOME/.kube/config

The kube config file has 3 sections:

  • Clusters - for development/testing/company environments.
  • Users - user groups with their privileges regarding accessing the cluster.
  • Contexts - Define which user account can access which cluster.

KubeConfig:

apiVersion: v1
kind: Config

# Default context to use:
current-context: dev-user@[company]

clusters:
- name: my-kube
  cluster:
    certificate-authority:
    server: ...

contexts:
- name: my-kube-admin@my-kube-playground
  context:
    cluster: my-kube
    user: my-kube-admin

users:
- name: my-kube-admin
  user:
    client-certificate: admin.crt
    client-key: admin.key

To view the current file used: kubectl config view - if we don't specify which file to use, it will use the default file in the $HOME/.kube/config directory.

Change the current context: kubectl config use-context user@cluster

We can also add 'namespace' under 'contexts' field - so that when we switch to a context, it will automatically be in a specific namespace:

...
kind: Config
...

contexts:
- name: ...
  context:
    cluster: ...
    user: ...
    namespace: [namespace]
...

We can also add certificates to the KubeConfig file:

apiVersion: v1
kind: Config

clusters:
-  name: production
   cluster:
     # Using a file:
     certificate-authority: /etc/kubernetes/pki/ca.crt

     # Or using plain data:
     certificate-authority-data: [base64-encoded-text]

Authorization

We can restrict specific users / services / groups to perform operations on the cluster / use their own namespace only.

Authorization Techniques Available:

  • Node authorization
  • ABAC authorization
  • RBAC authorization
  • Webhook authorization

RBAC Example:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: developer
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list", "get", "create", 'update", "delete"]
- apiGroups: [""]
  resources: ["ConfigMap"]
  verbs: ["create"]

kubectl create -f developer-role.yaml

Binding the user the role:

kind: RoleBinding
metadata:
  name: devuser-developer-binding
subjects:
- kind: User
  name: dev-user
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: developer
  apiGroup: rbac.authorization.k8s.io

kubectl create -f devuser-developer-binding.yaml

Useful commands:

kubectl get roles

kubectl get rolebindings

kubectl describe role developer

kubectl describe rolebinding devuser-developer-binding

Check access [result is 'yes'/'no']:

kubectl auth can-i create deployments

kubectl auth can-i delete nodes

kubectl auth can-i create deployments --as dev-user

kubectl auth can-i create pods --as dev-user --namespace test

Create Cluster-Role & Role-Binding:

Imperative Way:

k create clusterrole storage-admin --resource=persistentvolumes,storageclasses --verb=list,create,get,watch

k create clusterrolebinding storage-admin --user=storage-administrator --clusterrole=storage-admin

Declerative Way:

# cluster-admin-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-administrator
rules:
-  apiGroups: [""]
   resources: ["nodes"]
   verbs: ["list", "get", "create", "delete"]
# cluster-admin-role-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name cluster-admin-role-binding
subjects:
-  kind: User
   name: cluster-admin
   apiGroup: rbac.authorization.k8s.io
roleRef:
   kind: ClusterRole
   name: cluster-administrator
   apiGroup: rbac.authorization.k8s.io

kubectl create -f cluster-admin-role-binding.yaml

Service Accounts

External services can access & control the Kubernetes cluster, services such as Prometheus, Jenkins, Azure DevOps, and more.

Creating and View Service Account:

kubectl create serviceaccount [name]

kubectl get serviceaccount

While creating the service account, it also associates a token to it. We can view it's name using kubectl describe serviceaccount [name].

The token is stored as a secret object (that we learned earlier in this section), hence we can view the token using kubectl describe secret [serviceaccount-name]-token-kbbdm.

We can provide the token in a REST API request as a Bearer token.

When a new pod is created, the default service account is mounted to the pod as a volume automatically.

Note: To disable this behavior, add automountServiceAccountToken: false under spec in the pod-definition.

As we run kubectl describe pod [pod] we can see:

...
Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-j4hkv (ro)
...

And then, if we exec the container and view the files in the mounted location using: kubectl exec -it [name] ls /var/run/secrets/kubernetes.io/serviceaccount

We can see 3 files: ca.crt, namespace, and token.

Then, if we check the contents of t he token file we can see the token being used to access the k8s API.

Service account users are very restricted, hence, if we would like to use another service account, specify it in the pod-definition:

...
kind: Pod
...
spec:
  ...
  serviceAccountName: [service-account-name]

Image Security

Configuring credentials to access a private Docker registry:

kubectl create secret docker-registry regcred \
--docker-server= private-registry.io \
--docker-username= user \
--docker-password= password \
--docker-email= email

pod-definition.yaml:

...
spec:
  containers:
  ...
  imagePullSecrets:
  -  name: regcred

Security Contexts

As we know, when we run a container in Docker we set security standards, such as the User ID that runs the container using: docker run --user=101 ubuntu sleep 3600, the Linux capabilities that can be added and removed from the container using: docker run --cap-add MAC_ADMIN ubuntu and more.

Those can be configured in k8s as well in a container or pod level. If we configure it at a container level, this will override the settings on the pod.

Pod level example:

..
kind: Pod
..
spec:
  securityContext:
    runAsUser: 1000 # ID
  containers:
    -  name: ...
    ...

Container level example:

..
kind: Pod
..
spec:
  containers:
    - name: ...
      securityContext:
        runAsUser: 1000 # ID
       ...

Network Policy

Network policies are used to control the traffic between pods and other network endpoints. The traffic is directed towards those pods using rules specified by labels.

Creating a network policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: db-policy
spec:
  podSelector:
    matchLabels:
      role: db
    policyTypes:
    -  Ingress
    ingress:
    -  from:
       -  podSelector:
            matchLabels:
              name: api-pod
        ports:
        -  protocol: TCP
           port: 3306

For example - protecting the database pod that it will not allow access from any other pod other than the API pod and only from port 3306:

Step one is to block out every in/out (Ingress/Egress) traffic of the DB pod:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: db-policy
spec: 
  podSelector:
    matchLabels:
      role: db
    policyTypes:
    -  Ingress
    -  Egress
    ingress:
    -  from:
       - podSelector:
           matchLabels:
             name: api-pod
       ports:
       -  protocol: TCP
          port: 3306

The example above blocks all traffic to the DB pod, except from the API pod.

We can also add support for specific namespaces using namespaceSelector and specific IP addresses using ipBlock (for example a backup server initiating a backup) under ingress: - from: (we will see an example below).

If those are specified by seperated dashes (-), those are seperate rules and act as OR condition, for example:

...
spec:
  ...
  ingress:
  -  from:
     -  podSelector:
         ...
     -  namespaceSelector:
         ...
     -  ipBlock:
          cidr: 192.168.5.10/32
	 ...

However, if one property is nested inside other, it's like and AND condition, for example:

...
spec:
 ...
 ingress:
 -  from:
    -  podSelector:
        ...
       namespaceSelector:
          ...
    -  ipBlock:
         cidr: 192.168.5.10/32
    ...

Here, namespaceSelector is nested inside podSelector, which means that both of those criteria need to be present in order for the rule to be applied.

What about egress traffic to the backup server? Let's say we have an agent on the database pod that pushes backup to the backup server, the egress rule will be specified using:

...
spec:
  podSelector:
    ...
  policyTypes:
  -  Ingress
  -  Egress
  ingress:
  -  from:
     ...
  egress:
  -  to:
     # can be any selector (pod, namespace, ipblock)
     -  ipBlock:
           cidr: 192.168.5.10/32
     ports:
     -  protocol: TCP
        port: 80



Section 7: Storage

Volume Drivers in Docker

In Docker, storage drivers help manage storage in images and containers.

Examples for storage drivers:

  • AUFS
  • ZFS
  • BTRFS
  • Device Mapper
  • Overlay

If we want the data to be persistent we can use volumes.

Volumes are not handled by storage drivers, however, they are handled by volume driver plugins.

Examples for volume driver plugins:

  • Local - Create a volume on the local Docker host and stores the data under /var/lib/dockervolumes
  • Azure File Storage
  • Convoy
  • DigitalOcean Block Storage
  • Flocker
  • GCE-Docker
  • GlusterFS
  • NetApp
  • RexRay
  • Portworx
  • VMWare vSphere Storage
  • And more..

Container Storage Interface (And Other Interfaces)

As new tools for working with containers have developed, new sets of standards were created such as:

  1. CRI (Container Runtime Interface) - To allow k8s work with container runtime engines such as Docker, rkt, cri-o
  2. CNI (Container Network Interface) - To allow k8s work with networking solutions such as weaveworks, flannel, cilium.
  3. CSI (Container Storage Interface) - To allow k8s work with storage solutions like portworx, Amazon EBS, Dell EMC, GlusterFS and much more.

Those standards are not designed uniquely for k8s, those are applicable to any container orchestration platform.

Volumes

Under the volumes specification in the pod-definition YAML file we can define the hostPath:

volumes:
-  name: data-volume
   hostPath:
     path: /data
     type: Directory

This works great on a single node, however in multiple nodes it is not recommended since it will use the /data directory in all of them, and expects them all to be exactly the same and hold the exact same data.

To configure access to AWS Elastic Block Store volume for example:

volumes:
-  name: data-volume
   awsElasticBlockStore:
     volumeID: [volume-id]
     fsType: ext4

Persistent Volumes

The drawback of the approach we saw in the previous chapter is that the storage needs to be configured every time for each pod, in every pod-definition file.

We would like a centralized storage pool. For that we use persistent volumes, which is a cluster-wide storage pool and can be used by users that deploy the applications to the cluster.

Each user can use a portion of the centralized storage pool.

Creating the persistent volume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-vol1
spec:
  accessModes:
    -  [ReadOnlyMany/ReadWriteOnce/ReadWriteMany]
  capacity:
    storage: 1Gi
  hostPath: 
  # (in production environment use external storage service such as awsElasticBlockStore as we saw earlier)
    path: /tmp/data

kubectl create -f pv-definition.yaml

kubectl get persistentvolume

Persistent Volume Claim

Persistent volume claims are used to fit the storage to a node.

The administrator creates persistent volumes.

The user creates persistent volume claims in order to use the storage.

Once the claims are created, k8s binds those to the persistent volumes, as k8s tries to find persistent volume that has sufficient storage capacity that was requested by the claim (and other requests such as access modes, volume modes, storage class, selectors, etc.).

Persistent Volumes and persistent volumes claims have a one-to-one relationship. If no matching volumes are available, the claim will stay in 'pending' state until new volumes are available.

Creating a Claim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: myclaim
spec:
  accessModes:
    -  ReadWriteOnce
  resources:
    requests:
      storage: 500Mi

kubectl create -f [file]

kubectl get persistentvolumeclaim

The PVC is created and is in 'pending' state.

Now, k8s will look at the persistent volumes created and once found, will bind the matching PV with the PVC (in this case it will bind the 1Gi storage capacity volume we created in the previous chapter).

We can use kubectl get persistentvolumeclaim again to see the Status is 'Bound'.

Deleting a PVC:

Delete using kubectl delete persistentvolumeclaim [name]

We can specify what happens when we delete it:

persistentVolumeReclaimPolicy: [Retain \ Delete \ Recycle]

Setting PVCs in Pods

apiVersion: v1
kind: Pod
...
spec:
  containers:
    ...
  volumes:
    - name: [name]
      persistentVolumeClaim:
        claimName: [claimName]

We can also set this in ReplicaSets and Deployments - under the pod's template section.

Storage Classes

Static Provisioning - Say we create a PVC from a Google Cloud persistent disk, a prerequisite to the PVC, is that the disk must be created first at Google Cloud with the same name.

A nicer approach is that the volume provisioned automatically when required.

This is called "Dynamic Provisioning" and can be implemented by Storage Classes.

How it works it that we create configuration that will automatically provision storage in the storage service and attach the storage to the pods when the claim is handled.

sc-definition.yaml:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: google-storage
provisioner: kubernetes.io/gce-pd

Now we don't need to create the PV definition anymore because it will be handled automatically when the storage class is created.

For the pvc-definition to use the new storage class, we add storageClassName: google-storage under spec in pvc-definition.yaml file.




Section 8: Networking

Prerequisite: Switching & Routing Useful Linux Commands

List and modify interfaces on the host: > ip link

See the IP addresses assigned to those interfaces: > ip addr

Set IP addresses on the interfaces: > ip addr add 192.168.1.10/24 dev etc

Note: Those operations do not persist after reboot. If we would like to make them persistent, set them in etc/network/interfaces file.

View the routing table: > ip route

Add entries into the routing table: > ip route add 192.168.1.0/24 via 192.168.2.1

Check if IP forwarding is enabled on the host (configured as a router): > cat /proc/sys/net/ipv4/ip_forward

Prerequisite: DNS Management (Linux)

Configuring a DNS entry at /etc/hosts in Host A is problematic because the change will only occur in Host A, and then Host A will not continue to search for the real DNS address of Host B. Which means that we need to configure all the /etc/hosts files in our network.

This approach is tedious at scale, so we need a better solution - A DNS server that all hosts point to.

How to make a host point to the DNS server?

cat /etc/resolv.conf

nameserver             192.168.1.100

To change the behavior of whether a host is looking at the cache and then at the DNS server (or the opposite way) we can change the file /etc/nsswitch.conf

Example of the contents of /etc/nsswitch.conf:

...
hosts:             files dns
...

Which means that the host will first look at it's files (/etc/hosts) and then at the DNS server.

  • Use "nslookup" to query a hostname from a DNS server (does not consider the entries at /etc/hosts.
  • Use "dig" to test DNS name resolution,

Record types (how the records are stored in the DNS server):

  • A = Hostname <-> IPv4
  • AAAA = Hostname <-> IPv6
  • CNAME = Hostname <-> Name (Alias)

Prerequisite: Network Namespaces (Linux)

Containers are isolated from other containers running on the host.

We can see that as an example by running ps aux both from the container and from the host. We will see different PIDs for the same processes. This is performed using "Namespaces".

Create a new network namespace:

ip netns add [name]

View the namespaces:

ip netns

Execute command in specific namespace:

> ip netns exec [name]  [command]
# OR
> ip -n [name] [command]

Examples:

  1. Connecting two namespaces (like connecting two virtual ethernet interfaces)
> ip link add veth-red type veth peer name veth-blue # creating the ns
> ip link set veth-red netns red # attaching the ns
> ip link set veth-blue netns blue # attaching the ns
> ip -n red addr add 192.168.15.1 dev veth-red # assigning IP addresses whithin the red namespace
> ip -n blue addr add 192.168.15.2 dev veth-blue # assigning the IP whithin the blue namespace
> ip -n red link set veth-red up # bring up the red interface
> ip -n blue link set veth-blue up # bring up the blue interface
> ip netns exec red ping 192.168.15.2 # reach from the red to blue
> ip netns exec red arp # see the blue at the red's ARP table
> ip netns exec blue arp # see the red at the blue's ARP table
> arp # the host's ARP table doesn't show the new namespaces and interfaces
  1. Creating a virtual switch (for creating the virtual network for the namespaces to act as a physical network) using Linux Bridge:
> ip link add v-net-0 type bridge # creating a new interface for the host
> ip link set dev v-net-0 up # bring the interface up

# Connecting the interfaces to the network switch:
> ip -n red link del veth-red # deletes the link we created earlier from both sides (red and blue)
> ip link add veth-red type veth peer name veth-red-br # creating a virtual network to connect the red namespace to the bridge network
> ip link add veth-blue type veth peer name veth-blue-br # creating a virtual network to connect the blue namespace to the bridge network
> ip link set veth-red netns red # attach the first end of the interface to the red namespace
> ip link set veth-red-br master v-net-0 # attach the other end to the bridge network
# Same for the blue network:
> ip link set veth-blue netns blue
> ip link set veth-blue-br master v-net-0

# Setting IP addresses to the links & setting them up:
> ip -n red addr add 192.168.15.1 dev veth-red
> ip -n blue addr add 192.168.15.2 dev veth-blue
> ip -n red link set veth-red up
> ip -n blue link set veth-blue up

# Assigning an IP address to the bridge switch interface to allow access to namespaces through it:
> ip addr add 192.168.15.5/24 dev v-net-0

Now, the network we built is restricted to the host, we can't reach networks outside of the namespace and vice versa.

We need to provide an entry in the routing table for a gateway to the outside world.

We can use the localhost to act as an the gateway for those private networks since it contains all the namespaces.

So we can route the traffic this way:

> ip netns exec blue ip route add [destination] via [gateway="192.168.15.5"]

Now when we try to ping the external internet we don't get the network unreachable message anymore, however we don't get a response back from the destination, since the external network doesn't know about our private network.

We need to enable NAT on our host to act as a gateway, to add NAT functionality to the host: > iptables -t nat -A POSTROUTING -s 192.168.15.0/24 -j MASQUERADE

Now we get a response back from the target, the private network is recognized.

However, we still can't access the internet from the namespaces, so we need to add routing through the host at default gateway: > ip netns exec blue ip route add default via [gateway="192.168.15.5"]

Now, say we have a web application using port 80, but it still can't access the namespaces because the other host doesn't know about the private network.

So we need to add a port-forwarding rule specifying to transfer every traffic from localhost:80 to port 80 assigned to the blue namespace: > iptables -t nat -A PREROUTING --dport 80 --to-destination 192.168.15.2:80 -j DNAT

Prerequisite: Networking in Docker

In docker we have different networking options to choose from (as --network option in docker run):

  1. None - The container can't reach to any resource and no resource can reach the container.
  2. Host - the network is attached to the host network.
  3. Bridge - An internal private network (like an interface to the host), and providing isolation from containers that are not connected to that bridge network. We can see the port forwarding Docker creates using iptables -nvL -t nat

Prerequisite: Container Networking Interface (CNI)

So far, in networking we learned the following topics:

  1. Creating network namespaces
  2. Creating bridge network/interface
  3. Creating vEth pairs (pipe, virtual cable)
  4. Attaching vEth to namespaces
  5. Attaching other vEth to Bridge
  6. Assigning IP addresses
  7. Bringing the interfaces up
  8. Enabling NAT - IP Masquerade

We then learned that Docker (and other container runtime engines) basically perform the same steps.

Since they all kind of perform the same steps to configure the networking solution, why not create a single instance of program that does all of that without hassling with the networking?

This program is called the Bridge program.

Use case example: adding a container to a specific namespace: bridge add [container_id] /var/run/netns/[container_id]

Now, this standard (and more) are implemented in the CNI framework that is designed for dynamically configuring networking resources for containers.

Note: Docker does not implement CNI, it supports CNM (Container Network Model) which is a bit different. Which means that we can't use the plugin directly on Docker.

For example:

Can't do:

> docker run --network=cni-bridge nginx

Can do:

> docker run --network=none nginx
> bridge add [container_id] /var/run/netns/[container_id]

Cluster Networking

As we know, the k8s master consists of master and worker nodes.

  • Each node must have at least one interface connected to a network.

  • Each interface must have an address configured.

  • The hosts must have a unique name & MAC address.

There are also some ports that need to be opened (investigate them if something is not working properly):

  • The master should accept connection on port 6443 (for the API server).
  • The kubelet on master (the kubelet can be on the master node if we didn't mention it until now) & worker nodes listens on port 10250.
  • The kube-scheduler requires port 10251 to be open.
  • The kube-controller-manager requires port 10252 to be open.
  • The worker nodes expose service for external access on ports 30000-32767.
  • The ETCD server listens on port 2379.
  • If we have multiple master nodes, all the ports we discussed regarding the master node need to be open, and we also need an additional port 2380 open so the ETCD clients on the master nodes can communicate with each other.

Pod Networking

Our k8s clusters will contain a big number of pods and services running on it. We need to figure out how will the pods communicate with each other? how to access the services running on the pods both internally within the cluster and externally outside the cluster?

Kubernetes expects us to solve these issues, but also provides the k8s networking model in order to help us doing so.

The Networking Model:

  1. Every pod should have an IP address.
  2. Every pod should be able to communicate with every other pod in the same node with that IP address.
  3. Every pod should be able to communicate with every other pod on other nodes without NAT and using that IP address.

Solving Parts 1+2:

Say we have 3 nodes in the cluster.

The nodes are running on an external network and connected by LAN at 192.168.1.0.

Node 1: 192.168.1.11

Node 2: 192.168.1.12

Node 3: 192.168.1.13

Next, containers are created and k8s creates for them network namespaces.

To enable communication, we need to attach the namespaces to a bridge network on each node using ip link add v-net-0 type bridge and bring them up using ip link set dev v-net-0 up.

Now we need to assign IP addresses to the bridge interfaces or networks - each network will be on it's own subnet (whatever we like), for example: ip addr add 10.244.[node_number].1/24.

Now the next steps need to be performed for every container and every time a container is created:

> ip link add ... # Create vEth pair
> ip link set ... # Attach vEth first end to the container
> ip link set ... # Attach vEth second end to the bridge 
> ip -n [namespace] addr add ... # Assigning IP addresses
> ip -n [namespace] route add ... # Adding a route to the default gateway

# Now decide what IP address will be used.
# For this example we use 10.244.1.2

> ip -n [namespace] link set ... # Bring up the interface

We solved parts 1+2 each pod has their own IP address and they can communicate with each other on their own nodes.

Solving Part 3:

Now we need to enable them to reach other pods on other nodes.

Naive Solution on a Simple Setup:

From Node 1:

# Do this for all pods we want to reach on other nodes 
> ip route add [pod_ip_on_node2] via [node2_ip]

A Better Approach:

If our system is more complex, we would define an IP routing table on the router and point all hosts to use that as the default gateway.

Network Gateway
10.244.1.0/24 192.168.1.11
10.244.2.0/24 192.168.1.12
10.244.3.0/24 192.168.1.13

Since we handled many manual steps, say we use a script to do all of this.

That's where CNI can tell k8s how to call the script when we create a container (and also specifies for us how the script should look like).

The script structure should be:

net-script.sh

ADD)
  # Create veth pair
  # Attach veth pair
  # Assign IP address
  # Bring up the interface
  ip -n [namespace] link set ...

DEL)
  # Delete veth pair
  ip link del ...

Now, every time a container is created, the kubelet looks at the --cni-conf-dir=/etc/cni/net.d, identifies the script's name.

It then searches at the --cni-bin-dir=/etc/cni/bin to find the script and then executes the script using ./network-script.sh add [container] [namespace]

  • To view the kubelet options (like we saw above) use: ps -aux | grep kubelet

  • To view the supported CNI plugins use: ls /opt/cni/bin

  • To view the CNI configuration file use: ls /etc/cni/net.d

CNI Weave

The routing table is a good solution for simple networks, but for big networks with hundreds of nodes this solution may not fit since the routing table may fail to support this complex network.

WeaveWorks tool (and many other tools; we will focus in WeaveWorks for now) can help this problem.

It can be thought as an "outsourcing delivery company" to handle the massive traffic that the router may fail to handle.

Weave can be embedded by:

  1. services or daemons on each node in the cluster manually.
  2. Pods in the cluster (if k8s is set up) using kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')" , then view the pods by kubectl get pods -n kube-system and view the logs by kubectl logs weave-net-5gcmb weave -n k.

Service Networking

Usually we don't actually configure the pods to communicate with each other.

If we would like pods to access services of another pods, we should use services.

In this chapter we will learn how services such as ClusterIP and NodePort (that we learned in Section 1) work.

In addition to kubelet, each node runs the kube-proxy, which watches the changes made in the cluster, and each time a new service is created, it invokes action.

Important to mention is that services are cluster-wide, and they are not present like pods and other objects. Those are virtual objects. There are not processes, namespaces or interfaces for the services.

When we define a service object, it gets an IP address from a predefined range. Then, the kube-proxies create forwarding rules on each node.

Then, every time a pod tries to reach the IP & Port of the service, it gets forwarded to the IP of the pod.

DNS In Kubernetes

How to access a service with the DNS name that is configured in the Kube DNS:

  1. Same namespace - curl http://[service_name]
  2. Different namespace - curl http://[service-name].[namespace]

The fully qualified domain name of the service is actually like:

curl http://[service-name].[namespace].svc.cluster.local

Hostname Namespace Type Root IP Address
[service-name] [namespace] svc cluster-local [service-ip-address]
... ... ... ... ...

For pods however, the entries are a bit different:

curl http://[ip-seperatedByDahses].[namespace].pod.cluster.local

The hostname is the IP address of the pod, separated by dashes (10.0.0.1 -> 10-0-0-1).

Hostname Namespace Type Root IP Address
[ip-seperatedByDahses] [namespace] pod cluster-local [pod-ip-address]
... ... ... ... ...

CoreDNS In Kubernetes

Kubernetes creates DNS entries in the DNS server for the services and pods. The difference is in the naming of the pods - dashes instead of dots (as we saw in the previous chapter).

From k8s version 1.12 the recommended DNS server k8s uses switched to CoreDNS instead of Kube-DNS.

In the kube-system namespace, the CoreDNS server is deployed as a pod (actually 2 pods in a replicaset in a deployment for redundancy purposes), that holds the ./Coredns executable.

In the case of k8s, the file used is /etc/coredns/Corefile where plugins are configured for monitoring, cache, error handling and more.

The plugin that enables connectivity between CoreDNS and Kubernetes is the kubernetes entry inside the Corefile.

This entry shows (one of many) the property cluster.local which is the top level domain, this means that the records in the CoreDNS go through this domain.

There are more entries in the Corefile such as proxy which means that every DNS request the CoreDNS can't resolve, it sends to the proxy property which by default is /etc/resolv.conf which is meant to use the name server of the k8s node.

Note that /etc/coredns/Corefile is also available to edit as a configmap (kubectl get configmap -n kube-system).

When the Core-DNS is deployed, it also creates a service kube-dns for the use of other components in the cluster.

The kubelet is responsible for the DNS configuration on pods, meaning that if we view the kubelet configuration file /var/lib/kubelet/config.yaml, we can see the DNS server IP address.

All of this complex configuration made behind the scene allows us to view the cluster's services and pods inside the same namespace and outside.

We can also view the DNS entry using:

> host [service-name] # =
> host [service-name].default # =
> host [service-name].default.svc
---- 
# Output:
[service-name].default.svc.cluster.local has address [ip-address]

However, for pods searching using the "host" command as above, we need to provide the full DNS entry:

host [pod-ip-seperatedByDashes].default.pod.cluster.local
----
# Output
[pod-ip-seperatedByDashes].default.pod.cluster.local has address [ip-address]

Ingress in Kubernetes

Say we are deploying an E-Commerce application in a k8s cluster for a company, and the application will be hosted at www.my-online-store.com.

We dockerize the application and deploy it as a pod in a deployment.

We also deploy a MySQL database as a pod and create a ClusterIP on it, in order to allow the application use the database.

Now, the application is working on the cluster, but we want to make it accessible to the world, hence, we create a NodePort service and the application is available at port 38080 for example.

Now, the users can access the application on http://[node-ip]:38080/, if the traffic increases, we scale the app by increasing the replicas accordingly.

There are also steps like configuring a proxy server to transfer all requests from port 80 to 38080 (the default nginx server so that the users don't need to remember the port number), and set up a DNS entry.

All of this configuration is for an on premise application inside our own data center.

Now, let's suppose we want to deploy the app on GCP. Here, Instead of creating a NodePort service, we would create a LoadBalancer service, and require GCP to configure a load balancer for the service.

Now, say the application grew and now supports video streaming over www.my-online-store.com/watch and we also moved the E-Commerce to www.my-online-store.com/wear.

Now, the applications are completely separated from each other, but we want them to access the same cluster resources - so we deploy the app as a seperate deployment on k8s.

Now we create a LoadBalancer over port 38282 and provision it on GCP as well.

Now, how do we direct the traffic between our application services depending on the URL the user entered? We need a proxy / load balancer to choose between the video and e-commerce apps.

This configuration is required any time we add a new service to our application.

Of course we would also like to enable SSL encryption on our website to allow the app to use https instead of http. Where do we configure that?

To avoid reconfiguration every time, the best approach is to set all of this configuration once in the k8s cluster just as any other definition file - That is the purpose of Ingress.

Ingress allows the users to access the application using a URL that we can configure to access different services within the cluster based on the URL path, while implementing SSL security at the same time, like a load balancer that exists on a high level that can be configured from inside the cluster.

Note that we still need to expose (once) the ingress-service, either by NodePort or with a cloud native Load Balancer.

In ingress we need to deploy a reverse proxy solution such as Nginx, Haproxy, Traefik and more.. and then take care of the ingress configuration (using definition files).

The solution we deploy is the "Ingress Controller", and the set of rules we configure is called the "Ingress Resources".

Important to mention is that a cluster does not comes pre-built with ingress controller, which means that deploying the Ingress-Controller resource is a prerequisite to setting the ingress configuration.

Now let's use Nginx for this example as the Ingress-Controller, which is deployed in our cluster as any other deloyment object (the deployment nginx-ingress-controller definition file contains 1 replica of the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.21.0 and the name of the container is nginx-ingress-controller).

For more information on how to deploy the nginx-ingress-controller Click Here.

Next, we configure a NodePort with the selector name: nginx-ingress to expose the Ingress Controller.

The ingress controllers are quite sophisticated and can monitor the cluster for ingress resources and configure the nginx server when a change occurs, for that it requires a service account (with correct roles and role bindings) with relevant permissions.

Now let's handle the Ingress Resource, that will eventually allow the users to route to the specific application using a URL path or while navigating the app.

Ingress configuration file:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ingress-wear
spec:
  backend:
  # (traffic is routed to the applications services and not directly to the pods)
    service:
      name: ...
      port:
        number: ...

kubectl create -f ingress-definition.yaml

kubectl get ingress

We can set rules in the ingress definition file. For example, what happens when the user reaches different URL addresses.

For example:

...
kind: Ingress
...
spec:
  rules:
  -  http:
     paths:
     -  path: /path1
        backend:
          service:
            name: ...
            port:
              number: ...
     -   path: /path2
           backend:
              ...

And in the imperative way: kubectl create ingress [ingress-name] --rule="host/path=service:port"

Now we can view the rules using kubectl describe ingress [ingress].




Section 9: Designing a Cluster

Choosing Infrastructure

On a laptop or local machine running Linux, we can start by installing the binaries and setting up a local cluster (those operations can be automated using solutions that we will explore in this section).

On Windows however, we cannot use Kubernetes natively since there are no Kubernetes binaries available for this purpose. Hence, we must rely on a virtualization like HyperV, VMware Workstation or VirtualBox to create Linux virtual machines on which we can run the Kubernetes cluster.

Even the Docker containers for Windows are Linux based and behind the scene, those run on a small Linux operating system created by HyperV.

For easily deploying a single node cluster we can use Minikube.

For quickly deploying a single/multi node cluster we can use Kubeadm (requires manual provisioning and configuration of the host).

So the difference between the two is that Minikube deploys the virtual machine, and Kubeadm requires the virtual machine to be ready for use.

The "Manual" solutions are named "Turnkey solutions", on which we provision, configure the VMs manually, use scripts to deploy the cluster, and maintain the VM ourselves (for example Kubernetes on AWS using KOPS).

And the managed solutions provide us services like Kubernetes-As-A-Service, VM provisioning and maintenance, and more (for example GKE) without having to perform the configuration on ourselves.

The Turnkey solutions can be:

  • OpenShift
  • Cloud Foundary Container Runtime
  • VMWare Cloud PKS
  • Vagrant
  • And more..

The Hosted (Managed) solutions can be:

  • Google Container Engine (GKE)
  • OpenShift Online
  • Azure Kubernetes Service
  • Amazon Elastic Container Service for Kubernetes (EKS)

Configuring High Availability

What happens if the Master node fails in a high availability environment? The system will slowly crash.

So, in a high availability environment (where we have redundancy over each cluster component) we should use multiple master nodes.

The API Server on all of the master nodes must be alive & running at all times.

We should have a load balancer configured over the master nodes.

To avoid a situation where the Master nodes run in parallel so they don't run more pods than what is actually needed.

So they need to be ran in "Active-Standby" mode to decide which Master node is currently active and which is passive.

For example to achieve this when the Controller Manager starts, we can use: kube-controller-manager --leader-elect true [options].

In HA scenario, if the ETCD is stored inside each Master node, it is easier to setup and manage but prone to risk during failures.

So, we should separate the ETCD and configure it to run on it's own set of servers externally.

This approach is called External ETCD Topology and is less risky but harder to setup and requires more servers.

As a reminder, to view and define where the ETCD servers are the file is located at /etc/systemd/system/kube-apiserver.service.

ETCD In High Availability Setup

ETCD is a distributed key-value document based database, opposed to the traditional table formatted database.

Say we have 3 servers running ETCD in a HA setup, all running and maintaining an identical copy of the database for redundancy purposes.

How do we make sure that the data is consistent and make sure that every read/write gets updated on all the copies of the database?

One instance is responsible for processing the write requests.

The nodes decide a leader among them, and the other nodes become the followers, so that the leader processes the request and sends a copy of the modified data.

If the data request is received in a follower node, it sends the data to the leader, the leader processes the request and sends the update to the followers.

How the leader is elected? How do they make sure the writes are distributed across all the instances?

The leader election protocol is named RAFT.

How the RAFT protocol is implemented in a 3 Master nodes cluster:

As a starting point, we don't have a leader elected.

RAFT algorithm uses a timer on the three nodes to initiate a request.

The first node that finishes the timer sends a request to the other nodes to be elected as the leader.

The other master nodes vote for the node that requested the permission, and once the node obtained the leader role, it sends notification at specific intervals that the node wants to stay at the role of the leader.

If the other master nodes don't receive the notification in time, they assume the leader went down, or lost connectivity, the election process initiates again.

Now, a write is considered as completed if it was written to the majority of the nodes in the cluster.

The majority is named Quorom and is calculated using: $\lfloor \dfrac{N}{2} \rfloor+ 1$.

The Quorom is the minimum number of nodes that must be available in order for the cluster to write successfully and function properly.

Note that we have 2 nodes in the cluster, the majority is still 2, so the Quorom does not have any effect in this scenario. Hence, having 2 instances is like having 1 instance.

So the recommended minimum number of instances is 3 and above for the Quorom to be met, and ultimately an odd number.

As a rule of thumb the best numbers for fault tolerance are 3, 5 and 7 master nodes.



Created By Idan Refaeli ©

Top comments (3)

Collapse
 
superuser profile image
superuser

Amazing resource

Collapse
 
leehungitc profile image
leehungitc

This article is so helpful to understand K8s

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay