DEV Community

Arseny Zinchenko
Arseny Zinchenko

Posted on • Originally published at rtfm.co.ua on

Kubernetes: Service, load balancing, kube-proxy, and iptables

One day I wondered — how is load balancing between pods working in Kubernetes?

I.e. — we have an external Load Balancer. Then a Service. And behind it — Pods.

What happens when we are receiving a network packet from the world, and we have a few pods — how the traffic will be distributed between them?

kube-proxy

So. the rouning rules between pods between a Service and its Pods are controlled by the kube-proxy service that can be working in one of the three following modes - user space proxy mode , iptables proxy mode , and IPVS proxy mode.

User space proxy mode

Links:

A deprecated mode, previously was the default.

When using this mode, kube-proxy watch for changes in a cluster and for each new Service will open a TCP port on a WorkerNode.

Then, iptables on this WorkerNode begin routing traffic from this port to the kube-proxy service which is acting as a proxy service using the round-robin approach, i.e. by sending traffic to a next pod in its backend's list. During this, kube-proxy can try to send a package to another pod if the first one didn't respond.

iptables proxy mode

Links:

Our case, which we will investigate in this post. Currently, is the default one.

When using this mode, kube-proxy watch for changes in a cluster and for each new Service will open a TCP port on a WorkerNode.

Then, the iptables on this WorkerNode sends traffic from this port to a Kubernetes Service which is actually a chain in the iptables rules, and via this chain, traffic goes directly to pods which are a backend for this Service. During this, a targeting pod is selected randomly.

This mode is less expensive for system resources as all necessary operations are performed in the kernel by the netfilter module. Also, this mode works faster and is more reliable because there is no a "middle-ware" - the kube-proxy itself.

But if the first pod where a packet was sent did not respond — then a connection fails, while in the user space proxy mode kube-proxy will try to send it to another pod.

This is why is so important to properly configure Readiness Probes, so Kubernetes will not send traffic to pods that are not ready to accept it.

Furthermore, this mode is more complicated for debugging, because in the user space proxy mode the kube-proxy will write its logs to the /var/log/kube-proxy, while with the netfilter you have to go to debug into the kernel itself.

IPVS proxy mode

Links:

And the most recent mode. It uses the netlink kernel module and creates new IPVS rules for new Kubernetes Services.

The main is the diversity of the load-balaning modes:

  • rr: round-robin
  • lc: least connection (smallest number of open connections)
  • dh: destination hashing
  • sh: source hashing
  • sed: shortest expected delay
  • nq: never queue

kube-proxy config

Let’s go to check which mode is used in our case, in the AWS Lastic Kubernetes Service cluster.

Find kube-proxy pods:

$ kubectl -n kube-system get pod -l k8s-app=kube-proxy -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-proxy-4prtt 1/1 Running 1 158d 10.3.42.245 ip-10–3–42–245.us-east-2.compute.internal <none> <none>
kube-proxy-5b7pd 1/1 Running 0 60d 10.3.58.133 ip-10–3–58–133.us-east-2.compute.internal <none> <none>
…
kube-proxy-tgl52 1/1 Running 0 59d 10.3.59.159 ip-10–3–59–159.us-east-2.compute.internal <none> <none>
Enter fullscreen mode Exit fullscreen mode

On every WorkerNode of the cluster, we have a dedicated kube-proxy instance with the kube-proxy-config ConfigMap attached:

$ kubectl -n kube-system get pod kube-proxy-4prtt -o yaml
apiVersion: v1
kind: Pod
…
spec:
…
containers:
…
volumeMounts:
…
- mountPath: /var/lib/kube-proxy-config/
name: config
…
volumes:
…
- configMap:
defaultMode: 420
name: kube-proxy-config
name: config
Enter fullscreen mode Exit fullscreen mode

Check this ConfigMap content:

$ kubectl -n kube-system get cm kube-proxy-config -o yaml
apiVersion: v1
data:
config: |-
…
mode: “iptables”
…
Enter fullscreen mode Exit fullscreen mode

Now, when we are more familiar with the kube-proxy modes - let's go deeper to see how it works and what iptables is doing here.

Kubernetes Pod load-balancing

For our journey, let’s take a real application with an Ingress (AWS Application Load Balancer, ALB) which sends traffic to a Kubernetes Service:

$ kubectl -n eks-dev-1-appname-ns get ingress appname-backend-ingress -o yaml
…
- backend:
serviceName: appname-backend-svc
servicePort: 80
…
Enter fullscreen mode Exit fullscreen mode

Check the Service itself:

$ kubectl -n eks-dev-1-appname-ns get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
appname-backend-svc NodePort 172.20.249.22 <none> 80:31103/TCP 63d
Enter fullscreen mode Exit fullscreen mode

Here we have the NodePorttype - it's listening to a TCP port on a WorkerNode.

The ALB e172ad3e-eksdev1appname-abac ALB routes the traffic from clients to the e172ad3e-4caa286edf23ff7e06d AWS TargetGroup:

ЕС2 in this TargetGroup are listening to the 31103 port which we saw in the Service details above:

AWS LoadBalancer traffic modes

Documentation>>>.

A side note: AWS ALB supports two modes for traffic — IP, and Instance Mode.

  • Instance mode : the default mode, requires a Kubernetes Service to have the NodePort type, and routes traffic to a TCP port of a WorkerNode
  • IP mode : with this mode targets for an ALB are Kubernetes Pods directly instead of the Kubernetes Worker Nodes.

Now, we need to have an access to one of these nodes — connect to a Bastion host, and then to one of the WorkerNode:

ubuntu@ip-10–3–29–14:~$ ssh ec2-user@10.3.49.200 -i .ssh/bttrm-eks-nodegroup-us-east-2.pem
Last login: Thu May 28 06:25:27 2020 from ip-10–3–29–32.us-east-2.compute.internal
__|__ |_ )
_| ( / Amazon Linux 2 AMI
___|\___ |___|
39 package(s) needed for security, out of 64 available
Run “sudo yum update” to apply all updates.
[ec2-user@ip-10–3–49–200 ~]$ sudo -s
[root@ip-10–3–49–200 ec2-user]#
Enter fullscreen mode Exit fullscreen mode

kube-proxy and iptables

So, a packet from a client came to the WorkerNode.

On this node, the kube-proxy service is binding on the port allocated so no one another service will use it, and also it creates a set of iptables rules:

[root@ip-10–3–49–200 ec2-user]# netstat -anp | grep 31103
tcp6 0 0 :::31103 :::* LISTEN 4287/kube-proxy
Enter fullscreen mode Exit fullscreen mode

The packet comes to the 31107 port, where it’s started following by the iptables filters.

iptables rules

Links:

The Linux kernel accepts this packet and sends it to the PREROUTING chain of the nat table:

See the describing kube-proxy iptables rules.

Check rules in the nat table and its PREROUTING chain:

[root@ip-10–3–49–200 ec2-user]# iptables -t nat -L PREROUTING | column -t
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all — anywhere anywhere /* kubernetes service portals */
Enter fullscreen mode Exit fullscreen mode

Here we have a target to the following chain - the KUBE-SERVICES, which have the next chain - KUBE-NODEPORTS as the last rule, which captures packets for a Service with the NodePort type:

[root@ip-10–3–49–200 ec2-user]# iptables -t nat -L KUBE-SERVICES -n | column -t
…
KUBE-NODEPORTS all — 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in th
is chain */ ADDRTYPE match dst-type LOCAL
…
Enter fullscreen mode Exit fullscreen mode

Check rules in this chain:

[root@ip-10–3–49–200 ec2-user]# iptables -t nat -L KUBE-NODEPORTS -n | column -t | grep 31103
KUBE-MARK-MASQ tcp — 0.0.0.0/0 0.0.0.0/0 /* eks-dev-1-appname-ns/appnamed-backend-svc: */ tcp dpt:31103
KUBE-SVC-TII5GQPKXWC65SRC tcp — 0.0.0.0/0 0.0.0.0/0 /* eks-dev-1-appname-ns/appname-backend-svc: */ tcp dpt:31103
Enter fullscreen mode Exit fullscreen mode

And here it is intercepting packets for the dpt:31103 (destination port 31103) and they are sent to the next chain - KUBE-SVC-TII5GQPKXWC65SRC, check it now:

[root@ip-10–3–49–200 ec2-user]# iptables -t nat -L KUBE-SVC-TII5GQPKXWC65SRC | column -t
Chain KUBE-SVC-TII5GQPKXWC65SRC (2 references)
target prot opt source destination
KUBE-SEP-N36I6W2ULZA2XU52 all — anywhere anywhere statistic mode random probability 0.50000000000
KUBE-SEP-4NFMB5GS6KDP7RHJ all — anywhere anywhere
Enter fullscreen mode Exit fullscreen mode

Here we can see the next two chains where is the “routing magic” happens — the packet randomly will be sent to one of these chains, each has 0.5 from 1.0 “weight”  —  statistic mode random probability 0.5, as per the official Kubernetes documentation:

By default, kube-proxy in iptables mode chooses a backend at random.

See also Turning IPTables into a TCP load balancer for fun and profit.

Check those chains.

The first one:

[root@ip-10–3–49–200 ec2-user]# iptables -t nat -L KUBE-SEP-N36I6W2ULZA2XU52 -n | column -t
Chain KUBE-SEP-N36I6W2ULZA2XU52 (1 references)
target prot opt source destination
KUBE-MARK-MASQ all — 10.3.34.219 0.0.0.0/0
DNAT tcp — 0.0.0.0/0 0.0.0.0/0 tcp to:10.3.34.219:3001
Enter fullscreen mode Exit fullscreen mode

And the second one:

[root@ip-10–3–49–200 ec2-user]# iptables -t nat -L KUBE-SEP-4NFMB5GS6KDP7RHJ -n | column -t
Chain KUBE-SEP-4NFMB5GS6KDP7RHJ (1 references)
target prot opt source destination
KUBE-MARK-MASQ all — 10.3.57.124 0.0.0.0/0
DNAT tcp — 0.0.0.0/0 0.0.0.0/0 tcp to:10.3.57.124:3001
Enter fullscreen mode Exit fullscreen mode

And here we can see that the DNAT (Destination NAT) chain is sending the packet to an IP and 3001 port, which is actually our ContainerPort - check the Deployment:

$ kubectl -n eks-dev-1-appname-ns get deploy appname-backend -o json | jq ‘.spec.template.spec.containers[].ports[].containerPort’
3001
Enter fullscreen mode Exit fullscreen mode

And go to see our pods IPs.

Find the pods:

$ kubectl -n eks-dev-1-appname-ns get pod
NAME READY STATUS RESTARTS AGE
appname-backend-768ddf9f54–2nrp5 1/1 Running 0 3d
appname-backend-768ddf9f54-pm9bh 1/1 Running 0 3d
Enter fullscreen mode Exit fullscreen mode

And IP of the first pod:

$ kubectl -n eks-dev-1-appname-ns get pod appname-backend-768ddf9f54–2nrp5 — template={{.status.podIP}}
10.3.34.219
Enter fullscreen mode Exit fullscreen mode

And the second one:

$ kubectl -n eks-dev-1-appname-ns get pod appname-backend-768ddf9f54-pm9bh — template={{.status.podIP}}
10.3.57.124
Enter fullscreen mode Exit fullscreen mode

Isn’t it great? :-) So simple — and so great.

Now, let’s try to scale our Deployment to see how the KUBE-SVC-TII5GQPKXWC65SRC will be changed to reflect the scaling.

Find the Deployment:

$ kubectl -n eks-dev-1-appname-ns get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
appname-backend 2/2 2 2 64d
Enter fullscreen mode Exit fullscreen mode

Scale it from two to three pods:

$ kubectl -n eks-dev-1-appname-ns scale deploy appname-backend — replicas=3
deployment.extensions/appname-backend scaled
Enter fullscreen mode Exit fullscreen mode

Check the iptables rules:

[root@ip-10–3–49–200 ec2-user]# iptables -t nat -L KUBE-SVC-TII5GQPKXWC65SRC | column -t
Chain KUBE-SVC-TII5GQPKXWC65SRC (2 references)
target prot opt source destination
KUBE-SEP-N36I6W2ULZA2XU52 all — anywhere anywhere statistic mode random probability 0.33332999982
KUBE-SEP-HDIQCDRXRXBGRR55 all — anywhere anywhere statistic mode random probability 0.50000000000
KUBE-SEP-4NFMB5GS6KDP7RHJ all — anywhere anywhere
Enter fullscreen mode Exit fullscreen mode

Now we can see that our KUBE-SVC-TII5GQPKXWC65SRC got three rules: the first one with the 0.33332999982 random, as there are two more rules after, then the second one with the 0.5 weight, and the last one - without rules at all.

Check the iptables statistics module.

Actually, “That’s all folks!” ©

Useful links

Originally published at RTFM: Linux, DevOps and system administration.


Top comments (0)