Understanding Docker Swarm: Architecture and Networking

#docker #microservices #networking #containers

Organizations increasingly lean towards containerization to deploy their applications in today’s digital landscape. Netflix revealed that in 2017, they launched more than one million containers per week to support their worldwide streaming services. Managing such a colossal amount of containers manually would be impossible. Thus, container orchestrators like Kubernetes and Docker Swarm have become essential tools. These platforms automate the management of large amounts of containers across numerous servers, enabling optimal resource utilization, ensuring high availability, and facilitating dynamic scalability. In this article, I will explore Docker Swarm’s architecture and networking features and provide practical code examples to demonstrate its use.

What is Docker Swarm?

Docker Swarm is a container orchestration tool developed and released by Docker, Inc. in 2015. It’s an integral part of the Docker ecosystem. Thanks to its simplicity and tight integration with Docker, it quickly garnered significant attention within the development community. Despite Docker’s acquisition by Mirantis and the rapid rise of Kubernetes coupled with the emergence of services such as Azure’s AKS built on it, Docker Swarm has maintained its relevance and appeal. As Mirantis’ Engineering Manager, Ryan Zhang, points out, “Both Docker Swarm and Kubernetes serve as container orchestrators, leading many to believe that Docker Swarm would become obsolete as Kubernetes gained traction. However, the reality has been quite different. We increasingly see our clients utilizing Docker Swarm and Kubernetes concurrently, leveraging the strengths of each through MKE. Consequently, we will continue to foster the development of Docker Swarm while also giving back to the open-source community.” This dual approach, leveraging the strengths of both tools, is being adopted by various companies, including MetLife, Royal Bank of Canada, and S&P Global, to maximize their value.

Docker Swarm Architecture

Docker Swarm is a Docker component based on something called SwarmKit. Imagine it as a group, or “swarm,” of units called nodes. Each node is an instance of the Docker engine that’s part of the swarm. They could be on a virtual machine (you can have more than one node on the same machine or spread them over different devices) or a physical machine. In the swarm, there are two roles these nodes can have: Managers and Workers.

Manager nodes: These act as the ‘brain’ of the cluster, responsible for orchestrating and managing cluster operations and maintaining the desired state of the swarm by using Docker CLI commands. They utilize the Raft Consensus Algorithm to decide task assignments and maintain the swarm’s state. While manager nodes also run services as worker nodes by default, they can be configured to carry out manager tasks exclusively.
Worker nodes: These function as the ‘muscles’ of the cluster, receiving and executing tasks as assigned by the Manager nodes.

Applications are deployed by submitting a service definition to a manager node. These service definitions detail the ideal state of the service, including factors such as the number of replicas, network and storage resource allocation, external-facing ports, and container images.

Once the manager node receives this information, it distributes the running containers (tasks) to worker nodes. After that, Docker does all the work to maintain the desired state. For example, if a worker node becomes inaccessible, Docker will allocate that node’s tasks to other healthy nodes.

If there are multiple manager nodes, they need to achieve consensus on the service’s optimal state and task distribution before assigning it to the worker nodes. This process is essential for ensuring uniformity and avoiding conflicts in task distribution and service state management across the different manager nodes. It’s why having an odd number of manager nodes is recommended.

+---------------+-------------------+---------------------+
| # of managers | failure tolerance | required for quorum |
+---------------+-------------------+---------------------+
| 1             | 0                 | 1                   |
+---------------+-------------------+---------------------+
| 2             | 0                 | 2                   |
+---------------+-------------------+---------------------+
| 3             | 1                 | 2                   |
+---------------+-------------------+---------------------+
| 4             | 1                 | 3                   |
+---------------+-------------------+---------------------+
| 5             | 2                 | 3                   |
+---------------+-------------------+---------------------+
| 6             | 2                 | 4                   |
+---------------+-------------------+---------------------+

When a service must be accessible from requests outside the network, the manager node intervenes. It assigns a specific port, the publishedPort, within the range of 30000 to 32767. By doing so, If an external host makes a connection request to this port on any swarm node, the swarm’s routing mesh redirects it to the appropriate task. The manager node employs an ingress load balancer to distribute these incoming requests among the service’s replicas.

To ensure the connectivity between managers and working nodes, the following ports must be open:

TCP port 2377 for communication with and between manager nodes
TCP and UDP port 7946 for overlay network node discovery
UDP port 4789 for overlay network traffic

Also, to create an overlay network with encryption, you must open the IP protocol 50 (IPSec ESP).

Raft consensus

Before explaining how raft works, let’s take a moment to explain its purpose. The raft algorithm was designed to solve the problem in fault-tolerant distributed systems by making multiple servers agree on the same information. In a cluster using the raft algorithm, servers can be either:

**Leaders: **This is the only server that can interact with the client. It’s responsible for log replication to the followers. It regularly informs the followers of its existence by sending a heartbeat message.
Follower: These servers will sync up their copy of data with the leader.
Candidates: When a follower doesn’t receive the heartbeat from the leader in a specific time frame (election timeout), it changes its status to “candidate” and starts a leader election.

When a server becomes a candidate, a new election starts, and it sends a message to the other servers via a RequestVote RPC. Each server can vote only once, and if the candidate reaches a majority, the candidate becomes the leader, and it increases the Term (number used to represent time and used to stamp all outgoing messages).

Because of the lack of any election coordination, having the same election timeout might result in multiple elections, resulting in an infinite loop where all the servers vote for themselves and no majority is reached. To avoid this issue, raft uses randomized timeout. Also, if the previous leader re-joins the cluster, it will discover the change because its Term is lower than the one user in the messages; it will then change its state to “follower.”

Raft Algorithm and Docker

The Raft algorithm is exclusively used among manager nodes. The leader among the manager nodes and this leader are responsible for making decisions and handling cluster-wide operations, and coordinating the activities of other nodes. On the other hand, worker nodes do not participate in the raft algorithm and just follow the instructions provided by the leader. The screenshot below shows that once the node elected as a leader is not reachable, the algorithm automatically passes the lead to another manager node.

Docker Network

In Docker Swarm, the network is crucial in enabling communication and connectivity among the containers and nodes. By default, when you initialize or join a swarm, Docker creates two networks:

An overlay network (ingress): Docker uses this network to handle the routing of each package to and from the destination node by providing the routing mesh.
A virtual bridge (docker_gwbridge): Connect the overlay networks to an individual Docker daemon’s physical network, allows the containers to have external connectivity outside their cluster, and is created on each worker node.

$ sudo docker network ls
NETWORK ID     NAME              DRIVER    SCOPE
0284b8e83ba5   bridge            bridge    local
bf3d13651304   docker_gwbridge   bridge    local
6616f39918cf   host              host      local
j1m15skfclkc   ingress           overlay   swarm
67dd551b8c76   none              null      local

By default, Docker will also start a hidden container (not visible via docker container ls -a command) and associate it to both docker_gwbridge, and ingress network. This container’s role involves load balancing, reverse proxy, or other network-related tasks.

If we inspect the docker_gwbridge network, we will see that the ingress-sbox container’s network interface is enslaved to the docker_gwbridge, which allows traffic to flow between them.

By default, when the manager node creates a new service, depending on the requirement to publish ports, its container is associated with different networks. If no ports are published, it’s associated with the default bridge (bridge) network.

Suppose their containers need to expose some ports to the outside world. In that case, the container is associated with both the overlay network (it can be user-defined or the default ingress overlay network) and the docker_gwbridge network.

Earlier, I mentioned the possibility of associating the container in the service with a user-defined overlay network. To do so, we won’t create the overlay network in the worker node, but we will give this task to the master node, which will automatically sync with all nodes participating in the cluster.

Lastly, the screenshot below demonstrates that the containers on the worker node, vm-swarm-uks-03, can successfully communicate with both the manager node, vm-swarm-uks-01, and an instance of the containers from the my-web service.

Security

The swarm service’s management traffic is inherently secured with the AES encryption algorithm in GCM mode. Every 12 hours, the swarm’s manager nodes cycle through the key used for gossip data encryption.

To also possible to secure the applications’ data, by including the argument —opt encrypted while establishing the overlay network. This action triggers IPSEC encryption at the vxlan level. However, it’s important to note that this encryption may significantly affect performance.