Eventual Consistency in Kubernetes and the Real World™
It was once acceptable — and even expected — for web services to go down for maintenance or when under heavy load. You could run a simple LAMP stack on a single physical machine and call it a day. Today, however, services are measured in the number of nines of availability they provide. A single server no longer cuts it.
One way to achieve higher availability is by running countless copies, or replicas, of our services across geographies. Now, though, we've got ourselves a distributed system to wrangle! This is cool, it can add a shiny new buzzword to your resume, but it also brings along its own challenges.
You've now got to...
- Keep n servers up to date with the latest operating system patches
- Ensure that they are all running the correct version of your service
- Prevent configuration drift across them all
- Orchestrate rollouts of new code and configuration
- Handle network partitions between servers
- Etc. ...
The CAP Theorem posits that a stateful distributed system can guarantee at most two of the following properties:
- Consistency - every read request is guaranteed to either receive the most current data or fail
- Availability - ability to continue serving requests without the guarantee that the response contains the most current data
- Partition Tolerance - resiliency in the event of dropped or degraded network communication between nodes
There is no right answer when it comes to selecting which two CAP properties your distributed system will guarantee. It's all about tradeoffs. If you're designing a distributed database to store financial transactions for a bank, you'll want to favor Consistency over Availability in the event of a network partition. The cost of returning incorrect financial information is simply too high.
However, if you're designing a distributed system to manage a highly available web service deployment, you'll surely want to choose Availability over Consistency. If some portion of your services is running out of date code or configuration, so be it. At least they're still serving customer requests instead of errors!
Since Kubernetes is optimized for running resilient, highly available workloads, it chooses Availability. State updates are propagated over time in what is known as eventual consistency. Eventual consistency means that if no new updates are made to a system, then eventually, the system will converge. This means that eventually, all nodes will reflect the most current update and be Consistent. There are no guarantees around how long this "eventually" will take. In the event of a crashed process, it could be less than a second. If a network cable is unwittingly unplugged, it could take days. All that is "guaranteed" is that an eventually consistent system will try to make the desired state of the system into its actual state.
To make this all a bit more concrete, let's consider a real-world distributed system that we're all aware of: a household HVAC system.
When you set the temperature on the thermostat to 70 degrees, what you're really doing is requesting that the temperature be 70 degrees. Setting the temperature on the thermostat does nothing on its own. However, the thermostat will work together with the furnace and air conditioner to make this request a reality. In other words, the HVAC system will work to make this desired state (thermostat) into the actual state of your home.
If the actual temperature is 65 degrees, the thermostat will turn on the furnace and periodically check to see if the room has reached the desired temperature. Likewise, if the actual temperature is 75 degrees, it will turn on the air conditioner. The thermostat will do this monitoring in perpetuity in what is known as a control loop.
Now let's consider a "network partition" scenario. The temperature in the room is 75 degrees, and someone has unplugged the air conditioner to save energy. In this case, convergence to 70 degrees could potentially take a very long time. Eventually, though, someone will plug the air conditioner back in (or it will get cold enough outside), and the room will reach its desired temperature. That's eventual consistency for you!
So, to summarize:
- Desired State is the state that you want the system to be in
- Actual State is the state that the system is actually in
You may have seen references to something called "Perceived State." I like to think of this as merely being a qualifier on the concept of "Actual State." Philosophically it's referring to our inability to distinguish what we can perceive through our senses from "true" reality. Think Plato's Allegory of the Cave.
Practically, in distributed systems "Perceived State" refers to the imprecision of our measurements and lag time between when the system was measured and when we observe the results. In the HVAC example, you can think of the current temperature reported by the thermostat as being the Perceived State and know that the Actual State may have changed slightly since the temperature was last probed.
All Kubernetes objects have
status subobjects. You can think of
spec as being the specification of the object. This is the section where you declare the Desired State of the resource, and Kubernetes will try its hardest to make it so. The
status section, on the other hand, is not meant to be touched. This is where Kubernetes provides visibility into the Actual State of the resource. If you wish to learn more, I recommend reading the following architecture docs about Kubernetes API conventions.
Now let's dive a bit deeper with some concrete examples of eventual consistency in Kubernetes. I'll be using a small test cluster running on Digital Ocean that has two worker nodes. Each node is running other workloads (such as Contour for ingress), but we should have enough headroom to run three 256M
Pods on each. This is a managed cluster, so Digital Ocean is taking care of the master nodes for us.
Think back to our web service scenario where we mentioned that we wanted to be able to do the following:
Have multiple replicas of our application for high availability
Keep code and configuration in sync across these replicas
Be resilient in the event of application and infrastructure failures
Kubernetes makes this task simple with its
Deployments allow us to provide a template for the workloads we want to run as well as the number of replicas and the strategy for rolling out updates.
# example-deployment.yaml --- apiVersion: apps/v1 kind: Deployment metadata: name: example-deployment spec: selector: matchLabels: app: httpbin replicas: 5 strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: labels: app: httpbin spec: containers: - name: httpbin image: kennethreitz/httpbin:latest ports: - containerPort: 80 env: - name: VERSION value: "1.0.0" resources: requests: memory: "256M"
Deployment above, we declare that we want to five two replicas of our web service (
httpbin) and that each needs 256M of memory to run. We also specify that we want it to perform rolling updates with at most one-quarter of the
Pods being updated at a given time. Let's apply it:
$ kubectl apply -f example-deployment.yaml
Wait a few seconds and then view the
pods that are running:
$ kubectl get deployments NAME READY UP-TO-DATE AVAILABLE AGE example-deployment 5/5 5 3 12s $ kubectl get pods NAME READY STATUS RESTARTS AGE example-deployment-58958c58d9-229qp 1/1 Running 0 25s example-deployment-58958c58d9-cwbzp 1/1 Running 0 25s example-deployment-58958c58d9-hgf52 1/1 Running 0 25s example-deployment-58958c58d9-mr5kp 1/1 Running 0 25s example-deployment-58958c58d9-nkwbn 1/1 Running 0 25s
Our cluster is now running five replicas of our web service across its two worker nodes.
So what just happened here? Think back to our thermostat example. The thermostat took the occupant's desired temperature as input and then acted on the actual temperature by either turning up the furnace or air conditioner. Kubernetes does something similar by continuously watching the state of API resources using components known as controllers.
On Kubernetes we created a
Deployment that defined how we want our applications to run on the cluster. The built-in
Deployment controller observed this and then created a
ReplicaSet to represent our group of
Pod replicas. The
ReplicaSet controller saw this happened and spun up five
Pods based on the template we specified in our
Deployment. If all of that sounded like a bunch of Kubernonsense, perhaps the diagram below will help you visualize it.
Now let's take a closer look at our
$ kubectl get deployment example-deployment -o yaml
The first thing you'll notice is that there's a whole lot more YAML than what we initially declared. Not only has the
spec section been filled in with some additional defaults, but we now have a
status section revealing the Actual State of our
status: availableReplicas: 5 conditions: - lastTransitionTime: "2020-02-02T19:20:47Z" lastUpdateTime: "2020-02-02T19:20:47Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2020-02-02T19:20:23Z" lastUpdateTime: "2020-02-02T19:20:48Z" message: ReplicaSet "example-deployment-58958c58d9" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing observedGeneration: 1 readyReplicas: 5 replicas: 5 updatedReplicas: 5
Here we see that of the five replicas we requested, we have five that are ready (running).
Let's delete a pod to simulate one of our applications crashing:
$ kubectl delete pod example-deployment-58958c58d9-nkwbn --wait=false
Now quickly look at the
$ kubectl get deployment example-deployment -o yaml
status: availableReplicas: 4 conditions: - lastTransitionTime: "2020-02-02T19:20:47Z" lastUpdateTime: "2020-02-02T19:20:47Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2020-02-02T19:20:23Z" lastUpdateTime: "2020-02-02T19:20:48Z" message: ReplicaSet "example-deployment-58958c58d9" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing observedGeneration: 1 readyReplicas: 4 replicas: 5 unavailableReplicas: 1 updatedReplicas: 5
status has been updated to reflect that only four of the five pods are running. The
ReplicaSet controller will quickly remedy this by creating a new
Pod, so you have to act quickly to see the
Deployment in this state. Also, it's important to note that the
spec did not change at all since our Desired State remains the same.
Now let's consider the case of a network partition. Imagine that someone tripped over an ethernet cable and left an entire server rack unreachable. We can simulate this by deleting a node from our cluster.
Let's look at what happened to our deployment and pods:
$ kubectl get deployments NAME READY UP-TO-DATE AVAILABLE AGE example-deployment 3/5 5 3 3h27m $ kubectl get pods NAME READY STATUS RESTARTS AGE example-deployment-58958c58d9-4f2nq 0/1 Pending 0 9m25s example-deployment-58958c58d9-cwbzp 1/1 Running 0 3h27m example-deployment-58958c58d9-hgf52 1/1 Running 0 3h27m example-deployment-58958c58d9-mr5kp 1/1 Running 0 3h27m example-deployment-58958c58d9-xm7dr 0/1 Pending 0 9m25s
Since we lost a node, we've lost some available capacity. Even though we desired five pods, we can only have enough headroom to run three of them. The Actual State of our
Deployment is now:
status: availableReplicas: 3 conditions: - lastTransitionTime: "2020-02-02T19:20:23Z" lastUpdateTime: "2020-02-02T19:20:48Z" message: ReplicaSet "example-deployment-58958c58d9" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2020-02-02T22:38:47Z" lastUpdateTime: "2020-02-02T22:38:47Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available observedGeneration: 1 readyReplicas: 3 replicas: 5 unavailableReplicas: 2 updatedReplicas: 5
Going back to our thermostat example, you can think of this situation as analogous to the air conditioner being unplugged and unresponsive. However, if we add the node back to the worker pool, the control loops will eventually reconcile this.
$ kubectl get pods NAME READY STATUS RESTARTS AGE example-deployment-58958c58d9-4f2nq 0/1 ContainerCreating 0 30m example-deployment-58958c58d9-cwbzp 1/1 Running 0 3h48m example-deployment-58958c58d9-hgf52 1/1 Running 0 3h48m example-deployment-58958c58d9-mr5kp 1/1 Running 0 3h48m example-deployment-58958c58d9-xm7dr 0/1 ContainerCreating 0 30m
That's eventual consistency in action!
So far, we've looked at a couple failure scenarios that caused the actual number of running pods to be lower than what we desired. Let's now consider a rollout of updated configuration to our
Let's add an additional environment variable creatively named
$ kubectl edit deployment example-deployment
apiVersion: apps/v1 kind: Deployment metadata: ... name: example-deployment ... spec: ... replicas: 5 ... template: metadata: ... spec: containers: - env: - name: VERSION value: 1.0.0 - name: NEW_ENVIRONMENT_VARIABLE value: hello there! image: kennethreitz/httpbin:latest ... status: ...
This will cause the
Deployment to gradually rollout a new set of
Pods that meet this new Desired State. It does this by scaling down the existing
ReplicaSet and creating a new one that matches the updated
You can view this by observing the
Pods that are running on the cluster:
$ kubectl get replicasets NAME DESIRED CURRENT READY AGE example-deployment-58958c58d9 0 0 0 3h52m example-deployment-9787854b4 5 5 5 39s $ kubectl get pods NAME READY STATUS RESTARTS AGE example-deployment-58958c58d9-4f2nq 1/1 Running 0 33m example-deployment-58958c58d9-cwbzp 1/1 Running 0 3h51m example-deployment-58958c58d9-hgf52 1/1 Running 0 3h51m example-deployment-58958c58d9-mr5kp 1/1 Running 0 3h51m example-deployment-58958c58d9-xm7dr 0/1 Terminating 0 33m example-deployment-9787854b4-5jm9c 0/1 ContainerCreating 0 3s example-deployment-9787854b4-wvtb4 0/1 Pending 0 3s example-deployment-9787854b4-zgvd5 0/1 ContainerCreating 0 3s
During this period, users of the application may temporarily be directed to older
Pods, but eventually, all of the original
Pods will be replaced. Unfortunately, there's no direct analog thermostat example here. It does, however, tie in nicely with our earlier discussion of the CAP theorem. This showcases how the platform chooses Availability over Consistency.
If the platform waited for the original
Pods to be entirely deleted before starting the new ones, clients would surely receive errors. Instead, the
Deployment optimizes for no downtime, but with the potential for inconsistent responses. Eventually, though, the platform converges, and we end up with only the five new
$ kubectl get pods NAME READY STATUS RESTARTS AGE example-deployment-9787854b4-5jm9c 1/1 Running 0 9m6s example-deployment-9787854b4-hkcr9 1/1 Running 0 9m example-deployment-9787854b4-wvtb4 1/1 Running 0 9m6s example-deployment-9787854b4-zcnsl 1/1 Running 0 9m1s example-deployment-9787854b4-zgvd5 1/1 Running 0 9m6s
We've now explored a few real-world examples of convergence and eventual consistency. Hopefully, they helped clarify what people mean when they refer to Desired and Actual state in Kubernetes. However, if you only take away one thing from this post, just remember the following:
Desired State is what you asked for, but may not reflect reality
Actual State is the state the system is currently in (to the best of our knowledge)
Desired State and Actual State may not always align, but Kubernetes will try its hardest!
Have a happy Palindrome Day (2020-02-02)!* 🤓
* Note: this post on desired vs actual state was originally posted on my blog on 2020-02-02. Unfortunately it must have been 2020-02-03 in UTC by the time I cross-posted to DEV. 😞