DEV Community: Tibo Beijen

Introducing the Zen of DevOps

Tibo Beijen — Sun, 01 Mar 2026 08:31:00 +0000

Introduction

Over the past ten years or so, my role has gradually shifted from software to platforms. More towards the 'ops' side of things, but coming from a background that values APIs, automation, artifacts and guardrails in the form of automated tests.

And I found out that a lot of best practices from software engineering can be adapted and applied to modern ops practices as well.

DevOps in a nutshell really: Bridging the gap between Dev and Ops.

One of the most impacting pieces of guidance I have encountered, is the Zen of Python. Which largely applies to modern DevOps as well.

So, I have created a variant: The Zen of DevOps.

The Zen of Python

It must have been around 2013 or so, when working at NU.nl, when we phased out PHP in favor of Python. And that was an interesting mental exercise!

Now I like my share of abstractions. When working on my graduation project, my favorite part was using OOP concepts in Macromedia Director, even though the demo app was just a small part of the project's scope. And working with PHP I went through my 'Gang of Four' phase and built a fair share of overengineered bloat. Zend Framework was my tool of choice, satisfying every design pattern crave I had.

Then came Python. And with that came Django, an opinionated framework. Really the opposite side of Zend Framework (which is just a grab-bag of tools with a consistent interface). And it was not just Django that has opinions, Python itself as well. A vision of the core values of the language: The Zen of Python.

After a transition period, shoving some PHP-isms into Python, I came to appreciate the nature of Python. At its core it's simple. But, when you need it, it gives you all the OOP you want, as well as powerful concepts such as decorators, context managers and exceptions that are cheap and very useful.

Simple when possible. Complex when needed.

Adapting to DevOps

The Zen of DevOps combines personal experience with conversations and observations of the past many years: Setups I have experienced to be easy to maintain. Setups that, despite all the modern tools, were complex and brittle. Countless conference talks I attended and articles I read. Many hallway tracks, discussing practices with peers.

The resulting guidelines:

Be able to break non-production systems
Be able to break non-production systems only
Design for more than one
Design for more than once
Favor changes that make you faster over those that slow you down
Beautiful is better than ugly
Explicit is better than implicit
Simple is better than complex
Complex is better than complicated
Errors should never pass silently
Unless explicitly silenced
In the face of ambiguity, refuse the temptation to guess
There should be one - and preferably only one - obvious way to do it
If the implementation is hard to explain, it's a bad idea
If the implementation is easy to explain, it may be a good idea

Removals

Some elements of the Zen of Python, I have left out of the Zen of DevOps:

Flat is better than nested / Sparse is better than dense / Readability counts

In DevOps we have to do with the languages and formats that are common: Mostly Go, which has its own opinionated fmt. Furthermore, schema design of YAML and JSON should be guided more by API design guidelines, than readability. Although readability of course is a good thing.

Special cases aren't special enough to break the rules / Although practicality beats purity

Although valid points in their own right, I felt that, because the scope of DevOps is so much wider than a single programming language, that this guideline is a bit too restrictive. The reality of DevOps in large organizations is often a messy variety of practices having different levels of maturity. This guideline just gets in the way.

Now is better than never / Although never is often better than right now

I replaced that with "Favor changes that make you faster over those that slow you down". Putting a bit more emphasis on modern Agile and scrum practices, that sometimes favor external stakeholder requests over internal team effectiveness¹.

Although that way may not be obvious at first unless you're Dutch.

It's a joke. I like jokes, and even though I'm Dutch as well: This one makes no sense and rather distracts than adds anything.

Namespaces are one honking great idea -- let's do more of those!

The ubiquitous 'naming things'. A bit out of tone. DevOps is a lot about 'moving parts' and 'orchestration'. Not just software, where namespaces indeed are useful.

Additions

Some new guidelines have been added (See the Zen of DevOps for more elaborate explanation of the added guidelines):

Be able to break non-production systems / Be able to break non-production systems only

These two guidelines emphasize on differentiating non-production and production. Which is more an 'ops' thing than a 'dev' thing and was not really conveyed in the Zen of Python.

Design for more than one / Design for more than once

These guidelines focus on the automation and codifying practices of modern infra. And really, it's not that new: Using sysprep in the Windows XP era, to stamp out many desktops, is not entirely different from preparing USB sticks for air-gapped Kubernetes edge deployments. And that is not unlike immutable infrastructure, never modifying a server in-place, just stamping out new ones.

Favor changes that make you faster over those that slow you down

As stated above, this guideline emphasizes the need to stay ahead of the maintenance curve. The scope and complexity of what teams can, and need to, manage is evergrowing. But that means changes that simplify, reduce friction and improve efficiency are more important than ever.

Universal and timeless

Time will tell if the Zen of DevOps will be as timeless as the Zen of Python. I hope so!

The range of practices that can be observed in the field of devops is increasingly wide: Front runners have already adopted agentic workflows. At the same time there are organizations where requesting a server, a cluster, a DNS change, or firewall change, can take many days².

AI is changing many fields of works in impactful ways³. At the same time, engineering principles are quite foundational. If you design a plane, you build it to last, you design for maintenance and upgrades, add redundancy⁴, add safety margins. Whether the design is created on paper using rulers, using a computer, or mostly by AI: Those principles still exist, and should be supervised.

Software is no different: Security, observability, maintainability, auditability, computational efficiency are all foundational engineering practices, also known as 'non functional requirements'.

We will see if the Zen of DevOps will hold strong in these times of AI. If it doesn't, we have probably ended up with a lot of incomprehensible junk. But I have good hopes.

Take for example:

There should be one - and preferably only one - obvious way to do it

This will translate directly into better agent performance. Likewise, when experimenting with agent integrations, it's really important you can do that on non-prod. And if first experiments mess things up, it's really helpful you can rebuild your setup.

Unlike the Zen of Python, which focused on a single language, the Zen of DevOps aims to be more universal.

Our industry is full of 'strong preferences' or previous choices we have become invested in beyond the point of no return. The Zen of DevOps aims to guide at a higher level, so is not about:

Serverless vs. Kubernetes
Public cloud vs. on-premise
AWS vs. Azure vs. GCP
Terraform vs. CDK vs. Pulumi vs. Crossplane
GitOps vs. Pipelines
Agile vs. Kanban vs. Waterfall
Strong vs. Weak typing
Imperative vs. Declarative
Windows vs. Linux
Pets vs. Cattle
DevOps vs. SRE vs. Platform Engineering
Rust vs. ... <every other language>
YAML vs. JSON vs. TOML vs. KYAML vs JSON5

Means, not goals

Don’t let guidelines distract you from your goals!

Some of the guidelines can be interpreted in several ways. And not every guideline might be feasible or applicable in every environment. And that's ok!

Consider 'explicit'. To some it might mean: Make everything very visible. No abstractions. Everything is 'out there'. To others, including me, it means: Make conscious choices in what to expose, and what to hide, making the parts that can be considered 'the interface', explicit.

The main take away is: Be deliberate about such practices, and keep evaluating how they affect a project and collaboration within and between teams.

One does not complete or fail the Zen of DevOps.

What's next

It has been interesting to try to collect years of experience and observations into a small set of principles. I hope it gives teams and individuals some new perspectives. Even if not agreeing, unpacking why that might be can yield insights, and has value.

In the coming time I might dive deeper into certain topics. If so, they will be tagged zenofdevops.

Got any thoughts or feedback? By all means, reach out on LinkedIn.

Zen to all...

I have yet to see an OKR stating 'good team mental health' as a key result. ↩
I am aware there are people, upon reading, wishing it were mere days. ↩
Curious what happens when we find out that we have replaced all simple deterministic processes by awesome 'magic', and now are strategically dependent on the new oil: Datacenter capacity, energy, GPUs, memory. Sold by just a few big companies. ↩
The price of not doing so can be unacceptably high. ↩

East, west, north, south: How to fix your local cluster routes

Tibo Beijen — Fri, 04 Apr 2025 13:52:16 +0000

Introduction

Recently I needed to test a Keycloak upgrade. This required me to deploy both the new Keycloak version and a sample OIDC application on my local Kubernetes setup. And it pointed me to a thing I kept postponing:

Improve my local development DNS, routing and TLS setup

The challenge

Until recently, I used urls like keycloak.127.0.0.1.nip.io:8443. This points to 127.0.0.1 port 8443 which forwards to a local k3d cluster. At the same time it provides a unique hostname that can be used for configuring ingress.

Nice. But not without flaws.

For starters, this works for routing traffic to the K3D cluster, we could call this 'north-south'. But not for routing traffic within the cluster (east-west). This becomes apparent when trying to setup an OIDC sample application¹, such as the one shipped with DEX. The domain pointing to Keycloak is used in two places: By the browser of the user logging in, so in this case from the host OS, and directly from the backend, so within the cluster.

This puts us in a catch-22: nip.io, or an entry in /etc/hosts only works for north-south. svc.cluster.local only works for east-west.

Another problem is that the default certificates issued by Traefik, are not trusted by other systems or browsers. So we frequently need to bypass security warnings, which by itself is indicative of a problem and encourages bad habits. Furthermore, even if we manage to configure our setup to use the ingress service from within the cluster, it depends on the backend application if it allows bypassing TLS host checking.

To improve this, we need to address some things. In this article:

Introduction
The challenge
The plan
Improvement 1: TLS certificates and trust
Improvement 2: Fixing north-south routing
Improvement 3: Fixing east-west routing
Combining and automating
Next steps and wrapping it up

☞ Don't fancy reading? Head straight to the github repo containing taskfile automation

The plan

So, let's identify and configure the components needed to create a smooth local Kubernetes setup, providing trusted TLS and predictable endpoints.

This will result in applications being accessible via the following pattern:

k3d cluster	Hostnames	HTTP port	HTTPS port
cl0	*.cl0.k3d.local	10080	10443
cl1	*.cl1.k3d.local	11080	11443
cl2	*.cl2.k3d.local	12080	12443
etc, etc...

{{< figure src="/img/dev_routes.jpg" title="East, west, north, south. The components used to fix the routes. Source: Wikimedia & Open Source projects" >}}

Improvement 1: TLS certificates and trust

Create CA certificate and key

The ingress configurations in the cluster need to serve a certificate that is trusted by browsers and systems. One way could be registering a public (sub)domain for internal use, and use Let's Encrypt certificates, using DNS-01 challenge for verification.

Another way is to create a self-signed Certificate Authority (CA), use that to issue TLS certificates, and ensure the CA is trusted by the relevant systems. I chose this approach since it's mostly a setup-once affair, and doesn't require me to deal with API tokens of my DNS provider.

One important thing to be aware of, is that adding a CA to a trust bundle, has any certificate signed by it, to be trusted. So, if the CA signs a certificate for e.g. myaccount.google.com, your browser will trust it. This can be mitigated by adding NameConstraints. This reduces the risk of adding self-signed CAs to your trust bundles.

Create a ca.ini file:

[ req ]
default_bits       = 4096
distinguished_name = req_distinguished_name
req_extensions     = v3_req
prompt             = no

[ req_distinguished_name ]
CN = Development Setup .local CA
O = LocalDev
C = NL

[ v3_req ]
basicConstraints = critical, CA:TRUE
keyUsage = critical, keyCertSign, cRLSign
nameConstraints = critical, permitted;DNS:.local

Create key, certificate signing request (csr) and signed certificate:

openssl ecparam -name prime256v1 -genkey -noout -out ca.key
openssl req -new -key ca.key -out ca.csr -config ca.ini
openssl x509 -req -in ca.csr -signkey ca.key -out ca.crt -days 3650 -extfile ca.ini -extensions v3_req
# Show the certificate
openssl x509 -noout -text -in ca.crt

Note the name constraints section:

X509v3 extensions:
    X509v3 Name Constraints: critical
        Permitted:
          DNS:.local

Let's test if name constraints works by issuing a certificate that does not match the name constraint:

# Create a certificate for example.com
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 \
  -CA ca.crt -CAkey ca.key \
  -nodes -keyout example.com.key -out example.com.crt -subj "/CN=example.com" \
  -addext "subjectAltName=DNS:example.com,DNS:*.example.com,IP:10.0.0.1"
# Ok, so we can create a certificate outside the name constraints
# Let's verify it
openssl verify -verbose -CAfile ca.crt example.com.crt

This results in:

CN=example.com
error 47 at 0 depth lookup: permitted subtree violation
error example.com.crt: verification failed

Good! Now add the CA to Keychain:

sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain ca.crt

Setup Kubernetes cluster issuer and trust bundle

We can now install cert-manager to issue TLS certificates and trust-manager to distribute trust bundles:

helm repo add jetstack https://charts.jetstack.io --force-update

# cert-manager
# Note the NameConstraints feature gates!
helm upgrade --install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true \
  --set webhook.featureGates="NameConstraints=true" \
  --set featureGates="NameConstraints=true"

# trust-manager
helm upgrade --install \
  trust-manager jetstack/trust-manager \
  --namespace cert-manager

Add the CA certificate and create a ClusterIssuer:

kubectl -n cert-manager create secret tls root-ca --cert=ca.crt --key=ca.key

cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: root-ca
spec:
  ca:
    secretName: root-ca
EOF

Create a trust Bundle:

cat <<EOF | kubectl apply -f -
apiVersion: trust.cert-manager.io/v1alpha1
kind: Bundle
metadata:
  name: default-ca-bundle
spec:
  sources:
  - useDefaultCAs: true
  - secret:
      name: "root-ca"
      key: "tls.crt"
  target:
    configMap:
      key: "bundle.pem"
    namespaceSelector:
      matchLabels:
        trust: enabled

Now the bits of configuration we need to remember when setting up applications:

The cluster issuer name is root-ca, so Ingress objects need the annotation cert-manager.io/cluster-issuer: root-ca
The CA bundle, including our issuer CA, can be found in a ConfigMap default-ca-bundle under key pundle.pem, if the namespace is labeled trust: enabled.

Note: Since I consider dev clusters ephemeral and short-lived, topics like safely rotating issuer certificates don't need attention. When setting up trust manager in production environments, be sure to consider what namespace to install in and prepare for issuer certificate rotation.

Improvement 2: Fixing north-south routing

As mentioned in the introduction, DNS resolvers like nip.io are helpful for routing from host to development cluster, but will not work within the cluster: It will resolve to 127.0.0.1 and the target service won't be there.

One way to handle this is to install dnsmasq, and configure resolving on the host in such a way that .local will use 127.0.0.1 to resolve DNS. Which will then respond with 127.0.0.1. Using the port matching the K3D cluster will then ensure the correct cluster receives the traffic.

brew install dnsmasq
# Ensure service can bind to port 53 and starts at reboot
sudo brew services start dnsmasq

Configure dnsmasq, use brew --prefix to determine where the config is located. On a silicon Mac that will be /opt/homebrew.

Ensure the following line in /opt/homebrew/etc/dnsmasq.conf is uncommented:

conf-dir=/opt/homebrew/etc/dnsmasq.d/,*.conf

Then we can configure dnsmasq to resolve .local to 127.0.0.1:

echo "address=/local/127.0.0.1" > $(brew --prefix)/etc/dnsmasq.d/local.conf

Finally, we tell macOS to use dnsmasq at 127.0.0.1 to resolve DNS queries for .local:

sudo sh -c "echo 'nameserver 127.0.0.1' > /etc/resolver/local"

Be aware that tools like dig and nslookup behave a bit different from the usual program on macOS so they are not the best way to test². If we have set up a K3D cluster, mapping host ports to the http and https ports using -p, we could try to reach an application:

# Assuming we have set up keycloak and port 10443 is forwarded to k3d https port
# Using -k since curl does not use the system trust bundle, so is not aware of our CA
curl -k https://keycloak.cl0.k3d.local:10443/ -I
HTTP/2 302
location: https://keycloak.cl0.k3d.local:10443/admin/

Application is there. Good. Let's move on.

Improvement 3: Fixing east-west routing

Our dnsmasq setup works from host, via ingress to a service. But when needing to access another service within the cluster, using the same domain, it won't.

Of course, we can access services the usual way via service-name.namespace.svc.cluster.local. But this means within the cluster we need to use a different domain than from the outside. Confusing at best, and in some cases not possible. One example being Keycloak client applications, as outlined in the introduction, where only one configuration item for the Keycloak domain exists.

If, from within our cluster, we try to resolve keycloak.cl0.k3d.local from a pod, the following happens:

CoreDNS knows nothing about this, it's not a pod, it's not a service
CoreDNS forwards DNS resolving to host
The host (our Macbook) will recognize .local and tell DNS to query for the domain at 127.0.0.1:53
There is no DNS server running in the pod so DNS resolving will fail

To fix this, we make two adjustments.

First, we copy the existing traefik service to traefik-internal, changing the type from LoadBalancer into ClusterIP and adjusting the ports to align with the ports mapped to the host. The resulting service looks like this:

apiVersion: v1
kind: Service
metadata:
  name: traefik-internal
  namespace: kube-system
spec:
  ports:
  - name: webext
    port: 10080
    protocol: TCP
    targetPort: web
  - name: websecureext
    port: 10443
    protocol: TCP
    targetPort: websecure
  selector:
    app.kubernetes.io/instance: traefik-kube-system
    app.kubernetes.io/name: traefik
  type: ClusterIP

Next, we need to ensure that connecting to e.g. keycloak.cl0.k3d.local from within our cluster, will end up at the traefik-internal service.

For this we can add a custom dns entry to CoreDNS. We can do so by adding a configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  k3d.local.server: |
    k3d.local:53 {
        errors
        cache 30
        rewrite name regex (.*)\.cl0\.k3d\.local traefik-internal.kube-system.svc.cluster.local
        # We need to rewrite everything coming into this configuration block to avoid infinite loops
        rewrite name regex (.*)\.k3d\.local host.k3d.internal
        forward . 127.0.0.1:53
    }

Note: As the comment says, we need to ensure we rewrite everything, since we feed the rewritten domain back to CoreDNS. This is why using k3d.local as domain to put clusters under, works fine, whereas k3d.internal does not. In case of the latter, we would rewrite to a new FQDN that re-enters the custom config, resulting in an infinite loop and CoreDNS crash.

With the 3 improvements in place, we now have a setup that works:

Combining and automating

Although we now have a configuration that works, it is not particularly easy to set up. So, what do we do? We automate.

Taskfile is single-binary Make alternative that provides all the templating and configurability needed, to easily spin up K3D clusters configured as described in this article.

Check out the dev-cluster-config repository. Then:

# review .default.env
cat .default.env

# If needing to adjust config 
cp .default.env .env
vim .env

# Once: Setup CA certificate
task cert

# Once: Setup dnsmasq
task dnsmasq_brew

# Setup & configure clusters
task k3d-cluster-setup-0
task k3d-cluster-setup-1

# Optionally: Add example applications (nginx/curl)
task k3d-cluster-examples-0
task k3d-cluster-examples-1

# Use k3d to remove a cluster
k3d cluster delete cl0
k3d cluster delete cl1

Next steps and wrapping it up

Optionally, we could also set up a load balancer like haproxy on the macOS host that listens on the default http and https ports 80 and 443. It would serve the trusted certificate and, based on host, forward to the proper k3d cluster. This would remove the need to use custom ports.

If on the other hand, one does not want to set up dnsmasq, one could address clusters like myapp.cl1.k3d.127.0.0.1.nip.io, and update the CoreDNS configuration accordingly, to intercept DNS lookups to *.k3d.127.0.0.1.nip.io and return the host host.k3d.internal IP address.

The setup described in this article, consists of several discrete parts. It is not a one-stop integrated solution. However, as illustrated above, it can be easily extended and adjusted, so that can be considered an advantage. If wanting to run Kind, Minikube, Rancher Desktop or Colima, a similar approach will work.

Now, local development setups, like OS and editor choices, is typically something engineers are very opinionated about. And that's fine!!³ So, if you are wondering "why are you doing all this and not doing this other thing instead?". By all means, reach out on LinkedIn or BlueSky. I'm curious!

Regardless, I hope the above provides some guidance on getting the most out of your local development clusters.

Yes, we are mixing Keycloak and DEX. The beauty of standards such as OIDC. ↩
It's... complicated. This article about configuring DNS and VPN gives some insights. ↩
Although there is often a balance to strike between 'own improvements' and 'team standards'. ↩

12 Factor: 13 years later

Tibo Beijen — Sat, 27 Apr 2024 04:33:00 +0000

Introduction

In a presentation about CI/CD I gave recently, I briefly mentioned the 12 factor methodology. Somewhere along the lines of "You might find some good practices there", and summarizing it as:

artifact
configuration +
---------------
deployment

After the talk, a colleague of way back, came to me and said: "You were way too mild in suggesting it. It's mandatory, people should follow those practices."¹

And yes, he was right. There are a lot of good practices to get from the 12 factor methodology. But do all parts still hold up? Or might following it to the letter be actually counter-productive in some cases?

In the past, I have onboarded quite a number of applications into Kubernetes, that were already built with 12 factor in mind. That process usually was fairly smooth, so you start to take things for granted. Until you bump into applications that are tough to operate, that is.

Upon closer inspection, such applications are usually found to violate some of the 12 factor principles.

The 12 factor methodology has been initiated almost 13 years ago at Heroku, a company that was 'cloud native', focused on developer experience and ease of operation. So, it's no surprise it still is relevant.

So, let's glance over the 12 factors, and put them in the context of modern cloud native applications.

The 12 factors

1. Codebase

One codebase tracked in revision control, many deploys

Looking at the image, these days we would add artifact between codebase and deploys. Artifact being a container, or perhaps zip file (serverless).

code         -> artifact       -> deploy
- versioned     - container       - prod
                - zip             - staging
                                  - local

It's worth noting that for local development, depending on the setup, some form of live-reload usually comes in place of creating an actual artifact.

2. Dependencies

Explicitly declare and isolate dependencies

This is something that has become more natural in containerized applications.

One part of the description is a bit dated though: "Twelve-factor apps also do not rely on the implicit existence of any system tools. Examples include shelling out to ImageMagick or curl."

In containerized applications, the boundary is the container, and its contents are well-defined. So an application shelling out to curl is not a problem, since curl now comes with the artifact, instead of it being assumed to exist.

Similarly, in serverless setups like AWS Lambda, the execution environment is so well-defined that any dependency it provides, can be safely used.

3. Config

Store config in the environment

This point is perhaps overly specific on the exact solution. The main takeaways are:

Configuration not in application code
Artifact + configuration = deployment

Confusingly, and especially with the rise of GitOps, the configuration is in a codebase, but detached from the application code.

As long as the above concept is followed, using environment variables or config files, is mostly an implementation detail.

Using Kubernetes, depending on security requirements, there might be considerations to use files instead of environment variables, optionally combined with envelope encryption. On this topic, I can recommend:

KubeCon EU 2023: A Confidential Story of Well-Kept Secrets - Lukonde Mwila, AWS (video).

4. Backing services

Treat backing services as attached resources

This has become common practice. In Kubernetes, it's usually easy to configure either a local single-pod (non-prod) Redis or Postgres, or a remote cloud-managed variant like RDS or Elasticache.

There can be reasons to use local file system or memory, for example performance, or simplicity. This is fine, as long as the data is completely ephemeral, and the implementation doesn't negatively affect any of the other factors.

5. Build, release, run

Strictly separate build and run stages

From Kubernetes to AWS Lambda: It will be hard these days to violate this principle. Enhancing the aforementioned summary:

Build   -> artifact
Release -> configuration +
--------------------------
Run     -> deployment

6. Processes

Execute the app as one or more stateless processes

In the full text, there is a line that better summarizes the point:

Twelve-factor processes are stateless and share-nothing

Some takeaways:

One container, one process, one service.
No sticky-sessions. Store sessions externally, e.g. in Redis. See also factor 4.
Simplify the process by considering init containers or Helm chart hooks. See also factor 12.

Somewhat overlapping with factor 4, this factor implies using external services where possible. For example: Use external Redis instead of embedded Infinispan.

7. Port binding

Export services via port binding

This holds up for TCP-based applications. But it is no longer applicable for event-driven systems such as AWS Lambda or WASM on Kubernetes using SpinKube.

8. Concurrency

Scale out via the process model

Make your application horizontal scalable. This is somewhat related to factor 4, which result in share-nothing application processes.

Furthermore, the application should leave process management to the operating system or orchestrator.

9. Disposability

Maximize robustness with fast startup and graceful shutdown

In a way this can be seen as complementing the previous factor: Just as it should be easy to horizontally scale out, it should be easy to remove or replace processes.

Specific to Kubernetes, this boils down to:

Obey termination signals. The application should gracefully shut down. Either handle the SIGTERM signal in the application, or setup a PreStop hook (more info).
Setup probes. Probes should only return OK when the application is actually ready to receive traffic.
Setup maxSurge (rolling updates) and PodDisruptionBudget(scheduling).
Nodes are cattle, so it always should be possible to reschedule pods: The share-nothing concept.

10. Dev/prod parity

Keep development, staging, and production as similar as possible

This is a broad topic and as relevant as ever. At a high level it boils down to 'Shift left': Validate changes as reliably and quickly as possible.

Solutions are many, and could include Docker Compose, VS Code dev containers, Telepresence, Localstack or setting up temporary AWS accounts as a development environment for serverless applications.

11. Logs

Treat logs as event streams

Don't store logs in files. Don't 'ship' logs in the application.

The operating system or orchestrator should capture the output stream and route it to the logging storage of choice.

Where the 12 factor methodology shows its age a bit is that there is no mention of metrics and traces, together with logs, often referred to as "the three pillars of observability".

Extrapolating the approach to logging, consider systems that 'wrap' an application instead of requiring a detailed implementation. OpenTelemetry zero-code instrumentation could be a good starting point. APM agents of observability SaaS platforms such as New Relic or Datadog can be applied similarly.

12. Admin processes

Run admin/management tasks as one-off processes

This fragment in the full description might summarize it better: "Admin code should ship with the application code".

This is about tasks like changing database schema, or uploading asset bundles to a centralized storage location.

The goal is to rule out any synchronization issues. Keywords are:

Identical environment
Same codebase

Summarizing the 12 factors

As long as we try to grasp the idea behind the factors instead of following every detail, I would say most of the factors hold up quite well.

Some recommendations have become more or less common practice over the years. Some other recommendations have a bit of overlap. For example: Externalizing state (factor 4) makes concurrency (factor 8) and disposability (factor 9) easier to accomplish.

Factor 13: Forward and backward compatibility

There is a point not addressed in the 12 factor methodology that in my experience has always made an application easier to operate: Backward and forward compatibility.

These days we expect application deployments to be frequent and without any downtime. That implies either rolling updates or blue/green deployments. Even blue/green deployments, in large distributed platforms, are hardly ever truly atomic. And deployment patterns like canary deployments, imply being able to roll back.

So, getting this right opens up the path the frequent friction-less deploys.

This is about databases, cached data and API contracts. We need to consider:

How does our application handle data while version N and N+1 are running simultaneously?
What happens if we need to roll back from N+1 to N?

Some pointers:

When changing the database schema, first add columns. Only remove the columns in a subsequent release once the data has been migrated.
First add a field to an API or event schema, only then update consumers to actually expect the new field.
Consider compatibility of cached objects. Prefixing cache-keys with something unique to the application version can help here.

What will happen with data in the transition period? Store the data in old and new format? Do we need to store version information with the data and support multiple versions?

This can be complicated for applications provided for others to operate, unlike applications operated by the developing team itself, and released via CI/CD. External users often don't follow all minor releases, making it more likely to not have backward compatibility.

Conclusion

Some of the above recommendations might take additional effort. However, in my experience that is worth it and will be paid back (with interest) by ease of operations, piece of mind and a reduced need for coordination of releases.

XS4All had a great culture, showing its roots: Passionate, knowledgeable and vocal. ↩

EKS and the quest for IP addresses: Secondary CIDR ranges and private NAT gateways

Tibo Beijen — Thu, 10 Feb 2022 11:59:19 +0000

EKS and its hunger for IP addresses

Kubernetes allows running highly diverse workloads with similar effort. From a user perspective there's little difference between running 2 pods on a node, each consuming 2 vCPU, and running tens of pods each consuming 0.05 vCPU. Looking at the network however, there is a big difference: Each pod needs to have a unique IP address. In most Kubernetes implementations there is a CNI plugin that allocates each pod an IP address in an IP space that is internal to the cluster.

EKS, the managed Kubernetes offering by AWS, by default uses the Amazon VPC CNI plugin for Kubernetes. Different to most networking implementations, this assigns each pod a dedicated IP address in the VPC, the network the nodes reside in.

What the VPC CNI plugin does ¹ boils down to this:

It keeps a number of network interfaces (ENIs) and IP addresses 'warm' on each node, to be able to quickly assign IP addresses to new pods.
By default it keeps an entire spare ENI warm.
This means that any node effectively claims 2 ENIs * ips-per-ENI, since there will always be at least one daemonset claiming an IP address of the first ENI.

Now if we look at the list of available IP addresses per ENI and calculate an example:

EC2 type m5.xlarge, 15 IP addresses per ENI. 30 IP addresses at minimum per node.
Say, we have 50 nodes running. That's 1500 private addresses taken. (For perspective: That's ~$7000/month worth of on-demand EC2 compute).
Say, we have /21 VPC, providing 3 /23 private subnets. That's 3 x 512 = 1536 available IP addresses.
Managed services also need IP addresses...

We can see where this is going. So, creating /16 VPCs it is then? Probably not.

Multiple VPCs

In a lot of organizations there is not just one VPC. The networking landscape might be a combination of:

Multiple AWS accounts and VPCs in one or more regions
Data centers
Office networks
Peered services, like DBaaS from providers other than AWS

There are many ways to connect VPCs and other networks. The larger the CIDR range is that needs to be routable from outside the VPC, the more likely it becomes that there is overlap.

As a result, in larger organizations, individual AWS accounts are typically provided a VPC with a relatively small CIDR range, that fits in the larger networking plan. To still have 'lots of ips', AWS VPCs can be configured with with secondary CIDR ranges.

This solves the IP space problem, however does not by itself solve the routing problem. The secondary CIDR range would still need to be unique in the total networking landscape to be routable from outside the VPC. This could not be an actual problem if workloads in the secondary CIDR only need to connect to resources within the VPC but this very often is not the case.

Quite recently AWS introduced Private NAT gateways which, together with custom networking, are options to facilitate routable EKS pods in secondary CIDR ranges.

VPC setups

Let's go over some VPC setups to illustrate the problem and see how we can run EKS.

Basic

A basic VPC consists of a single CIDR range, some private and public subnets, a NAT gateway and an Internet Gateway. Depending on the primary CIDR range size this might be sufficient, but in the scope of larger organizations let's assume a relatively small CIDR range.

Pro: Simple
Con: Private IP exhaustion

Secondary CIDR range

Next step: Adding a secondary CIDR range, placing nodes and pods in the secondary subnets. This could work if workloads never need to connect to resources in private networks outside the VPC, which is unlikely. Theoretically pods would be able to send packets to other VPCs but there is no route back.

Pro: Simple
Con: No route between pods and private resources outside the VPC

Secondary CIDR range + custom networking

To remedy the routing problem, custom networking can be enabled in the VPC CNI plugin. This allows placing the nodes and pods in different subnets. Nodes go into the primary private subnets, pods go into the secondary private subnet. This solves the routing problem since by default, for traffic to external networks, the CNI plugin translates the pods IP address to the primary IP address of the node (SNAT). In this setup those nodes are in routable subnets.

Setting up secondary CIDR ranges and custom networking is described in the AWS knowledge center and also in the Amazon EKS Workshop.

Be aware that Source Network Address Translation is disabled when using security groups for pods²:

Source NAT is disabled for outbound traffic from pods with assigned security groups so that outbound security group rules are applied. To access the internet, pods with assigned security groups must be launched on nodes that are deployed in a private subnet configured with a NAT gateway or instance. Pods with assigned security groups deployed to public subnets are not able to access the internet.

Pro: No additional NAT gateway needed
Con: Complex VPC CNI network configuration
Con: Not compatible with security groups for pods

Secondary cidr + private NAT gateway

Instead of configuring custom networking, it is also possible to solve the routing problem by using a private NAT gateway. Unlike a public NAT gateway, it is placed in a private subnet and is not linked to an internet gateway.

This way nodes and pods can run in the secondary CIDR range, and the routing problem is solved outside of EKS.

Pro: Straightforward default VPC CNI network configuration
Pro: Can be used with security group for pods
Con: NAT gateway incurs cost

Routing and controlling cost

One NAT gateway is enough

Let's take a look at the most basic route table one can set up for the secondary private subnet:

10.150.40.0/21  local   
100.64.0.0/16   local   
0.0.0.0/0       nat-<private-id>

This puts any non-VPC traffic into the primary private subnet, and lets the route table that is configured there do the rest. Simple, but there is a catch³ which we can observe when testing internet connectivity from a node.

[ec2-user@ip-100-64-43-196 ~]$ ping www.google.com
PING www.google.com (74.125.193.147) 56(84) bytes of data.
64 bytes from ig-in-f147.1e100.net (74.125.193.147): icmp_seq=1 ttl=49 time=2.31 ms
^C

[ec2-user@ip-100-64-43-196 ~]$ tracepath -p 443 74.125.193.147
 1?: [LOCALHOST]                                         pmtu 9001
 1:  ip-10-150-42-36.eu-west-1.compute.internal            0.168ms
 1:  ip-10-150-42-36.eu-west-1.compute.internal            1.016ms
 2:  ip-10-150-40-116.eu-west-1.compute.internal           0.739ms
 3:  ip-10-150-40-1.eu-west-1.compute.internal             1.510ms pmtu 1500
 3:  no reply
^C

Looking at the trace, and at the NAT gateways that exist, we can see that traffic passes the private and the NAT gateway.

A careful observer might have noticed the green line in the traffic diagram bypassing the private NAT gateway. To accomplish this one needs to adjust the routing table by only directing private network traffic to the private NAT gateway:

10.150.40.0/21  local   
100.64.0.0/16   local   
10.0.0.0/8      nat-<private-id>
0.0.0.0/0       nat-<public-id>

Halving the amount of traffic passing through NAT gateways is halving the cost (ignoring the fixed fee of a NAT gateway).

VPC endpoints and peering connections

The above illustrates that it is important to replicate route table entries for VPC endpoints and peering connections, that exist in the primary private subnets, to avoid traffic unnecessarily passing through the private NAT gateway. It will (probably) work but it brings unneeded cost.

A reminder: Since the planets that are DNS, routing and security groups need to align, be sure to grant the secondary CIDR range access to any VPC endpoint of the type 'Interface' that exist in the VPC. Not doing so will have DNS return a VPC-local IP address which will not go through the private NAT gateway and hence will be blocked by the security group on the VPC endpoint.

Concluding

Private NAT gateways can be an alternative to custom networking when running EKS pods in secondary CIDR ranges. As always, there are trade-offs that need to be considered, including:

Amount of network traffic going over Transit Gateway and by that the private NAT gateway
Ability to use security groups for pods
Complexity of set-up

The above should give some insight in the world of EKS networking and hopefully provides pointers to what to investigate more deeply and what pitfalls to avoid. As always, feel free to reach out on Twitter to discuss!

This is described in great detail in this blog post: https://betterprogramming.pub/amazon-eks-is-eating-my-ips-e18ea057e045 ↩
Disclaimer: We haven't yet enabled security groups for pods so this is theoretical. However, following the described logic of 'No NAT = no route to the internet', we can assume similar restrictions to apply to external private networks. ↩
Using more NAT gateways then needed can be a serious waste of money and be subject to snark. ↩

Terraform: Good plan = good apply

Tibo Beijen — Sat, 15 Jan 2022 08:04:01 +0000

Recently I worked on some infrastructure changes that resulted in terraform plan showing more, and more impactful, changes than expected. Diving deeper, it appeared that a lot of the planned changes could be avoided by some preparations, resulting in a terraform apply with no impact at all.

Ordered lists and state manipulation

First of all, Terraform for quite some time supports for_each, which is a more robust way to create multiple resources. That said, there can be various reasons why resources have been created by iterating over a list. The most obvious one being code that originates from before for_each was common, where converting to count to for_each would require module users to do complex migrations. (The new 'moved' block will certainly help in these scenarios, although it requires users to upgrade to Terraform v1.1 first.)

In this particular case we added a secondary cidr to a VPC 'A' that was peered to another VPC 'B'. The VPC peering connection was managed by CloudPosse's terraform-aws-vpc-peering module.

In the vocabulary of the module, VPC A is the accepter, VPC B is the requester. Before the cidr addition the peering module managed 5 route tables in VPC B: 1x VPC default, 1x public subnets, and 3x private subnet for each availability zone. In each of those route tables a route is managed by the module that routes traffic to VPC A cidr over the VPC peering connection that is managed by the module.

Now with the introduction of a second cidr in VPC A, the module adds an additional route to each of the 5 route tables, resulting in a desired state like this:

It does so by looping over a combination of the accepter VPC cidr_block_associations attributes and requester VPC route tables. This results in a terraform plan like this:

# module.vpc_peering.aws_route.requestor[1] must be replaced
-/+ resource "aws_route" "requestor" {
      ~ destination_cidr_block     = "10.123.32.0/21" -> "100.64.0.0/16" # forces replacement
      # truncated for readability
      ~ route_table_id             = "rtb-some-id" -> "rtb-other-id" # forces replacement
      ~ state                      = "active" -> (known after apply)
        vpc_peering_connection_id  = "pcx-peering-to-vpc-a"
    }

  # module.vpc_peering.aws_route.requestor[2] must be replaced
-/+ resource "aws_route" "requestor" {
        destination_cidr_block     = "10.123.32.0/21"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[3] must be replaced
-/+ resource "aws_route" "requestor" {
      ~ destination_cidr_block     = "10.123.32.0/21" -> "100.64.0.0/16" # forces replacement
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[4] must be replaced
-/+ resource "aws_route" "requestor" {
        destination_cidr_block     = "10.123.32.0/21"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[5] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "100.64.0.0/16"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[6] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "10.123.32.0/21"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[7] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "100.64.0.0/16"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[8] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "10.123.32.0/21"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[9] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "100.64.0.0/16"
      # truncated for readability
    }

I tried applying a change like this to a non-critical development environment and it applies quite fast. However, for production, 'quite fast' is not good enough. Futhermore, if for some odd reason it partially fails, you and your visitors are having a very bad day.

Looking closer, and helped by looking at the module source, one can distinguish an alternating pattern in the destination cidr block: Primary, secondary, primary, secondary, repeat.

This is validated by comparing some of the planned additions to what's currently at different locations in state, for example:

# Same destination cidr and route table as planned addition [6]
terraform state show module.vpc_peering.aws_route.requestor[3]

Long story short, moving the existing resources in terraform state to where the module expects them to be in this new situation, results in a much more straightforward plan.

terraform state module.vpc_peering.aws_route.requestor[4] module.vpc_peering.aws_route.requestor[8]
terraform state module.vpc_peering.aws_route.requestor[3] module.vpc_peering.aws_route.requestor[6]
terraform state module.vpc_peering.aws_route.requestor[1] module.vpc_peering.aws_route.requestor[2]
terraform state module.vpc_peering.aws_route.requestor[2] module.vpc_peering.aws_route.requestor[4]

The resulting plan:

  # module.vpc_peering.aws_route.requestor[1] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "100.64.0.0/16"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[3] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "100.64.0.0/16"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[5] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "100.64.0.0/16"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[7] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "100.64.0.0/16"
      # truncated for readability
    }

  # module.vpc_peering.aws_route.requestor[9] will be created
  + resource "aws_route" "requestor" {
      + destination_cidr_block     = "100.64.0.0/16"
      # truncated for readability
    }

Much better and can be applied with zero impact. Another case where a simple change would result in an unexpected amount of terraform plan output was the following:

Using --target to prevent computed values side effects

We manage some EKS clusters, having managed node groups, using the terraform-aws-eks module.

Changing a property of a cluster, the public_access_cidrs resulted in quote some planned changes:

The cidr addition, as expected.
Apparently the mere fact that the EKS cluster itself is changed, causes a computed value change that introduces a new launch template version.
The new launch template version causes a managed node group update, causing EKS to replace all nodes. Not a problem per se, since workloads should be able to handle this, but it takes a considerable amount of time.

# All resources truncated for readability

  # module.eks_cluster_a.module.eks_cluster.aws_eks_cluster.this[0] will be updated in-place
  ~ resource "aws_eks_cluster" "this" {
      ~ vpc_config {
          ~ public_access_cidrs       = [
              # Yes, we fully trust Cloudflare DNS to not hack into our cluster
              + "1.1.1.1/32",
            ]
        }
    }

  # module.eks_cluster_a.module.eks_cluster.module.node_groups.aws_eks_node_group.workers["ng_a"] will be updated in-place
  ~ resource "aws_eks_node_group" "workers" {
      ~ launch_template {
            id      = "lt-someid"
            name    = "cluster_a-ng_a20211222060030102500000001"
          ~ version = "5" -> (known after apply)
        }
    }

  # module.eks_cluster_a.module.eks_cluster.module.node_groups.aws_launch_template.workers["ng_a"] will be updated in-place
  ~ resource "aws_launch_template" "workers" {
      ~ default_version         = 5 -> (known after apply)
      ~ latest_version          = 5 -> (known after apply)
      ~ user_data               = "base64-gibberish" -> (known after apply)
    }

This seemed a bit over-the-top for a cidr addition. And indeed it can be avoided.

Targeting only the cluster obviously shows only a change to the cluster (and the removal of several outputs):

terraform plan --target=module.eks_cluster_a.module.eks_cluster.aws_eks_cluster.this[0]

Applying this, and then running terraform plan on the entire project results in:

# Truncated for readability: Some read data sources that change

Plan: 0 to add, 0 to change, 0 to destroy.

Once again, much better!

Concluding

When confronted with impactful planned changes, there might be more options than hope for the best, schedule at night or sit it out.

It's safe to do a terraform plan so when suspecting a chain of dependencies, experimenting with --target can help.

Modifying state is more tricky. What works for me is:

Prepare all of the state mv changes in a txt file first before applying them.
Make sure to have the current state backed up (e.g. Copy terraform state show output to a file).
Know how to revert the moves if needed.
Test the pattern on a non-prod environment first.

Hopefully the above helps anyone to use Terraform with confidence without breaking (important) things. If you have feedback or comments, be sure to leave a comment or reach out on Twitter!

Shifting Akamai to the left using Terraform

Tibo Beijen — Fri, 03 Dec 2021 12:52:00 +0000

This article was originally written for my personal blog

Recently we migrated our CDNs from Cloudfront to Akamai. We use Terraform for infrastructure as code (IaC) and luckily it supports Akamai as well. Since we had Cloudfront distributions for pretty much every environment, it served as a good moment to reflect on what we've taken for granted in the past years, especially since Akamai has the concept of a 'staging network' which doesn't naturally seem to fit in a test-early, test-often approach (Spoiler alert: We don't use the staging network).

Shift Left testing

"Shift left" is popular in the contemporary agile and DevOps IT-landscape, and for good reasons. This article by BMC summarizes it nicely:

Shift Left is a practice intended to find and prevent defects early in the software delivery process. The idea is to improve quality by moving tasks to the left as early in the lifecycle as possible. Shift Left testing means testing earlier in the software development process.

Shift left testing illustrated:

Quoting yet another source, devops.com:

Shifting left requires two key DevOps practices: continuous testing and continuous deployment.

And:

Another way to reduce the failure rate is to make all environments in the pipeline look as much like production as possible.

So, how does a CDN (any CDN, be it Cloudfront, Akamai, Fastly, you name it) fit into this shift left approach? Very well actually, as long as:

The CDN isn't limited to production only but is present in every environment as early as possible in the development lifecycle. ¹
Setting up and updating the CDN should be no different than any other code or infra change. As Martin Fowler says: "If it hurts, do it more often".

Akamai concepts

Compared to Cloudfront, Akamai has some advanced concepts that need to be fitted into an IaC workflow in some way.

Activation

An Akamai 'property' (what in Cloudfront is called a 'distribution') has versions of which one is active at any moment, typically the most recent one. When modifying a configuration of which the latest version is active, this results in a new property version which can be activated when ready.

The staging network

Akamai provides two networks: Production and staging. Property versions can be activated on the staging and production networks independently. The staging network is feature-complete but doesn't provide the performance of the production network. If the production network would use mysite.com.edgekey.net then the staging network would be accessible using mysite.com.edgekey-staging.net. This can be used by modifying the /etc/hosts file, to allow testing before activating the version on the production network.

Adapting to IaC

One can observe that both of the above concepts seem to originate from a more traditional acceptance testing practice happening late in the development lifecycle. In an IaC practice they loose some of their relevance and can even cause ambiguity that can be considered undesirable:

Configuration versions are already present by having configuration in source control. The active version is determined by the branching model that is used (commonly 'latest master'), combined with any automation that exists.
The Akamai staging network can be used to test a property version, but it's not really a staging environment since it uses the production origins. ² To illustrate: One could only test the integration of an application and a CDN change after deploying the application to production. This limits the scope of what can be tested using the staging network. So for test, let alone multiple test (feature) environments, more than one property is needed.

What we found works well:

Create a property for all environments: test (one or multiple), staging and production.
Always activate the latest version.
Test on test, which is fully representative, using any automation one has, for example cypress e2e tests.

This way the delivery of Akamai config changes is identical to that of application changes.

Note that it still allows shit-hits-the-fan rollbacks: The first hour after activating a production property version, there's a quick fallback option. This can be activated (stop the bleeding), after which the active version defined in IaC can be aligned with the actual active version and a fix can be worked on (proper surgery).

Terraform

Overall the Terraform module does a fine job in translating declarative Terraform config into Akamai API actions. There are however some things to consider:

Version to be activated

An activation is a separate Terraform resource. What happens under the hood is that if the version changes it will use Akamai's Property API (PAPI) to create a new activation.

The Terraform property resource has 3 attributes related to versions: latest_version, production_version and staging_version. These are determined after the property has been updated, but before any activation has finished.

We take 'always activating latest' as a starting point. However, scenarios can exist where you want to pin a version. One possible way to accomplish this is a setting a local like this:

locals {
  production_version_to_activate = (var.production_activate_latest == true ? 
    akamai_property.property.latest_version : 
    (var.production_pinned_version > 0 ? var.production_pinned_version : akamai_property.property.production_version))
}

Having variable defaults:

# Note: Similar variables would exist for staging network
variable "production_pinned_version" {
  description = "Pin PRODUCTION network activation to this version. Set to 0 to always use previous property version on production (don't activate any property changes)."
  type        = number
  default     = 0
}

variable "production_activate_latest" {
  description = "Apply latest version to production. This supersedes any pinned version so disable if wanting to stay at a specific version."
  default     = true
}

This way tfvars can be set for various scenarios following below examples:

# Directly activate latest property version (default)
production_activate_latest   = true

# Stick to previously active version (update the property, activate later, or via GUI)
production_activate_latest   = false

# Activate specific version (e.g. reverting to known to work version)
production_pinned_version    = 7
production_activate_latest   = false

Slow activations

Activating the staging network takes about 2 to 3 mins. Activating production typically takes between 9 and 11 minutes. To shorten the feedback loop, one can configure DNS for the test environment to use Akamai's staging network, and avoid activating the production network altogether. Example:

test.mysite.com CNAME test.mysite.com.edgekey-staging.net

Given low traffic, the cache-hit ratio on test usually can't be compared to production anyway, so not having production performance would normally not be an issue.

Implicit edge hostnames

The edge hostname resource requires to set a certificate enrollment ID when using enhanced TLS (edge hostnames ending in edgekey.net). However, if you're a 'Secure by default' customer, you can (not: must) use default certificates. In that case the edge hostname will be created implicitly by the property manager API.

As a result the edge hostname that is created is not managed via Terraform. Most of the edge hostname attributes hardly ever needs to be changed, but for ip_behavior this can be a problem (Github issue).

Final thoughts

The main take-away is: Treat a CDN like any other cloud resource, making sure to have representative environments as early as possible in the development lifecycle, whether it is via Terraform, the Akamai CLI or another tool of choice.

Shift-left in the context of Akamai results in achieving confidence in provisioning 1...n near-identical CDN properties, reducing the need for the Akamai's staging network and ultimately speeding up the delivery process.

Worth noting is that end-to-end tests in a caching setup can be challenging to keep fast due to cache ttl. This can be mitigated via cachebusters, reduced max-age values in response headers or other constructs.

A representative test environment with carefully considered exceptions still beats shifting right.

Thanks for reading! Please leave any feedback or comments below, or find me on Twitter.

For a CDN, representative local development seems a bit far-fetched, but once you deploy, having a representative environment should be the goal. ↩
One could attempt to mitigate this by selecting a staging origin based on the request host, but this is a bad idea for a variety of reasons, the most obvious one being that it adds complexity that can easily backfire (production traffic ending up on staging origin), while still being limited to just production and staging. No test. No shifting left. ↩

Maximize learnings from a Kubernetes cluster failure

Tibo Beijen — Fri, 01 Feb 2019 20:16:29 +0000

This article was originally written for my personal blog

Since a number of months we (NU.nl development team) operate a small number of Kubernetes clusters. We see the potential of Kubernetes and how it can increase our productivity and how it can improve our CI/CD practices. Currently we run part of our logging and building toolset on Kubernetes, plus some small (internal) customer facing workloads, with the plan to move more applications there once we have build up knowledge and confidence.

Recently our team faced some problems on one of the clusters. Not as severe as to bring down the cluster completely, but definitely affecting the user experience of some internally used tools and dashboards.

Coincidentally, around the same time I visited DevOpsCon 2018 in Munich, where the opening keynote "Staying Alive: Patterns for Failure Management from the Bottom of the Ocean" related very well to this incident.

The talk (by Ronnie Chen, engineering manager at Twitter) focussed on various ways to make DevOps teams more effective in preventing and handling failures. One of the topics addressed was how catastrophes are usually caused by a cascade of failures, resulting in this quote:

A post-mortem that blames an incident only on the root cause, might only cover ~15% of the issues that led up to the incident.

As can be seen in this list of postmortem templates, quite a lot of them contain 'root cause(s)' (plural). Nevertheless the chain of events can be easily overlooked, especially as in a lot of situations, removing or fixing the root cause makes the problem go away.

So, let's see what cascade of failures led to our incident and maximize our learnings.

The incident

Our team received reports of a number of services showing erratic behavior: Occasional error pages, slow responses and time-outs.

Attempting to investigate via Grafana, we experienced similar behavior affecting Grafana and Prometheus. Examining the cluster from the console resulted in:

$: kubectl get nodes
NAME                                          STATUS     ROLES     AGE       VERSION
ip-10-150-34-78.eu-west-1.compute.internal    Ready      master    43d       v1.10.6
ip-10-150-35-189.eu-west-1.compute.internal   Ready      node      2h        v1.10.6
ip-10-150-36-156.eu-west-1.compute.internal   Ready      node      2h        v1.10.6
ip-10-150-37-179.eu-west-1.compute.internal   NotReady   node      2h        v1.10.6
ip-10-150-37-37.eu-west-1.compute.internal    Ready      master    43d       v1.10.6
ip-10-150-38-190.eu-west-1.compute.internal   Ready      node      4h        v1.10.6
ip-10-150-39-21.eu-west-1.compute.internal    NotReady   node      2h        v1.10.6
ip-10-150-39-64.eu-west-1.compute.internal    Ready      master    43d       v1.10.6

Nodes NotReady, not good. Describing various nodes (not just the unhealthy ones) showed:

$: kubectl describe node ip-10-150-36-156.eu-west-1.compute.internal

<truncated>

Events:
  Type     Reason                   Age                From                                                     Message
  ----     ------                   ----               ----                                                     -------
  Normal   Starting                 36m                kubelet, ip-10-150-36-156.eu-west-1.compute.internal     Starting kubelet.
  Normal   NodeHasSufficientDisk    36m (x2 over 36m)  kubelet, ip-10-150-36-156.eu-west-1.compute.internal     Node ip-10-150-36-156.eu-west-1.compute.internal status is now: NodeHasSufficientDisk
  Normal   NodeHasSufficientMemory  36m (x2 over 36m)  kubelet, ip-10-150-36-156.eu-west-1.compute.internal     Node ip-10-150-36-156.eu-west-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    36m (x2 over 36m)  kubelet, ip-10-150-36-156.eu-west-1.compute.internal     Node ip-10-150-36-156.eu-west-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     36m                kubelet, ip-10-150-36-156.eu-west-1.compute.internal     Node ip-10-150-36-156.eu-west-1.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeNotReady             36m                kubelet, ip-10-150-36-156.eu-west-1.compute.internal     Node ip-10-150-36-156.eu-west-1.compute.internal status is now: NodeNotReady
  Warning  SystemOOM                36m (x4 over 36m)  kubelet, ip-10-150-36-156.eu-west-1.compute.internal     System OOM encountered
  Normal   NodeAllocatableEnforced  36m                kubelet, ip-10-150-36-156.eu-west-1.compute.internal     Updated Node Allocatable limit across pods
  Normal   Starting                 36m                kube-proxy, ip-10-150-36-156.eu-west-1.compute.internal  Starting kube-proxy.
  Normal   NodeReady                36m                kubelet, ip-10-150-36-156.eu-west-1.compute.internal     Node ip-10-150-36-156.eu-west-1.compute.internal status is now: NodeReady

It looked like the node's operating system was killing processes before the kubelet was able to reclaim memory, as described in the Kubernetes docs.

The nodes in our cluster are part of an auto-scaling group. So, considering we had intermittent outages and at that time had problems reaching Grafana, we decided to terminate the NotReady nodes one by one to see if new nodes would remain stable. This was not the case, new nodes appeared correctly but either some existing nodes or new nodes quickly got into status NotReady.

It did result however, in Prometheus and Grafana to be scheduled at a node that remained stable, so at least we had more data to analyze and the root cause became apparent quickly...

Root cause

One of the dashboards in our Grafana setup shows cluster-wide totals as well as a graphs for pod memory and cpu usage. This quickly showed the source of our problems.

Those lines going up into nowhere are all pods running ElastAlert. For logs we have an Elasticsearch cluster running, and recently we had been experimenting with ElastAlert to trigger alerts based on logs. One of the alerts that was introduced shortly before the incident was an alert that would fire if our Cloudfront-* indexes would not receive new documents for a certain period. As the throughput of that Cloudfront distribution is a couple of millions of request/hour, this apparently caused an enormous ramp up in memory usage. In hindsight, digging deeper into documentation, we'd better have used use_count_query and/or max_query_size.

Cascade of failures

So, root cause identified, investigated and fixed. Incident closed, right? Keeping in mind the quote from before, there's is still 85% of learnings to be found, so let's dive in:

No alerts fired

Obviously we were working on alerting as the root cause was related to ElastAlert. Some data to act on is (currently) only available in Elasticsearch, like log messages (occurence of keywords) or systems outside of the Kubernetes cluster. Prometheus also has an alertmanager which we still need to set up. Besides those two sources we use New Relic for APM. Regardless of the sources, and probably a need to converge, it starts with at least defining alert rules.

Resolution:

Define alerts related to resource usage, like CPU, Memory and disk space.
Continue research on alerting strategy that effectively combines possibly multiple sources.

Grafana dashboard affected by cluster problems

Prometheus and Grafana are very easy to set up in a Kubernetes cluster (Install some helm charts and you're pretty much up and running). However, if you can't reach your cluster, you're blind.

Resolution:

Consider exporting metrics outside of cluster and move Grafana out of cluster as well. This is not without downsides though as very well explained in this article at robustperception.io. An advantage might be having a single go-to point for dashboards for multiple clusters. to my knowledge Kublr uses a similar set-up to monitor multiple clusters.
Out-of-cluster location could be EC2 but also a separate Kubernetes cluster.

Not fully benefiting from our ELK stack

We are running an EC2-based ELK stack that ingests a big volume of Cloudfront logs. But also logs and metrics from the Kubernetes clusters, exported by filebeat and metricbeat daemonsets. So, the data we couldn't access via the in-cluster Grafana, existed in the ELK stack as well.... but either wasn't visualized properly or was overlooked.

This in general is a somewhat tricky subject: Elasticsearch on the one hand is likely to be needed anyway for centralized logs and can do metrics as well so it could be the one-stop solution. However at scale it's quite a beast to operate and onboarding could really benefit from more example dashboards (imo).

On the other hand, Prometheus is simple to set up, seems to be the default technology in the Kubernetes eco-system and, paired with Grafana's available dashboards, is very easy to get into.

Resolution:

Either visualize important metrics in ELK or improve Prometheus/Grafana availability.
Improve metrics strategy.

No CPU & memory limits on ElastAlert pod

The Helm chart used to install ElastAlert allows specifying resource requests and limits, however these do not have default values (not uncommon) and were overlooked by us.

In order to enforce configuring resource limits we could have configured default and limit memory requests for our namespace.

Resolution:

Specify resource limits via Helm values.
Configure namespaces to have defaults en limits for memory requests.

Ops services affecting customer facing workloads

Customer workloads and monitoring/logging tools sharing the same set of resources has the risk of an amplifying effect consuming all resources. Increased traffic, causing increased CPU/memory pressure, causing more logging/metric volume, causing even more CPU/memory pressure, etc.

We were already planning to move all logging, monitoring and CI/CD tooling to a dedicated node group within the production cluster. Depending on our experience with that, having a dedicated 'tools' cluster is also an option.

Resolution (was already planned):

Isolate customer facing workloads from build, logging & monitoring workloads.

No team-wide awareness of the ElastAlert change that was deployed

Although the new alert passed code review, the fact that it was merged and deployed was not known to everybody. More important, as it was installed via the command line by one of the team members, there was no immediate source of information that showed what applications in the cluster might have been updated.

Resolution:

Deploy everything via automation (e.g. Jenkins pipelines)
Consider a GitOps approach for deploying new application versions: 'state to be' and history of changes in code, using a tool well known by developers.

No smoke tests

If we had deployed the ElastAlert update using a pipeline, we could have added a 'smoke test' step after the deploy. This could have signalled excessive memory usage, or pod restarts due to the pod exceeding configured memory limits.

Resolution:

Deploy via a pipeline that includes a smoke test step.

Knowledge of operating Kubernetes limited to part of team

Our team (as most teams) consist of people with various levels of expertise on different topics. Some have more Cloud & DevOps experience, some are front-end or Django experts, etc. As Kubernetes is quite new technology, and certainly for our team, knowledge was not as widespread as is desirable. As with all technologies practiced by Agile teams: DevOps should not be limited to a single (part of a) team. Luckily experienced team members were available to assist the on-call team member that had little infrastructure experience.

Resolution (was already planned):

Ensuring Kubernetes-related work (cloud infrastructure-related in general actually) is part of team sprints and is picked up by all team members, pairing with more experienced members.
Workshops deep-diving into certain topics.

Wrap up

As becomes quite apparent, fixing the ElastAlert problem itself was just the tip of the iceberg. There was a lot more to learn from this seemingly simple incident. Most points listed in this article were already on our radar in one way or the other but their importance was emphasized.

Turning these learnings into scrum (or kanban) items will allow us to improve our platform and practices in a focused way and measure our progress.

Learning and improving as a team requires a company culture that allows 'blameless post mortems' and does not merely focus on 'number of incidents' or 'time to resolve'. To finish with a quote heard at a DevOps conference:

Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill