DEV Community: Steven Sklar

How do Kubernetes Operators Handle Concurrency?

Steven Sklar — Wed, 09 Oct 2024 15:17:08 +0000

Originally posted on my blog

By default, operators built using Kubebuilder and controller-runtime process a single reconcile request at a time. This is a sensible setting, since it's easier for operator developers to reason about and debug the logic in their applications. It also constrains throughput from the controller to core Kubernetes resources like ectd and the API server.

But what if your work queue starts backing up and average reconciliation times increase due to requests that are left sitting in the queue, waiting to be processed? Luckily for us, a controller-runtime Controller struct includes a MaxConcurrentReconciles field (as I previously mentioned in my Kubebuilder Tips article). This option allows you to set the number of concurrent reconcile loops that are running in a single controller. So with a value above 1, you can reconcile multiple Kubernetes resources simultaneously.

Early in my operator journey, one question that I had was how could we guarantee that the same resource isn't being reconciled at the same time by 2 or more goroutines? With MaxConcurrentReconciles set above 1, this could lead to all sorts of race conditions and undesirable behavior, as the state of an object inside a reconciliation loop could change via a side-effect from an external source (a reconciliation loop running in a different thread).

I thought about this for a while, and even implemented a sync.Map-based approach that would allow a goroutine to acquire a lock for a given resource (based on its namespace/name).

It turns out that all of this effort was for naught, since I recently learned (in a k8s slack channel) that the controller workqueue already includes this feature! Albeit with a simpler implementation.

This is a quick story about how a k8s controller's workqueue guarantees that unique resources are reconciled sequentially. So even if MaxConcurrentReconciles is set above 1, you can be confident that only a single reconciliation function is acting on any given resource at a time.

client-go/util

Controller-runtime uses the client-go/util/workqueue library to implement its underlying reconciliation queue. In the package's doc.go file, a comment states that the workqueue supports these properties:

Fair: items processed in the order in which they are added.
Stingy: a single item will not be processed multiple times concurrently, and if an item is added multiple times before it can be processed, it will only be processed once.
Multiple consumers and producers. In particular, it is allowed for an item to be re-enqueued while it is being processed.
Shutdown notifications.

Wait a second... My answer is right here in the second bullet, the "Stingy" property! According to these docs, the queue will automatically handle this concurrency issue for me, without having to write a single line of code. Let's run through the implementation.

How does the workqueue work?

The workqueue struct has 3 main methods, Add, Get, and Done. Inside a controller, an informer would Add reconcile requests (namespaced-names of generic k8s resources) to the workqueue. A reconcile loop running in a separate goroutine would then Get the next request from the queue (blocking if it is empty). The loop would perform whatever custom logic is written in the controller, and then the controller would call Done on the queue, passing in the reconcile request as an argument. This would start the process over again, and the reconcile loop would call Get to retrieve the next work item.

This is similar to processing messages in RabbitMQ, where a worker pops an item off the queue, processes it, and then sends an "Ack" back to the message broker indicating that processing has completed and it's safe to remove the item from the queue.

Still, I have an operator running in production that powers QuestDB Cloud's infrastructure, and wanted to be sure that the workqueue works as advertised. So a wrote a quick test to validate its behavior.

A little test

Here is a simple test that validates the "Stingy" property:

package main_test

import (
    "testing"

    "github.com/stretchr/testify/assert"

    "k8s.io/client-go/util/workqueue"
)

func TestWorkqueueStingyProperty(t *testing.T) {

    type Request int

    // Create a new workqueue and add a request
    wq := workqueue.New()
    wq.Add(Request(1))
    assert.Equal(t, wq.Len(), 1)

    // Subsequent adds of an identical object
    // should still result in a single queued one
    wq.Add(Request(1))
    wq.Add(Request(1))
    assert.Equal(t, wq.Len(), 1)

    // Getting the object should remove it from the queue
    // At this point, the controller is processing the request
    obj, _ := wq.Get()
    req := obj.(Request)
    assert.Equal(t, wq.Len(), 0)

    // But re-adding an identical request before it is marked as "Done"
    // should be a no-op, since we don't want to process it simultaneously
    // with the first one
    wq.Add(Request(1))
    assert.Equal(t, wq.Len(), 0)

    // Once the original request is marked as Done, the second
    // instance of the object will be now available for processing
    wq.Done(req)
    assert.Equal(t, wq.Len(), 1)

    // And since it is available for processing, it will be
    // returned by a Get call
    wq.Get()
    assert.Equal(t, wq.Len(), 0)
}

Since the workqueue uses a mutex under the hood, this behavior is threadsafe. So even if I wrote more tests that used multiple goroutines simultaneously reading and writing from the queue at high speeds in an attempt to break it, the workqueue's actual behavior would be the same as that of our single-threaded test.

All is not lost

There are a lot of little gems like this hiding in the Kubernetes standard libraries, some of which are in not-so-obvious places (like a controller-runtime workqueue found in the go client package). Despite this discovery, and others like it that I've made in the past, I still feel that my previous attempts at solving these issues are not complete time-wasters. They force you to think critically about fundamental problems in distributed systems computing, and help you to understand more of what is going on under the hood. So that by the time I've discovered that "Kubernetes did it", I'm relieved that I can simplify my codebase and perhaps remove some unnecessary unit tests.

How to add Kubernetes-powered leader election to your Go apps

Steven Sklar — Fri, 19 Jul 2024 20:03:36 +0000

Originally published on by blog

The Kubernetes standard library is full of gems, hidden away in many of the various subpackages that are part of the ecosystem. One such example that I discovered recently k8s.io/client-go/tools/leaderelection, which can be used to add a leader election protocol to any application running inside a Kubernetes cluster. This article will discuss what leader election is, how it's implemented in this Kubernetes package, and provide an example of how we can use this library in our own applications.

Leader Election

Leader election is a distributed systems concept that is a core building block of highly-available software. It allows for multiple concurrent processes to coordinate amongst each other and elect a single "leader" process, which is then responsible for performing synchronous actions like writing to a data store.

This is useful in systems like distributed databases or caches, where multiple processes are running to create redundancy against hardware or network failures, but can't write to storage simultaneously to ensure data consistency. If the leader process becomes unresponsive at some point in the future, the remaining processes will kick off a new leader election, eventually picking a new process to act as the leader.

Using this concept, we can create highly-available software with a single leader and multiple standby replicas.

In Kubernetes, the controller-runtime package uses leader election to make controllers highly-available. In a controller deployment, resource reconciliation only occurs when a process is the leader, and other replicas are waiting on standby. If the leader pod becomes unresponsive, the remaining replicas will elect a new leader to perform subsequent reconciliations and resume normal operation.

Kubernetes Leases

This library uses a Kubernetes Lease, or distributed lock, that can be obtained by a process. Leases are native Kubernetes resources that are held by a single identity, for a given duration, with a renewal option. Here's an example spec from the docs:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  labels:
    apiserver.kubernetes.io/identity: kube-apiserver
    kubernetes.io/hostname: master-1
  name: apiserver-07a5ea9b9b072c4a5f3d1c3702
  namespace: kube-system
spec:
  holderIdentity: apiserver-07a5ea9b9b072c4a5f3d1c3702_0c8914f7-0f35-440e-8676-7844977d3a05
  leaseDurationSeconds: 3600
  renewTime: "2023-07-04T21:58:48.065888Z"

Leases are used by the k8s ecosystem in three ways:

Node Heartbeats: Every Node has a corresponding Lease resource and updates its renewTime field on an ongoing basis. If a Lease's renewTime hasn't been updated in a while, the Node will be tainted as not available and no more Pods will be scheduled to it.
Leader Election: In this case, a Lease is used to coordinate among multiple processes by having a leader update the Lease's holderIdentity. Standby replicas, with different identities, are stuck waiting for the Lease to expire. If the Lease does expire, and is not renewed by the leader, a new election takes place in which the remaining replicas attempt to take ownership of the Lease by updating its holderIdentity with their own. Since the Kubernetes API server disallows updates to stale objects, only a single standby node will successfully be able to update the Lease, at which point it will continue execution as the new leader.
API Server Identity: Starting in v1.26, as a beta feature, each kube-apiserver replica will publish its identity by creating a dedicated Lease. Since this is a relatively slim, new feature, there's not much else that can be derived from the Lease object aside from how many API servers are running. But this does leave room to add more metadata to these Leases in future k8s versions.

Now let's explore this second use case of Leases by writing a sample program to demonstrate how you can use them in leader election scenarios.

Example Program

In this code example, we are using the leaderelection package to handle the leader election and Lease manipulation specifics.

package main

import (
    "context"
    "fmt"
    "os"
    "time"

    "k8s.io/client-go/tools/leaderelection"
    rl "k8s.io/client-go/tools/leaderelection/resourcelock"
    ctrl "sigs.k8s.io/controller-runtime"
)

var (
    // lockName and lockNamespace need to be shared across all running instances
    lockName      = "my-lock"
    lockNamespace = "default"

    // identity is unique to the individual process. This will not work for anything,
    // outside of a toy example, since processes running in different containers or
    // computers can share the same pid.
    identity      = fmt.Sprintf("%d", os.Getpid())
)

func main() {
    // Get the active kubernetes context
    cfg, err := ctrl.GetConfig()
    if err != nil {
        panic(err.Error())
    }

    // Create a new lock. This will be used to create a Lease resource in the cluster.
    l, err := rl.NewFromKubeconfig(
        rl.LeasesResourceLock,
        lockNamespace,
        lockName,
        rl.ResourceLockConfig{
            Identity: identity,
        },
        cfg,
        time.Second*10,
    )
    if err != nil {
        panic(err)
    }

    // Create a new leader election configuration with a 15 second lease duration.
    // Visit https://pkg.go.dev/k8s.io/client-go/tools/leaderelection#LeaderElectionConfig
    // for more information on the LeaderElectionConfig struct fields
    el, err := leaderelection.NewLeaderElector(leaderelection.LeaderElectionConfig{
        Lock:          l,
        LeaseDuration: time.Second * 15,
        RenewDeadline: time.Second * 10,
        RetryPeriod:   time.Second * 2,
        Name:          lockName,
        Callbacks: leaderelection.LeaderCallbacks{
            OnStartedLeading: func(ctx context.Context) { println("I am the leader!") },
            OnStoppedLeading: func() { println("I am not the leader anymore!") },
            OnNewLeader:      func(identity string) { fmt.Printf("the leader is %s\n", identity) },
        },
    })
    if err != nil {
        panic(err)
    }

    // Begin the leader election process. This will block.
    el.Run(context.Background())

}

What's nice about the leaderelection package is that it provides a callback-based framework for handling leader elections. This way, you can act on specific state changes in a granular way and properly release resources when a new leader is elected. By running these callbacks in separate goroutines, the package takes advantage of Go's strong concurrency support to efficiently utilize machine resources.

Testing it out

To test this, lets spin up a test cluster using kind.

$ kind create cluster

Copy the sample code into main.go, create a new module (go mod init leaderelectiontest) and tidy it (go mod tidy) to install its dependencies. Once you run go run main.go, you should see output like this:

$ go run main.go
I0716 11:43:50.337947     138 leaderelection.go:250] attempting to acquire leader lease default/my-lock...
I0716 11:43:50.351264     138 leaderelection.go:260] successfully acquired lease default/my-lock
the leader is 138
I am the leader!

The exact leader identity will be different from what's in the example (138), since this is just the PID of the process that was running on my computer at the time of writing.

And here's the Lease that was created in the test cluster:

$ kubectl describe lease/my-lock
Name:         my-lock
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  coordination.k8s.io/v1
Kind:         Lease
Metadata:
  Creation Timestamp:  2024-07-16T15:43:50Z
  Resource Version:    613
  UID:                 1d978362-69c5-43e9-af13-7b319dd452a6
Spec:
  Acquire Time:            2024-07-16T15:43:50.338049Z
  Holder Identity:         138
  Lease Duration Seconds:  15
  Lease Transitions:       0
  Renew Time:              2024-07-16T15:45:31.122956Z
Events:                    <none>

See that the "Holder Identity" is the same as the process's PID, 138.

Now, let's open up another terminal and run the same main.go file in a separate process:

$ go run main.go
I0716 11:48:34.489953     604 leaderelection.go:250] attempting to acquire leader lease default/my-lock...
the leader is 138

This second process will wait forever, until the first one is not responsive. Let's kill the first process and wait around 15 seconds. Now that the first process is not renewing its claim on the Lease, the .spec.renewTime field won't be updated anymore. This will eventually cause the second process to trigger a new leader election, since the Lease's renew time is older than its duration. Because this process is the only one now running, it will elect itself as the new leader.

the leader is 604
I0716 11:48:51.904732     604 leaderelection.go:260] successfully acquired lease default/my-lock
I am the leader!

If there were multiple processes still running after the initial leader exited, the first process to acquire the Lease would be the new leader, and the rest would continue to be on standby.

No single-leader guarantees

This package is not foolproof, in that it "does not guarantee that only one client is acting as a leader (a.k.a. fencing)". For example, if a leader is paused and lets its Lease expire, another standby replica will acquire the Lease. Then, once the original leader resumes execution, it will think that it's still the leader and continue doing work alongside the newly-elected leader. In this way, you can end up with two leaders running simultaneously.

To fix this, a fencing token which references the Lease needs to be included in each request to the server. A fencing token is effectively an integer that increases by 1 every time a Lease changes hands. So a client with an old fencing token will have its requests rejected by the server. In this scenario, if an old leader wakes up from sleep and a new leader has already incremented the fencing token, all of the old leader's requests would be rejected because it is sending an older (smaller) token than what the server has seen from the newer leader.

Implementing fencing in Kubernetes would be difficult without modifying the core API server to account for corresponding fencing tokens for each Lease. However, the risk of having multiple leader controllers is somewhat mitigated by the k8s API server itself. Because updates to stale objects are rejected, only controllers with the most up-to-date version of an object can modify it. So while we could have multiple controller leaders running, a resource's state would never regress to older versions if a controller misses a change made by another leader. Instead, reconciliation time would increase as both leaders need to refresh their own internal states of resources to ensure that they are acting on the most recent versions.

Still, if you're using this package to implement leader election using a different data store, this is an important caveat to be aware of.

Conclusion

Leader election and distributed locking are critical building blocks of distributed systems. When trying to build fault-tolerant and highly-available applications, having tools like these at your disposal is critical. The Kubernetes standard library gives us a battle-tested wrapper around its primitives to allow application developers to easily build leader election into their own applications.

While use of this particular library does limit you to deploying your application on Kubernetes, that seems to be the way the world is going recently. If in fact that is a dealbreaker, you can of course fork the library and modify it to work against any ACID-compliant and highly-available datastore.

Stay tuned for more k8s source deep dives!

K3s Traefik Ingress - configured for your homelab!

Steven Sklar — Fri, 15 Dec 2023 17:20:19 +0000

Originally posted on my blog

I recently purchased a used Lenovo M900 Think Centre (i7 with 32GB RAM) from eBay to expand my mini-homelab, which was just a single Synology DS218+ plugged into my ISP's router (yuck!). Since I've been spending a big chunk of time at work playing around with Kubernetes, I figured that I'd put my skills to the test and run a k3s node on the new server. While I was familiar with k3s before starting this project, I'd never actually run it before, opting for tools like kind (and minikube before that) to run small test clusters for my local development work.

K3s comes "batteries-included" and is packaged with all of the core components that are needed to run a cluster. For ingress, this component is Traefik, something which I also never had any hands-on experience with.

After a bit of trial-and-error, along with some help from the docs, I was able to figure out how to configure the Traefik ingress to expose a service on the new node to the rest of my LAN. This post will go into detail about how I did it, in case you've also installed k3s on a node in your own lab and want to know how to use the pre-bundled Traefik ingress.

Installing k3s

The k3s installation process is fairly straightforward. It just requires curling https://get.k3s.io and executing the script. The default settings all make sense to me, except for the need to chmod the bundled kubeconfig since its default filemode is inaccessible to non-root users. Luckly, k3s has a solution for that! All you need to do is export K3S_KUBECONFIG_MODE="644" before you run the installation, and the k3s install script does the rest.

Configuring Traefik

By default, host ports 80 and 443 are exposed by the bundled Traefik ingress controller. This will let you create HTTP and HTTPS ingresses on their standard ports.

But of course, I want to run a QuestDB instance on my node, which uses two additional TCP ports for Influx Line Protocol (ILP) and Pgwire communication with the database. So how can I expose these extra ports on my node and route traffic to the QuestDB container running inside of k3s?

K3s deploys Traefik via a Helm chart and allows you to modify that chart's values.yaml through a HelmChartConfig resource. To further customize values.yaml files for installed charts, you can place extra HelmChartConfig manifests in /var/lib/rancher/k3s/server/manifests. Here is my /var/lib/rancher/k3s/server/manifests/traefik-config.yaml:

---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: traefik
  namespace: kube-system
spec:
  valuesContent: |-
    ports:
      psql:
        port: 8812
        expose: true
        exposedPort: 8812
        protocol: TCP
      ilp:
        port: 9009
        expose: true
        exposedPort: 9009
        protocol: TCP

This manifest modifies the Traefik deployment to serve two extra TCP ports on the host: 8812 for Pgwire and 9009 for ILP. Along with these new ports, Traefik also ships with its default "web" and "websecure" ports that route HTTP and HTTPS traffic respectively. But this is already preconfigured for you in the Helm chart, so there's no extra work involved.

So now that we have our ports exposed, lets figure out how to route traffic from them to our QuestDB instance running inside the cluster.

IngressRoutes

While you can use traditional k8s Ingresses to configure external access to cluster resources, Traefik v2 also includes new, more flexible types of ingress that coordinate directly with the Traefik deployment. These can be configured by using Traefik-specific Custom Resources, which allow users to specify cluster ingress routes using Traefik's custom routing rules instead of the standard URI-based routing traditionally found in k8s. Using these rules, you can route requests not just based on hostnames and paths, but also by request headers, querystrings, and source IPs, with regex matching support for many of these options. This unlocks significantly more flexibility when routing traffic into your cluster as opposed to using a standard k8s ingress.

Here's an example of an IngressRoute that I use to expose the QuestDB web console.

---
kind: IngressRoute
metadata:
  name: questdb
  namespace: questdb
spec:
  entryPoints:
    - web
  routes:
    - kind: Rule
      match: Host(`my-hostname`)
      services:
        - kind: Service
          name: questdb
          port: 9000

This CRD maps the QuestDB service port 9000 to host port 80 through the web entrypoint, which as I mentioned above, comes pre-installed in the Traefik Helm chart. With this config, I can now access the console by navigating to http://my-hostname/ in my web browser and Traefik will route the request to my QuestDB HTTP service.

IngressRouteTCP

It's important to note that IngressRoutes are used solely for HTTP ingress routing. Raw TCP routing is done using IngressRouteTCPs, and there are also IngressRouteUDPs available for you to use as well.

To support the TCP-based ILP and Pgwire protocols, I created 2 IngressRouteTCP resources to handle traffic on host ports 9009 and 8812.

---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRouteTCP
metadata:
  name: questdb-ilp
  namespace: questdb
spec:
  entryPoints:
    - ilp
  routes:
    - match: HostSNI(`*`)
      services:
        - name: questdb
          port: 9009
          terminationDelay: -1
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRouteTCP
metadata:
  name: questdb-psql
  namespace: questdb
spec:
  entryPoints:
    - psql
  routes:
    - match: HostSNI(`*`)
      services:
        - name: questdb
          port: 8812

Here, we use the TCP-specific HostSNI matcher to route all node traffic on ports 9009 (ilp) and 8812 (psql) to the questdb service. The ilp and psql entrypoints correspond to the ilp and psql ports that we exposed in the traefik-config.yaml.

Now we're able to access QuestDB on all 3 supported ports on my k3s node from the rest of my home network.

Conclusion

I hope this example gives you a bit more confidence when configuring Traefik on a single-node k3s "cluster". It may be a bit confusing at first, but once you have the basics down, you'll be exposing all of your hosted services in no time!

Since we're only working with one node, there's a limited amount of routing flexibility that I could include in the Traefik CRD configurations. In a more complex environment, you can really go wild with all of the possibilities that are provided by Traefik routing rules! Still, the benefit of this simple example is that you can learn the basics in a small and controlled environment, and then add complexity once it's needed.

Remember, if you're using k3s clustering mode and running multiple nodes, you'll need to route traffic using an external load balancer like Haproxy. Maybe I'll play around with this if I ever pick up a second M900 or other mini-desktop. But until then, I'll stick with this ingress configuration.

How I Built a Terminal UI to Display K8s Operator Metrics

Steven Sklar — Thu, 07 Dec 2023 02:40:26 +0000

Originally posted on my blog

As you hopefully know from reading some of my recent articles, I really enjoy writing Kubernetes operators! I've written a bit about them on this blog, and also just recently presented a talk about how to build them at the inaugural FOSSY conference in Portland, Oregon. So here goes another operator-related article, although this one is about one of the later stages of operator development: performance tuning.

Since I've been working on an operator that includes multiple controllers, each with 4 reconcilers running concurrently that make API calls to resources outside of the cluster, I wanted to start profiling my reconciliation loop performance to ensure that I could handle real-world throughput requirements.

Prometheus Metrics

Kubebuilder offers the ability to export operator-specific Prometheus metrics as part of its default scaffolding. This provides a great way to profile and monitor your operator performance under both testing and live conditions.

Some of these metrics (many on a per-controller basis) include:

A histogram of reconcile times
Number of active workers
Workqueue depth
Total reconcile time
Total reconcile errors
API server request method & response counts
Process memory usage & GC performance

I wanted a way to monitor these metrics for controllers that are under development on my local machine. This means that they are running as processes outside of any cluster context, which makes it difficult to scrape their exposed metrics using a standard prometheus/grafana setup.

operator-prom-metrics-viewer

So I thought, why not craft a little GUI to display this information? Inspired by the absolutely incredible k9s project, I decided to make a terminal-driven GUI using the tview library.

Since I'm not (yet) storing this information for further analysis, I decided to only display the latest scraped data point for each metric. But I also didn't want to implement my own prometheus metric parser to do this, so I imported pieces of the prometheus scrape library itself to handle the metric parsing. All I had to do was implement a few interfaces for storing and querying my metrics to interop with the scrape library.

Prometheus in-memory, transient storage

The prometheus storage and query interfaces are relatively simple, and it helps that we actually don't need to implement all of their methods!

Storage

The underlying storage mechanism of my InMemoryMetricStorage class is a simple map along with a mutex for locking it during multithreaded IO. In this map, we are only storing the latest value for each metric, so there's no problem overwriting a key's value if it already exists. And we also don't need to worry about out-of-order writes or any other time-series database problems.

type InMemoryAppender struct {
    data map[uint64]DataPoint
    mu   *sync.Mutex
}

The DataPoint struct maps directly to a prometheus metric's structure. If you've worked with prometheus metrics before, this should look pretty familiar to you.

type DataPoint struct {
    Labels    labels.Labels
    Timestamp int64
    Value     float64
}

To populate each DataPoint's uint64 key in the map, we can simply use the prometheus builtin labels.Hash() function to generate a unique key for each DataPoint, without having to do any extra work on our end.

To demonstrate, here's an example of a prometheus metric:

# HELP http_requests_total Total number of http api requests
# TYPE http_requests_total counter
http_requests_total{api="add_product"} 4633433

And how it would be represented by the DataPoint struct:

d := DataPoint{
    Labels: []Label{
        {Name: "api", Value: "add_product"},
    },
    Value: 4633433,
    Timestamp: time.Now(),
}

Now that we have our data model, we need a way to store DataPoints for further retrieval.

Appender

Here's the interface of the storage backend appender:

type Appender interface {
    Append(ref SeriesRef, l labels.Labels, t int64, v float64) (SeriesRef, error)
    Commit() error
    Rollback() error

    ExemplarAppender
    HistogramAppender
    MetadataUpdater
}

Because we're building a simple in-memory tool (with no persistence or availability guarantees), we can ignore the Commit() and Rollback() methods (by turning them into no-ops). Furthermore, we can avoid implementing the bottom three interfaces, since the metrics that are exposed by kubebuilder are not exemplars, metadata, or histograms (the framework actually uses gauges to represent the histogram values, more on this below!). So this only leaves the Append() function to implement, which just writes to the InMemoryAppender's underlying map storage in a threadsafe manner.

func (a *InMemoryAppender) Append(ref storage.SeriesRef, l labels.Labels, t int64, v float64) (storage.SeriesRef, error) {
    a.mu.Lock()
    defer a.mu.Unlock()
    a.data[l.Hash()] = DataPoint{Labels: l, Timestamp: t, Value: v}
    return ref, nil
}

Querier

The querier implementation is also fairly straightforward. We just need to implement a single Query method against our datastore to extract metrics from our store.

My solution is far from an optimized inner-loop; it has awful exponential performance in the worst case. But for looping through a few metrics at a rate of at most once-per-second, I'm not overly concerned with the time-complexity of this function.

func (a *InMemoryAppender) Query(metric string, l labels.Labels) []DataPoint {
    a.mu.Lock()
    defer a.mu.Unlock()

    dataToReturn := []DataPoint{}

    for _, d := range a.data {
        if d.Labels.Get("__name__") != metric {
            continue
        }

        var isMatch bool = true
        for _, label := range l {
            if d.Labels.Get(label.Name) != label.Value {
                isMatch = false
            }
        }

        if isMatch {
            dataToReturn = append(dataToReturn, d)
        }

    }

    return dataToReturn
}

Histogram implementation

Since the controller reconcile time histogram data is stored in prometheus gauges, I did need to write some additional code to transform these into a workable schema to actually generate a histogram. To aid in the construction of the histogram buckets, I created 2 structs, one to represent the histogram itself, and the other to represent each of its buckets.

type HistogramData struct {
    buckets []Bucket
    curIdx  int
}

type Bucket struct {
    Label string
    Value int
}

For each controller, we can run a query against our storage backend for the workqueue_queue_duration_seconds_bucket metric with the name=<controller-name> label. This will return all histogram buckets for a specific controller, each with a separate label (le) that denotes the bucket's limit. We can iterate through these metrics to create Bucket objects, append them to the HistogramData.buckets slice, and eventually sort them before rendering the histogram to keep the display consistent and in the correct order.

The curIdx field on the HistogramData struct is only used internally when rendering the histogram, to keep track of the current index of the buckets slice that we're on. This way, we can consume buckets using a for loop and exit when we return the final bucket. It's a bit clunky, but it works!

v1.0

After all of this hacking, I ended up with something that looks like this:

It's far from perfect; since I don't manipulate the histogram values (yet) or have any display scrolling, it's very likely that the histogram will spill off the right of the screen. The UX can also be improved, since you can only switch the controller that is being displayed by pressing the Up and Down buttons.

But it's been a fantastic tool for me when debugging slow reconciles due to API rate limiting, and has also highlighted errors for me when I miss them in the logs. Give it a whirl and let me know what you think!

Link to the Github: https://github.com/sklarsa/operator-prom-metrics-viewer

Annotations in Kubernetes Operator Design

Steven Sklar — Sat, 25 Nov 2023 23:22:16 +0000

Originally published on my blog

It seems that annotations are everywhere in the Kubernetes (k8s) ecosystem. Ingress controllers, cloud providers, and operators of all kinds use the metadata stored in annotations to perform targeted actions inside of a cluster. So how can we leverage these when developing a new k8s operator?

To the Docs

Despite their widespread use, the official documentation of annotations is actually quite brief. In fact, it only takes two short sentences at the top of the page to define an annotation:

You can use Kubernetes annotations to attach arbitrary non-identifying metadata to objects. Clients such as tools and libraries can retrieve this metadata.

While technically accurate, this definition is still pretty vague and not entirely helpful.

The docs expand on this by providing a few examples of the types of metadata that can be stored in an annotation. But these samples range from build information all the way to individuals' "phone or pager numbers" (who still carries a pager these days anyway?).

Somewhere within their ambiguity lies the true power of k8s annotations; they grant the ability to tag any cluster resource with structured data in almost any format. It's like having a dedicated key-value store attached to every resource in your cluster! So how can we harness this power in an operator?

In this post, I will detail a way in which I recently used annotations while writing an operator for my company's product, QuestDB. Hopefully this will give you an idea of how you can incorporate annotations into your own operators to harness their full potential.

Background

The operator that I've been working on is designed to manage the full lifecycle of a QuestDB database instance, including version and hardware upgrades, config changes, backups, and (eventually) recovery from node failure. I used the Operator SDK and kubebuilder frameworks to provide scaffolding and API support.

It always comes back to a JWK

In order to take advantage of the database's many performance optimizations (such as importing over 300k rows/sec with io_uring), we recommend that users ingest data over InfluxDB Line Protocol. One of the features that we offer, which is not part of the original protocol, is authentication over TCP using a JSON Web Key (JWK).

This feature can be configured in a file that is referenced by the main server config on launch. You just need to add your JWK's key id and public data to the file in this format:

testUser1 ec-p-256-sha256 fLKYEaoEb9lrn3nkwLDA-M_xnuFOdSt9y0Z7_vWSHLU Dt5tbS1dEDMSYfym3fgMv0B99szno-dFc1rYF9t0aac
# [key/user id] [key type] {keyX keyY}

Let's say that you have your private key stored elsewhere in a k8s cluster as a Secret, so your client application can securely push data to your QuestDB instance. The JWK secret data would look something like this:

{
  "kty": "EC",
  "d": "5UjEMuA0Pj5pjK8a-fa24dyIf-Es5mYny3oE_Wmus48",
  "crv": "P-256",
  "kid": "testUser1",
  "x": "fLKYEaoEb9lrn3nkwLDA-M_xnuFOdSt9y0Z7_vWSHLU",
  "y": "Dt5tbS1dEDMSYfym3fgMv0B99szno-dFc1rYF9t0aac"
}

When a user creates a QuestDB Custom Resource (CR) in the cluster, we want to be able to point our operator to this private key and reformat the public values ("kid", "x", and "y") so that it can create a valid auth.conf ConfigMap value to mount to the Pod running our QuestDB instance. The operator can then add line.tcp.auth.db.path=auth.conf to the main server config to make it aware of the new authentication file, and the client application can communicate to QuestDB securely over ILP using the private key.

How can we let the operator know which Secret to use?

Using the Spec

One approach is to simply create a field on the QuestDB Custom Resource:

type QuestDBSpec struct {
    ...
    IlpSecretName      string `json:"ilpSecretName,omitempty"`
    IlpSecretNamespace string `json:"ilpSecretNamespace,omitempty"`
    ...
}

With these fields, a user can now set their values to the name and namespace of the secret that contains the JWK's private key, like so:

apiVersion: crd.questdb/v1
kind: QuestDB
...
spec:
  ilpSecretName: my-private-key
  ilpSecretNamespace: default

After applying the above yaml to the cluster, the operator will kick off a reconciliation loop of the newly created (or updated) QuestDB CR. Inside this loop, the operator will query the k8s API for the Secret default/my-private-key, obtain the "kid", "x", and "y" values from the Secret's data, modify the ConfigMap that is holding the QuestDB configuration, and continue the process as described above.

Even though this technically works, the approach is fairly naive and can lead to some issues down the line. For example, if you want to rotate your JWK, how will the operator know to update the public key in the QuestDB auth ConfigMap? Or, what will happen if the secret does not even exist? Let's use some kubebuilder primitives to help answer these questions and improve the solution.

Kubebuilder Watches

Kubebuilder has built-in support for watching resources that are managed both by the operator and also externally by another component. A watch is a function that registers the controller with the k8s API server, so that the controller is notified when a "watched" resource has changed. This allows the operator to kick off a reconciliation loop against the changed object, to ensure that the actual resource spec matches the desired spec (through operator's custom logic).

Using kubebuilder, resource watches can be configured in a function:

func (r *QuestDBReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&questdbv1.QuestDB{}).
        Owns(&corev1.ConfigMap{}).
        Watches(
            &source.Kind{Type: &corev1.Secret{}},
            handler.EnqueueRequestsFromMapFunc(r.secretToQuestDB),
            builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}),
        ).
        Complete(r)
}

In this function, we register our reconciler with a controller manager and set up 3 different types of watches:

For(&questdbv1.QuestDB{}) instructs the manager that the controller's primary managed resource is a questdbv1.QuestDB. This watch function registers the manager with the k8s API so it will be notified about any changes that happen to a QuestDB CR. When a change has been identified, the manager will kick off a reconcile of that object, calling the QuestDBReconciler.Reconcile() function to migrate the resource status to its desired state. Only one For clause can be used when registering a new controller, which goes along hand-in-hand with the recommendation that a controller should be responsible for a single CR.
Owns(&corev1.ConfigMap{}) will kick off a reconcile of any QuestDB CR when a ConfigMap that is owned by a QuestDB changes. To own an object, you can use the controllerutil.SetControllerReference function to create a parent-child relationship between the QuestDB parent and ConfigMap child. So then, changes to that ConfigMap will trigger a reconcile of the parent QuestDB in the controller.
Based on the function signature alone, the Watches block is clearly very different than the previous types. In this case, we are listening for changes to any corev1.Secret inside the entire cluster, regardless of ownership constraints. The watch is also set up with a specific predicate to filter out some events (predicate.ResourceVersionChangedPredicate). This predicate will match cluster events when a Secret's version is incremented (as the result of a Spec or Status change). So when a corev1.Secret change is found anywhere in the cluster, the manager will run the secretToQuestDB function to map that Secret to zero-or-more QuestDB NamespacedName references, based on its characteristics.

Below, we will use this function to update a QuestDB's config if a JWK value has changed. To do this, we need to map from a Secret to any QuestDBs that are using that Secret's value for ILP authentication.

Let's take a deeper look at this mapper function to see how to accomplish this.

EnqueueRequestsFromMapFunc

The sigs.k8s.io/controller-runtime package defines a MapFunc that is an input to the Watches function:

type MapFunc func(client.Object) []reconcile.Request

This function accepts a generic API object and returns a list of reconcile requests, which are simple wrappers on top of namespaced names (usually seen in the form "namespace/name"):

type Request struct {
  // NamespacedName is the name and namespace of the object to reconcile.
  types.NamespacedName
}

So how can we turn a generic client.Object (that is a generic abstraction on top of a Secret) into the name and namespace of a QuestDB object that we want to reconcile?

There are many possible answers to this question!

One idea is to create a naming convention that somehow encodes the name and namespace of the target QuestDB into the Secret's name, so we could use the client.Object.GetName() and client.Object.GetNamespace() to build a NamespacedName to reconcile. Perhaps something like questdb-${DB_NAME}-ilp. But this would limit what we could name Secrets, which might not interop well if something like external secrets controller is syncing the secret from an external source like Vault. Or if a developer simply forgets the naming convention, and needs to debug why their QuestDB's ILP auth isn't working.

Maybe we could reuse the IlpSecretName and IlpSecretNamespace spec fields from the previous section? We could query for a QuestDB that has a Spec.IlpSecretName == client.Object.GetName() (and likewise for namespace) inside our mapper function. But this doesn't work for a few reasons.

The first is that you are unable to use field selectors with CRDs, so this query is literally impossible in the current version of k8s!

Secondly, lets say you try to bypass this restriction by storing the secret name on the QuestDB object in something that could queried against, like resource labels. Since the function only accepts a client.Object and does not return an error along with its []reconcile.Request, there's no clean place to instantiate a new client inside a MapFunc. To do that, you would need a cancelable context and a standardized way to handle API errors. You can create all of this inside a MapFunc, but you wouldn't be able to use the rest of kubebuilder's built-in error handling capabilities and its context that is attached to every other API request in the system. So based on the signature of MapFunc, it's clear that the designers don't want you making any queries inside of them!

Then how can we only use the data found in the client.Object to create a list of QuestDBs to reconcile?

Annotations to the rescue!

To solve this issue, I decided to create a new annotation: "crd.questdb.io/name". This annotation will be attached to a Secret and points to the name of the QuestDB CR that will use its data to construct an ILP auth config file. For simplicity, I will assume that the Secret will only be used by a single QuestDB, and that both the Secret and QuestDB will reside in the same namespace.

This allows us to create a very simple mapper function that looks something like this:

func CheckSecretForQdbs(obj client.Object) []reconcile.Request {

  var (
    requests = []reconcile.Request{}
  )

  // Exit if the object is not a Secret
  if _, ok := obj.(*v1core.Secret); !ok {
    return requests
  }

  // Extract the target QuestDB from the annotation
  qdbName, ok := obj.GetAnnotations()["crd.questdb.io/name"]
  if !ok {
    return requests
  }

  requests = append(requests, reconcile.Request{
    NamespacedName: client.ObjectKey{
      Name:      qdbName,
      // The Secret and QuestDB must reside in
      // the same namespace for this to work
      Namespace: obj.GetNamespace(),
    },
  })

  return requests

}

Reconciliation logic

But we're not done yet! The controller still needs to find this Secret and use its data to construct the auth config.

Inside our QuestDB reconciliation loop, we can query for all Secrets in a QuestDB's namespace and iterate over them until we find the one we're looking for, based on our new annotation. Here's a small code sample of that, without any additional error-checking.

func (r *QuestDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
  q := &questdbv1.QuestDB{}

  // Assumes that the QuestDB exists (for simplicity)
  err := r.Get(ctx, req.NamespacedName, q)
  if err != nil {
    return ctrl.Result{}, err
  }

  allSecrets := &v1core.SecretList{}
  authSecret := v1core.Secret{}

  // Get a list of all secrets in the namespace
  if err := r.List(ctx, allSecrets, client.InNamespace(q.Namespace)); err != nil {
    return nil, err
  }

  // Iterate over them to find the secret with the desired annotation
  for _, secret := range allSecrets.Items {
    if secret.Annotations["crd.questdb.io/name"] == q.Name {
      authSecret = secret
    }
  }

  if authSecret.Name == "" {
    return errors.New("auth secret not found")
  }

  var (
    x   = authSecret["x"]
    y   = authSecret["y"]
    kid = authSecret["kid"]
  )

  // Construct the ILP auth string to add to the QuestDB config
  var auth string = constructIlpAuthConfig(x, y, kid)

  // Add this auth string to a ConfigMap value and update...
}

As you can see, the new annotation allows us to fully decouple the Secret from the QuestDB operator, since there are no domain-specific naming requirements for the Secret. You don't even need to change the QuestDB CR spec to update the config. All you need to do is add the annotation to any Secret in the QuestDB's namespace, set the value to the name of the QuestDB resource, and the operator will be automatically be notified of the change and update your QuestDB's config to use the Secret's public key data.

Note that this is a golden-path solution; we still need to handle cases where more than 1 Secret has the annotation or a matching Secret does not have the required keys that are needed to generate the JWK public key.

No limits

The beauty of annotations is that you can store anything in them, and with a custom operator, use that data to perform any cluster automation that you can dream of! K8s doesn't even prescribe the format of an annotation's value, as long as it can be represented in a YAML string. This means you can use simple strings, JSON, or even base64-encoded binary blobs as annotation values for an operator to use! Still, since k8s is a young-ish and constantly evolving system, I would probably stick with simple annotation values to abide by KISS as much as possible.

After using annotations in my operator code, I've started to gain more of an appreciation for why the k8s annotation docs are so vague; because they can be used for any custom action, it's not really possible to define all of their capabilities. It's up to the operator developer to use annotations in his or her own way.

I hope this example has sparked some of your own ideas about how to use annotations in your own operators. Let me know if it has!

Hacking in kind (Kubernetes in Docker)

Steven Sklar — Fri, 17 Nov 2023 19:17:15 +0000

Originally published on my blog

How to dynamically add nodes to a kind cluster

Kind allows you to run a Kubernetes cluster inside Docker. This is incredibly useful for developing Helm charts, Operators, or even just testing out different k8s features in a safe way.

I've recently been working on an operator (built using the operator-sdk) that manages cluster node lifecycles. Kind allows you to spin up clusters with multiple nodes, using a Docker container per-node and joining them using a common Docker network. However, the kind executable does not allow you to modify an existing cluster by adding or removing a node.

I wanted to see if this was possible using a simple shell script, and it turns out that it's actually not too difficult!

Creating the node

Using my favorite diff tool, DiffMerge, and docker inspect to compare an existing kind node's state to a new container's, I experimented with various docker run flags until I got something that's close enough to the kind node.

docker run \
--restart on-failure \
-v /lib/modules:/lib/modules:ro \
--privileged \
-h $NODE_NAME \
-d \
--network kind \
--network-alias $NODE_NAME \
--tmpfs /run \
--tmpfs /tmp \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--security-opt label=disable \
-v /var \
--name $NODE_NAME \
--label io.x-k8s.kind.cluster=kind \
--label io.x-k8s.kind.role=worker \
--env KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER \
kindest/node:v1.25.2@sha256:9be91e9e9cdf116809841fc77ebdb8845443c4c72fe5218f3ae9eb57fdb4bace

Joining to the cluster

You can join new nodes to a k8s cluster by using the kubeadm join command. In this case, we can use docker exec to execute this command on our node after its container has started up.

This command won't work out of the box because kind uses a kubeadm.conf that does not exist in the node docker image. It is injected into the container by the kind executable.

Again, using my trusty DiffMerge tool, I compared two /kind/kubeadm.conf files in existing kind nodes and found very few differences. This allowed me to just grab one from any worker node to use as a template.

docker exec --privileged kind-worker cat /kind/kubeadm.conf > $LOCAL_KUBEADM

From here, I needed to set the node's unique IP in its kubeadm.conf. We can use docker inspect to grab any node IP address we need. Since I'm working in bash, I just decided to use a simple sed replacement to replace the template node's IP address with my new node's IP in my local copy of kubeadm.conf.

TEMPLATE_IP=$(docker inspect kind-worker | jq -r '.[0].NetworkSettings.Networks.kind.IPAddress')
NODE_IP=$(docker inspect $NODE_NAME | jq -r '.[0].NetworkSettings.Networks.kind.IPAddress')

ESCAPED_TEMPLATE_IP=$(echo $TEMPLATE_IP | sed 's/\./\\./g' )
ESCAPED_NODE_IP=$(echo $NODE_IP | sed 's/\./\\./g')

sed -i.bkp "s/${ESCAPED_TEMPLATE_IP}/${ESCAPED_NODE_IP}/g" $LOCAL_KUBEADM

Now that our kubeadm.conf is prepared, we need to copy it to the new node:

docker exec --privileged -i $NODE_NAME cp /dev/stdin /kind/kubeadm.conf < $LOCAL_KUBEADM

Finally, we can join our node to the cluster:

docker exec --privileged $NODE_NAME kubeadm join --config /kind/kubeadm.conf --skip-phases=preflight --v=6

Node Tags

Since you have complete control of the new node's kubeadm.conf, it is possible to configure many of its properties for further testing. For example, to add additional labels to the new node, you can run something like this:

sed -i.bkp "s/node-labels: \"\"/node-labels: \"my-label-key=my-label-value\"/g" $LOCAL_KUBEADM

This will add the my-label-key=my-label-value label to the node once it joins the cluster.

Future Work

Based on this script, I believe it's possible to add a kind create node subcommand to add a node to an existing cluster. Stay tuned for that...

How are Kubernetes VolumeAttachments Named?

Steven Sklar — Tue, 29 Aug 2023 15:33:30 +0000

This is a short article with a little k8s persistent storage nugget that I recently uncovered. FWIW, I've only checked this while using the AWS EBS Container Storage Interface (CSI) driver, but the article should apply to any cluster using CSI drivers to mount volumes.

Background

After a PersistentVolume (PV) is created and bound to a Persistent Volume Claim (PVC), a Pod can mount the claim as a volume to access its data. While the Pod is being provisioned, the k8s storage controller creates a VolumeAttachment, which declares its intent to physically attach the volume to the Pod's Node. The storage driver running on the Node will be notified when this VolumeAttachment is created, recognize that it needs to mount a volume to its host, and perform the volume mount, thus making the volume available for the Pod to use.

Lets say that you want to check if a particular volume is actually attached to a Node. For example, maybe you want to ensure that a volume is unmounted before destroying a Node to ensure data integrity of a high-throughput database? So how can you find a PV's corresponding VolumeAttachment to determine if its backing volume is unmounted?

Querying VolumeAttachments

The VolumeAttachment Spec contains nodeName and source fields, which as expected, correspond to the Node's name and PV's name respectively. But these are not valid fieldSelector fields, so you cannot query the API server for a VolumeAttachment with a matching Node or PV name. Instead, you need to LIST all VolumeAttachments in the cluster and iterate over them to find the one that you want. In a large cluster, there could be hundreds (or even thousands) of VolumeAttachments, so this operation could take quite a while to run inside of a controller reconcile loop.

Maybe we could somehow craft a GET request to find a PV's matching VolumeAttachment? Unfortunately, the "name" field, at least when using a supported k8s CSI driver, is a vague string that looks something like csi-e52f481e06b9cf4dc9d95a56d1788026b4707cdce2e105d4444f0be9d6206a09. Where does this name come from? Can we use this to check if a PV is mounted to a particular Node?

VolumeAttachment Naming

After a bit of digging, I found the answer in the k8s csi source code

// getAttachmentName returns csi-<sha256(volName,csiDriverName,NodeName)>
func getAttachmentName(volName, csiDriverName, nodeName string) string {
    result := sha256.Sum256([]byte(fmt.Sprintf("%s%s%s", volName, csiDriverName, nodeName)))
    return fmt.Sprintf("csi-%x", result)
}

The attachment suffix is a sha256 sum of the names of the following components:

The volume handle (part of the CSI Spec)
The driver doing the mounting, and
The target Node

With this discovery, we can now determine if a specific PV is mounted to a Node by using a GET request based on the imputed VolumeAttachment name.

Here's a small function that I wrote to generate a VolumeAttachment name based on a PV and Node:

package csi

import (
    "crypto/sha256"
    "fmt"

    v1core "k8s.io/api/core/v1"
)

func GetVolumeAttachmentName(pv *v1core.PersistentVolume, n *v1core.Node) string {
    if pv.Spec.CSI == nil {
        return ""
    }

    var (
        volName       = pv.Spec.CSI.VolumeHandle
        csiDriverName = pv.Spec.CSI.Driver
        nodeName      = n.Name
    )

    result := sha256.Sum256([]byte(fmt.Sprintf("%s%s%s", volName, csiDriverName, nodeName)))
    return fmt.Sprintf("csi-%x", result)
}

This function accepts a PV and a Node, and sha256sums pieces of their metadata to construct a valid VolumeAttachment name. Now, you can now use this function to generate an input to check if a PV is mounted to a Node, all in a single API request!


func IsPvAttached(pv *v1core.PersistentVolume, n *v1core.Node) bool {
    va := &v1storage.VolumeAttachment{}
    return !apierrors.IsNotFound(
        client.Get(
            ctx,
            types.NamespacedName{
                Name: GetVolumeAttachmentName(pv, n),
            },
            va,
        ),
    )
}

The storage controller will delete a VolumeAttachment when a PV is unmounted from a Node, so we can simply check for the existence of a VolumeAttachment with the PV's name & driver and the Node's name by using our GetVolumeAttachmentName function from above. If the VolumeAttachment does not exist, we can definitively say that the volume is unmounted from the Node, and safely delete the node without any data corruption on the volume.

Hopefully this demystifies those strange VolumeAttachment names, and unlocks some new ideas when writing your next k8s controller.

Originally posted on my blog: https://sklar.rocks/k8s-volumeattachment-names/

Kubebuilder Tips and Tricks

Steven Sklar — Tue, 22 Aug 2023 14:23:58 +0000

Recently, I've been spending a lot of time writing a Kubernetes operator using the go operator-sdk, which is built on top of the Kubebuilder framework. This is a list of a few tips and tricks that I've compiled over the past few months working with these frameworks.

Log Formatting

Kubebuilder, like much of the k8s ecosystem, utilizes zap for logging. Out of the box, the Kubebuilder zap configuration outputs a timestamp for each log, which gets formatted using scientific notation. This makes it difficult for me to read the time of an event just by glancing at it. Personally, I prefer ISO 8601, so let's change it!

In your scaffolding's main.go, you can configure your current logger format by modifying the zap.Options struct and calling ctrl.SetLogger.

opts := zap.Options{
    Development: true,
    TimeEncoder: zapcore.ISO8601TimeEncoder,
}

ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

In this case, I added the zapcore.ISO8601TimeEncoder, which encodes timestamps to human-readable ISO 8601-formatted strings. It took a bit of digging, along with a bit of help from the Kubernetes Slack org, to figure this one out. But it's been a huge quality-of-life improvement when debugging complex reconcile loops, especially in a multithreaded environment.

MaxConcurrentReconciles

Speaking of multithreaded environments, by default, an operator will only run a single reconcile loop per-controller. However, in practice, especially when running a globally-scoped controller, it's useful to run multiple concurrent reconcile loops to simultaneously handle many resource changes at once. Luckily, the Operator SDK makes this incredibly easy with the MaxConcurrentReconciles setting. We can set this up in a new controller's SetupWithManager func:

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
        WithOptions(controller.Options{MaxConcurrentReconciles: 10}).
        ...
        Complete(r)
}

I've created a command line arg in my main.go file that allows the user to set this value to any integer value, since this will likely be a tweaked over time depending on how the controller performs in a production cluster.

Parent-Child Relationships

One of the basic functions of a controller is to act as a parent to Kubernetes resources. This allows the controller to "own"
these objects such that when it is deleted, all child objects are automatically garbage collected by the Kubernetes runtime.

I like this small function that can be called for any client.Object to add a parent reference to the controller
that you're writing.

func (r *CustomReconciler) ownObject(ctx context.Context, cr *myapiv1alpha1.CustomResource, obj client.Object) error {

    err := ctrl.SetControllerReference(cr, obj, r.Scheme)
    if err != nil {
        return err
    }
    return r.Update(ctx, obj)
}

You can then add Owns watches for these resources in your SetupWithManager func. These will instruct your controller to listen for changes in child resources of the specified types, triggering a reconcile loop on each change.

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
        Owns(&v1apps.Deployment{}).
        Owns(&v1core.ConfigMap{}).
        Owns(&v1core.Service{}).
        Complete(r)
}

Watches

Your controller can also watch resources that it doesn't own. This is useful for when you need to watch for changes in globally-scoped resources like PersistentVolumes or Nodes. Here's an example of how you would register this watch in your SetupWithManager func.

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
        Watches(
            &source.Kind{Type: &v1core.Node{}},
            handler.EnqueueRequestsFromMapFunc(myNodeFilterFunc),
            builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}),
        ).
        Complete(r)
}

In this case, you need to implement myNodeFilterFunc to accept
an obj client.Object and return []reconcile.Request. Using the
ResourceVersionChangedPredicate triggers the filter function for every change on that resource type, so it's important to
write your filter function to be as efficient as possible, since there is a chance that it could be called quite a bit, especially
if your controller is globally-scoped.

Field Indexers

One gotcha that I encountered happened when trying to query for a list of Pods that are running on a particular Node. This query uses a FieldSelector filter, as
seen here:

// Get a list of all pods on the node
err := c.List(ctx, &pods, &client.ListOptions{
    Namespace:     "",
    FieldSelector: fields.ParseSelectorOrDie(fmt.Sprintf("spec.nodeName=%s", node.Name)),
})

This codepath led to the following error: Index with name field:spec.nodeName does not exist. After some googling around, I
found this GitHub issue that referenced
a Kubebuilder docs page which contained the answer.

Controllers created using operator-sdk and Kubebuilder use a built-in caching mechanism to store results of API requests. This is to prevent
spamming the K8s API, as well as improve reconciliation performance.

When performing resource lookups using FieldSelectors, you first need to add your desired search field to an index
that the cache can use for lookups. Here's an example that will build this index for a Pod's nodeName:

if err := mgr.GetFieldIndexer().IndexField(context.TODO(), &v1core.Pod{}, "spec.nodeName", func(rawObj client.Object) []string {
    pod := rawObj.(*v1core.Pod)
    return []string{pod.Spec.NodeName}
}); err != nil {
    return err
}

Now, we can run the List function from above with the FieldSelector with no issues.

Retries on Conflicts

If you've ever written controllers, you're probably very familiar with the error Operation cannot be fulfilled on ...: the object has been modified; please apply your changes to the latest version and try again

This occurs when the version of the resource that you're currently reconciling in your controller is out-of-date with what's in latest version of the K8s cluster state. If you're retrying your reconciliation loop on any errors, your controller will eventually reconcile the resource, but this can really pollute your logs and make it difficult to spot more important errors.

After reading through the k8s source, I found the solution to this: RetryOnConflict. It's a utility function in the client-go package that runs a function and automatically retries on conflict, up to a certain point.

Now, you can just wrap your logic inside this function argument, and never have to worry about this issue again! And the added benefit is that you just get to return err instead of return ctrl.Result{}, err, which makes your code that much easier to read.

Useful Kubebuilder Markers

Here are some useful code markers that I've found while developing my operator.

To add custom columns to your custom resource's description (when running kubectl get), you can add annotations to your API object like these:

//+kubebuilder:printcolumn:name="NodeReady",type="boolean",JSONPath=".status.nodeReady"
//+kubebuilder:printcolumn:name="NodeIp",type="string",JSONPath=".status.nodeIp"

To add a shortname to your custom resource (like pvc for PersistentVolumeClaim for example), you can add this annotation:

//+kubebuilder:resource:shortName=mycr;mycrs

More docs on kubebuilder markers can be found here:

https://book.kubebuilder.io/reference/markers/crd.html

Originally published on my blog: https://sklar.rocks/kubebuilder-tips/

We moved our Cloud operations to a Kubernetes Operator

Steven Sklar — Tue, 15 Aug 2023 16:16:22 +0000

Kubernetes operators seem to be everywhere these days. There are hundreds of them available to install in your cluster. Operators manage everything from X.509 certificates to enterprise-y database deployments, service meshes, and monitoring stacks. But why add even more complexity to a system that many say is already convoluted? With an operator controlling cluster resources, if you have a bug, you now have two problems: figuring out which of the myriad of Kubernetes primitives is causing the issue, as well as understanding how that dang operator is contributing to the problem!

So when I proposed the idea of introducing a custom-built operator into our Cloud infrastructure stack, I knew that I had to really sell it as a true benefit to the company, so people wouldn't think that I was spending my time on a piece of RDD vaporware. After the go-ahead to begin the project, I spent my days learning about Kubernetes internals, scouring the internet far and wide to find content about operator development, and writing (and rewriting) the base of what was to become our QuestDB Cloud operator.

After many months of hard work, we've finally deployed this Kubernetes operator into our production Cloud. Building on top of our existing 'one-shot-with-rollback' provisioning model, the operator adds new functionality that allows us to perform complex operations like automatically recovering from node failure, and will also orchestrate upcoming QuestDB Enterprise features such as High Availability and Cold Storage.

If we've done my job correctly, you, as a current or potential end user will continue to enjoy the stability and easy management of your growing QuestDB deployments, without noticing a single difference. This is the story of how we accomplished this engineering feat and why we even did it in the first place.

Original System Architecture

To manage customer instances in our cloud, we use a queue-based system that is able to provision databases across different AWS accounts and regions using a centralized control plane known simply as "the provisioner."

When a user creates, modifies, or deletes a database in the QuestDB Cloud, an application backend server asynchronously sends a message that describes the change to a backend worker, which then appends that message to a Kafka log. Each message contains information about the target database (its unique identifier, resident cluster, and information about its config, runtime, and storage) as well as the specific set of operations that are to be performed.

These instructional messages get picked up by one of many provisioner workers, each of which subscribes to a particular Kafka log partition. Once a provisioner worker receives a message, it is decoded and its instructions are carried out inside the same process. Instructions vary based on the specific operation that needs to be performed. These range from simple tasks like an application-level restart, to more complex ones such as tearing down an underlying database's node if the user wants to pause an active instance to save costs on an unused resource.

Each high-level instruction is broken down into smaller, composable granular tasks. Thus, the provisioner coordinates resources at a relatively low level, explicitly managing k8s primitives like StatefulSets, Deployments, Ingresses, Services, and PersistentVolumes as well as AWS resources like EC2 Instances, Security Groups, Route53 Records, and EBS Volumes.

Provisioning progress is reported back to the application through a separate Kafka queue (we call it the "gossip" queue). Backend workers are listening to this queue for new messages, and update a customer database's status in our backend database when new information has been received.

Each granular operation also has a corresponding rollback procedure in case the provisioner experiences an error while running a set of commands. These rollbacks are designed to leave the target database's infrastructure in a consistent state, although it's important to note that this is not the desired state as intended by the end user's actions. Once an error has been encountered and the rollback executed, the backend will be notified about the original error through the gossip Kafka queue in order to surface any relevant information back to the end user. Our internal monitoring system will also trigger alerts on these errors so an on-call engineer can investigate the issue and kick off the appropriate remediation steps.

While this system has proven to be reliable over the past several months since we first launched our public cloud offering, the framework still leaves room for improvement. Much of our recent database development work around High Availability and Cold Storage requires adding more infrastructure components to our existing system. And adding new components to any distributed system tends to increase the system's overall error rate.

As we continue to add complexity to our infrastructure, there is a risk that the above rollback strategy won't be sufficient enough to recover from new classes of errors. We need a more responsive and dynamic provisioning system that can automatically respond to faults and take proactive remediation action, instead of triggering alerts and potentially waking up one of our globally-distributed engineers.

This is where a Kubernetes operator can pick up the slack and improve our existing infrastructure automation.

Introducing a Kubernetes Operator

Now that stateful workloads have started to mature in the Kubernetes ecosystem, practitioners are writing operators to automate the management of these workloads. Operators that handle full-lifecycle tasks like provisioning, snapshotting, migrating, and upgrading clustered database systems have started to become mainstream across the industry.

These operators include Custom Resource Definitions (CRDs) that define new types of Kubernetes API resources which describe how to manage the intricacies of each particular database system. Custom Resources (CRs) are then created from these CRDs that effectively define the expected state of a particular database resource. CRs are managed by operator controller deployments that orchestrate various Kubernetes primitives to migrate the cluster state to its desired state based on the CR manifests.

By writing our own Kubernetes operator, I believed, we could take advantage of this model by allowing the provisioner to only focus on higher-level tasks, while an operator handles the nitty-gritty details of managing each individual customer database. Since an operator is directly hooked in to the Kubernetes API server through Resource Watches, it can immediately react to changes in cluster state and constantly reconcile cluster resources to achieve the desired state described by a Custom Resource spec. This allows us to automate even more of our infrastructure management, since the control plane is effectively running all of the time, instead of only during explicit provisioning actions.

With an operator deployed in one of our clusters, we can now interact with a customer database by executing API calls and CRUD operations against a single resource, a new Custom Resource Definition that represents an entire QuestDB Cloud deployment. This CRD allows us to maintain a central Kubernetes object that represents the canonical version of the entire state of a customer's QuestDB database deployment in our cluster. It includes fields like the database version, instance type, volume size, server config, and the underlying node image, as well as key pieces of status like a node's IP address and DNS record information. Based on these spec field values, as well as changes to them, the operator performs a continuous reconciliation of all the relevant Kubernetes and AWS primitives to match the expected state of the higher-level QuestDB construct.

New Architecture

This leads us to our new architecture:

While this diagram resembles the old one, there is a key difference here; a QuestDB operator instance is now running inside of each cluster! Instead of the provisioner being responsible for all of the low-level Kubernetes and AWS building blocks, it is now only responsible for modifying a single object: a QuestDB Custom Resource. Any modifications to the state of a QuestDB deployment, large or small, can be made by changing a single manifest that describes the desired state of the entire deployment.

This solution still uses the provisioner, but in a much more limited capacity than before. The provisioner continues to act as the communication link between the application and each tenant cluster, and it is still responsible for coordinating high-level operations. But, it can now delegate low-level database management responsibilities to the operator's constantly-running reconciliation loops.

Operator Benefits

Auto-Healing

Since the operator model is eventually-consistent, we no longer need to rely on rollbacks in the face of errors. Instead, if the operator encounters an error in its reconciliation loop, it re-attempts that reconciliation at a later time, using an exponential backoff by default.

The operator is registered to Resource Watches for all QuestDBs in the cluster, as well as for their subcomponents. So, a change in any of these components, or the QuestDB spec itself, will trigger a reconciliation loop inside the operator. If a component is accidentally mutated or experiences some state drift, the operator will be immediately notified of this change and automatically attempt to bring the rest of the infrastructure to its desired state.

This property is incredibly useful to us, since we provision a dedicated node for each database and use a combination of taints and NodeSelectors to ensure that all of a deployment's Pods are running on this single node. Due to these design choices, we are now exposed to the risk of a node hardware failure that renders it inaccessible. Typical Kubernetes recovery patterns can not save us.

So instead of using an alerting-first remediation paradigm, where on-call engineers would follow a runbook after receiving an alert that a database is down, the operator allows us to write custom node failover logic that is triggered as soon as the cluster is aware of an unresponsive node. Now, our engineers can sleep soundly through the night while the operator automatically performs the remediation steps that they would previously be responsible for.

Simpler Provisioning

Instead of having to coordinate tens of individual objects in a typical operation, the provisioner now only has to modify a single Kubernetes object while executing higher level instructions. This dramatically reduces the code complexity inside of the provisioner codebase and makes it far easier for us to write unit and integration tests. It is also easier for on-call engineers to quickly diagnose and solve issues with customer databases, since all of the relevant information about a specific database is contained inside of a single Kubernetes spec manifest, instead of spread across various resource types inside each namespace. Although, as expressed in the previous point, hopefully there will be fewer problems for our on-call engineers to solve in the first place!

Easier Upgrades

The 15-week Kubernetes release cadence can be a burden for smaller infrastructure teams to keep on top of. This relatively rapid schedule, combined with the short lifetime of each release, means that it is critical for organisations to keep their clusters up-to-date. The importance of this is magnified if you're not running your own control plane, since vendors tend to drop support for old k8s versions fairly quickly.

Since we're running each database on its own node, and there are only two of us to manage the entire QuestDB Cloud infrastructure, we could easily spend most of our time just upgrading customer database nodes to new Kubernetes versions. Instead, our operator now allows us to automatically perform a near-zero-downtime node upgrade with a single field change on our CRD.

Once the operator detects a change in a database's underlying node image, it will handle all of the orchestration around the upgrade, including keeping the customer database online for as long as possible and cleaning up stale resources once the upgrade is complete. With this automation in place, we can now perform partial upgrades against subsets of our infrastructure to test new versions before rolling them out to the entire cluster, and also seamlessly roll back to an older node version if required.

Control Plane Locality

By deploying an operator-per-cluster, we can improve the time it takes for provisioning operations to complete by avoiding most cross-region traffic that takes place from the provisioner to remote tenant clusters. We only need to make a few cross-region API calls to kickstart the provisioning process by performing a CRUD operation on a QuestDB spec, and the local cluster operator will handle the rest. This improves our API call network latency and reliability, reducing the number of retries for any given operation as well as the overall database provisioning time.

Resource Coordination

Now that we have a single Kubernetes resource to represent an entire QuestDB deployment, we can use this resource as a primitive to build even larger abstractions. For example, now we can easily create a massive (but temporary) instance to crunch large amounts of data, save that data to a PersistentVolume in the cluster, then spin up another smaller database to act as the query interface. Due to the inherent composability of Kubernetes resources, the possibilities with a QuestDB Custom Resource are endless!

The DevOps Stuff

Testing the Operator

Before deploying such a significant piece of software in our stack, I wanted to be as confident as possible in its behavior and runtime characteristics before letting it run wild in our clusters. So we wrote and ran a large series of unit and integration tests before we felt confident enough to deploy the new infrastructure to development, and eventually to production.

Unit tests were written against an in-memory Kubernetes API server using the controller-runtime/pkg/envtest library. Envtest allowed us to iterate quickly since we could run tests against a fresh API cluster that started up in around 5 seconds instead of having to spin up a new cluster every time we wanted to run a test suite. Even existing micro-cluster tools like Kind could not get us that level of performance. Since envtest is also not packaged with any controllers, we could also set our test cluster to a specific state and be sure that this state would not be modified unless we explicitly did so in our test code. This allowed us to fully test specific edge-cases without having to worry about control plane-level controllers modifying various objects out from underneath us.

But unit testing is not enough, especially since many of the primitives that we are manipulating are fairly high-level abstractions in and of themselves. So we also spun up a test EKS cluster to run live tests against, built with the same manifests as our development and production environments. This cluster allowed us run tests using live AWS and K8s APIs, and was integral when testing features like node recovery and upgrades.

We were also able to leverage Ginkgo's parallel testing runtime to run our integration tests on multiple concurrent processes. This provided multiple benefits: we could run our entire integration test suite in under 10 minutes and also reuse the same suite to load test the operator in a production-like environment. Using these tests, we were able to identify hot spots in the code that needed further optimization and experimented with ways to save API calls to ease the load on our own Kubernetes API server while also staying under various AWS rate limits. It was only after running these tests over and over again that I felt confident enough to deploy the operator to our dev and prod clusters.

Monitoring

Since we built our operator using the Kubebuilder framework, most standard monitoring tasks were handled for us out-of-the-box. Our operator automatically exposes a rich set of Prometheus metrics that measure reconciliation performance, the number of k8s API calls, workqueue statistics, and memory-related metrics. We we were able to ingest these metrics into pre-built dashboards by leveraging the grafana/v1-alpha plugin, which scaffolds two Grafana dashboards to monitor Operator resource usage and performance. All we had to do was add these to our existing Grafana manifests and we were good to go!

Migrating to the New Paradigm

No article about a large infrastructure change would be complete without a discussion about the migration strategy from old to the new. In our case, we designed the operator to be a drop-in-replacement for the database instances managed by the provisioner. This involved many tedious rounds of diffing yaml manifests to ensure that the resources coordinated by the operator would match the ones that were already in our cluster.

Once this resource parity was reached, we utilized Kubernetes owner-dependent relationships to create an Owner Reference on each sub-resource that pointed to the new QuestDB Custom Resource. Part of the reconciliation loop involves ensuring that these ownership references are in place, and if not, are created automatically. This way, the operator can seamlessly "inherit" legacy resources through its main reconciliation codepath.

We were able to roll this out incrementally due to the guarantee that the operator cannot make any changes to resources that are not owned by a QuestDB CR. Because of this property, we could create a single QuestDB CR and monitor the operator's behavior on only the cluster resources associated with that database. Once everything behaved as expected, we started to roll out additional Custom Resources. This could be easily controlled on the provisioner side; by checking for the existence of a QuestDB CR for a given database. If the CR existed, then we could be sure that we were dealing with an operator-managed instance, otherwise the provisioner would behave as it did before the operator existed.

Conclusion

I'm extremely proud of the hard work that we've put into this project. Starting from an empty git repo, it's been a long journey to get here. There have been several times over the past few months when I questioned if all of this work was even worth the trouble, especially since we already had a provisioning system that was battle-tested in production. But with the support of my colleagues, I powered through to the end. And I can definitely say that it's been worth it, as we've already seen immediate benefits of moving to this new paradigm!

Using our QuestDB CR as a building block has paved the way for new features like adding a large-format bulk CSV upload with our SQL COPY command. The new node upgrade process will save us countless hours with every new k8s release. And even though we haven't experienced a hardware failure yet, I'm confident that our new operator will seamlessly handle that situation as well, without paging up an on-call engineer. Especially since we are a small team, it is critical for us to use automation as a force-multiplier that assists us in managing our Cloud infrastructure, and a Kubernetes operator provides us with next-level automation that was previously unattainable.

The best way to determine the best fit is to make it real with your own data. QuestDB Cloud offers $200 in free credit. Or, if you would rather play around with sample data first, check out our live Web Console.