DEV Community: Joel Takvorian

Kubernetes operators: avoiding the memory pitfall

Joel Takvorian — Fri, 19 Jul 2024 08:43:55 +0000

^{Cover image by Blake Patterson - CC BY 2.0 - https://flickr.com/photos/blakespot/6173837649}

(Previously, in the tribulations of a Kubernetes operator developer: Kubernetes CRD: the versioning joy)

A while ago, in the NetObserv team, we heard a user complaining about the memory consumption of our operator. Sure, memory and resource footprint is always a concern, but wait, what? Did they really mean the operator? Probably wrong, the operator itself doesn't do anything that is memory intensive. Or...

What does the operator do, by the way? It is responsible for keeping the other NetObserv components afloat. It reads a config (a custom resource), and makes sure all the underlying components (for us: some eBPF agents, flowlogs-pipeline, and an OpenShift console plugin) are well configured and running according to that global config. For doing so, it needs to fetch, create or update a few Kubernetes resources (deployments, configmaps, secrets, etc.), and watch them. There are a few other things, but really nothing that can explain that high memory usage.

And we're being told the user had to increase the operator memory limit to 4 GB. 4 GB. That is fishy.

The problems

Just to clarify: we fixed the problem back in November 2023, last year. For initial investigations, we asked our friends of the Operator Framework: are we doing anything wrong? It turned out yes, we made a couple of wrong assumptions. To begin with, we were given some pointers such as:

Managing resource requests/limits with operators (we already did that, this wasn't the root problem here, although it helps avoiding OOM-kills)
This email thread (getting closer to the problem!)
Good practices around cache management (problem found! Solution? Not yet...)

Quoting some excerpts from that last link:

One of the pitfalls that many operators are failing into is that they watch resources with high cardinality like secrets possibly in all namespaces. This has a massive impact on the memory used by the controller on big clusters. Such resources can be filtered by label or fields.

But also:

Requests to a client backed by a filtered cache for objects that do not match the filter will never return anything. In other words, filtered caches make the filtered-out objects invisible to the client.

So yes, that was us:

builder := ctrl.NewControllerManagedBy(mgr).
        For(&flowslatest.FlowCollector{}).
        Owns(&corev1.ConfigMap{}).
        Owns(&appsv1.Deployment{}).
        // etc.

This was our former definition of our controller builder. You see that we declare "owning" config maps. Owns is documented as:

// Owns defines types of Objects being *generated* by the ControllerManagedBy, and configures the ControllerManagedBy to respond to
// create / delete / update events by *reconciling the owner object*.
//
// The default behavior reconciles only the first controller-type OwnerReference of the given type.
// Use Owns(object, builder.MatchEveryOwner) to reconcile all owners.
//
// By default, this is the equivalent of calling
// Watches(object, handler.EnqueueRequestForOwner([...], ownerType, OnlyControllerOwner())).

This means that reconcile requests are generated when the watched resources of the given kinds are created/updated/deleted. It doesn't mean that only these owned resources are watched in the underlying informers. In fact, if you don't properly configure the cache, the underlying informers end up watching all resources of the given kinds in the cluster. That was the first wrong assumption that we made, about what Owns means. On large clusters with many ConfigMaps or Secrets, this has a massive impact, not only in memory, but also in the bandwidth usage with the API server. Other kinds, such as Deployments or DaemonSets, are generally less numerous and heavy, so might be OK to keep as globally watched, even though there's basically the same problem.

But there's more.

We noticed that removing the Owns(&corev1.ConfigMap{}). line didn't change anything with the memory consumption. There was something else. We found that we still had all cluster config maps in memory. Yet, we don't declare any watches or informers on ConfigMaps. At least not intentionally. What we do however, is calling things like:

r.Client.Get(ctx,
  types.NamespacedName{Name: "my-cm", Namespace:"my-ns"},
  &configmap,
)

Sounds pretty harmless, doesn't it?

Let's check the doc, from Get in the Reader interface (permalink):

    // Get retrieves an obj for the given object key from the Kubernetes Cluster.
    // obj must be a struct pointer so that obj can be updated with the response
    // returned by the Server.

Okay, nothing fancy here. But just in case: what about the interface implementations? First, it's interesting to notice that there are several implementations: from client, typed_client, unstructured_client, namespaced_client, metadata_client. The main implementation, in client.go, attempts to read from a cache, or falls back to doing a live query with one of the other implementations (typed, metadata or unstructured client). The namespaced client is a wrapper on top of another client.

The package-level doc mentions that a cache is used, but without telling much about the details:

// It is a common pattern in Kubernetes to read from a cache and write to the API
// server.  This pattern is covered by the creating the Client with a Cache.

What cache are we talking about? Is it for lazy-loading the requested resources?
Not really: it's again an informers cache. Informers don't load resources lazily: they prefetch everything. So when the first request is done to fetch a resource of a given kind, an informer for that kind is started, filling up with all cluster data. This is the reason why we still had a high memory consumption, and this is our second wrong assumption: that a simple Get could not be harmful. I find this quite pernicious as it's all done very implicitly, it's so easy to shoot yourself in the foot.

Removing this Get call would finally result in a much smaller memory footprint of the operator, from gigabytes to less than 100 MB. OK then, how to fix it really, while still fetching the resources that we need? There are several options, depending on the use cases.

Non-solutions

About the manager cache, you may think using custom predicates in the controller builder would solve this issue. For instance, writing:

func (r *MyReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&v1.MyResource{}).
        Owns(&corev1.ConfigMap{}, builder.WithPredicates(myPredicate())).
        Complete(r)
}

where myPredicate would narrow down the watched config maps.

But it's not. These predicates filter which changes would generate a reconcile request, but this only happens after the informers have been updated with the created/updated/deleted config maps. In other words, it avoids triggering unnecessary reconcile loops (thus saving CPU), but it has no impact on what informers are keeping in cache under the cover.

The solutions

controller-runtime offers several mitigations for the issue:

1. As mentioned above, this page on good practices provides two examples to restrict the scope of the informers: by restricting cached resource according to their label, or according to a field, such as the resource name or namespace. Read the cache options documentation for more information. As mentioned, you need to be careful when using this solution, as it will prevent you from accessing resources out of the scope defined. But this is definitely the way to go, if that works for you.

However, this solution is only useful when you have some control, or prior knowledge, over the resources that you want to get. This is not always the case. For instance, you may expose an API to users allowing them to reference any config map or secret; this is what we do in NetObserv, when we need to load certificates for communicating with other systems. In that case, it is not possible to make any assumption about the config map name or namespace, labels, etc., as we get this information only in the reconcile loops, when the manager and the controllers are already started, hence their cache is already configured. We could, maybe, consider the option to restart the whole managers and controllers when we detect a change on the required resource names/namespaces, but… meh, that sounds very complicated for an apparently simple problem, isn't it?

2. Another option is to just disable the cache for some group-version-kinds (GVK). Here, we're talking about the client cache, which is not the same as the controller-runtime manager cache. This is a bit of a brutal solution: caches are used for a good reason, when used correctly, to minimize traffic from the API server, especially when the fetched resources aren't expected to change often.

3. Or we can implement our own cache layer. This is the solution we opted for with NetObserv, given that option 1 didn't work for us, and option 2 would lose all the benefit of a cache.

So we have this narrowcache package that:
- provides a client, that is a wrapper on top of sigs.k8s.io/controller-runtime/pkg/client
- is configured to use with an explicitly provided list of GVKs (requests for other GVKs are redirected to the wrapped client)
- also provides a Source interface, allowing to be used for enqueuing reconcile requests with watches defined on the controllers

The resources of managed GVKs are then lazy-loaded: on first request, they are fetched using a live client, added in a local cache, and a watch is created under the cover to track any change for that specific resource (NOT tracking the full GVK, like informers do). Subsequent calls just return the cached object.

It is used as such in NetObserv:

1. In manager initialisation:

func NewManager(
    ctx context.Context,
    kcfg *rest.Config,
    opts *ctrl.Options,
    // ...
) (*Manager, error) {
    // ...
    narrowCache := narrowcache.NewConfig(kcfg, narrowcache.ConfigMaps, narrowcache.Secrets)
    opts.Client = client.Options{Cache: narrowCache.ControllerRuntimeClientCacheOptions()}

    internalManager, err := ctrl.NewManager(kcfg, *opts)
    if err != nil {
        return nil, err
    }
    client, err := narrowCache.CreateClient(internalManager.GetClient())
    if err != nil {
        return nil, fmt.Errorf("unable to create narrow cache client: %w", err)
    }
    // ...

2. Subsequent calls to Get are done transparently, as it implements the client Reader interface.

3. Watching for enqueuing reconcile requests is done as such:

func watch(ctx context.Context, ctrl controller.Controller, cl *narrowcache.Client, obj client.Object) error {
    s, err := cl.GetSource(ctx, obj)
    if err != nil {
        return err
    }
    return ctrl.Watch(
        s,
        handler.EnqueueRequestsFromMapFunc(func(ctx context.Context, o client.Object) []reconcile.Request {
            // enqueuing logic / filtering here
        }),
    )
}

To be honest, this is not a panacea, because we're rewriting some logic that also exists in controller-runtime, and very certainly it's better than us at doing it (except for not covering the use case that we need). There's also the drawback of having to deal with potential breaking changes with the controller-runtime interfaces, especially related to the Source watching. I wish it was addressed directly upstream, but this proposal was rejected: it's considered an edge case.

By the way: an edge-case, or a broad one?

I played this little game:

Install some operators
Monitor memory consumption and ingress traffic on these operators
Create many config maps and secrets
Rinse and repeat

Of about ~60 random operators tested, picked up from operator hub, 13 showed a memory increase, and traffic spike to the API server, correlated with the unrelated resources that I created. This is far from negligible. (I'm contacting the authors to let them know - no intent to shame of course, it's so easy to get trapped). You can also play this little game yourself, with the operators that you're using. Of course, there is a small chance that this is done on purpose, ie. some operators may actually need to watch all config maps or secrets in the cluster for their normal operation, but I bet this would be a very small minority, if any.

To create many config maps and secrets, simply run:

kubectl create namespace test
for i in {0..200}; do kubectl create cm test-cm-$i -n test --from-file=./large_file ; done
for i in {0..200}; do kubectl create secret generic test-secret-$i -n test --from-file=key=./large_file ; done

Where large_file is a local file of ~ 500KB (for instance).

An operator is considered tested positive if the memory metric increases, and the network shows a spike during the operation, or tested negative if these metrics stay flat.

Here we see the tested operator reacting to unrelated ConfigMap creation, with memory increasing from 72MB to 350MB, and receive bandwidth showing spikes above 1MBps downloads.

I did not dig deep enough to check if, for each operator tested positive, a simple cache config would be sufficient or not. For sure, some of them don't need more than that.

This blog is my small contribution to help raise awareness of the waste of resources still often seen in the software industry.

Kubernetes CRD: the versioning joy

Joel Takvorian — Thu, 04 Jul 2024 15:44:09 +0000

(The tribulations of a Kubernetes operator developer)

I am a developer of the Network Observability operator, for Kubernetes / OpenShift.

A few days ago, we released our 1.6 version -- which I hope you will try and appreciate, but this isn't the point here. I want to talk about an issue that was reported to us soon after the release.

The error says: risk of data loss updating "flowcollectors.flows.netobserv.io": new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD

What's that? It was a first for the team. This is an error reported by OLM.

Investigating

Indeed, we used to serve a v1alpha1 version of our CRD. And indeed, we are now removing it. But we didn't do it abruptly. We thought we followed all the guidelines of an API versioning lifecycle. I think we did, except for one detail.

Let's rewind and recap the timeline:

v1alpha1 was the first version, introduced in our operator 1.0
in 1.2, we introduced a new v1beta1. It was the new preferred version, but the storage version was still v1alpha1. Both versions were still served, and a conversion webhook allowed to convert from one to another.
in 1.3, v1beta1 became the stored version. At this point, after an upgrade, every instance of our resource in etcd are in version v1beta1, right? (spoiler: it's more complicated).
in 1.5 we introduced a v1beta2, and we flagged v1alpha1 as deprecated.
in 1.6, we make v1beta2 the storage version, and removed v1alpha1.

And BOOM!

A few users complained about the error message mentioned above:

risk of data loss updating "flowcollectors.flows.netobserv.io": new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD

And they are stuck: OLM won't allow them to proceed further. Or they can entirely remove the operator and the CRD, and reinstall.

In fact, this is only some early adopters of NetObserv who have been seeing this. And we didn't see it when testing the upgrade prior to releasing. So what happened? I spent the last couple of days trying to clear out the fog.

When users installed an old version <= 1.2, the CRD keeps track of the storage version in its status:

kubectl get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
["v1alpha1"]

Later on, when users upgrade to 1.3, the new storage version becomes v1beta1. So, this is certainly what now appears in the CRD status. This is certainly what now appears in the CRD status? (Padme style)

kubectl get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
["v1alpha1","v1beta1"]

Why is it keeping v1alpha1? Oh, I know! Upgrading the operator did not necessarily change anything in the custom resources. Only resources that have been changed post-install would have make the apiserver write them to etcd in the new storage version; but different versions may coexist in etcd, hence the status.storedVersions field being an array and not a single string. That makes sense.

Certainly, I can do some dummy edition of my custom resources to make sure they are in the new storage version. The apiserver will replace the old one with a new one, so it will use the updated storage version. Let's do this. Then check again:

kubectl get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
["v1alpha1","v1beta1"]

Hmm...
So, I am now almost sure I don't have any v1alpha1 remaining in my cluster, but the CRD doesn't tell me that. What I learned is that the CRD status is not a source of truth for what's in etcd.

Here's what the doc says:

storedVersions lists all versions of CustomResources that were ever persisted. Tracking these versions allows a migration path for stored versions in etcd. The field is mutable so a migration controller can finish a migration to another version (ensuring no old objects are left in storage), and then remove the rest of the versions from this list. Versions may not be removed from spec.versions while they exist in this list.

But how to ensure no old objects are left in storage? While poking around, I haven't found any simple way to inspect what custom resources are in etcd, and in which version. It seems like no one wants to be responsible for that, in the core kube ecosystem. It is like a black box.

Apiserver? it deals with incoming requests but it doesn't actively keep track / stats of what's in etcd.

There is actually a metric (gauge) showing which objects the apiserver stored. It is called apiserver_storage_objects:

But it tells nothing about the version -- and even if it did, it would probably not be reliable, as it's generated from the requests that the apiserver deals with, it is not keeping an active state of what's in etcd, as far as I understand.

etcd itself? It is a binary store, it knows nothing about the business meaning of what comes in and out.
And not talking about OLM, which is probably even further from knowing that.

If you, reader, can shed some light on how you would do that, ie. how you would ensure that no deprecated version of a custom resource is still lying around somewhere in a cluster, I would love to hear from you, don't hesitate to let me know!

Update from October 8th, 2024:
koff allows to do so! You first need to dump your etcd database by creating a snapshot. Then you can use koff to get the versions of your custom resources, for instance:

koff use etcd.db
koff get myresource -ojson | jq '.items.[].apiVersion'

There's the etcdctl tool that allows to interact with etcd, if you know exactly what you're looking for, and how this is stored in etcd, etc. But expecting our users to do this for upgrading? Meh...

Kube Storage Version Migrator

Actually, it turns out the kube community has a go-to option for the whole issue. It's called the Kube Storage Version Migrator (SVM). I guess in some flavours of Kubernetes, it might be enabled by default and triggers for any custom resource. In OpenShift, the trigger for automatic migration is not enabled, so it is up to the operator developers (or the users) to generate the migration requests.

In our case, this is how the migration request looks like:

apiVersion: migration.k8s.io/v1alpha1
kind: StorageVersionMigration
metadata:
  name: migrate-flowcollector-v1alpha1
spec:
  resource:
    group: flows.netobserv.io
    resource: flowcollectors 
    version: v1alpha1

Under the hood, the SVM just rewrites the custom resources without any modification, to make the apiserver trigger a conversion (possibly via your webhooks, if you have some) and make them stored in the new storage version.

To make sure the resources have really been modified, we can check their resourceVersion before and after applying the StorageVersionMigration:

# Before
$ kubectl get flowcollector cluster -ojsonpath='{.metadata.resourceVersion}'
53114

# Apply
$ kubectl apply -f ./migrate-flowcollector-v1alpha1.yaml

# After
$ kubectl get flowcollector cluster -ojsonpath='{.metadata.resourceVersion}'
55111

# Did it succeed?
$ kubectl get storageversionmigration.migration.k8s.io/migrate-flowcollector-v1alpha1 -o yaml
# [...]
  conditions:
  - lastUpdateTime: "2024-07-04T07:53:12Z"
    status: "True"
    type: Succeeded

Then, all you have to do is trust SVM and apiserver to have effectively rewritten all the deprecated versions in their new version.

Unfortunately, we're not entirely done yet.

kubectl get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
["v1alpha1","v1beta1"]

Yes, the CRD status isn't updated. It seems like it's not something SVM would do for us. So OLM will still block the upgrade. We need to manually edit the CRD status, and remove the deprecated version -- now that we're 99.9% sure it's not there (I don't like the other 0.1% much).

Revisited lifecycle

To repeat the versioning timeline, here is what it seems we should have done:

v1alpha1 was the first version, introduced in our operator 1.0
in 1.2, we introduced a new v1beta1. Storage version is still v1alpha1
in 1.3, v1beta1 becomes the stored version.
- ⚠️ The operator should check the CRD status and, if needed, create a StorageVersionMigration, and then update the CRD status to remove the old storage version ⚠️
in 1.5 v1beta2 is introduced, and we flag v1alpha1 as deprecated
in 1.6, v1beta2 is the new storage version, we run again through the StorageVersionMigration steps (so we're safe when v1beta1 will be removed later). We remove v1alpha1
Everything works like a charm, hopefully.

For the anecdote, in our case with NetObserv, all this convoluted scenario is probably just resulting from a false-alarm, the initial OLM error being a false positive: our FlowCollector resource manages workload installation, and have a status that reports the deployments readiness. On upgrade, new images are used, pods are redeployed, so the FlowCollector status changes. So, it had to be rewritten in the new storage version, v1beta1, prior to the removal of the deprecated version. The users who have seen this issue could simply have manually removed the v1alpha1 from the CRD status, and that's it.

While one could argue that OLM is too conservative here, blocking an upgrade that should pass because all the resources in storage must be fine, in its defense, it probably has no simple way to know that. And messing up with resources made inaccessible in etcd is certainly a scenario we really don't want to run into. This is something that operator developers have to deal with.

I hope this article will help prevent future mistakes for others. This error is quite tricky to spot, as it can reveal itself long after the fact.

Update: examples of implementations have been given in the comments below (thanks Jeeva):