Joel Takvorian

Posted on Jul 19, 2024

Kubernetes operators: avoiding the memory pitfall

#kubernetes #devops #softwareengineering #theycoded

^{Cover image by Blake Patterson - CC BY 2.0 - https://flickr.com/photos/blakespot/6173837649}

(Previously, in the tribulations of a Kubernetes operator developer: Kubernetes CRD: the versioning joy)

A while ago, in the NetObserv team, we heard a user complaining about the memory consumption of our operator. Sure, memory and resource footprint is always a concern, but wait, what? Did they really mean the operator? Probably wrong, the operator itself doesn't do anything that is memory intensive. Or...

What does the operator do, by the way? It is responsible for keeping the other NetObserv components afloat. It reads a config (a custom resource), and makes sure all the underlying components (for us: some eBPF agents, flowlogs-pipeline, and an OpenShift console plugin) are well configured and running according to that global config. For doing so, it needs to fetch, create or update a few Kubernetes resources (deployments, configmaps, secrets, etc.), and watch them. There are a few other things, but really nothing that can explain that high memory usage.

And we're being told the user had to increase the operator memory limit to 4 GB. 4 GB. That is fishy.

The problems

Just to clarify: we fixed the problem back in November 2023, last year. For initial investigations, we asked our friends of the Operator Framework: are we doing anything wrong? It turned out yes, we made a couple of wrong assumptions. To begin with, we were given some pointers such as:

Managing resource requests/limits with operators (we already did that, this wasn't the root problem here, although it helps avoiding OOM-kills)
This email thread (getting closer to the problem!)
Good practices around cache management (problem found! Solution? Not yet...)

Quoting some excerpts from that last link:

One of the pitfalls that many operators are failing into is that they watch resources with high cardinality like secrets possibly in all namespaces. This has a massive impact on the memory used by the controller on big clusters. Such resources can be filtered by label or fields.

But also:

Requests to a client backed by a filtered cache for objects that do not match the filter will never return anything. In other words, filtered caches make the filtered-out objects invisible to the client.

So yes, that was us:

builder := ctrl.NewControllerManagedBy(mgr).
        For(&flowslatest.FlowCollector{}).
        Owns(&corev1.ConfigMap{}).
        Owns(&appsv1.Deployment{}).
        // etc.

This was our former definition of our controller builder. You see that we declare "owning" config maps. Owns is documented as:

// Owns defines types of Objects being *generated* by the ControllerManagedBy, and configures the ControllerManagedBy to respond to
// create / delete / update events by *reconciling the owner object*.
//
// The default behavior reconciles only the first controller-type OwnerReference of the given type.
// Use Owns(object, builder.MatchEveryOwner) to reconcile all owners.
//
// By default, this is the equivalent of calling
// Watches(object, handler.EnqueueRequestForOwner([...], ownerType, OnlyControllerOwner())).

This means that reconcile requests are generated when the watched resources of the given kinds are created/updated/deleted. It doesn't mean that only these owned resources are watched in the underlying informers. In fact, if you don't properly configure the cache, the underlying informers end up watching all resources of the given kinds in the cluster. That was the first wrong assumption that we made, about what Owns means. On large clusters with many ConfigMaps or Secrets, this has a massive impact, not only in memory, but also in the bandwidth usage with the API server. Other kinds, such as Deployments or DaemonSets, are generally less numerous and heavy, so might be OK to keep as globally watched, even though there's basically the same problem.

But there's more.

We noticed that removing the Owns(&corev1.ConfigMap{}). line didn't change anything with the memory consumption. There was something else. We found that we still had all cluster config maps in memory. Yet, we don't declare any watches or informers on ConfigMaps. At least not intentionally. What we do however, is calling things like:

r.Client.Get(ctx,
  types.NamespacedName{Name: "my-cm", Namespace:"my-ns"},
  &configmap,
)

Sounds pretty harmless, doesn't it?

Let's check the doc, from Get in the Reader interface (permalink):

    // Get retrieves an obj for the given object key from the Kubernetes Cluster.
    // obj must be a struct pointer so that obj can be updated with the response
    // returned by the Server.

Okay, nothing fancy here. But just in case: what about the interface implementations? First, it's interesting to notice that there are several implementations: from client, typed_client, unstructured_client, namespaced_client, metadata_client. The main implementation, in client.go, attempts to read from a cache, or falls back to doing a live query with one of the other implementations (typed, metadata or unstructured client). The namespaced client is a wrapper on top of another client.

The package-level doc mentions that a cache is used, but without telling much about the details:

// It is a common pattern in Kubernetes to read from a cache and write to the API
// server.  This pattern is covered by the creating the Client with a Cache.

What cache are we talking about? Is it for lazy-loading the requested resources?
Not really: it's again an informers cache. Informers don't load resources lazily: they prefetch everything. So when the first request is done to fetch a resource of a given kind, an informer for that kind is started, filling up with all cluster data. This is the reason why we still had a high memory consumption, and this is our second wrong assumption: that a simple Get could not be harmful. I find this quite pernicious as it's all done very implicitly, it's so easy to shoot yourself in the foot.

Removing this Get call would finally result in a much smaller memory footprint of the operator, from gigabytes to less than 100 MB. OK then, how to fix it really, while still fetching the resources that we need? There are several options, depending on the use cases.

Non-solutions

About the manager cache, you may think using custom predicates in the controller builder would solve this issue. For instance, writing:

func (r *MyReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&v1.MyResource{}).
        Owns(&corev1.ConfigMap{}, builder.WithPredicates(myPredicate())).
        Complete(r)
}

where myPredicate would narrow down the watched config maps.

But it's not. These predicates filter which changes would generate a reconcile request, but this only happens after the informers have been updated with the created/updated/deleted config maps. In other words, it avoids triggering unnecessary reconcile loops (thus saving CPU), but it has no impact on what informers are keeping in cache under the cover.

The solutions

controller-runtime offers several mitigations for the issue:

1. As mentioned above, this page on good practices provides two examples to restrict the scope of the informers: by restricting cached resource according to their label, or according to a field, such as the resource name or namespace. Read the cache options documentation for more information. As mentioned, you need to be careful when using this solution, as it will prevent you from accessing resources out of the scope defined. But this is definitely the way to go, if that works for you.

However, this solution is only useful when you have some control, or prior knowledge, over the resources that you want to get. This is not always the case. For instance, you may expose an API to users allowing them to reference any config map or secret; this is what we do in NetObserv, when we need to load certificates for communicating with other systems. In that case, it is not possible to make any assumption about the config map name or namespace, labels, etc., as we get this information only in the reconcile loops, when the manager and the controllers are already started, hence their cache is already configured. We could, maybe, consider the option to restart the whole managers and controllers when we detect a change on the required resource names/namespaces, but… meh, that sounds very complicated for an apparently simple problem, isn't it?

2. Another option is to just disable the cache for some group-version-kinds (GVK). Here, we're talking about the client cache, which is not the same as the controller-runtime manager cache. This is a bit of a brutal solution: caches are used for a good reason, when used correctly, to minimize traffic from the API server, especially when the fetched resources aren't expected to change often.

3. Or we can implement our own cache layer. This is the solution we opted for with NetObserv, given that option 1 didn't work for us, and option 2 would lose all the benefit of a cache.

So we have this narrowcache package that:
- provides a client, that is a wrapper on top of sigs.k8s.io/controller-runtime/pkg/client
- is configured to use with an explicitly provided list of GVKs (requests for other GVKs are redirected to the wrapped client)
- also provides a Source interface, allowing to be used for enqueuing reconcile requests with watches defined on the controllers

The resources of managed GVKs are then lazy-loaded: on first request, they are fetched using a live client, added in a local cache, and a watch is created under the cover to track any change for that specific resource (NOT tracking the full GVK, like informers do). Subsequent calls just return the cached object.

It is used as such in NetObserv:

1. In manager initialisation:

func NewManager(
    ctx context.Context,
    kcfg *rest.Config,
    opts *ctrl.Options,
    // ...
) (*Manager, error) {
    // ...
    narrowCache := narrowcache.NewConfig(kcfg, narrowcache.ConfigMaps, narrowcache.Secrets)
    opts.Client = client.Options{Cache: narrowCache.ControllerRuntimeClientCacheOptions()}

    internalManager, err := ctrl.NewManager(kcfg, *opts)
    if err != nil {
        return nil, err
    }
    client, err := narrowCache.CreateClient(internalManager.GetClient())
    if err != nil {
        return nil, fmt.Errorf("unable to create narrow cache client: %w", err)
    }
    // ...

2. Subsequent calls to Get are done transparently, as it implements the client Reader interface.

3. Watching for enqueuing reconcile requests is done as such:

func watch(ctx context.Context, ctrl controller.Controller, cl *narrowcache.Client, obj client.Object) error {
    s, err := cl.GetSource(ctx, obj)
    if err != nil {
        return err
    }
    return ctrl.Watch(
        s,
        handler.EnqueueRequestsFromMapFunc(func(ctx context.Context, o client.Object) []reconcile.Request {
            // enqueuing logic / filtering here
        }),
    )
}

To be honest, this is not a panacea, because we're rewriting some logic that also exists in controller-runtime, and very certainly it's better than us at doing it (except for not covering the use case that we need). There's also the drawback of having to deal with potential breaking changes with the controller-runtime interfaces, especially related to the Source watching. I wish it was addressed directly upstream, but this proposal was rejected: it's considered an edge case.

By the way: an edge-case, or a broad one?

I played this little game:

Install some operators
Monitor memory consumption and ingress traffic on these operators
Create many config maps and secrets
Rinse and repeat

Of about ~60 random operators tested, picked up from operator hub, 13 showed a memory increase, and traffic spike to the API server, correlated with the unrelated resources that I created. This is far from negligible. (I'm contacting the authors to let them know - no intent to shame of course, it's so easy to get trapped). You can also play this little game yourself, with the operators that you're using. Of course, there is a small chance that this is done on purpose, ie. some operators may actually need to watch all config maps or secrets in the cluster for their normal operation, but I bet this would be a very small minority, if any.

To create many config maps and secrets, simply run:

kubectl create namespace test
for i in {0..200}; do kubectl create cm test-cm-$i -n test --from-file=./large_file ; done
for i in {0..200}; do kubectl create secret generic test-secret-$i -n test --from-file=key=./large_file ; done

Where large_file is a local file of ~ 500KB (for instance).

An operator is considered tested positive if the memory metric increases, and the network shows a spike during the operation, or tested negative if these metrics stay flat.

Here we see the tested operator reacting to unrelated ConfigMap creation, with memory increasing from 72MB to 350MB, and receive bandwidth showing spikes above 1MBps downloads.

I did not dig deep enough to check if, for each operator tested positive, a simple cache config would be sufficient or not. For sure, some of them don't need more than that.

This blog is my small contribution to help raise awareness of the waste of resources still often seen in the software industry.