Dev Blog - Celery Operator - Part 1 - Initial Design

#celery #kubernetes #python #go

This is the first part of my development blog in celery operator. Here is the disclaimer. This operator is purely my personal learning product. It does not go through the official CEP process in celery community. Please refer to the official one if you seek for an operator with LTS.

Background

With the heavy usage of operators in my career, the abstraction in the Kubernetes layer is so awesome that engineers could easily monitor and scale their applications with kubectl.

In my current company, in order to separate the loading from web server, some of the tasks are delegated to a task queue application. It is called celery in python. A ton of apps are now supporting celery as their task queuing backend.

The great thing of celery is similar to other task queuing service. It provides multiple queues for different requirements from the task. It also provides us with a cronjob service to run tasks periodically. This provides a super handy way to create both on-demand and corn-based task execution to a web server.

Issues spotted

In the Kubernetes world, the biggest advantages Kubernetes has are scaling and observability.

For example, it is common to implement the horizontal pod autoscaler(HPA) with different affinity and selector rules. High availability and high redundancy becomes a super easy task to accomplish in this realm.

Object status is also a powerful utility when we deal with the custom resource definition. It provides us with a handy way to list out the status of apps itself. If you have deal with operator before, you will know a good operator design could give you a full overview in kubectl interaction.

Let’s circle back to the celery itself. My experience in the deployment of celery onto Kubernetes is so bad that the app itself could not fully utilize the modern infra design in the Kubernetes.

For example, celery requires a number of resources in order to provide the full functioning app stack. Broker, result backend, worker and schedule are required. In my current company, the helm chart to deploy celery stack is pretty complex. This handicaps our maintainability of the charts.

Observability is also a serious issue in celery. Celery does not come with tracing or monitoring feature. We need to deploy an additional app called flower to show up the status of celery. Unfortunately, my colleagues had a really terrible experience in flower... It consumed a myriad of resources in his laptop... Anyway, in order to provide more detailed info, we may need to enter the pods and execute the celery command. This introduces a serious security issue because you gives out the permission of exec in the cluster to the engineers. Eventually, this increases the workload in my team drastically when dealing with celery.

Last but not least, the scaling with Kubernetes is so terrible for celery. In order to scale the celery against with the queue size (of course based on queue size but not the built-in CPU or memory way). We need to setup celery exporter and Prometheus. And Prometheus turns the metric into Kubernetes-native metrics format. Umm... When you see this action, you know how complicated it is... But without scaling, we need to pay extra cost in maintaining those potentially unused instances...What a dilemma it is!

Design of operator

In order to resolve the issues spotted, I have decided to move onto operator. Originally, I just want to make use of the official one. But I just realized that the official one is still under POC and it cannot fully support what I want to do in the operator.

Therefore, I have decided to design and develop my own celery operator!

We have a ton of methods to build a Kubernetes operator. As the officially supported celery operator, it takes an approach like a templating method, similar to helm and kustomize. For my celery operator, I want to try something different, operator SDK with kubebuilder!

Originally, I want to create an all-in-one celery CRD. However, after a careful thinking, I felt all-in-one celery CRD is pretty hard to maintain. Imagine that there are over 1000 lines within a reconcile function. This definitely is not a good style in coding.

Therefore, I eventually separated all the controllers and use an all-in-one celery CRD to wrap the rest of CRDs. The resultant structure is like this.

type Celery struct {
  Schedulers []CeleryScheduler
  Broker CeleryBroker
  Workers []CeleryWorker
}

Or using graphviz to visualize the relation.

With this design, the users could simply fit in their own usage. If they want to keep their existing stack with a newly introduced workers, they could simply create the workers CRD. And the code itself could be separated easier.

Under the controller, I originally chose the deployment as the way to control the pod spawning. However, I have foreseen that deployment possibly is not a good way to go if I want to control the pod spawning with custom metrics, like queue size.

Eventually, I change the design from deployment to pod directly. This is a really tough decision because I needed to implement all the CRUD issues on my own... This is another story to go...

For now, the celery operator has already implemented the create, update and delete method. The last step before releasing the first Alpha version is the metric extraction. This will be included in the next article in this series.

Ref:
celery-operator - https://github.com/RyanSiu1995/celery-operator

DEV Community

Dev Blog - Celery Operator - Part 1 - Initial Design

Background

Issues spotted

Design of operator

Top comments (0)