In this post you’ll learn what Operators and Custom Resource Definitions (CRDs) are, how they work together, their pros and pitfalls, how to scaffold one using tools like Kubebuilder, and how to write your own operator in Go — diving into the reconciliation loop and controller mechanics in practice.
Introduction (Let's set the stage up)
When you think of Kubernetes, you probably think of Deployments, Services, StatefulSets, etc. But what if you want Kubernetes “to understand” higher-level concepts in your domain (e.g. “a Database cluster”, “a Cache cluster”, “a workflow job”) — and automate not just deployment, but upgrades, backups, self-healing, etc.? That’s where Operators and CRDs come in.
In this article, we’ll start from first principles — what Operators and CRDs are, how they play together — and then go step by step through the process of scaffolding, writing, and understanding the key “reconciliation loop” logic in Go. I’ll also share practical tips, design pitfalls, and trade-offs from real projects. Let’s go.
1. What are Operators and CRDs, and how do they interact?
Custom Resource Definitions (CRDs) — extending the Kubernetes API
At its core, a Custom Resource Definition (CRD) is a way to extend the Kubernetes API with your own new kinds (types).
Kubernetes comes with built-in resource kinds: Pod, Deployment, Service, etc. Each of these has a spec (desired state) and status (observed state) and is served by the Kubernetes API server.
A CRD allows you to add a new kind — for example, MyApp, DatabaseCluster, Cache, MySQLBackup, etc. You define the schema (often via OpenAPI v3 validation), the group/version/kind, and Kubernetes will then allow clients to kubectl apply objects of that new kind (Custom Resources, CRs).
Once your CRD is installed, your cluster effectively “knows about” this new API surface.
So CRD = schema + API registration (i.e. telling Kubernetes: “I have this new type, validate it, store it, serve it”).
But a CRD by itself only gives you a data model — it does nothing automatically. You still need logic to act when CRs are created, updated, or deleted.
Operators — controllers with domain logic
An Operator is the piece that makes your CRD useful. It is (in practice) a Kubernetes controller (a client of the Kubernetes API) that:
Watches events on your custom resources (CRs),
Compares the desired state (in spec) with the current state of the world,
And takes actions (create/update/delete Kubernetes primitives or external resources) so as to converge the system toward the desired state.
Thus, an Operator combines two parts:
CRD — defines the “language” (what attributes the user can express).
Controller / Reconciler logic — the “brain” that watches for changes and enforces them.
The Operator pattern is essentially: “Let me treat my application (or cluster component) as a first-class Kubernetes object; the operator will drive its lifecycle.”
When you write an operator, you typically own the CRD (i.e. your operator is the canonical manager for that CRD). You register the CRD, and then inside the operator you write logic to reconcile every instance of the CRD.
In operation, things go as follows:
- A user (or system) kubectl applys a CR of kind Foo (that your CRD defines).
- The Kubernetes API server stores that CR object (desired state).
- Your operator’s controller sees that new CR (via watch/informer) and triggers a reconcile.
- In Reconcile(), your code reads the CR, checks/subscribes to or reads existing resources (e.g. Deployments, Services, ConfigMaps), and if things are missing or wrong, issues requests to the API (create/update/delete) to align them.
- Over time, through repeated reconciliation, the “actual” cluster state is made to match what the CR requests (ideally).
- Optionally, the operator updates the CR’s status subfield to reflect progress, health, or conditions.
One way to think: Kubernetes built-in controllers reconcile built-in kinds (e.g. Deployment reconciles Pods). Your operator reconciles CR kinds into a set of built-in or other CRs that in turn get reconciled.
Hence, CRD + Operator = your extension to Kubernetes behavior — you teach Kubernetes to “understand” your domain.
2. Why use this pattern? Benefits and challenges
Advantages
Using CRDs + Operators yields several compelling benefits:
Declarative, consistent API
Users express what they want (via CRD spec) and the operator handles how to realize it. That hides complexity and reduces human error.Day-2 operations automation
Beyond initial deploy (Day 1), operators allow you to automate upgrades, backups, schema migrations, health checks, scaling, rolling restarts, etc.
You codify your “operational knowledge” and embed it.
Self-healing and drift correction
If someone manually fiddles with resources (e.g. deletes a Pod, modifies a configmap), the operator’s reconciliation loop can detect drift and restore the correct state.Domain-aware orchestration
The operator can understand ordering, dependencies, constraints (e.g. start DB, wait, then migrate), and enforce complex workflows, something flat YAML can’t do reliably.Simplified user experience
For many users, deploying your app becomes kubectl apply -f myapp.yaml. Under the hood, the operator installs all the needed services, handles upgrades, etc. They don’t need to know all the Kubernetes primitives.Extensibility and composability
You can build operators that interact (watching CRs of other operators), build meta-operators, or chain behavior modularly (though this comes with trade-offs).
Challenges, pitfalls, and caveats
With power comes responsibility. Here are key challenges and trade-offs:
Correctness & idempotency
The reconciliation logic must be idempotent — running multiple times should not break things or cause oscillations. Mistakes here lead to thrashing, resource conflicts, or stuck states.Complexity growth
As your domain logic grows (multiple subcomponents, version upgrades, backward compatibility), the operator code can become complex. Structuring it carefully is vital.Testing and observability burden
You need solid tests (unit, integration) for reconcile logic, error paths, race conditions. Also, need metrics, logs, tracing, health checks, leader election, etc., to operate in a production cluster.Upgrade path and API versioning
As your CRD evolves, you’ll need to support version migrations (v1alpha → v1beta → v1), conversion, deprecation. Mistakes here can break existing installations.Handling external systems/side effects
If your operator talks to external databases, cloud APIs, or non-Kubernetes systems, you must manage eventual consistency, network failures, retries, backoff. Reconcile loop can’t block indefinitely.Race conditions, concurrency, and resource ownership
You must ensure controllers don’t step on each other’s toes. For example, two operators managing the same CR kind is discouraged.
Handling concurrent reconcile loops safely, avoiding duplicate work, and reconciling in correct order adds complexity.
Operator’s resource consumption & scale
If there are many CR instances or many events, the operator must scale (e.g. concurrency, rate limiting). Also be careful to avoid large list operations in every reconcile.Drift vs manual override tension
Sometimes users want to override something (tweak a configmap child directly). Operator may override that on next reconcile. You may need “ignore diff” or “do not manage this field” features.Garbage collection/eletion semantics
When a CR is deleted, your operator should clean up owned resources in the right order (especially if there are dependencies). Use ownerReferences and finalizers carefully.
3## . Scaffolding CRDs/Operators easily: Kubebuilder and friends
You don’t have to start from scratch. Tools like Kubebuilder, Operator SDK, or controller-runtime scaffolding greatly reduce boilerplate and help you follow best practices.
Here’s a walkthrough of how you’d use Kubebuilder to scaffold your operator + CRD.
Getting started with Kubebuilder
(These are high-level steps; for full detail see the Kubebuilder Book)
Install Kubebuilder
Download appropriate binaries and put in your PATH.Initialize project
kubebuilder init --domain your.domain --repo github.com/you/your-operator
This sets up the project scaffolding: main.go, API directory, controller directory, etc.
- Create API + Controller scaffold
kubebuilder create api --group <group> --version <version> --kind <KindName>
This generates:
- api/vX/KindName_types.go (where you define Spec and Status)
- api/vX/KindName_webhook.go (if validation/defaulting is enabled)
- controllers/KindName_controller.go with a stub Reconcile() and SetupWithManager()
- Sample manifest YAMLs under config/samples/
CRD YAML generation logic under config/crd
Edit Spec/Status & markers
In *_types.go, you annotate fields with markers (// +kubebuilder:validation: etc.) for CRD schema validation, default values, optional fields, etc.Implement Reconcile logic
In the controller stub, replace the generated TODO code with your actual logic.Set up Watches/Ownership
In SetupWithManager(), you wire which resources your controller watches (the primary resource and any secondary ones). E.g.:
func (r *KindReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&yourgroupv1.Kind{}).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
Complete(r)
}
This ensures your reconcile loop is triggered when CR changes or when owned resources change.
Generate CRD manifests/controllers
Use make manifests or make install depending on your scaffold to generate CRD YAMLs (which include your validation markers).Build, deploy, test
You build the operator binary (often containerize it), install the CRD in a cluster, deploy the operator, then apply sample CR YAMLs (from config/samples) and see the behavior.
Kubebuilder (and controller-runtime) handles much of the plumbing: caching, informers, client libraries, leader election, default reconcile loop wiring, etc.
Pros of using Kubebuilder:
- You start with solid boilerplate following best practices.
- You get validation/defaulting support, CRD schema generation, versioning support.
- It standardizes how your operator is structured, which helps maintainability.
Caveats:
- The scaffold may not exactly match your domain logic — you’ll adapt.
- For highly custom behavior (multi-CR operators, cross-CR relationships), you’ll need to extend the scaffold.
- Learning the marker syntax, imports, API versioning, etc., has a learning curve.
Once your operator grows, you may want to break large reconcile logic into well-modular domain services, state machines, or sub-reconcilers.
4. Writing your own operator in Go — the Reconciliation Loop in action
Let’s walk through a simplified example operator in Go, focusing on the reconcile loop mechanics. I’ll highlight key patterns and pitfalls.
The skeleton: controller and reconcile stub
After scaffolding, you’ll have something like:
type MyKindReconciler struct {
client.Client
Scheme *runtime.Scheme
Log logr.Logger
}
func (r *MyKindReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := r.Log.WithValues("MyKind", req.NamespacedName)
// 1. Fetch the Custom Resource
var my mygroupv1.MyKind
if err := r.Get(ctx, req.NamespacedName, &my); err != nil {
if apierrors.IsNotFound(err) {
// CR deleted — cleanup if needed
return ctrl.Result{}, nil
}
return ctrl.Result{}, err
}
// 2. Desired vs actual: examine my.Spec, read existing resources
// e.g. look for Deployment named after the CR or matching labels
// 3. If child Deployment doesn’t exist, create
// Or if exists but spec doesn’t match, update
// 4. Optionally update status: set conditions, phases
// 5. Return result: maybe requeue, or success
return ctrl.Result{}, nil
}
func (r *MyKindReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&mygroupv1.MyKind{}).
Owns(&appsv1.Deployment{}).
Complete(r)
}
Let’s break it down and dive into nuances.
Step-by-step logic and patterns
99(a) Fetch the custom resource**
This is your starting point. If the CR is not found (deleted), often you simply exit (the ownerReferences + finalizers may handle cleanup).
But note: your reconcile should handle stale events — e.g. events where the CR was deleted before your code saw it. So check IsNotFound carefully.
(b) Observe existing “child” or managed resources
You might next issue a Get or List to find related resources (Deployments, StatefulSets, Services, Secrets) that you manage and should reflect the CR’s desired spec.
A common pattern is:
found := &appsv1.Deployment{}
err := r.Get(ctx, types.NamespacedName{Namespace: my.Namespace, Name: childName}, found)
if err != nil && apierrors.IsNotFound(err) {
// strictly not found → create new
} else if err != nil {
return ctrl.Result{}, err
}
If found, you compare fields (replica count, container image, env vars, etc.) with what your my.Spec asks for. If differences, you update. Use r.Update.
(c) Set owner references
When creating child resources, use controllerutil.SetControllerReference(&my, child, r.Scheme) so that Kubernetes understands the CR “owns” that child. That enables garbage collection: when the CR is deleted, its owned children go away, too.
This also enables watch events (when child changes) to trigger your reconcile function.
(d) Idempotency
Your code should consider “if exists and is correct, do nothing.” Don’t blindly issue updates unless needed. This avoids infinite loops, API flapping, etc.
Also, your code should gracefully handle partial failures (e.g. child creation succeeded, but status update fails). Ensure no inconsistent state or repeated destructive loops.
(e) Status subresource updates
Often you want to update my.Status to reflect progress, conditions, readiness, errors, etc. For example:
my.Status.ReadyReplicas = found.Status.ReadyReplicas
if err := r.Status().Update(ctx, &my); err != nil {
return ctrl.Result{}, err
}
Use r.Status() so it updates only status, not spec. Be cautious about infinite loops: status update is itself an update event, triggering another reconcile.
(f) Return ctrl.Result
Your Reconcile returns two values: ctrl.Result and error. The combination dictates what happens next:
- return ctrl.Result{}, nil → done, no immediate requeue
- return ctrl.Result{Requeue: true}, nil → immediately requeue
- return ctrl.Result{RequeueAfter: time.Duration}, nil → requeue after the given delay
- return ctrl.Result{}, err → error, so the runtime may retry with backoff
You use requeue when you know further work is needed after a delay (e.g. waiting for a child to settle). The scaffolding often sets a “syncPeriod” default (e.g. 10 hours) so even in absence of events, reconciles run periodically.
Stack Overflow
Also, your code should not block indefinitely — reconcilers must return rather than wait on long blocking operations.
(g) Concurrent reconciles & safety
Controller-runtime supports concurrent reconciliation of different objects (via MaxConcurrentReconciles) allowing your operator to scale.
However, never attempt to reconcile the same object concurrently — runtime ensures that your reconciler gets serialized per object key. But you should be careful about cross-object state (e.g. two CRs manipulating the same shared resource).
(h) Watch other resources, not just primary CR
Often you’ll want to watch secondary or external resources (e.g. ConfigMaps, Secrets, other CRs). You map events on them to reconcile your CRs (via .Owns(...), .Watches(...) in SetupWithManager).
ctrl.NewControllerManagedBy(mgr).
For(&mygroupv1.MyKind{}).
Owns(&appsv1.Deployment{}).
Watches(&source.Kind{Type: &corev1.Secret{}}, handler.EnqueueRequestsFromMapFunc(mapFn)).
Complete(r)
Thus, if a Secret changes, you can trigger reconciliation of relevant CR(s).
Example: Memcached operator (minimal)
Kubebuilder’s example is Memcached: user supplies size: N in the CR, the operator ensures a Deployment with N replicas of memcached is running.
Pseudocode outline:
func (r *MemcachedReconciler) Reconcile(ctx, req) {
// Fetch Memcached CR
// Define desired deployment spec
// Check if deployment exists
// If not, create
// Else, if replicas differ, update
// Update status
}
This simple example illustrates the core pattern. You can expand it to include scaling, backups, upgrades, etc.
The reconciliation loop in practice
The reconciliation loop is the heart of your operator. It is:
- Event-driven (via watches)
- Stateful-agnostic (reconcile must handle all states)
- Idempotent (safe to run multiple times)
- Non-blocking (each call should complete quickly)
- Triggers further reconciles by requeue or watching owned resources
As Kubernetes operators are merely controllers in user space, they plug into the control plane’s reconciliation machinery. When the controller-runtime manager runs, it registers your controller, and each time an event (create/update/delete) happens on watched resources, the manager enqueues a reconcile Request, which is processed by calling your Reconcile() function.
In effect, the operator’s reconcilers extend Kubernetes’ control loop to your custom domain.
Best Practices & Tips (parting advice)
- Keep your reconcile logic modular: break it into sub-reconcilers or small functions (e.g. “ensureDeployment”, “ensureConfig”, “updateStatus”).
- Use conditions in status (Ready, Progressing, Degraded) rather than encoding booleans or strings; it makes status easier to interpret and extend.
- Guard expensive list or watch operations — use indexers or field selectors to limit scope.
- Use leader election if you run multiple replicas of your operator (to avoid double work).
- Monitor metrics (reconcile durations, queue length, errors).
- Be careful with schema evolution: provide CRD conversion webhooks or adopting strategies when migrating APIs.
- Use finalizers to clean up external dependencies (e.g. delete cloud resources) before object is fully removed.
- Gracefully handle partial failures: circuit-breakers, retries, backoff.
- Document your CRD’s fields, constraints, examples (use config/samples).
- Seat your operator in a namespace (or cluster-wide) thoughtfully — restrict RBAC.
- Don’t let your operator manage more than one CR kind (if too many concerns, split into multiple controllers)
Conclusion
Operators + CRDs represent a powerful pattern for making Kubernetes aware of your domain logic and automating much of the operational burden. You define new APIs (CRDs), and the operator (controller) drives the system toward the desired state — doing what a human operator would, but continuously, reliably, at cluster scale.
Yes, there’s complexity, and writing a robust operator takes care, testing, observability, and design discipline. But once you cross the learning curve, operators become your go-to tool to manage data stores, middleware, clusters, workflows, and many other system components in a Kubernetes-native way.
If you like, I can prepare a full working code example (for a simple operator), with tests and deployment, and even diagrams. Would you like me to do that next?

Top comments (0)