Mohammed Firdous

Posted on Apr 7

Building Canary, Baseline & Traffic Routing for PipeCD's Kubernetes Multi-Cluster Plugin

#kubernetes #opensource #gitops #cloudnative

If you had told me last year that I would be working with Kubernetes and all things clusters, deployments and service meshes, I would have brushed it off. I am truly grateful for the journey thus far.

Earlier last month, I got accepted as an LFX Mentee for Term 1 of this calendar year. For me it is such a big deal, given my background, and how much effort has been put in behind the scenes to get to this stage.

I'm currently a mentee in the LFX Mentorship program working on PipeCD, an open-source GitOps continuous delivery platform. For the past four weeks, I've been building out the kubernetes_multicluster plugin specifically implementing the deployment pipeline stages that handle canary, primary and baseline deployments across multiple clusters.

What is PipeCD and what is this plugin?

PipeCD is an open-source GitOps CD platform that manages deployments across different infrastructure targets like Kubernetes, ECS, Terraform, Lambda and more. Each target type has a plugin that knows how to deploy to it.

The kubernetes_multicluster plugin is for teams running the same application across multiple Kubernetes clusters say US, EU and Asia and needing all of them to stay in sync through a single pipeline. Rolling out a new version across clusters one at a time, manually, with no coordination, is error-prone and slow. The plugin lets you define one pipeline that runs across every cluster at the same time, with canary and baseline checks before anything hits production.

Progressive Delivery and Why These Stages Exist

Before a new version reaches all users, it goes through stages. A canary sends a small slice of traffic to the new version first. A baseline runs the current version at the same scale so you have a fair comparison. Primary is the actual promotion. Clean stages remove the temporary resources when you're done.

This pattern is called progressive delivery, because you roll out gradually, check things look good, then commit. If something looks wrong at the canary stage, you stop there. Nothing has touched production yet.

The kubernetes_multicluster plugin runs all of this across every cluster at the same time. One pipeline, every cluster, same stages.

A full pipeline looks like this:

stages:
  - name: K8S_CANARY_ROLLOUT
  - name: K8S_BASELINE_ROLLOUT
  - name: K8S_TRAFFIC_ROUTING
  - name: K8S_PRIMARY_ROLLOUT
  - name: K8S_CANARY_CLEAN
  - name: K8S_BASELINE_CLEAN

Each of these is a stage I built. The sections below go through what each one does.

What I Built

K8S_CANARY_ROLLOUT

The canary stage deploys the new version of your app as a small slice alongside the existing production deployment. If your app normally runs 3 pods, canary might spin up 1 pod (or 20%) of the new version enough to catch problems without affecting most users.

It loads manifests from Git, creates copies of all workloads with a -canary suffix, scales them down to the configured replica count, adds a pipecd.dev/variant=canary label, and applies them to every target cluster in parallel. The original deployment is never touched this stage only ever adds resources.

K8S_CANARY_CLEAN

Once the canary window is over, whether you promoted or rolled back, the canary pods are just sitting in every cluster doing nothing. K8S_CANARY_CLEAN removes them.

It finds all resources with the label pipecd.dev/variant=canary for the application and deletes them in order: Services first, then Deployments, then everything else. The order matters as you don't want to remove the Deployment while the Service is still sending traffic to it.

One thing worth noting: the query is scoped strictly to canary-labelled resources. Even if something goes wrong in the deletion logic, it cannot touch primary resources.

K8S_PRIMARY_ROLLOUT

After the canary looks good, you promote the new version to primary,the workload actually serving all your users. This stage takes the manifests from Git, adds the pipecd.dev/variant=primary label, and applies them across all clusters in parallel.

It also has a prune option: after applying, it checks what's currently running in the cluster against what was just applied, and deletes anything that's no longer in Git. Useful when you remove a resource from your manifests and want the cluster to reflect that.

K8S_BASELINE_ROLLOUT

This one took me a while to understand and it is the stage I find most interesting to explain as well.

When you're running a canary, the natural thing is to compare it against primary. The issue is that's not a fair comparison primary is handling far more traffic than canary, under different conditions.

Baseline gives you a fairer comparison. You take the current version (not the new one) and run it at the same scale as canary. Now your cluster has:

simple             2/2   ← production, current version
simple-canary      1/1   ← new version, being tested
simple-baseline    1/1   ← current version at canary scale

You compare canary vs baseline,same number of pods, same traffic conditions. If canary is worse, it's obvious.

The key difference from every other rollout stage is one line of code. Canary and primary load manifests from the new Git commit (TargetDeploymentSource). Baseline loads from what's currently running (RunningDeploymentSource):

// canary.go — new version
manifests, err := p.loadManifests(ctx, ..., &input.Request.TargetDeploymentSource, ...)

// baseline.go — current version
manifests, err := p.loadManifests(ctx, ..., &input.Request.RunningDeploymentSource, ...)

K8S_BASELINE_CLEAN

Once the analysis is done, baseline resources get cleaned up the same way as canary find everything labelled pipecd.dev/variant=baseline and delete it in order. No configuration needed. It doesn't matter whether createService: true was set during rollout, it finds whatever is there and removes it.

K8S_TRAFFIC_ROUTING

Canary and baseline pods exist in the cluster but get no traffic until this stage runs. Without it, you're analysing pods that nobody is actually hitting. This stage is what sends real user traffic to them.

Two methods are supported:

PodSelector (no service mesh needed): changes the Kubernetes Service selector to point at one variant. All-or-nothing 100% to canary or 100% back to primary.

Istio: updates VirtualService route weights to split traffic across all three variants at once for example, primary 80%, canary 10%, baseline 10%. Also supports editableRoutes to limit which named routes the stage is allowed to modify.

One small thing I added on top of the traffic routing stage: per-route logging. When the stage runs, it now logs each route it processes whether it was skipped (because it's not in editableRoutes) or updated with new weights. Before this, the log just said "Successfully updated traffic routing" with no detail. Now you can see exactly which routes changed and to what percentages, which is useful when debugging a misconfigured VirtualService.

Something I Found Interesting

The thing that surprised me was how errgroup handles running across multiple clusters without much extra code.

Every stage needs to run against N clusters, not one. A simple for-loop would run them one at a time slow, and if cluster 2 fails you don't find out until cluster 1 is already done.

errgroup runs all clusters at the same time and returns the first error:

eg, ctx := errgroup.WithContext(ctx)
for _, tc := range targetClusters {
    tc := tc
    eg.Go(func() error {
        return canaryRollout(ctx, tc.deployTarget, ...)
    })
}
return eg.Wait()

All clusters run in parallel. If any one fails, the stage fails immediately. The same pattern is used across every stage, so adding a new stage is mostly just writing the per-cluster logic the concurrency part is already solved.

What's Next

The next piece is DetermineStrategy, that is the logic that decides what kind of deployment to trigger based on what changed in Git. After that, livestate drift detection so PipeCD can flag when a cluster has drifted from what Git says it should be.

To get involved, check out the PipeCD project and come join us on Slack.

DEV Community