DEV Community: Mohammed Firdous

What It's Actually Like to Be an LFX Mentee at CNCF

Mohammed Firdous — Thu, 04 Jun 2026 00:20:21 +0000

The CNCF has a lot of projects. When I opened the CNCF Landscape for the first time, I spent days just trying to figure out where to start. Kubernetes, Cilium, Argo CD, Flux, Istio etc. There is a lot going on. It is easy to feel overwhelmed.

I had used Argo CD before on a personal project, so GitOps felt familiar. That's when I found PipeCD, a continuous delivery tool still in the CNCF sandbox stage. What caught my attention was the plugin architecture. Instead of one big codebase handling every deployment type, PipeCD had redesigned things so each platform like Kubernetes, ECS, Lambda has its own separate plugin, a standalone binary talking to the agent over gRPC. That design meant the Kubernetes multi-cluster plugin was its own contained project, and it was still being built out. I could contribute to something real without needing to understand the entire codebase first. Being sandbox stage also meant I could get involved early, at a point where the project was still taking shape. That felt like the right fit.

The Project

During Term 1 2026, I worked with the PipeCD team to build out the Kubernetes multi-cluster plugin. The goal was simple: let teams deploy to multiple Kubernetes clusters from one pipeline, with proper progressive delivery such as canary testing, baseline comparison, gradual rollout, instead of pushing to every cluster at once and hoping for the best.

When I joined, the plugin could only do a basic sync and rollback. Over 30+ merged PRs, I built six deployment stages: Canary Rollout, Canary Clean, Baseline Rollout, Baseline Clean, Primary Rollout, and Traffic Routing. Each stage can target a different set of clusters, so you can test on one cluster before rolling out to the rest. I also updated the plugin SDK to report pass or fail per cluster instead of one combined result, and added health checks, rollback cleanup, and drift detection for five Kubernetes workload types.

The Part That Humbled Me

My first PR was a small refactor, easy, clean, merged quickly. That gave me too much confidence. A few weeks in, I jumped straight into building the Analysis stage plugin without really understanding what was needed or why. I opened three PRs. Two were closed. One was merged after a lot of rework. I had rushed in without reading the existing code or asking questions first.

The lesson was simple: read the issue, read the code, ask questions, then write. I had it backwards.

What Surprised Me

Open source is still the best place to learn in public and get real feedback on your work. What surprised me is how much a small change can matter. Before this mentorship, I had contributed to GitLab: a single change fixing an outdated tutorial that was sending new contributors to a search string that no longer existed in the codebase. Small fix, but every new contributor who followed that guide hit that wall. It just went in and quietly stopped being a problem for people.

The Mentorship Itself

The LFX Mentorship gave me something I could not get from a course or a side project. I was working on real code that real users run, with experienced engineers reviewing everything I wrote. My mentors Shinnosuke Sawada-Dazai and Khanh Tran gave honest feedback on every PR. They pushed me to write cleaner Go, think about edge cases I would have missed, and understand why a design decision matters, not just what it does. Doing all of this in the open, with every comment and discussion public, raised the bar for how carefully I thought before writing code.

If You Are Applying Next Term

Take your time choosing a project. CNCF is large and it is easy to pick something just because it is well known. Pick something you actually use or are genuinely curious about. Find the community on Slack, read through past issues, watch how the project works before you jump in. The community is what makes CNCF work. Being active in it will take you further than the code alone.

For the technical details of what I built so far, check out these posts:

Building the Kubernetes Multi-Cluster Plugin for PipeCD — LFX Mentorship

Building Canary, Baseline & Traffic Routing for PipeCD's Kubernetes Multi-Cluster Plugin

Mohammed Firdous — Tue, 07 Apr 2026 09:27:17 +0000

If you had told me last year that I would be working with Kubernetes and all things clusters, deployments and service meshes, I would have brushed it off. I am truly grateful for the journey thus far.

Earlier last month, I got accepted as an LFX Mentee for Term 1 of this calendar year. For me it is such a big deal, given my background, and how much effort has been put in behind the scenes to get to this stage.

I'm currently a mentee in the LFX Mentorship program working on PipeCD, an open-source GitOps continuous delivery platform. For the past four weeks, I've been building out the kubernetes_multicluster plugin specifically implementing the deployment pipeline stages that handle canary, primary and baseline deployments across multiple clusters.

What is PipeCD and what is this plugin?

PipeCD is an open-source GitOps CD platform that manages deployments across different infrastructure targets like Kubernetes, ECS, Terraform, Lambda and more. Each target type has a plugin that knows how to deploy to it.

The kubernetes_multicluster plugin is for teams running the same application across multiple Kubernetes clusters say US, EU and Asia and needing all of them to stay in sync through a single pipeline. Rolling out a new version across clusters one at a time, manually, with no coordination, is error-prone and slow. The plugin lets you define one pipeline that runs across every cluster at the same time, with canary and baseline checks before anything hits production.

Progressive Delivery and Why These Stages Exist

Before a new version reaches all users, it goes through stages. A canary sends a small slice of traffic to the new version first. A baseline runs the current version at the same scale so you have a fair comparison. Primary is the actual promotion. Clean stages remove the temporary resources when you're done.

This pattern is called progressive delivery, because you roll out gradually, check things look good, then commit. If something looks wrong at the canary stage, you stop there. Nothing has touched production yet.

The kubernetes_multicluster plugin runs all of this across every cluster at the same time. One pipeline, every cluster, same stages.

A full pipeline looks like this:

stages:
  - name: K8S_CANARY_ROLLOUT
  - name: K8S_BASELINE_ROLLOUT
  - name: K8S_TRAFFIC_ROUTING
  - name: K8S_PRIMARY_ROLLOUT
  - name: K8S_CANARY_CLEAN
  - name: K8S_BASELINE_CLEAN

Each of these is a stage I built. The sections below go through what each one does.

What I Built

K8S_CANARY_ROLLOUT

The canary stage deploys the new version of your app as a small slice alongside the existing production deployment. If your app normally runs 3 pods, canary might spin up 1 pod (or 20%) of the new version enough to catch problems without affecting most users.

It loads manifests from Git, creates copies of all workloads with a -canary suffix, scales them down to the configured replica count, adds a pipecd.dev/variant=canary label, and applies them to every target cluster in parallel. The original deployment is never touched this stage only ever adds resources.

K8S_CANARY_CLEAN

Once the canary window is over, whether you promoted or rolled back, the canary pods are just sitting in every cluster doing nothing. K8S_CANARY_CLEAN removes them.

It finds all resources with the label pipecd.dev/variant=canary for the application and deletes them in order: Services first, then Deployments, then everything else. The order matters as you don't want to remove the Deployment while the Service is still sending traffic to it.

One thing worth noting: the query is scoped strictly to canary-labelled resources. Even if something goes wrong in the deletion logic, it cannot touch primary resources.

K8S_PRIMARY_ROLLOUT

After the canary looks good, you promote the new version to primary,the workload actually serving all your users. This stage takes the manifests from Git, adds the pipecd.dev/variant=primary label, and applies them across all clusters in parallel.

It also has a prune option: after applying, it checks what's currently running in the cluster against what was just applied, and deletes anything that's no longer in Git. Useful when you remove a resource from your manifests and want the cluster to reflect that.

K8S_BASELINE_ROLLOUT

This one took me a while to understand and it is the stage I find most interesting to explain as well.

When you're running a canary, the natural thing is to compare it against primary. The issue is that's not a fair comparison primary is handling far more traffic than canary, under different conditions.

Baseline gives you a fairer comparison. You take the current version (not the new one) and run it at the same scale as canary. Now your cluster has:

simple             2/2   ← production, current version
simple-canary      1/1   ← new version, being tested
simple-baseline    1/1   ← current version at canary scale

You compare canary vs baseline,same number of pods, same traffic conditions. If canary is worse, it's obvious.

The key difference from every other rollout stage is one line of code. Canary and primary load manifests from the new Git commit (TargetDeploymentSource). Baseline loads from what's currently running (RunningDeploymentSource):

// canary.go — new version
manifests, err := p.loadManifests(ctx, ..., &input.Request.TargetDeploymentSource, ...)

// baseline.go — current version
manifests, err := p.loadManifests(ctx, ..., &input.Request.RunningDeploymentSource, ...)

K8S_BASELINE_CLEAN

Once the analysis is done, baseline resources get cleaned up the same way as canary find everything labelled pipecd.dev/variant=baseline and delete it in order. No configuration needed. It doesn't matter whether createService: true was set during rollout, it finds whatever is there and removes it.

K8S_TRAFFIC_ROUTING

Canary and baseline pods exist in the cluster but get no traffic until this stage runs. Without it, you're analysing pods that nobody is actually hitting. This stage is what sends real user traffic to them.

Two methods are supported:

PodSelector (no service mesh needed): changes the Kubernetes Service selector to point at one variant. All-or-nothing 100% to canary or 100% back to primary.

Istio: updates VirtualService route weights to split traffic across all three variants at once for example, primary 80%, canary 10%, baseline 10%. Also supports editableRoutes to limit which named routes the stage is allowed to modify.

One small thing I added on top of the traffic routing stage: per-route logging. When the stage runs, it now logs each route it processes whether it was skipped (because it's not in editableRoutes) or updated with new weights. Before this, the log just said "Successfully updated traffic routing" with no detail. Now you can see exactly which routes changed and to what percentages, which is useful when debugging a misconfigured VirtualService.

Something I Found Interesting

The thing that surprised me was how errgroup handles running across multiple clusters without much extra code.

Every stage needs to run against N clusters, not one. A simple for-loop would run them one at a time slow, and if cluster 2 fails you don't find out until cluster 1 is already done.

errgroup runs all clusters at the same time and returns the first error:

eg, ctx := errgroup.WithContext(ctx)
for _, tc := range targetClusters {
    tc := tc
    eg.Go(func() error {
        return canaryRollout(ctx, tc.deployTarget, ...)
    })
}
return eg.Wait()

All clusters run in parallel. If any one fails, the stage fails immediately. The same pattern is used across every stage, so adding a new stage is mostly just writing the per-cluster logic the concurrency part is already solved.

What's Next

The next piece is DetermineStrategy, that is the logic that decides what kind of deployment to trigger based on what changed in Git. After that, livestate drift detection so PipeCD can flag when a cluster has drifted from what Git says it should be.

To get involved, check out the PipeCD project and come join us on Slack.

Orchestrating Complex Serverless Workflows on AWS

Mohammed Firdous — Tue, 27 May 2025 07:05:00 +0000

TL;DR: Just linking Lambda functions makes your app hard to manage and easy to break. AWS Step Functions help you control steps in your app with built-in error fixing and easy tracking. AWS EventBridge lets parts of your app send messages (events) to each other without being directly connected.
Pattern 1: Use Step Functions to run long tasks in the background while your app stays fast.
Pattern 2: Use EventBridge to start jobs automatically when something happens, like a new customer signing up.
These tools make your serverless app easier to grow, fix, and keep working well.

Introduction
Why Do Orchestration and Events Matter?
AWS Step Functions - Your Workflow Manager
AWS EventBridge – Your Serverless Event Bus
Pattern 1 - Asynchronous API Processing with Step Functions
Pattern 2 - Event Driven Workflow Triggering with EventBridge
Practical Tips
Taking the Next Step
References

Introduction

So, you've learned how to use AWS Lambda. You can create functions, call them using API Gateway, and save data in DynamoDB. That’s great! But what happens when your app starts getting bigger and more complex?

When one user action needs to do many things, like calling different services, handling errors well, and making sure everything happens in the right order, just linking Lambda functions can get messy. It can feel like a game of pinball, where you lose track of what’s happening.

When you try to handle state, retries, and errors across multiple Lambda functions, things get hard. You also need to see what’s going on when a process has many steps. That’s where the real power of AWS serverless tools helps.

Two tools are especially useful here: AWS Step Functions and AWS EventBridge.

EventBridge acts like a message system that lets different parts of your app (and other services) send and receive events without directly calling each other. This keeps your app flexible and able to handle changes or failures better.
Step Functions lets you create a visual workflow that shows the steps and how they connect, like a flowchart for your app.

This guide helps you go beyond basic Lambda.

We will look at two practical patterns using Step Functions and
EventBridge. These patterns help you build stronger, easier-to-maintain, and more scalable serverless applications on AWS.

Why Do Orchestration and Events Matter?

Before we go into Step Functions and EventBridge, let’s talk about why these tools are important when your serverless apps grow.

Imagine you’re building a multi-step order system with just Lambda functions calling each other:

ProcessOrder gets the order.
It calls ValidateInventory.
If inventory is fine, it calls ProcessPayment.
If payment works, it calls ShipOrder.
But if something fails, what do you do? Roll back? Tell the user? Retry?
How do you know which step is running? Or if it’s finished?
If ProcessPayment takes a long time, does the first function just wait and risk timing out?

Chaining Lambdas like this makes them too dependent on each other. Handling errors and tracking the process becomes messy. This problem is called the Lambda Pinball anti-pattern, where your logic jumps around from function to function like a pinball in a machine.

Direct chaining ties functions too closely. The system becomes fragile. Error handling spreads across different functions, making it hard to manage. Keeping track of the whole process gets tricky. People call this the "Lambda Pinball" anti-pattern.

This is where orchestration and event-driven patterns help a lot:

Orchestration (Step Functions): It gives you one place to define and manage the workflow. Step Functions keep track of state between steps, handle retries and errors, and let you see what’s happening.
Event-Driven (EventBridge): It separates services. Instead of calling each other directly, functions send events like OrderPlaced. Other services listen for events like OrderPlaced and act on them. This makes the system stronger, if one service is down, the others can still work. It’s also easier to add new features, since you don’t have to change existing services to add a new one that listens to the same event.

Using Step Functions for workflows and EventBridge for events helps you build serverless systems that are easier to manage, grow, and handle failures.

AWS Step Functions - Your Workflow Manager

Using Step Functions for workflows and EventBridge for events helps you build serverless systems that are easier to manage, grow, and handle failures.

Think of AWS Step Functions as a tool to design and run workflows. You define the steps using JSON in the Amazon States Language. This setup creates a state machine. A system that controls how each step runs, keeps track of the current step, and handles errors and retries for you.

Key Benefits:

Automatic State Management: Step Functions keeps track of data between steps, so you don’t have to pass or store it manually.
Built-in Error Handling: You can set rules to retry on temporary errors or catch specific errors right in the workflow, making error handling easier and centralized.
Supports Long Tasks: Workflows can run up to a year, perfect for things that take a long time or need human input much longer than Lambda timeouts.
Run Steps in Parallel: You can run several tasks at the same time and wait for all or some to finish before moving on.
Direct AWS Service Calls: Step Functions can call many AWS services directly, like Lambda, SQS, DynamoDB, and others,no extra code needed for simple calls.
Clear Visibility: You get a visual view in the AWS console showing each step’s input, output, and errors, which helps a lot with debugging.

Here's a conceptual snippet of what a state machine definition might look like:

{
  "Comment": "A simple example of a Step Functions state machine",
  "StartAt": "ValidateInput",
  "States": {
    "ValidateInput": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ValidateLambda",
      "Next": "ProcessData"
    },
    "ProcessData": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessLambda",
      "Retry": [{
        "ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException", "Lambda.SdkClientException"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 2
      }],
      "Catch": [{
        "ErrorEquals": ["States.TaskFailed"],
        "Next": "NotifyFailure"
      }],
      "End": true
    },
    "NotifyFailure": {
      "Type": "Task",
      "Resource": "arn:aws:sns:us-east-1:123456789012:MySNSTopic",
      "End": true
    }
  }
}

This simple example shows defining states (ValidateInput, ProcessData, NotifyFailure), linking them (Next), and adding retry/catch logic.

AWS EventBridge – Your Serverless Event Bus

Step Functions manages workflows you define, but EventBridge handles events you might not know about yet. It works like a central hub where events from AWS services, your apps, or external SaaS tools flow through and get routed to the right places automatically.

Key Benefits:

Decoupling: Event producers don’t need to know who will handle the event, and handlers don’t need to know who sent it. They just send or listen for events. This makes your system more flexible and stronger.
Content-Based Filtering: You can set rules to catch only certain events based on what’s inside them.
Flexible Routing: One event can trigger many targets like Lambda, Step Functions, SQS, and more.
Many Event Sources: EventBridge works with over 90+ AWS services and many SaaS tools. You can react to things like new S3 files, DynamoDB changes, or partner events from tools like Datadog.
Schema Registry: Store and share event formats so teams understand them better and can even generate code for handling events.

Example:

Say users upload images to an S3 bucket. Instead of making S3 call your image processor directly, you can:

Set S3 to send ObjectCreated events to EventBridge.
Create a rule that listens only for .jpg or .png files in certain folders.
Set the rule’s target to your image processing Lambda or a Step Functions workflow for more steps.

Now, the S3 upload and image processing are separate. You can add more rules to send the same event to other services, like notifications or audits, without changing S3 or the processing function. This keeps your system flexible and easier to update.

Pattern 1 - Asynchronous API Processing with Step Functions

Sometimes, your API needs to start a task that takes a long time but still respond quickly to the user.

Example: A user asks for a detailed report that could take minutes to create.

In this case, the API starts a Step Functions workflow to handle the long process in the background and immediately returns a response saying the request is received. The workflow runs the report generation without making the user wait.

Architecture:

The client sends a POST request to /generate-report through API Gateway.
API Gateway starts the Step Functions workflow directly or via a quick Lambda.
The workflow begins with the client’s input.
API Gateway immediately sends back a 202 Accepted response with the workflow ID so the client can check progress later.
The Step Functions workflow runs these tasks:

Validate input with a Lambda.
Query data using Lambda or Fargate.
Format the report with Lambda.
Save the report to S3 with Lambda.
Optionally notify the user via Lambda or SNS when done or if it fails.

Why this helps:

The client doesn’t wait for the whole report to finish.
Step Functions handles retries and errors automatically.
The API stays light and scalable, while the heavy work runs separately.

Pattern 2 - Event Driven Workflow Triggering with EventBridge

You can use events from different sources to automatically start complex workflows.

Example: When a new customer signs up and their info is added to a DynamoDB Customers table, start an onboarding workflow with multiple steps.

Architecture:

Here’s a simple breakdown:

The Customers DynamoDB table has Streams enabled to track changes.
A Lambda function listens to the DynamoDB Stream and gets batches of changes.
For each new customer (INSERT), the Lambda creates a custom event with the customer data and sends it to a custom EventBridge event bus.
An EventBridge rule listens for events with source: myapp.customers and detail-type: CustomerCreated.
The rule triggers a Step Functions workflow for onboarding.
The Step Functions workflow runs steps like:

Add customer to CRM.
Send a welcome email.
Provision resources for the customer.

Why this works:

Customer creation is separated from onboarding logic.
The system reacts automatically to new customers.
You can add more rules or workflows easily without changing the original services.

Bonus:
You might connect DynamoDB Streams directly to Step Functions using EventBridge Pipes, skipping the Lambda if no event filtering or transformation is needed.

Practical Tips

Cost Models:
Step Functions Standard charges per state transition. Express charges based on how long it runs and how many times it’s called often cheaper for many short tasks.
EventBridge charges per event sent to custom or partner event buses and per target invoked. AWS service events are usually free.
Observability:
Use CloudWatch Logs inside your Lambdas. Turn on AWS X-Ray tracing for Lambda and Step Functions to see the full flow of requests. Set up CloudWatch Metrics and Alarms to track failures and queue depths.
Standard vs Express Workflows:
Use Standard for long, reliable workflows (up to 1 year) where exactly-once matters.
Use Express for fast, high volume, short tasks (under 5 minutes) where it’s okay if tasks run more than once and cost is a priority.
Error Handling:
Use Step Functions’ Retry blocks to handle temporary problems like network issues. Use Catch blocks to handle specific errors and run clean-up or notification tasks.
Idempotency:
Because events might arrive more than once, make sure tasks can safely run multiple times with the same input without causing problems. Check if the work is already done before acting.

Taking the Next Step

Using Step Functions for orchestration and EventBridge for event-driven workflows lets you build more powerful, scalable, and reliable serverless apps. The examples of asynchronous API handling and event-triggered workflows show just how these services solve real challenges.

Once you understand these, you can design complex systems that are easier to manage and adapt to changing needs.

Try implementing one of these patterns yourself. Explore the extensive
AWS Serverless Patterns Collection for more inspiration and ready-to-deploy examples. And most importantly, share your experiences and questions in the comments below. Let's learn together :)!

DEV Community: Mohammed Firdous

What It's Actually Like to Be an LFX Mentee at CNCF

The Project

The Part That Humbled Me

What Surprised Me

The Mentorship Itself

If You Are Applying Next Term

Building Canary, Baseline & Traffic Routing for PipeCD's Kubernetes Multi-Cluster Plugin

What is PipeCD and what is this plugin?

Progressive Delivery and Why These Stages Exist

What I Built

K8S_CANARY_ROLLOUT

K8S_CANARY_CLEAN

K8S_PRIMARY_ROLLOUT

K8S_BASELINE_ROLLOUT

K8S_BASELINE_CLEAN

K8S_TRAFFIC_ROUTING

Something I Found Interesting

What's Next

Links

Orchestrating Complex Serverless Workflows on AWS

Table of Contents

Introduction

Why Do Orchestration and Events Matter?

AWS Step Functions - Your Workflow Manager

AWS EventBridge – Your Serverless Event Bus

Pattern 1 - Asynchronous API Processing with Step Functions

Pattern 2 - Event Driven Workflow Triggering with EventBridge

Practical Tips

Taking the Next Step

References