<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohammed Firdous</title>
    <description>The latest articles on DEV Community by Mohammed Firdous (@mohammedfirdouss).</description>
    <link>https://dev.to/mohammedfirdouss</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2996814%2F7289c02d-5656-481f-814d-7a8481f6c523.png</url>
      <title>DEV Community: Mohammed Firdous</title>
      <link>https://dev.to/mohammedfirdouss</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mohammedfirdouss"/>
    <language>en</language>
    <item>
      <title>Building Canary, Baseline &amp; Traffic Routing for PipeCD's Kubernetes Multi-Cluster Plugin</title>
      <dc:creator>Mohammed Firdous</dc:creator>
      <pubDate>Tue, 07 Apr 2026 09:27:17 +0000</pubDate>
      <link>https://dev.to/mohammedfirdouss/building-canary-baseline-traffic-routing-for-pipecds-kubernetes-multi-cluster-plugin-ddi</link>
      <guid>https://dev.to/mohammedfirdouss/building-canary-baseline-traffic-routing-for-pipecds-kubernetes-multi-cluster-plugin-ddi</guid>
      <description>&lt;p&gt;If you had told me last year that I would be working with Kubernetes and all things clusters, deployments and service meshes, I would have brushed it off. I am truly grateful for the journey thus far.&lt;/p&gt;

&lt;p&gt;Earlier last month, I got accepted as an LFX Mentee for Term 1 of this calendar year. For me it is such a big deal, given my background, and how much effort has been put in behind the scenes to get to this stage.&lt;/p&gt;

&lt;p&gt;I'm currently a mentee in the LFX Mentorship program working on &lt;a href="https://pipecd.dev" rel="noopener noreferrer"&gt;PipeCD&lt;/a&gt;, an open-source GitOps continuous delivery platform. For the past four weeks, I've been building out the &lt;code&gt;kubernetes_multicluster&lt;/code&gt; plugin specifically implementing the deployment pipeline stages that handle canary, primary and baseline deployments across multiple clusters.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is PipeCD and what is this plugin?
&lt;/h2&gt;

&lt;p&gt;PipeCD is an open-source GitOps CD platform that manages deployments across different infrastructure targets like Kubernetes, ECS, Terraform, Lambda and more. Each target type has a plugin that knows how to deploy to it.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;kubernetes_multicluster&lt;/code&gt; plugin is for teams running the same application across multiple Kubernetes clusters say US, EU and Asia and needing all of them to stay in sync through a single pipeline. Rolling out a new version across clusters one at a time, manually, with no coordination, is error-prone and slow. The plugin lets you define one pipeline that runs across every cluster at the same time, with canary and baseline checks before anything hits production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Progressive Delivery and Why These Stages Exist
&lt;/h2&gt;

&lt;p&gt;Before a new version reaches all users, it goes through stages. A canary sends a small slice of traffic to the new version first. A baseline runs the &lt;em&gt;current&lt;/em&gt; version at the same scale so you have a fair comparison. Primary is the actual promotion. Clean stages remove the temporary resources when you're done.&lt;/p&gt;

&lt;p&gt;This pattern is called progressive delivery, because you roll out gradually, check things look good, then commit. If something looks wrong at the canary stage, you stop there. Nothing has touched production yet.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;kubernetes_multicluster&lt;/code&gt; plugin runs all of this across every cluster at the same time. One pipeline, every cluster, same stages.&lt;/p&gt;

&lt;p&gt;A full pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8S_CANARY_ROLLOUT&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8S_BASELINE_ROLLOUT&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8S_TRAFFIC_ROUTING&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8S_PRIMARY_ROLLOUT&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8S_CANARY_CLEAN&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8S_BASELINE_CLEAN&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each of these is a stage I built. The sections below go through what each one does.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;h3&gt;
  
  
  K8S_CANARY_ROLLOUT
&lt;/h3&gt;

&lt;p&gt;The canary stage deploys the new version of your app as a small slice alongside the existing production deployment. If your app normally runs 3 pods, canary might spin up 1 pod (or 20%) of the new version enough to catch problems without affecting most users.&lt;/p&gt;

&lt;p&gt;It loads manifests from Git, creates copies of all workloads with a &lt;code&gt;-canary&lt;/code&gt; suffix, scales them down to the configured replica count, adds a &lt;code&gt;pipecd.dev/variant=canary&lt;/code&gt; label, and applies them to every target cluster in parallel. The original deployment is never touched this stage only ever adds resources.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswcu1ppt38ltw87wwbol.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswcu1ppt38ltw87wwbol.png" alt="Canary rollout stage log applying manifests to cluster-eu and cluster-us" width="800" height="263"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0lfhrddbu6r3mt01tlrs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0lfhrddbu6r3mt01tlrs.png" alt="Canary rollout success — deploy targets: cluster-eu + cluster-us" width="800" height="270"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  K8S_CANARY_CLEAN
&lt;/h3&gt;

&lt;p&gt;Once the canary window is over, whether you promoted or rolled back, the canary pods are just sitting in every cluster doing nothing. &lt;code&gt;K8S_CANARY_CLEAN&lt;/code&gt; removes them.&lt;/p&gt;

&lt;p&gt;It finds all resources with the label &lt;code&gt;pipecd.dev/variant=canary&lt;/code&gt; for the application and deletes them in order: Services first, then Deployments, then everything else. The order matters as you don't want to remove the Deployment while the Service is still sending traffic to it.&lt;/p&gt;

&lt;p&gt;One thing worth noting: the query is scoped strictly to canary-labelled resources. Even if something goes wrong in the deletion logic, it cannot touch primary resources.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vzb1fikt47ax3z53bjy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vzb1fikt47ax3z53bjy.png" alt="K8S_CANARY_CLEAN stage log deleting simple-canary resources from both clusters" width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5980a6b46dtiqv9oetz7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5980a6b46dtiqv9oetz7.png" alt="K8S_CANARY_ROLLOUT → K8S_CANARY_CLEAN pipeline — both stages green on cluster-eu and cluster-us" width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  K8S_PRIMARY_ROLLOUT
&lt;/h3&gt;

&lt;p&gt;After the canary looks good, you promote the new version to primary,the workload actually serving all your users. This stage takes the manifests from Git, adds the &lt;code&gt;pipecd.dev/variant=primary&lt;/code&gt; label, and applies them across all clusters in parallel.&lt;/p&gt;

&lt;p&gt;It also has a &lt;code&gt;prune&lt;/code&gt; option: after applying, it checks what's currently running in the cluster against what was just applied, and deletes anything that's no longer in Git. Useful when you remove a resource from your manifests and want the cluster to reflect that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5i4pfqi10ef38i7ltn55.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5i4pfqi10ef38i7ltn55.png" alt="K8S_PRIMARY_ROLLOUT success deploy targets: cluster-eu + cluster-us" width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbi6iplj1l8g5wbkwgql.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbi6iplj1l8g5wbkwgql.png" alt="kubectl confirming simple 2/2 updated in both cluster-eu and cluster-us" width="765" height="107"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  K8S_BASELINE_ROLLOUT
&lt;/h3&gt;

&lt;p&gt;This one took me a while to understand and it is the stage I find most interesting to explain as well.&lt;/p&gt;

&lt;p&gt;When you're running a canary, the natural thing is to compare it against primary. The issue is that's not a fair comparison primary is handling far more traffic than canary, under different conditions.&lt;/p&gt;

&lt;p&gt;Baseline gives you a fairer comparison. You take the &lt;em&gt;current&lt;/em&gt; version (not the new one) and run it at the same scale as canary. Now your cluster has:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;simple             2/2   ← production, current version
simple-canary      1/1   ← new version, being tested
simple-baseline    1/1   ← current version at canary scale
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You compare canary vs baseline,same number of pods, same traffic conditions. If canary is worse, it's obvious.&lt;/p&gt;

&lt;p&gt;The key difference from every other rollout stage is one line of code. Canary and primary load manifests from the new Git commit (&lt;code&gt;TargetDeploymentSource&lt;/code&gt;). Baseline loads from what's currently running (&lt;code&gt;RunningDeploymentSource&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// canary.go — new version&lt;/span&gt;
&lt;span class="n"&gt;manifests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loadManifests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TargetDeploymentSource&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// baseline.go — current version&lt;/span&gt;
&lt;span class="n"&gt;manifests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loadManifests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RunningDeploymentSource&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa61aeapcwnqquh3v3vdh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa61aeapcwnqquh3v3vdh.png" alt="K8S_BASELINE_ROLLOUT stage log loading manifests from running deployment source" width="800" height="265"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0r26i800uc46kfmauo5p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0r26i800uc46kfmauo5p.png" alt="K8S_BASELINE_ROLLOUT" width="800" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6lowk38efysvvy0vbmk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6lowk38efysvvy0vbmk.png" alt="K8S_BASELINE_ROLLOUT" width="800" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8n8ailmbo0gae31sf58l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8n8ailmbo0gae31sf58l.png" alt="kubectl showing simple, simple-baseline, simple-canary all running in both clusters" width="800" height="127"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  K8S_BASELINE_CLEAN
&lt;/h3&gt;

&lt;p&gt;Once the analysis is done, baseline resources get cleaned up the same way as canary find everything labelled &lt;code&gt;pipecd.dev/variant=baseline&lt;/code&gt; and delete it in order. No configuration needed. It doesn't matter whether &lt;code&gt;createService: true&lt;/code&gt; was set during rollout, it finds whatever is there and removes it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddfhlxgi1i40r0x8dgh2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddfhlxgi1i40r0x8dgh2.png" alt="K8S_BASELINE_CLEAN" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffsxadrvt48rbr4hi113m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffsxadrvt48rbr4hi113m.png" alt="K8S_BASELINE_CLEAN stage log deleting baseline resources from both clusters" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fapa1tdqa751bgl2t9g6c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fapa1tdqa751bgl2t9g6c.png" alt="K8S_BASELINE_CLEAN stage log deleting baseline resources from both clusters" width="800" height="174"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frder7h0ylhxn0wkaykep.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frder7h0ylhxn0wkaykep.png" alt="K8S_BASELINE_CLEAN" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8epcyha332ir2xz3jhua.png" alt="kubectl confirming no baseline resources remain in cluster-eu or cluster-us" width="800" height="58"&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  K8S_TRAFFIC_ROUTING
&lt;/h3&gt;

&lt;p&gt;Canary and baseline pods exist in the cluster but get no traffic until this stage runs. Without it, you're analysing pods that nobody is actually hitting. This stage is what sends real user traffic to them.&lt;/p&gt;

&lt;p&gt;Two methods are supported:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PodSelector&lt;/strong&gt; (no service mesh needed): changes the Kubernetes Service selector to point at one variant. All-or-nothing 100% to canary or 100% back to primary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7uqb18dwo6jyhr0aekpw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7uqb18dwo6jyhr0aekpw.png" alt="PodSelector traffic routing full pipeline success across cluster-eu and cluster-us" width="800" height="249"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqddkmg9qs6unwk0bklu3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqddkmg9qs6unwk0bklu3.png" alt="PodSelector" width="800" height="249"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpiwbmpywrxmf4ykbcomp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpiwbmpywrxmf4ykbcomp.png" alt="PodSelector" width="800" height="257"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2u1kqy9jasqhqwtwebbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2u1kqy9jasqhqwtwebbd.png" alt="PodSelector" width="800" height="243"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Istio&lt;/strong&gt;: updates VirtualService route weights to split traffic across all three variants at once for example, primary 80%, canary 10%, baseline 10%. Also supports &lt;code&gt;editableRoutes&lt;/code&gt; to limit which named routes the stage is allowed to modify.&lt;/p&gt;

&lt;p&gt;One small thing I added on top of the traffic routing stage: per-route logging. When the stage runs, it now logs each route it processes whether it was skipped (because it's not in &lt;code&gt;editableRoutes&lt;/code&gt;) or updated with new weights. Before this, the log just said "Successfully updated traffic routing" with no detail. Now you can see exactly which routes changed and to what percentages, which is useful when debugging a misconfigured VirtualService.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F498uvcytrlppjpxqcq05.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F498uvcytrlppjpxqcq05.png" alt="Istio traffic routing stage log per-route logging showing which routes were updated in both clusters" width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ql3nmfb19psi5gfu8a2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ql3nmfb19psi5gfu8a2.png" alt="Istio" width="800" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxy1gv5cojotd6ks253m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxy1gv5cojotd6ks253m.png" alt="Istio" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftva9uwcmd7prb6qf72dm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftva9uwcmd7prb6qf72dm.png" alt="Istio" width="800" height="183"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6oj4elm5asfgergkp2se.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6oj4elm5asfgergkp2se.png" alt="Full Istio pipeline, all 7 stages green on cluster-eu and cluster-us" width="800" height="250"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Something I Found Interesting
&lt;/h2&gt;

&lt;p&gt;The thing that surprised me was how &lt;code&gt;errgroup&lt;/code&gt; handles running across multiple clusters without much extra code.&lt;/p&gt;

&lt;p&gt;Every stage needs to run against N clusters, not one. A simple for-loop would run them one at a time slow, and if cluster 2 fails you don't find out until cluster 1 is already done.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;errgroup&lt;/code&gt; runs all clusters at the same time and returns the first error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;eg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;errgroup&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;targetClusters&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt;
    &lt;span class="n"&gt;eg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Go&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;canaryRollout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deployTarget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;eg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All clusters run in parallel. If any one fails, the stage fails immediately. The same pattern is used across every stage, so adding a new stage is mostly just writing the per-cluster logic the concurrency part is already solved.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The next piece is &lt;code&gt;DetermineStrategy&lt;/code&gt;, that is the logic that decides what kind of deployment to trigger based on what changed in Git. After that, livestate drift detection so PipeCD can flag when a cluster has drifted from what Git says it should be.&lt;/p&gt;

&lt;p&gt;To get involved, check out the PipeCD project and come join us on Slack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/pipe-cd/pipecd" rel="noopener noreferrer"&gt;PipeCD repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mentorship.lfx.linuxfoundation.org" rel="noopener noreferrer"&gt;LFX Mentorship Program&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/pipe-cd/pipecd/issues/6446" rel="noopener noreferrer"&gt;Issue #6446, kubernetes_multicluster plugin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/pipe-cd/pipecd/pull/6629" rel="noopener noreferrer"&gt;PR #6629 K8S_TRAFFIC_ROUTING&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/pipe-cd/pipecd/pull/6648" rel="noopener noreferrer"&gt;PR #6648 Per-route logging in K8S_TRAFFIC_ROUTING&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://app.slack.com/client/T08PSQ7BQ/C01B27F9T0X" rel="noopener noreferrer"&gt;Slack #PipeCD&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>opensource</category>
      <category>gitops</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Orchestrating Complex Serverless Workflows on AWS</title>
      <dc:creator>Mohammed Firdous</dc:creator>
      <pubDate>Tue, 27 May 2025 07:05:00 +0000</pubDate>
      <link>https://dev.to/aws-builders/orchestrating-complex-serverless-workflows-on-aws-3hbo</link>
      <guid>https://dev.to/aws-builders/orchestrating-complex-serverless-workflows-on-aws-3hbo</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Just linking Lambda functions makes your app hard to manage and easy to break. AWS Step Functions help you control steps in your app with built-in error fixing and easy tracking. AWS EventBridge lets parts of your app send messages (events) to each other without being directly connected.&lt;br&gt;
&lt;strong&gt;Pattern 1&lt;/strong&gt;: Use Step Functions to run long tasks in the background while your app stays fast.&lt;br&gt;
&lt;strong&gt;Pattern 2&lt;/strong&gt;: Use EventBridge to start jobs automatically when something happens, like a new customer signing up.&lt;br&gt;
These tools make your serverless app easier to grow, fix, and keep working well.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Why Do Orchestration and Events Matter?&lt;/li&gt;
&lt;li&gt;AWS Step Functions - Your Workflow Manager&lt;/li&gt;
&lt;li&gt;AWS EventBridge – Your Serverless Event Bus&lt;/li&gt;
&lt;li&gt;Pattern 1 - Asynchronous API Processing with Step Functions&lt;/li&gt;
&lt;li&gt;Pattern 2 - Event Driven Workflow Triggering with EventBridge&lt;/li&gt;
&lt;li&gt;Practical Tips&lt;/li&gt;
&lt;li&gt;Taking the Next Step&lt;/li&gt;
&lt;li&gt;References&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;So, you've learned how to use AWS Lambda. You can create functions, call them using API Gateway, and save data in DynamoDB. That’s great! But what happens when your app starts getting bigger and more complex?&lt;/p&gt;

&lt;p&gt;When one user action needs to do many things, like calling different services, handling errors well, and making sure everything happens in the right order, just linking Lambda functions can get messy. It can feel like a game of pinball, where you lose track of what’s happening.&lt;/p&gt;

&lt;p&gt;When you try to handle state, retries, and errors across multiple Lambda functions, things get hard. You also need to see what’s going on when a process has many steps. That’s where the real power of AWS serverless tools helps.&lt;/p&gt;

&lt;p&gt;Two tools are especially useful here: &lt;strong&gt;AWS Step Functions&lt;/strong&gt; and &lt;strong&gt;AWS EventBridge&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;EventBridge&lt;/strong&gt; acts like a message system that lets different parts of your app (and other services) send and receive events without directly calling each other. This keeps your app flexible and able to handle changes or failures better.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Step Functions&lt;/strong&gt; lets you create a visual workflow that shows the steps and how they connect, like a flowchart for your app.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guide helps you go beyond basic Lambda.&lt;/p&gt;

&lt;p&gt;We will look at two practical patterns using &lt;strong&gt;Step Functions&lt;/strong&gt; and &lt;br&gt;
&lt;strong&gt;EventBridge&lt;/strong&gt;. These patterns help you build stronger, easier-to-maintain, and more scalable serverless applications on AWS.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Do Orchestration and Events Matter?
&lt;/h2&gt;

&lt;p&gt;Before we go into Step Functions and EventBridge, let’s talk about &lt;em&gt;why&lt;/em&gt; these tools are important when your serverless apps grow.&lt;/p&gt;

&lt;p&gt;Imagine you’re building a multi-step order system with just Lambda functions calling each other:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;ProcessOrder&lt;/code&gt; gets the order.&lt;/li&gt;
&lt;li&gt;It calls &lt;code&gt;ValidateInventory&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If inventory is fine, it calls &lt;code&gt;ProcessPayment&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If payment works, it calls &lt;code&gt;ShipOrder&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;But if something fails, what do you do? Roll back? Tell the user? Retry?&lt;/li&gt;
&lt;li&gt;How do you know which step is running? Or if it’s finished?&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;ProcessPayment&lt;/code&gt; takes a long time, does the first function just wait and risk timing out?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Chaining Lambdas like this makes them too dependent on each other. Handling errors and tracking the process becomes messy. This problem is called the &lt;strong&gt;Lambda Pinball&lt;/strong&gt; anti-pattern, where your logic jumps around from function to function like a pinball in a machine.&lt;/p&gt;

&lt;p&gt;Direct chaining ties functions too closely. The system becomes fragile. Error handling spreads across different functions, making it hard to manage. Keeping track of the whole process gets tricky. People call this the "Lambda Pinball" anti-pattern.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf0bxoocf950f58unaov.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf0bxoocf950f58unaov.png" title="Diagram illustrating the Lambda Pinball anti-pattern compared to an orchestrated flow" alt="Lambda Pinball Anti-Pattern Diagram" width="679" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where orchestration and event-driven patterns help a lot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Orchestration (Step Functions)&lt;/strong&gt;: It gives you one place to define and manage the workflow. Step Functions keep track of state between steps, handle retries and errors, and let you see what’s happening.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Event-Driven (EventBridge)&lt;/strong&gt;: It separates services. Instead of calling each other directly, functions send events like &lt;code&gt;OrderPlaced&lt;/code&gt;. Other services listen for events like &lt;code&gt;OrderPlaced&lt;/code&gt; and act on them. This makes the system stronger, if one service is down, the others can still work. It’s also easier to add new features, since you don’t have to change existing services to add a new one that listens to the same event.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using Step Functions for workflows and EventBridge for events helps you build serverless systems that are easier to manage, grow, and handle failures.&lt;/p&gt;
&lt;h2&gt;
  
  
  AWS Step Functions - Your Workflow Manager
&lt;/h2&gt;

&lt;p&gt;Using Step Functions for workflows and EventBridge for events helps you build serverless systems that are easier to manage, grow, and handle failures. &lt;/p&gt;

&lt;p&gt;Think of AWS Step Functions as a tool to design and run workflows. You define the steps using JSON in the &lt;strong&gt;Amazon States Language&lt;/strong&gt;. This setup creates a &lt;strong&gt;state machine&lt;/strong&gt;. A system that controls how each step runs, keeps track of the current step, and handles errors and retries for you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjusm31wo7v1j094ua6l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjusm31wo7v1j094ua6l.png" title="Conceptual diagram of a basic Step Functions workflow" alt="Basic Step Functions Workflow Diagram" width="760" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automatic State Management:&lt;/strong&gt; Step Functions keeps track of data between steps, so you don’t have to pass or store it manually.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Built-in Error Handling:&lt;/strong&gt; You can set rules to retry on temporary errors or catch specific errors right in the workflow, making error handling easier and centralized.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Supports Long Tasks:&lt;/strong&gt; Workflows can run up to a year, perfect for things that take a long time or need human input much longer than Lambda timeouts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run Steps in Parallel:&lt;/strong&gt; You can run several tasks at the same time and wait for all or some to finish before moving on.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Direct AWS Service Calls:&lt;/strong&gt; Step Functions can call many AWS services directly, like Lambda, SQS, DynamoDB, and others,no extra code needed for simple calls.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Clear Visibility:&lt;/strong&gt; You get a visual view in the AWS console showing each step’s input, output, and errors, which helps a lot with debugging.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's a conceptual snippet of what a state machine definition might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Comment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"A simple example of a Step Functions state machine"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StartAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ValidateInput"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"States"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ValidateInput"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:lambda:us-east-1:123456789012:function:ValidateLambda"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ProcessData"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ProcessData"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:lambda:us-east-1:123456789012:function:ProcessLambda"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Retry"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ErrorEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Lambda.ServiceException"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Lambda.AWSLambdaException"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Lambda.SdkClientException"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"IntervalSeconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"MaxAttempts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"BackoffRate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Catch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ErrorEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"States.TaskFailed"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NotifyFailure"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"End"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"NotifyFailure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:sns:us-east-1:123456789012:MySNSTopic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"End"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple example shows defining states (&lt;code&gt;ValidateInput&lt;/code&gt;, &lt;code&gt;ProcessData&lt;/code&gt;, &lt;code&gt;NotifyFailure&lt;/code&gt;), linking them (&lt;code&gt;Next&lt;/code&gt;), and adding retry/catch logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS EventBridge – Your Serverless Event Bus
&lt;/h2&gt;

&lt;p&gt;Step Functions manages workflows you define, but EventBridge handles events you might not know about yet. It works like a central hub where events from AWS services, your apps, or external SaaS tools flow through and get routed to the right places automatically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxpw3sfxn9fubl3jmfk5x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxpw3sfxn9fubl3jmfk5x.png" title="Conceptual diagram of a basic EventBridge event bus" alt="Basic EventBridge Event Bus Diagram" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decoupling:&lt;/strong&gt; Event producers don’t need to know who will handle the event, and handlers don’t need to know who sent it. They just send or listen for events. This makes your system more flexible and stronger.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Content-Based Filtering:&lt;/strong&gt; You can set rules to catch only certain events based on what’s inside them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexible Routing:&lt;/strong&gt; One event can trigger many targets like Lambda, Step Functions, SQS, and more.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Many Event Sources:&lt;/strong&gt; EventBridge works with over 90+ AWS services and many SaaS tools. You can react to things like new S3 files, DynamoDB changes, or partner events from tools like Datadog.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Schema Registry:&lt;/strong&gt; Store and share event formats so teams understand them better and can even generate code for handling events.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Say users upload images to an S3 bucket. Instead of making S3 call your image processor directly, you can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set S3 to send &lt;code&gt;ObjectCreated&lt;/code&gt; events to EventBridge.&lt;/li&gt;
&lt;li&gt;Create a rule that listens only for &lt;code&gt;.jpg&lt;/code&gt; or &lt;code&gt;.png&lt;/code&gt; files in certain folders.&lt;/li&gt;
&lt;li&gt;Set the rule’s target to your image processing Lambda or a Step Functions workflow for more steps.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now, the S3 upload and image processing are separate. You can add more rules to send the same event to other services, like notifications or audits, without changing S3 or the processing function. This keeps your system flexible and easier to update.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1 - Asynchronous API Processing with Step Functions
&lt;/h2&gt;

&lt;p&gt;Sometimes, your API needs to start a task that takes a long time but still respond quickly to the user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; A user asks for a detailed report that could take minutes to create.&lt;/p&gt;

&lt;p&gt;In this case, the API starts a Step Functions workflow to handle the long process in the background and immediately returns a response saying the request is received. The workflow runs the report generation without making the user wait.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnchyfzxcuwnqgqa4j2e4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnchyfzxcuwnqgqa4j2e4.png" title="Diagram illustrating the asynchronous API pattern using API Gateway and Step Functions" alt="Asynchronous API Pattern Diagram" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The client sends a POST request to &lt;code&gt;/generate-report&lt;/code&gt; through API Gateway.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;API Gateway starts the Step Functions workflow directly or via a quick Lambda.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The workflow begins with the client’s input.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;API Gateway immediately sends back a &lt;code&gt;202 Accepted&lt;/code&gt; response with the workflow ID so the client can check progress later.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Step Functions workflow runs these tasks:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Validate input with a Lambda.&lt;/li&gt;
&lt;li&gt;Query data using Lambda or Fargate.&lt;/li&gt;
&lt;li&gt;Format the report with Lambda.&lt;/li&gt;
&lt;li&gt;Save the report to S3 with Lambda.&lt;/li&gt;
&lt;li&gt;Optionally notify the user via Lambda or SNS when done or if it fails.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why this helps:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The client doesn’t wait for the whole report to finish.&lt;/li&gt;
&lt;li&gt;Step Functions handles retries and errors automatically.&lt;/li&gt;
&lt;li&gt;The API stays light and scalable, while the heavy work runs separately.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pattern 2 - Event Driven Workflow Triggering with EventBridge
&lt;/h2&gt;

&lt;p&gt;You can use events from different sources to automatically start complex workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; When a new customer signs up and their info is added to a DynamoDB &lt;code&gt;Customers&lt;/code&gt; table, start an onboarding workflow with multiple steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvg69g0bq4e9v9txgigbv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvg69g0bq4e9v9txgigbv.png" title="Diagram illustrating the event-driven workflow pattern using DynamoDB Streams, Lambda, EventBridge, and Step Functions" alt="Event-Driven Workflow Diagram" width="800" height="292"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s a simple breakdown:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;Customers&lt;/code&gt; DynamoDB table has Streams enabled to track changes.&lt;/li&gt;
&lt;li&gt;A Lambda function listens to the DynamoDB Stream and gets batches of changes.&lt;/li&gt;
&lt;li&gt;For each new customer (&lt;code&gt;INSERT&lt;/code&gt;), the Lambda creates a custom event with the customer data and sends it to a custom EventBridge event bus.&lt;/li&gt;
&lt;li&gt;An EventBridge rule listens for events with &lt;code&gt;source: myapp.customers&lt;/code&gt; and &lt;code&gt;detail-type: CustomerCreated&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The rule triggers a Step Functions workflow for onboarding.&lt;/li&gt;
&lt;li&gt;The Step Functions workflow runs steps like:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Add customer to CRM.&lt;/li&gt;
&lt;li&gt;Send a welcome email.&lt;/li&gt;
&lt;li&gt;Provision resources for the customer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer creation is separated from onboarding logic.&lt;/li&gt;
&lt;li&gt;The system reacts automatically to new customers.&lt;/li&gt;
&lt;li&gt;You can add more rules or workflows easily without changing the original services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bonus:&lt;/strong&gt;&lt;br&gt;
You might connect DynamoDB Streams directly to Step Functions using EventBridge Pipes, skipping the Lambda if no event filtering or transformation is needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Tips
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost Models:&lt;/strong&gt;&lt;br&gt;
Step Functions Standard charges per state transition. Express charges based on how long it runs and how many times it’s called often cheaper for many short tasks.&lt;br&gt;
EventBridge charges per event sent to custom or partner event buses and per target invoked. AWS service events are usually free.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observability:&lt;/strong&gt;&lt;br&gt;
Use CloudWatch Logs inside your Lambdas. Turn on AWS X-Ray tracing for Lambda and Step Functions to see the full flow of requests. Set up CloudWatch Metrics and Alarms to track failures and queue depths.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Standard vs Express Workflows:&lt;/strong&gt;&lt;br&gt;
Use &lt;strong&gt;Standard&lt;/strong&gt; for long, reliable workflows (up to 1 year) where exactly-once matters.&lt;br&gt;
Use &lt;strong&gt;Express&lt;/strong&gt; for fast, high volume, short tasks (under 5 minutes) where it’s okay if tasks run more than once and cost is a priority.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Error Handling:&lt;/strong&gt;&lt;br&gt;
Use Step Functions’ &lt;code&gt;Retry&lt;/code&gt; blocks to handle temporary problems like network issues. Use &lt;code&gt;Catch&lt;/code&gt; blocks to handle specific errors and run clean-up or notification tasks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Idempotency:&lt;/strong&gt;&lt;br&gt;
Because events might arrive more than once, make sure tasks can safely run multiple times with the same input without causing problems. Check if the work is already done before acting.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Taking the Next Step
&lt;/h2&gt;

&lt;p&gt;Using Step Functions for orchestration and EventBridge for event-driven workflows lets you build more powerful, scalable, and reliable serverless apps. The examples of asynchronous API handling and event-triggered workflows show just how these services solve real challenges.&lt;/p&gt;

&lt;p&gt;Once you understand these, you can design complex systems that are easier to manage and adapt to changing needs.&lt;/p&gt;

&lt;p&gt;Try implementing one of these patterns yourself. Explore the extensive &lt;br&gt;
&lt;a href="https://serverlessland.com/patterns" rel="noopener noreferrer"&gt;AWS Serverless Patterns Collection&lt;/a&gt; for more inspiration and ready-to-deploy examples. And most importantly, share your experiences and questions in the comments below. Let's learn together :)! &lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html" rel="noopener noreferrer"&gt;AWS Step Functions Developer Guide&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html" rel="noopener noreferrer"&gt;AWS EventBridge User Guide&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://serverlessland.com/patterns" rel="noopener noreferrer"&gt;Serverless Land Patterns Collection&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>eventdriven</category>
    </item>
  </channel>
</rss>
