<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Temporal</title>
    <description>The latest articles on DEV Community by Temporal (@temporalio).</description>
    <link>https://dev.to/temporalio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F3146%2F0ad3097f-4bc5-469a-9263-de293ef2ab2e.png</url>
      <title>DEV Community: Temporal</title>
      <link>https://dev.to/temporalio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/temporalio"/>
    <language>en</language>
    <item>
      <title>Why I needed Durable Execution to Read a Toy Manual</title>
      <dc:creator>Shy Ruparel</dc:creator>
      <pubDate>Fri, 10 Apr 2026 15:53:01 +0000</pubDate>
      <link>https://dev.to/temporalio/why-i-needed-durable-execution-to-read-a-toy-manual-35cn</link>
      <guid>https://dev.to/temporalio/why-i-needed-durable-execution-to-read-a-toy-manual-35cn</guid>
      <description>&lt;p&gt;Watch me take a Japanese toy manual and turn its translation into a bulletproof, &lt;strong&gt;AI-powered ETL pipeline&lt;/strong&gt;. I’ll show you how I use &lt;strong&gt;Temporal Workflows&lt;/strong&gt; to guarantee an AI pipeline never loses progress, surviving network failures, API crashes, and more.&lt;/p&gt;




&lt;h3&gt;
  
  
  What You'll Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Why Spider-man is the reason that Power Rangers has a giant robot.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guaranteed Completion with Temporal:&lt;/strong&gt; I’ll show you how to ensure your code keeps running even if servers crash or APIs fail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel OCR &amp;amp; Translation:&lt;/strong&gt; Learn how I used a &lt;strong&gt;"fan-out" pattern&lt;/strong&gt; with Google Document AI to process a 20-page manual in &lt;strong&gt;50 seconds&lt;/strong&gt; instead of 10 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resilient AI Cleanup:&lt;/strong&gt; See how I use &lt;strong&gt;Pydantic&lt;/strong&gt; and &lt;strong&gt;Temporal&lt;/strong&gt; together to handle non-deterministic LLM outputs from Gemini and automatically retry failed validations.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Ready to build it yourself?&lt;/strong&gt; 👉 &lt;a href="https://temporal.io/code-exchange/toku-solutions" rel="noopener noreferrer"&gt;Check out the code here!&lt;/a&gt;&lt;/p&gt;

</description>
      <category>temporal</category>
      <category>supersentai</category>
      <category>kamenrider</category>
      <category>ai</category>
    </item>
    <item>
      <title>Decoupling Temporal Services with Nexus and the Java SDK</title>
      <dc:creator>Nikolay Advolodkin</dc:creator>
      <pubDate>Thu, 02 Apr 2026 13:50:51 +0000</pubDate>
      <link>https://dev.to/temporalio/decoupling-temporal-services-with-nexus-and-the-java-sdk-20p</link>
      <guid>https://dev.to/temporalio/decoupling-temporal-services-with-nexus-and-the-java-sdk-20p</guid>
      <description>&lt;p&gt;Your Temporal services share a blast radius. A bug in Compliance at 3 AM crashes Payments, too, because they share the same Worker. The obvious fix is separate services with HTTP calls between them - but then you're managing HTTP clients, routing, error mapping, and callback infrastructure yourself.&lt;/p&gt;

&lt;p&gt;We published a hands-on tutorial on &lt;a href="https://learn.temporal.io/tutorials/nexus/nexus-sync-tutorial/?utm_source=enterprise-dev-rel&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nexus-sync-tutorial&amp;amp;utm_content=devto-launch" rel="noopener noreferrer"&gt;learn.temporal.io&lt;/a&gt; where you take a monolithic banking payment system and split it into two independently deployable services connected through Temporal Nexus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nexus Endpoints, Services, and Operations from scratch&lt;/li&gt;
&lt;li&gt;Two handler patterns for different use cases&lt;/li&gt;
&lt;li&gt;How to swap an Activity call for a durable cross-namespace Nexus call&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The caller-side change is minimal - the method call stays the same:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// BEFORE (monolith - direct activity call):&lt;/span&gt;
&lt;span class="nc"&gt;ComplianceResult&lt;/span&gt; &lt;span class="n"&gt;compliance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;complianceActivity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;checkCompliance&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compReq&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// AFTER (Nexus - durable cross-team call):&lt;/span&gt;
&lt;span class="nc"&gt;ComplianceResult&lt;/span&gt; &lt;span class="n"&gt;compliance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;complianceService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;checkCompliance&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compReq&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same method name. Same input. Same output. Behind that swap: a shared service contract, a Nexus handler, an endpoint registration, and a Worker configuration change.&lt;/p&gt;

&lt;p&gt;Here's what the Nexus handler looks like - it backs the operation with a long-running workflow so retries reuse the existing workflow instead of creating duplicates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@OperationImpl&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;OperationHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ComplianceRequest&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ComplianceResult&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;checkCompliance&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;WorkflowRunOperation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromWorkflowHandle&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;WorkflowClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Nexus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOperationContext&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getWorkflowClient&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="nc"&gt;ComplianceWorkflow&lt;/span&gt; &lt;span class="n"&gt;wf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newWorkflowStub&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="nc"&gt;ComplianceWorkflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;WorkflowOptions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setTaskQueue&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"compliance-risk"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setWorkflowId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"compliance-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getTransactionId&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;WorkflowHandle&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromWorkflowMethod&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;wf:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;});&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tutorial includes a durability checkpoint: you kill the Compliance Worker mid-transaction, restart it, and watch the payment resume exactly where it left off. No retry logic, no data loss across the namespace boundary. Java SDK, runs entirely on Temporal's dev server.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learn.temporal.io/tutorials/nexus/nexus-sync-tutorial/?utm_source=enterprise-dev-rel&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nexus-sync-tutorial&amp;amp;utm_content=devto-launch" rel="noopener noreferrer"&gt;&lt;strong&gt;Try it&lt;/strong&gt; &lt;/a&gt;&lt;/p&gt;

</description>
      <category>temporal</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>2025 — Part 2</title>
      <dc:creator>Sergey Bykov</dc:creator>
      <pubDate>Tue, 18 Nov 2025 16:55:03 +0000</pubDate>
      <link>https://dev.to/temporalio/2025-part-2-2eoi</link>
      <guid>https://dev.to/temporalio/2025-part-2-2eoi</guid>
      <description>&lt;p&gt;(&lt;a href="https://dev.to/temporalio/2025-3dmg"&gt;Part 1&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Company
&lt;/h2&gt;

&lt;p&gt;At the time of my last update, the company had 116 people. Now we are over 300. The Go-to-Market organization is now larger than Engineering. &lt;a href="https://en.wikipedia.org/wiki/Dunbar%27s_number" rel="noopener noreferrer"&gt;Some studies&lt;/a&gt; claim that our ancestors couldn’t handle tribes of over about 150 people. We are definitely past the point when one could know every employee. The loss of intimacy is offset by the feeling that we now have resources — a growing number of teams focusing on different areas while collaborating on cross-group efforts.&lt;/p&gt;

&lt;p&gt;With such growth, we are doubling down on our efforts to foster and reemphasize consistency in our hiring practices, decision-making, behavioral patterns, and rules of engagement, otherwise referred to as values and culture. In my previous life within a huge corporation, those things generally made sense to me, but they also felt somewhat artificial and performative. Within the context of a small company with a relatively flat structure, it feels very different — much closer to home. This makes me genuinely attentive to such aspects and eager to contribute where I can. Just recently, we rolled out our &lt;a href="https://temporal.io/careers" rel="noopener noreferrer"&gt;updated values&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0z753za1mzrzrau64rbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0z753za1mzrzrau64rbd.png" alt="Company" width="800" height="527"&gt;&lt;/a&gt;&lt;br&gt;
My impression is that at least half of the VC money these days goes to companies with corporate domains ending in “.ai,” and aside from that, funding isn’t easy. We raised our &lt;a href="https://temporal.io/blog/temporal-series-c-announcement" rel="noopener noreferrer"&gt;C round&lt;/a&gt; early this year with a very good, some say almost exceptional, multiple. This tells us that the investors have a strong conviction about our product, business model, and growth. I’m no VC, but I see how they are impressed with the quality of the use cases and the caliber of customers that come to our cloud. I hope they know better than I do how to assess and evaluate such factors. Since the C round, we’ve also had a &lt;a href="https://temporal.io/blog/temporal-raises-secondary-funding" rel="noopener noreferrer"&gt;secondary round&lt;/a&gt; that pushed the company’s valuation significantly higher.&lt;/p&gt;

&lt;p&gt;Keeping the hiring bar high continues to be a top priority. With the turmoil in the job market and Temporal becoming a better-known brand, we now have access to a larger pool of high-quality engineering talent. The interview process is still more art than science, and scaling and improving this art as the company grows is a challenge by itself. Hiring at the junior levels has its own difficulties. Recently, we had to close an open SDE 1 position after only a few hours because, during that time, we received more than 3,000 applications. We found that the old recipe still works well — filling junior positions via internships.&lt;/p&gt;

&lt;p&gt;We are still fully remote, with WeWork as an option for folks who want to come into the office. We are geo-distributed but not very balanced. Most of Engineering is on the U.S. West Coast, with roughly a tie between the Seattle and Bay areas. Smaller pockets are in Colorado, North Carolina, and the cities of New York, Chicago, Toronto, and Vancouver. The GTM team has its own distribution. My impression is they are more heavily tilted toward the East Coast.&lt;/p&gt;

&lt;p&gt;We settled on an annual all-company offsite (we started with twice a year). We complemented it with smaller team offsites and are now aggregating them into an annual R&amp;amp;D offsite, side by side with GTM’s sales kickoff event. We’ll see how this goes. There doesn’t appear to be a simple solution for doing it right, and each company needs to find its own rhythm. From time to time, we leverage the West Coast’s locality for in-person meetings to discuss some critical decisions or designs. In such cases, we consciously violate the remote-first setup for the sake of high-throughput discussions and faster decision-making — at the unfair expense of colleagues who can’t attend in person and have to connect via Zoom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replay
&lt;/h2&gt;

&lt;p&gt;It was a bold move in August 2022 to start our own annual conference. The &lt;a href="https://www.youtube.com/playlist?list=PLl9kRkvFJrlRWBrfqOOX1rN_d1mahYI70" rel="noopener noreferrer"&gt;inaugural edition&lt;/a&gt; was in Seattle. The &lt;a href="https://www.youtube.com/playlist?list=PLl9kRkvFJrlREHL7fiEKBWTp5QuFeYS2r" rel="noopener noreferrer"&gt;2023&lt;/a&gt; and &lt;a href="https://www.youtube.com/playlist?list=PLl9kRkvFJrlR0xieUwBN_nNHW0oijCZa6" rel="noopener noreferrer"&gt;2024&lt;/a&gt; editions were in Bellevue, WA, growing bigger each year. In &lt;a href="https://www.youtube.com/playlist?list=PLl9kRkvFJrlQ4Hw1U1aGxc2wH7oQ3tisp" rel="noopener noreferrer"&gt;2025&lt;/a&gt;, we held the event in London to reach audiences unlikely to travel to the U.S. Attending Replay is a very special experience. Seeing so many engineers and engineering leaders talking non-stop about your product and presenting on stage what they’ve built with it is a special kind of pleasure. I presented at all Replays but the very first one. In 2023, my talk was on the second day and I talked with folks so much before then that my voice let me down close to the end of my presentation. I guess that’s why no recording of it was published. But I gave slightly different versions of the same talk at &lt;a href="https://www.youtube.com/watch?v=LHkeXk_8Cq4" rel="noopener noreferrer"&gt;J on the Beach&lt;/a&gt; and &lt;a href="https://www.infoq.com/presentations/durable-execution-control-plane/" rel="noopener noreferrer"&gt;QCon SF&lt;/a&gt; that year.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bheq35jmfjvgns7tqyy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bheq35jmfjvgns7tqyy.png" alt="Replays" width="800" height="109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://replay.temporal.io/" rel="noopener noreferrer"&gt;Replay 2026&lt;/a&gt; will be in San Francisco — at Moscone, no less. It should be epic. I’ll need to rewatch the &lt;a href="https://www.imdb.com/title/tt2575988/" rel="noopener noreferrer"&gt;Silicon Valley documentary&lt;/a&gt; before going there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operations
&lt;/h2&gt;

&lt;p&gt;We operate a multi-million-dollar business based on a single product — Temporal Cloud. Our customers trust us with their hot-path business processes — often their most critical ones. This is an interesting phenomenon. They choose Durable Execution of Temporal to make their applications resilient to various failures. Naturally, they first and foremost care about the reliability of their most critical services. Some choose to self-host Temporal Server with all its dependencies. Many don’t view it as their core competency — operating such complex production machinery — and they come to our cloud service with their most precious workloads. It is amazing and sobering at the same time when big Internet household names bring us their “crown jewels” to run — even those who have a policy of not taking a dependency on SaaS vendors in the critical path. It was eye-opening to hear, on a couple of occasions, a customer say, “We only have two external dependencies — AWS and Temporal Cloud.”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5guzazuxia3xm3t2hhn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5guzazuxia3xm3t2hhn.png" alt="APS" width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Customer expectations are very high. Sometimes it feels like they set them higher for us than for the hyperscalers. We now have about eight engineering on-call rotations (teams), covering different areas of the system, plus one for on-call managers who coordinate across teams, and another for the Developer Success team that communicates with customers. This may seem large for our company size, but that’s the nature of the service we run.&lt;/p&gt;

&lt;p&gt;We use &lt;a href="http://incident.io" rel="noopener noreferrer"&gt;incident.io&lt;/a&gt; for managing incidents. It integrates nicely with Slack, creates a per-incident channel, and automatically adds the current on-call engineers to it, among other things. We saw great promise in the early days of their product. They haven’t disappointed and are growing fast. Like most folks, we use &lt;a href="http://statuspage.io" rel="noopener noreferrer"&gt;statuspage.io&lt;/a&gt; for public incidents and &lt;a href="http://pagerduty.com" rel="noopener noreferrer"&gt;pagerduty.com&lt;/a&gt; for on-call paging. Incident.io also integrates with Jira to automatically turn incident follow-ups into tickets, helping us continuously improve the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replication
&lt;/h2&gt;

&lt;p&gt;Temporal inherited the application-level replication stack from Cadence. Over the years, we dramatically improved it and added Control Plane functionality to manage it. Initially, we used replication to transparently migrate customer namespaces from Cell to Cell. After we got it working at the level we were happy with, we exposed it to customers as high-availability options — multi-region, cross-cloud, and single-region replication.&lt;/p&gt;

&lt;p&gt;At first, few customers immediately understood why they would want to pay double (due to the duplicate hardware needed) for such a feature. Some just used it, at our suggestion, as a tool for migrating their workloads from one region or cloud provider to another. The recent &lt;a href="https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW" rel="noopener noreferrer"&gt;GCP&lt;/a&gt; and &lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;AWS us-east-1&lt;/a&gt; outages vindicated the paranoid among our customers who refused to accept that “cloud regions pretty much never go down.”&lt;/p&gt;

&lt;p&gt;Customers who had replication enabled for their namespaces were able to fail over to the other region or cloud, and their applications continued executing as if nothing had happened. We discovered a few misses on our side and had to fail over some namespaces manually, with a longer delay than we expected. The important part is that replicated namespaces continued running after failover. We saw a major spike in customers setting up replication in the days after the AWS us-east-1 outage. One customer was in the process of migrating their namespace from AWS to GCP during GCP’s global outage. They weren’t impacted and didn’t even need to fail over because their active replica was still in AWS. They were considering keeping the cross-cloud replication running indefinitely after that.&lt;/p&gt;

&lt;p&gt;I gave a &lt;a href="https://www.youtube.com/watch?v=DM68pz5ysWE" rel="noopener noreferrer"&gt;conceptual talk&lt;/a&gt; about replicated namespaces, but the topic probably deserves its own post.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=DM68pz5ysWE" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyit8wprc7xqny4e6zn6s.png" alt="Fourth Little Pig" width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Road ahead
&lt;/h2&gt;

&lt;p&gt;With great opportunities come great responsibility and pressure to execute and realize those opportunities. We still have to strike the right balance between running a highly reliable service and investing in new functionality. It’s a deeply humbling experience to see that some of the world’s top companies — household names with tens or even hundreds of millions of users — take an all-in dependency on Temporal Cloud. This leaves no room for hubris, complacency, or sloppiness. We have to keep pushing the reliability and quality bar higher without hampering further development of the product.&lt;/p&gt;

&lt;p&gt;I don’t believe there’s a general recipe for how to grow an organization, be it engineering, R&amp;amp;D, or the whole company. We’ll have to navigate our own path — growing sustainably while preserving what has made us successful so far and learning new ways in parallel. It’s exciting and somewhat dizzying at the same time. Yet I feel we are still only getting started.&lt;/p&gt;

</description>
      <category>softwareengineering</category>
      <category>ai</category>
      <category>microservices</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>2025</title>
      <dc:creator>Sergey Bykov</dc:creator>
      <pubDate>Wed, 12 Nov 2025 21:56:41 +0000</pubDate>
      <link>https://dev.to/temporalio/2025-3dmg</link>
      <guid>https://dev.to/temporalio/2025-3dmg</guid>
      <description>&lt;h2&gt;
  
  
  Part 1
&lt;/h2&gt;

&lt;p&gt;Belated update. Yes, it’s been five years, can’t believe it myself. What’s the “delta” for the last three years that flew by too fast?&lt;br&gt;
Certain things haven’t changed much. We are still under the “dual mandate” — OSS server, SDKs (clients), CLI, and a whole bunch of other peripheral software, plus a cloud service where we charge customers for running and managing the invisible infrastructure so that they don’t have to. My focus continues to be primarily on the cloud side.&lt;/p&gt;

&lt;p&gt;At the same time, obviously, everything has changed — some things even multiple times. We went through COVID with its obligatory work-from-home setup, only for many companies to start imposing, some more gradually than others, return-to-office policies. I interviewed a number of candidates recently who moved away from major tech hubs during COVID and had to leave their jobs because of the RTO push. We are still fully remote.&lt;/p&gt;

&lt;p&gt;Most of the Big Tech companies went through rounds of mass layoffs — a tectonic shift from the previous 20 or so years of competing for talent and outbidding each other in offers. Startups suddenly became much more attractive for Big Tech employees who were previously reluctant to take the risk of leaving their well-paid jobs. At the same time, startup founders faced the funding drought starting in late 2021 and early 2022 caused by interest rate changes. Many had to close or fire-sell their ventures where just a couple of years earlier, cheap money seemed unlimited.&lt;/p&gt;

&lt;h2&gt;
  
  
  Product
&lt;/h2&gt;

&lt;p&gt;Developers choose Temporal for its programming model. They experience it in a language of their choice via Temporal SDKs. We started with two languages. Now we support seven: Go, Java, TypeScript, Python, .NET, PHP, and Ruby. Four of them are built on the same Core SDK written in Rust. No, we still don’t have an official Rust SDK.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqek4rva5d7fzlv1xwim.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqek4rva5d7fzlv1xwim.png" alt="SDKs" width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We started Temporal Cloud by hosting the OSS Temporal Server with an added layer of security and multi-tenancy. The original value proposition included that, plus general operational concerns such as monitoring, alerting, configuration, upgrades, and scale. We’ve been investing along several dimensions since then and are now running the fifth-generation Cells (&lt;a href="https://www.youtube.com/watch?v=KvxAz5HwBpc" rel="noopener noreferrer"&gt;Temporal Cloud “clusters”&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Building a &lt;a href="https://www.youtube.com/watch?v=SQv9ot-jB6o" rel="noopener noreferrer"&gt;custom storage layer&lt;/a&gt; between the server and database to absorb reads and coalesce writes was one of the first bold undertakings. Rolling it out to production over the course of 2023 gave us a significant increase in reliability, performance, and scalability compared to the vanilla OSS server. Another major investment was making it possible to incrementally add multiple databases to a running server. With these improvements, the scenario I mentioned in one of my previous posts, a major customer needing to increase their already outsized traffic level up to 10x for a day, became routine. At the end of 2021, a day like that was a big deal for both companies, with teams of engineers monitoring the system, communicating live, and taking action. The subsequent occurrences became increasingly “boring” and turned into non-events.&lt;/p&gt;

&lt;p&gt;On the authentication/authorization dimension, we went from initially supporting only &lt;a href="https://docs.temporal.io/cloud/certificates" rel="noopener noreferrer"&gt;mTLS&lt;/a&gt; and Google SSO to adding &lt;a href="https://docs.temporal.io/cloud/api-keys" rel="noopener noreferrer"&gt;API keys&lt;/a&gt;, &lt;a href="https://docs.temporal.io/cloud/service-accounts" rel="noopener noreferrer"&gt;service accounts&lt;/a&gt;, &lt;a href="https://docs.temporal.io/cloud/saml" rel="noopener noreferrer"&gt;SAML&lt;/a&gt;, &lt;a href="https://docs.temporal.io/cloud/user-groups" rel="noopener noreferrer"&gt;SCIM&lt;/a&gt;, and a bunch of other features critical for enterprise — and not only enterprise — customers.&lt;/p&gt;

&lt;p&gt;We started with prospective cloud customers filling out a form, getting contacted by our sales team to complete the paperwork, and then creating an account on their behalf. Embarrassing. Now, we have a complete &lt;a href="https://temporal.io/get-cloud" rel="noopener noreferrer"&gt;self-signup process&lt;/a&gt; that guides prospective customers along the path, with a full PLG motion behind it. When we opened up Temporal Cloud to the world, we were missing a number of table-stakes features. At the time, I called the bar we had to meet a “reasonable cloud service.” I believe we passed this milestone 12–18 months ago.&lt;/p&gt;

&lt;p&gt;I like that we don’t play licensing games (our OSS is under the MIT license) and instead extend and enhance it with proprietary features to differentiate our cloud offering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5b73uredlrwamqsz8nut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5b73uredlrwamqsz8nut.png" alt="Self-signup" width="800" height="544"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We launched Temporal Cloud in 2022 with support for AWS only. We added GCP in 2025 and are working on bringing in Azure, the last of the big three providers. Even though support for Kubernetes clusters across them is similar, most of the integration effort goes into their disparate security and resource hierarchy models, differences in networking, and subtle behavioral differences in their seemingly compatible APIs — for example, &lt;a href="https://airbyte.com/data-engineering-resources/s3-gcs-and-azure-blob-storage-compared" rel="noopener noreferrer"&gt;GCS vs. S3&lt;/a&gt;. Recently, we’ve been chasing GCP load balancers mysteriously ghosting a fraction of the connections. Support for hosted Elasticsearch is another headache — only AWS has it, but in the form of OpenSearch, their fork of ES from before Elastic changed its license.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy27c2ww3nn32lyir9fm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy27c2ww3nn32lyir9fm.png" alt="Multi-region" width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  AI
&lt;/h2&gt;

&lt;p&gt;The agentic AI “storm” turned into a sudden tailwind for Temporal. The very nature of such applications — being stateful, depending on a significant number of semi-reliable API calls to external services, and taking seconds to minutes to execute — made code-first Durable Execution a compelling programming model for this fast-moving, massive herd. While there are still some rough edges for AI use cases in the near term (such as payload and history size limits and required determinism of workflow code), the immediate benefits — high-velocity development of much more reliable code in the language of your choice, guaranteed scalability, and unparalleled visibility into execution for debugging — &lt;a href="https://docs.temporal.io/ai-cookbook" rel="noopener noreferrer"&gt;keep bringing AI-focused companies&lt;/a&gt; to Temporal. More traditional businesses that are scrambling to integrate AI into their systems do the same. I was told that as of late 2024, out of the top 20 AI companies, only two were aware of Temporal — and now, 16 of them already run Temporal-based apps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nexus
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fstb7215vb88u6fxdijok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fstb7215vb88u6fxdijok.png" alt="Nexus" width="84" height="84"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This year we launched the initial version of &lt;a href="https://docs.temporal.io/evaluate/nexus" rel="noopener noreferrer"&gt;Nexus&lt;/a&gt;, an &lt;a href="https://github.com/nexus-rpc/api" rel="noopener noreferrer"&gt;open-standard-based protocol&lt;/a&gt; for APIs that may take arbitrarily long to complete. I think of it as a great frontend layer for Durable Execution. But the protocol itself is implementation-agnostic. One could implement it using more traditional tools and approaches, for example, within the paradigm of event-driven architecture. The idea was conceived in the early days of Temporal. We started talking about it publicly in 2022, only to do nothing for another year due to other priorities.&lt;/p&gt;

&lt;p&gt;We believe that Nexus is an immense opportunity to integrate systems and services in a new, powerful way. Nexus deserves a dedicated post, and I’m contemplating a conference talk about how the combination of Durable Execution and Nexus could define a major evolution of the &lt;a href="https://martinfowler.com/microservices/" rel="noopener noreferrer"&gt;Microservice Architecture&lt;/a&gt;. I understand this is a very bold statement, but sometimes you have to shoot for the Moon.&lt;/p&gt;

&lt;p&gt;(Continued in &lt;a href="https://dev.to/temporalio/2025-part-2-2eoi"&gt;Part 2&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>softwareengineering</category>
      <category>ai</category>
      <category>microservices</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Building Durable Cloud Control Systems with Temporal</title>
      <dc:creator>Sergey Bykov</dc:creator>
      <pubDate>Sat, 09 Aug 2025 00:47:09 +0000</pubDate>
      <link>https://dev.to/temporalio/building-durable-cloud-control-systems-with-temporal-5l7</link>
      <guid>https://dev.to/temporalio/building-durable-cloud-control-systems-with-temporal-5l7</guid>
      <description>&lt;p&gt;In today’s world of managed cloud services, delivering exceptional user experiences often requires rethinking traditional architecture and operational strategies. At Temporal, we faced this challenge head-on, navigating complex decisions about tenancy models, resource management, and durable execution to build a reliable, scalable cloud service. This post explores our approach and the lessons we learned while creating &lt;a href="https://temporal.io/cloud" rel="noopener noreferrer"&gt;Temporal Cloud&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Case for Managed Cloud Services
&lt;/h2&gt;

&lt;p&gt;Managed services have become the default for delivering hosted solutions to customers. Whether it’s a database, queueing system, or another server-side technology, hosting a service not only provides a better user experience but also opens doors for monetization, especially for open-source projects. The challenge is how to do it effectively while maintaining reliability and scalability.&lt;/p&gt;

&lt;p&gt;One of the first decisions we made was about tenancy models. Should we pursue single-tenancy — provisioning dedicated clusters for each customer — or opt for &lt;a href="https://docs.temporal.io/evaluate/development-production-features/multi-tenancy?_gl=1*145kve2*_gcl_au*MTgzNjc0NTczNi4xNzQ4MzcwMTc5*_ga*MTI3NTM0MDA4OC4xNzQ4MzcwMTc5*_ga_R90Q9SJD3D*czE3NTQ2OTg5OTgkbzIzJGcwJHQxNzU0Njk4OTk4JGo2MCRsMCRoMA.." rel="noopener noreferrer"&gt;multi-tenancy&lt;/a&gt;, which allows multiple customers to share the same resources? While single-tenancy offers simplicity and isolation, its inefficiencies quickly become apparent. Customers end up paying for unused capacity, and providers shoulder higher operational costs. Multi-tenancy, though harder to implement, emerged as the clear winner. It optimizes resource usage, allows customers to pay for actual usage, and creates shared headroom for handling traffic spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Plane vs. Control Plane: Defining Responsibilities
&lt;/h2&gt;

&lt;p&gt;Architecting a managed service in terms of the data plane and control plane is an industry best practice that we followed, clearly defining and implementing their distinct roles within our cloud architecture.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Plane&lt;/strong&gt;: This is where the actual work happens — processing transactions, executing workflows, and handling customer data. It must maintain high availability, low latency, and resilience to failures. For Temporal Cloud, we adopted a cell-based architecture to isolate resources and minimize the blast radius of potential failures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control Plane&lt;/strong&gt;: This acts as the brain of the system, managing resources, provisioning namespaces, and handling configurations. While its performance is less critical than the data plane, reliability here still matters for customer experience. For instance, provisioning a namespace may not be urgent, but delays or errors in this process can frustrate users.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementing the Data Plane: A Cell-Based Architecture
&lt;/h2&gt;

&lt;p&gt;For the data plane, we applied a cell-based architecture to achieve strong isolation and scalability. Each cell operates as a self-contained unit with its own AWS account, VPC, EKS cluster, and supporting infrastructure. While this approach is framed within the context of AWS, we have applied the same principles to Google Cloud Platform (GCP), leveraging its equivalent primitives to ensure consistency and reliability across cloud providers. This approach ensures that failures or updates in one cell do not impact others, reducing the risk of cascading outages.&lt;/p&gt;

&lt;p&gt;Each cell in Temporal Cloud includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compute Pods&lt;/strong&gt;: Running Temporal services and infrastructure tools for observability, ingress management, and certificate handling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databases&lt;/strong&gt;: Both primary databases and Elasticsearch for enhanced visibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Additional Components&lt;/strong&gt;: Load balancers, private connectivity endpoints, and other supporting infrastructure that ensures smooth operation and integration across environments. Currently, Temporal Cloud operates across 14 AWS regions, and we’ve also added support for GCP. This architecture allows us to meet the diverse needs of our customers while maintaining reliability at scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Durable Execution: The Foundation of the Control Plane
&lt;/h2&gt;

&lt;p&gt;Building the control plane presented its own set of challenges, particularly around reliability and maintainability. Control plane tasks, such as provisioning namespaces or rolling out updates, involve complex long-running processes with many interdependent steps. Writing this logic as traditional, ad-hoc code often leads to brittle systems that are hard to debug and evolve.&lt;/p&gt;

&lt;p&gt;This is where Temporal’s &lt;a href="https://temporal.io/blog/building-reliable-distributed-systems-in-node-js-part-2" rel="noopener noreferrer"&gt;durable execution&lt;/a&gt; model shines. Designed based on experience with earlier systems like AWS Simple Workflow Service and Azure Durable Functions, Temporal’s approach separates business logic from state management and failure handling. Developers can write workflows as straightforward, happy-path code without worrying about retries, error handling, or state persistence. The system automatically manages these concerns, allowing workflows to seamlessly recover from failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Namespace Provisioning: A Real-World Example
&lt;/h2&gt;

&lt;p&gt;Consider the process of creating a new namespace in Temporal Cloud. When a user clicks “Create Namespace” on the web interface, the control plane orchestrates a series of tasks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Selecting a suitable cell within the chosen region.&lt;/li&gt;
&lt;li&gt;Creating database records and roles.&lt;/li&gt;
&lt;li&gt;Generating and provisioning mTLS certificates.&lt;/li&gt;
&lt;li&gt;Configuring ingress routes and verifying connectivity. Each step involves external API calls, DNS propagation, and other potential points of failure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without durable execution, managing retries, backoffs, and state persistence would result in a tangle of brittle code. With Temporal, these tasks are encapsulated in workflows, which transparently handle retries and maintain state across failures. Developers can focus on the high-level logic, confident that the system will handle the edge cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rolling Upgrades: Ensuring Safe Deployments
&lt;/h2&gt;

&lt;p&gt;Another common control plane scenario is rolling out updates to the Temporal Cloud fleet. Our deployment strategy involves organizing cells into deployment rings, progressing from pre-production environments to customer-facing cells with increasing priority of traffic.&lt;/p&gt;

&lt;p&gt;The rollout process is carefully staged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ring 0&lt;/strong&gt;: Synthetic traffic only, no customer impact. Changes are monitored here for at least a week.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ring 1&lt;/strong&gt;: Low-priority traffic namespaces, allowing for additional testing with minimal risk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher Rings&lt;/strong&gt;: Gradually expanding to critical, high-priority traffic customers. Within each ring, updates are applied in batches, with pauses between batches to observe for potential issues like memory leaks or race conditions. Temporal workflows handle this process, ensuring that even long-running deployments (which can span weeks) are resilient to failures or restarts.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Entity Workflows: A Powerful Pattern
&lt;/h2&gt;

&lt;p&gt;Temporal’s durable execution also enables powerful patterns like entity workflows. These are workflows tied to specific resources, such as cells or namespaces, providing a natural way to model state and operations. For example, each cell in Temporal Cloud has an entity workflow that manages its lifecycle, from provisioning to upgrades. This approach ensures consistency and simplifies concurrency control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Developer Happiness and Productivity
&lt;/h2&gt;

&lt;p&gt;One of the biggest benefits of Temporal’s approach is the impact on developer experience. By eliminating the need to write boilerplate code for retries, backoffs, and state management, developers can focus on delivering business value. Temporal’s built-in tools for observing and debugging workflows further enhance productivity, making it easier to understand and troubleshoot complex systems.&lt;/p&gt;

&lt;p&gt;Happy developers are productive developers, and Temporal’s approach fosters this by reducing the cognitive load and frustration associated with traditional workflow coding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Durable Execution Matters
&lt;/h2&gt;

&lt;p&gt;Durable execution is more than a technical innovation; it’s a paradigm shift for building cloud-native systems. By decoupling business logic from state management and failure handling, Temporal empowers developers to build reliable, scalable systems with less effort. Whether you’re managing control planes, provisioning resources, orchestrating complex workflows, performing money transfers, training AI models, or processing social media posts, this approach delivers clear benefits.&lt;/p&gt;

&lt;p&gt;At Temporal, we’ve seen firsthand how durable execution transforms the development process, enabling us to deliver a robust managed service that scales with our customers’ needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ready to Transform Your Control Plane?
&lt;/h2&gt;

&lt;p&gt;Temporal isn’t just a tool for building cloud systems; it’s a better way to think about workflows and application architecture. If you’re building or planning a managed cloud service, consider how durable execution can simplify your journey and unlock new possibilities. For more insights into our approach, check out &lt;a href="https://www.infoq.com/presentations/durable-execution-control-plane/" rel="noopener noreferrer"&gt;my full talk at QCon&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>control</category>
      <category>service</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Why Top Developers Prioritize Failure Management</title>
      <dc:creator>Sergey Bykov</dc:creator>
      <pubDate>Sat, 09 Aug 2025 00:35:05 +0000</pubDate>
      <link>https://dev.to/temporalio/why-top-developers-prioritize-failure-management-lj6</link>
      <guid>https://dev.to/temporalio/why-top-developers-prioritize-failure-management-lj6</guid>
      <description>&lt;p&gt;There’s a saying: “Amateurs study tactics, while professionals study logistics.” In software, this translates to: “Amateurs focus on algorithms, while professionals focus on failures.”&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://jonthebeach.com/" rel="noopener noreferrer"&gt;J on the Beach&lt;/a&gt;, I took time in my &lt;a href="https://www.youtube.com/watch?v=pMfMm2eD3GM" rel="noopener noreferrer"&gt;talk&lt;/a&gt; to expand on this saying and explain that real-world systems don’t just need code that works on the “happy path” — they need a safety net for when things go wrong.&lt;/p&gt;

&lt;p&gt;Modern software development has layers of complexity. You’re not just writing code; you’re connecting systems across time and space, handling data that doesn’t sleep, and ensuring flawless performance at scale. What sets top developers apart is how they manage failures. Building resilience focuses on ensuring reliability when things inevitably go wrong, not just maintaining uptime.&lt;/p&gt;

&lt;p&gt;In this post, we’ll walk through three common approaches to handling failures in software, each with its own strengths and weaknesses. Then we’ll introduce Temporal’s approach, workflow-as-code, which makes it easier to build reliability into your systems from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Ways to Handle Failure in Your Software
&lt;/h2&gt;

&lt;p&gt;Failures are inevitable in your distributed systems. When a network link fails, a server times out, or a service crashes, systems need strategies to respond properly and ensure that your operations remain reliable.&lt;/p&gt;

&lt;p&gt;Below, we’ll explore three common approaches to coordination between systems — Remote Procedure Calls (RPCs), persistent queues, and workflows — and their relationship to failure management.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Request-Response (RPC)
&lt;/h3&gt;

&lt;p&gt;The request-response, or RPC model, is a classic approach. A client makes a request, the server processes it, and sends back a response. In the best-case scenario — the “happy path” — everything works smoothly. Imagine a money transfer request: one service debits the sender while another credits the receiver. If all goes as planned, the transfer completes with no issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros of the RPC Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simplicity&lt;/strong&gt;: The direct client-server connection makes this model easy to implement for straightforward workflows.&lt;br&gt;
&lt;strong&gt;Efficiency on the “happy path”&lt;/strong&gt;: When things go smoothly, RPC provides fast, efficient responses and low latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons of the RPC Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limited resilience for partial failures&lt;/strong&gt;: If the client’s request is successful, but a response isn’t received, or a step in the process fails, RPC often requires extensive error-handling code on the client side.&lt;br&gt;
&lt;strong&gt;Heavy client burden&lt;/strong&gt;: Clients must handle errors, recovery, and retries, complicating systems as they scale.&lt;br&gt;
The RPC model works well for simple, synchronous tasks. However, for resilience, it falls short by placing the onus on developers of the RPCs and those consuming them to manage every failure scenario — and this is no trivial matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Persistent Queues
&lt;/h3&gt;

&lt;p&gt;Persistent queues add a degree of flexibility by decoupling the client from the server. Messages are placed in a queue, and the system processes them asynchronously. Queues help distribute workloads: they support automatic retries and asynchronous processing, which can smooth out demand spikes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros of Persistent Queues&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic retries&lt;/strong&gt;: Persistent queues often support automatic retries, attempting tasks multiple times if they initially fail.&lt;br&gt;
&lt;strong&gt;Load distribution&lt;/strong&gt;: Queues smooth processing under heavy loads, distributing requests over time, to improve system reliability.&lt;br&gt;
Producer-consumer separation: Decoupling producers and consumers allow the queue to function independently, improving fault tolerance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons of Persistent Queues&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loss of ordering&lt;/strong&gt;: Since queues process messages independently, tasks may execute out of order, causing unexpected issues for dependent operations.&lt;br&gt;
&lt;strong&gt;Dead-letter queues&lt;/strong&gt;: Tasks that continuously fail may require a separate “dead-letter” queue, adding complexity and, typically, manual intervention.&lt;br&gt;
&lt;strong&gt;Limited visibility into status&lt;/strong&gt;: Visibility becomes even more challenging when you have systems that use multiple queues, requiring additional tooling and infrastructure.&lt;br&gt;
Queues work well when you need flexibility and decoupling, but they lack the control and visibility needed for comprehensive failure management.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Workflows
&lt;/h3&gt;

&lt;p&gt;Workflows provide a robust solution for orchestrating complex processes across distributed systems. Unlike RPC or queue-based models, workflows manage retries, state, and error handling automatically, making them ideal for long-running or multi-step processes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros of Workflows&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Built-in resilience&lt;/strong&gt;: Workflows handle retries, recovery, and compensation steps automatically, reducing the need for custom error-handling code.&lt;br&gt;
&lt;strong&gt;Support for long-running processes&lt;/strong&gt;: Workflows accommodate processes that span minutes, hours, or even days, making them well-suited for complex tasks.&lt;br&gt;
&lt;strong&gt;Enhanced visibility&lt;/strong&gt;: Workflow systems enable real-time tracking and querying, so both clients and developers can see exactly where each process stands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons of Workflows&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure requirements&lt;/strong&gt;: Workflows require a solid infrastructure to manage states, retries, and tracking, which some teams may lack.&lt;br&gt;
&lt;strong&gt;Setup complexity&lt;/strong&gt;: Workflow systems can be complex to set up, especially when building custom solutions to manage workflows.&lt;br&gt;
For complex processes that demand reliability and transparency, workflows provide the most comprehensive solution, though they require dedicated infrastructure to deploy effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resilience Without Extra Overhead
&lt;/h2&gt;

&lt;p&gt;At Temporal, we addressed these challenges by designing a platform that handles resilience, error handling, and state management so you don’t have to.&lt;/p&gt;

&lt;p&gt;With Temporal, you write workflows as code - no extra XML, JSON, or YAML definition of workflow logic that is difficult to understand and debug down the line. Define your steps in regular code, and Temporal does the rest, managing retries, maintaining state, and ensuring that your workflows are reliable and simple to create.&lt;/p&gt;

&lt;p&gt;Companies like &lt;a href="https://temporal.io/resources/case-studies/anz-story" rel="noopener noreferrer"&gt;ANZ Bank&lt;/a&gt;, one of the largest banks in the Asia-Pacific region, rely on Temporal to strengthen the resilience and reliability of critical financial processes. With Temporal, ANZ orchestrates and manages complex operations across distributed systems, ensuring tasks are retried automatically, failures are handled, and long-running processes are tracked seamlessly. This has enabled ANZ to boost system reliability, reduce operational complexity, and uphold strict compliance standards in their high-stakes FinServ environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Management Is a Strategy, Not a Setback
&lt;/h2&gt;

&lt;p&gt;Any complex system will encounter failures. But how you handle those failures makes all the difference. For developers, focusing on failure management from the start distinguished exceptional teams from the average. Building resilience into your system sets your project up for long-term success.&lt;/p&gt;

</description>
      <category>reliability</category>
      <category>distributed</category>
      <category>failure</category>
    </item>
    <item>
      <title>Time-Travel Debugging Production Code</title>
      <dc:creator>Loren 🤓</dc:creator>
      <pubDate>Tue, 08 Aug 2023 07:03:31 +0000</pubDate>
      <link>https://dev.to/temporalio/time-travel-debugging-production-code-4m6o</link>
      <guid>https://dev.to/temporalio/time-travel-debugging-production-code-4m6o</guid>
      <description>&lt;p&gt;In this post, I’ll give an overview of time travel debugging (what it is, its history, how it’s implemented) and show how it relates to debugging your production code.&lt;/p&gt;

&lt;p&gt;Normally, when we use debuggers, we set a breakpoint on a line of code, we run our code, execution pauses on our breakpoint, we look at values of variables and maybe the call stack, and then we manually step forward through our code's execution. In &lt;em&gt;time-travel debugging&lt;/em&gt;, also known as &lt;em&gt;reverse debugging&lt;/em&gt;, we can step backward as well as forward. This is powerful because debugging is an exercise in figuring out what happened: traditional debuggers are good at telling you what your program is doing right now, whereas time-travel debuggers let you see what happened. You can wind back to any line of code that executed and see the full program state at any point in your program’s history.&lt;/p&gt;

&lt;h2&gt;
  
  
  History and current state
&lt;/h2&gt;

&lt;p&gt;It all started with Smalltalk-76, developed in 1976 at &lt;a href="https://en.wikipedia.org/wiki/PARC_(company)"&gt;Xerox PARC&lt;/a&gt;. (&lt;a href="https://en.wikipedia.org/wiki/Graphical_user_interface"&gt;Everything&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/Computer_mouse"&gt;started&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/Ethernet"&gt;at&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/WYSIWYG"&gt;PARC&lt;/a&gt; 😄.) It had the ability to retrospectively inspect checkpointed places in execution. Around 1980, MIT added a "retrograde motion" command to its &lt;a href="https://en.wikipedia.org/wiki/Dynamic_debugging_technique"&gt;DDT debugger&lt;/a&gt;, which gave a limited ability to move backward through execution. In a 1995 paper, MIT researchers released ZStep 95, the first true reverse debugger, which recorded all operations as they were performed and supported stepping backward, reverting the system to the previous state. However, it was a research tool and not widely adopted outside academia. &lt;/p&gt;

&lt;p&gt;ODB, the &lt;a href="https://omniscientdebugger.github.io/ODBUserManual.html"&gt;Omniscient Debugger&lt;/a&gt;, was a Java reverse debugger that was introduced in 2003, marking the first instance of time-travel debugging in a widely used programming language. &lt;a href="https://en.wikipedia.org/wiki/GNU_Debugger"&gt;GDB&lt;/a&gt; (perhaps the most well-known command-line debugger, used mostly with C/C++) added it in 2009.&lt;/p&gt;

&lt;p&gt;Now, time-travel debugging is available for &lt;a href="https://github.com/rr-debugger/rr/wiki/Related-work"&gt;many&lt;/a&gt; languages, platforms, and IDEs, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.replay.io/"&gt;Replay&lt;/a&gt; for JavaScript in Chrome, Firefox, and Node, and &lt;a href="https://wallabyjs.com/docs/intro/time-travel-debugger.html"&gt;Wallaby&lt;/a&gt; for tests in Node&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/time-travel-debugging-overview"&gt;WinDbg&lt;/a&gt; for Windows applications&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://rr-project.org/"&gt;rr&lt;/a&gt; for C, C++, Rust, Go, and others on Linux&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://undo.io/"&gt;Undo&lt;/a&gt; for C, C++, Java, Kotlin, Rust, and Go on Linux&lt;/li&gt;
&lt;li&gt;Various extensions (often rr- or Undo-based) for Visual Studio, VS Code, JetBrains IDEs, Emacs, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation techniques
&lt;/h2&gt;

&lt;p&gt;There are three main approaches to implementing time-travel debugging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Record &amp;amp; Replay&lt;/strong&gt;: Record all non-deterministic inputs to a program during its execution. Then, during the debug phase, the program can be deterministically replayed using the recorded inputs in order to reconstruct any prior state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshotting&lt;/strong&gt;: Periodically take snapshots of a program's entire state. During debugging, the program can be rolled back to these saved states. This method can be memory-intensive because it involves storing the entire state of the program at multiple points in time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instrumentation&lt;/strong&gt;: Add extra code to the program that logs changes in its state. This extra code allows the debugger to step the program backwards by reverting changes. However, this approach can significantly slow down the program's execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;rr uses the first (the rr name stands for Record and Replay), as does &lt;a href="https://docs.replay.io/learn-more/contribute/replay-for-new-contributors#5130506fb24843ab86fe79d11f02261b"&gt;Replay&lt;/a&gt;. WinDbg uses the first two, and Undo uses all three (see &lt;a href="https://undo.io/resources/liverecorder-vs-rr/"&gt;how it differs from rr&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Time-traveling in production
&lt;/h2&gt;

&lt;p&gt;Traditionally, running a debugger in prod doesn't make much sense. Sure, we could SSH into a prod machine and start the process handling requests with a debugger and a breakpoint, but once we hit the breakpoint, we're delaying responses to all current requests and unable to respond to new requests. Also, debugging non-trivial issues is an iterative process: we get a clue, we keep looking and find more clues; discovery of each clue is typically rerunning the program and reproducing the failure. So, instead of debugging in production, what we do is replicate on our dev machine whatever issue we're investigating and use a debugger locally (or, more often, add log statements 😄), and re-run as many times as required to figure it out. Replicating takes time (and in some cases a &lt;em&gt;lot&lt;/em&gt; of time, and in some cases infinite time), so it would be really useful if we didn't have to.&lt;/p&gt;

&lt;p&gt;While running traditional debuggers doesn't make sense, time-travel debuggers can record a process execution on one machine and replay it on another machine. So we can record (or snapshot or instrument) production and replay it on our dev machine for debugging (depending on the tool, our machine may need to have the same CPU instruction set as prod). However, the recording step generally doesn't make sense to use in prod given the high amount of overhead—if we set up recording and then have to use ten times as many servers to handle the same load, whoever &lt;a href="https://www.linkedin.com/in/kevin-laughlin-4133166/"&gt;pays our AWS bill&lt;/a&gt; will not be happy 😁.&lt;/p&gt;

&lt;p&gt;But there are a couple scenarios in which it does make sense:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Undo only slows down execution &lt;a href="https://undo.io/solutions/products/live-recorder/#section-133"&gt;2–5x&lt;/a&gt;, so while we don't want to leave it on just in case, we can &lt;a href="https://twitter.com/gregthelaw/status/1654558923242762243"&gt;turn it on temporarily&lt;/a&gt; on a subset of prod processes for hard-to-repro bugs until we have captured the bug happening, and then we turn it off.&lt;/li&gt;
&lt;li&gt;When we're already recording the execution of a program in the normal course of operation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The rest of this post is about #2, which is a way of running programs called &lt;em&gt;durable execution&lt;/em&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Durable execution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What's that?
&lt;/h3&gt;

&lt;p&gt;First, a brief backstory. After Amazon (one of the first large adopters of microservices) decided that using message queues to communicate between services was not the way to go (hear the story first-hand &lt;a href="https://www.youtube.com/watch?v=wIpz4ioK0gI"&gt;here&lt;/a&gt;), they started using orchestration. And once they realized defining orchestration logic in YAML/JSON wasn't a good developer experience, they created &lt;a href="https://docs.aws.amazon.com/amazonswf/latest/developerguide/swf-welcome.html"&gt;AWS Simple Workfow Service&lt;/a&gt; to define logic in code. This technique of backing code by an orchestration engine is called durable execution, and it spread to &lt;a href="https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp-inproc"&gt;Azure Durable Functions&lt;/a&gt;, &lt;a href="https://cadenceworkflow.io/"&gt;Cadence&lt;/a&gt; (used at Uber for &lt;a href="https://www.uber.com/blog/announcing-cadence/"&gt;&amp;gt; 1,000 services&lt;/a&gt;), and &lt;a href="https://temporal.io/"&gt;Temporal&lt;/a&gt; (used by Stripe, Netflix, Datadog, Snap, Coinbase, and many more).&lt;/p&gt;

&lt;p&gt;Durable execution runs code durably—recording each step in a database, so that when anything fails, it can be retried from the same step. The machine running the function can even lose power before it gets to line 10, and another process is guaranteed to pick up executing at line 10, with all variables and threads intact.&lt;sup id="fnref1"&gt;1&lt;/sup&gt; It does this with a form of record &amp;amp; replay: all input from the outside is recorded, so when the second process picks up the partially-executed function, it can replay the code (in a side-effect–free manner) with the recorded input in order to get the code into the right state by line 10.&lt;/p&gt;

&lt;p&gt;Durable execution's flavor of record &amp;amp; replay doesn't use high-overhead methods like &lt;a href="https://undo.io/resources/liverecorder-vs-rr/"&gt;software JIT binary translation&lt;/a&gt;, snapshotting, or instrumentation. It also doesn't require special hardware. It does require one constraint: durable code must be deterministic (i.e., given the same input, it must take the same code path). So it can't do things that might have different results at different times, like use the network or disk. However, it can call other functions that are run normally (&lt;a href="https://twitter.com/DominikTornow/status/1582370919258783744"&gt;"volatile functions"&lt;/a&gt;, as we like to call them 😄), and while each step of those functions isn't persisted, the functions are automatically retried on transient failures (like a service being down).&lt;/p&gt;

&lt;p&gt;Only the steps that require interacting with the outside world (like calling a volatile function, or calling &lt;code&gt;sleep('30 days')&lt;/code&gt;, which stores a timer in the database) are persisted. Their results are also persisted, so that when you replay the durable function that died on line 10, if it previously called the volatile function on line 5 that returned "foo", during replay, "foo" will immediately be returned (instead of the volatile function getting called again). While yes, it adds latency to be saving things to the database, Temporal supports extremely high throughput (tested up to a million recorded steps per second). And in addition to function recoverability and automatic retries, it comes with &lt;a href="https://temporal.io/blog/building-reliable-distributed-systems-in-node"&gt;many more benefits&lt;/a&gt;, including extraordinary visibility into and debuggability of production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debugging prod
&lt;/h3&gt;

&lt;p&gt;With durable execution, we can read through the steps that every single durable function took in production. We can also download the execution’s history, checkout the version of the code that's running in prod, and pass the file to a replayer (Temporal has runtimes for Go, Java, JavaScript, Python, .NET, and PHP) so we can see in a debugger exactly what the code did during that production function execution. Read &lt;a href="https://temporal.io/blog/temporal-for-vs-code"&gt;this post&lt;/a&gt; or watch &lt;a href="https://www.youtube.com/watch?v=3IjQde9HMNY"&gt;this video&lt;/a&gt; to see an example in VS Code.&lt;sup id="fnref2"&gt;2&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Being able to debug any past production code is a huge step up from the other option (finding a bug, trying to repro locally, failing, turning on Undo recording in prod until it happens again, turning it off, &lt;em&gt;then&lt;/em&gt; debugging locally). It's also a (sometimes necessary) step up distributed tracing.&lt;/p&gt;




&lt;p&gt;I hope you found this post interesting! If you'd like to learn more about durable execution, I recommend reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://temporal.io/blog/building-reliable-distributed-systems-in-node"&gt;Building reliable distributed systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://temporal.io/blog/building-reliable-distributed-systems-in-node-js-part-2"&gt;How durable execution works&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and watching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=wIpz4ioK0gI"&gt;Introduction to Temporal&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=6lSuDRRFgyY"&gt;Why durable execution changes everything&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Thanks to Greg Law, Jason Laster, Chad Retz, and Fitz for reviewing drafts of this post.&lt;/em&gt;&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;Technically, it doesn't have line-by-line granularity. It only records certain steps that the code takes—read on for more info ☺️. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;The astute reader may note that our extension uses the default VS Code debugger, which doesn’t have a back button 😄. I transitioned from talking about TTD to methods of debugging production code via recording, so while Temporal doesn’t have TTD yet, it does record all the non-deterministic inputs to the program and is able to replay execution, so it’s definitely possible to implement. Upvote &lt;a href="https://github.com/temporalio/vscode-debugger-extension/issues/51"&gt;this issue&lt;/a&gt; or comment if you have thoughts on implementation! ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>programming</category>
      <category>debugging</category>
      <category>temporal</category>
    </item>
    <item>
      <title>Actors and Workflows: Building a Customer Loyalty Program with Temporal</title>
      <dc:creator>Fitz</dc:creator>
      <pubDate>Thu, 03 Aug 2023 17:28:24 +0000</pubDate>
      <link>https://dev.to/temporalio/actors-and-workflows-building-a-customer-loyalty-program-with-temporal-9f3</link>
      <guid>https://dev.to/temporalio/actors-and-workflows-building-a-customer-loyalty-program-with-temporal-9f3</guid>
      <description>&lt;p&gt;This post is technically a followup of &lt;a href="https://temporal.io/blog/workflows-as-actors-is-it-really-possible"&gt;another post&lt;/a&gt;. You don't &lt;em&gt;need&lt;/em&gt; to read that one to make sense of this one, but it might give some useful background.&lt;/p&gt;

&lt;p&gt;That post talked through how &lt;a href="https://en.wikipedia.org/wiki/Actor_model"&gt;the Actor Model&lt;/a&gt; can be implemented using "Workflows" (on &lt;a href="https://dev.toTemporal"&gt;https://github.com/temporalio/temporal&lt;/a&gt;), even though these two concepts don't immediately appear compatible.&lt;/p&gt;

&lt;p&gt;Here, I dive into a concrete example: a Workflow representing a customer's loyalty status.&lt;/p&gt;

&lt;p&gt;If you want to skip the prose and just jump right into the code, you can find it all in &lt;a href="https://github.com/afitz0/customer-loyalty-workflow"&gt;this GitHub repository&lt;/a&gt;, with implementations in Go, Java, and Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actor Model Refresher
&lt;/h2&gt;

&lt;p&gt;As formally defined, Actors must be able to do three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Send and receive messages&lt;/li&gt;
&lt;li&gt;Create new Actors&lt;/li&gt;
&lt;li&gt;Maintain state&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Exact implementation details vary depending on what framework, library, or tools you're using, but the biggest challenge is having some kind of software artifact running &lt;em&gt;somewhere&lt;/em&gt; that can handle these things.&lt;/p&gt;

&lt;p&gt;That's where most Actor frameworks come in to help: providing both the programming model and the runtime environment for being able to build an Actor-based application in a highly distributed, concurrent, and scalable way.&lt;/p&gt;

&lt;p&gt;Temporal differs here in that it’s general-purpose, rather than specific to one model or system design pattern. With Workflows, you define a function that Temporal will ensure runs to completion (or reliably runs forever, if the function doesn’t return).&lt;/p&gt;

&lt;p&gt;I recognize that statement is both rather bold and also so generic as to be hard to disprove. So, let's look at a concrete example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loyal Customers
&lt;/h2&gt;

&lt;p&gt;Many consumer businesses have some kind of loyalty program. Buy 10 items, get the 11th free! Fly 10,000 miles, get free access to the airport lounge! Earn one million points over the lifetime of your account, earn a gold star!&lt;/p&gt;

&lt;p&gt;At the highest level, the application's logic isn't complex: Each customer has an integer counter that's incremented after the customer does certain things (e.g., buy something, or take a trip). When that counter crosses different thresholds, new rewards are unlocked. And, although we may not like it, customers can always close their accounts.&lt;/p&gt;

&lt;p&gt;When we create the diagram for the app, it might look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--41iv8ffH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3how2h0elz6ia2b8fn3x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--41iv8ffH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3how2h0elz6ia2b8fn3x.png" alt="customer loyalty diagram version 1" width="500" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In terms of the Actor Model, two of the three requirements are on display:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Send and receive messages&lt;/strong&gt;: A customer can send either an "earn points" message or a "try to use reward" message.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create new Actors&lt;/strong&gt;: ??? (This is the Actor requirement not apparent in this application, but we'll see later how it can be incorporated.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintain state&lt;/strong&gt;: A customer loyalty account needs to maintain the points counter and which rewards are unlocked (or be able to look up this information based on the points value).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Requirement #2, the ability to create other Actors, isn't immediately obvious here, but it isn't too far out of reach. We could define in this example application that one of the rewards for earning enough points is the ability to gift status to someone else, inviting them (i.e., creating their account) to the program if they aren't already a member.&lt;/p&gt;

&lt;p&gt;If our goal is to create a demo application for the Actor Model (as it is in this post), then there's actually one other thing missing: the ability for a customer (or rather, their loyalty account) to send messages. For that, we could also declare that customers with enough points can gift points or status levels (i.e., which rewards are unlocked) to their guests. Then they can send messages, too!&lt;/p&gt;

&lt;p&gt;Reworking the previous diagram to be more befitting of a full "Actor," we'd get the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mCfdb_dA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0bfuafimxwqxzn7rxxam.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mCfdb_dA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0bfuafimxwqxzn7rxxam.png" alt="customer loyalty diagram version 2" width="500" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And, as for the exact implementation details, read on!&lt;/p&gt;

&lt;h2&gt;
  
  
  Loyal (Temporal!) Customers
&lt;/h2&gt;

&lt;p&gt;Imagine being able to write the customer loyalty program above in just a function or two. Conceptually, that's not hard. In pseudocode, that might look like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INVITE_REWARD_MINIMUM_POINTS = 1000

function CustomerLoyaltyAccount:
    account_canceled = false
    points = 0

    while !account_canceled:
        message = receive_message()
        switch message.type:
            case 'cancel':
                account_canceled = true
            case 'add_points':
                fallthrough
            case 'gift_points':
                points += message.value
            case 'invite_guest'
                if points &amp;gt;= INVITE_REWARD_MINIMUM_POINTS:
                    spawn(new CustomerLoyaltyAccount())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But there are a few crucial details that are, well, rather undefined in this pseudo-function. Specifically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What's &lt;code&gt;receive_message()&lt;/code&gt; doing? How is it receiving messages?&lt;/li&gt;
&lt;li&gt;Similarly, what's &lt;code&gt;spawn(new CustomerLoyaltyAccount())&lt;/code&gt; doing? &lt;/li&gt;
&lt;li&gt;And most importantly, where is this function running? What happens if that runtime crashes or the function otherwise stops running?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these maps to core Temporal features that we can implement in an example Workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data can be sent to Workflows via Signals&lt;/li&gt;
&lt;li&gt;Workflows can create new Workflow instances&lt;/li&gt;
&lt;li&gt;As long as there are Workers running &lt;em&gt;somewhere&lt;/em&gt; that can communicate with the Temporal Server, then if the Worker running the function dies, the function will continue running on another (you know, kind of Temporal's main benefit)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Customers Go Loyal
&lt;/h3&gt;

&lt;p&gt;Let's build this up in Go. If you are more comfortable with other languages, I've also written the same Workflow in &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/tree/main/python"&gt;Python&lt;/a&gt; and &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/tree/main/java"&gt;Java&lt;/a&gt;. While the languages are different, most of the same concepts and patterns should carry over.&lt;/p&gt;

&lt;p&gt;(For brevity in the body of this blog post, I'll in most cases omit error handling but include it when non-trivial and relevant.)&lt;/p&gt;

&lt;p&gt;First, we write the skeleton of a Workflow and an Activity. For some of the milestones in a customer's lifecycle, it'd be nice to send them some kind of notification. In a real application, you'd call out to SendGrid, Mailchimp, Constant Contact, or some other email provider, but for simplicity's sake, I'm just logging out the details. This initial Workflow does just that: if it's a new customer, send a welcome email, but otherwise move on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;CustomerLoyaltyWorkflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt; &lt;span class="n"&gt;CustomerInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newCustomer&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Loyalty workflow started."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"CustomerInfo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt; &lt;span class="n"&gt;Activities&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;newCustomer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"New customer workflow; sending welcome email."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecuteActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SendEmail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Welcome, %v, to our loyalty program!"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
            &lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error running SendEmail activity for welcome email."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Skipping welcome email for non-new customer."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// ... [to be added later] ... //&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Activities&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Client&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Activities&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;SendEmail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;activity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sending email."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Contents"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next up, we need to be able to handle messages. This is the primary thing the Workflow (i.e., customer loyalty Actor) does: sit around waiting for new messages to come in.&lt;/p&gt;

&lt;p&gt;The following code replaces the &lt;code&gt;// ... [to be added below] ... //&lt;/code&gt; line from the previous snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;  &lt;span class="n"&gt;selector&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Signal handler for adding points&lt;/span&gt;
    &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddReceive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetSignalChannel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"addPoints"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReceiveChannel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;signalAddPoints&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c"&gt;// Signal handler for canceling account&lt;/span&gt;
    &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddReceive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetSignalChannel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"cancelAccount"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReceiveChannel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;signalCancelAccount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c"&gt;// ... [register other Signal handlers here] ... //&lt;/span&gt;

  &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Waiting for new messages"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AccountActive&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Signal handler function for adding points does very little, adding in the given points to the customer's state and then sending an email to the customer with the new value.&lt;/p&gt;

&lt;p&gt;As you might imagine, &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow.go#L250-L261"&gt;the cancel account handler&lt;/a&gt; is very similar, setting the &lt;code&gt;customer.AccountActive&lt;/code&gt; flag used above to false and then notifying the customer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;signalAddPoints&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReceiveChannel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;CustomerInfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt; &lt;span class="n"&gt;Activities&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;pointsToAdd&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Receive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;pointsToAdd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Adding points to customer account."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"PointsAdded"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pointsToAdd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LoyaltyPoints&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;pointsToAdd&lt;/span&gt;

    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecuteActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SendEmail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"You've earned more points! You now have %v."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LoyaltyPoints&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error running SendEmail activity for added points."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// ... [insert logic for unlocking status levels or rewards] ... //&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All combined, the code so far does three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;First, it registers the &lt;code&gt;signalAddPoints&lt;/code&gt; and &lt;code&gt;signalCancelAccount&lt;/code&gt; functions as the handlers for the "addPoints" and "cancelAccount" Signals, respectively.&lt;/li&gt;
&lt;li&gt;Then, it blocks forward progress on the Workflow, via &lt;code&gt;selector.Select(ctx)&lt;/code&gt;, until a registered Signal comes in. Unless that Signal is "cancelAccount," the Workflow will keep looping on this select.&lt;/li&gt;
&lt;li&gt;I've chosen for this application to not fail the Workflow when an email fails to send. This keeps the Workflow representing the customer's loyalty account active and running even in spite of external system failure.

&lt;ul&gt;
&lt;li&gt;For that, you'll want to set an appropriate retry policy to ensure that the Workflow doesn't completely block on email failures, for example by setting the &lt;code&gt;MaximumAttempts&lt;/code&gt; to a reasonably low number like 10.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Already this gives us most of the application. We have a function that runs perpetually, thanks to Temporal, and can receive two different kinds of messages, both of which modify the state of the Workflow with one that also results in the Workflow finishing.&lt;/p&gt;

&lt;p&gt;What remains is a couple of more Temporal-specific considerations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Long-Lived Customers
&lt;/h3&gt;

&lt;p&gt;In &lt;a href="https://temporal.io/blog/workflows-as-actors-is-it-really-possible#building-a-workflow-that-can-practically-run-forever"&gt;my last post&lt;/a&gt;, I spilled many words on the topic of "Continue-As-New." If you didn't—or don't want to!—read those words, the gist is this: at some point, a Workflow's history may get unwieldily big; Continue-As-New resets it.&lt;/p&gt;

&lt;p&gt;For this customer loyalty example Workflow, the far-and-away biggest contributor to the Event History is the &lt;em&gt;number&lt;/em&gt; of events, not the size. With the &lt;code&gt;addPoints&lt;/code&gt; Signal only taking a single integer argument and the &lt;code&gt;cancelAccount&lt;/code&gt; Signal taking none, the combined contribution to the &lt;em&gt;size&lt;/em&gt; of the history is minimal.&lt;/p&gt;

&lt;p&gt;A Signal with only a single integer parameter will, by itself, contribute one Event and about 500 bytes to the History, even with very large values. And so, how many of these Signals would be required to hit either the size or length limits?&lt;/p&gt;

&lt;p&gt;If &lt;em&gt;nothing&lt;/em&gt; else happened but &lt;code&gt;addPoints&lt;/code&gt; Signals, it'd take 51,200 of them to reach the length limit, but &lt;code&gt;50 * 1024 * 1024 / 500&lt;/code&gt; or 104,857.6 to reach the size limit. Knowing that many of these Signals will result in the &lt;code&gt;SendEmail&lt;/code&gt; Activity running, and each Activity contributes a handful of (small) events to the history, this Workflow will hit the History &lt;em&gt;length&lt;/em&gt; limit well before the size limit.&lt;/p&gt;

&lt;p&gt;So, let's add a check for that into our Workflow loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;eventsThreshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="n"&gt;_000&lt;/span&gt;
    &lt;span class="c"&gt;// ... snip ...&lt;/span&gt;

    &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Waiting for new messages"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AccountActive&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetCurrentHistoryLength&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;eventsThreshold&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, trigger Continue-As-New as needed, draining any pending signals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AccountActive&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Account still active, but hit continue-as-new threshold."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c"&gt;// Drain signals before continuing-as-new&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasPending&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewContinueAsNewError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CustomerLoyaltyWorkflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My previous post on this topic &lt;a href="https://temporal.io/blog/workflows-as-actors-is-it-really-possible#avoiding-signal-and-update-loss"&gt;explained in a little more detail&lt;/a&gt; about why it's necessary to drain signals before continuing-as-new. To briefly recap, &lt;a href="https://docs.temporal.io/glossary#continue-as-new"&gt;Continue-As-New&lt;/a&gt; finishes the current Workflow run and starts a new instance of the Workflow &lt;em&gt;regardless of any pending Signals&lt;/em&gt;. If we don't drain (and handle!) Signals before calling &lt;code&gt;workflow.NewContinueAsNewError&lt;/code&gt; (or &lt;a href="https://python.temporal.io/temporalio.workflow.html#continue_as_new"&gt;&lt;code&gt;workflow.continue_as_new&lt;/code&gt;&lt;/a&gt; in Python, or &lt;a href="https://www.javadoc.io/doc/io.temporal/temporal-sdk/latest/io/temporal/workflow/Workflow.html#continueAsNew(io.temporal.workflow.ContinueAsNewOptions,java.lang.Object...)"&gt;&lt;code&gt;Workflow.continueAsNew&lt;/code&gt;&lt;/a&gt; in Java), those pending Signals will be forever lost.&lt;/p&gt;

&lt;p&gt;The last major thing this Workflow needs to make it a true, stage-worthy Actor is the ability to create others.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spawning New Customers
&lt;/h3&gt;

&lt;p&gt;While Temporal has support for Parent/Child relationships between Workflows, in this customer loyalty application, the only thing we need is the ability to send a message from one to the other in the case of gifting status or points.&lt;/p&gt;

&lt;p&gt;Temporal provides an API in the Client that can do this and create other Workflows all in one call, called &lt;a href="https://docs.temporal.io/dev-guide/go/features#signal-with-start"&gt;Signal-with-Start&lt;/a&gt;. Since this is only available in the Client, not from a Workflow, we'll need to do this &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/activities.go#L29-L50"&gt;in an Activity&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;First, I'm setting the ID Reuse Policy to &lt;code&gt;REJECT&lt;/code&gt;. This is in some ways a "business logic" kind of decision, where I'm declaring that once a customer's account is closed, it can't be re-invited. (Note that after a &lt;a href="https://docs.temporal.io/clusters#retention-period"&gt;namespace's retention period&lt;/a&gt; has passed, IDs from closed Workflows can be reused regardless of this policy, and so in a real-life production version of this app, you'd want to have this check an external source for customer account statuses.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Activities&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;StartGuestWorkflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt; &lt;span class="n"&gt;CustomerInfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
    &lt;span class="n"&gt;workflowOptions&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StartWorkflowOptions&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;TaskQueue&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;             &lt;span class="n"&gt;TaskQueue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;WorkflowIDReusePolicy&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WORKFLOW_ID_REUSE_POLICY_REJECT_DUPLICATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we can call &lt;code&gt;Client.SignalWithStartWorkflow&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Starting and signaling guest workflow."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"GuestID"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CustomerID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SignalWithStartWorkflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CustomerWorkflowID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CustomerID&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;SignalEnsureMinimumStatus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusLevel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Ordinal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;workflowOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CustomerLoyaltyWorkflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the use of the Client from the Activities receiver struct! I'm making use of something in the way Temporal works in Go: if, when we instantiate and register the Activities in the Worker, &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/worker/main.go#L26-L28"&gt;we also set this Client,&lt;/a&gt; then the same connection will be available within the Activities. This way, we don't have to worry about re-creating the Client.&lt;/p&gt;

&lt;p&gt;I'm also ignoring the returned future from &lt;code&gt;SignalWithStartWorkflow&lt;/code&gt; via a Go convention of assigning to &lt;code&gt;_&lt;/code&gt;; because this "guest" Workflow is expected to run indefinitely long, blocking on its results would prevent the original Workflow from doing anything else. Since the future returned from starting a Workflow is either used for waiting for the Workflow to finish, or getting its IDs (which we already know from the &lt;code&gt;CustomerWorkflowID(guest.CustomerID)&lt;/code&gt; call), we can safely ignore it.&lt;/p&gt;

&lt;p&gt;But, it's still necessary to handle the error. With the ID Reuse Policy set to &lt;code&gt;REJECT&lt;/code&gt;, retrying the resulting error from trying to start a an already-closed Workflow will get us nowhere, and so we should instead send some useful information back to the Workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;    &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;serviceerror&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WorkflowExecutionAlreadyStarted&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;As&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GuestAlreadyCanceled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GuestInvited&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// ... [Defined at top] ...&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;GuestInviteResult&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;

&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;GuestInvited&lt;/span&gt; &lt;span class="n"&gt;GuestInviteResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;iota&lt;/span&gt;
    &lt;span class="n"&gt;GuestAlreadyCanceled&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Back in the Workflow, after running this Activity I can then check for that error and notify the customer as appropriate. As before, I'm allowing the Workflow to continue if sending the email failed. But if that &lt;code&gt;SignalWithStartWorkflow&lt;/code&gt; call failed for any reason other than the guest's account already existing, I want to make some noise and fail the Workflow—something unusual is likely happening.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;inviteResult&lt;/span&gt; &lt;span class="n"&gt;GuestInviteResult&lt;/span&gt;
&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecuteActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StartGuestWorkflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;inviteResult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"could not signal-with-start guest/child workflow for guest ID '%v': %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guestID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;inviteResult&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;GuestAlreadyCanceled&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;emailToSend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Your guest has canceled!"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;emailToSend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Your guest has been invited!"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecuteActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SendEmail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emailToSend&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This snippet of code would end up being in a Signal handler for something like &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow.go#L191-L232"&gt;an "invite guest" Signal&lt;/a&gt;. The handler would also include, as discussed at the top of this post, &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow.go#L198-L199"&gt;a check&lt;/a&gt; for if the current customer is even allowed to do this action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summing it all up
&lt;/h2&gt;

&lt;p&gt;There are a few other things to explore in this app, like &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow.go#L102-L108"&gt;catching a cancellation request&lt;/a&gt; or &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow_test.go"&gt;looking through the tests&lt;/a&gt;, but this post has gotten long enough as it is. 🙂&lt;/p&gt;

&lt;p&gt;Hopefully this post serves as a nice "close-to-real-world" example for you of how to build something that looks like an "Actor"—aka, a really, really long running Workflow that can send and receive messages and maintain state without a database—using Temporal.&lt;/p&gt;

&lt;p&gt;For more information related to this post and about Temporal, check out the following links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/afitz0/customer-loyalty-workflow/"&gt;This post's source code&lt;/a&gt; (As of publishing, available in Java, Go, and Python)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://temporal.io/blog/workflows-as-actors-is-it-really-possible"&gt;Actors &amp;amp; Workflows, Part 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.temporal.io/dev-guide/go/features#signal-with-start"&gt;SignaIWithStart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//docs.temporal.io"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//docs.temporal.io/dev-guide"&gt;Developer Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the best way to learn Temporal is with &lt;a href="https://learn.temporal.io/courses"&gt;our free courses&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover image &lt;a href="https://unsplash.com/photos/fg7J6NnebBc?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditShareLink"&gt;from John Jennings on Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>backend</category>
      <category>distributedsystems</category>
      <category>go</category>
      <category>microservices</category>
    </item>
    <item>
      <title>To Choreograph or Orchestrate your Saga, that is the question.</title>
      <dc:creator>Emily Fortuna</dc:creator>
      <pubDate>Wed, 12 Jul 2023 17:46:50 +0000</pubDate>
      <link>https://dev.to/temporalio/to-choreograph-or-orchestrate-your-saga-that-is-the-question-4kna</link>
      <guid>https://dev.to/temporalio/to-choreograph-or-orchestrate-your-saga-that-is-the-question-4kna</guid>
      <description>&lt;p&gt;The saga pattern is a distributed systems design pattern for a task that spans machine or microservice boundaries in which full execution of all steps is necessary. Partial execution is not desirable. A common life example used to explain when the saga pattern is useful is trip planning. If you’re planning on attending &lt;a href="https://temporal.io/replay"&gt;Replay&lt;/a&gt;, for example, you’d need to book a conference ticket, an airplane ticket, and a hotel. If you fail to acquire one of these things, you’ll miss out on meeting fun people in backend engineering face-to-face.&lt;/p&gt;

&lt;p&gt;Below the surface, there are two main ways microservices can talk to one another that make your saga possible: choreography and orchestration. &lt;/p&gt;

&lt;h3&gt;
  
  
  Choreography
&lt;/h3&gt;

&lt;p&gt;Choreography is analogous to ants in an ant colony. Like ants, each microservice has &lt;em&gt;local&lt;/em&gt; knowledge, and shares information about state changes with other services via chemical signals called pheromones–I mean via message passing. Just as an ant trail to food emerges organically from pheromones, the overall behavior of a system as a whole that contains choreographed microservices emerges organically from each microservice’s instructions.   &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XAPEtOwx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0of9n89mjwqrbffissp1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XAPEtOwx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0of9n89mjwqrbffissp1.jpg" alt="trail of ants on a log against a green background" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One tenant drilled into every software engineer’s head is the value of decoupling. Choreography embodies this idea and is straightforward to implement as a whole. Choreography can be a popular, easy choice for systems that are incrementally moving from a monolith to a microservices architecture. However, if you have any sort of ordering requirement of tasks, such as ordered steps in your saga, choreography can get unwieldy fairly quickly. Suppose we want to book the plane first so that the hotel can know your flight number and pick you up from the airport. Then we book our conference ticket (maybe there’s a discount with certain hotels). The sequence of messages that each service responded to would look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cvHbRlhr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y9xojsfi7vkxl4j16llc.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cvHbRlhr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y9xojsfi7vkxl4j16llc.gif" alt="three services: plane, hotel, and conference ticket sending messages about when their state has changed so that the other services can act on them." width="600" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, just from looking at each microservice’s individual codebase, it’s difficult to understand the order that the system &lt;em&gt;should&lt;/em&gt; have since that ordering is distributed throughout the code. This leads to all sorts of higher level business logic diagrams that need to be kept in sync with the code…but wouldn’t it be better if the code were just easier to read in the first place? It also can be difficult to debug the exact sequence of events that lead to a bug since control flow is not immediately clear. So, unless all of your microservices are truly independent of one another and don’t have any sort of “happens before” logic, consider using orchestration instead. &lt;/p&gt;

&lt;h3&gt;
  
  
  Orchestration
&lt;/h3&gt;

&lt;p&gt;Orchestration, on the other hand, is like an air traffic control tower directing planes, or microservices. One service, a “super microservice” if you will, functions as the message broker sending messages directly to individual microservices telling them what to do, just like planes wait for permission to take off. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pW_IGjsC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hrskypbc97yzls0c25mo.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pW_IGjsC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hrskypbc97yzls0c25mo.gif" alt="three boxes, each representing plane, hotel, and conference booking microservices, and a message broker sending book and cancel commands to each." width="600" height="266"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because orchestration centralizes control flow, debugging and understanding control flow is much simpler. Additionally, since each step doesn’t need to keep track of what “happens before” messages it needs to listen to, the code for individual microservices is much simpler. Orchestration also shines in situations where many services need to interact in a single &lt;a href="https://temporal.io/blog/saga-pattern-made-easy"&gt;saga&lt;/a&gt; step. The glaring Achilles’ heel of this method is that bane of all distributed systems: the message broker is a single point of failure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---jR8mRfu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ne7nfbuk4lujoye8ug7i.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---jR8mRfu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ne7nfbuk4lujoye8ug7i.jpg" alt="Still from the movie airplane with an inflatable co pilot and a disheveled pilot sitting in the cockpit, with a flight attendant standing between them looking slightly concerned" width="800" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting it all together
&lt;/h3&gt;

&lt;p&gt;So to summarize, choreography:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is decentralized and decoupled&lt;/li&gt;
&lt;li&gt;Is good for highly independent microservices&lt;/li&gt;
&lt;li&gt;Is “easier” to implement, at least initially &lt;/li&gt;
&lt;li&gt;Is an easy choice for converting established monoliths to microservices&lt;/li&gt;
&lt;li&gt;Can make control flow unclear&lt;/li&gt;
&lt;li&gt;Can be challenging to debug&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and orchestration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has one service issuing “commands” to execute microservices&lt;/li&gt;
&lt;li&gt;Makes control flow easier to understand&lt;/li&gt;
&lt;li&gt;Easier to build with greenfield applications&lt;/li&gt;
&lt;li&gt;Makes debugging and failure handling clearer&lt;/li&gt;
&lt;li&gt;Is “harder” to implement initially, but pays dividends later&lt;/li&gt;
&lt;li&gt;Has a single point of failure (the message broker)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting tradeoff between these two approaches is one wants to reach for the light, agile option (choreography) in the early days and avoid over-architecting your project, but counterintuitively, orchestration is often easier to build when one uses it from the start. &lt;/p&gt;

&lt;h3&gt;
  
  
  So, what does Temporal do?
&lt;/h3&gt;

&lt;p&gt;Temporal uses orchestration under the hood (you won’t have to implement it yourself), &lt;strong&gt;&lt;em&gt;but&lt;/em&gt; &lt;em&gt;also&lt;/em&gt;&lt;/strong&gt; avoids that crucial drawback of a single point of failure. How is such a thing possible? Internally, Temporal records your program’s progress in a log. If that message broker were to go offline, your entire program’s history will have been saved and so that another machine can start up exactly where your program left off, as if nothing happened. This makes Temporal completely horizontally scalable. &lt;/p&gt;

&lt;p&gt;To bring this idea back to the saga pattern, an important component of the saga pattern is driving towards completion of all the steps of the saga. The fact that Temporal ensures no progress will ever be lost means it will pick up exactly where it left off “no matter what”, including failures for an unknown length of time, completing the saga with no extra code or heavy lifting on your part.&lt;/p&gt;

&lt;p&gt;Additionally, unlike some orchestration engines, in Temporal, the logic of your workflow is expressed entirely in code, so you don’t have to deal with json. In essence, nothing additional is needed to make a robust, failure resilient application other than the business logic of your application itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Choreography and orchestration provide different approaches to coordinating communication between microservices. Choreography is decoupled but can make debugging and control flow difficult to follow. Orchestration is centralized, but results in a single point of failure. Temporal uses orchestration under the covers, &lt;em&gt;but&lt;/em&gt; by design safeguards against a single point of failure, allowing you to focus on writing your code with the confidence that it is failure resilient.&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>microservices</category>
      <category>designpatterns</category>
      <category>sagas</category>
    </item>
    <item>
      <title>25 Key Terms for Speaking Distributed Systems and Temporal (an emoji-based guide)</title>
      <dc:creator>Emily Fortuna</dc:creator>
      <pubDate>Thu, 29 Jun 2023 18:55:20 +0000</pubDate>
      <link>https://dev.to/temporalio/25-key-terms-for-speaking-distributed-systems-and-temporal-an-emoji-based-guide-1cp9</link>
      <guid>https://dev.to/temporalio/25-key-terms-for-speaking-distributed-systems-and-temporal-an-emoji-based-guide-1cp9</guid>
      <description>&lt;p&gt;So you want to keep up with all the cool kids throwing around terms like “&lt;a href="https://docs.temporal.io/clusters#multi-cluster-replication"&gt;multi-cluster replication&lt;/a&gt;” but you don’t have time to read several textbooks. This handy quick-reference will give you the framework for following (and participating in!) conversations involving distributed systems or Temporal with ease. At the next dinner party you’ll be able to win friends and influence people with your ability to explain distributed systems succinctly in plain English… because we all know you’re&lt;a href="https://simpsons.fandom.com/wiki/You_Don%27t_Win_Friends_with_Salad"&gt; not gonna do it with salad&lt;/a&gt;. This guide builds upon itself, with terms requiring no additional context first.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eRpNTDEK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gr7en1ario6hn05xe87r.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eRpNTDEK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gr7en1ario6hn05xe87r.jpg" alt="An astronaut floating in space, attached to a book as if it is the oxygen supply" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Distributed Systems Terms To Know
&lt;/h2&gt;

&lt;h4&gt;
  
  
  concurrency   ↠
&lt;/h4&gt;

&lt;p&gt; Roughly, the idea of running multiple things at once. Two people eating dinner at the same time (“in parallel”) are eating concurrently. Your operating system context switching between a web browser and IDE is also a form of concurrency. &lt;/p&gt;

&lt;h4&gt;
  
  
  scalability   📈
&lt;/h4&gt;

&lt;p&gt; The ability for a system, such as a website, to accommodate a growing number of requests or work. You can improve scalability by finding places where work can be executed simultaneously or removing performance bottlenecks.&lt;/p&gt;

&lt;h4&gt;
  
  
  reliability   ✅
&lt;/h4&gt;

&lt;p&gt; The likelihood of a system to run without failure for a period of time. Systems can be made more reliable by reducing single points of failure, and detecting failures quickly.&lt;/p&gt;

&lt;h4&gt;
  
  
  eventual consistency   🐌
&lt;/h4&gt;

&lt;p&gt; Let’s say you’ve replicated a database to improve reliability (and possibly scalability). Great! Eventual consistency says a change in the data in one location will &lt;em&gt;eventually&lt;/em&gt; be updated in every location that the database lives; however, until every location is updated, a read from one of the locations may not &lt;em&gt;yet&lt;/em&gt; have the updated value. You, the programmer, need to bear in mind you may not always have the most up-to-date data when working under this model.&lt;/p&gt;

&lt;h4&gt;
  
  
  strong consistency   💪
&lt;/h4&gt;

&lt;p&gt; The guarantee that a data store will always provide the most up-to-date value.&lt;/p&gt;

&lt;h4&gt;
  
  
  CAP Theorem   🧢
&lt;/h4&gt;

&lt;p&gt; The rule that you gotta pick two of the three: &lt;em&gt;(strong)&lt;/em&gt; &lt;em&gt;consistency&lt;/em&gt;, &lt;em&gt;availability&lt;/em&gt;, &lt;em&gt;partition tolerance&lt;/em&gt;. Any distributed data store can only provide at most any two of these qualities, alas. &lt;em&gt;Availability&lt;/em&gt; is defined as every request returns a non-error response. &lt;em&gt;Partition tolerance&lt;/em&gt; is the ability for a system to continue to operate despite requests between data store nodes being delayed or dropped. See also strong consistency and eventual consistency.&lt;/p&gt;

&lt;h4&gt;
  
  
  ACID   🧪
&lt;/h4&gt;

&lt;p&gt; Hardcore-sounding acronym borrowed from databases that stands for &lt;em&gt;atomicity&lt;/em&gt;, &lt;em&gt;consistency&lt;/em&gt;, &lt;em&gt;isolation&lt;/em&gt;, &lt;em&gt;durability&lt;/em&gt;. See strong consistency,  eventual consistency, and other deets below.&lt;/p&gt;

&lt;h4&gt;
  
  
  atomicity   ⚛️
&lt;/h4&gt;

&lt;p&gt; Executing a sequence of operations all together as if they were a single unit, or not at all. &lt;/p&gt;

&lt;h4&gt;
  
  
  isolation   📦
&lt;/h4&gt;

&lt;p&gt; Executing a sequence of operations concurrently with another sequence has the same effect as executing each operation sequentially.&lt;/p&gt;

&lt;h4&gt;
  
  
  durability   🗿
&lt;/h4&gt;

&lt;p&gt; Think long-lasting. Standing the test of time. Persisting—i.e. written to disk, or if you were really hardcore, etched on a stone tablet—at which point it can be looked up even in the face of system failure such as a power outage or crash.&lt;/p&gt;

&lt;h4&gt;
  
  
  durable execution   🔜
&lt;/h4&gt;

&lt;p&gt; Similar to &lt;em&gt;durability&lt;/em&gt;, once a program has started executing, it will &lt;em&gt;continue&lt;/em&gt; executing to completion. Persisting every step the program takes so that execution can be continued by another process if the current process dies. &lt;/p&gt;

&lt;h4&gt;
  
  
  idempotent function   🥪
&lt;/h4&gt;

&lt;p&gt; Scary-sounding word, less scary meaning: a function that has the same observed result when called with the same inputs, whether it is called one time or many times.&lt;/p&gt;

&lt;p&gt; A function setting some field &lt;code&gt;foo=3&lt;/code&gt;? Idempotent. The function &lt;code&gt;foo += 3&lt;/code&gt;? Not idempotent, because the value of &lt;code&gt;foo&lt;/code&gt; is dependent on the number of times your function is called. Naive implementations of functions that transfer money or send emails are also not idempotent by &lt;em&gt;default&lt;/em&gt;. &lt;/p&gt;

&lt;h4&gt;
  
  
  deterministic function   🧮
&lt;/h4&gt;

&lt;p&gt; Code that always has the same effect/output when given a particular input. Things that are &lt;em&gt;not&lt;/em&gt; deterministic use some external state such as user input, a random number, or stored data. Code that reads or writes to a variable that other code can also modify simultaneously is also &lt;em&gt;not&lt;/em&gt; deterministic.&lt;/p&gt;

&lt;h4&gt;
  
  
  platform   💻
&lt;/h4&gt;

&lt;p&gt; Windows, iOS, Docker, and VMware are all platforms. They’re execution environments that define how programs behave inside them. Temporal is also a platform, which defines that code run with Temporal is failure and timeout resilient. You may see the term &lt;em&gt;platform-level&lt;/em&gt; used in relation to &lt;a href="https://temporal.io/blog/failure-handling-in-practice"&gt;failures&lt;/a&gt;. Platform-level failures are caused by low-level issues such as network errors or process crashes.&lt;/p&gt;

&lt;h4&gt;
  
  
  application   〉
&lt;/h4&gt;

&lt;p&gt; The code you write. You may see the term &lt;em&gt;application-level&lt;/em&gt; used in relation to &lt;a href="https://temporal.io/blog/failure-handling-in-practice"&gt;failures&lt;/a&gt;. Application-level failures are domain-specific failures like “insufficient inventory”, or “user canceled ride request.”&lt;/p&gt;

&lt;h4&gt;
  
  
  event sourcing   🎤
&lt;/h4&gt;

&lt;p&gt; A design pattern that creates event objects for every state change in a system, and records this sequence of events in a log (or event history). Temporal uses event sourcing “under the hood” to ensure failure resilience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Temporal-Specific Terms
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Temporal   ✧
&lt;/h4&gt;

&lt;p&gt; A way to run your code, a service and library that work together, that ensures your code never gets stuck in failure at the application-level. or the platform-level. While libraries like &lt;a href="https://www.npmjs.com/package/async-retry"&gt;async-retry&lt;/a&gt; take care of retry logic for functions that fail, what happens if your code making that library call crashes? Temporal says “we gotchu.” It abstracts away complex concepts around retries, rollbacks, queues, state machines, and timers, so that no matter where the failure happens, we’ll ensure your code keeps running the way you want. &lt;/p&gt;

&lt;h4&gt;
  
  
  Worker   👷
&lt;/h4&gt;

&lt;p&gt; The process that’s actually &lt;em&gt;doing the work&lt;/em&gt; executing all of your Temporal code (the Workflow and Activities). Capitalized here to denote the Temporal-specific concept of a Worker, to differentiate from the generic idea of a worker process.&lt;/p&gt;

&lt;h4&gt;
  
  
  Workflow   📖
&lt;/h4&gt;

&lt;p&gt; The high-level business logic of your program. Essentially, this is where the logic of your application begins. (&lt;em&gt;Technically&lt;/em&gt; execution starts with the Worker, and the Worker runs the Workflow code.) All Workflow logic must be deterministic.&lt;/p&gt;

&lt;h4&gt;
  
  
  Activity   💾
&lt;/h4&gt;

&lt;p&gt; Components of your Workflow that might fail, like network or file system calls, inventory holds, or credit card charges. The decision around how &lt;em&gt;many&lt;/em&gt; Activities your program should have–whether you make a separate Activity for every non-deterministic call or put the entire rest of your program in an Activity (don’t do that)–is generally a function of how you’d like your program to behave when retrying a failure. For example, if a downstream instruction should always grab the very freshest data when retrying, those instructions should be grouped together in a single Activity. If you can retry with the old data, they can be in separate Activities. Since Activities can be retried, they should be idempotent.&lt;/p&gt;

&lt;h4&gt;
  
  
  Query   🙋
&lt;/h4&gt;

&lt;p&gt; A way to inspect the state of a Workflow. The results are guaranteed to show the most recent state.&lt;/p&gt;

&lt;h4&gt;
  
  
  Signal   🧑‍🏫
&lt;/h4&gt;

&lt;p&gt; A way to notify or send information to a Workflow. A common use case is notifying a Workflow that the user added items to their shopping cart.&lt;/p&gt;

&lt;h4&gt;
  
  
  retry   🔄
&lt;/h4&gt;

&lt;p&gt; Generally, re-executing an Activity that has failed. Technically, Workflows can also be retried, but they are &lt;em&gt;far&lt;/em&gt; less common, such as a developer attempting to update Workflow code running in production.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cluster   🏘️
&lt;/h4&gt;

&lt;p&gt; The collection of services and databases that make failure and timeout resilience possible. You might sometimes see this colloquially called the Temporal Server.&lt;/p&gt;

&lt;h4&gt;
  
  
  History   🗃️
&lt;/h4&gt;

&lt;p&gt; A log of events that happened over the course of execution. This log contains attempts to run Activities, Workflow status changes (started, failed, scheduled, etc), timer events, and external information signaled to the system during the run.&lt;/p&gt;

&lt;h2&gt;
  
  
  In Closing
&lt;/h2&gt;

&lt;p&gt;Knowing this core set of 25 terms should give you sufficient lay-of-the-land to sling references like &lt;em&gt;ACID&lt;/em&gt; and &lt;em&gt;Workflow&lt;/em&gt; in conversations with coworkers, friends, and family with ease! Better yet, you now know enough to dive deeper into subdomains of interest. If you’d like to try out these terms in practice, check out our &lt;a href="https://learn.temporal.io/"&gt;getting started guides, courses, and examples in Go, Java, Python, PHP, and TypeScript&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>microservices</category>
      <category>learning</category>
      <category>backend</category>
    </item>
    <item>
      <title>Tuning Temporal Server request latency on Kubernetes</title>
      <dc:creator>Rob Holland</dc:creator>
      <pubDate>Thu, 15 Jun 2023 16:44:17 +0000</pubDate>
      <link>https://dev.to/temporalio/tuning-temporal-server-request-latency-on-kubernetes-20np</link>
      <guid>https://dev.to/temporalio/tuning-temporal-server-request-latency-on-kubernetes-20np</guid>
      <description>&lt;p&gt;Request latency is an important indicator for the performance of Temporal Server. Temporal Cloud can offer reliably low request latencies, thanks to its custom persistence backend and expertly managed Temporal Server infrastructure. In this post, we’ll give you some tips for getting lower and more predictable request latencies, and making more efficient use of your nodes, when deploying a self-hosted Temporal Server on Kubernetes.&lt;/p&gt;

&lt;p&gt;When evaluating the performance of a Temporal Server deployment, we begin by looking at metrics for the request latencies your application, or workers, observe when communicating with Temporal Server. In order for the system as a whole to run efficiently and reliably, requests must be handled with consistent, low latencies. Low latencies allow us to get high throughput, and stable latencies avoid unexpected slowdowns in our application and allow us to monitor for performance degradation without triggering false alerts.&lt;/p&gt;

&lt;p&gt;For this post, we’ll use the &lt;a href="https://docs.temporal.io/clusters#history-service"&gt;History&lt;/a&gt; service as our example, which is the service responsible for handling calls to start a new workflow execution, or to update a workflow’s state (history) as it makes progress. None of these tips are specific to the History service—most of them can be applied to all the &lt;a href="https://docs.temporal.io/clusters#temporal-server"&gt;Temporal Server services&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The curious case of the unexpected throttling
&lt;/h2&gt;

&lt;p&gt;Generally, Kubernetes deployments will set CPU limits on containers to stop them from being able to consume too much CPU, starving other containers running on the same node. The way this is enforced is using something called &lt;a href="https://medium.com/@ramandumcs/cpu-throttling-unbundled-eae883e7e494"&gt;CPU throttling&lt;/a&gt;. Kubernetes converts the CPU limit you set on the container into a limit on CPU cycles per 1/10th second. If the container tries to use more than this limit, it is “throttled”, which means its execution is delayed. This can have a non-trivial impact on the performance of containers, as it can increase request latency. This is particularly true for requests requiring CPU intensive tasks, such as obtaining locks.&lt;/p&gt;

&lt;p&gt;For monitoring the Kubernetes clusters in our Scaling series (&lt;a href="https://dev.to/temporalio/scaling-temporal-the-basics-31l5"&gt;first post here&lt;/a&gt;) we use the &lt;a href="https://github.com/prometheus-operator/kube-prometheus#readme"&gt;&lt;code&gt;kube-prometheus&lt;/code&gt;&lt;/a&gt; stack.&lt;/p&gt;

&lt;p&gt;In contrast to the 1/10th second used to manage CPU throttling, the Prometheus system uses an interval of 15 seconds or more between scrapes of aggregated CPU metrics. The large difference in intervals between the throttling period and the monitoring scraping interval means that CPU throttling can be occurring even if CPU usage metrics are reporting a long way under 100% usage. For this reason, it’s important to monitor CPU throttling specifically.&lt;/p&gt;

&lt;p&gt;Here is an example for the History service:&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/3owwiacgyKBDDD52yfUhLd/0a66b3e514929ae4a57bd96d4829804d/CPU_Throttling-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/3owwiacgyKBDDD52yfUhLd/0a66b3e514929ae4a57bd96d4829804d/CPU_Throttling-mh.png" alt="History Service Dashboard: CPU is being throttled despite low CPU usage"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We can see from the dashboard that although the history pods’ CPU usage is reporting below 60%, it is being throttled.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;kube-prometheus&lt;/code&gt; setups, you can use this Prometheus query to check for CPU throttling, adjusting the &lt;code&gt;namespace&lt;/code&gt; and &lt;code&gt;workload&lt;/code&gt; selectors as appropriate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(
    increase(container_cpu_cfs_throttled_periods_total{job="kubelet", metrics_path="/metrics/cadvisor", container!=""}[$__rate_interval])
    * on(namespace,pod)
    group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{namespace="temporal", workload="temporal-history"}
)
/
sum(
    increase(container_cpu_cfs_periods_total{job="kubelet", metrics_path="/metrics/cadvisor", container!=""}[$__rate_interval])
    * on(namespace,pod)
    group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{namespace="temporal", workload="temporal-history"}
) &amp;gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So, how can we fix the throttling? Later we’ll discuss why you should probably stop using CPU limits entirely, but for now, as Temporal Server is written in Go, there is something else we can do to improve latencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  GOMAXPROCS in Kubernetes
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;GOMAXPROCS&lt;/code&gt; is a runtime setting for Go that controls how many processes it’s allowed to fork to provide concurrent processing. By default, Go will assume that it can fork a process for each core on the machine it’s running on, giving it a high level of concurrency.&lt;/p&gt;

&lt;p&gt;On a Kubernetes cluster, however, containers will generally not be allowed to use the majority of the cores on a node, due to CPU limits. This mismatch means that Go will make bad decisions about how many processes to fork, leading to inefficient CPU usage. It will (among other things) have to run garbage collection and other housekeeping tasks on CPU cores that it isn’t able to use for any useful amount of real work. As an example: on our Kubernetes cluster, the nodes have 8 cores, but our history pods are limited to 2 cores. This means they may create up to 8 processes, but across those 8 only be able to use a total of 2 cores' share of cycles in every throttling period. It then becomes easy for the container’s processes to starve each other of allowed CPU cycles. &lt;/p&gt;

&lt;p&gt;To fix this, we can let Go know how many cores it’s allowed to use by setting the &lt;code&gt;GOMAXPROCS&lt;/code&gt; environment variable to match our CPU limit. Note: &lt;code&gt;GOMAXPROCS&lt;/code&gt; must be an integer, so you should set it to the number of whole cores you set in the limit. Let’s see what happens when we set &lt;code&gt;GOMAXPROCS&lt;/code&gt; on our deployments:&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/3cwcbRAng6gzTZ2c4XpDL9/12585d12d60dff2258ec3d86cfd13e94/GOMAXPROCS-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/3cwcbRAng6gzTZ2c4XpDL9/12585d12d60dff2258ec3d86cfd13e94/GOMAXPROCS-mh.png" alt="History Dashboard: Showing reduced CPU usage and lower request latency after setting GOMAXPROCS"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the left of the graphs, you can see the performance with the default &lt;code&gt;GOMAXPROCS&lt;/code&gt; setting. Towards the right, you can see the results of setting the &lt;code&gt;GOMAXPROCS&lt;/code&gt; environment variable to “2”, letting Go know it should only use at most 2 processes. CPU throttling has gone entirely, which has helped make our latency more stable. We can also see that because Go can make better decisions about how many processes to create, our CPU usage has lowered, even though performance has actually improved slightly (request latency has lowered). Here, you can see how the CPU across all Temporal services drops after adjusting GOMAXPROCS:&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/4sVnSDGkT9lh2P4w2xy6n8/7f2233d25779c520992a5be4809ff8f3/Resources_-_Temporal_-_Dashboards_-_Grafana-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/4sVnSDGkT9lh2P4w2xy6n8/7f2233d25779c520992a5be4809ff8f3/Resources_-_Temporal_-_Dashboards_-_Grafana-mh.png" alt="Resource Dashboard: Showing reduced CPU by all Temporal Server services after setting GOMAXPROCS"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To help give a better experience out of the box, from release 1.21.0 onwards, Temporal will automatically set &lt;code&gt;GOMAXPROCS&lt;/code&gt; to match Kubernetes CPU limits if they are present and the &lt;code&gt;GOMAXPROCS&lt;/code&gt; environment variable is not already set. Before that release, you should manually set the &lt;code&gt;GOMAXPROCS&lt;/code&gt; environment variable for your Temporal Cluster deployments. Also note that &lt;code&gt;GOMAXPROCS&lt;/code&gt; will not automatically be set based on &lt;a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container"&gt;CPU requests&lt;/a&gt;, only limits. If you are not using CPU limits, you should set &lt;code&gt;GOMAXPROCS&lt;/code&gt; manually to close to (equal or slightly greater) than your CPU request. This allows Go to make good decisions about CPU efficiency, taking your CPU requests into consideration.&lt;/p&gt;

&lt;p&gt;Which brings us nicely to our second suggestion…&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU limits probably do more harm than good
&lt;/h2&gt;

&lt;p&gt;Now that we’ve improved the efficiency of our CPU usage, I’m going to echo the &lt;a href="https://twitter.com/thockin/status/1134193838841401345?s=20"&gt;sentiment of Tim Hockin&lt;/a&gt; (of Kubernetes fame) and &lt;a href="https://home.robusta.dev/blog/stop-using-cpu-limits"&gt;many&lt;/a&gt; &lt;a href="https://medium.com/directeam/kubernetes-resources-under-the-hood-part-3-6ee7d6015965"&gt;others&lt;/a&gt; and suggest that you stop using CPU limits entirely. CPU requests should be closely monitored to ensure you are requesting a sensible amount of CPU for your containers, so that Kubernetes can make good decisions about how many pods it assigns to a node. This allows containers that are having a CPU burst to make use of any spare CPU on the node. Make sure to monitor node CPU usage as well—frequently running out of CPU on the node tells you that pods are bursting more often than your requests allow for, and you should re-examine their CPU requests.&lt;/p&gt;

&lt;p&gt;If you can’t disable limits entirely as they enforce some business requirements (customer isolation for example), then consider dedicating some nodes to the Temporal Cluster and use &lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/"&gt;taints and tolerations&lt;/a&gt; to pin the deployments to those nodes. This allows you to remove CPU limits from your Temporal Cluster deployments while leaving them in place for your other workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Avoiding increased latency from re-balancing during Temporal upgrades
&lt;/h2&gt;

&lt;p&gt;Temporal Server’s &lt;a href="https://docs.temporal.io/clusters#history-service"&gt;History&lt;/a&gt; service automatically balances history shards across the available history pods, this is what allows Temporal Cluster to scale horizontally. &lt;em&gt;Note: Although we use the term balance here, Temporal does not guarantee that there will be an equal number of shards on each pod.&lt;/em&gt; The History service will rebalance shards every time a new history pod is added or removed, and this process can take a while to settle. Depending on the scale of your cluster, this rebalancing can increase the latency for requests, as a shard cannot be written to while it is being reassigned to a new history or pod. The effect of this will vary depending on what percentage of shards each of the pods is responsible for. The fewer pods you have, the greater the effect on latency when they are added/removed.&lt;/p&gt;

&lt;p&gt;The latency spike during a rollout can be mitigated in two ways, depending on the number of history pods you have:&lt;/p&gt;

&lt;p&gt;If you have more than 10 pods, the best option will be to do rollouts slowly, ideally one pod at a time. You can use low values for &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge"&gt;maxSurge&lt;/a&gt; and &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable"&gt;maxUnavailable&lt;/a&gt; to ensure pods are rotated slowly. Using &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#min-ready-seconds"&gt;minReadySeconds&lt;/a&gt;, or a &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#min-ready-seconds"&gt;startupProbe&lt;/a&gt; with initialDelaySeconds, can give Temporal Server time to rebalance as each pod is added.&lt;/p&gt;

&lt;p&gt;If you have less than 10 pods, it’s better to rotate pods quickly so that rebalancing can settle quickly. You will see latency spikes for each change, but the overall impact will be lower. You can experiment with the &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge"&gt;maxSurge&lt;/a&gt; and &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable"&gt;maxUnavailable&lt;/a&gt; settings to allow Kubernetes to roll out more pods at the same time. The defaults are 25% for each, which for 4 pods would mean only 1 pod will be rotated at once. Your mileage will vary based on scale and load, but we’ve had good success with 50% for maxSurge/maxUnavailable on low (4 or less) pod counts.&lt;/p&gt;

&lt;p&gt;Pull-based monitoring systems such as Prometheus use a discovery mechanism to find pods to scrape for metrics. As there is a delay between a pod being started and Prometheus being aware of it, the pod may not be scraped for a few intervals after starting up. This means metrics can report inaccurate values during a deployment, until all the new pods are being scraped.&lt;/p&gt;

&lt;p&gt;For this reason, it’s best to ensure you are not using metrics that are emitted by the History service when evaluating History deployment strategies. Instead, SDK metrics such as &lt;code&gt;StartWorkflowExecution&lt;/code&gt; request latency are a good fit here. Frontend metrics can also be useful, as long as the Frontend service is not being rolled out at the same time as the History service.&lt;/p&gt;

&lt;p&gt;These same deployment strategies are also useful for the &lt;a href="https://docs.temporal.io/clusters#matching-service"&gt;Matching&lt;/a&gt; service, which balances task queue partitions across matching pods.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this post we’ve discussed CPU throttling, CPU limits, and the effect of rebalancing during Temporal upgrades/rollouts. Hopefully, these tips will help you save some money on resources, by using less CPU, and improve the performance and reliability of your self-hosted Temporal Cluster.&lt;/p&gt;

&lt;p&gt;We hope you’ve found this useful, we’d love to discuss it further or answer any questions you might have. Please reach out with any questions or comments on the &lt;a href="https://community.temporal.io/"&gt;Community Forum&lt;/a&gt; or &lt;a href="https://t.mp/slack"&gt;Slack&lt;/a&gt;. My name is Rob Holland, feel free to reach out to me directly on &lt;a href="https://t.mp/slack"&gt;Temporal’s Slack&lt;/a&gt; if you like, would love to hear from you. You can also follow us on &lt;a href="https://twitter.com/temporalio"&gt;Twitter&lt;/a&gt; if you’d like more of this kind of content.&lt;/p&gt;

</description>
      <category>temporal</category>
      <category>docker</category>
      <category>kubernetes</category>
      <category>go</category>
    </item>
    <item>
      <title>Scaling Temporal: The Basics</title>
      <dc:creator>Rob Holland</dc:creator>
      <pubDate>Thu, 15 Jun 2023 16:42:55 +0000</pubDate>
      <link>https://dev.to/temporalio/scaling-temporal-the-basics-31l5</link>
      <guid>https://dev.to/temporalio/scaling-temporal-the-basics-31l5</guid>
      <description>&lt;p&gt;Scaling your own Temporal Cluster can be a complex subject because there are infinite variations on workload patterns, business goals, and operational goals. So, for this post, we will help make it simple and focus on metrics and terminology that can be used to discuss scaling a Temporal Cluster for any kind of workflow architecture.&lt;br&gt;
By far the simplest way to scale is to use Temporal Cloud. Our custom persistence layer and expertly managed Temporal Clusters can support extreme levels of load, and you pay only for what you use as you grow.&lt;br&gt;
In this post, we'll walk through a process for scaling a self-hosted instance of Temporal Cluster.&lt;/p&gt;

&lt;p&gt;Out of the box, our Temporal Cluster is configured with the development-level defaults. We’ll work through some &lt;strong&gt;load&lt;/strong&gt;, &lt;strong&gt;measure&lt;/strong&gt;, &lt;strong&gt;scale&lt;/strong&gt; iterations to move towards a production-level setup, touching on Kubernetes resource management, Temporal shard count configuration, and polling optimization. The process we’ll follow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt;: Set or adjust the level of load we want to test with. Normally, we’ll be increasing the load as we improve our configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure&lt;/strong&gt;: Check our monitoring to spot bottlenecks or problem areas under our new level of load.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt;: Adjust Kubernetes or Temporal configuration to remove bottlenecks, ensuring we have safe headroom for CPU and memory usage. We may also need to adjust node or persistence instance sizes here, either to scale up for more load or scale things down to save costs if we have more headroom than we need.&lt;/li&gt;
&lt;li&gt;Repeat&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Our Cluster
&lt;/h2&gt;

&lt;p&gt;For our load testing we’ve deployed Temporal on Kubernetes, and we’re using MySQL for the persistence backend. The MySQL instance has 4 CPU cores and 32GB RAM, and each Temporal service (Frontend, History, Matching, and Worker) has 2 pods, with &lt;a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/"&gt;requests&lt;/a&gt; for 1 CPU core and 1GB RAM as a starting point. We’re not setting CPU limits for our pods—see our upcoming &lt;em&gt;Temporal on Kubernetes&lt;/em&gt; post for more details on why. For monitoring we’ll use Prometheus and Grafana, installed via the &lt;a href="https://github.com/prometheus-operator/kube-prometheus"&gt;kube-prometheus&lt;/a&gt; stack, giving us some useful Kubernetes metrics.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/7jAvPG8jSHNbpI7WRFsEM3/614dcb36bf01a6816f470ded01b953f2/cluster.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/7jAvPG8jSHNbpI7WRFsEM3/614dcb36bf01a6816f470ded01b953f2/cluster.png" alt="Temporal Cluster diagram"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Scaling Up
&lt;/h2&gt;

&lt;p&gt;Our goal in this post will be to see what performance we can achieve while keeping our persistence database (MySQL in this case) at or below 80% CPU. Temporal is designed to be horizontally scalable, so it is almost always the case that it can be scaled to the point that the persistence backend becomes the bottleneck.&lt;/p&gt;
&lt;h3&gt;
  
  
  Load
&lt;/h3&gt;

&lt;p&gt;To create load on a Temporal Cluster, we need to start Workflows and have Workers to run them. To make it easy to set up load tests, we have packaged a simple Workflow and some Activities in the &lt;a href="https://github.com/temporalio/benchmark-workers/pkgs/container/benchmark-workers"&gt;benchmark-workers package&lt;/a&gt;. Running a &lt;code&gt;benchmark-worker&lt;/code&gt; container will bring up a load test Worker with default Temporal Go SDK settings. The only configuration it needs out of the box is the host and port for the Temporal Frontend service.&lt;/p&gt;

&lt;p&gt;To run a benchmark Worker with default settings we can use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl run benchmark-worker &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--image&lt;/span&gt; ghcr.io/temporalio/benchmark-workers:main &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--image-pull-policy&lt;/span&gt; Always &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="s2"&gt;"TEMPORAL_GRPC_ENDPOINT=temporal-frontend.temporal:7233"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once our Workers are running, we need something to start Workflows in a predictable way. The benchmark-workers package includes a runner that starts a configurable number of Workflows in parallel, starting a new execution each time one of the Workflows completes. This gives us a simple dial to increase load, by increasing the number of parallel Workflows that will be running at any given time.&lt;/p&gt;

&lt;p&gt;To run a benchmark runner we can use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl run benchmark-worker &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--image&lt;/span&gt; ghcr.io/temporalio/benchmark-workers:main &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--image-pull-policy&lt;/span&gt; Always &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="s2"&gt;"TEMPORAL_GRPC_ENDPOINT=temporal-frontend.temporal:7233"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--command&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; runner &lt;span class="nt"&gt;-t&lt;/span&gt; ExecuteActivity &lt;span class="s1"&gt;'{ "Count": 3, "Activity": "Echo", "Input": { "Message": "test" } }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For our load test, we’ll use a deployment rather than &lt;code&gt;kubectl&lt;/code&gt; to deploy the Workers and runner. This allows us to easily scale the Workers and collect metrics from them via Prometheus. We’ll use a deployment similar to the example here: &lt;a href="https://github.com/temporalio/benchmark-workers/blob/main/deployment.yaml"&gt;github.com/temporalio/benchmark-workers/blob/main/deployment.yaml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For this test we’ll start off with the default runner settings, which will keep 10 parallel Workflows executions active. You can find details of the available configuration options in the &lt;a href="https://github.com/temporalio/benchmark-workers#readme"&gt;README&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Measure
&lt;/h3&gt;

&lt;p&gt;When deciding how to measure system performance under load, the first metric that might come to mind is the number of Workflows completed per second. However, Workflows in Temporal can vary enormously between different use cases, so this turns out to not be a very useful metric. A load test using a Workflow which just runs one Activity might produce a relatively high result compared to a system running a batch processing Workflow which calls hundreds of Activities. For this reason, we use a metric called &lt;a href="https://docs.temporal.io/workflows#state-transition"&gt;&lt;strong&gt;State Transitions&lt;/strong&gt;&lt;/a&gt; as our measure of performance. State Transitions represent Temporal writing to its persistence backend, which is a reasonable proxy of how much work Temporal itself is doing to ensure your executions are durable. Using State Transitions per second allows us to compare numbers across different workloads. Using Prometheus, you can monitor State Transitions with the query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(state_transition_count_count{namespace="default"}[1m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we have State Transitions per second as our throughput metric, we need to qualify it with some other metrics for business or operational goals (commonly called Service Level Objectives or SLOs). The values you decide on for a production SLO will vary. To start our load tests, we are going to work on handling a fixed level of load (as opposed to spiky) and expect a StartWorkflowExecution request latency of less than 150ms. If a load test can run within our StartWorkflowExecution latency SLO, we’ll consider that the cluster can handle the load. As we progress we’ll add other SLOs to help us decide if the cluster can be scaled to handle higher load, or to more efficiently handle the current load.&lt;/p&gt;

&lt;p&gt;We can add a Prometheus alert to make sure we are meeting our SLO. We’re only concerned about &lt;code&gt;StartWorkflowExecution&lt;/code&gt; requests for now, so we filter the operation metric tag to focus on those.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;   &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TemporalRequestLatencyHigh&lt;/span&gt;
     &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
       &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Temporal {{ $labels.operation }} request latency is currently {{ $value | humanize }}, outside of SLO 150ms.&lt;/span&gt;
       &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Temporal request latency is too high.&lt;/span&gt;
     &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
       &lt;span class="s"&gt;histogram_quantile(0.95, sum by (le, operation) (rate(temporal_request_latency_bucket{job="benchmark-monitoring",operation="StartWorkflowExecution"}[5m])))&lt;/span&gt;
       &lt;span class="s"&gt;&amp;gt; 0.150&lt;/span&gt;
     &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
     &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
       &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;temporal&lt;/span&gt;
       &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Checking our dashboard, we can see that unfortunately our alert is already firing, telling us we’re failing our SLO for request latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/1vuDxyXumf9Iq9DZAs3p8n/780a082dc9749902cfa6497019472b30/Scaling__1-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/1vuDxyXumf9Iq9DZAs3p8n/780a082dc9749902cfa6497019472b30/Scaling__1-mh.png" alt="SLO Dashboard: Showing alert firing for request latency"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Scale
&lt;/h3&gt;

&lt;p&gt;Obviously, this is not where we want to leave things, so let’s find out why our request latency is so high. The request we’re concerned with is the &lt;code&gt;StartWorkflowExecution&lt;/code&gt; request, which is handled by the History service. Before we dig into where the bottleneck might be, we should introduce one of the main tuning aspects of Temporal performance, &lt;strong&gt;&lt;a href="https://docs.temporal.io/clusters/#history-shard"&gt;History Shards&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Temporal uses shards (partitions) to divide responsibility for a Namespace’s Workflow histories amongst History pods, each of which will manage a set of the shards. Each Workflow history will belong to a single shard, and each shard will be managed by a single History pod. Before a Workflow history can be created or updated, there is a shard lock that must be obtained. This needs to be a very fast operation so that Workflow histories can be created and updated efficiently. Temporal allows you to choose the number of shards to partition across. The larger the shard count, the less lock contention there is, as each shard will own fewer histories, so there will be less waiting to obtain the lock.&lt;/p&gt;

&lt;p&gt;We can measure the latency for obtaining the shard lock in Prometheus using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.95, sum by (le)(rate(lock_latency_bucket[1m])))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/7hNeLRnXBctLK7erfMfpeV/4232591701cb604cd2e72d24cb3453e1/Scaling__2-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/7hNeLRnXBctLK7erfMfpeV/4232591701cb604cd2e72d24cb3453e1/Scaling__2-mh.png" alt="History Dashboard: High shard lock latency"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looking at the History dashboard we can see that shard lock latency p95 is nearly 50ms. This is much higher than we’d like. For good performance we’d expect shard lock latency to be less than 5ms, ideally around 1ms. This tells us that we probably have too few shards.&lt;/p&gt;

&lt;p&gt;The shard count on our cluster is set to the development default, which is 4. Temporal recommends that small production clusters use 512 shards. To give an idea of scale, it is rare for even large Temporal clusters to go beyond 4,096 shards.&lt;/p&gt;

&lt;p&gt;The downside to increasing the shard count is that each shard requires resources to manage. An overly large shard count wastes CPU and Memory on History pods; each shard also has its own task processing queues, which puts extra pressure on the persistence database. &lt;em&gt;One thing to note about shard count in Temporal is that it is the one configuration setting which cannot (currently) be changed after the cluster is built&lt;/em&gt;. For this reason it’s very important to do your own load testing or research to determine what a sensible shard count would be, &lt;strong&gt;before&lt;/strong&gt; building a production cluster. In future we hope to make the shard count adjustable. As this is just a test cluster, we can rebuild it with a shard count of 512.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/1ACDp4CIlgckjgE4Pqidp5/c855c8f71ecfa25719b2822186f591ea/Scaling__3-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/1ACDp4CIlgckjgE4Pqidp5/c855c8f71ecfa25719b2822186f591ea/Scaling__3-mh.png" alt="History Dashboard: Shard latency dropped, but pod memory climbing"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After changing the shard count, the shard lock latency has dropped from around 50ms to around 1ms. That’s a huge improvement!&lt;/p&gt;

&lt;p&gt;However, as we mentioned, each shard needs management. Part of the management includes a cache of Workflow histories for that shard. We can see the History pods’ memory usage is rising quickly. If the pods run out of memory, Kubernetes will terminate and restart them (OOMKilled). This causes Temporal to rebalance the shards onto the remaining History pod(s), only to then rebalance again once the new History pod comes up. Each time you make a scaling change, be sure to check that all Temporal pods are still within their CPU and memory requests—pods frequently being restarted is very bad for performance! To fix this, we can bump the memory limits for the History containers. Currently, it is hard to estimate the amount of memory a History pod is going to use because the limits are not set per host, or even in MB, but rather as a number of cache entries to store. There is work to improve this: &lt;a href="https://github.com/temporalio/temporal/issues/2941"&gt;github.com/temporalio/temporal/issues/2941&lt;/a&gt;. For now, we’ll set the History memory limit to 8GB and keep an eye on them—we can always raise it later if we find the pod needs more.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/lFR4cS7n4rnQcFnWnba3a/58833e80b80b65ba0691153894e9b715/Scaling__4-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/lFR4cS7n4rnQcFnWnba3a/58833e80b80b65ba0691153894e9b715/Scaling__4-mh.png" alt="History Dashboard: History pods with memory headroom"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After this change, the History pods are looking good. Now that things are stable, let’s see what impact our changes have had on the State Transitions and our SLO.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/61sgBu7aq8c8jMMMJRYcvN/dbd1d3e3232f5d79dfb21287c5250af0/Scaling__6-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/61sgBu7aq8c8jMMMJRYcvN/dbd1d3e3232f5d79dfb21287c5250af0/Scaling__6-mh.png" alt="History Dashboard: State transitions up, latency within SLO"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;State Transitions are up from our starting point of 150/s to 395/s and we’re way below our SLO of 150ms for request latency, staying under 50ms, so that’s great! We’ve completed a &lt;strong&gt;load&lt;/strong&gt;, &lt;strong&gt;measure&lt;/strong&gt;, &lt;strong&gt;scale&lt;/strong&gt; iteration and everything looks stable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Round two!
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Load
&lt;/h3&gt;

&lt;p&gt;After our shard adjustment, we’re stable, so let’s iterate again. We’ll increase the load to 20 parallel workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/5jaByfR0fEJEff8Fa2Gvfk/269b4da15595d06fdd714bda6689c97e/Scaling__7-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/5jaByfR0fEJEff8Fa2Gvfk/269b4da15595d06fdd714bda6689c97e/Scaling__7-mh.png" alt="SLO Dashboard: State transitions up"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Checking our SLO dashboard, we can see the State Transitions have risen to 680/s. Our request latency is still fine, let’s bump the load to 30 parallel workflows and see if we get more State Transitions for free.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/5pds6aSQF43gP7JQlHkEzp/34fbda176b4558d4d76fde3bc41ac64a/Scaling__8-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/5pds6aSQF43gP7JQlHkEzp/34fbda176b4558d4d76fde3bc41ac64a/Scaling__8-mh.png" alt="SLO Dashboard: State transitions up"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see we did get another raise in State Transitions, although not as dramatic. Time to check dashboards again.&lt;/p&gt;

&lt;h3&gt;
  
  
  Measure
&lt;/h3&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/6U4GgTbHiDun0u9AhNiMoS/283a92144e3c31e97c422280583517a1/Scaling__9-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/6U4GgTbHiDun0u9AhNiMoS/283a92144e3c31e97c422280583517a1/Scaling__9-mh.png" alt="History Dashboard: High CPU"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;History CPU is now exceeding its requests at times—we’d like to aim to have some headroom. Ideally, the majority of the time the process should use under 80% of its request, so let’s bump the History pods to 2 cores.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/5vXteyowBCu5yEGfJMTmdf/4f0df0946a8528e747a4b82504932b59/Scaling__10-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/5vXteyowBCu5yEGfJMTmdf/4f0df0946a8528e747a4b82504932b59/Scaling__10-mh.png" alt="History Dashboard: CPU has headroom"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;History CPU is looking better now, how about our State Transitions?&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/4svsTIEvMoFk8p1ZHuUtEF/da30c588d18dee3f2e5b027d56462572/Scaling__11-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/4svsTIEvMoFk8p1ZHuUtEF/da30c588d18dee3f2e5b027d56462572/Scaling__11-mh.png" alt="SLO Dashboard: State transitions up, request latency down"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We’re doing well! State Transitions are now up to 1,200/s and request latency is back down to 50ms. We’ve got the hang of the History scaling process, so let’s move on to look at another core Temporal sub-system, polling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scale
&lt;/h3&gt;

&lt;p&gt;While the History service is concerned with shuttling event histories to and from the persistence backend, the polling system (known as the Matching service) is responsible for matching tasks to your application workers efficiently.&lt;/p&gt;

&lt;p&gt;If your Worker replica count and poller configuration are not optimized, then there will be a delay between the time a task is requested and when it is processed. This is known as Schedule-to-Start latency, and this will be our next SLO. We’ll aim for 150ms like we do for our Request Latency SLO.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;   &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TemporalWorkflowTaskScheduleToStartLatencyHigh&lt;/span&gt;
     &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
       &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Temporal Workflow Task Schedule to Start latency is currently {{ $value | humanize }}, outside of SLO 150ms.&lt;/span&gt;
       &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Temporal Workflow Task Schedule to Start latency is too high.&lt;/span&gt;
     &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
       &lt;span class="s"&gt;histogram_quantile(0.95, sum by (le) (rate(temporal_workflow_task_schedule_to_start_latency_bucket{namespace="default"}[5m])))&lt;/span&gt;
       &lt;span class="s"&gt;&amp;gt; 0.150&lt;/span&gt;
     &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
     &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
       &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;temporal&lt;/span&gt;
       &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
   &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TemporalActivityScheduleToStartLatencyHigh&lt;/span&gt;
     &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
       &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Temporal Activity Schedule to Start latency is currently {{ $value | humanize }}, outside of SLO 150ms.&lt;/span&gt;
       &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Temporal Activity Schedule to Start latency is too high.&lt;/span&gt;
     &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
       &lt;span class="s"&gt;histogram_quantile(0.95, sum by (le) (rate(temporal_activity_schedule_to_start_latency_bucket{namespace="default"}[5m])))&lt;/span&gt;
       &lt;span class="s"&gt;&amp;gt; 0.150&lt;/span&gt;
     &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
     &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
       &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;temporal&lt;/span&gt;
       &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After adding these alerts, let’s check out the polling dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/3rZ9bGWgJSt10SSx5CZNsT/d3ae7acf9613980fc537a4304b458e1c/Scaling__12-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/3rZ9bGWgJSt10SSx5CZNsT/d3ae7acf9613980fc537a4304b458e1c/Scaling__12-mh.png" alt="Polling Dashboard: Activity Schedule-to-Start latency is outside of our SLO"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So we can see here that our Schedule-to-Start latency for Activities is too high. We’re taking over 150 ms to begin an Activity after it’s been scheduled. The dashboard also shows another polling related metric which we call &lt;strong&gt;Poll Sync Rate&lt;/strong&gt;. In an ideal world, when a Worker’s poller requests some work, the Matching service can hand it a task from its memory. This is known as “sync match”, short for synchronous matching. If the Matching service has a task in its memory too long, because it has not been able to hand out work quickly enough, the task is flushed to the persistence database. Tasks that were sent to the persistence database needed to be loaded back again later to hand to pollers (async matching). Compared with sync matching, async matching increases the load on the persistence database and is a lot less efficient. The ideal, then, is to have enough pollers to quickly consume all the tasks that land on a task queue. To have the least load on the persistence database and the highest throughput of tasks on a task queue, we should aim for both Workflow and Activity Poll Sync Rates to be 99% or higher. Improving the Poll Sync Rate will also improve the Schedule-to-Start Latency, as Workers will be able to receive the tasks more quickly.&lt;/p&gt;

&lt;p&gt;We can measure the Poll Sync Rate in Prometheus using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (task_type) (rate(poll_success_sync[1m])) / sum by (task_type) (rate(poll_success[1m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In order to improve the Poll Sync Rate, we adjust the number of Worker pods and their poller configuration. In our setup we currently have only 2 Worker pods, configured to have 10 Activity pollers and 10 Workflow pollers. Let’s up that to 20 pollers of each kind.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/1eNCqEDaL8YpOzh9HZg01p/6a9ef9b984140cbdd22178ed41f6b411/Scaling__13-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/1eNCqEDaL8YpOzh9HZg01p/6a9ef9b984140cbdd22178ed41f6b411/Scaling__13-mh.png" alt="Polling Dashboard: Activity Schedule-to-Start latency improved, but still over SLO"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Better, but not enough. Let’s try 100 of each type.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/4TVpVof4j52n6ZJZwMA3pg/61072500d1a914c33baea56f61a37944/Scaling__14-mh__1_.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/4TVpVof4j52n6ZJZwMA3pg/61072500d1a914c33baea56f61a37944/Scaling__14-mh__1_.png" alt="Polling Dashboard: Activity Schedule-to-Start latency within SLO"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Much better! Activity Poll Sync Rate is still not quite sticking at 99% though, bump Activity pollers to 150 didn’t fix it either. Let’s try adding 2 more Worker pods…&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/3HtIYQStrxDiUaTvYo8cwP/2b23174471ac07ad60a4d7c6a8a65d23/Scaling__15-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/3HtIYQStrxDiUaTvYo8cwP/2b23174471ac07ad60a4d7c6a8a65d23/Scaling__15-mh.png" alt="Polling Dashboard: Poll Sync Rate &amp;gt; 99%"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nice, consistently above 99% for Poll Sync Rate for both Workflow and Activity now. A quick check of the Matching dashboard shows that the Matching pods are well within CPU and Memory requests, so we’re looking stable. Now let’s see how we’re doing for State Transitions.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/3iTC2pgTlHcCdjgY6MHTLj/8bfe80c9d541ae0f21fb5750468aacd6/Scaling__16-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/3iTC2pgTlHcCdjgY6MHTLj/8bfe80c9d541ae0f21fb5750468aacd6/Scaling__16-mh.png" alt="SLO Dashboard: State Transitions up to 1,350/second"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looking good. Improving our polling efficiency has increased our State Transitions by around 150/second. One last check to see if we’re still within our persistence database CPU target of below 80%.&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/3j3qAqFod91T0wLTAau2iN/0ac24573769a3f6b974bd2c37e960b7c/Scaling__17-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/3j3qAqFod91T0wLTAau2iN/0ac24573769a3f6b974bd2c37e960b7c/Scaling__17-mh.png" alt="Persistence Dashboard: Database CPU &amp;lt; 80%"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Yes! We’re nearly spot on, averaging around 79%. That brings us to the end of our second &lt;strong&gt;load&lt;/strong&gt;, &lt;strong&gt;measure&lt;/strong&gt;, &lt;strong&gt;scale&lt;/strong&gt; iteration. The next step would either be to increase the database instance size and continue iterating to scale up, or if we’ve hit our desired performance target, we can instead check resource usage and reduce them where appropriate. This allows us to save some costs by potentially reducing node count.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;We’ve taken a cluster configured to the development default settings and scaled it from 150 to 1,350 State Transitions/second. To achieve this, we increased the shard count from 4 to 512, increased the History pod CPU and Memory requests, and adjusted our Worker replica count and poller configuration.&lt;/p&gt;

&lt;p&gt;We hope you’ve found this useful. We’d love to discuss it further or answer any questions you might have. Please reach out with any questions or comments on the &lt;a href="https://community.temporal.io/"&gt;Community Forum&lt;/a&gt; or &lt;a href="https://t.mp/slack"&gt;Slack&lt;/a&gt;. My name is Rob Holland, feel free to reach out to me directly on Slack if you like—would love to hear from you.&lt;/p&gt;

</description>
      <category>temporal</category>
      <category>performance</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
