<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Temporal</title>
    <description>The latest articles on DEV Community by Temporal (@temporalio).</description>
    <link>https://dev.to/temporalio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F3146%2F0ad3097f-4bc5-469a-9263-de293ef2ab2e.png</url>
      <title>DEV Community: Temporal</title>
      <link>https://dev.to/temporalio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/temporalio"/>
    <language>en</language>
    <item>
      <title>Behind The Badge: How We Built 2,000 Hackable Badges For Temporal Replay</title>
      <dc:creator>Shy Ruparel</dc:creator>
      <pubDate>Tue, 26 May 2026 04:00:00 +0000</pubDate>
      <link>https://dev.to/temporalio/behind-the-badge-how-we-built-2000-hackable-badges-for-temporal-replay-ejo</link>
      <guid>https://dev.to/temporalio/behind-the-badge-how-we-built-2000-hackable-badges-for-temporal-replay-ejo</guid>
      <description>&lt;p&gt;At some point in December, Candace, our Head of Design, asked me: what if your time at Replay could be represented as a Workflow? Every check-in, every fun event, every person you met, a living timeline, running on Temporal.&lt;/p&gt;

&lt;p&gt;Four months later, I'm coordinating manufacturing circuit board badges in Shenzhen over WhatsApp at 11 p.m., and I've hand-soldered more PCBs than I'd like to admit.&lt;/p&gt;

&lt;p&gt;To understand why I was asked to figure out this project is to know that my role at Temporal has this informal understanding baked into it: give Shy the weird projects. The ones that involve breadboards on a kitchen table at 11 p.m., a &lt;a href="https://www.youtube.com/watch?v=C_fE8T-DwiU" rel="noopener noreferrer"&gt;document translation pipeline for toys&lt;/a&gt;, or a half-baked idea (or two, or three...) that could either be really cool or a complete disaster at a conference in front of 2,000 people.&lt;/p&gt;

&lt;p&gt;The latter is this project where I set out to create badges for Temporal's annual developer conference, &lt;a href="http://replay.temporal.io" rel="noopener noreferrer"&gt;Replay&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Concept Stage
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.ctfassets.net%2F0uuz8ydxyd9p%2F5300nhqLn62wcqZYu5IeCj%2F807787f838daa7450f2996453ae423b6%2FP1000081.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.ctfassets.net%2F0uuz8ydxyd9p%2F5300nhqLn62wcqZYu5IeCj%2F807787f838daa7450f2996453ae423b6%2FP1000081.png" alt="Elecrow QAing the Replay badge during manufacturing" width="800" height="616"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I hit the ground running and started spec'ing it out. I could make one badge by myself, but to scale up to make a badge for everyone at Replay I knew I would need help. Through an introduction from some fellow Recurse Center alum, I found a hardware consultant and we put together a proposal.&lt;/p&gt;

&lt;p&gt;Next, I had to pitch that proposal to Andrew Baker, our VP of Developer Relations, during his first five days on the job. Tricky because he didn't know me at all, but lucky for me, he trusted some DevRel guy with a hardware project budget that required convoluted manufacturing.&lt;/p&gt;

&lt;p&gt;The early concept drawings are genuinely charming to look at now.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qk4m43362o376zjl32g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qk4m43362o376zjl32g.png" alt="Early badge concept sketch showing a chunky handheld device" width="700" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our hardware consultant's early sketches are full of hand-drawn illustrations of attendees holding a chunky handheld device with a T9 keyboard, an e-paper name tag on the front, a kickstand so it could sit on a table. There's even a little Tamagotchi-esque character on the screen. The vision at that stage was something closer to a personal game console, or honestly a weird cell phone that only worked at Replay.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.ctfassets.net%2F0uuz8ydxyd9p%2FTSchEqORfXgaVloPgEPFc%2F5d013338c29a8f8cf776508837235ed5%2FScreenshot_2026-04-29_at_2.38.02%25C3%25A2__PM.png%3Fw%3D700" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.ctfassets.net%2F0uuz8ydxyd9p%2FTSchEqORfXgaVloPgEPFc%2F5d013338c29a8f8cf776508837235ed5%2FScreenshot_2026-04-29_at_2.38.02%25C3%25A2__PM.png%3Fw%3D700" alt="Hand-drawn illustration of Replay attendees gathered around a glowing badge" width="463" height="718"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then on to artistic inspiration. Our consultant came back with these stunning hand-drawn illustrations. It captures the spirit of what we were going for better than any spec doc ever could: it shows attendees gathered together, shrouded in a collective glow, illuminated by the feeling of knowledge and community. I'd even say the energy is a little cosmic.&lt;/p&gt;

&lt;p&gt;This is just the type of energy we captured for the event, as Replay is the community's conference. We just get the privilege of paying for it. Finding ways for attendees to interact, share moments, learn from each other, and spark something in each other is always the actual goal. The badge is the vehicle taking you there.&lt;/p&gt;

&lt;h2&gt;
  
  
  What The Badge Does
&lt;/h2&gt;

&lt;p&gt;When you hear "badge," you probably picture a rigid little plastic thing with your name and an outdated LinkedIn headshot on it. Scrap that entire notion when we're talking about this badge because this one is more like a game piece.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjrny7fau0wj9adufolwj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjrny7fau0wj9adufolwj.png" alt="Early prototype badge held in hand with the screen showing attendee information" width="700" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Attendees navigate Replay with a mission of gathering with their peers to traverse the cosmos of the development unknown. Along the way, they collect connections, unlock new levels, and even find hidden treasures, all while the badge handles the technical complexity involved underneath.&lt;/p&gt;

&lt;p&gt;So what's actually inside it?&lt;/p&gt;

&lt;p&gt;The badge runs on an ESP32-S3 microcontroller with a 128x64 pixel OLED screen, an IR transmitter and receiver, a joystick, buttons, haptic feedback via a vibration motor, and an LED matrix with programmable colors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27c4t6wd7tol9ixrhnc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27c4t6wd7tol9ixrhnc7.png" alt="Prototype B v0.0 schematic render" width="638" height="794"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It also has a gyroscope, so the screen reorients depending on how you're holding it. Firmware is written in C, with a MicroPython layer sitting on top so attendees can write their own apps without needing to learn anything new. Attendees are able to update their contact information by using the MicroPython REPL and then beam it over to each other with the IR transmitter. The conference schedule is right there on the badge so you don't have to download an app. I don't know about you, but I've encountered very few conference apps that I've liked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ejbsug2k04xr1zyltg9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ejbsug2k04xr1zyltg9.png" alt="Green PCB with joystick and buttons held at a factory in Shenzhen" width="800" height="1422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhb2zs5oxwogei8t1tqi8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhb2zs5oxwogei8t1tqi8.png" alt="Temporal-branded bare PCB with ESP32 chip" width="523" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7xqwzdvp02n8sx9ursq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7xqwzdvp02n8sx9ursq.png" alt="DELTA badge dev kit revision" width="800" height="1066"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknzrc7zpb96sqtp5rthn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknzrc7zpb96sqtp5rthn.png" alt="Badge dev kit components laid out on a table" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2lsgihm55z76tlw34xp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2lsgihm55z76tlw34xp.png" alt="Final black badge render with space motif and blank screen" width="799" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0g52uou3swxe1qz81tbs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0g52uou3swxe1qz81tbs.png" alt="Final Replay badge render with physical print card attached" width="800" height="1003"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this is what it became after Kathy, our Senior Brand Designer, got involved. The back of the badge has Ziggy, our tardigrade mascot, as an astronaut, which I love, etched into a PCB.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI Of It All
&lt;/h2&gt;

&lt;p&gt;I have to tell you something, and I want you to keep it just between us... I never wrote firmware code before this project. Like ever.&lt;/p&gt;

&lt;p&gt;Now, I'm a decent enough programmer and I know how to have an architecture conversation with the best of 'em. I can confidently identify when something is going wrong and I know the right questions to ask to reveal a solution, but C firmware for an embedded microcontroller wasn't something I had ever done professionally.&lt;/p&gt;

&lt;p&gt;My testimony is this: generative AI made this project possible for me!&lt;/p&gt;

&lt;p&gt;I used Claude and Claude Code extensively throughout my development process, eventually swapping to Codex as newer models became available. The way I think about it: I became a full-time architect with access to a team of very high int, low wisdom junior engineers who are great at implementation but need a lot of hand-holding. You can't give them something vague. You have to know what you want, know when they've gotten it wrong, and stay deliberate about what you let them build. I genuinely don't think this workflow would have worked without my ten-plus years of developer experience behind it because the experience tells you which questions matter and when to push back.&lt;/p&gt;

&lt;p&gt;The codebase ended up around 5,000 lines of production logic, with tests more than quadrupling that. I used those tests as a way to validate what the AI was building and build trust in the output. I would record and review Playwright tests to confirm user interactions worked the way I wanted them to. I'll admit, when I code without AI, I mostly skipped writing tests in all my previous work. For me to have trust in the code that was getting generated, I needed extremely solid test coverage from the start.&lt;/p&gt;

&lt;p&gt;The documentation from all the architecture debates I had with the AI, the back-and-forth about structure, what to build, and what to cut, is over &lt;strong&gt;21,000&lt;/strong&gt; &lt;em&gt;lines&lt;/em&gt;, managed through &lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;GitHub Spec Kit&lt;/a&gt; that tracked specs and architecture decisions as the project evolved. That ratio says something interesting about what the job actually looks like now. It was very out of my comfort zone as someone with a CS degree and a full career without any AI tools even existing. As this project took four months, I swapped my tooling quite a bit. I finalized the project leveraging Codex with the &lt;a href="https://skills.sh/mattpocock/skills/grill-me" rel="noopener noreferrer"&gt;Grill Me Skill&lt;/a&gt;. It was a lot more reminiscent of wordsmithing an English essay than anything hands-on-keyboard, which I'm still adjusting to.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Something for you to look forward to: the whole codebase is open source! That means you'll be able to see the output of every argument I had with AI to get here! &lt;em&gt;and the crowd goes wild&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  From One To Two Thousand
&lt;/h2&gt;

&lt;p&gt;Now, I've built weird IoT stuff before. In fact, that's kind of my whole thing. But I've never had to build 2,000 of the same thing, and it turns out those are completely different problems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfmn64mn7lg9aognhqll.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfmn64mn7lg9aognhqll.png" alt="DEF CON cat-shaped badge" width="700" height="527"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The original inspiration for this whole category of badge, ours, Twilio's, GitHub's, traces back to &lt;a href="https://defcon.org/" rel="noopener noreferrer"&gt;DEF CON&lt;/a&gt;, the hacker conference in Las Vegas. They've been doing &lt;a href="https://www.defcon.org/html/links/dc-badge.html" rel="noopener noreferrer"&gt;programmatic badges&lt;/a&gt; for over twenty years. The badges run cryptography puzzles, communicate with each other, and unlock hidden games. Last year's badge had a full Pokemon clone where you could walk around a virtual version of the convention center. The level of engineering that community pours into these things is genuinely absurd in the best way, and I have a lot of love for it. I used to go when I was back in college during the era of Ryan "LostboY/1o57" Clarke making all the conference badges, and my memories here helped guide a lot of the creative energy I poured into this project.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3hvygiau8avj5t9wewom.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3hvygiau8avj5t9wewom.png" alt="Pimoroni Badgeware and a Badger 2040" width="800" height="603"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I had a lot of guidance too. &lt;a href="https://pimoroni.com/" rel="noopener noreferrer"&gt;Pimoroni&lt;/a&gt;, the team behind the GitHub Universe badge, were also incredibly generous. They couldn't meet our timeline but they sat with me anyway and gave me advice, so this is a huge shout out to them. Their work inspired me and the updated &lt;a href="https://badgewa.re/" rel="noopener noreferrer"&gt;Badgeware&lt;/a&gt; ecosystem served as my prototyping platform to prove that I could pull this off. Their contributions helped us get here.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcjrjd6gn2djz7i0297i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcjrjd6gn2djz7i0297i.png" alt="Multiple dev badges plugged in and running simultaneously" width="431" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the manufacturing side: I knew, at least in an abstract way, that hardware companies manage factory relationships through WhatsApp and that most electronics manufacturing happens in Shenzhen. Actually living that is something else. I'd start getting messages from the factory at 10 p.m. I'd answer until I fell asleep. I'd wake up and our west coast based hardware consultant had been chatting with them for another four hours on top of that. You're always catching up, always a little behind.&lt;/p&gt;

&lt;p&gt;Our manufacturer, &lt;a href="https://www.elecrow.com" rel="noopener noreferrer"&gt;Elecrow&lt;/a&gt;, was incredibly generous with their time and expertise, helping us solve the problems that came up throughout this entire process. I'm especially grateful to our account manager, Chris, who not only stepped up to make sure everything worked, arrived safely, and showed up on time, but also helped us get the most out of our time in Shenzhen. She did double duty, guiding us through all the electronics markets Shenzhen is known for.&lt;/p&gt;

&lt;p&gt;Sourcing LiPo batteries in bulk is harder than you'd think because you can't air ship them, so everything goes ground, and vendors get suspicious when you try to order large quantities at once. This resulted in our dev kits running off triple A batteries. Getting enough screens for the dev kits turned into placing about 30 separate Amazon orders across different sellers to work around quantity throttling. Our accounting team is going to have feelings about that for a while... sorry guys.&lt;/p&gt;

&lt;p&gt;There's also the China trip. Temporal's security policy doesn't allow company compute into the country, which is reasonable, so when I needed to go supervise manufacturing, it required multi-week negotiations between HR, security, and finance to work out.&lt;/p&gt;

&lt;p&gt;The answer turned out to be a clean MacBook and a clean phone provisioned specifically for the trip, access to exactly one Slack channel, and no corporate accounts. I had to say goodbye to Claude for the duration, which means I had to go back to the ways of ye olden days, pre-2023, for the development, which definitely made the firmware work interesting.&lt;/p&gt;

&lt;p&gt;That setup was the right call, but it did mean that once I was in China, I was mostly on my own for anything that came up. Between the limited access, the timezone gap, and the fact that problems were being discovered in real time on the manufacturing floor, there was not much opportunity to phone a friend. I got one hour-long call to get help on a specific issue, and otherwise the last-minute problems had to be debugged with whatever I had with me in the moment.&lt;/p&gt;

&lt;p&gt;Meanwhile, my single allowed Slack channel basically turned into a mini travel blog: factory floors, production lines, electronics markets, and a steady stream of "look at this, manufacturing is magic and also chaos." It was useful for keeping people in the loop, but it was not exactly the same thing as having the full company brain available while trying very hard not to become the bottleneck between "we found a problem" and "thousands of these need to ship."&lt;/p&gt;

&lt;p&gt;Here are a few photos of what went down on my trip.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frg4icunarpy2qbbrb8bw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frg4icunarpy2qbbrb8bw.jpg" alt="Factory floor with hands working on PCBs in a jig" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fruozzmwalwxuwwyn0x03.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fruozzmwalwxuwwyn0x03.jpg" alt="Stack of manufactured PCBs racked up at the factory" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpj37gq7bka6ymxrnxdp4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpj37gq7bka6ymxrnxdp4.jpg" alt="Factory machine with PCBs on an assembly line" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, I'm back, the badges were in tow, and I was excited to get them in your hands at Replay.&lt;/p&gt;

&lt;h2&gt;
  
  
  It's In Your Hands At Replay! And Beyond
&lt;/h2&gt;

&lt;p&gt;We live in a hyper-consumerist world, and we don't want to pile on to that by creating a bunch of e-waste.&lt;/p&gt;

&lt;p&gt;So the badge isn't e-waste when the conference ends. That was a hard constraint from the start: we're not building 2,000 things that get thrown in a drawer after three days. Because the firmware is open source and MicroPython is embedded, you can keep hacking on yours after Replay.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.ctfassets.net%2F0uuz8ydxyd9p%2F6TgOcYwpjhj3jnfBWZr9iM%2F6f67d39262846d6e60aac83113d9322e%2FCH5_8602.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.ctfassets.net%2F0uuz8ydxyd9p%2F6TgOcYwpjhj3jnfBWZr9iM%2F6f67d39262846d6e60aac83113d9322e%2FCH5_8602.jpg" alt="Rows of finished Replay badges ready to be handed out" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The sky's the limit really. The IR sensor means you could build a universal remote, the screen and joystick give you a tiny game device, and it'll keep being whatever you make it.&lt;/p&gt;

&lt;p&gt;People also started hacking on the firmware almost immediately. One attendee built a full Tamagotchi game during the conference. That's exactly the kind of thing we hoped for: someone picking up the hardware and immediately thinking &lt;em&gt;cool, what can I make this do?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.ctfassets.net%2F0uuz8ydxyd9p%2F4gVtwY9sQkYoDy1A9oPDZU%2F5a162eb2bc010d1bf120ea4b08ba8962%2FCH5_8753.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.ctfassets.net%2F0uuz8ydxyd9p%2F4gVtwY9sQkYoDy1A9oPDZU%2F5a162eb2bc010d1bf120ea4b08ba8962%2FCH5_8753.jpg" alt="Replay attendees exploring the badges" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For next year, I want to shift from the ESP32-S3 to the Raspberry Pi RP2350 and have every badge function as a Temporal worker. We managed to get &lt;a href="https://docs.temporal.io/develop/rust/" rel="noopener noreferrer"&gt;Temporal's Rust SDK&lt;/a&gt; running on the badge thanks to Edward Amsden, Staff Software Engineer on the SDK Language Runtime team, who saw me demo the badge during an engineering sprint showcase in his first week at Temporal and then casually spent the weekend hacking together Rust SDK support for us. I more or less seconded him to the badge team for his first month with us so we could get the project over the finish line. SDK team, I promise you can have him back now that we're finished with Replay. Because Edward got Temporal running in the final weeks of the project, we didn't have enough time to really showcase what we could do with it. Next year, though: two thousand workers hanging around people's necks in the same room. I don't know what I'd run on them yet, but I very much want to find out.&lt;/p&gt;

&lt;p&gt;We also made a bunch of deeply weird games for the badge, because apparently once you put an OLED screen, joystick, accelerometer, LED matrix, Wi-Fi, and a questionable amount of ambition into something people wear around their necks, then put a person who spent far too much time playing The World Ends with You and Henry Hatsworth on the Nintendo DS in their youth in charge of the project, aka me, this is what happens. Some are real games, some are firmware tests that got out of hand, and at least one is Flappy Asteroids: Flappy Bird on the lower LED matrix, Asteroids on the top, and absolutely no mercy anywhere. If you survived for more than 30 seconds at Replay, I hope you got your special variant joystick cap. And yes, we got Doom running.&lt;/p&gt;

&lt;p&gt;2,000+ badges showed up and worked. If you want to get hands-on with yours, everything you need is at &lt;a href="https://badge.temporal.io/" rel="noopener noreferrer"&gt;badge.temporal.io&lt;/a&gt;, and if you missed out this year, make sure you come to one of our SF events over the next few months and attend Replay 2027!&lt;/p&gt;

</description>
      <category>temporal</category>
      <category>hardware</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Why I needed Durable Execution to Read a Toy Manual</title>
      <dc:creator>Shy Ruparel</dc:creator>
      <pubDate>Fri, 10 Apr 2026 15:53:01 +0000</pubDate>
      <link>https://dev.to/temporalio/why-i-needed-durable-execution-to-read-a-toy-manual-35cn</link>
      <guid>https://dev.to/temporalio/why-i-needed-durable-execution-to-read-a-toy-manual-35cn</guid>
      <description>&lt;p&gt;Watch me take a Japanese toy manual and turn its translation into a bulletproof, &lt;strong&gt;AI-powered ETL pipeline&lt;/strong&gt;. I’ll show you how I use &lt;strong&gt;Temporal Workflows&lt;/strong&gt; to guarantee an AI pipeline never loses progress, surviving network failures, API crashes, and more.&lt;/p&gt;




&lt;h3&gt;
  
  
  What You'll Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Why Spider-man is the reason that Power Rangers has a giant robot.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guaranteed Completion with Temporal:&lt;/strong&gt; I’ll show you how to ensure your code keeps running even if servers crash or APIs fail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel OCR &amp;amp; Translation:&lt;/strong&gt; Learn how I used a &lt;strong&gt;"fan-out" pattern&lt;/strong&gt; with Google Document AI to process a 20-page manual in &lt;strong&gt;50 seconds&lt;/strong&gt; instead of 10 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resilient AI Cleanup:&lt;/strong&gt; See how I use &lt;strong&gt;Pydantic&lt;/strong&gt; and &lt;strong&gt;Temporal&lt;/strong&gt; together to handle non-deterministic LLM outputs from Gemini and automatically retry failed validations.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Ready to build it yourself?&lt;/strong&gt; 👉 &lt;a href="https://temporal.io/code-exchange/toku-solutions" rel="noopener noreferrer"&gt;Check out the code here!&lt;/a&gt;&lt;/p&gt;

</description>
      <category>temporal</category>
      <category>supersentai</category>
      <category>kamenrider</category>
      <category>ai</category>
    </item>
    <item>
      <title>Decoupling Temporal Services with Nexus and the Java SDK</title>
      <dc:creator>Nikolay Advolodkin</dc:creator>
      <pubDate>Thu, 02 Apr 2026 13:50:51 +0000</pubDate>
      <link>https://dev.to/temporalio/decoupling-temporal-services-with-nexus-and-the-java-sdk-20p</link>
      <guid>https://dev.to/temporalio/decoupling-temporal-services-with-nexus-and-the-java-sdk-20p</guid>
      <description>&lt;p&gt;Your Temporal services share a blast radius. A bug in Compliance at 3 AM crashes Payments, too, because they share the same Worker. The obvious fix is separate services with HTTP calls between them - but then you're managing HTTP clients, routing, error mapping, and callback infrastructure yourself.&lt;/p&gt;

&lt;p&gt;We published a hands-on tutorial on &lt;a href="https://learn.temporal.io/tutorials/nexus/nexus-sync-tutorial/?utm_source=enterprise-dev-rel&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nexus-sync-tutorial&amp;amp;utm_content=devto-launch" rel="noopener noreferrer"&gt;learn.temporal.io&lt;/a&gt; where you take a monolithic banking payment system and split it into two independently deployable services connected through Temporal Nexus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nexus Endpoints, Services, and Operations from scratch&lt;/li&gt;
&lt;li&gt;Two handler patterns for different use cases&lt;/li&gt;
&lt;li&gt;How to swap an Activity call for a durable cross-namespace Nexus call&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The caller-side change is minimal - the method call stays the same:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// BEFORE (monolith - direct activity call):&lt;/span&gt;
&lt;span class="nc"&gt;ComplianceResult&lt;/span&gt; &lt;span class="n"&gt;compliance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;complianceActivity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;checkCompliance&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compReq&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// AFTER (Nexus - durable cross-team call):&lt;/span&gt;
&lt;span class="nc"&gt;ComplianceResult&lt;/span&gt; &lt;span class="n"&gt;compliance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;complianceService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;checkCompliance&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compReq&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same method name. Same input. Same output. Behind that swap: a shared service contract, a Nexus handler, an endpoint registration, and a Worker configuration change.&lt;/p&gt;

&lt;p&gt;Here's what the Nexus handler looks like - it backs the operation with a long-running workflow so retries reuse the existing workflow instead of creating duplicates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@OperationImpl&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;OperationHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ComplianceRequest&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ComplianceResult&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;checkCompliance&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;WorkflowRunOperation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromWorkflowHandle&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;WorkflowClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Nexus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOperationContext&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getWorkflowClient&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="nc"&gt;ComplianceWorkflow&lt;/span&gt; &lt;span class="n"&gt;wf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newWorkflowStub&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="nc"&gt;ComplianceWorkflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;WorkflowOptions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setTaskQueue&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"compliance-risk"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setWorkflowId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"compliance-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getTransactionId&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;WorkflowHandle&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromWorkflowMethod&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;wf:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;});&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tutorial includes a durability checkpoint: you kill the Compliance Worker mid-transaction, restart it, and watch the payment resume exactly where it left off. No retry logic, no data loss across the namespace boundary. Java SDK, runs entirely on Temporal's dev server.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learn.temporal.io/tutorials/nexus/nexus-sync-tutorial/?utm_source=enterprise-dev-rel&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nexus-sync-tutorial&amp;amp;utm_content=devto-launch" rel="noopener noreferrer"&gt;&lt;strong&gt;Try it&lt;/strong&gt; &lt;/a&gt;&lt;/p&gt;

</description>
      <category>temporal</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>2025 — Part 2</title>
      <dc:creator>Sergey Bykov</dc:creator>
      <pubDate>Tue, 18 Nov 2025 16:55:03 +0000</pubDate>
      <link>https://dev.to/temporalio/2025-part-2-2eoi</link>
      <guid>https://dev.to/temporalio/2025-part-2-2eoi</guid>
      <description>&lt;p&gt;(&lt;a href="https://dev.to/temporalio/2025-3dmg"&gt;Part 1&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Company
&lt;/h2&gt;

&lt;p&gt;At the time of my last update, the company had 116 people. Now we are over 300. The Go-to-Market organization is now larger than Engineering. &lt;a href="https://en.wikipedia.org/wiki/Dunbar%27s_number" rel="noopener noreferrer"&gt;Some studies&lt;/a&gt; claim that our ancestors couldn’t handle tribes of over about 150 people. We are definitely past the point when one could know every employee. The loss of intimacy is offset by the feeling that we now have resources — a growing number of teams focusing on different areas while collaborating on cross-group efforts.&lt;/p&gt;

&lt;p&gt;With such growth, we are doubling down on our efforts to foster and reemphasize consistency in our hiring practices, decision-making, behavioral patterns, and rules of engagement, otherwise referred to as values and culture. In my previous life within a huge corporation, those things generally made sense to me, but they also felt somewhat artificial and performative. Within the context of a small company with a relatively flat structure, it feels very different — much closer to home. This makes me genuinely attentive to such aspects and eager to contribute where I can. Just recently, we rolled out our &lt;a href="https://temporal.io/careers" rel="noopener noreferrer"&gt;updated values&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0z753za1mzrzrau64rbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0z753za1mzrzrau64rbd.png" alt="Company" width="800" height="527"&gt;&lt;/a&gt;&lt;br&gt;
My impression is that at least half of the VC money these days goes to companies with corporate domains ending in “.ai,” and aside from that, funding isn’t easy. We raised our &lt;a href="https://temporal.io/blog/temporal-series-c-announcement" rel="noopener noreferrer"&gt;C round&lt;/a&gt; early this year with a very good, some say almost exceptional, multiple. This tells us that the investors have a strong conviction about our product, business model, and growth. I’m no VC, but I see how they are impressed with the quality of the use cases and the caliber of customers that come to our cloud. I hope they know better than I do how to assess and evaluate such factors. Since the C round, we’ve also had a &lt;a href="https://temporal.io/blog/temporal-raises-secondary-funding" rel="noopener noreferrer"&gt;secondary round&lt;/a&gt; that pushed the company’s valuation significantly higher.&lt;/p&gt;

&lt;p&gt;Keeping the hiring bar high continues to be a top priority. With the turmoil in the job market and Temporal becoming a better-known brand, we now have access to a larger pool of high-quality engineering talent. The interview process is still more art than science, and scaling and improving this art as the company grows is a challenge by itself. Hiring at the junior levels has its own difficulties. Recently, we had to close an open SDE 1 position after only a few hours because, during that time, we received more than 3,000 applications. We found that the old recipe still works well — filling junior positions via internships.&lt;/p&gt;

&lt;p&gt;We are still fully remote, with WeWork as an option for folks who want to come into the office. We are geo-distributed but not very balanced. Most of Engineering is on the U.S. West Coast, with roughly a tie between the Seattle and Bay areas. Smaller pockets are in Colorado, North Carolina, and the cities of New York, Chicago, Toronto, and Vancouver. The GTM team has its own distribution. My impression is they are more heavily tilted toward the East Coast.&lt;/p&gt;

&lt;p&gt;We settled on an annual all-company offsite (we started with twice a year). We complemented it with smaller team offsites and are now aggregating them into an annual R&amp;amp;D offsite, side by side with GTM’s sales kickoff event. We’ll see how this goes. There doesn’t appear to be a simple solution for doing it right, and each company needs to find its own rhythm. From time to time, we leverage the West Coast’s locality for in-person meetings to discuss some critical decisions or designs. In such cases, we consciously violate the remote-first setup for the sake of high-throughput discussions and faster decision-making — at the unfair expense of colleagues who can’t attend in person and have to connect via Zoom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replay
&lt;/h2&gt;

&lt;p&gt;It was a bold move in August 2022 to start our own annual conference. The &lt;a href="https://www.youtube.com/playlist?list=PLl9kRkvFJrlRWBrfqOOX1rN_d1mahYI70" rel="noopener noreferrer"&gt;inaugural edition&lt;/a&gt; was in Seattle. The &lt;a href="https://www.youtube.com/playlist?list=PLl9kRkvFJrlREHL7fiEKBWTp5QuFeYS2r" rel="noopener noreferrer"&gt;2023&lt;/a&gt; and &lt;a href="https://www.youtube.com/playlist?list=PLl9kRkvFJrlR0xieUwBN_nNHW0oijCZa6" rel="noopener noreferrer"&gt;2024&lt;/a&gt; editions were in Bellevue, WA, growing bigger each year. In &lt;a href="https://www.youtube.com/playlist?list=PLl9kRkvFJrlQ4Hw1U1aGxc2wH7oQ3tisp" rel="noopener noreferrer"&gt;2025&lt;/a&gt;, we held the event in London to reach audiences unlikely to travel to the U.S. Attending Replay is a very special experience. Seeing so many engineers and engineering leaders talking non-stop about your product and presenting on stage what they’ve built with it is a special kind of pleasure. I presented at all Replays but the very first one. In 2023, my talk was on the second day and I talked with folks so much before then that my voice let me down close to the end of my presentation. I guess that’s why no recording of it was published. But I gave slightly different versions of the same talk at &lt;a href="https://www.youtube.com/watch?v=LHkeXk_8Cq4" rel="noopener noreferrer"&gt;J on the Beach&lt;/a&gt; and &lt;a href="https://www.infoq.com/presentations/durable-execution-control-plane/" rel="noopener noreferrer"&gt;QCon SF&lt;/a&gt; that year.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bheq35jmfjvgns7tqyy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bheq35jmfjvgns7tqyy.png" alt="Replays" width="800" height="109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://replay.temporal.io/" rel="noopener noreferrer"&gt;Replay 2026&lt;/a&gt; will be in San Francisco — at Moscone, no less. It should be epic. I’ll need to rewatch the &lt;a href="https://www.imdb.com/title/tt2575988/" rel="noopener noreferrer"&gt;Silicon Valley documentary&lt;/a&gt; before going there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operations
&lt;/h2&gt;

&lt;p&gt;We operate a multi-million-dollar business based on a single product — Temporal Cloud. Our customers trust us with their hot-path business processes — often their most critical ones. This is an interesting phenomenon. They choose Durable Execution of Temporal to make their applications resilient to various failures. Naturally, they first and foremost care about the reliability of their most critical services. Some choose to self-host Temporal Server with all its dependencies. Many don’t view it as their core competency — operating such complex production machinery — and they come to our cloud service with their most precious workloads. It is amazing and sobering at the same time when big Internet household names bring us their “crown jewels” to run — even those who have a policy of not taking a dependency on SaaS vendors in the critical path. It was eye-opening to hear, on a couple of occasions, a customer say, “We only have two external dependencies — AWS and Temporal Cloud.”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5guzazuxia3xm3t2hhn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5guzazuxia3xm3t2hhn.png" alt="APS" width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Customer expectations are very high. Sometimes it feels like they set them higher for us than for the hyperscalers. We now have about eight engineering on-call rotations (teams), covering different areas of the system, plus one for on-call managers who coordinate across teams, and another for the Developer Success team that communicates with customers. This may seem large for our company size, but that’s the nature of the service we run.&lt;/p&gt;

&lt;p&gt;We use &lt;a href="http://incident.io" rel="noopener noreferrer"&gt;incident.io&lt;/a&gt; for managing incidents. It integrates nicely with Slack, creates a per-incident channel, and automatically adds the current on-call engineers to it, among other things. We saw great promise in the early days of their product. They haven’t disappointed and are growing fast. Like most folks, we use &lt;a href="http://statuspage.io" rel="noopener noreferrer"&gt;statuspage.io&lt;/a&gt; for public incidents and &lt;a href="http://pagerduty.com" rel="noopener noreferrer"&gt;pagerduty.com&lt;/a&gt; for on-call paging. Incident.io also integrates with Jira to automatically turn incident follow-ups into tickets, helping us continuously improve the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replication
&lt;/h2&gt;

&lt;p&gt;Temporal inherited the application-level replication stack from Cadence. Over the years, we dramatically improved it and added Control Plane functionality to manage it. Initially, we used replication to transparently migrate customer namespaces from Cell to Cell. After we got it working at the level we were happy with, we exposed it to customers as high-availability options — multi-region, cross-cloud, and single-region replication.&lt;/p&gt;

&lt;p&gt;At first, few customers immediately understood why they would want to pay double (due to the duplicate hardware needed) for such a feature. Some just used it, at our suggestion, as a tool for migrating their workloads from one region or cloud provider to another. The recent &lt;a href="https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW" rel="noopener noreferrer"&gt;GCP&lt;/a&gt; and &lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;AWS us-east-1&lt;/a&gt; outages vindicated the paranoid among our customers who refused to accept that “cloud regions pretty much never go down.”&lt;/p&gt;

&lt;p&gt;Customers who had replication enabled for their namespaces were able to fail over to the other region or cloud, and their applications continued executing as if nothing had happened. We discovered a few misses on our side and had to fail over some namespaces manually, with a longer delay than we expected. The important part is that replicated namespaces continued running after failover. We saw a major spike in customers setting up replication in the days after the AWS us-east-1 outage. One customer was in the process of migrating their namespace from AWS to GCP during GCP’s global outage. They weren’t impacted and didn’t even need to fail over because their active replica was still in AWS. They were considering keeping the cross-cloud replication running indefinitely after that.&lt;/p&gt;

&lt;p&gt;I gave a &lt;a href="https://www.youtube.com/watch?v=DM68pz5ysWE" rel="noopener noreferrer"&gt;conceptual talk&lt;/a&gt; about replicated namespaces, but the topic probably deserves its own post.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=DM68pz5ysWE" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyit8wprc7xqny4e6zn6s.png" alt="Fourth Little Pig" width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Road ahead
&lt;/h2&gt;

&lt;p&gt;With great opportunities come great responsibility and pressure to execute and realize those opportunities. We still have to strike the right balance between running a highly reliable service and investing in new functionality. It’s a deeply humbling experience to see that some of the world’s top companies — household names with tens or even hundreds of millions of users — take an all-in dependency on Temporal Cloud. This leaves no room for hubris, complacency, or sloppiness. We have to keep pushing the reliability and quality bar higher without hampering further development of the product.&lt;/p&gt;

&lt;p&gt;I don’t believe there’s a general recipe for how to grow an organization, be it engineering, R&amp;amp;D, or the whole company. We’ll have to navigate our own path — growing sustainably while preserving what has made us successful so far and learning new ways in parallel. It’s exciting and somewhat dizzying at the same time. Yet I feel we are still only getting started.&lt;/p&gt;

</description>
      <category>softwareengineering</category>
      <category>ai</category>
      <category>microservices</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>2025</title>
      <dc:creator>Sergey Bykov</dc:creator>
      <pubDate>Wed, 12 Nov 2025 21:56:41 +0000</pubDate>
      <link>https://dev.to/temporalio/2025-3dmg</link>
      <guid>https://dev.to/temporalio/2025-3dmg</guid>
      <description>&lt;h2&gt;
  
  
  Part 1
&lt;/h2&gt;

&lt;p&gt;Belated update. Yes, it’s been five years, can’t believe it myself. What’s the “delta” for the last three years that flew by too fast?&lt;br&gt;
Certain things haven’t changed much. We are still under the “dual mandate” — OSS server, SDKs (clients), CLI, and a whole bunch of other peripheral software, plus a cloud service where we charge customers for running and managing the invisible infrastructure so that they don’t have to. My focus continues to be primarily on the cloud side.&lt;/p&gt;

&lt;p&gt;At the same time, obviously, everything has changed — some things even multiple times. We went through COVID with its obligatory work-from-home setup, only for many companies to start imposing, some more gradually than others, return-to-office policies. I interviewed a number of candidates recently who moved away from major tech hubs during COVID and had to leave their jobs because of the RTO push. We are still fully remote.&lt;/p&gt;

&lt;p&gt;Most of the Big Tech companies went through rounds of mass layoffs — a tectonic shift from the previous 20 or so years of competing for talent and outbidding each other in offers. Startups suddenly became much more attractive for Big Tech employees who were previously reluctant to take the risk of leaving their well-paid jobs. At the same time, startup founders faced the funding drought starting in late 2021 and early 2022 caused by interest rate changes. Many had to close or fire-sell their ventures where just a couple of years earlier, cheap money seemed unlimited.&lt;/p&gt;

&lt;h2&gt;
  
  
  Product
&lt;/h2&gt;

&lt;p&gt;Developers choose Temporal for its programming model. They experience it in a language of their choice via Temporal SDKs. We started with two languages. Now we support seven: Go, Java, TypeScript, Python, .NET, PHP, and Ruby. Four of them are built on the same Core SDK written in Rust. No, we still don’t have an official Rust SDK.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqek4rva5d7fzlv1xwim.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqek4rva5d7fzlv1xwim.png" alt="SDKs" width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We started Temporal Cloud by hosting the OSS Temporal Server with an added layer of security and multi-tenancy. The original value proposition included that, plus general operational concerns such as monitoring, alerting, configuration, upgrades, and scale. We’ve been investing along several dimensions since then and are now running the fifth-generation Cells (&lt;a href="https://www.youtube.com/watch?v=KvxAz5HwBpc" rel="noopener noreferrer"&gt;Temporal Cloud “clusters”&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Building a &lt;a href="https://www.youtube.com/watch?v=SQv9ot-jB6o" rel="noopener noreferrer"&gt;custom storage layer&lt;/a&gt; between the server and database to absorb reads and coalesce writes was one of the first bold undertakings. Rolling it out to production over the course of 2023 gave us a significant increase in reliability, performance, and scalability compared to the vanilla OSS server. Another major investment was making it possible to incrementally add multiple databases to a running server. With these improvements, the scenario I mentioned in one of my previous posts, a major customer needing to increase their already outsized traffic level up to 10x for a day, became routine. At the end of 2021, a day like that was a big deal for both companies, with teams of engineers monitoring the system, communicating live, and taking action. The subsequent occurrences became increasingly “boring” and turned into non-events.&lt;/p&gt;

&lt;p&gt;On the authentication/authorization dimension, we went from initially supporting only &lt;a href="https://docs.temporal.io/cloud/certificates" rel="noopener noreferrer"&gt;mTLS&lt;/a&gt; and Google SSO to adding &lt;a href="https://docs.temporal.io/cloud/api-keys" rel="noopener noreferrer"&gt;API keys&lt;/a&gt;, &lt;a href="https://docs.temporal.io/cloud/service-accounts" rel="noopener noreferrer"&gt;service accounts&lt;/a&gt;, &lt;a href="https://docs.temporal.io/cloud/saml" rel="noopener noreferrer"&gt;SAML&lt;/a&gt;, &lt;a href="https://docs.temporal.io/cloud/user-groups" rel="noopener noreferrer"&gt;SCIM&lt;/a&gt;, and a bunch of other features critical for enterprise — and not only enterprise — customers.&lt;/p&gt;

&lt;p&gt;We started with prospective cloud customers filling out a form, getting contacted by our sales team to complete the paperwork, and then creating an account on their behalf. Embarrassing. Now, we have a complete &lt;a href="https://temporal.io/get-cloud" rel="noopener noreferrer"&gt;self-signup process&lt;/a&gt; that guides prospective customers along the path, with a full PLG motion behind it. When we opened up Temporal Cloud to the world, we were missing a number of table-stakes features. At the time, I called the bar we had to meet a “reasonable cloud service.” I believe we passed this milestone 12–18 months ago.&lt;/p&gt;

&lt;p&gt;I like that we don’t play licensing games (our OSS is under the MIT license) and instead extend and enhance it with proprietary features to differentiate our cloud offering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5b73uredlrwamqsz8nut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5b73uredlrwamqsz8nut.png" alt="Self-signup" width="800" height="544"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We launched Temporal Cloud in 2022 with support for AWS only. We added GCP in 2025 and are working on bringing in Azure, the last of the big three providers. Even though support for Kubernetes clusters across them is similar, most of the integration effort goes into their disparate security and resource hierarchy models, differences in networking, and subtle behavioral differences in their seemingly compatible APIs — for example, &lt;a href="https://airbyte.com/data-engineering-resources/s3-gcs-and-azure-blob-storage-compared" rel="noopener noreferrer"&gt;GCS vs. S3&lt;/a&gt;. Recently, we’ve been chasing GCP load balancers mysteriously ghosting a fraction of the connections. Support for hosted Elasticsearch is another headache — only AWS has it, but in the form of OpenSearch, their fork of ES from before Elastic changed its license.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy27c2ww3nn32lyir9fm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy27c2ww3nn32lyir9fm.png" alt="Multi-region" width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  AI
&lt;/h2&gt;

&lt;p&gt;The agentic AI “storm” turned into a sudden tailwind for Temporal. The very nature of such applications — being stateful, depending on a significant number of semi-reliable API calls to external services, and taking seconds to minutes to execute — made code-first Durable Execution a compelling programming model for this fast-moving, massive herd. While there are still some rough edges for AI use cases in the near term (such as payload and history size limits and required determinism of workflow code), the immediate benefits — high-velocity development of much more reliable code in the language of your choice, guaranteed scalability, and unparalleled visibility into execution for debugging — &lt;a href="https://docs.temporal.io/ai-cookbook" rel="noopener noreferrer"&gt;keep bringing AI-focused companies&lt;/a&gt; to Temporal. More traditional businesses that are scrambling to integrate AI into their systems do the same. I was told that as of late 2024, out of the top 20 AI companies, only two were aware of Temporal — and now, 16 of them already run Temporal-based apps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nexus
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fstb7215vb88u6fxdijok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fstb7215vb88u6fxdijok.png" alt="Nexus" width="84" height="84"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This year we launched the initial version of &lt;a href="https://docs.temporal.io/evaluate/nexus" rel="noopener noreferrer"&gt;Nexus&lt;/a&gt;, an &lt;a href="https://github.com/nexus-rpc/api" rel="noopener noreferrer"&gt;open-standard-based protocol&lt;/a&gt; for APIs that may take arbitrarily long to complete. I think of it as a great frontend layer for Durable Execution. But the protocol itself is implementation-agnostic. One could implement it using more traditional tools and approaches, for example, within the paradigm of event-driven architecture. The idea was conceived in the early days of Temporal. We started talking about it publicly in 2022, only to do nothing for another year due to other priorities.&lt;/p&gt;

&lt;p&gt;We believe that Nexus is an immense opportunity to integrate systems and services in a new, powerful way. Nexus deserves a dedicated post, and I’m contemplating a conference talk about how the combination of Durable Execution and Nexus could define a major evolution of the &lt;a href="https://martinfowler.com/microservices/" rel="noopener noreferrer"&gt;Microservice Architecture&lt;/a&gt;. I understand this is a very bold statement, but sometimes you have to shoot for the Moon.&lt;/p&gt;

&lt;p&gt;(Continued in &lt;a href="https://dev.to/temporalio/2025-part-2-2eoi"&gt;Part 2&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>softwareengineering</category>
      <category>ai</category>
      <category>microservices</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Building Durable Cloud Control Systems with Temporal</title>
      <dc:creator>Sergey Bykov</dc:creator>
      <pubDate>Sat, 09 Aug 2025 00:47:09 +0000</pubDate>
      <link>https://dev.to/temporalio/building-durable-cloud-control-systems-with-temporal-5l7</link>
      <guid>https://dev.to/temporalio/building-durable-cloud-control-systems-with-temporal-5l7</guid>
      <description>&lt;p&gt;In today’s world of managed cloud services, delivering exceptional user experiences often requires rethinking traditional architecture and operational strategies. At Temporal, we faced this challenge head-on, navigating complex decisions about tenancy models, resource management, and durable execution to build a reliable, scalable cloud service. This post explores our approach and the lessons we learned while creating &lt;a href="https://temporal.io/cloud" rel="noopener noreferrer"&gt;Temporal Cloud&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Case for Managed Cloud Services
&lt;/h2&gt;

&lt;p&gt;Managed services have become the default for delivering hosted solutions to customers. Whether it’s a database, queueing system, or another server-side technology, hosting a service not only provides a better user experience but also opens doors for monetization, especially for open-source projects. The challenge is how to do it effectively while maintaining reliability and scalability.&lt;/p&gt;

&lt;p&gt;One of the first decisions we made was about tenancy models. Should we pursue single-tenancy — provisioning dedicated clusters for each customer — or opt for &lt;a href="https://docs.temporal.io/evaluate/development-production-features/multi-tenancy?_gl=1*145kve2*_gcl_au*MTgzNjc0NTczNi4xNzQ4MzcwMTc5*_ga*MTI3NTM0MDA4OC4xNzQ4MzcwMTc5*_ga_R90Q9SJD3D*czE3NTQ2OTg5OTgkbzIzJGcwJHQxNzU0Njk4OTk4JGo2MCRsMCRoMA.." rel="noopener noreferrer"&gt;multi-tenancy&lt;/a&gt;, which allows multiple customers to share the same resources? While single-tenancy offers simplicity and isolation, its inefficiencies quickly become apparent. Customers end up paying for unused capacity, and providers shoulder higher operational costs. Multi-tenancy, though harder to implement, emerged as the clear winner. It optimizes resource usage, allows customers to pay for actual usage, and creates shared headroom for handling traffic spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Plane vs. Control Plane: Defining Responsibilities
&lt;/h2&gt;

&lt;p&gt;Architecting a managed service in terms of the data plane and control plane is an industry best practice that we followed, clearly defining and implementing their distinct roles within our cloud architecture.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Plane&lt;/strong&gt;: This is where the actual work happens — processing transactions, executing workflows, and handling customer data. It must maintain high availability, low latency, and resilience to failures. For Temporal Cloud, we adopted a cell-based architecture to isolate resources and minimize the blast radius of potential failures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control Plane&lt;/strong&gt;: This acts as the brain of the system, managing resources, provisioning namespaces, and handling configurations. While its performance is less critical than the data plane, reliability here still matters for customer experience. For instance, provisioning a namespace may not be urgent, but delays or errors in this process can frustrate users.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementing the Data Plane: A Cell-Based Architecture
&lt;/h2&gt;

&lt;p&gt;For the data plane, we applied a cell-based architecture to achieve strong isolation and scalability. Each cell operates as a self-contained unit with its own AWS account, VPC, EKS cluster, and supporting infrastructure. While this approach is framed within the context of AWS, we have applied the same principles to Google Cloud Platform (GCP), leveraging its equivalent primitives to ensure consistency and reliability across cloud providers. This approach ensures that failures or updates in one cell do not impact others, reducing the risk of cascading outages.&lt;/p&gt;

&lt;p&gt;Each cell in Temporal Cloud includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compute Pods&lt;/strong&gt;: Running Temporal services and infrastructure tools for observability, ingress management, and certificate handling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databases&lt;/strong&gt;: Both primary databases and Elasticsearch for enhanced visibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Additional Components&lt;/strong&gt;: Load balancers, private connectivity endpoints, and other supporting infrastructure that ensures smooth operation and integration across environments. Currently, Temporal Cloud operates across 14 AWS regions, and we’ve also added support for GCP. This architecture allows us to meet the diverse needs of our customers while maintaining reliability at scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Durable Execution: The Foundation of the Control Plane
&lt;/h2&gt;

&lt;p&gt;Building the control plane presented its own set of challenges, particularly around reliability and maintainability. Control plane tasks, such as provisioning namespaces or rolling out updates, involve complex long-running processes with many interdependent steps. Writing this logic as traditional, ad-hoc code often leads to brittle systems that are hard to debug and evolve.&lt;/p&gt;

&lt;p&gt;This is where Temporal’s &lt;a href="https://temporal.io/blog/building-reliable-distributed-systems-in-node-js-part-2" rel="noopener noreferrer"&gt;durable execution&lt;/a&gt; model shines. Designed based on experience with earlier systems like AWS Simple Workflow Service and Azure Durable Functions, Temporal’s approach separates business logic from state management and failure handling. Developers can write workflows as straightforward, happy-path code without worrying about retries, error handling, or state persistence. The system automatically manages these concerns, allowing workflows to seamlessly recover from failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Namespace Provisioning: A Real-World Example
&lt;/h2&gt;

&lt;p&gt;Consider the process of creating a new namespace in Temporal Cloud. When a user clicks “Create Namespace” on the web interface, the control plane orchestrates a series of tasks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Selecting a suitable cell within the chosen region.&lt;/li&gt;
&lt;li&gt;Creating database records and roles.&lt;/li&gt;
&lt;li&gt;Generating and provisioning mTLS certificates.&lt;/li&gt;
&lt;li&gt;Configuring ingress routes and verifying connectivity. Each step involves external API calls, DNS propagation, and other potential points of failure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without durable execution, managing retries, backoffs, and state persistence would result in a tangle of brittle code. With Temporal, these tasks are encapsulated in workflows, which transparently handle retries and maintain state across failures. Developers can focus on the high-level logic, confident that the system will handle the edge cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rolling Upgrades: Ensuring Safe Deployments
&lt;/h2&gt;

&lt;p&gt;Another common control plane scenario is rolling out updates to the Temporal Cloud fleet. Our deployment strategy involves organizing cells into deployment rings, progressing from pre-production environments to customer-facing cells with increasing priority of traffic.&lt;/p&gt;

&lt;p&gt;The rollout process is carefully staged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ring 0&lt;/strong&gt;: Synthetic traffic only, no customer impact. Changes are monitored here for at least a week.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ring 1&lt;/strong&gt;: Low-priority traffic namespaces, allowing for additional testing with minimal risk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher Rings&lt;/strong&gt;: Gradually expanding to critical, high-priority traffic customers. Within each ring, updates are applied in batches, with pauses between batches to observe for potential issues like memory leaks or race conditions. Temporal workflows handle this process, ensuring that even long-running deployments (which can span weeks) are resilient to failures or restarts.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Entity Workflows: A Powerful Pattern
&lt;/h2&gt;

&lt;p&gt;Temporal’s durable execution also enables powerful patterns like entity workflows. These are workflows tied to specific resources, such as cells or namespaces, providing a natural way to model state and operations. For example, each cell in Temporal Cloud has an entity workflow that manages its lifecycle, from provisioning to upgrades. This approach ensures consistency and simplifies concurrency control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Developer Happiness and Productivity
&lt;/h2&gt;

&lt;p&gt;One of the biggest benefits of Temporal’s approach is the impact on developer experience. By eliminating the need to write boilerplate code for retries, backoffs, and state management, developers can focus on delivering business value. Temporal’s built-in tools for observing and debugging workflows further enhance productivity, making it easier to understand and troubleshoot complex systems.&lt;/p&gt;

&lt;p&gt;Happy developers are productive developers, and Temporal’s approach fosters this by reducing the cognitive load and frustration associated with traditional workflow coding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Durable Execution Matters
&lt;/h2&gt;

&lt;p&gt;Durable execution is more than a technical innovation; it’s a paradigm shift for building cloud-native systems. By decoupling business logic from state management and failure handling, Temporal empowers developers to build reliable, scalable systems with less effort. Whether you’re managing control planes, provisioning resources, orchestrating complex workflows, performing money transfers, training AI models, or processing social media posts, this approach delivers clear benefits.&lt;/p&gt;

&lt;p&gt;At Temporal, we’ve seen firsthand how durable execution transforms the development process, enabling us to deliver a robust managed service that scales with our customers’ needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ready to Transform Your Control Plane?
&lt;/h2&gt;

&lt;p&gt;Temporal isn’t just a tool for building cloud systems; it’s a better way to think about workflows and application architecture. If you’re building or planning a managed cloud service, consider how durable execution can simplify your journey and unlock new possibilities. For more insights into our approach, check out &lt;a href="https://www.infoq.com/presentations/durable-execution-control-plane/" rel="noopener noreferrer"&gt;my full talk at QCon&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>control</category>
      <category>service</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Why Top Developers Prioritize Failure Management</title>
      <dc:creator>Sergey Bykov</dc:creator>
      <pubDate>Sat, 09 Aug 2025 00:35:05 +0000</pubDate>
      <link>https://dev.to/temporalio/why-top-developers-prioritize-failure-management-lj6</link>
      <guid>https://dev.to/temporalio/why-top-developers-prioritize-failure-management-lj6</guid>
      <description>&lt;p&gt;There’s a saying: “Amateurs study tactics, while professionals study logistics.” In software, this translates to: “Amateurs focus on algorithms, while professionals focus on failures.”&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://jonthebeach.com/" rel="noopener noreferrer"&gt;J on the Beach&lt;/a&gt;, I took time in my &lt;a href="https://www.youtube.com/watch?v=pMfMm2eD3GM" rel="noopener noreferrer"&gt;talk&lt;/a&gt; to expand on this saying and explain that real-world systems don’t just need code that works on the “happy path” — they need a safety net for when things go wrong.&lt;/p&gt;

&lt;p&gt;Modern software development has layers of complexity. You’re not just writing code; you’re connecting systems across time and space, handling data that doesn’t sleep, and ensuring flawless performance at scale. What sets top developers apart is how they manage failures. Building resilience focuses on ensuring reliability when things inevitably go wrong, not just maintaining uptime.&lt;/p&gt;

&lt;p&gt;In this post, we’ll walk through three common approaches to handling failures in software, each with its own strengths and weaknesses. Then we’ll introduce Temporal’s approach, workflow-as-code, which makes it easier to build reliability into your systems from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Ways to Handle Failure in Your Software
&lt;/h2&gt;

&lt;p&gt;Failures are inevitable in your distributed systems. When a network link fails, a server times out, or a service crashes, systems need strategies to respond properly and ensure that your operations remain reliable.&lt;/p&gt;

&lt;p&gt;Below, we’ll explore three common approaches to coordination between systems — Remote Procedure Calls (RPCs), persistent queues, and workflows — and their relationship to failure management.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Request-Response (RPC)
&lt;/h3&gt;

&lt;p&gt;The request-response, or RPC model, is a classic approach. A client makes a request, the server processes it, and sends back a response. In the best-case scenario — the “happy path” — everything works smoothly. Imagine a money transfer request: one service debits the sender while another credits the receiver. If all goes as planned, the transfer completes with no issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros of the RPC Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simplicity&lt;/strong&gt;: The direct client-server connection makes this model easy to implement for straightforward workflows.&lt;br&gt;
&lt;strong&gt;Efficiency on the “happy path”&lt;/strong&gt;: When things go smoothly, RPC provides fast, efficient responses and low latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons of the RPC Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limited resilience for partial failures&lt;/strong&gt;: If the client’s request is successful, but a response isn’t received, or a step in the process fails, RPC often requires extensive error-handling code on the client side.&lt;br&gt;
&lt;strong&gt;Heavy client burden&lt;/strong&gt;: Clients must handle errors, recovery, and retries, complicating systems as they scale.&lt;br&gt;
The RPC model works well for simple, synchronous tasks. However, for resilience, it falls short by placing the onus on developers of the RPCs and those consuming them to manage every failure scenario — and this is no trivial matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Persistent Queues
&lt;/h3&gt;

&lt;p&gt;Persistent queues add a degree of flexibility by decoupling the client from the server. Messages are placed in a queue, and the system processes them asynchronously. Queues help distribute workloads: they support automatic retries and asynchronous processing, which can smooth out demand spikes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros of Persistent Queues&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic retries&lt;/strong&gt;: Persistent queues often support automatic retries, attempting tasks multiple times if they initially fail.&lt;br&gt;
&lt;strong&gt;Load distribution&lt;/strong&gt;: Queues smooth processing under heavy loads, distributing requests over time, to improve system reliability.&lt;br&gt;
Producer-consumer separation: Decoupling producers and consumers allow the queue to function independently, improving fault tolerance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons of Persistent Queues&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loss of ordering&lt;/strong&gt;: Since queues process messages independently, tasks may execute out of order, causing unexpected issues for dependent operations.&lt;br&gt;
&lt;strong&gt;Dead-letter queues&lt;/strong&gt;: Tasks that continuously fail may require a separate “dead-letter” queue, adding complexity and, typically, manual intervention.&lt;br&gt;
&lt;strong&gt;Limited visibility into status&lt;/strong&gt;: Visibility becomes even more challenging when you have systems that use multiple queues, requiring additional tooling and infrastructure.&lt;br&gt;
Queues work well when you need flexibility and decoupling, but they lack the control and visibility needed for comprehensive failure management.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Workflows
&lt;/h3&gt;

&lt;p&gt;Workflows provide a robust solution for orchestrating complex processes across distributed systems. Unlike RPC or queue-based models, workflows manage retries, state, and error handling automatically, making them ideal for long-running or multi-step processes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros of Workflows&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Built-in resilience&lt;/strong&gt;: Workflows handle retries, recovery, and compensation steps automatically, reducing the need for custom error-handling code.&lt;br&gt;
&lt;strong&gt;Support for long-running processes&lt;/strong&gt;: Workflows accommodate processes that span minutes, hours, or even days, making them well-suited for complex tasks.&lt;br&gt;
&lt;strong&gt;Enhanced visibility&lt;/strong&gt;: Workflow systems enable real-time tracking and querying, so both clients and developers can see exactly where each process stands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons of Workflows&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure requirements&lt;/strong&gt;: Workflows require a solid infrastructure to manage states, retries, and tracking, which some teams may lack.&lt;br&gt;
&lt;strong&gt;Setup complexity&lt;/strong&gt;: Workflow systems can be complex to set up, especially when building custom solutions to manage workflows.&lt;br&gt;
For complex processes that demand reliability and transparency, workflows provide the most comprehensive solution, though they require dedicated infrastructure to deploy effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resilience Without Extra Overhead
&lt;/h2&gt;

&lt;p&gt;At Temporal, we addressed these challenges by designing a platform that handles resilience, error handling, and state management so you don’t have to.&lt;/p&gt;

&lt;p&gt;With Temporal, you write workflows as code - no extra XML, JSON, or YAML definition of workflow logic that is difficult to understand and debug down the line. Define your steps in regular code, and Temporal does the rest, managing retries, maintaining state, and ensuring that your workflows are reliable and simple to create.&lt;/p&gt;

&lt;p&gt;Companies like &lt;a href="https://temporal.io/resources/case-studies/anz-story" rel="noopener noreferrer"&gt;ANZ Bank&lt;/a&gt;, one of the largest banks in the Asia-Pacific region, rely on Temporal to strengthen the resilience and reliability of critical financial processes. With Temporal, ANZ orchestrates and manages complex operations across distributed systems, ensuring tasks are retried automatically, failures are handled, and long-running processes are tracked seamlessly. This has enabled ANZ to boost system reliability, reduce operational complexity, and uphold strict compliance standards in their high-stakes FinServ environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Management Is a Strategy, Not a Setback
&lt;/h2&gt;

&lt;p&gt;Any complex system will encounter failures. But how you handle those failures makes all the difference. For developers, focusing on failure management from the start distinguished exceptional teams from the average. Building resilience into your system sets your project up for long-term success.&lt;/p&gt;

</description>
      <category>reliability</category>
      <category>distributed</category>
      <category>failure</category>
    </item>
    <item>
      <title>Time-Travel Debugging Production Code</title>
      <dc:creator>Loren 🤓</dc:creator>
      <pubDate>Tue, 08 Aug 2023 07:03:31 +0000</pubDate>
      <link>https://dev.to/temporalio/time-travel-debugging-production-code-4m6o</link>
      <guid>https://dev.to/temporalio/time-travel-debugging-production-code-4m6o</guid>
      <description>&lt;p&gt;In this post, I’ll give an overview of time travel debugging (what it is, its history, how it’s implemented) and show how it relates to debugging your production code.&lt;/p&gt;

&lt;p&gt;Normally, when we use debuggers, we set a breakpoint on a line of code, we run our code, execution pauses on our breakpoint, we look at values of variables and maybe the call stack, and then we manually step forward through our code's execution. In &lt;em&gt;time-travel debugging&lt;/em&gt;, also known as &lt;em&gt;reverse debugging&lt;/em&gt;, we can step backward as well as forward. This is powerful because debugging is an exercise in figuring out what happened: traditional debuggers are good at telling you what your program is doing right now, whereas time-travel debuggers let you see what happened. You can wind back to any line of code that executed and see the full program state at any point in your program’s history.&lt;/p&gt;

&lt;h2&gt;
  
  
  History and current state
&lt;/h2&gt;

&lt;p&gt;It all started with Smalltalk-76, developed in 1976 at &lt;a href="https://en.wikipedia.org/wiki/PARC_(company)"&gt;Xerox PARC&lt;/a&gt;. (&lt;a href="https://en.wikipedia.org/wiki/Graphical_user_interface"&gt;Everything&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/Computer_mouse"&gt;started&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/Ethernet"&gt;at&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/WYSIWYG"&gt;PARC&lt;/a&gt; 😄.) It had the ability to retrospectively inspect checkpointed places in execution. Around 1980, MIT added a "retrograde motion" command to its &lt;a href="https://en.wikipedia.org/wiki/Dynamic_debugging_technique"&gt;DDT debugger&lt;/a&gt;, which gave a limited ability to move backward through execution. In a 1995 paper, MIT researchers released ZStep 95, the first true reverse debugger, which recorded all operations as they were performed and supported stepping backward, reverting the system to the previous state. However, it was a research tool and not widely adopted outside academia. &lt;/p&gt;

&lt;p&gt;ODB, the &lt;a href="https://omniscientdebugger.github.io/ODBUserManual.html"&gt;Omniscient Debugger&lt;/a&gt;, was a Java reverse debugger that was introduced in 2003, marking the first instance of time-travel debugging in a widely used programming language. &lt;a href="https://en.wikipedia.org/wiki/GNU_Debugger"&gt;GDB&lt;/a&gt; (perhaps the most well-known command-line debugger, used mostly with C/C++) added it in 2009.&lt;/p&gt;

&lt;p&gt;Now, time-travel debugging is available for &lt;a href="https://github.com/rr-debugger/rr/wiki/Related-work"&gt;many&lt;/a&gt; languages, platforms, and IDEs, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.replay.io/"&gt;Replay&lt;/a&gt; for JavaScript in Chrome, Firefox, and Node, and &lt;a href="https://wallabyjs.com/docs/intro/time-travel-debugger.html"&gt;Wallaby&lt;/a&gt; for tests in Node&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/time-travel-debugging-overview"&gt;WinDbg&lt;/a&gt; for Windows applications&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://rr-project.org/"&gt;rr&lt;/a&gt; for C, C++, Rust, Go, and others on Linux&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://undo.io/"&gt;Undo&lt;/a&gt; for C, C++, Java, Kotlin, Rust, and Go on Linux&lt;/li&gt;
&lt;li&gt;Various extensions (often rr- or Undo-based) for Visual Studio, VS Code, JetBrains IDEs, Emacs, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation techniques
&lt;/h2&gt;

&lt;p&gt;There are three main approaches to implementing time-travel debugging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Record &amp;amp; Replay&lt;/strong&gt;: Record all non-deterministic inputs to a program during its execution. Then, during the debug phase, the program can be deterministically replayed using the recorded inputs in order to reconstruct any prior state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshotting&lt;/strong&gt;: Periodically take snapshots of a program's entire state. During debugging, the program can be rolled back to these saved states. This method can be memory-intensive because it involves storing the entire state of the program at multiple points in time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instrumentation&lt;/strong&gt;: Add extra code to the program that logs changes in its state. This extra code allows the debugger to step the program backwards by reverting changes. However, this approach can significantly slow down the program's execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;rr uses the first (the rr name stands for Record and Replay), as does &lt;a href="https://docs.replay.io/learn-more/contribute/replay-for-new-contributors#5130506fb24843ab86fe79d11f02261b"&gt;Replay&lt;/a&gt;. WinDbg uses the first two, and Undo uses all three (see &lt;a href="https://undo.io/resources/liverecorder-vs-rr/"&gt;how it differs from rr&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Time-traveling in production
&lt;/h2&gt;

&lt;p&gt;Traditionally, running a debugger in prod doesn't make much sense. Sure, we could SSH into a prod machine and start the process handling requests with a debugger and a breakpoint, but once we hit the breakpoint, we're delaying responses to all current requests and unable to respond to new requests. Also, debugging non-trivial issues is an iterative process: we get a clue, we keep looking and find more clues; discovery of each clue is typically rerunning the program and reproducing the failure. So, instead of debugging in production, what we do is replicate on our dev machine whatever issue we're investigating and use a debugger locally (or, more often, add log statements 😄), and re-run as many times as required to figure it out. Replicating takes time (and in some cases a &lt;em&gt;lot&lt;/em&gt; of time, and in some cases infinite time), so it would be really useful if we didn't have to.&lt;/p&gt;

&lt;p&gt;While running traditional debuggers doesn't make sense, time-travel debuggers can record a process execution on one machine and replay it on another machine. So we can record (or snapshot or instrument) production and replay it on our dev machine for debugging (depending on the tool, our machine may need to have the same CPU instruction set as prod). However, the recording step generally doesn't make sense to use in prod given the high amount of overhead—if we set up recording and then have to use ten times as many servers to handle the same load, whoever &lt;a href="https://www.linkedin.com/in/kevin-laughlin-4133166/"&gt;pays our AWS bill&lt;/a&gt; will not be happy 😁.&lt;/p&gt;

&lt;p&gt;But there are a couple scenarios in which it does make sense:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Undo only slows down execution &lt;a href="https://undo.io/solutions/products/live-recorder/#section-133"&gt;2–5x&lt;/a&gt;, so while we don't want to leave it on just in case, we can &lt;a href="https://twitter.com/gregthelaw/status/1654558923242762243"&gt;turn it on temporarily&lt;/a&gt; on a subset of prod processes for hard-to-repro bugs until we have captured the bug happening, and then we turn it off.&lt;/li&gt;
&lt;li&gt;When we're already recording the execution of a program in the normal course of operation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The rest of this post is about #2, which is a way of running programs called &lt;em&gt;durable execution&lt;/em&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Durable execution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What's that?
&lt;/h3&gt;

&lt;p&gt;First, a brief backstory. After Amazon (one of the first large adopters of microservices) decided that using message queues to communicate between services was not the way to go (hear the story first-hand &lt;a href="https://www.youtube.com/watch?v=wIpz4ioK0gI"&gt;here&lt;/a&gt;), they started using orchestration. And once they realized defining orchestration logic in YAML/JSON wasn't a good developer experience, they created &lt;a href="https://docs.aws.amazon.com/amazonswf/latest/developerguide/swf-welcome.html"&gt;AWS Simple Workfow Service&lt;/a&gt; to define logic in code. This technique of backing code by an orchestration engine is called durable execution, and it spread to &lt;a href="https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp-inproc"&gt;Azure Durable Functions&lt;/a&gt;, &lt;a href="https://cadenceworkflow.io/"&gt;Cadence&lt;/a&gt; (used at Uber for &lt;a href="https://www.uber.com/blog/announcing-cadence/"&gt;&amp;gt; 1,000 services&lt;/a&gt;), and &lt;a href="https://temporal.io/"&gt;Temporal&lt;/a&gt; (used by Stripe, Netflix, Datadog, Snap, Coinbase, and many more).&lt;/p&gt;

&lt;p&gt;Durable execution runs code durably—recording each step in a database, so that when anything fails, it can be retried from the same step. The machine running the function can even lose power before it gets to line 10, and another process is guaranteed to pick up executing at line 10, with all variables and threads intact.&lt;sup id="fnref1"&gt;1&lt;/sup&gt; It does this with a form of record &amp;amp; replay: all input from the outside is recorded, so when the second process picks up the partially-executed function, it can replay the code (in a side-effect–free manner) with the recorded input in order to get the code into the right state by line 10.&lt;/p&gt;

&lt;p&gt;Durable execution's flavor of record &amp;amp; replay doesn't use high-overhead methods like &lt;a href="https://undo.io/resources/liverecorder-vs-rr/"&gt;software JIT binary translation&lt;/a&gt;, snapshotting, or instrumentation. It also doesn't require special hardware. It does require one constraint: durable code must be deterministic (i.e., given the same input, it must take the same code path). So it can't do things that might have different results at different times, like use the network or disk. However, it can call other functions that are run normally (&lt;a href="https://twitter.com/DominikTornow/status/1582370919258783744"&gt;"volatile functions"&lt;/a&gt;, as we like to call them 😄), and while each step of those functions isn't persisted, the functions are automatically retried on transient failures (like a service being down).&lt;/p&gt;

&lt;p&gt;Only the steps that require interacting with the outside world (like calling a volatile function, or calling &lt;code&gt;sleep('30 days')&lt;/code&gt;, which stores a timer in the database) are persisted. Their results are also persisted, so that when you replay the durable function that died on line 10, if it previously called the volatile function on line 5 that returned "foo", during replay, "foo" will immediately be returned (instead of the volatile function getting called again). While yes, it adds latency to be saving things to the database, Temporal supports extremely high throughput (tested up to a million recorded steps per second). And in addition to function recoverability and automatic retries, it comes with &lt;a href="https://temporal.io/blog/building-reliable-distributed-systems-in-node"&gt;many more benefits&lt;/a&gt;, including extraordinary visibility into and debuggability of production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debugging prod
&lt;/h3&gt;

&lt;p&gt;With durable execution, we can read through the steps that every single durable function took in production. We can also download the execution’s history, checkout the version of the code that's running in prod, and pass the file to a replayer (Temporal has runtimes for Go, Java, JavaScript, Python, .NET, and PHP) so we can see in a debugger exactly what the code did during that production function execution. Read &lt;a href="https://temporal.io/blog/temporal-for-vs-code"&gt;this post&lt;/a&gt; or watch &lt;a href="https://www.youtube.com/watch?v=3IjQde9HMNY"&gt;this video&lt;/a&gt; to see an example in VS Code.&lt;sup id="fnref2"&gt;2&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Being able to debug any past production code is a huge step up from the other option (finding a bug, trying to repro locally, failing, turning on Undo recording in prod until it happens again, turning it off, &lt;em&gt;then&lt;/em&gt; debugging locally). It's also a (sometimes necessary) step up distributed tracing.&lt;/p&gt;




&lt;p&gt;I hope you found this post interesting! If you'd like to learn more about durable execution, I recommend reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://temporal.io/blog/building-reliable-distributed-systems-in-node"&gt;Building reliable distributed systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://temporal.io/blog/building-reliable-distributed-systems-in-node-js-part-2"&gt;How durable execution works&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and watching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=wIpz4ioK0gI"&gt;Introduction to Temporal&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=6lSuDRRFgyY"&gt;Why durable execution changes everything&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Thanks to Greg Law, Jason Laster, Chad Retz, and Fitz for reviewing drafts of this post.&lt;/em&gt;&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;Technically, it doesn't have line-by-line granularity. It only records certain steps that the code takes—read on for more info ☺️. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;The astute reader may note that our extension uses the default VS Code debugger, which doesn’t have a back button 😄. I transitioned from talking about TTD to methods of debugging production code via recording, so while Temporal doesn’t have TTD yet, it does record all the non-deterministic inputs to the program and is able to replay execution, so it’s definitely possible to implement. Upvote &lt;a href="https://github.com/temporalio/vscode-debugger-extension/issues/51"&gt;this issue&lt;/a&gt; or comment if you have thoughts on implementation! ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>programming</category>
      <category>debugging</category>
      <category>temporal</category>
    </item>
    <item>
      <title>Actors and Workflows: Building a Customer Loyalty Program with Temporal</title>
      <dc:creator>Fitz</dc:creator>
      <pubDate>Thu, 03 Aug 2023 17:28:24 +0000</pubDate>
      <link>https://dev.to/temporalio/actors-and-workflows-building-a-customer-loyalty-program-with-temporal-9f3</link>
      <guid>https://dev.to/temporalio/actors-and-workflows-building-a-customer-loyalty-program-with-temporal-9f3</guid>
      <description>&lt;p&gt;This post is technically a followup of &lt;a href="https://temporal.io/blog/workflows-as-actors-is-it-really-possible"&gt;another post&lt;/a&gt;. You don't &lt;em&gt;need&lt;/em&gt; to read that one to make sense of this one, but it might give some useful background.&lt;/p&gt;

&lt;p&gt;That post talked through how &lt;a href="https://en.wikipedia.org/wiki/Actor_model"&gt;the Actor Model&lt;/a&gt; can be implemented using "Workflows" (on &lt;a href="https://dev.toTemporal"&gt;https://github.com/temporalio/temporal&lt;/a&gt;), even though these two concepts don't immediately appear compatible.&lt;/p&gt;

&lt;p&gt;Here, I dive into a concrete example: a Workflow representing a customer's loyalty status.&lt;/p&gt;

&lt;p&gt;If you want to skip the prose and just jump right into the code, you can find it all in &lt;a href="https://github.com/afitz0/customer-loyalty-workflow"&gt;this GitHub repository&lt;/a&gt;, with implementations in Go, Java, and Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actor Model Refresher
&lt;/h2&gt;

&lt;p&gt;As formally defined, Actors must be able to do three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Send and receive messages&lt;/li&gt;
&lt;li&gt;Create new Actors&lt;/li&gt;
&lt;li&gt;Maintain state&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Exact implementation details vary depending on what framework, library, or tools you're using, but the biggest challenge is having some kind of software artifact running &lt;em&gt;somewhere&lt;/em&gt; that can handle these things.&lt;/p&gt;

&lt;p&gt;That's where most Actor frameworks come in to help: providing both the programming model and the runtime environment for being able to build an Actor-based application in a highly distributed, concurrent, and scalable way.&lt;/p&gt;

&lt;p&gt;Temporal differs here in that it’s general-purpose, rather than specific to one model or system design pattern. With Workflows, you define a function that Temporal will ensure runs to completion (or reliably runs forever, if the function doesn’t return).&lt;/p&gt;

&lt;p&gt;I recognize that statement is both rather bold and also so generic as to be hard to disprove. So, let's look at a concrete example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loyal Customers
&lt;/h2&gt;

&lt;p&gt;Many consumer businesses have some kind of loyalty program. Buy 10 items, get the 11th free! Fly 10,000 miles, get free access to the airport lounge! Earn one million points over the lifetime of your account, earn a gold star!&lt;/p&gt;

&lt;p&gt;At the highest level, the application's logic isn't complex: Each customer has an integer counter that's incremented after the customer does certain things (e.g., buy something, or take a trip). When that counter crosses different thresholds, new rewards are unlocked. And, although we may not like it, customers can always close their accounts.&lt;/p&gt;

&lt;p&gt;When we create the diagram for the app, it might look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--41iv8ffH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3how2h0elz6ia2b8fn3x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--41iv8ffH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3how2h0elz6ia2b8fn3x.png" alt="customer loyalty diagram version 1" width="500" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In terms of the Actor Model, two of the three requirements are on display:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Send and receive messages&lt;/strong&gt;: A customer can send either an "earn points" message or a "try to use reward" message.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create new Actors&lt;/strong&gt;: ??? (This is the Actor requirement not apparent in this application, but we'll see later how it can be incorporated.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintain state&lt;/strong&gt;: A customer loyalty account needs to maintain the points counter and which rewards are unlocked (or be able to look up this information based on the points value).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Requirement #2, the ability to create other Actors, isn't immediately obvious here, but it isn't too far out of reach. We could define in this example application that one of the rewards for earning enough points is the ability to gift status to someone else, inviting them (i.e., creating their account) to the program if they aren't already a member.&lt;/p&gt;

&lt;p&gt;If our goal is to create a demo application for the Actor Model (as it is in this post), then there's actually one other thing missing: the ability for a customer (or rather, their loyalty account) to send messages. For that, we could also declare that customers with enough points can gift points or status levels (i.e., which rewards are unlocked) to their guests. Then they can send messages, too!&lt;/p&gt;

&lt;p&gt;Reworking the previous diagram to be more befitting of a full "Actor," we'd get the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mCfdb_dA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0bfuafimxwqxzn7rxxam.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mCfdb_dA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0bfuafimxwqxzn7rxxam.png" alt="customer loyalty diagram version 2" width="500" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And, as for the exact implementation details, read on!&lt;/p&gt;

&lt;h2&gt;
  
  
  Loyal (Temporal!) Customers
&lt;/h2&gt;

&lt;p&gt;Imagine being able to write the customer loyalty program above in just a function or two. Conceptually, that's not hard. In pseudocode, that might look like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INVITE_REWARD_MINIMUM_POINTS = 1000

function CustomerLoyaltyAccount:
    account_canceled = false
    points = 0

    while !account_canceled:
        message = receive_message()
        switch message.type:
            case 'cancel':
                account_canceled = true
            case 'add_points':
                fallthrough
            case 'gift_points':
                points += message.value
            case 'invite_guest'
                if points &amp;gt;= INVITE_REWARD_MINIMUM_POINTS:
                    spawn(new CustomerLoyaltyAccount())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But there are a few crucial details that are, well, rather undefined in this pseudo-function. Specifically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What's &lt;code&gt;receive_message()&lt;/code&gt; doing? How is it receiving messages?&lt;/li&gt;
&lt;li&gt;Similarly, what's &lt;code&gt;spawn(new CustomerLoyaltyAccount())&lt;/code&gt; doing? &lt;/li&gt;
&lt;li&gt;And most importantly, where is this function running? What happens if that runtime crashes or the function otherwise stops running?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these maps to core Temporal features that we can implement in an example Workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data can be sent to Workflows via Signals&lt;/li&gt;
&lt;li&gt;Workflows can create new Workflow instances&lt;/li&gt;
&lt;li&gt;As long as there are Workers running &lt;em&gt;somewhere&lt;/em&gt; that can communicate with the Temporal Server, then if the Worker running the function dies, the function will continue running on another (you know, kind of Temporal's main benefit)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Customers Go Loyal
&lt;/h3&gt;

&lt;p&gt;Let's build this up in Go. If you are more comfortable with other languages, I've also written the same Workflow in &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/tree/main/python"&gt;Python&lt;/a&gt; and &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/tree/main/java"&gt;Java&lt;/a&gt;. While the languages are different, most of the same concepts and patterns should carry over.&lt;/p&gt;

&lt;p&gt;(For brevity in the body of this blog post, I'll in most cases omit error handling but include it when non-trivial and relevant.)&lt;/p&gt;

&lt;p&gt;First, we write the skeleton of a Workflow and an Activity. For some of the milestones in a customer's lifecycle, it'd be nice to send them some kind of notification. In a real application, you'd call out to SendGrid, Mailchimp, Constant Contact, or some other email provider, but for simplicity's sake, I'm just logging out the details. This initial Workflow does just that: if it's a new customer, send a welcome email, but otherwise move on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;CustomerLoyaltyWorkflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt; &lt;span class="n"&gt;CustomerInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newCustomer&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Loyalty workflow started."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"CustomerInfo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt; &lt;span class="n"&gt;Activities&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;newCustomer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"New customer workflow; sending welcome email."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecuteActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SendEmail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Welcome, %v, to our loyalty program!"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
            &lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error running SendEmail activity for welcome email."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Skipping welcome email for non-new customer."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// ... [to be added later] ... //&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Activities&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Client&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Activities&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;SendEmail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;activity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sending email."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Contents"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next up, we need to be able to handle messages. This is the primary thing the Workflow (i.e., customer loyalty Actor) does: sit around waiting for new messages to come in.&lt;/p&gt;

&lt;p&gt;The following code replaces the &lt;code&gt;// ... [to be added below] ... //&lt;/code&gt; line from the previous snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;  &lt;span class="n"&gt;selector&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Signal handler for adding points&lt;/span&gt;
    &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddReceive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetSignalChannel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"addPoints"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReceiveChannel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;signalAddPoints&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c"&gt;// Signal handler for canceling account&lt;/span&gt;
    &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddReceive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetSignalChannel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"cancelAccount"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReceiveChannel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;signalCancelAccount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c"&gt;// ... [register other Signal handlers here] ... //&lt;/span&gt;

  &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Waiting for new messages"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AccountActive&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Signal handler function for adding points does very little, adding in the given points to the customer's state and then sending an email to the customer with the new value.&lt;/p&gt;

&lt;p&gt;As you might imagine, &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow.go#L250-L261"&gt;the cancel account handler&lt;/a&gt; is very similar, setting the &lt;code&gt;customer.AccountActive&lt;/code&gt; flag used above to false and then notifying the customer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;signalAddPoints&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReceiveChannel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;CustomerInfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt; &lt;span class="n"&gt;Activities&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;pointsToAdd&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Receive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;pointsToAdd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Adding points to customer account."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"PointsAdded"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pointsToAdd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LoyaltyPoints&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;pointsToAdd&lt;/span&gt;

    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecuteActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SendEmail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"You've earned more points! You now have %v."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LoyaltyPoints&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error running SendEmail activity for added points."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// ... [insert logic for unlocking status levels or rewards] ... //&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All combined, the code so far does three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;First, it registers the &lt;code&gt;signalAddPoints&lt;/code&gt; and &lt;code&gt;signalCancelAccount&lt;/code&gt; functions as the handlers for the "addPoints" and "cancelAccount" Signals, respectively.&lt;/li&gt;
&lt;li&gt;Then, it blocks forward progress on the Workflow, via &lt;code&gt;selector.Select(ctx)&lt;/code&gt;, until a registered Signal comes in. Unless that Signal is "cancelAccount," the Workflow will keep looping on this select.&lt;/li&gt;
&lt;li&gt;I've chosen for this application to not fail the Workflow when an email fails to send. This keeps the Workflow representing the customer's loyalty account active and running even in spite of external system failure.

&lt;ul&gt;
&lt;li&gt;For that, you'll want to set an appropriate retry policy to ensure that the Workflow doesn't completely block on email failures, for example by setting the &lt;code&gt;MaximumAttempts&lt;/code&gt; to a reasonably low number like 10.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Already this gives us most of the application. We have a function that runs perpetually, thanks to Temporal, and can receive two different kinds of messages, both of which modify the state of the Workflow with one that also results in the Workflow finishing.&lt;/p&gt;

&lt;p&gt;What remains is a couple of more Temporal-specific considerations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Long-Lived Customers
&lt;/h3&gt;

&lt;p&gt;In &lt;a href="https://temporal.io/blog/workflows-as-actors-is-it-really-possible#building-a-workflow-that-can-practically-run-forever"&gt;my last post&lt;/a&gt;, I spilled many words on the topic of "Continue-As-New." If you didn't—or don't want to!—read those words, the gist is this: at some point, a Workflow's history may get unwieldily big; Continue-As-New resets it.&lt;/p&gt;

&lt;p&gt;For this customer loyalty example Workflow, the far-and-away biggest contributor to the Event History is the &lt;em&gt;number&lt;/em&gt; of events, not the size. With the &lt;code&gt;addPoints&lt;/code&gt; Signal only taking a single integer argument and the &lt;code&gt;cancelAccount&lt;/code&gt; Signal taking none, the combined contribution to the &lt;em&gt;size&lt;/em&gt; of the history is minimal.&lt;/p&gt;

&lt;p&gt;A Signal with only a single integer parameter will, by itself, contribute one Event and about 500 bytes to the History, even with very large values. And so, how many of these Signals would be required to hit either the size or length limits?&lt;/p&gt;

&lt;p&gt;If &lt;em&gt;nothing&lt;/em&gt; else happened but &lt;code&gt;addPoints&lt;/code&gt; Signals, it'd take 51,200 of them to reach the length limit, but &lt;code&gt;50 * 1024 * 1024 / 500&lt;/code&gt; or 104,857.6 to reach the size limit. Knowing that many of these Signals will result in the &lt;code&gt;SendEmail&lt;/code&gt; Activity running, and each Activity contributes a handful of (small) events to the history, this Workflow will hit the History &lt;em&gt;length&lt;/em&gt; limit well before the size limit.&lt;/p&gt;

&lt;p&gt;So, let's add a check for that into our Workflow loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;eventsThreshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="n"&gt;_000&lt;/span&gt;
    &lt;span class="c"&gt;// ... snip ...&lt;/span&gt;

    &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Waiting for new messages"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AccountActive&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetCurrentHistoryLength&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;eventsThreshold&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, trigger Continue-As-New as needed, draining any pending signals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AccountActive&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Account still active, but hit continue-as-new threshold."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c"&gt;// Drain signals before continuing-as-new&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasPending&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewContinueAsNewError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CustomerLoyaltyWorkflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My previous post on this topic &lt;a href="https://temporal.io/blog/workflows-as-actors-is-it-really-possible#avoiding-signal-and-update-loss"&gt;explained in a little more detail&lt;/a&gt; about why it's necessary to drain signals before continuing-as-new. To briefly recap, &lt;a href="https://docs.temporal.io/glossary#continue-as-new"&gt;Continue-As-New&lt;/a&gt; finishes the current Workflow run and starts a new instance of the Workflow &lt;em&gt;regardless of any pending Signals&lt;/em&gt;. If we don't drain (and handle!) Signals before calling &lt;code&gt;workflow.NewContinueAsNewError&lt;/code&gt; (or &lt;a href="https://python.temporal.io/temporalio.workflow.html#continue_as_new"&gt;&lt;code&gt;workflow.continue_as_new&lt;/code&gt;&lt;/a&gt; in Python, or &lt;a href="https://www.javadoc.io/doc/io.temporal/temporal-sdk/latest/io/temporal/workflow/Workflow.html#continueAsNew(io.temporal.workflow.ContinueAsNewOptions,java.lang.Object...)"&gt;&lt;code&gt;Workflow.continueAsNew&lt;/code&gt;&lt;/a&gt; in Java), those pending Signals will be forever lost.&lt;/p&gt;

&lt;p&gt;The last major thing this Workflow needs to make it a true, stage-worthy Actor is the ability to create others.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spawning New Customers
&lt;/h3&gt;

&lt;p&gt;While Temporal has support for Parent/Child relationships between Workflows, in this customer loyalty application, the only thing we need is the ability to send a message from one to the other in the case of gifting status or points.&lt;/p&gt;

&lt;p&gt;Temporal provides an API in the Client that can do this and create other Workflows all in one call, called &lt;a href="https://docs.temporal.io/dev-guide/go/features#signal-with-start"&gt;Signal-with-Start&lt;/a&gt;. Since this is only available in the Client, not from a Workflow, we'll need to do this &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/activities.go#L29-L50"&gt;in an Activity&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;First, I'm setting the ID Reuse Policy to &lt;code&gt;REJECT&lt;/code&gt;. This is in some ways a "business logic" kind of decision, where I'm declaring that once a customer's account is closed, it can't be re-invited. (Note that after a &lt;a href="https://docs.temporal.io/clusters#retention-period"&gt;namespace's retention period&lt;/a&gt; has passed, IDs from closed Workflows can be reused regardless of this policy, and so in a real-life production version of this app, you'd want to have this check an external source for customer account statuses.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Activities&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;StartGuestWorkflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt; &lt;span class="n"&gt;CustomerInfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
    &lt;span class="n"&gt;workflowOptions&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StartWorkflowOptions&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;TaskQueue&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;             &lt;span class="n"&gt;TaskQueue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;WorkflowIDReusePolicy&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WORKFLOW_ID_REUSE_POLICY_REJECT_DUPLICATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we can call &lt;code&gt;Client.SignalWithStartWorkflow&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Starting and signaling guest workflow."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"GuestID"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CustomerID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SignalWithStartWorkflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CustomerWorkflowID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CustomerID&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;SignalEnsureMinimumStatus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusLevel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Ordinal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;workflowOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CustomerLoyaltyWorkflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the use of the Client from the Activities receiver struct! I'm making use of something in the way Temporal works in Go: if, when we instantiate and register the Activities in the Worker, &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/worker/main.go#L26-L28"&gt;we also set this Client,&lt;/a&gt; then the same connection will be available within the Activities. This way, we don't have to worry about re-creating the Client.&lt;/p&gt;

&lt;p&gt;I'm also ignoring the returned future from &lt;code&gt;SignalWithStartWorkflow&lt;/code&gt; via a Go convention of assigning to &lt;code&gt;_&lt;/code&gt;; because this "guest" Workflow is expected to run indefinitely long, blocking on its results would prevent the original Workflow from doing anything else. Since the future returned from starting a Workflow is either used for waiting for the Workflow to finish, or getting its IDs (which we already know from the &lt;code&gt;CustomerWorkflowID(guest.CustomerID)&lt;/code&gt; call), we can safely ignore it.&lt;/p&gt;

&lt;p&gt;But, it's still necessary to handle the error. With the ID Reuse Policy set to &lt;code&gt;REJECT&lt;/code&gt;, retrying the resulting error from trying to start a an already-closed Workflow will get us nowhere, and so we should instead send some useful information back to the Workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;    &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;serviceerror&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WorkflowExecutionAlreadyStarted&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;As&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GuestAlreadyCanceled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GuestInvited&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// ... [Defined at top] ...&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;GuestInviteResult&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;

&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;GuestInvited&lt;/span&gt; &lt;span class="n"&gt;GuestInviteResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;iota&lt;/span&gt;
    &lt;span class="n"&gt;GuestAlreadyCanceled&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Back in the Workflow, after running this Activity I can then check for that error and notify the customer as appropriate. As before, I'm allowing the Workflow to continue if sending the email failed. But if that &lt;code&gt;SignalWithStartWorkflow&lt;/code&gt; call failed for any reason other than the guest's account already existing, I want to make some noise and fail the Workflow—something unusual is likely happening.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;inviteResult&lt;/span&gt; &lt;span class="n"&gt;GuestInviteResult&lt;/span&gt;
&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecuteActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StartGuestWorkflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;inviteResult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"could not signal-with-start guest/child workflow for guest ID '%v': %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guestID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;inviteResult&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;GuestAlreadyCanceled&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;emailToSend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Your guest has canceled!"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;emailToSend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Your guest has been invited!"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecuteActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SendEmail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emailToSend&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This snippet of code would end up being in a Signal handler for something like &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow.go#L191-L232"&gt;an "invite guest" Signal&lt;/a&gt;. The handler would also include, as discussed at the top of this post, &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow.go#L198-L199"&gt;a check&lt;/a&gt; for if the current customer is even allowed to do this action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summing it all up
&lt;/h2&gt;

&lt;p&gt;There are a few other things to explore in this app, like &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow.go#L102-L108"&gt;catching a cancellation request&lt;/a&gt; or &lt;a href="https://github.com/afitz0/customer-loyalty-workflow/blob/main/go/loyalty/workflow_test.go"&gt;looking through the tests&lt;/a&gt;, but this post has gotten long enough as it is. 🙂&lt;/p&gt;

&lt;p&gt;Hopefully this post serves as a nice "close-to-real-world" example for you of how to build something that looks like an "Actor"—aka, a really, really long running Workflow that can send and receive messages and maintain state without a database—using Temporal.&lt;/p&gt;

&lt;p&gt;For more information related to this post and about Temporal, check out the following links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/afitz0/customer-loyalty-workflow/"&gt;This post's source code&lt;/a&gt; (As of publishing, available in Java, Go, and Python)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://temporal.io/blog/workflows-as-actors-is-it-really-possible"&gt;Actors &amp;amp; Workflows, Part 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.temporal.io/dev-guide/go/features#signal-with-start"&gt;SignaIWithStart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//docs.temporal.io"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//docs.temporal.io/dev-guide"&gt;Developer Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the best way to learn Temporal is with &lt;a href="https://learn.temporal.io/courses"&gt;our free courses&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover image &lt;a href="https://unsplash.com/photos/fg7J6NnebBc?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditShareLink"&gt;from John Jennings on Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>backend</category>
      <category>distributedsystems</category>
      <category>go</category>
      <category>microservices</category>
    </item>
    <item>
      <title>To Choreograph or Orchestrate your Saga, that is the question.</title>
      <dc:creator>Emily Fortuna</dc:creator>
      <pubDate>Wed, 12 Jul 2023 17:46:50 +0000</pubDate>
      <link>https://dev.to/temporalio/to-choreograph-or-orchestrate-your-saga-that-is-the-question-4kna</link>
      <guid>https://dev.to/temporalio/to-choreograph-or-orchestrate-your-saga-that-is-the-question-4kna</guid>
      <description>&lt;p&gt;The saga pattern is a distributed systems design pattern for a task that spans machine or microservice boundaries in which full execution of all steps is necessary. Partial execution is not desirable. A common life example used to explain when the saga pattern is useful is trip planning. If you’re planning on attending &lt;a href="https://temporal.io/replay"&gt;Replay&lt;/a&gt;, for example, you’d need to book a conference ticket, an airplane ticket, and a hotel. If you fail to acquire one of these things, you’ll miss out on meeting fun people in backend engineering face-to-face.&lt;/p&gt;

&lt;p&gt;Below the surface, there are two main ways microservices can talk to one another that make your saga possible: choreography and orchestration. &lt;/p&gt;

&lt;h3&gt;
  
  
  Choreography
&lt;/h3&gt;

&lt;p&gt;Choreography is analogous to ants in an ant colony. Like ants, each microservice has &lt;em&gt;local&lt;/em&gt; knowledge, and shares information about state changes with other services via chemical signals called pheromones–I mean via message passing. Just as an ant trail to food emerges organically from pheromones, the overall behavior of a system as a whole that contains choreographed microservices emerges organically from each microservice’s instructions.   &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XAPEtOwx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0of9n89mjwqrbffissp1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XAPEtOwx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0of9n89mjwqrbffissp1.jpg" alt="trail of ants on a log against a green background" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One tenant drilled into every software engineer’s head is the value of decoupling. Choreography embodies this idea and is straightforward to implement as a whole. Choreography can be a popular, easy choice for systems that are incrementally moving from a monolith to a microservices architecture. However, if you have any sort of ordering requirement of tasks, such as ordered steps in your saga, choreography can get unwieldy fairly quickly. Suppose we want to book the plane first so that the hotel can know your flight number and pick you up from the airport. Then we book our conference ticket (maybe there’s a discount with certain hotels). The sequence of messages that each service responded to would look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cvHbRlhr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y9xojsfi7vkxl4j16llc.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cvHbRlhr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y9xojsfi7vkxl4j16llc.gif" alt="three services: plane, hotel, and conference ticket sending messages about when their state has changed so that the other services can act on them." width="600" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, just from looking at each microservice’s individual codebase, it’s difficult to understand the order that the system &lt;em&gt;should&lt;/em&gt; have since that ordering is distributed throughout the code. This leads to all sorts of higher level business logic diagrams that need to be kept in sync with the code…but wouldn’t it be better if the code were just easier to read in the first place? It also can be difficult to debug the exact sequence of events that lead to a bug since control flow is not immediately clear. So, unless all of your microservices are truly independent of one another and don’t have any sort of “happens before” logic, consider using orchestration instead. &lt;/p&gt;

&lt;h3&gt;
  
  
  Orchestration
&lt;/h3&gt;

&lt;p&gt;Orchestration, on the other hand, is like an air traffic control tower directing planes, or microservices. One service, a “super microservice” if you will, functions as the message broker sending messages directly to individual microservices telling them what to do, just like planes wait for permission to take off. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pW_IGjsC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hrskypbc97yzls0c25mo.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pW_IGjsC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hrskypbc97yzls0c25mo.gif" alt="three boxes, each representing plane, hotel, and conference booking microservices, and a message broker sending book and cancel commands to each." width="600" height="266"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because orchestration centralizes control flow, debugging and understanding control flow is much simpler. Additionally, since each step doesn’t need to keep track of what “happens before” messages it needs to listen to, the code for individual microservices is much simpler. Orchestration also shines in situations where many services need to interact in a single &lt;a href="https://temporal.io/blog/saga-pattern-made-easy"&gt;saga&lt;/a&gt; step. The glaring Achilles’ heel of this method is that bane of all distributed systems: the message broker is a single point of failure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---jR8mRfu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ne7nfbuk4lujoye8ug7i.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---jR8mRfu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ne7nfbuk4lujoye8ug7i.jpg" alt="Still from the movie airplane with an inflatable co pilot and a disheveled pilot sitting in the cockpit, with a flight attendant standing between them looking slightly concerned" width="800" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting it all together
&lt;/h3&gt;

&lt;p&gt;So to summarize, choreography:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is decentralized and decoupled&lt;/li&gt;
&lt;li&gt;Is good for highly independent microservices&lt;/li&gt;
&lt;li&gt;Is “easier” to implement, at least initially &lt;/li&gt;
&lt;li&gt;Is an easy choice for converting established monoliths to microservices&lt;/li&gt;
&lt;li&gt;Can make control flow unclear&lt;/li&gt;
&lt;li&gt;Can be challenging to debug&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and orchestration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has one service issuing “commands” to execute microservices&lt;/li&gt;
&lt;li&gt;Makes control flow easier to understand&lt;/li&gt;
&lt;li&gt;Easier to build with greenfield applications&lt;/li&gt;
&lt;li&gt;Makes debugging and failure handling clearer&lt;/li&gt;
&lt;li&gt;Is “harder” to implement initially, but pays dividends later&lt;/li&gt;
&lt;li&gt;Has a single point of failure (the message broker)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting tradeoff between these two approaches is one wants to reach for the light, agile option (choreography) in the early days and avoid over-architecting your project, but counterintuitively, orchestration is often easier to build when one uses it from the start. &lt;/p&gt;

&lt;h3&gt;
  
  
  So, what does Temporal do?
&lt;/h3&gt;

&lt;p&gt;Temporal uses orchestration under the hood (you won’t have to implement it yourself), &lt;strong&gt;&lt;em&gt;but&lt;/em&gt; &lt;em&gt;also&lt;/em&gt;&lt;/strong&gt; avoids that crucial drawback of a single point of failure. How is such a thing possible? Internally, Temporal records your program’s progress in a log. If that message broker were to go offline, your entire program’s history will have been saved and so that another machine can start up exactly where your program left off, as if nothing happened. This makes Temporal completely horizontally scalable. &lt;/p&gt;

&lt;p&gt;To bring this idea back to the saga pattern, an important component of the saga pattern is driving towards completion of all the steps of the saga. The fact that Temporal ensures no progress will ever be lost means it will pick up exactly where it left off “no matter what”, including failures for an unknown length of time, completing the saga with no extra code or heavy lifting on your part.&lt;/p&gt;

&lt;p&gt;Additionally, unlike some orchestration engines, in Temporal, the logic of your workflow is expressed entirely in code, so you don’t have to deal with json. In essence, nothing additional is needed to make a robust, failure resilient application other than the business logic of your application itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Choreography and orchestration provide different approaches to coordinating communication between microservices. Choreography is decoupled but can make debugging and control flow difficult to follow. Orchestration is centralized, but results in a single point of failure. Temporal uses orchestration under the covers, &lt;em&gt;but&lt;/em&gt; by design safeguards against a single point of failure, allowing you to focus on writing your code with the confidence that it is failure resilient.&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>microservices</category>
      <category>designpatterns</category>
      <category>sagas</category>
    </item>
    <item>
      <title>25 Key Terms for Speaking Distributed Systems and Temporal (an emoji-based guide)</title>
      <dc:creator>Emily Fortuna</dc:creator>
      <pubDate>Thu, 29 Jun 2023 18:55:20 +0000</pubDate>
      <link>https://dev.to/temporalio/25-key-terms-for-speaking-distributed-systems-and-temporal-an-emoji-based-guide-1cp9</link>
      <guid>https://dev.to/temporalio/25-key-terms-for-speaking-distributed-systems-and-temporal-an-emoji-based-guide-1cp9</guid>
      <description>&lt;p&gt;So you want to keep up with all the cool kids throwing around terms like “&lt;a href="https://docs.temporal.io/clusters#multi-cluster-replication"&gt;multi-cluster replication&lt;/a&gt;” but you don’t have time to read several textbooks. This handy quick-reference will give you the framework for following (and participating in!) conversations involving distributed systems or Temporal with ease. At the next dinner party you’ll be able to win friends and influence people with your ability to explain distributed systems succinctly in plain English… because we all know you’re&lt;a href="https://simpsons.fandom.com/wiki/You_Don%27t_Win_Friends_with_Salad"&gt; not gonna do it with salad&lt;/a&gt;. This guide builds upon itself, with terms requiring no additional context first.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eRpNTDEK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gr7en1ario6hn05xe87r.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eRpNTDEK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gr7en1ario6hn05xe87r.jpg" alt="An astronaut floating in space, attached to a book as if it is the oxygen supply" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Distributed Systems Terms To Know
&lt;/h2&gt;

&lt;h4&gt;
  
  
  concurrency   ↠
&lt;/h4&gt;

&lt;p&gt; Roughly, the idea of running multiple things at once. Two people eating dinner at the same time (“in parallel”) are eating concurrently. Your operating system context switching between a web browser and IDE is also a form of concurrency. &lt;/p&gt;

&lt;h4&gt;
  
  
  scalability   📈
&lt;/h4&gt;

&lt;p&gt; The ability for a system, such as a website, to accommodate a growing number of requests or work. You can improve scalability by finding places where work can be executed simultaneously or removing performance bottlenecks.&lt;/p&gt;

&lt;h4&gt;
  
  
  reliability   ✅
&lt;/h4&gt;

&lt;p&gt; The likelihood of a system to run without failure for a period of time. Systems can be made more reliable by reducing single points of failure, and detecting failures quickly.&lt;/p&gt;

&lt;h4&gt;
  
  
  eventual consistency   🐌
&lt;/h4&gt;

&lt;p&gt; Let’s say you’ve replicated a database to improve reliability (and possibly scalability). Great! Eventual consistency says a change in the data in one location will &lt;em&gt;eventually&lt;/em&gt; be updated in every location that the database lives; however, until every location is updated, a read from one of the locations may not &lt;em&gt;yet&lt;/em&gt; have the updated value. You, the programmer, need to bear in mind you may not always have the most up-to-date data when working under this model.&lt;/p&gt;

&lt;h4&gt;
  
  
  strong consistency   💪
&lt;/h4&gt;

&lt;p&gt; The guarantee that a data store will always provide the most up-to-date value.&lt;/p&gt;

&lt;h4&gt;
  
  
  CAP Theorem   🧢
&lt;/h4&gt;

&lt;p&gt; The rule that you gotta pick two of the three: &lt;em&gt;(strong)&lt;/em&gt; &lt;em&gt;consistency&lt;/em&gt;, &lt;em&gt;availability&lt;/em&gt;, &lt;em&gt;partition tolerance&lt;/em&gt;. Any distributed data store can only provide at most any two of these qualities, alas. &lt;em&gt;Availability&lt;/em&gt; is defined as every request returns a non-error response. &lt;em&gt;Partition tolerance&lt;/em&gt; is the ability for a system to continue to operate despite requests between data store nodes being delayed or dropped. See also strong consistency and eventual consistency.&lt;/p&gt;

&lt;h4&gt;
  
  
  ACID   🧪
&lt;/h4&gt;

&lt;p&gt; Hardcore-sounding acronym borrowed from databases that stands for &lt;em&gt;atomicity&lt;/em&gt;, &lt;em&gt;consistency&lt;/em&gt;, &lt;em&gt;isolation&lt;/em&gt;, &lt;em&gt;durability&lt;/em&gt;. See strong consistency,  eventual consistency, and other deets below.&lt;/p&gt;

&lt;h4&gt;
  
  
  atomicity   ⚛️
&lt;/h4&gt;

&lt;p&gt; Executing a sequence of operations all together as if they were a single unit, or not at all. &lt;/p&gt;

&lt;h4&gt;
  
  
  isolation   📦
&lt;/h4&gt;

&lt;p&gt; Executing a sequence of operations concurrently with another sequence has the same effect as executing each operation sequentially.&lt;/p&gt;

&lt;h4&gt;
  
  
  durability   🗿
&lt;/h4&gt;

&lt;p&gt; Think long-lasting. Standing the test of time. Persisting—i.e. written to disk, or if you were really hardcore, etched on a stone tablet—at which point it can be looked up even in the face of system failure such as a power outage or crash.&lt;/p&gt;

&lt;h4&gt;
  
  
  durable execution   🔜
&lt;/h4&gt;

&lt;p&gt; Similar to &lt;em&gt;durability&lt;/em&gt;, once a program has started executing, it will &lt;em&gt;continue&lt;/em&gt; executing to completion. Persisting every step the program takes so that execution can be continued by another process if the current process dies. &lt;/p&gt;

&lt;h4&gt;
  
  
  idempotent function   🥪
&lt;/h4&gt;

&lt;p&gt; Scary-sounding word, less scary meaning: a function that has the same observed result when called with the same inputs, whether it is called one time or many times.&lt;/p&gt;

&lt;p&gt; A function setting some field &lt;code&gt;foo=3&lt;/code&gt;? Idempotent. The function &lt;code&gt;foo += 3&lt;/code&gt;? Not idempotent, because the value of &lt;code&gt;foo&lt;/code&gt; is dependent on the number of times your function is called. Naive implementations of functions that transfer money or send emails are also not idempotent by &lt;em&gt;default&lt;/em&gt;. &lt;/p&gt;

&lt;h4&gt;
  
  
  deterministic function   🧮
&lt;/h4&gt;

&lt;p&gt; Code that always has the same effect/output when given a particular input. Things that are &lt;em&gt;not&lt;/em&gt; deterministic use some external state such as user input, a random number, or stored data. Code that reads or writes to a variable that other code can also modify simultaneously is also &lt;em&gt;not&lt;/em&gt; deterministic.&lt;/p&gt;

&lt;h4&gt;
  
  
  platform   💻
&lt;/h4&gt;

&lt;p&gt; Windows, iOS, Docker, and VMware are all platforms. They’re execution environments that define how programs behave inside them. Temporal is also a platform, which defines that code run with Temporal is failure and timeout resilient. You may see the term &lt;em&gt;platform-level&lt;/em&gt; used in relation to &lt;a href="https://temporal.io/blog/failure-handling-in-practice"&gt;failures&lt;/a&gt;. Platform-level failures are caused by low-level issues such as network errors or process crashes.&lt;/p&gt;

&lt;h4&gt;
  
  
  application   〉
&lt;/h4&gt;

&lt;p&gt; The code you write. You may see the term &lt;em&gt;application-level&lt;/em&gt; used in relation to &lt;a href="https://temporal.io/blog/failure-handling-in-practice"&gt;failures&lt;/a&gt;. Application-level failures are domain-specific failures like “insufficient inventory”, or “user canceled ride request.”&lt;/p&gt;

&lt;h4&gt;
  
  
  event sourcing   🎤
&lt;/h4&gt;

&lt;p&gt; A design pattern that creates event objects for every state change in a system, and records this sequence of events in a log (or event history). Temporal uses event sourcing “under the hood” to ensure failure resilience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Temporal-Specific Terms
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Temporal   ✧
&lt;/h4&gt;

&lt;p&gt; A way to run your code, a service and library that work together, that ensures your code never gets stuck in failure at the application-level. or the platform-level. While libraries like &lt;a href="https://www.npmjs.com/package/async-retry"&gt;async-retry&lt;/a&gt; take care of retry logic for functions that fail, what happens if your code making that library call crashes? Temporal says “we gotchu.” It abstracts away complex concepts around retries, rollbacks, queues, state machines, and timers, so that no matter where the failure happens, we’ll ensure your code keeps running the way you want. &lt;/p&gt;

&lt;h4&gt;
  
  
  Worker   👷
&lt;/h4&gt;

&lt;p&gt; The process that’s actually &lt;em&gt;doing the work&lt;/em&gt; executing all of your Temporal code (the Workflow and Activities). Capitalized here to denote the Temporal-specific concept of a Worker, to differentiate from the generic idea of a worker process.&lt;/p&gt;

&lt;h4&gt;
  
  
  Workflow   📖
&lt;/h4&gt;

&lt;p&gt; The high-level business logic of your program. Essentially, this is where the logic of your application begins. (&lt;em&gt;Technically&lt;/em&gt; execution starts with the Worker, and the Worker runs the Workflow code.) All Workflow logic must be deterministic.&lt;/p&gt;

&lt;h4&gt;
  
  
  Activity   💾
&lt;/h4&gt;

&lt;p&gt; Components of your Workflow that might fail, like network or file system calls, inventory holds, or credit card charges. The decision around how &lt;em&gt;many&lt;/em&gt; Activities your program should have–whether you make a separate Activity for every non-deterministic call or put the entire rest of your program in an Activity (don’t do that)–is generally a function of how you’d like your program to behave when retrying a failure. For example, if a downstream instruction should always grab the very freshest data when retrying, those instructions should be grouped together in a single Activity. If you can retry with the old data, they can be in separate Activities. Since Activities can be retried, they should be idempotent.&lt;/p&gt;

&lt;h4&gt;
  
  
  Query   🙋
&lt;/h4&gt;

&lt;p&gt; A way to inspect the state of a Workflow. The results are guaranteed to show the most recent state.&lt;/p&gt;

&lt;h4&gt;
  
  
  Signal   🧑‍🏫
&lt;/h4&gt;

&lt;p&gt; A way to notify or send information to a Workflow. A common use case is notifying a Workflow that the user added items to their shopping cart.&lt;/p&gt;

&lt;h4&gt;
  
  
  retry   🔄
&lt;/h4&gt;

&lt;p&gt; Generally, re-executing an Activity that has failed. Technically, Workflows can also be retried, but they are &lt;em&gt;far&lt;/em&gt; less common, such as a developer attempting to update Workflow code running in production.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cluster   🏘️
&lt;/h4&gt;

&lt;p&gt; The collection of services and databases that make failure and timeout resilience possible. You might sometimes see this colloquially called the Temporal Server.&lt;/p&gt;

&lt;h4&gt;
  
  
  History   🗃️
&lt;/h4&gt;

&lt;p&gt; A log of events that happened over the course of execution. This log contains attempts to run Activities, Workflow status changes (started, failed, scheduled, etc), timer events, and external information signaled to the system during the run.&lt;/p&gt;

&lt;h2&gt;
  
  
  In Closing
&lt;/h2&gt;

&lt;p&gt;Knowing this core set of 25 terms should give you sufficient lay-of-the-land to sling references like &lt;em&gt;ACID&lt;/em&gt; and &lt;em&gt;Workflow&lt;/em&gt; in conversations with coworkers, friends, and family with ease! Better yet, you now know enough to dive deeper into subdomains of interest. If you’d like to try out these terms in practice, check out our &lt;a href="https://learn.temporal.io/"&gt;getting started guides, courses, and examples in Go, Java, Python, PHP, and TypeScript&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>microservices</category>
      <category>learning</category>
      <category>backend</category>
    </item>
    <item>
      <title>Tuning Temporal Server request latency on Kubernetes</title>
      <dc:creator>Rob Holland</dc:creator>
      <pubDate>Thu, 15 Jun 2023 16:44:17 +0000</pubDate>
      <link>https://dev.to/temporalio/tuning-temporal-server-request-latency-on-kubernetes-20np</link>
      <guid>https://dev.to/temporalio/tuning-temporal-server-request-latency-on-kubernetes-20np</guid>
      <description>&lt;p&gt;Request latency is an important indicator for the performance of Temporal Server. Temporal Cloud can offer reliably low request latencies, thanks to its custom persistence backend and expertly managed Temporal Server infrastructure. In this post, we’ll give you some tips for getting lower and more predictable request latencies, and making more efficient use of your nodes, when deploying a self-hosted Temporal Server on Kubernetes.&lt;/p&gt;

&lt;p&gt;When evaluating the performance of a Temporal Server deployment, we begin by looking at metrics for the request latencies your application, or workers, observe when communicating with Temporal Server. In order for the system as a whole to run efficiently and reliably, requests must be handled with consistent, low latencies. Low latencies allow us to get high throughput, and stable latencies avoid unexpected slowdowns in our application and allow us to monitor for performance degradation without triggering false alerts.&lt;/p&gt;

&lt;p&gt;For this post, we’ll use the &lt;a href="https://docs.temporal.io/clusters#history-service"&gt;History&lt;/a&gt; service as our example, which is the service responsible for handling calls to start a new workflow execution, or to update a workflow’s state (history) as it makes progress. None of these tips are specific to the History service—most of them can be applied to all the &lt;a href="https://docs.temporal.io/clusters#temporal-server"&gt;Temporal Server services&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The curious case of the unexpected throttling
&lt;/h2&gt;

&lt;p&gt;Generally, Kubernetes deployments will set CPU limits on containers to stop them from being able to consume too much CPU, starving other containers running on the same node. The way this is enforced is using something called &lt;a href="https://medium.com/@ramandumcs/cpu-throttling-unbundled-eae883e7e494"&gt;CPU throttling&lt;/a&gt;. Kubernetes converts the CPU limit you set on the container into a limit on CPU cycles per 1/10th second. If the container tries to use more than this limit, it is “throttled”, which means its execution is delayed. This can have a non-trivial impact on the performance of containers, as it can increase request latency. This is particularly true for requests requiring CPU intensive tasks, such as obtaining locks.&lt;/p&gt;

&lt;p&gt;For monitoring the Kubernetes clusters in our Scaling series (&lt;a href="https://dev.to/temporalio/scaling-temporal-the-basics-31l5"&gt;first post here&lt;/a&gt;) we use the &lt;a href="https://github.com/prometheus-operator/kube-prometheus#readme"&gt;&lt;code&gt;kube-prometheus&lt;/code&gt;&lt;/a&gt; stack.&lt;/p&gt;

&lt;p&gt;In contrast to the 1/10th second used to manage CPU throttling, the Prometheus system uses an interval of 15 seconds or more between scrapes of aggregated CPU metrics. The large difference in intervals between the throttling period and the monitoring scraping interval means that CPU throttling can be occurring even if CPU usage metrics are reporting a long way under 100% usage. For this reason, it’s important to monitor CPU throttling specifically.&lt;/p&gt;

&lt;p&gt;Here is an example for the History service:&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/3owwiacgyKBDDD52yfUhLd/0a66b3e514929ae4a57bd96d4829804d/CPU_Throttling-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/3owwiacgyKBDDD52yfUhLd/0a66b3e514929ae4a57bd96d4829804d/CPU_Throttling-mh.png" alt="History Service Dashboard: CPU is being throttled despite low CPU usage"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We can see from the dashboard that although the history pods’ CPU usage is reporting below 60%, it is being throttled.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;kube-prometheus&lt;/code&gt; setups, you can use this Prometheus query to check for CPU throttling, adjusting the &lt;code&gt;namespace&lt;/code&gt; and &lt;code&gt;workload&lt;/code&gt; selectors as appropriate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(
    increase(container_cpu_cfs_throttled_periods_total{job="kubelet", metrics_path="/metrics/cadvisor", container!=""}[$__rate_interval])
    * on(namespace,pod)
    group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{namespace="temporal", workload="temporal-history"}
)
/
sum(
    increase(container_cpu_cfs_periods_total{job="kubelet", metrics_path="/metrics/cadvisor", container!=""}[$__rate_interval])
    * on(namespace,pod)
    group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{namespace="temporal", workload="temporal-history"}
) &amp;gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So, how can we fix the throttling? Later we’ll discuss why you should probably stop using CPU limits entirely, but for now, as Temporal Server is written in Go, there is something else we can do to improve latencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  GOMAXPROCS in Kubernetes
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;GOMAXPROCS&lt;/code&gt; is a runtime setting for Go that controls how many processes it’s allowed to fork to provide concurrent processing. By default, Go will assume that it can fork a process for each core on the machine it’s running on, giving it a high level of concurrency.&lt;/p&gt;

&lt;p&gt;On a Kubernetes cluster, however, containers will generally not be allowed to use the majority of the cores on a node, due to CPU limits. This mismatch means that Go will make bad decisions about how many processes to fork, leading to inefficient CPU usage. It will (among other things) have to run garbage collection and other housekeeping tasks on CPU cores that it isn’t able to use for any useful amount of real work. As an example: on our Kubernetes cluster, the nodes have 8 cores, but our history pods are limited to 2 cores. This means they may create up to 8 processes, but across those 8 only be able to use a total of 2 cores' share of cycles in every throttling period. It then becomes easy for the container’s processes to starve each other of allowed CPU cycles. &lt;/p&gt;

&lt;p&gt;To fix this, we can let Go know how many cores it’s allowed to use by setting the &lt;code&gt;GOMAXPROCS&lt;/code&gt; environment variable to match our CPU limit. Note: &lt;code&gt;GOMAXPROCS&lt;/code&gt; must be an integer, so you should set it to the number of whole cores you set in the limit. Let’s see what happens when we set &lt;code&gt;GOMAXPROCS&lt;/code&gt; on our deployments:&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/3cwcbRAng6gzTZ2c4XpDL9/12585d12d60dff2258ec3d86cfd13e94/GOMAXPROCS-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/3cwcbRAng6gzTZ2c4XpDL9/12585d12d60dff2258ec3d86cfd13e94/GOMAXPROCS-mh.png" alt="History Dashboard: Showing reduced CPU usage and lower request latency after setting GOMAXPROCS"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the left of the graphs, you can see the performance with the default &lt;code&gt;GOMAXPROCS&lt;/code&gt; setting. Towards the right, you can see the results of setting the &lt;code&gt;GOMAXPROCS&lt;/code&gt; environment variable to “2”, letting Go know it should only use at most 2 processes. CPU throttling has gone entirely, which has helped make our latency more stable. We can also see that because Go can make better decisions about how many processes to create, our CPU usage has lowered, even though performance has actually improved slightly (request latency has lowered). Here, you can see how the CPU across all Temporal services drops after adjusting GOMAXPROCS:&lt;/p&gt;

&lt;p&gt;&lt;a href="//images.ctfassets.net/0uuz8ydxyd9p/4sVnSDGkT9lh2P4w2xy6n8/7f2233d25779c520992a5be4809ff8f3/Resources_-_Temporal_-_Dashboards_-_Grafana-mh.png" class="article-body-image-wrapper"&gt;&lt;img src="//images.ctfassets.net/0uuz8ydxyd9p/4sVnSDGkT9lh2P4w2xy6n8/7f2233d25779c520992a5be4809ff8f3/Resources_-_Temporal_-_Dashboards_-_Grafana-mh.png" alt="Resource Dashboard: Showing reduced CPU by all Temporal Server services after setting GOMAXPROCS"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To help give a better experience out of the box, from release 1.21.0 onwards, Temporal will automatically set &lt;code&gt;GOMAXPROCS&lt;/code&gt; to match Kubernetes CPU limits if they are present and the &lt;code&gt;GOMAXPROCS&lt;/code&gt; environment variable is not already set. Before that release, you should manually set the &lt;code&gt;GOMAXPROCS&lt;/code&gt; environment variable for your Temporal Cluster deployments. Also note that &lt;code&gt;GOMAXPROCS&lt;/code&gt; will not automatically be set based on &lt;a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container"&gt;CPU requests&lt;/a&gt;, only limits. If you are not using CPU limits, you should set &lt;code&gt;GOMAXPROCS&lt;/code&gt; manually to close to (equal or slightly greater) than your CPU request. This allows Go to make good decisions about CPU efficiency, taking your CPU requests into consideration.&lt;/p&gt;

&lt;p&gt;Which brings us nicely to our second suggestion…&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU limits probably do more harm than good
&lt;/h2&gt;

&lt;p&gt;Now that we’ve improved the efficiency of our CPU usage, I’m going to echo the &lt;a href="https://twitter.com/thockin/status/1134193838841401345?s=20"&gt;sentiment of Tim Hockin&lt;/a&gt; (of Kubernetes fame) and &lt;a href="https://home.robusta.dev/blog/stop-using-cpu-limits"&gt;many&lt;/a&gt; &lt;a href="https://medium.com/directeam/kubernetes-resources-under-the-hood-part-3-6ee7d6015965"&gt;others&lt;/a&gt; and suggest that you stop using CPU limits entirely. CPU requests should be closely monitored to ensure you are requesting a sensible amount of CPU for your containers, so that Kubernetes can make good decisions about how many pods it assigns to a node. This allows containers that are having a CPU burst to make use of any spare CPU on the node. Make sure to monitor node CPU usage as well—frequently running out of CPU on the node tells you that pods are bursting more often than your requests allow for, and you should re-examine their CPU requests.&lt;/p&gt;

&lt;p&gt;If you can’t disable limits entirely as they enforce some business requirements (customer isolation for example), then consider dedicating some nodes to the Temporal Cluster and use &lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/"&gt;taints and tolerations&lt;/a&gt; to pin the deployments to those nodes. This allows you to remove CPU limits from your Temporal Cluster deployments while leaving them in place for your other workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Avoiding increased latency from re-balancing during Temporal upgrades
&lt;/h2&gt;

&lt;p&gt;Temporal Server’s &lt;a href="https://docs.temporal.io/clusters#history-service"&gt;History&lt;/a&gt; service automatically balances history shards across the available history pods, this is what allows Temporal Cluster to scale horizontally. &lt;em&gt;Note: Although we use the term balance here, Temporal does not guarantee that there will be an equal number of shards on each pod.&lt;/em&gt; The History service will rebalance shards every time a new history pod is added or removed, and this process can take a while to settle. Depending on the scale of your cluster, this rebalancing can increase the latency for requests, as a shard cannot be written to while it is being reassigned to a new history or pod. The effect of this will vary depending on what percentage of shards each of the pods is responsible for. The fewer pods you have, the greater the effect on latency when they are added/removed.&lt;/p&gt;

&lt;p&gt;The latency spike during a rollout can be mitigated in two ways, depending on the number of history pods you have:&lt;/p&gt;

&lt;p&gt;If you have more than 10 pods, the best option will be to do rollouts slowly, ideally one pod at a time. You can use low values for &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge"&gt;maxSurge&lt;/a&gt; and &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable"&gt;maxUnavailable&lt;/a&gt; to ensure pods are rotated slowly. Using &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#min-ready-seconds"&gt;minReadySeconds&lt;/a&gt;, or a &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#min-ready-seconds"&gt;startupProbe&lt;/a&gt; with initialDelaySeconds, can give Temporal Server time to rebalance as each pod is added.&lt;/p&gt;

&lt;p&gt;If you have less than 10 pods, it’s better to rotate pods quickly so that rebalancing can settle quickly. You will see latency spikes for each change, but the overall impact will be lower. You can experiment with the &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge"&gt;maxSurge&lt;/a&gt; and &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable"&gt;maxUnavailable&lt;/a&gt; settings to allow Kubernetes to roll out more pods at the same time. The defaults are 25% for each, which for 4 pods would mean only 1 pod will be rotated at once. Your mileage will vary based on scale and load, but we’ve had good success with 50% for maxSurge/maxUnavailable on low (4 or less) pod counts.&lt;/p&gt;

&lt;p&gt;Pull-based monitoring systems such as Prometheus use a discovery mechanism to find pods to scrape for metrics. As there is a delay between a pod being started and Prometheus being aware of it, the pod may not be scraped for a few intervals after starting up. This means metrics can report inaccurate values during a deployment, until all the new pods are being scraped.&lt;/p&gt;

&lt;p&gt;For this reason, it’s best to ensure you are not using metrics that are emitted by the History service when evaluating History deployment strategies. Instead, SDK metrics such as &lt;code&gt;StartWorkflowExecution&lt;/code&gt; request latency are a good fit here. Frontend metrics can also be useful, as long as the Frontend service is not being rolled out at the same time as the History service.&lt;/p&gt;

&lt;p&gt;These same deployment strategies are also useful for the &lt;a href="https://docs.temporal.io/clusters#matching-service"&gt;Matching&lt;/a&gt; service, which balances task queue partitions across matching pods.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this post we’ve discussed CPU throttling, CPU limits, and the effect of rebalancing during Temporal upgrades/rollouts. Hopefully, these tips will help you save some money on resources, by using less CPU, and improve the performance and reliability of your self-hosted Temporal Cluster.&lt;/p&gt;

&lt;p&gt;We hope you’ve found this useful, we’d love to discuss it further or answer any questions you might have. Please reach out with any questions or comments on the &lt;a href="https://community.temporal.io/"&gt;Community Forum&lt;/a&gt; or &lt;a href="https://t.mp/slack"&gt;Slack&lt;/a&gt;. My name is Rob Holland, feel free to reach out to me directly on &lt;a href="https://t.mp/slack"&gt;Temporal’s Slack&lt;/a&gt; if you like, would love to hear from you. You can also follow us on &lt;a href="https://twitter.com/temporalio"&gt;Twitter&lt;/a&gt; if you’d like more of this kind of content.&lt;/p&gt;

</description>
      <category>temporal</category>
      <category>docker</category>
      <category>kubernetes</category>
      <category>go</category>
    </item>
  </channel>
</rss>
