DEV Community: Temporal

Someone was wrong on the internet, so I built a Temporal Docker extension

Shy Ruparel — Thu, 18 Jun 2026 14:00:00 +0000

Naturally, I took that personally, remembered I used to teach Docker Extensions to devs, forgot just enough of the rules to make the first version wrong, and eventually built the thing properly.

A funny thing happened during a Temporal community session a few months back.

Someone said that building a Temporal Docker Desktop extension probably was not possible.

This was offered in a perfectly reasonable, community-minded way. Nobody was being dramatic. Nobody was throwing down a gauntlet. It was just one of those "yeah, I am not sure the platform can do that" comments that usually pass peacefully through a conversation and disappear.

Unfortunately, I heard it.

And my brain did the deeply healthy thing where it whispered: someone is wrong on the internet.

This is not always my most productive mode. Sometimes it leads to unnecessary tabs. Sometimes it leads to me reading specifications at hours when no one should be reading specifications. Occasionally, though, it leads to something useful.

In this case, it led to the Temporal Docker Desktop extension (https://github.com/temporal-community/docker-extension).

I had some history here

The comment landed extra funny for me because I was the sole Developer Advocate at Docker when Docker Extensions launched. I helped teach the feature. I wrote some of the original tutorials for building them: Build Your First Docker Extension.

So when someone said a Temporal extension might not be possible, my reaction was not exactly neutral.

Selfie of Shy Ruparel with a co-MC Kat Speer for Dockercon 2022

It was more like: wait, I know this system.

Getting started with Docker

Or at least, I knew this system four years ago, which is a very different flavor of knowing.

Four years is plenty of time for APIs to shift, design rules to get sharper, product expectations to become more real, and for me to forget the part of the docs that would have saved me from my first idea.

DockerCon 2022 stream still of Shy Ruparel teaching Getting Started with Docker workshop

Naturally, my first idea was an iframe.

The first version was too clever

The earliest version of the extension embedded the Temporal UI directly inside Docker Desktop.

You can see why this was tempting. Temporal already has a good UI. Docker Desktop extensions can render UI. My brain connected those two dots with the confidence of a person who had not yet reread the current extension design guidance.

It worked.

And honestly, I was ready to ship it.

This is not one of those stories where I had a sudden, elegant product realization while staring thoughtfully out a window. The iframe version did the thing I wanted it to do, and part of me was delighted by how direct it was.

Then I got to the part where I needed to think about actually submitting it.

That is when I went back to the Docker docs, reread the extension design guidelines, and realized the clever version was probably going to get rejected. The Docker guidance is pretty clear about making extensions feel native to Docker Desktop, using the Docker theme, respecting the surrounding UI, and giving users clear actions instead of dumping an entire product surface into a panel.

Annoyingly, the docs were right.

Docs love doing that.

So I rebuilt it.

The better shape was not "Temporal UI, but trapped inside Docker Desktop." It was much smaller and more useful:

Start Temporal locally.

Know that it is running.
See the important endpoints.
Confirm the server is ready.
Get a quick glance at recent workflow activity.
Jump into the real Temporal UI when it is time to inspect anything seriously.

That last point matters.

The Temporal UI is already the right place to understand workflow history, event details, retries, task queues, and all the other machinery that makes Temporal useful. Rebuilding a miniature version of that inside Docker Desktop would have been worse for everyone, including me, and especially future me.

What the extension actually does

The final extension is intentionally boring in the best way.

It does the same thing as the temporal server start-dev command we know and love, but it turns that terminal prompt into a button.

You open Docker Desktop, click Temporal, and click Start Server.

That is it.

You get a local Temporal server running from temporalio/temporal:latest, using server start-dev under the hood. The extension starts a temporal-dev container, exposes gRPC on localhost:7233, exposes Temporal Web on localhost:8233, and persists data in a Docker volume called temporal-dev-data.

The server uses SQLite at /data/temporal.db, because local development should not require a distributed systems ceremony just to learn a distributed systems tool.

There is a readiness panel that checks the pieces people actually care about:

Is the server running?
Is the gRPC endpoint ready?
Is the Web UI ready?
Is persistent data mounted?

There is also a worker status section, because once you start running workflows, the next natural question is: "Is anything polling this task queue, or am I just confidently yelling into the void?"

The extension lists recent workflows, but deliberately does not pretend to be the full workflow inspection experience. Each workflow row deep-links into Temporal UI. The header has quick links for running workflows and failed workflows. The big button says Open Temporal UI, because the real product decision here is not "hide Temporal UI inside Docker Desktop."

It is "make getting to Temporal UI feel immediate."

The extension is the front door. Temporal UI is the house.

The reset button exists because local dev is local dev

There is also a Clear History button.

This is one of those features that sounds scary until you remember what local development is like.

Eventually you will start workflows with names like test-thing-final-2-for-real, point a worker at the wrong queue, restart something halfway through, or create enough experimental history that the most merciful thing you can do is throw the local database into the sea.

The extension lets you reset the local SQLite-backed Temporal data volume and restart cleanly. It asks you to type RESET, because deleting workflow history should involve at least one tiny moment of reflection.

Not a big one. We are not writing a novel about it.

Just enough friction to avoid "oops."

The UI became a real UI eventually

I should also admit something: I am not a frontend engineer.

Early extension UI screenshot

My original Docker extension tutorial UI looked exactly like what happens when a backend person is handed a component library and told, "Just make it usable." It was usable. It had buttons. Some of them were even in places.

Polished extension UI screenshot

This one looks better because I had Codex helping with the frontend pass, and because I have grown as a person.

Mostly the first thing.

The UI pass ended up being less about making it pretty and more about making it honest.

At one point, the troubleshooting logs lived as a footer below the workflow list. This seemed fine until the workflow list had actual workflows in it, at which point the footer was just stealing useful vertical space like a tiny real-estate speculator. So the logs moved into a Troubleshoot Startup dialog. They are there when you need them, out of the way when you do not.

The source code link started in the header, where it looked much more important than it was. That moved under Clear History, where it belongs: available, visible, but not shouting over the primary job of the extension.

The server status and server readiness panels started as separate boxes. They were redundant, so they became one Server panel. This is the kind of UI cleanup that feels obvious after you do it and somehow invisible before that.

There was also a surprisingly fussy screenshot pass, because the Marketplace screenshot should show the extension doing something real. I spun up a local Temporal server, created a bunch of running workflows, checked that the UI resized properly inside Docker Desktop, and captured the extension under actual load.

This is how you end up caring deeply about whether a scroll bar belongs to the page or the panel.

A glamorous life.

Why this matters

Temporal already has a great local development path with temporal server start-dev. What this extension does is put that same familiar starting point somewhere developers are already used to being: Docker Desktop.

That matters because the first minute of trying a tool has a weird amount of power. If the first step is easy, people are more likely to keep going. With the extension, you click a button, get a local Temporal server with persistent SQLite storage, see enough health and workflow activity to know it is alive, and then jump into the actual Temporal UI when you want to do real work.

It is not trying to replace the CLI. It is not trying to replace the Temporal Web UI. It is just another front door. Temporal is fun quickly once you are actually running workflows. This just gives people one more familiar place to start.

The community part

My favorite part of this whole thing is still the beginning.

Someone in the community wondered whether this was impossible, and instead of that being the end of the conversation, it became the start of a small, useful artifact.

That is the best version of community work to me. Not polished announcements. Not perfectly scoped roadmaps. Just people saying things out loud, someone getting curious, and a mildly unreasonable idea becoming real because enough of the pieces were already sitting nearby.

I had old Docker Extensions context. Temporal had a great local dev server and UI. Docker Desktop had the extension surface. Codex helped me sand down my frontend instincts. The community provided the spark.

The result is a small extension that does not try to be more than it is.

Honestly, I like that about it.

It starts Temporal. It tells you what is happening. It gets you into Temporal UI. It gets out of the way.

That is a pretty good shape for a tool.

Try it

The source code is here:

github.com/temporal-community/docker-extension

You can search and install the extension directly from Docker Desktop. Once installed in Docker Desktop, open the Temporal extension, click Start Server, and you will have local Temporal running with:

gRPC at localhost:7233
Temporal Web at localhost:8233
persisted local state in temporal-dev-data
quick links into Temporal UI for workflow inspection

And if you ever make a mess locally, there is a Clear History button.

Use it with care.

Or, realistically, use it with the exact amount of confidence that got you into the mess in the first place.

How we turned the Replay keynote surprise into an open-source embedded playground

Shy Ruparel — Tue, 02 Jun 2026 15:36:42 +0000

In the first post, I told the story of how the Replay badge went from "wouldn't it be cool if..." to thousands of circuit boards showing up at a developer conference.

This one is about what was hiding inside that surprise.

Samar Abbas, our CEO and co-founder announcing the Replay Badges

During the final keynote at Replay, our CEO and co-founder, Samar Abbas, came back on stage at the end of the session for a very "one more thing" kind of moment. He wanted to give Replay attendees one more place to deploy Temporal, then revealed that we had secretly manufactured 2,000 hardware badges for the people in attendance.

As attendees left the keynote hall, we handed them something that looked, at first glance, like a very over-engineered conference badge. Which, to be clear, it absolutely was. But it was also a tiny hackable computer: an ESP32-S3, an OLED screen, an LED matrix, buttons, joystick, motion sensor, haptics, IR, MicroPython, games, firmware updates, and enough sharp edges to remind me that embedded systems do not care about my feelings.

Now that the source is public on GitHub with our docs live at badge.temporal.io, I want to open the thing up and show you all the strange little systems inside it.

Because we did not just open source "some firmware."

We open sourced an entire ecosystem.

Our hard work visualized. These are the Temporal badges laid out and ready for attendees to grab after the day 2 keynote. A huge moment for me and the team!

Crowds swarming the tables to pick up their badges.

Attendees checking out their badges

What we actually open sourced

When I say the badge is open source, I do not just mean "there is a folder of C++ somewhere." The public repo includes the firmware, hardware design package, fabrication files, docs site, release asset documentation, flashing tools, MicroPython examples, community app infrastructure, and enough context for someone else to pick up the project without needing to inherit every weird Slack thread, manufacturing panic, and half-obsolete architecture decision that got us here.

The public repo looks roughly like this:

firmware/        C++ runtime, MicroPython bridge, apps, Doom, OTA
hardware/        KiCad source, fab outputs, artwork, mechanical files
ignition/        Temporal-powered flashing workflows
docs/            public badge docs site
community_apps/  installable apps from contributors
data/            schedule, speaker, and floor bundles
release-assets/  firmware and factory image release notes

The goal was for the badge to stop being only a thing we handed out at Replay and become something people could inspect, modify, repair, reflash, extend, and, if they are brave enough, theoretically manufacture themselves.

The hardware package includes KiCad projects, fabrication outputs, BOM and CPL files, mechanical references, and artwork. In theory, you can take the files we published and manufacture your own badge.

In practice, this is where you discover that "open hardware" still means making friends with fab notes, component substitutions, assembly constraints, and files named things like GBR_260331-R3.zip.

Which brings us to the first lesson.

Hardware has different gravity

The hardware is where the project becomes physical. Every feature has to fit inside real power, space, component, manufacturing, and assembly constraints.

The badge runs on an ESP32-S3 module with 16 MB of flash and 8 MB of PSRAM. It has a 128x64 monochrome OLED, an 8x8 LED matrix, four face buttons, an analog joystick, an accelerometer, haptics, IR send and receive, USB-C, battery power, and a two-board assembly where the backplate is also part of the visual design.

That sounds like a spec list, but in hardware every bullet point is also a negotiation. The screen wants power. The LEDs want power. The battery wants to be treated like a tiny volatile roommate. The IR receiver wants timing. The buttons want debounce. The enclosure wants tolerances. The PCB wants trace routing. The manufacturer wants files in a very specific shape. The designer wants the badge to look like it belongs at Replay and not like a dev board fell into a lanyard.

Our hardware guys did not believe I was going to be useful writing code at first, which is fair. From their perspective, I was a DevRel person walking into a hardware project with a lot of opinions and an alarming amount of confidence. It took them seeing my personal hardware hacking setup: the bench, the parts, the alarming amount of half-finished projects, and the Gridfinity-filled tool chest to become fully confident in me. After that, the dynamic changed a little. I was still learning their world, but I was not a tourist in hardware.

I also want to be clear about something: at the beginning, I did not understand half of the hardware file types involved.

KiCad projects, Gerbers, CPLs, fab zips, board backups, assembly exports, pick-and-place files — all of it looked like someone had taken a normal software repo and fed it through a manufacturing portal.

Our hardware consultants were moving fast in the tools they actually use. That matters. Hardware collaboration is not shaped like web app collaboration. A lot of hardware work naturally happens through big generated files, CAD exports, board snapshots, and zip packages that are meant for humans, factories, and design tools, not for a tidy Git history.

They also contributed to the C firmware along the way, so the Git problem was not just "how do I store board files?" It was also "how do I rebase active firmware changes from people who are simultaneously thinking about traces, components, and whether a factory can actually assemble this thing?"

So, somehow, my job became translating all of that into Git: rebasing branches, separating source from generated artifacts, preserving the files a manufacturer would need, and cleaning the history enough that a future human could make sense of it. At some point the hardware guys started referring to me as the Merge Czar, which is not a job title I expected to earn on a PCB project, but here we are.

There is a commit in the hardware history literally titled "i got spooked by the datasheet." I respect that commit deeply. That is hardware development in one sentence. You read something, your stomach drops a little, and suddenly there is a new BOM, a new CPL, a new Gerber package, and an R number attached to the end of the filename.

Software teams debate commit hygiene. Hardware teams have fab revisions.

Both are valid. One of them is just much heavier on zip files.

Firmware as joy infrastructure

The firmware had two jobs that were always in tension: create joy and delight for attendees, and make the badge easy enough to hack on that the delight could keep going after Replay.

A Replay attendee putting our fun firmware to good use

At first glance, firmware sounds like the boring layer. It is not. It is the layer that decides whether the badge feels alive.

The native C++ firmware handles boot, the OLED UI, inputs, LED animations, haptics, IR, storage, OTA updates, app launching, power behavior, the file system, and the basic resource ownership rules that keep everything from stepping on everything else.

This is where the badge became less like a single Arduino sketch and more like a tiny operating environment.

Every fun thing on the badge wants to own the same small set of resources. The screen. The buttons. The joystick. The LED matrix. The IR receiver. The filesystem. Memory. Time.

If you launch a MicroPython app, who owns the OLED? If Doom starts, what happens to the scheduler? What even is a scheduler? If an LED animation is running in the background, what happens when a game wants to take over the matrix? If IR is listening, how long can Python go without polling before frames start getting dropped? If a file is being written, what happens if power gets weird?

These are not abstract architecture questions. These are "why is this thing frozen in my hand?" questions.

The public firmware is organized into subsystems because eventually it had to be. Hardware, UI, screens, IR, boops, MicroPython, OTA, LED runtime, identity, storage, and infrastructure all got their own homes. That structure was earned, mostly by discovering what happens when too many responsibilities live in one place and then you add a deadline, a battery, and 2,000 units.

The firmware is where the badge learned how to be playful without being fragile.

Mostly.

The badge is not a browser tab

Most of my recent software life has been in environments where memory is something you notice when the bill arrives, the page gets slow, or a process gets very dramatic in production.

Embedded work does not let you be that vague.

On paper, the badge has 16 MB of flash and 8 MB of PSRAM. It also has the ESP32-S3's 512 KB of on-chip SRAM, which is technically true in the same way that an apartment technically has floor space before you move in. By the time the runtime, stacks, radios, buffers, static data, and display code take their share, the actual headroom feels much smaller.

That sounds like a lot if you are thinking "tiny conference badge." It sounds like nothing if you are thinking "computer that runs a UI, games, Python apps, a file system, OTA updates, IR, LED animations, and Doom."

The useful mental model is that flash is the badge's persistent storage. It survives power loss and holds the firmware, OTA update slots, filesystem image, apps, data bundles, and Doom's resources. FatFS is a partition of that flash presented like a tiny disk, which is why MicroPython apps can feel like files you can inspect and edit even though underneath it is all still a carefully partitioned chunk of SPI flash.

Internal SRAM is the tiny fast workbench where the firmware keeps the things it needs right now: task stacks, driver state, Wi-Fi and TLS buffers, timing-sensitive code paths, and anything that really cannot tolerate being slow or weirdly placed. PSRAM is the bigger folding table next to it. It gives us breathing room for larger allocations like the MicroPython heap, app metadata, editor buffers, and Doom, but it is still external RAM, not a free substitute for internal SRAM.

The public firmware gives MicroPython a 2 MB heap from PSRAM. That is the friendly layer where people can write apps. But the native firmware still needs memory too: task stacks, display buffers, IR queues, JSON-ish state, app registries, file buffers, OTA metadata, Doom resources, and all the small allocations you do not think about until the badge starts behaving like it has developed opinions.

On the web, if you accidentally allocate a few extra objects in a render loop, maybe Chrome sighs at you. On the badge, you feel it immediately. A garbage collection pause is not just an implementation detail when the UI is trying to stay responsive and timing-sensitive hardware is still running.

So memory became a design material.

We reused buffers. We avoided giant temporary objects. We cared about whether something lived in internal RAM or PSRAM. We had to think about what survived reboot, what survived a firmware flash, and what would be wiped when the filesystem was replaced. We built helpers like GCTicker and DualScreenSession so MicroPython games could keep running without making garbage collection everyone else's problem.

This is also where I learned about MessagePack. MessagePack is a binary serialization format: it carries JSON-shaped data like arrays, maps, strings, numbers, and booleans, but encodes it as compact bytes, completely dropping the benefit of being human-readable text, while keeping all the other benefits of JSON. It exists for the moments when you still want something portable and language-friendly, but you care more about size and parsing cost than being able to open the payload in a text editor.

Before this project, I had almost exclusively reached for JSON because JSON is just what we all use. It is readable, debuggable, and extremely convenient. It is not what you want on a tiny device. The badge still generates JSON versions of the schedule, speaker, and floor data so humans can inspect the output, but the firmware consumes a MessagePack bundle instead. In the public data package, those JSON files are about 35.9 KB together; the packed bundle is about 20.6 KB. Saving roughly 43 percent on one data bundle is not life-changing on a laptop or server. On the badge, it was one more place where the project taught me that every comfortable default has a cost.

Doom wanted everything

The OLED framebuffer is only 1024 bytes, which is adorable right up until you remember that every byte has to be drawn, rotated, copied, or handed across a boundary at the right time. The IR receive queue is small enough that "read your frames promptly" is not a suggestion. The LED matrix is tiny, but animations still want timing.

Doom is probably the best way to explain how quickly the badge stopped feeling spacious. The shareware WAD, which is Doom's asset bundle, is 4,196,020 bytes. It does not live inside the firmware binary; it lives on the flash-backed FatFS partition next to the MicroPython apps and data files. Even there, it takes up about two-thirds of the badge's 6 MB filesystem partition and almost exactly one quarter of the entire 16 MB flash chip.

The compiled Doom code and static data contribute on the order of another 330 KB to the firmware, which is about 12 percent of the current firmware image, or about 8 percent of one OTA app slot. At runtime, Doom's zone allocator reserves 2 MB of PSRAM before you count the screen buffers, luma buffer, and OLED buffer. It also uses a 24 KB internal SRAM stack for its own FreeRTOS task. FreeRTOS is the little real-time operating system under the firmware; a task is its version of a thread.

And yes, we had to leave room for OTA. That was the part of the memory map that made the flash budget feel much less like "16 MB" and much more like "two carefully guarded parking spaces." The production partition table has two app slots, each about 3.875 MB, so the badge can keep the currently running firmware while writing the next firmware image into the other slot. The current firmware binary is about 2.8 MB, which is already around 68 percent of one slot. If the firmware grows past the slot size, OTA does not happen. It does not matter that there is technically flash elsewhere; the update has to fit in the slot.

This was one of the most difficult parts of the project for our entire team. It forced embedded humility.

Every allocation felt physical.

C++ as the kernel, MicroPython as the place I wanted to play

When I do hardware hacking for myself, I usually reach for MicroPython or CircuitPython. That is the layer where I feel fast and curious instead of precious. I can make an LED blink, read a sensor, draw something on a screen, and keep going before my brain has time to turn the whole thing into a software architecture dissertation.

So part of the badge goal was selfish in the best way: I wanted a surface where I could contribute comfortably.

But I also wanted that for everyone else.

Not everyone who gets excited by a weird piece of hardware wants to become an embedded C developer before they can make it do something personal. The badge needed a low-friction layer for people who wanted to write apps, make tiny games, animate the LEDs, draw on the OLED, use the IMU, trigger haptics, or send IR without rebuilding the firmware.

That is what MicroPython is for.

The trick was not "put Python on a badge." The trick was deciding what Python should never be responsible for.

C++ owns the scary stuff: timing-sensitive hardware, resource ownership, storage safety, recovery paths, native UI, device lifecycle, and the pieces that need to survive a Python script doing something ambitious. MicroPython gets the friendly APIs: draw text, set pixels, read buttons, move a cursor, rumble the haptics, send IR, store app state, write a little game.

That meant I also had to learn how to expose C++ functions as MicroPython functions. If MicroPython apps were going to feel like they belonged on the badge, they needed to use the same UI primitives as the native firmware: layout helpers, button prompts, icon drawing, storage helpers, input handling, and all the small pieces that make an app feel consistent instead of like a loose script that happened to run on the same hardware.

I did not expect to spend as much time as I did on glyphs.

At 128x64 pixels, a button icon is not just decoration. It is either legible or it is dust. We tested version after version of the tiny glyphs for the physical buttons because those shapes were part of the language of the badge. The first version used a Nintendo-style button layout. Then we swapped toward a PlayStation/Xbox convention. Then we discovered that most of us had already internalized the Switch version so deeply that the "more standard" mapping felt wrong, and we changed it back.

That is firmware work too. Not glamorous, exactly, but it is the difference between a hackable thing and a thing people can actually use.

For a tiny app, the badge can feel almost casual:

oled_clear()
oled_println("Hello, Replay")
oled_show()

led_override_begin()
led_show_image(IMG_HEART)
led_override_end()

You can connect over USB, use JumperIDE, inspect files, run scripts through the REPL, and iterate without treating the firmware like a locked box.

That is the part I care about most. The C++ runtime makes the badge stable enough to trust. MicroPython makes it personal enough to love.

Games, Doom, and hidden delight

The games were partly technical demos, partly love letters to weird handheld games, and partly a way to hide little moments of surprise inside the badge.

Replay attendees doing exactly what I hoped they would! Playing with the badges

I grew up loving Nintendo DS games that treated hardware constraints as creative prompts instead of limitations. Games like Henry Hatsworth and The World Ends with You were always doing something strange with two screens, odd controls, timing, and divided attention. The device shaped the play.

So once the badge had an OLED screen, joystick, buttons, accelerometer, haptics, and an LED matrix, my brain went directly to "what can this do that a normal screen cannot?"

That is how you end up with Flappy Asteroids: Flappy Bird on the lower LED matrix, Asteroids on the top OLED, and absolutely no mercy anywhere.

The current record is 27 seconds, which I understand as a number but not as a human achievement. I made the thing and I cannot reliably survive five.

Breaksnake, IR Block Battle, Synth, the drawing tools, the LED toys, the community apps — they all served the same purpose. They made the badge feel discoverable. I wanted people to keep finding things. A menu item. A hidden interaction. A weird little game. Something that made them turn to the person next to them and say, "wait, this does what?"

And then there is Doom.

Doom is funny because getting Doom running on strange hardware is almost a genre of engineering by itself. It is a joke, but it is also a test. Can the device handle input ownership? Can it render fast enough? Can it allocate the buffers? Can it pause the normal badge services and restore them afterward? Can the partition layout hold the WAD? Can release tooling distribute the pieces correctly?

On the badge, Doom runs as a guest mode. It takes over the screen and input, uses PSRAM for large buffers, renders a 160x100 view down to the 128x64 monochrome OLED, drives the LED matrix as part of the HUD, and then has to cleanly hand the device back when you are done.

That forced architectural honesty. Resource ownership stopped being theoretical once a 1993 game wanted to borrow the entire badge.

I highly recommend adding a ridiculous feature to your embedded project if you want to find out whether your abstractions are real.

Temporal flashing Temporal badges

Flashing one badge is a command.

Flashing thousands of badges is a distributed systems problem wearing a USB cable.

At some point, manually flashing badges stops being a developer workflow and starts being a production line. You need to detect devices, build firmware, prepare the filesystem image, flash in parallel, retry failures, verify boot, sync clocks, keep logs, and know which unit failed without squinting at a dozen terminal windows.

So I built Ignition.

Ignition is the Temporal-powered flashing system in the public repo. It runs locally, no cloud required, and uses Temporal workflows to orchestrate the build and flash pipeline. It can download the latest factory image from GitHub Releases, detect connected badges over USB, start child workflows per badge, flash firmware and apps in parallel, verify the badge booted, sync the clock, and show progress through workflow history and activity heartbeats.

The point was not that existing factory flashing tools are bad. The factory already has tools that can write bytes to chips very efficiently. Ignition wrapped that kind of work in a software-shaped control plane: retries, per-badge state, logs, boot verification, clock sync, and a visible history of what happened to each physical device. That is the difference between "we flashed a batch" and "we know which badges flashed, which ones failed, why they failed, and where to pick back up."

In other words, at some point the badge project became Temporal enough that we needed Temporal to ship the Temporal badge.

Which is objectively too on the nose, but I am choosing to embrace it.

The funny thing is that this was not just brand poetry. It was useful. Temporal gave us retries, status, history, and a durable mental model for a process that involved physical devices being plugged in and unplugged in batches. The same tool we talk about for long-running software processes turned out to be a pretty good fit for "please flash this table full of 32 tiny computers and tell me which ones survived."

And now that same tool is in the public repo. You can use it to flash one badge on your desk, which is calmer than what we were doing with it, but spiritually related.

The open-source victory lap

Getting the badge ready for open source was the victory lap after shipping these crazy things.

It was also a second engineering project.

Locally on my machine, this project lived in what I called the omni directory, which is exactly as ominous as it sounds. I have no idea how anyone else structured their copy of the work, but mine had firmware repos, hardware repos, prototype repos, QA firmware, flashing tools, event tooling, design assets, worktrees, specs, agent reports, old branches, website QA code, and the printing software I had to build for our badge generator all piled into one place.

Some of that belonged in the public artifact. Some of it very much did not. Some of it was useful context. Some of it was archaeology. Some of it was a reminder that every ambitious project creates a wake behind it.

So open sourcing the badge meant curating.

We had to remove private and event-only context. We had to make the docs usable. We had to separate source from generated artifacts. We had to package release images. We had to publish the hardware files in a form that was useful without dumping every temporary file a design tool had ever breathed on. We had to preserve licensing and third-party notices, especially because Doom brings its own legal furniture. We had to create a Community Apps path, contribution rules, and review checks so the badge could grow without turning into a tiny chaos machine.

We also had to decide what future contributors, human or AI, actually need to know.

One underrated part of the cleanup was producing a small memory layer for future agents. Not the whole project history. Not every false start. Not every conversation. Just enough context to know where the important code lives, which repos matter, what the architectural boundaries are, what old branches mean, and which parts should be treated as historical sediment instead of active direction.

That feels like a very 2026 kind of maintainability problem.

Open source is no longer only about code being available. It is about whether the next person, or the next coding agent helping that person, can arrive in the project and become useful before they get overwhelmed. As my colleague Cornelia Davis put it in a line that has been rattling around in my head for months: "We are no longer just teaching humans, we are teaching agents now."

The public repo is not the whole messy garage. It is the workbench after we cleaned it enough that someone else could actually build something there.

The second life of the badge

The badge did its job at Replay if it made people smile.

And something that made me smile — after all the hard work, seeing the badges in the wild hanging around attendees necks

It keeps doing its job if people keep changing it.

That was the whole point. We did not want to make 2,000 objects that felt magical for three days and then became drawer fossils. The badge has a second life because the firmware is open, the hardware files are there, the docs are readable, MicroPython is embedded, and the community app path exists.

Badge.temporal.io

You can reflash it. You can write an app. You can install community apps. You can inspect the hardware. You can try to manufacture one if you are feeling brave. You can make the LED matrix do something wonderful or cursed. You can build a tiny game. You can make an IR remote. You can find some corner of the device we did not fully explore and make it yours.

If the first life of the badge was surprise, I hope the second life is mischief.

Everything you need is at badge.temporal.io and all the code is up on GitHub ready for you to clone.

github.com/temporal-community/badge.temporal.io

We have also made updates since Replay, so even if your badge worked perfectly at the conference, you will want to update your firmware.

Flash it, fork it, write an app, inspect the hardware files, submit a PR with a Community App or a fix, or send us something weird enough that I have to stop what I am doing and try it. You can submit any cool stuff in this Slack channel.

We also have a few extras, and we will be sharing them with people at our San Francisco events over the next few months. So if you missed Replay, come find us.

There may still be a tiny hackable computer waiting for you to put your name on it.

Temporal AI Question Planetarium using Temporal Workflows

Shy Ruparel — Mon, 01 Jun 2026 17:23:38 +0000

Behind The Badge: How We Built 2,000 Hackable Badges For Temporal Replay

Shy Ruparel — Tue, 26 May 2026 04:00:00 +0000

At some point in December, Candace, our Head of Design, asked me: what if your time at Replay could be represented as a Workflow? Every check-in, every fun event, every person you met, a living timeline, running on Temporal.

Four months later, I'm coordinating manufacturing circuit board badges in Shenzhen over WhatsApp at 11 p.m., and I've hand-soldered more PCBs than I'd like to admit.

To understand why I was asked to figure out this project is to know that my role at Temporal has this informal understanding baked into it: give Shy the weird projects. The ones that involve breadboards on a kitchen table at 11 p.m., a document translation pipeline for toys, or a half-baked idea (or two, or three...) that could either be really cool or a complete disaster at a conference in front of 2,000 people.

The latter is this project where I set out to create badges for Temporal's annual developer conference, Replay.

The Concept Stage

I hit the ground running and started spec'ing it out. I could make one badge by myself, but to scale up to make a badge for everyone at Replay I knew I would need help. Through an introduction from some fellow Recurse Center alum, I found a hardware consultant and we put together a proposal.

Next, I had to pitch that proposal to Andrew Baker, our VP of Developer Relations, during his first five days on the job. Tricky because he didn't know me at all, but lucky for me, he trusted some DevRel guy with a hardware project budget that required convoluted manufacturing.

The early concept drawings are genuinely charming to look at now.

Our hardware consultant's early sketches are full of hand-drawn illustrations of attendees holding a chunky handheld device with a T9 keyboard, an e-paper name tag on the front, a kickstand so it could sit on a table. There's even a little Tamagotchi-esque character on the screen. The vision at that stage was something closer to a personal game console, or honestly a weird cell phone that only worked at Replay.

Then on to artistic inspiration. Our consultant came back with these stunning hand-drawn illustrations. It captures the spirit of what we were going for better than any spec doc ever could: it shows attendees gathered together, shrouded in a collective glow, illuminated by the feeling of knowledge and community. I'd even say the energy is a little cosmic.

This is just the type of energy we captured for the event, as Replay is the community's conference. We just get the privilege of paying for it. Finding ways for attendees to interact, share moments, learn from each other, and spark something in each other is always the actual goal. The badge is the vehicle taking you there.

What The Badge Does

When you hear "badge," you probably picture a rigid little plastic thing with your name and an outdated LinkedIn headshot on it. Scrap that entire notion when we're talking about this badge because this one is more like a game piece.

Attendees navigate Replay with a mission of gathering with their peers to traverse the cosmos of the development unknown. Along the way, they collect connections, unlock new levels, and even find hidden treasures, all while the badge handles the technical complexity involved underneath.

So what's actually inside it?

The badge runs on an ESP32-S3 microcontroller with a 128x64 pixel OLED screen, an IR transmitter and receiver, a joystick, buttons, haptic feedback via a vibration motor, and an LED matrix with programmable colors.

It also has a gyroscope, so the screen reorients depending on how you're holding it. Firmware is written in C, with a MicroPython layer sitting on top so attendees can write their own apps without needing to learn anything new. Attendees are able to update their contact information by using the MicroPython REPL and then beam it over to each other with the IR transmitter. The conference schedule is right there on the badge so you don't have to download an app. I don't know about you, but I've encountered very few conference apps that I've liked.

And this is what it became after Kathy, our Senior Brand Designer, got involved. The back of the badge has Ziggy, our tardigrade mascot, as an astronaut, which I love, etched into a PCB.

The AI Of It All

I have to tell you something, and I want you to keep it just between us... I never wrote firmware code before this project. Like ever.

Now, I'm a decent enough programmer and I know how to have an architecture conversation with the best of 'em. I can confidently identify when something is going wrong and I know the right questions to ask to reveal a solution, but C firmware for an embedded microcontroller wasn't something I had ever done professionally.

My testimony is this: generative AI made this project possible for me!

I used Claude and Claude Code extensively throughout my development process, eventually swapping to Codex as newer models became available. The way I think about it: I became a full-time architect with access to a team of very high int, low wisdom junior engineers who are great at implementation but need a lot of hand-holding. You can't give them something vague. You have to know what you want, know when they've gotten it wrong, and stay deliberate about what you let them build. I genuinely don't think this workflow would have worked without my ten-plus years of developer experience behind it because the experience tells you which questions matter and when to push back.

The codebase ended up around 5,000 lines of production logic, with tests more than quadrupling that. I used those tests as a way to validate what the AI was building and build trust in the output. I would record and review Playwright tests to confirm user interactions worked the way I wanted them to. I'll admit, when I code without AI, I mostly skipped writing tests in all my previous work. For me to have trust in the code that was getting generated, I needed extremely solid test coverage from the start.

The documentation from all the architecture debates I had with the AI, the back-and-forth about structure, what to build, and what to cut, is over 21,000 lines, managed through GitHub Spec Kit that tracked specs and architecture decisions as the project evolved. That ratio says something interesting about what the job actually looks like now. It was very out of my comfort zone as someone with a CS degree and a full career without any AI tools even existing. As this project took four months, I swapped my tooling quite a bit. I finalized the project leveraging Codex with the Grill Me Skill. It was a lot more reminiscent of wordsmithing an English essay than anything hands-on-keyboard, which I'm still adjusting to.

Something for you to look forward to: the whole codebase is open source! That means you'll be able to see the output of every argument I had with AI to get here! and the crowd goes wild

From One To Two Thousand

Now, I've built weird IoT stuff before. In fact, that's kind of my whole thing. But I've never had to build 2,000 of the same thing, and it turns out those are completely different problems.

The original inspiration for this whole category of badge, ours, Twilio's, GitHub's, traces back to DEF CON, the hacker conference in Las Vegas. They've been doing programmatic badges for over twenty years. The badges run cryptography puzzles, communicate with each other, and unlock hidden games. Last year's badge had a full Pokemon clone where you could walk around a virtual version of the convention center. The level of engineering that community pours into these things is genuinely absurd in the best way, and I have a lot of love for it. I used to go when I was back in college during the era of Ryan "LostboY/1o57" Clarke making all the conference badges, and my memories here helped guide a lot of the creative energy I poured into this project.

I had a lot of guidance too. Pimoroni, the team behind the GitHub Universe badge, were also incredibly generous. They couldn't meet our timeline but they sat with me anyway and gave me advice, so this is a huge shout out to them. Their work inspired me and the updated Badgeware ecosystem served as my prototyping platform to prove that I could pull this off. Their contributions helped us get here.

On the manufacturing side: I knew, at least in an abstract way, that hardware companies manage factory relationships through WhatsApp and that most electronics manufacturing happens in Shenzhen. Actually living that is something else. I'd start getting messages from the factory at 10 p.m. I'd answer until I fell asleep. I'd wake up and our west coast based hardware consultant had been chatting with them for another four hours on top of that. You're always catching up, always a little behind.

Our manufacturer, Elecrow, was incredibly generous with their time and expertise, helping us solve the problems that came up throughout this entire process. I'm especially grateful to our account manager, Chris, who not only stepped up to make sure everything worked, arrived safely, and showed up on time, but also helped us get the most out of our time in Shenzhen. She did double duty, guiding us through all the electronics markets Shenzhen is known for.

Sourcing LiPo batteries in bulk is harder than you'd think because you can't air ship them, so everything goes ground, and vendors get suspicious when you try to order large quantities at once. This resulted in our dev kits running off triple A batteries. Getting enough screens for the dev kits turned into placing about 30 separate Amazon orders across different sellers to work around quantity throttling. Our accounting team is going to have feelings about that for a while... sorry guys.

There's also the China trip. Temporal's security policy doesn't allow company compute into the country, which is reasonable, so when I needed to go supervise manufacturing, it required multi-week negotiations between HR, security, and finance to work out.

The answer turned out to be a clean MacBook and a clean phone provisioned specifically for the trip, access to exactly one Slack channel, and no corporate accounts. I had to say goodbye to Claude for the duration, which means I had to go back to the ways of ye olden days, pre-2023, for the development, which definitely made the firmware work interesting.

That setup was the right call, but it did mean that once I was in China, I was mostly on my own for anything that came up. Between the limited access, the timezone gap, and the fact that problems were being discovered in real time on the manufacturing floor, there was not much opportunity to phone a friend. I got one hour-long call to get help on a specific issue, and otherwise the last-minute problems had to be debugged with whatever I had with me in the moment.

Meanwhile, my single allowed Slack channel basically turned into a mini travel blog: factory floors, production lines, electronics markets, and a steady stream of "look at this, manufacturing is magic and also chaos." It was useful for keeping people in the loop, but it was not exactly the same thing as having the full company brain available while trying very hard not to become the bottleneck between "we found a problem" and "thousands of these need to ship."

Here are a few photos of what went down on my trip.

Now, I'm back, the badges were in tow, and I was excited to get them in your hands at Replay.

It's In Your Hands At Replay! And Beyond

We live in a hyper-consumerist world, and we don't want to pile on to that by creating a bunch of e-waste.

So the badge isn't e-waste when the conference ends. That was a hard constraint from the start: we're not building 2,000 things that get thrown in a drawer after three days. Because the firmware is open source and MicroPython is embedded, you can keep hacking on yours after Replay.

The sky's the limit really. The IR sensor means you could build a universal remote, the screen and joystick give you a tiny game device, and it'll keep being whatever you make it.

People also started hacking on the firmware almost immediately. One attendee built a full Tamagotchi game during the conference. That's exactly the kind of thing we hoped for: someone picking up the hardware and immediately thinking cool, what can I make this do?

For next year, I want to shift from the ESP32-S3 to the Raspberry Pi RP2350 and have every badge function as a Temporal worker. We managed to get Temporal's Rust SDK running on the badge thanks to Edward Amsden, Staff Software Engineer on the SDK Language Runtime team, who saw me demo the badge during an engineering sprint showcase in his first week at Temporal and then casually spent the weekend hacking together Rust SDK support for us. I more or less seconded him to the badge team for his first month with us so we could get the project over the finish line. SDK team, I promise you can have him back now that we're finished with Replay. Because Edward got Temporal running in the final weeks of the project, we didn't have enough time to really showcase what we could do with it. Next year, though: two thousand workers hanging around people's necks in the same room. I don't know what I'd run on them yet, but I very much want to find out.

We also made a bunch of deeply weird games for the badge, because apparently once you put an OLED screen, joystick, accelerometer, LED matrix, Wi-Fi, and a questionable amount of ambition into something people wear around their necks, then put a person who spent far too much time playing The World Ends with You and Henry Hatsworth on the Nintendo DS in their youth in charge of the project, aka me, this is what happens. Some are real games, some are firmware tests that got out of hand, and at least one is Flappy Asteroids: Flappy Bird on the lower LED matrix, Asteroids on the top, and absolutely no mercy anywhere. If you survived for more than 30 seconds at Replay, I hope you got your special variant joystick cap. And yes, we got Doom running.

2,000+ badges showed up and worked. If you want to get hands-on with yours, everything you need is at badge.temporal.io, and if you missed out this year, make sure you come to one of our SF events over the next few months and attend Replay 2027!

Why I needed Durable Execution to Read a Toy Manual

Shy Ruparel — Fri, 10 Apr 2026 15:53:01 +0000

Watch me take a Japanese toy manual and turn its translation into a bulletproof, AI-powered ETL pipeline. I’ll show you how I use Temporal Workflows to guarantee an AI pipeline never loses progress, surviving network failures, API crashes, and more.

What You'll Learn

Why Spider-man is the reason that Power Rangers has a giant robot.
Guaranteed Completion with Temporal: I’ll show you how to ensure your code keeps running even if servers crash or APIs fail.
Parallel OCR & Translation: Learn how I used a "fan-out" pattern with Google Document AI to process a 20-page manual in 50 seconds instead of 10 minutes.
Resilient AI Cleanup: See how I use Pydantic and Temporal together to handle non-deterministic LLM outputs from Gemini and automatically retry failed validations.

Ready to build it yourself? 👉 Check out the code here!

Decoupling Temporal Services with Nexus and the Java SDK

Nikolay Advolodkin — Thu, 02 Apr 2026 13:50:51 +0000

Your Temporal services share a blast radius. A bug in Compliance at 3 AM crashes Payments, too, because they share the same Worker. The obvious fix is separate services with HTTP calls between them - but then you're managing HTTP clients, routing, error mapping, and callback infrastructure yourself.

We published a hands-on tutorial on learn.temporal.io where you take a monolithic banking payment system and split it into two independently deployable services connected through Temporal Nexus.

You'll learn:

Nexus Endpoints, Services, and Operations from scratch
Two handler patterns for different use cases
How to swap an Activity call for a durable cross-namespace Nexus call

The caller-side change is minimal - the method call stays the same:

// BEFORE (monolith - direct activity call):
ComplianceResult compliance = complianceActivity.checkCompliance(compReq);

// AFTER (Nexus - durable cross-team call):
ComplianceResult compliance = complianceService.checkCompliance(compReq);

Same method name. Same input. Same output. Behind that swap: a shared service contract, a Nexus handler, an endpoint registration, and a Worker configuration change.

Here's what the Nexus handler looks like - it backs the operation with a long-running workflow so retries reuse the existing workflow instead of creating duplicates:

@OperationImpl
public OperationHandler<ComplianceRequest, ComplianceResult> checkCompliance() {
    return WorkflowRunOperation.fromWorkflowHandle((ctx, details, input) -> {
        WorkflowClient client = Nexus.getOperationContext().getWorkflowClient();
        ComplianceWorkflow wf = client.newWorkflowStub(
            ComplianceWorkflow.class,
            WorkflowOptions.newBuilder()
                .setTaskQueue("compliance-risk")
                .setWorkflowId("compliance-" + input.getTransactionId())
                .build());
        return WorkflowHandle.fromWorkflowMethod(wf::run, input);
    });
}

The tutorial includes a durability checkpoint: you kill the Compliance Worker mid-transaction, restart it, and watch the payment resume exactly where it left off. No retry logic, no data loss across the namespace boundary. Java SDK, runs entirely on Temporal's dev server.

Try it

2025 — Part 2

Sergey Bykov — Tue, 18 Nov 2025 16:55:03 +0000

(Part 1)

Company

At the time of my last update, the company had 116 people. Now we are over 300. The Go-to-Market organization is now larger than Engineering. Some studies claim that our ancestors couldn’t handle tribes of over about 150 people. We are definitely past the point when one could know every employee. The loss of intimacy is offset by the feeling that we now have resources — a growing number of teams focusing on different areas while collaborating on cross-group efforts.

With such growth, we are doubling down on our efforts to foster and reemphasize consistency in our hiring practices, decision-making, behavioral patterns, and rules of engagement, otherwise referred to as values and culture. In my previous life within a huge corporation, those things generally made sense to me, but they also felt somewhat artificial and performative. Within the context of a small company with a relatively flat structure, it feels very different — much closer to home. This makes me genuinely attentive to such aspects and eager to contribute where I can. Just recently, we rolled out our updated values.

My impression is that at least half of the VC money these days goes to companies with corporate domains ending in “.ai,” and aside from that, funding isn’t easy. We raised our C round early this year with a very good, some say almost exceptional, multiple. This tells us that the investors have a strong conviction about our product, business model, and growth. I’m no VC, but I see how they are impressed with the quality of the use cases and the caliber of customers that come to our cloud. I hope they know better than I do how to assess and evaluate such factors. Since the C round, we’ve also had a secondary round that pushed the company’s valuation significantly higher.

Keeping the hiring bar high continues to be a top priority. With the turmoil in the job market and Temporal becoming a better-known brand, we now have access to a larger pool of high-quality engineering talent. The interview process is still more art than science, and scaling and improving this art as the company grows is a challenge by itself. Hiring at the junior levels has its own difficulties. Recently, we had to close an open SDE 1 position after only a few hours because, during that time, we received more than 3,000 applications. We found that the old recipe still works well — filling junior positions via internships.

We are still fully remote, with WeWork as an option for folks who want to come into the office. We are geo-distributed but not very balanced. Most of Engineering is on the U.S. West Coast, with roughly a tie between the Seattle and Bay areas. Smaller pockets are in Colorado, North Carolina, and the cities of New York, Chicago, Toronto, and Vancouver. The GTM team has its own distribution. My impression is they are more heavily tilted toward the East Coast.

We settled on an annual all-company offsite (we started with twice a year). We complemented it with smaller team offsites and are now aggregating them into an annual R&D offsite, side by side with GTM’s sales kickoff event. We’ll see how this goes. There doesn’t appear to be a simple solution for doing it right, and each company needs to find its own rhythm. From time to time, we leverage the West Coast’s locality for in-person meetings to discuss some critical decisions or designs. In such cases, we consciously violate the remote-first setup for the sake of high-throughput discussions and faster decision-making — at the unfair expense of colleagues who can’t attend in person and have to connect via Zoom.

Replay

It was a bold move in August 2022 to start our own annual conference. The inaugural edition was in Seattle. The 2023 and 2024 editions were in Bellevue, WA, growing bigger each year. In 2025, we held the event in London to reach audiences unlikely to travel to the U.S. Attending Replay is a very special experience. Seeing so many engineers and engineering leaders talking non-stop about your product and presenting on stage what they’ve built with it is a special kind of pleasure. I presented at all Replays but the very first one. In 2023, my talk was on the second day and I talked with folks so much before then that my voice let me down close to the end of my presentation. I guess that’s why no recording of it was published. But I gave slightly different versions of the same talk at J on the Beach and QCon SF that year.

Replay 2026 will be in San Francisco — at Moscone, no less. It should be epic. I’ll need to rewatch the Silicon Valley documentary before going there.

Operations

We operate a multi-million-dollar business based on a single product — Temporal Cloud. Our customers trust us with their hot-path business processes — often their most critical ones. This is an interesting phenomenon. They choose Durable Execution of Temporal to make their applications resilient to various failures. Naturally, they first and foremost care about the reliability of their most critical services. Some choose to self-host Temporal Server with all its dependencies. Many don’t view it as their core competency — operating such complex production machinery — and they come to our cloud service with their most precious workloads. It is amazing and sobering at the same time when big Internet household names bring us their “crown jewels” to run — even those who have a policy of not taking a dependency on SaaS vendors in the critical path. It was eye-opening to hear, on a couple of occasions, a customer say, “We only have two external dependencies — AWS and Temporal Cloud.”

Customer expectations are very high. Sometimes it feels like they set them higher for us than for the hyperscalers. We now have about eight engineering on-call rotations (teams), covering different areas of the system, plus one for on-call managers who coordinate across teams, and another for the Developer Success team that communicates with customers. This may seem large for our company size, but that’s the nature of the service we run.

We use incident.io for managing incidents. It integrates nicely with Slack, creates a per-incident channel, and automatically adds the current on-call engineers to it, among other things. We saw great promise in the early days of their product. They haven’t disappointed and are growing fast. Like most folks, we use statuspage.io for public incidents and pagerduty.com for on-call paging. Incident.io also integrates with Jira to automatically turn incident follow-ups into tickets, helping us continuously improve the system.

Replication

Temporal inherited the application-level replication stack from Cadence. Over the years, we dramatically improved it and added Control Plane functionality to manage it. Initially, we used replication to transparently migrate customer namespaces from Cell to Cell. After we got it working at the level we were happy with, we exposed it to customers as high-availability options — multi-region, cross-cloud, and single-region replication.

At first, few customers immediately understood why they would want to pay double (due to the duplicate hardware needed) for such a feature. Some just used it, at our suggestion, as a tool for migrating their workloads from one region or cloud provider to another. The recent GCP and AWS us-east-1 outages vindicated the paranoid among our customers who refused to accept that “cloud regions pretty much never go down.”

Customers who had replication enabled for their namespaces were able to fail over to the other region or cloud, and their applications continued executing as if nothing had happened. We discovered a few misses on our side and had to fail over some namespaces manually, with a longer delay than we expected. The important part is that replicated namespaces continued running after failover. We saw a major spike in customers setting up replication in the days after the AWS us-east-1 outage. One customer was in the process of migrating their namespace from AWS to GCP during GCP’s global outage. They weren’t impacted and didn’t even need to fail over because their active replica was still in AWS. They were considering keeping the cross-cloud replication running indefinitely after that.

I gave a conceptual talk about replicated namespaces, but the topic probably deserves its own post.

Road ahead

With great opportunities come great responsibility and pressure to execute and realize those opportunities. We still have to strike the right balance between running a highly reliable service and investing in new functionality. It’s a deeply humbling experience to see that some of the world’s top companies — household names with tens or even hundreds of millions of users — take an all-in dependency on Temporal Cloud. This leaves no room for hubris, complacency, or sloppiness. We have to keep pushing the reliability and quality bar higher without hampering further development of the product.

I don’t believe there’s a general recipe for how to grow an organization, be it engineering, R&D, or the whole company. We’ll have to navigate our own path — growing sustainably while preserving what has made us successful so far and learning new ways in parallel. It’s exciting and somewhat dizzying at the same time. Yet I feel we are still only getting started.

2025

Sergey Bykov — Wed, 12 Nov 2025 21:56:41 +0000

Part 1

Belated update. Yes, it’s been five years, can’t believe it myself. What’s the “delta” for the last three years that flew by too fast?
Certain things haven’t changed much. We are still under the “dual mandate” — OSS server, SDKs (clients), CLI, and a whole bunch of other peripheral software, plus a cloud service where we charge customers for running and managing the invisible infrastructure so that they don’t have to. My focus continues to be primarily on the cloud side.

At the same time, obviously, everything has changed — some things even multiple times. We went through COVID with its obligatory work-from-home setup, only for many companies to start imposing, some more gradually than others, return-to-office policies. I interviewed a number of candidates recently who moved away from major tech hubs during COVID and had to leave their jobs because of the RTO push. We are still fully remote.

Most of the Big Tech companies went through rounds of mass layoffs — a tectonic shift from the previous 20 or so years of competing for talent and outbidding each other in offers. Startups suddenly became much more attractive for Big Tech employees who were previously reluctant to take the risk of leaving their well-paid jobs. At the same time, startup founders faced the funding drought starting in late 2021 and early 2022 caused by interest rate changes. Many had to close or fire-sell their ventures where just a couple of years earlier, cheap money seemed unlimited.

Product

Developers choose Temporal for its programming model. They experience it in a language of their choice via Temporal SDKs. We started with two languages. Now we support seven: Go, Java, TypeScript, Python, .NET, PHP, and Ruby. Four of them are built on the same Core SDK written in Rust. No, we still don’t have an official Rust SDK.

We started Temporal Cloud by hosting the OSS Temporal Server with an added layer of security and multi-tenancy. The original value proposition included that, plus general operational concerns such as monitoring, alerting, configuration, upgrades, and scale. We’ve been investing along several dimensions since then and are now running the fifth-generation Cells (Temporal Cloud “clusters”).

Building a custom storage layer between the server and database to absorb reads and coalesce writes was one of the first bold undertakings. Rolling it out to production over the course of 2023 gave us a significant increase in reliability, performance, and scalability compared to the vanilla OSS server. Another major investment was making it possible to incrementally add multiple databases to a running server. With these improvements, the scenario I mentioned in one of my previous posts, a major customer needing to increase their already outsized traffic level up to 10x for a day, became routine. At the end of 2021, a day like that was a big deal for both companies, with teams of engineers monitoring the system, communicating live, and taking action. The subsequent occurrences became increasingly “boring” and turned into non-events.

On the authentication/authorization dimension, we went from initially supporting only mTLS and Google SSO to adding API keys, service accounts, SAML, SCIM, and a bunch of other features critical for enterprise — and not only enterprise — customers.

We started with prospective cloud customers filling out a form, getting contacted by our sales team to complete the paperwork, and then creating an account on their behalf. Embarrassing. Now, we have a complete self-signup process that guides prospective customers along the path, with a full PLG motion behind it. When we opened up Temporal Cloud to the world, we were missing a number of table-stakes features. At the time, I called the bar we had to meet a “reasonable cloud service.” I believe we passed this milestone 12–18 months ago.

I like that we don’t play licensing games (our OSS is under the MIT license) and instead extend and enhance it with proprietary features to differentiate our cloud offering.

We launched Temporal Cloud in 2022 with support for AWS only. We added GCP in 2025 and are working on bringing in Azure, the last of the big three providers. Even though support for Kubernetes clusters across them is similar, most of the integration effort goes into their disparate security and resource hierarchy models, differences in networking, and subtle behavioral differences in their seemingly compatible APIs — for example, GCS vs. S3. Recently, we’ve been chasing GCP load balancers mysteriously ghosting a fraction of the connections. Support for hosted Elasticsearch is another headache — only AWS has it, but in the form of OpenSearch, their fork of ES from before Elastic changed its license.

AI

The agentic AI “storm” turned into a sudden tailwind for Temporal. The very nature of such applications — being stateful, depending on a significant number of semi-reliable API calls to external services, and taking seconds to minutes to execute — made code-first Durable Execution a compelling programming model for this fast-moving, massive herd. While there are still some rough edges for AI use cases in the near term (such as payload and history size limits and required determinism of workflow code), the immediate benefits — high-velocity development of much more reliable code in the language of your choice, guaranteed scalability, and unparalleled visibility into execution for debugging — keep bringing AI-focused companies to Temporal. More traditional businesses that are scrambling to integrate AI into their systems do the same. I was told that as of late 2024, out of the top 20 AI companies, only two were aware of Temporal — and now, 16 of them already run Temporal-based apps.

Nexus

This year we launched the initial version of Nexus, an open-standard-based protocol for APIs that may take arbitrarily long to complete. I think of it as a great frontend layer for Durable Execution. But the protocol itself is implementation-agnostic. One could implement it using more traditional tools and approaches, for example, within the paradigm of event-driven architecture. The idea was conceived in the early days of Temporal. We started talking about it publicly in 2022, only to do nothing for another year due to other priorities.

We believe that Nexus is an immense opportunity to integrate systems and services in a new, powerful way. Nexus deserves a dedicated post, and I’m contemplating a conference talk about how the combination of Durable Execution and Nexus could define a major evolution of the Microservice Architecture. I understand this is a very bold statement, but sometimes you have to shoot for the Moon.

(Continued in Part 2)

Building Durable Cloud Control Systems with Temporal

Sergey Bykov — Sat, 09 Aug 2025 00:47:09 +0000

In today’s world of managed cloud services, delivering exceptional user experiences often requires rethinking traditional architecture and operational strategies. At Temporal, we faced this challenge head-on, navigating complex decisions about tenancy models, resource management, and durable execution to build a reliable, scalable cloud service. This post explores our approach and the lessons we learned while creating Temporal Cloud.

The Case for Managed Cloud Services

Managed services have become the default for delivering hosted solutions to customers. Whether it’s a database, queueing system, or another server-side technology, hosting a service not only provides a better user experience but also opens doors for monetization, especially for open-source projects. The challenge is how to do it effectively while maintaining reliability and scalability.

One of the first decisions we made was about tenancy models. Should we pursue single-tenancy — provisioning dedicated clusters for each customer — or opt for multi-tenancy, which allows multiple customers to share the same resources? While single-tenancy offers simplicity and isolation, its inefficiencies quickly become apparent. Customers end up paying for unused capacity, and providers shoulder higher operational costs. Multi-tenancy, though harder to implement, emerged as the clear winner. It optimizes resource usage, allows customers to pay for actual usage, and creates shared headroom for handling traffic spikes.

Data Plane vs. Control Plane: Defining Responsibilities

Architecting a managed service in terms of the data plane and control plane is an industry best practice that we followed, clearly defining and implementing their distinct roles within our cloud architecture.

Data Plane: This is where the actual work happens — processing transactions, executing workflows, and handling customer data. It must maintain high availability, low latency, and resilience to failures. For Temporal Cloud, we adopted a cell-based architecture to isolate resources and minimize the blast radius of potential failures.
Control Plane: This acts as the brain of the system, managing resources, provisioning namespaces, and handling configurations. While its performance is less critical than the data plane, reliability here still matters for customer experience. For instance, provisioning a namespace may not be urgent, but delays or errors in this process can frustrate users.

Implementing the Data Plane: A Cell-Based Architecture

For the data plane, we applied a cell-based architecture to achieve strong isolation and scalability. Each cell operates as a self-contained unit with its own AWS account, VPC, EKS cluster, and supporting infrastructure. While this approach is framed within the context of AWS, we have applied the same principles to Google Cloud Platform (GCP), leveraging its equivalent primitives to ensure consistency and reliability across cloud providers. This approach ensures that failures or updates in one cell do not impact others, reducing the risk of cascading outages.

Each cell in Temporal Cloud includes:

Compute Pods: Running Temporal services and infrastructure tools for observability, ingress management, and certificate handling.
Databases: Both primary databases and Elasticsearch for enhanced visibility.
Additional Components: Load balancers, private connectivity endpoints, and other supporting infrastructure that ensures smooth operation and integration across environments. Currently, Temporal Cloud operates across 14 AWS regions, and we’ve also added support for GCP. This architecture allows us to meet the diverse needs of our customers while maintaining reliability at scale.

Durable Execution: The Foundation of the Control Plane

Building the control plane presented its own set of challenges, particularly around reliability and maintainability. Control plane tasks, such as provisioning namespaces or rolling out updates, involve complex long-running processes with many interdependent steps. Writing this logic as traditional, ad-hoc code often leads to brittle systems that are hard to debug and evolve.

This is where Temporal’s durable execution model shines. Designed based on experience with earlier systems like AWS Simple Workflow Service and Azure Durable Functions, Temporal’s approach separates business logic from state management and failure handling. Developers can write workflows as straightforward, happy-path code without worrying about retries, error handling, or state persistence. The system automatically manages these concerns, allowing workflows to seamlessly recover from failures.

Namespace Provisioning: A Real-World Example

Consider the process of creating a new namespace in Temporal Cloud. When a user clicks “Create Namespace” on the web interface, the control plane orchestrates a series of tasks:

Selecting a suitable cell within the chosen region.
Creating database records and roles.
Generating and provisioning mTLS certificates.
Configuring ingress routes and verifying connectivity. Each step involves external API calls, DNS propagation, and other potential points of failure.

Without durable execution, managing retries, backoffs, and state persistence would result in a tangle of brittle code. With Temporal, these tasks are encapsulated in workflows, which transparently handle retries and maintain state across failures. Developers can focus on the high-level logic, confident that the system will handle the edge cases.

Rolling Upgrades: Ensuring Safe Deployments

Another common control plane scenario is rolling out updates to the Temporal Cloud fleet. Our deployment strategy involves organizing cells into deployment rings, progressing from pre-production environments to customer-facing cells with increasing priority of traffic.

The rollout process is carefully staged:

Ring 0: Synthetic traffic only, no customer impact. Changes are monitored here for at least a week.
Ring 1: Low-priority traffic namespaces, allowing for additional testing with minimal risk.
Higher Rings: Gradually expanding to critical, high-priority traffic customers. Within each ring, updates are applied in batches, with pauses between batches to observe for potential issues like memory leaks or race conditions. Temporal workflows handle this process, ensuring that even long-running deployments (which can span weeks) are resilient to failures or restarts.

Entity Workflows: A Powerful Pattern

Temporal’s durable execution also enables powerful patterns like entity workflows. These are workflows tied to specific resources, such as cells or namespaces, providing a natural way to model state and operations. For example, each cell in Temporal Cloud has an entity workflow that manages its lifecycle, from provisioning to upgrades. This approach ensures consistency and simplifies concurrency control.

Developer Happiness and Productivity

One of the biggest benefits of Temporal’s approach is the impact on developer experience. By eliminating the need to write boilerplate code for retries, backoffs, and state management, developers can focus on delivering business value. Temporal’s built-in tools for observing and debugging workflows further enhance productivity, making it easier to understand and troubleshoot complex systems.

Happy developers are productive developers, and Temporal’s approach fosters this by reducing the cognitive load and frustration associated with traditional workflow coding.

Why Durable Execution Matters

Durable execution is more than a technical innovation; it’s a paradigm shift for building cloud-native systems. By decoupling business logic from state management and failure handling, Temporal empowers developers to build reliable, scalable systems with less effort. Whether you’re managing control planes, provisioning resources, orchestrating complex workflows, performing money transfers, training AI models, or processing social media posts, this approach delivers clear benefits.

At Temporal, we’ve seen firsthand how durable execution transforms the development process, enabling us to deliver a robust managed service that scales with our customers’ needs.

Ready to Transform Your Control Plane?

Temporal isn’t just a tool for building cloud systems; it’s a better way to think about workflows and application architecture. If you’re building or planning a managed cloud service, consider how durable execution can simplify your journey and unlock new possibilities. For more insights into our approach, check out my full talk at QCon.

Why Top Developers Prioritize Failure Management

Sergey Bykov — Sat, 09 Aug 2025 00:35:05 +0000

There’s a saying: “Amateurs study tactics, while professionals study logistics.” In software, this translates to: “Amateurs focus on algorithms, while professionals focus on failures.”

At J on the Beach, I took time in my talk to expand on this saying and explain that real-world systems don’t just need code that works on the “happy path” — they need a safety net for when things go wrong.

Modern software development has layers of complexity. You’re not just writing code; you’re connecting systems across time and space, handling data that doesn’t sleep, and ensuring flawless performance at scale. What sets top developers apart is how they manage failures. Building resilience focuses on ensuring reliability when things inevitably go wrong, not just maintaining uptime.

In this post, we’ll walk through three common approaches to handling failures in software, each with its own strengths and weaknesses. Then we’ll introduce Temporal’s approach, workflow-as-code, which makes it easier to build reliability into your systems from day one.

Three Ways to Handle Failure in Your Software

Failures are inevitable in your distributed systems. When a network link fails, a server times out, or a service crashes, systems need strategies to respond properly and ensure that your operations remain reliable.

Below, we’ll explore three common approaches to coordination between systems — Remote Procedure Calls (RPCs), persistent queues, and workflows — and their relationship to failure management.

1. Request-Response (RPC)

The request-response, or RPC model, is a classic approach. A client makes a request, the server processes it, and sends back a response. In the best-case scenario — the “happy path” — everything works smoothly. Imagine a money transfer request: one service debits the sender while another credits the receiver. If all goes as planned, the transfer completes with no issues.

Pros of the RPC Model

Simplicity: The direct client-server connection makes this model easy to implement for straightforward workflows.
Efficiency on the “happy path”: When things go smoothly, RPC provides fast, efficient responses and low latency.

Cons of the RPC Model

Limited resilience for partial failures: If the client’s request is successful, but a response isn’t received, or a step in the process fails, RPC often requires extensive error-handling code on the client side.
Heavy client burden: Clients must handle errors, recovery, and retries, complicating systems as they scale.
The RPC model works well for simple, synchronous tasks. However, for resilience, it falls short by placing the onus on developers of the RPCs and those consuming them to manage every failure scenario — and this is no trivial matter.

2. Persistent Queues

Persistent queues add a degree of flexibility by decoupling the client from the server. Messages are placed in a queue, and the system processes them asynchronously. Queues help distribute workloads: they support automatic retries and asynchronous processing, which can smooth out demand spikes.

Pros of Persistent Queues

Automatic retries: Persistent queues often support automatic retries, attempting tasks multiple times if they initially fail.
Load distribution: Queues smooth processing under heavy loads, distributing requests over time, to improve system reliability.
Producer-consumer separation: Decoupling producers and consumers allow the queue to function independently, improving fault tolerance.

Cons of Persistent Queues

Loss of ordering: Since queues process messages independently, tasks may execute out of order, causing unexpected issues for dependent operations.
Dead-letter queues: Tasks that continuously fail may require a separate “dead-letter” queue, adding complexity and, typically, manual intervention.
Limited visibility into status: Visibility becomes even more challenging when you have systems that use multiple queues, requiring additional tooling and infrastructure.
Queues work well when you need flexibility and decoupling, but they lack the control and visibility needed for comprehensive failure management.

3. Workflows

Workflows provide a robust solution for orchestrating complex processes across distributed systems. Unlike RPC or queue-based models, workflows manage retries, state, and error handling automatically, making them ideal for long-running or multi-step processes.

Pros of Workflows

Built-in resilience: Workflows handle retries, recovery, and compensation steps automatically, reducing the need for custom error-handling code.
Support for long-running processes: Workflows accommodate processes that span minutes, hours, or even days, making them well-suited for complex tasks.
Enhanced visibility: Workflow systems enable real-time tracking and querying, so both clients and developers can see exactly where each process stands.

Cons of Workflows

Infrastructure requirements: Workflows require a solid infrastructure to manage states, retries, and tracking, which some teams may lack.
Setup complexity: Workflow systems can be complex to set up, especially when building custom solutions to manage workflows.
For complex processes that demand reliability and transparency, workflows provide the most comprehensive solution, though they require dedicated infrastructure to deploy effectively.

Resilience Without Extra Overhead

At Temporal, we addressed these challenges by designing a platform that handles resilience, error handling, and state management so you don’t have to.

With Temporal, you write workflows as code - no extra XML, JSON, or YAML definition of workflow logic that is difficult to understand and debug down the line. Define your steps in regular code, and Temporal does the rest, managing retries, maintaining state, and ensuring that your workflows are reliable and simple to create.

Companies like ANZ Bank, one of the largest banks in the Asia-Pacific region, rely on Temporal to strengthen the resilience and reliability of critical financial processes. With Temporal, ANZ orchestrates and manages complex operations across distributed systems, ensuring tasks are retried automatically, failures are handled, and long-running processes are tracked seamlessly. This has enabled ANZ to boost system reliability, reduce operational complexity, and uphold strict compliance standards in their high-stakes FinServ environment.

Failure Management Is a Strategy, Not a Setback

Any complex system will encounter failures. But how you handle those failures makes all the difference. For developers, focusing on failure management from the start distinguished exceptional teams from the average. Building resilience into your system sets your project up for long-term success.

Time-Travel Debugging Production Code

Loren 🤓 — Tue, 08 Aug 2023 07:03:31 +0000

In this post, I’ll give an overview of time travel debugging (what it is, its history, how it’s implemented) and show how it relates to debugging your production code.

Normally, when we use debuggers, we set a breakpoint on a line of code, we run our code, execution pauses on our breakpoint, we look at values of variables and maybe the call stack, and then we manually step forward through our code's execution. In time-travel debugging, also known as reverse debugging, we can step backward as well as forward. This is powerful because debugging is an exercise in figuring out what happened: traditional debuggers are good at telling you what your program is doing right now, whereas time-travel debuggers let you see what happened. You can wind back to any line of code that executed and see the full program state at any point in your program’s history.

History and current state

It all started with Smalltalk-76, developed in 1976 at Xerox PARC. (Everything started at PARC 😄.) It had the ability to retrospectively inspect checkpointed places in execution. Around 1980, MIT added a "retrograde motion" command to its DDT debugger, which gave a limited ability to move backward through execution. In a 1995 paper, MIT researchers released ZStep 95, the first true reverse debugger, which recorded all operations as they were performed and supported stepping backward, reverting the system to the previous state. However, it was a research tool and not widely adopted outside academia.

ODB, the Omniscient Debugger, was a Java reverse debugger that was introduced in 2003, marking the first instance of time-travel debugging in a widely used programming language. GDB (perhaps the most well-known command-line debugger, used mostly with C/C++) added it in 2009.

Now, time-travel debugging is available for many languages, platforms, and IDEs, including:

Replay for JavaScript in Chrome, Firefox, and Node, and Wallaby for tests in Node
WinDbg for Windows applications
rr for C, C++, Rust, Go, and others on Linux
Undo for C, C++, Java, Kotlin, Rust, and Go on Linux
Various extensions (often rr- or Undo-based) for Visual Studio, VS Code, JetBrains IDEs, Emacs, etc.

Implementation techniques

There are three main approaches to implementing time-travel debugging:

Record & Replay: Record all non-deterministic inputs to a program during its execution. Then, during the debug phase, the program can be deterministically replayed using the recorded inputs in order to reconstruct any prior state.
Snapshotting: Periodically take snapshots of a program's entire state. During debugging, the program can be rolled back to these saved states. This method can be memory-intensive because it involves storing the entire state of the program at multiple points in time.
Instrumentation: Add extra code to the program that logs changes in its state. This extra code allows the debugger to step the program backwards by reverting changes. However, this approach can significantly slow down the program's execution.

rr uses the first (the rr name stands for Record and Replay), as does Replay. WinDbg uses the first two, and Undo uses all three (see how it differs from rr).

Time-traveling in production

Traditionally, running a debugger in prod doesn't make much sense. Sure, we could SSH into a prod machine and start the process handling requests with a debugger and a breakpoint, but once we hit the breakpoint, we're delaying responses to all current requests and unable to respond to new requests. Also, debugging non-trivial issues is an iterative process: we get a clue, we keep looking and find more clues; discovery of each clue is typically rerunning the program and reproducing the failure. So, instead of debugging in production, what we do is replicate on our dev machine whatever issue we're investigating and use a debugger locally (or, more often, add log statements 😄), and re-run as many times as required to figure it out. Replicating takes time (and in some cases a lot of time, and in some cases infinite time), so it would be really useful if we didn't have to.

While running traditional debuggers doesn't make sense, time-travel debuggers can record a process execution on one machine and replay it on another machine. So we can record (or snapshot or instrument) production and replay it on our dev machine for debugging (depending on the tool, our machine may need to have the same CPU instruction set as prod). However, the recording step generally doesn't make sense to use in prod given the high amount of overhead—if we set up recording and then have to use ten times as many servers to handle the same load, whoever pays our AWS bill will not be happy 😁.

But there are a couple scenarios in which it does make sense:

Undo only slows down execution 2–5x, so while we don't want to leave it on just in case, we can turn it on temporarily on a subset of prod processes for hard-to-repro bugs until we have captured the bug happening, and then we turn it off.
When we're already recording the execution of a program in the normal course of operation.

The rest of this post is about #2, which is a way of running programs called durable execution.

Durable execution

What's that?

First, a brief backstory. After Amazon (one of the first large adopters of microservices) decided that using message queues to communicate between services was not the way to go (hear the story first-hand here), they started using orchestration. And once they realized defining orchestration logic in YAML/JSON wasn't a good developer experience, they created AWS Simple Workfow Service to define logic in code. This technique of backing code by an orchestration engine is called durable execution, and it spread to Azure Durable Functions, Cadence (used at Uber for > 1,000 services), and Temporal (used by Stripe, Netflix, Datadog, Snap, Coinbase, and many more).

Durable execution runs code durably—recording each step in a database, so that when anything fails, it can be retried from the same step. The machine running the function can even lose power before it gets to line 10, and another process is guaranteed to pick up executing at line 10, with all variables and threads intact.¹ It does this with a form of record & replay: all input from the outside is recorded, so when the second process picks up the partially-executed function, it can replay the code (in a side-effect–free manner) with the recorded input in order to get the code into the right state by line 10.

Durable execution's flavor of record & replay doesn't use high-overhead methods like software JIT binary translation, snapshotting, or instrumentation. It also doesn't require special hardware. It does require one constraint: durable code must be deterministic (i.e., given the same input, it must take the same code path). So it can't do things that might have different results at different times, like use the network or disk. However, it can call other functions that are run normally ("volatile functions", as we like to call them 😄), and while each step of those functions isn't persisted, the functions are automatically retried on transient failures (like a service being down).

Only the steps that require interacting with the outside world (like calling a volatile function, or calling sleep('30 days'), which stores a timer in the database) are persisted. Their results are also persisted, so that when you replay the durable function that died on line 10, if it previously called the volatile function on line 5 that returned "foo", during replay, "foo" will immediately be returned (instead of the volatile function getting called again). While yes, it adds latency to be saving things to the database, Temporal supports extremely high throughput (tested up to a million recorded steps per second). And in addition to function recoverability and automatic retries, it comes with many more benefits, including extraordinary visibility into and debuggability of production.

Debugging prod

With durable execution, we can read through the steps that every single durable function took in production. We can also download the execution’s history, checkout the version of the code that's running in prod, and pass the file to a replayer (Temporal has runtimes for Go, Java, JavaScript, Python, .NET, and PHP) so we can see in a debugger exactly what the code did during that production function execution. Read this post or watch this video to see an example in VS Code.²

Being able to debug any past production code is a huge step up from the other option (finding a bug, trying to repro locally, failing, turning on Undo recording in prod until it happens again, turning it off, then debugging locally). It's also a (sometimes necessary) step up distributed tracing.

I hope you found this post interesting! If you'd like to learn more about durable execution, I recommend reading:

and watching:

Thanks to Greg Law, Jason Laster, Chad Retz, and Fitz for reviewing drafts of this post.

Technically, it doesn't have line-by-line granularity. It only records certain steps that the code takes—read on for more info ☺️. ↩
The astute reader may note that our extension uses the default VS Code debugger, which doesn’t have a back button 😄. I transitioned from talking about TTD to methods of debugging production code via recording, so while Temporal doesn’t have TTD yet, it does record all the non-deterministic inputs to the program and is able to replay execution, so it’s definitely possible to implement. Upvote this issue or comment if you have thoughts on implementation! ↩

Actors and Workflows: Building a Customer Loyalty Program with Temporal

Fitz — Thu, 03 Aug 2023 17:28:24 +0000

This post is technically a followup of another post. You don't need to read that one to make sense of this one, but it might give some useful background.

That post talked through how the Actor Model can be implemented using "Workflows" (on https://github.com/temporalio/temporal), even though these two concepts don't immediately appear compatible.

Here, I dive into a concrete example: a Workflow representing a customer's loyalty status.

If you want to skip the prose and just jump right into the code, you can find it all in this GitHub repository, with implementations in Go, Java, and Python.

Actor Model Refresher

As formally defined, Actors must be able to do three things:

Send and receive messages
Create new Actors
Maintain state

Exact implementation details vary depending on what framework, library, or tools you're using, but the biggest challenge is having some kind of software artifact running somewhere that can handle these things.

That's where most Actor frameworks come in to help: providing both the programming model and the runtime environment for being able to build an Actor-based application in a highly distributed, concurrent, and scalable way.

Temporal differs here in that it’s general-purpose, rather than specific to one model or system design pattern. With Workflows, you define a function that Temporal will ensure runs to completion (or reliably runs forever, if the function doesn’t return).

I recognize that statement is both rather bold and also so generic as to be hard to disprove. So, let's look at a concrete example.

Loyal Customers

Many consumer businesses have some kind of loyalty program. Buy 10 items, get the 11th free! Fly 10,000 miles, get free access to the airport lounge! Earn one million points over the lifetime of your account, earn a gold star!

At the highest level, the application's logic isn't complex: Each customer has an integer counter that's incremented after the customer does certain things (e.g., buy something, or take a trip). When that counter crosses different thresholds, new rewards are unlocked. And, although we may not like it, customers can always close their accounts.

When we create the diagram for the app, it might look like this:

In terms of the Actor Model, two of the three requirements are on display:

Send and receive messages: A customer can send either an "earn points" message or a "try to use reward" message.
Create new Actors: ??? (This is the Actor requirement not apparent in this application, but we'll see later how it can be incorporated.)
Maintain state: A customer loyalty account needs to maintain the points counter and which rewards are unlocked (or be able to look up this information based on the points value).

Requirement #2, the ability to create other Actors, isn't immediately obvious here, but it isn't too far out of reach. We could define in this example application that one of the rewards for earning enough points is the ability to gift status to someone else, inviting them (i.e., creating their account) to the program if they aren't already a member.

If our goal is to create a demo application for the Actor Model (as it is in this post), then there's actually one other thing missing: the ability for a customer (or rather, their loyalty account) to send messages. For that, we could also declare that customers with enough points can gift points or status levels (i.e., which rewards are unlocked) to their guests. Then they can send messages, too!

Reworking the previous diagram to be more befitting of a full "Actor," we'd get the following:

And, as for the exact implementation details, read on!

Loyal (Temporal!) Customers

Imagine being able to write the customer loyalty program above in just a function or two. Conceptually, that's not hard. In pseudocode, that might look like the following:

INVITE_REWARD_MINIMUM_POINTS = 1000

function CustomerLoyaltyAccount:
    account_canceled = false
    points = 0

    while !account_canceled:
        message = receive_message()
        switch message.type:
            case 'cancel':
                account_canceled = true
            case 'add_points':
                fallthrough
            case 'gift_points':
                points += message.value
            case 'invite_guest'
                if points >= INVITE_REWARD_MINIMUM_POINTS:
                    spawn(new CustomerLoyaltyAccount())

But there are a few crucial details that are, well, rather undefined in this pseudo-function. Specifically:

What's receive_message() doing? How is it receiving messages?
Similarly, what's spawn(new CustomerLoyaltyAccount()) doing?
And most importantly, where is this function running? What happens if that runtime crashes or the function otherwise stops running?

Each of these maps to core Temporal features that we can implement in an example Workflow:

Data can be sent to Workflows via Signals
Workflows can create new Workflow instances
As long as there are Workers running somewhere that can communicate with the Temporal Server, then if the Worker running the function dies, the function will continue running on another (you know, kind of Temporal's main benefit)

Customers Go Loyal

Let's build this up in Go. If you are more comfortable with other languages, I've also written the same Workflow in Python and Java. While the languages are different, most of the same concepts and patterns should carry over.

(For brevity in the body of this blog post, I'll in most cases omit error handling but include it when non-trivial and relevant.)

First, we write the skeleton of a Workflow and an Activity. For some of the milestones in a customer's lifecycle, it'd be nice to send them some kind of notification. In a real application, you'd call out to SendGrid, Mailchimp, Constant Contact, or some other email provider, but for simplicity's sake, I'm just logging out the details. This initial Workflow does just that: if it's a new customer, send a welcome email, but otherwise move on.

func CustomerLoyaltyWorkflow(ctx workflow.Context, customer CustomerInfo, newCustomer bool) error {
    logger := workflow.GetLogger(ctx)
    logger.Info("Loyalty workflow started.", "CustomerInfo", customer)

  var activities Activities

    if newCustomer {
        logger.Info("New customer workflow; sending welcome email.")
        err := workflow.ExecuteActivity(ctx, activities.SendEmail,
            fmt.Sprintf("Welcome, %v, to our loyalty program!", customer.Name)).
            Get(ctx, nil)
        if err != nil {
            logger.Error("Error running SendEmail activity for welcome email.", "Error", err)
        }
    } else {
        logger.Info("Skipping welcome email for non-new customer.")
    }

    // ... [to be added later] ... //

    return nil
}

type Activities struct {
    Client client.Client
}

func (*Activities) SendEmail(ctx context.Context, body string) error {
    logger := activity.GetLogger(ctx)
    logger.Info("Sending email.", "Contents", body)
    return nil
}

Next up, we need to be able to handle messages. This is the primary thing the Workflow (i.e., customer loyalty Actor) does: sit around waiting for new messages to come in.

The following code replaces the // ... [to be added below] ... // line from the previous snippet:

  selector := workflow.NewSelector(ctx)

    // Signal handler for adding points
    selector.AddReceive(workflow.GetSignalChannel(ctx, "addPoints"),
        func(c workflow.ReceiveChannel, _ bool) {
            signalAddPoints(ctx, c, &customer)
        })

    // Signal handler for canceling account
    selector.AddReceive(workflow.GetSignalChannel(ctx, "cancelAccount"),
        func(c workflow.ReceiveChannel, _ bool) {
            signalCancelAccount(ctx, c, &customer)
        })

    // ... [register other Signal handlers here] ... //

  logger.Info("Waiting for new messages")
    for customer.AccountActive {
        selector.Select(ctx)
    }

The Signal handler function for adding points does very little, adding in the given points to the customer's state and then sending an email to the customer with the new value.

As you might imagine, the cancel account handler is very similar, setting the customer.AccountActive flag used above to false and then notifying the customer.

func signalAddPoints(ctx workflow.Context, c workflow.ReceiveChannel, customer *CustomerInfo) {
    logger := workflow.GetLogger(ctx)
    var activities Activities

    var pointsToAdd int
    c.Receive(ctx, &pointsToAdd)

    logger.Info("Adding points to customer account.", "PointsAdded", pointsToAdd)
    customer.LoyaltyPoints += pointsToAdd

    err := workflow.ExecuteActivity(ctx, activities.SendEmail,
        fmt.Sprintf("You've earned more points! You now have %v.", customer.LoyaltyPoints)).
        Get(ctx, nil)
    if err != nil {
        logger.Error("Error running SendEmail activity for added points.", "Error", err)
    }

    // ... [insert logic for unlocking status levels or rewards] ... //
}

All combined, the code so far does three things:

First, it registers the signalAddPoints and signalCancelAccount functions as the handlers for the "addPoints" and "cancelAccount" Signals, respectively.
Then, it blocks forward progress on the Workflow, via selector.Select(ctx), until a registered Signal comes in. Unless that Signal is "cancelAccount," the Workflow will keep looping on this select.
I've chosen for this application to not fail the Workflow when an email fails to send. This keeps the Workflow representing the customer's loyalty account active and running even in spite of external system failure.
- For that, you'll want to set an appropriate retry policy to ensure that the Workflow doesn't completely block on email failures, for example by setting the MaximumAttempts to a reasonably low number like 10.

Already this gives us most of the application. We have a function that runs perpetually, thanks to Temporal, and can receive two different kinds of messages, both of which modify the state of the Workflow with one that also results in the Workflow finishing.

What remains is a couple of more Temporal-specific considerations.

Long-Lived Customers

In my last post, I spilled many words on the topic of "Continue-As-New." If you didn't—or don't want to!—read those words, the gist is this: at some point, a Workflow's history may get unwieldily big; Continue-As-New resets it.

For this customer loyalty example Workflow, the far-and-away biggest contributor to the Event History is the number of events, not the size. With the addPoints Signal only taking a single integer argument and the cancelAccount Signal taking none, the combined contribution to the size of the history is minimal.

A Signal with only a single integer parameter will, by itself, contribute one Event and about 500 bytes to the History, even with very large values. And so, how many of these Signals would be required to hit either the size or length limits?

If nothing else happened but addPoints Signals, it'd take 51,200 of them to reach the length limit, but 50 * 1024 * 1024 / 500 or 104,857.6 to reach the size limit. Knowing that many of these Signals will result in the SendEmail Activity running, and each Activity contributes a handful of (small) events to the history, this Workflow will hit the History length limit well before the size limit.

So, let's add a check for that into our Workflow loop:

    const eventsThreshold = 10_000
    // ... snip ...

    info := workflow.GetInfo(ctx)

    logger.Info("Waiting for new messages")
    for customer.AccountActive && info.GetCurrentHistoryLength() < eventsThreshold {
        selector.Select(ctx)
    }

Finally, trigger Continue-As-New as needed, draining any pending signals:

    if customer.AccountActive {
        logger.Info("Account still active, but hit continue-as-new threshold.")
        // Drain signals before continuing-as-new
        for selector.HasPending() {
            selector.Select(ctx)
        }
        return workflow.NewContinueAsNewError(ctx, CustomerLoyaltyWorkflow, customer, false)
    }

My previous post on this topic explained in a little more detail about why it's necessary to drain signals before continuing-as-new. To briefly recap, Continue-As-New finishes the current Workflow run and starts a new instance of the Workflow regardless of any pending Signals. If we don't drain (and handle!) Signals before calling workflow.NewContinueAsNewError (or workflow.continue_as_new in Python, or Workflow.continueAsNew in Java), those pending Signals will be forever lost.

The last major thing this Workflow needs to make it a true, stage-worthy Actor is the ability to create others.

Spawning New Customers

While Temporal has support for Parent/Child relationships between Workflows, in this customer loyalty application, the only thing we need is the ability to send a message from one to the other in the case of gifting status or points.

Temporal provides an API in the Client that can do this and create other Workflows all in one call, called Signal-with-Start. Since this is only available in the Client, not from a Workflow, we'll need to do this in an Activity.

First, I'm setting the ID Reuse Policy to REJECT. This is in some ways a "business logic" kind of decision, where I'm declaring that once a customer's account is closed, it can't be re-invited. (Note that after a namespace's retention period has passed, IDs from closed Workflows can be reused regardless of this policy, and so in a real-life production version of this app, you'd want to have this check an external source for customer account statuses.)

func (a *Activities) StartGuestWorkflow(ctx context.Context, guest CustomerInfo) error {
    // ...
    workflowOptions := client.StartWorkflowOptions{
        TaskQueue:             TaskQueue,
        WorkflowIDReusePolicy: enums.WORKFLOW_ID_REUSE_POLICY_REJECT_DUPLICATE,
    }

Then, we can call Client.SignalWithStartWorkflow:

    logger.Info("Starting and signaling guest workflow.", "GuestID", guest.CustomerID)
    _, err := a.Client.SignalWithStartWorkflow(ctx, CustomerWorkflowID(guest.CustomerID),
        SignalEnsureMinimumStatus, guest.StatusLevel.Ordinal,
        workflowOptions, CustomerLoyaltyWorkflow, guest, true)

Note the use of the Client from the Activities receiver struct! I'm making use of something in the way Temporal works in Go: if, when we instantiate and register the Activities in the Worker, we also set this Client, then the same connection will be available within the Activities. This way, we don't have to worry about re-creating the Client.

I'm also ignoring the returned future from SignalWithStartWorkflow via a Go convention of assigning to _; because this "guest" Workflow is expected to run indefinitely long, blocking on its results would prevent the original Workflow from doing anything else. Since the future returned from starting a Workflow is either used for waiting for the Workflow to finish, or getting its IDs (which we already know from the CustomerWorkflowID(guest.CustomerID) call), we can safely ignore it.

But, it's still necessary to handle the error. With the ID Reuse Policy set to REJECT, retrying the resulting error from trying to start a an already-closed Workflow will get us nowhere, and so we should instead send some useful information back to the Workflow:

    target := &serviceerror.WorkflowExecutionAlreadyStarted{}
    if errors.As(err, &target) {
        return GuestAlreadyCanceled, nil
    } else if err != nil {
        return -1, err
    }

    return GuestInvited, nil
}

// ... [Defined at top] ...
type GuestInviteResult int

const (
    GuestInvited GuestInviteResult = iota
    GuestAlreadyCanceled
)

Back in the Workflow, after running this Activity I can then check for that error and notify the customer as appropriate. As before, I'm allowing the Workflow to continue if sending the email failed. But if that SignalWithStartWorkflow call failed for any reason other than the guest's account already existing, I want to make some noise and fail the Workflow—something unusual is likely happening.

var inviteResult GuestInviteResult
err := workflow.ExecuteActivity(ctx, activities.StartGuestWorkflow, guest).
    Get(ctx, &inviteResult)
if err != nil {
    return fmt.Errorf("could not signal-with-start guest/child workflow for guest ID '%v': %w", guestID, err)
}

if inviteResult == GuestAlreadyCanceled {
    emailToSend = "Your guest has canceled!"
} else {
    emailToSend = "Your guest has been invited!"
}

err := workflow.ExecuteActivity(ctx, activities.SendEmail, emailToSend).Get(ctx, nil)

This snippet of code would end up being in a Signal handler for something like an "invite guest" Signal. The handler would also include, as discussed at the top of this post, a check for if the current customer is even allowed to do this action.

Summing it all up

There are a few other things to explore in this app, like catching a cancellation request or looking through the tests, but this post has gotten long enough as it is. 🙂

Hopefully this post serves as a nice "close-to-real-world" example for you of how to build something that looks like an "Actor"—aka, a really, really long running Workflow that can send and receive messages and maintain state without a database—using Temporal.

For more information related to this post and about Temporal, check out the following links:

This post's source code (As of publishing, available in Java, Go, and Python)
Actors & Workflows, Part 1
SignaIWithStart
Documentation
Developer Guide

And the best way to learn Temporal is with our free courses.

Cover image from John Jennings on Unsplash