Why Rust powers Temporal’s new Core SDK

#programming #design #rust #discuss

Perhaps you’ve heard that here at Temporal, we’re working on new SDKs to support more languages. I’m an engineer on our SDK team, and I’m writing here to elaborate on our challenges and how we’re meeting them.

Temporal faces some unique challenges with respect to our client side libraries, namely:

The client side logic isn't just a thin wrapper over some HTTP/gRPC/etc calls. The SDK needs to handle the reconciliation of events and their related state changes, which may be generated at different times by different actors in the distributed system that is a Temporal deployment. More on this below.
We want to support as many languages as possible, while avoiding duplicating the complex logic in each language. On top of that, we need to present an idiomatic-to-their-language interface to our users.
We expect instances of the SDK (workflow and activity workers) to often be long lived, and hence the SDK must be extremely reliable.

In this post, we'll dive into the points mentioned above. We will also explain why we chose to write the Core SDK in Rust to help meet these goals. Note that some familiarity with Temporal's programming model will be helpful. Read more here. A Temporal SDK provides the APIs in your language of choice needed to author Workflows and Activities, as well as the behind-the-scenes logic required to drive them. They allow you to write durable, long lived business logic without worrying about burdensome retries or other temporary failure concerns. This is all done in a way that feels natural to your language of choice.

What's complex about it?

From a 10,000 foot view, a Temporal worker follows this algorithm when running a workflow:

Long poll the server for workflow tasks (i.e.: Server says "I need you to run the user's workflow code").
Apply the event history contained in the task to a collection of state machines associated with the workflow.
Run the user's workflow code, appropriately providing values from history for the results of Activities, firing Timers and Signals, etc. until the workflow code eventually blocks on something that isn't in history yet, or exits the workflow function.
Reply to the server, possibly telling it about some new commands (the things we blocked on, for example: "I want to start this timer") that have been generated by the user's code. Otherwise, notify that the workflow is completed.
goto 1

As you might imagine, steps 2 and 3 are pretty complex, especially step 2. This complexity arises from the huge number of combinations of actions that need to be taken as a result of workflow history being fed into these state machines. In turn, they determine what happens in the user's workflow code, and what needs to be sent to the server.

The state machine for timers, for example, encodes the logic that determines when we tell the server that it needs to track a new timer or when one should be cancelled, It also determines if a timer should be blocked or not in the user's workflow code. It's one of the simplest machines, and there are about 16 of them we need to implement.

What may not be immediately clear is there’s nothing language-specific about these state machines. Conceptually, they translate actions taken in your workflow code to commands that must be sent to the Temporal server. In the other direction, they translate workflow history into new information exposed to your workflow. This “translation” is the same regardless of what language your workflow is written in. In fact, there’s no reason why semantically identical workflows written in different languages running on different workers couldn’t handle each other’s histories - though you probably wouldn't want to do this.

Yet, as it stands, each of our existing SDKs re-implement this difficult logic. Clearly, we don't want to repeat this for each language. We need some kind of Core SDK that all other language SDKs can be built upon.

Towards a shared core

It’s clear we could substantially accelerate the development of new SDKs and increase the maintainability of existing ones by building a shared common core library used by the language-specific SDKs.

We knew any design would need to meet the following requirements:

Clean integration with other languages
Good ergonomics for the end user (ie: Avoid imposing new operational requirements)
High performance
Maintainable

Those requirements are pretty restrictive. To expand on the operational requirements: it's desirable from a packaging and performance perspective to be able to live in the same process as the language-specific SDK. For an end user that means they can simply deploy one binary which will run their worker, rather than needing to deploy the core SDK separately.

To implement the core and meet the requirements, we need a. We need a hero to rise to the challenge! Enter... Rust.

Why Rust?

There's a lot of good reasons to pick Rust; some of which could fill up entire separate blog posts. The same reasons it's gaining in popularity so quickly these days apply to why we chose it. "Fearless concurrency", "performance and safety", a quality type system, etc. These reasons check the “high performance” and “maintainable” boxes.

When it comes to connecting multiple languages to a shared library, the traditional choice is often C, but Rust makes for a safer and more modern alternative. Having language SDKs directly link to the Rust core meets our end-user-ease goal and keeps overhead low.
There are other ways we could've tackled the problem (for example by running another process that communicates with the language-specific SDK over some kind of IPC), but those options probably fail our ease-of-use goal. If desirable, we can always run out-of-process later because we use Protobufs to represent data passed between the Core and language-specific SDKs. This technique also reduces the amount of duplicate code we would have to write for each new language.

Rust also very easily compiles to WASM, which unlocks some very interesting possibilities for us that we'll likely discuss in a future blog post.

There's one other reason that matters a lot to me personally: Rust is fun to write. It's not often (though certainly has been) said, but I think it's a huge part of the reason the language has grown in popularity so quickly. It presents a challenging but rewarding mental model that, once internalized, provides some deeply rewarding "a-ha!" moments while also bolstering you with that "if it compiles, it probably works" confidence.

What's next?

In the future, we plan to support a larger selection of languages than we do currently, all based on top of the common core. We can expect Core to ensure a stable, well-tested basis for new languages SDKs where they all can benefit from the reliability and performance it provides.

Keep your eyes peeled for the first alpha release of our Node JS SDK which will be coming quite soon. It is built on top of the Rust core that we've been developing in tandem. We'll announce that release here on the blog as well as other communication channels.

Lastly, we plan to port our existing Go SDK to it as well. We're excited to grow the number of languages supported by Temporal, and bring you rock-solid reliability while doing it!

DEV Community

Why Rust powers Temporal’s new Core SDK

What's complex about it?

Towards a shared core

Why Rust?

What's next?

Top comments (0)