Programming language of the future Part 4: Distributed Actors

#programming

A good programming language should model it's domain in the most straight-forward way - that way there are no rough edges, no friction. The programmer has no gotchas to work around, the code flows naturally. Distributed programming in modern mainstream languages is not that experience, and in this blog we will go over a part of the solution, the one having to do with coordination of distributed actors.

Asynchronous Actors

OOP models a program as an interaction between objects, and its implementation in most modern languages work well in a single-threaded context. But as soon as concurrency gets introduced - you either introduce a whole different mechanism than the objects you've been building with up til this point (threads and mutexes and thread-local variables), or an event loop, and a split in the laguage - where half the language has async-await all over, and the other half doesn't, and the two halves don't work well together. The solution is the actor model - in which objects interact as actors, by exchanging messages, and processing them one at a time, asynchronously. Parallelism is achieved between different objects. And the fact that any one object processes one message at a time serves as the natural point of synchronisation - you don't need mutexes, and you don't need shared data - you share messages, thus copies of data. And there is no second mechanism to work with - it's just objects, that now work asynchronously. No split in the language either - code sections dispatching requests to a bunch of objects and aggregating results later can be marked as asynchronous explicitly, or be made asynchronous under the hood.

In fact, synchronous and asynchronous code aren't polar opposites - one is a part of another. Synchronous code is just asynchronous code, where every new instruction awaits the result of the previous one. And "the stack" is an optimization for storing the intermediate results for such a series of coroutines. But asynchronous execution on an event loop is the general mechanism - and we are interested in the general solution. So, making all code implicitly asynchronous, where results of operations are awaited on first use, is a good idea.

How much consistency is enough?

We're building a distributed system - and that comes with it's own set of challenges. You can't have Consistency, Availability and Partition Tolerance all at once - you need to choose two. And with time, industry moves from CA to AP. Issue is, the ergonomics of AP are much worse, at least with the tools that are available. Which is fixable.

I'll argue we don't need Consistency, because it's an illusion anyway - no two computers are fully synchronous. You can emulate it to a point, but the price for the emulation is steep. Namely - you can only have one node you write to - the rest have to be read-only followers. And if there's an issue with replication and a latency spike - the illusion breaks, the follower node either starts giving stale information, or the leader node stops accepting writes. Additionally, every write needs to ensure that data reaches every replica, which doesn't scale.

Consistency is good, because actions are taken based on information, but with a delay - and it is good if for the duration between observing the information and taking action, the information doesn't change. For a multi-threaded program, that means calling a section that involves first observing data, and then taking action a "critical section", and putting a lock in it. This has a performance penalty, which is why lock-free and wait-free data structures are popular, but also - in a way, it bends reality, by making observing state into a special action, and like with quantum physics - it only works on the small scale. A computer is part of the larger world, and it can put mutexes on things to make sure it's internal state is consistent - but it can't put mutexes on reality around it, which moves at it's own pace.

In fact, consistency on the small scale is needed to protect internal state, to prevent data corruption. But if you try to scale it out, and put mutexes on other things - you reduce parallelism and introduce latency. And in fact, for protecting internal state, the actor model is enough - as an object processes one message at a time, internal state is always consistent. Outside of it - you should be building around eventual consistency. But how?

Embracing Eventual Consistency

How does a program deal with eventual consistency? How can a program make a decision based on information, if that information can change under it?

A program doesn't just read information - it checks for conditions. If we want to protect every possible condition from changing - we can introduce locking, which is suboptimal. Instead, we can have the program poll for a condition - and proceed with the transaction while it holds. If it stops holding, while in the middle of a transaction - we would need to roll back all of the steps taken up to that point, in reverse order. That is achieved with the Saga Pattern.

But we could be polling for data from different nodes at random, where each node will have a different replica of data. How do we deal with the timeline going back-and-forth between nodes? We would need versioning of objects - the program would under the hood remember the last version of an object that it observed (or if it modified an object - the version after it), and when polling for that object, it would include that version - and the node would respond with the version of the object at least as fresh, or return a temporary error. And to have causal relationships propagate down the callstack, so too should the last observed versions - that way, history for the program progresses monotonically, and doesn't jump backwards.

This is how a program would observe eventually consistent state correctly. What about modifying it?

CQRS and Event Sourcing

Ensuring idempotent mutations is tricky, especially if the logic of the application itself is not idempotent. Then you just need to bite the bullet, and slap an ID onto everything. The IDs then need to not collide, to be stored, to be checked, all a giant mess.

Would be a way lesser mess, if the language runtime took care of all of it? The programmer would cleanly separate functions into commands, queries, and primitives. Commands return their status - pending, success, failure - and are idempotent. Can call other commands or queries. Queries are non-mutating and can call other queries. Primitives are regular functions, which can only run locally, and call other primitives. All of the infrastructure for that - the runtime takes care of.

In fact, the need to store an ever-increasing log of command IDs can be turned into a feature - by recording updates into an event log, also entirely handled by the runtime. The event log is good for replication, for looking back through history (especially code history, since code is objects), for recovering through failures by re-applying events from the event log, for even write load to physical storage - while the snapshot of the current state can be kept in RAM.

Conclusion

The approach outlined in this blog is proven scalable and robust by industry experience, but less ergonomic - which is addressed by integration with language. In the next blog, we will look at how programs are to be modeled - and modeled correctly.