DEV Community

Is Cooperative Concurrency Here to Stay?

Nested Software on September 18, 2018

Introduction Cooperative concurrency has been gaining momentum in Web applications over the past decade. Node.js uses it. It's been impl...

Read full post

edA‑qa mort‑ora‑y • Sep 18 '18

A nice overview to the differences in multitasking. I'd like to add a few points though:

switching between threads is on the order of nanoseconds. This overhead comes a lot from switching in/out of kernel space. That means the same overhead is encountered when doing any kernel call, such as IO. I would doubt whether the thread switching overhead has a more than trivial impact on the concurrency model.
Cooperative multi-tasking servers still tend to run in parallel mode as well. If you have a host with four cores you'll run fours instances of NodeJS behind NGINX. You don't have to worry about data races, but synchronization of data is still an issue.
Cooperative servers rely heavily on libraries and local servers that have been built to support the cooperative model. Your server code behaves like glue that holds all the bits together.

Nested Software • Sep 18 '18 • Edited

Thank you, yes, for eg NGINX runs a worker process per cpu core as you describe (nginx.com/blog/inside-nginx-how-we...). And as you point out, it does have to deal with blocking 3rd party dependencies, including the os (nginx.com/blog/thread-pools-boost-...).

Re thread switching, I looked at this article which seems to indicate it takes in the range of microseconds. What do you think?

blog.tsunanet.net/2010/11/how-long...

...Intel L5630: ~1600ns/process context switch, ~1400ns/thread context switch
Intel E5-2620: ~1600ns/process context switch, ~1300ns/thread context siwtch...

Zuodian Hu • Sep 18 '18

Microseconds sounds right. I wrote a context switcher in ARM in college, and know the switching itself tends to be pretty run of the mill as performance goes, just moving stuff between memory locations, so it's mostly memory bound. However, a context switch implies a scheduler run, and then you're additionally incurring scheduler run time on the CPU. The scheduler then has to access the data structure the kernel keeps to track schedulable entities (threads in Linux, I think).

Now, you're switching to a kernel context and filling the L1 with kernel memory, then run the scheduler, then switch to the new user context and filling the L1 with the new user memory.

Just thinking about it this way, you're easily going to hit microseconds on a typical context switch.

edA‑qa mort‑ora‑y • Sep 18 '18

We need to be careful about what we're measuring and comparing. A "context switch" will happen anytime you make an OS call, this includes the IO functions I mentioned. There is definitely overhead in this alone, even if we don't get into scheduling.

For a thread switch there will be more overhead, as Zuodian mentions. On Linux, or any OS, I don't think the actual process selection takes much time (these aren't complex data structures). The switching of page/privilege tables possibly.

But, here's an important aspect. When you switch to a new thread you have different memory pointers and your caches, especially the L0/L1 levels may not have that data in memory. This causes new loads to be requested. This will also happen if you have green threads and switch in user space, since your cooperative "threads" also have different memory.

Without a concrete comparison of use-cases written in both approaches I still think it'd be hard to say whether the actual context/thread switch is a significant part of the problem.

Nested Software • Sep 18 '18 • Edited

That's a good point that we should compare apples to apples.

In a very rough way, it seems that the difference in performance between NGINX and Apache (with worker/event model) suggests that preemptive multitasking has enough overhead to make a significant difference. NGINX can handle at least twice the number of requests per second.

Why that happens is less clear to me. Is context switching a thread significantly slower than context switching in-process? Do both have the same effect on the CPU cache? If that doesn't make enough of a difference, maybe there's something else going on.

I think threads/processes are preempted in Linux something like every 1-10 milliseconds in CFS, so that's a fixed, guaranteed cost. I wonder if maybe the frequency of context switching is significantly lower with a cooperative approach, since we only context switch voluntarily, usually because of I/O. There wouldn't be any context switching at all between tasks while they're using the CPU. Could that be the difference?

This suggests that might be the case:

When an NGINX server is active, only the worker processes are busy. Each worker process handles multiple connections in a nonblocking fashion, reducing the number of context switches.

Zuodian Hu • Sep 18 '18

When you change from one user process to another, the virtual page table must change since each process has its own virtual address space. That's the main thing I can think of that makes in-process context switches faster.

You're right in a sense; a cooperative event loop avoids OS-managed context switches entirely by running all tasks in the same thread context. So the only time that event loop has to give up CPU time to the OS is when the OS preempt the entire event loop thread, or some task in the event loop performs a system call. It's a technicality, but Linux will still preempt the event loop periodically to run other user processes, handle interrupts, and run kernel threads. The event loop just keeps all the tasks that it owns from fragmenting into multiple user threads.

Nested Software • Sep 19 '18

This is a good point, though I think in Linux at least, it only applies to actual process switching. That is, if we're switching between threads that are all running under the same process (as would be the case for Apache process running on a given core), I believe the memory isn't switched out the way it is when we switch from one process to another. I am not familiar with the details though, just that heap memory is generally shared, so I may be missing something.

Zuodian Hu • Sep 19 '18

I would encourage finding some reading material on operating systems if you're interested in this kind of stuff, I personally love it. Here is the one my OS professor wrote.

Zuodian Hu • Sep 18 '18

Oh and in that time, the kernel can run its own kernel threads as needed.

Zuodian Hu • Sep 18 '18 • Edited

The reason context switches are slow isn't because of the amount of stuff the CPU has to do, it has a lot more to do with memory access. Each thread has its own stack space and registers saved in memory, and the system has to swap cache lines in L1, and most likely L2, cache on every context switch, hence the slowdowns. You can imagine how thread-specific memory access patterns can make that worse if one is not careful.

Phil Ashby • Sep 18 '18

Very nice description :), I would like to add what I consider to be important factors in the choice of multi-tasking models: complexity and predictability.

Pre-emptive multi-tasking systems require the programmer to deal with shared state very carefully, ensuring consistency and avoiding deadlocks - we humans are not always very good at this, especially so in unfamiliar languages, and this complexity is likely to result in issues, then there is the fun of trying to debug a multi-threaded app - this is the old adage that 'threads are hard'.

In addition, pre-emptive systems are generally not predictable as to when the pre-emption will occur in an execution path, which can often be a problem when time critical work (eg: local hardware interaction) is taking place, leading to locked execution blocks, priority management and other strategies that increase complexity, defeat the pre-emption and reduce it's inherent safety mechanism.

Co-operative multi-tasking does not suffer from threaded-programming complexities, nor does it have unpredictable task interruptions - debugging is a relative joy, and execution hogs are likely to be easily detected at run time. It does introduce new ways of thinking about a program (ie: event driven), especially if the chosen language does not support keywords such as async/await, and typically requires more state to be managed than the procedural code used in a threaded design.

I feel both have their place: I wouldn't consider a co-operative OS where there are a significant number of applications, any of which could go wrong in unpredictable ways, but neither would I wish to debug the kernel; I have used a co-operative model in an environment where my team have full control of the code (a microcontroller in this case), needing minimal complexity to reduce the chance of human error and predictable execution. One can also apply this to an application within a process atop a pre-emptive OS, ie: the NGiNX example.

Nested Software • Sep 18 '18 • Edited

Thank you for your comments!

Pre-emptive multi-tasking systems require the programmer to deal with shared state very carefully,

This is a good point. With some applications - certainly most Web apps - there is no real-time communication among threads. In that case it's luckily not an issue. There may be communication mediated by something like a database that all the threads can access, but those semantics are much simpler than the programmer having to deal with shared memory manually.

I have used a co-operative model in an environment where my team have full control of the code (a microcontroller in this case)

That's a cool example. Especially if there's only one physical CPU/core, the simplicity of doing an event loop makes sense to me. I suppose in the case of multiple cores, one could implement a message passing system between threads and continue using an event model, though having to implement that from scratch sounds like it could be a lot of work. Was the microcontroller code done in C? Or would you have to drop down all the way into assembly?

Phil Ashby • Sep 20 '18

Good point on delegating thread synchronisation / communication to an existing trusted system (like a DB), although that still leaves room for pain (eventual consistency anyone?).

The MCU design was for a tiny 8-bit Freescale device on a satellite, so one core, little RAM (6k) and bare metal coding in C, with a couple of lines of ASM to issue the WAIT instruction. Also zero opportunity for code fixes once launched - we did 2 years of testing!

Nested Software • Sep 20 '18 • Edited

That's really cool! If you have a chance to write it up as an article at some point, I for one would love to read it!

Phil Ashby • Sep 20 '18

I really should do that - although ITAR prevented us publishing the source code when we did the work (en.wikipedia.org/wiki/Internationa...) it might be ok now.. the joys of regulation!

Having done a few talks for radio hams I'll probably write that up as an article for the satellite team website (funcube.org.uk) and syndicate it here :)

Adrian B.G. • Sep 18 '18

I may be over my head but I think Nginx, HAProxy and other proxies don't implement cooperative concurrency, but a Thread Pool with a max number of workers in their config file. They don't have a scheduler or a context switching, but I may confuse things.

The other day I was reading this related article if anyone is interested eli.thegreenplace.net/2018/measuri... I stumble upon it because it has a Go test as a comparison.

Nested Software • Sep 18 '18 • Edited

I believe the core architecture in NGINX is the use of cooperative concurrency with an event loop. Each worker process (one per CPU core) implements the event loop based on nonblocking I/O.

nginx.com/blog/inside-nginx-how-we...

When an NGINX server is active, only the worker processes are busy. Each worker process handles multiple connections in a nonblocking fashion, reducing the number of context switches.

NGINX does have thread pools to deal with the problem of dependencies that are not non-blocking though.

nginx.com/blog/thread-pools-boost-...

But the asynchronous, event‑driven approach still has a problem. Or, as I like to think of it, an “enemy”. And the name of the enemy is: blocking. Unfortunately, many third‑party modules use blocking calls, and users (and sometimes even the developers of the modules) aren’t aware of the drawbacks. Blocking operations can ruin NGINX performance and must be avoided at all costs.

Adrian B.G. • Sep 18 '18 • Edited

Oh I see, so the event loop is inside each worker.

I haven't got that deep, I was only touching the subject was I was learning the Worker Pool pattern as I study Go and distributed systems.

So thanks a lot for the info!

PS: the urls are broken (leading ":")

Nested Software • Sep 18 '18

Oops - links fixed!

Rodrigo Nonose • Sep 18 '18

I'll chime in to point to Erlang for curiosity.

Erlang's VM, BEAM, implements preemptive scheduling. It lives in a single OS process and spawns a scheduler (thread) per CPU core (default, but configurable) and the processes (not OS, but user-land) are fully preemptive and has non-blocking IO with dedicated schedulers.

Erlang achieves full preemption by having complete memory isolation in the process and handling all side-effects through messages. It's a pretty unique runtime compared to others like Go and Node.js and it results in a bunch of benefits such as high availability, fault-tolerance, high concurrency and non-blocking garbage collection.

Nested Software • Sep 18 '18 • Edited

Thanks @rhnonose ! I have a question:

the processes (not OS, but user-land) are fully preemptive

How does this work? By 'processes' I think you are referring to the functions that do work inside of Erlang, kind of like coroutines in Python's asyncio, is that right? If so, how can they be preempted?

Update: I was curious enough to look into this, and how it works is kind of interesting:

happi.github.io/theBeamBook/#CH-Sc...

The preemptive multitasking on the Erlang level is achieved by cooperative multitasking on the C level. The Erlang language, the compiler and the virtual machine works together to ensure that the execution of an Erlang process will yield within a limited time and let the next process run. The technique used to measure and limit the allowed execution time is called reduction counting, we will look at all the details of reduction counting soon.

The Erlang VM runs an event loop that implements cooperative concurrency. In that respect it's conceptually very similar to how Node.js and NGINX work. However, every time we call a function in Erlang, the VM has the option to run a different task instead. That's the sense in which it is preemptive. I guess I'd call this this "pseudo-preemptive."

I'm not certain, but I suspect that from the point of view of performance, it is probably pretty close to cooperative concurrency, since context switching still happens within an OS thread, only at well defined switching points like function calls, and only if the VM thinks it's time to do so (via reduction counting). The overhead ought to be similar to what Node.js does when we call an async function with await in JavaScript.

What about tight loops? Well, since the only way to loop in Erlang is via recursion, that just doesn't apply. The one area where this can be a problem is with native code. If we call native code from Erlang, we have to be very careful with it, since it isn't bound by the same restrictions as the Erlang language and VM. A poorly written native module could in fact hang the VM.

Zuodian Hu • Sep 18 '18

I'd just like to add a subtlety to the horizontal scaling issue in server side.

On the server, concurrent event loops help you scale horizontally, and large applications never run a small number of event loops. So if one event loop crashes, the server application still chugs along running all the other event loops. If we go even bigger, each server instance then is hardened against other instances crashing. So in this sense, you're hardening bigger chunks of an application to random crashes since your application is also bigger.

Ben Halpern • Sep 18 '18

It seems like the addendum here has a lot of answers. I’d think that the difference between a more CPU bound OS and the I/O bound Node etc. show pretty well why one might go extinct in one context and be adequate in another.

Unless I’m off with this logic?

Nested Software • Sep 18 '18 • Edited

I think that's a good point. In an OS, many applications (like Web browsers, word processors) are also I/O bound, but you're right, it's an important use-case to support applications that just want to use as much of the CPU as possible. So preemption makes a lot of sense from that perspective alone.

I think even in a world where all applications were I/O bound though, we'd still want the OS to be preemptive - the fragility of any single application being able to break everything just seems untenable.

Steven Liao • Nov 2 '19

Just learned from Andrew Clark that React's experimental Concurrent Mode is built on cooperative concurrency.
PepoG