Nested Software

Posted on Sep 18, 2018 • Edited on Nov 3, 2019 • Originally published at nestedsoftware.com

Is Cooperative Concurrency Here to Stay?

#concurrency #nonblocking #scalability #architecture

Introduction

Cooperative concurrency has been gaining momentum in Web applications over the past decade. Node.js uses it. It's been implemented in Python's asyncio and is used in many different programming languages and frameworks. In this article, I will discuss the pros and cons of this kind of concurrency and speculate a bit about its future.

There are two ways to handle multi-tasking in software: Cooperative multitasking and preemptive multitasking.

Preemptive Multitasking: Multithreading

Until fairly recently, concurrency for Web applications has been achieved mostly with preemptive multitasking by using threads: Each request is assigned an operating system thread. When the request is done, the thread can be released back to a pool.

For Web applications, threads have some nice properties:

Threads are preemptive, so if a given thread spends a lot of time using the CPU, other threads will still be serviced.
Threads can be assigned to different CPUs or CPU cores in parallel, which means that if two concurrent requests both use a lot of CPU, multi-threading will deliver a significant speedup.

If threads are so great, then why are technologies like Node.js on the server becoming more popular? After all, Node uses a cooperative multitasking model.

Cooperative Multitasking: A Bit of History

What is cooperative multitasking anyway? It means that each task runs in the context of an event loop. The event loop gives each task a turn to use the CPU. However, it's each task’s job to release control back to the event loop. If a task fails to release control, there's nothing that can be done about it.

Cooperative multitasking for Web applications takes us backward in time in some ways. Windows 3.1 and Mac Os 9 both had cooperative multitasking. One of the drawbacks of cooperative multitasking was that it made the whole system fragile. It was very easy for a single application to crash the entire OS.

By the early 2000s, Microsoft (as of Win 95, I believe) and Apple (OS X) both had preemptive multitasking: The OS would assign time slices to each process. It interrupted, i.e. preempted, each application automatically after each time slice. That way, a given application might freeze or crash, but that would not prevent the system as a whole from working.

Concurrency without Multithreading

In systems like Node.js, the approach is similar to that of Windows 3.1 and Mac OS 9. Node.js runs in the JavaScript event loop. All requests are handled within a single thread. Node gives control to requests in a queue. Each request does some amount of work and then returns control back to the event loop so progress can be made on other requests.

This approach definitely has some disadvantages: If a piece of code needs to do a lot of processing on the CPU, or fails to yield for some other reason, that will not only slow down the processing of that particular request, it will also stall all other requests that are currently being processed. Every user of the application will experience the slowdown. That's a pretty significant drawback!

Cooperative concurrency has become quite popular though. Why is that? I think there are two main reasons:

It takes a certain amount of time for the operating system to context switch from one thread to another. It's not a lot of time (on the order of microseconds), but since it does involve talking to the OS, it's slower than switching from one task to another within a single thread.
As the number of users accessing a Web app at the same time increases, so does the number of threads being used. A typical linux system should be able to handle tens of thousands of threads pretty easily. However, if the number of threads gets high enough, it can start to impact the performance of the application as well as the operating system itself. Once we do start to hit that limit, a single computer isn't enough anymore and we need to invest in additional hardware.

Comparison

In looking for a simple comparison between these two approaches, I think I found an interesting candidate in the performance difference between Apache and NGINX when serving static content. Apache still basically uses threads to service requests, whereas in NGINX the standard setup is to assign one process per CPU core. Beyond that, an event loop runs within each process and handles all of the requests assigned to that process internally. The difference is pretty significant. According to this analysis, and this one, NGINX is at least twice as fast.

Discussion

If we're choosing whether to use NGINX or Apache to serve static pages, the choice is pretty simple, since the details of how they're implemented don't matter. When it comes to our own software architecture choices, it's a bit tougher though. Clearly the cooperative approach can pay dividends when it comes to performance of I/O-bound tasks. It does however require us to take extra care with our code:

All I/O must be non-blocking.
CPU-bound processing may have to be broken up into separate chunks so that other tasks can make progress in between. Alternatively, a CPU-bound task may be sent to run in a separate thread.

In the end, is it likely to be worth it for our own applications? I think the value of a cooperative approach to concurrency increases with scale. For a relatively small application with several thousand concurrent users, I don't think it will make a difference. However, as we scale up to more and more users, our hardware costs will start to go up. Eventually, when we reach the big players like Google, Amazon, Microsoft, shaving several percentage points from the cost of data-centre maintenance by increasing the efficiency of each machine can add up to millions of dollars in savings. In that context, even if more money has to be spent on the development and maintenance of software, the net benefit may make it worthwhile.

The reason I brought up old operating systems earlier in this article was to emphasize the point that cooperative multitasking does not do well when there are a lot of 3rd party dependencies, such as all of the possible applications a user may install on an OS. In the same vein, the more we intend to use various libraries and frameworks in our applications, the more fraught cooperative concurrency becomes: The chance that at least one of them will behave badly and take down the whole application increases.

I don't really know, but my guess is that relatively little of the NGINX code relies on external dependencies. Combined with the simple job a Web server does, I think we can see how the use of cooperative concurrency makes sense for that use case: NGINX is a specialized, highly tuned piece of software.

In the end, every software project has to make its own determination about which concurrency model to choose. I do think it's worth it to realize that both approaches have potential benefits and drawbacks. It's not a matter of choosing an absolute "best" solution. Rather it's a matter of weighing various trade-offs and applying them to the problem at hand.

Is It Here to Stay?

Given that it does have drawbacks, I wonder if cooperative concurrency will eventually disappear from server-side programming in the same way it did from operating systems. Perhaps as hardware gets better, and improvements are made to how threads work, the differences in performance will begin to get smaller again.

On the other hand, the fact that so much computing now happens in the cloud means that scaling across large numbers of concurrent connections will continue to be a priority. That's not a problem for applications running locally on a user's machine. It may be a fundamental difference that could explain why cooperative concurrency disappeared from PCs but is going strong in Web and Internet applications.

I do think this approach is probably here to stay for the foreseeable future in situations where getting the most requests/second possible is critical or where horizontal scalability (supporting more users concurrently) matters a lot.

Addendum: What is Non-blocking I/O?

I have the impression that the term "non-blocking" can sometimes be a bit confusing. What it means is very simple though. Let's say we have a function that retrieves some data. When we call the function, the data may or may not be available yet.

If our function is non-blocking, that means it returns right away no matter what. If the data is available, our function will return the desired data. If the data isn't available, then it just returns a status like NOT_READY, but either way, it returns. A blocking function on the other hand can block the current thread and force it to wait until the data is available. With cooperative concurrency, we really have to make sure all of our calls to I/O are non-blocking since our entire event loop depends on it.

One way to deal with nonblocking I/O is to keep polling until the data we want becomes available. For example, see something like the select function. Another approach, used in Node.js, is to supply a callback to the non-blocking call. After the data becomes available, the event loop calls the callback, and the request can proceed from there (in the meantime, the non-blocking call returns and the event loop keeps working on other requests). In the latter case, there's still ultimately a loop polling behind the scenes, but the programmer using Node.js doesn't have to worry about how it's implemented.

We often hear about "non-blocking" in the context of cooperative concurrency, but that doesn't have to be the case. For example, Java's atomic variables allow multiple threads to access shared data without blocking. They do this by trying to perform an atomic operation in a loop: If the operation succeeds, great. If it fails, then we have to keep trying. It's same idea of polling, here in the context of a multithreaded application. Synchronizing on a lock on the other hand is blocking: A thread that wants to modify data protected by a lock has to block until the lock goes away.

Addendum: I/O Bound vs. CPU Bound

Since cooperative concurrency means multiple requests are being handled within a single thread, and therefore only one task is using the CPU at any given time, cooperative concurrency works best when the application logic is I/O bound. That means that a given request may require a bit of CPU time, but much of the time will actually be spent waiting for I/O: Accessing a database, connecting to a remote service, that sort of thing. From the point of view of the CPU, I/O takes forever. While a particular request is waiting for I/O, it can yield control back to the event loop so other requests can continue to be processed. As long as all requests are giving up control to wait for I/O frequently, the whole system should run smoothly. Many commonly used applications - e-mail, online shopping, social-networks - are about loading or sending data somewhere. They're great candidates for this kind of concurrency. If our application logic is CPU bound, that is, each request spends most of its time doing actual processing and not waiting for I/O, then this model is probably not a good idea.

Top comments (25)

edA‑qa mort‑ora‑y • Sep 18 '18

A nice overview to the differences in multitasking. I'd like to add a few points though:

switching between threads is on the order of nanoseconds. This overhead comes a lot from switching in/out of kernel space. That means the same overhead is encountered when doing any kernel call, such as IO. I would doubt whether the thread switching overhead has a more than trivial impact on the concurrency model.
Cooperative multi-tasking servers still tend to run in parallel mode as well. If you have a host with four cores you'll run fours instances of NodeJS behind NGINX. You don't have to worry about data races, but synchronization of data is still an issue.
Cooperative servers rely heavily on libraries and local servers that have been built to support the cooperative model. Your server code behaves like glue that holds all the bits together.

Nested Software • Sep 18 '18 • Edited

Thank you, yes, for eg NGINX runs a worker process per cpu core as you describe (nginx.com/blog/inside-nginx-how-we...). And as you point out, it does have to deal with blocking 3rd party dependencies, including the os (nginx.com/blog/thread-pools-boost-...).

Re thread switching, I looked at this article which seems to indicate it takes in the range of microseconds. What do you think?

blog.tsunanet.net/2010/11/how-long...

...Intel L5630: ~1600ns/process context switch, ~1400ns/thread context switch
Intel E5-2620: ~1600ns/process context switch, ~1300ns/thread context siwtch...

Zuodian Hu • Sep 18 '18

Microseconds sounds right. I wrote a context switcher in ARM in college, and know the switching itself tends to be pretty run of the mill as performance goes, just moving stuff between memory locations, so it's mostly memory bound. However, a context switch implies a scheduler run, and then you're additionally incurring scheduler run time on the CPU. The scheduler then has to access the data structure the kernel keeps to track schedulable entities (threads in Linux, I think).

Now, you're switching to a kernel context and filling the L1 with kernel memory, then run the scheduler, then switch to the new user context and filling the L1 with the new user memory.

Just thinking about it this way, you're easily going to hit microseconds on a typical context switch.

edA‑qa mort‑ora‑y • Sep 18 '18

We need to be careful about what we're measuring and comparing. A "context switch" will happen anytime you make an OS call, this includes the IO functions I mentioned. There is definitely overhead in this alone, even if we don't get into scheduling.

For a thread switch there will be more overhead, as Zuodian mentions. On Linux, or any OS, I don't think the actual process selection takes much time (these aren't complex data structures). The switching of page/privilege tables possibly.

But, here's an important aspect. When you switch to a new thread you have different memory pointers and your caches, especially the L0/L1 levels may not have that data in memory. This causes new loads to be requested. This will also happen if you have green threads and switch in user space, since your cooperative "threads" also have different memory.

Without a concrete comparison of use-cases written in both approaches I still think it'd be hard to say whether the actual context/thread switch is a significant part of the problem.

Nested Software • Sep 18 '18 • Edited

That's a good point that we should compare apples to apples.

In a very rough way, it seems that the difference in performance between NGINX and Apache (with worker/event model) suggests that preemptive multitasking has enough overhead to make a significant difference. NGINX can handle at least twice the number of requests per second.

Why that happens is less clear to me. Is context switching a thread significantly slower than context switching in-process? Do both have the same effect on the CPU cache? If that doesn't make enough of a difference, maybe there's something else going on.

I think threads/processes are preempted in Linux something like every 1-10 milliseconds in CFS, so that's a fixed, guaranteed cost. I wonder if maybe the frequency of context switching is significantly lower with a cooperative approach, since we only context switch voluntarily, usually because of I/O. There wouldn't be any context switching at all between tasks while they're using the CPU. Could that be the difference?

This suggests that might be the case:

When an NGINX server is active, only the worker processes are busy. Each worker process handles multiple connections in a nonblocking fashion, reducing the number of context switches.

Zuodian Hu • Sep 18 '18

When you change from one user process to another, the virtual page table must change since each process has its own virtual address space. That's the main thing I can think of that makes in-process context switches faster.

You're right in a sense; a cooperative event loop avoids OS-managed context switches entirely by running all tasks in the same thread context. So the only time that event loop has to give up CPU time to the OS is when the OS preempt the entire event loop thread, or some task in the event loop performs a system call. It's a technicality, but Linux will still preempt the event loop periodically to run other user processes, handle interrupts, and run kernel threads. The event loop just keeps all the tasks that it owns from fragmenting into multiple user threads.

Nested Software • Sep 19 '18

This is a good point, though I think in Linux at least, it only applies to actual process switching. That is, if we're switching between threads that are all running under the same process (as would be the case for Apache process running on a given core), I believe the memory isn't switched out the way it is when we switch from one process to another. I am not familiar with the details though, just that heap memory is generally shared, so I may be missing something.

Zuodian Hu • Sep 19 '18

I would encourage finding some reading material on operating systems if you're interested in this kind of stuff, I personally love it. Here is the one my OS professor wrote.

Zuodian Hu • Sep 18 '18

Oh and in that time, the kernel can run its own kernel threads as needed.

Zuodian Hu • Sep 18 '18 • Edited

The reason context switches are slow isn't because of the amount of stuff the CPU has to do, it has a lot more to do with memory access. Each thread has its own stack space and registers saved in memory, and the system has to swap cache lines in L1, and most likely L2, cache on every context switch, hence the slowdowns. You can imagine how thread-specific memory access patterns can make that worse if one is not careful.

Nested Software • Sep 18 '18 • Edited

Thank you for your comments!

Pre-emptive multi-tasking systems require the programmer to deal with shared state very carefully,

This is a good point. With some applications - certainly most Web apps - there is no real-time communication among threads. In that case it's luckily not an issue. There may be communication mediated by something like a database that all the threads can access, but those semantics are much simpler than the programmer having to deal with shared memory manually.

I have used a co-operative model in an environment where my team have full control of the code (a microcontroller in this case)

That's a cool example. Especially if there's only one physical CPU/core, the simplicity of doing an event loop makes sense to me. I suppose in the case of multiple cores, one could implement a message passing system between threads and continue using an event model, though having to implement that from scratch sounds like it could be a lot of work. Was the microcontroller code done in C? Or would you have to drop down all the way into assembly?

Adrian B.G. • Sep 18 '18

I may be over my head but I think Nginx, HAProxy and other proxies don't implement cooperative concurrency, but a Thread Pool with a max number of workers in their config file. They don't have a scheduler or a context switching, but I may confuse things.

The other day I was reading this related article if anyone is interested eli.thegreenplace.net/2018/measuri... I stumble upon it because it has a Go test as a comparison.

Nested Software • Sep 18 '18 • Edited

I believe the core architecture in NGINX is the use of cooperative concurrency with an event loop. Each worker process (one per CPU core) implements the event loop based on nonblocking I/O.

nginx.com/blog/inside-nginx-how-we...

When an NGINX server is active, only the worker processes are busy. Each worker process handles multiple connections in a nonblocking fashion, reducing the number of context switches.

NGINX does have thread pools to deal with the problem of dependencies that are not non-blocking though.

nginx.com/blog/thread-pools-boost-...

But the asynchronous, event‑driven approach still has a problem. Or, as I like to think of it, an “enemy”. And the name of the enemy is: blocking. Unfortunately, many third‑party modules use blocking calls, and users (and sometimes even the developers of the modules) aren’t aware of the drawbacks. Blocking operations can ruin NGINX performance and must be avoided at all costs.

Adrian B.G. • Sep 18 '18 • Edited

Oh I see, so the event loop is inside each worker.

I haven't got that deep, I was only touching the subject was I was learning the Worker Pool pattern as I study Go and distributed systems.

So thanks a lot for the info!

PS: the urls are broken (leading ":")

Nested Software • Sep 18 '18

Oops - links fixed!

Ben Halpern • Sep 18 '18

It seems like the addendum here has a lot of answers. I’d think that the difference between a more CPU bound OS and the I/O bound Node etc. show pretty well why one might go extinct in one context and be adequate in another.

Unless I’m off with this logic?

Nested Software • Sep 18 '18 • Edited

I think that's a good point. In an OS, many applications (like Web browsers, word processors) are also I/O bound, but you're right, it's an important use-case to support applications that just want to use as much of the CPU as possible. So preemption makes a lot of sense from that perspective alone.

I think even in a world where all applications were I/O bound though, we'd still want the OS to be preemptive - the fragility of any single application being able to break everything just seems untenable.

Zuodian Hu • Sep 18 '18

I'd just like to add a subtlety to the horizontal scaling issue in server side.

On the server, concurrent event loops help you scale horizontally, and large applications never run a small number of event loops. So if one event loop crashes, the server application still chugs along running all the other event loops. If we go even bigger, each server instance then is hardened against other instances crashing. So in this sense, you're hardening bigger chunks of an application to random crashes since your application is also bigger.

Rodrigo Nonose • Sep 18 '18

I'll chime in to point to Erlang for curiosity.

Erlang's VM, BEAM, implements preemptive scheduling. It lives in a single OS process and spawns a scheduler (thread) per CPU core (default, but configurable) and the processes (not OS, but user-land) are fully preemptive and has non-blocking IO with dedicated schedulers.

Erlang achieves full preemption by having complete memory isolation in the process and handling all side-effects through messages. It's a pretty unique runtime compared to others like Go and Node.js and it results in a bunch of benefits such as high availability, fault-tolerance, high concurrency and non-blocking garbage collection.

Nested Software • Sep 18 '18 • Edited

Thanks @rhnonose ! I have a question:

the processes (not OS, but user-land) are fully preemptive

How does this work? By 'processes' I think you are referring to the functions that do work inside of Erlang, kind of like coroutines in Python's asyncio, is that right? If so, how can they be preempted?

Update: I was curious enough to look into this, and how it works is kind of interesting:

happi.github.io/theBeamBook/#CH-Sc...

The preemptive multitasking on the Erlang level is achieved by cooperative multitasking on the C level. The Erlang language, the compiler and the virtual machine works together to ensure that the execution of an Erlang process will yield within a limited time and let the next process run. The technique used to measure and limit the allowed execution time is called reduction counting, we will look at all the details of reduction counting soon.

The Erlang VM runs an event loop that implements cooperative concurrency. In that respect it's conceptually very similar to how Node.js and NGINX work. However, every time we call a function in Erlang, the VM has the option to run a different task instead. That's the sense in which it is preemptive. I guess I'd call this this "pseudo-preemptive."

I'm not certain, but I suspect that from the point of view of performance, it is probably pretty close to cooperative concurrency, since context switching still happens within an OS thread, only at well defined switching points like function calls, and only if the VM thinks it's time to do so (via reduction counting). The overhead ought to be similar to what Node.js does when we call an async function with await in JavaScript.

What about tight loops? Well, since the only way to loop in Erlang is via recursion, that just doesn't apply. The one area where this can be a problem is with native code. If we call native code from Erlang, we have to be very careful with it, since it isn't bound by the same restrictions as the Erlang language and VM. A poorly written native module could in fact hang the VM.