Discussion on: Is Cooperative Concurrency Here to Stay?

View post

A nice overview to the differences in multitasking. I'd like to add a few points though:

switching between threads is on the order of nanoseconds. This overhead comes a lot from switching in/out of kernel space. That means the same overhead is encountered when doing any kernel call, such as IO. I would doubt whether the thread switching overhead has a more than trivial impact on the concurrency model.
Cooperative multi-tasking servers still tend to run in parallel mode as well. If you have a host with four cores you'll run fours instances of NodeJS behind NGINX. You don't have to worry about data races, but synchronization of data is still an issue.
Cooperative servers rely heavily on libraries and local servers that have been built to support the cooperative model. Your server code behaves like glue that holds all the bits together.

The reason context switches are slow isn't because of the amount of stuff the CPU has to do, it has a lot more to do with memory access. Each thread has its own stack space and registers saved in memory, and the system has to swap cache lines in L1, and most likely L2, cache on every context switch, hence the slowdowns. You can imagine how thread-specific memory access patterns can make that worse if one is not careful.

Nested Software • Sep 18 '18 • Edited

Thank you, yes, for eg NGINX runs a worker process per cpu core as you describe (nginx.com/blog/inside-nginx-how-we...). And as you point out, it does have to deal with blocking 3rd party dependencies, including the os (nginx.com/blog/thread-pools-boost-...).

Re thread switching, I looked at this article which seems to indicate it takes in the range of microseconds. What do you think?

blog.tsunanet.net/2010/11/how-long...

...Intel L5630: ~1600ns/process context switch, ~1400ns/thread context switch
Intel E5-2620: ~1600ns/process context switch, ~1300ns/thread context siwtch...

Zuodian Hu • Sep 18 '18

Oh and in that time, the kernel can run its own kernel threads as needed.

edA‑qa mort‑ora‑y • Sep 18 '18

We need to be careful about what we're measuring and comparing. A "context switch" will happen anytime you make an OS call, this includes the IO functions I mentioned. There is definitely overhead in this alone, even if we don't get into scheduling.

For a thread switch there will be more overhead, as Zuodian mentions. On Linux, or any OS, I don't think the actual process selection takes much time (these aren't complex data structures). The switching of page/privilege tables possibly.

But, here's an important aspect. When you switch to a new thread you have different memory pointers and your caches, especially the L0/L1 levels may not have that data in memory. This causes new loads to be requested. This will also happen if you have green threads and switch in user space, since your cooperative "threads" also have different memory.

Without a concrete comparison of use-cases written in both approaches I still think it'd be hard to say whether the actual context/thread switch is a significant part of the problem.

Nested Software • Sep 18 '18 • Edited

That's a good point that we should compare apples to apples.

In a very rough way, it seems that the difference in performance between NGINX and Apache (with worker/event model) suggests that preemptive multitasking has enough overhead to make a significant difference. NGINX can handle at least twice the number of requests per second.

Why that happens is less clear to me. Is context switching a thread significantly slower than context switching in-process? Do both have the same effect on the CPU cache? If that doesn't make enough of a difference, maybe there's something else going on.

I think threads/processes are preempted in Linux something like every 1-10 milliseconds in CFS, so that's a fixed, guaranteed cost. I wonder if maybe the frequency of context switching is significantly lower with a cooperative approach, since we only context switch voluntarily, usually because of I/O. There wouldn't be any context switching at all between tasks while they're using the CPU. Could that be the difference?

This suggests that might be the case:

When an NGINX server is active, only the worker processes are busy. Each worker process handles multiple connections in a nonblocking fashion, reducing the number of context switches.

Zuodian Hu • Sep 18 '18

When you change from one user process to another, the virtual page table must change since each process has its own virtual address space. That's the main thing I can think of that makes in-process context switches faster.

You're right in a sense; a cooperative event loop avoids OS-managed context switches entirely by running all tasks in the same thread context. So the only time that event loop has to give up CPU time to the OS is when the OS preempt the entire event loop thread, or some task in the event loop performs a system call. It's a technicality, but Linux will still preempt the event loop periodically to run other user processes, handle interrupts, and run kernel threads. The event loop just keeps all the tasks that it owns from fragmenting into multiple user threads.

Nested Software • Sep 19 '18

This is a good point, though I think in Linux at least, it only applies to actual process switching. That is, if we're switching between threads that are all running under the same process (as would be the case for Apache process running on a given core), I believe the memory isn't switched out the way it is when we switch from one process to another. I am not familiar with the details though, just that heap memory is generally shared, so I may be missing something.

Zuodian Hu • Sep 19 '18

I would encourage finding some reading material on operating systems if you're interested in this kind of stuff, I personally love it. Here is the one my OS professor wrote.

Zuodian Hu • Sep 18 '18

Microseconds sounds right. I wrote a context switcher in ARM in college, and know the switching itself tends to be pretty run of the mill as performance goes, just moving stuff between memory locations, so it's mostly memory bound. However, a context switch implies a scheduler run, and then you're additionally incurring scheduler run time on the CPU. The scheduler then has to access the data structure the kernel keeps to track schedulable entities (threads in Linux, I think).

Now, you're switching to a kernel context and filling the L1 with kernel memory, then run the scheduler, then switch to the new user context and filling the L1 with the new user memory.

Just thinking about it this way, you're easily going to hit microseconds on a typical context switch.