DEV Community: Gerardo Puga

C++ lambdas, for beginners

Gerardo Puga — Mon, 14 Sep 2020 00:09:54 +0000

What's this all about
A Lambda's full name
Analysis of the lambda expression
- The return type of the lambda
- The capture list
- Mutable lambdas
- Generalized captures (#C++14 and beyond)
What can and cannot be captured
Captures and the this pointer
An intuition of the closure class
Conclusion

What this all about

When C++11 came out it came with a lot of goodies, and one of the best additions to the language repertoire were lambda functions.

Most people's first exposure to lambda functions in C++ is in the way of "throwaway functions" scattered around the code, because that's what lambdas are used for most of the time. Take for instance:

auto addition = [](int x, int y) { return x + y; };
std::cout << addition(4, 3) << std::endl;
std::cout << addition(99, 1) << std::endl;

Here a lambda function is created to perform a simple addition between two numbers, and then the lambda gets called a couple of times with different input arguments each time. When execution exits the scope where the lambda object was created, the lambda will be destroyed just like any other local variable.

Lambdas can do more than that, though. Lambdas are executable objects, and just like other objects, they can be moved around, copied, stored in containers, passed as arguments, etc. Lambdas can carry a context with them, so that they can be created in one place and transferred to another for execution. Lambdas can be stateful, and can also be linked to the state of an external object.

It must be said that lambdas don't really bring anything radically new to the C++ table. Everything that can be done with lambdas today could already be done with pre-C++11 versions of C++ using good-old functor objects (objects that can be called like functions), if you bothered to do the effort to write them.

The real advantage of the lambda notation, though, is that it provides a compact way to describe an executable object in such a way that the compiler can do the implementation for us. The programmer can then focus on the what instead of on the how, and explore design possibilities that might have been avoided otherwise

During the rest of the article I'll try to introduce you to how lambda functions get declared, and what the ins and outs of variable capture are.

Please notice that the article is aimed at beginners trying to "get it" and/or intermediate users looking for a refresh. If you already "get" lambdas, you can look try the references in cite at the very end for better material.

A Lambda's full name

Lets start by getting semantics out of the way.

In reality, there's no such thing as a lambda. "Lambda" is just a colloquial term to that, depending on the context, will usually refer to one of three things.

A lambda expression.
A closure object.
A closure class.

All three of them are present in the following sentence:

auto addition = [](int x, int y) { return x + y; };

Here,

[](int x, int y) { return x + y; } is the lambda expression.
addition is the closure object.
You can't see the closure class, but it's there in the form of the type of the closure object.

Whenever you hear someone talking about a lambda, they are most likely talking about one of the three above, and in most cases it'll be one of the first two (the third one gets far less press time).

I'll keep it light and talk about lambdas when it suits me better, but I'll use the terms above when the distinction is important.

Analysis of the lambda expression

The lambda expression is the most visible expression of the lambda in the code.

The structure of the lambda expression is

[...capture list...](...parameters...) -> <return type> { ... body ... }

but in practice, it will be a lot simpler than this. In fact, the humblest of all lambda expressions is

[]{}

which defines a lambda that defines captures nothing, takes in nothing as parameters, does nothing, and returns no data; that lambda is, in other words, equivalent to this function:

void f() {}

Just like for function declarations, the code of the lambda is enclosed within the {} brackets, and the input parameters within the () pair. There's not a lot to say about either of these components of the lambda, since they follow both in form and in function from their equivalents on the function declaration.

The () of the parameters list can be completely omitted when the lambda is not meant to receive any parameters. That's why the humblest lambda is []{} and not [](){}. If a return type is specified, however, the () needs to be written even if the parameter list is empty.

I want to talk at length about captures, which is what separates lambdas from just being throwaway functions, but allow me to get something out of the way first and discuss the return type of the lambda first.

The return type of the lambda

The return value type of the lambda can be omitted from the expression, and most people will not write it. When the type is not present the compiler will infer it from the rest of the expression.

I won't be writing the return type in the rest of the examples in this articles unless I really need it. That's why I wanted to task about this early: it's less typing for me.

Only in a minority of cases you'll see lambda expression that fully state the return types of the lambda. In those cases it will usually be because of one of two reasons:

The lambda body is such that the return type is ambiguous to the compiler.
The coder wants to coerce the return type into something different from what the compiler would have inferred.

Whenever the return type is not written in the lambda expression the compiler uses a simple but effective set of rules to determine what type you meant to return:

If there are no return statements, or if the ones present are bare returns with no return value (such as return;), then the lambda is assumed to return void.
If there's a single return statement and it has a return expression, the return type will be the type that results from evaluating the expression.
If there are multiple return statements and all of them have a return expression that evaluates to the very same type, that will be the return type of the lambda.
If none of the above is the case, the compiler will give up and then you'll have to explicitly state the return type.

A few examples:

This declares a lambda that just returns void:

auto f1 = [](){};  // could be just []{}

Here the return expression evaluates to int, and therefore the return type of the lambda will be the same type:

auto f3 = [](int n) { return n; };

The return type in the expression here evaluates to bool, so that'll be the type of the return value:

auto f4 = [](int n) { return n > 5; };

There are two return expressions in the following example, but both are bool. Since there's no ambiguity the return type will be bool:

auto f5 = [](int n) {
  if (n == 5) {
      return true;
  } else {
      return false;
  }
};

In this case there are two return statements with expressions that evaluate to different types (bool and int). In this case the programmer needs to state state the return type explicitly, because it's not possible for the compiler to decide what the return type should be:

auto f6 = [](int n) -> bool {                
  if (n < 0) {
      return false;
  } else {
      return n; /* returns true if n > 0 */
  }
};

In this case we state the return value to coerce the return type to be a bool instead of an int.

auto f7 = [](const std::vector& n) -> bool { return v.size(); };

The capture list

The capture list of a lambda is a list of variable names in the enclosing scope of the lambda that have to be captured to become accessible within the lambda.

In a lambda expression the capture list is written between the square brackets []. It is not optional, and the brackets must be present even if the capture list is empty.

A capture creates a variable within the scope of the lambda body that has the same name as the variable being captured, and which can be (depending on the mode of capture) either a copy of the external variable, or a reference to it.

Captures that generate a copy of the external variable will be called by-copy captures, while the ones that generate a reference to an external variable will be called by-reference captures.

A by-copy capture creates a variable within the scope of the lambda that is a copy of the variable being captured, with the same value the later had at the time the lambda was created. By-copy captures are read-only, unless the lambda is mutable (more on that later).
A by-reference capture stores a reference to the variable being captured, which can be used to read or update the value of the external variable at any time during the lifetime of the lambda.

To ground those definitions with an example, take a look at the code fragment below. In it two local variables get defined, followed by the definition of a lambda function that captures them both.

int foo = 33;
int bar = 22;
auto within_range = [foo, &bar](int n) {
  return foo + bar;
}

The capture list in the lambda expression captures both variables: foo is captured by-copy, while bar is captured by-reference.

It's important to realize that that while the names of the captured variables are the same as the names of the external variables that they mirror, they are different variables. i.e. the foo variable within the body of the lambda is not the same foo outside.

As you probably guessed from the example, the capture mode is declared by prefixing by-reference captures with &. By-copy captures, on the other hand, have no prefix at all.

Additionally, there are two default capture modes that allow us to capture every variable that is used in the body and not explicitly mentioned in the capture list:

A bare & will capture by-reference anything used but not explicitly captured by-copy.
A bare = will capture by-copy anything used but not explicitly captured by-reference.

In a typical capture list you'll find a mixture of default capture modes with named captures. There are a few rules to this mix, however.

The default modes = and & cannot be both present.
If a default mode is present, it must lead the list.
If a = default capture is present, any named capture that follows must be by-reference.
If a & default capture is present, any named capture that follows must be by-copy. The rules make sense if you think about it, so it's not really necessary to worry about them too much.

These are some examples of typical capture lists:

[] Empty capture list, nothing will be captured.
[foo] Capture foo by copy.
[&bar] Capture bar by reference.
[foo, &bar] Capture foo by-copy and bar by-reference.
[=] Capture anything named from the enclosing scope by-copy.
[&] Capture anything named from the enclosing scope by-reference.
[&, foo, bar] Capture anything named from the enclosing scope by reference, except foo and bar which must be captured by-copy.
[=, &foo] Capture anything named from the enclosing scope by copy, except foo which must be captured by-reference.

Mutable lambdas

By default by-copy captures are not writable, and therefore the following fragment is an error:

int value;
auto bad_lambda = [value]() { value += 10; };

By-copy captures can be made writable if the lambda is declared as mutable. This makes the lambda stateful: any change you do to a by-copy capture will be carried over to the next execution of the same lambda.

For example, in this example the lambda will remember any update to the value of the captured initial_value variable. The value of the external variable, however, will remain unchanged because the lambda updates a copy.

int initial_value{5};

auto counter_lambda = [initial_value]() mutable {
    std::cout << initial_value++ << std::endl;
};

// each call will increment the internal copy
// stored within the lambda, and change carry over to 
// the next call.
counter_lambda(); // will print 5
counter_lambda(); // will print 6
counter_lambda(); // will print 7

// the original variable outside of the lambda is unchanged
std::cout << initial_value << std::endl;

by-reference captures, on the other hand, can be both read and written regardless of whether the lambda is mutable or not.

int total{0};

auto accumulate = [&total](int n) { total += n; };

// each call updates the value of the references variable
accumulate(1);
accumulate(2);
accumulate(3);

// print the accumulated value, 6
std::cout << total << std::endl;

Generalized captures (C++14 and beyond)

All the talk so far has been about captures as they were introduced when C++11 came out.

These captures work great, and there's a lot you can do with them, but after a while you'll notice that there are a couple of ways in which they fall short:

A capture always has the same name as the variable that was captured. This is not a big deal, of course, but sometimes you'd like to be able to name them something else.
To capture a value it needs to be previously stored in a variable; it's not possible to capture the result of an expression.
You can't use move semantics with captures. Captured objects need to be copyable; if they are not then you'll need to capture them by-reference, which may create a problem of ownership, or you'll need to do some other trick. This may be particularly annoying if you use unique_ptr a lot.

To mend this, C++14 upgraded lambdas with generalized lambda captures which

allow you to name the internal name of the capture anything you like.
allow you to capture not only variables, but also the result of expressions (only by-copy).
more importantly, allow you to capture move-only variables like unique_ptr instances.

The price to pay for these welcome improvements is that generalized captures are bit more verbose than regular captures because you need to state both the name for the variable being captured, and the name of the capture variable created within the lambda. The syntax is:

internal_name=expression for by-copy captures.
&internal_name=external_name for by-reference captures.

For example, this example uses generalized captures to capture counter by reference (naming it cnt within the lambda), and also captures the result of 3 * mean_level by copy (naming the result limit within the lambda).

int mean_level{5};
int counter{0};

auto f = [&cnt = counter, limit = 3 * mean_level]() {
  if (cnt < limit) cnt++;
};

To capture a move-only object, you just need to make sure the right side of the by-copy assignment is an rvalue, which can be done by providing a temporary value, or by using std::move:

auto adapter = std::make_unique<Adapter>();

auto runner = [adapter = std::move(adapter)]() { adapter->run(); }

The extra verbosity of generalized lambda captures is a very small price to pay given that you're not even required to pay for it: you can still use regular C++11 captures when that suits you better, and mix generalized and regular captures to get the best of each:

auto f = [&counter, limit = 3 * mean_level]() {
  if (counter < limit) counter++;
};

What can and cannot be captured

Earlier I said that only variables in the immediate local scope of the lambda can be captured. I mentioned it only in passing, and it probably flew under the radar when I said it.

However, this not a minor detail or a technicality, and to see why lets see what cannot be captured.

Global scope variables and static data members can't be captured.
Non-static class members can't be captured directly.

The first one may strike you as obvious if you think of it, since globals are accessible from within any function, and lambdas are function-like ("callable") objects, there's no reason for them to be an exception to this.

Still, you should keep in mind that globals must be regarded within a lambda just like by-reference captures are. This may have important implications in multi-threaded programs.

Static class members are just globals in disguise, so it's no surprise that as far as lambdas are concerned, they have the exact same restrictions.

Non-static class members cannot be captured either, but here the truth is more nuanced: actually they can, kind-of, but they cannot be captured in the same sense in which captures work for regular local variables.

Before we dig deeper into this, however, we need to take a short detour to talk about the pointer this.

Captures and the `this` pointer

During execution each non-static class method has access to an implicitly created this pointer that references the instance on which the method was called. This is what gives methods access to non-static data members of the class.

Typically you don't need to de-reference that pointer explicitly since the compiler will do it for you, but you can if want to be more explicit. For instance, in this fragment

class Value {
 public:
  void set(const int x) { x_ = x; }
  int get() const { return x_; }
 private:
  int x_;
};

we can make the dependency on this more visible by rewriting both methods as

void set(const int x) { this->x_ = x; }
int get() const { return this->x_; }

It's important to notice that this does not change the way the code is compiled, it only makes explicit what the compiler is doing behind your back.

This detour to talk about this (pun intended) is because now that we have unmasked how access to data members works it's easier to understand the nuanced version of how non-static data member capture works.

Non-static class members cannot be captured because they are not in the local scope around the lambda, they are in the class scope.

However, the this pointer can be captured, because it is a variable within the immediate scope of lambdas that get created in non-static methods of a class! By capturing this the lambda gets access to all non-static data members and also to instance methods.

In order to capture this, you just add it to the capture list:

auto is_empty = [this]() { return queue_.empty(); }

It's important to not that this is not a regular variable, and it's nature imposes a limitation to the capture process: this can only be captured by-copy; trying to capture the pointer like in [&this] is syntactic error. Default capture modes will also capture this, but notice that even if this is captured by the & default capture mode, the pointer will still be copied.

Now, here comes the catch: remember that this is not the instance, this is a pointer to the instance. You're not capturing the instance by copy, your only copying a pointer to it.

This is really important, because it means that any access to the instance members are still reference-like: by de-referencing this the lambda is accessing the original external variables, not copies of them!

This is the reason some sources explain the capture [this] as capturing the object by reference. It's not exactly true, but it's close enough.

Now, this detail can bite you hard, especially if you fall for thinking that [=] means that "everything gets captured by-copy", as in the following example:

#include <iostream>
#include <functional>
#include <string>

using namespace std;
using Filter = std::function<bool(const std::string &)>;

class FilterFactory {
 public:
  Filter buildFilter(const string &name) {
    name_ = name;
    return [=](const std::string &name) { 
        return (name == name_);
    };
  }
 private:
  std::string name_;
};

int main() {
    FilterFactory factory;
    auto filter_adam = factory.buildFilter("adam");
    auto filter_eva  = factory.buildFilter("eva");
    cout << filter_adam("adam") << endl; // should have returned true, but returns false
    cout << filter_adam("eva") << endl;  // should have returned false, but returns true
    cout << filter_eva("adam") << endl;
    cout << filter_eva("eva") << endl;
}

Here the unsuspecting programmer may have expected [=] to capture everything by-copy, including class members. That would have made each lambda completely self-contained and independent of the original factory instance.

In reality, however, the default capture mode is not capturing name_ at all; it is this that's getting captured, and the access to name_ is actually being done through this->name_.

The code builds, but behavior is not what the coder expected. For all practical purposes name_ is being accessed by reference, allowing lambdas to "see" changes to the variable value that are done after the creation of the lambda.

And it gets worse: if the factory variable is destroyed before the lambdas, the this pointer stored in the lambdas becomes invalid and any access to the data members of the instance it used to point to becomes undefined behavior.

An intuition of the closure class

Before closing the article I'll talk a bit about the closure class. The idea is not to be rigorous, but just to provide the reader with an intuition of how lambdas get implemented by the compiler under the hood.

The first thing to state is that there's no single closure class. A custom closure class gets created automatically by the compiler for each lambda expression.

These compiler-generated classes cannot be seen or changed, because only the compiler knows what they look like. That's why it is frequently said that the closure objects have an anonymous type.

To take a peek under the hood we can implement our own closure class from the description in the lambda expression. Without loss of generality, lets say we were asked to compile a fragment like the one in the following fragment:

int foo;
bool bar;

auto lf = [foo, &bar](int factor) { return foo * bar * factor; };

From the expression we can deduce that upon construction the lambda needs to capture two variables, one by copy (an integer) and another by reference (a boolean). The lambda objects that get instantiated from the expression need to be callable with a single input parameter (an integer) and must return a value after execution (another integer).

A possible implementation of the closure type for the lambda in the example above would be the following one:

class ClosureType {
 public:
    ClosureType(int foo, double &bar) : foo_{foo}, bar_{bar} {}

    double operator()(int factor) const {
        return foo_ * bar_ * factor;
    }
  private:
   int foo_;
   double &bar_;
};

Where you can see that:

Captures become closure class constructor parameters.
By-copy captures like foo are stored within member variables in each instance.
By-reference captures like &bar, are not stored themselves, but a reference to them is stored in the class.
By overloading operator() the closure objects produced by the class become callable objects.
The code body, return type and parameter list of the lambda expression become the body, return type and parameter list of the operator() overload.
In this case the operator() overload is a const method because the lambda is not mutable.

In the closure example above it is readily visible that captured variables get read on construction, while lambda parameters get passed to the lambda on execution.

While not very rigorous, our example above is good enough to get an idea of what a closure class may look like and how each part of the expression affects the implementation of the lambda.

The closure class example shows how the number and mode of captures impacts on the size of the closure objects that get instantiated from it: lambdas are more than just code (like a function would be), they carry a context with them, and the size of the context depends on the size and type of the captures.

By-reference captures are light, and only add a pointer to the size of the lambda object, but they don't guarantee that the captured object will exist at least as long as the lambda exists. By-copy captures, on the other hand, do guarantee the lifetime, but they can be heavy to move around because of the copy operation. Moving around lambdas will be at least as expensive as the most expensive by-copy capture they own.

In reality compilers can use a variety of implementations for lambdas depending on what's better suited for each case in particular. In particular, if a lambda does not capture any variable then it's usually cheaper to implement it with a simple anonymous function instead of an anonymous class; closure objects in this case become just pointers to that anonymous function.

That's why no-capture lambdas can be assigned to variables of correct pointer-to-function type variable, but lambdas that capture variables cannot.

int factor = 2;
auto no_capture_lambda = [](int n) { return 2 * n; }; 
auto with_capture_lambda = [factor](int n) { return factor * n; }; 

int (*f_ptr_1)(int) = no_capture_lambda;   // this is ok
int (*f_ptr_2)(int) = with_capture_lambda; // this fails to build

Conclusion

There's more to say about lambdas, of course, but this is probably enough for a good first bite.

For a deeper coverage you can visit the somewhat terse but extremely complete cppreference.com page on the topic. There you will find also lots of information on new developments in lambda functions in C++17 and beyond, which is something I intentionally left out.

If you have access to a copy of Effective Modern C++, go read it. If you don't, go get it. The chapter on lambdas is extremely well written and informative, just like the rest of the book. You'll find there a neat trick to use move semantics with lambdas when generalized captures are not available in your compiler.

I did not get a lot into passing lambdas around, but I did try to put some emphasis on the problems caused by capturing stuff by reference, either intentionally or by mistake. Understanding that is extremely important to safely execute lambdas in a different context from the one they were created in, and to determine how they will interact with other objects in multi-threaded code.

I hope you enjoyed this, until next time!

Multithreading by example, the stuff they didn't tell you

Gerardo Puga — Thu, 20 Aug 2020 23:54:24 +0000

The problem

Concurrent code can be daunting. In the very moment when you trigger the creation of a second thread in your system you're opening the gates for a completely new type of bug to come crashing through the doors...

And it's not just bugs of a new kind, but bugs of the worst kind:

They happen infrequently.
They are strongly dependent on execution context (timing, load, etc.).
When they happen, they are hard to reproduce, and hence to test for.
Very often they are perfect examples of heisenbugs: they vanish or mutate as soon as you set your debug tools on them.

The idea for this article came to me while implementing a very simple multi-threaded event queue, where one or more classes can post an events, and one or more independent subscribers can be notified of these events through a callback.

My initial requirements were actually very relaxed, since the use case I needed to satisfy was a really tame one.

And yet, as I created the implementation I realized that, simple as it was, the code was very good example of a large number of implementation issues that I often get asked about.

So I decided I would write about it, because I'm lazy, and I'd like to avoid repeating myself in the future.

This is what you can find in the rest of the article:

That's nice. Where's the code?
Unpacking the stuff
- Event sources should be decoupled from subscribers
- Subscribers can be decoupled from each other, but I won't
- Identify the execution threads at play
- Before we move on: callbacks are evil
- volatile vs atomic
- What if I have lots of shared variables between threads?
- Callbacks, yes, but RAII-ish
The devil hides is in the details
- The "lost event" pitfall
- The "deaf condition variable" pitfall (part I)
- The "deaf condition variable" pitfall (part II, the revenge)
- One more, for the road
Overcoming concurrency's hardships
- The solution: write bug-free concurrent code
- That's wishful thinking. What else have you got?
That's it, the end

That's nice. Where's the code?

This is the interface for the class this article will talk about, minus the boilerplate.

class EventQueue {
 public:
  void postEvent();
  CallbackHandler subscribeToEvent(const CallbackFunction &callback);
};

You can find the rest of the code in this repo. I'll be cutting an pasting interesting bits from it below as they get talked about.

It's a publisher/subscriber architecture, where one or more clients subscribe to an event, and register a callback to be called when the event is posted.

One or more sources can post instances of the event. Each time the event is posted, all of the client's callbacks get triggered. Events carry no information, other than the fact that they "happened". This is what makes this different from a message queue.

For the sake of simplicity, and because it suited my use case better, the code only guarantees that callbacks will get called at least once after each posted event, and not once for each posted event.

This means that if an event is posted multiple times before the event subscribers were notified of the first one, then events after the first will not be counted as a separate instance, and subscribers will be called only once.

This makes EventQueue kind of a misnomer, because it's not really a queue if you can "miss" events. However, from a design point of view, since events carry no information with them, then nothing is lost because of this design choice unless you want to count events (which is perfectly valid use case, just not mine).

The implementation of EventQueue has a postEvent() method for event sources, and a subscribeToEvent() method to allow clients to register a callback.

This is as simple as asynchronous message passing can get. And yet, there's still stuff to unpack here.

Unpacking the stuff

Event sources should be decoupled from subscribers

It would be trivial to write postEvent() so that it loops through a vector of client callbacks calling each, but that would mean that the called is blocked by the clients while the callbacks execute.

That couples the execution flow on the event source to the event receivers, and in a multi threaded where modules on each end should be protecting their internal states with mutexes, this kind of coupling is likely to end up causing a deadlock.

We'll decouple the event source by making postEvent() non-blocking. Posting the event only triggers the event notification, but control will be returned to the sender right-away, potentially even before any of the subscribers have been notified.

Notice that this goes beyond just making postEvent() not to wait for the event to be delivered before returning. It's easy to forget that if postEvent() needs to take ownership of the same EventQueue mutex that gets taken by the parts of the code that deliver events to clients, then it might block trying to get access to that mutex while a previous event is getting delivered to event subscribers. In most applications this is almost as bad as blocking while the event gets delivered, and just as likely to cause a deadlock.

No mutex in postEvent(), got it. However, this method still needs to access the EventQueue object internal state in a thread-safe manner, because callbacks will be executing from a separate thread.

To do this, the code uses a condition variable to trigger the processing of the event in an event forwarder thread:

void EventQueue::postEvent() {
  pending_event_ = true;
  cv_.notify_all();
}

Notice that postEvent() never locks. It just sets a boolean flag, notifies some code with a condition variable, and moves on. There's no way to block it.

I'll go back to that flag later.

Subscribers can be decoupled from each other, but I won't

Subscribers could be decoupled from each other too, but that's not my use case. I will just call each subscriber callback in the list sequentially, one after the other, every time an event gets posted.

This is unlikely to cause a deadlock by itself, since callbacks will usually take ownership of their target's mutex on entry and release it on exit.

This can, however, cause additional processing delays if one the callbacks calls something blocking or takes a long time to execute, but we're in a cooperative-scheduling scenario here, and we are all good gentlemen.

An alternative design would be to hold separate notification threads for each of the subscriber's callbacks, but in most cases the extra effort is not worth the cost if your subscriber's callbacks are guaranteed to be quick and non-blocking (and all callbacks should always be).

If your callbacks can be considered unsafe or prone to being manipulated by an adversary then you would do well to play it safe and walk this extra mile out of precaution. This is seldom the case in monolitic software, however.

Identify the execution threads at play

When writing concurrent code you should always be aware of what threads are at play around a given piece of code.

There will be threads external to your object, such as the ones executing postEvent(). These are potentially rooted in other objects in your system, and they are only temporarily executing code that belongs to your object (in this case EventQueue).

These external threads may be holding locks on mutexes elsewhere on the system (typically, the one protecting the object that created them), and may need to acquire additional mutexes in order to gain access to protected assets within an object the need to interact with.

Take for instance the case of a piece of code creating a copy of a pre-existing shared pointer. The execution thread (the one making the copy) is external to the shared pointer, and that thread needs to acquire the mutex within the shared pointer in order to update the pointer's reference count. It does so while executing the copy-assignment operator or the copy-constructor of the smart pointer class.

External threads are enough for a large number of types of objects whose internal state is only updated by interactions through their public interface. The std::shard_ptr class is an example of that kind of object.

Other objects, however, also need to be able to self-propel their own state forward, even in complete absence of external stimuli. These objects will usually have one or more internal threads -threads created by the object itself- that are there tasked with driving the evolution of the internal state of the objects through interactions with the exterior.

The EventQueue class is of this last kind: In order to decouple the event source from the subscribers the object needs to take upon itself the task of delivering the posted events to the subscribers. Executing the subscriber callbacks requires a thread, and therefore EventQueue creates an internal thread to do this, one that we'll call forwarderThread (since that's the name of the function that hosts its code).

This thread spends most of its time asleep. Each time an event is posted through postEvent() this internal thread gets woken up to take care of the delivery. The code of the thread looks like this:

void EventQueue::forwarderThread() {
  std::unique_lock<std::mutex> lock{mutex_};
  while (!halting_) {
    cv_.wait_for(
      lock,
      max_sleep_interval_,
      [this]() { return pending_event_ || halting_; }
    );
    if (pending_event_) {
      pending_event_ = false;
      triggerSubscribers();
    }
  }
}

We'll talk a lot about this function, but let's go one step at a time. For now lets just say that this code spends most its time sleeping on the condition variable cv_ until something wakes it up using cv_.notify_all(), such as for instance, a call to postEvent().

It's also important to understand that the thread owns the mutex while while it's awake, but the condition variable releases it while it sleeps in wait_for().

Whenever a cv_.notify_all() call wakes up the forwarder thread, it re-acquires the mutex and springs to action, delivering an event to clients by calling their registered callbacks. After doing that, it goes back to sleep.

Only the destructor can terminate this loop, by setting the halting_ flag to indicate that the thread needs to exit. This is important, because threads need to be joined before destroying them. thread::join() is a blocking call that waits until a thread terminates execution; if a thread is destroyed without having been joined, it will kill your system as steadfastly as a segmentation fault, and with about the same amount of pomp. The halting_ flag tells the thread that it needs to exit, so that it can be joined. It's like whistling to bring back your dog.

Before we move on: callbacks are evil

That's true. They are.

A callback is someone else's code that gets executed on your thread. Or it's your code, that executes on someone else's thread. Either way, you're not fully in the driving seat; you're literally sharing the seat.

Also, at its very best, a callback is just a raw pointer on one module pointing to another, and this raises lots of questions regarding the relative life scope of each object with repect to the other. Think of this scenario.

ObjectA owns ObjectX and ObjectY. During construction it set a callback in ObjectY to trigger a method in ObjectX when a message M is received. ObjectA is destroyed, and during the termination sequence it first destroys ObjectX, then ObjectY. After the demise of ObjectX, but before the end of ObjectY, a message M arrives, and the callback gets triggered an calls the method on ObjectX, all of whose data members have already been destroyed.

I hear you say:

"That's bad design, you have horizontally rigged your system and your architecture does not enforce a functional dependency between X and Y". It's true, I say, but the callback making the system crash did not make this any better. Also, even if the callback was hosted in ObjectA instead of ObjectX, this sequence of events would still cause a crash if the callback code used ObjectX during its execution.
"That's a corner case, the system is winding down anyway so no harm is done". A corner case, true; harmless, false. You're potentially wrecking the destruction process. I've seen code like this try to acquire an already destroyed mutex, which is undefined behavior, and then block forever causing the software to linger until I had to SIGKILL it. If you're "lucky", your code will crash right away, but even in that case you may lose information that was supposed to be orderly stored by your termination sequence.

This only gets worse if you think of a case in which you have subscribers that come and go during the life scope of the callback-calling class. If the subscriber objects are temporary you need to make sure you unsubscribe from the callback before you destruct each of them. But there's no way to enforce this, because a callback is just a raw pointer, so it's error prone and it's easy to cause a software crash during execution.

Callbacks are a necessary evil, though. But we can make them a bit more RAII. More on that later.

volatile vs atomic

You can injure yourself if you don't know this. If more than one thread is at play, there are some guarantees that are lost.

The short version is, if you have two asynchronous execution threads (for instance two Linux process threads, but also a task's code and an interrupt service routine on a humble embedded RTOS), and you want them to signal each other through a variable such as

bool halting_{false};

then that's a recipe for failure.

The reason for that is that the compiler does not know about that hidden Read-After-Write (RAW) dependency between the writes on one flow of execution and the reads on the second one, and it's very likely to either partially or totally optimize the connection through memory away.

This can happen in a number of ways, but the most frequent one is because any compiler worth its salt will optimize your object code so that variables get cached within CPU registers for as long as possible to improve the speed of access, and no actual memory accesses will really be performed except for reading the initial value of the variable at the start of the use scope, and storing the final value at the very end of it. If your code is waiting for a variable value to change in a loop, caused by a source that the compiler does not know about, then your code will most likely wait forever.

Notice if you risk it and decide to use a regular variable to synchronize independent execution flows, you may find that everything works "as expected", but the actual outcome will vary depending on the architecture of your code, your build configuration, the version of your compiler and what you had for breakfast last week. Your code may start failing a few minutes/days/months/years later when someone else makes an unrelated change to the code, updates the compiler version, or eats last-night's pizza for breakfast instead of butter on bread.

So bare variables is a no-no. How about volatile?

volatile bool halting_{false};

That's a big no-no too, but it's a different kind of error. Lots of people think volatile can be used for blissful inter-thread communication through a shared variable, and that's correct for other programming languages, and it even works for C++ code running on simpler single-threaded processors with simple memory hierarchies.

It's not true in general, however, because volatile is not meant to be used for that, and you should not use it that way because the outcome in that case would be undefined. Quoting directly from cppreference.com on the topic:

Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order)

The prose here admittedly a bit obscure but the warning is clear. A much clearer discussion of the topic can be found in Effective Modern C++, which I encourage you to read. In there, Meyers states that the guideline is:

Use std::atomic for concurrency, volatile for special memory.

Special memory in the context of the quote means, for instance, a memory-mapped I/O register in a micro-controller.

So, what is std::atomic? It's a template that "decorates" a bare type in such a way as to ensure two things:

Updates to the variable of an atomic type are seen as atomic by readers. That means they will either see the value after or before the update, but never a mixture or partially updated state.
The variable of an atomic type performs inter-thread memory synchronization on updates, making sure that after a thread writes on it, any other thread that reads the same variable will read the updated value.

What's the downside of this? why not default to atomic behavior? As you are very likely expecting, because it's not cheap, an atomic access is always more expensive than an access to a regular type variable:

There's no in-register caching of an atomic variable. That means no single-cycle read/write access even for atomics of simple integer-like types, such as bool. You'll always go through the memory hierarchy to access them, and pay the price for it.
Accesses to an atomic trigger a synchronization process between the caches of different cores in the CPU to ensure a consistent event-ordering for all threads on the system. This is a punch right on the performance on any modern multi-core, deeply-pipelined processor.
For smaller datatypes (bool, for instance), std::atomic<T> may use an implementation that uses atomic load/store instructions available on most modern processors, but for larger or user-defined datatypes, std::atomic will basically attach a mutex to the variable and acquire/release around each access. This can be even more expensive.

So, and here I'm setting the text in bold again: use std::atomic for unprotected variables you share between two threads, and use them sparingly.

That's why in EventQueue we use

std::atomic_bool pending_event_{false};
std::atomic<bool> halting_{false};

to transfer information between the some external thread executing postEvent() or the EventQueue destructor, and the internal thread that drives event delivery to subscribers. We need that information to get through across threads.

As a final note on this section, unprotected in the bold text means out of the protection scope of a mutex. See next section.

What if I have lots of shared variables between threads?

This is frequently the case. Most often your object's internal state is made of a multitude of internal data members, and updates to that state can be triggered by code executed by either internal or external threads.

In that case you don't make everything std::atomic<T>. That would only make access to your individual variables atomic, but the internal state of your class would still not be thread-safe, because different threads execution flows could still intermingle in undefined-and-most-likely-not-safe ways.

Take for instance,

EventQueue::CallbackHandle EventQueue::subscribeToEvent(const CallbackFunction &callback) {
  std::lock_guard<std::mutex> lock{mutex_};
  auto handler = std::make_shared<CallbackFunction>(callback);
  subscribers_.emplace_back(std::weak_ptr<CallbackFunction>(handler));
  return handler;
}

void EventQueue::forwarderThread() {
  std::unique_lock<std::mutex> lock{mutex_};
  while (!halting_) {
    cv_.wait_for(lock, max_sleep_interval_, [this]() { return static_cast<bool>(pending_event_); });
    if (pending_event_) {
      pending_event_ = false;
      triggerSubscribers();
    }
  }
}

These can execute concurrently; the first one is driven by an external thread, from some other object on the system makin g a call on EventQueue, while the second one is our internal thread's main body.

Both are somewhat complex functions, especially the second one. You might be tempted to say that the first one only makes an update to the subscribers_ vector, and that that operation can be made atomic with std::atomic<T>, and that would be true, but as you'll see later, triggerSubscribers() iterates on that vector, and iterating is not an atomic operation, it is a sequence of operations.

That's why we need to prevent these two methods from executing at the same time, and to do that we force them to only make changes to the state of the object (that is, to any member of the object) while holding a lock on a shared std::mutex variable.

The mutex can only be owned by a single thread thread at the same time. If we enforce that all changes to the state will only be done while owning the mutex, then that forces all updates to the state to be made in an ordered fashion, and each individual change can be thought of as atomic.

Notice that enforcing this is up to the developer. Nothing prevents me from updating the state without holding the mutex, but that's usually unsafe unless you have some other way to make sure of making changes to the object state in an atomic fashion. postEvent() does not use a mutex, but it only makes a very bounded type of change by setting an std::atomic_bool flag, so we're safe (to the extent anything in concurrency is safe).

There are many ways to lock a mutex, but two are the most common and both are on display in the code above:

std::lock_guard
std::unique_lock

std::lock_guard is the simplest one. In fact, it only has two methods: the constructor and the destructor. An instance of this class locks the mutex on when it gets constructored, and releases it when it's destroyed. This is effectively a use of the RAII idiom, where the resource being owned is the lock on the mutex.

std::unique_lock, is behaves in mostly the same way, but also has a few more features. The defining one is that it provides lock() and unlock() methods to temporarily release the lock manually. std::condition_variable::wait_for() uses these to unlock the mutex while it waits for the condition variable notification, so that's the reason std::unique_lock needs to be used there.

std::unique_lock provides a superset of the features of std::lock_guard, but you should always prefer the later unless you really need the features in the first one.

The creation of the lock (either lock_guard or unique_lock) should be the first sentence in the scope you want to protect, usually a public method of your object; anything before that would not be protected by the mutex. Everything after that all the way to the end of the scope will be.

Ok, locking a mutex gives us mutual exclusion, and therefore atomicity. What about inter-thread synchronization to guarantee side-effect ordering consistency on memory accesses and stuff? std::atomic<T> was big on this too.

Well, each time you lock/unlock a mutex memory access synchronization is guaranteed so that any variable you update within the scope of the lock will be visible with the updated value to any other thread that wants to read it later.

So, if you protect all accesses to a variable within the scope of locked mutex, the variable can be just a regular variable, because the mutex will ensure synchronization on lock/unlock. That's the case for subscribers_, but not for pending_event_, which is also updated from postEvent().

As as I said earlier, postEvent() does not lock the mutex to avoid having to potentially wait for the mutex to be released while the callbacks are getting executed. To make sure the updates to pending_event_ are seen by the forwarder thread we need to make that variable and std::atomic_bool.

This design choice will impact something else, one that can be the source of a very subtle bug if you're not careful.

Callbacks, yes, but RAII-ish

Callbacks are a necessary evil, I said. I need callbacks on this code.

However, instead of going for raw callbacks that would have to be both registered and unregistered manually by event subscribers, this code tries to at least ensure that callbacks will get automatically unregistered at the very moment when a handler returned by the subscribeToEvent() method goes out of scope.

The idea is that a subscriber will register a callback to receive notifications from the event by doing something like this.

auto handler = event_queue.subscribeToEvent(
  []() { doSomething(); }
);

The callback (in this case a lambda) will remain registered to the event represented by event_queue instance all the way until the point where handler goes auto of scope or is intentionally destroyed.

Once the handler goes out of scope, the event_queue instance will automatically remove the callback from the subscribers_ list with no need of any further interaction from the subscriber.

To implement this the code uses std::weak_ptr, which is a weak version of shared_ptr that does not hold ownership over the pointed-to-data, but that allows us to:

Check if an associated shared_ptr still exists.
Get a copy of the associated shared_ptr if at least one other copy still exists.

So, keeping in mind that EventQueue defines following aliases in class scope,

  using CallbackFunction = std::function<void()>;
  using CallbackHandle = std::shared_ptr<CallbackFunction>;

the implementation of subscribeToEvent() becomes is very simple:

  auto handler = std::make_shared<CallbackFunction>(callback);
  subscribers_.emplace_back(std::weak_ptr<CallbackFunction>(handler));
  return handler;

All this does is:

It creates the original shared_ptr<CallbackFunction>. This shared pointer is the CallbackHandle that subscribeToEvent() will return to the caller.
It creates a weak_ptr<CallbackFunction> associated to the shared_ptr created above, and stores that weak pointer in the internal list of subscribers. This weak pointer will be used by EventQueue to track which subscribers have been destroyed before calling on their callbacks.

The triggerSubscribers() method, which gets called by forwarderThread() and is therefore within the mutex protection scope of the second's mutex lock, basically only does two things: it removes any weak_ptr whose associated shared_ptr has gone out of scope, and then it iterates on the remaining subscribers calling the callback on each.

void EventQueue::triggerSubscribers() {
  subscribers_.remove_if([](const EventQueue::WeakCallbackHandler &handler) { return handler.expired(); });
  ...
  std::for_each(subscribers_.begin(), subscribers_.end(), caller);
}

Notice that the caller() lambda needs to check again whether the weak_ptr is still valid, because the shared_ptr might have been destroyed before the call to subscribers_.remove_if() but before std::for_each().

auto caller = [](const EventQueue::WeakCallbackHandler &weak_handler) {
  auto handler = weak_handler.lock();
  if (handler) {
    (*handler)();
  }
};

Notice that while this is an improvement manually registering/unregistering callbacks, there's still one issue with this code. We'll go back to it in the next section.

The devil hides is in the details

There's more than one devil hidden here.

The "lost event" pitfall

The first hidden devil, which is easy to fall for, is also relatively straightforward to detect and understand.

In forwarderThread(), if we change

if (pending_event_) {
  pending_event_ = false;
  triggerSubscribers();
}

for

if (pending_event_) {
  triggerSubscribers();
  pending_event_ = false;
}

then we are introducing a bug that could very easily be missed during testing unless we are looking for it.

Setting the flag to false after running the callbacks means that if the event is posted again after checking the value at if (pending_event_) but before pending_event_ = false;, the event might be missed by some of the subscribers.

Instead, setting it false before running the callbacks means that regardless of where a second event falls, the guarantee that all callbacks will be called at least once after each posted event still holds for all subscribers.

That one was easy.

The "deaf condition variable" pitfall (part I)

You may have noticed that the wait_for() call in forwarderThread() has a timeout parameter:

cv_.wait_for(
    lock,
    max_sleep_interval_,
    [this]() { return pending_event_ || halting_; }
);

It's perfectly reasonable to question this: Why would a timeout be needed here? This is code that should only wake up whenever there's an event to deliver, there's not really a reason to wake it up just because a timeout has expired.

Even more so, since the timeout causes the condition variable to be able to wake up for reasons other than receiving an event, it needs to check if an event was posted when it wakes up, and if not it'll just go back to sleep!

The first part is almost true, and we'll go back to that later. The the second one less so, because the destructor may also wake the condition variable, and we don't want to trigger the callbacks spuriously because EventQueue is getting destroyed, so we need to check if there's an event to deliver to clients regardless of the timeout.

Let's take a look at how destruction takes place.

EventQueue::~EventQueue() {
  halting_ = true;
  cv_.notify_all();
  forwarder_thread_.join();
}

This is a typical way to terminate an internal thread in objects that have one.

Notice that we don't lock the mutex in this method. We can't.

If we did cv_.wait_for() in forwarderThread() would be unable to wake up because it would not be possible for it to lock the mutex, and the destructor would block in forwarder_thread_.join() waiting for the thread to terminate execution. Since both threads are waiting for the other to move forward to move themselves, no progress can be made. This is a classic deadlock scenario.

To avoid this the destructor does not try to lock the mutex and instead it signals the intention to kill the thread through an std::atomic_bool variable.

Notice that, because the destructor code as a whole is not atomic, the order of the instructions matters:

halting_ = true;
cv_.notify_all();
forwarder_thread_.join();

If the order of these lines is changed the forwarder thread might not wake up, or might wake up and miss the halting flag and go back to sleep

We're good then. We won't miss the flag because we are setting stuff in the right order.

Not so.

The "deaf condition variable" pitfall (part II, the revenge)

There's more to say about the condition variable. Let's talk about the lambda in the third argument to wait_for.

Let's start by stating that, in theory, this should be enough for the code to work:

while (!halting_) {
  cv_.wait(lock);
  if (pending_event_) {
    ...
  }
 }

Both the destructor notifications and new event that are posted would be able to wake this loop and process the input acording to the source. However this has at least two issues.

The first one is that a condition variable will not be aware of any notify_all()/notify_one() that is called while it's awake.

So, if the destructor gets called in the unlucky interval in which forwarderThread() goes from evaluating while (!halting_) to setting the thread to sleep in cv_.wait(lock);, then you're dead. Unresponsive. The destructor will block waiting for the thread to terminate, but the thread will be sleeping and won't wake up because it missed the condition variable notification and the flag.

Notice that a very similar scenario would miss (or at least delay) a newly posted event if the notification arrives while forwarderThread() is still awake delivering a previous event.

In both cases the system is not really totally blocked, because the lucky arrival of a new postEvent() call would bring the loop back to the land of the living by waking up the forwarderThread(), but you should not count on this.

Also, it also turns out that condition variables have the annoying habit of sometimes waking up just because the feel like it. If you don't want to take my word on it, you can read the very small print in one cppreference.com:

It may also be unblocked spuriously.

Before you wonder, it's not a bug, it's feature. It's just too hard to create condition variable implementations that don't do that, so we need to live and code with this in mind.

To avoid spurious awakenings, a second form of the wait_for() call provides a predicate function parameter. This is usually just a lambda that should return false if the condition for awakening is not satisfied. When a predicate is provided, the condition variable will check the predicate while waking up and if the value is false then it will assume that it woke up spuriously and go back to sleep.

Agreed, the predicate sounds like a good idea. The code would look like this now:

while (!halting_) {
  cv_.wait_for(
     lock,
     [this]() { return pending_event_ || halting_; }
  );
  if (pending_event_) {
    ...
  }
 }

Notice that the predicate not only solves the issue of spurious awakenings. It also fixes missed notifications! if postEvent() got called while the code was running the callbacks, then when it loops back to the wait_for() call the predicate will be true, so the condition variable will not even go into sleep mode and a new processing round will start.

Similarly, it also fixes the unfortunate case where the destructor gets called in that narrow stretch between the while and the wait_for.

That's it, right? we're done! the timeout was useless.

Not so. Concurrency is such a fragile thing.

It may not look like it from our abstraction level, but

[this]() { return pending_event_ || halting_; }

actually takes some time to get executed. A small one, but still. There's non-zero chance that the two lines

 halting_ = true;
 cv_.notify_all();

will get executed after halting_ was evaluated by the lambda, but before cv_.wait_for() got to sleep.

The end result is the same as before: the destructor blocks, waiting for the thread to terminate, but the forwarder thread misses the notification and goes to sleep with no-one to wake it up because the destructor won't try twice.

Worse, a similarly ill-timed call to postEvent() can suffer the same fate.

So, after adding the predicate our worst possible case is still the same as before: blocking for an indeterminate amount of time. We just narrowed the time window during which that may happen.

Feeling depressed? I know the feeling. But hey! what are the chances, right? You need to be really unlucky to hit this issue!

Not so unlucky, it turns out. The repository provides a few units tests along with the EventQueue code. This is the first one:

TEST_F(QueueEventTest, ConstructionTest) {
  auto uut = std::make_unique<EventQueue>();
  uut = nullptr;
}

This unit test only creates an EventQueue and then destroys it right away. It's a perfectly valid test that checks the least amount of functionality that can be tested: construction and destruction don't blow up.

Turns out that this test hits the sweet spot for the above described missed-notification-failure frequently enough that if that test is run on a CI you coworkers would be upset by the random failures of your test very soon after it gets added.

There are two solutions to this:

Lock the mutex around any code that may wake the condition variable

If we set a lock to own the mutex around the code that sets the variables and signals the condition variable to wake up, this would fix the problem, because since the forwarder thread owns the mutex while it's awake, then mutual exclusion would ensure that flag updates and wake up notifications only get sent while it is asleep!

Great!

No! As I said before, we can't lock the mutex in the destructor. Not for the whole length of it, at least. We could, potentially, do this:

EventQueue::~EventQueue() {
  {
    std::lock_guard<std::mutex> lock{mutex_};
    halting_ = true;
  }
  cv_.notify_all();
  forwarder_thread_.join();
}

It's not pretty, but it would get the job done, and in this case halting_ could be just a regular bool instead of an std::atomic_bool.

However, the problem with not allowing anyone to do anything unless they own the mutex is that, well, no-one can do anything unless they own the mutex. That includes postEvent() which would have to wait until the mutex is released before being able to post a new event. This might cause the postEvent() caller to block while a callback is waiting for a different mutex, opening the opportunity for an indirect deadlock.

Alternative designs can avoid this, at the cost of some additional complexity, but more complexity also means more potential for other subtle bugs like this creeping in the system.

The cheap solution

The chances of this failure scenario happening are very low. If we keep damage low too, total risk is low too. If risk is low, the cost/benefit is on the side of a cheap-but-safe fix, rather than a complex fix that may open the door to equally complex bugs.

Turns out a very the cheap solution is to use a timeout.

cv_.wait_for(
    lock,
    max_sleep_interval_,
    [this]() { return pending_event_ || halting_; }
);

The timeout puts an upper bound to the time the code can be blocked if any of the unfortunate scenarios I described above play out: if something like that happens, you system won't block for an indeterminate amount of time; in the worst case it will only block for the duration of the timeout.

The side effect is that the loop in forwarderThread() is going to wake up periodically even when nothing is happening, so most of the time the thread will wake up only to go back to sleep again.

However, if your timeout is reasonable enough, the cost of periodically waking because of the timeout is negligible, and reasonable to keep bugs at bay.

I won't go that far as to claim that this is the best solution to this. It's not a compromise I'm particularly happy to make, but it's one that usually gets the work done. I'll be happy to hear ideas for alternative solutions for this!

One more, for the road

Finally, let's go back to the caller() lambda in triggerSubscribers(). As I said before, this only partially solves some of the issues I enumerated about callbacks.

auto caller = [](const EventQueue::WeakCallbackHandler &weak_handler) {
  auto handler = weak_handler.lock();
  if (handler) {
    (*handler)();
  }
};

There's a kind of worrying problem here, and you may have already noticed it.

While weak_handler.lock() will atomically return a shared pointer to the callback, or a nullptr if no more copies of the callback exist, there's no way to guarantee, with this implementation, that the handler will not be destroyed after

auto handler = weak_handler.lock();

but before the handler callback gets called in

(*handler)();

Crap! We might be calling a callback right after the client decided it did not want any more calls! This is a completely plausible scenario, in which, for a short moment, the only remaining copy of the callback handler is the one in the caller() lambda, which is for sure not the intention of the code here.

I won't go as far as fixing this here. The article is already long is it stands now, so I'll let the reader figure this one out.

A general idea would be that we need to make getting a copy of the shared pointer, testing and running the handler an atomic block with respect to callback handler destruction. That is, either all of this gets executed before destruction of the handle ends on the client side, or none of it does.

Overcoming concurrency's hardships

The code for EventQueue is almost as simple as any multi-threaded object can be in C++1x, and yet at no point were we short of potential problems.

Concurrency is hard. Even well-weathered professionals with years coding multi threaded software find themselves amused by new ways in which a system can throw itself into a pit, by managing to deadlock through a loop formed by three different mutexes in different places of the system when the input message timing is just unlucky enough.

Fortunately there's a solution for those endless hours of debugging multi threaded code.

The solution: write bug-free concurrent code

When creating concurrent code, testing will just not cut it, so you need to write flawless code.

That's it. Solved. You´re welcome.

That's wishful thinking. What else have you got?

Ok, the next best thing then. The next best thing right after writing flawless code is writing safe code.

Be conservative. Assume the worst possible timing for anything that might happen, worry that you algorithm can be unscheduled and replaced by a competing thread at any moment, keep in mind that the system is full of latencies, and that things will never execute in any particular order unless you force it through explicit synchronization.
Keep it simple. In concurrency complexity arises naturally from the huge number of ways in which different execution flows can be shuffled like a pack of cards. The more complex your code is, the larger the number of execution outcomes that can result from those combinations. The fewer possible outcomes, the easier it is to plan for all of them.
Correctness over performance. Concurrency is very hard to debug. If your code is fast but only a bit buggy, you'll spend more time debugging than your code will ever save compared to a simpler but less error-prone solution.
When given the choice between competing design solutions use the most conservative one that still makes sense for your system.
Your code will still fail, so when possible fail gracefully.

This is not a procedure. It's a mindset. And it's kind-of-obvious one, but most people don't code like this citing some pretext that usually alludes to efficiency, performance, execution time or some other imaginary quality of their software that they have usually only measured in their minds.

This mindset is especially important for concurrent software, because most bugs, if present, will get through testing and into production code unseen and unheard, like ninjas.

That's it, the end

That was much longer than I though when I started. Thanks for reading this far. Please let me know if you liked it!