How to make your code embarrassingly faster?

#cpp #openmp #paralellism

Amdahl’s law is most popular in the computer science community, named after Gene Amdahl in 1967. It is said that the more resources we add to a program the faster it goes, but how can we add more resources to the program? Imagine a program as a single task, if we could divide it into smaller tasks and those tasks into smaller tasks, in the end, we would have a program with a lot of tasks; if every task is independent of each other we could then assign each task to a different processor. The dividing of tasks into multiple processors is called parallel computing. The size of the tasks proportional to the number of tasks is called task granularity, the smaller the tasks the greater the granularity.

It is easier to see that if we run a lot of tasks concurrently we can save time since we can do a lot of things at the same time. Awesome right? Well… not always, first of all, when we create a new task there is an overhead we need to take into account and each task might be dependent on other tasks (e.g. Task A needs to be done before B) so we might not be able to parallelize everything. Finally, our computer can only handle a limited amount of concurrent tasks at the same time. So if you still think parallel computing is the best thing that ever happened continue reading and I will show you how to make your code embarrassingly parallel (despite the issues above).

OpenMP Introduction

Now, there are multiple ways to parallelize your code, but I will focus on using the OpenMP API because it is easy to implement and to learn. Let’s see a simple example into how you can parallelize the sum of two vectors

std::vector<int> sum(std::vector<int> const& v1, std::vector<int> const& v2){

    int size = v1.size(); // could have also done v2.size()
    std::vector<int> output(size); // initialize output with zeros 

    #pragma omp parallel 
    #pragma omp for 
    for(int i = 0; i < size; ++i){
      output[i] = v1[i] + v2[i];
    }

    return output;

}

Summing to vectors is simple, you just have to sum each element individually and write it into another vector. But what are those #pragma omp (something)? Those #pragma omp enable us to parallelize the code.

#pragma omp parallel creates a parallel region, all threads available execute it.
#pragma omp for divides the loop into k different chunks and each thread executes a different chunk (task). This way each thread sums a different part of the vector. The sum of two vectors is what in parallel computing is called embarrassingly parallel, what I mean by that is that we can create the tasks as small as we want and they won’t have any dependencies (The sum of each element in the vector is independent of each other). So what consequences does this have? First of all, we can parallelize it as much as the hardware supports and we don’t have to worry about data being shared between threads.

How to avoid data races

Imagine that instead of summing two vectors we would want to know the sum of the elements of a single one. It is tempting to use the same OpenMP structure as before but that would cause a data race, a data race happens when two threads want to access the same data at the same time, this causes inconsistency in the results since the value will update always in a different order. To visualize this let's see how we could solve the issue.

    std::vector<in> v; 
    ...
    int sum = 0;
    #pragma omp parallel 
    #pragma omp for 
    for(int i = 0; i < v.size(); ++i){
      #pragma omp critical 
      sum += v[i];
    }

A simple way to do this is adding a #pragma omp critical, this sentence ensures that no thread will execute the same code at the same time, this way when a thread wants to update the sum it has to wait for the other thread to do it. A more efficient way to do it (causes less overhead) would be to replace atomic for critical. Either way, if you try to execute this code you’ll see that it does not go as fast as you would think because most of the time the threads are waiting. I wish there would be a way to make it faster…

    std::vector<in> v; 
    ...
    int sum = 0;
    #pragma omp parallel 
    #pragma omp for reduction(+:sum)
    for(int i = 0; i < v.size(); ++i){
      sum += v[i];
    }

In fact, there is, the reduction clause solves all our problems. When a thread is created a private copy of sum is created and it is initialized with a value of 0. This way each thread updates its own copy of sum, and in the end, the copies are summed. By now you may be wondering how can I make a variable explicitly private or shared; in the #pragma omp for statement you can add a shared(var) clause or a private(var) clause.

Controlling the number of tasks

Until now we let OpenMP decide for us the tasks that are generated, and thus the number of threads, but how can we control this?


#include "omp.h" // remember to include the omp directive 

// equivalent to setting the env bash var 
// OMP_NUM_THREADS=N
omp_set_num_threads(N);

#pragma omp parallel num_threads(N)

The first way would be to set the environment variable OMP_NUM_THREADS, a second method is to set it with an omp function, this sets the number of threads for all parallel regions if you want to set the number of threads for a specific region you need to set it directly. But this only sets the number of threads it creates, what if we want to control the number of tasks.


// sum a vector from index begin to index end and return the value 
int sum_vector(std::vector<int> const& v, int begin, int end);

int sum1, sum2; 

// n is the size of the vector "v"
#pragma omp task 
sum1 = sum_vector(v, 0, n / 2); // sum first half 
#pragma omp task 
sum2 = sum_vectors(v, n / 2, n); // sum second half 
#pragma omp taskwait 
int sum = sum1 + sum2;

Based on the sum example we can perform the sum of a vector by dividing it into two tasks the first tasks sums the first half and the second task sums the second half, this is what we are doing here, we are creating two tasks with the task pragma. Simple right? You might have notice something strange at the end of the second task (#pragma omp taskwait), task wait does what the name says, it waits for the current tasks to finish before continuing the execution (it works as a sort of explicit barrier). But why do we have to wait for them to finish? The problem is that if we don’t wait we might not have sum1 and sum2 available for the final sum.

Conclusion

This has been a very short and limited introduction to OpenMP since they're a lot of things I wish I could have covered, but parallel computing is really an extensive topic. Nevertheless, I hope this introduction has opened your eyes on how you can improve your code with simple additions.

DEV Community

How to make your code embarrassingly faster?

OpenMP Introduction

How to avoid data races

Controlling the number of tasks

Conclusion

Top comments (0)

Read next

Explained ➡️ Apps Script code for AI powered Google Workspace Add on

Supporting Innovation: The Role of Donation-Driven Projects in Open Source

Open Source Developer Patronage Programs: A Sustainable Future for FOSS

Desatando el Poder de los Modelos de Lenguaje y la Ingeniería de Prompts