Winston Puckett

Posted on Sep 24, 2023

OpenMP Notes

#cpp #learning #tutorial

What is this document?

I recently went through Tim Mattson's Introduction to OpenMP, a 27-video course which outlines all of the major constructs and underlying principles in OpenMP. It's a wonderful course with lots of helpful information and I recommend you take it yourself. Here are my condensed notes on the course.

Why OpenMP?

Parallelism is important because of power usage - Moore's law continues to hold, but the power to runs chip with twice as many capacitors grows almost quadratically with the number of capacitors used.

Instead, by running just two cores in parallel, we find a 40% reduction in the power used for a given input of work.

https://youtu.be/cMWGeJyrc9w?si=XYPQzEBkBwGHgg7_

Concurrency is async (could be happening at the same time). Parallel is when you map so that it is done at the same time.

OpenMP assumes shared address space hardware.

Example of pragmas OpenMP uses:

#pragma omp construct clauses

// simplest
#pragma omp parallel

// You can combine statements like
#pragma omp parallel for
for ...

// Instead of
#pragma omp parallel
{
    #pragma omp for
    for ...
}

Blocks must not exit in the middle.

Shared memory machines

SMP Symmetric MultiProcessor

Operating system treats each thread the same with same amount of time devoted to each thread.

No special blocks of memory - all memory should be equally performant and accessible

NUMA Non-Uniform Memory Architecture

Understands and takes advantage of the fact that access time to certain memory is faster if it's physically closer to the processor .

Nothing is truly an SMP these days, but we pretend it is... at least for CPUs. I'm curious about SMPs vs NUMAs in the world of GPUs.

Random good info

Race conditions and synchronization are the most common errors in parallel programming.
Fundamental model behind OpenMP is the Fork-Join model
A collection of threads is called a Team.
#pragma omp parallel is what forks
omp_set_num_threads(4) tells OMP how many threads to use if it's unspecified in the pragma.
Data allocated outside the parallel block will go on the heap and be visible to all threads. Data allocated inside the block will be allocated to the stack and be visible only to the thread. They use "private" canonically.
A generated function is called a thunk.
Question, OpenMP uses pthreads under the hood when it compiles down. Would an NVIDIA GPU also do that or does it compile to a CUDA-specific implementation?

SPMD - Single Program, Multiple Data
Each thread runs identical code against different data
Cyclic distribution of loop iterations: increment the loop number by the number of threads .
To eliminate race conditions, we can promote a scalar like sum to an array.
Because we may get a different number of threads than we request, you may need to have any thread (id 0 most likely) check the thread count we got.

False sharing:
If independent data happens to sit on the same cache line, each update will cause cache lines to "slosh back and forth" between threads.
Promoting a scalar to an array and using a round-robin strategy is one example where we create false sharing and poor scalability.
One way to avoid false sharing is by padding the array so that separate threads will be guaranteed to show up in separate cache lines. To do this, make the sum array multidimensional and use the first value [i][0] for the actual value
Synchronization solves false sharing without us having to know the underlying cache sizes.

Synchronization
Barrier Synchronization is "each thread should wait until all threads arrive"
Mutual exclusion: "only one thread can access x resource"

Constructs:
Note that pragmas are always applicable to the thing directly below. So use blocks if you want to effect multiple things.

Critical

Only one resource may access at a time:

int sum = 0;
#pragma omp parallel
{
    int id = omp_get_thread_num();
#pragma omp critical
    sum += id;
}

Atomic

#pragma omp atomic

#pragma omp atomic read|write|update|capture

"If the low-level, high performance constructs for mutual exclusion exist on this hardware, use them. Otherwise act like this is a critical section."

Is there any benefit to critical sections in this case? Perhaps critical sections allow for function calls, where atomic only refers to a scalar set operation? Yes - video just said this is just available for simple binary operations to update values.

Barrier

Wait until all threads process to this point before moving on:

#pragma omp parallel
{
    int id = omp_get_thread_num();
#pragma omp barrier
    printf("%d", id);
}

Flush

Compilers are really good at optimizing where reads and writes occur. The order that you place operations in may not be the same order things happen if they are deemed to have equivalent results. This holds true for OpenMP. If you need to make reads and writes consistent, you need to use a Flush.

Creates a synchronization point that says, "you are guaranteed to have a consistent view of memory with the flush set." The flush set is the list of variables inside parenthesis passed to the flush pragma. When you leave off the flush set, everything must be consistent.

All reads and writes before the flush must resolve to memory before and reads or writes to memory after the flush set.

Flushes with overlapping flush sets may not be reordered with respect to each other.

For all intents and purposes, flush is equivalent to a fence in compiler terminology.

Flushes are hard to get right, so OpenMP provides implicit flushes at:

entering/exiting parallel regions
implicit/explicit barriers
entry/exit to critical sections
set/unset of a lock

Flush makes variables available to other threads.

If you spin lock on a variable, you also need to put a flush in the body of the loop. That forces the compiler to read the value every time not from a cache.

#pragma omp flush

#pragma omp flush(variableOne, variableTwo)

Master

#pragma omp master schedules the next block on the main thread. For most use cases of master, you usually want a barrier on the next statement.

Work sharing

Note: there is an implicit barrier at the end of any work sharing construct. nowait skips the barrier. There's no way to turn off the barrier at the end of a parallel region.

Constructs

Loop
- #pragma omp for
- #pragma omp for schedule
  - tells compiler how to split up loops between threads
  - ... schedule(static, optional chunk size): figure out how to break these loops into blocks at compile time
  - ... schedule(dynamic, optional chunk size): create a task queue and pull from it until there's no more work. This is better when you have radically different work from one thread to the next. If you leave off chunk size, it will do one at a time.
  - ... schedule(guided, optional chunk size): start with a large chunk size and get smaller as you go along. This was popular after a paper said this was optimal, but these days we know that the other two are usually better.
  - ... schedule(runtime): take the schedule and chunk size from the OMP_SCHEDULE environment variable or runtime library. This is useful when you aren't sure what schedule you want to use, so you compile the program once and try a bunch of them out. omp_set_schedule() is how you set the schedule.
  - ... schedule(auto): a new schedule which leaves the decision up to the compiler. How is this different from leaving off the schedule parameter? I'm guessing leaving it off means that you accept a default vs do compiler magic.
Reduction
- #pragma omp parallel reduction (operator: variables) | #pragma omp parallel for reduction(operator: variables)
- common task: Find the compute intensive loops, make it so that you can run them independently, insert omp for
- "Loop carry dependencies" Are variables which are dependent on the loops
- accumulating across a data structure is called a reduction.
- when we use the reduction operator, omp creates a local copy of the list of variables and initializes them to the identity of the operator (+=0, *=1)
- when the parallelizations are done, reduction construct combines everything into a single variable for you. Then combines that with the global copy to give you a final answer.
- One interesting thing is that you can do reductions over binary operations like &. Question, does that short circuit?
Section/Sections
- schedule different portions of a sequence on different threads
- sections for the outer block, section for each inner statement
Single
- "only the first three to hit this will execute it."
- note that there's still an implicit barrier because this is a work sharing construct.
- ... omp single
Task
- Consists of data environment, code to execute, "Internal Control Variables" or ICV.
  - In OpenMP there are certain features of the environment that constructs control (like number of threads). These are ICVs.
- #pragma omp task , #pragma omp taskwait
- It looks like taskwait is what creates the implicit barrier in this case. The current thread will keep executing after it has spawned a task with the task construct.
- when you use recursion, look at whether you need to set anything as shared, firstprivate etc. It's so common in tasks that we default private to firstprivate.
Lock
- omp_init_lock() , omp_set_lock() for acquire, omp_unset_lock() for release. omp_destroy_lock() to release the lock's memory. omp_test_lock() to find out if the lock is available, but not wait for it to be available.
- Useful when you have lots of bins where you have a low chance of trying to update two things at the same time.

Runtime Library Routines

omp_get_num_threads() get the current number of threads omp works with.
- outside a parallel region, this returns 1.
omp_set_num_threads() set the current number of threads omp works with.
omp_get_thread_num() get the current thread id.
omp_get_max_threads() get the max number of threads we could use for this system.
omp_in_parallel() is this running in parallel?
omp_set_dynamic() set dynamic mode (whether omp gets smart about how many threads you get for a given parallel region)
omp_get_dynamic() is this running in dynamic mode?
omp_num_procs() how many processors are there on this system?

Environment Variables
OMP_NUM_THREADS the number of threads we request, given this is an integer literal. In a production system, we almost always want to do this rather than have the code be responsible.
OMP_STACKSIZE ask for extra stack size in cases where you allocate large variables to the stack. This prevents stack overflow.
OMP_WAIT_POLICY ACTIVE | PASSIVE if ACTIVE, spin the thread until you get the lock (spin lock). If PASSIVE, suspend the thread until there's available resources. Waking suspended threads costs a lot, so use PASSIVE when you expect to have long wait times and don't want the thread eating resources.
OMP_PROC_BIND TRUE | FALSE if enabled, once you bind a thread to a processor, leave it there. This helps as no processor actually has uniform memory access, so we can optimize for the regions of memory closer to the processor. The disadvantage is that if something else eats up that processor (antivirus scan for example), you don't remap onto a processor which is available.

Data Environment

Heap is shared, stack is local to a thread.

Variables can be shared, private, or firstprivate/lastprivate. First and last means "the xth time you set this, it will remain that."

... for private(tmp) is an example of a declaration.

firstprivate(variable)
lastprivate(variable)

You can declare something as both first and last private. I have questions about what that would do.

Thread private

What if we want a global variable that's private to each thread?

int counter = 0;
#pragma omp threadprivate(counter)

void incrementCounter()
{
    #pragma omp parallel for copyin(counter)
    for (int i = 0; i < 1_000, i++) 
    {
        counter += 1;
    }
}

This isn't a real use case btw, I just wanted an example. For this, we should be using atomic.

Debugging

You need to use a parallel debugger. Visual Studio has one.

Default none for your data environment can help you debug, because it forces you to declare the type of everything.

Remember, the compiler skips pragmas it doesn't recognize are skipped.

Memory Model of OpenMP

There is shared memory and each thread gets a cache. How do you guarantee whether you're getting the most up to date value?

The order you put loads and stores may be quite different than what gets executed.

The memory model gives you precise controls over reads (r), writes (w), and synchronization (s) operations.

Sequential consistency is when r's, w's, s's are in the order you put them in. All threads will see them in this order, and execute as such. This creates a lot of constraints on optimizations.

Relaxed consistency means that you can move, as long as you don't change the output. For instance, you can't change ordering around synchronization operations. You as the programmer have to use these synchronization operations to order what needs to be consistent.

For more on this, see the Flush construct above.

Other Resources

Practical Approach to OpenMP for NVIDIA GPUs by Guray Ozen
Good quote from this course: "Computing is free, all you do is pay for data movement and the computing costs you nothing."
The recapitulation video in this series focussed on the design patterns from the course. It was really good: https://youtu.be/sNmGXNBzofM?si=qzR1cNPh1C02qLM1

DEV Community

OpenMP Notes

What is this document?

Why OpenMP?

Shared memory machines

SMP Symmetric MultiProcessor

NUMA Non-Uniform Memory Architecture

Random good info

SPMD - Single Program, Multiple Data

False sharing:

Synchronization

Critical

Atomic

Barrier

Flush

Master

Work sharing

Runtime Library Routines

Environment Variables

Data Environment

Thread private

Debugging

Memory Model of OpenMP

Other Resources

Top comments (0)

Read next

8 Essential VS Code Extensions [2024]

Automate Your Infrastructure with Terraform: A Beginner's Guide

How I Combined Small Language Models to Automate Workflow like Financial Research

How to Implement pagination on API Endpoint Using Nodejs and TypeScript