DEV Community: Chris Rhodes

Improving Life For Part Time Workers

Chris Rhodes — Mon, 16 Aug 2021 06:45:26 +0000

Background

Recently I was asked by my manager how they could improve the working life of part time software engineers. Working as part time software engineer myself, it makes sense to draw from my lived experience.

To give you context about myself, I normally work on a range of development tasks. Professionally I have mostly worked on web-based systems, from front-end with React to back end with Node-JS and Java (Springboot/Spring, custom AWS lambda and enterprise level work with Oracle message passing). My personal work includes, but not limited to, C++ projects dealing with more low level tasks such a parallel programming with OpenCL, Android applications, windows native terminal multiplexer and machine learning prototypes. I don't outline this because I want to brag, for from it. I outline my experience because I hope it gives context to my point of view and how I have experienced the software development industry.

In the following sections I use the term "do" as an instructional guideline and not as a hard rule, at the end of the day this is my opinion and you are free to do what you think is best. What I am about to say even goes against some Agile Principles.

This is especially true for principle 6,

The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.

Again, form your own opinion and never take anything anyone says as universal truth.

My Do And Do Nots

1. Do prefer written communication - Don't assume any vocalised message is remembered and understood

2. Do give agency to people - Don't micro-manage people and their work

3. Do use tools for writing - Don't require every update to be vocal

Explanations

Do prefer written communication

Too often in meetings things are said or instructions given where I mishear things, or I am tired, or just in general not paying 100% attention when I should be. Using written communication helps ensure that the message is articulated, can be referred back to and everyone can read the message when they need to.

An example I recently experienced was I changed teams at the end of a piece of work I completed. I didn't get a chance to run people through it. Taking my own advice, I had written instructions so no knowledge was lost.

A prudent example during 2020 Covid-19 lock-downs was being on Zoom calls for meetings. Then to just have your internet disconnect. Writing your explanations and actions would mitigate any missed information.

Another example is joining a new job or team. Sitting through hours of meetings explaining things is not a great way to pass on knowledge. Instead try to write down as much of it as possible. That way you create a source of truth. Then everyone can read it in their own time and pace.

There are many examples of where using written communication is superior to verbal communication. I am biased to written communication though. It is not perfect for all situations and we still need to verbally communicate when the time comes; just be ready to challenge your assumptions.

Do give agency to people

The worst thing I can imagine when working with a manager or a direct report is being micromanaged. People need agency to do work how they want and in their own time. If you need boundaries, use your tools to establish boundaries and expectations. For instance a lot of work has time restrictions, so communicate that through your tools. I say tools because that ensures there is a good understanding from both parties through a source of truth. I've professionally used Jira in the past for communicating technical requirements. Having all the information on the card was extremely helpful when the time was taken to write it down. This goes doubly for part time workers. When they are working they know there is a source of truth and they can read it when they are actually working and not worry about missing information because it was in a meeting when they weren't working.

Do use tools for writing

Re-empathising the case for written communication. Project and day-to-day updates are the most common form of meetings I have experienced. Nearly all of them could be written. Using tools to provide written updates means you don't have to set-up meetings just to communicate updates. For example, write updates in the Jira cards, in the git pull request, in email, in instant messengers, etc.

Post Note

I wrote this post several months when I transitioned from part-time to full-time employment and I never published this. I didn't publish this for several reasons, none of which are really relevant; mostly to do with my emotional state. After spending time working full-time, I think the points I raised here could be greatly beneficial to full-time work as well. Especially now flexible working is much more common as we continue to fight this pandemic.

OpenCL - Let's Go Deeper - Part 2.3

Chris Rhodes — Tue, 28 Jan 2020 00:30:06 +0000

Introduction

In this post I will be going through turning the Painters Partition problem into a parallel solution. The sequential solution to this is a recursive one, and I will show you I how I broke down the problem to turn it into a parallel one.

As always, the code for this is available on GitHub in the "part_2" folder. You can also find a python version of the sequential solution in painters_partition.py.

crr0004 / adventures-in-opencl

A accompanying repo for my writing about OpenCL

adventures-in-opencl

A accompanying repo for my writing about OpenCL - https://dev.to/crr0004/series/1481

Building

You will need a driver that implements OpenCL 2.0 and cmake.

This was compiled using clang. It should automatically pick up your primary OpenCL vendor and use that.

Headers

The OpenCL headers are included in this repo under shared, if CMake doesn't find them, you can override the directory with "-DOPENCL_INCLUDE_DIRS=../../shared/" when building from one of the build directories. Change the path if it is different for you.

When running it will print out what device is being used but if your vendor implementation is broken it can crash your driver.

Running

Run each sample from the source directory as it needs to pick up the kernel files

Used libraries

Catch2 https://github.com/catchorg/Catch2
Google Benchmark https://github.com/google/benchmark/

See LICENSE file for addition of respective licenses

View on GitHub

Problem Statement

The problem statement for the painters partition is as follows, "You have to paint N boards of length {A0, A1, A2 … AN-1}. There are K painters available and you are also given how much time a painter takes to paint 1 unit of board. You have to get this job done as soon as possible under the constraints that any painter will only paint continuous sections of board, say board {2, 3, 4} or only board {1} or nothing but not board {2, 4, 5}.", LeetCode (2017).

Basically what it is asking is to find the set of partitions which has the minimum weight. The weight being the maximum sum of each of the partitions. The recursive function can be defined as

that equation is a lot scarier that it looks, and it comes from The Algorithm Design Manual section 8.5 p295, which is also where I originally found the problem.

The part that threw me about the equation was why we are taking the maximum of sum of each part of each set, and it's because the maximum represents how long each painter will take, so we can't do better than the longest painter.

The equation is only one solution to the problem, and you could use other methods but the equation represents the recursive part. The two base cases we need to know are when the sub-set is of length 1 (n=1), and when we have exhausted how many partitions we can have (k=1). When n=1, M[n,k]=n, when k=1, M[n,k]=sum(n), keeping in mind n is the sub-set of numbers we are looking at for the partition.

Moving recursion into OpenCL

Moving something into OpenCL, and hence parallel code, requires establishing what work can be done, and the dependencies between each piece of work.

The two primary ways of thinking about parallel programming is, from a data perspective, or from a task perspective. For the data perspective we take each piece of data and apply the same logic on it, Single Instruction Multiple Data (SIMD). For the task perspective we take a set of data and apply any logic on it, Multiple Instructions Multiple Data (MIMD). Of course these definitions aren't complete and there is more nuance to it, especially when it comes to memory. If you're curious to learn more, you can look at Flynn's Taxonomy.

For our problem we can't directly move the recursive algorithm into a parallel solution because we have a dependency issue.

The primary piece of work we do in the recursive solution is stepping through each partition subset, and the dependency is that each step requires knowing the previous step until we hit a base case, see Line 10 of painters_partition.py.

To get a gist of what I am talking about, I would suggest you run the python file and uncomment line 9 of the python for w to see what is happening. So we have established what work we are doing, and established our dependencies. If can somehow move our dependencies around we can turn this into a parallel version. We won't worry about changing the work we are doing as it turns out we don't need to, it is perfectly valid to though.

The required leap in logic is realising that the recursive step is actually generating unique subsets and then computing on them. What if we could generate these subsets another way?

Recursion to Permutations

I could write another post about figuring out how this recursion can be represented with sets, so instead I will do my best to keep it short in explaining it.

The first step is going back to the problem at hand and establishing that we are doing an exhaustive search for the most optimum arrangement, also read as subsets.

The way the original recursive algorithm does this is by establishing a recursive function to place a separator to create a partition, and then give the remaining partition to same function, further placing another separator, until we hit a base case (run out of separators or one number left). The end result of this is that we can step through all valid permutations for the set.

If you're having trouble convincing yourself of this (like I did), then I would implore you to write a recursive solution yourself and see how it sits together. We can now see how this recursive algorithm is actually generating sets, and so we can treat those as permutations, and then use permutation generation to do the same thing. We don't need to change how we decide the optimum arrangement because it's the same way as the recursive algorithm, take the minimum of the maximum sum. If you understood none of that, which is okay, then just assume that the recursive function is generating sets.

Generating Permutations

If you've spent any time studying algorithms, then you will have come across some form of permutations, and one of my favourite resources for more classical study of algorithms is The Art of Computer Programming, I mainly used this to solve the following problems, mainly Volume 4a.

As we have established that we need unique subsets, we can now transform this problem into something that fixes the dependency issue. We need a sequential algorithm to generate the subsets that doesn't have dependencies.

Buffer or In-Place

In investigating this problem I realised that there were two main ways we could go about generating the permutations to compute on. We can generate the permutation in the OpenCL kernel (therefore no sequential dependencies), or we can pass it in as a buffer. I'm embarrassed to say I spent way too long trying to find a way to generate the permutation from an index, when the answer was on the page I was reading, even though I read the page multiple times.

You can generate the permutation in the OpenCL kernel as you run it. You can do this by using the global index as the permutation index, and using the Theorem L Section 7.2.1.3 The Art of Computer Programming Volume 4a, it is however quite complex to implement. So instead of trying to implement that, I am using Algorithm L of the same section to generate the permutations sequentially and then pass them in as a buffer. This is likely a good spot to check for optimisations. After all, wouldn't it be better if it was all parallel?

Permutation Representation

There are a couple ways to represent the permutation, you could have the set of each painter's section, the length of each painter's section, or the position of the end of each painter's section. I opted for the last one because it means we have permutations without replacement (unique subsets). To get an idea of what I am talking about, take the following example of the set N = {6, 7, 4, 9}, k = 2.

Sections	Length of Sections	Section End
{6} {7} {4 9}	1, 1, 2	0, 1
{6} {7 4} {9}	1, 2, 1	0, 2
{6 7} {4} {9}	2, 1, 1	1, 2

Implementing in OpenCL

Now we have established what we need, I am going to discuss the actual implementation in OpenCL

Inputs

Buffers

Permutations of partitions
Numbers

Constants

How many dividers

Outputs

Minimum of the maximum subset sum
Could also use local memory to improve the performance

Kernel Explained

The title has the link to the Github file, and I am going to be referencing line numbers for the sake of my sanity.

Line 1-8 - Kernel signature for the function that is going to be invoked
- 2 - buffer in for the permutations of partitions
- 3 - buffer in for the numbers to operate on
- 4 - constant for how many subsets we have (zero indexed)
- 5 - size of the numbers buffer
- 6 - local memory, not really used but could be
- 7 - buffer for the output
Line 11-12 - The indexes for the kernel in the first dimension (we only ever use the 0th)
- 12 - Local index isn't actually used, I am lazy
Line 21 - Start index for the subsets we are going to compute on.
Line 22 - Private max sum for the loops to come
Line 24 - Private running sum for the loops
Line 27-30 - The sum for the subset that is from the start of the numbers to the first divider
Line 31 - Built-in function for maximum number of the inputs
Line 32 - Reset the running sum
Line 35-44 - Primary loop for all the partitions that aren't the start or end partitions
Line 45-51 - The sum for the subset that is from the last divider to the end of the numbers
Line 57 - Set the output to the minimum of all the maximums of the work items.
- I am very lazy here and use an atomic version the built-in min function so I don't have to deal with any local memory collapsing. This is a very good to optimise as it will block all the other work items.

Executing the Kernel

Actually executing the kernel is fairly straightforward. We just need to generate our permutations, set-up the OpenCL kernel, set-up the buffers and set the kernel to run over an index space. The index space is the permutations vector divided by how many dividers we have set. I mainly use wrapper functions I have written to simply things, however I still use direct OpenCL if they are one line. You can see the documentation for OpenCL function in the reference pages.

One odd thing we have to do is set the output buffer to INT_MAX, so the minimum function works correctly, as the value default to 0.

Performance profiling and optimisation of OpenCL

Unfortunately I won't be going through optimising the C++ and OpenCL code. I might make that a follow up piece though. Optimisation is a rather difficult task, especially when it comes to parallel code. There are however lots of resources out there aimed at general C/C++ optimisation which are still relevant. You can also look at how CUDA code is optimised. I also try to point out spots that you could optimise.

Testing and debugging OpenCL

Lazily I have not written any tests for the program. I should have. Don't do that. Ideally I would even written a test to read in a series of numbers and the excepted output to ensure it's all actually working.

For OpenCL it is harder to test and debug, you are mainly doing black-box testing and debugging. You can do printf out of OpenCL kernels, however the output is flushed at the end of the work item, and so any work item output could come out in any order, which mainly means you need to attached the index to the print statements.

FIN

That's it. You will get a number out which represents our desired output. You can remove the comment on line 83 of main.cpp to hard code the numbers, just remember to set the numbersToGenerate to 4 in main(), and comment out the numbers generation loop (line 86-89 main.cpp)!

I hope this has helped explain how you can think about recursive algorithms in a parallel fashion and use OpenCL to implement it. I know this has taken me awhile to write and I have learnt a lot in doing so. Cheers for reading.

OpenCL - Summing Numbers - Part 2.2

Chris Rhodes — Tue, 28 Jan 2020 00:29:54 +0000

Introduction

This is a direct continuation from previous post. This is a working version of summing a list of numbers.

Problem Statement

As always, a problem statement is Step 1. Given a list of numbers, sum all numbers in the list.

For this post we are going to focus on turning the sub problem of adding a list of numbers together in a parallel way. It is much like the previous problem, however we only care about the final sum, not the intermediate sums. You may be asking why I am explaining such a simple task, however as it turns out it is quite complex to do utilising parallel computation and we can learn a lot. When we all first started programming, we dealt with simple problems and learnt much from them, we are doing the same for parallel computing.

Solution

If we take the property that a sum can be computed in any order, that is a + b = c = b + a, we can reason that it doesn't matter about what order the OpenCL kernel execute in, which is one part that blocked us before. We do still have a data dependency issue of to compute the final sum, we must compute the smaller parts of the sum. This is often the case; needing everything to complete in one step before we can move onto the next.

So we could compute each separate part of the sum, and then at some point do the last part of the sum. If we have the numbers 1, 2, 3, 4, 5, 6, we could compute 1 + 2 + 3 + 4 in one part, and 5+6 in another, then we are left with 7+11. Some of you may recognise this as a form of map and reduce functions. The issue is how much do we split our data, and when do we do the reduce part. This is ultimately an optimisation problem and if we parametrise our code correctly, we can do profiling and optimise. The most optimal solution will actually depend on the hardware, especially when ensuring we our compute bound, not memory bound.

GPUs have different memory models to CPUs and perform differently under different circumstances, we need to keep this in mind when determining how to split up our data to be computed on. While this is an optimisation, and I don't encourage you to try to find the right combination at the beginning, just keep in mind that different compute combinations yield different results. Try to keep things simple when first designing solutions. The following diagram is visual representation of one solution.

Kernel Breakdown

In this section I am going to break down the source code from one the kernels I have written. Most of the code that is used to set this up and run is the same for different kernels.

__kernel void add(__global int *numbers, __global int *numOut, __local int *numReduce) {

    // Get the index of the current element to be processed
    int i = get_global_id(0)*2;
    int locali = get_local_id(0);
    int2 va = vload2(i, numbers);
    int2 vb = vload2(i+1, numbers);
    int2 vc = va + vb;
    numReduce[locali] = vc[0] + vc[1];

    printf(
           "%d\t%d: %d + %d + %d + %d = %d\n", 
           i, 
           locali,
           va[0], va[1], vb[0], vb[1], 
           numReduce[locali]
           );

}

Header Signature

__kernel void add(__global int *numbers, __global int *numOut, __local int *numReduce).

__kernel marks the function that it can be invoked by the device
__global marks the pointer that is part of the global memory
__local marks the pointer that it is part of local memory

Memory Types

Global

A type of memory that all the kernels have read and write access to, as well as the host. The host allocates it. We use to read and write the integers into the kernels.

Local

A type of memory that only kernels within a work group have read and write access to, the host only as allocates it. Kernels can static allocate it though. We use it to hold the reduced numbers into, so we can due further computation on it.

Function Body

In respect to lines:

int i - We grab the index of the array we are operating on, the *2 is because we are expecting each kernel to operate on two pieces, or 4 numbers.
int locali - We grab the local index of the workgroup so the work item can store its results.
int2 va - We grab two numbers from the array and store them in a fixed size array of 2
int2 vb - Same for int2 va but we want the two numbers after va
int2 vc - Add the elements of each array. This is NOT adding each number sequentially, but operates similar to SIMD. The actual vendor implementation determines if SIMD will actually be used.
numReduce[locali] - Add the result of the previous addition and then store that result in the memory allocated for this work item
printf - Print out the results we just computed

Making this useful

For the astute reader will notice that this isn't particularly useful and we would need to get the result of all the numbers added together. So the next step would either to be keep going in this adding process, or to output the immediate steps and then compute the sum somewhere else. As we're outputting the immediate results to local memory, you want to have each kernel wait until the other kernels are done, and then continue to add the numbers together.

SIMD Aside

I want to quickly introduce the idea of Single Instruction Multiple Data (SIMD) here as it plays an important part in creating a mental model of how data is mapped into memory and then operated on. SIMD is a CPU (GPUs have it as well) instruction set which allows you to apply the same operation on multiple pieces of data. It does this by putting a chunk of data into a large register (128b, 256b, 512b, etc) and then operating on it. In source code this often looks like applying operations (add, multiple, swap, shift, etc) on fixed sized arrays. The following code is a simplified version of adding two sets of numbers together.

uint4 a = {1, 2, 3, 0}
uint4 b = {9, 5, 2, 7}

uint4 c = a + b
# c is now {10, 7, 5, 7}

The import part to realise here is that the adding operation will happen on all pairs of data at once, and won't happen separately. This quick aside is meant to quickly demonstrate that the time complexity of programs is often very different to the Big O algorithmic complexity and we need to pay special attention to how we do this in OpenCL as things are not as they seem.

Conclusion

When looking at trying to create parallel algorithms it's important to pay attention to what hardware you're working on, and how your memory is going to be mapped out. While cache optimisation is important, I purposely haven't touched on it in this post, it's more important to understand how you're going to map your memory to your compute resources.

Next Time

In the next post I will be going how to solve the painters partition problem, with special attention to transforming a recursive solution into a parallel one.

OpenCL - The Wrong Way - Part 2.1

Chris Rhodes — Tue, 28 Jan 2020 00:28:47 +0000

Introduction

In the last post in this series we looked at the general overview of OpenCL and how you could apply a kernel to a list of data. In this post we are going to look how to apply the same idea to a list of integers, and how this approach is the wrong way, and why the data dependencies get in our way. We will then look at reducing the problem further and look at doing some writing of OpenCL.

As a quick meta aside, I have been having a lot of trouble getting these posts out. As I am diving deeper and deeper into OpenCL, parallel algorithms and writing about it, I am finding I can include a lot of information but become crippled when deciding what to include. So for this, and future posts I am going to be reducing the scope of what I talk about until I can write these in a decent amount of time.

As always, the code for this is available on GitHub in the "summing_numbers" folder, note it only deals with the second part of this post as a solution.

crr0004 / adventures-in-opencl

A accompanying repo for my writing about OpenCL

adventures-in-opencl

A accompanying repo for my writing about OpenCL - https://dev.to/crr0004/series/1481

Building

You will need a driver that implements OpenCL 2.0 and cmake.

This was compiled using clang. It should automatically pick up your primary OpenCL vendor and use that.

Headers

When running it will print out what device is being used but if your vendor implementation is broken it can crash your driver.

Running

Run each sample from the source directory as it needs to pick up the kernel files

Used libraries

Catch2 https://github.com/catchorg/Catch2
Google Benchmark https://github.com/google/benchmark/

See LICENSE file for addition of respective licenses

View on GitHub

Problem Statement

As with all problem solving, start with your problem statement.

Ours is, given a list of integers, for each integer, sum all previous integers to the current one.

For those who have seen this before, it is quite a common task when looking at generation problems involved in weighting. This is a sub-problem of the painters partition problem when using dynamic programming to solve it. In a following post we will be solving the painters partition problem using a parallel solution.

Solution

Due to the nature of transitive properties of adding numbers, that is, all previous properties apply to the current, we can transform this problem to simply adding the previous number to the current one. The formula as follows becomes where i is the index of each number in the list. This just means that if we keep running a sum of all the numbers so far, we don't have to add the entire set each time.

Translating to OpenCL

The natural translation to OpenCL would be to take each step of the summation, and simply compute it, and this would work, as in it would compile and compute, however you would run into the issue of a race condition. That being, for each number being computed, all the other numbers would also being trying to compute. So if we were trying to compute the 10th number in the list, the 7th number also might be trying to compute and we would end up in a non-defined situation, otherwise known as a race condition.

OpenCL is kind to us (atomics) and ensures we don't get inconsistencies on the data structure themselves (imagine if two people tried to write on the same line of paper), however we still have the issue of not knowing the order in which all these numbers are going to be computed on the device. We could introduce a sync point for each number, that is each time we want to compute a number, we have to ensure all the previous numbers are complete. That would work. However it isn't any better than a sequential solution because we would end up waiting a lot and no improvements are made.

Change the Data Dependency

You could get around this race condition by undoing the transformation and simply computing and this would perfectly fine, however it wouldn't become any faster than a sequential version. So the data dependency issue is really what is stopping us from creating a parallel version of the solution that is any better than a sequential one. This doesn't explicitly mean that an OpenCL vs a non-OpenCL version would be any faster, because it does depend on the situation you are in. There is also the consideration of it's possible to do the sequential sum when it is part of a larger parallel algorithm. That is, for any list we're dealing with, just sum the list, even if we've done it before, this does fix the dependency issue however means we're computing the same thing a lot.

We could also just do the naive approach and have each step be the sum of the previous numbers and that might be faster than other solutions for your problem, even though it's Big O complexity is worse O(n^2).

Next Time

In the next post we are going to look at a sub-problem of this, and how to write an OpenCL kernel that partly solves it.

OpenCL - The First Approach - Part 1

Chris Rhodes — Tue, 30 Jul 2019 09:04:35 +0000

Introduction

Recently I have been doing a lot of work and study in machine learning, including DeepRacer, and part of that is finding a way to accelerate the required computation. A lot of frameworks and libraries will use CUDA however a couple like PlaidML use OpenCL. I am big advocate for open source software, preferably free but alas, so when frameworks and libraries only use CUDA for acceleration that really irks me. For those not in the know, NVIDIA (owner of CUDA) has a horrible history with open source communities and supporting implementations for the open source community (looking at the NVIDIA drivers for Linux). So I really want to get into OpenCL, and be part of the change I want to see, which is supporting a standard that is free and open to all.

When I started writing this piece I wanted to go into OpenCL and how to use it to implement recursion, but as I got into it I realised there is simply too much content. So in this Part 1 of many. In this piece I will be going into the basic approach of OpenCL with a sample and how that is broken apart in OpenCL calls. Onwards to the content!

Just quickly though, all the sample code can be found in my repository for these posts.

crr0004 / adventures-in-opencl

A accompanying repo for my writing about OpenCL

adventures-in-opencl

A accompanying repo for my writing about OpenCL - https://dev.to/crr0004/series/1481

Building

You will need a driver that implements OpenCL 2.0 and cmake.

This was compiled using clang. It should automatically pick up your primary OpenCL vendor and use that.

Headers

When running it will print out what device is being used but if your vendor implementation is broken it can crash your driver.

Running

Run each sample from the source directory as it needs to pick up the kernel files

Used libraries

Catch2 https://github.com/catchorg/Catch2
Google Benchmark https://github.com/google/benchmark/

See LICENSE file for addition of respective licenses

View on GitHub

What is OpenCL

"OpenCL (Open Computing Language) is an open royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms." (Khronos OpenCL 20 API Specification 2019 p. 11).

There have been multiple standards/libraries/frameworks/languages over the years that cater to parallel computing and they have various benefits and faults. Very few of those are on the hardware level, and even fewer are device agnostic, that is, they aren't programmed specifically for the device. This is where OpenCL really shines. It's implemented against a specification so you're not vendor locked (there is a caveat here for extensions and optimisation). It's royalty free so vendors don't pay for implementing it and communities can pop up around with little initial friction. If you want a parallel to draw (pun intended) then you can look at the adoption of OpenGL, Vulkan, OpenGL ES (phones use this a lot), WebGL, and basically any Khronos Group specifications.

Going forward I will mainly be talking about OpenCL 2.0, however I will try to note where parts are not present in OpenCL 1.2 as the specification directly mentions being backwards compatible (Khronos OpenCL 20 API Specification 2019 p. 57) and I will try to keep that going. That goes for all pieces I write on OpenCL.

Okay but how do we go fast?

In case you're new to parallel computing (This is my first deep dive), the easiest way to think about it that I've found, is to ask "Given infinite amount of compute units and infinite memory, how do we use it all?", so hopefully that leads you to the realisation that what we need to do is write software that can take advantage of those resources. So to go fast we have to some way to get all those compute units to operate on our memory in some useful way. OpenCL gives us a lot of ways to do this, and one way is to simply directly map one section of memory to one compute unit. OpenCL does us a kindness and specifies how this works when we have more memory than compute units, it specifies "work groups" with "work items".

Work groups are made up of work items, and a device driver specifies how big a work group can get. So if we a list of numbers with 1024 integers and a device with 64 compute units, there would be 16 work groups (1024/64). The device does not have to complete all these work groups at the same time, it just has to eventually do it. The following diagrams shows a basic structure of this, without worrying about the breaking up of work items into groups.

To do the actual computation, you write a "kernel", compile it, and then hand it off to the device with the data you want to compute on. You can see the naive_approach sample in my repository to see how this actually done.

If you have an AMD or Intel device that supports it, you can even use a debugger from their SDKs. I know AMD's is CodeXL, and Intel has debuggers in their SDK. For NVIDIA cards, you can try their Nsight software however I am not entirely sure if it's reliable or not, and their is not linux version I can find

Sample In Diagram Form

The following diagram is a rough break down of the APIs that the sample uses. Don't feel like you have to completely understand the diagram because it is just trying to give you an idea how the various API calls map into an actual program.

You can find all the details of the OpenCL API calls in the OpenCL API Reference pages.

TL;DR

OpenCL is another standard from the Khronos Group that enables device agnostic approach to parallel computing. There are various ways to do this and the first approach you should take is taking a list of numbers, your compute kernel, and then give it to the device to do. The devices can do fancy stuff to make this work properly, more detail to come in future posts!

References

Khronos OpenCL Registry - The Khronos Group Inc. 2019. Khronos OpenCL Registry - The Khronos Group Inc. [ONLINE] Available at: https://www.khronos.org/registry/OpenCL/. [Accessed 21 July 2019].
Khronos OpenCL 20 API Specification - The Khronos Group Inc. 2019. [ONLINE] Available at: https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf. [Accessed 21 July 2019].