Javad

Posted on Feb 14

Advanced GPU Optimization: OpenCL & SYCL from zero to hero - Part 2

#programming #ai #tutorial #discuss

Welcome back, heterogeneous computing enthusiasts! In our previous guides, we covered the fundamentals of OpenCL and SYCL, from setting up your environment to writing basic kernels and optimizing them with local memory. Now it's time to go deeper. In this advanced tutorial, we'll explore:

Key features of the OpenCL specification (2.x/3.0) that enable sophisticated programming.
What SYCL 2020 brings to the table, including unified shared memory and group algorithms.
Implementation of complex parallel patterns: reductions, scans, and stencils.
How to target CPUs, GPUs, and FPGAs with SYCL and compare performance.
A tour of oneAPI's domain‑specific libraries (oneMKL, oneDNN, oneDPL, etc.) that supercharge your development.

Whether you're looking to squeeze every drop of performance from your hardware or write portable code that scales across architectures, this guide has you covered.

Part 1: Deep Dive into the OpenCL Specification

The OpenCL specification has evolved significantly. While OpenCL 1.2 is the most widely supported baseline, modern devices support newer features that can greatly simplify and accelerate development. Let's look at the highlights.

1.1 OpenCL 2.x Features

Shared Virtual Memory (SVM)

OpenCL 2.0 introduced SVM, allowing the host and devices to share pointers. This eliminates explicit buffer mappings and can simplify programming. Three types exist:

Coarse-grained SVM: Sharing at buffer granularity; synchronization needed.
Fine-grained SVM: Sharing at any memory location; optional atomics.
Fine-grained buffer SVM: Like fine-grained but limited to a buffer region.

With SVM, you can pass complex data structures (like linked lists) to kernels without deep copies.

Device-side Enqueue

Kernels can enqueue other kernels, creating nested parallelism. This is great for recursive algorithms or dynamic workloads. The parent kernel creates a child kernel and enqueues it to a device queue.

Pipes

Pipes are FIFO data structures connecting kernels. They facilitate streaming patterns without global memory round trips. One kernel writes to a pipe, another reads from it, with synchronization handled by the hardware.

Work-group Collective Functions

OpenCL 2.0 added built‑in work‑group functions like work_group_reduce_add, work_group_scan_exclusive_add, etc., enabling efficient intra‑group reductions and scans.

C11 Atomics

Kernels can use atomic operations conforming to the C11 memory model, essential for lock‑free algorithms.

1.2 OpenCL 3.0: The New Baseline

OpenCL 3.0, released in 2020, takes a different approach: it makes OpenCL 1.2 the mandatory baseline, with all 2.x features optional. This allows vendors to implement only what they support while maintaining compatibility. Key additions:

A new query mechanism to check for optional features.
Support for newer C++ features in kernels (like auto, lambda expressions) via the cl_khr_c++11 extension.
Better integration with Vulkan and other APIs.

1.3 Using Advanced Features in Practice

To use these features, you must check for extensions and enable them at program creation. For example, to use SVM:

cl_device_svm_capabilities caps;
clGetDeviceInfo(device, CL_DEVICE_SVM_CAPABILITIES, sizeof(caps), &caps, NULL);
if (caps & CL_DEVICE_SVM_COARSE_GRAIN_BUFFER) {
    // Allocate coarse‑grain SVM buffer
    clSVMAlloc(context, CL_MEM_READ_WRITE, size, 0);
}

Device‑side enqueue requires building the program with the -cl-std=CL2.0 flag and using the cl_khr_device_enqueue extension.

While these features are powerful, they are not universally supported across all devices. Always fall back to portable code when necessary.

Part 2: Deep Dive into the SYCL 2020 Specification

SYCL 2020 is a major update that aligns with modern C++ and adds many features inspired by OpenCL 2.x and beyond. Let's explore the most impactful additions.

2.1 Unified Shared Memory (USM)

SYCL 2020 standardizes USM, providing pointer‑based memory management. Three allocation types:

malloc_device: Device‑accessible memory (not directly on host).
malloc_host: Host‑accessible memory (may be cached on device).
malloc_shared: Automatically migrated memory, accessible from both sides.

Example:

queue q;
int *data = malloc_shared<int>(N, q);
q.parallel_for(range<1>(N), [=](id<1> i) { data[i] = i; }).wait();
// data is updated on host after synchronization
free(data, q);

USM simplifies porting CUDA/HIP code and reduces verbosity compared to buffers.

2.2 Reductions

SYCL 2020 introduces built‑in reduction support. You can use reduction objects with parallel_for to perform reductions efficiently.

queue q;
std::vector<float> vec(N, 1.0f);
buffer<float> buf(vec);
float init = 0.0f;
{
    buffer<float> sum_buf(&init, 1);
    q.submit([&](handler& h) {
        auto in = buf.get_access<access::mode::read>(h);
        auto out = sum_buf.get_access<access::mode::write>(h);
        h.parallel_for(range<1>(N), reduction(out, 0.0f, std::plus<>()), [=](id<1> i, auto& sum) {
            sum += in[i];
        });
    });
}
// sum_buf now contains the total

The implementation chooses the most efficient strategy (tree reduction, atomics, etc.) based on the device.

2.3 Group Algorithms and Sub‑Groups

SYCL 2020 provides collective algorithms for work‑groups, similar to OpenCL 2.0:

group_barrier()
group_broadcast()
group_reduce(), group_scan() (exclusive/inclusive)
sub_group class for warp/wavefront operations.

Example of group reduction:

h.parallel_for(nd_range<1>(N, 256), [=](nd_item<1> it) {
    auto g = it.get_group();
    int local_sum = ...;
    int group_sum = reduce_over_group(g, local_sum, std::plus<>());
    if (it.get_local_id(0) == 0)
        atomic_ref<int, memory_order::relaxed, memory_scope::device>(global_sum) += group_sum;
});

Sub‑groups allow portable SIMD programming. You can query sub‑group size and use shuffles.

2.4 In‑Order Queues and Properties

SYCL 2020 adds queue properties to specify in‑order execution, which can simplify dependencies:

queue q{ property::queue::in_order() };

2.5 Interoperability with OpenCL

SYCL 2020 improves interop: you can create a SYCL platform, device, context, or queue from native OpenCL handles, and vice versa. This allows incremental migration or mixing low‑level control with high‑level productivity.

Part 3: Complex Parallel Patterns

Now that we've covered the specifications, let's implement three fundamental parallel patterns: reduction, scan, and stencil. We'll show both OpenCL and SYCL versions, with optimizations.

3.1 Reduction (Sum of an Array)

Reduction combines elements using an associative operator (e.g., sum). A common approach is hierarchical: each work‑group reduces its chunk in local memory, then one thread per group adds to global atomic.

OpenCL kernel:

__kernel void reduce(__global const float *in, __global float *out, __local float *localMem, int n) {
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    int groupSize = get_local_size(0);
    int groupId = get_group_id(0);

    // Load data into local memory
    float sum = (gid < n) ? in[gid] : 0.0f;
    localMem[lid] = sum;
    barrier(CLK_LOCAL_MEM_FENCE);

    // Reduce within group
    for (int stride = groupSize / 2; stride > 0; stride >>= 1) {
        if (lid < stride) {
            localMem[lid] += localMem[lid + stride];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Write partial sum
    if (lid == 0) {
        out[groupId] = localMem[0];
    }
}

After this kernel, you have partial sums. Run a second kernel with one work‑group to combine them, or copy to host and finalize.

SYCL with reductions:
SYCL's built‑in reduction handles this automatically, but for educational purposes, here's a manual version:

q.submit([&](handler& h) {
    auto in = buf_in.get_access<access::mode::read>(h);
    auto out = buf_out.get_access<access::mode::write>(h);
    accessor<float, 1, access::mode::read_write, access::target::local> localMem(range<1>(256), h);

    h.parallel_for(nd_range<1>(n, 256), [=](nd_item<1> it) {
        int gid = it.get_global_id(0);
        int lid = it.get_local_id(0);
        float sum = (gid < n) ? in[gid] : 0.0f;
        localMem[lid] = sum;
        it.barrier();

        for (int stride = 128; stride > 0; stride >>= 1) {
            if (lid < stride) {
                localMem[lid] += localMem[lid + stride];
            }
            it.barrier();
        }

        if (lid == 0) {
            out[it.get_group(0)] = localMem[0];
        }
    });
});

But the SYCL 2020 way is simpler:

q.submit([&](handler& h) {
    auto in = buf_in.get_access<access::mode::read>(h);
    h.parallel_for(range<1>(n), reduction(sum_buf, h, 0.0f, std::plus<>()), [=](id<1> i, auto& sum) {
        sum += in[i];
    });
});

3.2 Scan (Prefix Sum)

Scan computes cumulative sums. An efficient work‑group scan uses a double‑buffered local memory approach.

OpenCL kernel for inclusive scan (one block):

__kernel void scan(__global const float *in, __global float *out, __local float *temp, int n) {
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    int groupSize = get_local_size(0);
    int groupId = get_group_id(0);

    // Load data
    temp[2 * lid] = (gid < n) ? in[gid] : 0;
    temp[2 * lid + 1] = (gid + groupSize < n) ? in[gid + groupSize] : 0;
    barrier(CLK_LOCAL_MEM_FENCE);

    // Upsweep phase (parallel reduction)
    for (int stride = 1; stride < 2 * groupSize; stride <<= 1) {
        int index = (lid + 1) * stride * 2 - 1;
        if (index < 2 * groupSize && lid < groupSize) {
            temp[index] += temp[index - stride];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Downsweep phase
    for (int stride = groupSize; stride > 0; stride >>= 1) {
        int index = (lid + 1) * stride * 2 - 1;
        if (index + stride < 2 * groupSize && lid < groupSize) {
            temp[index + stride] += temp[index];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Write results
    if (gid < n) {
        out[gid] = temp[2 * lid];
        if (gid + groupSize < n)
            out[gid + groupSize] = temp[2 * lid + 1];
    }
}

After this, you need a second pass to add the total of previous groups. For large arrays, you'd combine with a top‑level scan of group sums.

SYCL using group_scan:
SYCL 2020 provides group_scan_inclusive and group_scan_exclusive:

q.submit([&](handler& h) {
    auto in = buf_in.get_access<access::mode::read>(h);
    auto out = buf_out.get_access<access::mode::write>(h);
    accessor<float, 1, access::mode::read_write, access::target::local> localMem(range<1>(256), h);

    h.parallel_for(nd_range<1>(n, 256), [=](nd_item<1> it) {
        int gid = it.get_global_id(0);
        int lid = it.get_local_id(0);
        float val = (gid < n) ? in[gid] : 0.0f;
        auto grp = it.get_group();
        float scan_val = group_scan_inclusive(grp, val, std::plus<>());
        if (gid < n)
            out[gid] = scan_val;
    });
});

This performs an inclusive scan within each work‑group. To get a global scan, you'd again need to handle the group offsets.

3.3 Stencil (2D Heat Conduction)

Stencils update each element based on its neighbors. Optimizations: use local memory to cache a tile plus halo.

OpenCL kernel for 2D heat stencil:

__kernel void heat(__global const float *in, __global float *out, int width, int height) {
    int x = get_global_id(0);
    int y = get_global_id(1);
    int localX = get_local_id(0);
    int localY = get_local_id(1);
    int localSizeX = get_local_size(0);
    int localSizeY = get_local_size(1);

    // Local memory with halo of 1
    __local float tile[LOCAL_SIZE_Y + 2][LOCAL_SIZE_X + 2];

    // Load interior
    if (x < width && y < height)
        tile[localY + 1][localX + 1] = in[y * width + x];
    else
        tile[localY + 1][localX + 1] = 0.0f;

    // Load halo regions (simplified: assumes boundary conditions)
    if (localY == 0 && y > 0)
        tile[0][localX + 1] = in[(y-1) * width + x];
    if (localY == localSizeY - 1 && y < height - 1)
        tile[localSizeY + 1][localX + 1] = in[(y+1) * width + x];
    // similarly for left/right
    barrier(CLK_LOCAL_MEM_FENCE);

    if (x < width && y < height) {
        float center = tile[localY + 1][localX + 1];
        float left   = tile[localY + 1][localX];
        float right  = tile[localY + 1][localX + 2];
        float up     = tile[localY][localX + 1];
        float down   = tile[localY + 2][localX + 1];
        out[y * width + x] = center + 0.125f * (left + right + up + down - 4.0f * center);
    }
}

SYCL version (with local accessor and nd_range):

q.submit([&](handler& h) {
    auto in = buf_in.get_access<access::mode::read>(h);
    auto out = buf_out.get_access<access::mode::write>(h);
    accessor<float, 2, access::mode::read_write, access::target::local> tile(range<2>(LOCAL_Y+2, LOCAL_X+2), h);

    h.parallel_for(nd_range<2>(range<2>(height, width), range<2>(LOCAL_Y, LOCAL_X)), [=](nd_item<2> it) {
        int x = it.get_global_id(1);
        int y = it.get_global_id(0);
        int lx = it.get_local_id(1);
        int ly = it.get_local_id(0);

        // Load interior
        if (x < width && y < height)
            tile[ly+1][lx+1] = in[y][x];

        // Load halos (simplified)
        // ... similar to OpenCL
        it.barrier();

        if (x < width && y < height) {
            float center = tile[ly+1][lx+1];
            float left   = tile[ly+1][lx];
            float right  = tile[ly+1][lx+2];
            float up     = tile[ly][lx+1];
            float down   = tile[ly+2][lx+1];
            out[y][x] = center + 0.125f * (left + right + up + down - 4.0f * center);
        }
    });
});

These patterns are building blocks for many scientific and engineering applications.

Part 4: SYCL on Different Hardware and Performance Comparison

One of SYCL's greatest promises is "write once, run anywhere." But how does performance compare across CPUs, GPUs, and FPGAs? Let's explore the practical considerations.

4.1 Targeting Devices with SYCL

Selecting a device is straightforward:

// Default selector (picks a device, often GPU if available)
queue q;

// Specific selectors
cpu_selector cpu;
gpu_selector gpu;
accelerator_selector fpga; // for FPGA

queue q_cpu(cpu);
queue q_gpu(gpu);
queue q_fpga(fpga);

You can also write custom selectors based on device info (e.g., preferring a GPU with larger local memory).

4.2 Performance Considerations per Device

CPU:

Optimize for vectorization: use wide SIMD (work‑group size = vector width).
Minimize data transfers; CPU shares memory with host, so USM can be efficient.
Avoid excessive local memory; CPU caches are good.

GPU:

Maximize occupancy: choose work‑group size that hides latency (e.g., 128‑256).
Use local memory for data reuse.
Ensure coalesced global memory accesses.

FPGA:

Kernels are synthesized into hardware pipelines.
Loop unrolling, pipelining, and on‑chip memory are key.
SYCL for FPGA supports [[intel::max_work_group_size(1)]] to create single‑work‑item kernels that are highly pipelined.
Use #pragma unroll and [[intel::initiation_interval(1)]] directives.

4.3 Benchmarking and Comparison

To compare performance across devices, you need a common benchmark. Let's take matrix multiplication (tiled) and run it on different hardware.

Methodology:

Compile with appropriate flags: for CPU, use -O3 -march=native; for GPU, -O3 -fsycl; for FPGA, you need the Intel FPGA SDK and use -Xshardware for actual hardware runs.
Measure kernel execution time using SYCL events:

q.submit([&](handler& h) {
    // ...
}).wait();
auto start = std::chrono::steady_clock::now();
q.submit([&](handler& h) {
    // kernel
}).wait();
auto end = std::chrono::steady_clock::now();

Better: use SYCL profiling info via property::queue::enable_profiling().

Expected Observations:

CPU: Good for small matrices, latency may be higher due to OS scheduling.
GPU: Excels at large, regular computations.
FPGA: Often slower for compute‑bound tasks but can be very power‑efficient and excel at pipeline‑friendly streaming.

Case Study: Matrix Multiplication
We can run the tiled SYCL kernel from Part 2.7 on a CPU, GPU, and FPGA (if available). The performance will vary dramatically. For example, on an Intel Core i9, the CPU might achieve ~50 GFLOPs; on an Intel UHD Graphics, maybe ~200 GFLOPs; on an Arria 10 FPGA, perhaps ~50 GFLOPs but with lower power. However, the FPGA could achieve lower latency for small batches and better predictability.

4.4 Portable Optimization Tips

Use #ifdef SYCL_DEVICE_CPU etc. to specialize for each architecture.
Tune work‑group size dynamically using device.get_info<info::device::max_work_group_size>().
For FPGA, design kernels to be highly parallel with pipeline parallelism rather than data parallelism.

Part 5: oneAPI Domain-Specific Libraries

oneAPI includes a rich set of libraries that provide optimized building blocks for common tasks. Using them can drastically accelerate development while ensuring high performance across devices.

5.1 oneMKL (Math Kernel Library)

oneMKL provides BLAS, LAPACK, FFT, RNG, and more. It has SYCL interfaces that accept SYCL buffers or USM pointers.

Example: Matrix multiplication using oneMKL:

#include <oneapi/mkl/blas.hpp>
queue q;
float alpha = 1.0f, beta = 0.0f;
oneapi::mkl::blas::gemm(q, oneapi::mkl::transpose::nontrans, oneapi::mkl::transpose::nontrans,
                        N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
q.wait();

This will call an optimized GEMM for the target device (CPU, GPU, or FPGA). No need to write your own kernel!

FFT example:

#include <oneapi/mkl/dft.hpp>
using namespace oneapi::mkl::dft;
descriptor<precision::single, domain::real> transform(N);
transform.commit(q);
// in and out buffers...
transform.compute_forward(in, out);

5.2 oneDNN (Deep Neural Networks)

oneDNN (formerly DNNL) provides primitives for deep learning: convolution, pooling, normalization, etc. It has SYCL and OpenCL backends.

Example: Convolution:

#include <oneapi/dnnl/dnnl.hpp>
using namespace dnnl;

engine eng(engine::kind::gpu, 0);
stream s(eng);
memory::dims conv_src_tz = {N, C, H, W};
memory::dims conv_weights_tz = {K, C, R, S};
memory::dims conv_dst_tz = {N, K, H', W'};
// Create memory objects
auto conv_src_mem = memory({{conv_src_tz}, memory::data_type::f32, memory::format_tag::any}, eng);
// Initialize convolution descriptor and primitive
convolution_forward::desc conv_desc(prop_kind::forward, algorithm::convolution_direct,
                                    conv_src_md, conv_weights_md, conv_dst_md,
                                    strides, padding_l, padding_r);
convolution_forward::primitive_desc conv_pd(conv_desc, eng);
auto conv = convolution_forward(conv_pd);
conv.execute(s, {{DNNL_ARG_SRC, conv_src_mem}, {DNNL_ARG_WEIGHTS, conv_weights_mem}, {DNNL_ARG_DST, conv_dst_mem}});
s.wait();

oneDNN automatically selects the best implementation for the hardware.

5.3 oneDPL (DPC++ Library)

oneDPL provides parallel algorithms similar to C++17's Parallel STL, implemented with SYCL.

Example: parallel sort:

#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>

std::vector<float> data(N);
// fill data...
auto policy = oneapi::dpl::execution::make_device_policy(q);
std::sort(policy, data.begin(), data.end());

Other algorithms: reduce, transform, scan, etc.

5.4 oneTBB (Threading Building Blocks)

While not exclusively for devices, oneTBB integrates with oneAPI to provide task parallelism on the host, which can coordinate with device work.

5.5 Other oneAPI Libraries

oneVPL – Video Processing Library for encode/decode.
oneDAL – Data Analytics Library.
oneCCL – Collective Communications Library for distributed computing.

5.6 Combining Libraries in a Project

A typical SYCL + oneAPI project might use:

oneDPL for data preparation on CPU.
oneMKL for FFT on GPU.
oneDNN for inference on FPGA.
oneTBB to manage host-side tasks.

All with the same SYCL queue and memory model.

Conclusion

We've journeyed through the advanced features of OpenCL and SYCL, implemented essential parallel patterns, explored performance across diverse hardware, and surveyed the powerful oneAPI libraries. Armed with this knowledge, you can now tackle complex computational problems with confidence and write code that runs efficiently on CPUs, GPUs, and FPGAs alike.

Remember, the key to mastering heterogeneous computing is practice. Experiment with these patterns on your own hardware, profile them, and see where optimizations take you. And when you need a productivity boost, don't reinvent the wheel – leverage the oneAPI libraries.

Further Resources:

Thank you for reading! If you have questions or want to share your experiences with OpenCL, SYCL, or oneAPI, leave a comment below. Happy coding!

DEV Community