Javad

Posted on Feb 14

Advanced GPU Optimization: Metal & Vulkan Compute from zero to hero - Part 2

#programming #ai #tutorial #discuss

Welcome back, GPU programming experts! You've mastered the fundamentals of Metal and Vulkan compute. Now it's time to explore the advanced features that truly unlock the power of modern GPUs. In this comprehensive guide, we'll dive deep into:

Metal Performance Shaders (MPS) – Apple's library of highly optimized compute primitives.
Metal Ray Tracing – Hardware‑accelerated ray tracing on Apple Silicon.
Vulkan Subgroup Operations – Efficient intra‑workgroup communication.
Vulkan Ray Tracing – Cross‑platform ray tracing extensions.
Vulkan Video – Hardware‑accelerated video encoding/decoding.
Cross‑Platform Alternatives – How SYCL and OpenCL fit into the picture.

By the end, you'll be equipped to build cutting‑edge applications that leverage the full capabilities of modern GPUs.

Part 1: Metal Performance Shaders (MPS)

Metal Performance Shaders is a framework that provides a collection of highly optimized compute and graphics functions for image processing, machine learning, linear algebra, and more. MPS is tightly integrated with Metal and takes advantage of the specific hardware capabilities of Apple GPUs.

1.1 Overview and Setup

MPS is available on all Apple platforms (iOS, macOS, tvOS) with Metal support. To use MPS, import the framework:

#import <MetalPerformanceShaders/MetalPerformanceShaders.h>

MPS kernels are objects that you create with a specific device and then encode into a Metal command buffer. They manage their own internal state and resources, so you don't have to worry about shader compilation or pipeline state.

1.2 Key MPS Features

Image Processing: Convolution, morphology, histogram, resizing, etc.
Matrix Multiplication: MPSMatrixMultiplication for fast linear algebra.
Neural Network: MPSCNNConvolution, MPSCNNFullyConnected, MPSCNNPooling – building blocks for deep learning.
Ray Tracing: MPSRayIntersector for ray‑triangle intersection and acceleration structure building.
Random Number Generation: MPSParallelRandom for GPU‑side random numbers.

1.3 Example: Matrix Multiplication with MPS

Let's see how to perform matrix multiplication using MPSMatrixMultiplication. This is far simpler than writing your own kernel and often faster.

#import <MetalPerformanceShaders/MetalPerformanceShaders.h>

id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLCommandQueue> commandQueue = [device newCommandQueue];

// Create matrices as MPSMatrix objects (wrappers around MTLBuffer)
MPSMatrixDescriptor *desc = [MPSMatrixDescriptor matrixDescriptorWithRows:1024 columns:1024 rowBytes:1024 * sizeof(float) dataType:MPSDataTypeFloat32];
id<MTLBuffer> bufferA = [device newBufferWithLength:desc.rowBytes * desc.rows options:MTLResourceStorageModeShared];
id<MTLBuffer> bufferB = [device newBufferWithLength:desc.rowBytes * desc.rows options:MTLResourceStorageModeShared];
id<MTLBuffer> bufferC = [device newBufferWithLength:desc.rowBytes * desc.rows options:MTLResourceStorageModeShared];

MPSMatrix *matrixA = [[MPSMatrix alloc] initWithBuffer:bufferA descriptor:desc];
MPSMatrix *matrixB = [[MPSMatrix alloc] initWithBuffer:bufferB descriptor:desc];
MPSMatrix *matrixC = [[MPSMatrix alloc] initWithBuffer:bufferC descriptor:desc];

// Create the multiplication kernel
MPSMatrixMultiplication *matMul = [[MPSMatrixMultiplication alloc] initWithDevice:device transposeLeft:NO transposeRight:NO resultRows:1024 resultColumns:1024 interiorColumns:1024 alpha:1.0 beta:0.0];

// Encode the kernel
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
[matMul encodeToCommandBuffer:commandBuffer leftMatrix:matrixA rightMatrix:matrixB resultMatrix:matrixC];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];

// Result in bufferC

That's it! The kernel automatically chooses the optimal algorithm for the device.

1.4 Example: Convolution for Deep Learning

MPS provides a full suite of neural network layers. Here's a simple convolution layer:

MPSCNNConvolutionDescriptor *convDesc = [MPSCNNConvolutionDescriptor cnnConvolutionDescriptorWithKernelWidth:3 kernelHeight:3 inputFeatureChannels:3 outputFeatureChannels:64];
convDesc.strideInPixelsX = 1; convDesc.strideInPixelsY = 1;

MPSCNNConvolution *conv = [[MPSCNNConvolution alloc] initWithDevice:device convolutionDescriptor:convDesc kernelWeights:weightsData biasTerms:biasData flags:MPSCNNConvolutionFlagsNone];

// Assume you have an MPSImage *srcImage
[conv encodeToCommandBuffer:commandBuffer sourceImage:srcImage destinationImage:dstImage];

MPS automatically handles weight layout and uses hardware‑accelerated matrix multiplication or Winograd algorithms.

1.5 MPSRayIntersector for Ray Tracing

MPSRayIntersector provides a fast, high‑level interface for ray tracing on Apple GPUs. It builds acceleration structures (BVHs) and performs ray‑primitive intersection tests.

Basic usage:

Create MPSRayIntersector with a device.
Build an acceleration structure from vertex and index data.
Encode intersection tests with rays.

MPSRayIntersector *intersector = [[MPSRayIntersector alloc] initWithDevice:device];

// Create a vertex buffer and index buffer
id<MTLBuffer> vertexBuffer = [device newBufferWithBytes:vertices length:vertexDataSize options:MTLResourceStorageModeManaged];
id<MTLBuffer> indexBuffer = [device newBufferWithBytes:indices length:indexDataSize options:MTLResourceStorageModeManaged];

// Build acceleration structure
MPSPolygonBuffer *polygonBuffer = [MPSPolygonBuffer polygonBufferWithDevice:device];
polygonBuffer.vertexBuffer = vertexBuffer;
polygonBuffer.indexBuffer = indexBuffer;
NSArray<MPSPolygonBuffer *> *buffers = @[polygonBuffer];
id<MTLBuffer> vertexCountBuffer = ...; // optional per‑polygon buffer

MPSTriangleAccelerationStructure *accel = [[MPSTriangleAccelerationStructure alloc] initWithDevice:device];
accel.polygonBuffers = buffers;
[accel rebuild];

// Prepare ray buffer (with origin, direction, etc.)
id<MTLBuffer> rayBuffer = [device newBufferWithLength:rayCount * sizeof(MPSRayOriginMaskDirection) options:MTLResourceStorageModeManaged];
MPSRayOriginMaskDirection *rays = (MPSRayOriginMaskDirection*)[rayBuffer contents];
// fill rays...

// Encode intersection
MPSIntersectionDataType intersectionDataType = MPSIntersectionDataTypeDistance;
id<MTLBuffer> intersectionBuffer = [device newBufferWithLength:rayCount * sizeof(MPSIntersectionDistance) options:MTLResourceStorageModeManaged];
[intersector encodeIntersectionToCommandBuffer:commandBuffer intersectionType:MPSIntersectionTypeNearest rayBuffer:rayBuffer rayBufferOffset:0 intersectionBuffer:intersectionBuffer intersectionBufferOffset:0 rayCount:rayCount accelerationStructure:accel];

You can then read the intersection distances from intersectionBuffer. MPSRayIntersector supports both triangle and sphere primitives, and can handle large scenes efficiently.

1.6 Performance Tips for MPS

Reuse MPS kernels and acceleration structures across frames.
Use MPSKernelOptions to tune for performance (e.g., MPSKernelOptionsAllowReducedPrecision for FP16).
For neural networks, consider using MPSNNGraph to fuse multiple layers into a single optimized graph.
Profile with Xcode's Metal System Trace to see how MPS kernels are scheduled.

Part 2: Metal Ray Tracing

Beyond MPSRayIntersector, Metal provides a more explicit ray tracing API that integrates with the Metal shading pipeline. This allows you to write custom ray generation, any‑hit, closest‑hit, and miss shaders.

2.1 Metal Ray Tracing Concepts

Acceleration Structure: A BVH built from geometry (primitives).
Intersection Function: A shader that determines if a ray hits a primitive. Metal provides a default triangle intersection, but you can write custom ones.
Ray Tracing Pipeline: Similar to a compute pipeline, but specialized for ray tracing.
Shader Stages: raygen, intersection, anyhit, closesthit, miss.

2.2 Setting Up a Basic Ray Tracer

Step 1: Create an acceleration structure
Similar to MPS, but using Metal's MTLAccelerationStructure APIs. You build a primitive acceleration structure descriptor, then encode the build into a command buffer.

Step 2: Create a ray tracing pipeline
You need a MTLRayTracingPipelineDescriptor that specifies the shader functions.

Step 3: Allocate resources

Ray buffer (origin, direction, etc.)
Intersection result buffer
Texture for output

Step 4: Encode ray tracing work
Use MTLRayTracingCommandEncoder to set the pipeline, bind resources, and dispatch rays.

2.3 Example: Simple Ray Generation Shader (MSL)

#include <metal_stdlib>
using namespace metal;

struct Ray {
    float3 origin;
    float3 direction;
};

struct Intersection {
    float distance;
    uint primitiveIndex;
    float2 barycentricCoord;
};

kernel void raygen(device Ray* rays [[buffer(0)]],
                   device Intersection* intersections [[buffer(1)]],
                   texture2d<float, access::write> output [[texture(0)]],
                   uint2 tid [[thread_position_in_grid]]) {
    // Generate ray for pixel (tid.x, tid.y)
    float2 uv = float2(tid) / float2(output.get_width(), output.get_height());
    rays[tid.y * output.get_width() + tid.x] = Ray{ float3(uv * 2.0f - 1.0f, 0.0f), float3(0,0,1) };

    // Wait for intersection results (simplified)
    Intersection hit = intersections[tid.y * output.get_width() + tid.x];
    float3 color = (hit.distance < 1e30) ? float3(1,0,0) : float3(0,0,0);
    output.write(float4(color, 1), tid);
}

You'll also need a closest‑hit shader that computes color based on material.

2.4 Building the Pipeline

On the host side, you compile the shaders and create a pipeline:

id<MTLLibrary> library = [device newDefaultLibrary];
id<MTLFunction> raygenFunc = [library newFunctionWithName:@"raygen"];
id<MTLFunction> closestHitFunc = [library newFunctionWithName:@"closesthit"];
id<MTLFunction> missFunc = [library newFunctionWithName:@"miss"];

MTLRayTracingPipelineDescriptor *desc = [MTLRayTracingPipelineDescriptor new];
desc.rayGeneratorFunction = raygenFunc;
desc.closestHitFunction = closestHitFunc;
desc.missFunction = missFunc;
desc.maxCallStackDepth = 1; // for recursion
desc.maxPayloadSize = sizeof(float3); // per‑ray data
desc.maxAttributeSize = sizeof(float2); // barycentrics

id<MTLRayTracingPipelineState> pipeline = [device newRayTracingPipelineStateWithDescriptor:desc error:nil];

2.5 Encoding and Dispatch

id<MTLRayTracingCommandEncoder> rtEncoder = [commandBuffer rayTracingCommandEncoder];
[rtEncoder setRayTracingPipelineState:pipeline];
[rtEncoder setRayBuffer:rayBuffer offset:0 atIndex:0];
[rtEncoder setIntersectionBuffer:intersectionBuffer offset:0 atIndex:1];
[rtEncoder setTexture:outputTexture atIndex:0];

// Dispatch rays (one per pixel)
[rtEncoder dispatchRays:MTLSizeMake(width, height, 1)];
[rtEncoder endEncoding];

This is a minimal example. Full ray tracing requires handling acceleration structures, intersection queries, and multiple hit groups.

2.6 Performance Considerations

Use MTLAccelerationStructure for static geometry; rebuild only when needed.
Batch rays in coherent groups for better coherence.
Use the smallest possible payload and attribute sizes.
Profile with Xcode to identify bottlenecks.

Part 3: Vulkan Subgroup Operations

Subgroup operations are a powerful feature in Vulkan (and other APIs) that allow efficient communication and computation among a set of invocations that execute together (a subgroup, roughly corresponding to a CUDA warp or AMD wavefront). They can dramatically improve performance for reductions, scans, and other data‑parallel patterns.

3.1 What are Subgroups?

A subgroup is a collection of invocations within a workgroup that can communicate and synchronize using specialized operations. The size of a subgroup is implementation‑defined but can be queried. On NVIDIA, it's 32; on AMD, 64; on Intel, maybe 8,16,32.

Subgroup operations include:

Vote: subgroupAll(), subgroupAny(), subgroupBallot().
Broadcast: subgroupBroadcast().
Shuffle: subgroupShuffle(), subgroupShuffleXor().
Arithmetic: subgroupAdd(), subgroupMin(), etc. (reductions).
Scan: subgroupInclusiveAdd(), subgroupExclusiveAdd().

3.2 Enabling Subgroup Support in Vulkan

You need to check for the VK_EXT_subgroup_size_control extension or at least the core subgroup support (since Vulkan 1.1, subgroups are core but optional). In your shader, declare:

#version 450
#extension GL_KHR_shader_subgroup_arithmetic : enable
#extension GL_KHR_shader_subgroup_ballot : enable
#extension GL_KHR_shader_subgroup_shuffle : enable

3.3 Example: Parallel Reduction with Subgroups

Suppose you want to sum an array within a workgroup. Without subgroups, you'd use shared memory and barriers. With subgroups, you can do a hierarchical reduction:

#version 450
#extension GL_KHR_shader_subgroup_arithmetic : enable

layout(local_size_x = 256) in;
layout(binding = 0) buffer In { float data[]; } inBuf;
layout(binding = 1) buffer Out { float data[]; } outBuf;

void main() {
    uint gid = gl_GlobalInvocationID.x;
    uint lid = gl_LocalInvocationID.x;
    float val = inBuf.data[gid];

    // First reduce within subgroup
    float subgroupSum = subgroupAdd(val);

    // The first invocation in each subgroup writes to shared memory
    shared float sharedSums[gl_WorkGroupSize.x / gl_SubgroupSize];
    if (subgroupElect()) {
        sharedSums[lid / gl_SubgroupSize] = subgroupSum;
    }
    barrier();

    // One subgroup reduces the shared sums
    if (gl_SubgroupID == 0) {
        float total = (lid < gl_NumSubgroups) ? sharedSums[lid] : 0;
        total = subgroupAdd(total);
        if (subgroupElect() && lid == 0) {
            outBuf.data[gl_WorkGroupID.x] = total;
        }
    }
}

This is much faster than a naive loop because it leverages hardware acceleration.

3.4 Vulkan Subgroup Query on Host

You can query subgroup properties:

VkPhysicalDeviceSubgroupProperties subgroupProps = {};
subgroupProps.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SUBGROUP_PROPERTIES;
VkPhysicalDeviceProperties2 props2 = {};
props2.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2;
props2.pNext = &subgroupProps;
vkGetPhysicalDeviceProperties2(physicalDevice, &props2);

uint32_t subgroupSize = subgroupProps.subgroupSize;
// Check supported operations in subgroupProps.supportedOperations

3.5 Best Practices

Use subgroups when you need intra‑workgroup communication with low latency.
Avoid branching that diverges within a subgroup (all invocations in a subgroup should take the same path).
Prefer subgroup operations over shared memory for small reductions/scans.
Fall back to shared memory if subgroup operations are not supported (query via gl_SubgroupSize == 0 or use #ifdef).

Part 4: Vulkan Ray Tracing

Vulkan ray tracing extensions (VK_KHR_ray_tracing_pipeline, VK_KHR_acceleration_structure, etc.) provide a cross‑platform API for hardware‑accelerated ray tracing. They are supported on NVIDIA RTX, AMD RDNA2+, and Intel Arc GPUs.

4.1 Core Concepts

Acceleration Structure: A BVH built from bottom‑level (geometry) and top‑level (instances) structures.
Shader Binding Table (SBT): A table of shader handles for ray generation, hit groups, and miss shaders.
Ray Tracing Pipeline: Combines shader stages (raygen, miss, closesthit, anyhit, intersection).
Ray Tracing Commands: vkCmdTraceRaysKHR to launch rays.

4.2 Building Acceleration Structures

First, create buffers with geometry data. Then use vkCmdBuildAccelerationStructuresKHR to build bottom‑level and top‑level AS.

Example: Build a bottom‑level AS from a triangle mesh:

VkAccelerationStructureGeometryKHR geometry = {};
geometry.sType = VK_STRUCTURE_TYPE_ACCELERATION_STRUCTURE_GEOMETRY_KHR;
geometry.geometryType = VK_GEOMETRY_TYPE_TRIANGLES_KHR;
geometry.flags = VK_GEOMETRY_OPAQUE_BIT_KHR;
geometry.geometry.triangles.sType = VK_STRUCTURE_TYPE_ACCELERATION_STRUCTURE_GEOMETRY_TRIANGLES_DATA_KHR;
geometry.geometry.triangles.vertexFormat = VK_FORMAT_R32G32B32_SFLOAT;
geometry.geometry.triangles.vertexData.deviceAddress = vertexBufferAddress;
geometry.geometry.triangles.vertexStride = sizeof(Vertex);
geometry.geometry.triangles.maxVertex = vertexCount - 1;
geometry.geometry.triangles.indexType = VK_INDEX_TYPE_UINT32;
geometry.geometry.triangles.indexData.deviceAddress = indexBufferAddress;

VkAccelerationStructureBuildRangeInfoKHR range = {};
range.primitiveCount = indexCount / 3;
range.primitiveOffset = 0;
range.firstVertex = 0;
range.transformOffset = 0;

// Create and build bottom AS...

You'll need to query the required memory size, allocate, and then build.

4.3 Shader Example (GLSL)

Ray generation shader:

#version 460
#extension GL_EXT_ray_tracing : require

layout(binding = 0) uniform accelerationStructureEXT topLevelAS;
layout(binding = 1, rgba8) uniform image2D image;
layout(location = 0) rayPayloadEXT vec3 hitColor;

void main() {
    ivec2 pixel = ivec2(gl_LaunchIDEXT.xy);
    vec2 dim = vec2(gl_LaunchSizeEXT.xy);
    vec2 d = (vec2(pixel) + vec2(0.5)) / dim - 0.5;
    float aspect = dim.x / dim.y;
    vec3 origin = vec3(0,0,-2);
    vec3 direction = normalize(vec3(d.x * aspect, -d.y, 1));

    traceRayEXT(topLevelAS, gl_RayFlagsOpaqueEXT, 0xff, 0, 0, 0, origin, 0.001, direction, 1000.0, 0);
    imageStore(image, pixel, vec4(hitColor, 1));
}

Closest hit shader:

#version 460
#extension GL_EXT_ray_tracing : require

layout(location = 0) rayPayloadInEXT vec3 hitColor;
hitAttributeEXT vec2 attribs;

void main() {
    hitColor = vec3(1,0,0); // red
}

Miss shader:

#version 460
#extension GL_EXT_ray_tracing : require

layout(location = 0) rayPayloadInEXT vec3 hitColor;

void main() {
    hitColor = vec3(0,0,0); // black
}

4.4 Creating a Ray Tracing Pipeline

You need to create shader modules for each stage, then a ray tracing pipeline with VkRayTracingPipelineCreateInfoKHR. This includes the shader groups (raygen, miss, hit groups).

4.5 Shader Binding Table (SBT)

The SBT is a buffer that holds shader handles and any local data for each shader group. You must fill it according to the pipeline layout. It's complex but crucial.

4.6 Dispatching Rays

After recording the SBT buffer, use:

vkCmdTraceRaysKHR(commandBuffer, &raygenSBT, &missSBT, &hitSBT, &callableSBT, width, height, 1);

4.7 Performance Tips

Use opaque geometry and avoid any‑hit shaders when possible.
Keep ray payload small.
Group rays coherently (e.g., sort by direction).
Use inline ray tracing (VK_KHR_ray_query) for simpler tasks without full pipeline overhead.

Part 5: Vulkan Video

Vulkan Video extensions (VK_KHR_video_queue, VK_KHR_video_decode_queue, VK_KHR_video_encode_queue) provide hardware‑accelerated video encoding and decoding. This is a relatively new area but powerful for applications needing real‑time video processing.

5.1 Overview

Vulkan Video exposes video codec capabilities as Vulkan queues and operations. You can decode H.264, H.265, etc., directly to Vulkan images, and encode from images. It integrates seamlessly with the rest of Vulkan for post‑processing and rendering.

5.2 Setting Up Video Queues

When creating a logical device, you need to request a video decode/encode queue:

float priority = 1.0f;
VkDeviceQueueCreateInfo queueCreateInfo = {};
queueCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;
queueCreateInfo.queueFamilyIndex = videoQueueFamilyIndex;
queueCreateInfo.queueCount = 1;
queueCreateInfo.pQueuePriorities = &priority;

You must first query queue families that support VK_QUEUE_VIDEO_DECODE_BIT_KHR or VK_QUEUE_VIDEO_ENCODE_BIT_KHR.

5.3 Decoding a Video Frame

Create a video session (VkVideoSessionKHR) with codec parameters.
Create video picture resources (reference and output images).
Submit decode operations via vkCmdDecodeVideoKHR.

Example decode command:

VkVideoDecodeInfoKHR decodeInfo = {};
decodeInfo.sType = VK_STRUCTURE_TYPE_VIDEO_DECODE_INFO_KHR;
decodeInfo.srcBuffer = bitstreamBuffer;
decodeInfo.srcBufferOffset = bitstreamOffset;
decodeInfo.srcBufferRange = bitstreamSize;
decodeInfo.dstPictureResource = pictureResource; // reference to image view
decodeInfo.pSetupReferenceSlots = ...; // for DPB

vkCmdDecodeVideoKHR(commandBuffer, &decodeInfo);

You'll need to manage the decoded picture buffer (DPB) and reference frames.

5.4 Encoding

Similarly, you can encode frames by providing raw images and receiving a bitstream.

5.5 Limitations and Considerations

Vulkan Video is still evolving; not all vendors support all codecs.
You need to handle bitstream parsing yourself (the API expects raw NAL units).
It's a low‑level interface; for simpler use, consider libraries like FFmpeg with Vulkan acceleration.

Part 6: Cross‑Platform Considerations

While Metal and Vulkan are powerful, they are platform‑specific. For applications that need to run on both Apple and non‑Apple hardware, you have several options:

6.1 SYCL

SYCL (as covered in previous tutorials) is a high‑level C++ abstraction that can target multiple backends including OpenCL, Level Zero, and CUDA. With SYCL 2020, you can write a single source that compiles to various devices. However, SYCL does not yet directly target Metal or Vulkan compute, but it can target OpenCL which runs on many platforms (including macOS via Intel GPU or with limitations). For Apple Silicon, SYCL is not a direct fit; you'd need to use OpenCL (which is deprecated on macOS) or a SYCL implementation that supports Metal (none currently).

6.2 OpenCL

OpenCL is widely supported (including on Intel GPUs, AMD, NVIDIA), but Apple has deprecated OpenCL on macOS in favor of Metal. For cross‑platform, you might write OpenCL kernels and use them on non‑Apple platforms, and rewrite for Metal on Apple. This is double the work.

6.3 Vulkan Everywhere?

Vulkan runs on Windows, Linux, Android, and macOS via MoltenVK. So you could write your compute kernels in Vulkan and use MoltenVK to run on macOS. MoltenVK translates Vulkan to Metal, and for compute, it works quite well. This gives you a single codebase for all platforms. However, you lose some Metal‑specific optimizations, but the portability is valuable.

For ray tracing, Vulkan ray tracing extensions are not yet supported through MoltenVK (Metal ray tracing is different). So if you need ray tracing on Apple, you'd need separate Metal code.

6.4 Recommendations

If you target only Apple: Use Metal and MPS for maximum performance.
If you target multiple platforms and need compute only: Use Vulkan with MoltenVK for macOS, or SYCL with appropriate backends.
If you need ray tracing across platforms: Write separate backends for Metal and Vulkan, as the APIs differ significantly.
If you need video encode/decode: Vulkan Video is cross‑vendor but not yet on Apple; use platform‑specific APIs (VideoToolbox on macOS, MediaFoundation on Windows, VAAPI on Linux).

Conclusion

You've now explored the advanced corners of Metal and Vulkan: the high‑level performance primitives of MPS, the excitement of ray tracing on both platforms, the efficiency of subgroup operations, and the emerging field of Vulkan Video. Armed with this knowledge, you can build applications that push the boundaries of what's possible on modern GPUs.

Remember, the key to mastery is practice. Implement a small path tracer in Metal, try a subgroup‑accelerated reduction in Vulkan, decode a video stream with Vulkan Video, and see how MPS can accelerate your machine learning models. The possibilities are endless.

If you have questions or want to share your own projects, drop a comment below. Happy coding!