Dmitry Trifonov

Posted on Sep 3 • Originally published at cloudrift.ai

Evolution of GPU Programming

#programming #gpu #deeplearning #cuda

From Smart Pixels to the Backbone of an AI-driven World

Every decade GPUs reinvented themselves - from drawing triangles to generating worlds, and now, reasoning with language. I have realized that throughout my entire programming journey, I have been working closely with GPUs and tried countless ways to program them. From writing pixel shaders in GLSL to implementing real-time 3D scanning algorithms in OpenCL to optimizing deep learning models in PyTorch and Tensorflow. So what can be a better way to share my experience than to write a blog post about the evolution of GPU programming, full of nostalgia and memes?

A lot has changed in the GPU programming landscape over the years. New programming models, new frameworks, and new hardware architectures have emerged. There is no point in studying them nowadays; however, the evolutionary path of GPU programming is quite interesting. If you're an AI expert or a developer in some other field - it can help you broaden your expertise or help you get necessary inspiration to dive into the world of GPU programming. It can give you new ideas to address current problems, especially given that some of the issues we face today in AI were already faced by graphics programmers 25 years ago. If you are a GPU programming veteran or not into programming at all - enjoy the story and the memes.

Here is a mildly entertaining, nostalgia-induced journey through the history of GPU programming from making brick walls that look bumpy in 2000 to optimizing attention mechanisms in LLM models in 2025. Feel free to skip the code snippets if you're not interested in programming or are already familiar with the material and would rather enjoy the story.

Smart Pixels

In the early 2000s, the GPUs were used exclusively for visualization, and the rendering pipeline was completely fixed-function. It was akin to HTML, where you would predefine your scene: geometry, textures, position of lights and camera, and the GPU would take care of rendering it. You could, of course, customize it on the fly, but only in a limited way, by changing parameters of the predefined functions, and this customization has happened entirely on the CPU side.

Here is a simple example of rendering a triangle using old-school OpenGL, taken from here.

// Set every pixel in the frame buffer to the current clear color.
glClear(GL_COLOR_BUFFER_BIT);

// Drawing is done by specifying a sequence of vertices. The way these
// vertices are connected. GL_POLYGON constructs a filled polygon.
glBegin(GL_POLYGON);
  glColor3f(1, 0, 0); glVertex3f(-0.6, -0.75, 0.5);
  glColor3f(0, 1, 0); glVertex3f(0.6, -0.75, 0);
  glColor3f(0, 0, 1); glVertex3f(0, 0.75, 0);
glEnd();

// Flush drawing command buffer to make drawing happen as soon as possible.
glFlush();

The idea that you can actually program how pixels are rendered on the screen was quite revolutionary in the early 2000s.

And my first interaction with these ideas was through this article from 2001 on a popular Russian game-development website about the NV_register_combiners extension for OpenGL. Surprisingly, the article is still available online.

This extension enabled you to program how the final color of a pixel is computed from various inputs, such as texture colors and lighting, allowing you to create more complex visual effects. This computation is performed on the GPU, enabling real-time performance. It was akin to running a small assembly program on the GPU for each pixel being rendered.

Graphics developers were fascinated by this idea, as it enabled them to increase the visual fidelity of the scenes dramatically. Shortly after, the GLSL was conceptualized and formally introduced in 2004, allowing the writing of more complex shaders (small programs that define how to manipulate geometry or pixels) in a C-like language.

Are you feeling GPU poor? Imagine that it was even worse back then! Every new generation of GPUs introduced new features and capabilities, dramatically increasing the visual fidelity of games. Having a new GPU was a prerequisite for playing the latest and greatest games. For those into computer graphics, the frustration of the wait and the excitement of getting the new card were doubled! Luckily, I could trick my parents into buying me a new card, because it supported SHADERS! Which, of course, were essential to advance my computer science education. Having the ability to play Oblivion on high settings was just a nice bonus.

Here is an example of a simple GLSL program from rastertek.com to perform bump mapping, the effect achieved by perturbing the surface normals of a texture to simulate small-scale bumps and wrinkles on the surface of an object.

in vec2 texCoord;
in vec3 normal;
in vec3 tangent;
in vec3 binormal;

void main(void)
{
    // Sample the pixel color from the texture using the sampler at this texture coordinate location.
    vec4 textureColor = texture(shaderTexture1, texCoord);

    // Sample the pixel from the normal map.
    vec4 bumpMap = texture(shaderTexture2, texCoord);

    // Expand the range of the normal value from (0, +1) to (-1, +1).
    vec3 bumpMap = (bumpMap * 2.0f) - 1.0f;

    // Calculate the normal from the data in the normal map.
    bumpNormal = (bumpMap.x * tangent) + (bumpMap.y * binormal) + (bumpMap.z * normal);

    // Normalize the resulting bump normal.
    bumpNormal = normalize(bumpNormal);

    // Calculate the amount of light on this pixel based on the normal map value.
    float lightIntensity = clamp(dot(bumpNormal, -lightDirection), 0.0f, 1.0f);

    // Determine the final amount of diffuse color based on the diffuse color combined with the light intensity.
    outputColor =  clamp((diffuseLightColor * lightIntensity), 0.0f, 1.0f);

    // Combine the final light color with the texture color.
    outputColor = outputColor * textureColor;
}

What do all these in vec3 variables mean? These are the inputs to the shader program. Those are specified per vertex and interpolated across the surface of the triangle being rendered. The interpolation is done by GPU hardware and fed into the shader program for each pixel being rendered. This way, you can have different values for each pixel, allowing for more complex effects. This allows for parallelization of the computation across all pixels being rendered, as each pixel can be processed independently.

Shaders quickly progressed from simple pixel color manipulation to complex effects simulating shadows, reflections, and refractions. Graphics programmers were especially obsessed with simulating complex surface details without increasing the geometric complexity of the scene. The deepest point of this rabbit hole was a Parallax Occlusion Mapping technique, which performs a type of ray-marching in a pixel shader, i.e., traversing space to determine the intersection of a ray with a surface defined by a heightmap texture. This way, a completely flat surface can appear to have complex 3D details.

GPUs as General Purpose Computers

At this point, you may wonder about LLMs, deep learning, and the ability to perform general-purpose computations on GPUs. However, take a look at the shader program above. It is just like a piece of C code. Why can't we use that to perform arbitrary computations on the GPU? Indeed, we can, and people have been doing so since the early 2000s. However, we need to address one problem first. How do we get data in and out of the GPU?

Getting data in is pretty straightforward. We can encode our data as a texture or geometry and upload it to the GPU. But how do we get data out? To help with that, we can use techniques like render to texture. It allows us to render the output of our shader program to a texture instead of the screen. Then we can read that texture back to the CPU.

For those not familiar with computer graphics terms. Texture is just an image. In computer graphics, textures are used to store image data that can be applied to the surface of 3D models to give them color and detail. A texture is typically a 2D array of pixels, where each pixel contains color information (e.g., RGB values) and sometimes additional information like alpha (transparency) or normal vectors for bump mapping.

This technique is actually even older than shaders themselves, as it was used in the pre-shader era to create effects like dynamic reflections and shadows. For example, to create a reflection effect, you can render the scene from the point of view of a reflected camera (e.g., below the water surface) to a texture, and then use that texture to render the water surface. You can use a pixel shader to distort the texture coordinates, simulating water ripples.

Some ingenious people figured out that you can use this technique to perform arbitrary computations on the GPU by encoding your input data as a texture, writing a shader program to perform the calculation, rendering the output to a texture, and then reading that texture back to the CPU.

What can you achieve with this technique? Everything you can with CUDA today. A popular technique in early GPGPU was to use ping-pong rendering, where two textures are alternated for reading and writing. This way, you can compute, take your input texture, compute some function on it, write the result to the output texture, then use that output texture as input for the following computation, and so on. This way, you can build complex computations by chaining together multiple shader programs. And you don't have to work with images specifically. You can encode any data as a texture, e.g., a 2D array of floats, a 3D volume of voxels, a graph, and so on.

For example, the Fast Fourier Transform (FFT) algorithm can be implemented using shaders and the render-to-texture technique. Here is an example of a GPU-based FFT implementation from GPU Gems 2, along with its medical image reconstruction.

Here is how a fragment shader for a single FFT pass looks. It is similar to the CUDA kernel you would write today, as shown below. It is essentially a function that is invoked for each pixel of the output texture. It reads data from the input textures, performs some computation, and writes the result as the color of the pixel, which is then stored in the output texture.

void FragmentProgram(in float4 TexCoordRect
                     : TEXCOORD0, out float4 sColor0
                     : COLOR0, out float4 sColor1
                     : COLOR1, out float4 sColor2
                     : COLOR2, out float4 sColor3
                     : COLOR3, uniform samplerRECT Real1,
                       uniform samplerRECT Imag1, uniform samplerRECT Real2,
                       uniform samplerRECT Imag2,
                       uniform samplerRECT ButterflyLookupI,
                       uniform samplerRECT ButterflyLookupWR,
                       uniform samplerRECT ButterflyLookupWI)
{
  // Read in butterfly indices
  float4 i = texRECT(ButterflyLookupI, TexCoordRect.xy);
  // Read in scrambling coordinates
  float4 WR = texRECT(ButterflyLookupWR, TexCoordRect.xy);
  // Read in weights
  float4 WI = texRECT(ButterflyLookupWI, TexCoordRect.xy);

  // Perform the butterfly operation, storing results in the output colors
  float2 Res;
  float2 r1 = float2(i.x, TexCoordRect.y);
  float2 r2 = float2(i.w, TexCoordRect.y);
  float4 InputX1 = texRECT(Real1, r1);
  float4 InputY1 = texRECT(Imag1, r1);
  float4 InputX2 = texRECT(Real1, r2);
  float4 InputY2 = texRECT(Imag1, r2);
  Res.x = WR.x * InputX2.x - WI.x * InputY2.x;
  Res.y = WI.x * InputX2.x + WR.x * InputY2.x;
  sColor0.x = InputX1.x + Res.x;
  sColor1.x = InputY1.x + Res.y;
  float4 InputX1_ = texRECT(Real2, r1);
  float4 InputY1_ = texRECT(Imag2, r1);
  float4 InputX2_ = texRECT(Real2, r2);
  float4 InputY2_ = texRECT(Imag2, r2);
  Res.x = WR.x * InputX2_.x - WI.x * InputY2_.x;
  Res.y = WI.x * InputX2_.x + WR.x * InputY2_.x;
  sColor2.x = InputX1_.x + Res.x;
  sColor3.x = InputY1_.x + Res.y;
}

The code above is written in Cg language. It is an early attempt of NVidia to ~~monopolize the graphics computing market~~ make shader programming more convenient. Luckily, nobody cared about it, and market relied on a more universally supported GLSL and HLSL languages.

I was fascinated by these developments! This technique unlocked a remarkable number of new applications in computer graphics, science, and the medical field, among others. Personally, I've used it to implement advanced graphics effects. Here is an example of using FFT to generate a complex water surface. This technique was used in the movie Titanic and in some advanced games like Assassin's Creed.

Realistic ocean surface rendering. The wave geometry was computed via a mathematical model that required performing a large 2D IFFT, which was implemented using shaders and a render-to-texture technique entirely on a GPU.

Are any of those articles worth reading? Of course, not. I want to demonstrate how I used Web-Archive to recover some old articles that are no longer available online and add a meme image to the post.

Enter the CUDA

Although the technique of using shaders for general-purpose computations was quite powerful, it was still somewhat limited. The programming model was not very friendly, as you had to encode your data as textures or other graphics primitives. The render-to-texture approach involves rendering a rectangular area of the entire screen, ensuring that all rendered pixels align precisely with the texels of the output texture. It was easy to misconfigure the graphics pipeline, such as forgetting to turn off texture filtering, which would lead to incorrect results.

All of these details were quite distracting and made it hard to focus on the actual computation, especially for non-graphics programmers. Thus, NVIDIA introduced CUDA in 2007, which provided a C-like programming model for writing general-purpose computations on NVIDIA GPUs.

The programming model is similar to the shader programming model, as you still write a kernel function that is executed in parallel by many threads. Each thread is identified by its 1D, 2D, or 3D index, which you can use to compute the memory address of the data you want to process. In the shader programming model, you would do that using texture coordinates or other varying variables, while you would use thread indices. However, all the scaffolding of setting up the graphics pipeline, managing textures, framebuffers, and so on, is eliminated. You can allocate memory on the GPU, copy data to it, launch a kernel, and copy the results back.

Here is how the FFT kernel from above would look in CUDA. Again, feel free to skip if you're here for the story.

// Helper function to perform a complex multiply and add operation.
__device__ float2 butterfly_op(float2 a, float2 b, float2 twiddle) {
    // Perform complex multiplication and addition
    float2 temp_result;
    temp_result.x = b.x * twiddle.x - b.y * twiddle.y;
    temp_result.y = b.y * twiddle.x + b.x * twiddle.y;
    return a + temp_result;
}

__global__ void fft_stage_kernel(
    // Input data arrays (now using float2 for complex numbers)
    float2 *d_input1,
    float2 *d_input2,

    // Combined butterfly lookup tables (now float2 for complex twiddle factors)
    float *d_butterflyLookupI,
    float2 *d_butterflyTwiddles,

    // Output data arrays (now using float2)
    float2 *d_out1,
    float2 *d_out2,

    int width,
    int height
) {
    int tx = blockIdx.x * blockDim.x + threadIdx.x;
    int ty = blockIdx.y * blockDim.y + threadIdx.y;

    if (tx >= width || ty >= height) {
        return;
    }

    int index = ty * width + tx;

    // Read butterfly lookup index and complex twiddle factor
    int lookup_i = (int)d_butterflyLookupI[index];
    float2 twiddle_factor = d_butterflyTwiddles[index];

    // Read input data using combined float2 arrays
    int r1_idx = ty * width + tx;
    int r2_idx = ty * width + lookup_i;

    float2 input1 = d_input1[r1_idx];
    float2 input2 = d_input1[r2_idx];

    // Perform the butterfly operation for the first pair of inputs
    d_out1[index] = butterfly_op(input1, input2, twiddle_factor);

    // Process the second pair of data arrays
    float2 input1_prime = d_input2[r1_idx];
    float2 input2_prime = d_input2[r2_idx];

    // Perform the second butterfly operation
    d_out2[index] = butterfly_op(input1_prime, input2_prime, twiddle_factor);
}

I was waiting to get my hands on a GPU that supported CUDA, again. I was earning money, so there was no need to trick my parent anymore, but high-end PC upgrades were still a considerable expense, and you needed to do them often. My first CUDA-capable GPU was the 8800GT, a GPU from the most legendary series of all time. It leveraged entirely new architecture and has introduced CUDA. In addition, asingle 8800 GTX card was able to outperform two previous-generation 7900 GTX cards in SLI and had comparable power consumption and price ($599-hold your tears in your eyes). When will we see such leaps in performance and value again, Mr. Leather-jacket CEO?

![An entry-level GPU in 2030 with an MSRP of $8799]](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rvy9ahmcnorqzqrix22n.webp)

CUDA Moat?

As a true open-source warrior, I did not use CUDA and relied on OpenCL instead for my work. It was not as well-supported as CUDA: debuggers and other tools were not so advanced, there were more glitches, and you could get slightly better performance out of CUDA on NVIDIA hardware. However, its drawbacks were outweighed by the fact that it was an open standard and worked on both AMD and Intel GPUs, so CUDA was far from being a monopoly at that time.

At my job, I was using OpenCL to implement an algorithm for real-time 3D scanning. The Artec Eva is a professional 3D scanner used for medical or industrial applications. Real-time 3D scanning involves a significant amount of GPU computation to process the input video stream, identify your position with respect to the environment (similar algorithms are employed as in self-driving cars), fuse all the input data into a single 3D model, and display it on the screen. All of this had to happen in real-time, so the user could see the result immediately and adjust their position if needed.

I opted for OpenCL, which was a brave choice back then and possibly a bad product decision at the time, as when you buy a $12000 3D scanner, you can afford a decent GPU and not worry about vendor lock-in. However, over time, as GPUs became more powerful and it became possible to run the pipeline on a laptop GPU, specifically Microsoft Surface tablet, the choice of OpenCL has become more relevant. Now, an operator had a lightweight display in their hands and could walk around the object being scanned. At least, this is what I tell myself to feel better about my choice 😅

Real-time scanning of a 3D object with an Artec scanner. Scanner localization, data fusion, and visualization are performed in real-time on a GPU using OpenCL.

In addition to OpenCL, there were many other hardware-agnostic GPGPU frameworks to choose from, including Halide, ArrayFire, and Numba. So, all things considered, the open-source and open-standard ecosystem was a fair contender to CUDA back then, and CUDA hasn't had its moat yet.

Deep Learning Revolution

The new GPU programming capabilities unlocked by CUDA/OpenCL have enabled numerous new applications in computer graphics, science, and medical fields, among others. However, the popularization of deep learning (this is how we've called AI before ChatGPT came along) is arguably the most noticeable outcome.

Many think that thanks to AI, the GPUs have become the central compute platform. In fact, it is the other way around. Thanks to GPUs, we have AI in the first place. Deep convolutional neural networks have been known since 90s. In 2012, a graduate student, Alex Krizhevsky, motivated by Ilya Sutskever, trained a deep convolutional neural network under the guidance of Geoffrey Hinton using a couple of GeForce GPUs to enter the ImageNet challenge. The model was called AlexNet, and the dataset consisted of 1.2 million images belonging to 1000 categories.

The results? They have obliterated the state-of-the-art computer vision models at the time, demonstrating a whopping 9.4% increase in accuracy over the previous state-of-the-art. This was a game-changer. It has triggered a deep learning revolution, where all breakthroughs in computer vision, natural language processing, and other fields were achieved using deep learning models trained on GPUs.

The Array Programming Model

GPU computing has caused great upheaval in the machine learning field, while the latter has retaliated by drastically changing the way we program GPUs. The programming model has shifted from writing kernels that operate on individual elements of an array to writing code that operates on entire arrays (tensors) at once.

The reason for this is that deep-learning frameworks like Tensorflow or PyTorch were inspired not by graphics programming, but scientific computing frameworks like NumPy and MATLAB. The programming model differs significantly from those of CUDA and OpenCL. Instead of writing kernels that operate on individual elements of an array, you write code that operates on entire arrays (tensors) at once. The framework breaks down the operations into smaller pieces that can be executed in parallel on the GPU. This programming model, known as array programming dates back to the 60s with the development of languages like APL and Fortran.

I am skipping the first and popular at the time declarative deep learning framework Caffe. It was suitable for defining a large number of models, but it was not appropriate for expressing arbitrary computations on tensors.

This programming model has one tremendous advantage. It is much easier to reason about the code, as you don't have to think about how to parallelize the computation. You write code that operates on entire arrays, and the framework takes care of the rest. It made GPU programming accessible to a much wider audience, as you didn't have to be a GPU programming expert to write code that runs on the GPU. It is so convenient that many GPU programming experts, myself included, have switched to using these frameworks for their work. It allows you to express your ideas much more concisely and focus on the problem at hand, rather than the intricacies of GPU programming. Additionally, frameworks like PyTorch and Tensorflow come with an automatic differentiation engine, which allows you to compute gradients of your functions automatically. This is especially useful for training neural networks, but it can also be applied to other applications.

Here is a simple numpy program. Even without knowing numpy, you can figure out what it does. It creates a couple of arrays, performs some basic operations on them, and prints the results.

import numpy as np

# Create a 1-dimensional array from a Python list
array1d = np.array([1, 2, 3, 4, 5])

# Create a 2-dimensional array (matrix)
array2d = np.array([[10, 20, 30], [40, 50, 60]])

# Element-wise addition
sum_array = array1d + 5

# Element-wise multiplication
product_array = array1d * 2

# Sum of all elements in an array
total_sum = np.sum(array1d)

# Mean of elements in an array
mean_value = np.mean(array1d)

# Accessing elements
print("First element of array1d:", array1d[0])
print("Element at row 0, column 1 of array2d:", array2d[0, 1])

Why Programming GPUs is Hard?

With the convenience of array programming model comes a significant drawback. It is hard to optimize the code for performance. To understand the reason, we first need to consider why it is hard to optimize code for GPUs in the first place.

There are several reasons why GPU programming is a complex task. Still, the primary limitation is that memory bandwidth heavily restricts GPUs, so GPU architects have introduced numerous complex mechanisms to hide the latency of memory accesses and maximize the utilization of available bandwidth. Developers need to understand these mechanisms and write code that leverages them. This is not an easy task, as it requires a deep understanding of the GPU architecture and the specific details of the memory hierarchy.

Think about the following example. The most powerful CPU at the time of writing is AMD EPYC 9965. It offers a whopping 192 cores and 384 threads. The per-socket memory bandwidth is about 614 GB/s. However, its number of cores pales in comparison with the most powerful GPU, which is NVIDIA B200 at the time of writing. It offers 16,896 CUDA cores and up to 8TB/s of memory bandwidth per GPU.

Now, you might see the problem: each CPU core has about 3.2 GB/s of memory bandwidth, while each GPU core has only about 0.47 GB/s of memory bandwidth. This means that each GPU core must perform significantly more work to hide the latency of memory accesses and make the best use of the available bandwidth. The situation with consumer GPUs is even worse, e.g. the RTX 5090 has 21,760 CUDA cores and 1,792 GB/s of memory bandwidth, which gives only about 0.082 GB/s per core. This means that GPUs must perform significantly more computations per memory access to achieve optimal performance.

The relationship between compute power and memory bandwidth in the GPU computing world is referred to as the ALU-to-memory ratio, which represents the number of operations a GPU core can perform per memory access. For GPUs, this ratio is much higher than for CPUs. It can be dozens or even hundreds of operations per memory access.

The same problem exists for all other parallel computing platforms, such as TPUs, neural processors, and FPGAs. The memory bandwidth per processing unit is always much lower than that of a CPU core. Between 2017 and 2022, I was optimizing neural network inference at Apple for their custom neural processors. We have shipped models such as Animoji, FaceID, Portrait mode, and numerous models that run on Apple Vision Pro. For each of these models, we've had to ensure there is no swapping of data between the on-chip memory and DRAM, as the memory bandwidth was the main bottleneck.

To work around this limitation, GPUs employ several techniques, such as using shared memory -a small amount of memory shared among a group of threads. This allows threads to cooperate and share data without accessing global memory, which is significantly slower. Another technique is to use memory coalescing, which enables threads to access memory in a way that minimizes the number of memory transactions. This is achieved by ensuring that threads access contiguous memory locations, which allows the GPU to fetch multiple data elements in a single memory transaction. GPU cores also have access to more registers than CPU cores, which can also be used to store intermediate data. However, registers are shared among cores (threads in a workgroup), so if you're using too many, some cores will be turned off.

Enough complex terms! If you're to take out one thing from this post, it is this: the most effective way to optimize a GPU program is to perform more computations per memory access. In other words, ensure that data doesn't leave the GPU core for as long as possible. Let's pin this and come back to the array programming model and the performance issues that it introduces.

I Love PyTorch! What Can Possibly be Wrong with It?

Let's take a look at how a simple CUDA kernel to perform an array operation like A*B + C would look. Here, A, B, and C are large arrays (tensors) and the operation is performed element-wise, e.g., [1, 2, 3] * [2, 2, 2] + [1, 1, 1] = [3, 5, 7]

__global__ void array_op(const float *A, const float *B, const float *C, float *D, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        D[idx] = A[idx] * B[idx] + C[idx];
    }
}

This kernel is straightforward. Each thread computes a single element of the output array D by reading the corresponding elements from the input arrays A, B, and C.

Now, let's take a look at how the same operation would look in PyTorch.

import torch

A = torch.randn(1000000, device='cuda')
B = torch.randn(1000000, device='cuda')
C = torch.randn(1000000, device='cuda')
D = A * B + C

If you naively translate PyTorch operations like elementwise multiplication and addition to CUDA, which is how it is actually done in practice, you would get two kernels: one for multiplication and one for addition. The runtime would launch a kernel to perform element-wise multiplication, store the result in a temporary array A1, and then launch another kernel to perform element-wise addition using A1 and B to produce the final tensor C.

__global__ void array_mul(const float *A, const float *B, float *E, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        E[idx] = A[idx] * B[idx];
    }
}

__global__ void array_add(const float *E, const float *C, float *D, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        D[idx] = E[idx] + C[idx];
    }
}

You can see the problem now:

In the original example, we fetch data once from memory and perform two operations (multiplication and addition) on it.
In the PyTorch example, we fetch data twice from memory and perform only one operation (multiplication or addition) on it.

Given that our program is completely memory-bound, the PyTorch version will be practically twice as slow as the CUDA version, because it performs half the number of operations per memory access. If we add one more unfused element-wise operation, it will be three times as slow, and so on.

You may wonder, can't we generate a single fused kernel that performs both operations simultaneously? The answer is yes, we can. In fact, both PyTorch and TensorFlow have a mechanism to do that. However, this is not an easy problem to solve in a general way. PyTorch officially supports more than 1200 operations that can be performed on tensors. The number of possible combinations of these operations is astronomical. Many of these operations are not even element-wise, e.g., matrix multiplication, convolutions, reductions, and so on. It is a complex problem to solve in a general way. For PyTorch, it is especially difficult, as it is a dynamic framework, i.e., the computation graph is built on-the-fly as the code is executed. This makes it challenging to analyze the entire computation graph and determine which operations can be fused.

This problem remains unsolved in a general way to date, as you'll see when we discuss Flash Attention in the context of LLM inference.

NVidia Domination

The Deep Learning revolution has dramatically changed the GPU programming landscape. The array programming model has made GPU programming accessible to a much wider audience, as you don't have to be a GPU programming expert to write code that runs on the GPU. However, it has also introduced new challenges, such as optimizing memory access patterns and fusing operations to achieve optimal performance.

This has created a strong moat for NVIDIA. Although CUDA was just one of the many GPGPU frameworks available at the time, the CUDA ecosystem had a great deal more to offer the community. For example, it had CUDNN, a highly optimized library for deep learning primitives such as convolutions, pooling, and normalization. This library was used by all major deep learning frameworks like TensorFlow and PyTorch to achieve good performance on NVIDIA GPUs. Additionally, NVIDIA has invested heavily in optimizing its hardware for deep learning workloads, for example, by introducing Tensor Cores, which are specialized hardware units designed for performing matrix multiplications and convolutions.

In the Deep Learning age, the NVIDIA GPUs have become the de facto standard for deep learning workloads. All major deep learning frameworks like PyTorch and Tensorflow were built on top of CUDNN, initially not even offering the option to use other backends like OpenCL or ROCm. All research has been done on NVIDIA hardware, as it was the only hardware that supported the tools they were using. This has created a strong network effect, as everyone was using NVIDIA hardware, so everyone was optimizing their code for NVIDIA hardware, which made NVIDIA hardware even more attractive.

From 2010 to the present, I have exclusively owned NVIDIA GPUs. Even though some AMD models were offering more value, the need to be able to perform AI-related work has always steered me into the Team Green camp.

Ironically, as innovative as CUDA was, the moat was created not by CUDA itself, but by the army of NVIDIA engineers who have optimized CUDNN and other libraries for deep learning workloads. There was simply no good algorithm to optimize computational graphs in a general way, so NVIDIA engineers have hand-optimized the most common patterns that appear in deep learning workloads.

There are many attempts to come up with an automatic way to optimize computational graphs or at least to come up with a universal, hardware-agnostic AI stack that makes the optimization process easier, like XLA from Google, TVM from the Apache Foundation, MLIR from LLVM or MAX from Modular AI. However, none of these attempts have been able to beat hand-optimized libraries like CUDNN on NVIDIA hardware on a large enough number of real-world use cases and establish a strong enough network effect.

The AI Era - Bigger is Better

History doesn't repeat itself, but it often rhymes. The computational power of GPUs has triggered the deep learning evolution. We've used the same algorithms that were known since 90s, but now we could train much larger models on much larger datasets. The same thing happened with LLMs. The transformer architecture was known since 2017, but it was only in 2020 that we've seen the first large-scale transformer models like GPT-3 and BERT. The reason for that is that training these models requires a lot of computational power and memory bandwidth. OpenAI has trained GPT-3 on a cluster of 10,000 GPUs. The largest model, GPT-3, has 175 billion parameters and was trained on a dataset of 570GB of text data. The training process took several weeks and cost several million dollars (and probably raised global temperature by a degree or so).

How did AI affect the GPU programming landscape? Not much, actually. The same array programming model is used for training and inference of LLMs. The same challenges of optimizing memory access patterns and fusing operations to achieve good performance still exist. However, the scale of the models has increased dramatically, which has introduced new challenges, like distributing the model across multiple GPUs and optimizing communication between GPUs.

The large scale of the models has also introduced new challenges for inference. The models are so large that they don't fit into the memory of a single GPU. For example, GPT-3 requires about 700GB of memory to store the model parameters, which is much larger than the memory of even the most powerful GPUs available today. This has led to the development of techniques such as model parallelism, where the model is split across multiple GPUs, and pipeline parallelism, where different parts of the model are executed on separate GPUs in a pipelined manner.

The Case of Flash Attention

Surprisingly, after all these years, the problem of optimizing memory access patterns and fusing operations to achieve good performance is still not solved in a general way. Let's take a look at a specific example of this problem in the context of LLM inference.

One of the most important operations in transformer models is the attention mechanism. The attention mechanism allows the model to focus on different parts of the input sequence when making predictions. The attention mechanism is implemented using a series of matrix multiplications and softmax operations (see the rightmost diagram on the image below).

The softmax operation involves computing the exponential of each element in the input matrix, summing them up, and then dividing each component by the sum.

Looks challenging to optimize, right? How can we reduce the number of memory accesses here? The naive implementation would involve reading the input matrices from memory, multiplying them together, storing the result in a temporary matrix, reading the temporary matrix from memory, computing the exponential of each element, summing them up, and then dividing each element by the sum. This would involve a lot of memory accesses and would be very slow. And it is slow!

However, previously I have mentioned that GPUs come with a bit of fast on-chip memory called shared memory (SRAM in hardware terms- static random access memory). It is a small amount of memory that is shared between a block of GPU cores. This memory is much faster than the global memory (GDDR or HBM) and can be used to store intermediate results. The original Flash Attention implementation was implemented and benchmarked on H100, which has 80GB of HBM memory and 192KB of shared memory per SM. The SRAM speed was about 19TB/s, and the HBM speed was about 1.5–2.0TB/s.

The authors of Flash Attention have devised a method to partition computations in a way that allows intermediate results to fit into shared memory, enabling them to perform the entire attention computation with fewer trips to global memory. This is achieved by partitioning the input matrices into smaller tiles, performing calculations on these tiles (including matrix multiplications and softmax operations), and streaming the results back into global memory. The result is a significant speedup over the naive implementation.

Conclusion

The GPU programming landscape has changed dramatically over the past two decades. The introduction of CUDA and OpenCL has made GPU programming accessible to a much wider audience and triggered the deep learning revolution, which in turn changed the way we program GPUs. The array programming model has made it easier to write code that runs on the GPU, but it has also introduced new challenges, such as optimizing memory access patterns and fusing operations to achieve optimal performance.

Now, when you're a certified GPU programming expert, enjoy the last meme and get your GPU cranking!