DEV Community: Dmitry Trifonov

Evolution of GPU Programming

Dmitry Trifonov — Wed, 03 Sep 2025 18:02:42 +0000

From Smart Pixels to the Backbone of an AI-driven World

Every decade GPUs reinvented themselves - from drawing triangles to generating worlds, and now, reasoning with language. I have realized that throughout my entire programming journey, I have been working closely with GPUs and tried countless ways to program them. From writing pixel shaders in GLSL to implementing real-time 3D scanning algorithms in OpenCL to optimizing deep learning models in PyTorch and Tensorflow. So what can be a better way to share my experience than to write a blog post about the evolution of GPU programming, full of nostalgia and memes?

A lot has changed in the GPU programming landscape over the years. New programming models, new frameworks, and new hardware architectures have emerged. There is no point in studying them nowadays; however, the evolutionary path of GPU programming is quite interesting. If you're an AI expert or a developer in some other field - it can help you broaden your expertise or help you get necessary inspiration to dive into the world of GPU programming. It can give you new ideas to address current problems, especially given that some of the issues we face today in AI were already faced by graphics programmers 25 years ago. If you are a GPU programming veteran or not into programming at all - enjoy the story and the memes.

Here is a mildly entertaining, nostalgia-induced journey through the history of GPU programming from making brick walls that look bumpy in 2000 to optimizing attention mechanisms in LLM models in 2025. Feel free to skip the code snippets if you're not interested in programming or are already familiar with the material and would rather enjoy the story.

Smart Pixels

In the early 2000s, the GPUs were used exclusively for visualization, and the rendering pipeline was completely fixed-function. It was akin to HTML, where you would predefine your scene: geometry, textures, position of lights and camera, and the GPU would take care of rendering it. You could, of course, customize it on the fly, but only in a limited way, by changing parameters of the predefined functions, and this customization has happened entirely on the CPU side.

Here is a simple example of rendering a triangle using old-school OpenGL, taken from here.

// Set every pixel in the frame buffer to the current clear color.
glClear(GL_COLOR_BUFFER_BIT);

// Drawing is done by specifying a sequence of vertices. The way these
// vertices are connected. GL_POLYGON constructs a filled polygon.
glBegin(GL_POLYGON);
  glColor3f(1, 0, 0); glVertex3f(-0.6, -0.75, 0.5);
  glColor3f(0, 1, 0); glVertex3f(0.6, -0.75, 0);
  glColor3f(0, 0, 1); glVertex3f(0, 0.75, 0);
glEnd();

// Flush drawing command buffer to make drawing happen as soon as possible.
glFlush();

The idea that you can actually program how pixels are rendered on the screen was quite revolutionary in the early 2000s.

And my first interaction with these ideas was through this article from 2001 on a popular Russian game-development website about the NV_register_combiners extension for OpenGL. Surprisingly, the article is still available online.

This extension enabled you to program how the final color of a pixel is computed from various inputs, such as texture colors and lighting, allowing you to create more complex visual effects. This computation is performed on the GPU, enabling real-time performance. It was akin to running a small assembly program on the GPU for each pixel being rendered.

Graphics developers were fascinated by this idea, as it enabled them to increase the visual fidelity of the scenes dramatically. Shortly after, the GLSL was conceptualized and formally introduced in 2004, allowing the writing of more complex shaders (small programs that define how to manipulate geometry or pixels) in a C-like language.

Are you feeling GPU poor? Imagine that it was even worse back then! Every new generation of GPUs introduced new features and capabilities, dramatically increasing the visual fidelity of games. Having a new GPU was a prerequisite for playing the latest and greatest games. For those into computer graphics, the frustration of the wait and the excitement of getting the new card were doubled! Luckily, I could trick my parents into buying me a new card, because it supported SHADERS! Which, of course, were essential to advance my computer science education. Having the ability to play Oblivion on high settings was just a nice bonus.

Here is an example of a simple GLSL program from rastertek.com to perform bump mapping, the effect achieved by perturbing the surface normals of a texture to simulate small-scale bumps and wrinkles on the surface of an object.

in vec2 texCoord;
in vec3 normal;
in vec3 tangent;
in vec3 binormal;

void main(void)
{
    // Sample the pixel color from the texture using the sampler at this texture coordinate location.
    vec4 textureColor = texture(shaderTexture1, texCoord);

    // Sample the pixel from the normal map.
    vec4 bumpMap = texture(shaderTexture2, texCoord);

    // Expand the range of the normal value from (0, +1) to (-1, +1).
    vec3 bumpMap = (bumpMap * 2.0f) - 1.0f;

    // Calculate the normal from the data in the normal map.
    bumpNormal = (bumpMap.x * tangent) + (bumpMap.y * binormal) + (bumpMap.z * normal);

    // Normalize the resulting bump normal.
    bumpNormal = normalize(bumpNormal);

    // Calculate the amount of light on this pixel based on the normal map value.
    float lightIntensity = clamp(dot(bumpNormal, -lightDirection), 0.0f, 1.0f);

    // Determine the final amount of diffuse color based on the diffuse color combined with the light intensity.
    outputColor =  clamp((diffuseLightColor * lightIntensity), 0.0f, 1.0f);

    // Combine the final light color with the texture color.
    outputColor = outputColor * textureColor;
}

What do all these in vec3 variables mean? These are the inputs to the shader program. Those are specified per vertex and interpolated across the surface of the triangle being rendered. The interpolation is done by GPU hardware and fed into the shader program for each pixel being rendered. This way, you can have different values for each pixel, allowing for more complex effects. This allows for parallelization of the computation across all pixels being rendered, as each pixel can be processed independently.

Shaders quickly progressed from simple pixel color manipulation to complex effects simulating shadows, reflections, and refractions. Graphics programmers were especially obsessed with simulating complex surface details without increasing the geometric complexity of the scene. The deepest point of this rabbit hole was a Parallax Occlusion Mapping technique, which performs a type of ray-marching in a pixel shader, i.e., traversing space to determine the intersection of a ray with a surface defined by a heightmap texture. This way, a completely flat surface can appear to have complex 3D details.

GPUs as General Purpose Computers

At this point, you may wonder about LLMs, deep learning, and the ability to perform general-purpose computations on GPUs. However, take a look at the shader program above. It is just like a piece of C code. Why can't we use that to perform arbitrary computations on the GPU? Indeed, we can, and people have been doing so since the early 2000s. However, we need to address one problem first. How do we get data in and out of the GPU?

Getting data in is pretty straightforward. We can encode our data as a texture or geometry and upload it to the GPU. But how do we get data out? To help with that, we can use techniques like render to texture. It allows us to render the output of our shader program to a texture instead of the screen. Then we can read that texture back to the CPU.

For those not familiar with computer graphics terms. Texture is just an image. In computer graphics, textures are used to store image data that can be applied to the surface of 3D models to give them color and detail. A texture is typically a 2D array of pixels, where each pixel contains color information (e.g., RGB values) and sometimes additional information like alpha (transparency) or normal vectors for bump mapping.

This technique is actually even older than shaders themselves, as it was used in the pre-shader era to create effects like dynamic reflections and shadows. For example, to create a reflection effect, you can render the scene from the point of view of a reflected camera (e.g., below the water surface) to a texture, and then use that texture to render the water surface. You can use a pixel shader to distort the texture coordinates, simulating water ripples.

Some ingenious people figured out that you can use this technique to perform arbitrary computations on the GPU by encoding your input data as a texture, writing a shader program to perform the calculation, rendering the output to a texture, and then reading that texture back to the CPU.

What can you achieve with this technique? Everything you can with CUDA today. A popular technique in early GPGPU was to use ping-pong rendering, where two textures are alternated for reading and writing. This way, you can compute, take your input texture, compute some function on it, write the result to the output texture, then use that output texture as input for the following computation, and so on. This way, you can build complex computations by chaining together multiple shader programs. And you don't have to work with images specifically. You can encode any data as a texture, e.g., a 2D array of floats, a 3D volume of voxels, a graph, and so on.

For example, the Fast Fourier Transform (FFT) algorithm can be implemented using shaders and the render-to-texture technique. Here is an example of a GPU-based FFT implementation from GPU Gems 2, along with its medical image reconstruction.

Here is how a fragment shader for a single FFT pass looks. It is similar to the CUDA kernel you would write today, as shown below. It is essentially a function that is invoked for each pixel of the output texture. It reads data from the input textures, performs some computation, and writes the result as the color of the pixel, which is then stored in the output texture.

void FragmentProgram(in float4 TexCoordRect
                     : TEXCOORD0, out float4 sColor0
                     : COLOR0, out float4 sColor1
                     : COLOR1, out float4 sColor2
                     : COLOR2, out float4 sColor3
                     : COLOR3, uniform samplerRECT Real1,
                       uniform samplerRECT Imag1, uniform samplerRECT Real2,
                       uniform samplerRECT Imag2,
                       uniform samplerRECT ButterflyLookupI,
                       uniform samplerRECT ButterflyLookupWR,
                       uniform samplerRECT ButterflyLookupWI)
{
  // Read in butterfly indices
  float4 i = texRECT(ButterflyLookupI, TexCoordRect.xy);
  // Read in scrambling coordinates
  float4 WR = texRECT(ButterflyLookupWR, TexCoordRect.xy);
  // Read in weights
  float4 WI = texRECT(ButterflyLookupWI, TexCoordRect.xy);

  // Perform the butterfly operation, storing results in the output colors
  float2 Res;
  float2 r1 = float2(i.x, TexCoordRect.y);
  float2 r2 = float2(i.w, TexCoordRect.y);
  float4 InputX1 = texRECT(Real1, r1);
  float4 InputY1 = texRECT(Imag1, r1);
  float4 InputX2 = texRECT(Real1, r2);
  float4 InputY2 = texRECT(Imag1, r2);
  Res.x = WR.x * InputX2.x - WI.x * InputY2.x;
  Res.y = WI.x * InputX2.x + WR.x * InputY2.x;
  sColor0.x = InputX1.x + Res.x;
  sColor1.x = InputY1.x + Res.y;
  float4 InputX1_ = texRECT(Real2, r1);
  float4 InputY1_ = texRECT(Imag2, r1);
  float4 InputX2_ = texRECT(Real2, r2);
  float4 InputY2_ = texRECT(Imag2, r2);
  Res.x = WR.x * InputX2_.x - WI.x * InputY2_.x;
  Res.y = WI.x * InputX2_.x + WR.x * InputY2_.x;
  sColor2.x = InputX1_.x + Res.x;
  sColor3.x = InputY1_.x + Res.y;
}

The code above is written in Cg language. It is an early attempt of NVidia to ~~monopolize the graphics computing market~~ make shader programming more convenient. Luckily, nobody cared about it, and market relied on a more universally supported GLSL and HLSL languages.

I was fascinated by these developments! This technique unlocked a remarkable number of new applications in computer graphics, science, and the medical field, among others. Personally, I've used it to implement advanced graphics effects. Here is an example of using FFT to generate a complex water surface. This technique was used in the movie Titanic and in some advanced games like Assassin's Creed.

Realistic ocean surface rendering. The wave geometry was computed via a mathematical model that required performing a large 2D IFFT, which was implemented using shaders and a render-to-texture technique entirely on a GPU.

Are any of those articles worth reading? Of course, not. I want to demonstrate how I used Web-Archive to recover some old articles that are no longer available online and add a meme image to the post.

Enter the CUDA

Although the technique of using shaders for general-purpose computations was quite powerful, it was still somewhat limited. The programming model was not very friendly, as you had to encode your data as textures or other graphics primitives. The render-to-texture approach involves rendering a rectangular area of the entire screen, ensuring that all rendered pixels align precisely with the texels of the output texture. It was easy to misconfigure the graphics pipeline, such as forgetting to turn off texture filtering, which would lead to incorrect results.

All of these details were quite distracting and made it hard to focus on the actual computation, especially for non-graphics programmers. Thus, NVIDIA introduced CUDA in 2007, which provided a C-like programming model for writing general-purpose computations on NVIDIA GPUs.

The programming model is similar to the shader programming model, as you still write a kernel function that is executed in parallel by many threads. Each thread is identified by its 1D, 2D, or 3D index, which you can use to compute the memory address of the data you want to process. In the shader programming model, you would do that using texture coordinates or other varying variables, while you would use thread indices. However, all the scaffolding of setting up the graphics pipeline, managing textures, framebuffers, and so on, is eliminated. You can allocate memory on the GPU, copy data to it, launch a kernel, and copy the results back.

Here is how the FFT kernel from above would look in CUDA. Again, feel free to skip if you're here for the story.

// Helper function to perform a complex multiply and add operation.
__device__ float2 butterfly_op(float2 a, float2 b, float2 twiddle) {
    // Perform complex multiplication and addition
    float2 temp_result;
    temp_result.x = b.x * twiddle.x - b.y * twiddle.y;
    temp_result.y = b.y * twiddle.x + b.x * twiddle.y;
    return a + temp_result;
}

__global__ void fft_stage_kernel(
    // Input data arrays (now using float2 for complex numbers)
    float2 *d_input1,
    float2 *d_input2,

    // Combined butterfly lookup tables (now float2 for complex twiddle factors)
    float *d_butterflyLookupI,
    float2 *d_butterflyTwiddles,

    // Output data arrays (now using float2)
    float2 *d_out1,
    float2 *d_out2,

    int width,
    int height
) {
    int tx = blockIdx.x * blockDim.x + threadIdx.x;
    int ty = blockIdx.y * blockDim.y + threadIdx.y;

    if (tx >= width || ty >= height) {
        return;
    }

    int index = ty * width + tx;

    // Read butterfly lookup index and complex twiddle factor
    int lookup_i = (int)d_butterflyLookupI[index];
    float2 twiddle_factor = d_butterflyTwiddles[index];

    // Read input data using combined float2 arrays
    int r1_idx = ty * width + tx;
    int r2_idx = ty * width + lookup_i;

    float2 input1 = d_input1[r1_idx];
    float2 input2 = d_input1[r2_idx];

    // Perform the butterfly operation for the first pair of inputs
    d_out1[index] = butterfly_op(input1, input2, twiddle_factor);

    // Process the second pair of data arrays
    float2 input1_prime = d_input2[r1_idx];
    float2 input2_prime = d_input2[r2_idx];

    // Perform the second butterfly operation
    d_out2[index] = butterfly_op(input1_prime, input2_prime, twiddle_factor);
}

I was waiting to get my hands on a GPU that supported CUDA, again. I was earning money, so there was no need to trick my parent anymore, but high-end PC upgrades were still a considerable expense, and you needed to do them often. My first CUDA-capable GPU was the 8800GT, a GPU from the most legendary series of all time. It leveraged entirely new architecture and has introduced CUDA. In addition, asingle 8800 GTX card was able to outperform two previous-generation 7900 GTX cards in SLI and had comparable power consumption and price ($599-hold your tears in your eyes). When will we see such leaps in performance and value again, Mr. Leather-jacket CEO?

![An entry-level GPU in 2030 with an MSRP of $8799]](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rvy9ahmcnorqzqrix22n.webp)

CUDA Moat?

As a true open-source warrior, I did not use CUDA and relied on OpenCL instead for my work. It was not as well-supported as CUDA: debuggers and other tools were not so advanced, there were more glitches, and you could get slightly better performance out of CUDA on NVIDIA hardware. However, its drawbacks were outweighed by the fact that it was an open standard and worked on both AMD and Intel GPUs, so CUDA was far from being a monopoly at that time.

At my job, I was using OpenCL to implement an algorithm for real-time 3D scanning. The Artec Eva is a professional 3D scanner used for medical or industrial applications. Real-time 3D scanning involves a significant amount of GPU computation to process the input video stream, identify your position with respect to the environment (similar algorithms are employed as in self-driving cars), fuse all the input data into a single 3D model, and display it on the screen. All of this had to happen in real-time, so the user could see the result immediately and adjust their position if needed.

I opted for OpenCL, which was a brave choice back then and possibly a bad product decision at the time, as when you buy a $12000 3D scanner, you can afford a decent GPU and not worry about vendor lock-in. However, over time, as GPUs became more powerful and it became possible to run the pipeline on a laptop GPU, specifically Microsoft Surface tablet, the choice of OpenCL has become more relevant. Now, an operator had a lightweight display in their hands and could walk around the object being scanned. At least, this is what I tell myself to feel better about my choice 😅

Real-time scanning of a 3D object with an Artec scanner. Scanner localization, data fusion, and visualization are performed in real-time on a GPU using OpenCL.

In addition to OpenCL, there were many other hardware-agnostic GPGPU frameworks to choose from, including Halide, ArrayFire, and Numba. So, all things considered, the open-source and open-standard ecosystem was a fair contender to CUDA back then, and CUDA hasn't had its moat yet.

Deep Learning Revolution

The new GPU programming capabilities unlocked by CUDA/OpenCL have enabled numerous new applications in computer graphics, science, and medical fields, among others. However, the popularization of deep learning (this is how we've called AI before ChatGPT came along) is arguably the most noticeable outcome.

Many think that thanks to AI, the GPUs have become the central compute platform. In fact, it is the other way around. Thanks to GPUs, we have AI in the first place. Deep convolutional neural networks have been known since 90s. In 2012, a graduate student, Alex Krizhevsky, motivated by Ilya Sutskever, trained a deep convolutional neural network under the guidance of Geoffrey Hinton using a couple of GeForce GPUs to enter the ImageNet challenge. The model was called AlexNet, and the dataset consisted of 1.2 million images belonging to 1000 categories.

The results? They have obliterated the state-of-the-art computer vision models at the time, demonstrating a whopping 9.4% increase in accuracy over the previous state-of-the-art. This was a game-changer. It has triggered a deep learning revolution, where all breakthroughs in computer vision, natural language processing, and other fields were achieved using deep learning models trained on GPUs.

The Array Programming Model

GPU computing has caused great upheaval in the machine learning field, while the latter has retaliated by drastically changing the way we program GPUs. The programming model has shifted from writing kernels that operate on individual elements of an array to writing code that operates on entire arrays (tensors) at once.

The reason for this is that deep-learning frameworks like Tensorflow or PyTorch were inspired not by graphics programming, but scientific computing frameworks like NumPy and MATLAB. The programming model differs significantly from those of CUDA and OpenCL. Instead of writing kernels that operate on individual elements of an array, you write code that operates on entire arrays (tensors) at once. The framework breaks down the operations into smaller pieces that can be executed in parallel on the GPU. This programming model, known as array programming dates back to the 60s with the development of languages like APL and Fortran.

I am skipping the first and popular at the time declarative deep learning framework Caffe. It was suitable for defining a large number of models, but it was not appropriate for expressing arbitrary computations on tensors.

This programming model has one tremendous advantage. It is much easier to reason about the code, as you don't have to think about how to parallelize the computation. You write code that operates on entire arrays, and the framework takes care of the rest. It made GPU programming accessible to a much wider audience, as you didn't have to be a GPU programming expert to write code that runs on the GPU. It is so convenient that many GPU programming experts, myself included, have switched to using these frameworks for their work. It allows you to express your ideas much more concisely and focus on the problem at hand, rather than the intricacies of GPU programming. Additionally, frameworks like PyTorch and Tensorflow come with an automatic differentiation engine, which allows you to compute gradients of your functions automatically. This is especially useful for training neural networks, but it can also be applied to other applications.

Here is a simple numpy program. Even without knowing numpy, you can figure out what it does. It creates a couple of arrays, performs some basic operations on them, and prints the results.

import numpy as np

# Create a 1-dimensional array from a Python list
array1d = np.array([1, 2, 3, 4, 5])

# Create a 2-dimensional array (matrix)
array2d = np.array([[10, 20, 30], [40, 50, 60]])

# Element-wise addition
sum_array = array1d + 5

# Element-wise multiplication
product_array = array1d * 2

# Sum of all elements in an array
total_sum = np.sum(array1d)

# Mean of elements in an array
mean_value = np.mean(array1d)

# Accessing elements
print("First element of array1d:", array1d[0])
print("Element at row 0, column 1 of array2d:", array2d[0, 1])

Why Programming GPUs is Hard?

With the convenience of array programming model comes a significant drawback. It is hard to optimize the code for performance. To understand the reason, we first need to consider why it is hard to optimize code for GPUs in the first place.

There are several reasons why GPU programming is a complex task. Still, the primary limitation is that memory bandwidth heavily restricts GPUs, so GPU architects have introduced numerous complex mechanisms to hide the latency of memory accesses and maximize the utilization of available bandwidth. Developers need to understand these mechanisms and write code that leverages them. This is not an easy task, as it requires a deep understanding of the GPU architecture and the specific details of the memory hierarchy.

Think about the following example. The most powerful CPU at the time of writing is AMD EPYC 9965. It offers a whopping 192 cores and 384 threads. The per-socket memory bandwidth is about 614 GB/s. However, its number of cores pales in comparison with the most powerful GPU, which is NVIDIA B200 at the time of writing. It offers 16,896 CUDA cores and up to 8TB/s of memory bandwidth per GPU.

Now, you might see the problem: each CPU core has about 3.2 GB/s of memory bandwidth, while each GPU core has only about 0.47 GB/s of memory bandwidth. This means that each GPU core must perform significantly more work to hide the latency of memory accesses and make the best use of the available bandwidth. The situation with consumer GPUs is even worse, e.g. the RTX 5090 has 21,760 CUDA cores and 1,792 GB/s of memory bandwidth, which gives only about 0.082 GB/s per core. This means that GPUs must perform significantly more computations per memory access to achieve optimal performance.

The relationship between compute power and memory bandwidth in the GPU computing world is referred to as the ALU-to-memory ratio, which represents the number of operations a GPU core can perform per memory access. For GPUs, this ratio is much higher than for CPUs. It can be dozens or even hundreds of operations per memory access.

The same problem exists for all other parallel computing platforms, such as TPUs, neural processors, and FPGAs. The memory bandwidth per processing unit is always much lower than that of a CPU core. Between 2017 and 2022, I was optimizing neural network inference at Apple for their custom neural processors. We have shipped models such as Animoji, FaceID, Portrait mode, and numerous models that run on Apple Vision Pro. For each of these models, we've had to ensure there is no swapping of data between the on-chip memory and DRAM, as the memory bandwidth was the main bottleneck.

To work around this limitation, GPUs employ several techniques, such as using shared memory -a small amount of memory shared among a group of threads. This allows threads to cooperate and share data without accessing global memory, which is significantly slower. Another technique is to use memory coalescing, which enables threads to access memory in a way that minimizes the number of memory transactions. This is achieved by ensuring that threads access contiguous memory locations, which allows the GPU to fetch multiple data elements in a single memory transaction. GPU cores also have access to more registers than CPU cores, which can also be used to store intermediate data. However, registers are shared among cores (threads in a workgroup), so if you're using too many, some cores will be turned off.

Enough complex terms! If you're to take out one thing from this post, it is this: the most effective way to optimize a GPU program is to perform more computations per memory access. In other words, ensure that data doesn't leave the GPU core for as long as possible. Let's pin this and come back to the array programming model and the performance issues that it introduces.

I Love PyTorch! What Can Possibly be Wrong with It?

Let's take a look at how a simple CUDA kernel to perform an array operation like A*B + C would look. Here, A, B, and C are large arrays (tensors) and the operation is performed element-wise, e.g., [1, 2, 3] * [2, 2, 2] + [1, 1, 1] = [3, 5, 7]

__global__ void array_op(const float *A, const float *B, const float *C, float *D, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        D[idx] = A[idx] * B[idx] + C[idx];
    }
}

This kernel is straightforward. Each thread computes a single element of the output array D by reading the corresponding elements from the input arrays A, B, and C.

Now, let's take a look at how the same operation would look in PyTorch.

import torch

A = torch.randn(1000000, device='cuda')
B = torch.randn(1000000, device='cuda')
C = torch.randn(1000000, device='cuda')
D = A * B + C

If you naively translate PyTorch operations like elementwise multiplication and addition to CUDA, which is how it is actually done in practice, you would get two kernels: one for multiplication and one for addition. The runtime would launch a kernel to perform element-wise multiplication, store the result in a temporary array A1, and then launch another kernel to perform element-wise addition using A1 and B to produce the final tensor C.

__global__ void array_mul(const float *A, const float *B, float *E, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        E[idx] = A[idx] * B[idx];
    }
}

__global__ void array_add(const float *E, const float *C, float *D, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        D[idx] = E[idx] + C[idx];
    }
}

You can see the problem now:

In the original example, we fetch data once from memory and perform two operations (multiplication and addition) on it.
In the PyTorch example, we fetch data twice from memory and perform only one operation (multiplication or addition) on it.

Given that our program is completely memory-bound, the PyTorch version will be practically twice as slow as the CUDA version, because it performs half the number of operations per memory access. If we add one more unfused element-wise operation, it will be three times as slow, and so on.

You may wonder, can't we generate a single fused kernel that performs both operations simultaneously? The answer is yes, we can. In fact, both PyTorch and TensorFlow have a mechanism to do that. However, this is not an easy problem to solve in a general way. PyTorch officially supports more than 1200 operations that can be performed on tensors. The number of possible combinations of these operations is astronomical. Many of these operations are not even element-wise, e.g., matrix multiplication, convolutions, reductions, and so on. It is a complex problem to solve in a general way. For PyTorch, it is especially difficult, as it is a dynamic framework, i.e., the computation graph is built on-the-fly as the code is executed. This makes it challenging to analyze the entire computation graph and determine which operations can be fused.

This problem remains unsolved in a general way to date, as you'll see when we discuss Flash Attention in the context of LLM inference.

NVidia Domination

The Deep Learning revolution has dramatically changed the GPU programming landscape. The array programming model has made GPU programming accessible to a much wider audience, as you don't have to be a GPU programming expert to write code that runs on the GPU. However, it has also introduced new challenges, such as optimizing memory access patterns and fusing operations to achieve optimal performance.

This has created a strong moat for NVIDIA. Although CUDA was just one of the many GPGPU frameworks available at the time, the CUDA ecosystem had a great deal more to offer the community. For example, it had CUDNN, a highly optimized library for deep learning primitives such as convolutions, pooling, and normalization. This library was used by all major deep learning frameworks like TensorFlow and PyTorch to achieve good performance on NVIDIA GPUs. Additionally, NVIDIA has invested heavily in optimizing its hardware for deep learning workloads, for example, by introducing Tensor Cores, which are specialized hardware units designed for performing matrix multiplications and convolutions.

In the Deep Learning age, the NVIDIA GPUs have become the de facto standard for deep learning workloads. All major deep learning frameworks like PyTorch and Tensorflow were built on top of CUDNN, initially not even offering the option to use other backends like OpenCL or ROCm. All research has been done on NVIDIA hardware, as it was the only hardware that supported the tools they were using. This has created a strong network effect, as everyone was using NVIDIA hardware, so everyone was optimizing their code for NVIDIA hardware, which made NVIDIA hardware even more attractive.

From 2010 to the present, I have exclusively owned NVIDIA GPUs. Even though some AMD models were offering more value, the need to be able to perform AI-related work has always steered me into the Team Green camp.

Ironically, as innovative as CUDA was, the moat was created not by CUDA itself, but by the army of NVIDIA engineers who have optimized CUDNN and other libraries for deep learning workloads. There was simply no good algorithm to optimize computational graphs in a general way, so NVIDIA engineers have hand-optimized the most common patterns that appear in deep learning workloads.

There are many attempts to come up with an automatic way to optimize computational graphs or at least to come up with a universal, hardware-agnostic AI stack that makes the optimization process easier, like XLA from Google, TVM from the Apache Foundation, MLIR from LLVM or MAX from Modular AI. However, none of these attempts have been able to beat hand-optimized libraries like CUDNN on NVIDIA hardware on a large enough number of real-world use cases and establish a strong enough network effect.

The AI Era - Bigger is Better

History doesn't repeat itself, but it often rhymes. The computational power of GPUs has triggered the deep learning evolution. We've used the same algorithms that were known since 90s, but now we could train much larger models on much larger datasets. The same thing happened with LLMs. The transformer architecture was known since 2017, but it was only in 2020 that we've seen the first large-scale transformer models like GPT-3 and BERT. The reason for that is that training these models requires a lot of computational power and memory bandwidth. OpenAI has trained GPT-3 on a cluster of 10,000 GPUs. The largest model, GPT-3, has 175 billion parameters and was trained on a dataset of 570GB of text data. The training process took several weeks and cost several million dollars (and probably raised global temperature by a degree or so).

How did AI affect the GPU programming landscape? Not much, actually. The same array programming model is used for training and inference of LLMs. The same challenges of optimizing memory access patterns and fusing operations to achieve good performance still exist. However, the scale of the models has increased dramatically, which has introduced new challenges, like distributing the model across multiple GPUs and optimizing communication between GPUs.

The large scale of the models has also introduced new challenges for inference. The models are so large that they don't fit into the memory of a single GPU. For example, GPT-3 requires about 700GB of memory to store the model parameters, which is much larger than the memory of even the most powerful GPUs available today. This has led to the development of techniques such as model parallelism, where the model is split across multiple GPUs, and pipeline parallelism, where different parts of the model are executed on separate GPUs in a pipelined manner.

The Case of Flash Attention

Surprisingly, after all these years, the problem of optimizing memory access patterns and fusing operations to achieve good performance is still not solved in a general way. Let's take a look at a specific example of this problem in the context of LLM inference.

One of the most important operations in transformer models is the attention mechanism. The attention mechanism allows the model to focus on different parts of the input sequence when making predictions. The attention mechanism is implemented using a series of matrix multiplications and softmax operations (see the rightmost diagram on the image below).

The softmax operation involves computing the exponential of each element in the input matrix, summing them up, and then dividing each component by the sum.

Looks challenging to optimize, right? How can we reduce the number of memory accesses here? The naive implementation would involve reading the input matrices from memory, multiplying them together, storing the result in a temporary matrix, reading the temporary matrix from memory, computing the exponential of each element, summing them up, and then dividing each element by the sum. This would involve a lot of memory accesses and would be very slow. And it is slow!

However, previously I have mentioned that GPUs come with a bit of fast on-chip memory called shared memory (SRAM in hardware terms- static random access memory). It is a small amount of memory that is shared between a block of GPU cores. This memory is much faster than the global memory (GDDR or HBM) and can be used to store intermediate results. The original Flash Attention implementation was implemented and benchmarked on H100, which has 80GB of HBM memory and 192KB of shared memory per SM. The SRAM speed was about 19TB/s, and the HBM speed was about 1.5–2.0TB/s.

The authors of Flash Attention have devised a method to partition computations in a way that allows intermediate results to fit into shared memory, enabling them to perform the entire attention computation with fewer trips to global memory. This is achieved by partitioning the input matrices into smaller tiles, performing calculations on these tiles (including matrix multiplications and softmax operations), and streaming the results back into global memory. The result is a significant speedup over the naive implementation.

Conclusion

The GPU programming landscape has changed dramatically over the past two decades. The introduction of CUDA and OpenCL has made GPU programming accessible to a much wider audience and triggered the deep learning revolution, which in turn changed the way we program GPUs. The array programming model has made it easier to write code that runs on the GPU, but it has also introduced new challenges, such as optimizing memory access patterns and fusing operations to achieve optimal performance.

Now, when you're a certified GPU programming expert, enjoy the last meme and get your GPU cranking!

Host Setup for QEMU KVM GPU Passthrough with VFIO on Linux

Dmitry Trifonov — Tue, 26 Aug 2025 21:27:08 +0000

From “black magic” to reproducible results

GPU passthrough shouldn't feel like sorcery. If you've ever lost a weekend to half-working configs, random resets, or a guest that only boots when the moon is right, this guide is for you. I have pulled lots of hair while hardening the CloudRift
VM service for a variety of consumer (RTX 4090, 5090, PRO 6000) and data center (H100, B200) GPUs, so writing this guide to help you avoid common pitfalls.

I'll focus specifically on the host node configuration for GPU passthrough. Thus, this guide is relevant regardless of whether you're using Proxmox or plain libvirt/QEMU. The provided instructions have been tested on Ubuntu 22.04 and 24.04 with various NVIDIA GPUs.

To keep this guide manageable, I won't delve into lower-level details, such as specific domain XML tricks, Linux kernel builds, or GPU firmware flashing. In most cases, you don't need to fiddle with those.

1. Remove NVIDIA drivers

The first step is to remove the NVIDIA drivers. It is not required, but NVIDIA drivers tend to cause issues with passthrough in one way or another, so it's better to remove them altogether.

If you're configuring your own work PC with multiple GPUs, skip this step as without NVIDIA drivers you won't be able to run UI applications. In this case, the passthrough robustness is likely not a priority for you. However, I strongly recommend removing NVIDIA drivers on headless servers.

If the NVIDIA driver is installed from the repository, you can remove it using the following commands:

sudo apt-get remove --purge '^nvidia-.*'
sudo apt autoremove

If you've installed the driver using the RUN file, remove it using:

sudo /usr/bin/nvidia-uninstall

Remove configs if any.

sudo rm -rf /etc/X11/xorg.conf
sudo rm -rf /etc/modprobe.d/nvidia*.conf
sudo rm -rf /lib/modprobe.d/nvidia*.conf

Reboot the system after driver removal

sudo reboot

2. Check BIOS, IOMMU Support and IOMMU Group Assignment

The next step is to check virtualization and IOMMU support. We need to check four things:

Virtualization is enabled (AMD-Vi / Intel VT-D options are enabled in bios). If present, enable "Above 4G decoding" and "Resizable BAR (ReBAR)" options in BIOS as well.
IOMMU is active (groups exist).
Each GPU and its audio function are isolated in their own IOMMU group.
GPU groups contain only GPU/video-audio functions and PCI bridges — no NICs, NVMe, SATA, etc.

You can use the following handy-dandy script to check those preconditions.

AI goes overboard when generating helper scripts, doesn't it? I can't complain, though. It provides a lot of useful information.

#!/usr/bin/env bash
# VFIO host sanity check: IOMMU support + GPU-containing groups

set -u  # don't use -e so greps that find nothing don't abort

# --- helpers ---------------------------------------------------------------
have() { command -v "$1" >/dev/null 2>&1; }

read_klog() {
  if have journalctl; then journalctl -k -b 0 2>/dev/null
  else dmesg 2>/dev/null
  fi
}

trim() { sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//'; }

# --- 1) CPU vendor + boot flags -------------------------------------------
CPU_VENDOR="$(
  (lscpu 2>/dev/null | awk -F: '/Vendor ID/{print $2}' | trim) ||
  (grep -m1 'vendor_id' /proc/cpuinfo 2>/dev/null | awk '{print $3}')
)"
[ -z "${CPU_VENDOR}" ] && CPU_VENDOR="(unknown)"

CMDLINE="$(cat /proc/cmdline 2>/dev/null || echo '')"
HAS_INTEL_FLAG=$(echo "$CMDLINE" | grep -q 'intel_iommu=on' && echo yes || echo no)
HAS_AMD_FLAG=$(echo "$CMDLINE" | grep -q 'amd_iommu=on' && echo yes || echo no)
HAS_PT_FLAG=$(echo "$CMDLINE" | grep -q 'iommu=pt' && echo yes || echo no)

# --- 2) Kernel log signals ------------------------------------------------
KLOG="$(read_klog)"

DISABLED_MSG=$(echo "$KLOG" | egrep -i 'IOMMU.*disabled by BIOS|DMAR:.*disabled|AMD-Vi:.*disabled' || true)
ENABLED_MSG=$(echo "$KLOG" | egrep -i 'DMAR: IOMMU enabled|AMD-Vi:.*IOMMU.*enabled|IOMMU: .*enabled' || true)
IR_MSG=$(echo "$KLOG" | egrep -i 'Interrupt remapping enabled' || true)

# --- 3) IOMMU groups presence --------------------------------------------
GROUPS_DIR="/sys/kernel/iommu_groups"
GROUP_COUNT=0
if [ -d "$GROUPS_DIR" ]; then
  GROUP_COUNT=$(find "$GROUPS_DIR" -mindepth 1 -maxdepth 1 -type d 2>/dev/null | wc -l | awk '{print $1}')
fi

# Heuristic: active if groups exist (>0). Logs help explain state.
IOMMU_ACTIVE="no"
[ "$GROUP_COUNT" -gt 0 ] && IOMMU_ACTIVE="yes"

# --- 4) Report summary ----------------------------------------------------
echo "=== IOMMU Summary ==="
echo "CPU vendor           : $CPU_VENDOR"
echo "Kernel cmdline       : $CMDLINE"
echo "Boot flags           : intel_iommu=$HAS_INTEL_FLAG  amd_iommu=$HAS_AMD_FLAG  iommu=pt=$HAS_PT_FLAG"
echo "Groups directory     : $GROUPS_DIR  (exists: $([ -d "$GROUPS_DIR" ] && echo yes || echo no))"
echo "IOMMU group count    : $GROUP_COUNT"
echo "Kernel says enabled  : $([ -n "$ENABLED_MSG" ] && echo yes || echo no)"
echo "Interrupt remapping  : $([ -n "$IR_MSG" ] && echo yes || echo no)"
echo "Kernel says disabled : $([ -n "$DISABLED_MSG" ] && echo yes || echo no)"
echo "IOMMU ACTIVE?        : $IOMMU_ACTIVE"
echo

if [ -n "$ENABLED_MSG" ]; then
  echo "--- Kernel enable lines ---"
  echo "$ENABLED_MSG"
  echo
fi
if [ -n "$DISABLED_MSG" ]; then
  echo "--- Kernel disable lines ---"
  echo "$DISABLED_MSG"
  echo
fi

# --- 5) Original: list only GPU-containing groups -------------------------
echo "=== GPU-Containing IOMMU Groups ==="
if [ ! -d "$GROUPS_DIR" ] || [ "$GROUP_COUNT" -eq 0 ]; then
  echo "(no IOMMU groups found)"
else
  declare -A GPU_COUNT_BY_GROUP=()
  group_warnings=()

  for g in "$GROUPS_DIR"/*; do
    [ -d "$g" ] || continue
    group_num=$(basename "$g")
    gpu_found=false
    device_lines=""
    non_gpu_non_bridge=false
    gpu_count_in_this_group=0

    for d in "$g"/devices/*; do
      [ -e "$d" ] || continue
      pci_addr=$(basename "$d")
      # -nns prints class code [XXXX] and vendor:device [vvvv:dddd]
      line=$(lspci -nns "$pci_addr" 2>/dev/null || echo "$pci_addr (unlisted)")
      device_lines+="$line"$'\n'

      # Extract first [...] which is the class code, e.g. 0300, 0302, 0403, 0604, 0600
      class_code=$(echo "$line" | awk -F'[][]' '{print $2}')

      # Detect GPUs / 3D controllers and their HDA audio functions
      if echo "$line" | grep -qE 'VGA compatible controller|3D controller'; then
        gpu_found=true
        gpu_count_in_this_group=$((gpu_count_in_this_group+1))
      fi

      # Allowlist: 0300(VGA), 0302(3D), 0403(HDA audio), 0600(host bridge), 0604(PCI bridge)
      case "$class_code" in
        0300|0302|0403|0600|0604) : ;;
        *) non_gpu_non_bridge=true ;;
      esac
    done

    if $gpu_found; then
      echo "IOMMU Group $group_num:"
      echo "$device_lines"

      # Track GPUs per group
      GPU_COUNT_BY_GROUP["$group_num"]=$gpu_count_in_this_group

      # Warn if unexpected devices share the group with the GPU
      if $non_gpu_non_bridge; then
        group_warnings+=("WARN: Group $group_num contains non-GPU, non-audio, non-bridge devices (consider different slot/CPU root complex or ACS).")
      fi
    fi
  done

  # Post-checks
  # 1) Each GPU should be alone (one GPU per group)
  shared_groups=()
  for gnum in "${!GPU_COUNT_BY_GROUP[@]}"; do
    if [ "${GPU_COUNT_BY_GROUP[$gnum]}" -gt 1 ]; then
      shared_groups+=("$gnum")
    fi
  done

  if [ "${#shared_groups[@]}" -gt 0 ]; then
    echo
    echo "WARN: Multiple GPUs share these IOMMU groups: ${shared_groups[*]} (prefer one GPU per group for VFIO)."
  fi

  # 2) Any non-bridge co-residents?
  if [ "${#group_warnings[@]}" -gt 0 ]; then
    echo
    printf "%s\n" "${group_warnings[@]}"
  fi
fi

Here is what a good summary should look like:

=== IOMMU Summary ===
CPU vendor           : AuthenticAMD
Kernel cmdline       : BOOT_IMAGE=/boot/vmlinuz-6.8.0-71-generic root=/dev/mapper/vgroot-lvroot ro systemd.unified_cgroup_hierarchy=false default_hugepagesz=1G hugepages=576 hugepagesz=1G nomodeset video=efifb:off iommu=pt pci=realloc pcie_aspm=off amd_iommu=on vfio-pci.ids=10de:0000,10de:204b,10de:22e8,10de:2bb1 modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel
Boot flags           : intel_iommu=no  amd_iommu=yes  iommu=pt=yes
Groups directory     : /sys/kernel/iommu_groups  (exists: yes)
IOMMU group count    : 57
Kernel says enabled  : no
Interrupt remapping  : no
Kernel says disabled : no
IOMMU ACTIVE?        : yes

=== GPU-Containing IOMMU Groups ===
IOMMU Group 13:
c1:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
c1:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

IOMMU Group 16:
c6:00.0 PCI bridge [0604]: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge [1a03:1150] (rev 06)
c7:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 52)

IOMMU Group 27:
81:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
81:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

IOMMU Group 42:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

IOMMU Group 54:
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
41:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

As we can see, IOMMU support is enabled, and all GPUs and their corresponding audio devices are in separate IOMMU groups.

Sometimes you may see PCI bridges in the GPU IOMMU group. This is normal.

=== GPU-Containing IOMMU Groups ===
IOMMU Group 13:
40:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
40:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2b85] (rev a1)
41:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

IOMMU Group 32:
20:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
20:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
25:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2b85] (rev a1)
25:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

3. Leverage 1G Huge Pages

This step is optional. However, if you have more than 512GB of RAM on your system, it is highly-encouraged. From experience, aside from providing performance benefit, 1GB huge pages make the VM startup much more reliable on high-memory systems.

Rule of thumb

< 128 GB RAM: usually skip (benefit is small).
128–512 GB: optional; can reduce latency jitter.
> 512 GB: recommended for reliability and predictable performance.

Why 1 GiB pages help

Fewer page-table walks → fewer TLB misses.
Lower page management overhead.
More predictable VM start times on large RAM allocations.

3.1 Check Huge Page Support

To confirm 1G huge page support on your system, check the pdpe1gb CPU flag.

grep -m1 pdpe1gb /proc/cpuinfo >/dev/null && echo "✓ CPU supports 1GiB pages" || echo "✗ No 1GiB page support"

3.2 Allocate Huge Pages

Determine how much memory you want to reserve for the VMs. You need to reserve that much memory for huge pages plus a buffer.

Note that the memory reserved for huge pages will not be usable on the host system.

For example, if you want to dedicate 2000GB to virtual machines with a 80 GB buffer, you would need 2080 huge pages.

I use the following empirically validated table to determine the huge page configuration on a high-memory multi-GPU system.

Total System RAM	VM Allocation	Buffer	Huge Pages	Left for System
768 GB	640 (8x80) GB	60 GB	700	68 GB
1024 GB	800 (8x100) GB	80 GB	880	144 GB
1256 GB	1040 (8x130) GB	100 GB	1140	116 GB
1512 GB	1280 (8x160) GB	120 GB	1300	212 GB
2048 GB	1760 (8x220) GB	160 GB	1920	128 GB
4096 GB	3680 (8*460) GB	200 GB	3880	216 GB

Is there a reliable formula to determine the huge page buffer size? Good question. If you know one, let me know in the comments. It makes sense that we need to leave some memory for the system, but it feels that the gap between the memory dedicated to VM allocation and the number of huge pages is unnecessary. After VM startup, we'll see that the system has allocated the exact number of requested huge pages, so why do we need a buffer, and how big should it be? Is it because of the fragmentation? Empirically, I've confirmed that it is needed. Without a buffer, I occasionally encountered OOM errors.

Run the following command to allocate 2000 pages (it will take a while):

echo 2000 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

To check that huge pages were allocated, run grep -i huge /proc/meminfo. Look at Hugepagesize and Hugetlb values. They tell the huge page size and the total amount of RAM allocated for huge pages. You should see output like this:

AnonHugePages:     79872 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    2080
HugePages_Free:     1580
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        2181038080 kB

To deallocate, invoke:

echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

3.3 Make Huge Pages Persistent

Edit the /etc/default/grub file and modify the line containing GRUB_CMDLINE_LINUX.

Add default_hugepagesz=1G hugepagesz=1G hugepages=<num> to the GRUB_CMDLINE_LINUX options. The <num> is the number of huge pages to allocate. For example:

GRUB_CMDLINE_LINUX="... default_hugepagesz=1G hugepagesz=1G hugepages=200"

Be careful. If you specify more huge pages than the system can allocate, the machine will not boot.

Update the GRUB changes, reboot, and verify that huge pages are allocated (or do this in the end).

sudo update-grub
sudo reboot

3.4 (Optional) Mount Huge Page Table

Many systems already have /dev/hugepages. If not, or if you want a dedicated mount:

sudo mkdir -p /mnt/hugepages-1G
sudo mount -t hugetlbfs -o pagesize=1G none /mnt/hugepages-1G

Check that the mount point is present by running grep hugetlbfs /proc/mounts.

You should see something like:

hugetlbfs /dev/hugepages hugetlbfs rw,nosuid,nodev,relatime,pagesize=1024M 0 0
hugetlbfs /mnt/hugepages-1G hugetlbfs rw,relatime,pagesize=1024M 0 0

To persist - invoke:

echo "none /mnt/hugepages-1G hugetlbfs pagesize=1G 0 0" | sudo tee -a /etc/fstab

3.5 Configure your Virtualization Software to use Huge Pages

Neither Proxmox nor libvirt is using huge pages by default.

To use them in libvirt, you need to add the following section to the domain XML

<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
  <locked/>
</memoryBacking>

In Proxmox CLI you do it as follows:

qm set <vmid> --hugepages 1024   # use 1GiB pages
qm set <vmid> --keephugepages 1  # optional: keep reserved after shutdown

4. Bind to VFIO Early

For maximum stability, have VFIO claim the GPU at boot so no runtime driver swaps occur (Proxmox/libvirt will otherwise bind/unbind around VM start/stop).

4.1 Identify the PCI IDs to bind

First, you need to determine the PCI vendor ID and device ID for your GPUs.

List all NVIDIA functions (display + audio, and any auxiliary functions):

lspci -nn | grep -i nvidia

Example (RTX 5090):

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2b85] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

4.2 Give VFIO first claim

Add the following lines to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub, replacing the PCI vendor ID and device ID with the appropriate values. Keep other options if needed.

GRUB_CMDLINE_LINUX_DEFAULT="modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel vfio-pci.ids=10de:2b85,10de:22e8 ..."

Proxmox is likely using systemd-boot by default instead of GRUB. Check the bootloader you're using and adjust the kernel command line accordingly.

Many online manuals suggest adding VFIO modules to /etc/modprobe.d/vfio.conf, but this approach has not always worked for me. I recommend early binding via the kernel command line.

4.3 Ensure VFIO is in the initramfs

We need to make sure that vfio modules are loaded early in the boot process. To achieve this, we include them in the initramfs

sudo tee -a /etc/initramfs-tools/modules >/dev/null <<'EOF'
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
EOF

4.4 Reboot and verify

Update grub, initramfs and reboot.

sudo update-initramfs -u -k all
sudo update-grub
sudo reboot

After reboot check that VFIO drivers are in use. You can use lspci -k | grep -A 2 -i nvidia command
and should see vfio drivers in use:

81:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 (rev a1)
    Subsystem: Gigabyte Technology Co., Ltd Device 416f
    Kernel driver in use: vfio-pci
    Kernel modules: nvidiafb, nouveau
81:00.1 Audio device: NVIDIA Corporation Device 22e8 (rev a1)
    Subsystem: NVIDIA Corporation Device 0000
    Kernel driver in use: vfio-pci
    Kernel modules: snd_hda_intel

To be fair, there was one machine where this technique to bind VFIO failed. The system was aggressively binding
snd_hda_intel driver to the GPU audio function. However, this method worked for me in all other cases.

5. Other GRUB Options

Here is a summary of other kernel command line options that you may want to consider, along with my thoughts on each.

pci=realloc: Reallocate PCI resources forces the kernel to reassign PCI bus resources (MMIO/IOBARs) from scratch, ignoring what the firmware/BIOS assigned. It helps avoid issues when the BIOS didn't allocate enough space for devices (common with large GPUs or multiple devices). Fixes “BAR can't be assigned” or “resource busy” errors. This option is helpful. I like to include it into the guest OS kernel params as well. It occasionally helps to work around BAR allocation issues. However, there is no need to list it unless the system has PCI device enumeration issues.
iommu=pt: IOMMU passthrough mode tells the kernel to enable the IOMMU but use pass-through mode for DMA mappings by default. For VFIO GPU passthrough — allows the device to access physical memory directly with minimal performance penalty. I haven't had a chance to test the performance gains, so I can just say that this option didn't break anything.
pcie_aspm=off: Disable PCIe Active State Power Management, which is a power-saving feature that reduces PCIe link power in idle states. Some PCIe devices (especially GPUs) have trouble retraining links or waking from ASPM low-power states, leading to hangs or device inaccessible errors. This option was introduced to my configs after losing a lot of time on the Reset Bug. It didn't help. I don't consider this option helpful at the moment, but I am still evaluating it.
nomodeset: Disable kernel mode setting (KMS) for all GPUs; prevents DRM drivers from taking over the console. This option is intended for use with headless servers only. It can break desktop/console output. I typically use it since we're working with headless servers.
video=efifb:off: Disables the firmware EFI framebuffer so simpledrm/efifb won’t grab the boot GPU before VFIO claims it. This option is outdated and has no effect on systems with modern kernels. I list it for completeness.
intel_iommu=on / amd_iommu=on: Enable IOMMU support for Intel and AMD. These are enabled by default, so there is no need to add them to kernel parameters.

Here is how the typical kernel command line should look on a headless server with over 500GB of RAM.

nomodeset
modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel
vfio-pci.ids=10de:2b85,10de:22e8
default_hugepagesz=1G hugepagesz=1G hugepages=400

Conclusion

The VFIO GPU passthrough is a finicky process. It is sensitive to host hardware and software configuration. However, with enough diligence, you can make it robust and reliable. I strongly believe in this approach and rely on VFIO GPU passthrough as the primary tool for our GPU rental service at cloudrift.ai.

I hope this guide helped you to improve your homelab or data center setup. If you notice any inaccuracies or have suggestions, please don't hesitate to let me know so we can improve the workflow together.

Final host checklist:

Enable IOMMU, Above 4G, and (where applicable) ReBAR in the BIOS.
Verify clean IOMMU groups; each GPU (+ audio) isolated.
Bind to vfio-pci early.
Size huge pages (1 GiB on high-RAM hosts) and confirm in /proc/meminfo.
Configure other kernel command-line options as needed.

UnSaaS your Stack with Self-hosted Cloud IDEs

Dmitry Trifonov — Wed, 20 Aug 2025 22:53:54 +0000

I am a PC enthusiast and use it as much as possible. However, with the speed at which LLMs are growing in size, it is challenging to avoid the cloud for AI development.

Many good GPU-enabled SaaS options exist for remote development, like Google Colab. Yet, if you need to go beyond the free tier, the compute cost on these SaaS platforms will quickly empty your pockets. Additionally, self-hosting allows you to use your favorite tools and is the most secure option if you do it right.

JetBrains, Zed, VS Code and Jupyter Lab

Renting a GPU Server

There are plenty of places to rent GPUs, and this tutorial is valid for any machine with SSH access. I am obviously using our own service cloudrift.ai to test solutions in this tutorial. It provides good value, supports virtual machines, and provisions them fast.

Jupyter Lab — Plain and Simple

Jupyter Lab is my go-to option for short experiments. It is the simplest IDE and the easiest to use if you use Python and are familiar with Jupyter. It contains everything needed for short experiments: a file explorer, a command line, and the Jupyter Notebook.

Install the necessary system dependencies after starting a VM and connecting to it.

sudo apt update
sudo apt install python3-venv

Create a virtual environment.

python3 -m venv venv
source venv/bin/activate

Install Jupyter Lab and start it. Replace JUPYTER_TOKEN with your secret.

pip install jupyterlab
JUPYTER_TOKEN=ide-tutorial jupyter lab --no-browser --port=8080 --ip=0.0.0.0

You need to add the — ip=0.0.0.0 flag to be able to access the notebook on a remote server externally since, by default, all access outside is disabled. The IDE will be available at http://{node-ip-address}:8080/. Specify JUPYTER_TOKEN when prompted to log in.

Jupyter Lab hosted on cloudrift.ai

VS Code — Most Versatile

VS Code is convenient if you need to do more serious development work. It contains a debugger. It supports many languages. The command line and file explorer are also available, along with a gazillion features you probably won’t need.

Install the code-server on a remote machine and run it using the following command.

curl -fsSL https://code-server.dev/install.sh | sh
PASSWORD=ide-tutorial code-server --bind-addr 0.0.0.0:8080

Don’t forget to substitute the password with your desired password. The IDE will be available at http://{node-ip-address}:8080/.

VS Code hosted on cloudrift.ai via code-server

JetBrains — Neat Features

I am a fan of JetBrains and use it for my local development. JetBrains takes a different approach from the aforementioned code editors. Instead of starting a remote IDE, your local IDE will communicate with the remote server. Thus, it feels like using your local IDE. Additionally, it offers nice features like the ability to clone the repository on a remote machine using your local SSH agent for authentication.

At the time of this writing, the JetBrains Gateway is in Beta. Many features were not working as expected in PyCharm or RustRover (testing on Ubuntu 22.04). Hopefully, the situation will improve over time.

To start, open any JetBrains IDE (update to the latest version) and select File -> Remote Development. You can also do it without installing JetBrains IDE via JetBrains Gateway.

Click “New Connection” and specify riftuser as the username and a node IP address as the Host.

On the next screen, choose the IDE you want to use and specify ~ as the Project directory.

IDE will take some time to download and configure. Afterwards, you can use it as your local one.

PyCharm running on a remote server on cloudrift.ai

PyCharm has another, more mature feature for remote development called a remote interpreter. To use it, you go to Settings -> Python Interpreter -> Add Interpreter -> On SSH and configure the connection similarly. It will synchronize your code with a remote server and run your app remotely. It is a good option if you have a good symmetric internet connection. Otherwise, the experience might be sluggish, and you will need to configure directories for synchronization to avoid uploading heavy directories like Python virtual environment.

Using remote interpreter feature in JetBrains IDE

Zed — Fast and Lean

I just learned about Zed and was quite impressed. The installation and remote access were a breeze, and it is also the most responsive of the tested IDEs. Like JetBrains, Zed is a local editor that communicates with the remote server. So, install Zed locally.

After installation, to connect to a remote server, click *File -> Open Remote -> Connect New Server *and specify the SSH command.

Connecting to a remote server using Zed

That’s it. Afterward, you would typically clone your repository using an integrated command line, and you’re all set.

Despite its simplicity, Zed is a niche product at the moment. You often need to use the command line to edit config files in Zed. As of today, the team has yet to add Build and Debug features to Zed. However, if you’re comfortable doing that, Zed might be the best option for remote development. The IDE comes with an AI assistant and all the modern features, like support for MCP servers.

Zed working with a remote server on cloudrift.ai

Conclusion

My recommendation for self-hosted cloud editors as of April 2025:

Jupyter Lab is best for simple Python projects if you’re familiar with Jupyter.
VS Code will handle most programming tasks well. It is the only tested editor with a properly working Build and Debug feature.
Zed is the best option if you’re proficient with the command line. It is easy to set up, fast, and has modern AI and collaborative features. However, it doesn’t have Build and Debug features.
JetBrains Gateway is in Beta and difficult to recommend at the moment. However, it has great potential due to its neat features that seamlessly blend local and remote environments.

Take a look at this list if you want to explore more options.

The IDE choice is personal, so choose the one you’re most familiar with and enjoy working with. Happy coding!