<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dmitry Trifonov</title>
    <description>The latest articles on DEV Community by Dmitry Trifonov (@novibecoding).</description>
    <link>https://dev.to/novibecoding</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3181089%2F0838ce1f-c589-40b8-ba8b-339fe8f1bddf.jpeg</url>
      <title>DEV Community: Dmitry Trifonov</title>
      <link>https://dev.to/novibecoding</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/novibecoding"/>
    <language>en</language>
    <item>
      <title>Evolution of GPU Programming</title>
      <dc:creator>Dmitry Trifonov</dc:creator>
      <pubDate>Wed, 03 Sep 2025 18:02:42 +0000</pubDate>
      <link>https://dev.to/novibecoding/evolution-of-gpu-programming-3o94</link>
      <guid>https://dev.to/novibecoding/evolution-of-gpu-programming-3o94</guid>
      <description>&lt;h3&gt;
  
  
  From Smart Pixels to the Backbone of an AI-driven World
&lt;/h3&gt;

&lt;p&gt;Every decade GPUs reinvented themselves - from drawing triangles to generating worlds, and now, reasoning with language. I have realized that throughout my entire programming journey, I have been working closely with GPUs and tried countless ways to program them. From writing pixel shaders in GLSL to implementing real-time 3D scanning algorithms in OpenCL to optimizing deep learning models in PyTorch and Tensorflow. So what can be a better way to share my experience than to write a blog post about the evolution of GPU programming, full of &lt;strong&gt;nostalgia and memes&lt;/strong&gt;?&lt;/p&gt;

&lt;p&gt;A lot has changed in the GPU programming landscape over the years. New programming models, new frameworks, and new hardware architectures have emerged. There is no point in studying them nowadays; however, the evolutionary path of GPU programming is quite interesting. If you're an AI expert or a developer in some other field - it can help you broaden your expertise or help you get necessary inspiration to dive into the world of GPU programming. It can give you new ideas to address current problems, especially given that some of the issues we face today in AI were already faced by graphics programmers 25 years ago. If you are a GPU programming veteran or not into programming at all - enjoy the story and the memes.&lt;/p&gt;

&lt;p&gt;Here is a mildly entertaining, nostalgia-induced journey through the history of GPU programming from making brick walls that look bumpy in 2000 to optimizing attention mechanisms in LLM models in 2025. Feel free to skip the code snippets if you're not interested in programming or are already familiar with the material and would rather enjoy the story.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyts846qahc9uy70va0a9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyts846qahc9uy70va0a9.jpg" alt=" " width="519" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Smart Pixels
&lt;/h2&gt;

&lt;p&gt;In the early 2000s, the GPUs were used exclusively for visualization, and the rendering pipeline was completely fixed-function. It was akin to HTML, where you would predefine your scene: geometry, textures, position of lights and camera, and the GPU would take care of rendering it. You could, of course, customize it on the fly, but only in a limited way, by changing parameters of the predefined functions, and this customization has happened entirely on the CPU side.&lt;/p&gt;

&lt;p&gt;Here is a simple example of rendering a triangle using old-school OpenGL, taken from &lt;a href="https://cs.lmu.edu/~ray/notes/openglexamples/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Set every pixel in the frame buffer to the current clear color.&lt;/span&gt;
&lt;span class="n"&gt;glClear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GL_COLOR_BUFFER_BIT&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Drawing is done by specifying a sequence of vertices. The way these&lt;/span&gt;
&lt;span class="c1"&gt;// vertices are connected. GL_POLYGON constructs a filled polygon.&lt;/span&gt;
&lt;span class="n"&gt;glBegin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GL_POLYGON&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;glColor3f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;glVertex3f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;glColor3f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;glVertex3f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;glColor3f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;glVertex3f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;glEnd&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Flush drawing command buffer to make drawing happen as soon as possible.&lt;/span&gt;
&lt;span class="n"&gt;glFlush&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foot5wu48v8788rv9ftm0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foot5wu48v8788rv9ftm0.png" alt="Rendering a triangle with OpenGL" width="400" height="282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The idea that you can actually program how pixels are rendered on the screen was quite revolutionary in the early 2000s.&lt;/p&gt;

&lt;p&gt;And my first interaction with these ideas was through &lt;a href="https://gamedev-ru.translate.goog/code/articles/?id=4155&amp;amp;_x_tr_sl=ru&amp;amp;_x_tr_tl=en&amp;amp;_x_tr_hl=en&amp;amp;_x_tr_pto=wapp" rel="noopener noreferrer"&gt;this article&lt;/a&gt; from 2001 on a popular Russian game-development website about the &lt;a href="https://registry.khronos.org/OpenGL/extensions/NV/NV_register_combiners.txt" rel="noopener noreferrer"&gt;NV_register_combiners&lt;/a&gt; extension for OpenGL. Surprisingly, the article is still available online.&lt;/p&gt;

&lt;p&gt;This extension enabled you to program how the final color of a pixel is computed from various inputs, such as texture colors and lighting, allowing you to create more complex visual effects. This computation is performed on the GPU, enabling real-time performance. It was akin to running a small assembly program on the GPU for each pixel being rendered.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42jt1gv4of4n6xmdfkcf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42jt1gv4of4n6xmdfkcf.png" width="800" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Graphics developers were fascinated by this idea, as it enabled them to increase the visual fidelity of the scenes dramatically. Shortly after, the GLSL was conceptualized and formally introduced in 2004, allowing the writing of more complex shaders (small programs that define how to manipulate geometry or pixels) in a C-like language.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are you feeling GPU poor? Imagine that it was even worse back then! Every new generation of GPUs introduced new features and capabilities, dramatically increasing the visual fidelity of games. Having a new GPU was a prerequisite for playing the latest and greatest games. For those into computer graphics, the frustration of the wait and the excitement of getting the new card were doubled! Luckily, I could trick my parents into buying me a new card, because it supported SHADERS! Which, of course, were essential to advance my computer science education. Having the ability to play Oblivion on high settings was just a nice bonus.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79fe4vh9vlj75qivhytu.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79fe4vh9vlj75qivhytu.webp" alt="The " width="800" height="911"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is an example of a simple GLSL program from &lt;a href="https://www.rastertek.com/gl4linuxtut20.html" rel="noopener noreferrer"&gt;rastertek.com&lt;/a&gt; to perform bump mapping, the effect achieved by perturbing the surface normals of a texture to simulate small-scale bumps and wrinkles on the surface of an object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight glsl"&gt;&lt;code&gt;&lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="kt"&gt;vec2&lt;/span&gt; &lt;span class="n"&gt;texCoord&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="kt"&gt;vec3&lt;/span&gt; &lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="kt"&gt;vec3&lt;/span&gt; &lt;span class="n"&gt;tangent&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="kt"&gt;vec3&lt;/span&gt; &lt;span class="n"&gt;binormal&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Sample the pixel color from the texture using the sampler at this texture coordinate location.&lt;/span&gt;
    &lt;span class="kt"&gt;vec4&lt;/span&gt; &lt;span class="n"&gt;textureColor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shaderTexture1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;texCoord&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Sample the pixel from the normal map.&lt;/span&gt;
    &lt;span class="kt"&gt;vec4&lt;/span&gt; &lt;span class="n"&gt;bumpMap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shaderTexture2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;texCoord&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Expand the range of the normal value from (0, +1) to (-1, +1).&lt;/span&gt;
    &lt;span class="kt"&gt;vec3&lt;/span&gt; &lt;span class="n"&gt;bumpMap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bumpMap&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Calculate the normal from the data in the normal map.&lt;/span&gt;
    &lt;span class="n"&gt;bumpNormal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bumpMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;tangent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bumpMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;binormal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bumpMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Normalize the resulting bump normal.&lt;/span&gt;
    &lt;span class="n"&gt;bumpNormal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bumpNormal&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Calculate the amount of light on this pixel based on the normal map value.&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;lightIntensity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bumpNormal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;lightDirection&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Determine the final amount of diffuse color based on the diffuse color combined with the light intensity.&lt;/span&gt;
    &lt;span class="n"&gt;outputColor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;diffuseLightColor&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;lightIntensity&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Combine the final light color with the texture color.&lt;/span&gt;
    &lt;span class="n"&gt;outputColor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputColor&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;textureColor&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;What do all these in vec3 variables mean? These are the inputs to the shader program. Those are specified per vertex and interpolated across the surface of the triangle being rendered. The interpolation is done by GPU hardware and fed into the shader program for each pixel being rendered. This way, you can have different values for each pixel, allowing for more complex effects. This allows for parallelization of the computation across all pixels being rendered, as each pixel can be processed independently.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Shaders quickly progressed from simple pixel color manipulation to complex effects simulating shadows, reflections, and refractions. Graphics programmers were especially obsessed with simulating complex surface details without increasing the geometric complexity of the scene. The deepest point of this rabbit hole was a &lt;a href="https://learnopengl.com/Advanced-Lighting/Parallax-Mapping" rel="noopener noreferrer"&gt;Parallax Occlusion Mapping&lt;/a&gt; technique, which performs a type of ray-marching in a pixel shader, i.e., traversing space to determine the intersection of a ray with a surface defined by a heightmap texture. This way, a completely flat surface can appear to have complex 3D details.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1oinrclhd6chybvopzu5.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1oinrclhd6chybvopzu5.jpg" alt="Parallax Occlusion Mapping technique. The cube's surface is entirely flat, but it appears to have details - image from babylon.js" width="418" height="417"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  GPUs as General Purpose Computers
&lt;/h2&gt;

&lt;p&gt;At this point, you may wonder about LLMs, deep learning, and the ability to perform general-purpose computations on GPUs. However, take a look at the shader program above. It is just like a piece of C code. Why can't we use that to perform arbitrary computations on the GPU? Indeed, we can, and people have been doing so since the early 2000s. However, we need to address one problem first. How do we get data in and out of the GPU?&lt;/p&gt;

&lt;p&gt;Getting data in is pretty straightforward. We can encode our data as a texture or geometry and upload it to the GPU. But how do we get data out? To help with that, we can use techniques like &lt;a href="http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-14-render-to-texture/" rel="noopener noreferrer"&gt;render to texture&lt;/a&gt;. It allows us to render the output of our shader program to a texture instead of the screen. Then we can read that texture back to the CPU.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For those not familiar with computer graphics terms. Texture is just an image. In computer graphics, textures are used to store image data that can be applied to the surface of 3D models to give them color and detail. A texture is typically a 2D array of pixels, where each pixel contains color information (e.g., RGB values) and sometimes additional information like alpha (transparency) or normal vectors for bump mapping.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This technique is actually even older than shaders themselves, as it was used in the pre-shader era to create effects like dynamic reflections and shadows. For example, to create a reflection effect, you can render the scene from the point of view of a reflected camera (e.g., below the water surface) to a texture, and then use that texture to render the water surface. You can use a pixel shader to distort the texture coordinates, simulating water ripples.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4x05ass1nrv3j1kvg8ae.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4x05ass1nrv3j1kvg8ae.png" width="800" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj07l2pod0emyjsm3roi6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj07l2pod0emyjsm3roi6.jpg" alt="An example of a water reflection effect achieved via the render-to-texture technique. Apparently, I was too lazy to fix the face orientation on the yacht model at the time of making that demo." width="640" height="513"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;Some ingenious people figured out that you can use this technique to perform arbitrary computations on the GPU by encoding your input data as a texture, writing a shader program to perform the calculation, rendering the output to a texture, and then reading that texture back to the CPU.&lt;/p&gt;

&lt;p&gt;What can you achieve with this technique? Everything you can with CUDA today. A popular technique in early GPGPU was to use ping-pong rendering, where two textures are alternated for reading and writing. This way, you can compute, take your input texture, compute some function on it, write the result to the output texture, then use that output texture as input for the following computation, and so on. This way, you can build complex computations by chaining together multiple shader programs. And you don't have to work with images specifically. You can encode any data as a texture, e.g., a 2D array of floats, a 3D volume of voxels, a graph, and so on.&lt;/p&gt;

&lt;p&gt;For example, the Fast Fourier Transform (FFT) algorithm can be implemented using shaders and the render-to-texture technique. Here is an example of a GPU-based FFT implementation from &lt;a href="https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-48-medical-image-reconstruction" rel="noopener noreferrer"&gt;GPU Gems 2&lt;/a&gt;, along with its medical image reconstruction.&lt;/p&gt;

&lt;p&gt;Here is how a fragment shader for a single FFT pass looks. It is similar to the CUDA kernel you would write today, as shown below. It is essentially a function that is invoked for each pixel of the output texture. It reads data from the input textures, performs some computation, and writes the result as the color of the pixel, which is then stored in the output texture.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;FragmentProgram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;TexCoordRect&lt;/span&gt;
                     &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TEXCOORD0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;sColor0&lt;/span&gt;
                     &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;COLOR0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;sColor1&lt;/span&gt;
                     &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;COLOR1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;sColor2&lt;/span&gt;
                     &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;COLOR2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;sColor3&lt;/span&gt;
                     &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;COLOR3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uniform&lt;/span&gt; &lt;span class="n"&gt;samplerRECT&lt;/span&gt; &lt;span class="n"&gt;Real1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;uniform&lt;/span&gt; &lt;span class="n"&gt;samplerRECT&lt;/span&gt; &lt;span class="n"&gt;Imag1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uniform&lt;/span&gt; &lt;span class="n"&gt;samplerRECT&lt;/span&gt; &lt;span class="n"&gt;Real2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;uniform&lt;/span&gt; &lt;span class="n"&gt;samplerRECT&lt;/span&gt; &lt;span class="n"&gt;Imag2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;uniform&lt;/span&gt; &lt;span class="n"&gt;samplerRECT&lt;/span&gt; &lt;span class="n"&gt;ButterflyLookupI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;uniform&lt;/span&gt; &lt;span class="n"&gt;samplerRECT&lt;/span&gt; &lt;span class="n"&gt;ButterflyLookupWR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;uniform&lt;/span&gt; &lt;span class="n"&gt;samplerRECT&lt;/span&gt; &lt;span class="n"&gt;ButterflyLookupWI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Read in butterfly indices&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ButterflyLookupI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TexCoordRect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xy&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Read in scrambling coordinates&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;WR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ButterflyLookupWR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TexCoordRect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xy&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Read in weights&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;WI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ButterflyLookupWI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TexCoordRect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xy&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Perform the butterfly operation, storing results in the output colors&lt;/span&gt;
  &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;Res&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;r1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;float2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TexCoordRect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;float2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TexCoordRect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;InputX1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Real1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;InputY1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Imag1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;InputX2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Real1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;InputY2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Imag1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;Res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WR&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;InputX2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;WI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;InputY2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;Res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;InputX2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;WR&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;InputY2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;sColor0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;InputX1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;Res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;sColor1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;InputY1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;Res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;InputX1_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Real2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;InputY1_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Imag2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;InputX2_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Real2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;float4&lt;/span&gt; &lt;span class="n"&gt;InputY2_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texRECT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Imag2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;Res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WR&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;InputX2_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;WI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;InputY2_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;Res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;InputX2_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;WR&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;InputY2_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;sColor2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;InputX1_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;Res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;sColor3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;InputY1_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;Res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The code above is written in &lt;a href="https://www.khronos.org/opengl/wiki/cg" rel="noopener noreferrer"&gt;Cg&lt;/a&gt; language. It is an early attempt of NVidia to &lt;del&gt;monopolize the graphics computing market&lt;/del&gt; make shader programming more convenient. Luckily, nobody cared about it, and market relied on a more universally supported GLSL and HLSL languages.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I was fascinated by these developments! This technique unlocked a remarkable number of new applications in computer graphics, science, and the medical field, among others. Personally, I've used it to implement advanced graphics effects. Here is an example of &lt;a href="http://www.uraldev.ru/articles/35/page/2" rel="noopener noreferrer"&gt;using FFT to generate a complex water surface&lt;/a&gt;. This technique was used in the movie &lt;a href="https://en.wikipedia.org/wiki/Titanic_(1997_film)" rel="noopener noreferrer"&gt;Titanic&lt;/a&gt; and in some advanced games like &lt;a href="https://en.wikipedia.org/wiki/Assassin%27s_Creed" rel="noopener noreferrer"&gt;Assassin's Creed&lt;/a&gt;.&lt;/p&gt;

&lt;center&gt;

&lt;p&gt;Realistic ocean surface rendering. The wave geometry was computed via a mathematical model that required performing a large 2D IFFT, which was implemented using shaders and a render-to-texture technique entirely on a GPU.&lt;/p&gt;
&lt;/center&gt;

&lt;blockquote&gt;
&lt;p&gt;Are any of those articles worth reading? Of course, not. I want to demonstrate how I used Web-Archive to recover some old articles that are no longer available online and add a meme image to the post.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gqgtt7xnqterxbk1ill.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gqgtt7xnqterxbk1ill.jpeg" width="600" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter the CUDA
&lt;/h2&gt;

&lt;p&gt;Although the technique of using shaders for general-purpose computations was quite powerful, it was still somewhat limited. The programming model was not very friendly, as you had to encode your data as textures or other graphics primitives. The render-to-texture approach involves rendering a rectangular area of the entire screen, ensuring that all rendered pixels align precisely with the texels of the output texture. It was easy to misconfigure the graphics pipeline, such as forgetting to turn off texture filtering, which would lead to incorrect results.&lt;/p&gt;

&lt;p&gt;All of these details were quite distracting and made it hard to focus on the actual computation, especially for non-graphics programmers. Thus, NVIDIA introduced CUDA in 2007, which provided a C-like programming model for writing general-purpose computations on NVIDIA GPUs.&lt;/p&gt;

&lt;p&gt;The programming model is similar to the shader programming model, as you still write a kernel function that is executed in parallel by many threads. Each thread is identified by its 1D, 2D, or 3D index, which you can use to compute the memory address of the data you want to process. In the shader programming model, you would do that using texture coordinates or other varying variables, while  you would use thread indices. However, all the scaffolding of setting up the graphics pipeline, managing textures, framebuffers, and so on, is eliminated. You can allocate memory on the GPU, copy data to it, launch a kernel, and copy the results back.&lt;/p&gt;

&lt;p&gt;Here is how the FFT kernel from above would look in CUDA. Again, feel free to skip if you're here for the story.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Helper function to perform a complex multiply and add operation.&lt;/span&gt;
&lt;span class="n"&gt;__device__&lt;/span&gt; &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="nf"&gt;butterfly_op&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;twiddle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Perform complex multiplication and addition&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;temp_result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;temp_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;twiddle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;twiddle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;temp_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;twiddle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;twiddle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;temp_result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;fft_stage_kernel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;// Input data arrays (now using float2 for complex numbers)&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;d_input1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;d_input2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;// Combined butterfly lookup tables (now float2 for complex twiddle factors)&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;d_butterflyLookupI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;d_butterflyTwiddles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;// Output data arrays (now using float2)&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;d_out1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;d_out2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;ty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tx&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;ty&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ty&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Read butterfly lookup index and complex twiddle factor&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;lookup_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;d_butterflyLookupI&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;twiddle_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d_butterflyTwiddles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

    &lt;span class="c1"&gt;// Read input data using combined float2 arrays&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;r1_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ty&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;r2_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ty&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;lookup_i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;input1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d_input1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r1_idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;input2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d_input1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r2_idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

    &lt;span class="c1"&gt;// Perform the butterfly operation for the first pair of inputs&lt;/span&gt;
    &lt;span class="n"&gt;d_out1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;butterfly_op&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;twiddle_factor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Process the second pair of data arrays&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;input1_prime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d_input2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r1_idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="n"&gt;float2&lt;/span&gt; &lt;span class="n"&gt;input2_prime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d_input2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r2_idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

    &lt;span class="c1"&gt;// Perform the second butterfly operation&lt;/span&gt;
    &lt;span class="n"&gt;d_out2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;butterfly_op&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input1_prime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input2_prime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;twiddle_factor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;I was waiting to get my hands on a GPU that supported CUDA, again. I was earning money, so there was no need to trick my parent anymore, but high-end PC upgrades were still a considerable expense, and you needed to do them often. My first CUDA-capable GPU was the 8800GT, a GPU from the most legendary series of all time. It leveraged entirely new architecture and has introduced CUDA. In addition, asingle 8800 GTX card was able to outperform two previous-generation 7900 GTX cards in SLI and had comparable power consumption and price ($599-hold your tears in your eyes). When will we see such leaps in performance and value again, Mr. Leather-jacket CEO?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;![An entry-level GPU in 2030 with an MSRP of $8799]](&lt;a href="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rvy9ahmcnorqzqrix22n.webp" rel="noopener noreferrer"&gt;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rvy9ahmcnorqzqrix22n.webp&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  CUDA Moat?
&lt;/h2&gt;

&lt;p&gt;As a &lt;strong&gt;true open-source warrior&lt;/strong&gt;, I did not use CUDA and relied on OpenCL instead for my work. It was not as well-supported as CUDA: debuggers and other tools were not so advanced, there were more glitches, and you could get slightly better performance out of CUDA on NVIDIA hardware. However, its drawbacks were outweighed by the fact that it was an open standard and worked on both AMD and Intel GPUs, so CUDA was far from being a monopoly at that time.&lt;/p&gt;

&lt;p&gt;At my job, I was using OpenCL to implement an algorithm for real-time 3D scanning. The &lt;a href="https://www.artec3d.com/portable-3d-scanners/artec-eva" rel="noopener noreferrer"&gt;Artec Eva&lt;/a&gt; is a professional 3D scanner used for medical or industrial applications. Real-time 3D scanning involves a significant amount of GPU computation to process the input video stream, identify your position with respect to the environment (similar algorithms are employed as in self-driving cars), fuse all the input data into a single 3D model, and display it on the screen. All of this had to happen in real-time, so the user could see the result immediately and adjust their position if needed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxh7peu5ec0cai35j32i.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxh7peu5ec0cai35j32i.jpg" alt="Scanning an object with an Artec 3D scanner" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I opted for OpenCL, which was a brave choice back then and possibly a bad product decision at the time, as when you buy a $12000 3D scanner, you can afford a decent GPU and not worry about vendor lock-in. However, over time, as GPUs became more powerful and it became possible to run the pipeline on a laptop GPU, specifically &lt;a href="https://en.wikipedia.org/wiki/Microsoft_Surface" rel="noopener noreferrer"&gt;Microsoft Surface&lt;/a&gt; tablet, the choice of OpenCL has become more relevant. Now, an operator had a lightweight display in their hands and could walk around the object being scanned. At least, this is what I tell myself to feel better about my choice 😅&lt;/p&gt;

&lt;center&gt;

&lt;p&gt;Real-time scanning of a 3D object with an Artec scanner. Scanner localization, data fusion, and visualization are performed in real-time on a GPU using OpenCL.&lt;/p&gt;
&lt;/center&gt;

&lt;p&gt;In addition to OpenCL, there were many other hardware-agnostic GPGPU frameworks to choose from, including &lt;a href="https://halide-lang.org" rel="noopener noreferrer"&gt;Halide&lt;/a&gt;, &lt;a href="https://arrayfire.com" rel="noopener noreferrer"&gt;ArrayFire&lt;/a&gt;, and &lt;a href="https://numba.pydata.org" rel="noopener noreferrer"&gt;Numba&lt;/a&gt;. So, all things considered, the open-source and open-standard ecosystem was a fair contender to CUDA back then, and CUDA hasn't had its moat yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Learning Revolution
&lt;/h2&gt;

&lt;p&gt;The new GPU programming capabilities unlocked by CUDA/OpenCL have enabled numerous new applications in computer graphics, science, and medical fields, among others. However, the popularization of deep learning (this is how we've called AI before ChatGPT came along) is arguably the most noticeable outcome.&lt;/p&gt;

&lt;p&gt;Many think that thanks to AI, the GPUs have become the central compute platform. In fact, it is the other way around. Thanks to GPUs, we have AI in the first place. Deep convolutional neural networks have been known since 90s. In 2012, a graduate student, &lt;a href="https://en.wikipedia.org/wiki/Alex_Krizhevsky" rel="noopener noreferrer"&gt;Alex Krizhevsky&lt;/a&gt;, motivated by &lt;a href="https://en.wikipedia.org/wiki/Ilya_Sutskever" rel="noopener noreferrer"&gt;Ilya Sutskever&lt;/a&gt;, trained a deep convolutional neural network under the guidance of &lt;a href="https://en.wikipedia.org/wiki/Geoffrey_Hinton" rel="noopener noreferrer"&gt;Geoffrey Hinton&lt;/a&gt; using a couple of GeForce GPUs to enter the &lt;a href="https://en.wikipedia.org/wiki/ImageNet#ImageNet_Challenge" rel="noopener noreferrer"&gt;ImageNet challenge&lt;/a&gt;. The model was called &lt;a href="https://en.wikipedia.org/wiki/AlexNet" rel="noopener noreferrer"&gt;AlexNet&lt;/a&gt;, and the dataset consisted of 1.2 million images belonging to 1000 categories.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp6dzv9c6hyvabthgse35.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp6dzv9c6hyvabthgse35.png" alt="The obligatory xkcd meme: https://xkcd.com/2347/" width="770" height="978"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The results? They have obliterated the state-of-the-art computer vision models at the time, demonstrating a whopping 9.4% increase in accuracy over the previous state-of-the-art. This was a game-changer. It has triggered a deep learning revolution, where all breakthroughs in computer vision, natural language processing, and other fields were achieved using deep learning models trained on GPUs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqz14g9yk19x0rnr1h2c.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqz14g9yk19x0rnr1h2c.webp" alt="The best ImageNet challenge results in 2010 and 2011, compared against all results in 2012, including AlexNet. Image from Pinecone's article: AlexNet and ImageNet: The Birth of Deep Learning" width="665" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Array Programming Model
&lt;/h2&gt;

&lt;p&gt;GPU computing has caused great upheaval in the machine learning field, while the latter has retaliated by drastically changing the way we program GPUs. The programming model has shifted from writing kernels that operate on individual elements of an array to writing code that operates on entire arrays (tensors) at once.&lt;/p&gt;

&lt;p&gt;The reason for this is that deep-learning frameworks like &lt;a href="https://www.tensorflow.org" rel="noopener noreferrer"&gt;Tensorflow&lt;/a&gt; or &lt;a href="https://pytorch.org" rel="noopener noreferrer"&gt;PyTorch&lt;/a&gt; were inspired not by graphics programming, but scientific computing frameworks like &lt;a href="https://numpy.org/" rel="noopener noreferrer"&gt;NumPy&lt;/a&gt; and &lt;a href="https://www.mathworks.com/products/matlab.html" rel="noopener noreferrer"&gt;MATLAB&lt;/a&gt;. The programming model differs significantly from those of CUDA and OpenCL. Instead of writing kernels that operate on individual elements of an array, you write code that operates on entire arrays (tensors) at once. The framework breaks down the operations into smaller pieces that can be executed in parallel on the GPU. This programming model, known as &lt;a href="https://en.wikipedia.org/wiki/Array_programming" rel="noopener noreferrer"&gt;array programming&lt;/a&gt; dates back to the 60s with the development of languages like &lt;a href="https://en.wikipedia.org/wiki/APL_(programming_language)" rel="noopener noreferrer"&gt;APL&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Fortran" rel="noopener noreferrer"&gt;Fortran&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I am skipping the first and popular at the time declarative deep learning framework &lt;a href="https://en.wikipedia.org/wiki/Caffe_(software)" rel="noopener noreferrer"&gt;Caffe&lt;/a&gt;. It was suitable for defining a large number of models, but it was not appropriate for expressing arbitrary computations on tensors.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This programming model has one tremendous advantage. It is much easier to reason about the code, as you don't have to think about how to parallelize the computation. You write code that operates on entire arrays, and the framework takes care of the rest. It made GPU programming accessible to a much wider audience, as you didn't have to be a GPU programming expert to write code that runs on the GPU. It is so convenient that many GPU programming experts, myself included, have switched to using these frameworks for their work. It allows you to express your ideas much more concisely and focus on the problem at hand, rather than the intricacies of GPU programming. Additionally, frameworks like PyTorch and Tensorflow come with an automatic differentiation engine, which allows you to compute gradients of your functions automatically. This is especially useful for training neural networks, but it can also be applied to other applications.&lt;/p&gt;

&lt;p&gt;Here is a simple numpy program. Even without knowing numpy, you can figure out what it does. It creates a couple of arrays, performs some basic operations on them, and prints the results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Create a 1-dimensional array from a Python list
&lt;/span&gt;&lt;span class="n"&gt;array1d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Create a 2-dimensional array (matrix)
&lt;/span&gt;&lt;span class="n"&gt;array2d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;

&lt;span class="c1"&gt;# Element-wise addition
&lt;/span&gt;&lt;span class="n"&gt;sum_array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;array1d&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;

&lt;span class="c1"&gt;# Element-wise multiplication
&lt;/span&gt;&lt;span class="n"&gt;product_array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;array1d&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="c1"&gt;# Sum of all elements in an array
&lt;/span&gt;&lt;span class="n"&gt;total_sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array1d&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Mean of elements in an array
&lt;/span&gt;&lt;span class="n"&gt;mean_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array1d&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Accessing elements
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;First element of array1d:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;array1d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Element at row 0, column 1 of array2d:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;array2d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Programming GPUs is Hard?
&lt;/h2&gt;

&lt;p&gt;With the convenience of array programming model comes a significant drawback. It is hard to optimize the code for performance. To understand the reason, we first need to consider why it is hard to optimize code for GPUs in the first place.&lt;/p&gt;

&lt;p&gt;There are several reasons why GPU programming is a complex task. Still, the primary limitation is that memory bandwidth heavily restricts GPUs, so GPU architects have introduced numerous complex mechanisms to hide the latency of memory accesses and maximize the utilization of available bandwidth. Developers need to understand these mechanisms and write code that leverages them. This is not an easy task, as it requires a deep understanding of the GPU architecture and the specific details of the memory hierarchy.&lt;/p&gt;

&lt;p&gt;Think about the following example. The most powerful CPU at the time of writing is &lt;a href="https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9965.html" rel="noopener noreferrer"&gt;AMD EPYC 9965&lt;/a&gt;. It offers a whopping 192 cores and 384 threads. The per-socket memory bandwidth is about 614 GB/s. However, its number of cores pales in comparison with the most powerful GPU, which is &lt;a href="https://www.nvidia.com/en-us/data-center/dgx-b200/" rel="noopener noreferrer"&gt;NVIDIA B200&lt;/a&gt; at the time of writing. It offers  16,896  CUDA cores and up to 8TB/s of memory bandwidth per GPU.&lt;/p&gt;

&lt;p&gt;Now, you might see the problem: each CPU core has about 3.2 GB/s of memory bandwidth, while each GPU core has only about 0.47 GB/s of memory bandwidth. This means that each GPU core must perform significantly more work to hide the latency of memory accesses and make the best use of the available bandwidth. The situation with consumer GPUs is even worse, e.g. the &lt;a href="https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/" rel="noopener noreferrer"&gt;RTX 5090&lt;/a&gt; has 21,760 CUDA cores and 1,792 GB/s of memory bandwidth, which gives only about 0.082 GB/s per core. This means that GPUs must perform significantly more computations per memory access to achieve optimal performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx1eirdl3cqvrhhcgg9xs.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx1eirdl3cqvrhhcgg9xs.jpeg" width="800" height="827"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The relationship between compute power and memory bandwidth in the GPU computing world is referred to as the ALU-to-memory ratio, which represents the number of operations a GPU core can perform per memory access. For GPUs, this ratio is much higher than for CPUs. It can be dozens or even hundreds of operations per memory access.&lt;/p&gt;

&lt;p&gt;The same problem exists for all other parallel computing platforms, such as TPUs, neural processors, and FPGAs. The memory bandwidth per processing unit is always much lower than that of a CPU core. Between 2017 and 2022, I was optimizing neural network inference at Apple for their custom neural processors. We have shipped models such as Animoji, FaceID, Portrait mode, and numerous models that run on Apple Vision Pro. For each of these models, we've had to ensure there is no swapping of data between the on-chip memory and DRAM, as the memory bandwidth was the main bottleneck.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To work around this limitation, GPUs employ several techniques, such as using &lt;strong&gt;shared memory&lt;/strong&gt; -a small amount of memory shared among a group of threads. This allows threads to cooperate and share data without accessing global memory, which is significantly slower. Another technique is to use &lt;strong&gt;memory coalescing&lt;/strong&gt;, which enables threads to access memory in a way that minimizes the number of memory transactions. This is achieved by ensuring that threads access contiguous memory locations, which allows the GPU to fetch multiple data elements in a single memory transaction. GPU cores also have access to &lt;strong&gt;more registers&lt;/strong&gt; than CPU cores, which can also be used to store intermediate data. However, registers are shared among cores (threads in a workgroup), so if you're using too many, some cores will be turned off.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwk1l9v28sqxe1fwq36nk.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwk1l9v28sqxe1fwq36nk.jpg" alt="GPU-machine memory hierarchy for NVIDIA Fermi (2010) architecture-transfer speeds on modern GPUs are about 5–10 times faster, but relationships are similar. Illustration from the publication Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units" width="800" height="831"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Enough complex terms! If you're to take out one thing from this post, it is this: &lt;strong&gt;the most effective way to optimize a GPU program is to perform more computations per memory access&lt;/strong&gt;. In other words, &lt;strong&gt;ensure that data doesn't leave the GPU core for as long as possible&lt;/strong&gt;. Let's pin this and come back to the array programming model and the performance issues that it introduces.&lt;/p&gt;

&lt;h2&gt;
  
  
  I Love PyTorch! What Can Possibly be Wrong with It?
&lt;/h2&gt;

&lt;p&gt;Let's take a look at how a simple CUDA kernel to perform an array operation like &lt;code&gt;A*B + C&lt;/code&gt; would look. Here, &lt;code&gt;A&lt;/code&gt;, &lt;code&gt;B&lt;/code&gt;, and &lt;code&gt;C&lt;/code&gt; are large arrays (tensors) and the operation is performed element-wise, e.g., &lt;code&gt;[1, 2, 3] * [2, 2, 2] + [1, 1, 1] = [3, 5, 7]&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;array_op&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This kernel is straightforward. Each thread computes a single element of the output array &lt;code&gt;D&lt;/code&gt; by reading the corresponding elements from the input arrays &lt;code&gt;A&lt;/code&gt;, &lt;code&gt;B&lt;/code&gt;, and &lt;code&gt;C&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now, let's take a look at how the same operation would look in PyTorch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you naively translate PyTorch operations like elementwise multiplication and addition to CUDA, which is how it is actually done in practice, you would get two kernels: one for multiplication and one for addition. The runtime would launch a kernel to perform element-wise multiplication, store the result in a temporary array &lt;code&gt;A1&lt;/code&gt;, and then launch another kernel to perform element-wise addition using &lt;code&gt;A1&lt;/code&gt; and &lt;code&gt;B&lt;/code&gt; to produce the final tensor &lt;code&gt;C&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;array_mul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;array_add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see the problem now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the original example, we fetch data once from memory and perform two operations (multiplication and addition) on it.&lt;/li&gt;
&lt;li&gt;In the PyTorch example, we fetch data twice from memory and perform only one operation (multiplication or addition) on it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Given that our program is completely memory-bound, the PyTorch version will be practically &lt;strong&gt;twice as slow as the CUDA version&lt;/strong&gt;, because it performs half the number of operations per memory access. If we add one more unfused element-wise operation, it will be &lt;strong&gt;three times as slow&lt;/strong&gt;, and so on.&lt;/p&gt;

&lt;p&gt;You may wonder, can't we generate a single fused kernel that performs both operations simultaneously? The answer is yes, we can. In fact, both PyTorch and TensorFlow have a mechanism to do that. However, this is not an easy problem to solve in a general way. PyTorch officially supports more than 1200 operations that can be performed on tensors. The number of possible combinations of these operations is astronomical. Many of these operations are not even element-wise, e.g., matrix multiplication, convolutions, reductions, and so on. It is a complex problem to solve in a general way. For PyTorch, it is especially difficult, as it is a dynamic framework, i.e., the computation graph is built on-the-fly as the code is executed. This makes it challenging to analyze the entire computation graph and determine which operations can be fused.&lt;/p&gt;

&lt;p&gt;This problem remains unsolved in a general way to date, as you'll see when we discuss &lt;a href="https://github.com/Dao-AILab/flash-attention" rel="noopener noreferrer"&gt;Flash Attention&lt;/a&gt; in the context of LLM inference.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flf2hndh7e17dr62zwa19.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flf2hndh7e17dr62zwa19.jpeg" width="720" height="709"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  NVidia Domination
&lt;/h2&gt;

&lt;p&gt;The Deep Learning revolution has dramatically changed the GPU programming landscape. The array programming model has made GPU programming accessible to a much wider audience, as you don't have to be a GPU programming expert to write code that runs on the GPU. However, it has also introduced new challenges, such as optimizing memory access patterns and fusing operations to achieve optimal performance.&lt;/p&gt;

&lt;p&gt;This has created a strong moat for NVIDIA. Although CUDA was just one of the many GPGPU frameworks available at the time, the CUDA ecosystem had a great deal more to offer the community. For example, it had CUDNN, a highly optimized library for deep learning primitives such as convolutions, pooling, and normalization. This library was used by all major deep learning frameworks like TensorFlow and PyTorch to achieve good performance on NVIDIA GPUs. Additionally, NVIDIA has invested heavily in optimizing its hardware for deep learning workloads, for example, by introducing Tensor Cores, which are specialized hardware units designed for performing matrix multiplications and convolutions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdn295tcgj4d9pa9r7c1l.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdn295tcgj4d9pa9r7c1l.jpg" width="742" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the Deep Learning age, the NVIDIA GPUs have become the &lt;strong&gt;de facto standard for deep learning workloads&lt;/strong&gt;. All major deep learning frameworks like PyTorch and Tensorflow were built on top of CUDNN, initially not even offering the option to use other backends like OpenCL or ROCm. All research has been done on NVIDIA hardware, as it was the only hardware that supported the tools they were using. This has created a strong network effect, as everyone was using NVIDIA hardware, so everyone was optimizing their code for NVIDIA hardware, which made NVIDIA hardware even more attractive.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From 2010 to the present, I have exclusively owned NVIDIA GPUs. Even though some AMD models were offering more value, the need to be able to perform AI-related work has always steered me into the Team Green camp.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ironically, as innovative as CUDA was, the moat was created not by CUDA itself, but by &lt;strong&gt;the army of NVIDIA engineers who have optimized CUDNN and other libraries for deep learning workloads&lt;/strong&gt;. There was simply no good algorithm to optimize computational graphs in a general way, so NVIDIA engineers have hand-optimized the most common patterns that appear in deep learning workloads.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;There are many attempts to come up with an automatic way to optimize computational graphs or at least to come up with a universal, hardware-agnostic AI stack that makes the optimization process easier, like &lt;a href="https://www.tensorflow.org/xla" rel="noopener noreferrer"&gt;XLA&lt;/a&gt; from Google, &lt;a href="https://tvm.apache.org" rel="noopener noreferrer"&gt;TVM&lt;/a&gt; from the Apache Foundation, &lt;a href="https://mlir.llvm.org" rel="noopener noreferrer"&gt;MLIR&lt;/a&gt; from LLVM or &lt;a href="https://www.modular.com/max" rel="noopener noreferrer"&gt;MAX&lt;/a&gt; from Modular AI. However, none of these attempts have been able to beat hand-optimized libraries like CUDNN on NVIDIA hardware on a large enough number of real-world use cases and establish a strong enough network effect.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The AI Era - Bigger is Better
&lt;/h2&gt;

&lt;p&gt;History doesn't repeat itself, but it often rhymes. The computational power of GPUs has triggered the deep learning evolution. We've used the same algorithms that were known since 90s, but now we could train much larger models on much larger datasets. The same thing happened with LLMs. The &lt;a href="https://arxiv.org/abs/1706.03762" rel="noopener noreferrer"&gt;transformer architecture&lt;/a&gt; was known since 2017, but it was only in 2020 that we've seen the first large-scale transformer models like GPT-3 and BERT. The reason for that is that training these models requires a lot of computational power and memory bandwidth. OpenAI has trained GPT-3 on a cluster of 10,000 GPUs. The largest model, GPT-3, has 175 billion parameters and was trained on a dataset of 570GB of text data. The training process took several weeks and cost several million dollars (and probably raised global temperature by a degree or so).&lt;/p&gt;

&lt;p&gt;How did AI affect the GPU programming landscape? Not much, actually. The same array programming model is used for training and inference of LLMs. The same challenges of optimizing memory access patterns and fusing operations to achieve good performance still exist. However, the scale of the models has increased dramatically, which has introduced new challenges, like distributing the model across multiple GPUs and optimizing communication between GPUs.&lt;/p&gt;

&lt;p&gt;The large scale of the models has also introduced new challenges for inference. The models are so large that they don't fit into the memory of a single GPU. For example, GPT-3 requires about 700GB of memory to store the model parameters, which is much larger than the memory of even the most powerful GPUs available today. This has led to the development of techniques such as model parallelism, where the model is split across multiple GPUs, and pipeline parallelism, where different parts of the model are executed on separate GPUs in a pipelined manner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6jc74jrtuf3b1mbo0l1t.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6jc74jrtuf3b1mbo0l1t.jpg" width="640" height="753"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Case of Flash Attention
&lt;/h2&gt;

&lt;p&gt;Surprisingly, after all these years, the problem of optimizing memory access patterns and fusing operations to achieve good performance is still not solved in a general way. Let's take a look at a specific example of this problem in the context of LLM inference.&lt;/p&gt;

&lt;p&gt;One of the most important operations in transformer models is the attention mechanism. The attention mechanism allows the model to focus on different parts of the input sequence when making predictions. The attention mechanism is implemented using a series of matrix multiplications and softmax operations (see the rightmost diagram on the image below).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kq6ofsw4xjbrs6dqe2d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kq6ofsw4xjbrs6dqe2d.png" alt="Attention mechanism in transformers" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The softmax operation involves computing the exponential of each element in the input matrix, summing them up, and then dividing each component by the sum.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9qdw168de6tolfze7y2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9qdw168de6tolfze7y2.jpg" alt="Softmax Operation" width="800" height="121"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looks challenging to optimize, right? How can we reduce the number of memory accesses here? The naive implementation would involve reading the input matrices from memory, multiplying them together, storing the result in a temporary matrix, reading the temporary matrix from memory, computing the exponential of each element, summing them up, and then dividing each element by the sum. This would involve a lot of memory accesses and would be very slow. And it is slow!&lt;/p&gt;

&lt;p&gt;However, previously I have mentioned that GPUs come with a bit of fast on-chip memory called shared memory (SRAM in hardware terms- static random access memory). It is a small amount of memory that is shared between a block of GPU cores. This memory is much faster than the global memory (GDDR or HBM) and can be used to store intermediate results. The original Flash Attention implementation was implemented and benchmarked on H100, which has 80GB of HBM memory and 192KB of shared memory per SM. The SRAM speed was about 19TB/s, and the HBM speed was about 1.5–2.0TB/s.&lt;/p&gt;

&lt;p&gt;The authors of Flash Attention have devised a method to partition computations in a way that allows intermediate results to fit into shared memory, enabling them to perform the entire attention computation with fewer trips to global memory. This is achieved by partitioning the input matrices into smaller tiles, performing calculations on these tiles (including matrix multiplications and softmax operations), and streaming the results back into global memory. The result is a significant speedup over the naive implementation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95mfe3nnjtnm7hu2d28v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95mfe3nnjtnm7hu2d28v.png" alt="Left: FlashAttention uses tiling to prevent materialization of the large 𝑁 × 𝑁 attention matrix (dotted box) on (relatively) slow GPU HBM. In the outer loop (red arrows), FlashAttention loops through blocks of the K and V matrices and loads them to fast on-chip SRAM. In each block, FlashAttention loops over blocks of Q matrix (blue arrows), loading them to SRAM, and writing the output of the attention computation back to HBM. Right: Speedup over the PyTorch implementation of attention on GPT-2. FlashAttention does not read and write the large 𝑁 × 𝑁 attention matrix to HBM, resulting in an 7.6× speedup on the attention computation." width="800" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The GPU programming landscape has changed dramatically over the past two decades. The introduction of CUDA and OpenCL has made GPU programming accessible to a much wider audience and triggered the deep learning revolution, which in turn changed the way we program GPUs. The array programming model has made it easier to write code that runs on the GPU, but it has also introduced new challenges, such as optimizing memory access patterns and fusing operations to achieve optimal performance.&lt;/p&gt;

&lt;p&gt;Now, when you're a certified GPU programming expert, enjoy the last meme and get your GPU cranking!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpk0q7t0de30w7jq6cju3.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpk0q7t0de30w7jq6cju3.jpeg" alt="If you frequently run into this issue — check out our GPU rental service." width="720" height="562"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>gpu</category>
      <category>deeplearning</category>
      <category>cuda</category>
    </item>
    <item>
      <title>Host Setup for QEMU KVM GPU Passthrough with VFIO on Linux</title>
      <dc:creator>Dmitry Trifonov</dc:creator>
      <pubDate>Tue, 26 Aug 2025 21:27:08 +0000</pubDate>
      <link>https://dev.to/novibecoding/host-setup-for-qemu-kvm-gpu-passthrough-with-vfio-on-linux-46lc</link>
      <guid>https://dev.to/novibecoding/host-setup-for-qemu-kvm-gpu-passthrough-with-vfio-on-linux-46lc</guid>
      <description>&lt;h3&gt;
  
  
  From “black magic” to reproducible results
&lt;/h3&gt;

&lt;p&gt;GPU passthrough shouldn't feel like sorcery. If you've ever lost a weekend to half-working configs, random resets, or a guest that only boots when the moon is right, this guide is for you. I have pulled lots of hair while hardening the &lt;a href="https://cloudrift.ai" rel="noopener noreferrer"&gt;CloudRift&lt;/a&gt;&lt;br&gt;
VM service for a variety of consumer (RTX 4090, 5090, PRO 6000) and data center (H100, B200) GPUs, so writing this guide to help you avoid common pitfalls.&lt;/p&gt;

&lt;p&gt;I'll focus specifically on the host node configuration for GPU passthrough. Thus, this guide is relevant regardless of whether you're using Proxmox or plain libvirt/QEMU. The provided instructions have been tested on Ubuntu 22.04 and 24.04 with various NVIDIA GPUs.&lt;/p&gt;

&lt;p&gt;To keep this guide manageable, I won't delve into lower-level details, such as specific domain XML tricks, Linux kernel builds, or GPU firmware flashing. In most cases, you don't need to fiddle with those.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Remove NVIDIA drivers
&lt;/h2&gt;

&lt;p&gt;The first step is to remove the NVIDIA drivers. It is not required, but NVIDIA drivers tend to cause issues with passthrough in one way or another, so it's better to remove them altogether.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you're configuring your own work PC with multiple GPUs, skip this step as without NVIDIA drivers you won't be able to run UI applications. In this case, the passthrough robustness is likely not a priority for you. However, I strongly recommend removing NVIDIA drivers on headless servers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the NVIDIA driver is installed from the repository, you can remove it using the following commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get remove &lt;span class="nt"&gt;--purge&lt;/span&gt; &lt;span class="s1"&gt;'^nvidia-.*'&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt autoremove
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you've installed the driver using the RUN file, remove it using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; /usr/bin/nvidia-uninstall
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remove configs if any.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /etc/X11/xorg.conf
&lt;span class="nb"&gt;sudo rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /etc/modprobe.d/nvidia&lt;span class="k"&gt;*&lt;/span&gt;.conf
&lt;span class="nb"&gt;sudo rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /lib/modprobe.d/nvidia&lt;span class="k"&gt;*&lt;/span&gt;.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reboot the system after driver removal&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;reboot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Check BIOS, IOMMU Support and IOMMU Group Assignment
&lt;/h2&gt;

&lt;p&gt;The next step is to check virtualization and IOMMU support. We need to check four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Virtualization is enabled (AMD-Vi / Intel VT-D options are enabled in bios). If present, enable "Above 4G decoding" and "Resizable BAR (ReBAR)" options in BIOS as well.&lt;/li&gt;
&lt;li&gt;IOMMU is active (groups exist).&lt;/li&gt;
&lt;li&gt;Each GPU and its audio function are isolated in their own IOMMU group.&lt;/li&gt;
&lt;li&gt;GPU groups contain only GPU/video-audio functions and PCI bridges — no NICs, NVMe, SATA, etc.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr2boo48ogsvmbugducvy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr2boo48ogsvmbugducvy.png" alt=" " width="797" height="594"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can use the following handy-dandy script to check those preconditions.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AI goes overboard when generating helper scripts, doesn't it? I can't complain, though. It provides a lot of useful information.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="c"&gt;# VFIO host sanity check: IOMMU support + GPU-containing groups&lt;/span&gt;

&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt;  &lt;span class="c"&gt;# don't use -e so greps that find nothing don't abort&lt;/span&gt;

&lt;span class="c"&gt;# --- helpers ---------------------------------------------------------------&lt;/span&gt;
have&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;command&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;/dev/null 2&amp;gt;&amp;amp;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

read_klog&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;have journalctl&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then &lt;/span&gt;journalctl &lt;span class="nt"&gt;-k&lt;/span&gt; &lt;span class="nt"&gt;-b&lt;/span&gt; 0 2&amp;gt;/dev/null
  &lt;span class="k"&gt;else &lt;/span&gt;dmesg 2&amp;gt;/dev/null
  &lt;span class="k"&gt;fi&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

trim&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/^[[:space:]]*//'&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/[[:space:]]*$//'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# --- 1) CPU vendor + boot flags -------------------------------------------&lt;/span&gt;
&lt;span class="nv"&gt;CPU_VENDOR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;
  &lt;span class="o"&gt;(&lt;/span&gt;lscpu 2&amp;gt;/dev/null | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="nt"&gt;-F&lt;/span&gt;: &lt;span class="s1"&gt;'/Vendor ID/{print $2}'&lt;/span&gt; | trim&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
  &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-m1&lt;/span&gt; &lt;span class="s1"&gt;'vendor_id'&lt;/span&gt; /proc/cpuinfo 2&amp;gt;/dev/null | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $3}'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-z&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CPU_VENDOR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nv"&gt;CPU_VENDOR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"(unknown)"&lt;/span&gt;

&lt;span class="nv"&gt;CMDLINE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/cmdline 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;HAS_INTEL_FLAG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CMDLINE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s1"&gt;'intel_iommu=on'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo yes&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;no&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;HAS_AMD_FLAG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CMDLINE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s1"&gt;'amd_iommu=on'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo yes&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;no&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;HAS_PT_FLAG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CMDLINE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s1"&gt;'iommu=pt'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo yes&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;no&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# --- 2) Kernel log signals ------------------------------------------------&lt;/span&gt;
&lt;span class="nv"&gt;KLOG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;read_klog&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nv"&gt;DISABLED_MSG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$KLOG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | egrep &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s1"&gt;'IOMMU.*disabled by BIOS|DMAR:.*disabled|AMD-Vi:.*disabled'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;ENABLED_MSG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$KLOG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | egrep &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s1"&gt;'DMAR: IOMMU enabled|AMD-Vi:.*IOMMU.*enabled|IOMMU: .*enabled'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;IR_MSG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$KLOG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | egrep &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s1"&gt;'Interrupt remapping enabled'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# --- 3) IOMMU groups presence --------------------------------------------&lt;/span&gt;
&lt;span class="nv"&gt;GROUPS_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/sys/kernel/iommu_groups"&lt;/span&gt;
&lt;span class="nv"&gt;GROUP_COUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GROUPS_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nv"&gt;GROUP_COUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;find &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GROUPS_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-mindepth&lt;/span&gt; 1 &lt;span class="nt"&gt;-maxdepth&lt;/span&gt; 1 &lt;span class="nt"&gt;-type&lt;/span&gt; d 2&amp;gt;/dev/null | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $1}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# Heuristic: active if groups exist (&amp;gt;0). Logs help explain state.&lt;/span&gt;
&lt;span class="nv"&gt;IOMMU_ACTIVE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"no"&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GROUP_COUNT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nv"&gt;IOMMU_ACTIVE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"yes"&lt;/span&gt;

&lt;span class="c"&gt;# --- 4) Report summary ----------------------------------------------------&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"=== IOMMU Summary ==="&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"CPU vendor           : &lt;/span&gt;&lt;span class="nv"&gt;$CPU_VENDOR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Kernel cmdline       : &lt;/span&gt;&lt;span class="nv"&gt;$CMDLINE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Boot flags           : intel_iommu=&lt;/span&gt;&lt;span class="nv"&gt;$HAS_INTEL_FLAG&lt;/span&gt;&lt;span class="s2"&gt;  amd_iommu=&lt;/span&gt;&lt;span class="nv"&gt;$HAS_AMD_FLAG&lt;/span&gt;&lt;span class="s2"&gt;  iommu=pt=&lt;/span&gt;&lt;span class="nv"&gt;$HAS_PT_FLAG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Groups directory     : &lt;/span&gt;&lt;span class="nv"&gt;$GROUPS_DIR&lt;/span&gt;&lt;span class="s2"&gt;  (exists: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GROUPS_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo yes&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;no&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;)"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"IOMMU group count    : &lt;/span&gt;&lt;span class="nv"&gt;$GROUP_COUNT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Kernel says enabled  : &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENABLED_MSG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo yes&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;no&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Interrupt remapping  : &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$IR_MSG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo yes&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;no&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Kernel says disabled : &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DISABLED_MSG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo yes&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;no&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"IOMMU ACTIVE?        : &lt;/span&gt;&lt;span class="nv"&gt;$IOMMU_ACTIVE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo

&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENABLED_MSG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"--- Kernel enable lines ---"&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENABLED_MSG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;echo
&lt;/span&gt;&lt;span class="k"&gt;fi
if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DISABLED_MSG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"--- Kernel disable lines ---"&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DISABLED_MSG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;echo
&lt;/span&gt;&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# --- 5) Original: list only GPU-containing groups -------------------------&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"=== GPU-Containing IOMMU Groups ==="&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GROUPS_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GROUP_COUNT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"(no IOMMU groups found)"&lt;/span&gt;
&lt;span class="k"&gt;else
  &lt;/span&gt;&lt;span class="nb"&gt;declare&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="nv"&gt;GPU_COUNT_BY_GROUP&lt;/span&gt;&lt;span class="o"&gt;=()&lt;/span&gt;
  &lt;span class="nv"&gt;group_warnings&lt;/span&gt;&lt;span class="o"&gt;=()&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;g &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GROUPS_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;/&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$g&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;continue
    &lt;/span&gt;&lt;span class="nv"&gt;group_num&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$g&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="nv"&gt;gpu_found&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false
    &lt;/span&gt;&lt;span class="nv"&gt;device_lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;
    &lt;span class="nv"&gt;non_gpu_non_bridge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false
    &lt;/span&gt;&lt;span class="nv"&gt;gpu_count_in_this_group&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0

    &lt;span class="k"&gt;for &lt;/span&gt;d &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$g&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;/devices/&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
      &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$d&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;continue
      &lt;/span&gt;&lt;span class="nv"&gt;pci_addr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$d&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
      &lt;span class="c"&gt;# -nns prints class code [XXXX] and vendor:device [vvvv:dddd]&lt;/span&gt;
      &lt;span class="nv"&gt;line&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;lspci &lt;span class="nt"&gt;-nns&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$pci_addr&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$pci_addr&lt;/span&gt;&lt;span class="s2"&gt; (unlisted)"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
      device_lines+&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s1"&gt;$'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;

      &lt;span class="c"&gt;# Extract first [...] which is the class code, e.g. 0300, 0302, 0403, 0604, 0600&lt;/span&gt;
      &lt;span class="nv"&gt;class_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="nt"&gt;-F&lt;/span&gt;&lt;span class="s1"&gt;'[][]'&lt;/span&gt; &lt;span class="s1"&gt;'{print $2}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

      &lt;span class="c"&gt;# Detect GPUs / 3D controllers and their HDA audio functions&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qE&lt;/span&gt; &lt;span class="s1"&gt;'VGA compatible controller|3D controller'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nv"&gt;gpu_found&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true
        &lt;/span&gt;&lt;span class="nv"&gt;gpu_count_in_this_group&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;gpu_count_in_this_group+1&lt;span class="k"&gt;))&lt;/span&gt;
      &lt;span class="k"&gt;fi&lt;/span&gt;

      &lt;span class="c"&gt;# Allowlist: 0300(VGA), 0302(3D), 0403(HDA audio), 0600(host bridge), 0604(PCI bridge)&lt;/span&gt;
      &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$class_code&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="k"&gt;in
        &lt;/span&gt;0300|0302|0403|0600|0604&lt;span class="p"&gt;)&lt;/span&gt; : &lt;span class="p"&gt;;;&lt;/span&gt;
        &lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;non_gpu_non_bridge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="p"&gt;;;&lt;/span&gt;
      &lt;span class="k"&gt;esac&lt;/span&gt;
    &lt;span class="k"&gt;done

    if&lt;/span&gt; &lt;span class="nv"&gt;$gpu_found&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
      &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"IOMMU Group &lt;/span&gt;&lt;span class="nv"&gt;$group_num&lt;/span&gt;&lt;span class="s2"&gt;:"&lt;/span&gt;
      &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$device_lines&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

      &lt;span class="c"&gt;# Track GPUs per group&lt;/span&gt;
      GPU_COUNT_BY_GROUP[&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$group_num&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;]=&lt;/span&gt;&lt;span class="nv"&gt;$gpu_count_in_this_group&lt;/span&gt;

      &lt;span class="c"&gt;# Warn if unexpected devices share the group with the GPU&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;$non_gpu_non_bridge&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;group_warnings+&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="s2"&gt;"WARN: Group &lt;/span&gt;&lt;span class="nv"&gt;$group_num&lt;/span&gt;&lt;span class="s2"&gt; contains non-GPU, non-audio, non-bridge devices (consider different slot/CPU root complex or ACS)."&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;fi
    fi
  done&lt;/span&gt;

  &lt;span class="c"&gt;# Post-checks&lt;/span&gt;
  &lt;span class="c"&gt;# 1) Each GPU should be alone (one GPU per group)&lt;/span&gt;
  &lt;span class="nv"&gt;shared_groups&lt;/span&gt;&lt;span class="o"&gt;=()&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;gnum &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="p"&gt;!GPU_COUNT_BY_GROUP[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GPU_COUNT_BY_GROUP&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;$gnum&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 1 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
      &lt;/span&gt;shared_groups+&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$gnum&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;fi
  done

  if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${#&lt;/span&gt;&lt;span class="nv"&gt;shared_groups&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo
    echo&lt;/span&gt; &lt;span class="s2"&gt;"WARN: Multiple GPUs share these IOMMU groups: &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;shared_groups&lt;/span&gt;&lt;span class="p"&gt;[*]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; (prefer one GPU per group for VFIO)."&lt;/span&gt;
  &lt;span class="k"&gt;fi&lt;/span&gt;

  &lt;span class="c"&gt;# 2) Any non-bridge co-residents?&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${#&lt;/span&gt;&lt;span class="nv"&gt;group_warnings&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo
    printf&lt;/span&gt; &lt;span class="s2"&gt;"%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;group_warnings&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="k"&gt;fi
fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what a good summary should look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;===&lt;/span&gt; IOMMU Summary &lt;span class="o"&gt;===&lt;/span&gt;
CPU vendor           : AuthenticAMD
Kernel cmdline       : &lt;span class="nv"&gt;BOOT_IMAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/boot/vmlinuz-6.8.0-71-generic &lt;span class="nv"&gt;root&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/dev/mapper/vgroot-lvroot ro systemd.unified_cgroup_hierarchy&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false &lt;/span&gt;&lt;span class="nv"&gt;default_hugepagesz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1G &lt;span class="nv"&gt;hugepages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;576 &lt;span class="nv"&gt;hugepagesz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1G nomodeset &lt;span class="nv"&gt;video&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;efifb:off &lt;span class="nv"&gt;iommu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pt &lt;span class="nv"&gt;pci&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;realloc &lt;span class="nv"&gt;pcie_aspm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;off &lt;span class="nv"&gt;amd_iommu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;on vfio-pci.ids&lt;span class="o"&gt;=&lt;/span&gt;10de:0000,10de:204b,10de:22e8,10de:2bb1 modprobe.blacklist&lt;span class="o"&gt;=&lt;/span&gt;nouveau,nvidia,nvidiafb,snd_hda_intel
Boot flags           : &lt;span class="nv"&gt;intel_iommu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;no  &lt;span class="nv"&gt;amd_iommu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;yes  &lt;/span&gt;&lt;span class="nv"&gt;iommu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;pt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;yes
&lt;/span&gt;Groups directory     : /sys/kernel/iommu_groups  &lt;span class="o"&gt;(&lt;/span&gt;exists: &lt;span class="nb"&gt;yes&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
IOMMU group count    : 57
Kernel says enabled  : no
Interrupt remapping  : no
Kernel says disabled : no
IOMMU ACTIVE?        : &lt;span class="nb"&gt;yes&lt;/span&gt;

&lt;span class="o"&gt;===&lt;/span&gt; GPU-Containing IOMMU Groups &lt;span class="o"&gt;===&lt;/span&gt;
IOMMU Group 13:
c1:00.0 VGA compatible controller &lt;span class="o"&gt;[&lt;/span&gt;0300]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:2bb1] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
c1:00.1 Audio device &lt;span class="o"&gt;[&lt;/span&gt;0403]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:22e8] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;

IOMMU Group 16:
c6:00.0 PCI bridge &lt;span class="o"&gt;[&lt;/span&gt;0604]: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge &lt;span class="o"&gt;[&lt;/span&gt;1a03:1150] &lt;span class="o"&gt;(&lt;/span&gt;rev 06&lt;span class="o"&gt;)&lt;/span&gt;
c7:00.0 VGA compatible controller &lt;span class="o"&gt;[&lt;/span&gt;0300]: ASPEED Technology, Inc. ASPEED Graphics Family &lt;span class="o"&gt;[&lt;/span&gt;1a03:2000] &lt;span class="o"&gt;(&lt;/span&gt;rev 52&lt;span class="o"&gt;)&lt;/span&gt;

IOMMU Group 27:
81:00.0 VGA compatible controller &lt;span class="o"&gt;[&lt;/span&gt;0300]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:2bb1] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
81:00.1 Audio device &lt;span class="o"&gt;[&lt;/span&gt;0403]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:22e8] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;

IOMMU Group 42:
01:00.0 VGA compatible controller &lt;span class="o"&gt;[&lt;/span&gt;0300]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:2bb1] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
01:00.1 Audio device &lt;span class="o"&gt;[&lt;/span&gt;0403]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:22e8] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;

IOMMU Group 54:
41:00.0 VGA compatible controller &lt;span class="o"&gt;[&lt;/span&gt;0300]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:2bb1] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
41:00.1 Audio device &lt;span class="o"&gt;[&lt;/span&gt;0403]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:22e8] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can see, IOMMU support is enabled, and all GPUs and their corresponding audio devices are in separate IOMMU groups.&lt;/p&gt;

&lt;p&gt;Sometimes you may see PCI bridges in the GPU IOMMU group. This is normal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;===&lt;/span&gt; GPU-Containing IOMMU Groups &lt;span class="o"&gt;===&lt;/span&gt;
IOMMU Group 13:
40:01.0 Host bridge &lt;span class="o"&gt;[&lt;/span&gt;0600]: Advanced Micro Devices, Inc. &lt;span class="o"&gt;[&lt;/span&gt;AMD] Starship/Matisse PCIe Dummy Host Bridge &lt;span class="o"&gt;[&lt;/span&gt;1022:1482]
40:01.1 PCI bridge &lt;span class="o"&gt;[&lt;/span&gt;0604]: Advanced Micro Devices, Inc. &lt;span class="o"&gt;[&lt;/span&gt;AMD] Starship/Matisse GPP Bridge &lt;span class="o"&gt;[&lt;/span&gt;1022:1483]
41:00.0 VGA compatible controller &lt;span class="o"&gt;[&lt;/span&gt;0300]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:2b85] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
41:00.1 Audio device &lt;span class="o"&gt;[&lt;/span&gt;0403]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:22e8] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;

IOMMU Group 32:
20:03.0 Host bridge &lt;span class="o"&gt;[&lt;/span&gt;0600]: Advanced Micro Devices, Inc. &lt;span class="o"&gt;[&lt;/span&gt;AMD] Starship/Matisse PCIe Dummy Host Bridge &lt;span class="o"&gt;[&lt;/span&gt;1022:1482]
20:03.1 PCI bridge &lt;span class="o"&gt;[&lt;/span&gt;0604]: Advanced Micro Devices, Inc. &lt;span class="o"&gt;[&lt;/span&gt;AMD] Starship/Matisse GPP Bridge &lt;span class="o"&gt;[&lt;/span&gt;1022:1483]
25:00.0 VGA compatible controller &lt;span class="o"&gt;[&lt;/span&gt;0300]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:2b85] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
25:00.1 Audio device &lt;span class="o"&gt;[&lt;/span&gt;0403]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:22e8] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Leverage 1G Huge Pages
&lt;/h2&gt;

&lt;p&gt;This step is optional. However, if you have more than 512GB of RAM on your system, it is highly-encouraged. From experience, aside from providing performance benefit, 1GB huge pages make the VM startup much more reliable on high-memory systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt; 128 GB RAM&lt;/strong&gt;: usually skip (benefit is small).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;128–512 GB&lt;/strong&gt;: optional; can reduce latency jitter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;gt; 512 GB&lt;/strong&gt;: recommended for reliability and predictable performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why 1 GiB pages help&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fewer page-table walks → fewer TLB misses.&lt;/li&gt;
&lt;li&gt;Lower page management overhead.&lt;/li&gt;
&lt;li&gt;More predictable VM start times on large RAM allocations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.1 Check Huge Page Support
&lt;/h3&gt;

&lt;p&gt;To confirm 1G huge page support on your system, check the pdpe1gb CPU flag.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-m1&lt;/span&gt; pdpe1gb /proc/cpuinfo &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;/dev/null &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"✓ CPU supports 1GiB pages"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"✗ No 1GiB page support"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.2 Allocate Huge Pages
&lt;/h3&gt;

&lt;p&gt;Determine how much memory you want to reserve for the VMs. You need to reserve that much memory for huge pages plus a buffer.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note that the memory reserved for huge pages will not be usable on the host system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For example, if you want to dedicate &lt;code&gt;2000GB&lt;/code&gt; to virtual machines with a &lt;code&gt;80 GB&lt;/code&gt; buffer, you would need &lt;code&gt;2080&lt;/code&gt; huge pages.&lt;/p&gt;

&lt;p&gt;I use the following empirically validated table to determine the huge page configuration on a high-memory multi-GPU system.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Total System RAM&lt;/th&gt;
&lt;th&gt;VM Allocation&lt;/th&gt;
&lt;th&gt;Buffer&lt;/th&gt;
&lt;th&gt;Huge Pages&lt;/th&gt;
&lt;th&gt;Left for System&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;768 GB&lt;/td&gt;
&lt;td&gt;640 (8x80) GB&lt;/td&gt;
&lt;td&gt;60 GB&lt;/td&gt;
&lt;td&gt;700&lt;/td&gt;
&lt;td&gt;68 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024 GB&lt;/td&gt;
&lt;td&gt;800 (8x100) GB&lt;/td&gt;
&lt;td&gt;80 GB&lt;/td&gt;
&lt;td&gt;880&lt;/td&gt;
&lt;td&gt;144 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1256 GB&lt;/td&gt;
&lt;td&gt;1040 (8x130) GB&lt;/td&gt;
&lt;td&gt;100 GB&lt;/td&gt;
&lt;td&gt;1140&lt;/td&gt;
&lt;td&gt;116 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1512 GB&lt;/td&gt;
&lt;td&gt;1280 (8x160) GB&lt;/td&gt;
&lt;td&gt;120 GB&lt;/td&gt;
&lt;td&gt;1300&lt;/td&gt;
&lt;td&gt;212 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048 GB&lt;/td&gt;
&lt;td&gt;1760 (8x220) GB&lt;/td&gt;
&lt;td&gt;160 GB&lt;/td&gt;
&lt;td&gt;1920&lt;/td&gt;
&lt;td&gt;128 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4096 GB&lt;/td&gt;
&lt;td&gt;3680 (8*460) GB&lt;/td&gt;
&lt;td&gt;200 GB&lt;/td&gt;
&lt;td&gt;3880&lt;/td&gt;
&lt;td&gt;216 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Is there a reliable formula to determine the huge page buffer size? Good question. If you know one, let me know in the comments. It makes sense that we need to leave some memory for the system, but it feels that the gap between the memory dedicated to VM allocation and the number of huge pages is unnecessary. After VM startup, we'll see that the system has allocated the exact number of requested huge pages, so why do we need a buffer, and how big should it be? Is it because of the fragmentation? Empirically, I've confirmed that it is needed. Without a buffer, I occasionally encountered OOM errors.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Run the following command to allocate 2000 pages (it will take a while):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo &lt;/span&gt;2000 | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To check that huge pages were allocated, run &lt;code&gt;grep -i huge /proc/meminfo&lt;/code&gt;. Look at &lt;code&gt;Hugepagesize&lt;/code&gt; and &lt;code&gt;Hugetlb&lt;/code&gt; values. They tell the huge page size and the total amount of RAM allocated for huge pages. You should see output like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;AnonHugePages:     79872 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    2080
HugePages_Free:     1580
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        2181038080 kB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To deallocate, invoke:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo &lt;/span&gt;0 | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.3 Make Huge Pages Persistent
&lt;/h3&gt;

&lt;p&gt;Edit the &lt;code&gt;/etc/default/grub&lt;/code&gt; file and modify the line containing &lt;code&gt;GRUB_CMDLINE_LINUX&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Add &lt;code&gt;default_hugepagesz=1G hugepagesz=1G hugepages=&amp;lt;num&amp;gt;&lt;/code&gt; to the &lt;code&gt;GRUB_CMDLINE_LINUX&lt;/code&gt; options. The &lt;code&gt;&amp;lt;num&amp;gt;&lt;/code&gt; is the number of huge pages to allocate. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;GRUB_CMDLINE_LINUX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"... default_hugepagesz=1G hugepagesz=1G hugepages=200"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Be careful. If you specify more huge pages than the system can allocate, the machine will not boot.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Update the GRUB changes, reboot, and verify that huge pages are allocated (or do this in the end).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;update-grub
&lt;span class="nb"&gt;sudo &lt;/span&gt;reboot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.4 (Optional) Mount Huge Page Table
&lt;/h3&gt;

&lt;p&gt;Many systems already have &lt;code&gt;/dev/hugepages&lt;/code&gt;. If not, or if you want a dedicated mount:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/hugepages-1G
&lt;span class="nb"&gt;sudo &lt;/span&gt;mount &lt;span class="nt"&gt;-t&lt;/span&gt; hugetlbfs &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;pagesize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1G none /mnt/hugepages-1G
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check that the mount point is present by running &lt;code&gt;grep hugetlbfs /proc/mounts&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You should see something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hugetlbfs /dev/hugepages hugetlbfs rw,nosuid,nodev,relatime,pagesize&lt;span class="o"&gt;=&lt;/span&gt;1024M 0 0
hugetlbfs /mnt/hugepages-1G hugetlbfs rw,relatime,pagesize&lt;span class="o"&gt;=&lt;/span&gt;1024M 0 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To persist - invoke:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"none /mnt/hugepages-1G hugetlbfs pagesize=1G 0 0"&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/fstab
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.5 Configure your Virtualization Software to use Huge Pages
&lt;/h3&gt;

&lt;p&gt;Neither Proxmox nor libvirt is using huge pages by default.&lt;/p&gt;

&lt;p&gt;To use them in libvirt, you need to add the following section to the domain XML&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;memoryBacking&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;hugepages&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;page&lt;/span&gt; &lt;span class="na"&gt;size=&lt;/span&gt;&lt;span class="s"&gt;'1048576'&lt;/span&gt; &lt;span class="na"&gt;unit=&lt;/span&gt;&lt;span class="s"&gt;'KiB'&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/hugepages&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;locked/&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/memoryBacking&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Proxmox CLI you do it as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;qm &lt;span class="nb"&gt;set&lt;/span&gt; &amp;lt;vmid&amp;gt; &lt;span class="nt"&gt;--hugepages&lt;/span&gt; 1024   &lt;span class="c"&gt;# use 1GiB pages&lt;/span&gt;
qm &lt;span class="nb"&gt;set&lt;/span&gt; &amp;lt;vmid&amp;gt; &lt;span class="nt"&gt;--keephugepages&lt;/span&gt; 1  &lt;span class="c"&gt;# optional: keep reserved after shutdown&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Bind to VFIO Early
&lt;/h2&gt;

&lt;p&gt;For maximum stability, have VFIO claim the GPU at boot so no runtime driver swaps occur (Proxmox/libvirt will otherwise bind/unbind around VM start/stop).&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 Identify the PCI IDs to bind
&lt;/h3&gt;

&lt;p&gt;First, you need to determine the PCI vendor ID and device ID for your GPUs.&lt;/p&gt;

&lt;p&gt;List all NVIDIA functions (display + audio, and any auxiliary functions):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;lspci &lt;span class="nt"&gt;-nn&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; nvidia
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example (RTX 5090):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;01:00.0 VGA compatible controller &lt;span class="o"&gt;[&lt;/span&gt;0300]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:2b85] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
01:00.1 Audio device &lt;span class="o"&gt;[&lt;/span&gt;0403]: NVIDIA Corporation Device &lt;span class="o"&gt;[&lt;/span&gt;10de:22e8] &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.2 Give VFIO first claim
&lt;/h3&gt;

&lt;p&gt;Add the following lines to &lt;code&gt;GRUB_CMDLINE_LINUX_DEFAULT&lt;/code&gt; in &lt;code&gt;/etc/default/grub&lt;/code&gt;, replacing the PCI vendor ID and device ID with the appropriate values. Keep other options if needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GRUB_CMDLINE_LINUX_DEFAULT="modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel vfio-pci.ids=10de:2b85,10de:22e8 ..."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Proxmox is likely using systemd-boot by default instead of GRUB. &lt;a href="https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot" rel="noopener noreferrer"&gt;Check the bootloader&lt;/a&gt; you're using and adjust the kernel command line accordingly.&lt;/p&gt;

&lt;p&gt;Many online manuals suggest adding VFIO modules to &lt;code&gt;/etc/modprobe.d/vfio.conf&lt;/code&gt;, but this approach has not always worked for me. I recommend early binding via the kernel command line.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4.3 Ensure VFIO is in the initramfs
&lt;/h3&gt;

&lt;p&gt;We need to make sure that vfio modules are loaded early in the boot process. To achieve this, we include them in the &lt;code&gt;initramfs&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/initramfs-tools/modules &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;/dev/null &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.4 Reboot and verify
&lt;/h3&gt;

&lt;p&gt;Update grub, initramfs and reboot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;update-initramfs &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-k&lt;/span&gt; all
&lt;span class="nb"&gt;sudo &lt;/span&gt;update-grub
&lt;span class="nb"&gt;sudo &lt;/span&gt;reboot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After reboot check that VFIO drivers are in use. You can use &lt;code&gt;lspci -k | grep -A 2 -i nvidia&lt;/code&gt; command&lt;br&gt;
and should see &lt;code&gt;vfio&lt;/code&gt; drivers in use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;81:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
    Subsystem: Gigabyte Technology Co., Ltd Device 416f
    Kernel driver &lt;span class="k"&gt;in &lt;/span&gt;use: vfio-pci
    Kernel modules: nvidiafb, nouveau
81:00.1 Audio device: NVIDIA Corporation Device 22e8 &lt;span class="o"&gt;(&lt;/span&gt;rev a1&lt;span class="o"&gt;)&lt;/span&gt;
    Subsystem: NVIDIA Corporation Device 0000
    Kernel driver &lt;span class="k"&gt;in &lt;/span&gt;use: vfio-pci
    Kernel modules: snd_hda_intel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;To be fair, there was one machine where this technique to bind VFIO failed. The system was aggressively binding&lt;br&gt;
  snd_hda_intel driver to the GPU audio function. However, this method worked for me in all other cases.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  5. Other GRUB Options
&lt;/h2&gt;

&lt;p&gt;Here is a summary of other kernel command line options that you may want to consider, along with my thoughts on each.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pci=realloc&lt;/code&gt;: Reallocate PCI resources forces the kernel to reassign PCI bus resources (MMIO/IOBARs) from scratch, ignoring what the firmware/BIOS assigned. It helps avoid issues when the BIOS didn't allocate enough space for devices (common with large GPUs or multiple devices). Fixes “BAR can't be assigned” or “resource busy” errors. &lt;em&gt;This option is helpful. I like to include it into the guest OS kernel params as well. It occasionally helps to work around BAR allocation issues. However, there is no need to list it unless the system has PCI device enumeration issues.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;iommu=pt&lt;/code&gt;: IOMMU passthrough mode tells the kernel to enable the IOMMU but use pass-through mode for DMA mappings by default. For VFIO GPU passthrough — allows the device to access physical memory directly with minimal performance penalty. &lt;em&gt;I haven't had a chance to test the performance gains, so I can just say that this option didn't break anything.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pcie_aspm=off&lt;/code&gt;: Disable PCIe Active State Power Management, which is a power-saving feature that reduces PCIe link power in idle states. Some PCIe devices (especially GPUs) have trouble retraining links or waking from ASPM low-power states, leading to hangs or device inaccessible errors. This option was introduced to my configs after losing a lot of time on the &lt;a href="https://dev.to/blog/bug-bounty-nvidia-reset-bug"&gt;Reset Bug&lt;/a&gt;. &lt;em&gt;It didn't help. I don't consider this option helpful at the moment, but I am still evaluating it.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nomodeset&lt;/code&gt;: Disable kernel mode setting (KMS) for all GPUs; prevents DRM drivers from taking over the console. &lt;strong&gt;This option is intended for use with headless servers only. It can break desktop/console output. I typically use it since we're working with headless servers.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;video=efifb:off&lt;/code&gt;: Disables the firmware EFI framebuffer so simpledrm/efifb won’t grab the boot GPU before VFIO claims it. &lt;em&gt;This option is outdated and has no effect on systems with modern kernels. I list it for completeness.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;intel_iommu=on&lt;/code&gt; / &lt;code&gt;amd_iommu=on&lt;/code&gt;: Enable IOMMU support for Intel and AMD. &lt;em&gt;These are enabled by default, so there is no need to add them to kernel parameters&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is how the typical kernel command line should look on a headless server with over 500GB of RAM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nomodeset
modprobe.blacklist&lt;span class="o"&gt;=&lt;/span&gt;nouveau,nvidia,nvidiafb,snd_hda_intel
vfio-pci.ids&lt;span class="o"&gt;=&lt;/span&gt;10de:2b85,10de:22e8
&lt;span class="nv"&gt;default_hugepagesz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1G &lt;span class="nv"&gt;hugepagesz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1G &lt;span class="nv"&gt;hugepages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;400
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The VFIO GPU passthrough is a finicky process. It is sensitive to host hardware and software configuration. However, with enough diligence, you can make it robust and reliable. &lt;strong&gt;I strongly believe in this approach and rely on VFIO GPU passthrough as the primary tool for our GPU rental service at &lt;a href="https://cloudrift.ai/" rel="noopener noreferrer"&gt;cloudrift.ai&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I hope this guide helped you to improve your homelab or data center setup. If you notice any inaccuracies or have suggestions, please don't hesitate to let me know so we can improve the workflow together.&lt;/p&gt;

&lt;p&gt;Final host checklist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enable IOMMU, Above 4G, and (where applicable) ReBAR in the BIOS.&lt;/li&gt;
&lt;li&gt;Verify clean IOMMU groups; each GPU (+ audio) isolated.&lt;/li&gt;
&lt;li&gt;Bind to vfio-pci early.&lt;/li&gt;
&lt;li&gt;Size huge pages (1 GiB on high-RAM hosts) and confirm in /proc/meminfo.&lt;/li&gt;
&lt;li&gt;Configure other kernel command-line options as needed.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gpu</category>
      <category>kvm</category>
      <category>qemu</category>
      <category>virtualmachine</category>
    </item>
    <item>
      <title>UnSaaS your Stack with Self-hosted Cloud IDEs</title>
      <dc:creator>Dmitry Trifonov</dc:creator>
      <pubDate>Wed, 20 Aug 2025 22:53:54 +0000</pubDate>
      <link>https://dev.to/novibecoding/unsaas-your-stack-with-self-hosted-cloud-ides-27j6</link>
      <guid>https://dev.to/novibecoding/unsaas-your-stack-with-self-hosted-cloud-ides-27j6</guid>
      <description>&lt;p&gt;I am a PC enthusiast and use it as much as possible. However, with the speed at which LLMs are growing in size, it is challenging to avoid the cloud for AI development.&lt;/p&gt;

&lt;p&gt;Many good GPU-enabled SaaS options exist for remote development, like &lt;a href="https://colab.google/" rel="noopener noreferrer"&gt;Google Colab&lt;/a&gt;. Yet, if you need to go beyond the free tier, the compute cost on these SaaS platforms will quickly empty your pockets. Additionally, self-hosting allows you to use your favorite tools and is the most secure option if you do it right.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw1ztabxvk2x7rqxn1ja.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw1ztabxvk2x7rqxn1ja.png" alt="JetBrains, Zed, VS Code and Jupyter Lab" width="800" height="533"&gt;&lt;/a&gt;&lt;em&gt;JetBrains, Zed, VS Code and Jupyter Lab&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Renting a GPU Server
&lt;/h2&gt;

&lt;p&gt;There are &lt;a href="https://research.aimultiple.com/cloud-gpu-providers/" rel="noopener noreferrer"&gt;plenty of places&lt;/a&gt; to rent GPUs, and this tutorial is valid for any machine with SSH access. I am obviously using our own service &lt;a href="https://www.cloudrift.ai/" rel="noopener noreferrer"&gt;cloudrift.ai&lt;/a&gt; to test solutions in this tutorial. It provides good value, supports virtual machines, and provisions them fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Jupyter Lab — Plain and Simple
&lt;/h2&gt;

&lt;p&gt;Jupyter Lab is my go-to option for short experiments. It is the simplest IDE and the easiest to use if you use Python and are familiar with Jupyter. It contains everything needed for short experiments: a file explorer, a command line, and the Jupyter Notebook.&lt;/p&gt;

&lt;p&gt;Install the necessary system dependencies after starting a VM and connecting to it.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt update
sudo apt install python3-venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Create a virtual environment.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3 -m venv venv
source venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Install Jupyter Lab and start it. Replace JUPYTER_TOKEN with your secret.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install jupyterlab
JUPYTER_TOKEN=ide-tutorial jupyter lab --no-browser --port=8080 --ip=0.0.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You need to add the — &lt;em&gt;ip=0.0.0.0&lt;/em&gt; flag to be able to access the notebook on a remote server externally since, by default, all access outside is disabled. The IDE will be available at &lt;a href="http://{node-ip-address}:8080/." rel="noopener noreferrer"&gt;http://{node-ip-address}:8080/.&lt;/a&gt; Specify JUPYTER_TOKEN when prompted to log in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5016%2F1%2AC5DNTr1XKGr5xIuVovBdnA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5016%2F1%2AC5DNTr1XKGr5xIuVovBdnA.png" alt="Jupyter Lab hosted on [cloudrift.ai](https://www.cloudrift.ai/)" width="800" height="444"&gt;&lt;/a&gt;&lt;em&gt;Jupyter Lab hosted on &lt;a href="https://www.cloudrift.ai/" rel="noopener noreferrer"&gt;cloudrift.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  VS Code — Most Versatile
&lt;/h2&gt;

&lt;p&gt;VS Code is convenient if you need to do more serious development work. It contains a debugger. It supports many languages. The command line and file explorer are also available, along with a gazillion features you probably won’t need.&lt;/p&gt;

&lt;p&gt;Install the &lt;a href="https://github.com/coder/code-server" rel="noopener noreferrer"&gt;code-server&lt;/a&gt; on a remote machine and run it using the following command.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -fsSL https://code-server.dev/install.sh | sh
PASSWORD=ide-tutorial code-server --bind-addr 0.0.0.0:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Don’t forget to substitute the password with your desired password. The IDE will be available at &lt;a href="http://{node-ip-address}:8080/." rel="noopener noreferrer"&gt;http://{node-ip-address}:8080/.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5004%2F1%2Al3P6IU9LSUSjGfpSJWIQpw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5004%2F1%2Al3P6IU9LSUSjGfpSJWIQpw.png" alt="VS Code hosted on [cloudrift.ai](https://www.cloudrift.ai/) via [code-server](https://github.com/coder/code-server)" width="800" height="470"&gt;&lt;/a&gt;&lt;em&gt;VS Code hosted on &lt;a href="https://www.cloudrift.ai/" rel="noopener noreferrer"&gt;cloudrift.ai&lt;/a&gt; via &lt;a href="https://github.com/coder/code-server" rel="noopener noreferrer"&gt;code-server&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  JetBrains — Neat Features
&lt;/h2&gt;

&lt;p&gt;I am a fan of JetBrains and use it for my local development. JetBrains takes a different approach from the aforementioned code editors. Instead of starting a remote IDE, your local IDE will communicate with the remote server. Thus, it feels like using your local IDE. Additionally, it offers nice features like the ability to clone the repository on a remote machine using your local SSH agent for authentication.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;At the time of this writing, the JetBrains Gateway is in Beta. Many features were not working as expected in PyCharm or RustRover (testing on Ubuntu 22.04). Hopefully, the situation will improve over time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To start, open any JetBrains IDE (update to the latest version) and select &lt;em&gt;File -&amp;gt; Remote Development&lt;/em&gt;. You can also do it without installing JetBrains IDE via &lt;a href="https://www.jetbrains.com/remote-development/gateway/" rel="noopener noreferrer"&gt;JetBrains Gateway&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94pnhpfrizsic67s7shw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94pnhpfrizsic67s7shw.png" width="800" height="707"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click “New Connection” and specify riftuser as the username and a node IP address as the Host.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84tslx3insqkzbkha7rd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84tslx3insqkzbkha7rd.png" width="800" height="707"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the next screen, choose the IDE you want to use and specify ~ as the Project directory.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2j2ytmq3hqjnx70ihtq4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2j2ytmq3hqjnx70ihtq4.png" width="800" height="707"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;IDE will take some time to download and configure. Afterwards, you can use it as your local one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5590%2F1%2Acp3tWKZzJpT3bm9tIp53xQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5590%2F1%2Acp3tWKZzJpT3bm9tIp53xQ.png" alt="PyCharm running on a remote server on [cloudrift.ai](https://www.cloudrift.ai/)" width="800" height="571"&gt;&lt;/a&gt;&lt;em&gt;PyCharm running on a remote server on &lt;a href="https://www.cloudrift.ai/" rel="noopener noreferrer"&gt;cloudrift.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PyCharm has another, more mature feature for remote development called a remote interpreter. To use it, you go to Settings -&amp;gt; Python Interpreter -&amp;gt; Add Interpreter -&amp;gt; On SSH and configure the connection similarly. It will synchronize your code with a remote server and run your app remotely. It is a good option if you have a good symmetric internet connection. Otherwise, the experience might be sluggish, and you will need to configure directories for synchronization to avoid uploading heavy directories like Python virtual environment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8fkad888v6gpdpo9s09.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8fkad888v6gpdpo9s09.png" alt="Using remote interpreter feature in JetBrains IDE" width="800" height="588"&gt;&lt;/a&gt;&lt;em&gt;Using remote interpreter feature in JetBrains IDE&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Zed — Fast and Lean
&lt;/h2&gt;

&lt;p&gt;I just learned about Zed and was quite impressed. The installation and remote access were a breeze, and it is also the most responsive of the tested IDEs. Like JetBrains, Zed is a local editor that communicates with the remote server. So, &lt;a href="https://zed.dev/docs/getting-started" rel="noopener noreferrer"&gt;install&lt;/a&gt; Zed &lt;strong&gt;locally&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;After installation, to connect to a remote server, click *File -&amp;gt; Open Remote -&amp;gt; Connect New Server *and specify the SSH command.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ohzv27mxdsxygjj5o24.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ohzv27mxdsxygjj5o24.png" alt="Connecting to a remote server using Zed" width="800" height="547"&gt;&lt;/a&gt;&lt;em&gt;Connecting to a remote server using Zed&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That’s it. Afterward, you would typically clone your repository using an integrated command line, and you’re all set.&lt;/p&gt;

&lt;p&gt;Despite its simplicity, Zed is a niche product at the moment. You often need to use the command line to edit config files in Zed. As of today, the team has yet to add &lt;a href="https://github.com/zed-industries/zed/issues/5065" rel="noopener noreferrer"&gt;Build and Debug&lt;/a&gt; features to Zed. However, if you’re comfortable doing that, Zed might be the best option for remote development. The IDE comes with an AI assistant and all the modern features, like support for MCP servers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F7444%2F1%2ALvGqkSMOZmzLehCU6ouZ2w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F7444%2F1%2ALvGqkSMOZmzLehCU6ouZ2w.png" alt="Zed working with a remote server on [cloudrift.ai](https://www.cloudrift.ai/)" width="800" height="455"&gt;&lt;/a&gt;&lt;em&gt;Zed working with a remote server on &lt;a href="https://www.cloudrift.ai/" rel="noopener noreferrer"&gt;cloudrift.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;My recommendation for self-hosted cloud editors as of April 2025:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Jupyter Lab&lt;/strong&gt; is best for simple Python projects if you’re familiar with Jupyter.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;VS Code&lt;/strong&gt; will handle most programming tasks well. It is the only tested editor with a properly working Build and Debug feature.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Zed&lt;/strong&gt; is the best option if you’re proficient with the command line. It is easy to set up, fast, and has modern AI and collaborative features. However, it doesn’t have Build and Debug features.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;JetBrains Gateway&lt;/strong&gt; is in Beta and difficult to recommend at the moment. However, it has great potential due to its neat features that seamlessly blend local and remote environments.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Take a look at this &lt;a href="https://github.com/awesome-selfhosted/awesome-selfhosted?tab=readme-ov-file#software-development---ide--tools" rel="noopener noreferrer"&gt;list&lt;/a&gt; if you want to explore more options.&lt;/p&gt;

&lt;p&gt;The IDE choice is personal, so choose the one you’re most familiar with and enjoy working with. Happy coding!&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>devops</category>
      <category>cloud</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
