Devs Daddy

Posted on Apr 22

DLSS 5 is not a failure. The Future of rendering: A deep technical look at new approaches after 15 years in Game Development

#gamedev #ai #nvidia #overview

Hello everyone! Before we start, let's get to know a few people we haven't met yet. My name is Elijah, I am a technical director at a company that develops products based on machine learning. Previously, I worked in game development for 15 years and went through several technical milestones. I was partly inspired to write this article by the recent srach that appeared after the announcement of DLSS 5 from Nvidia.

In this article, we will omit the marketing jamb of Nvidia itself and the technically crude demos, but rather dive into the near-term technical future that awaits us in the gamedev industry, going through the history of the graphics pipeline.

The beginning of a revolution in the graphics pipeline

There has been a really huge shift in the rendering architecture of games over the past five years. If for two decades before that, progress rested on the inexorable mathematics of Moore's law, where improving rendering quality was reduced to increasing computing power and increasing the number of polygons and shaders, today everything has changed. Now, in order to achieve high visual quality in games, it is not brute force head-on, but new approaches that change the established essence of building a game image over decades, including on the basis of AI technologies. Today, more than 80% of all pixels on the screen in the most advanced (from a technical point of view) games go not the classic way, but mixed approaches and new tricks (new approaches to calculating light, super-sampling based on AI, etc.).

Today we will focus on how it influenced (and will continue to influence) AI is like rendering in games to understand the essence of a shift in graphics. To do this, we will initially look at two key topics: first, the fundamental differences between DLSS and all previous approaches to resolution enhancement and anti-aliasing (from SSAA to TAAU), and secondly, the place that neural network scaling occupies in the modern graphics pipeline, after which we will review the near future of rendering in games.

From super-sampling to neural synthesis of a scene

To assess the place of DLSS (including the future 5th version) in the graphical pipeline, we will trace the evolution of anti-aliasing and scaling from the very first implementations to the modern approach using AI computing.

The Era of "pure mathematics": SSAA, MSAA, FXAA and their limitations

Classical methods of dealing with "ladders" at the boundaries of game objects were based on a simple but computationally expensive principle: processing a higher-resolution image or parts of it.

Let's look at three classic anti-aliasing algorithms in games:

SSAA (Supersampling Anti-Aliasing): The most "honest", but also voracious method. The scene is rendered at a resolution higher than the user's target (for example, 4K for output to an FHD monitor), and then the resulting frame is compressed back to the desired size. SSAA processes each subpixel, including shading, which gives a reference image quality, but leads to a drop in performance (I hope you understand why). For modern games with complex shaders and geometry, this approach is impractical.
MSAA (Multisample Anti-Aliasing): This method appeared as an attempt to optimize SSAA. The essence of optimization boils down to the fact that shading is processed only once for each primitive (for example, a triangle) inside a pixel, and not for each sample, as SSAA does. Although this approach significantly reduces the load on the GPU, it is worth considering that MSAA effectively smooths only the edges of the geometry, but does not cope with other elements of the pipeline, such as smoothing textures. Also, its cost is still high for complex scenes.
FXAA (Fast Approximate Anti-Aliasing): Post-processing smoothing, which analyzes a ready-made 2D frame, searches for high-contrast borders in it and blurs them. This is the cheapest method in terms of performance, but its main drawback is the inevitable "soap" of the entire image, including textures and interface elements, which leads to a loss of clarity and detail.

The Age of Temporal Accumulation: TAA and its Heirs (TAAU)

A key breakthrough occurred with the transition to temporal methods that use information not only from the current frame, but also from previous ones.

What has changed in the approaches:

TAA (Temporal Anti-Aliasing): Instead of rendering each pixel, TAA slightly shifts the camera position each frame, accumulates information from previous frames using motion vectors of objects, and then calculates the average value. This allows you to get a quality close to SSAA, with performance comparable to a single frame rendering. TAA has become an industry standard for many years, but it has its own fundamental problems: gouging, loss of detail on small and fast-moving objects, and the overall "soapy" feeling of the picture in some implementations and scenes.
TAAU (Temporal Anti-Aliasing Upsampling): The logical development of TAA. Realizing that temporal accumulation allows you to restore details from subpixel information, the developers began to use it to scale the image, rendering the scene in lower resolution and "upscale" to high resolution using frame history.

It is TAAU that is the direct predecessor of DLSS, but with one critical difference: TAAU relies on hard-coded rules and is not able to "understand" the scene, but only mathematically averages pixels, which often leads to artifacts.

The DLSS Revolution: From CNN to Vision Transformer

NVIDIA DLSS (Deep Learning Super Sampling) and its FSR counterpart have made a qualitative leap by replacing purely mathematical TAAU approaches with machine learning models capable of making more "intelligent" decisions about how to restore an image.

The first generations used convolutional neural networks (CNN). Their main disadvantage is "myopia": the model analyzed pixels only within a limited spatial window (the receptive field or the area of the input image). This led to problems familiar to gamers: if an object moved too fast, CNN would "lose" sight of it, which caused flickering and gushing on small details like foliage, wires, or hair.

The transition to the Vision Transformer architecture in the 4th version of DLSS has become a fundamental progress. Unlike CNN, Transformer can evaluate the significance and relationship of any pixels in a frame, regardless of the distance between them, thanks to an additional attention mechanism. The model learned to "understand" the context of the entire scene, which made it possible to radically increase image stability by reducing the number of artifacts and, for the first time, bring the image closer to the native resolution after upscaling (and sometimes surpass it in clarity).

Technical integration into the graphics pipeline: DLSS's place in the rendering pipeline

For those who are interested in the topic of how graphics pipelines work and what shaders have to do with it, I wrote an article about this a long time ago using the example of the Unity game engine. However, the essence of + is the same for all game engines. So I recommend reading it for a better understanding of the topic.

DLSS is not a "black box" that simply receives a low resolution input and outputs a high one. This is a complex system that is deeply integrated into the rendering process, requiring developers not only to call the API, but also to prepare specific data.

Therefore, when you see a bad DLSS, it's probably the game developer himself, not the technology.

Input data: what is needed for DLSS to work?

For DLSS to work correctly, the game's graphics engine must provide a special set of buffers, each of which is critically important to the algorithm:

Color Buffer: Roughly speaking, this is a frame rendered in low resolution. This is a "rough sketch" based on which the final image will be built.
Motion Vectors: A critical component. For each pixel, this buffer shows where it has moved compared to the previous frame. This allows DLSS to understand the dynamics of a scene, correctly link pixels from different frames, and avoid the effect of gouging.
Depth Buffer: Information about the distance from the camera to each pixel. It helps the model understand the structure of the scene and which objects overlap each other, which is especially important for proper processing of the edges of the geometry.
Jitter Offsets: To get more information than is contained in one frame, the camera is slightly shifted by a fraction of a pixel in each frame according to a special pattern (as in the previously described TAA). DLSS must know exactly the amount of this offset in order to "subtract" it from the motion vectors and correctly combine pixels from different frames.

How DLSS is integrated into the pipeline

DLSS is not embedded at any time, but in a strictly defined place of the graphic pipeline. Integration with the game engine takes place through an open SDK, which provides an interface for technologies (DLSS, Reflex, etc.).

Example of pipeline stages based on DLSS operation:

Geometry rendering and shading: The game engine does all the "hard" work: calculates geometry, lighting, materials, and shadows. All this happens in a reduced resolution.
Early post-processing: DLSS must be embedded before effects such as film grain, chromatic aberration, vignettes, and most importantly, before the user interface (UI). This is necessary for the neural network to work with a "clean" image of the scene, and not with effects superimposed on top of it, which can confuse it and distort the final result.
DLSS call: At this stage, all prepared input data (color buffers, depths, motion vectors) is transferred to DLSS. The model processes them using its weights and generates the final frame in the target (high) resolution.
Late post-processing and UI: After DLSS has done its job, effects that should remain "native" to the monitor resolution (chromatic aberration, vignette, etc.) are applied on top of the resulting high-quality image, and, most importantly, the interface is at the target resolution, remaining as clear and undistorted as possible.
Screen output: The final frame is sent to the display.

All this paves the way for mass adoption of technologies, especially when DirectX 12 and other vendors discover new DLSS-like approaches based on AI rendering (compression of textures, materials) at the API level, making them a standard part of the toolkit of a modern developer.

DLSS 5: Paradigm Shift towards AI scene Synthesis

DLSS 5 gives a start to redefining the very approach to problem statement. If previous versions worked with an already rendered frame, restoring or completing pixels, the new model uses the structural data of the engine: depth buffer, albedo, motion vectors, normals, identifiers of materials and lighting. The network doesn't just "finish painting" - it rethinks the visual properties of the scene.

NVIDIA claims that DLSS 5 is capable of analyzing the semantics of a scene: recognizing skin, hair, fabrics, various types of lighting, and generating more accurate pixels for subsurface scattering on the skin or more realistic material responses. We are talking about a controlled modification of the final image, which remains deterministic and temporally stable.

The fact that many "experts" were swearing at the final result is more likely now - it's just the dampness of the technology itself and the "handedness" of the developers who helped create the techno demonstration. It is worth considering that the approach to rendering itself is changing, so the first steps will be jackal, but earlier technologies without AI also went this way.

Multi Frame Generation: from linear interpolation to adaptive synthesis

The mathematics of frame generation

In the classical frame generation scheme (DLSS 3), one interpolated frame was generated for each rendered frame, which resulted in a speed increase of about 100%. DLSS 4 increased the gain by about two more times (three generated frames per one rendered one).

The key question is: at what FPS does this make sense? Let's assume that with an input stream of 60 FPS after super-sampling, the output can reach 360 FPS on the display. This corresponds to a time window of ~16.6 ms between rendered frames, within which the neural network must predict five intermediate states of the scene. Think about the answer yourself and write it in the comments.

In general, in the Blackwell architecture, which serves as the basis for motion prediction, linear motion vector interpolation becomes insufficient at high generation coefficients. DLSS 4.5 adds dynamic coefficient adjustment: in scenes with high complexity (for example, particle explosions), the model can reduce the coefficient to maintain quality, and in relatively static scenes it can increase it. However, again, it's up to the developers.

And that's where it's worth making a remark. Nvidia engineers are people who work solely to move their technology forward, and graphics programmers who work on rendering are not only listed in the red book, firstly, they are used to established approaches and cannot quickly switch to the updated pipeline of the render without breaking anything, and secondly Like everyone else, they stick to release dates. If you have an idea of what is meant by "rebuild the graphics pipeline", then you should understand why at first there are inevitable shoals in the use of new technology.

The problem is input delays (input lag) and the solution is via Reflex

The generation of intermediate frames fundamentally increases the input lag: you see, the generated frame does not contain a reaction to user input, this reaction appears only in the next rendered frame.

NVIDIA compensates for this with Reflex technology, which synchronizes the CPU and GPU so that the rendering queue is minimal. However, the implementation also strongly depends on the game engine and the correctness of embedding into the overall lifecycle of rendering and the rest of the logic.

What if you compress data instead of pixels?

DLSS is certainly cool and well-known. But this is just one of the approaches where AI is used in the rendering process. Let's look at other approaches.

AI-assisted texture compression: saving video memory

One of the most pressing problems of modern game development is the huge increase in texture volumes. Traditional block compression methods (BC1-BC7) achieve coefficients 4:1 <=> 8:1, but their effectiveness comes down to fundamental limitations: compression occurs independently for each block, without taking into account the global texture structure.

The engineers decided to approach the task in a fundamentally different way: instead of storing a compressed image, a trained neural network (or its weights) is stored, capable of reconstructing a texture of arbitrary resolution in runtime. For example, NVIDIA claims a sevenfold reduction in the use of VRAM and system memory compared to traditional block-compressed textures with comparable visual quality.

Technically, it works as follows: at the build stage of the game, textures are passed through a training procedure that takes less than a minute for thousands of assets (depending on the hardware, of course). The result is a compact representation for the model, which is decompressed by tensor kernels in real time when loaded into memory. Since unpacking takes place on the fly, there is no need to store all MIP levels in video memory at the same time. The AI can generate the required level on demand.

For AAA games, this means not only saving video memory, but also drastically reducing the size of the games themselves, speeding up downloads, and allowing you to increase texture density without increasing memory requirements.

Compression of shaders and materials

Complex materials are layered structures that combine dozens of maps and hundreds of mathematical operations. Rendering high-level materials (e.g. porcelain, silk, multilayer leather) in real time has so far been impractical due to the high computational cost.

AI shaders use trained neural networks to calculate complex shader code.

Architecturally, this means that instead of executing the full shader graph on each pixel, the GPU performs inference, which outputs the final shader parameters. The gain is achieved due to the fact that tensor kernels perform matrix operations much more efficiently than shader kernels.

Inference instead of ray tracing

Everyone knows how voracious ray tracing remains - Ray Tracing and, in particular, Path Tracing. Path tracing requires tracing hundreds or thousands of rays per pixel to converge indirect lighting. The new approach allows us to replace most of this work with an inference: after tracing one or two bounces, the neural network predicts the result of an infinite number of subsequent bounces.

For example, a similar technology (NRC) has already become available through the RTX Global Illumination SDK and will soon appear in RTX Remix. A practical consequence of this approach is the ability to achieve the visual quality of traditional Path Tracing with performance comparable to simpler global lighting techniques.

A small note. All the described technologies rely on specialized computing units. The evolution of tensor core from each generation directly determines how effectively AI rendering models work.

Based on this, RTX 20/30 series users receive identical image quality, for example from DLSS 4.5, but with a significant drop in performance due to the lack of native FP8/FP4 support in tensor cores.

Let's look into the inevitable future: AI rendering as a new standard

So, having talked about current technologies and a changing approach (somewhere else not ideal, experimental, but already gaining momentum), let's look into the near future of rendering in AI-based games.

Development trajectory until 2030

Analyzing the current technological vectors and already available solutions, we can safely predict several key areas:

Complete replacement of traditional shaders with inference: the first technologies like RTX Neural Shaders already demonstrate that neural networks can accelerate complex calculations on materials more efficiently than handwritten shader code, especially due to the improved architecture of tensor cores. The next step, of course, is the unification of all materials for the AI model, where the shader is compiled into the weights of a small inference.
Smooth transition from frame generation to scene generation: existing technologies for working with scene primitives (geometry, lighting, textures) are leading the way to a more optimized and advanced game production pipeline at the stage level, which will free the hands of artists and technical artists, eliminating the routine work of optimizing scenes at the AI level. And then, as an option, it is the generation of any primitives in order to select the initial sketches for the artists in seconds, rather than hours of manual work.
Hybrid computing models: Modern video cards already contain separate tensor cores, raytracing cores, and shader cores. Future architectures are likely to spread these specialized blocks even further, allowing classical rendering, ray tracing, and inference to be performed in parallel.
Standardization of AI rendering: Microsoft has already added support for AI rendering to DirectX, which paves the way for universalization (at least at the DirectX API level). The same ARM is developing its own GDK for developers, opening the door to super-sampling and denoising based on AI for mobile devices.

However, with all the achievements, there are still a number of fundamental problems that hundreds of engineers are working on:

Determinism: Generative models are inherently random. Competitive gaming requires pixel-by-pixel repeatability of the result, which is difficult to achieve without fixing the LED. However, the development of more and more new approaches reduces randomness to a minimum.
Energy consumption: inference certainly consumes a lot of energy. In mobile gaming and on portable devices (e.g. Steam Deck, Nintendo Switch) this is a critical limitation. But a lot of work is also being done in this direction, offering new options for optimizing models.
Backward compatibility: As the computing requirements of new models grow, old GPUs lose their ability to perform them efficiently, which creates fragmentation of the user base. Here, rather, the result will depend on the speed of the emergence of generally accepted standards in development, since we are only at the beginning of the path.

There has already been an aversion to technology in history. You just don't remember it. How has the industry digested past graphic revolutions?

The controversies surrounding the introduction of AI rendering: from "fake frames" and "soap instead of graphics" to the fear of losing control over the visual result, are certainly very loud, but they are not something new to the industry. Virtually every fundamental change in rendering architecture over the past 25 years has encountered similar resistance before becoming a new standard.

Compute shaders vs. a regular pipeline (2001-2004)

Before the advent of GeForce 3 and DirectX 8, the graphics pipeline was a rigidly defined chain of operations: vertex transformation, lighting using fixed formulas, rasterization, texture mixing. And then we were allowed to program our own vertex and pixel shaders for each stage, which opened the way to normal maps, dynamic shadows, and complex materials.

At that time, people were afraid that shaders were too slow, developers couldn't handle writing complex code, and that shaders that could be coded were crutches, not a step forward.

In reality, after just five years, games without shaders have become a relic of the past. Half-Life 2, Doom 3, Far Cry have demonstrated that a programmable pipeline is not just a replacement for the old one, but a tool that allows you to create masterpieces that were previously impossible. Developers quickly mastered HLSL and GLSL, and performance increased due to hardware acceleration of shader blocks.

Transition to Deferred lighting (Deferred Rendering, 2007-2011)

Classic forward rendering recalculated the lighting for each object, which made multiple dynamic light sources impractical. Deferred rendering divided the process into two passes: first, geometric attributes are written to the G‑buffer, and only then the lighting is calculated only in the screen space. This made it possible to use dozens and hundreds of dynamic light sources in the frame. But today it seems commonplace.

But at that time, many argued that the g-buffer would eat up a lot of memory, the powerful MSAA smoothing would have to be thrown in the trash, and all transparent objects would have to be replaced with something, because they break.

But in reality, the industry found compromises, because the pros outweighed all the cons. And so, instead of MSAA, new types of smoothing appeared first (FXAA, SMAA, later replaced by TAA and TAAU), which eventually even surpassed in quality. And the approach itself has become an industry standard.

Physically correct rendering (PBR, 2013-2016)

In the era before PBR, materials were described by standard parameters (specular power, glossiness), which behaved differently in different lighting conditions and required manual adjustment to each scene. The physically correct approach introduced a unified BRDF model based on the measurable additional properties of real materials: metallic, roughness, albedo.

However, at first, the reaction was harsh - they said that all games would become the same and plastic, old textures would have to be redone from scratch, and artists would lose creative control.

But something else happened: PBR did not destroy the styling. He gave the artists a predictable foundation on top of which stylistic solutions can be applied. The transition to PBR, of course, required additional staff training and the introduction of new tools (such as Substance Painter, Quixel), but the result was a leap in realism and stability of materials between different scenes and projects. Today, even stylized cartoon games use a PBR pipeline adapted to aesthetics.

Today

And now, starting in 2018, it all started with the transfer of raytracing technologies from film rendering to gaming. The first implementations were modest: only shadows or reflections with a low number of rays per pixel.

Seven years later, ray tracing has already become the standard in the AAA segment. Cyberpunk 2077 in Path Tracing mode, Alan Wake 2, Black Myth: Wukong with full RT: these are examples of how the technology has matured and become a standard. Denoising and upscaling (including DLSS) have evolved to make RT playable. Current-generation consoles have received hardware RT blocks, and AMD and Intel are catching up with NVIDIA in tracing performance.

The current transition to AI rendering shows all the same patterns as previous revolutions.

Is another revolution waiting for us? Let's summarize the results

Each of the past revolutions took time to adapt, update the toolkit, train developers, and optimize hardware. AI rendering is no exception.

It does not cancel out the work of artists and programmers, but gives them a new level of abstraction: instead of manually dealing with noise, smoothing or optimizing materials, developers will be able to rely on models trained on huge data sets of photo-realistic or stylized content.

Moreover, as we have already seen above, history shows that after the adoption of a new paradigm or technology, the industry does not just return to its previous level of quality, but enters a new stage. Shaders didn't kill 2D sprites, but they gave us normal maps and dynamic shadows. PBR did not make the games monotonous, but gave a new basis for both photo realism and stylization (for example, Dishonored 2).

AI rendering will not destroy "classic" graphics, but will allow for real-time visual complexity.

Another question is how developers will use technology, because there are examples where developers, even with a significant leap in technology, take a step back (hello Battlefield 6).

The next five years will determine how much AI rendering will permeate every aspect of game creation. Judging by the current trajectory, by 2030 the line between "rendering" and "generation" will become indistinguishable, or very blurred.

So, thanks for reading!

I am waiting for your forecasts and ideas in the comments!

DEV Community