DEV Community: Adam Sawicki

Book review: C++ Initialization Story

Adam Sawicki — Mon, 27 Mar 2023 19:23:47 +0000

Courtesy its author Bartłomiej Filipek (author of cppstories.com website), I was given an opportunity to read a book “C++ Initialization Story". Below you will find my review.

How many ways are there to initialize a variable in C++? I can think of at least the following:

int i1;
int i2; i2 = 123;
int i3 = 123;
int i4(); // function declaration not a variable
int i5(123);
int i6 = int(123);
int i7{};
int i8 = {};
int i9 = int{};
int iA{123};
int iB = {123};
int iC = int{123};

Do you know the difference between them? Which variable stays uinitialized, which is initialized with a value 0 or 123? What if I used a custom type instead of the basic int? How many copies of the object would be created? What if that type was a class having some custom constructors? Which constructor would get called? What if it was a std::vector or some other container?

Question like this is the foundation of this book, but topics covered by it are much wider. This book is a relatively big one. On 279 pages, the author treats the topic of "initialization" as an opportunity to describe various concepts of C++ language. Modern versions of the language standard are covered, up to C++23, but features that require new versions are explicitly marked as such. The book is not about some exotic quirks and tricks that can be done by stretching the language to its limits, but it is about concepts that are fundamental in any C++ program.

Initialization of local variables, as shown in the code above, is just the subject of the first chapter. Then initialization of "non-static data members" is described, which basically means variables inside structures and classes. Constructors obviously play the major role here, so their syntax and behavior is also described in details here. When talking about constructors, description of assignment operators and destructors follows naturally. Of course, these language constructs are described also in light of move semantics introduced by C++11. For example, did you know that std::vector<T> on resize will be able to use move constructor of your type T instead of performing a copy only when the move constructor is marked as noexcept?

Another topic related to initialization is an automatic deduction of types: auto keyword and template arguments. Special kinds of variables - static and thread_local are also described. The book also teaches new language constructs added for convenient variable initialization, like structured binding, designated initializers, or static inline. If you only used the old version if C++ so far, do you know that following syntax is now possible? Do you know what it means?

auto[iter, inserted] = mySet.insert(10);

Point p {
    .x = 10.0,
    .y = 20.0
};

class C {
    static inline int classCounter = 0;

When it comes to the difficulty level of the book, I would call it intermediate. Only some knowledge of C++ is required. Author explains every topic covered from very basics and shows simple code samples. The book additionally features a quiz in the middle and at the end, as well as a chapter with "techniques and use cases". For example, did you know that the most robust and efficient way to initialize a class with a string is to pass it by... value?

struct StringWrapper {
    std::string str_;
    StringWrapper(std::string str) : str_{std::move(str)} { }

For a long time I've been skeptical about new language standards like C++11, C++14, C++17, C++20. C++ is a tough language already, so every fresh addition only adds more complexity to it. It used to remind me of some elaborated, tricky Boost-style templates. But now, the more I use new features of the language (at least in my personal code), the more I like it. I always liked RAII and unique_ptr, but now with move semantics, return value optimization, std::optional, std::variant, and many other additions to the language small and big, it all starts to fit together. Code is clean, concise, readable, safe (no explicit new or delete!), and efficient at the same time. I now think that it is not an inherent feature of C++ to be verbose (with tons of boilerplate code required) and unsafe (with memory access violation errors easy to make), it is the old-fashioned approach of treating is as "C with classes". I hope that over time more and more developers, especially those who make key decisions in software projects, will also notice that and will allow using modern C++.

The book can be bought as ebook on leanpub.com, as well as in printed version on Amazon. I can strongly recommend it - it is really good! See also my reviews of previous book by this author: "C++17 in Detail" and "C++ Lambda Story".

VkExtensionsFeaturesHelp - My New Library

Adam Sawicki — Thu, 01 Apr 2021 18:46:31 +0000

I had this idea for quite some time and finally I've spent last weekend coding it, so here it is: 611 lines of code (and many times more of documentation), shared for free on MIT license:

** VkExtensionsFeaturesHelp **

Vulkan Extensions & Features Help, or VkExtensionsFeaturesHelp, is a small, header-only, C++ library for developers who use Vulkan API. It helps to avoid boilerplate code while creating VkInstance and VkDevice object by providing a convenient way to query and then enable:

instance layers
instance extensions
instance feature structures
device features
device extensions
device feature structures

The library provides a domain-specific language to describe the list of required or supported extensions, features, and layers. The language is fully defined in terms of preprocessor macros, so no custom build step is needed.

Any feedback is welcome :)

Why Not Use Heterogeneous Multi-GPU?

Adam Sawicki — Wed, 22 Jul 2020 20:22:06 +0000

There was an interesting discussion recently on one Slack channel about using integrated GPU (iGPU) together with discrete GPU (dGPU). Many sound ideas were said there, so I think it's worth writing them down. But because I probably never blogged about multi-GPU before, few words of introduction first:

The idea to use multiple GPUs in one program is not new, but not very widespread either. In old graphics APIs like Direct3D 11 it wasn't easy to implement. Doing it right in a complex game often involved engaging driver engineers from the GPU manufacturer (like AMD, NVIDIA) or using custom vendor extensions (like AMD GPU Services - see for example Explicit Crossfire API).

New generation of graphics APIs – Direct3D 12 and Vulkan – are lower level, give more direct access to the hardware. This includes the possibility to implement multi-GPU support on your own. There are two modes of operation. If the GPUs are identical (e.g. two graphics cards of the same model plugged to the motherboard), you can use them as one device object. In D3D12 you then index them as Node 0, Node 1, ... and specify NodeMask bit mask when allocating GPU memory, submitting commands and doing all sorts of GPU things. Similarly, in Vulkan you have VK_KHR_device_group extension available that allows you to create one logical device object that will use multiple physical devices.

But this post is about heterogeneous/asymmetric multi-GPU, where there are two different GPUs installed in the system, e.g. one integrated with the CPU and one discrete. A common example is a laptop with "switchable graphics", which may have an Intel CPU with their integrated “HD” graphics plus a NVIDIA GPU. There may even be two different GPUs from the same manufacturer! My new laptop (ASUS TUF Gaming FX505DY) has AMD Radeon Vega 8 + Radeon RX 560X. Another example is a desktop PC with CPU-integrated graphics and a discrete graphics card installed. Such combination may still be used by a single app, but to do that, you must create and use two separate Device objects. But whether you could, doesn't mean you should…

First question is: Are there games that support this technique? Probably few… There is just one example I heard of: Ashes of the Singularity by Oxide Games, and it was many years ago, when DX12 was still fresh. Other than that, there are mostly tech demos, e.g. "WITCH CHAPTER 0 [cry]" by Square Enix as described on DirectX Developer Blog (also 5 years old).

iGPU typically has lower computational power than dGPU. It could accelerate some pieces of computations needed each frame. One idea is to hand over the already rendered 3D scene to the iGPU so it can finish it with screen-space postprocessing effects and present it, which sounds even better if the display is connected to iGPU. Another option is to accelerate some computations, like occlusion culling, particles, or water simulation. There are some excellent learning materials about this technique. The best one I can think of is: Multi-Adapter with Integrated and Discrete GPUs by Allen Hux (Intel), GDC 2020.

However, there are many drawbacks of this technique, which were discussed in the Slack chat I mentioned:

It's difficult to implement multi-GPU support in general and to synchronize things properly.
iGPUs have greatly varying performance, from quite fast to very slow, so implementing it to always give a performance uplift is even harder.
Passing data back and forth between dGPU and iGPU involves multiple copies. The cost of it may be larger than the performance benefit of computing on iGPU.
iGPU shares same power and thermal limitations, memory bandwidth, and caches as the CPU, so they may slow each other down.
If you offload finishing render frame (postprocessing and Present) to iGPU, you may improve throughput a bit, but you increase latency a lot.
You need to support systems without iGPU as well, so your testing matrix expands. (An interesting idea was posted that if it's a DirectX workload, you might fall back to the software emulated WARP device – it's quite efficient and good quality in terms of correctness and compliance with GPU-accelerated DX).
Finishing and presenting a frame on iGPU sounds like a good idea if the display is connected to iGPU, but it's not so certain. Multi-GPU laptops usually have the build-in display connected to the iGPU, but external display output (e.g. HDMI) may be connected to iGPU or to dGPU (especially in "gaming laptops") – you never know.
Conscious gamers tend to update their graphics drivers for dGPU, but the driver for iGPU is often left in an ancient version, full of bugs.

Conclusion: Supporting heterogeneous multi-GPU in a game engine sounds like an interesting technical challenge, but better think twice before doing it in a production code.

BTW If you just want to use just one GPU and worry about the selection of the right one, see my old post: Switchable graphics versus D3D11 adapters.

How to Disable Notification Sound in Messenger for Android?

Adam Sawicki — Thu, 09 Jul 2020 20:39:55 +0000

Applications and websites fight for our attention. We want to stay connected and informed, but too many interruptions are not good for our productivity or mental health. Different applications have different settings dedicated to silencing notifications. I recently bought a new smartphone and so I needed to install and configure all the apps (which is a big task these days, same way as it always used to be with Windows PC after "format C:" and system reinstall).

Facebook Messenger for Android offers an on/off setting for all the notifications, and a choice of the sound of a notification and an incoming call. Unfortunately, it doesn't offer an option to silence the sound. You can only either choose among several different sound effects or disable all notifications of the app entirely. What if you want to keep notifications active so they appear in the Android drawer, use vibration, get sent to a smart band, and you can hear incoming calls ringing, you just want to mute the sound of incoming messages?

Here is the solution I found. It turns out you can upload a custom sound file to your smartphone and use it. For that I generated a small WAV file - 0.1 seconds of total silence. 1) You can download it from here:

Silence_100ms.wav (8.65 KB)

2) Now you need to put it into a specific directory in the memory of your smartphone, called "Notifications". To do this, you need to use an app that allows to freely manipulate files and directories, as opposed to just looking for specific content as image or music players do. If you downloaded the file directly to your smartphone, use free Total Commander to move this file to the "Notifications" directory. If you have it on your PC, MyPhoneExplorer will be a good app to connect to your phone using a USB cable or WiFi network and transfer the file.

3) Finally, you need to select the file in Messenger. To do this, go to its settings > Notifications & Sounds > Notification Sound. The new file "Silence_100ms" should appear mixed with the list of default sound effects. After choosing it, your message notifications in Messenger will be silent.

There is one downside of this method. While not audible, the sound is still playing on every incoming message, so if you listen to music e.g. using Spotify, the music will fade out for a second every time the sound is played.

Avoid double negation, unless...

Adam Sawicki — Thu, 11 Jun 2020 20:56:16 +0000

Boolean algebra is a branch of mathematics frequently used in computer science. It deals with simple operations like AND, OR, NOT - which we also use in natural language.

In programming, two negations of a boolean variable cancel each other. You could express it in C++ as:

!!x == x

In natural languages, it's not the case. In English we don't use double negatives (unless it's intended, like in the famous song quote "We don't need no education" :) In Polish double negatives are used but don't result in a positive statement. For example, we say "Wegetarianie nie jedzą żadnego rodzaju mięsa.", which literally translates as "Vegetarians don't eat no kind of meat."

No matter what your native language is, in programming it's good to simplify things and so to avoid double negations. For example, if you design an API for your library and you need to name a configuration setting about an optional optimization Foo, it's better to call it "FooOptimizationEnabled". You can then just check if it's true and if so, you do the optimization. If the setting was called "FooOptimizationDisabled", its documentation could say: "When the FooOptimizationDisabled setting is disabled, then the optimization is enabled." - which sounds confusing and we don't want that.

But there are two cases where negative flags are justified in the design of an API. Let me give examples from the world of graphics.

When enabled should be the default. It's easier to set flags to 0 or to a minimum set of required flags rather than thinking about each available flag whether or not it should be set, so a setting that most users will need enabled and disable only on special occasions could be a negative flag. For example, Direct3D 12 has D3D12_HEAP_FLAG_DENY_BUFFERS. A heap you create can host any GPU resources by default (buffers and textures), but when you use this flag, you declare you will not use it for buffers. (Note to graphics programmers: I know this is not the best example, because usage of these flags is actually required on Resource Heap Tier 1 and there are also "ALLOW" flags, but I hope you get the idea.)
For backward compatibility. If something has always been enabled in previous versions and you want to give users a possibility to disable it in the new version of your library, you don't want to break their old code or ask them to update their code everywhere adding the new "enable" flag, so it's better to add a negative flag that will disable the otherwise still enabled feature. That's what Vulkan does with VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT. There was no such flag in Vulkan 1.0 - pipelines were always optimized. The latest Vulkan 1.2 has it so you can ask to disable optimization, which may speed up the creation of the pipeline. All the existing code that doesn't use this flag continues to have its pipelines optimized.

On Debug, Release, and Other Project Configurations

Adam Sawicki — Sun, 17 May 2020 20:44:44 +0000

Foreword: I was going to write a post about #pragma optimize in Visual Studio, which I learned recently, but later I decided to describe the whole topic more broadly. As a result, this blog post can be useful or inspiring to every programmer coding in C++ or even in other languages, although I give examples based on just C++ as used in Microsoft Visual Studio on Windows.

When we compile a program, we often need to choose one of possible "configurations". Visual Studio creates two of those for a new C++ project, called "Debug" and "Release". As their names imply, the first one is mostly intended to be used during development and debugging, while the other should be used to generate the final binary of the program to be distributed to the external users. But there is more to it. Each of these configurations actually sets multiple parameters of the project and you can change them. You can also define your custom configurations and have over 2 of them, which can be very useful, as you will see later.

First, let's think about the specific settings that are defined by a project configuration. They can be divided into two broad categories. First one is all the parameters that control the compiler and linker. The difference between Debug and Release is mostly regarding optimizations. Debug configuration is all about having the optimizations disabled (which allows full debugging functionality and also makes the compilation time short), while Release has the optimizations enabled (which obviously makes the program run faster). For example, Visual Studio sets these options in Release:

/O2 - Optimization = Maximum Optimization (Favor Speed)
/Oi - Enable Intrinsic Functions = Yes
/Gy - Enable Function-Level Linking = Yes
/GL - Whole Program Optimization = Yes
/OPT:REF - linker option: References = Yes
/OPT:ICF - linker option: Enable COMDAT Folding = Yes
/LTCG:incremental - linker option: Link Time Code Generation = Use Fast Link Time Code Generation

Visual Studio also inserts an additional code in Debug configuration to fill memory with some bit pattern that helps with debugging low-level memory access errors, which are plaguing C and C++ programmers. For example, seeing 0xCCCCCCCC in the debugger usually means uninitialized memory on the stack, 0xCDCDCDCD - allocated but uninitialized memory on the heap, and 0xFEEEFEEE - memory that was already freed and should no longer be used. In Release, memory under such incorrectly used pointers will just hold its previous data.

The second category of things controlled by project configurations are specific features inside the code. In case of C and C++ these are usually enabled and disabled using preprocessor macros, like #ifdef, #if. Such macros can not only be defined inside the code using #define, but also passed from the outside, among the parameters of the compiler, and so they can be set and changed depending on the project configuration.

The features controlled by such macros can be very diverse. Probably the most canonical example is the standard assert macro (or your custom equivalent), which we define to some error logging, instruction to break into the debugger, or even complete program termination in Debug config, and to an empty macro in Release. In case of C++ in Visual Studio, the macro defined in Debug is _DEBUG, in Release - NDEBUG, and depending on the latter, standard macro assert is doing "something" or is just ignored.

There are more possibilities. Depending on these standard pre-defined macros or your custom ones, you can cut out different functionalities from the code. One example is any instrumentation that lets you analyze and profile its execution (like calls to Tracy). You probably don't want it in the final client build. Same with detailed logging functionality, any hidden developer setting or cheat codes (in case of games). On the other hand, you may want to include in the final build something that's not needed during development, like checking user's license, some anti-piracy or anti-cheat protection, and generation of certificates needed for the program to work on non-developer machines.

As you can see, there are many options to consider. Sometimes it can make sense to have over 2 project configurations. Probably the most common case is a need for a "Profile" configuration that allows to measure the performance accurately - has all the compiler optimizations enabled, but still keeps the instrumentation needed for profiling in the code. Another idea would be to wrap the super low level, frequently called checks like (index < size()) inside vector::operator[] into some separate macro called HEAVY_ASSERT and have some configuration called "SuperDebug" that we know works very slowly, but has all those checks enabled. On the other end, remember that the "FinalFinal" configuration that you will use to generate the final binary for the users should be build and tested in your Continuous Integration during development, not only one week before the release date. Bugs that occur in only one configuration and not in the others are not uncommon!

Some bugs just don't happen in Debug, e.g. due to uninitialized memory containing consistent 0xCCCCCCCC instead of garbage data, or a race condition between threads not occurring because of a different time it takes to execute certain functions. In some projects, the Debug configuration works so slowly that it's not even possible to test the program on a real, large data set in this configuration. I consider it a bad coding practice and I think it shouldn't happen, but it happens quite often, especially when STL is used, where every reference to myVector[i] element in the unoptimized code is a function call with a range check instead of just pointer dereferecence. In any case, sometimes we need to investigate bugs occurring in Release configuration. Not all hope is lost then, because in Visual Studio the debugger still works, just not as reliably as in Debug. Because of optimizations made by the compiler, the instruction pointer (yellow arrow) may jump across the code inconsistently, and some variables may be impossible to preview.

Here comes the trick that inspired me to write this whole blog post. I recently learned that there is this custom Microsoft preprocessor macro:

#pragma optimize("", off)

that if you put at the beginning of a .cpp file or just before your function of interest, disables all the compiler optimizations from this point until the end of the file, making its debugging nice and smooth, while the rest of the program behaves as before. (See also its documentation.) A nice trick!

Secrets of Direct3D 12: Resource Alignment

Adam Sawicki — Sun, 19 Apr 2020 11:33:32 +0000

In the new graphics APIs - Direct3D 12 and Vulkan - creation of resources (textures and buffers) is a multi-step process. You need to allocate some memory and place your resource in it. In D3D12 there is a convenient function ID3D12Device::CreateCommittedResource that lets you do it in one go, allocating the resource with its own, implicit memory heap, but it's recommended to allocate bigger memory blocks and place multiple resources in them using ID3D12Device::CreatePlacedResource.

When placing a resource in the memory, you need to know and respect its required size and alignment. Size is basically the number of bytes that the resource needs. Alignment is a power-of-two number which the offset of the beginning of the resource must be multiply of (offset % alignment == 0). I'm thinking about writing a separate article for beginners explaining the concept of memory alignment, but that's a separate topic...

Back to graphics, in Vulkan you first need to create your resource (e.g. vkCreateBuffer) and then pass it to the function (e.g. vkGetBufferMemoryRequirements) that will return required size of alignment of this resource (VkMemoryRequirements::size, alignment). In DirectX 12 it looks similar at first glance or even simpler, as it's enough to have a structure D3D12_RESOURCE_DESC describing the resource you will create to call ID3D12Device::GetResourceAllocationInfo and get D3D12_RESOURCE_ALLOCATION_INFO - a similar structure with SizeInBytes and Alignment. I've described it briefly in my article Differences in memory management between Direct3D 12 and Vulkan.

But if you dig deeper, there is more to it. While using the mentioned function is enough to make your program working correctly, applying some additional knowledge may let you save some memory, so read on if you want to make your GPU memory allocator better. First interesting information is that alignments in D3D12, unlike in Vulkan, are really fixed constants, independent of a particular GPU or graphics driver that the user may have installed.

Alignment required for buffers and normal textures is always 64 KB (65536), available as constant D3D12_DEFAULT_RESOURCE_PLACEMENT_ALIGNMENT.
Alignment required for MSAA textures is always 4 MB (4194304), available as constant D3D12_DEFAULT_MSAA_RESOURCE_PLACEMENT_ALIGNMENT.

So, we have these constants and we also have a function to query for actual alignment. To make things even more complicated, structure D3D12_RESOURCE_DESC contains Alignment member, so you have one alignment on the input, another one on the output! Fortunately, GetResourceAllocationInfo function allows to set D3D12_RESOURCE_DESC::Alignment to 0, which causes default alignment for the resource to be returned.

Now, let me introduce the concept of "small textures". It turns out that some textures can be aligned 4 KB and some MSAA textures can be aligned to 64 KB. They call this "small" alignment (as opposed to "default" alignment) and there are also constants for it:

Alignment allowed for small textures is 4 KB (4096), available as constant D3D12_SMALL_RESOURCE_PLACEMENT_ALIGNMENT.
Alignment allowed for small MSAA textures is 64 MB (65536), available as constant D3D12_SMALL_MSAA_RESOURCE_PLACEMENT_ALIGNMENT.

	Default	Small
Buffer
Texture	64 KB	4 KB
MSAA texture	4 MB	64 KB

Using this smaller alignment allows to save some GPU memory that would otherwise be unused as padding between resources. Unfortunately, it's unavailable for buffers and available only for small textures, with a very convoluted definition of "small". The rules are hidden in the description of Alignment member of D3D12_RESOURCE_DESC structure:

It must have UNKNOWN layout.
It must not be RENDER_TARGET or DEPTH_STENCIL.
Its most detailed mip level (considering texture width, height, depth, pixel format, and number of samples), aligned up to some imaginary "tiles", must require no more bytes than the "larger alignment restriction". So for a normal texture, when this calculated size is <= 64 KB, you can use the alignment of 4 KB. For an MSAA texture, when this calculated size is <= 4 MB, you can use the alignment of 64 KB.

Could GetResourceAllocationInfo calculate all this automatically and just return optimal alignment for a resource, like Vulkan function does? Possibly, but this is not what happens. You have to ask for it explicitly. When you pass D3D12_RESOURCE_DESC::Alignment = 0 on the input, you always get the default (larger) alignment on the output. Only when you set D3D12_RESOURCE_DESC::Alignment to the small alignment value, this function returns the same value if the small alignment has been "granted".

There are two ways to use it in practice. First one is to calculate the eligibility of a texture to use small alignment on your own and pass it to the function only when you know the texture fulfills the conditions. Second is to try the small alignment always. When it's not granted, GetResourceAllocationInfo returns some values other than expected (in my test it's Alignment = 64 KB and SizeInBytes = 0xFFFFFFFFFFFFFFFF). Then you should call it again with the default alignment. That's the method that Microsoft shows in their "Small Resources Sample". It looks good, but a problem with it is that calling this function with an alignment that is not accepted generates D3D12 Debug Layer error #721 CREATERESOURCE_INVALIDALIGNMENT. Or at least it used to, because on one of my machines the error no longer occurs. Maybe Microsoft fixed it in some recent update of Windows or Visual Studio / Windows SDK?

Here comes the last quirk of this whole D3D12 resource alignment topic: Alignment is applied to offset used in CreatePlacedResource, which we understand as relative to the beginning of an ID3D12Heap, but the heap itself has an alignment too! D3D12_HEAP_DESC structure has Alignment member. There is no equivalent of this in Vulkan. Documentation of D3D12_HEAP_DESC structure says it must be 64 KB or 4 MB. Whenever you predict you might create an MSAA textures in a heap, you must choose 4 MB. Otherwise, you can use 64 KB.

Thank you, Microsoft, for making this so complicated! ;) This article wouldn't be complete without the advertisement of open source library: D3D12 Memory Allocator (and similar Vulkan Memory Allocator), which automatically handles all this complexity. It also implements both ways of using small alignment, selectable using a preprocessor macro.

Initializing DX12 Textures After Allocation and Aliasing

Adam Sawicki — Thu, 19 Mar 2020 21:10:28 +0000

If you are a graphics programmer using Direct3D 12, you may wonder what's the initial content of a newly allocated buffer or texture. Microsoft admitted it was not clearly defined, but in practice such new memory is filled with zeros (unless you use the new flag D3D12_HEAP_FLAG_CREATE_NOT_ZEROED). See article “Coming to DirectX 12: More control over memory allocation”. This behavior has its pros and cons. Clearing all new memory makes sense, as the operating system surely doesn't want to disclose to us the data left by some other process, possibly containing passwords or other sensitive information. However, writing to a long memory region takes lots of time. Maybe that's one reason GPU memory allocation is so slow. I've seen large allocations taking even hundreds of milliseconds.

There are situations when the memory of your new buffer or texture is not zero, but may contain some random data. First case is when you create a resource using CreatePlacedResource function , inside a memory block that you might have used before for some other, already released resources. That's also what D3D12 Memory Allocator library does by default.

It is important to know that in this case you must initialize the resource in a specific way! The rules are described on page: “ID3D12Device::CreatePlacedResource method” and say: If your resource is a texture that has either RENDER_TARGET or DEPTH_STENCIL flag, you must initialize it after allocation and before any other usage using one of those methods: 1. a clear operation (ClearRenderTargetView or ClearDepthStencilView), 2. discard (DiscardResource), 3. copy to the entire resource as a destination (CopyResource, CopyBufferRegion, or CopyTextureRegion).

Please note that rendering to the texture as a Render Target or writing to it as an Unordered Access View is not on the list! It means that, for example, if you implement a posprocessing effect, you allocated an intermediate 1920x1080 texture, and you want to overwrite all its pixels by rendering a fullscreen quad or triangle (better to use one triangle - see article "GCN Execution Patterns in Full Screen Passes"), then initializing the texture before your draw call seems redundant, but you still need to do it.

What happens if you don't? Why are we asked to perform this initialization? Wouldn't we just see random colorful pixels if we use an uninitialized texture, which may or may not be a problem, depending on our use case? Not really... As I explained in my previous post “Texture Compression: What Can It Mean?”, a texture may be stored in video memory in some vendor-specific, compressed format. If the metadata of such compression are uninitialized, it might have consequences more severe than observing random colors. It's actually an undefined behavior. On one GPU everything may work fine, while on the other you may see graphical corruptions that even rendering to the texture as a Render Target cannot fix (or a total GPU crash maybe?) I've experienced this problem myself recently.

Thinking in terms of internal GPU texture compression also helps to explain why is this initialization required only for render-target and depth-stencil textures. GPUs use more aggressive compression techniques for those. Having the requirements for initialization defined like that implies that you can leave buffers and other textures uninitialized and just experience random data in their content without the danger of anything worse happening.

I feel that a side note on ID3D12GraphicsCommandList::DiscardResource function is needed, because many of you probably don't know it. Contrary to its name, this function doesn't release a resource or its memory. The meaning of this function is more like the mapping flag D3D11_MAP_WRITE_DISCARD from the old D3D11. It informs the driver that the current content of the resource might be garbage; we know about it, and we don't care, we don't need it, not going to read it, just going to fill the entire resource with a new content. Sometimes, calling this function may let the driver reach better performance. For example, it may skip downloading previous data from VRAM to the graphics chip. This is especially important and beneficial on tile-based, mobile GPUs. In some other cases, like the initialization of a newly allocated texture described here, it is required. Inside of it, driver might for example clear the metadata of its internal compression format. It is correct to call DiscardResource and then render to your new texture as a Render Target. It could also be potentially faster than doing ClearRenderTargetView instead of DiscardResource. By the way, if you happen to use Vulkan and still read that far, you might find it useful to know that the Vulkan equivalent of DiscardResource is an image memory barrier with oldLayout = VK_IMAGE_LAYOUT_UNDEFINED.

There is a second case when a resource may contain some random data. It happens when you use memory aliasing. This technique allows to save GPU memory by creating multiple resources in the same or overlapping region of a ID3D12Heap. It was not possible in old APIs (Direct3D 11, OpenGL) where each resource got its own implicit memory allocation. In Direct3D you can use CreatePlacedResource to put your resource in a specific heap, at a specific offset. It's not allowed to use aliasing resources at the same time. Sometimes you need some intermediate buffers or render targets only for a specific, short time during each frame. You can then reuse their memory for different resources needed in later part of the frame. That's the key idea of aliasing.

To do it correctly, you must do two things. First, between the usages you must issue a barrier of special type D3D12_RESOURCE_BARRIER_TYPE_ALIASING. Second, the resource to be used next (also called "ResourceAfter", as opposed to "ResourceBefore") needs to be initialized. The idea is like what I described before. You can find the rules of this initialization on page “Memory Aliasing and Data Inheritance”. This time however we are told to initialize every texture that has RENDER_TARGET or DEPTH_STENCIL flag with 1. a clear or 2. a copy operation to an entire subresource. DiscardResource is not allowed. Whether it's an omission or intentional, we have to stick to these rules, even if we feel such clears are redundant and will slow down our rendering. Otherwise we may experience hard to find bugs on some GPUs.

Texture Compression: What Can It Mean?

Adam Sawicki — Sun, 15 Mar 2020 13:57:06 +0000

"Data compression - the process of encoding information using fewer bits than the original representation." That's the definition from Wikipedia. But when we talk about textures (images that we use while rendering 3D graphics), it's not that simple. There are 4 different things we can mean by talking about texture compression, some of them you may not know. In this article, I'd like to give you some basic information about them.

1. Lossless data compression. That's the compression used to shrink binary data in size losing no single bit. We may talk about compression algorithms and libraries that implement them, like popular zlib or LZMA SDK. We may also mean file formats like ZIP or 7Z, which use these algorithms, but also define a way to pack multiple files with their whole directory structure into a single archive file.

Important thing to note here is that we can use this compression for any data. Some file types like text documents or binary executables have to be compressed in a lossless way so that no bits are lost or altered. You can also compress image files this way. Compression ratio depends on the data. The size of the compressed file will be smaller if there are many repeating patterns - the data look pretty boring, like many pixels with the same color. If the data is more varying, each next pixel has even slightly different value, then you may end up with a compressed file as large as original one or even larger. For example, following two images have size 480 x 480. When saved as uncompressed BMP R8G8B8 file, they both take 691,322 bytes. When compressed to a ZIP file, the first one is only 15,993, while the second one is 552,782 bytes.

We can talk about this compression in the context of textures because assets in games are often packed into archives in some custom format which protects the data from modification, speeds up loading, and may also use compression. For example, the new Call of Duty Warzone takes 162 GB of disk space after installation, but it has only 442 files because developers packed the largest data in some archives in files Data/data/data.000, 001 etc., 1 GB each.

2. Lossy compression. These are the algorithms that allow some data loss, but offer higher compression ratios than lossless ones. We use them for specific kinds of data, usually some media - images, sound, and video. For video it's virtually essential, because raw uncompressed data would take enormous space for each second of recording. Algorithms for lossy compression use the knowledge about the structure of the data to remove the information that will be unnoticeable or degrade quality to the lowest degree possible, from the perspective of human perception. We all know them - these are formats like JPEG for images and MP3 for music.

They have their pros and cons. JPEG compresses images in 8x8 blocks using Discrete Fourier Transform (DCT). You can find awesome, in-depth explanation of it on page: Unraveling the JPEG. It's good for natural images, but with text and diagrams it may fail to maintain desired quality. My first example saved as JPEG with Quality = 20% (this is very low, I usually use 90%) takes only 24,753 B, but it looks like this:

GIF is good for such synthetic images, but fails on natural images. I saved my second example as GIF with a color palette of 32 entries. The file is only 90,686 B, but it looks like this (look closer to see dithering used due to a limited number of colors):

Lossy compression is usually accompanied by lossless compression - file formats like JPEG, GIF, MP3, MP4 etc. compress the data losslessly on top of its core algorithm, so there is no point in compressing them again.

3. GPU texture compression. Here comes the interesting part. All formats described so far are designed to optimize data storage and transfer. We need to decompress all the textures packed in ZIP files or saved as JPEG before uploading them to video memory and using for rendering. But there are other types of texture compression formats that can be used by the GPU directly. They are lossy as well, but they work in a different way - they use a fixed number of bytes per block of NxN pixels. Thanks to this, a graphics card can easily pick right block from the memory and uncompress it on the fly, e.g. while sampling the texture. Some of such formats are BC1..7 (which stands for Block Compression) or ASTC (used on mobile platforms). For example, BC7 uses 1 byte per pixel, or 16 bytes per 4x4 block. You can find some overview of these formats here: Understanding BCn Texture Compression Formats.

The only file format I know which supports this compression is DDS, as it allows to store any texture that can be loaded straight to DirectX in various pixel formats, including not only block compressed but also cube, 3D, etc. Most game developers design their own file formats for this purpose anyway, to load them straight into GPU memory with no conversion.

4. Internal GPU texture compression. Pixels of a texture may not be stored in video memory the way you think - row-major order, one pixel after the other, R8G8B8A8 or whatever format you chose. When you create a texture with D3D12_TEXTURE_LAYOUT_UNKNOWN / VK_IMAGE_TILING_OPTIMAL (always do that, except for some very special cases), the GPU is free to use some optimized internal format. This may not be true "compression" by its definition, because it must be lossless, so the memory reserved for the texture will not be smaller. It may even be larger because of the requirement to store additional metadata. (That's why you have to take care of extra VK_IMAGE_ASPECT_METADATA_BIT when working with sparse textures in Vulkan.) The goal of these formats is to speed up access to the texture.

Details of these formats are specific to GPU vendors and may or may not be public. Some ideas of how a GPU could optimize a texture in its memory include:

Swizzle order of pixels in Morton order or some other way to improve locality of reference and cache hit rate when accessing spatially neighboring pixels.
Store metadata telling that a block of pixels or entire texture is cleared to a specific color, so that the clear operation is fast because it need not write all the pixels.
For depth texture: store minimum and/or maximum depth per block of MxN pixels so that whole group of rendered pixels can be tested and rejected early without testing each individual pixel. This is commonly known as Hi-Z.
For MSAA texture: store bit mask per pixel telling how many different colors are in its samples, so that not all the samples need to be necessarily read or written to memory.

How to make best use of those internal GPU compression formats if they differ per graphics card vendor and we don't know their details? Just make sure you leave the driver as much optimization opportunities as possible by:

always using D3D12_TEXTURE_LAYOUT_UNKNOWN / VK_IMAGE_TILING_OPTIMAL,
not using flags D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET, D3D12_RESOURCE_FLAG_ALLOW_DEPTH_STENCIL, D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS, D3D12_RESOURCE_FLAG_ALLOW_SIMULTANEOUS_ACCESS / VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT, VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT, VK_IMAGE_USAGE_STORAGE_BIT, VK_SHARING_MODE_CONCURRENT for any textures that don't need them,
not using formats DXGI_FORMAT_*_TYPELESS / VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT for any textures that don't need them,
issuing minimum necessary number of barriers, always to the state optimal for intended usage and never to D3D12_RESOURCE_STATE_COMMON / VK_IMAGE_LAYOUT_GENERAL.

Summary: As you can see, the term "texture compression" can mean different things, so when talking about anything like this, always make sure to be clear what do you mean unless it's obvious from the context.

Secrets of Direct3D 12: Copies to the Same Buffer

Adam Sawicki — Wed, 04 Mar 2020 21:59:58 +0000

Modern graphics APIs (D3D12, Vulkan) are complicated. They are designed to squeeze maximum performance out of graphics cards. GPUs are so fast at rendering not because they work with high clock frequencies (actually they don't - frequency of 1.5 GHz is high for a GPU, as opposed to many GHz on a CPU), but because they execute their workloads in a highly parallel and pipelined way. In other words: many tasks may be executed at the same time. To make it working correctly, we must manually synchronize them using barriers. At least sometimes...

Let's consider few scenarios. Scenario 1: A draw call rendering to a texture as a Render Target View (RTV), followed by a draw call sampling from this texture as a Shader Resource View (SRV). We know we must put a D3D12_RESOURCE_BARRIER_TYPE_TRANSITION barrier in between them to transition the texture from D3D12_RESOURCE_STATE_RENDER_TARGET to D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE.

Scenario 2: Two subsequent compute shader dispatches, executed in one command list, access the same texture as an Unordered Access View (UAV). The texture stays in D3D12_RESOURCE_STATE_UNORDERED_ACCESS, but still if the second dispatch needs to wait for the first one to finish, we must issue a barrier of special type D3D12_RESOURCE_BARRIER_TYPE_UAV. That's what this type of barrier was created for.

Scenario 3: Two subsequent draw calls rendering to the same texture as a Render Target View (RTV). The texture stays in the same state D3D12_RESOURCE_STATE_RENDER_TARGET. We need not put a barrier between them. The draw calls are free to overlap in time, but GPU has its own ways to guarantee that multiple writes to the same pixel will always happen in the order of draw calls, and even more - in the order of primitives as given in index + vertex buffer!

Now to scenario 4 , the most interesting one: Two subsequent copies to the same resource. Let's say we work with buffers here, just for simplicity, but I suspect textures work same way. What if the copies affect the same or overlapping regions of the destination buffer? Do they always execute in order, or can they overlap in time? Do we need to synchronize them to get proper result? What if some copies are fast, made from another buffer in GPU memory (D3D12_HEAP_TYPE_DEFAULT) and some are slow, accessing system memory (D3D12_HEAP_TYPE_UPLOAD) through PCI-Express bus? What if the card uses a compute shader to perform the copy? Isn't this the same as scenario 2?

That's a puzzle that my colleague asked recently. I didn't know the immediate answer to it, so I wrote a simple program to test this case. I prepared two buffers: gpuBuffer placed in DEFAULT heap and cpuBuffer placed in UPLOAD heap, 120 MB each, both filled with some distinct data and both transitioned to D3D12_RESOURCE_STATE_COPY_SOURCE. I then created another buffer destBuffer to be the destination of my copies. During the test I executed few CopyBufferRegion calls, from one source buffer or the other, small or large number of bytes. I then read back destBuffer and checked if the result is valid.

g_CommandList->CopyBufferRegion(destBuffer, 5 * (10 * 1024 * 1024),
    gpuBuffer, 5 * (10 * 1024 * 1024), 4 * (10 * 1024 * 1024));
g_CommandList->CopyBufferRegion(destBuffer, 3 * (10 * 1024 * 1024),
    cpuBuffer, 3 * (10 * 1024 * 1024), 4 * (10 * 1024 * 1024));
g_CommandList->CopyBufferRegion(destBuffer, SPECIAL_OFFSET,
    gpuBuffer, 102714720, 4);
g_CommandList->CopyBufferRegion(destBuffer, SPECIAL_OFFSET,
    cpuBuffer, 102714720, 4);

It turned out it is! I checked it on both AMD (Radeon RX 5700 XT) and NVIDIA card (GeForce GTX 1070). The driver serializes such copies, making sure they execute in order and the destination data is as expected even when memory regions written by the copy operations overlap.

I also made a capture using Radeon GPU Profiler (RGP) and looked at the graph. The copies are executed as a compute shader, large ones are split into multiple events, but after each copy there is an implicit barrier inserted by the driver, described as:

CmdBarrierBlitSync()

The AMD driver issued a barrier in between back-to-back blit operations to the same destination resource.

I think it explains everything. If the driver had to insert such a barrier, we can suspect it is required. I only can't find anything in the Direct3D documentation that would explicitly specify this behavior. If you find it, please let me know - e-mail me or leave a comment under this post.

Maybe we could insert a barrier manually in between these copies, just to make sure? Nope, there is no way to do it. I tried two different ways:

1) A UAV barrier like this:

D3D12_RESOURCE_BARRIER uavBarrier = {};
uavBarrier.Type = D3D12_RESOURCE_BARRIER_TYPE_UAV;
uavBarrier.UAV.pResource = destBuffer;
g_CommandList->ResourceBarrier(1, &uavBarrier);

It triggers D3D Debug Layer error that complains about the buffer not having UAV among its flags:

D3D12 ERROR: ID3D12GraphicsCommandList::ResourceBarrier: Missing resource bind flags. [RESOURCE_MANIPULATION ERROR #523: RESOURCE_BARRIER_MISSING_BIND_FLAGS]

2) A transition barrier from COPY_DEST to COPY_DEST:

D3D12_RESOURCE_BARRIER transitionBarrier = {};
transitionBarrier.Type = D3D12_RESOURCE_BARRIER_TYPE_TRANSITION;
transitionBarrier.Transition.pResource = destBuffer;
transitionBarrier.Transition.StateBefore = D3D12_RESOURCE_STATE_COPY_DEST;
transitionBarrier.Transition.StateAfter = D3D12_RESOURCE_STATE_COPY_DEST;
transitionBarrier.Transition.Subresource = D3D12_RESOURCE_BARRIER_ALL_SUBRESOURCES;
g_CommandList->ResourceBarrier(1, &transitionBarrier);

Bad luck again. This time the Debug Layer complains about "before" and "after" states having to be different.

D3D12 ERROR: ID3D12CommandList::ResourceBarrier: Before and after states must be different. [RESOURCE_MANIPULATION ERROR #525: RESOURCE_BARRIER_MATCHING_STATES]

Bonus scenario 5: ClearRenderTargetView, followed by a draw call that renders to the same texture as a Render Target View. The texture needs to be in D3D12_RESOURCE_STATE_RENDER_TARGET for both operations. We don't put a barrier in between them and don't even have a way to do it, just like in the scenario 4. So Clear operations must also guarantee the order of their execution, although I can't find anything about it in the DX12 spec.

What a mess! It seems that Direct3D 12 requires putting explicit barriers between our commands sometimes, automatically synchronizes some others, and doesn't even describe it all clearly in the documentation. The only general rule I can think of is that it cannot track resources bound through descriptors (like SRV, UAV), but tracks those that are bound in a more direct way (as render target, depth-stencil, clear target, copy destination) and synchronizes them automatically. I hope this post helped to clarify some situations that my happen in your rendering code.

How Do Graphics Cards Execute Vector Instructions?

Adam Sawicki — Sun, 19 Jan 2020 17:13:50 +0000

Intel announced that together with their new graphics architecture they will provide a new API, called oneAPI, that will allow to program GPU, CPU, and even FPGA in an unified way, and will support SIMD as well as SIMT mode. If you are not sure what does it mean but you want to be prepared for it, read this article. Here I try to explain concepts like SIMD, SIMT, AoS, SoA, and the vector instruction execution on CPU and GPU. I think it may interest to you as a programmer even if you don't write shaders or GPU computations. Also, don't worry if you don't know any assembly language - the examples below are simple and may be understandable to you, anyway. Below I will show three examples:

1. CPU, scalar

Let's say we write a program that operates on a numerical value. The value comes from somewhere and before we pass it for further processing, we want to execute following logic: if it's negative (less than zero), increase it by 1. In C++ it may look like this:

float number = ...;
bool needsIncrease = number < 0.0f;
if(needsIncrease)
  number += 1.0f;

If you compile this code in Visual Studio 2019 for 64-bit x86 architecture, you may get following assembly (with comments after semicolon added by me):

00007FF6474C1086 movss  xmm1,dword ptr [number]   ; xmm1 = number
00007FF6474C108C xorps  xmm0,xmm0                 ; xmm0 = 0
00007FF6474C108F comiss xmm0,xmm1                 ; compare xmm0 with xmm1, set flags
00007FF6474C1092 jbe    main+32h (07FF6474C10A2h) ; jump to 07FF6474C10A2 depending on flags
00007FF6474C1094 addss  xmm1,dword ptr [__real@3f800000 (07FF6474C2244h)]  ; xmm1 += 1
00007FF6474C109C movss  dword ptr [number],xmm1   ; number = xmm1
00007FF6474C10A2 ...

There is nothing special here, just normal CPU code. Each instruction operates on a single value.

2. CPU, vector

Some time ago vector instructions were introduced to CPUs. They allow to operate on many values at a time, not just a single one. For example, the CPU vector extension called Streaming SIMD Extensions (SSE) is accessible in Visual C++ using data types like __m128 (which can store 128-bit value representing e.g. 4x 32-bit floating-point numbers) and intrinsic functions like _mm_add_ps (which can add two such variables per-component, outputting a new vector of 4 floats as a result). We call this approach Single Instruction Multiple Data (SIMD), because one instruction operates not on a single numerical value, but on a whole vector of such values in parallel.

Let's say we want to implement following logic: given some vector (x, y, z, w) of 4x 32-bit floating point numbers, if its first component (x) is less than zero, increase the whole vector per-component by (1, 2, 3, 4). In Visual C++ we can implement it like this:

const float constant[] = {1.0f, 2.0f, 3.0f, 4.0f};
__m128 number = ...;
float x; _mm_store_ss(&x, number);
bool needsIncrease = x < 0.0f;
if(needsIncrease)
  number = _mm_add_ps(number, _mm_loadu_ps(constant));

Which gives following assembly:

00007FF7318C10CA  comiss xmm0,xmm1  ; compare xmm0 with xmm1, set flags
00007FF7318C10CD  jbe    main+69h (07FF7318C10D9h)  ; jump to 07FF7318C10D9 depending on flags
00007FF7318C10CF  movaps xmm5,xmmword ptr [__xmm@(...) (07FF7318C2250h)]  ; xmm5 = (1, 2, 3, 4)
00007FF7318C10D6  addps  xmm5,xmm1  ; xmm5 = xmm5 + xmm1
00007FF7318C10D9  movaps xmm0,xmm5  ; xmm0 = xmm5

This time xmm registers are used to store not just single numbers, but vectors of 4 floats. A single instruction - addps (as opposed to addss used in the previous example) adds 4 numbers from xmm1 to 4 numbers in xmm5.

It may seem obvious, but it's important for future considerations to note that the condition here and the boolean variable driving it (needsIncrease) is not a vector, but a single value, calculated based on the first component of vector number. Such a single value in the SIMD world is also called a "scalar". Based on it, the condition is true or false and the branch is taken or not, so either the whole vector is increased by (1, 2, 3, 4), or nothing happens. This is how CPUs work, because we execute just one program, with one thread, which has one instruction pointer to execute its instructions sequentially.

3. GPU

Now let's move on from CPU world to the world of a graphic processor (GPU). Those are programmed in different languages. One of them is GLSL, used in OpenGL and Vulkan graphics APIs. In this language there is also a data type that holds 4x 32-bit floating-point numbers, called vec4. You can add a vector to a vector per-component using just '+' operator.

Same logic as in section 2. implemented in GLSL looks like this:

vec4 number = ...;
bool needsIncrease = number.x < 0.0;
if(needsIncrease)
  number += vec4(1.0, 2.0, 3.0, 4.0);

When you compile a shader with such code for an AMD GPU, you may see following GPU assembly: (For offline shader compilation I used Radeon GPU Analyzer (RGA) - free tool from AMD.)

v_add_f32      v5, 1.0, v2      ; v5 = v2 + 1
v_add_f32      v1, 2.0, v3      ; v1 = v3 + 2
v_cmp_gt_f32   vcc, 0, v2       ; compare v2 with 0, set flags
v_cndmask_b32  v2, v2, v5, vcc  ; override v2 with v5 depending on flags
v_add_f32      v5, lit(0x40400000), v4  ; v5 = v4 + 3
v_cndmask_b32  v1, v3, v1, vcc  ; override v1 with v3 depending on flags
v_add_f32      v3, 4.0, v0      ; v3 = v0 + 4
v_cndmask_b32  v4, v4, v5, vcc  ; override v4 with v5 depending on flags
v_cndmask_b32  v3, v0, v3, vcc  ; override v3 with v0 depending on flags

You can see something interesting here: Despite high level shader language is vector, the actual GPU assembly operates on individual vector components (x, y, z, w) using separate instructions and stores their values in separate registers like (v2, v3, v4, v0). Does it mean GPUs don't support vector instructions?!

Actually, they do, but differently. First GPUs from decades ago (right after they became programmable with shaders) really operated on those vectors in the way we see them. Nowadays, it's true that what we treat as vector components (x, y, z, w) or color components (R, G, B, A) in the shaders we write, becomes separate values. But GPU instructions are still vector, as denoted by their prefix "v_". The SIMD in GPUs is used to process not a single vertex or pixel, but many of them (e.g. 64) at once. It means that a single register like v2 stores 64x 32-bit numbers and a single instruction like v_add_f32 adds per-component 64 of such numbers - just Xs or Ys or Zs or Ws, one for each pixel calculated in a separate SIMD lane.

Some people call it Structure of Arrays (SoA) as opposed to Array of Structures (AoS). This term comes from an imagination of how the data structure as stored in memory could be defined. If we were to define such data structure in C, the way we see it when programming in GLSL is array of structures:

struct {
  float x, y, z, w;
} number[64];

While the way the GPU actually operates is kind of a transpose of this - a structure of arrays:

struct {
  float x[64], y[64], z[64], w[64];
} number;

It comes with an interesting implication if you consider the condition we do before the addition. Please note that we write our shader as if we calculated just a single vertex or pixel, without even having to know that 64 of them will execute together in a vector manner. It means we have 64 Xs, Ys, Zs, and Ws. The X component of each pixel can be less or not less than 0, meaning that for each SIMD lane the condition may be fulfilled or not. So boolean variable needsIncrease inside the GPU is not a scalar, but also a vector, having 64 individual boolean values - one for each pixel! Each pixel may want to enter the if clause or skip it. That's what we call Single Instruction Multiple Threads (SIMT), and that's how real modern GPUs operate. How is it implemented if some threads want to do if and others want to do else? That's a different story...

Two Shader Compilers of Direct3D 12

Adam Sawicki — Mon, 23 Dec 2019 19:40:35 +0000

If we write a game or other graphics application using Direct3D 12, we also need to write some shaders. We author these in high-level language called HLSL and compile them before passing to the DirectX API while creating pipeline state objects (ID3D12Device::CreateGraphicsPipelineState). There are currently two shader compilers available, both from Microsoft, each outputting different binary format:

old compiler “FXC”
new compiler “DXC”

Which one to choose? The new compiler, called DirectX Shader Compiler, is more modern, based on LLVM/Clang, and open source. We must use it if we want to use Shader Model 6 or above. On the other hand, shaders compiled with it require relatively recent version of Windows and graphics drivers installed, so they won’t work on systems not updated for years.

Shaders can be compiled offline using a command-line program (standalone executable compiler) and then bundled with your program in compiled binary form. That’s probably the best way to go for release version, but for development and debugging purposes it’s easier if we can change shader source just as we change the source of CPU code, easily rebuild or run, or even reload changed shader while the app is running. For this, it’s convenient to integrate shader compiler as part of your program, which is possible through a compiler API.

This gives us 4 different ways of compiling shaders. This article is a quick tutorial for all of them.

1. Old Compiler - Offline

The standalone executable of the old compiler is called “fxc.exe”. You can find it bundled with Windows SDK, which is installed together with Visual Studio. For example, in my system I located it in this path: “c:\Program Files (x86)\Windows Kits\10\bin\10.0.17763.0\x64\fxc.exe”.

To compile a shader from HLSL source to the old binary format, issue a command like this:

fxc.exe fxc.exe /T ps_5_0 /E main PS.hlsl /Fo PS.bin

/T is target profile

ps_5_0 means pixel shader with Shader Model 5.0

/E is the entry point - the name of the main shader function, “main” in my case

PS.hlsl is the text file with shader source

/Fo is binary output file to be written

There are many more command line parameters supported for this tool. You can display help about them by passing /? parameter. Using appropriate parameters you can change optimization level, other compilation settings, provide additional #include directories, #define macros, preview intermediate data (preprocessed source, compiled assembly), or even disassemble existing binary file.

2. Old compiler - API

To use the old compiler as a library in your C++ program:

#include <d3dcompiler.h>
link with "d3dcompiler.lib"
call function D3DCompileFromFile

Example:

CComPtr<ID3DBlob> code, errorMsgs;
HRESULT hr = D3DCompileFromFile(
    L"PS.hlsl", // pFileName
    nullptr, // pDefines
    nullptr, // pInclude
    "main", // pEntrypoint
    "PS_5_0", // pTarget
    0, // Flags1, can be e.g. D3DCOMPILE_DEBUG, D3DCOMPILE_SKIP_OPTIMIZATION
    0, // Flags2
    &code, // ppCode
    &errorMsgs); // ppErrorMsgs
if(FAILED(hr))
{
    if(errorMsgs)
    {
        wprintf(L"Compilation failed with errors:\n%hs\n",
            (const char*)errorMsgs->GetBufferPointer());
    }
    // Handle compilation error...
}

D3D12_GRAPHICS_PIPELINE_STATE_DESC psoDesc = {};
// (...)
psoDesc.PS.BytecodeLength = code->GetBufferSize();
psoDesc.PS.pShaderBytecode = code->GetBufferPointer();
CComPtr<ID3D12PipelineState> pso;
hr = device->CreateGraphicsPipelineState(&psoDesc, IID_PPV_ARGS(&pso));

First parameter is the path to the file that contains HLSL source. If you want to load the source in some other way, there is also a function that takes a buffer in memory: D3DCompile. Second parameter (optional) can specify preprocessor macros to be #define-d during compilation. Third parameter (optional) can point to your own implementation of ID3DInclude interface that would provide additional files requested via #include. Entry point and target platforms is a string just like in command-line compiler. Other options that have their command line parameters (e.g. /Zi, /Od) can be specified as bit flags.

Two objects returned from this function are just buffers of binary data. ID3DBlob is a simple interface that you can query for its size and pointer to its data. In case of a successful compilation, ppCode output parameter returns buffer with compiled shader binary. You should pass its data to ID3D12PipelineState creation. After successful creation, the blob can be Release-d. The second buffer ppErrorMsgs contains a null-terminated string with error messages generated during compilation. It can be useful even if the compilation succeeded, as it then contains warnings.

Update: "d3dcompiler_47.dll" file is needed. Typically some version of it is available on the machine, but generally you still want to redistribute the exact version you're using from the Win10 SDK. Otherwise you could end up compiling with an older or newer version on an end-user's machine.

3. New Compiler - Offline

Using the new compiler in its standalone form is very similar to the old one. The executable is called “dxc.exe” and it’s also bundled with Windows SDK, in the same directory. Documentation of command line syntax mentions parameters starting with "-", but old "/" also seems to work. To compile the same shader using Shader Model 6.0 issue following command, which looks almost the same as for "fxc.exe":

dxc.exe -T ps_6_0 -E main PS.hlsl -Fo PS.bin

Despite using a new binary format (called “DXIL”, based on LLVM IR), you can load it and pass it to D3D12 PSO creation the same way as before. There is a tricky issue though. You need to attach file “dxil.dll” to your program. Otherwise, the PSO creation will fail! You can find this file in Windows SDK path like: “c:\Program Files (x86)\Windows Kits\10\Redist\D3D\x64\dxil.dll”. Just copy it to the directory with target EXE of your project or the one that you use as working directory.

4. New Compiler - API

The new compiler can also be used programatically as a library, but its usage is a bit more difficult. Just as with any C++ library, start with:

#include <dxcapi.h>
link "dxcompiler.lib"
create and use object of type IDxcCompiler

This time though you need to bundle additional DLL to your program (next to “dxil.dll” mentioned above): “dxcompiler.dll”, to be found in the same “Redist\D3D\x64” directory. There is more code needed to perform the compilation. First create IDxcLibrary and IDxcCompiler objects. They can stay alive for the whole lifetime of your application or as long as you need to compile more shaders. Then for each shader, load it from a file (or any source of your choice) to a blob, call Compile method, and inspect its result, whether it’s an error + a blob with error messages, or a success + a blob with compiled shader binary.

CComPtr<IDxcLibrary> library;
HRESULT hr = DxcCreateInstance(CLSID_DxcLibrary, IID_PPV_ARGS(&library));
//if(FAILED(hr)) Handle error...

CComPtr<IDxcCompiler> compiler;
hr = DxcCreateInstance(CLSID_DxcCompiler, IID_PPV_ARGS(&compiler));
//if(FAILED(hr)) Handle error...

uint32_t codePage = CP_UTF8;
CComPtr<IDxcBlobEncoding> sourceBlob;
hr = library->CreateBlobFromFile(L"PS.hlsl", &codePage, &sourceBlob);
//if(FAILED(hr)) Handle file loading error...

CComPtr<IDxcOperationResult> result;
hr = compiler->Compile(
    sourceBlob, // pSource
    L"PS.hlsl", // pSourceName
    L"main", // pEntryPoint
    L"PS_6_0", // pTargetProfile
    NULL, 0, // pArguments, argCount
    NULL, 0, // pDefines, defineCount
    NULL, // pIncludeHandler
    &result); // ppResult
if(SUCCEEDED(hr))
    result->GetStatus(&hr);
if(FAILED(hr))
{
    if(result)
    {
        CComPtr<IDxcBlobEncoding> errorsBlob;
        hr = result->GetErrorBuffer(&errorsBlob);
        if(SUCCEEDED(hr) && errorsBlob)
        {
            wprintf(L"Compilation failed with errors:\n%hs\n",
                (const char*)errorsBlob->GetBufferPointer());
        }
    }
    // Handle compilation error...
}
CComPtr<IDxcBlob> code;
result->GetResult(&code);

D3D12_GRAPHICS_PIPELINE_STATE_DESC psoDesc = {};
// (...)
psoDesc.PS.BytecodeLength = code->GetBufferSize();
psoDesc.PS.pShaderBytecode = code->GetBufferPointer();
CComPtr<ID3D12PipelineState> pso;
hr = device->CreateGraphicsPipelineState(&psoDesc, IID_PPV_ARGS(&pso));

Compilation function also takes strings with entry point and target profile, but in Unicode format this time. The way to pass additional flags also changed. Instead of using bit flags, parameter pArguments and argCount take an array of strings that can specify additional parameters same as you would pass to the command-line compiler, e.g. L"-Zi" to attach debug information or L"-Od" to disable optimizations.

Update 2020-01-05: Thanks @MyNameIsMJP for your feedback!