DEV Community

Cover image for Code Optimization on Windows: Timing
Michael (Poetry) Flynn
Michael (Poetry) Flynn

Posted on

Code Optimization on Windows: Timing

When I first started programming, one of the first things I wanted to do was optimize. No, really. I believe it’s a core tenant of programming: it’s fun, exciting, and makes it where programming is much more than gluing frameworks together.

That said, most programmers struggle with basic optimization ideas. Of course, most of those interested in the subject would at least know the bare basics such as algorithmic complexity, always timing your code before optimizing, and avoiding memory leaks. However, since serious nitty-gritty optimization is no longer a major facet of most developers’ lives due to a lack of technological constraints, time, and the constant push for new features, many are lacking any experience at all.

Therefore, this article is a part of a series that hopes to act as an introduction for those above the embedded environment, using C++, using an Intel architecture, on Windows Operating Systems, and using MSVC.

Starting Code

Two sorting algorithms will be compared to find which one is more optimized. They are Bubble Sort and Exchange Sort respectively. The code is as follows:



void BubbleSort(const uint32_t array_size) {

  //Fill Vector
  auto vec = std::vector<uint32_t>(array_size);
  for (auto& i : vec) {
    i = std::rand();
  }; //vector

  //Sort Vector
  for (uint32_t i = 0;i<array_size-1;++i) {
    for (uint32_t j = 0; j < array_size - i - 1; ++j)
      if (vec[j] > vec[j + 1]) {
        auto temp = vec[i];
        vec[i] = vec[j];
        vec[j] = temp;
      }; //if j less than i 
  }; //for every idx

}; //BubbleSort

void ExchangeSort(uint32_t array_size) {
  //Fill Vector
  auto vec = std::vector<uint32_t>(array_size);
  for (auto& i : vec) {
    i = std::rand();
  }; //vector

  //Sort Vector
  for (uint_fast32_t i = 0; i < array_size-1; ++i) {
    for (uint_fast32_t j = (i + 1); j < array_size; ++j)
    {
      if (vec[i] < vec[j])
      {
        auto temp = vec[i];
        vec[i] = vec[j];
        vec[j] = temp;
      };//if i < j
    }; //for every next variable
  }; //for every idx
}; //EchangeSort

int main() {
  BubbleSort(7000);
  ExchangeSort(7000);
}; //main


Enter fullscreen mode Exit fullscreen mode

Assembly

The first thing to know is the layer below the high-level compiler: assembly. While assembly is not necessarily needed to optimize, it’s still important for approximations and creating extremely niche optimizations. Using assembly for said niche optimizations won’t be shown because it’s rarely needed modernly, and inline assembly isn’t available for x64 architectures in MSVC anyhow.

To view the assembly of a program, you must view the disassembly. There are two primary ways to see the disassembly in MSVC. The first is to simply press Ctrl-Alt-D while debugging in a breaking state. Keep in mind Ctrl-Alt-D will only show the assembly within a breakpoint’s scope. Therefore, one should be careful to use breakpoints appropriately.

The second option is to enable the assembly output in your project settings as viewed in the image below. One should keep in mind this option produces far more code than Ctrl-Alt-D due to being the entire program. Due to this, it can be much harder to read.

Image description

The main useful thing in seeing assembly is the ability to approximate which function is faster. Keep in mind one can only approximate to a certain point. Using assembly without running it primarily works when talking about smaller functions or language features in which the programmer has little control over.

Here is the disassembly output using Ctrl-Alt-D for the BubbleSort function:



 mov         dword ptr [rsp+8],ecx  
 push        rbp  
 push        rdi  
 sub         rsp,238h  
 lea         rbp,[rsp+20h]  
 lea         rdi,[rsp+20h]  
 mov         ecx,56h  
 mov         eax,0CCCCCCCCh  
 rep stos    dword ptr [rdi]  
 mov         ecx,dword ptr [rsp+258h]  
 mov         rax,qword ptr [__security_cookie (07FF6FA562018h)]  
 xor         rax,rbp  
 mov         qword ptr [rbp+200h],rax  
 lea         rcx,[__76EBFBA9_main@cpp (07FF6FA568069h)]  
 call        __CheckForDebuggerJustMyCode (07FF6FA551573h)  
 mov         edx,20h  
 lea         rcx,[vec]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::__autoclassinit2 (07FF6FA5516FEh)  
 lea         rcx,[rbp+1E4h]  
 call        std::allocator<unsigned int>::allocator<unsigned int> (07FF6FA5516E0h)  
 mov         ecx,dword ptr [array_size]  
 mov         r8,rax  
 mov         edx,ecx  
 lea         rcx,[vec]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::vector<unsigned int,std::allocator<unsigned int> > (07FF6FA5516E5h)  
 lea         rax,[vec]  
 mov         qword ptr [rbp+48h],rax  
 mov         rcx,qword ptr [rbp+48h]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::_Unchecked_begin (07FF6FA5516B8h)  
 mov         qword ptr [rbp+68h],rax  
 mov         rcx,qword ptr [rbp+48h]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::_Unchecked_end (07FF6FA551690h)  
 mov         qword ptr [rbp+88h],rax  
 jmp         __$EncStackInitStart+96h (07FF6FA5597C8h)  
 mov         rax,qword ptr [rbp+68h]  
 add         rax,4  
 mov         qword ptr [rbp+68h],rax  
 mov         rax,qword ptr [rbp+88h]  
 cmp         qword ptr [rbp+68h],rax  
 je          __$EncStackInitStart+0BFh (07FF6FA5597F1h)  
 mov         rax,qword ptr [rbp+68h]  
 mov         qword ptr [rbp+0A8h],rax  
 call        qword ptr [__imp_rand (07FF6FA5663F0h)]  
 mov         rcx,qword ptr [rbp+0A8h]  
 mov         dword ptr [rcx],eax  
 jmp         __$EncStackInitStart+8Ah (07FF6FA5597BCh)  
 mov         dword ptr [rbp+0C4h],0  
 jmp         __$EncStackInitStart+0D9h (07FF6FA55980Bh)  
 mov         eax,dword ptr [rbp+0C4h]  
 inc         eax  
 mov         dword ptr [rbp+0C4h],eax  
 mov         eax,dword ptr [array_size]  
 dec         eax  
 cmp         dword ptr [rbp+0C4h],eax  
 jae         __$EncStackInitStart+1C3h (07FF6FA5598F5h)  
 mov         eax,dword ptr [rbp+0C4h]  
 inc         eax  
 mov         dword ptr [rbp+0E4h],eax  
 jmp         __$EncStackInitStart+10Bh (07FF6FA55983Dh)  
 mov         eax,dword ptr [rbp+0E4h]  
 inc         eax  
 mov         dword ptr [rbp+0E4h],eax  
 mov         eax,dword ptr [array_size]  
 cmp         dword ptr [rbp+0E4h],eax  
 jae         __$EncStackInitStart+1BEh (07FF6FA5598F0h)  
 mov         eax,dword ptr [rbp+0C4h]  
 mov         edx,eax  
 lea         rcx,[vec]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)  
 mov         qword ptr [rbp+1F8h],rax  
 mov         ecx,dword ptr [rbp+0E4h]  
 mov         edx,ecx  
 lea         rcx,[vec]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)  
 mov         eax,dword ptr [rax]  
 mov         rcx,qword ptr [rbp+1F8h]  
 cmp         dword ptr [rcx],eax  
 jae         __$EncStackInitStart+1B9h (07FF6FA5598EBh)  
 mov         eax,dword ptr [rbp+0C4h]  
 mov         edx,eax  
 lea         rcx,[vec]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)  
 mov         eax,dword ptr [rax]  
 mov         dword ptr [rbp+104h],eax  
 mov         eax,dword ptr [rbp+0E4h]  
 mov         edx,eax  
 lea         rcx,[vec]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)  
 mov         qword ptr [rbp+1F8h],rax  
 mov         ecx,dword ptr [rbp+0C4h]  
 mov         edx,ecx  
 lea         rcx,[vec]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)  
 mov         rcx,qword ptr [rbp+1F8h]  
 mov         ecx,dword ptr [rcx]  
 mov         dword ptr [rax],ecx  
 mov         eax,dword ptr [rbp+0E4h]  
 mov         edx,eax  
 lea         rcx,[vec]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)  
 mov         ecx,dword ptr [rbp+104h]  
 mov         dword ptr [rax],ecx  
 jmp         __$EncStackInitStart+0FDh (07FF6FA55982Fh)  
 jmp         __$EncStackInitStart+0CBh (07FF6FA5597FDh)  
 lea         rcx,[vec]  
 call        std::vector<unsigned int,std::allocator<unsigned int> >::~vector<unsigned int,std::allocator<unsigned int> > (07FF6FA5516F4h)  
 lea         rcx,[rbp-20h]  
 lea         rdx,[string "null pointer cannot point to a @"...+110h (07FF6FA55DEE0h)]  
 call        _RTC_CheckStackVars (07FF6FA5514ABh)  
 mov         rcx,qword ptr [rbp+200h]  
 xor         rcx,rbp  
 call        __security_check_cookie (07FF6FA55129Eh)  
 lea         rsp,[rbp+218h]  
 pop         rdi  
 pop         rbp  
 ret  


Enter fullscreen mode Exit fullscreen mode

The way one approximates is simply by comparing the number of lines between functions. The number of lines is the exact number of commands to the CPU disregarding the continual vector and debugger operations. The vector operations have their own assembly beyond just the “call,” and the debugger operations only are there due to being in debug mode. You can view the vector assembly, if relevant, by simply exploring the compiler’s implementation with the debugger.

Comparing the two functions, Exchange Sort has the smaller number of lines at 116 vs 120. However, keep in mind what was stated earlier: the approximation typically only works with small or near-identical functions. Two sorting functions are clearly different due to their general algorithm even if they’re similar in length. That gives way to the next section: timing.

Timing

Timing, as the old wisdom goes, is the only true way to guarantee your program wins on the speed front. There are only two native ways to measure the time spent using code at least regarding the CPU itself and not any peripherals like GPUs. Those would be the literal time spent or the CPU clock-cycles used. Keep in mind when timing you should always use the “Release” variant of a program. Debug mode includes under performing safe gaps specifically to make it easier to analyze bugs in code regardless of performance.

The STL has a standard clock easily available under the chrono header. Considering multiple clock options high_resolution_clock should be the best option. You can select timing increments from nano seconds to years.



#include <vector>
#include <list>
#include <cstdlib> 
#include <iostream> 
#include <chrono>
#include <string>

double bubblesort_time_average;
double exchangesort_time_average;

std::chrono::high_resolution_clock::time_point init_time;
std::chrono::high_resolution_clock::time_point fin_time;

std::vector<uint32_t> vector;

inline void End_Timer() {
  fin_time = std::chrono::high_resolution_clock::now();
}; //End_Timer

inline void Start_Timer() {
  init_time = std::chrono::steady_clock::now();
}; //Start_Time

void BubbleSort(const uint32_t array_size) {

  //Fill Vector
  for (auto& i : vector) {
    i = std::rand();
  }; //vector

  Start_Timer();

  //Sort Vector
  for (uint32_t i = 0;i<array_size-1;++i) {
    for (uint32_t j = 0; j < array_size - i - 1; ++j)
      if (vector[j] > vector[j + 1]) {
        auto temp = vector[i];
        vector[i] = vector[j];
        vector[j] = temp;
      }; //if j less than i 
  }; //for every idx

  End_Timer();

  bubblesort_time_average += std::chrono::duration<double,
    std::chrono::seconds::period>(fin_time -
      init_time).count();
}; //BubbleSort

void ExchangeSort(uint32_t array_size) {
  //Fill Vector
  for (auto& i : vector) {
    i = std::rand();
  }; //vector

  Start_Timer();

  //Sort Vector
  for (uint_fast32_t i = 0; i < array_size-1; ++i) {
    for (uint_fast32_t j = (i + 1); j < array_size; ++j)
    {
      if (vector[i] < vector[j])
      {
        auto temp = vector[i];
        vector[i] = vector[j];
        vector[j] = temp;
      };//if i < j
    }; //for every next variable
  }; //for every idx

  End_Timer();

  exchangesort_time_average += std::chrono::duration<double,
    std::chrono::seconds::period>(fin_time -
      init_time).count();
}; //EchangeSort

int main() {
  vector = std::vector<uint32_t>(7000);

  uint_fast8_t i = 0;
  for (; i < UINT_MAX8; ++i) {
    BubbleSort(7000);
    ExchangeSort(7000);
  }; //for loop

  exchangesort_time_average /= i;
  bubblesort_time_average /= i;

  std::cout << "ExchangeSort:"  << std::to_string(exchangesort_time_average)  << "\n";
  std::cout << "BubbleSort:"    << std::to_string(bubblesort_time_average)    << "\n";
}; //main


Enter fullscreen mode Exit fullscreen mode

Here we add two functions: Start_Timer and End_Timer. The names should be quite self explanatory. Inside them we initialize the timer by using a variable external to the function to collect the current time. Then, after placing the timer functions in the sorting functions, we subtract the previously stated variables afterwards storing the difference in another variable. Then after the sorting functions are looped multiple times, they’re averaged using the number of loops and output to console. My output gave me these numbers: Exchange Sort at 0.048897 seconds and Bubble Sort at 0.034960.

Clearly Bubble Sort is faster counteractive to the guesses using the assembly approximation. The reason being is the same reason previously stated: the approximation primarily works with smaller functions extremely similar in algorithmic complexity. These two sorting functions are clearly different in algorithmic complexity.

One could easily end there. However, timing the CPU clock cycles is still an important option especially when digging into the nitty-gritty. In situations where the the regular clock is barely enough to see the difference the CPU clock cycles can provide more clarity. The next code snippet shows how to do such.



#include <vector>
#include <list>
#include <cstdlib> 
#include <iostream> 
#include <chrono>
#include <string>
#include <intrin.h>

double init_clock_cycles;
double fin_clock_cycles;
double bubblesort_time_average;
double exchangesort_time_average;
double bubblesort_clock_cycle_average;
double exchangesort_clock_cycle_average;

std::chrono::high_resolution_clock::time_point init_time;
std::chrono::high_resolution_clock::time_point fin_time;

std::vector<uint32_t> vector;

inline void End_Timer() {
  fin_clock_cycles = __rdtsc();
  fin_time = std::chrono::high_resolution_clock::now();
}; //End_Timer

inline void Start_Timer() {
  init_clock_cycles = __rdtsc();
  init_time = std::chrono::steady_clock::now();
}; //Start_Time

void BubbleSort(const uint32_t array_size) {

  //Fill Vector
  for (auto& i : vector) {
    i = std::rand();
  }; //vector

  Start_Timer();

  //Sort Vector
  for (uint32_t i = 0;i<array_size-1;++i) {
    for (uint32_t j = 0; j < array_size - i - 1; ++j)
      if (vector[j] > vector[j + 1]) {
        auto temp = vector[i];
        vector[i] = vector[j];
        vector[j] = temp;
      }; //if j less than i 
  }; //for every idx

  End_Timer();

  bubblesort_time_average += std::chrono::duration<double,
    std::chrono::seconds::period>(fin_time -
      init_time).count();
  bubblesort_clock_cycle_average += fin_clock_cycles - init_clock_cycles;
}; //BubbleSort

void ExchangeSort(uint32_t array_size) {
  //Fill Vector
  for (auto& i : vector) {
    i = std::rand();
  }; //vector

  Start_Timer();

  //Sort Vector
  for (uint_fast32_t i = 0; i < array_size-1; ++i) {
    for (uint_fast32_t j = (i + 1); j < array_size; ++j)
    {
      if (vector[i] < vector[j])
      {
        auto temp = vector[i];
        vector[i] = vector[j];
        vector[j] = temp;
      };//if i < j
    }; //for every next variable
  }; //for every idx

  End_Timer();

  exchangesort_time_average += std::chrono::duration<double,
    std::chrono::seconds::period>(fin_time -
      init_time).count();
  exchangesort_clock_cycle_average += fin_clock_cycles - init_clock_cycles;
}; //EchangeSort

int main() {
  vector = std::vector<uint32_t>(7000);

  uint_fast8_t i = 0;
  for (; i < UINT8_MAX; ++i) {
    BubbleSort(7000);
    ExchangeSort(7000);
  }; //for loop

  exchangesort_time_average /= i;
  bubblesort_time_average /= i;

  exchangesort_clock_cycle_average /= i;
  exchangesort_clock_cycle_average /= i;

  std::cout << "ExchangeSort:"  << std::to_string(exchangesort_time_average)  << "\n";
  std::cout << "BubbleSort:"    << std::to_string(bubblesort_time_average)    << "\n";

  std::cout << "ExchangeSort:" << std::to_string(exchangesort_clock_cycle_average) << "\n";
  std::cout << "BubbleSort:" << std::to_string(bubblesort_clock_cycle_average) << "\n";
}; //main


Enter fullscreen mode Exit fullscreen mode

Just like the timing from the previous option, the way you gather the clock cycles is exactly the same except now you use the __rdtsc() function. This is something called a compiler intrinsic which is essentially pipelined assembly and a borderline replacement for inline assembly. It can be found in intrin header.

Removing OS Noise

If you’ve attempted profiling the previous functions multiple times you may have noticed it’s not really consistent. Despite the fact the difference between attempts is rather small, the number gathered for an average is rarely ever the same. To help alleviate this, one has a few options at their disposal.

One option is simply to play with the OS. The OS has a scheduler that manages tasks through a priority queue. It selects based on differing criteria such as whether the program is the one in focus to the user or whether the program is just in the background calculating and brooding its next move. To prevent the scheduler from taking thread power away from your program you can do two things: close as many programs as possible and specifically ask the OS to divert as many resources as possible to your program. The following program shows the latter, and one can assume any competent programmer can do the former.



#include <vector>
#include <list>
#include <cstdlib> 
#include <iostream> 
#include <chrono>
#include <string>
#include <intrin.h>
#include <Windows.h>
...
int main() {
  auto process = GetCurrentProcess();
  auto thread = GetCurrentThread();
  SetPriorityClass(process, REALTIME_PRIORITY_CLASS);
  SetThreadPriority(thread, THREAD_PRIORITY_TIME_CRITICAL);

  vector = std::vector<uint32_t>(7000);

  uint_fast8_t i = 0;
  for (; i < UINT8_MAX; ++i) {
    BubbleSort(7000);
    ExchangeSort(7000);
  }; //for loop

  exchangesort_time_average /= i;
  bubblesort_time_average /= i;

  std::cout << "ExchangeSort:"  << std::to_string(exchangesort_time_average)  << "\n";
  std::cout << "BubbleSort:"    << std::to_string(bubblesort_time_average)    << "\n";

  std::cout << "ExchangeSort:" << std::to_string(exchangesort_clock_cycle_average) << "\n";
  std::cout << "BubbleSort:" << std::to_string(bubblesort_clock_cycle_average) << "\n";
}; //main


Enter fullscreen mode Exit fullscreen mode

One may notice that not only the process is being modified. The threads are being modified as well. That’s because, if it wasn’t known, the Operating System manages true CPU threads for programs. The threads typically produced through methods such as std::thread and std::future are actually only asking the compiler to divert the resources best they can from the scheduler. If they can’t get a literal thread, the thread is just split going between multiple processes rapidly. Despite the fact methods like this are not used here, we still need to tell the OS that the central thread of the program should be sped-up. The settings REAL_TIME_PRIORITY and THREAD_PRIORITY_TIME_CRITICAL are the maximum priority settings. You can view all the settings in the Win32 API documentation.

Profiling with other programs

No one should have to profile this way specifically. In fact, entire suits of programs called profilers exist that allow a visual timeline with minimal code intervention. One of the best open source ones currently out there is Tracy.

To install Tracy, go to their github releases and download both the source code and the windows-version-zip. For this specific task, the source cannot be downloaded directly from the repo. The source can only be taken from the releases.

Run the typical cmake build command directly in the source code. Then open the SLN in the build folder. After the library has been compiled [use RELEASE mode for the best profiling] link it to the program. The necessary files to include can be found in the “public” folder at the top level of the source code. If you did not compile in release mode, the debug information format will have to be changed in the project properties. This is displayed below.

Image description

Tracy works using macros called ZoneScopes. Most other profilers work in a similar vein. A ZoneScope marks a certain “scope” in a C++ program to be analyzed. Inside of the BubbleSort or ExchangeSort function the ZoneScope macro can be added as shown below in code. The defines and includes needed are also shown.



#define TRACY_ENABLE
#include "include/tracy/Tracy.hpp"

...

void BubbleSort(const uint32_t array_size) {
  //Fill Vector
  for (auto& i : vector) {
    i = std::rand();
  }; //vector

  Start_Timer();
  {
    ZoneScopedNCS("Bubble Sort",tracy::Color::Blue);

    //Sort Vector
    for (uint32_t i = 0;i<array_size-1;++i) {
      for (uint32_t j = 0; j < array_size - i - 1; ++j)
        if (vector[j] > vector[j + 1]) {
          auto temp = vector[i];
          vector[i] = vector[j];
          vector[j] = temp;
        }; //if j less than i 
    }; //for every idx
  }

  End_Timer();

  bubblesort_time_average += std::chrono::duration<double,
    std::chrono::seconds::period>(fin_time -
      init_time).count();
  bubblesort_clock_cycle_average += fin_clock_cycles - init_clock_cycles;
}; //BubbleSort


Enter fullscreen mode Exit fullscreen mode

Although anyone can simply use ZoneScope without parentheses to work, ZoneScopeNC allows the addition of a name and color to make the visualization easier to use. Without the N, Zonescope will just use the function name, and without the C it’ll produce an identical beige color for any function on the timeline. It’s suggested to change the color when used in exchange sort. The colors use the X11 scheme which is well documented on Wikipedia. Using a hex code for colors is also an option.

To actually see the visualization unzip the windows-version-zip folder and start the tracy-profiler executable inside. Press the connect button then start the program. After a few moments, if you’ve done the previous instructions correctly, something like this should appear:

Image description

It doesn’t seem so useful at first, but one can zoom in with the scroll wheel on the main thread to see the timing difference.

Image description

Even more useful is that one can click on the function on the timeline. One can then click on statistics to bring up a histogram perfect for seeing timing in progress.

Image description

It’s beautiful, but Tracy is a much, much deeper program. One can add messages with TracyMessage() as well as plot values [such as the cpu clockcycles or seconds spent] with TracyPlot. An S can be added to the ZoneScope to see the callstack directly in Tracy, and Tracy can be held over server to view the profile on another computer. It truly is versatile.

Of course, that completes the section on timing. Timing is extremely simple. The more juicy bits come when one starts to analyze the memory to figure out how to micromanage allocation to their best interests. There’s also always using special compiler or CPU specific functions to really accelerate the program. Tracy is perfectly built for that as well, but that comes in another article.

Top comments (0)