Marcos Oliveira

Posted on Mar 12

8 flags to drastically improve the speed of your software

#cpp #c #compiling #performance

We've already made a article about flags that works for both GCC and Clang, however, those tips have general instructions for compilation.

In this article, we will specify more the objective at "compile time" that directly influences the performance of the binary, making the speed at "runtime" better!

01. The basics

The -fsanitize=address flag and all the others in sanitize(libasan), from Google, which was natively implemented by the GNU Project are used to check for memory leaks, memory violations and other related failures.

However, it should only be used during development, when you are going to make it available for production, that is, the release version. The ideal is to create the Makefile, or CMake or any other compilation tool without this flag. In fact, it is a good idea to remove any other debug flag, including: -g, -Wall, -Werror, -pedantic, -Wpedantic,...

Because they, especially -fsanitize=address, make the binary execution very slow. You can replace it with the optimizer, for example, -O1, -O2 or -O3:

-O1 (Basic optimization) - Enables optimizations that improve performance without significantly increasing compilation time. Examples: dead code elimination, constant propagation, limited inlining.
-O2 (Moderate optimization) - Includes all optimizations from -O1 and adds more aggressive ones that still maintain code reliability. Examples: loop unrolling, elimination of common subexpressions, better instruction scheduling.
-O3 (Aggressive optimization) - Includes all optimizations from -O2 and adds new, more aggressive ones, such as increased inlining and loop vectorization. May increase code size and, in some cases, reduce performance due to over-optimization.

And there is also -Ofast, although it is the most aggressive of all and almost equivalent to -O3, it can completely optimize the code, making it even faster, since it still includes the -ffast-math flag, this can be good, but the ideal is to do tests, since some precision calculations, mainly for the double and float types, can have unexpected results, since it can reduce the number of significant digits, in addition to being able to break with the C and C++ standards.

However, in most cases, it is recommended for release, for example:

g++ -Ofast main.cpp

If you want a less conflicting fusion, use it together with -ffp-contract=fast: Allows fusion of floating point operations, such as FMA (Fused Multiply-Add).

In short:

Use `-Ofast` if exact numerical precision is not critical and you want to extract maximum performance.

02. Architecture-specific tuning

The -march=native flag allows the compiler to generate code optimized for your CPU architecture:

g++ -Ofast -march=native main.cpp

Using it in combination with -Ofast can be a great idea for performance.

This allows the compiler to use advanced instructions of your processor, such as SSE, AVX, etc.

If you need to distribute the binary to other machines, choose a specific value instead of native, such as -march=haswell, -march=znver3, etc.

In short:

The `-march=native` flag, if you want, allows the compiler to generate code optimized for your CPU architecture.

03. Parallelism with OpenMP

If the code is parallelizable, add support for OpenMP to take advantage of multiple CPU cores:

g++ -Ofast -march=native -fopenmp main.cpp

Also in combination with the flags mentioned above.

This allows loops and other parts of the code to run in parallel.

OpenMP (Open Multi-Processing) is an application programming interface (API) for shared-memory multiprocessing on multiple platforms. It allows adding concurrency to programs written in C, C++ and Fortran based on the fork-join execution model.

04. Improve CPU cache usage

The -funroll-loops and -fprefetch-loop-arrays flags help improve loop execution:

g++ -Ofast -march=native -funroll-loops -fprefetch-loop-arrays main.cpp

Also in combination with the flags mentioned above.

If we used them in the video about Ranking of Programming Languages, C++ and C would leave those behind them even further behind! 😃

Remember, an even better utility than these flags is ccache, which we published in the article: Use Ccache and compile much faster, however, its focus is to reduce "compilation time" and not only binary performance.

05. Optimized linking

The -flto (Link-Time Optimization) flag is used to allow the optimizer to see the code as a whole:

g++ -Ofast -march=native -flto main.cpp

It is good to use it in conjunction with the first two flags mentioned.

06. Avoid exceptions and RTTI if they are not necessary

Use the -fno-rtti flag if the code does not use exceptions or RTTI (Runtime Type Information), disable them to gain performance:

g++ -Ofast -march=native -fno-exceptions -fno-rtti main.cpp

RTTI (Run-time Type Information) is a technique that stores information about the data type of an object during the execution of a program. RTTI is available in some programming languages, such as Delphi and C++.

07. Use execution profiles with Profile-Guided Optimization (PGO)

If you can run the program before final compilation (AND DO IT!!!), use PGO with the -fprofile-generate flag to optimize based on real execution data:

Don't confuse it with the so-called: borrow checker!

Compile with instrumentation:

g++ -Ofast -march=native -fprofile-generate main.cpp

It is good to use it together with the first two flags mentioned.

Run the program normally to generate profile data, and then recompile using the generated profiles:

g++ -Ofast -march=native -fprofile-use main.cpp

Profile-guided optimization (PGO), also known as profile-directed feedback (PDF) or feedback-directed optimization (FDO), is a compiler optimization technique that uses prior analysis of software artifacts or behaviors ("profiling") to improve the expected runtime performance of the program.

08. Improve the "tuning"

In addition to -march=native, you can use -mtune to tune the code generation for better performance without losing compatibility:

g++ -Ofast -march=native -mtune=native main.cpp

It is good to use it in conjunction with the first two flas mentioned.

If you need to run on multiple architectures, use something more generic, like -mtune=generic.

If you want to use all the flags we mentioned together, feel free:

g++ -Ofast -march=native -flto -funroll-loops -fprefetch-loop-arrays \
-fno-exceptions -fno-rtti -fopenmp main.cpp

For more information, see the links below:

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

DEV Community

8 flags to drastically improve the speed of your software

01. The basics

Use `-Ofast` if exact numerical precision is not critical and you want to extract maximum performance.

02. Architecture-specific tuning

The `-march=native` flag, if you want, allows the compiler to generate code optimized for your CPU architecture.

03. Parallelism with OpenMP

04. Improve CPU cache usage

05. Optimized linking

06. Avoid exceptions and RTTI if they are not necessary

07. Use execution profiles with Profile-Guided Optimization (PGO)

08. Improve the "tuning"

The Next Generation Developer Platform

Top comments (0)

The Essential Toolkit for Front-end Developers

Read next

Key Differences Between PHP 5.x and PHP 7.x/8.x: Performance, Features, and Improvements

PHP Closures and Generators can hold circular references

Module 9: Comprehensive Guide to AWS Well-Architected Framework and Cost Optimization

How does React Native's New Architecture affect performance?

Okay

01. The basics

Use -Ofast if exact numerical precision is not critical and you want to extract maximum performance.

02. Architecture-specific tuning

The -march=native flag, if you want, allows the compiler to generate code optimized for your CPU architecture.

03. Parallelism with OpenMP

04. Improve CPU cache usage

05. Optimized linking

06. Avoid exceptions and RTTI if they are not necessary

07. Use execution profiles with Profile-Guided Optimization (PGO)

08. Improve the "tuning"

The Next Generation Developer Platform

The Essential Toolkit for Front-end Developers

Read next

Key Differences Between PHP 5.x and PHP 7.x/8.x: Performance, Features, and Improvements

PHP Closures and Generators can hold circular references

Module 9: Comprehensive Guide to AWS Well-Architected Framework and Cost Optimization

How does React Native's New Architecture affect performance?

Okay

Use `-Ofast` if exact numerical precision is not critical and you want to extract maximum performance.

The `-march=native` flag, if you want, allows the compiler to generate code optimized for your CPU architecture.