<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Eugene Ostroukhov</title>
    <description>The latest articles on DEV Community by Eugene Ostroukhov (@eugeneo_17).</description>
    <link>https://dev.to/eugeneo_17</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1436933%2F41607852-6f9d-4a07-8f60-ca5b6c4d1abb.jpeg</url>
      <title>DEV Community: Eugene Ostroukhov</title>
      <link>https://dev.to/eugeneo_17</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eugeneo_17"/>
    <language>en</language>
    <item>
      <title>Open Source C++ Stack</title>
      <dc:creator>Eugene Ostroukhov</dc:creator>
      <pubDate>Tue, 16 Jul 2024 21:00:00 +0000</pubDate>
      <link>https://dev.to/eugeneo_17/open-source-c-stack-2j67</link>
      <guid>https://dev.to/eugeneo_17/open-source-c-stack-2j67</guid>
      <description>&lt;p&gt;C++ is often labeled as "unsafe" and "complex," but I find these critiques unjustified. My experience working on major projects like Chromium, Node, and gRPC — each a non-trivial codebase deployed on millions of devices, both virtual and physical, and subject to rigorous scrutiny—has shown me the true power and reliability of C++. Let's not forget the remarkable engineering feats made possible by C++, such as Unreal Engine. Even Linux and Git, both written in C (arguably even less "safe" than C++), stand as testaments to the robust potential of these languages.&lt;/p&gt;

&lt;p&gt;There is a trick to writing C++ code in a scalable way. I would call it "write C++ the Google way". Google has been using C++ for decades and has accumulated a wealth of expertise in writing C++ code that is safe, performant, and maintainable. And that expertise is readily available on GitHub.&lt;/p&gt;

&lt;p&gt;In this article, I want to introduce you to some open-source projects that make writing C++ code enjoyable. These projects are designed to work seamlessly together, yet you can pick and choose the ones that best fit your needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Google C++ Style Guide: Lingua Franca
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://google.github.io/styleguide/cppguide.html" rel="noopener noreferrer"&gt;Explore the Guide on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Google C++ Style Guide explains how to make C++ code beautiful. This&lt;br&gt;
comprehensive set of conventions ensures consistency across projects, making it easier for developers to dive into new codebases with confidence.&lt;/p&gt;

&lt;p&gt;The guide goes beyond mere stylistic choices like indentation width (&lt;a href="https://google.github.io/styleguide/cppguide.html#Spaces_vs._Tabs" rel="noopener noreferrer"&gt;2 spaces&lt;/a&gt;) or file naming conventions (&lt;a href="https://google.github.io/styleguide/cppguide.html#File_Names" rel="noopener noreferrer"&gt;all lowercase, with underscores (_) or dashes (-)&lt;/a&gt;). It delves deep into language features and provides the rationale behind each decision, offering a clear path to writing high-quality, maintainable C++ code.&lt;/p&gt;

&lt;p&gt;Perhaps the most controversial convention in the guide is the "ban" on exceptions. "&lt;a href="https://google.github.io/styleguide/cppguide.html#Exceptions" rel="noopener noreferrer"&gt;We do not use C++ exceptions&lt;/a&gt;" is a rule that can be hard to accept for developers coming from other languages. Yet, exceptions are not as essential as they might seem. Languages like Go thrive without them, and C++ projects like Chromium and gRPC demonstrate that robust and efficient code can be written without exceptions.&lt;/p&gt;

&lt;p&gt;I frequently recommend this guide to developers both inside and outside Google as the simplest way to elevate the quality of their C++ code. By adhering to these well-established conventions, anyone can write C++ "the Google way" and enjoy the benefits of safer, more maintainable, and performant code.&lt;/p&gt;
&lt;h2&gt;
  
  
  Bazel: If You Build It, They Will Come
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://bazel.build/" rel="noopener noreferrer"&gt;Discover Bazel&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Bazel is a build system inspired by Google's internal build system. Thousands of engineers at Google use Blaze daily to build countless projects, written not only in C++ but also in Java, Python, Go, other languages.&lt;/p&gt;

&lt;p&gt;Bazel highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Starlark Language&lt;/strong&gt;: Bazel build files are written in Starlark, a language that is both extensible and easy to read. This simplicity makes it approachable for new users and powerful enough for complex build configurations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficient Caching&lt;/strong&gt;: One of Bazel's standout features is its efficient caching mechanism, which relies on checksums rather than timestamps. This results in significantly faster and more reliable builds, a feature I rely on daily for large C++ projects like gRPC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency Tracking&lt;/strong&gt;: Bazel excels in dependency tracking, minimizing the size of build artifacts and speeding up the build process. This feature ensures that only the necessary components are rebuilt, saving valuable time and resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Powerful Query Language&lt;/strong&gt;: Bazel includes a robust query language that allows developers to analyze the build graph, providing deep insights into the build process and helping to optimize it further.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bazel makes it trivial to build Protobuf libraries, supports multiple&lt;br&gt;
languages and most libraries I mention below are trivial to add to a Bazel project, just look at Uchen.ML module file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;
Uchen core - ML framework
&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;
&lt;span class="nf"&gt;module&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uchen-core&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compatibility_level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;bazel_dep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abseil-cpp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;20240116.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;bazel_dep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;googletest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.14.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;git_override&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;module_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;googletest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;remote&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://github.com/google/googletest.git&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;commit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1d17ea141d2c11b8917d2c7d029f1c4e2b9769b2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;bazel_dep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google_benchmark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.8.3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;git_override&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;module_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google_benchmark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;remote&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://github.com/google/benchmark.git&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;commit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;447752540c71f34d5d71046e08192db181e9b02b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Dev dependencies
&lt;/span&gt;&lt;span class="nf"&gt;bazel_dep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hedron_compile_commands&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dev_dependency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;git_override&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;module_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hedron_compile_commands&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;remote&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://github.com/hedronvision/bazel-compile-commands-extractor.git&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;commit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a14ad3a64e7bf398ab48105aaa0348e032ac87f8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Abseil: Utilities
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://abseil.io/" rel="noopener noreferrer"&gt;abseil.io&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Abseil provides a wide array of utilities across various categories. Some of these utilities, such as &lt;code&gt;absl::string_view&lt;/code&gt; and &lt;code&gt;absl::optional&lt;/code&gt;, have already been adopted into the standard C++ library, with Abseil seamlessly using the standard versions when available. Other utilities, like my favorite &lt;code&gt;absl::Cleanup&lt;/code&gt;, seem to be on track to being standartize late (see &lt;code&gt;std::scope_exit&lt;/code&gt;). Many utilities in Abseil remain beyond the current scope of the standard library, offering unique functionalities that enhance C++ development.&lt;/p&gt;

&lt;p&gt;I heavily rely on the following parts of Abseil for my projects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://abseil.io/docs/cpp/guides/flags" rel="noopener noreferrer"&gt;Command Line Flags&lt;/a&gt;&lt;/strong&gt;: Simplifies the management of command line arguments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://abseil.io/docs/cpp/guides/logging" rel="noopener noreferrer"&gt;Logging&lt;/a&gt;&lt;/strong&gt;: Provides robust logging capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://abseil.io/docs/cpp/guides/strings" rel="noopener noreferrer"&gt;String Utilities&lt;/a&gt;&lt;/strong&gt;: Includes utilities for string joining, formatting, and more.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many classes in UchenML are augmented with &lt;code&gt;AbslStringify&lt;/code&gt; which allows for very easy tracing and debugging.&lt;/p&gt;

&lt;p&gt;I would also like to point out the &lt;a href="https://abseil.io/tips/" rel="noopener noreferrer"&gt;C++ Tips&lt;/a&gt; section that I would consider an essential reading for any C++ developer, on par with the Google C++ Style Guide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Google Test: Industry Standard
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://google.github.io/googletest/" rel="noopener noreferrer"&gt;Google Test&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My impression is that Google Test is the most popular C++ testing framework and I often see it used in projects outside Google. Not really much to add.&lt;/p&gt;

&lt;p&gt;I would definitely recommend to also take a look at GMock. GMock provides some &lt;a href="https://google.github.io/googletest/reference/matchers.html" rel="noopener noreferrer"&gt;matchers&lt;/a&gt; that can also be used for regular assertions. E.g., this is how contents of the collection can be checked ignoring the order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;EXPECT_THAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;UnorderedElementsAre&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Benchmarking: Google Benchmark
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/google/benchmark" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Writing benchmarks can be a fun and enlightening process, though it's easy to get caught up in the quest for better numbers. Despite this, benchmarks are crucial for performance-conscious development. Modern CPUs and compilers are highly complex, making performance reasoning anything but straightforward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Linters
&lt;/h2&gt;

&lt;p&gt;Linters keep the code base consistent and help to catch a lot of issues and even bugs way before the compiler is ran. &lt;a href="https://clang.llvm.org/extra/clang-tidy/" rel="noopener noreferrer"&gt;Clang-Tidy&lt;/a&gt; is the one I am relying on. &lt;a href="https://include-what-you-use.org/" rel="noopener noreferrer"&gt;IWYU&lt;/a&gt; is really helpful in keeping the includes clean, reducing number of dependencies and reducing the build times.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sanitizers
&lt;/h2&gt;

&lt;p&gt;Sanitizers detect and troubleshoot issues that are difficult to debug, such as memory leaks, concurrency issues, and undefined behavior. &lt;a href="https://clang.llvm.org/docs/AddressSanitizer.html" rel="noopener noreferrer"&gt;ASAN&lt;/a&gt; is the one I am using the most. This is the contents of the &lt;code&gt;.bazelrc&lt;/code&gt; set adds a special Bazel config so ASAN can be ran at any time on any target by adding &lt;code&gt;--config=asan&lt;/code&gt; to the build command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;build:asan &lt;span class="nt"&gt;--strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;never
build:asan &lt;span class="nt"&gt;--copt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nt"&gt;-fsanitize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;address
build:asan &lt;span class="nt"&gt;--copt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nt"&gt;-O0&lt;/span&gt;
build:asan &lt;span class="nt"&gt;--copt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nt"&gt;-fno-omit-frame-pointer&lt;/span&gt;
build:asan &lt;span class="nt"&gt;--copt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nt"&gt;-DGPR_NO_DIRECT_SYSCALLS&lt;/span&gt;
build:asan &lt;span class="nt"&gt;--copt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nt"&gt;-DADDRESS_SANITIZER&lt;/span&gt;  &lt;span class="c"&gt;# used by absl&lt;/span&gt;
build:asan &lt;span class="nt"&gt;--linkopt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nt"&gt;-fsanitize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;address
build:asan &lt;span class="nt"&gt;--action_env&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;ASAN_OPTIONS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;detect_leaks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1:color&lt;span class="o"&gt;=&lt;/span&gt;always
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other sanitizers worth mentioning are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://clang.llvm.org/docs/ThreadSanitizer.html" rel="noopener noreferrer"&gt;TSAN&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://clang.llvm.org/docs/MemorySanitizer.html" rel="noopener noreferrer"&gt;MSAN&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html" rel="noopener noreferrer"&gt;UBSAN&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://llvm.org/docs/LibFuzzer.html" rel="noopener noreferrer"&gt;Fuzzer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Serialization: Protobuf
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://protobuf.dev/" rel="noopener noreferrer"&gt;protobuf.dev&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a language-agnostic serialization library that is much faster than JSON and provides way more compact representation. There are several implementations with different features and trade offs. &lt;a href="https://github.com/protocolbuffers/protobuf/tree/main/upb" rel="noopener noreferrer"&gt;μpb&lt;/a&gt; is a very lightweight C++ implemantion that uses arena allocations.&lt;/p&gt;

&lt;p&gt;Protobufs have a large number of features that are very useful in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reflection API&lt;/li&gt;
&lt;li&gt;JSON serialization&lt;/li&gt;
&lt;li&gt;Text serialization (makes it very tempting to use protobufs as a configuration format)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  RPC: gRPC
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://grpc.io/" rel="noopener noreferrer"&gt;grpc.io&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Disclamer: I am working on gRPC full-time so I am biased.&lt;/p&gt;

&lt;p&gt;gRPC is another open-source effort that was informed and inspired by Google internal architecture. It is a proven solution (most Google Cloud APIs are implemented in gRPC) and is used by projects like TensorFlow, Firebase and many others (e.g. Bazel uses it for distributed build support).&lt;/p&gt;

&lt;p&gt;This is what gRPC offers that is not readily available in REST:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strongly typed API with codegen support for most popular languages&lt;/li&gt;
&lt;li&gt;Streaming support, including bi-directional streaming&lt;/li&gt;
&lt;li&gt;Client and server side load balancing&lt;/li&gt;
&lt;li&gt;Authentication and authorization support&lt;/li&gt;
&lt;li&gt;A lot of configuration options for things like timeouts, retries, etc.&lt;/li&gt;
&lt;li&gt;Built-in support for tracing and monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Portable SIMD: Highway
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/google/highway" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One can not utilize the full power of modern CPUs without using SIMD&lt;br&gt;
instructions. Highway library provides a portable way to use SIMD instructions across different platforms, making leveraging SIMD much more practical.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>google</category>
      <category>softwaredevelopment</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Case Study - TDD in Node.js Inspector Server and Other Projects</title>
      <dc:creator>Eugene Ostroukhov</dc:creator>
      <pubDate>Tue, 18 Jun 2024 21:00:00 +0000</pubDate>
      <link>https://dev.to/eugeneo_17/case-study-tdd-in-nodejs-inspector-server-and-other-projects-30e6</link>
      <guid>https://dev.to/eugeneo_17/case-study-tdd-in-nodejs-inspector-server-and-other-projects-30e6</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;Test Driven Development (TDD) is a software development methodology where tests are written before the actual code. The progress of implementation is then guided by the status of these tests.&lt;/p&gt;

&lt;p&gt;There is often confusion between the terms "automated testing," "unit testing," and "TDD." To clarify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automated Testing&lt;/strong&gt; refers to any testing performed by specialized software without requiring manual intervention. This includes various types of testing, depending on the scope (unit/integration) or the metrics being evaluated (correctness, security, load, benchmarking).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unit Testing&lt;/strong&gt; is a subset of automated testing that focuses on the smallest, independent logical units of code. These tests can be created at any stage of development, whether before or after the code is written.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test Driven Development (TDD)&lt;/strong&gt; is a practice where tests are designed and implemented before writing the actual code. While these tests are typically automated, they can also be manual in some cases. TDD can be applied at any level of granularity.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Node.js Inspector Server
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Problem Statement
&lt;/h3&gt;

&lt;p&gt;The goal was to transition Node.js to utilize a new V8 debugging API and expose a WebSocket endpoint compatible with the Chrome DevTools protocol. This required ensuring a smooth ecosystem transition and providing tools vendors with a clear migration path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenges
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The implementation needed to reside in the core Node.js binary, adhering to strict performance and security requirements.&lt;/li&gt;
&lt;li&gt;The low-level C++ code had to run on all platforms supported by Node.js.&lt;/li&gt;
&lt;li&gt;Rebuilding the Node.js binary is a time-consuming process that can significantly impact developer productivity.&lt;/li&gt;
&lt;li&gt;I was initially unfamiliar with &lt;code&gt;libuv&lt;/code&gt; and the internals of Node.js.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Approach
&lt;/h3&gt;

&lt;p&gt;The initial focus was on creating a WebSocket server in C++ to run outside the V8 engine on a separate thread. This design ensured that the server would continue running even when V8 was paused at a breakpoint, and it also minimized the impact on profiling data of the user code.&lt;/p&gt;

&lt;p&gt;To avoid a full rebuild of the Node.js binary during development, the server implementation was initially contained within the test code. As the codebase evolved, it was split into multiple source files and gradually integrated into the core Node.js code.&lt;/p&gt;

&lt;p&gt;The current C++ test suite includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/nodejs/node/blob/main/test/cctest/test_inspector_socket_server.cc"&gt;test_inspector_socket_server.cc&lt;/a&gt;: Tests the server, including socket listening, HTTP protocol support, and potential error states.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nodejs/node/blob/main/test/cctest/test_inspector_socket.cc"&gt;test_inspector_socket.cc&lt;/a&gt;: WebSocket protocol tests with a focus on edge cases and error handling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One interesting aspect of &lt;code&gt;libuv&lt;/code&gt; was that the tests were single-threaded, greatly simplifying the implementation of the test suite. This was a fun coding challenge and crucial for catching hard-to-reproduce bugs and regressions, especially those caused by differences in &lt;code&gt;libuv&lt;/code&gt; behavior across platforms.&lt;/p&gt;

&lt;p&gt;Once the server was stable and inspector integration began, tests were written in JavaScript using the WebSocket protocol. These tests were not strictly "unit tests," as V8 inspector already had significant testing coverage in the core V8, and duplicating it would have increased maintenance without adding much value.&lt;/p&gt;

&lt;p&gt;Later, a JavaScript API was introduced by community demand, making it even easier to write tests in JavaScript, particularly to cover Node-specific protocol extensions such as &lt;a href="https://github.com/nodejs/node/blob/main/test/parallel/test-inspector-tracing-domain.js"&gt;tracing&lt;/a&gt; or &lt;a href="https://github.com/nodejs/node/blob/main/test/parallel/test-worker-debug.js"&gt;workers&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Highlights
&lt;/h3&gt;

&lt;p&gt;The transition to the new protocol was completed ahead of schedule, allowing the legacy protocol to be deprecated and removed altogether. The integration underwent several deep reworks without disrupting the ecosystem, including the addition of support for worker threads. In all cases, new test cases were added to ensure stability.&lt;/p&gt;

&lt;p&gt;A significant flakiness in Inspector tests prompted a deep refactor (&lt;a href="https://github.com/nodejs/node/pull/21182"&gt;PR&lt;/a&gt;), improving the performance and stability of the entire DevTools protocol.&lt;/p&gt;

&lt;p&gt;At least one &lt;a href="https://github.com/nodejs/node/pull/25455"&gt;test case&lt;/a&gt; was added to justify keeping code in the native C++ part after contributors proposed moving it to JavaScript.&lt;/p&gt;

&lt;p&gt;The community identified several potential security vulnerabilities, leading to the addition of tests to prevent regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Partner API Endpoint
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Problem Statement
&lt;/h3&gt;

&lt;p&gt;The task was to implement a REST API endpoint according to the specifications provided by a partner company. Their software would query this endpoint to obtain information from our systems, streamlining the customer experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenges
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The specification required a large amount of the data points, raising concerns about whether we had all the required information or if it was in the expected format.&lt;/li&gt;
&lt;li&gt;There were uncertainties about whether the requested access complied with our security and privacy policies.&lt;/li&gt;
&lt;li&gt;The necessary information had to be sourced from multiple internal systems, and it was unclear how readily available this data was.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Approach
&lt;/h3&gt;

&lt;p&gt;The service code implementing the API was divided into multiple layers and engineered into several components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Response Packaging&lt;/strong&gt;: A component to format the response according to the partner’s specifications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Aggregation and Sanitization&lt;/strong&gt;: An internal component to aggregate data and ensure it was sanitized (e.g., converting internal codes to the partner’s specifications, normalizing addresses).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Source Connectors&lt;/strong&gt;: Independent components to connect to each internal data source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Request Processing and Validation&lt;/strong&gt;: A separate component to handle request validation and processing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first test involved directly calling the endpoint implementation with a mock request and checking the response. The initial implementation returned a hardcoded response, which was then gradually enhanced with more logic. E.g. a code that returns a hardcoded customer address would be replaced with a component that retrieved the address from the customer service. Unit tests were created for each component, focusing on mocking dependencies to verify logic, validation, and error propagation. For example, unit tests for the customer service connector mocked the network layer to directly check requests sent to the customer service, and mock responses were used to validate the connector’s logic, both in a happy path and in error scenarios.&lt;/p&gt;

&lt;h3&gt;
  
  
  Highlights
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The project codebase was split into clear maintainable components, enabling parallel development, including discussions with the teams responsible for each data source.&lt;/li&gt;
&lt;li&gt;Significant discussions with stakeholders (e.g., service developers, data owners, security, and privacy teams) were necessary, and we were able to start these discussions sooner which reduce the risk of delays.&lt;/li&gt;
&lt;li&gt;Testing provided with plenty of examples that were really useful in communications. For example, when discussing the data format with the partner, we could provide examples of the data we were sending, which helped clarify the requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project was delivered on time and promptly accepted by the partner.&lt;/p&gt;

&lt;h2&gt;
  
  
  Uchen.ML
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Problem Statement
&lt;/h3&gt;

&lt;p&gt;This project began as an attempt to build deep learning models that could be easily deployed in specific scenarios. It was developed alongside learning the theory of deep neural network training. Both the external API and internal implementation were in constant flux, with significant rewrites anticipated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach
&lt;/h3&gt;

&lt;p&gt;Each component started as a test case. For example, each gradient calculator began in the test class, with all numbers verified against values returned by the PyTorch implementation. As the framework matured, the underlying math of the stacked components grew increasingly complex, making the tests essential for detecting subtle issues. Extensive rework often required benchmarks to justify code changes. Writing test cases helped refine the framework's API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Highlights
&lt;/h3&gt;

&lt;p&gt;The project continues to evolve, despite extended breaks in development. Test cases have been invaluable for catching new issues early, including identifying when new APIs are too cumbersome for unconsidered use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Cleanup aggressively and avoid duplicate test cases. Do not test trivial code (such as getters and setters). Tests have maintanance code and can be a significant draw on engineer productivity and even team morale.&lt;/li&gt;
&lt;li&gt;Test behaviors not the implementation. Use higher level APIs and data that mimics the real-world usage.&lt;/li&gt;
&lt;li&gt;Use a tool that reruns the tests on file save, such as &lt;code&gt;jest --watch&lt;/code&gt; or &lt;code&gt;ibazel&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Do not add &lt;code&gt;TODO&lt;/code&gt; comments in the code. Add disabled or failing tests instead.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;TDD is goes beyond just writing tests; it fundamentally shapes the design and architecture of the code. Tests help understand the requirements and constraints, leading to more robust and error-resistant code. Test cases also serve multiple purposes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Tracking Implementation Progress: They provide a clear, incremental path of development, showing what features have been implemented and what remains to be done. Each passing test signifies a step forward in the project, offering a sense of accomplishment and a clear indicator of progress.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Onboarding New Team Members: For new developers joining the team, test cases offer a practical insight into the functionality and expected behavior of the software. They serve as an up-to-date documentation that new team members can use to understand the codebase more quickly and effectively.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Providing a Safety Net for Future Changes: One of the most significant benefits of TDD is the confidence it provides when making future modifications. As the software evolves, having a comprehensive suite of tests ensures that new changes do not introduce regressions. This safety net allows developers to refactor and improve the code with greater assurance.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By integrating TDD into the development process, teams can achieve a higher standard of software quality, foster a culture of continuous improvement, and reduce long-term maintenance costs.&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>testing</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>Case Study - Optimizing Linear Layer</title>
      <dc:creator>Eugene Ostroukhov</dc:creator>
      <pubDate>Mon, 22 Apr 2024 19:00:00 +0000</pubDate>
      <link>https://dev.to/eugeneo_17/case-study-optimizing-linear-layer-55dl</link>
      <guid>https://dev.to/eugeneo_17/case-study-optimizing-linear-layer-55dl</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;As Uchen.ml is heading towards the public announcement and first demos, some low-hanging fruit needs to be picked in terms of optimizations. The most often used piece of any ML library is the linear layer as it is the most basic building block for any neural net. This post details the process of optimizing the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;Uchen is designed for implementing ML solutions that can be easily&lt;br&gt;
integrated into existing systems, with specific goals on Web Assembly, embedded and video games.&lt;/p&gt;

&lt;p&gt;To maintain velocity and to avoid overcomplicating build and validation process, following constraints are in place:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only the C++20 standard library is used. ABSL dependency is there (logging, asserts and some utilities) but it is under consideration if its inclusion will remain mandatory.&lt;/li&gt;
&lt;li&gt;No compiler-specific optimizations, including pragmas, conditional compilations or intrinsics.&lt;/li&gt;
&lt;li&gt;No CPU architecture-specific optimizations. Particularly, no optimizations for one architecture that may be detrimental for others. Apple M2 and Intel Core CPUs are used to inform and direct the optimization efforts.&lt;/li&gt;
&lt;li&gt;Uchen is and will remain a CPU-only ML framework. There are no plans at this point to implement GPU or other acceleration support.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This constraints will be lifted as the deployment targets and actual requirements are better understood.&lt;/p&gt;
&lt;h2&gt;
  
  
  Benchmark code
&lt;/h2&gt;

&lt;p&gt;The benchmark runs inference through the linear layers of different configurations. Inputs are initialized to 0, parameters are initialized to random values outside the benchmark loop. Range of parameter values is between -1 and 1. Output values are not checked. float datatype is used for inputs and outputs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;template&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;Is&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;Os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;typename&lt;/span&gt; &lt;span class="nc"&gt;D&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;BM_Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;State&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Linear layer with Is inputs and Os outputs&lt;/span&gt;
  &lt;span class="n"&gt;uchen&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;uchen&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Is&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Os&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;// zero-initialized input vector. This operation is O(n) to the number&lt;/span&gt;
  &lt;span class="c1"&gt;// of the inputs and may have a negligible impact on benchmark.&lt;/span&gt;
  &lt;span class="n"&gt;uchen&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Is&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;// Parameters are using the store filled with random values outside the loop.&lt;/span&gt;
  &lt;span class="c1"&gt;// This operation is O(1) to the number of parameters and has no impact&lt;/span&gt;
  &lt;span class="c1"&gt;// on benchmark.&lt;/span&gt;
  &lt;span class="n"&gt;uchen&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;parameters_t&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;decltype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;DoNotOptimize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hardware
&lt;/h2&gt;

&lt;p&gt;Regular PCs are used. Note that the numbers can not be compared across&lt;br&gt;
the architectures and this paper is only concerned with the relative gains&lt;br&gt;
and not the absolute values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Apple M2 Pro
  10 Cores
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz   3.79 GHz
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Naive version
&lt;/h2&gt;

&lt;p&gt;Linear layer runs the following operation to produce the output:&lt;/p&gt;

&lt;p&gt;yj=bj+∑i = 0nwjixi&lt;/p&gt;

&lt;p&gt;Which translated into the following C++ code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;  &lt;span class="n"&gt;output_t&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;()(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;input_t&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;Parameters&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;parameter_count&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;output_t&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;constexpr&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;Is&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_t&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Is&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;Is&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
      &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Is&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt;
            &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Is&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Given &lt;em&gt;n&lt;/em&gt; inputs and &lt;em&gt;m&lt;/em&gt; outputs, parameters layout is:&lt;br&gt;
Parameters are a flat array in the following format (n is a number of inputs, m is the number of outputs):&lt;/p&gt;


{
w00
, ... ,w0n,b0,w10, ... ,wmn,bm}
&lt;h3&gt;
  
  
  Benchmark Results:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Parameter Count&lt;/th&gt;
&lt;th&gt;i7-10700KF&lt;/th&gt;
&lt;th&gt;Apple M2 Pro&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt;100, 200&amp;gt;&lt;/td&gt;
&lt;td&gt;20,200&lt;/td&gt;
&lt;td&gt;13,645 ns&lt;/td&gt;
&lt;td&gt;6882 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt;2500, 8&amp;gt;&lt;/td&gt;
&lt;td&gt;20,008&lt;/td&gt;
&lt;td&gt;17,086 ns&lt;/td&gt;
&lt;td&gt;16,307 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt;8, 2500&amp;gt;&lt;/td&gt;
&lt;td&gt;22,500&lt;/td&gt;
&lt;td&gt;5032 ns&lt;/td&gt;
&lt;td&gt;2177 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note that the number of operations (memory reads, stores and arithmetic) is directly correlated to the number of parameters so all the models were set up to have roughly the same number of them.&lt;/p&gt;

&lt;p&gt;Intel architecture shows approximately 3.4x spread between the best case scenario (number outputs drastically exceeds number of inputs) and worst case scenario (number of inputs is much greater then the number of outputs). Apple ARM implementation showed 7.5x spread.&lt;/p&gt;
&lt;h1&gt;
  
  
  Transposed iteration order
&lt;/h1&gt;

&lt;p&gt;The first optimization is to change the iteration order. Instead of iterating over the outputs and then over the inputs, we iterate over the inputs and then over the outputs. This change allows for better cache utilization and reduces the number of cache misses.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;  &lt;span class="n"&gt;output_t&lt;/span&gt; &lt;span class="nf"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;()(&lt;/span&gt;
      &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;input_t&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;Parameters&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;elements&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Outputs&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;output_t&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the code uses a linear iteration over the parameters, making the memory&lt;br&gt;
access pattern more predictable and cache-friendly.&lt;/p&gt;

&lt;p&gt;Parameters layout is as follows (n is a number of inputs, m is the number of outputs):&lt;/p&gt;


{
b0, ... ,bm
,w00, ... ,w0n,w10, ... ,wm0}
&lt;h3&gt;
  
  
  Benchmark Results:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;i7-10700KF&lt;/th&gt;
&lt;th&gt;Apple M2 Pro&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100, 200&lt;/td&gt;
&lt;td&gt;1880 ns&lt;/td&gt;
&lt;td&gt;1326 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2500, 8&lt;/td&gt;
&lt;td&gt;5611ns&lt;/td&gt;
&lt;td&gt;11,124 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8, 2500&lt;/td&gt;
&lt;td&gt;2015 ns&lt;/td&gt;
&lt;td&gt;1354 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Number of inputs has a significant negative impact on the performance as it&lt;br&gt;
adds an extra memory load.&lt;/p&gt;
&lt;h2&gt;
  
  
  Compile for the specific CPU
&lt;/h2&gt;

&lt;p&gt;By default, the compiler generates the code that works on most CPUs. This&lt;br&gt;
precludes usage of some newer instructions that have a significant impact&lt;br&gt;
on the performance. To test this, I pass &lt;code&gt;-march=native&lt;/code&gt; to the compiler which&lt;br&gt;
makes it target my current CPU. I am using Bazel, so the invocation looked like&lt;br&gt;
this (&lt;code&gt;-g&lt;/code&gt; is for including debugging symbols as I use &lt;code&gt;gdb&lt;/code&gt; to look at&lt;br&gt;
the assembly code):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bazel run &lt;span class="nt"&gt;-c&lt;/span&gt; opt &lt;span class="nt"&gt;--cxxopt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-g"&lt;/span&gt; &lt;span class="nt"&gt;--cxxopt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-march=native"&lt;/span&gt; //benchmark:linear
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Benchmark Results (Intel only):
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;i7-10700KF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100, 200&lt;/td&gt;
&lt;td&gt;1203 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2500, 8&lt;/td&gt;
&lt;td&gt;4871 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8, 2500&lt;/td&gt;
&lt;td&gt;1080 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This "optimization" shows pretty solid across the board. Ultimately, it will be&lt;br&gt;
up to embedders to decide the CPU target to compile for.&lt;/p&gt;
&lt;h2&gt;
  
  
  A bigger layer
&lt;/h2&gt;

&lt;p&gt;With the numbers in single digit microseconds, it seems reasonable to increase&lt;br&gt;
the size of the linear layer to see what impact it has on the performance.&lt;/p&gt;
&lt;h3&gt;
  
  
  Benchmark Results (Intel only):
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;i7-10700KF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100, 200&lt;/td&gt;
&lt;td&gt;1180 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2500, 8&lt;/td&gt;
&lt;td&gt;5056 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8, 2500&lt;/td&gt;
&lt;td&gt;1148 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4000, 2000&lt;/td&gt;
&lt;td&gt;1496652 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000000, 8&lt;/td&gt;
&lt;td&gt;2986400 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8, 1000000&lt;/td&gt;
&lt;td&gt;2185253 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The worst case is "only" 2x slower than the best case.&lt;/p&gt;
&lt;h2&gt;
  
  
  Using SIMD intrinsics directly
&lt;/h2&gt;

&lt;p&gt;As mentioned above, SIMD intrinsics are not considered (i.e. Web Assembly still&lt;br&gt;
has no support for them). However, it is still interesting to see what&lt;br&gt;
the performance gains could be if we tried them. I did not use AVX as data&lt;br&gt;
alignment is not supported yet, though this will change soon.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;  &lt;span class="n"&gt;output_t&lt;/span&gt; &lt;span class="nf"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;()(&lt;/span&gt;
      &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;input_t&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;Parameters&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;elements&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Outputs&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;output_t&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;__m128&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_loadu_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Parameters are accessed in the wrong order - this is a bug!&lt;/span&gt;
        &lt;span class="c1"&gt;// This code is for benchmark only.&lt;/span&gt;
        &lt;span class="n"&gt;__m128&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_load_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="n"&gt;__m128&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_dp_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xf1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;_mm_cvtss_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Benchmark Results (Intel only):
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;i7-10700KF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100, 200&lt;/td&gt;
&lt;td&gt;5765 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2500, 8&lt;/td&gt;
&lt;td&gt;5955 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8, 2500&lt;/td&gt;
&lt;td&gt;5761 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4000, 2000&lt;/td&gt;
&lt;td&gt;2783286 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000000, 8&lt;/td&gt;
&lt;td&gt;2964343 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8, 1000000&lt;/td&gt;
&lt;td&gt;3214760 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;"Good" cases worsened, which shows that the updated code is ran. Shape of&lt;br&gt;
the data is known at compile time when using Uchen so the compilers make&lt;br&gt;
informed decisions about vectorization and other optimizations and it looks&lt;br&gt;
like competing with them is an unnecessary exercise.&lt;/p&gt;

&lt;p&gt;Manually unrolling the loop yields similar results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;  &lt;span class="n"&gt;output_t&lt;/span&gt; &lt;span class="nf"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;()(&lt;/span&gt;
      &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;input_t&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;Parameters&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;elements&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Outputs&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;output_t&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;i7-10700KF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100, 200&lt;/td&gt;
&lt;td&gt;6523 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2500, 8&lt;/td&gt;
&lt;td&gt;5088 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8, 2500&lt;/td&gt;
&lt;td&gt;6763 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4000, 2000&lt;/td&gt;
&lt;td&gt;3226969 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000000, 8&lt;/td&gt;
&lt;td&gt;2918274 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8, 1000000&lt;/td&gt;
&lt;td&gt;3734533 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;At this point it looks like the performance on this granularity is pretty close&lt;br&gt;
to the practical limit for &lt;em&gt;a single core&lt;/em&gt;. Next article will detail&lt;br&gt;
multi-threading and the memory alignment (particularly, dealing with the number&lt;br&gt;
of parameters not divisible by 8).&lt;/p&gt;

&lt;p&gt;The backpropagation optimizations and Uchen's approach to the memory management&lt;br&gt;
will also be detailed in the future articles.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>performance</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
