<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kartikay dubey</title>
    <description>The latest articles on DEV Community by kartikay dubey (@dubeykartikay).</description>
    <link>https://dev.to/dubeykartikay</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3818892%2Fe8587f68-3f37-4472-b7b0-89fae4ac0c9c.jpg</url>
      <title>DEV Community: kartikay dubey</title>
      <link>https://dev.to/dubeykartikay</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dubeykartikay"/>
    <language>en</language>
    <item>
      <title>The Fastest Set Is Often Not a Set: 4050 Duplicate-Detection Benchmarks</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Tue, 02 Jun 2026 18:37:21 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/the-fastest-set-is-often-not-a-set-4050-duplicate-detection-benchmarks-31im</link>
      <guid>https://dev.to/dubeykartikay/the-fastest-set-is-often-not-a-set-4050-duplicate-detection-benchmarks-31im</guid>
      <description>&lt;p&gt;Duplicate detection looks solved: keep a hash set, skip what you have already seen. A benchmark suite of &lt;strong&gt;4050 measurements across 480 workloads&lt;/strong&gt; says the fastest strategy can be 94x faster than &lt;code&gt;std::unordered_set&lt;/code&gt;, or 90,000x slower, depending on what you are deduplicating and what guarantees you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dense integers are an array problem
&lt;/h2&gt;

&lt;p&gt;When keys are dense, bounded 32-bit integers, a hash set wastes work: it hashes, probes buckets, and chases pointers. A bitset turns membership into one indexed bit. At one million uniform integers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;strategy&lt;/th&gt;
&lt;th&gt;ns per insert&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;growable bitset&lt;/td&gt;
&lt;td&gt;5.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sort then unique&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;roaring bitmap&lt;/td&gt;
&lt;td&gt;165&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;std::unordered_set&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;483&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;std::set&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1154&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The bitset is &lt;strong&gt;94x faster&lt;/strong&gt; than the hash set for the same correct answer. If your key is already an array index, do not turn it into a hashing problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Text keys change the cost model
&lt;/h2&gt;

&lt;p&gt;For long strings, comparison and hashing dominate. Sorting with fingerprints (with full-key verification when correctness matters) can beat a hash set by 1.8x to 2.7x. For clustered duplicate strings, a hash set is excellent because recent buckets stay hot in cache.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streaming is a forgetting problem
&lt;/h2&gt;

&lt;p&gt;For unbounded streams, the question is what to remember. An in-memory sliding window costs ~68 ns/event. A PostgreSQL-backed detector with per-event transactions costs ~6.1 ms/event, a 90,000x gap for the same logical check. Batching commits closes most of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical decision table
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dense bounded ints&lt;/strong&gt; -&amp;gt; pre-sized bitset (30x to 110x faster).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sparse 64-bit ints&lt;/strong&gt; -&amp;gt; Roaring bitmap, or sort + unique on finite batches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long strings&lt;/strong&gt; -&amp;gt; fingerprinted sorting, verify on match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming, bounded memory&lt;/strong&gt; -&amp;gt; in-memory sliding window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming, durable&lt;/strong&gt; -&amp;gt; RocksDB or Postgres with &lt;em&gt;batched&lt;/em&gt; writes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fastest set is often not a set. It is the data structure your key space was trying to be.&lt;/p&gt;

&lt;p&gt;For all 4050 measurements, the winner heatmaps, and the streaming benchmarks: &lt;a href="https://dubeykartikay.com/posts/the-shape-of-duplicate-detection/" rel="noopener noreferrer"&gt;The Shape of Duplicate Detection&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>performance</category>
      <category>algorithms</category>
      <category>datastructures</category>
    </item>
    <item>
      <title>How to Install Boost in Any C++ Project: CMake, vcpkg, Conan, and More</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Tue, 02 Jun 2026 18:36:14 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/how-to-install-boost-in-any-c-project-cmake-vcpkg-conan-and-more-2nk2</link>
      <guid>https://dev.to/dubeykartikay/how-to-install-boost-in-any-c-project-cmake-vcpkg-conan-and-more-2nk2</guid>
      <description>&lt;p&gt;If you have written much C++, you have reached for Boost, and probably lost an afternoon to linker errors getting it installed. Here are the practical ways to add Boost to a project, with the snippets that actually work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Header-only vs compiled
&lt;/h2&gt;

&lt;p&gt;Boost is ~160 libraries in two camps. &lt;strong&gt;Header-only&lt;/strong&gt; ones (&lt;code&gt;asio&lt;/code&gt;, &lt;code&gt;beast&lt;/code&gt;, &lt;code&gt;mp11&lt;/code&gt;, &lt;code&gt;hana&lt;/code&gt;, &lt;code&gt;pfr&lt;/code&gt;) need only an &lt;code&gt;#include&lt;/code&gt;. &lt;strong&gt;Compiled&lt;/strong&gt; ones (&lt;code&gt;filesystem&lt;/code&gt;, &lt;code&gt;program_options&lt;/code&gt;, &lt;code&gt;thread&lt;/code&gt;, &lt;code&gt;regex&lt;/code&gt;, &lt;code&gt;serialization&lt;/code&gt;, &lt;code&gt;log&lt;/code&gt;) ship &lt;code&gt;.so&lt;/code&gt;/&lt;code&gt;.a&lt;/code&gt; files you must link. One gotcha: &lt;code&gt;boost::system&lt;/code&gt; has been mostly header-only since 1.69, so you rarely need &lt;code&gt;-lboost_system&lt;/code&gt; anymore, despite what older tutorials say.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method 1: CMake find_package (system Boost)
&lt;/h2&gt;

&lt;p&gt;Install through your package manager, then let CMake find it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cmake"&gt;&lt;code&gt;&lt;span class="nb"&gt;find_package&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;Boost 1.71 REQUIRED COMPONENTS filesystem program_options&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;target_link_libraries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;my_app PRIVATE
    Boost::filesystem
    Boost::program_options
    Boost::headers        &lt;span class="c1"&gt;# header-only libs&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install commands per platform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;libboost-all-dev   &lt;span class="c"&gt;# Ubuntu/Debian (~500 MB; prefer per-component)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dnf &lt;span class="nb"&gt;install &lt;/span&gt;boost-devel        &lt;span class="c"&gt;# Fedora/RHEL&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;pacman &lt;span class="nt"&gt;-S&lt;/span&gt; boost boost-libs     &lt;span class="c"&gt;# Arch&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;boost                  &lt;span class="c"&gt;# macOS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Method 2: FetchContent (and why it bites)
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;FetchContent&lt;/code&gt; works, but Boost's modular CMake means you must declare component dependencies explicitly or you hit cryptic missing-target errors. It also compiles Boost as part of your build, which is slow. Good for reproducibility, bad for iteration speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methods 3 &amp;amp; 4: vcpkg and Conan
&lt;/h2&gt;

&lt;p&gt;Package managers give you pinned, reproducible Boost that plugs into CMake's &lt;code&gt;find_package&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# vcpkg manifest mode: list boost in vcpkg.json, then configure with the toolchain file&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DCMAKE_TOOLCHAIN_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.../vcpkg/scripts/buildsystems/vcpkg.cmake
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Conan is the same idea with a &lt;code&gt;conanfile.txt&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method 5: Manual g++ linking
&lt;/h2&gt;

&lt;p&gt;No build system, just the compiler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;g++ main.cpp &lt;span class="nt"&gt;-o&lt;/span&gt; app                            &lt;span class="c"&gt;# header-only&lt;/span&gt;
g++ main.cpp &lt;span class="nt"&gt;-o&lt;/span&gt; app &lt;span class="nt"&gt;-lboost_filesystem&lt;/span&gt;         &lt;span class="c"&gt;# compiled lib&lt;/span&gt;
g++ main.cpp &lt;span class="nt"&gt;-o&lt;/span&gt; app &lt;span class="nt"&gt;-l&lt;/span&gt;:libboost_filesystem.a   &lt;span class="c"&gt;# static&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Link order matters: dependents come before dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which method should you use?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System &lt;code&gt;find_package&lt;/code&gt;&lt;/strong&gt; for quick local builds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vcpkg or Conan&lt;/strong&gt; for reproducible, cross-platform projects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build from source with b2&lt;/strong&gt; only when you need a specific version or custom variant.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the complete walkthrough, including building from source with &lt;code&gt;b2&lt;/code&gt;, Nix, Docker dev containers, static-linking details, and an FAQ: &lt;a href="https://dubeykartikay.com/posts/install-boost-cpp/" rel="noopener noreferrer"&gt;How to Install Boost in Any C++ Project&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>cmake</category>
      <category>boost</category>
      <category>vcpkg</category>
    </item>
    <item>
      <title>Making a Go Log Viewer 12x Faster (and the Benchmark Bug That Fooled Me)</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Tue, 02 Jun 2026 18:36:10 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/making-a-go-log-viewer-12x-faster-and-the-benchmark-bug-that-fooled-me-564n</link>
      <guid>https://dev.to/dubeykartikay/making-a-go-log-viewer-12x-faster-and-the-benchmark-bug-that-fooled-me-564n</guid>
      <description>&lt;p&gt;I built Peacock, a terminal JSON log viewer in Go, and it could not keep up with a busy log stream. So I profiled it with &lt;code&gt;go tool pprof&lt;/code&gt;: read the profile, fix the hottest line, re-profile, repeat. On a real 70x240 terminal, throughput went from &lt;strong&gt;52 lines/sec to 651 lines/sec, about 12x&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The most useful lesson, though, came from an evening I lost to a bug in my own benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cleanup: a pointless join/split
&lt;/h2&gt;

&lt;p&gt;The base profile flagged a viewport setter eating 8% of CPU. &lt;code&gt;SetContent&lt;/code&gt; takes a string and splits it on &lt;code&gt;\n&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0   29.66s   227:   m.SetContentLines(strings.Split(s, "\n"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But my code already had a &lt;code&gt;[]string&lt;/code&gt;. It was joining the lines into one giant string with &lt;code&gt;lipgloss.JoinVertical&lt;/code&gt;, just so &lt;code&gt;SetContent&lt;/code&gt; could split them again. Calling &lt;code&gt;SetContentLines&lt;/code&gt; directly removed the round trip.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real win: render only what is visible
&lt;/h2&gt;

&lt;p&gt;The hottest function rendered &lt;strong&gt;every buffered entry on every frame&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;70ms   8.06s   130:   rendered, _ := m.styles.renderEntry(m.visibleEntries[i], width)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The terminal shows ~70 lines, yet Peacock was word-wrapping the entire backlog each frame. I capped rendering to the viewport height. &lt;code&gt;contentLines&lt;/code&gt; cumulative time dropped from 66.86% to 14.11%. This single algorithmic change carried the practical win.&lt;/p&gt;

&lt;h2&gt;
  
  
  A ring buffer instead of slice trimming
&lt;/h2&gt;

&lt;p&gt;Appending entries and re-slicing the backlog churned &lt;code&gt;memmove&lt;/code&gt; and the GC. A fixed-size circular buffer overwrites in place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;entryRing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this, &lt;code&gt;appendEntry&lt;/code&gt; disappeared from the profile.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cache that did nothing, and the 0x0 terminal
&lt;/h2&gt;

&lt;p&gt;I cached each rendered entry by viewport width. Throughput did not move at all:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Ring buffer:    6,095 l/s
Cache rendered: 6,095 l/s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I re-read the cache logic three times. The bug was not in my code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;script &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'tput lines; tput cols'&lt;/span&gt; /dev/null
0
0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The benchmark's pseudo-terminal had &lt;strong&gt;no dimensions&lt;/strong&gt;. With width 0, the wrap function returned early, so there was almost no rendering work for the cache to skip. I set explicit &lt;code&gt;stty rows/cols&lt;/code&gt; plus &lt;code&gt;LINES&lt;/code&gt;/&lt;code&gt;COLUMNS&lt;/code&gt;, and the cache finally showed a 3x jump.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The biggest wins are algorithmic.&lt;/strong&gt; Visible-only rendering beat every string-allocation trick.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your benchmark is part of the system.&lt;/strong&gt; When an optimization shows zero improvement, suspect the measurement before the code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For every pprof command, every profile output, and the full corrected throughput ladder: &lt;a href="https://dubeykartikay.com/posts/go-optimization-pprof/" rel="noopener noreferrer"&gt;Making a Log Viewer 12x Faster&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>go</category>
      <category>pprof</category>
      <category>performance</category>
      <category>profiling</category>
    </item>
    <item>
      <title>How 3 Lines of Code Caused a 10x Kafka Throughput Drop</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Sun, 03 May 2026 16:06:30 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/how-3-lines-of-code-caused-a-10x-kafka-throughput-drop-3ln5</link>
      <guid>https://dev.to/dubeykartikay/how-3-lines-of-code-caused-a-10x-kafka-throughput-drop-3ln5</guid>
      <description>&lt;p&gt;In August 2025, a user reported that Apache Kafka v3.9.0 dropped consumer throughput by 10x. Other users reproduced it. The culprit was a configuration called &lt;code&gt;min.insync.replicas&lt;/code&gt;, and the fix was three lines of code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The report
&lt;/h2&gt;

&lt;p&gt;Sharad Garg opened a ticket titled "Consumer throughput drops by 10 times with Kafka v3.9.0 in ZK mode." Ritvik Gupta ran controlled tests and traced the issue to &lt;code&gt;min.insync.replicas&lt;/code&gt;. Setting it from 1 to 2 caused a massive drop:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Message Rate&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 Producer 1 Consumer&lt;/td&gt;
&lt;td&gt;89.21&lt;/td&gt;
&lt;td&gt;min.insync.replicas = 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 Producer 1 Consumer&lt;/td&gt;
&lt;td&gt;298.99&lt;/td&gt;
&lt;td&gt;min.insync.replicas = 1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Another user reported throughput falling from 147 MB/s on Kafka 3.4 to 58 MB/s on Kafka 3.9 with the same setting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The root cause
&lt;/h2&gt;

&lt;p&gt;Chia-Ping Tsai, a long-time Kafka contributor, identified the issue. It traced back to KAFKA-15583, titled "High watermark can only advance if ISR size is larger than min ISR."&lt;/p&gt;

&lt;p&gt;The high watermark (HW) is the offset of the latest message copied to all in-sync replicas. Consumers are only allowed to read up to the HW. This guarantees that consumed data will not disappear if a broker crashes.&lt;/p&gt;

&lt;p&gt;The change added this check inside the leader's watermark advancement logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="nf"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isUnderMinIsr&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="s"&gt;"Not increasing HWM because partition is under min ISR"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before v3.9.0, &lt;code&gt;min.insync.replicas&lt;/code&gt; only affected producers using &lt;code&gt;acks=all&lt;/code&gt;. It dictated how many replicas had to acknowledge a write before the producer considered it successful. It had nothing to do with consumers.&lt;/p&gt;

&lt;p&gt;After v3.9.0, the same setting also blocks consumer reads. If a follower is slow and drops out of the ISR, the leader stops advancing the high watermark until that follower catches up. Consumers stall until the watermark moves again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is a feature, not a bug
&lt;/h2&gt;

&lt;p&gt;Kafka prioritizes durability over throughput. Blocking reads until &lt;code&gt;min.insync.replicas&lt;/code&gt; are healthy prevents consumers from reading data that has not been sufficiently replicated. If the leader crashes after a consumer reads an under-replicated message, that message is gone, and the consumer has already processed it.&lt;/p&gt;

&lt;p&gt;The trade-off is real. The change arguably deserved a major version bump, because a 10x throughput drop in a minor release can break production pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;If you hit this, your options are straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower &lt;code&gt;min.insync.replicas&lt;/code&gt; if your durability requirements allow it.&lt;/li&gt;
&lt;li&gt;Ensure followers have enough resources to keep up with the leader.&lt;/li&gt;
&lt;li&gt;Monitor ISR size and follower lag as critical metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three lines of code. A massive performance impact. A reminder that distributed systems are full of sharp edges.&lt;/p&gt;

&lt;p&gt;For the full timeline, mailing list discussion, and the exact PR diff: &lt;a href="https://dubeykartikay.com/posts/kafka-throughput-drop-min-insync-replicas/" rel="noopener noreferrer"&gt;How a Minor Release Caused a 10x Throughput Drop in Kafka&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>distributedsystems</category>
      <category>performance</category>
      <category>apachekafka</category>
    </item>
    <item>
      <title>Optimizing My Hugo Blog: From 3.6 MB of JavaScript to Zero</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Sun, 03 May 2026 16:06:28 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/optimizing-my-hugo-blog-from-36-mb-of-javascript-to-zero-22jh</link>
      <guid>https://dev.to/dubeykartikay/optimizing-my-hugo-blog-from-36-mb-of-javascript-to-zero-22jh</guid>
      <description>&lt;p&gt;My Hugo blog was downloading 3.6 MB of JavaScript and 40 KB of external CSS on every page load. For a static blog with mostly text and a few diagrams, that was absurd. Here is how I fixed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Baseline
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;HTML: 86 KB&lt;/li&gt;
&lt;li&gt;JavaScript: 3.6 MB (Mermaid + KaTeX)&lt;/li&gt;
&lt;li&gt;CSS: 40 KB (KaTeX stylesheets)&lt;/li&gt;
&lt;li&gt;Problem: render-blocking scripts loaded on every page for math and diagrams&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Optimization 1: HTML minification
&lt;/h2&gt;

&lt;p&gt;Adding &lt;code&gt;minifyOutput = true&lt;/code&gt; to &lt;code&gt;hugo.toml&lt;/code&gt; shrunk HTML by 16%. Small win, zero risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 2: Inline CSS
&lt;/h2&gt;

&lt;p&gt;I removed the external &lt;code&gt;main.css&lt;/code&gt; link and inlined the styles directly into the HTML. The HTML grew slightly, but I eliminated one render-blocking network request. First Contentful Paint improved because the browser no longer waits for a CSS fetch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 3: Native MathML
&lt;/h2&gt;

&lt;p&gt;My blog used KaTeX to render equations. That meant JavaScript, CSS, and font files for every page with math. I switched to Hugo's Goldmark passthrough extensions, which output native MathML. Browsers render this directly.&lt;/p&gt;

&lt;p&gt;Result: 278 KB of JavaScript removed, all external stylesheets eliminated. Math now renders without any scripts or fonts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 4: Conditional asset loading
&lt;/h2&gt;

&lt;p&gt;Mermaid.js was loading on every page, even text-only posts. I used Hugo's &lt;code&gt;.Store&lt;/code&gt; to set a &lt;code&gt;hasMermaid&lt;/code&gt; flag during Markdown processing. The script tag only injects when a page actually contains a diagram.&lt;/p&gt;

&lt;p&gt;Text-only pages no longer download Mermaid. Diagram pages still get it, but only when needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 5: Server-side rendering for Mermaid
&lt;/h2&gt;

&lt;p&gt;Even conditional loading left a 3.3 MB script on diagram pages. I added a Node.js build step that pre-renders Mermaid blocks into static SVG files at build time. The frontend outputs &lt;code&gt;&amp;lt;img src="diagram.svg"&amp;gt;&lt;/code&gt; instead of a &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tag.&lt;/p&gt;

&lt;p&gt;Result: zero JavaScript on the frontend. Total Blocking Time dropped because the browser no longer executes JS to calculate layouts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 6: Early Hints and caching
&lt;/h2&gt;

&lt;p&gt;I generated a &lt;code&gt;_headers&lt;/code&gt; file with strict &lt;code&gt;Cache-Control&lt;/code&gt; rules for immutable assets. The build script also injects &lt;code&gt;Link: rel=preload&lt;/code&gt; headers for images and SVGs. Cloudflare returns &lt;code&gt;103 Early Hints&lt;/code&gt;, telling the browser to fetch assets before the HTML document finishes downloading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JavaScript&lt;/td&gt;
&lt;td&gt;3.6 MB&lt;/td&gt;
&lt;td&gt;0 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External CSS&lt;/td&gt;
&lt;td&gt;40 KB&lt;/td&gt;
&lt;td&gt;0 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTML&lt;/td&gt;
&lt;td&gt;86 KB&lt;/td&gt;
&lt;td&gt;72 KB (minified)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The site is now 100% JavaScript-free on the frontend. Performance matters, and static sites do not need a heavy JS framework to be fast.&lt;/p&gt;

&lt;p&gt;For the full &lt;code&gt;hugo.toml&lt;/code&gt; config, build scripts, and Lighthouse score breakdown: &lt;a href="https://dubeykartikay.com/posts/hugo-optimization-zero-js/" rel="noopener noreferrer"&gt;Optimizing My Hugo Blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>hugo</category>
      <category>webperf</category>
      <category>staticsite</category>
      <category>zerojs</category>
    </item>
    <item>
      <title>Vector Databases and Semantic Search: A Practical Introduction</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Sun, 03 May 2026 16:06:26 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/vector-databases-and-semantic-search-a-practical-introduction-414a</link>
      <guid>https://dev.to/dubeykartikay/vector-databases-and-semantic-search-a-practical-introduction-414a</guid>
      <description>&lt;p&gt;Traditional search engines match keywords. If you search for "dog shelters around Gurgaon" and the indexed page says "animal shelters near Delhi," you get no results. The words do not overlap.&lt;/p&gt;

&lt;p&gt;Semantic search fixes this by converting text into vectors. Similar ideas end up close together in vector space, even when the words differ.&lt;/p&gt;

&lt;h2&gt;
  
  
  From words to vectors
&lt;/h2&gt;

&lt;p&gt;An embedding model takes a word or sentence and produces a high-dimensional vector. The key property: semantically similar inputs produce vectors that are close to each other. "Dog" and "animal" sit near each other. "Dog" and "car" do not.&lt;/p&gt;

&lt;p&gt;For a search engine, the pipeline is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Convert every document in the corpus into a vector and store it.&lt;/li&gt;
&lt;li&gt;Convert the user's query into a vector using the same model.&lt;/li&gt;
&lt;li&gt;Find the documents whose vectors are closest to the query vector.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The hard part is step 3. A corpus of a million documents with 768-dimensional vectors is a massive dataset. Computing the exact distance from the query to every document is too slow for interactive search.&lt;/p&gt;

&lt;h2&gt;
  
  
  Approximate Nearest Neighbors
&lt;/h2&gt;

&lt;p&gt;Exact search is &lt;code&gt;O(n)&lt;/code&gt;. ANN algorithms trade a small amount of accuracy for massive speedups. The metric is &lt;code&gt;recall@k&lt;/code&gt;: out of the true k closest vectors, how many does the approximation find? A 95% recall@100 means 95 of the 100 true nearest neighbors are returned.&lt;/p&gt;

&lt;p&gt;Graph-based ANN builds a navigable graph over the dataset. Search starts at an entry point and greedily walks toward the query. Each step moves to the neighbor closest to the query, expanding the frontier until the best candidates are found.&lt;/p&gt;

&lt;h2&gt;
  
  
  DiskANN and Vamana
&lt;/h2&gt;

&lt;p&gt;Microsoft Research developed DiskANN and the Vamana index to make graph-based ANN work at scale. The algorithm has three pieces:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Greedy Search&lt;/strong&gt; maintains a candidate list and a visited set. It repeatedly expands the closest unvisited candidate, adds its graph neighbors, and keeps the best candidates bounded by a search-list size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Robust Prune&lt;/strong&gt; builds the graph edges. For each point, it considers possible neighbors and keeps a bounded set of useful outgoing edges. An &lt;code&gt;alpha&lt;/code&gt; parameter controls how aggressively candidates are pruned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vamana Construction&lt;/strong&gt; iterates over the dataset in random order. For each point, it runs greedy search, prunes the visited set into outgoing edges, adds backlinks, and repairs any degree violations.&lt;/p&gt;

&lt;p&gt;The result is a sparse graph where greedy search finds high-recall neighbors quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Vector databases like Pinecone, Weaviate, and Milvus package these ideas into production systems. They handle indexing, query routing, replication, and metadata filtering. If you are building semantic search, recommendation, or retrieval-augmented generation, you are probably using these algorithms whether you know it or not.&lt;/p&gt;

&lt;p&gt;For the full mathematical walkthrough with pseudocode, LaTeX equations, and diagrams: &lt;a href="https://dubeykartikay.com/posts/vector-databases-semantic-search/" rel="noopener noreferrer"&gt;How Google Search Actually Works&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>vectordb</category>
      <category>semanticsearch</category>
      <category>ann</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>How I Made My Vector Search Engine 16x Faster Without Changing the Algorithm</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Sun, 03 May 2026 16:06:24 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/how-i-made-my-vector-search-engine-16x-faster-without-changing-the-algorithm-4c8o</link>
      <guid>https://dev.to/dubeykartikay/how-i-made-my-vector-search-engine-16x-faster-without-changing-the-algorithm-4c8o</guid>
      <description>&lt;p&gt;I built a Vamana-based vector search engine in C++ called &lt;code&gt;sembed-engine&lt;/code&gt;. Recently I made a pull request that sped up queries by 16x and builds by 9x. The algorithm stayed exactly the same. The recall stayed at 1.0. The number of visited nodes did not change.&lt;/p&gt;

&lt;p&gt;The speedup came from data layout.&lt;/p&gt;

&lt;h2&gt;
  
  
  The old design
&lt;/h2&gt;

&lt;p&gt;The original code stored vectors as separate objects pointed to by &lt;code&gt;shared_ptr&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Record&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int64_t&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Vector&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is clean C++. Every record has an id and a vector. The vector knows how to calculate distance. In the hot path, though, the CPU had to load the record, read the &lt;code&gt;shared_ptr&lt;/code&gt;, follow the pointer, call virtual methods, and read each float through an abstraction layer. Millions of times per query.&lt;/p&gt;

&lt;h2&gt;
  
  
  The new layout
&lt;/h2&gt;

&lt;p&gt;I replaced the object graph with a flat array. All vector values live in one contiguous block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;ids&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;id0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;
&lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v0_dim0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v0_dim1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;v1_dim0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v1_dim1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Vector &lt;code&gt;i&lt;/code&gt; starts at &lt;code&gt;values[i * D]&lt;/code&gt;. A &lt;code&gt;FloatVectorView&lt;/code&gt; is just a pointer and a dimension count. No allocations. No pointer chasing. The next vector is right after the previous one in memory.&lt;/p&gt;

&lt;p&gt;The assembly tells the story. The old code had virtual calls and scalar square roots:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;call rax          ; virtual dispatch
sqrtss xmm2, xmm2 ; scalar square root
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The new code loads packed floats and operates on four at a time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;movups xmm1, XMMWORD PTR [rdi+rax]
subps xmm1, xmm3
mulps xmm1, xmm1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Removing unnecessary square roots
&lt;/h2&gt;

&lt;p&gt;Euclidean distance includes a square root. For nearest-neighbor search, we only care about ordering, not the absolute distance value. If &lt;code&gt;sqrt(25) &amp;lt; sqrt(100)&lt;/code&gt;, then &lt;code&gt;25 &amp;lt; 100&lt;/code&gt;. The ordering is identical.&lt;/p&gt;

&lt;p&gt;Switching to squared distances eliminated &lt;code&gt;sqrtss&lt;/code&gt; entirely from the hot path. One caveat: Vamana pruning uses an &lt;code&gt;alpha&lt;/code&gt; parameter. When everything is squared, &lt;code&gt;alpha&lt;/code&gt; must be squared too to preserve the same comparison semantics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caching scores during sort
&lt;/h2&gt;

&lt;p&gt;The old comparator computed distances inside the sort function. Sorting calls the comparator many times, so the same distance was recomputed repeatedly. The fix was to compute each distance once, store it in a &lt;code&gt;ScoredNode { node; score; }&lt;/code&gt;, and sort by the cached score.&lt;/p&gt;

&lt;p&gt;Old comparator assembly called &lt;code&gt;new_view_squared&lt;/code&gt; repeatedly. New comparator assembly just loaded two floats and compared them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gvec query latency&lt;/td&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;4.094 ms&lt;/td&gt;
&lt;td&gt;0.631 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;w2v query latency&lt;/td&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;25.15 ms&lt;/td&gt;
&lt;td&gt;1.524 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;w2v build time&lt;/td&gt;
&lt;td&gt;total&lt;/td&gt;
&lt;td&gt;17.91 s&lt;/td&gt;
&lt;td&gt;1.889 s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The search visited the same number of nodes. It stopped paying unnecessary tax at every node.&lt;/p&gt;

&lt;p&gt;For the full benchmark methodology, assembly breakdown, and PR diff: &lt;a href="https://dubeykartikay.com/posts/sembed-engine-vector-search-performance/" rel="noopener noreferrer"&gt;How I Made My Vector Search Engine 16x Faster&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>vectorsearch</category>
      <category>cpp</category>
      <category>performance</category>
      <category>vamana</category>
    </item>
    <item>
      <title>Setting Up Dual GPU Gaming Laptops in Hyprland</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Sun, 03 May 2026 16:06:22 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/setting-up-dual-gpu-gaming-laptops-in-hyprland-3n9i</link>
      <guid>https://dev.to/dubeykartikay/setting-up-dual-gpu-gaming-laptops-in-hyprland-3n9i</guid>
      <description>&lt;p&gt;Gaming laptops with dual GPUs are common, and they are a pain on Linux. I run an ASUS Zephyrus G15 with an AMD integrated GPU and an NVIDIA discrete GPU. Before I fixed the setup, I dealt with broken resume from suspend, terrible battery life, overheating, and games that ran worse than they should.&lt;/p&gt;

&lt;p&gt;This is a practical guide for setting up dual GPU systems in Hyprland. Most of it applies to other Wayland compositors too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Set the iGPU as primary
&lt;/h2&gt;

&lt;p&gt;Hyprland uses the &lt;code&gt;AQ_DRM_DEVICES&lt;/code&gt; environment variable to decide which GPU drives the display. You want the iGPU first for power efficiency and better Linux compatibility.&lt;/p&gt;

&lt;p&gt;First, find your GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;lspci &lt;span class="nt"&gt;-d&lt;/span&gt; ::03xx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My output shows an RTX 3060 at &lt;code&gt;01:00.0&lt;/code&gt; and an AMD Vega at &lt;code&gt;06:00.0&lt;/code&gt;. Create udev rules to symlink these to friendly names:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/etc/udev/rules.d/igpu-device-path.rules&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;KERNEL&lt;/span&gt;==&lt;span class="s2"&gt;"card*"&lt;/span&gt;, &lt;span class="n"&gt;KERNELS&lt;/span&gt;==&lt;span class="s2"&gt;"0000:06:00.0"&lt;/span&gt;, &lt;span class="n"&gt;SUBSYSTEM&lt;/span&gt;==&lt;span class="s2"&gt;"drm"&lt;/span&gt;, &lt;span class="n"&gt;SUBSYSTEMS&lt;/span&gt;==&lt;span class="s2"&gt;"pci"&lt;/span&gt;, &lt;span class="n"&gt;SYMLINK&lt;/span&gt;+=&lt;span class="s2"&gt;"dri/amd-igpu"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;/etc/udev/rules.d/dgpu-device-path.rules&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;KERNEL&lt;/span&gt;==&lt;span class="s2"&gt;"card*"&lt;/span&gt;, &lt;span class="n"&gt;KERNELS&lt;/span&gt;==&lt;span class="s2"&gt;"0000:01:00.0"&lt;/span&gt;, &lt;span class="n"&gt;SUBSYSTEM&lt;/span&gt;==&lt;span class="s2"&gt;"drm"&lt;/span&gt;, &lt;span class="n"&gt;SUBSYSTEMS&lt;/span&gt;==&lt;span class="s2"&gt;"pci"&lt;/span&gt;, &lt;span class="n"&gt;SYMLINK&lt;/span&gt;+=&lt;span class="s2"&gt;"dri/nvidia-dgpu"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reload rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;udevadm control &lt;span class="nt"&gt;--reload-rules&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;udevadm trigger
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then tell Hyprland to prefer the iGPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;env&lt;/span&gt; = &lt;span class="n"&gt;AQ_DRM_DEVICES&lt;/span&gt;, /&lt;span class="n"&gt;dev&lt;/span&gt;/&lt;span class="n"&gt;dri&lt;/span&gt;/&lt;span class="n"&gt;amd&lt;/span&gt;-&lt;span class="n"&gt;igpu&lt;/span&gt;:/&lt;span class="n"&gt;dev&lt;/span&gt;/&lt;span class="n"&gt;dri&lt;/span&gt;/&lt;span class="n"&gt;nvidia&lt;/span&gt;-&lt;span class="n"&gt;dgpu&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Fix hardware video decoding
&lt;/h2&gt;

&lt;p&gt;Without hardware decoding, video playback burns CPU, drains battery, and stutters at high resolution. Check if your system already supports it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;pacman &lt;span class="nt"&gt;-S&lt;/span&gt; libva-utils
vainfo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;vainfo&lt;/code&gt; fails or picks the wrong GPU, set the driver explicitly. For AMD, add to &lt;code&gt;hyprland.conf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;env&lt;/span&gt; = &lt;span class="n"&gt;LIBVA_DRIVER_NAME&lt;/span&gt;, &lt;span class="n"&gt;radeonsi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common driver names: NVIDIA uses &lt;code&gt;nvidia&lt;/code&gt;, AMD uses &lt;code&gt;radeonsi&lt;/code&gt;, Intel uses &lt;code&gt;i965&lt;/code&gt; or &lt;code&gt;iHD&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Switch between Hybrid and Integrated mode
&lt;/h2&gt;

&lt;p&gt;For gaming, you want both GPUs active. For battery life, you want the dGPU completely off.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;pacman &lt;span class="nt"&gt;-S&lt;/span&gt; supergfxctl
supergfxctl &lt;span class="nt"&gt;-s&lt;/span&gt;    &lt;span class="c"&gt;# list supported modes&lt;/span&gt;
supergfxctl &lt;span class="nt"&gt;-g&lt;/span&gt;    &lt;span class="c"&gt;# check current mode&lt;/span&gt;
supergfxctl &lt;span class="nt"&gt;-m&lt;/span&gt; Integrated   &lt;span class="c"&gt;# iGPU only, saves battery&lt;/span&gt;
supergfxctl &lt;span class="nt"&gt;-m&lt;/span&gt; Hybrid       &lt;span class="c"&gt;# both GPUs, for gaming&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That covers the essentials. I wrote a longer post with full &lt;code&gt;hyprland.conf&lt;/code&gt; snippets, troubleshooting tips for NVIDIA-specific quirks, and screenshots of the setup: &lt;a href="https://dubeykartikay.com/posts/hyprland-dual-gpu-gaming-laptops/" rel="noopener noreferrer"&gt;How to Setup Dual GPU Systems in Hyprland&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>hyprland</category>
      <category>linux</category>
      <category>nvidia</category>
      <category>gaminglaptops</category>
    </item>
    <item>
      <title>How No Man's Sky Creates 18 Quintillion Planets With Just Math</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Sun, 03 May 2026 16:06:20 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/how-no-mans-sky-creates-18-quintillion-planets-with-just-math-3fgf</link>
      <guid>https://dev.to/dubeykartikay/how-no-mans-sky-creates-18-quintillion-planets-with-just-math-3fgf</guid>
      <description>&lt;p&gt;No Man's Sky advertises 18 quintillion planets. That is not because someone modeled them by hand. It is because the game generates terrain, flora, and atmosphere from mathematical functions seeded by the planet's coordinates.&lt;/p&gt;

&lt;p&gt;The core idea is procedural generation, and the simplest building block is noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why raw randomness fails
&lt;/h2&gt;

&lt;p&gt;If you fill a height map with random numbers, you get chaos. Real terrain has smooth transitions: hills blend into valleys, coastlines curve gradually. The solution is a noise function that produces smooth, continuous random values.&lt;/p&gt;

&lt;p&gt;Perlin noise does exactly this. It generates values that vary gradually across space, so nearby points have similar heights. Feed a 2D grid of Perlin noise into a renderer, add color and lighting, and you get something that looks like terrain.&lt;/p&gt;

&lt;p&gt;The trick is layering. A single layer of Perlin noise looks too uniform, like rolling hills with no variation. Games stack multiple layers at different frequencies and amplitudes. Low-frequency layers define the broad shape of continents. High-frequency layers add rocks, cracks, and surface detail. This is called fractal Brownian motion, and it is the reason generated worlds look organic instead of synthetic.&lt;/p&gt;

&lt;h2&gt;
  
  
  What No Man's Sky adds
&lt;/h2&gt;

&lt;p&gt;Sean Murray and the team at Hello Games went further than basic layered noise. Their GDC talk outlines several techniques:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain warping&lt;/strong&gt; twists the noise field itself. Instead of sampling noise at the raw coordinates, you sample at coordinates that have been displaced by another noise function. This creates caves, overhangs, and twisted terrain that straight noise cannot produce.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Filtering and image processing&lt;/strong&gt; cleans up the raw noise. Unfiltered procedural terrain often looks muddy or repetitive. The team runs filters to emphasize ridges and valleys, suppress bland regions, and sculpt the terrain into more interesting shapes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DEM blending&lt;/strong&gt; mixes in real-world elevation data for grounding. The risk is making everything look like Earth, which is familiar but boring. The game uses this sparingly, blending real data with warped noise to keep things alien but plausible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Biome rules&lt;/strong&gt; layer on top of the terrain. Temperature, humidity, and elevation determine what plants and animals spawn. These rules are also procedural, driven by the same coordinate seeds that generated the planet itself. Visit the same planet twice, you get the same terrain and the same wildlife. Visit a different planet, everything changes.&lt;/p&gt;

&lt;p&gt;The result is a universe where every planet is deterministic (the same seed always produces the same world) but effectively infinite (the coordinate space is so large you will never see the same planet twice).&lt;/p&gt;

&lt;p&gt;If you want to see the Perlin noise graphs and a deeper walkthrough of the layering math: &lt;a href="https://dubeykartikay.com/posts/procedural-generation-no-mans-sky/" rel="noopener noreferrer"&gt;How No Man's Sky Creates 18 Quintillion Planets&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>gamedev</category>
      <category>proceduralgeneration</category>
      <category>perlinnoise</category>
      <category>nomansky</category>
    </item>
    <item>
      <title>Reading Algorithms Like an Engineer: What DiskANN Taught Me About Pseudocode</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Sun, 03 May 2026 16:06:18 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/reading-algorithms-like-an-engineer-what-diskann-taught-me-about-pseudocode-2979</link>
      <guid>https://dev.to/dubeykartikay/reading-algorithms-like-an-engineer-what-diskann-taught-me-about-pseudocode-2979</guid>
      <description>&lt;p&gt;The first time I implemented Vamana from the DiskANN paper, my approximate nearest neighbor index was slower than brute force. On tiny test fixtures, brute force took 0.27 ms per query. My Vamana implementation took 22.98 ms.&lt;/p&gt;

&lt;p&gt;That sounds absurd. ANN exists to skip work. The problem was not the algorithm. It was how I mapped the paper's abstractions to actual data structures.&lt;/p&gt;

&lt;h2&gt;
  
  
  A set is not a data structure
&lt;/h2&gt;

&lt;p&gt;The DiskANN pseudocode talks about sets &lt;code&gt;L&lt;/code&gt;, &lt;code&gt;V&lt;/code&gt;, and &lt;code&gt;Nout(p)&lt;/code&gt;. That is fine for explanation. Code cannot store an abstract set.&lt;/p&gt;

&lt;p&gt;When the paper says &lt;code&gt;L&lt;/code&gt; (the candidate list), I had to decide: sorted vector? heap? bounded priority queue? How do I find the closest unvisited element? How do I enforce the search-list bound? How do I remove duplicates?&lt;/p&gt;

&lt;p&gt;When the paper says &lt;code&gt;V&lt;/code&gt; (the visited set), I had to decide: &lt;code&gt;unordered_set&lt;/code&gt;? dense bitset? boolean array? Node ids in my case were dense integers, so an indexed bit operation beat a hash-table lookup by a wide margin.&lt;/p&gt;

&lt;p&gt;When the paper says "remove candidates," I had to ask whether removal is physical or logical. In a hot loop, marking a candidate as deleted and skipping it is much cheaper than erasing from a vector and reshuffling everything behind it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;In my &lt;code&gt;sembed-engine&lt;/code&gt; project, I changed the implementation to match the invariants the algorithm already needed, rather than copying the pseudocode literally.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;Neighbour&lt;/code&gt; struct became &lt;code&gt;{ float distance; NodeId node; bool marked; }&lt;/code&gt;. A &lt;code&gt;SortedBoundedVector&lt;/code&gt; kept candidates sorted as they were inserted, capped the list size, rejected duplicates, and tracked the next unexpanded node. Visited tracking moved to &lt;code&gt;boost::dynamic_bitset&lt;/code&gt;. Pruning switched from physical deletion to marker-style bookkeeping.&lt;/p&gt;

&lt;p&gt;The algorithm did not change. The code started matching the invariants the algorithm already needed.&lt;/p&gt;

&lt;p&gt;After the fix, Vamana went from 22.98 ms to 0.02 ms on the same small fixture. On a larger dataset, it delivered 5.34x the query throughput of brute force while keeping recall at 1.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson
&lt;/h2&gt;

&lt;p&gt;Slow down at the nouns in pseudocode. If it says &lt;code&gt;L&lt;/code&gt;, ask what operations &lt;code&gt;L&lt;/code&gt; needs. If it says &lt;code&gt;V&lt;/code&gt;, ask how membership is checked. If it says "remove," ask whether deletion is physical or logical. If it says "bounded," ask where that bound is enforced.&lt;/p&gt;

&lt;p&gt;The paper gives the map. Implementation is the terrain.&lt;/p&gt;

&lt;p&gt;For the full benchmark data, PR details, and code snippets: &lt;a href="https://dubeykartikay.com/posts/reading-algorithms-like-an-engineer/" rel="noopener noreferrer"&gt;Reading Algorithms Like an Engineer&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>ann</category>
      <category>cpp</category>
      <category>diskann</category>
    </item>
    <item>
      <title>Why You Should Never Use std::unordered_set in Hot C++ Loops</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Sun, 03 May 2026 16:06:17 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/why-you-should-never-use-stdunorderedset-in-hot-c-loops-2lc4</link>
      <guid>https://dev.to/dubeykartikay/why-you-should-never-use-stdunorderedset-in-hot-c-loops-2lc4</guid>
      <description>&lt;p&gt;Hash tables feel like the default choice for membership tests. &lt;code&gt;std::unordered_set&lt;/code&gt; promises average &lt;code&gt;O(1)&lt;/code&gt; lookup, so we reach for it automatically. In performance-sensitive C++ code, that habit can cost you an order of magnitude.&lt;/p&gt;

&lt;p&gt;I ran into this while building a Vamana graph index for approximate nearest neighbor search. The algorithm needs to track visited nodes. Node ids are dense integers, and the visited check runs inside the hottest loop in the entire search path.&lt;/p&gt;

&lt;p&gt;My first implementation used &lt;code&gt;std::unordered_set&amp;lt;uint32_t&amp;gt;&lt;/code&gt;. It was correct, and it was slow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the benchmark says
&lt;/h2&gt;

&lt;p&gt;I generated 1000 vectors of random &lt;code&gt;uint32_t&lt;/code&gt; ids and deduplicated them using three approaches: &lt;code&gt;std::unordered_set&lt;/code&gt;, &lt;code&gt;sort + unique&lt;/code&gt;, and &lt;code&gt;boost::dynamic_bitset&amp;lt;&amp;gt;&lt;/code&gt;. For dense ids sampled from &lt;code&gt;[0, 2n)&lt;/code&gt;, the numbers were brutal:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;unordered_set ms&lt;/th&gt;
&lt;th&gt;sort+unique ms&lt;/th&gt;
&lt;th&gt;boost bitset ms&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32,768&lt;/td&gt;
&lt;td&gt;1,649&lt;/td&gt;
&lt;td&gt;1,455&lt;/td&gt;
&lt;td&gt;177&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500,000&lt;/td&gt;
&lt;td&gt;50,302&lt;/td&gt;
&lt;td&gt;26,759&lt;/td&gt;
&lt;td&gt;3,423&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At &lt;code&gt;n = 500,000&lt;/code&gt;, the bitset was 14.7x faster. The hash table had to hash keys, grow buckets, rehash, and chase pointers through memory. The bitset did one indexed memory operation.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sort + unique&lt;/code&gt; also beat the hash table at scale because it walks contiguous memory, and CPUs love that.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the hash table wins
&lt;/h2&gt;

&lt;p&gt;Sparse ids change the picture. When I sampled only &lt;code&gt;n&lt;/code&gt; ids from a universe of 100,000,000 possible values, the bitset had to clear a massive mostly-empty array before every vector:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;unordered_set ms&lt;/th&gt;
&lt;th&gt;boost bitset ms&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;6.3&lt;/td&gt;
&lt;td&gt;149.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2,048&lt;/td&gt;
&lt;td&gt;91.9&lt;/td&gt;
&lt;td&gt;145.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65,536&lt;/td&gt;
&lt;td&gt;4,169.3&lt;/td&gt;
&lt;td&gt;985.4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For small sparse inputs, &lt;code&gt;std::unordered_set&lt;/code&gt; is genuinely better. The bitset only pulls ahead once the input is large enough to amortize the fixed clearing cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical rule
&lt;/h2&gt;

&lt;p&gt;Reach for &lt;code&gt;std::unordered_set&lt;/code&gt; when ids are sparse, unbounded, or not integer-indexable. When ids are dense integers inside a hot loop, make the membership check an indexed load or store instead.&lt;/p&gt;

&lt;p&gt;The CPU does not care about your Big-O notation. It cares about memory access patterns.&lt;/p&gt;

&lt;p&gt;I wrote a longer post with the full methodology, assembly-level analysis, and raw CSV data: &lt;a href="https://dubeykartikay.com/posts/why-never-use-std-unordered-set/" rel="noopener noreferrer"&gt;Why You Should Never Use a set&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>performance</category>
      <category>algorithms</category>
      <category>benchmarking</category>
    </item>
    <item>
      <title>Optimize Hugo Blog Performance: Zero JS and 100% Lighthouse Score</title>
      <dc:creator>kartikay dubey</dc:creator>
      <pubDate>Sat, 04 Apr 2026 14:30:32 +0000</pubDate>
      <link>https://dev.to/dubeykartikay/optimize-hugo-blog-performance-zero-js-and-100-lighthouse-score-4128</link>
      <guid>https://dev.to/dubeykartikay/optimize-hugo-blog-performance-zero-js-and-100-lighthouse-score-4128</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I reduced my Hugo blog's page weight by eliminating 3.6 MB of JavaScript and 40 KB of external CSS, achieving a 100% JS-free frontend. Key optimizations included HTML minification, inlining CSS, switching to native MathML, and pre-rendering Mermaid diagrams server-side.&lt;/p&gt;

&lt;p&gt;I recently looked into my blog's performance and was surprised to find my pages were downloading over 3.6 MB of JavaScript and render-blocking CSS on every load. For a simple static site, this was too much, so I decided to optimize it. &lt;/p&gt;

&lt;p&gt;Here is the step-by-step breakdown of how I reduced my payload and removed JavaScript.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Baseline
&lt;/h2&gt;

&lt;p&gt;Before starting, my site had some major issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTML Size:&lt;/strong&gt; 86,348 bytes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JS Size:&lt;/strong&gt; 3,617,515 bytes (3.6 MB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CSS Size:&lt;/strong&gt; 40,560 bytes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Issue:&lt;/strong&gt; Massive blocking JS/CSS scripts were loaded on every single page for Mermaid diagrams and Math rendering. The HTML was also not minified.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxueyze3w5x0fyw4xvuuh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxueyze3w5x0fyw4xvuuh.png" alt="Hugo Blog Performance Baseline Metrics" width="800" height="172"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 1: HTML Minification
&lt;/h2&gt;

&lt;p&gt;The first step was simple: adding &lt;code&gt;minifyOutput = true&lt;/code&gt; to &lt;code&gt;hugo.toml&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTML Size:&lt;/strong&gt; 72,370 bytes (16% smaller)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact:&lt;/strong&gt; Reduced parsing time for HTML, leading to a faster First Paint.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j7zizubjnffs9lg3loo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j7zizubjnffs9lg3loo.png" alt="Hugo HTML Minification Results and Faster First Paint" width="800" height="178"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 2: Inlining CSS
&lt;/h2&gt;

&lt;p&gt;Next, I removed the &lt;code&gt;&amp;lt;link&amp;gt;&lt;/code&gt; tag pointing to my &lt;code&gt;main.css&lt;/code&gt; file and replaced it with an inline &lt;code&gt;&amp;lt;style&amp;gt;{{.Content|safeCSS}}&amp;lt;/style&amp;gt;&lt;/code&gt; block.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTML Size:&lt;/strong&gt; Increased to 127,350 bytes (because CSS is now inside the HTML document).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact:&lt;/strong&gt; This eliminated 1 critical render-blocking HTTP request. The browser no longer waits for an external CSS fetch, which improves &lt;strong&gt;First Contentful Paint (FCP)&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpdsfhulc7wstyg965xd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpdsfhulc7wstyg965xd.png" alt="Hugo Inline CSS Performance Impact and FCP Improvement" width="800" height="240"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 3: Native MathML
&lt;/h2&gt;

&lt;p&gt;My blog used the KaTeX library (JS, CSS, and fonts) to render equations. I removed it and enabled Hugo's Goldmark passthrough extensions to render Native MathML instead.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTML Size:&lt;/strong&gt; 123,341 bytes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JS Size:&lt;/strong&gt; 3,338,725 bytes (278 KB smaller)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CSS Size:&lt;/strong&gt; 0 bytes (Removed KaTeX CSS, meaning zero external stylesheets are loaded).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact:&lt;/strong&gt; A significant reduction in payload size. I removed the need for JavaScript and font files for math. The browser now renders it natively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmjsdg57d0e3suhzal9y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmjsdg57d0e3suhzal9y.png" alt="Hugo Native MathML vs KaTeX JavaScript Performance" width="800" height="144"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 4: Conditional Asset Loading
&lt;/h2&gt;

&lt;p&gt;My Mermaid script was loading on every page. I used Hugo's &lt;code&gt;.Store&lt;/code&gt; to set a flag &lt;code&gt;hasMermaid&lt;/code&gt; when processing Markdown, and only injected the Mermaid &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tag if that flag is true.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTML Size:&lt;/strong&gt; 117,632 bytes (Saved 6 KB across all generated pages).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact:&lt;/strong&gt; Text-only blog posts no longer force the browser to download &lt;code&gt;mermaid.min.js&lt;/code&gt;. The JavaScript is only loaded when necessary.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsojhfxtl4uqly41rsy3u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsojhfxtl4uqly41rsy3u.png" alt="Hugo Conditional Asset Loading for Mermaid JS on Text Pages" width="800" height="161"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;(Text-only pages don't load Mermaid)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhf1c3k5tsyc3fws5uqe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhf1c3k5tsyc3fws5uqe.png" alt="Hugo Mermaid Diagram Rendering Output with Conditional Logic" width="800" height="151"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Pages with diagrams load Mermaid conditionally)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 5: Server-side Rendering for Mermaid Diagrams
&lt;/h2&gt;

&lt;p&gt;Even conditionally, loading a 3.3 MB Mermaid script on some pages was heavy. I introduced a Node.js build step to pre-render Mermaid blocks into static SVG files. Now, the frontend outputs an &lt;code&gt;&amp;lt;img src="diagram.svg"&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JS Size:&lt;/strong&gt; 0 bytes (Removed the remaining 3.3 MB of Mermaid JavaScript).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact:&lt;/strong&gt; The site is now &lt;strong&gt;100% JavaScript-free&lt;/strong&gt; on the frontend. The &lt;code&gt;Total Blocking Time (TBT)&lt;/code&gt; metrics improved because the browser no longer executes JS to calculate layouts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faidqgyhivmjmimdfzx7e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faidqgyhivmjmimdfzx7e.png" alt="Hugo Server-side Rendering (SSR) for Mermaid Diagrams with Zero JS" width="800" height="100"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization 6: Early Hints &amp;amp; Caching
&lt;/h2&gt;

&lt;p&gt;Finally, I optimized the network layer. I generated a &lt;code&gt;_headers&lt;/code&gt; file to define strict &lt;code&gt;Cache-Control&lt;/code&gt; rules for immutable assets. I also added &lt;code&gt;Link: &amp;lt;image&amp;gt;; rel=preload; as=image&lt;/code&gt; directives automatically via the build script.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Impact:&lt;/strong&gt; Cloudflare will now return &lt;code&gt;103 Early Hints&lt;/code&gt; responses, telling the browser to fetch SVGs and images immediately. Even before the HTML document finishes downloading. Assets cache indefinitely on repeat visits, eliminating secondary network fetch delays.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Summary
&lt;/h2&gt;

&lt;p&gt;Over the course of these 6 optimizations, I successfully brought the frontend static vendor sizes from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JS Payload:&lt;/strong&gt; 3.6 MB  -&amp;gt;  &lt;strong&gt;0 bytes&lt;/strong&gt; (100% reduction)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External CSS:&lt;/strong&gt; 40 KB -&amp;gt; &lt;strong&gt;0 bytes&lt;/strong&gt; (Eliminated all external style sheets, saving a round-trip on every page).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTML Payload:&lt;/strong&gt; Minified by 16% initially, offset slightly by securely inlining CSS, ensuring near-instantaneous &lt;code&gt;First Contentful Paint&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Performance matters, and sometimes you don't need a heavy JS framework to deliver a fast experience!&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
