DEV Community: kartikay dubey

The Fastest Set Is Often Not a Set: 4050 Duplicate-Detection Benchmarks

kartikay dubey — Tue, 02 Jun 2026 18:37:21 +0000

Duplicate detection looks solved: keep a hash set, skip what you have already seen. A benchmark suite of 4050 measurements across 480 workloads says the fastest strategy can be 94x faster than std::unordered_set, or 90,000x slower, depending on what you are deduplicating and what guarantees you need.

Dense integers are an array problem

When keys are dense, bounded 32-bit integers, a hash set wastes work: it hashes, probes buckets, and chases pointers. A bitset turns membership into one indexed bit. At one million uniform integers:

strategy	ns per insert
growable bitset	5.1
sort then unique	60
roaring bitmap	165
`std::unordered_set`	483
`std::set`	1154

The bitset is 94x faster than the hash set for the same correct answer. If your key is already an array index, do not turn it into a hashing problem.

Text keys change the cost model

For long strings, comparison and hashing dominate. Sorting with fingerprints (with full-key verification when correctness matters) can beat a hash set by 1.8x to 2.7x. For clustered duplicate strings, a hash set is excellent because recent buckets stay hot in cache.

Streaming is a forgetting problem

For unbounded streams, the question is what to remember. An in-memory sliding window costs ~68 ns/event. A PostgreSQL-backed detector with per-event transactions costs ~6.1 ms/event, a 90,000x gap for the same logical check. Batching commits closes most of it.

A practical decision table

Dense bounded ints -> pre-sized bitset (30x to 110x faster).
Sparse 64-bit ints -> Roaring bitmap, or sort + unique on finite batches.
Long strings -> fingerprinted sorting, verify on match.
Streaming, bounded memory -> in-memory sliding window.
Streaming, durable -> RocksDB or Postgres with batched writes.

The fastest set is often not a set. It is the data structure your key space was trying to be.

For all 4050 measurements, the winner heatmaps, and the streaming benchmarks: The Shape of Duplicate Detection.

How to Install Boost in Any C++ Project: CMake, vcpkg, Conan, and More

kartikay dubey — Tue, 02 Jun 2026 18:36:14 +0000

If you have written much C++, you have reached for Boost, and probably lost an afternoon to linker errors getting it installed. Here are the practical ways to add Boost to a project, with the snippets that actually work.

Header-only vs compiled

Boost is ~160 libraries in two camps. Header-only ones (asio, beast, mp11, hana, pfr) need only an #include. Compiled ones (filesystem, program_options, thread, regex, serialization, log) ship .so/.a files you must link. One gotcha: boost::system has been mostly header-only since 1.69, so you rarely need -lboost_system anymore, despite what older tutorials say.

Method 1: CMake find_package (system Boost)

Install through your package manager, then let CMake find it:

find_package(Boost 1.71 REQUIRED COMPONENTS filesystem program_options)

target_link_libraries(my_app PRIVATE
    Boost::filesystem
    Boost::program_options
    Boost::headers        # header-only libs
)

Install commands per platform:

sudo apt install libboost-all-dev   # Ubuntu/Debian (~500 MB; prefer per-component)
sudo dnf install boost-devel        # Fedora/RHEL
sudo pacman -S boost boost-libs     # Arch
brew install boost                  # macOS

Method 2: FetchContent (and why it bites)

FetchContent works, but Boost's modular CMake means you must declare component dependencies explicitly or you hit cryptic missing-target errors. It also compiles Boost as part of your build, which is slow. Good for reproducibility, bad for iteration speed.

Methods 3 & 4: vcpkg and Conan

Package managers give you pinned, reproducible Boost that plugs into CMake's find_package:

# vcpkg manifest mode: list boost in vcpkg.json, then configure with the toolchain file
cmake -B build -DCMAKE_TOOLCHAIN_FILE=.../vcpkg/scripts/buildsystems/vcpkg.cmake

Conan is the same idea with a conanfile.txt.

Method 5: Manual g++ linking

No build system, just the compiler:

g++ main.cpp -o app                            # header-only
g++ main.cpp -o app -lboost_filesystem         # compiled lib
g++ main.cpp -o app -l:libboost_filesystem.a   # static

Link order matters: dependents come before dependencies.

Which method should you use?

System find_package for quick local builds.
vcpkg or Conan for reproducible, cross-platform projects.
Build from source with b2 only when you need a specific version or custom variant.

For the complete walkthrough, including building from source with b2, Nix, Docker dev containers, static-linking details, and an FAQ: How to Install Boost in Any C++ Project.

Making a Go Log Viewer 12x Faster (and the Benchmark Bug That Fooled Me)

kartikay dubey — Tue, 02 Jun 2026 18:36:10 +0000

I built Peacock, a terminal JSON log viewer in Go, and it could not keep up with a busy log stream. So I profiled it with go tool pprof: read the profile, fix the hottest line, re-profile, repeat. On a real 70x240 terminal, throughput went from 52 lines/sec to 651 lines/sec, about 12x.

The most useful lesson, though, came from an evening I lost to a bug in my own benchmark.

Cleanup: a pointless join/split

The base profile flagged a viewport setter eating 8% of CPU. SetContent takes a string and splits it on \n:

0   29.66s   227:   m.SetContentLines(strings.Split(s, "\n"))

But my code already had a []string. It was joining the lines into one giant string with lipgloss.JoinVertical, just so SetContent could split them again. Calling SetContentLines directly removed the round trip.

The real win: render only what is visible

The hottest function rendered every buffered entry on every frame:

70ms   8.06s   130:   rendered, _ := m.styles.renderEntry(m.visibleEntries[i], width)

The terminal shows ~70 lines, yet Peacock was word-wrapping the entire backlog each frame. I capped rendering to the viewport height. contentLines cumulative time dropped from 66.86% to 14.11%. This single algorithmic change carried the practical win.

A ring buffer instead of slice trimming

Appending entries and re-slicing the backlog churned memmove and the GC. A fixed-size circular buffer overwrites in place:

func (r *entryRing) Append(entry logs.Entry) {
    if r.size == len(r.entries) {
        r.entries[r.start] = entry
        r.start = (r.start + 1) % len(r.entries)
        return
    }
    r.entries[(r.start+r.size)%len(r.entries)] = entry
    r.size++
}

After this, appendEntry disappeared from the profile.

The cache that did nothing, and the 0x0 terminal

I cached each rendered entry by viewport width. Throughput did not move at all:

Ring buffer:    6,095 l/s
Cache rendered: 6,095 l/s

I re-read the cache logic three times. The bug was not in my code:

$ script -q -c 'tput lines; tput cols' /dev/null
0
0

The benchmark's pseudo-terminal had no dimensions. With width 0, the wrap function returned early, so there was almost no rendering work for the cache to skip. I set explicit stty rows/cols plus LINES/COLUMNS, and the cache finally showed a 3x jump.

Lessons

The biggest wins are algorithmic. Visible-only rendering beat every string-allocation trick.
Your benchmark is part of the system. When an optimization shows zero improvement, suspect the measurement before the code.

For every pprof command, every profile output, and the full corrected throughput ladder: Making a Log Viewer 12x Faster.

How 3 Lines of Code Caused a 10x Kafka Throughput Drop

kartikay dubey — Sun, 03 May 2026 16:06:30 +0000

In August 2025, a user reported that Apache Kafka v3.9.0 dropped consumer throughput by 10x. Other users reproduced it. The culprit was a configuration called min.insync.replicas, and the fix was three lines of code.

The report

Sharad Garg opened a ticket titled "Consumer throughput drops by 10 times with Kafka v3.9.0 in ZK mode." Ritvik Gupta ran controlled tests and traced the issue to min.insync.replicas. Setting it from 1 to 2 caused a massive drop:

Test	Message Rate	Configuration
1 Producer 1 Consumer	89.21	min.insync.replicas = 2
1 Producer 1 Consumer	298.99	min.insync.replicas = 1

Another user reported throughput falling from 147 MB/s on Kafka 3.4 to 58 MB/s on Kafka 3.9 with the same setting.

The root cause

Chia-Ping Tsai, a long-time Kafka contributor, identified the issue. It traced back to KAFKA-15583, titled "High watermark can only advance if ISR size is larger than min ISR."

The high watermark (HW) is the offset of the latest message copied to all in-sync replicas. Consumers are only allowed to read up to the HW. This guarantees that consumed data will not disappear if a broker crashes.

The change added this check inside the leader's watermark advancement logic:

if (isUnderMinIsr) {
  trace(s"Not increasing HWM because partition is under min ISR")
  return false
}

Before v3.9.0, min.insync.replicas only affected producers using acks=all. It dictated how many replicas had to acknowledge a write before the producer considered it successful. It had nothing to do with consumers.

After v3.9.0, the same setting also blocks consumer reads. If a follower is slow and drops out of the ISR, the leader stops advancing the high watermark until that follower catches up. Consumers stall until the watermark moves again.

Why this is a feature, not a bug

Kafka prioritizes durability over throughput. Blocking reads until min.insync.replicas are healthy prevents consumers from reading data that has not been sufficiently replicated. If the leader crashes after a consumer reads an under-replicated message, that message is gone, and the consumer has already processed it.

The trade-off is real. The change arguably deserved a major version bump, because a 10x throughput drop in a minor release can break production pipelines.

The fix

If you hit this, your options are straightforward:

Lower min.insync.replicas if your durability requirements allow it.
Ensure followers have enough resources to keep up with the leader.
Monitor ISR size and follower lag as critical metrics.

Three lines of code. A massive performance impact. A reminder that distributed systems are full of sharp edges.

For the full timeline, mailing list discussion, and the exact PR diff: How a Minor Release Caused a 10x Throughput Drop in Kafka.

Optimizing My Hugo Blog: From 3.6 MB of JavaScript to Zero

kartikay dubey — Sun, 03 May 2026 16:06:28 +0000

My Hugo blog was downloading 3.6 MB of JavaScript and 40 KB of external CSS on every page load. For a static blog with mostly text and a few diagrams, that was absurd. Here is how I fixed it.

Baseline

HTML: 86 KB
JavaScript: 3.6 MB (Mermaid + KaTeX)
CSS: 40 KB (KaTeX stylesheets)
Problem: render-blocking scripts loaded on every page for math and diagrams

Optimization 1: HTML minification

Adding minifyOutput = true to hugo.toml shrunk HTML by 16%. Small win, zero risk.

Optimization 2: Inline CSS

I removed the external main.css link and inlined the styles directly into the HTML. The HTML grew slightly, but I eliminated one render-blocking network request. First Contentful Paint improved because the browser no longer waits for a CSS fetch.

Optimization 3: Native MathML

My blog used KaTeX to render equations. That meant JavaScript, CSS, and font files for every page with math. I switched to Hugo's Goldmark passthrough extensions, which output native MathML. Browsers render this directly.

Result: 278 KB of JavaScript removed, all external stylesheets eliminated. Math now renders without any scripts or fonts.

Optimization 4: Conditional asset loading

Mermaid.js was loading on every page, even text-only posts. I used Hugo's .Store to set a hasMermaid flag during Markdown processing. The script tag only injects when a page actually contains a diagram.

Text-only pages no longer download Mermaid. Diagram pages still get it, but only when needed.

Optimization 5: Server-side rendering for Mermaid

Even conditional loading left a 3.3 MB script on diagram pages. I added a Node.js build step that pre-renders Mermaid blocks into static SVG files at build time. The frontend outputs <img src="diagram.svg"> instead of a <script> tag.

Result: zero JavaScript on the frontend. Total Blocking Time dropped because the browser no longer executes JS to calculate layouts.

Optimization 6: Early Hints and caching

I generated a _headers file with strict Cache-Control rules for immutable assets. The build script also injects Link: rel=preload headers for images and SVGs. Cloudflare returns 103 Early Hints, telling the browser to fetch assets before the HTML document finishes downloading.

Summary

Metric	Before	After
JavaScript	3.6 MB	0 bytes
External CSS	40 KB	0 bytes
HTML	86 KB	72 KB (minified)

The site is now 100% JavaScript-free on the frontend. Performance matters, and static sites do not need a heavy JS framework to be fast.

For the full hugo.toml config, build scripts, and Lighthouse score breakdown: Optimizing My Hugo Blog.

Vector Databases and Semantic Search: A Practical Introduction

kartikay dubey — Sun, 03 May 2026 16:06:26 +0000

Traditional search engines match keywords. If you search for "dog shelters around Gurgaon" and the indexed page says "animal shelters near Delhi," you get no results. The words do not overlap.

Semantic search fixes this by converting text into vectors. Similar ideas end up close together in vector space, even when the words differ.

From words to vectors

An embedding model takes a word or sentence and produces a high-dimensional vector. The key property: semantically similar inputs produce vectors that are close to each other. "Dog" and "animal" sit near each other. "Dog" and "car" do not.

For a search engine, the pipeline is straightforward:

Convert every document in the corpus into a vector and store it.
Convert the user's query into a vector using the same model.
Find the documents whose vectors are closest to the query vector.

The hard part is step 3. A corpus of a million documents with 768-dimensional vectors is a massive dataset. Computing the exact distance from the query to every document is too slow for interactive search.

Approximate Nearest Neighbors

Exact search is O(n). ANN algorithms trade a small amount of accuracy for massive speedups. The metric is recall@k: out of the true k closest vectors, how many does the approximation find? A 95% recall@100 means 95 of the 100 true nearest neighbors are returned.

Graph-based ANN builds a navigable graph over the dataset. Search starts at an entry point and greedily walks toward the query. Each step moves to the neighbor closest to the query, expanding the frontier until the best candidates are found.

DiskANN and Vamana

Microsoft Research developed DiskANN and the Vamana index to make graph-based ANN work at scale. The algorithm has three pieces:

Greedy Search maintains a candidate list and a visited set. It repeatedly expands the closest unvisited candidate, adds its graph neighbors, and keeps the best candidates bounded by a search-list size.

Robust Prune builds the graph edges. For each point, it considers possible neighbors and keeps a bounded set of useful outgoing edges. An alpha parameter controls how aggressively candidates are pruned.

Vamana Construction iterates over the dataset in random order. For each point, it runs greedy search, prunes the visited set into outgoing edges, adds backlinks, and repairs any degree violations.

The result is a sparse graph where greedy search finds high-recall neighbors quickly.

Why this matters

Vector databases like Pinecone, Weaviate, and Milvus package these ideas into production systems. They handle indexing, query routing, replication, and metadata filtering. If you are building semantic search, recommendation, or retrieval-augmented generation, you are probably using these algorithms whether you know it or not.

For the full mathematical walkthrough with pseudocode, LaTeX equations, and diagrams: How Google Search Actually Works.

How I Made My Vector Search Engine 16x Faster Without Changing the Algorithm

kartikay dubey — Sun, 03 May 2026 16:06:24 +0000

I built a Vamana-based vector search engine in C++ called sembed-engine. Recently I made a pull request that sped up queries by 16x and builds by 9x. The algorithm stayed exactly the same. The recall stayed at 1.0. The number of visited nodes did not change.

The speedup came from data layout.

The old design

The original code stored vectors as separate objects pointed to by shared_ptr:

struct Record {
    int64_t id;
    std::shared_ptr<Vector> vector;
};

This is clean C++. Every record has an id and a vector. The vector knows how to calculate distance. In the hot path, though, the CPU had to load the record, read the shared_ptr, follow the pointer, call virtual methods, and read each float through an abstraction layer. Millions of times per query.

The new layout

I replaced the object graph with a flat array. All vector values live in one contiguous block:

ids    = [id0, id1, id2, ...]
values = [v0_dim0, v0_dim1, ..., v1_dim0, v1_dim1, ...]

Vector i starts at values[i * D]. A FloatVectorView is just a pointer and a dimension count. No allocations. No pointer chasing. The next vector is right after the previous one in memory.

The assembly tells the story. The old code had virtual calls and scalar square roots:

call rax          ; virtual dispatch
sqrtss xmm2, xmm2 ; scalar square root

The new code loads packed floats and operates on four at a time:

movups xmm1, XMMWORD PTR [rdi+rax]
subps xmm1, xmm3
mulps xmm1, xmm1

Removing unnecessary square roots

Euclidean distance includes a square root. For nearest-neighbor search, we only care about ordering, not the absolute distance value. If sqrt(25) < sqrt(100), then 25 < 100. The ordering is identical.

Switching to squared distances eliminated sqrtss entirely from the hot path. One caveat: Vamana pruning uses an alpha parameter. When everything is squared, alpha must be squared too to preserve the same comparison semantics.

Caching scores during sort

The old comparator computed distances inside the sort function. Sorting calls the comparator many times, so the same distance was recomputed repeatedly. The fix was to compute each distance once, store it in a ScoredNode { node; score; }, and sort by the cached score.

Old comparator assembly called new_view_squared repeatedly. New comparator assembly just loaded two floats and compared them.

Results

Workload	Metric	Before	After
gvec query latency	p50	4.094 ms	0.631 ms
w2v query latency	p50	25.15 ms	1.524 ms
w2v build time	total	17.91 s	1.889 s

The search visited the same number of nodes. It stopped paying unnecessary tax at every node.

For the full benchmark methodology, assembly breakdown, and PR diff: How I Made My Vector Search Engine 16x Faster.

Setting Up Dual GPU Gaming Laptops in Hyprland

kartikay dubey — Sun, 03 May 2026 16:06:22 +0000

Gaming laptops with dual GPUs are common, and they are a pain on Linux. I run an ASUS Zephyrus G15 with an AMD integrated GPU and an NVIDIA discrete GPU. Before I fixed the setup, I dealt with broken resume from suspend, terrible battery life, overheating, and games that ran worse than they should.

This is a practical guide for setting up dual GPU systems in Hyprland. Most of it applies to other Wayland compositors too.

Step 1: Set the iGPU as primary

Hyprland uses the AQ_DRM_DEVICES environment variable to decide which GPU drives the display. You want the iGPU first for power efficiency and better Linux compatibility.

First, find your GPUs:

lspci -d ::03xx

My output shows an RTX 3060 at 01:00.0 and an AMD Vega at 06:00.0. Create udev rules to symlink these to friendly names:

/etc/udev/rules.d/igpu-device-path.rules:

KERNEL=="card*", KERNELS=="0000:06:00.0", SUBSYSTEM=="drm", SUBSYSTEMS=="pci", SYMLINK+="dri/amd-igpu"

/etc/udev/rules.d/dgpu-device-path.rules:

KERNEL=="card*", KERNELS=="0000:01:00.0", SUBSYSTEM=="drm", SUBSYSTEMS=="pci", SYMLINK+="dri/nvidia-dgpu"

Reload rules:

sudo udevadm control --reload-rules
sudo udevadm trigger

Then tell Hyprland to prefer the iGPU:

env = AQ_DRM_DEVICES, /dev/dri/amd-igpu:/dev/dri/nvidia-dgpu

Step 2: Fix hardware video decoding

Without hardware decoding, video playback burns CPU, drains battery, and stutters at high resolution. Check if your system already supports it:

sudo pacman -S libva-utils
vainfo

If vainfo fails or picks the wrong GPU, set the driver explicitly. For AMD, add to hyprland.conf:

env = LIBVA_DRIVER_NAME, radeonsi

Common driver names: NVIDIA uses nvidia, AMD uses radeonsi, Intel uses i965 or iHD.

Step 3: Switch between Hybrid and Integrated mode

For gaming, you want both GPUs active. For battery life, you want the dGPU completely off.

sudo pacman -S supergfxctl
supergfxctl -s    # list supported modes
supergfxctl -g    # check current mode
supergfxctl -m Integrated   # iGPU only, saves battery
supergfxctl -m Hybrid       # both GPUs, for gaming

That covers the essentials. I wrote a longer post with full hyprland.conf snippets, troubleshooting tips for NVIDIA-specific quirks, and screenshots of the setup: How to Setup Dual GPU Systems in Hyprland.

How No Man's Sky Creates 18 Quintillion Planets With Just Math

kartikay dubey — Sun, 03 May 2026 16:06:20 +0000

No Man's Sky advertises 18 quintillion planets. That is not because someone modeled them by hand. It is because the game generates terrain, flora, and atmosphere from mathematical functions seeded by the planet's coordinates.

The core idea is procedural generation, and the simplest building block is noise.

Why raw randomness fails

If you fill a height map with random numbers, you get chaos. Real terrain has smooth transitions: hills blend into valleys, coastlines curve gradually. The solution is a noise function that produces smooth, continuous random values.

Perlin noise does exactly this. It generates values that vary gradually across space, so nearby points have similar heights. Feed a 2D grid of Perlin noise into a renderer, add color and lighting, and you get something that looks like terrain.

The trick is layering. A single layer of Perlin noise looks too uniform, like rolling hills with no variation. Games stack multiple layers at different frequencies and amplitudes. Low-frequency layers define the broad shape of continents. High-frequency layers add rocks, cracks, and surface detail. This is called fractal Brownian motion, and it is the reason generated worlds look organic instead of synthetic.

What No Man's Sky adds

Sean Murray and the team at Hello Games went further than basic layered noise. Their GDC talk outlines several techniques:

Domain warping twists the noise field itself. Instead of sampling noise at the raw coordinates, you sample at coordinates that have been displaced by another noise function. This creates caves, overhangs, and twisted terrain that straight noise cannot produce.

Filtering and image processing cleans up the raw noise. Unfiltered procedural terrain often looks muddy or repetitive. The team runs filters to emphasize ridges and valleys, suppress bland regions, and sculpt the terrain into more interesting shapes.

DEM blending mixes in real-world elevation data for grounding. The risk is making everything look like Earth, which is familiar but boring. The game uses this sparingly, blending real data with warped noise to keep things alien but plausible.

Biome rules layer on top of the terrain. Temperature, humidity, and elevation determine what plants and animals spawn. These rules are also procedural, driven by the same coordinate seeds that generated the planet itself. Visit the same planet twice, you get the same terrain and the same wildlife. Visit a different planet, everything changes.

The result is a universe where every planet is deterministic (the same seed always produces the same world) but effectively infinite (the coordinate space is so large you will never see the same planet twice).

If you want to see the Perlin noise graphs and a deeper walkthrough of the layering math: How No Man's Sky Creates 18 Quintillion Planets.

Reading Algorithms Like an Engineer: What DiskANN Taught Me About Pseudocode

kartikay dubey — Sun, 03 May 2026 16:06:18 +0000

The first time I implemented Vamana from the DiskANN paper, my approximate nearest neighbor index was slower than brute force. On tiny test fixtures, brute force took 0.27 ms per query. My Vamana implementation took 22.98 ms.

That sounds absurd. ANN exists to skip work. The problem was not the algorithm. It was how I mapped the paper's abstractions to actual data structures.

A set is not a data structure

The DiskANN pseudocode talks about sets L, V, and Nout(p). That is fine for explanation. Code cannot store an abstract set.

When the paper says L (the candidate list), I had to decide: sorted vector? heap? bounded priority queue? How do I find the closest unvisited element? How do I enforce the search-list bound? How do I remove duplicates?

When the paper says V (the visited set), I had to decide: unordered_set? dense bitset? boolean array? Node ids in my case were dense integers, so an indexed bit operation beat a hash-table lookup by a wide margin.

When the paper says "remove candidates," I had to ask whether removal is physical or logical. In a hot loop, marking a candidate as deleted and skipping it is much cheaper than erasing from a vector and reshuffling everything behind it.

The fix

In my sembed-engine project, I changed the implementation to match the invariants the algorithm already needed, rather than copying the pseudocode literally.

A Neighbour struct became { float distance; NodeId node; bool marked; }. A SortedBoundedVector kept candidates sorted as they were inserted, capped the list size, rejected duplicates, and tracked the next unexpanded node. Visited tracking moved to boost::dynamic_bitset. Pruning switched from physical deletion to marker-style bookkeeping.

The algorithm did not change. The code started matching the invariants the algorithm already needed.

After the fix, Vamana went from 22.98 ms to 0.02 ms on the same small fixture. On a larger dataset, it delivered 5.34x the query throughput of brute force while keeping recall at 1.0.

The lesson

Slow down at the nouns in pseudocode. If it says L, ask what operations L needs. If it says V, ask how membership is checked. If it says "remove," ask whether deletion is physical or logical. If it says "bounded," ask where that bound is enforced.

The paper gives the map. Implementation is the terrain.

For the full benchmark data, PR details, and code snippets: Reading Algorithms Like an Engineer.

Why You Should Never Use std::unordered_set in Hot C++ Loops

kartikay dubey — Sun, 03 May 2026 16:06:17 +0000

Hash tables feel like the default choice for membership tests. std::unordered_set promises average O(1) lookup, so we reach for it automatically. In performance-sensitive C++ code, that habit can cost you an order of magnitude.

I ran into this while building a Vamana graph index for approximate nearest neighbor search. The algorithm needs to track visited nodes. Node ids are dense integers, and the visited check runs inside the hottest loop in the entire search path.

My first implementation used std::unordered_set<uint32_t>. It was correct, and it was slow.

What the benchmark says

I generated 1000 vectors of random uint32_t ids and deduplicated them using three approaches: std::unordered_set, sort + unique, and boost::dynamic_bitset<>. For dense ids sampled from [0, 2n), the numbers were brutal:

n	unordered_set ms	sort+unique ms	boost bitset ms
128	5	3	1
32,768	1,649	1,455	177
500,000	50,302	26,759	3,423

At n = 500,000, the bitset was 14.7x faster. The hash table had to hash keys, grow buckets, rehash, and chase pointers through memory. The bitset did one indexed memory operation.

sort + unique also beat the hash table at scale because it walks contiguous memory, and CPUs love that.

When the hash table wins

Sparse ids change the picture. When I sampled only n ids from a universe of 100,000,000 possible values, the bitset had to clear a massive mostly-empty array before every vector:

n	unordered_set ms	boost bitset ms
128	6.3	149.7
2,048	91.9	145.5
65,536	4,169.3	985.4

For small sparse inputs, std::unordered_set is genuinely better. The bitset only pulls ahead once the input is large enough to amortize the fixed clearing cost.

The practical rule

Reach for std::unordered_set when ids are sparse, unbounded, or not integer-indexable. When ids are dense integers inside a hot loop, make the membership check an indexed load or store instead.

The CPU does not care about your Big-O notation. It cares about memory access patterns.

I wrote a longer post with the full methodology, assembly-level analysis, and raw CSV data: Why You Should Never Use a set.

Optimize Hugo Blog Performance: Zero JS and 100% Lighthouse Score

kartikay dubey — Sat, 04 Apr 2026 14:30:32 +0000

TL;DR

I reduced my Hugo blog's page weight by eliminating 3.6 MB of JavaScript and 40 KB of external CSS, achieving a 100% JS-free frontend. Key optimizations included HTML minification, inlining CSS, switching to native MathML, and pre-rendering Mermaid diagrams server-side.

I recently looked into my blog's performance and was surprised to find my pages were downloading over 3.6 MB of JavaScript and render-blocking CSS on every load. For a simple static site, this was too much, so I decided to optimize it.

Here is the step-by-step breakdown of how I reduced my payload and removed JavaScript.

The Baseline

Before starting, my site had some major issues:

HTML Size: 86,348 bytes
JS Size: 3,617,515 bytes (3.6 MB)
CSS Size: 40,560 bytes
Issue: Massive blocking JS/CSS scripts were loaded on every single page for Mermaid diagrams and Math rendering. The HTML was also not minified.

Optimization 1: HTML Minification

The first step was simple: adding minifyOutput = true to hugo.toml.

HTML Size: 72,370 bytes (16% smaller)
Impact: Reduced parsing time for HTML, leading to a faster First Paint.

Optimization 2: Inlining CSS

Next, I removed the <link> tag pointing to my main.css file and replaced it with an inline <style>{{.Content|safeCSS}}</style> block.

HTML Size: Increased to 127,350 bytes (because CSS is now inside the HTML document).
Impact: This eliminated 1 critical render-blocking HTTP request. The browser no longer waits for an external CSS fetch, which improves First Contentful Paint (FCP).

Optimization 3: Native MathML

My blog used the KaTeX library (JS, CSS, and fonts) to render equations. I removed it and enabled Hugo's Goldmark passthrough extensions to render Native MathML instead.

HTML Size: 123,341 bytes
JS Size: 3,338,725 bytes (278 KB smaller)
CSS Size: 0 bytes (Removed KaTeX CSS, meaning zero external stylesheets are loaded).
Impact: A significant reduction in payload size. I removed the need for JavaScript and font files for math. The browser now renders it natively.

Optimization 4: Conditional Asset Loading

My Mermaid script was loading on every page. I used Hugo's .Store to set a flag hasMermaid when processing Markdown, and only injected the Mermaid <script> tag if that flag is true.

HTML Size: 117,632 bytes (Saved 6 KB across all generated pages).
Impact: Text-only blog posts no longer force the browser to download mermaid.min.js. The JavaScript is only loaded when necessary.

(Text-only pages don't load Mermaid)

(Pages with diagrams load Mermaid conditionally)

Optimization 5: Server-side Rendering for Mermaid Diagrams

Even conditionally, loading a 3.3 MB Mermaid script on some pages was heavy. I introduced a Node.js build step to pre-render Mermaid blocks into static SVG files. Now, the frontend outputs an <img src="diagram.svg">.

JS Size: 0 bytes (Removed the remaining 3.3 MB of Mermaid JavaScript).
Impact: The site is now 100% JavaScript-free on the frontend. The Total Blocking Time (TBT) metrics improved because the browser no longer executes JS to calculate layouts.

Optimization 6: Early Hints & Caching

Finally, I optimized the network layer. I generated a _headers file to define strict Cache-Control rules for immutable assets. I also added Link: <image>; rel=preload; as=image directives automatically via the build script.

Impact: Cloudflare will now return 103 Early Hints responses, telling the browser to fetch SVGs and images immediately. Even before the HTML document finishes downloading. Assets cache indefinitely on repeat visits, eliminating secondary network fetch delays.

Final Summary

Over the course of these 6 optimizations, I successfully brought the frontend static vendor sizes from:

JS Payload: 3.6 MB -> 0 bytes (100% reduction)
External CSS: 40 KB -> 0 bytes (Eliminated all external style sheets, saving a round-trip on every page).
HTML Payload: Minified by 16% initially, offset slightly by securely inlining CSS, ensuring near-instantaneous First Contentful Paint.

Performance matters, and sometimes you don't need a heavy JS framework to deliver a fast experience!