DEV Community: Erich

Cache policy implementations vs the original papers

Erich — Wed, 29 Apr 2026 19:08:32 +0000

Cache eviction papers are full of hit rate tables. Running the same workloads against your own code is the fastest way to find out if your implementation actually does what the paper describes, or if you read the algorithm wrong. InterchangeDB has six eviction policies behind a single trait, so I ran all six against five access patterns and compared my numbers to the published results.

The setup

FIFO, Clock, LRU, LRU-K, 2Q, and ARC. Five access patterns. Sequential and Random as baselines, Zipfian with 80/20 skew, CyclicScan for a repeating scan slightly larger than the pool, and ScanPlusHot for a hot working set under scan pressure. Pool size is 64, total pages 256, 100,000 accesses per run.

These run against a SimulatedPool with no disk I/O. The same policies plug into the real buffer pool manager, but for clean comparison the policy logic is isolated from disk noise.

Policy	Sequential	Random	Zipfian	CyclicScan	Scan+Hot
FIFO	0.0%	24.9%	60.2%	0.0%	16.7%
Clock	0.0%	24.9%	68.6%	0.0%	16.7%
LRU	0.0%	24.9%	69.0%	0.0%	16.7%
LRU-K	0.0%	24.9%	81.3%	0.0%	33.3%
2Q	0.0%	25.1%	77.2%	77.7%	33.3%
ARC	0.0%	24.9%	76.5%	0.0%	33.3%

Sequential and Random are exactly what you'd expect. Sequential is 0% across the board because every access is a new page. Random is 25% across the board because with 64 slots and 256 pages, the ceiling is pool size divided by total pages. Bélády (1966) derived this in his probabilistic model, "the probability of referencing a block in memory is c/s." 64/256 = 0.25. When the access pattern has no structure, no policy can beat coin-flipping.

Zipfian

Zipfian skew is the shape of real OLTP traffic. Trending product catalogs, social media feeds, any workload where a small set of hot items gets most of the reads.

LRU-K wins at 81.3%. Why by such a margin? Its K-th most recent reference acts as a frequency estimator. A page touched once last week and once today looks different from a page touched twice in the last millisecond, and LRU-K can tell them apart. That's 12 percentage points over plain LRU. 2Q was designed as an O(1) approximation of LRU-K (Johnson and Shasha, 1994) and lands at 77.2%, right below it. ARC at 76.5%. The ordering matches what O'Neil et al. (1993) predicted on database buffer pool workloads.

Clock and LRU come in within half a point of each other, 68.6% and 69.0%. Corbató (1968) showed the same thing on Multics. Clock trades a sliver of hit rate for cheaper bookkeeping, which matters when the eviction path runs on every buffer pool access.

When the workload has obvious hot keys, frequency tracking buys you 10+ percentage points over LRU. When the skew isn't there, plain LRU is within a point of the best and costs less to run. If you know your keyspace is skewed and you can afford the bookkeeping, use LRU-K or 2Q. If you don't know, start with LRU.

Scan resistance

Scan-plus-hot is the mixed OLTP-OLAP pattern. A hot transactional working set coexists with reporting queries that scan large ranges. What happens to the hot pages when a scan blows through?

The benchmark fills the cache with 16 hot pages, lets them age into the frequent region of each policy, then runs a one-time scan of 176 cold pages. How many of the original 16 survive?

Policy	Survived	Survival rate
FIFO	0/16	0%
Clock	0/16	0%
LRU	0/16	0%
LRU-K	16/16	100%
2Q	16/16	100%
ARC	16/16	100%

ARC isolates hot pages in a frequent list that scans never touch. 2Q promotes hot pages to a separate list and evicts from the recent queue first. LRU-K is simpler. Pages with only one access have infinite backward-K distance, so they always evict first.

Theorem 4 in the ARC paper (Megiddo and Modha, 2003) proves this formally. A one-time scan through more novel pages than the cache can hold will not evict pages already in the frequent list.

The overall hit rate across the trace confirms it. Scan-resistant policies hit 33.3%, non-scan-resistant ones hit 16.7%. But hit rate is an average over the entire trace. It won't show you the moment a hot page gets flushed to disk and the next read has to wait for a disk fetch. Survival rate catches what hit rate hides.

If your database runs reporting queries against a live transactional set, you need scan resistance. Without it, a single scan can flush the hot set to disk.

Scan resistance is not cycle resistance

CyclicScan is a deterministic repeating loop through 72 unique pages with a pool of 64. The working set overflows the cache by 12.5% and the same pattern repeats. Recurring batch jobs, scheduled report generation, periodic cache-warming sweeps, anything that touches the same keys in the same order on a schedule.

LRU, FIFO, and Clock all hit 0%. Classic sequential flooding from Stonebraker (1981). The page LRU evicts is exactly the page needed next.

2Q hits 77.7%. Once a page gets touched twice it promotes to the hot list and stays until the hot list overflows. The hot list isn't bounded by the ghost queue, so the stable working set survives repeated scans.

ARC hits 0%. Why?

ARC learns through ghost hits. When a page evicted from the recent list gets accessed again, it shows up in the recent ghost list, and ARC grows the target size for recent pages. Ghost hits are the feedback signal.

But ARC has an invariant. The recent list plus the recent ghost list together hold at most 64 entries, the pool size. When the recent list saturates at 64 during a cold-start scan, the ghost list gets forced to zero. No ghost entries, no ghost hits on the next loop, no adaptation. ARC degenerates to LRU. LRU is 0% on this workload.

2Q doesn't have this coupling. Its ghost queue has an independent size bound. Even while the recent queue fills during the first scan, ghost entries accumulate. Second pass starts, ghosts are still there, pages promote to the hot list, working set survives.

Theorem 4 proves ARC is scan-resistant. It protects an existing hot set from a one-time scan. It does not claim cycle resistance, where the entire access pattern is a repeating loop with no pre-existing hot set to protect. The ghost-list coupling in ARC's invariant means cycle resistance was never on the table.

Scan resistance is not cycle resistance. If your workload has recurring batch jobs that loop through the same pages on a schedule, 2Q is the only policy here that handles it.

Latency

ARC is the fastest policy on three of five workloads despite being the most conceptually complex. The paper describes it as constant-time per request, and the benchmarks agree.

LRU-K is the slowest by a wide margin, 623ms on Random against 38ms for 2Q. Maintaining K-distance history per page costs time. The paper acknowledged this. It's why LRU-K rarely shows up in production despite winning on hit rate.

2Q is the most consistent. Among the two fastest on every workload, top-3 on hit rate on every workload. Never the best at anything, never bad at anything.

What production chose

PostgreSQL's buffer manager walks through nearly every policy in this post. LRU through 7.4. ARC in 8.0, pulled in 8.0.2 over IBM patent concerns. 2Q as the emergency replacement. Clock-sweep in 8.1 for lower lock contention under concurrent access.

RocksDB uses sharded LRU with a high-priority pool that keeps index and filter blocks from being evicted by data block scans. Same principle as 2Q's hot list. They also ship HyperClockCache as a lock-free alternative for high-concurrency reads.

SQLite uses plain LRU. For an embedded database where the working set usually fits in memory and concurrent writers aren't fighting over the buffer pool, that's enough.

The pattern is the same everywhere. Simple workloads got simple policies. Mixed read-write workloads with scans drove the scan-resistant variants. Lock contention under concurrent access pushed PostgreSQL and RocksDB toward clock. Nobody ships the policy with the best hit rate. Production picks the policy that solves the biggest problem the system actually faces, and that problem is almost never "we need 3 more percentage points of cache hits."

Where this goes

No policy dominates.

The point of building all six behind a single trait is that you don't have to predict your workload. You characterize it, swap the policy, and let the numbers decide. The buffer pool is validated. Next is the storage engine.

Code and benchmarks: InterchangeDB. Run cargo bench --bench eviction_policies -- summary to reproduce.

References

Bélády, L. A. (1966). A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal.
Corbató, F. J. (1968). A Paging Experiment with the Multics System.
O'Neil, O'Neil, Weikum (1993). The LRU-K Page Replacement Algorithm For Database Disk Buffering. SIGMOD.
Johnson, Shasha (1994). 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. VLDB.
Megiddo, Modha (2003). ARC: A Self-Tuning, Low Overhead Replacement Cache. FAST.
Stonebraker (1981). Operating System Support for Database Management.

Building a database to understand databases

Erich — Tue, 07 Apr 2026 18:34:01 +0000

Databases always felt like a black box to me. You call INSERT, data goes in. You call SELECT, data comes back out. Something crashes, and somehow your data is still there. I wanted to know how all of that actually works.

InterchangeDB is a database I'm building from scratch in Rust to learn how each subsystem works by implementing it myself. The project plan has been heavily inspired by CMU BusTub, mini-lsm, and ToyDB. The internals are interchangeable. Different components can be swapped in and out so I can see how they compare directly, running the same stress tests against different combinations of components on the same data.

Right now there are two storage engines behind a generic trait.

The B+Tree sits on top of a buffer pool manager that handles reading and writing pages to disk. The buffer pool has six cache eviction policies (FIFO, Clock, LRU, LRU-K, 2Q, and ARC) that can be hot-swapped at runtime. I benchmarked all six across five different workload patterns and the results were not what I expected. More on that soon.

The LSM-Tree writes go to a memtable first, then flush to sorted string tables on disk. Bloom filters cut down unnecessary reads. I ran head-to-head benchmarks between the two engines on identical workloads. The write performance gap was orders of magnitude larger than I anticipated, and the read gap was surprisingly small. More on that soon too.

Both engines are swappable at compile time through Rust generics. Same test suite, same benchmarks, same data, different engine underneath.

Underneath the engines there's a write-ahead log with checkpointing, crash recovery, a lock manager, deadlock detection, and strict two-phase locking. The database is ACID today for single-version concurrency.

The next step is MVCC so readers never block writers. After that, garbage collection for old versions, and a verification phase of crash recovery and concurrency stress tests. The end goal is a working database where I know the ins and outs of every subsystem, what real databases use which components, and why.

Check out the project here: InterchangeDB

I'm currently looking for roles in databases, data infrastructure, and search. If your team is building in this space, I'd love to talk.

Notes on not getting hired

Erich — Thu, 12 Mar 2026 19:02:24 +0000

On a whim, I applied to a defense tech company. Their recruiter emailed me three hours later. We had a phone screen the next day. A coding interview the week after. A second coding interview the week after that. Just like that I was in a final loop. One application, no networking, no LinkedIn DMs, no referral.

I nailed the coding portion. Had a genuinely good conversation with the hiring manager. Then came the system design round.

I had never done a system design interview before.

I walked through an architecture and started second-guessing myself out loud. It's exactly as bad as it sounds. You can feel the moment an interview turns. It's like a key that doesn't quite catch. You keep turning it, hoping it'll grab, and it never does. I had no idea what I was doing, and worse, I was demonstrating that fact in real time to several strangers.

The recruiter called with a rejection two days later. At least it wasn't an automated email.

1 application. 1 final loop. 0 offers.

75 applications went out over the next six weeks. August into September. I tracked everything in a spreadsheet. Which company, what role, what stage of the process. The spreadsheet was meticulous. The responses were not. The silence was nearly total. My inbox was empty.

76 applications. 1 final round. 0 offers.

January. 85 more applications. I spent the fall building. A search engine in C++, a prediction market arbitrage system in Python, a database in Rust. Things I could point to and say, "I built this, here's how it works, here's why it's cool."

Still nothing in my inbox. The market did not care. The market does not care.

161 applications. 1 final round. 0 offers.

The average job posting attracts around 250 applications. At a recognizable tech company the number is higher. At a large one that number is astronomical.¹ 75% of those resumes are rejected by ATS software before a human ever sees them. Your resume doesn't go into a pile. It goes into a filter.² Wrong keywords, wrong format, wrong anything and you're out before a person weighs in. Run 161 applications through the funnel: roughly 40 reach a human. Roughly 33% of those make it to interview scheduling, which leaves 13. About 32% of those pass the intermediate screening, so 4 should reach a final loop.

When a resume does reach a recruiter, the initial scan takes around 7 seconds. The average recruiter today manages 2,500+ applications across all their open roles. Screening 500 applications at even 30 seconds each is 4 hours of pure triage before any meaningful conversation happens. 40 applications reviewed for 7 seconds each is 4 minutes and 40 seconds of human attention.

161 applications. 4 minutes and 40 seconds.

A final loop at a large tech company runs 4 to 6 interviews. That is after a recruiter screen and one or two technical screens. Then the loop itself contains coding, system design, and behavioral rounds. Six to nine interviews per company before you get a decision. The onsite-to-offer ratio runs about 3:1, so those 4 final loops should produce roughly one offer. I had one final loop. The math says I should have three more loops and an offer.

In engineering specifically, the average number of interviews per hire is the highest in tech, meaning the conversion rate from interview to offer is the lowest of any sector. A candidate today is three times less likely to get hired for a role than they were three years ago. On average, it takes 20 total interviews across multiple applications to land one offer.

The alternative is a referral. The funnel above assumes a cold application. With a referral, your resume skips the filter entirely and goes into the hands of a real person. Industry data puts referred candidates at roughly a 30% hire rate compared to under 3% for cold applications.³ A warm introduction is doing more work than anything in your portfolio.

Then two things happened in the same month.

Someone referred me somewhere. One phone screen, then a full loop.

One of the 85 applications came back to life. No phone screen, no recruiter call. Just an email to schedule a virtual coding interview with a real person. I passed it. Then a full loop.

Two companies whose rejection emails I'd actually be sad to receive. One referral, one cold application, both arriving at the same destination in the same month.

Both loops went well enough that I can't tell which way they'll land. That's a strange thing to say after months of inbox silence. The last time I left a final loop I knew exactly how it went. Uncertainty feels better than dread.

161 applications. 3 final rounds. 0 offers?

In college I applied to hundreds of jobs without knowing LeetCode existed. No projects, no interview experience, nothing to show. I thought I deserved a job and that someone should take a chance on me.

I had a diploma and a lot of confidence in the wrong things.

Now I have the experience. A search engine. An arbitrage system. A database. Things I built because I wanted to understand how they work.

Then again, my inbox is still empty.

In a few months I'll send another 75 or 80 applications. I'll keep building in the meantime.

I have no lessons for you. I have no job.

¹ These numbers come from a quick search across general hiring reports. Gem, Standout-CV, Shortlistd, and others. Software engineering specific data is harder to isolate cleanly and varies enough across sources that you should treat them as directional rather than precise. The picture they paint is accurate even if the exact percentages aren't.

² I like to imagine this filter as a printer directly dropping applications into a shredder.

³ Referral hire rate data comes from separate industry sources and is not derived from the funnel math above. The funnel describes a cold application pipeline. Referral numbers are industry-wide averages. Both are directional.

Webcrawling is just a brute force algorithm

Erich — Tue, 27 Jan 2026 23:45:08 +0000

Every search engine starts with a crawler. Before ranking algorithms, before inverted indexes, before any of the clever stuff, someone has to actually go get the pages. Google, Bing, DuckDuckGo, all of them. Crawlers are where the data comes from.

The algorithm is brute force BFS. Visit a page, read the rules, download the content, extract the links, add them to the queue. Repeat until you've visited every page on the internet. Then start over, because pages change.

That's it. No cleverness, no optimization at the conceptual level. You're just walking a graph, one node at a time, until you've seen all of it. The webcrawler's job is completeness, not efficiency.

Level 0: The naive crawler

You need an HTTP client, an HTML parser, and a queue. libcurl handles requests. Any HTML parsing library works, or you can write your own parser if you enjoy suffering. The queue just holds URL strings. You also need a set to track visited URLs or you'll loop forever when site A links to site B links back to site A.

This version works. It also gets your IP banned. A tight loop with no delays will send dozens of requests per second to the same server. Admins notice. Rate limiters trigger. Your crawler either gets blocked or your ISP gets complaints.

If you run this on a cloud provider, you won't just get banned. You'll also receive a bill that makes you reconsider your career choices.

Level 1: Politeness

The internet has a convention for this: robots.txt. Every domain can publish one. Go to the base URL of any site right now and add "/robots.txt" and see what it looks like. It specifies which paths crawlers can access and, more importantly, many seconds to wait between requests.

Now you need a robots.txt parser and a per-domain delay mechanism. Before each request, check the domain's crawl delay and sleep for that duration.

The crawler is now polite. It's also slow. A 5-second crawl delay means 12 pages per minute from one domain. Crawling a million pages at that rate takes longer than you want to wait(~57 days).

Level 2: Parallelism

The obvious fix is threads. If you're waiting 5 seconds between requests to domain A, you could be fetching from domains B, C, and D in the meantime. Spawn a pool of worker threads, give them a shared queue, and let each one crawl independently.

This is where most tutorials stop. It's also where you'll get banned again.

The problem is subtle. Your queue contains thousands of URLs from hundreds of domains, all interleaved. Thread A grabs example.com/page1. Thread B grabs example.com/page2. Thread C grabs example.com/about. Each thread respects the crawl delay individually. It waits 5 seconds after its own last request to that domain. But threads don't know about each other. All three check their own timers, see no recent request, and fire simultaneously. The server sees three requests in the same instant. Your distributed politeness is actually coordinated rudeness.

Level 3: Coordinated politeness

The solution is to centralize the coordination. Instead of each worker managing its own timing, you push that responsibility into the frontier itself.

Workers do three things: they register a domain's required delay (parsed from robots.txt), they mark when they've hit a domain, and they ask for the next URL. The frontier only hands out a URL when that domain's crawl delay has elapsed. Workers don't sleep manually. They just ask for work. The frontier decides when work is available.

This inverts the control. Workers don't coordinate with each other. They don't even know each other exists. They all talk to the frontier, and the frontier enforces the timing. One mutex, one source of truth, no race conditions between workers checking timestamps simultaneously.

Now you have a crawler that could index the entire internet. The only obstacles are time, money, storage, compute, and reality.

Reality

I've only described the happy path. Real crawlers hit messier problems.

DNS resolution blocks. Every URL needs a DNS lookup, and those can take seconds. If your threads block on DNS, your parallelism disappears. You either need async DNS or a caching layer.

Memory pressure builds. A million URLs in a queue takes real memory. A visited set with a million entries takes more. You eventually need to spill to disk or use probabilistic data structures like bloom filters.

HTML is cursed. Real-world pages have malformed tags, broken encoding, and markup that would make the W3C weep. Your parser will encounter things that technically shouldn't exist. It needs to not crash.

What I built

If this was interesting, check out BloomSearch. A search engine I built that uses a crawler like this to index the web.

The LLM Imposter

Erich — Wed, 21 Jan 2026 15:24:24 +0000

A few weeks ago I finished a project that actually works. Handles real data, solves a real problem, runs well. I'm proud of it. I'm also... something else. Not ashamed exactly. Just aware of a voice I can't shake: You didn't really do this. This doesn't count.

I used LLMs¹ heavily throughout the process. Not vibe coding. I wasn't just prompting "build me a thing" and shipping whatever came out. I made architectural decisions, debugged failures, understood trade-offs. But still. I can't shake that voice.

There's an image of what a "real programmer" looks like. Someone who writes syntax from memory, who suffered through documentation for years, who earned their skills through late nights and cryptic error messages. The suffering was the point. If you didn't struggle, you didn't learn.

I internalized that standard somewhere along the way. And by that standard, using an LLM to accelerate past the friction feels like skipping the exam.

But this isn't the first time the standard changed.

Every abstraction layer in programming history faced the same resistance. Assembly to C: "You're hiding the machine, you'll never understand what's actually happening." C to managed languages: "Garbage collection? Memory management is the job." Using libraries: "You're importing code you've never read." And for a decade straight: "You're not a real engineer, you're just copying from Stack Overflow."

Each time, skeptics said the new way wasn't real programming. Each time, they were defending a standard that was about to become obsolete.

The abstraction didn't eliminate the need for understanding. You didn't need to manage registers anymore, but you still needed to understand performance. You didn't need to manually free memory, but you still needed to know why your program was leaking.

The programmers who insisted assembly was the only "real" programming were guarding a gate nobody needed to pass through anymore. Not because they were wrong about assembly being powerful. Because they were wrong about what mattered.

So what matters this time?

Code became cheap. Producing working syntax is commoditized now. An LLM can generate a function faster than I can type the signature. Maybe I just type slow.

But software is still expensive.

Knowing which components the system needs, how they interact, where it will fail at scale, what trade-offs you're making. None of that got cheaper. The LLM produces parts. It doesn't know which parts matter or where they go.

Think about what it means to be a mechanic. A parts supplier can hand you a carburetor². That's the easy part. Being a mechanic means knowing where it goes, how it connects to everything else, whether this particular carburetor is right for this particular engine. It means looking at a car that won't start and tracing the problem backward through systems you understand. It means knowing that a failing fuel pump will starve the engine. It means finding out the problem wasn't with the carburetor at all.

Anyone can order parts. The mechanic knows why the car runs.

Vibe coding is ordering parts and bolting them on until something happens. Sometimes you get a car. Usually you get an expensive mess that breaks in ways you can't diagnose because you never understood how it was supposed to work in the first place.

The friction didn't disappear when LLMs arrived. It relocated.

The old slog was syntax memorization, Stack Overflow archaeology, decoding documentation written by someone who hated you. The new slog is architecture, system design, evaluating outputs, catching "You're absolutely right!" mistakes, knowing when generated code is subtly wrong in ways that won't surface until production.

Different friction. Still friction. Still earns the outcome.

I've stopped asking myself "did I use AI to build this?"

The better question: if this breaks, can I fix it?

If yes, you built it. The tool you used to get there is irrelevant. If no, you have a pile of parts and a prayer.

I can debug my project. I can explain why the components exist and what they do. I can extend it, refactor it, reason about its failure modes. The LLM accelerated the syntax production. The engineering was mine.

That voice is real. But it's grading me on a standard for a game that already changed.

Besides, I use VS Code. Half the internet already doesn't think I'm a real programmer.

¹ Large Language Model - it's worth distinguishing from the broader "AI" label.

² Intentionally obsolete car part.

HFT-Lite: Prediction market arbitrage engine

Erich — Sun, 14 Dec 2025 21:58:50 +0000

How it works

The system connects to Kalshi and Interactive Brokers ForecastEx via WebSockets. Events are mapped to a unified symbol config tracking equivalent contracts across platforms. Market data is normalized into a central order book where an arbitrage detector continuously scans for cross-venue mispricings.

When complementary contracts (YES on one exchange, NO on the other) can be purchased for less than the guaranteed $1.00 settlement minus fees, both trades execute. The outcome doesn't matter, one side always pays.

Current scope: political and economic events (Fed decisions, presidential nominations, Senate majority control, House majority control).

Results

Pure arbitrage opportunities showed up immediately. In 35 minutes of monitoring:

Event	Kalshi	IBKR	Combined	Net Margin
SENATE_2026_REP	YES @ $0.68	NO @ $0.26	$0.94	2.62%
HOUSE_2026_REP	YES @ $0.27	NO @ $0.69	$0.96	0.62%
SENATE_2026_DEM	NO @ $0.68	YES @ $0.28	$0.96	0.49%

The catch: these contracts settle February 1, 2027. That's 414 days of capital lock-up for 1-3% return. Treasury bills pay better.

Risks

Leg risk is the main concern. If one side fills and the other doesn't, the system rolls back by buying the opposite side on the filled exchange. Loss is limited to fees, but it's still a loss.

Regulatory uncertainty hangs over the entire space. Prediction markets occupy gray legal territory. Platforms could face restrictions that impact liquidity or access. That means holding cash and positions on an exchange that suddenly has problems.

What's next

The current margins only make sense with shorter-term contracts. Weekly or daily events reduce capital lock-up and make 1-3% spreads worthwhile.

The next evolution of this system is comparing options-implied probabilities to prediction market prices. If SPY options imply a 30% probability of closing between $595 and $600, and Kalshi has that bracket at 15 cents, someone's wrong. Retail prediction markets are probably the soft target.

Other improvements need to be made as well. On the execution side, parallel order placement and Kelly criterion position sizing. On infrastructure, WebSocket reconnection handling and a real-time dashboard. Risk management needs category exposure limits and correlation tracking. Holding ten different contracts with the same outcome isn't diversification.

Check out the project at the link below.

GitHub

One bug, nine errors: what templates actually are

Erich — Wed, 10 Dec 2025 19:06:58 +0000

Part 4 of "You Didn't Learn C++ in College"

I'm four weeks into CMU's 15-445 Database Systems project on B+Trees. My code compiles on the previous commit. I changed one line. The compiler responds with nine identical errors. Same bug, reported nine times, each with a different template instantiation.

Nine opportunities to learn about templates.

The project that broke me

CMU 15-445 has you build a database storage engine from scratch. Project 1 is a buffer pool manager. Project 2 is a B+Tree index that sits on top of it. The B+Tree stores key-value pairs where both the key type and value type are template parameters. Your tree needs to work with 4-byte keys, 8-byte keys, 16-byte keys, 64-byte composite keys. All without writing separate implementations for each.

The tree class has three template parameters: KeyType, ValueType, and KeyComparator. Every method you write needs to handle any combination. And your B+Tree pages live in the buffer pool as raw memory that you cast into typed nodes. One wrong type and you're debugging memory corruption. No one wants to do that.

I can't post the implementation (course policy), but the template structure is public. Three type parameters, dozens of methods, all generic over key and value types.

What college taught me about templates

My undergraduate C++ course covered templates in maybe two lectures. "Here's vector<int>, here's vector<string>, templates let you reuse code." That's pretty much it, two weeks condensed into two sentences.

I thought templates were syntax sugar. Write one function, use it with different types. The textbooks show template<typename T> T max(T a, T b) and I assumed the compiler did something clever at runtime to figure out the types.

I was completely wrong.

Templates generate code at compile time

When you write BPlusTree<GenericKey<8>, RID, GenericComparator<8>>, the compiler doesn't create a generic class that handles all types. It generates a completely new class. Specific to those exact types. With its own machine code.

Two instantiations with different key sizes are two entirely separate classes. The 8-byte version has no idea the 16-byte version exists. They don't share code. They don't share vtables. The compiler literally generates distinct implementations and compiles them independently. C++ templates have no runtime component. By the time your program runs, the templates are gone. They are replaced by concrete, specialized code for each type combination you actually used.

One bug, nine errors

Here's an actual error I caused. I passed a page_id_t (an int) where the function expected an RID:

error: reference to type 'const bustub::RID' could not bind to 
       an lvalue of type 'bustub::page_id_t' (aka 'int')
    leaf->SetValueAt(insert_index, cause_error);
                                   ^~~~~~~~~~~

Clear enough. But the compiler didn't report one error. It reported nine:

note: in instantiation of member function 
  'bustub::BPlusTree<bustub::GenericKey<4>, bustub::RID, 
   bustub::GenericComparator<4>, 0>::InsertWithCrabbing' requested here
template class BPlusTree<GenericKey<4>, RID, GenericComparator<4>>;
               ^

Then the same error for GenericKey<8>. Then GenericKey<16>. Then GenericKey<32>. Then GenericKey<64>. Plus variants with different fourth template parameters.

The bottom of the B+Tree implementation file has explicit template instantiations. Lines that tell the compiler: generate complete code for each of these type combinations right here, in this translation unit. The codebase does this so linking works correctly. Template code normally lives in headers, but explicit instantiation lets you put implementations in .cpp files.

My one type error existed in a method called InsertWithCrabbing. The compiler instantiated that method nine times, once per explicit instantiation. Each instantiation hit the same bug. Nine identical errors, each with its own "in instantiation of" note showing which type combination triggered it.

The error itself was on line 311. The instantiation requests were on lines 1383-1395. A thousand lines apart in the output, connected by template machinery. Once I understood that each "note: in instantiation of" was just the compiler saying "I tried to generate code for this type combination and hit your bug," the error dump became readable.

Why databases use templates

After staring at enough error messages, the design made sense. Databases need type-specific code without paying for runtime polymorphism.

Consider the alternative with virtual functions. Every operation requires a virtual function call through a vtable. The compiler can't inline across the indirection. You end up casting void* everywhere, losing type safety. And if you want to optimize key comparisons for different key sizes, you need runtime branches.

Templates eliminate all of this. The compiler sees the exact types at compile time. For BPlusTree<GenericKey<8>, RID, GenericComparator<8>>, it generates code that operates directly on 8-byte keys. No indirection. No vtables. The optimizer can inline the comparator, see through all the abstractions, and generate tight machine code.

This is what C++ people mean by "zero-overhead abstraction." You write generic code, the compiler generates specialized code. The abstraction costs nothing at runtime because it doesn't exist at runtime.

Compile-time specialization

The template power in BusTub is straightforward: the compiler generates specialized code for each key size, and all the size calculations happen at compile time.

The comparator shows how non-type template parameters work:

template <size_t KeySize>
class GenericComparator {
 public:
  inline auto operator()(const GenericKey<KeySize> &lhs, 
                         const GenericKey<KeySize> &rhs) const -> int {
    for (uint32_t i = 0; i < key_schema_->GetColumnCount(); i++) {
      Value lhs_value = lhs.ToValue(key_schema_, i);
      Value rhs_value = rhs.ToValue(key_schema_, i);

      if (lhs_value.CompareLessThan(rhs_value) == CmpBool::CmpTrue) {
        return -1;
      }
      if (lhs_value.CompareGreaterThan(rhs_value) == CmpBool::CmpTrue) {
        return 1;
      }
    }
    return 0;
  }
 private:
  Schema *key_schema_;
};

KeySize isn't a type, it's a compile-time constant. GenericComparator<8> only compares GenericKey<8> values. Try to compare a GenericKey<16> and you get a type error at compile time, not a runtime bug. The template parameter acts as a compile-time constraint that prevents mismatched key sizes from ever reaching production.

This is why the codebase has those explicit instantiations. The database knows it will index columns of certain sizes. Rather than let templates instantiate lazily and potentially bloat binary size with unused combinations, explicit instantiation says: generate exactly these versions, nothing else.

What C++20 fixed (and what you'll still encounter)

The 15-445 codebase uses C++17. Modern C++ has concepts, which make template constraints explicit and readable:

template <typename T>
requires std::integral<T>
T square(T x) { return x * x; }

And the error messages become human-readable:

error: cannot call square with type 'GenericKey<8>'
note: constraints not satisfied: std::integral<GenericKey<8>> evaluated to false

One line telling you exactly what went wrong.

Rust learned from this

Rust calls template instantiation "monomorphization" and builds constraints into the language from day one.

fn compare<T: Ord>(a: &T, b: &T) -> Ordering {
    a.cmp(b)
}

The T: Ord is a trait bound. If you try to call this with a type that doesn't implement Ord, the error tells you exactly that. No instantiation chain. No nine repeated errors.

Rust also catches constraint violations where you define the generic function, not where you call it. C++ templates don't check that KeyType has the methods you need until instantiation. Rust checks the trait bounds immediately. This is what happens when language designers learn from 30 years of C++ template errors.

What I actually learned

The B+Tree project took me a month. I'm currently on a break from it because implementing concurrent access with latching broke my brain. But debugging template errors taught me something I never got from college.

Templates are a code generation system. When you write a template, you're writing instructions for the compiler to follow when it generates real code. Each instantiation creates a new, specialized version. The error messages are repetitive because the compiler hits your bug once per instantiation. Understanding this changes how you debug.

The B+Tree uses templates because database indexes need to work with arbitrary key types while generating optimal code for each one. Virtual functions would add overhead on every comparison, every key copy, every node traversal. Templates let you write the generic algorithm once and get specialized assembly for each key type. The compile-time pain is the price for runtime performance.

Next: Part 5 covers move semantics

Smart pointers: memory safety without garbage collection

Erich — Mon, 24 Nov 2025 20:07:39 +0000

Part 3 of "You Didn't Learn C++ in College"

I'm building a web crawler to learn C++ and understand how search engines work. Not a toy project that crawls ten pages and calls it done, but something that needs to run for hours, handle thousands of URLs, and not explode. This means dealing with the reality that every college programming project conveniently ignores: programs that actually stay running.

My college data structures course taught new and delete, then handed us assignments that ran for 30 seconds and exited. Memory leaks? Dangling pointers? "Just be careful" was the advice. The assignments ended before the leaks mattered. Those short-lived programs never exposed the problems with manual memory management.

A web crawler runs for hours and processes thousands of documents. Miss a single delete in an error path, and you leak memory on every failed HTTP request. Forget to clean up when a parsing exception gets thrown, and memory usage climbs until the system kills your process. Delete the same object twice because two threads finished at the same time, and the program crashes with a memory corruption error that's nearly impossible to debug.

Raw pointers and manual delete calls don't scale to long-running programs. So for this crawler, I'm using smart pointers from the start. They're RAII applied to memory management, and they make the whole "be careful" thing obsolete.

What smart pointers actually are

A smart pointer is a class that wraps a raw pointer and manages its lifetime. When the smart pointer goes out of scope, it automatically deletes the object it owns. The destructor does the cleanup. Every exit path, every exception, every early return, the object gets deleted exactly once.

C++ provides three types in the standard library: unique_ptr for single ownership, shared_ptr for shared ownership with reference counting, and weak_ptr for non-owning observation. Each solves different ownership patterns.

In the crawler, every HTTP response will need a parser. The crawler creates the parser, uses it to extract links and content, then should destroy it. With raw pointers, I'd need delete calls after normal completion, after parsing errors, after network timeouts, after receiving invalid HTML. Miss one path and memory leaks.

With unique_ptr, the parser gets deleted automatically when I'm done with it. The function will create a parser using make_unique, fetch HTML content, and process it. If the fetch returns empty, the function returns early and the parser destructor runs automatically. If parsing throws an exception, the stack unwinds and the parser destructor runs. On normal completion, the function ends and the parser destructor runs. Every path works correctly without manual cleanup scattered everywhere.

unique_ptr: single ownership

A unique_ptr owns exactly one object and cannot be copied, only moved. This transfers ownership explicitly. When the unique_ptr goes out of scope or gets reset, it calls delete on the object it owns.

The performance cost is zero. A unique_ptr<T> compiles to the exact same assembly as a raw T* pointer. The compiler optimizes away the wrapper completely. You get automatic memory management at no runtime cost.

The crawler's URL queue will use this pattern. Each URL gets fetched exactly once, and one component owns that work. The queue will store crawl tasks wrapped in unique_ptr. Each task contains a URL, a depth counter for limiting how deep the crawler goes, and the logic to fetch and process that URL. When I need to process a task, I'll pop it from the queue by moving ownership out. The queue no longer owns it, the processing function now owns it. When processing completes, the task goes out of scope and gets deleted automatically.

This prevents the bug where the queue thinks it still owns the task and tries to delete it while another thread is using it. Move semantics make this impossible. Once ownership transfers out, the queue has nothing. It can't accidentally delete something it no longer owns. The compiler enforces this. Try to copy a unique_ptr and the code won't compile.

The type system documents who's responsible for cleanup. The queue owns tasks, processing borrows them temporarily. No ambiguity about whose job it is to call delete.

shared_ptr: when you need shared ownership

The crawler will maintain a cache of parsed robots.txt files. Multiple URLs from the same domain need to check the same robots.txt. The cache owns these files, but active crawl tasks also need access to them. The file shouldn't be deleted until both the cache evicts it and all tasks using it complete.

This needs shared ownership. Multiple shared_ptr instances can point to the same object. A reference count tracks how many owners exist. When a new shared_ptr copies from an existing one, the reference count increments. When a shared_ptr gets destroyed, the reference count decrements. When the count hits zero, the last shared_ptr deletes the object.

The cache will store robots.txt files as shared_ptr. When a task needs to check if a URL is allowed, it asks the cache for that domain's robots.txt. The cache returns a copy of the shared_ptr, incrementing the reference count. Now both the cache and the task own the robots.txt. If the cache decides to evict that entry to save memory, it can delete its copy of the shared_ptr. The reference count decrements but doesn't hit zero because the task still owns a copy. The robots.txt stays alive. When the task finishes and its shared_ptr gets destroyed, the reference count hits zero and the robots.txt gets deleted.

No dangling pointers. No use-after-free bugs. The task can safely use the robots.txt even after the cache evicted it.

The cost is real though. Each shared_ptr stores two pointers: one to the object and one to a control block that holds the reference count. That's 16 bytes on a 64-bit system instead of 8 bytes for a raw pointer. Incrementing and decrementing the reference count uses atomic operations for thread safety. Atomic operations are significantly slower than regular integer operations because they need to coordinate across CPU cores. Creating a shared_ptr with the naive approach allocates memory twice: once for the object and once for the control block.

Use make_shared to fix the double allocation problem. It allocates the object and control block in one contiguous chunk, cutting allocation overhead in half and improving cache locality since the object and its metadata sit next to each other in memory.

Don't default to shared_ptr because it seems easier than thinking about ownership. Shared ownership makes reasoning about lifetimes harder. When ten different components all own something, figuring out when it actually gets deleted requires tracking all ten owners. Use shared_ptr only when you actually need multiple owners, like caches where clients need to keep using objects even after eviction, callbacks that outlive the code that registered them, or async operations where multiple threads need access to shared state.

weak_ptr: breaking cycles

The crawler will represent the web as a graph of pages. Each page object stores its URL, parsed content, and references to other pages it links to. If I use shared_ptr for these outbound links, I create circular references. Page A links to Page B, which links back to Page A. Both hold shared_ptrs to each other. The reference counts never hit zero. Memory leaks despite using smart pointers.

weak_ptr solves this. It holds a non-owning reference to an object managed by shared_ptr. It doesn't increment the reference count. The object can be deleted while weak_ptrs still reference it. Before using a weak_ptr, convert it to a temporary shared_ptr by calling lock(). This returns an empty shared_ptr if the object was already deleted, or a valid shared_ptr if it still exists.

The page cache will own pages with shared_ptr. When I add a link from one page to another, the source page stores a weak_ptr to the target. The target's reference count doesn't increase. When the cache evicts the target page, that page gets deleted even though other pages still reference it. The weak_ptrs don't keep it alive.

When I need to traverse the graph and visit all pages a given page links to, I'll iterate through its weak_ptr list and call lock() on each one. If the target page still exists, lock() returns a valid shared_ptr and I can access the URL. If the target was deleted, lock() returns empty and I skip it. The code handles missing pages gracefully without crashes or undefined behavior.

This pattern shows up everywhere in large systems. Parent-child relationships use it: parents own children with shared_ptr, children reference parents with weak_ptr. Otherwise parents and children would keep each other alive forever. Observer patterns use it: the subject being observed is owned elsewhere, observers hold weak_ptr so they don't prevent the subject from being deleted. Caches use it: the cache uses shared_ptr for ownership, clients get weak_ptr so they can access objects but don't prevent eviction.

Why raw pointers still exist

Raw pointers aren't gone. They're for non-owning references within a limited scope. When a function takes a parameter it doesn't own and won't outlive the call, use a raw pointer or reference.

The crawler will have a function that processes HTML given a parser. The function doesn't own the parser and doesn't need to keep it alive. It just needs to use it during the function call. Passing a raw pointer or reference is perfect here. The caller owns the parser, the processing function borrows it. When processing completes, the parser goes back to being owned by the caller.

The rule became: smart pointers for ownership, raw pointers for borrowing. The type system documents who's responsible for cleanup. A function taking unique_ptr by value takes ownership. A function taking shared_ptr by value shares ownership. A function taking a raw pointer borrows without ownership. You can see the memory management contract in the function signature.

What this changes

Smart pointers make ownership explicit in the type system. The cache will use shared_ptr because multiple systems need access and it's unclear who finishes last. Tasks will use unique_ptr because they have clear single owners. Links will use weak_ptr to avoid cycles. The code will say what it does through the types instead of through comments and developer discipline.

This approach showed up in Rust as the entire language design. Every type has ownership semantics enforced at compile time. You can't compile code that would cause a use-after-free. You can't accidentally create circular references. The borrow checker rejects programs with ambiguous ownership. C++ made smart pointers optional, letting you choose between manual memory management and automatic cleanup. Rust made ownership tracking mandatory, moving all these bugs from runtime to compile time.

Go went the opposite direction and chose garbage collection. Memory management happens automatically at runtime through a concurrent mark-and-sweep collector. No ownership tracking needed. No thinking about when objects get deleted. You pay for this with GC pauses where the program stops to clean up memory, and less control over when cleanup actually happens. Each language learned from C++'s complexity and made different trade-offs based on their priorities.

In modern C++, if you're writing new and delete by hand, you're writing C++98. The language moved on two decades ago. Use make_unique for single ownership, make_shared when you need multiple owners, and weak_ptr to observe without owning. The ownership model becomes clear in the code instead of existing only in comments and documentation. The compiler handles the cleanup, and you get zero-overhead abstractions that cost nothing at runtime.

The crawler isn't built yet, but the design decisions are already clear. Smart pointers make the ownership explicit before I write the implementation. College taught "be careful" with raw pointers. Modern C++ provides actual tools instead of advice.

Next: Templates - Why C++ compiles so slowly