Part 3 of "You Didn't Learn C++ in College"
I'm building a web crawler to learn C++ and understand how search engines work. Not a toy project that crawls ten pages and calls it done, but something that needs to run for hours, handle thousands of URLs, and not explode. This means dealing with the reality that every college programming project conveniently ignores: programs that actually stay running.
My college data structures course taught new and delete, then handed us assignments that ran for 30 seconds and exited. Memory leaks? Dangling pointers? "Just be careful" was the advice. The assignments ended before the leaks mattered. Those short-lived programs never exposed the problems with manual memory management.
A web crawler runs for hours and processes thousands of documents. Miss a single delete in an error path, and you leak memory on every failed HTTP request. Forget to clean up when a parsing exception gets thrown, and memory usage climbs until the system kills your process. Delete the same object twice because two threads finished at the same time, and the program crashes with a memory corruption error that's nearly impossible to debug.
Raw pointers and manual delete calls don't scale to long-running programs. So for this crawler, I'm using smart pointers from the start. They're RAII applied to memory management, and they make the whole "be careful" thing obsolete.
What smart pointers actually are
A smart pointer is a class that wraps a raw pointer and manages its lifetime. When the smart pointer goes out of scope, it automatically deletes the object it owns. The destructor does the cleanup. Every exit path, every exception, every early return, the object gets deleted exactly once.
C++ provides three types in the standard library: unique_ptr for single ownership, shared_ptr for shared ownership with reference counting, and weak_ptr for non-owning observation. Each solves different ownership patterns.
In the crawler, every HTTP response will need a parser. The crawler creates the parser, uses it to extract links and content, then should destroy it. With raw pointers, I'd need delete calls after normal completion, after parsing errors, after network timeouts, after receiving invalid HTML. Miss one path and memory leaks.
With unique_ptr, the parser gets deleted automatically when I'm done with it. The function will create a parser using make_unique, fetch HTML content, and process it. If the fetch returns empty, the function returns early and the parser destructor runs automatically. If parsing throws an exception, the stack unwinds and the parser destructor runs. On normal completion, the function ends and the parser destructor runs. Every path works correctly without manual cleanup scattered everywhere.
unique_ptr: single ownership
A unique_ptr owns exactly one object and cannot be copied, only moved. This transfers ownership explicitly. When the unique_ptr goes out of scope or gets reset, it calls delete on the object it owns.
The performance cost is zero. A unique_ptr<T> compiles to the exact same assembly as a raw T* pointer. The compiler optimizes away the wrapper completely. You get automatic memory management at no runtime cost.
The crawler's URL queue will use this pattern. Each URL gets fetched exactly once, and one component owns that work. The queue will store crawl tasks wrapped in unique_ptr. Each task contains a URL, a depth counter for limiting how deep the crawler goes, and the logic to fetch and process that URL. When I need to process a task, I'll pop it from the queue by moving ownership out. The queue no longer owns it, the processing function now owns it. When processing completes, the task goes out of scope and gets deleted automatically.
This prevents the bug where the queue thinks it still owns the task and tries to delete it while another thread is using it. Move semantics make this impossible. Once ownership transfers out, the queue has nothing. It can't accidentally delete something it no longer owns. The compiler enforces this. Try to copy a unique_ptr and the code won't compile.
The type system documents who's responsible for cleanup. The queue owns tasks, processing borrows them temporarily. No ambiguity about whose job it is to call delete.
shared_ptr: when you need shared ownership
The crawler will maintain a cache of parsed robots.txt files. Multiple URLs from the same domain need to check the same robots.txt. The cache owns these files, but active crawl tasks also need access to them. The file shouldn't be deleted until both the cache evicts it and all tasks using it complete.
This needs shared ownership. Multiple shared_ptr instances can point to the same object. A reference count tracks how many owners exist. When a new shared_ptr copies from an existing one, the reference count increments. When a shared_ptr gets destroyed, the reference count decrements. When the count hits zero, the last shared_ptr deletes the object.
The cache will store robots.txt files as shared_ptr. When a task needs to check if a URL is allowed, it asks the cache for that domain's robots.txt. The cache returns a copy of the shared_ptr, incrementing the reference count. Now both the cache and the task own the robots.txt. If the cache decides to evict that entry to save memory, it can delete its copy of the shared_ptr. The reference count decrements but doesn't hit zero because the task still owns a copy. The robots.txt stays alive. When the task finishes and its shared_ptr gets destroyed, the reference count hits zero and the robots.txt gets deleted.
No dangling pointers. No use-after-free bugs. The task can safely use the robots.txt even after the cache evicted it.
The cost is real though. Each shared_ptr stores two pointers: one to the object and one to a control block that holds the reference count. That's 16 bytes on a 64-bit system instead of 8 bytes for a raw pointer. Incrementing and decrementing the reference count uses atomic operations for thread safety. Atomic operations are significantly slower than regular integer operations because they need to coordinate across CPU cores. Creating a shared_ptr with the naive approach allocates memory twice: once for the object and once for the control block.
Use make_shared to fix the double allocation problem. It allocates the object and control block in one contiguous chunk, cutting allocation overhead in half and improving cache locality since the object and its metadata sit next to each other in memory.
Don't default to shared_ptr because it seems easier than thinking about ownership. Shared ownership makes reasoning about lifetimes harder. When ten different components all own something, figuring out when it actually gets deleted requires tracking all ten owners. Use shared_ptr only when you actually need multiple owners, like caches where clients need to keep using objects even after eviction, callbacks that outlive the code that registered them, or async operations where multiple threads need access to shared state.
weak_ptr: breaking cycles
The crawler will represent the web as a graph of pages. Each page object stores its URL, parsed content, and references to other pages it links to. If I use shared_ptr for these outbound links, I create circular references. Page A links to Page B, which links back to Page A. Both hold shared_ptrs to each other. The reference counts never hit zero. Memory leaks despite using smart pointers.
weak_ptr solves this. It holds a non-owning reference to an object managed by shared_ptr. It doesn't increment the reference count. The object can be deleted while weak_ptrs still reference it. Before using a weak_ptr, convert it to a temporary shared_ptr by calling lock(). This returns an empty shared_ptr if the object was already deleted, or a valid shared_ptr if it still exists.
The page cache will own pages with shared_ptr. When I add a link from one page to another, the source page stores a weak_ptr to the target. The target's reference count doesn't increase. When the cache evicts the target page, that page gets deleted even though other pages still reference it. The weak_ptrs don't keep it alive.
When I need to traverse the graph and visit all pages a given page links to, I'll iterate through its weak_ptr list and call lock() on each one. If the target page still exists, lock() returns a valid shared_ptr and I can access the URL. If the target was deleted, lock() returns empty and I skip it. The code handles missing pages gracefully without crashes or undefined behavior.
This pattern shows up everywhere in large systems. Parent-child relationships use it: parents own children with shared_ptr, children reference parents with weak_ptr. Otherwise parents and children would keep each other alive forever. Observer patterns use it: the subject being observed is owned elsewhere, observers hold weak_ptr so they don't prevent the subject from being deleted. Caches use it: the cache uses shared_ptr for ownership, clients get weak_ptr so they can access objects but don't prevent eviction.
Why raw pointers still exist
Raw pointers aren't gone. They're for non-owning references within a limited scope. When a function takes a parameter it doesn't own and won't outlive the call, use a raw pointer or reference.
The crawler will have a function that processes HTML given a parser. The function doesn't own the parser and doesn't need to keep it alive. It just needs to use it during the function call. Passing a raw pointer or reference is perfect here. The caller owns the parser, the processing function borrows it. When processing completes, the parser goes back to being owned by the caller.
The rule became: smart pointers for ownership, raw pointers for borrowing. The type system documents who's responsible for cleanup. A function taking unique_ptr by value takes ownership. A function taking shared_ptr by value shares ownership. A function taking a raw pointer borrows without ownership. You can see the memory management contract in the function signature.
What this changes
Smart pointers make ownership explicit in the type system. The cache will use shared_ptr because multiple systems need access and it's unclear who finishes last. Tasks will use unique_ptr because they have clear single owners. Links will use weak_ptr to avoid cycles. The code will say what it does through the types instead of through comments and developer discipline.
This approach showed up in Rust as the entire language design. Every type has ownership semantics enforced at compile time. You can't compile code that would cause a use-after-free. You can't accidentally create circular references. The borrow checker rejects programs with ambiguous ownership. C++ made smart pointers optional, letting you choose between manual memory management and automatic cleanup. Rust made ownership tracking mandatory, moving all these bugs from runtime to compile time.
Go went the opposite direction and chose garbage collection. Memory management happens automatically at runtime through a concurrent mark-and-sweep collector. No ownership tracking needed. No thinking about when objects get deleted. You pay for this with GC pauses where the program stops to clean up memory, and less control over when cleanup actually happens. Each language learned from C++'s complexity and made different trade-offs based on their priorities.
In modern C++, if you're writing new and delete by hand, you're writing C++98. The language moved on two decades ago. Use make_unique for single ownership, make_shared when you need multiple owners, and weak_ptr to observe without owning. The ownership model becomes clear in the code instead of existing only in comments and documentation. The compiler handles the cleanup, and you get zero-overhead abstractions that cost nothing at runtime.
The crawler isn't built yet, but the design decisions are already clear. Smart pointers make the ownership explicit before I write the implementation. College taught "be careful" with raw pointers. Modern C++ provides actual tools instead of advice.
Next: Templates - Why C++ compiles so slowly
Top comments (0)