DEV Community: Emmanuel Onuiteshi

Scaling to 1 Million Users : Load Balancing & Caching Strategies

Emmanuel Onuiteshi — Mon, 25 May 2026 22:45:28 +0000

You build a URL shortener in a weekend. It works perfectly. Then it goes viral.

At first it’s just your friends. Pages load instantly, and they love what you’ve built. They share the link with others, and you watch the user count tick towards a hundred. That quiet excitement hits; you’ve made something real. Then more people start using it.

You keep opening your dashboard. The numbers are climbing faster than you can refresh. You are half excited, half terrified. Hundreds is turning into thousands. Then the notifications start: the app is slow, links are not redirecting, users are complaining publicly. The same system that handled a hundred users without blinking is now falling apart under a thousand. You have not changed a single line of code.
So what went wrong?

Nothing went wrong. You just hit the wall that every growing system hits eventually.
The question is:

What do you actually do?

The Scaling Roadmap

Scaling isn’t a single decision, it’s a series of targeted upgrades, each one unlocking the next order of magnitude. Here’s the progression every high-traffic app follows:

•Single Server - handles your first few thousand users
•Load Balancer - distributes traffic across multiple servers
•Caching Layer - serves popular data from memory instead of the database
•Content Delivery Network (CDN) - pushes content closer to users globally
•Distributed Cache - spreads cache across multiple machines for millions of users

Single Server : Your first deployment runs everything on one machine: request handling, database queries, link generation, and page serving. This is perfectly fine up to a few thousand users; the pages load fast and the setup is simple. Don’t over-engineer this stage.

Load Balancer : Once you’re in the tens of thousands, a single server starts to buckle. Requests queue up, response times climb, and occasional timeouts start appearing. A load balancer sits in front of your servers and distributes incoming traffic across a pool of app servers, ensuring no single machine becomes a bottleneck. Traffic spikes that would have crashed your app are now absorbed gracefully.

Caching Layer : At hundreds of thousands of users, a pattern becomes obvious: the same short codes are being resolved over and over. Instead of hitting the database every time, a cache layer stores the most frequently accessed mappings in memory. A lookup that previously cost a 40ms database round-trip now completes in under 1ms. Database load drops dramatically, and your app can handle far more concurrent users on the same hardware.

Content Delivery Network (CDN) : Once your users are spread across the globe, physical distance becomes a problem. A CDN places copies of your static assets and cache-able responses at edge locations around the world. A user in Lagos, Berlin, or Sydney gets their redirect served from a nearby edge node rather than your origin server in, say, Virginia. Latency drops from hundreds of milliseconds to single digits.

Distributed Caching : At millions of users, even a single powerful cache server becomes a constraint. A distributed cache; like a Redis Cluster, spreads data across multiple nodes. The most popular short links are served instantly from memory, read throughput scales horizontally, and the system stays fast even under massive, sustained load.

Load Balancing: Distributing Traffic Across Servers

Round-Robin : Round-robin is the simplest traffic distribution strategy: each incoming request is sent to the next server in rotation, cycling back to the start. It works well when servers are equally capable and traffic is fairly uniform. For a URL shortener handling stateless redirect requests, round-robin is a reasonable starting point at modest scale.

But round-robin has a critical blind spot. It knows nothing about data locality. If one server has cached a hot short code in memory, round-robin may send the next request for that code to a different server entirely, causing a cache miss. At scale, this causes unnecessary database pressure and unpredictable latency. Adding or removing servers also reshuffles which server handles which requests, wiping out accumulated cache state.

The Rehashing Problem : Imagine your URL shortener has four servers, each caching a quarter of your popular short codes. You add a fifth server to handle increased load. With naive modulo hashing (short_code % number_of_servers), roughly 80% of your cache keys now map to different servers. Users experience redirect failures and slowdowns while servers frantically rebuild their caches.
It’s like rearranging a warehouse mid-shipment.

Consistent Hashing: The Production Solution
Consistent hashing solves this cleanly. Picture a ring. Servers occupy fixed positions along the ring, and each short code is hashed to a point on the ring. Requests route clockwise to the nearest server. When you add a new server, only the keys in the arc immediately preceding its position need to migrate roughly 1/N of total keys, where N is the number of servers. Virtual nodes (multiple positions per server) smooth out load distribution even further.

For your URL shortener, consistent hashing on the short_code ensures that popular links reliably route to the server holding their cache, and that adding capacity during a traffic spike doesn’t cascade into a cache stampede.

Here’s how round-robin looks in an NGINX upstream configuration:

upstream app_servers {
    server app1.example.com;
    server app2.example.com;
    server app3.example.com;
}

server {
    location / {
        proxy_pass http://app_servers;
    }
}

Algorithm comparison:

Algorithm	Keys Moved on Change	Used By	Best For
Round-Robin	N/A (no cache affinity)	NGINX default	Stateless, uniform requests
Mod-N Hashing	~80% when N changes	Legacy systems	Static server pools only
Consistent Hashing	~1/N (minimum possible)	DynamoDB, Cassandra, Akamai	Dynamic scaling, cache affinity
Power of Two Choices	N/A (load-aware)	AWS Lambda, Envoy	Multi-LB environments, service mesh

Real-world precedent : Netflix applies consistent hashing to route requests to the servers holding cached video segment data. Popular content is served without repeatedly querying origin storage, keeping playback smooth even under massive load. The same principle applies directly to your URL shortener.

HTTP Caching: Making the Web Faster

HTTP caching is built into the web protocol. When configured correctly, browsers and CDN edge nodes store responses locally, eliminating redundant trips to your origin servers. The key headers are:

•Cache-Control - defines how long content should be stored and by whom
•ETag - a fingerprint that lets clients check whether cached content is still fresh
•Vary - specifies which request headers affect the cached response

Understanding Cache-Control : A common misconception: Cache-Control: no-cache does not mean “don’t cache.” It means “cache, but revalidate before serving.” The response can live in memory; it just can’t be served without checking freshness first. Understanding this distinction is essential to using caching effectively.

A more powerful pattern is splitting browser and CDN TTLs:

Cache-Control: public, max-age=60, s-maxage=3600

This tells browsers to cache for 60 seconds (so users get fast responses on repeated clicks) and CDNs to cache for an hour (so your origin servers rarely see requests for popular links). Browsers validate frequently; CDNs absorb the bulk of the load.

ETags and Conditional Requests : On the first request, your server returns a response with an ETag header, a hash or version identifier. The browser stores it. On the next request, the browser sends the ETag back. If the content hasn’t changed, the server responds with 304 Not Modified. No body is sent, bandwidth is saved, and the user experiences an instant load. For a URL shortener, this matters for any metadata pages where content changes infrequently.

Stale-while-revalidate : stale-while-revalidate allows serving an expired cache entry immediately while fetching a fresh copy in the background. Applied to your URL shortener, this means a redirect response can be served from cache even after its TTL expires, with the cache refreshed transparently. Users never see a delay during high-traffic bursts.

The Vary Header Trap
Vary: User-Agent forces caches to store a separate copy for every distinct browser and device type. This silently destroys cache efficiency, every variation gets its own slot, and cache hit rates collapse. Avoid broad Vary headers unless you’re genuinely serving different content per device.

CDN Architecture: Bringing Content Closer to Users

At its core, a CDN is a distributed HTTP cache. Instead of routing every request back to your origin server, copies of your content live at dozens of edge locations worldwide. For a URL shortener, this means viral links, the small fraction that receive massive traffic can be served entirely from the edge, with zero database involvement.

Pull CDN vs Push CDN

Pull CDN : lazily fetches content from your origin only when a user first requests it. The cache fills naturally over time. Ideal for dynamic or unpredictable content, like short codes whose popularity you can’t know in advance.

Push CDN : requires you to proactively upload content to edge nodes. Best for static resources or pre-generated redirect tables for your most popular links.

Real-world precedent : Netflix Open Connect achieves a 98% CDN cache hit rate for video streams. Nearly every video chunk is served from the edge, not from Netflix’s origin data centers. The same model applies directly to a URL shortener: the top 0.1% of links can be handled entirely at the edge, leaving your database untouched.

Cache invalidation strategies:

Strategy	How It Works	Speed	Use Case
TTL Expiration	Content expires automatically after N seconds	Delayed (waits for TTL)	Slow-changing content:: blog posts, product pages
Purge API	Manual API call instantly removes cached content	Fastly: 150ms global	News, e-commerce inventory, breaking content
Surrogate Keys	Tag responses; purge all tagged objects at once	Same as purge	Complex relationships: purge all product-123 pages
Soft Purge	Mark stale, serve old while refreshing in background	Immediate serve	High-traffic pages where downtime is unacceptable

CDN provider comparison:

Feature	Cloudflare	AWS CloudFront	Fastly
PoPs	330+ cities	750+ PoPs + 1,140 embedded	~200 strategic
Routing	Anycast (single IP, BGP routing)	DNS-based (+ Anycast option)	Anycast
Purge Speed	Sub-150ms global	Seconds to minutes	150ms global (since 2011)
Edge Compute	Workers: V8 Isolates, <1ms cold start	Lambda@Edge or CF Functions	Compute@Edge: WebAssembly
Cache Invalidation	Purge API + Cache Rules	API (slow) + versioned URLs	Surrogate keys: best in class
Free Tier	Generous: unlimited bandwidth	Pay per GB from first byte	No free tier

Redis: Application-Level Caching

Beyond the HTTP layer, your application needs its own in-memory cache. Redis is the industry standard: it stores data in RAM rather than on disk, making look-ups orders of magnitude faster than a database query. For a URL shortener, Redis is the layer that makes redirect responses feel instantaneous.

Cache-Aside: The Recommended Pattern
When a user clicks a short link, your app checks Redis first. If the mapping is there, it’s returned immediately. If not, the app queries the database, returns the result, and stores it in Redis for future requests. Most subsequent clicks on that link never touch the database.

def get_short_url(short_code):
    url = cache.get(short_code)       # Step 1: Check cache
    if not url:                        # Step 2: Cache miss
        url = db.query(short_code)     #   Query database
        cache.set(short_code, url)     # Step 3: Populate cache
    return url

Write-Through vs Write-Behind

Write-Through : writes to both cache and database simultaneously. Guarantees consistency but doubles write latency. Use this when data correctness is non-negotiable.

Write-Behind : writes to cache first and flushes to the database asynchronously. Faster writes, but risks data loss if the cache crashes before the flush completes. Use this for high-throughput analytics where some loss is acceptable.

Cache Stampede: The Failure Mode You Must Plan For
A cache stampede happens when a popular cache key expires and thousands of concurrent requests simultaneously find a miss. Each one fires a database query. The database buckles under the load. For a URL shortener, a single viral link expiring at the wrong moment can trigger exactly this scenario.

Three defences:

1.TTL jitter: Randomize expiry times slightly so keys don’t expire simultaneously
2.Distributed lock (Redis SET NX EX): Only one request rebuilds the cache; others wait
3.XFetch: proactively Refresh hot keys just before they expire, preventing the miss entirely

Memory Optimization

Real-world precedent: Instagram stored 300 million URL mappings in Redis using 21 GB of memory. By switching to Redis ziplist encoding (which compacts small structures), they reduced that to 5 GB - a 76% reduction. For your URL shortener, similar techniques (efficient serialization, compact data structures) can dramatically cut infrastructure costs at scale.

Eviction Policy
allkeys-lru(Least Recently Used) is the right default for general workloads. If your traffic follows an 80/20 pattern: 20% of links generating 80% of clicks, then allkeys-lfu (Least Frequently Used) keeps your hottest links in memory while evicting cold ones. Choosing the right policy ensures cache performance holds under sustained load.

Everything Applied: One Request, End to End

So let’s say a user in Nigeria clicks a short link in a tweet. Here’s exactly what happens across the full stack:

Step 1: CDN Edge (~5ms)

The request hits the nearest CDN edge location. For the top 0.1% of viral links, the ones cached at the edge via s-maxage, the redirect response is returned immediately. The request never reaches your servers. TTL jitter ensures popular links don’t expire in sync, preventing coordinated cache misses.

Step 2: Redis / Cache-Aside (~10ms)

If the CDN doesn’t have the link, the request reaches your app servers. Cache-Aside checks Redis for the short_code. A hit returns the mapping in under 10ms. A miss triggers the database path. Distributed locks or XFetch prevent simultaneous misses from cascading into a stampede.

Step 3: Database (~40ms)

On a cache miss, the app queries the sharded database (consistent hashing routes the query to the correct shard), retrieves the mapping, writes it back to Redis, and responds. Ziplist encoding and appropriate eviction policies keep the cache lean and performant for the next request.

Step 4: Redirect Response: 301 vs 302

Why 302 and not 301? Bitly famously uses 302 because click analytics are their core product. A 301 permanently caches the redirect in the browser, making future clicks invisible to their tracking. A 302 ensures every click is recorded. For your URL shortener, the answer depends on whether analytics matter more than marginal performance gains.

Performance at scale; back of the envelope:

Metric	Calculation	Result
New URLs created	100M per month / 30 days / 86,400 sec	~40 writes/second
URL redirects (100:1 read ratio)	40 writes/sec × 100	4,000 reads/sec (40K at peak)
Short code space (7 chars, Base62)	62^7	3.52 trillion combinations (~100 years)
Storage (5 years)	100M × 12 months × 5 years × 500 bytes	~3 TB before replication
Redis hot cache (top 1% = 90% of traffic)	1% of daily URLs × 500 bytes	~330 MB caches 90%+ of reads

What breaks at each scale stage:

Scale	What Breaks	The Fix
0-1K users	Nothing	Single server, SQLite or MySQL, no Redis needed
1K-10K users	Database read bottleneck	Add Redis cache-aside, add read replica
10K-100K users	App server CPU ceiling	Load balancer + 2-3 app servers (Round-Robin)
100K-500K users	Cache miss spikes overwhelming DB	CDN for redirects, TTL jitter, Redis cluster
500K-1M users	Database write throughput ceiling	Sharding with consistent hashing, async analytics
1M+ users	Single-region latency for global users	Multi-region, GeoDNS routing, regional Redis clusters

Key Trade-Offs

Decision	Option A	Option B	Choose Based On
Redirect type	301 Permanent (browser caches; no return trips)	302 Temporary (every click reaches your servers)	Need analytics? Use 302. Click data is the product for companies like Bitly.
Short code generation	Auto-increment + Base62 encode (predictable, zero collisions)	Hash (MD5 truncated; collision risk)	At scale, auto-increment + XOR obfuscation beats hash complexity.
Database choice	NoSQL (DynamoDB/Cassandra): horizontal sharding native	SQL (MySQL): simpler, vertical scaling ceiling	At 4,000 reads/sec, NoSQL with consistent hashing wins.

The Engineering Mindset

The best engineers are not the ones who know every tool. They are the ones who understand trade-offs deeply enough to make the right call for their specific system, their specific constraints, and their specific users.

For example, as a developer in Nigeria, your constraints are real: users are on expensive data plans, connectivity that drops without warning, servers that are geographically far away. Every caching decision you make is an act of empathy for the person on a 3G connection in Kano trying to load your app. Build accordingly. The goal isn’t to over-engineer early, it’s to design systems that evolve as demand increases.

Scaling to one million users is less about powerful hardware and more about smart architecture. Three principles drive it:

•Distribute traffic using load balancing
•Reduce repeated computation through caching
•Move content closer to users using CDNs

When combined, these strategies dramatically reduce database load, lower infrastructure costs, and keep your application fast, from your first hundred users to your first million.

Resources

•Designing Data-Intensive Applications. Kleppmann (chapters 5-6)
•RFC 9111. HTTP Caching (current standard)
•ByteByteGo YouTube. Alex Xu; visual system design
•Instagram Engineering Blog. Redis memory optimization
•Scaling Memcache at Facebook. NSDI 2013 (Nishtala et al.)
•Redis Official Docs. Caching patterns & eviction reference

What’s Next?

Once caching and load balancing are in place and your system is serving one million users reliably from cache, the next frontier is real-time communication at scale. Technologies like WebSockets and Server-Sent Events introduce a fundamentally different set of constraints; persistent connections, event fan-out, and stateful session management.

Next post will cover WebSockets, HTTP polling, and Server-Sent Events. Follow or subscribe for the next post.

From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure)

Emmanuel Onuiteshi — Mon, 25 May 2026 14:30:25 +0000

What Actually Happens When You Press Enter?

You type www.google.com and press Enter. Half a second later, a fully rendered page appears. Nobody taught you to find that remarkable. But as a developer, that half second is your responsibility.

It takes 0.5 seconds. But it touches 7 layers of infrastructure.
Here is every layer, in order.

Step 1: DNS Lookup; The Internet’s Phonebook

Humans remember names. Computers understand numbers. DNS translates google.com into 142.250.190.46.
The lookup chain: your browser cache → OS cache → Recursive Resolver → Root Server → TLD Server → Authoritative Server.
The whole chain completes in milliseconds.
When DNS fails, no website loads at all. It is so foundational that its failure looks like the entire internet is broken.

Step 2: TCP Connection; The 3-Way Handshake

Having the IP address is not enough. Your device and the server need to confirm they are both ready to communicate reliably. TCP handles this with three messages before a single byte of your request moves:
•SYN: “can we talk?”
•SYN-ACK: “yes, let’s talk”
•ACK: “connection open”

On a Lagos to Frankfurt connection, this round trip is 100 to 150ms. On a local server, under 5ms. That gap is why CDN edge nodes matter because they bring the handshake closer to your users.

Step 3: The HTTP Request

With the connection open, your browser sends a structured request. Three parts:

Part	Purpose	Example
Request Line	The verb and path	GET /index.html HTTP/1.1
Headers	Context about the request	Host, User-Agent, Cookie
Body	Payload (POST/PUT only)	JSON data, form fields

HTTP methods define intent: GET retrieves, POST creates, PUT/PATCH updates, DELETE removes. Using the wrong method breaks caching. GET requests are cached by default; POST requests are not. If you are using POST to fetch data, you are bypassing the entire caching layer unnecessarily.

Step 4: Client-Server Architecture

Right, so the request has arrived somewhere. But where exactly? And who is allowed to touch what?
The Rule: the client is never allowed inside the kitchen. They must ask the waiter, who asks the kitchen on their behalf.

Think of it like a restaurant. You are the customer. You can read the menu and place an order, but you do not walk into the kitchen yourself. The waiter (the network) carries your request. The kitchen (the server) does the actual work. And the pantry (the database) holds all the ingredients.

Most production systems follow the 3-tier model:

Layer	Role	Technologies
Presentation	What the user sees	HTML, CSS, JavaScript, React
Application	Business logic and rules	Node.js, Python, Go, Java
Data	Persistent storage	PostgreSQL, MongoDB, Redis

Each layer talks only to the layer immediately next to it. The browser never touches the database directly. This boundary is a security constraint, not just a convention.

Step 5: Server Processing and REST

So the request has made it past the front door. Now your application server actually does something with it. Here is the typical flow:

• The web server (Nginx, Apache) receives the raw request and routes it inward
• Middleware runs: authentication checks, rate limiting, request logging
• The router matches the URL and HTTP method to a specific handler function
• The handler runs your business logic, queries the database if needed, and builds a response

This is where REST comes in. REST is the set of conventions that makes this process predictable and consistent. The four rules:

• URLs are nouns, not verbs. Use /users/123, not /getUser?id=123
• Use HTTP methods correctly and consistently
• Every request is stateless, it carries everything the server needs to process it
• Structure is consistent: /users returns a list, /users/123 returns one record

A well-designed REST API is one your teammates can read without a dictionary. A poorly designed one is a support ticket waiting to happen.

GET    /users          // List all users
GET    /users/123      // Get one user
POST   /users          // Create a user
DELETE /users/123      // Remove a user

Step 6: The HTTP Response

The server has done its job. Now it sends back what it found or what went wrong. Every response has a status code, headers, and a body.

Status codes are the internet’s traffic lights. Every developer needs these internalized:

Range	Meaning	Key Codes
2xx	Success	200 OK, 201 Created, 204 No Content
3xx	Redirect	301 Permanent, 302 Temporary, 304 Not Modified
4xx	Client Error	400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found
5xx	Server Error	500 Internal Error, 502 Bad Gateway, 503 Unavailable

One mistake that drives everyone mad: Returning 200 OK when an error occurs is one of the most common API mistakes. It breaks clients, breaks monitoring, and makes debugging painful. Return the right code every time.

Step 7: Browser Rendering; Code into Pixels

The response is sitting in your browser. It is raw HTML, CSS, and JavaScript. None of it is visible yet. What happens next is actually one of the most impressive things your computer does silently, several times a day.

The browser runs through the Critical Rendering Path:

• Parse HTML → build the DOM tree
• Parse CSS → build the CSSOM
• Combine into a Render Tree (visible elements only)
• Layout: calculate exact positions and sizes for everything on the page
• Paint and Composite: pixels hit the screen

JavaScript can interrupt this pipeline at any point. A 200kb render-blocking script sitting in the wrong place is the difference between a 0.5 second load and a 3 second one. On a 3G connection in Kano or Benin City, that delay is not a minor inconvenience. It is the difference between a user who waits and one who closes the tab.

HTML is the blueprint. CSS is the paint bucket. JavaScript is the interior designer rearranging furniture after the house is built. The browser does all of it in under 200 milliseconds.

The Full Journey at a Glance

Step	Layer	What Happens
1	DNS Lookup	Domain name resolved to IP address
2	TCP Connection	3-way handshake establishes reliable channel
3	HTTP Request	Browser sends method, headers, and body
4	Client-Server	Request routed through 3-tier architecture
5	Server Processing	Business logic runs; database queried; response built
6	HTTP Response	Status code, headers, and payload returned
7	Browser Rendering	HTML, CSS, JS converted to pixels

What’s Next?
You now know what happens every time a user hits your app. DNS finds the address, TCP builds the connection, HTTP carries the message, your server does the work, and the browser makes it visible. Seven layers, half a second.

The next question is: what happens when a million users do all of that simultaneously? Look out for my next post.