Akshat Kumar Gupta

Posted on May 26 • Edited on Jun 15

It's Just Thumbnails. Until It Isn't. Here's the Distributed System I Built to Fix That.

#distributedsystems #infrastructure #performance #systemdesign

A thumbnail is 20KB. That's nothing — smaller than most profile pictures, smaller than a favicon. But a thumbnail is also the cover of a book. It's the first thing a user sees, the thing that makes them click or keep scrolling. It carries the entire first impression of your content.

And when you have thousands of pieces of content, each with thousands of users requesting their thumbnails simultaneously — that 20KB file starts multiplying fast. Thousands of requests per second. Gigabytes of outbound traffic every minute — and that number scales fast with the cluster. Suddenly the smallest file in your system is driving one of your biggest infrastructure problems.

At my company we had one server handling all of it. No backup, no redundancy — just one machine. A single point of failure sitting at the center of everything the user sees when they open the app.

You already know where this is going. Server goes down — every thumbnail disappears. The app loads, the content is there, but every image is a broken placeholder. Nothing kills user trust faster than that.

So we had to build something better. A system that could:

Accept thumbnail uploads at thousands of requests per second
Immediately replicate every upload to a backup server
Serve reads at 5× the upload rate without breaking a sweat
Automatically failover to the backup the moment the primary goes down — with no downtime, no manual intervention, no user ever seeing a broken image

This is how we built it.

How We Got Here — The Decisions That Shaped Everything

Before jumping into the architecture, let me tell you how we got there — because the decisions matter as much as the design.

We knew one thing early: we didn't need a Netflix-level architecture. We didn't have Netflix-level numbers. Over-engineering this would slow us down and add complexity we'd never need. So we made a rule — make the best use of the systems we're already running.

That's where Redis came in.

We were already using Redis. And Redis has a feature called Pub/Sub — a messaging system where one server broadcasts an event and others listen. Sounds perfect for replication, right? Server A uploads a file, broadcasts an event, Server B picks it up and saves a copy.

The problem: if a subscriber is offline when the event fires, the event is gone forever. No retry, no replay. If your backup server crashes and restarts, it has no idea what it missed. Files silently missing, no errors, no way to recover.

So we moved to Redis Streams with AOF (Append Only File — every event is written to disk and survives crashes). Unlike Pub/Sub, Streams remember every event and every consumer's position. If the backup server goes down and comes back up, it picks up exactly where it left off. Nothing lost.

That solved replication. But we still had another problem: routing.

How does a load balancer know which server holds a specific file? The obvious answer is a lookup table — a big map of "file X lives on server B." But lookup tables don't scale. With millions of files and thousands of requests per second, that table becomes a bottleneck. Every single request has to check it. It's slow, it's a single point of failure, and it grows forever.

We needed something that could answer "which server owns this file?" in constant time, with pure math — no table, no database call, no coordination.

That's consistent hashing. Same file, same formula, same answer — every time, instantly. Distributed storage systems like Ceph use this exact principle at massive scale. We didn't need all of Ceph. We just needed the idea, implemented minimally for our use case.

So the skeleton was in place: a load balancer doing consistent hashing, servers with a primary and secondary each, and Redis Streams connecting them.

Decisions Before the Final Design

The skeleton was in place. But we still had two decisions to make before anything was final.

How many replicas? We landed on the simplest answer: primary and secondary — two servers per file, that's it. Yes, if both fail simultaneously the file is unavailable. We were completely fine with that. Two specific machines failing at the exact same time is extremely unlikely — we weren't building for that edge case.

How to store files on disk? The same hash we use for routing also determines the file's path on disk — so any file can be located instantly, always, with no database lookup.

The stream topology that falls out of this:

Each server gets its own dedicated Redis Stream. It posts upload events to its stream, and its secondary consumes that stream. Every server is both a producer and a consumer — forming a circular topology:

Server A → posts to Stream A → consumed by Server B (A's secondary)
Server B → posts to Stream B → consumed by Server C (B's secondary)
Server C → posts to Stream C → consumed by Server A (C's secondary)

Every server has exactly one stream to write to and one stream to read from. Each server is someone's primary and someone else's secondary. It fits together like mechanical parts — each piece has one job, and together they cover everything.

When the Primary Goes Down — The Problem Nobody Talks About

At this point someone smart in the room asks the obvious question:

"You said if the primary goes down, you serve from the secondary. Fine. But what about uploads? If Server A is down and an upload comes in — you route it to Server B. Server B saves the file. Now that file exists only on Server B. Server A has no idea it exists. When Server A comes back online, your two-server replication guarantee is broken. The file lives on one server. Your architecture just fell apart."

They're right.

And honestly — I was struggling with this one too.

We even used AI to brainstorm. The answer it gave us was the most obvious one: just add another stream. A dedicated fallback stream for exactly this scenario.

I knew immediately that was wrong. Not because it wouldn't work — it would. But the moment you add a stream for this, you add a consumer, logic to decide when to use which stream, and suddenly the beautiful simplicity of "each server has one stream, reads one stream" is gone. You've traded an elegant design for a complicated one. Maintenance gets harder. Edge cases multiply. The whole thing starts feeling fragile.

I shelved it. I knew there was a better answer somewhere.

And then one morning it clicked.

Why don't we just use the stream that Server A already consumes?

That was it. One sentence. One of the most satisfying moments I've had thinking through a system design.

Here's how it works. In our circular topology:

Server C → posts to Stream C → consumed by Server A

Server A is already consuming Stream C as part of normal operations. So — what if when Server B handles a fallback upload, instead of posting to its own stream, it posts to Stream C instead?

When Server A recovers and comes back online, it resumes consuming Stream C from where it left off — and there's the event waiting for it: "a file was saved on Server B, come pull it." Server A pulls the file, saves its copy, and consistency is restored. Automatically. With zero changes to the existing infrastructure.

Normal:   upload → Server A (primary) → posts to Stream A → Server B copies it

Fallback: Server A is down
          upload → Server B (fallback, header tagged)
          Server B saves the file
          Server B posts event to Stream C  ← the stream Server A already consumes
          Server A recovers → consumes Stream C → sees the event → pulls file from B → done

No new streams. No new consumers. No new logic on Server A's side — it's just consuming its existing stream and doing what it always does. The system heals itself.

The Full Picture

Here's the complete architecture — every piece we've discussed, all together:

Upload path: client → load balancer → primary server → disk + Stream event → secondary server pulls and copies.
Fallback upload path: primary down → load balancer routes to secondary with fallback header → secondary posts to the stream the primary already consumes → primary self-heals on recovery.

Under the Hood: How the Ring Routes a File

The word "ring" has come up a few times. Before we get into migration — which relies on it heavily — let me make the mechanic concrete.

Each server gets 100 virtual positions on the ring. Without them, each server owns one big continuous arc, and the sizes of those arcs are essentially random — one server could end up owning 30% of the hash space, another only 10%. A hundred virtual nodes per server breaks ownership into small scattered segments, so the variance averages out and load stays even across the cluster.

Routing a file uses two steps:

Step 1 — Primary via the ring: Hash the filename → walk clockwise until you hit the first ring position → that server is the primary.

Step 2 — Secondary via the server list: Servers have a fixed predefined order. Once you know the primary, take the next server in the list. Primary A → secondary B. Primary C → wraps back to A.

Why not use the ring for the secondary too? Because then secondary assignments become random — Server A could end up secondary for files across multiple different servers, meaning it'd need to consume from multiple streams. The circular topology collapses. The fixed server list is what holds the whole stream architecture together.

Same file, same steps, same answer every time. No lookup table. No database call. Pure math.

Migration — The Annoying Part

Every engineer who designs something elegant eventually hits the part that isn't. For us, that was migration.

We had the design, the POC worked — and then the obvious question: what happens when you add or remove a server from the ring?

Files that used to belong to Server A now belong to Server D. Those files still physically sit on Server A. Something has to move them — while the system keeps serving traffic, zero downtime.

Consistent hashing already does most of the work. When you have 3 servers and add a 4th, each existing server gives up roughly 8% of its hash space. Only ~25% of all files need to move. Everything else stays exactly where it is. And here's the beautiful part — the more servers you have, the less migration happens. Adding an 11th server to a 10-server cluster moves only ~9%. Adding a 101st to a 100-server cluster moves just ~1%. The larger the cluster, the cheaper it becomes to scale it further.

The replication multiplier. Those numbers are only for primary copies. Since every file has a secondary too, the same ~25% needs its secondary repositioned as well. Actual data movement is roughly double the hash space math. Not catastrophic — but worth knowing going in.

Pull, not push. The obvious move is to push — directly transfer the file to the new server. Simple, right? Except: if the new server crashes after saving the file but before posting an event to its stream, the secondary never gets notified. Replication breaks mid-migration, silently, with no way to know. Push also means adding infrastructure we didn't have — direct file transfers, connection management, timeouts, retry logic. A whole new code path. And everything in this system was deliberately built around streams and events. Adding direct HTTP transfers to migration felt like exactly the kind of complexity we'd been avoiding from day one.

So we flipped it. The background worker on the old server posts a flagged migration event to two streams simultaneously — the new primary's stream and the new secondary's stream — with a pointer back to itself as the source. Both servers consume their streams independently, see the event, and pull the file directly. The secondary doesn't wait for the new primary. Both pull in parallel. If either crashes mid-migration, the AOF-persisted event is still waiting when it recovers.

During migration, serve from both rings. The load balancer holds the old ring and the new ring simultaneously. Every request tries the new ring first. File not there yet? Fall back to the old ring. No errors, no broken images, no downtime. Once migration completes, the old ring is retired.

The removal edge case that will catch you off guard. When you remove a server, the fallback ring still routes requests to it until every file has migrated off. Keep it alive and serving traffic until the background worker has finished and the old ring is fully retired. Shut it down early and you've created exactly the downtime you were trying to avoid.

Temporary duplicates during migration are fine. A file sitting on three servers for a short window doesn't break anything — it just uses extra disk space. Run cleanup after migration finishes: walk every server, find any file where you're no longer the primary or secondary, delete it.

The Read/Write Split

Writes go through the application server. Hash the filename, determine the primary, write to disk, post an event to Redis Streams. CPU work, disk I/O, network calls.

Reads bypass the application server entirely. NGINX serves thumbnail requests directly from disk — no application logic, no Redis. It reads the file and streams it to the client. That's the whole operation.

The result is a 5–6× gap between read and write throughput on the same hardware. Not from any clever trick — just from not making reads do more work than they need to. Keep the application server out of the read path, let NGINX do what it was built for, and the gap appears on its own.

One Last Thing

This held up under heavy load testing. That's as close to real as we could get.

This was also my first time working seriously on the devops side. The team helped with a lot of the foundational setup — provisioning the servers, configuring the Redis environments, getting the multi-server topology actually running. I wrote the routing scripts and ran the load tests myself, but none of that happens without someone who knows how to stand up infrastructure properly. Learned a lot from that side of things.

The moment that still stands out: realizing we could route fallback uploads through the stream the primary already consumes. No new infrastructure. Just the same parts, doing one more job. That's the kind of solution that feels right — not because it's clever, but because nothing about the system needed to change to support it.

And that's probably the most important thing I learned building this — and it has nothing to do with Redis or consistent hashing.

You don't need a perfect solution. You need the right solution for your system.

Every tradeoff we accepted — two servers instead of three, slow migration instead of fast, temporary duplicates during cleanup — wasn't a compromise. It was a deliberate choice. We weren't carrying requirements we didn't actually have. A tradeoff only costs you something if you needed what you gave up.

The core of system design isn't knowing every tool. It's understanding the problem in front of you well enough to know which tools you don't need.

Top comments (6)

Devireddy Vinayreddy • May 26

As a Devops lead, It has been a pleasure working alongside you on our DevOps initiatives. I truly valued the knowledge we exchanged and the opportunity to tackle complex problems together, relying on genuine, hands-on engineering.

Akshat Kumar Gupta • May 26 • Edited

Same here — the infrastructure side was new territory for me and having someone who actually knows what they're doing made the difference between a design on paper and something that runs. Appreciated.

Jaya krishna Rayudu • May 26 • Edited

As part of the DevOps team, it has been a great experience working closely with you on this architecture. I truly appreciated the knowledge sharing, the deep technical discussions, and the opportunity to solve real distributed systems challenges together through practical, hands-on engineering.

Akshat Kumar Gupta • May 26

Honestly the unsung part of this whole thing Jaya Krishna — every time I needed a new iteration stood up, Redis configured, NGINX routing sorted, you made it happen. The design work doesn't go anywhere without that. Appreciated.