DEV Community

Akshat Kumar Gupta
Akshat Kumar Gupta

Posted on

It's Just Thumbnails. Until It Isn't. Here's the Distributed System I Built to Fix That.

A thumbnail is 20KB. That's nothing — smaller than most profile pictures, smaller than a favicon. But a thumbnail is also the cover of a book. It's the first thing a user sees, the thing that makes them click or keep scrolling. It carries the entire first impression of your content.

And when you have thousands of pieces of content, each with thousands of users requesting their thumbnails simultaneously — that 20KB file starts multiplying fast. Thousands of requests per second. Gigabytes of outbound traffic every minute — and that number scales fast with the cluster. Suddenly the smallest file in your system is driving one of your biggest infrastructure problems.

At my company we had one server handling all of it. No backup, no redundancy — just one machine. A single point of failure sitting at the center of everything the user sees when they open the app.

You already know where this is going. Server goes down — every thumbnail disappears. The app loads, the content is there, but every image is a broken placeholder. Nothing kills user trust faster than that.

So we had to build something better. A system that could:

  • Accept thumbnail uploads at thousands of requests per second
  • Immediately replicate every upload to a backup server
  • Serve reads at 5× the upload rate without breaking a sweat
  • Automatically failover to the backup the moment the primary goes down — with no downtime, no manual intervention, no user ever seeing a broken image

This is how we built it.


How We Got Here — The Decisions That Shaped Everything

Before jumping into the architecture, let me tell you how we got there — because the decisions matter as much as the design.

We knew one thing early: we didn't need a Netflix-level architecture. We didn't have Netflix-level numbers. Over-engineering this would slow us down and add complexity we'd never need. So we made a rule — make the best use of the systems we're already running.

That's where Redis came in.

We were already using Redis. And Redis has a feature called Pub/Sub — a messaging system where one server can broadcast an event and others can listen. Sounds perfect for replication, right? Server A uploads a file, broadcasts an event, Server B picks it up and saves a copy.

The problem: in Redis Pub/Sub, if a subscriber is offline when the event fires, the event is gone forever. No retry, no replay. For replication that means if your backup server crashes and restarts, it has no idea what it missed. Files silently missing, no errors, no way to recover.

So we moved to Redis Streams with AOF (Append Only File — every event is written to disk and survives crashes). Unlike Pub/Sub, Streams remember every event and every consumer's position. If the backup server goes down and comes back up, it picks up exactly where it left off. Nothing lost.

That solved replication. But we still had another problem: routing.

We had load balancers already in place. But how does a load balancer know which server holds a specific file? And when the primary server fails, how does it know where the backup copy is?

The obvious answer is a lookup table — a big map of "file X lives on server B." But lookup tables don't scale. With millions of files and thousands of requests per second, that table becomes a bottleneck. Every single request has to check it. It's slow, it's a single point of failure, and it grows forever.

We needed something that could answer "which server owns this file?" in constant time, with pure math — no table, no database call, no coordination.

That's when we looked at Ceph — one of the leading distributed storage systems used in production at massive scale. Ceph solves exactly this problem with an algorithm called CRUSH, built on a concept called consistent hashing. The core idea: use a mathematical formula to deterministically map any file to a server. Same file, same formula, same answer — every time, instantly.

We didn't need Ceph itself. We just needed the same principle, implemented minimally for our use case.

So by this point we had a skeleton:

  • A load balancer that routes requests using consistent hashing — no lookup table, just math
  • A group of servers — one primary per file, one secondary as the backup
  • Redis Streams connecting them — so every upload on the primary is automatically replicated to the secondary, and even if the secondary crashes, it catches up when it comes back

That skeleton became our architecture.


Decisions Before the Final Design

The skeleton was in place. But we still had two important decisions to make before we could finalize anything.

Decision 1: How many servers should each thumbnail be replicated to?

This wasn't just a reliability question — it directly determined how we'd integrate Redis Streams and how files would be stored. More replicas means more streams, more consumers, more complexity. We thought about it and landed on the simplest answer: primary and secondary. Two servers per file. That's it.

Yes, there's a drawback — if both servers fail simultaneously, the file is unavailable. We were completely fine with that. In a pool of servers, two specific machines failing at the exact same time is extremely unlikely. We weren't building for that edge case. We were building for the 99.99% scenario, and the simplest design that covered that was two servers.

Decision 2: How do we store the files on disk?

Dumping every thumbnail into a single directory is a disaster at scale. Filesystems slow down badly when a single folder holds hundreds of thousands of files.

Since we were already hashing every filename for routing, we used that same hash to build a multi-level directory structure — up to 3 levels deep, using the first few characters of the hash as folder names:

/ab/cd/ef/thumbnail.jpg
Enter fullscreen mode Exit fullscreen mode

This kept every directory small and scannable. And because we were hashing based on the thumbnail's filename — which followed a fixed naming convention that never changed — that same hash could always be recomputed to find the file instantly. Same filename, same hash, same path. Always. No database lookup needed to locate a file on disk.

The Redis Streams topology that came out of these two decisions:

Quick note on Redis Streams if you haven't used them: think of a Stream as a queue. One server drops a message at the back of the queue — "hey, I just saved this file." Another server is sitting at the other end, picking up messages one by one and acting on them. The key thing is the queue remembers every message and remembers exactly where each consumer left off — so if the consuming server goes down and comes back up, it just continues from where it stopped. Nothing gets skipped, nothing gets lost.

Each server gets its own dedicated Redis Stream. It posts upload events to its own stream, and its secondary server consumes that stream. Because consistent hashing can make any server a primary for some files and a secondary for others, every server both produces and consumes — forming a circular topology:

Server A → posts to Stream A → consumed by Server B (A's secondary)
Server B → posts to Stream B → consumed by Server C (B's secondary)
Server C → posts to Stream C → consumed by Server A (C's secondary)
Enter fullscreen mode Exit fullscreen mode

The happy path flow is clean. A request comes in, the load balancer routes it to Server A. Server A saves the file, posts an event to Stream A. Server B picks up the event, pulls the file, saves its copy. Done.

Every server has exactly one stream to write to and one stream to read from. Each server is someone's primary and someone else's secondary. It fits together like mechanical parts — each piece has one job, and together they cover everything.


When the Primary Goes Down — The Problem Nobody Talks About

At this point someone smart in the room asks the obvious question:

"You said if the primary goes down, you serve from the secondary. Fine. But what about uploads? If Server A is down and an upload comes in — you route it to Server B. Server B saves the file. Now that file exists only on Server B. Server A has no idea it exists. When Server A comes back online, your two-server replication guarantee is broken. The file lives on one server. Your architecture just fell apart."

They're right.

And honestly — I was struggling with this one too.

We even used AI to brainstorm. The answer it gave us was the most obvious one: just add another stream. A dedicated fallback stream for exactly this scenario.

I knew immediately that was wrong. Not because it wouldn't work — it would. But the moment you add a stream for this, you add a consumer for that stream, logic to decide when to use which stream, and suddenly the beautiful simplicity of "each server has one stream, reads one stream" is gone. You've traded an elegant design for a complicated one. Maintenance gets harder. Edge cases multiply. The whole thing starts feeling fragile.

I shelved it. I knew there was a better answer somewhere.

And then one morning it clicked.

Why don't we just use the stream that Server A already consumes?

That was it. One sentence. One of the most satisfying moments I've had thinking through a system design.

Here's how it works. In our circular topology:

Server C → posts to Stream C → consumed by Server A
Enter fullscreen mode Exit fullscreen mode

Server A is already consuming Stream C as part of normal operations. It's always listening to that stream. So — what if when Server B handles a fallback upload, instead of posting to Stream B (which Server B normally does), it posts to Stream C instead?

When Server A recovers and comes back online, it resumes consuming Stream C from where it left off — and there's the event waiting for it: "a file was saved on Server B, come pull it." Server A pulls the file, saves its copy, and consistency is restored. Automatically. With zero changes to the existing infrastructure.

Walk through it concretely:

Normal:   upload → Server A (primary) → posts to Stream A → Server B copies it

Fallback: Server A is down
          upload → Server B (fallback, header tagged)
          Server B saves the file
          Server B posts event to Stream C  ← the stream Server A already consumes
          Server A recovers → consumes Stream C → sees the event → pulls file from B → done
Enter fullscreen mode Exit fullscreen mode

No new streams. No new consumers. No new logic on Server A's side — it's just consuming its existing stream and doing what it always does. The system heals itself.

This was exactly the level of complexity we wanted. The same moving parts, doing one extra job they were already set up to do.


The Full Picture

Here's the complete architecture — every piece we've discussed, all together:

Distributed Thumbnail Storage Architecture — load balancer with consistent hashing routing to primary and secondary servers, each connected via their own Redis Stream in a circular topology

Upload path: client → load balancer → primary server → disk + Stream event → secondary server pulls and copies.
Fallback upload path: primary down → load balancer routes to secondary with fallback header → secondary posts to the stream the primary already consumes → primary self-heals on recovery.


Under the Hood: How the Ring Routes a File

The word "ring" has come up a few times. Before we get into migration math — which relies on it heavily — let me make the mechanic concrete.

Each server gets 100 virtual positions on the ring. With 3 servers, that's 300 positions total, arranged in sorted order around a circular number line. Sorted by their hash values, they end up interleaved — not in clean blocks, but spread out:

... → [A#5] → [B#23] → [C#11] → [A#57] → [C#89] → [B#77] → ...
Enter fullscreen mode Exit fullscreen mode

Routing a file uses two separate steps:

Step 1 — Primary via the ring: MD5 hash the filename → get a long integer → walk clockwise until you hit the first ring position → that server is the primary.

Step 2 — Secondary via the server list: The servers have a fixed predefined order — [A, B, C]. Once you know the primary, look it up in that list and take the next one. Primary A → secondary B. Primary B → secondary C. Primary C → wraps back to A.

This is why the stream topology locks in the way it does. The streams are built to match the list exactly:

A → Stream A → B
B → Stream B → C
C → Stream C → A
Enter fullscreen mode Exit fullscreen mode

The secondary is always deterministic from the server list — not from wherever you happen to land on the ring.

Concrete: cat.png hashes to 45422. Walking clockwise, the first ring position belongs to B. Primary: B. B's next in the list: C. Secondary: C.

Edge case — dog.jpg hashes to 98311, bigger than every position on the ring. Wrap around to the first position. Primary: A. A's next in the list: B. Secondary: B.

Why 100 virtual nodes instead of 1? Without them, each server owns one big continuous arc of the ring — and the sizes of those arcs are essentially random. One server could end up owning 30% of the hash space, another only 10%. The distribution is completely uneven, which defeats the whole point. Virtual nodes break each server's ownership into 100 small scattered segments, so the variance averages out and every server ends up with roughly equal load.

That's the ring. Now — what happens when it changes?


Migration — The Annoying Part

Every engineer who designs something elegant eventually hits the part that isn't. For us, that was migration.

We had the design. We'd built the POC. It worked exactly as expected. And then the obvious question came: what happens when you add or remove a server from the ring?

We couldn't ignore it. You add a server, the hash ring changes, files that used to belong to Server A now belong to Server D. Those files still physically sit on Server A. Something has to move them. And it has to do it while the system keeps serving requests — zero downtime.

The good news: we weren't expecting Amazon-level traffic. We were fine with a migration that was good enough for our scale, not one engineered for millions of concurrent users. And we already had everything we needed — we just had to use it.

The obvious starting point: a background worker.


Why Consistent Hashing Already Does Most of the Work

Before writing a single line of migration code, we did the math — and consistent hashing surprised us with how little data actually needs to move.

Adding a server — 3 servers becoming 4:

When you have 3 servers, the hash ring is divided roughly equally:

Server A owns ~33.3% of the hash space
Server B owns ~33.3% of the hash space
Server C owns ~33.3% of the hash space
Enter fullscreen mode Exit fullscreen mode

Add a 4th server and the ring rebalances:

Server A owns ~25%  (lost ~8.3%)
Server B owns ~25%  (lost ~8.3%)
Server C owns ~25%  (lost ~8.3%)
Server D owns ~25%  (gained ~8.3% from each)
Enter fullscreen mode Exit fullscreen mode

Only ~25% of all files need to move. Everything else stays exactly where it is. And here's the beautiful part: the more servers you have, the less migration happens. Adding a 4th server to a 3-server cluster moves ~25% of files. Adding an 11th server to a 10-server cluster moves only ~9%. Adding a 101st server to a 100-server cluster moves just ~1%. The larger your cluster grows, the cheaper it becomes to scale it further.

Removing a server — 4 servers becoming 3:

The reverse happens. Server D is removed. Its 25% of the hash space gets redistributed across the remaining three servers — roughly 8.3% going to each. Again, only ~25% of files move — and again, the larger the cluster, the smaller that percentage. The majority of the cluster is always untouched.


The Replication Multiplier

Here's the catch we noticed: those calculations are only for primary copies.

Because every file has a secondary too, when a file's primary moves to a new server, its secondary assignment also changes. The same percentage of files that move as primaries also need their secondary copies repositioned.

So the actual data movement is roughly double the hash space calculation. Not catastrophic — but worth knowing going in.


Pull, Not Push — And Why the Streams Already Had the Answer

When the background worker finds a file that no longer belongs to its current server, the obvious move is to push it — directly transfer the file to the new primary via HTTP. The worker controls exactly how fast it sends, add a small delay between transfers, you've got rate limiting built in. Simple, right?

We thought about it. Then we thought about what actually happens after the new primary receives that file.

It saves it. Then it has to post an event to its stream. Then the secondary consumes that event and pulls a copy. Three hops. And if the new primary crashes after receiving the file but before posting the event — the secondary never gets notified. Your replication guarantee breaks mid-migration, silently, with no way to know.

There's a deeper problem too: push adds infrastructure we didn't have. Direct file transfer means connection management, timeouts, retry logic — a whole new code path. And everything in this system was deliberately built around streams and events. Adding direct HTTP file transfers to migration felt like exactly the kind of complexity we'd been avoiding from day one.

So we flipped it: let the streams do the work.

The background worker on the old server doesn't push files. It posts a flagged migration event to two streams simultaneously — the new primary's stream and the new secondary's stream — with a pointer back to itself as the source. Both servers consume their own streams independently, see the event, and pull the file directly from the old server.

Old server walks files → finds file that needs to move
  → posts event {source_url = self} to new primary's stream
  → posts same event to new secondary's stream
  → both servers consume their streams independently
  → both pull the file directly from old server
  → done. No hops. No dependency.
Enter fullscreen mode Exit fullscreen mode

The secondary doesn't wait for the new primary. Both pull in parallel. If the new primary crashes mid-migration, the secondary still has its own event sitting in its stream — it pulls the file regardless. And if any server crashes and restarts, the event is AOF-persisted, waiting exactly where it was left off.

Rate limiting works the same way it would have with push — the background worker adds a small delay between posting events and controls exactly how fast migration runs.

The part that made this click: the stream consumers on the new servers already knew how to do this. One extra flag — rebalance=true — and the existing consumer logic handles migration the same way it handles normal replication. No new code path. The same moving parts, doing one extra job they were already set up to do.

And we were completely fine with migration taking its time. No deadline, no SLA on how fast a new server had to be loaded. Slow and steady, no disruption to live traffic.


A Note on How We Think About Design

This is probably the most important thing I learned building this system — and it has nothing to do with Redis or consistent hashing.

You don't need a perfect solution. You need the right solution for your system.

Every tradeoff we accepted — two servers instead of three, slow migration instead of fast, temporary duplicates during cleanup — wasn't a compromise. It was a deliberate choice. We weren't carrying requirements we didn't actually have. A tradeoff only costs you something if you needed what you gave up.

The core of system design isn't knowing every tool. It's understanding the problem in front of you well enough to know which tools you don't need.


The Question Nobody Asks Until It's Too Late

So migration is running. Files are moving in the background. And then it hits you:

During migration, the ring has changed. Requests are already being routed using the new ring. But the files haven't fully moved yet. A request comes in for a file that belongs to Server D in the new ring — except that file is still sitting on Server A because the background worker hasn't gotten to it yet. What happens?

This is the moment where a lot of designs fall apart.

Our answer was simpler than we expected: maintain a fallback ring in the load balancer.

The load balancer holds two rings simultaneously during migration — the new ring and the old ring. Every request tries the new ring first. If the file is there, great. If the request fails — file not found — it falls back to the old ring automatically.

Request comes in during migration:
  → Try new ring first
  → File not there yet? Fall back to old ring
  → Old ring knows exactly where it was
  → Served. User sees nothing.
Enter fullscreen mode Exit fullscreen mode

If the new ring doesn't have the file, it simply means migration hasn't reached it yet. The old ring still has it. No errors, no broken images, no downtime. Once migration completes and every file has moved, the old ring is retired and the load balancer runs on the new ring alone.


The Removal Edge Case That Will Catch You Off Guard

Adding a server is straightforward with this approach. Removing one has a trap.

When you remove Server D, the load balancer immediately builds a new ring without it. But the fallback ring — the old ring — still includes Server D. And that fallback ring will keep routing requests to Server D until every file has migrated off it.

Which means Server D cannot go offline until migration is complete.

Shut it down early and the fallback ring starts hitting a dead server. Files that haven't migrated yet become unreachable. You've created exactly the downtime you were trying to avoid.

The rule: when removing a server, keep it alive and serving traffic until the background worker has walked every file and the old ring is fully retired. Only then shut it down gracefully.

It sounds obvious in hindsight. But it's the kind of thing that only becomes obvious after you've thought through every step of what the fallback ring is actually doing.


The Duplicates Problem (And Why We Were Fine With It)

During migration, files temporarily exist on more than two servers. A file that's moving from Server A to Server D still sits on Server A while migration is in progress. For a window of time, three or more servers have that file.

We were completely fine with this. Temporary duplicates don't break anything — they just use a bit of extra disk space for a short period.

What we decided: run cleanup only after migration completes. Once every background worker signals it's done walking all its files, a cleanup job starts. It walks every server, and for each file asks: am I still the primary or secondary for this file? If neither — it's an orphan. Delete it.

Clean, no risk of deleting a file mid-migration.


The Read/Write Split

Honest answer upfront: these numbers depend on your hardware. Faster CPUs, faster disks, more bandwidth — the numbers shift. Our servers are 12-core machines, files are 20KB thumbnails. With that setup, we hit roughly 10,000 read requests per second per node.

Uploads go through Spring Boot. The server hashes the filename, determines the primary, writes to disk, posts an event to Redis Streams. CPU work, disk I/O, network calls. In load testing, Spring Boot handled around 1,500–3,000 upload requests per second per node.

Reads bypass Spring Boot entirely. NGINX serves thumbnail fetch requests directly from disk — no JVM overhead, no application logic, no Redis. NGINX reads the file and streams it to the client. That's the whole operation.

The result is a 5–6× gap between read and write throughput on the same hardware. Not from any clever trick — just from not making reads do more work than they need to. Keep Spring Boot out of the read path, let NGINX do what it was built for, and the gap appears on its own.


One Last Thing

We haven't launched yet. So I can't tell you this held up under real user traffic — I can tell you it held up under 2 million requests in a load test, 444ms average, 80-85% CPU. That's as close to real as we could get before the real thing.

What I've learned building this: the hardest part isn't the code. It's resisting the urge to solve problems you don't have yet. Every time we stripped something out — a third replica, a dedicated fallback stream, a faster migration algorithm — the system got cleaner.

The moment that still stands out: realizing we could route fallback uploads through the stream the primary already consumes. No new infrastructure. Just the same parts, doing one more job. That's the kind of solution that feels right — not because it's clever, but because nothing about the system needed to change to support it.

This was also my first time working seriously on the devops side. The team helped with a lot of the foundational setup — provisioning the servers, configuring the Redis environments, getting the multi-server topology actually running. I wrote the Lua scripts and ran the JMeter tests myself, but none of that happens without someone who knows how to stand up infrastructure properly. Learned a lot from that side of things.

Top comments (5)

Collapse
 
devireddy_vinayreddy profile image
Devireddy Vinayreddy

As a Devops lead, It has been a pleasure working alongside you on our DevOps initiatives. I truly valued the knowledge we exchanged and the opportunity to tackle complex problems together, relying on genuine, hands-on engineering.

Collapse
 
akshat_kumargupta_ac6c1b profile image
Akshat Kumar Gupta • Edited

Same here — the infrastructure side was new territory for me and having someone who actually knows what they're doing made the difference between a design on paper and something that runs. Appreciated.

Collapse
 
jaya_krishna_rayudu profile image
Jaya krishna Rayudu • Edited

As part of the DevOps team, it has been a great experience working closely with you on this architecture. I truly appreciated the knowledge sharing, the deep technical discussions, and the opportunity to solve real distributed systems challenges together through practical, hands-on engineering.

Collapse
 
akshat_kumargupta_ac6c1b profile image
Akshat Kumar Gupta

Honestly the unsung part of this whole thing Jaya Krishna — every time I needed a new iteration stood up, Redis configured, NGINX routing sorted, you made it happen. The design work doesn't go anywhere without that. Appreciated.

Collapse
 
shashank_verma_2ea8514f01 profile image
Shashank verma

awsm article brother, too much things to learn. I am really glad to be a part of this work.