DEV Community

mohamed Tayel
mohamed Tayel

Posted on

Welcome to the Distributed Systems World — The Challenges Nobody Warned You About

tags: distributedsystems, beginners, architecture, webdev
series: Fundamentals of Distributed Systems
cover_image:

You built a small web app. It runs on one machine. It works.

Then traffic grows. Or you hear someone in a meeting say "microservices." Or your boss talks about scaling. And suddenly there's a whole new world of words: replication, sharding, consensus, CAP theorem.

This series is for that moment. We'll go through it slowly, with simple pictures and everyday examples — no fancy terms, no scary code. By the end you'll understand what people mean by "distributed systems" and why everyone makes such a big deal about them.

This first article is the map: what changes when you go from one machine to many, and the six big challenges you'll meet along the way.

Let's start.


So… what is a distributed system?

Here's the simple version:

A distributed system is a group of computers working together so the user thinks it's just one computer.

That's it. That's the whole idea.

Think about ordering food on an app. You tap a button. Two hundred milliseconds later you see "Order placed."

In those 200 milliseconds, a lot happened:

  • A server picked you out of millions of users
  • A payment system charged your card (maybe from a different country)
  • The order was saved on multiple computers
  • The restaurant's screen got a notification
  • Your phone got a confirmation

Many machines just had a tiny conversation to make your single tap work. That conversation is the hard part.


Why bother with multiple machines?

One machine is simpler. So why does anyone leave?

1. Growth. Every machine has a limit. The best CPU, the most RAM — there's always a ceiling. When your traffic passes that ceiling, you have no choice but to add more machines.

2. Safety. Machines die. Hard drives fail, power goes out, someone trips on a cable. If everything lives on one machine, when it dies, your app dies. With many machines, the others keep going.

3. Speed for everyone. A user in Japan asking a server in the US waits a long time — light takes time to travel. To feel "instant" worldwide, you need machines close to people.

That's the trade-off. You get growth, safety, and speed. In return, you inherit every problem we're about to talk about. Going distributed is not free.

Now, the six challenges.


Challenge 1: The network can't be trusted

Inside one computer, calling code is like talking to someone in the same room. You speak, they hear. Simple.

The moment two computers need to talk over a network, it's like texting. And texts can:

  • Get lost — never arrive at all
  • Arrive late — 5 seconds after you gave up waiting
  • Arrive twice — the same message appears two times
  • Arrive in the wrong order — your second message reaches them first

This isn't theoretical. There's actually a famous puzzle called the Two Generals Problem that proves something brutal: when two parties talk over an unreliable channel, neither can ever be 100% sure the other got the message. Not "hard" — actually impossible.

So when your code talks to another machine, you have to plan for the network lying to you. Every time.


Challenge 2: Many copies, one truth

A teacher dictating to students

When you have many machines, you usually have many copies of the same data. That's good — if one machine dies, the others still have it.

But it creates a new problem: how do you make sure all the copies agree?

Imagine a teacher saying "everyone write X equals 5 in your notebook." If one student is daydreaming, their notebook is wrong. Now whoever reads from that student gets the wrong answer.

The standard trick is simple:

  1. One machine is the boss. Everybody listens to the boss.
  2. The boss sends the change to everyone else.
  3. The boss waits until most of them confirm they got it.
  4. Only then does the boss say "done."

That "most of them" rule (called a majority or quorum) is the magic. It guarantees that even if some machines crash or miss the message, the system still knows what the truth is.

You're already using this without knowing — every modern database does it.


Challenge 3: How do you grow?

Bigger kitchen vs more chefs

When traffic grows, you have two choices.

Option A: Make the machine bigger. Stronger CPU, more memory, faster disk. Your code doesn't change at all. Easy.

The problem? There's a biggest machine on Earth, and one day you'll hit it. Plus, it's still one machine — when it dies, everything dies with it.

Option B: Add more machines. Split the work across many.

The good news: almost no limit. One machine dies? The others keep going.

The bad news: now your code has to deal with all the headaches from this article. Coordination, copying data, splitting work, network problems — they all become real.

Most real systems do a bit of both: bigger machines where it makes sense, more machines for everything else.


Challenge 4: Some parts get hammered, others nap

Three cashiers, one overloaded

When your data outgrows one machine, you split it across many. This is called partitioning (or sharding — same thing, different name).

It works beautifully — until traffic isn't evenly spread.

Imagine a supermarket with three cashiers. Sounds great, right? Now imagine every customer for some reason chose Cashier #2. Cashier #1 is reading a book. Cashier #3 is looking at the ceiling. Cashier #2 has a line out the door.

The total work isn't too much. It's just stuck in one place.

This is super common in real systems. One viral post, one huge customer, one trending topic — and suddenly one machine is melting while the rest are bored. The problem isn't more traffic. It's uneven traffic.

Fixing it is its own art — we'll spend a whole article on it later.


Challenge 5: Testing is genuinely harder

Three levels of testing

In a small app, when something breaks, you can usually figure it out. Open the logs, find the error, fix it.

In a distributed system, bugs are sneaky. They hide. They show up:

  • Only when many users hit the app at once
  • Only on one specific machine
  • Only at 3 AM during a holiday
  • Only when the network is slow that day

To catch them before they reach real users, we test on three levels:

  • Many small tests that check tiny pieces of code. Cheap and fast.
  • Some medium tests that check whether pieces work together. A bit slower.
  • A few big tests that try the whole app like a real user would. Slow and expensive.

The trick is having lots of small ones and very few big ones. The opposite (lots of big tests) makes everything slow and people stop running them.


Challenge 6: You can't fix what you can't see

Logs, numbers, and alerts

Writing code is half the work. The other half is keeping it running 24/7 across all those machines.

For that, you need three things:

Logs — a written record of what happened. "User clicked button," "Database returned error," "Payment failed." Like a diary.

Numbers — how the system is doing right now. How fast? How busy? How many errors? Like a car dashboard.

Alerts — a phone call when something breaks. Because if nobody's watching, the numbers don't matter.

The basic rule of running things: if you can't see it, you can't fix it.


Quick recap

Six challenges. None go away, all are solvable.

# Challenge The short version
1 Network Messages can be lost, late, doubled, or out of order.
2 Copies Many copies of data must agree.
3 Growth Bigger machine vs. more machines.
4 Uneven load Splitting only works if traffic splits evenly too.
5 Testing Bugs hide in timing. Use small, medium, and big tests.
6 Watching You need logs, numbers, and alerts.

The honest truth: distributed systems aren't magic. They're a list of trade-offs. Once you see the trade-offs clearly, the choices start making sense.


What's next

Next article we'll dig into the first piece of the puzzle: how services talk to each other. Sometimes they wait for an answer (like a phone call). Sometimes they leave a message and move on (like a voicemail). When to use which? That's part 2.

After that: how data is stored, what "eventually consistent" means, how to keep things running when machines fail, and how to actually watch all of this in real time.

If anything here felt unclear — perfect. That means we have something good to dig into next. Drop a comment with what you want explored first.

See you in part 2. 👋


Follow the series — every article is simple language, lots of pictures, no scary code.

Top comments (0)