DEV Community

xytras
xytras

Posted on

Three things every indie multiplayer game gets wrong in production

Most indie multiplayer games ship with three architectural decisions that look fine at MVP and break somewhere between 50 and 500 concurrent players. After hosting servers for indie survival/MMO games across UE, Unity, and Godot for several years, these are the three failure modes I keep seeing.

Failure 1: Trusting the client for any value that affects other players

The pattern: the client computes "I dealt 25 damage to player X" or "my final survival time was 14:32" and sends that to the server. The server records it.

This fails the moment one player decompiles the client and starts sending fake values. They're suddenly invincible, or topping leaderboards with impossible scores, or duplicating resources. Trust collapses for the honest players who watched it happen.

The fix that has held up: server-authoritative on anything that touches other players. The client sends INTENT ("I shot at player X from this position") and the server validates and applies. Latency goes up because there's a round trip. Cheat surface drops to almost zero because the client never gets to be the source of truth on anything competitive.

Single-player progression can stay client-side. Leaderboards, PvP outcomes, shared world state, currency. None of those can.

Failure 2: No versioning or event log on player save data

The pattern: each player's save is a JSON blob. Every save overwrites the previous version. The only backup is whatever your hosting provider's nightly snapshot grabbed.

This fails when:

  • A bad patch corrupts saves and you find out 18 hours later.
  • A duplication exploit briefly works and you can't tell which players exploited.
  • Two parallel sessions (player reconnects on phone while desktop session still active) race-condition on the same save.
  • A support ticket comes in saying "I lost 3 hours of progress" and you can only restore to 4am yesterday.

The fix: per-player append-only event log. The "save" becomes a projection of events. Rollback to any second is a re-projection from event N, not a restore from backup. Audit becomes trivial because every progression jump has a source event. Race conditions become detectable instead of silent.

This is more work than a JSON blob. It's also the difference between "support is a war zone" and "support takes 5 minutes."

Failure 3: One binary doing auth, matchmaking, simulation, persistence, and anti-cheat

The pattern: the game server process does everything. Player auth happens in the same binary as the world tick. Persistence writes block on the same thread as physics. Anti-cheat runs inline with the simulation step.

This fails because those workloads are different shapes. World simulation is CPU-bound. Auth is I/O-bound and bursty. Persistence is database-bound and write-heavy. Anti-cheat scans are CPU-bursty.

When you bolt them all into one process, you get cascading failures. A noisy auth attack spikes CPU and the world tick starts dropping frames. A bad database write blocks for 200ms and the simulation hitches. An anti-cheat scan kicks in and 80 players see a momentary disconnect.

The fix is unglamorous: split the stateless concerns out. Auth, persistence, matchmaking, anti-cheat coordination all live as separate services that talk to the game server over an internal API. The game server keeps the one job that only it can do, which is running the world simulation. Everything else scales horizontally, fails independently, and can be replaced without taking down the game.

Most indie games discover this around the time they hit 100-200 concurrent and players start reporting "everyone got disconnected for 90 seconds." That's usually one of the stateless concerns starving the simulation thread.

The shape that works

The three fixes share a pattern: separate concerns that have different failure modes, and don't let the client be the source of truth for anything that crosses the player boundary.

If you want to read deeper on the architecture that holds these together, I've written about the tick-server vs event-driven split here: https://gsb.supercraft.host/blog/multiplayer-game-backend-architecture/ and on per-player event-sourcing for save data here: https://gsb.supercraft.host/blog/player-data-schema-design-nosql-vs-sql/. The orchestration side (splitting stateless services from the game server) is covered here: https://gsb.supercraft.host/blog/game-server-orchestration-guide/.

None of this is required at 5 players. All of it becomes required somewhere between 50 and 500 concurrent. The earlier you put the architecture in place, the cheaper the migration is.

What's the failure mode you've seen most often in production multiplayer? I've been catching mostly category 3 lately as more indie teams hit the auth-starves-simulation problem.

Top comments (0)