DEV Community

Cover image for 🧠 System Design: Foundations, Scaling Strategies, and Resilience Patterns
Matheus Gomes 👨‍💻
Matheus Gomes 👨‍💻

Posted on

🧠 System Design: Foundations, Scaling Strategies, and Resilience Patterns

💡 What Is System Design and Why It’s Valuable

System design is the process of planning how different parts of a software system work together: the architecture, components, data flow, and how everything scales or recovers from failure.

It aims to make sure your system:

✅ Works correctly (meets functional requirements)

⚙️ Performs efficiently and reliably (meets non-functional requirements like scalability, latency, and fault tolerance)

🎯 Why It’s Valuable

👩💻 Team Growth: Clear boundaries let multiple teams develop without interfering.

📈 Traffic Growth: Plan for scaling so your app doesn’t crash under load.

🧰 Risk Reduction: Identify and eliminate bottlenecks or single points of failure.

💰 Cost Efficiency: Optimize infrastructure to save money at scale.

🛡️ Reliability: Design for uptime—your users expect it.


🧱 Separating Out the Database

When you begin, you might have your app and database all on one machine.

But soon, as users grow, you’ll need to separate them.

💬 Example

Imagine a simple blog app:

  • Your code runs on a web server (for example, Node.js or Python/Django).

  • It stores posts in a database (e.g., PostgreSQL).

By running the database separately, you can:

  • Scale your web servers independently.

  • Back up the database securely.

  • Use different database technologies for different needs.

🏗️ In production, databases often run on their own managed services, like Amazon RDS or Google Cloud SQL.


🏋️ Vertical Scaling (Scaling Up)

Vertical scaling means upgrading your current machine, adding more CPU, memory, or faster SSDs.

🖥️ Example

You start with:

t2.micro: 1 CPU, 1 GB RAM

Traffic grows, so you upgrade to:

t2.large: 4 CPUs, 16 GB RAM

✅ Pros

  • Simple to implement, often no code changes required.

  • Low latency and fast in-memory performance.

⚠️ Cons

  • 💸 Costs rise quickly.

  • 🚫 Machine size has physical limits.

  • ❌ One failure can take down the whole system.

Use vertical scaling when:

  • You’re starting out.

  • Your app doesn’t yet need multiple servers.


🔁 Horizontal Scaling (Scaling Out)

Horizontal scaling means adding more machines instead of upgrading one.

It’s like adding more waiters to a busy restaurant instead of hiring one superhuman waiter.

💬 Example

You start with:

  • 1 web server handling all requests.

When traffic increases:

  • Add more servers.

A load balancer will distribute requests among them.


⚖️ Load Balancer

A Load Balancer (LB) spreads requests evenly across several servers.

🧩 How It Works

  1. Client → LB

  2. LB → Sends request to the least busy server

  3. Server responds → LB → Client

⚙️ LB Responsibilities

  • Distribute traffic 🕸️

  • Check server health 💉

  • Terminate SSL/TLS 🔐

  • Remove bad servers from rotation 🚫

💬 Example

AWS users might use Elastic Load Balancing (ELB).

In local setups, you might try NGINX or HAProxy.

✅ Benefits

  • Seamless scaling by adding/removing servers.

  • Zero-downtime updates using rolling deployments.


📦 Stateless Services

A stateless service means it doesn’t remember anything between requests.

All data or sessions are stored elsewhere (like a database or cache).

💬 Example

Imagine a shopping cart:

  • ❌ Stateful: Stored in web server memory. If that server dies, cart is gone.

  • ✅ Stateless: Cart stored in a database or Redis. Any server can respond.

🧭 Benefits

  • 🔄 Easy to scale horizontally.

  • 💪 Increased fault tolerance.

  • 🚀 Updates and deployments are simpler.


☁️ Serverless

Serverless computing means you write functions, not servers.

Cloud providers run them on demand.

💬 Example

You upload a photo → this triggers a Lambda function that stores it in S3 and updates a database.

You don’t manage infrastructure, you only pay per execution.

✅ Pros

  • Zero infrastructure management.

  • Scales instantly.

  • You pay only when your code runs.

⚠️ Cons

  • Startup delay (cold starts).

  • Harder debugging and monitoring.

  • Time and memory limits.

🪄 Serverless is ideal for:

  • Event-driven apps.

  • APIs with unpredictable traffic.

  • Lightweight background jobs (e.g., sending emails).


🗃️ Scaling the Databases

Databases are often the hardest to scale, since they hold state.

⚙️ Strategies

📖 1. Read Replicas

Use additional servers for read operations, so the main database focuses on writes.

✅ Example:

A news website can serve millions of readers using read replicas, while journalists write only to the primary database.


⚡ 2. Caching

Store frequently accessed data in memory.

This reduces database load.

💬 Example:
Instead of repeatedly querying SELECT * FROM product WHERE id=123, cache it for 10 minutes.


🧩 3. Sharding (Partitioning)

Split large datasets into smaller parts by a chosen key.

Example:

  • Shard 1: Users 1–1 million

  • Shard 2: Users 1–2 million

✅ Benefits:

  • Boosts throughput and storage.

  • Avoids single DB bottlenecks.

⚠️ Challenges:

  • Harder migrations.

  • Managing cross-shard queries.


🧮 4. Connection Pooling

Limit DB connections by having a shared pool (e.g., pgbouncer).

This avoids a DB overload when many app servers connect at once.


💡 5. CQRS (Command Query Responsibility Segregation)

Separate read and write operations into different models:

  • Commands: Insert, update.

  • Queries: Fetch data, often denormalized.

This enables independent optimization and scaling.


🌍 6. Multi‑Region Setup

Replicate data across regions to reduce latency and improve resilience.

💬 Example:

Users in Brazil read/write from the São Paulo region, while users in Germany use Frankfurt.


🧯 Failover Strategies

When something fails (and it will) your system must recover automatically.

Below are standard failover patterns, from cheapest to most resilient:


🧊 Cold Standby

  • Backup system exists but is turned off.

  • Restored manually from backups.

⏰ RTO: Hours

💰 Cost: Low

🧩 Example: Archive systems or staging environments.


🌤️ Warm Standby

  • Partially active backup that receives continuous data updates.

  • Scaled up on demand during failure.

⏰ RTO: Minutes

💰 Cost: Medium

🧩 Example: E-commerce store backups.


🔥 Hot Standby

  • Fully provisioned clone, continuously updated and ready to take traffic.

⏰ RTO: Seconds

💰 Cost: High

🧩 Example: Critical financial or healthcare systems.


🌎 Multi‑Primary (Active‑Active)

  • Multiple regions serve traffic simultaneously.

  • Requires bidirectional replication and conflict handling.

✅ Fastest recovery and lowest latency

⚠️ Hardest to manage due to data conflicts

🧩 Example:

A global chat app — EU users connect to the EU data center, US users to the US, both stay synchronized.


🧭 Putting It All Together (A Growth Journey)

Stage What You Add Purpose
🚀 Early Start Single server, vertical scaling Simple and low-cost setup
⚙️ Growth Stage Separate database, stateless app Better reliability and maintainability
🌐 Scaling Stage Load balancer with multiple servers Handles more traffic
🗂️ Data Scaling Caching, read replicas, sharding Reduces load on the main database
🔁 Reliability Failover mechanisms, automation Increases uptime and resilience
⚡ Mature System Multi-region deployment, global monitoring Supports global traffic and quick recovery

🧩 Key Takeaways

  • 🧠 System design = trade‑offs under constraints.

  • 🌱 Start small, evolve realistically — don’t over‑engineer early on.

  • 🏗️ Stateless design + separate databases unlock horizontal scaling.

  • 📊 Database scaling = replicas + caching + sharding + pooling.

  • 💪 Failover design ensures reliability during disasters.

  • 📈 Evolve incrementally — track performance, failure rates, and cost.

Top comments (0)