Matheus Gomes 👨‍💻

Posted on Oct 20

🧠 System Design: Foundations, Scaling Strategies, and Resilience Patterns

#systemdesign #programming #todayilearned

💡 What Is System Design and Why It’s Valuable

System design is the process of planning how different parts of a software system work together: the architecture, components, data flow, and how everything scales or recovers from failure.

It aims to make sure your system:

✅ Works correctly (meets functional requirements)

⚙️ Performs efficiently and reliably (meets non-functional requirements like scalability, latency, and fault tolerance)

🎯 Why It’s Valuable

👩💻 Team Growth: Clear boundaries let multiple teams develop without interfering.

📈 Traffic Growth: Plan for scaling so your app doesn’t crash under load.

🧰 Risk Reduction: Identify and eliminate bottlenecks or single points of failure.

💰 Cost Efficiency: Optimize infrastructure to save money at scale.

🛡️ Reliability: Design for uptime—your users expect it.

🧱 Separating Out the Database

When you begin, you might have your app and database all on one machine.

But soon, as users grow, you’ll need to separate them.

💬 Example

Imagine a simple blog app:

Your code runs on a web server (for example, Node.js or Python/Django).
It stores posts in a database (e.g., PostgreSQL).

By running the database separately, you can:

Scale your web servers independently.
Back up the database securely.
Use different database technologies for different needs.

🏗️ In production, databases often run on their own managed services, like Amazon RDS or Google Cloud SQL.

🏋️ Vertical Scaling (Scaling Up)

Vertical scaling means upgrading your current machine, adding more CPU, memory, or faster SSDs.

🖥️ Example

You start with:

t2.micro: 1 CPU, 1 GB RAM

Traffic grows, so you upgrade to:

t2.large: 4 CPUs, 16 GB RAM

✅ Pros

Simple to implement, often no code changes required.
Low latency and fast in-memory performance.

⚠️ Cons

💸 Costs rise quickly.
🚫 Machine size has physical limits.
❌ One failure can take down the whole system.

Use vertical scaling when:

You’re starting out.
Your app doesn’t yet need multiple servers.

🔁 Horizontal Scaling (Scaling Out)

Horizontal scaling means adding more machines instead of upgrading one.

It’s like adding more waiters to a busy restaurant instead of hiring one superhuman waiter.

💬 Example

You start with:

1 web server handling all requests.

When traffic increases:

Add more servers.

A load balancer will distribute requests among them.

⚖️ Load Balancer

A Load Balancer (LB) spreads requests evenly across several servers.

🧩 How It Works

Client → LB
LB → Sends request to the least busy server
Server responds → LB → Client

⚙️ LB Responsibilities

Distribute traffic 🕸️
Check server health 💉
Terminate SSL/TLS 🔐
Remove bad servers from rotation 🚫

💬 Example

AWS users might use Elastic Load Balancing (ELB).

In local setups, you might try NGINX or HAProxy.

✅ Benefits

Seamless scaling by adding/removing servers.
Zero-downtime updates using rolling deployments.

📦 Stateless Services

A stateless service means it doesn’t remember anything between requests.

All data or sessions are stored elsewhere (like a database or cache).

💬 Example

Imagine a shopping cart:

❌ Stateful: Stored in web server memory. If that server dies, cart is gone.
✅ Stateless: Cart stored in a database or Redis. Any server can respond.

🧭 Benefits

🔄 Easy to scale horizontally.
💪 Increased fault tolerance.
🚀 Updates and deployments are simpler.

☁️ Serverless

Serverless computing means you write functions, not servers.

Cloud providers run them on demand.

💬 Example

You upload a photo → this triggers a Lambda function that stores it in S3 and updates a database.

You don’t manage infrastructure, you only pay per execution.

✅ Pros

Zero infrastructure management.
Scales instantly.
You pay only when your code runs.

⚠️ Cons

Startup delay (cold starts).
Harder debugging and monitoring.
Time and memory limits.

🪄 Serverless is ideal for:

Event-driven apps.
APIs with unpredictable traffic.
Lightweight background jobs (e.g., sending emails).

🗃️ Scaling the Databases

Databases are often the hardest to scale, since they hold state.

⚙️ Strategies

📖 1. Read Replicas

Use additional servers for read operations, so the main database focuses on writes.

✅ Example:

A news website can serve millions of readers using read replicas, while journalists write only to the primary database.

⚡ 2. Caching

Store frequently accessed data in memory.

This reduces database load.

💬 Example:
Instead of repeatedly querying SELECT * FROM product WHERE id=123, cache it for 10 minutes.

🧩 3. Sharding (Partitioning)

Split large datasets into smaller parts by a chosen key.

Example:

Shard 1: Users 1–1 million
Shard 2: Users 1–2 million

✅ Benefits:

Boosts throughput and storage.
Avoids single DB bottlenecks.

⚠️ Challenges:

Harder migrations.
Managing cross-shard queries.

🧮 4. Connection Pooling

Limit DB connections by having a shared pool (e.g., pgbouncer).

This avoids a DB overload when many app servers connect at once.

💡 5. CQRS (Command Query Responsibility Segregation)

Separate read and write operations into different models:

Commands: Insert, update.
Queries: Fetch data, often denormalized.

This enables independent optimization and scaling.

🌍 6. Multi‑Region Setup

Replicate data across regions to reduce latency and improve resilience.

💬 Example:

Users in Brazil read/write from the São Paulo region, while users in Germany use Frankfurt.

🧯 Failover Strategies

When something fails (and it will) your system must recover automatically.

Below are standard failover patterns, from cheapest to most resilient:

🧊 Cold Standby

Backup system exists but is turned off.
Restored manually from backups.

⏰ RTO: Hours

💰 Cost: Low

🧩 Example: Archive systems or staging environments.

🌤️ Warm Standby

Partially active backup that receives continuous data updates.
Scaled up on demand during failure.

⏰ RTO: Minutes

💰 Cost: Medium

🧩 Example: E-commerce store backups.

🔥 Hot Standby

Fully provisioned clone, continuously updated and ready to take traffic.

⏰ RTO: Seconds

💰 Cost: High

🧩 Example: Critical financial or healthcare systems.

🌎 Multi‑Primary (Active‑Active)

Multiple regions serve traffic simultaneously.
Requires bidirectional replication and conflict handling.

✅ Fastest recovery and lowest latency

⚠️ Hardest to manage due to data conflicts

🧩 Example:

A global chat app — EU users connect to the EU data center, US users to the US, both stay synchronized.

🧭 Putting It All Together (A Growth Journey)

Stage	What You Add	Purpose
🚀 Early Start	Single server, vertical scaling	Simple and low-cost setup
⚙️ Growth Stage	Separate database, stateless app	Better reliability and maintainability
🌐 Scaling Stage	Load balancer with multiple servers	Handles more traffic
🗂️ Data Scaling	Caching, read replicas, sharding	Reduces load on the main database
🔁 Reliability	Failover mechanisms, automation	Increases uptime and resilience
⚡ Mature System	Multi-region deployment, global monitoring	Supports global traffic and quick recovery

🧩 Key Takeaways

🧠 System design = trade‑offs under constraints.
🌱 Start small, evolve realistically — don’t over‑engineer early on.
🏗️ Stateless design + separate databases unlock horizontal scaling.
📊 Database scaling = replicas + caching + sharding + pooling.
💪 Failover design ensures reliability during disasters.
📈 Evolve incrementally — track performance, failure rates, and cost.

DEV Community

🧠 System Design: Foundations, Scaling Strategies, and Resilience Patterns

💡 What Is System Design and Why It’s Valuable

🎯 Why It’s Valuable

🧱 Separating Out the Database

🏋️ Vertical Scaling (Scaling Up)

🔁 Horizontal Scaling (Scaling Out)

⚖️ Load Balancer

📦 Stateless Services

☁️ Serverless

🗃️ Scaling the Databases

📖 1. Read Replicas

⚡ 2. Caching

🧩 3. Sharding (Partitioning)

🧮 4. Connection Pooling

💡 5. CQRS (Command Query Responsibility Segregation)

🌍 6. Multi‑Region Setup

🧯 Failover Strategies

🧊 Cold Standby

🌤️ Warm Standby

🔥 Hot Standby

🌎 Multi‑Primary (Active‑Active)

🧭 Putting It All Together (A Growth Journey)

🧩 Key Takeaways

Top comments (0)