DEV Community


Posted on • Updated on

System Design

System Design study guide / interview prep for me

Content here is not original. It is is a summary of the following sources:


API design


System Availability



Load Balancing

Database design

Leader Election

Endpoint protection

  • Use cases
    • protecting against DOS
  • rate-limiting

pub-sub Messaging / MQs

  • Benefits
  • Use case / purpose
  • pub/sub messaging
  • comparison to DB

References to getting started

reference: SO & stackprint

Standard Error Code

  • 400 Bad request
    • client side input fail validation
  • 401 Unauthorized
    • user not authenticated
  • 403 forbidden
    • user authenticated but not privileged to access resources
  • 404 Not Found
    • resource not found
  • 500 Internal Server Error
    • generic server error. shouldn't be thrown explicitly.
  • 502 Bad gateway
    • invalid response from upstream server
  • 503 Service Unavailable
    • something unexpected happen on server side (server overload, some parts fiail( reference: SO

Storage: Memory VS Disk

Disk is persistent. (when power is off, data will "persist").
RAM is transient. When lost power memory is wiped away.

If data is useful only during session of server then keep it in Memory.

Keeping it in RAM is much faster and less expensive than writing to persistent database.

Latency VS Throughput

Latency: duration for action to complete / produce result
Throughput: maximum capacity of system / work done in unit time.

  • Latency:
    • load site fast & smoothly
    • fast lookups
    • avoid pinging distant servers.
  • Throughput
    • can improve by reducing bottleneck
    • system will only be as fast as its slowest bottleneck
    • increase by improving slowest bottleneck

Hard ware definition from compritech:

Latency: time taken for packet to be transferred across network
Throughput: quantity of data being send & recevied per unit time

Scaling: Vertical VS Horizontal

Vertical Scaling: adding compute (CPU) and memory (RAM, Disk, SSD) resources to a single computer.
Horizontal Scaling: adding more computers to a cluster

  • Vertical
    • fairly straightforward
    • much lower overall memory capacity
  • Horizontal
    • higher overall compute and storage capacity
    • sized dynamically without downtime
    • most popular DB model hard to scale horizontally.

Availability: Intro

availability is like resiliency of system. fault tolerant system makes an available system.

Measured using (uptime): percentage time system's primary function is available in given window of time

Service Level Agreement / Assurance
set of guaranteed service level metrics.

How to achieve High Availability

Reduce or eliminate single points of failure
This is done by designing redundancy into the system.

Eg. two or more servers to provide services.

Why Caching

Caching helps reduce latency of system.

  • app can perform faster
    • faster to retrieve data from memory than disk

Works best when

  1. store static / infrequently caching data
  2. source of change is single operations rather than user-generate operations

If data consistency & freshness is critical. Caching may not be optimal unless cache is refreshed frequently (refresh cache frequently may not be ideal either)

Caching: Use Case

  1. when using certain piece of data often
  2. backend has computationally intensive work (cache reduces complexity to O(1))
  3. server makes multiple netwrk requests & API calls


  • CDN (content delivery network)
    • caches content (images / videos/ webpages) in proxy server located closer to end user than origin server
    • can deliver content more quickly.

Caching Strategy

deals with the "write" operations of cache.

  1. keep cache & DB in sync
  2. should it be done synchronously or asyncrhonouslt
  3. data eviction - Which old records to "kick out" for new data?

Caching Strategies

  • LIFO
  • FIFO
  • LRU
  • LFU ### Proxy:basics proxy: basically a middleman between client & origin server.

There are 2 kinds of proxies

  1. forward
  2. reverse alt text Forward
  3. computer make requests to sites on internet
  4. proxy server intercepts request
  5. communicate to web server on behalf of clients

alt text

  • client send request to origin server of website
  • requests intercepted at network edge by reverse proxy server
  • reverse proxy sends request to & receive response from origin server

Forward VS Reverse Proxy

proxy: basically a middleman between client & origin server.

How proxies work

Forward: sits in front of client, no origin server communicates with specific client. Server has no knowledge of client
Reverse: sits in front of origin server, no client communicate directly with origin server. Client has no knowledge of server.

  • Forward proxy

    • avoid company browsing restrictions.
    • clients connect to proxy rather than directly to sites they are visiting
    • block access to certain content
    • school network, configure to connect to web via proxy that blocks stuff
    • privacy
    • IP address of client is harder to tarce, only IP address of prxy server is visible
  • Reverse proxy

    • load balancing
    • distribute incoming traffic among different origin servers
    • protection from attacks
    • web site/service never needs to reveal IP address of original server
    • harder to have targeted attack against origin server (eg. DDoS). hackers can only target reverse proxy. Reverse proxy will have tighter security & more resources.
    • Global Server Local Balancing
    • website distributed on several servers around the globe.
    • reverse proxy send clients to server geographically closets to them.
    • Caching
    • reverse proxy can cache content -> faster performance
    • SSL encryption
    • reverse proxy can decrypt all incoming requests & encrypt all outgoing repsonses
    • free up valuable resources on origin server.

Reference: cloudflare

Why Load balancing

Maintain availability & throughput
By distributing incoming request loads across multiple servers

When server get lots of requests it can:

  • slow down (throughput reduces, more latency)
  • fail (no availability)

Load Balancing helps this by Distribute incoming traffic among origin servers

Load Balancing Strategies

Probability server selection

  • Round Robin
    • loop through servers in fixed sequence
  • Weighted Round Robin
    • assign different weights / probabilities each server.
    • traffic split up according to weights
  • Load-based server selection
    • monitor current capacity / performance of servers
    • send request to server with highest throughput / lowest latency.
  • IP Hashing based selection
    • hash IP address of incoming request
    • use hash value to allocate server
  • Path / Service based selection
    • route requests based on path / service provided.

CAP Theorem

any distributed DB can only satisfy 2 of three features

  • Consistency: every node responds with most recent version of data
  • Availability: Any node can send a response
  • Partition tolerance: system continues working even if communication between any of the nodes is broken

DB is usually CP or AP database. Since cannot garuntee statbility of network, P is non-negotiable.

Relational Database

database using a relational data model, organizes data in tables with rows of data entries and columns of predetermined data types.

Use when

  • many-to-many relatinoship between entries
  • data needs to follow predetermined schema
  • consistent transactions are important
  • relationship between data always need to be accurate

ACID properties

  • Atomicity
    • transaction is atomic / smallest unit.
    • all instructions in transaction will execute or non at all
  • Consistency
    • If DB in initially consistent, it should remain consistent after every transaction
    • eg. write operation (transfer money from A to B) failed & transaction not rolled back. Db is inconsistent because amount of money between A &B (A+B) not equal before & after transaction
  • Isolation
    • multiple transaction running concurrently they should be affected by each other
    • result should be same as result obtained if transactions running sequentially.
  • Durability
    • changes committed to DB should remain even in case of software & system failure.

Reference educative & tutorialspoint

Non-relational Database

also known as NoSQL database.
at core, DB hold data in hash-table like structure.
Use cases: caching, environment variables, configuration files / session state
Use environment: in memory & persisten storage

Since their structure is like hashtable there is minimal over head

  • extremely fast
  • simple & easy to use

NoSQL Base properties

Basically available: system guarantees availability
soft state: state of system may change over time even without input
eventual consistency: system will be consistent over very short period of time unless inputs are received

Types of Nosql DB

  • Graph database
    • many-to-many realtinoships
    • fase at following graph edges
  • Document Store
    • isolated documents, retrieve by a key
    • documents with different schemas that are easy to update
    • easy to scale
  • Key-value store
    • like a very large hashtable
    • opaque values (DB has no notion of what is stored in value only provides read, overwrite and delete operations)
    • simple operations (no schemas, joins or indices)
    • minimal overhead - easy to scale
    • suitable for caching implementations
  • column-family DB

NoSQL VS Relational

  • NoSQL

    • dynamic schema.
    • dev can use "unstructured data" can build application without defining schema
    • Scaling
    • scales horizontally over cheap commodity servers
    • Simple Operations
    • data retrieval is simple
    • Cheap hardware
    • app deployed to commodity hardware like public clouds
  • SQL

    • workload volume is consistent
    • ACID garuntees required
    • data is predictable & highly structured
    • data est expressed relationally
    • write safety required
    • app deployed to large high-end hardware

Leader Election: Basics

When system scales horizontally, some tasks require precise coordination between nodes. Where there is leader nodes directing follower nodes.

Leader Election algo

how cluster of nodes without leader communicate with each other & choose one to be leader.

algo executed when cluster starts or when leader node goes down.

Use Case

  • Any node can be leader, no single point of failure required to coordinate system
  • System doing complex work that need good coordination
    • eg. compute how protein folds. cluster needs leader node to assign each node to work on different part then add results together
  • System executes many distributed writes to data & requires strong consistency
    • no matter which node handles request user will always have most up-to-date version of data.
    • leader creates consistency by being source of truth on what the most recent state of system is.


  • split brain
    • bad implementation -> 2 nodes controlling system
  • single point of failure / bottleneck
  • leader starts making bad decisions entire system will follow

Reference: iGotAnOffer

MQ Benefits

  • Resilience
    • app specific faults wont impact system
    • if one component fails, all others can continue interacting with queue, processing / producing messages.

Reference: IBM

Top comments (0)