Oleg Goncharov

Posted on Nov 6

Complete Guide to Database Types: From Basics to System Design

#database #acid #systemdesign

A comprehensive guide for developers, architects, and system design interview candidates

Introduction

Choosing the right database is one of the most critical decisions when designing systems. It determines performance, scalability, reliability, and development and maintenance costs. In this guide, we'll examine all major database types, their characteristics, when to use them, when to avoid them, and how to justify your choice correctly in a system design interview.

Part 1: Fundamental Division — Relational vs Non-Relational Databases

Relational Databases (SQL / RDBMS)

What it is: Databases that store data as tables (rows and columns) with relationships between them through foreign keys.

Key characteristics:

Structured data in tables
Strict schema (schema)
SQL query language
ACID transactions
Vertical scaling (usually)

Popular examples:

PostgreSQL — powerful open-source DBMS with advanced features
MySQL — most popular in web development
Oracle Database — enterprise-level
Microsoft SQL Server — .NET ecosystem integration
SQLite — embedded database for small applications

When to Use Relational Databases

Scenario	Justification
Structured data with clear relationships	E-commerce: Users → Orders → Products → Reviews
Strict data consistency	Banking operations, financial transactions
Complex queries with JOINs	Analytical reports, business intelligence
ACID transactions are critical	Reservation systems, inventory management
Medium data volumes	Up to several terabytes on single server

When NOT to Use

❌ Unstructured or frequently changing data
❌ Need for horizontal scaling across thousands of servers
❌ Very high write latency requirements
❌ Schema constantly evolves
❌ Working with graphs or hierarchical structures

Pros and Cons

Pros:

✅ Data integrity through ACID
✅ Powerful SQL query language
✅ Mature ecosystem of tools
✅ Support for complex transactions
✅ Excellent for analytics

Cons:

❌ Complexity of horizontal scaling
❌ Rigid schema
❌ Vertical scaling is expensive
❌ Performance issues with huge data volumes
❌ JOINs can be slow on large volumes

Non-Relational Databases (NoSQL)

What it is: Databases that don't use tabular structure and don't require fixed schema.

Key characteristics:

Flexible or absent schema
Horizontal scaling
CAP theorem (Consistency, Availability, Partition tolerance)
Eventual consistency (usually)
Various data models (document, key-value, graph, wide-column)

When to Use NoSQL:

✅ Unstructured or semi-structured data
✅ Need for horizontal scaling
✅ Very high write throughput
✅ Flexible, evolving schema
✅ Large data volumes (big data)

Part 2: Types of NoSQL Databases

2.1 Document Databases (Document-Oriented)

What it is: Store data as documents (usually JSON, BSON, XML).

Popular examples:

MongoDB — most popular document database
CouchDB — with focus on offline-first
Amazon DocumentDB — MongoDB-compatible managed service

Data model:

{
  "_id": "507f1f77bcf86cd799439011",
  "name": "John Doe",
  "email": "john@example.com",
  "orders": [
    {
      "order_id": "ORD123",
      "total": 99.99,
      "items": [...]
    }
  ]
}

When to Use

Scenario	Example
Hierarchical data	Product catalogs with nested categories
Heterogeneous record structure	CMS, blogs (different content types)
Fast development and prototyping	MVP, startups
Aggregations and full-text search	Logs, event analytics

Pros and Cons

Pros:

✅ Flexible schema
✅ Natural object representation (JSON)
✅ Good performance for reading nested data
✅ Horizontal scaling (sharding)

Cons:

❌ Complex JOINs (data denormalization)
❌ Eventual consistency (in distributed setup)
❌ No strict transactions between documents (historically)
❌ Data duplication with denormalization

Pitfalls:

Easy to create schema chaos with inconsistent structures
Denormalization can lead to database bloat
Indexes need to be planned in advance for performance

Optimization:

// ❌ Bad: Deep nesting
{
  user: {
    profile: {
      address: {
        city: {
          name: "Moscow"
        }
      }
    }
  }
}

// ✅ Good: Flat structure with indexes
{
  user_id: "123",
  city_name: "Moscow",  // indexed
  full_address: {...}
}

2.2 Key-Value Stores

What it is: Simplest model — key → value. Like a giant HashMap.

Popular examples:

Redis — in-memory, lightning-fast
Amazon DynamoDB — managed, auto-scaling
Memcached — simple cache
Riak KV — distributed, fault-tolerant

Data model:

Key: "user:123:session"
Value: "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."

Key: "product:456:stock"
Value: "42"

When to Use

Scenario	Example
Caching	Page cache, API responses
Session management	Web sessions, JWT tokens
Queues and pub/sub	Redis Streams, background jobs
Counters and real-time statistics	Likes, views, ratings
Rate limiting	API throttling

Pros and Cons

Pros:

✅ Maximum performance (O(1) access)
✅ Simple to use
✅ Ideal for caching
✅ Horizontal scaling (consistent hashing)

Cons:

❌ No complex queries
❌ No relationships between data
❌ Limited filtering capabilities
❌ In-memory solutions require lots of RAM

Pitfalls:

Redis persistence: Need to configure RDB/AOF for data persistence
Memory limits: Plan eviction policies
Single-threaded: Redis doesn't use multi-core without clustering

Optimization:

# ❌ Bad: Many small requests
for user_id in user_ids:
    redis.get(f"user:{user_id}")

# ✅ Good: Batch operations
redis.mget([f"user:{uid}" for uid in user_ids])

# ✅ Use pipelining
pipe = redis.pipeline()
for user_id in user_ids:
    pipe.get(f"user:{user_id}")
results = pipe.execute()

2.3 Wide-Column Stores (Columnar)

What it is: Data stored in columns rather than rows. Optimized for reading and writing huge volumes.

Popular examples:

Apache Cassandra — distributed, high availability
Google Bigtable — managed from Google
HBase — built on Hadoop
ScyllaDB — Cassandra-compatible, but faster

Data model:

Row Key: user_123
  Column Family: profile
    name: "John Doe"
    email: "john@example.com"
  Column Family: activity
    last_login: "2024-01-15"
    page_views: 1523

When to Use

Scenario	Example
Time-series data	IoT sensors, monitoring metrics
Huge write workloads	Event logging, analytics
Geographically distributed data	Global applications
Recommendation systems	Content personalization

Pros and Cons

Pros:

✅ Linear scalability (add nodes)
✅ High availability (no single point of failure)
✅ Excellent for write-heavy workloads
✅ Flexible schema (dynamic columns)

Cons:

❌ Eventual consistency
❌ Complex data modeling
❌ No JOINs and complex transactions
❌ Complexity of cluster administration

Pitfalls:

Partition key choice: Wrong choice kills performance
Hotspots: Some partitions can be overloaded
Tombstones: Deleted data creates "tombstones" that slow reads

Optimization:

-- ❌ Bad: Using single value as partition key
CREATE TABLE events (
    user_id text,
    event_time timestamp,
    event_data text,
    PRIMARY KEY (user_id, event_time)  -- user_id same for all events!
);

-- ✅ Good: Combined key for distribution
CREATE TABLE events (
    user_id text,
    bucket int,  -- e.g., hash of event_time
    event_time timestamp,
    event_data text,
    PRIMARY KEY ((user_id, bucket), event_time)
);

2.4 Graph Databases

What it is: Optimized for storing and traversing graphs (nodes and edges).

Popular examples:

Neo4j — market leader in graph databases
Amazon Neptune — managed graph database
ArangoDB — multi-model (documents + graphs)
JanusGraph — distributed graph database

Data model:

// Nodes and relationships
(User:Person {name: "Alice"})-[:FOLLOWS]->(User:Person {name: "Bob"})
(Alice)-[:LIKES]->(Post {title: "Graph DBs are cool"})
(Bob)-[:COMMENTED_ON]->(Post)

When to Use

Scenario	Example
Social networks	Friend connections, followers, recommendations
Fraud detection	Detecting fraudulent schemes
Knowledge graphs	Wikipedia, recommendation systems
Routing and navigation	GPS, logistics
Identity and Access Management	Access rights management

Pros and Cons

Pros:

✅ Natural relationship modeling
✅ Fast graph traversal (deep JOINs)
✅ Flexible relationship schema
✅ Powerful query language (Cypher, Gremlin)

Cons:

❌ Not optimal for tabular data
❌ Scaling complexity (distributed graphs are hard)
❌ Less mature ecosystem
❌ Specific use case

Pitfalls:

Traversal depth: Too deep queries can be slow
Indexing: Need to index node properties for fast search
Sharding: Splitting graphs across servers is very complex

Optimization:

// ❌ Bad: Unlimited depth
MATCH (user:User {id: 123})-[:FOLLOWS*]->(friend)
RETURN friend

// ✅ Good: Limited depth
MATCH (user:User {id: 123})-[:FOLLOWS*1..3]->(friend)
RETURN friend
LIMIT 100

// ✅ Use indexes
CREATE INDEX ON :User(id)

2.5 Time-Series Databases

What it is: Optimized for data with timestamps (metrics, events, logs).

Popular examples:

InfluxDB — popular open-source TSDB
TimescaleDB — PostgreSQL extension for time-series
Prometheus — monitoring and alerting
Amazon Timestream — managed time-series

Data model:

Timestamp: 2024-01-15T10:30:00Z
Metric: cpu_usage
Tags: {server: "web-01", region: "us-east-1"}
Value: 78.5

When to Use

Scenario	Example
Infrastructure monitoring	CPU, memory, disk usage
IoT and sensors	Temperature, pressure, geolocation
Financial data	Stock prices, exchange rates
Application analytics	APM, performance metrics

Pros and Cons

Pros:

✅ Optimized for time-based queries
✅ Efficient data compression
✅ Automatic aggregation
✅ Retention policies (auto-delete old data)

Cons:

❌ Specialized only for time-series
❌ Updating old data is complex/impossible
❌ Not suitable for transactional data

Optimization:

-- ✅ Use downsampling for old data
SELECT time_bucket('1 hour', time) AS hour,
       avg(cpu_usage) as avg_cpu
FROM metrics
WHERE time > now() - interval '30 days'
GROUP BY hour;

-- ✅ Configure retention policies
CREATE RETENTION POLICY "one_year" ON "mydb" 
  DURATION 52w REPLICATION 1 DEFAULT;

Part 3: Specialized Database Types

3.1 Search Engines (Search Platforms)

Examples:

Elasticsearch — full-text search and analytics
Apache Solr — enterprise search
Meilisearch — fast, easy to use

When to use:

Full-text search
Faceted search (category filters)
Autocomplete and typo-tolerance
Log processing and analysis (ELK stack)

3.2 Vector Databases

Examples:

Pinecone — managed vector database
Weaviate — AI-native vector search
Qdrant — open-source vector search
Milvus — scalable vector database

When to use:

Semantic search (meaning-based search)
Recommendation systems
AI/ML applications
Similarity search (similar images/text)

3.3 NewSQL Databases

Examples:

Google Spanner — globally distributed SQL database
CockroachDB — distributed SQL with ACID
TiDB — MySQL-compatible distributed database

When to use:

Need ACID transactions + horizontal scaling
Global distribution with low latency
Migration from traditional SQL with scalability requirements

Part 4: CAP Theorem and Database Selection

CAP Theorem

Impossible to guarantee all three simultaneously:

Consistency — all nodes see identical data
Availability — every request receives a response
Partition tolerance — system works during network failures

Can only choose 2 of 3:

Type	Priority	Examples	Use case
CP	Consistency + Partition tolerance	MongoDB, HBase, Redis	Finance, inventory
AP	Availability + Partition tolerance	Cassandra, DynamoDB, CouchDB	Social media, IoT
CA	Consistency + Availability	PostgreSQL, MySQL (single node)	Traditional applications

Practical selection:

Need strict consistency (bank, exchange)?
  → CP system (PostgreSQL, MongoDB with strong consistency)

Need maximum availability (social media, IoT)?
  → AP system (Cassandra, DynamoDB)

Can live with eventual consistency?
  → Use AP and win in scale

Part 5: Best Practices and Optimization

Indexing

Relational Databases:

-- ✅ Index frequently used WHERE columns
CREATE INDEX idx_user_email ON users(email);

-- ✅ Composite indexes for compound queries
CREATE INDEX idx_order_user_date ON orders(user_id, created_at DESC);

-- ❌ Avoid indexes on low-cardinality columns
-- CREATE INDEX idx_user_active ON users(is_active);  -- bad, only 2 values

-- ✅ Partial indexes for specific queries
CREATE INDEX idx_active_users ON users(email) WHERE is_active = true;

MongoDB:

// ✅ Compound indexes
db.products.createIndex({ category: 1, price: -1 })

// ✅ Text indexes for full-text search
db.articles.createIndex({ title: "text", content: "text" })

// ❌ Avoid too many indexes (slow writes)

Normalization vs Denormalization

Relational Databases — Normalization:

-- ✅ 3NF (Third Normal Form)
CREATE TABLE users (id, name, email);
CREATE TABLE orders (id, user_id, total);
CREATE TABLE order_items (id, order_id, product_id, quantity);

-- Pros: no duplication, easy updates
-- Cons: JOINs needed for complete data

NoSQL — Denormalization:

// ✅ Embedded data for fast reads
{
  "order_id": "123",
  "user": {
    "id": "456",
    "name": "John Doe",
    "email": "john@example.com"
  },
  "items": [
    {"product_id": "789", "name": "Laptop", "price": 999}
  ]
}

// Pros: single read for all data
// Cons: duplication, complex updates

Partitioning (Sharding)

Sharding strategies:

Range-based sharding:

Shard 1: user_id 1-1000000
Shard 2: user_id 1000001-2000000
Shard 3: user_id 2000001-3000000

✅ Simple implementation

❌ Risk of hotspots (uneven distribution)

Hash-based sharding:

shard = hash(user_id) % num_shards

✅ Even distribution

❌ Hard to add new shards

Geo-based sharding:

Shard US-EAST: users from USA
Shard EU-WEST: users from Europe
Shard ASIA: users from Asia

✅ Low latency for regional queries

❌ Uneven load across regions

Caching

Caching strategies:

Cache-Aside (Lazy Loading):

def get_user(user_id):
    # Try to get from cache
    user = cache.get(f"user:{user_id}")
    if user:
        return user

    # Not in cache — read from database
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)

    # Save to cache
    cache.set(f"user:{user_id}", user, ttl=3600)
    return user

Write-Through:

def update_user(user_id, data):
    # First update database
    db.update("users", user_id, data)

    # Then update cache
    cache.set(f"user:{user_id}", data, ttl=3600)

Write-Behind (Write-Back):

def update_user(user_id, data):
    # First update cache
    cache.set(f"user:{user_id}", data)

    # Asynchronously write to database
    queue.enqueue(lambda: db.update("users", user_id, data))

Connection Pooling

# ❌ Bad: creating new connection each time
def get_data():
    conn = psycopg2.connect("dbname=mydb")
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM users")
    conn.close()

# ✅ Good: connection pool
from psycopg2 import pool

connection_pool = pool.SimpleConnectionPool(
    minconn=5,
    maxconn=20,
    dbname="mydb"
)

def get_data():
    conn = connection_pool.getconn()
    try:
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM users")
        return cursor.fetchall()
    finally:
        connection_pool.putconn(conn)

Part 6: Database Selection Guide for System Design

Database Selection Flowchart

1. Data structure?
   └─ Structured, relationships matter
      └─ Need complex transactions?
         ├─ Yes → PostgreSQL/MySQL
         └─ No → Evaluate scale
   └─ Unstructured/semi-structured
      └─ Documents/JSON?
         └─ Yes → MongoDB
      └─ Graphs and relationships?
         └─ Yes → Neo4j
      └─ Simple key-value?
         └─ Yes → Redis/DynamoDB

2. Scale?
   └─ < 1TB → Relational database
   └─ > 1TB, need horizontal scale
      └─ Write-heavy → Cassandra
      └─ Read-heavy → MongoDB (with replication)

3. Consistency?
   └─ Strict (finance) → PostgreSQL, MySQL
   └─ Eventual (social media) → Cassandra, DynamoDB

4. Latency requirements?
   └─ < 1ms → Redis (in-memory)
   └─ < 10ms → DynamoDB, Cassandra
   └─ < 100ms → PostgreSQL, MongoDB

Database Selection Matrix by Use Case

Use Case	Recommendation	Alternatives
E-commerce catalog	MongoDB	PostgreSQL (JSON), DynamoDB
Payment system	PostgreSQL	MySQL, CockroachDB
Social network	Neo4j (graphs) + Cassandra (posts)	MongoDB, DynamoDB
Chat/messenger	Cassandra	MongoDB, DynamoDB
Analytics/Dashboards	ClickHouse, TimescaleDB	Elasticsearch
Caching	Redis	Memcached
Logging	Elasticsearch	ClickHouse, TimescaleDB
IoT/Time-series	InfluxDB, TimescaleDB	Cassandra
Content search	Elasticsearch	Solr, Meilisearch
ML/AI vector search	Pinecone, Weaviate	Milvus, Qdrant

Typical Database Combinations in Single System

Twitter-like application:

┌─────────────────────────┐
│ User Profiles           │
│ PostgreSQL              │ ← Structured data, relationships
└─────────────────────────┘

┌─────────────────────────┐
│ Tweets/Posts            │
│ Cassandra               │ ← High write throughput, scale
└─────────────────────────┘

┌─────────────────────────┐
│ Social Graph            │
│ Neo4j                   │ ← Subscriptions, recommendations
└─────────────────────────┘

┌─────────────────────────┐
│ Trending Topics         │
│ Redis                   │ ← Real-time counters
└─────────────────────────┘

┌─────────────────────────┐
│ Search                  │
│ Elasticsearch           │ ← Full-text search
└─────────────────────────┘

E-commerce platform:

PostgreSQL    → Orders, Payments, Inventory (ACID critical)
MongoDB       → Product Catalog (flexible schema)
Redis         → Session, Cart, Cache
Elasticsearch → Product Search
Neo4j         → Recommendations (bought together)

Part 7: Interview Questions and Answers

Frequent Questions

Q: Why not use PostgreSQL for everything?

A: PostgreSQL is great, but:

Vertical scaling is expensive
Not optimal for graphs (recommendations, social connections)
Not ideal choice for full-text search
Fixed schema — bad for rapidly changing data

Q: When to use NoSQL instead of SQL?

A: NoSQL when:

Flexible schema is critical
Horizontal scaling is mandatory
Eventual consistency is acceptable
Data is unstructured or semi-structured

Q: How to ensure consistency in distributed system?

A: Strategies:

Distributed transactions (2PC, Saga pattern)
Event sourcing (log of all changes)
CQRS (Command Query Responsibility Segregation)
Eventual consistency with compensating transactions

Q: How to choose partition key in Cassandra?

A: Good partition key:

Even data distribution
High cardinality
Matches access patterns
Example: user_id + timestamp_bucket instead of just user_id

Red Flags in Interviews

❌ Wrong:

"I'll choose MongoDB because it's NoSQL and fast"
"We use MySQL because I'm familiar with it"
"NoSQL doesn't support JOINs" (false for many)
"SQL databases don't scale" (false, just harder)

✅ Correct:

"I'll use PostgreSQL for transactional data due to ACID guarantees, and Cassandra for time-series metrics due to high write throughput"
"I'll consider trade-offs: PostgreSQL provides consistency, Cassandra provides availability during partition"
"For MVP we start with PostgreSQL, then add specialized databases if needed"

Part 8: Advanced Topics

Replication

Master-Slave (Primary-Replica):

┌──────────┐  writes   ┌──────────┐
│  Client  │ ────────> │  Master  │
└──────────┘           └──────────┘
                            │
                    async replication
                            │
          ┌─────────────────┼─────────────────┐
          ▼                 ▼                 ▼
    ┌──────────┐      ┌──────────┐      ┌──────────┐
    │ Replica1 │      │ Replica2 │      │ Replica3 │
    └──────────┘      └──────────┘      └──────────┘
         ▲                 ▲                 ▲
         └─────── reads ───┴─────────────────┘

Multi-Master:

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Master 1 │ ◄─► │ Master 2 │ ◄─► │ Master 3 │
└──────────┘     └──────────┘     └──────────┘
     ▲                ▲                ▲
     └──────── writes anywhere ────────┘

Consistency Levels (Cassandra example)

ONE:    Fast, but may return stale data
QUORUM: Majority of replicas (N/2 + 1) — balance
ALL:    All replicas — slow, maximum consistency

Formula for strong consistency:

W + R > N

W = write consistency level
R = read consistency level  
N = replication factor

Example: N=3, W=2, R=2 → 2+2=4 > 3 ✓ (strong consistency)

Database Migration Strategies

Blue-Green Deployment:

Old database (Blue) is running
New database (Green) is configured
Replicate data Blue → Green
Switch traffic to Green
Rollback to Blue if issues

Dual Writes:

def write_data(data):
    # Write to both databases
    old_db.write(data)
    new_db.write(data)

    # Read from old database (gradually switch)
    return old_db.read(data)