DEV Community

Oleg Goncharov
Oleg Goncharov

Posted on

Complete Guide to Database Types: From Basics to System Design

A comprehensive guide for developers, architects, and system design interview candidates

Introduction

Choosing the right database is one of the most critical decisions when designing systems. It determines performance, scalability, reliability, and development and maintenance costs. In this guide, we'll examine all major database types, their characteristics, when to use them, when to avoid them, and how to justify your choice correctly in a system design interview.


Part 1: Fundamental Division — Relational vs Non-Relational Databases

Relational Databases (SQL / RDBMS)

What it is: Databases that store data as tables (rows and columns) with relationships between them through foreign keys.

Key characteristics:

  • Structured data in tables
  • Strict schema (schema)
  • SQL query language
  • ACID transactions
  • Vertical scaling (usually)

Popular examples:

  • PostgreSQL — powerful open-source DBMS with advanced features
  • MySQL — most popular in web development
  • Oracle Database — enterprise-level
  • Microsoft SQL Server — .NET ecosystem integration
  • SQLite — embedded database for small applications

When to Use Relational Databases

Scenario Justification
Structured data with clear relationships E-commerce: Users → Orders → Products → Reviews
Strict data consistency Banking operations, financial transactions
Complex queries with JOINs Analytical reports, business intelligence
ACID transactions are critical Reservation systems, inventory management
Medium data volumes Up to several terabytes on single server

When NOT to Use

  • ❌ Unstructured or frequently changing data
  • ❌ Need for horizontal scaling across thousands of servers
  • ❌ Very high write latency requirements
  • ❌ Schema constantly evolves
  • ❌ Working with graphs or hierarchical structures

Pros and Cons

Pros:

  • ✅ Data integrity through ACID
  • ✅ Powerful SQL query language
  • ✅ Mature ecosystem of tools
  • ✅ Support for complex transactions
  • ✅ Excellent for analytics

Cons:

  • ❌ Complexity of horizontal scaling
  • ❌ Rigid schema
  • ❌ Vertical scaling is expensive
  • ❌ Performance issues with huge data volumes
  • ❌ JOINs can be slow on large volumes

Non-Relational Databases (NoSQL)

What it is: Databases that don't use tabular structure and don't require fixed schema.

Key characteristics:

  • Flexible or absent schema
  • Horizontal scaling
  • CAP theorem (Consistency, Availability, Partition tolerance)
  • Eventual consistency (usually)
  • Various data models (document, key-value, graph, wide-column)

When to Use NoSQL:

  • ✅ Unstructured or semi-structured data
  • ✅ Need for horizontal scaling
  • ✅ Very high write throughput
  • ✅ Flexible, evolving schema
  • ✅ Large data volumes (big data)

Part 2: Types of NoSQL Databases

2.1 Document Databases (Document-Oriented)

What it is: Store data as documents (usually JSON, BSON, XML).

Popular examples:

  • MongoDB — most popular document database
  • CouchDB — with focus on offline-first
  • Amazon DocumentDB — MongoDB-compatible managed service

Data model:

{
  "_id": "507f1f77bcf86cd799439011",
  "name": "John Doe",
  "email": "john@example.com",
  "orders": [
    {
      "order_id": "ORD123",
      "total": 99.99,
      "items": [...]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

When to Use

Scenario Example
Hierarchical data Product catalogs with nested categories
Heterogeneous record structure CMS, blogs (different content types)
Fast development and prototyping MVP, startups
Aggregations and full-text search Logs, event analytics

Pros and Cons

Pros:

  • ✅ Flexible schema
  • ✅ Natural object representation (JSON)
  • ✅ Good performance for reading nested data
  • ✅ Horizontal scaling (sharding)

Cons:

  • ❌ Complex JOINs (data denormalization)
  • ❌ Eventual consistency (in distributed setup)
  • ❌ No strict transactions between documents (historically)
  • ❌ Data duplication with denormalization

Pitfalls:

  • Easy to create schema chaos with inconsistent structures
  • Denormalization can lead to database bloat
  • Indexes need to be planned in advance for performance

Optimization:

// ❌ Bad: Deep nesting
{
  user: {
    profile: {
      address: {
        city: {
          name: "Moscow"
        }
      }
    }
  }
}

// ✅ Good: Flat structure with indexes
{
  user_id: "123",
  city_name: "Moscow",  // indexed
  full_address: {...}
}
Enter fullscreen mode Exit fullscreen mode

2.2 Key-Value Stores

What it is: Simplest model — keyvalue. Like a giant HashMap.

Popular examples:

  • Redis — in-memory, lightning-fast
  • Amazon DynamoDB — managed, auto-scaling
  • Memcached — simple cache
  • Riak KV — distributed, fault-tolerant

Data model:

Key: "user:123:session"
Value: "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."

Key: "product:456:stock"
Value: "42"
Enter fullscreen mode Exit fullscreen mode

When to Use

Scenario Example
Caching Page cache, API responses
Session management Web sessions, JWT tokens
Queues and pub/sub Redis Streams, background jobs
Counters and real-time statistics Likes, views, ratings
Rate limiting API throttling

Pros and Cons

Pros:

  • ✅ Maximum performance (O(1) access)
  • ✅ Simple to use
  • ✅ Ideal for caching
  • ✅ Horizontal scaling (consistent hashing)

Cons:

  • ❌ No complex queries
  • ❌ No relationships between data
  • ❌ Limited filtering capabilities
  • ❌ In-memory solutions require lots of RAM

Pitfalls:

  • Redis persistence: Need to configure RDB/AOF for data persistence
  • Memory limits: Plan eviction policies
  • Single-threaded: Redis doesn't use multi-core without clustering

Optimization:

# ❌ Bad: Many small requests
for user_id in user_ids:
    redis.get(f"user:{user_id}")

# ✅ Good: Batch operations
redis.mget([f"user:{uid}" for uid in user_ids])

# ✅ Use pipelining
pipe = redis.pipeline()
for user_id in user_ids:
    pipe.get(f"user:{user_id}")
results = pipe.execute()
Enter fullscreen mode Exit fullscreen mode

2.3 Wide-Column Stores (Columnar)

What it is: Data stored in columns rather than rows. Optimized for reading and writing huge volumes.

Popular examples:

  • Apache Cassandra — distributed, high availability
  • Google Bigtable — managed from Google
  • HBase — built on Hadoop
  • ScyllaDB — Cassandra-compatible, but faster

Data model:

Row Key: user_123
  Column Family: profile
    name: "John Doe"
    email: "john@example.com"
  Column Family: activity
    last_login: "2024-01-15"
    page_views: 1523
Enter fullscreen mode Exit fullscreen mode

When to Use

Scenario Example
Time-series data IoT sensors, monitoring metrics
Huge write workloads Event logging, analytics
Geographically distributed data Global applications
Recommendation systems Content personalization

Pros and Cons

Pros:

  • ✅ Linear scalability (add nodes)
  • ✅ High availability (no single point of failure)
  • ✅ Excellent for write-heavy workloads
  • ✅ Flexible schema (dynamic columns)

Cons:

  • ❌ Eventual consistency
  • ❌ Complex data modeling
  • ❌ No JOINs and complex transactions
  • ❌ Complexity of cluster administration

Pitfalls:

  • Partition key choice: Wrong choice kills performance
  • Hotspots: Some partitions can be overloaded
  • Tombstones: Deleted data creates "tombstones" that slow reads

Optimization:

-- ❌ Bad: Using single value as partition key
CREATE TABLE events (
    user_id text,
    event_time timestamp,
    event_data text,
    PRIMARY KEY (user_id, event_time)  -- user_id same for all events!
);

-- ✅ Good: Combined key for distribution
CREATE TABLE events (
    user_id text,
    bucket int,  -- e.g., hash of event_time
    event_time timestamp,
    event_data text,
    PRIMARY KEY ((user_id, bucket), event_time)
);
Enter fullscreen mode Exit fullscreen mode

2.4 Graph Databases

What it is: Optimized for storing and traversing graphs (nodes and edges).

Popular examples:

  • Neo4j — market leader in graph databases
  • Amazon Neptune — managed graph database
  • ArangoDB — multi-model (documents + graphs)
  • JanusGraph — distributed graph database

Data model:

// Nodes and relationships
(User:Person {name: "Alice"})-[:FOLLOWS]->(User:Person {name: "Bob"})
(Alice)-[:LIKES]->(Post {title: "Graph DBs are cool"})
(Bob)-[:COMMENTED_ON]->(Post)
Enter fullscreen mode Exit fullscreen mode

When to Use

Scenario Example
Social networks Friend connections, followers, recommendations
Fraud detection Detecting fraudulent schemes
Knowledge graphs Wikipedia, recommendation systems
Routing and navigation GPS, logistics
Identity and Access Management Access rights management

Pros and Cons

Pros:

  • ✅ Natural relationship modeling
  • ✅ Fast graph traversal (deep JOINs)
  • ✅ Flexible relationship schema
  • ✅ Powerful query language (Cypher, Gremlin)

Cons:

  • ❌ Not optimal for tabular data
  • ❌ Scaling complexity (distributed graphs are hard)
  • ❌ Less mature ecosystem
  • ❌ Specific use case

Pitfalls:

  • Traversal depth: Too deep queries can be slow
  • Indexing: Need to index node properties for fast search
  • Sharding: Splitting graphs across servers is very complex

Optimization:

// ❌ Bad: Unlimited depth
MATCH (user:User {id: 123})-[:FOLLOWS*]->(friend)
RETURN friend

// ✅ Good: Limited depth
MATCH (user:User {id: 123})-[:FOLLOWS*1..3]->(friend)
RETURN friend
LIMIT 100

// ✅ Use indexes
CREATE INDEX ON :User(id)
Enter fullscreen mode Exit fullscreen mode

2.5 Time-Series Databases

What it is: Optimized for data with timestamps (metrics, events, logs).

Popular examples:

  • InfluxDB — popular open-source TSDB
  • TimescaleDB — PostgreSQL extension for time-series
  • Prometheus — monitoring and alerting
  • Amazon Timestream — managed time-series

Data model:

Timestamp: 2024-01-15T10:30:00Z
Metric: cpu_usage
Tags: {server: "web-01", region: "us-east-1"}
Value: 78.5
Enter fullscreen mode Exit fullscreen mode

When to Use

Scenario Example
Infrastructure monitoring CPU, memory, disk usage
IoT and sensors Temperature, pressure, geolocation
Financial data Stock prices, exchange rates
Application analytics APM, performance metrics

Pros and Cons

Pros:

  • ✅ Optimized for time-based queries
  • ✅ Efficient data compression
  • ✅ Automatic aggregation
  • ✅ Retention policies (auto-delete old data)

Cons:

  • ❌ Specialized only for time-series
  • ❌ Updating old data is complex/impossible
  • ❌ Not suitable for transactional data

Optimization:

-- ✅ Use downsampling for old data
SELECT time_bucket('1 hour', time) AS hour,
       avg(cpu_usage) as avg_cpu
FROM metrics
WHERE time > now() - interval '30 days'
GROUP BY hour;

-- ✅ Configure retention policies
CREATE RETENTION POLICY "one_year" ON "mydb" 
  DURATION 52w REPLICATION 1 DEFAULT;
Enter fullscreen mode Exit fullscreen mode

Part 3: Specialized Database Types

3.1 Search Engines (Search Platforms)

Examples:

  • Elasticsearch — full-text search and analytics
  • Apache Solr — enterprise search
  • Meilisearch — fast, easy to use

When to use:

  • Full-text search
  • Faceted search (category filters)
  • Autocomplete and typo-tolerance
  • Log processing and analysis (ELK stack)

3.2 Vector Databases

Examples:

  • Pinecone — managed vector database
  • Weaviate — AI-native vector search
  • Qdrant — open-source vector search
  • Milvus — scalable vector database

When to use:

  • Semantic search (meaning-based search)
  • Recommendation systems
  • AI/ML applications
  • Similarity search (similar images/text)

3.3 NewSQL Databases

Examples:

  • Google Spanner — globally distributed SQL database
  • CockroachDB — distributed SQL with ACID
  • TiDB — MySQL-compatible distributed database

When to use:

  • Need ACID transactions + horizontal scaling
  • Global distribution with low latency
  • Migration from traditional SQL with scalability requirements

Part 4: CAP Theorem and Database Selection

CAP Theorem

Impossible to guarantee all three simultaneously:

  • Consistency — all nodes see identical data
  • Availability — every request receives a response
  • Partition tolerance — system works during network failures

Can only choose 2 of 3:

Type Priority Examples Use case
CP Consistency + Partition tolerance MongoDB, HBase, Redis Finance, inventory
AP Availability + Partition tolerance Cassandra, DynamoDB, CouchDB Social media, IoT
CA Consistency + Availability PostgreSQL, MySQL (single node) Traditional applications

Practical selection:

Need strict consistency (bank, exchange)?
  → CP system (PostgreSQL, MongoDB with strong consistency)

Need maximum availability (social media, IoT)?
  → AP system (Cassandra, DynamoDB)

Can live with eventual consistency?
  → Use AP and win in scale
Enter fullscreen mode Exit fullscreen mode

Part 5: Best Practices and Optimization

Indexing

Relational Databases:

-- ✅ Index frequently used WHERE columns
CREATE INDEX idx_user_email ON users(email);

-- ✅ Composite indexes for compound queries
CREATE INDEX idx_order_user_date ON orders(user_id, created_at DESC);

-- ❌ Avoid indexes on low-cardinality columns
-- CREATE INDEX idx_user_active ON users(is_active);  -- bad, only 2 values

-- ✅ Partial indexes for specific queries
CREATE INDEX idx_active_users ON users(email) WHERE is_active = true;
Enter fullscreen mode Exit fullscreen mode

MongoDB:

// ✅ Compound indexes
db.products.createIndex({ category: 1, price: -1 })

// ✅ Text indexes for full-text search
db.articles.createIndex({ title: "text", content: "text" })

// ❌ Avoid too many indexes (slow writes)
Enter fullscreen mode Exit fullscreen mode

Normalization vs Denormalization

Relational Databases — Normalization:

-- ✅ 3NF (Third Normal Form)
CREATE TABLE users (id, name, email);
CREATE TABLE orders (id, user_id, total);
CREATE TABLE order_items (id, order_id, product_id, quantity);

-- Pros: no duplication, easy updates
-- Cons: JOINs needed for complete data
Enter fullscreen mode Exit fullscreen mode

NoSQL — Denormalization:

//  Embedded data for fast reads
{
  "order_id": "123",
  "user": {
    "id": "456",
    "name": "John Doe",
    "email": "john@example.com"
  },
  "items": [
    {"product_id": "789", "name": "Laptop", "price": 999}
  ]
}

// Pros: single read for all data
// Cons: duplication, complex updates
Enter fullscreen mode Exit fullscreen mode

Partitioning (Sharding)

Sharding strategies:

  1. Range-based sharding:
Shard 1: user_id 1-1000000
Shard 2: user_id 1000001-2000000
Shard 3: user_id 2000001-3000000
Enter fullscreen mode Exit fullscreen mode

✅ Simple implementation

❌ Risk of hotspots (uneven distribution)

  1. Hash-based sharding:
shard = hash(user_id) % num_shards
Enter fullscreen mode Exit fullscreen mode

✅ Even distribution

❌ Hard to add new shards

  1. Geo-based sharding:
Shard US-EAST: users from USA
Shard EU-WEST: users from Europe
Shard ASIA: users from Asia
Enter fullscreen mode Exit fullscreen mode

✅ Low latency for regional queries

❌ Uneven load across regions

Caching

Caching strategies:

  1. Cache-Aside (Lazy Loading):
def get_user(user_id):
    # Try to get from cache
    user = cache.get(f"user:{user_id}")
    if user:
        return user

    # Not in cache — read from database
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)

    # Save to cache
    cache.set(f"user:{user_id}", user, ttl=3600)
    return user
Enter fullscreen mode Exit fullscreen mode
  1. Write-Through:
def update_user(user_id, data):
    # First update database
    db.update("users", user_id, data)

    # Then update cache
    cache.set(f"user:{user_id}", data, ttl=3600)
Enter fullscreen mode Exit fullscreen mode
  1. Write-Behind (Write-Back):
def update_user(user_id, data):
    # First update cache
    cache.set(f"user:{user_id}", data)

    # Asynchronously write to database
    queue.enqueue(lambda: db.update("users", user_id, data))
Enter fullscreen mode Exit fullscreen mode

Connection Pooling

# ❌ Bad: creating new connection each time
def get_data():
    conn = psycopg2.connect("dbname=mydb")
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM users")
    conn.close()

# ✅ Good: connection pool
from psycopg2 import pool

connection_pool = pool.SimpleConnectionPool(
    minconn=5,
    maxconn=20,
    dbname="mydb"
)

def get_data():
    conn = connection_pool.getconn()
    try:
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM users")
        return cursor.fetchall()
    finally:
        connection_pool.putconn(conn)
Enter fullscreen mode Exit fullscreen mode

Part 6: Database Selection Guide for System Design

Database Selection Flowchart

1. Data structure?
   └─ Structured, relationships matter
      └─ Need complex transactions?
         ├─ Yes → PostgreSQL/MySQL
         └─ No → Evaluate scale
   └─ Unstructured/semi-structured
      └─ Documents/JSON?
         └─ Yes → MongoDB
      └─ Graphs and relationships?
         └─ Yes → Neo4j
      └─ Simple key-value?
         └─ Yes → Redis/DynamoDB

2. Scale?
   └─ < 1TB → Relational database
   └─ > 1TB, need horizontal scale
      └─ Write-heavy → Cassandra
      └─ Read-heavy → MongoDB (with replication)

3. Consistency?
   └─ Strict (finance) → PostgreSQL, MySQL
   └─ Eventual (social media) → Cassandra, DynamoDB

4. Latency requirements?
   └─ < 1ms → Redis (in-memory)
   └─ < 10ms → DynamoDB, Cassandra
   └─ < 100ms → PostgreSQL, MongoDB
Enter fullscreen mode Exit fullscreen mode

Database Selection Matrix by Use Case

Use Case Recommendation Alternatives
E-commerce catalog MongoDB PostgreSQL (JSON), DynamoDB
Payment system PostgreSQL MySQL, CockroachDB
Social network Neo4j (graphs) + Cassandra (posts) MongoDB, DynamoDB
Chat/messenger Cassandra MongoDB, DynamoDB
Analytics/Dashboards ClickHouse, TimescaleDB Elasticsearch
Caching Redis Memcached
Logging Elasticsearch ClickHouse, TimescaleDB
IoT/Time-series InfluxDB, TimescaleDB Cassandra
Content search Elasticsearch Solr, Meilisearch
ML/AI vector search Pinecone, Weaviate Milvus, Qdrant

Typical Database Combinations in Single System

Twitter-like application:

┌─────────────────────────┐
│ User Profiles           │
│ PostgreSQL              │ ← Structured data, relationships
└─────────────────────────┘

┌─────────────────────────┐
│ Tweets/Posts            │
│ Cassandra               │ ← High write throughput, scale
└─────────────────────────┘

┌─────────────────────────┐
│ Social Graph            │
│ Neo4j                   │ ← Subscriptions, recommendations
└─────────────────────────┘

┌─────────────────────────┐
│ Trending Topics         │
│ Redis                   │ ← Real-time counters
└─────────────────────────┘

┌─────────────────────────┐
│ Search                  │
│ Elasticsearch           │ ← Full-text search
└─────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

E-commerce platform:

PostgreSQL    → Orders, Payments, Inventory (ACID critical)
MongoDB       → Product Catalog (flexible schema)
Redis         → Session, Cart, Cache
Elasticsearch → Product Search
Neo4j         → Recommendations (bought together)
Enter fullscreen mode Exit fullscreen mode

Part 7: Interview Questions and Answers

Frequent Questions

Q: Why not use PostgreSQL for everything?

A: PostgreSQL is great, but:

  • Vertical scaling is expensive
  • Not optimal for graphs (recommendations, social connections)
  • Not ideal choice for full-text search
  • Fixed schema — bad for rapidly changing data

Q: When to use NoSQL instead of SQL?

A: NoSQL when:

  • Flexible schema is critical
  • Horizontal scaling is mandatory
  • Eventual consistency is acceptable
  • Data is unstructured or semi-structured

Q: How to ensure consistency in distributed system?

A: Strategies:

  1. Distributed transactions (2PC, Saga pattern)
  2. Event sourcing (log of all changes)
  3. CQRS (Command Query Responsibility Segregation)
  4. Eventual consistency with compensating transactions

Q: How to choose partition key in Cassandra?

A: Good partition key:

  • Even data distribution
  • High cardinality
  • Matches access patterns
  • Example: user_id + timestamp_bucket instead of just user_id

Red Flags in Interviews

Wrong:

  • "I'll choose MongoDB because it's NoSQL and fast"
  • "We use MySQL because I'm familiar with it"
  • "NoSQL doesn't support JOINs" (false for many)
  • "SQL databases don't scale" (false, just harder)

Correct:

  • "I'll use PostgreSQL for transactional data due to ACID guarantees, and Cassandra for time-series metrics due to high write throughput"
  • "I'll consider trade-offs: PostgreSQL provides consistency, Cassandra provides availability during partition"
  • "For MVP we start with PostgreSQL, then add specialized databases if needed"

Part 8: Advanced Topics

Replication

Master-Slave (Primary-Replica):

┌──────────┐  writes   ┌──────────┐
│  Client  │ ────────> │  Master  │
└──────────┘           └──────────┘
                            │
                    async replication
                            │
          ┌─────────────────┼─────────────────┐
          ▼                 ▼                 ▼
    ┌──────────┐      ┌──────────┐      ┌──────────┐
    │ Replica1 │      │ Replica2 │      │ Replica3 │
    └──────────┘      └──────────┘      └──────────┘
         ▲                 ▲                 ▲
         └─────── reads ───┴─────────────────┘
Enter fullscreen mode Exit fullscreen mode

Multi-Master:

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Master 1 │ ◄─► │ Master 2 │ ◄─► │ Master 3 │
└──────────┘     └──────────┘     └──────────┘
     ▲                ▲                ▲
     └──────── writes anywhere ────────┘
Enter fullscreen mode Exit fullscreen mode

Consistency Levels (Cassandra example)

ONE:    Fast, but may return stale data
QUORUM: Majority of replicas (N/2 + 1) — balance
ALL:    All replicas — slow, maximum consistency
Enter fullscreen mode Exit fullscreen mode

Formula for strong consistency:

W + R > N

W = write consistency level
R = read consistency level  
N = replication factor

Example: N=3, W=2, R=2 → 2+2=4 > 3 ✓ (strong consistency)
Enter fullscreen mode Exit fullscreen mode

Database Migration Strategies

Blue-Green Deployment:

  1. Old database (Blue) is running
  2. New database (Green) is configured
  3. Replicate data Blue → Green
  4. Switch traffic to Green
  5. Rollback to Blue if issues

Dual Writes:

def write_data(data):
    # Write to both databases
    old_db.write(data)
    new_db.write(data)

    # Read from old database (gradually switch)
    return old_db.read(data)
Enter fullscreen mode Exit fullscreen mode

Conclusion

Key Takeaways

  1. No universal database — choice depends on use case
  2. CAP theorem — can't have everything simultaneously
  3. Measure and test — don't rely on assumptions
  4. Start simple — can migrate later
  5. Polymodal approach — use multiple databases in one system

Interview Checklist for System Design

When selecting a database, justify:

  • [ ] Data structure (structured/unstructured)
  • [ ] Access patterns (read-heavy/write-heavy)
  • [ ] Consistency requirements (ACID vs eventual)
  • [ ] Scale (data volume, QPS)
  • [ ] Latency requirements
  • [ ] CAP trade-offs
  • [ ] Indexing and optimization
  • [ ] Sharding strategy (if needed)
  • [ ] Replication and failover
  • [ ] Cost (managed vs self-hosted)

Useful Resources:

Practice:

Good luck with your interview! 🚀

Top comments (0)