A comprehensive guide for developers, architects, and system design interview candidates
Introduction
Choosing the right database is one of the most critical decisions when designing systems. It determines performance, scalability, reliability, and development and maintenance costs. In this guide, we'll examine all major database types, their characteristics, when to use them, when to avoid them, and how to justify your choice correctly in a system design interview.
Part 1: Fundamental Division — Relational vs Non-Relational Databases
Relational Databases (SQL / RDBMS)
What it is: Databases that store data as tables (rows and columns) with relationships between them through foreign keys.
Key characteristics:
- Structured data in tables
- Strict schema (schema)
- SQL query language
- ACID transactions
- Vertical scaling (usually)
Popular examples:
- PostgreSQL — powerful open-source DBMS with advanced features
- MySQL — most popular in web development
- Oracle Database — enterprise-level
- Microsoft SQL Server — .NET ecosystem integration
- SQLite — embedded database for small applications
When to Use Relational Databases
| Scenario | Justification |
|---|---|
| Structured data with clear relationships | E-commerce: Users → Orders → Products → Reviews |
| Strict data consistency | Banking operations, financial transactions |
| Complex queries with JOINs | Analytical reports, business intelligence |
| ACID transactions are critical | Reservation systems, inventory management |
| Medium data volumes | Up to several terabytes on single server |
When NOT to Use
- ❌ Unstructured or frequently changing data
- ❌ Need for horizontal scaling across thousands of servers
- ❌ Very high write latency requirements
- ❌ Schema constantly evolves
- ❌ Working with graphs or hierarchical structures
Pros and Cons
Pros:
- ✅ Data integrity through ACID
- ✅ Powerful SQL query language
- ✅ Mature ecosystem of tools
- ✅ Support for complex transactions
- ✅ Excellent for analytics
Cons:
- ❌ Complexity of horizontal scaling
- ❌ Rigid schema
- ❌ Vertical scaling is expensive
- ❌ Performance issues with huge data volumes
- ❌ JOINs can be slow on large volumes
Non-Relational Databases (NoSQL)
What it is: Databases that don't use tabular structure and don't require fixed schema.
Key characteristics:
- Flexible or absent schema
- Horizontal scaling
- CAP theorem (Consistency, Availability, Partition tolerance)
- Eventual consistency (usually)
- Various data models (document, key-value, graph, wide-column)
When to Use NoSQL:
- ✅ Unstructured or semi-structured data
- ✅ Need for horizontal scaling
- ✅ Very high write throughput
- ✅ Flexible, evolving schema
- ✅ Large data volumes (big data)
Part 2: Types of NoSQL Databases
2.1 Document Databases (Document-Oriented)
What it is: Store data as documents (usually JSON, BSON, XML).
Popular examples:
- MongoDB — most popular document database
- CouchDB — with focus on offline-first
- Amazon DocumentDB — MongoDB-compatible managed service
Data model:
{
"_id": "507f1f77bcf86cd799439011",
"name": "John Doe",
"email": "john@example.com",
"orders": [
{
"order_id": "ORD123",
"total": 99.99,
"items": [...]
}
]
}
When to Use
| Scenario | Example |
|---|---|
| Hierarchical data | Product catalogs with nested categories |
| Heterogeneous record structure | CMS, blogs (different content types) |
| Fast development and prototyping | MVP, startups |
| Aggregations and full-text search | Logs, event analytics |
Pros and Cons
Pros:
- ✅ Flexible schema
- ✅ Natural object representation (JSON)
- ✅ Good performance for reading nested data
- ✅ Horizontal scaling (sharding)
Cons:
- ❌ Complex JOINs (data denormalization)
- ❌ Eventual consistency (in distributed setup)
- ❌ No strict transactions between documents (historically)
- ❌ Data duplication with denormalization
Pitfalls:
- Easy to create schema chaos with inconsistent structures
- Denormalization can lead to database bloat
- Indexes need to be planned in advance for performance
Optimization:
// ❌ Bad: Deep nesting
{
user: {
profile: {
address: {
city: {
name: "Moscow"
}
}
}
}
}
// ✅ Good: Flat structure with indexes
{
user_id: "123",
city_name: "Moscow", // indexed
full_address: {...}
}
2.2 Key-Value Stores
What it is: Simplest model — key → value. Like a giant HashMap.
Popular examples:
- Redis — in-memory, lightning-fast
- Amazon DynamoDB — managed, auto-scaling
- Memcached — simple cache
- Riak KV — distributed, fault-tolerant
Data model:
Key: "user:123:session"
Value: "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
Key: "product:456:stock"
Value: "42"
When to Use
| Scenario | Example |
|---|---|
| Caching | Page cache, API responses |
| Session management | Web sessions, JWT tokens |
| Queues and pub/sub | Redis Streams, background jobs |
| Counters and real-time statistics | Likes, views, ratings |
| Rate limiting | API throttling |
Pros and Cons
Pros:
- ✅ Maximum performance (O(1) access)
- ✅ Simple to use
- ✅ Ideal for caching
- ✅ Horizontal scaling (consistent hashing)
Cons:
- ❌ No complex queries
- ❌ No relationships between data
- ❌ Limited filtering capabilities
- ❌ In-memory solutions require lots of RAM
Pitfalls:
- Redis persistence: Need to configure RDB/AOF for data persistence
- Memory limits: Plan eviction policies
- Single-threaded: Redis doesn't use multi-core without clustering
Optimization:
# ❌ Bad: Many small requests
for user_id in user_ids:
redis.get(f"user:{user_id}")
# ✅ Good: Batch operations
redis.mget([f"user:{uid}" for uid in user_ids])
# ✅ Use pipelining
pipe = redis.pipeline()
for user_id in user_ids:
pipe.get(f"user:{user_id}")
results = pipe.execute()
2.3 Wide-Column Stores (Columnar)
What it is: Data stored in columns rather than rows. Optimized for reading and writing huge volumes.
Popular examples:
- Apache Cassandra — distributed, high availability
- Google Bigtable — managed from Google
- HBase — built on Hadoop
- ScyllaDB — Cassandra-compatible, but faster
Data model:
Row Key: user_123
Column Family: profile
name: "John Doe"
email: "john@example.com"
Column Family: activity
last_login: "2024-01-15"
page_views: 1523
When to Use
| Scenario | Example |
|---|---|
| Time-series data | IoT sensors, monitoring metrics |
| Huge write workloads | Event logging, analytics |
| Geographically distributed data | Global applications |
| Recommendation systems | Content personalization |
Pros and Cons
Pros:
- ✅ Linear scalability (add nodes)
- ✅ High availability (no single point of failure)
- ✅ Excellent for write-heavy workloads
- ✅ Flexible schema (dynamic columns)
Cons:
- ❌ Eventual consistency
- ❌ Complex data modeling
- ❌ No JOINs and complex transactions
- ❌ Complexity of cluster administration
Pitfalls:
- Partition key choice: Wrong choice kills performance
- Hotspots: Some partitions can be overloaded
- Tombstones: Deleted data creates "tombstones" that slow reads
Optimization:
-- ❌ Bad: Using single value as partition key
CREATE TABLE events (
user_id text,
event_time timestamp,
event_data text,
PRIMARY KEY (user_id, event_time) -- user_id same for all events!
);
-- ✅ Good: Combined key for distribution
CREATE TABLE events (
user_id text,
bucket int, -- e.g., hash of event_time
event_time timestamp,
event_data text,
PRIMARY KEY ((user_id, bucket), event_time)
);
2.4 Graph Databases
What it is: Optimized for storing and traversing graphs (nodes and edges).
Popular examples:
- Neo4j — market leader in graph databases
- Amazon Neptune — managed graph database
- ArangoDB — multi-model (documents + graphs)
- JanusGraph — distributed graph database
Data model:
// Nodes and relationships
(User:Person {name: "Alice"})-[:FOLLOWS]->(User:Person {name: "Bob"})
(Alice)-[:LIKES]->(Post {title: "Graph DBs are cool"})
(Bob)-[:COMMENTED_ON]->(Post)
When to Use
| Scenario | Example |
|---|---|
| Social networks | Friend connections, followers, recommendations |
| Fraud detection | Detecting fraudulent schemes |
| Knowledge graphs | Wikipedia, recommendation systems |
| Routing and navigation | GPS, logistics |
| Identity and Access Management | Access rights management |
Pros and Cons
Pros:
- ✅ Natural relationship modeling
- ✅ Fast graph traversal (deep JOINs)
- ✅ Flexible relationship schema
- ✅ Powerful query language (Cypher, Gremlin)
Cons:
- ❌ Not optimal for tabular data
- ❌ Scaling complexity (distributed graphs are hard)
- ❌ Less mature ecosystem
- ❌ Specific use case
Pitfalls:
- Traversal depth: Too deep queries can be slow
- Indexing: Need to index node properties for fast search
- Sharding: Splitting graphs across servers is very complex
Optimization:
// ❌ Bad: Unlimited depth
MATCH (user:User {id: 123})-[:FOLLOWS*]->(friend)
RETURN friend
// ✅ Good: Limited depth
MATCH (user:User {id: 123})-[:FOLLOWS*1..3]->(friend)
RETURN friend
LIMIT 100
// ✅ Use indexes
CREATE INDEX ON :User(id)
2.5 Time-Series Databases
What it is: Optimized for data with timestamps (metrics, events, logs).
Popular examples:
- InfluxDB — popular open-source TSDB
- TimescaleDB — PostgreSQL extension for time-series
- Prometheus — monitoring and alerting
- Amazon Timestream — managed time-series
Data model:
Timestamp: 2024-01-15T10:30:00Z
Metric: cpu_usage
Tags: {server: "web-01", region: "us-east-1"}
Value: 78.5
When to Use
| Scenario | Example |
|---|---|
| Infrastructure monitoring | CPU, memory, disk usage |
| IoT and sensors | Temperature, pressure, geolocation |
| Financial data | Stock prices, exchange rates |
| Application analytics | APM, performance metrics |
Pros and Cons
Pros:
- ✅ Optimized for time-based queries
- ✅ Efficient data compression
- ✅ Automatic aggregation
- ✅ Retention policies (auto-delete old data)
Cons:
- ❌ Specialized only for time-series
- ❌ Updating old data is complex/impossible
- ❌ Not suitable for transactional data
Optimization:
-- ✅ Use downsampling for old data
SELECT time_bucket('1 hour', time) AS hour,
avg(cpu_usage) as avg_cpu
FROM metrics
WHERE time > now() - interval '30 days'
GROUP BY hour;
-- ✅ Configure retention policies
CREATE RETENTION POLICY "one_year" ON "mydb"
DURATION 52w REPLICATION 1 DEFAULT;
Part 3: Specialized Database Types
3.1 Search Engines (Search Platforms)
Examples:
- Elasticsearch — full-text search and analytics
- Apache Solr — enterprise search
- Meilisearch — fast, easy to use
When to use:
- Full-text search
- Faceted search (category filters)
- Autocomplete and typo-tolerance
- Log processing and analysis (ELK stack)
3.2 Vector Databases
Examples:
- Pinecone — managed vector database
- Weaviate — AI-native vector search
- Qdrant — open-source vector search
- Milvus — scalable vector database
When to use:
- Semantic search (meaning-based search)
- Recommendation systems
- AI/ML applications
- Similarity search (similar images/text)
3.3 NewSQL Databases
Examples:
- Google Spanner — globally distributed SQL database
- CockroachDB — distributed SQL with ACID
- TiDB — MySQL-compatible distributed database
When to use:
- Need ACID transactions + horizontal scaling
- Global distribution with low latency
- Migration from traditional SQL with scalability requirements
Part 4: CAP Theorem and Database Selection
CAP Theorem
Impossible to guarantee all three simultaneously:
- Consistency — all nodes see identical data
- Availability — every request receives a response
- Partition tolerance — system works during network failures
Can only choose 2 of 3:
| Type | Priority | Examples | Use case |
|---|---|---|---|
| CP | Consistency + Partition tolerance | MongoDB, HBase, Redis | Finance, inventory |
| AP | Availability + Partition tolerance | Cassandra, DynamoDB, CouchDB | Social media, IoT |
| CA | Consistency + Availability | PostgreSQL, MySQL (single node) | Traditional applications |
Practical selection:
Need strict consistency (bank, exchange)?
→ CP system (PostgreSQL, MongoDB with strong consistency)
Need maximum availability (social media, IoT)?
→ AP system (Cassandra, DynamoDB)
Can live with eventual consistency?
→ Use AP and win in scale
Part 5: Best Practices and Optimization
Indexing
Relational Databases:
-- ✅ Index frequently used WHERE columns
CREATE INDEX idx_user_email ON users(email);
-- ✅ Composite indexes for compound queries
CREATE INDEX idx_order_user_date ON orders(user_id, created_at DESC);
-- ❌ Avoid indexes on low-cardinality columns
-- CREATE INDEX idx_user_active ON users(is_active); -- bad, only 2 values
-- ✅ Partial indexes for specific queries
CREATE INDEX idx_active_users ON users(email) WHERE is_active = true;
MongoDB:
// ✅ Compound indexes
db.products.createIndex({ category: 1, price: -1 })
// ✅ Text indexes for full-text search
db.articles.createIndex({ title: "text", content: "text" })
// ❌ Avoid too many indexes (slow writes)
Normalization vs Denormalization
Relational Databases — Normalization:
-- ✅ 3NF (Third Normal Form)
CREATE TABLE users (id, name, email);
CREATE TABLE orders (id, user_id, total);
CREATE TABLE order_items (id, order_id, product_id, quantity);
-- Pros: no duplication, easy updates
-- Cons: JOINs needed for complete data
NoSQL — Denormalization:
// ✅ Embedded data for fast reads
{
"order_id": "123",
"user": {
"id": "456",
"name": "John Doe",
"email": "john@example.com"
},
"items": [
{"product_id": "789", "name": "Laptop", "price": 999}
]
}
// Pros: single read for all data
// Cons: duplication, complex updates
Partitioning (Sharding)
Sharding strategies:
- Range-based sharding:
Shard 1: user_id 1-1000000
Shard 2: user_id 1000001-2000000
Shard 3: user_id 2000001-3000000
✅ Simple implementation
❌ Risk of hotspots (uneven distribution)
- Hash-based sharding:
shard = hash(user_id) % num_shards
✅ Even distribution
❌ Hard to add new shards
- Geo-based sharding:
Shard US-EAST: users from USA
Shard EU-WEST: users from Europe
Shard ASIA: users from Asia
✅ Low latency for regional queries
❌ Uneven load across regions
Caching
Caching strategies:
- Cache-Aside (Lazy Loading):
def get_user(user_id):
# Try to get from cache
user = cache.get(f"user:{user_id}")
if user:
return user
# Not in cache — read from database
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
# Save to cache
cache.set(f"user:{user_id}", user, ttl=3600)
return user
- Write-Through:
def update_user(user_id, data):
# First update database
db.update("users", user_id, data)
# Then update cache
cache.set(f"user:{user_id}", data, ttl=3600)
- Write-Behind (Write-Back):
def update_user(user_id, data):
# First update cache
cache.set(f"user:{user_id}", data)
# Asynchronously write to database
queue.enqueue(lambda: db.update("users", user_id, data))
Connection Pooling
# ❌ Bad: creating new connection each time
def get_data():
conn = psycopg2.connect("dbname=mydb")
cursor = conn.cursor()
cursor.execute("SELECT * FROM users")
conn.close()
# ✅ Good: connection pool
from psycopg2 import pool
connection_pool = pool.SimpleConnectionPool(
minconn=5,
maxconn=20,
dbname="mydb"
)
def get_data():
conn = connection_pool.getconn()
try:
cursor = conn.cursor()
cursor.execute("SELECT * FROM users")
return cursor.fetchall()
finally:
connection_pool.putconn(conn)
Part 6: Database Selection Guide for System Design
Database Selection Flowchart
1. Data structure?
└─ Structured, relationships matter
└─ Need complex transactions?
├─ Yes → PostgreSQL/MySQL
└─ No → Evaluate scale
└─ Unstructured/semi-structured
└─ Documents/JSON?
└─ Yes → MongoDB
└─ Graphs and relationships?
└─ Yes → Neo4j
└─ Simple key-value?
└─ Yes → Redis/DynamoDB
2. Scale?
└─ < 1TB → Relational database
└─ > 1TB, need horizontal scale
└─ Write-heavy → Cassandra
└─ Read-heavy → MongoDB (with replication)
3. Consistency?
└─ Strict (finance) → PostgreSQL, MySQL
└─ Eventual (social media) → Cassandra, DynamoDB
4. Latency requirements?
└─ < 1ms → Redis (in-memory)
└─ < 10ms → DynamoDB, Cassandra
└─ < 100ms → PostgreSQL, MongoDB
Database Selection Matrix by Use Case
| Use Case | Recommendation | Alternatives |
|---|---|---|
| E-commerce catalog | MongoDB | PostgreSQL (JSON), DynamoDB |
| Payment system | PostgreSQL | MySQL, CockroachDB |
| Social network | Neo4j (graphs) + Cassandra (posts) | MongoDB, DynamoDB |
| Chat/messenger | Cassandra | MongoDB, DynamoDB |
| Analytics/Dashboards | ClickHouse, TimescaleDB | Elasticsearch |
| Caching | Redis | Memcached |
| Logging | Elasticsearch | ClickHouse, TimescaleDB |
| IoT/Time-series | InfluxDB, TimescaleDB | Cassandra |
| Content search | Elasticsearch | Solr, Meilisearch |
| ML/AI vector search | Pinecone, Weaviate | Milvus, Qdrant |
Typical Database Combinations in Single System
Twitter-like application:
┌─────────────────────────┐
│ User Profiles │
│ PostgreSQL │ ← Structured data, relationships
└─────────────────────────┘
┌─────────────────────────┐
│ Tweets/Posts │
│ Cassandra │ ← High write throughput, scale
└─────────────────────────┘
┌─────────────────────────┐
│ Social Graph │
│ Neo4j │ ← Subscriptions, recommendations
└─────────────────────────┘
┌─────────────────────────┐
│ Trending Topics │
│ Redis │ ← Real-time counters
└─────────────────────────┘
┌─────────────────────────┐
│ Search │
│ Elasticsearch │ ← Full-text search
└─────────────────────────┘
E-commerce platform:
PostgreSQL → Orders, Payments, Inventory (ACID critical)
MongoDB → Product Catalog (flexible schema)
Redis → Session, Cart, Cache
Elasticsearch → Product Search
Neo4j → Recommendations (bought together)
Part 7: Interview Questions and Answers
Frequent Questions
Q: Why not use PostgreSQL for everything?
A: PostgreSQL is great, but:
- Vertical scaling is expensive
- Not optimal for graphs (recommendations, social connections)
- Not ideal choice for full-text search
- Fixed schema — bad for rapidly changing data
Q: When to use NoSQL instead of SQL?
A: NoSQL when:
- Flexible schema is critical
- Horizontal scaling is mandatory
- Eventual consistency is acceptable
- Data is unstructured or semi-structured
Q: How to ensure consistency in distributed system?
A: Strategies:
- Distributed transactions (2PC, Saga pattern)
- Event sourcing (log of all changes)
- CQRS (Command Query Responsibility Segregation)
- Eventual consistency with compensating transactions
Q: How to choose partition key in Cassandra?
A: Good partition key:
- Even data distribution
- High cardinality
- Matches access patterns
- Example:
user_id + timestamp_bucketinstead of justuser_id
Red Flags in Interviews
❌ Wrong:
- "I'll choose MongoDB because it's NoSQL and fast"
- "We use MySQL because I'm familiar with it"
- "NoSQL doesn't support JOINs" (false for many)
- "SQL databases don't scale" (false, just harder)
✅ Correct:
- "I'll use PostgreSQL for transactional data due to ACID guarantees, and Cassandra for time-series metrics due to high write throughput"
- "I'll consider trade-offs: PostgreSQL provides consistency, Cassandra provides availability during partition"
- "For MVP we start with PostgreSQL, then add specialized databases if needed"
Part 8: Advanced Topics
Replication
Master-Slave (Primary-Replica):
┌──────────┐ writes ┌──────────┐
│ Client │ ────────> │ Master │
└──────────┘ └──────────┘
│
async replication
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Replica1 │ │ Replica2 │ │ Replica3 │
└──────────┘ └──────────┘ └──────────┘
▲ ▲ ▲
└─────── reads ───┴─────────────────┘
Multi-Master:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Master 1 │ ◄─► │ Master 2 │ ◄─► │ Master 3 │
└──────────┘ └──────────┘ └──────────┘
▲ ▲ ▲
└──────── writes anywhere ────────┘
Consistency Levels (Cassandra example)
ONE: Fast, but may return stale data
QUORUM: Majority of replicas (N/2 + 1) — balance
ALL: All replicas — slow, maximum consistency
Formula for strong consistency:
W + R > N
W = write consistency level
R = read consistency level
N = replication factor
Example: N=3, W=2, R=2 → 2+2=4 > 3 ✓ (strong consistency)
Database Migration Strategies
Blue-Green Deployment:
- Old database (Blue) is running
- New database (Green) is configured
- Replicate data Blue → Green
- Switch traffic to Green
- Rollback to Blue if issues
Dual Writes:
def write_data(data):
# Write to both databases
old_db.write(data)
new_db.write(data)
# Read from old database (gradually switch)
return old_db.read(data)
Conclusion
Key Takeaways
- No universal database — choice depends on use case
- CAP theorem — can't have everything simultaneously
- Measure and test — don't rely on assumptions
- Start simple — can migrate later
- Polymodal approach — use multiple databases in one system
Interview Checklist for System Design
When selecting a database, justify:
- [ ] Data structure (structured/unstructured)
- [ ] Access patterns (read-heavy/write-heavy)
- [ ] Consistency requirements (ACID vs eventual)
- [ ] Scale (data volume, QPS)
- [ ] Latency requirements
- [ ] CAP trade-offs
- [ ] Indexing and optimization
- [ ] Sharding strategy (if needed)
- [ ] Replication and failover
- [ ] Cost (managed vs self-hosted)
Useful Resources:
- Database of Databases — catalog of all databases
- System Design Primer
- Use The Index, Luke — guide to indexes
- Designing Data-Intensive Applications — must-read book
Practice:
- LeetCode Database Problems
- DB-Engines Ranking — database popularity ranking
- Pramp System Design Interviews — mock interviews
Good luck with your interview! 🚀
Top comments (0)