Hey fellow dev — let's tackle the choice that keeps us up at night: which database should you use?
I've boiled down the major database archetypes into a practical field guide so you can make the right call for your next project or system design interview. For each archetype you'll get: a real-world example, core characteristics, pros/cons, common pitfalls, and concrete product examples.
Before You Even Look at a Database, Answer These 6 Questions
Every database choice I've made that turned out well started here. Every bad choice skipped at least one of these.
What shape is your data? Rows and columns with fixed fields? Nested JSON documents? Nodes and edges? Timestamped measurements?
How stable is your schema? If your fields change often between records, you'll hate rigid relational schemas. If they're locked in, schema flexibility is a non-issue.
How complex are your relationships? Do entities reference each other heavily? Many-to-many? Deep traversals? This is where most teams underestimate.
What are your query patterns? Short key lookups vs. aggregations vs. full-text vs. joins — these push you toward completely different architectures.
What's the expected scale? A single-region app serving 10K users is a different problem from global OLTP at 1M writes/sec.
Who is operating this thing? The best database for your use case is the one your team can actually run well. Don't adopt Cassandra if no one on your team has operated it.
Quick thumb rule
| If... | Start with... |
|---|---|
| ACID + complex relationships | Relational DB |
| Flexible schema + hierarchical documents | Document DB |
| Extreme low-latency key lookups | Key–Value store |
| Graph traversals | Graph DB |
| High-cardinality time series | Time-Series DB |
Relational Databases (RDBMS) — The Reliable Workhorse
RDBMS is the foundational technology for systems where data consistency, reliability, and complex relationships are critical. Because they're designed to be highly structured and secure, they're the industry standard for any application where "getting the data right" matters more than raw speed.
Characteristics
- Structured rows and columns with a defined schema
- Strong ACID guarantees for transactions
- Powerful SQL for joins, aggregation, and complex queries
- Mature tooling for backups, migrations, and analytics
Pros
- Battle-tested for correctness and consistency
- Great for normalized data and multi-table transactions
- Rich ecosystem (ORMs, monitoring, tooling)
Cons
- Schema migrations can be painful at large scale
- Joins across huge tables can be slow without careful indexing
- Vertical scaling limits unless you adopt sharding or distributed SQL
Examples: PostgreSQL, MySQL, Microsoft SQL Server
Who uses them: Core transactional systems at Shopify, payment systems, most SaaS apps
Common pitfalls
- Over-normalizing for read-heavy workloads — optimize with materialized views or read replicas
- Ignoring indexing and query plans; assuming indexes are "free"
Document Databases — The Flexible Multi-Tool
Document databases shine when your data requirements are evolving rapidly or too varied to fit neatly into a rigid table schema. They're the go-to for modern, agile development — think CMS or user profiles where each record may have different fields.
Characteristics
- Schema-flexible JSON-like documents (nested structures welcome)
- Queryable fields and secondary indexes; supports partial updates
- Good balance between flexible modeling and querying capability
Pros
- Evolve your data model quickly without rigid migrations
- Excellent for denormalized data and aggregations that map to document shapes
- Often easier to scale horizontally than single-instance RDBMS
Cons
- Complex joins are either unsupported or expensive (application-level)
- Risk of inconsistent schemas and duplicated data if not managed
- Transactions historically limited (though many modern engines now support them)
Examples: MongoDB, Couchbase, Amazon DocumentDB
Who uses them: Content systems, user profile stores, product catalogs
Common pitfalls
- Modeling relational semantics as documents without a strategy — leads to data redundancy and maintenance headaches
- Uncontrolled document growth (oversized documents, deeply nested arrays) hurts performance
Key–Value Stores — The Blazing-Fast Lookup
Think of these as a giant, super-organized dictionary where you can grab any piece of information instantly by its unique key. The go-to choice for session management and caching.
Characteristics
- Simple API: store and retrieve by primary key
- Extremely low-latency reads and writes
- Minimal structure — values are opaque blobs to the store
Pros
- Excellent performance at scale; trivially sharded
- Simple to reason about and operate
Cons
- Limited querying (no secondary indexes or complex queries)
- Application must handle consistency and indexing logic
Examples: Redis, Amazon DynamoDB (used as KV), Memcached
Common pitfalls
- Using KV where you actually need complex queries or relationships
- Treating cache as primary storage without durable persistence or correct eviction handling
Column-Family (Wide-Column) Stores — The High-Throughput Workhorse
Unlike standard systems that store data row by row, these databases organize information by columns — enabling specific data points to be read and written at incredible speeds. Ideal for massive workloads like IoT telemetry, web logs, or real-time user activity feeds.
Characteristics
- Data modeled as rows with many sparse columns grouped into families
- Optimized for write-heavy workloads and large-scale partitioning
- Tunable consistency and compaction strategies
Pros
- High write throughput, designed for horizontal scale
- Efficient for queries reading contiguous ranges or specific columns
Cons
- Query flexibility is lower than relational databases
- Requires careful data modeling to avoid hot partitions
Examples: Apache Cassandra, ScyllaDB, HBase
Common pitfalls
- Poor partition key choices causing hotspotting and degraded performance
- Trying to support ad-hoc analytics without ETL into an analytics store
Graph Databases — The Relationship Specialists
Graph databases store data as a network of nodes (entities) and edges (relationships). By treating connections as first-class citizens, they allow high-performance traversal of complex data paths without the heavy cost of JOIN operations.
Characteristics
- Native representation of nodes and edges with rich traversal capabilities
- Efficient for deep, variable-length relationship queries
- Query languages: Cypher, Gremlin, or GQL-like syntaxes
Pros
- Expressive for relationship-heavy domains; traversals are fast
- Intuitive modeling for networks and hierarchies
Cons
- Not ideal for wide analytical queries or massive ad-hoc aggregations
- Scaling can be more complex than key-value or column stores
Examples: Neo4j, Amazon Neptune, JanusGraph
Common pitfalls
- Modeling everything as a graph when simpler data models suffice
- Ignoring traversal complexity — performance degrades with high-degree nodes
Time-Series Databases — Optimized for Temporal Data
Purpose-built to store and query data points indexed by time — sensor readings, server logs, stock market fluctuations. Essential for applications where tracking trends and anomalies over time is critical.
Characteristics
- Time is a first-class citizen (efficient append, retention, downsampling)
- Built-in functions for aggregations over windows and rate calculations
- Often compact storage and compression for high-ingest workloads
Pros
- Excellent for metric retention, querying recent data, and rollups
- Features like retention policies and continuous aggregations
Cons
- Not a general-purpose store for arbitrary relational data
- May need separate systems for long-term archival or complex joins
Examples: InfluxDB, TimescaleDB (Postgres extension), Prometheus
Common pitfalls
- Using a TSDB for non-temporal data
- Trying to join TSDB data with relational transactional data without ETL
Search Engines / Full-Text Stores — The Queryable Text Engine
These systems excel at ranking results by relevance and handling typos through fuzzy matching. The industry standard for powering site-wide search bars, log analysis tools, and document discovery.
Characteristics
- Inverted indexes optimized for text search and relevance ranking
- Support for filters, facets, and near-real-time indexing
- Powerful query DSLs for scoring and boosting
Pros
- Great for full-text search, autocomplete, and ranked results
- Can support analytics over indexed fields
Cons
- Not designed as a source-of-truth transactional database
- Indexing lag and eventual consistency between primary store and index
Examples: Elasticsearch, OpenSearch, Algolia
Common pitfalls
- Treating the search index as the primary data store (data loss risk)
- Forgetting to reindex when source data schema or semantics change
NewSQL / Distributed SQL — SQL with Scale
NewSQL databases offer both strict ACID consistency and massive horizontal scalability. They automate data distribution across cloud clusters, removing the need for manual sharding while maintaining a standard SQL interface.
Characteristics
- SQL and ACID semantics combined with distributed architecture
- Built-in sharding/replication to scale reads and writes
Pros
- Familiar SQL model with modern horizontal scaling
- Often simpler to operate than hand-sharded RDBMS clusters
Cons
- Complexity and operational costs may still be higher than single-node RDBMS
- Some trade-offs in latency or consistency depending on config
Examples: CockroachDB, Google Spanner, YugabyteDB
Common pitfalls
- Expecting the same latency characteristics as a single-machine RDBMS
- Underestimating the operational model (geo-partitioning, transaction latency)
Multi-Model Databases — One Engine, Multiple Models
Versatile platforms built to handle multiple data structures — documents, graphs, relational tables — within a single unified engine. Instead of managing separate databases for different data types, you store and query varied structures using one consistent API.
Characteristics
- Support for two or more data models within one engine
- Aims to reduce polyglot persistence complexity
Pros
- Use the right model for each feature without managing multiple systems
- Simplified operational footprint
Cons
- Each model may not be best-in-class; vendor lock-in risk
- Complexity in modeling and backups across models
Examples: ArangoDB, Azure Cosmos DB
Common pitfalls
- Believing multi-model removes the need for careful modeling and performance testing
Vector Databases — The AI-Powered Semantic Engine
AI models convert text, images, or audio into vectors, and these databases find the most similar vectors using semantic meaning rather than exact keyword matches. The backbone of modern RAG pipelines and AI-powered search.
Characteristics
- Stores high-dimensional vectors (embeddings) generated by ML models
- Optimized for similarity search (Approximate Nearest Neighbor — ANN)
- Supports hybrid search (metadata filtering + vector similarity)
Pros
- Highly efficient at finding semantically similar items at scale
- Enables advanced AI use cases like LLM memory and RAG
- Flexible support for various distance metrics (cosine, L2, inner product)
Cons
- High memory footprint; vector indexes (like HNSW) can be RAM-intensive
- Query results are probabilistic (approximate), not deterministic
- Requires a pipeline to generate and sync embeddings from source data
Examples: Pinecone, Milvus, Weaviate, Qdrant, pgvector (PostgreSQL extension)
Who uses them: AI startups, platforms building generative AI features, search infrastructure teams
Common pitfalls
- Storing massive raw data in the vector DB — use it for embeddings, store raw data in a traditional DB
- Ignoring embedding staleness when source content changes
- Choosing the wrong distance metric or index parameters, leading to poor recall/precision
How I Choose: Practical Heuristics
Start with the shape of your data and queries:
- Heavily relational and transactional → RDBMS or Distributed SQL
- Document-like, evolving schema → Document DB
- Extreme single-key performance → Key–Value store
- Traversals and relationships are core → Graph DB
- Metric/time-based workload → Time-Series DB
Then apply these filters:
- Prioritize correctness first for money/health/safety systems (ACID > scale). You can scale later.
- Consider operational burden: prefer familiar tools unless scale demands otherwise.
- Prototype the hot paths: build and load-test a minimal model of your expected queries and traffic.
- Plan for backups, migrations, and observability from day one.
Quick Reference: Decision Table
| Database Type | Best Used For | Example Use Cases | Popular Options |
|---|---|---|---|
| Relational (SQL) | Structured data with relationships & transactions | Banking, E-commerce orders, ERP, CRM | PostgreSQL, MySQL, SQL Server |
| Document | Flexible, schema-less JSON data | CMS, Product catalogs, User profiles | MongoDB, Couchbase |
| Key-Value | Ultra-fast reads and writes | Caching, Sessions, Rate limiting | Redis, DynamoDB |
| Vector | Semantic search using AI embeddings | RAG, AI chatbots, Recommendations | Pinecone, Weaviate, Milvus |
| Graph | Highly connected data and relationships | Social networks, Fraud detection, Knowledge graphs | Neo4j, Amazon Neptune |
| Time-Series | Data that changes over time | Monitoring, IoT sensors, Stock data | InfluxDB, TimescaleDB |
| Search | Full-text search and analytics | Log analysis, Search engines, Observability | Elasticsearch, OpenSearch |
| Columnar | Analytical workloads and large-scale reporting | Data warehouses, BI dashboards | ClickHouse, Snowflake |
Quick Design Patterns
- Read-heavy with complex joins → RDBMS + read replicas or materialized views
- Flexible user data + frequent reads → Document DB with controlled denormalization
- High ingest telemetry → TSDB or column-family store with downsampling pipeline
- Low-latency session/cache → Redis or managed in-memory KV
- Search plus data store → Primary store (RDBMS/DocDB) + search index (Elasticsearch) with a sync strategy
Originally published on gauravbytes.dev
Top comments (0)