When people think about databases, they usually think about SQL, queries, tables, indexes, and transactions.
But underneath all of those features lies something even more fundamental:
Data Structures.
Every database engine—from PostgreSQL and MySQL to MongoDB and SQLite—is essentially a collection of carefully chosen data structures working together.
If you want to understand how databases really work—or build your own database engine from scratch—you need to understand the data structures that make modern databases possible.
In this article, we'll explore the most important data structures used in database development, why they exist, and where they are used.
Why Databases Need Specialized Data Structures
Imagine a database containing:
- 1 million rows
- 100 million rows
- 1 billion rows
A simple linear search through all records would be far too slow.
Databases must solve several critical problems:
- Fast lookups
- Efficient inserts
- Efficient updates
- Fast range queries
- Minimal disk I/O
- Efficient memory usage
- Concurrent access by many users
Different data structures solve different parts of these problems.
There is no single "perfect" data structure.
Instead, database engines combine multiple structures together.
1. Arrays
What is an Array?
An array is a contiguous block of memory containing elements of the same type.
int numbers[5] = {10,20,30,40,50};
Arrays provide:
- O(1) random access
- Cache-friendly memory layout
- Simple implementation
Where Databases Use Arrays
Arrays appear everywhere inside database engines:
Buffer Pools
Databases load disk pages into memory.
Buffer Pool
[Page 1]
[Page 2]
[Page 3]
[Page 4]
The buffer pool is often managed using arrays.
Query Execution
Intermediate query results frequently use arrays.
Example:
SELECT * FROM users;
Rows returned by the executor may temporarily be stored in array-like structures.
Columnar Databases
Column stores often keep values in arrays:
Age Column
20
25
30
40
50
This layout enables efficient analytics.
2. Linked Lists
What is a Linked List?
A linked list stores elements connected through pointers.
[Node] -> [Node] -> [Node]
Each node contains:
struct Node {
int value;
struct Node* next;
};
Where Databases Use Linked Lists
Buffer Management
Many databases maintain free pages using linked lists.
Free Page List
Page 5 -> Page 8 -> Page 10
Transaction Management
Active transactions are often tracked through linked structures.
Memory Allocators
Database memory managers frequently organize free blocks using linked lists.
3. Hash Tables
What is a Hash Table?
A hash table maps:
Key -> Value
through a hash function.
Example:
"Ali" -> 102
"Sara" -> 211
Why Databases Love Hash Tables
Hash tables provide nearly:
O(1)
lookup time.
This makes them ideal for exact-match searches.
Where Hash Tables Are Used
Query Execution
Hash Join:
SELECT *
FROM orders
JOIN customers
ON orders.customer_id = customers.id;
Database builds a hash table for one side of the join.
Query Cache
Many engines use hash tables for:
Query -> Result
mapping.
Metadata Lookup
Finding:
- tables
- indexes
- columns
- users
is often done through hash structures.
Limitation
Hash tables cannot efficiently perform:
WHERE age BETWEEN 20 AND 30
Range queries require ordered structures.
That's why databases also use trees.
4. Binary Search Trees (BST)
A BST stores values in sorted order.
50
/ \
25 75
Properties:
Left < Root < Right
Why Databases Rarely Use Plain BSTs
A normal BST can become:
1
\
2
\
3
\
4
Performance degrades to:
O(n)
Therefore databases prefer balanced trees.
5. AVL Trees
AVL Trees automatically balance themselves.
30
/ \
20 40
Height remains controlled.
Database Usage
AVL trees are sometimes used in:
- in-memory indexes
- metadata structures
- cache management
However, they are less common for large disk-based indexes.
6. Red-Black Trees
Red-Black Trees are balanced binary search trees.
Operations:
Search O(log n)
Insert O(log n)
Delete O(log n)
Where Used
Many database internals use Red-Black Trees for:
- lock management
- transaction tracking
- memory management
- scheduling structures
7. B-Trees
This is where databases become interesting.
B-Trees are arguably the most important data structure in database history.
The Problem
Disk access is expensive.
Reading one node per disk access would be too slow.
The Solution
Store many keys in one node.
|10|20|30|40|
Each node contains multiple keys.
This dramatically reduces tree height.
Example
Instead of:
Height = 20
you might get:
Height = 3
for millions of records.
Why Databases Use B-Trees
Advantages:
- Fast lookup
- Fast insertion
- Fast deletion
- Excellent range queries
- Minimal disk reads
Used In
Traditional database indexes:
CREATE INDEX idx_users_name
ON users(name);
Often becomes a B-Tree index internally.
8. B+ Trees
Most modern databases actually use B+ Trees.
Not plain B-Trees.
Structure
Internal nodes:
Only keys
Leaf nodes:
Actual records
Leaf Nodes Are Linked
[Leaf1] <-> [Leaf2] <-> [Leaf3]
This is extremely important.
Why?
Range queries become very fast.
Example:
SELECT *
FROM users
WHERE age BETWEEN 20 AND 40;
Database finds the first record and then scans linked leaf pages.
Used By
- MySQL InnoDB
- SQLite
- Many relational databases
B+ Trees are arguably the king of database indexing.
9. LSM Trees (Log-Structured Merge Trees)
Modern databases face a different challenge:
Massive write workloads.
Problem
Updating a B+ Tree repeatedly causes random disk writes.
Random writes are expensive.
Solution
LSM Trees.
Writes first go to memory.
MemTable
Later they are flushed to disk.
SSTables
Background processes merge files.
Benefits
Excellent write performance.
Ideal for:
- Logging systems
- Analytics systems
- Distributed databases
Used By
10. Skip Lists
A Skip List is a probabilistic alternative to balanced trees.
Level 3 ------
Level 2 ------
Level 1 ------
Level 0 ------
Multiple layers accelerate searching.
Advantages
- Simpler implementation
- Good performance
- Easy concurrent operations
Used By
- Redis sorted sets
- Some in-memory databases
11. Tries (Prefix Trees)
A Trie stores strings character by character.
Example:
car
cat
can
Shared prefixes are reused.
Database Usage
Useful for:
- Full-text search
- Autocomplete
- Dictionary lookups
- Query suggestions
Example
c
|
a
|
+--r
+--t
+--n
12. Heaps (Priority Queues)
A Heap allows quick access to:
Smallest
or
Largest
element.
Database Usage
Query Optimization
Database planners choose execution strategies using priority structures.
Top-K Queries
ORDER BY score DESC
LIMIT 10;
Heaps help efficiently maintain the top results.
13. Bloom Filters
A Bloom Filter is a probabilistic structure.
It answers:
"Possibly exists"
or
"Definitely does not exist"
Why Useful
Avoids unnecessary disk reads.
Used By
- Cassandra
- RocksDB
- BigTable-inspired systems
Benefit
A tiny memory footprint can save huge amounts of I/O.
14. Graph Structures
Graph databases use nodes and edges.
Alice ---- Friend ---- Bob
Used By
Perfect For
- Social networks
- Recommendation systems
- Fraud detection
- Knowledge graphs
Importance Ranking for Database Engine Developers
If your goal is to build a database engine from scratch, focus on these first:
| Data Structure | Importance |
|---|---|
| Arrays | ⭐⭐⭐⭐⭐ |
| Linked Lists | ⭐⭐⭐⭐ |
| Hash Tables | ⭐⭐⭐⭐⭐ |
| Binary Search Trees | ⭐⭐⭐ |
| AVL Trees | ⭐⭐⭐ |
| Red-Black Trees | ⭐⭐⭐⭐ |
| B-Trees | ⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐ |
| B+ Trees | ⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐ |
| LSM Trees | ⭐⭐⭐⭐⭐⭐⭐ |
| Skip Lists | ⭐⭐⭐⭐⭐ |
| Tries | ⭐⭐⭐⭐ |
| Heaps | ⭐⭐⭐ |
| Bloom Filters | ⭐⭐⭐⭐⭐ |
| Graph Structures | Depends on DB type |
Final Thoughts
Database engines are not magical systems. At their core, they are sophisticated combinations of data structures designed to optimize storage, retrieval, indexing, caching, concurrency, and disk access.
If you want to become a database engineer, your learning path should look roughly like this:
- Arrays
- Linked Lists
- Stacks & Queues
- Hash Tables
- Trees
- AVL Trees
- Red-Black Trees
- B-Trees
- B+ Trees
- Heaps
- Tries
- Bloom Filters
- Skip Lists
- LSM Trees
Master these structures, and you'll begin to see database engines not as black boxes, but as collections of elegant algorithms and data structures working together to manage data at scale.
Top comments (0)