DEV Community: Nikita Kutsokon

Databases: Concurrency Control. Part 1

Nikita Kutsokon — Fri, 25 Apr 2025 12:02:54 +0000

Introduction

Imagine you're working in a busy office with multiple people trying to update the same spreadsheet at the same time. One person is adding new information, while another is making changes to existing data. At the same time, others are just looking at the document, but they need to see the latest, accurate version of the data. In this situation, how can we ensure that:

People making changes don’t accidentally overwrite each other’s work ?
People who are just reading the document always see the most up-to-date version, without waiting for someone else to finish their work ?
The document doesn't end up in a jumbled state ?

Now, imagine this happening in a database, where multiple users or applications are trying to read from and write to the same tables at the same time. The challenge is even bigger in this case because databases are designed to store huge amounts of data, and many transactions need to happen simultaneously.

Why Is Concurrency Hard in Databases ?
When multiple users or processes are interacting with the database at once, several things can go wrong:

Data consistency issues - One user might see outdated information because another user is changing it.
Locking problems - If one transaction locks a row, other transactions might have to wait, leading to delays and frustration.
Concurrency anomalies - Without the right mechanisms in place, transactions could lead to inconsistent results, such as one user overwriting another’s work.

Databases need to figure out how to let users work simultaneously without blocking each other, and also make sure the data stays consistent and correct no matter how many people are using it.

What is MVCC ?

The solution lies in using techniques that allow each transaction to see a consistent snapshot of the data, without interfering with other transactions. One of the most powerful and widely used techniques in modern databases - is called Multiversion Concurrency Control. It is a method that enables multiple transactions to access the database simultaneously without blocking each other. This is accomplished by creating multiple versions of a piece of data, instead of having just one active version at any time.

In an MVCC-based system, when data is updated or inserted, a new version of the row is created. However, if a transaction only reads the data without modifying it, it accesses the original version of the data. This ensures that read operations are not blocked by write operations, and write operations do not block reads. As a result, the new version of the data becomes visible only after a modification or insertion occurs, allowing concurrent transactions to work with different versions of the same data without causing interference. This approach enables databases to maintain data consistency and high concurrency while minimizing the need for locking resources.

Example of MVCC in Action
Let's say three transactions, T1, T2, and T3, are operating on the same row in the database:

Transaction T1 reads a row and decides to update it. This creates version 2 of the row.
Transaction T2 also reads the same row (before T1 has committed) and performs a different update, creating version 3 of the row.
Transaction T3 reads the row as well and decides to insert a new value, creating version 4.

At this point, there are three versions of the same row, each with different updates from different transactions. The database now has to manage and ensure that the operations from these transactions do not interfere with each other. In the next parts of this article, we will explore how the database resolves conflicts when multiple transactions try to modify the same data. This conflict resolution ensures that data consistency is maintained and that transactions are isolated properly

The Hidden Cost of MVCC: Bloat

While Multiversion Concurrency Control offers many benefits, such as high concurrency and reduced locking, it comes with its own set of challenges. One of the most significant drawbacks of MVCC is bloat. In this section, we’ll explore what bloat is, how it occurs in MVCC systems, and its impact on database performance.

Bloat refers to the buildup of unused, outdated, or "dead" data versions in the database, resulting from the creation of multiple row versions due to MVCC. As time passes, these obsolete versions accumulate and consume storage space, leading to inefficient resource usage. As bloat increases, it introduces several negative effects on both storage and performance. The presence of outdated data versions creates inefficiencies that can worsen over time. If left unmanaged, these inefficiencies can significantly degrade the overall health and performance of the database. Let’s explore how bloat can impact the database’s functionality in more detail:

Increased Storage Usage 💾🫠
The primary cost of bloat is that it leads to increased disk space usage. As new versions of rows accumulate and old versions remain in the database, the overall size of the database increases. This means more disk space is needed for storage, and more memory is required to process queries that involve large amounts of outdated data.
Slower Query Performance 🏃😶
As bloat accumulates, queries may need to process more data than necessary. Even though old versions of rows are no longer needed, they still have to be read and checked during queries, especially if they are not cleaned up properly. This can slow down query performance, particularly for read-heavy operations.
Increased Maintenance Overhead 🛠️☹️
Managing and cleaning up bloat introduces additional complexity. While newer versions of a row are easily accessible, dead tuples need to be removed periodically, which requires an active maintenance process. Without proper cleanup, the database can become inefficient and slow down over time.

Enter VACUUM 🦸🌟

To combat the negative effects of bloat, databases like PostgreSQL use a process called VACUUM. VACUUM is designed to reclaim storage space by removing outdated or "dead" versions of rows that are no longer needed. Over time, as multiple versions of rows accumulate due to ongoing transactions, the database can become inefficient and sluggish. Without VACUUM, this bloat would continue to increase, leading to wasted storage and slower query performance. VACUUM helps keep the database running efficiently by removing these obsolete row versions and maintaining transaction visibility, ensuring that ongoing operations aren't impacted by outdated data.

Think of VACUUM like a housekeeper in a busy office. Just as employees move documents around, updating or discarding them, the office can become cluttered if old, irrelevant papers are not cleared away. VACUUM performs a similar role, ensuring the database remains tidy and efficient. PostgreSQL offers several types of VACUUM processes to manage this cleanup. The standard VACUUM reclaims storage and removes dead rows without locking the table, allowing the database to continue operating with minimal disruption. The VACUUM FULL, on the other hand, not only removes dead rows but also compacts the table by physically rewriting it, reducing its disk space usage. While standard VACUUM is run frequently as part of regular database maintenance, VACUUM FULL is typically used when a large amount of space needs to be reclaimed, such as after large deletions or updates.

How MVCC and VACUUM Work Together 🤝

MVCC and VACUUM work together to maintain PostgreSQL’s performance and consistency. MVCC enables multiple transactions to run concurrently by creating new versions of data when updates or inserts occur, allowing reads to access the original version without waiting for writes. However, this leads to the accumulation of outdated row versions, causing bloat. VACUUM addresses this issue by periodically removing these "dead" versions, reclaiming storage space, and ensuring the database remains efficient. Together, MVCC allows high concurrency while VACUUM prevents performance degradation, keeping the database both responsive and well-maintained.

Helpful Links 🤓

Text resources:

Video resources:

Databases: Replication

Nikita Kutsokon — Wed, 23 Apr 2025 10:11:52 +0000

Why Replication Matters

Database replication plays a crucial role in ensuring that business systems remain resilient, responsive, and reliable. For modern companies, data is at the heart of nearly every operation — from handling customer transactions to running analytics and decision-making tools. If the primary database fails, even for a few minutes, it can lead to lost revenue, broken services, and a damaged reputation. Replication helps prevent this by creating and maintaining exact copies of the database (replicas) on other servers, often in real-time. These replicas can be used for failover, meaning if the main system goes down, a replica can immediately take over with minimal disruption.

Beyond disaster recovery, replication supports performance and scalability. In high-traffic environments, read-heavy operations (like showing product listings or dashboards) can be redirected to replicas, easing the load on the primary server. This is known as read scaling, and it's especially valuable for applications with many users or global reach. Replication also enables data locality — storing copies closer to users in different regions — which reduces latency and improves user experience. Additionally, developers and analysts can run heavy reporting or analytics on replicas without affecting the main production system, which ensures better system stability. In essence, database replication is a foundational practice for building systems that are fast, fault-tolerant, and future-ready.

Replication Topologies

In distributed systems, replication is a key technique used to enhance availability, fault tolerance, and performance. The structure and behavior of data replication are determined by the replication topology, which defines how data flows and is synchronized across multiple nodes. Each topology offers a different balance of consistency, latency, scalability, and complexity, making it suitable for various types of applications. Below are the main replication topologies, along with their advantages, disadvantages, and real-world examples:

1. Single-Leader
In a single-leader replication topology, one node, called the leader, handles all write operations, while one or more replica nodes handle read operations by copying data from the leader. This setup ensures strong consistency because all writes go through a single node, which makes it easier to reason about system state and resolve conflicts. However, the leader becomes a single point of failure—if it goes down, the system cannot process writes until a new leader is chosen. Additionally, write throughput is limited by the leader's capacity, and there may be read inconsistencies if the replicas lag behind.

🤓 This topology is ideal for systems where strong consistency is a priority and write operations need to be tightly controlled, such as financial applications or systems where it is critical that all data changes are reflected correctly. It is also useful when read scalability is needed, as replicas can handle read-heavy workloads without affecting the leader's performance.

2. Multi-Leader
A multi-leader topology allows multiple nodes to accept write operations and synchronize with each other. This setup is particularly beneficial for geographically distributed applications, where users across different regions need to perform updates with minimal delay. By having multiple leaders, the system can offer better write availability and reduce latency for users who are far apart. However, this topology introduces the challenge of conflict resolution, as multiple nodes may try to write to the same data simultaneously, and syncing those changes becomes more complex. There’s also the risk of data divergence if synchronization between leaders is delayed.

🤓This topology is most useful for geo-distributed applications that require low-latency writes, such as collaborative platforms or social media apps where users across different regions need to interact and update data simultaneously without waiting for a central leader to process the changes.

3. Leaderless
In a leaderless replication topology, every node is equal and can accept both read and write operations. Consistency is typically achieved through consensus protocols, such as quorum-based approaches where a majority of nodes must agree on a change. There is no designated leader in this model, so the system avoids the single point of failure that exists in other topologies like single-leader. However, leaderless replication may introduce eventual consistency, meaning that data might not be immediately consistent across all nodes, especially in cases of network partitions or concurrent updates. Conflict resolution can also be more complex, especially as the number of nodes increases.

Leaderless replication can be implemented in several configurations, each with its own advantages and trade-offs. Three common configurations are star, circle, and all-to-all:

Star - In this setup, one node acts as a central hub that coordinates communication between other nodes. The central node may not necessarily handle all writes, but it ensures that the data is synchronized across all nodes. This configuration reduces the complexity of having every node communicate with each other but can introduce a single point of failure at the central hub if not managed properly.
Circle - The nodes are arranged in a circular fashion, where each node communicates with its immediate neighbors. When a write happens, it propagates through the circle of nodes until it is agreed upon by a majority. The circle setup helps distribute the load more evenly compared to the star configuration, but there is still potential for delays in synchronization due to the sequential nature of the communication.
All-to-All - In an all-to-all setup, each node communicates directly with every other node. This configuration allows for high redundancy and fault tolerance, as there is no central point of communication. However, it can be complex to manage because the system must ensure that all nodes agree on data changes, which can lead to conflicts and synchronization issues in larger networks.

🤓 This topology is ideal for highly available, partition-tolerant systems where eventual consistency is acceptable. It is often used in systems that require horizontal scaling without a single point of failure, such as large-scale e-commerce websites or distributed data storage systems that need to handle massive traffic loads without downtime.

Replication Strategies

When it comes to database replication, the strategy you choose will depend on your business needs and how you want your system to function. Here are the most common replication strategies:

Full Replication
In full replication, every replica of the database contains an identical copy of all data. This means that each server stores a complete version of the database, ensuring that data is always available across multiple locations. This approach is particularly useful for systems where high availability and fast access to all data are critical. It’s like having several backup copies of everything in the database.

Partial Replication
Partial replication is when only a subset of the database is replicated to each server. Instead of duplicating the entire database, each replica holds only part of the data that’s needed for its specific use case. For example, one server may hold customer data, while another holds order information. This approach is more efficient when certain types of data are accessed more frequently than others.

Snapshot Replication
Snapshot replication involves periodically taking a snapshot (or copy) of the database and sending it to the replicas. Rather than constantly updating the replicas in real time, the system refreshes the replica at set intervals—such as once a day or once an hour. This is useful when you don’t need constant updates, but want to ensure that all replicas are eventually up-to-date at specific times.

Transactional Replication
Transactional replication is a more dynamic approach where data changes (such as inserts, updates, and deletes) are continuously propagated to the replicas. This ensures that all replicas remain in sync with the leader server in near real-time. Whenever there is a change to the database, whether it's adding a new record or modifying an existing one, those changes are immediately reflected across all replicas. This is ideal for applications that require near-instant consistency across all nodes.

Merge Replication
Merge replication allows multiple replicas to independently modify the data. Changes made at different nodes are later merged together to maintain consistency across all replicas. This approach is especially useful in scenarios where different users or locations need to make updates simultaneously, such as collaborative platforms or distributed systems. Once the data is merged, any conflicting changes are resolved using pre-defined rules to ensure consistency.

🤓 Each replication strategy serves a different purpose, and choosing the right one depends on your system’s needs. For businesses that require high availability and low latency, full replication might be the best option. For applications with less critical real-time data needs, snapshot replication might be a more efficient choice. On the other hand, transactional replication works well for systems that require consistency and real-time synchronization. Merge replication is perfect for systems where independent data changes are common, and data needs to be synchronized later.

Synchronous vs Asynchronous Replication

In the world of database replication, one of the most important decisions revolves around timing: when should data be copied from the primary node to its replicas? This decision shapes the system’s performance, consistency, and resilience — and it boils down to two main strategies: synchronous and asynchronous.

Synchronous Replication
Synchronous replication means the primary database and its replicas stay perfectly in sync at all times. Every time a user performs a write — like submitting a transaction or saving a form — that change must be successfully written not only to the primary database but also to one or more replicas before the transaction is confirmed as complete. In other words, the user has to wait until the data is safely stored in all participating databases.

The main benefit of this strategy is strong consistency. You can be confident that if a node goes down, no recent transactions will be lost — all copies are up-to-date. This is essential in systems where data integrity is critical, such as banking, medical records, or inventory systems. But the cost of this safety is performance: every write is slower, and system responsiveness can suffer if any replica is slow or unreachable. In extreme cases, a single failed node might prevent new transactions from being processed at all.

👷🌎 Example 🌎👷
Online banking platforms and financial exchanges are textbook cases for synchronous replication. In these environments, even a tiny inconsistency — like a payment not registering or a trade executing twice — can result in major financial and legal consequences. Therefore, maintaining perfect synchronization is worth the performance trade-off

Asynchronous Replication
Asynchronous replication takes a different approach: it prioritizes speed. When a user submits a write, the primary database processes and commits the change immediately, without waiting for replicas. The changes are sent to replicas afterward, in the background. This dramatically improves performance and responsiveness, making it ideal for systems with very high throughput or where users are distributed globally.

However, this speed comes at the cost of eventual consistency. If the primary node crashes before all changes reach the replicas, there’s a risk of data loss. This is acceptable in use cases where perfect, real-time consistency is less important — for example, in social media apps, content platforms, or analytics dashboards, where occasional data lag isn’t a major concern.

👷🌎 Example 🌎👷
Instagram’s “like” system is a great example of asynchronous replication in action. When you double-tap a photo, your like is recorded instantly, giving you fast feedback — but the update might take a second or two to appear for your friends or even disappear briefly during high load. That’s okay, because likes are non-critical data. Prioritizing speed over strict consistency enables Instagram to serve billions of actions per day efficiently without slowing the user experience

Database Backup vs Replication

Though often mentioned together, database backup and replication serve very different purposes in a data infrastructure. Understanding their roles can help businesses make smarter choices when designing systems for reliability, disaster recovery, and performance.

Database Backup

Backups are periodic snapshots of your database, usually stored in a separate location from the original system. Their primary goal is data recovery — if something catastrophic happens (like accidental data deletion, or corruption), you can restore your system to a previous, known-good state.

👷🌎 Backups are ideal when you need long-term recovery options or protection from human error. They’re also essential for compliance in industries where historical data must be retained, such as healthcare or finance.

Database Replication

Replication, on the other hand, is about maintaining live copies of your data across multiple servers in real time or near-real time. It’s designed for high availability, load balancing, and fault tolerance. If one server goes down, another replica can immediately take over with little or no downtime. Unlike backups, replication doesn’t protect against logical errors — if bad data is written to the primary database, it’s usually also written to all replicas. Replication is about system continuity, not recovery from past states.

👷🌎 Replication is ideal when you need a system to stay online 24/7, or when you're serving global users and want to distribute read requests closer to them. It’s widely used in microservices, content platforms, SaaS applications, and any system requiring real-time access to data.

Considerations and Challenges

Helpful Links 🤓

Text resources:

Video resources:

Databases: SQL vs NoSQL

Nikita Kutsokon — Fri, 21 Mar 2025 11:42:06 +0000

What is SQL ?

SQL (Structured Query Language) - is a specialized language designed for managing and manipulating databases. It enables users to efficiently store, retrieve, update, and delete structured data within relational databases.

A relational database is a type of database that organizes data into tables (relations) consisting of rows and columns. Each table has a predefined schema, ensuring data consistency and relationships between different tables. These relationships are established using primary keys and foreign keys, allowing efficient data retrieval and integrity. Let's look at some examples of SQL code along with their purpose:

Fetches all users from the table

SELECT * FROM Users;

Updates the email of the user with ID = 1

UPDATE Users 
SET email = 'john.doe@example.com' 
WHERE id = 1;

Deletes the user with ID = 1

DELETE FROM Users 
WHERE id = 1;

What is NoSQL ?

NoSQL (Not Only SQL) is a type of database designed for storing and managing large volumes of unstructured, semi-structured, or rapidly changing data. Unlike traditional relational databases, NoSQL databases do not rely on fixed schemas or tables with rows and columns. Instead, they allow for more flexible data models that can accommodate a variety of data types such as documents, key-value pairs, graphs, and wide-column stores.

NoSQL databases are often used in applications that require high performance, scalability, and the ability to handle big data or rapidly changing data. Here are a few examples of NoSQL code and their purposes:

Insert a document into a Users collection

db.Users.insertOne({
  "_id": 1,
  "name": "John Doe",
  "email": "john@example.com",
  "age": 30,
  "address": {
    "street": "123 Elm St.",
    "city": "Springfield",
    "zip": "12345"
  }
});

Find all movies liked by John Doe's friends

MATCH (john:User {name: 'John Doe'})-[:FRIEND]->(friend)-[:LIKES]->(movie)
RETURN friend.name, movie.title

Retrieving all items from the cart

LRANGE cart:1001:items 0 -1

Pros and Cons of SQL Databases

SQL databases are widely used for storing and managing structured data in a relational model, where data is organized into tables with predefined schemas. They are known for strong consistency, reliability, and complex querying capabilities, making them ideal for applications that require data integrity and structured relationships. However, while SQL databases offer many advantages, they also have limitations, especially when dealing with scalability and unstructured data. Below is a breakdown of the key pros and cons of SQL databases to help determine when they are the right choice:

✅ Pros

Structured Data & Schema Enforcement
SQL databases follow a strict schema, ensuring organized and consistent data storage. This makes them well-suited for applications requiring well-defined relationships between data entities.

ACID Compliance
SQL databases adhere to Atomicity, Consistency, Isolation, and Durability principles, ensuring data accuracy, reliability, and integrity, which is critical for financial and transactional applications.

Powerful Querying Capabilities
SQL provides advanced querying features such as JOIN, aggregations, and indexing, allowing users to retrieve complex data efficiently from multiple tables.

*Standardization & Wide Adoption *
SQL is a universally recognized language with extensive documentation, making it easy for developers to learn and work with across various database management systems like MySQL, PostgreSQL, SQL Server, and Oracle.

❌ Cons

Scalability Limitations
SQL databases typically scale vertically by adding more resources (CPU, RAM, storage) to a single server. However, due to their relational nature, they struggle with horizontal scaling, which involves distributing data across multiple servers. This limitation can create bottlenecks when dealing with high traffic or large datasets, making it difficult to efficiently scale out in distributed environments.

Fixed Schema
Changing the schema (adding or modifying columns) often requires migrations, which can be time-consuming and impact application performance. This makes SQL databases less adaptable to rapidly evolving data structures.

*High Licensing Costs for Enterprise Solutions *
While open-source options like MySQL and PostgreSQL are free, commercial SQL databases such as Oracle and Microsoft SQL Server can have high licensing costs, making them expensive for large-scale deployments.

Pros and Cons of NoSQL Databases

NoSQL databases are designed to handle large volumes of unstructured, semi-structured, and structured data with a focus on scalability, flexibility, and high-speed performance. Unlike SQL databases, NoSQL does not enforce a strict schema, making it an excellent choice for big data, real-time applications, and distributed systems. While NoSQL databases offer significant advantages in terms of performance and scalability, they also have drawbacks, particularly in areas like data consistency and complex querying. Below is a breakdown of the key pros and cons of NoSQL databases to help determine when they are the right choice.

✅ Pros

Flexible Schema & Dynamic Data Models
NoSQL databases do not require a predefined schema, allowing for rapid changes in data structure without complex migrations. This is ideal for applications where data formats evolve over time.

High Scalability
NoSQL databases scale horizontally by distributing data across multiple servers, making them well-suited for handling massive amounts of data and high traffic loads.

High Availability & Fault Tolerance
Many NoSQL databases use replication and sharding to ensure high availability and resilience, making them ideal for distributed and cloud-native applications.

Faster Write Operations
Unlike traditional SQL databases that enforce strict consistency, NoSQL databases often prioritize speed over consistency, making them much faster for write-heavy workloads.

❌ Cons

Eventual Consistency (BASE Model)
Unlike SQL databases, which follow ACID properties, most NoSQL databases follow the BASE model. This can lead to temporary inconsistencies, which may not be acceptable for applications requiring strong data integrity.

Steeper Learning Curve & Maintenance Challenges
While SQL is a standardized language, NoSQL databases vary significantly in design (e.g., MongoDB, Cassandra, Redis, Neo4j), requiring specialized knowledge to manage and optimize performance.

Not Ideal for Transactions
NoSQL databases are not the best choice for applications that require multi-row transactions, as they often lack full ACID compliance, leading to potential issues with data consistency and rollback.

Key Differences Between SQL and NoSQL

Data Structure
SQL databases use a relational model, where data is stored in structured tables with predefined schemas consisting of rows and columns. This structure enforces a strict organization, ensuring that data adheres to a set pattern. This makes SQL ideal for applications that require complex queries and strong consistency in structured data, such as financial systems. In contrast, NoSQL databases offer a more flexible approach, using a variety of data models such as key-value pairs, documents, graphs, and column families. This flexibility allows NoSQL to handle unstructured or semi-structured data, making it suitable for applications that need dynamic schemas and easy adaptation to changes in data over time.

Scalability
SQL databases typically scale vertically, meaning they add more resources (like CPU, RAM, or storage) to a single server to handle increased load. However, due to their relational nature, SQL databases face challenges with horizontal scaling, where data is distributed across multiple servers. This makes scaling more complex and resource-intensive as the amount of data and user requests increase. On the other hand, NoSQL databases are designed for horizontal scaling, which allows them to easily distribute data across multiple servers, enabling them to handle high volumes of data and traffic more efficiently. This scalability makes NoSQL an ideal choice for applications with large-scale, distributed data needs.

Performance
SQL databases offer solid performance for read-heavy workloads, where data is relatively stable, and complex queries with multiple joins are necessary. However, SQL databases can experience performance issues under heavy write operations due to their need to maintain strong consistency across the system. This can lead to slower query performance as the database grows. NoSQL databases, however, are optimized for high-speed writes. With their flexible data models and ability to distribute data, NoSQL databases can process large amounts of data quickly, making them ideal for applications that require fast access to large datasets, such as real-time analytics or social media platforms.

Consistency vs Availability (ACID vs BASE)
SQL databases follow the ACID model, which ensures that database transactions are processed reliably and maintain strong consistency. This makes them ideal for applications where data integrity and reliability are crucial, such as banking systems. However, this strict consistency can limit the system’s ability to scale easily or recover from failures. NoSQL databases typically follow the BASE model, prioritizing availability and partition tolerance over immediate consistency. This means that while data may not be consistent immediately, NoSQL systems are designed to remain operational and available, making them suitable for applications where uptime and performance are more critical than absolute consistency.

Storage and Data Integrity
SQL databases provide robust storage and data integrity features, ensuring that the data remains consistent and accurate across transactions. With built-in mechanisms like foreign keys and constraints, SQL databases ensure that data relationships are maintained and that data integrity is not compromised. However, this can be limiting when dealing with rapidly changing or vast amounts of data. NoSQL databases, while generally providing less rigid data integrity guarantees, offer flexibility by allowing data to be stored in various formats and distributed across multiple nodes. This design ensures high availability and fault tolerance, but can sometimes result in eventual consistency, where data across nodes may not be immediately synchronized. This trade-off is often acceptable in scenarios that prioritize scalability and speed over strict consistency.

Types of SQL and NoSQL Databases

If you’re looking at SQL databases, there are several popular options, each with its own strengths. For instance, MySQL is widely used in web applications due to its reliability, scalability, and ease of use. Then, there’s PostgreSQL, which is great for handling large datasets and complex data structures, thanks to its support for complex queries, joins, and ACID compliance. If you’re working in an enterprise environment, Microsoft SQL Server is often the go-to choice, offering high security and seamless integration with other Microsoft products. For larger, more advanced needs, Oracle Database shines with its robustness, scalability, and features like clustering and partitioning. Lastly, if you need something lightweight for smaller-scale applications or embedded systems, SQLite is a great choice, being file-based and easy to integrate directly into applications.

NoSQL databases come in different types, each suited for specific use cases. Key-Value Stores are the simplest, storing data as key-value pairs, making them fast and ideal for caching and quick retrieval. Examples of this type include Redis and Riak. Document Stores store data as documents, typically in formats like JSON, BSON, or XML, making them well-suited for semi-structured data. Popular examples are MongoDB and CouchDB. Column-Family Stores organize data in columns instead of rows, which is great for fast read and write operations on large datasets. Apache Cassandra and HBase are key examples here. Graph Databases focus on relationships between data points, storing data as nodes and edges, making them perfect for applications like social networks or recommendation systems. Neo4j and ArangoDB are prominent graph databases. Lastly, Time-Series Databases are optimized for handling time-ordered data, such as logs or sensor readings. InfluxDB and TimescaleDB are examples used for such tasks. Each type of NoSQL database is designed with a specific strength in mind, catering to the needs of different data models and applications.

How to Choose Between SQL and NoSQL

When to Choose SQL

Your data is structured
You need ACID compliance
Your queries are complex
Data relationships are crucial

When to Choose NoSQL

Your data is semi-structured or unstructured
You need high scalability
Performance in writes is critical
Your application prioritizes availability
You handle large-scale, dynamic data

Choosing between SQL and NoSQL depends on your application's needs. SQL is ideal for applications that require structured data, complex queries, and high consistency, such as banking systems, customer relationship management (CRM) tools, and enterprise resource planning (ERP) systems, where transactions need to be reliable and data relationships are crucial. NoSQL, on the other hand, is perfect for large-scale applications with unstructured or semi-structured data, like social media platforms, real-time analytics, content management systems, e-commerce websites, and IoT applications, where high scalability, flexibility, and quick data retrieval are essential. NoSQL databases like MongoDB, Cassandra, and Couchbase excel in handling large volumes of data across distributed systems, making them ideal for applications that need to handle massive amounts of unstructured data, real-time updates, and horizontal scalability. In some cases, combining both SQL and NoSQL databases in a hybrid approach can deliver the best of both worlds, as seen in modern web applications and cloud-based platforms where specific use cases demand the strengths of each.

Helpful Links 🤓

Text resources:

Video resources:

Databases: Indexes. Part 2

Nikita Kutsokon — Fri, 07 Mar 2025 18:10:31 +0000

In the previous section, we explored the fundamental concepts of database indexes and their role in enhancing query performance. Now, we will delve deeper into advanced indexing techniques that can further optimize your database. As data grows and queries become more complex, understanding how to fine-tune and select the right indexing strategy becomes crucial. In this section, we’ll cover different types of indexes, such as clustered, non-clustered, and composite indexes, and explore when and how to use them effectively. We will also discuss common indexing pitfalls and best practices to ensure your database remains efficient as it scales.

Types of Indexes

1. Single-Column Indexes
This is the most basic type of index. It is created on a single column of a table.

Let’s say we have a users table with columns id, name, and age. If we frequently search for users by their name, we could create an index on the name column:

CREATE INDEX idx_name ON users(name);

This index will speed up searches like:

SELECT * FROM users WHERE name = 'Alice';

Without the index, the database would have to scan every row, which is slow for large tables. But with the index, it can directly look up rows where the name is 'Alice'.

2. Composite Indexes
A composite index is an index that involves more than one column. It’s useful when you frequently query the database using multiple columns together in the WHERE clause or as part of JOIN conditions.

Imagine we have a users table with the following columns: first_name, last_name, age. If we often search for users by both first_name and last_name together, we can create a composite index like this:

CREATE INDEX idx_name_age ON users(first_name, last_name);

Now, if we run a query like:

SELECT * FROM users WHERE first_name = 'John' AND last_name = 'Doe';

The database will use the composite index to quickly find rows where both first_name is 'John' and last_name is 'Doe', without scanning the entire table.

⚠️ Important Note:

A composite index is like a combined shortcut that helps the database find information faster when searching through multiple columns at once. The order of columns in this index is important. For example, if the index is created with first_name first and last_name second, it works best when you search using both first_name and last_name. However, if you only search by last_name, the index won't be as effective because it was designed to prioritize the first column (first_name). In short, the database can use the index for queries that match the first column or more, but not just the later ones.

3. Unique Indexes
A unique index is an index that ensures the uniqueness of the values in the indexed column. It is automatically created when you define a column with a UNIQUE constraint.

Suppose we have a users table with an email column, and we want to ensure no two users have the same email. We can create a unique index like this:

CREATE UNIQUE INDEX idx_email ON users(email);

This ensures that every email in the email column is unique. If someone tries to insert a duplicate email, the database will reject the insertion.

4. Full-Text Indexes
A full-text index is specialized for text searching. It allows the database to efficiently search for words or phrases within large text columns, often used in fields like articles, descriptions, and comments.

If you have a posts table with a content column, you can create a full-text index for efficient searching:

CREATE FULLTEXT INDEX idx_content ON posts(content);

Then, you can perform a search like:

SELECT * FROM posts WHERE MATCH(content) AGAINST ('keyword');

This is much faster than a basic search because the full-text index is designed specifically to handle such queries efficiently.

5. Clustered Index

A clustered index determines the physical order of data in the table. When you create a primary key on a table, a clustered index is automatically created. There can be only one clustered index per table because the data rows are physically sorted by it.

CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    age INT
);

The database will create a clustered index on the id column by default. The data in the table will be physically stored in order of id. This makes range queries like:

SELECT * FROM users WHERE id BETWEEN 1 AND 10;

6. Non-Clustered Index

A non-clustered index is a separate structure from the data table. It contains a copy of the indexed columns and a pointer to the actual rows in the table. Unlike clustered indexes, non-clustered indexes don't change the physical order of the data.

For a table with a non-clustered index:

CREATE INDEX idx_name ON users(name);

This index will help quickly look up rows based on the name column, but the data in the table remains unordered based on name.

Table Scans

When querying a database, the way the system retrieves data from tables significantly affects performance. Databases use different scanning methods to fetch records, depending on whether indexes are available and how the query is structured. Understanding these table scans helps in optimizing queries and designing efficient indexes. Let's look at some of them:

1. Sequential Table Scan
A Sequential Table Scan occurs when the database reads every row in a table to find matching records. This happens when:

No index exists on the column being queried. The query optimizer determines that scanning the entire table is more efficient than using an index
Suppose we have a users table:

SELECT * FROM users WHERE age = 25;

If age is not indexed, the database will check each row, making this slow for large tables.

2. Index Scan

An Index Scan occurs when the database reads the entire index instead of scanning the table directly. This is more efficient than a full table scan but can still lead to high I/O costs for large datasets because, after scanning the index, the database needs to fetch the actual data from the table. This is more efficient than a full table scan but still involves scanning multiple entries in the index.

If age is indexed, the query:

SELECT * FROM users WHERE age = 25;

Will first scan the index for matching values before retrieving actual records. This is faster than a sequential scan but can still be expensive if many rows match.

3. Bitmap Index Scan
A Bitmap Index Scan works differently from a traditional index scan because it is optimized for columns with low cardinality (few unique values). Instead of storing direct row pointers, it uses a bitmap (a series of bits) to represent which rows contain a particular value.

Let's say we have a users table with a status column that can have only three values: active, inactive, and banned. Since there are only three possible values, this column is a great candidate for a bitmap index.

How It Works

Imagine we have a users table with 10 million rows and a status column, which can have only three values:

active
inactive
banned

If we don’t use an index, searching for all active users would mean checking each row one by one, which is slow. Instead the database creates a separate bitmap for each unique value (active, inactive, banned). Each bitmap is simply a long list of 1 and 0, where:

1 means the row has that value.
0 means the row does not have that value

Now, if we run:

SELECT * FROM users WHERE status = 'active';

Instead of scanning the whole table, the database retrieves the bitmap for 'active':

1 0 1 0

This tells the database exactly which rows contain active (1st, 3rd). The system skips all rows where the bit is 0, making the lookup very fast. Finally, it fetches only those rows from the actual table.

Why Is This Efficient ?

Bitmaps are small → They use less space compared to traditional indexes.
Fast Filtering → The database can quickly determine matching rows using bitwise operations (AND, OR).
Great for Multiple Conditions → If we add another condition, like:

SELECT * FROM users WHERE status = 'active' AND age = 30;

The database just combines bitmaps for status = 'active' and age = 30 using a bitwise AND operation, avoiding unnecessary scans.

When to Use a Bitmap Index
✔ Best for columns with few unique values (status, gender, category).
❌ Not efficient for high-cardinality columns (username, email).

Understanding the EXPLAIN Query

The EXPLAIN command in PostgreSQL provides the query execution plan that shows how the database will execute your query. This helps to analyze and optimize the performance of queries. Let's break down each part of the provided SQL and EXPLAIN results. To see how it is works, lets create a customer table:

CREATE TABLE customer (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    status VARCHAR(10) CHECK (status IN ('active', 'inactive', 'banned')) NOT NULL,
    age INT CHECK (age >= 0)
);

This creates a customer table with columns for id, first_name, last_name, status, and age. To play with this table we need data in it, lets insert random with the next query:

INSERT INTO customer (first_name, last_name, status, age)
SELECT 
    LEFT(md5(random()::text), 10),
    LEFT(md5(random()::text), 10),
    (ARRAY['active', 'inactive', 'banned'])[floor(random() * 3 + 1)], 
    floor(random() * 80 + 18)
FROM generate_series(1, 10);

Now we have a table with next data:

Let's see how the EXPLAIN query works in action:

EXPLAIN SELECT * FROM customer;

This command shows the execution plan for selecting all rows from the customer table. It will likely use a Seq Scan (sequential scan) because no index is defined yet, and the entire table is being scanned. Let's extend our abilities of EXPLAIN with ANALYZE parameter. EXPLAIN ANALYZE executes the query and returns both the execution plan and the actual runtime statistics:

EXPLAIN ANALYZE  SELECT * FROM customer WHERE status = 'banned';

Let's break down the EXPLAIN ANALYZE output for the query:

Seq Scan on customer
The query starts with a sequential scan (Seq Scan) on the customer table. This means PostgreSQL is checking each row of the table to find the records that match the condition (status = 'banned'). Since no index is set up for the status column, PostgreSQL has no shortcut to directly find the rows it needs. Instead, it has to look through the entire table, one row at a time.
cost=0.00..13.25 rows=1 width=282
The cost of executing this query is estimated at 13.25, with 0.00 as the startup cost and 13.25 as the total cost. The startup cost is low because PostgreSQL only needs minimal resources to start executing the query. The total cost is higher because PostgreSQL needs to go through all the rows to apply the filter (status = 'banned'). The value 13.25 represents the internal calculation PostgreSQL uses to estimate how much work it will take to process the query. It also expects that about 1 row will match the condition.
rows=1
PostgreSQL's initial estimation of the number of rows that will match the condition (status = 'banned'). This is an estimate based on table statistics.
width=282
The average size (in bytes) of the rows in the customer table.
actual time 0.147..0.151
In terms of actual execution, the query took around 0.147 to 0.151 milliseconds to complete. This is how long PostgreSQL took to scan the table and find the rows that matched the condition. Since only 5 rows matched the condition, it was able to process the query very quickly, with the total time of only 0.151 ms.
rows=5 loops=1
Finally, the query executed in a single loop (loops=1). This means PostgreSQL only had to scan the table once to find the matching rows. Even though it estimated that only 1 row would be returned, it actually found 5 rows that had the status 'banned', which is why 5 rows were returned. This reflects the actual results of the query.

To illustrate how adding an index can change the table scan type, let's walk through the process with the customer table:

We start by inserting a huge amount of random data into the customer table:

INSERT INTO customer (first_name, last_name, status, age)
SELECT 
    LEFT(md5(random()::text), 10),
    LEFT(md5(random()::text), 10),
    (ARRAY['active', 'inactive', 'banned'])[floor(random() * 3 + 1)], 
    floor(random() * 80 + 18)
FROM generate_series(1, 1000000);

Now, let's check the query plan for selecting rows based on the first_name column:

EXPLAIN ANALYZE  SELECT * FROM customer WHERE first_name  = '698ffdfe4d';

The database decides to use a Sequential Scan (Seq Scan), meaning it scans all rows in the table to find the matching entry. Additionally, parallel workers are used to improve the scan efficiency when dealing with larger datasets:

Now, we add an index on the first_name column to improve search performance:

CREATE INDEX idx_customer_first_name ON customer(first_name);

Once the index is created, let's run the same query again to see the changes in the execution plan:

EXPLAIN ANALYZE  SELECT * FROM customer WHERE first_name  = '698ffdfe4d';

As we can see, after adding the index, the database now uses an Index Scan rather than a Sequential Scan. This significantly reduces the query execution time and cost.

🎉 The Index Scan using _id_customer_first_name is approximately 263 times faster than the Sequential Scan (0.233 ms vs 61.311 ms). This demonstrates the significant performance boost indexing provides, especially when filtering by a specific column in large datasets. By avoiding a full table scan, the index dramatically reduces query execution time and improves efficiency.

How to Choose the Right Index

Choosing the right index for a table is critical to optimizing query performance. While indexes can significantly speed up read operations, they come with overhead on write operations (like INSERT, UPDATE, DELETE) because the indexes need to be maintained. Therefore, it’s essential to strike a balance between the speed of queries and the cost of maintaining indexes. Here are some guidelines to help you choose the best index for your scenario:

Understand Your Queries
Before creating an index, analyze which queries run most often. Focus on columns in WHERE, JOIN, and ORDER BY clauses, as indexing them can speed up search and sorting operations.

Choose High-Selectivity Columns
Indexes work best on columns with many unique values, like email or user_id. Columns with few values (status or gender) may not benefit much from a regular index, but bitmap indexes can be useful.

Single vs Composite Indexes
A single-column index is best for filtering by one column, while a composite index is better for queries using multiple columns together. Remember, in composite indexes, the order of columns matters—queries should use the left-most column first.

Balance Indexes with Write Performance
Indexes speed up searches but slow down writes (INSERT, UPDATE, DELETE) because they need to be updated too. Avoid unnecessary indexes and regularly check for unused ones to maintain fast performance.

Consider Table Size
For small tables, indexes might not make a big difference. But for large tables, they are essential to avoid slow full-table scans. Sparse data (many NULL values) can also reduce index efficiency.

Test and Optimize
Use tools like EXPLAIN to check how your queries use indexes. Try different index types and compare performance to find the best option. Regularly update your indexing strategy as your data and queries evolve.

Helpful Links 🤓

Text resources:

Video resources:

Databases: Indexes. Part 1

Nikita Kutsokon — Tue, 04 Mar 2025 20:05:37 +0000

This article will introduce you to the concept of database indexes. Before diving into indexes, it’s recommended to first read about how databases store data, as this will provide a better understanding of how indexes fit into the overall database architecture.

What Is An Index In a Database?

An index in a database is like the table of contents in a book. It helps the database find information faster, just like a table of contents helps you quickly find a chapter in a book. Normally, when you search for something in a database, it has to scan every row in a table, which takes time. An index is a special structure that keeps track of where data is stored, allowing the database to find it quickly. It works similarly to a library catalog, which helps locate books without checking every shelf, making searches more efficient.

Why Are Indexes Important ?

In database management, indexes are very helpful tools. They make it easier and faster to find and retrieve specific information. Here are some key benefits:

Faster Data Retrieval - when searching for a specific record, an index allows the database to find the result directly, instead of scanning every row. This significantly reduces query time.
Efficient Sorting and Filtering - queries with ORDER BY or WHERE conditions run much faster because the database can use the index instead of sorting or filtering all rows manually.
Better Performance for Large Databases - as data increases, queries can slow down. Indexes help keep search times consistent, even when dealing with millions of records.
Optimized Joins Between Tables - indexes speed up JOIN operations by quickly matching related records between tables, making complex queries more efficient.

These benefits make indexes essential for maintaining the performance and efficiency of databases, especially as they grow larger.

How Indexes Are Stored in a Database

Indexes are stored separately from the main table data. They act as a separate data structure that is linked to the table. An index typically stores a reference to the row in the table and the indexed column's value. Most relational databases use B-trees (Balanced Trees) or hash tables for storing indexes. The structure allows efficient lookups, insertions, and deletions. Here's how it works:

B-trees (or B+ trees): These are the most common data structures used for indexing. The tree structure allows the database to maintain sorted values, and the search can quickly navigate the tree in a logarithmic manner, making the lookup faster. The leaf nodes of the tree contain the actual data pointers, allowing for efficient retrieval.
Hash Indexes: A hash table uses a hash function to convert the indexed column's value into a fixed-size hash code, which then points to the data rows. This is efficient when dealing with equality searches (WHERE column = value), but it doesn’t support range queries (WHERE column > value).

Why Indexes Are Fast

The reason why an index speeds up searches in databases is the use of binary search.

Without an index, if you wanted to search for a specific value in a large table, the database would have to scan every row to find the match. This process is known as a full table scan, and it has a time complexity of O(n), where n is the number of rows. With an index, the database doesn’t need to scan every row. It can use binary search to locate the desired data much faster. Binary search works by dividing the data in half repeatedly, narrowing down the search space quickly. Instead of checking every single record, it cuts the search space in half with each step.

For example, in a sorted list, a binary search can locate an element in O(log n) time, where n is the number of items. The larger the data set, the more significant the performance improvement when using an index.

Big O Notation Comparison

- Full Table Scan (No Index): O(n)
Every record is checked one by one. Time increases linearly with the number of records.

- Indexed Search (Using Binary Search): O(log n)
The number of checks decreases exponentially with each step. Even for large datasets, the number of operations remains much lower than a full table scan.

As you can see from the graph, the time complexity for a full table scan increases linearly (O(n)) with the size of the dataset, while the time complexity for an indexed search using binary search increases logarithmically (O(log n)).

How Indexes Help with Search in a Library

Imagine you’re in a large library trying to find a specific book. The library has thousands of books, and you’re looking for a book titled “The Art of Programming”. Here’s how an index would help speed up the search:

Without an Index
If the books are randomly arranged on shelves, you’d have to check each book one by one to find the one you’re looking for. This is similar to performing a full table scan. The librarian would start at the first book and keep checking each title until they find “The Art of Programming”. If there are 100,000 books, you might have to check a lot of them before finding your desired book.

👉 Time Complexity: O(n) (You might have to check all the books, linearly).

With an Index
Now imagine the library has a catalog that lists all books alphabetically. The catalog helps you directly jump to the section where books starting with "A" are located. So, instead of scanning the whole library, you can use the catalog to quickly locate where the book should be and directly go to that section. In a database, indexes work the same way. They allow the database to directly jump to the section of the data where the relevant records are located, rather than scanning through the entire table.

Since the catalog is sorted, the librarian can apply binary search to find the book much faster. They wouldn’t check each title individually but would instead look at the middle entry and decide whether the book is before or after that entry. This process repeats, reducing the number of books the librarian needs to check.

👉 Time Complexity: O(log n) (With each step, you halve the number of books you need to check).

This catalog-based approach, powered by binary search, is much faster than checking every single book, just like how an index in a database speeds up queries.

Why Not Index Every Column?

While indexes are incredibly useful for speeding up searches, creating an index on every column in a database can lead to several issues, making it a bad practice. Let’s explore why:

1. Performance Overhead During Data Insertion/Updates

Indexes need to be updated whenever you add, update, or delete data in a table. The more indexes you have, the more work the database has to do.

Imagine a library where every book is indexed by title, author, genre, and publication year. The title index is like a sorted list of book titles, where each title points to the shelf where the book is located. When a new book arrives, the librarian can’t just put it on a shelf randomly—they first need to insert the book’s title in the correct alphabetical position in the title index, update the author index by adding the book under the right author’s name, place it in the correct category in the genre index, and do the same for the publication year index. The more indexes exist, the more time and effort it takes to properly store the book. Similarly, in a database, each index must be updated when a new row is inserted, which slows down write operations significantly if there are too many indexes.

2. Increased Storage Usage

Each index consumes disk space. The more indexes you create, the more storage is required to maintain them. This can be problematic when dealing with large tables or databases with limited storage.

Imagine a library where, in addition to storing physical books, the librarian also keeps separate catalogs for sorting books by title, author, genre, and publication year. Each of these catalogs takes up space. If the librarian creates a separate catalog for every possible book attribute—such as number of pages, language, publisher, edition, and even cover color—the storage for these catalogs could eventually take up more space than the books themselves! Similarly, in a database, each index requires additional storage. If every column in a large table has an index, the total space used by the indexes could exceed the actual data, leading to inefficient storage usage.

3. Redundant Indexes

Some columns may not need indexing at all, especially if they're rarely used in queries. Having an index on every column would lead to redundant indexes that don’t improve performance and just add unnecessary overhead.

Imagine a library where the librarian creates a separate catalog for every single book detail, including the number of pages, cover color, and even the weight of the book. However, readers never search for books based on their weight or cover color—they only look for title, author, or genre. Maintaining these unnecessary catalogs takes time and space but provides no real benefit. Similarly, in a database, indexing columns that are rarely searched or sorted creates redundant indexes that consume resources without improving performance.

4. Decreased Read Performance

While indexes speed up reads (searches and lookups), having too many indexes can actually hurt read performance in some situations.

Imagine a library with dozens of different catalogs—by title, author, genre, publisher, year, cover color, number of pages, and more. When a reader asks for a book, the librarian must decide which catalog to use. If there are too many catalogs, it takes extra time to check them all and find the best one. Similarly, in a database, when a query is executed, the database must choose the most efficient index. If there are too many indexes, this decision-making process can slow down searches rather than speed them up. Additionally, when new books are added or removed, all these indexes need updating, causing delays and reducing overall performance.

Part 2 ->

Helpful Links 🤓

Text resources:

Video resources:

Databases: How Data is Stored on Disk

Nikita Kutsokon — Sat, 01 Mar 2025 19:57:57 +0000

A database system is designed to store, read, and write data efficiently. It organizes the data into pages and utilizes both disk storage and RAM to manage these pages. But before diving into how databases handle data, let's first explain what RAM is for a better understanding of the article.

What is RAM

RAM (Random Access Memory) is the short-term memory of your computer, used to store data and programs that are actively in use. It’s a critical component because it allows the computer to quickly access the information it needs without waiting for slower storage devices like hard drives (HDDs) or solid-state drives (SSDs). RAM is much faster than these storage devices, and when you open a program or file, it gets loaded into RAM to improve performance. The more RAM your computer has, the more data and programs it can store and process at once, which makes multitasking and running demanding applications faster and smoother. However, RAM is volatile, meaning that when you turn off your computer, all the data stored in RAM is erased. This is why it’s essential for running programs in real-time, but it doesn’t hold onto data permanently.

How the Database Stores Data

A database stores data in small blocks called pages, which typically hold several rows of a table or other pieces of data. Each page is a fixed size (usually between 4 KB and 8 KB), and this consistency helps the system efficiently manage large datasets. When data from a table is stored in a database, the rows of that table are divided and placed across multiple pages. Ea*ch page can contain one or more rows* depending on the size of the rows and the page size. For example, if a table has large rows (such as those with many columns or large text), fewer rows may fit on a page. Smaller rows will allow more to fit on a single page. This method of storing data ensures efficient access, as the database can quickly load entire pages into memory instead of fetching individual rows one by one. This helps speed up queries, especially when large amounts of data need to be processed or retrieved.

The data is stored on a disk, like a hard drive (HDD) or a solid-state drive (SSD). These disks preserve the data even when the computer is turned off. However, reading from the disk is slower than reading from the computer's RAM, which is faster but only works while the computer is on. To speed up data access, the database uses a special area in RAM called the buffer pool. This area stores frequently accessed data. Instead of reading small pieces of data one by one, the database reads whole pages at once. These pages are stored on the disk, and when needed, they’re quickly loaded into RAM for faster access.

How the Database Reads Data

When the database needs to access data, it doesn’t read individual rows or records directly from the disk. Instead, it reads entire pages, which are fixed-size blocks containing multiple rows or pieces of data. The database first checks if the required page is already in the buffer pool. If the page is found in RAM, it can be accessed quickly. If not, the database reads the page from the disk and loads it into RAM for future use. This way, the database minimizes the need to access the slower disk.

How the Database Writes Data

When the database needs to write data (adding new rows or updating existing data), it first writes the changes to the buffer pool in RAM, not directly to the disk. This is because writing to RAM is much faster. Later, the database writes the changes from RAM to the disk, typically in large batches, to minimize disk access time. This process is called buffering or write-back. To ensure data isn't lost, databases use transaction logs to track changes. If the system crashes, the transaction logs help recover the changes when the system restarts.

Disk I/O Optimization

Disk I/O (Input/Output) optimization plays a key role in enhancing the performance of databases. Since databases store large volumes of data on disks, which are slower than RAM, minimizing the time spent reading from and writing to these disks is essential. Optimizing disk I/O involves improving how a database interacts with its disk storage, resulting in faster and more responsive system performance. Let's explore some techniques used to achieve this:

1. Buffer Pool (Caching)
Think of the buffer pool like a temporary storage area in your computer's memory (RAM). When the database needs data, it first checks if the data is already in the buffer pool. If it is, the database can quickly use it without waiting for it to be read from the slower disk. This speeds up data retrieval, especially for commonly used data.

2. Read-Ahead (Prefetching)
Imagine you’re reading a book and you know the next chapter is important. Instead of waiting for each page, you turn several pages ahead. The database does something similar by predicting which data it might need next and loading it into memory ahead of time. This way, the data is ready when needed, reducing delays.

3. Write-Ahead Logging (WAL)
Before the database writes any new data to the disk, it first records the change in a special log file. This is like making a note before making any changes to ensure that if something goes wrong (like a power failure), the database can recover the changes from the log, preventing data loss.

4. Batching Writes
Instead of writing every single change to the disk one by one, the database groups several changes together and writes them all at once. This reduces the number of times the system has to access the disk, making the overall process more efficient.

5. Disk Striping (RAID)
Disk striping is like cutting a file into pieces and storing those pieces on different disks. When the database needs to access the file, it can read from multiple disks at once, which speeds up the process. This is often used in RAID setups to improve performance and ensure the data is safe.

HDD vs SSD in Database Storage

When storing data for a database, you can use either HDD (Hard Disk Drive) or SSD (Solid State Drive). Both have their strengths and weaknesses, so let’s break it down in simple terms.

HDD
HDDs work by using spinning disks, similar to a record player, with a small arm that moves to read or write data. Because these parts need to physically move, HDDs are slower and take more time to locate and load information. However, they are more affordable per gigabyte, making them a great choice for storing large amounts of data that don’t require fast access. This makes HDDs ideal for backups, archives, and databases with low traffic, where speed is less important than storage capacity.

SSD
SSDs use flash memory, a type of non-volatile storage that retains data even when the power is off. Unlike HDDs, which rely on spinning disks, flash memory stores data electronically in memory cells made of transistors. These cells hold an electrical charge to represent binary data (0s and 1s). Because there are no moving parts, SSDs have much lower latency, meaning they can access and transfer data almost instantly, which results in significantly faster read and write speeds. This makes SSDs ideal for applications that need high-speed performance, like real-time analytics and databases with heavy user traffic. Additionally, SSDs are more durable, consume less power, and generate less heat, making them a reliable and energy-efficient choice for high-performance computing

Helpful Links 🤓

Text resources:

Video resources:

Databases: CAP Theorem

Nikita Kutsokon — Fri, 28 Feb 2025 09:33:10 +0000

What is CAP Theorem?

The CAP theorem, also known as Brewer's theorem, is a fundamental concept in distributed computing that states it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

Consistency
Availability
Partition Tolerance

Consistency

Every node in a distributed system sees the same data at the same time. Once a write operation is completed, any subsequent read operation will return the updated value.

Consistency means all parts of the system see the same data at the same time. This is done using methods like consensus algorithms, which make sure all parts agree on the data. It's important for things like banking or inventory systems where accurate data is crucial.

⚠️ Keeping data consistent can slow down the system and make it less available, especially if parts of the network fail. The system might need to wait for all parts to agree, which can cause delays.

Availability

Every request (read or write) receives a response, regardless of the state of any individual node in the system

Availability means the system always responds to requests, even if some parts fail. This is achieved by copying data across multiple parts and having backup plans. It's vital for services like social media or online games where being always online is important

⚠️ Focusing on availability can mean sometimes showing old data, especially if network problems occur. The system might prioritize staying online over having the latest data.

Partition Tolerance

The system continues to operate despite network partitions, where communication between nodes is disrupted

Partition tolerance means the system keeps working even if parts of the network are disconnected. This is done by handling network failures gracefully. It's essential for apps used in places with poor network connections, like mobile apps

⚠️ A system that handles network failures must choose between consistency and availability. This choice can affect how reliable and up-to-date the data is during network problems.

The CAP Triangle: Trade-offs

1. Consistency vs Availability

In the event of a network partition, a system must decide whether to maintain consistency (ensuring all nodes have the same data) or availability (ensuring the system remains operational). This trade-off is critical because it directly impacts how the system behaves when parts of the network are disconnected. Choosing consistency can lead to better data integrity but may result in downtime or slower responses during network issues. Conversely, prioritizing availability ensures the system stays responsive but might serve stale or inconsistent data.

2. Availability vs Partition Tolerance

A system that prioritizes availability and partition tolerance (AP) remains operational during network partitions but may return stale or inconsistent data. This trade-off is common in systems where uptime is more important than immediate data consistency. While this approach ensures high availability, it can lead to temporary data inconsistencies. Users might see outdated information, which can be acceptable in some applications but problematic in others, like financial systems.

🤓 The most frequent trade-off is between consistency and availability, especially in systems that must handle network partitions. This trade-off is crucial because network partitions are inevitable in distributed systems. Many modern distributed databases and applications opt for eventual consistency to balance this trade-off, ensuring data will become consistent over time as the network stabilizes.

Consistency(CAP) vs Consistency(ACID)

CAP consistency ensures that all nodes in a distributed system always return the most recent data, focusing on synchronization across replicas, even at the cost of availability during network failures. In contrast, ACID consistency ensures that a database remains in a valid state by enforcing rules and constraints within transactions, preventing partial updates or invalid data. While CAP consistency is crucial for distributed systems like Google Spanner or Zookeeper, ACID consistency is fundamental for relational databases like PostgreSQL and MySQL, ensuring correctness but potentially impacting performance.

Helpful Links 🤓

Text resources:

Video resources:

Databases: Cursor

Nikita Kutsokon — Mon, 24 Feb 2025 10:15:00 +0000

What is it ?

A cursor in a database acts like a pointer that enables you to handle each row in a result set one at a time. It's similar to a bookmark, helping you move through the rows sequentially. Imagine the cursor as a marker that begins just before the first row of your data. You can advance this marker through the rows, fetching and processing each row individually as you go

Let's explore the syntax of cursors in SQL using a simple example. This example demonstrates how to declare, open, fetch, and process data using a cursor:

You have a table named Products with columns ProductID, ProductName, and Price. You want to print the name and price of each product

DECLARE @ProductID INT;
DECLARE @ProductName NVARCHAR(100);
DECLARE @Price DECIMAL(10, 2);

-- Declare the cursor
DECLARE ProductCursor CURSOR FOR
SELECT ProductID, ProductName, Price FROM Products;

-- Open the cursor
OPEN ProductCursor;

-- Fetch the first row into the variables
FETCH NEXT FROM ProductCursor INTO @ProductID, @ProductName, @Price;

-- Loop through the rows
WHILE @@FETCH_STATUS = 0
BEGIN
    -- Print the product name and price
    PRINT 'Product ID: ' + CAST(@ProductID AS NVARCHAR(10)) +
          ', Name: ' + @ProductName +
          ', Price: ' + CAST(@Price AS NVARCHAR(10));

    -- Fetch the next row
    FETCH NEXT FROM ProductCursor INTO @ProductID, @ProductName, @Price;
END;

-- Close and deallocate the cursor
CLOSE ProductCursor;
DEALLOCATE ProductCursor;

Why we need it ?

You need a cursor in a database when your task requires detailed, row-by-row processing that cannot be efficiently handled with a single SELECT statement. Here are some key reasons:

Use a cursor when each row needs different, complex actions based on its data.
When the order of processing matters, like calculating running totals, a cursor ensures rows are handled in sequence.
Cursors allow you to run different SQL commands for each row, adapting to each row's data.
With a cursor, you can handle errors for each row separately, logging issues without stopping the entire process.
Cursors help apply complex transformation rules to each row individually.

Types of Cursors

1. Forward-Only

Moves in one direction from the first to the last row.

You need to read through a list of customer orders to generate a summary report

A forward-only cursor is ideal for this task. You open the cursor to fetch each order sequentially, calculate the total sales, and then move to the next order. This type of cursor is efficient for read-only operations where you don't need to revisit previous rows, making it suitable for generating reports or summaries.

2. Static

Creates a temporary copy of the data, allowing modifications to the underlying data without affecting the cursor.

You are generating a monthly sales report and want to ensure that the data remains consistent throughout the report generation process.

A static cursor is perfect for this situation. It takes a snapshot of the data at the time the cursor is opened, ensuring that any changes made to the data by other users do not affect your report. This consistency is crucial for accurate reporting and analysis over a specific period.

3. Dynamic

Reflects changes made to the data as you scroll through the cursor.

You are monitoring real-time inventory levels and need to react to changes immediately.

A dynamic cursor is suitable here as it reflects changes made to the data as you move through the cursor. If items are added or removed from the inventory while you are processing, the cursor will reflect these changes, allowing you to make real-time adjustments and maintain accurate inventory levels.

4. Keyset-Driven

Similar to a dynamic cursor but uses a set of keys to track rows. It detects changes to the membership and order of rows but does not reflect changes to the data within the rows.

You are processing a list of active user accounts and need to ensure that any new accounts added during the process are included.

A keyset-driven cursor is useful in this case. It uses a set of keys to track rows, detecting changes to the membership and order of rows. If new accounts are added, the cursor will include them in the processing. However, it does not reflect changes to existing accounts (like updating user details) within the rows. This makes it efficient for tasks where row membership is more critical than data updates, ensuring that all relevant rows are processed without missing any new additions.

Cursor vs Select

Cursors are ideal for detailed, row-by-row processing, such as handling complex logic or sequential operations, like sending personalized emails or calculating running totals. They offer flexibility but can be less efficient due to their row-by-row nature.

In contrast, SELECT operations are optimized for set-based processing, efficiently retrieving and manipulating multiple rows at once. They are best for simple aggregations and read-only queries, like generating sales reports. SELECT operations are generally more performant, especially with large datasets.

The choice between using a cursor and a SELECT operation depends on your task's complexity and performance needs. Use cursors for complex, row-level tasks and SELECT for simpler, set-based data retrieval.

Helpful Links 🤓

Text resources:

Video resources:

Databases: ACID & BASE

Nikita Kutsokon — Sun, 23 Feb 2025 11:41:57 +0000

ACID

In the realm of database management, ensuring the reliability and integrity of data is paramount. This is where the ACID principles come into play. ACID, an acronym for Atomicity, Consistency, Isolation, and Durability, represents a set of properties that guarantee reliable processing of database transactions. These principles are foundational to traditional relational database management systems (RDBMS) and are crucial for applications where data accuracy and consistency are non-negotiable.

Before we start, let's recap what is the transaction:

Transaction - is a sequence of one or more queries executed as a single unit of work. The main principle of a transaction is "all or nothing", meaning:

If all queries in the transaction succeed, the changes are safely saved in the database.
If any query in the transaction fails, all changes are undone, restoring the database to its original state before the transaction started.

⚠️ When changes are saved, we say the transaction is committed. If changes are undone, the transaction is rolled back ⚠️

Imagine you are buying a laptop from an online store. When you click "Place Order", several things must happen in the database:

Check if the laptop is in stock.
Deduct the laptop from the stock.
Charge your credit card.
Create an order record in the system.
Send a confirmation email.

We can think of it as a single unit of work. Let's also consider that each query interacts with various tables:

🤓 You can ask, why in this scenario i need a transaction, well, lets see what will happen if we didnt use it:

The laptop is taken from stock, but the payment fails — now the store has fewer laptops but no money.
The payment goes through, but there’s no order record — you paid, but the store has no idea what you bought!

Using a transaction keeps everything in sync. If something goes wrong, it rolls back all the steps to make sure nothing is left half-finished. Let's consider an example where all changes are committed only if every query is successful

A transaction makes sure everything is done correctly or nothing is done at all!

If you are familiar with SQL, you can also see how this would look in a query (pseudocode):

BEGIN TRANSACTION;

-- 1. Check if the laptop is in stock
SELECT stock_quantity FROM laptops WHERE laptop_id = 1;

-- 2. Deduct the laptop from the stock
UPDATE laptops SET stock_quantity = stock_quantity - 1 WHERE laptop_id = 1;

-- 3. Charge the credit card (simplified)
INSERT INTO payments (user_id, amount) VALUES (123, 1000);

-- 4. Create an order record in the system
INSERT INTO orders (user_id, laptop_id, order_date) VALUES (123, 1, '2025-02-23');

-- 5. Send a confirmation email (simplified)
INSERT INTO email_queue (user_id, email_type, status) VALUES (123, 'order_confirmation', 'pending');

-- 6. If everything is successful, commit the transaction
COMMIT;

Now that we understand what a transaction is, let's explore what the ACID acronym stands for ! 🎉🎉🎉

Atomicity

All or nothing. A transaction either fully completes or has no effect.

Atomicity is one of the core principles of database transactions, ensuring that a transaction is treated as a single, indivisible unit of work. This means:
✅ All-or-Nothing – If a transaction completes successfully, all its changes are saved.
✅ Rollback on Failure – If any part of the transaction fails, all changes are discarded, keeping the database unchanged.

🔑 Key points that define atomacity:

All-or-Nothing Execution – A transaction is either fully completed or not executed at all
Rollback on Failure – If one query in the transaction fails, all previous successful queries are undone
No Partial Updates – If a problem occurs, the database remains as if the transaction never happened

🚨 What happens if Atomicity is NOT ensured?

The payment succeeds, but the stock update fails → You are charged, but the store doesn't reserve your laptop!
The stock is updated, but the payment fails → The laptop is removed from inventory without receiving money.

Consistency

Rules are followed. A transaction moves the database from one valid state to another.

It ensures that a transaction brings the database from one valid state to another. In other words, a transaction must always follow the rules, constraints, and data integrity of the database, maintaining its integrity throughout the process. This means:
✅ Valid Transitions – A transaction will only commit if it preserves the database's rules and integrity.
✅ Invalid Transitions – If the transaction violates any of the database’s constraints (data type rules, referential integrity), it will be rolled back.

🔑 Key Points that Define Consistency:

Integrity Preservation – The database must transition from one consistent state to another, adhering to defined rules ( constraints, triggers, business rules).
No Invalid Data – Transactions that would result in invalid data (violating primary key, foreign key, or unique constraints) are not allowed to commit.
Enforcing Constraints – All database rules such as constraints (like unique values, not null, foreign keys) are maintained throughout the transaction.
Automatic Rollback – If a transaction causes an inconsistency, the database will automatically roll back to its last consistent state, ensuring no corrupt data is saved.
No Partial Data – Transactions that leave data in an inconsistent state (such as violating business logic) will not complete.

🚨 What happens if Consistency is NOT ensured?

Payment fails after inventory update -> If the stock is deducted but payment is not successful (due to a system error), the database may reflect that the laptop has been sold, but the payment was not processed. This creates inconsistency, as the stock is reduced without receiving money.
Order without valid data -> If the transaction violates integrity constraints (an order is created without a valid customer ID or payment), the database might store incomplete or invalid data.
Incomplete transactions -> The database might store records with missing or invalid information, leading to errors in reporting, business logic, or even operational decisions.

Isolation

Transactions don't interfere. Each transaction is independent of others happening at the same time.

Isolation is one of the key principles of the ACID properties in database transactions. It ensures that each transaction is executed in isolation from other concurrent transactions. This means that even if multiple transactions are running at the same time, the changes made by each transaction are not visible to others until they are fully committed, preventing them from interfering with each other. This means:
✅ Independent Transactions – Transactions are isolated from each other, ensuring that they don’t affect each other’s results.
✅ No Interference – Even though transactions are executed simultaneously, one transaction’s data changes will not be visible to another transaction until fully committed.

🔑 Key Points that Define Isolation:

Transaction Independence – Each transaction is executed as if it is the only transaction running in the system. No transaction can access the intermediate (uncommitted) data of another transaction.
Prevents Dirty Reads – A transaction should not read data that is in the middle of being modified by another transaction (dirty reads).
Prevents Non-Repeatable Reads – Data read by a transaction cannot change before it is completed, ensuring that subsequent reads within the same transaction give consistent results.
Prevents Phantom Reads – A transaction should not see new rows that were added or deleted by another transaction after it started.
Isolation Levels – Isolation can be adjusted with different levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable), offering a trade-off between performance and strict isolation.

🚨 What happens if Isolation is NOT ensured?

Dirty Reads -> One transaction might read uncommitted changes made by another transaction. For example, Customer A might deduct the stock of the laptop, but Customer B reads the same stock quantity before Customer A commits the transaction.
Lost Updates -> If two transactions are modifying the same data at the same time, one transaction might overwrite the changes of the other, leading to lost data.
Inconsistent Data -> A transaction might use data that is still being modified by another transaction, leading to inconsistent results and potential errors.
Phantom Reads -> A transaction may not see the same set of rows (inventory) every time it queries the database if another transaction inserts, deletes, or updates rows during the course of the first transaction.

Durability

Changes stick. Once a transaction is complete, its changes are permanent, even if the system fails.

This principle ensures that once a transaction has been committed, its changes are permanent, even in the event of a system failure (power loss or crash). This means:
✅ Permanent Changes – Once a transaction is committed, the changes are guaranteed to persist, no matter what happens next.
✅ System Failures Don't Lose Data – Even if the database crashes after the transaction commits, the changes will remain intact.

🔑 Key Points that Define Durability

Permanent Commit – Once a transaction is successfully committed, its results are saved permanently to disk and will survive system crashes.
Crash Recovery – In the event of a crash, the database can recover to its last committed state, ensuring no data loss.
Data Integrity After Failure – Even if there is a failure after the transaction is committed, the database will never revert to a previous, inconsistent state.
No Rollback After Commit – After a transaction is committed, there’s no way to undo its changes unless explicitly done by another transaction.
Guarantee of Persistence – Durability guarantees that once the transaction is completed, it becomes part of the database history.

🚨 What happens if Durability is NOT ensured?

After committing a transaction, if the system crashes, any committed changes could be lost -> Payment might be processed, but the system could fail before saving the updated inventory or generating the order record.
When the system recovers, the transaction could be lost -> inconsistent data such as uncharged payments, incorrect stock levels, or missing orders.

📝 In summary, ACID provides the foundation for secure and efficient transaction management, helping prevent errors, inconsistencies, and data corruption, which ultimately builds trust in database systems and ensures smooth operations for end-users and businesses.

BASE

With the advent of NoSQL databases, a new paradigm emerged for managing and manipulating data, emphasizing flexibility and scalability over rigid consistency. This shift led to the development of the BASE model, which stands for Basically Available, Soft state, Eventual consistency. Unlike traditional relational databases that prioritize strict consistency and transactional integrity, the BASE model embraces a more relaxed approach to data management. The BASE model is designed to address the challenges of distributed systems, where data is spread across multiple nodes and consistency is not always immediate. It allows for high availability and partition tolerance, making it ideal for large-scale applications where real-time consistency is less critical than continuous accessibility.

Basically Available

The system always provides a response to queries, but the data might not be the most recent.

Imagine you have a distributed database with multiple nodes. If one node fails, the system can still respond to queries using data from other nodes, even if that data is slightly outdated. This ensures that the system is always available to users, even during failures.

Soft State

The system's state can change over time, even without new input, due to eventual consistency.

In a shopping cart application, if you add an item to your cart, the system might not immediately reflect this change across all servers. Over time, the system will update and synchronize, ensuring that all servers eventually show the same state.

Eventual Consistency

If no new updates are made, all accesses to a given data item will eventually return the last updated value.

Suppose you update your profile picture on a social media platform. Due to eventual consistency, some users might see your old profile picture for a short period. However, after some time, everyone will see the new picture as the system synchronizes the updates across all servers.

📝 In summary, BASE provides a flexible and scalable approach to database management, prioritizing availability and eventual consistency over strict transaction rules. This makes it ideal for distributed systems, allowing businesses to handle large-scale data efficiently while maintaining system responsiveness and fault tolerance.

ACID vs BASE

Consistency & Availability - ACID prioritizes strong consistency and reliability, making it suitable for critical applications where data integrity is paramount. BASE prioritizes availability and partition tolerance, making it suitable for large-scale, distributed systems where immediate consistency is less critical.
Scalability - BASE systems are generally more scalable due to their relaxed consistency requirements, making them suitable for high-traffic applications. ACID systems can be less scalable due to the overhead of maintaining strong consistency.
Use Cases - ACID is preferred for traditional enterprise applications like banking and healthcare, where data integrity and consistency are crucial. BASE is preferred for modern web applications, IoT, and real-time analytics, where scalability and availability are more important.

📝 In summary, the choice between ACID and BASE depends on the specific requirements of the application, particularly the trade-offs between consistency, availability, and scalability.

Helpful Links 🤓

Text resources:

Video resources:

Databases: Partitioning & Sharding

Nikita Kutsokon — Wed, 19 Feb 2025 16:31:37 +0000

As databases grow in size and complexity, ensuring efficient storage, retrieval, and management of data becomes a significant challenge. Two key strategies to handle large-scale data distribution are partitioning and sharding. While both techniques involve breaking down data into smaller segments, they serve different purposes and are used in different scenarios.

Partitioning

Partitioning is splitting a database table into smaller parts within one database

✨ Think of a database as a club with different rooms for different music genres. Partitioning is how you decide who goes where—pop lovers in one room, rock fans in another. ✨

Vertical & Horizontal Partitioning

Vertical Partitioning - splits a table into multiple tables by columns. Each new table contains a subset of the columns from the original table, let's look at an example:

Original Table:
customers (customer_id, name, email, address, phone_number)

Partitioned Tables:
customer_details (customer_id, name, email)
customer_contact (customer_id, address, phone_number)

Horizontal partitioning - splits a table into multiple tables by rows. Each new table contains a subset of the rows from the original table, let's explore an example:

Original Table:
sales (sale_id, product_id, sale_date, amount)

Partitioned Tables:
sales_2023 (sales from 2023)
sales_2024 (sales from 2024)

Types of partitioning

1. By range: Data is split based on a range of values (dates or numbers)

Suppose you have a music database and you want to partition the data based on the release year of the songs:

Partition: Songs released from 1960 to 1969
Partition: Songs released from 1970 to 1979
Partition: Songs released from 1980 to 1989
Partition: Songs released from 1990 to 1999

CREATE TABLE songs (
    song_id INT,
    title VARCHAR(100),
    artist VARCHAR(100),
    release_year INT
)
PARTITION BY RANGE (release_year) (
    PARTITION p1 VALUES LESS THAN (1970),
    PARTITION p2 VALUES LESS THAN (1980),
    PARTITION p3 VALUES LESS THAN (1990),
    PARTITION p4 VALUES LESS THAN (2000)
);

😎 Range partitioning is ideal when you need to analyze data over specific time periods. For example, if you want to study the evolution of music genres over decades. It also useful for archiving old data while keeping recent data easily accessible. For instance, you can archive songs from the 1960s and 1970s while keeping newer songs in more frequently accessed partitions.

2. By list: Data is split based on predefined categories (regions or groups)

Suppose you have a music database and you want to partition the data based on the genre of the songs:

Partition: Rock songs
Partition: Pop songs
Partition: Jazz songs
Partition: Classical songs

CREATE TABLE songs (
    song_id INT,
    title VARCHAR(100),
    artist VARCHAR(100),
    genre VARCHAR(50)
)
PARTITION BY LIST (genre) (
    PARTITION p1 VALUES IN ('Rock'),
    PARTITION p2 VALUES IN ('Pop'),
    PARTITION p3 VALUES IN ('Jazz'),
    PARTITION p4 VALUES IN ('Classical')
);

😎 List partitioning is best when you frequently query data based on specific categories, such as music genres, allowing for efficient retrieval of songs within each genre.

3. By hash: Data is evenly split using a hashing function

Suppose you have a music database and you want to partition the data evenly based on the song ID using a hashing function:

Partition: Songs with song_id % 4 = 0
Partition: Songs with song_id % 4 = 1
Partition: Songs with song_id % 4 = 2
Partition: Songs with song_id % 4 = 3

CREATE TABLE songs (
    song_id INT,
    title VARCHAR(100),
    artist VARCHAR(100),
    genre VARCHAR(50)
)
PARTITION BY HASH (song_id)
PARTITIONS 4;

😎 Hash partitioning is suitable when you need to ensure an even distribution of data across partitions, preventing hotspots and balancing the load evenly, especially useful for large datasets with uniform access patterns.

Pros & Cos

🟢 Pros:

Improved Performance - queries that access a small portion of the data can be faster because they only need to scan relevant partitions. Indexing can be more efficient, as smaller indexes are faster to search.
Easier Management - administrative tasks like backups and archiving can be performed on individual partitions rather than the entire table. Maintenance operations can be done on a per-partition basis, reducing downtime.

🔴 Cons:

Complexity - designing and implementing an effective partitioning strategy can be complex and requires careful planning. It may require additional administrative overhead to manage partitions.
Application Changes - existing applications may need to be modified to take full advantage of partitioning. Queries may need to be rewritten to optimize for partitioned tables.

Partitioning is most beneficial for very large tables. Smaller tables may not see significant benefits and could even experience performance degradation

Sharding

Sharding is splitting a database into smaller, independent databases (shards), where each shard stores a portion of the data

✨ Sharding is like hosting your party in multiple locations—one club for the 90s hits, another for techno. Each place handles its own crowd. ✨

Types of sharding

1. Ranged/Dynamic Sharding: Data is allocated to shards based on a predefined range of values from a specific field (shard key)

😎 Suitable for datasets where queries often target specific ranges of data, such as date ranges or numerical sequences. For instance, consider this case:

Shard A: Records with IDs from 0 to 19
Shard B: Records with IDs from 20 to 39
Shard C: Records with IDs from 40 to 50

⚠️ Effective shard keys should have high cardinality and well-distributed frequency to avoid unbalanced shards.

2. Algorithmic/Hashed Sharding:Data is allocated to shards using a hash function applied to a field or set of fields

😎 Ideal for evenly distributing data across shards when a suitable shard key is not available. To give you an idea, here’s an example:

Hash Value = ID % Number of Shards

⚠️ Can lead to increased broadcast operations and complex resharding processes when the number of shards changes.

3. Entity/Relationship-Based Sharding: Related data is kept together on the same physical shard.

😎 Effective in relational databases where related data is frequently accessed together, reducing the need for broadcast operations. Suppose we take this scenario:

Shard A: User data and related payment methods for users A-M
Shard B: User data and related payment methods for users N-Z

⚠️ Requires careful planning to ensure related data is correctly grouped and managed.

4. Geography-Based Sharding: Data is allocated to shards based on geographic information, with shards often located in corresponding geographic regions.

😎 Improves performance and reduces latency by storing data closer to the users accessing it. As an illustration, let’s look at this example:

Shard A: Data for users in North America
Shard B: Data for users in Europe

⚠️ Effective for global applications where data locality is important for performance and compliance.

Pros & Cos

🟢 Pros:

Scalability - sharding allows databases to scale horizontally by distributing data across multiple servers. This makes it easier to handle large volumes of data and high traffic loads.
Performance - by distributing data and queries across multiple shards, you can improve query performance and reduce latency, as each shard handles a smaller portion of the data.
High Availability - sharding can enhance availability by isolating failures to individual shards. If one shard goes down, the others can continue to operate, minimizing downtime.
Fault Isolation - issues in one shard do not necessarily affect others, which can improve the overall reliability of the system.
Geographic Distribution - sharding allows data to be distributed across different geographic locations, which can reduce latency for users in different regions and comply with data sovereignty regulations.

🔴 Cons:

Complexity - implementing and managing a sharded database architecture can be complex. It requires careful planning and expertise to ensure data is distributed and accessed efficiently.
Data Consistency - Maintaining data consistency across shards can be challenging, especially in environments where data changes frequently.
Query Complexity - queries that span multiple shards can be more complex to implement and may require additional logic to aggregate results from different shards.
Operational Overhead - sharding introduces additional operational overhead, including the need to manage multiple database instances and ensure they are properly synchronized.
Resource Management - each shard requires its own resources (e.g., CPU, memory), which can lead to increased infrastructure costs and the need for more sophisticated resource management.
Backup and Recovery - backing up and recovering data from a sharded database can be more complex compared to a single database instance, as each shard needs to be managed individually.

Sharding improves scalability by distributing data across multiple databases. It’s useful for large-scale systems but adds complexity. Smaller databases may not benefit

Partitioning vs Sharding

Helpful Links 🤓

Text resources:

Video resources: