Sagar Dutta

Posted on May 12, 2023

DBMS Concepts

#sql #database

ACID

To ensure that the database is in a consistent state both before and after the transaction, one needs to follow certain properties and those properties are known as ACID properties. ACID is an acronym for Atomicity, Consistency, Isolation, and Durability. Much of the architecture of any modern relational database is founded on these properties.

Atomicity

The concept of atomicity describes how the data stays atomic. In other words, if an action is performed on the data, it should either be executed, completed, or not executed at all i.e. all the DML Statements (i.e. insert, update, delete) inside a transaction are completed successfully or all of them are rolled back. Additionally, it signifies that the process shouldn’t pause or run in segments. If an operation needs to be performed on the transaction, it should be carried out entirely and not in parts. It includes two operations in it and these are:
- Commit: Changes are shown after a transaction commits.
- Abort: Changes to the database are invisible if a transaction aborts.
Consistency

Consistency ensures that the database data is constantly maintained. The integrity of the data should be upheld in DBMS, which means that changes to the database should always be saved. When it comes to transactions, data integrity is crucial to ensuring that the database is consistent both before and after the transaction.
Isolation

Isolation guarantees that numerous transactions can take place simultaneously without causing the database state to become inconsistent. Transactions take place without interruption and independently. The Data modifications made by one transaction must be isolated from the data modifications made by all other transactions. This property guarantees that the state created by concurrently running transactions will be the same as the state created by serially running them in some order.
Durability

Durability guarantees that data remains in the database permanently once an operation has been successfully completed. Data stored in a database should be so perfectly durable that it can still function even in the event of a system failing, malfunctions, or crash. However, if it disappears, the recovery manager is now in charge of making sure the database is durable. Whenever changes are made in the database, we must use the COMMIT command to commit those changes and avoid the risk of losing that data.

CAP Theorem

The CAP theorem states that a distributed database system has to make a tradeoff between Consistency and Availability when a Partition occurs.

Consistency

Consistency means that the user should be able to see the same data no matter which node they connect to on the system. This data is the most recent data written to the system. So if a write operation has occurred on a node, it should be replicated to all its replicas. So that whenever a user connects to the system, they can see that same information.

However, having a system that maintains consistency instantaneously and globally is near impossible. Therefore, the goal is to make this transition fast enough so that it is hardly noticeable.
Availability

Availability means that every request from the user should elicit a response from the system. Whether the user wants to read or write, the user should get a response even if the operation was unsuccessful.

Availability is of importance when it is required that the client or user be able to access the data at all times, even if it is not consistent
Partition Tolerance

Partition refers to a communication break between nodes within a distributed system. Meaning, if a node cannot receive any messages from another node in the system, there is a partition between the two nodes. Partition could have been because of network failure, server crash, or any other reason.

So, if Partition means a break in communication then Partition tolerance would mean that the system should still be able to work even if there is a partition in the system. Meaning if a node fails to communicate, then one of the replicas of the node should be able to retrieve the data required by the user.

This is handled by keeping replicas of the records in multiple different nodes. So that even if a partition occurs, we are able to retrieve the data from its replica.

Joins

Lets understand Joins with an example. Consider these two tables:

Student

ROLL_NO	NAME	AGE
1	John	18
2	Amy	19
3	Bob	20
4	Eve	21

StudentCourse

COURSE_ID	ROLL_NO
C1	1
C2	2
C3	3
C4	4
C5	5

Inner Join

This join will return the records that have matching values in both tables. For example, this query will show the names and courses of students enrolled in different courses.
```
SELECT Student.NAME, StudentCourse.COURSE_ID
FROM Student
INNER JOIN StudentCourse
ON Student.ROLL_NO = StudentCourse.ROLL_NO;
```
NAME COURSE_ID

John C1

Amy C2

Bob C3

Eve C4
Left Join

This join will return all the records from the left table, and the matched records from the right table. If there is no match, the right side will contain null. For example, this query will show the names and courses of all students, even if they are not enrolled in any course.
```
SELECT Student.NAME, StudentCourse.COURSE_ID
FROM Student
LEFT JOIN StudentCourse
ON Student.ROLL_NO = StudentCourse.ROLL_NO;
```
NAME COURSE_ID

John C1

Amy C2

Bob C3

Eve C4

Tom null
Right Join

This join will return all the records from the right table, and the matched records from the left table. If there is no match, the left side will contain null. For example, this query will show the names and courses of all courses, even if they have no students enrolled.
```
SELECT Student.NAME, StudentCourse.COURSE_ID
FROM Student
RIGHT JOIN StudentCourse
ON Student.ROLL_NO = StudentCourse.ROLL_NO;
```
NAME COURSE_ID

John C1

Amy C2

Bob C3

Eve C4

null C5
Full Join

This join will return all the records when there is a match in either left or right table. If there is no match, both sides will contain null. For example, this query will show the names and courses of all students and all courses.
```
SELECT Student.NAME, StudentCourse.COURSE_ID
FROM Student
FULL JOIN StudentCourse
ON Student.ROLL_NO = StudentCourse.ROLL_NO;
```
NAME COURSE_ID

John C1

Amy C2

Bob C3

Eve C4

Tom null

null C5
Natural Join

This join will return the records that have matching values in all common columns. The common columns are determined by their names and data types. For example, this query will show the names and courses of students enrolled in different courses.
```
SELECT *
FROM Student
NATURAL JOIN StudentCourse;
```
ROLL_NO NAME AGE COURSE_ID

1 John 18 C1

2 Amy 19 C2

3 Bob 20 C3

4 Eve 21 C4

NAME	COURSE_ID
John	C1
Amy	C2
Bob	C3
Eve	C4

NAME	COURSE_ID
John	C1
Amy	C2
Bob	C3
Eve	C4
Tom	null

NAME	COURSE_ID
John	C1
Amy	C2
Bob	C3
Eve	C4
null	C5

NAME	COURSE_ID
John	C1
Amy	C2
Bob	C3
Eve	C4
Tom	null
null	C5

ROLL_NO	NAME	AGE	COURSE_ID
1	John	18	C1
2	Amy	19	C2
3	Bob	20	C3
4	Eve	21	C4

Aggregations, Filters in queries

Data aggregation is any process whereby data is gathered and expressed in a summary form. When data is aggregated, atomic data rows -- typically gathered from multiple sources -- are replaced with totals or summary statistics. Groups of observed aggregates are replaced with summary statistics based on those observations. Aggregate data is typically found in a data warehouse, as it can provide answers to analytical questions and also dramatically reduce the time to query large sets of data.

Data aggregation is often used to provide statistical analysis for groups of people and to create useful summary data for business analysis.

Data aggregators work by combining atomic data from multiple sources, processing the data for new insights and presenting the aggregate data in a summary view. Furthermore, data aggregators usually provide the ability to track data lineage and can trace back to the underlying atomic data that was aggregated.

A natural extension to aggregation scoping is filtering. Filters are applied to database indexes, filtering out data that does not match a predicate. Filters cannot do anything that could not be done directly with indexes, but they do allow us to separate some aspect of a query, reusing query logic across different filtered views.

Normalization

Database Normalization is a technique of organizing the data in the database to minimize redundancy and improve data integrity. It is a multi-step process that puts data into tabular form, removing duplicated data from the relation tables.

Normalization is used for mainly two purposes:

Eliminating redundant(useless) data.
Ensuring data dependencies make sense i.e data is logically stored.

There are different levels of normalization, called normal forms, that follow certain rules and constraints:

First Normal Form (1NF)

A relation is in 1NF if it contains only atomic values, i.e., each attribute has a single value for each tuple. There are no repeating groups or arrays in the relation.
Second Normal Form (2NF)

A relation is in 2NF if it is in 1NF and all non-key attributes are fully functionally dependent on the primary key, i.e., there are no partial dependencies. A partial dependency occurs when a non-key attribute depends on only a part of the primary key.
Third Normal Form (3NF)

A relation is in 3NF if it is in 2NF and there are no transitive dependencies, i.e., no non-key attribute depends on another non-key attribute. A transitive dependency occurs when a non-key attribute depends on another non-key attribute that depends on the primary key.
Boyce Codd Normal Form (BCNF)

Boyce Codd Normal Form (BCNF) is a stronger version of Third Normal Form (3NF) that eliminates more redundancy and anomalies. A relation is in BCNF if for every functional dependency X -> Y, X is a superkey or a candidate key. A superkey is a set of attributes that uniquely identifies a tuple, and a candidate key is a minimal superkey.

Indexes

Indexing is a data structure technique which allows us to quickly retrieve records from a database file. An Index is a small table having only two columns. The first column comprises a copy of the primary or candidate key of a table. Its second column contains a set of pointers for holding the address of the disk block where that specific key value stored.

An index:

Takes a search key as input
Efficiently returns a collection of matching records.

Syntax:

CREATE INDEX index_name ON table_name;

Transactions

A database transaction is a sequence of operations on a database that is treated as a single unit and they are executed by applying the principle “do everything or do nothing” and a successful transaction must pass the ACID test. Transactions may consist of a single read, write, delete, or update operations or a combination of these.

Locking mechanism

Locking mechanisms are a way for databases to produce sequential data output without the sequential steps. The locks provide a method for securing the data that is being used so no anomalies can occur like lost data or additional data that can be added because of the loss of a transaction.

Problems that Locks Solve:

Lost Update Problem

A lost update occurs when two different transactions are trying to update the same column on the same row within a database at the same time.
Temporary Update Problem

Temporary update or dirty read problem occurs when one transaction updates an item and fails.
Incorrect Summary Problem

Incorrect Summary issue occurs when one transaction takes summary over the value of all the instances of a repeated data-item, and second transaction update few instances of that specific data-item.
Phantom Reads

A lower isolation level increases the ability of many users to access the same data at the same time, but increases the number of concurrency effects (such as dirty reads or lost updates) users might encounter.

Different Types of Locks :

There are three primary types of locks that are used in a database.

Read Locks

These types of locks make it so that data can only be read. Depending on the database system and restrictions on the read lock, this can make it so only one user can read the data to allowing every user access to reading the data but not being able to modify anything. The read lock can be applied to a single row, a section of rows, or an entire table. This can also be dependent on the type of database system that is being used that could limit the amount of data that can be locked for reading.
Write Locks

This type of lock is used for updates in the database system. When this lock is applied it prevents any other transactions from changing the records that are being accessed. This does allow transactions to read the data before it is modified and the changes are made permanent.
Exclusive Write Locks

This type of lock is similar to a write lock. The only difference is that with an exclusive write lock, the only things that can look at the data or modify the data is the original transaction. No other transaction can read the data while this lock is applied.

There are also many other locks that signal intention to request another type of lock. These locks are called multi-level-locks. This is useful to show other transactions what types of locks the transaction has throughout the hierarchy of data levels.

Database Isolation Levels

Database isolation levels are used to define the degree to which one transaction must be isolated from the data modifications made by other concurrent transactions. Isolation levels prevent certain anomalies or side effects that can occur when transactions run concurrently, such as dirty reads, non-repeatable reads, and phantom reads.

The SQL standard defines four isolation levels: Read Uncommitted, Read Committed, Repeatable Read, and Serializable. However, different database systems may implement them differently or provide additional levels.

Read Uncommitted

It is the lowest isolation level. It allows a transaction to read uncommitted changes made by other transactions, which may lead to dirty reads. It does not provide any locking on the data being read or written.
Read Committed

It is a higher isolation level than Read Uncommitted. It guarantees that a transaction only reads data that has been committed by other transactions. It prevents dirty reads but not non-repeatable reads or phantom reads. It usually uses shared locks on the data being read and exclusive locks on the data being written.
Repeatable Read

It is a higher isolation level than Read Committed. It ensures that a transaction sees the same data when it reads the same row multiple times. It prevents dirty reads and non-repeatable reads but not phantom reads. It usually uses shared locks on all the rows read by the transaction and exclusive locks on the rows written by the transaction.
Serializable

It is the highest isolation level. It guarantees that a transaction sees a consistent snapshot of the database as if no other transactions were running concurrently. It prevents dirty reads, non-repeatable reads, and phantom reads. It usually uses range locks or predicate locks on the data accessed by the transaction.

Higher isolation levels provide more consistency but also more locking and less concurrency. Therefore, choosing an appropriate isolation level depends on the application requirements and trade-offs.

Triggers

A database trigger is a stored procedure that is automatically executed when a certain event occurs on a table or view in a database. For example, a trigger can be used to enforce data integrity, update another table, or audit changes.

DEV Community

DBMS Concepts

ACID

Atomicity

Consistency

Isolation

Durability

CAP Theorem

Consistency

Availability

Partition Tolerance

Joins

Inner Join

Left Join

Right Join

Full Join

Natural Join

Aggregations, Filters in queries

Normalization

Indexes

Transactions

Locking mechanism

Database Isolation Levels

Triggers

Top comments (0)

Read next

How to Reset a Forgotten MariaDB User Password

Web3 explained for Web2 Developers: My Personal Journey from SQL to Solana

SQL database projects intro

Unlocking DuckDB from Anywhere - A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)