DEV Community: Mervyn Lee

The journey of using Git - Zero to Workflow

Mervyn Lee — Wed, 06 Jan 2021 16:00:23 +0000

During my first year in the University, people have been talking about Git. Some of us, the freshies, tried to adapt it in our assignments. We rushed through the basic tutorial, learning about the magic commands of git clone, git pull, git add, git commit and git push origin master and assume that they are the gist of Git. We knew a little about the scary merge conflict little monster, therefore we had some discussion to avoid it at the first place.

We managed to pull through.

In fact, with a small improvement of handling merge conflict properly, I have managed to pull through in my early career in a startup and new team in a mid-size company, until I joined a large organisation that requires heavy collaboration between engineers. I was required to study the onboarding materials and get exposed to Git Workflows. In the meantime, I have started to contribute to open source project in GitHub. It was an Ah Ha! moment to me as everything connects.

I wish I'd known about the terminology Git Workflow earlier. Hopefully this article is able to help you to gain awareness of some common Git Workflows and explore them in depth if you hasn't.

Centralized Workflow

If you work individually or in a team where everyone is developing features to the master/main branch directly, this workflow is called the Centralized Workflow. It is okay for small team but the conflict resolution process can form a bottleneck as your team scales in size. This workflow is intuitive for Git beginner, or teams that are transitioning from SVN.

The Centralized Workflow is essentially a building block for other Git workflows.

Feature branching

With some further reading of Git branching, it will be intuitive to create branches for development. Perhaps you have tried to create a branch for entire development, or assign one branch for each developer using the developer's name as branch name, etc. You are very close to the idea of Feature branching.

Feature Branching is a logical extension of Centralized Workflow. The core idea behind the Feature Branch Workflow is that all feature development should take place in a dedicated branch instead of the master branch. This encapsulation makes it easy for multiple developers to work on a particular feature without disturbing the main codebase.

Gitflow Workflow

With the foundation of Feature Branching, you can proceed to understand the famous Gitflow Workflow. This workflow doesn’t add any new concepts or commands beyond what’s required for the Feature Branch Workflow. Instead, it assigns very specific roles to different branches and defines how and when they should interact. The overall flow for Gitflow Workflow is:

A develop branch is created from master
A release branch is created from develop
Feature branches are created from develop
When a feature is complete it is merged into the develop branch
When the release branch is done it is merged into develop and master
If an issue in master is detected a hotfix branch is created from master
Once the hotfix is complete it is merged to both develop and master

Forking Workflow

If you are interested to contribute to Open Source, I strongly recommend you to understand the Forking Workflow. The Forking Workflow is fundamentally different than the other workflows discussed above. Instead of using a single server-side repository to act as the “central” codebase, it gives every developer a server-side repository. This means that each contributor has not one, but two Git repositories: a private local one and a public server-side one. You can imagine yourself contributing on a public project with the step-by-step example of this workflow.

A developer 'forks' an 'official' server-side repository. This creates their own server-side copy.
The new server-side copy is cloned to their local system.
A Git remote path for the 'official' repository is added to the local clone.
A new local feature branch is created.
The developer makes changes on the new branch.
New commits are created for the changes.
The branch gets pushed to the developer's own server-side copy.
The developer opens a pull request from the new branch to the 'official' repository.
The pull request gets approved for merge and is merged into the original server-side repository.

You can understand further from the external links attached on the workflows or search the keyword online. This blog is not served as a complete explanation for each workflow, but intended to give an overview of some common Git Workflows.

Some key takeaways are:

There is no one-size-fits-all Git workflow
A workflow should be simple and enhance the productivity of your team
Your business requirements should help shape your Git workflow

Trick to rebuild summary tables when the existing data have to remain available during the operation

Mervyn Lee — Tue, 24 Mar 2020 07:47:49 +0000

As a Data Engineer, I encountered situations where the summary or reporting tables need to be repopulated daily. When the population is ongoing, we need to make sure that the existing data is still available to the user.

We can achieve this by using a “shadow table” which acts like a staging table you build “behind” the real table. When you’re done building it, you can swap the tables with an atomic rename. For example, if you need to rebuild my_table, you can create my_table_new, fill it with data, and rename it as my_table. In case the new table has a problem, you can rename my_table to my_table_old before that for a quick rollback.

mysql> DROP TABLE IF EXISTS my_table_new, my_table_old;
mysql> CREATE TABLE my_table_new LIKE my_table;
-- populate my_table_new as desired
mysql> RENAME TABLE my_table TO my_table_old, my_table_new TO my_table;

Types of queries that can & can't use a B-Tree index

Mervyn Lee — Wed, 18 Mar 2020 14:44:41 +0000

Storage engines use B-Tree indexes in various ways, which can affect performance. The general idea of a B-Tree is that all the values are stored in order, and each leaf page is the same distance from the root. The figure below shows an abstract representation of a B-Tree(technically a B+ Tree) index, which corresponds roughly to how InnoDB’s indexes work. MyISAM uses a different structure, but the principles are similar.

A B-Tree index speeds up data access because the storage engine doesn’t have to scan the whole table to find the desired data. The idea of B-Tree is an implementation from Multiway Trees (M-way Tree) with some rules and restrictions on how the tree is built. Binary Search Tree is actually a Multiway Tree with M=2 (2 children). Therefore, you can relate B-Tree with a Binary Search Tree concept. It starts at the root node (not shown in this figure). The slots in the root node hold pointers to child nodes, and the storage engine follows these pointers. It finds the right pointer by looking at the values in the node pages, which define the upper and lower bounds of the values in the child nodes. Eventually, the storage engine either determines that the desired value doesn’t exist or successfully reaches a leaf page.

You can learn more about the concept of B-Tree and how it applies to index in Database here.

Suppose you have the following table:

CREATE TABLE People (
  last_name   varchar(50)     not null,
  first_name  varchar(50)     not null,
  dob         date            not null,
  gender      enum('m', 'f')  not null,
  key(last_name, first_name, dob) );

The index will contain the values from the last_name, first_name, and dob columns for every row in the table. Figure below illustrates how the index arranges the data it stores.

Note that there are two people with the same name but different birth dates, and they’re sorted by birth date.

B-Tree indexes work well for lookups by the full key value, a key range, or a key prefix. They are useful only if the lookup uses a leftmost prefix of the index. The index we showed above will be useful for the following kinds of queries:

Match the full value
A match on the full key value specifies values for all columns in the index. For example, this index can help you find a person named Cuba Allen who was born on 1960-01-01.
Match a leftmost prefix
This index can help you find all people with the last name Allen. This uses only the first column in the index.
Match a column prefix
You can match on the first part of a column’s value. This index can help you find all people whose last names begin with J. This uses only the first column in the index.
Match a range of values
This index can help you find people whose last names are between Allen and Barrymore. This also uses only the first column.
Match one part exactly and match a range on another part
This index can help you find everyone whose last name is Allen and whose first name starts with the letter K (Kim, Karl, etc.). This is an exact match on last_ name and a range query on first_name.

As the tree’s nodes are sorted, they are helpful for both lookups (finding values) and ORDER BY queries (finding values in sorted order). In general, if a B-Tree can help you find a row in a particular way, it can help you sort rows by the same criteria. So, our index will be helpful for ORDER BY clauses that match all the types of lookups we just listed.

Here are some limitations of B-Tree indexes:

They are not useful if the lookup does not start from the leftmost side of the indexed columns. For example, this index won’t help you find all people named Bill or all people born on a certain date, because those columns are not leftmost in the index. Likewise, you can’t use the index to find people whose last name ends with a particular letter.
You can’t skip columns in the index. That is, you won’t be able to find all people whose last name is Smith and who were born on a particular date. If you don’t specify a value for the first_name column, MySQL can use only the first column of the index.
The storage engine can’t optimize accesses with any columns to the right of the first range condition. For example, if your query is WHERE last_name="Smith" AND first_name LIKE 'J%' AND dob='1976-12-23', the index access will use only the first two columns in the index, because the LIKE is a range condition (the server can use the rest of the columns for other purposes, though).

Multi-threading vs Event Loop in Python

Mervyn Lee — Fri, 21 Feb 2020 14:26:13 +0000

I came into a network I/O bound optimization problem and manage to solve it using Multi-threading solution here. In the middle of research, I came into Asyncio — Asynchronous I/O library in Python, which brings into the question it may be a better solution.

What is Asyncio

There are a lot of articles out there explaining what is Asyncio. The core concept to bring away is that Asyncio provides us an event loop. The event loop tracks different I/O events and switches to tasks that are ready and pauses the ones which are waiting on I/O. Thus we don’t waste time on tasks that are not ready to run right now.

Sounds like a thread? What are their similarities and differences?

They're both limited by the Global Interpreter Lock and are both single process, multi-threaded. They are both forms of concurrency but not parallelism - Concurrency makes progress together and parallelism makes progress in parallel. Parallelism is in fact a form of concurrency but concurrency doesn't imply parallelism.

Threading, via `concurrent.futures`

It employs time-slicing of CPU. All threads are given a slot of CPU time to do work. If the thread is blocking (sleeping or blocked on sockets), then off it goes to the next thread. The issue of threading is when many threads that are blocked for long periods, it begins to degrade into polling (polling vs. interrupt)

Asyncio

Asyncio uses an event loop and you can imagine it as a pub-sub - a push notification model. Threads will announce they're blocked by using Asyncio methods. The next available thread at the top of the queue is then processed on until it completes or is blocked again. This has reduced concurrency and can allow one thread to starve out the others. If the access pattern is threads that are blocked for a long time, this model will ensure that you don't bother checking a thread, you wait for it to announce it's available.

What about Multi-processing?

Multi-processing is indeed useful when it comes to CPU-bounded kind of optimization problem, as it is the only real way to achieve true parallelism without the restriction of Global Interpreter Lock. For our problem which is I/O bound operations, GIL usually doesn’t harm much.

Benchmark

I have modified my code from my previous blog into Asyncio implementation. Both implementations are shown below.

from urllib3 import HTTPConnectionPool
import concurrent.futures
import Retry
import json

url = "www.example.com/employees/"
# Assume you have obtained the total number of pages
total_page = N
result = []
number_of_workers = W
retry = Retry(
    total=3,
    read=3,
    connect=3,
    backoff_factor=0.1,
    status_forcelist=(500, 502, 504))

pool = HTTPSConnectionPool(url, maxsize=1, headers={'API-Auth-Secret-Key':API_SECRET_KEY}, retries=retry)

def get_data_from_api(pool, page_number):
    return pool.request('GET',
        url='/v1/employee',
        fields={'page': page_number}
    )

with concurrent.futures.ThreadPoolExecutor(max_workers=worker) as executor:
    future_result = {executor.submit(get_data_from_api, pool, i): i for i in range(2, total_page+1)}
    for future in concurrent.futures.as_completed(future_result):
        result += json.loads(future.result().data.decode('utf-8'))

from aiohttp.connector import TCPConnector
from aiohttp_retry import RetryClient
import asyncio
import json
url = "www.example.com/employees/v1/employee"
# Assume you have obtained the total number of pages
total_page = N
result = []
number_of_workers = W
conn = TCPConnector(limit=number_of_workers)

async def execute(url, session, page_number):
    async with session.get(url+'?page='+str(page_number), headers={'API-Auth-Secret-Key':API_SECRET_KEY}, retry_attempts=3, retry_factor=0.1, retry_for_statuses=(500, 502, 504)) as response:
        return await response.read()

async def bound_fetch(sem, url, session, page_number):
    # Getter function with semaphore.
    async with sem:
        return await execute(url, session, page_number)

async def run():
    url = 'https://cms.lifestyle.rea-asia.com/wp-json/rea-content/v1/contents'
    tasks = []
    # Limits to 50 async tasks to compare the performance of 50 threads
    sem = asyncio.Semaphore(50)
    global result
    # Fetch all responses within 50 sessions,
    # keep connection alive for all requests.
    async with RetryClient(connector=conn) as session:
        for i in range (total_page+1):
            task = asyncio.ensure_future(bound_fetch(sem, url.format(i), session, i))
            tasks.append(task)
        responses = await asyncio.gather(*tasks)
        # you now have all response bodies in this variable
        for res in responses:
            result += json.loads(res.decode("utf-8"))

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)

Result

The time taken for 6000 pages(calls) of API took around 6 minutes for both approaches. The improvement is not significant. Asyncio is helpful when the thread is blocked for a long time however I believe that the server has a prompt response therefore multi-threading is fine. As a rule of thumb, always benchmark the performance when it comes to an optimization problem.

Conclusion

Should I use Asyncio then? The purpose of many applications is to act on network packets entering an interface, timeouts expiring, mouse clicks, or other types of events. Such applications are often very well suited to use an event loop.

Applications that need to churn massively parallel algorithms are more suitable for running multiple (independent) threads on several CPU cores. However, threaded applications must deal with the side effects of concurrency, like race conditions, deadlocks, live locks, etc. Writing error-free threaded applications is hard, debugging them can be even harder.

We can follow this pseudo code to make an intuitive approach:

#IO Bound
if io_bound:
    if io_very_slow:
        print("Use Asyncio")
    else:
       print("Use Multi-threading")
# CPU bound
else:
    print("Multi-processing")

A better way to use Python requests in API data collection

Mervyn Lee — Sat, 01 Feb 2020 17:25:09 +0000

We need to collect data through API requests sometimes rather than performing data extraction from databases directly due to a couple of reasons such as database authorization restriction, or we like the fact that the complicated data processing logic has been implemented by backend and available through API, avoiding the need to perform ETL on the raw data ourselves. However, due to the huge amount of API requests and the nature of the unreliable network (or server), careful implementation is needed to avoid partial data loss or unexpected response.

Imagine that there is an API that returns a list of employees' information through pagination. There are M rows of data per page and N pages in total.

The first implementation

import requests
url = "www.example.com/employees/"
# Assume you have obtained the total number of pages
total_page = N
result = []

def get_data_from_api(url, page_number):
    # a list of employee's data
    return requests.get(url, params={'page': total_page})

for i in range(0, total_page+1):
    try:
        result += get_data_from_api(url, i)
    except Exception:
        # sleep for a bit in case that helps
        time.sleep(1)
        # try again
         result += get_data_from_api(url, i)

This is not a recommended solution. You put faith on the server that any problem can be solved on the second trial. You can tweak the above solution with a while loop and keep trying, but that is not an elegant solution either. Your code will not stop running when the API server is facing some problems.

Improved Implementation

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry


def requests_retry_session(
    retries=3,
    backoff_factor=0.3,
    status_forcelist=(500, 502, 504),
    session=None,
):
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

In the code above, we allow the request to retry on response status code of 500, 502 and 504 at most 3 times with a backoff factor of 0.3.

A backoff factor to apply between attempts after the second try (most errors are resolved immediately by a second try without a delay). urllib3 will sleep for: {backoff factor} * (2 ^ ({number of total retries} - 1)) seconds. If the backoff_factor is 0.1, then sleep() will sleep for [0.0s, 0.2s, 0.4s, ...] between retries. It will never be longer than Retry.BACKOFF_MAX. By default, backoff is disabled (set to 0).

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

url = "www.example.com/employees/"
# Assume you have obtained the total number of pages
total_page = N
result = []

def requests_retry_session(
    retries=3,
    backoff_factor=0.3,
    status_forcelist=(500, 502, 504),
    session=None,
):
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session


def get_data_from_api(url, page_number):
    # a list of employee's data
    return requests_retry_session(url, params={'page': total_page})

for i in range(0, total_page+1):
    try:
        result += get_data_from_api(url, i)
    except Exception as e:
    print('It failed :(', e.__class__.__name__)

By introducing retries and backoff factor, we will be able to control the situation and deal with it accordingly.

Speed it up!

Firing requests one after another will take a long time. We can make it faster by introducing multithreading and asynchronous programming.

import requests
import concurrent.futures
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

url = "www.example.com/employees/"
# Assume you have obtained the total number of pages
total_page = N
result = []
number_of_workers = W

def requests_retry_session(
    retries=3,
    backoff_factor=0.3,
    status_forcelist=(500, 502, 504),
    session=None,
):
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session


def get_data_from_api(url, page_number):
    # a list of employee's data
    return requests_retry_session(url, params={'page': total_page})

with concurrent.futures.ThreadPoolExecutor(max_workers= number_of_workers) as executor:
    future_result = {executor.submit(get_data_from_api, url, i): i for i in range(0, total_page+1)}
    try:
        for future in concurrent.futures.as_completed(future_result):
            result += future.result()
    except Exception as e:
        print('It failed :(', e.__class__.__name__)

Session in the rescue!

Good. Now we are able to fire multiple requests at the same time asynchronously. However, we have one more issue to solve. In our current implementation, each request will initiate a new TCP connection to the server. Too many TCP connections will slow down the server and also introducing overhead such as TCP 3-Way Handshake Process. In one of my tasks, the excessive amount of connections handicapped the server, causing timeout on half of the requests even after retries!

Introducing Session Object. The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance and will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase.

One of the ideas is to initiate the same amount of sessions with our Thread workers. These workers will reuse their session to perform requests. Note that the requests_retry_session object is a Session Object.

The ultimate solution

import requests
import concurrent.futures
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

url = "www.example.com/employees/"
# Assume you have obtained the total number of pages
total_page = N
result = []
number_of_workers = W

def requests_retry_session(
    retries=3,
    backoff_factor=0.3,
    status_forcelist=(500, 502, 504),
    session=None,
):
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session


def get_data_from_api(session, url, page_number):
    # a list of employee's data
    return session.get(url, params={'page': page_number})

sessions = [requests_retry_session() for i in range(W)]

with concurrent.futures.ThreadPoolExecutor(max_workers=number_of_workers) as executor:

    # We will assign (i modulo W)th session to page i for equal distribution 
    future_result = {executor.submit(get_data_from_api, sessions[i%W], url, i): i for i in range(0, total_page+1)}
    try:
        for future in concurrent.futures.as_completed(future_result):
            result += future.result()
    except Exception as e:
        print('It failed :(', e.__class__.__name__)

With the solution in mind, you are now able to make predictable requests, collecting data from API requests concurrently and asynchronously by reusing sessions. Note that it's an opinionated solution but by its existence, it demonstrates how it works so you can adapt it in your own use case.

I improvise the solution based on the proposed implementation from here. Feel free to read more from there. Have a nice day!

Understanding SQL Server Isolation Levels through example

Mervyn Lee — Fri, 20 Dec 2019 17:06:15 +0000

After reading a lot helpful contents in dev.to, I decided to start contributing while learning and start writing! I am inspired to write my first post when I was reading a topic regarding Isolation Levels in High Performance MySQL by Baron Schwartz. It looks confusing, therefore I decided to mess around in an actual SQL server and justify my understanding. So here it is, my very first blog post. I will explain about these isolations along with SQL queries.

I found it very helpful in one of my previous project. We have a table which serves data to client through API endpoint according to different parameters and filters, and new data comes in daily. With this knowledge, I am able to be more confident to handle different scenarios including data import failure halfway, knowing if part of the new data will be exposed to the client during import and so on. I will revisit this project and write a separate article soon.

The SQL standards defines four isolation levels, with specific rules for which changes are and aren't visible inside and outside a transaction. A transaction is a group of SQL queries that are treated atomically, as a single unit of work. If the database engine can apply the entire group of queries to a database, it does so, but if any of them can't be done because of a crash or other reason, none of them is applied.

READ UNCOMMITTED

In this isolation level, transactions can view the result of uncommitted transactions. This level is rarely used in practice as the performance isn't much better than other levels, which have many advantages.

The example require two separate transactions running at the same time. I did this with two terminals logged in at the same time to have two sessions, I named it Session A and Session B. This example is executed in MYSQL 8.

This is how the sample table looks like:

Table and data generation code is at the end of this post.

SELECT * FROM student;
+----+-----------+-----------+------+
| id | firstname | lastname  | cgpa |
+----+-----------+-----------+------+
|  1 | Wilman    | Kala      |  3.5 |
|  2 | Karl      | Jablonski | 2.67 |
|  3 | Matti     | Karttunen | 3.86 |
|  4 | John      | Doe       | 1.93 |
|  5 | Will      | Smith     | 2.25 |
+----+-----------+-----------+------+

-- Session A:
> SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
> START TRANSACTION;
> SELECT * FROM student WHERE firstname = 'Will';

+----+-----------+----------+------+
| id | firstname | lastname | cgpa |
+----+-----------+----------+------+
|  5 | Will      | Smith    | 2.25 |
+----+-----------+----------+------+

-- Session B:
> START TRANSACTION;
> UPDATE student SET CGPA = 3.1 WHERE firstname = 'WILL';

--Session A:
> SELECT * FROM student WHERE firstname = 'Will';
+----+-----------+----------+------+
| id | firstname | lastname | cgpa |
+----+-----------+----------+------+
|  5 | Will      | Smith    |  3.1 |
+----+-----------+----------+------+
> COMMIT;

As you can see, the transaction in Session A is able to see the uncommitted row. This phenomena is called Dirty Read - situation when a transaction reads a data that has not yet been committed. This isolation prevents Dirty Write only. Let's finish the transaction in Session B.

-- Session B:
> COMMIT;

READ COMMITTED

The default isolation level for most database systems (but not MYSQL). A transaction will see only those changes made by transactions that were already committed when it began, and its changes won't be visible to other until it has committed.

--Session A:
> SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
> SELECT * FROM student WHERE firstname = 'Will';
+----+-----------+----------+------+
| id | firstname | lastname | cgpa |
+----+-----------+----------+------+
|  5 | Will      | Smith    |  3.1 |
+----+-----------+----------+------+

--Session B:
> START TRANSACTION;
> UPDATE student SET CGPA = 3.999 WHERE firstname = 'WILL';

--Session A:
> SELECT * FROM student WHERE firstname = 'Will';
+----+-----------+----------+------+
| id | firstname | lastname | cgpa |
+----+-----------+----------+------+
|  5 | Will      | Smith    |  3.1 |
+----+-----------+----------+------+

As you can see, when the transaction in Session B is not committed yet, the transaction in Session A is not able to see the change. Now we go back to Session B to commit the change.

--Session B:
> COMMIT;

--Session A:
SELECT * FROM student WHERE firstname = 'Will';
+----+-----------+----------+-------+
| id | firstname | lastname | cgpa  |
+----+-----------+----------+-------+
|  5 | Will      | Smith    | 3.999 |
+----+-----------+----------+-------+

Great. Now the transaction in Session A only able to see the change only after commit operation is done by transaction in Session B. However, when your transaction only want to perform some update or delete on the data that it selected in previously queries excluding new data that coming in parallel from other transaction, you are in trouble. This level still allows what's know as a Non Repeatable Read - occurs when a transaction reads same row twice, and get a different value each time.

Confuse? Let me show you an example.

--Session A:
> START TRANSACTION;
> SELECT * FROM student;
+----+-----------+-----------+-------+
| id | firstname | lastname  | cgpa  |
+----+-----------+-----------+-------+
|  1 | Wilman    | Kala      |   3.5 |
|  2 | Karl      | Jablonski |  2.67 |
|  3 | Matti     | Karttunen |  3.86 |
|  4 | John      | Doe       |  1.93 |
|  5 | Will      | Smith     | 3.999 |
+----+-----------+-----------+-------+

--Session B:
> START TRANSACTION;
> INSERT INTO student (firstname, lastname, cgpa) VALUES ('Will', 'Smack', 0.98);
> COMMIT;

--Session A:
> SELECT * FROM student;
+----+-----------+-----------+-------+
| id | firstname | lastname  | cgpa  |
+----+-----------+-----------+-------+
|  1 | Wilman    | Kala      |   3.5 |
|  2 | Karl      | Jablonski |  2.67 |
|  3 | Matti     | Karttunen |  3.86 |
|  4 | John      | Doe       |  1.93 |
|  5 | Will      | Smith     | 3.999 |
|  6 | Will      | Smack     |  0.98 |
+----+-----------+-----------+-------+
> COMMIT;

Simply put, when new data is inserted, modified or deleted from commited transaction in Session B, the transaction in Session A is able to see the new change. To solve that, let's move on to another level.

REPEATABLE READ

It solved the problems that READ COMMITTED allows. It guarantees that any row a transaction reads will "look the same" in subsequent reads within the same transaction.

--Session A:
> START TRANSACTION;
> SELECT * FROM student;
+----+-----------+-----------+-------+
| id | firstname | lastname  | cgpa  |
+----+-----------+-----------+-------+
|  1 | Wilman    | Kala      |   3.5 |
|  2 | Karl      | Jablonski |  2.67 |
|  3 | Matti     | Karttunen |  3.86 |
|  4 | John      | Doe       |  1.93 |
|  5 | Will      | Smith     | 3.999 |
|  6 | Will      | Smack     |  0.98 |
+----+-----------+-----------+-------+

--Session B:
> START TRANSACTION;
> INSERT INTO student (firstname, lastname, cgpa) VALUES ('Will', 'Win', 3.414);
> COMMIT;

--Session A:
> SELECT * FROM student;
+----+-----------+-----------+-------+
| id | firstname | lastname  | cgpa  |
+----+-----------+-----------+-------+
|  1 | Wilman    | Kala      |   3.5 |
|  2 | Karl      | Jablonski |  2.67 |
|  3 | Matti     | Karttunen |  3.86 |
|  4 | John      | Doe       |  1.93 |
|  5 | Will      | Smith     | 3.999 |
|  6 | Will      | Smack     |  0.98 |
+----+-----------+-----------+-------+

As you can see, even though the transaction that performs insertion in Session B is committed, the transaction in Session A is still looking at the same data it retrieved previously. To see the change, end the transaction in Session A by commit and start a new transaction to perform the query again.

> COMMIT;
> SELECT * FROM student;
+----+-----------+-----------+-------+
| id | firstname | lastname  | cgpa  |
+----+-----------+-----------+-------+
|  1 | Wilman    | Kala      |   3.5 |
|  2 | Karl      | Jablonski |  2.67 |
|  3 | Matti     | Karttunen |  3.86 |
|  4 | John      | Doe       |  1.93 |
|  5 | Will      | Smith     | 3.999 |
|  6 | Will      | Smack     |  0.98 |
|  7 | Will      | Win       | 3.414 |
+----+-----------+-----------+-------+

Woohoo! Perfect? Sadly, it still allows another tricky problem: phantom reads. Simply put, a phantom read can happen when you select some range of rows using WHERE clause, another transaction inserts a new row into the range, and then you select the same range again; you will then see the new "phantom" row. Let's see the phantom by ourselves!

--Session A:
> START TRANSACTION;
> SELECT * FROM student WHERE firstname='Wilman';
+----+-----------+----------+------+
| id | firstname | lastname | cgpa |
+----+-----------+----------+------+
|  1 | Wilman    | Kala     |  3.5 |
+----+-----------+----------+------+

--Session B:
> START TRANSACTION;
> INSERT INTO student (firstname, lastname, cgpa) VALUES ('Wilman', 'Jackson', 3.98);
> COMMIT;

SELECT * FROM student WHERE firstname='Wilman';
+----+-----------+----------+------+
| id | firstname | lastname | cgpa |
+----+-----------+----------+------+
|  1 | Wilman    | Kala     |  3.5 |
|  8 | Wilman    | Jackson  | 3.98 |
> COMMIT;

So, the rule of thumb - A SELECT statement without WHERE clause will able to be REPEATABLE READ in this isolation level, but not when WHERE clause is around. To solve this, we will need to move to the 4th level.

SERIALIZABLE

The highest level of isolation, SERIALIZABLE, solves the phantom read problem by forcing transactions to be ordered so that they can't possibly conflict. In short, it places a lock on every row it reads. At this level, a lot of timeouts and lock contention can occur. It trades off concurrency in favour of the data stability that results. Here is an example when your transaction will wait for another transaction...

--Session A:
> SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
> START TRANSACTION;
> SELECT * FROM student WHERE firstname='Wilman';
+----+-----------+----------+------+
| id | firstname | lastname | cgpa |
+----+-----------+----------+------+
|  1 | Wilman    | Kala     |  3.5 |
| 10 | Wilman    | Jackson  | 3.98 |
+----+-----------+----------+------+

--Session B:
> START TRANSACTION;
> INSERT INTO student (firstname, lastname, cgpa) VALUES ('Wilman', 'Mars', 2.777);

After you fire the INSERT command, it will just stuck there...
Don't worry. Go back to our Session A and commit the transaction.

--Session A:
> COMMIT;

After the completion of transaction in Session A, Session B is finally able to complete INSERT operation.

I hope you enjoy the hands on example! Table below summaries the various isolation levels and the drawbacks associated with each one.

Isolation Level	Dirty Reads possible	Non Repeatable Reads possible	Phantom Reads possible	Locking reads
READ UNCOMMITTED	Yes	Yes	Yes	Yes
READ COMMITTED	No	Yes	Yes	Yes
REPEATABLE READ	No	No	Yes	Yes
SERIALIZABLE	No	No	No	Yes

Table and data generation code

I use Docker to run SQL server in a container using image in the Docker Hub. Skip this step if you don't feel to use Docker.

> docker run --name mysql -e MYSQL_ROOT_PASSWORD=your_password -d mysql:tag
> docker inspect mysql #Identify the IP address
> docker run -it --rm mysql mysql -h172.17.0.2 -uroot -p

CREATE TABLE student (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
firstname VARCHAR(30) NOT NULL,
lastname VARCHAR(30) NOT NULL,
cgpa FLOAT(50)
);

INSERT INTO student (firstname, lastname, cgpa) VALUES ('Wilman', 'Kala', 3.5);
INSERT INTO student (firstname, lastname, cgpa) VALUES ('Karl', 'Jablonski', 2.67);
INSERT INTO student (firstname, lastname, cgpa) VALUES ('Matti', 'Karttunen', 3.86);
INSERT INTO student (firstname, lastname, cgpa) VALUES ('John', 'Doe', 1.93);
INSERT INTO student (firstname, lastname, cgpa) VALUES ('Will', 'Smith', 2.25);