DEV Community: Arpit Bhayani

Bitcask - a log-structured fast KV store

Arpit Bhayani — Sun, 19 Jul 2020 14:39:34 +0000

Bitcask is one of the most efficient embedded Key-Value (KV) Databases designed to handle production-grade traffic. The paper that introduced Bitcask to the world says it is a Log-Structured Hash Table for Fast Key/Value Data which, in a simpler language, means that the data will be written sequentially to an append-only log file and there will be pointers for each key pointing to the position of its log entry. Building a KV store off the append-only log files seems like a really weird design choice, but Bitcask does not only make it efficient but it also gives a really high Read-Write throughput.

Bitcask was introduced as the backend for a distributed database named Riak in which each node used to run one instance of Bitcask to hold the data that it was responsible for. In this essay, we take a detailed look into Bitcask, its design, and find the secret sauce that makes it so performant.

Design of Bitcask

Bitcask uses a lot of principles from log-structured file systems and draws inspiration from a number of designs that involve log file merging, for example - merging in LSM Trees. It essentially is just a directory of append-only (log) files with a fixed structure and an in-memory index holding the keys mapped to a bunch of information necessary for point lookups - referring to the entry in the datafile.

Datafiles

Datafiles are append-only log files that hold the KV pairs along with some meta-information. A single Bitcask instance could have many datafiles, out of which just one will be active and opened for writing, while the others are considered immutable and are only used for reads.

Each entry in the datafile has a fixed structure illustrated above and it stores crc, timestamp, key_size, value_size, actual key, and the actual value. All the write operations - create, update and delete - made on the engine translates into entries in this active datafile. When this active datafile meets a size threshold, it is closed and a new active datafile is created; and as stated earlier, when closed (intentionally or unintentionally), the datafile is considered immutable and is never opened for writing again.

KeyDir

KeyDir is an in-memory hash table that stores all the keys present in the Bitcask instance and maps it to the offset in the datafile where the log entry (value) resides; thus facilitating the point lookups. The mapped value in the Hash Table is a structure that holds file_id, offset, and some meta-information like timestamp, as illustrated below.

Operations on Bitcask

Now that we have seen the overall design and components of Bitcask, we can jump into exploring the operations that it supports and details of their implementations.

Putting a new Key Value

When a new KV pair is submitted to be stored in the Bitcask, the engine first appends it to the active datafile and then creates a new entry in the KeyDir specifying the offset and file where the value is stored. Both of these actions are performed atomically which means either the entry is made in both the structures or none.

Putting a new Key-Value pair requires just one atomic operation encapsulating one disk write and a few in-memory access and updates. Since the active datafile is an append-only file, the disk write operation does not have to perform any disk seek whatsoever making the write operate at an optimum rate providing a high write throughput.

Updating an existing Key Value

This KV store does not support partial update, out of the box, but it does support full value replacement. Hence the update operation is very similar to putting a new KV pair, the only change being instead of creating an entry in KeyDir, the existing entry is updated with the new position in, possibly, the new datafile.

The entry corresponding to the old value is now dangling and will be garbage collected explicitly during merging and compaction.

Deleting a Key

Deleting a key is a special operation where the engine atomically appends a new entry in the active datafile with value equalling a tombstone value, denoting deletion, and deleting the entry from the in-memory KeyDir. The tombstone value is chosen as something very unique so that it does not interfere with the existing value space.

Delete operation, just like the update operation, is very lightweight and requires a disk write and an in-memory update. In delete operation as well, the older entries corresponding to the deleted keys are left dangling and will be garbage collected explicitly during merging and compaction.

Reading a Key-Value

Reading a KV pair from the store requires the engine to first find the datafile and the offset within it for the given key; which is done using the KeyDir. Once that information is available the engine then performs one disk read from the corresponding datafile at the offset to retrieve the log entry. The correctness of the value retrieved is checked against the CRC stored and the value is then returned to the client.

The operation is inherently fast as it requires just one disk read and a few in-memory accesses, but it could be made faster using Filesystem read-ahead cache.

Merge and Compaction

As we have seen during Update and Delete operations the old entries corresponding to the key remain untouched and dangling and this leads to Bitcask consuming a lot of disk space. In order to make things efficient for the disk utilization the engine once a while compacts the older closed datafiles into one or many merged files having the same structure as the existing datafiles.

The merge process iterates over all the immutable files in the Bitcask and produces a set of datafiles having only live and latest versions of each present key. This way the unused and non-existent keys are ignored from the newer datafiles saving a bunch of disk space. Since the record now exists in a different merged datafile and at a new offset, its entry in KeyDir needs an atomic updation.

Performant bootup

If the Bitcask crashes and needs a boot-up, it will have to read all the datafiles and build a new KeyDir. Merging and compaction here do help as it reduces the need to read data that is eventually going to be evicted. But there is another operation that could help in making the boot times faster.

For every datafile a hint file is created which holds everything in the datafile except the value i.e. it holds the key and its meta-information. This hint file, hence, is just a file containing all the keys from the corresponding datafile. This hint file is very small in size and hence by reading this file the engine could quickly create the entire KeyDir and complete the bootup process faster.

Strengths and Weaknesses of Bitcask

Strengths

Low latency for read and write operations
High Write Throughput
Single disk seek to retrieve any value
Predictable lookup and insert performance
Crash recovery is fast and bounded
Backing up is easy - Just copy the directory would suffice

Weaknesses

The KeyDir holds all the keys in memory at all times and this adds a huge constraint on the system that it needs to have enough memory to contain the entire keyspace along with other essentials like Filesystem buffers. Thus the limiting factor for a Bitcask is the limited RAM available to hold the KeyDir.

Although this weakness sees a major one but the solution to this is fairly simple. We can typically shard the keys and scale it horizontally without losing much of the basic operations like Create, Read, Update, and Delete.

References

Phi φ Accrual Failure Detection

Arpit Bhayani — Sun, 12 Jul 2020 09:50:56 +0000

One of the most important virtues of any distributed system is its ability to detect failures in any of its subsystems before things go havoc. Early detection of failures helps in taking preventive actions and ensuring that the system stays fault-tolerant. The conventional way of failure detection is by using a bunch of heartbeat messages with a fixed timeout, indicating if a subsystem is down or not.

In this essay, we take a look into an adaptive failure detection algorithm called Phi Accrual Failure Detection, which was introduced in a paper by Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama. The algorithm uses historical heartbeat information to make the threshold adaptive. Instead of generating a binary value, like conventional methods, it generates continuous values suggesting the confidence level it has in stating if the system crashed or not.

Conventional Failure Detection

Accurately detecting failures is an impossible problem to solve as we cannot ever say if a system crashed or is just very slow in responding. Conventional Failure Detection algorithms output a boolean value stating if the system is down or not; there is no middle ground.

Heartbeats with constants timeouts

The conventional Failure Detection algorithms use heartbeat messages with a fixed timeout in order to determine if a system is alive or not. The monitored system periodically sends a heartbeat message to the monitoring system, informing that it is still alive. The monitoring system will suspect that the process crashed if it fails to receive any heartbeat message within a configured timeout period.

Here the value of timeout is very crucial as keeping it short means we detect failures quickly but with a lot of false positives; and while keeping it long means we reduce the false positives but the detection time takes a toll.

Phi Accrual Failure Detection

Phi Accrual Failure Detection is an adaptive Failure Detection algorithm that provides a building block to implementing failure detectors in any distributed system. A generic Accrual Failure Detector, instead of providing output as a boolean (system being up or down), outputs the suspicion information (level) on a continuous scale such that higher the suspicion value, the higher are the chances that the system is down.

Detailing φ

We define φ as the suspicion level output by this failure detector and as the algorithm is adaptive, the value will be dynamic and will reflect the current network conditions and system behavior. As we established earlier - lower are the chances of receiving the heartbeat, higher are the chances that the system crashed hence higher should be the value of φ; the details around expressing φ mathematically are as illustrated below.

The illustration above mathematically expresses our establishments and shows how we can use -log10(x) function applied to the probability to get a gradual negative slope indicating a decline in the value of φ. We observe how, as the probability of receiving heartbeat increases, the value of φ decreases and approaches 0, and when the probability of receiving heartbeat decreases and approaches 0, the value of φ tends to infinity ∞.

The φ value computed using -log10(x) also suggests our likeliness of making mistakes decreases exponentially as the value of φ increases. So if we say a system is down if φ crosses a certain threshold X where X is 1, it implies that our decision will be contradicted in the future by the reception of a late heartbeat is about 10%. For X = 2, the likelihood of the mistake will be 1%, for X = 3 it will be 0.1%, and so on.

Estimating the probability of receiving another heartbeat

Now that we have defined what φ is, we need a way to compute the probability of receiving another heartbeat given we have seen some heartbeats before. This probability is proportional to the probability that the heartbeat will arrive more than t units after the previous one i.e. longer the wait lesser are the chances of receiving the heartbeat.

In order to implement this, we keep a sampled Sliding Window holding arrival times of past heartbeats. Whenever a new heartbeat arrives, its arrival time is stored into the window, and the data regarding the oldest heartbeat is deleted.

We observe that the arrival intervals follow a Normal Distribution indicating most of the heartbeats arrive within a specific range while there are a few that arrive late due to various network or system conditions. From the information stored in the window, we can easily compute the arrival intervals, mean, and variance which we require to estimate the probability.

Since arrival intervals follow a Normal Distribution, we can integrate the Probability Density Function over the interval (t, ∞) to get the probability of receiving heartbeat after t units of time. Thus the expression for deriving this can be illustrated below.

We observe that if the process actually crashes, the value is guaranteed to accrue (accumulate) over time and will tend to infinity ∞. Since the accrual failure detectors output value in a continuous range, we need to explicitly define thresholds crossing which we say that the system crashed.

Benefits of using Accrual Failure Detectors

We can define multiple thresholds, crossing which we can take precautionary measures defined for it. As the threshold becomes steeper the action could become more drastic. Another major benefit of using this system is that it favors a nearly complete decoupling between application requirements and monitoring as it leaves the applications to define threshold according to their QoS requirements.

References

Deciphering Single-byte XOR Ciphertext

Arpit Bhayani — Sun, 21 Jun 2020 13:22:09 +0000

Encryption is a process of encoding messages such that it can only be read and understood by the intended parties. The process of extracting the original message from an encrypted one is called Decryption. Encryption usually scrambles the original message using a key, called encryption key, that the involved parties agree on.

The strength of an encryption algorithm is determined by how hard it would be to extract the original message without knowing the encryption key. Usually, this depends on the number of bits in the key - bigger the key, the longer it takes to decrypt the enciphered data.

In this essay, we will work with a very simple cipher (encryption algorithm) that uses an encryption key with a size of one byte, and try to decipher the ciphered text and retrieve the original message without knowing the encryption key. The problem statement, defined above, is based on Cryptopals Set 1 Challenge 3.

Single-byte XOR cipher

The Single-byte XOR cipher algorithm works with an encryption key of size 1 byte - which means the encryption key could be one of the possible 256 values of a byte. Now we take a detailed look at how the encryption and decryption processes look like for this cipher.

Encryption

As part of the encryption process, the original message is iterated bytewise and every single byte b is XORed with the encryption key key and the resultant stream of bytes is again translated back as characters and sent to the other party. These encrypted bytes need not be among the usual printable characters and should ideally be interpreted as a stream of bytes. Following is the python-based implementation of the encryption process.

def single_byte_xor(text: bytes, key: int) -> bytes:
    """Given a plain text `text` as bytes and an encryption key `key` as a byte
    in range [0, 256) the function encrypts the text by performing
    XOR of all the bytes and the `key` and returns the resultant.
    """
    return bytes([b ^ key for b in text])

As an example, we can try to encrypt the plain text - abcd - with encryption key 69 and as per the algorithm, we perform XOR bytewise on the given plain text. For character a, the byte i.e. ASCII value is 97 which when XORed with 69 results in 36 whose character equivalent is $, similarly for b the encrypted byte is ', for c it is & and for d it is !. Hence when abcd is encrypted using single-byte XOR cipher and encryption key 69, the resultant ciphertext i.e. the encrypted message is $'&!.

Decryption

Decryption is the process of extracting the original message from the encrypted ciphertext given the encryption key. XOR has a property - if a = b ^ c then b = a ^ c, hence the decryption process is exactly the same as the encryption i.e. we iterate through the encrypted message bytewise and XOR each byte with the encryption key - the resultant will be the original message.

Since encryption and decryption both have an exact same implementation - we pass the ciphertext to the function single_byte_xor, defined above, to get the original message back.

>>> single_byte_xor(b"$'&!", 69)
b'abcd'

Deciphering without the encryption key

Things become really interesting when we have to recover the original message given the ciphertext and having no knowledge of the encryption key; although we do know the encryption algorithm.

As a sample plain text, we take the last couple of messages, sent across on their German military radio network during World War II. These messages were intercepted and decrypted by the British troops. During wartime, the messages were encrypted using Enigma Machine and Alan Turing famously cracked the Enigma Code (similar to encryption key) that was used to encipher German messages.

In this essay, instead of encrypting the message using the Enigma Code, we are going to use Single-byte XOR cipher and try to recover the original message back without any knowledge of the encryption key.

Here, we assume that the original message, to be encrypted, is a genuine English lowercased sentence. The ciphertext that we would try to decipher can be obtained as

>>> key = 82
>>> plain_text = b'british troops entered cuxhaven at 1400 on 6 may - from now on all radio traffic will cease - wishing you all the best. lt kunkel.'
>>> single_byte_xor(plain_text, key)
b'0 ;&;!:r& =="!r7<&7 76r1\'*:3$7<r3&rcfbbr=<rdr?3+r\x7fr4 =?r<=%r=<r3>>r 36;=r& 344;1r%;>>r173!7r\x7fr%;!:;<5r+=\'r3>>r&:7r07!&|r>&r9\'<97>|'

Bruteforce

There are a very limited number of possible encryption keys - 256 to be exact - we can, very conveniently, go for the Bruteforce approach and try to decrypt the ciphered text with every single one of it. So we start iterating over all keys in the range [0, 256) and decrypt the ciphertext and see which one resembles the original message the most.

In the illustration above, we see that the message decrypted through key 82 is, in fact, our original message, while the other retrieved plain texts look scrambled and garbage. Doing this visually is very easy; we, as humans, are able to comprehend familiarity but how will a computer recognize this?

We need a way to quantify the closeness of a text to a genuine English sentence. Closer the decrypted text is to be a genuine English sentence, the closer it would be to our original plain text.

We can do this only because of our assumption - that the original plain text is a genuine English sentence.

ETAOIN SHRDLU

Letter Frequency is the number of times letters of an alphabet appear on average in written language. In the English language the letter frequency of letter a is 8.239%, for b it is 1.505% which means out of 100 letters written in English, the letter a, on an average, will show up 8.239% of times while b shows up 1.505% of times. Letter frequency (in percentage) for other letters is as shown below.

occurance_english = {
    'a': 8.2389258,    'b': 1.5051398,    'c': 2.8065007,    'd': 4.2904556,
    'e': 12.813865,    'f': 2.2476217,    'g': 2.0327458,    'h': 6.1476691,
    'i': 6.1476691,    'j': 0.1543474,    'k': 0.7787989,    'l': 4.0604477,
    'm': 2.4271893,    'n': 6.8084376,    'o': 7.5731132,    'p': 1.9459884,
    'q': 0.0958366,    'r': 6.0397268,    's': 6.3827211,    't': 9.1357551,
    'u': 2.7822893,    'v': 0.9866131,    'w': 2.3807842,    'x': 0.1513210,
    'y': 1.9913847,    'z': 0.0746517
}

This Letter Frequency analysis is a rudimentary way for language identification in which we see if the current letter frequency distribution of a text matches the average letter frequency distribution of the English language. ETAOIN SHRDLU is the approximate order of frequency of the 12 most commonly used letters in the English language.

The following chart shows Letter Frequency analysis for decrypted plain texts with encryption keys from 79 to 84.

In the illustration above, we could clearly see how well the Letter Frequency distribution for encryption key 82 fits the distribution of the English language. Now that our hypothesis holds true, we need a way to quantify this measure and we call if the Fitting Quotient.

Fitting Quotient

Fitting Quotient is the measure that suggests how well the two Letter Frequency Distributions match. Heuristically, we define the Fitting Quotient as the average of the absolute difference between the frequencies (in percentage) of letters in text and the corresponding letter in the English Language. Thus having a smaller value of Fitting Quotient implies the text is closer to the English Language.

Python-based implementation of the, above defined, Fitting Quotient is as shown below. The function first computes the relative frequency for each letter in text and then takes an average of the absolute difference between the two distributions.

dist_english = list(occurance_english.values())

def compute_fitting_quotient(text: bytes) -> float:
    """Given the stream of bytes `text` the function computes the fitting
    quotient of the letter frequency distribution for `text` with the
    letter frequency distribution of the English language.

    The function returns the average of the absolute difference between the
    frequencies (in percentage) of letters in `text` and the corresponding
    letter in the English Language.
    """
    counter = Counter(text)
    dist_text = [
        (counter.get(ord(ch), 0) * 100) / len(text)
        for ch in occurance_english
    ]
    return sum([abs(a - b) for a, b in zip(dist_english, dist_text)]) / len(dist_text)

Deciphering

Now that we have everything we require to directly get the plain text out of the given ciphertext we wrap it in a function that iterates over all possible encryption keys in the range [0, 256), decrypts the ciphertext, computes the fitting quotient for the plain text and returns the one that minimizes the quotient as the original message. Python-based implementation of this deciphering logic is as illustrated below.

def decipher(text: bytes) -> Tuple[bytes, int]:
    """The function deciphers an encrypted text using Single Byte XOR and returns
    the original plain text message and the encryption key.
    """
    original_text, encryption_key, min_fq = None, None, None
    for k in range(256):
        # we generate the plain text using encryption key `k`
        _text = single_byte_xor(text, k)

        # we compute the fitting quotient for this decrypted plain text
        _fq = compute_fitting_quotient(_text)

        # if the fitting quotient of this generated plain text is lesser
        # than the minimum seen till now `min_fq` we update.
        if min_fq is None or _fq < min_fq:
            encryption_key, original_text, min_fq = k, _text, _fq

    # return the text and key that has the minimum fitting quotient
    return original_text, encryption_key

This approach was also tested against 100 random English sentences with random Encryption keys and it was found that this deciphering technique fared well for all the samples. The approach would fail if the sentence is very short or contains a lot of symbols. The source code for this entire deciphering process is available in a Jupyter notebook at arpitbhayani.me/decipher-single-byte-xor.

References

Making Python Integers Iterable

Arpit Bhayani — Sun, 14 Jun 2020 09:35:53 +0000

Iterables in Python are objects and containers that could be stepped through one item at a time, usually using a for ... in loop. Not all objects can be iterated, for example - we cannot iterate an integer, it is a singular value. The best we can do here is iterate on a range of integers using the range type which helps us iterate through all integers in the range [0, n).

Since integers, individualistically, are not iterable, when we try to do a for x in 7, it raises an exception stating TypeError: 'int' object is not iterable. So what if, we change the Python's source code and make integers iterable, say every time we do a for x in 7, instead of raising an exception it actually iterates through the values [0, 7). In this essay, we would be going through exactly that, and the entire agenda being:

What is a Python iterable?
What is an iterator protocol?
Changing Python's source code and make integers iterable, and
Why it might be a bad idea to do so?

Python Iterables

Any object that could be iterated is an Iterable in Python. The list has to be the most popular iterable out there and it finds its usage in almost every single Python application - directly or indirectly. Even before the first user command is executed, the Python interpreter, while booting up, has already created 406 lists, for its internal usage.

In the example below, we see how a list a is iterated through using a for ... in loop and each element can be accessed via variable x.

>>> a = [2, 3, 5, 7, 11, 13]
>>> for x in a: print(x, end=" ")
2 3 5 7 11 13

Similar to list, range is a python type that allows us to iterate on integer values starting with the value start and going till end while stepping over step values at each time. range is most commonly used for implementing a C-like for loop in Python. In the example below, the for loop iterates over a range that starts from 0, goes till 7 with a step of 1 - producing the sequence [0, 7).

# The range(0, 7, 1) will iterate through values 0 to 6 and every time
# it will increment the current value by 1 i.e. the step.
>>> for x in range(0, 7, 1): print(x, end=" ")
0 1 2 3 4 5 6

Apart from list and range other iterables are - tuple, set, frozenset, str, bytes, bytearray, memoryview, and dict. Python also allows us to create custom iterables by making objects and types follow the Iterator Protocol.

Iterators and Iterator Protocol

Python, keeping things simple, defines iterable as any object that follows the Iterator Protocol; which means the object or a container implements the following functions

__iter__ should return an iterator object having implemented the __next__ method
__next__ should return the next item of the iteration and if items are exhausted then raise a StopIteration exception.

So, in a gist, __iter__ is something that makes any python object iterable; hence to make integers iterable we need to have __iter__ function set for integers.

Iterable in CPython

The most famous and widely used implementation of Python is CPython where the core is implemented in pure C. Since we need to make changes to one of the core datatypes of Python, we will be modifying CPython, add __iter__ function to an Integer type, and rebuild the binary. But before jumping into the implementation, it is important to understand a few fundamentals.

The `PyTypeObject`

Every object in Python is associated with a type and each type is an instance of a struct named PyTypeObject. A new instance of this structure is effectively a new type in python. This structure holds a few meta information and a bunch of C function pointers - each implementing a small segment of the type's functionality. Most of these "slots" in the structure are optional which could be filled by putting appropriate function pointers and driving the corresponding functionality.

The `tp_iter` slot

Among all the slots available, the slot that interests us is the tp_iter slot which can hold a pointer to a function that returns an iterator object. This slot corresponds to the __iter__ function which effectively makes the object iterable. A non NULL value of this slot indicates iterability. The tp_iter holds the function with the following signature

PyObject * tp_iter(PyObject *);

Integers in Python do not have a fixed size; rather the size of integer depends on the value it holds. How Python implements super long integers is a story on its own but the core implementation can be found at longobject.c. The instance of PyTypeObject that defines integer/long type is PyLong_Type and has its tp_iter slot set to 0 i.e. NULL which asserts the fact that Integers in python are not iterable.

PyTypeObject PyLong_Type = {
    ...

    "int",                                      /* tp_name */
    offsetof(PyLongObject, ob_digit),           /* tp_basicsize */
    sizeof(digit),                              /* tp_itemsize */
    ...
    0,                                          /* tp_iter */
    ...
};

This NULL value for tp_iter makes int object not iterable and hence if this slot was occupied by an appropriate function pointer with the aforementioned signature, this could well make any integer iterable.

Implementing `long_iter`

Now we implement the tp_iter function on integer type, naming it long_iter, that returns an iterator object, as required by the convention. The core functionality we are looking to implement here is - when an integer n is iterated, it should iterate through the sequence [0, n) with step 1. This behavior is very close to the pre-defined range type, that iterates over a range of integer values, more specifically a range that starts at 0, goes till n with a step of 1.

We define a utility function in rangeobject.c that, given a python integer, returns an instance of longrangeiterobject as per our specifications. This utility function will instantiate the longrangeiterobject with start as 0, ending at the long value given in the argument, and step as 1. The utility function is as illustrated below.

/*
 *  PyLongRangeIter_ZeroToN creates and returns a range iterator on long
 *  iterating on values in the range [0, n).
 *
 *  The function creates and returns a range iterator from 0 till the
 *  provided long value.
 */
PyObject *
PyLongRangeIter_ZeroToN(PyObject *long_obj)
{
    // creating a new instance of longrangeiterobject
    longrangeiterobject *it;
    it = PyObject_New(longrangeiterobject, &PyLongRangeIter_Type);

    // if unable to allocate memoty to it, return NULL.
    if (it == NULL)
        return NULL;

    // we set the start to 0
    it->start = _PyLong_Zero;

    // we set the step to 1
    it->step = _PyLong_One;

    // we set the index to 0, since we want to always start from the first
    // element of the iteration
    it->index = _PyLong_Zero;

    // we set the total length of iteration to be equal to the provided value
    it->len = long_obj;

    // we increment the reference count for each of the values referenced
    Py_INCREF(it->start);
    Py_INCREF(it->step);
    Py_INCREF(it->len);
    Py_INCREF(it->index);

    // downcast the iterator instance to PyObject and return
    return (PyObject *)it;
}

The utility function PyLongRangeIter_ZeroToN is defined in rangeobject.c and will be declared in rangeobject.h so that it can be used across the CPython. Declaration of function in rangeobject.h using standard Python macros goes like this

PyAPI_FUNC(PyObject *)   PyLongRangeIter_ZeroToN(PyObject *);

The function occupying the tp_iter slot will receive the self object as the input argument and is expected to return the iterator instance. Hence, the long_iter function will receive the python integer object (self) that is being iterated as an input argument and it should return the iterator instance. Here we would use the utility function PyLongRangeIter_ZeroToN, we just defined, which is returning us an instance of range iterator. The entire long_iter function could be defined as

/*
 *  long_iter creates an instance of range iterator using PyLongRangeIter_ZeroToN
 *  and returns the iterator instance.
 *
 *  The argument to the `tp_iter` is the `self` object and since we are trying to
 *  iterate an integer here, the input argument to `long_iter` will be the
 *  PyObject of type PyLong_Type, holding the integer value.
 */
static PyObject * long_iter(PyObject *long_obj)
{
    return PyLongRangeIter_ZeroToN(long_obj);
}

Now that we have long_iter defined, we can place the function on the tp_iter slot of PyLong_Type that enables the required iterability on integers.

PyTypeObject PyLong_Type = {
    ...

    "int",                                      /* tp_name */
    offsetof(PyLongObject, ob_digit),           /* tp_basicsize */
    sizeof(digit),                              /* tp_itemsize */
    ...
    long_iter,                                  /* tp_iter */
    ...
};

Consolidated flow

Once we have everything in place, the entire flow goes like this -

Every time an integer is iterated, using any iteration method - for example for ... in, it would check the tp_iter of the PyLongType and since now it holds the function pointer long_iter, the function will be invoked. This invocation will return an iterator object of type longrangeiterobject with a fixed start, index, and step values - which in pythonic terms is effectively a range(0, n, 1). Hence the for x in 7 is inherently evaluated as for x in range(0, 7, 1) allowing us to iterate integers.

These changes are also hosted on a remote branch cpython@02-long-iter and Pull Request holding the diff can be found here.

Integer iteration in action

Once we build a new python binary with the aforementioned changes, we can see iterable integers in actions. Now when we do for x in 7, instead of raising an exception, it actually iterates through values [0, 7).

>>> for i in 7: print(i, end=" ");
0 1 2 3 4 5 6

# Since integers are now iterable, we can create a list of [0, 7) using `list`
# Internally `list` tries to iterate on the given object i.e. `7`
# now that the iteration is defined as [0, 7) we get the list from
# from iteration, instead of an exception
>>> list(7)
[0, 1, 2, 3, 4, 5, 6]

Why it is not a good idea

Although it seems fun, and somewhat useful, to have iterable integers, it is really not a great idea. The core reason for this is that it makes unpacking unpredictable. Unpacking is when you unpack an iterable and assign it to multiple variables. For example: a, b = 3, 4 will assign 3 to a and 4 to b. So assigning a, b = 7 should be an error because there is just one value on the right side and multiple on the left.

Unpacking treats right-hand size as iterable and tries to iterate on it; and now since Integers are iterable the right-hand side, post iteration yields 7 values which the left-hand side has mere 2 variables; Hence it raises an exception ValueError: too many values to unpack (expected 2).

Things would work just fine if we do a, b = 2 as now the right-hand side, post iteration, has two values, and the left-hand side has two variables. Thus two very similar statements result in two very different outcomes, making unpacking unpredictable.

>>> a, b = 7
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack (expected 2)

>>> a, b = 2
>>> a, b
0, 1

Conclusion

In this essay, we modified the Python's source code and made integers iterable. Even though it is not a good idea to do so, but it is fun to play around with the code and make changes in our favorite programming language. It helps us get a detailed idea about core python implementation and may pave the way for us to become a Python core developer. This is one of many articles in Python Internals series - How python implements super long integers? and Python Caches Integers.

References

Powering inheritance in C using structure composition

Arpit Bhayani — Sun, 07 Jun 2020 08:41:11 +0000

C language does not support inheritance however it does support Structure Compositions which can be tweaked to serve use-cases requiring parent-child relationships. In this article, we find out how Structure Compositions help us emulate inheritance in C and keep our code extensible. We will also find how it powers two of the most important things to have ever been invented in the field of computer science.

What is structure composition?

Structure Composition is when we put one structure within another, not through its pointer but as a native member - something like this

// this structure defines a node of a linked list and
// it only holds the pointers to the next and the previous
// nodes in the linked list.
struct list_head {
    struct list_head *next; // pointer to the node next to the current one
    struct list_head *prev; // pointer to the node previous to the current one
};

// list_int holds an list_head and an integer data member
struct list_int {
    struct list_head list;  // common next and prev pointers
    int value;              // specific member as per implementation
};

// list_int holds an list_head and an char * data member
struct list_str {
    struct list_head list;  // common next and prev pointers
    char * str;             // specific member as per implementation
};

In the example above, we define a node of a linked list using structure composition. Usually, a linked list node has 3 members - two pointers to adjacent nodes (next and previous) and a third one could either be the data or a pointer to it. The defining factor of a linked list is the two pointers that logically form a chain of nodes. To keep things abstract we create a struct named list_head which holds these two pointers next and prev and omits the specifics i.e. data.

Using list_head structure, if we were to define a node of a linked list holding an integer value we could create another struct, named list_int that holds a member of type list_head and an integer value value. The next and previous pointers are brought into this struct through list_head list and could be referred to as list.next and list.prev.

There is a very genuine reason for picking such weird names for a linked list node and members of structures; the reason to do so will be cleared in the later sections of this essay.

Because of the above structure definition, building a linked list node holding of any type becomes a breeze. For example, a node holding string could be quickly defined as a struct list_str having list_head and a char *. This ability to extend list_head and build a node holding data of any type and any specifics make low-level code simple, uniform, and extensible.

Memory Representation of `list_int`

Structures in C are not padded and they do not even hold any meta information, not even for the member names; hence during allocation, they are allocated the space just enough to hold the actual data.

In the illustration above we see how members of list_int are mapped on the allocated space - required by its individual members. It is allocated a contiguous space of 12 bytes - 4 bytes for each of the two pointers and another 4 bytes for the integer value. The contiguity of space allocation and order of members during allocation could be verified by printing out their addresses as shown below.

void print_addrs() {
    // creating a node of the list_int holding value 41434
    struct list_int *ll = new_list_int(41434);

    // printing the address of individual members
    printf("%p: head\n",             head);
    printf("%p: head->list.next\n",  &((head->list).next));
    printf("%p: head->list.prev\n",  &((head->list).prev));
    printf("%p: head->value\n",      &(head->value));
}

~ $ make && ./a.out
0x4058f0: head
0x4058f0: head->list.next
0x4058f4: head->list.prev
0x4058f8: head->value

We clearly see all the 3 members, occupying 12 bytes contiguous memory segments in order of their definition within the struct.

The above code was executed on a machine where the size of integer and pointers were 4 bytes each. The results might differ depending on the machine and CPU architecture.

Casting pointers pointing to struct

In C language, when a pointer to a struct is cast to a pointer to another struct, the engine maps the individual members of a target struct type, depending on their order and offsets, on to the slice of memory of the source struct instance.

When we cast list_int * into list_head *, the engine maps the space required by target type i.e. list_head on space occupied by list_int. This means it maps the 8 bytes required by list_head on the first 8 bytes occupied by list_int instance. Going by the memory representation discussed above, we find that the first 8 bytes of list_int are in fact list_head, and hence casting list_int * to list_head * is effectively just referencing the list_head member of list_int through a new variable.

This effectively builds a parent-child relationship between the two structs where we can safely typecast a child list_int to its parent list_head.

It is important to note here that the parent-child relationship is established only because the first member of list_int is of type list_head. it would not have worked if we change the order of members in list_int.

How does this drive inheritance?

As established above, by putting one struct within another as its first element we are effectively creating a parent-child relationship between the two. Since this gives us an ability to safely typecast child to its parent we can define functions that accept a pointer to parent struct as an argument and perform operations that do not really require to deal with specifics. This allows us to NOT rewrite the functional logic for every child extensions and thus avoid redundant code.

From the context we have set up, say we want to write a function that adds a node between the two in a linked list. The core logic to perform this operation does not really need to deal with any specifics all it takes is a few pointer manipulations of next and prev. Hence, we could just define the function accepting arguments of type list_head * and write the function as

/*
 * Insert a new entry between two known consecutive entries.
 *
 * This is only for internal list manipulation where we know
 * the prev/next entries already!
 */
static void __list_add(struct list_head *new,
                       struct list_head *prev,
                       struct list_head *next)
{
    next->prev = new;
    new->next = next;
    new->prev = prev;
    prev->next = new;
}

Since we can safely typecase list_int * and list_str * to list_head * we can pass any of the specific implementations the function __list_add and it would still add the node between the other two seamlessly.

Since the core operations on linked lists only require pointer manipulations, we can define these operations as functions accepting list_head * instead of specific types like list_int *. Thus we need not write similar functions for specifics. A function to delete a node could be written as

/*
 * Delete a list entry by making the prev/next entries
 * point to each other.
 *
 * This is only for internal list manipulation where we know
 * the prev/next entries already!
 */
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
    next->prev = prev;
    prev->next = next;
}

Other linked list utilities like adding a node to tail, swapping nodes, splicing the list, rotating the list, etc only require manipulations of next and prev pointers. Hence they could also be written in a very similar way i.e accepting list_head * and thus eliminating the need to reimplement function logic for every single child implementation.

This behavior is very similar to how inheritance in modern OOP languages, like Python and Java, work where the child is allowed to invoke any parent function.

Who uses structure compositions?

There are a ton of practical usage of using Structure Compositions but the most famous ones are

Linux Kernel

In order to keep things abstract and extensible, Linux Kernel uses Structure Composition at several places. One of the most important places where it uses composition is for managing and maintaining Linked Lists, exactly how we saw things above. The struct definitions and code snippets are taken as-is from the Kernel's source code, and hence the structure and variable names look different than usual.

Python Type and Object Hierarchy

Python, one of the most important languages in today's world, uses Structure Composition to build Type Hierarchy. Python defines a root structure called PyObject which holds reference count, defining the number of places from which the object is referenced - and object type - determining the type of the object i.e. int, str, list, dict, etc.

typedef struct _object {
    Py_ssize_t     ob_refcnt;  // holds reference count of the object
    PyTypeObject   *ob_type;   // holds the type of the object
} PyObject;

Since Python wants these fields to be present in every single object that is created during runtime, it uses structure composition to ensure that objects like integers, floats, string, etc put PyObject as their first element and thus establishing a parent-child relationship. A Float object in Python is defined as

#define PyObject_HEAD PyObject ob_base;

typedef struct {
    PyObject_HEAD
    double ob_fval;    // holds the actual float value
} PyFloatObject;

Now writing utility functions that increments and decrements references count on every access of any object could be written as just a single function accepting PyObject * as shown below

static inline void _Py_INCREF(PyObject *op) {
    op->ob_refcnt++;
}

Thus we eradicate a need of rewriting INCREF for every single object type and just write it once for PyObject and it will work for every single Python object type that is extended through PyObject.

References

The RUM Conjecture

Arpit Bhayani — Sun, 31 May 2020 12:10:37 +0000

The RUM Conjecture states that we cannot design an access method for a storage system that is optimal in all the following three aspects - Reads, Updates, and, Memory. The conjecture puts forth that we always have to trade one to make the other two optimal and this makes the three constitutes a competing triangle, very similar to the famous CAP theorem.

Access Method

Data access refers to an ability to access and retrieve data stored within a storage system driven by an optional storage engine. Usually, a storage system is designed to be optimal for serving a niche use case and achieve that by carefully and judiciously deciding the memory and disk storage requirements, defining well-structured access and retrieval pattern, designing data structures for primary and auxiliary data and picking additional techniques like compression, encryption, etc. These decisions define, and to some extent restricts, the possible ways the storage engine can read and update the data in the system.

RUM Overheads

An ideal storage system would be the one that has an access method that provides lowest Read Overhead, minimal Update Cost, and does not require any extra Memory or Storage space, over the main data. In the real-world, achieving this is near impossible and that is something that is dictated by this conjecture.

Read Overhead

Read Overhead occur when the storage engine performs reads on auxiliary data to fetch the required intended main data. This usually happens when we use an auxiliary data structure like a Secondary Index to speed up reads. The reads happening on this auxiliary structure constitutes read overheads.

Read Overhead is measured through Read Amplification and it is defined as the ratio between the total amount of data read (main + auxiliary) and the amount of main data intended to be read.

Update Overhead

Update Overhead occur when the storage engine performs writes on auxiliary data or on some unmodified main data along with intended updates on the main data. A typical example of Update Overheads is the writes that happen on an auxiliary structure like Secondary Index alongside the write happening on intended main data.

Update Overhead is measured through Write Amplification and it is defined as the ratio between the total amount of data written (main + auxiliary) and the amount of main data intended to be updated.

Memory Overhead

Memory overhead occurs when the storage system uses an auxiliary data structure to speed up reads, writes, or to serve common access patterns. This storage is in addition to the storage needs of the main data.

Memory Overhead is measured through Space Amplification and it is defined as the ratio between the space utilized for auxiliary and main data and space utilized by the main data.

The Conjecture

The RUM Conjecture, in a formal way, states that

An access method that can set an upper bound for two out of the read, update, and memory overheads, also sets a lower bound for the third overhead.

This is not a hard rule that is followed and hence it is not a theorem but a conjecture - widely observed but not proven. But we can safely keep this in mind while designing the next big storage system serving a use case.

Categorizing Storage Systems

Now that we have seen RUM overheads and the RUM Conjecture we take a look at examples of Storage Systems that classify into one of the three types.

Read Optimised

Read Optimised storage systems offer very low read overhead but require some extra auxiliary space to gain necessary performance that again comes at a cost of updates required to keep auxiliary data in sync with main data which adds to update overheads. When the updates, on main data, become frequent the performance of a read optimized storage system takes a dip.

A fine example of a read optimized storage system is the one that supports Point Indexes, also called Hash-based indexes, offering constant time access. The systems that provide logarithmic time access, like B-Trees and Skiplists, also fall into this category.

Update Optimised

Update Optimised storage systems offer very low Update Overhead by usually using an auxiliary space holding differential data (delta) and flushing them over main data in a bulk operation. The need of having an auxiliary data to keep track of delta to perform a bulk update adds to Memory Overhead.

A few examples of Update Optimised systems are LSM Trees, Partitioned B Trees, and FD Tree. These structures offer very good performance for an update-heavy system but suffer from an increased read and space overheads. While reading data from LSM Tree, the engine needs to perform read on all the tiers and then perform a conflict resolution, and maintaining tiers of data itself is a huge Space Overhead.

Memory Optimised

Memory Optimised storage systems are designed to minimize auxiliary memory required for access and updates on the main data. To be memory-optimized the systems usually use compress the main data and auxiliary storages, or allow some error rate, like false positives.

A few examples of Memory Optimises systems are lossy index structures like Bloom Filters, Count-min sketches, Lossy encodings, and Sparse Indexes. Keeping either main or auxiliary data compressed, to be memory efficient, the system takes a toll on writes and reads as they now have additionally performed compression and decompressions adding to the Update and Read overheads.

Storage System examples for RUM Conjecture

Block-based Clustered Indexing

Block-based Clustered Indexing, sits comfortably between these three optimized systems types. It is not read Read efficient but also efficient on Updates and Memory. It builds a very short tree for its auxiliary data, by storing a few pointers to pages and since the data is clustered i.e. the main data itself is stored in the index, the system does not go to fetch the main data from the main storage and hence provides a minimal Read overhead.

Being RUM Adaptive

Storage systems have always been rigid with respect to the kind of use cases it aims to solve. the application, the workload, and the hardware should dictate how we access our data, and not the constraints of our systems. Storage systems could be designed to be RUM Adaptive and they should possess an ability to be tuned to reduce the RUM overheads depending on the data access pattern and computation knowledge. RUM Adaptive storage systems are part of the discussion for some other day.

Conclusion

There will always be trade-offs, between Read, Update, and Memory, while either choosing one storage system over others; the RUM conjecture facilitates and to some extent formalizes the entire process. Although this is just a conjecture, it still helps us disambiguate and make an informed, better and viable decision that will go a long way.

This essay was heavily based on the original research paper introducing The RUM Conjecture.

References

Designing Access Methods: The RUM Conjecture

Consistent Hashing

Arpit Bhayani — Sun, 24 May 2020 17:11:28 +0000

Consistent hashing is a hashing technique that performs really well when operated in a dynamic environment where the distributed system scales up and scales down frequently. The core concept of Consistent Hashing was introduced in the paper Consistent Hashing and RandomTrees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web but it gained popularity after the famous paper introducing DynamoDB - Dynamo: Amazon’s Highly Available Key-value Store. Since then the consistent hashing gained traction and found a ton of use cases in designing and scaling distributed systems efficiently. The two famous examples that exhaustively use this technique are Bit Torrent, for their peer-to-peer networks and Akamai, for their web caches. In this article we dive deep into the need of Consistent Hashing, the internals of it, and more importantly along the way implement it using arrays and Binary Search.

Hash Functions

Before we jump into the core Consistent Hashing technique we first get a few things cleared up, one of which is Hash Functions. Hash Functions are any functions that map value from an arbitrarily sized domain to another fixed-sized domain, usually called the Hash Space. For example, mapping URLs to 32-bit integers or web pages' HTML content to a 256-byte string. The values generated as an output of these hash functions are typically used as keys to enable efficient lookups of the original entity.

An example of a simple hash function is a function that maps a 32-bit integer into an 8-bit integer hash space. The function could be implemented using the arithmetic operator modulo and we can achieve this by taking a modulo 256 which yields numbers in the range [0, 255] taking up 8-bits for its representation. A hash function, that maps keys to such integer domain, more often than not applies the modulo N so as to restrict the values, or the hash space, to a range [0, N-1].

A good hash function has the following properties

The function is computationally efficient and the values generated are easy for lookups
The function, for most general use cases, behaves like a pseudorandom generator that spreads data out evenly without any noticeable correlation

Now that we have seen what a hash function is, we take a look into how we could use them and build a somewhat scalable distributed system.

Building a distributed storage system

Say we are building a distributed storage system in which users can upload files and access them on demand. The service exposes the following APIs to the users

upload to upload the file
fetch to fetch the file and return its content

Behind the scenes the system has Storage Nodes on which the files are stored and accessed. These nodes expose the functions put_file and fetch_file that puts and gets the file content to/from the disk and sends the response to the main API server which in turn sends it back to the user.

To sustain the initial load, the system has 5 Stogare Nodes which stores the uploaded files in a distributed manner. Having multiple nodes ensures that the system, as a whole, is not overwhelmed, and the storage is distributed almost evenly across.

When the user invokes upload function with the path of the file, the system first needs to identify the storage node that will be responsible for holding the file and we do this by applying a hash function to the path and in turn getting the storage node index. Once we get the storage node, we read the content of the file and put that file on the node by invoking the put_file function of the node.

# storage_nodes holding instances of actual storage node objects
storage_nodes = [
    StorageNode(name='A', host='10.131.213.12'),
    StorageNode(name='B', host='10.131.217.11'),
    StorageNode(name='C', host='10.131.142.46'),
    StorageNode(name='D', host='10.131.114.17'),
    StorageNode(name='E', host='10.131.189.18'),
]


def hash_fn(key):
    """The function sums the bytes present in the `key` and then
    take a mod with 5. This hash function thus generates output
    in the range [0, 4].
    """
    return sum(bytearray(key.encode('utf-8'))) % 5


def upload(path):
    # we use the hash function to get the index of the storage node
    # that would hold the file
    index = hash_fn(path)

    # we get the StorageNode instance
    node = storage_nodes[index]

    # we put the file on the node and return
    return node.put_file(path)


def fetch(path):
    # we use the hash function to get the index of the storage node
    # that would hold the file
    index = hash_fn(path)

    # we get the StorageNode instance
    node = storage_nodes[index]

    # we fetch the file from the node and return
    return node.fetch_file(path)

The hash function used over here simply sums the bytes and takes the modulo by 5 (since there are 5 storage nodes in the system) and thus generating the output in the hash space [0, 4]. This output value now represents the index of the storage engine that will be responsible for holding the file.

Say we have 5 files 'f1.txt', 'f2.txt', 'f3.txt', 'f4.txt', 'f5.txt' if we apply the hash function to these we find that they are stored on storage nodes E, A, B, C, and D respectively.

Things become interesting when the system gains some traction and it needs to be scaled to 7 nodes, which means now the hash function should do mod 7 instead of a mod 5. Changing the hash function implies changing the mapping and association of files with storage nodes. We first need to administer the new associations and see which files required to be moved from one node to another.

With the new hash function the same 5 files 'f1.txt', 'f2.txt', 'f3.txt', 'f4.txt', 'f5.txt' will now be associated with storage nodes D, E, F, G, A. Here we see that changing the hash function requires us to move every single one of the 5 files to a different node.

If we have to change the hash function every time we scale up or down and if this requires us to move not all but even half of the data, the process becomes super expensive and in longer run infeasible. So we need a way to minimize the data movement required during scale-ups or scale-downs, and this is where Consistent Hashing fits in and minimizes the required data transfer.

Consistent Hashing

The major pain point of the above system is that it is prone to events like scale-ups and scale-downs as it requires a lot of alterations in associations. These associations are purely driven by the underlying Hash Function and hence if we could somehow make this hash function independent of the number of the storage nodes in the system, we address this flaw.

Consistent Hashing addresses this situation by keeping the Hash Space huge and constant, somewhere in the order of [0, 2^128 - 1] and the storage node and objects both map to one of the slots in this huge Hash Space. Unlike in the traditional system where the file was associated with storage node at index where it got hashed to, in this system the chances of a collision between a file and a storage node are infinitesimally small and hence we need a different way to define this association.

Instead of using a collision-based approach we define the association as - the file will be associated with the storage node which is present to the immediate right of its hashed location. Defining association in this way helps us

keep the hash function independent of the number of storage nodes
keep associations relative and not driven by absolute collisions

Consistent Hashing on an average requires only k/n units of data to be migrated during scale up and down; where k is the total number of keys and n is the number of nodes in the system.

A very naive way to implement this is by allocating an array of size equal to the Hash Space and putting files and storage node literally in the array on the hashed location. In order to get association we iterate from the item's hashed location towards the right and find the first Storage Node. If we reach the end of the array and do not find any Storage Node we circle back to index 0 and continue the search. The approach is very easy to implement but suffers from the following limitations

requires huge memory to hold such a large array
finding association by iterating every time to the right is O(hash_space)

A better way of implementing this is by using two arrays: one to hold the Storage Nodes, called nodes and another one to hold the positions of the Storage Nodes in the hash space, called keys. There is a one-to-one correspondence between the two arrays - the Storage Node nodes[i] is present at position keys[i] in the hash space. Both the arrays are kept sorted as per the keys array.

Hash Function in Consistent Hashing

We define total_slots as the size of this entire hash space, typically of the order 2^256 and the hash function could be implemented by taking SHA-256 followed by a mod total_slots. Since the total_slots is huge and a constant the following hash function implementation is independent of the actual number of Storage Nodes present in the system and hence remains unaffected by events like scale-ups and scale-downs.

def hash_fn(key: str, total_slots: int) -> int:
    """hash_fn creates an integer equivalent of a SHA256 hash and
    takes a modulo with the total number of slots in hash space.
    """
    hsh = hashlib.sha256()

    # converting data into bytes and passing it to hash function
    hsh.update(bytes(key.encode('utf-8')))

    # converting the HEX digest into equivalent integer value
    return int(hsh.hexdigest(), 16) % total_slots

Adding a new node in the system

When there is a need to scale up and add a new node in the system, in our case a new Storage Node, we

find the position of the node where it resides in the Hash Space
populate the new node with data it is supposed to serve
add the node in the Hash Space

When a new node is added in the system it only affects the files that hash at the location to the left and associated with the node to the right, of the position the new node will fit in. All other files and associations remain unaffected, thus minimizing the amount of data to be migrated and mapping required to be changed.

From the illustration above, we see when a new node K is added between nodes B and E, we change the associations of files present in the segment B-K and assign them to node K. The data belonging to the segment B-K could be found at node E to which they were previously associated with. Thus the only files affected and that needs migration are in the segment B-K; and their association changes from node E to node K.

In order to implement this at a low-level using nodes and keys array, we first get the position of the new node in the Hash Space using the hash function. We then find the index of the smallest key greater than the position in the sorted keys array using binary search. This index will be where the key and the new Storage node will be placed in keys and nodes array respectively.

def add_node(self, node: StorageNode) -> int:
    """add_node function adds a new node in the system and returns the key
    from the hash space where it was placed
    """

    # handling error when hash space is full.
    if len(self._keys) == self.total_slots:
        raise Exception("hash space is full")

    key = hash_fn(node.host, self.total_slots)

    # find the index where the key should be inserted in the keys array
    # this will be the index where the Storage Node will be added in the
    # nodes array.
    index = bisect(self._keys, key)

    # if we have already seen the key i.e. node already is present
    # for the same key, we raise Collision Exception
    if index > 0 and self._keys[index - 1] == key:
        raise Exception("collision occurred")

    # Perform data migration

    # insert the node_id and the key at the same `index` location.
    # this insertion will keep nodes and keys sorted w.r.t keys.
    self.nodes.insert(index, node)
    self._keys.insert(index, key)

    return key

Removing a new node from the system

When there is a need to scale down and remove an existing node from the system, we

find the position of the node to be removed from the Hash Space
populate the node to the right with data that was associated with the node to be removed
remove the node from the Hash Space

When a node is removed from the system it only affects the files associated with the node itself. All other files and associations remain unaffected, thus minimizing the amount of data to be migrated and mapping required to be changed.

From the illustration above, we see when node K is removed from the system, we change the associations of files associated with node K to the node that lies to its immediate right i.e. node E. Thus the only files affected and needs migration are the ones associated with node K.

In order to implement this at a low-level using nodes and keys array, we get the index where the node K lies in the keys array using binary search. Once we have the index we remove the key from the keys array and Storage Node from the nodes array present on that index.

def remove_node(self, node: StorageNode) -> int:
    """remove_node removes the node and returns the key
    from the hash space on which the node was placed.
    """

    # handling error when space is empty
    if len(self._keys) == 0:
        raise Exception("hash space is empty")

    key = hash_fn(node.host, self.total_slots)

    # we find the index where the key would reside in the keys
    index = bisect_left(self._keys, key)

    # if key does not exist in the array we raise Exception
    if index >= len(self._keys) or self._keys[index] != key:
        raise Exception("node does not exist")

    # Perform data migration

    # now that all sanity checks are done we popping the
    # keys and nodes at the index and thus removing the presence of the node.
    self._keys.pop(index)
    self.nodes.pop(index)

    return key

Associating an item to a node

Now that we have seen how consistent hashing helps in keeping data migration, during scale-ups and scale-downs, to a bare minimum; it is time we see how to efficiently we can find the "node to the right" for a given item. The operation to find the association has to be super fast and efficient as it is something that will be invoked for every single read and write that happens on the system.

To implement this at low-level we again take leverage of binary search and perform this operation in O(log(n)). We first pass the item to the hash function and fetch the position where the item is hashed in the hash space. This position is then binary searched in the keys array to obtain the index of the first key which is greater than the position (obtained from the hash function). if there are no keys greater than the position, in the keys array, we circle back and return the 0th index. The index thus obtained will be the index of the storage node in the nodes array associated with the item.

def assign(self, item: str) -> str:
    """Given an item, the function returns the node it is associated with.
    """
    key = hash_fn(item, self.total_slots)

    # we find the first node to the right of this key
    # if bisect_right returns index which is out of bounds then
    # we circle back to the first in the array in a circular fashion.
    index = bisect_right(self._keys, key) % len(self._keys)

    # return the node present at the index
    return self.nodes[index]

The source code with the implementation of Consistent Hashing in Python could be found at github.com/arpitbbhayani/consistent-hashing.

Conclusion

Consistent Hashing is one of the most important algorithms to help us horizontally scale and manage any distributed system. The algorithm does not only work in sharded systems but also finds its application in load balancing, data partitioning, managing server-based sticky sessions, routing algorithms, and many more. A lot of databases owe their scale, performance, and ability to handle the humongous load to Consistent Hashing.

References

Python Caches Integers

Arpit Bhayani — Sun, 17 May 2020 17:35:00 +0000

An integer in Python is not a traditional 2, 4, or 8-byte implementation but rather it is implemented as an array of digits in base 2^30 which enables Python to support super long integers. Since there is no explicit limit on the size, working with integers in Python is extremely convenient as we can carry out operations on very long numbers without worrying about integer overflows. This convenience comes at a cost of allocation being expensive and trivial operations like addition, multiplication, division being inefficient.

Each integer in python is implemented as a C structure illustrated below.

struct _longobject {
    ...
    Py_ssize_t    ob_refcnt;      // <--- holds reference count
    ...
    Py_ssize_t    ob_size;        // <--- holds number of digits
    digit         ob_digit[1];    // <--- holds the digits in base 2^30
};

It is observed that smaller integers in the range -5 to 256, are used very frequently as compared to other longer integers and hence to gain performance benefit Python preallocates this range of integers during initialization and makes them singleton and hence every time a smaller integer value is referenced instead of allocating a new integer it passes the reference of the corresponding singleton.

Here is what Python's official documentation says about this preallocation

The current implementation keeps an array of integer objects for all integers between -5 and 256 when you create an int in that range you actually just get back a reference to the existing object.

In the CPython's source code this optimization can be traced in the macro IS_SMALL_INT and the function get_small_int in longobject.c. This way python saves a lot of space and computation for commonly used integers.

Verifying smaller integers are indeed a singleton

For a CPython implementation, the in-built id function returns the address of the object in memory. This means if the smaller integers are indeed singleton then the id function should return the same memory address for two instances of the same value while multiple instances of larger values should return different ones, and this is indeed what we observe

>>> x, y = 36, 36
>>> id(x) == id(y)
True


>>> x, y = 257, 257
>>> id(x) == id(y)
False

The singletons can also be seen in action during computations. In the example below, we reach the same target value 6 by performing two operations on three different numbers, 2, 4, and 10, and we see the id function returning the same memory reference in both the cases.

>>> a, b, c = 2, 4, 10
>>> x = a + b
>>> y = c - b
>>> id(x) == id(y)
True

Verifying if these integers are indeed referenced often

We have established that Python indeed is consuming smaller integers through their corresponding singleton instances, without reallocating them every time. Now we verify the hypothesis that Python indeed saves a bunch of allocations during its initialization through these singletons. We do this by checking the reference counts of each of the integer values.

Reference Counts

Reference count holds the number of different places there are that have a reference to the object. Every time an object is referenced the ob_refcnt, in its structure, is increased by 1, and when dereferenced the count is decreased by 1. When the reference count becomes 0 the object is garbage collected.

In order to get the current reference count of an object, we use the function getrefcount from the sys module.

>>> ref_count = sys.getrefcount(50)
11

When we do this for all the integers in range -5 to 300 we get the following distribution

The above graph suggests that the reference count of smaller integer values is high indicating heavy usage and it decreases as the value increases which asserts the fact that there are many objects referencing smaller integer values as compared to larger ones during python initialization.

The value 0 is referenced the most - 359 times while along the long tail we see spikes in reference counts at powers of 2 i.e. 32, 64, 128, and 256. Python during its initialization itself requires small integer values and hence by creating singletons it saves about 1993 allocations.

The reference counts were computed on a freshly spun python which means during initialization it requires some integers for computations and these are facilitated by creating singleton instances of smaller values.

In usual programming, the smaller integer values are accessed much more frequently than larger ones, having singleton instances of these saves python a bunch of computation and allocations.

References

Fractional Cascading - Speeding up Binary Searches

Arpit Bhayani — Mon, 11 May 2020 03:33:30 +0000

Binary Search is an algorithm that finds the position of a target value in a sorted list. The algorithm exploits the fact that the list is sorted, and is devised such that is does not have to even look at all the n elements, to decide if a value is present or not. In the worst case, the algorithm checks the log(n) number of elements to make the decision.

Binary Search could be tweaked to output the position of the target value, or return the position of the smallest number greater than the target value i.e. position where the target value should have been present in the list.

Things become more interesting when we have to perform an iterative binary search on k lists in which we find the target value in each of the k lists independently. The problem statement could be formally defined as

Given k lists of n sorted integers each, and a target value x, return the position of the smallest value greater than or equal to x in each of the k lists. Preprocessing of the list is allowed before answering the queries.

The naive approach - k binary searches

The expected output of this iterative search is the position of the smallest value greater than or equal to x in each of the k lists. This is a classical Binary Search problem and hence in this approach, we fire k binary searches on k lists for the target value x and collect the positions.

Python has an in-built module called bisect which has the function bisect_left which outputs the smallest value greater than or equal to x in a list which is exactly what we need to output and hence python-based solution using this k-binary searches approach could be

import bisect

arr = [
    [21, 54, 64, 79, 93],
    [27, 35, 46, 47, 72],
    [11, 44, 62, 66, 94],
    [10, 35, 46, 79, 83],
]

def get_positions_k_bin_search(x): 
    return [bisect.bisect_left(l, x) for l in arr]

>>> get_positions_k_bin_search(60)
[2, 4, 2, 3]

Time and Space Complexity

Each of the k lists have size n and we know the time complexity of performing a binary search in one list of n elements is O(log(n)). Hence we deduce that the time complexity of this k-binary searches approach is O(klog(n)).

This approach does not really require any additional space and hence the space complexity is O(1).

The k-binary searches approach is thus super-efficient on space but not so much on time. Hence by trading some space, we could reap some benefits on time, and on this exact principle, the unified binary search approach is based.

Unified binary search

This approach uses some extra space, preprocessing and computations to reduce search time. The preprocessing actually involves precomputing the positions of all elements in all the k lists. This precomputation enables us to perform just one binary search and get the required precalculated positions in one go.

Preprocess

The preprocessing is done in two phases; in the first phase, we compute a position tuple for each element and associate it with the same. In phase two of preprocessing, we create an auxiliary list containing all the elements of all the lists, on which we then perform a binary search for the given target value.

Computing position tuple for each element

Position tuple is a k item tuple where every ith item denotes the position of the associated element in the ith list. We compute this tuple by performing a binary search on all the k lists treating the element as the target value.

From the example above, the position tuple of 4th element in the 4th list i.e 79 will be [3, 5, 4, 3] which denotes its position in all 4 lists. In list 1, 79 is at index 3, in list 2, 79 is actually out of bounds but would be inserted at index 5 hence the output 5, we could also have returned a value marking out of bounds, like -2, in list 3, 79 is not present but the smallest number greater than 79 is 94 and which is at index 4 and in list 4, 79 is present at index 3. This makes the position tuple for 79 to be [3, 5, 4, 3].

Given a 2-dimensional array arr we compute the position tuple for an element (i, j) by performing a binary search on all k lists as shown in python code below

for i, l in enumerate(arr):
    for j, e in enumerate(l):
        for k, m in enumerate(arr):
            positions[i][j][k] = int(bisect.bisect_left(m, e))

Creating a huge list

Once we have all the position tuples and they are well associated with the corresponding elements, we create an auxiliary list of size k * n that holds all the elements from all the k lists. This auxiliary list is again kept sorted so that we could perform a binary search on it.

Working

Given a target value, we perform a binary search in the above auxiliary list and get the smallest element greater than or equal to this target value. Once we get the element, we now get the associated position tuple. This position tuple is precisely the position of the target element in all the k lists. Thus by performing one binary search in this huge list, we are able to get the required positions.

Complexity

We are performing binary search just once on the list of size k * n hence, the time complexity of this approach is O(log(kn)) which is a huge improvement over the k-binary searches approach where it was O(klog(n)).

This approach, unlike k-binary searches, requires an additional space of O(k.kn) since each element holds k item position tuple and there are in all k * n elements.

Fractional cascading is something that gives us the best of both worlds by creating bridges between the lists and narrowing the scope of binary searches on subsequent iterations. Let's find out how.

Fractional Cascading

Fractional cascading is a technique through which we speed up the iterative binary searches by creating bridges between the lists. The main idea behind this approach is to dampen the need to perform binary searches on subsequent lists after performing the search on one.

In the k-binary searches approach, we solved the problem by performing k binary searches on k lists. If, after the binary search on the first list, we would have known a range within which the target value was present in the 2nd list, we would have limited our search within that subset which helps us save a bunch of computation time. The bridges, defined above, provides us with a shortcut to reach the subset of the other list where that target value would be present.

Fractional cascading is just an idea through which we could speed up binary searches, implementations vary with respect to the underlying data. The bridges could be implemented using pointers, graphs, or array indexes.

Preprocess

Preprocessing is a super-critical step in fractional cascading because it is responsible for speeding up the iterative binary searches. Preprocessing actually sets up all the bridges from all the elements from one list to the range of items in the lower list where the element could be found. These bridges then cascade to all the lists on the lower levels.

Create Auxiliary Lists

The first step in pre-processing is to create k auxiliary lists from k original lists. These lists are created bottom-up which means lists on the lower levels are created first - M(i+1) is created before M(i). An auxiliary list M(i) is created as a sorted list of elements of the original list L(i) and half of the previously created auxiliary list M(i+1). The half elements of auxiliary lists are chosen by picking every other element from it.

By picking every other element from lower-level lists, we fill the gaps in value ranges in the original list L(i), giving us a uniform spread of values across all auxiliary lists. Another advantage of picking every other element is that we eradicate the need for performing binary searches on subsequent lists altogether. Now we only need to perform a binary search for list M(0) and for every other list, we only need to check the element we reach via the bridge and an element before that - a constant time comparison.

Position tuples

A position tuple for Fractional Cascading is a 2 item tuple, associated with each element of the auxiliary list, where the first item is the position of the element in the original list on the same level - serving as the required position - and the second element is the position of the element in the auxiliary list on the lower level - serving as the bridge from one level to another.

The position tuple for each element in the auxiliary array can be created by doing a binary search on the original list and the auxiliary list on the lower level. Given a 2-dimensional array arr and auxiliary lists m_arr we compute the position tuples for element (i, j) by performing a binary search on all k original and auxiliary lists as shown in python code below

for i, l in enumerate(m_arr):
    for j, m in enumerate(m_arr[i]):
        pointers[i][j] = [
            bisect.bisect_left(arr[i], m_arr[i][j]),
            bisect.bisect_left(m_arr[i+1], m_arr[i][j]),
        ]

Fractional Cascading in action

We start by performing a binary search on the first auxiliary list M(0) from which we get the element corresponding to the target value. The position tuple for this element contains the position corresponding to the original list L(0) and bridge that will take us to the list M(1). Now when we move to the list M(1) through the bridge and have reached the index b.

Since auxiliary lists have uniform range spread, because of every other element being promoted, we are sure that the target value should be checked again at the index b and b - 1; because if the value was any lower it would have been promoted and bridged to other value and hence the trail we trace would be different from what we are tracing now.

Once we know which of the b and b-1 index to pick (depending on the values at the index and the target value) we add the first item of the position tuple to the solution set and move the auxiliary list on the lower level and the entire process continues.

Once we reach the last auxiliary list and process the position tuple there and pick the element, our solution set contains the required positions and we can stop the iteration.

def get_locations_fractional_cascading(x): 
    locations = []

    # the first and only binary search on the auxiliary list M[0]
    index = bisect.bisect_left(m_arr[0], x)

    # loc always holds the required location from the original list on same level
    # next_loc holds the bridge index on the lower level
    loc, next_loc = pointers[0][index]

    # adding loc to the solution
    locations.append(loc)

    for i in range(1, len(m_arr)):
        # we check for the element we reach through the bridge
        # and the one before it and make the decision to go with one
        # depending on the target value.
        if x <= m_arr[i][next_loc-1]:
            loc, next_loc = pointers[i][next_loc-1]
        else:
            loc, next_loc = pointers[i][next_loc]

        # adding loc to the solution
        locations.append(loc)

    # returning the required locations
    return locations

The entire working code could be found here github.com/arpitbbhayani/fractional-cascading

Time and space complexity

In Fractional Cascading, we perform binary search once on the auxiliary list M(0) and then make k constant comparisons for each of the subsequent levels; hence the time complexity is O(k + log(n)).

The auxiliary lists could at most contain all the elements from the original list plus 1/2 |L(n)| + 1/4 |L(n-1)| + 1/8 |L(n-2)| + ... which is less than all elements of the original list combined. Thus the total size of the auxiliary list cannot exceed twice the original lists. The position tuple for each of the elements is also a constant 2 item tuple thus the space complexity of Fractional Cascading is O(kn).

Thus Fractional Cascading has time complexity very close to the k-binary searches approach with a very low space complexity as compared to the unified binary searches approach; thus giving us the best of both worlds.

Fractional Cascading in real world

Fractional Cascading is used in FD-Trees which are used in databases to address the asymmetry of read-write speeds in tree indexing on the flash disk. Fractional cascading is typically used in range search data structures like Segment Trees to speed up lookups and filters.

References

Other articles you might like:

This article was originally published on my blog - Fractional Cascading - Speeding up Binary Searches.

If you liked what you read, subscribe to my newsletter and get the post delivered directly to your inbox and give me a shout-out @arpit_bhayani.

Copy-on-Write Semantics

Arpit Bhayani — Sun, 03 May 2020 12:56:18 +0000

Copy-On-Write, abbreviately referred to as CoW suggests deferring the copy process until the first modification. A resource is usually copied when we do not want the changes made in the either to be visible to the other. A resource here could be anything - an in-memory page, a database disk block, an item in a structure, or even the entire data structure.

CoW suggests that we first copy by reference and let both instances share the same resource and just before the first modification we clone the original resource and then apply the updates.

Deep copying

The process of creating a pure clone of the resource is called Deep Copying and it copies not only the immediate content but also all the remote resources that are referenced within it. Thus if we were to deep copy a Linked List we do not just copy the head pointer, rather we clone all the nodes of the list and create an entirely new list from the original one. A C++ function for deep copying a Linked List is as illustrated below

struct node* copy(struct node *head) {
    if (!head) {
        return NULL;
    }

    struct node *nhead = (struct node *) calloc(sizeof(struct node))
    nhead->val = head->val;

    struct node *p = head;
    struct node *q = nhead;

    while(p -> next) {
        q -> next = (struct node *) calloc(sizeof(struct node));
        q -> next -> val = p -> next -> val;
        p = p -> next;
        q = q -> next;
    }

    return nhead;
}

Going by the details, we understand that deep copying is a very memory-intensive operation, and hence we try to not do it very often.

Why Copy-on-Write

Copy-on-Write, as established earlier, suggests we defer the copy operation until the first modification is requested. The approach suits the best when the traversal and access operations vastly outnumber the mutations. CoW has a number of advantages, some of them are

Perceived performance gain

By having a CoW, the process need not wait for the deep copy to happen, instead, it could directly proceed by just doing a copy-by-reference, where the resource is shared between the two, which is much faster than a deep copy and thus gaining a performance boost. Although we cannot totally get rid of deep copy because once some modification is requested the deep copy has to be triggered.

A particular example where we gain a significant performance boost is during the fork system call.

fork system call creates a child process that is a spitting copy of its parent. During this call, if the parent's program space is huge and we trigger a deep copy, the time taken to create the child process will shoot up. But if we just do copy-by-reference the child process could be spun super fast. Once the child decides to make some modifications to its program space, then we trigger the deep copy.

Better resource management

CoW gives us an optimistic way to manage memory. One peculiar property that CoW exploits are how, before any modifications to the copied instance, both the original and the copied resources are exactly the same. The readers, thus, cannot distinguish if they are reading from the original resource or the copied one.

Things change when the first modification is made to the copied instance and that's where readers of the corresponding resource would expect to see things differently. But what if the copied instance is never modified?

Since there are no modifications, in CoW, the deep copy would never happen and hence the only operation that ever happened was a super-fast copy-by-reference of the original resource; and thus we just saved an expensive deep copy operation.

One very common pattern in OS is called fork-exec in which a child process is forked as a spitting copy of its parent but it immediately executes another program, using exec family of functions, replacing its entire program space. Since the child does not intend to modify its program space ever, inherited from the parent, and just wants to replace it with the new program, deep copy plays no part and is a waste. So if we defer the deep copy operation until modification, the deep copy would never happen and we thus save a bunch of memory and CPU cycles.

#include <stdio.h>

int main( void ) {
    char * argv[2] = {".", NULL};

    // fork spins the child process and both child and the parent
    // continues to co-exist from this point with the same
    // program space.
    int pid = fork();

    if ( pid == 0 ) {
        // The entire child program space is replace by the
        // execvp function call.
        // The child continues to execute the `ls` command.
        execvp("ls", argv);
    }

    // Child process will never reach here.
    // hence all memory that was copied from its parent's
    // program space is of no use.

    // The parent will continue its execution and print the
    // following message.
    printf("parent finishes...\n");
    return 0;
}

Updating without locks

Locks are required when we have in-place updates. Multiple writers try to modify the same instance of the resource and hence we need to define a critical section where the updations happen. This critical section is bounded by locks and any writer who wishes to modify would have to acquire the lock. This streamlines the writers and ensures only one writer could enter the critical section at any point in time, creating a chokepoint.

If we follow CoW aggressively, which suggests we copy before we write, there will be no in-place updates. All variables during every single write will create a clone, apply updates to it and then in one atomic compare-and-swap operation switch and start pointing to this newer version; thus eradicating the need for locking entirely. Garbage collection on unused items, with old values, could happen from time to time.

Versioning and point in time snapshots

If we aggressively follow CoW then on every write we create a clone of the original resource and apply updates to it. If we do not garbage collect the older unused instances, what we get is the history of the resource that shows us how it has been changing with time (every write operation).

Each update creates a new version of the resource and thus we get resource versioning; enabling us to take point-in-time snapshots. This particular behavior is used by all collaborative document tools, like Google Docs, to provide document versioning. Point-in-time snapshots are also used in the databases to take timely backups allowing us to have a rollback and recovery plan in case of some data loss or worse a database failure.

Implementing CoW

CoW is just a technique and it tells us what and not how. The implementation is all in the hands of the system and depending on the type of resource being CoW'ed the implementation details differ.

The naive way to perform copy operation is by doing a deep copy which, as established before, is a super inefficient way. We can do a lot better than this by understanding the nuances of the underlying resource. To gain a deeper understanding we see how efficiently we could make CoW Binary Tree Binary Tree.

Efficient Copy-on-write on a Binary Tree

Given a Binary Tree A we create a copy B such that any modifications by A are not visible to B and any modifications on B are not visible to A. The simplest way to achieve this is by cloning all the nodes of the tree, their pointer references, and create a second tree which is then pointed by B - as illustrated in the diagram below. Any modifications made to either tree will not be visible to the other because their entire space is mutually exclusive.

Copy-on-Write semantics suggest an optimistic approach where B instead of pointing to the cloned A, shares the same reference as A which means it also points to the exact same tree as A. Now say, we modify the node 2 in tree B and change its value to 9.

Observing closely we find that a lot of pointers could be reused and hence a better approach would be to only copy the path from the updating node till the root, keeping all other pointers references same, and let B point to this new root, as shown in the illustration.

Thus instead of maintaining two separate mutually exclusive trees, we make space partially exclusive depending on which node is updated and in the process make things efficient with respect to memory and time. This behavior is core to a family of data structures called Persistent Data Structures.

Fun fact: You can model Time Travel using Copy-on-Write semantics.

Why shouldn't we Copy-on-Write

CoW is an expensive process if done aggressively. If on every single write, we create a copy then in a system that is write-heavy, things could go out of hand very soon. A lot of CPU cycles will be occupied for doing garbage collections and thus stalling the core processes. Picking which battles to win is important while choosing something as critical as Copy-on-Write.

References

Other articles you might like:

This article was originally published on my blog - Copy-on-Write Semantics.

If you liked what you read, subscribe to my newsletter and get the post delivered directly to your inbox and give me a shout-out @arpit_bhayani.

What makes MySQL LRU cache scan resistant

Arpit Bhayani — Sun, 26 Apr 2020 11:43:40 +0000

Disk reads are 4x (for SSD) to 80x (for magnetic disk) slower as compared to main memory (RAM) reads and hence it becomes extremely important for a database to utilize main memory as much as it can, and be super-performant while keeping its latencies to a bare minimum. Engines cannot simply replace disks with RAM because of volatility and cost, hence it needs to strike a balance between the two - maximize main-memory utilization and minimize the disk access.

The database engine virtually splits the data files into pages. A page is a unit which represents how much data the engine transfers at any one time between the disk (the data files) and the main memory. It is usually a few kilobytes 4KB, 8KB, 16KB, 32KB, etc. and is configurable via engine parameters. Because of its bulky size, a page can hold one or multiple rows of a table depending on how much data is in each row i.e. the length of the row.

Locality of reference

Database systems exhibit a strong and predictable behaviour called locality of reference which suggests the access pattern of a page and its neighbours.

Spatial Locality of Reference

The spatial locality of reference suggests if a row is accessed, there is a high probability that the neighbouring rows will be accessed in the near future.

Having a larger page size addresses this situation to some extent. As one page could fit multiple rows, this means when that page is cached in main memory, the engine saves a disk read if the neighbouring rows residing on the same page are accessed.

Another way to address this situation is to read-ahead pages that are very likely to be accessed in the future and keep them available in the main memory. This way if the read-ahead pages are referenced, the engine needs to go to the disk to fetch the page, rather it will find the page residing in the main memory and thus saving a bunch of disk reads.

Temporal Locality of Reference

The temporal locality of reference suggests that if a page is recently accessed, it is very likely that the same page will be accessed again in the near future.

Caching exploits this behaviour by putting every single page accessed from the disk into main-memory (cache). Hence the next time a page which is available in the cache is referenced, the engine need not make a disk read to get the page, rather it could reference it from the cache directly, again saving a disk read.

Since the cache is very costly, it is in magnitude smaller in capacity than the disk. It can only hold some fixed number of pages which means the cache suffers from the problem of getting full very quickly. Once the cache gets full, the engine needs to evict an old page so that the new page, which according to the temporal locality of reference is going to be accessed in the near future, could get a place in the cache.

The most common strategy that decides the page that will be evicted from the cache is the Least Recently Used cache eviction strategy. This strategy uses Temporal Locality of Reference to the core and hence evicts the page which was not accessed the longest, thus maximizing the time the most-recently accessed pages are held in the cache.

LRU Cache

The LRU cache holds the items in the order of its last access, allowing us to identify which item is not being used the longest. When the cache is full and a newer item needs to make an entry in the cache, the item which is not accessed the longest is evicted and hence the name Least Recently Used.

The one end (head) of the list holds the most-recently referenced page while the fag end (tail) of the list holds the least-recently referenced one. A new page, being most-recently accessed, is always added at the head of the list while the eviction happens at the tail. If a page from the cache is referenced again, it is moved to the head of the list as it is now the most-recently referenced.

Implementation

LRU cache is often implemented by pairing a doubly-linked list with a hash map. The cache is thus just a linked list of pages and the hashmap maps the page_id to the node in the linked list, enabling O(1) lookups.

InnoDB's Buffer Pool

MySQL InnoDB's cache is called Buffer Pool which does exactly what has been established earlier. Pseudocode implementation of get_page function, using which the engine gets the page for further processing, could be summarized as

def get_page(page_id:int) -> Page:
    # Check if the page is available in the cache
    page = cache.get_page(page_id)

    # if the page is retrieved from the main memory
    # return the page.
    if page:
        return page

    # retrieve the page from the disk
    page = disk.get_page(page_id)

    # put the page in the cache,
    # if the cache is full, evict a page which is
    # least recently used.
    if cache.is_full():
        cache.evict_page()

    # put the page in the cache
    cache.put_page(page)

    # return the pages
    return page

A notorious problem with Sequential Scans

Above caching strategy works wonders and helps the engine to be super-performant. Cache hit ratio is usually more than 80% for mid-sized production-level traffic, which means 80% of the times the pages were served from the main memory (cache) and the engine did not require to make the disk read.

What would happen if an entire table is scanned? say, while talking a DB dump, or running a SELECT without WHERE to perform some statistical computations.

Going by the MySQL's aforementioned behaviour, the engine iterates on all the pages and since each page which is accessed now is the most recent one, it puts it at the head of the cache while evicting one from the tail.

If the table is bigger than the cache, this process will wipe out the entire cache and fill it with the pages from just one table. If these pages are not referenced again, this is a total loss and performance of the database takes a hit. The performance will pickup once these pages are evicted from the cache and other pages make an entry.

Midpoint Insertion Strategy

MySQL InnoDB Engine ploys an extremely smart solution to solve the notorious problem with Sequential Scans. Instead of keeping its Buffer Pool a strict LRU, it tweaks it a little bit.

Instead of treating the Buffer Pool as a single doubly-linked list, it treats it as a combination of two smaller sublists - usually 5/8th and 3/8th of the total size. One sublist holds the younger data while the other one holds the older data. The head of the Young sublist holds the most recent pages and the recency decreases as we reach the tail of the Old sublist.

Eviction

The tail of the Old Sublist holds the Least Recently Used page and the eviction thus happens as per the LRU Strategy i.e. at the tail of the Old Sublist.

Insertion

This is where this strategy differs from Strict LRU. The insertion, instead of happening at "newest" end of the list i.e. head of Young sublist, happens at the head of Old sublist i.e. in the "middle" of the list. This position of the list where the tail of the Young sublist meets the head of the Old sublist is referred to as the "midpoint", and hence the name of the strategy is Midpoint Insertion Strategy.

By inserting in the middle, the pages that are only read once, such as during a full table scan, can be aged out of the Buffer Pool sooner than with a strict LRU algorithm.

Moving page from Old to the Young sublist

In this strategy, like in Strict LRU implementation, whenever the page is accessed it moves to the newest end of the list i.e. the head of the Young sublist. During the first access, the pages make an entry in the cache in the "middle" position.

If the page is referenced the second time it is moved to the head of Young sublist and hence stays in the cache for a longer time. If the page, after being inserted in the middle, is never referenced again (during full scans), it is evicted sooner because the Old sublist is usually shorter than the Young sublist.

The Young sublist thus remains unaffected by table scans bringing in new blocks that might or might not be accessed afterwards. The engine thus remains performant as more frequently accessed pages continue to remain in the cache (Young sublist).

MySQL parameter to tune the midpoint

InnoDB allows us to tune the midpoint of the buffer pool through the parameter innodb_old_blocks_pct. This parameter controls the percentage of Old sublist to Buffer Pool. The default value is 37 which corresponds to the ratio 3/8.

In order to get greater insights about Buffer Pool we can invoke the following command as

$ SHOW ENGINE INNODB STATUS

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 137363456; in additional pool allocated 0
Dictionary memory allocated 159646
Buffer pool size   8191
Free buffers       7741
Database pages     449
Old database pages 0

...

Pages made young 12, not young 0
43.00 youngs/s, 27.00 non-youngs/s

...

Buffer pool hit rate 997 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

...

The command SHOW ENGINE INNODB STATUS outputs a lot of interesting metrics but the most interesting and critical ones, w.r.t Memory and Buffer Pool, are

number of pages that were made young
rate of eviction without access
cache hit ratio
read ahead rate

Conclusion

We see how by changing just one aspect of LRU cache, MySQL InnoDB makes itself Scan Resistant. Sequential scanning was a critical issue for the cache but it was addressed in a very elegant way.

References

Other articles you might like:

This article was originally published on my blog - What makes MySQL LRU cache scan resistant.

If you liked what you read, subscribe to my newsletter and get the post delivered directly to your inbox and give me a shout-out @arpit_bhayani.

Building Finite State Machines with Python Coroutines

Arpit Bhayani — Sun, 19 Apr 2020 10:15:48 +0000

Finite State Machine is a mathematical model of computation that models a sequential logic. FSM consists of a finite number of states, transition functions, input alphabets, a start state and end state(s). In the field of computer science, the FSMs are used in designing Compilers, Linguistics Processing, Step workflows, Game Design, Protocols Procedures (like TCP/IP), Event-driven programming, Conversational AI and many more.

To understand what a finite machine is, we take a look at Traffic Signal. Finite State Machine for a Traffic Signal is designed and rendered below. Green is the start/initial state, which upon receiving a trigger moves to Yellow, which, in turn, upon receiving a trigger, transitions to Red. The Red then circles back to Green and the loop continues.

An FSM must be in exactly one of the finite states at any given point in time and then in response to an input, it receives, the machine transitions to another state. In the example above, the traffic signal is exactly in one of the 3 states - Green, Yellow or Red. The transition rules are defined for each state which defines what sequential logic will be played out upon input.

Implementing an FSM is crucial to solving some of the most interesting problems in Computer Science and in this article, we dive deep into modeling a Finite State Machine using Python coroutines.

Python Coroutines

Before diving into the implementation we take a detour and look at what Generators and Coroutines are, how they keep implementation intuitive and fits into the scheme of things.

Generators

Generators are resumable functions that yield values as long as someone, by calling next function, keeps asking it. If there are no more values to yield, the generator raises a StopIteration exception.

def fib():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a+b

The yield statement is where the magic happens. Upon reaching the yield statement, the generator function execution is paused and the yielded value is returned to the caller and the caller continues its execution. The flow returns back to the generator when the caller function asks from the next value. Once the next value is requested by calling next (explicitly or implicitly), the generator function resumes from where it left off i.e. yield statement.

>>> fgen = fib()
>>> [next(fgen) for _ in range(10)]
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

Using a Fibonacci generator is memory-efficient as now we need not compute a lot of Fibonacci numbers and hold them in memory, in a list, rather the requesting process could ask for as many values as it needs and the generator would keep on yielding values one by one.

Coroutines

Coroutines, just like generators, are resumable functions but instead of generating values, they consume values on the fly. The working of it is very similar to the generator and again the yield statement is where the magic happens. When a coroutine is paused at the yield statement, we could send the value it using send function and the value could be used using the assignment operator = on yield as shown below

def grep(substr):
    while True:
        line = yield
        if substr in line:
            print(f"found {substr}")

In the example above, we wrote a simple grep utility that checks for a substring in a given stream of text. When the coroutine grep is paused at the yield statement, using the send function, we send the text to it, and it will be referenced by the variable line. The coroutine then continues its execution to check if substr is in line or not. Once the flow reaches the yield statement again, the coroutine pauses and waits for the caller to send it a new value.

Note that, this is not a thread that keeps on running and hogging the CPU. It is just a function whose execution is paused at the yield statement waiting for the value; the state is persisted and the control is passed back to the caller. When resumed the coroutine starts from the same state where it left off.

Before sending the value to a coroutine we need to "prime" it so that the flow reaches the yield statement and the execution is paused while waiting for the value to be sent.

>>> g = grep("users/created")
>>> next(g)  # priming the generator
>>>
>>> g.send("users/get api took 1 ms.")
>>> g.send("users/created api took 3 ms.")
found users/created
>>> g.send("users/get api took 1 ms.")
>>> g.send("users/created api took 4 ms.")
found users/created
>>> g.send("users/get api took 1 ms.")

In the function invocations above we see how we could keep on sending the text to the coroutine and it continues to spit out if it found the given substring users/created in the text. This ability of coroutine to pause the execution and accept input on the fly helps us model FSM in a very intuitive way.

Building a Finite State Machine

While building FSMs, the most important thing is how we decide to model and implement states and transition functions. States could be modeled as Python Coroutines that run an infinite loop within which they accept the input, decides the transition and updates the current state of the FSM. The transition function could be as simple as a bunch of if and elif statements and in a more complex system it could be a decision function.

To dive into low-level details, we build an FSM for the regular expression ab*c, which means if the given string matches the regex then the machine should end at the end state, only then we say that the string matches the regex.

State

From the FSM above we model the state q2 as

def _create_q2():
    while True:
        # Wait till the input is received.
        # once received store the input in `char`
        char = yield

        # depending on what we received as the input
        # change the current state of the fsm
        if char == 'b':
            # on receiving `b` the state moves to `q2`
            current_state = q2
        elif char == 'c':
            # on receiving `c` the state moves to `q3`
            current_state = q3
        else:
            # on receiving any other input, break the loop
            # so that next time when someone sends any input to
            # the coroutine it raises StopIteration
            break

The coroutine runs as an infinite loop in which it waits for the input token at the yield statement. Upon receiving the input, say b it changes the current state of FSM to q2 and on receiving c changes the state to q3 and this precisely what we see in the FSM diagram.

FSM Class

To keep things encapsulated we will define a class for FSM which holds all the states and maintains the current state of the machine. It will also have a method called send which reroutes the received input to the current state. The current state upon receiving this input makes a decision and updates the current_state of the FSM as shown above.

Depending on the use-case the FSM could also have a function that answers the core problem statement, for example, does the given line matches the regular expression? or is the number divisible by 3?

The FSM class for the regular expression ab*c could be modeled as

class FSM:
    def __init__(self):
        # initializing states
        self.start = self._create_start()
        self.q1 = self._create_q1()
        self.q2 = self._create_q2()
        self.q3 = self._create_q3()

        # setting current state of the system
        self.current_state = self.start

        # stopped flag to denote that iteration is stopped due to bad
        # input against which transition was not defined.
        self.stopped = False

    def send(self, char):
        """The function sends the curretn input to the current state
        It captures the StopIteration exception and marks the stopped flag.
        """
        try:
            self.current_state.send(char)
        except StopIteration:
            self.stopped = True

    def does_match(self):
        """The function at any point in time returns if till the current input
        the string matches the given regular expression.

        It does so by comparing the current state with the end state `q3`.
        It also checks for `stopped` flag which sees that due to bad input the iteration of FSM had to be stopped.
        """
        if self.stopped:
            return False
        return self.current_state == self.q3

    ...

    @prime
    def _create_q2(self):
        while True:
            # Wait till the input is received.
            # once received store the input in `char`
            char = yield

            # depending on what we received as the input
            # change the current state of the fsm
            if char == 'b':
                # on receiving `b` the state moves to `q2`
                self.current_state = self.q2
            elif char == 'c':
                # on receiving `c` the state moves to `q3`
                self.current_state = self.q3
            else:
                # on receiving any other input, break the loop
                # so that next time when someone sends any input to
                # the coroutine it raises StopIteration
                break
    ...

Similar to how we have defined the function _create_q2 we could define functions for the other three states start, q1 and q3. You can find the complete FSM modeled at arpitbbhayani/fsm/regex-1

Driver function

The motive of this problem statement is to define a function called grep_regex which tests a given text against the regex ab*c. The function will internally create an instance of FSM and will pass the stream of characters to it. Once all the characters are exhausted, we invoke does_match function on the FSM which suggests if the given text matches the regex ab*c or not.

def grep_regex(text):
    evaluator = FSM()
    for ch in text:
        evaluator.send(ch)
    return evaluator.does_match()

>>> grep_regex("abc")
True

>>> grep_regex("aba")
False

The entire execution is purely running sequential - and that's because of Coroutines. All states seem to run in parallel but they that are all executing in one thread concurrently. The coroutine of the current state is executing while all others are suspended on their corresponding yield statements. When a new input is sent to the coroutine it is unblocked completes its execution, changes the current state of FSM and pauses itself on its yield statement again.

More FSMs

We have seen how intuitive it is to build Regular expression FSMs using Python Coroutines, but if our hypothesis is true things should equally intuitive when we are implementing FSMs for other use cases and here we take a look at two examples and see how a state is implemented in each

Divisibility by 3

Here we build an FSM that tells if a given stream of digits of a number is divisible by 3 or not. The state machine is as shown below.

We can implement the state q1 as a coroutine as

def _create_q1(self):
    while True:
        digit = yield
        if  digit in [0, 3, 6, 9]:
            self.current_state = self.q1
        elif  digit in [1, 4, 7]:
            self.current_state = self.q2
        elif  digit in [2, 5, 8]:
            self.current_state = self.q0

We can see the similarity between the coroutine implementation and the transition function for a state. The entire implementation of this FSM can be found at arpitbbhayani/fsm/divisibility-by-3.

SQL Query Validator

Here we build an FSM for a SQL Query Validator, which for a given a SQL query tells if it is a valid SQL query or not. The FSM for the validator that covers all the SQL queries will be massive, hence we just deal with the subset of it where we support the following SQL queries

SELECT * from TABLE_NAME;
SELECT column, [...columns] from TABLE_NAME;

We can implement the state explicit_cols as a coroutine as

def _create_explicit_cols(self):
    while True:
        token = yield
        if token == 'from':
            self.current_state = self.from_clause
        elif token == ',':
            self.current_state = self.more_cols
        else:
            break

Again the coroutine through which the state is implemented is very similar to the transition function of the state keeping things intuitive. The entire implementation of this FSM can be found at arpitbbhayani/fsm/sql-query-validator.

Conclusion

Even though this may not be the most efficient way to implement and build FSM but it is the most intuitive way indeed. The edges and state transitions, translate well into if and elif statements or the decision functions, while each state is being modeled as an independent coroutine and we still do things in a sequential manner. The entire execution is like a relay race where the baton of execution is being passed from one coroutine to another.

References and Readings

Other articles you might like:

This article was originally published on my blog - Building Finite State Machines with Python Coroutines.

If you liked what you read, subscribe to my newsletter and get the post delivered directly to your inbox and give me a shout-out @arpit_bhayani.

DEV Community: Arpit Bhayani

Bitcask - a log-structured fast KV store

Design of Bitcask

Datafiles

KeyDir

Operations on Bitcask

Putting a new Key Value

Updating an existing Key Value

Deleting a Key

Reading a Key-Value

Merge and Compaction

Performant bootup

Strengths and Weaknesses of Bitcask

Strengths

Weaknesses

References

Other articles that you might like

Phi φ Accrual Failure Detection

Conventional Failure Detection

Heartbeats with constants timeouts

Phi Accrual Failure Detection

Detailing φ

Estimating the probability of receiving another heartbeat

Benefits of using Accrual Failure Detectors

References

Other articles that you might like

Deciphering Single-byte XOR Ciphertext

Single-byte XOR cipher

Encryption

Decryption

Deciphering without the encryption key

Bruteforce

ETAOIN SHRDLU

Fitting Quotient

Deciphering

References

Other articles that you might like

Making Python Integers Iterable

Python Iterables

Iterators and Iterator Protocol

Iterable in CPython

The PyTypeObject

The tp_iter slot

Implementing long_iter

Consolidated flow

Integer iteration in action

Why it is not a good idea

Conclusion

References

Other articles that you might like

Powering inheritance in C using structure composition

What is structure composition?

Memory Representation of list_int

Casting pointers pointing to struct

How does this drive inheritance?

Who uses structure compositions?

Linux Kernel

Python Type and Object Hierarchy

References

Other articles that you might like

The RUM Conjecture

Access Method

RUM Overheads

Read Overhead

Update Overhead

Memory Overhead

The Conjecture

Categorizing Storage Systems

Read Optimised

Update Optimised

Memory Optimised

Block-based Clustered Indexing

Being RUM Adaptive

Conclusion

References

Other articles that you might like

Consistent Hashing

Hash Functions

Building a distributed storage system

Consistent Hashing

The `PyTypeObject`

The `tp_iter` slot

Implementing `long_iter`

Memory Representation of `list_int`