DEV Community

Frankie
Frankie

Posted on

Optimizing a Full-Text Search Engine - Compression

About three months ago I was presented with a problem. The problem was that I wanted to be able to search for Podcasts based on an episode's transcript instead of an episode's titles, tags, etc. However, searching large amounts of text gets very hard very quickly. Normal database engines (MySQL, MongoDB, Postgresql, etc) did not provide the level of performance I was looking for. I wanted something that was extremely fast and easy to deploy without having to write much code.

So, in order to solve my dilemma I decided to take matters into my own hands. I decided to build my own solution. A full-text indexing and search engine server that could handle large quantities of information and provide near instant search results.

Since Fist has been in development, a few people have taken interest and have greatly helped advance the project to the next level. We are moving fast toward the goal of creating a production ready full-text search engine that is light weight and easy to deploy.

The problem is that it is not there yet. There are problems, quite a few problems. Problems that make Fist unfit for any kind of production use. In order to take Fist to that next level we need to make it lighter, faster, and way more efficient.

There are currently three pressing problems that Fist faces. The first is that the index file that is saved to the disk grows too big too fast. The second issue is that large numbers of simultaneous requests are not processed quickly enough. The third issue issue is that the indexing algorithm is too slow.

In this three part blogging series we will be implementing the necessary changes to help alleviate these problems.

We will be:

  1. Implementing index file compression
  2. Optimizing request handling
  3. Optimizing indexing

This post, which is the first in the series, will go over how we can use liblzf to compress our index file before it is saved to disk. This greatly increases the size of the index that we can store on our system without eating unnecessary amounts of storace space.

liblzf is an incredibly small data compression library. The entire library consists of 4 files. 2 .c file and 2 .h files. This makes it easily embeddable inside of a code base while still keeping things light weight. Pretty much, it's exactly what we are looking for.

LZF compression is a very fast and efficient data compression algorithm. It is used by both Redis and OpenVPN for their data compression needs. It has been proven to work well in real production environments and is a good choice if things need to stay lightweight.

Fist currently does not use any kind of data compression when saving the database file to disk. This made it easier to implement the index serialization and write it to the disk in the beginning. We are past that point now though and the rate at which the index grows is starting to become a problem.

Since I am an avid listener of the Joe Rogan Experience Podcast, I thought it would be a good place to start to provide Fist with real data. Episodes are also upwards of three hours long so there is a lot of dialog that can be indexed. Fortunately for me, Aaron Hendrick was kind enough to open source the transcripts for a bunch of different episodes of the JRE podcast for us to use https://github.com/achendrick/jrescribe-transcripts

I threw a Python script together that parsed through the transcript files, extracted just the dialog, and indexed it in Fist using fistpy. Each file that contained the episode's transcript (some didn't have the transcript) was roughly 120KB - 170KB in size.

The script I used is below. It requires that the jrescribe-transcripts folder be in the same directory as where it is run and it depends on the fistpy client library. It also requires Fist to be running locally on port 5575, which is the default port.

import os
import json
import re
import string
from fist.client import Fist

f = Fist("localhost", 5575)
files = os.listdir("jrescribe-transcripts")

bytes_indexed = 0

for fon, fname in enumerate(files):
    fpath = "jrescribe-transcripts/{}".format(fname)
    if os.path.isfile(fpath):
        with open(fpath, 'r') as script:
            data = script.read()
            data = data.replace("---json", '')
            data = data.split("---")
            try:
                json_data = json.loads(data[0])
            except Exception as e:
                print(f"Failed to read JSON data from {fname}")

            script = data[1].lower()
            remove_strings = ["<episode-header />", "<episode-footer />", "<transcribe-call-to-action />", '\n']

            for string_to_remove in remove_strings:
                script = script.replace(string_to_remove, '')
            for i in re.findall("<timemark seconds=\"[0-9]+\" />", script):
                script = script.replace(i, '')
            for p in string.punctuation:
                script = script.replace(p, '')

            bytes_indexed += len(script)
            f.index(f"{fname} {script}")

After indexing just 358 of the 1395 files available the index was already a whopping 607MB. This didn't come as too much of a surprise, and luckily there is a lot of room for improvement. Since our index naturally contains a lot of duplicate words, LZF should be able to do a very good job of compressing the information.

Below is what the serialization function looks like before implementing compression.

void sdump(hashmap *hmap) {
    // Write size of hashmap to file. (# keys)

    FILE *dump = fopen("fist.db", "wb");

    uint32_t num_indices = 0;

    for(int i = 0; i < HMAP_SIZE; i++) {
        // Get number of indices that have values
        hashmap on = hmap[i];
        if(on.length > 0)
            num_indices++;
    }

    fwrite(&num_indices, sizeof(num_indices), 1, dump);
    // Iterate through hashmap and write key and array of values to file

    for(int i = 0; i < HMAP_SIZE; i++) {
        hashmap on = hmap[i];
        if(on.length > 0) {
            for(int key = 0; key < on.length; key++) {
                keyval object = on.maps[key];
                uint32_t length = object.key.length;
                // Writes key length and key name to db file
                fwrite(&length, sizeof(length), 1, dump);
                fwrite(dtext(object.key), object.key.length, 1, dump);

                uint32_t num_values = object.values.length;
                // Writes number of values associated with key to db file
                fwrite(&num_values, sizeof(num_values), 1, dump);
                for(int value = 0; value < object.values.length; value++) {
                    // Writes value to db file
                    dstring value_on = object.values.values[value];
                    uint32_t val_length = value_on.length;
                    fwrite(&val_length, sizeof(val_length), 1, dump);
                    fwrite(dtext(value_on), value_on.length, 1, dump);
                }
            }
        }
    }
    fclose(dump);
}

The function, named sdump(), is very simple. It takes in a pointer to a hashmap and serializes the data into a custom binary format that will ultimately get written to a file on the disk named fist.db. The goal here is to take the data that would be written and compress it before it is written.

In order to do this we have to make a few changes to sdump() and implement a new function called sdump_compress() that allows us to first write the binary data to a temporary file, compress that data, then write the information to the disk. The code for this is below.

void sdump_compress(unsigned char *data, uint64_t original_size) {
    FILE *compressed = fopen("fist.db", "wb");
    fwrite(&original_size, sizeof(original_size), 1, compressed);

    char *buffer;
    if((buffer = malloc(original_size * 3)) == NULL) {
        perror("Could not allocate memory durring compression. DB file will not be saved.");
        fclose(compressed);
        return;
    }
    long size;
    if(!(size = lzf_compress(data, original_size, buffer, original_size * 3))) {
        printf("Compression error\n");
    }
    fwrite(buffer, size, 1, compressed);
    fclose(compressed);
    free(buffer);
}

void sdump(hashmap *hmap) {
    // Write binary data to a temporary file, load the temp file into memory, compress it, save it
    // to disk.

    FILE *dump = tmpfile();

    if(dump == NULL) {
        perror("Could not create tmpfile during sdump. DB file will not be saved.");
        return;
    }

    uint32_t num_indices = 0;

    for(int i = 0; i < HMAP_SIZE; i++) {
        // Get number of indices that have values
        hashmap on = hmap[i];
        if(on.length > 0)
            num_indices++;
    }

    fwrite(&num_indices, sizeof(num_indices), 1, dump);
    // Iterate through hashmap and write key and array of values to file

    for(int i = 0; i < HMAP_SIZE; i++) {
        hashmap on = hmap[i];
        if(on.length > 0) {
            for(int key = 0; key < on.length; key++) {
                keyval object = on.maps[key];
                uint32_t length = object.key.length;
                // Writes key length and key name to db file

                fwrite(&length, sizeof(length), 1, dump);
                fwrite(dtext(object.key), object.key.length, 1, dump);

                uint32_t num_values = object.values.length;
                // Writes number of values associated with key to db file
                fwrite(&num_values, sizeof(num_values), 1, dump);
                for(int value = 0; value < object.values.length; value++) {
                    // Writes value to db file
                    dstring value_on = object.values.values[value];
                    uint32_t val_length = value_on.length;

                    fwrite(&val_length, sizeof(val_length), 1, dump);
                    fwrite(dtext(value_on), value_on.length, 1, dump);
                }
            }
        }
    }

    fseek(dump, 0, SEEK_END);
    uint64_t len = ftell(dump);
    fseek(dump, 0, SEEK_SET);
    unsigned char *buffer;
    if((buffer = malloc(len)) == NULL) {
        perror("Could not allocate memory during sdump. DB file will not be saved.");
        fclose(dump);
        return;
    }
    fread(buffer, 1, len, dump);

    sdump_compress(buffer, len);
    fclose(dump);
    free(buffer);
}

Only a few changes had to be made to sdump(). Firstly, we are now opening a new tmpfile() instead of fist.db for writing. We then write the binary data to this temporary file and allocate a buffer the size of the file. That data then gets passed to a new function sdump_compress() that compresses the binary data using lzf_compress() and writes that information to a file named fist.db.

Notice in sdump_compress() that before writing the compressed binary information to fist.db we write the original size of the decompressed information to the first 8 bytes of the output file. This is needed during decompression.

Decompression is the same process in reverse. If you are interested in seeing how that works you can view the entire serializer.c file here

What's most important is how much this helped. Going back to our original test with the JRE podcast files, running the same Python script over the same number of transcript files created a fist.db file size of ~300MB as opposed to the 600MB+ file from the initial test. The file size was effectively cut in half. This isn't too surprising since there is a lot of duplicate information being compressed.

Timing the new compression algorithm shows that it is roughly ~45% slower than the old algorithm. However, save to disk only happens occasionally at 2 minute intervals and when the server is stopped. This does not have any impact on indexing or search speed, so it is acceptable for now. The dramatic decrease in file size also makes up for the decrease in speed.

The speed issue could be helped by not using a temporary but to save time and code this implementation was chosen.

Top comments (0)