DEV Community: Guillaume Gautreau

Large documents in redis: is it worth compressing them (Part 2)

Guillaume Gautreau — Wed, 14 Jul 2021 20:06:09 +0000

In a previous post, we measured that compressing large JSON documents before sending them to redis was faster than sending them as is. I made the measurement on my own computer and a local redis database.

Now that the principle has been validated, I needed to know if this result could be replicated on an environment that is closer to the production environment we have at Forest Admin.

On a server, I ran the same benchmark:

Server performance is the same as in production (except for the load induced by other requests)
The redis server is comparable to the one that is used in production.

⬇️ Download speed comparison

In this first graph, we will compare the download performance from redis + decompression time of the same JSON documents, using 3 different methods:

Uncompressed JSON document
Compressed brotli-1
Compressed gzip-3
Compressed deflate-3

These algorithms appeared to be the fastest in their family during my first tests. This second test on a production-like environment revealed the same, that's why I decided not to publish the same level of info as the last time.

Performances are really similar to documents smaller than 4 MB, but the difference starts to be significative after this value.

For larger documents, all compression methods have similar performance regarding the time it takes to download and decompress them.

⬆️ Upload speed comparison

The same protocol has been applied, for all algorithms. As for the download performance comparison, we will compare here the same challengers that have been selected in the previous test.

When writing JSON documents larger than 4 MB, results show that it is worth compressing them. For documents with a size of 10 MB, the overall compression and upload time can almost be divided by 2 with this method, from less than 300 ms to 150 ms with brotli-1.

🏅 Brotli-1: the winner in production-like environment

This second test shows that the algorithm brotli-1 makes a significant difference regarding the upload and download performance of documents larger than 4 MB.

At Forest Admin, we need to store some JSON documents that are larger than 4 MB, and we will definitely try this solution in production, using a canary deployment.

The space saving ratio of the brotli-1 algorithm has been measured to be more than 90%, on the types of documents that we are storing. So, in addition to allowing faster data transfer, this solution will also save a lot of space on our redis instances.

⚠ Be sure to test the algorithm on a production-like environment

Be careful before using this solution in your own environment, because results can vary a lot with:

the type of document
the compression algorithm
the compression level

For instance, brotli with the default compression level is very effective in terms of space saving, but also very slow during the compression phase.

When testing in a production-like environment, some algorithms have also been measured as slower than the original solution, which consists of sending the plain JSON documents, as you can see below.

Compressing documents with deflate and a compression level of 0 was faster than using no compression on my laptop, but became slower on a real server.

Large documents in redis: is it worth compressing them (Part 1)

Guillaume Gautreau — Fri, 28 May 2021 15:36:21 +0000

At Forest Admin, we build admin panels for which we need to compute and cache large JSON documents. These documents are stored in redis and retrieved from this storage in order to be as fast as possible.

🐘 The problem with large JSON documents: latency

Some of these JSON document can weight more than 20MB. Storing and retrieving such large document introduce latency in our services, as the data needs to be uploaded or downloaded through the network.

So, every time we need to retrieve these documents from the server, we download 20MB (in worst cases) from redis and parse the received json.

Looking at performance logs in production, I could see that this operation of storing or retrieving these documents costed us some significant amount of time.

I wanted to test if would worth to upload compressed json documents instead of plain json content in redis.

That's why I created a repository on github with some code to run benchmarks.

🕗 Protocol

I want to evaluate the total time of pushing a document and retrieving it from redis, with 5 different implementations:

none: JSON.stringify() + push the result to redis
brotli: JSON.stringify() + brotli + push the buffer to redis
deflate: JSON.stringify() + deflate + push the buffer to redis
gzip: JSON.stringify() + gzip + push the buffer to redis
msgpack: msgpack.pack() + push the buffer to redis

Each compression algorithm will be evaluated with every possible compression level:

brotli: 1 to 11
deflate: -1 to 9
gzip: -1 to 9

Json documents will be retrieved from a redis server, based on a key pattern. For my own test, I will use real JSON documents of various sizes that we handle in production.

The benchmark is run against a redis server, and in my case, it'll be a local server running in docker.

Finally, these tests are run on my Macbook Pro 2,8 GHz Intel Core i7, on node 14.

📈 Results

Brotli compression

We can see that a compressed document with brotli will always be faster to retrieve, and decompress than a large uncompressed document.

These first results show that ⚠ brotli is very slow during compression, when used with quality levels of 10 and 11. That's something worth noting as 11 is the default value in node 14.

Using brotli with its default value is actually slower than not doing anything for document upload.

I plotted another graph without these 2 values for brotli compression, to be able to compare quality levels.

With this 3^rd plot, we can see that any other quality level allows to have better upload time of our documents to redis.

We also can see that the minimal compression level has the best possible performance in the conditions of my experiments. It might change for instance with a remote redis server.

To summarize

Download:
- 🏅 brotli-11 for download performance
- ✅ every compression level has better performance
Upload:
- 🏅 brotli-1 for upload performance
- ✅ compression levels up to 9 are faster than no compression
- ❌ brotli-10 and brotli-11 are slower than no compression
Potential winner:
- 🏅 brotli-1

Deflate compression

Same results here for deflate than brotli: it's always faster to retrieve compressed documents from redis when they are compressed.

On the other hand, it's also always faster to upload compressed documents to redis with deflate, whatever the compression level we choose.

We can see that deflate 3 seems to be the faster when used to upload documents.

To summarize

Download:
- ✅ Results are very similar across compression levels, and we cannot spot a clear winner from the results
Upload:
- 🏅 deflate-3 seems to be slightly faster than other compression levels
- ✅ all compression levels are faster to use than nothing at all
Potential winner:
- 🏅 deflate-3

Gzip compression

With gzip also, it's always faster to retrieve compressed documents from redis than uncompressed documents.

Using gzip to upload documents is also always faster, but we can see a clear difference between stronger compression levels (8 & 9) that seem a little slower.

To summarize

Download:
- ✅ Results are very similar across compression levels, and we cannot spot a clear winner from the results
Upload:
- 🏅 gzip-3 seems to be slightly faster than other compression levels
- ✅ all compression levels are faster to use than nothing at all
Potential winner:
- 🏅 gzip-3

msgpack

The library msgpack allows to serialize javascript objects in a smaller form that the standard JSON.stringify. It is documented as slower than JSON, but I wanted to measure if the time to serialize was compensated by a faster transfert.

It also has the advantage to transform objects into a buffer in one operation, in contrary to the solution using a compression algorithm:

msgpack transforms an object into a buffer
solutions with compression are requiring 2 steps: one for transformation to JSON, another for the compression of the resulting string.

Results are not good for msgpack in my local environment: it seems always slower to use msgpack than just sending plain JSON documents to my redis server.

⚠ This library can still produce interesting results when the network gets slower. In my situation, it cannot be faster than solutions based on compression, because the compression level obtained with msgpack (around 60%) is a lot less interesting than the compression obtained with compression algorithms (more than 90%).

Upload results are not good either, for msgpack.

To summarize

❌ msgpack is slower than the original implementation both for upload & download

Final match: brotli-1 vs deflate-3 vs gzip-3

Results here are very similar between compression algorithms.

Regarding upload time, brotli-1 seems to have a clear advantage over other algorithms.

And the winner is: 🏅 brotli-1

Next step: validate these results on an solution that is closer to the production:

on an instance that has similar performance to production instances
use a remote redis instance, similar to the one used in production.

These results will be shared in a part 2 of this series!

Git tip: get back to work after a revert on master

Guillaume Gautreau — Sat, 24 Apr 2021 22:02:28 +0000

Shit happens.

Sometimes, a sneaky bug hid itself into the beautiful change you worked on. It even flew below unit tests' radars and tiptoed without being noticed during manual tests.

Now, this nasty bug is live in production and EVERYONE notices it, you have to revert your changes from master. 😢

↩ The revert

Ok, this is the time where you revert your code:

# Assuming that you have to create a PR
# for the revert
git checkout master
git pull
git checkout -b fix/revert-superb-change
git revert HASH-OF-MERGE-COMMIT
git push -u origin fix/revert-superb-change

Once your PR gets approved, your revert just cancelled everything that was in your cool change.

👷 Work on a fix

At this point, the easiest thing to do is to just make a fix on the branch containing all your changes.

git checkout feat/superb-change
# Work on a fix
# ...
# ...
# ...
git commit -a -m "fix: sneaky bug"
git push

😨 OMG, if I merge master on my branch, I lose almost all my work

That's it, if you want to prepare your branch to be merged again on master, you'll face another problem:

Master contains a commit that removes the work from your branch.

If you merge master into your feature branch as usual, it will actually remove a large proportion of your changes on your branch. 🤯

🚒 Merge mastery to the rescue

This is the trick: you can tell git that a particular commit had been merged without actually merging it.

git checkout feat/superb-change

# This will allow you to apply all 
# changes between your first merge 
# and the revert, if any
#
# ⚠ It's important to carefully choose
# the commit JUST BEFORE the revert commit
git merge HASH-OF-COMMIT-JUST-BEFORE-REVERT

# This is how you tell git to merge
# without really merging the revert
git merge HASH-OF-REVERT-COMMIT --strategy=ours

The option --strategy=ours tells git to keep all our current changes when merging.

It means that it will only record the merge without changing anything in your branch.

It's important to note that you should first merge all changes made before the revert in order to correctly apply them. This way, only the revert will be merged without changes on your code.

Once everything had been done, you can proceed as usual:

# Will merge all changes made after the revert
git merge master
git push

And now, your branch is ready to be merged into master, with all your changes!

Thanks to @bryanbraun for his awesome git diagram template.

Extract-Transform-Load with RxJS: save time and memory with backpressure

Guillaume Gautreau — Tue, 20 Apr 2021 13:22:20 +0000

Let's say that you have to extract 100M objects from a database, make some transformations on them and then load them into another storage system.

Problems will arise as soon as writing into the second DB will become slower than reading from the first. Depending on the implementation, you could face one of these issues:

extracted data stacks up in your memory, and your program crashes because of the memory usage;
you send too many requests in parallel to your target database;
your program is slow because you process each page of data in sequence.

At Forest Admin, we recently faced this issue to move data from a Postgresql database to ElasticSearch.

These problems can be addressed by processing data in streams that support backpressure. It allows the stream to process data at the pace of the slowest asynchronous processing in the chain.

RxJS is a great streaming library, but it does not natively support backpressure, and it's not easy to find examples. So, I decided to share one.

Let's illustrate with an example

Let's fake the extract method just for the purpose of this article:

async function extract(pageSize, page) {
  // Just fake an async network access that
  // resolves after 200ms
  await new Promise((resolve) => setTimeout(resolve, Math.random() * 100));

  if (pageSize * (page - 1) >= 100_000_000) {
    return []
  }

  return new Array(pageSize)
    .fill()
    .map((_, index) => ({
      id: pageSize * (page - 1) + index + 1,
      label: `Random label ${Math.random()}`,
      title: `Random title ${Math.random()}`,
      value: Math.random(),
      createdAt: new Date()
    }));
}

The load method, could be asynchronous but that's not useful in this example.

function transform(i) { return i; }

And now, let's fake the load method:

async function load(items){
  // Let's fake an async network access that takes
  // max 150ms to write all the items
  await new Promise((resolve) => 
    setTimeout(resolve, Math.random() * 150)
  );
}

Example of backpressure in RxJS

The backpressure is ensured by the BehaviorSubject named drain in the example below. You'll see that the code allow to push data concurrently on the target database, with a limit of 5 requests in parallel.

Input data is also loaded with concurrency, but this time the pace is regulated by the drain subject. Every time a page is sent to the target database, we allow another one to be extracted.

const { BehaviorSubject } = require('rxjs');
const { mergeMap, map, tap, filter } = require('rxjs/operators')

async function extractTransformLoad() {
  const CONCURRENCY = 5;
  const PAGE_SIZE = 1000;

  // This allows us to load a fixed number
  // of pages from the beginning
  const drain = new BehaviorSubject(
    new Array(CONCURRENCY * 2).fill()
  );

  return drain
    // This is necessary because the observable
    // streams arrays. This allows us to push
    // a fixed number of pages to load from 
    // the beginning
    .pipe(mergeMap(v => v))
    // Values inside the arrays don't really matter
    // we only use values indices to generate page
    // numbers
    .pipe(map((_, index) => index + 1))
    // EXTRACT
    .pipe(mergeMap((page) => extract(PAGE_SIZE, page)))
    // Terminate if it was an empty page = the last page
    .pipe(tap((results) => {
      if (!results.length) drain.complete();
    }))
    .pipe(filter(results => results.length))
    // TRANSFORM and LOAD
    .pipe(transform)
    .pipe(mergeMap(load, CONCURRENCY))
    // Just make sure to not keep results in memory
    .pipe(map(() => undefined))
    // When a page has been processed, allow to extract
    // a new one
    .pipe(tap(() => {
      drain.next([undefined])
    }))
    .toPromise()
}

In the example above, we initialized the concurrency to 5, meaning that 5 requests can be sent to the target database at the same time. In order to reduce the time waiting for new data, the BehaviorSubject named drain ensures to load twice as much pages of data.

In this example,

memory will contain 10 pages of data at the maximum;
the processing will be as fast as possible with the maximum concurrency that we defined;
only 5 queries can be made in parallel to the target database.