DEV Community

Michał Knasiecki
Michał Knasiecki

Posted on • Updated on

Easy tips to reduce MongoDB database size in no time

In this series of articles, I show how we can reduce database size and improve it’s performance just by using the right structures and proper data types. In the previous post, we were checking how json structure may influence database storage size. The results were promising, so it's worth taking a broader look at the topic. Therefore what else can we do to reduce the size of data on disk?

Enumerated types

Now, let's also look at the enumerated types. You can store them in two ways, either by using a label:

{"_id": 1, "source": "ALLEGRO"}
Enter fullscreen mode Exit fullscreen mode

or by ordinal value:

{"_id": 1, "source": 1}
Enter fullscreen mode Exit fullscreen mode

Here an analogy with the first experiment can be found: again we replace a character string with a numerical value.
Since we got a great result the first time, maybe we could repeat it here?

label vs index size

Unfortunately, the result is disappointing and the reduction in size is small. However, if we think more deeply, we will come to the conclusion that it could not have been otherwise. The enumerated types are not unique identifiers. We are dealing only with a few possible values, appearing many times in the whole collection, so it’s a great chance for MongoDB data compression to prove its worth. Most likely the values from the first collection have been automatically replaced by the engine to its corresponding ordinal values. Additionally, we don't have any profit on the schema here, because they are the same in both cases. The situation might be different for collections that do not have data compression enabled.

This is a good time to take a closer look at how snappy compression works in MongoDB. I've prepared two more collections, with identical data, but with data compression turned off. The results are shown in the chart below, compiled together with the collections with data compression turned on.

snappy vs plain size

It is clear that the use of an ordinal value instead of a label of the enumerated type brings considerable profit for collections with data compression disabled. In case of lack of compression it is definitely worth considering using numerical values.

Useless _class field

If we use spring-data, each of our collections additionally contains a _class field storing the package and the name of the entity class containing the mapping for documents from this collection. This mechanism is used to support inheritance of model classes, that is not widely used. In most cases this field stores exactly the same values for each document what makes it useless. And how much does it cost? Let's compare the following documents:

{"_id": 1}
Enter fullscreen mode Exit fullscreen mode

and

{"_id": 1, "_class": "pl.allegro.some.project.domain.sample"}
Enter fullscreen mode Exit fullscreen mode

_class field size

The difference is considerable, over 50%. Bearing in mind that the compression is enabled, I believe that the impact of the data itself is small and the result is caused by the schema of the collection containing only the key being half as small as the one with the key (1 field vs 2 fields). After removal of the _class field from a document with more fields, the difference expressed in percent will be much smaller of course. However, storing useless data does not make sense.

Useless indexes

When investigating the case with identifiers stored as strings, I checked whether manipulating the data type could reduce the index size. Unfortunately I did not succeed, so I decided to focus on the problem of redundant indexes.

It is usually a good practice to cover each query with an index so that the number of collscan operations is as small as possible. This often leads to situations where we have too many indexes. We add more queries, create new indexes for them, and often it turns out that this newly created index takes over the queries of another one. As a result, we have to maintain many unnecessary indexes, wasting disk space and CPU time.

Fortunately, it is possible to check the number of index uses with the query:

db.COLLECTION_NAME.aggregate([{$indexStats:{}}])
Enter fullscreen mode Exit fullscreen mode

It allows you to find indexes that are not used so you can safely delete them.

And finally, one more thing. Indexes can be removed quickly, but they take much longer to rebuild. This means that the consequences of removal of an index by mistake can be severe. Therefore it is better to make sure that the index to be deleted is definitely not used by any query. The latest MongoDB 4.4 provides a command that helps to avoid errors:

db.COLLECTION_NAME.hideIndex(<index>)
Enter fullscreen mode Exit fullscreen mode

The above-mentioned command hides the index from the query optimizer. It does not take this index into account when building the query execution plan, but the index is still updated when modifying documents. Thanks to that, if it turns out that it was needed, we are able to restore it immediately and it will still be up-to-date.

Conclusion

Using a few simple techniques, preparing several versions of the schema and using stats() command we are able to design a model which does not overload our infrastructure. I encourage you to experiment with your own databases. Maybe you too can save some disk space and CPU time.

Top comments (0)