László Károlyi

Posted on Jan 2, 2019

Django: Why you might want to set up extra DB level indexes for your ManyToMany relationships

#django #mariadb #indexes

Preface: This is Django 1.11.17 and MariaDB 1.3.11.

Disclaimer: I'm in no way a MariaDB professional, but as with everything, I strive to know it better. Suggestions on the subject are welcome.

In one of my current projects where the DB has grown significantly, I was clicking around in pages, looking at SQL query speeds. In one of them, there was a prefetch_related ManyToMany relationship, which when provided with a huge data set, was slowing down to a great extent.

The model is called SeatReservation, and it has an AppliedDiscount ManyToMany key on it. Something like this:

class TestModel(Model):
    pass

class SeatReservation(Model):
    ...
    test_model = ManyToManyField(to=TestModel)
    applied_discounts = ManyToManyField(to=AppliedDiscount)
    ...

Now, on the table that's generated for the AppliedDiscount relation, will be two indexes created:

CREATE TABLE `module_seatreservation_applied_discounts` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `seatreservation_id` int(11) NOT NULL,
  `applieddiscount_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unique_key` (`seatreservation_id`,`applieddiscount_id`),
  KEY `constraint_1` (`applieddiscount_id`),
  CONSTRAINT `constraint_1` FOREIGN KEY (`applieddiscount_id`) REFERENCES `module_applieddiscount` (`id`),
  CONSTRAINT `constraint_2` FOREIGN KEY (`seatreservation_id`) REFERENCES `module_seatreservation` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

When throwing a select from the TestModel angle with having a prefetch_related for the AppliedDiscounts that the SeatReservation model has, the prefetch_related query will look something like this:

SELECT
(`module_seatreservation_applied_discounts`.`seatreservation_id`) AS `_prefetch_related_val_seatreservation_id`, `module_applieddiscount`.`id`, `module_applieddiscount`.`name`, `module_applieddiscount`.`price`
FROM `module_applieddiscount` INNER JOIN `module_seatreservation_applied_discounts` ON (`module_applieddiscount`.`id` = `module_seatreservation_applied_discounts`.`applieddiscount_id`) WHERE `module_seatreservation_applied_discounts`.`seatreservation_id` IN (485696, 485697, 485693, 485694, 485695)

Depending on how much buffer memory for your MariaDB is set, and how much seatreservation_ids you shove into the query, MariaDB will use different methods of query.

Let's look at the EXPLAIN EXTENDED version of the exact query above:

+------+-------------+------------------------------------------+-------+-------------------------+------------+---------+------+------+----------+--------------------------------------------------------------+
| id   | select_type | table                                    | type  | possible_keys           | key        | key_len | ref  | rows | filtered | Extra                                                        |
+------+-------------+------------------------------------------+-------+-------------------------+------------+---------+------+------+----------+--------------------------------------------------------------+
|    1 | SIMPLE      | module_applieddiscount                   | ALL   | PRIMARY                 | NULL       | NULL    | NULL |    1 |   100.00 |                                                              |
|    1 | SIMPLE      | module_seatreservation_applied_discounts | range | unique_key,constraint_1 | unique_key | 4       | NULL |    5 |    80.00 | Using where; Using index; Using join buffer (flat, BNL join) |
+------+-------------+------------------------------------------+-------+-------------------------+------------+---------+------+------+----------+--------------------------------------------------------------+

So far good, it will execute very fast, as it doesn't have to use temporary tables. But shoot the same SQL query with thousands of seatreservation_ids, and you'll end up with a much different EXPLAIN EXTENDED result:

+------+--------------+------------------------------------------+------+--------------------------+--------------+---------+----------------------------+-------+----------+----------------+
| id   | select_type  | table                                    | type | possible_keys            | key          | key_len | ref                        | rows  | filtered | Extra          |
+------+--------------+------------------------------------------+------+--------------------------+--------------+---------+----------------------------+-------+----------+----------------+
|    1 | PRIMARY      | module_applieddiscount                   | ALL  | PRIMARY                  | NULL         | NULL    | NULL                       |     1 |   100.00 |                |
|    1 | PRIMARY      | <subquery2>                              | ALL  | distinct_key             | NULL         | NULL    | NULL                       |  1000 |   100.00 |                |
|    1 | PRIMARY      | module_seatreservation_applied_discounts | ref  | unique_key,constraint_1  | constraint_1 | 4       | module_applieddiscount.id  | 25846 |   100.00 | Using where    |
|    2 | MATERIALIZED | <derived3>                               | ALL  | NULL                     | NULL         | NULL    | NULL                       |  1000 |   100.00 |                |
|    3 | DERIVED      | NULL                                     | NULL | NULL                     | NULL         | NULL    | NULL                       |  NULL |     NULL | No tables used |
+------+--------------+------------------------------------------+------+--------------------------+--------------+---------+----------------------------+-------+----------+----------------+

As you can see, there is a query that will iterate through the entire table for each result, that has 25846 records at present. This is where queries start to slow down drastically. I've experienced queries executing for 5 minutes because of this.

I have no idea as to why Django won't create an index for seatreservation_id, despite the bugreport I found about the opposite (Django needlessly creating indexes for all unique indexes' first column). It is told that the index creation will be optional (depending on the DB backend used), once the PR that's on Github will be merged.

As one comment says:

The question of whether an index for the first column of a unique constraint is redundant, depends on how the database implements the unique constraint.

By empirical testing (see above), it seems that in my case the index on seatreservation_id wouldn't be redundant, as MariaDB doesn't use the unique index for the query.

So I added an index manually for seatreservation_id:

ALTER TABLE module_seatreservation_applied_discounts ADD INDEX new_index (seatreservation_id);

...and executed the same EXPLAIN EXTENDED select above, with 1000 IDs requested:

+------+--------------+------------------------------------------+------+-----------------------------------+-----------+---------+--------------+------+----------+----------------+
| id   | select_type  | table                                    | type | possible_keys                     | key       | key_len | ref          | rows | filtered | Extra          |
+------+--------------+------------------------------------------+------+-----------------------------------+-----------+---------+--------------+------+----------+----------------+
|    1 | PRIMARY      | module_applieddiscount                   | ALL  | PRIMARY                           | NULL      | NULL    | NULL         |    1 |   100.00 |                |
|    1 | PRIMARY      | <subquery2>                              | ALL  | distinct_key                      | NULL      | NULL    | NULL         | 1000 |   100.00 |                |
|    1 | PRIMARY      | module_seatreservation_applied_discounts | ref  | unique_key,constraint_1,new_index | new_index | 4       | tvc_0._col_1 |    1 |   100.00 | Using where    |
|    2 | MATERIALIZED | <derived3>                               | ALL  | NULL                              | NULL      | NULL    | NULL         | 1000 |   100.00 |                |
|    3 | DERIVED      | NULL                                     | NULL | NULL                              | NULL      | NULL    | NULL         | NULL |     NULL | No tables used |
+------+--------------+------------------------------------------+------+-----------------------------------+-----------+---------+--------------+------+----------+----------------+

See what happened? MariaDB instantly started to use the index I just added, and the query now uses only one iteration on the table per result row! A significant update, that one is able to see on the production level as well.

The way of putting this key manually on the SQL level varies by developer, I'll leave it to you. I did it with putting custom logic into my apps.py on the module that contains the models.py, which will execute on Django checks (and local runserver).

So what are the takeaways?

If you're big on speed like I am, keep your DB memory settings intentionally low in production if you can afford it, while observing response times for various views. I got 2 gigs given to MariaDB in production, despite having 32GB in the machine, and always around 20GB free.
Observe your production log. I use multitail, and an UWSGI syntax for it that I created.
If a view takes long, evaluate the SQL query times on it first (django-debug-toolbar is your friend).
Strive to know your application from the inside out, and from the bottom to top. This involves knowing the underlying DB, the modules your project uses, and all the external services. This is my one-man-army approach: I build greenfield projects, top to bottom, project management and trainings included.
Have no fear of diving into even the source code of Django and the modules you use. This is when you are able to find bugs, or actually improve your code, moreover contribute to the opensource projects themselves.

Of course, this only applies if you have the time and will to actively support your ongoing projects. If that's not the case, just throw a large amount of memory at MariaDB, and put fires out when they appear. My experience is, those fires are way bigger and harder to put out when you don't go into these endeavors.

Caching and optimizing my projects is one of my shticks. This project for example, has 4 levels of caching above the DB:

MariaDB query cache
Redis storing processed data structures fetched from MariaDB
ttl_cache() and lru_cache() over Redis per request base (the request being used in the cache in conjunction with PK)
An internal cache class used to store pk-object dictionaries, that's passed around in the logic.

This heavy caching results in views that deal with heavy amounts of data, execute fast.

If you managed to read this far, kudos to you. This is the abrupt end of the post.

Top comments (2)

SHxKM • Sep 21 '20

Very interesting. I think I might be hitting the same issue as you had with a join between two fairly large tables. I think that nothing else explains the speed issues I'm facing, and why PostgreSQL's planner is ignoring the indices I do have. Care to elaborate a bit more on this point?

The way of putting this key manually on the SQL level varies by developer, I'll leave it to you. I did it with putting custom logic into my apps.py on the module that contains the models.py, which will execute on Django checks (and local runserver).

László Károlyi • Sep 23 '20

Oh, I almost forgot to reply to you.

In my apps.py under the django module that contains the models.py with the model, I added the following functions, which is a result of me experimenting around until I found out the right way of doing it. It is basically putting together the SQL for creating the index via MariaDB:

from django.apps import AppConfig
from django.conf import settings
from django.core.checks import Info, Tags, register
from django.db import connection
from django.db.models.signals import post_migrate
from django.db.utils import OperationalError
from django.utils.translation import ugettext_lazy as _


def add_applieddiscount_sr_index() -> list:
    """
    As discussed in the link below, add an index to `seatreservation_id`
    on the `ManyToMany` `AppliedDiscount` relation.
    https://dev.to/karolyi/django-why-you-might-want-to-set-up-extra-db-level-indexes-for-your-manytomany-relationships-1ecg
    """
    from .models import SeatReservation
    errors = []
    m2m_model = SeatReservation.applied_discounts.through
    db_table = m2m_model._meta.db_table
    field_name = m2m_model._meta.get_field('seatreservation').column
    cursor = connection.cursor()
    try:
        cursor.execute(
            f'ALTER TABLE `{db_table}` ADD INDEX sr_id_index (`{field_name}`)')
    except OperationalError as exc:
        if exc.args[0] == settings.MARIADB_INDEX_ALREADY_EXISTS_CODE:
            # Index already exists
            return errors
        raise
    errors.append(Info(msg='Added index on SeatReservation.applied_discounts'))
    return errors


def post_migrate_ensure_defaults(**kwargs):
    """
    Run the default models installer when it is our application's
    `post_migrate` signal.
    """
    if not isinstance(kwargs['sender'], BookingConfig):
        # Running from migration post-signal, but not from our app's
        # signal
        return
    return ensure_defaults()


def ensure_defaults(**kwargs):
    'Run the default setter and maintenance functions.'
    errors = []
    errors.extend(add_applieddiscount_sr_index())
    return errors


class BookingConfig(AppConfig):
    name = f'project.booking'
    verbose_name = _('Project: Booking')
    label = f'project_booking'

    def ready(self):
        'Fix database, run maintenance scripts, etc ...'
        post_migrate.connect(
            post_migrate_ensure_defaults,
            dispatch_uid=f'{self.label}.ensure_defaults'
        )
        register(ensure_defaults, Tags.models)

Also, you'll have to modify the init.py next to the apps.py to contain the following lines:

from django.conf import settings

default_app_config = 'project.booking.apps.BookingConfig'

Where the path in default_app_config is the python import path to your apps.py.