rodbv

Posted on Mar 29 • Edited on Mar 31

Djangonaut diaries, week 4: Eliminating a Redundant Index in Django's ORM

#djangonaut #django #database

Hello again!

Last week, we discussed issue #22125, to prevent a redundant database index from being created on Many-to-many "through" tables. We geeked out a lot about composite indexes, reproduced the issue, and now it's time to send a patch for it.

One interesting thing about this ticket, is that there was in fact a PR created addressing it, 8 years ago! A few years later, it was closed due to inactivity.

That PR had some aspects that make a lot of sense, most importantly, it sets db_index=False in the function create_many_to_many_intermediary_model, which seems to be the change we need to do. Let's understand what happens there.

How are many-to-many models defined in Django?

As I've explained in the previous article, a many-to-many (m2m) relationship normally needs an intermediary table to store these links. For example if I have model Book and a model Author, and I want my app to allow a book to have several authors and an author to have several books, I will need a through-table to have the unique pairs (book_id, author_id).

Specifying this m2m relationship in code is straightfoward:

class Book(models.Model):
    # ...other book fields
    authors = models.ManyToManyField(Author, related_name="books")

What's curious here is that while every Django model normally has one backing database table, in this case we will have a third table besides books and authors which is not behind a model we declared. It will be a model and table defined by the framework itself, unless we tell it otherwise.

This is achieved using some Python metaprogramming, more specifically the type function which allows a class to be defined dynamically.

In create_many_to_many_intermediary_model:

def create_many_to_many_intermediary_model(field, klass):
    # ...some code
    name = "%s_%s" % (klass._meta.object_name, field.name)

    # Construct and return the new class.
    return type(
        name,
        (models.Model,),
        {
            "Meta": meta,
            "__module__": klass.__module__,
            from_: models.ForeignKey(
                klass,
                related_name="%s+" % name,
                db_tablespace=field.db_tablespace,
                db_constraint=field.remote_field.db_constraint,
                db_index=False, # <- our change
                on_delete=CASCADE,
            ),
            to: models.ForeignKey(
                to_model,
                related_name="%s+" % name,
                db_tablespace=field.db_tablespace,
                db_constraint=field.remote_field.db_constraint,
                on_delete=CASCADE,
            ),
        },
    )

This function receives two parameters, the field name which is defining the m2m relationship, authors, and the model class in which it was defined, Book. Based on this it defines a name for the class it wants to create dynamically, Book_authors, and the two fields it will need, from and to.

In the from field is where we can add our change to flip the default behavior of foreign keys, which is to create an index.

Does this change work?

Before we move forward, it's good to check: does this change actually work, at least on the happy path?

Let's run the migration with the change on and off, and compare the SQL produced. Without our change, we get this block when we run makemigrations and sqlmigrate over SQLite:

CREATE TABLE "blog_book_authors" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "book_id" integer NOT NULL REFERENCES "blog_book" ("id") DEFERRABLE INITIALLY DEFERRED, "author_id" integer NOT NULL REFERENCES "blog_author" ("id") DEFERRABLE INITIALLY DEFERRED);
CREATE UNIQUE INDEX "blog_book_authors_book_id_author_id_0a5bb3b3_uniq" ON "blog_book_authors" ("book_id", "author_id");
CREATE INDEX "blog_book_authors_book_id_35eae5ed" ON "blog_book_authors" ("book_id");
CREATE INDEX "blog_book_authors_author_id_fa034e3d" ON "blog_book_authors" ("author_id");

The second line from bottom is the interesting one, it's defining an index blog_book_authors_book_id_35eae5ed for book_id on our through-table, blog_book_authors.

Now with db_index=False:

CREATE TABLE "blog_book_authors" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "book_id" bigint NOT NULL REFERENCES "blog_book" ("id") DEFERRABLE INITIALLY DEFERRED, "author_id" bigint NOT NULL REFERENCES "blog_author" ("id") DEFERRABLE INITIALLY DEFERRED);
CREATE UNIQUE INDEX "blog_book_authors_book_id_author_id_0a5bb3b3_uniq" ON "blog_book_authors" ("book_id", "author_id");
CREATE INDEX "blog_book_authors_author_id_fa034e3d" ON "blog_book_authors" ("author_id");
COMMIT;

Nice, so our change works, the happy path is checked.

Updating tests

Now that we made the change and it seems to work, let's see the test coverage for it. I could search for the related functions and classes to run a subset of the tests, but since running the full test suite doesn't take that long, let's give a shot. Ideally, one or more tests will fail, and we can start by reviewing them. Let's run python tests/runtests.py and see how it goes

............................................................................s......................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 18412 tests in 220.346s

Interesting, all tests passed. So there's no specific test check for the creation of the FK index for the first field of an m2m table, which is fine, as the creation of FK indexes are well covered and this is just another FK index which happens to be in a m2m model.

Looking at that PR which was closed a few years ago, we get a good hint: In tests/schema/tests.py, there are lots of tests related to how m2m artifacts are created. We can add a new test there, similar to the one in the original PR, but that also looks similar to the current approach we see in tests nearby, which has evolved since.

Let's create a test helper _test_m2m_create_skips_redundant_source_fk_index(self, M2MFieldClass) that receives different variations of m2m fields that inherit from RelatedField, which is where we did our change: ManyToManyField, CustomManyToManyField and InheritedManyToManyField.

def _test_m2m_create_skips_redundant_source_fk_index(self, M2MFieldClass):
    class LocalBookWithM2M(Model):
        tags = M2MFieldClass("TagM2MTest", related_name="books")

        class Meta:
            app_label = "schema"
            apps = new_apps

    self.local_models = [LocalBookWithM2M]

    with connection.schema_editor() as editor:
        editor.create_model(TagM2MTest)
        editor.create_model(LocalBookWithM2M)

    through = LocalBookWithM2M._meta.get_field("tags").remote_field.through
    through_table = through._meta.db_table

    # ensure index for localbookwithm2m_id was NOT created
    self.assertNotIn("localbookwithm2m_id", self.get_indexes(through_table))
    self.assertIn("tagm2mtest_id", self.get_indexes(through_table))

    constraints = self.get_constraints(through_table).values()
    unique_columns = [
        constraint["columns"] for constraint in constraints if constraint["unique"]
    ]
    self.assertIn(["localbookwithm2m_id", "tagm2mtest_id"], unique_columns)

Our change is being checked in the line where we assert that an index for localbookwithm2m_id was NOT created in the through-table, check the comment.

To run the tests in this file only, we can pass the test class name using the -k param.

> python tests/runtests.py -k SchemaTests

sssssssssssssssssssssssssssssssssssssssssss....s.sss....................sss....s.ss.....s.ss.ss.sss..s......ss.s.s..................s........s......................ss..s.s......................s..s....s.s.ss.........................s.....ss.........s........s..sss..s...sss
----------------------------------------------------------------------
Ran 273 tests in 18.881s

Cool, test passed. Now it's crucial to "undo" our change and ensure the related tests fail, otherwise our test is a false-positive, i.e. a test that always passes and therefore has no value.

So we commented out the line with db_index=False...

from_: models.ForeignKey(
     klass,
     related_name="%s+" % name,
     db_tablespace=field.db_tablespace,
     db_constraint=field.remote_field.db_constraint,
     # db_index=False,
     on_delete=CASCADE,
),

... and re-run the tests:

    ^^^^^^^^^^^^^^^^^
  File "/Users/rodrigo/.pyenv/versions/3.12.8/lib/python3.12/unittest/case.py", line 715, in fail
    raise self.failureException(msg)
    ^^^^^^^^^^^^^^^^^
AssertionError: 'localbookwithm2m_id' unexpectedly found in ['tagm2mtest_id', 'localbookwithm2m_id']

----------------------------------------------------------------------
Ran 273 tests in 18.994s

FAILED (failures=3, skipped=90)

Cool! Tests failed, so they work as a regression test. Now we can reintroduce our code change, make sure tests pass again, and we're close to send our PR.

One thing I want to do is adding a comment on the line we're setting db_index=False. There's a chance the reviewers will ask to remove this comment, but personally I like to add comments on unusual or non-obvious code. In the future, one may find that line with db_index=False and wonder why that's the case; after all, we normally expect every FK to have an index for it. So pointing to the ticket can be helpful. We'll see if others agree, this kind of preference depends a lot on the culture of the team/project.

from_: models.ForeignKey(
    klass,
    related_name="%s+" % name,
    db_tablespace=field.db_tablespace,
    db_constraint=field.remote_field.db_constraint,
    on_delete=CASCADE,
    # Ticket #22125: index is redundant with
    # unique_together (from_, to).
    db_index=False,
),

Ok, time to send the PR Fixed #22125 - Unnecessary creation of index for ManyToManyField.! My second!

Uh oh, some tests failed on CI

Oh well. Remember we had all test passing locally? That's not what happened when the PR ran its CI pipeline. Some tests failed for Postgres and Postgis.

So what do we have here? On top, I see that the test that failed was running Postgres. Then, we see the failing test is called test_tablespace_for_many_to_many_field - name makes sense - and that it was expecting to find 2 instances of something but found just one.

Let's dig into this test. I added a comment on where they failed.

@skipUnlessDBFeature("supports_tablespaces")
def test_tablespace_for_many_to_many_field(self):
    sql = sql_for_table(Authors).lower()
    # ... more code

    sql = sql_for_index(Authors).lower()
    # The ManyToManyField declares no db_tablespace, its indexes go to
    # the model's tablespace, unless DEFAULT_INDEX_TABLESPACE is set.
    if settings.DEFAULT_INDEX_TABLESPACE:
        self.assertNumContains(sql, settings.DEFAULT_INDEX_TABLESPACE, 2) # <-- failed here
    else:
        self.assertNumContains(sql, "tbl_tbsp", 2) # <- and here
    self.assertNumContains(sql, "idx_tbsp", 0)

So this is why we had no failures running tests locally on the default database (SQLite), but it failed on CI: this test only runs when the database backend supports Tablespaces (check the decorator on top), which is the case for Postgres and Postgis in the CI.

We could just set the value to what's expected and see how it goes, but it's important to understand a bit more what's going on. First of all, what's a tablespace? From Postgres docs:

Tablespaces in PostgreSQL allow database administrators to define locations in the file system where the files representing database objects can be stored. Once created, a tablespace can be referred to by name when creating database objects.
By using tablespaces, an administrator can control the disk layout of a PostgreSQL installation.

So it's a way to tell the db manager to store some of its files in different disks, which seems handy for huge databases.

I'd love to debug these tests and see the values being passed for these assertions, to understand what's going on. I could run Postgres locally, or set it up to run on Docker, but there's an easier way...

Enter django-docker-box

Django-docker-box can be better explained by its README, which includes this:

🧪 Test supported core database backends — MariaDB, MySQL, Oracle, PostgreSQL, SQLite
🌍 Test supported core geospatial backends — MariaDB, MySQL, Oracle, PostGIS, SpatiaLite

Among many other awesome things, it allows us to easily run tests on all databases supported by Django, one docker-compose command away, lovely.

To run it with Postgres is pretty easy, and we can pass the same parameters we normally pass to runtests.py to limit to the tests we care about, with -k TablespacesTests. I also set --parallel 1, to avoid spawning lots of processes to run only 5 tests.

❯ docker compose run --rm postgresql --parallel=1 -k TablespacesTests
[+]  4/4t 4/44
...
Testing against Django installed in '/django/source/django'
Found 5 test(s).
Creating test database for alias 'default'...
System check identified no issues (46 silenced).
.F.ss
======================================================================
FAIL: test_tablespace_for_many_to_many_field (model_options.test_tablespaces.TablespacesTests.test_tablespace_for_many_to_many_field)
...
AssertionError: 1 != 2 : Found 1 instances of 'tbl_tbsp', expected 2

----------------------------------------------------------------------
Ran 5 tests in 0.891s

FAILED (failures=1, skipped=2)

Nice! Tests failed just like in CI. Now we can pass the parameter --pdb that will break into debugging mode when the assertion fails, and have a look around. Knowing how to use pdb or ipdb is a very useful skill, worth checking in case you don't know them.

At this point, it's good to know that in these tests, there's a through table named "Authors" to link the "from" model Article and the "to" model Scientist.

This is a bit of the pdb session I had:

Opening PDB: AssertionError("1 != 2 : Found 1 instances of 'tbl_tbsp', expected 2")
> /django/source/tests/model_options/test_tablespaces.py(52)assertNumContains()
-> self.assertEqual(
(Pdb) l
 47             apps.all_models["model_options"] = self._old_models
 48             apps.clear_cache()
 49
 50         def assertNumContains(self, haystack, needle, count):
 51             real_count = haystack.count(needle)
 52  ->         self.assertEqual(
 53                 real_count,
 54                 count,
 55                 "Found %d instances of '%s', expected %d" % (real_count, needle, count),
 56             )
 57
(Pdb) u
> /django/source/tests/model_options/test_tablespaces.py(116)test_tablespace_for_many_to_many_field()
-> self.assertNumContains(sql, "tbl_tbsp", 2)
(Pdb) sql
'create index "model_options_articleref_authors_scientist_id_4ad98b60" on "model_options_articleref_authors" ("scientist_id") tablespace "tbl_tbsp"'
(Pdb)

First, I pressed l (as in "list") to list the code around the failure, and since it's in the asssertNumEqual function itself, I pressed u to "step out" and going to the outer scope, the test that called assertNumEqual. There, I could type the name of the variable being tested, sql, and got this

-> self.assertNumContains(sql, "tbl_tbsp", 2)
(Pdb) sql
'create index "model_options_articleref_authors_scientist_id_4ad98b60" on "model_options_articleref_authors" ("scientist_id") tablespace "tbl_tbsp"'
(Pdb)

So it's getting the SQL that was produced for the "to" model, Scientist, and not the "from" model, Article. To make sure our understanding is correct, let's undo our change on related.py and run again

(Pdb) sql
'create index "model_options_articleref_authors_article_id_727a3ebf" on "model_options_articleref_authors" ("article_id") tablespace "tbl_tbsp"
create index "model_options_articleref_authors_scientist_id_4ad98b60" on "model_options_articleref_authors" ("scientist_id") tablespace "tbl_tbsp"'

In this case, I see two indexes, one for article_id and one for scientist_id. So it makes sense, we should really reduce the assertion from 2 to 1 occurence, as the index for article_id will be gone. Nice!

Lowering the values on the tests and running them again got the tests to pass:

Found 5 test(s).
Creating test database for alias 'default'...
System check identified no issues (46 silenced).
...ss
----------------------------------------------------------------------
Ran 5 tests in 0.872s

Time to push this and see if the CI is happy (that takes a while, so go have some dinner, pet the cat ;))

Great! I think this is it by now, now I can make sure the PR is according to the guidelines etc and wait for it to be reviewed.

See you!

Top comments (2)

Annabelle Wiegart • Mar 30

Fantastic article @rodbv! I really like how you take us along on your Django ticket journey. I learned a lot, and I'm sure the post is helpful for your own learning, too! Such a good idea to document the solution process this way.

rodbv • Mar 30

Thank you so much!