rodbv

Posted on Mar 21

Djangonaut diaries, week 3 - Working on an ORM issue

#djangonaut #django #opensource

Hi there!

The PR from the previous week is currently being reviewed, with some great suggestions from one of the Django fellows, Jacob Walls.

In the meantime I picked up one of the issues that were suggested by the navigator for this cohort, Akash, an ORM optimization issue titled "Unnecessary creation of index for ManyToManyField".

In summary, when Django creates the migrations for a many-to-many "through" table, it creates a database index for the leftmost column which is redundant as the database engine will favor the composite index.

Now, if you understood every word on this sentence and the ticket title, maybe this blog post is not for you, you can wait for the next one in which I actually fix the issue. I won't judge you.

...

What? you here still? Great!! Let's dig a bit on what are database indexes, how the most common index for relational databases work (balanced trees, aka b-trees), what are many-to-many (m2m) relations and junction tables (also called through-tables), and finally how "unique" indexes for m2m tables work.

Knowing the basics of all that, we will have the tools to work on a proper fix.

What is an index, anyway? Why not index everything?

So, we all know indexes from books: if I want to find a topic or a chapter in a book, instead of reading every page, I can check its index at the beginning or the end of the book, search for the topic there, get the page number and jump straight to that page. In a way, it's a trade-off: we're making the book a bit heavier and expensive (having more pages) in exchange for making it more practical to use.

Database indexes also help us find data quickly, they're shortcuts to find data and avoid table scans, which is when the database engine reads every row in a table to find what was requested.

An index is a separate data structure from the main table, optimized for speed and size. The most common type is the B-Tree (Balanced Tree), which maintains a consistent depth so that searching for any record takes roughly the same amount of time.

Just like book indexes, there's also a trade-off here: to make data retrieval faster, we will use a bit more disk space, and also make writing operations a bit slower, as we need to update not only the table but also the indexes affected. In other words, indexes are not "free", and having unnecessary indexes is a bad idea.

How indexes are used

In the diagram below, the database is searching for the book "Two Scoops of Django" (an excellent book btw!). It follows a specific logical path through the pages:

The search starts at the Root Page. The engine compares "Two Scoops" against the keys J and S. Since "T" comes after "S," it ignores the first two branches and follows the pointer to the third.

It lands on an Internal Page for the S-Z range. It sees another split and determines the title must be in the S-U section.

Finally, it arrives at the Leaf Page, which contains the actual index entries. it finds "Two Scoops of Django" and its associated rowid: 88, which is where the record is physically located on disk and can be retrieved for the user.

By traversing this tree, the database only reads a few pages from disk instead of scanning the entire table. Those who remember big-O notation, that's O(log N), which means much better :D

What about M2M relations, composite indexes and unique constraints?

Many-to-many relations, or M2M for short, are relationships where both parties involved can occur several times. For example, if you have a table for Books and a table for Authors, you need a third table to link them: a book can have one or more authors, and an author can have written one or more books. This middle table is called a through-table, or junction table, or a join table, or Larry (just kidding, not Larry).

One rule we usually want for our through-table is to enforce that the pairs (book_id, author_id) are unique. There is no reason to record twice that a specific book has a specific author!

To enforce this rule (aka constraint) efficiently, a Composite Index combining both IDs is needed. The database needs to check this index before every insert or update to ensure the pair doesn't already exist. It’s the same B-Tree structure we saw before, but instead of sorting by a single value, it sorts by tuples of (book_id, author_id).

The order of columns in this index is critical. Because the B-Tree is sorted, the intermediate pages will first split the data by the first value book_id. If there are multiple authors for that same book, it then splits those by the second value author_id. In the database world we call the first the "leftmost" column, and it's the most important of the pair, for the reasons we'll explain now.

Check the diagram below, see that when we search for Book ID 10 and author ID 5, it will first find the book, and then its authors.

So whenever I try to associate an author to a book, the database engine can use this index to quickly check if the relation already exists, and if so raise an error.

From this arrangement we may note two things:

If I want to search for all authors for a given book ID, this index is good enough, it can just get all nodes once it reaches the Book ID;
On the other hand, if I want to quickly search for books related to a specific author, this index won't help me at all. In this case, I would need a separate index on this table for author_id.

Back to the ticket

Ooookay, I hope you're still with me after this rabbit hole :D

Onto our ticket: it says that for M2M relations, Django is creating three indexes:

One for the first entity ID
One for the second entity ID
One for the pair (unique constraint index)

From the explanation above you can agree with the creator of the ticket that the first index is redundant, right?

But, let's see if this is actually true in practice, if a real database engine would optimize and just use the composite index if I search by book ID.

To do this, we can use a nice tool (mentioned in the ticket comments, which is how I learned about it) called SQL Fiddle. This tool allow us to provide code for tables and queries, pick a database engine, and by running a special SQL command called EXPLAIN QUERY, see how the database would perform the query, its "query plan". Very handy, especially because SQL Fiddle supports all relational databases supported by Django (SQLite, Postgres, MySQL, Oracle, MariaDB).

This is the test we did there. Note that, just like issue described, we created three indexes, one for each column and one for the composite index (unique constraint).

-- 1. Create the main tables
CREATE TABLE Book (
  ID INTEGER PRIMARY KEY AUTOINCREMENT,
  Title TEXT
);

CREATE TABLE Author (
  ID INTEGER PRIMARY KEY AUTOINCREMENT,
  Name TEXT
);

-- 2. Create the M2M through-table
CREATE TABLE Book_Authors (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  book_id INTEGER,
  author_id INTEGER,
  FOREIGN KEY (book_id) REFERENCES Book(ID),
  FOREIGN KEY (author_id) REFERENCES Author(ID)
);

-- 3. Add the indexes (Fixed Typo and Names)
CREATE INDEX idx_book_author_book_id ON Book_Authors(book_id);
CREATE INDEX idx_book_author_author_id ON Book_Authors(author_id);
CREATE UNIQUE INDEX idx_book_author_unique_book_id_author_id ON Book_Authors(book_id, author_id);

-- 4. Gather statistics
ANALYZE;

-- 5. THE TEST
EXPLAIN QUERY PLAN 
SELECT * FROM Book_Authors WHERE book_id = 10;

...and the result is...drumroll....

QUERY PLAN
`--SEARCH Book_Authors USING COVERING INDEX 
idx_book_author_unique_book_id_author_id (book_id=?)

Okay! So we can confirmed the thesis of the issue. We searched by book ID and SQLite opted to use the composite index despite having a dedicated index for book ID. If we use SQL Fiddle to toggle Postgres, Oracle and MariaDB, the same will happen (try it!)

If we search by Author, on the other hand, it will use the specific index for author:

EXPLAIN QUERY PLAN 
SELECT * FROM Book_Authors WHERE author_id = 5;

QUERY PLAN
`--SEARCH Book_Authors USING INDEX idx_book_author_author_id (author_id=?)

Is Django really producing the extra index?

Finally, it's good to "check the plug": is Django really creating three indexes, including the extra one for the leftmost column? if not, you may ask me why did I make you read such long article... well, at least you didn't have to write it!

To test on Django I will create the models. Note that the m2m relation is defined in the Book model.


class Author(models.Model):
    name = models.CharField(max_length=100)

    def __str__(self):
        return self.name


class Book(models.Model):
    title = models.CharField(max_length=200)
    authors = models.ManyToManyField(Author, related_name="books")
    published_date = models.DateField()

    def __str__(self):
        return f"{self.title}"

Now I can run the command to create the migration...

> python manage.py makemigrations

Migrations for 'blog':
  blog/migrations/0003_author_book.py
    + Create model Author
    + Create model Book

And use the command manage.py sqlmigrations to see the SQL commands that will produce the tables and indexes

... create table etc

CREATE UNIQUE INDEX "blog_book_authors_book_id_author_id_0a5bb3b3_uniq" ON "blog_book_authors" ("book_id", "author_id");
CREATE INDEX "blog_book_authors_book_id_35eae5ed" ON "blog_book_authors" ("book_id");
CREATE INDEX "blog_book_authors_author_id_fa034e3d" ON "blog_book_authors" ("author_id");

There it is! The redundant index on book_id, named blog_book_authors_book_id_35eae5ed.

Now that we know

Django creates 3 indexes for m2m through-tables, including one for the leftmost column
Database engines like SQLite, Postgres etc optimize usage of indexes and will prefer using the composite index when possible

...we can move on and find a fix to prevent the extra index from being created! But that's the for the next post, soon ;)

Cheers!

PS: If you want to know a bit more about other types of indexes, I have some slides for a talk I did for Python Cerrado (a regional conference here in Brazil) in 2024. This is how I got so nerdy about b-trees etc, to have a reason to travel and meet more Pythonistas ;)

PPS: Yes I used AI to draw these Mermaid diagrams, I'm not that good on Mermaid. And I learned that dev.to does not support Mermaid too :(