DEV Community

Dmitrii Doroshev
Dmitrii Doroshev

Posted on • Originally published at doroshev.substack.com

The Cartesian Product Trap in Django ORM

I hit a performance wall at work recently after adding a single extra Count() on a Django queryset that looked harmless. Luckily, the problematic code didn't make it into the production environment, but I had to think hard to figure out what went wrong and find a solution.


The Demo Models

All the following models and entities are made up by me to illustrate the problem.

Imagine we have a PostgreSQL database storing information about separate stores, their departments, employees and products of the stores:

Database schema showing Store, Department, Employee, and Product relationships

The DB tables can contain much more fields, but in our case we only care about relations between the models.

So in Django, models.py would look like this:

from django.db.models import Model, CharField, ForeignKey, CASCADE

class Store(Model):
    name = CharField(max_length=200)
    # ...

class Department(Model):
    store = ForeignKey(
        Store, related_name="departments", on_delete=CASCADE
    )
    # ...

class Employee(Model):
    department = ForeignKey(
        Department, related_name="employees", on_delete=CASCADE
    )
    # ...

class Product(Model):
    store = ForeignKey(
        Store, related_name="products", on_delete=CASCADE
    )
    # ...
Enter fullscreen mode Exit fullscreen mode

Seed data (roughly what bit me):

  • 2 stores,
  • 10 departments per store,
  • 500 employees per department - 10 000 employees overall,
  • 2500 products per store.

In one of the places of our system, we show a list of the stores, including the amount of employees in each store. Pretty easy with Django ORM, right?

stores = Store.objects.annotate(
    total_employees=Count("departments__employees")
).values(
    "id",
    # ...
    "total_employees"
)
Enter fullscreen mode Exit fullscreen mode

This query is relatively fast and works like a charm.

The Problem

Now let's imagine we were asked to add another counter: a number of products per store. We already have total_employees, why not just add total_products?

Well, most likely we already have some unit test for our piece of code, which checks the logic on a small amount of data, and before releasing the code, we can figure out that another JOIN was added, some data is duplicated, and instead of just COUNT(...), we switch to COUNT(DISTINCT ...), eventually coming up with something like this:

stores = Store.objects.annotate(
    total_employees=Count("departments__employees",
                          distinct=True),
    total_products=Count("products", distinct=True),
).values(
    "id",
    # ...
    "total_employees",
    "total_products",
)
Enter fullscreen mode Exit fullscreen mode

Looks safe to commit, push, wait for the green tests on CI, merge and deploy.

However, after the deployment you'll almost immediately see that everything hangs. As I said, there's not that many stores, only 2 for now, not that many employees, and not that many departments.

And it takes, like, 10 seconds to fetch the numbers! What's wrong with it?

Let's take a closer look at the generated SQL for this seemingly innocent queryset:

SELECT 
    "shop_store"."id",
    COUNT(DISTINCT "shop_product"."id") AS "total_products", 
    COUNT(DISTINCT "shop_employee"."id") AS "total_employees"
FROM "shop_store"
  LEFT OUTER JOIN "shop_product"
    ON "shop_store"."id" = "shop_product"."store_id"
  LEFT OUTER JOIN "shop_department"
    ON "shop_store"."id" = "shop_department"."store_id"
  LEFT OUTER JOIN "shop_employee"
    ON "shop_department"."id" = "shop_employee"."department_id" 
GROUP BY "shop_store"."id"
Enter fullscreen mode Exit fullscreen mode

And let's check the actual query plan:

GroupAggregate ... rows=2 ...
  ->  Nested Loop Left Join ... rows=25000000 ...
  ...
...
Execution Time: 11376.072 ms
Enter fullscreen mode Exit fullscreen mode

Translation: these three JOINs turn into a 25-million-row cartesian mess before GROUP BY and COUNT(DISTINCT):

Products × Departments × Employees
= 2500 × 10 × 500
= 12 500 000  (per store)
× 2 stores
= 25 000 000 joined rows
Enter fullscreen mode Exit fullscreen mode

The Fix

There are multiple ways of handling this case, but the easiest fix is to use subqueries:

subquery_products = Subquery(
    Product.objects.filter(store_id=OuterRef("pk"))
        .values("store_id")
        .annotate(count=Count("pk"))
        .values("count"),
    output_field=IntegerField()
)
subquery_employees = Subquery(
    Employee.objects.filter(department__store_id=OuterRef("pk"))
        .values("department__store_id")
        .annotate(count=Count("pk"))
        .values("count"),
    output_field=IntegerField()
)
stores = Store.objects.annotate(
    total_products=Coalesce(subquery_products, 0),
    total_employees=Coalesce(subquery_employees, 0),
).values("id", "total_products", "total_employees")
Enter fullscreen mode Exit fullscreen mode

SQL query:

SELECT "shop_store"."id",
       COALESCE((
         SELECT COUNT(U0."id") AS "count"  
         FROM "shop_product" U0  
         WHERE U0."store_id" = "shop_store"."id"  
         GROUP BY U0."store_id"), 0
       ) AS "total_products",
       COALESCE((
         SELECT COUNT(U0."id") AS "count"
         FROM "shop_employee" U0
           INNER JOIN "shop_department" U1
             ON U0."department_id" = U1."id"
         WHERE U1."store_id" = "shop_store"."id"
         GROUP BY U1."store_id"), 0
       ) AS "total_employees"
FROM "shop_store";
Enter fullscreen mode Exit fullscreen mode

Now this one takes a couple of milliseconds with pretty modest and predictable plan:

Seq Scan on shop_store ... rows=2 ...
  SubPlan 1
     -> Seq Scan on shop_product u0 ... rows=2500 ...
  SubPlan 2
     -> Hash Join ... rows=5000 ...
        -> Seq Scan on shop_employee u0_1 ... rows=10000 ...
        -> Hash ... rows=10 ...
...
Execution Time: 5.600 ms
Enter fullscreen mode Exit fullscreen mode

No giant intermediate data sets, just two tiny scans:

  • before: 11 376 ms (~11 seconds)
  • after: 5.6 ms (2000x faster)

Takeaways

  • COUNT(DISTINCT) with multi-branch LEFT JOINs makes the database loop through the entire cartesian product.
  • Correlated subqueries aggregate each branch separately and scale linearly with data size.
  • Always test aggregate queries against production-sized data before you ship.

Top comments (2)

Collapse
 
xwero profile image
david duymelinck

I don't know how to do it django ORM, but another way to write the query is with UNION.

SELECT 
    store_id,
    SUM(total_products) AS total_products,
    SUM(total_employees) AS total_employees
FROM (
    SELECT 
        store_id,
        COUNT(DISTINCT id) AS total_products,
        0 AS total_employees
    FROM shop_product
    GROUP BY store_id

    UNION ALL

    SELECT 
        d.store_id,
        0 AS total_products,
        COUNT(DISTINCT e.id) AS total_employees
    FROM shop_department d
    LEFT JOIN shop_employee e ON d.id = e.department_id
    GROUP BY d.store_id
) counts
GROUP BY store_id;
Enter fullscreen mode Exit fullscreen mode

I find aggregate queries so awkward.

Collapse
 
pomidoroshev profile image
Dmitrii Doroshev • Edited

@xwero I don't think it's possible to compose a Django queryset with UNION in FROM, but I see that your query is way more efficient, especially when there's more than just 2 stores.

I created 32 stores, 160000 employees and 80000 products, and checked the plans. Here's the plan of your query with FROM ( ... UNION ALL ... ):

And here's the plan of the final query from the article:

UNION - ~130 ms; the other one - ~645 ms!

Thank you so much for the comment, I learned something new today :)