Code Review Doctor

Posted on Jan 3, 2021 • Edited on Jan 8, 2021

666 Django projects checked for inefficient database queries. Over half had these 4 anti-patterns

#webdev #python #django #performance

All mature codebases contain anti-patterns: code problems that introduce security holes, tech debt that raises the cost of working with the codebase, and code that goes against best practice that slows it down.

Django Doctor performs static analysis to find and auto fix such problems. We checked 666 Django codebases for inefficient Django ORM calls and were surprised with the extent of the problem: 50% of all Django codebases we checked had the following anti patterns:

16% used queryset.count() > 0 instead of queryset.exists()
15% used len(queryset) instead of queryset.count()
10% checked if queryset: instead of if queryset.exists():
10% did not use foreign keys directly

These at best slow down the Django app, and at worse can bring production down.

How would you solve these performance anti-patterns? Try our Django performance refactor challenge.

Given these problems were present in over half of the Django github repositories checked, there's a good chance that yours has at least one of these problems. An experienced Django dev may think they would not make these mistakes, but the thing is no dev is an island: over time devs change teams, inherit brownfield codebase that have accumulated tech debt, and will review code written by junior devs. It's therefore important to:

avoid the mistake
know how to effectively find and fix the mistake
know how to communicate to junior devs why it's a problem.

Using `.count() > 0` instead of `.exists()`

Comparing queryset.count() is less efficient than checking queryset.exists(). The Django docs tells us to use queryset.count() if you only want the count, and use queryset.exists() if you only want to find out if at least one result exists.

The reason for this is because queryset.count() performs an SQL operation that scans every row in the database table to calculate the sum. On the other hand queryset.exists() simply reads a single record in the most optimized way:

remove ordering
remove grouping
clear any dev-defined select_related and distinct from the queryset

So instead of

def check_hounds():
    queryset = HoundsModel.objects.all()
    if queryset.count() > 0:
         return "oh no. Run!"

Django Doctor would auto fix this to:

def check_hounds():
    queryset = HoundsModel.objects.all()
    if queryset.exists():
         return "oh no. Run!"

Using `len(queryset)` instead of `queryset.count()`

len(queryset) performs the count at application level. That is much less efficient than doing queryset.count(), which does the calculation at database level and just returns the count.

Querysets are lazily evaluated - meaning the records are not read from the database until code interact with the data. Thats's why we can do queryset.all() without downloading every records from the database.

An example of something that interacts with the data is doing len(queryset). This will read all the records in the queryset: effectively downloading the database over the internet. Not particularly efficient.

On the other hand doing queryset.count() handles the calculation at the database level by executing SQL like SELECT COUNT(*) FROM table. That means using queryset.count() makes the code quicker, and improves database performance. Plus imagine the waste in downloading 5000 records only to check the length and throw them away at the end!

So instead of

def check_hounds():
    queryset = HoundsModel.objects.all()
    if len(queryset) > 2:
         return "oh no. Run!"

Django Doctor would auto fix this to:

def check_hounds():
    if HoundsModel.objects.count() > 2:
         return "oh no. Run!"

Having said that though, if the records will need reading after the length is checked, then len(queryset) can be valid.

Using `if queryset:` instead of `if queryset.exists():`

Similar to above, this can load every row in the database table into memory because the queryset is evaluated. Checking if a queryset is truthy/falsey is much less efficient than checking queryset.exists().

This is especially inefficient if the table is very large: it can cause a CPU spike on the database and use a lot of memory on the web server. So instead of pulling every single row from the table, check queryset.exists(), which simply tries to read a single record from the table in a very efficient way.

So instead of

def check_hounds():
    queryset = HoundsModel.objects.all()
    if queryset:
         return "oh no. Run!"

Django Doctor would auto fix this to:

def check_hounds():
    queryset = HoundsModel.objects.all()
    if queryset.exists():
         return "oh no. Run!"

Not use foreign keys directly

When working with foreign keys, accessing model_instance.related_field.id will result in a database read when related_field.id is evaluated. That can be eliminated by doing model_instance.related_field_id, which is the foreign key value that Django has already cached on the object to make this scenario more efficient.

So instead of

def check_hounds(pk, farm_ids):
    hound = HoundsModel.objects.get(pk=pk)
    if hound.farm.id in farm_ids:
        ...

Django Doctor would auto fix this to:

def check_hounds(pk, farm_ids):
    hound = HoundsModel.objects.get(pk=pk)
    if hound.farm_id in farm_ids:
      ...

Does your Django code have these anti-patterns?

Over time it's easy for tech debt to slip into your codebase. I can check that for you at django.doctor, or can review your GitHub PRs:

Or try out Django refactor challenges.

Top comments (4)

Maxim Danilov • Jan 8 '21 • Edited

wtf man? Are you really think that advice "Not use foreign keys directly" help you?
NOT. Its wrong solution!
Try to imagine something like that:

def check_hounds(pk, farm_ids):
    if HoundsModel.objects.filter(pk=pk, hound__farm__id__in=farm_ids).exists():
        return "Ja! it is really better than before!!!"

text
if queryset.count() > 0: is not pythonic.
Somebody who work with python, wrote if queryset.count():

Using len(queryset) is not bad, if you work with objects from queryset after len. Django put furst 25 objects as default in cache, and not ask DB again. It can be wery useful in high-load project.

and, of course, used object manager directly in code is wery bad practice.

it should be:

def check_hounds(*args, **kwargs):
    if hound_manager.check_hounds_at_home(*args, **kwargs):

i work with django many Years, if you want, we can talk about work together.

Code Review Doctor • Jan 8 '21 • Edited

The example code is imperfect, but the message I'm conveying is not "use my code". The message I'm trying to convey is "use foreign key values directly if you have to use foreign keys" - which is what Django docs themselves suggesting doing.

The advice of "use foreign keys directly" is fine, but yes I agree the way I illustrate the problem has other issues that are outside of scope of the purpose of the point I'm making

Using len(queryset) is not bad, if you work with objects from queryset after le

Agreed, I make that point in other words too: "Having said that though, if the records will need reading after the length is checked, then len(queryset) can be valid."