DEV Community: Kyle Johnson

How to Name Django Migrations (and Why It's Important)

Kyle Johnson — Fri, 25 Jun 2021 20:11:12 +0000

Have you ever needed to undo, fake, or deal with migrations more deeply than the typical python manage.py migrate? Did you find yourself opening the migration files to find out what they contained? Everybody probably will at some point.

And if you have, you probably know what a pain it can be to go searching through all those files. In this post, we'll talk about the benefits of properly naming your Django migrations and show how doing a little prep work can save you -- and your fellow developers -- a lot of time. But before that, let's cover the basics.

What are Django Migrations?

Django migrations are a core part of the Django Object-Relational Mapper, commonly shortened to ORM. If you’re unfamiliar with ORM, it’s one of Django’s powerful features which enables you to interact with your database, like you would with SQL.

The migration framework was brought into Django on version 1.7. Migrations themselves consist of Python code and define the operations to apply to a database schema as it progresses through the Software Development Life Cycle, or SDLC. These migration files for each application live in a migrations directory within the app and, according to Django documentation, are designed to be committed to, and distributed as part of, its codebase.

Django documentation tells us that migrations should be made “once on your development machine and then running the same migrations on your colleagues’ machines, your staging machines, and eventually your production machines.”

It's helpful to think “of migrations as a version control system for your database schema.”

In terms of backend support, migrations are supported on any backends that Django ships with, including third-party backends that have support for schema alteration. But not all databases are equal when it comes to migrations. Django’s documentation states that some are more capable than others and it’s worth understanding the differences.

PostgreSQL

In terms of schema support, PostgreSQL is deemed the most capable option available. The lone exception is versions before PostgreSQL 11, which added columns with default values, causing a full rewrite of the table, for a time proportional to its size.

Django’s documentation recommends you always create new columns with null=True, as this way they will be added immediately.

MySQL

Django’s migration documentation includes three suggestions when it comes to MySQL support.

Roll-Back Risks: MySQL lacks support for transactions around schema alteration operations. If a migration fails, you will have to manually unpick the changes in order to try again. In short, if you find yourself here, it’s impossible to roll back with Django to an earlier point -- for that you’ll need to use raw SQL.
Slow Execution Times: MySQL fully rewrites tables for almost every schema operation, causing resource usage consideration because it takes time proportional to the number of rows in the table to add or remove columns. On slower hardware, this can really drag out the process to an estimated minute per million rows. - adding a few columns to a table with just a few million rows could lock your site up for over ten minutes.
Limited Name Lengths: MySQL has relatively small limits on name lengths for columns, tables and indexes, as well as a limit on the combined size of all columns an index covers. This means that indexes that are possible on other backends will fail to be created under MySQL.

SQLite

SQLite lacks a lot robust built-in schema alteration support, forcing Django to step in an emulate it by:

Creating a new table with the new schema
Copying the data across
Dropping the old table
Renaming the new table to match the original name

While Django’s documentation is generally optimistic about how the process works in general, it also admits that results can come slowly and the outcome be “occasionally buggy”. It’s not recommended developers run and migrate SQLite in a production environment without first considering the risks and limitations.

What’s So Great About Django Migrations?

Above nearly everything else, Django Migrations are important and helpful because they make developers' lives easier.

Django Migrations Help with Agility

Databases are central components of an application and in today’s fast paced world, things change quickly. Agile projects are always changing and sometimes, adjustments are necessary to meet updated requirements. Django migrations help with the process of making, applying, and tracking changes to database schemas.

Developers Use Python for Writing Django Migrations

If you are working with Django then you are more than likely adept at coding in Python because the two go hand-in-hand. In Django, Migrations are written (primarily) in Python -- a language the developers know anyway -- so you’ll be working with a language you’re familiar with anyway.

Without Migrations, developers would have to have to learn SQL commands or work with a GUI like Phpmyadmin every time they wanted to change the model definition and modify the database schema.

Django Migrations Help Reduce Repetition

Migrations are generated from the models you create so the process doesn’t have to be repeated. In comparison, building a model and then writing SQL to generate database tables is repetitive.

Django Migrations Benefit from Version Control (Git)

Database schemas don’t live in the code but Django Migrations are housed in the application, right in a folder with an appropriate name. So when you commit changes and make updates to the Migration, the changes are tracked.

Why Should Developers Consider Naming Migrations?

One of the really helpful features of Django Migrations is that they can be named. You can attach a descriptive title to each migration in an effort to distinguish it from others. And if your project becomes complex and goes on for an extended period of time, there can be several migrations.

Now, there’s absolutely no issue with deciding to not name migrations. You can certainly get by without doing it.

Still, taking a minute to name Django Migrations is helpful if it only gives you an indication as to what’s inside. Just as it’s worthwhile to be descriptive in naming commits, branches, and nearly anything when it comes to version control in Git, naming Django Migrations can help you when you’re looking back through versions for a specific file.

You can either spend your time going through files one-by-one until you find what you’re looking for or you can see a list of properly named Django Migrations and get a good idea of where something is at first glance.

Clean code makes for easier maintenance. It just makes life a little easier.

How Do You Name Django Migrations?

Django’s makemigrations command has a flag --name that can be used to make migrations more readable and easy on the eyes.

Sometimes Django will name your migrations for you but when it doesn't, the resulting title can be unhelpful when read by human beings. When Django names migrations, it comes out looking like this: 0005_auto_20210608_2154.

What does that file contain? Who knows. Important stuff? Maybe. Non-essential parts? Your guess is as good as mine.

Alternatively, what if you took the time to name a migration like this: 0005_person_email_and_opt_out? Even for someone unfamiliar with the project will probably be able to figure out what that migration contains.

Here’s a simple example I made. The first list only used python manage.py makemigrations:

0001_initial
0002_auto_20210608_2147
0003_auto_20210608_2149
0004_person_phone_number
0005_auto_20210608_2154
0006_auto_20210608_2155

Now below are the exact same migrations but with the --name flag included in the makemigrations command:

0001_initial
0002_business_address_fields
0003_business_owner_and_person
0004_person_phone_number
0005_person_email_and_opt_out
0006_business_description_and_services

Taking a small bit of time to name your migrations results in a more readable migration history which will make life easier in the future.

Look at the internal Django apps as an example. Every migration is named.

admin
- 0001_initial
- 0002_logentry_remove_auto_add
- 0003_logentry_add_action_flag_choices
auth
- 0001_initial
- 0002_alter_permission_name_max_length
- 0003_alter_user_email_max_length
- 0004_alter_user_username_opts
- 0005_alter_user_last_login_null
- 0006_require_contenttypes_0002
- 0007_alter_validators_add_error_messages
- 0008_alter_user_username_max_length
- 0009_alter_user_last_name_max_length
- 0010_alter_group_name_max_length
- 0011_update_proxy_permissions
- 0012_alter_user_first_name_max_length
contenttypes
- 0001_initial
- 0002_remove_content_type_name

It's that easy.

Django migrations go a long way in making a developer's life easier. But you can level up your game simply and make life easier for everyone when you give the migrations a descriptive, appropriate name. This post originally appeared on our insights blog, where we write about devops consulting services.

A Django Upgrade Guide for Major and Minor Releases

Kyle Johnson — Fri, 07 May 2021 12:14:56 +0000

The Django project has major updates every eight months and minor updates as needed. It's a good idea to keep your project up to date with the latest version or at least a supported version. This keeps your application secure and allows you to use new features like ASGI support, built-in cross-db JSON field, upcoming functional indexes, and more.

With the upcoming Django 3.2 release in mind, this post goes through the general process for updating a Django project.

How to Upgrade to Minor Releases of Django

Patch releases are made available as needed.

These minor releases fix bugs and security issues in the major version. Always upgrade your projects to the latest patch of the major version you are running. It is easy and there is no reason not to do this.

Per the Django documentation:

These releases will be 100% compatible with the associated feature release, unless this is impossible for security reasons or to prevent data loss. So the answer to "should I upgrade to the latest patch release?” will always be "yes."

Here's how to upgrade using the latest minor patch

Read the release notes and pay attention to anything that might affect your project. Since the minor releases are “100% compatible with the associated feature release” there should be nothing you need to change in your project, with the exception of the next step below.
Change Django version in requirements file. For example if you are running Django==3.1.6 and the 3.1.7 patch comes out you will update your requirements file to Django==3.1.7
Install the updated requirements and test your project.
That's it. You've completed the upgrade.

How to Upgrade to Major Releases of Django

Feature releases are made available every 8 months.

Version 3.2 comes out in April of 2021 and is going to be a long term support (LTS) release.

Every .2 release will be a LTS release starting with 2.2. Prior to 2.2 there LTS releases did not end in .2 (1.11, 1.8, etc).

It is a good idea to keep up with the latest Django version even if it is not a LTS release. You get the latest security enhancements, features, and performance improvements. You also reduce technical debt of upgrading down the road when you are multiple versions behind.

Here's how to upgrade using the latest major release

First, add a source control system such as Git to your application if it isn’t already. This will allow you to maintain various versions of your application such as the old version and your work-in-progress upgraded project. If you need help, find an authorized GitLab partner to make sure you're doing it right.
Dockerize your application if it isn’t already. This isn’t necessary but it is extremely helpful when you are upgrading from an ancient version of Django (or Python 2, database, cache, etc). (If you are still on Django 1.1 you are not alone - I recently upgraded one of those.).
Test the project and add tests where they are lacking. Automated software tests are a lifesaver in keeping software up to date.

Run tests with python -Wall manage.py test and deal with any deprecation warnings

Read the release notes for the major Django version that comes after the version you are currently running. Make any changes to your project. If any third party packages you are using are not compatible with the new major Django version, you will need to also read those release notes and update your project accordingly.
Change your requirements file to use the new major Django version and any dependencies that need to be updated. Install the updated requirements.
Run Django checks with python -Wall manage.py check

Fix any issues and deal with any deprecation warnings

Run tests with python -Wall manage.py test

Fix any issues and deal with any deprecation warnings

Repeat steps 4 through 10 until you get to the latest version of Django
Upgrade all dependencies to supported versions. Basically, repeat steps 4 through 10 but with the third party packages instead of with Django.

How Do I Upgrade If I'm on Python 2?

If you are still on Python 2, it's time to jump ship and get on board with Python 3.

The steps to update are the same but first, upgrade to Django 1.11 while staying on Python 2, then switch to Python 3, test, and continue the Django upgrade to the latest version. Django 1.11 is the last Django version to support Python 2 and it is also the beginning of where Django upgrades become very stable.

What if I’m using a Django version under 1.7 (Where migrations were added)?

As with everything this can vary on a project-by-project basis but the simple version is upgrade to Django 1.7, delete any south migrations, generate new initial migrations that match the existing database schema, and then run them. If you are running the new migrations for the first time on Django 1.8 or greater then use the --fake-initial flag.

Once the migrations are finished, continue upgrading Django to the latest version.

What if my database is no longer supported by Django?

First things first, backup your database(s). If the project is small enough then the simplest solution might be to use python manage.py dumpdata and python manage.py loaddata to copy the data from the old database version into a newer database.

For larger projects refer to database specific guides for tools such a pg_upgrade or mysql_upgrade.

Summary

Upgrading software can be intimidating but there is no better time to do it than now since it only gets more difficult and insecure down the road. Tools such as source control and Docker help reduce environment differences and lead to a better development/deployment experience. Finally, automated testing is a massive help in preventing errors during regular development and software upgrades.

This post originally appeared on our devops consulting blog.

Improving Django View Performance with Async Support

Kyle Johnson — Thu, 06 May 2021 19:52:05 +0000

Django 3.1 was recently released and along with it came support for asynchronous views.

To anyone that uses Django and works with lots of data from lots of different Web sources, this is a big deal. Supporting asynchronous views means cleaner code, a different way to think about things, and most importantly, the ability to dramatically improve performance of applications.

But let’s back up a bit.

If you’re unfamiliar with the term “views”, don’t worry, it’s an easy concept to understand.

Views are key components of applications built in with the Django framework. At their very simplest, views are Python functions or classes that take a web request and produce a web response. Prior to Django 3.1, views had to run with Python threads. Now, views can run in an asynchronous event loop without Python threads. This means Python’s asyncio library can be used inside of Django views.

According to the documentation:

The main benefits are the ability to service hundreds of connections without using Python threads.

There are other benefits as well including the use of Python’s asyncio.gather.

Let’s say you have a view that makes four API calls. Even in a best case scenario, if each call takes only a second, it’s a total of four seconds if executed synchronously. And that’s the best case scenario.

We can cut down on that time frame significantly and improve the situation overall by using Pythons concurrent.futures library.

This makes it possible to make the four API calls in the previous example concurrently meaning the view could take roughly one second in total if using four workers with the ThreadPoolExecutor. By all accounts, the practice cuts down on the time needed and improves the calls.

That’s important in a world where seconds matter and making someone wait around for an application to load can cause frustration.

A Real-World Example: Tracking Montana's Yellowstone River

To illustrate how asynchronous views improve performance, I created an example project to display statistical data from the United States Geological Survey (USGS).

The project makes six API calls to the USGS to collect data about six access points on the Yellowstone River in my home state of Montana. This data includes the volume of water moving at each access point at the time, known as discharge rate, as well as the gage height, which is the surface level of the water relative to its streambed.

Example 1: The Synchronous Method

Code:

def get_river_flow_and_height(site_id):
  """
  Synchronous method to get river data from USGS
  """
  response = requests.get(f'https://waterservices.usgs.gov/nwis/iv/?format=json&sites={site_id}&parameterCd=00060,00065&siteStatus=all')
  data = parse_flow_and_height_from_json(response.json())
  return data

def dashboard_v1(request):
  """
  Synchronous view that loads data one at a time
  """
  start_time = time.time()

  river_data = []

  for site_id in SITES.keys():
      river_data.append((SITES[site_id], get_river_flow_and_height(site_id)))

  return render(request, 'rivers/dashboard.html', {
      'river_data': river_data,
      'version': 'v1',
      'load_time': time.time() - start_time,
  })

Result:

The data loads and takes almost four seconds. For the purposes of this post, that’ll be our baseline. Let’s see if we can improve that situation.

Example 2: A Concurrent View Loading Some Data Simultaneously

def dashboard_v2(request):
  """
  Concurrent view that loads some data simultaneously
  """
  start_time = time.time()

  river_data = []

  with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
      for site_id, data in zip(SITES.keys(), executor.map(get_river_flow_and_height, SITES.keys())):
          river_data.append((SITES[site_id], data))

  return render(request, 'rivers/dashboard.html', {
      'river_data': river_data,
      'version': 'v2',
      'load_time': time.time() - start_time,
  })

Result:

Now we’re down to roughly 1.5 seconds and that’s a big improvement. Let’s see what happens when we leverage asynchronous views.

Example 3: The Asynchronous Method

async def get_river_flow_and_height_async(site_id):
  """
  Asynchronous method to get river data from USGS
  """
  async with httpx.AsyncClient() as client:
      response = await client.get(f'https://waterservices.usgs.gov/nwis/iv/?format=json&sites={site_id}&parameterCd=00060,00065&siteStatus=all')
      data = parse_flow_and_height_from_json(response.json())
  return data


async def dashboard_v3(request):
  """
  Asynchronous view that loads data using asyncio
  """
  start_time = time.time()

  river_data = []

  datas = await asyncio.gather(*[get_river_flow_and_height_async(site_id) for site_id in SITES.keys()])
  for site_id, data in zip(SITES.keys(), datas):
      river_data.append((SITES[site_id], data))

  return render(request, 'rivers/dashboard.html', {
      'river_data': river_data,
      'version': 'v3',
      'load_time': time.time() - start_time,
  })

Result:

Wow, we got results back in just under a second. That’s roughly a three second improvement on the original method.

This example shows pretty clearly how asynchronous views can be leveraged to drastically improve performance.

View the full project on GitLab: https://gitlab.com/nextlink/example-django-async-rivers

This post originally appeared on our blog at NextLink Labs where, among other things, we write about devops consulting

Efficient Iteration of Big Data in Django

Kyle Johnson — Thu, 06 May 2021 19:24:27 +0000

Running out of memory is not fun.

Unfortunately, when working with larger datasets its bound to happen at some point.

For example, I tried to run a Django management command that updated a value on a model with a large amount of rows in the database table:

python manage.py my_update_command
Killed

That was inside of Kubernetes which killed the process when it exceeded its memory limit. In a more traditional environment, you can completely freeze up the server if it runs out of memory.

Context

You're probably wondering why I'm trying to run a management command like this in the first place. When working with large datasets, its best to avoid anything that is O(n) or worse. In this case, I had a JSONField with a bunch of data. I also had an IntegerField on the model that stored a calculation based on some of the data in the JSONField.

Of course, requirements change and the calculation I had been using needed to use different values from the JSONField. This also needed to happen for all the existing data in the database (the large amount of rows). Luckily I had everything stored in the JSONField and making this change was as simple as running the management command and patiently waiting.

from django.core.management import BaseCommand

from ...models import SomeModel
from ...utils import some_calculation


class Command(BaseCommand):
    help = "Updates SomeModel.field based on some_calculation"

    def handle(self, *args, **options):
        self.stdout.write("Starting")

        try:
            queryset = SomeModel.objects.all()

            for obj in queryset:
                obj.field = some_calculation(obj)
                obj.save(update_fields=["field"])
        except KeyboardInterrupt:
            self.stdout.write("KeyboardInterrupt")

        self.stdout.write("Done")

Killed, now what?

Naturally, it wasn't that simple. Method 1 was using enough memory to have Kubernetes stop it. I tried a few different things here including moving the code into a data migration and running multiple asynchronous tasks. I had difficulties getting these approaches working and struggled to monitor progress.

Really, I just wanted a simple, memory-efficient management command to iterate through the data and update it.

QuerySet.iterator

Django's built-in solution to iterating though a larger QuerySet is the QuerySet.iterator method. This helps immensely and is probably good enough in most cases.

However, method 2 was still getting killed in my case.

# simplified command using QuerySet.iterator
class Command(BaseCommand):
    def handle(self, *args, **options):
        queryset = SomeModel.objects.all().iterator(chunk_size=1000)

        for obj in queryset:
            obj.field = some_calculation(obj)
            obj.save(update_fields=["field"])

Behold the Paginator

I needed to iterate through the QuerySet by using smaller chunks in a more memory-efficient manner then the iterator method. I started to roll out my own solution before I realized that this sounded very familiar. Django has pagination support built-in which is exactly what I was about to implement. I ended up using the Django Paginator to iterate through the QuerySet in chunks.

Method 4 works great.

# simplified command using Paginator
class Command(BaseCommand):
    def handle(self, *args, **options):
        queryset = SomeModel.objects.all()

        paginator = Paginator(queryset, 1000)

        for page_number in paginator.page_range:
            page = paginator.page(page_number)

            for obj in page.object_list:
                obj.field = some_calculation(obj)
                obj.save(update_fields=["field"])

Memory Usage

At this point there are the three methods described above, plus another two I added that use QuerySet.bulk_update:

Regular QuerySet
QuerySet.iterator
QuerySet.iterator and QuerySet.bulk_update
Paginator
Paginator and QuerySet.bulk_update

I ran comparisons on these approaches with 50,000 items in the database. Here is memory usage for all five methods with three runs each:

The plot clearly shows that method 1 is not a good choice since the entire QuerySet is loaded into memory before it can be used. Method 3 is also showing a steady increase in memory (note: it appears there is a memory leak here that I was unable to resolve). Zooming in on methods 2, 4 and 5 it becomes more clear that methods 4 and 5 are the winners:

I don't claim that the Paginator solution is always the best or even the best for my problem. More importantly it solved my problem and gave me the opportunity to dive into the memory differences between these approaches described above. If you have a similar problem, I recommend diving in and seeing how the comparison pans out for the specific problem.

A shortened version of the method 5 management command is:

# simplified command using Paginator and QuerySet.bulk_update
class Command(BaseCommand):
    def handle(self, *args, **options):
        queryset = SomeModel.objects.all()

        paginator = Paginator(queryset, 1000)

        for page_number in paginator.page_range:
            page = paginator.page(page_number)
            updates = []

            for obj in page.object_list:
                obj.field = some_calculation(obj)
                updates.append(obj)

            SomeModel.objects.bulk_update(updates, ["field"])

This article was originally posted on our blog.