DEV Community: Matthew Segal

How to find what you want in the Django documentation

Matthew Segal — Fri, 26 Jun 2020 09:50:14 +0000

Many beginner programmers find the Django documentation overwhelming.

Let's say you want to learn how to perform a login for a user. Seems like it would be pretty simple: logins are a core feature of Django. If you google for "django login" or search the docs you see a few options, with "Using the Django authentication system" as the most promising result. You click the link, happily anticipating that your login problems will soon be over, and you get smacked in the face with thirty nine full browser pages of text. This is way too much information!

Alternatively, you find your way to the reference page on django.contrib.auth, because that's where all the auth stuff is, right? If you browse this page you will see an endless enumeration of all the different authentication models and fields and functions, but no explanation of how they're supposed to fit together.

At this stage you may want to close your browser tab in despair and reconsider your decision to learn Django. It turns out the info that you wanted was somewhere in that really long page here and here. Why was it so hard to find? Why is this documentation so fragmented?

God forbid that you should complain to anyone about this struggle. Experienced devs will say things like "you are looking in the wrong place" and "you need more experience before you try Django". This response begs the question though: how does anyone know where the "right place" is? The table of contents in the Django documentation is unreadably long. Meanwhile, you read other people raving about how great Django docs are: what are they talking about? You may wonder: am I missing something?

Wouldn't it be great if you could go from having a question to finding the answer in a few minutes or less? A quick Google and a scan, and boom: you know how to solve your Django problem. This is possible. As a professional Django dev I do this daily. I rarely remember how to do anything from heart and I am constantly scanning the docs to figure out how to solve problems, and you can too.

In this post I will outline how to find what you want in the Django documentation, so that you spend less time frustrated and stuck, and more time writing your web app. I also include a list of key references that I find useful.

Experienced devs can be dismissive when you complain about documentation, but they're right about one thing: knowing how to read docs is a really important skill for a programmer, and being good at this will save you lots of time.

Find the right section

Library documentation is almost always written with distinct sections. If you do not understand what these sections are for, then you will be totally lost.
If you have time, watch Daniele Procida's excellent talk how documentation should be structured. In the talk he describes four different sections of documentation:

Tutorials: lessons that show you how to complete a small project (example)
How-to guides: guide with steps on how to solve a common problem (example)
API References: detailed technical descriptions of all the bits of code (example)
Explanations: high level discussion of design decisions (example)

In addition to these, there's also commonly a Quickstart (example), which is the absolute minimum steps you need to to do get started with the library.

The Django Rest Framework docs use a structure similar to this

The ReactJS docs use a structure similar to this

The Django docs use a structure similar to this

Hopefully you see the pattern here: all these docs have been split up into distinct sections. Learn this structure once and you can quickly navigate most documentation.
Now that you understand that library documentation is usually structured in a particular way, I will explain how to navigate that structure.

Do the tutorial first

This might seem obvious, but I have to say it. If there is a tutorial in the docs and you are feeling lost, then do the tutorial. It is a place where the authors may have decided to introduce concepts that are key to understanding everything else. If you're feeling like a badass, then don't "do" the tutorial, but at the very least skim read it.

Find an example, guide or overview

Avoid the API reference section, unless you already know exactly what you're looking for. You will recognise that you are in an API reference section because the title will have "reference" in it, and the content will be very detailed with few high-level explanations. For example, django.contrib.auth is a reference section - it is not a good place to learn how "Django login" works.

You need to understand how the bits of code fit together before looking at an API reference. This can be hard since most documentation, even the really good stuff, is incomplete. Still, the best thing to try is to look for overviews and explanations of framework features.

Find and scan the list of how-to guides, to see if they solve your exact problem. This will save you a lot of time if the guide directly solves your problem. Using our login example, there is no "how to log a user in" guide, which is bad luck.

If there is no guide, then quickly scan the topic list and try and find the topic that you need. If you do not already understand the topic well, then read the overview. Google terms that you do not understand, like "authentication" and "authorization" (they're different, specific things). In our login case, "User authentication in Django" is the topic that we want from the list.

Once you think you sort-of understand how everything should fit together, then you can move to the detailed API reference, so that you can ensure that you're using the code correctly.

Find and remember key references

Once you understand what you want to do, you will need to use the API reference pages to figure out exactly what code you should write. It's good to remember key pages that contain the most useful references. Here's my personal favourites that I use all the time:

Settings reference: A list of all the settings and what they do
Built-in template tags: All the template tags with examples
Queryset API reference: All the different tools for using the ORM to access the database
Model field reference: All the different model fields
Classy Class Based Views: Detailed descriptions for each of Django's class-based views

I don't have any of these pages bookmarked, I just google for them and then search using ctrl-f to find what I need in seconds.

When using Django REST Framework I often find myself referring to:

Classy DRF: Like Classy Class Based Views but for DRF
Serializer reference: To make serializers work
Serializer field reference: All the different serializer fields
Nested relationships: How to put serializers inside of other serializers

Search insead of reading

Most documentation is not meant to be read linearly, from start to end, like a novel: most pages are too long to read. Instead, you should strategically search for what you want. Most documentation involves big lists of things, because they're so much stuff that the authors need to explain in a lot of detail. You cannot rely on brute-force reading all the content to find the info you need.

You can use your browser's build in text search feature (ctrl-f) to quickly find the text that you need. This will save you a lot of scrolling and squinting at your screen. I use this technique all the time when browsing the Django docs. Here's a video of me finding out how to log in with Django using ctrl-f. Here's me struggling to get past the first list by trying to read all the words with my pathetic human eyes. I genuinely did miss the "auth" section several times when trying to read that list manually while writing this post.

Using search is how you navigate the enormous table of contents or the 39 browser pages of authentication overview. You're not supposed to read all that stuff, you're supposed to strategically search it. In our login example, good search terms would be "auth", "login", "log in" and "user".

In addition, most really long pages will have a sidebar summarising all the content. If you're going to read something, read that.

Read the source code

This is kind of the documentation equivalent of "go fuck yourself", but when you need an answer and the documentation doesn't have it, then the code is the authoratative source on how the library works. There are many library details that would be too laborious to document in full, and at some point the expectation is that if you really need to know how something works, then you should try reading the code. The Django source code is pretty well written, and the more time you spend immersed in it, the easier it will be to navigate. This isn't really advice for beginners, but if you're feeling brave, then give it a try.

Summary

The Django docs, in my opionion, really are quite good, but like most code docs, they're hard for beginners to navigate. I hope that these tips will make learning Django a more enjoyable experience for you. To summarise my tips:

Identify the different sections of the documentation
Do the tutorial first if you're not feeling confident, or at least skim read it
Avoid the API reference early on
Try find a how to guide for your problem
Try find a topic overview and explanation for your topic
Remember key references for quick lookup later
Search the docs, don't read them like a book
Read the source code if you're desperate

As good as it is, the Django docs do not, and should not, tell you everything there is to know about how to use Django. At some point, you will need to turn to Django community blogs like Simple is Better than Complex, YouTube videos, courses and books. When you need to deploy your Django app, you might enjoy my guide on Django deployment and my overview of Django server setups.

How to pull production data into your local Postgres database

Matthew Segal — Sun, 21 Jun 2020 10:47:47 +0000

Sometimes you want to write a feature for your Django app that requires a lot of structured data that already exists in production. This happened to me recently: I needed to create a reporting tool for internal business users. The problem was that I didn't have much data in my local database. How can I see what my reports will look like if I don't have any data?

It's possible to generate a bunch of fake data using a management command. I've written earlier about how to do this with FactoryBoy. This approach is great for filling web pages with dummy content, but it's tedious to do if your data is highly structured and follows a bunch of implcit rules. In the case of my reporting tool, the data I wanted involved hundreds of form submissions, and each submission has dozens of answers with many different data types. Writing a script to generate data like this would haven take ages! I've also seen situations like this when working with billing systems and online stores with many product categories.

Wouldn't it be nice if we could just get a copy of our production data and use that for local development? You could just pull the latest data from prod and work on your feature with the confidence that you have plenty of data that is structured correctly.

In this post I'll show you a script which you can use to fetch a Postgres database backup from cloud storage and use it to populate your local Postgres database with prod data. This post builds on three previous posts of mine, which you might want to read if you can't follow the scripting in this post:

I'm going to do all of my scripting in bash, but it's also possible to write similar scripts in PowerShell, with only a few tweaks to the syntax.

Starting script

Let's start with the "database reset" bash script from my previous post. This script resets your local database, runs migrations and creates a local superuser for you to use. We're going to extend this script with an additional step to download and restore from our latest database backup.

#!/bin/bash
# Resets the local Django database, adding an admin login and migrations
set -e
echo -e "\n>>> Resetting the database"
./manage.py reset_db --close-sessions --noinput

# =========================================
# DOWNLOAD AND RESTORE DATABASE BACKUP HERE
# =========================================

echo -e "\n>>> Running migrations"
./manage.py migrate

echo -e "\n>>> Creating new superuser 'admin'"
./manage.py createsuperuser \
   --username admin \
   --email admin@example.com \
   --noinput

echo -e "\n>>> Setting superuser 'admin' password to 12345"
./manage.py shell_plus --quiet-load -c "
u=User.objects.get(username='admin')
u.set_password('12345')
u.save()
"

echo -e "\n>>> Database restore finished."

Fetching the latest database backup

Now that we have a base script to work with, we need to fetch the latest database backup. I'm going to assume that you've followed my guide on automating your Postgres database backups.

Let's say your database is saved in an AWS S3 bucket called mydatabase-backups, and you've saved your backups with a timestamp in the filename, like postgres_mydatabase_1592731247.pgdump. Using these two facts we can use a little bit of bash scripting to find the name of the latest backup from our S3 bucket:

# Find the latest backup file
S3_BUCKET=s3://mydatabase-backups
LATEST_FILE=$(aws s3 ls $S3_BUCKET | awk '{print $4}' | sort | tail -n 1)
echo -e "\nFound file $LATEST_FILE in bucket $S3_BUCKET"

Once you know the name of the latest backup file, you can download it to the current directory with the aws CLI tool:

# Download the latest backup file
aws s3 cp ${S3_BUCKET}/${LATEST_FILE} .

The . in this case refers to the current directory.

Restoring from the latest backup

Now that you've downloaded the backup file, you can apply it to your local database with pg_restore. You may need to install a Postgres client on your local machine to get access to this tool. Assuming your local Postgres credentials aren't a secret, you can just hardcode them into the script:

pg_restore \
    --clean \
    --dbname postgres \
    --host localhost \
    --port 5432 \
    --username postgres \
    --no-owner \
    $LATEST_FILE

In this case we use --clean to remove any existing data and we use --no-owner to ignore any commands that set ownership of objects in the database.

Look ma, no files!

You don't have to save your backup file to disk before you use it to restore your local database: you can stream the data directly from aws s3 cp to pg_restore using pipes.

aws s3 cp ${S3_BUCKET}/${LATEST_FILE} - | \
    pg_restore \
        --clean \
        --dbname postgres \
        --host localhost \
        --port 5432 \
        --username postgres \
        --no-owner

The - in this case means "stream to stdout", which we use so that we can pipe the data.

Final script

Here's the whole thing:

#!/bin/bash
# Resets the local Django database,
# restores from latest prod backup,
# and adds an admin login and migrations
set -e
echo -e "\n>>> Resetting the database"
./manage.py reset_db --close-sessions --noinput

echo -e "\nRestoring database from S3 backups"
S3_BUCKET=s3://mydatabase-backups
LATEST_FILE=$(aws s3 ls $S3_BUCKET | awk '{print $4}' | sort | tail -n 1)
aws s3 cp ${S3_BUCKET}/${LATEST_FILE} - | \
    pg_restore \
        --clean \
        --dbname postgres \
        --host localhost \
        --port 5432 \
        --username postgres \
        --no-owner

echo -e "\n>>> Running migrations"
./manage.py migrate

echo -e "\n>>> Creating new superuser 'admin'"
./manage.py createsuperuser \
   --username admin \
   --email admin@example.com \
   --noinput

echo -e "\n>>> Setting superuser 'admin' password to 12345"
./manage.py shell_plus --quiet-load -c "
u=User.objects.get(username='admin')
u.set_password('12345')
u.save()
"

echo -e "\n>>> Database restore finished."

You should be able to to run this over and over and over to get the latest database backup working on your local machine.

Other considerations

When talking about using production backups locally, there are two points that I think are important.

First, production data can contain sensitive user information including names, addresses, emails and even credit card details. You need to ensure that this data is only be distributed to people who are authorised to access it, or alternatively the backups should be sanitized so the senitive data is overwritten or removed.

Secondly, It's possible to use database backups to debug issues in production. I think it's a great method for squashing hard-to-reproduce bugs, but it shouldn't be your only way to solve production errors. Before you move onto this technique, you should first ensure you have application logging and error monitoring set up for your Django app, so that you don't lean on your backups as a crutch.

Next steps

If you don't already have automated prod backups, I encourage you to set that up if you have any valuable data in your Django app. Once that's done, you'll be able to use this script to pull down prod data into your local dev environment on demand.

How to automatically reset your local Django database

Matthew Segal — Sun, 21 Jun 2020 10:45:40 +0000

Sometimes when you're working on a Django app you want a fresh start. You want to nuke all of the data in your local database and start again from scratch. Maybe you ran some migrations that you don't want to keep, or perhaps there's some test data that you want to get rid of. This kind of problem doesn't crop up very often, but when it does it's super annoying to do it manually over and over.

In this post I'll show you small script that you can use to reset your local Django database. It completely automates deleting the old data, running migrations and setting up new users. I've written the script in bash but most of it will also work in powershell or cmd with only minor changes.

For those of you who hate reading, the full script is near the bottom.

Resetting the database

We're going to reset our local database with the django-extensions package, which provides a nifty little helper command called reset_db. This command destroys and recreates your Django app's database.

./manage.py reset_db

I like to add the --noinput flag so the script does not ask me for confirmation, and the --close-sessions flag if I'm using PostgreSQL locally so that the command does not fail if my Django app is connected the database at the same time.

./manage.py reset_db --noinput --close-sessions

This is is a good start, but now we have no migrations, users or any other data in our database. We need to add some data back in there before we can start using the app again.

Running migrations

Before you do anything else it's important to run migrations so that all your database tables are set up correctly:

./manage.py migrate

Creating an admin user

You want to have a superuser set up so you can log into the Django admin. It's nice when a script guarantees that your superuser always has the same username and password. The first part of creating a superuser is pretty standard:

./manage.py createsuperuser \
   --username admin \
   --email admin@example.com \
   --noinput

Now we want to set the admin user's password to something easy to remember, like "12345". This isn't a security risk because it's just for local development. This step involves a little more scripting trickery. Here we can use shell_plus, which is an enhanced Django shell provided by django-extensions. The shell_plus command will automatically import all of our models, which means we can write short one liners like this one, which prints the number of Users in the database:

./manage.py shell_plus --quiet-load -c "print(User.objects.count())"
# 13

Using this method we can grab our admin user and set their password:

./manage.py shell_plus --quiet-load -c "
u = User.objects.get(username='admin')
u.set_password('12345')
u.save()
"

Setting up new data

There might be a little bit of data that you want to set up every time you reset your database. For example, in one app I run, I want to ensure that there is always a SlackMessage model that has a SlackChannel. We can set up this data in the same way we set up the admin user's password:

./manage.py shell_plus --quiet-load -c "
c = SlackChannel.objects.create(name='Test Alerts')
SlackMessage.objects.create(channel=c)
"

If you need to set up a lot of data then there are options like fixtures or tools like Factory Boy (which I heartily recommend). If you only need to do a few lines of scripting to create your data, then you can include them in this script. If your development data setup is very complicated, then I recommend putting all the setup code into a custom management command.

The final script

This is the script that you can use to reset your local Django database:

#!/bin/bash
# Resets the local Django database, adding an admin login and migrations
set -e
echo -e "\n>>> Resetting the database"
./manage.py reset_db --close-sessions --noinput

echo -e "\n>>> Running migrations"
./manage.py migrate

echo -e "\n>>> Creating new superuser 'admin'"
./manage.py createsuperuser \
   --username admin \
   --email admin@example.com \
   --noinput

echo -e "\n>>> Setting superuser 'admin' password to 12345"
./manage.py shell_plus --quiet-load -c "
u=User.objects.get(username='admin')
u.set_password('12345')
u.save()
"

# Any extra data setup goes here.

echo -e "\n>>> Database restore finished."

Other methods

It's good to note that what I'm proposing is the "nuclear option": purge everything and restart from scratch. There are also some more precise methods available for managing your local database:

If you just want to reverse some particular migrations, then you can use the migrate command as documented here.
If you just want to delete all your data and you don't care about re-applying the migrations, then the flush management command, documented here will take care of that.

Docker environments

If you're running your local Django app in a Docker container via docker-compose, then this process is a little bit more tricky, but it's not too much more complicated. You just need to add two commands to your script.

First you want a command to kill all running containers, which I do because I'm superstitious and don't trust that reset_db will actually close all database connections:

function stop_docker {
    echo -e "\nStopping all running Docker containers"
    # Ensure that no containers automatically restart
    docker update --restart=no `docker ps -q`
    # Kill everything
    docker kill `docker ps -q`
}

We also want a shorthand way to run commands inside your docker environment. Let's say you are working with a compose file located at docker/docker-compose.local.yml and your Django app's container is called web. Then you can run your commands inside the container as follows:

function run_docker {
    docker-compose -f docker/docker-compose.local.yml run --rm web $@
}

Now we can just prefix run_docker to all the management commands we run. For example:

# Without Docker
./manage.py reset_db --close-sessions --noinput
# With Docker
run_docker ./manage.py reset_db --close-sessions --noinput

I will note that this run_docker shortcut can act a little weird when you're passing strings to shell_plus. You might need to experiment with different methods of escaping whitespace etc.

Conclusion

Hopefully this script will save you some time when you're working on your Django app. If you're interested in more Django-related database stuff then you might enjoy reading about how to back up and restore a Postgres database and then how to fully automate your prod backup process.

How to automate your Postgres database backups

Matthew Segal — Sun, 21 Jun 2020 10:43:16 +0000

If you've got a web app running in production, then you'll want to take regular database backups, or else you risk losing all your data. Taking these backups manually is fine, but it's easy to forget to do it. It's better to remove the chance of human error and automate the whole process. To automate your backup and restore you will need three things:

A safe place to store your backup files
A script that creates the backups and uploads them to the safe place
A method to automatically run the backup script every day

A safe place for your database backup files

You don't want to store your backup files on the same server as your database. If your database server gets deleted, then you'll lose your backups as well. Instead, you should store your backups somewhere else, like a hard drive, your PC, or in the cloud.

I like using cloud object storage for this kind of use-case. If you haven't heard of "object storage" before: it's just a kind of cloud service where you can store a bunch of files. All major cloud providers offer this service:

Amazon's AWS has the Simple Storage Service (S3)
Microsoft's Azure has Storage
Google Cloud also has Storage
DigitalOcean has Spaces

These object storage services are very cheap at around 2c/GB/month, you'll never run out of disk space, they're easy to access from command line tools and they have very fast upload/download speeds, especially to/from other services hosted with the same cloud provider. I use these services a lot: this blog is being served from AWS S3.

I like using S3 simply because I'm quite familiar with it, so that's what we're going to use for the rest of this post. If you're not already familiar with using the AWS command-line, then check out this post I wrote about getting started with AWS S3 before you continue.

Creating a database backup script

In my previous post on database backups I showed you a small script to automatically take a backup using PostgreSQL:

#!/bin/bash
# Backs up mydatabase to a file.
TIME=$(date "+%s")
BACKUP_FILE="postgres_${PGDATABASE}_${TIME}.pgdump"
echo "Backing up $PGDATABASE to $BACKUP_FILE"
pg_dump --format=custom > $BACKUP_FILE
echo "Backup completed for $PGDATABASE"

I'm going to assume you have set up your Postgres database environment variables (PGHOST, etc) either in the script, or elsewhere, as mentioned in the previous post.
Next we're going to get our script to upload all backups to AWS S3.

Uploading backups to AWS Simple Storage Service (S3)

We will be uploading our backups to S3 with the aws command line (CLI) tool. To get this tool to work, we need to set up our AWS credentials on the server by either using aws configure or by setting the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Once that's done we can use aws s3 cp to upload our backup files. Let's say we're using a bucket called "mydatabase-backups":

#!/bin/bash
# Backs up mydatabase to a file and then uploads it to AWS S3.
# First, dump database backup to a file
TIME=$(date "+%s")
BACKUP_FILE="postgres_${PGDATABASE}_${TIME}.pgdump"
echo "Backing up $PGDATABASE to $BACKUP_FILE"
pg_dump --format=custom > $BACKUP_FILE

# Second, copy file to AWS S3
S3_BUCKET=s3://mydatabase-backups
S3_TARGET=$S3_BUCKET/$BACKUP_FILE
echo "Copying $BACKUP_FILE to $S3_TARGET"
aws s3 cp $BACKUP_FILE $S3_TARGET

echo "Backup completed for $PGDATABASE"

You should be able to run this multiple times and see a new backup appear in your S3 bucket's webpage every time you do it. As a bonus, you can add a little one liner at the end of your script that checks for the last uploaded file to the S3 bucket:


BACKUP_RESULT=$(aws s3 ls $S3_BUCKET | tail -n 1)
echo "Latest S3 backup: $BACKUP_RESULT"

Once you're confident that your backup script works, we can move on to getting it to run every day.

Running cron jobs

Now we need to get our server to run this script every day, even when we're not around. The simplest way to do this is on a Linux server is with cron. Cron can automatically run scripts for us on a schedule. We'll be using the crontab tool to set up our backup job.

You can read more about how to use crontab here. If you find that you're having issues setting up cron, you might also find this StackOverflow post useful.

Before we set up our daily database backup job, I suggest trying out a test script to make sure that your cron setup is working. For example, this script prints the current time when it is run:

#!/bin/bash
echo $(date)

Using nano, you can create a new file called ~/test.sh, save it, then make it executable as follows:

nano ~/test.sh
# Write out the time printing script in nano, save the file.
chmod +x ~/test.sh

Then you can test it out a little by running it a couple of times to check that it is printing the time:

~/test.sh
# Sat Jun  6 08:05:14 UTC 2020
~/test.sh
# Sat Jun  6 08:05:14 UTC 2020
~/test.sh
# Sat Jun  6 08:05:14 UTC 2020

Once you're confident that your test script works, you can create a cron job to run it every minute. Cron uses a special syntax to specifiy how often a job runs. These "cron expressions" are a pain to write by hand, so I use this tool to generate them. The cron expression for "every minute" is the inscrutable string "* * * * *". This is the crontab entry that we're going to use:

# Test crontab entry
SHELL=/bin/bash
* * * * * ~/test.sh &>> ~/time.log

The SHELL setting tells crontab to use bash to execute our command
The "* * * * *" entry tells cron to execute our command every minute
The command ~/test.sh &>> ~/time.log runs our test script ~/test.sh and then appends all output to a log file called ~/time.log

Enter the text above into your user's crontab file using the crontab editor:

crontab -e

Once you've saved your entry, you should then be able to view your crontab entry using the list command:

crontab -l
# SHELL=/bin/bash
# * * * * * ~/test.sh &>> ~/time.log

You can check that cron is actually trying to run your script by watching the system log:

tail -f /var/log/syslog | grep CRON
# Jun  6 11:17:01 swarm CRON[6908]: (root) CMD (~/test.sh &>> ~/time.log)
# Jun  6 11:17:01 swarm CRON[6908]: (root) CMD (~/test.sh &>> ~/time.log)

You can also watch your logfile to see that time is being written every minute:

tail -f time.log
# Sat Jun 6 11:34:01 UTC 2020
# Sat Jun 6 11:35:01 UTC 2020
# Sat Jun 6 11:36:01 UTC 2020
# Sat Jun 6 11:37:01 UTC 2020

Once you're happy that you can run a test script every minute with cron, we can move on to running your database backup script daily.

Running our backup script daily

Now we're nearly ready to run our backup script using a cron job. There are a few changes that we'll need to make to our existing setup. First we need to write our database backup script to ~/backup.sh and make sure it is executable:

chmod +x ~/backup.sh

Then we need to crontab entry to run every day, which will be "0 0 * * *", and update our cron command to run our backup script. Our new crontab entry should be:

# Database backup crontab entry
SHELL=/bin/bash
0 0 * * * ~/backup.sh &>> ~/backup.log

Update your crontab with crontab -e. Now we wait! This script should run every night at midnight (server time) to take your database backups and upload them to AWS S3. If this isn't working, then change your cron expression so that it runs the script every minute, and use the steps I showed above to try and debug the problem.

Hopefully it all runs OK and you will have plenty of daily database backups to roll back to if anything ever goes wrong.

Automatic restore from the latest backup

When disaster strikes and you need your backups, you could manually view your S3 bucket, download the backup file, upload it to the server and manual run the restore, which I documented in my previous post. This is totally fine, but as a bonus I thought it would be nice to include a script that automatically downloads the latest backup file and uses it to restore your database. This kind of script would be ideal for dumping production data into a test server. First I'll show you the script, then I'll explain how it works:

#!/bin/bash
echo -e "\nRestoring database $PGDATABASE from S3 backups"

# Find the latest backup file
S3_BUCKET=s3://mydatabase-backups
LATEST_FILE=$(aws s3 ls $S3_BUCKET | awk '{print $4}' | sort | tail -n 1)
echo -e "\nFound file $LATEST_FILE in bucket $S3_BUCKET"

# Restore from the latest backup file
echo -e "\nRestoring $PGDATABASE from $LATEST_FILE"
S3_TARGET=$S3_BUCKET/$LATEST_FILE
aws s3 cp $S3_TARGET - | pg_restore --dbname $PGDATABASE --clean --no-owner
echo -e "\nRestore completed"

I've assumed that all the Postgres environment variables (PGHOST, etc) are already set elsewhere.

There are three tasks that are done in this script:

finding the latest backup file in S3
downloading the backup file
restoring from the backup file

So the first part of this script is finding the latest database backup file. The way we know which file is the latest is because of the Unix timestamp which we added to the filename. The first command we use is aws s3 ls, which shows us all the files in our backup bucket:

aws s3 ls $S3_BUCKET
# 2019-04-04 10:04:58     112309 postgres_mydatabase_1554372295.pgdump
# 2019-04-06 07:48:53     112622 postgres_mydatabase_1554536929.pgdump
# 2019-04-14 07:24:02     113484 postgres_mydatabase_1555226638.pgdump
# 2019-05-06 11:37:39     115805 postgres_mydatabase_1557142655.pgdump

We then use awk to isolate the filename. awk is a text processing tool which I use occasionally, along with cut and sed to mangle streams of text into the shape I want. I hate them all, but they can be useful.

aws s3 ls $S3_BUCKET | awk '{print $4}'
# postgres_mydatabase_1554372295.pgdump
# postgres_mydatabase_1554536929.pgdump
# postgres_mydatabase_1555226638.pgdump
# postgres_mydatabase_1557142655.pgdump

We then run sort over this output to ensure that each line is sorted by the time. The aws CLI tool seems to sort this data by the uploaded time, but we want to use our timestamp, just in case a file was manually uploaded out-of-order:

aws s3 ls $S3_BUCKET | awk '{print $4}' | sort
# postgres_mydatabase_1554372295.pgdump
# postgres_mydatabase_1554536929.pgdump
# postgres_mydatabase_1555226638.pgdump
# postgres_mydatabase_1557142655.pgdump

We use tail to grab the last line of the output:

aws s3 ls $S3_BUCKET | awk '{print $4}' | sort | tail -n 1
# postgres_mydatabase_1557142655.pgdump

And there's our filename! We use the $() command-substituation thingy to capture the command output and store it in a variable:

LATEST_FILE=$(aws s3 ls $S3_BUCKET | awk '{print $4}' | sort | tail -n 1)
echo $LATEST_FILE
# postgres_mydatabase_1557142655.pgdump

And that's part one of our script done: find the latest backup file. Now we need to download that file and use it to restore our database. We use the aws CLI to copy backup file from S3 and stream the bytes into stdout. This literally prints out your whole backup file into the terminal:

S3_TARGET=$S3_BUCKET/$LATEST_FILE
aws s3 cp $S3_TARGET -
# xtshirt9.5.199.5.19k0ENCODINENCODING
# SET client_encoding = 'UTF8';
# false00
# ... etc ...

The - symbol is commonly used in shell scripting to mean "write to stdout". This isn't very useful on it's own, but we can send that data to the pg_restore command via a pipe:

S3_TARGET=$S3_BUCKET/$LATEST_FILE
aws s3 cp $S3_TARGET - | pg_restore --dbname $PGDATABASE --clean --no-owner

And that's the whole script!

Next steps

Now you can set up automated backups for your Postgres database. Hopefully having these daily backups this will take a weight off your mind. Don't forget to do a test restore every now and then, because backups are worthless if you aren't confident that they actually work.

If you want to learn more about the Unix shell tools I used in this post, then I recommend having a go at the Over the Wire Wargames, which teaches you about bash scripting and hacking at the same time.

How to backup and restore a Postgres database

Matthew Segal — Sun, 21 Jun 2020 10:38:44 +0000

You've deployed your Django web app to to the internet. Grats! Now you have a fun new problem: your app's database is full of precious "live" data, and if you lose that data, it's gone forever. If your database gets blown away or corrupted, then you will need backups to restore your data. This post will go over how to backup and restore PostgreSQL, which is the database most commonly deployed with Django.

Not everyone needs backups. If your Django app is just a hobby project then losing all your data might not be such a big deal. That said, if your app is a critical part of a business, then losing your app's data could literally mean the end of the business - people losing their jobs and going bankrupt. So, at least some of time, you don't want to lose all your data.

The good news is that backing up and restoring Postgres is pretty easy, you only need two commands: pg_dump and pg_restore. If you're using MySQL instead of Postgres, then you can do something very similar to the instructions in this post using mysqldump.

Taking database backups

I'm going to assume that you've already got a Postgres database running somewhere. You'll need to run the following code from a bash shell on a Linux machine that can access the database. In this example, let's say you're logged into the database server with ssh.

The first thing to do is set some Postgres-specifc environment variables to specify your target database and login credentials. This is mostly for our convenience later on.

# The server Postgres is running on
export PGHOST=localhost
# The port Postgres is listening on
export PGPORT=5432
# The database you want to back up
export PGDATABASE=mydatabase
# The database user you are logging in as
export PGUSER=myusername
# The database user's password
export PGPASSWORD=mypassw0rd

You can test these environment variables by running a psql command to list all the tables in your app's database.

psql -c "\dt"

# Output:
# List of relations
# Schema | Name          | Type  | Owner
#--------+---------------+-------+--------
# public | auth_group    | table | myusername
# public | auth_group... | table | myusername
# public | auth_permi... | table | myusername
# public | django_adm... | table | myusername
# .. etc ..

If psql is missing you can install it on Ubuntu or Debian using apt:

sudo apt install postgresql-client

Now we're ready to create a database dump with pg_dump. It's pretty simple to use because we set up those environment variables earlier. When you run pg_dump, it just spits out a bunch of SQL statements as hundreds, or even thousands of lines of text. You can take a look at the output using head to view the first 10 lines of text:

pg_dump | head

# Output:
# --
# -- PostgreSQL database dump
# --
# -- Dumped from database version 9.5.19
# -- Dumped by pg_dump version 9.5.19
# SET statement_timeout = 0;
# SET lock_timeout = 0;
# SET client_encoding = 'UTF8';

The SQL statements produced by pg_dump are instructions on how to re-create your database. You can turn this output into a backup by writing all this SQL text into a file:

pg_dump > mybackup.sql

That's it! You now have a database backup. You might have noticed that storing all your data as SQL statements is rather inefficient. You can compress this data by using the "custom" dump format:

pg_dump --format=custom > mybackup.pgdump

This "custom" format is ~3x smaller in terms of file size, but it's not as pretty for humans to read because it's now in some funky non-text binary format:

pg_dump --format=custom | head

# Output:
# xtshirt9.5.199.5.19k0ENCODINENCODING
# SET client_encoding = 'UTF8';
# false00
# ... etc ...

Finally, mybackup.pgdump is a crappy file name. It's not clear what is inside the file. Are we going to remember which database this is for? How do we know that this is the freshest copy? Let's add a timestamp plus a descriptive name to help us remember:

# Get Unix epoch timestamp
# Eg. 1591255548
TIME=$(date "+%s")
# Descriptive file name
# Eg. postgres_mydatabase_1591255548.pgdump
BACKUP_FILE="postgres_${PGDATABASE}_${TIME}.pgdump"
pg_dump --format=custom > $BACKUP_FILE

Now you can run these commands every month, week, or day to get a snapshot of your data. If you wanted, you could write this whole thing into a bash script called backup.sh:

#!/bin/bash
# Backs up mydatabase to a file.
export PGHOST=localhost
export PGPORT=5432
export PGDATABASE=mydatabase
export PGUSER=myusername
export PGPASSWORD=mypassw0rd
TIME=$(date "+%s")
BACKUP_FILE="postgres_${PGDATABASE}_${TIME}.pgdump"
echo "Backing up $PGDATABASE to $BACKUP_FILE"
pg_dump --format=custom > $BACKUP_FILE
echo "Backup completed"

You should avoid hardcoding passwords like I just did above, it's better to pass credentials in as a script argument or environment variable. The file /etc/environment is a nice place to store these kinds of credentials on a secure server.

Restoring your database from backups

It's pointless creating backups if you don't know how to use them to restore your data. There are three scenarios that I can think of where you want to run a restore:

You need to set up your database from scratch
You want to rollback your exiting database to a previous time
You want to restore data in your dev environment

I'll go over these scenarios one at a time.

Restoring from scratch

Sometimes you can lose the database server and there is nothing left. Maybe you deleted it by accident, thinking it was a different server. Luckily you have your database backup file, and hopefully some automated configuration management to help you quickly set the server up again.

Once you've got the new server provisioned and PostgreSQL installed, you'll need to recreate the database and the user who owns it:

sudo -u postgres psql <<-EOF
    CREATE USER $PGUSER WITH PASSWORD '$PGPASSWORD';
    CREATE DATABASE $PGDATABASE WITH OWNER $PGUSER;
EOF

Then you can set up the same environment variables that we did earlier (PGHOST, etc.) and then use pg_restore to restore your data.
You'll probably see some warning errors, which is normal.

BACKUP_FILE=postgres_mydatabase_1591255548.pgdump
pg_restore --dbname $PGDATABASE $BACKUP_FILE

# Output:
# ... lots of errors ...
# pg_restore: WARNING:  no privileges were granted for "public"
# WARNING: errors ignored on restore: 1

I'm not 100% on what all these errors mean, but I believe they're mostly related to the restore script trying to modify Postgres objects that your user does not have permission to modify. If you're using a standard Django app this shouldn't be an issue. You can check that the restore actually worked by checking your tables with psql:

# Check the tables
psql -c "\dt"

# Output:
# List of relations
# Schema | Name          | Type  | Owner
#--------+---------------+-------+--------
# public | auth_group    | table | myusername
# public | auth_group... | table | myusername
# public | auth_permi... | table | myusername
# public | django_adm... | table | myusername
# .. etc ..

# Check the last migration
psql -c "SELECT * FROM django_migrations ORDER BY id DESC LIMIT 1"

# Output:
#  id |  app   | name      | applied
# ----+--------+-----------+---------------
#  20 | tshirt | 0003_a... | 2019-08-26...

There you go! Your database has been restored. Crisis averted.

Rolling back an existing database

If you want to roll your existing database back to an previous point in time, deleting all new data, then you will need to use the --clean flag, which drops your restored database tables before re-creating them (docs here):

BACKUP_FILE=postgres_mydatabase_1591255548.pgdump
pg_restore --clean --dbname $PGDATABASE $BACKUP_FILE

Restoring a dev environment

It's often beneficial to restore a testing or development database from a known backup.
When you do this, you're not so worried about setting up the right user permissions.
In this case you want to completely destroy and re-create the database to get a completely fresh start, and you want to use the --no-owner flag to ignore any database-user related stuff in the restore script:

sudo -u postgres psql -c "DROP DATABASE $PGDATABASE"
sudo -u postgres psql -c "CREATE DATABASE $PGDATABASE"
BACKUP_FILE=postgres_mydatabase_1591255548.pgdump
pg_restore --no-owner --dbname $PGDATABASE $BACKUP_FILE

I use this method quite often to pull non-sensitive data down from production environments to try and reproduce bugs that have occured in prod. It's much easier to fix mysterious bugs when you have regular database backups, error reporting and centralized logging.

Next steps

I hope you now have the tools you need to backups and restore your Django app's Postgres database. If you want to read more the Postgres docs have a good section on database backups.

Once you've got your head around database backups, you should automate the process to make it more reliable. I will show you how to do this in this follow-up post.

How to generate lots of dummy data for your Django app

Matthew Segal — Sat, 20 Jun 2020 07:17:48 +0000

It sucks when you're working on a Django app and all your pages are empty. For example, if you're working on a forum webapp, then all your discussion boards will be empty by default:

Manually creating enough data for your pages to look realistic is a lot of work. Wouldn't it be nice if there was an automatic way to populate your local database with dummy data
that looks real? Eg. your forum app has many threads:

Even better, wouldn't it be cool if there was an easy way to populate each thread with as many comments
as you like?

In this post I'll show you how to use Factory Boy and a few other tricks to quickly and repeatably generate an endless amount of dummy data for your Django app. By the end of the post you'll be able to generate all your test data using a management command:

./manage.py setup_test_data

There is example code for this blog post hosted in this GitHub repo.

Example application

In this post we'll be working with an example app that is an online forum. There are four models that we'll be working with:

# models.py

class User(models.Model):
    """A person who uses the website"""
    name = models.CharField(max_length=128)


class Thread(models.Model):
    """A forum comment thread"""
    title = models.CharField(max_length=128)
    creator = models.ForeignKey(User)


class Comment(models.Model):
    """A comment by a user on a thread"""
    body = models.CharField(max_length=128)
    poster = models.ForeignKey(User)
    thread = models.ForeignKey(Thread)


class Club(models.Model):
    """A group of users interested in the same thing"""
    name = models.CharField(max_length=128)
    member = models.ManyToManyField(User)

Building data with Factory Boy

We'll be using Factory Boy to generate all our dummy data. It's a library that's built for automated testing, but it also works well for this use-case. Factory Boy can easily be configured to generate random but realistic data like names, emails and paragraphs by internally using the Faker library.

When using Factory Boy you create classes called "factories", which each represent a Django model. For example, for a user, you would create a factory class as follows:

# factories.py
import factory
from factory.django import DjangoModelFactory

from .models import User

# Defining a factory
class UserFactory(DjangoModelFactory):
    class Meta:
        model = User

    name = factory.Faker("first_name")

# Using a factory with auto-generated data
u = UserFactory()
u.name # Kimberly
u.id # 51

# You can optionally pass in your own data
u = UserFactory(name="Alice")
u.name # Alice
u.id # 52

You can find the data types that Faker can produce by looking at the "providers" that the library offers. Eg. I found "first_name" by reviewing the options inside the person provider.

Another benefit of Factory boy is that it can be set up to generate related data using SubFactory, saving you a lot of boilerplate and time. For example we can set up the ThreadFactory so that it generates a User as its creator automatically:

# factories.py
class ThreadFactory(DjangoModelFactory):
    class Meta:
        model = Thread

    creator = factory.SubFactory(UserFactory)
    title = factory.Faker(
        "sentence",
        nb_words=5,
        variable_nb_words=True
    )

# Create a new thread
t = ThreadFactory()
t.title  # Room marriage study
t.creator  # <User: Michelle>
t.creator.name  # Michelle

The ability to automatically generate related models and fake data makes Factory Boy quite powerful. It's worth taking a quick look at the other suggested patterns if you decide to try it out.

Adding a management command

Once you've defined all the models that you want to generate with Factory Boy, you can write a management command to automatically populate your database. This is a pretty crude script that doesn't take advantage of all of Factory Boy's features, like sub-factories, but I didn't want to spend too much time getting fancy:

# setup_test_data.py
import random

from django.db import transaction
from django.core.management.base import BaseCommand

from forum.models import User, Thread, Club, Comment
from forum.factories import (
    UserFactory,
    ThreadFactory,
    ClubFactory,
    CommentFactory
)

NUM_USERS = 50
NUM_CLUBS = 10
NUM_THREADS = 12
COMMENTS_PER_THREAD = 25
USERS_PER_CLUB = 8

class Command(BaseCommand):
    help = "Generates test data"

    @transaction.atomic
    def handle(self, *args, **kwargs):
        self.stdout.write("Deleting old data...")
        models = [User, Thread, Comment, Club]
        for m in models:
            m.objects.all().delete()

        self.stdout.write("Creating new data...")
        # Create all the users
        people = []
        for _ in range(NUM_USERS):
            person = UserFactory()
            people.append(person)

        # Add some users to clubs
        for _ in range(NUM_CLUBS):
            club = ClubFactory()
            members = random.choices(
                people,
                k=USERS_PER_CLUB
            )
            club.user.add(*members)

        # Create all the threads
        for _ in range(NUM_THREADS):
            creator = random.choice(people)
            thread = ThreadFactory(creator=creator)
            # Create comments for each thread
            for _ in range(COMMENTS_PER_THREAD):
                commentor = random.choice(people)
                CommentFactory(
                    user=commentor,
                    thread=thread
                )

Using the transaction.atomic decorator makes a big difference in the runtime of this script, since it bundles up 100s of queries and submits them in one go.

Images

If you need dummy images for your website as well then there are a lot of great free tools online to help. I use adorable.io for dummy profile pics and Picsum or Unsplash for larger pictures like this one: https://picsum.photos/700/500.

Next steps

Hopefully this post helps you spin up a lot of fake data for your Django app very quickly. If you enjoy using Factory Boy to generate your dummy data, then you also might like incorporating it into your unit tests.

A tour of Django server setups

Matthew Segal — Thu, 18 Jun 2020 23:32:05 +0000

If you haven't deployed a lot of Django apps, then you might wonder: how do professionals put Django apps on the internet? What does Django typically look like when it's running in production? You might even be thinking what the hell is production?

Before I started working a developer there was just a fuzzy cloud in my head where the knowledge of production infrastructure should be. If there's a fuzzy cloud in your head, let's fix it. There are many ways to extend a Django server setup to achieve better performance, cost-effectiveness and reliability. This post will take you on a tour of some common Django server setups, from the most simple and basic to the more complex and powerful. I hope it will build up your mental model of how Django is hosted in production, piece-by-piece.

Your local machine

Let's start by reviewing a Django setup that you are alreay familiar with: your local machine. Going over this will be a warm-up for later sections. When you run Django locally, you have:

Your web browser (Chrome, Safari, Firefox, etc)
Django running with the runserver management command
A SQLite database sitting in your project folder

Pretty simple right? Next let's look at something similar, but deployed to a web server.

Simplest possible webserver

The simplest Django web server you can setup is very similar to your local dev environment. Most professional Django devs don't use a basic setup like this for their production environments. It works perfectly fine, but it has some limitations that we'll discuss later. It looks like this:

Typically people run Django on a Linux virtual machine, often using the Ubuntu distribution. The virtual machine is hosted by a cloud provider like Amazon, Google, Azure, DigitalOcean or Linode.

Instead of using runserver, you should use a WSGI server like Gunicorn to run your Django app. I go into more detail on why you shouldn't use runserver in production, and explain WSGI here. Otherwise, not that much is different from your local machine: you can still use SQLite as the database (more here).

This is the bare bones of the setup. There are a few other details that you'll need to manage like setting up DNS, virtual environments, babysitting Gunicorn with a process supervisor like Supervisord or how to serve static files with Whitenoise. If you're interested in a more complete guide on how to set up a simple server like this, I wrote a guide that explains how to deploy Django.

Typical standalone webserver

Let's go over an environment that a professional Django dev might set up in production when using a single server. It's not the exact setup that everyone will always use, but the structure is very common.

Some things are the same as the simple setup above: it's still a Linux virtual machine with Django being run by Gunicorn. There are three main differences:

SQLite has been replaced by a different database, PostgreSQL
A NGINX web server is now sitting in-front of Gunicorn in a reverse-proxy setup
Static files are now being served from outside of Django

Why did we swap SQLite for PostgreSQL? In general Postgres is a litte more advanced and full featured. For example, Postgres can handle multiple writes at the same time, while SQLite can't.

Why did we add NGINX to our setup? NGINX is a dedicated webserver which provides extra features and performance improvements over just using Gunicorn to serve web requests. For example we can use NGINX to directly serve our app's static and media files more efficiently. NGINX can also be configured to a lot of other useful things, like encrypt your web traffic using HTTPS and compress your files to make your site faster. NGINX is the web server that is most commonly combined with Django, but there are also alternatives like the Apache HTTP server and Traefik.

It's important to note that everything here lives on a single server, which means that if the server goes away, so does all your data, unless you have backups. This data includes your Django tables, which are stored in Postgres, and files uploaded by users, which will be stored in the MEDIA_ROOT folder, somewhere on your filesystem. Having only one server also means that if your server restarts or shuts off, so does your website. This is OK for smaller projects, but it's not acceptable for big sites like StackOverflow or Instagram, where the cost of downtime is very high.

Single webserver with multiple apps

Once you start using NGINX and PostgreSQL, you can run multiple Django apps on the same machine. You can save money on hosting fees by packing multiple apps onto a single server rather than paying for a separate server for each app. This setup also allows you to re-use some of the services and configurations that you've already set up.

NGINX is able to route incoming HTTP requests to different apps based on the domain name, and Postgres can host multiple databases on a single machine.
For example, I use a single server to host some of my personal Django projects: Matt's Links, Memories Ninja and Blog Reader

I've omitted the static files for simplicity. Note that having multiple apps on one server saves you hosting costs, but there are downsides: restarting the server restarts all of your apps.

Single webserver with a worker

Some web apps need to do things other than just CRUD. For example, my website Blog Reader needs to scrape text from a website and then send it to an Amazon API to be translated into audio files. Another common example is "thumbnailing", where you upload a huge 5MB image file to Facebook and they downsize it into a crappy 120kB JPEG. These kinds of tasks do not happen inside a Django view, because they take too long to run. Instead they have to happen "offline", in a separate worker process, using tools like Celery, Huey, Django-RQ or Django-Q. All these tools provide you with a way to run tasks outside of Django views and do more complicated things, like co-ordinate multiple tasks and run them on schedules.

All of these tools follow a similar pattern: tasks are dispatched by Django and put in a queue where they wait to be executed. This queue is managed by a service called a "broker", which keeps track of all the tasks that need to be done. Common brokers for Django tasks are Redis and RabbitMQ. A worker process, which uses the same codebase as your Django app, pulls tasks out the broker and runs them.

If you haven't worked with task queues before then it's not immediately obvious how this all works, so let me give an example. You want to upload a 2MB photo of your breakfast from your phone to a Django site. To optimise image loading performance, the Django site will turn that 2MB photo upload into a 70kB display image and a smaller thumbnail image. So this is what happenes:

A user uploads a photo to a Django view, which saves the original photo to the filesystem and updates the database to show that the file has been received
The view also pushes a thumbnailing task to the task broker
The broker receives the task and puts it in a queue, where it waits to be executed
The worker asks the broker for the next task and the broker sends the thumbnailing tasks
The worker reads the task description and runs some Python function, which reads the original image from the filesystem, creates the smaller thumbnail images, saves them and then updates the database to show that the thumbnailing is complete

If you want to learn more about this stuff, I've written guides for getting started with offline tasks and scheduled tasks with Django Q.

Single webserver with a cache

Sometimes you'll want to use a cache to store data for a short time. For example, caches are commonly used when you have some data that was expensive to pull from the database or an API and you want to re-use it for a little while. Redis and Memcached are both popular cache services that are used in production with Django. It's not a very complicated setup.

Single webserver with Docker

If you've heard of Docker before you might be wondering where it factors into these setups. It's a great tool for creating consistent programming environments, but it doesn't actually change how any of this works too much. Most of the setups I've described would work basically the same way... except everything is inside a Docker container.

For example, if you were running multiple Django apps on one server and you wanted to use Docker containers, then you might do something like this using Docker Swarm:

As you can see it's not such a different structure compared to what we were doing before Docker. The containers are just wrappers around the services that we were already running. Putting things inside of Docker containers doesn't really change how all the services talk to each other. If you really wanted to you could wrap Docker containers around more things like NGINX, the database, a Redis cache, whatever. This is why I think it's valuable to learn how to deploy Django without Docker first. That said, you can do some more complicated setups with Docker containers, which we'll get into later.

External services

So far I've been showing you server setups with just one virtual machine running Ubuntu. This is the simplest setup that you can use, but it has limitations: there are some things that you might need that a single server can't give you. In this section I'm going to walk you through how we can break apart our single server into more advanced setups.

If you've studied programming you might have read about separation of concerns, the single responsibility principle and model-view-controller (MVC). A lot of the changes that we're going to make will have a similar kind of vibe: we're going to split up our services into smaller, more specialised units, based on their "responsibilities". We're going to pull apart our services bit-by-bit until there's nothing left. Just a note: you might not need to do this for your services, this is just an overview of what you could do.

External services - database

The first thing you'd want to pull off of our server is the database. This involves putting PostgreSQL onto its own virtual machine. You can set this up yourself or pay a little extra for an off-the-shelf service like Amazon RDS.

There are a couple of reasons that you'd want to put the database on its own server:

You might have multiple apps on different servers that depend on the same database
Your database performance will not be impacted by "noisy neighbours" eating up CPU, RAM or disk space on the same machine
You've moved your precious database away from your Django web server, which means you can delete and re-create your Django app's server with less concern
mumble muble security mumble

Using an off-the-shelf option like AWS RDS is attractive because it reduces the amount of admin work that you need to run your database server. If you're a backend web developer with a lot of work to do and more money than time then this is a good move.

External services - object storage

It is common to push file storage off the web server into "object storage", which is basically a filesystem behind a nice API. This is often done using django-storages, which I enjoy using. Object storage is usually used for user-uploaded "media" such as documents, photos and videos. I use AWS S3 (Simple Storage Service) for this, but every big cloud hosting provider has some sort of "object storage" offering.

There are a few reasons why this is a good idea

You've moved all of your app's state (files, database) off of your server, so now you can move, destroy and re-create the Django server with no data loss
File downloads hit the object storage service, rather than your server, meaning you can scale your file downloads more easily
You don't need to worry about any filesystem admin, like running out of disk space
Multiple servers can easily share the same set of files

Hopefully you see a theme here, we're taking shit we don't care about and making it someone else's problem. Paying someone else to do the work of managing our files and database leaves us more free time to work on more important things.

External services - web server

You can also run your "web server" (NGINX) on a different virtual machine to your "app server" (Gunicorn + Django):

This seems kind of pointless though, why would you bother? Well, for one, you might have multiple identical app servers set up for redundancy and to handle high traffic, and NGINX can act as a load balancer between the different servers.

You could also replace NGINX with an off-the-shelf load balancer like an AWS Elastic Load Balancer or something similar.

Note how putting our services on their own servers allows us to scale them out over multiple virtual machines. We couldn't run our Django app on three servers at the same time if we also had three copies of our filesystem and three databases.

External services - task queue

You can also push your "offline task" services onto their own servers. Typically the broker service would get its own machine and the worker would live on another:

Splitting your worker onto its own server is useful because:

You can protect your Django web app from "noisy neighbours": workers which are hogging all the RAM and CPU
You can give the worker server extra resources that it needs: CPU, RAM, or access to a GPU
You can now make changes to the worker server without risking damage to the task queue or the web server

Now that you've split things up, you can also scale out your workers to run more tasks in parallel:

You could potentially swap our your self-managed broker (Redis or RabbitMQ) for a managed queue like Amazon SQS.

External services - final form

If you went totally all-out, your Django app could be set up like this:

As you can see, you can go pretty crazy splitting up all the parts of your Django app and spreading across multiple servers. There are many upsides to this, but the downside is that you now have mutiple servers to provision, update, monitor and maintain. Sometimes the extra complexity is well worth or and sometimes it's a waste of your time. That said, there are many benefits to this setup:

Your web and worker servers are completely replaceable, you can destroy, create and update them without affecting uptime at all
You can now do blue-green deployments with zero web app downtime
Your files and database are easily shared between multiple servers and applications
You can provision different sized servers for their different workloads
You can swap out your self-managed servers for managed infrastructure, like moving your task broker to AWS SQS, or your database to AWS RDS
You can now autoscale your servers (more on this later)

When you have complicated infrastructure like this you need to start automating your infrastructure setup and server config. It's just not feasible to manage this stuff manually once your setup has this many moving parts. I recorded a talk on configuration management that introduces these concepts. You'll need to start looking into tools like Ansible and Packer to configure your virtual machines, and tools like Terraform or CloudFormation to configure your cloud services.

Auto scaling groups

You've already seen how you can have multiple web servers running the same app, or multiple worker servers all pulling tasks from a queue. These servers cost money, dollars per hour, and it can get very expensive to run more servers than you need.

This is where autoscaling comes in. You can setup your cloud services to use some sort of trigger, such as virtual machine CPU usage, to automatically create new virtual machines from an image and add them to an autoscaling group.

Let's use our task worker servers as an example. If you have a thumbnailing service that turns big uploaded photos into smaller photos then one server should be able to handle dozens of file uploads per second. What if during some periods of the day, like around 6pm after work, you saw file uploads spike from dozens per second to thousands per second? Then you'd need more servers! With an autoscaling setup, the CPU usage on your worker servers would spike, triggering the creation of more and more worker servers, until you had enough to handle all the uploads. When the rate of file uploads drops, the extra servers would be automatically destroyed, so you aren't always paying for them.

Container clusterfuck

There is a whole world of container fuckery that I haven't covered in much detail, because:

I don't know it very well
It's a little complicated for the targed audience of this post; and
I don't think that most people need it

For completeness I'll quickly go over some of the cool, crazy things you can do with containers. You can use tools like Kubernetes and Docker Swarm with a set of config files to define all your services as Docker containers and how they should all talk to each other. All your containers run somewhere in your Kubernetes/Swarm cluster, but as a developer, you don't really care what server they're on. You just build your Docker containers, write your config file, and push it up to your infrastructure.

Using these "container orchestration" tools allows you to decouple your containers from their underlying infrastructure. Multiple teams can deploy their apps to the same set of servers without any conflict between their apps.
This is the kind of infrastructure that enables teams to deploy microservices. Big enterprises like Target will have specialised teams dedicated to setting up and maintaining these container orchestration systems, while other teams can use them without having to think about the underlying servers. These teams are essentially supplying a "platform as a service" (PaaS) to the rest of the organisation.

As you might have noticed, there is probably too much complexity in these container orchestration tools for them to be worth your while as a solo developer or even as a small team. If you're interested in this sort of thing you might like Dokku, which claims to be "the smallest PaaS implementation you've ever seen".

End of tour

That's basically everything that I know that I know about how Django can be set up in production. If you're interested in building up your infrastructure skills, then I recommend you try out one of the setups or tools that I've mentioned in this post. Hopefully I've built up your mental models of how Django gets deployed so that the next time someone mentions "task broker" or "autoscaling", you have some idea of what they're talking about.

If you enjoyed reading this you might also like other things I've written about deploying Django as simply as possible, how to get started with offline tasks, how to start logging to files and tracking errors in prod and my introduction to configuration management.

If you liked the box diagrams in this post check out Exalidraw.

How to polish your GitHub projects when you're looking for a job

Matthew Segal — Thu, 18 Jun 2020 23:10:34 +0000

When you're going for your first programming job, you don't have any work experience or references to show that you can write code. You might not even have a relevant degree (I didn't). What you can do is write some code and throw it up on GitHub to demonstrate to employers that you can build a complete app all by yourself.

A lot of junior devs don't know how to show off their projects on GitHub. They spend hours and hours writing code and then forget to do some basic things to make their project seem interesting. In this post I want to share some tips that you can apply in a few hours to make an existing project much more effective at getting you an interview.

Remove all the clutter

Your project should only contain source code, plus the minimum files required to run it. It should not not contain:

Editor config files (.idea, .vscode)
Database files (eg. SQLite)
Random documents (.pdf, .xls)
Media files (images, videos, audio)
Build outputs and artifacts (*.dll files, *.exe, etc)
Bytecode (eg. *.pyc files for Python)
Log files (eg. *.log)

Having these files in your repo make you look sloppy. Professional developers don't like finding random crap cluttering up their codebase. You can keep these files out of your git repo using a .gitignore file. If you already have these files inside your repo, make sure to delete them. If you're using bash you can use find to delete all files that match a pattern, like Python bytecode files ending in .pyc.

find -name *.pyc -delete

You can achieve a similar result in Windows PowerShell, but it'll be a little more verbose.

Sometimes you do need to keep some media files, documents or even small databases in your source control. This is okay to do as long as it's an essential part of running, testing or documenting the code, as opposed to random clutter that you forgot to remove or gitignore. A good example of non-code files that you should keep in source control is website static files, like favicons and fonts.

Write a README

Your project must have a README file. This is a file in the root of your project's repository called README.md. It's a text file written in Markdown that gives a quick overview of what your project is and what it does. Not having a README makes your project seem crappy, and many people, including me, may close the browser window without checking any code if there isn't one present.

Here's one I prepared earlier, and here's another. They're not
perfect, but I hope they give you a general idea of what to do.

One hour of paying attention to your project's README is worth 20 extra hours of coding, when it comes to impressing hiring managers. You know when people mindlessly write that they have "excellent communication skills" on their resume? No one believe that - it's far too easy to just say that. Don't tell them that you have excellent commuication skills, show them when you write an excellent README.

Enough of me waffling about why you should right a README, what do you put in it?

First, you should describe what your project does at a high level: what problem it solves. It is a command line tool that plays music? Is it a website that finds you low prices on Amazon? Is it a Reddit bot that reminds people? A reader should be able to read the first few sentences and decide if it's something they might want to use. You should summarize the main features of your project in this section.

A key point to remember is that the employer or recruiter reading your GitHub is both lazy and time-poor. They might not read past the first few sentences... they might not even read the code! They may well assume that your project works without checking anything. Before you rush to pack your README with features that don't exist, you scallywag, note that they may ask you more about your project in a job interview. So, uh... don't lie about anything.

Beyong a basic overview of your project, it's also good to outline the high-level architecture of your code - how it's structured. For example, in a Django web app, you could explain the different apps that you've implemented and their responsibilities.

If your project is a website, then you can also talk about the production infrastructure that your website runs on. For example:

This website is deployed to a DigitalOcean virtual machine. The Django app runs inside a Gunicorn WSGI app server and depends on a Postgres database. A seperate Celery worker process runs offline tasks. Redis is responsible for both caching and serving as a task broker.

Or for something a little more simple:

This project is a static webpage that is hosted on Netlify

Simply indicating that you know how to deploy your application makes you look good. "Isn't that obvious though?" - you may ask. No, it's not obvious and you need to be explicit.

A little warning on READMEs: they're for other people to read, not you. Do not include personal to-dos or notes to yourself in your README. Put those somewhere else, like Trello or Workflowy.

Add a screenshot

Add a screenshot of your website or tool and embed it in the README, it'll take you 10 minutes and it makes it look way better. Store the screenshot in a "docs" folder and embed it in your README using Markdown. If it's a command line app your can use asciinema to record the tool in action, if your project has a GUI then you can quickly record yourself using the website with Loom. This will make your project seem much more impressive for only a small amount of effort.

Give instructions for other developers

You should include instructions on how other devs can get started using your project. This is important because it demonstrates that you can document project setup instructions, and also because someone may actually try to run your code. These instructions should state what tools are required to run your project. For example:

You will need Python 3 and pip installed
You will need yarn and node v11+
You will need docker and docker-compose

Next your should explain the steps, with explicit command line examples if possible, that are required to get the app built or running. If your project has external libraries that need to be installed, then you should have a file that specifies these dependencies, like a requirements.txt (Python) or package.json (Node) or Dockerfile / docker-compose.yaml (Docker).

You should also include instructions on how to run your automated tests. You have some tests, right? More on that later.

If you've scripted your project's deployment, you can mention how to do it here, if you like.

Have a nice, readable commit history

If possible, your git commit history should tell a story about what you've been working on. Each commit should represent a distinct unit of work, and the commit message should explain what work was done. For example your commit messages could look like this:

Added smoke tests for payment API
Refactored image compression
Added Windows compatibility

There are differing opions amongst devs on what exactly makes a "good" commit message, but it's very, very clear what bad commit messages look like:

zzzz
add code
more code
fuck
remove shitty code
fuckfuckfuckfuck
still broken
fuck Windows
zzz
adsafsf
broken

I for one have written my fair share of "zzz"s. This tip is hard to implement if you've already written all your commits. If you're feeling brave, or if you need to remove a few "fucks", you can re-write your commit history with git rebase. Be warned though, you can lose your code if you screw this up.

Fix your formatting

If I see inconsistent indentation or other poor formatting in someone's code, my opinion of their programming ability drops dramatically. Is this fair? Maybe, maybe not, but that's how it is. Make sure all your code sticks to your language's standard styling conventions. If you don't know what those are, find out, you'll need to learn them eventually. Fixing bad coding style is much easier to do if you use a linter or auto-formatter.

Add linting or formatting

This one is a bonus, but it's reasonably quick to do. Grab your language community's favorite linter and run it over your code. Something like eslint for JavaScript or flake8 for Python. For those not in the know, a linter is a program that identifies style issues in your code. You run it over your codebase and it yells at you if you do anything wrong. You think your impostor syndrome is bad? Try using a tool that screams at your about all your shitty style choices. These tools are quite common in-industry and using one will help you stand out from other junior devs.

Even better than a linter, try using an auto-formatter. I prefer these personally. These tools automatically re-write your code so they conform with a standard style. Examples include gofmt for Go, Black for Python and
Prettier for JavaScript. I've written more about getting started with Black here.

Whatever you choose, make sure you document how to run the linter or formatting tool in your README.

Write some tests

Automated code testing is an important part of writing reliable professional-grade software. If you want someone to pay you money to be a professional software developer, then you should demonstrate that you know what a unit test is and how to write one. You don't need to write 100s of tests or get a high test coverage, but write a few at least.

Needless to say, explain how to run your tests in your README.

Add automated tests

If you want to look super fancy then you can run your automated tests in GitHub Actions. This isn't a must-have but it looks nice. It'll take you 30 minutes if you've already written some tests and you can put a cool "tests passing" badge in your README that looks really good. I've written more on how to do this here

Deploy your project

If your project is a website then make sure it's deployed and available online. If you have deployed it, make sure there's a link to the live site in the README. This could be a large undertaking, taking hours or days, especially if you haven't done this before, so I'll leave it to you do decide if it's worthwhile.

If your project is a Django app and you want to get it online, then you might like my guide on simple Django deployments.

Add documentation

This is a high effort endeavour so I don't really recommend it if you're just trying to quickly improve the appeal of your project. That said, building HTML documentation with something like Sphinx and hosting it on GitHub Pages looks pretty pro. This only really makes sense if your app is reasonably complicated and requires documentation.

Next steps

I mention GitHub a lot in this post, but the same tips apply for projects hosted on Bitbucket and GitLab. All these tips also apply to employer-supplied coding tests that are hosted on GitHub, although I'd caution you not to spend too much time jazzing up coding tests: too many beautiful submissions end up in the garbage.

Now you should have a few things you can do to spiff up your projects before you show them to prospective employers. I think it's important to make sure that the code that you've spent hours on isn't overlooked or dismissed because you didn't write a README.

Good luck, and please don't hesitate to mail me money if this post helps you get a job.

If you enjoyed reading this, then you might like my other blog posts over at mattsegal.dev