If you ever plan to run multiple processes for the same server code, it is inevitable that you have to deal with the age-old problem of managing concurrency.
The Django ORM is no different. If you call the save() method on an object, there is a good chance that two different instances of your server call this method on the same object at the same time causing your data to get corrupted.
Consider the example of an account model that is responsible for storing how much bank balance somebody has in their account:
from django.db import models from django.contrib.auth.models import User from django.db import models, transaction class Account(models.Model): balance = models.IntegerField(default=0) user = models.ForeignKey(User) def deposit(self, amount): self.balance += amount self.save() def withdraw(self, amount): if amount > self.balance: raise errors.InsufficientFunds() self.balance -= amount self.save()
As you can see, there are two methods included in this model to deposit and withdraw money into the account.
Seems straightforward enough right? Nothing could go wrong here right? Its basic addition and subtraction right? WRONG!
Let’s say there is an account with a Balance of $1000 in it which is accessible by 2 different users. Think of it as a joint account.
Now let’s say User1 is the earner and User2 is the spender. User1 deposited 100$ into the account and therefore the server invoked account.deposit(100) but at the exact same time, User2 withdrew $100 thus invoking account.withdraw(100).
What should happen in this case? Ideally the balance at the end of these two transactions should remain 1000$ right? If you are running a single instance of your server, this would indeed be the case because these two transactions would always run one after another.
But if these transactions are run by different instances of your server in parallel, there is a good chance that the balance at the end of it would be $900. Why does this happen?
Here are the steps that occur in these transactions
Step 1: User1 retrieves the account
- Balance is $1000
Step 2: User2 retrieves the account
- Balance is $1000
Step 3: User1 deposits $100
- Balance is $1000 + $100 = $1100
Step 4: User2 withdraws $100
- Balance is $1000 - $100 = $900
In step 4, the balance that the server has loaded into memory is stale because it was already updated to $1100 in step 3 which the other server instance was not aware of and hence it thinks that the current balance is still $1000.
This is the classic concurrency problem and thankfully this age-old problem has an age-old solution.
The solution is quite simple. When a database operation is in progress, the object or the set of objects that are being updated must be locked until the operation is complete so that no other process can access this object/objects.
This will prevent multiple instances of a server from loading stale data into memory and corrupting the database.
The best place to lock an object is to do it at the database level as opposed to the application level. This will protect your data from getting corrupted by other processes such as cron jobs as well.
Besides, when you run multiple workers of your Django application, it can be a pain to maintain locks at the application level because you would need to use some other 3rd party tool that stays in sync across all your workers to achieve a global lock.
The select_for_update method offered by the Django ORM solves the problem of concurrency by returning a queryset that locks all the rows that belong to this queryset until the outermost transaction it is inside gets committed thus preventing data corruption.
Here is how you can modify the Account model to use select_for_update and lock the account object:
from django.db import models from django.contrib.auth.models import User from django.db import models, transaction class Account(models.Model): balance = models.IntegerField(default=0) user = models.ForeignKey(User) def get_queryset(self): return self.__class__.objects.filter(id=self.id) @transaction.atomic() def deposit(self, amount): obj = self.get_queryset().select_for_update().get() obj.balance += amount obj.save() @transaction.atomic() def withdraw(self, amount): obj = self.get_queryset().select_for_update().get() if amount > obj.balance: raise errors.InsufficientFunds() obj.balance -= amount obj.save()
To acquire a lock, we need to fetch the object from the database using select_for_update. Operating on the self object will not work since it has already been fetched. This is why the above code has a method defined called get_queryset where we fetch the object that is being operated on at the time of withdrawal/deposit.
Do keep in mind that for this to work, the database that you are using must support transactions and locks. If you are using SQLite, select_for_update is pretty much useless. My personal recommendation would be to use PostgreSQL.
The steps that have been defined in the concurrency problem above will now change to this:
Step 1: User1 raises request to deposit $100
User1 acquires a lock on the account
Balance is $1000
Step 2: User2 raises request to withdraw $100
User2 attempts to acquire a lock which fails because the account has already been locked by User1
User2 waits for the lock to be released
Step 3: User1 deposits $100 into the account
Balance is $1000 + $100 = $1100
Lock on the account by User1 is released
User2 acquires the lock on the account soon after.
Step 4: User2 withdraws $100 from the account
- Balance is $1100 - $100 = $1000
- Lock on the account by User2 is released.
Step 5: Balance is $1000 and the data is free of corruption.
When you run multiple workers of your Django application, you will run into concurrency issues when the same queryset is updated by different processes at the same time.
To prevent this, use select_for_update inside a transaction block to fetch your queryset so that it is locked until the transaction is completed.