Using Django's JSONField? You probably don't need it. Here's why

#django #database #postgres #python

In my current role, we use Django with Django Rest Framework (DRF) to power our platform’s API, and leverage PostgreSQL to store both relational and nonrelational data. As part of the platform, we run asynchronous, on-demand batch jobs that can operate up to half a million rows at once.

Each of these jobs is represented by a row in a table and contains both top-level, single-responsibility columns as well as “deeper” structures that may contain more arbitrary data. As a working proof-of-concept, we avoided over-optimizing early and didn’t have a clear definition of what these job outputs might look like. We added JSON columns to our job model that would store different forms of data - metadata, debug output, and namely, job report output.

While we didn’t need to expose metadata or debug output to our API, returning the report data to describe the job results was the main requirement. This data varied little in size and complexity, only maintaining a couple of keys and a few levels of nesting.

This job model did not have any special qualities. It was defined and represented like any other type of model within our Django app, except for additional JSONB columns to represent arbitrary and unnormalized data.

Unexpectedly, we experienced major slowdowns both in API response time and internal queries made through an RDBMS. At a point, requesting a single one of these jobs would take upwards of 5 minutes. When JSONB columns were omitted from queries, the query had a subsecond execution time.

Interestingly, we found that converting all columns that did not need to use any of JSONB’s benefits – such as indexing and advanced querying – significantly improved our application’s performance. While I’m doubtful that this was the true culprit, it did get us to our desired performance in the shortest time possible.

With that personal anecdote, let’s explore the differences between JSONB and JSON, and how you can use JSON in your Django apps.

JSONB vs JSON, and how they’re handled

Both the JSON and JSONB data types are used for storing JSON data in PostgreSQL tables. These data types are almost identical, different only in how these values are stored in the database, and the efficiency with which operations [1] are performed.

Notably, JSON stores the exact input text, whereas JSONB instead stores a decomposed binary format. As a result, there’s a small amount of write/conversion overhead (converting from text to binary), making storing input on this field slightly slower overall. The binary format of JSONB allows for significantly faster processing and operations, as the field does not have to be reparsed each time.

In addition to the fields represented in a different format, JSONB allows for indexing of keys in the field, which allows for efficient arbitrary key lookups. This would take significantly longer if attempting to search within a JSON field, as you would have to both parse the field and perform a search within a text object.

In summary,

JSON:

Stores an exact copy of the input text, including “semantically-insignificant” whitespace, duplicate keys, and ordering.
The last value of a duplicate key is used.
Must reparse the field for each operation, causing these operations to take significantly longer

JSONB:

Stores a decomposed binary format, which does not preserve whitespace (outside of values), the order of keys, or duplicate keys. Only the last value of duplicate keys is kept.
No reparsing is needed and operations over the field are significantly quicker
Can take up more disk space than a regular JSON column
Supports indexing of keys, significantly improving the performance of lookups

When to use which

Deciding which format of JSON to use will largely depend on the needs of your application. Generally, the Postgres manual suggests that most applications should use the JSONB field, except in specialized circumstances or legacy applications where keys would not be able to be re-ordered on retrieval.

In normal usage, this makes sense as you are likely to be storing some form of structured yet arbitrary data that you may need to perform operations or lookups on. However, we found that a JSON field was a better option for our use case, as we stored large amounts of varying data to be only used for retrieval without manipulation.

Deciding between JSONB and JSON will depend on several factors, such as requirements for indexing, lookups, operations, and retrieval.

The short answer is,

Use JSON when:

You are only storing and retrieving data within this field
You do not need JSONB’s benefits and would prefer to save write time and disk space

Use JSONB when:

You need indexing of keys within the object
You will perform lookups over this field
You will perform operations over this field, such as sorting, ordering, or slicing, before working with the representation in your application [2].

Switching from JSONB to JSON

If you’re using Postgres with Django and importing the JSONField object, you’re using JSONB. This is the default in recent Django versions and is a sensible choice for most applications, but you may find that you’d prefer a standard JSON based on your data and workflow.

If you find that you don’t need JSONB’s features, you have two options:

You can use a text field, which will store your JSON in the exact input format. There are a few caveats to this approach:
1. You will need to serialize and deserialize the object when working with it, especially in an API context
2. A native text field does not provide JSON input validation. To ensure that you’re storing valid JSON objects, you’ll need to validate this before inserting it into your database.
3. You will not be able to retroactively cast the field to a jsonb if you later decide to use JSONB format.
If you want the convenience of a package, you can use packages like django-jsonfield or jsonfield, which will handle validation and serialization by default. If you opt-out of using JSONB, this is the preferred option.

If you’re using other databases like MySQL, you may opt for a package like django-mysql, which provides support for JSON fields and extends the functionality of MySQL in your Django application.

Using JSON with Django Rest Framework

To solve the performance issue, we opted to switch to using the JSONField field provided by django-jsonfield, allowing us to remove JSONB from the model entirely. With our new JSON fields, we suddenly found that the data provided by our API to our processing/job service was causing exceptions when we were attempting to extract and load information from the object. Where we originally were inferring a native JSON object, which was already deserialized by Django, we instead were receiving a JSON object in the form of a string. Our data providers didn’t expect this and quickly began failing when we attempted to use any metadata from our API.

After some experimentation, we found that responses needed to have these new JSON fields deserialized manually before being returned in the API. Django handled this originally, but switching to a non-default field required some adjustment.

Assuming that you’re using serializers within Django Rest Framework (DRF), you can update the serializer for the field to be a SerializerMethodField, which can be used to add any arbitrary data to your object’s serialized representation or change the representation of an existing field. Here’s how you can deserialize the JSONField within a serializer:

from rest_framework import serializers
from rest_framework.fields import SerializerMethodField

from models.cluster import Cluster

class ClusterSerializer(serializers.ModelSerializer):
  metadata = SerializerMethodField()

  class Meta:
    model = Cluster
    fields = ["metadata"]

  def get_metadata(self, obj):
    return obj.metadata

This forces the field to deserialize from a string to a Python dict. If you intend on using this field in more than one serializer, you may benefit from adding a property to the model to automatically deserialize this field instead of within your serializer:

from django.db import models

import jsonfield

class Cluster(models.Model):
  metadata = jsonfield.JSONField()

  @property
  def _metadata(self):
    return self.metadata

Further anecdotes

Admittedly, there were too many confounding factors relating to this issue to blame for the usage of the JSONB field. Because we had resolved the problem with this change, I stopped researching possible causes beyond what I had originally found and stopped looking further into performance benchmarks.

Jobs in this table and the corresponding column were being populated by data produced through a disconnected processor. The output had been sent through our API at the time (this has since changed as another performance consideration), and inserts were being handled by our main Django application.

Based on our usage, there may have been issues with indexing, or the query planner was somehow unable to work with the data in this column across all of our jobs. As we experiencing slowdowns both in our application and using an RDBMS, I’m unsure if this was amplified by passing the model to a DRF serializer, which has an overhead of its own; or if our managed database simply could not handle returning this data as part of our queries.

Interestingly, we had millions of rows in another table that had five times the number of JSONB columns with no obvious differences in performance. These JSONB columns were fully queryable and responded in a reasonable amount of time, concerning the size of the data being returned, despite no difference in the model’s definition of the JSON field. Many of these columns had significantly more data than the related column in any of the problematic jobs, and yet they were handled significantly better by the RDBMS. Our problematic table with only a few rows took up around 16MB of space, whereas switching to a normal JSON reduced table size to a few KB. The table with millions of rows and sets of JSONB columns held 20GB+ of data and responded within seconds to both complex and simple queries.

While this change likely isn’t the single reason behind performance improvement, it helped recognize that there is a need to make considerations about how you’re storing and handling related data. As our system matured and we had more examples of what distinct outputs of these jobs may look like, we eventually normalized this report column into its separate table, allowing us to improve performance even further by lazily loading report data while keeping our API less complex. After reviewing the output of several distinct jobs, we extracted upwards of 20 attributes into dedicated columns with explicit data types, which will help significantly with performance over time as the size of the data stored in this table continues to grow.

I don’t have the full answer to what happened here, as I didn’t get to spend more time investigating. If you have any thoughts on what may have occurred beyond what I’ve covered here or have experienced this yourself, I'd love to chat.

Footnotes

[1] Operations here refer to things like search, sorting, slicing, and ordering done directly through database queries.

[2] "Working with the representation in your application" refers to retrieving your object from the database, storing it in memory/as a local variable, and then working with the local object. At this point, it wouldn't matter what format the JSON is stored in, as you'll only get JSONB benefits when operations are done within the database context.