Kevin Gilpin

Posted on Feb 15, 2021 • Edited on Feb 17, 2021

Effective SQL - Indexing and denormalizing JSONB

#architecture #database #sql #postgres

Lately, one of my favorite database features is the JSONB data type in PostgreSQL. To recap the SQL vs NoSQL debate:

SQL crowd Structure is good!
NoSQL crowd Structure is bad!

With JSONB data in PostgreSQL, you can get the best of both worlds. We know that a rigid schema is not a good fit for every type of data, because not all real-world data adheres to a strict format. For example, log messages, errors, code traces, and JSON web services all return diverse families of data. And, there are plenty of good use cases for storing this kind of data in the database. But it's also true that relational technology is extremely powerful, and it it would be a shame to have to give it all up to just store a bit of unstructured data.

In this post, I’m not just recommending that you check out JSONB; you’ve probably already done that. I’m going to suggest two techniques that we use at AppLand to make JSONB data even more effective. Let's dig in.

Efficient querying - Indexes on JSONB

You know that when you have SQL data stored in strongly typed columns, every row has a uniform structure, and data is both efficient to retrieve and easy to to index. What you may not know is that PostgreSQL can also create indexes on JSONB columns. So, if you have a common field such as name within a JSONB document, you can add an index on that field. Then you can efficiently search by name within the JSONB, and PostgreSQL won’t have to slog through all that JSON to find the matching names. Not only that, but you can also SELECT the name from within the JSON and PostgreSQL can retrieve that data without parsing the JSON, using a technique called index-only scan.

Efficient retrieval - Using generated columns with JSONB

Now, what if you want to be able to efficiently fetch a subset of JSONB data, without having to load and parse every object? At AppLand, we store AppMap data in the database. Each AppMap has a mandatory section called metadata, and we want to be able to efficiently pull out the metadata of selected rows without having to parse the entire JSONB column (the metadata is usually less than 1% of the total JSON object). Awesomely, since PostgreSQL 12, PostgreSQL has a feature called generated columns.

If you’ve ever worked with triggers, you know that when a row is inserted, updated, or deleted, you can instruct the database to run an in-database function which can process the row in a pretty much unlimited way. A generated column is like a trigger; it runs automatically whenever a row is inserted or updated. The generated column specifically applies a function to the new row, and then stores the result of that function on the row.

So, you can use generated functions to pull frequently used data out of a JSONB column, and store it on the row. Then you can write a query to access the copied/cached (“de-normalized”) data. It’s truly the best of both relational and non-relational worlds.

An example

Here's a concrete example of a JSONB index and generated column. For this example, I will use AppMap JSON, which represents a class dependency diagram and execution trace of a program. At the top level, the AppMap format includes metadata, classMap, and events.

Let's store the AppMap data in a simple table called appmaps, and add a minimal record:

CREATE TABLE appmaps (
    id uuid DEFAULT gen_random_uuid() PRIMARY KEY,
    data jsonb
);
-- data is schema-less so we can omit all fields except metadata
INSERT INTO appmaps ( data )
VALUES ( '{"metadata":{"name":"login"}}'::jsonb );

Optimizing with an index

When a user wants to retrieve an AppMap by name, the server needs to search data->'metadata'->>'name'. The double-arrow ->> instructs PostgreSQL to convert the result to text (from JSON), so that the query condition can be ordinary text.

CREATE INDEX idx_appmaps_name
ON appmaps USING BTREE ((data->'metadata'->>'name'));

Note: There are different types of indexes you can use, with advantages and disadvantages. BTREE is good for text.

This entire query can now be executed using just the index, using index-only scan:

SELECT data->'metadata'->>'name' AS name
FROM appmaps WHERE data->'metadata'->>'name' = 'login';
 name
-------
 login
(1 row)

Optimizing with a generated column

Let's say we want to be able to efficiently pull out the metadata from the data column. Add the GENERATED column:

ALTER TABLE appmaps ADD COLUMN metadata JSONB
GENERATED ALWAYS AS (data->'metadata') STORED;

All rows now contain the metadata column which holds stored, denormalized data:

SELECT id, metadata from appmaps;
                  id                  | metadata
--------------------------------------+-------
 c5e4cfa5-1945-4685-a50f-7e5b0edd62e5 | {"name": "login"}
(1 row)

Wrapping up

These are a couple of techniques that we are using at AppLand to make JSON data act "relational" when we want it to. Not only does PostgreSQL provide the powerful JSONB data type, but it also provides the shortcuts and usability tricks that make JSONB truly useful.

While you're here

We are conducting a survey: State of Software Architecture Quality. We are aiming for 300 responses, and once we meet our goal we will be donating $1,000 to Girls Who Code. Please contribute to our understanding of software architecture quality by filling out the survey! Of course, we will be summarizing and publishing the results once they are available.

If you don't want to fill out the survey, but you want to be notified when the results are available, you can fill out this form: https://forms.gle/u8CPS3GGD6A7WHsG7.

DEV Community

Effective SQL - Indexing and denormalizing JSONB

Efficient querying - Indexes on JSONB

Efficient retrieval - Using generated columns with JSONB

An example

Wrapping up

Top comments (0)