Introduction
PostgreSQL is a swiss knife of databases — it supports all kinds of data. But different data types create different challenges. If you are using PostgreSQL to store JSON data you probably have heard about potential performance problems of querying JSON data. The general rule of thumb is to use JSONB data types when you can. But sometimes you want to use JSON not JSONB to preserve some information, like duplicated keys, maybe some null values or order of keys in the JSON. But how can we query the data more efficiently if we use JSON? In this blog I will talk about using B-tree indexes on JSON data.
JSON and B tree indexes
Let’s start from the basics and talk about JSON in Postgres. It can be stored either as a JSON data type where JSON is stored as a text or as a JSONB in a binary format. You could hear that you need JSONB to use indexes for your data. That’s not entirely true. While JSONB has more options for indexes, the JSON data type still supports some indexing. For example, you can create a B-Tree index for your known JSON keys and that index will be similar in behaviour to any other B-Tree indexes on expression in Postgres.
Here is an example of how you can create one.
I have a table jproducts with two columns — id and product where id is a primary key and the product is a JSON data type. The table has about 29 000 rows.
testdb=> \d ecomm.jproducts
Table "ecomm.jproducts"
Column | Type | Collation | Nullable | Default
---------+--------+-----------+----------+---------
id | bigint | | not null |
product | json | | |
Indexes:
"jproducts_pkey" PRIMARY KEY, btree (id)
testdb=>
And here is the list of keys in my JSON:
key | value_type
------------------------+------------
brand | string
category | string
cost | number
department | string
distribution_center_id | number
id | number
name | string
product_description | string
product_image_uri | string
retail_price | number
sku | string
Let’s say we know that our application is going to filter data by brand and it is going to be executed often enough. If we run the query without an index then we can see it takes about 42 ms.
testdb=> explain analyze select product->>'department' from ecomm.jproducts where product @> jsonb_build_object('brand','Victor');
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..7143.47 rows=2036 width=32) (actual time=0.310..41.958 rows=178 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on jproducts (cost=0.00..5939.87 rows=848 width=32) (actual time=0.068..39.653 rows=59 loops=3)
Filter: ((product ->> 'brand'::text) = 'Victor'::text)
Rows Removed by Filter: 9647
Planning Time: 0.051 ms
Execution Time: 41.998 ms
(8 rows)
Even if 42ms looks reasonable enough it can create some performance problems when it scales to hundreds of requests per second. What if we create an index for the key ‘brand’?
create index jproducts_brand on ecomm.jproducts using btree ((product->>'brand'));
Now if we execute the same query we can see the index is used:
testdb=> explain analyze select product->>'department' from ecomm.jproducts where product->>'brand' = 'Victor';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on jproducts (cost=5.42..488.04 rows=146 width=32) (actual time=0.051..0.481 rows=178 loops=1)
Recheck Cond: ((product ->> 'brand'::text) = 'Victor'::text)
Heap Blocks: exact=164
-> Bitmap Index Scan on jproducts_brand (cost=0.00..5.38 rows=146 width=0) (actual time=0.024..0.024 rows=178 loops=1)
Index Cond: ((product ->> 'brand'::text) = 'Victor'::text)
Planning Time: 0.153 ms
Execution Time: 0.455 ms
(7 rows)
Our new index reduces the execution time almost by 100 times. That is a significant performance boost. But will the index be used for any operation when we work with the ‘brand’ key? Let’s change our query a bit and use the LIKE operator or the UPPER function.
testdb=> explain analyze select product->>'department' from ecomm.jproducts where product->>'brand' like '%Victor%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Seq Scan on jproducts (cost=0.00..3550.82 rows=9 width=32) (actual time=0.079..42.731 rows=178 loops=1)
Filter: ((product ->> 'brand'::text) ~~'%Victor%'::text)
Rows Removed by Filter: 28942
Planning Time: 0.053 ms
Execution Time: 42.768 ms
(5 rows)
testdb=> explain analyze select product->>'department' from ecomm.jproducts where upper(product->>'brand') = 'VICTOR';
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on jproducts (cost=0.00..3623.96 rows=146 width=32) (actual time=1.221..54.989 rows=178 loops=1)
Filter: (upper((product ->> 'brand'::text)) = 'VICTOR'::text)
Rows Removed by Filter: 28942
Planning Time: 0.052 ms
Execution Time: 55.028 ms
(5 rows)
testdb=>
The index is not used and it resembles exactly the same behaviour as for any other B-tree indexes on expressions. It has to be able to search in its binary tree using exactly the same expression as specified during the index creation. If we want the index to be used for a query with the UPPER function then the function has to be defined in the index.
create index jproducts_brand_upper on ecomm.jproducts using btree (UPPER(product->>'brand'));
And then the index will indeed be used if we have a query with the UPPER function.
testdb=> explain analyze select product->>'department' from ecomm.jproducts where upper(product->>'brand') = 'VICTOR';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on jproducts (cost=5.42..488.41 rows=146 width=32) (actual time=0.090..0.910 rows=178 loops=1)
Recheck Cond: (upper((product ->> 'brand'::text)) = 'VICTOR'::text)
Heap Blocks: exact=164
-> Bitmap Index Scan on jproducts_brand_upper (cost=0.00..5.38 rows=146 width=0) (actual time=0.023..0.023 rows=178 loops=1)
Index Cond: (upper((product ->> 'brand'::text)) = 'VICTOR'::text)
Planning Time: 0.149 ms
Execution Time: 0.952 ms
(7 rows)
testdb=>
Let’s talk about planner and stats for the index. Does PostgreSQL gather statistics for the keys in the JSON, and would it impact the planner’s decision on whether or not to use the index? I analyzed the table before creating our index but didn’t do it after.
Let’s check the planner’s expectation about the number of rows for the brands ‘Victor’ and ‘Verona Q’ .
testdb=> explain select product->>'department' from ecomm.jproducts where product->>'brand' = 'Victor';
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on jproducts (cost=5.42..488.04 rows=146 width=32)
Recheck Cond: ((product ->> 'brand'::text) = 'Victor'::text)
-> Bitmap Index Scan on jproducts_brand (cost=0.00..5.38 rows=146 width=0)
Index Cond: ((product ->> 'brand'::text) = 'Victor'::text)
(4 rows)
testdb=> explain select product->>'department' from ecomm.jproducts where product->>'brand' = 'Verona Q';
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on jproducts (cost=5.42..488.04 rows=146 width=32)
Recheck Cond: ((product ->> 'brand'::text) = 'Verona Q'::text)
-> Bitmap Index Scan on jproducts_brand (cost=0.00..5.38 rows=146 width=0)
Index Cond: ((product ->> 'brand'::text) = 'Verona Q'::text)
(4 rows)
testdb=>
It seems that the planner assumes the number of rows is roughly the same and equal to 146. By default PostgreSQL doesn’t gather statistics for keys in your JSON but it can do it for expressions or when you create an index on expression. We have created the index and we can help our planner by adding the statistics.
Let’s analyze the table and repeat one of our queries.
testdb=> analyze ecomm.jproducts;
ANALYZE
testdb=> explain select product->>'department' from ecomm.jproducts where product->>'brand' = 'Verona Q';
QUERY PLAN
----------------------------------------------------------------------------------
Bitmap Heap Scan on jproducts (cost=16.30..2162.20 rows=1034 width=32)
Recheck Cond: ((product ->> 'brand'::text) = 'Verona Q'::text)
-> Bitmap Index Scan on jproducts_brand (cost=0.00..16.04 rows=1034 width=0)
Index Cond: ((product ->> 'brand'::text) = 'Verona Q'::text)
(4 rows)
testdb=> select count(*) from ecomm.jproducts where product->>'brand' = 'Verona Q';
count
-------
1034
(1 row)
Now the planner knows exactly how many rows of each particular brand we have and that can help to make the right decision. It might prefer the sequential scan (seq_scan) operation sometimes because it can be less expensive if cardinality (number of rows returned by the query block) is too high. The reason for that is the cost of querying using seq_scan is by default four times less costly than the random scan used for index. You can see (and change) it using the following parameters.
testdb=> show seq_page_cost ;
seq_page_cost
---------------
1
(1 row)
testdb=> show random_page_cost ;
random_page_cost
------------------
4
(1 row)
We can demonstrate it using an index on product->department key.
testdb=> create index jproducts_department on ecomm.jproducts using btree ((product->>'department'));
CREATE INDEX
Time: 93.934 ms
testdb=> explain select product->>'brand' from ecomm.jproducts where product->>'department' = 'Women';
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on jproducts (cost=5.42..491.90 rows=146 width=32)
Recheck Cond: ((product ->> 'department'::text) = 'Women'::text)
-> Bitmap Index Scan on jproducts_department (cost=0.00..5.38 rows=146 width=0)
Index Cond: ((product ->> 'department'::text) = 'Women'::text)
(4 rows)
Time: 63.887 ms
testdb=> analyze ecomm.jproducts;
ANALYZE
Time: 166.026 ms
testdb=> explain select product->>'brand' from ecomm.jproducts where product->>'department' = 'Women';
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on jproducts (cost=0.00..3869.77 rows=15989 width=32)
Filter: ((product ->> 'department'::text) = 'Women'::text)
(2 rows)
Time: 58.981 ms
testdb=>
You can see that after creating the index we have the same default assumption about the number of rows for each department and the index is chosen to get the data. But as soon as we analyzed the table again it updated the stats and recognized that more than half of the table needs to be scanned. The overall cost for our index scan is higher than the sequential scan for the table. This is because we’re retrieving most of our table’s pages, and an index scan has a higher cost per page. As a result, the planner chooses to use sequential scan and ignore the index.
That’s great but what would we do if we don’t know what keys would be used in our application queries? Can we create an index for each key? We can but it might not be the best decision. Each index generates significant overhead for all operations on the data. Every inserted, deleted or updated row should update the indexes and then later be a subject of the vacuuming process. Here is a simple example of impact on insert from our two indexes.
-- Without any indexes on JSON
testdb=> insert into ecomm.jproducts select id, to_json(t) from products t;
INSERT 0 29120
Time: 954.380 ms
testdb=>
-- Creating indexes
testdb=> create index jproducts_brand on ecomm.jproducts using btree ((product->>'brand'));
CREATE INDEX
Time: 69.097 ms
testdb=> create index jproducts_department on ecomm.jproducts using btree ((product->>'department'));
CREATE INDEX
Time: 63.080 ms
testdb=>
-- Insert with two indexes
testdb=> insert into ecomm.jproducts select id, to_json(t) from products t;
INSERT 0 29120
Time: 3008.885 ms (00:03.009)
testdb=>
Just two indexes increased insert time by 3 times and it is not taking into consideration the full impact from any deletes or updates, which will result in additional vacuuming of obsolete tuples.
Conclusion
Let’s recap what we’ve discussed here.
- You can use index on expressions for your JSON data when you know what JSON keys your application or queries are going to use.
- The indexes follow the same rules as any other b-tree indexes on expressions — your query should use a compatible expression.
- You need to analyze your table after creating an index to have correct statistics for your indexed keys to help planner to make correct decisions about using the index.
- Be mindful of index overhead and potential impact on your DML and maintenance operations. Create only indexes you really need.
What if you don’t know what keys are going to be used in your application and maybe don’t know what new keys can appear in your JSON data? Then maybe it makes sense to look into JSONB data type and GIN index. And that is what we are going to discuss in the next blog. Stay tuned.

Top comments (0)