Pouria Hadjibagheri

Posted on Nov 13, 2021

Cascade of doom: JIT, and how a Postgres update led to 70% failure on a critical national service

#postgres #sql #performance #azure

The UK coronavirus dashboard is the primary data reporting service for the COVID-19 pandemic in the United Kingdom. It sustains on average 45 to 50 million hits a day, and is regarded as a critical national service.

We run the service on an active-active architecture, meaning that there are at least two, often three independent replicas of our services running in geographically different locations.

There is only one exception: our database. The service runs using a special PostgreSQL database: Hyperscale Citus. The fact that our database does not conform to the active-active architecture is not because we don't know how to create read replicas, rather, it is due to some logistical reasons beyond the scope of this article.

The database, however, runs on a highly available ecosystem, wherein an unhealthy instance is replaced within a couple of minutes.

An exciting update

PostgreSQL 14 was officially released on 30 September 2021. Amongst other features, the new version offered enhanced security and a boost in performance of queries from partitioned tables.

We were quite excited about that, though because we use a fully managed service, we thought it was very unlikely for us to be able to upgrade our database until mid-2022 at the earliest.

Thanks to the great work done by the Azure Postgres team, however, our perception was proven wrong and the new update was made available on managed databases across Azure within 24 hours.

This was a great achievement for the Azure Postgres team — a new milestone in the maintenance of managed services in the Cloud. The update was made even more exciting as it also offered a new version of the plugin: Citus 10, which resolved a number of bugs we had encountered in the previous version.

Collaborating with the Azure Postgres (Citus) team, we deployed the new version in our development environment in the evening of Wednesday, 13 October 2021. It went very smoothly and we monitored everything for 48 hours and did some load testing before setting off the deployment to the production environment at midnight of Saturday, 16 October 2021. The deployment processes ended at around 11am on Saturday without any issues.

Everything was nice and rosey.

What else happened?

The week that followed the update saw a number of changes to the service:

We released new data for booster vaccines.
We released a new version of summary pages implemented in F#.

The second update was intended to:

reduced reliance on the database
decrease latency at peak time
leverage Redis for expensive queries
use the cache pre-population solution implemented in our ETL

The week after the night before

Towards the end of the week, several people reached out to me on Twitter reporting that they were encountering an increasing number of failed requests when using APIv2.

They were peculiarly saying that they couldn't download new data from the API after the release — i.e. at 4pm UK time.

My first course of action in response to such reports is to check the API service, and our PGBouncer instances to ensure that they are healthy. Then I usually clear the storage cache for APIv2, which usually resolves the problem.

Number of grievances directed at me on Twitter increased. People were now reporting increased latency in our Generic API. This concerned me because unlike APIv2, which is aggressively throttled and is designed to download very large payloads, the Generic API has been implemented in the Go programming language using highly optimised queries for rapid response. Whilst a download from APIv2 may take 3 or 4 minutes to complete, the average latency for our Generic API is ~30-50 milliseconds.

Once again, I investigated the infrastructure and even reverted some changes to the Generic API to ensure that the issue wasn't from the recent — albeit minor — changes. It wasn't.

Disaster weekend

On Sunday, 31 October 2021, we experienced a critical failure: our ETL processes were failing — we couldn't deploy new data to the database. The reason? failure to establish or maintain connections with the database.

I highlighted earlier that all connections to our database go through a connection pool (PGBouncer), which is hosted on two powerful virtual machines with 64 GB memory and 16 CPU cores (Standard_D16_v4), maintained using Azure VM Scale Sets.

Why do we use PGBouncer?

Well, in a nutshell, a database offers a limited number of concurrent connections. You can change or increase the number, but there is a trade-off.

The more connections you have, the more memory you use. In Postgres, each additional connection uses ~10 MB of memory. It also means that more concurrent processes can be initiated, which it turn need more resources to run.

Most connections to the database don't need to be maintained. You only need a connection to the database while you submit a query and await the response. However, applications tend to occupy connections forever.

PGBouncer, and connection pools in general, mitigate the situation by maintaining a predefined number of connections with the database (986 in our case) and providing short lived connections to client applications in return (5000 in our case). It then forwards requests from each client to an available connection to the database from its pool, and then routes the response to the requesting client. In short, our 986 connections become 5000.

How long is the connection between a client and PGBouncer is maintained? That depends on your settings, but in our case, it is maintained for one transaction.

I had also just enabled accelerated networking on PGBouncer instances, so there was no reason for them to fail — and yet, they were both showing as unhealthy.

The health check probe for PGBouncer instances is set to ping Postgres. This means that — other than running out of CPU / RAM — there is only one other issue that could make them unhealthy: failure of the database.

Naturally, the next thing I did was to check the database, and this is what I saw:

Everything was healthy 🤯!

But this could not have been, and I needed to dig deeper. The only problem was that it was now 12:45pm, and we hadn't deployed any data yet. We were running out of time...

Next, I checked the number of connections:

Ok, so nothing wrong there... except for the spikes highlighted using red arrows. They correlated with our failed attempts to deploy the data. This was weird as we don't normally need 150 connections to deploy the data. It gets done with 30 or 40 connections.

Next, I check the CPU... and that's when I learned the true meaning of Halloween 👻:

This could not be happening; not on this server. But it was!

I alerted the Azure Postgres team; but they are mostly based in the US, and it was all happening on a Sunday... not just any Sunday, it was the Halloween 🎃. Remember that this isn't Azure support, this is the core development team for the service, and folks that I know aren't on-call 24 / 7.

I had no time to do much at that point, it was 2:50pm already — all I knew was that I needed to reduce the load on the database to enable the deployment of data. All I could think of then point was to forcibly reduce the number of connections to the database to reduce the load, and so I turned off 70% of our servers. That did the trick, and we managed to deploy the first chunk of data at 3:08pm. We finished by the skin of our teeth at 3:58pm, I turned the servers back on, and we managed to release the data at 4:01pm.

It was far from over

Next, we experience significant deterioration in APIv1. The issue had now become very critical. APIv1 is our Open Data API. It is the one we use to show the data on the website, and is heavily used by different organisations — from local authorities, to the Department of Health and No 10 Downing Street, to WHO, to Johns Hopkins University. Issues with this API became noticeable at 4:30pm on Sunday, 31 October 2021.

It is important to put this into perspective. Most of the reports I had received thus far were talking about issues at release time. That is when we release the daily data and sustain a massive lightening surge in demand. Here is how our traffic looks at release time:

This means that a few failures here and there are expected, but not at 50% rate, as was the case on Sunday, 31 October 2021:

There was another interesting pattern there. See the part highlighted using a grey arrow. That's when 70% of the servers were turned off, yet the number of failed requests had decreased! But this can't be. We had a lot of free connections available, so we couldn't have been running out of connections. Granted, it wasn't during the peak, so we weren't using all the servers anyway, but still, reduced failure on 70% less capacity?

Why did the number of failures drop?

That's where things became even more complicated.

After extensive investigation and benchmarking, it turned out that our queries were taking significantly longer to run.

For instance, this is a well-optimised, tested query from our Generic API that should — under normal circumstance — run in ~1.5 milliseconds:

SELECT
   area_code    AS "areaCode",
   area_name    AS "areaName",
   ar.area_type AS "areaType"
FROM covid19.area_reference AS ar
WHERE LOWER(area_type) = LOWER('region')
  AND area_name ILIKE 'London'
LIMIT 1 OFFSET 0;

So I analysed the query as follows:

EXPLAIN (ANALYZE, BUFFERS)
SELECT
   area_code    AS "areaCode",
   area_name    AS "areaName",
   ar.area_type AS "areaType"
FROM covid19.area_reference AS ar
WHERE LOWER(area_type) = LOWER('region')
  AND area_name ILIKE 'London'
LIMIT 1 OFFSET 0;

And the response was as follows:

...
  Buffers: shared hit=5
Planning Time: 0.119 ms
Execution Time: 57.190 ms

It took ~57.2ms to run! But how is that possible? It's an increase of nearly 40x.

I then check the activities in the database to see how many queries are running, and for how long:

SELECT usename, 
       state, 
       query, 
       state_change - query_start AS time_lapsed 
FROM pg_stat_activity;

There were 140 queries, but more than half of them included this:

WITH RECURSIVE typeinfo_tree(
    oid, ns, name, kind, basetype, elemtype, elemdelim,
    range_subtype, attrtypoids, attrnames, depth)
AS (
    SELECT
        ti.oid, ti.ns, ti.name, ti.kind, ti.basetype,
        ti.elemtype, ti.elemdelim, ti.range_subtype,
        ti.attrtypoids, ti.attrnames, 0
    FROM
            (...

These queries showed as IDLE and the time_lapsed between submission and state change were quite noticeable.

After more investigation, it turned out that this query is sent by asyncpg, the driver API that we use in our Python codebase to perform async operations with Postgres.

Further research revealed that the issue had already been encountered by others in the community:

Simple query runs long when DB schema contains thousands of tables #186

sergeyspatar posted on Sep 07, 2017

asyncpg version: 0.12
PostgreSQL version: 9.6.4
Python version: 3.5.3
Platform: Linux x86_64 (Debian Stretch)
Do you use pgbouncer?: no
Did you install asyncpg with pip?: no
If you built asyncpg locally, which version of Cython did you use?: 0.25.2
Can the issue be reproduced under both asyncio and uvloop?: yes

My DB schema contains thousands of tables inherited from one table. These tables store some logs or historical information and have around 100 columns. As a result, PostgreSQL information schema table pg_attribute contains enormous number of records.

I use my own DB user sergey to connect to PostgreSQL who is a superuser. When I connect as this user, some simple queries run very long in asyncpg.

Preparation. This code creates 5000 inherited tables with 100 columns to prepare similar environment for the test cases below:

import asyncpg
import asyncio

loop = asyncio.get_event_loop()

async def run():
    conn = await asyncpg.connect(user='sergey')
    await conn.execute(
        'create table testlog ({})'.format(
            ','.join('c{:02} varchar'.format(n) for n in range(100))
        )
    )
    for n in range(5000):
        await conn.execute(
            'create table testlog_{:04} () inherits (testlog)'.format(n)
        )

loop.run_until_complete(run())

Example 1: Connect as sergey and run this simple query:

import asyncpg
import asyncio

#import uvloop
#asyncio.set_event_loop(uvloop.new_event_loop())

loop = asyncio.get_event_loop()

async def run():
    conn = await asyncpg.connect(user='sergey')
    return await conn.fetch('select $1::integer[]', [1, 2, 3])

print(loop.run_until_complete(run()))

While it was running for about 2 minutes, I saw this query in pg_activity_stat:

WITH RECURSIVE typeinfo_tree(
    oid, ns, name, kind, basetype, has_bin_io, elemtype, elemdelim,
    range_subtype, elem_has_bin_io, attrtypoids, attrnames, depth)
AS (
    WITH composite_attrs
    AS (
    SELECT
        c.reltype                    AS comptype_oid,
        array_agg(ia.atttypid ORDER BY ia.attnum)    AS typoids,
        array_agg(ia.attname::text ORDER BY ia.attnum)   AS names
    FROM
        pg_attribute ia
        INNER JOIN pg_class c
        ON (ia.attrelid = c.oid)
    WHERE
        ia.attnum > 0 AND NOT ia.attisdropped
    GROUP BY
        c.reltype
    ),

    typeinfo
    AS (
    SELECT
        t.oid               AS oid,
        ns.nspname              AS ns,
        t.typname               AS name,
        t.typtype               AS kind,
        (CASE WHEN t.typtype = 'd' THEN
        (WITH RECURSIVE typebases(oid, depth) AS (

Unfortunately, pg_activity_stat does not show the whole query but only first part. I guess the slow execution has to do with the size of pg_attribute.

Example 2: Now connect as user postgres and run the same query. It is still not instant but much faster (~1 second):

import asyncpg
import asyncio

#import uvloop
#asyncio.set_event_loop(uvloop.new_event_loop())

loop = asyncio.get_event_loop()

async def run():
    conn = await asyncpg.connect(user='postgres')
    return await conn.fetch('select $1::integer[]', [1, 2, 3])

print(loop.run_until_complete(run()))

Example 3: Not all queries execute long. This one is fast:

import asyncpg
import asyncio

#import uvloop
#asyncio.set_event_loop(uvloop.new_event_loop())

loop = asyncio.get_event_loop()

async def run():
    conn = await asyncpg.connect(user='sergey')
    return await conn.fetch('select $1::integer', 1)

print(loop.run_until_complete(run()))

P.S. I tried vacuum full analyze pg_attribute and vacuum full analyze pg_class, it did not help.

View on GitHub

Based on discussions in the issue ticket, it appears that the problem is associated with a setting entitled Just-in-time Compilation (JIT) that is turned on by default in Postgres 14 when it is compiled using LLVM.

I found this at around 11:40pm on Monday, 1 November 2021, after some discussions I had earlier at around 9:30pm with the Azure Postgres (Citus) team, where JIT was mentioned as a possible culprit.

I was still not satisfied that this was the issue, but then the Citus team messaged me, saying that after some extensive investigation, they also think the issue originates in JIT having been turned on. What was more interesting was that they sent me the same discussion I had found on GitHub.

We now had a hypothesis, and needed to test it. So they sent me a fairly chunky query to run and evaluate the result:

Planning Time: 3.283 ms
JIT:
  Functions: 147
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 17.460 ms, Inlining 15.796 ms, Optimization 525.188 ms, Emission 350.897 ms, Total 909.340 ms
Execution Time: 911.799 ms

Right then, we had a benchmark. But look again, we also have the issue revealing itself right there too:

JIT:
  Functions: 147
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 17.460 ms, Inlining 15.796 ms, Optimization 525.188 ms, Emission 350.897ms, Total 909.340 ms

JIT optimisation process took Total 909.340 ms to run!

Time to turn JIT off — it's now 12:20am, so it's as good a time as any to experiment... and we did:

At 12:22am, we ran the above query again, and this is what we saw:

Planning Time: 3.159 ms
Execution Time: 0.409 ms

And just like that, the execution time reduced by 2,229 folds 😲.

Next day, we saw a substantial change in the CPU stats in Postgres. The vertical line in the following image shows 12:22am, the point at which JIT was turned off:

Problem solved! 🎉

Concluding remarks

Issues such as this are quite difficult to identify. This is because they manifest themselves as a cascade of problems whose trail needs to be followed. Even then, one needs to do some detective work and research to pinpoint the cause.

JIT in Postgres

JIT compilation is a useful optimisation feature — no question. It is, however, not one to be turned on by default. It should only be used in such scenarios as follows:

columnar or analytical processes that need to be executed on the entire table;
queries that tend to take a very long time to process without JIT; and,
queries that are not run repetitively in quick succession.

Think of it as JIT compilers in Python or .NET, which optimise the code to make it run faster. They are extremely useful in long running applications. They make you wait less, and they reduce the cost of running the application.

The benefits in such cases are significant because the additional overhead is not noticed beyond the startup phase, which is when the JIT process does the compilation.

This is not, however, the case when we are making short, well-optimised queries from a database tens if not hundreds of times per second — certainly not when we are doing the compilation for the same query each time it is submitted. This wouldn't cause any issues if compiled queries were cached, but as things currently stands, they are not.

The cache for such queries may be stored in binary format with the hash of the query string set as the key. The time-to-live (TTL) of the cache could be set using an adjustable setting with a default duration set to something like 5 minutes. It is, however, important to appreciate that the core Postgres team encompasses some very smart and competent software engineers who know what they are doing and would have definitely thought of caching compiled queries. This leaves me to believe this approach has not been adopted for reasons of which we may not be aware.

Thresholds

Playing with JIT thresholds to base its activation on transaction cost won't, in my opinion, have a significant impact — not when you are dealing with a massive, but adequately indexed database.

Back-pressure

Another issue here is back-pressure. When an otherwise optimised query that should take 1 millisecond takes 5 seconds to execute — possibly because it contains a recursive CTE — the database (CPU) remains occupied. JIT compilation is a CPU intensive process, and there are only so many such queries the database can concurrently execute.

This means that other queries, possibly even identical ones — of which there may be hundreds — are idled and queued for execution, creating a backlog of delayed jobs.

This sets off a chain of events, or as I call it: the cascade of doom. You end up with seemingly small processes saturating the CPU or awaiting JIT compilation:

which causes increased latency,
which may cause your client application to reach its timeout threshold,
which causes queries to fail,
which causes the instance to report as unhealthy,
which leads to the creation of new instances and the activation of auto-scaling processes,
which increases the number of connections,
which adds to the pressure on the database.

... and the rest is history.

How do we know if JIT is useful?

There is no absolute workflow to follow. It really depends on the application. A good way to determine its usefulness, however, is to turn JIT on for a specific transaction and do some benchmarking:

BEGIN;
SET LOCAL jit = ON;

EXPLAIN (BUFFERS, ANALYSE) 
-- The query
;

COMMIT;

Note that if your query is too simple and falls below the threshold for jit_above_cost, the process may not be activated. You can temporarily decrease the threshold to mitigate this.

Let's evaluate a query from our database as an example.

Without JIT:

EXPLAIN (BUFFERS, ANALYSE)
SELECT metric,
       area_code,
       MAX(area_name) AS area_name,
       percentile_cont(.25) WITHIN GROUP (ORDER BY (ts.payload ->> 'value')::FLOAT) AS first_quantile,
       percentile_cont(.5)  WITHIN GROUP (ORDER BY (ts.payload ->> 'value')::FLOAT) AS median,
       percentile_cont(.75) WITHIN GROUP (ORDER BY (ts.payload ->> 'value')::FLOAT) AS third_quantile
FROM covid19.time_series AS ts
  JOIN covid19.area_reference   AS ar ON ar.id = ts.area_id
  JOIN covid19.metric_reference AS mr ON mr.id = ts.metric_id
WHERE ts.date BETWEEN '2021-08-05' AND '2021-11-05'
  AND ts.partition_id = '2021_11_10|ltla'
  AND NOT (metric ILIKE '%direction%' OR metric ILIKE '%demog%')
GROUP BY metric, area_code
ORDER BY median DESC;

QUERY PLAN
Planning:
  Buffers: shared hit=29
Planning Time: 2.163 ms
Execution Time: 13024.737 ms

Now with JIT:

BEGIN;

SET LOCAL jit = ON;
SET LOCAL jit_above_cost = 10;

EXPLAIN (BUFFERS, ANALYSE)
SELECT metric,
       area_code,
       MAX(area_name) AS area_name,
       percentile_cont(.25) WITHIN GROUP (ORDER BY (ts.payload ->> 'value')::FLOAT) AS first_quantile,
       percentile_cont(.5)  WITHIN GROUP (ORDER BY (ts.payload ->> 'value')::FLOAT) AS median,
       percentile_cont(.75) WITHIN GROUP (ORDER BY (ts.payload ->> 'value')::FLOAT) AS third_quantile
FROM covid19.time_series AS ts
  JOIN covid19.area_reference   AS ar ON ar.id = ts.area_id
  JOIN covid19.metric_reference AS mr ON mr.id = ts.metric_id
WHERE ts.date BETWEEN '2021-08-05' AND '2021-11-05'
  AND ts.partition_id = '2021_11_10|ltla'
  AND NOT (metric ILIKE '%direction%' OR metric ILIKE '%demog%')
GROUP BY metric, area_code
ORDER BY median DESC;

COMMIT;

QUERY PLAN
Planning:
  Buffers: shared hit=29
Planning Time: 2.090 ms
JIT:
  Functions: 8
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 1.471 ms, Inlining 0.000 ms, Optimization 0.485 ms, Emission 6.421 ms, Total 8.377 ms
Execution Time: 12750.490 ms

We can see that despite the overhead, the execution time is — albeit marginally — shorter with JIT.

No quick answer

We all enjoy one-size-fits-all solutions, but unfortunately there is no such thing in high performance computing.

It is often easy to look from the outside and prescribe a generic solution to any given problem; however, whether or not such solutions prove helpful may only be determined through meticulous experimentation on the specific product.

Following this incident, the Postgres team at Microsoft Azure have decided to consider changing the default setting in Postgres 14 and turn JIT off for new instances.

Acknowledgement

Finally, I would like to thank the Azure Postgres (Hyperscale) team for their commitment and dedicate support, day, night, and midnight!

Deploy with ease. Manage efficiently. Scale faster.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

Top comments (29)

ahmed galal • Nov 14 '21

Thanks for sharing this insightfull - but unfortunate 😅 - incident!
It is really usefull to know how tech leads investigate and tackle such problems, the illustration was well written as well, great work !

Pouria Hadjibagheri • Nov 14 '21

Glad you found it useful.

Andrey Belyaev • Nov 19 '21

Hi! Just wondered if you changed anything else apart from disabling JIT. Any changes in the monitoring tools settings to catch increasing responce time? I'm pretty sure that in such complex system as yours there are a lot of metrics being monitored.

Also it would be great to read another article about an architecture of your system. It's a great example of a real high-load application.

Andrey Belyaev • Nov 19 '21

BTW I made a Russian translation of your article here: habr.com/ru/company/haulmont/blog/...

Pouria Hadjibagheri • Nov 19 '21 • Edited

Thank you for letting me know. Thank you for taking the time to do it — it's wonderful. Hopefully Russian speakers will find it useful. I'll share the link on Twitter and LinkedIn.

Re monitoring, we have end-to-end tracking of requests implemented using Azure Application Insight. Here is a fairly simple example of the events for the postcode search on the landing page (CDN, load balancers, and PGBouncer aren't included here):

We also have alerts with a very high sensitivity in place.

What is also important to notice is that the issue doesn't really manifest itself until we have a lot of requests coming in — especially at peak time.

This is in part because of the aggressive caching policies (4 layers) that we have in place. Even if the response time is longer, it's only long once every few hours, so it gets registered as an outlier. I mean, I had to search 20 queries before I found an example (the screenshot above) that included a database transaction). Having said that, I did get alerts re response time, but I'm sure you appreciate that these things are often judgement calls, and they didn't initially concern me as they weren't particularly out of the ordinary... until that fateful Sunday.

Another reason is that we have quite a lot of servers. If one zone in one zone in a region is experiencing issues, we do expect to see an increase in latency. We might also see a slight increased latency if someone is hammering our API for, say 30 minutes with lots of different (unique) requests. These are things that do happen every once in a while, and tend to resolve themselves after a bit. All in all, there were things that could have pointed to a potential issue, but based on my experience with the service, they weren't sufficiently out of the ordinary to merit an extensive investigation. What would normally concern me is sustained / prolonged increase in latency, and we didn't have that until Sunday, 30 October.

Re writing an article about our architecture, I am hoping to do something along those lines in the future. Right now, I am collaborating with folks at Microsoft to write an article on our database. I have also given a few talks on our daily processes, a couple of which are available on YouTube. If that's something of interest, I'll be happy to share the links.

Once again, thank you for taking the time to translate the article, and for sharing your thoughts here.

Andrey Belyaev • Nov 19 '21

If that's something of interest, I'll be happy to share the links.

Yes, please! Thanks in advance :-)

Pouria Hadjibagheri • Nov 25 '21

These are the two that are available online. They're a bit old / generic, but here goes:

youtube.com/watch?v=ukMWR_4_XZo

youtube.com/watch?v=nKYCJP0T03M

Laurence 'GreenReaper' Parry • Nov 14 '21

I've had a similar experience with JIT - for short queries, well-optimized by index, it can impose a cost that outweighs any benefit it provides. It also greatly increases the size of the postgres package (though for most I doubt this is a big concern) - it'd be nice if distro packagers considered splitting its optimization symbols into optional dependency packages, as removing their files both saves space and is a convenient way to disable JIT.

A related issue occurred when they changed CTEs ("WITH x ...") to no longer materialize X on the grounds that it would give the optimizer more options. In our case we were using it as an optimization gate (since the community has an aversion to them otherwise...) and changing that caused several queries that can be very expensive if the query planner makes a bad call to run long, causing CPU issues 'til we added MATERIALIZED.

Of course PostgreSQL 14 is quite new, and in my experience it can take a few revisions to shake out issues with both new and existing functionality (as shown in the 14.1 patch notes) - so as a general rule, it's best to hang back. But I can appreciate wanting to switch to it quickly, especially if it fixes issues you've experienced.

Pouria Hadjibagheri • Nov 14 '21

Not much to add here - I agree with almost everything you said. CTEs, especially recursive ones, tend to be very expensive processes in Postgres. Needless to say that expensive processes, regardless of their execution time, activate JIT... and that's part of the problem.

Once you have lots of data (we have ~8 billion records in the DB) and the need to accommodate high-throughput clients, a lot of things change... generic tutorials you find on the internet tend to become a lot less helpful. They need real world experience, and that's something which is becoming increasingly more difficult to find in the market.

Materialised views of course help with CTEs as they virtually create the stuff as a separate, all inclusive, indexed table, and eliminate the need for lots of in-memory processes.

Re the upgrade; I tend to be of the opinion the we shouldn't upgrade to the latest release of a software immediately - not in production anyway. The issue for us is that we use partitioned tables. Prior to PG14, querying a partitioned tables using the main (parent) table via the partition key was virtually impossible because of the latency, so we had to directly query the partitions - for instance:

SELECT -- whatever
FROM covid19.time_series_p2021_11_14_ltla AS ts
  JOIN -- whatever
WHERE 
  -- whatever

PG14 introduced a fix that allows us to query the parent partition without the added latency, so the same query now looks like this:

SELECT -- whatever
FROM covid19.time_series AS ts
  JOIN -- whatever
WHERE 
  ts.partition_id = '2021_11_14|ltla'
  -- whatever

Makes life a lot easier and allows us to generate queries and use ORMs more easily.

There were other reasons too, but this was the most important one. Also, it's Postgres for crying out loud, these things shouldn't happen :/

Thanks for taking the time to share your experience and insight.

Laurence 'GreenReaper' Parry • Nov 15 '21

Yeah, I can appreciate the frustration there. It's probably frustrating for the developers, too!

My issue was not materialized views, but WITH x AS MATERIALISED, which is required to force PG 12+ to use non-recursive CTEs without side-effects as an optimization fence.

Half a decade ago we had an issue with mailing list code that generated a list of recommended users who had recently posted work that was liked by users that the person being mailed had watched, who they did not themselves watch (if they did, they would've seen the work already).

Many users did not actually watch anyone and so the check was quick, but it fell apart when the planner decided to compute the list of liked work (which could be expensive) before it considered whether there were any watched users who might have made those likes.

I hoisted watched users into a CTE, forcing it to quickly find out whether it should not execute the rest (apologies for 'legacy' code):

        $tmp_user_id = $this->sanitize($user_id);
        $tmp_count = $this->sanitize($count);
        $tmp_date = $this->sanitize($date);

-       $arr = $this->runQuery("select u.username from watches w, submission_favorites sf, submissions s, user_statistics t, users u where w.user_id = '$tmp_user_id' and w.watched_user_id = sf.user_id and sf.create_datetime > '$tmp_date' and sf.submission_id = s.submission_id and s.deleted != 't' and s.public = 't' {$this->acl_filter} and s.user_id not in (select watched_user_id from watches where user_id = '".$tmp_user_id."') and s.user_id != '$tmp_user_id' and s.user_id = u.user_id and u.user_id = t.user_id group by t.submission_views_count, u.username order by t.submission_views_count desc, u.username asc limit '$tmp_count';");
+       #The with clause here is necessary to avoid an unduly expensive query plan
+       $arr = $this->runQuery("with watched_users as (select watched_user_id from watches where user_id = '$tmp_user_id') select u.username from watches w, submission_favorites sf, submissions s, user_statistics t, users u where w.user_id = '$tmp_user_id' and w.watched_user_id = sf.user_id and sf.create_datetime > '$tmp_date' and sf.submission_id = s.submission_id and s.deleted != 't' and s.public = 't' {$this->acl_filter} and s.user_id not in (select * from watched_users) and s.user_id != '$tmp_user_id' and s.user_id = u.user_id and u.user_id = t.user_id group by t.submission_views_count, u.username order by t.submission_views_count desc, u.username asc limit '$tmp_count';");

        return $arr;
    }

This fell apart again when we switched to PostgreSQL 12 - though we didn't notice until the weekly mailer, when things got noticeably slower for a long time. Thankfully it was limited to one CPU core, but that's on a quad-core which normally hit an average of three at peak - and it wasn't the only place we'd used that pattern, just one of the most expensive.

Having said that, usually version upgrades being benefits. But they're definitely a two-edged sword - I'm still not sure about upgrading to 14 because it means we'll lose SSL compression (used to avoid overages due to replication transfers), or have to hack OpenSSL more to ignore it being explicitly disabled. A replacement feature is still stuck in the conmitfest process. Perhaps we'll stick with 13 for now - we have another four years to think about migration.

Pouria Hadjibagheri • Nov 15 '21

Gotcha. Don't get me wrong though, PG14 is great, and it's faster... just make sure you turn JIT off if you compile with LLVM. :))

Thanks again for sharing your thoughts. Great to hear about your experience. Some of these edge cases are quite rare, and it's great to have them documented in a consolidated format - even as a comment.

I guess the reason why they are edge cases is - at least to some extent - because nowadays most devs rely almost exclusively on ORMs to deal with databases, which means they never really get to know or learn about the true power of an RDBMS.

Once again, thanks Laurence.

Andrei Stefanie • Nov 14 '21

Excellent article. Thank you for putting all this together!

It is a bit odd that the compilation is performed (again and again) for all queries. This seems to partially contradict the point of JIT itself

Pouria Hadjibagheri • Nov 14 '21

Glad you liked it.

Definitely agree that repetitive compilation is not only very counterintuitive here, but also destructive. It certainly defies the point of JIT, or any sort of compilation for that matter. It's somehow like treating JIT compiler as an interpreter, if you know what I mean...

Alex Smith • Nov 16 '21 • Edited

Great write up, thanks for sharing. You mention why the query was not cached:

It is, however, important to appreciate that the core Postgres team encompasses some very smart and competent software engineers who know what they are doing and would have definitely thought of caching compiled queries

This reminds me of a similar issue I had with the Entity Framework query (de?)optimiser, we found that certain recursive queries just caused the EF caching plan to give up, wonder if the Postgres team took the same kind of shortcut?

Pouria Hadjibagheri • Nov 16 '21 • Edited

oh dear...

I hear you. I was trying to be courteous. I mean, I don't particularly like it when people come around and tell me "you should have done ...", so I was trying to not sound condescending. Otherwise, the fact that this thing has been turned on by default seems like a massive oversight to me.

Thank you for sharing your thoughts.

Robert Thorneycroft • Nov 15 '21

I'd say your biggest issue happened at 'It went very smoothly and we monitored everything for 48 hours and did some load testing before setting off the deployment to the production environment at midnight of Saturday, 16 October 2021.'

Two days of testing from initial Dev deployment to Production for a critical national service sounds totally insufficient. In 48 hours there's no way to test if the database might have longer term stability issues.

Aside from the load testing which failed to spot a 2229 fold performance regression and the associated CPU spikes, how much testing was done to check HA, backup and recovery solutions all still worked?

Being that close to the bleeding edge on a nationally critical system is asking for trouble.

Pouria Hadjibagheri • Nov 15 '21

All worked. I did load testing from 10 different locations, on 1k to 3k requests per minute. There was nothing to suggest that there would be any regression - in fact, we noticed an improvement.

This may possibly have been because we didn't simulate the massive lightening spike even though we tried. It could be because we didn't have quite the same variety of requests in our test - again, we tried. It could have been because we have extensive caching in there, which could at least in part be the reason why we didn't see total failure.

Ultimately, the thing is that the main issue (APIv1) didn't really manifest itself until 31 October. Why did we not notice anything earlier? There was nothing there to notice. We just didn't have that level of failure earlier in the week.

Again, you're looking at it from the outside - which is always easier. We don't always have the resources to do all the nice things that are written in textbooks. I created this service from the scratch in two weeks, and have since been working on average 16 hours a day to maintain it.

As for being that close to the bleeding edge; there was a reason for it, which I outlined in my response to another comment. This service relies on modern technologies primarily because it is very demanding, and nothing remotely like it has been done in the public sector before. It is the most hit service in the history of UK Government's online services.

Anyhow, I shared this experience in the hope that it will help others. If your take from it is that we didn't do enough, then so be it.

Thanks for sharing your thoughts.

Robert Thorneycroft • Nov 15 '21

Sorry I wasn't trying to belittle your efforts in anyway, it sounds like you've done a miraculous job getting such a large site up and running from scratch at such short notice.

I was trying to offer some advice on how you could improve your procedures to minimise the risks of running into similar issues in the future. I've been working with very large clients including the UK Government offices for many years and I would never consider going from Dev to Production so quickly, even on mature software. This is especially true for software which had only just been released.

I know there is always a lot of pressure from above to deliver everything yesterday, but the same people are usually also the first ones to complain if anything goes wrong. Unless a service is down, I think negotiating more time for testing and QA prior to going to Production could save you a lot of aggravation in the long run. If you are getting pushback to deliver faster, you can use this as an example of what can go wrong. In my experience the UK Government is always concerned about reputational impact of major services being down and I have always been able to negotiate adequate time based on this risk.

I suspect as you mentioned your failure to spot this issue during testing was most likely due to the lack of variety of requests in your tests. Caching obviously plays a significant role in the performance profile of a database so having a large representative test scenario is critical.

The article from a technical perspective regarding the analysis of the impact of JIT on the overall response time was very interesting and informative. Thank you for your time to write it all up.

Pouria Hadjibagheri • Nov 15 '21 • Edited

I don't disagree with you at all. All I'm saying is that there were mitigating circumstances. We had issues that we couldn't resolve in the previous version, so it was a catch-22, and a risk that on balance, I considered to be reasonable. For what it is worth, I stand by my decision to do the upgrade. Despite the issue, we have have significant improvements and we have managed to reduce the running cost quite substantially.

We have million of test cases - literally. But not matter how much we try, there would always be cases that we couldn't anticipate. That's why major services with a whole lot more resource than that which we can imagine break every once in a while.

Unfortunately, things happen, and all we can do is learn from them.

Thank you again.

Ashton • Nov 15 '21

In my last project, the APM solution (Dynatrace) would have noticed and alerted such a performance issue almost immediately after deploy. There is no shortage of good APM tools, and you really shouldn't be hearing about performance and stability issues first from users, nor taking most of the day to find where the issues are coming from. In this case, an APM probably wouldn't have told you specifically that PostgreSQL JIT was the problem, but it certainly would have highlighted the sudden huge change in query times. Combined with load testing, you could have caught this before it reached prod.

Pouria Hadjibagheri • Nov 15 '21 • Edited

We have all that in place, and it didn't help. It's a bit more complicated when you get 50-60 million requests a day, of which 90% succeed. The issue was noticeable at release time. There were alerts regarding some outliers, but nothing out of the ordinary. You don't get the whole back-pressure issue until you sustain heavy traffic.

Also remember that most services were fine for a few days... until we had an increase in demand... and the increase was going from 45 million a day to 50 million a day.

We also did load testing, and it didn't catch it. As I said, it's a bit more complicated than it might appear.

avarugh1 • Nov 14 '21

really good write-up mate, loved the snippets. thank you

Marcus S. Abildskov • Nov 14 '21

Seems like a bad architectural choice not going with Yugabyte.

Pouria Hadjibagheri • Nov 14 '21

I disagree. I would never deploy a service that I haven't used in production before on a major service, especially one that I had less than 2 weeks to plan, develop, and deploy from the scratch as a part of emergency response. There is also a matter resources at our disposal in the public sector.

Postgres is a proven, open-source technology, as is Citus. They also come as a managed service with extensive support from Microsoft. I have the core dev on speed dial. Can't say that about Yugabyte.

As I said in my concluding remarks, it is often easier to come up with solutions from the outside, but when you're inside, things aren't always that easy.