DEV Community

Cover image for Cascade of doom: JIT, and how a Postgres update led to 70% failure on a critical national service

Cascade of doom: JIT, and how a Postgres update led to 70% failure on a critical national service

Pouria Hadjibagheri on November 13, 2021

The UK coronavirus dashboard is the primary data reporting service for the COVID-19 pandemic in the United Kingdom. It sustains on average 45 to 50...
Collapse
 
deathwaiting profile image
ahmed galal

Thanks for sharing this insightfull - but unfortunate 😅 - incident!
It is really usefull to know how tech leads investigate and tackle such problems, the illustration was well written as well, great work !

Collapse
 
xenatisch profile image
Pouria Hadjibagheri

Glad you found it useful.

Collapse
 
belyaevandrey profile image
Andrey Belyaev

Hi! Just wondered if you changed anything else apart from disabling JIT. Any changes in the monitoring tools settings to catch increasing responce time? I'm pretty sure that in such complex system as yours there are a lot of metrics being monitored.

Also it would be great to read another article about an architecture of your system. It's a great example of a real high-load application.

Thread Thread
 
belyaevandrey profile image
Andrey Belyaev

BTW I made a Russian translation of your article here: habr.com/ru/company/haulmont/blog/...

Thread Thread
 
xenatisch profile image
Pouria Hadjibagheri • Edited

Thank you for letting me know. Thank you for taking the time to do it — it's wonderful. Hopefully Russian speakers will find it useful. I'll share the link on Twitter and LinkedIn.

Re monitoring, we have end-to-end tracking of requests implemented using Azure Application Insight. Here is a fairly simple example of the events for the postcode search on the landing page (CDN, load balancers, and PGBouncer aren't included here):

App insight events showing the request, storage, Postgres, and Redis events

We also have alerts with a very high sensitivity in place.

What is also important to notice is that the issue doesn't really manifest itself until we have a lot of requests coming in — especially at peak time.

This is in part because of the aggressive caching policies (4 layers) that we have in place. Even if the response time is longer, it's only long once every few hours, so it gets registered as an outlier. I mean, I had to search 20 queries before I found an example (the screenshot above) that included a database transaction). Having said that, I did get alerts re response time, but I'm sure you appreciate that these things are often judgement calls, and they didn't initially concern me as they weren't particularly out of the ordinary... until that fateful Sunday.

Another reason is that we have quite a lot of servers. If one zone in one zone in a region is experiencing issues, we do expect to see an increase in latency. We might also see a slight increased latency if someone is hammering our API for, say 30 minutes with lots of different (unique) requests. These are things that do happen every once in a while, and tend to resolve themselves after a bit. All in all, there were things that could have pointed to a potential issue, but based on my experience with the service, they weren't sufficiently out of the ordinary to merit an extensive investigation. What would normally concern me is sustained / prolonged increase in latency, and we didn't have that until Sunday, 30 October.

Re writing an article about our architecture, I am hoping to do something along those lines in the future. Right now, I am collaborating with folks at Microsoft to write an article on our database. I have also given a few talks on our daily processes, a couple of which are available on YouTube. If that's something of interest, I'll be happy to share the links.

Once again, thank you for taking the time to translate the article, and for sharing your thoughts here.

Thread Thread
 
belyaevandrey profile image
Andrey Belyaev

If that's something of interest, I'll be happy to share the links.

Yes, please! Thanks in advance :-)

Thread Thread
 
xenatisch profile image
Pouria Hadjibagheri

These are the two that are available online. They're a bit old / generic, but here goes:

youtube.com/watch?v=ukMWR_4_XZo

youtube.com/watch?v=nKYCJP0T03M

Collapse
 
greenreaper profile image
Laurence 'GreenReaper' Parry

I've had a similar experience with JIT - for short queries, well-optimized by index, it can impose a cost that outweighs any benefit it provides. It also greatly increases the size of the postgres package (though for most I doubt this is a big concern) - it'd be nice if distro packagers considered splitting its optimization symbols into optional dependency packages, as removing their files both saves space and is a convenient way to disable JIT.

A related issue occurred when they changed CTEs ("WITH x ...") to no longer materialize X on the grounds that it would give the optimizer more options. In our case we were using it as an optimization gate (since the community has an aversion to them otherwise...) and changing that caused several queries that can be very expensive if the query planner makes a bad call to run long, causing CPU issues 'til we added MATERIALIZED.

Of course PostgreSQL 14 is quite new, and in my experience it can take a few revisions to shake out issues with both new and existing functionality (as shown in the 14.1 patch notes) - so as a general rule, it's best to hang back. But I can appreciate wanting to switch to it quickly, especially if it fixes issues you've experienced.

Collapse
 
xenatisch profile image
Pouria Hadjibagheri

Not much to add here - I agree with almost everything you said. CTEs, especially recursive ones, tend to be very expensive processes in Postgres. Needless to say that expensive processes, regardless of their execution time, activate JIT... and that's part of the problem.

Once you have lots of data (we have ~8 billion records in the DB) and the need to accommodate high-throughput clients, a lot of things change... generic tutorials you find on the internet tend to become a lot less helpful. They need real world experience, and that's something which is becoming increasingly more difficult to find in the market.

Materialised views of course help with CTEs as they virtually create the stuff as a separate, all inclusive, indexed table, and eliminate the need for lots of in-memory processes.

Re the upgrade; I tend to be of the opinion the we shouldn't upgrade to the latest release of a software immediately - not in production anyway. The issue for us is that we use partitioned tables. Prior to PG14, querying a partitioned tables using the main (parent) table via the partition key was virtually impossible because of the latency, so we had to directly query the partitions - for instance:

SELECT -- whatever
FROM covid19.time_series_p2021_11_14_ltla AS ts
  JOIN -- whatever
WHERE 
  -- whatever
Enter fullscreen mode Exit fullscreen mode

PG14 introduced a fix that allows us to query the parent partition without the added latency, so the same query now looks like this:

SELECT -- whatever
FROM covid19.time_series AS ts
  JOIN -- whatever
WHERE 
  ts.partition_id = '2021_11_14|ltla'
  -- whatever
Enter fullscreen mode Exit fullscreen mode

Makes life a lot easier and allows us to generate queries and use ORMs more easily.

There were other reasons too, but this was the most important one. Also, it's Postgres for crying out loud, these things shouldn't happen :/

Thanks for taking the time to share your experience and insight.

Collapse
 
greenreaper profile image
Laurence 'GreenReaper' Parry

Yeah, I can appreciate the frustration there. It's probably frustrating for the developers, too!

My issue was not materialized views, but WITH x AS MATERIALISED, which is required to force PG 12+ to use non-recursive CTEs without side-effects as an optimization fence.

Half a decade ago we had an issue with mailing list code that generated a list of recommended users who had recently posted work that was liked by users that the person being mailed had watched, who they did not themselves watch (if they did, they would've seen the work already).

Many users did not actually watch anyone and so the check was quick, but it fell apart when the planner decided to compute the list of liked work (which could be expensive) before it considered whether there were any watched users who might have made those likes.

I hoisted watched users into a CTE, forcing it to quickly find out whether it should not execute the rest (apologies for 'legacy' code):

        $tmp_user_id = $this->sanitize($user_id);
        $tmp_count = $this->sanitize($count);
        $tmp_date = $this->sanitize($date);

-       $arr = $this->runQuery("select u.username from watches w, submission_favorites sf, submissions s, user_statistics t, users u where w.user_id = '$tmp_user_id' and w.watched_user_id = sf.user_id and sf.create_datetime > '$tmp_date' and sf.submission_id = s.submission_id and s.deleted != 't' and s.public = 't' {$this->acl_filter} and s.user_id not in (select watched_user_id from watches where user_id = '".$tmp_user_id."') and s.user_id != '$tmp_user_id' and s.user_id = u.user_id and u.user_id = t.user_id group by t.submission_views_count, u.username order by t.submission_views_count desc, u.username asc limit '$tmp_count';");
+       #The with clause here is necessary to avoid an unduly expensive query plan
+       $arr = $this->runQuery("with watched_users as (select watched_user_id from watches where user_id = '$tmp_user_id') select u.username from watches w, submission_favorites sf, submissions s, user_statistics t, users u where w.user_id = '$tmp_user_id' and w.watched_user_id = sf.user_id and sf.create_datetime > '$tmp_date' and sf.submission_id = s.submission_id and s.deleted != 't' and s.public = 't' {$this->acl_filter} and s.user_id not in (select * from watched_users) and s.user_id != '$tmp_user_id' and s.user_id = u.user_id and u.user_id = t.user_id group by t.submission_views_count, u.username order by t.submission_views_count desc, u.username asc limit '$tmp_count';");

        return $arr;
    }
Enter fullscreen mode Exit fullscreen mode

This fell apart again when we switched to PostgreSQL 12 - though we didn't notice until the weekly mailer, when things got noticeably slower for a long time. Thankfully it was limited to one CPU core, but that's on a quad-core which normally hit an average of three at peak - and it wasn't the only place we'd used that pattern, just one of the most expensive.

Having said that, usually version upgrades being benefits. But they're definitely a two-edged sword - I'm still not sure about upgrading to 14 because it means we'll lose SSL compression (used to avoid overages due to replication transfers), or have to hack OpenSSL more to ignore it being explicitly disabled. A replacement feature is still stuck in the conmitfest process. Perhaps we'll stick with 13 for now - we have another four years to think about migration.

Thread Thread
 
xenatisch profile image
Pouria Hadjibagheri

Gotcha. Don't get me wrong though, PG14 is great, and it's faster... just make sure you turn JIT off if you compile with LLVM. :))

Thanks again for sharing your thoughts. Great to hear about your experience. Some of these edge cases are quite rare, and it's great to have them documented in a consolidated format - even as a comment.

I guess the reason why they are edge cases is - at least to some extent - because nowadays most devs rely almost exclusively on ORMs to deal with databases, which means they never really get to know or learn about the true power of an RDBMS.

Once again, thanks Laurence.

Collapse
 
andreistefanie profile image
Andrei Stefanie

Excellent article. Thank you for putting all this together!

It is a bit odd that the compilation is performed (again and again) for all queries. This seems to partially contradict the point of JIT itself

Collapse
 
xenatisch profile image
Pouria Hadjibagheri

Glad you liked it.

Definitely agree that repetitive compilation is not only very counterintuitive here, but also destructive. It certainly defies the point of JIT, or any sort of compilation for that matter. It's somehow like treating JIT compiler as an interpreter, if you know what I mean...

Collapse
 
asmithnhsx profile image
Alex Smith • Edited

Great write up, thanks for sharing. You mention why the query was not cached:

It is, however, important to appreciate that the core Postgres team encompasses some very smart and competent software engineers who know what they are doing and would have definitely thought of caching compiled queries

This reminds me of a similar issue I had with the Entity Framework query (de?)optimiser, we found that certain recursive queries just caused the EF caching plan to give up, wonder if the Postgres team took the same kind of shortcut?

Collapse
 
xenatisch profile image
Pouria Hadjibagheri • Edited

oh dear...

I hear you. I was trying to be courteous. I mean, I don't particularly like it when people come around and tell me "you should have done ...", so I was trying to not sound condescending. Otherwise, the fact that this thing has been turned on by default seems like a massive oversight to me.

Thank you for sharing your thoughts.

Collapse
 
basisinstinct profile image
Robert Thorneycroft

I'd say your biggest issue happened at 'It went very smoothly and we monitored everything for 48 hours and did some load testing before setting off the deployment to the production environment at midnight of Saturday, 16 October 2021.'

Two days of testing from initial Dev deployment to Production for a critical national service sounds totally insufficient. In 48 hours there's no way to test if the database might have longer term stability issues.

Aside from the load testing which failed to spot a 2229 fold performance regression and the associated CPU spikes, how much testing was done to check HA, backup and recovery solutions all still worked?

Being that close to the bleeding edge on a nationally critical system is asking for trouble.

Collapse
 
xenatisch profile image
Pouria Hadjibagheri

All worked. I did load testing from 10 different locations, on 1k to 3k requests per minute. There was nothing to suggest that there would be any regression - in fact, we noticed an improvement.

This may possibly have been because we didn't simulate the massive lightening spike even though we tried. It could be because we didn't have quite the same variety of requests in our test - again, we tried. It could have been because we have extensive caching in there, which could at least in part be the reason why we didn't see total failure.

Ultimately, the thing is that the main issue (APIv1) didn't really manifest itself until 31 October. Why did we not notice anything earlier? There was nothing there to notice. We just didn't have that level of failure earlier in the week.

Again, you're looking at it from the outside - which is always easier. We don't always have the resources to do all the nice things that are written in textbooks. I created this service from the scratch in two weeks, and have since been working on average 16 hours a day to maintain it.

As for being that close to the bleeding edge; there was a reason for it, which I outlined in my response to another comment. This service relies on modern technologies primarily because it is very demanding, and nothing remotely like it has been done in the public sector before. It is the most hit service in the history of UK Government's online services.

Anyhow, I shared this experience in the hope that it will help others. If your take from it is that we didn't do enough, then so be it.

Thanks for sharing your thoughts.

Collapse
 
basisinstinct profile image
Robert Thorneycroft

Sorry I wasn't trying to belittle your efforts in anyway, it sounds like you've done a miraculous job getting such a large site up and running from scratch at such short notice.

I was trying to offer some advice on how you could improve your procedures to minimise the risks of running into similar issues in the future. I've been working with very large clients including the UK Government offices for many years and I would never consider going from Dev to Production so quickly, even on mature software. This is especially true for software which had only just been released.

I know there is always a lot of pressure from above to deliver everything yesterday, but the same people are usually also the first ones to complain if anything goes wrong. Unless a service is down, I think negotiating more time for testing and QA prior to going to Production could save you a lot of aggravation in the long run. If you are getting pushback to deliver faster, you can use this as an example of what can go wrong. In my experience the UK Government is always concerned about reputational impact of major services being down and I have always been able to negotiate adequate time based on this risk.

I suspect as you mentioned your failure to spot this issue during testing was most likely due to the lack of variety of requests in your tests. Caching obviously plays a significant role in the performance profile of a database so having a large representative test scenario is critical.

The article from a technical perspective regarding the analysis of the impact of JIT on the overall response time was very interesting and informative. Thank you for your time to write it all up.

Thread Thread
 
xenatisch profile image
Pouria Hadjibagheri • Edited

I don't disagree with you at all. All I'm saying is that there were mitigating circumstances. We had issues that we couldn't resolve in the previous version, so it was a catch-22, and a risk that on balance, I considered to be reasonable. For what it is worth, I stand by my decision to do the upgrade. Despite the issue, we have have significant improvements and we have managed to reduce the running cost quite substantially.

We have million of test cases - literally. But not matter how much we try, there would always be cases that we couldn't anticipate. That's why major services with a whole lot more resource than that which we can imagine break every once in a while.

Unfortunately, things happen, and all we can do is learn from them.

Thank you again.

Collapse
 
ashton profile image
Ashton

In my last project, the APM solution (Dynatrace) would have noticed and alerted such a performance issue almost immediately after deploy. There is no shortage of good APM tools, and you really shouldn't be hearing about performance and stability issues first from users, nor taking most of the day to find where the issues are coming from. In this case, an APM probably wouldn't have told you specifically that PostgreSQL JIT was the problem, but it certainly would have highlighted the sudden huge change in query times. Combined with load testing, you could have caught this before it reached prod.

Collapse
 
xenatisch profile image
Pouria Hadjibagheri • Edited

We have all that in place, and it didn't help. It's a bit more complicated when you get 50-60 million requests a day, of which 90% succeed. The issue was noticeable at release time. There were alerts regarding some outliers, but nothing out of the ordinary. You don't get the whole back-pressure issue until you sustain heavy traffic.

Also remember that most services were fine for a few days... until we had an increase in demand... and the increase was going from 45 million a day to 50 million a day.

We also did load testing, and it didn't catch it. As I said, it's a bit more complicated than it might appear.

Collapse
 
avarugh1 profile image
avarugh1

really good write-up mate, loved the snippets. thank you

Collapse
 
marcus-sa profile image
Marcus S. Abildskov

Seems like a bad architectural choice not going with Yugabyte.

Collapse
 
xenatisch profile image
Pouria Hadjibagheri

I disagree. I would never deploy a service that I haven't used in production before on a major service, especially one that I had less than 2 weeks to plan, develop, and deploy from the scratch as a part of emergency response. There is also a matter resources at our disposal in the public sector.

Postgres is a proven, open-source technology, as is Citus. They also come as a managed service with extensive support from Microsoft. I have the core dev on speed dial. Can't say that about Yugabyte.

As I said in my concluding remarks, it is often easier to come up with solutions from the outside, but when you're inside, things aren't always that easy.

Collapse
 
avarugh1 profile image
avarugh1

what are you talking about

Collapse
 
arpansac profile image
Arpan Garg

Wow! We faced the same thing with Postgres on commudle.com.

Collapse
 
xenatisch profile image
Pouria Hadjibagheri

Hope this helps you deal with it. Let me know if I can be of assistance.

Collapse
 
hnhegde profile image
Harsha N Hegde

Very good analysis! I like the way you have put together your debug process.

Collapse
 
ubaidurrehman1 profile image
Ubaid ur Rehman

Amazing,