Discussion on: Cascade of doom: JIT, and how a Postgres update led to 70% failure on a critical national service

View post

Thanks for sharing this insightfull - but unfortunate 😅 - incident!
It is really usefull to know how tech leads investigate and tackle such problems, the illustration was well written as well, great work !

Pouria Hadjibagheri • Nov 14 '21

Glad you found it useful.

Andrey Belyaev • Nov 19 '21

Hi! Just wondered if you changed anything else apart from disabling JIT. Any changes in the monitoring tools settings to catch increasing responce time? I'm pretty sure that in such complex system as yours there are a lot of metrics being monitored.

Also it would be great to read another article about an architecture of your system. It's a great example of a real high-load application.

Andrey Belyaev • Nov 19 '21

BTW I made a Russian translation of your article here: habr.com/ru/company/haulmont/blog/...

Pouria Hadjibagheri • Nov 19 '21 • Edited

Thank you for letting me know. Thank you for taking the time to do it — it's wonderful. Hopefully Russian speakers will find it useful. I'll share the link on Twitter and LinkedIn.

Re monitoring, we have end-to-end tracking of requests implemented using Azure Application Insight. Here is a fairly simple example of the events for the postcode search on the landing page (CDN, load balancers, and PGBouncer aren't included here):

We also have alerts with a very high sensitivity in place.

What is also important to notice is that the issue doesn't really manifest itself until we have a lot of requests coming in — especially at peak time.

This is in part because of the aggressive caching policies (4 layers) that we have in place. Even if the response time is longer, it's only long once every few hours, so it gets registered as an outlier. I mean, I had to search 20 queries before I found an example (the screenshot above) that included a database transaction). Having said that, I did get alerts re response time, but I'm sure you appreciate that these things are often judgement calls, and they didn't initially concern me as they weren't particularly out of the ordinary... until that fateful Sunday.

Another reason is that we have quite a lot of servers. If one zone in one zone in a region is experiencing issues, we do expect to see an increase in latency. We might also see a slight increased latency if someone is hammering our API for, say 30 minutes with lots of different (unique) requests. These are things that do happen every once in a while, and tend to resolve themselves after a bit. All in all, there were things that could have pointed to a potential issue, but based on my experience with the service, they weren't sufficiently out of the ordinary to merit an extensive investigation. What would normally concern me is sustained / prolonged increase in latency, and we didn't have that until Sunday, 30 October.

Re writing an article about our architecture, I am hoping to do something along those lines in the future. Right now, I am collaborating with folks at Microsoft to write an article on our database. I have also given a few talks on our daily processes, a couple of which are available on YouTube. If that's something of interest, I'll be happy to share the links.

Once again, thank you for taking the time to translate the article, and for sharing your thoughts here.