Discussion on: Cascade of doom: JIT, and how a Postgres update led to 70% failure on a critical national service

View post

In my last project, the APM solution (Dynatrace) would have noticed and alerted such a performance issue almost immediately after deploy. There is no shortage of good APM tools, and you really shouldn't be hearing about performance and stability issues first from users, nor taking most of the day to find where the issues are coming from. In this case, an APM probably wouldn't have told you specifically that PostgreSQL JIT was the problem, but it certainly would have highlighted the sudden huge change in query times. Combined with load testing, you could have caught this before it reached prod.

Pouria Hadjibagheri • Nov 15 '21 • Edited

We have all that in place, and it didn't help. It's a bit more complicated when you get 50-60 million requests a day, of which 90% succeed. The issue was noticeable at release time. There were alerts regarding some outliers, but nothing out of the ordinary. You don't get the whole back-pressure issue until you sustain heavy traffic.

Also remember that most services were fine for a few days... until we had an increase in demand... and the increase was going from 45 million a day to 50 million a day.

We also did load testing, and it didn't catch it. As I said, it's a bit more complicated than it might appear.