In my last project, the APM solution (Dynatrace) would have noticed and alerted such a performance issue almost immediately after deploy. There is no shortage of good APM tools, and you really shouldn't be hearing about performance and stability issues first from users, nor taking most of the day to find where the issues are coming from. In this case, an APM probably wouldn't have told you specifically that PostgreSQL JIT was the problem, but it certainly would have highlighted the sudden huge change in query times. Combined with load testing, you could have caught this before it reached prod.
Technical and development lead of the official UK COVID-19 Dashboard.
Software geek. Researcher in neuroscience & biomedical data science. Former teaching fellow at University College London.
Location
London, UK
Education
Birkbeck College, University of London and UCL
Work
Deputy head of software engineering for web and data services at UK Health Security Agency
We have all that in place, and it didn't help. It's a bit more complicated when you get 50-60 million requests a day, of which 90% succeed. The issue was noticeable at release time. There were alerts regarding some outliers, but nothing out of the ordinary. You don't get the whole back-pressure issue until you sustain heavy traffic.
Also remember that most services were fine for a few days... until we had an increase in demand... and the increase was going from 45 million a day to 50 million a day.
We also did load testing, and it didn't catch it. As I said, it's a bit more complicated than it might appear.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
In my last project, the APM solution (Dynatrace) would have noticed and alerted such a performance issue almost immediately after deploy. There is no shortage of good APM tools, and you really shouldn't be hearing about performance and stability issues first from users, nor taking most of the day to find where the issues are coming from. In this case, an APM probably wouldn't have told you specifically that PostgreSQL JIT was the problem, but it certainly would have highlighted the sudden huge change in query times. Combined with load testing, you could have caught this before it reached prod.
We have all that in place, and it didn't help. It's a bit more complicated when you get 50-60 million requests a day, of which 90% succeed. The issue was noticeable at release time. There were alerts regarding some outliers, but nothing out of the ordinary. You don't get the whole back-pressure issue until you sustain heavy traffic.
Also remember that most services were fine for a few days... until we had an increase in demand... and the increase was going from 45 million a day to 50 million a day.
We also did load testing, and it didn't catch it. As I said, it's a bit more complicated than it might appear.