Discussion on: Cascade of doom: JIT, and how a Postgres update led to 70% failure on a critical national service

View post

Replies for: All worked. I did load testing from 10 different locations, on 1k to 3k requests per minute. There was nothing to suggest that there would be any r...

Sorry I wasn't trying to belittle your efforts in anyway, it sounds like you've done a miraculous job getting such a large site up and running from scratch at such short notice.

I was trying to offer some advice on how you could improve your procedures to minimise the risks of running into similar issues in the future. I've been working with very large clients including the UK Government offices for many years and I would never consider going from Dev to Production so quickly, even on mature software. This is especially true for software which had only just been released.

I know there is always a lot of pressure from above to deliver everything yesterday, but the same people are usually also the first ones to complain if anything goes wrong. Unless a service is down, I think negotiating more time for testing and QA prior to going to Production could save you a lot of aggravation in the long run. If you are getting pushback to deliver faster, you can use this as an example of what can go wrong. In my experience the UK Government is always concerned about reputational impact of major services being down and I have always been able to negotiate adequate time based on this risk.

I suspect as you mentioned your failure to spot this issue during testing was most likely due to the lack of variety of requests in your tests. Caching obviously plays a significant role in the performance profile of a database so having a large representative test scenario is critical.

The article from a technical perspective regarding the analysis of the impact of JIT on the overall response time was very interesting and informative. Thank you for your time to write it all up.

Pouria Hadjibagheri • Nov 15 '21 • Edited

I don't disagree with you at all. All I'm saying is that there were mitigating circumstances. We had issues that we couldn't resolve in the previous version, so it was a catch-22, and a risk that on balance, I considered to be reasonable. For what it is worth, I stand by my decision to do the upgrade. Despite the issue, we have have significant improvements and we have managed to reduce the running cost quite substantially.

We have million of test cases - literally. But not matter how much we try, there would always be cases that we couldn't anticipate. That's why major services with a whole lot more resource than that which we can imagine break every once in a while.

Unfortunately, things happen, and all we can do is learn from them.

Thank you again.