Discussion on: Cascade of doom: JIT, and how a Postgres update led to 70% failure on a critical national service

View post

Replies for: I'd say your biggest issue happened at 'It went very smoothly and we monitored everything for 48 hours and did some load testing before setting off...

All worked. I did load testing from 10 different locations, on 1k to 3k requests per minute. There was nothing to suggest that there would be any regression - in fact, we noticed an improvement.

This may possibly have been because we didn't simulate the massive lightening spike even though we tried. It could be because we didn't have quite the same variety of requests in our test - again, we tried. It could have been because we have extensive caching in there, which could at least in part be the reason why we didn't see total failure.

Ultimately, the thing is that the main issue (APIv1) didn't really manifest itself until 31 October. Why did we not notice anything earlier? There was nothing there to notice. We just didn't have that level of failure earlier in the week.

Again, you're looking at it from the outside - which is always easier. We don't always have the resources to do all the nice things that are written in textbooks. I created this service from the scratch in two weeks, and have since been working on average 16 hours a day to maintain it.

As for being that close to the bleeding edge; there was a reason for it, which I outlined in my response to another comment. This service relies on modern technologies primarily because it is very demanding, and nothing remotely like it has been done in the public sector before. It is the most hit service in the history of UK Government's online services.

Anyhow, I shared this experience in the hope that it will help others. If your take from it is that we didn't do enough, then so be it.

Thanks for sharing your thoughts.

Robert Thorneycroft • Nov 15 '21

Sorry I wasn't trying to belittle your efforts in anyway, it sounds like you've done a miraculous job getting such a large site up and running from scratch at such short notice.

I was trying to offer some advice on how you could improve your procedures to minimise the risks of running into similar issues in the future. I've been working with very large clients including the UK Government offices for many years and I would never consider going from Dev to Production so quickly, even on mature software. This is especially true for software which had only just been released.

I know there is always a lot of pressure from above to deliver everything yesterday, but the same people are usually also the first ones to complain if anything goes wrong. Unless a service is down, I think negotiating more time for testing and QA prior to going to Production could save you a lot of aggravation in the long run. If you are getting pushback to deliver faster, you can use this as an example of what can go wrong. In my experience the UK Government is always concerned about reputational impact of major services being down and I have always been able to negotiate adequate time based on this risk.

I suspect as you mentioned your failure to spot this issue during testing was most likely due to the lack of variety of requests in your tests. Caching obviously plays a significant role in the performance profile of a database so having a large representative test scenario is critical.

The article from a technical perspective regarding the analysis of the impact of JIT on the overall response time was very interesting and informative. Thank you for your time to write it all up.

Pouria Hadjibagheri • Nov 15 '21 • Edited

I don't disagree with you at all. All I'm saying is that there were mitigating circumstances. We had issues that we couldn't resolve in the previous version, so it was a catch-22, and a risk that on balance, I considered to be reasonable. For what it is worth, I stand by my decision to do the upgrade. Despite the issue, we have have significant improvements and we have managed to reduce the running cost quite substantially.

We have million of test cases - literally. But not matter how much we try, there would always be cases that we couldn't anticipate. That's why major services with a whole lot more resource than that which we can imagine break every once in a while.

Unfortunately, things happen, and all we can do is learn from them.

Thank you again.