Discussion on: A junior, a mid and a senior dev walk into a bar

View post

Hopefully yours is not a culture of blame.

In order for this migration to work, you would have had to be extremely careful to get it exactly right. If it blows up, it's easy to say that you should have been more careful. And of course, you say that you'll be more careful next time.

That sounds good, but an environment in which everyone just has to be extremely careful or everything blows up is going to face catastrophic failures often. I've been there.

Developers make mistakes. That's reality. It doesn't have to break production systems if we understand it and work with it, not against it.

That's one reason why we write unit tests for our code. We catch little mistakes. If we didn't make small mistakes, then why write unit tests?

We also test the behavior of our applications, including QA tests. If we never introduced bugs, why test our applications?

Then we deploy our code. That can be a complex process, as you experienced. Do we test that deployment process? We test everything else, so why would we not test the part that has the greatest potential to crash everything?

That means having a staging environment which perfectly mirrors production, where we can test our deployments. Containerization also helps.

Was it your personal idea (or S's or M's) not to have a staging environment for such tests? I bet it wasn't. The risk of catastrophic failure was built in to your deployment process. It was a matter of time.

"Let's all be more careful" is not an answer. I doubt that you or anyone else were careless. Humans simply aren't capable of mentally calculating every variable and visualizing how complex systems will behave. (We do okay sometimes, but relying on that is a horrible idea.) If we could do that then we wouldn't need computers. We could just do everything in our heads.

Testing everything, which includes having the environment for such tests, is the answer.

anabella • Dec 31 '18

Wise words! Loved your comment, it's almost like a mini-article on its own 👏

We did introduce a lot of monitoring and testing after this happened but we still do not have staging as a part of the deployment process. We did have lower environments but they're used more in an "ad hoc" manner. So that means they were used for testing whenever someone acknowledged it was necessary to test something there first... And this might have been the biggest problem.