In the effort to reduce time to recovery, I find it’s often easy to forget some of the most obvious things.
Over the last few weeks, I’ve seen a broken deployment go unnoticed for several hours before it was fixed. While this didn’t mean the service was unresponsive (since the old instances kept running), it did pose a problem, for somewhat obvious reasons.
It at least one case, it made responding to a production issue difficult, because it was impossible to deploy the fix without first fixing the deployment.
So now we’re monitoring deployments. Any failure to deploy creates a Slack alert that everyone on the team will see.
If you’re not already monitoring and alerting for failed deployments, I encourage you to do so today. It’s pretty quick. If you’re using GitHub Actions and Slack, you can use the Notify Slack Action. If you’re using some other configuration, spend the 30 minutes it takes to google and configure your solution. You can thank me later 😊
If you enjoyed this message, subscribe to The Daily Commit to get future messages to your inbox.
Top comments (0)