Note: While it does draw inspiration from several real events this story is a work of fiction. After the story are some of my thoughts on Friday deploys.
It's Friday, 10AM. Standup has finished and I'm looking at the log files on our test servers. Hang on, our log files aren't rotating. Is this happening in production? I look. This is bad. The disk is going to be full before Monday unless we do something. I call over Bethany and Thomas and show them what I've found.
We call incident management to tell them about the issue. They tell us to do what we have to to fix it and end the call. We agree that we should aim to fix this before 11:30. We have a team lunch that we don't want to miss.
Thomas starts raising a change request. He needs to find someone from another team to approve it and finds Alexandra. She has reviewed our changes before and knows how our team works. He tells her that we will be sending her a change for approval around 11.
Meanwhile Bethany and I are investigating the issue. The config changed and so stopped rotating our log files. It will be a simple thing to fix. How long has this issue existed? Were we lucky to have not had issues before? We look back at our change history and find it. The issue has existed for 3 months.
10:30AM. All we need to do is change one line in a config file. I make the change. Our build system kicks off.
We watch the deployment to our Test environment. All tests pass. Test environment deployment completed. Now to our Stage environment.
I click "Deploy To Stage". Our deployment script raises a change request and The deployment steps kick off. All the tests pass. Stage environment deployment complete.
11AM. Thomas links the stage change to our production change and goes to submit it. "Change must be more than 10 minutes in the future". None of us have ever seen that message before. Thomas sets the change to start at 11:12 and submits it again. Bethany messages to let Alexandra know. The change gets approved almost immediately.
Incident management calls back.
"How is everything going?"
"Good, we're waiting for our change window to start"
"Why did you do that? You can get approval later."
"Our normal process is fast enough so we decided to do it."
The call ends.
It's time for our change window. We hit the deploy button and wait.
The deployment kicks off.
The tests run.
Tests pass. Change complete! We head to the lifts to leave the office for our team lunch.
Incident Management calls me at 12PM. They came to see our team and couldn't find us and wanted to know how the incident was going. I let them know that we are out for a team lunch and the issue is fixed.
2PM. Team lunch was great. I had the lamb shank. Everyone headed straight home afterwards. I lived close to the restaurant so am home first. I log into work and close the change.
This isn't a scary story. Everyone is happy with what is happening and nothing goes wrong. This is what changes should be like. Deployments shouldn't be scary and dangerous. They should be something that is normal and boring.
I have worked in a team where everyone worked hard to build confidence in our pipelines and processes. We were always looking for ways to build more confidence. We wanted to stay at the point where we could diagnose issues quickly and deploy fixes even faster.
We didn't start there.
We started with our deployments being scary and dangerous. We did them at 7PM in the middle of the week so that if any issues occurred they wouldn't cause a major impact. Our deployments would fail without warning. Doing the same thing again would work.
One change we did started at 1PM on Valentines Day and should have finished by 2PM. Every time we deployed, the new version would fall over. We would look at it, it would be a pipeline problem, so we would try again. We extended the change window 3 times and eventually had to call someone from the networks team for help. At 8PM we finally had a successful deployment. My valentines day plans disrupted, my workmates evening gone, and the network engineers picnic date disturbed.
After this, we worked to improve our deployments. Infrastructure was causing the failures, but the application was what we were changing. So we split our infrastructure from our application. This gave us an immediate improvement.
We slowly built more confidence.
Infrastructure deployments still failed too often, so we investigated. We found that the timeout was too low, which caused the deployment fail on a random step. So we looked at what the timeout should be and gave it a buffer. That made our infrastructure deployments more stable.
We looked back to our application deployments. We had integration tests, but they had to be manually run. So we automated them. We looked at what tests we had and saw places where we could add more. Everyone in the team worked to write these tests together to give us all more confidence.
We found holes in our monitoring and alerting. There were metrics that we cared about that we couldn't see. So we made them easier to see. And we added alerts on them to tell us when they weren't right. More confidence.
We looked at the entire process from beginning to end. We could make a change and deploy it to our stage environment the same day. Moving that change to production would then take a week due to change processes outside the control of our team.
We spoke to change management to find out what we could do to reduce the lead time. They came back with suggestions to template our changes and re-evaluate the risk classification. We looked at how we were classifying our changes and found we were erring on the side of caution. Our changes were safer than ever before. We standardized our steps and created a template. The change management team removed the lead time after we started using the template.
We talked to the people approving our changes and found out what they cared about. We were doing changes so often and they weren't failing. Having to approve our changes became a pain for the other teams. They wanted to be removed, and only added if we thought the change we were making required extra scrutiny, so we removed them. We had built enough confidence in our system that they were happy.
After a year of slowly building confidence, we were finally at the point where we could make a change and have it safely deployed to production the same day.
And then we did have an incident where we had to make a change and deploy it the same day. And we did leave as a team for team lunch as soon as it finished.
But we weren't done. There were always things we could find that would build our confidence. We investigated contract tests, performance tests and UI tests. We weighed up the effort to implement them against the confidence it would build. Where it made sense we added them to our pipelines. There is always room for more improvement, but we were at the point where we would happily deploy on a Friday.
Should you deploy on a Friday?
- If you've got the confidence in your build and deploy pipelines, go for it.
- If you don't, go build some confidence.