Discussion on: 7 Ways Continuous Delivery Helps Build a Culture of Learning

View post

Replies for: I 100% agree that CD is great, but I would break out some of the items you listed to be more Agile/CI related than being specific to CD. Depending ...

"If it hurts, do it more often." — I've definitely seen this bring good things in many contexts.

I'm currently researching other people's journeys to CD, so I'd love to hear more about yours. :) If you'd like to share feel free to post here or reach me on marko x renderedtext x com.

simonhaisz • Sep 21 '18

Sure!

The players: Medium sized company (100-200 R&D department).
The stage: One large monolithic product with tons of features and a couple decades of history.
The audience: Large companies (1B+ revenue) that don't like change.
Previous chapters: Release cycle of a major release every 1-2 years with minor releases every 3-6 months.

We were following a ScrummerFall process at that point. We would have a release plan that would start with all the high priority features, guess at how long the longest poles would be and then fill in the gaps with lower priority features. As breaking the build is bad, most features would be developed on a feature branch and kept out of the official builds. Some of these feature branches would be worked on for months before being merged.

In our favor each team had thousands of automated tests for their layer. The downside is that these were mostly integration tests, so they were slow and flaky. If you were run them all sequentially it would take a day. Then we had hundreds of full-stack E2E tests which were even slower and flakier - their total runtime was several days. Then there were the performance tests, hundreds of tests across dozens of data sets. Their runtime was measured in weeks.

Obviously we did not have a quick feedback loop 😢 And because of the flakiness you almost never saw 100% green. A good build was 90-98% green. Because investigating those intermittent failures was so expensive we got in the bad habit of ignoring them during dev and just re-running them for release time. Run the failing tests multiple times on the same build, if they almost always pass then its a problem with the tests and not the product 😭

Enough was enough so we began our journey. We actually brought Dave in as a consultant to help us. I can personally testify that he's great.

The first thing we did was start treating master as if we would ship off of it at any time. We obviously didn't, because we weren't ready. But it was the start of doing something painful frequently until it stops being painful.

Every test failure became a bug, as it should be. It had to be logged, investigated, and fixed. It was considered a defect in the product unless you could prove it was in the tests. And regardless of where it was it couldn't be closed until it was fixed. Not surprisingly, velocity dropped like a piano. PM was 😠.

So we worked on it and improved our tests. Rewrote whole sections of tech framework so that they would be reliable. Got everyone to treat writing tests as carefully as they wrote prod code. After a while (it sure took a while) we reached the point that a failed test meant there was a real bug. We actually got 100% green builds. I can't say we never had a flaky test again but they really rare now and if a new type of flakiness pops up we've got good tools and techniques to get rid of it.

At the same time we stopped using feature branches and started branching by abstraction using feature toggles/flags. So everyone was pushing to master which meant it was easy to test features in development and even test their combinations. And if your feature 'leaked' out of its toggle? Ooh, that's a bad bug. So we were finding issues way right away instead of after merging 6 months later.

At the same time we worked on improving the feedback loop. Slow tests were investigated and improved/re-written to be faster. We invested in test farms to run tests in parallel easily. Any PR could easily have thousands of test run against it before it was merged. Implemented CI so that we ran builds and the fast tests with every commit. Each official build now runs all of the INT tests in ~30 minutes (instead of hours) and all of the E2E tests in a few hours (instead of days). Performance tests are run every night (instead of monthly).

The end result was that after a year of blood, sweat, and tears we started releasing monthly. We can actually be Agile now and deliver value incrementally, even with our Enterprise customers.

Marko Anastasov • Sep 25 '18

This is pure gold. Thank you so much for sharing! 🙇‍♂️