The Platformer #26: The Subtle Art Of Letting Things Blow Up In Your Face

Sorry for the long title on this one, but it’s just too good of a tease not to use. Before we go there, let’s start with the pickings de la cereza — that’s Spanish (according to Google).

Cherry picks

Hacktoberfest continues, so we’re still spending a lot of time on reviewing all those incoming PRs. Thanks for all your contributions!

On the mobile platform end, we’re going to try something new: a weekly mobile hangout (in the public ~Developers: Mobile channel of course, using Calls) on Mondays. The goal is to have a casual “drop in and out as you like” type of environment. Some of our mobile engineers will be there, if you like to talk about mobile development, are stuck somewhere, or would like to share your love of dynamic islands — this will be the opportunity. We’re also pushing three initiatives forward: automated performance testing (channel: ~Client Performance Automation) in collaboration with the web platform team, Sentry Integration and the “single command” installer for development environments. On the latter topic, we polled people last week. We’ll create a prototype of the “single binary” (written in Go) option.

On the web platform end, the team has been working on unifying post components (and cleaning up afterwards), more fighting with Webpack and friends in the context of the front-end part of the multi-product architecture, testing out a prototype of menus based on Material UI, and the regular cycle of dependency updates.

On the desktop platform end, working on native notifications and fixing bugs in the download list. We’ve now upgraded to Electron v21! In addition, during last week’s web guild meeting: we gave some details on the plan to webapp/desktop module integration. Feedback is welcome!

On the server platform side, already some time ago (but I somehow missed it), we posted a thorough
investigation of performance regression you may be interested in. Last week I mentioned “community on demand” (a way to deploy duplicates of our community server for testing purposes) and it is now available to Mattermost employees. We haven’t broken community this week, therefore we’re continuing our refactoring streak.

On the QA platform side, we've been doing a lot of work stabilizing our QA infrastructure, adding observability (collecting metrics), and automating more things (such as building AMIs for iOS-based testing). We have now also added Percy to our stack (a tool to support visual regression testing). On the test case management topic, we are currently witnessing a piece of history in the making — the first QA-Engineering collaboration on test cases, happening in a PR (because now we can) 🤯.

Quality Utopia

Picture this:

As an engineer, you get a new feature to build. The spec is clear, the designs are done. And... you have an acceptance test suite that will only succeed when the feature is implemented correctly.

Energized by the new assignment you start typin’ the codez.

Tap, tap, tap.

In the background, all sorts of quality magic is happening. Safety nets activate.

Your compiler is checking types. Your compiler is happy, a green checkmark lights up.
Then, your unit tests run. Another green checkmark.

A suite of integration tests run. Another checkmark.

At this point we’ve reached the limits of what your whiny laptop can handle without spinning up fans too loud (for those who haven’t yet transferred to ARM nirvana), so your code is making its way to the QA Cloud.

The QA Cloud is spinning up your code and runs elaborate suites of end-to-end tests covering all of your product’s functionality. Of course, this coverage is extended with the acceptance test of your particular feature. Check. Check.

When these pass, the code is deployed to various setups that mirror actual production setups our customers use, as well as representative sets of their data. Here, too, an elaborate functional test set is run. Check. Check. Check.

Then, a slew of load tests run on these deployments. Your code is exercised as if 5k users, 10k, 20k, 50k, 100k are concurrently typin’ the messagez, makin’ the callz, draggin’ the cardz, and checkin’ the boxez. Check, check.

Then, for a while, things go quiet.

As your mind drifts, all of a sudden you receive a message on Mattermost from a colleague.

“Nice job on that new feature, I’ve been playing with it for a few minutes and it works great!”

Your code is live on community.

A few hours later, a phone call.

“Ahoy-hoy! KeyCustomer Inc. just called and they just love the new feature, and our turn-around on delivering it. They expanded their contract and added 5k more seats!”

While this may sound utopian. It’s not as far out of reach as you may think. In fact, it’s what we’ve been building towards (perhaps implicitly) for quite a while. Let’s consider it a kind of a “quality north star.”

If I could have this tomorrow, I would take it, wouldn’t you?

Next week is also fine.

The good news is: we’re moving there. Step by step. However, there are still plenty of gaps.

The annoying thing with gaps is: where are they, how big are they, and how should we prioritize them? Also: what about the gaps we don’t know about?

One obvious gap we have is that we don’t have 100% test coverage at any level. I hope nobody is shocked by this. Not at the unit test, integration test, or end-to-end test level.

Doing this would be an excessive amount of work, we could create a virtually infinite backlog of tests to write. So how do we prioritize?

Less obvious gaps are... well, unknown. We know they’re there (we can smell them), but we don’t really know.

So, what now?

For this, let me introduce you to a surprisingly simple, yet powerful art form:

The subtle art of letting things blow up in your face.
...and then learn from it.

Right, I shouldn’t forget about that last part, it’s kinda critical.

This technique helps in prioritization because it gives us a realistic idea of the things that actually tend to break in real life.

For instance, reading The Platformer, you may naively think it’s pretty important to have solid test coverage for excessively long posts. However, it’s possible that you’ll find that in practice there are only a handful of nutcases in the world that write >1k word posts (most of them in platform teams), and this is not really blowing up all that often, and therefore not really worth putting a lot of effort into.

Letting things blow up “by design” should give you a whole different perspective on what others may refer to as “failure.” Because it’s not failure, it’s opportunity! We identified a gap, w00t! Now let’s learn from it and adjust our priorities. Good times.

And if things don’t blow up — equally good times. We keep typin’ the codez.

The “subtle art” part is to contain the blowing up to reasonable levels. We can only YOLO our way through life to some reasonable level. The earlier things blow up, the smaller the impact, the better. We want to increasingly have things blow up more and more upstream, if you will. This is why typed languages are valuable, they give you hints of you doing crazy stuff in milliseconds. You don’t really want a customer to have to call you to learn that “undefined is not a function.”

What do we do when we’re “successful” and our face needs some fixing up, post-blowout? What do we do when your half-baked feature, somehow ends up being deployed to customers in a broken state?

We retrospect.

We investigate where things went wrong. Already today we have a lot of layers of safety nets in place: we have types, unit tests, integration tests, end-to-end-tests, we do code reviews, we deploy to the Community server, we have Rainforest tests, (some) manual QA test suites and explicit QA sign-offs.

We need to pinpoint where in this line of layer after layer of safety nets this should have been caught and fix it, or if we need even more layers. And while at it, look around a bit more and see if there are other obvious gaps we can close as we’re cleaning up anyway.

That’s all.

Then: rinse and repeat.

And before you know it — give it 20-30 years— it’s here: Quality Utopia.

This post was adapted from its original version on the Mattermost Community Server. Want to take a closer look at the inner workings of the Platform team? Join the Community server to be the first to read The Platformer every Friday.ˇ