I hope this post makes it to you safely. Let’s immediately address the mastodon in the room: these last two weeks haven’t been the greatest in terms of the stability of our community server. Even today, teams continue to firefight and figure out what the issues are. It seems to be an accumulation of many things. Sadly, reality is always more complex than we’d hope. Obviously, this needs to be fully, and confidently ironed out before releasing anything to customers.
For those unaware: we deploy the latest (and typically) greatest version of Mattermost to community.mattermost.com daily (I think around 10am CET). The reason for this is what some call drinking your own champagne — but I prefer the original, more disgusting version of eating your own dogfood (or “dogfooding” for short).
Of course, we test features and fixes as well as we can before merging them (and we’re constantly looking at ways of doing this more reliably). However, you only really know if something works by putting it into some sort of production environment. And we’re volunteering for this ourselves. In the highly hypothetical case that something doesn’t quite work, we’d notice it immediately. Also, due to our use of feature flags, we can enable certain features that are not fully baked yet just for this community environment for testing. The downside of this approach is what we’re currently witnessing, the hypothetical case where something doesn’t work quite right.
Trade-offs!
I could spend a few hundred words dynamic islanding the topic of outages, because there are potential long-term benefits to having production issues like this. However, a few things need to happen first:
- We need to put out the fire.
- We need to refresh, take a shower, eat something, and chill for a bit.
- We need properly retrospect, and root cause all of this, extract learning, and come up with actional plans of how to improve. We’re still at stage 1. So one thing at a time. That said, I’ll mention that early in my industrial career, for a year or two I was responsible running a fairly large SaaS service. I was woken up at 2am, multiple times per week for many months. Those were tough times, but I learned almost everything I know today about building resilient software, monitoring, release strategies, and scaling during that time. There’s opportunity in everything. It just sucks while you’re in it.
This is all wisdom for later. Until then: HugOps!
While a lot of the issues are owned by the SRE and Boards team, people in the server platform teams are also helping out wherever possible, which changed the focus a bit this week.
On to this week’s kersen plukkerij (that’s Dutch, if I remember it still).
Cherry picks
Here are the notes for this week’s multi-product architecture sync.
On the mobile platform end, this week we have the code freeze planned for our v2 GA release. So besides getting certain Sentry changes in, the focus has been bug-fixing all the things. The end is nigh’.
On the web platform end, we now have an initial version of performance regression tests integrated into our pipelines, here’s an example. While a performance regression doesn’t block a PR’s merge yet, it will appear as a failing test. We’re still wrapping up the work on the unification of the post-component, which is now mostly about fixing broken end-to-end tests. Work continues on menus, and we’re figuring out some caching issue related to webpack federation.
On the desktop platform end, it seems Apple is equally as excited to upgrade the desktop app, and has therefore approved 5.2.2! Available at your neighborly App Store now!
On the server platform side, besides helping out with recent outages, we have made progress on additional fronts. Including running the regular release regression tests (no regressions found with the load test suite, but with current issues we may find there are gaps — likely because all load tests focus on Channels and we don’t have Playbooks nor Boards coverage yet — did I mention opportunity?). We also added some documentation with best practices on how to configure coverage frequency.
On the QA platform side, we continue where we left off last week: release testing, fixing flaky tests, but also preparing for some serious “winter cleaning” in January of all our tests. And... we are now getting iOS mobile v2 test runs reported to our test report channel 🥳
That is all for this week. The end of 2022 is nearing. Let’s go into the holiday season in as stable shape as we can.
Have a quiet weekend!
Top comments (0)