Jeremy Friesen for The DEV Team

Posted on Nov 30, 2021 • Originally published at takeonrules.com on Nov 30, 2021

Keeping the Stakes Low while Breaking Production

#programming #todayilearned #meta #sre

Using Existing Tooling to Make Changes Reversible

This morning I deployed a major change to the production servers of DEV.to.
You can read more about that change at Feature update: Feed - DEV Community.

Adjusting the feed is a big deal. It’s the front door of DEV.to.

Preparing for Change

We wanted to gather information about the impact of changing the front door. At Forem we use the Field Test gem. This allows us to run different experiments. I configured Field Test to assign 20% of the folks to the new strategy and 80% of the folks to the prior strategy.

To insulate from upstream libraries, we often wrap those libraries in our own implementations. This helps with the principle of least exposure; rely on the narrowest interface possible. I wrote the AbExperiment class to be the wrapper around Field Test. With the wrapper, I allowed for the system to set an ENV variable and thus force all requests to use the same experiment. This helped with running local tests.

In addition, with that change, I created a means for our Site Reliability Engineers (SREs) to disable, with an application restart, the changes that I made. They would just need to add an ENV variable.

The next step came about when I learned more about our use of Flipper; a Ruby gem for dynamically toggling on and off features. I didn’t know when the feature would roll out, but I wanted control over the feature. I also wanted admins of other Forems to have control as well. This was trivial with Flipper. Once I deployed the code, Forem’s got the original behavior unless they turned “flipped” on the feature.

Enter the Bug

I flipped the feature on in the mid-morning, and things looked fine until someone reported a broken feed. They had no articles in their Relevant feed. I checked the SQL query and things appeared to work. I also checked the database, and saw that I was in the same experiment segment. Things worked for me…until they didn’t. As in things worked, and then after poking around a bit they failed.

Now with confirmation, I pinged our SREs to toggle off the feature. It takes a minute or so for the feature toggle to propogate. This meant we didn’t have to rush out a hot fix. Which, if you’ve ever done one of those, you know that it’s rather even odds that you first make things worse.

With the immediate problem contained, we moved into figuring out the problem. Me and a few SREs started debugging. They have access to the Rails console, and we reconstructed the situation.

And in a few minutes, we found the problem. I spent the next hour writing a new pull request to fix the problem. As of writing this, I haven’t merged that pull request.

Once we merge the pull request, we’ll need to toggle back on the feature. And pay attention to see if we’ve fixed things.

Prepare for Problems

This was my first major feature change release and the collaboration during the pull request really helped prepare all of us for what we would do if things went sideways.

I was a bit disheartened when a SRE and I decided to turn off the new feature. I wanted it to work; to be bug free. But that was not the case.

Turning off the feature freed my mind from worrying about all the folks on DEV. And we could all turn and calmly explore the root cause of the bug.

That calmness helped create the space for finding the problem. And it took all of us.

The SREs had powers I didn’t have and I had context that they didn’t.

For the Curious

We ran the following in the console:

puts Articles::Feeds::WeightedQueryStrategy.new(
  user: User.find_by(id: 1),
  page: nil).call.to_sql

We then pasted that into Blazer and started looking at the SQL. As we moved around the massive SQL statement, we saw the culprit. A very narrow range for allowed article’s publication dates.

Alas, temporality strikes again!

Lessons Learned

I deployed a logic bug. My change broke the feed for some folks. What could’ve been a super stressful afternoon, was instead a calm exploration of the problem.

This calmness was made possible because I added two key mechanisms for testing and rollback. This bit of effort meant that we could quickly and confidently rollback the change.

And with the fire put out, we could approach the problem with clearer minds. We didn’t have the stress of a broken system screaming on the sideliness saying “Fix it! Fix it! Fix it!”

Top comments (5)

Comment deleted

Jeremy Friesen • Dec 1 '21

You're definitely right, the complexity of rollbacking a feature or change varies. And going into any major change it useful to consider "How would you roll things back?" In this case, the feature was a query change (and introducing new classes to help build that query). That made the rollback easy.

Were I to "migrate liquid tags usage" in production that would be a far more complicated rollback plan, because those tags uses are stored in the database.

GrahamTheDev • Nov 30 '21 • Edited

Great write up on the (very minor) disruption yesterday.

Some valuable nuggets in here:

having a way to roll back quickly when things go sideways (as they will, more often than not!)
especially the part about having space to consider a proper fix, rather than hot fixing a problem and potentially making it worse.

I look forward to attempt 2 (and 3 and 4 if necessary! But I have my fingers crossed for you that everything goes smoothly this time! 🤞) for this new test, I hope it has the desired outcome! ❤🦄

Andy Piper • Nov 30 '21

Interesting post, thanks for sharing the process and what happened - and I’ excited to see what the new Feed changes bring to the site!

Minor point, but it looks like there is a hanging sentence in the last section of the post? “What could”; and the last paragraph reads as unfinished, but maybe I’m mistaken?

Thanks for all your hard work on DEV and Forem!

Jeremy Friesen • Nov 30 '21

Removed the "What could" dangling sentence.

I'm eager to get the second attempt out for the feed.

DEV Community

Keeping the Stakes Low while Breaking Production

Using Existing Tooling to Make Changes Reversible

Preparing for Change

Enter the Bug

Prepare for Problems

For the Curious

Lessons Learned

Top comments (5)

Read next

JavaScript Non-Primitive Data Types

RSpec, Philosophically Speaking

Employee Management System using Python.

JavaScript Primitive Data Types