A kill switch is a tool that allows anyone on your team to turn off a feature in production with the click of a button. For example, let’s say you have a new banner on your e-commerce site informing users of free shipping. After you release it to production, you notice a bug. Instead of going through a whole release process to commit a fix, you can just log in to your feature flagging framework and execute the kill switch for that feature, disabling it with the click of a button. Executing a kill switch is especially helpful for when incidents happen in production.
In many feature flag management systems, you can configure automatic kill switch triggers. You can tie your metrics to your feature flagging system and automate the kill switch to turn off the flag if there is a significant impact on a KPI tied to a flag. A KPI is a key performance indicator, and when you import them into your feature flagging system, you are creating a list of things to monitor. Let’s say the free shipping feature is tied to a flag called free_shipping in our example above. In the metrics dashboard for that flag, since you turned the flag on, there was a significant increase in page load time, and the kill switch is automatically triggered to turn the flag off.
There are pros and cons to automating your kill switch. Let’s take a look at both.
If there is a common history of feature launches causing crashes for your app, you should set up kill switch automation. For example, let’s say that for every user who receives the treatment “on” of your feature flag, their browser crashes – meaning everyone who is receiving your new feature is experiencing this behavior. It makes sense to set up your feature flagging system to automatically kill this feature because it ruins your users’ experience. You can directly correlate the crashing to the new feature.
Killing a release by way of a kill switch is typically a safe operation as long as you’ve designed your releases in a backward-compatible way. You need to be confident that there won’t be other issues when the system automatically rolls back the changes.
When the risk of letting the feature stay on longer is bigger than automatically turning it off, you should set up automation for that feature flag. This is especially true when you are changing critical business flows. Suppose your metrics are performing negatively because of a new change to your app’s login page. It is far better to stop the bleeding and turn off that feature as soon as possible rather than investigate first and then make a call because it is a critical flow.
While there are many relevant reasons to automate your kill switch, it can also be an antipattern. You need to be sure that a specific feature influences a specific metric before triggering the kill switch. Automating your kill switch based on hitting some arbitrary threshold (page load time over X, exception count over Y) makes these thresholds irrelevant if you can’t correlate them with a specific feature.
Most people have their kill switch automated where they monitor a particular metric, let’s say page load time, and if page load time spikes, then kill the feature. However, suppose you don’t have the granularity of tying those increases or decreases to a specific feature. In that case, you don’t know if that metric is being influenced by the feature that you just released. You can’t make that correlation. When your feature flagging system details both the feature flag’s state and the metrics in the same place, it’s much easier to make that correlation.
You should not automate your kill switch if the release has been out for a long time. Sometimes you might have code that gets mostly or entirely ramped up, but the kill switch hook might still linger around. There can be a risk that the kill switch breaks flows that users have come to rely on or some edge cases in the application that no longer function in the old code path. When you set up an alert, you define what your monitoring window is.
With a kill switch, something can be out for a few weeks, and that old path might no longer be safe. If something has been out for an extended period of time, maybe new features have been released that will interact with it poorly. For this reason, it’s a best practice to make sure your kill switch is limited to a reasonable window because you don’t want to be left with technical debt to clean up later. There are use cases for long-lived kill switches that stay in your codebase for a long time that you’d use for graceful degradation when something fails, but these are not likely something you would want to automate.
Feature flags help mitigate risk. The kill switch is one aspect that gives you the flexibility to quickly turn on and off a feature that might be negatively affecting your user base. You should automate your kill switch if you can confidently correlate a change in metrics to your feature flag state, and turning it off won’t make things worse.
Feature flags provide you with the tools you need to build and deploy safer and faster!
- Implementing Testing in Production
- A Quick Guide to Implementing Feature Flags with Spring Boot
- The Benefits of Feature Flags in Software Development
- Feature Flag Maintenance
- Pros and Cons of Canary Release and Feature Flags in Continuous Delivery