Deploying into Production: The need for a Red Light

#devops

Stare at the Abyss

As scale and complexity grow, there are diminishing returns from pre-deployment testing. A test writer cannot envision the combinatoric explosion of coincidences that yield calamity. We must accept that deploying into production is the only definitive test.

Embrace the Unknown

What now? Nihilistic dirges are unhelpful. If we can't avoid it, then we must embrace it... or at the very least, plan for it. If we're willing to assume that an unknown problem might deploy and surface in production, how can we prepare?

Limit Exposure

Depending on our stack, we might be able to deploy with feature flags. Feature flags let us organize production users into A/B/X, and give us a very simple way to try newly-deployed code and turn it off. Alternatively, we might canary into A/B/X by node, or by pod, and have an automated rollback procedure.

Observe the Brokenness: a "Red Light"

We still need a way to tell that A is broken: a "red light", of sorts. Instrumentation plays a role here, but we would like our red light to be as general as possible. Just as a tester can't test for every unknown, a developer can't instrument for every unknown. It's tempting to think that as new failure modes are found, we can instrument for them, and that should be enough.

Unfortunately, as the size of our code base grows ever larger, the surface area of unknowns gets ever bigger. We will never catch up with manual effort and reactive improvements alone.

Make Use of What You've Got

We would like to use all the clues and cues available to feed our red light. Let's see what some recent RCAs might offer as ways to think about detecting un-instrumented problems.

Detect Events that Stop Happening

Stripe recently had an outage due to database bugs, combined with a configuration change (https://stripe.com/rcas/2019-07-10). The thoughtful RCA shows that a problem went undetected for some time since DB nodes were responding as up but had stopped sending their replication metrics.

Here, attention might have been drawn to the problem much earlier by viewing these updates as a train of roughly periodic events that stopped happening. This highlights the importance of being able to tell that something regular, stopped happening.

Don't Maintain Regexes (and Do Use Metrics)

Cloudflare recently had an outage due to a bad regex (https://blog.cloudflare.com/cloudflare-outage). They were testing a regex-based rule to scan JavaScript, looking at the rule's false positive rates. The regex induced excessive backtracking (I'm guessing) and pegged a lot of CPUs.

This example highlights the difficulty of creating, curating, managing, and maintaining regexes (the same difficulty that keeps a lot of folks from leveraging the full power of their logs and event streams). It also shows that monitoring metrics is obviously important.

Do Use Logs

Honeycomb recently had an outage due to a missing binary (https://www.honeycomb.io/blog/incident-review-you-cant-deploy-binaries-that-dont-exist/). Long story short, the buildevents tool regressed, and so didn't exit with nonzero code despite build errors.

Here, I imagine that noticing new/exceptionally rare build events occurring in the build logs themselves could have provided a red light, perhaps before deployment. But without automatically structuring events and building a dictionary of event types (like Zebrium provides), it would be impossible to do this reliably.

Our Way

Zebrium's mission is to be the best "red light" possible for production deployments. We acknowledge and embrace the importance of instrumentation, but we insist that an automatically structured understanding of logged events and incidental metrics is required to complete the mission. It's clear that multiple data sources are required, as is higher-level learning and machine interpretation of patterns in such data.

If you're interested in trying our technology, you can get started for free by clicking here.

Posted with permission of the author: Larry Lancaster, CTO, Zebrium.