Anyone can have a bad day. Take for example, in professional baseball, the league’s top-performing pitcher. This pitcher could be unhittable for most of the season, but then have a single game where he just can’t find the strike zone or the opposing team is hitting and scoring with ease. Yesterday, the CDN provider Cloudflare, had a bad day as the experienced a service outage resulting in about 30 minutes of downtime for customers. In baseball, the root of the problem for a bad day is mental, which is not easily avoidable. However, in technology, we stand a better chance of avoiding a bad day with testing and with the technology we choose to implement. In this post, I highlight the root cause of Cloudflare’s outage, but moreover, I discuss the risks of the cloud choices we have.
The Outage
As described in Cloudflare’s blog, there was a global software deploy that resulted in a CPU spike, or as they put it, “massive spike in CPU utilization”. This overconsumption of CPU handicapped their ability to process requests, effectively a denial of service (DoS). As a result, all users trying to access sites fronted by Cloudflare received a 502 Bad Gateway error response.
The Cause
The context of the outage causing software deploy was the Cloudflare Web Application Firewall (WAF). Specifically, it was a particular WAF rule, as an update to the blog states, “The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules.” The rule contained an erroneous regular expression that caused the CPU spike. Regular expressions are a sequence of characters that define a pattern for searching text. Regular expressions are traditionally used in WAF technologies to search application inputs to find patterns that match the signature of an attack payload. Regular expressions can be very complex, which of course requires CPU to process. The more complex, the more CPU required. One bespoke risk with complex regular expressions is Regular Expression Denial of Service (ReDoS). A regular expression susceptible to ReDoS will take a long time to evaluate, consuming CPU time, and is detrimental to system performance. It is not clear if this was the exact problem for Cloudflare, but certainly, something similar occurred.
Before continuing I want to emphasize this post is not a criticism of Cloudflare. While I am employed by Signal Sciences, a competitor to Cloudflare WAF, I am also a user of Cloudflare’s free tier service on my personal web site projects. Although I don’t leverage their WAF explicitly, I use Signal Sciences as my WAF. In my opinion, incidents happen and how a company responds is important. I think Cloudflare communicated and resolved quickly, as well as posted informative updates on the incident. To me, this seems be a good response so far given the unfortunate circumstances. The last piece of the response will be what changes to avoid a repeat of the incident. Regardless, the point of this post is anyone can have a bad day, and now let’s turn to more thematic issues.
The Root Cause
Taking a step back, there is a broader issue that I submit as the root cause of the incident. This is the WAF technology itself, moreover legacy WAF technology. Legacy WAF solutions primary mechanism for attack detection is regular expressions. This results in a security control that at a high risk of false positives that can break applications if in blocking mode, or overload analysts with fruitless alerts. In addition, the complexity of regular expressions end up being very difficult to maintain over time, and in the worst-case introduces the risk of a ReDoS. There is an unfortunate trade-off that must be made, either dumb it down to avoid headaches and risk or invest heavily in time and resources to ensure effectiveness, efficiency and hopefully reduce outage risk. In today’s world of the modern web, business drivers, and threats this is not a trade-off a security team should be forced to tangle with. In the long term, it’s simply unaffordable. Unfortunately, Cloudflare, as well as every other WAF technology, is primarily based on the legacy approach of regular expressions.
Another piece of data that underscores the risk is several recent CVEs related to ReDoS in the ModSecurity Core Ruleset Project. Five CVEs in total. ModSecurity has been the foundational technology in mostly all legacy WAF products. If you are using a WAF product today and are not sure of what technology it is based on, you should ask your provider. If the answer includes ModSecurity or regular expressions, be aware of the risks. Overall, this is not a great story for a security control.
Cloud Outages
Another component of risk to acknowledge with this incident, and others like it, are outages of cloud services. Now more than ever our businesses depend on cloud providers and their services. An outage for them translates to an outage for us. The dependency on cloud services is only going to increase and become more critical. On the surface, I think most users of the Internet and our sites and applications have come to terms the possibility of an outage, and even accept it to a degree as a fact of life. While such an incident is embarrassing for the provider and inconvenient to users, any outrage will subside and the incident will be forgotten in a few months if not a few weeks. For a 30 minute outage like Cloudflare’s, this will likely be the case.
However, as a business dependent on cloud services you need to understand the impact from a cost and opportunity loss perspective. Understand the risk. What are some of the impacts to consider from a risk realized event. It really depends on the context of your online business, and you should understand what is important to your business. For a low traffic volume web site, 30 minutes may not be a big deal. But what about an hour, two hours, six hours, or more? What is your tolerance for downtime? For a high traffic volume web site, 30 minutes can certainly be a big deal. Anything beyond that time frame only compounds the impact.
A few specific examples to think about are:
New user registration. If a user attempts to create a new account on your site, but can’t due to an outage they may go to a competitor. To help measure this risk you’ll need to track the average frequency of new user registrations over time. A simple example of this could be an average of 25 new user registrations per day.
Financial transactions. Specifically purchases or anything else that impacts your bottom line. To help measure this risk you’ll need to track the average number of transactions over time. To be even more specific, also track the average value of the transactions.
These are two high impact examples, but your business applications may have more. From these examples, there are two primary points I’d like to highlight. The first point is on mitigation and the second is on instrumenting your applications so you can know and understand the impact of an outage.
Mitigating the impact of a cloud outage is not so different from what you may already have planned for disaster recovery. The key here is, having a plan. In the cloud world, this could mean implementing a multi-cloud solution for your application deployments. The question to ask yourself is, can your cloud deployments on cloud provider A be seamlessly and quickly deployed to cloud provider B? This would mean you have accounts and provisioning already established and the orchestration tools that enable the seamless move. Certainly, a multi-cloud solution introduces additional cost to your cloud plans, but that should be weighed against what your tolerance is for an outage.
My second point on the instrumentation of applications is about measuring and understanding what your tolerance for an outage may be. Having the capability to instrument your applications has far more benefits than measuring the frequency of user registrations, logins, or high-risk transactions. There are very signification threat benefit and mitigation benefits to be gained with the capability. Having said that, instrumentation gives you visibility on what is important to your business as well as the data. Without that data, you cannot assess your risk accurately, which is necessary to plan appropriately for cloud outage mitigation.
Conclusion
Outage happens! It is a fact of cloud life. Cloud providers will continue to improve service robustness over time so we can expect outages to be rare overtime. However, that doesn’t mean it would be risk appropriate to rely on cloud providers to ensure the robustness of our business online. This is your responsibility to plan for. If you are using a legacy WAF technology, it is evident that you should consider a modern WAF such as Signal Sciences, which helps eliminate the risks associated with regular expression based attack detection. In addition plan for cloud resiliency, which might mean a multi-cloud solution. Finally, instrumentation of applications is crucial to obtain the visibility required to understand business risk and defend applications in the modern web.
Top comments (0)