DEV Community: Sven Delmas

Postmortem of Incident on 08 June 2020

Sven Delmas — Thu, 09 Jul 2020 23:18:40 +0000

We encountered four interwoven bugs that caused a degradation of service in one of our production instances. As the bugs have particularly interesting implications and some significant lessons learned, we thought it would be prudent to explain the situation to the community to help the next person who may encounter one of these problems. The majority of the problems we encountered were external in nature, and we’re working toward mitigating the impact of future external problems.

Note: All times listed are in UTC.

On 08 June 2020 at 10:35, our on-call SRE received an alert that our production instance needed attention. Investigation began immediately.

What Happened

When addressing an unusual problem with a spike in incoming data, we discovered four distinct but interconnected bugs that compounded the reverberations of the data spike. Our Elasticsearch clusters were exhibiting high thread pool counts that exacerbated the data spike, and when we attempted to increase our hardware allocation, we ran into three external bugs with our disks that triggered index corruption issues, further compounding the problem. While identifying mitigations and workarounds for two of the external bugs, we figured out a way to adjust our system to handle data spikes better and address the thread pool count in the process.

Detailed Times (in UTC)

2020-06-08 ~07:00: A customer’s incoming data started spiking. This spike didn’t start impacting other systems until our first alert at 10:35.
2020-06-08 10:35: We received an alert, which kicked off our incident management process.
2020-06-08 11:31 - 2020-06-09 14:58: We found our first indication of a “noisy neighbor” problem: A customer having a spike in incoming data that was so large other clusters were affected as the load balanced. Our typical mitigation strategy wasn’t working very well, and we started investigating why. One issue we identified was that our prioritization strategies were not working properly.

We also were getting a number of noisy alerts about pod numbers that briefly sent us down the wrong path.

Elasticsearch high thread pool issues arose that were not like any we had seen before. We continued to mitigate with manual intervention to keep the system running while we tried to identify the underlying issue.
2020-06-09 15:30 - 2020-06-10 02:00: We started noticing Elasticsearch pods on various clusters had high thread pools and were constantly restarting. We attempted to fix it by clearing the Elasticsearch cache and adjusting how different components could request CPU resources to free up those resources for Elasticsearch. Neither approach worked, and we continued to manage the environment manually while searching for a cause.
2020-06-10 05:00 - 16:45: We provisioned new hardware for our backend systems to scale up and manage the load better. While setting up the new hardware, we continued to manage the environment and kept hitting the threadpool problem.
2020-06-10 16:45 - 2020-06-11 04:45: When scaling up the new hardware, we discovered a bug, which we’ll call here a group label bug, in a third-party provider that blocked connections from Elasticsearch to storage volumes provisioned on the new hardware. We engaged with the third-party vendor and began to work around the problem manually.
2020-06-11 04:45 - 09:30: While still working around the group label bug, we observed random Elasticsearch pod restarts all around the clusters, seemingly at random, making the Elasticsearch recovery even more difficult. We continued engaging with the third-party vendor while managing the environment as best we could.
2020-06-11 09:30: We identified patterns in the pod restarts. The restarts did not only affect Elasticsearch pods; all disk-backed pods on one hardware node would restart simultaneously, then another random node would proceed through the same restart cycle. This was our first suspicion of a second bug related to the third-party provider.
2020-06-11 09:35: We confirmed that we encountered a bug with Stork, which our third-party storage provider uses for orchestration. A Stork healthcheck that ensures the storage driver is running on a node killed disk-backed pods on any nodes that did not seem to be running that storage driver. Despite these nodes actually having Portworx (our storage driver) running, the healthcheck didn’t identify the running instance and therefore killed all of the pods on the node. We disabled the healthchecks to stop the pods from crashing while the core devs from the Stork project started digging.
2020-06-11 13:00 - 18:15: We continued to manage the environment as it started to stabilize a bit after the healthcheck bug was identified and mitigated, and we continued to work around the group label bug manually to scale up. We still were seeing high thread pool issues, which slowed down Elasticsearch. We decided to lock all Elasticsearch indices from past days and stop shard assignment from all days other than the current day.
2020-06-11 19:15 - 20:00: We disabled thin provisioning on nodes, which is enabled by default, to manage our resources in a more controlled fashion.
2020-06-11 22:30 - 2020-06-12 02:00: We figured out there was a strong correlation between high merge counts and high thread pool queue in our Elasticsearch pods. As we continued to try to identify why we were dealing with high thread pool issues, we continued to manage the environment.
2020-06-12 02:21 - 04:50: While trying to manage the environment, we observed a sudden drop in our deferred backlog and realized that some of the batches in the deferred queue hit our 18-hour retention limit, causing them to get deleted from disk. We immediately worked on the data loss and engaged with our Elasticsearch experts to understand how we might mitigate the data loss.
2020-06-12 05:00 - 08:00: We hit a setback with the group label bug where nodes rebooted and reset back to the state where the bug prevented volume mounting, so we had to re-decommission the nodes. Meanwhile, we worked with our Elasticsearch experts to handle the thread pool issue while continuing to manage the environment and investigate the data loss.
2020-06-12 08:32: We believed we had gotten the situation under control and had updated the status page accordingly.
2020-06-12 09:30 - 12:15: While monitoring the environment since the incident was still open until all Elasticsearch clusters had fully recovered, the oncall SRE noticed some Elasticsearch pods living on the same hardware node crash-looping with an Elasticsearch error about index corruption. This set of errors kicked off more investigation and another round of environment management.

We discovered that one of the nodes was missing primaries on a volume that showed corruption errors. As a result, we engaged with our third-party vendor to see if we were facing disk corruption.
2020-06-12 13:00 - 18:00: We thought we had encountered an NVME bug that corrupted the data and started digging to confirm or refute that theory. By 15:37, we were able to confirm that the NVME bug was inapplicable to our environment and therefore a red herring. Meanwhile, we continued to manage the environment due to the initial noisy neighbor hitting again and the continual thread pool issue, and we continued all of the previous investigations with the assistance of our Elasticsearch experts.
2020-06-12 18:30 - 20:15: We found and started allocating stale primaries for all of the unassigned shards that are around in an attempt to retrieve lost data. The nodes holding the stale primaries had not been touched before the data corruption began, so they were less likely to have corruption.
2020-06-13 02:10: We found the same index corruption error on a different hardware node and its Elasticsearch pods.
2020-06-13 ~02:20-3:00: We started to provision more nodes to handle increased load and allocated more stale primaries on certain clusters to recover data. Some customers on these clusters experienced a degradation in service of the search function while this action was performed.
2020-06-13 03:15: We discovered one corrupted Elasticsearch pod that had a few primary shards from unreplicated indices, so there was no replica shard available to use to recover lost data.
2020-06-13 03:25 - 04:50: We released a set of hotfixes to handle this new version of the noisy neighbor problem. We immediately began seeing significant improvement in the clusters later attributed to the hotfix.
2020-06-13 05:31 - 05:49: The new nodes were ready and bootstrapping began.
2020-06-13 06:07: We again believed we had the situation under control and updated the status page accordingly.
2020-06-13 06:38: A final hotfix was released to handle the last of the edge cases.
2020-06-13 10:57 - onward: We moved to more monitoring the incident rather than actively managing the environment as it appeared the last of the hotfixes worked. We moved toward a stronger focus on recovery.
2020-06-13 18:18: A final update to the status page was added to mark this incident as resolved.
2020-06-13 21:19: We ran into a similar thread pool issue, but it didn’t seem to have the same effect as before with the new hotfix.
2020-06-13 21:56: The oncall SRE observed that the increased speed from the hotfix caused a change in our deferred batch processing. As the problem seemed to be resolving itself generally quickly, the bug was considered a low severity and was relegated to the development backlog.
2020-06-15 15:56 - 17:14: After some further monitoring during a particularly busy period that was similar to the trigger of the initial event, we decided the hotfix was holding and the incident was fully resolved in our incident management process.

Key Factors

First, a quick general note. Elasticsearch is built off of the Apache Lucene platform,¹ and a lot of the following discussions will refer to the underlying Lucene processes and how they affected our Elasticsearch clusters.

The “Noisy Neighbor” Problem

The issue of one customer generating a data spike that kicks off a shift of resources to manage that spike has been something we’ve managed through a semi-automated system in the past. Normally, this spike would be a small blip on the radar at most, requiring minimal intervention to offset. While this factor was mainly a trigger that kicked off the incident, we were able to add a more code-based, permanent solution to our architecture as a result that reduced the need for more manual intervention in similar future cases.

Merge Threads Problem

Lucene has the concept of an index, which is essentially a specialized data store that holds documents--objects that hold data in a standardized format. These indices are further broken down into segments, or sub-indices, that are immutable, or unable to be changed once created. Lucene will merge these segments together to create larger segments as more segments are generated when more data is added to the overall index.² Elasticsearch’s shards are essentially Lucene indices, and therefore, by extension, Elasticsearch shards also have this merge function. Elasticsearch also uses the thread pool design pattern throughout its architecture to manage the functions like bulk insertion of data into shards and flushing indices to disk.

In our case, we were constantly hitting a high thread pool count while data were getting written, which caused significant lag times. Generally, this high thread pool count is related to the noisy neighbor problem with significant spikes of data incoming to the system. Here, the coupling of the noisy neighbor problem with the external factors we were dealing with made the issue appear in an unusual form. The fact that the hotfix reduced the impact of high thread pool counts on our architecture bolsters this argument.

Incidentally, the thread pool spike could also account for the initial impact on search when the data spike occurred. High thread pool counts on merge can be a symptom of a large number of segments being created all at once as large amounts of data are added to the index, and that rapid turnover of memory cache can cause invalidation of search results and therefore degradation of search availability.

External Concerns

Index Corruption and Storage Timeouts

After some investigation, the index corruption issue we saw on our longer-storage systems seemed to stem from a connection timeout to our third-party storage system, and this symptom does not seem to have been isolated to our systems.³ The underlying cause was out of our scope, but the technical recovery process is worth talking about separately.

Elasticsearch duplicates clusters across an entire ecosystem, creating primary and secondary (replica) shards on different clusters.⁴ The source of truth for a document in the cluster is the primary shard. Generally, whatever happens on the primary shard then is replicated to the secondary shards, and new secondaries can then be created or destroyed at any time. In our case, the secondaries are created over time, leaving a trail of “stale” copies of the data until those secondary shards are eventually assigned and wiped. The newer secondaries had the same data corruption issues as the failed primaries. However, the stale secondaries had the majority of the data from the primary shard without the corruption problem, and they luckily had not yet been wiped. We were able to manually reassign the secondaries as primaries, buffering them to memory and then flushing them to disk. In doing so, we were able to recover nearly all of the missing data from the corrupted indices.

Other External Bugs

Two other factors involved external sources where we ran into unique bugs that have since been reported and solved by third parties. As these bugs belong to other sources, we won’t spend too much time here explaining them. One bug involved labeling of groups of volumes in Kubernetes-based clusters.⁵ The other, as noted, was related to the Stork health check call. The health check didn’t seem to identify the storage driver that was running or that the storage driver was still booting, and that caused all of the disk-backed pods to get ejected from a node. For more information on this last bug, we refer you to the relevant pull request on the open-source project.

In a larger sense, one of the hallmarks of a mature platform is the ability to mitigate external factors swiftly with minimal impact on the customer’s experience. That hallmark is why the industry balances loads across datacenters and encourages rolling updates and restarts across microservices architectures, as just two examples. The aim of mitigation strategies is to handle issues gracefully, such as seamlessly transitioning a user’s session to a different datacenter if the first datacenter’s connectivity falters or preserving a known working function if an update to an external dependency breaks part of an application’s functionality. No one expects that a third party’s systems work all of the time. Instead, we all work off of service-level agreements and uptime statistics and then attempt to have worst-case scenarios planned and ready should something we rely on go down. And we find solid third-party partners that react swiftly to reports that something isn’t working properly.

Next Steps

We are continuing to monitor and tweak the hotfix we released to production that seemed to address our “noisy neighbor” problem once and for all. There will definitely be more optimization that is needed simply because it was a hotfix released at speed to address a pressing issue in production. Examples of this optimization include the following potential investigations:

Improving our cluster balance based on the new metrics around data spikes to ensure no single cluster can take over all of the processing power available in a cluster when encountering a spike in incoming data.
Decoupling our clusters by ensuring our worker pools pause processing on problematic clusters while maintaining throughput on all other clusters.
Revising our auto-resume functionality, which would automatically attempt to resume problematic clusters in an exponential backoff fashion but which actually compounded the issue as work from the slow clusters would leak into the worker pool and slow the total worker throughput down. For now, the SRE team will make the determination of when to pause and resume clusters, and we may modify our data pipeline to provide finer throttling tools for our SRE team when there are spikes in data.

In regards to the external factors that complicated this incident, we generally were lucky to get engagement quickly from our third-party partners. We have some new internal mitigation techniques to standardize in case we run into similar issues in the future:

First, we are working to ensure that we are running the same version of our storage driver in every environment. Due to code freezes requested by various business partners and a transition to a different storage driver that is more robust, our environments have gotten out of sync, and we did not have a set way to catch up everything in a safe, controlled manner. We are standardizing that catchup process and adding in additional testing capacity to ensure that any process we do land on does not introduce further errors. We are moving everything to the more robust storage driver, as well, which should address some of the errors we have encountered in other incidents.
We are adding additional testing capacity to our post-upgrade and post-deployment processes, such as canary processes, to reduce the potential for uncaught or unknown bugs to affect our production environments. This addition will help in the case of unidentified kernel bugs, for example, so we can immediately mitigate and roll back as necessary.
We are adding additional checks and tooling to ensure that no shard is unreplicated in any of our environments to prevent data loss as we experienced here. While we may still run into similar index corruption issues in the future, replication and better backups will mitigate that external factor. We’re also standardizing the manual secondaries assignment to replicate the process in any necessary future incident.
We are standardizing the procedure generated during the incident to manage the high thread pool issue in Elasticsearch. By writing up and standardizing this playbook, we aim to ensure anyone on the team can replicate the process efficiently and accurately, addressing issues as soon as they arise rather than allowing them to build up while we go back to remember how to manage the issue again.
Finally, we are improving our tooling to ensure faster transparency with our customers when we do encounter incidents that affect them. We aim to ensure that customers who are immediately affected get faster communications and provide better information to our customer support and success teams.

Wrap Up

While this incident mostly ended up being compounded by external factors, we learned more about how to mitigate those factors and addressed some underlying issues with our infrastructure that the attempts to mitigate the problems uncovered. Kudos goes to our partners that stepped in immediately to help understand the problem and work on their respective ends to address underlying issues that we encountered. We also were pleased that our new notification systems for customers within the app seemed to work so well in our first test of our initial action items from the last postmortem.

See https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up for a still-relevant, solid dive into how Elasticsearch is built on Lucene. ↩
See http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html ↩
See https://forums.portworx.com/t/failed-to-mount-volumes-in-pod-http-error-404/445 ↩
https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html ↩
https://docs.portworx.com/reference/release-notes/portworx/#2-5-0-2 ↩

Postmortem of Root Certificate Expiration from 30 May 2020

Sven Delmas — Wed, 10 Jun 2020 22:01:00 +0000

Originally published to our blog

We had a partial production outage the weekend of 30 May 2020, and we missed a few things, outlined in the next few sections. Since others may encounter a similar incident in the future, we thought it would be worthwhile to share our experience and help others learn as much as we have. We’re addressing where we failed through better customer communication, improved planning for future workarounds, accelerated completion of our CI/CD improvements, and stronger endpoint alerting.

Note: All times listed are in UTC.

On 30 May 2020 at 13:04, a customer alerted us in our public Slack that their LogDNA agents suddenly stopped shipping data. Our on-call SRE began an investigation immediately.

What Happened

The certificate chain for our systems relied on the AddTrust External CA Root. That root certificate expired on 30 May 2020 at 10:48, which caused a certificate expiry error that broke the TLS handshake between some of our customers’ agents and libraries and our systems. Any system that attempted a new TLS handshake with our various endpoints with one of the clients using an older certificate authority certificate store or an older TLS implementation failed to connect.

Detailed Times (in UTC)

2020-05-30 10:48: The certificate expired as noted.
2020-05-30 12:59: We received our first customer ticket reporting something wasn’t working correctly.
2020-05-30 13:04: We first received word from a customer in our public Slack that their agents stopped shipping data.
2020-05-30 13:13: The SRE on-call finished verifying the customer report and opened an incident.
2020-05-30 13:52: In analyzing the data coming in for ingestion, we misunderstood the impact of the certificate expiration due to a discrepancy in two different datasets. Our data did not demonstrate a complete loss of ingestion traffic, so while a severe problem, we thought the issue was more isolated to one version of our agent. Based on customer reports, we thought that only the Docker-based image of our v1 agent was affected. As such, we focused on releasing a new v1 agent build and identifying a way for customers running older operating systems to update their certificate lists.
2020-05-30 15:22: We identified the patch for Debian-based systems and attempted to apply the same patch to our v1 agent image. Then, we realized that the problem with the v1 agent was due to how NodeJS manages certificates (described under NodeJS Certificate Management). We focused our efforts on rebuilding the v1 agent with the correct certificate store for both Docker and Debian-based systems.
2020-05-30 19:10: We started validating packages internally before shipping to customers. We also thought we identified the solution to the lack of any drop in ingestion traffic as existing agents did not need to attempt a new TLS handshake.
2020-05-30 19:13: We pushed a new v1 agent build that allowed customers to restart ingestion on most platforms and documented hotfixes for older Debian-based systems in our public Slack. We did not realize our build had failures on AWS due to a hiccup in our CI/CD process.
2020-05-31: We continued to think that the patched image solved the problem and pointed customers to the new 1.6.3 image as needed.
2020-06-01 13:07: We received reports from multiple customers with data that the new 1.6.3 image had issues that would cause CrashLoopBackOff. We discovered a bug in our release chain that prevented us from releasing a new image. We were back at square one.
2020-06-01 15:00: We fixed the bug in our release chain and started generating new packages. An intermittent hiccup in service with our CI/CD provider caused a further delay.
2020-06-01 17:14: We released the new packages and started exploring the option of switching to a new certificate authority. We wanted to remove the need for customer intervention and to avoid issues with how NodeJS vendored OpenSSL (described under NodeJS Certificate Management and Certificate Update Process).
2020-06-01 18:13: We began the process of updating our certificates on testing environments, pushing the changes out to higher environments one by one after testing was complete.
2020-06-01 21:46: We completed a switchover to a new certificate authority and certificate chain. This fix meant all systems except those that would require manual certificate chain installation, such as those using Syslog ingestion, would immediately begin ingestion again without any further customer action. Those systems with manual certificate chain installation were provided with further instructions and pointed to the new chain on the CDN.

Key Factors

NodeJS Certificate Management

Along with our libraries and direct endpoints, we have two separate agents we’re currently supporting: v1 agent, which is written in NodeJS, and v2 agent, which is written in Rust. The v1 agent has been kept on an older version of NodeJS to provide compatibility with older operating systems. That older version of NodeJS uses a default list of trusted certificates for certificate authorities that did not include the new certificate, and NodeJS overall does not read from the local system’s trusted certificate authority list.

NodeJS, similarly to Java, ships with a bundled list of trusted certificates to ensure the security of TLS calls. Browsers do the same thing except they manage their own lists; NodeJS uses Mozilla’s list as the core developers deferred to Mozilla's "well-defined policy" to ensure the list stays current. In the past, there were a number of calls for NodeJS to enable teams to add new certificates or otherwise enable better management of the certificate store. As of version 7.3.0 (coupled with LTS 6.10.0 and LTS 4.8.0), the core developers of NodeJS added the ability to include new certificates in that trusted list. Before that release, end-user developers and ops teams would have to recompile NodeJS with their own certificate additions or patches. Coincidentally, the NodeJS community raised a request to remove the AddTrust certificate and put the proper certificate in, and that fix landed on 01 June 2020. We discovered this change in the source code itself during the postmortem phase of this incident.

NodeJS also ran into an issue with how it vendored OpenSSL. Different versions of NodeJS used different versions of OpenSSL, and some older versions of OpenSSL gave up when finding invalid certificates in a given path versus trying alternatives.

How does that work, exactly? To ensure we’re all on the same page, let’s talk about certificate chains. The basic chain involves three pieces: a root certificate from a certificate authority, an intermediate certificate that is verified by the root certificate, and a leaf certificate that is verified by the intermediate certificate. The root certificates, as noted, are generally coded into operating systems, browsers, and other local certificate stores to ensure no one can (easily) impersonate a certificate authority. Leaf certificates are pretty familiar to most technical teams as they are the certificates that a team receives as a result of their request to a certificate authority. Intermediate certificates, on the other hand, serve a few functions. The most important function is they add a layer of security between the all-powerful root certificate with its private keys and the external world. One of the other functions, however, that’s the most relevant here is the need to bridge between old root certificates and new ones&emdash;a function known as cross-signing. Certificate authorities release two intermediate certificates, one for each root certificate, that then both validate a leaf certificate.

Older versions of OpenSSL, specifically 1.0.x and older, have issues with this system where they follow the certificate path up the chain once and then fail if that path leads to an expired root certificate. In newer versions, OpenSSL attempts to follow an alternate chain when one is available, such as in this case where there was an additional cross-signed intermediate available that pointed to the new root. This last issue caused problems when we built our systems on different versions of NodeJS that ship with older versions of OpenSSL, and it also caused problems with other systems because of the next point.

Certificate Update Process

When we were updating certificates in the past, we only updated the leaf certificate (the bottommost part of the certificate chain) rather than including the intermediate certificate. Sectigo has offered intermediate certificates that cross-signed the AddTrust certificate with the new USERTrust RSA certificate to ensure that older systems supported. Since we didn’t add the intermediate certificate during our update process, we missed adding the cross-signed intermediate certificate. As noted, this omittance caused errors with OpenSSL and GnuTLS on other older systems such as older Debian or Fedora builds, not just our NodeJS build.

Endpoint Monitoring

Finally, and possibly most importantly, we did not have any external endpoint monitoring to alert us when this certificate chain broke. We ended up relying on customers flagging the incident for us.

Next Steps

Outage and Customer Communications

We weren't as prompt, nor as expansive, as we should have been in communicating with our customers. That is changing immediately, starting with this public postmortem. We are discussing steps to ensure that we have the right required channel of communication for customers based on incident severity, and we will implement them as quickly as possible. Here’s the most up-to-date version of our plan:

Low-severity incidents: Our customers can expect timely status page updates.
Medium-severity incidents: In addition to status page updates, we'll deliver in-app notifications for all impacted customers. Note that we are working on this functionality.
High-severity incidents: You can expect the above communications, as well as direct email notice to all impacted customers.

Ultimately, if there is an incident, we owe it to every customer impacted to be notified as early as possible, along with consistent status page updates. Our customers can expect effective communication moving forward.

Workarounds and Solutions

We know customers rely on us every day to help with troubleshooting, security, compliance, and other critical tasks. We were not fast enough in providing potential workarounds for the full problem. We are starting work on documenting temporary workarounds for future use so we can respond faster during an incident.

We know some customers need a backfill solution for security or compliance. We are aware of the problem and are in the extremely early stages of defining solutions and workarounds for future use. Given the current complexities, we do not recommend attempting to backfill missing logs from this recent incident. Some of the difficulties we need to address for a backfill solution are identifying when the problem actually requires backfill, minimizing impact of any potential option on throughput to livetail, reducing duplicate entries, and ensuring that timestamps are preserved and placed in order. For example, any log management system can see occasional delays when coupled with slower delivery from external systems, and it might appear that a batch of log lines was not ingested when, in fact, they are in the pipeline and would appear on their own without any intervention. As these difficulties could seriously warp or destroy data if not handled properly, we are starting with small, noncritical experiments and ensuring that any process we eventually recommend for any workaround our customers need is thoroughly tested, repeatable, and reliable.

Technical Initiatives

We’re improving our CI/CD systems to speed up releases when we need to. We actually were in the process of doing this very task, which caused some of the delays in releasing an initial patch immediately as we discovered issues that we needed to fix.

Our current monitoring solution does not include external endpoint monitoring. To increase our level of proactive alerting and notifications, we have already started talking to vendors to supplement our current monitoring so we are notified the moment any of our endpoints one of our own systems goes down. We are also working to identify the right automated solution to check certificates that includes the entire certificate chain rather than just the tail end of the certificate chain.

Wrap Up

While this certificate expiration incident affected multiple software providers over the weekend, we should have known of the issues proactively, we should have been more actively communicating updates during the incident, we should have been faster at identifying and providing workarounds, and ultimately, we should have resolved the issue much faster. Our commitment to you is that we are going to do better, starting with the fixes outlined above.