DEV Community: Ivo Mägi

Debugging virtualised/containerised environments is hard

Ivo Mägi — Wed, 04 Dec 2019 12:55:33 +0000

One of the customers using Plumbr APM was recently facing a peculiar issue in their production environment where one of the Docker containers in production exited with the exit code 137. The setup of the environment was fairly simple

self-hosted hardware with Ubuntu OS;
multiple Docker containers also running Ubuntu OS running on the machine
Java Virtual Machines running inside the Docker containers.

Investigating the issue led us toDocker documentation on the topic. Reading this made it clear that the cause for this exit code is either manual docker stop command or out of memory condition coupled with subsequent kill signal sent by the kernel’s oomkiller.

Grepping through the syslog confirmed that oomkiller of the OS kernel deployed on the hardware was indeed being triggered:

[138805.608851] java invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[138805.608887] [<ffffffff8116d20e>] oom_kill_process+0x24e/0x3b0
[138805.608916] Task in /docker/264b771811d88f4dbd3249a54261f224a69ebffed6c7f76c7aa3bc83b3aaaf71 killed as a result of limit of /docker/264b771811d88f4dbd3249a54261f224a69ebffed6c7f76c7aa3bc83b3aaaf71
[138805.608902] [<ffffffff8116da84>] pagefault_out_of_memory+0x14/0x90
[138805.608918] memory: usage 3140120kB, limit 3145728kB, failcnt 616038
[138805.608940] memory+swap: usage 6291456kB, limit 6291456kB, failcnt 2837
[138805.609043] Memory cgroup out of memory: Kill process 20611 (java) score 1068 or sacrifice child

As can be seen from this log excerpt, a limit of 3145728kB (around 3GB) was being approached by the Java process, triggering the termination of the Docker container. This was peculiar, because the Docker itself was launched with 4GB limit in the docker-compose file.

As you likely know, JVM also limits its memory usage. While Docker itself was configured to 4GB limit in the docker-compose file, the JVM was started with Xmx=3GB. This might create additional confusion, but pay attention memory usage for the JVM at hand can exceed the limit specified by -Xmx, as described in one of our previous posts analyzing JVM memory usage.

Understanding all this left us still confused. The Docker should be allowed 4G of memory, so why on earth is it OOMkilled already at around 3GB? Some more googling revealed that there is one more memory limit involved at the OS deployed directly on the hardware.

Say hello to cgroups. cgroups (aka control groups) is a Linux kernel feature to limit, police and account the resource usage for a set of processes. Compared to other approaches (the 'nice' command or /etc/security/limits.conf), cgroups offer more flexibility as they can operate on (sub)sets of processes.

In our situation, the cgroups limited the memory usage (via memory.limit_in_bytes) to 3GB. Now we were getting somewhere!

Inspection of memory utilization and GC events using Plumbr revealed that most of the time memory utilization of the JVM running inside the Docker was around 700MB. The only exceptions occurred just before the termination events, where memory allocation spiked. The spike was followed by a lengthy GC pause. So what seemed to be going on was:

Java code running inside the JVM was trying to allocate a lot of memory.
JVM, after checking that there is plenty of memory to use below 3GB Xmx limit asked for more memory from the operating system.
Docker verified that there is plenty of memory to use below its 4GB limit and did not enforce any limits.
OS Kernel verified that the cgroup memory limit of 3GB is approached and killed the Docker container.
JVM process got killed along with the Docker container before it can perform its own OutOfMemoryError procedures.

Understanding this, we tuned all memory limits for the involved components - 2.5 GB to Docker and 1.5GB to Java. After this was done, the JVM was able to conduct the OutOfMemoryError operations and throw its OutOfMemoryError. This enabled Plumbr finally do its magic – capture the memory snapshot with relevant stack dumps and promptly expose that there was one database query, that tried to download almost whole database in certain situation.

Take-away

Even in such a simple deployment, three different memory limits were involved

JVM via the -Xmx parameter
Docker via the docker-compose parameter
OS via the memory.limit_in_bytes cgroups parameter

So, whenever you face a process getting killed by an OOM killer, you need to pay attention to the memory limits of all involved controls.

Another take-away is for Docker developers. It does not seem to make sense to allow such “Matryoshkas” to be launched where the nested Docker container memory limit is set higher than cgroups memory limit. It seems a simple if statement checking this on container startup coupled with warning message on startup will save hundreds of debugging hours for your users in the future.

Did you know that background TABs in your browser load 20+ times slower?

Ivo Mägi — Thu, 28 Nov 2019 12:14:13 +0000

Recently we troubleshooted a performance issue, reported by one of the customers of Plumbr who was using our Real User Monitoring solution. While investigating the behaviour we stumbled upon a major difference in time it takes to load a web page in background tabs vs the tabs in foreground.

To quantify this difference, we investigated 1.8 million user interactions in UI and compared the duration of such user interactions for two subsets:

interactions which loaded fully while being in foreground;
interactions which loaded partially or fully while being in background.

The difference was stunning. The load time for interactions in background TABs was 22 to 56 times longer than for the interactions in foreground:

In the chart above we plotted out the difference of (partially) background interactions to the fully foreground interactions. Different performance percentiles gave a slightly different view:

median load time for background interactions was 24x worse,
90th percentile was 22 times slower,
99th percentile loaded 56 times slower

than for the foreground interactions.

The metric we were investigating was the difference between the interaction in UI (a click on a button for example) until the last resource fetched from the server as a result of the interaction is downloaded in the browser. So yes, TTLB is the metric here.

When we understood the extent of this difference, we started to look for the cause. Apparently, the browser vendors have been heavily optimizing for the resource usage in order to save the battery in handheld devices. We discovered at least two such optimizations having an impact for the background tabs:

Page load time difference in background: limited parallelism

Background tabs set a per-frame simultaneous loading limit lower than for the foreground tabs. For example, Google Chrome limits the number of resources fetched to six when the tab is in the focus and to three when in background per server/proxy. Other browser vendors use different limits – for example the IE 7 used to limit this to just two for foreground TABs. IE 10 increased this to eight parallel requests per server/proxy.

What it means is that only a limited number of requests from browser are permitted to go to the network stack in parallel. Excessive requests are enqueued and executed when the previous request finishes. Thus, all requests will run eventually, but with a delay pending on the number of simultaneous loading permitted and the time it takes to complete the requests.

To illustrate this behavior, we built a small test case which loads 13 resources from the server. Loading each resource takes one second to complete (a server-side delay simulating a dynamic response). When now launching two interactions – one in the foreground and one in the background, we faced the following via the Chrome Developer Tools:

In the first image, the page was loaded in foreground. In the second image, the page was loaded in the background TAB. It is immediately visible that the first instance loads six resources in parallel and is thus able to complete the loading in around three seconds, while the second example uses just three parallel requests and is thus completing the pageload in five seconds.

Page load time difference in background: CPU throttling

Second reason for the interactions being slower in background TABs is related to how the CPU access gets throttled for background TABs. Again, the intentions are good – if the background TABs are CPU-heavy it will put significant burden on the battery life.

Different browser vendors are implementing it differently. For example, Google Chrome limits timers in the background to run only once per second. In addition, Chrome will delay timers to limit average CPU load to 1% of the processor core when running in background.

A small experiment we carried out involved loading and parsing the Angular 1.7.8 JS library and measure the first contentful paint event on the otherwise empty page loaded. We used the version hosted in Cloudflare CDN for the experiment. What we ended up confirming is that the page which just loads the library (and does nothing with it afterwards) renders in 200ms in foreground TAB and in 2,200ms.

Take - away

The fact that background TABs load slower due to the browser-specific optimizations is likely not a surprise to our tech-savvy readers. What might be surprising is how heavy such optimizations are – the results show a difference of 22-56x when measuring the user interaction duration in foreground vs background TABs.

Should you be worried about this behavior and start to optimize accordingly? Most likely not – the user experience as such is not really impacted if the slowly loading TAB is not in the focus for the user. However, it is crucial to understand this massive difference and to be able to exclude such interactions from your performance optimizations as (likely) wasteful.

In addition – we only exposed two of such optimizations in this blog post. There are likely several others and in case our readers are aware of any other limitations in background TABs, let us know in the comments.

APM is good at root cause detection. But there is much more to it.

Ivo Mägi — Wed, 20 Nov 2019 13:30:54 +0000

It is indeed true that faster root cause detection is one of the benefits an APM deployment can unlock. That being true, using just the root cause detection feature in isolation leads to poor decisions – the root cause should be in focus only in situations where the impact of the particular root cause justifies the intervention. To illustrate this, let me walk you through one of the quite typical situations what we see in our daily lives:

Customers are complaining that opening monthly reports is “too slow”
Product owner reacts to the recurring complaints and assigns the engineers with the task of “lets make the reports faster”
Engineer charged with the request, digs into the data exposed by the APM, finds a slow database call and spends a week optimizing the query to speed up the reporting 2x.
The patched version is released and the complaints about reporting disappear.

What is wrong with using APM only for root cause detection?

What was wrong with the approach above? The problem was resolved, so what could have been improved in such a situation? As it happens to be, there are multiple flaws with the approach above, so let me walk you through these:

First and foremost, the team does not seem to have any clear performance requirements set. A symptom for this is the missing the objective; instead vague tasks containing phrases “too slow” or “make it faster” are popping up in backlog.

Performance happens to be only one of the non-functional requirements engineering team needs to deal with. In addition usability, security and availability are examples of other such requirements. To have a clear understanding of priorities, the team should understand for whether or not to even focus on performance issues. It could very well be that the application is performing well enough and the time should instead be spent on improving other aspects of the product.

Second, the problem awareness reached the engineering team only after a number of customer complaints had reached product manager. As a result of the delay days or even weeks passed before the task was made the priority for engineers. During all this time more and more users got frustrated with the product.

Last but not least, the engineering focused on improving the aspect the customers were complaining, which does not always correlate to spending the time with bottlenecks with the most impact. There might be other aspects of the product, which are even slower than the monthly reporting functionality that the team was focusing. As a result, the team might not spend the resources towards better performance efficiently.

How to improve the situation?

Understanding the problems with the described approach sets the scenery for improving. After all – first step to recovery is admitting that you have a problem, isn’t it? So let me walk you through how to really benefit from the APM in situations where performance issues occur.

Start by setting the performance objective(s) for the service. This builds the foundation for success and removes the vague “make it faster” and “too slow” tickets from the issue tracker. Having a clear requirement such as “The median response time must be under 800ms and 99% of the responses must complete in under 5,000ms” builds a clear and measurable target towards which to optimize.

Knowing where to set the latency thresholds might sound complex. Indeed, the ultimate knowledge in this field can be derived via understanding the correlation of performance metrics and business metrics. This is specific to the business the company is running. For example:

For digital media, improving the median latency by 20%, the engagement over content increases 6%
For e-commerce, the conversion rate in a funnel step increases by 17% if the first contentful paint of the step is reduced from 2,500ms to 1,300ms

Gaining this knowledge might not be possible initially as it requires domain-specific data correlation with performance and business metrics. For most companies however, the starting objectives can be set using a simple three-step approach:

Understand the status quo, by analyzing the current baseline. As a result you will understand the performance your users are currently experiencing. This performance should be expressed in latency distribution as percentiles, similar to: The median latency is 900ms, and 99th percentile is 6,200ms.
Pick a few low-hanging and juicy improvements from the root causes detected by the APM. This typically requires allocating a few man-weeks of the engineering time to tackle 2-3 bottlenecks with big impact, which are easy to mitigate.
After the improvements are released, measure the performance again using the very same latency distribution. You should be facing a better performance, for example the same three latency-describing metrics could now be 800ms for median and 4,500ms for the 99th percentile.

Use this new baseline to formalize the objective. Using again the numbers from examples above, the team might agree that:

Median response time of the API must be under 850ms for 99% of the time in any given month
99th percentile of the response time of the API must be under 5,000ms for 90% of the time in any given month.

After going through this exercise, you now have a clear objective to meet and you can continue with the next steps to reap the full benefits of an APM deployment.

Set up alerting. The objective here is to use the APM data to be aware immediately whenever the performance objectives are no longer met. This removes the lag in waiting for users to complain and significantly cuts down the time it takes to mitigate the issues when the performance drops.

The key to successful alerting lies in accepting that alerts should be based on the symptoms and not root causes or technical metrics. What this means in practice is that instead alerts based on increased memory usage or network traffic, you should alert your team based on what the real users experience. APMs (and RUMs) are a great source for such metrics.

Google with their Site Reliability Engineering movement has done a lot in this regard to educate the market, but there is still significant resistance in the field. But I can encourage you to give it a try – have a week where you disable the dozens of different technical metrics based alerts and replace them with APM or RUM based alerts. Now you will be alerted only on situations where either the throughput is abnormal, error rate exceeds a particular threshold or performance as experienced by users decreases. I am willing to take a bet that you are pleasantly surprised of the quality of the signal used. Brace yourself for significantly less false positives and negatives and gain confidence in your alerts!

Setting up the alerts in practice is simple. Pick the underlying metric, set the SLO-based thresholds and stream the alerts to a channel already in use (Slack, Pagerduty, email, …). For performance alerts – expanding on the example used in previous section – the team can set up alerts in situations where the median response time has exceeded 850ms or the 99th percentile of the response times has exceeded 5,000ms.

Work with root causes with most impact. In situations where the performance SLO is breached and the alert has been triggered, it is crucial to make sure the engineering time is spent in investigating and improving the bottlenecks that have contributed the most towards the breach.

Ironically, this is not often the case. Times and times again I have seen teams who are stubbornly shaving off milliseconds from a particular code feature it situation where the real problems are several orders of magnitude larger and located in a completely unrelated code sections. Use the power of your APM and before improving anything, make sure that the improvements are carried out in the most problematic areas of the source code.

The best APM vendors make your job in this field trivial by ranking the different bottlenecks based on their impact. As a result, you would have information similar to following at your fingertips, giving you confidence that the improvements must focus on mitigating the first bottleneck:

Take-away

Any APM worth adopting will simplify the root cause resolution by locating the bottlenecks and errors in your application via distributed traces and additional technology-specific instrumentation. However, using the APM only for root cause resolution process is handicapping you. Expand how you are using the APM to really stay in control of the performance and availability of the digital services you are monitoring.

Being a strong believer in this, I can encourage you to take Plumbr APM out for a test run. Grab your 13-day free trial to enjoy the benefits described above.