Iñaki Villar

Posted on Mar 28

Using RSS to Understand Memory Pressure in CI Builds

#githubactions #gradle

Once in a while, you may have wondered why builds running on CI agents can still hit OOM errors, even on machines with large amounts of memory. For example, how is it possible to hit an OOM on a 32 GB machine even after setting a 16 GB heap?

The first and most immediate answer is that the value configured via jvmargs in gradle.properties applies only to the heap of the Gradle process. From the operating system’s point of view, a JVM process is composed of more than just the heap. Several additional components contribute to the total memory footprint, and these are often overlooked when sizing CI agents or tuning memory limits:

Metaspace
Code cache
Thread stacks
Direct buffers
GC native memory
Native / OS memory

All of these are grouped under the RSS (Resident Set Size) of the Java process on Unix-like systems.

Another important reason is that the Gradle process is not the only JVM involved in a build. We also have the Kotlin daemon, test JVMs, and in Android builds, additional isolated processes such as Lint or R8. Each of these processes has its own heap and its own RSS footprint. Together, all of them contribute to the total memory pressure on the machine.

In OOM scenarios, there is an additional problem: the host machine may kill the Gradle process before the build finishes. When that happens, we lose valuable diagnostic data. In CI environments, and especially in GitHub Actions, this is even worse because we usually cannot attach post-build steps to collect more information.

Since OOM scenarios are exactly the situations where visibility matters most, I ended up building a GitHub Action for that: Process Watcher.

In this article, we track memory behavior over time across JVM processes, combining RSS, heap usage, and GC activity. The goal is to move beyond static numbers and understand how memory pressure evolves during the build.

Capturing the RSS of a process

To understand real memory usage during build execution, we need to analyze RSS, not just heap size. On Unix-like systems, RSS reflects the physical memory currently held by the process, which makes it a better signal for understanding memory pressure.

To check the RSS of a process:

ps -o rss= -p "$PID"

This command outputs values in kilobytes, for example:

That means ~639 MB of physical RAM.

At first glance, we could collect this data at the end of the build. But this has two problems:

Not all processes live until the end
The build can be killed or time out

If the build is killed, we lose everything, no data, no diagnostics.

Because of that, I decided to take a different approach: run a separate monitoring process during the build.

Initially, the approach was simple: capture RSS and heap usage during the build and archive the data at the end of the execution. A typical output looked like this:

Elapsed_Time | PID | Name | Heap_Used_MB | Heap_Capacity_MB | RSS_MB
00:00:05 | 149 | GradleDaemon | 29.7MB | 86.0MB | 241.0MB
00:00:10 | 149 | GradleDaemon | 191.7MB | 338.0MB | 560.1MB
00:00:16 | 149 | GradleDaemon | 113.1MB | 198.0MB | 428.4MB

With this, we can visualize the data and calculate the total RSS across all Gradle processes:

In addition to RSS and heap usage, we can also track cumulative GC time and better understand how memory behaves during the build.

That worked for successful builds, but it had the same limitation: if the main Gradle process was killed, we lost all the data.

To address that, I added a remote mode that publishes the data to a Firebase database, allowing live monitoring even when the build fails or is interrupted.

With that in place, we can now look at some practical scenarios where this kind of visibility helps explain memory behavior in Android builds.

The case of misaligned Kotlin versions

We start with a known suspect. As mentioned in previous articles, this scenario can happen when the Kotlin version embedded in the Gradle distribution is misaligned with the Kotlin version used by the project.

If we run a typical nowinandroid build (:app:assembleProdDebug) and attach our GitHub Actions instrumentation tool, we observe the following memory profile:

The image clearly shows two Kotlin processes. The first one, PID 5133, is spawned during the compilation of the included builds and remains unused during the execution phase.

Although its heap usage at the end of the build is only 429 MiB, its RSS footprint accounts for 8.4% of the total RSS memory of the build. In environments closer to the memory limit, for example on a free GitHub Actions runner, this alone can represent around 4% of the available memory.

The key point is that the RSS of this first Kotlin process is never reclaimed, so that memory remains allocated for the entire build. In practice, this reduces the memory available for the rest of the build without providing any benefit during execution.

Timeouts

In the second scenario, we analyze another common case. We have heard several times from users that some builds hit the timeout defined in the job configuration.

These timeouts act as a safeguard against builds running indefinitely, but they also indicate that something is not behaving as expected. When the timeout kills the agent, we lose the Gradle process and any information that would normally be reported at the end of the build.

In some cases, the issue is related to thread locks. In others, it is an unexpected memory situation that can be understood by analyzing memory metrics across the different JVM processes.

For instance, let’s analyze this build:

The total RSS is not hitting the maximum, and the agent is not killed due to memory pressure. Instead, the build is terminated by a timeout.

At first glance, this does not look like a typical OOM scenario. But if we look at how memory behaves over time, a different pattern appears.

One interesting detail is that we observe an almost flat pattern in the later stages of both the Kotlin and Gradle processes. In this case, it is useful to review the GC graph:

The GC activity of the Gradle process shows a clear linear growth over time, which indicates that the heap is under pressure and memory is not being reclaimed efficiently. This is the kind of pattern that may not immediately fail the build, but still keeps it alive in a degraded state until the timeout is reached.

The key point is that if we detect this pattern early, we can stop the build sooner and avoid wasting time and resources. In this example, that could save up to 30% of the build time.

To get a more realistic view, this scenario shows that if we detect this behavior early, we could cancel the build and avoid the overhead of letting it continue:

G1 vs Parallel GC

Another common question is which GC is more suitable. Performance is important, but we should also consider the RSS footprint when comparing different GC strategies.

In many cases, we focus only on build time, but memory behavior can vary significantly between GC implementations. Some may be faster, while others allow the OS to reclaim memory more efficiently.

Let’s look at a G1 vs Parallel comparison of the build:

From these measurements, we can observe that, regardless of the performance outcome, the OS is able to reclaim memory more efficiently with G1. That tradeoff can matter in CI environments where staying below the memory threshold is more important than small differences in execution time.

For completeness, here is the cumulative view, left G1, right Parallel:

The OOM puzzle

Finally, let’s look at perhaps the most valuable use case for this kind of monitoring: an OOM-killed build.
It all starts with this discouraging message in GitHub Actions, where we don't get any additional feedback and the post steps haven’t been executed:

We get no useful feedback, and as mentioned before, we also do not have the chance to run post steps to archive logs, measurements, or any other diagnostic data.

In this case, if we enable the remote mode of the Build Process Watcher, we can at least preserve the latest snapshot of the Gradle processes before the agent kills the container or the build. What we get is this:

We know that GitHub Actions free runners have a 16 GB memory limit, and in the image we can already see that, before the failure, the Gradle process was increasing its RSS in a clear high-memory-pressure scenario.

In this case, the Gradle heap was configured to 10 GB. One common misconception is to assume that this is enough, or to leave the Kotlin daemon heap unspecified, assuming the defaults will be good enough. But the important detail is that the Kotlin process still contributes its own memory footprint. Looking at the data, we can see that its peak RSS reached 6962.0 MB, adding more pressure on top of the Gradle process.

So it is easy to see how the build gets dangerously close to the machine limit. But the point is not only to spot the problem, it is to make the build work.

What I did here was try different memory splits between the Gradle and Kotlin processes and compare the runs with Process Watcher. Looking at RSS growth, heap usage, GC behavior, and whether the build completed or not gave me a better direction, but it still took some experimentation to find a stable configuration.

In this case, the fix was not just increasing memory. After trying different combinations, the only stable one was 7 GB for Gradle and 3 GB for the Kotlin process. Other combinations were still ending in OOM or timeout:

So for me, the interesting part here is not only the final numbers, but how we got there. By observing the RSS pressure and comparing the runs, we were able to move from an unstable memory profile to a configuration that was sustainable for the runner limit.

Final Words

In my case, for GitHub Actions, I published Process Watcher, but the general idea is simple and can be implemented in different ways. The important part is not the specific tool, but having a way to observe RSS, heap usage, and GC behavior while the build is running. That visibility makes it much easier to understand memory pressure and iterate toward more stable configurations.

One note: to use the visualization tools in Process Watcher, you do not need to enable the remote option. The site provides a replay view and a compare view that can be used with the artifacts generated at the end of the build, without publishing data to Firebase. You can also just download the generated HTML files and open them locally.

Happy Building!