Iñaki Villar

Posted on Aug 23

Gradle Learning Day: Reinforcement Learning for Build Optimization

#gradle #android #kotlin

This month at Gradle, we had our Learning Day, a day dedicated to exploring new ideas and experimenting with technologies outside our usual work. The theme this time was AI.

While brainstorming ideas, I remembered a video that completely blew my mind, one that I especially loved as a soccer fan:

That led me to think about reinforcement learning, a branch of machine learning where a system learns through rewards and penalties from its actions.

In the build engineering world, there’s always a recurring question: what’s the optimal configuration, for example in terms of heap memory or workers, for a project? It’s a tough one because the answer depends on many factors, and often the most honest reply is simply, "It depends."

So my idea was: why not use reinforcement learning to help calculate the best build configuration? During Learning Day, I started a small experiment, later expanded it, and that’s what I’m sharing here today.

A Quick Look at Reinforcement Learning

Reinforcement Learning is a type of machine learning where an agent makes decisions by interacting with an environment. Each action gives the agent a reward or a penalty, and over time, the agent learns which decisions lead to better results:

(Image: https://en.wikipedia.org/wiki/Reinforcement_learning)

Building an RL Framework for Gradle

This was the initial idea: explore whether reinforcement learning could treat Gradle builds as its environment and use performance as the reward signal. Faster builds or lower memory usage would yield a positive reward, while slower or heavier builds would yield a negative reward. From there, the agent could learn which configurations produce the best outcomes.

Beyond the RL approach, I also wanted to understand how to deploy the agent. The goal was to demonstrate the full cycle—from defining an experiment to orchestrating the build executions and collecting the resulting data. I’m happy to say I built a working proof of concept: an agent deployed on GCP, integrated with Cloud Functions, with GitHub Runners orchestrating the build executions.

However, I simplified some parts to get a working POC without spending too much time. You’ll find more details in the next sections. Here’s a high-level diagram of the setup:

Let’s walk through the main parts.

The RL Agent

The RL agent is the brain of the system, proposing which configurations to try. What do I consider “configurations”? Any setting that materially affects build performance. Today there are hundreds of such parameters across the JVM, Gradle, Kotlin, and even component-specific systems like AGP or Dagger. Initially, I targeted JVM parameters. A production-ready optimization system would expand to include individual JVM flags (e.g., -XX:NewRatio, -XX:MaxMetaspaceSize), garbage collector selection, and compiler optimizations. For this POC, I focused on just three:

Gradle Workers
Xmx for Gradle process
Xmx for Kotlin process

I know it’s simple, but it’s a solid starting point. Even with just three parameters, the number of combinations grows quickly. Since we don’t have infinite resources or time to test every combination, I added guardrails to constrain the search space and define which options we’ll explore:

I’m constrained by the environment (GHA), which provides only 4 workers.
All experiments were run on modularized Android projects, and by 2026 I know it’s not realistic to build with just 1 GB of memory.
To avoid OOMs, I limited the max heap to 8 GB in both processes.

For this experiment, the reward was build time — yes, just build time. I initially started with a formula that included GC metrics for both processes and the mean Kotlin compile time, but the idea was to keep it simple and working first, so I can iterate later.

Next, I’ll go over the different learning models I tested:

First Attempt with Q-tables

In the initial iteration, I was relying entirely on a Q-table. The Q-table is a lookup table that stores the learned value (Q-value) for each action-state combination. In our Gradle build optimization context:

Actions: Parameter combinations (max_workers, gradle_heap_gb, kotlin_heap_gb)
Q-Values: Learned rewards for each parameter combination

This is an example Q-Table entry:

{
  "4_6_8": 0.12524971,  // 4 workers, 6GB gradle, 8GB kotlin → Q-value 0.125
  "2_7_5": 0.12335408,  // 2 workers, 7GB gradle, 5GB kotlin → Q-value 0.123
  "1_2_8": 0.11712929   // 1 worker, 2GB gradle, 8GB kotlin → Q-value 0.117
}

In the first experiment run, I observed the following behavior:

One iteration was repeated six times, and with such a small total (15 iterations), I missed the chance to explore further.

In Q-tables, there are three different phases:

Initialization
Exploration
Exploitation

It’s critical to define appropriate values for exploration and exploitation when, as in our environments, the set of actions is finite. Too much exploitation too early could lead to the problem described earlier. As we’ll see in the GitHub Runner section, I parallelize N builds during action execution, which can increase exploration. I didn’t go deeper into Q-Tables and instead moved on to a simpler approach.

Adaptive Exploration Strategy

For the version described in this article, I chose an adaptive exploration strategy where the final “best action” is determined purely by observed build performance, not by learned Q-values. This makes the current implementation even simpler.

best_variant = max(variants, key=lambda v: v.get('reward', 0))

And build performance here is measured purely by build time. I initially tried incorporating GC time from the processes into a distributed reward formula — that was the original idea — but I still need to better understand its implications.

RL API Component

The RL API is built with FastAPI, deployed on GCP, and acts as the communication layer between GitHub Actions and the reinforcement learning engine. It exposes endpoints that cover the full experiment lifecycle. The primary endpoint, /get-action, receives experiment requests and returns Gradle configurations. Another key endpoint, /send-feedback, ingests build results from GitHub Actions and computes rewards on a continuous logarithmic scale. We also use Firestore to persist action results and experiment metadata.

Github Runners

If you think about it, one might assume we could simply run this locally, serving the RL agent and measuring build executions. While technically possible, this would essentially hijack our system resources during the RL experiment, and any other processes running at the same time could distort the results.

With Telltale, I’ve already demonstrated that it’s possible to orchestrate sequences of builds across different scenarios while maintaining both isolation and fairness. Following the same philosophy, I didn’t want to base our results on a single build — instead, we ran multiple builds to reduce noise and avoid the trap of regression to the mean.

Inspired by Telltale, we adopted a similar approach: whatever the RL agent decides to execute, we delegate to GitHub Actions, ensuring it runs in an isolated and repeatable environment:

Initially, we use a seed step with two main purposes:

Populate the GHA runner cache with dependencies.
Modify the project using actions to save the project state for later.

After this, we execute the build n times based on the number of iterations defined for the experiment. In this initial version, we limited the total to 150 builds to avoid overloading the GHA runners and impacting my teammates in Develocity.

For each experiment, the parallel builds are distributed as follows:

15 iterations: 1 seed build + 10 builds per iteration
30 iterations: 1 seed build + 5 builds per iteration
50 iterations: 1 seed build + 3 builds per iteration

Finally, it’s worth noting that the action proposed by the RL agent is passed through a workflow dispatch input defined as rl-actions:

  rl-actions:
    description: 'RL-generated action parameters (JSON string)'
    required: false
    default: '{}'

And we’ll be able to track the progress of the different actions executed in the experiment directly from GHA:

Collecting data

Having the GitHub runners execute builds in parallel, we still need to collect the action data from each experiment and submit the feedback to the RL Agent.

The projects under experimentation in this article are connected to Develocity. Each build publishes a Build Scan to Develocity, and to identify the non-seeding builds of each action we tag them with the experiment and action identifiers, such as experiment-1755896319523_W2_G4_K8:

The Develocity API provides endpoints to retrieve the initial reward fields needed for calculation, such as build duration and task execution information. Additionally, you can extend the Build Scan data, as I’m doing, to report both the Kotlin process GC time and the Gradle process GC time with the plugins InfoKotlinProcess and InfoGradleProcess, as custom values:

You can aggregate the data using your preferred Develocity API client. In the scope of this article, we are using: BuildExperimentResults.

Results in Action

I built a working POC where you can trigger an experiment by including the Repository name, the task, and the number of iterations you want to run:

The URL is available at: https://rlgradleld.web.app/

I’ve disabled the creation of new experiments, but feel free to ping me if you’d like to see a live demo with one of your preferred projects that can run within the free tier of GitHub Actions.
In the UI, you’ll find a set of experiments we’ve run for different projects:

Additionally, I’ve published the repo that contains the RL Agent, Cloud Functions, UI, and GitHub Actions runners:
https://github.com/cdsap/RLGradleBuilds

Please follow the instructions and let me know if you have any questions about the setup.

Final Thoughts

Working on this Learning Day project was very interesting. I still have some mixed feelings, since the rewards were purely performance-based, and I would have liked to explore more advanced RL mechanisms such as:

Learning from actions we already know will be bad, to avoid further exploration of those (for instance, running with 1 worker or a very low Gradle heap size).
Using a composite reward formula that incorporates GC times and average Kotlin compiler task durations.
So far, all experiments have run in scenarios with only cached dependencies. I plan to extend this to other cases, such as best-case builds or incremental builds, with the ultimate goal of dynamically configuring memory per scenario, guided by the RL mechanism.

On the infrastructure side, all the core components are already in place: the RL agent is connected to Cloud Functions, Firestore stores actions and experiment data, and GitHub Actions runners orchestrate executions and process build information published to Develocity.

Happy experimenting and happy building!

DEV Community