DEV Community

Fridolín Pokorný
Fridolín Pokorný

Posted on

How to beat Python’s pip: Reinforcement learning-based dependency resolution

The next episode from our series will be more theoretical. It will prepare the ground for the next article that will conclude the things we discussed so far. We will take a look at Monte Carlo tree search, Temporal Difference learning, and Markov decision process and how they can be used in a resolution process.

Alt Text

Balos beach on Crete island, Greece. Image by the author.

Markov decision process as a base for resolver

First, let’s take a look at Markov decision process (MDP). In a base, it provides us with a mathematical framework for modeling decision making (see more info in the linked Wikipedia article). To understand the decision-making process let’s apply it to the resolution process (other examples can be found on the Internet).

Instead of implementing a resolver using SAT or using backtracking, we will try to walk through the dependency graph. In that case, the resolver will try to satisfy dependencies of an application considering version range specification for direct dependencies, and recursively considering the transitive ones until it finds a valid resolution (or it concludes there is none).

At first, the resolver starts in an “initial state” which states all the requirements of an application to be included in the application stack. After a few rounds, it will end up in a state sn which will hold two sets — a set of dependencies to be resolved and a set of dependencies already resolved and included in the application stack.

In each state sn, the resolver can take an action that corresponds to making a decision on which dependency should be included in the application stack. An example can be shown in the figure down below — the resolver can take the action to resolve jinja2==2.10.2 coming from the PyPI. By doing so, the resolver adds jinja2==2.10.2 to the resolved dependencies set and adds all the dependencies on which jinja2==2.10.2 directly depends on to the unresolved dependencies set (respecting the version range specification).

As jinja2==2.10.2 can affect our application stack positively or negatively based on the knowledge we have about this dependency (e.g. build-time errors spotted in our software inspections), we can respect this fact by propagating the "reward signal" that corresponds to the action taken.

Alt Text

Decision-making process — resolving dependencies by walking through the dependency graph and retrieving the reward signal. This process can be modeled as an MDP. Image by the author.

The cumulated reward signal across all the actions taken to find a final state (all the packages included in the software stack/computed lock file) then corresponds to the overall software stack quality (i.e. software stack score).

Reinforcement learning-based dependency resolution and abstractions

The resolution process can be seen as communication between abstractions described below (following object-oriented programming paradigm).


... is an abstraction that can resolve software stacks following rules (e.g. how dependencies should be resolved respecting Python packaging standards). It uses Predictor to help with guiding which dependencies should be resolved while traversing the dependency graph (we do not need to resolve the latest packages necessarily). Resolver also triggers the Software stack resolution pipeline to compute the immediate reward signal that is subsequently forwarded to Predictor.


… helps Resolver to resolve software stacks — acts as a "decision-maker". It selects dependencies that should be included in a software stack to deliver the required quality (e.g. stable software, secure software, latest software, ...). It learns what packages should be included in software by selecting them.

Software stack resolution pipeline

... is a scoring pipeline that judges actions made by the Predictor. The output of this pipeline is primarily the "reward signal" that is computed in the pipeline units. This resolution pipeline can be dynamically constructed on each run to respect user needs (e.g. different pipeline units to deliver "secure software" in opposite to "well-performing software" given the operating system and hardware used to run the application).

Monte Carlo tree search and Temporal Difference learning in a resolution process

Simulated annealing in its adaptive form (ASA) was used as the first type of Predictor in the resolution process. Even though the ASA based predictor does not learn anything about the state space of possible software stacks, it gave a base for resolving software stacks that had many times higher quality than the "latest" software (as resolved by Pipenv or pip).

The next natural step for the resolution process was to learn actions taken during the resolution process. Temporal Difference learning and, later, Monte Carlo tree search algorithm principles were used for implementing the next predictor types. As there is no real opponent to play against (as seen in RL based algorithms), formulas like UCB1 could not be applied in their direct form. To balance exploration and exploitation, ideas from ASA were reused. Time became the real opponent to keep the resolution process responsive enough to users. The number of software stacks successfully resolved so far or the number of iterations done in the resolver became attributes for balancing exploration and exploitation (and possibly other attributes as well).

A video that includes core principles of Python dependency resolution and reinforcement learning dependency resolution principles.

The video presented above was introduced as part of DevConf.US 2020 and demonstrates these principles more in-depth. Check the linked annual event taking place in the USA, Czech Republic, and India each year (this year a virtual event). A lot of cool topics can be explored and you can also participate next year — the event is open as open-source.

See for more info.

Approximating maxima of an N-dimensional function using dimension addition and reinforcement learning

See also "Approximating maxima of an N-dimensional function using dimension addition and reinforcement learning" to get more insights from a slightly different perspective.

Project Thoth

Project Thoth is an application that aims to help Python developers. If you wish to be updated on any improvements and any progress we make in project Thoth, feel free to subscribe to our YouTube channel where we post updates as well as recordings from scrum demos. Check also our Twitter.

Stay tuned for any new updates!

Top comments (0)