How to beat Python’s pip: Software stack resolution pipelines

#python #datascience #opensource #machinelearning

Following our previous article about reinforcement learning-based dependency resolution, we will take a look at actions taken during the resolution process. An example can be resolving intel-tensorflow instead of tensorflow following programmable rules.

Lungern, Switzerland. Image by the author.

Dependency graphs and software stack resolution

Users and maintainers have limited control with additional semantics when it comes to dependencies installed and resolved in Python applications. Tools, such as pip, Pipenv, or Poetry resolve a dependency stack to the latest possible candidate following dependency specification stated by application developers (direct dependencies) and by library maintainers (transitive dependencies). This can become a limitation, especially considering applications that were written months or years ago and require non-zero attention in maintenance. A trigger to revisit the dependency stack can be a security vulnerability found in the software stack or an observation the given software stack is no longer suitable for the given task and should be upgraded or downgraded (e.g. performance improvements in releases).

Even the resolution of the latest software can lead to issues that library maintainers haven’t spotted or haven’t considered. We already know that the state space of all the possible software stacks in the Python ecosystem is in many cases too large to explore and evaluate. Moreover, dependencies in the application stack evolve over time and underpinning or overpinning dependencies happen quite often.

Another aspect to consider is the human being itself. The complexity behind libraries and application stacks becomes a field of expertise on their own. What’s the best performing TensorFlow stack for the given application running on specific hardware? Should I use a Red Hat build, an Intel build, or a Google TensorFlow build? All of the companies stated have dedicated teams that focus on the performance of the builds produced and there is required certain manpower to quantify these questions. The performance aspect described is just another item in the vector coming to the application stack quality evaluation.

Software stack resolution pipeline and pipeline configuration

Let’s promote the whole resolution process and let’s make it server-side. In that case, the resolver can use a shared database of knowledge that can assist with the software stack resolution. The whole resolution process can be treated as a pipeline made out of units that cooperate together to form the most suitable stack for user needs.

The server-side resolution is not required, but it definitely helps with the whole process. Users are not required to maintain the database and serving software stacks as a service has also other pros (e.g. allocated pool of resources).

The software stack resolution pipeline can:

inject new packages or new package versions to the dependency graph based on packages resolved (e.g. a package accidentally not stated as a dependency of a library, dependency underpinning issues, ...)
remove a dependency in a specific version or the whole dependency with its dependency subgraph from the dependency graph and let resolver find another resolution path (e.g. a package accidentally stated as a dependency, missing ABI symbols in the runtime environment, dependency overpinning issues, ...)
score a package occurring in the dependency graph positively — prioritize resolution of a specific package in the dependency graph (e.g. positive performance aspect of a package in a specific version/build)
score a package in a specific version occurring in the dependency graph negatively — prioritize resolution of other versions (e.g. a security vulnerability present in a specific release)
prevent resolving a specific package in a specific version so that resolver tries to find a different resolution path if any (e.g. buggy package releases)

These pipeline units form autonomous pieces that know when they should be included in the resolution pipeline (thus be part of the "pipeline configuration") and know when to perform certain actions during the actual resolution.

A component called "pipeline builder" adds pipeline units to the pipeline configuration based on the decision made by the pipeline unit itself. This is done during the phase which creates the pipeline configuration.

Creation of a resolution pipeline configuration by the pipeline builder. Image by the author.

Once the resolution pipeline is built, it is used during the resolution process.

A software stack resolution process

In the last article, we have described a resolution process as a Markov decision process. This uncovered the potential to use reinforcement learning algorithms to come up with a suitable software stack candidate for applications.

Latest software is not always the greatest.

The last article described three main entities used during the resolution process:

Resolver — an entity for resolving software following Python packaging specification
Predictor — an entity used for guiding the resolution in the dependency graph
Software stack resolution pipeline — an abstraction for scoring and adjusting the dependency graph

The whole resolution process is then seen as a cooperation of the three described.

A resolution process guided by a predictor — magician. The fairy girl corresponds to the resolver which passes the predicted part of the dependency graph (a package) to the scoring pipeline. Results of the scoring pipeline (reward signal) are reported back to the predictor. Image by the author.

The software stack resolution pipeline is formed out of units of a different type. Each one is serving its own purpose. An example can be pipeline units of type "Step" which map to an action that is taken in a Markov decision process.

The resolver can be easily extended by providing pipeline units that follow semantics, API, and help with the software stack resolution process. The interface is simple so anyone can provide their own implementation and extend the resolution process with the provided knowledge. The pre-aggregated knowledge of dependencies helps with the offline resolution so that the system can score hundreds of software stacks per second.

The demo shown above demonstrates how pipeline units can be used in a resolution process to come up with a software stack that respects the pipeline configuration supplied. The resolution process finds a intel-tensorflow==2.0.1 software stack instead of the pinned tensorflow==2.1.0 as specified in the direct dependency listing. The notebook shown can be found in the linked repository.

Thoth adviser

If you are interested in the resolution process and core principles used in the implementation, you can check thoth-adviser documentation and sources available on GitHub.

Also, check other articles from the “How to beat Python’s pip” series:

Project Thoth

Project Thoth is an application that aims to help Python developers. If you wish to be updated on any improvements and any progress we make in project Thoth, feel free to subscribe to our YouTube channel where we post updates as well as recordings from scrum demos. We also post updates on Twitter.

Stay tuned for any new updates!

Amplify your impact where it matters most — building exceptional apps.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started