DEV Community: Fridolín Pokorný

When checking your Python package sources matters

Fridolín Pokorný — Sun, 16 Apr 2023 18:12:26 +0000

In today's article, we will take a look at a small tool called Yorkshire. It's goal is to check configured Python package indexes in projects to make sure only desired package sources are used.

A cute Yorkshire terrier, Pixabay.

Python's packaging allows users to consume packages from multiple sources - multiple Python package indexes. During an installation process, the resolution algorithm implemented in pip searches all the configured package indexes to satisfy requirements.

The resolution algorithm treats all the configured indexes as mirrors. If a package foo is available on an index A as well as on an index B, they are both treated with the same relevance, considering versions available. Options --index-url and --extra-index-url allow specifying the primary and secondary indexes, but there is no guarantee on which index is actually used. If there is a network issue, the resolution process can use secondary indexes as they are just mirrors.

In some cases, users want to consume packages from index A and a specific package from index B. As of today, there is no configuration option in pip to specify which index should be used to consume the specific package. This allows dependency confusion attacks, such as the PyTorch incident.

There was a discussion to prevent dependency confusion attacks using a map file on discuss.python.org. The idea of the map file was not accepted, nevertheless there was a proposal PEP-708: Extending the Repository API to Mitigate Dependency Confusion Attacks that pushed the idea of preventing the dependency confusion attacks further.

Until the PEP-708 gets eventually accepted and implemented, there is a space to check how projects configure their Python package indexes. Even if PEP-708 is accepted, it might be a good idea for organizations to check which indexes are used to monitor consumption of software in their environments.

To support checks of the index configuration, there was developed a tool called Yorkshire. Yorkshire checks any index configuration in files that can be used to specify project dependencies, such as requirements.txt, pyproject.toml, or Pipenv files. If there are used multiple Python package indexes, Yorkshire reports it. Optionally, it can check only allowed indexes are configured.

Let's take Poetry's configuration for specifying secondary indexes as an example. The linked command can generate a pyproject.toml file similar to this one:

[tool.poetry]
name = "foo"
version = "1.0.0"
description = "My package"
authors = ["Author <author@email.com>"]

[tool.poetry.dependencies]
python = "^3.6"
flask = "^2"

[[tool.poetry.source]]
name = "private_repo"
url = "https://test.pypi.org/simple/"
default = false
secondary = true

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

The configuration above allows Poetry to consume packages hosted on PyPI as well as the ones hosted on the test PyPI. Which one will be actually used to consume packages? Well, it depends on dependencies of the flask package. Note the version might be relevant here as well, depending on the environment to which dependencies are installed.

Let's run Yorkshire on the pyproject.toml file above:

$ yorkshire detect ./pyproject.toml
2023-04-15 20:13:44,984 [1767887] INFO     yorkshire._lib: Performing detection in pyproject.toml file located at '.'
2023-04-15 20:13:44,985 [1767887] WARNING  yorkshire._lib: File './pyproject.toml' uses an explicitly configured Poetry source: ['https://test.pypi.org/simple/']

As can be seen, Yorkshire issues a warning as multiple package sources can be eventually used.

Next, let's specify the test PyPI to be allowed:

$ yorkshire detect --allowed-index-url "https://test.pypi.org/simple/" ./pyproject.toml
2023-04-15 20:16:33,806 [1773955] INFO     yorkshire._lib: Performing detection in pyproject.toml file located at '.'

The command above shows that the test PyPI specified in the pyprojec.toml is now no longer flagged as a possible issue.

Yorkshire understands requirements file types used in the Python ecosystem. Similarly to the pyproject.toml configuration specific to Poetry, Yorkshire supports configuration as used in PDM, Pipenv, pip-tools, or pip itself. All the tools have their own specifics (some of them even support assigning packages to an index, as mentioned above).

Yorkshire provides an API to eventually incorporate checks into other projects or systems.

Organizations can use Yorkshire in their checks or monitoring to make sure only trusted Python package indexes are used. Would you find it useful?

How to get information about the provenance of Python packages installed

Fridolín Pokorný — Thu, 13 Apr 2023 20:59:28 +0000

Let's take a look on how to obtain information about the provenance of installed packages in the Python ecosystem. This idea is part of PEP-710 which is in a draft state as of today.

Židlochovice - Rozhledna Akátová věž; Czech republic. Image by author.

The tutorial uses files that are available at github.com/fridex/pip-provenance.

Let's create a simple Python application using Chainguard's Python image. This application will be a simple flask hello world application. The app.py script will have the following content:

from flask import Flask

app = Flask(__name__)

@app.route('/')
def index():
    return 'Hello, world!'

app.run(host='0.0.0.0', port=8080)

Additionally, we will create a requirements.in file with the following content:

flask

We will use pip-tools to lock dependencies to specific versions for reproducibility. Also, we will keep hashes of the Python distributions installed:

pip-compile --generate-hashes

The command above will create a requirements.txt file. An example of such a file can be found here.

Next, let's create a containerized environment with our application.

Using the upstream pip

First, we will use the upstream pip which is also shipped in Chainguard's images. We can directly take the Dockerfile as written by Chainguard with minimal changes to make sure we have a containerized application:

FROM cgr.dev/chainguard/python:latest-dev as builder

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt --user

FROM cgr.dev/chainguard/python:latest

WORKDIR /app
# Make sure you update Python version in path
COPY --from=builder /home/nonroot/.local/lib/python3.11/site-packages /home/nonroot/.local/lib/python3.11/site-packages
COPY app.py .

ENTRYPOINT ["python", "/app/app.py"]

The containerized application can be built:

podman build -f raw/Dockerfile -t pip-provenance:raw .

Subsequently, the built application can be run and accessed at locahost:8080:

podman run -p 8080:8080 pip-provenance:raw

Now, let's imagine someone published this image to a registry and we would like to get information about the packages installed. We can pull the pip-provenance:raw image and run pip freeze. Unfortunately, pip freeze shows only Python packages installed and their versions:

$ pip freeze                     
click==8.1.3
Flask==2.2.3
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.2
Werkzeug==2.2.3

We don't have any information from where these packages were actually installed. Also, we do not have any information on digests of these packages. An exception are packages installed using a direct URL following PEP-610, but that's not the case in our example.

Using the patched pip

There was a proposal in PEP-710 to store provenance information about the installed packages when they are identified using their name, and optionally their version (which is our example). Let's take a look on what information is stored and how we could access it.

First, let's adjust our Dockerfile to use a patched version of pip that follows PEP-710:

FROM cgr.dev/chainguard/python:latest-dev as builder

WORKDIR /app
COPY requirements.txt .
# ----->%------
USER root
RUN pip install --force-reinstall pip install git+https://github.com/fridex/pip.git@provenance-url
USER nonroot
# -----%<------
RUN pip install -r requirements.txt --user

FROM cgr.dev/chainguard/python:latest

WORKDIR /app
# Make sure you update Python version in path
COPY --from=builder /home/nonroot/.local/lib/python3.11/site-packages /home/nonroot/.local/lib/python3.11/site-packages
COPY app.py .

ENTRYPOINT ["python", "/app/app.py"]

Let's build this application:

podman build -f patched/Dockerfile -t pip-provenance:patched .

We can run the application and access it at localhost:8080, the changes introduced in pip will have no effect on it:

podman run -p 8080:8080 pip-provenance:patched

Following PEP-710, pip stores information about the provenance in *.dist-info directories that are located in site-packages. Let's copy the site-packages directory out of the containerized environment so that we can check what was installed there (substitute [CONTAINER_HASH] with the hash of the containerized environment that was run in the previous example):

podman cp [CONTAINER_HASH]:/home/nonroot/.local/lib/python3.11/site-packages site-packages

We can take a look at provenance_url.json file for package flask*:

$ cat ./site-packages/Flask-2.2.3.dist-info/provenance_url.json | jq
{
  "archive_info": {
    "hash": "sha256=c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d",
    "hashes": {
      "sha256": "c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d"
    }
  },
  "url": "https://files.pythonhosted.org/packages/95/9c/a3542594ce4973786236a1b7b702b8ca81dbf40ea270f0f96284f0c27348/Flask-2.2.3-py3-none-any.whl"
}

This file is created by the patched pip and is described more in detail in PEP-710.

A small tool, called pip-preserve, can read content of the site-packages directory and understands the provenance_url.json for each Python package installed. Moreover, if a package was installed using a direct URL, the tool can also read direct_url.json as described in PEP-610 to fully reconstruct the environment. Let's use the tool on our site-packages directory from the containerized environment:

$ pip install pip-preserve
...
$ pip-preserve --ignore-errors --site-packages ./site-packages      
#
# This file is autogenerated by pip-preserve version 0.0.2.post1 with Python 3.10.6.
#
click==8.1.3 \
  --hash=sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48
flask==2.2.3 \
  --hash=sha256:c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d
itsdangerous==2.1.2 \
  --hash=sha256:2c2349112351b88699d8d4b6b075022c0808887cb7ad10069318a8b0bc88db44
jinja2==3.1.2 \
  --hash=sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61
markupsafe==2.1.2 \
  --hash=sha256:f2bfb563d0211ce16b63c7cb9395d2c682a23187f54c3d79bfec33e6705473c6
werkzeug==2.2.3 \
  --hash=sha256:56433961bc1f12533306c624f3be5e744389ac61d722175d543e1751285da612

As you can see, the tool reconstructed requirements.txt file, listing all the packages installed together with their versions and hashes.

A reader can notice that the reconstructed file has only one hash per package. The reason is that pip installs only one package. Our original requirements.txt file lists multiple hashes that correspond to Python distributions as published on PyPI at the time the pip-compile command was run. On installation time, pip takes the one that is matching the environment to which the Python distribution is installed. For example, pip took the wheel file published for flask==2.2.3, not the source distribution available on PyPI (you can verify it by checking artifact hashes). Using the patched version of pip, we can point to the exact artifact that was installed.

If we pass --direct-url option to the pip-preserve tool, we can get exact URLs from where Python packages were installed:

$ pip-preserve --ignore-errors --direct-url --site-packages ./site-packages
#
# This file is autogenerated by pip-preserve version 0.0.2.post1 with Python 3.10.6.
#
https://files.pythonhosted.org/packages/c2/f1/df59e28c642d583f7dacffb1e0965d0e00b218e0186d7858ac5233dce840/click-8.1.3-py3-none-any.whl \
  --hash=sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48
https://files.pythonhosted.org/packages/95/9c/a3542594ce4973786236a1b7b702b8ca81dbf40ea270f0f96284f0c27348/Flask-2.2.3-py3-none-any.whl \
  --hash=sha256:c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d
https://files.pythonhosted.org/packages/68/5f/447e04e828f47465eeab35b5d408b7ebaaaee207f48b7136c5a7267a30ae/itsdangerous-2.1.2-py3-none-any.whl \
  --hash=sha256:2c2349112351b88699d8d4b6b075022c0808887cb7ad10069318a8b0bc88db44
https://files.pythonhosted.org/packages/bc/c3/f068337a370801f372f2f8f6bad74a5c140f6fda3d9de154052708dd3c65/Jinja2-3.1.2-py3-none-any.whl \
  --hash=sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61
https://files.pythonhosted.org/packages/5a/94/d056bf5dbadf7f4b193ee2a132b3d49ffa1602371e3847518b2982045425/MarkupSafe-2.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
  --hash=sha256:f2bfb563d0211ce16b63c7cb9395d2c682a23187f54c3d79bfec33e6705473c6
https://files.pythonhosted.org/packages/f6/f8/9da63c1617ae2a1dec2fbf6412f3a0cfe9d4ce029eccbda6e1e4258ca45f/Werkzeug-2.2.3-py3-none-any.whl \
  --hash=sha256:56433961bc1f12533306c624f3be5e744389ac61d722175d543e1751285da612

Why is this useful?

Okay, now we know we can get information about the provenance of installed packages using PEP-710. If we take a look at other packages, such as TensorFlow, we can see that there are published multiple wheel files - each corresponding to a specific environment. If we just pip install tensorflow, which wheel file is actually used (assuming we do not have access to installation logs all the time)?

Also note, there can be specific builds of Python packages hosted on a private Python package index. These wheels can be built with options that might not be expressed using wheel tags. If you are using a Python environment (not necessarily a containerized environment), how do you know the provenance of the Python packages installed (without accessing installation logs, or eventually any build configuration)?

Built containerized environments used in this article are available at docker.io/fridex/pip-provenance:

podman pull fridex/pip-provenance:raw
podman pull fridex/pip-provenance:patched

You can follow related discussion about PEP-710 at discuss.python.org.

*Even though the provenance_url.json files produced by the patched pip keep the hash key, PEP-710 does not define it. The patched pip implementation uses code that is defined by PEP-610 (direct URL). The hash key is now deprecated in the direct_url.json file introduced by PEP-610.

Why PyPI Doesn't Know Your Projects Dependencies but Thoth Does

Fridolín Pokorný — Sun, 13 Feb 2022 16:50:14 +0000

How can I produce a dependency graph for Python packages? Why PyPI does not state dependencies of Python packages? Let's have a look at these questions and a solution for Python developers.

PyPI

PyPI, The Python Package Index, is the main source of open-source Python packages. It provides a way to publish, browse as well as obtain open-source Python packages. However, it does not list information about dependencies to users.

Why PyPI Doesn't Know Your Projects Dependencies

Here I would like to refer to an article by Dustin Ingram, one of the PyPI maintainers. The referenced article nicely explains this problem and shows why it is not possible to list all the dependencies for a Python package.

Long story short, the article explains that packaging of Python projects can execute a Python script that computes dependencies on installation time in the target environment. The script can evaluate what dependencies should be installed based on arbitrary code execution which creates the listing of dependencies dynamically. This can be seen powerful as users can express their needs in the code. It might not be necessarily true and handy. The approach causes headaches to maintainers and developers as dependencies are not statically declared and always known deterministically in advance.

This issue is slowly getting fixed with static wheel metadata, but source distributions can still suffer from this issue.

Project Thoth and Python Package Dependencies

Project Thoth offers a cloud Python resolver available publicly as an alternative to pip, Pipenv, or Poetry. Naturally, a resolver needs to know dependency graph to resolve application dependencies. Thoth's trick for obtaining the dependency graph lies in pre-computing dependency information by installing packages into containerized environments.

Imagine a containerized environment such as Fedora 34. It provides a prepared environment which is used to install Python packages - it ships Python interpreter version 3.9 and other software packages in specific versions. The container image provides environment for installing Python packages. And that is what Thoth's background data aggregation logic does. It installs each Python package into the containerized environment and checks what dependencies the given package has in the given container image.

Of course, there can be nuances when a package is not behaving deterministically even in the predefined environment (the example is taken from the linked Dustin's article):

import random
from setuptools import setup

dependency = random.choice(['Schrodinger', 'Cat'])

setup(
    name='paradox',
    version='0.0.1',
    description='A nondeterministic package',
    install_requires=[dependency],
)

This is however rare and considered as a really bad practice. You should not do it. (By the way, Thoth has a solution to fix even this.)

Python Dependency Information in Thoth

A component called thoth-solver is responsible for extracting dependency information together with additional metadata. Other components in Thoth's cloud resolver make sure that the dependency listing is kept up to date with new package releases. Check the following article for more information.

Thoth invests resources to analyze Python packages. Once Python packages are analyzed and dependency information is extracted, data are synced into Thoth's database and made available to users as well as to Thoth's cloud resolver. You can query dependency information on Thoth's API endpoints.

Mind that dependency information is obtained for each containerized environment individually. That way, the dependency information is more accurate than dependency information available on Open Source Insights. Open Source Insights state dependency information for a very specific setup and only default to "latest versions" that were found when the dependency information was obtained or refreshed. Thoth shows all the matching versions of available Python packages even across multiple Python package indexes for selected GNU/Linux distributions.

Consuming Dependency Information

As of now, Thoth provides API endpoints to consume the computed dependency information. API endpoints are publicly available so feel free to consume available dependency data.

To obtain dependency information for package pandas in version 1.3.3 from PyPI in Fedora 34 running Python 3.9, simply issue the following HTTP GET request:

curl -X 'GET' \
  'https://khemenu.thoth-station.ninja/api/v1/python/package/version/metadata?name=pandas&version=1.3.3&index=https%3A%2F%2Fpypi.org%2Fsimple&os_name=fedora&os_version=34&python_version=3.9' \
  -H 'accept: application/json'

Note all the dependency versions, respecting extras and environment markers besides other package metadata provided. Additional metadata shown include core Python packaging metadata, files available or packages (modules) brought when installing pandas==1.3.3 from PyPI into the given environment. Check thoth-solver documentation for more information.

You can compare the shown dependency listing with Open Source Insights.

Using Thoth's Resolver

The described dependency data are used in Thoth's resolver. The cloud based resolver uses a reinforcement learning techniques to come up with the best possible libraries for your application. All the dependency resolvers in Python - pip, Pipenv, and Poetry resolve application dependencies to the latest possible versions which might not be always the best choice. Check the following tutorial that will walk you through some security-related aspects of Thoth.

If you wish to give Thoth's cloud resolver a try, install Thamos. Thamos is a command line interface to Thoth's backend:

pip install thamos

Once Thamos is installed, check available environments and add dependencies to your project. Finally, ask Thoth's resolver for an advisory on your application:

thamos environments
thamos config
thamos add "flask~=2.0.0"
thamos advise

Check available help for each Thamos command shown by supplying --help option. Do not hesitate to provide feedback.

If you wish to be updated with Thoth news, follow @ThothStation on Twitter or check Thoth-Station YouTube channel.

🧙🪄🐍

(Late) Hacktoberfest 2021 x Monstarlab Prague

Fridolín Pokorný — Sun, 14 Nov 2021 18:47:11 +0000

COVID19 made IT conferences and meetups virtual. It feels still more natural to me to meet people, exchange ideas and socialize. Surprisingly, an e-mail landed to my inbox one day:

Hello Fridolín!

My name is Tiago, I'm organizing once again the Hacktoberfest in Prague and I found your profile on GitHub and would like to ask you if you're interested in joining us in the event and maybe do a small talk (around 20-25min) related to open source.
%< snip >%

That's how the invite to "(Late) Hacktoberfest 2021 x Monsterlab Prague" looked like. The conversation with Tiago resulted in an accepted talk titled "Full-time Open Source". The whole meetup was in a very friendly atmosphere with different topics from various areas of open-source (... of course, the pizza was tasty! 😋🍕).

Check Hacktoberfest 2021 @ Monsterlab Prague playlist on YouTube with talks recorded. Big thanks to Tiago and other meetup organizers. I would definitely recommend watching interesting talks and hopefully seeing everyone who is interested in open-source joining this meetup next year.

If you are not familiar with Hacktoberfest, visit their homepage to change the world with open-source and win a cool t-shirt or (newly) plant a tree! 👕🌲

How to beat Python’s pip: Software stack resolution pipelines

Fridolín Pokorný — Mon, 16 Nov 2020 20:44:22 +0000

Following our previous article about reinforcement learning-based dependency resolution, we will take a look at actions taken during the resolution process. An example can be resolving intel-tensorflow instead of tensorflow following programmable rules.

Lungern, Switzerland. Image by the author.

Dependency graphs and software stack resolution

Users and maintainers have limited control with additional semantics when it comes to dependencies installed and resolved in Python applications. Tools, such as pip, Pipenv, or Poetry resolve a dependency stack to the latest possible candidate following dependency specification stated by application developers (direct dependencies) and by library maintainers (transitive dependencies). This can become a limitation, especially considering applications that were written months or years ago and require non-zero attention in maintenance. A trigger to revisit the dependency stack can be a security vulnerability found in the software stack or an observation the given software stack is no longer suitable for the given task and should be upgraded or downgraded (e.g. performance improvements in releases).

Even the resolution of the latest software can lead to issues that library maintainers haven’t spotted or haven’t considered. We already know that the state space of all the possible software stacks in the Python ecosystem is in many cases too large to explore and evaluate. Moreover, dependencies in the application stack evolve over time and underpinning or overpinning dependencies happen quite often.

Another aspect to consider is the human being itself. The complexity behind libraries and application stacks becomes a field of expertise on their own. What’s the best performing TensorFlow stack for the given application running on specific hardware? Should I use a Red Hat build, an Intel build, or a Google TensorFlow build? All of the companies stated have dedicated teams that focus on the performance of the builds produced and there is required certain manpower to quantify these questions. The performance aspect described is just another item in the vector coming to the application stack quality evaluation.

Software stack resolution pipeline and pipeline configuration

Let’s promote the whole resolution process and let’s make it server-side. In that case, the resolver can use a shared database of knowledge that can assist with the software stack resolution. The whole resolution process can be treated as a pipeline made out of units that cooperate together to form the most suitable stack for user needs.

The server-side resolution is not required, but it definitely helps with the whole process. Users are not required to maintain the database and serving software stacks as a service has also other pros (e.g. allocated pool of resources).

The software stack resolution pipeline can:

inject new packages or new package versions to the dependency graph based on packages resolved (e.g. a package accidentally not stated as a dependency of a library, dependency underpinning issues, ...)
remove a dependency in a specific version or the whole dependency with its dependency subgraph from the dependency graph and let resolver find another resolution path (e.g. a package accidentally stated as a dependency, missing ABI symbols in the runtime environment, dependency overpinning issues, ...)
score a package occurring in the dependency graph positively — prioritize resolution of a specific package in the dependency graph (e.g. positive performance aspect of a package in a specific version/build)
score a package in a specific version occurring in the dependency graph negatively — prioritize resolution of other versions (e.g. a security vulnerability present in a specific release)
prevent resolving a specific package in a specific version so that resolver tries to find a different resolution path if any (e.g. buggy package releases)

These pipeline units form autonomous pieces that know when they should be included in the resolution pipeline (thus be part of the "pipeline configuration") and know when to perform certain actions during the actual resolution.

A component called "pipeline builder" adds pipeline units to the pipeline configuration based on the decision made by the pipeline unit itself. This is done during the phase which creates the pipeline configuration.

Creation of a resolution pipeline configuration by the pipeline builder. Image by the author.

Once the resolution pipeline is built, it is used during the resolution process.

A software stack resolution process

In the last article, we have described a resolution process as a Markov decision process. This uncovered the potential to use reinforcement learning algorithms to come up with a suitable software stack candidate for applications.

Latest software is not always the greatest.

The last article described three main entities used during the resolution process:

Resolver — an entity for resolving software following Python packaging specification
Predictor — an entity used for guiding the resolution in the dependency graph
Software stack resolution pipeline — an abstraction for scoring and adjusting the dependency graph

The whole resolution process is then seen as a cooperation of the three described.

A resolution process guided by a predictor — magician. The fairy girl corresponds to the resolver which passes the predicted part of the dependency graph (a package) to the scoring pipeline. Results of the scoring pipeline (reward signal) are reported back to the predictor. Image by the author.

The software stack resolution pipeline is formed out of units of a different type. Each one is serving its own purpose. An example can be pipeline units of type "Step" which map to an action that is taken in a Markov decision process.

The resolver can be easily extended by providing pipeline units that follow semantics, API, and help with the software stack resolution process. The interface is simple so anyone can provide their own implementation and extend the resolution process with the provided knowledge. The pre-aggregated knowledge of dependencies helps with the offline resolution so that the system can score hundreds of software stacks per second.

The demo shown above demonstrates how pipeline units can be used in a resolution process to come up with a software stack that respects the pipeline configuration supplied. The resolution process finds a intel-tensorflow==2.0.1 software stack instead of the pinned tensorflow==2.1.0 as specified in the direct dependency listing. The notebook shown can be found in the linked repository.

Thoth adviser

If you are interested in the resolution process and core principles used in the implementation, you can check thoth-adviser documentation and sources available on GitHub.

Also, check other articles from the “How to beat Python’s pip” series:

Project Thoth

Project Thoth is an application that aims to help Python developers. If you wish to be updated on any improvements and any progress we make in project Thoth, feel free to subscribe to our YouTube channel where we post updates as well as recordings from scrum demos. We also post updates on Twitter.

Stay tuned for any new updates!

How to beat Python’s pip: Reinforcement learning-based dependency resolution

Fridolín Pokorný — Sat, 07 Nov 2020 18:09:27 +0000

The next episode from our series will be more theoretical. It will prepare the ground for the next article that will conclude the things we discussed so far. We will take a look at Monte Carlo tree search, Temporal Difference learning, and Markov decision process and how they can be used in a resolution process.

Balos beach on Crete island, Greece. Image by the author.

Markov decision process as a base for resolver

First, let’s take a look at Markov decision process (MDP). In a base, it provides us with a mathematical framework for modeling decision making (see more info in the linked Wikipedia article). To understand the decision-making process let’s apply it to the resolution process (other examples can be found on the Internet).

Instead of implementing a resolver using SAT or using backtracking, we will try to walk through the dependency graph. In that case, the resolver will try to satisfy dependencies of an application considering version range specification for direct dependencies, and recursively considering the transitive ones until it finds a valid resolution (or it concludes there is none).

At first, the resolver starts in an “initial state” which states all the requirements of an application to be included in the application stack. After a few rounds, it will end up in a state sn which will hold two sets — a set of dependencies to be resolved and a set of dependencies already resolved and included in the application stack.

In each state sn, the resolver can take an action that corresponds to making a decision on which dependency should be included in the application stack. An example can be shown in the figure down below — the resolver can take the action to resolve jinja2==2.10.2 coming from the PyPI. By doing so, the resolver adds jinja2==2.10.2 to the resolved dependencies set and adds all the dependencies on which jinja2==2.10.2 directly depends on to the unresolved dependencies set (respecting the version range specification).

As jinja2==2.10.2 can affect our application stack positively or negatively based on the knowledge we have about this dependency (e.g. build-time errors spotted in our software inspections), we can respect this fact by propagating the "reward signal" that corresponds to the action taken.

Decision-making process — resolving dependencies by walking through the dependency graph and retrieving the reward signal. This process can be modeled as an MDP. Image by the author.

The cumulated reward signal across all the actions taken to find a final state (all the packages included in the software stack/computed lock file) then corresponds to the overall software stack quality (i.e. software stack score).

Reinforcement learning-based dependency resolution and abstractions

The resolution process can be seen as communication between abstractions described below (following object-oriented programming paradigm).

Resolver

... is an abstraction that can resolve software stacks following rules (e.g. how dependencies should be resolved respecting Python packaging standards). It uses Predictor to help with guiding which dependencies should be resolved while traversing the dependency graph (we do not need to resolve the latest packages necessarily). Resolver also triggers the Software stack resolution pipeline to compute the immediate reward signal that is subsequently forwarded to Predictor.

Predictor

… helps Resolver to resolve software stacks — acts as a "decision-maker". It selects dependencies that should be included in a software stack to deliver the required quality (e.g. stable software, secure software, latest software, ...). It learns what packages should be included in software by selecting them.

Software stack resolution pipeline

... is a scoring pipeline that judges actions made by the Predictor. The output of this pipeline is primarily the "reward signal" that is computed in the pipeline units. This resolution pipeline can be dynamically constructed on each run to respect user needs (e.g. different pipeline units to deliver "secure software" in opposite to "well-performing software" given the operating system and hardware used to run the application).

Monte Carlo tree search and Temporal Difference learning in a resolution process

Simulated annealing in its adaptive form (ASA) was used as the first type of Predictor in the resolution process. Even though the ASA based predictor does not learn anything about the state space of possible software stacks, it gave a base for resolving software stacks that had many times higher quality than the "latest" software (as resolved by Pipenv or pip).

The next natural step for the resolution process was to learn actions taken during the resolution process. Temporal Difference learning and, later, Monte Carlo tree search algorithm principles were used for implementing the next predictor types. As there is no real opponent to play against (as seen in RL based algorithms), formulas like UCB1 could not be applied in their direct form. To balance exploration and exploitation, ideas from ASA were reused. Time became the real opponent to keep the resolution process responsive enough to users. The number of software stacks successfully resolved so far or the number of iterations done in the resolver became attributes for balancing exploration and exploitation (and possibly other attributes as well).

A video that includes core principles of Python dependency resolution and reinforcement learning dependency resolution principles.

a link to slides used during the talk

The video presented above was introduced as part of DevConf.US 2020 and demonstrates these principles more in-depth. Check the linked annual event taking place in the USA, Czech Republic, and India each year (this year a virtual event). A lot of cool topics can be explored and you can also participate next year — the event is open as open-source.

See devconf.info for more info.

Approximating maxima of an N-dimensional function using dimension addition and reinforcement learning

See also "Approximating maxima of an N-dimensional function using dimension addition and reinforcement learning" to get more insights from a slightly different perspective.

Project Thoth

Stay tuned for any new updates!

micropipenv: the one installation tool that covers Pipenv, Poetry and pip-tools

Fridolín Pokorný — Tue, 03 Nov 2020 21:06:21 +0000

In this article, we will take a look at a tool called "micropipenv". Its main goal is to serve as a common layer for installing Python dependencies as specified by pip, pip-tools, Pipenv or Poetry.

If you are a Python developer, you’ve probably experienced that dependency hell where you can easily end up in with Python’s packaging. A tool called pip has been around for quite some time, but its implementation is not that sufficient if you develop or maintain any larger Python-based application.

pip-tools

A project called pip-tools tries to address the lack of dependency management in pip. It uses two text files - requirements.in and requirements.txt. The first file, requirments.in managed by a user, states direct dependencies of an application with version range specifications. The latter one, requirements.txt managed by pip-tools, acts as a lock file stating all the packages in specific versions necessary to run an application (an analogy to npm-shrinkwrap.json or package-lock.json from the npm ecosystem).

To use pip-tools, you need to explicitly maintain a Python virtual environment (if you need one). Starting Python 3.3., you can issue the following command to create one and activate it:

python3 -m venv venv/
source venv/bin/active

After you have created the virtual environment, commands pip-compile and pip-sync will become your friends.

pip-tools workflow as described in pip-tools package description available on PyPI

For dependencies needed to execute your test suite, pip-tools introduced a convention using dev-requirements.in and dev-requirements.txt. The semantics can be analogically deduced.

A note to `setup.py` and `requirments.txt`

You have probably seen requirements.txt file used with setup.py. Note the difference with pip-tools that can be misleading for newcomers. If you want to publish a library, you don’t want to restrict all the versions by pinning transitive dependencies to specific versions as you don’t know how the resolved software stack will look like on an application side. The final resolution should always happen on the application level, not on the library level. You, as a library maintainer, just want to provide information about the compatibility of your library with dependencies used within the library.

Pipenv

Kenneth Reitz, the author of one of the most popular Python library — requests, introduced a project called Pipenv in January 2017. The project gained popularity and attention from the community very quickly.

Pipenv: Python Dev Workflow for Humans

The main aim of the project was to simplify dependency management and make it more user-friendly. It helped to maintain a virtual environment and manage dependencies with one single command, using newly introduced files Pipfile (TOML) and Pipfile.lock (JSON).

Pipenv in our team

We, at Red Hat, adopted Pipenv and made it one of the options for dependency management during deployment in the OpenShift’s Source-To-Image build process (see for example Fedora-based Python container images or Thoth’s Python s2i container images). It gained popularity also in our AICoE and Thoth team where de-facto all the repositories with Python source code use Pipenv for dependency management.

Poetry

Another option is to use Poetry. It looks like Poetry attracted the Python community, especially during the silent phase of Pipenv. Poetry uses Pipenv: Python Dev Workflow for Humans
a different type of lock format than Pipenv and is not compatible with Pipenv or pip at all.

micropipenv

Some of the requirements and ideas lead to introducing a new tool called micropipenv. No, there is no intention to introduce another pip, pip-tools, Poetry or Pipenv.

https://xkcd.com/1987/

This lightweight implementation (one file, around 1200 LOC with comments) has one optional dependency and can install your requirements from files that are managed using Pipenv, Poetry, pip-tools, or simple requirements.txt file as commonly used together with setup.py. The only thing you need to do is to issue:

micropipenv install

The tool will automatically detect what type of requirements file or lock file you use for managing your dependencies and performs the desired installation.

Moreover, the tool offers a simple conversion feature where any lock file can be transformed to requirements.txt and/or requirements.in.

To produce a pip-tools style requirements.in and requirements.txt you can simply perform the following commands, assuming you have Pipenv or Poetry lock files present in the directory:

micropipenv requirements --no-dev > requirements.txt
micropipenv requirements --no-dev --only-direct > requirements.in
micropipenv requirements --no-default > dev-requirements.txt
micropipenv requirements --no-default --only-direct > dev-requirements.in

If you just want to install dependencies, you don’t need to install micropipenv at all. You can just simply download it and let it do its one time job:

curl https://raw.githubusercontent.com/thoth-station/micropipenv/master/micropipenv.py | python3 - install

See documentation and sources for more info.

Thoth

One of the ongoing efforts in Red Hat’s Office of the CTO is a project called Thoth. The newly introduced tool in this article, micropipenv, was born in this project. Check one of our publicly available demos to see micropipenv in action (the micropipenv demo starts at 9:00):

You can follow our YouTube channel for more updates.

Why micropipenv?

The main reason behind micropipenv was to reduce the maintenance cost of Pipenv during its silent phase. However, it turned out to be a good idea to have such a minimalistic tool for installing dependencies, especially when it comes to containerized applications. The main advantage of micropipenv turned out to be its size.
When deploying applications in containerized environments, it’s a really good idea to maintain a lock file for the application. As the lock file states the whole dependency stack already resolved, there is no reason why there needs to be shipped Poetry or Pipenv in the container image. A tool that just installs the dependencies from any lock file supplied seems to be like a minimalistic way to go to reduce container image size and software present in it (and thus shipped with the application).

A simple size comparison done a while back showed approximately 30.4MiB difference when Pipenv was not installed into the containerized environment in comparison to a single file approach using micropipenv.

micropipenv can read in Pipfile.lock, requirements.txt or poetry.lock stating already resolved software stack and install it using pip.

The design of the CLI made micropipenv a straightforward tool to make a compatibility layer between all popular Python dependency management tools available out there in the open-source world.

Project Thoth

How to beat Python’s pip: Dependency Monkey inspecting the quality of TensorFlow dependencies

Fridolín Pokorný — Sun, 01 Nov 2020 12:24:58 +0000

In this article, we will continue inspecting the quality of the software. Instead of selecting packages to be checked manually, we will use a component called "Dependency Monkey" which can resolve software stacks following programmed rules and verify the application correctness.

Neurathen Castle located in the Bastei rocks near Rathen in Saxon Switzerland, Germany. Image by the author.

Why different combinations of packages?

In the previous article, but mainly in the introductory article to "How to beat Python’s pip" series, we have described a state space of all the possible software stacks that can be resolved for an application stack given the requirements on libraries. Each resolved software stack in such state space can be scored by a scoring function that can compute "how good the given software is". In the figure below, we can see an interpolated scoring function for resolved software stacks created out of two libraries simplelib and anotherlib.

The image above shows an interpolated score function for a state-space made when installing two dependencies "simplelib" and "anotherlib" in different versions (valid combinations of different versions installed together).

The interpolated function above shows a score for two-dimensional state space (one dimension for each package). As we add more packages to an application, this state space is becoming larger and larger (especially considering transitive dependencies that need to be added as well to have a valid software stack).

For real-world applications, we can very easily get tens of dimensions (e.g. by installing tensorflow==2.3.0 we include 36 distinct packages in different versions, thus 36 dimensions plus one dimension for the scoring function). These dimensions introduce distinct input features that affect application behavior as reflected by the scoring function. As we already know based on our last article, any issue in any of these packages can introduce a problem in our application (run time or build time).

All the possible versions (all the possible 36-dimensional vectors following our example) are impossible to test in a reasonable time and thus require some smart picking which versions should be included in the final resolved stack. One slicing mechanism is the actual resolver — it can slice possible resolutions respecting version range specifications of packages in the dependency graph. But how do we limit the number of possible stacks to a reasonable sample even more?

Packages "B" in versions <1.5.0 will be removed based on resolver — they are not valid resolutions following the version range specification of package "A". Hence, they will limit the size of the corresponding feature "B".

A smart offline resolver

Besides removing packages based on version range specification in the resolver, a component called Dependency Monkey is capable of using "pipeline units". The whole resolution process is treated as a pipeline made out of pipeline units of different types that decide whether packages should be considered during the resolution. In other words, if resolved stacks formed out of selected packages should be inspected.

An example can be an inspection of a TensorFlow software stack. If we want to test a specific TensorFlow with NumPy versions for compatibility, we can skip already tested software stack combinations (e.g. based on the queries to our database with previous test results).

Pipeline units create a programmable interface to the resolver which can act based on pipeline units decisions.

Amun inspections: revisited

In the previous article called "How to beat Python’s pip: Inspecting the quality of machine learning software" we introduced a service called Amun that can run software respecting a specification that states how the application is assembled and run. Besides information about the operating system or hardware used, it accepts also a list of packages that should be installed in order to build and run the software.

As Dependency Monkey can resolve Python software stacks, it becomes one of the users of the Amun service. Simply said, if a Dependency Monkey resolves a Python software stack which it considers as a valid candidate for testing, it submits it to Amun to inspect its quality.

We use "quality" to describe a certain aspect of the software. One such quality aspect can be performance or other runtime behavior. The fact an application fails to build is also an indicator of the software stack quality.

Dependency Monkey’s resolution pipeline

One can see Dependency Monkey as a resolver that accepts an input vector and resolves one or multiple software stacks considering the input vector and an aggregated knowledge about the software and packages forming the software stacks. This aggregated knowledge can accumulate information about packages or package combinations seen in the software stacks.

Dependency Monkey is formed out of pipeline units that help to resolve Python software stacks based on the input vector considering the knowledge base.

Checking different package combinations in TensorFlow stacks

Let’s check some dependencies of a TensorFlow stack (I used TensorFlow in version 2.1.0, the dependency listing will differ across versions). If we take a look at the direct dependencies of TensorFlow, we will find packages such as h5py, opt-einsum, scipy, Keras-Preprocessing, and tensorboard in specific versions. They share a common dependency NumPy, a direct dependency of TensorFlow itself (see this GitHub gist for the listing that can change over time with new package releases). All the packages stated can be installed in different versions, which can have different version range requirements on NumPy. The actual version of NumPy installed depends on the resolver and the resolution process that can take into account also other libraries that the user requested to install (besides TensorFlow as a single direct dependency). It’s worth to pinpoint here that any issue in NumPy (even incompatibilities introduced by overpinning or underpinning) can lead to a broken application. So let’s try to test the TensorFlow stack with different combinations of NumPy.

In the upcoming video, you can see a brief walk-through on Dependency Monkey together with a service called Amun. In the first part of the demo (starting at 19:25), Dependency Monkey resolves software stacks considering aggregated knowledge (one of such knowledge is dependency information needed during the resolution) and submits these software stacks to Amun to inspect the quality of the software. The tested software stack is TensorFlow in version 2.1.0, using the build published on PyPI, with different combinations of NumPy resolved (the whole application stack is formed with packages in the same package version but NumPy versions get adjusted respecting the dependency graph).

A note to video: Dependencies that should be locked could be also stated in the direct dependency listing. Note however that by doing so, the dependency will always be present in all the stacks, even though it would not be used and could affect the dependency graph. That’s why pinning of dependencies is performed on a unit level.

The second part of the demo (starting at 28:13) shows Dependency Monkey resolution that randomly samples the state space of all the possible TensorFlow stacks. As we already know, this state space is too large thus checking all the combinations is impossible in a reasonable time. Dependency Monkey randomly generates software stacks that are valid resolutions of TensorFlow software and submits them to Amun which verifies the software stack builds and runs correctly.

Such random state space sampling can spot issues. One such interesting issue in TensorFlow 2.1 stack is a dependency urllib3 that, when installed in a specific version, can cause runtime errors on TensorFlow imports. See this document for a detailed overview. Note the version installed can depend also on other libraries that an application can use besides TensorFlow so there can be affected applications by this issue.

Project Thoth

Stay tuned for any updates!

How to beat Python’s pip: Inspecting the quality of machine learning software

Fridolín Pokorný — Mon, 26 Oct 2020 17:57:50 +0000

Following the previous article written about solving Python dependencies, we will take a look at the quality of software. This article will cover "inspections" of software stacks and will link a free dataset available on Kaggle. Even though the title says the quality of "machine learning software", principles and ideas can be reused for inspecting any software quality.

Application (Software & Hardware) Stack

Let’s consider a Python machine learning application. This application can use a machine learning library, such as TensorFlow. TensorFlow is in that case a direct dependency of the application and by installing it, the machine learning application is using directly TensorFlow and indirectly dependencies of TensorFlow. Examples of such indirect dependencies of our application can be NumPy or absl-py that are used by TensorFlow.

Our machine learning Python application and all the Python libraries run on top of a Python interpreter in some specific version. Moreover, they can use other additional native dependencies (provided by the operating system) such as glibc or CUDA (if running computations on GPU). To visualize this fact, let’s create a stack with all the items creating the application stack running on top of some hardware.

Note that an issue in any of the described layers causes that our Python application misbehaves, produces wrong output, produces runtime errors, or simply does not start at all.

Let’s try to identify any possible issues in the described stack by building the software and let’s have it running on our hardware. By doing so we can spot possible issues before pushing our application to a production environment or fine-tune the software so that we get the best possible out of our application on the hardware available.

On-demand software stack creation

If our application depends on a TensorFlow release starting version 2.0.0 (e.g. requirements on API offered by tensorflow>=2.0.0), we can test our application with different versions of TensorFlow up to the current 2.3.0 release available on PyPI to this date. The same can be applied to transitive dependencies of TensorFlow, e.g. absl-py, NumPy, or any other. A version change of any transitive dependency can be performed analogically to any other dependency in our software stack.

Dependency Monkey

Note one version change can completely change (or even invalidate) what dependencies in what versions will be present in the application stack considering the dependency graph and version range specifications of libraries present in the software stack. To create a pinned down list of packages in specific versions to be installed a resolver needs to be run in order to resolve packages and their version range requirements.

Do you remember the state space described in the first article of "How to beat Python’s pip" series? Dependency Monkey can in fact create the state space of all the possible software stacks that can be resolved respecting version range specifications. If the state space is too large to resolve in a reasonable time, it can be sampled.

A component called "Dependency Monkey" is capable of creating different software stacks considering the dependency graph and version specifications of packages in the dependency graph. This all is done offline based on pre-computed results from Thoth’s solver runs (see the previous article from "How to beat Python’s pip" series). The results of solver runs are synced into Thoth’s database so that they are available in a query-able form. Doing so enables Dependency Monkey to resolve software stacks at a fast pace (see a YouTube video on optimizing Thoth’s resolver). Moreover, the underlying algorithm can consider Python packages published on different Python indices (besides PyPI, it can also use custom TensorFlow builds from an index such as the AICoE one). We will do a more in-depth explanation of Dependency Monkey in one of the upcoming articles. If you are too eager, feel free to browse its online documentation.

Amun API

Now, let’s utilize a service called "Amun". This service was designed to accept a specification of the software stack and hardware and execute an application given the specification.

Amun is an OpenShift cluster native application, that utilizes OpenShift features (such as builds, container image registry, …) and Argo Workflows to run desired software on specific hardware using a specific software environment. The specification is accepted in a JSON format that is subsequently translated into respective steps that need to be done in order to test the given stack build and run.

The video linked above shows how Amun inspections are run and how the knowledge created is aggregated using OpenShift, Argo workflows, and Ceph. You can see inspected different TensorFlow builds tensorflow, tensorflow-cpu, intel-tensorflow and a community builds of TensorFlow for AVX2 instruction set support available on the AICoE index.

Thoth’s inspection dataset on Kaggle

We (Red Hat) have produced multiple inspections as part of the project Thoth where we tested different TensorFlow releases and different TensorFlow builds.

One such dataset is Thoth’s performance data set in version 1 on Kaggle. It’s consisting out of nearly 4000 files capturing information about inspection runs of TensorFlow stacks. A notebook published together with the dataset can help one exploring the dataset.

Project Thoth

Stay tuned for any updates!

How to beat Python’s pip: Solving Python dependencies

Fridolín Pokorný — Sat, 17 Oct 2020 18:59:09 +0000

In the previous blog post, I’ve mentioned the start of the blog series I plan to publish about Python dependencies and how to deal with them. This article is the second one coming out of this series and it will cover obtaining information about Python dependencies. You’ll also gain access to our dataset published on Kaggle.

Python dependencies

Python’s packaging allows specifying dependencies using a setup.py script. This script states all the metadata used by Python’s packaging to let seamlessly install Python distributions into environments. It has a pretty nice and straightforward structure and allows you to programmatically define all the needed bits when packaging your Python package.

Another popular way how to write your Python package metadata is the setup.cfg file. In opposite to setup.py, this file is not a Python source code but a static configuration file. Refer to distutils documentation for more info.

Having an ability to use a Python script to define all the packaging metadata offers great power. You can code basically anything you want and adjust your package metadata as desired during the setup.py invocation. But as usual:

With great power comes great responsibility.

Checking dependencies of a Python package

If you visit any project page on PyPI, the Python Package Index, you’ll notice there are no dependency information. The great power behind the setup.py script is the main reason why there are no dependencies — if the dependencies are stated in a Python script, it means the Python script needs to be executed to obtain dependency information. Okay, so let’s trigger the installation process of a Python package, and let’s see what dependencies are stated there, but wait… What operating system should we use? What Python interpreter version should we use? What native dependencies (such as gcc for native extensions) should be present in our environment? What CPU architecture? And… And…

Obtaining information about a Python package seems to be not that straightforward. Consider all the factors and variations that can be coded in the setup.py script which can, in turn, construct different sets of dependencies or can lead to Python package installation issues. Refer to the article "Why PyPI Doesn’t Know Your Projects Dependencies" written by Dusting Ingram, one of the maintainers of the Warehouse (software that powers PyPI) for more info on this.

Why PyPI doesn’t know your project's dependencies but Thoth does

We, at Red Hat, have developed a tool that can check dependencies of the desired Python package — it’s called thoth-solver (see also it’s PyPI release).

This Python application can install dependencies from any Python package index conforming to the Simple Repository API PEP-503 (such as pypi.org or AICoE Python package index) and extract all the metadata of a Python package. One of such metadata are requirements of the Python package.

As there are no static metadata to rely on without actually installing a Python module (well, for some wheel distributions it is possible to do so), thoth-solver installs the given application into your environment and extracts package metadata. The data aggregated are reported in a structured JSON format for any further processing.

Checking some of the Thoth solver screws

As stated above, Thoth’s solver downloads and actually installs a Python package. The very first "observation" it captures is:

Does the given Python application install into the given environment?

There are numerous reasons why a Python package does not need to be installable into the environment. It might be missing or wrong toolchain (e.g. missing gcc or wrong gcc version e.g. lacking proper C/C++ standard), Python interpreter incompatibilities (e.g. Python 2 versus Python3 issues), missing required manylinux support by an older pip release used, or just wrong release by maintainers. Basically, anything that can go possibly wrong. The implementation behind thoth-solver captures this fact with all the relevant log information (that are subsequently analyzed within project Thoth to automatically derive why the given package was not installable).

Once the installation succeeds, the tool obtains all the information about dependencies using importlib.metadata (and some additional metadata as produced by the importlib.metadata module). This metadata gathering is done on top of a fresh virtual environment into which the analyzed package was installed into to reduce any interference with dependencies of thoth-solver itself or any other package installed in the environment where thoth-solver runs in. Requirements stated are parsed and solved respecting Python standards for dependency specification so that the resulting document states also dependencies in specific versions (rather than just dependency specifications). The obtained results are subsequently aggregated and reported in the final JSON document together with thoth-solver run metadata (OS, Python interpreter version/build, …).

We run thoth-solver as a containerized application in our clusters using different operating systems (such as UBI, RHEL, Fedora, ...), different native dependencies, and different Python interpreter versions (a matrix of all the factors that can influence Python package installation). The resulting JSON documents of thoth-solver runs are automatically placed onto Ceph and synced into Thoth’s knowledge base. Optimizations of thoth-solver implementation (such as pre-baking virtual environment into containers shipped) allowed us to analyze a few hundred Python packages per hour (the only limitation for us are basically cluster resources and networking). All the dependency information is available on our API endpoints.

Thoth solver dataset on Kaggle

If you wish to browse some of the thoth-solver data, you can do so by accessing Kaggle dataset we published. See also a notebook that explores the dataset or github.com/thoth-station/datasets repository with notebooks.

The dataset consists of 100,000 resulting (415.79 MB) thoth-solver JSON documents. You can find application stacks of popular Python packages (such as TensorFlow) published on PyPI.

Thoth reverse solver

As the solver states dependencies at the point of time when it is run, we wanted to keep our knowledge base up to date with recent Python package releases. Consider a new major numpy==1.20.0 release or a new patch numpy==1.18.6 release — do these releases affect any packages that depend on numpy>=1.19? We can answer this question (offline, without running thoth-solver) using another component called thoth-revsolver. Check this demo for more info:

Project Thoth

Project Thoth is an application that aims to help Python developers. If you wish to be updated on any improvements and any progress we make in project Thoth, feel free to subscribe to our YouTube channel where we post our updates as well as our recordings from our scrum demos. You can also follow us on Twitter.

Stay tuned for any new updates!

How to beat Python’s pip: A brief intro

Fridolín Pokorný — Mon, 12 Oct 2020 10:01:59 +0000

The Python’s package installer, pip, is known to have issues when resolving software stacks. In the upcoming series of articles, I will briefly discuss an approach that helped to resolve versions of libraries for applications faster than pip’s resolution algorithm. Moreover, the resolved software stacks are scored based on various aspects to help with shipping high-quality software.

Python is one of the most growing programming languages out there. There is no doubt it’s becoming the programming language of choice for data science, machine learning engineers, or software developers. In my eyes, Python code is a pseudo-code that simply runs — easy to write, easy to maintain. Creating an API server using Flask, making data analysis in Jupyter notebooks, or creating a neural network using TensorFlow, these all can be easily written in a few lines of code. Any performance-critical parts can be optimized thanks to CPython’s C API. Python is a very effective weapon in anyone’s inventory.

Python, pip & resolvers

The Python Packaging Authority (PyPA) is a working group that maintains a core set of projects used in Python packaging. One of these components is pip — the PyPA recommended tool for installing Python packages. If you developed any Python application, most probably you have used it or at least considered it to be used for installing libraries for your project. Similar tools are Pipenv (also maintained by PyPA) or Poetry.

pip does its job in most cases pretty well — it can install your desired software from PyPI — the Python’s Packaging Index that hosts open-source projects. Alternatively, you can use your privately hosted Python indexes as a source of software to be installed. Unfortunately, pip lacks a proper implementation of a resolver that can in some cases lead to painful situations. As of today, PyPA is working on a new pip’s resolver implementation. Mostly, resolvers use an implementation but let’s have a look at the resolution problem from the other side.

A state-space made out of Python packages

Let’s say we want to create an application that uses two libraries called simplelib and anotherlib. These libraries can be installed in different versions. These versions can have a different impact on the resulting software shipped — e.g. performance impact, security impact, or, in the worst cases, the application does not assemble at all. Now, let’s create a function that includes such observations and performs "scoring" with respect to versions included in the software installed. Such function would have discrete values and for our artificial example it could look like this visualization (assuming the libraries do not have any transitive dependencies):

To make it more intuitive let’s try to interpolate values and plot the resulting figure:

On the horizontal axes, you can see different versions of simplelib and anotherlib libraries. On the vertical axis, you can see different values of the scoring function. If you would use pip, Pipenv, or Poetry, all these tools will resolve as more recent versions of libraries as possible — on our graph it would be the rightmost value:

But what if we want to ship better software? What if the most recent releases are broken? That would require manual work, increased maintenance cost and one can easily end up in dependency hell.

Thoth’s advise: install the right software

The idea described above gave birth to a project called Thoth. Thoth is a recommendation engine for Python applications that can resolve not the latest, but the greatest set of libraries that get installed for your application. Thoth’s resolver is resolving and scoring software stacks based on it’s aggregated knowledge — hence it’s a server-side resolution. You can submit requirements that you have for your application and Thoth’s recommendation engine can resolve the software stack that satisfies them.

In the upcoming articles, I will dive more into Thoth’s internals — how the resolution is performed, what are the key concepts implemented, and how the implementation can resolve and score tenths, hundreds, or thousands of software stacks a second. One of the concepts used there is reinforcement learning that helps to resolve high-quality software stacks based on observations in Thoth’s knowledge base, so stay tuned!

“Termial” random for prioritized picking an item from a list

Fridolín Pokorný — Thu, 20 Feb 2020 19:10:31 +0000

Let’s take a look at a solution that randomly picks an item from a list. Instead of assigning equal probability to each item in the list, let’s create an algorithm that assigns the highest probability to the item at index 0 and lower probabilities to all remaining items as the index in the list grows. Can we implement such a function?😓

You can use already available routines present in the Python standard library to pick a random item from a list. Let’s say we want to randomly pick a number from a list of integers. The only thing you need to do is to run the following snippet:

import random
my_list = [42, 33, 30, int("deadbeef", base=16)]
random.choice(my_list)
# results in 42 with a probability of 1 / len(my_list)

Now let’s say we want to prioritize number 42 over 33, prioritize number 33 over 30, and at the same time, prioritize number 30 over “0xdeadbeef”. We have 4 numbers in total in our list, let’s assign weights to these numbers in the following manner:

+--------------+-------------+
|    number    |    weight   |
+--------------+-------------+
|     42       |      4      |
|     33       |      3      |
|     30       |      2      |
|  0xdeadbeef  |      1      |
+--------------+-------------+

The higher the weight is, the higher the probability we pick the given number.
You can see weighs as a number of “buckets” we assign to each number from the list. Subsequently, we randomly (random uniform) try to hit one bucket. After hitting the bucket, we check to which number this bucket corresponds to.
The total number of buckets we can hit is equal to the sum of weights:

4 + 3 + 2 + 1 = 10

The probability of hitting a bucket based on the number from the list is shown below:

+--------------+-----------------------+
|    number    |       probability     |
+--------------+-----------------------+
|     42       |       4/10 = 0.4      |
|     33       |       3/10 = 0.3      |
|     30       |       2/10 = 0.2      |
|  0xdeadbeef  |       1/10 = 0.1      |
+--------------+-----------------------+

To generalize this for n numbers, we can come up with the following formula:

In other mathematical words:

Does this formula look familiar to you? It’s called termial of a positive integer n; from Wikipedia:

The termial was coined by Donald E. Knuth in his The Art of Computer Programming. It is the additive analog of the factorial function, which is the product of integers from 1 to n. He used it to illustrate the extension of the domain from positive integers to the real numbers.

Now, let’s make our hands dirty with some code. To compute the termial of n:

termial_of_n = sum(range(1, len(my_list) + 1))  # O(N)

Another, more effective way, to compute the termial of n is to use Binomial coefficient principle and compute (len(my_list) + 1) over 2:

l = len(my_list)
# (l + 1) over 2 = l! / (2!*(l-2)!) = l * (l - 1) / 2
termial_of_n = ((l**2) + l) >> 1  # O(1)

Finally, we can pick a random (random uniform) bucket from our buckets:


import random
choice = random.randrange(termial_of_n)

The result stored in the variable choice holds an integer starting from 0 to 9 (inclusively) and represents an index to the list of the buckets we created earlier:

+--------------+---------------+---------------+
|    choice    |     bucket    |     number    |
+--------------+---------------+---------------+
|      0       |       1       |       42      |
+--------------+---------------+---------------+
|      1       |       2       |       42      |
+--------------+---------------+---------------+
|      2       |       3       |       42      |
+--------------+---------------+---------------+
|      3       |       4       |       42      |
+--------------+---------------+---------------+
|      4       |       5       |       33      |
+--------------+---------------+---------------+
|      5       |       6       |       33      |
+--------------+---------------+---------------+
|      6       |       7       |       33      |
+--------------+---------------+---------------+
|      7       |       8       |       30      |
+--------------+---------------+---------------+
|      8       |       9       |       30      |
+--------------+---------------+---------------+
|      9       |       10      |   0xdeadbeef  |
+--------------+---------------+---------------+

How do we find which number we hit through a randomly picked bucket for any n?

Let’s revisit how we computed the termial number of n using the Binomial coefficient based formula:

l = len(my_list)
termial_of_n = ((l**2) + l) >> 1

Based on termial function definition, we know that regardless of n, we always assign 1 bucket to number at index 0, 2 buckets to number at index 1, 3 buckets to number at index 2 and so on. Using this knowledge, we can transform the Binomial coefficient formula to the following equation:

choice = ((i**2) + i) >> 1

The next step is to find i that satisfies the given equation. The equation is a quadratic function described as:

a*i**2 + b*i + c = 0

where:

a = 1/2
b = 1/2
c = -choice

As the choice is expected to be always a non-negative integer (index to the list of buckets), we can search for a solution that always results in a non-negative integer (reducing one discriminant term that always results in negative i).

import math
# D = b**2 - 4*a*c
# x1 = (-b + math.sqrt(D)) / (2*a)
# x2 = (-b - math.sqrt(D)) / (2*a)
# Given:
#   a = 1/2
#   b = 1/2
#   c = -choice
# D = (1/2)**2 + 4*0.5*choice = 0.25 + 2*choice
i = math.floor(-0.5 + math.sqrt(0.25 + (choice << 1)))

The solution has to be rounded using math.floor as it corresponds to the inverted index with respect to n. As i is inverted, the final solution (index to the original list) is:

my_list[n - 1 - i]

Asymptotic complexity analysis

Let’s assume:

the len function can return the length of the list in O(1) time
random.randrange* operates in O(1) time
we use Binomial coefficient based equation for computing the termial of n

The whole computation is done in O(1) time with O(1) space.

If we would use the sum based computation of the termial of n, the algorithm would become O(n) time with O(1) space.

Disclaimer & original usage

I designed this algorithm for Thoth’s recommendation engine. Its main purpose is to prefer more recent versions of packages in the resolver during the resolution of Python software stacks.

I would be happy for any feedback or any similar approaches recommended.

A complete solution can be found on my GitHub gist:

DEV Community: Fridolín Pokorný

When checking your Python package sources matters

How to get information about the provenance of Python packages installed

Using the upstream pip

Using the patched pip

Why is this useful?

Why PyPI Doesn't Know Your Projects Dependencies but Thoth Does

PyPI

Why PyPI Doesn't Know Your Projects Dependencies

Project Thoth and Python Package Dependencies

Python Dependency Information in Thoth

Consuming Dependency Information

Using Thoth's Resolver

(Late) Hacktoberfest 2021 x Monstarlab Prague

How to beat Python’s pip: Software stack resolution pipelines

Dependency graphs and software stack resolution

Software stack resolution pipeline and pipeline configuration

A software stack resolution process

Thoth adviser

Project Thoth

How to beat Python’s pip: Reinforcement learning-based dependency resolution

Markov decision process as a base for resolver

Reinforcement learning-based dependency resolution and abstractions

Resolver

Predictor

Software stack resolution pipeline

Monte Carlo tree search and Temporal Difference learning in a resolution process

Approximating maxima of an N-dimensional function using dimension addition and reinforcement learning

Project Thoth

micropipenv: the one installation tool that covers Pipenv, Poetry and pip-tools

pip-tools

A note to setup.py and requirments.txt

Pipenv

Pipenv in our team

Poetry

micropipenv

Thoth

Why micropipenv?

Project Thoth

How to beat Python’s pip: Dependency Monkey inspecting the quality of TensorFlow dependencies

Why different combinations of packages?

A smart offline resolver

Amun inspections: revisited

Dependency Monkey’s resolution pipeline

Checking different package combinations in TensorFlow stacks

Project Thoth

How to beat Python’s pip: Inspecting the quality of machine learning software

Application (Software & Hardware) Stack

On-demand software stack creation

Dependency Monkey

Amun API

Thoth’s inspection dataset on Kaggle

Project Thoth

How to beat Python’s pip: Solving Python dependencies

Python dependencies

Checking dependencies of a Python package

Why PyPI doesn’t know your project's dependencies but Thoth does

Checking some of the Thoth solver screws

Thoth solver dataset on Kaggle

Thoth reverse solver

Project Thoth

How to beat Python’s pip: A brief intro

Python, pip & resolvers

A state-space made out of Python packages

Thoth’s advise: install the right software

“Termial” random for prioritized picking an item from a list

Asymptotic complexity analysis

Disclaimer & original usage

A note to `setup.py` and `requirments.txt`