In the previous blog post, I’ve mentioned the start of the blog series I plan to publish about Python dependencies and how to deal with them. This article is the second one coming out of this series and it will cover obtaining information about Python dependencies. You’ll also gain access to our dataset published on Kaggle.
Python’s packaging allows specifying dependencies using a
setup.py script. This script states all the metadata used by Python’s packaging to let seamlessly install Python distributions into environments. It has a pretty nice and straightforward structure and allows you to programmatically define all the needed bits when packaging your Python package.
Another popular way how to write your Python package metadata is the
setup.cfgfile. In opposite to
setup.py, this file is not a Python source code but a static configuration file. Refer to distutils documentation for more info.
Having an ability to use a Python script to define all the packaging metadata offers great power. You can code basically anything you want and adjust your package metadata as desired during the
setup.py invocation. But as usual:
With great power comes great responsibility.
If you visit any project page on PyPI, the Python Package Index, you’ll notice there are no dependency information. The great power behind the
setup.py script is the main reason why there are no dependencies — if the dependencies are stated in a Python script, it means the Python script needs to be executed to obtain dependency information. Okay, so let’s trigger the installation process of a Python package, and let’s see what dependencies are stated there, but wait… What operating system should we use? What Python interpreter version should we use? What native dependencies (such as gcc for native extensions) should be present in our environment? What CPU architecture? And… And…
Obtaining information about a Python package seems to be not that straightforward. Consider all the factors and variations that can be coded in the
setup.py script which can, in turn, construct different sets of dependencies or can lead to Python package installation issues. Refer to the article "Why PyPI Doesn’t Know Your Projects Dependencies" written by Dusting Ingram, one of the maintainers of the Warehouse (software that powers PyPI) for more info on this.
This Python application can install dependencies from any Python package index conforming to the Simple Repository API PEP-503 (such as pypi.org or AICoE Python package index) and extract all the metadata of a Python package. One of such metadata are requirements of the Python package.
As there are no static metadata to rely on without actually installing a Python module (well, for some wheel distributions it is possible to do so), thoth-solver installs the given application into your environment and extracts package metadata. The data aggregated are reported in a structured JSON format for any further processing.
As stated above, Thoth’s solver downloads and actually installs a Python package. The very first "observation" it captures is:
- Does the given Python application install into the given environment?
There are numerous reasons why a Python package does not need to be installable into the environment. It might be missing or wrong toolchain (e.g. missing gcc or wrong gcc version e.g. lacking proper C/C++ standard), Python interpreter incompatibilities (e.g. Python 2 versus Python3 issues), missing required manylinux support by an older pip release used, or just wrong release by maintainers. Basically, anything that can go possibly wrong. The implementation behind thoth-solver captures this fact with all the relevant log information (that are subsequently analyzed within project Thoth to automatically derive why the given package was not installable).
Once the installation succeeds, the tool obtains all the information about dependencies using
importlib.metadata (and some additional metadata as produced by the
importlib.metadata module). This metadata gathering is done on top of a fresh virtual environment into which the analyzed package was installed into to reduce any interference with dependencies of thoth-solver itself or any other package installed in the environment where thoth-solver runs in. Requirements stated are parsed and solved respecting Python standards for dependency specification so that the resulting document states also dependencies in specific versions (rather than just dependency specifications). The obtained results are subsequently aggregated and reported in the final JSON document together with thoth-solver run metadata (OS, Python interpreter version/build, …).
We run thoth-solver as a containerized application in our clusters using different operating systems (such as UBI, RHEL, Fedora, ...), different native dependencies, and different Python interpreter versions (a matrix of all the factors that can influence Python package installation). The resulting JSON documents of thoth-solver runs are automatically placed onto Ceph and synced into Thoth’s knowledge base. Optimizations of thoth-solver implementation (such as pre-baking virtual environment into containers shipped) allowed us to analyze a few hundred Python packages per hour (the only limitation for us are basically cluster resources and networking). All the dependency information is available on our API endpoints.
If you wish to browse some of the thoth-solver data, you can do so by accessing Kaggle dataset we published. See also a notebook that explores the dataset or github.com/thoth-station/datasets repository with notebooks.
The dataset consists of 100,000 resulting (415.79 MB) thoth-solver JSON documents. You can find application stacks of popular Python packages (such as TensorFlow) published on PyPI.
As the solver states dependencies at the point of time when it is run, we wanted to keep our knowledge base up to date with recent Python package releases. Consider a new major
numpy==1.20.0 release or a new patch
numpy==1.18.6 release — do these releases affect any packages that depend on
numpy>=1.19? We can answer this question (offline, without running thoth-solver) using another component called thoth-revsolver. Check this demo for more info:
Project Thoth is an application that aims to help Python developers. If you wish to be updated on any improvements and any progress we make in project Thoth, feel free to subscribe to our YouTube channel where we post our updates as well as our recordings from our scrum demos. You can also follow us on Twitter.
Stay tuned for any new updates!