What happens when you run
pip install <somepackage>? A lot more than you might think. Python's package ecosystem is quite complex.
pip needs to decide which
distribution of the package to install.
This is more complex for Python than many other languages, since each version (or release) of a Python package usually has multiple
distributions. There are 7 different kinds of distributions, but the most common these days are source distributions and binary wheels. A source distribution is exactly what it says on the tin—the raw Python and potentially C extension code that the package developers wrote. A binary wheel is a more complex archive format, which can contain compiled C extension code. This is convenient for users, because compiling, say, numpy from source takes a long time (~4 minutes on my desktop), and it is hard for package authors to ensure that their source code will compile on other people's machines. But it comes at a price--the compiled code is specific to the architecture and often the OS it was compiled on, so most packages with C extensions will build multiple wheel distributions, and
pip needs to decide which if any are suitable for your computer.
To find the distributions available,
https://pypi.org/simple/<somepackage>, which is a simple HTML page full of links, where the text of the link is the filename of the distribution. The filenames encode the version, kind of distribution, and for binary wheels, the architecture and OS they are compatible with. This format is complex enough to be covered by two different PEPs:
- The version scheme is covered by PEP 440.
- Binary wheel filename compatibility tags are the subject of PEP 425.
To select a distribution,
pip first determines which distributions are compatible with your system and implementation of python. For binary wheels, it parses the filenames according to PEP 425, extracting the python implementation, application binary interface, and platform. The python implementation can be something as broad as
py2.py3 (meaning "any implementation of python 2.X or 3.X") or it can specify a python interpreter and major version, such as
pp35 (meaning PyPy version 3.5). The application binary interface is essentially what version of CPython's C-API the C extension code is compatible with, if there is any. Interpreting the platform portion of the compatibility tag is more difficult. It can be relatively obvious, like
win32 for 32-bit Windows, but I am usually installing
manylinux1 wheels. Which Linux distributions are compatible with
manylinux1 is a subject of heavy debate on the
distutils mailing list. Luckily the process for source distributions is simpler—all source distributions are assumed to be compatible, at least at this step in the process.
pip has a list of compatible distributions, it sorts them by version, chooses the most recent version, and then chooses the "best" distribution for that version. It prefers binary wheels if there are any, and if they are multiple it chooses the one most specific to the install environment. These are just
pip's default preferences though—they can be configured with options like
--prefer-binary. The "best" distribution is either downloaded or installed from the local cache, which on Linux is usually located in
Determining the dependencies for this distribution is not simple either. In theory, one could just use the
requires_dist value from
https://pypi.org/pypi/<somepackage>/<version>/json. However, this relies on the package author uploading the correct metadata, and older packaging clients do not do so. So in practice
pip (and anyone else who wants to know the dependencies of a package) have to download and inspect it.
For binary wheels, the dependencies are listed in a file called
METADATA. But for source distributions the dependencies are effectively whatever gets installed when you execute their
setup.py script with the
install command. There's no way to know unless you try it, which is what
pip does! Specifically, it leverages
setuptools to run
install up to the point where it knows what dependencies to install. However, this can be further complicated by the fact that running
install might itself require dependencies. The standard way to specify this in a Python package is to pass a the
setup_requires argument to
setuptools.setup. By way of
pip will run
setup.py just enough to discover
setup_requires, install those dependencies, then go back and execute
setup.py again. Naturally, this is madness and
setup_requires should never be used.
pip has a list of requirements, it starts this whole process over again for each required package, taking into account any constraints on its version. It builds a whole tree of packages this way, until every dependency of every distribution it has found is already in the tree. This process breaks of course if there is a dependency cycle, but it will always terminate—after all, there are only finitely many python packages!
What happens though if one of the distributions
pip finds violates the requirements of another, for example if it
pip first finds
2.5 but then finds a distribution requiring
idna<=2.4? Well, it ignores the requirement and installs
idna anyway! There is a longstanding issue open to add a true dependency resolver to
pip, with lots of false starts and partial implementations, but none have ever quite made it in. This is of course in large part due to the complexity of determining the dependencies for a python package—it is very difficult to build an efficient dependency resolver when determining the dependencies of a single candidate requires downloading and executing potentially megabytes of code!
pip has to actually build and install the package. If it downloaded a source distribution, and the
wheel package is installed, it will first build a binary wheel specifically for your machine out of the source. Then it needs to determine which library directory to install the package in—the system's, the user's, or a virtualenv's? This is controlled by
sys.prefix, which in turn is controlled by
pip's executable path and the
PYTHONHOME environment variables. Finally, it moves the wheel files into the appropriate library directory, and compiles the python source files into bytecode for faster execution.
Now your package is installed! I've really only scratched the surface—there are dozens of options that change
pip's behavior, many corner cases of other distribution types and platform limitations, and I didn't even touch on installing multiple packages (which is handled differently than a package with multiple dependencies). But I hope it this was informative, if not useful.