DEV Community

Cover image for Managing Large Debian Repositories with Pulp
ATIXAG
ATIXAG

Posted on

Managing Large Debian Repositories with Pulp

Pulp is a free, open-source platform for software repository management. You can fetch, upload, and distribute content from various sources. Repository versioning makes sure that nothing is lost as you can always roll back to previous versions. The pulp_deb plugin adds APT repository support.

There is such a thing as Pulp Debian support, and it has been around for a while. It was expanded by ATIX for use with Katello a few years ago. It works great for small to medium-sized repositories. However, performance is not ideal.

Challenge

Around 2019, ATIX consultants wanted to synchronize all of Debian Stretch and Ubuntu Xenial for a demo. Unfortunately, they found that it generally takes about five hours, only to fail with a “Cannot allocate memory” error. What was going on?

To answer this question, they needed to take a closer look at the pulp_deb implementation. Code is organized into several steps. The implementation relies heavily on the python-debpkgr dependency, which in turn relies on deb822 from the python-debian library. python-debpkgr is mainly designed to take a pile of Debian packages and organize them into an APT repository. The structure of Debian repositories looks like this:

/dists/ stretch / Release
/dists/ stretch /main/binary -amd64/ Packages
/dists/ stretch / contrib /binary -amd64/ Packages
/dists/ stretch /non -free/binary -amd64/ Packages
/pool/
Enter fullscreen mode Exit fullscreen mode

During a sync, we have the “MetadataStep,” which is provided with a list of releases, components, and packages (with meta data) from the Mongo DB. It then applies a logic: for every combination of architecture, component, and release, a list of packages is generated. These lists contain the paths to the actual .deb package files on the disk. Finally, each list is passed to a debpkgr call as an argument.

debpkgr is mainly designed to take a pile of Debian packages and turn them into a repo. So, it does just that: Each .deb file is accessed on the disk to extract the meta data debpkgr needs. Due to the way the package lists overlap for different architectures, many of these .deb files will actually be parsed multiple times.

The solution

Our experts’ first thought was: maybe there’s a quick-and-dirty fix? However, they also considered a complete redesign of the way debpkgr works. Another alternative might be dropping debpkgr (from the MetadataStep) and implementing everything themselves.

The basic idea was to exclusively use information from the Mongo DB to create the repository structure. The old implementation already had to parse the meta data from the Mongo DB in order to generate the lists that were then passed to debpkgr. This essentially remained unchanged. Our experts had to create the desired directory structure themselves. They also had to build the symlinks to the actual .deb files themselves. They then needed the ability to write Packages and Release files. As one always does, they happened upon a few stumbling blocks:

debpkgr generates md5sum, sha1, and sha256 for metadata. The existing data base model only stored sha256 hashes. Actually using the meta data from the data base revealed a bug. User-defined meta data fields/fields were not stored in the existing data base model.

Our consultants came up with the following results:

  • Two major pull requests:

1.Ensure the db is used consistently by quba42 · Pull Request #61 · pulp/pulp_deb

2.MetadataStep performance by quba42 · Pull Request #57 · pulp/pulp_deb

  • An end to our memory problems

  • Syncs for medium-sized repositories (1500 packages) that are more than twice as fast

  • Syncing Ubuntu Xenial (main, restricted, universe, multiverse) for amd64 (53837 Packages) within 3h36m on the test system

What did everyone learn? It is important to know your tools! Furthermore, you have to take your time to plan the architecture and gain the required domain knowledge.

Heroku

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay