Thomas H Jones II

Posted on May 11, 2018 • Originally published at thjones2.blogspot.com on May 2, 2018

"How Big Is It," You Ask?

#mirror #repository #rpm #storage

For one of the projects I'm supporting, they want to deploy into an controlled network. However, they want to be able to flexibly support systems in that network being able to fetch feature and patch RPMs (and other stuff) as needed. Being on an controlled network, they don't want to overwhelm the security gateways by having a bunch of systems fetching directly from the internet. It's a lot more processing-efficient to create a local cache that you malware-scan once rather than having to malware-scan the same stuff multiple times as each system performs and individual fetch. Result: a desire to host a controlled, pre-scanned mirror of the relevant upstream software repositories.

This begged the question, "what all do we need to mirror." I stress the word "need" because this customer is used to thinking in terms of manual rather than automated efforts (working on that with them, too). As a result of this manual mind-set, they want to keep data-sets around tasks small.

Unfortunately, their idea of small is also a bit out of date. To them, 100GiB — at least for software repositories — falls into the "not small" category. To me, if I can fit something on my phone, it's a small amount of data. We'll ignore the fact that I'm biased by the fact that I could jam a 512GiB micro SD card into my phone. At any rate, they were wanting to minimize the number of repositories and channels they were going to need to mirror so that they could keep the copy-jobs small. I pointed out "you want these systems to have a similar degree of fetching-functionality to what they'd have if they weren't being blocked from downloading directly from the Internet: that means you need <insert exhaustive list of repositories and channels>". Naturally, they balked. The questions "how do you know they'll actually need all of that" and "how much space is all of that going to take" were asked (with a certain degree of overwroughtness).

I pointed out that I couldn't meaningfully answer the former question because I wasn't part of the design-groups for the systems that were to be deployed in the controlled network. I also pointed out that they'd probably be better asking the owners of the prospective systems what they'd anticipate needing (knowing full well that, usually, such questions' answers are somewhere in the "I'm not sure, yet" neighborhood). As such, my exhaustive list was a hedge: better to have and not need than to need and not have. Given the stupid-cheapness of storage and that it can actually be easier to sync all of the channels in a given repository vice syncing a subset, I didn't see a a benefit to not throw storage at the problem.

To the second question, I pointed out, "I'm sitting at this meeting, not in front of my computer. I can get you a full, channel-by-channel breakdown once I get back to my desk.

One of the nice things about yum repositories (where the feature and patch RPMs would come from) is that they're easily queryble for both item-counts and aggregate sizes. The only real down side is that the OS-included tools for doing so are more for human-centric, ad hoc queries rather than something that can be jammed into an Excel spreadsheet and =SUM formulae being run. In other words, sizes are put into "friendly" units: if you have 100KiB of data, the number is reported in KiB; if you have 1055KiB of data, the number is reported in MiB; and so on. So, I needed to "wrap" the native tools output to put everything into consistent units (which Excel prefers for =SUMing and other mathematical actions). Because it was a "quick" task, I did it in BASH. In retrospect, using another language likely would have been far less ugly. However, what I came up with worked for creating a CSV:

#!/bin/bash

for YUM in $(
 yum repolist all | awk '/abled/{ print $1}' | \
 sed -e '{
 /-test/d
 /-debuginfo/d
 /-source/d
 /-media/d
 }' | sed 's/\/.\*$//'
 )
do
IFS=$'
'
 REPOSTRUCT=($(
 yum --disablerepo=\* --enablerepo=${YUM} repolist -v | \
 grep ^Repo- | grep -E "(id|-name|pkgs|size) \*:" | \
 sed 's/ \*: /:/'
 ))
 unset IFS

 REPSZ=($(echo "${REPOSTRUCT[3]}" | sed 's/^Repo-size://'))

 if [[$( echo "${REPSZ[1]}" ) = M ]]
 then
 SIZE=$(echo "${REPSZ[0]} \* 1024" | bc)
 elif [[$( echo "${REPSZ[1]}" ) = G ]]
 then
 SIZE=$(echo "${REPSZ[0]} \* 1024 \* 1024 " | bc)
 else
 SIZE="${REPSZ[0]}"
 fi

 for CT in 0 1 2
 do
 printf "%s;" "${REPOSTRUCT[${CT}]}"
 done
 echo ${SIZE}
done | sed 's/Repo-[a-z]\*://'

Yeah... hideous and likely far from optimal ...but not worth my time (even if I had it) to revisit. It gets the job done and there's a certain joy to writing hideous code to solve a problem you didn't want to be asked to solve in the first place.

At any rate, checking against all of the repos that we'd want to mirror for the project, the initial-sync data-set would fit on my phone (without having to upgrade to one of the top-end beasties). Pointing out that the only the initial sync would be "large" and that only a couple of the channels updated with anything resembling regularity (the rest being essentially static), the monthly delta-sync would be vastly smaller and trivial to babysit. So, we'll see whether that assuages their anxieties or not.

DEV Community

"How Big Is It," You Ask?

Top comments (0)

Read next

Navigating the Security Landscape of Docker Containers: A DevSecOps Perspective

Inferência de Tipos com o Operador Losango

Handling Python event loop shutdown without exceptions

Getting Started with Playwright: A Step-by-Step Guide