loading...

Taming the CUDA

ferricoxide profile image Thomas H Jones II Originally published at thjones2.blogspot.com on ・2 min read

Recently, I was placed on a new contract supporting a data science project. I'm not doing any real data-science work, simply improving the architecture and automation of the processes used to manage and deploy their data-science tooling.

Like most of my customers, the current customer is an Enterprise Linux shop and an AWS shop. Amazon makes available several GPU-enabled instance-types that are well-disposed to running data science types of tasks. And, while RHEL is generically suitable to running on GPU-enabled instance types, to get the best performance out of them, you need to run the GPU drivers published by the GPU-vendor rather than the ones bundled with RHEL.

Unfortunately, as third-party drivers, there's some gotchas with using them. The one they'd been most-plagued by was updating drivers as the GPU-vendor made further updates available. While doing a simple yum upgrade works for most packagings, it can be problematic when using third-party drivers. When you try to do yum upgrade (after having ensured the new driver-RPMs are available via yum), you'll ultimately get a bunch of dependency errors due to the driver DSOs being in use.

Ultimately, what I had to move to was a workflow that looked like:

  1. Uninstall the current GPU-driver RPMs
  2. Install the new GPU-driver RPMs
  3. Reboot

Unfortunately, "uninstall the current GPU-driver RPMs" actually means "uninstall just the 60+ RPMs that were previously installed ...and nothing beyond that. And, while I could have done something like yum uninstall <DRIVER_STUB-NAME>, doing so would result in more packages being removed than I intended.

Fortunately, RHEL (7+) include a nice option with the yum package-management utility: yum history undo <INSTALL_ID>.

Due to the data science users individual EC2s being of varying vintage (and launched from different AMIs), the value of is not stable across their entire environment.

The automation gods giveth; the automation gods taketh away.

That said, there's a quick method to make the instability pretty much a non-problem:

yum history undo $( yum history info <rpm_name>| \
sed -n '/Transaction ID/p' | \
cut -d: -f 2 )

Which is to say "Undo the yum transaction-ID returned when querying the yum history for ". Works like a champ and made the overall update process go very smoothly.

Now to wrap it up within the automation framework they're leveraging. I don't think it natively understands the above logic, so, I'll probably have to shell-escape to get step #1 done.

Discussion

pic
Editor guide