DEV Community: ShandonCodes

Packer & Proxmox: A Bumpy Road

ShandonCodes — Tue, 16 Jul 2024 22:30:27 +0000

Several months ago I used Packer to simplify the creation of VM templates on my organization's vSphere instance. While getting Packer to work was not the most pleasant, the benefits of using it had begun to show and so I decided to begin using Packer in my homelab. In my homelab I use Proxmox as my hypervisor solution, so I figured there would be a few differences but nothing that would cost me too many cycles. I was wrong.

Issue 1: VM Creation Failure

The first issue I encountered was the failure of the VM creation process. Initially, I attempted to authenticate using an API token created by my admin user, but Packer provided no useful feedback in the terminal. After enabling Packer’s debugging mode, I discovered a 501 Not Implemented error being returned from my Proxmox instance.

This error led me to explore the Proxmox API reference documentation. Initially, I suspected that my Proxmox version (v8.0.3) was not compatible with the latest Packer plugin for Proxmox. However, the real issue was with the token permissions, not the API endpoint.

When creating the API token, I had left the Privilege Separation box checked, which required me to manually assign permissions for every resource the token needed to access. I tried adding permissions for each required resource, but I couldn’t find all of them in the UI. Eventually, I created a new token, ensuring that I did not select the Privilege Separation box. This new token inherited all permissions from my admin user, allowing me to create the VM and its resources without issues.

Although diagnosing this issue was challenging due to the misleading 501 error, it could have been avoided if a more appropriate error, such as 401 Unauthorized, had been returned. While this issue originates from the Proxmox API response, an additional warning message in the Packer plugin could help others avoid wasting time on similar problems.

Issue 2: Subiquity Install freezing

Ubuntu uses Subiquity as a framework to drive the installation of the OS, it make it very easy to automate installation of both Ubuntu Desktop and Server. During the Subiquity install I noticed the installer started to hang once additional packages were being installed. This was a pretty easy solve as all I needed to do was add more RAM to my configuration. I modified my configuration to use about 4GB of RAM and the installation started without a hitch!

Issue 3: Packer SSH Connection Failure

After the initial installation was complete Packer was not able to connect to the VM over SSH even though it had the correct credentials (I test the credentials by SSHing in manually while the VM was running). After debugging for a few hours I came across an article that outlines how Packer gets the VM's IP. It relies on the QEMU guest agent to be running on the machine, without it the IP of the machine was not being communicated back to Packer so the SSH connection kept timing out.
All I had to do was add the qemu-guest-agent package to the Subiquity installer so that the service would start and report the IP to Packer for the IP connection.

Issue 4: VM Template Clone Failures

Once Packer was able to create the VM template correctly, I had one last issue when attempting to use it. During the installation a temporary disk was created to expose the files used Subiquity installer. During the teardown by Packer that temporary image is destroyed, but the VM does not know that when it reboots that image will not longer be available so all VMs created from that template would require that drive to be manually removed before boot. To solve this I set the unmount key to true in the image declaration:

Now before Packer finishes the teardown process it makes sure the temporary image is unmounted from the VM. Without this temporary image being permanently mounted in their template all VMs created from it wold no longer throw an error when booting the OS.

Issue 5: Unclear Documentation

This is more a list of gripes rather than a specific issue, but I really do not understand the design behind the Packer documentation (or really Hashicorp's in general). A couple of the issues that stand out to me are:

Launch point is good, reference is terrible: The base documentation page is actually pretty good, it explains what Packer does and lays out the terminology used in the remainder of the documentation. The tutorials are pretty well defined and provide quite a bit of detail while not completely overloading you along the way. The part I do not like about the documentation is the lack on contrast between section in the "On this page" pane. For example, look at the following screenshot:

The "Required" and "Optional" sections should be nested under the "Configuration Reference" parent section. This would help break things but in that side pane visually, also when with option is selected the remaining instances of that selection are highlighted in that pane. So selecting "Required" highlights the other four instances of "Required" in that pane. I am not sure why you would want that functionality at all.

Example inconsistency: Inside of the docs the examples used very often use the JSON format over HCL2. Considering the preferred method by Hashicorp is for everyone to use HCL2 I think it would be best if examples are shown in both languages (considering they are both supported) with HCL2 being noted as the primary method to use. I will say this is not really on Hashicorp alone, as community based plugins (like the one I am using) probably do not have to adhere to the same standards BUT considering this documentation is hosted on the official Packer site I think they can share in some of the fault.

TL;DR

Packer is a powerful tool with significant benefits, but using it on new platforms can present challenges. I encountered issues with VM creation, Subiquity installation, SSH connections, and VM template cloning. Additionally, unclear documentation added to the complexity. However, each problem had a solution, and with some persistence, Packer can be effectively used across different environments.

Pylint is...

ShandonCodes — Wed, 28 Feb 2024 03:00:23 +0000

The Intro

It happens at some point in everyone's career, one minute you are making a small change to your software project and the next thing you know you come across an error. An error that no one else on your team has seen and worse, one you have little to no idea how to debug and find a solution. Strap-in as I walk you through an error that took me an entire workday to solve and most importantly, one that made me re-think the entire way my team operates.

Let me start by painting the scene, I picked up a maintenance ticket with a simple description: "Migrate Gitlab Runners to new vSphere instance". Sounds simple, enough but I decided to take the time to really improve how we deployed our runners. At the time, it was a very manual process to create new runners for our project, an engineer would need to manually create the VM, install the OS as well as other software, and they would have to remember all of the configuration steps that were used on previous runners. Needless to say this was not ideal, so I decided I would take advantage of tools like Packer to ease the future creation of VMs. Specifically, I used Packer to create a VM template with all the required configurations (I'll talk about that in the future). Once I had the template all I needed to do was perform a few clicks in vSphere and voilà, a new runner was available to the project.

The Error

Now that are runners were "migrated" I began testing our pipelines on one new runner (I decided to start testing with more modest one to simplify debugging). Everything ran just fine until the pipeline began its linting stage. The linting job would fail but there was no output, just a notice that the job failed and there was an exit code of "1"...now if you are thinking what I was you might be wondering why zero changes to the project source code somehow caused a linting error and you would be right to wonder. As far as the linter should be concerned nothing changed, yet it was finding an issue. I ran the linter locally to double-check for errors and none were shown. I doubled checked the pylint version being run in the pipeline vs locally and I confirmed they were exactly the same. While debugging this two main things stood out:

The version of pylint was a major version behind (were were using 2.13.9 and the current release version (as of this writing) is 3.0.3
An error was thrown in the pipeline, but no output was displayed.

First, I touched base with my team about the outdated version of pylint and they mentioned due to some breaking changes in the newest major version the linter would totally fail in our very large codebase and would require a very large refactor to work. While this was not ideal, I decided to move on from this and focus on the more pressing second issue. The pipeline was failing and I was getting no error. After about an hour or so of digging I learned our pipeline was outputting the pylint output to a file and because the job was failing all artifacts it created, like the linter record) was not saved. So there was an actual error, but I just could not view it.

The Problem

Easily enough I changed the job to output to stdout and then I could see the errors. The errors were related to some relative imports in part of the source and while this was great to know I was still puzzled by why a linting error suddenly came out of seemingly nowhere and why I could not replicate on my workstation. I mean sure, I could just fix the linter issues found, but how would I or anyone be able to ensure we will not receive linter errors in the pipeline if we cannot test them locally before hand? After scratching my head on this for quite a few hours I realized something (and you may have too), remember earlier when I mentioned I created one modest runner for testing? Well the exact specs were as follows:

1 CPU
64 GB RAM
150 SSD

Compared to these specs from the now deprecated runners:

4 CPU
64 GB RAM
150 HDD

The Solution

Notice the CPU count. You see, as I was searching "why are my lint results different from my pipeline" I did not get any exact results, but I stumbled across this bug report. Basically, the report says when pylint is run with multiple cores some errors may or may not appear even if the do or do not appear when running pylint on a single core. Now in our pipeline and locally we do not specify how many cores to use (i.e. pylint --jobs=0), so by default pylint will use AS MANY cores as possible. So in our pipeline the runner's pylint instance was using --jobs=1 and locally my workstation (4 cores) was using --jobs=4. To confirm I manually changed my pylint to use --jobs=1 locally and I was finally able to reproduce the errors locally! To complete my testing I updated the runner specs to have 4 CPUs and the pipeline passed with no issues!

To recap the issue was caused by a known bug in an outdated version of pylint, but it was only found by chance when I began working on a completely unrelated ticket. If I had used more CPUs on my test runner to begin with, I might not have ever found this issue at all (especially if we upgraded pylint in the near future).

The Path Forward

So how did I wrap everything up? I created 2 other runners and thanks to the higher performance specs used on the new vSphere cluster our release pipeline is now down from 4 hours to 2, a major win! Also, I did not actually fix the original linter error. I felt that it did not matter as the error was never thrown on multi-core systems and our runners/workstations always have multiple cores. I did however cite this issue in a ticket to update pylint.

I learned that efforts that might add little to no benefit in the present may drastically save time in the future. If we as a team had prioritized updating our pylint version than I very may well have not spent a full workday chasing bugs and learning more about an outdated version of pylint than I needed to.

Stop using entgo...please

ShandonCodes — Mon, 08 Jan 2024 03:24:04 +0000

If you found this article, than you are probably similar to how I was a few months ago. I started a project in Go that required a SQL backend and I wanted to use any tool that would help me build this backend quickly. I stumbled upon entgo (an ORM for Go) and decided to give it a try.

Initially it was easy to setup and get started. The code generation seemed to work well and in no time I was using the database in my application. Everything was going well, but I ran into a few issues I could not over look:

Auto Migration Failures
"Magical" Queries
Code Bloat

Auto Migration Failures

The auto migration feature of module seems to have issues processing fields with date/time values. After I spent hours debugging I reached out to their community to see if anyone had this issue. I received no response in their Discord (does not seem very active) and I only received one response on a Github issue I posted on the topic (after over a month of the issue being posted).

"Magical" Queries

During load testing on my application I began noticing performance issues from the database. When I began using the Debug() mode for the entgo client I observed overly complex queries, even for simple cases (i.e. simple SELECTs, no JOINs. Some of these issues may be resolved by modifying the code generation options of the library but that seems like quite a bit of additional effort for an issue that should not exist.

Code Bloat

The included ent code generation tool could use some optimization, reviewing many of the files reveals many included functions and complex logic that (as I understand it) adds little value unless advanced features of the library are being used (GraphQL integration, Custom Hooks, etc).

What to use instead?

I just told you all of the reasons why not to use entgo, but what should you use? Honestly, raw SQL or really a raw SQL schema and raw queries with sqlc for Go code generation. That may sound counter intuitive, but hear me out. sqlc is not an ORM, you still write your database schema, perform your migrations, and write your own queries. sqlc handles the generation of Go interfaces to perform those queries in a type-safe fashion within your application. This has a few advantages over entgo (or any ORM for that matter):

No "Magic" Queries
No Auto Migration System Failures
Minimal Code Generation

No "Magic" Queries

You write all of the queries yourself, so you can easily make them as efficient as your SQL knowledge allows

No Auto Migration System Failures

You perform the database migrations yourself before sqlc is ever called, it cannot cause migration errors if it does not perform migrations.

Minimal Code Generation

sqlc generates 3 (minimal) files total as part of its tooling, regardless of the database schema or number/complexity of queries.

For all of these reasons above I am officially making the switch from using entgo to sqlc. If there is interest I can write about that migration in more detail, let me know what you think about that in the comments.

I hope you read this and have considered that an ORM like entgo might not actually be the faster/easier approach to adding SQL queries to your application and that sometimes a simpler approach can be just as good, if not better.