DEV Community: Nick Schmidt

Starting an IaC Repository with GitHub and Terraform

Nick Schmidt — Sat, 13 Dec 2025 09:00:00 +0000

There are only two hard things in Computer Science: cache invalidation, naming things, and off by one errors.

Loosely attributed to Phil Karlton

Let's start off with a bit of a hot take - Terraform isn't particularly hard to learn. It does use unique configuration languages, but most people don't struggle with learning the code.

Infrastructure-as-Code (IaC) isn't about the programming language - it's about establishing a body of discipline around managing infrastructure. Tools like Ansible and Terraform simply facilitate the practice.

Instead of focusing on some programmatically elegant tricks here, let's try to focus on how to build a "starter kit" of sorts to build upon this practice. The managed resources in this example will be intentionally simple to shift focus to the structure, naming, and release management aspects of Infrastructure-as-Code.

[

](iac_starter_kit.png)

Repositories (Structure and Naming)

Start a GitHub repository with some basic documentation before contributing code:

README.md should describe what the project is for, describe the project structure: how the software works.
USAGE.md should describe how to consume resources within the project, how release management works.
CONTRIBUTING.md should describe how to contribute to the codebase: the branch and merge workflows and rules of conduct go here.
CHANGELOG.md should be created based on the Keep a Changelog standards
.gitignore should make sure that any temporary files created by tools, like pycache, Terraform locks don't accidentally get committed to the repository
markdownlint.json and any other linting rules - automated code QC is a good thing
img/ should be created to contain rendered images for documentation. Use illustrations to make the repository easy to understand!
dwg/ should be created to contain unrendered diagrams, e.g. svg, d2
doc/ may be created for any automatically rendered documentation, e.g. ReadTheDocs

Once these are created, start mapping out what loose structures should be included in the repository. Here are some examples:

conf.d/ for any flat file configurations that may get deployed
- Make subdirectories for any machine targets
roles/ for any Ansible roles. Since this is IaC, breaking this down into roles instead of one giant pile will be simpler
- Within each role:
- templates/ should contain any Jinja2 templates. Ansible will auto-detect this folder by name, and it simplifies structure quite a bit.
- requirements.txt should contain any software prerequisites for the Ansible playbooks. This facilitates CI/CD tooling with virtual environments, in addition to better documenting software dependencies.
- Playbooks and truth files, of course
terraform/ for any Terraform code
- modules/ for any Terraform re-usable modules
- accounts/ for any Terraform tenants, e.g. AWS Accounts, CloudFlare accounts, or other unrelated resources to keep them separate and organized
python/ for any Python code
js/ for any JavaScript
...and so on.

Now that the raw structure is somewhat laid out, we can shift focus to the Terraform account's subdirectory (in /terraform/accounts/{{ account_type }}_{{ account_id }}_{{account_name}}) structure. Here's what I've seen lead to a maintainable code base:

/terraform/accounts/cloudflare_12345_engyak_co
- templates/ for any gotmpl templates
- provider.tf should declare any Terraform pre-requisites, e.g. the Cloudflare provider minimum version
- vars.tf should declare any input variables. In my experience, this is a good place for module inputs, but not as useful for actual infrastructure declarations
- locals.tf should declare any Don't Repeat Yourself (DRY) variables. I typically use them for consistent resource names and IDs. There are a lot of opinions about vars versus locals, but there are a few key differences:
- vars should actually be variable (non-static multiples of a resource)
- locals can render and iterate on an input, e.g. with for_each loops
- backend.tf should indicate where terraform.tfstate is placed, any file locking. Normally, this points to an S3 bucket and provides authorization for it
- data.tf should have any external data resources. This example doesn't need any, but AWS IAM policy documents and S3 bucket policies fit this category. Any resource prefixed with data instead of resource goes here, essentially

Now that all that's out of the way, we're able to actually create resources. Things can be a lot more free-form here, because the definition of related resources can vary greatly based on who's doing the work.

My personal preference is to maintain small, easily readable files that function independently wherever possible. In this example, we'll use one file for each DNS zone. Here's /terraform/accounts/cloudflare_youwish_engyak_co/engyak.co.tf:

 1resource "cloudflare_record" "engyak_co_blog" {
 2 content = "blog-engyak-co.pages.dev"
 3 name = "blog"
 4 proxied = false
 5 ttl = 1
 6 type = "CNAME"
 7 zone_id = "redacted"
 8}
 9
10resource "cloudflare_record" "engyak_co_root" {
11 content = "blog-engyak-co.pages.dev"
12 name = "engyak.co"
13 proxied = true
14 ttl = 1
15 type = "CNAME"
16 zone_id = "redacted"
17}
18
19resource "cloudflare_record" "engyak_co_uri_blog" {
20 name = "engyak.co"
21 priority = 1
22 proxied = false
23 ttl = 1
24 type = "URI"
25 zone_id = "redacted"
26 data {
27 target = "blog.engyak.co"
28 weight = 1
29 }
30}

These resources are built according to the provider in provider.tf:

 1terraform {
 2 required_providers {
 3 cloudflare = {
 4 source = "cloudflare/cloudflare"
 5 version = "~> 4"
 6 }
 7 }
 8}
 9
10provider "cloudflare" {
11}

Always consult the provider's documentation on how to use their resources.

Actions (Release Management)

The biggest advantage a Git repository has for Infrastructure-as-Code is its versioning capability, but the ability to control the release of changes can really take things to the next level.

First, I'd recommend starting out with a branch management plan. It can start simple, like:

Don't allow any commits directly to main (GitHub branch protection rules, plus general threads in CONTRIBUTING.md)
Only allow code to be pushed to main via a successful pull request (GitHub branch protection rules do this as well)
- At least 1 approving peer review
- All testing must PASS (more on this later)
All prospective changes must start as a diverging branch (or fork, but forking is much more advanced) that is up-to-date with main
Outline appropriate change windows, if applicable

At this point, the rules are in place, but none of it actually controls release. GitHub doesn't have credentials to release changes; ideally no users should either. The objective here is to prevent all direct changes to infrastructure. This can be achieved with AWS IAM roles, Cloudflare RBAC, or an equivalent. Take away the keys!

GitHub Actions provides a (usually free or cheap) amnesic container service to run ephemeral code from source control. This is going to be the foundation for this example moving forward, but other providers like GitLab and Atlassian have equivalents as well. If the source control provider doesn't have a built-in service, plenty of other CI tools exist to fill that gap, like Jenkins and Concourse.

For a Terraform pipeline, there should be two Actions per account:

terraform plan: This will test your code for validity, and also explain any potential impacts the change might have
terraform apply: This will implement tested changes. This Action should be restricted to the main branch!

Here's an example plan Action. I named it based on `{{ event trigger }}: {{ provider }} {{ action }} to keep things organized.

`
1---
2name: 'On-Commit: Cloudflare Terraform Plan'
3
4on:
5 push:
6
7permissions:
8 contents: read
9
10jobs:
11 plan:
12 name: 'Terraform Plan'
13 env:
14 CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
15 runs-on: ubuntu-latest
16 steps:
17 - uses: actions/checkout@v4
18 - name: 'Terraform Setup'
19 uses: hashicorp/setup-terraform@v3
20 with:
21 terraform_version: '>= 1.10.5'
22 - name: 'Terraform Plan'
23 run: |
24 terraform init
25 terraform validate
26 terraform plan -input=false

27 working-directory: terraform/accounts/cloudflare_youwish_engyak_co/

Here's a rundown on how the testing works:

We use the env directive to expose CLOUDFLARE_API_TOKEN (specified in the cloudflare provider as the way to pass credentials)
We use actions/checkout@v4 (or latest version) to load a copy of main into the Actions runner
We use hashicorp/setup-terraform@v3. Previous Actions runners shipped with Terraform, but the base image didn't update this package frequently enough. Now it doesn't ship with the image - but this tool lets us restrict and control software versions as part of the pipeline. This lets us slow releases if breaking changes occur with terraform without having to monkey around with internals - it's a much better system.
The Terraform Plan step is where most of the work gets done. We initialize Terraform in non-interactive mode (-input=false) using our workspace with the working-directory key.

This will now run every time code is committed to the repository, and it'll display any expected changes every time code is contributed. If it fails, it will produce an error and (ideally) notify engineers/developers on where to fix it.

Note: terraform validate and terraform plan do not catch all problems, just test for config validity. Resource conflicts, API idiosyncrasies will pass this step and only reveal things on apply!

Now, we can finally start releasing changes:

`
1---
2name: 'Cron-Demand: Cloudflare Terraform Apply'
3
4on:
5 workflow_dispatch:
6 branches: ['main']
7 schedule:
8 - cron: "15 4,5 * * *"
9
10permissions:
11 contents: read
12
13jobs:
14 plan:
15 name: 'Terraform Plan'
16 env:
17 CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
18 runs-on: ubuntu-latest
19 steps:
20 - uses: actions/checkout@v4
21 - name: 'Terraform Setup'
22 uses: hashicorp/setup-terraform@v3
23 with:
24 terraform_version: '>= 1.10.5'
25 - name: 'Terraform Plan'
26 id: tf_plan
27 run: |
28 terraform init
29 terraform validate
30 terraform plan -input=false --detailed-exitcode

31 continue-on-error: true
32 working-directory: terraform/accounts/cloudflare_youwish_engyak_co/
33 - name: 'Terraform Apply'
34 run: |
35 terraform apply

36 working-directory: terraform/accounts/cloudflare_youwish_engyak_co/
37 if: github.ref != 'refs/heads/main' && needs.tf_plan.outputs.exit-code == 2

This Action will either run daily at 0415-0515 UTC or if executed manually. We've established a "change window", and there are quite a few more complexities added to this workflow to implemet change safety:

detailed-exitcode and id: tf_plan allow us to "catch" the results of terraform plan. A return code of 0 means no changes required, and 2 means changes are required.
if: conditionals restrict the dangerous parts of the workflow to only execute when the branch is main and plan is valid and expects changes.

Terraform Starter Kit

This template should act as a foundational "starter kit" for establishing an effective, robust, mature Infrastructure-as-Code practice. I've found that it's easier to modify and improve an existing process than to start anew - the objective here is to get engineers past that "writer's block."

Happy coding!

Visualize and Report Ansible with OpenTelemetry and Syslog

Nick Schmidt — Sun, 23 Nov 2025 09:00:00 +0000

Ansible is a fantastic tool to manage fleets of machines, but it's difficult to provide effective reporting when the fleet massively scales. Imagine hundreds of lines like this; try to find the one that failed (and why):

1PLAY RECAP *********************************************************************
2dev.lab.engyak.net : ok=6 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

...it's not difficult to read, but it doesn't decide what might deserve individual attention. It's possible to create Jinja reports that will be more executive-friendly, but they're focused on individual executions as well.

Ansible callback plugins provide us a framework to aggregate and analyze information about playbook execution without compromising idempotency.

Types of Callback Plugins

`aggregate`

aggregate callback plugins modify the summary at the end of a task's output. They don't appear to impact recap, and don't have many useful examples.

Aggregate Plugin list

`stdout`

stdout callback plugins modify the continual output presented as Ansible completes work:

1TASK [Update Apt!] *************************************************************
2ok: [dev.lab.engyak.net]

This is where the fun begins! Note that only one plugin for stdout can be selected for a given playbook.

Using `stdout` callbacks

The process for enabling callback plugins in ansible.cfg. Since this is executed from an environment (GitHub Actions), I prefer leveraging environment injection.

ANSIBLE_CALLBACK_RESULT_FORMAT controls how data is printed out from individual tasks on the screen, this is up to preference. I prefer yaml, and recommend playing with this setting to see what works best for you.
ANSIBLE_PYTHON_INTERPRETER silences any chatter about the discovered Python interpreter. Since this is a consistent environment without any tight coupling to specific releases, I don't feel the need to pin one, and I don't want to see the messages.
DEFAULT_STDOUT_CALLBACK will let you set the stdout callback plugin

In GitHub Actions, you can use the env key to manipulate outputs without having to change any code. I'm also integrating Netbox into this pipeline.

 1jobs:
 2 build:
 3 name: 'Manage Lab Configurations'
 4 runs-on: self-hosted
 5 env:
 6 ANSIBLE_PYTHON_INTERPRETER: 'auto_silent'
 7 ANSIBLE_STDOUT_CALLBACK: 'default'
 8 ANSIBLE_CALLBACK_RESULT_FORMAT: 'yaml'
 9 NETBOX_TOKEN: ${{ secrets.NETBOX_TOKEN }}
10 NETBOX_API: ${{ vars.NETBOX_URL }}
11 steps:
12 - uses: actions/checkout@v4
13 - name: Execute Ansible Management Playbook
14 run: |
15 python3 -m venv .
16 source bin/activate
17 python3 -m pip install --upgrade pip
18 python3 -m pip install -r requirements.txt
19 ansible-inventory -i local.netbox.netbox.nb_inventory.yml --graph
20 ansible-playbook -i local.netbox.netbox.nb_inventory.yml lab-management.yml

For reference purposes, I've added all compatible fields here. The yaml results format is considerably more compact given the character limit per line.

dense seems to be a popular callback, and it uses colorization to generate play output, and tries to place things all on one line:

1task 1.task 1: ns2.lab.engyak.nettask 1: ns2.lab.engyak.net ns.lab.engyak.nettask 2.task 2: ns2.lab.engyak.nettask 2: ns2.lab.engyak.net ns.lab.engyak.nettask 3.task 3: ns2.lab.engyak.nettask 3: ns2.lab.engyak.net ns.lab.engyak.nettask 4.task 4: ns.lab.engyak.nettask 4: ns.lab.engyak.net ns2.lab.engyak.nettask 5.task 5: ns2.lab.engyak.nettask 5: ns2.lab.engyak.net ns.lab.engyak.nettask 6.task 6: ns2.lab.engyak.nettask 6: ns2.lab.engyak.nettask 6: ns2.lab.engyak.nettask 6: ns2.lab.engyak.net ns.lab.engyak.nettask 6: ns2.lab.engyak.net ns.lab.engyak.nettask 6: ns2.lab.engyak.net ns.lab.engyak.nettask 7.task 7: ns.lab.engyak.nettask 7: ns.lab.engyak.net ns2.lab.engyak.nettask 7: ns.lab.engyak.net ns2.lab.engyak.nettask 7: ns.lab.engyak.net ns2.lab.engyak.nettask 7: ns.lab.engyak.net ns2.lab.engyak.nettask 7: ns.lab.engyak.net ns2.lab.engyak.nettask 8.task 8: ns2.lab.engyak.nettask 8: ns2.lab.engyak.net ns.lab.engyak.nettask 9.task 9: ns2.lab.engyak.nettask 9: ns2.lab.engyak.net ns.lab.engyak.nettask 10.task 10: ns2.lab.engyak.nettask 10: ns2.lab.engyak.net ns.lab.engyak.net

It's definitely compact, but not super readable. oneline is probably the best non-default plugin of the bunch, but it's much more verbose than the default one. It also displays a lot of system-specific information, so no snippet here.

`notification`

This is where things get really good for those of us with execution environments! notification callback plugins send data to external systems when a play finishes.

Directing results to OpenTelemetry

OpenTelemetry is a truly neat open standard for exchanging "trace information" between systems.

This is incredibly useful, but also difficult to explain in a way that's clear without providing concrete examples. Essentially, OpenTelemetry-based traces allow debugging systems that do not all exist in the same software package, and it offers a timeline for each step. As it happens, Ansible's callback plugin is well-architected and a good example of the value that a trace can have, even from an application perspective.

First, we'll need to assemble an OpenTelemetry-compliant platform to stream Ansible results to. I've selected Jaeger for this purpose. It has an all-in-one quickstart function:

1docker run --rm --name jaeger \
2 -p 16686:16686 \
3 -p 4317:4317 \
4 -p 4318:4318 \
5 -p 5778:5778 \
6 -p 9411:9411 \
7 cr.jaegertracing.io/jaegertracing/jaeger:2.12.0

Once it's running, we need to instruct Ansible to forward data. This is achievable exclusively with environment variables:

1 env:
2 ANSIBLE_CALLBACKS_ENABLED: 'community.general.opentelemetry'
3 ANSIBLE_OPENTELEMETRY_ENABLE_FROM_ENVIRONMENT: 'ANSIBLE_OPENTELEMETRY_ENABLED'
4 ANSIBLE_OPENTELEMETRY_ENABLED: 'true'
5 OTEL_EXPORTER_OTLP_ENDPOINT: 'http://jaeger.lab.engyak.net:4317'
6 OTEL_EXPORTER_INSECURE: 'true'
7 OTEL_SERVICE_NAME: 'ansible'

In addition to these variables, the module requires the following additions to requirements.txt:

1opentelemetry-sdk
2opentelemetry-exporter-otlp-proto-grpc
3opentelemetry-exporter-otlp-proto-http

Once these changes get applied, with _no other required changes to the Ansible code itself _, all subsequent runs submit OTLP traces to Jaeger It looks like this:

[

](jaeger_1.png)[

](jaeger_2.png)

This provides a comprehensive "drill down" for every step taken by Ansible, and I've honestly never seen this level of detail before. Every single programmatic step is logged with a timestamp, allowing an engineer to find out:

Which node took too long
Which step slowed things down the most
Whether that matches the baseline for other nodes

For a transactional application this has to be even more useful.

Directing Results to Syslog

Now, for something quite a bit more boring (but equally important). If OpenTelemetry is a microscope, Syslog is the 10,000 foot view. This can also be set up by CI, and should run in parallel with OpenTelemetry:

1 env:
2 ANSIBLE_CALLBACKS_ENABLED: 'community.general.opentelemetry,community.general.syslog_json'
3 ANSIBLE_OPENTELEMETRY_ENABLE_FROM_ENVIRONMENT: 'ANSIBLE_OPENTELEMETRY_ENABLED'
4 ANSIBLE_OPENTELEMETRY_ENABLED: 'true'
5 OTEL_EXPORTER_OTLP_ENDPOINT: 'http://jaeger.lab.engyak.net:4317'
6 OTEL_EXPORTER_INSECURE: 'true'
7 OTEL_SERVICE_NAME: 'ansible'
8 SYSLOG_PORT: '54514'
9 SYSLOG_SERVER: '127.0.0.1'

Each of these callback plugins serves a different purpose. Syslog callbacks provide a shorter summary as JSON, which can easily be dashboarded:

1<14>1 2025-11-30T07:41:00-09:00 10.66.1.143 gh-runner2 - - - ansible-command: task execution OK; host: ns.lab.engyak.net; message: {"changed": false, "checksum": "a46e7011b00c560dddcc193ef16f01fd2d05970e", "dest": "/etc/unbound/unbound.conf", "gid": 0, "group": "root", "mode": "0640", "owner": "root", "path": "/etc/unbound/unbound.conf", "size": 4531, "state": "file", "uid": 0}

Some example conditions:

"changed": true would indicate how many modifications were made per hostname (identified by host: ns.lab.engyak.net)
!= 'task execution OK' would search for job failures

Modernizing the Monitoring Stack

Ansible, despite being an infrastructure tool, provides a good example of the different types of modern monitoring. Thematically, these concepts should be applied to actual applications.

Traces are an excellent tool to identify software process bottlenecks. Any tool that has long-running jobs can benefit from tracing. They're computationally costly, so they should be saved for any tool where performance degradation truly matters.
Syslog is the "swiss army knife" of monitoring. It's the best tool for simple events, and can be the foundation for event-driven programming.
Metrics allow infrastructure engineers to "just send the important bits" via tools like protobuf, sort of like SNMP but better. In the network realm, this is where Model-Driven Telemetry reigns supreme, and in the application stack Prometheus is a popular option.

One thing I did find interesting - Grafana + Alloy allowed the unification of all of these data types. Here's a preview of what Jaeger in Grafana looks like:

[

](grafana_1.png)

Automate DNS Zone Generation and Deployment with Ansible and Netbox

Nick Schmidt — Sun, 10 Nov 2024 09:00:00 +0000

In a previous post, I covered a method to automatically generate DNS zones from an embedded YAML list.

This wasn't the most useful on its own, only ensuring that forward and reverse DNS entries match each other (you'll be shocked by how many places it isn't!) - and we need a good way to simplify DNS administration with tooling less expensive that, say, Infoblox.

This isn't to say that Infoblox is bad, but a fully loaded Infoblox license is a little pricy for home labs.

The Pattern

First, let's illustrate a potential design:

[

](pattern.svg)

The Code

In order to do this, we're going to need to find a good way to pull pre-filtered data for ansible to work with, and Netbox has a GraphQL API (/graphql/) that's perfect for this:

 1{
 2 ip_address_list(filters: {dns_name: {i_contains: "example.net"}, family: 4}) {
 3 dns_name
 4 address
 5 }
 6}
 7{
 8 ip_address_list(filters: {dns_name: {i_contains: "example.net"}, family: 6}) {
 9 dns_name
10 address
11 }
12}

This will give us a separate sheet for IPv4 and IPv6 addresses attached to a given zone - and we can assemble it without any postprocessing in Ansible.

Netbox's GraphQL sandbox produces the following data output:

 1{
 2 "data": {
 3 "ip_address_list": [
 4 {
 5 "dns_name": "ns.example.net",
 6 "address": "1.1.1.1/32"
 7 }
 8 ]
 9 }
10}

Now, this is received via a graphical interface, which means we can't consume it programmatically. In order to do that, we'll need to package the GraphQL payload in JSON. Here's an Ansible task that does just that:

 1 - name: "Try Fetching `lab.engyak.net` IPv4 GraphQL!"
 2 ansible.builtin.uri:
 3 url: "https://netbox/graphql/"
 4 method: POST
 5 body:
 6 query: "query { ip_address_list(filters: {dns_name: {i_contains: \"example.com\"}, family: 4}) { dns_name address }}"
 7 body_format: "json"
 8 headers:
 9 Authorization: "Token {{ lookup('ansible.builtin.env', 'NETBOX_TOKEN') }}"
10 Content-Type: "application/json"
11 Accept: "application/json"
12 validate_certs: false
13 register: result_example_net_v4

There aren't any pynetbox based modules that automate this into Ansible, so here we're using the ansible.builtin.uri module (also known as the Jack of All Trades module) to pull JSON data. It also uses the environment variable NETBOX_TOKEN, which must be exposed by secrets management / CI processes.

In this case, I'm pulling IPv4 and IPv6 records separately. Jinja doesn't know the difference between types of record, so I cheat on postprocessing and let GraphQL do all the heavy lifting. IPv6 is the same, but with family: 6/result_example_net_v6.

The next step is to build Jinja templates to define the zonefiles. I created them in a previous post, but will include all of them in a Gist at the end of this post. They need to be modified to process output from GraphQL, because we don't control any of the field names with it.

The Jinja templates used in this example are unique to ansible - the custom filter ansible.utils.ipaddr is amazing, converting Netbox's {{ address }}/{{ cidr }} notation is compact and efficient, but it doesn't work as an A record target. Invocations like |ansible.utils.ipaddr('address') or |ansible.utils.ipaddr('revdns') are particularly useful here.

Finally, it's good to test the resulting zonefiles for sanity. It's included in the Gist.

Retrospective

Netbox's GraphQL API is a really effective tool for aggregating pre-filtered data and driving automation processes. I was quite impressed that I could just ask an API endpoint for this nice and tidy report, already pre-formatted for me!

Lack of field and format control is an issue with GraphQL (you're stuck with whatever data structure the application architect has in store for you) - but Ansible and Jinja2 empower you to present the back-end data in any front-end manner you prefer (in my case, as DNS data loaded into an Unbound instance).

Nearly any business reporting process can be driven from Netbox in this fashion, as long as the resulting format can be Jinjafied. Here are some ideas on how this can be used further:

Report on Circuits per Region
Report on IT-Managed assets in a given Site
Report on how many Sites have IPv6 coverage

The Gist

As promised, here's the raw code I created to automate DNS zonefile management from Netbox:

@import url('https://cdn.rawgit.com/lonekorean/gist-syntax-themes/d49b91b3/stylesheets/idle-fingers.css');

@import url('https://fonts.googleapis.com/css?family=Open+Sans');
body {
  font: 16px 'Open Sans', sans-serif;
}
body .gist .gist-file {
  border-color: #555 #555 #444
}
body .gist .gist-data {
  border-color: #555
}
body .gist .gist-meta {
  color: #ffffff;
  background: #373737; 
}
body .gist .gist-meta a {
  color: #ffffff
}

GitHub Link

VM Deployment Pipelines with Proxmox

Nick Schmidt — Sat, 31 Aug 2024 09:00:00 +0000

Decoupled approaches to deployment of IaaS workloads are the way of the future.

Here, we'll try to construct a VM deployment pipeline leveraging GitHub Actions and Ansible's community modules.

Proxmox Setup

Not featured here : Loading a VM ISO is particular to the Proxmox deployment, but it's necessary for future steps.

Let's create a VM named deb12.6-template:

I set a separate VM ID range for templates to simplify visual automatic sorting.

Note: Paravirtualized hardware is still the optimal choice, like with vSphere - but in this case, VirtIO is the code supplier.

Note: SSD Emulation and qemu-agent are required for virtual disk reclamation with QEMU. This is particularly important in my lab.

In this installation, I'm using paravirtualized network adapters and have separated my management(vmbr0) and data plane(vmbr1)

Debian Linux Setup

I'll skip the Linux installer parts for brevity, Debian's installer is excellent and easy to use.

At a high level, we'll want to do some preparatory steps before declaring this a usable base image:

Create users
- Recommended approach: Create a bootstrap user, then shred it
- Leave the bootstrap user with an SSH key on the base image
- After creation, build a takeover playbook that installs the latest and greatest username table, sssd, SSH keys, APM, anything with confidential cryptographic material that should not be left unencrypted on the hypervisor
- This won't slow the VM deployment speed by as much as you think
Install packages
- This is just a list of some basics that I prefer to add to each machine. It's more network-centric; anything more comprehensive should be part of a build playbook specific to whatever's being deployed.
- Note: This is an Ansible playbook, and therefore, it needs Ansible to run (apt install ansible)

---
- name: "Debian machine prep"
  hosts: localhost
  tasks:
  - name: "Install standard packages"
    ansible.builtin.apt:
      pkg:
        - 'curl'
        - 'dnsutils'
        - 'diffutils'
        - 'ethtool'
        - 'git'
        - 'mtr'
        - 'net-tools'
        - 'netcat-traditional'
        - 'python3-requests'
        - 'python3-jinja2'
        - 'tcpdump'
        - 'telnet'
        - 'traceroute'
        - 'qemu-guest-agent'
        - 'vim'
        - 'wget'

Clean up the disk. This will make our base image more compact - each clone will inherit any wasted space, so consider it a 10,20x savings in disk usage. I leave this as a file on the base image and name it reset_vm.sh:

#!/bin/bash

# Clean Apt
apt clean

# Cleaning logs.
if [ -f /var/log/audit/audit.log ]; then
  cat /dev/null > /var/log/audit/audit.log
fi
if [ -f /var/log/wtmp ]; then
  cat /dev/null > /var/log/wtmp
fi
if [ -f /var/log/lastlog ]; then
  cat /dev/null > /var/log/lastlog
fi

# Cleaning udev rules.
if [ -f /etc/udev/rules.d/70-persistent-net.rules ]; then
  rm /etc/udev/rules.d/70-persistent-net.rules
fi

# Cleaning the /tmp directories
rm -rf /tmp/*
rm -rf /var/tmp/*

# Cleaning the SSH host keys
rm -f /etc/ssh/ssh_host_*

# Cleaning the machine-id
truncate -s 0 /etc/machine-id
rm /var/lib/dbus/machine-id
ln -s /etc/machine-id /var/lib/dbus/machine-id

# Cleaning the shell history
unset HISTFILE
history -cw
echo > ~/.bash_history
rm -fr /root/.bash_history

# Truncating hostname, hosts, resolv.conf and setting hostname to localhost
truncate -s 0 /etc/{hostname,hosts,resolv.conf}
hostnamectl set-hostname localhost

# Clean cloud-init - deprecated because cloud-init isn't currently used
# cloud-init clean -s -l

# Force a filesystem sync
sync

Shutdown the Virtual Machine. I prefer to start it back up and shut it down from the hypervisor to ensure that qemu-guest-agent is working properly.

Deployment Pipeline

First, we will want to create an API token under "Datacenter -> Permissions -> API Tokens":

There are some oddities with the Ansible proxmoxer based module and Ansible to keep in mind:

api_user is needed and used by the API client, formatted as {{ user }}@domain
api_token_id is not the same as the output from the command, it's what you put into the "Token ID" field.
- {{ api_user}}!{{ api_token_id }} should form the combined credential presented to the API, and match the created token.

If you attempt to use the output from the API creation screen under api_user or api_token_id, it'll return a 401 Invalid user without much explanation as to what might be the issue.

Here's the pipeline. Github's primary job is to set up the Python/Ansible environment, and translate the workflow inputs into something that Ansible can properly digest.

I also added some cat steps - this allows us to use the GitHub Actions log to store intent until Netbox registration completes.

---
name: "On-Demand: Build VM on Proxmox"

on:
  workflow_dispatch:
    inputs:
      machine_name:
        description: "Machine Name"
        required: true
        default: "examplename"
      machine_id:
        description: "VM ID (can't re-use)"
        required: true
      template:
        description: "VM Template Name"
        required: true
        type: choice
        options:
          - deb12.6-template
        default: "deb12.6-template"
      hardware_cpus:
        description: "VM vCPU Count"
        required: true
        default: "1"
      hardware_memory:
        description: "VM Memory Allocation (in MB)"
        required: true
        default: "512"

permissions:
  contents: read

jobs:
  build:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4
      - name: Create Variable YAML File
        run: |
          cat <<EOF > roles/proxmox_kvm/parameters.yaml
          ---
            vm_data:
              name: "${{ github.event.inputs.machine_name }}"
              id: ${{ github.event.inputs.machine_id }}
              template: "${{ github.event.inputs.template }}"
              node: node
              hardware:
                cpus: ${{ github.event.inputs.hardware_cpus }}
                memory: ${{ github.event.inputs.hardware_memory }}
                storage: ssd-tier
                format: qcow2
          EOF
      - name: Build VM
        run: |
          cd roles/proxmox_kvm/
          cat parameters.yaml
          python3 -m venv .
          source bin/activate
          python3 -m pip install --upgrade pip
          python3 -m pip install -r requirements.txt
          python3 --version
          ansible --version

          export PAPIUSER="${{ secrets.PAPIUSER }}"
          export PAPI_TOKEN="${{ secrets.PAPI_TOKEN }}"
          export PAPI_SECRET="${{ secrets.PAPI_SECRET }}"
          export PHOSTNAME="${{ secrets.PHOSTNAME }}"
          export NETBOX_TOKEN="${{ secrets.NETBOX_TOKEN }}"
          export NETBOX_URL="${{ secrets.NETBOX_URL }}"
          export NETBOX_CLUSTER="${{ secrets.NETBOX_CLUSTER_PROX }}"
          ansible-playbook build_vm_prox.yml

In addition, a requirements.txt is required by GitHub to set up the venv, and belongs in the role folder (roles/proxmox_kvm as above):

###### Requirements without Version Specifiers ######
pytz
netaddr
django
jinja2
requests
pynetbox

###### Requirements with Version Specifiers ######
ansible >= 8.4.0              # Mostly just don't use old Ansible (e.g. v2, v3)
proxmoxer >= 2.0.0

This Ansible playbook also integrates Netbox, as my vSphere workflow did, and uses a common schema to simplify code re-use. There are a few quirks with the Proxmox playbooks:

There's no module to grab VM Guest network information, but the API provides it, so I can get it with uri
Proxmox has a nasty habit of breaking Ansible with JSON keys that include -. The best way to fix it is with a debug action: {{ prox_network_result.json.data | replace('-','_') }}
Proxmox's VM copy needs a timeout configured, and announces it's done before the VM is ready for actions. I added an ansible.builtin.pause step before starting the VM, and after (to allow it to boot)

---
- name: "Build VM on Proxmox"
  hosts: localhost
  gather_facts: true
  # Before executing ensure that the prerequisites are installed
  # `ansible-galaxy collection install netbox.netbox`
  # `python3 -m pip install aiohttp pynetbox`
  # We start with a pre-check playbook, if it fails, we don't want to
  # make changes
  any_errors_fatal: true
  vars_files:
    - "parameters.yaml"

  tasks:
    - name: "Debug"
      ansible.builtin.debug:
        msg: '{{ vm_data }}'
    - name: "Test connectivity and authentication"
      community.general.proxmox_node_info:
        api_host: '{{ lookup("env", "PHOSTNAME") }}'
        api_user: '{{ lookup("env", "PAPIUSER") }}'
        api_token_id: '{{ lookup("env", "PAPI_TOKEN") }}'
        api_token_secret: '{{ lookup("env", "PAPI_SECRET") }}'
      register: prox_node_result
    - name: "Display Node Data"
      ansible.builtin.debug:
        msg: '{{ prox_node_result }}'
    - name: "Build the VM"
      community.general.proxmox_kvm:
        api_host: '{{ lookup("env", "PHOSTNAME") }}'
        api_user: '{{ lookup("env", "PAPIUSER") }}'
        api_token_id: '{{ lookup("env", "PAPI_TOKEN") }}'
        api_token_secret: '{{ lookup("env", "PAPI_SECRET") }}'
        name: '{{ vm_data.name }}'
        node: '{{ vm_data.node }}'
        storage: '{{ vm_data.hardware.storage }}'
        newid: '{{ vm_data.id }}'
        clone: '{{ vm_data.template }}'
        format: '{{ vm_data.hardware.format }}'
        timeout: 500
        state: present
    - name: "Wait for the VM to fully register"
      ansible.builtin.pause:
        seconds: 15
    - name: "Start the VM"
      community.general.proxmox_kvm:
        api_host: '{{ lookup("env", "PHOSTNAME") }}'
        api_user: '{{ lookup("env", "PAPIUSER") }}'
        api_token_id: '{{ lookup("env", "PAPI_TOKEN") }}'
        api_token_secret: '{{ lookup("env", "PAPI_SECRET") }}'
        name: '{{ vm_data.name }}'
        state: started
    - name: "Wait for the VM to fully boot"
      ansible.builtin.pause:
        seconds: 45
    - name: "Get VM information"
      community.general.proxmox_vm_info:
        api_host: '{{ lookup("env", "PHOSTNAME") }}'
        api_user: '{{ lookup("env", "PAPIUSER") }}'
        api_token_id: '{{ lookup("env", "PAPI_TOKEN") }}'
        api_token_secret: '{{ lookup("env", "PAPI_SECRET") }}'
        vmid: '{{ vm_data.id }}'
      register: prox_vm_result
    - name: "Report the VM!"
      ansible.builtin.debug:
        var: prox_vm_result
    - name: "Fetch VM Networking information"
      ansible.builtin.uri:
        url: 'https://{{ lookup("env", "PHOSTNAME") }}:8006/api2/json/nodes/{{ vm_data.node }}/qemu/{{ vm_data.id }}/agent/network-get-interfaces'
        method: 'GET'
        headers:
          Content-Type: 'application/json'
          Authorization: 'PVEAPIToken={{ lookup("env", "PAPIUSER") }}!{{ lookup("env", "PAPI_TOKEN") }}={{ lookup("env", "PAPI_SECRET") }}'
        validate_certs: false
      register: prox_network_result
    - name: "Refactor Network Information"
      ansible.builtin.debug:
        msg: "{{ prox_network_result.json.data | replace('-','_') }}"
      register: prox_network_result_modified
    - name: "Register the VM in Netbox!"
      netbox.netbox.netbox_virtual_machine:
        netbox_token: '{{ lookup("env", "NETBOX_TOKEN") }}'
        netbox_url: '{{ lookup("env", "NETBOX_URL") }}'
        validate_certs: false
        data:
          cluster: '{{ lookup("env", "NETBOX_CLUSTER") }}'
          name: '{{ vm_data.name }}'
          description: 'Built by the GH Actions Pipeline!'
          local_context_data: '{{ prox_vm_result }}'
          memory: '{{ vm_data.hardware.memory }}'
          vcpus: '{{ vm_data.hardware.cpus }}'
    - name: "Configure VM Interface in Netbox!"
      netbox.netbox.netbox_vm_interface:
        netbox_token: '{{ lookup("env", "NETBOX_TOKEN") }}'
        netbox_url: '{{ lookup("env", "NETBOX_URL") }}'
        validate_certs: false
        data:
          name: '{{ vm_data.name }}_intf_{{ item.hardware_address | replace(":", "") | safe }}'
          virtual_machine: '{{ vm_data.name }}'
          vrf: 'Campus'
          mac_address: '{{ item.hardware_address }}'
      with_items: '{{ prox_network_result_modified.msg.result }}'
      when: item.hardware_address != '00:00:00:00:00:00'
    - name: "Reserve IP"
      netbox.netbox.netbox_ip_address:
        netbox_token: '{{ lookup("env", "NETBOX_TOKEN") }}'
        netbox_url: '{{ lookup("env", "NETBOX_URL") }}'
        validate_certs: false
        data:
          address: '{{ item.ip_addresses[0].ip_address }}/{{ item.ip_addresses[0].prefix }}'
          vrf: 'Campus'
          assigned_object:
            virtual_machine: '{{ vm_data.name }}'
        state: present
      with_items: '{{ prox_network_result_modified.msg.result }}'
      when: item.hardware_address != '00:00:00:00:00:00'
    - name: "Finalize the VM in Netbox!"
      netbox.netbox.netbox_virtual_machine:
        netbox_token: '{{ lookup("env", "NETBOX_TOKEN") }}'
        netbox_url: '{{ lookup("env", "NETBOX_URL") }}'
        validate_certs: false
        data:
          cluster: '{{ lookup("env", "NETBOX_CLUSTER") }}'
          tags: 
            - 'lab_debian_machines'
            - 'lab_linux_machines'
            - 'lab_apt_updates'
          name: '{{ vm_data.name }}'
          primary_ip4:
            address: '{{ item.ip_addresses[0].ip_address }}/{{ item.ip_addresses[0].prefix }}'
            vrf: "Campus"
      with_items: '{{ prox_network_result_modified.msg.result }}'
      when: item.hardware_address != '00:00:00:00:00:00'

Conclusion

Overall, the Proxmox API/playbooks are quite a bit simpler to use than the VMware ones. The proxmoxer based modules are relatively feature complete compared to vmware_rest, but the largest exception I found (examples not in this post) was that I could always fall back to Ansible's comprehensive Linux foundation to fill any gaps I needed to. It's a refreshing change.

Starting from scratch with Netbox IPAM

Nick Schmidt — Sat, 11 May 2024 09:00:00 +0000

Spreadsheets are not an adequate method to manage IP addressing

Different IP design strategies

IPv4

Bogons, and the basics

There are a number of valid and invalid prefixes for use internally within an enterprise. Here's a list of invalid prefixes in the global routing table; of those, the RFC 1918 prefixes are available for use:

Prefix	RFC	Usable Internally?
0.0.0.0	1122	🤪 Everybody more or less agreed not to use it
10.0.0.0/8	1918	✅ Use this block for large prefix allocations
100.64.0.0/10	6598	🙄 CG-NAT, can technically be used, but will break in random cloud applications
127.0.0.0/8	1122	❌ loopback
169.254.0.0/16	3927	✅, but don't allocate it (APIPA)
172.16.0.0/12	1918	✅ Use this block for medium prefix allocations
192.0.0.0/24	5736	❌ IETF skunkworks
192.0.2.0/24	5737	❌ Carrier test networks
192.88.99.0/24	3068	❌ 6to4 relays
192.168.0.0/16	1918	✅ Avoid this block for enterprises, it'll collide with home networks when people use VPN
198.18.0.0/15	2544	❌ device benchmarking
198.51.100.0/24	5737	❌ Carrier test networks
203.0.113.0/24	5737	❌ Carrier test networks
224.0.0.0/4	3171	❌ multicast
240.0.0.0/4	1122	🤯 madlad play, might work, might not. Linux seems to live in this space just fine

All of these prefixes must be dropped at any network perimeter, e.g. firewalls, extranet routers, to prevent internal traffic or misconfigured NATs from leaking. It also prevents protocol abuse, which is a cheap and easy way to improve security.

In multi-site networks, dropping all of these prefixes would be wise - an ethernet loop + APIPA can turn a switching issue into a network-wide outage pretty easily. Longest Prefix Match can ensure that any allocated networks remain reachable.

Each of these prefixes should be created in Netbox, so you can use it as a reference later. I'd recommend tagging them with some form of hint to indicate usability, e.g.:

IP:Usable
IP:Unusable

As you get more familiar with the API/search, tag-based filters become incredibly handy.

IPv6

IPv6 is quite easy. All valid routable addresses fall under one allocated prefix:

2000::/3

This means that you can implement a "default route" that won't accidentally leak bogons like in IPv4, but with a much simpler approach. Instead of implementing ::0/0 for your default route, use 2000::/3.

If you insist on using private addressing, I'd encourage a thorough review of why - but this is the prefix available:

fc00::/7

Link-local addressing, or addressing that is "always on" regardless of prefix allocation, is also allocated a specific prefix. This prevents the need of a bunch of little helper protocols that simply don't need to exist, or become standardized. Traffic like Router Advertisements(RA), Routing Protocols, First Hop Redundancy Protocols have a distinct source address that can be pinged even before a network is online.

It's also incredibly handy when bootstrapping new devices! All that's required is some form of helper on the default gateway to act as an SSH proxy and some neighbor discovery, and you suddenly have always-on remote management.

fe80::/10

Multicast also has its own prefix:

ff80::/8

This is much simpler, but where to get IPv6 addressing can be more complex. If it's in a lab environment and doesn't need internet access, fc00::/7 is just fine to use.

The recommended method for acquiring an IPv6 prefix is to request it with DHCP-PD or to request it through a tunnel broker.

There's one more "gotcha" to keep in mind with IPv6 - weird stuff breaks if you go with a longer prefix than /64. I'd strongly encourage avoiding cutesy CIDR block allocations like /120 or /65; that's an IPv4 solution to a problem IPv6 doesn't have. Just request enough IP addressing for your site instead.

Constructing an IP hierarchy

For the purposes of this post, we're going to use the following language to describe a network address:

1{{ prefix }}{{ subprefix }}{{ host bits }}/{{ prefix length }}
2 10.99. 100. 0 /24

The major first step here is to decide how to break down your addressing. There are two major paths to follow:

Location based addressing is used when prefix scale is a concern. If you need to summarize routes on routers due to RIB limits, this is the way to go. This can be for a few reasons:
- "I have a lot of routes / lot of sites and am worried about RIB capacity in my hardware"
- Most enterprise equipment can handle 16-64k routes; if this is not enough, follow this approach
- ISPs will follow this path
- Cloud providers will follow this path
Purpose based addressing is used when perimeter security is a concern. Easy summarization to a common prefix per "network role" allows for straightforward firewall policy creation, including a number of microsegmentation tools that may have laughably low table capacities.
- "I want to keep my workloads separated from each other"
- Financial services will follow this path
- Healthcare will follow this path

Committing to one or the other before allocating blocks will simplify your life later.

Note: With IPv6, only the largest of organizations (ones that need more than 2^8 or 2^16 networks per site) will need to allocate their own top-level prefix. It's easier to just run DHCP-PD and ask for a /56 or /48.

Guidance on prefix allocation

To assess the proper 1918/bogon prefix for use, first assess the number of prefixes you would need as a ceiling:

1num_sites*num_network_roles

Attempt to select a site that will fit this prefix count with a minumum of 80% buffer (leaving a reserve for point-to-point connects, etc.)

I would highly encourage not getting creative with CIDR prefix lengths in IPv4-land. If possible, try and stick to /24 for a subprefix. IPv6 does not support prefix lengths longer than /64 particularly well (with specific exceptions for point-to-point, /126 or /127 depending on hardware), and using prefixes like /65 for access segments will lead to trouble with end devices like Android.

It's much simpler to translate the /24 in question linearly to a /64 and using that calculation to estimate what IPv6 prefix size you want. It's also much simpler to troubleshoot and maintain if you don't build a pile of weird stuff, even if it makes you feel smart!

As a starting point, it's good to set up a set of standard t-shirt sizes for networks in `{{ IPv4 }}/{{ IPv6 }} format. Here's an example:

Large Site/Role: /16//56
Medium Site/Role: /18//60
Small Site/Role: /22//62
"Normal" subprefix: /24//64
"Small" subprefix: /26//64
Point-to-point: /31//127

Note: Service providers don't always support weird DHCP-PD sizes, so options may be limited to the above.

Note: Service providers are typically pretty generous with prefix allocations, and keep in mind a /56 is roughly equivalent to a /16. I'd recommend allocating a /56 per site in production or in the lab whenever permitted.

Automating it

Once you have sizes set, it's actually pretty easy to let go of your artisanal, hand-crafted prefixes and automate aggressively. With Netbox and Ansible, it's incredibly easy to leverage the netbox.netbox.netbox_prefix module. The following example will grab a /24 from 10.99.0.0/16:

`
1- name: "Example Ansible Playbook"
2 connection: local
3 hosts: localhost
4 gather_facts: False
5 tasks:
6 - name: "Get next available prefix"
7 netbox.netbox.netbox_prefix:
8 netbox_url: "{{ netbox_url }}"
9 netbox_token: "{{ netbox_token }}"
10 data:
11 parent: "10.99.0.0/16"
12 prefix_length: 24
13 state: present
14 first_available: yes

It's extremely rewarding to design, deploy, and automate an IP design in this manner - and you'll find that automation is considerably easier if what to automate is well-defined.

Manage Linux patching with Ansible and Netbox!

Nick Schmidt — Sun, 07 Apr 2024 09:00:00 +0000

Patching all of my random experiments took too much of my free time, so I automated it.

This is a pretty cheesy thing to do, but over the years it became more and more time-consuming to maintain all the different deployed workloads and infrastructure.

Requirements

With all system design, it's best to consider all relevant needs ahead of time. Given that this is a home lab, I decided to adopt an intentionally aggressive, but theoretically viable in production approach:

Nightly patching
Nightly reboots
No exempt packages
Distribution-agnostic, it should patch multiple distributions at once
This workflow should execute consistently from-code

Iteration 1: Ansible with Jenkins

The earliest implementation I built here had the least refinement by far. Here I tied Jenkins to an internal repository:

To leverage this, I started out with an INI inventory, but it quickly became problematic. I wanted a hierarchy, with each distribution potentially fitting multiple categories. This became pretty messy pretty quickly, so I moved to a YAML Inventory:

debian_machines:
  hosts:
    hostname_1:
      ansible_host: "1.1.1.1"
    hostname_2:
      ansible_host: "2.2.2.2"
ubuntu_machines:
  hosts:
    hostname_3:
      ansible_host: "3.3.3.3"
apt_updates:
  children:
    debian_machines:
    ubuntu_machines:
nameservers:
  children:
    hostname_2:
    hostname_3:

This allowed me to simplify my playbooks and inventory by making "groups of groups", and avoid crazy stuff like taking down all nodes for an application at once. We'll use nameservers: as an example here:

---
- name: "Reboot APT Machines, except DNS"
  hosts: apt_updates,!nameservers
  tasks:
    - name: "Ansible Self-Test!"
      ansible.builtin.ping:
    - name: "Reboot Apt Machines!"
      ansible.builtin.reboot:
- name: "Reboot nameservers"
  hosts: nameservers
  serial: 1
  tasks:
    - name: "Ansible Self-Test!"
      ansible.builtin.ping:
    - name: "Reboot nameservers serially!"
      ansible.builtin.reboot:

The serial: 1 key instructs the Ansible controller to only execute this playbook on one machine at a time, so DNS continuity is preserved.

Retrospective

I had several issues with this approach, but to my surprise, Linux patching and actual Ansible issues haven't cropped up at all. With most mainstream distributions, the QC must be good enough to patch nightly like this.

I did have issues with inventory management, however. To update the Ansible inventory, I could deploy as-code, which was nice, but it was still clunky. If I deployed 5 Alpine images in a day, I want them to automatically be added to my inventory for maximum laziness.

I also quickly discovered that maintaining Jenkins was labor-intensive. It's a truly powerful engine, and great if you need all the extra features, but there aren't many low-friction ways to automate all the required maintenance, particularly around plugins. I was able to update Jenkins itself with a package manager, but it seems like every few days I had to patch plugins (manually).

Iteration 2: Ansible, Netbox, GitHub Actions

I'll be up-front - for parameterized builds, GitHub Actions is less capable.

It has some pretty big upsides, however:

You don't have to maintain the GUI at all
Logging is excellent
Integration with GitHub is excellent
Pipelines are YAML defined in their own repository
Status badges in Markdown (we don't need some stinkin' badges!)

This workflow has been much smoother to operate. Since the deployment workflow already updates Netbox, all machines are added to the "maintenance loop after first boot.

I was really surprised at how little work was required to convert these CI pipelines. This was naive of me - ease of conversion is the entire point of CI pipelines, but it's still mind-boggling to realize how effective it is at times.

To make this work, I first needed to create a CI process in .github/workflows on my Lab repository:

name: "Nightly: @0100 Update Linux Machines"

on:
  schedule:
    - cron: "0 9 * * *"

permissions:
  contents: read

jobs:
  build:
    runs-on: self-hosted
    steps:
    - uses: actions/checkout@v3
    - name: Execute Ansible Nightly Job
      run: |
        python3 -m venv .
        source bin/activate
        python3 -m pip install --upgrade pip
        python3 -m pip install -r requirements.txt
        python3 --version
        ansible --version
        export NETBOX_TOKEN=${{ secrets.NETBOX_TOKEN }}
        export NETBOX_API=${{ vars.NETBOX_URL }}
        ansible-galaxy collection install netbox.netbox
        ansible-inventory -i local.netbox.netbox.nb_inventory.yml --graph
        ansible-playbook -i local.netbox.netbox.nb_inventory.yml lab-nightly.yml

This executes on a GitHub Self-Hosted runner in my lab with a Python Virtual Environment. The workflow will run a clean build, every time - by wiping out the workspace prior to each execution. No configuration artifacts are left behind.

With GitHub Actions, all processes are listed alphabetically, you can't do folders and trees to keep it more organized. I developed a naming convention:

{{ workflow_type }}: {{ description}}

To keep things sane.

From there, we need a way to point to Netbox as an inventory source. This requires a few files:

requirements.txt is the Python 3 Pip inventory - since things are running in a virtual environment, it will only use python packages in this list.

txt

Requirements without Version Specifiers

pytz
netaddr
django
jinja2
requests
pynetbox

Requirements with Version Specifiers

ansible >= 8.4.0 # Mostly just don't use old Ansible (e.g. v2, v3)


The next step is to build an inventory file. This has to be named specifically for the plugin to work - `local.netbox.netbox.nb_inventory.yml`:

plugin: netbox.netbox.nb_inventory
validate_certs: False
config_context: True
group_by:

tags device_query_filters:
has_primary_ip: 'true'


- **Not featured here - The API Endpoint and API Token directives are handled by [GitHub Actions Secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions), and therefore don't need to be in this file.**

This file is pretty straightforward. It indicates that we should use Netbox tags to develop our inventory, and we can assign multiple tags in the Netbox application to each Virtual Machine. I also added the `has_primary_ip` directive - if a machine doesn't get an IP address for some reason, it won't try to reach that VM and patch it, causing late night failures.

Here's a preview of the Netbox application with these tags:

[![Netbox Preview](https://blog.engyak.co/2024/04/patching/netbox_preview.png)](netbox_preview.png)

Refactoring the Ansible playbooks was hilariously easy. The Netbox inventory plugin prepends the `group_by` field onto the group, so all I had to do in each playbook was prepend `tags_` to each name. Here's an example:

name: "Apt Machines" hosts: tags_lab_apt_updates tasks:
- name: "Ansible Self-Test!" ansible.builtin.ping:
- name: "Update Apt!" ansible.builtin.apt: name: "*" state: latest update_cache: true
name: "Apk Machines" hosts: tags_lab_apk_updates tasks:
- name: "Ansible Self-Test!" ansible.builtin.ping:
- name: "Update Apk!" ansible.builtin.apt: available: true upgrade: true update_cache: true




After that, the CI tooling just takes care of it all for me!

### Retrospective

I'm going to stick with this method for a while. Netbox tagging makes inventory management much more intuitive, and I can develop tag "pre-sets" in my deployment pipeline to correctly categorize all the _stuff_ I deploy. Since it's effectively documentation, I have an easy place to put data I'll need to find later for those "what was I thinking?" moments.

I'll be honest - behind that, I haven't really given it much thought. This approach requires zero attention to continue, and it happens while I sleep. I haven't gotten any problems from it, and it allows me to focus my free time on things that are more important.

10/10 would recommend.

Abstracting DNS Record Management with Ansible and Jinja 2

Nick Schmidt — Sat, 06 Jan 2024 09:00:00 +0000

Synchronizing properly implemented DNS zones is, to put it lightly, a real chore:

Creating forward DNS entries, e.g. A, AAAA, CNAME. These names are used to resolve to resources.
Creating reverse DNS entries, e.g. PTR.
Creating DNS entries that define the zone, e.g. SOA, NS

For a system to behave properly, your forward and reverse entries need to be identical, but software like BIND/Unbound rely on zonefiles that don't connect the two. Many information systems / DNS zones exist with improperly implemented reverse DNS, or partially implemented forward DNS asymptomatically for a time. Certain events (e.g. CA validation, discovery, implementing IPv6) can bring things to the forefront if ordinary network management practice doesn't.

For this post, we'll first work on abstracting the DNS zonefile - ensuring that a user can deploy zonefiles conformant to a standard - and then we'll illustrate how that can be used with Netbox to automatically populate DNS entries from Netbox.

Abstracting the zonefile here will achieve a few goals - but the file size is guaranteed to be longer than if you simply managed the zone files from source. Here are some advantages:

This pipeline ABSOLUTELY MUST establish forward and reverse records from the same data!
This pipeline must test zonefiles, and avoid installing them if they aren't good (prevents outages)
This pipeline must establish documentation standards for a DNS zone (abstract the standard)
This pipeline must scale to support large quantities of DNS zones / records
This pipeline must be easy to use, even with inexperienced DNS administrators (we can't have it all be on the shoulders of that one guy who can safely make DNS changes)

To achieve this, we'll first establish a YAML schema and Jinja2 template to structure the data. Here's the YAML schema:

 1zones:
 2 - name: filename
 3 zonename:
 4 soa:
 5 settings:
 6 ttl:
 7 serial:
 8 refresh:
 9 retry:
10 expiry:
11 nameservers: []
12 reverse_zones:
13 ip4:
14 ip6:
15 records: [{ "name": "", "type": "", "addr": ""}]

There are also some subtle differences between IPv4 and IPv6 reverse zones, so in this case, we're going to use three Jinja2 templates (in the Gist below).

It also assumes that there's a dedicated classful prefix for each DNS zone. This isn't always true for more complex deployments, but they can also do stuff like buy Infoblox.

I have also included a GitHub Action in the gist, because it provides a good place to demostrate best practices (e.g. using venv) in one compact place. If you want to install generated zone files on-premises, you can run this on a self-hosted runner with an Ansible inventory group (e.g. nameservers).

It's still a little clunky, the next step should help with that (harvesting DDI information from Netbox IPAM data).

@import url('https://cdn.rawgit.com/lonekorean/gist-syntax-themes/d49b91b3/stylesheets/idle-fingers.css');

@import url('https://fonts.googleapis.com/css?family=Open+Sans');
body {
  font: 16px 'Open Sans', sans-serif;
}
body .gist .gist-file {
  border-color: #555 #555 #444
}
body .gist .gist-data {
  border-color: #555
}
body .gist .gist-meta {
  color: #ffffff;
  background: #373737; 
}
body .gist .gist-meta a {
  color: #ffffff
}

GitHub Link

Build and Consume Alpine Linux vSphere Images

Nick Schmidt — Sun, 24 Dec 2023 09:00:00 +0000

Deploying Linux for the impatient

If you've ever wanted to just "test something out really quick" in a live environment, Linux distributions have always been generally lightweight, but that's not the only implicit requirement for experimentation.

A Linux IaaS distribution should be:

Reasonably secure (basic hardening applied, fewer packages == fewer vulnerabilities)
Light on disk usage (shortening deployment times)
Light on system resources, e.g. CPU/Memory
Flexible (supports a package manager with a wide ecosystem of packages)

Package flexibility is usually the compromise made here - but when you're deploying programmable code, container images and virtual environments like Python's venv should be able to bridge some gaps.

Alpine Linux focuses on these goals - but doesn't compromise on automation. Combined with a dynamic inventory bootstrapping process, it's relatively straightforward to bring Alpine's APK ansible module into play to build any extra software on a new machine.

Customizing and building the ISO

First, let's upload the ISO to a datastore:

Let's create a new Virtual Machine. We'll attach the ISO to it

Target any storage and compute as preferred. SSD datastores will be faster, of course.

vSphere 8.0 Update 2 doesn't have a preset for Alpine Linux, and the guest OS options are important - it defines what paravirtualized hardware is available:

Ensure that VMXNET 3 and PVSCSI are both available. The "New Network" will become the default port-group assigned to the template.

CPU/Memory are mostly irrelevant, as the deployment pipeline can customize afterwards - and this OS doesn't need much in terms of resources:

Select the Alpine "Datastore ISO" and enable "Connect and Power On" for the assigned CD/DVD drive:

Start the machine - it'll boot to a command prompt very quickly. Log in as root:

The installation guide indicates to use the setup-alpine script, and follow the prompts:

The majority of setup here is extremely simple - because it's not installing a bunch of software. GUIs are also possible after the installation is complete - but it does defeat the point.

Instead of rebooting as instructed, shut the virtual machine down and delete the disk drive:

Note: The shutdown process isn't installed with Alpine, and the following command does execute a graceful shutdown!

shutdown now

Start the machine up - we'll want to add some quality of life improvements to this machine like guest tools:

{{ insert favorite editor here }} /etc/apk/repositories

Remove the # from the line ending in /alpine/v{{ version }}/community and save.

Per Alpine's guide, install and enable open-vm-tools:

apk add open-vm-tools open-vm-tools-guestinfo open-vm-tools-deploypkg
rc-service open-vm-tools start
rc-update add open-vm-tools boot

Once running, ensure that guest power actions are available:

Personally, I like testing, so instead of powering off the VM, I use the guest action to ensure everything is working. Either way, shut the VM down.

Hit "Actions" on the VM, or right-click it, and select "Clone → Clone as Template to Library":

Select whatever storage backing and content libraries are preferable at this point. It won't take long to clone in. Delete the old VM whenever it makes sense, I usually do so after testing a deployment:

Note: Set "Power on VM after creation" - this will clone extremely quickly and boot even faster.

Modifying the deployment pipeline

@import url('https://cdn.rawgit.com/lonekorean/gist-syntax-themes/d49b91b3/stylesheets/idle-fingers.css');

@import url('https://fonts.googleapis.com/css?family=Open+Sans');
body {
  font: 16px 'Open Sans', sans-serif;
}
body .gist .gist-file {
  border-color: #555 #555 #444
}
body .gist .gist-data {
  border-color: #555
}
body .gist .gist-meta {
  color: #ffffff;
  background: #373737; 
}
body .gist .gist-meta a {
  color: #ffffff
}

The deployment pipeline code itself is available here. I've made some modifications from previous versions:

GitHub Actions now supports the choice type, which means we can select UUIDs. There isn't a way to build a "friendly name" mapping. We achieve this by creating a "lookup dictionary" with the friendly name as a key and a UUID as the value. This list will need to be populated via data collection (featured below).

First, we'll need to find out what the UUID of the template and cluster are. Here's an example to collect the required information. The UUIDs of system resources (and this template) are only available via the API. Use this information to form the parameters.yml file created in the GitHub Action workflow, e.g. datastore, cluster, folder

Adjusting and running this workflow will allow an engineer to populate the previous workflow and expose vSphere assets to further deployment automation!

For reference, this machine deployed in about 3 seconds on a shared SSD (iSCSI):

The GitHub workflow takes >2 minutes to complete, but the workflow it's attached to has manual wait step:

Apollo 13's "Failure is not an option", and how non-engineers misinterpret it

Nick Schmidt — Sat, 25 Nov 2023 09:00:00 +0000

Failure is not an option!

It might surprise you to know that this quote wasn't real - it feels legendary, but was never said by Gene Kranz. It was written up for the film.

The aerospace engineering discipline isn't really something everybody gets to experience, so it makes sense that "spicing things up" for the movie would be generally accepted as reality.

When you create a program (or release a new capability), it makes perfect sense to get all excited and release it as soon as you feel it's "done" - but this is just an example of how IT/Computer Science is relatively young compared to other engineering disciplines.

With more traditional engineering disciplines, testing is a key aspect to deployment and design. Everything is tested for safety. Concrete is thoroughly tested before integration in bridges and structures. Most pickup trucks are tested to their listed tow capacity.

This isn't a perfect ideal world, however. Bridges still fail, and in this case companies didn't follow the SAE J2807 standard until forced (Toyota: 2011, General Motors: 2015, Ford: 2015, Dodge: 2015).

Industry-wide changes take time

Here's why: It's expensive to re-tool in the physical world. NASA just straight up didn't have the option, so they compensated by accounting for as many potential scenarios as possible, at the expense of cost. That's what "Failure is not an option" was intended to reflect. Everything is tested and planned ahead of time, and the mission systems didn't try anything truly new.

Engineering is the practice of taking learned experiences and codifying them, ensuring that the same mistake doesn't happen twice. The safety codes and engineering artifacts we use in the physical world are "written in blood" - many structural engineering practices were learnt from a loss of life, it's why they're so important.

I don't think anybody has died due to an email not getting through, but I'd counter that the same practices are much easier to execute in IT and therefore should be followed. IT is a relatively young engineering-adjacent discipline, and the standards for performance are relatively low, albeit always increasing.

Here's a rough estimate of each engineering discipline's age:

Chemical Engineering (~1800s AD)
Civil Engineering (BC, formalized in the 1700s AD)
Electrical Engineering (1700s AD, formalized in the 1800s AD)
Mechanical Engineering (BC, formalized in the 1800s AD)
Software Engineering (1960s AD)

More recent engineering disciplines fit in these families, and one could argue (correctly) that while they are younger, they benefit from the preceding disciplines and the broader body of knowledge. This is particularly true in the field of aerospace.

Systems Engineering practitioners have collected a number of practices together to integrate new technologies and disciplines in the SEBoK - which essentially forms a "starter kit" of practices and protocols for developing new solutions. The SEBoK is an excellent (albeit overwhelming) place to procure methods for continuous improvement, either as a team or individually.

Don't fear failure, understand it

Across all of these disciplines, we see a common pattern around failure; the natural reaction to failure is to avoid it. Humans don't want to be associated with failure, and this reflex must be overridden to be a successful engineer.

I'd like to provide an example of good failure analysis instead of harping on past failures - my concern here is that any controversy may get in the way of the idea I want to convey - which deviates from the practice of failure analysis somewhat.

Washington State DOT's analysis of the Tacoma Narrows bridge failure is an example of well-executed failure analysis.

In this case, the structure was too rigid - "common sense" would tell us that if a bridge is extremely strong, it won't have any issues standing up to high winds.

Applying failure analysis to IT

It's important that we learn from these shortcomings and integrate solutions into future designs. Typically, this is where "system integration" comes into play - as a product is validated for release, all known tests are applied to it to ensure that failures don't recur. The NASA engineers supporting Apollo 13 didn't try anything new on the mission system (Apollo 13 itself). NASA tested all solutions thoroughly with the ground crew, astronauts, and QA engineers before rollout was ever considered an option.

The Apollo program was extremely expensive compared to most of our IT budgets, but we're almost always testing software. Failure Analysis practices are trivial with software debugging and mature unit testing, and eventually we're going to have to perform at the standards held by traditional engineering disciplines.

Example - a maintenance window backfired

We've all been here before - let's say that spanning tree did something unexpected during a maintenance window and caused unexpected downtime.

The first and most effective aspect of failure analysis (at least for our careers) is to provide a compelling narrative. We need to invert the human reflexive reaction to failure and encourage interest over punitive behaviors. Writing a complete and compelling narrative both ensures that people will react more positively to the occurrence and provide confidence that due diligence will be performed to ensure it doesn't happen again.

Sure, it'll always happen again with STP in some way, but other materials have common patterns and properties too. We didn't stop using aluminum because it isn't as strong as steel or as good of a conductor as copper; instead we learned its strengths and weaknesses, applying the solution judiciously. In this case, we need to prove that we will apply the solution more judiciously as well.

Second, gather all possible data on the time of the outage. Don't try to filter it yet, and don't react slowly. Anything that can record system data is valuable here (telemetry in particular) - so automatic gathering is extremely valuable.

Third, find ways to locate precursors and the failure itself. This part should be automated and attached to any CI pipelines for the future, "set it and forget it" is the best way. As this practice evolves, a solution develops incredible mass and manually executing every failure analysis unit test after every change will quickly become tedious and slow.

Why?

The pressure to follow this pattern is only going to grow in the future. The previous decade's reliability standards were hilariously low compared to the quality of technology and service today - just look at the standards people hold us to. Instead of fearing this trend, let's analyze it and find ways to improve. It'll give us a competitive edge in the future.

As with Apollo 13, our greatest failures drive our greatest successes.

Internet Load Balancing with pfSense

Nick Schmidt — Sun, 08 Oct 2023 09:00:00 +0000

With full-time remote work, internet outages transform from a nuisance to a real problem

Prior to the pandemic, "working hours" were typically considered fair game by internet service providers to schedule necessary system maintenance. It's unrealistic to expect perfect uptime from any service provider - as the saying goes:

Schedule maintenance on your equipment before your equipment schedules it for you!

ISPs are terrible about this, mostly because "old and stable" means customers receive reliable service. Eventually that trusty Toyota Corolla dies, though, causing severe customer impact.

I'd suggest taking matters into your own hands here. The technologies involved in internet load balancing are fairly complex, but if you follow a known formula it's doable for most tech-savvy users.

Internet Load Balancing

Load balancing network traffic is traditionally a separate domain from routing and firewalling, with most of the general industry focus centering around Server Load Balancing (SLB). An Internet Load Balancer (IPv4) needs to provide the following functions reliably:

Monitor each available path for viability with some form of end-to-end test.
Evenly (or with a ratio) balance new flows between each available path.
Track related sessions and place "affinity" to a specific path, ensuring that protocols like RTP + RTCP work
NAT Outbound traffic for its relevant link (IPv4)

To clarify, this doesn't cover SD-WAN and why it's more effective. Per-packet assessment and FEC lead to a much higher quality user experience and can achieve much cleaner ratios than what I provide below, but home users typically have high individual bandwidth with their internet services and like the concept of using them to their fullest. If the connectivity options at home are sufficiently mismatched or slow, it would be worthwhile to take SD-WAN solutions into consideration.

Let's establish an example topology, and cover the tunables that will provide a "good enough" WAN load balancing solution that centers around minimizing impact to remote work:

[

](wan-example.svg)

In this scenario, we'll just assume that one's wireline and one isn't to make things easy to explain. The transport doesn't really matter much, but it simplifies any documentation from here on out.

First, let's assign a new interface for the second internet link, and configure it for DHCP. This menu can be found under Interfaces ⇾ Assignments :

[

](pfsense_01_interfaces.png)

[

](pfsense_02_interfaces.png)

Note: Ensure that "Block private networks and loopback addresses" and "Block bogon networks" are checked. This is a WAN link, after all.

When using DHCP, the secondary WAN link should automatically install a "gateway", but it won't load balance just yet. We need to create a Gateway Group to enforce load balancing policies, and then assign it as the default gateway for things to take effect:

[

](pfsense_03_gateways.png)

[

](pfsense_04_gateways.png)

[

](pfsense_05_gateways.png)

Now, let's create monitoring IPs so pfSense can periodically test for packet loss or latency on that link. The following menu is available by editing the service provider gateway under System → Routing → Gateways :

[

](pfsense_06_gateways.png)

I'd suggest using the Service Provider's DNS services or an anycast DNS provider you don't typically use for the monitor addresses. pfSense installs a static route via that WAN for the monitor address, which means that it'll go down with the WAN link.

Note: Duplicate this IP and create a DNS monitor with Uptime Kuma if you want to monitor per-provider reliably. It's quick and easy!

This is all that's required, assuming that you want to get the most even load balancing possible. Here are a few tunables that may apply to more specific scenarios:

pfSense won't load balance asymmetric link speeds by default. If the interface speeds are different, you will need to create a policy-based routing rule ( Firewall → Rules → LAN → New rule ), and modify the Advanced Option Gateway : [

](pfsense_07_pbf.png)

While editing the gateway ( System → Routing → Gateways ), look for an Advanced setting labeled Weight. This will allow you to set a ratio between gateway groups, e.g. 2:1.
pfSense provides a simplified persistence mechanism that will pin each client to a specific WAN link. This is important, particularly if your remote work situation requires comprehensive use of voice and video services like Zoom or Teams. Please note that this feature will impact load balancing evenness to a great degree! [

](pfsense_08_sticky.png)

pfSense provides gateway status under Status → Gateways , but I haven't found a way to externally track those statistics via SNMP.

The Internet and its scaling issues

We've created a problem with global routing that is just plain fascinating.

Network Address Translation allows us to "spoof" our internal private networks with multiple public prefixes. This both solves and creates problems - as an upside, we're able to leverage WAN redundancy with service-provider public IPv4 addressing somewhat easily. This matters, because the public internet routing table currently can't support a designated prefix for every home network, and we're already experiencing internet availability issues due to route propagation:

In August 2014 we celebrated "512k Day" by enjoying a number of network outages related to TCAM capacity worldwide: link
The Internet Service Providers (ISPs) at the time managed to postpone this issue by "carving" TCAM, re-allocating ternary memory from other purposes to postpone the doomsday clock. This provided a capacity of 256,000 routes, but the clock was ticking. This bought ~ 5 years of time, and this was generally enough time to bump up capacity and lifecycle hardware.

Now, we have a new problem.

IPv4 routes consume 64 bits (8 bytes) of memory each assuming that no hardware optimization is used (you can store the number 1-32 as a 5-bit integer, but route lookups would be a multi-pass operation / require a lookup table), resulting in an internet routing table size of 4 Megabytes on 512k day, or 6 Megabytes on 768k day. It doesn't sound like much, but TCAM is designed for fast lookup and is somewhat limited in capacity.

IPv6 requires 256 bits (32 bytes) of storage per prefix, but more cleanly summarizes. Apples-to apples at a million routes would be 8 Megabytes (IPv4) + 32 Megabytes (IPv6), or 40 MB of TCAM.

Most of the absolute latest networking hardware is up to the task, but this is also with decades of hacks and best-practice engineering optimizing it. If I, as a household, establish my own /64, it's not that much of a problem, but every other network doing so would result in a table exponentially larger than hardware today can handle. This generally violates the design principle of "prefix summarization is hiding useful information," but it's driven by hardware limitations (as it always has).

The Future (IPv6) Solution

Interestingly enough, IPv6 is well-suited to this solution, and simple. Endpoints typically have tons of compute resources available for simple tasks like internet load balancing - but the client software isn't quite up to snuff. A dynamic IPv6 network leverages Router Advertisements and DHCPv6 to configure host devices with DNS and IP addresses, and there is nothing restricting multiple routers from advertising multiple prefixes over the same network:

[

](ipv6-future.svg)

Well, nothing except our own internal limitations, and client software. This would require a client device to automatically test each "path" and decide which one to use for a given application. We're not quite there yet, but the key elements are in place to guarantee a much higher service quality than our core and home routers can execute.

Retrospectives

While researching this topic, I discovered a few things that might be good for budget-conscious or hands-on users:

You don't need the biggest internet plan from each service provider. While 2 500 Megabit plans are definitely not going to be equal to a gigabit service, a typical household only uses a few megabits at a time. Right-sizing your internet services will save some serious cash, and may be cheaper than the single provider plan!
Capped services can be reduced to a lower ratio, or shut off entirely when the cap is reached. This approach is particularly appealing if services in your area are capped, because bandwidth caps are a tiny fraction of your link speed, and pfSense will average out your ratio rather effectively with more diverse usage.
Purchase an appliance with at least four ethernet ports! If a second service provider makes sense, it's entirely possible that a third may become an option.
If your ISP provides notice of maintenance, it's trivial to disable a gateway temporarily( System → Routing → Gateways → Edit ):

Site-to-Site VPNs will need to be pinned to a specific WAN link via static routes, or by using dynamic tunnel IDs (not IP address identities!)
- Transport within a service provider will typically have much higher available bandwidth and lower latency than transport crossing multiple ISPs. If a site is important, try to match the service providers on both sides and run a tunnel per service provider for best results.

Handoff to Day-N Automation with vSphere Content Libraries and Netbox

Nick Schmidt — Sat, 30 Sep 2023 09:00:00 +0000

The challenge with build automation is too much convenience

Think about it. If it's easy to compose and deploy workloads, it's also easy to develop sprawl, and a good system designer would have methods in place to mitigate that.

In a previous post I covered how to deploy vSphere VMs with Ansible and the Automation Value Proposition that comes with it:

Providing this capability to a company as-is is hazardous. Ask the following questions, in rough order of priority:

How do we track decommissions/unused machines?
How do we track who owns / uses what?
How do we track what OS images are end-of-life?
How do we track resource consumption (e.g. IP usage) and avoid re-using addresses?
How do we track certificates?

VMs also don't do much good without customization, unless you're comfortable handing those root credentials to whomever wants them.

System Integration

Linux heads live for this type of work - we return to the Unix design principles where a system or subsystem should excel at a single task instead of solving all possible issues at the expense of quality.

Let's explore a multi-system integration:

For this example, we'll re-implement the previous VM build process, but orchestrate it with GitHub Actions. I'll provide a gist at the end of this post.

I don't keep my vCenter exposed to the internet, so there will be some preparation required for this Action to function. We're using several prerequisites, install them first:

1python3 -m pip install aiohttp pynetbox
2ansible-galaxy collection install vmware.vmware_rest netbox.netbox

This Action leverages parameterization heavily, with Ansible relying on variables injected from GitHub to the virtual environment (venv). It provides a little "quiz" that will let consumers define attributes about the deployed machine, e.g. vCPU count and memory. Any input sanitization should be done by Ansible in this context.

Once a VM deployed, the vmware_rest module returns the virtual machine's Managed Object ID (MOID). We can use that to get operational data about the VM via VMware Tools.

Ansible keeps all of this data as registered variables for future utilization. Now, we have to put the data somewhere persistent. Netbox is a valuable tool for documenting information assets, but it can also be used as an Inventory. We can dump all the information about the VM into netbox rather easily, and pave the way for further customization seamlessly.

Note: I excluded the Guest Customization play in this version of the deployment script. It hasn't been particularly stable across 8.x releases with my automated testing, either failing completely with a Service Unavailable or crashing vCenter. It is possible, however, to change IP addresses, install packages, copy artifacts with Ansible after the fact. Customization via Ansible might even be a better approach in more complex deployments.

@import url('https://cdn.rawgit.com/lonekorean/gist-syntax-themes/d49b91b3/stylesheets/idle-fingers.css');

@import url('https://fonts.googleapis.com/css?family=Open+Sans');
body {
  font: 16px 'Open Sans', sans-serif;
}
body .gist .gist-file {
  border-color: #555 #555 #444
}
body .gist .gist-data {
  border-color: #555
}
body .gist .gist-meta {
  color: #ffffff;
  background: #373737; 
}
body .gist .gist-meta a {
  color: #ffffff
}

Circumventing Coder's block and starting a new project

Nick Schmidt — Sat, 26 Aug 2023 09:00:00 +0000

It's difficult to start a new software project

Documentation

Depending on how a software project starts, it can either be the easiest or the hardest aspect of a new project.

Documentation suffers from a similar issue, so a good place to get things moving would be to simplify the basics of repository management. Here's a number of things that should be in a Git repository:

1.gitignore
2CHANGELOG.md
3README.md

Each of these things can be simplified in some way, and will make your future life easier. Let's start with the easy stuff:

.gitignore is easily templated by your source control provider. At this point, it's smart to include one of their templates, this feature has really grown over the years.
- GitHub provides templates for common development setups, and integrates it into their new repository wizard.
- GitLab does too.
CHANGELOG.md takes very little effort to start, but is difficult to apply to an existing project.
- Keep a changelog provides an excellent template and guidance on how to effectively write changelogs.

README.md is the core of your project's documentation, and deserves separate mention because it's a powerful tool to organize how a project should function. Spend plenty of time on this part!

Since README.md is the page that renders by default in SCM, the objective should be to provide everything a user needs to consume your software. I personally prefer to outline how the software should function and use it as a reference when writing the actual code. Here's a decent starting point:

 1# {{ Name the Project }}
 2
 3## Goal(s)
 4
 5{{ Write what your project should do }}
 6
 7## Overview
 8
 9### Validation
10
11{{ Write how functional software should be evaluated at an end-to-end level }}
12
13### Unit Testing
14
15{{ Write how functional software should be evaluated at a component level }}
16
17## HOWTO
18
19{{ Indicate how the software should be used. Provide examples, fix it later when the functional code revises your intricate plans }}
20
21## Software Dependencies
22
23{{ Include what the software will need to run. Update as you `include` new libraries}}
24
25## Contributors
26
27{{ Set up a place for software contributors to put their names. It might encourage participation }}

CI

Continuous Integration is not easy to establish, but pays off over time as it catches small issues that appear throughout development. Since the testing strategy is written in plain language, CI tool setup is ideally started soon after the documentation. For a given CI tool, e.g. GitHub Actions or Jenkins, common practices can be templated for re-use.

...Then start writing code

From here, it's going to be easier to begin authoring software from an outline. Start by writing out your plan ("pseudo-code") in comments, defining class structures if applicable.

Infrastructure automation is typically a play-by-play implementation of an operating procedure, which won't necessarily need object-oriented coding. In this case, the same approach still works - transpose the operating procedure as comments, and implement it as code.

Hindsight

After some practice, it quickly becomes easier to start a new software repository. Structuring a software project quickly becomes important after only a few hundred lines of code - so if you're stuck, put some work in on the structure, it'll get the process moving, and the effort spent multiplies itself as the project grows.

Most Git providers also support repoository templates - use this feature to create a form of "starter kit", and copy it whenever a new project is created to safe time.

DEV Community: Nick Schmidt

Starting an IaC Repository with GitHub and Terraform

Repositories (Structure and Naming)

Actions (Release Management)

Terraform Starter Kit

Visualize and Report Ansible with OpenTelemetry and Syslog

Types of Callback Plugins

aggregate

stdout

Using stdout callbacks

notification

Directing results to OpenTelemetry

Directing Results to Syslog

Modernizing the Monitoring Stack

Automate DNS Zone Generation and Deployment with Ansible and Netbox

The Pattern

The Code

Retrospective

The Gist

VM Deployment Pipelines with Proxmox

Proxmox Setup

Debian Linux Setup

Deployment Pipeline

Conclusion

Starting from scratch with Netbox IPAM

Different IP design strategies

IPv4

Bogons, and the basics

IPv6

Constructing an IP hierarchy

Guidance on prefix allocation

Automating it

Manage Linux patching with Ansible and Netbox!

Patching all of my random experiments took too much of my free time, so I automated it.

Requirements

Iteration 1: Ansible with Jenkins

Retrospective

Iteration 2: Ansible, Netbox, GitHub Actions

Requirements without Version Specifiers

Requirements with Version Specifiers

Abstracting DNS Record Management with Ansible and Jinja 2

Build and Consume Alpine Linux vSphere Images

Deploying Linux for the impatient

Customizing and building the ISO

Modifying the deployment pipeline

Apollo 13's "Failure is not an option", and how non-engineers misinterpret it

Industry-wide changes take time

Don't fear failure, understand it

Applying failure analysis to IT

Example - a maintenance window backfired

Why?

Internet Load Balancing with pfSense

With full-time remote work, internet outages transform from a nuisance to a real problem

Internet Load Balancing

The Internet and its scaling issues

The Future (IPv6) Solution

Retrospectives

Handoff to Day-N Automation with vSphere Content Libraries and Netbox

The challenge with build automation is too much convenience

System Integration

Circumventing Coder's block and starting a new project

It's difficult to start a new software project

Documentation

CI

...Then start writing code

Hindsight

`aggregate`

`stdout`

Using `stdout` callbacks

`notification`